Hi all, I need a bit of help with a really strange problem we've got with out SQL cluster. The setup is - 2 windows 2003 nodes, sharing a das, sql 2000 enterprise sp4. Cluster was up and running for a couple of days, the one of our offices said they could not connect to it. This box is held in one office, we have 5 or so other offices that connect to this cluster, and only one of them could not connect. So, network problems? Well remote desktop to either node of the cluster and we can connect to the machines in the remote office. Setup ethereal on a PC in the remote office, try to connect via query anaylzer to the cluter and we can see packets going back and forth happily with the IP of the cluster etc. and no other traffic on the line is affected. Q hair being pulled. So, we failover it over to the second node. Everything starts working again. Please explain that one. We give up for the afternoon to save our hair. Skip to a few days later (today now). And it all fails again, same as last time. Failback to first node, it still doesn't work, fail over to second, oh it work again. For a few minutes, then dies again. Heeeeeelllp!!!
Treat this as you would any network communications failure. Name resolution IP path resolution (ping) SQL client connect (Query Analyzer) application connect. Somewhere you will find where the glitch is. I would guess DNS cache problems on the client, but that is just to impress everyone if it really is the problem. -- Geoff N. Hiten Senior Database Administrator Microsoft SQL Server MVP [quoted text, click to view] <kristan.mcdonald@googlemail.com> wrote in message news:1165936405.727921.56250@16g2000cwy.googlegroups.com... > Hi all, > > I need a bit of help with a really strange problem we've got with out > SQL cluster. The setup is - 2 windows 2003 nodes, sharing a das, sql > 2000 enterprise sp4. Cluster was up and running for a couple of days, > the one of our offices said they could not connect to it. This box is > held in one office, we have 5 or so other offices that connect to this > cluster, and only one of them could not connect. > > So, network problems? Well remote desktop to either node of the cluster > and we can connect to the machines in the remote office. Setup ethereal > on a PC in the remote office, try to connect via query anaylzer to the > cluter and we can see packets going back and forth happily with the IP > of the cluster etc. and no other traffic on the line is affected. Q > hair being pulled. > > So, we failover it over to the second node. Everything starts working > again. Please explain that one. We give up for the afternoon to save > our hair. > > Skip to a few days later (today now). And it all fails again, same as > last time. Failback to first node, it still doesn't work, fail over to > second, oh it work again. For a few minutes, then dies again. > > Heeeeeelllp!!! >
Hi Geoff, Thanks for the advice, the weird thing about this is other network connections seem fine, it's just SQL that fails, and even then it's not a total block. Name resolution seems to work fine and I can connect to the shares on the server and copy files off so there's nothing wrong with the physical layer. If I try to connect using query analyser, it will reject my password and I'll get a login failed if I enter the wrong one, or try to open the connection if I use the correct one, so at least at some level SQL is responding to these requests. We had an incident yesterday where the remote office lost SQL again, however when odbcping and isql, I seemed to be able to connect and run queries. I wouldn't treat this as gospel though as everything was a bit rushed and confused trying to get connectivity restored. It also seemed that if you turned off the object browser in query analyzer you could get a connection. I'm starting to think it might be SQL server itself, but I don't know enough about how query analyzer connects to verify it, and it still doesn't explain why the odbc application connections fail... or does it? Any help appreciated! Thanks, Kristan [quoted text, click to view] Geoff N. Hiten wrote: > Treat this as you would any network communications failure. > > Name resolution > IP path resolution (ping) > SQL client connect (Query Analyzer) > application connect. > > Somewhere you will find where the glitch is. I would guess DNS cache > problems on the client, but that is just to impress everyone if it really is > the problem. > > -- > Geoff N. Hiten > Senior Database Administrator > Microsoft SQL Server MVP
What do your SQL error logs say? What is your configuration? Are you running per Processor Licenses or CAL-based? Do you have any concurrent connection limits configured? What does sp_configure show? You said the system was SP4. Have you applied the 2187 hotfix yet? I think you need to start fresh. You say the connections are intermittent and only for one branch. Are the lost connections always from this one branch? If so, it is your physical layer, in the network. Check your routers from dropped packets. Make sure no one is using AUTO NEGOTIATE on the network link and duplex. Is there anything special about this branch? Is it the furthest away? Does it have the slowest connection speed? Does it have the most users? If an outage is sporadic but affects everyone equally, it is the system resource, or at least a common point in between. If you can localize a problem (like it is affecting only one branch or one group of users), then it will be somewhere unique to them, but at a common point for them, like a router at the branch site. Hope this helps, but without any additional data, that's about the best we've got. Sincerely, Anthony Thomas -- [quoted text, click to view] <kristan.mcdonald@googlemail.com> wrote in message news:1166013373.732672.101210@j44g2000cwa.googlegroups.com... > Hi Geoff, > > Thanks for the advice, the weird thing about this is other network > connections seem fine, it's just SQL that fails, and even then it's not > a total block. Name resolution seems to work fine and I can connect to > the shares on the server and copy files off so there's nothing wrong > with the physical layer. If I try to connect using query analyser, it > will reject my password and I'll get a login failed if I enter the > wrong one, or try to open the connection if I use the correct one, so > at least at some level SQL is responding to these requests. > > We had an incident yesterday where the remote office lost SQL again, > however when odbcping and isql, I seemed to be able to connect and run > queries. I wouldn't treat this as gospel though as everything was a bit > rushed and confused trying to get connectivity restored. It also seemed > that if you turned off the object browser in query analyzer you could > get a connection. > > I'm starting to think it might be SQL server itself, but I don't know > enough about how query analyzer connects to verify it, and it still > doesn't explain why the odbc application connections fail... or does > it? > > Any help appreciated! > > Thanks, > > Kristan > > Geoff N. Hiten wrote: > > > Treat this as you would any network communications failure. > > > > Name resolution > > IP path resolution (ping) > > SQL client connect (Query Analyzer) > > application connect. > > > > Somewhere you will find where the glitch is. I would guess DNS cache > > problems on the client, but that is just to impress everyone if it really is > > the problem. > > > > -- > > Geoff N. Hiten > > Senior Database Administrator > > Microsoft SQL Server MVP >
Then definitely check your AUTO NEGOTIATE settings on both the switch ports and the NICs for all players, clients and server, for both link speed and duplex. This would explain why sometimes you have a problem and sometimes you don't, for some clients and not others. Usually these can reset after patching the bios on the switches. Sincerely, Anthony Thomas -- [quoted text, click to view] <kristan.mcdonald@googlemail.com> wrote in message news:1166619576.945335.285910@f1g2000cwa.googlegroups.com... > Thanks for your help and suggestions - SQL error logs don't indicate > anything, and we're running per-processor. No connection limits set, > and I've not applied 2187 but will shortly. > > We've got a bit deeper with this now, and it seems like it may be line > related as we've found a few more apps (exchange for one) that seems to > suffer a similar problem. > > Connecting via ISQL when the outage is taking place works, however > doing something like select * from sysdatabases (which I guess is > similar to what query analyzer does at login if the object browser is > showing) I get only the first 10-15 databases (same place each time) > and it dies. Looking at packet traces from both ends there's loads of > retransmission and out of sequence arrivals going on, so I'm now 99% > certain it's not SQL related. > > Still doesn't explain why I can run the same query on a box on exactly > the switch and it will work, whereas doing in on the cluster will fail. > > We're working with the network providers at the moment, but they're > trying to claim if it works on one box then it must be cluster related. > > Thanks guys, > > Kristan > > Anthony Thomas wrote: > > > What do your SQL error logs say? > > > > What is your configuration? Are you running per Processor Licenses or > > CAL-based? > > > > Do you have any concurrent connection limits configured? What does > > sp_configure show? > > > > You said the system was SP4. Have you applied the 2187 hotfix yet? > > > > I think you need to start fresh. You say the connections are intermittent > > and only for one branch. Are the lost connections always from this one > > branch? If so, it is your physical layer, in the network. Check your > > routers from dropped packets. Make sure no one is using AUTO NEGOTIATE on > > the network link and duplex. > > > > Is there anything special about this branch? Is it the furthest away? Does > > it have the slowest connection speed? Does it have the most users? > > > > If an outage is sporadic but affects everyone equally, it is the system > > resource, or at least a common point in between. If you can localize a > > problem (like it is affecting only one branch or one group of users), then > > it will be somewhere unique to them, but at a common point for them, like a > > router at the branch site. > > > > Hope this helps, but without any additional data, that's about the best > > we've got. > > > > Sincerely, > > > > > > Anthony Thomas > > > > > > -- > > > > <kristan.mcdonald@googlemail.com> wrote in message > > news:1166013373.732672.101210@j44g2000cwa.googlegroups.com... > > > Hi Geoff, > > > > > > Thanks for the advice, the weird thing about this is other network > > > connections seem fine, it's just SQL that fails, and even then it's not > > > a total block. Name resolution seems to work fine and I can connect to > > > the shares on the server and copy files off so there's nothing wrong > > > with the physical layer. If I try to connect using query analyser, it > > > will reject my password and I'll get a login failed if I enter the > > > wrong one, or try to open the connection if I use the correct one, so > > > at least at some level SQL is responding to these requests. > > > > > > We had an incident yesterday where the remote office lost SQL again, > > > however when odbcping and isql, I seemed to be able to connect and run > > > queries. I wouldn't treat this as gospel though as everything was a bit > > > rushed and confused trying to get connectivity restored. It also seemed > > > that if you turned off the object browser in query analyzer you could > > > get a connection. > > > > > > I'm starting to think it might be SQL server itself, but I don't know > > > enough about how query analyzer connects to verify it, and it still > > > doesn't explain why the odbc application connections fail... or does > > > it? > > > > > > Any help appreciated! > > > > > > Thanks, > > > > > > Kristan > > > > > > Geoff N. Hiten wrote: > > > > > > > Treat this as you would any network communications failure. > > > > > > > > Name resolution > > > > IP path resolution (ping) > > > > SQL client connect (Query Analyzer) > > > > application connect. > > > > > > > > Somewhere you will find where the glitch is. I would guess DNS cache > > > > problems on the client, but that is just to impress everyone if it > > really is > > > > the problem. > > > > > > > > -- > > > > Geoff N. Hiten > > > > Senior Database Administrator > > > > Microsoft SQL Server MVP > > > >
Thanks for your help and suggestions - SQL error logs don't indicate anything, and we're running per-processor. No connection limits set, and I've not applied 2187 but will shortly. We've got a bit deeper with this now, and it seems like it may be line related as we've found a few more apps (exchange for one) that seems to suffer a similar problem. Connecting via ISQL when the outage is taking place works, however doing something like select * from sysdatabases (which I guess is similar to what query analyzer does at login if the object browser is showing) I get only the first 10-15 databases (same place each time) and it dies. Looking at packet traces from both ends there's loads of retransmission and out of sequence arrivals going on, so I'm now 99% certain it's not SQL related. Still doesn't explain why I can run the same query on a box on exactly the switch and it will work, whereas doing in on the cluster will fail. We're working with the network providers at the moment, but they're trying to claim if it works on one box then it must be cluster related. Thanks guys, Kristan [quoted text, click to view] Anthony Thomas wrote: > What do your SQL error logs say? > > What is your configuration? Are you running per Processor Licenses or > CAL-based? > > Do you have any concurrent connection limits configured? What does > sp_configure show? > > You said the system was SP4. Have you applied the 2187 hotfix yet? > > I think you need to start fresh. You say the connections are intermittent > and only for one branch. Are the lost connections always from this one > branch? If so, it is your physical layer, in the network. Check your > routers from dropped packets. Make sure no one is using AUTO NEGOTIATE on > the network link and duplex. > > Is there anything special about this branch? Is it the furthest away? Does > it have the slowest connection speed? Does it have the most users? > > If an outage is sporadic but affects everyone equally, it is the system > resource, or at least a common point in between. If you can localize a > problem (like it is affecting only one branch or one group of users), then > it will be somewhere unique to them, but at a common point for them, like a > router at the branch site. > > Hope this helps, but without any additional data, that's about the best > we've got. > > Sincerely, > > > Anthony Thomas > > > -- > > <kristan.mcdonald@googlemail.com> wrote in message > news:1166013373.732672.101210@j44g2000cwa.googlegroups.com... > > Hi Geoff, > > > > Thanks for the advice, the weird thing about this is other network > > connections seem fine, it's just SQL that fails, and even then it's not > > a total block. Name resolution seems to work fine and I can connect to > > the shares on the server and copy files off so there's nothing wrong > > with the physical layer. If I try to connect using query analyser, it > > will reject my password and I'll get a login failed if I enter the > > wrong one, or try to open the connection if I use the correct one, so > > at least at some level SQL is responding to these requests. > > > > We had an incident yesterday where the remote office lost SQL again, > > however when odbcping and isql, I seemed to be able to connect and run > > queries. I wouldn't treat this as gospel though as everything was a bit > > rushed and confused trying to get connectivity restored. It also seemed > > that if you turned off the object browser in query analyzer you could > > get a connection. > > > > I'm starting to think it might be SQL server itself, but I don't know > > enough about how query analyzer connects to verify it, and it still > > doesn't explain why the odbc application connections fail... or does > > it? > > > > Any help appreciated! > > > > Thanks, > > > > Kristan > > > > Geoff N. Hiten wrote: > > > > > Treat this as you would any network communications failure. > > > > > > Name resolution > > > IP path resolution (ping) > > > SQL client connect (Query Analyzer) > > > application connect. > > > > > > Somewhere you will find where the glitch is. I would guess DNS cache > > > problems on the client, but that is just to impress everyone if it > really is > > > the problem. > > > > > > -- > > > Geoff N. Hiten > > > Senior Database Administrator > > > Microsoft SQL Server MVP > >
Don't see what you're looking for? Try a search.
|