Groups | Blog | Home
all groups > sql server clustering > january 2004 >

sql server clustering : Failover doesn't work properly whith network problems.


Chris Carmichael
1/22/2004 11:56:05 AM
Hello All,

We have a SQL cluster in active/passive configuration. Unfortunately, our
network has been having issues. More specifically, the line protocol on the
Cisco 3550 that the primary node on this cluster is connected to seems to
drop off temporarily from time to time.

So the problem is that when this happens, failover does not work. It hangs
up on the first node. When this happens, you can't connect to cluster
administrator to see what is going on. Yet if you physically tun off the
primary node, then failover will complete. We tried testing failover by
unplugging the network cable from the primary node. When you do this the
failover happens without incident.

Our take on it is that clustering will only work if you have a failure on
level 1 of the OSI layers?!?! That hardly seems right for a system as robust
as SQL. Yet a level one failure kicks off the failover perfectly. Then if
we lose the line protocol on the switch (whihc would be layer 3-4, it
doesn't. So it appears as though the network connection is up to the
server, but you can't communicate with it on the public side. Our heartbeat
connection is simply a crossover cable, so I don't think that is the
problem.

We have been pulling out our hair on this for 2 weeks. does anyone here
have any suggestions. Or, is there a way to force the cluster to failover
EVERYTHING if any one resource dies?

Thank you!

Chris


Geoff N. Hiten
1/22/2004 4:12:44 PM
Do any of the following help?
http://support.microsoft.com/default.aspx?scid=kb;EN-US;242600

http://support.microsoft.com/default.aspx?scid=kb;EN-US;176320

http://support.microsoft.com/default.aspx?scid=kb;en-us;286342&Product=winsvr2003

You didn't specify the host OS, so I included links for 2000 and 2003. If
you are on NT4, upgrade it now. :)

My take is that the LooksAlive and Isalive tests fail, forcing the failover,
but the IP address is still alive on the net, preventing the Virtual server
from coming up on the second node. If this is the case, you should see
duplicate IP address errors in the second node's event log.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com




[quoted text, click to view]

Chris Carmichael
1/22/2004 5:21:42 PM
Thanks for the input. I looked at these scenarios and nothing seems to help.
In looking at the event logs, it looks as though the second node tried
unsuccessfully to take control of the disk array 6 times, then turned off
it's cluster service. Since the ip address had already failed over, that
left the server in a half failover state. This does not happen if we just
unplug the network cable though?!?! Any other ideas?

Thank You again,

Chris

[quoted text, click to view]

Dan Johnson
1/23/2004 9:34:56 AM
Chris, I am having the same problem. SQL2000/Win2000
server config with a CISCO 6500 switch. I have 3 other
clusters with the same config NOT having an issue.
Diags on the primary cluster node shows no issue with the
NICS. The switch is showing no alingment errors or
dropped packets to these ports. I do believe that this
is a network issue and am going to fail over to the
second node next weekend and run there for awhile to rule
out the server.
Drop me an email if you have any luck!
djohnson@shc.org
[quoted text, click to view]
Geoff N. Hiten
1/23/2004 10:09:15 AM
What is your hardware and OS config? I am aware of at least one combination
that has problems releasing the SCSI reservation when there are
communication issues.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com




[quoted text, click to view]

Chris Carmichael
1/23/2004 5:23:01 PM
We are running the cluster on two Poweredge 6650 servers. Quad Processors
and 8 gigs of ram each. They both connect up to a Dell power vault 220s.
Windows 2000 and SQL 2000

They connect up to a dell power
[quoted text, click to view]

Geoff N. Hiten
1/24/2004 10:20:28 PM
Hmmm. SCSI Cluster. Not my favorite config, but you work with what you
got.

Obvious question #1, was this put togehter by Dell and certified as a valid
cluster? If not, at least tell me that all the parts, ESPECIALLY the Perc
cards, are on the Cluster HCL. If not, you may have found out why it didn't
make the list.

You may want to open a case with PSS if the cluster is certified. This is
beginning to sound like it may be more than a newsgroup help situation.
Note, if it is not a certified cluster, PSS likely won't touch it.

--
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
CareerBuilder.com


[quoted text, click to view]
http://support.microsoft.com/default.aspx?scid=kb;en-us;286342&Product=winsv
r2003
[quoted text, click to view]

AddThis Social Bookmark Button