Groups | Blog | Home
all groups > sql server clustering > april 2006 >

sql server clustering : Cluster Failover Failed -- KB help anyone?



dwcscreenwriterextremesupreme NO[at]SPAM gmail.com
4/25/2006 12:36:30 PM
Our second node in an active/active SQL cluster (Win 2K SP4/ SQL2K SP2)
failed as a result (it seems) of a bad NIC at the public interface.

Cluster and SQL logs seem to support that the failover went
successfully, except for SQL job agent suddenly stopping on the first
node.

However, 2 hours later, the first node received a boatload of errors
indicating that the file shares were down, the second node was no
longer registered in the NBT, the first node was no longer in the NBT
and the cluster itself was no longer in the NBT. At this point, cluster
services on the node shut down everything within the cluster --
including SQL.


I have heard that this might be a result of an SQL bug regarding
clustering, but can find no corroborating documentation. Has anyone
ever heard of a problem like this, or can you point me to some KB
articles referring to similar issues? I need a post-mortem reason for
why everything just suddenly died.
dwcscreenwriterextremesupreme NO[at]SPAM gmail.com
4/25/2006 3:13:30 PM
File shares are all on a central SAN, with quorum drive on its own
spindles, sql resources for each instance on their own spindles.

The sequence of events appears to be this:

(from system log @ 22:39)
The browser has forced an election on network
\Device\NetBT_Tcpip_{210B8C37-6D9D-479A-97FA-3F6B28A0D447} because a
master browser was stopped.

(from cluster logs)
00000788.00000794::2006/04/22-22:40:04.211 File Share <Logship2>: Error
checking for share. Error 2114.
00000788.00000794::2006/04/22-22:40:04.211 File Share <Logship2>: Error
checking for share. Error 2114.
0000055c.0000079c::2006/04/22-22:40:04.242 [FM] NotifyCallBackRoutine:
enqueuing event
0000055c.00000648::2006/04/22-22:40:04.258 [CP] CppResourceNotify for
resource Logship2
0000055c.0000064c::2006/04/22-22:40:04.274 [FM]
FmpHandleResourceTransition: Resource Name =
09ccd7f0-6c52-4756-af88-a3b6e4fb6043 old state=2 new state=4
0000055c.0000064c::2006/04/22-22:40:04.289 [GUM] GumSendUpdate: queuing
update type 0 context 8
0000055c.0000064c::2006/04/22-22:40:04.289 [GUM] GumSendUpdate:
Dispatching seq 29499 type 0 context 8 to node 1
0000055c.0000064c::2006/04/22-22:40:04.289 [GUM] GumSendUpdate:
completed update seq 29499 type 0 context 8
0000055c.0000064c::2006/04/22-22:40:04.289 [FM]
FmpPropagateResourceState: resource
09ccd7f0-6c52-4756-af88-a3b6e4fb6043 failed event.
0000055c.0000064c::2006/04/22-22:40:04.289 [FM]
FmpHandleResourceFailure: taking resource
09ccd7f0-6c52-4756-af88-a3b6e4fb6043 and dependents offline
00000788.00000794::2006/04/22-22:40:10.961 Network Name <SQL Network
Name(2P0SQLV2)>: Name 2P0SQLV2<20> is no longer registered with NBT.
00000788.00000794::2006/04/22-22:40:10.976 File Share <LogShip>: Error
checking for share. Error 2114.
00000788.00000794::2006/04/22-22:40:10.976 File Share <LogShip>: Error
checking for share. Error 2114.
0000055c.0000079c::2006/04/22-22:40:10.976 [FM] NotifyCallBackRoutine:
enqueuing event
0000055c.00000648::2006/04/22-22:40:10.976 [CP] CppResourceNotify for
resource LogShip
00000788.00000794::2006/04/22-22:40:11.023 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1<20> is no longer registered with NBT.
00000788.00000794::2006/04/22-22:40:11.023 Network Name <Cluster Name>:
Name 2P0CLUSTER1<20> is no longer registered with NBT.
00000788.0000041c::2006/04/22-22:40:11.023 File Share <Logship2>:
SmbShareDoTerminate: SmbpShareNotifyWorker Terminated... !!!
00000788.0000041c::2006/04/22-22:40:11.023 File Share <Logship2>: Error
removing share 'Logship2'. Error 2114.
0000055c.0000064c::2006/04/22-22:40:11.039 [CP] CppResourceNotify for
resource Logship2
00000788.00000794::2006/04/22-22:40:11.039 Network Name <SQL Network
Name(2P0SQLV2)>: Name 2P0SQLV2<20> is no longer registered with NBT.
0000055c.0000064c::2006/04/22-22:40:11.039 [FM] RmTerminateResource:
09ccd7f0-6c52-4756-af88-a3b6e4fb6043 is now offline
00000788.00000794::2006/04/22-22:40:11.039 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1<20> is no longer registered with NBT.
0000055c.0000064c::2006/04/22-22:40:11.039 [FM] RestartResourceTree,
Restart resource 09ccd7f0-6c52-4756-af88-a3b6e4fb6043
00000788.00000794::2006/04/22-22:40:11.039 Network Name <Cluster Name>:
Name 2P0CLUSTER1<20> is no longer registered with NBT.


A few seconds later, the SQL node fails its heartbeat

00000788.00000794::2006/04/22-22:40:16.351 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1 failed IsAlive/LooksAlive check, error
5038.
00000788.00000794::2006/04/22-22:40:16.351 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1<20> is no longer registered with NBT.
00000788.00000794::2006/04/22-22:40:16.351 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1 failed IsAlive/LooksAlive check, error
5038.


As does the cluster itself

0000055c.0000079c::2006/04/22-22:40:16.351 [FM] NotifyCallBackRoutine:
enqueuing event
0000055c.00000648::2006/04/22-22:40:16.351 [CP] CppResourceNotify for
resource SQL Network Name(2P0SQLV1)
00000788.00000794::2006/04/22-22:40:16.351 Network Name <Cluster Name>:
Name 2P0CLUSTER1<20> is no longer registered with NBT.
00000788.00000794::2006/04/22-22:40:16.351 Network Name <Cluster Name>:
Name 2P0CLUSTER1 failed IsAlive/LooksAlive check, error 5038.

Again, in SQL everything seemed to be fine. It was Cluster Services
itself that seemed to die.
Geoff N. Hiten
4/25/2006 4:27:11 PM
First, what do the logs on both servers say. You should have application
logs that indicate why the first node failed to carry the complete load.
One of the more common failure causes is not configuring memory to support
both instances on the same node. The instance that failed over may not have
had enough memory to operate.

Also, if you have file shares on clustered resources (or worse, clustered
SQL server instances accessing local file shares) then you have a design
that is incompatible with clustering. Again, the error logs should tell you
what is going on.

--
Geoff N. Hiten
Senior Database Administrator
Microsoft SQL Server MVP



[quoted text, click to view]

Geoff N. Hiten
4/25/2006 8:31:11 PM
Turn off NETBIOS on your private NIC.

--
Geoff N. Hiten
Senior Database Administrator
Microsoft SQL Server MVP



[quoted text, click to view]

dwcscreenwriterextremesupreme NO[at]SPAM gmail.com
4/26/2006 11:33:41 AM
Thank you, we'll try that.

What I'm most interested in though, is if you've ever heard of
something like this happening before. This is a server cluster that's
been up and solid for over a year, and I can't just put in my
post-mortem "mystical fairies killed the server."

Does anyone know *why* something like this might have happened, or have
any KB articles referring to similar problems?





[quoted text, click to view]
AddThis Social Bookmark Button