First, what do the logs on both servers say. You should have application
logs that indicate why the first node failed to carry the complete load.
One of the more common failure causes is not configuring memory to support
both instances on the same node. The instance that failed over may not have
had enough memory to operate.
Also, if you have file shares on clustered resources (or worse, clustered
SQL server instances accessing local file shares) then you have a design
that is incompatible with clustering. Again, the error logs should tell you
what is going on.
--
Geoff N. Hiten
Senior Database Administrator
Microsoft SQL Server MVP
[quoted text, click to view] <dwcscreenwriterextremesupreme@gmail.com> wrote in message
news:1145993790.471775.3320@y43g2000cwc.googlegroups.com...
> Our second node in an active/active SQL cluster (Win 2K SP4/ SQL2K SP2)
> failed as a result (it seems) of a bad NIC at the public interface.
>
> Cluster and SQL logs seem to support that the failover went
> successfully, except for SQL job agent suddenly stopping on the first
> node.
>
> However, 2 hours later, the first node received a boatload of errors
> indicating that the file shares were down, the second node was no
> longer registered in the NBT, the first node was no longer in the NBT
> and the cluster itself was no longer in the NBT. At this point, cluster
> services on the node shut down everything within the cluster --
> including SQL.
>
>
> I have heard that this might be a result of an SQL bug regarding
> clustering, but can find no corroborating documentation. Has anyone
> ever heard of a problem like this, or can you point me to some KB
> articles referring to similar issues? I need a post-mortem reason for
> why everything just suddenly died.
>