dotnet clr:
Windows Server 2003 Service Pack 1 causes the System.Threading.Timer to
not fire, sometimes immediately and sometimes after a while. Once a
timer dies, it will never fire again.
Jamus Sprinson posted this first with a simple repro app at
http://groups-beta.google.com/group/microsoft.public.dotnet.framework.performance/browse_thread/thread/8f4430b6f6efe829/92db52b56f6ebf28
Jamus's repro app causes this problem almost immediately by using a
large number of timers. We encountered this problem in a larger
production app that only had 7 timers. Eventually some timers would
stop firing, never to fire again, while other timers continued to fire.
A variation on Jamus's repro app is to change the Timer period from
Timeout.Infinite to something like 15000 (15 seconds). This allows you
to see that sometimes all timers will fire the first time around, but
on subsequent firings some timers will start dying off. Sometimes all
timers will fire repeatedly for quite a while, and then system load
appears to cause them to drop off. We did some other tests to verify
that this isn't a problem with ThreadPool.QueueUserWorkItem or
ThreadPool.UnsafeQueueUserWorkItem. It happens with both Debug and
Release builds.
This problem only occurs on Windows Server 2003 with Service Pack 1.
We tested several OS and Service Pack variants including the .NET
Framework 1.1 with and without the .NET Framework 1.1 SP1. The culprit
is very clearly Windows Server 2003 SP1. No other OS exhibits this
behavior.
Our partial workaround is to implement our own timers using a dedicated
thread, however this is insufficient since we also use classes in the
..NET Framework that use the System.Threading.Timer. Classes that use
System.Threading.Timer include:
System.Data.SqlClient.ConnectionPool
System.Data.SqlClient.TdsParser
System.Data.SqlClient.Lifetime.LeaseManager
System.Timers.Timer
System.Web.Caching.CacheExpires
System.Web.Caching.CacheInternal.StartCacheMemoryTimers
System.Web.HttpRuntime
System.Web.RequestQueue
System.Web.RequestTimeoutManager
System.Web.SessionState
System.Web.Util.ResourcePool
As you can see from the list, this has a fairly serious impact on
important pieces of the .NET Framework. We have also reproduced this
problem using the System.Timers.Timer, and I assume you could find ways
of reproducing it with other classes listed above.
One difference we saw from what Jamus reported is that this problem
reproduced for us quite easily on single-processor Windows Server 2003
machines.
The problem appears to be worse under load. I took a look at the .NET
performance counters while the repro app was running, and the only
difference I noticed between runs that lost timers and runs that didn't
was an extra GC on the successful runs. I can't see how that would be
significant, but the following comment in the Rotor source for
AddTimerCallbackEx in comthreadpool.cpp makes me nervous:
// NOTE: there is a potential race between the time we retrieve the app
domain pointer,
// and the time which this thread enters the domain.
//
// To solve the race, we rely on the fact that there is a thread sync
(via GC)
// between releasing an app domain's handle, and destroying the app
domain. Thus
// it is important that we not go into preemptive gc mode in that
window.
Another bit of weirdness we saw while debugging our production app with
7 timers is that for TimerCallback delegates pointing to different
instances of the same exact type of object, the TimerCallback's
_methodPtr field was sometimes the same as the MethodDesc table's Entry
value which points to the beginning of the method's instructions, while
at other times the _methodPtr field points to an instruction that does
a jmp to the the beginning of the method referenced by the MethodDesc
table. I was able to see this with WinDbg and SOS using !dumpmt -MD
and !u. This seemed pretty weird since I was under the impression that
delegate signatures were "equal" if _target and _methodPtr matched.
Perhaps delegate pointers aren't always being fixed up during a GC?
However this doesn't match with what we see in Jamus's repro app that
only uses a single TimerCallback delegate for all Timers and only some
die, so this may be yet another issue, or more likely a
misunderstanding on my part.
There is also another unresolved report of slightly different
System.Threading.Timer flakiness on Windows Server 2003 here
http://groups-beta.google.com/group/microsoft.public.dotnet.languages.csharp/browse_thread/thread/ad9c74e361a88a06/b7ff4ba0b5e1e533
Does anyone know of a hotfix or better workaround for this issue?
Oran