Groups | Blog | Home
all groups > dotnet performance > january 2006 >

dotnet performance : Performance with reading large numbers of files...



Mark Rochotte
1/22/2006 6:24:19 PM
Hi All.

I have a small test application (a condensation of the issue from a much
larger application) that recurses a directory and adds all the file names
to a string collection. The app then iterates over these names, opens the
file, and reads the entire file (or the first 4Mb if greater than 4Mb), and
closes the file.

The problem is that after 10000 or so files, the application slows down significantly.
I've played with various GC and Thread.Sleep() calls as part of the loop,
but the slowdown still occurs.

There are 29000 files or so, in a couple hundred directories, and all the
files are 100-200k in size.

Can someone explain a way to fix the issue? I'm including the source to
the app and some sample output. This particular code is for C# 1.1, but
the same issue exists for 2.0 RTM.

Thanks,

m

-----------------------------
using System;
using System.Collections.Specialized;
using System.IO;

namespace FileReadTest
{
internal class Class1
{
const string vDir = "<SOME DIR HERE>";

[STAThread]
private static void Main(string[] args)
{
DateTime sStartTime = DateTime.Now;

StringCollection sDiskFiles = new StringCollection();
RecurseDirs(new DirectoryInfo(vDir), sDiskFiles);
byte[] sBuf = new byte[1024*1024*4];

int x = 0;
foreach (string sFileName in sDiskFiles)
{
if (x++%256 == 1)
Console.WriteLine("Processing {0,6:#,##0} of {1,6:#,##0}
Elapsed: {2}", x, sDiskFiles.Count, (DateTime.Now-sStartTime));

using (FileStream sStream = File.OpenRead(sFileName))
{
sStream.Read(sBuf, 0, sBuf.Length);
sStream.Close();
}
}

}

private static void RecurseDirs(DirectoryInfo aDir, StringCollection
aList)
{
DirectoryInfo[] sDirArr = aDir.GetDirectories();
for (int x = 0; x < sDirArr.Length; x++)
{
RecurseDirs(sDirArr[x], aList);
}
FileInfo[] sFileArr = aDir.GetFiles();
for (int x = 0; x < sFileArr.Length; x++)
{
FileInfo sInfo = sFileArr[x];
aList.Add(sInfo.FullName);
}
}

}
}
-----------------------------
Processing 1 of 29,041 Elapsed: 00:00:17.3281250
Processing 257 of 29,041 Elapsed: 00:00:18.5312500 << about 1.2s per
250 files.
Processing 513 of 29,041 Elapsed: 00:00:19.9531250
Processing 769 of 29,041 Elapsed: 00:00:21.1562500
Processing 1,025 of 29,041 Elapsed: 00:00:22.5937500
Processing 1,281 of 29,041 Elapsed: 00:00:23.6718750
Processing 1,537 of 29,041 Elapsed: 00:00:24.8750000
Processing 1,793 of 29,041 Elapsed: 00:00:25.9843750
Processing 2,049 of 29,041 Elapsed: 00:00:27.3906250
Processing 2,305 of 29,041 Elapsed: 00:00:28.6718750
<snip>
Processing 12,545 of 29,041 Elapsed: 00:01:55.1875000
Processing 12,801 of 29,041 Elapsed: 00:01:58.9218750
Processing 13,057 of 29,041 Elapsed: 00:02:02.9687500
Processing 13,313 of 29,041 Elapsed: 00:02:06.7968750
Processing 13,569 of 29,041 Elapsed: 00:02:10.7812500
Processing 13,825 of 29,041 Elapsed: 00:02:14.7343750
Processing 14,081 of 29,041 Elapsed: 00:02:18.6406250
Processing 14,337 of 29,041 Elapsed: 00:02:22.2968750
Processing 14,593 of 29,041 Elapsed: 00:02:25.9687500 << about 3.7s per
250 files
<snip>
Processing 20,225 of 29,041 Elapsed: 00:04:19.5156250
Processing 20,481 of 29,041 Elapsed: 00:04:26.4531250
Processing 20,737 of 29,041 Elapsed: 00:04:33.4375000
Processing 20,993 of 29,041 Elapsed: 00:04:39.3593750
Processing 21,249 of 29,041 Elapsed: 00:04:47.6562500
Processing 21,505 of 29,041 Elapsed: 00:04:56.9843750
Processing 21,761 of 29,041 Elapsed: 00:05:07.4062500
Processing 22,017 of 29,041 Elapsed: 00:05:15.7812500 << about 8.3s per
250 files
<snip>
Processing 27,649 of 29,041 Elapsed: 00:07:20.1250000
Processing 27,905 of 29,041 Elapsed: 00:07:24.6093750
Processing 28,161 of 29,041 Elapsed: 00:07:32.8281250
Processing 28,417 of 29,041 Elapsed: 00:07:41.6406250 << about 8.8s per
250 files.

AMercer
1/23/2006 7:00:02 AM


[quoted text, click to view]



[quoted text, click to view]

By my reckoning, you have about 17s of startup, and I agree with 1.2s for
the first batch. The last batch took 7*60+41-17 sec == 444s which yields
about 3.9s per 250 files, not 8.8s.

Your 29,000 filenames could take several megabytes of storage in sDiskFiles.
You access it sequentially over several minutes. Maybe you are being hit by
paging. One possibility is to read the files as you find them in
RecurseDirs. You will still need the large sBuf, but not the large
sDiskFiles. BTW, if the files are a few hundred k max, why is sBuf a few
megs?

Finally, I think foreach in C# is slow-ish, and an alternative loop
construct may help. I use VB, so this is just a guess.

Jon Skeet [C# MVP]
1/24/2006 12:32:56 AM
[quoted text, click to view]

<snip>

[quoted text, click to view]

In some situations foreach is slower than other loop constructs - but
it'll only when nothing of any significance is going on in the loop.

Don't forget that it's only looping 29,000 times here. Based on a quick
test on my machine, the loop itself may be contributing about a
millisecond to the total time - in other words, about 0.0002% of the
total time taken.

Jon
Jon Skeet [C# MVP]
1/24/2006 12:37:00 AM
[quoted text, click to view]

I very much doubt that garbage collection is relevant here. A couple of
things which might be going on:

1) Depending on how the files were written out, it's possible that the
earlier files are less fragmented than the later files.

2) Depending on what you'd been doing with the system before running
the test, it's possible that the earlier files were cached.

3) Other system activity (virus checking etc) could have had a
significant impact

It's also possible that the files you read later happened to be the
bigger ones.

I suggest you repeat the test:
1) Defragment your disk
2) Write out the files with random content, but all the same size
3) Reboot
4) Temporarily disable any virus checkers (preferrably having
disconnected from any network)
5) Run the test (with nothing else running at the same time)

Jon
AddThis Social Bookmark Button