Groups | Blog | Home
all groups > dotnet performance > august 2005 >

dotnet performance : StreamReader->regex->Win32 api marshalling


Tim
8/31/2005 7:35:01 AM
I've written a C# .Net console app that reads massive (upto 2Gb) text files
of csv data using a StreamReader with UTF7 in configurable-size char[]
blocks, parses the read data using the .Net regex facilities and re-formats
it before sending it to a Managed C++ .Net dll that wraps the SQLServer bcp
api and so squirts the data into a SQLServer db. It all works fine, but the
performance is just not good enough - I need to at least halve the time its
taking to process the data. I've profiled the code and it looks like the
overhead is more or less evenly split between the reading/parsing on the one
hand and the data loading/bcp/Win32 api marshalling on the other hand - so
I'm looking for improvements across the whole functionality. The problem is
that I need the Regex sophistication to parse the data because its very rough
- so I need to turn the text read from the files into .Net Strings so I can
Regex them. But then I need to marshall the output Strings from the Regex
down to the BCP api because bcp is the only SQLServer i/f that provides
anything like the performance we need. However, I gather that two big .Net
performance overheads are String splitting/manipulation and .Net to Win32
ansi string marshalling :-(. Any suggestions for optimisation tweaks for
these parts of the FCL or ways to circumvent the problem areas much
Scott
8/31/2005 8:37:34 AM
[quoted text, click to view]

Are you using the Compile flag on the regex?

I would also try to insert to the database directly from your code by
batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert
xxx xxxx xxx;... I would start by batching into groups of a 100 and
then try to tweak from there.

I've had very good performance success with these two strategies.

shriop
8/31/2005 8:51:10 AM
Lots of info here, but I'll try to split it out.

Regex's are a notoriously bad performance wise way to accomplish what
you're trying to do.
http://www.codeproject.com/cs/database/CsvReader.asp
http://www.csvreader.com/csv_benchmarks.html

I would suggest you either implement the code yourself using raw char
array handling to find the commas, etc, etc, or purchase a commercial
csv parser like the one I sell, http://www.csvreader.com .

bcp is not what I would recommend for this because it does not properly
handle escape sequences, and does not necessarily have a performance
benefit that I've seen over dts. I would recommend calling a dts job
instead to import this data. If you don't need to do any data
manipulation to the original csv file, then there's no reason to even
do any csv parsing, just let dts do it for you. If you do want to try
the batching of insert statements like one of the previous comments
suggested, make sure you start a transaction before doing the chunks of
inserts, then commit at the end of a chunk. Otherwise, you'll get a
performance hit as the db tries to commit after every single insert.

[quoted text, click to view]
Tim
8/31/2005 9:56:07 AM
I did initially try writing an intermediate re-formatted text file and then
processing that using the bcp utility as a separate process (great minds
....), but the problem with that is the size of the files - they are huge and
we can't afford to replicate them as intermediate files. The performance
wasn't any better either - the overhead of writing to the intermediate text
file was just as great.
Regarding buffering - the profiling seemed to indicate that there wasn't
much slack in the processor, so I don't think there's a lot to be made up
there (its a local server so there's no network overhead).
I didn't want to complicate things by mentioning this, but a further
complication is that there is a separate thread to the file parsing component
that uses a .Net FileSystemWatcher object to look for the csv files arriving
and fires off the reading/parsing/bcping process as separate threads - so in
practice there will be many of these file processing threads running in
parallel, so I'd expect any slack in the processor to be taken up by that
anyway. So for the minute I'm just trying to optimise the file processing
code itself.

Tim

[quoted text, click to view]
David Browne
8/31/2005 10:02:37 AM

[quoted text, click to view]

Two ideas. First is to use buffering so your parsing code doesn't wait on
your BCP loads, and vice versa. Introducing a buffer between the two can
help keep the processor busy while the BCP component is waiting on network
and SQL Server work.

An excelent candidate for buffering is a file. Just run you .NET code and
output a clean file. Then take that file and BCP it into SQL Server. You
really aren't gaining anything by pipelining this stuff into SQL.

David

David Browne
8/31/2005 10:44:38 AM

[quoted text, click to view]
BCP is significantly faster than using "batched" inserts.

David

Tim
9/1/2005 2:21:03 AM
When I said the csv data was rough, I meant it! As an example: it uses a
specific configurable field delimiter for all the fields in a record except
the last field in the record which uses a \r\n. BUT ... the last field (and
in fact any other field) may also contain \r\n in its data! My solution to
this was to use the regex definition of the format of the first field in each
record to find the end of the last field. Would I be able to handle this with
your csv parser?
Thanks
Tim

[quoted text, click to view]
shriop
9/1/2005 8:15:44 AM
I'm pretty sure this is right in line with what my parser is made to
handle. If you find it can't handle it, then I'd like to see an example
of the data so I can add the functionality to handle it. You can set
the delimiter to anything you want. And end of row is commonly ended
with a /r/n, so it handles that. the /r/n in the data is no problem as
long as there are double quotes around the field itself, so the parser
has something to base the logic on and know that this is data, and not
the end of a row. I've got a free demo version that you can try it out
with. You can also feel free to email me directly with any questions.
This kind of data is why I generally recommend people not try to use
regular expressions to parse the data, the expression just gets too
hairy, leading to poor performance.

Here's what I'm picturing you're data looking like, and what it can
handle, although I don't know what specific delimiter you're using.

1,Bruce,Dunwiddie,"in here, I can have commas and
\r\n"\r\n2,Bob,Jones,some other quote here


[quoted text, click to view]
Tim
9/1/2005 9:39:12 AM
Thats the problem - the text isn't quoted. So I'm using meta-data definitions
of the SQLServer table columns that the csv data is destined for to decide
what sort of char patterns to expect in each field. But where the field is
text all bets are off - you just keep going 'til you reach whatever's defined
as the delimiter char; which is fine unless the last field in the row is text
and potentially includes the row delimiter (\r\n) in its data. As far as I
can see the only way to semi-reliably determine the end of the row,
therefore, is to find the start of the next row. Originally I was using the
entire regex definition of a record to find the start of the next row, but
I've optimised it to just look for the first column of the next row to save
time. I can't think of anything else to do - apart from drowning the person
responsible for the data format. Won't solve the problem but I'll feel a lot
better.

[quoted text, click to view]
shriop
9/1/2005 10:34:53 AM
Can I see an example of one of these problematic rows? You're pretty
much out of luck from what you say. I saw a post a while back about
exporting events from the event viewer when the description contains
newline characters and the description is not quoted. I tried to find
something, anything, to base the logic off of. But if you can't be
certain of anything to deal with this file, then it's awful hard to
parse. This is one of the reasons why I try to push using my CsvWriter
class to create csv files, so people who don't know any better don't
try to come up with what they think a csv format is and end up leaving
out handling of newlines for example.
Tim
9/2/2005 1:54:05 AM
Well as I say the regex solution gives you some hope of being able to parse
the data, but I can't see any other way to do it. The other good thing about
using regex in messy cases like this is that any bits of text that don't get
recognised for whatever reason just don't get extracted - they don't
interfere with the parsing of any subsequent 'valid' rows of data.
The data would look something like ...

1|1/9/2005|Bruce|Dunwiddie|in here, I can have commas and
\r\n anything except the bar char thats the designated field delimiter for
this particular table "\r\n
2|28/7/2005|Bob|Jones|some other text

.... so you would look for the pattern of newline followed by an int follwed
by the field delimiter as the row terminator (not forgetting the end of the
file ofcourse!).
I think the only solution is to turbocharge the regex class or give it some
steroids or something. Or maybe put some limited optional regex functionality
into your csv parser? If there was some way to submit a regex expression as
the row delimiter in your csv parser that might be a good half-way solution?

[quoted text, click to view]
shriop
9/2/2005 7:44:12 AM
Because of the way the parser is built for performance, there's no way
to include regular expressions into the parsing. It'd have to be a
totally different version of the parser.

What you're describing possibly working with regular expressions has
nothing to do with regular expressions. You can, and I think should, do
the exact same thing just doing normal parsing. I'm fine with what
you're seeing as a pattern for a row delimiter. If you stayed with
using a normal csv parser, you could read in a row, and save the
results off somewhere. Then, read in the next row. If the first
column's value is not an integer, then you know that the first column's
value should be appended to the last column's value from the last row
and that you're still in the middle of a row. Rinse, and repeat.
AddThis Social Bookmark Button