all groups > dotnet datatools > april 2008 >
You're in the

dotnet datatools

group:

How to read sequentially from a random point in a large Xml File. (200 - 2000 MB)


How to read sequentially from a random point in a large Xml File. (200 - 2000 MB) Schwartzberg
4/3/2008 12:59:08 PM
dotnet datatools: Hello

Have a huge XML file with multitudes of "LogEntry" nodes / text
lines.
A small sample of this xml/text content is below.
The file could be anywhere between 200 to 2000 MB.
My questions comes in two parts.

(A)
I would like a solution (or ideas for it), in C#, -- to randomly
access a really huge xml file, and to sequentially read only the
"memory permiting" number of nodes into memory, from a place randomly
selected in the huge file. The application otherwise returns an out
of memory error or gets very slow, if i try to load the entire file,
because the user like to "scroll" through the file, viewing different
parts. Like when scrolling through a huge Word document.

How is this (best) done?

(B)
What is, and/or how would i estimate, the max amount of xml or text
from the file that the application can have in it's memory? The
application is both a web applicatin and a windows standalone.

On a 32bit machine with 2GB Ram, the virual memory is 2GB, which gives
an answer.
But i have a Java app that goes in a unhandled heap error already when
loading xml from a 200MB size file.

Any ideas, solutions, or links concerning the above (especially (A))?

One avenue is to try to base a sequential reader on a random access
stream.

I tried this idea. I based the XmlTextReader (for seqeuntial read) on
the FileStream (for randon access), but this didnt work. There is
some test code at the bottom of this email that shows some of this.

I used the FileStream for random access via the FileStream.Seek(..)
method.
But the XmlTextReader.Read() didn't start reading from the new
position.

The following:
FileStream.Seek(<Random NewPosition>, SeekOrigin.Begin);
FileStream.Read();
would read from a the new position, but it didnt effect the
positioning of XmlTextReader.Read().
Even though XmlTextReader is based on the same FileStream.

It caused though the last read of the XmlTextReader to validate the
xml erroneously (when the xml was actually ok).

An alternative is to base a StreamReader on a FileStream.
The StreamReader.BaseStream is available for random access, and the
StreamReader is there for sequential read.
But i think the same problem is there, as when basing the
XmlTextReader on the FileStream.

As a side thought to the problem, - it could be more easily solved if
MicroSoft offered an indexing mechanism (for application purposes) on
NTSF files. But this isn't the case. Or if i could load the huge
file into a database table, but the requirement is only to use xml
files (or flat files), so this isn't an option.

This question involves several "technologies". So i am posting it on
several newsgroups.

Here's a sample of the XML:
Each "LogEntry" node is viewed as line of text in a GridView
controller.

<Logs AtrA="AllTheLogs">
<Log AtrA="log1" AtrB="Machine nr 1">
<LogEntry AtrA="name1" AtrB="time" AtrC="location" />
<LogEntry AtrA="name2" AtrB="time" AtrC="location" />
<LogEntry AtrA="name3" AtrB="time" AtrC="location" />
<LogEntry AtrA="name4" AtrB="time" AtrC="location" />
</Log>
<Log AtrA="log2" AtrB="Machine nr 1">
<LogEntry AtrA="name5" AtrB="time" AtrC="location" />
<LogEntry AtrA="name6" AtrB="time" AtrC="location" />
</Log>
</Logs>



Some test code using XmlTextReader(FileStream) based on a file with
the above xml.
I used the VS debugger to look into the variables.

System.IO.FileStream fs = null;
int i = 0;
long[] bookMarks = new long[4000];
String[] linesOfText = new String[4000];
byte[] aBuffer = new byte[1000];
char[] charBuffer = new char[1000];
try
{
fs = new FileStream("c:\\aXMLfile.xml",
FileMode.OpenOrCreate);
System.Xml.XmlTextReader reader = new
XmlTextReader(fs);

long lngthOfFS = fs.Length;

Boolean a = false;
while (reader.Read())
{
bookMarks[i] = fs.Position;
StreamReader sr = new StreamReader(fs);

if (i == 2)
{
fs.Read(aBuffer, 0, aBuffer.Length);
fs.Position = 0;
fs.Read(aBuffer, 0, aBuffer.Length);
for (int g = 0; g < aBuffer.Length; g++)
{
charBuffer[g] = (char)aBuffer[g];
}
}

linesOfText[i] = "Attribute count: "
+ reader.AttributeCount
+ ", NodeType: "
+ reader.NodeType
+ ", Name: "
+ reader.Name
+ ", value: "
+ reader.Value;
a = reader.HasAttributes;

if (reader.HasAttributes)
{
for (int ii = 0; ii < reader.AttributeCount; ii
++)
{
reader.MoveToAttribute(ii);
linesOfText[i] = linesOfText[i]
+ "Attribute " + ii.ToString() + ":"
+ ", Name: "
+ reader.Name
+ ", value: "
+ reader.Value;
}

}

i++;
}
}
catch(Exception e)
{
String message = e.ToString();
}
finally
{
fs.Unlock(0, fs.Length);
}

Other references:
Efficient Techniques for Modifying Large XML Files
http://msdn2.microsoft.com/en-us/library/aa302289.aspx
XML Reader with Bookmarks
http://msdn2.microsoft.com/en-us/library/aa302292.aspx
The Best of Both Worlds: Combining XPath with the XmlReader
http://msdn2.microsoft.com/en-us/library/ms950778.aspx

Comments to references:

Helena Kupkova developed a XmlBookmarkReader class (based on
XmlReader). But when XmlBookmarkReader sets a bookmark on a read
node, it caches it and the following node, to be able to "replay" the
bookmark when it is needed. On huge files, an early bookmark will
cache the xml content of the file until the applicaiton runs out of
memory.

Dare Obasanjo XPathReader doesnt avoid a sequential read of the file,
testing for each read, for a match for one or more xpaths. For a new
XPath, the code would have to seqential reading from the start of the
file.




--
Regards,
Paul
Re: How to read sequentially from a random point in a large Xml File. (200 - 2000 MB) W. Jordan
4/16/2008 9:28:19 AM
Hey there,

I did not test with the problem. I know what you are talking about.
First, in order to browse and seek through the document, you should
use XmlReader.Read () and Stream.Position to go through the document,
and record start tag offsets of every critical elements. And use
the offset to Seek (). It seems that you have implemented this.

Well, if the XmlReader on FileStream way does not work, you can
use the FileStream to read a trunk of text into a String--we call
this a String Fragment, and then use StringReader to read that
string.

You must have a question -- when you are reading, what place
the XmlReader has gone to, at the "String Fragment"? Yes, that's
hard to tell. Thus, we need to do a trick to position this.

I assume that the String Fragment looks like that

FileStream positioned here
-><LogEntry AtrA="name2" AtrB="time" AtrC="location" />
<LogEntry AtrA="name3" AtrB="time" AtrC="location" />
<LogEntry AtrA="name4" AtrB="time" AtrC="location" />
</Log>
<Log AtrA="log2" AtrB="Machine nr 1">
<LogEntry AtrA="name5" AtrB="time" Atr

So, actually you have gotten three "LogEntry" elements and some
"junk" in this string fragment (An EndElement of "Log", and a
start tag of "Log", andone more incomplete "LogEntry" which does
not belong to the current "Log").
XmlReader will normally read three "LogEntry" elements and throw
an exception when it reaches the EndElement of "Log".

You should do something to prevent this exception. You should do
replaces to the fragment to add an "id" attribute to the LogEntry
elements, mark the end of log, and let the fragment look like that.

<LogEntry id="1" AtrA="name2" AtrB="time" AtrC="location" />
<LogEntry id="2" AtrA="name3" AtrB="time" AtrC="location" />
<LogEntry id="3" AtrA="name4" AtrB="time" AtrC="location" />
<EndOfLog/>

How to do this to the fragment??
First we read the fragment line by line (I assume that you place
every "LogEntry" at a separated line), by using StreamReader.ReadLine,
....

line = streamReader.ReadLine ();

We use a List<String> to buffer the entries.
Before adding the line to the buffer, we need to detect whether the
line is the end of a log.

if (line.Trim () == "</Log>") {
lines.Add ("<EndOfLog/>"); // lines is List<String>
streamPosition = streamReader.BaseStream.Position;
break;
}
lines.Add (line);

And we join the string array to a new string:
stringFragment = String.Concat (lines.ToArray());

So we use StringReader to read that stringFragment. And XmlReader
is based on that StringReader. When the XmlReader meets <EndOfLog/>,
you should stop it, of course. I think you can handle the rest.

Please let me know whether this helps.
My email wmjordanX-XSPAMX-X@163.com (remove uppercase characters
and '-').


--


Best Regards,
W. Jordan


[quoted text, click to view]
Re: How to read sequentially from a random point in a large Xml File. (200 - 2000 MB) W. Jordan
4/16/2008 9:31:40 AM
Sorry about the previous post.
No "id" attribute is needed actually...

[quoted text, click to view]

Re: How to read sequentially from a random point in a large Xml File. (200 - 2000 MB) W. Jordan
4/19/2008 11:00:51 AM
Today I tested with the TextReader, and XmlReader, I found that the
Stream.Position can not be used to tell XML node position between
each Read methods, since the readers will read some amount of chars
and then cache the them. Then we have got to write our own XML parsing
and indexing codes, which is a bit difficult to implement.

[quoted text, click to view]
AddThis Social Bookmark Button