Groups | Blog | Home
all groups > dotnet xml > october 2004 >

dotnet xml : Unexpected Token error while using XmlParserContext



Brian Cobb
10/28/2004 12:59:10 PM
Greetings;

I have an application where I am receiving HTML fragments containing
snippets (sub-fragments?) of XML. I wish to extract the XML bits for further
processing. In the process of playing around with various ways to accomplish
this I came up with the following code:

string xmlFrag=
@"
<P> </P>
<P>This is a test <xml><field name=""instructor-name"">Brian
Cobb</field></xml></P>
<P> </P>
<P>Another test</P><xml><field name=""picture"">brianc.jpg</field></xml>" ;


try
{
string subset = "<!ENTITY nbsp ' '>";
XmlParserContext context = new XmlParserContext(
null,
null,
"html",
#if USING_DOCTYPE_IDS
this.publicID,
#else
null,
#endif
#if USING_DOCTYPE_IDS
this.systemID,
#else
null,
#endif
#if USE_SUBSET
subset,
#else
null,
#endif
"",
"en-us",
XmlSpace.None,
System.Text.Encoding.UTF8
);

XmlValidatingReader reader =
new XmlValidatingReader(xmlFrag, XmlNodeType.Element, context);
reader.ValidationType = ValidationType.None;
while(reader.Read()) { ...

Note that this.systemID = @"http://www.w3.org/TR/html4/loose.dtd"
and this.publicID = @"-//W3C//DTD HTML 4.01 Transitional//EN"
in my most recent test and that USING_DOCTYPE_IDS is defined.

The problem I am having is, that on the first call to reader.Read() I get an
XmlException with the message: "This is an unexpected token. The expected
token is 'TAGEND'" at Line 31, Column 3". Since there aren't 31 lines in
xmlFrag I would surmise that the problem lies with the external DTD; however,
selecting different values for systemID and publicID produces similar
results. If I #undef USING_DOCTYPE_IDS, the code works as expected.

What I am trying to avoid is having to define my own general entities as I
have done with nbsp in my example. Fiddling with systemID and publicID is
my initial attempt to use standard DTDs to get around this (think of small
child whining "But Mommy, I don't want to define my own DTD"). But, if anyone
has a better idea how to accomplish this, I'm all ears.

thanks.

PS: if USE_SUBSET is #undef'ed I get XmlException "undefined entity nbsp".


Stuart Celarier
11/1/2004 12:35:06 AM
Brian, even though you have XML fragments embedded in HTML, I don't see much
hope for trying to read the XML out of the HTML using an XML reader because
the HTML is not XML. It doesn't matter what DTD you are using for the HTML,
until you get to XHTML, you can't use XML technologies on HTML.

But what you can to is get an XmlReader than reads HTML. Since HTML is SGML,
you could use the SgmlReader on GotDotNet [1]. Knock yourself out. Go wild.

Cheers,
Stuart Celarier, Fern Creek

[1]
http://www.gotdotnet.com/community/usersamples/Default.aspx?query=sgmlreader

AddThis Social Bookmark Button