Groups | Blog | Home
all groups > dotnet xml > january 2005 >

dotnet xml : Invalid charachters in a XML Doc


Glen
1/26/2005 4:56:56 PM
I'm new to XML, so this is a newbie question. I'm reading in XML docs via a
VB.NET application and extracting node data and I find that one of my blocks
or lines has a copyright character in it. I'm using the .NET XMLTextReader
class and the Reader won't parse this character at all; just throws an
exception and boom the application quits. The character always appears in
the same place in the document so I thought to detect the node and skip over
it using the skip function. Nope, it still blows up.

Here's the node. Notice the copyright char shows up as a block. Can anyone
point me in the right direction? TIA...
<body.end>

<tagline>Copyright ? 2005 </tagline> the cp

</body.end>





Bjoern Hoehrmann
1/27/2005 7:28:32 AM
* Glen wrote in microsoft.public.dotnet.xml:
[quoted text, click to view]

If you do not declare a different encoding, all XML processors assume
that XML documents are UTF-8 encoded, if that causes any trouble it
would seem that the document is actually encoded in a different
encoding. You can solve your problem either by fixing the document
before passing it to the XML processor (by declaring the right encoding)
or by transcoding the document to UTF-8 (see System.Text.Encoding for
routines that could do that). I suspect the documents are actually in
the Windows-1252 encoding, so declaring the encoding would like like

<?xml version="1.0" encoding="windows-1252"?>
<x>...</x>
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
Glen
1/27/2005 10:19:08 AM
Thanks for the heads-up. Turns out the encoding is actually
ISO 8859-1.


[quoted text, click to view]

AddThis Social Bookmark Button