all groups > dotnet xml > june 2005 >
You're in the

dotnet xml

group:

XmlSerializer XmlTextWriter UTF8 and Bad Characters


XmlSerializer XmlTextWriter UTF8 and Bad Characters rmgalante NO[at]SPAM yahoo.com
6/30/2005 2:18:40 PM
dotnet xml:
I have a Windows Service that reads and writes an XML file to disk
periodically. I use the XmlSerializer to serialize and deserialize the
XML file on disk. I am writing the file using an XmlTextWriter and UTF8
encoding.

I notice after long periods of time that my file has bad,
non-displayable characters. I try to load the XML file with IE, and it
tells me that the XML is bad for one of many different reasons,
depending on where the bad characters are located.

I load the XML file with WordPad and try to fix the XML file. Sometimes
it works. Most times it doesn't because there are just too many bad
characters. They always render themselves as that square character,
which is non-printable.

I've noticed the same problem with the WSDL proxy code generator. I
execute the WSDL utility to build a proxy for a web service. And when I
view the code, it's full of these non-printable characters. I have to
go through the code and fix all the bad characters.

Has anyone had similar problems with the XmlSerializer or the
XmlTextWriter?

Rob
Re: XmlSerializer XmlTextWriter UTF8 and Bad Characters Derek Harmon
6/30/2005 8:41:09 PM
[quoted text, click to view]

What you really need is to examine the file in a hex editor, to get past
thinking about those characters as "bad," and "unprintable," and "that
square character." ;-)

If you don't have a hex editor, here is a short little command-line util
that can help get you started,

- - - hexDump.cs
using System;
using System.IO;

public class HexDumpUtil
{
public static void Main(string[] args)
{
if ( args.Length != 1 )
{
Console.WriteLine("dump <filename.xml>");
return;
}
FileStream fstream = new FileStream( args[ 0], FileMode.Open);
BinaryReader bread = new BinaryReader( fstream);
byte[ ] buf = new byte[ 512];
int pos = 0;
int blockNum = 0;
while ( bread.Read( buf, 0, 512) > 0 )
{
Console.WriteLine("---");
Console.WriteLine("Block #{0}", blockNum);
Console.WriteLine( );
++blockNum;

for ( int i= 0; i < 16; ++i)
{
Console.Write( "{0:X4} : ", pos);
for ( int j = 0; j < 2; j++)
{
for ( int k = 0; k < 16; ++k)
{
Console.Write( "{0:X2} ", (int)( buf[ (i<<5) + (j<<4)+ k]));
}
if ( j == 0 )
{
Console.Write( " | ");
}
}
Console.WriteLine( );
pos += 32;
}
}
bread.Close( );
}
}
- - -

Now let's walkthru solving this. Let's pretend that I have an XML file that
looks like this in UTF-8 encoding,

- - - bad.xml (UTF-8 encoded)
<?xml version="1.0" ?>
<doc>Hello World@</doc>
- - -

This document is well formed, but for the sake of argument, let's assume the
character, '@', is actually "that square character." How do you find out what
character it really is? and can it be represented in UTF-8? It might be a grave,
in lieu of a single quote, or a backtick instead of double-quotes, etc.

Start by piping it through hexDump,

C:\> hexdump bad.xml > bad.lst

Open the .lst file in Notepad and you may see something like this,

- - - bad.lst
Block #0

0000 : EF BB BF 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E | 3D 22 31 2E 30 22 20 3F 3E 0D 0A 3C 64 6F 63 3E
0020 : 48 65 6C 6C 6F 20 57 6F 72 6C 64 E2 98 BA 3C 2F | 64 6F 63 3E 00 00 00 00 00 00 00 00 00 00 00 00
: :
- - -

Let's pick this apart. The first 3 bytes are the standard UTF-8 byte
order mark, you can ignore this salutation. 3C is the '<' leading the
XMLDecl, in fact the character sequence, 3C 3F 78 6D 6C corresponds
to "<?xml".

The trouble happens before the closing tag of </doc>, so let's skip
ahead looking for "</" or the character sequence 3C 2F. Before
3C 2F there is the sequence 57 6F 72 6C 64 E2 98 BA. This
corresponds to the text, "World@". OK, we know 57 6F 72
6C 64 is "World," so what's that leave us with? That square
character, that's right!

What [single] character has UTF-8 encoding E2 98 BA? Well, no
point in scouring encyclopedia catalogs of the encoding standards,
the following short snippet of .NET decoding magic tells us that the
UTF-8 byte sequence, E2 98 BA, is ...

using System.IO;
using System.Text;
// . . .
byte[] buf = new byte[] { 0xE2, 0x98, 0xBA };
StreamReader read = new StreamReader( new MemoryStream( buf),
Encoding.UTF8);
Console.WriteLine( read.ReadToEnd( ) );

.... the happy-face character (provided your console can display the
character, if it can't then it shows 'that square character.') When console
doesn't work, calling up Start | Run... | charmap.exe and searching thru
"Arial Unicode MS" font will usually do the trick.

The technique to take away from this is that even though those characters
may appear square, you may find yourself with much greater control over
pinpointing and solving the problem if you knew what byte sequence was
bad. Then you can ask questions like, "why is it bad?" and "how did it
get here?" and that will be further along on the road to success.


Derek Harmon

AddThis Social Bookmark Button