[quoted text, click to view] <rmgalante@yahoo.com> wrote in message news:1120166320.173565.324930@g14g2000cwa.googlegroups.com...
> I notice after long periods of time that my file has bad,
> non-displayable characters. I try to load the XML file with IE, and it
> tells me that the XML is bad for one of many different reasons,
> depending on where the bad characters are located.
What you really need is to examine the file in a hex editor, to get past
thinking about those characters as "bad," and "unprintable," and "that
square character." ;-)
If you don't have a hex editor, here is a short little command-line util
that can help get you started,
- - - hexDump.cs
using System;
using System.IO;
public class HexDumpUtil
{
public static void Main(string[] args)
{
if ( args.Length != 1 )
{
Console.WriteLine("dump <filename.xml>");
return;
}
FileStream fstream = new FileStream( args[ 0], FileMode.Open);
BinaryReader bread = new BinaryReader( fstream);
byte[ ] buf = new byte[ 512];
int pos = 0;
int blockNum = 0;
while ( bread.Read( buf, 0, 512) > 0 )
{
Console.WriteLine("---");
Console.WriteLine("Block #{0}", blockNum);
Console.WriteLine( );
++blockNum;
for ( int i= 0; i < 16; ++i)
{
Console.Write( "{0:X4} : ", pos);
for ( int j = 0; j < 2; j++)
{
for ( int k = 0; k < 16; ++k)
{
Console.Write( "{0:X2} ", (int)( buf[ (i<<5) + (j<<4)+ k]));
}
if ( j == 0 )
{
Console.Write( " | ");
}
}
Console.WriteLine( );
pos += 32;
}
}
bread.Close( );
}
}
- - -
Now let's walkthru solving this. Let's pretend that I have an XML file that
looks like this in UTF-8 encoding,
- - - bad.xml (UTF-8 encoded)
<?xml version="1.0" ?>
<doc>Hello World@</doc>
- - -
This document is well formed, but for the sake of argument, let's assume the
character, '@', is actually "that square character." How do you find out what
character it really is? and can it be represented in UTF-8? It might be a grave,
in lieu of a single quote, or a backtick instead of double-quotes, etc.
Start by piping it through hexDump,
C:\> hexdump bad.xml > bad.lst
Open the .lst file in Notepad and you may see something like this,
- - - bad.lst
Block #0
0000 : EF BB BF 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E | 3D 22 31 2E 30 22 20 3F 3E 0D 0A 3C 64 6F 63 3E
0020 : 48 65 6C 6C 6F 20 57 6F 72 6C 64 E2 98 BA 3C 2F | 64 6F 63 3E 00 00 00 00 00 00 00 00 00 00 00 00
: :
- - -
Let's pick this apart. The first 3 bytes are the standard UTF-8 byte
order mark, you can ignore this salutation. 3C is the '<' leading the
XMLDecl, in fact the character sequence, 3C 3F 78 6D 6C corresponds
to "<?xml".
The trouble happens before the closing tag of </doc>, so let's skip
ahead looking for "</" or the character sequence 3C 2F. Before
3C 2F there is the sequence 57 6F 72 6C 64 E2 98 BA. This
corresponds to the text, "World@". OK, we know 57 6F 72
6C 64 is "World," so what's that leave us with? That square
character, that's right!
What [single] character has UTF-8 encoding E2 98 BA? Well, no
point in scouring encyclopedia catalogs of the encoding standards,
the following short snippet of .NET decoding magic tells us that the
UTF-8 byte sequence, E2 98 BA, is ...
using System.IO;
using System.Text;
// . . .
byte[] buf = new byte[] { 0xE2, 0x98, 0xBA };
StreamReader read = new StreamReader( new MemoryStream( buf),
Encoding.UTF8);
Console.WriteLine( read.ReadToEnd( ) );
.... the happy-face character (provided your console can display the
character, if it can't then it shows 'that square character.') When console
doesn't work, calling up Start | Run... | charmap.exe and searching thru
"Arial Unicode MS" font will usually do the trick.
The technique to take away from this is that even though those characters
may appear square, you may find yourself with much greater control over
pinpointing and solving the problem if you knew what byte sequence was
bad. Then you can ask questions like, "why is it bad?" and "how did it
get here?" and that will be further along on the road to success.
Derek Harmon