I got it working after a bit. I big thankyou Jon.
Reconising the file encoding was most of the battle. It wasn't UTF8 at all -
the japanese chars were always 2 bytes, not 3. When i started thinking about
"Jon Skeet [C# MVP]" wrote:
> <"=?Utf-8?B?aHVudGVyYg==?=" <Hunter
> Beanland@discussions.microsoft.com>> wrote:
> > I have a file which has no BOM and contains mostly single byte chars. There
> > are numerous double byte chars (Japanese) which appear throughout. I need to
> > take the resulting Unicode and store it in a DB and display it onscreen. No
> > matter which way I open the file, convert it to Unicode/leave it as is or
> > what ever, I see all single bytes ok, but double bytes become 2 seperate
> > single bytes. Surely there is an easy way to convert these mixed bytes to
> > Unicode? Below is 2 (of many) attempts at doing the conversion. I was
> > expecting that Encoding.Convert would be able to do this. My HTML charset,
> > session codepage, locale, thread culture are all set correctly for Japanese.
> > (reading Japanese from a unicode file works).
> >
> > Attempt 1:
> > Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open,
> > FileAccess.Read, FileShare.None)
> > Dim bytUTF8(Fs.Length) As Byte
> > Fs.Read(bytUTF8, 0, bytUTF8.Length)
> > bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
> > Response.Write(Encoding.Unicode.GetString(bytUni))
> >
> > Attempt 2:
> > reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"),
> > System.Text.Encoding.UTF8, True)
> > bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEnd())
> > bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
> > lblMessage.Text = Encoding.Unicode.GetString(bytUni)
> >
> > In ASP3 I had to pass the text through ADO to do the conversion which was
> > very ugly to do - surely that is not required now?
>
> No. Your first problem is that you're reading the text in assuming it's
> UTF-8, then converting it *back* to UTF-8 bytes, then treating those
> bytes as if they were UTF-16 (Unicode) bytes. There's no need to
> convert them into bytes again - reader.ReadToEnd() is giving you a
> string, so just use that string!
>
> Now, that assumes that the file is *actually* in UTF-8. In my
> experience Japanese characters come out as 3 bytes in UTF-8, so you may
> actually have a Shift-JIS file instead.
>
> You should not that your first attempt doesn't guarantee to read the
> whole file, by the way - see
>
http://www.pobox.com/~skeet/csharp/readbinary.html
>
> For more information about Unicode issues, see
>
http://www.pobox.com/~skeet/csharp/unicode.html
>
http://www.pobox.com/~skeet/csharp/debuggingunicode.html
>
>
> --
> Jon Skeet - <skeet@pobox.com>
>
http://www.pobox.com/~skeet
> If replying to the group, please do not mail me too