Groups | Blog | Home
all groups > dotnet general > september 2003 >

dotnet general : Creating a Unicode Surrogate Pair


Chris Mullins
9/20/2003 9:19:36 PM
I've got a big unicode character, and i'm trying to build it into a string.

The unicode character is in the range "0x10400", so it's going to require a
surrogate pair.

I've been through all the logic to iterate over strings that already have
these pairs in them, but how do I encode this Unicode Character INTO the
string? The string is UTF-8 encoded, but none of the things I've trided
using the encoders seems to work right...

Breaking it up into words (0x0001 and x0400) is obviously incorrect, as this
violates the valid ranges for a high-surrogate .

I'm at a loss as to how to deal with it from here...

--
Chris Mullins

Jon Skeet
9/21/2003 8:13:07 AM
[quoted text, click to view]

Try breaking the pair into 0xd801 and 0xdc00 - I believe the algorithm
is basically:

o Subtract 0x10000
o High surrogate is 0xd800+(result/0x400)
o Low surrogate is 0xdc00+(result%0x400)

Of course, whatever's reading the string will need to know what to do
with the surrogate. I've managed to avoid using them so far,
fortunately...

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet
Chris Mullins
9/21/2003 10:10:54 AM
[quoted text, click to view]

After reading more, it looks like your suggestion is the best option for
..NET. I'm going to take any code point > 0xFFFF and break it down into a
surrogate pair, according to the algorithm found at:
http://www.unicode.org/book/ch03.pdf

This says the encoding chars will be encoded as:
H = (S-0x10000) / 0x400 + 0xD800
L = (S-0x10000) % 0x400 + 0xDC00

Now, I need to figure out what to do next...

Do I just append the High/Low surrogate pairs into my .NET string, or do I
have to pass this character array through the approate UTF8/16 encoder to
turn it into an encoded byte stream, and then somehow massage that into my
string?

--
Chris Mullins

Jon Skeet
9/22/2003 7:51:13 AM
[quoted text, click to view]

You just append them to your string - anything which can cope with
surrogates should then recognise them appropriately.

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet
AddThis Social Bookmark Button