Groups | Blog | Home
all groups > dotnet internationalization > august 2005 >

dotnet internationalization : Confusion with "character" and "codepoint"...


Jochen Kalmbach [MVP]
8/24/2005 12:00:00 AM
Hi Michael!

[quoted text, click to view]

So you think the documentatiuon should not be corrected?


This is at least very confusing for MBCS because mostly the MSDN says
"character" when it means "codepoint"; but there are also some
expections! (like "_mbslen": here the _real_ number of characters are
returned!!! and this is somewhat confusing)...

--
Greetings
Jochen

My blog about Win32 and .NET
Michael (michka) Kaplan [MS]
8/24/2005 9:11:03 AM
Low level functions in the Win32 API are not concerned with the user notion
of a "character" and they are always referring to code points when it comes
to length.

This only makes sense since the answer controls things like buffer
allocation, which would cause the functions to fail if you allocated the
number of "characters" when the number of code points was in fact greater.


--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.


[quoted text, click to view]

Jochen Kalmbach [MVP]
8/24/2005 10:49:46 AM
Hi!

I think the MSDN docu uses in many places the word "characters" while it
really means "codepoint".

For example the docu of "MultiByteToWideChar":

<quote>
Parameter "cchWideChar":
Specifies the size, in wide characters...
</quote>

There are no such thing like "wide characters"; there are only characters...
But it should realy state "(wide) codepoints".

Also the retrun value uses the word "wide characters" which is also
confusing...

Here is a simple example:

char utf8[] = {0xF0, 0x90, 0x8C, 0xB0, 0x00}; // GOTHIC LETTER AHSA
(0x10330)
wchar_t utf16[100];
int iRet = MultiByteToWideChar(CP_UTF8, 0, utf8, -1,
utf16, sizeof(utf16)/sizeof(wchar_t));


This is only *one* character but it has *two* codepoints in UTF16.

The expected return value should be 2 (1 character and terminating NUL).
But it returns 3 (2 codepoints and terminating NUL).



So at least it is very unclear in which case the docu really means
"characters" and which it really means "codepoint".


--
Greetings
Jochen

My blog about Win32 and .NET
Michael (michka) Kaplan [MS]
8/24/2005 3:14:20 PM
It is something that I think ought to be corrected over time with the
various owners of the documentation picees for the PSDK, the CRT, and so
on....

I was just offering to alleviate some of the immediate confusion....


--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.


[quoted text, click to view]

Jochen Kalmbach [MVP]
8/26/2005 12:00:00 AM
Hi Michael!
[quoted text, click to view]

I think we were both wrong...

After re-reading (parts of) the unicode-spec the correct term should be
"code unit" instad of "codepoint"...

See: Unicode-Glossary
http://www.unicode.org/glossary/#code_unit
http://www.unicode.org/glossary/#code_value
http://www.unicode.org/glossary/#code_point

--
Greetings
Jochen

My blog about Win32 and .NET
Michael (michka) Kaplan [MS]
8/26/2005 6:29:28 AM
Personally I have never cottoned to that definition, since it does not match
the C definition and people have no problem with phrases like "UTF-16 code
points".

But in any case, Win32 is not concerned with characters in *almost* all APIs
under it.


--
MichKa [Microsoft]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure, Fonts, and Tools
Blog: http://blogs.msdn.com/michkap

This posting is provided "AS IS" with
no warranties, and confers no rights.


[quoted text, click to view]

AddThis Social Bookmark Button