all groups > dotnet internationalization > october 2004 >
You're in the

dotnet internationalization

group:

First letter index


First letter index François
10/28/2004 7:37:06 AM
dotnet internationalization:
I want to build an index to a list of words based on the first letter of
these words for 4 languages ( french english czech spanish ).

In english there is no problem. There is no accent and all letter are based
on only one character.

In french there are accents, and as fas as i know letters are all one
character. Capitalizing a word dont change its meaning.

In other languages there may be accents and letter may be composed of more
than one character and capitalization can change the meaning of a word.

-----
Here an example in french of what i am seeking.

For the words :
Abandon
École
ennui
fuite

i would like to obtain the following entries :
A ( -> Abandon)
E ( -> École, ennui)
F ( -> fuite )

-----

I can build for french a lookup table and so have a solution; french being
my maternal language.

But i cannot for czech, etc.

-----

My questions are :

1) Even if for french and english my aim is meaningfull, is it meaningfull
for czech and spanish?
2) if 1) is answered yes, does the dotnet framework can help me?

I had a look at CompareInfo and SortKey classes.

Thank you.

Re: First letter index Michael \(michka\) Kaplan [MS]
10/31/2004 9:23:42 AM
Better to decompose the text (normalization from D) than to use a lookup --
then you can pick the first character off.

However, even better than that is to be sure it is the right thing to do. In
English that letter is "E with a funny line on it" but in some languages
those are considered entirely different letters; attempting to strip them
will lead to an unhappy user community....


--
MichKa [MS]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure and Font Technologies
Windows International Division

This posting is provided "AS IS" with
no warranties, and confers no rights.


[quoted text, click to view]

Re: First letter index François
11/1/2004 12:08:02 PM
Thank you.

Here's an other examples. This time in czech ( of which language i know
nothing ).

The following words are sorted ( ignoreCase ) with the locale cs-CZ.

Cyklopentan
Částice
Dusík
Ethanol
Chlor
Fluor
Glutaraldehyd
Hydroxid
Chlor

Using the first two bytes of KeyData
[ ...CompareInfo.GetSortKey(s,CompareOptions.IgnoreCase ).KeyData ]
as an indicator to the effect that a break occurs on the first letter of the
words,
i obtain the following index.

C Č D E F G H C

Using the first two bytes of KeyData gives satisfying results for english
and almost satisfying results for french ( I cant for example map É to E ).

"C Č D E F G H C" looks weird and ( must be "C Č D E F G H Ch" ).

The Keydata for Chlor in ( cs-CZ, CompareOptions.IgnoreCase ) reads "14 46
14 72 14 124 14 138 1 1 1 1 0" . There is only 4 byte pairs instead of 5 even
if "Chlor" counts 5 characters.
So the base API knows that "Ch" is only one letter.
It will be interesting to have a reversed map from (14,46) to Ch. ???

I had a look at the Unicode web site on "NFD".
That seems interesting. It will permit to map É to E in french.
Will it help me knowing that "Ch" is a letter in czech. What about the
framework and NFD?

Re: First letter index Michael \(michka\) Kaplan [MS]
11/1/2004 1:34:12 PM
Normalization is not doing a MAPPING -- it turns E Grave into E + Comnbing
Grave, so its two letters, and the first letter is an E.

I would *never* recommend trying to unpack the sort key to get information
as the values are not promised to remain constant.


--
MichKa [MS]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure and Font Technologies
Windows International Division

This posting is provided "AS IS" with
no warranties, and confers no rights.


[quoted text, click to view]

Re: First letter index François
11/2/2004 7:12:02 AM
Thanks.

I had a look a UnicodeData.txt and the ICU website.

I will build a simple lookup table for each of the languages I intend to
support using ICU collation charts and ICU collation customization rules.

These tables will help me determine the index entry for a word( eg École ->
E in french ; Chlor -> CH in Czech ).

I am a novice in unicode.
Your help was precious.
Re: First letter index Michael \(michka\) Kaplan [MS]
11/2/2004 7:18:38 AM
Well, you may want to reconsider this plan -- there sre plenty of languages
for which this is an invalid model.

But beyond that if you are using Whidbey then Unicode normalization is built
in.


--
MichKa [MS]
NLS Collation/Locale/Keyboard Technical Lead
Globalization Infrastructure and Font Technologies
Windows International Division

This posting is provided "AS IS" with
no warranties, and confers no rights.


[quoted text, click to view]

AddThis Social Bookmark Button