all groups > sql server odbc > july 2004 >
You're in the

sql server odbc

group:

Collation Sequences, Accented Characters and ORDER BY clauses



Collation Sequences, Accented Characters and ORDER BY clauses Mark Peck
7/19/2004 3:08:06 PM
sql server odbc: Hi,

Any help with the following query would be much appreciated.

We are developing an application in VB6 for SQL Server 2000 (SP3), running
on Win2k and have come across the following issue:

We have a 30 character attribute/column (varchar) in which we store
surnames. We now have some records where one or more characters within some
of the surnames contain accented characters. We issue queries against the
table and order by the surname, expecting an alphabetical sort. However, we
are finding some very odd results with the accented values and how SQL
Server 'sees' them. We typically use a collation of
SQL_Latin1_General_CP1_CS_AS.

Because of our findings, we created a test table that contains one
attribute, string_test varchar(30), NULLS allowed, no index. We then
populated the table with 10K records based on the following:

'A' + randomly generated ASCII character + randomly generated alphabetic
character

We then ran the following SQL:

SELECT * FROM cbs_br8151_test3 WHERE string_test LIKE 'A [aàáâãäå][ab]%'

The result we got was:

1. Aaa
2. Aàa
3. Aáa
4. Aãa
5. Aäa
6. Aåa
7. Aab
8. Aáb
9. Aâb
10. Aäb

When viewing the first 5 records in isolation, the sort order appears
correct. However, when looking at all 10 records, the sort order seems to
be based on the characters and then additionally grouped on the first and
last character. We think the sort order should have produced:

1. Aaa
2. Aab
3. Aàa
4. Aáa
5. Aáb
6. Aãa
7. Aäa
8. Aäb
9. Aåa
10. Aâb

Further testing has shown that a binary-based collation sequence produces
the same sort order as our previous query (using the CP1_CS_AS collation
sequence). So changing the collation sequence in use doesn't seem to make a
big difference, if any at all. However, we don't think that is right
either. To prove the point, I then queried for the ASCII conversion of each
of the characters:

SELECT string_test, ASCII(SUBSTRING(string_test,1,1)) AS first_char,
ASCII(SUBSTRING(string_test,2,1)) AS second_char,
ASCII(SUBSTRING(string_test,3,1)) AS third_char
FROM cbs_br8151_test3
WHERE string_test LIKE 'A[aàáâãäå][ab]%'
ORDER BY string_test

1. Aaa 65 97 97
2. Aàa 65 224 97
3. Aáa 65 225 97
4. Aãa 65 227 97
5. Aäa 65 228 97
6. Aåa 65 229 97
7. Aab 65 97 98
8. Aáb 65 225 98
9. Aâb 65 226 98
10. Aäb 65 228 98

This proves that whatever conversion of the character data is used to base
the 'sort' on, it isn't alphabetical, ASCII, or the binary values assigned
to the characters does not translate to alphabetical (which is supposed to
be the case with a case-sensitive, dictionary-based sort order is it not?).
It is performing a sort, but also seems to be applying an additional,
internal, 'group by' based on the first and last characters.

Lastly, I then selected the data into a new table with the appropriate ASCII
values in additional attributes/columns and then sorted the result set on
the ASCII values. It now provides the expected sort order (because I am not
sorting on the string itself):

SELECT string_test, ASCII(SUBSTRING(string_test,1,1)) AS first_char,
ASCII(SUBSTRING(string_test,2,1)) AS second_char,
ASCII(SUBSTRING(string_test,3,1)) AS third_char
into cbs_br8151_test4
FROM cbs_br8151_test3
WHERE string_test LIKE 'A%'

SELECT string_test, ASCII(SUBSTRING(string_test,1,1)) AS first_char,
ASCII(SUBSTRING(string_test,2,1)) AS second_char,
ASCII(SUBSTRING(string_test,3,1)) AS third_char
FROM cbs_br8151_test4
WHERE string_test LIKE 'A[aàáâãäå][ab]%'
ORDER BY first_char, second_char, third_char

1. Aaa 65 97 97
2. Aab 65 97 98
3. Aàa 65 224 97
4. Aáa 65 225 97
5. Aáb 65 225 98
6. Aâb 65 226 98
7. Aãa 65 227 97
8. Aäa 65 228 97
9. Aäb 65 228 98
10. Aåa 65 229 97


What I need to be able to do is query the varchar attribute/column, sort on
it directly (not conversions of the actual characters) and get the expected
result set. Does anybody know how to do this, or have come across this
problem before? It should be noted that we did not have a problem with the
'sort' that SQL Server provided before we started to input and use accented
characters.

If it is of any help, we have queried the database using both Query Analyzer
and also a separate ODBC-based client (ODBC being the normal connection
method for the application we are developing).

Many thanks,
Mark

RE: Collation Sequences, Accented Characters and ORDER BY clauses bartd NO[at]SPAM online.microsoft.com
7/19/2004 10:09:24 PM
Mark,

In SQL's accent sensitive collations, the presence or absence of accented
characters serves the purpose of a "tie-breaker" for otherwise identical
strings. So, for example, all a's (accented or not) are expected to sort
together in a group and to come before any b's. However, binary
collations don't behave this way and will sort based on the numeric
character code.

HTH,
Bart
------------
Bart Duncan
Microsoft SQL Server Support

Please reply to the newsgroup only - thanks.
This posting is provided "AS IS" with no warranties, and confers no
rights.


--------------------
Reply-To: "Mark Peck" <mark.peck@care.co.uk>
From: "Mark Peck" <mark.peck@care.co.uk>
Subject: Collation Sequences, Accented Characters and ORDER BY clauses
Date: Mon, 19 Jul 2004 15:08:06 +0100
Lines: 133
Organization: Care Business Solutions Ltd
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2800.1437
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1441
Message-ID: <eUzPipZbEHA.1356@TK2MSFTNGP09.phx.gbl>
Newsgroups: microsoft.public.sqlserver.odbc
NNTP-Posting-Host: host217-46-206-17.in-addr.btopenworld.com 217.46.206.17
Path:
cpmsftngxa06.phx.gbl!TK2MSFTNGXS01.phx.gbl!TK2MSFTNGXA05.phx.gbl!TK2MSFTNG
P08.phx.gbl!TK2MSFTNGP09.phx.gbl
Xref: cpmsftngxa06.phx.gbl microsoft.public.sqlserver.odbc:42275
X-Tomcat-NG: microsoft.public.sqlserver.odbc

Hi,

Any help with the following query would be much appreciated.

We are developing an application in VB6 for SQL Server 2000 (SP3), running
on Win2k and have come across the following issue:

We have a 30 character attribute/column (varchar) in which we store
surnames. We now have some records where one or more characters within
some
of the surnames contain accented characters. We issue queries against the
table and order by the surname, expecting an alphabetical sort. However,
we
are finding some very odd results with the accented values and how SQL
Server 'sees' them. We typically use a collation of
SQL_Latin1_General_CP1_CS_AS.

Because of our findings, we created a test table that contains one
attribute, string_test varchar(30), NULLS allowed, no index. We then
populated the table with 10K records based on the following:

'A' + randomly generated ASCII character + randomly generated alphabetic
character

We then ran the following SQL:

SELECT * FROM cbs_br8151_test3 WHERE string_test LIKE 'A [aàáâãäå][ab]%'

The result we got was:

1. Aaa
2. Aàa
3. Aáa
4. Aãa
5. Aäa
6. Aåa
7. Aab
8. Aáb
9. Aâb
10. Aäb

When viewing the first 5 records in isolation, the sort order appears
correct. However, when looking at all 10 records, the sort order seems to
be based on the characters and then additionally grouped on the first and
last character. We think the sort order should have produced:

1. Aaa
2. Aab
3. Aàa
4. Aáa
5. Aáb
6. Aãa
7. Aäa
8. Aäb
9. Aåa
10. Aâb

Further testing has shown that a binary-based collation sequence produces
the same sort order as our previous query (using the CP1_CS_AS collation
sequence). So changing the collation sequence in use doesn't seem to
make a
big difference, if any at all. However, we don't think that is right
either. To prove the point, I then queried for the ASCII conversion of
each
of the characters:

SELECT string_test, ASCII(SUBSTRING(string_test,1,1)) AS first_char,
ASCII(SUBSTRING(string_test,2,1)) AS second_char,
ASCII(SUBSTRING(string_test,3,1)) AS third_char
FROM cbs_br8151_test3
WHERE string_test LIKE 'A[aàáâãäå][ab]%'
ORDER BY string_test

1. Aaa 65 97 97
2. Aàa 65 224 97
3. Aáa 65 225 97
4. Aãa 65 227 97
5. Aäa 65 228 97
6. Aåa 65 229 97
7. Aab 65 97 98
8. Aáb 65 225 98
9. Aâb 65 226 98
10. Aäb 65 228 98

This proves that whatever conversion of the character data is used to base
the 'sort' on, it isn't alphabetical, ASCII, or the binary values assigned
to the characters does not translate to alphabetical (which is supposed to
be the case with a case-sensitive, dictionary-based sort order is it
not?).
It is performing a sort, but also seems to be applying an additional,
internal, 'group by' based on the first and last characters.

Lastly, I then selected the data into a new table with the appropriate
ASCII
values in additional attributes/columns and then sorted the result set on
the ASCII values. It now provides the expected sort order (because I am
not
sorting on the string itself):

SELECT string_test, ASCII(SUBSTRING(string_test,1,1)) AS first_char,
ASCII(SUBSTRING(string_test,2,1)) AS second_char,
ASCII(SUBSTRING(string_test,3,1)) AS third_char
into cbs_br8151_test4
FROM cbs_br8151_test3
WHERE string_test LIKE 'A%'

SELECT string_test, ASCII(SUBSTRING(string_test,1,1)) AS first_char,
ASCII(SUBSTRING(string_test,2,1)) AS second_char,
ASCII(SUBSTRING(string_test,3,1)) AS third_char
FROM cbs_br8151_test4
WHERE string_test LIKE 'A[aàáâãäå][ab]%'
ORDER BY first_char, second_char, third_char

1. Aaa 65 97 97
2. Aab 65 97 98
3. Aàa 65 224 97
4. Aáa 65 225 97
5. Aáb 65 225 98
6. Aâb 65 226 98
7. Aãa 65 227 97
8. Aäa 65 228 97
9. Aäb 65 228 98
10. Aåa 65 229 97


What I need to be able to do is query the varchar attribute/column, sort
on
it directly (not conversions of the actual characters) and get the
expected
result set. Does anybody know how to do this, or have come across this
problem before? It should be noted that we did not have a problem with
the
'sort' that SQL Server provided before we started to input and use
accented
characters.

If it is of any help, we have queried the database using both Query
Analyzer
and also a separate ODBC-based client (ODBC being the normal connection
method for the application we are developing).

Many thanks,
Mark



Re: Collation Sequences, Accented Characters and ORDER BY clauses Mark Peck
7/20/2004 9:35:06 AM
Hi Bart,

Many thanks for the reply - at least that explains what SQL Server is doing.

Regards,

Mark

[quoted text, click to view]

Re: Collation Sequences, Accented Characters and ORDER BY clauses bartd NO[at]SPAM online.microsoft.com
7/20/2004 2:06:38 PM
BTW, a binary collation does sort the way you said you thought SQL should
always sort. However, binary collations are also case sensitive so that
may not meet your needs.
drop table t1
create table t1 (c1 varchar (30) COLLATE Latin1_General_BIN)
insert into t1 values ('Aaa')
insert into t1 values ('Aàa')
insert into t1 values ('Aáa')
insert into t1 values ('Aãa')
insert into t1 values ('Aäa')
insert into t1 values ('Aåa')
insert into t1 values ('Aab')
insert into t1 values ('Aáb')
insert into t1 values ('Aâb')
insert into t1 values ('Aäb')

Results:
c1
------------------------------
Aaa
Aab
Aàa
Aáa
Aáb
Aâb
Aãa
Aäa
Aäb
Aåa

Bart
------------
Bart Duncan
Microsoft SQL Server Support

Please reply to the newsgroup only - thanks.
This posting is provided "AS IS" with no warranties, and confers no
rights.


--------------------
Reply-To: "Mark Peck" <mark.peck@care.co.uk>
From: "Mark Peck" <mark.peck@care.co.uk>
References: <eUzPipZbEHA.1356@TK2MSFTNGP09.phx.gbl>
<5iDeS0dbEHA.2900@cpmsftngxa06.phx.gbl>
Subject: Re: Collation Sequences, Accented Characters and ORDER BY clauses
Date: Tue, 20 Jul 2004 09:35:06 +0100
Lines: 32
Organization: Care Business Solutions Ltd
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2800.1437
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1441
Message-ID: <O7blJUjbEHA.796@TK2MSFTNGP09.phx.gbl>
Newsgroups: microsoft.public.sqlserver.odbc
NNTP-Posting-Host: host217-46-206-17.in-addr.btopenworld.com 217.46.206.17
Path: cpmsftngxa06.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTNGP09.phx.gbl
Xref: cpmsftngxa06.phx.gbl microsoft.public.sqlserver.odbc:42280
X-Tomcat-NG: microsoft.public.sqlserver.odbc

Hi Bart,

Many thanks for the reply - at least that explains what SQL Server is
doing.

Regards,

Mark

[quoted text, click to view]



AddThis Social Bookmark Button