Groups | Blog | Home
all groups > sql server full text search > april 2004 >

sql server full text search : FTS is accent sensitive?


Hilary Cotter
4/23/2004 7:12:52 PM
It is not accent sensitive, or accent aware. searches on "acasã" will not
match on row containing "acasa".

[quoted text, click to view]

Emil Mustea
4/23/2004 11:14:14 PM
Hi,

I'm planning to store some romanian documents in SQL image. I understand
that I have to use neutral language.
What I need to know if FTS is accent sensitive: some peoples writes
"acasã"(means home), others simply "acasa".
The query "acasa" returns "acasa" and "acasã" or only "acasa".

Sorry for my bad english..

Thank you.

Best regards,
Emil Mustea

John Kane
4/24/2004 10:05:34 AM
Emil,
The answer to your question for SQL Server 2000 (RTM to current SP3a) is no.
I've tested this with both an accent sensitive database and an accent
insensitive database collation using the same data and same server
configuration:

select TextCol, VarcharCol, CharCol from FTSAccent
/*
TextCol VarcharCol CharCol
----------------------- ------------------------- --------------------------
----
Kahlúa HalfPipe jam cafè cafe
Kahlua HalfPipe jam cafe cafe
Halfpipe Jam? classic-recipes CD-R
Halfpipe Jam classic recipes CD R
*/

-- Accent testing for "accent insensitive" results
select TextCol from FTSAccent where contains(TextCol,'Kahlua') --
non-accented word
-- Expected Returns: 2 rows - "Kahlua HalfPipe jam" and "Kahlúa HalfPipe
jam" as this database is accent-Insensitive
-- Actual Results : 1 row - "Kahlua HalfPipe jam"

select TextCol from FTSAccent where contains(TextCol,'Kahlúa') -- accented
word
-- Expected Returns: 2 rows - "Kahlua HalfPipe jam" and "Kahlúa HalfPipe
jam" as this database is accent-Insensitive
-- Actual Results : 1 row - "Kahlúa HalfPipe jam"


-- Accent specific testing for "accent sensitive" results...
select TextCol from FTSAccent where contains(TextCol,'Kahlua') --
non-accented word
-- Expected Returns: 1 row - "Kahlua HalfPipe jam" and NOT "Kahlúa HalfPipe
jam" as this database is accent-sensitive
-- Actual Results : 1 row - "Kahlua HalfPipe jam"

select TextCol from FTSAccent where contains(TextCol,'Kahlúa') -- accented
word
-- Expected Returns: 1 row - "Kahlúa HalfPipe jam" and NOT "Kahlua HalfPipe
jam" as this database is accent-sensitive
-- Actual Results : 1 row - "Kahlua HalfPipe jam"

As you can see the actual results differs from the expected results with
both accent sensitive and accent insensitive database collations and for the
accent insensitive database collation, only the accented or non-accented
search word was returned, but not both. This may or may not be fixed in a
future service pack for SQL Server 2000.

Additionally, for SQL Server 2005 (codename Yukon) will support accent
sensitive or insensitive fulltext search via new T-SQL:
CREATE FULLTEXT CATALOG fulltext_catalog_identifier ON FILEGROUP
filegroup_identifier
IN PATH <root path> WITH ACCENT SENSITIVE | INSENSITIVE AS DEFAULT

Regards,
John



[quoted text, click to view]

Emil Mustea
4/24/2004 5:37:00 PM
[quoted text, click to view]

So it's true: searches on "acasa" will return both "acasa" and
"acasã".Correct?


[quoted text, click to view]

John Kane
4/25/2004 1:37:28 PM
You're welcome, Emil,
Would you be able to provide the ascii code for these letters from the
Romanian code page?
For example for the english accented letter: SELECT ascii('à') -- returns:
224

Thanks,
John


[quoted text, click to view]

Emil Mustea
4/25/2004 11:07:16 PM
Thank you for your detailed reply.
In Romanian we have a 2 letters with accents in the bottom of the letter
"º"(means "sh" in english) "þ"(it's sounds like zz in Pizza) . These letters
will be considered accented?


[quoted text, click to view]

Emil Mustea
4/27/2004 12:16:37 AM
º = 186
þ = 254


[quoted text, click to view]

John Kane
4/27/2004 9:14:29 PM
Emil,
Yes, I believe that the ascii code values above 127 are "extended character"
and are treated as accented characters.
The best way to confirm this is to put Romanian words that contain these
accented letters in a SQL table defined with one of the following datatypes:
Nvarchar, Nchar, or NText. Then create a FT Index on this column using the
Neutral "Language for Word Breaker" and the confirm that Romanian words that
contain these accented letters, are returned together with similar
non-accented letters using CONTAINS or FREETEXT. "The proof is always in
the pudding" - to paraphrase an old English saying .. <G>

Regards,
John


[quoted text, click to view]

AddThis Social Bookmark Button