Groups | Blog | Home
all groups > sql server full text search > april 2004 >

sql server full text search : neutral word g(b)rrrrrrrrrreaker


hseip NO[at]SPAM theconsultantsnetwork.com
4/29/2004 3:09:02 PM
Hi

1) I am using SQL Server 2000
Select @@version => Microsoft SQL Server 2000 - 8.00.194 (Intel X86)
Aug 6 2000 00:57:48 Copyright (c) 1988-2000 Microsoft Corporation
Developer Edition on Windows NT 5.0 (Build 2195: Service Pack 4)

2) I just installed SP3a for SQL Server
3) I am using the natural word breaker with

EXEC sp_configure 'default full-text language', 0
RECONFIGURE

4) the noise.enu and noise.dat files contain as the
single character "c" as a noise word.

Here is my dilemma:

A) I am searching for "C programming" where all the occurrences
of "programming" are found but not the exact occurrences of "C
programming"
because "c" is a noise word and therefore is omitted.

B) I am searching for "C++" which is found successfully.

C) Now I am taking "c" out of the noise word list to find exact
occurrences of
"C programming" which then works successfully!

D) Now I don't find "C++" anymore!!!!!! Instead I find all occurrences
of "C"

Help - PLEASE!

PS: The same issue exists with "J++" and "Visual J"

Thanks

John Kane
4/30/2004 9:38:09 PM
Hen,
First of all, thank you for the SQL Server (2000 RTM) & OS platform (Win2K)
as this is most important info.
Since you have SQL Server 2000 installed on Win2K, SQL FTS is using the
OS-supplied wordbreaker dll - infosoft.dll - and this wordbreaker is
different from NT 4.0 as well as different from Win2003 and it's new
wordbreaker dll - langwbrk.dll and therefore, you will get different results
based upon searching for C (or any other single letter) and punctuation that
is in contact with the single letter, such as C++ or J++.

I've done some testing with c vs. c++ vs. c# on both Win2K and Win2003 with
both CONTAINS and FREETEXT. Below are some of my results for this somewhat
problematic and very frequently asked question and I can supply the details,
but here are the summary:

FREETEXT will find C++ c# C- but NOT C without Error 7619 on Win2K SP3 with
SQL2K RTM when C is in noise files.
CONTAINS will find C++ c# C- but Error 7619 with C on Win2K SP3 with SQL2K
RTM when C is in noise files.

FREETEXT will find all rows contain C regardless of punctuation on Win2K SP3
with SQL2K RTM when C is NOT in noise files.
CONTAINS will find all rows contain C regardless of punctuation on Win2K SP3
with SQL2K RTM when C is NOT in noise files.

FREETEXT will find C++ C# C- but NOT C without Error 7619 on Win2003 with
SQL2K SP2 when C is in noise files.
CONTAINS will find C++ C# C- but Error 7619 with C on Win2003 with SQL2K SP2
when C is in noise files.

FREETEXT will find all rows contain C and C<punctuation> on Win2003 with
SQL2K SP2 when C is NOT in noise files
CONTAINS will find only rows containing C regardless of punctuation, but
multiple rows depending upon Capitalization on Win2003 with SQL2K SP2 when
C is NOT in noise files.

You can "have your cake, but not eat it" with this time of SQL FTS search on
just a single letter and punctuation, you best bet is to AND it with another
SQL FTS query using either a special made-up word or a special phrase, for
example CSharp or "C Sharp"...

Note, that the US_English wordbreaker will "break" words based upon
language-specific conditions, only the Neutral wordbreaker "breaks" words
based upon "white space" on any OS platform and that on Win2K SP3 and above
there was a specific fix for C++ vs. C (note capitalization).

Regards,
John



[quoted text, click to view]

hseip NO[at]SPAM theconsultantsnetwork.com
5/2/2004 2:26:19 PM
John,

Thanks for your infinite patience.

I resorted to a different solution with "like '%C programming%'" since
my queries run in batch and 99.8% of my search phrases are found
correctly. The solution with "like" is expensive and has longer run
time but since it only effects so few I am willing to do it.

Thanks

Hen


[quoted text, click to view]
John Kane
5/2/2004 8:12:51 PM
You're welcome, Hen,
Yep, sometimes it's necessary to take 10 yards and punt (to use a football
analogy) and therefore try a non-FTS solution such as T-SQL LIKE. While LIKE
has it's obvious disadvantages, and SQL FTS has a few of it's own and I'd
encourage you to use both in your "search" bag-of-tricks!

Regards,
John


[quoted text, click to view]

AddThis Social Bookmark Button