Groups | Blog | Home
all groups > sql server full text search > july 2007 >

sql server full text search : unexpected behavior with certain words "i"



Thorsten Jakobsson
7/9/2007 2:38:02 AM
Hello,
I have started to test fulltext indexing. I have run into a problem that I
don't know how to solve.
I have this data in one fulltext-column "Landsorganisationen i Sverige".
When I perform this search I get 1 hit:
select top 100 * from CA_CATALOG_New_SEARCH where
CONTAINS(FWD_ALL,'"landsorganisationen"') AND CONTAINS(FWD_ALL,'"sverige"')

But, when I do this search I get 0 hits (should be 1 hit):
select top 100 * from CA_CATALOG_New_SEARCH where
CONTAINS(FWD_ALL,'"landsorganisationen"') AND CONTAINS(FWD_ALL,'"sverige"')
AND CONTAINS(FWD_ALL,'"i"')

There is something with the word "i" that confuses the search engine. I
suspect that it is considered a special word and not indexed. But since I
want a hit, it seems like I need to know all words that won't work and
exclude them from searches. But it sounds like unnessecery work. It must be
some better way to work around this.
And if I need to know which words, were is the list?

Sincerely,
ML
7/9/2007 5:48:01 AM
The usual practice is to remove (i.e. replace with a single space) individual
letters from the noise file used when the FTI is built. Look up "noise words"
in Books Online and make sure you edit the noise file(s) for your specific
language(s).

FTI needs to be rebuilt after the noise file has been edited.


ML

---
Hilary Cotter
7/9/2007 12:15:56 PM
If you do empty your noise word list ensure that you replace it with a
single space. Otherwise it will use the indexing services noise word list.

--
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Thorsten Jakobsson" <Thorsten Jakobsson@discussions.microsoft.com> wrote in
message news:B0F51364-BDD8-4F46-88B2-CF55E3A20357@microsoft.com...
[quoted text, click to view]

Daniel Crichton
7/9/2007 1:48:49 PM
Thorsten wrote on Mon, 9 Jul 2007 02:38:02 -0700:

[quoted text, click to view]

There is a list of words held in the noise.<languagecode> files (eg.
noise.enu for Neutral English) which are not indexed. You can remove all of
these (but leave a single line with a space on it), and then repopulate your
index. Just remember to edit the correct file related to your database
locale.

The other alternative is to parse the noise word list and create a routine
to strip out these ignored words when searching - the benefit of this is
that you keep the keep the index catalog small and strip out common words
from the searches. However, it might be more work than the simple clearing
of the noise file.

Personally, I cleared all my noise word files - the performance hit was
negligible for the tests I performed on my FT searching before and after.

Dan

Thorsten Jakobsson
7/10/2007 4:54:01 AM
[quoted text, click to view]

Thanks to all!

That answers my question.

Thorsten
Hilary Cotter
7/19/2007 12:00:00 AM
You need to rebuild your catalogs. Stopping and starting SQL Server or doing
a new population won't help.

--
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Thorsten Jakobsson" <ThorstenJakobsson@discussions.microsoft.com> wrote in
message news:4034C808-FA25-477F-B32E-9C2E6C49D69B@microsoft.com...
[quoted text, click to view]

Thorsten Jakobsson
7/19/2007 5:08:04 AM
Hello, I need to ask about this again. I can't make it work. I have edited
the noise-files I can find.
In this directory C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData
I have edited noiseENG.txt, noiseENU.txt and noiseSVE.txt. I have removed
all instances of single letters.
I have for example removed single "b" from all three files. I have then
repopulated the fulltext index in every way I can think of. Still I get 0
rows from the first statement and 17 from the second, they should both return
17:
SELECT DISTINCT cn.CA_CATALOG_ID FROM CA_CATALOG_New_SEARCH cn WHERE
CONTAINS(FWD_1,'"b"')
SELECT DISTINCT fwd_1,cn.CA_CATALOG_ID FROM CA_CATALOG_New_SEARCH cn WHERE
FWD_1='b'

One of the ways to repopulate I have tried is this
ALTER FULLTEXT INDEX ON CA_CATALOG_New_SEARCH STOP POPULATION
ALTER FULLTEXT INDEX ON CA_CATALOG_New_SEARCH START FULL POPULATION

I have stopped the SQLServer and started, doesn't help.

What is missing, it seems like the database doesn't understand that the
files have been edited?
The documentation says that the noisefiles are called for exampl Noisesve,
but there is no file with that name, the file that exists is noiseSVE.txt.

Sincerely,
Thorsten
Thorsten Jakobsson
7/19/2007 5:52:03 AM
It finally works!

I did this
sp_configure 'default full-text language',1053
which means swedish

and after that I rebuilt the index.

It said 1033 before I ran sp_configure

Thorsten Jakobsson
7/20/2007 1:48:01 AM
This drives me crazy, I thought I had it figured out. I removed "b" from the
noiseSVE.txt and removed, created and populated the fulltext table.
It works so, I thought it was working. But when I try to search for "1" or
"ett"
("ett" is one in swedish), it just doesn't work. I have tried to remove
everything in
noiseSVE.txt and just leave a single space, when that didn't work I tried to
have just the word "XXXXXXXXXXX" in the noiseSVE.txt file. It doesn't work.

This search returns nothing:
SELECT DISTINCT fwd_all,cn.CA_CATALOG_ID FROM CA_CATALOG_New_SEARCH cn WHERE
(CONTAINS(FWD_ALL,'"ett"'))

this returns 488 rows:
SELECT DISTINCT fwd_all,cn.CA_CATALOG_ID FROM CA_CATALOG_New_SEARCH cn WHERE
FWD_ALL like '% ett %'

this returns 840 rows:
SELECT DISTINCT fwd_all,cn.CA_CATALOG_ID FROM CA_CATALOG_New_SEARCH cn WHERE
(CONTAINS(FWD_ALL,'"b"'))

Is there a special case for single numbers? Are single numbers not indexed
in fulltext search?
But for example "10" works alright.

this returns 295 rows
SELECT DISTINCT fwd_all,cn.CA_CATALOG_ID FROM CA_CATALOG_New_SEARCH cn WHERE
(CONTAINS(FWD_ALL,'"10"'))

What is going on?

Thorsten Jakobsson
7/20/2007 2:46:03 AM
Yet another time I rebuilt the fulltext table, now it works! "ett" and "1"
works.
I have no idea why it suddenly works. There must be something that triggers
the changes in noiseSVE.txt that I don't understand.

Thorsten

[quoted text, click to view]
Hilary Cotter
7/20/2007 5:22:22 PM
single numbers are index if they are not in the noise word list.

--
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Thorsten Jakobsson" <ThorstenJakobsson@discussions.microsoft.com> wrote in
message news:B938159D-A15B-4414-B034-1714D2A4B63B@microsoft.com...
[quoted text, click to view]

AddThis Social Bookmark Button