Groups | Blog | Home
all groups > sql server full text search > april 2005 >

sql server full text search : SQL Server Indexing VS. FTS Indexing


MikeBe
4/25/2005 4:21:37 PM
I'm trying to understand the benefit of using an FTS Inverted Index
in the context of the following example:

Given a set of articles, I want to retrieve a subset that matches
certain keywords.

I text parse and extract each word (excluding noise words) from each
article (using space as the delimiter) and store the words along with
their associated article id in the following table:

ArticlesKeywords

ID:1 KEY: DOG
ID:1 KEY: CAT
ID:1 KEY: MOUSE
ID:2 KEY: DOG
ID:2 KEY: CAT

I'm not truly familiar with how SQL indexing is implemented but
aren't duplicate keys fall under the same node in the indexing tree?
A non-unique index pictures to me as a list of row address under a
certain tree branch - a "collection per key"

As far as I understand an FTS Inverted Index will essentially build a
table for each unique key - a table in which each row points to each
ID. How is this different from just putting a non-unique index on KEY
above?

Thanks,

Mike
Hilary Cotter
4/27/2005 11:04:33 AM
The benefits are basically performance and functionality. Your inverted
index has some of the functionality, but none of the performance.

Basically the SQL FTS inverted file index is b-tree like so queries are
resolved very quickly as the search engine navigates the index nodes to get
to the leaf pages where the keys and offsets occur. Then the key's are
returned to SQL Server. Depending on your query rank and position may be
included.

SQL FTS also adds teh functionality of being able to stem search phrases, so
with FreeText a search on book will match with books, book, booked, booking,
etc.

--
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

[quoted text, click to view]

MikeBe
4/28/2005 6:45:17 AM
Tnx, I don't understand your comment on performance, are you referring
to speed? Is there a huge difference in the b-tree implementation? Is
FTS 'b-tree like' magically optimized :) Actually I read some posts
of people using FTS that complain on the speed factor. As far as
ranking, are their any details on this black box?

Regarding your example if I'm not mistaken LIKE 'book%' would produce
the same result ...
Hilary Cotter
4/28/2005 10:37:59 AM
The way the full-text indexes work is using a b-tree like algorithm to
locate which rows contain the search terms. Then it uses a completely
different algorithm (contains or freetext) to calculate weights and ranking.

What happens is each word is indexed and placed in a bucket. So if there are
three words indexed alpha, Michael and zebra, there would be three buckets,
one alpha, one Michael and one zebra. A search on alpha would be directed to
the middle bucket and an evaluation would be done and it would be determined
that a< m, and then the search would be directed to the alpha bucket.

This is termed a binary search, as it divides the search into two nodes. In
reality you will be dealing with several hundred thousand indexed terms, and
you will have many buckets, the first bucket perhaps containing the works a
to aardvark. and then if your binary search points to this bucket, it might
go through several levels of buckets beneath this until if gets to the leaf
nodes.

This sort of binary search is extremely fast and by traversing these index
nodes you can to the leaf node very very quickly. Then returning the rows
and the word positions is somewhat expensive, and more expensive still is
whittling down your results sets when you have multiple search terms.

A search using the like predicate can be faster is

1) you have a clustered index on the column you are searching
2) you are searching like this select * from tablename where clusteredcolumn
like 'mik%'

These likes will use the clustered index to resolve them and this can be
faster than a fulltext search. If the above two conditions are not met, it
will not be faster than a like.

Keep in mind that the performance of your SQL FTS solution is most sensitive
to the number of rows you return. If you limit your results set to lets say
100 rows using containstable or freetexttable (i.e. select * from
containstable(tablename, *,'searchprhase',100)) you will get optimal
performance.

SQL FTS performance does degrade significantly if you are returning several
thousand rows.

For contains ranking algorithm have a look at SMART gerry salton

For FreeText have a look at Okapi BM-25

--
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

[quoted text, click to view]

MikeBe
4/28/2005 12:07:32 PM
Below...

[quoted text, click to view]

You mean a Ternary Tree ?

[quoted text, click to view]

.... and also if you use a non-clustered index on the column of a
clustered table



[quoted text, click to view]

Intresting... why? (test or by design)

[quoted text, click to view]

Good to know...


[quoted text, click to view]
MikeBe
4/28/2005 2:18:51 PM
Excellent information! Tnx.
Hilary Cotter
4/28/2005 4:36:03 PM
1) No, perhaps my example was not good - have a look at this

http://publib.boulder.ibm.com/infocenter/ids9help/index.jsp?topic=/com.ibm.adref.doc/adrefmst229.htm

and figure 17.

2 no, have a look at the execution plans when you do this

create database btree
go
use btree
create table btree
(pk char(10) not null constraint primarykey primary key, charcol char(20))
go
create index test on btree(charcol)
GO
declare @int int
declare @alpha int
declare @beta int
declare @holding char(3)
select @int=1
select @alpha=1
select @beta=1
while @int< 26
begin
while @alpha<26
begin
while @beta<26
begin
select @holding=char(64+@int)+char(64+@alpha)+char(64+@beta)
insert btree (pk, charcol) values (@holding, @holding)
select @beta=@beta+1
end
set @beta=1
select @alpha=@alpha+1
end
set @alpha=1
select @int=@int+1
end
set showplan_all on
select * from btree where pk like 'm%'

----------------------------------------------------------------------------
----------------------------------------------------------------------------
---------------------------- ----------- ----------- ----------- -----------
------------------- ------------------------------ -------------------------
----------------------------------------------------------------------------
---------------------------------------------------- -----------------------
--------- ------------------------ ------------------------ ----------------
-------- ----------- ------------------------ ------------------------------
-- -------- ------------------------------ -------- ------------------------
select * from btree where pk like 'm%'
2 1 0 NULL NULL
1
NULL 637.23169 NULL
NULL NULL 9.5086023E-3 NULL
NULL SELECT 0 NULL
|--Clustered Index Seek(OBJECT:([btree].[dbo].[btree].[primarykey]),
SEEK:([btree].[pk] >= 'Lþ' AND [btree].[pk] < 'N'),
WHERE:(like([btree].[pk], 'm%', NULL)) ORDERED FORWARD) 2 3
1 Clustered Index Seek Clustered Index Seek
OBJECT:([btree].[dbo].[btree].[primarykey]), SEEK:([btree].[pk] >= 'Lþ' AND
[btree].[pk] < 'N'), WHERE:(like([btree].[pk], 'm%', NULL)) ORDERED FORWARD
[btree].[charcol], [btree].[pk] 637.23169 8.5507222E-3
7.7945489E-4 37 9.3301767E-3
[btree].[charcol], [btree].[pk] NULL PLAN_ROW 0
1.0

(2 row(s) affected)


select * from btree (INDEX (test)) where pk like 'm%'
StmtText
StmtId NodeId Parent PhysicalOp LogicalOp
Argument
DefinedValues EstimateRows EstimateIO
EstimateCPU AvgRowSize TotalSubtreeCost OutputList
Warnings Type Parallel EstimateExecutions
----------------------------------------------------------------------------
-------------------- ----------- ----------- ----------- -------------------
----------- ------------------------------ ---------------------------------
----------------------------------------------------------- ----------------
---------------- ------------------------ ------------------------ ---------
--------------- ----------- ------------------------ -----------------------
--------- -------- ------------------------------ -------- -----------------
-------
select * from btree (INDEX (test)) where pk like 'm%'
156 1 0 NULL NULL
1
NULL 637.23169 NULL
NULL NULL 0.10877679 NULL
NULL SELECT 0 NULL
|--Index Scan(OBJECT:([btree].[dbo].[btree].[test]),
WHERE:(like([btree].[pk], 'm%', NULL))) 156 3 1
Index Scan Index Scan
OBJECT:([btree].[dbo].[btree].[test]), WHERE:(like([btree].[pk], 'm%',
NULL)), FORCEDINDEX [btree].[charcol], [btree].[pk] 637.23169
0.08868961 0.0172187 37 0.10590831
[btree].[charcol], [btree].[pk] NULL PLAN_ROW 0
1.0

(2 row(s) affected)



3) I don't know the answer to this, I would venture to say that it is by
design as most people don't want more than 100 or so rows returned. It could
be a function of hardware, I really don't know, but it is observable. You
start to notice it around 2000 or so rows.


--
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

[quoted text, click to view]

Hilary Cotter
4/28/2005 4:40:17 PM
BTW - I want to stress that these indexing seek vs scan observations are
only for

1) a clustered index and 2) when you query like this select * from tablename
where columnname like 'test%', it is not valid for this query

select * from tablename where columnname like '%test%'

This does a table scan no matter what index is on the table.

--
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

[quoted text, click to view]

AddThis Social Bookmark Button