Groups | Blog | Home
all groups > sql server full text search > september 2006 >

sql server full text search : Performance issue with CONTAINS


boblotz2001
9/5/2006 1:02:04 PM
Seemingly a simple thing but something is wrong:

SELECT * FROM PERSON WHERE company_id = 7410 AND CONTAINS(first_name,
'Daniel') takes 2.5 minutes.

SELECT person.* FROM CONTAINSTABLE(PERSON, first_name, 'Daniel') AS a
JOIN person ON a.key = person.person_id WHERE company_id = 7410 takes
under a second.

Person table has 1101665 Million rows. 3893 rows have company_id =
7410. 6836 rows contain 'Daniel'. Total of 32 rows are returned.

Catalog attributes:
Item Count: 1101665
Unique Key Count: 4271510
Catalog Size: 128 Mb

The second query runs as expected: person table performs an index seek
on the company_id index and returns 3893 rows, remote scan returns 6836
rows and a Hash Match join is used to produce 32 rows.

The second query starts the same by using the company_id index and
returning 3893 rows but then things get ugly. Remote scan show
estimated number of rows as 6836 but returns 26,510,733 rows and a
filter is then applied to filter the number of rows to 32! Needless to
say that takes a while.

Need to add the fact that the catalog contains one table 'PERSON' but
there are about
dozen columns in PERSON table being indexed.

Any thoughts?

Thanks
Hilary Cotter
9/5/2006 8:39:16 PM
This is a well known phenomena known as trimming. The best way to solve this
is by creating an indexed view which contains the data you are looking for,
i.e. company_id=7410 or you partition your table into different tables by
company_id. For large numbers of company_id this will not work.

Another option is to add company_id to your content and then search like
this

SELECT * FROM PERSON WHERE company_id = 7410 AND CONTAINS(first_name,
'Daniel' and '7410') takes 2.5 minutes.


--
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

[quoted text, click to view]

boblotz2001
9/6/2006 7:23:55 AM
Hillary,

Thanks for your reply... I am not sure if I will be able to implement
either solution. The first one is a bit drastic and carries a fair
amount of overhead with it. Besides, as you mentioned I do have a good
amount of companies. Also, doesn't sound like this change would affect
the execution plan. Index Server would still return a large recordset
and try to filter it down but now instead of millions of row it would
deal with thousands.

Second solution is also problematic because I gave company ID as an
example but there are other predicates that can be used in the query.
For example, the search might be zip_code and last name instead of
company_id and first_name. I don't want to have to include every
permutation of search criteria in the index.

I guess my point is why is this happening specifically? Is it the size
of the table, shortcoming of the technology, improper indexing on the
table, something else?

Thanks


[quoted text, click to view]
Hilary Cotter
9/6/2006 11:01:01 AM
Indexing Services as well as other search engines like Lucene allow you to
store data in the index so you can trim in the search engine and return
trimmed results. The problem is that the more data you store the slower your
searches. SQL FTS does not allow you to store data in the index and hence
you have to trim the results set as it comes back.

So other than partitioning or using a different search engine there is no
way around this.

--
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.

This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]

Simon Sabin
9/6/2006 10:45:27 PM
Hello boblotz2001,

The lack of an ability to put the extra data into the index is a big pain
especially when the keyword is not very selective. Of the two options posed,
I would go for the addition of extra text in the search text i.e. COMP1234
It is more flexible and scalable. We implemented both and I am regretting
doing the partition one.

I woul always use containstable, I believe it simplifies the options the
optimiser has to consider.

You may want to look at SQL Turbo


Simon Sabin
SQL Server MVP
http://sqlblogcasts.com/blogs/simons


[quoted text, click to view]

tony.newsgrps NO[at]SPAM gmail.com
9/10/2006 4:31:46 PM
Simon, Hilary,

What is the difference between these two queries:
SELECT * FROM PERSON WHERE company_id = 7410 AND CONTAINS(first_name,
'Daniel')

SELECT person.* FROM CONTAINSTABLE(PERSON, first_name, 'Daniel') AS a
JOIN person ON a.key = person.person_id WHERE company_id = 7410

shouldn't they return the same results? Are they interchangeable? If
not, when should I use CONTAINS? When should I use CONTAINSTABLE?

Does the same phenomena happens with FREETEXT as well?

BTW, I'm assuming this thread applies to Sql Server 2005 here, right?

Thank you,
Tony.

[quoted text, click to view]
Hilary Cotter
9/10/2006 8:04:32 PM
It applies to SQL 2000 or SQL 2005.

Containstable has some additional features that Contains does not have. One
of them is the ability to limit the range to the highest ranked X results,
ie the first 100 results. Another is that it returns the rank value, and you
can also use it to join on different tables which have a similar key.

ContainsTable returns faster results as well - mainly because only two
columns are being returned.

--
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.

This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]

tony.newsgrps NO[at]SPAM gmail.com
9/11/2006 8:20:51 AM
Hilary,

Thank you for your answer.
Does the trimming issue exist with ContainsTable too? If I don't set
any range limit, does ContainsTable return all records matching the
searched words or only the top X results?

[quoted text, click to view]
beginning of the thread are equivalent, so I'm confused about the whole
discussion that followed about partitioning or using another search
engine.

Tony.

[quoted text, click to view]
Hilary Cotter
9/11/2006 1:47:56 PM
Unfortunately it does.

--
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.

This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]
AddThis Social Bookmark Button