Groups | Blog | Home
all groups > sql server full text search > july 2003 >

sql server full text search : DB Design / Hardware Advice to optimize FTS query performance


robnews NO[at]SPAM iGive.com
7/16/2003 8:28:28 AM
We're implementing FTS on SQL 2K from scratch. I've never done this
before - so please excuse (and highlght) newbie-type mistakes.
Looking for advice on best way to design for the following:

Goals:
Query Speed
Ease of Maintenance

Less Important (I think):
Population time

Situation:

We're creating a comparison shopping engine, more or less. Data will
be accessed via the web. Number of searches won't be high initially,
but hopefully will grow.

We're going to be receiving product description catalog datafeeds
(files) from 4 or 5 (maybe as many as 10, heck maybe more) outside
sources. Each product catalog is similar, but not exactly the same,
approximately 30 columns in each. It looks like only 2 of those
columns will need to be full-text indexed. The data is going to be
updated regularly, but doesn't change much. Each product catalog will
be updated on a different schedule (from daily to monthly).

Each catalog will range from 1,000,000 - 5,000,000 rows.

A query for "adidas running shoes" may have result rows in any of the
datafeeds.

I think my options are:

A) Map each datafeed into a single common table. Figure out how to
manage updating that common table from the changing datafeed data.

Create FTS on that table in a single FTS catalog.

Query will be simple I assume:

SELECT FT_TBL.*, KEY_TBL.RANK
FROM tablename AS FT_TBL INNER JOIN
CONTAINSTABLE (tablename, *,
'(adidas and running and shoe)',1000)
AS KEY_TBL
ON FT_TBL.CatalogID = KEY_TBL.[KEY]
ORDER BY KEY_TBL.RANK DESC


B) Create table per datafeed. Create FTS each table into common FTS
catalog.

Query design will be something like

Query Table A
UNION
Query Table B
UNION
Query Table C

Order entire result set by rank

Also, everything I read about hardware suggests 4+ processors, plenty
of RAM, plenty of disk (on separate channels/controllers).

Initially, however, this will go into production on 2 processor box.
Is that hopeless?

Anything I'm missing?

Andrew Cencini [MS]
7/16/2003 9:40:13 AM
Hi, Rob --

Sounds like you've read the Full-Text Deployment white paper
(http://support.microsoft.com/default.aspx?scid=/support/sql/content/2000pap
ers/fts_white%20paper.asp), and maybe Building Search Applications for the
Web
(http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql2k/ht
ml/sql_fulltextsearch.asp) <-- this one can be useful for some additional
application-centered performance techniques.

Depending on what you are expecting the most common queries to be, it
probably would be more performant to split your data into separate tables
based on feed (that way, if it is desirable to query against only a single
feed, you can direct the query to a single table/ for multiple feeds reduce
the unions to only tables you want to query). Depending on the speed of the
dual-proc machine you have, it may be enough to satisfy your requirements,
though obviously higher-end hardware can always help :)

Thanks,
--andrew

Andrew Cencini
Program Manger
Microsoft Corp. - SQL Server Engine

--
This posting is provided "AS IS" with no warranties, and confers no rights.


[quoted text, click to view]

John Kane
7/16/2003 10:50:15 AM
Rob,
In addition to what Andrew has said below and the white paper (I was a
"contributor") referenced, SQL Server 2000 BOL title "Full-text Search
Recommendations" has additional recommendations. In terms of your goals, for
FTS query speed, it's best to limit each query to one FTS predicate
(CONTAINS* or FREETEXT*) as each FTS predicate is a "round-trip" to the FT
Catalog.

Additionally, you will need to "trap" noise words in order to avoid Error
7619, this is usually done on the client side, via methods in KB article
246800 (Q246800) "INF: Correctly Parsing Quotation Marks in FTS Queries"
http://support.microsoft.com//default.aspx?scid=kb;EN-US;246800 Also, FTS
queries for the phrase "adidas running shoes" - note, the use of double
quotes will produce different results than with search words in your query
below: '(adidas and running and shoe)'. If you are getting multiple language
from your datafeeds, it would be best to store the different languages text
in different language-specific columns and then create a FT Index for each
column (can all be in one FT Catalog) and then define the correct "Language
for Word Breaker" for each column. If this is not possible, then you must
use the Neutral "Language for Word Breaker" as this will break the multiple
language words based upon the "white space" between words. Note, if you do
use the Neutral "Language for Word Breaker", you will lose the ability to
execute FTS queries based upon the language specific FORMSOF(INFLECTIONAL)
keywords.

Since you're implementing SQL FTS from scratch, you should also consider
using Windows Server 2003 (Win2003) vs. Windows Server 2000 (Win2K) as there
are significant differences between the OS-specific word breaker dll's.
Win2K's wordbreaker infosoft.dll will sometimes not produce the expected
wordbreaking, especially if punctuation characters are next to or "touching"
the search word, while Win2003's new wordbreaker langwrbk.dll will provide
better and more expected results. Windows XP (WinXP) also has this newer
wordbreaker langwrbk.dll, so you can easily test for different results using
SQL Server 2000 Developer Edition on WinXP with a small subset of your
table's data.

Ease of maintenance can be improved via using "Change Tracking" and "Update
Index in Background" as you will not have to maintain scheduled Incremental
Population jobs. While FT Catalog Population time is less important to you,
with 1 to 5 million row tables and I'm sure that they will grow beyond those
sizes, the Full Population time can take several days and if for some reason
the FT Catalog files are lost or become corrupted, it will take several days
to re-populate the FT Catalogs. I'd recommend that you consider an addition
backup server for both your database as well as the FT Catalogs just in
case. See KB article 240867 (Q240867) "INF: How to Move, Copy, and Backup
Full-Text Catalog Folders and Files"
http://support.microsoft.com/default.aspx?scid=kb;EN-US;240867 for
procedures on how to do this.

Since, you are planning to implement SQL FTS on a multiple CPU server, you
should know that during FT Populations, the MSSearch service can and does
cause CPU usage to hit 90+ percentage and this can cause issues for your
other SQL Server processes. You can to some degree control this via a system
stored procedure sp_fulltext_service 'resource_usage' (see the white paper
for more details), but you can also set the CPU Affinity of the "Microsoft
Search" (mssearch.exe) service via an interactive use of TaskMgr.exe
launched from a command prompt, for example:

AT <current_time_plus_1_min> /interactive taskmgr.exe

Once this TaskMgr is running, you can set the CPU affinity for the
"Microsoft Search" (mssearch.exe) service to use one or more of your CPU's
while you use sp_configure 'affinity mask' (see BOL for details) to
configure SQL Server to use the other CPU's on your server. I have
procedures and methods for using setting this up as a service such that it
can survive a re-boot on all current OS platforms. Email me directly, if
you're interested in these procedures.

Overall, SQL FTS is straight-forward to implement, you just need to plan for
contingencies and of course have higher-end hardware, especially fast disk
drives for your FT Catalogs!!

Best Regards,
John





[quoted text, click to view]
robnews NO[at]SPAM iGive.com
7/16/2003 6:16:38 PM
[quoted text, click to view]
John, Andrew - thanks.

Help me spend our money - if this went on a new (to us), dedicated
box, say dual Xeon processor, 4 GB RAM, fast cpus (2.8Mhz / 1MB Cache
look likes a sweet spot in the used box market), put it on win2003,
would you spend money on more RAM or 15K drives (vs. 10K)?

Also, I couldn't tell, are the bugs mentioned present in win2003?

I think the dedicated box takes care of lots of evils (separate
channels, maxed processor utilization, upgrade to win2003), and would
give us a near live backup db server as well (we use replication right
now to an offsite db server for backup).

And, to summarize, you both agree that the separate tables for each
datafeed, with a UNION joining the queries (most queries will be
against all tables) will yield acceptable query performance.

Also, we don't have multiple language issues (thankfully).

Again, thanks!

John Kane
7/16/2003 6:54:35 PM
Rob,
Hand's down choice is to spend your $$$ on the 15K RPM drives! MSSearch is
mainly I/O bound for performance issues but it also has a hard memory limit
of 512Mb of RAM with the sp_fulltext_service 'resource_usage' set to 5
(dedicated) and then only if the 512Mb of RAM is available, i.e., not used
by either SQL Server or the OS.

There are always bugs, it just depends if you hit them ;-)! but seriously,
the "bugs" I was referring to are in the Win2K (Windows Server 2000) and not
Win2003, relative to the wordbreaker dll's.

Yes, separate tables - where it makes sense based upon your datafeed
structure. If you're doing non-FTS query a UNION ALL should be appropriate
and give you good query performance, but be sure that you have the
appropriate "regular" (clustered and non-clustered) indexes to generate the
best optimized query. If you are doing FTS queries with multiple CONTAINS*
or FREETEXT* predicates for each table and each table has from 1 to 5
million (or more) rows, then I suspect that you will find the FTS query
performance not acceptable as each CONTAINS* or FREETEXT* predicate is a
"round trip", i.e., a complete read of the entire FT Catalog. You should
also consider using CONTAINSTABLE or FREETEXTTABLE and use the Top_N_Rank
parameter to improve your FTS query performance. Additionally, you can
create a user stored procedure (better for query plan re-use) and use temp
tables to "merge" the multiple results from each table to present a
customizable results list from the temp table to the user, for example:

CREATE PROCEDURE FindTune @Srch varchar(128)
AS
SET nocount ON
SELECT tunes.[key],ar.rank INTO #temptable from
tunes,freetexttable(artistname, *, @srch, 50) as ar
where tunes.[artistindex]=ar.[key]
UNION
SELECT [key],rank FROM freetexttable(tunes, *, @srch, 50)
UNION
select tunes.[key],ft.rank from tunes, freetexttable(albumname, *, @srch,50)
as ft
where tunes.[albumindex]=ft.[key]
/* The temporary table contains two columns; Rank and Key. Now do the main
query */
SELECT tunes.[key],best.total, tunes.title,
(select artistname from artistname where
artistname.[key]=tunes.artistindex) as artist,
(select albumname from albumname where albumname.[key]=tunes.albumindex)
as album,
(select path from tunedirs where tunedirs.[key]=tunes.dirindex) as
path, tunes.filename, tunes.filesize
FROM tunes,
(SELECT top 100 sum(rank) AS total, ett.[key] /* This bit sums the
merged rankings */
FROM #temptable as ett
GROUP BY ett.[key]
ORDER BY total DESC) AS best /* <<..which makes another table
containing the summed,sorted rankings */
WHERE best.[key] = tunes.[key]
order by best.total desc, artist,tunes.filename
drop table #temptable
go

It is good that you don't have multiple language issues!
Regards,
John



[quoted text, click to view]

Andrew Cencini [MS]
7/16/2003 7:27:50 PM
I'd prototype the system with a subset of the data and workload and look at
what is taking the bigger hit -- IO or CPU, then I'd spend the money on what
made the most sense. For example, if you're doing a lot of
FREETEXT/FREETEXTTABLE queries, that will add some additional CPU load (the
more sophisticated score computation, as well as the expansion of query
terms is primarily behind this) where you can take advantage of faster CPU
speeds (note that faster CPUs as opposed to vastly more CPUs in this case is
a reasonable thing to look at). If you are returning large result sets or
doing heavy joining, then IO will most likely be more dominant, and fast
disk will be a good use of that money.

For the time being if you have a decent dev box available to do some
preliminary tests -- ideally with dual, 1+Ghz CPUs, about a gig of RAM, and
fast disk, you can probably get a good idea of what hardware profile your
application will really require before you buy it.

Thanks,
--andrew

Andrew Cencini
Program Manger
Microsoft Corp. - SQL Server Engine

--
This posting is provided "AS IS" with no warranties, and confers no rights.

[quoted text, click to view]

AddThis Social Bookmark Button