Rob,
In addition to what Andrew has said below and the white paper (I was a
"contributor") referenced, SQL Server 2000 BOL title "Full-text Search
Recommendations" has additional recommendations. In terms of your goals, for
FTS query speed, it's best to limit each query to one FTS predicate
(CONTAINS* or FREETEXT*) as each FTS predicate is a "round-trip" to the FT
Catalog.
Additionally, you will need to "trap" noise words in order to avoid Error
7619, this is usually done on the client side, via methods in KB article
246800 (Q246800) "INF: Correctly Parsing Quotation Marks in FTS Queries"
http://support.microsoft.com//default.aspx?scid=kb;EN-US;246800 Also, FTS
queries for the phrase "adidas running shoes" - note, the use of double
quotes will produce different results than with search words in your query
below: '(adidas and running and shoe)'. If you are getting multiple language
from your datafeeds, it would be best to store the different languages text
in different language-specific columns and then create a FT Index for each
column (can all be in one FT Catalog) and then define the correct "Language
for Word Breaker" for each column. If this is not possible, then you must
use the Neutral "Language for Word Breaker" as this will break the multiple
language words based upon the "white space" between words. Note, if you do
use the Neutral "Language for Word Breaker", you will lose the ability to
execute FTS queries based upon the language specific FORMSOF(INFLECTIONAL)
keywords.
Since you're implementing SQL FTS from scratch, you should also consider
using Windows Server 2003 (Win2003) vs. Windows Server 2000 (Win2K) as there
are significant differences between the OS-specific word breaker dll's.
Win2K's wordbreaker infosoft.dll will sometimes not produce the expected
wordbreaking, especially if punctuation characters are next to or "touching"
the search word, while Win2003's new wordbreaker langwrbk.dll will provide
better and more expected results. Windows XP (WinXP) also has this newer
wordbreaker langwrbk.dll, so you can easily test for different results using
SQL Server 2000 Developer Edition on WinXP with a small subset of your
table's data.
Ease of maintenance can be improved via using "Change Tracking" and "Update
Index in Background" as you will not have to maintain scheduled Incremental
Population jobs. While FT Catalog Population time is less important to you,
with 1 to 5 million row tables and I'm sure that they will grow beyond those
sizes, the Full Population time can take several days and if for some reason
the FT Catalog files are lost or become corrupted, it will take several days
to re-populate the FT Catalogs. I'd recommend that you consider an addition
backup server for both your database as well as the FT Catalogs just in
case. See KB article 240867 (Q240867) "INF: How to Move, Copy, and Backup
Full-Text Catalog Folders and Files"
http://support.microsoft.com/default.aspx?scid=kb;EN-US;240867 for
procedures on how to do this.
Since, you are planning to implement SQL FTS on a multiple CPU server, you
should know that during FT Populations, the MSSearch service can and does
cause CPU usage to hit 90+ percentage and this can cause issues for your
other SQL Server processes. You can to some degree control this via a system
stored procedure sp_fulltext_service 'resource_usage' (see the white paper
for more details), but you can also set the CPU Affinity of the "Microsoft
Search" (mssearch.exe) service via an interactive use of TaskMgr.exe
launched from a command prompt, for example:
AT <current_time_plus_1_min> /interactive taskmgr.exe
Once this TaskMgr is running, you can set the CPU affinity for the
"Microsoft Search" (mssearch.exe) service to use one or more of your CPU's
while you use sp_configure 'affinity mask' (see BOL for details) to
configure SQL Server to use the other CPU's on your server. I have
procedures and methods for using setting this up as a service such that it
can survive a re-boot on all current OS platforms. Email me directly, if
you're interested in these procedures.
Overall, SQL FTS is straight-forward to implement, you just need to plan for
contingencies and of course have higher-end hardware, especially fast disk
drives for your FT Catalogs!!
Best Regards,
John
[quoted text, click to view] "Andrew Cencini [MS]" <acencini@online.microsoft.com> wrote in message
news:ONwnvj7SDHA.940@TK2MSFTNGP11.phx.gbl...
> Hi, Rob --
>
> Sounds like you've read the Full-Text Deployment white paper
>
(
http://support.microsoft.com/default.aspx?scid=/support/sql/content/2000pap
> ers/fts_white%20paper.asp), and maybe Building Search Applications for the
> Web
>
(
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql2k/ht
> ml/sql_fulltextsearch.asp) <-- this one can be useful for some additional
> application-centered performance techniques.
>
> Depending on what you are expecting the most common queries to be, it
> probably would be more performant to split your data into separate tables
> based on feed (that way, if it is desirable to query against only a single
> feed, you can direct the query to a single table/ for multiple feeds
reduce
> the unions to only tables you want to query). Depending on the speed of
the
> dual-proc machine you have, it may be enough to satisfy your requirements,
> though obviously higher-end hardware can always help :)
>
> Thanks,
> --andrew
>
> Andrew Cencini
> Program Manger
> Microsoft Corp. - SQL Server Engine
>
> --
> This posting is provided "AS IS" with no warranties, and confers no
rights.
>
>
> "Rob" <robnews@iGive.com> wrote in message
> news:f0823485.0307160728.55852810@posting.google.com...
> > We're implementing FTS on SQL 2K from scratch. I've never done this
> > before - so please excuse (and highlght) newbie-type mistakes.
> > Looking for advice on best way to design for the following:
> >
> > Goals:
> > Query Speed
> > Ease of Maintenance
> >
> > Less Important (I think):
> > Population time
> >
> > Situation:
> >
> > We're creating a comparison shopping engine, more or less. Data will
> > be accessed via the web. Number of searches won't be high initially,
> > but hopefully will grow.
> >
> > We're going to be receiving product description catalog datafeeds
> > (files) from 4 or 5 (maybe as many as 10, heck maybe more) outside
> > sources. Each product catalog is similar, but not exactly the same,
> > approximately 30 columns in each. It looks like only 2 of those
> > columns will need to be full-text indexed. The data is going to be
> > updated regularly, but doesn't change much. Each product catalog will
> > be updated on a different schedule (from daily to monthly).
> >
> > Each catalog will range from 1,000,000 - 5,000,000 rows.
> >
> > A query for "adidas running shoes" may have result rows in any of the
> > datafeeds.
> >
Rob,
Hand's down choice is to spend your $$$ on the 15K RPM drives! MSSearch is
mainly I/O bound for performance issues but it also has a hard memory limit
of 512Mb of RAM with the sp_fulltext_service 'resource_usage' set to 5
(dedicated) and then only if the 512Mb of RAM is available, i.e., not used
by either SQL Server or the OS.
There are always bugs, it just depends if you hit them ;-)! but seriously,
the "bugs" I was referring to are in the Win2K (Windows Server 2000) and not
Win2003, relative to the wordbreaker dll's.
Yes, separate tables - where it makes sense based upon your datafeed
structure. If you're doing non-FTS query a UNION ALL should be appropriate
and give you good query performance, but be sure that you have the
appropriate "regular" (clustered and non-clustered) indexes to generate the
best optimized query. If you are doing FTS queries with multiple CONTAINS*
or FREETEXT* predicates for each table and each table has from 1 to 5
million (or more) rows, then I suspect that you will find the FTS query
performance not acceptable as each CONTAINS* or FREETEXT* predicate is a
"round trip", i.e., a complete read of the entire FT Catalog. You should
also consider using CONTAINSTABLE or FREETEXTTABLE and use the Top_N_Rank
parameter to improve your FTS query performance. Additionally, you can
create a user stored procedure (better for query plan re-use) and use temp
tables to "merge" the multiple results from each table to present a
customizable results list from the temp table to the user, for example:
CREATE PROCEDURE FindTune @Srch varchar(128)
AS
SET nocount ON
SELECT tunes.[key],ar.rank INTO #temptable from
tunes,freetexttable(artistname, *, @srch, 50) as ar
where tunes.[artistindex]=ar.[key]
UNION
SELECT [key],rank FROM freetexttable(tunes, *, @srch, 50)
UNION
select tunes.[key],ft.rank from tunes, freetexttable(albumname, *, @srch,50)
as ft
where tunes.[albumindex]=ft.[key]
/* The temporary table contains two columns; Rank and Key. Now do the main
query */
SELECT tunes.[key],best.total, tunes.title,
(select artistname from artistname where
artistname.[key]=tunes.artistindex) as artist,
(select albumname from albumname where albumname.[key]=tunes.albumindex)
as album,
(select path from tunedirs where tunedirs.[key]=tunes.dirindex) as
path, tunes.filename, tunes.filesize
FROM tunes,
(SELECT top 100 sum(rank) AS total, ett.[key] /* This bit sums the
merged rankings */
FROM #temptable as ett
GROUP BY ett.[key]
ORDER BY total DESC) AS best /* <<..which makes another table
containing the summed,sorted rankings */
WHERE best.[key] = tunes.[key]
order by best.total desc, artist,tunes.filename
drop table #temptable
go
It is good that you don't have multiple language issues!
Regards,
John
[quoted text, click to view] "Rob" <robnews@iGive.com> wrote in message
news:f0823485.0307161716.1e922dd0@posting.google.com...
> "John Kane" <jt-kane@comcast.net> wrote in message
news:<#VSC2K8SDHA.1576@TK2MSFTNGP12.phx.gbl>...
> > Rob,
> > In addition to what Andrew has said below and the white paper (I was a
> > "contributor") referenced, SQL Server 2000 BOL title "Full-text Search
> > Recommendations" has additional recommendations. In terms of your goals,
for
> > FTS query speed, it's best to limit each query to one FTS predicate
> > (CONTAINS* or FREETEXT*) as each FTS predicate is a "round-trip" to the
FT
> > Catalog.
> >
> John, Andrew - thanks.
>
> Help me spend our money - if this went on a new (to us), dedicated
> box, say dual Xeon processor, 4 GB RAM, fast cpus (2.8Mhz / 1MB Cache
> look likes a sweet spot in the used box market), put it on win2003,
> would you spend money on more RAM or 15K drives (vs. 10K)?
>
> Also, I couldn't tell, are the bugs mentioned present in win2003?
>
> I think the dedicated box takes care of lots of evils (separate
> channels, maxed processor utilization, upgrade to win2003), and would
> give us a near live backup db server as well (we use replication right
> now to an offsite db server for backup).
>
> And, to summarize, you both agree that the separate tables for each
> datafeed, with a UNION joining the queries (most queries will be
> against all tables) will yield acceptable query performance.
>
> Also, we don't have multiple language issues (thankfully).
>
> Again, thanks!
>
> Rob
I'd prototype the system with a subset of the data and workload and look at
what is taking the bigger hit -- IO or CPU, then I'd spend the money on what
made the most sense. For example, if you're doing a lot of
FREETEXT/FREETEXTTABLE queries, that will add some additional CPU load (the
more sophisticated score computation, as well as the expansion of query
terms is primarily behind this) where you can take advantage of faster CPU
speeds (note that faster CPUs as opposed to vastly more CPUs in this case is
a reasonable thing to look at). If you are returning large result sets or
doing heavy joining, then IO will most likely be more dominant, and fast
disk will be a good use of that money.
For the time being if you have a decent dev box available to do some
preliminary tests -- ideally with dual, 1+Ghz CPUs, about a gig of RAM, and
fast disk, you can probably get a good idea of what hardware profile your
application will really require before you buy it.
Thanks,
--andrew
Andrew Cencini
Program Manger
Microsoft Corp. - SQL Server Engine
--
This posting is provided "AS IS" with no warranties, and confers no rights.
[quoted text, click to view] "John Kane" <jt-kane@comcast.net> wrote in message
news:eH56eZATDHA.2020@TK2MSFTNGP11.phx.gbl...
> Rob,
> Hand's down choice is to spend your $$$ on the 15K RPM drives! MSSearch is
> mainly I/O bound for performance issues but it also has a hard memory
limit
> of 512Mb of RAM with the sp_fulltext_service 'resource_usage' set to 5
> (dedicated) and then only if the 512Mb of RAM is available, i.e., not used
> by either SQL Server or the OS.
>
> There are always bugs, it just depends if you hit them ;-)! but seriously,
> the "bugs" I was referring to are in the Win2K (Windows Server 2000) and
not
> Win2003, relative to the wordbreaker dll's.
>
> Yes, separate tables - where it makes sense based upon your datafeed
> structure. If you're doing non-FTS query a UNION ALL should be appropriate
> and give you good query performance, but be sure that you have the
> appropriate "regular" (clustered and non-clustered) indexes to generate
the
> best optimized query. If you are doing FTS queries with multiple CONTAINS*
> or FREETEXT* predicates for each table and each table has from 1 to 5
> million (or more) rows, then I suspect that you will find the FTS query
> performance not acceptable as each CONTAINS* or FREETEXT* predicate is a
> "round trip", i.e., a complete read of the entire FT Catalog. You should
> also consider using CONTAINSTABLE or FREETEXTTABLE and use the Top_N_Rank
> parameter to improve your FTS query performance. Additionally, you can
> create a user stored procedure (better for query plan re-use) and use temp
> tables to "merge" the multiple results from each table to present a
> customizable results list from the temp table to the user, for example:
>
> CREATE PROCEDURE FindTune @Srch varchar(128)
> AS
> SET nocount ON
> SELECT tunes.[key],ar.rank INTO #temptable from
> tunes,freetexttable(artistname, *, @srch, 50) as ar
> where tunes.[artistindex]=ar.[key]
> UNION
> SELECT [key],rank FROM freetexttable(tunes, *, @srch, 50)
> UNION
> select tunes.[key],ft.rank from tunes, freetexttable(albumname, *,
@srch,50)
> as ft
> where tunes.[albumindex]=ft.[key]
> /* The temporary table contains two columns; Rank and Key. Now do the main
> query */
> SELECT tunes.[key],best.total, tunes.title,
> (select artistname from artistname where
> artistname.[key]=tunes.artistindex) as artist,
> (select albumname from albumname where
albumname.[key]=tunes.albumindex)
> as album,
> (select path from tunedirs where tunedirs.[key]=tunes.dirindex) as
> path, tunes.filename, tunes.filesize
> FROM tunes,
> (SELECT top 100 sum(rank) AS total, ett.[key] /* This bit sums the
> merged rankings */
> FROM #temptable as ett
> GROUP BY ett.[key]
> ORDER BY total DESC) AS best /* <<..which makes another table
> containing the summed,sorted rankings */
> WHERE best.[key] = tunes.[key]
> order by best.total desc, artist,tunes.filename
> drop table #temptable
> go
>
> It is good that you don't have multiple language issues!
> Regards,
> John
>
>
>
> "Rob" <robnews@iGive.com> wrote in message
> news:f0823485.0307161716.1e922dd0@posting.google.com...
> > "John Kane" <jt-kane@comcast.net> wrote in message
> news:<#VSC2K8SDHA.1576@TK2MSFTNGP12.phx.gbl>...
> > > Rob,
> > > In addition to what Andrew has said below and the white paper (I was a
> > > "contributor") referenced, SQL Server 2000 BOL title "Full-text Search
> > > Recommendations" has additional recommendations. In terms of your
goals,
> for
> > > FTS query speed, it's best to limit each query to one FTS predicate
> > > (CONTAINS* or FREETEXT*) as each FTS predicate is a "round-trip" to
the
> FT
> > > Catalog.
> > >
> > John, Andrew - thanks.
> >
> > Help me spend our money - if this went on a new (to us), dedicated
> > box, say dual Xeon processor, 4 GB RAM, fast cpus (2.8Mhz / 1MB Cache
> > look likes a sweet spot in the used box market), put it on win2003,
> > would you spend money on more RAM or 15K drives (vs. 10K)?
> >
> > Also, I couldn't tell, are the bugs mentioned present in win2003?
> >
> > I think the dedicated box takes care of lots of evils (separate
> > channels, maxed processor utilization, upgrade to win2003), and would
> > give us a near live backup db server as well (we use replication right
> > now to an offsite db server for backup).
> >
> > And, to summarize, you both agree that the separate tables for each
> > datafeed, with a UNION joining the queries (most queries will be
> > against all tables) will yield acceptable query performance.
> >
> > Also, we don't have multiple language issues (thankfully).
> >
> > Again, thanks!
> >
> > Rob
>
>