Groups | Blog | Home
all groups > sql server full text search > february 2007 >

sql server full text search : What is incremental?



Ian Boyd
2/23/2007 3:44:34 PM
Reading the MS article "10 Ways to Optimize SQL Server Full-Text Indexing"
http://msdn2.microsoft.com/en-us/library/aa175787(sql.80).aspx

It says regarding a full population:

"MSSearch will issue sp_fulltext_getdata once for every row in your
table. So if you have 50 million rows in your database, this procedure will
be issued 50 million times.

"On the other hand, if you use an incremental population...this
statement will also be issued 50 million times. Why?"


That's exactly what i'd like to know. He continues:

"Because even with an incremental population, MSSearch must figure out
exactly which rows have been changed, updated, and deleted."


So why is the incremental populator not getting the id of the rows that have
changed after the last timestamp? What is so hard about:

SELECT PrimaryKeyColumn
FROM SomeTable
WHERE timestamp > 0000123456789


If full-population calls sp_fulltext_getdata for every row in the table,
and incremental-population calls sp_fulltext_getdata for every for in the
table,
what is the difference between full and incremental populating?






And while we're at it, can we please, for the love of everything holy, fix
EM so i can manage a full-text catalog from my development. And can be
please, for the love of everything holy, include some way to see what
populating method and change tracking scheme the full text indexing is
using. And can we please, for the love of everything holy, let EM show if FT
Search is running on the server. And while we're at it, can be please, for
the love of everything holy, let me manage change tracking from EM. And
while we're at it, can we please, for the love of everything holy, document
this stuff in BOL. And while we're at it, can we please, for the love of
everything holy shut me up.


Hilary Cotter
2/23/2007 4:36:33 PM
Hmmm........ this article looks familiar. Basically the indexer extracts the
entire row to look at the value of the timestamp column to determine if it
has been modified since the last time it looked at that row. If the
timestamp column is modified the full-text indexed columns will be
re-indexed.

It also checks to see if new rows are added and then compares all the key
and timestamp info returned from all the row with what is in the index to
figure out what has been deleted.

The problem with this query SELECT PrimaryKeyColumn
FROM SomeTable
WHERE timestamp > 0000123456789

is it doesn't detect deletes.

A full-index blindly reinidexes everything. The incremental only indexes the
rows which have been modified or newly inserted. So it does more
bookkeeping than indexing and can't detect if the update was on a column
which is not being full-text indexed.

While this might sound inefficient, it actually is very efficient and
reliable.

--
Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]

Ian Boyd
2/26/2007 5:04:55 PM
[quoted text, click to view]

What if we call
sp_fulltext_getdata CatalogID, object_id

Which returns a list of ALL rows in the table, and their timestamps. Any
rows in our full-text catalog, but not in the returned list, have been
deleted.

Then, we call
sp_fulltext_getdata CatalogID, object_id, RowPrimayKeyID

only for those rows that that have a new timestamp than our largest.


So we get the best of both worlds. We find deletes, and we only scan rows
changed that have changed since our last incremental scan.

Hilary Cotter
2/26/2007 10:20:51 PM
Yes, you can do this, just keep in mind that each time you fire this proc
you reset an internal counter saying this is the highest timestamp, so each
time you fire it MSSearch starts thinking that the timestamp you generated
by firing this proc manually is its reference point and it can miss modified
rows.

Basically when fire sp_fulltext_getdata CatalogID, object_id, RowPrimayKeyID
yourself it throws away the values, in other words only when MSSearch fires
this proc does the row get reindexed.

So, I really don't recommend it.

--
Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]

Ian Boyd
3/1/2007 12:00:00 AM
[quoted text, click to view]

i wasn't meaning me. i meant why isn't SQL Server doing this since it's
faster and more efficient?

Hilary Cotter
3/1/2007 12:33:23 PM
I believe it is, it first fires the proc to get a list of what has change
and updates the time stamp, then it gathers those rows and indexes them, and
then it checks to see what has been deleted.

Why do you think it works differently?

--
Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]

Ian Boyd
3/3/2007 12:12:26 AM
[quoted text, click to view]

From the article "10 Ways to Optimize SQL Server Full-Text Indexing"
http://msdn2.microsoft.com/en-us/library/aa175787(sql.80).aspx

"MSSearch will issue sp_fulltext_getdata once for every row in your
table. So if you have 50 million rows in your database, this procedure will
be issued 50 million times.

"On the other hand, if you use an incremental population...this
statement will also be issued 50 million times."

Perhaps it's a typo, and was supposed to read:
"On the other hand, if you use an incremental population...this
statement will NOT be issued 50 million times, but rather will only be
executed once for each modified row in the table. Normally there will be
only one or two rows, rather than 50 million."

Hilary Cotter
3/5/2007 12:00:00 AM
Nope, its not a typo. The row has to be extracted to get the timestamp for
that column. Consider an you kick off your indexing at 12:00, it extracts
that row at 12:50. Now, suppose that row has modified since 12:00. The
indexer has to know whether to index that row or not. The 12:00 row set says
it doesn't need to, but the Indexer needs a more up to date value, hence it
grabs the most recent timestamp.

Make sense?

--
Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]

Ian Boyd
3/5/2007 2:38:15 PM
[quoted text, click to view]

Consider the indexing kicks off at 12:00. It gets a lit of all primary keys
in the table, and that row's timestamp:

LastTimestamp PrimaryKey
========= =======
30043 1
32364 2
30044 3
....
32363 49584682
etc.

Any rows not in this list get purged from the FullText catalog.

Next the full-text indexer know that when it last crawled the table, the
timestamp was at 32363. So it can search through the above list, and only
fetch rows that have a newer timestamp.

Timestamp PrimaryKey
======= =======
32364 3

And so rather than having to pull the contents of 50million rows, we only
pull the contents of one row. That's a speed improvement by a factor of 50
million!

And if a row was modified during the 2 seconds it took to do our incremental
scan, we can just scan again - since we've sped up the incremental indexing
by a factor of 50 million, we can afford to scan more frequently. Perhaps
even as quickly as every 6 or 7 seconds.



[quoted text, click to view]

This sounds like a full-scan, where we scan every row no matter what. i'm
talking about an incremental full-text catalog build. Why does incremental
not do as i've described? Everyone who comes into SQL Server Full-Text
Indexing assumes it does an incremental scan - and then cannot figure out
why it's scanning the whole table.

If you're trying to tell me that incremental and full do the same thing
(i.e. scan every row in the table), then i return to my original question:
What is incremental?

Hilary Cotter
3/6/2007 4:05:40 PM
but the timestamp retrieved is the value of the timestamp when the first
list was obtained. It then needs to retrieve the timestamp again to see if
it has changed since the first full-list retrieval. I suspect the first
full-list is to get the crawl seeds and more importantly to figure out what
to delete.

Consider this case. You get the highest value timestamp and compare it with
the highest value timestamp from the last run. You only index those rows.
What about the rows which have modified while you are indexing this delta.
By only doing this delta you miss those ones.

You might want to make your case on connect to see what the office MS answer
is.

--
Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]

Ian Boyd
3/8/2007 12:32:07 AM
[quoted text, click to view]


You're describing a solution to a problem that doesn't need to be solved. At
the very least full repopulation doesn't solve this problem either.

Consider full-repopulation:

12:00:00pm: Get list of all rows and their timestamps. Only row 37983 has
changed
12:00:02pm: Crawl row 1, in case it changed since we got the list
12:00:02pm: Crawl row 2, in case it changed since we got the list
.....
12:49:37pm: Crawl row 37983, because we know it changed
12:49:37pm: Row 3 changes - after we've crawled it
12:49:37pm: Ccrawl row 37984, in case it changed since we got the list
....
6:13:31pm: Crawl row 50123456, in case it changed since we got the list
6:13:31pm: Crawl complete.

But we missed the change to row 3. The full-repopulation STILL misses the
update. So i'll ask you: "What about the rows which have modified while you
are indexing this delta?"

Does the full rebuild algorithm perform endless passes until no rows have
changed during the full scan? No, there would never be an end. This is an
accepted limitation of the algorithm, and any changes we'll pick up the next
time a full repopulation is run. What's more, because a full repopulation is
so unnecessarily lengthy, it essentially guarantees that rows will change
during the full re-scan - and that they will get missed.


So, my original question: If an increment rebuild is the same as a full
rebuild, what's the difference between incremental and full? Or, more
simply, "What is incremental?"


Now consider incremental-repopulation as everyone assumes it works:

12:00:00pm: Get list of all rows and their timetamps. Only row 37983 has
changed
12:00:02pm: Row 3 changes
12:00:02pm: Crawl row 23983, because it has changed
12:00:02pm: Crawl complete

In both examples we miss a change to a row. But in the ideal incremental
rebuild algorithm, the window where that change could have happened is much
smaller (on the order of 50 million times smaller). And since we've already
accepted that if row changes after we've examined it, then it will just have
to get picked up next time the incremental (or full) population is run.
Additionally, since the incremental population only took 2 seconds, we can
run it much more frequently; getting updates in near realtime.



The difference is: the incremental algorithm is faster, with less load on
the server, and row changes making it into the full text catalog faster. At
least, that's what every MSSQL dba assumes the incremental repopulation
does. Until they begin seeing high load on their server, and they post into
m.p.s.f asking what SQL Server is doing.

My collary question is: why isn't SQL Server doing this already? There's
the "full" algorithm. Is there any other algorithm? Which brings me to my
primary question:

Is there an "incremental" algorithm of some sort? If yes, what does the
incremental repopulation algorithm do differently from the "full"? What is
incremental?

According to an article on TechNet, there's no difference. So then, why the
two names?


[quoted text, click to view]
i don't know what connect is. Nor do i believe i would get an answer.

Hilary Cotter
3/8/2007 3:43:54 PM
let me try to explain this again.

Full population. Every row is extracted and indexed.

Incremental population. Every row is extracted, timestamps are compared with
the last timestamp the last full or incremental population ended on and if
the timestamp is different the extracted row is reindexed. Rows which are no
longer in the table but in the catalog are removed from the catalog. Rows
which are not in the catalog are inserted.

Does this make sense? The cost of bringing the entire row is not much more
expensive from just bringing the timestamp so its not that inefficient.


--
Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]

Ian Boyd
3/8/2007 4:03:53 PM
[quoted text, click to view]

If i understand what you're staying - then someone the SQL team needs a
smack.

Both "full" and "incremental" both scan scan every row, one by one, in the
database, whether the row has changed or not. Why, in the name of god, would
they not have an actual "incremental" scan? How is this not a no-brainer?

[quoted text, click to view]

i disagree. Put an index on PK, Timestamp.

Fetching every row in the database, one by one; it's like using a cursor to
do everything. There's so much to be gained by using SQL Server to do what
it's good at: operate on sets.


My god, they really should add an "incremental" update option.

Ian Boyd
3/8/2007 4:05:52 PM
[quoted text, click to view]

i see what he's trying to say.

Incremental offers no performance improvement over full, because it still
reads every single row in the database, one by one, not even as a set of
rows.

Hilary Cotter
3/8/2007 7:47:31 PM
some of the reasons are that not every table is guaranteed to have a
timestamp column on it, nor can Microsoft guarantee that their user's won't
update the pk's. Suppose you update a pk on your table. How is the indexer
supposed to know what row the data in the catalog belongs to? Microsoft
can't count on their users being as sophisticated as you obviously are.

As Simon mentions change tracking is designed to provide real time indexing.
Full-populations are designed when the majority of your data changes at one
time. Incremental is designed when large portions of your data changes at
one time.

Keep in mind that the MSSearch engine provided search services to Exchange
Content Indexing, MS Office search, Site Server Search and SharePoint Portal
Search. They needed a consistent crawl or population method for all
population/crawl types.

In SQL 2005 the crawl mechanism has been totally re-written as a database
only indexer. Still incremental population remains as it is the fastest
crawl/population type in some cases.

Its the bookkeeping that MSSearch must do in order to reconcile what is in
the catalog with what is in the table which makes the extraction all the
rows in the base table essential.

This crawl algorithm is also present in Sybase and Oracle last time I
looked.

--
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

[quoted text, click to view]

Simon Sabin
3/8/2007 8:20:28 PM
Hello Ian,

Isn't the difference that incremental looks to see if the record has changed
and reindex if required. With full it just reindexes the record.


Simon Sabin
SQL Server MVP
http://sqlblogcasts.com/blogs/simons


[quoted text, click to view]

Hilary Cotter
3/8/2007 9:05:28 PM
Both full and incremental extract each row. Incremental will not index each
row, but full will. Still full can be faster than incremental as it doesn't
have to do any bookeeping or index maintenance.

--
Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]

Simon Sabin
3/8/2007 11:14:45 PM
Hello Ian,

Thats what change tracking is for.


Simon Sabin
SQL Server MVP
http://sqlblogcasts.com/blogs/simons


[quoted text, click to view]

Ian Boyd
3/9/2007 12:00:00 AM
[quoted text, click to view]

i'd be fine with a timestamp being required to to FT indexing; like some
forms of replication need a "IsUniqueIdentifier" row.

[quoted text, click to view]

You upate the PK, but don't forget that the timestamp updates with it. We'll
get the "new" values associated with that row next run.

[quoted text, click to view]

But both heavily impact SQL Server to the same degree.

[quoted text, click to view]

It's an implementation detail that Microsoft SQL Server implements it's
FullText search ability as a separate process. Integrate it into the engine,
and it's not a problem. If they need to choose a different implementation:
i'm fine with that.

[quoted text, click to view]

If this goes toward my previous statement then that's excellent news.

[quoted text, click to view]

See points 1 and 2

[quoted text, click to view]

Which means there is room for improvement and innovation on Microsoft's part
to make a superior product.

Ian Boyd
3/9/2007 12:00:00 AM
[quoted text, click to view]

From the BOL, regarding "incremental" population:

Incremental population
"...This feature requires that the indexed table have a column of the
timestamp data type."


Is there anything i'm missing?

Hilary Cotter
3/9/2007 10:53:09 AM
Comments in line.
--
Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]

The unique identifier used in replication is used to track which rows have
changed in which tables and which servers. It does not have a timestamp
component. While you could use it for a timestamp it is too wide.
[quoted text, click to view]

That's right you update the pk, the timestamp tell the indexer that it has
been updated since the last time, but the key information is gone. So last
run it had a key value of 1 and the catalog knows that X number of words in
the index belong to row 1, but now the value of row 1 is 13424. How does it
reconcile this to remove these enteries and update them with the correct new
pk value?
[quoted text, click to view]

Not always. The extraction is typically less expensive than the indexing,
especially when you are dealing with blobs and Far Eastern content.
[quoted text, click to view]
In SQL 2005 the indexer is in process.

[quoted text, click to view]

Please see my answers to them:)
[quoted text, click to view]

Which means you better go looking to smack the entire search engine
developers, not just the MS ones.
[quoted text, click to view]

Hilary Cotter
3/9/2007 10:54:17 AM
That is 100% correct, but I wish I have a cent for every dba who has run an
incremental population on a table without a timestamp and had a full
population performed.

--
Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]

Ian Boyd
3/10/2007 6:08:55 PM
[quoted text, click to view]

No, i was saying that's there's already a precident for mandating a column's
presence in a table to be able to use a feature. So there's no problem in
mandating a timestamp be present in order to use incremental population

[quoted text, click to view]

Okay, i guess i need to have a *detailed* example to explain how this works.

Assume we want to catalog a simple Customers table

CREATE TABLE Customers (
pk int NOT NULL PRIMARY KEY,
Firstname varchar(25),
Lastname varchar(25),
[timestamp] timestamp)

Now, assume this table has been pre-filled with two rows:

pk Firstname Lastname timestamp
1 Ian Boyd 970001
2 Hillary Cotter 970002


Incremental Population Nº1
=====================
Let's do our "initial" incremental population. We record the last timestamp
every time we run the incremental population. Since this is the first time
an incremental population has been run, we assume a "LastTimestamp" of zero.

The first step in performing an incremental population is to get every
Primary key value, and it's timestamp.

NOTE: Perhaps there is some special stored procedure that can be used to get
this list; perhaps being called sp_fulltext_getdata. We won't use that for
my algorithm, we're use something more clear - for the purposes of
explaining the incremental population algorithm.


[quoted text, click to view]

pk timestamp
1 970001
2 970002
2 row(s) affected

These are all the rows the table contains. Any entries in our full-text
catalog that do not exist in this list need to be removed from the
full-text catalog; since they have been deleted from the source database. In
our example, this is the first time the full-text indexer has been run on
our Customers table, so the full-text catalog is empty.

Next, we scan the recordset, looking for any rows with a timestamp greater
than zero. We find two rows:

pk timestamp
1 970001
2 970002

So, there are two rows that need to go into the full-text catalog. The
full-text indexer issues a statement to get the contents of the Customers
table for primary key value 1.

NOTE: This can be some special SQL Server stored procedure (perhaps named
sp_fulltext_getdata) that is used to get the values of columns for a
specific row. We're going to do it the obvious way; for the purposes of
explaining this algorithm

[quoted text, click to view]

pk Firstname Lastname timestamp
1 Ian Boyd 970001
1 row(s) affected

The full-text application now takes those values, and parses them into
whatever format it uses and stores them wherever it stores them in the
catalog. Now it get's the next row:

[quoted text, click to view]

pk Firstname Lastname timestamp
2 Hillary Cotter 970002
1 rows(s) affected

And now those values are put into the full-text catalog. And the catalog is
up to date.



Now let's say someone modifies a row:

[quoted text, click to view]
1 rows(s) affected

Incremental Population Nº2
=====================
Now, the incremental population algorithm is started up again. The last
timestamp in the table when it was run 97002. It's first step is to find all
rows in the table, and their timestamps

[quoted text, click to view]
pk timestamp
1 970001
2 970003
2 row(s) affected

First we delete any rows from our catalog that are not in the database.
Since both rows are still there, we proceed to the next phase. Look through
the recordset and find rows that have a timestamp greater than 970002. There
is one row that matches:

pk timestamp
2 970003

So the indexer fetches the contents of that row:

[quoted text, click to view]

pk Firstname Lastname timestamp
2 Hilary Cotter 970002
1 rows(s) affected

And now those values are put into the full-text catalog. And the catalog is
up to date.



Now let's say someone updates the primary key in the table

[quoted text, click to view]


Incremental Population Nº3
=====================
Now, the incremental population algorithm is started up again. The last
timestamp in the table when it was run 97003. It's first step is to find all
rows in the table, and their timestamps

[quoted text, click to view]
pk timestamp
1 970001
3 970004
2 row(s) affected

First we delete any rows from our catalog that are not in the database. But
now we see that our full-text catalog has data for a row with pk=2, but that
row doesn't exist in the database. So rows with pk=2 is purged from the
full-text catalog, and we proceed to the next phase. Look through the
recordset and find rows that have a timestamp greater than 970003. There is
one row that matches:

pk timestamp
3 970004

So the indexer fetches the contents of that row:

[quoted text, click to view]

pk Firstname Lastname timestamp
3 Hilary Cotter 970004
1 rows(s) affected

And now those values are put into the full-text catalog. And the catalog is
up to date.


Now let's say someone is really devious and swaps the PK values of rows 1
and 3

[quoted text, click to view]
1 row(s) affected


Incremental Population Nº4
=====================
Now, the incremental population algorithm is started up again. The last
timestamp in the table when it was run 97004. It's first step is to find all
rows in the table, and their timestamps

[quoted text, click to view]
pk timestamp
1 970007
3 970006
2 row(s) affected

First we delete any rows from our catalog that are not in the database. All
rows in our catalog exist in the database, so we proceed to the next step.
The last timestamp when we ran was 970004, so we look for rows that have a
newer timestamp, and we find two rows:

pk timestamp
1 970007
3 970006


So the indexer fetches the contents of the the first row:

[quoted text, click to view]

pk Firstname Lastname timestamp
1 Hilary Cotter 970007
1 row(s) affected

Ian Boyd
3/10/2007 8:29:04 PM
[quoted text, click to view]

The interface, or the server, should not allow it to happen. It should be an
error. Or perhaps:

|
| Cannot perform an incremental population on a table without a timestamp.
|
| Would you like to perform a full population instead?
|
| Yes No
|

The former is just poor user interface design (which i take it is fixed in
SQL2005 Management Studio.

Hilary Cotter
3/12/2007 12:00:00 AM
Ian you are describing how an incremental population runs with one
exception.
So the indexer fetches the contents of that row:

SQL Server has to poll the entire table to find out what is in it. It then
compares this with what the fulltext catalog has. It then can detect what is
deleted.

An update can be considered a delete followed by an insert.

Your method describes how to detect what is new (inserts and updates), but
it doesn't describe a method to determine what is deleted.

Can you help me to understand how your method detects deletions?

To quote you "First we delete any rows from our catalog that are not in the
database." and "Any entries in our full-text catalog that do not exist in
this list need to be removed from the full-text catalog; since they have
been deleted from the source database" and "First we delete any rows from
our catalog that are not in the database." and again 'First we delete any
rows from our catalog that are not in the database. " And one more time
"First we delete any rows from our catalog that are not in the database. "

Exactly how does the indexer know what is deleted knowing only what is new?

The algorithm that the crawl mechanism in MSSearch uses is to say send me
all the pks and timestamps and I'll compare this list with what is in the
catalog.

This way I detect what is new (in a similar manner to what you describe),
and also what is to be deleted.

Am I missing something?


--
Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]
Ian Boyd
3/12/2007 11:26:59 AM
[quoted text, click to view]

Let's say our table and our catalog are currently in sync

Contents on the Customers table:
pk Firstname Lastname timestamp
1 Ian Alexander Boyd 970001
2 Hilary Nathan Cotter 970002

NOTE: i'll now store first and last names in the FirstName field, so that we
can see words being broken up.
NOTE: i do not know the format that is used to store the full-text catalog;
so i'll have to make one up.

Contents of the FullTextCatalog table:
Table Key Column keyword
Customers 1 Firstname Ian
Customers 1 Firstname Alexander
Customers 1 Lastname Boyd
Customers 2 Firstname Hilary
Customers 2 Firstname Nathan
Customers 2 Lastname Cotter

And the last time the incremental population ran the timestamp in the
Customers table was 970002. The full-text indexer also must store this:

Contents of the Full-text catalog LastTimestamps table:
Table LastTimestamp
Customers 970002


Everything is setup. Now we delete a row:

[quoted text, click to view]
1 row(s) affected

Now there is only one row in the customer's table:

[quoted text, click to view]
pk Firstname Lastname timestamp
1 Ian Alexander Boyd 970001
1 row(s) affected



Now the incremental indexer comes along. First thing is does it ask for all
key-timestamp pairs:
[quoted text, click to view]
pk timestamp
1 970001
1 row(s) affected



So your question is: How does the indexer know that a row has been deleted?
Specifically, how does it know that row 2 has been deleted?

Looking through the full-text catalog, we find all rows that we've
previously indexed:

[quoted text, click to view]
pk
1
2
2 row(s) affected.

Our full-text catalog contains values from two rows. But in the list from
the database that we fetched a moment ago there is only one row:

[quoted text, click to view]
pk timestamp
1 970001
1 row(s) affected

Obviously row 2, that exists in our full-text catalog, has been delete from
the source database, and must be deleted from our full-text catalog.

[quoted text, click to view]
3 row(s) affected




This is how we detect deletes.




Perhaps now you're wanting to contrive a more pessimistic case. Let's take
the original values:

pk Firstname Lastname timestamp
1 Ian Alexander Boyd 970001
2 Hilary Nathan Cotter 970002

and delete Hilary, and switch Ian to pk value #2, resulting in:

pk Firstname Lastname
2 Ian Alexander Boyd

So now we've deleted a row, but also moved another row on top of it. How do
we handle this?

Well, starting from:
pk Firstname Lastname timestamp
1 Ian Alexander Boyd 970001
2 Hilary Nathan Cotter 970002

let's do the SQL to make the change:

[quoted text, click to view]
1 row(s) affected

Giving us:
pk Firstname Lastname timestamp
1 Ian Alexander Boyd 970001

[quoted text, click to view]
1 row(s) affected

Giving us:
pk Firstname Lastname timestamp
2 Ian Alexander Boyd 970003


Now the incremental indexing starts, and it's first step is to get a list of
all pk-timestamp pairs:

[quoted text, click to view]
pk timestamp
2 970003
1 row(s) affected

Now we know that a row has been deleted, but how does the full-text indexer
know? It looks through it's full-text catalog:


[quoted text, click to view]
pk
1
2
2 row(s) affected.

The catalog has two rows, but the current able only has one. Specifically,
row with pk=1 is no longer in the source database, while it is still in our
catalog. So let's remove that row from the catalog:

[quoted text, click to view]
1 row(s) affected


"Ah-hah!", you say. "How does it know that row 2 is not the same row 2 in
the catalog?" From the timestamps. The indexer stores the last timestamp
that was in a table the last time in ran:

[quoted text, click to view]
Table LastTimestamp
Customers 970002
1 row(s) affected.

So any rows that have a timestamp greater than 970002 need to be
re-populated. Seeking through the list in memory that we already retreived
(which i'll reproduce here for convience):

pk timestamp
2 970003

indicates that one row has been modified.

So the full-text indexer throws away any data in it's catalog for Key=2, and
puts in it's place values from the current row of pk=2.




Makes sense?

Hilary Cotter
3/12/2007 12:23:44 PM
What you describe is exactly how the indexer works. It gets stores the last
highest timestamp form the last run.

It uses this as the basis to find out what is new. It then extracts each pk
and timestamp to determine what to delete and by extension what has been
updated. It also extracts the content of the row at that time as the expense
of this hit is not much more than the cost of getting the timestamp and the
pk.

This is what I tried to explain with my statement "Incremental population.
Every row is extracted, timestamps are compared with
the last timestamp the last full or incremental population ended on and if
the timestamp is different the extracted row is reindexed. Rows which are no
longer in the table but in the catalog are removed from the catalog. Rows
which are not in the catalog are inserted.

Does this make sense? The cost of bringing the entire row is not much more
expensive from just bringing the timestamp so its not that inefficient."

Your response was "If i understand what you're staying - then someone the
SQL team needs a smack."

Am I missing something? Do I need to be smacked? Did I do a poor job of
explaining this?






--
Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]

Ian Boyd
3/12/2007 2:40:55 PM
[quoted text, click to view]

No, the hit *is* much more. We will, of course, have a covering index on
pk, Timestamp

so that when the full-text indexer performs the query

[quoted text, click to view]

it only has to scan the index. This is much faster than having to scan every
row of the database.

After that, once we've found the few rows that have changed, it's much
better for the server to only fetch those specific few rows, rather than
querying every row in the table one. Hell, i'd prefer it if we didn't have
to issue a separate select for each modified row, but if we could use a
set-based operation

[quoted text, click to view]

Assuming there's less than a few hundred modified rows.


Do you disagree that having a convering index on pk,timestamp and only
querying for few dozen modified rows is better than querying for all rows in
a table, whether they are used are not - and not even as a set or a batch,
but one row at a time?


Assuming a 50,000,000 row table. If the pk is a 4-byte integer, and the
timestamp is a 4-byte ordinal, and you create a covering index on those two,
you can fit a little over 1000 rows in an 8k page. That's works out to about
50k pages of i/o. If there is then a dozen modified rows, and those rows are
randomly scattered around the database, that's 84 more page reads (assuming
b-tree traversal is log(n), with 6,250,000 pages), for a total of 50,084
pages of i/o.

On the other hand, using the current SQL Server incremental algorithm, we
have to scan every page in a table. Assume each row in the table is 1000
bytes wide. That's about 8 rows per page. That then requires 6,250,000 pages
of i/o to get the entire database.

50,084 pages to read
vs
6,250,000 pages to read

That's about 125x more i/o generated using the "brute force" method, over
than the finesse method.


So, this is where i come to the point in my head asking, "Why do i have to
explain this to the SQL Server team?"

[quoted text, click to view]

What we need to do is stop bringing in a whole row, just to get the pk and
timestamp; and only bring in the whole row when it is absoutely necessary.


Hilary Cotter
3/12/2007 11:35:35 PM
Microsoft does recommend putting an index on the timestamp column. Their
customers need the flexibility to create indexes and tables as they see fit.
Best practices might dictate using a covering index in some cases. For
example some application might require the dba not modify the table
structure in any way.

Rather than going around in circles with me you should go to connect and
make your suggestions there and have Microsoft respond to them.

--
Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]

Ian Boyd
3/13/2007 11:49:04 AM
[quoted text, click to view]

Where is this place?

Hilary Cotter
3/13/2007 12:23:04 PM
https://connect.microsoft.com/SQLServer/

--
Hilary Cotter

Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com



[quoted text, click to view]

Ian Boyd
3/22/2007 1:16:42 PM

[quoted text, click to view]

Checking on the status of my post, looks like they deleted it. Oh well,
maybe SQLServerX

AddThis Social Bookmark Button