all groups > sql server full text search > february 2007 >
Reading the MS article "10 Ways to Optimize SQL Server Full-Text Indexing" http://msdn2.microsoft.com/en-us/library/aa175787(sql.80).aspx It says regarding a full population: "MSSearch will issue sp_fulltext_getdata once for every row in your table. So if you have 50 million rows in your database, this procedure will be issued 50 million times. "On the other hand, if you use an incremental population...this statement will also be issued 50 million times. Why?" That's exactly what i'd like to know. He continues: "Because even with an incremental population, MSSearch must figure out exactly which rows have been changed, updated, and deleted." So why is the incremental populator not getting the id of the rows that have changed after the last timestamp? What is so hard about: SELECT PrimaryKeyColumn FROM SomeTable WHERE timestamp > 0000123456789 If full-population calls sp_fulltext_getdata for every row in the table, and incremental-population calls sp_fulltext_getdata for every for in the table, what is the difference between full and incremental populating? And while we're at it, can we please, for the love of everything holy, fix EM so i can manage a full-text catalog from my development. And can be please, for the love of everything holy, include some way to see what populating method and change tracking scheme the full text indexing is using. And can we please, for the love of everything holy, let EM show if FT Search is running on the server. And while we're at it, can be please, for the love of everything holy, let me manage change tracking from EM. And while we're at it, can we please, for the love of everything holy, document this stuff in BOL. And while we're at it, can we please, for the love of everything holy shut me up.
Hmmm........ this article looks familiar. Basically the indexer extracts the entire row to look at the value of the timestamp column to determine if it has been modified since the last time it looked at that row. If the timestamp column is modified the full-text indexed columns will be re-indexed. It also checks to see if new rows are added and then compares all the key and timestamp info returned from all the row with what is in the index to figure out what has been deleted. The problem with this query SELECT PrimaryKeyColumn FROM SomeTable WHERE timestamp > 0000123456789 is it doesn't detect deletes. A full-index blindly reinidexes everything. The incremental only indexes the rows which have been modified or newly inserted. So it does more bookkeeping than indexing and can't detect if the update was on a column which is not being full-text indexed. While this might sound inefficient, it actually is very efficient and reliable. -- Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com [quoted text, click to view] "Ian Boyd" <ian.msnews010@avatopia.com> wrote in message news:%23lYLut4VHHA.3980@TK2MSFTNGP02.phx.gbl... > Reading the MS article "10 Ways to Optimize SQL Server Full-Text Indexing" > http://msdn2.microsoft.com/en-us/library/aa175787(sql.80).aspx > > It says regarding a full population: > > "MSSearch will issue sp_fulltext_getdata once for every row in your > table. So if you have 50 million rows in your database, this procedure > will be issued 50 million times. > > "On the other hand, if you use an incremental population...this > statement will also be issued 50 million times. Why?" > > > That's exactly what i'd like to know. He continues: > > "Because even with an incremental population, MSSearch must figure out > exactly which rows have been changed, updated, and deleted." > > > So why is the incremental populator not getting the id of the rows that > have changed after the last timestamp? What is so hard about: > > SELECT PrimaryKeyColumn > FROM SomeTable > WHERE timestamp > 0000123456789 > > > If full-population calls sp_fulltext_getdata for every row in the table, > and incremental-population calls sp_fulltext_getdata for every for in the > table, > what is the difference between full and incremental populating? > > > > > > > And while we're at it, can we please, for the love of everything holy, fix > EM so i can manage a full-text catalog from my development. And can be > please, for the love of everything holy, include some way to see what > populating method and change tracking scheme the full text indexing is > using. And can we please, for the love of everything holy, let EM show if > FT Search is running on the server. And while we're at it, can be please, > for the love of everything holy, let me manage change tracking from EM. > And while we're at it, can we please, for the love of everything holy, > document this stuff in BOL. And while we're at it, can we please, for the > love of everything holy shut me up. > > >
[quoted text, click to view] > is it doesn't detect deletes.
What if we call sp_fulltext_getdata CatalogID, object_id Which returns a list of ALL rows in the table, and their timestamps. Any rows in our full-text catalog, but not in the returned list, have been deleted. Then, we call sp_fulltext_getdata CatalogID, object_id, RowPrimayKeyID only for those rows that that have a new timestamp than our largest. So we get the best of both worlds. We find deletes, and we only scan rows changed that have changed since our last incremental scan.
Yes, you can do this, just keep in mind that each time you fire this proc you reset an internal counter saying this is the highest timestamp, so each time you fire it MSSearch starts thinking that the timestamp you generated by firing this proc manually is its reference point and it can miss modified rows. Basically when fire sp_fulltext_getdata CatalogID, object_id, RowPrimayKeyID yourself it throws away the values, in other words only when MSSearch fires this proc does the row get reindexed. So, I really don't recommend it. -- Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com [quoted text, click to view] "Ian Boyd" <ian.msnews010@avatopia.com> wrote in message news:OcmPnIfWHHA.2284@TK2MSFTNGP03.phx.gbl... >> is it doesn't detect deletes. > > What if we call > sp_fulltext_getdata CatalogID, object_id > > Which returns a list of ALL rows in the table, and their timestamps. Any > rows in our full-text catalog, but not in the returned list, have been > deleted. > > Then, we call > sp_fulltext_getdata CatalogID, object_id, RowPrimayKeyID > > only for those rows that that have a new timestamp than our largest. > > > So we get the best of both worlds. We find deletes, and we only scan rows > changed that have changed since our last incremental scan. >
[quoted text, click to view] > Yes, you can do this, just keep in mind that each time you fire this proc > you reset an internal counter saying this is the highest timestamp, so > each time you fire it MSSearch starts thinking that the timestamp you > generated by firing this proc manually is its reference point and it can > miss modified rows. > > Basically when fire sp_fulltext_getdata CatalogID, object_id, > RowPrimayKeyID yourself it throws away the values, in other words only > when MSSearch fires this proc does the row get reindexed.
i wasn't meaning me. i meant why isn't SQL Server doing this since it's faster and more efficient?
I believe it is, it first fires the proc to get a list of what has change and updates the time stamp, then it gathers those rows and indexes them, and then it checks to see what has been deleted. Why do you think it works differently? -- Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com [quoted text, click to view] "Ian Boyd" <ian.msnews010@avatopia.com> wrote in message news:%23XFDBGBXHHA.4188@TK2MSFTNGP06.phx.gbl... >> Yes, you can do this, just keep in mind that each time you fire this proc >> you reset an internal counter saying this is the highest timestamp, so >> each time you fire it MSSearch starts thinking that the timestamp you >> generated by firing this proc manually is its reference point and it can >> miss modified rows. >> >> Basically when fire sp_fulltext_getdata CatalogID, object_id, >> RowPrimayKeyID yourself it throws away the values, in other words only >> when MSSearch fires this proc does the row get reindexed. > > i wasn't meaning me. i meant why isn't SQL Server doing this since it's > faster and more efficient? >
[quoted text, click to view] > Why do you think it works differently?
From the article "10 Ways to Optimize SQL Server Full-Text Indexing" http://msdn2.microsoft.com/en-us/library/aa175787(sql.80).aspx "MSSearch will issue sp_fulltext_getdata once for every row in your table. So if you have 50 million rows in your database, this procedure will be issued 50 million times. "On the other hand, if you use an incremental population...this statement will also be issued 50 million times." Perhaps it's a typo, and was supposed to read: "On the other hand, if you use an incremental population...this statement will NOT be issued 50 million times, but rather will only be executed once for each modified row in the table. Normally there will be only one or two rows, rather than 50 million."
Nope, its not a typo. The row has to be extracted to get the timestamp for that column. Consider an you kick off your indexing at 12:00, it extracts that row at 12:50. Now, suppose that row has modified since 12:00. The indexer has to know whether to index that row or not. The 12:00 row set says it doesn't need to, but the Indexer needs a more up to date value, hence it grabs the most recent timestamp. Make sense? -- Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com [quoted text, click to view] "Ian Boyd" <admin@SWIFTPA.NET> wrote in message news:uWW6IKVXHHA.2640@TK2MSFTNGP06.phx.gbl... >> Why do you think it works differently? > > From the article "10 Ways to Optimize SQL Server Full-Text Indexing" > http://msdn2.microsoft.com/en-us/library/aa175787(sql.80).aspx > > "MSSearch will issue sp_fulltext_getdata once for every row in your > table. So if you have 50 million rows in your database, this procedure > will > be issued 50 million times. > > "On the other hand, if you use an incremental population...this > statement will also be issued 50 million times." > > Perhaps it's a typo, and was supposed to read: > "On the other hand, if you use an incremental population...this > statement will NOT be issued 50 million times, but rather will only be > executed once for each modified row in the table. Normally there will be > only one or two rows, rather than 50 million." >
[quoted text, click to view] > Nope, its not a typo. The row has to be extracted to get the timestamp for > that column.
Consider the indexing kicks off at 12:00. It gets a lit of all primary keys in the table, and that row's timestamp: LastTimestamp PrimaryKey ========= ======= 30043 1 32364 2 30044 3 .... 32363 49584682 etc. Any rows not in this list get purged from the FullText catalog. Next the full-text indexer know that when it last crawled the table, the timestamp was at 32363. So it can search through the above list, and only fetch rows that have a newer timestamp. Timestamp PrimaryKey ======= ======= 32364 3 And so rather than having to pull the contents of 50million rows, we only pull the contents of one row. That's a speed improvement by a factor of 50 million! And if a row was modified during the 2 seconds it took to do our incremental scan, we can just scan again - since we've sped up the incremental indexing by a factor of 50 million, we can afford to scan more frequently. Perhaps even as quickly as every 6 or 7 seconds. [quoted text, click to view] > Consider an you kick off your indexing at 12:00, it extracts that row at > 12:50. Now, suppose that row has modified since 12:00. The indexer has to > know whether to index that row or not. The 12:00 row set says it doesn't > need to, but the Indexer needs a more up to date value, hence it grabs the > most recent timestamp.
This sounds like a full-scan, where we scan every row no matter what. i'm talking about an incremental full-text catalog build. Why does incremental not do as i've described? Everyone who comes into SQL Server Full-Text Indexing assumes it does an incremental scan - and then cannot figure out why it's scanning the whole table. If you're trying to tell me that incremental and full do the same thing (i.e. scan every row in the table), then i return to my original question: What is incremental?
but the timestamp retrieved is the value of the timestamp when the first list was obtained. It then needs to retrieve the timestamp again to see if it has changed since the first full-list retrieval. I suspect the first full-list is to get the crawl seeds and more importantly to figure out what to delete. Consider this case. You get the highest value timestamp and compare it with the highest value timestamp from the last run. You only index those rows. What about the rows which have modified while you are indexing this delta. By only doing this delta you miss those ones. You might want to make your case on connect to see what the office MS answer is. -- Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com [quoted text, click to view] "Ian Boyd" <ian.msnews010@avatopia.com> wrote in message news:eQthS31XHHA.1244@TK2MSFTNGP04.phx.gbl... >> Nope, its not a typo. The row has to be extracted to get the timestamp >> for that column. > > Consider the indexing kicks off at 12:00. It gets a lit of all primary > keys in the table, and that row's timestamp: > > LastTimestamp PrimaryKey > ========= ======= > 30043 1 > 32364 2 > 30044 3 > ... > 32363 49584682 > etc. > > Any rows not in this list get purged from the FullText catalog. > > Next the full-text indexer know that when it last crawled the table, the > timestamp was at 32363. So it can search through the above list, and only > fetch rows that have a newer timestamp. > > Timestamp PrimaryKey > ======= ======= > 32364 3 > > And so rather than having to pull the contents of 50million rows, we only > pull the contents of one row. That's a speed improvement by a factor of > 50 million! > > And if a row was modified during the 2 seconds it took to do our > incremental scan, we can just scan again - since we've sped up the > incremental indexing by a factor of 50 million, we can afford to scan more > frequently. Perhaps even as quickly as every 6 or 7 seconds. > > > >> Consider an you kick off your indexing at 12:00, it extracts that row at >> 12:50. Now, suppose that row has modified since 12:00. The indexer has to >> know whether to index that row or not. The 12:00 row set says it doesn't >> need to, but the Indexer needs a more up to date value, hence it grabs >> the most recent timestamp. > > This sounds like a full-scan, where we scan every row no matter what. i'm > talking about an incremental full-text catalog build. Why does > incremental not do as i've described? Everyone who comes into SQL Server > Full-Text Indexing assumes it does an incremental scan - and then cannot > figure out why it's scanning the whole table. > > If you're trying to tell me that incremental and full do the same thing > (i.e. scan every row in the table), then i return to my original question: > What is incremental? >
[quoted text, click to view] > but the timestamp retrieved is the value of the timestamp when the first > list was obtained. It then needs to retrieve the timestamp again to see if > it has changed since the first full-list retrieval. I suspect the first > full-list is to get the crawl seeds and more importantly to figure out > what to delete. > > Consider this case. You get the highest value timestamp and compare it > with the highest value timestamp from the last run. You only index those > rows. What about the rows which have modified while you are indexing this > delta. By only doing this delta you miss those ones.
You're describing a solution to a problem that doesn't need to be solved. At the very least full repopulation doesn't solve this problem either. Consider full-repopulation: 12:00:00pm: Get list of all rows and their timestamps. Only row 37983 has changed 12:00:02pm: Crawl row 1, in case it changed since we got the list 12:00:02pm: Crawl row 2, in case it changed since we got the list ..... 12:49:37pm: Crawl row 37983, because we know it changed 12:49:37pm: Row 3 changes - after we've crawled it 12:49:37pm: Ccrawl row 37984, in case it changed since we got the list .... 6:13:31pm: Crawl row 50123456, in case it changed since we got the list 6:13:31pm: Crawl complete. But we missed the change to row 3. The full-repopulation STILL misses the update. So i'll ask you: "What about the rows which have modified while you are indexing this delta?" Does the full rebuild algorithm perform endless passes until no rows have changed during the full scan? No, there would never be an end. This is an accepted limitation of the algorithm, and any changes we'll pick up the next time a full repopulation is run. What's more, because a full repopulation is so unnecessarily lengthy, it essentially guarantees that rows will change during the full re-scan - and that they will get missed. So, my original question: If an increment rebuild is the same as a full rebuild, what's the difference between incremental and full? Or, more simply, "What is incremental?" Now consider incremental-repopulation as everyone assumes it works: 12:00:00pm: Get list of all rows and their timetamps. Only row 37983 has changed 12:00:02pm: Row 3 changes 12:00:02pm: Crawl row 23983, because it has changed 12:00:02pm: Crawl complete In both examples we miss a change to a row. But in the ideal incremental rebuild algorithm, the window where that change could have happened is much smaller (on the order of 50 million times smaller). And since we've already accepted that if row changes after we've examined it, then it will just have to get picked up next time the incremental (or full) population is run. Additionally, since the incremental population only took 2 seconds, we can run it much more frequently; getting updates in near realtime. The difference is: the incremental algorithm is faster, with less load on the server, and row changes making it into the full text catalog faster. At least, that's what every MSSQL dba assumes the incremental repopulation does. Until they begin seeing high load on their server, and they post into m.p.s.f asking what SQL Server is doing. My collary question is: why isn't SQL Server doing this already? There's the "full" algorithm. Is there any other algorithm? Which brings me to my primary question: Is there an "incremental" algorithm of some sort? If yes, what does the incremental repopulation algorithm do differently from the "full"? What is incremental? According to an article on TechNet, there's no difference. So then, why the two names? [quoted text, click to view] > You might want to make your case on connect to see what the offial MS > answer is.
i don't know what connect is. Nor do i believe i would get an answer.
let me try to explain this again. Full population. Every row is extracted and indexed. Incremental population. Every row is extracted, timestamps are compared with the last timestamp the last full or incremental population ended on and if the timestamp is different the extracted row is reindexed. Rows which are no longer in the table but in the catalog are removed from the catalog. Rows which are not in the catalog are inserted. Does this make sense? The cost of bringing the entire row is not much more expensive from just bringing the timestamp so its not that inefficient. -- Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com [quoted text, click to view] "Ian Boyd" <admin@SWIFTPA.NET> wrote in message news:uZzgeMUYHHA.588@TK2MSFTNGP06.phx.gbl... >> but the timestamp retrieved is the value of the timestamp when the first >> list was obtained. It then needs to retrieve the timestamp again to see >> if it has changed since the first full-list retrieval. I suspect the >> first full-list is to get the crawl seeds and more importantly to figure >> out what to delete. >> >> Consider this case. You get the highest value timestamp and compare it >> with the highest value timestamp from the last run. You only index those >> rows. What about the rows which have modified while you are indexing this >> delta. By only doing this delta you miss those ones. > > > You're describing a solution to a problem that doesn't need to be solved. > At the very least full repopulation doesn't solve this problem either. > > Consider full-repopulation: > > 12:00:00pm: Get list of all rows and their timestamps. Only row 37983 has > changed > 12:00:02pm: Crawl row 1, in case it changed since we got the list > 12:00:02pm: Crawl row 2, in case it changed since we got the list > .... > 12:49:37pm: Crawl row 37983, because we know it changed > 12:49:37pm: Row 3 changes - after we've crawled it > 12:49:37pm: Ccrawl row 37984, in case it changed since we got the list > ... > 6:13:31pm: Crawl row 50123456, in case it changed since we got the list > 6:13:31pm: Crawl complete. > > But we missed the change to row 3. The full-repopulation STILL misses the > update. So i'll ask you: "What about the rows which have modified while > you are indexing this delta?" > > Does the full rebuild algorithm perform endless passes until no rows have > changed during the full scan? No, there would never be an end. This is an > accepted limitation of the algorithm, and any changes we'll pick up the > next time a full repopulation is run. What's more, because a full > repopulation is so unnecessarily lengthy, it essentially guarantees that > rows will change during the full re-scan - and that they will get missed. > > > So, my original question: If an increment rebuild is the same as a full > rebuild, what's the difference between incremental and full? Or, more > simply, "What is incremental?" > > > Now consider incremental-repopulation as everyone assumes it works: > > 12:00:00pm: Get list of all rows and their timetamps. Only row 37983 has > changed > 12:00:02pm: Row 3 changes > 12:00:02pm: Crawl row 23983, because it has changed > 12:00:02pm: Crawl complete > > In both examples we miss a change to a row. But in the ideal incremental > rebuild algorithm, the window where that change could have happened is > much smaller (on the order of 50 million times smaller). And since we've > already accepted that if row changes after we've examined it, then it will > just have to get picked up next time the incremental (or full) population > is run. Additionally, since the incremental population only took 2 > seconds, we can run it much more frequently; getting updates in near > realtime. > > > > The difference is: the incremental algorithm is faster, with less load on > the server, and row changes making it into the full text catalog faster. > At least, that's what every MSSQL dba assumes the incremental repopulation > does. Until they begin seeing high load on their server, and they post > into m.p.s.f asking what SQL Server is doing. > > My collary question is: why isn't SQL Server doing this already? There's > the "full" algorithm. Is there any other algorithm? Which brings me to my > primary question: > > Is there an "incremental" algorithm of some sort? If yes, what does the > incremental repopulation algorithm do differently from the "full"? What is > incremental? > > According to an article on TechNet, there's no difference. So then, why > the two names? > > >> You might want to make your case on connect to see what the offial MS >> answer is. > i don't know what connect is. Nor do i believe i would get an answer. >
[quoted text, click to view] > Full population. Every row is extracted and indexed. > > Incremental population. Every row is extracted, timestamps are compared > with the last timestamp the last full or incremental population ended on > and if the timestamp is different the extracted row is reindexed. Rows > which are no longer in the table but in the catalog are removed from the > catalog. Rows which are not in the catalog are inserted.
If i understand what you're staying - then someone the SQL team needs a smack. Both "full" and "incremental" both scan scan every row, one by one, in the database, whether the row has changed or not. Why, in the name of god, would they not have an actual "incremental" scan? How is this not a no-brainer? [quoted text, click to view] > Does this make sense? The cost of bringing the entire row is not much more > expensive from just bringing the timestamp so its not that inefficient.
i disagree. Put an index on PK, Timestamp. Fetching every row in the database, one by one; it's like using a cursor to do everything. There's so much to be gained by using SQL Server to do what it's good at: operate on sets. My god, they really should add an "incremental" update option.
[quoted text, click to view] > Isn't the difference that incremental looks to see if the record has > changed and reindex if required. With full it just reindexes the record.
i see what he's trying to say. Incremental offers no performance improvement over full, because it still reads every single row in the database, one by one, not even as a set of rows.
some of the reasons are that not every table is guaranteed to have a timestamp column on it, nor can Microsoft guarantee that their user's won't update the pk's. Suppose you update a pk on your table. How is the indexer supposed to know what row the data in the catalog belongs to? Microsoft can't count on their users being as sophisticated as you obviously are. As Simon mentions change tracking is designed to provide real time indexing. Full-populations are designed when the majority of your data changes at one time. Incremental is designed when large portions of your data changes at one time. Keep in mind that the MSSearch engine provided search services to Exchange Content Indexing, MS Office search, Site Server Search and SharePoint Portal Search. They needed a consistent crawl or population method for all population/crawl types. In SQL 2005 the crawl mechanism has been totally re-written as a database only indexer. Still incremental population remains as it is the fastest crawl/population type in some cases. Its the bookkeeping that MSSearch must do in order to reconcile what is in the catalog with what is in the table which makes the extraction all the rows in the base table essential. This crawl algorithm is also present in Sybase and Oracle last time I looked. -- Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com [quoted text, click to view] "Ian Boyd" <ian.msnews010@avatopia.com> wrote in message news:ueMtIVcYHHA.3772@TK2MSFTNGP06.phx.gbl... >> Full population. Every row is extracted and indexed. >> >> Incremental population. Every row is extracted, timestamps are compared >> with the last timestamp the last full or incremental population ended on >> and if the timestamp is different the extracted row is reindexed. Rows >> which are no longer in the table but in the catalog are removed from the >> catalog. Rows which are not in the catalog are inserted. > > If i understand what you're staying - then someone the SQL team needs a > smack. > > Both "full" and "incremental" both scan scan every row, one by one, in the > database, whether the row has changed or not. Why, in the name of god, > would they not have an actual "incremental" scan? How is this not a > no-brainer? > >> Does this make sense? The cost of bringing the entire row is not much >> more expensive from just bringing the timestamp so its not that >> inefficient. > > i disagree. Put an index on PK, Timestamp. > > Fetching every row in the database, one by one; it's like using a cursor > to do everything. There's so much to be gained by using SQL Server to do > what it's good at: operate on sets. > > > My god, they really should add an "incremental" update option. >
Hello Ian, Isn't the difference that incremental looks to see if the record has changed and reindex if required. With full it just reindexes the record. Simon Sabin SQL Server MVP http://sqlblogcasts.com/blogs/simons [quoted text, click to view] >> but the timestamp retrieved is the value of the timestamp when the >> first list was obtained. It then needs to retrieve the timestamp >> again to see if it has changed since the first full-list retrieval. I >> suspect the first full-list is to get the crawl seeds and more >> importantly to figure out what to delete. >> >> Consider this case. You get the highest value timestamp and compare >> it with the highest value timestamp from the last run. You only index >> those rows. What about the rows which have modified while you are >> indexing this delta. By only doing this delta you miss those ones. >> > You're describing a solution to a problem that doesn't need to be > solved. At the very least full repopulation doesn't solve this problem > either. > > Consider full-repopulation: > > 12:00:00pm: Get list of all rows and their timestamps. Only row 37983 > has > changed > 12:00:02pm: Crawl row 1, in case it changed since we got the list > 12:00:02pm: Crawl row 2, in case it changed since we got the list > .... > 12:49:37pm: Crawl row 37983, because we know it changed > 12:49:37pm: Row 3 changes - after we've crawled it > 12:49:37pm: Ccrawl row 37984, in case it changed since we got the list > ... > 6:13:31pm: Crawl row 50123456, in case it changed since we got the > list > 6:13:31pm: Crawl complete. > But we missed the change to row 3. The full-repopulation STILL misses > the update. So i'll ask you: "What about the rows which have modified > while you are indexing this delta?" > > Does the full rebuild algorithm perform endless passes until no rows > have changed during the full scan? No, there would never be an end. > This is an accepted limitation of the algorithm, and any changes we'll > pick up the next time a full repopulation is run. What's more, because > a full repopulation is so unnecessarily lengthy, it essentially > guarantees that rows will change during the full re-scan - and that > they will get missed. > > So, my original question: If an increment rebuild is the same as a > full rebuild, what's the difference between incremental and full? Or, > more simply, "What is incremental?" > > Now consider incremental-repopulation as everyone assumes it works: > > 12:00:00pm: Get list of all rows and their timetamps. Only row 37983 > has > changed > 12:00:02pm: Row 3 changes > 12:00:02pm: Crawl row 23983, because it has changed > 12:00:02pm: Crawl complete > In both examples we miss a change to a row. But in the ideal > incremental rebuild algorithm, the window where that change could have > happened is much smaller (on the order of 50 million times smaller). > And since we've already accepted that if row changes after we've > examined it, then it will just have to get picked up next time the > incremental (or full) population is run. Additionally, since the > incremental population only took 2 seconds, we can run it much more > frequently; getting updates in near realtime. > > The difference is: the incremental algorithm is faster, with less load > on the server, and row changes making it into the full text catalog > faster. At least, that's what every MSSQL dba assumes the incremental > repopulation does. Until they begin seeing high load on their server, > and they post into m.p.s.f asking what SQL Server is doing. > > My collary question is: why isn't SQL Server doing this already? > There's the "full" algorithm. Is there any other algorithm? Which > brings me to my primary question: > > Is there an "incremental" algorithm of some sort? If yes, what does > the incremental repopulation algorithm do differently from the "full"? > What is incremental? > > According to an article on TechNet, there's no difference. So then, > why the two names? > >> You might want to make your case on connect to see what the offial MS >> answer is. >> > i don't know what connect is. Nor do i believe i would get an answer. >
Both full and incremental extract each row. Incremental will not index each row, but full will. Still full can be faster than incremental as it doesn't have to do any bookeeping or index maintenance. -- Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com [quoted text, click to view] "Ian Boyd" <ian.msnews010@avatopia.com> wrote in message news:%23z2hPWcYHHA.1220@TK2MSFTNGP03.phx.gbl... >> Isn't the difference that incremental looks to see if the record has >> changed and reindex if required. With full it just reindexes the record. > > i see what he's trying to say. > > Incremental offers no performance improvement over full, because it still > reads every single row in the database, one by one, not even as a set of > rows. >
Hello Ian, Thats what change tracking is for. Simon Sabin SQL Server MVP http://sqlblogcasts.com/blogs/simons [quoted text, click to view] >> Full population. Every row is extracted and indexed. >> >> Incremental population. Every row is extracted, timestamps are >> compared with the last timestamp the last full or incremental >> population ended on and if the timestamp is different the extracted >> row is reindexed. Rows which are no longer in the table but in the >> catalog are removed from the catalog. Rows which are not in the >> catalog are inserted. >> > If i understand what you're staying - then someone the SQL team needs > a smack. > > Both "full" and "incremental" both scan scan every row, one by one, in > the database, whether the row has changed or not. Why, in the name of > god, would they not have an actual "incremental" scan? How is this not > a no-brainer? > >> Does this make sense? The cost of bringing the entire row is not much >> more expensive from just bringing the timestamp so its not that >> inefficient. >> > i disagree. Put an index on PK, Timestamp. > > Fetching every row in the database, one by one; it's like using a > cursor to do everything. There's so much to be gained by using SQL > Server to do what it's good at: operate on sets. > > My god, they really should add an "incremental" update option. >
[quoted text, click to view] > some of the reasons are that not every table is guaranteed to have a > timestamp column on it
i'd be fine with a timestamp being required to to FT indexing; like some forms of replication need a "IsUniqueIdentifier" row. [quoted text, click to view] > nor can Microsoft guarantee that their user's won't update the pk's.
You upate the PK, but don't forget that the timestamp updates with it. We'll get the "new" values associated with that row next run. [quoted text, click to view] > As Simon mentions change tracking is designed to provide real time > indexing. Full-populations are designed when the majority of your data > changes at one time. Incremental is designed when large portions of your > data changes at one time.
But both heavily impact SQL Server to the same degree. [quoted text, click to view] > Keep in mind that the MSSearch engine provided search services to Exchange > Content Indexing, MS Office search, Site Server Search and SharePoint > Portal Search. They needed a consistent crawl or population method for all > population/crawl types.
It's an implementation detail that Microsoft SQL Server implements it's FullText search ability as a separate process. Integrate it into the engine, and it's not a problem. If they need to choose a different implementation: i'm fine with that. [quoted text, click to view] > In SQL 2005 the crawl mechanism has been totally re-written as a database > only indexer. Still incremental population remains as it is the fastest > crawl/population type in some cases.
If this goes toward my previous statement then that's excellent news. [quoted text, click to view] > Its the bookkeeping that MSSearch must do in order to reconcile what is in > the catalog with what is in the table which makes the extraction all the > rows in the base table essential.
See points 1 and 2 [quoted text, click to view] > This crawl algorithm is also present in Sybase and Oracle last time I > looked.
Which means there is room for improvement and innovation on Microsoft's part to make a superior product.
[quoted text, click to view] > some of the reasons are that not every table is guaranteed to have a > timestamp column on it
From the BOL, regarding "incremental" population: Incremental population "...This feature requires that the indexed table have a column of the timestamp data type." Is there anything i'm missing?
Comments in line. -- Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com [quoted text, click to view] "Ian Boyd" <ian.msnews010@avatopia.com> wrote in message news:OzIImZlYHHA.3256@TK2MSFTNGP04.phx.gbl... >> some of the reasons are that not every table is guaranteed to have a >> timestamp column on it > > i'd be fine with a timestamp being required to to FT indexing; like some > forms of replication need a "IsUniqueIdentifier" row.
The unique identifier used in replication is used to track which rows have changed in which tables and which servers. It does not have a timestamp component. While you could use it for a timestamp it is too wide. [quoted text, click to view] > >> nor can Microsoft guarantee that their user's won't update the pk's. > > You upate the PK, but don't forget that the timestamp updates with it. > We'll get the "new" values associated with that row next run.
That's right you update the pk, the timestamp tell the indexer that it has been updated since the last time, but the key information is gone. So last run it had a key value of 1 and the catalog knows that X number of words in the index belong to row 1, but now the value of row 1 is 13424. How does it reconcile this to remove these enteries and update them with the correct new pk value? [quoted text, click to view] > >> As Simon mentions change tracking is designed to provide real time >> indexing. Full-populations are designed when the majority of your data >> changes at one time. Incremental is designed when large portions of your >> data changes at one time. > > But both heavily impact SQL Server to the same degree.
Not always. The extraction is typically less expensive than the indexing, especially when you are dealing with blobs and Far Eastern content. [quoted text, click to view] > >> Keep in mind that the MSSearch engine provided search services to >> Exchange Content Indexing, MS Office search, Site Server Search and >> SharePoint Portal Search. They needed a consistent crawl or population >> method for all population/crawl types. > > It's an implementation detail that Microsoft SQL Server implements it's > FullText search ability as a separate process. Integrate it into the > engine, and it's not a problem. If they need to choose a different > implementation: i'm fine with that. >
In SQL 2005 the indexer is in process. [quoted text, click to view] >> In SQL 2005 the crawl mechanism has been totally re-written as a database >> only indexer. Still incremental population remains as it is the fastest >> crawl/population type in some cases. > > If this goes toward my previous statement then that's excellent news. > >> Its the bookkeeping that MSSearch must do in order to reconcile what is >> in the catalog with what is in the table which makes the extraction all >> the rows in the base table essential. > > See points 1 and 2
Please see my answers to them:) [quoted text, click to view] > >> This crawl algorithm is also present in Sybase and Oracle last time I >> looked. > > Which means there is room for improvement and innovation on Microsoft's > part to make a superior product.
Which means you better go looking to smack the entire search engine developers, not just the MS ones. [quoted text, click to view] > >
That is 100% correct, but I wish I have a cent for every dba who has run an incremental population on a table without a timestamp and had a full population performed. -- Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com [quoted text, click to view] "Ian Boyd" <ian.msnews010@avatopia.com> wrote in message news:%23zqUfclYHHA.4264@TK2MSFTNGP05.phx.gbl... >> some of the reasons are that not every table is guaranteed to have a >> timestamp column on it > > From the BOL, regarding "incremental" population: > > Incremental population > "...This feature requires that the indexed table have a column of the > timestamp data type." > > > Is there anything i'm missing? >
[quoted text, click to view] >> i'd be fine with a timestamp being required to to FT indexing; like some >> forms of replication need a "IsUniqueIdentifier" row. > >While you could use it for a timestamp it is too wide.
No, i was saying that's there's already a precident for mandating a column's presence in a table to be able to use a feature. So there's no problem in mandating a timestamp be present in order to use incremental population [quoted text, click to view] > That's right you update the pk, the timestamp tell the indexer that it has > been updated since the last time, but the key information is gone. So last > run it had a key value of 1 and the catalog knows that X number of words > in the index belong to row 1, but now the value of row 1 is 13424. How > does it reconcile this to remove these enteries and update them with the > correct new pk value?
Okay, i guess i need to have a *detailed* example to explain how this works. Assume we want to catalog a simple Customers table CREATE TABLE Customers ( pk int NOT NULL PRIMARY KEY, Firstname varchar(25), Lastname varchar(25), [timestamp] timestamp) Now, assume this table has been pre-filled with two rows: pk Firstname Lastname timestamp 1 Ian Boyd 970001 2 Hillary Cotter 970002 Incremental Population Nº1 ===================== Let's do our "initial" incremental population. We record the last timestamp every time we run the incremental population. Since this is the first time an incremental population has been run, we assume a "LastTimestamp" of zero. The first step in performing an incremental population is to get every Primary key value, and it's timestamp. NOTE: Perhaps there is some special stored procedure that can be used to get this list; perhaps being called sp_fulltext_getdata. We won't use that for my algorithm, we're use something more clear - for the purposes of explaining the incremental population algorithm. [quoted text, click to view] > SELECT pk, timestamp FROM Customers
pk timestamp 1 970001 2 970002 2 row(s) affected These are all the rows the table contains. Any entries in our full-text catalog that do not exist in this list need to be removed from the full-text catalog; since they have been deleted from the source database. In our example, this is the first time the full-text indexer has been run on our Customers table, so the full-text catalog is empty. Next, we scan the recordset, looking for any rows with a timestamp greater than zero. We find two rows: pk timestamp 1 970001 2 970002 So, there are two rows that need to go into the full-text catalog. The full-text indexer issues a statement to get the contents of the Customers table for primary key value 1. NOTE: This can be some special SQL Server stored procedure (perhaps named sp_fulltext_getdata) that is used to get the values of columns for a specific row. We're going to do it the obvious way; for the purposes of explaining this algorithm [quoted text, click to view] > SELECT * FROM Customers WHERE pk = 1
pk Firstname Lastname timestamp 1 Ian Boyd 970001 1 row(s) affected The full-text application now takes those values, and parses them into whatever format it uses and stores them wherever it stores them in the catalog. Now it get's the next row: [quoted text, click to view] > SELECT * FROM Customers WHERE pk=2
pk Firstname Lastname timestamp 2 Hillary Cotter 970002 1 rows(s) affected And now those values are put into the full-text catalog. And the catalog is up to date. Now let's say someone modifies a row: [quoted text, click to view] > UPDATE Customers SET Firstname = 'Hilary' WHERE pk = 2
1 rows(s) affected Incremental Population Nº2 ===================== Now, the incremental population algorithm is started up again. The last timestamp in the table when it was run 97002. It's first step is to find all rows in the table, and their timestamps [quoted text, click to view] > SELECT pk, timestamp FROM Customers
pk timestamp 1 970001 2 970003 2 row(s) affected First we delete any rows from our catalog that are not in the database. Since both rows are still there, we proceed to the next phase. Look through the recordset and find rows that have a timestamp greater than 970002. There is one row that matches: pk timestamp 2 970003 So the indexer fetches the contents of that row: [quoted text, click to view] > SELECT * FROM Customers WHERE pk=2
pk Firstname Lastname timestamp 2 Hilary Cotter 970002 1 rows(s) affected And now those values are put into the full-text catalog. And the catalog is up to date. Now let's say someone updates the primary key in the table [quoted text, click to view] > UPDATE Customers SET pk = 3 WHERE pk = 2
Incremental Population Nº3 ===================== Now, the incremental population algorithm is started up again. The last timestamp in the table when it was run 97003. It's first step is to find all rows in the table, and their timestamps [quoted text, click to view] > SELECT pk, timestamp FROM Customers
pk timestamp 1 970001 3 970004 2 row(s) affected First we delete any rows from our catalog that are not in the database. But now we see that our full-text catalog has data for a row with pk=2, but that row doesn't exist in the database. So rows with pk=2 is purged from the full-text catalog, and we proceed to the next phase. Look through the recordset and find rows that have a timestamp greater than 970003. There is one row that matches: pk timestamp 3 970004 So the indexer fetches the contents of that row: [quoted text, click to view] > SELECT * FROM Customers WHERE pk=3
pk Firstname Lastname timestamp 3 Hilary Cotter 970004 1 rows(s) affected And now those values are put into the full-text catalog. And the catalog is up to date. Now let's say someone is really devious and swaps the PK values of rows 1 and 3 [quoted text, click to view] >UPDATE Customers SET pk = 4 WHERE pk = 1 1 rows(s) affected > UPDATE Customers SETpk = 1 WHERE pk = 3 1 row(s) affected > UPDATE Customers SET pk = 3 WHERE pk = 4
1 row(s) affected Incremental Population Nº4 ===================== Now, the incremental population algorithm is started up again. The last timestamp in the table when it was run 97004. It's first step is to find all rows in the table, and their timestamps [quoted text, click to view] > SELECT pk, timestamp FROM Customers
pk timestamp 1 970007 3 970006 2 row(s) affected First we delete any rows from our catalog that are not in the database. All rows in our catalog exist in the database, so we proceed to the next step. The last timestamp when we ran was 970004, so we look for rows that have a newer timestamp, and we find two rows: pk timestamp 1 970007 3 970006 So the indexer fetches the contents of the the first row: [quoted text, click to view] > SELECT * FROM Customers WHERE pk=1
pk Firstname Lastname timestamp 1 Hilary Cotter 970007 1 row(s) affected
[quoted text, click to view] > That is 100% correct, but I wish I have a cent for every dba who has run > an incremental population on a table without a timestamp and had a full > population performed.
The interface, or the server, should not allow it to happen. It should be an error. Or perhaps: | | Cannot perform an incremental population on a table without a timestamp. | | Would you like to perform a full population instead? | | Yes No | The former is just poor user interface design (which i take it is fixed in SQL2005 Management Studio.
Ian you are describing how an incremental population runs with one exception. So the indexer fetches the contents of that row: SQL Server has to poll the entire table to find out what is in it. It then compares this with what the fulltext catalog has. It then can detect what is deleted. An update can be considered a delete followed by an insert. Your method describes how to detect what is new (inserts and updates), but it doesn't describe a method to determine what is deleted. Can you help me to understand how your method detects deletions? To quote you "First we delete any rows from our catalog that are not in the database." and "Any entries in our full-text catalog that do not exist in this list need to be removed from the full-text catalog; since they have been deleted from the source database" and "First we delete any rows from our catalog that are not in the database." and again 'First we delete any rows from our catalog that are not in the database. " And one more time "First we delete any rows from our catalog that are not in the database. " Exactly how does the indexer know what is deleted knowing only what is new? The algorithm that the crawl mechanism in MSSearch uses is to say send me all the pks and timestamps and I'll compare this list with what is in the catalog. This way I detect what is new (in a similar manner to what you describe), and also what is to be deleted. Am I missing something? -- Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com [quoted text, click to view] "Ian Boyd" <admin@SWIFTPA.NET> wrote in message news:OCrVVk2YHHA.4520@TK2MSFTNGP06.phx.gbl... >>> i'd be fine with a timestamp being required to to FT indexing; like some >>> forms of replication need a "IsUniqueIdentifier" row. >> >>While you could use it for a timestamp it is too wide. > > No, i was saying that's there's already a precident for mandating a > column's presence in a table to be able to use a feature. So there's no > problem in mandating a timestamp be present in order to use incremental > population > >> That's right you update the pk, the timestamp tell the indexer that it >> has been updated since the last time, but the key information is gone. So >> last run it had a key value of 1 and the catalog knows that X number of >> words in the index belong to row 1, but now the value of row 1 is 13424. >> How does it reconcile this to remove these enteries and update them with >> the correct new pk value? > > Okay, i guess i need to have a *detailed* example to explain how this > works. > > Assume we want to catalog a simple Customers table > > CREATE TABLE Customers ( > pk int NOT NULL PRIMARY KEY, > Firstname varchar(25), > Lastname varchar(25), > [timestamp] timestamp) > > Now, assume this table has been pre-filled with two rows: > > pk Firstname Lastname timestamp > 1 Ian Boyd 970001 > 2 Hillary Cotter 970002 > > > Incremental Population Nº1 > ===================== > Let's do our "initial" incremental population. We record the last > timestamp every time we run the incremental population. Since this is the > first time an incremental population has been run, we assume a > "LastTimestamp" of zero. > > The first step in performing an incremental population is to get every > Primary key value, and it's timestamp. > > NOTE: Perhaps there is some special stored procedure that can be used to > get this list; perhaps being called sp_fulltext_getdata. We won't use that > for my algorithm, we're use something more clear - for the purposes of > explaining the incremental population algorithm. > > >> SELECT pk, timestamp FROM Customers > > pk timestamp > 1 970001 > 2 970002 > 2 row(s) affected > > These are all the rows the table contains. Any entries in our full-text > catalog that do not exist in this list need to be removed from the > full-text catalog; since they have been deleted from the source database. > In our example, this is the first time the full-text indexer has been run > on our Customers table, so the full-text catalog is empty. > > Next, we scan the recordset, looking for any rows with a timestamp greater > than zero. We find two rows: > > pk timestamp > 1 970001 > 2 970002 > > So, there are two rows that need to go into the full-text catalog. The > full-text indexer issues a statement to get the contents of the Customers > table for primary key value 1. > > NOTE: This can be some special SQL Server stored procedure (perhaps named > sp_fulltext_getdata) that is used to get the values of columns for a > specific row. We're going to do it the obvious way; for the purposes of > explaining this algorithm > >> SELECT * FROM Customers WHERE pk = 1 > > pk Firstname Lastname timestamp > 1 Ian Boyd 970001 > 1 row(s) affected > > The full-text application now takes those values, and parses them into > whatever format it uses and stores them wherever it stores them in the > catalog. Now it get's the next row: > >> SELECT * FROM Customers WHERE pk=2 > > pk Firstname Lastname timestamp > 2 Hillary Cotter 970002 > 1 rows(s) affected > > And now those values are put into the full-text catalog. And the catalog > is up to date. > > > > Now let's say someone modifies a row: > >> UPDATE Customers SET Firstname = 'Hilary' WHERE pk = 2 > 1 rows(s) affected > > Incremental Population Nº2 > ===================== > Now, the incremental population algorithm is started up again. The last > timestamp in the table when it was run 97002. It's first step is to find > all rows in the table, and their timestamps > >> SELECT pk, timestamp FROM Customers > pk timestamp > 1 970001 > 2 970003 > 2 row(s) affected > > First we delete any rows from our catalog that are not in the database. > Since both rows are still there, we proceed to the next phase. Look > through the recordset and find rows that have a timestamp greater than > 970002. There is one row that matches: > > pk timestamp > 2 970003 > > So the indexer fetches the contents of that row: > >> SELECT * FROM Customers WHERE pk=2 > > pk Firstname Lastname timestamp > 2 Hilary Cotter 970002 > 1 rows(s) affected > > And now those values are put into the full-text catalog. And the catalog > is up to date. > > > > Now let's say someone updates the primary key in the table > >> UPDATE Customers SET pk = 3 WHERE pk = 2 > > > Incremental Population Nº3 > ===================== > Now, the incremental population algorithm is started up again. The last > timestamp in the table when it was run 97003. It's first step is to find > all rows in the table, and their timestamps >
[quoted text, click to view] > An update can be considered a delete followed by an insert. > > Your method describes how to detect what is new (inserts and updates), but > it doesn't describe a method to determine what is deleted. > > Can you help me to understand how your method detects deletions?
Let's say our table and our catalog are currently in sync Contents on the Customers table: pk Firstname Lastname timestamp 1 Ian Alexander Boyd 970001 2 Hilary Nathan Cotter 970002 NOTE: i'll now store first and last names in the FirstName field, so that we can see words being broken up. NOTE: i do not know the format that is used to store the full-text catalog; so i'll have to make one up. Contents of the FullTextCatalog table: Table Key Column keyword Customers 1 Firstname Ian Customers 1 Firstname Alexander Customers 1 Lastname Boyd Customers 2 Firstname Hilary Customers 2 Firstname Nathan Customers 2 Lastname Cotter And the last time the incremental population ran the timestamp in the Customers table was 970002. The full-text indexer also must store this: Contents of the Full-text catalog LastTimestamps table: Table LastTimestamp Customers 970002 Everything is setup. Now we delete a row: [quoted text, click to view] >DELETE FROM Customers WHERE pk = 2
1 row(s) affected Now there is only one row in the customer's table: [quoted text, click to view] >SELECT * FROM Customers
pk Firstname Lastname timestamp 1 Ian Alexander Boyd 970001 1 row(s) affected Now the incremental indexer comes along. First thing is does it ask for all key-timestamp pairs: [quoted text, click to view] >SELECT pk, timestamp FROM Customers
pk timestamp 1 970001 1 row(s) affected So your question is: How does the indexer know that a row has been deleted? Specifically, how does it know that row 2 has been deleted? Looking through the full-text catalog, we find all rows that we've previously indexed: [quoted text, click to view] >SELECT DISTINCT(Key) AS pk FROM FullTextCatalog WHERE TableName='Customers'
pk 1 2 2 row(s) affected. Our full-text catalog contains values from two rows. But in the list from the database that we fetched a moment ago there is only one row: [quoted text, click to view] >SELECT pk, timestamp FROM Customers
pk timestamp 1 970001 1 row(s) affected Obviously row 2, that exists in our full-text catalog, has been delete from the source database, and must be deleted from our full-text catalog. [quoted text, click to view] >DELETE FROM FullTextCatalog WHERE Table='Customers' >AND Key = 2
3 row(s) affected This is how we detect deletes. Perhaps now you're wanting to contrive a more pessimistic case. Let's take the original values: pk Firstname Lastname timestamp 1 Ian Alexander Boyd 970001 2 Hilary Nathan Cotter 970002 and delete Hilary, and switch Ian to pk value #2, resulting in: pk Firstname Lastname 2 Ian Alexander Boyd So now we've deleted a row, but also moved another row on top of it. How do we handle this? Well, starting from: pk Firstname Lastname timestamp 1 Ian Alexander Boyd 970001 2 Hilary Nathan Cotter 970002 let's do the SQL to make the change: [quoted text, click to view] >DELETE FROM Customers WHERE pk = 2
1 row(s) affected Giving us: pk Firstname Lastname timestamp 1 Ian Alexander Boyd 970001 [quoted text, click to view] >UPDATE Customers SET pk = 2 WHERE pk = 1
1 row(s) affected Giving us: pk Firstname Lastname timestamp 2 Ian Alexander Boyd 970003 Now the incremental indexing starts, and it's first step is to get a list of all pk-timestamp pairs: [quoted text, click to view] >SELECT pk, timestamp FROM Customers
pk timestamp 2 970003 1 row(s) affected Now we know that a row has been deleted, but how does the full-text indexer know? It looks through it's full-text catalog: [quoted text, click to view] >SELECT DISTINCT(Key) AS pk FROM FullTextCatalog WHERE TableName='Customers'
pk 1 2 2 row(s) affected. The catalog has two rows, but the current able only has one. Specifically, row with pk=1 is no longer in the source database, while it is still in our catalog. So let's remove that row from the catalog: [quoted text, click to view] >DELETE FROM FullTextCatalog >WHERE TableName = 'Customers' >AND Key = 1
1 row(s) affected "Ah-hah!", you say. "How does it know that row 2 is not the same row 2 in the catalog?" From the timestamps. The indexer stores the last timestamp that was in a table the last time in ran: [quoted text, click to view] >SELECT * FROM LastTiemeStamps >WHERE Table = 'Customers'
Table LastTimestamp Customers 970002 1 row(s) affected. So any rows that have a timestamp greater than 970002 need to be re-populated. Seeking through the list in memory that we already retreived (which i'll reproduce here for convience): pk timestamp 2 970003 indicates that one row has been modified. So the full-text indexer throws away any data in it's catalog for Key=2, and puts in it's place values from the current row of pk=2. Makes sense?
What you describe is exactly how the indexer works. It gets stores the last highest timestamp form the last run. It uses this as the basis to find out what is new. It then extracts each pk and timestamp to determine what to delete and by extension what has been updated. It also extracts the content of the row at that time as the expense of this hit is not much more than the cost of getting the timestamp and the pk. This is what I tried to explain with my statement "Incremental population. Every row is extracted, timestamps are compared with the last timestamp the last full or incremental population ended on and if the timestamp is different the extracted row is reindexed. Rows which are no longer in the table but in the catalog are removed from the catalog. Rows which are not in the catalog are inserted. Does this make sense? The cost of bringing the entire row is not much more expensive from just bringing the timestamp so its not that inefficient." Your response was "If i understand what you're staying - then someone the SQL team needs a smack." Am I missing something? Do I need to be smacked? Did I do a poor job of explaining this? -- Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com [quoted text, click to view] "Ian Boyd" <ian.msnews010@avatopia.com> wrote in message news:%23TqWhrLZHHA.348@TK2MSFTNGP02.phx.gbl... >> An update can be considered a delete followed by an insert. >> >> Your method describes how to detect what is new (inserts and updates), >> but it doesn't describe a method to determine what is deleted. >> >> Can you help me to understand how your method detects deletions? > > Let's say our table and our catalog are currently in sync > > Contents on the Customers table: > pk Firstname Lastname timestamp > 1 Ian Alexander Boyd 970001 > 2 Hilary Nathan Cotter 970002 > > NOTE: i'll now store first and last names in the FirstName field, so that > we can see words being broken up. > NOTE: i do not know the format that is used to store the full-text > catalog; so i'll have to make one up. > > Contents of the FullTextCatalog table: > Table Key Column keyword > Customers 1 Firstname Ian > Customers 1 Firstname Alexander > Customers 1 Lastname Boyd > Customers 2 Firstname Hilary > Customers 2 Firstname Nathan > Customers 2 Lastname Cotter > > And the last time the incremental population ran the timestamp in the > Customers table was 970002. The full-text indexer also must store this: > > Contents of the Full-text catalog LastTimestamps table: > Table LastTimestamp > Customers 970002 > > > Everything is setup. Now we delete a row: > >>DELETE FROM Customers WHERE pk = 2 > 1 row(s) affected > > Now there is only one row in the customer's table: > >>SELECT * FROM Customers > pk Firstname Lastname timestamp > 1 Ian Alexander Boyd 970001 > 1 row(s) affected > > > > Now the incremental indexer comes along. First thing is does it ask for > all key-timestamp pairs: >>SELECT pk, timestamp FROM Customers > pk timestamp > 1 970001 > 1 row(s) affected > > > > So your question is: How does the indexer know that a row has been > deleted? Specifically, how does it know that row 2 has been deleted? > > Looking through the full-text catalog, we find all rows that we've > previously indexed: > >>SELECT DISTINCT(Key) AS pk FROM FullTextCatalog WHERE >>TableName='Customers' > pk > 1 > 2 > 2 row(s) affected. > > Our full-text catalog contains values from two rows. But in the list from > the database that we fetched a moment ago there is only one row: > >>SELECT pk, timestamp FROM Customers > pk timestamp > 1 970001 > 1 row(s) affected > > Obviously row 2, that exists in our full-text catalog, has been delete > from the source database, and must be deleted from our full-text catalog. > >>DELETE FROM FullTextCatalog WHERE Table='Customers' >>AND Key = 2 > 3 row(s) affected > > > > > This is how we detect deletes. > > > > > Perhaps now you're wanting to contrive a more pessimistic case. Let's take > the original values: > > pk Firstname Lastname timestamp > 1 Ian Alexander Boyd 970001 > 2 Hilary Nathan Cotter 970002 > > and delete Hilary, and switch Ian to pk value #2, resulting in: > > pk Firstname Lastname > 2 Ian Alexander Boyd > > So now we've deleted a row, but also moved another row on top of it. How > do we handle this? > > Well, starting from: > pk Firstname Lastname timestamp > 1 Ian Alexander Boyd 970001 > 2 Hilary Nathan Cotter 970002 > > let's do the SQL to make the change: > >>DELETE FROM Customers WHERE pk = 2 > 1 row(s) affected > > Giving us: > pk Firstname Lastname timestamp > 1 Ian Alexander Boyd 970001 > >>UPDATE Customers SET pk = 2 WHERE pk = 1 > 1 row(s) affected > > Giving us: > pk Firstname Lastname timestamp > 2 Ian Alexander Boyd 970003 > > > Now the incremental indexing starts, and it's first step is to get a list > of all pk-timestamp pairs: > >>SELECT pk, timestamp FROM Customers > pk timestamp > 2 970003 > 1 row(s) affected > > Now we know that a row has been deleted, but how does the full-text > indexer know? It looks through it's full-text catalog: > > >>SELECT DISTINCT(Key) AS pk FROM FullTextCatalog WHERE >>TableName='Customers' > pk > 1 > 2 > 2 row(s) affected. > > The catalog has two rows, but the current able only has one. Specifically, > row with pk=1 is no longer in the source database, while it is still in > our catalog. So let's remove that row from the catalog: > >>DELETE FROM FullTextCatalog >>WHERE TableName = 'Customers' >>AND Key = 1 > 1 row(s) affected > > > "Ah-hah!", you say. "How does it know that row 2 is not the same row 2 in > the catalog?" From the timestamps. The indexer stores the last timestamp > that was in a table the last time in ran: > >>SELECT * FROM LastTiemeStamps >>WHERE Table = 'Customers' > Table LastTimestamp > Customers 970002 > 1 row(s) affected. > > So any rows that have a timestamp greater than 970002 need to be > re-populated. Seeking through the list in memory that we already retreived > (which i'll reproduce here for convience): > > pk timestamp > 2 970003 > > indicates that one row has been modified. > > So the full-text indexer throws away any data in it's catalog for Key=2, > and puts in it's place values from the current row of pk=2. > > > > > Makes sense? >
[quoted text, click to view] "Hilary Cotter" <hilary.cotter@gmail.com> wrote in message news:uPNzrJMZHHA.4620@TK2MSFTNGP05.phx.gbl... > What you describe is exactly how the indexer works. It gets stores the > last highest timestamp form the last run. > > It uses this as the basis to find out what is new. It then extracts each > pk and timestamp to determine what to delete and by extension what has > been updated. It also extracts the content of the row at that time as the > expense of this hit is not much more than the cost of getting the > timestamp and the pk.
No, the hit *is* much more. We will, of course, have a covering index on pk, Timestamp so that when the full-text indexer performs the query [quoted text, click to view] >SELECT pk,timestamp FROM Customers
it only has to scan the index. This is much faster than having to scan every row of the database. After that, once we've found the few rows that have changed, it's much better for the server to only fetch those specific few rows, rather than querying every row in the table one. Hell, i'd prefer it if we didn't have to issue a separate select for each modified row, but if we could use a set-based operation [quoted text, click to view] >SELECT * FROM Customers >WHERE TimeStamp > 970002
Assuming there's less than a few hundred modified rows. Do you disagree that having a convering index on pk,timestamp and only querying for few dozen modified rows is better than querying for all rows in a table, whether they are used are not - and not even as a set or a batch, but one row at a time? Assuming a 50,000,000 row table. If the pk is a 4-byte integer, and the timestamp is a 4-byte ordinal, and you create a covering index on those two, you can fit a little over 1000 rows in an 8k page. That's works out to about 50k pages of i/o. If there is then a dozen modified rows, and those rows are randomly scattered around the database, that's 84 more page reads (assuming b-tree traversal is log(n), with 6,250,000 pages), for a total of 50,084 pages of i/o. On the other hand, using the current SQL Server incremental algorithm, we have to scan every page in a table. Assume each row in the table is 1000 bytes wide. That's about 8 rows per page. That then requires 6,250,000 pages of i/o to get the entire database. 50,084 pages to read vs 6,250,000 pages to read That's about 125x more i/o generated using the "brute force" method, over than the finesse method. So, this is where i come to the point in my head asking, "Why do i have to explain this to the SQL Server team?" [quoted text, click to view] > Does this make sense? The cost of bringing the entire row is not much more > expensive from just bringing the timestamp so its not that inefficient."
What we need to do is stop bringing in a whole row, just to get the pk and timestamp; and only bring in the whole row when it is absoutely necessary.
Microsoft does recommend putting an index on the timestamp column. Their customers need the flexibility to create indexes and tables as they see fit. Best practices might dictate using a covering index in some cases. For example some application might require the dba not modify the table structure in any way. Rather than going around in circles with me you should go to connect and make your suggestions there and have Microsoft respond to them. -- Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com [quoted text, click to view] "Ian Boyd" <ian.msnews010@avatopia.com> wrote in message news:uFqp5XNZHHA.4620@TK2MSFTNGP05.phx.gbl... > "Hilary Cotter" <hilary.cotter@gmail.com> wrote in message > news:uPNzrJMZHHA.4620@TK2MSFTNGP05.phx.gbl... >> What you describe is exactly how the indexer works. It gets stores the >> last highest timestamp form the last run. >> >> It uses this as the basis to find out what is new. It then extracts each >> pk and timestamp to determine what to delete and by extension what has >> been updated. It also extracts the content of the row at that time as the >> expense of this hit is not much more than the cost of getting the >> timestamp and the pk. > > No, the hit *is* much more. We will, of course, have a covering index on > pk, Timestamp > > so that when the full-text indexer performs the query > >>SELECT pk,timestamp FROM Customers > > it only has to scan the index. This is much faster than having to scan > every row of the database. > > After that, once we've found the few rows that have changed, it's much > better for the server to only fetch those specific few rows, rather than > querying every row in the table one. Hell, i'd prefer it if we didn't have > to issue a separate select for each modified row, but if we could use a > set-based operation > >>SELECT * FROM Customers >>WHERE TimeStamp > 970002 > > Assuming there's less than a few hundred modified rows. > > > Do you disagree that having a convering index on pk,timestamp and only > querying for few dozen modified rows is better than querying for all rows > in a table, whether they are used are not - and not even as a set or a > batch, but one row at a time? > > > Assuming a 50,000,000 row table. If the pk is a 4-byte integer, and the > timestamp is a 4-byte ordinal, and you create a covering index on those > two, you can fit a little over 1000 rows in an 8k page. That's works out > to about 50k pages of i/o. If there is then a dozen modified rows, and > those rows are randomly scattered around the database, that's 84 more page > reads (assuming b-tree traversal is log(n), with 6,250,000 pages), for a > total of 50,084 pages of i/o. > > On the other hand, using the current SQL Server incremental algorithm, we > have to scan every page in a table. Assume each row in the table is 1000 > bytes wide. That's about 8 rows per page. That then requires 6,250,000 > pages of i/o to get the entire database. > > 50,084 pages to read > vs > 6,250,000 pages to read > > That's about 125x more i/o generated using the "brute force" method, over > than the finesse method. > > > So, this is where i come to the point in my head asking, "Why do i have to > explain this to the SQL Server team?" > >> Does this make sense? The cost of bringing the entire row is not much >> more expensive from just bringing the timestamp so its not that >> inefficient." > > What we need to do is stop bringing in a whole row, just to get the pk and > timestamp; and only bring in the whole row when it is absoutely necessary. > > >
[quoted text, click to view] > Microsoft does recommend putting an index on the timestamp column. Their > customers need the flexibility to create indexes and tables as they see > fit. Best practices might dictate using a covering index in some cases. > For example some application might require the dba not modify the table > structure in any way. > > Rather than going around in circles with me you should go to connect and > make your suggestions there and have Microsoft respond to them.
Where is this place?
https://connect.microsoft.com/SQLServer/ -- Hilary Cotter Looking for a SQL Server replication book? http://www.nwsu.com/0974973602.html Looking for a FAQ on Indexing Services/SQL FTS http://www.indexserverfaq.com [quoted text, click to view] "Ian Boyd" <ian.msnews010@avatopia.com> wrote in message news:u5p8hcYZHHA.4396@TK2MSFTNGP06.phx.gbl... >> Microsoft does recommend putting an index on the timestamp column. Their >> customers need the flexibility to create indexes and tables as they see >> fit. Best practices might dictate using a covering index in some cases. >> For example some application might require the dba not modify the table >> structure in any way. >> >> Rather than going around in circles with me you should go to connect and >> make your suggestions there and have Microsoft respond to them. > > Where is this place? > >
Don't see what you're looking for? Try a search.
|
|
|