SQL Server 2005 Full Text Search on HTML documents

SQL Server has a decent full text search engine (IMO), but if you have HTML data in your database, it can be tricky searching on it. For example, if users search on the word “strong” you don’t want to bring back data like “<strong>this text is emphasized</strong>”. Also, there were early problems with SQL 2000’s word breaker, in that it didn’t treat > or < as a word delimiter (this problem has since been resolved).

Since SQL 2005 is around I thought I’d throw out a few ways I’ve noticed to store & search on HTML data.

Just Store HTML data in a varchar or text column

First of all, you can go the simple route & store it in a varchar(max) or text column, and create a full text index on that column. 

The upsides

  • It’s a simple approach
  • FTS Change Tracking will track the values for varchar columns. If you’re using a text column, changes are tracked unless made via WRITETEXT and UPDATETEXT. That’s not much of an issue with SQL 2005, though, since WRITETEXT and UPDATETEXT are now deprecated.
  • It’s easy to update values in varchar or text columns
  • You’ll get matches on all the words

The downside

  • You’ll also get matches on words inside comments and HTML tags (e.g. “font”, “arial”, “body”)

So this might be a place to start for an 80/20 HTML search engine approach, and you could maybe treat words like “font” “td” etc as noise words so they’re ignored in searches. Not a perfect solution, though, especially if your users like to search on the word “title.”

Store HTML data in an XML Column

Now that SQL 2005 has an “XML” data type column, you can store your HTML data in that instead and search on it.

Upsides:

  • Search results won’t include tagnames, attribute names, or words within comments
  • Change tracking will track XML column changes

Downsides:

  • Could be hard to update values in the column, I don’t know how easy it is to programmatically interact with XML column data types
  • Your HTML needs to be well-formed. “< font > hey there </ font>” will return an error. “<font> hey there </font>” won’t.
  • Full Text Search won’t match on tag and attribute names (good), but will match on attribute values (bad). For example, if your data is “<font face=”Arial”>hi there</font>”, searching on “font” won’t return a match, but searching on “Arial” will.

Here’s a complete script to try out in your own database (SQL Server 2005 only). It creates a new database called “ftstest” and a table called “Test”.

create database ftstest
go
use ftstest
go
sp_fulltext_database ‘enable’
go
Create fulltext catalog FTSCatalog as default
go
CREATE TABLE Test (
ID int not null identity constraint PK_Test primary key,
Title varchar(1000),
Description XML)
go

insert into Test (Title, Description) values
(‘some stuff goes here’, ‘<font face=”Arial”>test1</font>’)
go
insert into Test (Title, Description) values
(‘some second row’, ‘<font> test2 </font>’)
go
insert into Test (Title, Description) values
(‘some stuff here’, ‘<font> test3 foobar </font>’)
go
insert into Test (Title, Description) values
(‘some stuff here’, ‘<font>boogie</font>’)
go

CREATE FULLTEXT INDEX ON Test(Title, Description) KEY INDEX PK_Test
GO

— these queries return data
select * from Test where FREETEXT(*,‘stuff’)
select * from Test where FREETEXT(*,‘test1’)
select * from Test where FREETEXT(*,‘test2’)
select * from Test where FREETEXT(*,‘boogie’)
select * from Test where FREETEXT(*,‘Arial’)

— these don’t
select * from Test where FREETEXT(*,‘face’)
select * from Test where FREETEXT(*,‘font’)

Store the HTML in an Image Column

This is the old standby. SQL Server can automatically ignore HTML markup in search results if you store your HTML data in a column of the image data type. You also need a second column whose value (e.g. ‘htm’) indicates the type of data.

Upsides:

  • All HTML markup is ignored for searches (except for a questionable feature where if you have spaces around your tags like this “< strong >” the tagname will be included in the search results).
  • You can actually use this feature to store & perform FTS searches on other types of documents, like PPT, PDF, DOC, etc. So it’s good if you’re doing a document management system & need to search on not only HTML documents but other kinds, too.

Downsides:

  • You have to deal with updating Image data types, which can be a huge PITA. I really wish SQL supported this for varchar or text columns.

Here’s a sample script, in this case the DescriptionContentType column contains the value ‘htm’, telling SQL Server FTS that the Description column contains HTML data & that the indexer should use the HTML iFilter:

CREATE TABLE Test (
ID int not null identity constraint PK_Test primary key,
Title varchar(1000),
Description image,
DescriptionContentType char(3) default ‘htm’
)
go

CREATE FULLTEXT INDEX ON Test(Title, Description TYPE COLUMN DescriptionContenttype)
KEY INDEX PK_Test
go

INSERT INTO Test (Title, Description) VALUES (‘hi’,‘<strong>test</strong>’)
go

SELECT * FROM Test WHERE CONTAINS(*,‘hi’)    — returns results
SELECT * FROM Test WHERE CONTAINS(*,‘test’)    — returns results
SELECT * FROM Test WHERE CONTAINS(*,‘strong’)    — no results

Use a Separate Keywords Column

This is a complex but common approach. Basically, you store your HTML in a varchar or text column, then strip out the HML markup & store the resulting text in a separate keyword column. You then perform searches on the keyword column. 

Upsides:

  • You get to avoid working with Image columns
  • HTML markup is avoided in search results
  • Change tracking will handle the keyword column (provided it’s varchar or text w/o using WRITETEXT)

Downsides

  • You need to find or write a function to strip HTML from your column (easy enough with RegEx)
  • Extra storage space is consumed since your storing a lot of the data twice
  • You have to maintain two columns for HTML data: the HTML column, and the keyword column. Thus more work, more risk of bugs, & possibly more confusion. Plus all code that interacts with that table may need to be aware of & correctly use the columns correctly.

Here’s a SQL blurb illustrating the concept.

CREATE TABLE Test (
ID int not null identity constraint PK_Test primary key,
Title varchar(1000),
Description varchar(max),
DescriptionKeywords varchar(max)
)
go

CREATE FULLTEXT INDEX ON Test(Title, DescriptionKeywords)
KEY INDEX PK_Test
go

INSERT INTO Test (Title, Description, DescriptionKeywords)
VALUES (‘hi’,‘<strong>test</strong>’, fnStripHtmlFromText(‘<strong>test</strong>’))
go

SELECT * FROM Test WHERE CONTAINS(*,‘hi’)    — returns results
SELECT * FROM Test WHERE CONTAINS(*,‘test’)    — returns results
SELECT * FROM Test WHERE CONTAINS(*,‘strong’)    — no results

Notice the fnStripHtmlFromText function — that’s the function you’d need to write to strip HTML from incoming data. For better protection, you could restrict access to the table to store procedures only, and only expose the Description column, like this:

CREATE PROCEDURE spInsertTest (
    @Title varchar(1000),
    @Description varchar(max)
)
AS
BEGIN
    INSERT INTO Test (Title, Description, DescriptionKeywords)
    VALUES (@Title,@Description, fnStripHtmlFromText(@Description))

    RETURN @@IDENTITY
END

Alternately, if you needed to use raw SQL instead of stored procedures, you could use INSERT and UPDATE triggers to maintain the DescriptionKeywords column, and your SQL could just interact with the Description column. Sorta like this:

CREATE TRIGGER dbo.tuTest
ON dbo.Test
AFTER UPDATE
AS
BEGIN
    SET NOCOUNT ON;

— update keyword column with keywords from html column
    UPDATE Test SET
        DescriptionKeywords = fnStripHtmlFromText(i.Description)
        FROM Test t INNER JOIN inserted i ON t.ID = i.ID
END

Use a Separate Image Column

Like the separate keyword column solution, above, except that instead of parsing out the keywords, you just store a second copy of your HTML data in an Image column along with a content type column. The upside is you don’t need to write an HTML keyword parser, but the downside is your keywords are in an image column (which may be a non issue since you shouldn’t interact with it directly). Here’s a sample script

CREATE TABLE Test (
ID int not null identity constraint PK_Test primary key,
Title varchar(1000),
Description varchar(max),
FTSDescription image,
FTSDescriptionContenttype char(3) default ‘htm’
)
go

CREATE FULLTEXT INDEX ON Test(Title, FTSDescription TYPE COLUMN FTSDescriptionContenttype)
KEY INDEX PK_Test
go

Conclusion

Well that was a long post. The solution depends on what your goals are, but I’d recommend architecting your application in such a way that if you start out with the first, simplest solution, you can enhance your system later to a more sophisticated implementation without breaking everything. That means you should ideally be interacting with your system either via store procedures or objects. Then if you need to change the underlying database schema in order to handle a better search feature, you can do it in your DAL and/or procedures.

Since I’m doing this for an existing project (building a light CMS), I’m personally leaning towards a separate keyword or separate image column approach. I don’t want to directly interact with Image columns at all if I can help it. 🙂

0