SQL Server has a decent full text search engine (IMO), but if you have HTML data in your database, it can be tricky searching on it. For example, if users search on the word "strong" you don't want to bring back data like "<strong>this text is emphasized</strong>". Also, there were early problems with SQL 2000's word breaker, in that it didn't treat > or < as a word delimiter (this problem has since been resolved).
Since SQL 2005 is around I thought I'd throw out a few ways I've noticed to store & search on HTML data.
First of all, you can go the simple route & store it in a varchar(max) or text column, and create a full text index on that column.
The upsides
The downside
So this might be a place to start for an 80/20 HTML search engine approach, and you could maybe treat words like "font" "td" etc as noise words so they're ignored in searches. Not a perfect solution, though, especially if your users like to search on the word "title."
Now that SQL 2005 has an "XML" data type column, you can store your HTML data in that instead and search on it.
Upsides:
Downsides:
Here's a complete script to try out in your own database (SQL Server 2005 only). It creates a new database called "ftstest" and a table called "Test".
create database ftstestgouse ftstestgosp_fulltext_database 'enable'goCreate fulltext catalog FTSCatalog as defaultgoCREATE TABLE Test (ID int not null identity constraint PK_Test primary key, Title varchar(1000), Description XML)goinsert into Test (Title, Description) values('some stuff goes here', '<font face="Arial">test1</font>')goinsert into Test (Title, Description) values('some second row', '<font> test2 </font>')goinsert into Test (Title, Description) values('some stuff here', '<font> test3 foobar </font>')goinsert into Test (Title, Description) values('some stuff here', '<font>boogie</font>')goCREATE FULLTEXT INDEX ON Test(Title, Description) KEY INDEX PK_TestGO-- these queries return dataselect * from Test where FREETEXT(*,'stuff')select * from Test where FREETEXT(*,'test1')select * from Test where FREETEXT(*,'test2')select * from Test where FREETEXT(*,'boogie')select * from Test where FREETEXT(*,'Arial')-- these don'tselect * from Test where FREETEXT(*,'face')select * from Test where FREETEXT(*,'font')
This is the old standby. SQL Server can automatically ignore HTML markup in search results if you store your HTML data in a column of the image data type. You also need a second column whose value (e.g. 'htm') indicates the type of data.
Here's a sample script, in this case the DescriptionContentType column contains the value 'htm', telling SQL Server FTS that the Description column contains HTML data & that the indexer should use the HTML iFilter:
CREATE TABLE Test (ID int not null identity constraint PK_Test primary key, Title varchar(1000), Description image,DescriptionContentType char(3) default 'htm')goCREATE FULLTEXT INDEX ON Test(Title, Description TYPE COLUMN DescriptionContenttype) KEY INDEX PK_TestgoINSERT INTO Test (Title, Description) VALUES ('hi','<strong>test</strong>')goSELECT * FROM Test WHERE CONTAINS(*,'hi') -- returns resultsSELECT * FROM Test WHERE CONTAINS(*,'test') -- returns resultsSELECT * FROM Test WHERE CONTAINS(*,'strong') -- no results
This is a complex but common approach. Basically, you store your HTML in a varchar or text column, then strip out the HML markup & store the resulting text in a separate keyword column. You then perform searches on the keyword column.
Downsides
Here's a SQL blurb illustrating the concept.
CREATE TABLE Test (ID int not null identity constraint PK_Test primary key, Title varchar(1000), Description varchar(max),DescriptionKeywords varchar(max))goCREATE FULLTEXT INDEX ON Test(Title, DescriptionKeywords) KEY INDEX PK_TestgoINSERT INTO Test (Title, Description, DescriptionKeywords) VALUES ('hi','<strong>test</strong>', fnStripHtmlFromText('<strong>test</strong>'))goSELECT * FROM Test WHERE CONTAINS(*,'hi') -- returns resultsSELECT * FROM Test WHERE CONTAINS(*,'test') -- returns resultsSELECT * FROM Test WHERE CONTAINS(*,'strong') -- no results
Notice the fnStripHtmlFromText function -- that's the function you'd need to write to strip HTML from incoming data. For better protection, you could restrict access to the table to store procedures only, and only expose the Description column, like this:
CREATE PROCEDURE spInsertTest ( @Title varchar(1000), @Description varchar(max))ASBEGIN INSERT INTO Test (Title, Description, DescriptionKeywords) VALUES (@Title,@Description, fnStripHtmlFromText(@Description)) RETURN @@IDENTITYEND
Alternately, if you needed to use raw SQL instead of stored procedures, you could use INSERT and UPDATE triggers to maintain the DescriptionKeywords column, and your SQL could just interact with the Description column. Sorta like this:
CREATE TRIGGER dbo.tuTest ON dbo.Test AFTER UPDATEAS BEGIN SET NOCOUNT ON;-- update keyword column with keywords from html column UPDATE Test SET DescriptionKeywords = fnStripHtmlFromText(i.Description) FROM Test t INNER JOIN inserted i ON t.ID = i.IDEND
Like the separate keyword column solution, above, except that instead of parsing out the keywords, you just store a second copy of your HTML data in an Image column along with a content type column. The upside is you don't need to write an HTML keyword parser, but the downside is your keywords are in an image column (which may be a non issue since you shouldn't interact with it directly). Here's a sample script
CREATE TABLE Test (ID int not null identity constraint PK_Test primary key, Title varchar(1000), Description varchar(max),FTSDescription image,FTSDescriptionContenttype char(3) default 'htm')goCREATE FULLTEXT INDEX ON Test(Title, FTSDescription TYPE COLUMN FTSDescriptionContenttype) KEY INDEX PK_Testgo
Well that was a long post. The solution depends on what your goals are, but I'd recommend architecting your application in such a way that if you start out with the first, simplest solution, you can enhance your system later to a more sophisticated implementation without breaking everything. That means you should ideally be interacting with your system either via store procedures or objects. Then if you need to change the underlying database schema in order to handle a better search feature, you can do it in your DAL and/or procedures.
Since I'm doing this for an existing project (building a light CMS), I'm personally leaning towards a separate keyword or separate image column approach. I don't want to directly interact with Image columns at all if I can help it. :)
Powered by: newtelligence dasBlog 2.0.7226.0
Disclaimer The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.
© Copyright 2008, Ben Strackany
E-mail