home · blog · groups · about us · contact us
DevelopmentNow Blog
 Saturday, August 12, 2006
 
 

SQL Server has a decent full text search engine (IMO), but if you have HTML data in your database, it can be tricky searching on it. For example, if users search on the word "strong" you don't want to bring back data like "<strong>this text is emphasized</strong>". Also, there were early problems with SQL 2000's word breaker, in that it didn't treat > or < as a word delimiter (this problem has since been resolved).

Since SQL 2005 is around I thought I'd throw out a few ways I've noticed to store & search on HTML data.

Just Store HTML data in a varchar or text column

First of all, you can go the simple route & store it in a varchar(max) or text column, and create a full text index on that column. 

The upsides

  • It's a simple approach
  • FTS Change Tracking will track the values for varchar columns. If you're using a text column, changes are tracked unless made via WRITETEXT and UPDATETEXT. That's not much of an issue with SQL 2005, though, since WRITETEXT and UPDATETEXT are now deprecated.
  • It's easy to update values in varchar or text columns
  • You'll get matches on all the words

The downside

  • You'll also get matches on words inside comments and HTML tags (e.g. "font", "arial", "body")

So this might be a place to start for an 80/20 HTML search engine approach, and you could maybe treat words like "font" "td" etc as noise words so they're ignored in searches. Not a perfect solution, though, especially if your users like to search on the word "title."

Store HTML data in an XML Column

Now that SQL 2005 has an "XML" data type column, you can store your HTML data in that instead and search on it.

Upsides:

  • Search results won't include tagnames, attribute names, or words within comments
  • Change tracking will track XML column changes

Downsides:

  • Could be hard to update values in the column, I don't know how easy it is to programmatically interact with XML column data types
  • Your HTML needs to be well-formed. "< font > hey there </ font>" will return an error. "<font> hey there </font>" won't.
  • Full Text Search won't match on tag and attribute names (good), but will match on attribute values (bad). For example, if your data is "<font face="Arial">hi there</font>", searching on "font" won't return a match, but searching on "Arial" will.

Here's a complete script to try out in your own database (SQL Server 2005 only). It creates a new database called "ftstest" and a table called "Test".

create database ftstest
go
use ftstest
go
sp_fulltext_database 'enable'
go
Create fulltext catalog FTSCatalog as default
go
CREATE TABLE Test (
ID int not null identity constraint PK_Test primary key,
Title varchar(1000),
Description XML)
go

insert into Test (Title, Description) values
('some stuff goes here', '<font face="Arial">test1</font>')
go
insert into Test (Title, Description) values
('some second row', '<font> test2 </font>')
go
insert into Test (Title, Description) values
('some stuff here', '<font> test3 foobar </font>')
go
insert into Test (Title, Description) values
('some stuff here', '<font>boogie</font>')
go

CREATE FULLTEXT INDEX ON Test(Title, Description) KEY INDEX PK_Test
GO

-- these queries return data
select * from Test where FREETEXT(*,'stuff')
select * from Test where FREETEXT(*,'test1')
select * from Test where FREETEXT(*,'test2')
select * from Test where FREETEXT(*,'boogie')
select * from Test where FREETEXT(*,'Arial')

-- these don't
select * from Test where FREETEXT(*,'face')
select * from Test where FREETEXT(*,'font')

Store the HTML in an Image Column

This is the old standby. SQL Server can automatically ignore HTML markup in search results if you store your HTML data in a column of the image data type. You also need a second column whose value (e.g. 'htm') indicates the type of data.

Upsides:

  • All HTML markup is ignored for searches (except for a questionable feature where if you have spaces around your tags like this "< strong >" the tagname will be included in the search results).
  • You can actually use this feature to store & perform FTS searches on other types of documents, like PPT, PDF, DOC, etc. So it's good if you're doing a document management system & need to search on not only HTML documents but other kinds, too.

Downsides:

  • You have to deal with updating Image data types, which can be a huge PITA. I really wish SQL supported this for varchar or text columns.

Here's a sample script, in this case the DescriptionContentType column contains the value 'htm', telling SQL Server FTS that the Description column contains HTML data & that the indexer should use the HTML iFilter:

CREATE TABLE Test (
ID int not null identity constraint PK_Test primary key,
Title varchar(1000),
Description image,
DescriptionContentType char(3) default 'htm'
)
go

CREATE FULLTEXT INDEX ON Test(Title, Description TYPE COLUMN DescriptionContenttype)
KEY INDEX PK_Test
go

INSERT INTO Test (Title, Description) VALUES ('hi','<strong>test</strong>')
go

SELECT * FROM Test WHERE CONTAINS(*,'hi')    -- returns results
SELECT * FROM Test WHERE CONTAINS(*,'test')    -- returns results
SELECT * FROM Test WHERE CONTAINS(*,'strong')    -- no results

Use a Separate Keywords Column

This is a complex but common approach. Basically, you store your HTML in a varchar or text column, then strip out the HML markup & store the resulting text in a separate keyword column. You then perform searches on the keyword column. 

Upsides:

  • You get to avoid working with Image columns
  • HTML markup is avoided in search results
  • Change tracking will handle the keyword column (provided it's varchar or text w/o using WRITETEXT)

Downsides

  • You need to find or write a function to strip HTML from your column (easy enough with RegEx)
  • Extra storage space is consumed since your storing a lot of the data twice
  • You have to maintain two columns for HTML data: the HTML column, and the keyword column. Thus more work, more risk of bugs, & possibly more confusion. Plus all code that interacts with that table may need to be aware of & correctly use the columns correctly.

Here's a SQL blurb illustrating the concept.

CREATE TABLE Test (
ID int not null identity constraint PK_Test primary key,
Title varchar(1000),
Description varchar(max),
DescriptionKeywords varchar(max)
)
go

CREATE FULLTEXT INDEX ON Test(Title, DescriptionKeywords)
KEY INDEX PK_Test
go

INSERT INTO Test (Title, Description, DescriptionKeywords)
VALUES ('hi','<strong>test</strong>', fnStripHtmlFromText('<strong>test</strong>'))
go

SELECT * FROM Test WHERE CONTAINS(*,'hi')    -- returns results
SELECT * FROM Test WHERE CONTAINS(*,'test')    -- returns results
SELECT * FROM Test WHERE CONTAINS(*,'strong')    -- no results

Notice the fnStripHtmlFromText function -- that's the function you'd need to write to strip HTML from incoming data. For better protection, you could restrict access to the table to store procedures only, and only expose the Description column, like this:

CREATE PROCEDURE spInsertTest (
    @Title varchar(1000),
    @Description varchar(max)
)
AS
BEGIN
    INSERT INTO Test (Title, Description, DescriptionKeywords)
    VALUES (@Title,@Description, fnStripHtmlFromText(@Description))

    RETURN @@IDENTITY
END

Alternately, if you needed to use raw SQL instead of stored procedures, you could use INSERT and UPDATE triggers to maintain the DescriptionKeywords column, and your SQL could just interact with the Description column. Sorta like this:

CREATE TRIGGER dbo.tuTest
ON dbo.Test
AFTER UPDATE
AS
BEGIN
    SET NOCOUNT ON;

-- update keyword column with keywords from html column
    UPDATE Test SET
        DescriptionKeywords = fnStripHtmlFromText(i.Description)
        FROM Test t INNER JOIN inserted i ON t.ID = i.ID
END

Use a Separate Image Column

Like the separate keyword column solution, above, except that instead of parsing out the keywords, you just store a second copy of your HTML data in an Image column along with a content type column. The upside is you don't need to write an HTML keyword parser, but the downside is your keywords are in an image column (which may be a non issue since you shouldn't interact with it directly). Here's a sample script

CREATE TABLE Test (
ID int not null identity constraint PK_Test primary key,
Title varchar(1000),
Description varchar(max),
FTSDescription image,
FTSDescriptionContenttype char(3) default 'htm'
)
go

CREATE FULLTEXT INDEX ON Test(Title, FTSDescription TYPE COLUMN FTSDescriptionContenttype)
KEY INDEX PK_Test
go

Conclusion

Well that was a long post. The solution depends on what your goals are, but I'd recommend architecting your application in such a way that if you start out with the first, simplest solution, you can enhance your system later to a more sophisticated implementation without breaking everything. That means you should ideally be interacting with your system either via store procedures or objects. Then if you need to change the underlying database schema in order to handle a better search feature, you can do it in your DAL and/or procedures.

Since I'm doing this for an existing project (building a light CMS), I'm personally leaning towards a separate keyword or separate image column approach. I don't want to directly interact with Image columns at all if I can help it. :)

August 12, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Friday, August 11, 2006
 
 

If this were an ordinary post I'd show you a bunch of code illustrating how to send multipart MIME emails using .NET. But yesterday I ran across DotNetOpenMail, an open-source mail component for .NET. And I don't believe in reinventing the wheel too much.

As a reminder, multipart MIME emails allow you to embed multiple content with different MIME types (e.g. HTML and TEXT) into a single email. That way, recipients with HTML-capable email clients will see the HTML version of your email, while older email programs will display the text version.

In .NET 1.1 (which is what I was developing in yesterday), multipart MIME emails aren't really supported, although if System.Net.Mail uses CDO.Message behind the scenes, you'll automatically get a multipart MIME email generated.

So anyhow, I happily found this open-source component & it appears to work fine for my purposes. And so I thought I'd pass along the tip.

August 11, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Wednesday, August 09, 2006
 
 

Had a few issues running a 1.1 site on Windows 2003. Things I did to resolve the issues:

  • Made sure v1.1 was selected in the ASP.NET tab in IIS Manager for that site. That fixed the issue with ASP.NET not sending the aspnet_client files to the browser.
  • Made sure the \aspnet_client\system_web\1_1_4322 files were in the wwwroot directory for that site. Also copied the latest versions of the js files from C:\WINDOWS\Microsoft.NET\Framework\v1.1.4322\ASP.NETClientFiles into the \aspnet_client\system_web\1_1_4322 wwwroot folder. That resolved the issue where no postbacks were occurring due to an old bug w/ client side validation, discussed on Thomas Freudenberg's blog.
  • Was getting a weird error "CS0016: Could not write to output file 'c:\WINDOWS\Microsoft.NET\Framework\v1.1.4322\Temporary ASP.NET Files\xxxxx'. The directory name is invalid." Turns out the TEMP & TMP environment values were set to a user-specific account. KB825791 gives the fix .. basically changing the environment values and ensuring that the ASPNET and NETWORK SERVICE accounts have full rights to the temp directory.

Now it works. :)

August 9, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Monday, August 07, 2006
 
 

So I was having a little trouble getting full text search to work with the GUI in SQL Server Express with Advanced Services (formerly SQL Server 2005 Express SP1), so I had to do things manually. It was probably a permissions or setup issue with SQL Server Expres or the tools. In addition to setting up FTS, I wanted a search query to weight columns differently in the search rankings -- something that SQL Server FTS doesn't really support.

Setting up Full Text Search

First I had to download and install SQL Server Express with Advanced Services. It's big, but comes with the goodies I wanted.

Then I connected to my SQL Server Express database using SQL Server Management Studio so I could type in some queries. If your SQL Server Express database is in your Visual Studio Project's App_Data folder, you may be out of luck -- I wasn't able to get full text search to work on those, although maybe adjusting permissions would do it.

Once connected to the database, I created a full text catalog

CREATE FULLTEXT CATALOG MyFTCatalog

Next I needed to get the name of a unique index for my table. You can only create full-text indexes on tables with a single-key unique index (e.g. an autonumber primary key index). Remember that your unique index doesn't have to be on the columns that you want to perform full text searches on.

I had a table called Listing a primary key of IdListing and three varchar fields I wanted to search on: Address, Realtor, and Notes. My table already had a unique index called PK_Listing_IdListing, so it was time to create a full-text index on the three columns I wanted to be able to search on:

CREATE FULLTEXT INDEX ON Listing (Address, Realtor, Notes)
KEY INDEX PK_Listing_IdListing
ON MyFTCatalog
WITH CHANGE_TRACKING AUTO

What the above query did is create a full-text index on those three Listing table columns and store it in the full-text catalog named MyFTCatalog. I indicated PK_Listing_IdListing as the index to help uniquely identify rows on the Listing table, and I told the Full Text Search engine to automatically update the full-text catalog if values in the table change.

Lastly I did a quick check to confirm the catalog existed and wasn't still building

SELECT FULLTEXTCATALOGPROPERTY('MyFTCatalog', 'Populatestatus')

And we're set up. Now it was time to query. And man is it hot in here. I guess overclocking your PC makes for a sweaty summer. Anyhow...moving on.

Performing Weighted Queries

There are plenty of pages about performing full-text queries in SQL Server. Here's a place to start.

So my first query looked like this

SELECT IdListing, Address, Realtor, Notes
FROM Listing
WHERE FREETEXT(*,'some keywords')

The * tells FTS to perform the search on all columns in the full-text index. But the query wasn't going to work for me, since it doesn't give more weight to one column over the other. Plus, in order to sort results by ranking, I needed to use the *TABLE full-text queries. I'm partial to FREETEXTTABLE because it already does all the stemming/etc for me.

Then I did a UNION query like this

SELECT TOP 100 Rank, Address, Realtor, Notes
FROM
(
    SELECT f.Rank, l.Address, l.Realtor, l.Notes
    FROM listing l INNER JOIN
    FREETEXTTABLE(listing, Address, 'some keywords') as f
    ON l.idListing = f.[KEY]
    UNION
    SELECT f.Rank, l.Address, l.Realtor, l.Notes
    FROM listing l INNER JOIN
    FREETEXTTABLE(listing, Realtor, 'some keywords') as f
    ON l.idListing = f.[KEY]
    UNION
    SELECT f.Rank, l.Address, l.Realtor, l.Notes
    FROM listing l INNER JOIN
    FREETEXTTABLE(listing, Notes, 'some keywords') as f
    ON l.idListing = f.[KEY]
) as myTable
ORDER BY Rank DESC

which I quickly rewrote to

SELECT TOP 100 f.Rank, l.Address, l.Realtor, l.Notes
FROM Listing l INNER JOIN
(
SELECT Rank, [KEY] from FREETEXTTABLE(listing, Address, 'some keywords')
UNION
select Rank, [KEY] from FREETEXTTABLE(listing, Realtor, 'some keywords')
UNION
select Rank, [KEY] from FREETEXTTABLE(listing, Notes, 'some keywords')
) as f
ON l.IdListing = f.[KEY]
ORDER BY f.Rank DESC

and then added some weights to the rankings, like so.

SELECT TOP 100 f.WeightedRank, l.Address, l.Realtor, l.Notes
FROM listing l INNER JOIN
(
        SELECT Rank * 5.0 as WeightedRank, [KEY] from FREETEXTTABLE(listing, Address, 'some keywords')
        UNION
        select Rank * 3.0 as WeightedRank, [KEY] from FREETEXTTABLE(listing, Realtor, 'some keywords')
        UNION
        select Rank * 1.0 as WeightedRank, [KEY] from FREETEXTTABLE(listing, Notes, 'some keywords')
) as f
ON l.idListing = f.[KEY]
ORDER BY f.WeightedRank DESC

Pretty good. You have the column weighting, and you could wrap it up in a nice little stored procedure and be good to go.

However, there was one last thing I needed. I really wanted a query that would to combine column rankings, so that if there were hits in multiple columns, the rank would be higher than a hit in a single column. So this is what I came up with.

SELECT TOP 100 f.WeightedRank, l.Address, l.Realtor, l.Notes
FROM listing l INNER JOIN
(
    SELECT [KEY], SUM(Rank) AS WeightedRank
    FROM
    (
        SELECT Rank * 5.0 as Rank, [KEY] from FREETEXTTABLE(listing, Address, 'some keywords')
        UNION
        select Rank * 3.0 as Rank, [KEY] from FREETEXTTABLE(listing, Realtor, 'some keywords')
        UNION
        select Rank * 1.0 as Rank, [KEY] from FREETEXTTABLE(listing, Notes, 'some keywords')
    ) as x
    GROUP BY [KEY]
) as f
ON l.idListing = f.[KEY]
ORDER BY f.WeightedRank DESC

Notice how I'm grouping the inner UNION query by [KEY] (in this case, Listing.IdListing) and SUMming the weighted ranks. That allows us to push results with hits in multiple columns higher up in the search rankings. Obviously it's not going to perform as well as a simpler query, but the ranking was important for this project.

Conclusion

So, there ya go. Installing SQL Server Express isn't too bad, although it's a big download. Setting up Full Text Search seemed to work best for me from the command line. And, now you have a way to rank matches with different columns having different weights.

Update: An Alternate Approach

Hilary Cotter (SQL MVP & FTS guru) provided an alternate query. I did a few tests & both seemed comparable in performance, although I didn't test using very large data sets. I made a slight change to his query and added a WHERE clause so that only matches are returned.

select TOP 100
    idListing, Address, Realtor, Notes,
    RankTotal=isnull(RankAddress,0)+isnull(RankRealtor,0)+isnull(RankNotes,0)
from listing
left join (SELECT Rank * 5.0 as RankAddress, [KEY] from
    FREETEXTTABLE(listing, Address, 'Street')) as k
    on k.[key]=Listing.idListing
left join (select Rank * 3.0 as RankRealtor, [KEY] from
    FREETEXTTABLE(listing, Realtor, 'Street')) as l
    on l.[key]=Listing.idListing
left join (select Rank * 1.0 as RankNotes, [KEY] from
    FREETEXTTABLE(listing, Notes, 'Street')) as m
    on m.[key]=Listing.idListing
WHERE RankAddress IS NOT NULL OR RankRealtor IS NOT NULL OR RankNotes IS NOT NULL
ORDER BY RankTotal DESC

Hilary also provided a script (run it in Query Analyzer or in a Query Tab in SQL Mgmt Studio) to set up a test database so you can try the query out yourself. I modified it to seed the test database with a bunch of records (since with only a few records, even LIKE is faster that FTS):

create database realtor
go
use realtor
GO
sp_fulltext_database 'enable'
GO
Create fulltext catalog realtor as default
GO
create table Listing(
    idListing int not null identity constraint ListingPK primary key,
    Address varchar(200), Realtor varchar(200), Notes varchar(200))
GO
-- add initial seed records
insert into Listing(Address, Realtor, Notes)
values('123 Any Street','John Street','the word on the street is good')
insert into Listing(Address, Realtor, Notes)
values('123 Any Road','John Street','the word of mouth is good')
insert into Listing(Address, Realtor, Notes)
values('123 Any Road','John Smith','the word on the street is good')
insert into Listing(Address, Realtor, Notes)
values('123 Any Street','John Smith','the word of mouth is good')
GO
-- multiply seed records, get up over 1M rows
-- might take a while
PRINT 'Please wait a few minutes while the database is seeded'
DECLARE @i int
SET @i = 0
WHILE (@i < 18)
BEGIN
    insert into Listing(Address, Realtor, Notes)
    select TOP 10 Address, Realtor, Notes from Listing

    SET @i = @i + 1
    PRINT convert(varchar,@i)
END
PRINT 'Database has been seeded'
GO
PRINT 'Please wait a few minutes while the fulltext index is built'
GO
create fulltext index on listing(Address, Realtor, Notes)
key index ListingPK
GO
-- check the below query. When it returns zero, the FT index is done building.
SELECT FULLTEXTCATALOGPROPERTY('realtor', 'Populatestatus')
GO


 

August 7, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Wednesday, August 02, 2006
 
 

AKA "working virtual, or virtually working?"

My friend Griffin Caprio blogged about the virtues of being a virtual worker and finding wifi hotspots. I thought I'd chime in with a few tips of my own.

Insure your stuff

Make sure your computer equipment is covered. Many homeowner policies DON'T cover computers at all, or not if they're used for business. You may want to get a small umbrella business insurance policy to cover your equipment at home & on the road (think dropped laptop at the airport). Ask around for referrals, or pick a few insurance agents out of the phone book.

Host a Web Server

If you have a static IP address from your ISP, then you can configure DNS to point to a web server on your network, and host away. If you have a dynamic IP, however, then you need to use dynamic DNS to ensure that when your IP address changes, your DNS entry (www.yourcooldomain.com) points to the right IP. There are several providers. I've used DNSExit for years and it works well, but you can also check out No-IP, TZO, or DynDNS. Or others. Some routers come with built-in support for certain dynamic DNS providers, meaning a simple config change in your router is all that's needed to keep your DNS up to date.

Back up your stuff

What would you do if your computer crashed or your hard drive blew out? Would you lose any work? How long would it take you to recover? Backups are important for any IT professional, and I'd suggest an automated approach. You can go with a service like Mozy that runs on your PC and backs stuff up in the background. Or, if you have a place you can FTP files to (e.g. your ISP or an inexpensive host like e-rice or dreamhost) you can pick up a copy of WinZip 10 Pro which can regularly zip up & upload files via FTP. Remember to not only back up documents, but emails, code, and database dumps. Having an organized directory structure where your important files are makes it easier. Then, if disaster strikes, you'll be in a better position to recover. And the silver lining is maybe you'll now have a reason to get a shiny new PC.

August 2, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Thursday, July 27, 2006
 
 

Scott Mitchell wrote recently about a plug & play ASP.NET error-logging framework that he and Atif Aziz wrote for an MSDN article a while back. 

The framework is called ELMAH (Error Logging Modules and Handlers) and it's free & open source. Apparently you just install the DLL & add a few lines to your web.config, and it'll start logging errors while allowing administrators to view errors online or even access an RSS feed of recent errors. I usually install a global error handler in my ASP.NET apps & use log4net to log & email the information, but I never put together a web-based error viewer. So if there's a stable framework that wraps all that up, I'm all for it.

You can download & read more about ELMAH here. Some screenshots (courtesy of MSDN):


Viewing the error log


RSS feed of recent errors

July 27, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 
 

So I was downloading the latest version of Anthem.NET to use for a Visual Studio 2003 project. I downloaded the zip, extracted, made the virtual directory, but kept getting weird security errors like

"The project location is not fully trusted by the .NET runtime. This is usually because it is either a network share or mapped to a network share not on the local machine.  If the output path is under the project location, your code will not execute as fully trusted and you may receive unexpected security exceptions."

and Visual Studio saying I can't debug the application. When I tried going to the local site in IE (localhost/Anthem-Examples-2003) I'd get

"Server cannot access application directory 'C:\Documents and Settings\Ben\My Documents\Visual Studio Projects\anthem\Anthem-Examples-2003\'. The directory does not exist or is not accessible because of security settings."

It always worked flawlessly before. So, after some dorking and searching around, I finally got it working, here's how I did it...

  • After downloading the zip file, I right clicked it and clicked "Unblock" (some new XP SP2 security thing). Then I extracted it. That resolved the first "project location not trusted" issue.
  • I then went to Control Panel->Admin Tools->.NET 1.1. Security Wizards, and gave full trust to the Intranet Zone.
  • I then disabled simple file sharing (Control Panel->Folder Options->View tab->Uncheck simple file sharing). This allowed me to access the "Security" tab on folders.
  • I then went to the folder containing the files that I extracted from the zip file. I right-clicked the folder, went to the (newly available) Security tab, and gave the Users group (of which the ASPNET account is part) standard (read/view/execute) access to that directory.

And now it works. I think a recent Windows security update is probably to blame. But, now we're back in action.

July 27, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 
 

I forgot to include links for some of the libraries I used in my talk at Code Camp:

Atlas -- http://atlas.asp.net

Anthem -- http://www.anthemdotnet.com

Prototype -- http://prototype.conio.net/

script.aculo.us -- http://script.aculo.us/

 

July 27, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]



 Wednesday, July 26, 2006