Groups | Blog | Home
all groups > sql server full text search > january 2004 >

sql server full text search : creating custom filters for full-text indexing


Parhez Sattar
1/20/2004 10:49:58 AM
I have been searching through the MSDN library for
instructions on how to add/create custom filters for full-
text indexing of additional file types in SQL 2000 so it
will index other files types, such as .aspx, etc. Any
John Kane
1/20/2004 1:11:33 PM
Parhez,
You should be able to use filtreg.exe from the Platform SDK (see
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/html/ixufilt_5691.asp)
and then "associate" the file extension shtml with the same file extension
as .aspx files, for example:

filtreg /?
Usage: filtreg [dstExt] [srcExt]
Displays IFilter registrations. If [dstExt] and [srcExt] are specified
then [dstExt] is registered to act like [srcExt].

filtreg (edited) output (on Windows Server 2003)
.....
..asp --> HTML filter (nlhtml.dll)
..aspx --> HTML filter (nlhtml.dll)
....
..htm --> HTML filter (nlhtml.dll)
..html --> HTML filter (nlhtml.dll)

You can then "assoicate" shtml with asp or htm using filtreg.exe as follows:

filtreg .html .aspx

then re-run filtreg to confirm the assocation and for .shtml it will
display:
..aspx --> HTML filter (nlhtml.dll)

Regards,
John



[quoted text, click to view]

Parhez Sattar
1/20/2004 1:24:08 PM
John,
Would this work for SQL 2000, in addition to the Indexing
John Kane
1/20/2004 1:29:29 PM
Yes.

[quoted text, click to view]

Parhez Sattar
1/20/2004 1:32:40 PM
How do I go about getting this filtreg.exe program? I
don't have the SDK and would prefer to not wait until the
Parhez Sattar
1/21/2004 8:36:31 AM
Hilary,
I have my content (the .aspx files) in a SQL 2000
database, as it is part of a Windows Sharepoint Services
site. As far as I know, the Indexing Service can't reach
Hilary Cotter
1/21/2004 11:00:18 AM
you might want to check out

http://support.microsoft.com/default.aspx?scid=kb;en-us;311521&Product=is

as well.

[quoted text, click to view]

John Kane
1/21/2004 3:57:43 PM

Parhez and Hilary,
Hilary, I created the reg file from the kb article and it made no difference
on my SQL 2000 SP3 server on Win2003.
However, after a bit of testing, Parhez, you can use Filtreg.exe and set
your aspx file to the text IFilter, for example:

filtreg .aspx .txt

Additionally, when you populate the Doc table, the extension column
(nvarchar(256) NULL) for your aspx files should be populated with "txt" as
the extension and then run a Full Population and this will work for you...
or at least it did for me on my server..

Regards,
John



[quoted text, click to view]

John Kane
1/21/2004 7:33:31 PM
Hi Hilary,
Agreed. However, the aspx file only contained the following:

<%@ Page language="c#" Codebehind="Home.aspx.cs" AutoEventWireup="false"
Inherits="Microsoft.ReportingServices.UI.HomePage" %>
<%@ Register TagPrefix="MSRS" Namespace="Microsoft.ReportingServices.UI"
Assembly="ReportingServicesWebUserInterface" %>

Using html or htm did not work because it lacked the standard <html>
metatags and I could only repo it with txt.
I can email the full repo script and file to you if you want to test it....

Regards,
John




[quoted text, click to view]

Parhez Sattar
1/21/2004 8:28:43 PM
John,
I will try this tomorrow. On the iFilter end, I suspect
I can use regfilt.exe to change the association for .aspx
from with HTML to with Text. A question though, how do I
make my .aspx files put .txt as the extension in the
Extension column when that is done by the application, in
this case WSS. I have always thought messing with the
data in the content database via SQL tools is a big no
no, according to the WSS gurus out there. I know how to
tell the catalog to repopulate, but I don't know if I
should change the content of the Docs table from outside
of WSS. What are your thoughts?



[quoted text, click to view]
Hilary Cotter
1/21/2004 9:25:41 PM
indexing your aspx as text will index the raw code. Indexing your aspx as
html will index the document stored in an image data type as html, ie the
code and html formatting tags will not be indexed.

The code will not be interpreted or processed though. Just the non html
formatting information will be indexed.

So if you have the following html tags in your aspx

<html>
<head>
<META NAME="DESCRIPTION" CONTENT="The entry page to Microsoft's Web site.
Find software, solutions, answers, support, and Microsoft news." />
</head>
<body>
<h1>this is a test</h1></body></html>

the only text that will be indexed in the above doc is: this is a test

[quoted text, click to view]

John Kane
1/22/2004 10:25:37 AM
Parhez,
On the SQL Server side, you can obviously issue an UPDATE statement to
change all aspx values to txt values, but that would not help you on the WSS
side where these files are inserted into the SQL Server docs table....

I found the following related thread in the WSS newsgroup that might be
helpful to you...
"The 'Extention' column is needed by SQL Full Text Engine, and this column
is set as calculated in the docs table to take the letters following the
last dot of the file name... only when this file name corresponds to a file
include in a Document Library(field is set to '' if the DocLibId is null).
So the inability to run a full text search on an file stored as an
attachment in WSS is only linked to the way this field is calculated and
thus, is by design".

You might want to re-post your question in the
microsoft.public.sharepoint.windowsservices newsgroup and if you get a good
anwser, could you re-post it in this newsgroup as well?

Thanks,
John




[quoted text, click to view]

Hilary Cotter
1/22/2004 1:44:06 PM
true, such an aspx page would generate no html content. Other aspx pages
would/could.

In this case, both using html or txt as the persistenthandler would be
unlikely to generate any meaningful results.

[quoted text, click to view]

Parhez Sattar
1/23/2004 8:11:41 AM
AddThis Social Bookmark Button