all groups > sql server full text search > june 2004 >
You're in the

sql server full text search

group:

Indexing Word Docs


Indexing Word Docs Binder
6/21/2004 12:38:45 PM
sql server full text search:
We currently have an application that OCRs a tif image and places the
recognized text in a SQL table.
The table is then indexed by the FTS service.
The app then allows you to search for any of the text and display the
corresponding tif image in a viewer.

I would also like to be able to search WORD docs for their contents using
the same catalog.

What is the proper manner to have the WORD docs indexed by the FTS service?
Do I need to extract the text from the WORD doc and store it in the table
much like the recognized text
from the OCR process?

Thanks


FRe: Indexing Word Docs John Kane
6/21/2004 11:02:47 PM
Binder,
What version of SQL Server (2000 or 7.0) and on what OS platform (NT4.0,
Win2K, or Win2003) is it installed? Could you post the full output of
SELECT @@version -- as this is helpful to answering your question.

If you are using SQL Server 2000, you can use it's new feature (this feature
is not present in SQL 7.0) - from SQL Sever 2000 BOL title "Filtering
Supported File Types". This feature allows you to store the binary version
of the MS Word document and then in your table define a file extension
column and populate it with the correct values ("doc" for MS Word document)
and then run a Full Population and then you can use the CONTAINS or FREETEXT
quires to FTS the contents of these files stored in a sql table>

If you are using SQL Server 7.0, you will need to setup a process to extract
the MS Word text and then store this text in a TEXT column and the FT Index
that column, much as you do for your OCR'ed data.

Regards,
John


[quoted text, click to view]

Re: Indexing Word Docs John Kane
6/22/2004 9:11:53 AM
Binder,

Q. What is the relationship between FTS and Indexing Service?
A. While they use the same underlying Microsoft Search Technology, they full
text index different servers. Indexing Service handles the server's files on
its local disk drive, while FTS (or really the "Micrsoft Search" service
[mssearch.exe]) full text indexes textaul (char, nvarchar, text, etc.)
columns in SQL Server tables. Yes, it seems to me that using the Indexing
Service, should work for you.

What is the name of your app? Does it support SQL Server 2000? If so, does
it support the storage of MS Word documents in columns that are defined with
the IMAGE datatype? Is the feature that is titled "Full-text Querying of
File Data", a feature of your app, or are you referring to the feature of
SQL Severer (version) ?

In addition to SQL Server's Full-text Search (FTS) component, you can also
define a "Linked Server" to the Indexing Service via using MSIDX, the "OLE
DB Provider for Microsoft Indexing Service". You would define this linked
server via sp_addlinkedserver. Below is an example from SQL Server 2000
Books Online:

G. Use the Microsoft OLE DB Provider for Indexing Service
This example creates a linked server and uses OPENQUERY to retrieve
information from both the linked server and the file system enabled for
Indexing Service.

EXEC sp_addlinkedserver FileSystem,
'Index Server',
'MSIDXS',
'Web'
GO
USE pubs
GO
IF EXISTS(SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'yEmployees')
DROP TABLE yEmployees
GO
CREATE TABLE yEmployees
(
id int NOT NULL,
lname varchar(30) NOT NULL,
fname varchar(30) NOT NULL,
salary money,
hiredate datetime
)
GO
INSERT yEmployees VALUES
(
10,
'Fuller',
'Andrew',
$60000,
'9/12/98'
)
GO
IF EXISTS(SELECT TABLE_NAME FROM INFORMATION_SCHEMA.VIEWS
WHERE TABLE_NAME = 'DistribFiles')
DROP VIEW DistribFiles
GO
CREATE VIEW DistribFiles
AS
SELECT *
FROM OPENQUERY(FileSystem,
'SELECT Directory,
FileName,
DocAuthor,
Size,
Create,
Write
FROM SCOPE('' "c:\My Documents" '')
WHERE CONTAINS(''Distributed'') > 0
AND FileName LIKE ''%.doc%'' ')
WHERE DATEPART(yy, Write) = 1998
GO
SELECT *
FROM DistribFiles
GO
SELECT Directory,
FileName,
DocAuthor,
hiredate
FROM DistribFiles D, yEmployees E
WHERE D.DocAuthor = E.FName + ' ' + E.LName
GO

Regards,
John





[quoted text, click to view]

Re: Indexing Word Docs Binder
6/22/2004 10:21:02 AM
John,

What is the relationship between FTS and Indexing Service?
It looks like the Indexing Service maintains a catalog much the same as FTS.

We have support for WORD in our app already by storing the WORD doc in our
file warehouse on the file system.
We can display the .doc file in our viewer the same as a .tif image.
We currently don't have functionality to search for data in the WORD docs,
only text from the OCR process.
Since the WORD file is already stored in the file system and referenced by
our application, I was wondering about the feature that is titled "Full-text
Querying of File Data"

It looks like it uses the Index Service to allow searching for data in files
on the file system.
Wouldn't that work for my scenario?

It appears that when we want to search for data contained in a WORD doc, we
would use the SCOPE function in our query. Otherwise, we continue to search
for text from the OCR process.

Can you provide some insight?

Thanks





[quoted text, click to view]

Re: Indexing Word Docs Binder
6/22/2004 11:04:54 AM
System Parameters:

Windows 2000 Server

Microsoft SQL Server 2000 - 8.00.194 (Intel X86)
Aug 6 2000 00:57:48
Copyright (c) 1988-2000 Microsoft Corporation
Enterprise Edition on Windows NT 5.0 (Build 2195: Service Pack 4)






[quoted text, click to view]

AddThis Social Bookmark Button