all groups > sql server full text search > january 2004 >
You're in the

sql server full text search

group:

Using full text search for searching duplicate documents


Re: Using full text search for searching duplicate documents John Kane
1/21/2004 3:30:56 PM
sql server full text search: Auntin,
I don't believe that SQL Server's Full-text Search components with the
MSSearch service is the best solution for your requirement.
Neither FREETEXT or CONTAINS were designed to detect duplicate data in SQL
Server.

Regards,
John


[quoted text, click to view]

Using full text search for searching duplicate documents Auntin Philipino
1/21/2004 5:55:15 PM
Hi,

I have a requirement where in i have to search for duplicate documents in my
folder. I copy the contents in the document to a TEXT column & i enable FTS
on this column.

Now i want to check for the content in one document being present in another
document (Maybe with a few minor changes). Which is the best way to
implement this?

I am finding that FREETEXT returns data which have a huge degree of
variance. And CONTAINS keeps throwing up errors if the criteria data has a
newline character or any spl character.... How to solve this problem?

Thnx in advance,
Philipino

Re: Using full text search for searching duplicate documents Auntin Philipino
1/22/2004 11:01:55 AM
Hi John,

Thnx for the reply.

Actually my requirement is as follows:

I have thousands of images. I OCR them & capture their content & dump the
OCRed data into a SQL Server table. Some of these images will be duplicate
images with a little variation in their content; Say one documents has a
scribbled note on it & then scanned into an image & another copy of the same
document is scanned without the scribbled text/note. Now i want these two
images to be identified as duplicates.

I was planning to make the FTS feature of SQL Server, wherein i take the
data in the first row & compare with the rest of the data using the FREETEXT
/ FREETEXTTABLE predicates. Now that you have mentioned that this is not a
good option for my requirement, can u please suggest any other alternative
way of implementing this.

Thnx in advance,
Philipino

[quoted text, click to view]

Re: Using full text search for searching duplicate documents Hilary Cotter
1/25/2004 6:03:41 PM
a comparison by byte size will be pretty good. Most other MS search engines
(Site Server Search, Exchange,Sharepoint) offer DocSignature which is a hash
which you can use to identify duplicates with.

[quoted text, click to view]

AddThis Social Bookmark Button