Groups | Blog | Home
all groups > sql server full text search > march 2005 >

sql server full text search : Full text search on Chinese or Chinese/English mix?


Xin Chen
3/12/2005 11:39:17 PM
I want to use SQL 2005 FT to search on web page I crawled from web. The
page can be Chinese, English or Chinese/English(Chinese article with English
phrase in it).

First question is that what language word breaker I should choose. Does
Chinese word breaker make its English content hard to search?.

Second question, Should I store text in different language in difference
catalog so that I can choose the specific word breaker for the FTS? but how
to determine what language a web page is using. Most of Chinese and English
web page uses utf-8 charset which make it indistinguishable for my program
to determine which language it is using. Shouldn't SQL server figure out
what word breaker to use automattically by examining the bytes of utf-8
encoding of the text?

Third, what encoding I should use when I insert the content of web page into
the full text database? use utf-8, or gb2312(chinese) or Unicode? Does it
matter?

Your inputs are greatly appreciated.

Hilary Cotter
3/14/2005 10:12:01 AM
You have to use the ms.locale metatag for this to work, store your documents
in the image or varbinary data type, and then query using the Language
keyword. The language type you assign to the column is irrelevant as the
langauge tags in the document type dominate. Here is an example

CREATE TABLE blob

(pk INT not null IDENTITY(1,1) CONSTRAINT primarykey PRIMARY KEY,

blob VARBINARY(MAX),

blobtype VARCHAR(10))

GO

CREATE FULLTEXT INDEX ON blob

(blob TYPE COLUMN blobtype LANGUAGE 1033) --note the LCID is for American
English

KEY INDEX PrimaryKey ON catalog_name

GO

--note that these html documents we are pushing in are tagged with French
language metatags.


INSERT INTO blob (blob,blobtype)
VALUES(CONVERT(VARBINARY(256),'<HTML><HEAD><META name="ms.locale"
CONTENT="FR"></HEAD><BODY>mangé</BODY></HTML>'),'.htm')

INSERT INTO blob (blob,blobtype)
VALUES(CONVERT(VARBINARY(256),'<HTML><HEAD><META name="ms.locale"
CONTENT="FR"></HEAD><BODY>manger</BODY></HTML>'),'.htm')

GO

Querying for all stemmed forms of the French verb manger (to eat).

SELECT * FROM blob WHERE CONTAINS(*, 'formsof(inflectional,manger), language
1036)

--two rows returned.



--
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

[quoted text, click to view]

Hilary Cotter
3/14/2005 10:16:53 AM
Maybe I didn't answer your question to well.

1) It doesn't matter what word breaker you select as for varbinary or image
data type columns where the document's contains language tags the iFilter
understands (HTML docs tagged with the ms.locale metatag, or Word and other
Office docs) the embedded language tag will control the word breaker used.

2) You don't have to if you are using the Image or varbinary data type
columns. For other data type columns you will.

3) utf-8 should work.
--
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html

Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com

[quoted text, click to view]

AddThis Social Bookmark Button