Groups | Blog | Home
all groups > sql server data mining > october 2004 >

sql server data mining : Best Match


daven NO[at]SPAM cougar-automation.co.uk
10/27/2004 2:59:36 AM
Hi all,
i have a table which contains for arguments sake, beans -

Table Beans
BeanId
DecriptionId
DescriptionValue


with values such as

Harricot length 1.1
Kidney length 1.4
Harricot Colour Pale Orange
Kidney Colour Red
Harricot Notes Beans Means Heinz
Kidney Notes Delicioso con carne


How can I write a query to return the best match for say a bean of
length 1.3 and colour blue maybe relating to Heinz.

The beans are just examples the real data is going to be completely
varied, the query must be fast because I anticipate having a million
rows, but what I am most interested in is how to score the results to
find the best match and was hoping to use SQL to return the best match
rather then try and score it in the front end application. but i could
do some there.

I hope this makes sense

Thanks

Adam Machanic
10/27/2004 9:36:58 AM
First of all, you're going to have to normalize this data if you want to
search it efficiently and ensure that you don't have duplicate values or
other data integrity issues. That means changing your table to:

Beans
------
BeanId
BeanLength
BeanColour
BeanNotes
...

Second, can you specify in more detail what constitutes a 'best match'?
What logic would you like to use in order to determine whether something is
a match?


--
Adam Machanic
SQL Server MVP
http://www.sqljunkies.com/weblog/amachanic
--


[quoted text, click to view]

daven NO[at]SPAM cougar-automation.co.uk
10/28/2004 1:08:50 PM
Adam,

First of all thanks for your reply.

it is my intention to make the descriptions for the beans to be
extensible, so any more descriptions would just be a different id and
value combo (length 1.1, notes blah, colour red), doing what you
propose - i would have to add a different column for each description
- beanid beandescription1(length) beandescription2(colour) etc.

i dont have a problem with this if you tell me it is the way to go
regarding creating quick queries, it would however result in a lot of
rows like -

name length colour notes
haricott 1.3
haricott orange
haricott beanzmeansheinz
haricott these are another persons notes
haricott 1.2

and then someone might come along with another descriptor -

name origin length colour notes
haricott 1.3
haricott orange
haricott beanzmeansheinz
haricott these are another persons notes
haricott 1.2
haricott peru

will this hamper the speed or will it simply result in more data being
stored?

i was planning to run db sp's to remove duplicate beans and
descriptions periodically at the server.

as for determining a match thats a bit fuzzy right now, and ideally is
what i'd like to use :)

someone would provide a set of descriptions and values and the
database return the closest match to those descriptions, i was
thinking of each extension of the main app to provide its own piece of
WHERE clause, but I dont know, maybe just provide the set of values
and then a number of searches perform different scoring functions on
each value to provide a set of beans and probabilities? what do you
think?

thanks once again for the reply

regards

Adam Machanic
10/28/2004 9:37:33 PM
[quoted text, click to view]

Are these all just arbitrary descriptions? Or are they typed in the
application? That is, does the UI know the difference between a
"descriptor" and a "color"?

If the UI can differentiate, then you should definitely store these
different types in different columns; otherwise, it would be impossible (or
at least extremely difficult), for instance, to constrain such that a length
must be numeric whereas a color must be "orange", "red", or "green"... This
will also make searches much quicker -- if a user wants to search on length,
you'll be able to take advantage of an index on that column. Not so if you
have all of the values stored in an arbitrary way.

On the other hand, if the "descriptor" is always just some arbitrary
thing defined by the user, there's probably not much normalization you can
do (or that will be necessary to do).

[quoted text, click to view]

You could, perhaps, set up tables with associative values for various
columns; for instance, you might associate the color "orange" with the
colors "yellow", "red", and "burnt sienna", or you might allow a range on
lengths, e.g. +/- 0.02 ...


--
Adam Machanic
SQL Server MVP
http://www.sqljunkies.com/weblog/amachanic
--


daven NO[at]SPAM cougar-automation.co.uk
10/31/2004 11:26:22 AM
the ui can differentiate between the descriptions, the app is
extensible through addin's.
each addin identifies its own descriptor for the beans so it would
know whether its piece of the search would be looking for colours or
numbers, this is what lead me to the idea of each addin providing its
own piece of where clause.

addin1(colour) "colour like red"

so providing a column, is indexing the data. what about the storage?
(the fact that there would be many rows for a given bean with lots of
nulls in them) is this a problem?

Dave

[quoted text, click to view]
Adam Machanic
10/31/2004 7:10:26 PM
[quoted text, click to view]


Probably not; NULLs take up the minimum amount of disk space for any
given type... So although you will have more disk space than with the
original scheme, you probably won't have that much more...

--
Adam Machanic
SQL Server MVP
http://www.sqljunkies.com/weblog/amachanic
--

daven NO[at]SPAM cougar-automation.co.uk
11/2/2004 2:06:55 AM
ok matey, thats food for thought, thanks.
Dave


[quoted text, click to view]
AddThis Social Bookmark Button