Groups | Blog | Home
all groups > sql server data mining > november 2005 >

sql server data mining : Association Rules Limitation?



Wayan Sugiantara
11/29/2005 10:28:09 PM
Hi Jamie and All,



I have been using SQL Server 2005 data mining for quite a while. They work
fine until recently when I use a big database. For example for the market
basket analysis, we have 3.6 million of cases and it failed to process the
mining model. It ran for 2 days and the server was still processing (but it
actually looked freezing). Is there any limitation on the number of cases
for this algorithm? I am sure there is no such kind of limitation. So then
something wrong with our data or the mining model? The mining structure is
very simple and it is similar with the sample from Jamie's book on Data
Mining with SQL Server 2005.



Thanks,

Wayan

Jamie MacLennan (MS)
11/30/2005 1:57:20 PM
Can you please provide more details?

E.g. machine specs, # nested tables, what does the processing dialog say,
etc. etc.

Thanks

--

-Jamie MacLennan
SQL Server Data Mining
This posting is provided "AS IS" with no warranties, and confers no rights.
[quoted text, click to view]

Wayan Sugiantara
12/1/2005 4:06:27 PM
Hi Jesper,

I think that might be a good solution, especially increasing the support
level.
We have only 1215 unique items so that was puzzle me why it took forever to
process.

Thanks for the help.
Wayan

[quoted text, click to view]

cathy.sherrill NO[at]SPAM saralee.com
12/1/2005 4:20:05 PM
[quoted text, click to view]
anonymous_user NO[at]SPAM sqlserverdatamining.com
12/1/2005 8:20:04 PM
Hi Wayan,

There isn't any limit to the number of cases per se, but you should be very careful with the three parameters.

The first you have no control over, and it's the number of unique items in your dataset. You won't be able to do the same type of analysis if you have 1,000,000 products as you would if you have 1,000 products in your database.

The second parameter is the MINIMUM_SUPPORT. Always start by setting this to a high number (0.50) , and then decrease this until you find the right setting for your case.

The third parameter is the MAXIMUM_ITEMSET_SIZE. If you set this to 2 you will only look at pairwise correlation, something that's prudent if you have millions of different products. I think the default is 3 and it may be too high in your case.

Unfortunately the algorithm is limited to the fact that the results have to fit in main memory. It uses a clever datastructure for this, but still, it's not unlimited.

Hope this helps,
Jesper Lind
Microsoft Research




[quoted text, click to view]
AddThis Social Bookmark Button