Groups | Blog | Home
all groups > sql server data mining > october 2005 >

sql server data mining : How to manage the risk overfitting of data mining model on training data


Guy Mounier
10/16/2005 9:29:50 PM
Hi there,

I was wondering if anyone had a recommendation WRT managing the risk of
overfitting any of my data mining models to the training data set.

In other words, I can achieve an excellent lift with a given data
mining model on a given training data set, but realize very poor
performance on the actual data.

How can I ensure that my model will be reliable on production data?

Has anyone heard of Statistical Learning Theory from Vapnik? The
framework is supposed to help product reliable model. How can I apply
such framework with SQL Server 2005?

Thanks.

Guy
Jamie MacLennan (MS)
10/17/2005 5:49:32 PM
In MS research we are working with ideas from Vaprik's papers in algorithms
for the future.

To minimize risk of poor performance using decision trees, you can use a
technique called cross-validation.

The way you would implement this mechanism is to divide your data into 10
sets and create 10 models. For each model you train with 90% of the data
and test with 10% -changing the 10% for each model. The model which scores
the best out of this set should be the best fit. You can also try varying a
parameter across the models, such as COMPLEXITY_PENALTY.

Setting up such a scheme using the data mining wizard and designers is
fairly complex, but it can be done much easier using Integration Services
for sampling and merging resutls and ending in data mining training
transforms.

--

-Jamie MacLennan
SQL Server Data Mining
This posting is provided "AS IS" with no warranties, and confers no rights.
[quoted text, click to view]

Guy Mounier
10/17/2005 6:28:01 PM
Thank you for your answer. Understood it. To be clear, though, the goal
is NOT to find the best fit for my training data, but the most reliable
fit, in other words, I want the lift curve for my given model to not
vary a lot as I switch data set. For example, let's say I split my
training data set in two, DataSet 1 and DataSet 2, and build data
mining models DM1 and DM2 using the same configuration parameters. My
model DM1 or DM2 will be considered reliable if the differential
between the lift curves of DM1 and DM2 is small. Otherwise, I would
need to modify the configuration parameters and re-run the data mining
models. You suggest to play with COMPLEXITY_PENALTY parameters. What
other key parameters could I use to tune the "fitness" of the data
mining model, and avoid overfitting?

I am a little disappointed that you don't support the Vapnik framework
yet, because in the business world, it would bring such high-value to
let non-statisticians (i.e. business people) build reliable models
automatically. Virtually all business scenarios always contain a
certain degree of unpredictability, represented by the space between
the Ideal Lift curve and the data mining model lift curve. Therefore,
reliability of the model becomes the only constant that the business
can count on, and know for example that 40% of the outcome cannot be
explained by the input parameters of the model, but that of the other
60%, the model knows what the key drivers are.

Are you aware of data mining company called Kxen? They are applying
Statistical Learning Theory to build reliable models automatically. I
invite you to take a look at them.

BTW, I enjoy reading the book you co-authored "Data Mining with SQL
Server 2005".
Jamie MacLennan (MS)
10/18/2005 11:05:33 AM
Thank you for your comments. FYI, KXEN is building a set of plug-in
algorithms to bring their technology inside our framework allowing you to
use DMX, our tools, reporting services, integration services, etc. etc. on
top of the KXEN algorithms.

They will also be present at the Paris launch event for SQL Server.

--

-Jamie MacLennan
SQL Server Data Mining
This posting is provided "AS IS" with no warranties, and confers no rights.
[quoted text, click to view]

Jamie MacLennan (MS)
10/18/2005 11:13:10 AM
To asnwer your other question, the other key parameter would be
minimum_support which prevents low-support leaves from being generated

--

-Jamie MacLennan
SQL Server Data Mining
This posting is provided "AS IS" with no warranties, and confers no rights.
[quoted text, click to view]

Predictor
10/18/2005 12:06:59 PM

[quoted text, click to view]


See the FAQ for Usenet comp.ai.neural-nets:

http://www.faqs.org/faqs/ai-faq/neural-nets/part1/preamble.html

....especially these sections:

What is overfitting and how can I avoid it?
http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-3.html

How can generalization error be estimated?
http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-11.html

What are cross-validation and bootstrapping?
http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html


-Will Dwinnell
http://will.dwinnell.com
AddThis Social Bookmark Button