Hi there, I was wondering if anyone had a recommendation WRT managing the risk of overfitting any of my data mining models to the training data set. In other words, I can achieve an excellent lift with a given data mining model on a given training data set, but realize very poor performance on the actual data. How can I ensure that my model will be reliable on production data? Has anyone heard of Statistical Learning Theory from Vapnik? The framework is supposed to help product reliable model. How can I apply such framework with SQL Server 2005? Thanks. Guy
In MS research we are working with ideas from Vaprik's papers in algorithms for the future. To minimize risk of poor performance using decision trees, you can use a technique called cross-validation. The way you would implement this mechanism is to divide your data into 10 sets and create 10 models. For each model you train with 90% of the data and test with 10% -changing the 10% for each model. The model which scores the best out of this set should be the best fit. You can also try varying a parameter across the models, such as COMPLEXITY_PENALTY. Setting up such a scheme using the data mining wizard and designers is fairly complex, but it can be done much easier using Integration Services for sampling and merging resutls and ending in data mining training transforms. -- -Jamie MacLennan SQL Server Data Mining This posting is provided "AS IS" with no warranties, and confers no rights. [quoted text, click to view] "Guy Mounier" <gmounier@gmail.com> wrote in message news:1129523390.651522.152850@g49g2000cwa.googlegroups.com... > Hi there, > > I was wondering if anyone had a recommendation WRT managing the risk of > overfitting any of my data mining models to the training data set. > > In other words, I can achieve an excellent lift with a given data > mining model on a given training data set, but realize very poor > performance on the actual data. > > How can I ensure that my model will be reliable on production data? > > Has anyone heard of Statistical Learning Theory from Vapnik? The > framework is supposed to help product reliable model. How can I apply > such framework with SQL Server 2005? > > Thanks. > > Guy >
Thank you for your answer. Understood it. To be clear, though, the goal is NOT to find the best fit for my training data, but the most reliable fit, in other words, I want the lift curve for my given model to not vary a lot as I switch data set. For example, let's say I split my training data set in two, DataSet 1 and DataSet 2, and build data mining models DM1 and DM2 using the same configuration parameters. My model DM1 or DM2 will be considered reliable if the differential between the lift curves of DM1 and DM2 is small. Otherwise, I would need to modify the configuration parameters and re-run the data mining models. You suggest to play with COMPLEXITY_PENALTY parameters. What other key parameters could I use to tune the "fitness" of the data mining model, and avoid overfitting? I am a little disappointed that you don't support the Vapnik framework yet, because in the business world, it would bring such high-value to let non-statisticians (i.e. business people) build reliable models automatically. Virtually all business scenarios always contain a certain degree of unpredictability, represented by the space between the Ideal Lift curve and the data mining model lift curve. Therefore, reliability of the model becomes the only constant that the business can count on, and know for example that 40% of the outcome cannot be explained by the input parameters of the model, but that of the other 60%, the model knows what the key drivers are. Are you aware of data mining company called Kxen? They are applying Statistical Learning Theory to build reliable models automatically. I invite you to take a look at them. BTW, I enjoy reading the book you co-authored "Data Mining with SQL Server 2005".
Thank you for your comments. FYI, KXEN is building a set of plug-in algorithms to bring their technology inside our framework allowing you to use DMX, our tools, reporting services, integration services, etc. etc. on top of the KXEN algorithms. They will also be present at the Paris launch event for SQL Server. -- -Jamie MacLennan SQL Server Data Mining This posting is provided "AS IS" with no warranties, and confers no rights. [quoted text, click to view] "Guy Mounier" <gmounier@gmail.com> wrote in message news:1129598881.058872.7130@g43g2000cwa.googlegroups.com... > Thank you for your answer. Understood it. To be clear, though, the goal > is NOT to find the best fit for my training data, but the most reliable > fit, in other words, I want the lift curve for my given model to not > vary a lot as I switch data set. For example, let's say I split my > training data set in two, DataSet 1 and DataSet 2, and build data > mining models DM1 and DM2 using the same configuration parameters. My > model DM1 or DM2 will be considered reliable if the differential > between the lift curves of DM1 and DM2 is small. Otherwise, I would > need to modify the configuration parameters and re-run the data mining > models. You suggest to play with COMPLEXITY_PENALTY parameters. What > other key parameters could I use to tune the "fitness" of the data > mining model, and avoid overfitting? > > I am a little disappointed that you don't support the Vapnik framework > yet, because in the business world, it would bring such high-value to > let non-statisticians (i.e. business people) build reliable models > automatically. Virtually all business scenarios always contain a > certain degree of unpredictability, represented by the space between > the Ideal Lift curve and the data mining model lift curve. Therefore, > reliability of the model becomes the only constant that the business > can count on, and know for example that 40% of the outcome cannot be > explained by the input parameters of the model, but that of the other > 60%, the model knows what the key drivers are. > > Are you aware of data mining company called Kxen? They are applying > Statistical Learning Theory to build reliable models automatically. I > invite you to take a look at them. > > BTW, I enjoy reading the book you co-authored "Data Mining with SQL > Server 2005". >
To asnwer your other question, the other key parameter would be minimum_support which prevents low-support leaves from being generated -- -Jamie MacLennan SQL Server Data Mining This posting is provided "AS IS" with no warranties, and confers no rights. [quoted text, click to view] "Guy Mounier" <gmounier@gmail.com> wrote in message news:1129598881.058872.7130@g43g2000cwa.googlegroups.com... > Thank you for your answer. Understood it. To be clear, though, the goal > is NOT to find the best fit for my training data, but the most reliable > fit, in other words, I want the lift curve for my given model to not > vary a lot as I switch data set. For example, let's say I split my > training data set in two, DataSet 1 and DataSet 2, and build data > mining models DM1 and DM2 using the same configuration parameters. My > model DM1 or DM2 will be considered reliable if the differential > between the lift curves of DM1 and DM2 is small. Otherwise, I would > need to modify the configuration parameters and re-run the data mining > models. You suggest to play with COMPLEXITY_PENALTY parameters. What > other key parameters could I use to tune the "fitness" of the data > mining model, and avoid overfitting? > > I am a little disappointed that you don't support the Vapnik framework > yet, because in the business world, it would bring such high-value to > let non-statisticians (i.e. business people) build reliable models > automatically. Virtually all business scenarios always contain a > certain degree of unpredictability, represented by the space between > the Ideal Lift curve and the data mining model lift curve. Therefore, > reliability of the model becomes the only constant that the business > can count on, and know for example that 40% of the outcome cannot be > explained by the input parameters of the model, but that of the other > 60%, the model knows what the key drivers are. > > Are you aware of data mining company called Kxen? They are applying > Statistical Learning Theory to build reliable models automatically. I > invite you to take a look at them. > > BTW, I enjoy reading the book you co-authored "Data Mining with SQL > Server 2005". >
Don't see what you're looking for? Try a search.
|