sql server data mining:
I've created a view joining (in essence) customers with sales. I am then trying to predict sales amount (continuous) by customer attribute inputs. I selected DISCRETIZED(EQUAL_AREAS,4) for the predicted variable (sales amount). What I don't understand is what "value" is telling me in the model. I can understand the number of cases associated with "node total" under value, as well as the number of cases associated with "missing". However, the other values are numbers like "832.125", with an associated case total of 593. Now to get that case total, the sales amount had to be between 175 and 1,490. So in short, what the heck are the actual sales ranges associated with these "equal buckets". Also, as to the definition of "case", I understand that in my example the case is a "customer" with a key of customer_id. But in the data mining model, the number associated with "cases" for the "all" category is a distinct count of all the input variables and the predicted variable. So even though I define the case to be customer_id, the data mining output makes the case count much higher and equal to again, the distinct count of all the variables in the model. Thanks. --Bob Harford Data Warehousing Architect AlphaInsight Corp. / Falls Church, VA
Some more searching on the group gave the answer to the equal_areas question: I'll re-post a previous post by Jamie MacLennan: ====================================================== If your data mining model was named MyModel the query would be SELECT DISTINCT price, RangeMin(price), RangeMax(price) FROM MyModel This is because the decision tree model does not support prediction of continuous numbers. When a contrinuous column is selected to be predicted is it automatically DISCRETIZED. This means that an algorithm is run to break the continuous numbers into a set of distinct ranges, or buckets. For example, an Age column may be broken up into buckets like 0-18,19-30,31-45,45-65,65+ depending on the distribution of ages. The value 129916.5 was returned because it was the midpoint of one of the automatically detected ranges. The query above will provide the midpoint, min, and max of the ranges for each range the model found. -- Jamie MacLennan SQL Server Data Mining
Don't see what you're looking for? Try a search.
|