all groups > sql server data mining > september 2005 >
You're in the

sql server data mining

group:

Newbie question on EQUAL_AREAS and cases


Newbie question on EQUAL_AREAS and cases Bob
9/20/2005 10:53:14 AM
sql server data mining: I've created a view joining (in essence) customers with sales. I am
then trying to predict sales amount (continuous) by customer attribute
inputs. I selected DISCRETIZED(EQUAL_AREAS,4) for the predicted
variable (sales amount).

What I don't understand is what "value" is telling me in the model. I
can understand the number of cases associated with "node total" under
value, as well as the number of cases associated with "missing".
However, the other values are numbers like "832.125", with an
associated case total of 593. Now to get that case total, the sales
amount had to be between 175 and 1,490. So in short, what the heck are
the actual sales ranges associated with these "equal buckets".

Also, as to the definition of "case", I understand that in my example
the case is a "customer" with a key of customer_id. But in the data
mining model, the number associated with "cases" for the "all" category
is a distinct count of all the input variables and the predicted
variable. So even though I define the case to be customer_id, the data
mining output makes the case count much higher and equal to again, the
distinct count of all the variables in the model.

Thanks.

--Bob Harford
Data Warehousing Architect
AlphaInsight Corp. / Falls Church, VA
Re: Newbie question on EQUAL_AREAS and cases Bob
9/20/2005 11:36:06 AM
Some more searching on the group gave the answer to the equal_areas
question:

I'll re-post a previous post by Jamie MacLennan:
======================================================

If your data mining model was named MyModel the query would be


SELECT DISTINCT price, RangeMin(price), RangeMax(price) FROM MyModel


This is because the decision tree model does not support prediction of
continuous numbers. When a contrinuous column is selected to be
predicted
is it automatically DISCRETIZED. This means that an algorithm is run
to
break the continuous numbers into a set of distinct ranges, or buckets.
For
example, an Age column may be broken up into buckets like
0-18,19-30,31-45,45-65,65+ depending on the distribution of ages.


The value 129916.5 was returned because it was the midpoint of one of
the
automatically detected ranges. The query above will provide the
midpoint,
min, and max of the ranges for each range the model found.


--
Jamie MacLennan
SQL Server Data Mining
AddThis Social Bookmark Button