Groups | Blog | Home
all groups > sql server data mining > november 2005 >

sql server data mining : anormaly detection using tree algo



ImJoe
11/22/2005 8:52:08 AM

This is q derived from an article on using Excel for anormaly detction by
Jamie MacLennan, so, it's more for him.

Now, when we talk about anormaly, the first thought is, what is considered
'norm' or 'normal'. And let's continue to use the college plan case/scenario.
Quick review, in this case, we have data for 3 attributes of parentIncome,
studentIQ, parentEncourgement, goal/prediction is to find college plan
tendency. Again I think it's difficult to comprehend (typo?) that
speculation of anormaly without the premise or better 'definition' of
'normal'. Along this line of thinkin, when we come up with 'something' that
we consider 'normal', they become rule(s) for anormaly detection. With this
particular case, let's assume, IQ>110 and income > 45000 and encouragement
Yes makes the number one rule that leads to college plan Yes.

Now, if this makes sense to you and, say, we want to use tree algo here, how
do we embed rules to create a tree model for anormaly detection? I notice
that there's filters available at least from the model accuracy testing
interface, however, I can't simply equate RULES to filters. Your thoughts?

Thanks.
Jamie MacLennan (MS)
11/22/2005 9:46:33 AM
The assumption with anomaly detection is that "normal" is described by the
majority of cases. This assumption allows you to use the tree to determine
the rules for a given dataset. If you want to specify the rules, i.e.
create an expert system, then you aren't really doing data mining anymore.

Personally I think that the clustering method for anomaly detection is much
better, however it doesn't provide direct rules. There's a webcast linked
from sqlserverdatamining.com that describes this method.

--

-Jamie MacLennan
SQL Server Data Mining
This posting is provided "AS IS" with no warranties, and confers no rights.
[quoted text, click to view]

ImJoe
11/22/2005 1:16:10 PM
I have reservation about the notion that ["normal" is described by the
majority of cases.] because this lets 'abnormal' sort of contribute to the
'normal popluation' process, I kind of think it compromise the quality of
'normal'. But let's leave it here for now. I created tree and clustering
models for the College Plan scenario. Both model accuracies seems very very
close. Clustering seems difficult to read (business value extraction, my
term), for instance, the cluster diagram displays cluster 1, cluster 2,
cluster 3 etc. though correlation degree (cardinality) between/among clusters
is shown by the 'thinkness'/'brightness' of lines, I would think:
a) if the size of a cluster could be correlated to its percentage of
population then we would immediately know that, ok, cluster X is BIG in this
case/scanrio; then next (naturally)
b) we want to know what this cluster is made of (the key atributes and
respective values), the cluster characteristics feature may dress this
question, however, currently it seems a bit too much, that is, for instance,
the current cluster X has the same variable repeated too much times like IQ 4
times (I'm not suggesting not to repeat, for continous type, it's necessary,
however, I'm wondering if we could specify the number of variable here, maybe
yes throught the Algo Param Setting and yet, I don't seem to see it.


On another note, I need to refresh my memory on clustering concept and read
up more on that, BOL is not sufficient for this need, other than googling any
other idea?

Thanks.


[quoted text, click to view]
Jamie MacLennan (MS)
11/22/2005 4:24:24 PM
Anomaly detection is looking for data that is unusual for the data set in
which it is contained. For instance it may not be unusual for a student to
be an 8-year old girl, but it would be if the set of data you were looking
at was for the Harvard Men's Glee Club.

For clustering, the best description of how the clustering in SQL 2005 works
is in my book.

--

-Jamie MacLennan
SQL Server Data Mining
This posting is provided "AS IS" with no warranties, and confers no rights.
[quoted text, click to view]

ImJoe
11/23/2005 1:26:07 PM

[quoted text, click to view]
My name is 'skipper' :)

[quoted text, click to view]
I'll check it out.

[quoted text, click to view]
AddThis Social Bookmark Button