This is q derived from an article on using Excel for anormaly detction by Jamie MacLennan, so, it's more for him. Now, when we talk about anormaly, the first thought is, what is considered 'norm' or 'normal'. And let's continue to use the college plan case/scenario. Quick review, in this case, we have data for 3 attributes of parentIncome, studentIQ, parentEncourgement, goal/prediction is to find college plan tendency. Again I think it's difficult to comprehend (typo?) that speculation of anormaly without the premise or better 'definition' of 'normal'. Along this line of thinkin, when we come up with 'something' that we consider 'normal', they become rule(s) for anormaly detection. With this particular case, let's assume, IQ>110 and income > 45000 and encouragement Yes makes the number one rule that leads to college plan Yes. Now, if this makes sense to you and, say, we want to use tree algo here, how do we embed rules to create a tree model for anormaly detection? I notice that there's filters available at least from the model accuracy testing interface, however, I can't simply equate RULES to filters. Your thoughts? Thanks.
The assumption with anomaly detection is that "normal" is described by the majority of cases. This assumption allows you to use the tree to determine the rules for a given dataset. If you want to specify the rules, i.e. create an expert system, then you aren't really doing data mining anymore. Personally I think that the clustering method for anomaly detection is much better, however it doesn't provide direct rules. There's a webcast linked from sqlserverdatamining.com that describes this method. -- -Jamie MacLennan SQL Server Data Mining This posting is provided "AS IS" with no warranties, and confers no rights. [quoted text, click to view] "ImJoe" <ImJoe@discussions.microsoft.com> wrote in message news:2F7A6F53-F3AB-452E-994F-C96BB01E60E4@microsoft.com... > > This is q derived from an article on using Excel for anormaly detction by > Jamie MacLennan, so, it's more for him. > > Now, when we talk about anormaly, the first thought is, what is considered > 'norm' or 'normal'. And let's continue to use the college plan > case/scenario. > Quick review, in this case, we have data for 3 attributes of parentIncome, > studentIQ, parentEncourgement, goal/prediction is to find college plan > tendency. Again I think it's difficult to comprehend (typo?) that > speculation of anormaly without the premise or better 'definition' of > 'normal'. Along this line of thinkin, when we come up with 'something' > that > we consider 'normal', they become rule(s) for anormaly detection. With > this > particular case, let's assume, IQ>110 and income > 45000 and encouragement > Yes makes the number one rule that leads to college plan Yes. > > Now, if this makes sense to you and, say, we want to use tree algo here, > how > do we embed rules to create a tree model for anormaly detection? I notice > that there's filters available at least from the model accuracy testing > interface, however, I can't simply equate RULES to filters. Your > thoughts? > > Thanks. > P.S. lot of distraction from my end
I have reservation about the notion that ["normal" is described by the majority of cases.] because this lets 'abnormal' sort of contribute to the 'normal popluation' process, I kind of think it compromise the quality of 'normal'. But let's leave it here for now. I created tree and clustering models for the College Plan scenario. Both model accuracies seems very very close. Clustering seems difficult to read (business value extraction, my term), for instance, the cluster diagram displays cluster 1, cluster 2, cluster 3 etc. though correlation degree (cardinality) between/among clusters is shown by the 'thinkness'/'brightness' of lines, I would think: a) if the size of a cluster could be correlated to its percentage of population then we would immediately know that, ok, cluster X is BIG in this case/scanrio; then next (naturally) b) we want to know what this cluster is made of (the key atributes and respective values), the cluster characteristics feature may dress this question, however, currently it seems a bit too much, that is, for instance, the current cluster X has the same variable repeated too much times like IQ 4 times (I'm not suggesting not to repeat, for continous type, it's necessary, however, I'm wondering if we could specify the number of variable here, maybe yes throught the Algo Param Setting and yet, I don't seem to see it. On another note, I need to refresh my memory on clustering concept and read up more on that, BOL is not sufficient for this need, other than googling any other idea? Thanks. [quoted text, click to view] "Jamie MacLennan (MS)" wrote: > The assumption with anomaly detection is that "normal" is described by the > majority of cases. This assumption allows you to use the tree to determine > the rules for a given dataset. If you want to specify the rules, i.e. > create an expert system, then you aren't really doing data mining anymore. > > Personally I think that the clustering method for anomaly detection is much > better, however it doesn't provide direct rules. There's a webcast linked > from sqlserverdatamining.com that describes this method. > > -- > > -Jamie MacLennan > SQL Server Data Mining > This posting is provided "AS IS" with no warranties, and confers no rights. > "ImJoe" <ImJoe@discussions.microsoft.com> wrote in message > news:2F7A6F53-F3AB-452E-994F-C96BB01E60E4@microsoft.com... > > > > This is q derived from an article on using Excel for anormaly detction by > > Jamie MacLennan, so, it's more for him. > > > > Now, when we talk about anormaly, the first thought is, what is considered > > 'norm' or 'normal'. And let's continue to use the college plan > > case/scenario. > > Quick review, in this case, we have data for 3 attributes of parentIncome, > > studentIQ, parentEncourgement, goal/prediction is to find college plan > > tendency. Again I think it's difficult to comprehend (typo?) that > > speculation of anormaly without the premise or better 'definition' of > > 'normal'. Along this line of thinkin, when we come up with 'something' > > that > > we consider 'normal', they become rule(s) for anormaly detection. With > > this > > particular case, let's assume, IQ>110 and income > 45000 and encouragement > > Yes makes the number one rule that leads to college plan Yes. > > > > Now, if this makes sense to you and, say, we want to use tree algo here, > > how > > do we embed rules to create a tree model for anormaly detection? I notice > > that there's filters available at least from the model accuracy testing > > interface, however, I can't simply equate RULES to filters. Your > > thoughts? > > > > Thanks. > > P.S. lot of distraction from my end > >
Anomaly detection is looking for data that is unusual for the data set in which it is contained. For instance it may not be unusual for a student to be an 8-year old girl, but it would be if the set of data you were looking at was for the Harvard Men's Glee Club. For clustering, the best description of how the clustering in SQL 2005 works is in my book. -- -Jamie MacLennan SQL Server Data Mining This posting is provided "AS IS" with no warranties, and confers no rights. [quoted text, click to view] "ImJoe" <ImJoe@discussions.microsoft.com> wrote in message news:F34411E8-46DE-48C2-BD99-FF794304A634@microsoft.com... >I have reservation about the notion that ["normal" is described by the > majority of cases.] because this lets 'abnormal' sort of contribute to the > 'normal popluation' process, I kind of think it compromise the quality of > 'normal'. But let's leave it here for now. I created tree and > clustering > models for the College Plan scenario. Both model accuracies seems very > very > close. Clustering seems difficult to read (business value extraction, my > term), for instance, the cluster diagram displays cluster 1, cluster 2, > cluster 3 etc. though correlation degree (cardinality) between/among > clusters > is shown by the 'thinkness'/'brightness' of lines, I would think: > a) if the size of a cluster could be correlated to its percentage of > population then we would immediately know that, ok, cluster X is BIG in > this > case/scanrio; then next (naturally) > b) we want to know what this cluster is made of (the key atributes and > respective values), the cluster characteristics feature may dress this > question, however, currently it seems a bit too much, that is, for > instance, > the current cluster X has the same variable repeated too much times like > IQ 4 > times (I'm not suggesting not to repeat, for continous type, it's > necessary, > however, I'm wondering if we could specify the number of variable here, > maybe > yes throught the Algo Param Setting and yet, I don't seem to see it. > > > On another note, I need to refresh my memory on clustering concept and > read > up more on that, BOL is not sufficient for this need, other than googling > any > other idea? > > Thanks. > > > "Jamie MacLennan (MS)" wrote: > >> The assumption with anomaly detection is that "normal" is described by >> the >> majority of cases. This assumption allows you to use the tree to >> determine >> the rules for a given dataset. If you want to specify the rules, i.e. >> create an expert system, then you aren't really doing data mining >> anymore. >> >> Personally I think that the clustering method for anomaly detection is >> much >> better, however it doesn't provide direct rules. There's a webcast >> linked >> from sqlserverdatamining.com that describes this method. >> >> -- >> >> -Jamie MacLennan >> SQL Server Data Mining >> This posting is provided "AS IS" with no warranties, and confers no >> rights. >> "ImJoe" <ImJoe@discussions.microsoft.com> wrote in message >> news:2F7A6F53-F3AB-452E-994F-C96BB01E60E4@microsoft.com... >> > >> > This is q derived from an article on using Excel for anormaly detction >> > by >> > Jamie MacLennan, so, it's more for him. >> > >> > Now, when we talk about anormaly, the first thought is, what is >> > considered >> > 'norm' or 'normal'. And let's continue to use the college plan >> > case/scenario. >> > Quick review, in this case, we have data for 3 attributes of >> > parentIncome, >> > studentIQ, parentEncourgement, goal/prediction is to find college plan >> > tendency. Again I think it's difficult to comprehend (typo?) that >> > speculation of anormaly without the premise or better 'definition' of >> > 'normal'. Along this line of thinkin, when we come up with >> > 'something' >> > that >> > we consider 'normal', they become rule(s) for anormaly detection. With >> > this >> > particular case, let's assume, IQ>110 and income > 45000 and >> > encouragement >> > Yes makes the number one rule that leads to college plan Yes. >> > >> > Now, if this makes sense to you and, say, we want to use tree algo >> > here, >> > how >> > do we embed rules to create a tree model for anormaly detection? I >> > notice >> > that there's filters available at least from the model accuracy testing >> > interface, however, I can't simply equate RULES to filters. Your >> > thoughts? >> > >> > Thanks. >> > P.S. lot of distraction from my end >> >> >>
[quoted text, click to view] "Jamie MacLennan (MS)" wrote: > Anomaly detection is looking for data that is unusual for the data set in > which it is contained. For instance it may not be unusual for a student to > be an 8-year old girl, but it would be if the set of data you were looking > at was for the Harvard Men's Glee Club.
My name is 'skipper' :) [quoted text, click to view] > For clustering, the best description of how the clustering in SQL 2005 works > is in my book.
I'll check it out. [quoted text, click to view] > -- > > -Jamie MacLennan > SQL Server Data Mining > This posting is provided "AS IS" with no warranties, and confers no rights. > "ImJoe" <ImJoe@discussions.microsoft.com> wrote in message > news:F34411E8-46DE-48C2-BD99-FF794304A634@microsoft.com... > >I have reservation about the notion that ["normal" is described by the > > majority of cases.] because this lets 'abnormal' sort of contribute to the > > 'normal popluation' process, I kind of think it compromise the quality of > > 'normal'. But let's leave it here for now. I created tree and > > clustering > > models for the College Plan scenario. Both model accuracies seems very > > very > > close. Clustering seems difficult to read (business value extraction, my > > term), for instance, the cluster diagram displays cluster 1, cluster 2, > > cluster 3 etc. though correlation degree (cardinality) between/among > > clusters > > is shown by the 'thinkness'/'brightness' of lines, I would think: > > a) if the size of a cluster could be correlated to its percentage of > > population then we would immediately know that, ok, cluster X is BIG in > > this > > case/scanrio; then next (naturally) > > b) we want to know what this cluster is made of (the key atributes and > > respective values), the cluster characteristics feature may dress this > > question, however, currently it seems a bit too much, that is, for > > instance, > > the current cluster X has the same variable repeated too much times like > > IQ 4 > > times (I'm not suggesting not to repeat, for continous type, it's > > necessary, > > however, I'm wondering if we could specify the number of variable here, > > maybe > > yes throught the Algo Param Setting and yet, I don't seem to see it. > > > > > > On another note, I need to refresh my memory on clustering concept and > > read > > up more on that, BOL is not sufficient for this need, other than googling > > any > > other idea? > > > > Thanks. > > > > > > "Jamie MacLennan (MS)" wrote: > > > >> The assumption with anomaly detection is that "normal" is described by > >> the > >> majority of cases. This assumption allows you to use the tree to > >> determine > >> the rules for a given dataset. If you want to specify the rules, i.e. > >> create an expert system, then you aren't really doing data mining > >> anymore. > >> > >> Personally I think that the clustering method for anomaly detection is > >> much > >> better, however it doesn't provide direct rules. There's a webcast > >> linked > >> from sqlserverdatamining.com that describes this method. > >> > >> -- > >> > >> -Jamie MacLennan > >> SQL Server Data Mining > >> This posting is provided "AS IS" with no warranties, and confers no > >> rights. > >> "ImJoe" <ImJoe@discussions.microsoft.com> wrote in message > >> news:2F7A6F53-F3AB-452E-994F-C96BB01E60E4@microsoft.com... > >> > > >> > This is q derived from an article on using Excel for anormaly detction > >> > by > >> > Jamie MacLennan, so, it's more for him. > >> > > >> > Now, when we talk about anormaly, the first thought is, what is > >> > considered > >> > 'norm' or 'normal'. And let's continue to use the college plan > >> > case/scenario. > >> > Quick review, in this case, we have data for 3 attributes of > >> > parentIncome, > >> > studentIQ, parentEncourgement, goal/prediction is to find college plan > >> > tendency. Again I think it's difficult to comprehend (typo?) that > >> > speculation of anormaly without the premise or better 'definition' of > >> > 'normal'. Along this line of thinkin, when we come up with > >> > 'something' > >> > that > >> > we consider 'normal', they become rule(s) for anormaly detection. With > >> > this > >> > particular case, let's assume, IQ>110 and income > 45000 and > >> > encouragement > >> > Yes makes the number one rule that leads to college plan Yes. > >> > > >> > Now, if this makes sense to you and, say, we want to use tree algo > >> > here, > >> > how > >> > do we embed rules to create a tree model for anormaly detection? I > >> > notice > >> > that there's filters available at least from the model accuracy testing > >> > interface, however, I can't simply equate RULES to filters. Your > >> > thoughts? > >> > > >> > Thanks. > >> > P.S. lot of distraction from my end > >> > >> > >> > >
Don't see what you're looking for? Try a search.
|