I will be grateful if you could answer a few more questions around Analyzing Key Influencers
1. When specifying the training data for Decision Tree, there is a SUGGEST button (Recommend inputs for currently set predictable) which recommends which input are related to the predictable attribute. It also gives a ‘Score’ for each recommended inputs. What algorithm does the SUGGEST button use? Does it use simple entropy/correlation based algorithm OR sophisticated feature selection algorithms?
2. Can I access this ‘Score’ and recommended inputs above programmatically?
3. What feature selection algorithms are used in SQL Server 2005? Can they be invoked programmatically?
5. In Logistic Regression mining model viewer, we get a chart which clearly shows what attributes favor which state of the predictable attribute. For example, income level < 23000 favors BikeBuyer = 0 (does not buy) with a score of 89.00. What algorithm is used to calculate the ‘Score’? Can LR be used as a feature selector in case where the predicted attribute is binary (select the attributes that favor one state or the other with a score of, say, greater than some threshold)?
6. You suggested using Naive Bayes to find AKIs. What if the input attributes are all continuous (predicted attribute binary)? Shouldnt I be going for LR?
Thanks bunches
MA
Please see inlines:
1. When specifying the training data for Decision Tree, there is a SUGGEST button (Recommend inputs for currently set predictable) which recommends which input are related to the predictable attribute. It also gives a ‘Score’ for each recommended inputs. What algorithm does the SUGGEST button use? Does it use simple entropy/correlation based algorithm OR sophisticated feature selection algorithms?
Entropy based, on top of a small data sample.
2. Can I access this ‘Score’ and recommended inputs above programmatically?
No
3. What feature selection algorithms are used in SQL Server 2005? Can they be invoked programmatically?
Entropy based and they cannot be invoked programatically. One could write a plug-in algorithm with the sole purpose of performing advanced feature selection/extraction operation. However, such an algorithm is not currently included in Analysis Services.
5. In Logistic Regression mining model viewer, we get a chart which clearly shows what attributes favor which state of the predictable attribute. For example, income level < 23000 favors BikeBuyer = 0 (does not buy) with a score of 89.00. What algorithm is used to calculate the ‘Score’? Can LR be used as a feature selector in case where the predicted attribute is binary (select the attributes that favor one state or the other with a score of, say, greater than some threshold)?
The score only signifies a relative importance among all other factors. Here is briefly how that viewer works:
- it generates a fake input, containing, for each input attribute state, one row, with that input attribute state and everything else on NULL (Missing)
- it runs one prediction, fetches the histogram and compares the probabilities for 0 vs 1
- it normalizes the probabilities to scores, between 0 and 100
So, assuming your model has IncomeLevel as the only input attribute, the viewer practically executes a prediction for each possible state of the IncomeLevel attribute.
If the input is continuous, then is executes a prediction for each quartile of the input attribute
Now, with this info, LR could be used for feature selection if the target is binary. However, there are a few potential issues:
- LR is implemented as Neural Networks without hidden layers. On large volumes of data, training may be slow.
- LR has a feature selection parameter itself (default is top 255). That feature selection is entropy based and it means that the LR part will only see the remaining 255 features (so be sure to adjust the value of that parameter)
- for small data sets, the coefficients detected by the Neural net for some of the features are not really meaningful (as the network training goal is reached fast, by adjusting only a few coefficients) so the relative importance may not be reliable in the lower part of the viewer (results with small scores)
If you decide to use LR, you can call directly the stored procedure used by the viewer. Furthermore, the stored proc signature allows you to pinpoint some of the inputs and sort only the others
6. You suggested using Naive Bayes to find AKIs. What if the input attributes are all continuous (predicted attribute binary)? Shouldnt I be going for LR?
Continuous inputs can be discretized to be used in Naive Bayes. Naive Bayes has a few advantages:
- it is pretty straightforward and the results are easy to interpret
- it trains very fast.
- it works with any kind of predictable targets (binary or not, discrete or continuous - via discretization)
|||Thank you very much Bogdan!!
MA
No comments:
Post a Comment