This content has been marked as final.
Show 3 replies

1. Re: Minimum length description (MDL) used for Attribute importance ?
Mark KellyOracle Nov 29, 2012 9:19 AM (in response to QUANNA)Hi,
Here are some links to additional information regarding AI:
Documentation on MDL (used for attribute importance):
http://docs.oracle.com/cd/E11882_01/datamine.112/e16808/algo_mdl.htm#CHDFJHBH
The above includes some information on the data preparation we do via ADP. We first bin numerical predictors using supervised binning  a technique we added that leverages ODM decision tree functionality (and also uses the MDL principle inside).
Attribute Importance
Attribute Importance  "Units" or numerical nature of "Importance"
There is more information in the forum if you search for "Attribute Importance"
Thanks, Mark 
2. Re: Minimum length description (MDL) used for Attribute importance ?
QUANNA Dec 6, 2012 10:26 PM (in response to Mark KellyOracle)For the example below:

Here is a simple example. Imagine a target that takes on one of two values and a predictor that also takes on one of two values and the following data:
Target Predictor
1 1
1 1
1 2
2 1
2 2
2 2
We need to encode the sequence of target values. The shortest encoding depends upon your certainty with respect to the next target's value. With this data, 50% of the targets have value 1, and 50% have value 2. If this is all you know (our baseline), then the best you can do is (the entropy) 0.5 log2(0.5) = 1 bit per target value.
Note that there can be a problem with computing the entropy from counts. If the count was 0, then the entropy would be minus infinity. To avoid this, we smooth the probabilities, using what is called an msmoothing. We add m to the numerator and num_target_values * m to the denominator. For the baseline model, in this example we add 1 to the numerator and 2 to the denominator. We add two instances to our data. The smoothing in this case does not alter the baseline models probability. It is still 0.5.
To encode the model, we use a trick. Including the "added instances" we have a total of 8 targets. However, given the rules for "adding instances" , if we know what the first 7 targets are, then we know the eighth. To specify the model all we need to know is the number of values of target value 1 among the data. There are only 7 possibilities: 1,2,3,4,5,6,7. You can't have eight and you can't have 0, again, by the rules for adding instances. Thus the number of possible models is the number of ways of choosing a number between 1 and 7. Assuming a uniform probability on the possible models, the probability of our baseline model is 1/7 and the cost of encoding the model log2(1/7) = log2(7). For the more general multiclass case, the model encoding is log2(n+k1 choose k1) where n is the number of rows and k is the number of target values.
So our baseline model has description length
Parent description length = 1 * 6 + log2(7) = 8.807355
When the predictor value is 1, there are two instances of target value 1 and one instance of target value 2. To keep consistent with the baseline model, we set m = m(baseline model)/number of unique predictor values = 1/2. So the conditional probabilities are:
P(Target = 1Predictor=1) = (2+0.5)/(3 + 2*0.5) = .625
P(Target = 2Predictor=1) = (1+0.5)/(3 + 2*0.5) = .375
and, similarly:
P(Target = 1Predictor=2) = (1+0.5)/(3 + 2*0.5) = .375
P(Target = 2Predictor=2) = (2+0.5)/(3 + 2*0.5) = .625
We are less uncertain about the target value, so the encoding the targets using this predictor requires fewer than 6 bits:
target sequence encoding cost =0.625*log2(0.625) * 4 0.375*log2(0.375) * 2 = 5.7266040175497892 bits
However, to encode the model, note that we do not know where the two "added instances" will fall with respect to the predictor value, so we cannot reduce our uncertainty to 7 instances. Instead we have two submodels, one for each predictor value, with 4 possibilities each. The counts could be 0,1,2,3 of target value 1 given that the predictor value is 1. Thus the cost to encode the two submodels is:
model encoding cost = 2 * log2(4) = 4 bits
Child description length = 5.7266040175497892 + 4 = 9.7266040175497892
The attribute importance is:
Parent description length  Child description length = 0.15320818258203092
Any attribute importance < 0, we truncate to 0.

I don't understand the part :
if we know what the first 7 targets are, then we know the eighth. To specify the model all we need to know is the number of values of target value 1 among the data. There are only 7 possibilities: 1,2,3,4,5,6,7. You can't have eight and you can't have 0, again, by the rules for adding instances. Thus the number of possible models is the number of ways of choosing a number between 1 and 7. Assuming a uniform probability on the possible models, the probability of our baseline model is 1/7 and the cost of encoding the model log2(1/7) = log2(7). For the more general multiclass case, the model encoding is log2(n+k1 choose k1) where n is the number of rows and k is the number of target values.
> could anybody give me an explaination for that 
3. Re: Minimum length description (MDL) used for Attribute importance ?
Mark KellyOracle Dec 12, 2012 4:35 PM (in response to QUANNA)Hi,
Thank you for your interest in learning more about the indepth details of ODM's AI algorithm. We will be providing more details in a following up post. Could you share with us some of the use cases where you are applying AI?
Mark