3 Replies Latest reply on Dec 12, 2012 10:35 PM by Mark Kelly-Oracle

# Minimum length description (MDL) used for Attribute importance ?

As I know oracle used 2 part code MDL for Attribute importance, howerver i could not figure out what is the formular(how) that oracle used for calculating the importance value and the basic concept behind that formular?

Anyone have an idea please? or any document related.

Many Thanks
• ###### 1. Re: Minimum length description (MDL) used for Attribute importance ?
Hi,

Documentation on MDL (used for attribute importance):
http://docs.oracle.com/cd/E11882_01/datamine.112/e16808/algo_mdl.htm#CHDFJHBH
The above includes some information on the data preparation we do via ADP. We first bin numerical predictors using supervised binning - a technique we added that leverages ODM decision tree functionality (and also uses the MDL principle inside).

Attribute Importance
Attribute Importance - "Units" or numerical nature of "Importance"

There is more information in the forum if you search for "Attribute Importance"
Thanks, Mark
• ###### 2. Re: Minimum length description (MDL) used for Attribute importance ?
For the example below:

------------------------------------------------------------
Here is a simple example. Imagine a target that takes on one of two values and a predictor that also takes on one of two values and the following data:

Target Predictor
1 1
1 1
1 2
2 1
2 2
2 2

We need to encode the sequence of target values. The shortest encoding depends upon your certainty with respect to the next target's value. With this data, 50% of the targets have value 1, and 50% have value 2. If this is all you know (our baseline), then the best you can do is (the entropy) 0.5 log2(0.5) = 1 bit per target value.

Note that there can be a problem with computing the entropy from counts. If the count was 0, then the entropy would be minus infinity. To avoid this, we smooth the probabilities, using what is called an m-smoothing. We add m to the numerator and num_target_values * m to the denominator. For the baseline model, in this example we add 1 to the numerator and 2 to the denominator. We add two instances to our data. The smoothing in this case does not alter the baseline models probability. It is still 0.5.

To encode the model, we use a trick. Including the "added instances" we have a total of 8 targets. However, given the rules for "adding instances" , if we know what the first 7 targets are, then we know the eighth. To specify the model all we need to know is the number of values of target value 1 among the data. There are only 7 possibilities: 1,2,3,4,5,6,7. You can't have eight and you can't have 0, again, by the rules for adding instances. Thus the number of possible models is the number of ways of choosing a number between 1 and 7. Assuming a uniform probability on the possible models, the probability of our baseline model is 1/7 and the cost of encoding the model -log2(1/7) = log2(7). For the more general multi-class case, the model encoding is log2(n+k-1 choose k-1) where n is the number of rows and k is the number of target values.

So our baseline model has description length

Parent description length = 1 * 6 + log2(7) = 8.807355

When the predictor value is 1, there are two instances of target value 1 and one instance of target value 2. To keep consistent with the baseline model, we set m = m(baseline model)/number of unique predictor values = 1/2. So the conditional probabilities are:

P(Target = 1|Predictor=1) = (2+0.5)/(3 + 2*0.5) = .625
P(Target = 2|Predictor=1) = (1+0.5)/(3 + 2*0.5) = .375

and, similarly:

P(Target = 1|Predictor=2) = (1+0.5)/(3 + 2*0.5) = .375
P(Target = 2|Predictor=2) = (2+0.5)/(3 + 2*0.5) = .625

We are less uncertain about the target value, so the encoding the targets using this predictor requires fewer than 6 bits:

target sequence encoding cost =-0.625*log2(0.625) * 4 -0.375*log2(0.375) * 2 = 5.7266040175497892 bits

However, to encode the model, note that we do not know where the two "added instances" will fall with respect to the predictor value, so we cannot reduce our uncertainty to 7 instances. Instead we have two sub-models, one for each predictor value, with 4 possibilities each. The counts could be 0,1,2,3 of target value 1 given that the predictor value is 1. Thus the cost to encode the two sub-models is:

model encoding cost = 2 * log2(4) = 4 bits

Child description length = 5.7266040175497892 + 4 = 9.7266040175497892

The attribute importance is:

Parent description length - Child description length = -0.15320818258203092

Any attribute importance < 0, we truncate to 0.
----------------------------------------------------------------

I don't understand the part :
if we know what the first 7 targets are, then we know the eighth. To specify the model all we need to know is the number of values of target value 1 among the data. There are only 7 possibilities: 1,2,3,4,5,6,7. You can't have eight and you can't have 0, again, by the rules for adding instances. Thus the number of possible models is the number of ways of choosing a number between 1 and 7. Assuming a uniform probability on the possible models, the probability of our baseline model is 1/7 and the cost of encoding the model -log2(1/7) = log2(7). For the more general multi-class case, the model encoding is log2(n+k-1 choose k-1) where n is the number of rows and k is the number of target values.

--> could anybody give me an explaination for that
• ###### 3. Re: Minimum length description (MDL) used for Attribute importance ?
Hi,
Thank you for your interest in learning more about the in-depth details of ODM's AI algorithm. We will be providing more details in a following up post. Could you share with us some of the use cases where you are applying AI?
Mark