Currently Being Moderated
For the example below:
------------------------------------------------------------
Here is a simple example. Imagine a target that takes on one of two values and a predictor that also takes on one of two values and the following data:
Target Predictor
1 1
1 1
1 2
2 1
2 2
2 2
We need to encode the sequence of target values. The shortest encoding depends upon your certainty with respect to the next target's value. With this data, 50% of the targets have value 1, and 50% have value 2. If this is all you know (our baseline), then the best you can do is (the entropy) 0.5 log2(0.5) = 1 bit per target value.
Note that there can be a problem with computing the entropy from counts. If the count was 0, then the entropy would be minus infinity. To avoid this, we smooth the probabilities, using what is called an m-smoothing. We add m to the numerator and num_target_values * m to the denominator. For the baseline model, in this example we add 1 to the numerator and 2 to the denominator. We add two instances to our data. The smoothing in this case does not alter the baseline models probability. It is still 0.5.
To encode the model, we use a trick. Including the "added instances" we have a total of 8 targets. However, given the rules for "adding instances" , if we know what the first 7 targets are, then we know the eighth. To specify the model all we need to know is the number of values of target value 1 among the data. There are only 7 possibilities: 1,2,3,4,5,6,7. You can't have eight and you can't have 0, again, by the rules for adding instances. Thus the number of possible models is the number of ways of choosing a number between 1 and 7. Assuming a uniform probability on the possible models, the probability of our baseline model is 1/7 and the cost of encoding the model -log2(1/7) = log2(7). For the more general multi-class case, the model encoding is log2(n+k-1 choose k-1) where n is the number of rows and k is the number of target values.
So our baseline model has description length
Parent description length = 1 * 6 + log2(7) = 8.807355
When the predictor value is 1, there are two instances of target value 1 and one instance of target value 2. To keep consistent with the baseline model, we set m = m(baseline model)/number of unique predictor values = 1/2. So the conditional probabilities are:
P(Target = 1|Predictor=1) = (2+0.5)/(3 + 2*0.5) = .625
P(Target = 2|Predictor=1) = (1+0.5)/(3 + 2*0.5) = .375
and, similarly:
P(Target = 1|Predictor=2) = (1+0.5)/(3 + 2*0.5) = .375
P(Target = 2|Predictor=2) = (2+0.5)/(3 + 2*0.5) = .625
We are less uncertain about the target value, so the encoding the targets using this predictor requires fewer than 6 bits:
target sequence encoding cost =-0.625*log2(0.625) * 4 -0.375*log2(0.375) * 2 = 5.7266040175497892 bits
However, to encode the model, note that we do not know where the two "added instances" will fall with respect to the predictor value, so we cannot reduce our uncertainty to 7 instances. Instead we have two sub-models, one for each predictor value, with 4 possibilities each. The counts could be 0,1,2,3 of target value 1 given that the predictor value is 1. Thus the cost to encode the two sub-models is:
model encoding cost = 2 * log2(4) = 4 bits
Child description length = 5.7266040175497892 + 4 = 9.7266040175497892
The attribute importance is:
Parent description length - Child description length = -0.15320818258203092
Any attribute importance < 0, we truncate to 0.
----------------------------------------------------------------
I don't understand the part :
if we know what the first 7 targets are, then we know the eighth. To specify the model all we need to know is the number of values of target value 1 among the data. There are only 7 possibilities: 1,2,3,4,5,6,7. You can't have eight and you can't have 0, again, by the rules for adding instances. Thus the number of possible models is the number of ways of choosing a number between 1 and 7. Assuming a uniform probability on the possible models, the probability of our baseline model is 1/7 and the cost of encoding the model -log2(1/7) = log2(7). For the more general multi-class case, the model encoding is log2(n+k-1 choose k-1) where n is the number of rows and k is the number of target values.
--> could anybody give me an explaination for that