I'm in Oracle Data Miner. I'm working with the Attribute Importance function, MDL algorithm.
I run the process, and the output includes the "Importance" column.
What is the nature of that column?
I know, of course, that if you sort on it, the bigger the number, the more important the attribute is. Easy enough.
1) Are there any "units" attached to that number? I don't think so...but I want to ask just in case.
2) More important, what is the nature of that number from the 4 types of numbers:
*Nominal - Just a label, nothing more. Can't be this since it has order.
*Ordinal - It can be ordered, but there is no relative magnitude. It must be at least this, by definition of "Importance"
*Interval - Can be added and subtracted. That is, the difference between 0.9 and 0.8 is the same as the difference between 0.3 and 0.2.
*Ratio - Can be multiplied and divided. That is, the difference between 0.2 and 0.8 is four times as much predictability.
I'm assuming it is an Ordinal value only.
But! If it is either Interval or Ratio, what is the formula? Is it proprietary? (I did a search and found that Oracle has a patent on Attribute Importance, so it might not publish the formula.) If it is not proprietary, what is it?
It is not linear, whatever it is. I dummied up a simple data set that is highly predictable, where the first column is an ID, the last column is the value to predict, and the 5 columns in the middle go from fully predictive to random:
Clearly, the first [non-ID] column is 8 of 8 predictive, the second is 7 of 8 predictive, down to the second to the last column which is 4 of 8 (i.e. non-predictive), and got this from the Attribute Importance MDL function:
Attribute Importance applies the minimum description length principle (see, for example, "Stochastic Complexity in Statistical Inquiry", Jorma Rissanen) to evaluating predictors. MDL is non-linear. It is based on entropy. The probabilities that it uses are smoothed, which makes them different from what might be expected from the counts (see discussion) and, we make use of a simple two-part code (entropy is only a piece of the calculation). The other (non-entropy) part also disturbs what might be expected from an interval or ratio view of the data counts. This two-part code is MDL's mechanism to avoid over-fit and so is required.
The MDL principle states that the minimum description length explanation is the most probable. We encode the data using a model using a simple two-part code. The first part encodes the model. The second part encodes the data using the model. The encoding that uses the fewest bits, wins. To compute the attribute importance of a predictor, we compare the base line model, the unconditional probability of target, to the probability of the target conditioned on the predictor value. The baseline is termed the "parent" model and the conditional model is termed the "child" model.
Here is a simple example. Imagine a target that takes on one of two values and a predictor that also takes on one of two values and the following data:
We need to encode the sequence of target values. The shortest encoding depends upon your certainty with respect to the next target's value. With this data, 50% of the targets have value 1, and 50% have value 2. If this is all you know (our baseline), then the best you can do is (the entropy) 0.5 log2(0.5) = 1 bit per target value.
Note that there can be a problem with computing the entropy from counts. If the count was 0, then the entropy would be minus infinity. To avoid this, we smooth the probabilities, using what is called an m-smoothing. We add m to the numerator and num_target_values * m to the denominator. For the baseline model, in this example we add 1 to the numerator and 2 to the denominator. We add two instances to our data. The smoothing in this case does not alter the baseline models probability. It is still 0.5.
To encode the model, we use a trick. Including the "added instances" we have a total of 8 targets. However, given the rules for "adding instances" , if we know what the first 7 targets are, then we know the eighth. To specify the model all we need to know is the number of values of target value 1 among the data. There are only 7 possibilities: 1,2,3,4,5,6,7. You can't have eight and you can't have 0, again, by the rules for adding instances. Thus the number of possible models is the number of ways of choosing a number between 1 and 7. Assuming a uniform probability on the possible models, the probability of our baseline model is 1/7 and the cost of encoding the model -log2(1/7) = log2(7). For the more general multi-class case, the model encoding is log2(n+k-1 choose k-1) where n is the number of rows and k is the number of target values.
When the predictor value is 1, there are two instances of target value 1 and one instance of target value 2. To keep consistent with the baseline model, we set m = m(baseline model)/number of unique predictor values = 1/2. So the conditional probabilities are:
However, to encode the model, note that we do not know where the two "added instances" will fall with respect to the predictor value, so we cannot reduce our uncertainty to 7 instances. Instead we have two sub-models, one for each predictor value, with 4 possibilities each. The counts could be 0,1,2,3 of target value 1 given that the predictor value is 1. Thus the cost to encode the two sub-models is: