Currently Being Moderated
I'm in Oracle Data Miner. I'm working with the Attribute Importance function, MDL algorithm.
I run the process, and the output includes the "Importance" column.
What is the nature of that column?
I know, of course, that if you sort on it, the bigger the number, the more important the attribute is. Easy enough.
But:
1) Are there any "units" attached to that number? I don't think so...but I want to ask just in case.
2) More important, what is the nature of that number from the 4 types of numbers:
*Nominal - Just a label, nothing more. Can't be this since it has order.
*Ordinal - It can be ordered, but there is no relative magnitude. It must be at least this, by definition of "Importance"
*Interval - Can be added and subtracted. That is, the difference between 0.9 and 0.8 is the same as the difference between 0.3 and 0.2.
*Ratio - Can be multiplied and divided. That is, the difference between 0.2 and 0.8 is four times as much predictability.
I'm assuming it is an Ordinal value only.
But! If it is either Interval or Ratio, what is the formula? Is it proprietary? (I did a search and found that Oracle has a patent on Attribute Importance, so it might not publish the formula.) If it is not proprietary, what is it?
It is not linear, whatever it is. I dummied up a simple data set that is highly predictable, where the first column is an ID, the last column is the value to predict, and the 5 columns in the middle go from fully predictive to random:
insert into mad_diag_16 values ( 1, 0, 0, 0, 0, 0, 'A' );
insert into mad_diag_16 values ( 2, 0, 0, 0, 0, 0, 'A' );
insert into mad_diag_16 values ( 3, 0, 0, 0, 0, 0, 'A' );
insert into mad_diag_16 values ( 4, 0, 0, 0, 0, 0, 'A' );
insert into mad_diag_16 values ( 5, 0, 0, 0, 0, 1, 'A' );
insert into mad_diag_16 values ( 6, 0, 0, 0, 1, 1, 'A' );
insert into mad_diag_16 values ( 7, 0, 0, 1, 1, 1, 'A' );
insert into mad_diag_16 values ( 8, 0, 1, 1, 1, 1, 'A' );
insert into mad_diag_16 values ( 9, 1, 1, 1, 1, 1, 'B' );
insert into mad_diag_16 values ( 10, 1, 1, 1, 1, 1, 'B' );
insert into mad_diag_16 values ( 11, 1, 1, 1, 1, 1, 'B' );
insert into mad_diag_16 values ( 12, 1, 1, 1, 1, 1, 'B' );
insert into mad_diag_16 values ( 13, 1, 1, 1, 1, 0, 'B' );
insert into mad_diag_16 values ( 14, 1, 1, 1, 0, 0, 'B' );
insert into mad_diag_16 values ( 15, 1, 1, 0, 0, 0, 'B' );
insert into mad_diag_16 values ( 16, 1, 0, 0, 0, 0, 'B' );
Clearly, the first [non-ID] column is 8 of 8 predictive, the second is 7 of 8 predictive, down to the second to the last column which is 4 of 8 (i.e. non-predictive), and got this from the Attribute Importance MDL function:
Rank Importance
1 0.9991015040
2 0.4559745760
3 0.1883009600
4 0.0451636420
5 -0.0003967320
So, the 8 of 8 column came very close to 1.0. That makes sense.
And the 4 of 8 column basically came out 0. This also makes sense.
But notice the 7/8ths column got a 0.456 and the 6/8ths column got 0.188, and the 5/8ths column got 0.045. Why is that?
If there is some kind of proportionality along that curve, the numbers would have to be either Interval or Ratio, and there would have to be some type of formula. Any clue as to those?
Much thanks!
James