Forum Stats

  • 3,760,221 Users
  • 2,251,664 Discussions
  • 7,871,026 Comments

Discussions

DBMS_PREDICTIVE_ANALYTICS and DBMS_DATAMINING.ATTRIBUTE_IMPORTANCE

Brendan
Brendan Member Posts: 236 Bronze Badge
edited May 15, 2013 3:41AM in Machine Learning
What is difference is between using the EXPLAIN function in DBMS_PREDICTIVE_ANALYTICS and using the DBMS_DATA_MINING.MODEL function with model set to DBMS_DATAMINING.ATRIBUTE_IMPORTANCE ?

Is there a difference or do they do the same thing?

Answers

  • 56160
    56160 Member Posts: 14
    They both use the same algorithm (and underlying code).
    EXPLAIN adds some pre-processing to handle input attributes using the DATE data type and attributes with unstructured text (in 12c).
    EXPLAIN also adds post-processing to normalize the attribute importance values so that they range from 0-1.

    -Peter
  • Brendan
    Brendan Member Posts: 236 Bronze Badge
    But attribute importance in the GUI and in DBMS_DATA_MINING also give the range 0-1
    They used to give -1 - +1

    Is there a setting to get the -1 to +1 range or are all negative values set to zero
  • Hi Brendan,
    ODMr no longer actually creates a AI model.
    Attribute Importance is generated in the Column Filter node using a ODM function, so no model is persisted.
    The range is 0 to 1.
    When you refer to an older behavior can you state what versions of the db you are comparing against and what version of ODMr?
    Thanks, Mark
  • Brendan
    Brendan Member Posts: 236 Bronze Badge
    Hi Mark
    It was in the classic version. The earlier version of the the 11.2 documentation at it in 9-2 and 9-3. I can email you the doc.
    Brendan
  • Hi Brendan,
    So you are correct, there is a change in how AI scales the results.
    Here is the explanation from the algorithm developer to clarify the intent.
    Thanks, Mark

    The raw score for attribute Importance is a simple two-part code MDL measure. It views a model as an attempt to reduce communication costs, measured in transmission bits. The cost is the sum of the costs of transmitting the model and transmitting the data using the model to compress the data. This gives a way of comparing a set of different models, in particular, a model consisting of the of the target probability conditioned on a binned set of attribute values versus the prior. The benefit is measured, using an idealized codes, the entropies, p log p. The best possible code has a cost equal to within a bit of the entropy. The benefit is equal to the reduction in communication cost when the attribute model is chosen relative to the prior model. It is not a good thing, if that reduction is negative. In that respect, the measure differs from correlation, where the sign is a direction and the magnitude, a strength. Negative values represent uninteresting attributes, so these were set to benefit 0.

    The problem with the raw measure is that the range of values depends on the problem. The higher the entropy of the target (prior model), the greater the scale of raw values. This makes it difficult for users to interpret. To simplify, we re-scaled the values. The rescaling is the per-row benefit.
  • Brendan
    Brendan Member Posts: 236 Bronze Badge
This discussion has been closed.