1 Reply Latest reply: Feb 22, 2012 8:57 AM by Mark Kelly RSS

    Model Testing with Classification Node

    919129
      Hi,

      I am using DB 11.2, SQL Developer 3.0.04, Data Miner 11.2.0.2.04.40.
      I set up the following test case:
      Table 'T_ALL' contains all data used for mining. PK is the primary key and case-id.
      I split this table manually creating two tables:
      created table T_TEST as select * from T_ALL sample(25);
      created table T_BUILD as select * from T_ALL a1 where not exists (select 1 from T_TEST t where a1.PK = t.PK);

      Also:
      create view V_ALL as select * from T_BUILD union all select * from T_TEST;


      In Data Miner's Workflow GUI, I created a classification node with only one SVM model, that uses only 1 attribute from T_ALL. All setting was left default.

      Case 1:
      T_ALL connected to CLAS node. Test settings in CLAS node is: 'Split for Test: 25%'.
      When viewing test results for the model, the predictive confidence is 30%.
      The same is true is I replace T_ALL with V_ALL.

      Case 2:
      T_BUILD connected to CLAS node as 'Build', T_TEST connected to CLAS node as 'Test'. Test settings in CLAS node is: 'Use Test Data Source for Testing'.
      When viewing test results for the model, the predictive confidence is 0%.

      Case 3:
      T_ALL connected to CLAS node as 'Build', T_TEST connected to CLAS node as 'Test'. Test settings in CLAS node is: 'Use Test Data Source for Testing'.
      When viewing test results for the model, the predictive confidence is 27%.
      The same is true is I replace T_ALL with V_ALL.

      Q1:
      Could anyone please tell me the cause of this striking difference between Case1 and Case2?
      It seems that DM uses all the data for building the model when 'Split for Test: 25%' is selected for the test. (Documentation states otherwise.)

      Q2:
      How does DM actually generates the build and test datasets when 'Split for Test: 25%' is selected for the test?

      Thank you in advance,
      Peter
        • 1. Re: Model Testing with Classification Node
          Mark Kelly
          Hi Peter,
          The Classification Build node uses a stratification split implementation to insure the build and test data have the same distribution of target values.
          Your split does not include this treatment so that accounts for the difference.
          We do not have a Stratified Split node for you to use at this point, but we do have a Sample node that performs something like this.
          If you set the Sample Node to stratification, you can take a look at the generated sql to get a feel for the implementation.
          Thanks, Mark