7 Replies Latest reply on Jan 4, 2016 5:29 PM by alwu-Oracle

    sem_apis.remove_duplicates

    ChrD

      Hi

       

      When the RDF table for model xxx has been created like

       

      CREATE TABLE xxx_tpl ( id RAW (16) DEFAULT SYS_GUID () PRIMARY KEY, triple sdo_rdf_triple_s);

       

      and - after inserting some duplicates - the command

       

      EXECUTE sem_apis.remove_duplicates('xxx')

       

      gives

       

      ORA-26026: unique index NN3.SYS_C0017356 initially in unusable state

      ORA-06512: at "MDSYS.SDO_RDF", line 4930

      ORA-06512: at "MDSYS.RDF_APIS", line 1363

      ORA-06512: at line 1

       

      and the model gets deleted (also after alter session set skip_unusable_indexes=true) !!!

       

      Without making the id column PRIMARY KEY the "remove_duplicates" works fine!

       

      Regards Christian

        • 1. Re: sem_apis.remove_duplicates
          alwu-Oracle

          Hi Christian,

           

          Thanks. I took a look at it and this is what I found.

           

          create TABLE family_tpl (id RAW (16) DEFAULT SYS_GUID () PRIMARY KEY, triple sdo_rdf_triple_s);

          exec sem_apis.create_sem_model('family', 'family_tpl', 'triple');

          -- add a duplicate

          insert into family_tpl (triple) values(sdo_rdf_triple_s('family', '<urn:s>', '<urn:p>','<urn:o>'));

          insert into family_tpl (triple) values(sdo_rdf_triple_s('family', '<urn:s>', '<urn:p>','<urn:o>'));

          alter session set skip_unusable_indexes=true;

          exec sem_apis.remove_duplicates('family');

           

          ERROR at line 1:

          ORA-26026: unique index SCOTT.SYS_C0029694 initially in unusable state

          ORA-06512: at "MDSYS.SDO_RDF", line 4930

          ORA-06512: at "MDSYS.RDF_APIS", line 1363

          ORA-06512: at line 1

           

          -- The data is gone from the application table

          SQL> select count(1) from family_tpl;

            COUNT(1)

          ----------

               0

           

          -- The data is still in MDSYS.RDF_LINK$ table (partition)

          SQL> select count(1) from mdsys.rdfm_family;

            COUNT(1)

          ----------

               1

           

          Yes. That ORA 26026 was reproduced. However, the data was not removed from the MDSYS.RDF_LINK$ (the view MDSYS.RDFM_FAMILY still has the data). Only the application table was truncated. The overall flow works like this: remove duplicates in RDF_LINK$ partition (by fixing the cost column), truncate application table, disable application table indices, insert data from RDF_LINK$ into the application table, and rebuild the indices. This flow is efficient because we don't actually spend resources on determining what are the duplicates, what to keep, or what to remove.

           

          We don't recommend adding an additional column to the application table. The reasons are twofold: 1) it does not contribute to a SPARQL query unless one explicitly joins the SEM_MATCH with the application table, 2) remove_duplicates logic simply does not know what value to remove/preserve for the additional column.

           

          If you have a strong requirement to have such an ID column with primary/unique key constraint, please file a bug or SR and we can fix the ORA-26026. However, please keep in mind that in this case the application table will have a new set of ID values after the remove_duplicate call.

           

          Hope it helps,


          Zhe Wu

          • 2. Re: sem_apis.remove_duplicates
            ChrD

            Hi Zhe

             

            Thanks - but not just is data gone from the application table; the model is also "gone" from mdsys.rdf_model$ view.

             

            The unique ID column could be used for reification and foreign keys - but as your workflow is, additional columns in the application table are not preserved with remove_duplicates.

             

            If it's not too slow having a flag checking for duplicates with bulk load - maybe it might be possible to check the staging table for duplicates before load?

             

            BR - Christian

            • 3. Re: sem_apis.remove_duplicates
              alwu-Oracle

              Hi Christian,

               

              Hmmm. The model was not supposed to be removed. Could you please provide a small test case so that I can reproduce locally?

               

              For reification, I would recommend using quads (named graphs) instead. You can use the 4th column (the graph name URI) to identify a triple (or a set of triples) and then make further provenance related assertions on that graph name URI. This way, you can use SPARQL to query both base triples and reified assertions.

               

              If you need to use foreign keys, then you do have to use an additional column.

               

              To remove duplicates (most if not all of the duplicates) with bulk load, you can define a view (or a physical table) and feed it to the bulk loader.

               

              e.g.

               

              create view staging_tab_uniq as select /*+ parallel(8) */ unique * from staging_tab;

               

              Hope it helps,

               

              Zhe Wu

              • 4. Re: sem_apis.remove_duplicates
                ChrD

                Hi Zhe

                 

                I thing that if you try "select * from mdsys.rdf_model$" in your own example above you won't see the "family" model after remove_duplicates.

                 

                I agree that using the graph column for reification could be an option.

                 

                And about the duplicates: it's mainly for not importing triples into the model that are already there and not duplicates in the staging table by itself.

                 

                BR - Christian

                • 5. Re: sem_apis.remove_duplicates
                  alwu-Oracle

                  Hi Christian,


                  Thanks. You are right, the metadata about the model was removed from one of the internal tables. And because the logic got interrupted so that metadata did not get a chance to be inserted back. This metadata can be restored if necessary. The graph data was still in a partition in MDSYS.RDF_LINK$.

                   

                  Regarding duplicates, it seems that in each batch of your triples, you may not have many duplicates. However, duplicates can exist across multiple batches and you want to remove them during data loading. Right?

                   

                  Cheers,

                   

                  Zhe Wu

                  • 6. Re: sem_apis.remove_duplicates
                    ChrD

                    Hi Zhe

                     

                    Yes that's right (remove duplicates during data load).

                     

                    If necessary - is there a procedure for restoring data from the link table and reestablish the metadata?

                     

                    ps: as for the primary key on ID, there is this on page 6-6 (ex 6-1) in "RDF Semantic Developers Guide 12C":

                     

                    CREATE TABLE emp_rdf_data (id number, triple sdo_rdf_triple_s);

                    ALTER TABLE emp_rdf_data add CONSTRAINT emp_rdf_data_PK PRIMARY KEY (ID);

                     

                    Cheers - Christian

                    • 7. Re: sem_apis.remove_duplicates
                      alwu-Oracle

                      Hi Christian,

                       

                      Happy New Year!

                       

                      Let's take this offline since it involves some internal data structures and additional privileges. Please email me at alan dot wu at oracle dot com.

                       

                      Thanks,

                       

                      Zhe Wu