-
1. Re: sem_apis.remove_duplicates
alwu-Oracle Dec 21, 2015 5:41 PM (in response to ChrD)Hi Christian,
Thanks. I took a look at it and this is what I found.
create TABLE family_tpl (id RAW (16) DEFAULT SYS_GUID () PRIMARY KEY, triple sdo_rdf_triple_s);
exec sem_apis.create_sem_model('family', 'family_tpl', 'triple');
-- add a duplicate
insert into family_tpl (triple) values(sdo_rdf_triple_s('family', '<urn:s>', '<urn:p>','<urn:o>'));
insert into family_tpl (triple) values(sdo_rdf_triple_s('family', '<urn:s>', '<urn:p>','<urn:o>'));
alter session set skip_unusable_indexes=true;
exec sem_apis.remove_duplicates('family');
ERROR at line 1:
ORA-26026: unique index SCOTT.SYS_C0029694 initially in unusable state
ORA-06512: at "MDSYS.SDO_RDF", line 4930
ORA-06512: at "MDSYS.RDF_APIS", line 1363
ORA-06512: at line 1
-- The data is gone from the application table
SQL> select count(1) from family_tpl;
COUNT(1)
----------
0
-- The data is still in MDSYS.RDF_LINK$ table (partition)
SQL> select count(1) from mdsys.rdfm_family;
COUNT(1)
----------
1
Yes. That ORA 26026 was reproduced. However, the data was not removed from the MDSYS.RDF_LINK$ (the view MDSYS.RDFM_FAMILY still has the data). Only the application table was truncated. The overall flow works like this: remove duplicates in RDF_LINK$ partition (by fixing the cost column), truncate application table, disable application table indices, insert data from RDF_LINK$ into the application table, and rebuild the indices. This flow is efficient because we don't actually spend resources on determining what are the duplicates, what to keep, or what to remove.
We don't recommend adding an additional column to the application table. The reasons are twofold: 1) it does not contribute to a SPARQL query unless one explicitly joins the SEM_MATCH with the application table, 2) remove_duplicates logic simply does not know what value to remove/preserve for the additional column.
If you have a strong requirement to have such an ID column with primary/unique key constraint, please file a bug or SR and we can fix the ORA-26026. However, please keep in mind that in this case the application table will have a new set of ID values after the remove_duplicate call.
Hope it helps,
Zhe Wu -
2. Re: sem_apis.remove_duplicates
ChrD Dec 22, 2015 10:37 AM (in response to alwu-Oracle)Hi Zhe
Thanks - but not just is data gone from the application table; the model is also "gone" from mdsys.rdf_model$ view.
The unique ID column could be used for reification and foreign keys - but as your workflow is, additional columns in the application table are not preserved with remove_duplicates.
If it's not too slow having a flag checking for duplicates with bulk load - maybe it might be possible to check the staging table for duplicates before load?
BR - Christian
-
3. Re: sem_apis.remove_duplicates
alwu-Oracle Dec 22, 2015 5:16 PM (in response to ChrD)Hi Christian,
Hmmm. The model was not supposed to be removed. Could you please provide a small test case so that I can reproduce locally?
For reification, I would recommend using quads (named graphs) instead. You can use the 4th column (the graph name URI) to identify a triple (or a set of triples) and then make further provenance related assertions on that graph name URI. This way, you can use SPARQL to query both base triples and reified assertions.
If you need to use foreign keys, then you do have to use an additional column.
To remove duplicates (most if not all of the duplicates) with bulk load, you can define a view (or a physical table) and feed it to the bulk loader.
e.g.
create view staging_tab_uniq as select /*+ parallel(8) */ unique * from staging_tab;
Hope it helps,
Zhe Wu
-
4. Re: sem_apis.remove_duplicates
ChrD Dec 22, 2015 7:01 PM (in response to alwu-Oracle)Hi Zhe
I thing that if you try "select * from mdsys.rdf_model$" in your own example above you won't see the "family" model after remove_duplicates.
I agree that using the graph column for reification could be an option.
And about the duplicates: it's mainly for not importing triples into the model that are already there and not duplicates in the staging table by itself.
BR - Christian
-
5. Re: sem_apis.remove_duplicates
alwu-Oracle Dec 22, 2015 9:21 PM (in response to ChrD)Hi Christian,
Thanks. You are right, the metadata about the model was removed from one of the internal tables. And because the logic got interrupted so that metadata did not get a chance to be inserted back. This metadata can be restored if necessary. The graph data was still in a partition in MDSYS.RDF_LINK$.Regarding duplicates, it seems that in each batch of your triples, you may not have many duplicates. However, duplicates can exist across multiple batches and you want to remove them during data loading. Right?
Cheers,
Zhe Wu
-
6. Re: sem_apis.remove_duplicates
ChrD Dec 23, 2015 11:19 AM (in response to alwu-Oracle)Hi Zhe
Yes that's right (remove duplicates during data load).
If necessary - is there a procedure for restoring data from the link table and reestablish the metadata?
ps: as for the primary key on ID, there is this on page 6-6 (ex 6-1) in "RDF Semantic Developers Guide 12C":
CREATE TABLE emp_rdf_data (id number, triple sdo_rdf_triple_s);
ALTER TABLE emp_rdf_data add CONSTRAINT emp_rdf_data_PK PRIMARY KEY (ID);
Cheers - Christian
-
7. Re: sem_apis.remove_duplicates
alwu-Oracle Jan 4, 2016 5:29 PM (in response to ChrD)Hi Christian,
Happy New Year!
Let's take this offline since it involves some internal data structures and additional privileges. Please email me at alan dot wu at oracle dot com.
Thanks,
Zhe Wu