This content has been marked as final. Show 4 replies
Direct RDF/XML reading is not currently supported. You can use Raptor as you have already done before, but optionally with a named pipe, as follows:
Using Raptor (rapper) and piping to gzip would produce a 7GB gzipped file. You do not need to split the file. You can then use zcat on that gzipped file and send the output to a named pipe. Use the named pipe as the name of the data file for sqlldr.
I think materializing the gzipped file on disk may be avoided, if necessary, by directly sending the rapper output to the named pipe. Of course if you need to use the data more than once, materializing on disk would be a better choice.
I have a followup question regarding loading Uniprot using SQL*Loader. I create a staging table called stable_uniprot with the following columns:
create table stable_uniprot (
RDF$STC_sub varchar2(4000) not null,
RDF$STC_pred varchar2(4000) not null,
RDF$STC_obj varchar2(4000) not null,
I then created a control file based on the example control file described in the Oracle by Example tutorial (http://www.oracle.com/technology/obe/11gr1_db/datamgmt/nci_semantic_network/nci_Semantics_les01.htm). During my load, I occasionally encounter Uniprot proteins with protein sequences that are greater than 4000 characters. I think Uniprot has a max of 5000 characters (because I don't see any sequences with more than 5000 characters). My question is, how do I load the data containing more than 5000 characters? varchar2 has an upper limit of 4000 characters. Can I use a CLOB field? If so, how would that work?
The rows rejected by Oracle DB during loading via SQL*Loader are stored in the "bad" file (e.g., d0.bad). You can use the "batch load" functionality, oracle.spatial.rdf.client.BatchLoader, by specifying the name of the bad file as the <N-Triplefile>. (Please refer to doc if you need additional details about batch load.)
As one of the current developers of the uniprot.rdf data I would love to hear about your progress with using the rdf data in oracle.
If you have any questions or comments you can reach me at jerven.bolleman at isb-sib.ch.
We have done it at SIB as well. However, we have not since january loaded it into an Oracle DB (at least looking at the current doc).
By the way there are 3,207 entries with a sequence of more than 4000 chars in release 14.2 and 1,414 with sequences longer 5000 amino acids.
With longest currently being 36,805 amino acids long.
Best of luck,