4 Replies Latest reply on Oct 2, 2008 8:58 AM by jerven

    loading Uniprot data

      I downloaded the UniProt RDF file from ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/. The uniprot.rdf.gz file is 4 GB in size. In the past, I have taken RDF/XML files, converted them to N-triple files using Raptor, and loaded them using SQL*Loader. How do I load the uniprot.rdf.gz file? Do I need to convert it to N-triple and use SQL*Loader? Is there a method in Oracle that would allow me to load the data in its RDF/XML format? Do I need to split the 4 GB file?
        • 1. Re: loading Uniprot data
          Direct RDF/XML reading is not currently supported. You can use Raptor as you have already done before, but optionally with a named pipe, as follows:

          Using Raptor (rapper) and piping to gzip would produce a 7GB gzipped file. You do not need to split the file. You can then use zcat on that gzipped file and send the output to a named pipe. Use the named pipe as the name of the data file for sqlldr.

          I think materializing the gzipped file on disk may be avoided, if necessary, by directly sending the rapper output to the named pipe. Of course if you need to use the data more than once, materializing on disk would be a better choice.
          • 2. Re: loading Uniprot data
            I have a followup question regarding loading Uniprot using SQL*Loader. I create a staging table called stable_uniprot with the following columns:

            create table stable_uniprot (
            RDF$STC_sub varchar2(4000) not null,
            RDF$STC_pred varchar2(4000) not null,
            RDF$STC_obj varchar2(4000) not null,
            RDF$STC_sub_ext varchar2(64),
            RDF$STC_pred_ext varchar2(64),
            RDF$STC_obj_ext varchar2(64),
            RDF$STC_canon_ext varchar2(64)

            I then created a control file based on the example control file described in the Oracle by Example tutorial (http://www.oracle.com/technology/obe/11gr1_db/datamgmt/nci_semantic_network/nci_Semantics_les01.htm). During my load, I occasionally encounter Uniprot proteins with protein sequences that are greater than 4000 characters. I think Uniprot has a max of 5000 characters (because I don't see any sequences with more than 5000 characters). My question is, how do I load the data containing more than 5000 characters? varchar2 has an upper limit of 4000 characters. Can I use a CLOB field? If so, how would that work?
            • 3. Re: loading Uniprot data

              The rows rejected by Oracle DB during loading via SQL*Loader are stored in the "bad" file (e.g., d0.bad). You can use the "batch load" functionality, oracle.spatial.rdf.client.BatchLoader, by specifying the name of the bad file as the <N-Triplefile>. (Please refer to doc if you need additional details about batch load.)

              - Souri.
              • 4. Re: loading Uniprot data
                Hi Chuck,

                As one of the current developers of the uniprot.rdf data I would love to hear about your progress with using the rdf data in oracle.
                If you have any questions or comments you can reach me at jerven.bolleman at isb-sib.ch.

                We have done it at SIB as well. However, we have not since january loaded it into an Oracle DB (at least looking at the current doc).

                By the way there are 3,207 entries with a sequence of more than 4000 chars in release 14.2 and 1,414 with sequences longer 5000 amino acids.
                With longest currently being 36,805 amino acids long.

                Best of luck,