5 Replies Latest reply on Jan 6, 2009 7:38 PM by Mannamal-Oracle

    Data "smushing" with 10g Spatial

      Are there any functions or methods for merging properties of RDF nodes that have distinct RDF ID's but actually represent the same entity?

      With other tools that provide OWL support, for example, one can define a rule that will check for specific unique properties of a node and assign owl:sameAs property to two nodes that match. This partially gets us there by at least labeling nodes that are equivilent. However, one is left to provide code that "smushes" or aggregates the properties of these two nodes and removes the redundant node.

      So, my question is: Does Oracle Spatial 10g provide some way to merge node that represent the same entity, either by attaching a preprocessing method to the data import step or on-the-fly in queries?

        • 1. Re: Data "smushing" with 10g Spatial
          No, the current Oracle functionality does not include aggregating two nodes as you describe. Whatever data is provided to Oracle is kept "as-is" - it is not modified in any way, and there is no API for doing so.

          It is an interesting requirement however. Could you explain more what you mean by having distinct RDF IDs but representing the same entity? I am a little unclear on what you mean by RDF IDs. We are always gathering requirements, so would like to undestand more.

          • 2. Re: Data "smushing" with 10g Spatial
            An example that comes up all the time in life sciences is the non-static naming of proteins and genes. If you look across databases (and sometimes even within the same database), the same gene or protein is named differently. This naming problem extends to the URIs in RDF versions of these databases.

            Merging these datasets with a conventional RDBMS requires identifiying a common key that links identical proteins/genes or otherwise provide a translation table that holds all of the identifiers and their database sources to link common entities.

            RDF presents a unique opportunity to "automatically" aggregate the properties for nodes that represent the same protein or gene. In thinking through this, a translation "table" of triples would still be required, but linking all of the data sources could be done with a simple rule.

            For instance, if we used the BioPAX ontology, an OWL rule (I know Oracle doesn't yet support OWL, but this is a good reason to) could state that any physicalEntity (e.g. protein, gene, etc.) that matched another protein or gene entity via a cross-referenced identifier should be linked with an owl:sameAs property. Right now this is possible (albeit slowly) using open-source tools such as Jena and Pellet.

            The real magic would be that if some API function could present this pair of identical nodes as a single node to the user or programmatically to the developer.

            Does this make sense? I'd love to speak with you more about this.
            • 3. Re: Data "smushing" with 10g Spatial

              Yes, that helps. I think I am understanding you right that the cross-referenced identifier is something that comes from the application, and that the RDF store itself does not have to determine that two entities are the same by comparing other properties of the entities (because the comparison might depend on domain knowledge).

              First, OWL support is planned in a future release (see Oracle 11g and changes in RDF support? With that support, the identitical entities might not be automatically identified on load, but an additional step could, through some application code, identify entities which have the same cross-referenced identifier, and add a new owl:sameAs link between the two entities.

              In 10gR2 the same thing is possible by adding a user-defined property, say :identicalTo. Then define user-defined rules to relate the two entities - example:
              if A :identitcalTo B, and A rdfs:subClassOf X
              B is rdfs:subClassOf X

              An inferencing step would create these relations.

              However in both approaches (OWL or user-defined rules), that final magic of presenting both nodes as identical is not there - the query in application code will have to specify that - "find entities that are owl:sameAs to a given entity, or :identicalTo a given entity".

              Let us discuss this offline. You can reach me by writing to me at melliyal <dot> annamalai <at> oracle <dot> com.

              • 4. Re: Data "smushing" with 10g Spatial
                Was there some sort of resolution to this issue? I am using Oracle 11g and I am encountering the same issue. I have loaded data from Uniprot and Pubmed into two models. My Pubmed URIs look like: http://www.ncbi.nlm.nih.gov/pubmed/12260993 and my Uniprot URIs for the same item looks like: http://purl.uniprot.org/pubmed/12260993. The original poster referred to this issue as smushing. This is a fairly common problem when using RDF in life sciences (there are several blogs and postings regarding this issue). I can insert triples into one (or both) of the graphs using the owl:sameAs predicate. However, I would like to avoid this if there is another solution.
                • 5. Re: Data "smushing" with 10g Spatial
                  A feature to help such data "smushing" is planned for an upcoming release.

                  In the current release, another suggested modeling mechanism (in addition to owl:sameAs) is owl:InverseFunctionalProperty. An example is below:

                  :hasID rdf:type owl:InverseFunctionalProperty.
                  <http://www.ncbi.nlm.nih.gov/pubmed/12260993> :hasID "12260993"^^xsd:integer;
                  <http://purl.uniprot.org/pubmed/12260993> :hasID "12260993"^^xsd:integer;

                  Note that this goes beyond OWL DL but it is something we can handle.