6 Replies Latest reply: Nov 19, 2009 2:30 PM by DrClap RSS

    Regular expressions for xml parsing

    807580
      I have a xml parsing problem that I have to solve using regular expressions. It's not possible for me to use a different method other than regular expression. But there is a problem that I cannot seem to rap my head around. I want to extract the contents of a tag but the problem is that this tag occurs serveral times in the XML file but I only want the contents of one particular occurence. Basically the problem is as follows;

      I want to extract
      <bp:NAME ***stufff***>(I want this part)</bp:NAME>
      This tag can occur is serval places. For example here;
      <bp:ORGANISM>
      ***bunch of tags***
      <bp:NAME ***stufff***>***stufff***</bp:NAME>
      ***bunch of tags***
      </bp:ORGANISM>
      or here;
      <bp:DATABASE>
      ***bunch of tags***
      <bp:NAME ***stufff***>***stufff***</bp:NAME>
      ***bunch of tags***
      </bp:DATABASE>
      I do not want the content of those tags. I want the content of the <NAME> tag that is not between either the <ORGANISM> tags or the <DATABASE> tags. These tags can be in any order. I for the life of me cannot seem to figure this problem out. I tried several different approaches. For example I tried using the following regex
      (?:<bp:NAME [^>]*>([^<]*).*?<bp:ORGANISM>.*?</bp:ORGANISM>|
      <bp:ORGANISM>.*?</bp:ORGANISM>.*?<bp:NAME [^>]*>([^<]*))
      This kind of works, the information I want is either in the first captured group or in the second one. So I just check which group is not empty and that is the one I want. But this only works if there is only one other tag containing the name tag (in this particular regular expression that is the organism tag). Since there is another tag (the database tag) I have to work around, and these tags can be in any order, the regular expression then becomes three times as large and then there are six different groups in which the information I want can occur. This does not seem like a good idea to me. There has to be another way to do this. So I tried using the following regex;
      (?:</bp:ORGANISM>)?.*?(?:</bp:DATABASE>)?.*?<bp:NAME [^>]*>([^<]*)
      I thought this would get rid of any occurences of the other tags in front of the name tag, but it doesn't work either. It seems like it is not greedy enough. Well I think you get the point. I don't know what to try next so I really need some help.

      Here is an example of the type of data I will run into. The tags can be in any order and they do not always have to occur. In the example below the <DATABASE> tag is not part of the data and the name tag I want just happens to be in front of the organism tag but this is not always the case. The name tag I want is the firstname tag in the file, namely;
      <bp:NAME rdf:datatype="xsd:string">Progesterone receptor</bp:NAME>
      So I don't want the name tag that is in between the organism tags.
      <bp:protein rdf:ID="CPATH-27885">
      &#8722;<bp:COMMENT rdf:datatype="xsd:string">
      Belongs to the nuclear hormone receptor family. NR3 subfamily. SIMILARITY: Contains 1 nuclear receptor DNA-binding domain. WEB RESOURCE: Name=NIEHS-SNPs; URL="http://egp.gs.washington.edu/data/pgr/"; WEB RESOURCE: Name=Wikipedia; Note=Progesterone receptor entry; URL="http://en.wikipedia.org/wiki/Progesterone_receptor"; GENE SYNONYMS: NR3C3. COPYRIGHT:  Protein annotation is derived from the UniProt Consortium (http://www.uniprot.org/).  Distributed under the Creative Commons Attribution-NoDerivs License.
      </bp:COMMENT>
      <bp:SYNONYMS rdf:datatype="xsd:string">Nuclear receptor subfamily 3 group C member 3</bp:SYNONYMS>
      <bp:SYNONYMS rdf:datatype="xsd:string">PR</bp:SYNONYMS>
      <bp:NAME rdf:datatype="xsd:string">Progesterone receptor</bp:NAME>
      &#8722;<bp:ORGANISM>
      &#8722;<bp:bioSource rdf:ID="CPATH-LOCAL-112384">
      <bp:NAME rdf:datatype="xsd:string">Homo sapiens</bp:NAME>
      &#8722;<bp:TAXON-XREF>
      &#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-112385">
      <bp:DB rdf:datatype="xsd:string">NCBI_TAXONOMY</bp:DB>
      <bp:ID rdf:datatype="xsd:string">9606</bp:ID>
      </bp:unificationXref>
      </bp:TAXON-XREF>
      </bp:bioSource>
      </bp:ORGANISM>
      <bp:SHORT-NAME rdf:datatype="xsd:string">PRGR_HUMAN</bp:SHORT-NAME>
      &#8722;<bp:XREF>
      &#8722;<bp:relationshipXref rdf:ID="CPATH-LOCAL-112386">
      <bp:DB rdf:datatype="xsd:string">ENTREZ_GENE</bp:DB>
      <bp:ID rdf:datatype="xsd:string">5241</bp:ID>
      </bp:relationshipXref>
      </bp:XREF>
      &#8722;<bp:XREF>
      &#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-112387">
      <bp:DB rdf:datatype="xsd:string">UNIPROT</bp:DB>
      <bp:ID rdf:datatype="xsd:string">P06401</bp:ID>
      </bp:unificationXref>
      </bp:XREF>
      &#8722;<bp:XREF>
      &#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-112388">
      <bp:DB rdf:datatype="xsd:string">UNIPROT</bp:DB>
      <bp:ID rdf:datatype="xsd:string">A7X8B0</bp:ID>
      </bp:unificationXref>
      </bp:XREF>
      &#8722;<bp:XREF>
      &#8722;<bp:relationshipXref rdf:ID="CPATH-LOCAL-112389">
      <bp:DB rdf:datatype="xsd:string">GENE_SYMBOL</bp:DB>
      <bp:ID rdf:datatype="xsd:string">PGR</bp:ID>
      </bp:relationshipXref>
      </bp:XREF>
      &#8722;<bp:XREF>
      &#8722;<bp:relationshipXref rdf:ID="CPATH-LOCAL-112390">
      <bp:DB rdf:datatype="xsd:string">REF_SEQ</bp:DB>
      <bp:ID rdf:datatype="xsd:string">NP_000917</bp:ID>
      </bp:relationshipXref>
      </bp:XREF>
      &#8722;<bp:XREF>
      &#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-112391">
      <bp:DB rdf:datatype="xsd:string">UNIPROT</bp:DB>
      <bp:ID rdf:datatype="xsd:string">Q9UPF7</bp:ID>
      </bp:unificationXref>
      </bp:XREF>
      &#8722;<bp:XREF>
      &#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-113580">
      <bp:DB rdf:datatype="http://www.w3.org/2001/XMLSchema#string">CPATH</bp:DB>
      <bp:ID rdf:datatype="http://www.w3.org/2001/XMLSchema#string">27885</bp:ID>
      </bp:unificationXref>
      </bp:XREF>
      </bp:protein>
      Edited by: Dani3ll3 on Nov 19, 2009 2:51 AM
        • 1. Re: Regular expressions for xml parsing
          YoungWinston
          Dani3ll3 wrote:
          I have a xml parsing problem that I have to solve using regular expressions.
          Is this for school or something? Because SAX or DOM would give you access to that kind of stuff easily.
          I cannot seem to rap my head around.
          Rapping heads is not usually conducive to good thought.

          Winston
          • 2. Re: Regular expressions for xml parsing
            807580
            Yes, it is for school. I have to use java regular expressions. I know there are better ways to do this but I don't have any other option.
            The following regex doesn't work either. It doesn't seem to be greedy enough. It just keeps on returning the first occurrence of the name tag.
            (?:<bp:ORGANISM>.*?</bp:ORGANISM>)?.*?(?:<bp:DATABASE>.*?</bp:DATABASE>)?.*?<bp:NAME [^>]*>([^<]*)
            Edited by: Dani3ll3 on Nov 19, 2009 3:07 AM
            • 3. Re: Regular expressions for xml parsing
              YoungWinston
              Dani3ll3 wrote:
              I do not want the content of those tags. I want the content of the <NAME> tag that is not between either the <ORGANISM> tags or the <DATABASE> tags. These tags can be in any order. I for the life of me cannot seem to figure this problem out.
              I think you might be trying to do too much at once.
              Have you tried eliminating the content of the tags that you want to ignore using String.replaceAll() first?

              Once you have the result of that, I think the regex might be a lot simpler.

              Winston
              • 4. Re: Regular expressions for xml parsing
                807580
                Thanks a lot after I did that the regular expression worked. :)
                For any other people having the same problem. This is what I did;
                //remove all other occurrences of the name tag
                String shortInput = input.replaceAll("<bp:ORGANISM>.*?</bp:ORGANISM>", "");
                shortInput = shortInput.replaceAll("<bp:DATABASE>.*?</bp:DATABASE>", "");
                //find the name tag
                nameMatcher = Pattern.compile("<bp:NAME [^>]*>([^<]*)").matcher(shortInput);
                • 5. Re: Regular expressions for xml parsing
                  YoungWinston
                  Dani3ll3 wrote:
                  Thanks a lot after I did that the regular expression worked. :)
                  Great!

                  Just as an additional point, you might make it a bit more flexible by enclosing the "ignore" phase in a method. Perhaps something like:
                  private String removeTagsAndContent(String xmlInput, String... ignoreTags) {
                      ... // "ignore" logic
                  I leave it to you to fill in the blanks :-).

                  Winston
                  • 6. Re: Regular expressions for xml parsing
                    DrClap
                    Dani3ll3 wrote:
                    Thanks a lot after I did that the regular expression worked. :)
                    Good. But remember that in real life, you would then have to apply the XML rules to get the actual contents of the text node. For example it might be a CDATA section or it might include characters like ampersands which have been escaped and which you need to unescape. That's why it's better to use a proper parser, as already suggested.

                    It seems to me this forum is full of posts where people are doing homework questions which teach them to do things the wrong way. But of course there's nothing the student can do about that.