Forum Stats

  • 3,781,168 Users
  • 2,254,486 Discussions
  • 7,879,602 Comments

Discussions

Advice needed on performance strategy

626210
626210 Member Posts: 17
edited Feb 23, 2010 6:00PM in Berkeley DB XML
I have an XML database with the following structure (simplified example):
        <job id="123">
           <case tag="abc" code="0"/>
           <case tag="def" code="1"/>
           <case tag="ghi" code="0"/>
           <case tag="jkl" code="0"/>
                ... etc
        </job>

        <job id="456">
           <case tag="abc" code="0"/>
           <case tag="def" code="0"/>
           <case tag="ghi" code="1"/>
           <case tag="jkl" code="0"/>
                ... etc
        </job>
I want to find all matching tags where the codes differ. In this case it would get me the "def" and "ghi" tags. I do so with a nested for loop:
        let $cltn := collection("my_database")
        let $cases1 := $cltn/job[@id eq "123"]/case
        let $cases2 := $cltn/job[@id eq "456"]/case
        for $c1 in $cases1
          for $c2 in $cases2[@tag eq $c1/@tag]
            where $c1/code ne $c2/code
              return ...
              ... etc ...
That works fine except that it takes forever to run. There are many thousands of case/tags but only several dozen jobs.
I suspect I need to do one of two things:

1. Improve my indexing, or
2. Avoid the double for loop

I have tried the following indexes:
addindex "" tag edge-attribute-equality-string
addindex "" tag node-attribute-equality-string
addindex "" job/case/tag edge-attribute-equality-string (does this even make sense?)
Maybe there is a way to eliminate one or both of the for loops -- perhaps a more clever way to use "where". My suspicion is that the use of "for" forces sequential processing and indexes are simply not used.

Thanks for reading!

Edited by: rcp on Feb 21, 2010 6:00 AM
Tagged:

Answers

  • 637288
    637288 Member Posts: 488
    Hi,

    what about a query like:
    let $col := collection('my_database')
    for $i in ($col/job)[1]/case 
      where count(distinct-values($col/job/case[@tag=$i/@tag]/@code)) > 1
      return data($i/@tag)
    Concerning indexes. You will definitely need them. Your index #3 won't work, since DB XML indexes XML nodes, not the paths of XML documents. You would probably need presence edge indexes on 'case' elements, and equality edge indexes on both attributes. You can explore the query plans to see whether all indexes are being used.

    Hope this helps,
    Vyacheslav
  • 626210
    626210 Member Posts: 17
    edited Feb 21, 2010 8:57AM
    Vyacheslav,

    Thanks for the reply, but that doesn't quite do what I need. I need more than just the tags, I need to be able to output a report something like this:
    Job 123 vs. job 456:
    ghi      0        1
    def      1        0
    I was able to simplify the heart of my query. It works as well but is no faster, despite one book claiming that putting predicates in the for clasue instead of a separate where clause can be faster. Here it is with more detail on the "return"...
    let $cltn := collection("my_database")
    let $cases1 := $cltn/job[@id eq "123"]/case
    let $cases2 := $cltn/job[@id eq "456"]/case
    for $c1 in $cases1, $c2 in $cases2[@tag eq $c1/@tag and @rc ne $c1/@rc]
      return
        <tr>
        <td>$c1/@tag/string()</td><td>$c1/@rc/string()</td><td>$c2/@rc/string()</td>
        </tr>
    I tried both node and "edge-attribute-equality-string" indexes on id, tag and rc with no difference in performance.

    Edited by: rcp on Feb 21, 2010 5:53 AM
  • 637288
    637288 Member Posts: 488
    But how will your report look like if you have, say, 100 jobs? In your example you are using just 2 jobs.
    Have you checked the query plans (XmlQueryExpression.getQueryPlan()) ? Do you see the usage of all your indexes?

    Vyacheslav
    637288
  • 626210
    626210 Member Posts: 17
    edited Feb 23, 2010 11:14AM
    OK, here's what I think is the pertinent part of the queryPlan. (The names are a little different from my simplified example since this is from the actual query, but "tag" and "rc" are as above.)

    I note that the tag and rc indexes are mentioned, so I assume they are being used.
    <QueryPlanToAST>
      <NegativeNodePredicateFilterQP uri="" name="#tmp6">
        <ParentOfAttributeJoinQP>
          <ValueQP container="abatjobs.dbxml-temp13318" index="node-attribute-equality-string" operation="eq" child="tag">
            <PromoteUntyped uri="http://www.w3.org/2001/XMLSchema" name="string">
              <TreatAs>
                <Atomize>
                  <QueryPlanToAST>
                    <StepQP axis="attribute" name="tag" nodeType="attribute">
                      <VariableQP name="tc1"/>
                    </StepQP>
                  </QueryPlanToAST>
                </Atomize>
                <SequenceType type="item()?"/>
              </TreatAs>
            </PromoteUntyped>
          </ValueQP>
          <VariableQP name="tcs2"/>
        </ParentOfAttributeJoinQP>
        <AttributeJoinQP>
          <VariableQP name="#tmp6"/>
          <ValueQP container="abatjobs.dbxml-temp13318" index="node-attribute-equality-string" operation="inv(ne)" child="rc">
            <PromoteUntyped uri="http://www.w3.org/2001/XMLSchema" name="string">
              <TreatAs>
                <Atomize>
                  <QueryPlanToAST>
                    <StepQP axis="attribute" name="rc" nodeType="attribute">
                      <VariableQP name="tc1"/>
                    </StepQP>
                  </QueryPlanToAST>
                </Atomize>
                <SequenceType type="item()?"/>
              </TreatAs>
            </PromoteUntyped>
          </ValueQP>
        </AttributeJoinQP>
      </NegativeNodePredicateFilterQP>
    </QueryPlanToAST>
    Another thing I tried did help but it is more a matter of avoiding the problem then solving it. Instead of running the query against my whole database I made a temporary database that contained nothing but the two "job" documents needed for the query. If indexing "worked" as I think it should then the query should not be any faster this way, but I did observe the following:
    * With dbxml version 2.4.11 there was only a very slight improvement. (from 25 to 20 minutes)
    * With version 2.5.16 the query went down to about 5 minutes.

    Characteristics of my database:
    * anywhere from 400 to 600 "job" documents
    * approx 5000 "tag" elements per job
    * usually between 0 and 5 mismatched "rc" values for any pair of jobs
    * The application only compares selected jobs, about two dozen of them each time it runs.

    Whayt that means is that it takes about 5 minutes (newest dbxml) or 20 minutes (older dbxml) to do a couple of dozen two-way joins of a pair of 5000 entity documents where anywhere from 0 to 5 elements are being selected.

    Edited by: rcp on Feb 23, 2010 8:12 AM (a little clean up)
  • 637288
    637288 Member Posts: 488
    Hi,

    so you run your query for only 2 arbitrary jobs at a time?

    You should use the last release of DB XML if you don't have any strong reasons against it.

    Why can't you use the query I proposed, but slightly modified (assuming that you know the ids of two jobs you are working with):
    let $col := collection('my_database')
    for $i in $col/job[@id=$id1]/case 
      where count(distinct-values($col/job[@id = $id1 or @id= $id2]/case[@tag=$i/@tag]/@code)) > 1
      return data($i/@tag)
    using this query you will get the tags that a different, and knowing the jobs ids it's straightforward to create a report you want to get. Index on the 'id' attribute as well will help you to improve performance.

    Let me know if you have any problems,
    Vyacheslav
  • 626210
    626210 Member Posts: 17
    I have tried the latest dbxml. That's where I got it down to 5 minutes. Unfortunately there is a complication right now. I need to build a 32-bit version on a 64-bit system. Almost there but I need to get the right 32-bit development libs installed. Working on that.

    As for your query, I'm having a bit of trouble with it as my port of it into my scheme never returns any matches -- working on it!
  • 637288
    637288 Member Posts: 488
    Out of curiosity, why do you need 32-bit version of libs?

    As for my query, I tested it successfully on the examples you provided. So it should work :)

    Vyacheslav
  • 626210
    626210 Member Posts: 17
    I need a 32-bit version of dbxml to run on a 32-bit i686 system. That's where my current, older, dbxml is running.

    I'm sure I messed up your query, but I have several things going on at once here and need to put that on the back burner for a bit. Your query was helpful since it did give me the idea for the following, which I did get working and which knocked another 5 minutes off of the total run (down to 15 mins from an original 25 mins):
    for $case1 in $cltn/job[@id eq "job1"]/case
            where $cltn/job[@id eq "job2"]/case[@tag eq $case1/@tag and @rc ne $case1/@rc]
        return ...
  • 637288
    637288 Member Posts: 488
    Did you add the edge index on the 'id' attribute? Changing other indexes to the edge type may also improve performance.
This discussion has been closed.