Forum Stats

  • 3,782,046 Users
  • 2,254,585 Discussions
  • 7,879,901 Comments

Discussions

A simple question about indices

637288
637288 Member Posts: 488
edited Sep 16, 2009 9:35AM in Berkeley DB XML
Hi,

Suppose I have a query like
for $a in collection()//foo return $a[@bar='baz']
and I have an index for attribute bar. Is there a way to improve efficiency further by adding index for the node foo? Because then DB XML would know what documents in the container do not contain node foo and DB XML would simply skip those documents from the consideration. Yes, I know indices like edge-element-presence-none, but I can't see use of them in my query plans. Or am I doing smth wrong?

Thanks in advance,
Vyacheslav

Best Answer

  • 522770
    522770 Member Posts: 728
    Accepted Answer
    Hi Vyacheslav,

    If you've added a presence index of "foo" and it hasn't been used, it's because the cost optimizer has decided that the index isn't beneficial in you case. The cost optimizer won't always find the best query plan, but I think it's pretty good in cases like your query. You may find that adding more data or different data will give a different query plan - and may in fact use the presence index.

    John
«1

Answers

  • 597600
    597600 Member Posts: 250
    Defining the index as edge-attribute-equality-string (or -decimal or -date or whatever) should work. I have the following index and the query uses it:
    edge-attribute-equality-string for node {}:cgID
    query { for $bc in collection()//a:Broadcast where $bc/@cgID = 'dvrs' return $bc }
    Michael Ludwig
  • 637288
    637288 Member Posts: 488
    Thanks, Michael,

    but I said that I already had an attribute index and it's successfully being utilized in my queries. But then I thought that there might be further optimization by using somehow an index for the element that contains already indexed attribute.

    Vyacheslav
  • 597600
    597600 Member Posts: 250
    Hi Vyacheslav,

    maybe this is another misunderstanding, but I think that substituting an edge-attribute-equality-whatever for your node-attribute-equality-whatever is going to achieve precisely the optimization you're looking for. What this does is taking the parent of the attribute (or element) into account. In my example, I have a:Broadcast/@cgID and a:Category/@cgID and a need to differentiate between the two. With just node-, DBXML can only do this by analyzing the document after doing the index lookup to determine the parent; with edge-, the parent of the attribute (or element) is part of the index, so no further analysis (even parsing in the case of wholedoc containers) is needed. The query can be fully answered just by consulting the index.

    You could also keep your node- index if there are cases where you need it and create the edge- index in addition to cover the use case you're presenting here.

    Michael Ludwig
  • 637288
    637288 Member Posts: 488
    thanks Michael. Actually I'm using an edge-atribute-equality-string index. But my question was whether I can do some optimimzations by providing some index for the parent element. Or an edge index on the attribute is totally enough. Most like, it is.

    Vyacheslav
  • 522770
    522770 Member Posts: 728
    Accepted Answer
    Hi Vyacheslav,

    If you've added a presence index of "foo" and it hasn't been used, it's because the cost optimizer has decided that the index isn't beneficial in you case. The cost optimizer won't always find the best query plan, but I think it's pretty good in cases like your query. You may find that adding more data or different data will give a different query plan - and may in fact use the presence index.

    John
  • 637288
    637288 Member Posts: 488
    Thanks, John!

    That's what I was asking for.

    Vyacheslav
  • 597600
    597600 Member Posts: 250
    Maybe there is a flaw in my logic, but I think that for the examples given in this thread, the set of documents (or nodes) retrievable by an edge-attribute-whatever index is a strict subset of the set of documents (or nodes) retrievable by a node-element-presence index on the parent element. Which means that it is impossible for the latter to have better selectivity than the former. Which means the optimizer should never prefer the latter to the former. If this is not so, please correct my assumption.

    Michael Ludwig
  • 522770
    522770 Member Posts: 728
    Your logic about edge indexes is sound, but the query optimizer is probabilistic so I guess it could choose to use a presence index on "foo" even with an edge index on the attribute.

    John
  • 597600
    597600 Member Posts: 250
    Thanks, John. My naive understanding after a quick web search is that a probabilistic optimizer is called such in opposition to a deterministic optimizer. The former may consult statistics based on usage data, heuristics, or just throw in some plain randomness, while the letter is, well, deterministic, and acts like a pure function, free of side-effects. Is that about correct?

    Michael Ludwig
  • 522770
    522770 Member Posts: 728
    I wasn't using the term "probabilistic" in opposition to "deterministic". DB XML's optimizer is "deterministic" in that given the same data and indexes it will pick the same query plan. What I was meaning is that the underlying algorithms are an approximation of what happens when the query runs, and also take some heuristic short cuts to prune the query plan search space. As such, it may not find the best query plan - but it usually finds a good query plan.

    John
This discussion has been closed.