Forum Stats

  • 3,782,603 Users
  • 2,254,670 Discussions
  • 7,880,133 Comments

Discussions

Maps in DB XML

637288
637288 Member Posts: 488
edited Jan 5, 2009 7:25AM in Berkeley DB XML
Hey all!

Is it possible to store maps (I mean values (key, value) in DB XML (for example in metadata). I want to index smth in order to speed up document retrieval, and this document retrieval is my application specific issue. Thanks in advance.

Vyacheslav
Tagged:
«1

Answers

  • 597600
    597600 Member Posts: 250
    In order to speed up document retrieval, an index will help.

    Now you've been a little vague with respect to what it is you what you want to base your retrieval on. A map, alright. Consider that the default index - by document name - is a meta-data index, and a map. The document name is the key, the document is the value.

    Could you provide an example to illustrate what you mean when referring to a map?

    Michael Ludwig
  • 637288
    637288 Member Posts: 488
    ok, I agree that I was to vague. Here is a more detailed explanation.
    I want to simulate a directory structure inside one container. In particular, I want to quickly find out which subdirectories and files contain a random folder. So I wanted to have the following structure in DB XML: pairs (folder, set of entries).
    What I was thinking about is that I could introduce a special namespace for documents which contain only two metafileds, one is for folder entries and the second is for file entries. Both metafields would be in comma separated list format.

    For example:

    This is a document for root folder ('filesystem' is a namespace).
    docname = filesystem:/
    metafield folders = documents,papers
    metafield files = doc1.xml, doc2.xml

    This is a document for 'documents' folder:

    docname = filesystem:/documents
    metafield folders =
    metafield files = documentN.xml,stylesheet.xsl

    Is it a good idea?

    Thanks,
    Vyacheslav
  • 597600
    597600 Member Posts: 250
    +"I want to quickly find out which subdirectories and files contain a random folder."+

    Maybe a grammar issue here. "Subdirectories and files" are the object here, I think, and the "random folder" is the subject. So it should read "contains", not "contain". You have a folder and want to quickly find all the entries. Is this correct?
  • 637288
    637288 Member Posts: 488
    Yes, sorry! it was a grammar mistake. You understood me correctly
  • 597600
    597600 Member Posts: 250
    I would try and take advantage of the semantics of the filesystem URIs, the pathnames. This is untested and unproven, but try encoding the hierarchy depth in the URI like this:
    /          -> 0:fs:/
    /eins      -> 1:fs:/eins
    /zwei      -> 1:fs:/zwei
    /eins/a    -> 2:fs:/eins/a
    /eins/b    -> 2:fs:/eins/b
    /eins/b/x  -> 3:fs:/eins/b/x
    /eins/c    -> 2:fs:/eins/c
    /zwei/abc  -> 2:fs:/zwei/abc
    /zwei/etc  -> 2:fs:/zwei/etc
    Each document gets this glorified path as a meta-data attribute.

    In order to find all children of "/eins", you can then query for everything starting with "2:fs:/eins". The depth is evident from the search pattern: "/eins" has depth 1, so its children must have depth 2. Prefixing the path with the depth avoids matching everything (at least partially matching) when querying for the children of the root node. Or any other node with a lot of descendants (as opposed to children), for that matter.

    An advantage here is that every document has to track only its own positions. If a child node disappears, no need to call the police: there is no reference to it stored with the parent node, hence no problem.

    But maybe others have better suggestions. And I think a very similar suggestion was made a while ago on another thread by George Feinberg.

    Michael Ludwig
    597600
  • 522770
    522770 Member Posts: 728
    detonator413 wrote:
    I want to simulate a directory structure inside one container. In particular, I want to quickly find out which subdirectories and files contain a random folder. So I wanted to have the following structure in DB XML: pairs (folder, set of entries).
    What I was thinking about is that I could introduce a special namespace for documents which contain only two metafileds, one is for folder entries and the second is for file entries. Both metafields would be in comma separated list format.
    DB XML understands XML well, but not CSV. Why not use XML to store this list?

    I think it would be easy enough to introduce a new (namespaced) XML document for each of your folders, which listed the contents of that folder.

    John
    522770
  • 637288
    637288 Member Posts: 488
    Thanks Michael! That was useful.
    But I didn't get completely, where am I supposed to store information about eins directory for example? If I understood correctly, we store such info in metadata of documents (which are leaves of our hierarchy). So in which document would we store info about eins children?

    And what are disadvantages of my approach? In my case it would be easy to get children by getting a document by name, which is very fast.
  • 597600
    597600 Member Posts: 250
    A potential difficulty with your approach is that you'll have to maintain referential integrity to keep the tree well-formed. If /eins/d is added to the tree, /eins has to be updated; same story for removal of nodes.

    Information about the children of /eins isn't stored explicitly; it is, however, implicit in the pathname. When looking at /eins, which translates into 1:fs:/eins, I know that all its children are indexed on a string starting with "2:fs:/eins/". So you'd logically ask for starts-with("2:fs:/eins/").

    You can do better, of course: Store the identity of the document as suggested; and in addition, store the parent node's identity:
    real path     depth path      parent
    /          -> 0:fs:/          SPECIAL
    /eins      -> 1:fs:/eins      0:fs:/
    /zwei      -> 1:fs:/zwei      0:fs:/
    /eins/a    -> 2:fs:/eins/a    1:fs:/eins
    /eins/b    -> 2:fs:/eins/b    1:fs:/eins
    /eins/b/x  -> 3:fs:/eins/b/x  2:fs:/eins/b
    /eins/c    -> 2:fs:/eins/c    1:fs:/eins
    /zwei/abc  -> 2:fs:/zwei/abc  1:fs:/zwei
    /zwei/etc  -> 2:fs:/zwei/etc  1:fs:/zwei
    Again, the parent node follows automatically from the path.

    To get all children of /eins, ask for parent = "0:fs:/eins".

    Of course, you'd still need to handle recursive deletion, insertion out of the tree, etc.

    And I just realize that with the addition of the reference to the parent node, I may have obliterated the need to encode the depth in the path. Well, that may still be useful for something ...

    There must be tons of literature on trees out there ... Better advice than I can give is certainly available.

    Michael Ludwig
  • 637288
    637288 Member Posts: 488
    If I understood you correctly, your approach is suitable only for retrieving files of a particular folder. But how will you retrieve the information about subfolders of a particular folder? What should I do if I want to list folders in 'eins'?

    Thanks,
    Vyacheslav
  • 597600
    597600 Member Posts: 250
    +"What should I do if I want to list folders in 'eins'?"+

    Then you should do this:
    query { collection()/*[ dbxml:metadata("fs-parent") = '1:fs:/eins' ] }
    Proceed as follows:
    # Start with a fresh container.
    create vyacheslav.dbxml
    
    # Add some documents.
    put a <Folder/>
    put b <Folder/>
    put c <File/>
    put d <File/>
    put e <File/>
    
    # You can add metadata.
    help setmetadata
    
    # Store the path of each node.
    # This may or may not prove useful.
    setmetadata a "" fs-me string 0:fs:/ 
    setmetadata b "" fs-me string 1:fs:/eins
    setmetadata c "" fs-me string 1:fs:/zwei
    setmetadata d "" fs-me string 1:fs:/drei
    setmetadata e "" fs-me string 2:fs:/eins/huhu
    
    # Store the parent path for each node.
    # This is what you need for your query for folder contents.
    setmetadata a "" fs-parent string ROOT
    setmetadata b "" fs-parent string 0:fs:/
    setmetadata c "" fs-parent string 0:fs:/
    setmetadata d "" fs-parent string 0:fs:/
    setmetadata e "" fs-parent string 1:fs:/eins
    
    # Query to see your metadata.
    query {
            for $n in collection()/*
            return (
                    $n/dbxml:metadata("fs-me"),
                    $n/dbxml:metadata("fs-parent"),
                    "---"
            )
    }
    print
    
    # Create indexes on your "fs-parent" metadata to boost query speed.
    addindex "" fs-parent node-metadata-equality-string
    
    # Create the other one if you need it. (I don't know if you do.)
    addindex "" fs-me node-metadata-substring-string
    
    # Take a look at your indexes.
    listindexes
    
    # Fire a query.
    query { collection()/*[ dbxml:metadata("fs-parent") = '0:fs:/' ] }
    print
    
    # Another one.
    query { collection()/*[ dbxml:metadata("fs-parent") = '1:fs:/eins' ] }
    print
    
    # And another one.
    query { collection()/*[ dbxml:metadata("fs-parent") = '1:fs:/zwei' ] }
    print
    
    # Check your index is being used.
    queryplan { collection()/*[ dbxml:metadata("fs-parent") = '0:fs:/' ] }
    Of course, you'll want to do this programmatically. Particularly, you'll want to have code that ensures the integrity of the fs tree you're coding into the metadata. No dangling leaves etc.

    Michael Ludwig
This discussion has been closed.