This content has been marked as final. Show 5 replies
First, I have to assume you are using node storage to get more granular updates. Without knowing the entire layout for your documents and access patterns I can only tell you what BDB XML attempts to do. In case it's not obvious, nodes (elements) that are not modified by an update are never part of the replication stream.
With updates BDB XML tries to avoid unnecessary modification and even document traversal. Here are some examples.
1. If you modify a leaf element in place you will change that node as well as any indices it may participate in. This is important to understand because it means you should avoid value indexes on elements that are not leaf elements.
2. If you modify a non-leaf element again you'll change that node and any child elements that were also modified, as well as indices.
In your message you use the term "top-level element." I'm guessing that's just the result of an initial query and that you then use it as context for an update operation. As long as that top-level element is not itself modified then it won't be part of the replication stream.
Another hint -- it is not usually a good idea to model a single document with a flat, wide structure of repeated elements if you intend to add and delete those elements. You'd be better off modeling such data as documents that come and go rather than individual elements. If you only modify content in-place it's not a big deal, it is frequent element insertion and deletion that can cause problems over time. Again, since I don't know your layout all I can do is give general advice.
I'm using Berkeley DB XML 2.5.16 with Berkeley DB 4.8.30.Tom,
We recently added a new replicated persistent document type to our application and I'd like to understand if there is any performance advantage associated with the way we model the elements and sub-elements contained in a replicated persistent document.
Our application currently handles sub-element modifications by updating the top-level element associated with the sub-element. So my question is, if a top-level element is updated, how granular are the changes that get replicated to the replica sites in the replication group as a result of the update? In other words, if I retrieve a top-level element from the database, modify one of its sub-elements and then execute an update on the top-level element; does the entire top-level element get replicated to the replica sites or just the change(s) to the modified sub-element? If the entire top-level element gets sent, then it seems like it would be more efficient for us to model the sub-elements as top-level elements that contain a foreign key to their parent top-level element rather than modeling them as actual sub-elements. Would this improve the replication performance given the fact that our application treats a modification to a sub-element as an update to its top-level parent?
Replication works by having the master send all of its transaction logs to each client, and when a client gets a complete set of logs for a committed transaction, it applies the operations recorded in the logs. Basically every committed update operation you perform on the master gets performed on each client. So every record that gets updated on your master container is going to be updated on the client containers. So the real question is does your operation update the top-level element along with the sub-element, or does it just update the sub-element.
To answer that question depends on whether your container is using the node storage format or whole document format?
If you are using whole document format then the entire document is a single record, so naturally updates are granular to the document.
If you are using node document format, then each element node is its own record, so then it really depends on how your update query is executed by DBXML. There are many ways to figure that out. You can use XMLDebugListener to track how your update query is executed, and ultimately what elements get updated. You can also use the utility db_printlog to read the logs to see what updates are performed, or use the Berkeley DB environment command DB_ENV->set_verbose to turn on verbose output for what reads and writes are performed, or what messages are sent from the replication master to the clients.
In the end, there is no universal answer for what would give the best performance. The best thing you can do is test your own data formats and expected queries, and see what is best for your application.
Thanks for your responses. When I used the term "top-level element" I'm referring to any element that is declared immediately below the root element in the schema for a given document. We currently use node storage for all of our documents. My only concern is related to the way our application handles a change to a sub-element of a top-level element and the associated cost of replication in Berkeley DB XML. Our application always executes an update query on the top-level element when one of its sub-elements is changed rather than just executing an update on the changed sub-element. I'm concerned about the cost of replication given this behavior.
Only modified BDB records are replicated. You say you perform the update query on the top-level element but if that element itself is not modified as part of the query then it won't be replicated. Only nodes that are actually changed are committed and therefore replicated. That said, certain updates will modify that top-level element, such as addition or removal of a direct child sub-element. A pure content change such as modifying the text content of an element or adding/removing/changing an attribute will only affect that element so it will generally be the only node replicated other than indexes that are affected but index entries are small.