This content has been marked as final. Show 8 replies
You're right that the use of removePerDbMetadata in JE 5 is an improvement over JE 4. Log files are deleted at the end of a checkpoint. Often, especially when there is a large cleaning backlog, there are multiple files to be deleted in one batch. There is one MapLN in the Btree for each database. In JE 4, the MapLN for each database is flushed for each log file deleted, so the number of MapLNs written at the end of the checkpoint is the number of active databases multiplied by the number of log files deleted. In JE 5 the MapLN for each active database is only flushed once at the end of the checkpoint, for the entire batch of files deleted.
MapLNs contain information about the utilization of the log records for a database. Whenever the DB utilization changes (whenever you write to the DB), the MapLN is updated, which marks it dirty. So for an active database, its MapLN is usually dirty. Dirty MapLNs are flushed (written to the log) at specific times. In all, MapLNs are written at the following times.
1) When a DB is created, truncated or removed.
2) When a DB is evicted, which can only occur if the DB is closed.
3) At each checkpoint, for all DBs.
4) During Db.sync. (Deferred-write DBs only).
5) When log files are deleted at the end of a checkpoint, as described above.
6) When splitting the root of the Btree, i.e., adding another level.
7) When opening a database and modifying its configuration (key prefixing, max nodes, etc).
8) When evicting a BIN in a temporary DB, if the BIN contains deleted records. (JE 4 only.)
9) During one-time conversion of a duplicates DB to the new JE 5 format. (JE 5 only.)
10) When removing one or more branches of a Btree because one or more BINs are empty (all records in the BIN are deleted). Does not apply to deferred-write DBs. Is done asynchronously by the INCompressor background thread. I'm afraid each MapLN is written twice during this procedure.
I leave it to you to correlate the numbers of MapLNs in your log with the above factors in your app.
Thank you for the detailed information.
Turns out in this case it is the INCompressor, and it's the MapLNs for the file summary DB (_jeUtilization) that are being written out over-and-over, which is perhaps unsurprising since they are the only thing being significantly deleted in this case (working through a large cleaner backlog).
That's good to know, thanks for doing the work to figure it out! It very well may be that we can do improvements in this area at a future time, and understanding this will help a lot.
One thing to consider, for large data sets, is that larger log files will make a difference, since a smaller number of log files will result in less deletions, less metadata, etc. With JE 5 in particular we recommend using a log file size of 1GB, although the default size is still the original 10 MB value for backward compatibility. There are a number of performance benefits to using larger files.
Also, a larger checkpoint interval is recommended. The original default that is still in place, 20 MB, was at one time necessary to ensure that recovery time after a crash was reasonable. With JE 5, a checkpoint interval of 200 MB results in reasonable recovery times in our tests, although each app may be different in this respect and crash-recovery testing is important. A larger checkpoint interval reduces the overall amount of writing and the amount of metadata, which in turn reduces the amount of log cleaning, and overall performance is improved.
You are welcome. Also as an experiment I tried setting this database to be deferred-write, and as expected the repeated logging of the MapLNs went away. (I know a real fix is more complicated than just that.) BTW a while back when I mentioned in another thread seeing the INCompressor lagging at the end of a checkpoint after doing a lot of cleaning--I'll bet it was from this.
I'd been meaning to ask about log file size recommendations. I'm stuck on 4.1.21 at the moment (Voldemort), and I do have a large data set. Is trying a log file size of 1GB a good idea even with this version? (I'm currently using 60MB but I've been suspecting that larger would help.)
I'll also experiment with increasing the checkpoint interval, at least somewhat.
Thanks much for the advice.
Yes, what you said about the INCompressor makes sense.
On file size I suspect that 1 GB is beneficial for JE 4.1 also, although we have not measured performance in this configuration. Also be sure to only use such a large file size if your data set is larger than just a few GB -- you'll want to have 20 or more log files (to pick an arbitrary number) so the cleaner can select the least utilized from among them.
On checkpoint interval, I believe you'll need a smaller interval on JE 4.1 than JE 5, to get the same recovery time after a crash. There were radical improvements in JE 5 that reduced the amount of overall writing and recovery speed after a crash.
Note that Diego (https://forums.oracle.com/forums/profile.jspa?userID=918648) and Vinoth (https://forums.oracle.com/forums/profile.jspa?userID=918153) are working toward using Voldemort on JE 5.
What about je.log.fileCacheSize (number of open file handles)? Do you know if increasing that (from the default 100) would be beneficial, in the case in which you have a lot of log files? (I would guess it could be good, but on the other hand I've never caught a stacktrace in the process of opening a file.)
Yes, it is expensive to open a file and it can cause blocking. In our performance tests with large data sets we set the file cache size to 2000. You may also need to increase the max open files allowed in your OS.
10) When removing one or more branches of a Btree because one or more BINs are empty (all records in the BIN are deleted). Does not apply to deferred-write DBs. Is done asynchronously by the INCompressor background thread. I'm afraid each MapLN is written twice during this procedure.I need to correct myself. When the INCompressor thread wakes up and one or more BINs are empty, it will perform a "reverse split" to delete each empty BIN. Each reverse split logs the BIN's ancestor INs and its database MapLN (once). The MapLN is logged again at the end of the batch of reverse splits performed by the INCompressor thread. So if there is only one empty BIN in a database, its MapLN will be logged twice. But when there are many (N) empty BINs in a database, its MapLN will be logged (N + 1) times, not (N * 2) times.
I've opened an internal ticket [#21654] for making improvements in this area in the future.