I have recently upgraded to version 5.0.34 and was testing our application when I saw the following error:
com.sleepycat.je.EnvironmentFailureException: (JE 5.0.34) com.sleepycat.je.utilint.RelatchRequiredException UNEXPECTED_EXCEPTION: Unexpected internal Exception, may have side effects.
As a rough overview, the application has many threads (transactionally) inserting and removing into three separate tables. Commits are non synchronous (i.e. Transaction.commitNoSync() is being used). A separate thread regularly calls Environment.flushLog(true).
I've only seen this once so far, when the application was under very heavy load...
This shouldn't happen, of course. It is not an error that will cause data corruption, so you should be able to safely continue.
Today is a holiday for us, but I will look into this later this week. It would help a lot if you could save any additional stack traces, if they occur.
thanks... unfortunately I've not been able to reproduce it as yet (I added in a bit more code around the affected area to log out more contextual information if it happens again)... I will keep trying.
OK... I've been doing some more testing and have some more information.
Firstly I have seen this a few more times though by no means is this particular error easily reproducable.
I have determined that it occurs on both jdk 1.7.0 and 1.6.0_26 (both linux, 64bit) as I wanted to rule out jdk 1.7 as a factor.
Since I had never seen this error when using v4.x of BDB I looked at the code change I hade made simultaneous with the upgrade, namely that previously in order to synchronously flush the asyncronoush transaction commits to disk I was calling Environment.sync() whereas with v5.0.34 I was calling Environment.flushFlog(true) (since this had demonstrated better performance and covered exactly the use case that I desired).
After switching this code back to sync() I could not reproduce the issue. Moreover neither did I see any of the other perplexing errors that I had been investigating (see below).
The part in my code where the error was occuring is the only place where cursors are actively used. All other transactional activity is either single record inserts, or single record deletes drived by a unique key. In the area of the code where I saw this exception being thrown, looks somewhat like this:
DatabaseEntry value = new DatabaseEntry();
value.setPartial(0, 0, true);
cursor = database.openCursor(tx, null);
status = cursor.getSearchKeyRange(key, value, LockMode.RMW);
while (status == OperationStatus.SUCCESS)
keyObject = binding.entryToObject(key);
if(keyObject.getId() != id)
status = cursor.delete();
status = cursor.getNext(key, value, LockMode.RMW);
Where the idea is that the actual primary key for the table is a compound of (id,number), and we start the cursor from the key (id,0) and delete all entires until we find an entry with a different id.
Now as it happens, for the test I am running I know that there will only ever be at most one entry of the form (id, 0).
And if I change the code to remove the cursor, and instead just do a straight delete operation on the database, I again see no errors.
The other thing to note is that in addition to the Unexpected Internal Exception described above, I was seeing far more frequent LockTimeout exceptions... moreover backing off and retrying the transactions didn't seem to work - the locks seemed to be permanently locked. These timeouts also only occurred in the above part of my code, and not in any of the other transactions it was performing.
After changing to use Environment.sync() (rather than flushLog), or after removing the cursor (but keeping flishLog) these lock timeout exceptions also went away.
Finally I was seeing an even stranger (to me) error... Occaisionally my tests seemed to be completing fine, but when I came to shut down my process, closing the environment was reporting that some transactions were still uncommitted. Again - the only instances were from the transactions involving the cursor above. Now as above the only way that you can get through that code without committing is (AFAICT) by some method throwing an exception... and I was seeing no such exceptions reported.
Once again, either changing from flushLog(true) to sync(), or replacing the cursor with a straight delete stopped these issues from occurring.
So, in summary I was seeing a number of weird behaviours that only seemed to occur when I was using a cursor to delete records from my database, and when I was using flushLog() to flush the log records to disk.
Hope this is helpful... If there's anything more I can do to help you debug this issue, please let me know,
Thanks for the clear explanation. I have been working on a fix that I'm hoping you can try. My understanding of the problem is consistent with what you describe, except perhaps for the unreleased locks.
Please send me email -- mark.hayes at o.com (oracle) -- so that I can give you a fix to try out.
For the record, this bug has been fixed and will be in the next JE 5.0 patch release. The internal support ticket is [#21121]. Note that the bug does not cause data loss and does not invalidate the Environment instance.