This content has been marked as final. Show 17 replies
This is not an error and TT does not lock system tables when comitting. Please see the following description from docNote 402403.1
When transactions are being DurableCommited against a data store, it
is sometimes possible to see processes awaiting the "GrpComm" (Group
Commit) lock. This effect can be seen either by running ttXactAdmin or
as warnings in the message log. These warnings will typicaly have the
format "Unable to acquire group commit lock (error 6003)" in 6.0 and
5.1. In version 5.0 a number of retries for this lock may also be
The Group Commit lock is a peformance
optimisation. It allows one transaction to durably sync all
transactions that have already written their records to the kernel I/O
buffer but not durably committed them to disk.
waiting on the Group Commit lock does not indicate a problem with this
lock. When durable commits are occurring, most transactions spend the
majority of their time waiting to be committed durably, so it is to be
expected that there will be many waiters on this lock. In fact, in an
application where transactions are short, you will frequently find
nearly every connection waiting for the Group Commit lock. The idea is
that all of the waiters on the lock will be flushed together with a
single disk sync. This situation should not be seen as “contention” as
there is no such thing as “contention” on the Group Commit lock. It is
a performance optimization, rather than a protector of a shared
It is true that transactions can be seen waiting
for locks that are held by other transactions that in turn either hold,
or are waiting for, the Group Commit lock. This also does not
(necessarily) indicate a problem with the Group Commit lock. A
transaction cannot release its regular (db, tbl, row, etc.) locks until
it has committed, so you will frequently see this sort of lock wait
pattern in applications that use durable commits.
the group commit warning is frequently harmless. Sometimes though
failure to acquire the group-commit lock can indicate a slow disk (lock
waiters time out while the previous durable commit being synced to
disk) or high cpu contention (the lock holder can’t get enough cycles
to release the lock before the waiters time out).
lock is not needed for correctness, the transaction will not retry
after reporting a failure to get the lock. Having waited 10 seconds for
a group sync, instead it will then issue a 'sync' for itself.
MONITOR table records the number of durable commits (XACT_D_COMMITS),
and also the number of syncs (LOG_FORCES). When group commit is in
effect, you will see that XACT_D_COMMITS is much larger than
LOG_FORCES as many transactions are being durably committed in one
write to disk.
Note: in 7.0 the message was changed to this:
Warn: : 4228: 5908/00D6D098: sbLogBufGCLockENQNoErr: The commit is
performed without the benefit of group commit optimization (error 6003)
last day I saw this document, I have an doubt yet. our dsn configurations set the DurableCommit = 0, this means the log will hold on log buffer, not flush to disk immediate. And In our application, we never call the method "call ttDurableCommit" . So I don't know when will happen group commit? In checkpoint or log buffer full??
And alse from monitor result, we can see XACT_D_COMMITS is smaller than LOG_FORCES. So I guess the group commit failed is because can not get the lock of sys.tables because that time application hold the lock of that table, May I right??
TIME_OF_1ST_CONNECT: Wed Nov 19 03:38:45 2008
I think I understand the lock request timeout is means get group commit lock failed.
and still we have no durablecommit, which situation can cause group commit? when the group commit happened, application need to wait the precommit on disk??
whether it effect the performance of application? because sometimes we found the operation is slowly, most of time is normal.
btw: when the operation was slowly, we found some pagein on system, so does we need the set the parameter v_pinshm =1 to make sure all memory hold by timesten not swap . thank you...
Please keep in mind that Replication will perform a durable commit every 200ms. This is a hardcoded setting and cannot be changed or turned off. You state that the group commit optimization warnings coincide with a checkpoint. This sounds to me like the checkpoint is using I/O resources that otherwise the group commit optimization would benefit from using.
A few comments:
- Are your checkpoint and log files on different disks?
- LOG_FS_READS is 12551. Do you see this regularly increasing? It would be optimum if all log reads by replication and xla readers were done from the log buffer rather than the filesystem.
- you could use ttCkptHistory to check the amount of bytes written out at each checkpoint, and determine if those which coincided with the group commit warning, were also writing out more bytes than usual.
Okay, it looks very much like these warnings are coming from the regular durable commits done by the repagent and are likely caused by heavy I/O contention ocurring when a checkpoint runs. Although the warnings are just warnings and are not in themselves a problem it sounds like your system is not optimally configured and that could lead to other probelms in the future. As previously mentioned, it is stringly recommended that the TimesTen datastore checkpoint and transaction log files be placed on seperate disks from each other and also sepaarte to other high activity files. Bo9th checkpointing and loggign are I/O intensive operations and, unless your disks are on a high-end RAID-0 disk array then a checkpoint is likely to overload the disk and impact performance.
Just to correct a couple of other points as well:
1. Commit does not need to get any locks on system tables etc. the group commit lock is a special lock only used by durable commits.
2. The log buffer is not flushed only when it gets full. It is continually flushed by the flusher thread in the manging sub-daemon and in general should never become 'full'.
The sub-daemon flusher thread flushes log records continually. As records are posted into the log buffer by application transactions the flusher thread writes them to disk. if at any time there are no records to flush, the flusher goes to sleep on a semaphore. As soon as some records are placed in the buffer, the flusher is woken up and starts flushing again. So, flushing is continuous, not time interval based.
What kind of disk hardware are you using (rotational speed, interface type etc.)? Are both disks connected to the same controller? Are there any other files on the log and checkpoint disks or are they dedicated devices? What O/S platform are you running on?
the system is SAM, DS8000, IBM p595
they are many controllers, and many logic disks .
they are dedicated devices
AIX Version 5.3
PV COPIES IN BAND DISTRIBUTION
vpath13 286:000:000 0% 161:000:000:125:000
vpath42 286:000:000 0% 161:000:000:125:000
vpath20 286:000:000 0% 161:000:000:125:000
vpath49 286:000:000 0% 161:000:000:125:000
vpath26 286:000:000 0% 161:000:000:125:000
vpath55 286:000:000 0% 161:000:000:125:000
vpath12 286:000:000 0% 161:000:000:125:000
vpath41 286:000:000 0% 161:000:000:125:000
vpath27 286:000:000 0% 161:000:000:125:000
vpath56 286:000:000 0% 161:000:000:125:000
vpath14 286:000:000 0% 161:000:000:125:000
vpath43 286:000:000 0% 161:000:000:125:000
vpath28 286:000:000 0% 161:000:000:125:000
vpath57 286:000:000 0% 161:000:000:125:000
PV COPIES IN BAND DISTRIBUTION
vpath27 172:000:000 0% 000:000:000:067:105
vpath56 172:000:000 0% 000:000:000:067:105
vpath20 172:000:000 0% 000:000:000:067:105
vpath49 172:000:000 0% 000:000:000:067:105
vpath26 172:000:000 0% 000:000:000:067:105
vpath55 172:000:000 0% 000:000:000:067:105
vpath13 172:000:000 0% 000:000:000:067:105
vpath42 172:000:000 0% 000:000:000:067:105
vpath12 172:000:000 0% 000:000:000:067:105
vpath41 172:000:000 0% 000:000:000:067:105
vpath14 172:000:000 0% 000:000:000:067:105
vpath43 172:000:000 0% 000:000:000:067:105
vpath28 172:000:000 0% 000:000:000:067:105
vpath57 172:000:000 0% 000:000:000:067:105
Edited by: user578558 on 2009-1-15 上午1:12
Edited by: user578558 on 2009-1-15 上午1:17
Thanks for the information, but it does not tell me a great deal in terms of how disks are arranged into RAID groups and how these map to filesystems on this, and maybe other, machines. If these warnings always coinciode with checkpoints then it seems like some I/O or CPU resource issues is the underlying cause (checkpoint operations can be both I/O and CPU intensive). To get to the bottom of this would likely require detailed investigation in conjunction with an AIX expert. If the only symptoms you are seeing is these warnings but otherwise everything is working fine then I would not worry about them and just ignore them. If you really feel you need to eliminate them then you should open an SR with Oracle support.
Hi Chris, All,
One of our application processes got stuck when accessing timesten, the function back trace is as below:
fe5cc8ac nanosleep (ffb87390, ffb87388)
fc094f24 sbSleep (2a, 0, 0, 0, 0, 2) + 70
fc1c6218 sbSpinLatchGet (e4403390, 28c5320, 0, 7, fc92373c, 70) + 1a8
fc1e235c sbLogBufInsertFast (ffb8781c, 28c63b8, 0, 1, 0, 0) + 554
fc204a98 sbLogRecInsert (e440ce98, 8d4598, 0, 1, 1c, e440cea0) + 4b8
fc203e5c sbLogWrite (8, 8d4598, 0, 0, 28c5320, e4c4e928) + 12c
fc278dec sbTblStorLTupDelete (e5eaae50, 1, 0, 90bb78, e440ce0c, e48c) + 5fc
fc288edc sbTblDelete (e5eaae50, 16, 0, 8d47c8, 0, 1) + 2f0
fc3a0298 sbEeDeleteEval (fa0c93a0, f9e4c150, fa0c9410, 28c5320, fc917a2c, 0) + 478
fc38b9f4 sbEeStmtEval (ecb0, f9e4c150, 28c5320, ef88, ef8c, ef90) + 1b4
fc7a7d0c sbSqlCmdExec (0, 14017550, 0, 0, 142947a0, 2cdd238) + 338
fc0f5020 sb_sqlExec (2cdbf60, 2cdd238, 0, ffb890c4, 0, 0) + 348
fc070140 cxlSqlExecute (2cdb5b8, 1, 12c4, 2cde1e8, fc917a2c, 0) + 54
fc03f3c8 stmtExecute (2cdb5b8, ffffdffb, 16, 0, 8000, c) + 1d0
fc035f20 SQLExecute (2cdb5b8, 4c, 0, fc035efc, 27438d8, 0) + 24
The process stuck there for 20 seconds and finally was killed by monitor process.
Before the process was killed, in ttmesglog there were 5 occurrence of the 6003 error.
03:39:28.87 Timesten error 6003
03:39:44.65 Timesten error 6003
03:39:59.53 Timesten error 6003
03:40:11.39 Timesten error 6003
03:40:13.20 Timesten error 6003
03:40:58.25 Timesten error 6003
03:41:23 the application process was detected that it has been stuck at nanosleep/sbSleep/sbSpinLatchGet() for 20s and was killed.
There are two processes accessing the DB, and the other one was killed later.
Could anyone help me out? what's the possible reasons and how to avoid it happening again?
Edited by: 983078 on Jan 21, 2013 2:00 AM
The thread in the process that was killed was blocked waiting to post a log record into the in-memory log buffer. It's hard to be definitive with just the pstack but the associated group commit timeout warnings in the message log suggest that you have a logging bottleneck. TimesTen is not able to flush the log buffer to disk fast enough to make space for new records. Since posting the log record is a 'must do' operation and generally should complete very quickly it is not valid to 'give up after a timeout'.
What are your DSN settings for this database? What kind of disk storage are the transaction logs located on and is that storage shared with anything else (such as TT checkpoint files)? Do you observe any non-zero/increasing values in SYS.MONITOR.LOG_BUFFER_WAITS? What are the details of the hardware (#CPUs, type pf CPUs) and workload?
Thanks for your quick response.
Our system is running on Netra-T2000. 7 application service instances share the hardware with the support of virtualization. Just before the process was killed, we had cron jobs to clean the application log files, which are located on the same disk of the DB log files. We were running stress testing when the issue was fired, so the system was quite busy at that time -- average cpu usage was over 90%. We run similar test in each weekend, and we clean the log files at each mid-night, but we never hit such issue (application process got stuck for a long time) before, though we occasionally had 6003 errors in ttmesglog.
Here is the DB settings (direct DB access):
And the ttstatus output:
TimesTen status report as of Tue Jan 22 03:13:13 2013
Daemon pid 7726 port 17000 instance xxxxxx
TimesTen server pid 7731 started on port 17002
No TimesTen webserver running
Data store /xxxxxx/CompanyName/db/ApplicationNameData
There are 20 connections to the data store
Data store is in shared mode
Shared Memory KEY 0x0400b436 ID 1476395024
Type PID Context Connection Name ConnID
Process 10436 0x01aa9140 APP1 12
Process 11375 0x00cd8398 APP2 13
Process 9706 0x0345bbe0 APP6 9
Process 9706 0x0ad6f738 APP6 10
Process 9706 0x0afeec98 APP6 11
Replication 8509 0x00164100 APP7 4
Replication 8509 0x00183270 APP8 6
Subdaemon 7727 0x000718b8 Worker 2042
Subdaemon 7727 0x000e25b0 Flusher 2043
Subdaemon 7727 0x000f1710 Checkpoint 2047
Replication policy : Always
Replication agent is running.
Cache Agent policy : Manual
End of report