My version is 188.8.131.52.5 one of my rac instance crashes with message ORA-00240: control file enqueue held for more than 120 seconds. Received an instance abort message from instance 1.
here are the contents of alert log file
IPC Send timeout detected. Receiver ospid 27423 [email@example.com (LMON)]
2013-03-22 22:30:05.644000 -07:00
Errors in file /u01/app/oracle/diag/rdbms/lfgoimdb/LFGoimdb2/trace/LFGoimdb2_lmon_27423.trc:
2013-03-22 22:31:08.734000 -07:00
Errors in file /u01/app/oracle/diag/rdbms/lfgoimdb/LFGoimdb2/trace/LFGoimdb2_arc2_27691.trc (incident=15905):
ORA-00240: control file enqueue held for more than 120 seconds
Incident details in: /u01/app/oracle/diag/rdbms/lfgoimdb/LFGoimdb2/incident/incdir_15905/LFGoimdb2_arc2_27691_i15905.trc
2013-03-22 22:31:13.409000 -07:00
Received an instance abort message from instance 1
Please check instance 1 alert and LMON trace files for detail.
LMS0 (ospid: 27427): terminating the instance due to error 481
System state dump requested by (instance=2, osid=27427 (LMS0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/lfgoimdb/LFGoimdb2/trace/LFGoimdb2_diag_27413.trc
2013-03-22 22:31:18.376000 -07:00
Dumping diagnostic data in directory=[cdmp_20130322223113], requested by (instance=2, osid=27427 (LMS0)), summary=[abnormal instance termination].
ORA-1092 : opitsk aborting process
Instance terminated by LMS0, pid = 27427
Yes I am using NSF storage .I got this problem after i put my database to archivelog mode. All the services are up and doing well but one of the two node crashes after some time.My other node remains functional.The storage server is very slow. I am using vmware and there are other applications connecting to this server.
The default lock time of control file is 900 sec but why it is giving this error after 120 seconds and terminating the instance?
what should I do to overcome this issue? since i am on latest version there is no problem of bug?
The alert file shows the same error on the other node.As said by Mr Hemant this error is because of control file being locked for more than 120 seconds,Which is the main cause in my case since i am using NFS and my network is very congested hence probably i am hitting this error.But my question is why instead of 900 sec in rac it is taking 120 seconds?Why is node 1 evicting node 2 on hitting this error?Node 1 is my master node and it performs fine.
And also suggestion to avoid this since out of two instances i am still using one , the other node evicts after some time.
You said this started after archive log was turned on. What are the size of your redo logs and how often are they switching?
As Hemant suggested it looks like it's holding on to the control file too long because of work being done to it. Check to make sure the redo logs are not undersized and causing excessive work to be done by the archiver background process.
My redo logs size is default 50mb.There is currently no load on the system since we are not using this environment for time being.The log switches are averaged to be 8 per day.I think Increasing the size of redo will further cause the problems since the archiver may again hold lock for more time.
Since there is no dedicated connection between the nodes and storage ,So increasing the hardware and network configuration is only solution to this? Or I am still missing something...
As far as configuration is considered i cannot add more resources to this environment.How can I solve this issue?
Thanks for the reply I leave it with the fact that since i am not using dedicated connection I am facing this problem..Dedicated connection is not possible in this environment so i am taking db in no archive log mode.
But the question remains that whether the control file lock time is reduced to 120 seconds in RAC? Instead of 900 seconds! And who evicts the node after this error?
Hemant in one of your blog I have read this time to be 900 seconds.
Can i increase this time in RAC?