If you change parameter on ASM to try solve this issue, I assure you that you are obscuring the root cause of issue, the issue always will be there and will show up again when your load increase.
You need find the root cause of timeout and fix it. This is real solution.
I recommend you meet with your Storage Admin,System Admin and DBA to identify and solve your issue. If this meeting was not enough, contact Oracle and IBM support.
Hello and thank you for your responses.
SR has been opened with Oracle. We even managed to get severity 1. Unfortunately, they make us turn around, trying only to get out of any responsibility and not to help us to find out root causes. We managed to get one Oracle guy coming into our office and having the meeting with us; it was caricatural, system vs oracle vs netapp, a big ping pong party. So I cannot rely on them to find out a fix in appropriate delays.
So as I said, we had meeting with oracle, storage and system teams.
Reasons of such long timeout are:
- we do controller failover/failback for tests purposes
- switch time is only 1s, but!
- check delays (aix, asm, and so on...) increase it up to around 60s (which is not normal, I agree, but we work on 2 axies: 1 is to find the root cause, other is to find fixes). Description is quite accurate on thread I quoted.
For a reminder:
- We use NetApp Storage
- We use Oracle 22.214.171.124 in RAC 2 nodes dataguarded on another RAC 2 nodes
What came out of this meeting is:
- our client used 16 paths per LUN (8 primary + 8 secondary). NetApp team told them it was quite useless to have so much
- we performed some tests with 8. Still got the issue, but kinda less destructive.
- we so performed some tests with 4 paths (2p + 2s). This time, we do not face the issue anymore
- we have a fix. But!
- we do not understand really well why it occurs. Current theory is a problem on a fiber which makes some paths failing, and time to go through other is too long (because it may test on a lot of failing paths). We have planned to do some other tests to confirm it or not.
- we do not understand why setting _asm_hbeatiowait to 200s is not a working workaround (even if we do understand it is a workaround, not a fix). We though that it would be sufficient to take in the 60s of whole switch process, but we notice our diskgroups are randomly dismount forced after a few seconds only.
- we asked in the SR why and how to force ASM wait a little bit more for the disks to be presented, and except disktimeout and _asm_hbeatiowait (which both do not have any effect) we do not have a clue of what to modify (event if we understand it is just a workaround, not else).
But if you have any idea...
Setting setting _asm_hbeatiowait to 200 is a standard MOS recommendation in this type of cases.
In fact you can not do a lot on ASM side here. Since the problem post probably layis somewhere between your storage, SAN infra and OS.
So I recommend just in case to double check that you applied _asm_hbeatiowait change properly on your ASM instances.
Then, double check your Unix SCSI timeout value. Improtant is that it should be lower than _asm_hbeatiowait
Then, as mentioined, contact IBM AIX team.
-- Kirill Loifman, ww.dadbm.com