I have a 1/2 rack Exadata X2-2. We had a storage cell fail and it has now been repaired and powered back on. Since the cell was down for much longer than the disk_repair_time (3.6 hours - the default), I suspect that all of the cell disks have been dropped (as per the doc). Here are my two questions:
1.] I realize that all grid disks in a storage cell are assigned to the same failure group. What's a good way to verify that, indeed, all of the cell disks in this cell have been dropped?
I was going to guess:
SELECT group.name, disk.name, disk.path, disk.mode_status, disk.mount_status, disk.state, disk.failgroup
v$asm_disk disk, v$asm_diskgroup group
mode_status = 'OFFLINE' AND disk.group_number = group.group_number
ORDER BY failgroup;
I would expect to see a:
mount_status of +"CLOSED - Disk is present in the storage system but is not being accessed by Automatic Storage Management"+
state of +"UNKNOWN - Automatic Storage Management disk state is not known (typically the disk is not mounted)"+
for each disk in the failgroup.
2.] After I see which ones are offline, I assume I can just do:
alter diskgroup <disk group name> online disk <disk name>;
for each effected diskgroup and disk combination and then ASM will begin to rebalance the diskgroups. That seems like a lot of alter statements.
Is there a quicker way, i.e. maybe a way to online the whole failgroup in a single command?
I realize that the re-synchronization will be a time consuming process. . .but it's unavoidable.
PS. I read that I can monitor the rebalance operation by doing a: SELECT * FROM v$asm_operation;
If the disks have been offline longer than the disk_repair_time value, then they should be dropped and not show up in v$asm_disk. You can verify it by checking in the alert log.
This script will check the disk status:
set lines 150
set pages 1000
col Diskgroup for a10
col Disk for a40
col "Size (MB)" for 999,999,999
d.failgroup "Fail Group",
d.total_mb "Size (MB)",
d.SECTOR_SIZE "Sector Size"
order by 1,2
If they are still listed in v$asm_disk, then you can just online them one by one, or issue:
ALTER DISKGROUP <DISKGROUP> ONLINE DISKS IN FAILGROUP <CELL_NAME>;
If they've been actually dropped, you'll need to add them in with an ALTER DISKGROUP command:
ALTER DISKGROUP <DISKGROUP> ADD FAILGROUP <CELL_NAME> DISK '<CELL_IP>/<GRIDDISK_NAME>', '<CELL_IP>/<GRIDDISK_NAME>', '<CELL_IP>/<GRIDDISK_NAME>';
Repeat for each diskgroup.
Thank you Andy, that's very helpful.
I didn't realize that if the disks had been dropped, they wouldn't show up in v$asm_disk. Are you certain about this?
I see on the that there is a DROPPED state. (reference [http://docs.oracle.com/cd/E11882_01/server.112/e25513/dynviews_1024.htm#sthref3121] ) DROPPED - Disk has been fully expelled from the disk group
I also found that another good way to check for the state (ONLINE / OFFLINE) is through cellcli with:
LIST GRIDDISK ATTRIBUTES name, ASMModeStatus
Very unexpectedly, when I looked at the status today (using the query you provided) it showed that the disks still had a state of NORMAL and mode_status of ONLINE. I can't explain that, since it had been much longer than disk_repair_time. I'll have to do some more reading and try to figure out how this could happen.
Could it be the grid disks get automaticaly discovered after onlining the cell due the asm_diskstring setting?
(and after discovery ASM started to rebalance the data.)
Be aware the disk_repair_time is only a limit after which ASM will start to enforce redundancy.
I suppose it could be, but I haven't read anything that states that grid disks can get automatically re-added. I've opened an SR with Oracle and hopefully they can shed some light on the topic.
I expect what you'll find is that unless you manually dropped the disks from the diskgroup, the cell "remembers" the desired configuration and will issue auto management commands to ASM to add the disks back in again once things return to wellness. If you find differently, please post your findings.
Here's the explanation offered by Oracle Support:
"I believe that the disk where never dropped because they where never officially taking offline. From disk_repair_time to drop disk it has to see the disk if in offline status. If the disk remain in normal online state but are inaccessible because of some of other actions/problems then once they become accessible the will be available regardless of what disk_repair_time is set to."
"Also from ASM Fast Mirror Resync - Example To Simulate Transient Disk Failure And Restore Disk (Doc ID 443835.1) note that we do stop drop timer when in rolling upgrade mode
"If a disk goes offline when the ASM instance is in rolling upgrade mode, the disk remains offline until the rolling upgrade has ended and the timer for dropping the disk is stopped until the ASM cluster is out of rolling upgrade mode. See "ASM Rolling Upgrade"."
Edited by: Bdub on Feb 14, 2012 5:38 PM
You should check the alert log for the ASM instance from the time that the cell went offline to ensure that the disks were offlined correctly. Just curious...what versions of the Exadata software are you on? If a griddisk (or an entire cell) goes offline, then the corresponding ASM disks should be taken offline as well. If the disks were in a good state but taken offline, then they should be automatically added back into the diskgroup when the cell comes online, even if they have been dropped from the diskgroup. The alert log will tell you exactly what happened. I wrote about something similar to this a few months ago - http://blog.oracle-ninja.com/2011/09/exadata-storage-on-demand