Oracle Database 12c: ASM 12c External Redundancy Diskgroup Handles Corrupted Blocks Better

I had the pleasure of working for a company where I came across an ASM corruption where to this day no vendor was able to find the root cause or an explanation of the corruption.

Corruption Story:

Here is what happen about 1 year ago one fine evening we were doing normal disk add operation in ASM 11.2.0.3. Our diskgroup was running low on disk space so we setup our normal maintenance to add more disk. Note the diskgroup was a external redundancy diskgroup. We added 10 disk and as the rebalance was occurring this diskgroup dismounted because the check that ASM does during the rebalance operation found corruption on block 40. When we work with Oracle support it was noted that the first 250 blocks of the ASM disk contain metadata blocks. Based on Oracle's finding one thing they noted was the corruption could have occurred anytime since the last rebalance which had occurred 3 months before. I do not want to name any names but the OS, SAN vendor and Oracle began to review the logs but could not come up with any findings as to what caused the issue.

Long story short to resolve the issue we had to recreate the diskgroup and recover the database.

This blog is going to show what happens when a specific block is corrupted and how ASM handles that issue in 11.2 and 12.1.

Note: I am going to simulate the block corruption using dd command DO NOT use this command in production as you will cause problems.

ASM 11.2.0.3

Command to Simulate Block 40 Corruption

Note: /dev/sdg1 is ASM disk DISK1 in diskgroup DG01

I execute the dd command to simulate the corrupted block.

[root@alpddbs001 ~]# dd if=/dev/zero of=/dev/sdg1 bs=4k seek=40 count=4
4+0 records in
4+0 records out
16384 bytes (16 kB) copied, 2.5e-05 seconds, 655 MB/s

Note: When I execute the above command no issue is reported in the ASM alert log.

Check ASM Diskgroup Mounted

I then login to the ASM instance as sysasm and check the the ASM diskgroup is still mounted.

SQL> select name,state from v$asm_diskgroup;

NAME STATE

------------------------------ -----------

DG01 MOUNTED

FRA01 MOUNTED

Run ASM Check Diskgroup Command

I then issue an a check of the diskgroup the same type of check that would be issued during a rebalance operation.

SQL> alter diskgroup DG01 check;

ASM Alert Log Errors

The moment that I issue the check command in ASM the ASM alert log begins to report errors.

SQL> alter diskgroup dg01 check
NOTE: starting check of diskgroup DG01
Sun Jul 14 20:40:12 2013
GMON checking disk 0 for group 1 at 9 for pid 19, osid 12226
Sun Jul 14 20:40:28 2013
WARNING: cache read a corrupt block: group=1(DG01) dsk=0 blk=40 disk=0 (DG01_0000) incarn=3916004514 au=0 blk=10 count=246
Errors in file /orabase/diag/asm/+asm/+ASM/trace/+ASM_ora_12226.trc:
ORA-15196: invalid ASM block header [kfc.c:26077] [endian_kfbh] [2147483648] [40] [0 != 1]
NOTE: a corrupted block from group DG01 was dumped to /orabase/diag/asm/+asm/+ASM/trace/+ASM_ora_12226.trc
WARNING: cache read (retry) a corrupt block: group=1(DG01) dsk=0 blk=40 disk=0 (DG01_0000) incarn=3916004514 au=0 blk=40 count=1
Errors in file /orabase/diag/asm/+asm/+ASM/trace/+ASM_ora_12226.trc:
ORA-15196: invalid ASM block header [kfc.c:26077] [endian_kfbh] [2147483648] [40] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26077] [endian_kfbh] [2147483648] [40] [0 != 1]
ERROR: cache failed to read group=1(DG01) dsk=0 blk=10 from disk(s): 0(DG01_0000)
ORA-15196: invalid ASM block header [kfc.c:26077] [endian_kfbh] [2147483648] [40] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26077] [endian_kfbh] [2147483648] [40] [0 != 1]
NOTE: cache initiating offline of disk 0 group DG01
NOTE: process _user12226_+asm (12226) initiating offline of disk 0.3916004514 (DG01_0000) with mask 0x7e in group 1
WARNING: Disk 0 (DG01_0000) in group 1 in mode 0x7f is now being taken offline on ASM inst 1
NOTE: initiating PST update: grp = 1, dsk = 0/0xe9697ca2, mask = 0x6a, op = clear
Sun Jul 14 20:40:28 2013
GMON updating disk modes for group 1 at 10 for pid 19, osid 12226
ERROR: Disk 0 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 1)
Sun Jul 14 20:40:28 2013
NOTE: cache dismounting (not clean) group 1/0x0C498C6A (DG01)
NOTE: messaging CKPT to quiesce pins Unix process pid: 12412, image: oracle@alpddbs001 (B000)
Sun Jul 14 20:40:29 2013
NOTE: halting all I/Os to diskgroup 1 (DG01)
WARNING: Offline of disk 0 (DG01_0000) in group 1 and mode 0x7f failed on ASM inst 1
Sun Jul 14 20:40:29 2013
NOTE: LGWR doing non-clean dismount of group 1 (DG01)
NOTE: LGWR sync ABA=9.3204 last written ABA 9.3204
NOTE: cache dismounted group 1/0x0C498C6A (DG01)
Sun Jul 14 20:40:29 2013
ERROR: ORA-15130 in COD recovery for diskgroup 1/0xc498c6a (DG01)
ERROR: ORA-15130 thrown in RBAL for group number 1
Errors in file /orabase/diag/asm/+asm/+ASM/trace/+ASM_rbal_8308.trc:
ORA-15130: diskgroup "DG01" is being dismounted
System State dumped to trace file /orabase/diag/asm/+asm/+ASM/trace/+ASM_ora_12226.trc
SQL> alter diskgroup DG01 dismount force /* ASM SERVER */
ERROR: ORA-15130 in COD recovery for diskgroup 1/0xc498c6a (DG01)
ERROR: ORA-15130 thrown in RBAL for group number 1
Errors in file /orabase/diag/asm/+asm/+ASM/trace/+ASM_rbal_8308.trc:
ORA-15130: diskgroup "DG01" is being dismounted
ERROR: ORA-15130 in COD recovery for diskgroup 1/0xc498c6a (DG01)
ERROR: ORA-15130 thrown in RBAL for group number 1
Errors in file /orabase/diag/asm/+asm/+ASM/trace/+ASM_rbal_8308.trc:
ORA-15130: diskgroup "DG01" is being dismounted
Sun Jul 14 20:40:41 2013
ERROR: ORA-15130 in COD recovery for diskgroup 1/0xc498c6a (DG01)
ERROR: ORA-15130 thrown in RBAL for group number 1
Errors in file /orabase/diag/asm/+asm/+ASM/trace/+ASM_rbal_8308.trc:
ORA-15130: diskgroup "DG01" is being dismounted
ERROR: ORA-15130 in COD recovery for diskgroup 1/0xc498c6a (DG01)
ERROR: ORA-15130 thrown in RBAL for group number 1
Errors in file /orabase/diag/asm/+asm/+ASM/trace/+ASM_rbal_8308.trc:
ORA-15130: diskgroup "DG01" is being dismounted
ERROR: ORA-15130 in COD recovery for diskgroup 1/0xc498c6a (DG01)
ERROR: ORA-15130 thrown in RBAL for group number 1
Errors in file /orabase/diag/asm/+asm/+ASM/trace/+ASM_rbal_8308.trc:
ORA-15130: diskgroup "DG01" is being dismounted
ERROR: ORA-15130 in COD recovery for diskgroup 1/0xc498c6a (DG01)
ERROR: ORA-15130 thrown in RBAL for group number 1
Errors in file /orabase/diag/asm/+asm/+ASM/trace/+ASM_rbal_8308.trc:
ORA-15130: diskgroup "DG01" is being dismounted
Sun Jul 14 20:40:51 2013
NOTE: AMDU dump of disk group DG01 created at /orabase/diag/asm/+asm/+ASM/trace
NOTE: cache deleting context for group DG01 1/0x0c498c6a
ORA-15032: not all alterations performed
ORA-15130: diskgroup "DG01" is being dismounted
ORA-15066: offlining disk "DG01_0000" in group "DG01" may result in a data loss
ORA-15196: invalid ASM block header [kfc.c:26077] [endian_kfbh] [2147483648] [40] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26077] [endian_kfbh] [2147483648] [40] [0 != 1]
ERROR: alter diskgroup dg01 check
Sun Jul 14 20:40:51 2013
GMON dismounting group 1 at 11 for pid 20, osid 12412
Sun Jul 14 20:40:51 2013
NOTE: Disk in mode 0x8 marked for de-assignment
NOTE: Disk in mode 0x8 marked for de-assignment
NOTE: Disk in mode 0x8 marked for de-assignment
NOTE: Disk in mode 0x8 marked for de-assignment
SUCCESS: diskgroup DG01 was dismounted
SUCCESS: alter diskgroup DG01 dismount force /* ASM SERVER */
ERROR: PST-initiated MANDATORY DISMOUNT of group DG01
Sun Jul 14 20:40:51 2013
NOTE: diskgroup resource ora.DG01.dg is offline
Sun Jul 14 20:40:56 2013
NOTE: ASM client dbtest1:dbtest1 disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /orabase/diag/asm/+asm/+ASM/trace/+ASM_ora_8303.trc
Sun Jul 14 20:40:56 2013
NOTE: ASM client omsrepos:omsrepos disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /orabase/diag/asm/+asm/+ASM/trace/+ASM_ora_8255.trc
Sun Jul 14 20:40:58 2013
Errors in file /orabase/diag/asm/+asm/+ASM/trace/+ASM_ora_12486.trc:
ORA-17503: ksfdopn:2 Failed to open file +DG01/omsrepos/spfileomsrepos.ora
ORA-15001: diskgroup "DG01" does not exist or is not mounted
Sun Jul 14 20:40:58 2013
Errors in file /orabase/diag/asm/+asm/+ASM/trace/+ASM_ora_12488.trc:
ORA-17503: ksfdopn:2 Failed to open file +DG01/dbtest1/spfiledbtest1.ora
ORA-15001: diskgroup "DG01" does not exist or is not mounted

ASM Diskgroup Check Command Fails

The check diskgroup returns errors

alter diskgroup DG01 check

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15130: diskgroup "DG01" is being dismounted

ORA-15066: offlining disk "DG01_0000" in group "DG01" may result in a data loss

ORA-15196: invalid ASM block header [kfc.c:26077] [endian_kfbh] [2147483648]

[2] [0 != 1]

ORA-15196: invalid ASM block header [kfc.c:26077] [endian_kfbh] [2147483648]

[2] [0 != 1]

Based on this corruption Oracle ASM 11.2 does not handle the issue well and dismounts the diskgroup and all databases serviced by the ASM instance that use diskgroup DG01 are aborted.

ASM 12.1.0.1

Command to Simulate Block 40 Corruption

Note: /dev/sdg1 is ASM disk DISK1 in diskgroup DG01

I execute the dd command to simulate the corrupted block.

[root@alpddbs001 ~]# dd if=/dev/zero of=/dev/sdg1 bs=4k seek=40 count=4
4+0 records in
4+0 records out
16384 bytes (16 kB) copied, 2.5e-05 seconds, 655 MB/

Note: When I execute the above command no issue is reported in the ASM alert log.

Check ASM Diskgroup Mounted

I then login to the ASM instance as sysasm and check the the ASM diskgroup is still mounted.

SQL> select name,state from v$asm_diskgroup;

NAME STATE

------------------------------ -----------

FRA01 MOUNTED

DG01 MOUNTED

Run ASM Check Diskgroup Command

I then issue an a check of the diskgroup the same type of check that would be issued during a rebalance operation.

SQL> alter diskgroup DG01 check;

ASM Alert Log Errors

The moment that I issue the check command in ASM the ASM alert log begins to report errors. The difference in the 12.1 it is handling the error differently.

Thu Jul 11 20:07:38 2013

ERROR: Could not read the header of disk DG01_0000 (0).

NOTE: process _user20079_+asm (20079) initiating offline of disk 0.3915922415 (DG01_0000) with mask 0x7e in group 1 (DG01) without client assisting

NOTE: initiating PST update: grp 1 (DG01), dsk = 0/0xe9683bef, mask = 0x6a, op = clear

Thu Jul 11 20:07:38 2013

GMON updating disk modes for group 1 at 39 for pid 36, osid 20079

ERROR: disk 0(DG01_0000) in group 1(DG01) cannot be offlined because the disk group has external redundancy.

Thu Jul 11 20:07:38 2013

ERROR: too many offline disks in PST (grp 1)

Thu Jul 11 20:07:38 2013

ERROR: Failed to offline disk DG01_0000 (0).

NOTE: starting check of diskgroup DG01

Thu Jul 11 20:07:38 2013

GMON checking disk 0 for group 1 at 40 for pid 36, osid 20079

GMON checking disk 1 for group 1 at 41 for pid 36, osid 20079

GMON checking disk 2 for group 1 at 42 for pid 36, osid 20079

GMON checking disk 3 for group 1 at 43 for pid 36, osid 20079

Thu Jul 11 20:07:39 2013

SUCCESS: check of diskgroup DG01 found no errors

Thu Jul 11 20:07:39 2013

SUCCESS: alter diskgroup DG01 check

Thu Jul 11 20:07:39 2013

ERROR: no read quorum in group: required 1, found 0 disks

ASM Health Checker found 1 new failures

Thu Jul 11 20:08:36 2013

SQL> alter diskgroup DG01 check

Thu Jul 11 20:08:36 2013

ERROR: Could not read the header of disk DG01_0000 (0).

NOTE: process _user20079_+asm (20079) initiating offline of disk 0.3915922415 (DG01_0000) with mask 0x7e in group 1 (DG01) without client assisting

NOTE: initiating PST update: grp 1 (DG01), dsk = 0/0xe9683bef, mask = 0x6a, op = clear

Thu Jul 11 20:08:36 2013

GMON updating disk modes for group 1 at 44 for pid 36, osid 20079

ERROR: disk 0(DG01_0000) in group 1(DG01) cannot be offlined because the disk group has external redundancy.

Thu Jul 11 20:08:36 2013

ERROR: too many offline disks in PST (grp 1)

Thu Jul 11 20:08:36 2013

ERROR: Failed to offline disk DG01_0000 (0).

NOTE: starting check of diskgroup DG01

Thu Jul 11 20:08:36 2013

GMON checking disk 0 for group 1 at 45 for pid 36, osid 20079

GMON checking disk 1 for group 1 at 46 for pid 36, osid 20079

GMON checking disk 2 for group 1 at 47 for pid 36, osid 20079

GMON checking disk 3 for group 1 at 48 for pid 36, osid 20079

Thu Jul 11 20:08:37 2013

SUCCESS: check of diskgroup DG01 found no errors

Thu Jul 11 20:08:37 2013

SUCCESS: alter diskgroup DG01 check

Thu Jul 11 20:08:38 2013

ERROR: no read quorum in group: required 1, found 0 disks

Thu Jul 11 20:08:38 2013

ERROR: no read quorum in group: required 1, found 0 disks

ASM Health Checker found 1 new failures

ASM 12.1 Diskgroup Check is successful

SQL> alter diskgroup DG01 check;

Diskgroup altered.

ASM diskgroup are still mounted

SQL> select name,state from v$asm_diskgroup;

NAME STATE

------------------------------ -----------

FRA01 MOUNTED

DG01 MOUNTED

Conclusion

ASM 12.1 handled the corruption better then in 11.2.

Based on information from Oracle:

On 12.1 there is more protection for some of the metadata. Previous release (>= 11.1.0.7 and 10.2.0.5) only the disk header was protected. This was true for external redundancy as well.

Note: If more then 2 or 3 MB of disk data is loose in ASM metadata on external redundancy diskgroup you lose the entire diskgroup. This is true for 12.1 as well. To protect from this type of issue use normal or high redundancy diskgroup check out my blog on normal redundancy.

There better protection of metadata on 12.1 is transparent and works automatically with PHY_META_REPLICATED

I have been told in 12.1.0.2 there will be a new feature that will not allow OS utilities to make changes to ASM disks.

My webpages

http://db12c.blogspot.com/
http://cloudcontrol12c.blogspot.com/

http://www.youtube.com/user/jfruiz11375

Follow me on Twitter

14 comments:

UnknownJuly 12, 2013 at 5:52 AM
Great!!
SYSDBAJuly 12, 2013 at 6:00 AM
Really interesting. Thanks for sharing this.
Javier RuizJuly 12, 2013 at 6:17 AM
You're Welcome
UnknownJuly 15, 2013 at 11:25 AM
Good info Javier. How did 12.1 recover the block 40 since this details not on alertlog?
Javier RuizJuly 15, 2013 at 11:31 AM
Based on information from Oracle the protection of Metadata is transparent and automatic with PHY_META_REPLICATED. Seems the metadata is being rebuilt from the replicated blocks. Oracle did note that in external redundancy diskgroup if more the 2 or 3 MB of is lost the diskgroup would need to be rebuilt. See the info in the conclusion of the blog.
Javier RuizJuly 15, 2013 at 11:33 AM
You can also check my blog on ASM 12c Normal Redundancy testing http://db12c.blogspot.com/2013/07/asm-12101-normal-redundancy-testing.html
BelManJuly 17, 2013 at 1:21 PM
Good Job
Javier RuizJuly 17, 2013 at 4:04 PM
Thank You BelMan you may also be interested in my other blog post for ASM 12c

ASM 12101 Normal Redundancy Testing http://db12c.blogspot.com/2013/07/asm-12101-normal-redundancy-testing.html

ASM 12c New Feature Replace Command http://db12c.blogspot.com/2013/07/asm-12c-new-feature-replace-command.html

UnknownJuly 18, 2013 at 12:22 PM
Javier:
Thanks for the great post and research in 12c.
Since the header info is backed up in second last block of allocation unit 1, "kfed repair" finds header block with KFBTYP_DISKHEAD type and restores it. Hence the reason that if first 2 to 3 MB info is lost you will likely have to recreate the diskgroup. This means that we still have to backup ASM metadata using md_backup so that we can also backup attributes, templates, directories etc, otherwise one has to find all that information from ASM alert log while re-creating disk group.

Bottom line is if you are still using 10g then backup first 4k using dd command. 11g onwards use md_backup or kfed utility.
Javier RuizJuly 18, 2013 at 12:45 PM
Thanks for the reply

I agree with the md_backup but remember md_backup/md_restore does not take care of the superblocks. The true solution for preventing a diskgroup outage due to some disk issue is to use normal or high redundancy diskgroup.

ASM 12101 Normal Redundancy Testing http://db12c.blogspot.com/2013/07/asm-12101-normal-redundancy-testing.html
UnknownJanuary 19, 2015 at 8:37 AM
Javier,
Some reason 12c didnt recover the block 40 corruption on one of the disk on external redundancy.

NOTE: starting meta-data replication for disk 18 of group DG01 at FCN 0.31612998
Mon Jan 19 09:37:52 2015
WARNING: cache read a corrupt block: group=4(DG01) dsk=18 blk=40 disk=18 (DISK019) incarn=3620249718 au=0 blk=40 count=1
Mon Jan 19 09:37:52 2015
Errors in file /oragridbase/diag/asm/+asm/+ASM/trace/+ASM_ars0_5002.trc:
ORA-15196: invalid ASM block header [kfc.c:29297] [check_kfbh] [2147483666] [40] [2213122942 != 304098278]
NOTE: a corrupted block from group DG01 was dumped to /oragridbase/diag/asm/+asm/+ASM/trace/+ASM_ars0_5002.trc
WARNING: cache read (retry) a corrupt block: group=4(DG01) dsk=18 blk=40 disk=18 (DISK019) incarn=3620249718 au=0 blk=40 count=1
Mon Jan 19 09:37:52 2015
Errors in file /oragridbase/diag/asm/+asm/+ASM/trace/+ASM_ars0_5002.trc:
ORA-15196: invalid ASM block header [kfc.c:29297] [check_kfbh] [2147483666] [40] [2213122942 != 304098278]
ORA-15196: invalid ASM block header [kfc.c:29297] [check_kfbh] [2147483666] [40] [2213122942 != 304098278]
ERROR: cache failed to read group=4(DG01) dsk=18 blk=40 from disk(s): 18(DISK019) 18(DISK019)
ORA-15196: invalid ASM block header [kfc.c:29297] [check_kfbh] [2147483666] [40] [2213122942 != 304098278]
ORA-15196: invalid ASM block header [kfc.c:29297] [check_kfbh] [2147483666] [40] [2213122942 != 304098278]

NOTE: cache initiating offline of disk 18 group DG01
NOTE: process _ars0_+asm (5002) initiating offline of disk 18.3620249718 (DISK019) with mask 0x7e in group 4 (DG01) with client assisting
NOTE: initiating PST update: grp 4 (DG01), dsk = 18/0xd7c8a076, mask = 0x6a, op = clear
Mon Jan 19 09:37:52 2015
GMON updating disk modes for group 4 at 69 for pid 21, osid 5002
ERROR: disk 18(DISK019) in group 4(DG01) cannot be offlined because the disk group has external redundancy.
Mon Jan 19 09:37:52 2015
ERROR: too many offline disks in PST (grp 4)
Mon Jan 19 09:37:52 2015
NOTE: cache dismounting (not clean) group 4/0x7DD86EFF (DG01)
NOTE: messaging CKPT to quiesce pins Unix process pid: 5038, image: oracle@alpqdbs001 (B000)
Mon Jan 19 09:37:52 2015
NOTE: halting all I/Os to diskgroup 4 (DG01)
Mon Jan 19 09:37:52 2015
NOTE: LGWR doing non-clean dismount of group 4 (DG01) thread 1
NOTE: LGWR sync ABA=604.8015 last written ABA 604.8015
Mon Jan 19 09:37:52 2015
NOTE: cache dismounted group 4/0x7DD86EFF (DG01)
Mon Jan 19 09:37:52 2015
SQL> alter diskgroup DG01 dismount force /* ASM SERVER:2111336191 */
Javier RuizJanuary 19, 2015 at 10:15 AM
Was the compatible.asm set to 12.1?

Oracle Database 12c

Thursday, July 11, 2013

ASM 12c External Redundancy Diskgroup Handles Corrupted Blocks Better

Corruption Story:

ASM 11.2.0.3

Command to Simulate Block 40 Corruption

Check ASM Diskgroup Mounted

Run ASM Check Diskgroup Command

ASM Alert Log Errors

ASM Diskgroup Check Command Fails

ASM 12.1.0.1

Command to Simulate Block 40 Corruption

Check ASM Diskgroup Mounted

Run ASM Check Diskgroup Command

ASM Alert Log Errors

Conclusion

14 comments:

About Me