sitemap

RSS地图

收藏本站

设为首页

Oracle研究中心

当前位置:Oracle研究中心 > 故障案例 >

【案例】Oracle RAC非归档断电后恢复过程 大量报错ORA-0600

时间:2016-12-05 21:13   来源:Oracle研究中心   作者:网络   点击:

天萃荷净 Oracle研究中心案例分析:运维DBA反映Oracle RAC数据库环境异常断电导致RAC无法启动,经查询发现出现大量的ORA-00600报错。

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客
本文链接地址: 记一次非归档RAC恢复(大量的ORA-600 错误)

帮客户恢复了一个10205 rac(for linux),非归档,掉电(听说频繁掉电几十次),下面是处理过程:

SQL> startup mount pfile='/home/oracle/pfile.ora';
ORACLE instance started.

Total System Global Area 4294967296 bytes
Fixed SIZE                  2101736 bytes
Variable SIZE             805309976 bytes
DATABASE Buffers         3456106496 bytes
Redo Buffers               31449088 bytes
ORA-00214: control file '/u01/oradata/e200pro/control02.ctl' version 1483026
inconsistent WITH file '/u01/oradata/e200pro/control01.ctl' version 1482441

通常遇到这个情况,我们采取的方法是分别control01,control02,control03去进行尝试性mount,不过我这里发现都不行,还会抛出如下错误:

ORA-00600: internal error code, arguments: [kccpb_sanity_check_2], [1484399],[1483026], [0x000000000], [], [], [], []

无奈之下只能进行controlfile 的重建,开始头晕晕的,重建controlfile的时候少加了一些datafile,以至于后面进行了resetlogs等操作以后,想再次进行controlfile的重建发现就不行了,报如下类似错误:

CREATE CONTROLFILE REUSE DATABASE "E200PRO" RESETLOGS  NOARCHIVELOG
*
ERROR at line 1:
ORA-01503: CREATE CONTROLFILE failed
ORA-01189: file is from a different RESETLOGS than previous files
ORA-01110: data file 21: '/u02/oradata/datafile/archind1.dbf'

我知道可以通过其他手段去修改datafile header然后来进行重建,但是无法确认具体要修改哪些地方? 当时我参考了一下dbsnake以前的一篇详见9i & 10g datafile header文章,
大致判断如下几个地方需要修改:

kcvfhrls    --resetlogs change scn值
kcvfhprs    --Pre resetlogs change 值(即上次resetlog的scn)
kcvfhprc    --prev reset logs count
kcvfhrlc    --resetlogs time
FHRBA_SEQ   --Redo log sequence number
kccfhcsq    --controlfile sequence number 

当然,这些信息都能从X$KCVFH中查到。

后面经过和熊哥沟通,确认只需要修改如下4个地方即可:

FHPRS  offset 420
RHPRC  offset 416
FHRLS  offset 116
FHRLC  offset 112

然后必须使用resetlogs方式创建,否则会遇到如下类似错误:

CREATE CONTROLFILE REUSE DATABASE "E200PRO" NORESETLOGS  NOARCHIVELOG
*
ERROR at line 1:
ORA-01503: CREATE CONTROLFILE failed
ORA-01189: file is from a different RESETLOGS than previous files
ORA-01517: log member: '/u01/oradata/e200pro/redo01.log'
重建完controlfile以后,直接recover database一把,提示成功,然后直接open抛出如下错误:

Sat May 12 13:45:23 CST 2012
SMON: enabling cache recovery
Sat May 12 13:45:23 CST 2012
ORA-01555 caused by SQL statement below (SQL ID: 4krwuz0ctqxdt, SCN: 0x0938.d2a71797):
Sat May 12 13:45:23 CST 2012
select ctime, mtime, stime from obj$ where obj# = :1
Sat May 12 13:45:23 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/udump/e200pro1_ora_2253.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 9 with name "_SYSSMU9$" too small
Error 704 happened during db open, shutting down database
USER: terminating instance due to error 704
Instance terminated by USER, pid = 2253
ORA-1092 signalled during: alter databas

开始以为是这个回滚段有问题,使用如下隐含参数以后发现仍然不行:
_corrupted_rollback_segments=(_SYSSMU9$)
_offline_rollback_segments=(_SYSSMU9$)

通过bbed检查该回滚段状态,发现为3,也就是online,原来如此,这里我本来想去修改该回滚段状态的,熊哥说修改的话建议直接修改为5,不过我这里没有修改。

最后熊哥建议调整scn,在mount下我手工操作,如下:

alter session set events '10015 trace name adjust_scn level 10';
发现不行,最后 alter session set events '10015 trace name adjust_scn level 9445';

依然不行,无奈只能用*._minimum_giga_scn=9445加到pfile里面,然后启动。

直接正常alter database open成功。

不过后面接下来的是一系列的ora-00600错误:
Sat May 12 15:17:52 CST 2012
Recovery of Online Redo Log: Thread 2 Group 6 Seq 2 Reading mem 0
  Mem# 0: /u02/oradata/flash_recover_area/E200PRO/onlinelog/o1_mf_6_7tvy79tz_.log
Block recovery completed at rba 2.65.16, scn 2361.1073847969
Sat May 12 15:17:52 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/udump/e200pro2_ora_11977.trc:
ORA-00600: internal error code, arguments: [4194], [52], [23], [], [], [], [], []
Sat May 12 15:17:53 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/bdump/e200pro2_smon_11886.trc:
ORA-00600: internal error code, arguments: [4194], [47], [38], [], [], [], [], []
Sat May 12 15:17:55 CST 2012

---该错误很容易处理,就不多说了
Sat May 12 15:47:01 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/bdump/e200pro1_j001_25397.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-08102: index key not found, obj# 239, file 1, block 458954 (2)
ORA-12012: error on auto execute of job 23
ORA-08102: index key not found, obj# 239, file 1, block 458954 (2)

Sat May 12 15:56:36 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/bdump/e200pro1_j001_29476.trc:
ORA-12012: error on auto execute of job 186
ORA-08102: index key not found, obj# 239, file 1, block 458954 (2)
Sat May 12 15:56:37 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/bdump/e200pro1_j000_29474.trc:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
Sat May 12 15:56:38 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/bdump/e200pro1_j003_29482.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-08102: index key not found, obj# 239, file 1, block 458954 (2)
ORA-12012: error on auto execute of job 2
ORA-08102: index key not found, obj# 239, file 1, block 458954 (2)
Sat May 12 15:56:40 CST 2012

---这里根据obj定位到为对象i_job_next,发现为obj$上的索引,直接drop重建解决。

后面又是更为严重的13013错误,直接导致实例crash:

Sat May 12 16:22:22 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/bdump/e200pro2_smon_11886.trc:
ORA-00600: internal error code, arguments: [13013], [5001], [483], [4462439], [151], [4462439], [4], []
Sat May 12 16:22:23 CST 2012
Non-fatal internal error happenned while SMON was doing flushing of monitored table stats.
SMON encountered 1 out of maximum 100 non-fatal internal errors.
Sat May 12 16:22:23 CST 2012
Trace dumping is performing id=[cdmp_20120512162223]
Sat May 12 16:22:24 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/bdump/e200pro2_smon_11886.trc:
ORA-00600: internal error code, arguments: [13013], [5001], [483], [4462439], [151], [4462439], [4], []
Sat May 12 16:22:26 CST 2012
Non-fatal internal error happenned while SMON was doing flushing of monitored table stats.
SMON encountered 2 out of maximum 100 non-fatal internal errors.
Sat May 12 16:22:27 CST 2012

---483为data_object_id,可以查到对象,4462439为rdba地址,也可以进行转换

SQL> SELECT dbms_utility.data_block_address_file(4462439) Rfile# ,dbms_utility.data_block_address_block(4462439) "Block#" FROM dual;

    RFILE#     Block#
---------- ----------
         1     268135

dbv检查该block是正常,至少物理上是ok的。

[oracle@dbssvr-a datafile]$ dbv file=/u01/oradata/e200pro/system01.dbf blocksize=8192 start=268134 end=268136

DBVERIFY: Release 10.2.0.5.0 - Production on Sat May 12 16:46:14 2012
Copyright (c) 1982, 2007, Oracle.  All rights reserved.
DBVERIFY - Verification starting : FILE = /u01/oradata/e200pro/system01.dbf
DBVERIFY - Verification complete

Total Pages Examined         : 3
Total Pages Processed (Data) : 3
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 0
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 0
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 0
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Highest block SCN            : 1074016503 (2361.1074016503)

经查该对象为索引i_mon_mods$_obj,直接drop index i_mon_mods$_obj;

然后运行如下语句重建即可:
create unique index i_mon_mods$_obj on mon_mods$(obj#)
  storage (maxextents unlimited)
/

处理完ora-600 13013错误以后,数据库不再crash。

后面还有一些如下错误:
Sat May 12 16:42:36 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/udump/e200pro2_ora_13411.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Sat May 12 16:42:39 CST 2012
Trace dumping is performing id=[cdmp_20120512164239]
Sat May 12 16:43:27 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/udump/e200pro2_ora_14023.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Sat May 12 16:43:28 CST 2012
Trace dumping is performing id=[cdmp_20120512164328]

Sat May 12 19:26:53 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/udump/e200pro2_ora_2148.trc:
ORA-00600: internal error code, arguments: [6002], [6], [6], [4], [0], [], [], []
Sat May 12 19:26:54 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/udump/e200pro2_ora_2148.trc:
ORA-00600: internal error code, arguments: [6002], [6], [6], [4], [0], [], [], []
Sat May 12 19:26:56 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/udump/e200pro2_ora_2148.trc:
ORA-00600: internal error code, arguments: [6002], [6], [6], [4], [0], [], [], []
Sat May 12 19:26:58 CST 2012

Sun May 13 08:00:11 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/bdump/e200pro2_j000_31295.trc:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
Sun May 13 08:00:15 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/bdump/e200pro2_j000_31295.trc:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
Sun May 13 08:00:17 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/bdump/e200pro2_j000_31295.trc:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
Sun May 13 08:00:21 CST 2012

如上这几个错误都跟index有关系,也就是存在逻辑错误,可以通过如下方式进行检查:

analyze table owner.table_name validate structure cascade online;

补充一些:对于分区表,要执行validate操作,必须先执行如下sql:

@ /rdbms/admin/utlvaite.sql

当然,处理方式也就很简单了,把有问题的index drop重建。

后面还陆续报出如下一系列错误:

Sun May 13 18:09:24 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/udump/e200pro2_ora_13838.trc:
ORA-00600: internal error code, arguments: [ktspgsb-1], [], [], [], [], [], [], []
Sun May 13 18:09:50 CST 2012
Trace dumping is performing id=[cdmp_20120513180950]
Sun May 13 18:10:04 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/udump/e200pro2_ora_13838.trc:
ORAhttp://www.oracleplus.net-00600: internal error code, arguments: [ktspgsb-1], [], [], [], [], [], [], []
Sun May 13 18:21:20 CST 2012

---该错误明显是涉及的表损坏了,重建表即可。
Sun May 13 18:42:58 CST 2012
Hex dump of (file 25, block 9819) in trace file /oracle/product/10.2/admin/e200pro/udump/e200pro2_ora_1460.trc
Corrupt block relative dba: 0x0640265b (file 25, block 9819)
Fractured block found during buffer read
Data in bad block:
type: 6 format: 2 rdba: 0x0640265b
last change scn: 0x0938.f6dc4f53 seq: 0x1 flg: 0x04
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x3f300601
check value in block header: 0xdf55
computed block checksum: 0xe34
Reread of rdba: 0x0640265b (file 25, block 9819) found same corrupted data
Sun May 13 18:42:58 CST 2012
Corrupt Block Found
         TSN = 10, TSNAME = TBLSPA_ARCH_IND
         RFN = 25, BLK = 9819, RDBA = 104867419
         OBJN = 484917, OBJD = 543662, OBJECT = PK_C_IT_RUN, SUBOBJECT =
         SEGMENT OWNER = CSMDBOWNER, SEGMENT TYPE = Index Segment
Sun May 13 18:45:31 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/udump/e200pro2_ora_31407.trc:
ORA-00600: internal error code, arguments: [kcbnew_3], [0], [4], [648035], [], [], [], []
Sun May 13 18:45:58 CST 2012

---该错误比较明显,是index block损坏,是Fractured block,drop重建即可。
Mon May 14 10:46:19 CST 2012
Recovery of Online Redo Log: Thread 2 Group 5 Seq 117 Reading mem 0
  Mem# 0: /u02/oradata/flash_recover_area/E200PRO/onlinelog/o1_mf_5_7tvy78pk_.log
Block recovery stopped at EOT rba 117.93764.16
Block recovery completed at rba 117.93764.16, scn 2361.1083355810
Mon May 14 10:46:19 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/bdump/e200pro2_smon_11792.trc:
ORA-00607: Internal error occurred while making a change to a data block
ORA-00600: internal error code, arguments: [kddummy_blkchk], [23], [3], [18018], [], [], [], []
Mon May 14 10:46:19 CST 2012
Trace dumping is performing id=[cdmp_20120514104619]
Mon May 14 10:46:22 CST 2012
Errors in file /oracle/product/10.2/admin/e200pro/bdump/e200pro2_smon_11792.trc:
ORA-00600: internal error code, arguments: [kddummy_blkchk], [23], [3], [18018], [], [], [], []
Mon May 14 10:46:23 CST 2012
Doing block recovery for file 23 block 3
Resuming block recovery (PMON) for file 23 block 3
Block recovery from logseq 117, block 93650 to scn 10141501141666
Mon May 14 10:46:23 CST 2012

---关于[kddummy_blkchk]错误,我在mos上搜了一下,相关的bug比较多,总的来说这里还是涉及的对象有问题,需要把相关对象重建应该就能解决。

SELECT segment_name
          FROM dba_extents
           WHERE file_id = 23
               AND 3 BETWEEN block_id AND block_id + blocks - 1;

如果说实例很快就crash了,可以通过event 10513来阻止smon去进行实例recovery,如下:
alter system set event='10513 trace name context forever,level 2' scope=spfile;
然后在对表进行处理。

--------------------------------------ORACLE-DBA----------------------------------------

最权威、专业的Oracle案例资源汇总之【案例】Oracle RAC非归档断电后恢复过程 大量报错ORA-0600

本文由大师惜分飞原创分享,网址:http://www.oracleplus.net/arch/1366.html

Oracle研究中心

关键词:

ORA-00600 kccpb_sanity_check_2

Oracle报错ORA-0600解决办法

Oracle RAC断电后无法启动解决办法

ORA-00600 qertbFetchByRowID