sitemap

RSS地图

收藏本站

设为首页

Oracle研究中心

当前位置:Oracle研究中心 > 故障案例 >

【案例】Oracle RAC报错ORA-00600 [kccpb_sanity_check_2]解决办法

时间:2016-11-12 14:16   来源:Oracle研究中心   作者:网络   点击:

天萃荷净 Oracle研究中心案例分析:运维DBA反映Oracle RAC FOR ASM数据库无法启动并报错ORA-00600等错误,结合日志分析原因为SCN异常导致整体故障。
本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客
本文链接地址: 数套ASM RAC的恢复案例

前不久帮助某客户恢复了6套Oracle RAC,均为ASM,而且版本均为10.2.0.4。

1.遇到ORA-00600 错误:


Thu Dec 31 11:55:46 2015
SUCCESS: diskgroup DG1 was mounted
Thu Dec 31 11:55:50 2015
Errors in file /oracle/admin/xxx/udump/xxx1_ora_28803.trc:
ORA-00600: internal error code, arguments: [kccpb_sanity_check_2], [13715626], [13715623], [0x000000000], [], [], [], []
SUCCESS: diskgroup DG1 was dismounted
Thu Dec 31 11:55:51 2015
对于该错误,其实很简单,主要是因为控制文件损坏,通过重建控制文件或者利用备份的控制文件进行restore即可进行mount;甚至于我们利用控制文件快照都可以进行数据库mount;然后接着进行恢复操作。在恢复的过程中还遇到了如下的错误:

Errors in file /oracle/admin/xxx/udump/xxx1_ora_6990.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [3431], [18446744072948603858], [3431], [18446744072948586897], [], [], []
Tue Jan  5 10:52:28 2016
Errors in file /oracle/admin/xxx/bdump/xxx1_arc0_8205.trc:
ORA-19504: failed to create file "+DG1/xxx/archivelog/1_2_900069464.dbf"
ORA-17502: ksfdcre:4 Failed to create file +DG1/xxx/archivelog/1_2_900069464.dbf
ORA-00600: internal error code, arguments: [kffbAddBlk04], [], [], [], [], [], [], []
Tue Jan  5 10:52:28 2016
ARC0: Error 19504 Creating archive log file to '+DG1/xxx/archivelog/1_2_900069464.dbf'
ARCH: Archival stopped, error occurred. Will continue retrying
Tue Jan  5 10:52:30 2016
ORACLE Instance xxx1 - Arc
上述的ORA-00600错误其实很简单,主要是数据块SCN的问题。

SQL> startup mount pfile='/tmp/p.ora';
ORACLE instance started.

Total System Global Area 2.1475E+10 bytes
Fixed Size                  2122368 bytes
Variable Size            2399145344 bytes
Database Buffers         1.9059E+10 bytes
Redo Buffers               14651392 bytes
Database mounted.
SQL> ALTER DATABASE ADD LOGFILE THREAD 2
  2    GROUP 3 (
  3      '+DG/xxxx/onlinelog/group_3.271.752099989',
  4      '+DG/xxxx/onlinelog/group_3.272.752099991'
  5    ) SIZE 100M REUSE,
  6    GROUP 4 (
  7      '+DG/xxxx/onlinelog/group_4.273.752099991',
  8      '+DG/xxxx/onlinelog/group_4.274.752099993'
  9    ) SIZE 100M REUSE,
10    GROUP 6 (
11      '+DG/xxxx/onlinelog/group_6.275.752099993',
12      '+DG/xxxx/onlinelog/group_6.276.752099993'
13    ) SIZE 100M REUSE;
ALTER DATABASE ADD LOGFILE THREAD 2
*
ERROR at line 1:
ORA-01276: Cannot add file +DG/xxxx/onlinelog/group_3.271.752099989.  File hasan Oracle Managed Files file name.

由于是ORACLE RAC,因此重建控制文件之后,是需要添加redo logfile的;然而add logfile 发现报上述错误。根据Oracle metalink的一些方法均不能成功,都报上面的错误,确实很怪异。
有些人看上述的错误,可能会认为是设置了OMF的参数,其实这里并不是,我将相关参数全部修改之后,错误依旧。
这里实际上添加logfile时,只写磁盘组名称就行了,不需要写绝对路径。

2.接着在进行recover后进行open resetlogs打开时,报错ORA-01248

SQL> startup mount pfile='/tmp/p.ora';
ORACLE instance started.

Total System Global Area 2.1475E+10 bytes
Fixed Size                  2122368 bytes
Variable Size            2399145344 bytes
Database Buffers         1.9059E+10 bytes
Redo Buffers               14651392 bytes
Database mounted.
SQL> recover database using backup controlfile until cancel;
ORA-00279: change 13300428179625 generated at 04/04/2013 12:51:35 needed for
thread 1
ORA-00289: suggestion : +DG/archivelog/arch1_752099890_12809_1.log
ORA-00280: change 13300428179625 for thread 1 is in sequence #12809

Specify log: {=suggested | filename | AUTO | CANCEL}
auto
ORA-00308: cannot open archived log
'+DG/archivelog/arch1_752099890_12809_1.log'
ORA-17503: ksfdopn:2 Failed to open file
+DG/archivelog/arch1_752099890_12809_1.log
ORA-15173: entry 'arch1_752099890_12809_1.log' does not exist in directory
'archivelog'

ORA-00308: cannot open archived log
'+DG/archivelog/arch1_752099890_12809_1.log'
ORA-17503: ksfdopn:2 Failed to open file
+DG/archivelog/arch1_752099890_12809_1.log
ORA-15173: entry 'arch1_752099890_12809_1.log' does not exist in directory
'archivelog'

ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DG/xxxx/datafile/system.256.752099833'

SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01248: file 42 was created in the future of incomplete recovery
ORA-01110: data file 42: '+DG/xxxx/datafile/file_tab_xdidx03.ora'

这个错误还是比较少见,实际上网上那些说法,以及Oracle mos提供的解决方法我发现都不行。
无奈只能先将其offline ,然后再进行恢复。再进行open之前我查询了当前的checkpoint scn如下:

SQL> select file#,checkpoint_change# from v$datafile;

     FILE#      CHECKPOINT_CHANGE#
---------- -----------------------
         1          14731601024328
         2          14731601024328
         3          14731601024328
         4          13300428179625
         5          14731601024328
         6          14731601024328
         7          14731601024328
     .......
        39          14731601024328
        40          14731601024328
        41          14731601024328
        42          14731601024328
        43          14731601024328

43 rows selected.

SQL> c/datafile/datafile_header
  1* select file#,checkpoint_change# from v$datafile_header
SQL> /

     FILE#      CHECKPOINT_CHANGE#
---------- -----------------------
         1          14731601024328
         2          14731601024328
         3          14731601024328
         4          13300428179625
         5          14731601024328
         6          14731601024328
         7          14731601024328
         ......
        40          14731601024328
        41          14731601024328
        42          14731601024328
        43          14731601024328

43 rows selected.
由于open失败,这里我想着是不是这2个文件有问题,又用之前的快照控制文件进行recover一把,然后再次用重建的控制文件起来数据库进行recover,发现神奇的事情出现了:

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 13305808683011 generated at 01/11/2016 21:09:02 needed for
thread 1
ORA-00289: suggestion : +DG/archivelog/arch1_900882531_1_1.log
ORA-00280: change 13305808683011 for thread 1 is in sequence #1

Specify log: {=suggested | filename | AUTO | CANCEL}
auto
ORA-00308: cannot open archived log '+DG/archivelog/arch1_900882531_1_1.log'
ORA-17503: ksfdopn:2 Failed to open file +DG/archivelog/arch1_900882531_1_1.log
ORA-15173: entry 'arch1_900882531_1_1.log' does not exist in directory
'archivelog'

ORA-00308: cannot open archived log '+DG/archivelog/arch1_900882531_1_1.log'
ORA-17503: ksfdopn:2 Failed to open file +DG/archivelog/arch1_900882531_1_1.log
ORA-15173: entry 'arch1_900882531_1_1.log' does not exist in directory
'archivelog'

ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '+DG/xxxx/datafile/system.256.752099833'

SQL> alter database datafile 42 offline;
alter database datafile 43 offline;
Database altered.

SQL>

Database altered.

SQL>
SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01092: ORACLE instance terminated. Disconnection forced

我们可以看到open 失败了,对于open失败的 情况,我们首先是看alert log,接着10046 trace。

SQL> startup nomount pfile='/tmp/p.ora';
ORACLE instance started.

Total System Global Area 2.1475E+10 bytes
Fixed Size                  2122368 bytes
Variable Size            2399145344 bytes
Database Buffers         1.9059E+10 bytes
Redo Buffers               14651392 bytes
SQL> oradebug setmypid
Statement processed.

SQL> alter database mount;
Database altered.

SQL> recover database;
Media recovery complete.
SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 4 with name "_SYSSMU4$"
too small
这里我又屏蔽了undo相关的参数。再次尝试发现错误依旧。再次启动,神奇的事情出现了,SCN居然倒退了?

SQL> startup mount  pfile='/tmp/p.ora';
ORACLE instance started.
Total System Global Area 2.1475E+10 bytes
Fixed Size                  2122368 bytes
Variable Size            2399145344 bytes
Database Buffers         1.9059E+10 bytes
Redo Buffers               14651392 bytes
Database mounted.
SQL> recover database;
Media recovery complete.

SQL> select checkpoint_change#,file# from v$datafile;

         CHECKPOINT_CHANGE#      FILE#
--------------------------- ----------
             13314398637607          1
             13314398637607          2
             13314398637607          3
             13314398637607          4
             13314398637607          5
            ......
             13314398637607         38
             13314398637607         39
             13314398637607         40
             13314398637607         41
                          0         42
                          0         43

43 rows selected.

SQL> select checkpoint_change#,file#,checkpoint_time from v$datafile_header;

         CHECKPOINT_CHANGE#      FILE# CHECKPOIN
--------------------------- ---------- ---------
             13314398637607          1 11-JAN-16
             13314398637607          2 11-JAN-16
             13314398637607          3 11-JAN-16
             13314398637607          4 11-JAN-16
             13314398637607          5 11-JAN-16
             ......
             13314398637607         39 11-JAN-16
             13314398637607         40 11-JAN-16
             13314398637607         41 11-JAN-16
             14731601024328         42 30-DEC-15
             14731601024328         43 30-DEC-15

43 rows selected.
很明显,这个133的scn 回退到了过去2年前了,出现时空穿越了。。。。 当然,open肯定还是报错:

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 4 with name "_SYSSMU4$"
too small
这里先不管为啥连数据文件头的SCN都倒退了(之前被offline的2个文件scn是OK的). ?通过10046 trace得到如下内容:

PARSING IN CURSOR #5 len=52 dep=1 uid=0 oct=3 lid=0 tim=1418474357830663 hv=429618617 ad='5db7ea50'
select ctime, mtime, stime from obj$ where obj# = :1
END OF STMT
PARSE #5:c=0,e=531,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1418474357830659
BINDS #5:
kkscoacd
Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0
  kxsbbbfp=2ad7172aa020  bln=22  avl=02  flg=05
  value=20
EXEC #5:c=1000,e=673,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1418474357831431
WAIT #5: nam='db file sequential read' ela= 9843 file#=1 block#=218 blocks=1 obj#=-1 tim=1418474357841421
.....
FETCH #6:c=17997,e=64936,p=23,cr=566,cu=0,mis=0,r=1,dep=2,og=3,tim=1418474357907669
STAT #6 id=1 cnt=1 pid=0 pos=1 obj=15 op='TABLE ACCESS BY INDEX ROWID UNDO$ (cr=566 pr=23 pw=0 time=64913 us)'
STAT #6 id=2 cnt=1 pid=1 pos=1 obj=34 op='INDEX UNIQUE SCAN I_UNDO1 (cr=1 pr=1 pw=0 time=5769 us)'
WAIT #5: nam='db file sequential read' ela= 13031 file#=40 block#=167538 blocks=1 obj#=-1 tim=1418474357920819
FETCH #5:c=19996,e=89548,p=25,cr=568,cu=0,mis=0,r=0,dep=1,og=4,tim=1418474357921006

我们这里可以看到,这里报错的SQL读取了file 1 block 218,以及file 40 block 167538。
对于file 1 block 218,我dump 发现没有活动事务;而file 40 block 167538则为undo 块.

SQL> select name from v$datafile where file#=40;
NAME
--------------------------------------------------------------------------------
+DG/xxxx/datafile/undotbs4
同时dump 了这个undo 块,发现确实感觉有些异常,如下所示:

********************************************************************************
UNDO BLK:
xid: 0x0009.01b.0014320a  seq: 0xa47 cnt: 0x1   irb: 0x1   icl: 0x0   flg: 0x0000
Rec Offset      Rec Offset      Rec Offset      Rec Offset      Rec Offset
---------------------------------------------------------------------------
0x01 0x0014    
*-----------------------------
* Rec #0x1  slt: 0x1b  objn: 55417(0x0000d879)  objd: 296039  tblspc: 20(0x00000014)
*       Layer:  10 (Index)   opc: 21   rci 0x00
Undo type:  Regular undo   Last buffer split:  No
Temp Object:  No
Tablespace Undo:  No
rdba: 0x0a028e71
*-----------------------------
index general undo (branch) operations
KTB Redo
op: 0x05  ver: 0x01
op: R  itc: 2
Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0009.01b.0014320a  0x0a028e71.0a47.04  ----    1  fsc 0x0000.00000000
0x02   0x0009.02b.001428b6  0x0a028e6f.0a47.06  ----  112  fsc 0x0000.00000000
Dump kdige : block dba :0x05d630b3, seghdr dba: 0x06076e89
restore block before image

由于所有的文件头SCN 都倒退了,正常open 都报错,只能推进SCN,而且SCN必须要比这个undo block的最大SCN 还要大一些才行,通过在pfile文件中加入参数*._minimum_giga_scn即可。
SQL> conn /as sysdba
Connected to an idle instance.

SQL> startup mount pfile='/tmp/p.ora';
ORACLE instance started.
Total System Global Area 2.1475E+10 bytes
Fixed Size                  2122368 bytes
Variable Size            2399145344 bytes
Database Buffers         1.9059E+10 bytes
Redo Buffers               14651392 bytes
Database mounted.
SQL> show parameter job
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
job_queue_processes                  integer     10
SQL> alter system set job_queue_processes=0;
System altered.

SQL> alter database open;
Database altered.

SQL> drop tablespace undotbs3 including contents and datafiles;
Tablespace dropped.

SQL> drop tablespace undotbs4 including contents and datafiles;
Tablespace dropped.
顺利打开数据库之后,立即将原有的undo 表空间进行drop 并重建。
虽然数据库是打开了,然而其中有2个数据文件之前被我们offline了,而且中间进行了resetlogs操作,因此现在无法进行正常online了。

SQL> select file#,checkpoint_change# from v$datafile;
     FILE#             CHECKPOINT_CHANGE#
---------- ------------------------------
         1                 14759124431097
         2                 14759124431097
         3                 14759124431097
         4                 14759124431097
         5                 14759124431097
        ......
        36                 14759124431097
        37                 14759124431097
        38                 14759124431097
        41                 14759124431097
        42                              0
        43                              0
        44                 14759124431097
        45                 14759124431097
43 rows selected.

SQL> alter database datafile 42 online;
alter database datafile 42 online
*
ERROR at line 1:
ORA-01190: control file or data file 42 is from before the last RESETLOGS
ORA-01110: data file 42: '+DG/xxxx/datafile/file_tab_xxx03.ora'

SQL> alter database datafile 43 online;
alter database datafile 43 online
*
ERROR at line 1:
ORA-01190: control file or data file 43 is from before the last RESETLOGS
ORA-01110: data file 43: '+DG/xxxx/datafile/file_tab1_xxx05.ora'
这里用bbed 将上面2个文件头相关信息修改掉,然后进行recover,可以顺利online文件。

SQL> recover datafile 42;
Media recovery complete.

SQL> alter database datafile 42 online;
Database altered.

SQL> recover datafile 43;
Media recovery complete.

SQL> alter database datafile 43 online;
Database altered.

SQL> select file#,checkpoint_change# ,status from v$datafile;

     FILE#         Oracle?о?????    CHECKPOINT_CHANGE# STATUS
---------- ------------------------------ -------
         1                 14759124821491 SYSTEM
         2                 14759124821491 SYSTEM
         3                 14759124821491 ONLINE
         4                 14759124821491 ONLINE
         5                 14759124821491 ONLINE
      。。。。。。
        36                 14759124821491 ONLINE
        37                 14759124821491 ONLINE
        38                 14759124821491 ONLINE
        41                 14759124821491 ONLINE
        42                 14759124831966 ONLINE
        43                 14759124832115 ONLINE
        44                 14759124821491 ONLINE
        45                 14759124821491 ONLINE
43 rows selected.

SQL> alter system checkpoint;
System altered.

SQL> select file#,checkpoint_change# ,status from v$datafile;

     FILE#             CHECKPOINT_CHANGE# STATUS
---------- ------------------------------ -------
         1                 14759124832224 SYSTEM
         2                 14759124832224 SYSTEM
         3                 14759124832224 ONLINE
         4                 14759124832224 ONLINE
         5                 14759124832224 ONLINE
       ......
        38                 14759124832224 ONLINE
        41                 14759124832224 ONLINE
        42                 14759124832224 ONLINE
        43                 14759124832224 ONLINE
        44                 14759124832224 ONLINE
        45                 14759124832224 ONLINE

43 rows selected.
最后建议将数据库expdp 导出并重建。到此告一段落!

--------------------------------------ORACLE-DBA----------------------------------------

最权威、专业的Oracle案例资源汇总之【案例】Oracle RAC报错ORA-00600 [kccpb_sanity_check_2]解决办法

本文由大师惜分飞原创分享,网址:http://www.oracleplus.net/arch/1243.html

Oracle研究中心

关键词:

ORA-00600

ORA-00600 [kccpb_sanity_check_2]

Oracle RAC异常恢复过程案例