sitemap

RSS地图

收藏本站

设为首页

Oracle研究中心

当前位置:Oracle研究中心 > 故障案例 >

【案例】Oracle RAC报错ORA-01555 ORA-00704异常恢复笔记

时间:2016-11-14 22:03   来源:Oracle研究中心   作者:网络   点击:

天萃荷净 Oracle研究中心案例分析:运维DBA反映Oracle RAC数据库异常无法启动报错ORA-01555 ORA-00704,分析原因为SCN异常导致,处理过程中重建控制文件等操作。

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客
本文链接地址: 清明节加班恢复的一个11gR2 rac恢复案例

数据恢复案例:由于掉电导致数据库无法open。经过初步排查,确认数据库版本为Oracle 11.2.0.3(linux RAC),数据量比较小,

大约200G左右。整个恢复过程开始看上去很顺利,仅30分钟就顺利打开了数据库,后续发现其中确实有少坑,这里跟大家简单分享一下恢复case。

首先我们来看下数据库无法open所报的错误是什么?

Sun Apr 03 20:55:36 2016
SMON: enabling cache recovery
ORA-01555 caused by SQL statement below (SQL ID: 4krwuz0ctqxdt, SCN: 0x0000.5edc85a7):
select ctime, mtime, stime from obj$ where obj# = :1
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_19990.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 29 with name "_SYSSMU29_3872709797$" too small

Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_19990.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 29 with name "_SYSSMU29_3872709797$" too small

Error 704 happened during db open, shutting down database
USER (ospid: 19990): terminating the instance due to error 704
Instance terminated by USER, pid = 19990
ORA-1092 signalled during: alter database open...
opiodr aborting process unknown ospid (19990) as a result of ORA-1092

这个错误其实很常见,已经遇到很多次了,处理方式也不难;大致上有两种.
1、通过10046 trace定位到有问题的数据块,然后手工去屏蔽事务;
2、推进数据库SCN
这里我选择使用推进scn的方式来进行处理。

直接通过oradebug poke修改scn;第一次修改可能是增加的scn不够大;第一次报错一样;第二次报错改变了;变成我们更加熟悉的错误:

Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc  (incident=2108431):
ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl1/incident/incdir_2108431/orcl1_ora_23188_i2108431.trc
Sun Apr 03 21:09:46 2016
Dumping diagnostic data in directory=[cdmp_20160403210946], requested by (instance=1, osid=23188), summary=[incident=2108431].
Sun Apr 03 21:09:46 2016
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc:
ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc:
ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []
Error 600 happened during db open, shutting down database
USER (ospid: 23188): terminating the instance due to eOracleоrror 600

上述的这个错误处理方式其实也有2种,大致如下:
1、由于scn差距很小,因此直接适当推进scn即可。
2、bbed修改dba地址20971648 中的事务来绕过该错误。

很明显,这里我选择第1种方法更简单;这里我再次修改scn,稍微增加大一点即可;很顺利的打开了数据库。

SQL> startup mount pfile='/tmp/pfile.ora';
ORACLE instance started.

Total System Global Area 2.0243E+10 bytes
Fixed Size      2237088 bytes
Variable Size   7449087328 bytes
Database Buffers  1.2751E+10 bytes
Redo Buffers     41189376 bytes
Database mounted.

SQL> oradebug setmypid
Statement processed.

SQL> alter system set job_queue_processes=0;
System altered.

SQL> oradebug poke 0x060019598  4 0x832B8852
BEFORE: [060019598, 06001959C) = 00000000
AFTER: [060019598, 06001959C) = 832B8852

SQL> alter database open;
Database altered.

看上去整个恢复过程很简单,也就不到半小时就打开了数据库。可是当我检查数据库文件状态时,整个数据库一共有23个数据文件,其中有11个数据文件状态为missing,

这也就是说都无法识别到数据库文件。实际上此时数据库alert log中也在报如下的错误,告诉我们这部分数据文件无法识别:


Sun Apr 03 21:26:09 2016
minact-scn: Inst 1 is now the master inc#:2 mmon proc-id:24523 status:0x7
minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000
[24583] Successfully onlined Undo Tablespace 2.
Undo initialization finished serial:0 start:123081664 end:123083044 diff:1380 (13 seconds)
Dictionary check beginning
Tablespace 'NORMING_DATA' #10 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMING_TEMP' #11 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGTEST_TEMP' #12 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGTEST_DATA' #13 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGLJ_TEMP' #14 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGLJ_DATA' #15 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'TABLESPACE_XYZH' #16 found in data dictionary,
but not in the controlfile. Adding to controlfile.
File #13 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00013' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #14 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00014' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #15 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00015' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #16 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00016' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #17 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00017' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #18 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00018' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #19 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00019' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #20 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00020' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #21 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00021' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #22 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00022' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #23 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00023' in the controlfile.
This file can no longer be recovered so it must be dropped.
Dictionary check complete

由于此时数据库已经打开了,因此为产生了一个重建控制文件的脚本,发现脚本内容如下:

STARTUP NOMOUNT
CREATE CONTROLFILE REUSE DATABASE "ORCL" RESETLOGS  ARCHIVELOG
    MAXLOGFILES 192
    MAXLOGMEMBERS 3
    MAXDATAFILES 1024
    MAXINSTANCES 32
    MAXLOGHISTORY 9344
LOGFILE
  GROUP 1 '+DATA/orcl/onlinelog/group_1.273.850670135'  SIZE 50M BLOCKSIZE 512,
  GROUP 2 '+DATA/orcl/onlinelog/group_2.274.850670135'  SIZE 50M BLOCKSIZE 512
-- STANDBY LOGFILE
DATAFILE
  '+DATA/orcl/datafile/system.268.850670033',
  '+DATA/orcl/datafile/sysaux.269.850670033',
  '+DATA/orcl/datafile/undotbs1.270.850670033',
  '+DATA/orcl/datafile/users.271.850670033',
  '+DATA/orcl/datafile/undotbs2.276.850670237',
  '+DATA/orcl/datafile/datacenter',
  '+DATA/orcl/datafile/partner_platform',
  '+DATA/orcl/datafile/sw_portal',
  '+DATA/orcl/datafile/system.dbf',
  '+DATA/orcl/datafile/system_02.dbf',
  '+DATA/orcl/datafile/user_02.dbf',
  '+DATA/orcl/datafile/user_03.dbf',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00013',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00014',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00015',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00016',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00017',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00018',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00019',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00020',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00021',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00022',
  '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00023'
CHARACTER SET ZHS16GBK;

实际上我问客户,他们的反馈是之前由于控制文件损坏,客户也重建了控制文件,进行了多次恢复,而且也进行了resetlogs操作。

从上面的信息来看,不难看出客户重建控制文件的时候漏掉了11个数据文件。由于这部分文件的信息在数据字典中存在,因此在open的时候Oracle 会自动进行offline drop。

或许有人要说,直接找到文件然后重建控制文件不就行了吗?确实如此,然而实际上这里却并没有这么简单。

我进入到asm磁盘组检查文件发现有几个文件名称很奇怪,例如user_02.dbf 实际上link到了system,类似这样的情况。

这种情况下极容易出错。争取的做法查询dba_data_files进行数据文件的挨个确认。

确认好asm磁盘组漏掉的4个文件之后,还有7个文件位于文件系统中。全部添加到脚本中进行创建时发现这些文件和之前到文件到resetlogs已经完全不同了。

其实创建控制文件会报错ora-01189。因此这里还必须手工去修改这11个数据文件头的resetlogs信息;等我将resetlogs信息全部修改完毕后,可以顺利创建控制文件。

但是当我进行reconver时却发现需要之前等archivelog,进一步检查发现归档日志都全部被删掉了。
因此最后还必须的再次修改这部分数据文件的checkpoint信息,将其改成与其他正常的文件一致,最后可以顺利打开数据库,

且检查所有的数据库文件状态均为online状态,如下所示:
最后再将文件系统的文件迁移到asm磁盘组,然后添加redo信息,启动rac节点2.

--------------------------------------ORACLE-DBA----------------------------------------

最权威、专业的Oracle案例资源汇总之【案例】Oracle RAC报错ORA-01555 ORA-00704异常恢复笔记

本文由大师惜分飞原创分享,网址:http://www.oracleplus.net/arch/1269.html

Oracle研究中心

关键词:

ORA-00704

ORA-01555

Oracle RAC异常恢复笔记