sitemap

RSS地图

收藏本站

设为首页

Oracle研究中心

当前位置:Oracle研究中心 > 故障案例 >

【案例】Oracle RAC报错ORA-00600 [kjbrchkmode]官方MOS解决办法

时间:2016-11-13 20:30   来源:Oracle研究中心   作者:网络   点击:

天萃荷净 Oracle研究中心案例分析:运维DBA反映Oracle RAC数据库出现报错ORA-00600 [kjbrchkmode],结合MOS官方文档分析产生原因和解决办法。lms进程异常 导致实例重启

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger 的Oracle技术博客
本文链接地址: lms进程异常 导致实例重启一例

某rac的其中一个节点重启了,希望让我们帮忙分析一下具体原因,相关的日志简单的分析了一下,

alert log如下:
Mon Jul 11 12:47:14 2011
Thread 2 advanced to log sequence 53143
Current log# 3 seq# 53143 mem# 0: /dev/rdb_redo21_500m
Mon Jul 11 12:47:54 2011
ARC3: Standby redo logfile selected for thread 2 sequence 53142 for destination LOG_ARCHIVE_DEST_2
Mon Jul 11 12:52:48 2011
Errors in file /ora/app/admin/oraapp/bdump/oraapp2_lms0_1044892.trc:
ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530],[], [], [], []
Mon Jul 11 12:52:51 2011
Errors in file /ora/app/admin/oraapp/bdump/oraapp2_lms0_1044892.trc:
ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530],[], [], [], []
Mon Jul 11 12:52:52 2011
Trace dumping is performing id=[cdmp_20110711125251]
Mon Jul 11 12:52:55 2011
Errors in file /ora/app/admin/oraapp/bdump/oraapp2_lms0_1044892.trc:
ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530],[], [], [], []
Mon Jul 11 12:52:55 2011
LMS0: terminating instance due to error 484
Instance terminated by LMS0, pid = 1044892
Mon Jul 11 12:53:41 2011
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 en6 192.168.182.0 configured from OCR for use as a cluster interconnect
Interface type 1 en4 10.66.81.0 configured from OCR for use as a public interface
Picked latch-free SCN scheme 3
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled

从上面的信息来看,该节点在12点52分出现ora-600内部错误,仅仅不到10s的时间,就导致lms0进程异常中止
最后导致该实例重启。

我们来看看对于的trace信息:
*** 2011-07-11 12:52:48.437
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530], [], [], [], []
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (  means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c bl ksedst1 FFFFFFFFFFFC560   000000000 
ksedmp+0290 bl ksedst 104811C98 
ksfdmp+0018 bl 03F40AE0
kgerinv+00dc bl _ptrgl
kgeasnmierr+0040 bl kgerinv 000000000   000000000 
000000000   11021E068 
FFFFFFFFFFFCCA0 
kjbrchkmode+0230 bl 03F3FD0C
kjbriconsign+00d8 bl kjbrchkmode 000000000   FFFFFFFFFFFCCA8 
kjbmiconsign+01f0 bl kjbriconsign 10517A4BC   000000000 
000000002   42C98000E0000 
0FFFFCCF0 
kjbmpocr+1114 bl kjbmiconsign 110BFB640   11021E068 
114C3177CB0279   110C9F850 
FFFFFFFFFFFCCA0 
kjmxmpm+0238 bl 03F3FB24
kjmpmsgi+050c bl kjmxmpm 0A25B7EE4   600000000 
A25B00ED00000000 
000000000 
kjmsm+177c bl kjmpmsgi 110C6D820   1EAD6A410 
100000000   000000001 
0FFFFD780   000000000 
ksbrdp+03e0 bl _ptrgl
opirip+03fc bl 01FAFCDC
opidrv+0448 bl opirip 1102736F0   410275030 
FFFFFFFFFFFF310 
sou2o+0090 bl opidrv 320318411C   400000010 
FFFFFFFFFFFF310 
opimai_real+0150 bl 01FAB1D4
main+0098 bl opimai_real 000000000   000000000 
__start+0098 bl main 000000000   000000000 

*** 2011-07-11 12:52:51.089
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530], [], [], [], []
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (  means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst+001c bl ksedst1 FFFFFFFFFFFC9E0   000000000 
ksedmp+0290 bl ksedst 104811C98 
kjmpmsgi+0620 bl 03F40AE0
kjmsm+177c bl kjmpmsgi 110C6D820   1EAD6A410 
100000000   000000001 
0FFFFD780   000000000 
ksbrdp+03e0 bl _ptrgl
opirip+03fc bl 01FAFCDC
opidrv+0448 bl opirip 1102736F0   410275030 
FFFFFFFFFFFF310 
sou2o+0090 bl opidrv 320318411C   400000010 
FFFFFFFFFFFF310 
opimai_real+0150 bl 01FAB1D4
main+0098 bl opimai_real 000000000   000000000 
__start+0098 bl main 000000000   000000000 

前后2次的ora--00600信息的,虽然call stack信息有点微小差别,不过基本上一致,根据600错误以及call stack

我们在metalink上查到一个bug,对比了一下call stack信息,90%一致。虽然不能肯定的下结论,但是感觉也有点八九不离十了的味道了。

下面咱们继续,既然是lms0进程异常中止,那么我们这里就很有必要来分析一下lms0进程的状态了,于是在trace中搜索 process或lms0 找到如下信息:
PROCESS STATE
-------------
Process global information:
process: 7000000ea3a7dd8, call: 7000000ea5c9a88, xact: 0, curses: 7000000ea578b08, usrses: 7000000ea580548
----------------------------------------
SO: 7000000ea3a7dd8, type: 2, owner: 0, flag: INIT/-/-/0x00 ---从这里owner 为0 看出该so 已经处于顶层了
(process) Oracle pid=7, calls cur/top: 7000000ea5c9a88/7000000ea5c7680, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 90
last post received-location: KJCS Post snd proxy to flush msg
last process to post me: 7000000ea3a8d98 1 6
last post sent: 0 0 93
last post sent-location: kjbrcvdscn: scn acked
last process posted by me: 7000000ea3a8d98 1 6
(latch info) wait_event=0 bits=10
holding (efd=5) 7000000e8f688a0 Child gcs resource hash level=4 child#=20
Location from where latch is held: kjbmpconvert:
Context saved from call: 0
state=busy, wlstate=free
Process Group: DEFAULT, pseudo proc: 7000000e7418bc0
O/S info: user: oracle, term: UNKNOWN, ospid: 1044892
OSD pid info: Unix process pid: 1044892, image: oracle@t3smis_db_b (LMS0)

这里的gcs resource hash 是latch 名称,其level为4,可以查询v$latch.
搜索字符串7000000ea3a7dd8,发现如下内容:

SO: 7000000e1963000, type: 20, owner: 7000000ea3a7dd8, flag: -/-/-/0x00namespace [KSXP] key = [ 32 32 30 47 45 53 52 30 30 31 00 ]
----------------------------------------
SO: 7000000ea580548, type: 4, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00 (session) sid: 550 trans: 0, creator: 7000000ea3a7dd8, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0002-0007-00000003, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
last wait for 'gcs remote message' blocking sess=0x0 seq=62654 wait_time=18631 seconds since wait started=4
waittime=18, poll=0, event=0
Dumping Session Wait History
for 'gcs remote message' count=1 wait_time=18631
waittime=18, poll=0, event=0
for 'gcs remote message' count=1 wait_time=751
waittime=18, poll=0, event=0
for 'gcs remote message' count=1 wait_time=248
waittime=18, poll=0, event=0
for 'gcs remote message' count=1 wait_time=74
waittime=18, poll=0, event=0
for 'gcs remote message' count=1 wait_time=2099
waittime=18, poll=0, event=0
for 'gcs remote message' count=1 wait_time=7055
waittime=18, poll=0, event=0
for 'gcs remote message' count=1 wait_time=7010
waittime=18, poll=0, event=0
for 'gcs remote message' count=1 wait_time=56
waittime=18, poll=0, event=0
for 'gcs remote message' count=1 wait_time=6202
waittime=18, poll=0, event=0
for 'gcs remote message' count=1 wait_time=803
waittime=18, poll=0, event=0
temporary object counter: 0
----------------------------------------
UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log 7000000e8fa1998
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 550 efd 0 whr 0 slp 0
KGX Atomic Operation Log 7000000e8fa19e0
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 550 efd 0 whr 0 slp 0
KGX Atomic Operation Log 7000000e8fa1a28
Mutex 0(0, 0) idn 0 oper NONE
Library Cache uid 550 efd 0 whr 0 slp 0
----------------------------------------
SO: 7000000e08f57d8, type: 41, owner: 7000000ea580548, flag: INIT/-/-/0x00 (dummy) nxc=0, nlb=0
----------------------------------------
SO: 7000000e765b078, type: 11, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 7000000ea3a7dd8,
event: 7, last message event: 93,
last message waited event: 93, messages read: 1
channel: (7000000e767c640) system events broadcast channel
scope: 2, event: 38038, last mesage event: 93,
publishers/subscribers: 0/141,
messages published: 1
----------------------------------------
SO: 7000000ea93e9e0, type: 19, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00
GES MSG BUFFERS: st=emp chunk=0x0 hdr=0x0 lnk=0x0 flags=0x0 inc=16
outq=2 sndq=0 opid=7 prmb=0x0
mbg[i]=(4626201 97644168) mbg[b]=(54246 41376233) mbg[r]=(0 0)
fmq[i]=(30 22) fmq[b]=(20 8) fmq[r]=(0 0)
mop[s]=64557405 mop[q]=104801381 pendq=0 zmbq=0
------------process 0x7000000ea93e9e0--------------------
proc version : 0
Local node : 1
pid : 1044892
lkp_node : 1
svr_mode : 0
proc state : KJP_NORMAL
Last drm hb acked : 0
Total accesses : 351576
Imm. accesses : 351543
Locks on ASTQ : 0
Locks Pending AST : 0
Granted locks : 0
AST_Q:
PENDING_Q:
GRANTED_Q:
KJM HIST LMS0:
1:1 12:1:10 7:1 6:1 10:0 2:0 19:0 18:2 17:2 16:2
15:2 12:18639 7:0 6:15 10:0 17:1 16:1 14:73:4 1:0 12:1:7
7:0 6:0 10:1 17:1 16:1 12:753 7:1 6:0 10:0 17:2
16:1 13:65521:4 20:84:3 1:0 12:1:7 7:1 6:0 10:0 17:2 16:1
12:250 7:1 6:16 10:0 17:2 16:1 14:73:5 1:1 12:1:9 7:0
6:0 10:1 17:1 16:1 12:78 7:0 6:0 10:1 17:2 21:32
16:2 12:2105 7:0 6:14
DEFER MSG QUEUE ON LMS0 IS EMPTY
SEQUENCES:
0:116284752.0 1:0.0
----------------------------------------
SO: 7000000ea5c7680, type: 3, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00 (call) sess: cur 7000000ea580548, rec 7000000ea578b08, usr 7000000ea580548; depth: 0
----------------------------------------
SO: 7000000ea5c9a88, type: 3, owner: 7000000ea5c7680, flag: INIT/-/-/0x00 (call) sess: cur 7000000ea578b08, rec 7000000ea578b08, usr 7000000ea580548; depth: 1
----------------------------------------
SO: 7000000ea578b08, type: 4, owner: 7000000ea5c7680, flag: INIT/-/-/0x00 (session) sid: 538 trans: 0, creator: 0, flag: (2) -/REC -/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000 txn branch: 0 oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS temporary object counter: 0
----------------------------------------
SO: 7000000e7842e80, type: 16, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00 (osp req holder)

从上面的rec 7000000ea578b08 我们可以判断为这个是一个递归操作,而后面的depth 1则是表示 递归操作的层次。

无奈之下,我猜测那个600错误中的字符串700000041fd2ce0 应该是个类似latch地址的东西,或者说是一个资源的标示,于是搜索字符串700000041fd2ce0,发现如下信息:
*** 2011-07-11 12:52:52.561
GLOBAL CACHE ELEMENT DUMP (address: 700000041fd2c70):
id1: 0x42c98 id2: 0xe0000 obj: 0 block: (14/273560)
lock: NL rls: 0x0000 acq: 0x0000 latch: 3
flags: 0x41 fair: 0 recovery: 0
bscn: 0x0.0 bctx: 0 write: 0 scan: 0x2000000 xflg: 0 xid: 0x0.0.0
lcp: 0 lnk: [NULL] lch: [700000053f3aa30,700000053f3aa30]
LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
flg: 0x08000000 state: SCURRENT mode: NULL
addr: 700000053f3a930 obj: 47733 cls: DATA bscn: 0x2.cb164f84
GCS UNKNOWN 700000041fd2ce0,0 sq[0,0] resp[0,0x0.0] pkey 0
grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
master 158 owner 0 sid 0 remote[70206dbf7,0] hist 0x6006c
history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 16 flags 0xc replay# -501677968
disk: 0xdd6e.07000000 write request: 0x05dc.00000089
pi scn: 0x0002.0045d1a8
msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
Dump of buffer cache at level 1 for tsn=9, rdba=58993816
BH (700000053f3a930) file#: 14 rdba: 0x03842c98 (14/273560) class: 1 ba: 700000053602000
set: 10 blksize: 8192 bsi: 0 set-flg: 0 pwbcnt: 0
dbwrid: 0 obj: 47733 objn: 47733 tsn: 9 afn: 14
hash: [7000000e1d2bb78,7000000e1d2bb78] lru: [700000014f61a38,70000003bf83688]
lru-flags: hot_buffer
ckptq: [NULL] fileq: [NULL] objq: [700000020f63830,700000052f14028]
st: SCURRENT md: NULL tch: 0 le: 700000041fd2c70
flags: remote_transfer
LRBA: [0x0.0.0] HSCN: [0xffff.ffffffff] HSUB: [65535]
GCS RESOURCE 7000000e4ce1530 hashq [7000000d01f0ff0,7000000e4493420] name[0x42c98.e0000] pkey 47733
grant 7000000e2190070 cvt 0 send 0,0 write 0,0@65535
flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL
disk: 0x0000.00000000 write: 0x0000.00000000 cnt 0x0 hist 0x0
xid 0x0000.000.00000000 sid 0
pkey 47733
hv 89 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 12, dom 0]
kjga st 0x4, step 0.0.0, cinc 16, rmno 1550, flags 0x0
lb 0, hb 0, myb 3475, drmb 3475, apifrz 0
GCS SHADOW 7000000e2190070,1272 sq[700000041fd2ce0,7000000e4ce1550] resp[7000000e4ce1530,0x42c98.e0000] pkey 47733
grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL
master 1 owner 0 sid 0 remote[700000045fb5940,1518] hist 0x8208225f
history 0x1f.0x9.0x2.0x2.0x2.0x2. cflag 0x0 sender 0 flags 0x0 replay# 0
disk: 0x0000.00000000 write request: 0x0000.00000000
pi scn: 0x0000.00000000
msgseq 0x3 updseq 0x0 reqids[1996,0,0] infop 0x0
GCS UNKNOWN 700000041fd2ce0,0 sq[0,0] resp[0,0x0.0] pkey 0
grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
master 158 owner 0 sid 0 remote[70206dbf7,0] hist 0x6006c
history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 16 flags 0xc replay# -501677968
disk: 0xdd6e.07000000 write request: 0x05dc.00000089
pi scn: 0x0002.0045d1a8
msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
GCS UNKNOWN 7000000dd6e9e10,0 sq[0,0] resp[0,0x0.0] pkey 0
grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
master 147 owner 0 sid 0 remote[70206dc00,0] hist 0x6006c
history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 24 flags 0xc replay# 0
disk: 0xdd6e.07000000 write request: 0x05dc.00000089
pi scn: 0x0002.0045daa8
msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
GCS UNKNOWN 7000000dd6e9318,0 sq[0,0] resp[0,0x0.0] pkey 0
grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
master 148 owner 0 sid 0 remote[70206dbff,0] hist 0x6006c
history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 200 flags 0xc replay# 0
disk: 0xdd6e.07000000 write request: 0x05dc.00000089
pi scn: 0x0002.0045d9a8
msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
GCS UNKNOWN 7000000dd6e94c8,0 sq[0,0] resp[0,0x0.0] pkey 0
grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
master 118 owner 0 sid 0 remote[70206dbf6,0] hist 0x6006c
history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 8 flags 0xc replay# 0
disk: 0xdda6.07000000 write request: 0x05dc.00000089
pi scn: 0x0002.0045d0a8
msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
GCS UNKNOWN 7000000dda67608,0 sq[0,0] resp[0,0x0.0] pkey 0
grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
master 158 owner 0 sid 0 remote[70206dbf5,0] hist 0x6006c
history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 96 flags 0xc replay# 0
disk: 0xdd4b.07000000 write request: 0x05dc.00000089
pi scn: 0x0002.0045cfa8
msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
ABORTING
lms abort after exception 600
KJC Communication Dump:
state 0x5 flags 0x0 mode 0x0 inst 1 inc 16
nrcv 3 nsp 3 nrcvbuf 1000
reg_msg: sz 456 cur 229 (s:0 i:229) max 967 ini 1750
big_msg: sz 8240 cur 21 (s:0 i:21) max 529 ini 1750
rsv_msg: sz 8240 cur 0 (s:0 i:0) max 0 tot 1000
rcvr: id 0 orapid 6 ospid 1098058
rcvr: id 2 orapid 8 ospid 463194
rcvr: id 1 orapid 7 ospid 1044892
send proxy: id 0 ndst 1 (0:0 )
send proxy: id 2 ndst 1 (0:2 )
send proxy: id 1 ndst 1 (0:1 )
GES resource limits:
ges resources: cur 0 max 0 ini 11712
ges enqueues: cur 0 max 0 ini 17740
ges cresources: cur 977 max 2335
gcs resources: cur 91982 max 185193 ini 185193
gcs shadows: cur 72232 max 135376 ini 185193
KJCTS state: seq-check:no timeout:yes waitticks:0x3 highload no
GES destination context:
Dest 0 rcvr 0 inc 16 state 0x10041 tstate 0x0

这里我发现GCS RESOURCE 7000000e4ce1530 这里的7000000e4ce1530 正是600错误中的错误字符串。

我们回头再来认真的看看上面的600错误,如下:
ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530], [], [], [], []

其中字符串代码,700000041FD2CE0 7000000E4CE1530
*** 2011-07-11 06:46:53.780
kjdrvalidRMno: msg type 34 from node 0 dropped
FUSION MSG 11094cc68,34 from[0,1073492] ver[16,1536] ln 128 sq[0,8]
CLOSE [0x1794.80000, 47523] shadow [0,0] seq 0x2 act 1
client [700000026fd0078,1025] reqid 1369 ordered 0
grant 1 convert 0 role 0
pi [0x0.0x0] flags 0x0 state 0x20
disk scn 0x0.0 writereq scn 0x0.0 rreqid 0
msgRM# 1536 bkt# 3826 drmbkt# 3826
pkey 47523, stat 5, masters[32767, 32767->1], reminc 0, RM# 0 flg 0x0
hv 47 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 14, dom 0]
kjga st 0x4, step 0.34.0, cinc 16, rmno 1536, flags 0x20
lb 3072, hb 4095, myb 3826, drmb 3826, apifrz 1
kjmvalidate: drm drop a message RMno 1536 from 0 type 34
mver 16 myver 16 seq 0.109921737 rseq 0.109921736 flag x6d
kjdrvalidRMno: msg type 34 from node 0 dropped
FUSION MSG 11094cce8,34 from[0,1073492] ver[16,1536] ln 128 sq[0,8]
CLOSE [0x178c.80000, 47523] shadow [0,0] seq 0x2 act 1
client [70000005bfc6010,1475] reqid 1921 ordered 0
grant 1 convert 0 role 0
pi [0x0.0x0] flags 0x0 state 0x20
disk scn 0x0.0 writereq scn 0x0.0 rreqid 0
msgRM# 1536 bkt# 3825 drmbkt# 3825
pkey 47523, stat 5, masters[32767, 32767->1], reminc 0, RM# 0 flg 0x0
hv 47 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 14, dom 0]
kjga st 0x4, step 0.34.0, cinc 16, rmno 1536, flags 0x20
lb 3072, hb 4095, myb 3825, drmb 3825, apifrz 1
kjmvalidate: drm drop a message RMno 1536 from 0 type 34
mver 16 myver 16 seq 0.109921738 rseq 0.109921737 flag x6d
kjdrvalidRMno: msg type 34 from node 0 dropped
FUSION MSG 11094cd68,34 from[0,1073492] ver[16,1536] ln 128 sq[0,8]
CLOSE [0x17cb.80000, 47523] shadow [0,0] seq 0x2 act 1
client [70000001fff22f0,1158] reqid 1407 ordered 0
grant 1 convert 0 role 0
pi [0x0.0x0] flags 0x0 state 0x20
disk scn 0x0.0 writereq scn 0x0.0 rreqid 0
msgRM# 1536 bkt# 3833 drmbkt# 3833
pkey 47523, stat 5, masters[32767, 32767->1], reminc 0, RM# 0 flg 0x0
hv 47 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 14, dom 0]
kjga st 0x4, step 0.34.0, cinc 16, rmno 1536, flags 0x20
lb 3072, hb 4095, myb 3833, drmb 3833, apifrz 1
kjmvalidate: drm drop a message RMno 1536 from 0 type 34
mver 16 myver 16 seq 0.109921739 rseq 0.109921738 flag x6d
kjdrvalidRMno: msg type 34 from node 0 dropped
FUSION MSG 11094cde8,34 from[0,1073492] ver[16,1536] ln 128 sq[0,8]
CLOSE [0x17c3.80000, 47523] shadow [0,0] seq 0x2 act 1
client [70000005bff7d90,1289] reqid 1807 ordered 0
grant 1 convert 0 role 0
pi [0x0.0x0] flags 0x0 state 0x20
disk scn 0x0.0 writereq scn 0x0.0 rreqid 0
msgRM# 1536 bkt# 3832 drmbkt# 3832
pkey 47523, stat 5, masters[32767, 32767->1], reminc 0, RM# 0 flg 0x0
hv 47 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 14, dom 0]
kjga st 0x4, step 0.34.0, cinc 16, rmno 1536, flags 0x20
lb 3072, hb 4095, myb 3832, drmb 3832, apifrz 1
kjmvalidate: drm drop a message RMno 1536 from 0 type 34
mver 16 myver 16 seq 0.109921740 rseq 0.109921739 flag x6d
*** 2011-07-11 12:52:48.432
GCS UNKNOWN 700000041fd2ce0,0 sq[0,0] resp[0,0x0.0] pkey http://www.oracleplus.net0
grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
master 158 owner 0 sid 0 remote[70206dbf7,0] hist 0x6006c
history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 16 flags 0xc replay# -501677968
disk: 0xdd6e.07000000 write request: 0x05dc.00000089
pi scn: 0x0002.0045d1a8
msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
GCS RESOURCE 7000000e4ce1530 hashq [7000000d01f0ff0,7000000e4493420] name[0x42c98.e0000] pkey 47733
grant 7000000e2190070 cvt 0 send 0,0 write 0,0@65535
flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL
disk: 0x0000.00000000 write: 0x0000.00000000 cnt 0x0 hist 0x0
xid 0x0000.000.00000000 sid 0
pkey 47733
hv 89 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 12, dom 0]
kjga st 0x4, step 0.0.0, cinc 16, rmno 1550, flags 0x0
lb 0, hb 0, myb 3475, drmb 3475, apifrz 0
GCS SHADOW 7000000e2190070,1272 sq[700000041fd2ce0,7000000e4ce1550] resp[7000000e4ce1530,0x42c98.e0000] pkey 47733
grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL
master 1 owner 0 sid 0 remote[700000045fb5940,1518] hist 0x8208225f
history 0x1f.0x9.0x2.0x2.0x2.0x2. cflag 0x0 sender 0 flags 0x0 replay# 0
disk: 0x0000.00000000 write request: 0x0000.00000000
pi scn: 0x0000.00000000
msgseq 0x3 updseq 0x0 reqids[1996,0,0] infop 0x0
GCS UNKNOWN 700000041fd2ce0,0 sq[0,0] resp[0,0x0.0] pkey 0
grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
master 158 owner 0 sid 0 remote[70206dbf7,0] hist 0x6006c
history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 16 flags 0xc replay# -501677968
disk: 0xdd6e.07000000 write request: 0x05dc.00000089
pi scn: 0x0002.0045d1a8
msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
GCS UNKNOWN 7000000dd6e9e10,0 sq[0,0] resp[0,0x0.0] pkey 0
grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
master 147 owner 0 sid 0 remote[70206dc00,0] hist 0x6006c
history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 24 flags 0xc replay# 0
disk: 0xdd6e.07000000 write request: 0x05dc.00000089
pi scn: 0x0002.0045daa8
msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
GCS UNKNOWN 7000000dd6e9318,0 sq[0,0] resp[0,0x0.0] pkey 0
grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
master 148 owner 0 sid 0 remote[70206dbff,0] hist 0x6006c
history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 200 flags 0xc replay# 0
disk: 0xdd6e.07000000 write request: 0x05dc.00000089
pi scn: 0x0002.0045d9a8
msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
GCS UNKNOWN 7000000dd6e94c8,0 sq[0,0] resp[0,0x0.0] pkey 0
grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
master 118 owner 0 sid 0 remote[70206dbf6,0] hist 0x6006c
history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 8 flags 0xc replay# 0
disk: 0xdda6.07000000 write request: 0x05dc.00000089
pi scn: 0x0002.0045d0a8
msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
GCS UNKNOWN 7000000dda67608,0 sq[0,0] resp[0,0x0.0] pkey 0
grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL
master 158 owner 0 sid 0 remote[70206dbf5,0] hist 0x6006c
history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 96 flags 0xc replay# 0
disk: 0xdd4b.07000000 write request: 0x05dc.00000089
pi scn: 0x0002.0045cfa8
msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0
ABORTING
*** 2011-07-11 12:52:48.437
ksedmp: internal or fatal error

ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530], [], [], [], []

从上看,我感觉似乎跟DRM有点关系,至于是不是一定有关系,我不敢肯定,隐隐的感觉有点关联。

到最后,我们有必要来回顾一下lms进程的作用,说的简单一点,该进程的作用就是管理集群节点之间数据block的访问,并在节点之间传递cache buffer的块镜像,当lms进程跨节点请求数据时,为了确保数据的一致性,在同一时刻,保证只能有一个实例去更新一个数据块,该进程跟lmd进程协同作用,所以通常来说,lms出问题,lmd也会出问题。

我们再回到这个问题上来,lms进程为啥为异常abort?从上面gcs信息来看,是由于请求的资源和所获取到的资源不一致导致,而又由于需要保证数据的一致性,进而导致lms abort。

正常情况下,7000000E4CE1530应该跟700000041FD2CE0一样才对。
当然上面也是我的个人猜测,如果大家有什么高见,欢迎讨论。

最后我们给客户的答复是上面判断的那个bug,虽然有有点无奈,不过也没办法。

关于该bug的详细描述如下:
BOTH RAC INSTANCES CRASHED WITH ORA-600 [kjbrchkmode:last] [ID 987822.1]
--------------------------------------------------------------------------------

修改时间 27-AUG-2010 类型 PROBLEM 状态 PUBLISHED
In this Document
Symptoms
Cause
Solution
--------------------------------------------------------------------------------
Symptoms
RAC Instance crashed after the LMS process terminated with ORA-600 [kjbrchkmode:last], [60], [0x5F91E9390], [2], [], [], [], []
The corresponding LMS tracefile showed the following call stack:
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kjbrchkmode:last], [60], [0x5F91E9390], [2], [], [], [], []
----- Call Stack Trace -----
ksedmp kgerinv kgeasnmierr kjbrchkmode kjbmprlst kjmxmpm kjmpmsgi kjmsm ksbrdp opirip opidrv sou2o opimai_real main
Cause
The issue is caused by:
BUG 6239052 INSTANCE CRASH DUE TO ORA-600 [KJBRCHKMODE LAST]
The fix for bug 6239052 is included in patchset 10.2.0.4, and 11g onwards.

--------------------------------------ORACLE-DBA----------------------------------------

最权威、专业的Oracle案例资源汇总之【案例】Oracle RAC报错ORA-00600 [kjbrchkmode]官方MOS解决办法

本文由大师惜分飞原创分享,网址:http://www.oracleplus.net/arch/1258.html

Oracle研究中心

关键词:

ORA-00600

ORA-00600 [kjbrchkmode]