sitemap

RSS地图

收藏本站

设为首页

Oracle研究中心

当前位置:Oracle研究中心 > 故障案例 >

【案例】Oracle RAC中IPC Send timeout导致实例重启(reason 0x1)

时间:2016-11-04 19:34   来源:Oracle研究中心   作者:HTZ   点击:

天萃荷净 Oracle研究中心案例分析:运维DBA反映Oracle RAC实例节点重启,分析LOG日志得出是由于IPC Send timeout导致实例重启(reason 0x1)。

1 环境介绍

AIX平台2个节点数据库,同时只能启动一个节点,当启动第二个节点的时候会由于IPC timeout将第一个节点踢掉,数据库版本是
10.2.0.5

2,hosts与网卡配置

2.1 hosts信息
127.0.0.1 loopback localhost # loopback (lo0) name/address
22.111.111.215 htzpw1
22.111.111.216 htzpw2
22.111.111.208 htzpw1-vip
22.111.111.209 htzpw2-vip
#192.178.4.10 htzpw1-priv
#192.178.4.11 htzpw2-priv

111.111.1.10 htzpw1_boot1 htzpw1-priv
192.178.3.10 htzpw1_boot2
111.111.1.11 htzpw2_boot1 htzpw2-priv
192.178.3.11 htzpw2_boot2
22.111.111.175 htzpw1_per
22.111.111.176 htzpw2_per
22.111.111.130 htzpwVIP
2.2 网卡信息
node1
ifconfig -a
en4: flags=1e080863,c0
inet 111.111.1.10 netmask 0xffffff80 broadcast 111.111.1.127
inet 22.111.111.175 netmask 0xffffff80 broadcast 10.185.240.255
tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
en5: flags=1e080863,c0
inet 192.178.4.10 netmask 0xffffff80 broadcast 192.178.4.127
tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
en6: flags=1e080863,c0
inet 192.178.3.10 netmask 0xffffff80 broadcast 192.178.3.127
tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
en7: flags=1e080863,c0
inet 22.111.111.215 netmask 0xffffff80 broadcast 10.185.240.255
inet 22.111.111.208 netmask 0xffffff80 broadcast 10.185.240.255
tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
lo0: flags=e08084b,c0
inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
inet6 ::1%1/0
tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1


node2

ifconfig -a
en4: flags=1e080863,c0
inet 111.111.1.11 netmask 0xffffff80 broadcast 192.178.2.127
inet 22.111.111.176 netmask 0xffffff80 broadcast 10.185.240.255
tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
en5: flags=1e080863,c0
inet 192.178.4.11 netmask 0xffffff80 broadcast 192.178.4.127
tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
en6: flags=1e080863,c0
inet 192.178.3.11 netmask 0xffffff80 broadcast 192.178.3.127
tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
en7: flags=1e080863,c0
inet 22.111.111.216 netmask 0xffffff80 broadcast 10.185.240.255
inet 22.111.111.209 netmask 0xffffff80 broadcast 10.185.240.255
tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
lo0: flags=e08084b,c0
inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
inet6 ::1%1/0
tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1

2.3 oifcfg信息
oifcfg getif
en7 22.111.111.128 global public
en4 111.112.1.0 global cluster_interconnect

node 1
en4 111.111.1.10 netmask 0xffffff80
en7 22.111.111.215 netmask 0xffffff80

node2
en4 111.111.1.11 netmask 0xffffff80
en7 22.111.111.216 netmask 0xffffff80
从上面的信息可以得到,HACMP有两个boot地址分别配置在ent4与ent6上面,在hacmp上面配置了一个永久IP地址与服务地址,
与ORACLE的 共有网络在同一个网段,ORACLE共有网络配置在en7上面。其实整个HACMP配置网络是有问题的,出现了共有网段
同时出现在ORACLE的时候与共有网络上面。在10G RAC中,HACMP在资源处的主要作用就是提供VG并发的功能,并不提供IP
服务功能。这里还有一个错误的地方,私有网络的网段是111.112.1.0这个网段,但是网卡配置的确实111.111.1这个网段,
这里肯定是错误的。

3,启动报错

启动节点2,alert日志信息如下
SuSat Mar 28 10:34:40 GMT+08:00 2015Starting ORACLE instance (normal)
sskgpgetexecname failed to get name
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 en4 111.111.1.0 configured from OCR for use as a cluster interconnect
Interface type 1 en7 22.111.111.128 configured from OCR for use as a public interface
Picked latch-free SCN scheme 3
Autotune of undo retention is turned off.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.5.0.
System parameters with non-default values:
processes = 4000
sessions = 4405
sga_max_size = 37580963840
__shared_pool_size = 4731174912
__large_pool_size = 16777216
__java_pool_size = 67108864
__streams_pool_size = 0
spfile = /dev/rspfile
sga_target = 37580963840
control_files = /dev/rcontrol1, /dev/rcontrol2, /dev/rcontrol3
db_block_size = 8192
__db_cache_size = 32749125632
compatible = 10.2.0.5.0
log_archive_dest_1 = location=/arch2
db_files = 2000
db_file_multiblock_read_count= 16
cluster_database = TRUE
cluster_database_instances= 2
_gc_undo_affinity = FALSE
_gc_affinity_time = 0
_gc_affinity_limit = 250
thread = 2
instance_number = 2
undo_management = AUTO
undo_tablespace = UNDOTBS2
_undo_autotune = FALSE
undo_retention = 3600
remote_os_authent = FALSE
remote_login_passwordfile= EXCLUSIVE
db_domain =
local_listener = (ADDRESS=(PROTOCOL=TCP)(HOST=22.111.111.209)(PORT=1521))
remote_listener = LISTENERS_IPAPCQ
job_queue_processes = 10
background_dump_dest = /oracle/admin/ipapcq/bdump
user_dump_dest = /oracle/admin/ipapcq/udump
core_dump_dest = /oracle/admin/ipapcq/cdump
audit_file_dest = /oracle/admin/ipapcq/adump
db_name = ipapcq
open_cursors = 1000
sql92_security = TRUE
pga_aggregate_target = 5368709120
aq_tm_processes = 10
Cluster communication is configured to use the following interface(s) for this instance
111.111.1.11
Sat Mar 28 10:34:42 GMT+08:00 2015cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
PMON started with pid=2, OS id=15795144
DIAG started with pid=3, OS id=14811274
PSP0 started with pid=4, OS id=15663318
LMON started with pid=5, OS id=9044900
LMD0 started with pid=6, OS id=15663464
LMS0 started with pid=7, OS id=8717302
LMS1 started with pid=8, OS id=8520698
LMS2 started with pid=9, OS id=10748526
LMS3 started with pid=10, OS id=15335548
MMAN started with pid=11, OS id=11076346
DBW0 started with pid=12, OS id=12386814
DBW1 started with pid=13, OS id=11796972
LGWR started with pid=14, OS id=17498712
CKPT started with pid=15, OS id=12255636
SMON started with pid=16, OS id=14811574
RECO started with pid=17, OS id=15270292
CJQ0 started with pid=18, OS id=9437972
MMON started with pid=19, OS id=11338422
MMNL started with pid=20, OS id=10879030
Sat Mar 28 10:34:52 GMT+08:00 2015lmon registered with NM – instance id 2 (internal mem no 1)
Sat Mar 28 10:34:52 GMT+08:00 2015Reconfiguration started (old inc 0, new inc 82)
Dynamic remastering is disabled
List of nodes:
0 1
Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
Communication channels reestablished
* domain 0 valid = 1 according to instance 0
Sat Mar 28 10:34:53 GMT+08:00 2015 Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Sat Mar 28 10:34:53 GMT+08:00 2015 LMS 0: 0 GCS shadows cancelled, 0 closed
Sat Mar 28 10:34:53 GMT+08:00 2015 LMS 3: 0 GCS shadows cancelled, 0 closed
Sat Mar 28 10:34:53 GMT+08:00 2015 LMS 1: 0 GCS shadows cancelled, 0 closed
Sat Mar 28 10:34:53 GMT+08:00 2015 LMS 2: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
Sat Mar 28 10:40:06 GMT+08:00 2015IPC Send timeout detected.Sender: ospid 15663464
Receiver: inst 1 binc 91474918 ospid 23855480
Sat Mar 28 10:40:08 GMT+08:00 2015IPC Send timeout to 0.0 inc 82 for msg type 26 from opid 6
Sat Mar 28 10:40:08 GMT+08:00 2015Communications reconfiguration: instance_number 1
Sat Mar 28 10:40:09 GMT+08:00 2015Trace dumping is performing id=[cdmp_20150328104008]
Sat Mar 28 10:40:09 GMT+08:00 2015Reconfiguration started (old inc 82, new inc 86)
List of nodes:
1
Nested reconfiguration detected.
Global Resource Directory frozen
* dead instance detected – domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 1: 0 GCS shadows cancelled, 0 closed
Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 0: 0 GCS shadows cancelled, 0 closed
Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 2: 0 GCS shadows cancelled, 0 closed
Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 3: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 0: 0 GCS shadows traversed, 0 replayed
Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 2: 0 GCS shadows traversed, 0 replayed
Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 1: 0 GCS shadows traversed, 0 replayed
Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 3: 0 GCS shadows traversed, 0 replayed
Sat Mar 28 10:40:09 GMT+08:00 2015 Submitted all GCS remote-cache requests
Fix write in gcs resources
Reconfiguration complete
LCK0 started with pid=21, OS id=10879338
Sat Mar 28 10:40:10 GMT+08:00 2015ALTER DATABASE MOUNT

这里可以看到数据库在nomount阶段的时候由于ipc send timeout将节点1给关了,这里数据库识别出来的是111.111.1这个网段是私用网络,但是oifcfg出来是111.112这个网络。
下面是节点1的日志
Sat Mar 28 10:34:35 GMT+08:00 2015Reconfiguration started (old inc 80, new inc 82)
List of nodes:
0 1
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Sat Mar 28 10:34:35 GMT+08:00 2015 LMS 3: 0 GCS shadows cancelled, 0 closed
Sat Mar 28 10:34:35 GMT+08:00 2015 LMS 1: 0 GCS shadows cancelled, 0 closed
Sat Mar 28 10:34:35 GMT+08:00 2015 LMS 0: 0 GCS shadows cancelled, 0 closed
Sat Mar 28 10:34:35 GMT+08:00 2015 LMS 2: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
Sat Mar 28 10:39:49 GMT+08:00 2015IPC Send timeout detected. Receiver ospid 23855480
Sat Mar 28 10:39:49 GMT+08:00 2015Errors in file /oracle/admin/ipapcq/bdump/ipapcq1_lmd0_23855480.trc:
Sat Mar 28 10:39:51 GMT+08:00 2015Received an instance abort message from instance 2 (reason 0x1)
Please check instance 2 alert and LMON trace files for detail.
LMS0: terminating instance due to error 481
Sat Mar 28 10:39:52 GMT+08:00 2015Trace dumping is performing id=[cdmp_20150328104008]
Sat Mar 28 10:39:54 GMT+08:00 2015Shutting down instance (abort)
License high water mark = 1186
Sat Mar 28 10:39:56 GMT+08:00 2015Instance terminated by LMS0, pid = 20251510
Sat Mar 28 10:39:59 GMT+08:00 2015Instance terminated by USER, pid = 25232348
这里看到节点1由于IPC SEND TIMEOUT,lms0进程将实例关闭,ipc timeout的原因是(reason 0x1),0x1说明:The Node Monitor generated the reconfiguration。

lmd0进程日志信息如下:
Shared Memory:
ID KEY
3145740 0xd36e6b18
Maximum processes: = 4000
ksxpdmp: facility 2 (SKGFAIPC) (0x1, 0x0) counts 0, 0
ksxpdmp: Dumping the osd context (brief)
SKGXPCTX: 0x11040f4d0 ctx
WAIT HISTORY
Wait Time Time since Fast reaps Wait Type Return Code
(ms) prev wait(ms) before
——— ————– ———– ——— ———–
117 0 0 NORMAL TIMEDOUT
80 0 0 NORMAL TIMEDOUT
80 0 0 NORMAL TIMEDOUT
80 0 0 NORMAL TIMEDOUT
151 0 0 NORMAL POSTED 这里是成功了的
80 0 0 NORMAL TIMEDOUT
80 0 0 NORMAL TIMEDOUT
80 0 0 NORMAL TIMEDOUT
80 0 0 NORMAL TIMEDOUT
80 0 0 NORMAL TIMEDOUT
80 0 0 NORMAL TIMEDOUT
80 0 0 NORMAL TIMEDOUT
80 0 0 NORMAL TIMEDOUT
80 0 0 NORMAL TIMEDOUT
80 0 0 NORMAL TIMEDOUT
80 0 0 NORMAL TIMEDOUT
wait delta 0 sec (60 msec) ctx ts 0x27b9f8e8 last ts 0x27b9f8e8
user cpu time since last wait 0 sec 0 ticks
system cpu time since last wait 0 sec 0 ticks
locked 1
blocked 10
timed wait receives 2
fast reaps since last wait 0
admno 0x4440d03e admport:
SSKGXPT 0x104118a0 flags SSKGXPT_BUFSZ_SET socket no 9 IP 111.111.1.10 UDP 34970
last bytes received: 0
context timestamp 0x27b9f8e8
done Queue
no completed requests
port Queue
SKGXPT 11046c718 portno = 165815271 port:
SSKGXPT 0x1046c728 flags SSKGXPT_BUFSZ_SET socket no 14 IP 111.111.1.10 UDP 34972
connection Queue
no pending connect disconnect operations
sends waiting to be transmitted
no sends waiting to be transmitted
pending ack Queue
no send requests pending ack
Mapped regions
ksxpdmp: Dumping the ksxp contexts
ksxpdmp: Dumping ksxp context 0x11046c8f8 client 2
这里给处了IP地址与IPC TIMEDOUT的信息。

lms0日志信息,只有下面的信息。
*** 2015-03-28 10:34:35.858
0 GCS shadows cancelled, 0 closed
885825 GCS resources traversed, 0 cancelled
LMS 0: 2443790 GCS resources on freelist, 4215772 on array, 4215775 allocated
*** 2015-03-28 10:39:51.259
Received a special admin message (type 2) from node 1
Message body: flag 0x1 data 0x0 0x1 0x0
Abort the instance
ksuitm: waiting up to [5] seconds before killing DIAG(25297038)
此时节点2上面的lmd,lms进程里面没有任何与节点1相关的一些信息。

4 分析总结

通过上面分析,怀疑是配置错误的私有网络导致的,另外HACMP配置网络错误,出现公网网络与私网网络在同一个网卡上面。

本文固定链接: http://www.htz.pw/2015/03/28/rac%e4%b8%adipc-send-timeout%e5%af%bc%e8%87%b4%e5%ae%9e%e4%be%8b%e9%87%8d%e5%90%afreason-0x1.html | 认真就输

--------------------------------------ORACLE-DBA----------------------------------------

最权威、专业的Oracle案例资源汇总之【案例】Oracle RAC中IPC Send timeout导致实例重启(reason 0x1)

本文由大师惜分飞原创分享,网址:http://www.oracleplus.net/arch/1170.html

Oracle研究中心

关键词:

Oracle RAC报错reason 0x1

RAC中IPC Send timeout导致实例重启