sitemap

RSS地图

收藏本站

设为首页

Oracle研究中心

当前位置:Oracle研究中心 > 故障案例 >

【案例】Oracle RAC votedisk丢失导致节点主机重启

时间:2016-11-04 19:27   来源:Oracle研究中心   作者:HTZ   点击:

天萃荷净 Oracle研究中心案例分析:运维DBA反映Oracle RAC节点主机重启,分析原因是由于votedisk丢失导致节点主机重启。
下面是模拟其中一个节点VOTEDISK磁盘丢失导致主机重启

1,环境介绍

[root@cisser2 ~]# crsctl query crs activeversion
CRS active version on the cluster is [10.2.0.5.0]

[root@cisser2 ~]# lsb_release -a
LSB Version: :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 5.11 (Tikanga)
Release: 5.11
Codename: Tikanga

2,查看磁盘信息

[root@cisser1 tmp]# dmsetup ls
disk1_votep1 (253, 5)
disk1_vote (253, 2)
disk1_ocr (253, 3)
VolGroup00-LogVol01 (253, 0)
disk1_data1 (253, 4)
VolGroup00-LogVol00 (253, 1)
disk1_ocrp1 (253, 6)

[root@cisser1 tmp]# raw -qa
/dev/raw/raw1: bound to major 253, minor 6
/dev/raw/raw4: bound to major 253, minor 5

[root@cisser1 ~]# crsctl query css votedisk
0. 0 /dev/raw/raw4

located 1 votedisk(s).

[root@cisser1 tmp]# multipath -ll
disk1_vote (36000c291eaeb9a8cb897fed3bb029eb7) dm-2 VMware,,VMware Virtual
[size=307M][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=1][active]
\_ 2:0:0:0 sdb 8:16 [active][ready]
disk1_ocr (36000c293ecddddd9af5f396457322054) dm-3 VMware,,VMware Virtual
[size=307M][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=1][active]
\_ 2:0:1:0 sdc 8:32 [active][ready]
disk1_data1 (36000c294078123daee865a29e3b1ea63) dm-4 VMware,,VMware Virtual
[size=50G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=1][active]
\_ 2:0:2:0 sdd 8:48 [active][ready]
这里可以看到/dev/raw/raw4是votedisk对应多路径磁盘/dev/dm-2别名是disk1_vote,对应磁盘名是sdb

3,删除sdb磁盘

[root@cisser1 device]# pwd
/sys/block/sdb/device

[root@cisser1 device]# ls -l delete
–w——- 1 root root 4096 Mar 29 11:46 delete

[root@cisser1 dm-2]# echo 1 > /sys/block/sdb/device/delete

[root@cisser1 dm-2]# multipath -ll
disk1_vote (36000c291eaeb9a8cb897fed3bb029eb7) dm-2 ,
[size=307M][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=0][enabled]
\_ #:#:#:# – #:# [failed][faulty]这里可以看到路径丢失

4,查看节点1的cssd日志信息

[root@cisser1 cssd]# tail -f ocssd.log
从这个时候开始报磁盘错误
[ CSSD]2015-03-29 13:16:35.485 [843671872] >ERROR: Internal Error Information:
Category: 1234
Operation: scls_block_read
Location: fread_failed
Other: fread unable to read buffer
Dep: 5
…………………………
[ CSSD]2015-03-29 13:16:35.485 [843671872] >ERROR: clssnmvReadBlocks: read failed 1 at offset 529 of /dev/raw/raw4
[ CSSD]2015-03-29 13:16:35.485 [843671872] >TRACE: clssnmDiskStateChange: state from 4 to 3 disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:16:35.485 [832436544] >ERROR: Internal Error Information:
Category: 1234
Operation: scls_block_write
Location: fwrite_faile
Other: fwrite unable to write buffer
Dep: 5

第12次出现下面日志
[ CSSD]2015-03-29 13:19:26.604 [832436544] >ERROR: Internal Error Information:
Category: 1234
Operation: scls_block_read
Location: fread_failed
Other: fread unable to read buffer
Dep: 5

[ CSSD]2015-03-29 13:19:26.604 [832436544] >ERROR: clssnmvReadBlocks: read failed 1 at offset 4 of /dev/raw/raw4
[ CSSD]2015-03-29 13:19:27.814 [1013770560] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2015-03-29 13:19:27.814 [1013770560] >TRACE: clssnmSendingThread: sent 5 status msgs to all nodes
[ CSSD]2015-03-29 13:19:31.820 [1013770560] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2015-03-29 13:19:31.820 [1013770560] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
[ CSSD]2015-03-29 13:19:35.615 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 19900 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:36.615 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 18900 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:36.825 [1013770560] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2015-03-29 13:19:36.825 [1013770560] >TRACE: clssnmSendingThread: sent 5 status msgs to all nodes
[ CSSD]2015-03-29 13:19:37.617 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 17900 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:37.892 [992381248] >TRACE: clssgmDispatchCMXMSG(): msg type(3) src(2) dest(1) size(420) tag(01f7002a) incarnation(5)
[ CSSD]2015-03-29 13:19:37.892 [992381248] >TRACE: clssgmHandleMasterAdd(): src(2) dest(1) size(420)
[ CSSD]2015-03-29 13:19:37.892 [992381248] >TRACE: clssgmHandleMasterAdd(): grock(SRVM.DATABASE.NODEAPPS.cisser2) memberNo(-1) node(2) client(1f7002a) type(3).
[ CSSD]2015-03-29 13:19:37.892 [992381248] >TRACE: clssgmAddMember: granted member(0) flags(0x1) node(2) grock (0x8c66cb0/SRVM.DATABASE.NODEAPPS.cisser2)
[ CSSD]2015-03-29 13:19:37.892 [992381248] >TRACE: clssgmCommonAddMember: Remote member(0) node(2) flags 0x1 0x1 grock (3/0x8c66cb0/SRVM.DATABASE.NODEAPPS.cisser2)
[ CSSD]2015-03-29 13:19:37.898 [992381248] >TRACE: clssgmDispatchCMXMSG(): msg type(4) src(2) dest(1) size(352) tag(01f8002a) incarnation(5)
[ CSSD]2015-03-29 13:19:37.898 [992381248] >TRACE: clssgmHandleMasterExit(): src(2) dest(1) size(352)
[ CSSD]2015-03-29 13:19:37.898 [992381248] >TRACE: clssgmRemoveMember: grock(SRVM.DATABASE.NODEAPPS.cisser2) member(0/0x8c013b0) nodeNum(2) flags(0x1) type(3)
[ CSSD]2015-03-29 13:19:38.618 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 16900 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:39.619 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 15890 ms, disk (0//dev/raw/rawOracle oracleplus.net4)
[ CSSD]2015-03-29 13:19:39.701 [992381248] >TRACE: clssgmDispatchCMXMSG(): msg type(12) src(2) dest(1) size(360) tag(01f9002a) incarnation(5)
[ CSSD]2015-03-29 13:19:40.620 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 14890 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:41.621 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 13890 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:41.831 [1013770560] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2015-03-29 13:19:41.831 [1013770560] >TRACE: clssnmSendingThread: sent 5 status msgs to all nodes
[ CSSD]2015-03-29 13:19:42.623 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 12890 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:43.237 [950012224] >TRACE: clssgmAllocProc: (0x8c23c60) allocated
[ CSSD]2015-03-29 13:19:43.238 [971401536] >TRACE: Connect request from user root
[ CSSD]2015-03-29 13:19:43.238 [950012224] >TRACE: clssgmClientConnectMsg: Connect from con(0x8c013b0) proc(0x8c23c60) pid() proto(10:2:1:1)
[ CSSD]2015-03-29 13:19:43.239 [950012224] >TRACE: clssgmRegisterClient: proc(17/0x8c23c60), client(1/0x8bd2110)
[ CSSD]2015-03-29 13:19:43.239 [950012224] >TRACE: clssgmExecuteClientRequest: GRKJOIN recvd from client 1 (0x8bd2110)
[ CSSD]2015-03-29 13:19:43.239 [950012224] >TRACE: clssgmJoinGrock: grock SRVM.DATABASE.NODEAPPS.cisser1 new client 0x8bd2110 with con 0x8bfdef0, requested num -1
[ CSSD]2015-03-29 13:19:43.239 [950012224] >TRACE: clssgmAddGrockMember: adding member to grock SRVM.DATABASE.NODEAPPS.cisser1
[ CSSD]2015-03-29 13:19:43.239 [950012224] >TRACE: clssgmAddMember: granted member(0) flags(0x1) node(1) grock (0x8c64fa0/SRVM.DATABASE.NODEAPPS.cisser1)
[ CSSD]2015-03-29 13:19:43.239 [950012224] >TRACE: clssgmQueueGrockEvent: lockName(SRVM.DATABASE.NODEAPPS.cisser1) type(2) count (1/1) xwaiters(0) event(1) to memberNo(0)
[ CSSD]2015-03-29 13:19:43.239 [950012224] >TRACE: clssgmCommonAddMember: Local member(0) node(1) flags 0x1 0x1 grock (3/0x8c64fa0/SRVM.DATABASE.NODEAPPS.cisser1)
[ CSSD]2015-03-29 13:19:43.244 [950012224] >TRACE: clssgmExecuteClientRequest: GRKEXIT recvd from client 1 (0x8bd2110)
[ CSSD]2015-03-29 13:19:43.244 [950012224] >TRACE: clssgmExitGrock: client 1 (0x8bd2110), grock SRVM.DATABASE.NODEAPPS.cisser1, member 0
[ CSSD]2015-03-29 13:19:43.244 [950012224] >TRACE: clssgmUnregisterClient(): removing proc 17 client 1, flags 0x04000000
[ CSSD]2015-03-29 13:19:43.244 [950012224] >TRACE: clssgmRemoveMember: grock(SRVM.DATABASE.NODEAPPS.cisser1) member(0/0x8c22fa0) nodeNum(1) flags(0x1) type(3)
[ CSSD]2015-03-29 13:19:43.244 [950012224] >TRACE: clssgmUnregisterClient: client 0x8bd2110 expiring
[ CSSD]2015-03-29 13:19:43.452 [950012224] >TRACE: clssgmDeadProc: proc 0x8c23c60
[ CSSD]2015-03-29 13:19:43.452 [950012224] >TRACE: clssgmDeleteClientListener: deleting cmProc (0x8c23c60), with 0 clients
[ CSSD]2015-03-29 13:19:43.452 [950012224] >TRACE: clssgmDeleteClientListener: cleanup for proc(0x8c23c60) con(0x8c013b0) pid()
[ CSSD]2015-03-29 13:19:43.625 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 11890 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:44.627 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 10890 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:45.616 [832436544] >ERROR: Internal Error Information:
Category: 1234
Operation: scls_block_read
Location: fread_failed
Other: fread unable to read buffer
Dep: 5

[ CSSD]2015-03-29 13:19:45.616 [832436544] >ERROR: clssnmvReadBlocks: read failed 1 at offset 4 of /dev/raw/raw4
[ CSSD]2015-03-29 13:19:45.616 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 9900 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:46.617 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 8900 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:46.839 [1013770560] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2015-03-29 13:19:46.839 [1013770560] >TRACE: clssnmSendingThread: sent 5 status msgs to all nodes
[ CSSD]2015-03-29 13:19:47.618 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 7900 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:48.620 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 6900 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:49.621 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 5890 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:50.623 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 4890 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:50.845 [1013770560] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2015-03-29 13:19:50.845 [1013770560] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
[ CSSD]2015-03-29 13:19:51.625 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 3890 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:52.626 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 2890 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:53.627 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 1890 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:54.628 [854161728] >WARNING: clssnmDiskPMT: voting device offline at 90% fatal, termination in 890 ms, disk (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:55.530 [854161728] >TRACE: clssnmDiskPMT: offline disk (200010 ms) (0//dev/raw/raw4)
[ CSSD]2015-03-29 13:19:55.530 [854161728] >ERROR: clssnmDiskPMT: Aborting, 1 of 1 voting disks unavailable
[ CSSD]2015-03-29 13:19:55.533 [854161728] >ERROR: ###################################
[ CSSD]2015-03-29 13:19:55.533 [854161728] >ERROR: clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread
[ CSSD]2015-03-29 13:19:55.533 [854161728] >ERROR: ###################################
[ CSSD]2015-03-29 13:19:55.533 [854161728] >TRACE: clssgmDiscOmonReady: omon was posted for member 1
这里可以看到clssnmDiskPMT: Aborting, 1 of 1 voting disks unavailable,主机1由于不能访问VOTEDISK磁盘,CSSD进程被clssnmvDiskPingMonitorThread线程终止,导致主机重启。

5,主机2的OCSSD日志

这里看到主机2收到节点1已经failure,后面开始出现50% heartbeat网络心跳丢失。
[ CSSD]2015-03-29 13:19:58.691 [1518352704] >TRACE: clssgmPeerListener: discarded 0 future msgsfor 1
[ CSSD]2015-03-29 13:19:58.691 [1396734272] >WARNING: clssnmeventhndlr: Receive failure with node 1 (cisser1), state 3, con(0xf66b090), probe((nil)), rc=11
…………………………………
[ CSSD]2015-03-29 13:20:28.778 [1529252160] >WARNING: clssnmPollingThread: node cisser1 (1) at 50% heartbeat fatal, eviction in 29.200 seconds seedhbimpd 0
[ CSSD]2015-03-29 13:20:28.778 [1529252160] >TRACE: clssnmPollingThread: node cisser1 (1) is impending reconfig, flag 1, misstime 30800
[ CSSD]2015-03-29 13:20:28.778 [1529252160] >TRACE: clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[ CSSD]2015-03-29 13:20:32.905 [1539742016] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2015-03-29 13:20:32.905 [1539742016] >TRACE: clssnmSendingThread: sent 5 status msgs to all nodes
[ CSSD]2015-03-29 13:20:36.910 [1539742016] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2015-03-29 13:20:36.910 [1539742016] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
[ CSSD]2015-03-29 13:20:39.826 [1407633728] >TRACE: clssgmAllocateRPCIndex: allocated rpc 507 (0x2abe50b1cd90)
[ CSSD]2015-03-29 13:20:39.826 [1407633728] >TRACE: clssgmpeersend: send failed type 12, node 1, unreachable, flags 0x0, quiesced 0
[ CSSD]2015-03-29 13:20:39.826 [1407633728] >TRACE: clssgmFreeRPCIndex: freeing rpc 507
[ CSSD]2015-03-29 13:20:41.917 [1539742016] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2015-03-29 13:20:41.917 [1539742016] >TRACE: clssnmSendingThread: sent 5 status msgs to all nodes
[ CSSD]2015-03-29 13:20:43.780 [1529252160] >WARNING: clssnmPollingThread: node cisser1 (1) at 75% heartbeat fatal, eviction in 14.200 seconds seedhbimpd 1
[ CSSD]2015-03-29 13:20:45.923 [1539742016] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2015-03-29 13:20:45.923 [1539742016] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
[ CSSD]2015-03-29 13:20:49.930 [1539742016] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2015-03-29 13:20:49.930 [1539742016] >TRACE: clssnmSendingThread: sent 4 status msgs to all nodes
[ CSSD]2015-03-29 13:20:52.784 [1529252160] >WARNING: clssnmPollingThread: node cisser1 (1) at 90% heartbeat fatal, eviction in 5.200 seconds seedhbimpd 1
[ CSSD]2015-03-29 13:20:53.785 [1529252160] >WARNING: clssnmPollingThread: node cisser1 (1) at 90% heartbeat fatal, eviction in 4.200 seconds seedhbimpd 1
[ CSSD]2015-03-29 13:20:54.787 [1529252160] >WARNING: clssnmPollingThread: node cisser1 (1) at 90% heartbeat fatal, eviction in 3.200 seconds seedhbimpd 1
[ CSSD]2015-03-29 13:20:54.938 [1539742016] >TRACE: clssnmSendingThread: sending status msg to all nodes
[ CSSD]2015-03-29 13:20:54.938 [1539742016] >TRACE: clssnmSendingThread: sent 5 status msgs to all nodes
[ CSSD]2015-03-29 13:20:55.788 [1529252160] >WARNING: clssnmPollingThread: node cisser1 (1) at 90% heartbeat fatal, eviction in 2.200 seconds seedhbimpd 1
[ CSSD]2015-03-29 13:20:56.790 [1529252160] >WARNING: clssnmPollingThread: node cisser1 (1) at 90% heartbeat fatal, eviction in 1.200 seconds seedhbimpd 1
[ CSSD]2015-03-29 13:20:57.791 [1529252160] >WARNING: clssnmPollingThread: node cisser1 (1) at 90% heartbeat fatal, eviction in 0.190 seconds seedhbimpd 1
[ CSSD]2015-03-29 13:20:57.984 [1529252160] >TRACE: clssnmPollingThread: Eviction started for node cisser1 (1), flags 0x0001, state 3, wt4c 0 seedhbimpd 1

node2在2015-03-29 13:20:57.984开始踢节点,其实这个主机1已经重启了,是由于节点1重启导致网络丢失,所以菜出现了驱除节点的提示。

本文固定链接: http://www.htz.pw/2015/03/29/votedisk%e4%b8%a2%e5%a4%b1%e5%af%bc%e8%87%b4%e4%b8%bb%e6%9c%ba%e9%87%8d%e5%90%af.html | 认真就输

--------------------------------------ORACLE-DBA----------------------------------------

最权威、专业的Oracle案例资源汇总之【案例】Oracle RAC votedisk丢失导致节点主机重启

本文由大师惜分飞原创分享,网址:http://www.oracleplus.net/arch/1168.html

Oracle研究中心

关键词:

Oracle RAC节点重启原因分析

Oracle votedisk丢失的修复方法