ДОбрый вечер
Случилась проблема...
"Развалился" кластер после остановки его и перезагрузке серверовЕсть 3-нодовый кластер на RHEL5.3 (на данный момент был)
В логи ничего не пишет , кроме сообщений о "отстреле" ноды (fencing настроен)
На моменте старта сервиса fencing долго думает, потом в произвольном порядке "застреливаются" ноды, и нормально кластер поднимается только на выжившей ноде ( сужу о поднявшемся gfs-разделе с SAN)Для остановки использовалась связка luci\ricci. Потому грешу на на хз какие не потертые pid-ы которые не дают согласоваться корректно запуститься службам и согласоваться нодам.
Кто имеет опят в устранении данной проблемы - буду премного благодарен.
С Ув.
Кусок лога с ноды, убивающей вторую.May 13 11:31:40 test-node0 kernel: bonding: bond0: link status definitely up for interface eth1.
May 13 11:31:40 test-node0 kernel: DLM (built Jan 6 2010 13:26:37) installed
May 13 11:31:40 test-node0 kernel: GFS2 (built Jan 6 2010 13:27:13) installed
May 13 11:31:40 test-node0 kernel: Lock_DLM (built Jan 6 2010 13:27:19) installed
May 13 11:31:40 test-node0 openais[4329]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6'
May 13 11:31:40 test-node0 openais[4329]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.
May 13 11:31:40 test-node0 openais[4329]: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
May 13 11:31:40 test-node0 openais[4329]: [MAIN ] AIS Executive Service: started and ready to provide service.
May 13 11:31:40 test-node0 openais[4329]: [MAIN ] Using default multicast address of 239.192.6.148
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] token hold (386 ms) retransmits before loss (20 retrans)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] join (60 ms) send_join (0 ms) consensus (20000 ms) merge (200 ms)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1500
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] send threads (0 threads)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] RRP token expired timeout (495 ms)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] RRP token problem counter (2000 ms)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] RRP threshold (10 problem count)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] RRP mode set to none.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] heartbeat_failures_allowed (0)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] max_network_delay (50 ms)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes).
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] The network interface [10.10.10.10] is now up.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Created or loaded sequence id 4.10.10.10.10 for this ring.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] entering GATHER state from 15.
May 13 11:31:40 test-node0 openais[4329]: [CMAN ] CMAN 2.0.115 (built Mar 16 2010 10:29:01) started
May 13 11:31:40 test-node0 openais[4329]: [MAIN ] Service initialized 'openais CMAN membership service 2.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais extended virtual synchrony service'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais cluster membership service B.01.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais availability management framework B.01.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais checkpoint service B.01.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais event service B.01.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais distributed locking service B.01.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais message service B.01.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais configuration service'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais cluster closed process group service v1.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais cluster config database access v1.01'
May 13 11:31:40 test-node0 openais[4329]: [SYNC ] Not using a virtual synchrony filter.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Creating commit token because I am the rep.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Saving state aru 0 high seq received 0
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Storing new sequence id for ring 8
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] entering COMMIT state.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] entering RECOVERY state.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] position [0] member 10.10.10.10:
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] previous ring seq 4 rep 10.10.10.10
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] aru 0 high delivered 0 received flag 1
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Did not need to originate any messages in recovery.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Sending initial ORF token
May 13 11:31:40 test-node0 openais[4329]: [CLM ] CLM CONFIGURATION CHANGE
May 13 11:31:40 test-node0 openais[4329]: [CLM ] New Configuration:
May 13 11:31:40 test-node0 openais[4329]: [CLM ] Members Left:
May 13 11:31:40 test-node0 openais[4329]: [CLM ] Members Joined:
May 13 11:31:40 test-node0 openais[4329]: [CLM ] CLM CONFIGURATION CHANGE
May 13 11:31:40 test-node0 openais[4329]: [CLM ] New Configuration:
May 13 11:31:40 test-node0 openais[4329]: [CLM ] r(0) ip(10.10.10.10)
May 13 11:31:40 test-node0 openais[4329]: [CLM ] Members Left:
May 13 11:31:40 test-node0 openais[4329]: [CLM ] Members Joined:
May 13 11:31:40 test-node0 openais[4329]: [CLM ] r(0) ip(10.10.10.10)
May 13 11:31:40 test-node0 openais[4329]: [SYNC ] This node is within the primary component and will provide service.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] entering OPERATIONAL state.
May 13 11:31:40 test-node0 openais[4329]: [CMAN ] quorum regained, resuming activity
May 13 11:31:40 test-node0 openais[4329]: [CLM ] got nodejoin message 10.10.10.10
May 13 11:31:41 test-node0 ccsd[4320]: Initial status:: Quorate
May 13 11:31:41 test-node0 qdiskd[4359]: <info> Quorum Daemon Initializing
May 13 11:31:41 test-node0 qdiskd[4359]: <crit> Initialization failed
May 13 11:31:48 test-node0 kernel: bond0: no IPv6 routers present
May 13 11:32:31 test-node0 fenced[4373]: test-node.domain not a cluster member after 3 sec post_join_delay
May 13 11:32:31 test-node0 fenced[4373]: fencing node "test-node.domain"
May 13 11:33:11 test-node0 fenced[4373]: agent "fence_bladecenter" reports: Connection timed out
May 13 11:33:11 test-node0 fenced[4373]: fence "test-node.domain" failed
May 13 11:33:16 test-node0 fenced[4373]: fencing node "test-node.domain"
May 13 11:33:31 test-node0 fenced[4373]: agent "fence_bladecenter" reports: Connection timed out
May 13 11:33:31 test-node0 fenced[4373]: fence "test-node.domain" failed
May 13 11:33:36 test-node0 fenced[4373]: fencing node "test-node.domain"
May 13 11:34:23 test-node0 fenced[4373]: fence "test-node.domain" success
May 13 11:34:23 test-node0 kernel: dlm: Using TCP for communications
May 13 11:34:24 test-node0 clvmd: Cluster LVM daemon started - connected to CMAN
May 13 11:34:25 test-node0 scsi_reserve: [error] cluster not configured for scsi reservations