The OpenNET Project / Index page

[ новости /+++ | форум | теги | ]

форумы  помощь  поиск  регистрация  майллист  вход/выход  слежка  RSS
"Горящая проблема с кластером RedHat Cluster"
Вариант для распечатки  
Пред. тема | След. тема 
Форумы Открытые системы на сервере (Системное и пользовательское ПО / Linux)
Изначальное сообщение [ Отслеживать ]

"Горящая проблема с кластером RedHat Cluster"  +/
Сообщение от ZemieL email on 13-Май-10, 00:28 
ДОбрый вечер
Случилась проблема...
"Развалился" кластер после остановки его и перезагрузке серверов

Есть 3-нодовый кластер на RHEL5.3 (на данный момент был)
В логи ничего не пишет , кроме сообщений о "отстреле" ноды (fencing настроен)
На моменте старта сервиса fencing долго думает, потом в произвольном порядке "застреливаются" ноды, и нормально кластер поднимается только на выжившей ноде ( сужу о поднявшемся gfs-разделе с SAN)

Для остановки использовалась связка luci\ricci. Потому  грешу на на хз какие не потертые pid-ы которые не дают согласоваться корректно запуститься службам и согласоваться нодам.

Кто имеет опят в устранении данной проблемы - буду премного благодарен.

С Ув.

Высказать мнение | Ответить | Правка | Cообщить модератору

Оглавление

Сообщения по теме [Сортировка по времени | RSS]


1. "Горящая проблема с кластером RedHat Cluster"  +/
Сообщение от ZemieL email on 13-Май-10, 12:59 
Кусок лога с ноды, убивающей вторую.

May 13 11:31:40 test-node0 kernel: bonding: bond0: link status definitely up for interface eth1.
May 13 11:31:40 test-node0 kernel: DLM (built Jan  6 2010 13:26:37) installed
May 13 11:31:40 test-node0 kernel: GFS2 (built Jan  6 2010 13:27:13) installed
May 13 11:31:40 test-node0 kernel: Lock_DLM (built Jan  6 2010 13:27:19) installed
May 13 11:31:40 test-node0 openais[4329]: [MAIN ] AIS Executive Service RELEASE 'subrev 1887 version 0.80.6'
May 13 11:31:40 test-node0 openais[4329]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors.
May 13 11:31:40 test-node0 openais[4329]: [MAIN ] Copyright (C) 2006 Red Hat, Inc.
May 13 11:31:40 test-node0 openais[4329]: [MAIN ] AIS Executive Service: started and ready to provide service.
May 13 11:31:40 test-node0 openais[4329]: [MAIN ] Using default multicast address of 239.192.6.148
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] token hold (386 ms) retransmits before loss (20 retrans)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] join (60 ms) send_join (0 ms) consensus (20000 ms) merge (200 ms)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1500
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] send threads (0 threads)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] RRP token expired timeout (495 ms)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] RRP token problem counter (2000 ms)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] RRP threshold (10 problem count)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] RRP mode set to none.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] heartbeat_failures_allowed (0)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] max_network_delay (50 ms)
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes).
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] The network interface [10.10.10.10] is now up.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Created or loaded sequence id 4.10.10.10.10 for this ring.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] entering GATHER state from 15.
May 13 11:31:40 test-node0 openais[4329]: [CMAN ] CMAN 2.0.115 (built Mar 16 2010 10:29:01) started
May 13 11:31:40 test-node0 openais[4329]: [MAIN ] Service initialized 'openais CMAN membership service 2.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais extended virtual synchrony service'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais cluster membership service B.01.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais availability management framework B.01.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais checkpoint service B.01.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais event service B.01.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais distributed locking service B.01.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais message service B.01.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais configuration service'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais cluster closed process group service v1.01'
May 13 11:31:40 test-node0 openais[4329]: [SERV ] Service initialized 'openais cluster config database access v1.01'
May 13 11:31:40 test-node0 openais[4329]: [SYNC ] Not using a virtual synchrony filter.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Creating commit token because I am the rep.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Saving state aru 0 high seq received 0
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Storing new sequence id for ring 8
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] entering COMMIT state.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] entering RECOVERY state.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] position [0] member 10.10.10.10:
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] previous ring seq 4 rep 10.10.10.10
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] aru 0 high delivered 0 received flag 1
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Did not need to originate any messages in recovery.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] Sending initial ORF token
May 13 11:31:40 test-node0 openais[4329]: [CLM  ] CLM CONFIGURATION CHANGE
May 13 11:31:40 test-node0 openais[4329]: [CLM  ] New Configuration:
May 13 11:31:40 test-node0 openais[4329]: [CLM  ] Members Left:
May 13 11:31:40 test-node0 openais[4329]: [CLM  ] Members Joined:
May 13 11:31:40 test-node0 openais[4329]: [CLM  ] CLM CONFIGURATION CHANGE
May 13 11:31:40 test-node0 openais[4329]: [CLM  ] New Configuration:
May 13 11:31:40 test-node0 openais[4329]: [CLM  ]      r(0) ip(10.10.10.10)
May 13 11:31:40 test-node0 openais[4329]: [CLM  ] Members Left:
May 13 11:31:40 test-node0 openais[4329]: [CLM  ] Members Joined:
May 13 11:31:40 test-node0 openais[4329]: [CLM  ]      r(0) ip(10.10.10.10)
May 13 11:31:40 test-node0 openais[4329]: [SYNC ] This node is within the primary component and will provide service.
May 13 11:31:40 test-node0 openais[4329]: [TOTEM] entering OPERATIONAL state.
May 13 11:31:40 test-node0 openais[4329]: [CMAN ] quorum regained, resuming activity
May 13 11:31:40 test-node0 openais[4329]: [CLM  ] got nodejoin message 10.10.10.10
May 13 11:31:41 test-node0 ccsd[4320]: Initial status:: Quorate
May 13 11:31:41 test-node0 qdiskd[4359]: <info> Quorum Daemon Initializing
May 13 11:31:41 test-node0 qdiskd[4359]: <crit> Initialization failed
May 13 11:31:48 test-node0 kernel: bond0: no IPv6 routers present
May 13 11:32:31 test-node0 fenced[4373]: test-node.domain not a cluster member after 3 sec post_join_delay
May 13 11:32:31 test-node0 fenced[4373]: fencing node "test-node.domain"
May 13 11:33:11 test-node0 fenced[4373]: agent "fence_bladecenter" reports: Connection timed out
May 13 11:33:11 test-node0 fenced[4373]: fence "test-node.domain" failed
May 13 11:33:16 test-node0 fenced[4373]: fencing node "test-node.domain"
May 13 11:33:31 test-node0 fenced[4373]: agent "fence_bladecenter" reports: Connection timed out
May 13 11:33:31 test-node0 fenced[4373]: fence "test-node.domain" failed
May 13 11:33:36 test-node0 fenced[4373]: fencing node "test-node.domain"
May 13 11:34:23 test-node0 fenced[4373]: fence "test-node.domain" success
May 13 11:34:23 test-node0 kernel: dlm: Using TCP for communications
May 13 11:34:24 test-node0 clvmd: Cluster LVM daemon started - connected to CMAN
May 13 11:34:25 test-node0 scsi_reserve: [error] cluster not configured for scsi reservations

Высказать мнение | Ответить | Правка | ^ | Наверх | Cообщить модератору

Архив | Удалить

Рекомендовать для помещения в FAQ | Индекс форумов | Темы | Пред. тема | След. тема




Партнёры:
PostgresPro
Inferno Solutions
Hosting by Hoster.ru
Хостинг:

Закладки на сайте
Проследить за страницей
Created 1996-2024 by Maxim Chirkov
Добавить, Поддержать, Вебмастеру