TOPIC:

Pool master doesn't auto switch to slave 10 years 9 months ago #24

yx
Topic Author
Offline
Posts: 3

I have two nodes pool for XCP 1.6. I have installed ha-lizard-1.6.41.4.tgz on each of the xcp hosts (xcp1, xcp2).

The test steps is as follows:
1.Create a pool with xcp1 as the pool master, and then add xcp2 to the pool.
I have a shared NFS storage.
2.install ha-lizard on both xcp1 and xcp2.
3.use ha-cfg status to enable HA on the pool.
4.Create a VM on xcp1.
5.On XenCenter, click reboot on the xcp1.
6.Login into xcp2, and use xe pool-list to check the pool master's uuid.
7.After xcp1 rebooted, login into xcp1, and use xe pool-list to check the pool master's uuid.

Result of the test:
1.xcp2's pool master's uuid change to xcp2's uuid.
2.xcp1's pool master's uuid is still xcp1's uuid.
3.If I manually issue the xe pool-recover-slaves on xcp2, xcp1 will then change to slave.

Issue:
I expect ha-lizard should auto change the xcp1 to become a pool slave after rebooting without manually issuing the xe pool-recover-slaves command on the new master node.
Maybe the new pool master xcp2 can automatically issue the command after seeing xcp1 have completed rebooting(like receiving response of ping from xcp1).

Please help to give me some suggestion on doing this.

My ha-lizard setting is as follows:
DISABLED_VAPPS=()
ENABLE_LOGGING=1
FENCE_ACTION=stop
FENCE_ENABLED=1
FENCE_FILE_LOC=/etc/ha-lizard/fence
FENCE_HA_ONFAIL=1
FENCE_HEURISTICS_IPS=172.16.36.254
FENCE_HOST_FORGET=1
FENCE_IPADDRESS=
FENCE_METHOD=POOL
FENCE_MIN_HOSTS=2
FENCE_PASSWD=
FENCE_QUORUM_REQUIRED=1
FENCE_REBOOT_LONE_HOST=0
FENCE_USE_IP_HEURISTICS=1
GLOBAL_VM_HA=1
MAIL_FROM="root@localhost"
MAIL_ON=1
MAIL_SUBJECT="SYSTEM_ALERT-FROM_HOST:$HOSTNAME"
MAIL_TO="root@localhost"
MONITOR_DELAY=45
MONITOR_KILLALL=1
MONITOR_MAX_STARTS=50
MONITOR_SCANRATE=10
OP_MODE=2
PROMOTE_SLAVE=1
SLAVE_HA=1
SLAVE_VM_STAT=0
XAPI_COUNT=5
XAPI_DELAY=15
XC_FIELD_NAME='ha-lizard-enabled'
XE_TIMEOUT=10

Please Log in or Create an account to join the conversation.

Last edit: by yx.

Pool master does not auto switch to slave after re 10 years 9 months ago #25

Pulse Supply
Offline
Posts: 157

- Please verify that you are running version 1.6.41.4 as this is the required version for a 2-node pool

- You should set FENCE_FORGET_HOST to 0 as that is no longer mandatory for fencing. Leaving it set to 1 will work, however, you will have to manually prepare a host for re-introduction to the pool if it has been fenced. Setting this to 0 will save you time in your testing.

It should not be necessary to issue the recover-slaves command in order to re-introduce the former master into the pool as a slave. The new Master will perform that action as part of the fencing logic. Can you perform your test again, this time with FENCE_HOST_FORGET=0

Also, it is possible that the rebooted host is joining the pool before the surviving host has a chance to fence it. Try setting:
XAPI_COUNT=2
XAPI_DELAY=10

This will speed up the fencing process considerably. Also, capture a log on the surviving host after xcp1 is rebooted.. this will help in troubleshooting. If you are still experiecing trouble you can post the log output here. capture the log with the "ha-cfg log" command.

Please Log in or Create an account to join the conversation.

Pool master does not auto switch to slave after re 10 years 9 months ago #26

yx
Topic Author
Offline
Posts: 3

Thanks for the reply. But the problem still exists.

- My ha-lizard version is 1.6.41.4

- ha-lizard setting:
DISABLED_VAPPS=()
ENABLE_LOGGING=1
FENCE_ACTION=stop
FENCE_ENABLED=1
FENCE_FILE_LOC=/etc/ha-lizard/fence
FENCE_HA_ONFAIL=1
FENCE_HEURISTICS_IPS=172.16.36.254
FENCE_HOST_FORGET=0
FENCE_IPADDRESS=
FENCE_METHOD=POOL
FENCE_MIN_HOSTS=2
FENCE_PASSWD=
FENCE_QUORUM_REQUIRED=1
FENCE_REBOOT_LONE_HOST=0
FENCE_USE_IP_HEURISTICS=1
GLOBAL_VM_HA=1
MAIL_FROM="root@localhost"
MAIL_ON=1
MAIL_SUBJECT="SYSTEM_ALERT-FROM_HOST:$HOSTNAME"
MAIL_TO="root@localhost"
MONITOR_DELAY=45
MONITOR_KILLALL=1
MONITOR_MAX_STARTS=50
MONITOR_SCANRATE=10
OP_MODE=2
PROMOTE_SLAVE=1
SLAVE_HA=1
SLAVE_VM_STAT=0
XAPI_COUNT=2
XAPI_DELAY=10
XC_FIELD_NAME='ha-lizard-enabled'
XE_TIMEOUT=10

- log (test steps are the same to the previous. xcp1's uuid: b7610022-406b-40e0-90ae-e31ff792898b, xcp2'uuid: 117cdfae-7678-46b9-b329-28453f04c3cc)
Jul 24 10:19:06 xcp2 ha-lizard: 17871 This host detected as pool Master
Jul 24 10:19:06 xcp2 ha-lizard: 17871 Found 2 hosts in pool
Jul 24 10:19:06 xcp2 ha-lizard: 17871 Calling function write_pool_state
Jul 24 10:19:06 xcp2 ha-lizard: 17944 Calling function autoselect_slave
Jul 24 10:19:06 xcp2 ha-lizard: 17951 Calling function check_slave_status
Jul 24 10:19:06 xcp2 ha-lizard: 17944 autoselect_slave: This host UUID found: 117cdfae-7678-46b9-b329-28453f04c3cc
Jul 24 10:19:06 xcp2 ha-lizard: 17951 get_pool_host_list: returned 117cdfae-7678-46b9-b329-28453f04c3cc b7610022-406b-40e0-90ae-e31ff792898b
Jul 24 10:19:06 xcp2 ha-lizard: 17951 check_slave_status: Removing Master UUID from list of Hosts
Jul 24 10:19:06 xcp2 ha-lizard: 17944 autoselect_slave: MASTER host UUID found: 117cdfae-7678-46b9-b329-28453f04c3cc
Jul 24 10:19:06 xcp2 ha-lizard: 17871 get_vms_on_host: Returned 74d0ebb8-d06a-7819-4a19-ad63224c82b7
Jul 24 10:19:06 xcp2 ha-lizard: 17951 check_slave_status: Removing Slave UUID from list of Hosts - Slave: b7610022-406b-40e0-90ae-e31ff792898b is disabled or in maintenance mode
Jul 24 10:19:06 xcp2 ha-lizard: 17951 check_slave_status: Host IP Address check Status Array for Slaves = ()
Jul 24 10:19:06 xcp2 ha-lizard: 17951 check_slave_status: Failed slave count = 0
Jul 24 10:19:06 xcp2 ha-lizard: 17951 check_slave_status: No Failed slaves detected
Jul 24 10:19:06 xcp2 ha-lizard: 17944 autoselect_slave: 117cdfae-7678-46b9-b329-28453f04c3cc is Master UUID - excluding from list of available slaves
Jul 24 10:19:06 xcp2 ha-lizard: 17951 Function check_slave_status reported no failures: calling vm_mon
Jul 24 10:19:06 xcp2 ha-lizard: 17951 vm_mon: ha-lizard is operating mode 2 - managing pool VMs
Jul 24 10:19:06 xcp2 ha-lizard: 17871 get_vms_on_host: No VMs found on host: b7610022-406b-40e0-90ae-e31ff792898b
Jul 24 10:19:06 xcp2 ha-lizard: 17944 autoselect_slave: Removing Slave UUID from list of Hosts - Slave: b7610022-406b-40e0-90ae-e31ff792898b is disabled or in maintenance mode
Jul 24 10:19:06 xcp2 ha-lizard: 17944 autoselect_slave: 0 available Slave UUIDs found:
Jul 24 10:19:06 xcp2 ha-lizard: 17951 vm_mon: Retrived list of VMs for this poll: 74d0ebb8-d06a-7819-4a19-ad63224c82b7
Jul 24 10:19:06 xcp2 ha-lizard: 17951 vm_mon: Removing Control Domains from VM list
Jul 24 10:19:06 xcp2 ha-lizard: 17951 vm_mon: VM list returned = 74d0ebb8-d06a-7819-4a19-ad63224c82b7
Jul 24 10:19:06 xcp2 ha-lizard: 17951 vm_state: Machine state for 74d0ebb8-d06a-7819-4a19-ad63224c82b7 returned: running
Jul 24 10:19:06 xcp2 ha-lizard: 17951 vm_mon: VM 74d0ebb8-d06a-7819-4a19-ad63224c82b7 state = running
Jul 24 10:19:06 xcp2 ha-lizard: 17951 vm_mon: 0 Eligible Halted VMs found
Jul 24 10:19:06 xcp2 ha-lizard: 17871 email: Sending ALERT email to root@localhost: write_pool_state: Error retrieving autopromote_uuid from pool configuration
Jul 24 10:19:06 xcp2 ha-lizard-NOTICE-/etc/ha-lizard/ha-lizard.sh: /etc/ha-lizard/ha-lizard.func: line 62: mail: command not found
Jul 24 10:19:06 xcp2 ha-lizard: 17871 email: SMTP Session Output:
Jul 24 10:19:06 xcp2 ha-lizard: 17944 autoselect_slave: Selected Slave: = Current slave: - ignoring update
Jul 24 10:19:06 xcp2 ha-lizard: 17871 check_ha_enabled: Checking if ha-lizard is enabled for pool: 3060de88-5d0b-18fe-df97-f8cb60d5b474
Jul 24 10:19:06 xcp2 ha-lizard: 17871 check_ha_enabled: ha-lizard is enabled
Jul 24 10:19:06 xcp2 ha-lizard: 17871 get_pool_host_list: enabled flag set - returning only hosts with enabled=true
Jul 24 10:19:07 xcp2 ha-lizard: 17871 get_pool_host_list: returned 117cdfae-7678-46b9-b329-28453f04c3cc
Jul 24 10:19:07 xcp2 ha-lizard: 17871 get_pool_ip_list: returned 172.16.36.2
Jul 24 10:19:10 xcp2 ha-lizard: 16753 ha-lizard Watchdog: ha-lizard running - OK
Jul 24 10:19:20 xcp2 ha-lizard: 16753 ha-lizard Watchdog: ha-lizard running - OK
Jul 24 10:19:30 xcp2 ha-lizard: 16753 ha-lizard Watchdog: ha-lizard running - OK
Jul 24 10:19:40 xcp2 ha-lizard: 16753 ha-lizard Watchdog: ha-lizard running - OK
Jul 24 10:19:50 xcp2 ha-lizard: 17861 Spawning new instance of ha-lizard
Jul 24 10:19:50 xcp2 ha-lizard: 16753 ha-lizard Watchdog: ha-lizard running - OK
Jul 24 10:19:50 xcp2 ha-lizard: 18318 Checking if this host is a Pool Master or Slave
Jul 24 10:19:50 xcp2 ha-lizard: 18318 This host's pool status = master
Jul 24 10:19:50 xcp2 ha-lizard: 18318 Checking if ha-lizard is enabled for this pool
Jul 24 10:19:50 xcp2 ha-lizard: 18318 check_ha_enabled: Checking if ha-lizard is enabled for pool: 3060de88-5d0b-18fe-df97-f8cb60d5b474
Jul 24 10:19:50 xcp2 ha-lizard: 18318 check_ha_enabled: ha-lizard is enabled
Jul 24 10:19:50 xcp2 ha-lizard: 18318 ha-lizard is enabled
Jul 24 10:19:51 xcp2 ha-lizard: 18318 Successfully updated global pool configuration settings in /etc/ha-lizard/ha-lizard.pool.conf. Settings take effect on subsequent run
Jul 24 10:19:51 xcp2 ha-lizard: 18318 DISABLED_VAPPS=() ENABLE_LOGGING=1 FENCE_ACTION=stop FENCE_ENABLED=1 FENCE_FILE_LOC=/etc/ha-lizard/fence FENCE_HA_ONFAIL=1 FENCE_HEURISTICS_IPS=172.16.36.254 FENCE_HOST_FORGET=0 FENCE_IPADDRESS= FENCE_METHOD=POOL FENCE_MIN_HOSTS=2 FENCE_PASSWD= FENCE_QUORUM_REQUIRED=1 FENCE_REBOOT_LONE_HOST=0 FENCE_USE_IP_HEURISTICS=1 GLOBAL_VM_HA=1 MAIL_FROM="root@localhost" MAIL_ON=1 MAIL_SUBJECT="SYSTEM_ALERT-FROM_HOST:$HOSTNAME" MAIL_TO="root@localhost" MONITOR_DELAY=45 MONITOR_KILLALL=1 MONITOR_MAX_STARTS=50 MONITOR_SCANRATE=10 OP_MODE=2 PROMOTE_SLAVE=1 SLAVE_HA=1 SLAVE_VM_STAT=0 XAPI_COUNT=2 XAPI_DELAY=10 XC_FIELD_NAME='ha-lizard-enabled' XE_TIMEOUT=10
Jul 24 10:19:51 xcp2 ha-lizard: 18318 This host detected as pool Master
Jul 24 10:19:51 xcp2 ha-lizard: 18318 Found 2 hosts in pool
Jul 24 10:19:51 xcp2 ha-lizard: 18318 Calling function write_pool_state
Jul 24 10:19:51 xcp2 ha-lizard: 18381 Calling function autoselect_slave
Jul 24 10:19:51 xcp2 ha-lizard: 18388 Calling function check_slave_status
Jul 24 10:19:51 xcp2 ha-lizard: 18381 autoselect_slave: This host UUID found: 117cdfae-7678-46b9-b329-28453f04c3cc
Jul 24 10:19:51 xcp2 ha-lizard: 18388 get_pool_host_list: returned 117cdfae-7678-46b9-b329-28453f04c3cc b7610022-406b-40e0-90ae-e31ff792898b
Jul 24 10:19:51 xcp2 ha-lizard: 18381 autoselect_slave: MASTER host UUID found: 117cdfae-7678-46b9-b329-28453f04c3cc
Jul 24 10:19:51 xcp2 ha-lizard: 18388 check_slave_status: Removing Master UUID from list of Hosts
Jul 24 10:19:51 xcp2 ha-lizard: 18318 get_vms_on_host: Returned 74d0ebb8-d06a-7819-4a19-ad63224c82b7
Jul 24 10:19:51 xcp2 ha-lizard: 18381 autoselect_slave: 117cdfae-7678-46b9-b329-28453f04c3cc is Master UUID - excluding from list of available slaves
Jul 24 10:19:51 xcp2 ha-lizard: 18388 check_slave_status: Removing Slave UUID from list of Hosts - Slave: b7610022-406b-40e0-90ae-e31ff792898b is disabled or in maintenance mode
Jul 24 10:19:51 xcp2 ha-lizard: 18388 check_slave_status: Host IP Address check Status Array for Slaves = ()
Jul 24 10:19:51 xcp2 ha-lizard: 18388 check_slave_status: Failed slave count = 0
Jul 24 10:19:51 xcp2 ha-lizard: 18388 check_slave_status: No Failed slaves detected
Jul 24 10:19:51 xcp2 ha-lizard: 18388 Function check_slave_status reported no failures: calling vm_mon
Jul 24 10:19:51 xcp2 ha-lizard: 18388 vm_mon: ha-lizard is operating mode 2 - managing pool VMs
Jul 24 10:19:51 xcp2 ha-lizard: 18318 get_vms_on_host: No VMs found on host: b7610022-406b-40e0-90ae-e31ff792898b
Jul 24 10:19:51 xcp2 ha-lizard: 18381 autoselect_slave: Removing Slave UUID from list of Hosts - Slave: b7610022-406b-40e0-90ae-e31ff792898b is disabled or in maintenance mode
Jul 24 10:19:51 xcp2 ha-lizard: 18381 autoselect_slave: 0 available Slave UUIDs found:
Jul 24 10:19:51 xcp2 ha-lizard: 18388 vm_mon: Retrived list of VMs for this poll: 74d0ebb8-d06a-7819-4a19-ad63224c82b7
Jul 24 10:19:51 xcp2 ha-lizard: 18388 vm_mon: Removing Control Domains from VM list
Jul 24 10:19:51 xcp2 ha-lizard: 18388 vm_mon: VM list returned = 74d0ebb8-d06a-7819-4a19-ad63224c82b7
Jul 24 10:19:51 xcp2 ha-lizard: 18388 vm_state: Machine state for 74d0ebb8-d06a-7819-4a19-ad63224c82b7 returned: running
Jul 24 10:19:51 xcp2 ha-lizard: 18388 vm_mon: VM 74d0ebb8-d06a-7819-4a19-ad63224c82b7 state = running
Jul 24 10:19:51 xcp2 ha-lizard: 18388 vm_mon: 0 Eligible Halted VMs found
Jul 24 10:19:51 xcp2 ha-lizard: 18318 email: Sending ALERT email to root@localhost: write_pool_state: Error retrieving autopromote_uuid from pool configuration
Jul 24 10:19:51 xcp2 ha-lizard-NOTICE-/etc/ha-lizard/ha-lizard.sh: /etc/ha-lizard/ha-lizard.func: line 62: mail: command not found
Jul 24 10:19:51 xcp2 ha-lizard: 18318 email: SMTP Session Output:
Jul 24 10:19:51 xcp2 ha-lizard: 18381 autoselect_slave: Selected Slave: = Current slave: - ignoring update
Jul 24 10:19:51 xcp2 ha-lizard: 18318 check_ha_enabled: Checking if ha-lizard is enabled for pool: 3060de88-5d0b-18fe-df97-f8cb60d5b474
Jul 24 10:19:51 xcp2 ha-lizard: 18318 check_ha_enabled: ha-lizard is enabled
Jul 24 10:19:51 xcp2 ha-lizard: 18318 get_pool_host_list: enabled flag set - returning only hosts with enabled=true
Jul 24 10:19:51 xcp2 ha-lizard: 18318 get_pool_host_list: returned 117cdfae-7678-46b9-b329-28453f04c3cc
Jul 24 10:19:52 xcp2 ha-lizard: 18318 get_pool_ip_list: returned 172.16.36.2
Jul 24 10:20:00 xcp2 ha-lizard: 16753 ha-lizard Watchdog: ha-lizard running - OK
Jul 24 10:20:10 xcp2 ha-lizard: 16753 ha-lizard Watchdog: ha-lizard running - OK
Jul 24 10:20:20 xcp2 ha-lizard: 16753 ha-lizard Watchdog: ha-lizard running - OK
Jul 24 10:20:30 xcp2 ha-lizard: 16753 ha-lizard Watchdog: ha-lizard running - OK
Jul 24 10:20:35 xcp2 ha-lizard: 18310 Spawning new instance of ha-lizard
Jul 24 10:20:35 xcp2 ha-lizard: 18766 Checking if this host is a Pool Master or Slave
Jul 24 10:20:35 xcp2 ha-lizard: 18766 This host's pool status = master
Jul 24 10:20:35 xcp2 ha-lizard: 18766 Checking if ha-lizard is enabled for this pool
Jul 24 10:20:35 xcp2 ha-lizard: 18766 check_ha_enabled: Checking if ha-lizard is enabled for pool: 3060de88-5d0b-18fe-df97-f8cb60d5b474
Jul 24 10:20:35 xcp2 ha-lizard: 18766 check_ha_enabled: ha-lizard is enabled
Jul 24 10:20:35 xcp2 ha-lizard: 18766 ha-lizard is enabled
Jul 24 10:20:36 xcp2 ha-lizard: 18766 Successfully updated global pool configuration settings in /etc/ha-lizard/ha-lizard.pool.conf. Settings take effect on subsequent run
Jul 24 10:20:36 xcp2 ha-lizard: 18766 DISABLED_VAPPS=() ENABLE_LOGGING=1 FENCE_ACTION=stop FENCE_ENABLED=1 FENCE_FILE_LOC=/etc/ha-lizard/fence FENCE_HA_ONFAIL=1 FENCE_HEURISTICS_IPS=172.16.36.254 FENCE_HOST_FORGET=0 FENCE_IPADDRESS= FENCE_METHOD=POOL FENCE_MIN_HOSTS=2 FENCE_PASSWD= FENCE_QUORUM_REQUIRED=1 FENCE_REBOOT_LONE_HOST=0 FENCE_USE_IP_HEURISTICS=1 GLOBAL_VM_HA=1 MAIL_FROM="root@localhost" MAIL_ON=1 MAIL_SUBJECT="SYSTEM_ALERT-FROM_HOST:$HOSTNAME" MAIL_TO="root@localhost" MONITOR_DELAY=45 MONITOR_KILLALL=1 MONITOR_MAX_STARTS=50 MONITOR_SCANRATE=10 OP_MODE=2 PROMOTE_SLAVE=1 SLAVE_HA=1 SLAVE_VM_STAT=0 XAPI_COUNT=2 XAPI_DELAY=10 XC_FIELD_NAME='ha-lizard-enabled' XE_TIMEOUT=10
Jul 24 10:20:36 xcp2 ha-lizard: 18766 This host detected as pool Master
Jul 24 10:20:36 xcp2 ha-lizard: 18766 Found 2 hosts in pool
Jul 24 10:20:36 xcp2 ha-lizard: 18766 Calling function write_pool_state
Jul 24 10:20:36 xcp2 ha-lizard: 18839 Calling function autoselect_slave
Jul 24 10:20:36 xcp2 ha-lizard: 18846 Calling function check_slave_status
Jul 24 10:20:36 xcp2 ha-lizard: 18839 autoselect_slave: This host UUID found: 117cdfae-7678-46b9-b329-28453f04c3cc
Jul 24 10:20:36 xcp2 ha-lizard: 18846 get_pool_host_list: returned 117cdfae-7678-46b9-b329-28453f04c3cc b7610022-406b-40e0-90ae-e31ff792898b
Jul 24 10:20:36 xcp2 ha-lizard: 18839 autoselect_slave: MASTER host UUID found: 117cdfae-7678-46b9-b329-28453f04c3cc
Jul 24 10:20:36 xcp2 ha-lizard: 18846 check_slave_status: Removing Master UUID from list of Hosts
Jul 24 10:20:36 xcp2 ha-lizard: 18766 get_vms_on_host: Returned 74d0ebb8-d06a-7819-4a19-ad63224c82b7
Jul 24 10:20:36 xcp2 ha-lizard: 18839 autoselect_slave: 117cdfae-7678-46b9-b329-28453f04c3cc is Master UUID - excluding from list of available slaves
Jul 24 10:20:36 xcp2 ha-lizard: 18846 check_slave_status: Removing Slave UUID from list of Hosts - Slave: b7610022-406b-40e0-90ae-e31ff792898b is disabled or in maintenance mode
Jul 24 10:20:36 xcp2 ha-lizard: 18846 check_slave_status: Host IP Address check Status Array for Slaves = ()
Jul 24 10:20:36 xcp2 ha-lizard: 18846 check_slave_status: Failed slave count = 0
Jul 24 10:20:36 xcp2 ha-lizard: 18846 check_slave_status: No Failed slaves detected
Jul 24 10:20:36 xcp2 ha-lizard: 18846 Function check_slave_status reported no failures: calling vm_mon
Jul 24 10:20:36 xcp2 ha-lizard: 18846 vm_mon: ha-lizard is operating mode 2 - managing pool VMs
Jul 24 10:20:36 xcp2 ha-lizard: 18766 get_vms_on_host: No VMs found on host: b7610022-406b-40e0-90ae-e31ff792898b
Jul 24 10:20:36 xcp2 ha-lizard: 18839 autoselect_slave: Removing Slave UUID from list of Hosts - Slave: b7610022-406b-40e0-90ae-e31ff792898b is disabled or in maintenance mode
Jul 24 10:20:36 xcp2 ha-lizard: 18839 autoselect_slave: 0 available Slave UUIDs found:
Jul 24 10:20:36 xcp2 ha-lizard: 18846 vm_mon: Retrived list of VMs for this poll: 74d0ebb8-d06a-7819-4a19-ad63224c82b7
Jul 24 10:20:36 xcp2 ha-lizard: 18846 vm_mon: Removing Control Domains from VM list
Jul 24 10:20:36 xcp2 ha-lizard: 18846 vm_mon: VM list returned = 74d0ebb8-d06a-7819-4a19-ad63224c82b7
Jul 24 10:20:36 xcp2 ha-lizard: 18766 email: Sending ALERT email to root@localhost: write_pool_state: Error retrieving autopromote_uuid from pool configuration
Jul 24 10:20:36 xcp2 ha-lizard-NOTICE-/etc/ha-lizard/ha-lizard.sh: /etc/ha-lizard/ha-lizard.func: line 62: mail: command not found
Jul 24 10:20:36 xcp2 ha-lizard: 18846 vm_state: Machine state for 74d0ebb8-d06a-7819-4a19-ad63224c82b7 returned: running
Jul 24 10:20:36 xcp2 ha-lizard: 18766 email: SMTP Session Output:
Jul 24 10:20:36 xcp2 ha-lizard: 18846 vm_mon: VM 74d0ebb8-d06a-7819-4a19-ad63224c82b7 state = running
Jul 24 10:20:36 xcp2 ha-lizard: 18846 vm_mon: 0 Eligible Halted VMs found
Jul 24 10:20:36 xcp2 ha-lizard: 18839 autoselect_slave: Selected Slave: = Current slave: - ignoring update
Jul 24 10:20:36 xcp2 ha-lizard: 18766 check_ha_enabled: Checking if ha-lizard is enabled for pool: 3060de88-5d0b-18fe-df97-f8cb60d5b474
Jul 24 10:20:36 xcp2 ha-lizard: 18766 check_ha_enabled: ha-lizard is enabled
Jul 24 10:20:37 xcp2 ha-lizard: 18766 get_pool_host_list: enabled flag set - returning only hosts with enabled=true
Jul 24 10:20:37 xcp2 ha-lizard: 18766 get_pool_host_list: returned 117cdfae-7678-46b9-b329-28453f04c3cc
Jul 24 10:20:37 xcp2 ha-lizard: 18766 get_pool_ip_list: returned 172.16.36.2
Jul 24 10:20:40 xcp2 ha-lizard: 16753 ha-lizard Watchdog: ha-lizard running - OK

Many thanks.

Please Log in or Create an account to join the conversation.

Pool master does not auto switch to slave after re 10 years 9 months ago #27

Pulse Supply
Offline
Posts: 157

It appears that xcp1 is either powered off or in maintenance mode. Below line from the log verified this. Was the host in maintenance mode before you powered off?
check_slave_status: Removing Slave UUID from list of Hosts - Slave: b7610022-406b-40e0-90ae-e31ff792898b is disabled or in maintenance mode

HA logic will ignore any host that is in maintenance mode. So, if you power-off a host that was in maintenance mode before power-off, it will not be treated as a failure and no steps are taken to recover VMs (there would be no VMs to recover in this case while a host is in maintenance mode/disabled)

Can you start your test again making sure that both hosts are enabled and joined in the same pool before simulating the failure? You can check the logs before powering-off the slave to ensure that xcp1 is being seen as operational.

Please Log in or Create an account to join the conversation.

Pool master doesn't auto switch to slave 10 years 9 months ago #28

yx
Topic Author
Offline
Posts: 3

Actually, the rebooting cycle of my each machine needs about 5 minutes.
Also, I didn't manually change any host into maintenance mode. But it seems that rebooting will cause maintenance mode shortly.

After having some tests, I found some phenomenons.

Each test is based on 2 hosts and 1 VM on pool master. Shown as below:

-pool
-xcp1(master)
-vm1
-xcp2(slave)

-Test A
First, by setting FENCE_ENABLED=1 and FENCE_FORGET_HOST=0, I reboot the master(xcp1) and then the slave(xcp2) is promoted to be the new master.

Second, after waiting for about 5 minutes, xcp1 wakes up and it has changed to slave. But when I check the watchdog, the message shows that xcp1 is not permitted to promote. That is to say, if now I reboot xcp2, while there is no master to be selected. In this condition, HA may fail.

So I try the Test B.

-Test B
First, by using the default ha-lizard setting, and modify FENCE_ENABLED=0.

Second, because my machine take 5 minutes to complete rebooting, I edit the ha-lizard.func file by modifying line 389 which is "sleep 15" at function promote_slave(). I set it to sleep 300(5 minutes) so that line 393 "xe pool-recover-slaves" can be executed successfully.

Third, because the master will be rebooted, and the VM on the master should be monitored by a slave, I set SLAVE_VM_STAT=1.

Fourth, I reboot the master(xcp1) and the slave(xcp2) is promoted to be the new master. Also, VM1 on xcp1 has been changed to xcp2 by HA. After 5 minutes, xcp1 wakes up and becomes the slave successfully. And I check the watchdog, the message shows that xcp1 is allowed to be promoted.

Fifth, run the same test again. Now I reboot the xcp2(master), and xcp1 is promoted to be the new master successfully, too.
VM1 is also changed to xcp1 successfully.

Furthermore, I add another VM on the master. Shown as below:

-pool
-xcp1(master)
-vm1
-vm2
-xcp2(slave)

In this scenerio, I also try to run Test B. Unfortunately, a vm doesn't change to the new master. It doesn't wake up on the new slave either.

Any suggestion?

Many thanks.

Please Log in or Create an account to join the conversation.

Pool master doesn't auto switch to slave 10 years 9 months ago #29

Pulse Supply
Offline
Posts: 157

Regarding your test A - Is it possible you are gracefully rebooting XCP1? If so, this would not be a valid test as the server will be temporarily in maintenance mode. While in maintenence mode, the new Master (XCP2) cannot select a slave as being allowed to promote itself since no slave is available. If this is the case, you should simulate your test like a real HW failure. Consider rebooting with "echo b > /proc/sysrq-trigger". This will not gracefully shut down the host and it will never enter maintencance mode. You should capture the log on the surviving host in all your tests as they are very descriptive and will let you know exactly what is going on.

Test B will not work well since you have disabled fencing. Promoting a host to Master will require fencing. Ejecting a slave also requires fencing otherwise failed VMs on a slave cannot be started. Part of the fencing logic releases any hung VDIs.. The VM likely fails to boot due to VDI not being available (meaning XAPI still sees it attached to the failed host.). You can verify this in the logs, they will tell you if the issue is storage related.

If you still have trouble, fill out a contact form with your contact details. We can try to solve this live.

Good Luck

Please Log in or Create an account to join the conversation.