High load and System Operations getting Freezed

High load and System Operations getting Freezed 6 years 7 months ago #1873

Sherbin George
Topic Author
Offline
Posts: 11

Hi,

We have ran into a situation where the system operations getting freezed on HA-Lizard.

The situation started when we moved a VM onto a HA-Lizard Pool. It got stuck when attempting a start function. At one Point, I think xe-toolstack restart would fix this. But there was no luck.

Afterwards, I attempted a Snapshot on a running VM within the same pool and it made the VM to freeze and it become un-responsive. Even a force reboot was showing to be running forever.

Atlast, we rebooted Slave Host to see if it fixes the issue, but it didn't. We haven't touched Master yet. But Load is staying around 10.00 at an Average.

I am passing the information I've collected from Both Master and Slave. Can a Master reboot can bring things aback to normal here?

**********************
Master

[root@XEN05 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.4.5 (api:1/proto:86-101)
srcversion: D496E56BBEBA8B1339BB34A
m:res cs ro ds p mounted fstype
1:iscsi1 StandAlone Primary/Unknown UpToDate/DUnknown r

[root@XEN05 ~]# ha-cfg status

Xen05@:iscsi-cfg status

| iSCSI-HA Version IHA_2.1.5 |
| Sat Aug 24 14:38:52 EDT 2019 |

Control + C to exit

| DRBD Status |

| version: 8.4.5 (api:1/proto:86-101) |
| srcversion: D496E56BBEBA8B1339BB34A |
| 1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r

|
| ns:0 nr:0 dw:334633404 dr:1893486048 al:33628704 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:3117818488 |

Slave

[root@XEN06 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.4.5 (api:1/proto:86-101)
srcversion: D496E56BBEBA8B1339BB34A
m:res cs ro ds p mounted fstype
1:iscsi1 WFConnection Secondary/Unknown UpToDate/DUnknown C

[root@XEN06 ~]# ha-cfg status

Xen06@:iscsi-cfg status

| iSCSI-HA Version IHA_2.1.5 |
| Sat Aug 24 14:40:13 EDT 2019 |

Control + C to exit

| DRBD Status |

| version: 8.4.5 (api:1/proto:86-101) |
| srcversion: D496E56BBEBA8B1339BB34A |
| 1: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r

|
| ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:34692 |

Please Log in or Create an account to join the conversation.

High load and System Operations getting Freezed 6 years 7 months ago #1874

Fabio Brizzolla
Offline
Posts: 8

Sherbin, I ran into a similar problem and later discovered the root cause was the xcp-emu-manager package version on my XCP-ng 7.6 hosts. Theres a problem with the version 0.0.3 or 0.0.5.... I cant remember exactly, but I know you must update it to avoid get VM stuck during an Live Migration.

The latest version of xcp-emu-manager is 1.1.2. I think you should give a try.

Other situation I`ve seen was an the VM processing get stuck on wait state after live migrate it to the Slave host (I`ve openned the VM console on XOA and watched the CPU usage on nmon). Moving the VM back to the Master host brought VM to normal operation.

This very same issue was solved after updated all hosts entirely with an "yum update".

Regards

Regards

Please Log in or Create an account to join the conversation.

Last edit: by Fabio Brizzolla.

High load and System Operations getting Freezed 6 years 7 months ago #1875

Sherbin George Topic Author Offline Posts: 11	Thank you for you reply Fabio. We are using 7.1 on both of the Hosts where the problem exists, and I don't think the xcp-emu-manager package is installed over there. We have another Pool with 7.5 installed and there we have xcp-emu-manager. ~~~ [root@XEN05 ~]# rpm -qa \| grep -i emu emulex-be2net-11.1.196.0-1.x86_64 qemu-xen-2.2.1-4.36786.x86_64 emulex-lpfc-11.1.210.1-1.x86_64 ~~~
	Please Log in or Create an account to join the conversation.

High load and System Operations getting Freezed 6 years 7 months ago #1876

Salvatore Costantino
Offline
Posts: 728

it looks like your replication link is not working (as shown in the status of both hosts (5 and 6). This could prevent any VMs from running on the slave if the link issue is IP related.

You can check whether the issue is IP related by pinging the floating replication IP and peer replication IP from each host. If there is a connectivity issue AND you are using a bonded active/active link for replication, try momentarily unplugging one of the replication Ethernet ports. This is known to clear a linux ARP issue that sometimes appears when a linux bridge is stacked on top of a linux bonded link.

If the above is not the issue, then you could have a split brain situation. You can recover from this with a tool in /etc/iscsi-ha/scripts/drbd-sb-tool. Keep in mind, that this tool will perform a complete sync from the surviving node to the peer which could take several hours, depending on the link speed and size of the storage. The system will continue to operate while this is happening in the background.

Please Log in or Create an account to join the conversation.

High load and System Operations getting Freezed 6 years 7 months ago #1877

Sherbin George
Topic Author
Offline
Posts: 11

Hi Salvatore,

We would see the Floating IP is pinging from Both Hosts.

[root@XEN05 ~]# ping 10.10.11.3
PING 10.10.11.3 (10.10.11.3) 56(84) bytes of data.
64 bytes from 10.10.11.3: icmp_seq=1 ttl=64 time=0.028 ms
64 bytes from 10.10.11.3: icmp_seq=2 ttl=64 time=0.043 ms

[root@XEN06 ~]# ping 10.10.11.3
PING 10.10.11.3 (10.10.11.3) 56(84) bytes of data.
64 bytes from 10.10.11.3: icmp_seq=1 ttl=64 time=0.162 ms
64 bytes from 10.10.11.3: icmp_seq=2 ttl=64 time=0.182 ms

Also in the current situation 4 VM's are running on Slave and 1 VM on Master(Load Average is still staying around 10.00 at an average). But we haven't performed any system operations, thinking it might break things again.

Does a Reboot of Master brings up things back to sync rather than considering the drbd-sb-tool tool, while keeping all VM's under slave? Or do you think rebooting master here will affects VM's running on Slave too?

Please Log in or Create an account to join the conversation.

High load and System Operations getting Freezed 6 years 7 months ago #1878

Salvatore Costantino
Offline
Posts: 728

drbd-sb-tool is only to recover from a DRBD split brain (which is not common). It is not necessary when executing maintenance operations such as a reboot. In your case, after the reboot, the hosts should resync within a few seconds.

to reboot the master, you must put the storage into manual mode so that it can be exposed on the slave while the master reboots. Here are the steps. Before performing the below, migrate all VMs to the slave

1) Disable HA - this can be done from either host. Only needs to be done on a single host

ha-cfg ha-disable

2) Put each host into manual mode. This needs to be done on both hosts

iscsi-cfg manual-mode-enable

The next 2 steps must be performed with minimal delay in between.

3) Put the master into secondary mode

iscsi-cfg become-secondary

4) Put the slave into primary mode

iscsi-cfg become-primary

It is now safe to reboot the master. Once the master has rebooted, perform the following to put the pool back into normal operational mode.

The next 2 steps must be performed with minimal delay in between.

5) Put the slave into secondary mode

iscsi-cfg become-secondary

6) Put the master into primary mode

iscsi-cfg become-primary

7) On both hosts - exit manual mode

iscsi-cfg manual-mode-disable

re-enable HA

ha-cfg ha-enable

Please Log in or Create an account to join the conversation.

TOPIC:

High load and System Operations getting Freezed 6 years 7 months ago #1873

High load and System Operations getting Freezed 6 years 7 months ago #1874

High load and System Operations getting Freezed 6 years 7 months ago #1875

High load and System Operations getting Freezed 6 years 7 months ago #1876

High load and System Operations getting Freezed 6 years 7 months ago #1877

High load and System Operations getting Freezed 6 years 7 months ago #1878

Company

Links

Products