Testing some disaster scenarios. How to proceed?

Testing some disaster scenarios. How to proceed? 8 years 1 week ago #1564

Daniel Masó
Topic Author
Offline
Posts: 9

Hi all ha-lizard community,
That's the first time that I post here.
I have two nodes installed using "halizard_nosan_installer_2.1.3"on a Xenserver 7.2 installation.
Everything works great. I've created some virtual machines running on each of the hosts.
I've tried to unplug the power cord on one server and check what happens (the other one takes all the virtual machines and so on).

But I want to try 2 more critical scenarios.

-- Crital Scenario, TEST 1 --
Each of my xen servers have 1 hard disk for the system and other disk for the storage and iscsi/drbd.
I've removed the storage disk of the master node simulating a failing disk, just to know what happen if someday that really happens.
And then I plugged a new hard disk (not the same disk).

The "iscsi-cfg status" situation after the hard disk unplugging and after plugging a new (blank) hard disk is the same:

1: cs:Connected ro:Primary/Secondary ds:Diskless/UpToDate C r

I've waited some time (5 or 10 minuts) just looking if the iscsi or drbd daemons deals with the new hard disk... but nothing happens.

How do I have to operate to attach the new hard disk to the iscsi environment and then, how to make drbd starts syncronizing again the disks?

Thanks.

-- Crital Scenario, TEST 2 --
And what happens if in a future one of the nodes dies?
It's supposed that I can work just with the survivor server... but what do I have to do to add a new node in substition of the failed one?
Can I install Xenserver 7.2 again and run again the script "halizard_nosan_installer_2.1.3" and that's all?

Thanks

Please, feel free to ask me for logs or screenshots.
Thanks.

Please Log in or Create an account to join the conversation.

Testing some disaster scenarios. How to proceed? 8 years 1 week ago #1565

Salvatore Costantino
Offline
Posts: 728

Your first test scenario is an interesting test case that is not specifically handled. In this scenario, your master has completely lost its non-raid disk. Your data is still fully intact on the slave, so, you could switch iscsi-ha to manual mode and start exposing the surviving storage to the pool from the slave. That would get the pool operational again.

To sync your data to the new disk, you would need to recreate drbd meta data on the new disk and then sync. These steps are outlined in the reference design document on the web site

As an aside, we have added disk SMART status monitoring to iscsi-ha. Our next release will, at a minimum, alert on this condition.

Regarding the second scenario/question. This post discusses the same type of event and steps required to re-introduce a new host;
halizard.com/forum/suggestion-box/272-host-failure-scenario

Please Log in or Create an account to join the conversation.

Testing some disaster scenarios. How to proceed? 8 years 1 week ago #1566

Daniel Masó
Topic Author
Offline
Posts: 9

Thank you Salvatore.

I've checked the documentation ( halizard.com/images/pdf/iscsi-ha_2-node_...Server_7.x_final.pdf ).

xenserver1 --> server with supposed failing disk with a new disk (data loss)
xenserver2 --> survival node with correct data

At my first try, the situation was: xenserver1 as iscsi master, xenserver2 as iscsi slave and manual mode disabled.

[root@xenserver1 ~]# dd if=/dev/zero bs=1M count=1 of=/dev/sdb
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0108849 s, 96.3 MB/s

[root@xenserver1 ~]# drbdadm create-md iscsi1
initializing activity log
initializing bitmap (29812 KB) to all zero
Writing meta data...
New drbd meta data block successfully created.

And then, that command on the survivor data node:

[root@xenserver2 scripts]# drbdadm -- --overwrite-data-of-peer primary iscsi1
1: State change failed: (-1) Multiple primaries not allowed by config
Command 'drbdsetup-84 primary 1 --overwrite-data-of-peer' terminated with exit code 11

Nothing worked on that first try.

And now the second try. What I've done: activate manual mode on both nodes, demote xenserver1 to be secondary, promote xenserver2 to primary)

[root@xenserver1 ~]# iscsi-cfg manual-mode-enable
iscsi-ha now in manual mode
Note: High Availability should be disabled if any hosts will be shutdown or rebooted

[root@xenserver2 ~]# iscsi-cfg manual-mode-enable
iscsi-ha now in manual mode
Note: High Availability should be disabled if any hosts will be shutdown or rebooted

[root@xenserver1 ~]# iscsi-cfg become-secondary

| iscsi-ha is in manual mode - current status shown below |

Storage role: Secondary [expected secondary]
Replication IP: 10.10.10.1/24 [10.10.10.3 not expected here]
iSCSI target: Stopped [expected stopped]

[root@xenserver2 scripts]# iscsi-cfg become-primary

| iscsi-ha is in manual mode - current status shown below |

Storage role: Primary [expected primary]
Replication IP: 10.10.10.2/24 10.10.10.3/24 [10.10.10.3 expected here]
iSCSI target: Running [expected running]

[root@xenserver2 scripts]# drbdadm -- --overwrite-data-of-peer primary iscsi1

After the last command, without any error this time, I rollback the roles (xencenter1 as a master, xencenter2 as slave and disable manual mode on both nodes) and checked the "iscsi-cfg status" but the situation is still the same:

| DRBD Status |

| version: 8.4.5 (api:1/proto:86-101) |
| srcversion: 2A6B2FA4F0703B49CA9C727 |
| 1: cs:Connected ro:Primary/Secondary ds:Diskless/UpToDate C r

|
| ns:149408 nr:22592 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 |

Any idea?

Thank you.

Please Log in or Create an account to join the conversation.

Testing some disaster scenarios. How to proceed? 8 years 1 week ago #1567

Salvatore Costantino Offline Posts: 728	Synchronizing to a new disk can take hours when introducing the new drive. Disabling manual mode during a sync will break your pool if the master is the host with the new disk. Can you try again, this time leaving the pool in manual mode while syncing. You can check drbd status while the sync is going on to ensure that the diskless state has cleared with "cat /proc/drbd"
	Please Log in or Create an account to join the conversation.

Testing some disaster scenarios. How to proceed? 8 years 1 week ago #1568

Daniel Masó
Topic Author
Offline
Posts: 9

Ok

1) enable manual mode on both nodes
2) demote master to be secondary (the node with the new disk)
3) promote slave to be master (the node with the correct data)
4) execute that command on the survival node ( drbdadm -- --overwrite-data-of-peer primary iscsi1 )
5) do not deactivate manual mode
6) do not rollback iscsi roles
7) check "cat /proc/drbd"

[root@xenserver1 ~]# iscsi-cfg manual-mode-enable
iscsi-ha now in manual mode
Note: High Availability should be disabled if any hosts will be shutdown or rebooted

[root@xenserver2 ~]# iscsi-cfg manual-mode-enable
iscsi-ha now in manual mode
Note: High Availability should be disabled if any hosts will be shutdown or rebooted

[root@xenserver1 ~]# iscsi-cfg become-secondary

| iscsi-ha is in manual mode - current status shown below |

Storage role: Primary [expected secondary]
Replication IP: 10.10.10.1/24 10.10.10.3/24 [10.10.10.3 not expected here]
iSCSI target: Stopped [expected stopped]

[root@xenserver2 scripts]# iscsi-cfg become-primary

| iscsi-ha is in manual mode - current status shown below |

Storage role: Primary [expected primary]
Replication IP: 10.10.10.2/24 10.10.10.3/24 [10.10.10.3 expected here]
iSCSI target: Running [expected running]

[root@xenserver2 scripts]# drbdadm -- --overwrite-data-of-peer primary iscsi1

[root@xenserver1 ~]# cat /proc/drbd
version: 8.4.5 (api:1/proto:86-101)
srcversion: 2A6B2FA4F0703B49CA9C727

1: cs:Connected ro:Secondary/Primary ds:Diskless/UpToDate C r

ns:176036 nr:24784 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

[root@xenserver2 scripts]# cat /proc/drbd
version: 8.4.5 (api:1/proto:86-101)
srcversion: 2A6B2FA4F0703B49CA9C727

1: cs:Connected ro:Primary/Secondary ds:UpToDate/Diskless C r

ns:165072 nr:365656 dw:369852 dr:179648 al:129 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:41624

Still "Diskless".
No progress bar with the sync process.

<< NEWS >>
It looks like it works.
I missed that command ---> [root@xenserver1 ~]# service drbd start
After that, "cat /proc/drbd" looks like this:

[root@xenserver1 ~]# cat /proc/drbd
version: 8.4.5 (api:1/proto:86-101)
srcversion: 2A6B2FA4F0703B49CA9C727

1: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r

ns:0 nr:98304 dw:98304 dr:0 al:0 bm:0 lo:1 pe:3 ua:0 ap:0 ep:1 wo:f oos:732453884
[>....................] sync'ed: 0.1% (715284/715380)M
finish: 4:08:07 speed: 49,152 (49,152) want: 49,760 K/sec

In case of disk fail on secondary iscsi node, the process is:
- xenserver1 as a iscsi primary master
- xenserver2 as a iscsi secondary slave
- remove data hard disk on xenserver2 (now diskless)
1) put the new hard disk on xenserver2
2) enable manual mode (on both nodes)
3) dd if=/dev/zero bs=1M count=1 of=/dev/xxx (failing node)
4) drbdadm create-md iscsi1 (failing node)
5) service drbd start (failing node)
6) execute that command on the survival node ( drbdadm -- --overwrite-data-of-peer primary iscsi1 )
7) do not disable manual mode yet

check "cat /proc/drbd" until the sync proccess has finished
9) disable manual mode (on both nodes)

In case of disk fail on primary iscsi node, the process is:
- xenserver1 as a iscsi primary master
- xenserver2 as a iscsi secondary slave
- remove data hard disk on xenserver1 (now diskless)
1) put the new hard disk on xenserver1
2) enable manual mode (on both nodes)
3) demote primary master to slave
4) promote secondary slave to master
5) dd if=/dev/zero bs=1M count=1 of=/dev/xxx (failing node)
6) drbdadm create-md iscsi1 (failing node)
7) service drbd start (failing node)

execute that command on the survival node ( drbdadm -- --overwrite-data-of-peer primary iscsi1 )
9) do not disable manual mode yet
10) check "cat /proc/drbd" until the sync proccess has finished
11) demote iscsi master to slave
12) promote iscsi slave to master
13) disable manual mode (on both nodes)

And finally i'm going to reinstall from the very beggining one of the systems just to emulate a hole server failure.

See you and thanks!

Please Log in or Create an account to join the conversation.

Last edit: by Daniel Masó.

TOPIC:

Testing some disaster scenarios. How to proceed? 8 years 1 week ago #1564

Testing some disaster scenarios. How to proceed? 8 years 1 week ago #1565

Testing some disaster scenarios. How to proceed? 8 years 1 week ago #1566

Testing some disaster scenarios. How to proceed? 8 years 1 week ago #1567

Testing some disaster scenarios. How to proceed? 8 years 1 week ago #1568

Company

Links

Products