Forum
Welcome, Guest
Username: Password: Remember me
  • Page:
  • 1
  • 2

TOPIC:

10.10.10.3 doesn't get activated 10 years 7 months ago #49

We have created a test scenario with the Master in DRBD primary state and SyncSource to an invalidated peer. Unfortunately we are unable to reproduce the condition you are seeing. In viewing your logs, it looks like there may be an issue with the email alert function as it hangs at that point and then gets restarted by the watchdog. Are you able to re-create the condition (we did so with "drbdadm invalidate all" on the secondary peer). If so, this time set MAIL_ON=0 in /etc/iscsi-ha/iscsi-ha.conf.

Below is the log output from our test:
Aug 16 23:44:01 XS1 iscsi-ha: 14599 Spawning new instance of iscsi-ha
Aug 16 23:44:01 XS1 iscsi-ha: 14824 Checking if this host is a Pool Master or Slave
Aug 16 23:44:01 XS1 iscsi-ha: 14824 This host's pool status = master
Aug 16 23:44:01 XS1 iscsi-ha: 14820 auto_plug_pbd: Found LVMoISCSI SR List: 1dff3fd0-2903-35e4-5537-18545cca3bbe
Aug 16 23:44:01 XS1 iscsi-ha: 14824 DRBD Running on this host: version: 8.3.15 (api:88/proto:86-97) GIT-hash: 0ce4d235fc02b5c53c1c52c53433d11a694eab8c build by root@XS2, 2013-08-04 23:01:14 1: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r---n- ns:382512644 nr:0 dw:301792 dr:382782096 al:0 bm:23411 lo:42 pe:58 ua:64 ap:0 ep:1 wo:d oos:56997420 [================>...] sync'ed: 87.1% (55660/429200)M finish: 0:09:25 speed: 100,796 (100,184) K/sec
Aug 16 23:44:01 XS1 iscsi-ha: 14824 check_drbd_resource_state: DRBD Resource: iscsi1 in Primary mode
Aug 16 23:44:01 XS1 iscsi-ha: 14824 DRBD Resource: iscsi1 in SyncSource state - expected Connected state
Aug 16 23:44:01 XS1 iscsi-ha: 14824 Mail Spool Directory Found /dev/shm/iscsi-ha-mail
Aug 16 23:44:01 XS1 iscsi-ha: 14824 email Sending ALERT email to root@localhost: DRBD Resource: iscsi1 in SyncSource state - expected Connected state

Aug 16 23:44:01 XS1 iscsi-ha: 14824 email: Message copied to /dev/shm/iscsi-ha-mail/1376711041.msg - Suppressing duplicates for 30 Minutes
Aug 16 23:44:02 XS1 iscsi-ha: 14824 iSCSI target: /etc/init.d/tgtd status = OK. [tgtd (pid 1870 1868) is running...]
Aug 16 23:44:02 XS1 iscsi-ha: 14824 local_ip_list: Local IP list returned 127.0.0.1 10.10.10.1 192.168.1.241
Aug 16 23:44:04 XS1 iscsi-ha: 27466 iscsi-ha Watchdog: iscsi-ha running - OK
Aug 16 23:44:04 XS1 iscsi-ha: 14824 check_ip_health: 10.10.10.3 response = FAIL
Aug 16 23:44:04 XS1 iscsi-ha: 14824 Mail Spool Directory Found /dev/shm/iscsi-ha-mail
Aug 16 23:44:04 XS1 iscsi-ha: 14824 email Sending ALERT email to root@localhost: check_ip_health: 10.10.10.3 response = FAIL
Aug 16 23:44:04 XS1 iscsi-ha: 14824 email: Message copied to /dev/shm/iscsi-ha-mail/1376711044.msg - Suppressing duplicates for 30 Minutes
Aug 16 23:44:04 XS1 iscsi-ha: 14824 Virtual IP 10.10.10.3 expected local, not found. Initializing..
Aug 16 23:44:04 XS1 iscsi-ha: 14824 mask_numbits: Mask 255.255.255.0 contains 24 bits.
Aug 16 23:44:04 XS1 iscsi-ha: 14824 DRBD Virtual IP: 10.10.10.3 successfully added to local interface xapi1
Aug 16 23:44:04 XS1 iscsi-ha: 14824 Updating ARP for 10.10.10.3
Aug 16 23:44:06 XS1 iscsi-ha-NOTICE-/etc/iscsi-ha/iscsi-ha.sh: ARPING 10.10.10.3 from 10.10.10.3 xapi1
Aug 16 23:44:06 XS1 iscsi-ha-NOTICE-/etc/iscsi-ha/iscsi-ha.sh: Sent 2 probes (2 broadcast(s))
Aug 16 23:44:06 XS1 iscsi-ha-NOTICE-/etc/iscsi-ha/iscsi-ha.sh: Received 0 response(s)
Aug 16 23:44:11 XS1 iscsi-ha: 14812 Spawning new instance of iscsi-ha
Aug 16 23:44:11 XS1 iscsi-ha: 15193 Checking if this host is a Pool Master or Slave
Aug 16 23:44:11 XS1 iscsi-ha: 15193 This host's pool status = master
Aug 16 23:44:11 XS1 iscsi-ha: 15189 auto_plug_pbd: Found LVMoISCSI SR List: 1dff3fd0-2903-35e4-5537-18545cca3bbe
Aug 16 23:44:11 XS1 iscsi-ha: 15193 DRBD Running on this host: version: 8.3.15 (api:88/proto:86-97) GIT-hash: 0ce4d235fc02b5c53c1c52c53433d11a694eab8c build by root@XS2, 2013-08-04 23:01:14 1: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r
ns:383522304 nr:0 dw:301792 dr:383794576 al:0 bm:23472 lo:9 pe:36 ua:52 ap:0 ep:1 wo:d oos:55985068 [================>...] sync'ed: 87.3% (54672/429200)M finish: 0:09:12 speed: 101,220 (100,184) K/sec
Aug 16 23:44:11 XS1 iscsi-ha: 15193 check_drbd_resource_state: DRBD Resource: iscsi1 in Primary mode
Aug 16 23:44:11 XS1 iscsi-ha: 15193 DRBD Resource: iscsi1 in SyncSource state - expected Connected state
Aug 16 23:44:11 XS1 iscsi-ha: 15193 Mail Spool Directory Found /dev/shm/iscsi-ha-mail
Aug 16 23:44:11 XS1 iscsi-ha: 15193 email: Duplicate message - not sending. Content = DRBD Resource: iscsi1 in SyncSource state - expected Connected state
Aug 16 23:44:11 XS1 iscsi-ha: 15193 email: Message barred for 30 minutes
Aug 16 23:44:11 XS1 iscsi-ha: 15193 iSCSI target: /etc/init.d/tgtd status = OK. [tgtd (pid 1870 1868) is running...]
Aug 16 23:44:11 XS1 iscsi-ha: 15193 local_ip_list: Local IP list returned 127.0.0.1 10.10.10.1 10.10.10.3 192.168.1.241
Aug 16 23:44:11 XS1 iscsi-ha: 15193 Virtual IP: 10.10.10.3 discovered on host XS1
Aug 16 23:44:14 XS1 iscsi-ha: 27466 iscsi-ha Watchdog: iscsi-ha running - OK

Please Log in or Create an account to join the conversation.

10.10.10.3 doesn't get activated 10 years 7 months ago #50

Can you also check the following?

- is sendmail package installed on your host? (try "rpm -qa | grep sendmail")
- is mailx package installed on your host? (try "rpm -qa | grep mailx")

- Is a DNS server configured on the host for resolving domain names?

- Is the domain part of the MAIL_TO configuration parameter resolvable by your host?

Please Log in or Create an account to join the conversation.

10.10.10.3 doesn't get activated 10 years 7 months ago #51

  • christ neeskens
  • christ neeskens's Avatar Topic Author
  • Offline
  • Posts: 18
I recreated the problem with invalidate all and when the rebuilding started the same problem happened again.

Than changed parameter mail_on=1 to 0 like you suggested and restarted iscsi-ha. Now 10.10.10.3 is reachable again.
And everything is up and running during the syncTarget state!

Guess the problem must be in there somewhere...


Do you maybe also have an idea why the speed is so slow?
I set the syncer on 100M
The servers are connected over a bonded 1GB (active-active) connection, but the speed doesn't get much above 70, most of the time it;s around 60M/sec with drops to about 30M/sec

Please Log in or Create an account to join the conversation.

10.10.10.3 doesn't get activated 10 years 7 months ago #52

  • christ neeskens
  • christ neeskens's Avatar Topic Author
  • Offline
  • Posts: 18
Some answers

[root@xenserver1 ~]# rpm -qa | grep sendmail
sendmail-8.13.8-8.1.el5_7

[root@xenserver1 ~]# rpm -qa | grep mailx
mailx-8.1.1-44.2.2

DNS is set up and working (using the google dns servers)
[root@xenserver1 ~]# ping google.nl
PING google.nl (82.94.234.35) 56(84) bytes of data.
64 bytes from cache.google.com (82.94.234.35): icmp_seq=1 ttl=60 time=30.3 ms
64 bytes from cache.google.com (82.94.234.35): icmp_seq=2 ttl=60 time=29.7 ms

The mail-to domain is also resolveable, otherwise i wouldn't get any emails.
Anyway, this is a ping output to the mail to domain:
[root@xenserver1 ~]# ping innocom.nl
PING innocom.nl (194.109.6.98) 56(84) bytes of data.
64 bytes from whl-www.xs4all.net (194.109.6.98): icmp_seq=1 ttl=61 time=30.1 ms
64 bytes from whl-www.xs4all.net (194.109.6.98): icmp_seq=2 ttl=61 time=29.3 ms


I do get the iscsi-ha mail messages.
As goes for several other HA related emails. When the system is failing or rebuilding i get about 150 emails every hour per server.

Please Log in or Create an account to join the conversation.

10.10.10.3 doesn't get activated 10 years 7 months ago #53

Thanks for the details and confirming the email alert is somehow hanging the process. A trivial function as email should not interfere with more critical logic. We will work on moving the email logic to a subprocess so that it cannot interfere with other critical functions and post a patch release soon.

FYI - some clues on why this is happening will likely be in your sendmail log file. If you find more details please post them here so that we can replicate the scenario.

Regarding the synchronization speed, our How-To uses 100M, but that should be adapted to your environment. Notably the disk write speed and ethernet I/O speed for the specific server HW. You should set this value to 30%-40% of the lowest of these two subsystem speeds.

Please Log in or Create an account to join the conversation.

Last edit: by Pulse Supply. Reason: mark as resolved
  • Page:
  • 1
  • 2