Zone redundant linux cluster in Azure



So I'm back to clustering again - as mentioned in previous blogs I've done this a lot on the past with HPUX/AIX/redhat/windows/oracle in various different forms but never with SLES and never with Linux in Azure until this past week.

So here is my write up - hopefully this is useful to others - I couldn't find a good bit of documentation that really covered a complete cluster case of  what I would call a 'typical' traditional cluster - i.e. and ip address and a filesystem move from one node to another in the event of machine failure (there are still some use cases where this is relevant).

The following Microsoft note got me a lot of the way there (https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/high-availability-guide-suse-nfs) but there were some things that didn't work.

So to help me in future (and to hopefully help others too) here is how I built a SLES zone redundant cluster in Azure.

First up a basic picture of what it will build (don't judge my drawing skills)



So we end up with 3 machines here - the 2 'cluster' nodes - which can host the 'application' service and a 3rd note to act as a 'split brain device' (or quorum disk or cluster lock disk - whatever you want to call that). In windows clustering this device has been replaced with azure blob storage (the so called cloud witness) but that's not directly possible here (well I say that - there is an option of using a fencing agent but that has some limitations so was discounted)

I split the machines so that the 2 main cluster nodes were in azure zone 1 and 2 (in west Europe) with the split brain device being located in Zone 3. This is then increasing our availability to even better than a normal availability set.

First up we'll just deal with node 3 as that is very simple (it just needs to be an iscsi target reachable from the other 2 servers)  and we can just get it out of the way. All this needs to be is a very small server (B2ms was fine for me) - it probably could be any os that supports creating iSCSI devices but to keep things SuSE I provisioned a SLES15sp1 image from the marketplace



I provision that on a private address only set with a static reservation (no public ip's here - though of course you could choose to do that should you wish....) It doesn't need any data disk and pretty much everything is just default settings.

Once that is up and running we ssh on to the server and then run the following steps:

1) Sort out the software packages and enable some iscsi stuff


zypper update

zypper remove lio-utils python-rtslib python-configshell targetcli

zypper install targetcli-fb dbus-1-python

systemctl enable targetcli

systemctl start targetcli

2) Create the iscsi device (i just do this on the os disk) - now my system I'm installing is called 'hub' - so any instances of that name you could replace with whatever you want to call it. My two cluster nodes are called simply node01 and node02 - so again replace those names with the name of your choosing


mkdir /sbd



targetcli backstores/fileio create sbdhub /sbd/sbdhub 50M write_back=false

targetcli iscsi/ create iqn.2006-04.hub.local:hub

targetcli iscsi/iqn.2006-04.hub.local:hub/tpg1/luns/ create /backstores/fileio/sbdhub

targetcli iscsi/iqn.2006-04.hub.local:hub/tpg1/acls/ create iqn.2006-04.node01.local:node01

targetcli iscsi/iqn.2006-04.hub.local:hub/tpg1/acls/ create iqn.2006-04.node02.local:node02

targetcli saveconfig

And that's it - nothing else to do on this server at all - we just created an iscsi device for later use for split brain detection in the cluster.

Right now on to the more complicated cluster nodes.

We start off in pretty much the same way - I just order 2 VM's from the portal - this time however making sure to choose the SLES 15sp1 image for SAP (as this includes the HA packages required for clustering even though I have no intention of using SAP)


node01 i order in zone 1 and node02 in zone 2 (original huh). Each machine I add 1 additional data disk to - in my case 1TB each - you may require more or less than that - but just make sure they are the same size.

Once the azure platform has done it's thing i then log on to node01 and run the following steps

1) attach to the iscsi device we created on the 3rd node in zone 3

systemctl enable iscsid
systemctl enable iscsi
systemctl enable sbd

2) in /etc/iscsi/initiatorname.iscsi file - set name to match name from sbd node for node1

## may reject this initiator.  The InitiatorName must be unique
## for each iSCSI initiator.  Do NOT duplicate iSCSI InitiatorNames.
InitiatorName=iqn.2006-04.node01.local:node01

3) restart iscsi to pick that up

systemctl restart iscsid
systemctl restart iscsi

4) discover that device from the 3rd node

iscsiadm -m discovery --type=st --portal=10.10.10.76:3260  
iscsiadm -m node -T iqn.2006-04.hub.local:hub --login --portal=10.10.10.76:3260
iscsiadm -m node -p 10.10.10.76:3260 --op=update --name=node.startup --value=automatic

5) Now when we run lsscsi we will see an iscsi device (/dev/sdc in the output below) 

# lsscsi
[1:0:0:0]    cd/dvd  Msft     Virtual CD/ROM   1.0   /dev/sr0
[2:0:0:0]    disk    Msft     Virtual Disk     1.0   /dev/sda
[3:0:1:0]    disk    Msft     Virtual Disk     1.0   /dev/sdb
[6:0:0:0]    disk    LIO-ORG  sbdhub           4.0   /dev/sdc

6) If we then find the full path to that using this command  (pick the one starting scsi-3)

ls -l /dev/disk/by-id/scsi-* | grep sdc 

/dev/disk/by-id/scsi-3600140591120716e08e4ad4b3b5318c0

7) We can then create the sbd device

sbd -d /dev/disk/by-id/scsi-3600140591120716e08e4ad4b3b5318c0 -1 60 -4 120 create

8) And add that device to the config file

vi /etc/sysconfig/sbd

grep scsi /etc/sysconfig/sbd
SBD_DEVICE="/dev/disk/by-id/scsi-3600140522db15c1728d4b01a2e2204ed"

9) now we enable the 'softdog' module - honestly no clue what this is for but the MS note said to do it...…..

echo softdog | sudo tee /etc/modules-load.d/softdog.conf

modprobe -v softdog

10) now get all the packages up to date

zypper update (then reboot)

11) Now a couple of config changes

 vi /etc/systemd/system.conf

 look for this value and set to 4096

defaulttasksmax=4096

then reload the daemon

systemctl daemon-reload


add
vm.dirty_bytes = 629145600
vm.dirty_background_bytes = 314572800

to /etc/sysctl.conf

Then stop the azure cloud networking thing kicking in as it messes stuff up

vi /etc/sysconfig/network/ifcfg-eth0 (set cloud_netconfig_manage to no)

12) Now install some python stuff


zypper in python3-azure-sdk

14) Now make sure names resolution is working by adding entries to /etc/hosts on the server


10.10.10.69 node01
10.10.10.70 node02


15) Now we can initialize the cluster (albeit with one node at the moment) - pretty much defaults for everything - output below is fairly self explanatory.

ha-cluster-init -u
WARNING: chronyd.service is not configured to start at system boot.
Do you want to continue anyway (y/n)? y
  Generating SSH key
  Configuring csync2
  Generating csync2 shared key (this may take a while)...done
  csync2 checking files...done

Configure Corosync (unicast):
  This will configure the cluster messaging layer.  You will need
  to specify a network address over which to communicate (default
  is eth0's network, but you can use the network address of any
  active interface).

  Address for ring0 [10.10.10.69]
  Port for ring0 [5405]

Configure SBD:
  If you have shared storage, for example a SAN or iSCSI target,
  you can use it avoid split-brain scenarios by configuring SBD.
  This requires a 1 MB partition, accessible to all nodes in the
  cluster.  The device path must be persistent and consistent
  across all nodes in the cluster, so /dev/disk/by-id/* devices
  are a good choice.  Note that all data on the partition you
  specify here will be destroyed.

Do you wish to use SBD (y/n)? y
SBD is already configured to use /dev/disk/by-id/scsi-3600140522db15c1728d4b01a2e2204ed - overwrite (y/n)? n
  Hawk cluster interface is now running. To see cluster status, open:
  Log in with username 'hacluster', password 'linux'
WARNING: You should change the hacluster password to something more secure!
  Waiting for cluster........done
  Loading initial cluster configuration

Configure Administration IP Address:
  Optionally configure an administration virtual IP
  address. The purpose of this IP address is to
  provide a single IP that can be used to interact
  with the cluster, rather than using the IP address
  of any specific cluster node.

Do you wish to configure a virtual IP address (y/n)? n
  Done (log saved to /var/log/ha-cluster-bootstrap.log)
Right no on to zone 2 node (I'll switch to orangy colour for this - by the way the colour is not some 80's throwback - it's meant to help illustrate which server stuff is being done on - see pic at top of the post.......)

Most of this is a repeat so I'll skip a lot of the explanation


systemctl enable iscsid

systemctl enable iscsi

systemctl enable sbd



 vi /etc/iscsi/initiatorname.iscsi

grep Init /etc/iscsi/initiatorname.iscsi

## Default iSCSI Initiatorname.

## If you change the InitiatorName, existing access control lists

## may reject this initiator.  The InitiatorName must be unique

## for each iSCSI initiator.  Do NOT duplicate iSCSI InitiatorNames.

InitiatorName=iqn.2006-04.node02.local:node02



systemctl restart iscsid

systemctl restart iscsi


iscsiadm -m discovery --type=st --portal=10.10.10.76:3260  

iscsiadm -m node -T iqn.2006-04.hub.local:hub --login --portal=10.10.10.76:3260

iscsiadm -m node -p 10.10.10.76:3260 --op=update --name=node.startup --value=automatic

# lsscsi

[1:0:0:0]    cd/dvd  Msft     Virtual CD/ROM   1.0   /dev/sr0

[2:0:0:0]    disk    Msft     Virtual Disk     1.0   /dev/sda

[3:0:1:0]    disk    Msft     Virtual Disk     1.0   /dev/sdb

[6:0:0:0]    disk    LIO-ORG  sbdhub           4.0   /dev/sdc


/dev/disk/by-id/scsi-3600140591120716e08e4ad4b3b5318c0 (same as on node 1)



dont recreate sbd device again - just enter into config



vi /etc/sysconfig/sbd



hubclus-1:~ # grep scsi /etc/sysconfig/sbd

SBD_DEVICE="/dev/disk/by-id/scsi-3600140522db15c1728d4b01a2e2204ed"


echo softdog | sudo tee /etc/modules-load.d/softdog.conf

modprobe -v softdog

zypper update

 vi /etc/systemd/system.conf

defaulttasksmax=4096

systemctl daemon-reload


add

vm.dirty_bytes = 629145600

vm.dirty_background_bytes = 314572800


to /etc/sysctl.conf



vi /etc/sysconfig/network/ifcfg-eth0 (set cloud_netconfig_manage to no)






zypper in python3-azure-sdk

update hosts


10.10.10.69 node01

10.10.10.70 node02


Now on both nodes we need to generate ssh keys to allow transparent login

ssh-keygen (default everything when prompted)
then copy the content of the public key into the authorized_keys file on the opposite node - then you can ssh with no password to the opposite node  (I'm assuming if you are contemplating building a cluster you'll know the basics of how to set this up to be honest.......)

Now we join the second node to the cluster


ha-cluster-join

WARNING: chronyd.service is not configured to start at system boot.

Do you want to continue anyway (y/n)? y

  Join This Node to Cluster:

  You will be asked for the IP address of an existing node, from which

  configuration will be copied.  If you have not already configured

  passwordless ssh between nodes, you will be prompted for the root

  password of the existing node.



  IP address or hostname of existing node (e.g.: 192.168.1.1) []node01

  Retrieving SSH keys - This may prompt for root@node01:

/root/.ssh/id_rsa already exists - overwrite (y/n)? n

  No new SSH keys installed

  Configuring csync2...done

  Merging known_hosts

  Probing for new partitions...done

  Address for ring0 [10.10.10.70]



  Hawk cluster interface is now running. To see cluster status, open:


  Log in with username 'hacluster', password 'linux'

WARNING: You should change the hacluster password to something more secure!

  Waiting for cluster....done

  Reloading cluster configuration...Password:

Password:

done

  Done (log saved to /var/log/ha-cluster-bootstrap.log)

And at this point we have a basic cluster in place - it's not really got anything useful in it though at all - so now lets add in that.

The following steps need to be done on both nodes (showing this in red......)

1) update corosync config - just these 2 values need updating and the service restarting



update /etc/corosync/corosync.conf file


  token:          30000
  consensus:      36000


service corosync restart



 2) Then we install socat - this we'll use to host a 'dummy' service running on a port that we can use for the loadbalancer probe we'll create later on


zypper in socat 


 3) Now we prep that 1TB disk I added to be used by drbd (I find this for some reason the most impossible sequence of 4 letters to type - I get it wrong at least 50% of the time - too long working with db's I think)



Make sure you run this against the 1TB device and not the iscsi device...…. (sdd in my case where iscsi was sdc)



 # fdisk /dev/sdd


Welcome to fdisk (util-linux 2.33.1).

Changes will remain in memory only, until you decide to write them.

Be careful before using the write command.



Device does not contain a recognized partition table.

Created a new DOS disklabel with disk identifier 0x207d70aa.



Command (m for help): n

Partition type

   p   primary (0 primary, 0 extended, 4 free)

   e   extended (container for logical partitions)

Select (default p): p

Partition number (1-4, default 1):

First sector (2048-2147483647, default 2048):

Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-2147483647, default 2147483647):



Created a new partition 1 of type 'Linux' and of size 1024 GiB.



Command (m for help): w

The partition table has been altered.

Calling ioctl() to re-read partition table.

Syncing disks.





 Now we want to initialize this disk with lvm (I don't think you need to use LVM to be honest - but I just prefer to work using it 



pvcreate /dev/sdd1

vgcreate /dev/netahub /dev/sdd1

lvcreate --name esis --size 128G /dev/netahub

now we have a 128GB block device that we will use with drbd

(As a side note  i had never even heard of drbd before i started building this - it essentially does block replication - doesnt have to be used in a cluster but i guess thats the main use case)

So lets set that part up


1) set content of vi /etc/drbd.d/global_common.conf to





global {

     usage-count no;

}

common {

     handlers {

          fence-peer "/usr/lib/drbd/crm-fence-peer.sh";

          after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";

          split-brain "/usr/lib/drbd/notify-split-brain.sh root";

          pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";

     }

     startup {

          wfc-timeout 0;

     }

     options {

     }

     disk {

          md-flushes yes;

          disk-flushes yes;

          c-plan-ahead 1;

          c-min-rate 100M;

          c-fill-target 20M;

          c-max-rate 4G;

     }

     net {

          after-sb-0pri discard-younger-primary;

          after-sb-1pri discard-secondary;

          after-sb-2pri call-pri-lost-after-sb;

          protocol     C;

          tcp-cork yes;

          max-buffers 20000;

          max-epoch-size 20000;

          sndbuf-size 0;

          rcvbuf-size 0;

     }

}



2)  Now we create a drbd config file for our lvm device - this file is created in /etc/drbd.d and in my case is named hub-esis.res (the name of the file has to match the name in the first line of the file). Note that we name the virtual device that will be created here as /dev/drbd0





resource hub-esis {

     protocol     C;

     disk {

          on-io-error       detach;

     }

     on node01{

          address   10.10.10.69:7790;

          device    /dev/drbd0;

          disk      /dev/netahub/esis;

          meta-disk internal;

     }

     on node02{

          address   10.10.10.70:7790;

          device    /dev/drbd0;

          disk      /dev/netahub/esis;

          meta-disk internal;

     }

}



3) Now we create the actual device


 drbdadm create-md hub-esis
drbdadm up hub-esis
drbdadm new-current-uuid --clear-bitmap hub-esis
drbdadm primary --force hub-esis

Now we bring it up on the second node also

drbdadm create-md hub-esis
drbdadm up hub-esis
drbdadm down hub-esis
drbdadm up hub-esis

Now we go back to primary and force the 2 devices to sync

drbdsetup wait-sync-resource hub-esis
So now that block device is syncing between the 2 servers - now it's active lets create a filesystem of the meta device that got added

mkfs.xfs /dev/drbd0

Right at this point we can now start adding stuff into the actual cluster config and get something actually useful running

1) we set a global setting for the cluster and go into maintenance mode (so no failovers start happening)

crm configure property maintenance-mode=true
crm configure rsc_defaults resource-stickiness="200"

2) Lets add the drbd device to the cluster config - the first command defines the device and the second defines the master slave relationship


crm configure primitive drbd_hub-esis \

  ocf:linbit:drbd \

  params drbd_resource=hub-esis \

  op monitor interval="15" role="Master" \

  op monitor interval="30" role="Slave" notify="true"



crm configure ms ms-drbd_hub-esis drbd_hub-esis \

  meta master-max="1" master-node-max="1" clone-max="2" \

  clone-node-max="1" notify="true" interleave="true"
3) Now we add the filesystem definition to mount the filesystem (remember that you have to actually create the mount point on both nodes - in this case /esis)


crm configure primitive fs_esis \

  ocf:heartbeat:Filesystem \

  params device=/dev/drbd0 \

  directory=/esis  \

  fstype=xfs \

  op monitor interval="10s"
4) Now we add the virtual cluster ip (make sure the netmask and nic device are set correctly)


crm configure primitive vip_hub \

  IPaddr2 \

  params ip=10.10.10.68 cidr_netmask=28 nic=eth0 op monitor interval=10 timeout=20

5) Now we add the probeport service (i'll link back to this on the loadbalancer in a bit) - this is just creating a dummy 'thing' running on port 55555


crm configure primitive probeport anything \

  params binfile="/usr/bin/socat" cmdline_options="-U TCP-LISTEN:55555,backlog=10,fork,reuseaddr /dev/null" \

  op monitor timeout=20s interval=10 depth=0


6) Now we group some of the objecs together and define some dependencies between them


crm configure group hub \

  fs_esis probeport vip_hub



crm configure order o-drbd_before_hub mandatory: \

  ms-drbd_hub-esis:promote hub:start



crm configure colocation col-hub_on_drbd mandatory: \

  hub ms-drbd_hub-esis:Master


7) Now we come out of maintenance mode

crm configure property maintenance-mode=false

At this point the cluster should be up and running and all healthy - (if you have any warning messages from crm status you can clean them up with "crm resource cleanup" - I found this was sometimes essential as if it gets in a sort of funny state it will refuse to run on the other node - this command clears up and previous error messages and sets the state as OK)

You'll find it you just shutdown a node the cluster package of drbd/vip/probeport/filesystem will very rapidly (just few secs for me) switch to the other node  and everything works great.

However there is one more thing to do in the portal and that is to create a load balancer - this is needed as the vip we created via the cluster is not known to azure - so you can't actually get to that ip outside of the servers. The load balancer we will create has the same vip as the cluster ip (very confusing i know - but it works this way for windows cluster too). The loadbalancer will be looking for something running on port 55555 (the dummy service thing we created) - if it finds it it sends all the traffic there. So when that service moves so does all the traffic the load balancer is sending.

So in the load balancer screen be sure to choose a standard load balancer (required to use across zones) and set it something like this (ignore error as its just beause i didnt choose vnet etc first)


backend pools is just the 2 cluster nodes - for the health probe it looks like this:


And finally the rules screen - be sure to check the ha ports box so all ports are load balanced (unless you know specifically which ones you need and are going to create a rule for each one)


And that's it done - as a simple test you should now be able to ssh to the cluster ip from your laptop/desktop and you'll be routed to the host where the app currently is. If you kill the first node and then try and connect again with ssh you'll see you end up on the other node like magic

There you have it linux cluster in Azure across availability zones - enjoy........

As a small aside - the website that is mentioned during cluster creation is actually quite nice as a visual overview but also as a management console - see example screenshot from another test cluster below



Comments