High Availability Configuration with Pacemaker and Corosync
Work In Progress...
Description
Requirements
Install
Run commands bellow on both nodes to install Pacemaker and Corosync:
svn update /usr/src/mor/sh_scripts /usr/src/mor/sh_scripts/corosync/install.sh
Configuration
All examples assume that there are two nodes with hostnames node01 and node02 and they are reachable by their hostnames and IP addresses:
- node01 - 192.168.0.152
- node02 - 192.168.0.200
192.168.0.205 is Virtual IP address.
Also, in all following command line examples, convention is this:
- [root@node01 ~]# denotes a command which should be run on 'ONE' server in the cluster.
- [root@ALL ~]# denotes a command which should be run on 'ALL' servers (node01 and node02) in the cluster.
You should replace hostnames and IP addresses to match your setup.
Installation script will install all needed packages and configuration files. Firstly let's setup cluster authentication:
Copy password from node01 to node02
[root@node01 ~]# scp /root/hacluster_password root@node02:/root/hacluster_password
And apply password on node02
[root@node02 ~]# cat /root/hacluster_password | passwd --stdin hacluster
Now we can authenticate cluster:
[root@node01 ~]# pcs cluster auth node01 node02 -u hacluster -p $(cat /root/hacluster_password) node02: Authorized node01: Authorized
If you get any other output, it means something went wrong and you should not proceed until this is fixed. If everything is OK, then we can setup cluster:
[root@node01 ~]# pcs cluster setup --name cluster_asterisk node01 node02 Destroying cluster on nodes: node01, node02... node01: Stopping Cluster (pacemaker)... node02: Stopping Cluster (pacemaker)... node02: Successfully destroyed cluster node01: Successfully destroyed cluster Sending 'pacemaker_remote authkey' to 'node01', 'node02' node01: successful distribution of the file 'pacemaker_remote authkey' node02: successful distribution of the file 'pacemaker_remote authkey' Sending cluster config files to the nodes... node01: Succeeded node02: Succeeded Synchronizing pcsd certificates on nodes node01, node02... node02: Success node01: Success Restarting pcsd on the nodes in order to reload the certificates... node02: Success node01: Success [root@node01 ~]#
If everything went OK, there should be no errors in output. If this is the case, let's start cluster:
[root@node01 ~]# pcs cluster start --all node02: Starting Cluster... node01: Starting Cluster... [root@node01 ~]#
This will automatically start Corosync and Pacemaker services on both nodes. Now let's check if Corosync is happy and there are no errors (issue this command on both nodes separately):
[root@node01 ~]# corosync-cfgtool -s Printing ring status. Local node ID 1 RING ID 0 id = 192.168.0.152 status = ring 0 active with no faults
[root@node02 ~]# corosync-cfgtool -s Printing ring status. Local node ID 2 RING ID 0 id = 192.168.0.200 status = ring 0 active with no faults
If you see different output, you should investigate before proceeding. Now let's check membership and quorum APIs, you should see both nodes with status Joined
[root@node01 ~]# corosync-cmapctl | grep members runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.0.152) runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.1.status (str) = joined runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.0.200) runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.2.status (str) = joined
Now disable STONITH and Ignore the Quorum Policy:
[root@node01 ~]# pcs property set stonith-enabled=false [root@node01 ~]# pcs property set no-quorum-policy=ignore [root@node01 ~]# pcs property list Cluster Properties: cluster-infrastructure: corosync cluster-name: cluster_asterisk dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9 have-watchdog: false no-quorum-policy: ignore stonith-enabled: false
Finally, let check cluster status:
[root@node01 ~]# pcs status Cluster name: cluster_asterisk Stack: corosync Current DC: node02 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Wed Oct 17 07:34:20 2018 Last change: Wed Oct 17 07:32:39 2018 by root via cibadmin on node01 2 nodes configured 0 resources configured Online: [ node01 node02 ] No resources Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
We can see that both nodes are online, all daemons (corosync, pacemaker, pcsd) are active (started) and enabled.
Configuring Asterisk HA solution with Virtual IP
Now when cluster is ready, we can add resources (Virtual IP, Asterisk, httpd, opensips, etc). In this section we will show how to add Virtual IP and Asterisk resources.
Firstly, let's add Virtual IP resource. Do not forget replace ip values and nic name with values from your setup.
[root@node01 ~]# pcs resource create VirtualIP ocf:heartbeat:IPaddr2 ip=192.168.0.205 cidr_netmask=32 nic=enp0s3 op monitor interval=30s [root@node01 ~]# pcs status Cluster name: cluster_asterisk Stack: corosync Current DC: node02 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Wed Oct 17 07:49:56 2018 Last change: Wed Oct 17 07:49:51 2018 by root via cibadmin on node01 2 nodes configured 1 resource configured Online: [ node01 node02 ] Full list of resources: VirtualIP (ocf::heartbeat:IPaddr2): Started node01 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
ip command should also confirm that Virtual IP has been assigned to interface:
[root@node01 ~]# ip addr show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 08:00:27:90:37:9c brd ff:ff:ff:ff:ff:ff inet 192.168.0.152/24 brd 192.168.0.255 scope global noprefixroute dynamic enp0s3 valid_lft 564sec preferred_lft 564sec inet 192.168.0.205/32 brd 192.168.0.255 scope global enp0s3 valid_lft forever preferred_lft forever inet6 fe80::eb74:dc5d:cdd:df23/64 scope link noprefixroute valid_lft forever preferred_lft forever
Once Virtual IP resource is setup correctly, it is time to add Asterisk resource. Install script will add Asterisk resource in directory /usr/lib/ocf/resource.d/heartbeat
[root@node01 ~]# pcs resource create asterisk ocf:heartbeat:asterisk op monitor timeout="30"
Now add colocation so that both resources will start at the same node, and set ordering (so that VirtualIP would start before Asterisk):
[root@node01 ~]# pcs constraint colocation add asterisk with VirtualIP score=INFINITY [root@node01 ~]# pcs constraint order VirtualIP then asterisk Adding VirtualIP asterisk (kind: Mandatory) (Options: first-action=start then-action=start)
Stickiness, constraints and moving resources
If one node fails, resources will be moved to other node. After first node recovers, should we move resources back to first one or leave the running on second one? This can be controlled using stickiness, if we set stickiness > 0, this mean that Pacemaker will prefer to leave resources running and avoid moving them between nodes:
[root@node01 ~]# pcs resource defaults resource-stickiness=100 Warning: Defaults do not apply to resources which override them with their own defined values [root@node01 ~]# pcs resource defaults resource-stickiness: 100 [root@node01 ~]#
With this settings, resources should not be moved back to original node if that node rebooted for example.
It is possible to move resource manually using pcs recourse move command. For example, let's move resources from current node to the other one:
[root@node01 ~]# pcs resource show VirtualIP (ocf::heartbeat:IPaddr2): Started node01 asterisk (ocf::heartbeat:asterisk): Started node01
[root@node01 ~]# pcs resource move VirtualIP Warning: Creating location constraint cli-ban-VirtualIP-on-node01 with a score of -INFINITY for resource VirtualIP on node node01. This will prevent VirtualIP from running on node01 until the constraint is removed. This will be the case even if node01 is the last node in the cluster. [root@node01 ~]#
[root@node01 ~]# pcs resource show VirtualIP (ocf::heartbeat:IPaddr2): Started node02 asterisk (ocf::heartbeat:asterisk): Started node02
As you can see, resources has been moved (as asterisk depends on VirtualIP, it is enough to move VirtualIP resource), however new location constraint has been created (cli-ban-VirtualIP-on-node01). We can check constraints this way:
[root@node01 ~]# pcs constraint --full Location Constraints: Resource: VirtualIP Disabled on: node01 (score:-INFINITY) (role: Started) (id:cli-ban-VirtualIP-on-node01) Ordering Constraints: start VirtualIP then start asterisk (kind:Mandatory) (id:order-VirtualIP-asterisk-mandatory)
Colocation Constraints: asterisk with VirtualIP (score:INFINITY) (id:colocation-asterisk-VirtualIP-INFINITY) Ticket Constraints:
Now even if node02 will fail, resources will not be moved to node01. To remove constraint, we can use command pcs constraint remove <constraint-id> [root@node01 ~]# pcs constraint remove cli-ban-VirtualIP-on-node01 [root@node01 ~]# pcs constraint --full Location Constraints: Ordering Constraints: start VirtualIP then start asterisk (kind:Mandatory) (id:order-VirtualIP-asterisk-mandatory) Colocation Constraints: asterisk with VirtualIP (score:INFINITY) (id:colocation-asterisk-VirtualIP-INFINITY) Ticket Constraints:
Maintenance mode
There is time when we need to stop, inspect or do other maintenance work on resources without interference from cluster management software. We can achieve this by putting cluster in maintenance mode
[root@node01 ~]# pcs property set maintenance-mode=true [root@node01 ~]# pcs property | grep -i maintenance maintenance-mode: true
[root@node01 ~]# pcs status Cluster name: cluster_asterisk Stack: corosync Current DC: node02 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Wed Oct 17 10:45:24 2018 Last change: Wed Oct 17 10:39:18 2018 by root via cibadmin on node01 2 nodes configured 2 resources configured *** Resource management is DISABLED *** The cluster will not attempt to start, stop or recover services Online: [ node01 node02 ] Full list of resources: VirtualIP (ocf::heartbeat:IPaddr2): Started node02 (unmanaged) asterisk (ocf::heartbeat:asterisk): Started node02 (unmanaged) Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
To move back cluster to normal operation, simply set maintenance mode to false
[root@node01 ~]# pcs property set maintenance-mode=false [root@node01 ~]# pcs property | grep -i maintenance maintenance-mode: false [root@node01 ~]#