High Availability Configuration with Pacemaker and Corosync

From Kolmisoft Wiki
Revision as of 13:09, 24 November 2022 by Gilbertas (talk | contribs)
Jump to navigationJump to search

Description

This tutorial will demonstrate how you can use Corosync and Pacemaker with a Virtual IP to create a high availability server solution in MOR.

Requirements

  • All public IP addresses have to be on same subnet.
  • Virtual IP has to be free (not assigned to any device on the network). Virtual IP will be managed by Corosync itself.
  • The following ports have to be open between nodes in cluster:
    • UDP 5404, 5405, 5406 - used by Corosync
    • TCP 2224 - used by pscd service
    • TCP 3121 - used by Pacemaker



Install

Run commands below on both nodes to install Pacemaker and Corosync:

svn update /usr/src/k_framework/
/usr/src/k_framework/helpers/corosync/corosync_install.sh
systemctl enable corosync
systemctl enable pacemaker

If /usr/src/k_framework/ directory is not present, checkout it manually:

svn co http://svn.kolmisoft.com/k_framework/ /usr/src/k_framework



Configuration

Firsly, we need to configure hostnames in servers. You can find current hostname with command:

uname -n

We recommend to user hostname node01 for main server, and node02 for backup server, however you can use different hostnames if you wish. You can set hostname on server using command bellow:

hostnamectl set-hostname your-new-hostname

If you use different hostname, please replace node01 and node02 with actual hostname names in configuration examples bellow.

Once hostame setup is complete, on main server open file /etc/hosts, you will see something like this:

192.168.0.131 node01 #This is example. Change to correct IP
192.168.0.132 node02 #Change to correct IP here aswell

Replace IP addresses with actual IP of both servers. If servers hostnames are not node01/node02 replace them with actual hostanme name of servers.

Repeat this procedure on backup server too.

All examples assume that there are two nodes with hostnames node01 and node02 and they are reachable by their hostnames and IP addresses:

  • node01 - 192.168.0.152
  • node02 - 192.168.0.200

192.168.0.205 is Virtual IP address.

Also, in all following command line examples, convention is this:

  • [root@node01 ~]# denotes a command which should be run on 'ONE' server in the cluster.
  • [root@ALL ~]# denotes a command which should be run on 'ALL' servers (node01 and node02) in the cluster.

You should replace hostnames and IP addresses to match your setup.

Authenticate and Setup Cluster

Installation script will install all needed packages and configuration files. Firstly let's setup cluster authentication:

Copy password from node01 to node02

[root@node01 ~]#  scp /root/hacluster_password root@node02:/root/hacluster_password 

And apply password on node02

[root@node02 ~]# cat /root/hacluster_password | passwd --stdin hacluster

Now we can authenticate cluster:

[root@node01 ~]#  pcs cluster auth node01 node02 -u hacluster -p $(cat /root/hacluster_password)
node02: Authorized
node01: Authorized

If you get any other output, it means something went wrong and you should not proceed until this is fixed. If everything is OK, then we can setup cluster:

[root@node01 ~]# pcs cluster setup --name cluster_asterisk node01 node02
Destroying cluster on nodes: node01, node02...
node01: Stopping Cluster (pacemaker)...
node02: Stopping Cluster (pacemaker)...
node02: Successfully destroyed cluster
node01: Successfully destroyed cluster
Sending 'pacemaker_remote authkey' to 'node01', 'node02'
node01: successful distribution of the file 'pacemaker_remote authkey'
node02: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
node01: Succeeded
node02: Succeeded
Synchronizing pcsd certificates on nodes node01, node02...
node02: Success
node01: Success
Restarting pcsd on the nodes in order to reload the certificates...
node02: Success
node01: Success
[root@node01 ~]# 

If everything went OK, there should be no errors in output. If this is the case, let's start cluster:

[root@node01 ~]# pcs cluster start --all
node02: Starting Cluster...
node01: Starting Cluster...
[root@node01 ~]# 

This will automatically start Corosync and Pacemaker services on both nodes. Now let's check if Corosync is happy and there are no errors (issue this command on both nodes separately):

[root@node01 ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
       id	= 192.168.0.152
       status	= ring 0 active with no faults
[root@node02 ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
     id	= 192.168.0.200
     status	= ring 0 active with no faults

If you see different output, you should investigate before proceeding. Now let's check membership and quorum APIs, you should see both nodes with status Joined

[root@node01 ~]# corosync-cmapctl | grep members
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.0.152) 
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.0.200) 
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined


Now disable STONITH and Ignore the Quorum Policy:

[root@node01 ~]# pcs property set stonith-enabled=false
[root@node01 ~]# pcs property set no-quorum-policy=ignore
[root@node01 ~]# pcs property list
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: cluster_asterisk
 dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9
 have-watchdog: false
 no-quorum-policy: ignore
 stonith-enabled: false


Finally, let check cluster status:

[root@node01 ~]# pcs status
Cluster name: cluster_asterisk
Stack: corosync svn update /usr/src/mor/sh_scripts
/usr/src/mor/sh_scripts/corosync/corosync_install.sh
systemctl enable corosync
systemctl enable pacemaker
Current DC: node02 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Wed Oct 17 07:34:20 2018
Last change: Wed Oct 17 07:32:39 2018 by root via cibadmin on node01
2 nodes configured
0 resources configured
Online: [ node01 node02 ]
No resources
Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


We can see that both nodes are online, all daemons (corosync, pacemaker, pcsd) are active.

Configuring Asterisk HA solution with Virtual IP

Before proceeding we need to prepare Asterisk to be managed by Corosync with Virtual IP.

1. Make sure that Asterisk service is stopped and disabled on both nodes.

[root@ALL ~]# systemctl disable asterisk
[root@ALL ~]# systemctl stop asterisk

2. Replace binaddr from 0.0.0.0 to VirtualIP in /etc/asterisk/sip.conf

3. Add Virtual IP in file /etc/asterisk/manager.conf

permit=192.168.0.205/255.255.255.0

4. Add VirtualIP in GUI Servers page and Global settings page.

5. In file /etc/mor/system.conf add variable VIRTUAL_IP with correct virtual IP

VIRTUAL_IP=xx.xx.xx.xx 

This variable will be used by check scripts.

Adding VirtualIP resource

Now when cluster is ready, we can add resources (Virtual IP, Asterisk, httpd, opensips, etc). In this section we will show how to add Virtual IP and Asterisk resources.

Firstly, let's add Virtual IP resource. Do not forget replace ip, cidr_netmask values and nic name with values from your setup. Use subnet of main IP for cidr_netmask. Also, make sure that nic name is the same on both servers. If nic name is different, leave nic parameter out. System will find suitable nic for Virtual IP automatically.


[root@node01 ~]# pcs resource create VirtualIP  ocf:heartbeat:IPaddr2 ip=192.168.0.205 cidr_netmask=32 nic=enp0s3 op monitor interval=30s
[root@node01 ~]# pcs status
Cluster name: cluster_asterisk
Stack: corosync
Current DC: node02 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Wed Oct 17 07:49:56 2018
Last change: Wed Oct 17 07:49:51 2018 by root via cibadmin on node01
2 nodes configured
1 resource configured
Online: [ node01 node02 ]
Full list of resources:
 VirtualIP	(ocf::heartbeat:IPaddr2):	Started node01
Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Now let's confirm that Virtual IP has indeed been assigned to interface:

[root@node01 ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:90:37:9c brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.152/24 brd 192.168.0.255 scope global noprefixroute dynamic enp0s3
       valid_lft 564sec preferred_lft 564sec
    inet 192.168.0.205/24 brd 192.168.0.255 scope global enp0s3
       valid_lft forever preferred_lft forever
    inet6 fe80::eb74:dc5d:cdd:df23/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

If IP is not assigned, recheck if nic interface is correct. If nic was not used, check output of this command:

ip -o -f inet route list match 1.2.3.4 scope link

Where 1.2.3.4 is Virtual IP. If output is empty, it means that system cannot find interface automatically, most likely main IP/mask is set incorrectly.

Adding Asterisk resource

Once Virtual IP resource is setup correctly, it is time to add Asterisk resource (Install script will add Asterisk resource in directory /usr/lib/ocf/resource.d/heartbeat/).

[root@node01 ~]# pcs resource create asterisk ocf:heartbeat:asterisk op monitor timeout="30"

Now add colocation so that both resources will start at the same node, and set ordering (so that VirtualIP would start before Asterisk):

[root@node01 ~]# pcs constraint colocation add asterisk with VirtualIP score=INFINITY 
[root@node01 ~]# pcs constraint order VirtualIP then asterisk 
Adding VirtualIP asterisk (kind: Mandatory) (Options: first-action=start then-action=start)

Let's check if Asterisk is running:

[root@localhost node01 ~]# pcs resource show
 VirtualIP	(ocf::heartbeat:IPaddr2):	Started node01
 asterisk	(ocf::heartbeat:asterisk):	Started node01

Now make node01 preferd one. This means that if node01 fails and resources are moved node02, node01 will reclaim resources once it will be up again.

 [root@node01 ~]# pcs resource defaults resource-stickiness=0
 [root@node01 ~]# pcs constraint location asterisk prefers node01=50

Run once again:

 systemctl enable corosync
 systemctl enable pacemaker

pcs status should show

Daemon Status:
 corosync: active/enabled
 pacemaker: active/enabled
 pcsd: active/enabled


Reboot servers to make sure that resources are switching correctly and services started as needed.

Configuring KSR HA solution with Virtual IP

1. Install Failover software as described here.

2. Configure the nodes and #Authenticate and Setup Cluster

3. On all failover servers, In file /etc/m2/system.conf add variable VIRTUAL_IP with correct virtual IP

VIRTUAL_IP=xx.xx.xx.xx

4. Virtual IP should be in the GUI servers list. Physical IPs should be added too.

5. Add Virtual IP resource.

6. Make the main node the preferred node for VirtualIP:

[root@node01 ~]# pcs resource defaults resource-stickiness=0
[root@node01 ~]# pcs constraint location VirtualIP prefers node01=50

7. Run /usr/src/m4/check.sh

[root@ALL ~]# /usr/src/m4/check.sh

8. Make changes as indicated by the check.

9. Manually grep for physical servers IP:

[root@ALL ~]# grep -F 'physical IP' /etc/kamailio/kamailio.cfg /etc/sysconfig/rtpengine /usr/local/etc/sems/etc/b2bua_topology_hiding.sbcprofile.conf /usr/local/etc/sems/sems.conf

If grep found anything, replace IPs manually with virtual IP.

10. Stop and disable services

[root@ALL ~]# systemctl disable kamailio
[root@ALL ~]# systemctl stop kamailio
[root@ALL ~]# systemctl disable sems
[root@ALL ~]# systemctl stop sems
[root@ALL ~]# systemctl disable rtpengine
[root@ALL ~]# systemctl stop rtpengine

11. Create resources for rtpengine, seems, kamailio

pcs resource create rtpengine systemd:rtpengine
pcs resource create sems systemd:sems
pcs resource create kamailio systemd:kamailio

12. Create a new resource group (for example named ksr) and add all resources (including VirtualIP) into it. Resources in a group are started sequentially and stopped in the reverse order.

pcs resource group add ksr VirtualIP rtpengine sems kamailio

13. Check the output of pcs status. If any resource fails to start, subsequent will not start too. In such case, you will see an error like this (in this case example for rtpengine) in pcs status output

Resource Group: ksr
    VirtualIP  (ocf::heartbeat:IPaddr2):       Started ksr-45
    rtpengine  (systemd:rtpengine):    Stopped
    sems       (systemd:sems): Stopped
    kamailio   (systemd:kamailio):     Stopped
Failed Resource Actions:
* rtpengine_start_0 on ksr-45 'unknown error' (1): call=20, status=complete, exitreason=,
    last-rc-change='Fri Nov 18 11:07:54 2022', queued=0ms, exec=2060ms

To fix this, check failed resource logs to determine the reason (for example, wrong IP), fix it, and clear pcs with the command:

pcs resource cleanup rtpengine

(use the appropriate resources in your scenario)

If an error has been fixed, pcs will try to start service again, and all subsequent services:

Resource Group: ksr
    VirtualIP  (ocf::heartbeat:IPaddr2):       Started ksr-45
    rtpengine  (systemd:rtpengine):    Started ksr-45
    sems       (systemd:sems): Started ksr-45
    kamailio   (systemd:kamailio):     Started ksr-45

14. If all services running, make a test call, check the sip trace, etc.

15. To test configuration on the backup node, use this command on the main node:

 [root@node01 ~]# pcs cluster stop

And check the output on node02

[root@node02 ~]# pcs status

If there are any errors, fix them as described in 10. step If all services running, make a test call, check the sip trace, etc.

16. To move services to the main node, start the cluster again on the main node:

[root@node01 ~]# pcs cluster start

and check the status:

 [root@node01 ~]# pcs status

17. Enable corosync and pacemaker again on both nodes:

[root@ALL ~]# systemctl enable corosync
[root@ALL ~]# systemctl enable pacemaker


Stickiness, constraints and moving resources

If one node fails, resources will be moved to other node. After first node recovers, should we move resources back to first one or leave the running on second one? This can be controlled using stickiness, if we set stickiness > prefer score, this mean that Pacemaker will prefer to leave resources running and avoid moving them between nodes:

[root@node01 ~]#  pcs resource defaults resource-stickiness=100
Warning: Defaults do not apply to resources which override them with their own defined values
[root@node01 ~]# pcs resource defaults
resource-stickiness: 100
[root@node01 ~]# 

With this settings, resources should not be moved back to original node if that node rebooted for example.

It is possible to move resource manually using pcs recourse move command. For example, let's move resources from current node to the other one:

[root@node01 ~]# pcs resource show
 VirtualIP	(ocf::heartbeat:IPaddr2):	Started node01
 asterisk	(ocf::heartbeat:asterisk):	Started node01
[root@node01 ~]# pcs resource move VirtualIP
Warning: Creating location constraint cli-ban-VirtualIP-on-node01 with a score of -INFINITY for resource VirtualIP on node node01.
This will prevent VirtualIP from running on node01 until the constraint is removed. This will be the case even if node01 is the last node in the cluster.
[root@node01 ~]# 
[root@node01 ~]# pcs resource show
VirtualIP	(ocf::heartbeat:IPaddr2):	Started node02
asterisk	(ocf::heartbeat:asterisk):	Started node02

As you can see, resources has been moved (as asterisk depends on VirtualIP, it is enough to move VirtualIP resource), however new location constraint has been created (cli-ban-VirtualIP-on-node01). We can check constraints this way:

[root@node01 ~]#  pcs constraint --full
Location Constraints:
  Resource: VirtualIP
    Disabled on: node01 (score:-INFINITY) (role: Started) (id:cli-ban-VirtualIP-on-node01)
Ordering Constraints:
  start VirtualIP then start asterisk (kind:Mandatory) (id:order-VirtualIP-asterisk-mandatory)
Colocation Constraints:
  asterisk with VirtualIP (score:INFINITY) (id:colocation-asterisk-VirtualIP-INFINITY)
Ticket Constraints:

Now even if node02 will fail, resources will not be moved to node01. To remove constraint, we can use command pcs constraint remove <constraint-id>

[root@node01 ~]# pcs constraint remove cli-ban-VirtualIP-on-node01
[root@node01 ~]#  pcs constraint --full
Location Constraints:
Ordering Constraints:
  start VirtualIP then start asterisk (kind:Mandatory) (id:order-VirtualIP-asterisk-mandatory)
Colocation Constraints:
  asterisk with VirtualIP (score:INFINITY) (id:colocation-asterisk-VirtualIP-INFINITY)
Ticket Constraints:

Maintenance mode

There is time when we need to stop, inspect or do other maintenance work on resources without interference from cluster management software. We can achieve this by putting cluster in maintenance mode

[root@node01 ~]# pcs property set maintenance-mode=true
[root@node01 ~]# pcs property | grep -i maintenance
 maintenance-mode: true
[root@node01 ~]# pcs status 
Cluster name: cluster_asterisk
Stack: corosync
Current DC: node02 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Wed Oct 17 10:45:24 2018
Last change: Wed Oct 17 10:39:18 2018 by root via cibadmin on node01

2 nodes configured
2 resources configured

             *** Resource management is DISABLED ***
  The cluster will not attempt to start, stop or recover services

Online: [ node01 node02 ] 

Full list of resources:

 VirtualIP	(ocf::heartbeat:IPaddr2):	Started node02 (unmanaged)
 asterisk	(ocf::heartbeat:asterisk):	Started node02 (unmanaged)

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


To move back cluster to normal operation, simply set maintenance mode to false

[root@node01 ~]# pcs property set maintenance-mode=false
[root@node01 ~]# pcs property | grep -i maintenance
 maintenance-mode: false
[root@node01 ~]#

Resource Clean Up

If the asterisk does not start on the node, due to such errors:

pcs status
 VirtualIP      (ocf::heartbeat:IPaddr2):       Started[ mor01 mor02 ]
 asterisk       (ocf::heartbeat:asterisk):      FAILED mor 01 (blocked) 

or

pcs status
Failed Resource Actions:
* asterisk_start_0 on mor01 'unknown error' (1): call=24, status=Error, exitreason=,
   last-rc-change='Sun Jun  5 03:18:00 2022', queued=0ms, exec=6175ms

try to cleanup Asterisk resource

[root@mor01 ~]# pcs resource cleanup asterisk 
Cleaned up asterisk on mor02
Cleaned up asterisk on mor01
Waiting for 1 reply from the CRMd. OK

Replacing a Corosync Node

Example based on situation when node01 is being replaced.

Disable selinux.

Make sure the node01 is completely stopped.

Give the new machine the same hostname and IP address as the old one.

Correct /etc/hosts on the node01.

Install the cluster software on node01. Run commands bellow on both nodes to install Pacemaker and Corosync:

svn update /usr/src/k_framework/
/usr/src/k_framework/helpers/corosync/corosync_install.sh
systemctl enable corosync
systemctl enable pacemaker

Copy /etc/corosync/corosync.conf from node02 to node01.

Copy password from node02 to node01:

[root@node02 ~]#  scp /root/hacluster_password root@node01:/root/hacluster_password 

And apply password on node01:

[root@node01 ~]# cat /root/hacluster_password | passwd --stdin hacluster

Copy authkey from node02 to node01:

 [root@node02 ~]#  scp /etc/pacemaker/authkey root@node01:/etc/pacemaker/authkey

Restart pcs Daemon on node02:

[root@node02 ~]# systemctl restart pcsd.service

Now we can authenticate cluster:

[root@node01 ~]#  pcs cluster auth node01 node02 -u hacluster -p $(cat /root/hacluster_password)
node02: Authorized
node01: Authorized

If you get any other output, it means something went wrong and you should not proceed until this is fixed.

Remove VirtualIP resource and create it again:

[root@node01 ~]# pcs resource delete VirtualIP
[root@node01 ~]# pcs resource create VirtualIP  ocf:heartbeat:IPaddr2 ip=192.168.0.205 cidr_netmask=32 nic=enp0s3 op monitor interval=30s

Check more details about creating VirtualIP in the topics above.

Now add colocation so that both resources will start at the same node, and set ordering (so that VirtualIP would start before Asterisk):

[root@node01 ~]# pcs constraint colocation add asterisk with VirtualIP score=INFINITY 
[root@node01 ~]# pcs constraint order VirtualIP then asterisk 
Adding VirtualIP asterisk (kind: Mandatory) (Options: first-action=start then-action=start)

Make sure that everything is enabled and working by executing command:

[root@node01 ~]# pcs status

More Information

This guide is based on and more information can be found here:

http://www.alexlinux.com/asterisk-high-availability-cluster-with-pacemaker-on-centos-7/

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Overview/

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch05.html

http://linux-ha.org/doc/man-pages/re-ra-IPaddr2.html