High Availability Configuration with Pacemaker and Corosync

From Kolmisoft Wiki
Revision as of 11:39, 27 March 2019 by Gilbertas (talk | contribs)
Jump to navigationJump to search

Work In Progress...

Description



Requirements



Install

Run commands bellow on both nodes to install Pacemaker and Corosync:

svn update /usr/src/mor/sh_scripts
/usr/src/mor/sh_scripts/corosync/install.sh

Copy password from node01 to node02

[root@node01 ~]#  scp /root/hacluster_password root@node02:/root/hacluster_password 

And apply password on node02

[root@node02 ~]# echo CHANGEME > /root/hacluster_password; chmod 600 /root/hacluster_password




Configuration

All examples assume that there are two nodes with hostnames node01 and node02 and they are reachable by their hostnames and IP addresses:

  • node01 - 192.168.0.152
  • node02 - 192.168.0.200

192.168.0.205 is Virtual IP address.

Also, in all following command line examples, convention is this:

  • [root@node01 ~]# denotes a command which should be run on 'ONE' server in the cluster.
  • [root@ALL ~]# denotes a command which should be run on 'ALL' servers (node01 and node02) in the cluster.

You should replace hostnames and IP addresses to match your setup.


Installation script will install all needed packages and configuration files. Firstly let's setup cluster authentication:

Copy password from node01 to node02

[root@node01 ~]#  scp /root/hacluster_password root@node02:/root/hacluster_password 

And apply password on node02

[root@node02 ~]# cat /root/hacluster_password | passwd --stdin hacluster

Now we can authenticate cluster:

[root@node01 ~]#  pcs cluster auth node01 node02 -u hacluster -p $(cat /root/hacluster_password)
node02: Authorized
node01: Authorized

If you get any other output, it means something went wrong and you should not proceed until this is fixed. If everything is OK, then we can setup cluster:

[root@node01 ~]# pcs cluster setup --name cluster_asterisk node01 node02
Destroying cluster on nodes: node01, node02...
node01: Stopping Cluster (pacemaker)...
node02: Stopping Cluster (pacemaker)...
node02: Successfully destroyed cluster
node01: Successfully destroyed cluster
Sending 'pacemaker_remote authkey' to 'node01', 'node02'
node01: successful distribution of the file 'pacemaker_remote authkey'
node02: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
node01: Succeeded
node02: Succeeded
Synchronizing pcsd certificates on nodes node01, node02...
node02: Success
node01: Success
Restarting pcsd on the nodes in order to reload the certificates...
node02: Success
node01: Success
[root@node01 ~]# 

If everything went OK, there should be no errors in output. If this is the case, let's start cluster:

[root@node01 ~]# pcs cluster start --all
node02: Starting Cluster...
node01: Starting Cluster...
[root@node01 ~]# 

This will automatically start Corosync and Pacemaker services on both nodes. Now let's check if Corosync is happy and there are no errors (issue this command on both nodes separately):

[root@node01 ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
       id	= 192.168.0.152
       status	= ring 0 active with no faults
[root@node02 ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
     id	= 192.168.0.200
     status	= ring 0 active with no faults

If you see different output, you should investigate before proceeding. Now let's check membership and quorum APIs, you should see both nodes with status Joined

[root@node01 ~]# corosync-cmapctl | grep members
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.0.152) 
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.0.200) 
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined


Now disable STONITH and Ignore the Quorum Policy:

[root@node01 ~]# pcs property set stonith-enabled=false
[root@node01 ~]# pcs property set no-quorum-policy=ignore
[root@node01 ~]# pcs property list
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: cluster_asterisk
 dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9
 have-watchdog: false
 no-quorum-policy: ignore
 stonith-enabled: false


Finally, let check cluster status:

[root@node01 ~]# pcs status
Cluster name: cluster_asterisk
Stack: corosync
Current DC: node02 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Wed Oct 17 07:34:20 2018
Last change: Wed Oct 17 07:32:39 2018 by root via cibadmin on node01
2 nodes configured
0 resources configured
Online: [ node01 node02 ]
No resources
Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


We can see that both nodes are online, all daemons (corosync, pacemaker, pcsd) are active (started) and enabled.

Configuring Asterisk HA solution with Virtual IP

Now when cluster is ready, we can add resources (Virtual IP, Asterisk, httpd, opensips, etc). In this section we will show how to add Virtual IP and Asterisk resources.

Firstly, let's add Virtual IP resource. Do not forget replace ip values and nic name with values from your setup.

[root@node01 ~]# pcs resource create VirtualIP  ocf:heartbeat:IPaddr2 ip=192.168.0.205 cidr_netmask=32 nic=enp0s3 op monitor interval=30s
[root@node01 ~]# pcs status
Cluster name: cluster_asterisk
Stack: corosync
Current DC: node02 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Wed Oct 17 07:49:56 2018
Last change: Wed Oct 17 07:49:51 2018 by root via cibadmin on node01
2 nodes configured
1 resource configured
Online: [ node01 node02 ]
Full list of resources:
 VirtualIP	(ocf::heartbeat:IPaddr2):	Started node01
Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


ip command should also confirm that Virtual IP has been assigned to interface:

[root@node01 ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:90:37:9c brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.152/24 brd 192.168.0.255 scope global noprefixroute dynamic enp0s3
       valid_lft 564sec preferred_lft 564sec
    inet 192.168.0.205/32 brd 192.168.0.255 scope global enp0s3
       valid_lft forever preferred_lft forever
    inet6 fe80::eb74:dc5d:cdd:df23/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

Once Virtual IP resource is setup correctly, it is time to add Asterisk resource. Install script will add Asterisk resource in directory /usr/lib/ocf/resource.d/heartbeat

[root@node01 ~]# pcs resource create asterisk ocf:heartbeat:asterisk op monitor timeout="30"

Now add colocation so that both resources will start at the same node, and set ordering (so that VirtualIP would start before Asterisk):

[root@node01 ~]# pcs constraint colocation add asterisk with VirtualIP score=INFINITY 
[root@node01 ~]# pcs constraint order VirtualIP then asterisk 
Adding VirtualIP asterisk (kind: Mandatory) (Options: first-action=start then-action=start)

Stickiness, constraints and moving resources

If one node fails, resources will be moved to other node. After first node recovers, should we move resources back to first one or leave the running on second one? This can be controlled using stickiness, if we set stickiness > 0, this mean that Pacemaker will prefer to leave resources running and avoid moving them between nodes:

[root@node01 ~]#  pcs resource defaults resource-stickiness=100
Warning: Defaults do not apply to resources which override them with their own defined values
[root@node01 ~]# pcs resource defaults
resource-stickiness: 100
[root@node01 ~]# 

With this settings, resources should not be moved back to original node if that node rebooted for example.

It is possible to move resource manually using pcs recourse move command. For example, let's move resources from current node to the other one:

[root@node01 ~]# pcs resource show
 VirtualIP	(ocf::heartbeat:IPaddr2):	Started node01
 asterisk	(ocf::heartbeat:asterisk):	Started node01
[root@node01 ~]# pcs resource move VirtualIP
Warning: Creating location constraint cli-ban-VirtualIP-on-node01 with a score of -INFINITY for resource VirtualIP on node node01.
This will prevent VirtualIP from running on node01 until the constraint is removed. This will be the case even if node01 is the last node in the cluster.
[root@node01 ~]# 
[root@node01 ~]# pcs resource show
VirtualIP	(ocf::heartbeat:IPaddr2):	Started node02
asterisk	(ocf::heartbeat:asterisk):	Started node02

As you can see, resources has been moved (as asterisk depends on VirtualIP, it is enough to move VirtualIP resource), however new location constraint has been created (cli-ban-VirtualIP-on-node01). We can check constraints this way:

[root@node01 ~]#  pcs constraint --full
Location Constraints:
  Resource: VirtualIP
    Disabled on: node01 (score:-INFINITY) (role: Started) (id:cli-ban-VirtualIP-on-node01)
Ordering Constraints:
  start VirtualIP then start asterisk (kind:Mandatory) (id:order-VirtualIP-asterisk-mandatory)
Colocation Constraints:
  asterisk with VirtualIP (score:INFINITY) (id:colocation-asterisk-VirtualIP-INFINITY)
Ticket Constraints:
Now even if node02 will fail, resources will not be moved to node01. To remove constraint, we can use command  pcs constraint remove <constraint-id>
[root@node01 ~]# pcs constraint remove cli-ban-VirtualIP-on-node01
[root@node01 ~]#  pcs constraint --full
Location Constraints:
Ordering Constraints:
  start VirtualIP then start asterisk (kind:Mandatory) (id:order-VirtualIP-asterisk-mandatory)
Colocation Constraints:
  asterisk with VirtualIP (score:INFINITY) (id:colocation-asterisk-VirtualIP-INFINITY)
Ticket Constraints:

Maintenance mode

There is time when we need to stop, inspect or do other maintenance work on resources without interference from cluster management software. We can achieve this by putting cluster in maintenance mode

[root@node01 ~]# pcs property set maintenance-mode=true
[root@node01 ~]# pcs property | grep -i maintenance
 maintenance-mode: true
[root@node01 ~]# pcs status 
Cluster name: cluster_asterisk
Stack: corosync
Current DC: node02 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Wed Oct 17 10:45:24 2018
Last change: Wed Oct 17 10:39:18 2018 by root via cibadmin on node01

2 nodes configured
2 resources configured

             *** Resource management is DISABLED ***
  The cluster will not attempt to start, stop or recover services

Online: [ node01 node02 ] 

Full list of resources:

 VirtualIP	(ocf::heartbeat:IPaddr2):	Started node02 (unmanaged)
 asterisk	(ocf::heartbeat:asterisk):	Started node02 (unmanaged)

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


To move back cluster to normal operation, simply set maintenance mode to false

[root@node01 ~]# pcs property set maintenance-mode=false
[root@node01 ~]# pcs property | grep -i maintenance
 maintenance-mode: false
[root@node01 ~]#