Thursday, March 7, 2013

Clustering with DRBD, Pacemaker, CentOS 6.x - Frustrations


In the beginning

An important LAMP project had been hosted on a wonderfully stable, active/passive cluster with shared-nothing storage.  The infrastructure consisted of CentOS 5.x for an OS, DRBD 8.2 for replicated storage, Heartbeat 2 with CRM for cluster messaging and resource management.  Resource fencing was provided by DOPD and node fencing via STONITH (IPMI.)  Dell Poweredge R610 servers provided the metal.

Putting the cluster together was a relatively painless exercise, all it took was a smart intern (shout out to Val Komarov, a senior CS major at Virginia Tech) and some time to configure it.  While somewhat limited, the documentation and guides available were enough to get us going with confidence and no kernel panics.


We fear change

Time passed and inevitably the LAMP site was rewritten, prompting a move to CentOS 6.x for developer support.  I wanted to retain all of the cluster functionality, reinventing as little as possible.  How much could possibly change?

Heartbeat, what happened?!

Previously we had made use of Heartbeat 2 for clustering functionality.  The Linux-HA project page provided an easy to use implementation guide, and relatively straight forward and unified approach for understanding how the cluster functioned.  Not surprisingly, revisiting the site came to reveal that most of the functionality had been decoupled and forked into several sub parts, namely Pacemaker, Corosync, Cluster Glue and Resource Agents.  At this point your head might be spinning!

So how do I create a cluster now?  I found most of the relevant information for the new implementation at the Pacemaker site.  Unfortunately, the number of versions, shells and tools  listed on the site made it difficult to choose the 'best' path for building the new cluster.   Previously you had little choice but to just install the Heartbeat packages, and possibly opt to use CRM.  Furthermore, it is now recommended (by Pacemaker) to use Corosync instead of Heartbeat for cluster messaging.  Confused yet?

DRBD - 8.3, or 8.4?

I always like to choose stability over features when it comes to disk and data subsystems.  Initially I opted to deploy DRBD 8.3 from a 3rd party repository.  Unfortunately, I found myself removing 8.3 and running 8.4 to quickly resolve kernel panics when implementing resource level fencing scripts.  Unfortunately this also introduced some minor changes to the DRBD configuration file and DRBDADM tool.

CRMSH or PCS, which shell?

Pacemaker supports multiple command line interfaces for creating and modifying cluster resources.  While PCS seems to be very active in development, I opted for CRMSH being that there is more reference material for it and it seems to be a logical progression of Heartbeat 2's CRM commands.

Which GUI?

Although a GUI is often frowned upon, when working with complicated systems it can be invaluable so long as it provides accurate and latency free information and feature parity.  With Hearbeat 2, the only functional GUI we came across was a python based package conveniently named heartbeat-gui.  With Pacemaker, there are now four to choose from. Heartbeat-gui is now Pacemaker-mgmt (py-gui) and since deprecated in favor of 'Hawk'.  I installed it anyways as I loathe most web interfaces that have replaced the desktop/console versions that proceeded them.

Practical implementation

As mentioned, there are several great guides for deploying a Pacemaker cluster on their site.  However, they utilize Fedora and there are some changes to note when using CentOS instead.

Download and install CentOS 6.3 on each cluster host.  I had orignally started with CentOS 6.1 media, which forced me to use text mode installation (the graphical installer would not start on the PowerEdge R610.)

Configure Networking (www1 shown):

vi /etc/sysconfig/network-scripts/ifcfg-em1
i
#primary adapter for node access
DEVICE=em1
BOOTPROTO=none
ONBOOT=yes
NETMASK=255.255.255.0
IPADDR=10.254.6.31
GATEWAY=10.254.6.1
HWADDR=00:00:00:00:00:00 #insert adapter hardware address
DNS1=8.8.8.8
DNS2=8.8.4.4
:wq

vi /etc/sysconfig/network-scripts/ifcfg-em2
i
#adapter for DRBD replication
DEVICE=em2
BOOTPROTO=none
ONBOOT=yes
NETMASK=255.255.255.0
IPADDR=10.254.0.1
HWADDR=00:00:00:00:00:00 #insert adapter hardware address
MTU="9000" #enable jumbo frames
:wq

vi /etc/sysconfig/network-scripts/ifcfg-em3
i
#adapter for Corosync ring 0
DEVICE=em3
BOOTPROTO=none
ONBOOT=yes
NETMASK=255.255.255.0
IPADDR=10.254.1.1
HWADDR=00:00:00:00:00:00 #insert adapter hardware address
:wq

vi /etc/sysconfig/network-scripts/ifcfg-em4
i
#adapter for Corosync ring 1
DEVICE=em4
BOOTPROTO=none
ONBOOT=yes
NETMASK=255.255.255.0
IPADDR=10.254.2.1
HWADDR=00:00:00:00:00:00 #insert adapter hardware address
:wq

vi /etc/hosts
G
O
10.254.6.31   www1
10.254.6.32   www2
10.254.6.30   clusteredwebsite.com
:wq

vi /etc/resolv.conf
G
O
nameserver 8.8.8.8
nameserver 8.8.4.4
:wq

service network restart

Update and install repository/packages:

#System and cluster tools, don't hate on the desktop
yum update
yum install pacemaker corosync httpd ipmitool fence-agents
yum groupinstall "Developer Tools" "Desktop"

#DRBD
rpm -Uvh http://elrepo.org/elrepo-release-6-5.el6.elrepo.noarch.rpm
yum install drbd84-utils kmod-drbd84 --enablerepo=elrepo

#Packages for compiling pacemaker-mgmt
yum install libtool-ltdl-devel cluster-glue-libs-devel pacemaker-libs-devel gnutls gnutls-devel python-devel pam-devel

Enable manual startup (services weren't working consistently)

chkconfig corosync off
chkconfig pacemaker off
chkconfig drbd off
chkconfig httpd off

vi /etc/rc.d/rc.local
G
O
service corosync start
service pacemaker start
:wq

Edit corosync files:

mv /etc/corosync/corosync.conf /etc/corosync/corosync.conf.bak

vi /etc/corosync/corosync.conf
i
compatibility: whitetank
aisexec {
        # Run as root - this is necessary to be able to manage resources with Pacemaker
        user:        root
        group:       root
}
totem {
        version: 2
        secauth: on
        threads: 0
        rrp_mode:passive
        interface {
                ringnumber: 0
                bindnetaddr: 10.254.1.1
                mcastaddr: 226.94.1.1
                mcastport: 5405
                ttl: 1
        }
        interface {
                ringnumber: 1
                bindnetaddr: 10.254.2.1
                mcastaddr: 226.94.1.2
                mcastport: 5405
                ttl: 1
        }
}

logging {
        fileline: off
        to_stderr: no
        to_logfile: yes
        to_syslog: yes
        logfile: /var/log/cluster/corosync.log
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
}

amf {
        mode: disabled
}
:wq

vi /etc/corosync/service.d/pcmk
i
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 1
use_mgmtd: 1
}
:wq

Generate, distribute authentication key (do this once):

corosync-keygen
chmod 400 /etc/corosync/authkey
scp /etc/corosync/authkey root@www2:/etc/corosync/

Compile heartbeat-gui:

cd /root
wget https://github.com/ClusterLabs/pacemaker-mgmt/archive/master.zip
unzip master.zip
cd pacemaker-mgmt
./ConfigureMe configure
./ConfigureMe make
./ConfigureMe install

#set hacluster user password on each host, used for logging in with the gui
passwd hacluster

Edit drbd.conf file:

I used a RAID-1 SAS 15k array for /dev/sdb with battery backed controller.

mv /etc/drbd.conf /etc/drbd.conf.bak
vi /etc/drbd.conf
i
resource data {
        protocol C;
        startup {
                #degr-wfc-timeout 60;
        }
        disk {
                on-io-error detach; #safety
                fencing resource-only; #safety
                no-disk-flushes; #speed, use battery backed raid
                no-md-flushes; #speed, use battery backed raid
        }
        handlers {
                fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
                after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
        }

        syncer {
                rate 33M;
                verify-alg sha1;
                al-extents 3389;
        }
        net {
                max-buffers 8000;
                max-epoch-size 8000;
                sndbuf-size 512k;
        }

        on www1 {
                device /dev/drbd0;
                disk /dev/sdb;
                address 10.254.0.1:7788;
                meta-disk internal;
        }
        on www2 {
                device /dev/drbd0;
                disk /dev/sdb;
                address 10.254.0.2:7788;
                meta-disk internal;
        }
}
:wq

Create DRBD disk

You can increase the speed of the initial replication either by modifying drbd.conf or using drbdadm.

#start the service on each host
service drbd start

#create resource on one host
drbdadm create-md data
drbdadm up resource
drbdadm primary --force data

# format, mount and test data
mkdir /data
mkfs.ext4 /dev/drbd0
mount /dev/drbd0 /data
ls /data
umount /data
drbdadm down data

#on each host
service drbd stop

Start Services (both hosts):

service corosync start
service pacemaker start


In the previous cluster, we used DOPD, and a now deprecated resource agent ocf:heartbeat:drbd.  Now we must use ocf:linbit:drbd.

crm configure
crm(live)configure# primitive drbd_data ocf:linbit:drbd \
                    params drbd_resource="data" \
                    op monitor interval="29s" role="Master" \
                    op monitor interval="31s" role="Slave"
crm(live)configure# ms ms_drbd_data drbd_data \
                    meta master-max="1" master-node-max="1" \
                         clone-max="2" clone-node-max="1" \
                         notify="true"
crm(live)configure# primitive fs_data ocf:heartbeat:Filesystem \
                    params device="/dev/drbd0" \
                      directory="/data" fstype="ext4"
crm(live)configure# primitive ip_httpd ocf:heartbeat:IPaddr2 \
                    params ip="10.254.6.30" nic="em1"
crm(live)configure# primitive httpd lsb:httpd
crm(live)configure# group web_services fs_data ip_httpd httpd
crm(live)configure# colocation web_services_on_drbd \
                      inf: httpd ms_drbd_data:Master
crm(live)configure# order web_services_after_drbd \
                      inf: ms_drbd_data:promote web_services:start
crm(live)configure# commit
crm(live)configure# exit

Configure Stonith (substitute DRAC IPMI information)

I configured Stonith (shoot the other node in the head) to use Dell's IPMI compliant management card, and shutdown the host rather than reboot it.  Stonith is used to prevent split-brain conditions where data may be corrupted, in addition to resource level fencing for DRBD.

crm configure
crm(live) configure# primitive stonith-www1 stonith:fence_ipmilan \
        params pcmk_host_list="www1" pcmk_host_check="static-list" ipaddr="WWW1iDracAddress" login="username" passwd="password" verbose="true" lanplus="true" power_wait="4" \
        op monitor interval="60s"
crm(live)configure# primitive stonith-www2 stonith:fence_ipmilan \
        params pcmk_host_list="www2" pcmk_host_check="static-list" ipaddr="WWW2iDracAddress" login="username" passwd="password" verbose="true" lanplus="true" power_wait="4" \
        op monitor interval="60s"
crm(live)configure# location lc-stonith-www1 stonith-www1 -inf: www1
crm(live)configure# location lc-stonith-www2 stonith-www2 -inf: www2
crm(live)configure# commit
crm(live)configure# exit

Up and running, now test!

The cluster should now be up and running.  You can test moving resources between cluster nodes by setting a node to standby.  You can also test  cluster failover by pulling the power from the active cluster host, and fencing by disconnecting the corosync ethernet interfaces (one node should halt the other.)

Configure services to float their configurations

You can move configuration files for apache and other services to the DRBD disk, and symlink the original locations on each cluster host.  This allows your applications to maintain compatibility, prevent services from running on the wrong hosts, and having to sync files.

Test GUI

From X11/Gnome, you should be able to run the pacemaker-mgmt GUI.  Having this installed on each host, you should be able to connect to localhost using the hacluster username and password previously set.  Behold! (additional services configured.)


10 comments:

  1. Thanks for the post. I'm getting 'crm: command not found' when I try running 'crm configure' on Centos 6.4. crm is not in pacemaker or pacemaker-cli packages!?

    ReplyDelete
    Replies
    1. crm should be provided by pacemaker-cli and accessible from /usr/sbin/crm. It is possible that the 6.4 package instead has pcs, I haven't upgraded from 6.3 to 6.4 on yet.

      yum whatprovides '/usr/sbin/crm'
      Loaded plugins: fastestmirror, refresh-packagekit
      Loading mirror speeds from cached hostfile
      * base: mirrors.advancedhosters.com
      * extras: mirror.linux.duke.edu
      * rpmforge: mirror.us.leaseweb.net
      * updates: mirrors.advancedhosters.com
      pacemaker-cli-1.1.7-6.el6.x86_64 : Command line tools for controlling Pacemaker clusters
      Repo : installed
      Matched from:
      Other : Provides-match: /usr/sbin/crm

      Delete
    2. Looks like thats what may have happened. http://serverfault.com/questions/487309/crm-commandcluster-managment-for-pacemaker-not-found-in-latest-centos-6

      Delete
    3. You can solve this using":

      wget -P /etc/yum.repos.d/ http://download.opensuse.org/repositories/network:/ha-clustering/CentOS_CentOS-6/network:ha-clustering.repo

      yum install crmsh.x86_64 -y

      Delete
    4. i need to say yesterday this repository is still there
      but today...
      not anymore...

      can not find crmsh package for CentOS6.4

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. on crm configure step: set up in www1 only or both?
    on Stonith config ipaddr="WWW1iDracAddress" login="username" passwd="password" change to real ip/user/password or typing word "WWW1iDracAddress" "username" "password"

    ReplyDelete
  4. You need to setup on both nodes. Node 1 (www1) will be monitoring node 2 (www2) and vice versa. If one of the nodes goes offline, the remaining one will issue the STONITH shutdown using IPMI to make sure it is really down.

    ReplyDelete
  5. Please Update this Article with latest version of cluster components . Thank u~>

    ReplyDelete