GFS on iSCSI shared storage

This method, based on software version delivered with CentOS 6.0 use dlm_controld.pcmk and gfs_controld.pcmk, which are special version developped to be used directly by Pacemaker. After upgrading the OS to CentOS 6.2, the RPM providing dlm_controld.pcmk and gfs_controld.pcmk were replaced by cman, wichi provides the standard gfs_controld and dlm_controld. To be use these two with Pacemaker, we need to enable CMAN with Corosync. This is explained in the next howto.

This howto is tested with CentOS 6.0 and will not work with CentOS 6.2


In the howto about the creation of a MySQL failover (active-passive) cluster, we've show how we were using our iSCSI shared storage that we builded before. In the case of the MySQL cluster, it was sufficient to make a standard file-system on the iSCSI LUN once connected, because the cluster in fail-over mode, enforce that he's mounted only on one node at a time.

But if we need a filesystem that can be accessed by more than one server at the same time, we need to use a Cluster File System. The base of such file-system is to share the locking among the hosts accessing the file-system. This is what will be later called DLM or Distributed Lock Manager. There are various products to do this :
  • OCFS (Oracle Cluster File System)
  • GFS2 (Global File System version 2)
  • GlusterFS
  • (...)
We have chosen here GFS2 because it integrates easily with Pacemaker. (Note that OCFS integrates also as easily with Pacemaker).
Remark : GFS2 can run outside of Pacemaker too. We will here integrate GFS2 into Pacemaker to make the DLM service resilient, as the storage box, our iSCSI cluster already is.
 

The GFS2 cluster will be another cluster, running at least two nodes, but not limited to two. This cluster is also running Pacemaker / Corosync as for the iSCSI cluster itself. Beside the packages required to run Pacemaker, we need three more packages (as of CentOS 6.0) :

  • gfs2-utils

  • gfs-pcmk

  • dlm-pcmk (which is a pre-requisite for gfs-pcmk)

Once these thre RPM's are installed, check that the init script gfs2 is not started automatically at startup of the nodes. It has to be managed by the cluster.

We will need the following Resource Agents (RA) on this cluster :

  • Mount filesystem → ocf:heartbeat:Filesystem

  • Manage the GFS distributed lock manager → ocf:pacemaker:controld

  • Connect to an iSCSI target → ocf:heartbeat:iscsi

The configuration of the resources appears to be like this :

primitive dlm ocf:pacemaker:controld \
   op monitor interval="120"
primitive gfs ocf:pacemaker:controld \
   params daemon="gfs_controld.pcmk" args="" \
   op monitor interval="120"
primitive gfs-mount ocf:heartbeat:Filesystem \
   params device="/dev/sdb1" directory="/opt" fstype="gfs2" \
   op start interval="0" timeout="120" \
   op stop interval="0" timeout="120"
primitive iscsi1 ocf:heartbeat:iscsi \
   params portal="192.168.20.122" target="iqn.2011-11.begetest.net" \
   op start interval="0" timeout="20" \
   op stop interval="0" timeout="20" \
   op monitor interval="120" timeout="30" start-delay="0"
clone clone-dlm dlm \
   meta globally-unique="false" interleave="true"
clone clone-gfs gfs \
   meta globally-unique="false" interleave="true"
clone clone-gfs-mount gfs-mount \
   meta globally-unique="false" interleave="true"
clone clone-iscsi1 iscsi1 \
   params portal="192.168.20.122" target="iqn.2011-11.begetest.net"
   colocation dlm-with-gfs 100: clone-dlm clone-gfs
   colocation dlm-with-iscsi 100: clone-dlm clone-iscsi1
order dlm-after-iscsi inf: clone-iscsi1 clone-dlm
order gfs-after-dlm inf: clone-dlm clone-gfs
order gfs-mount-after-gfs inf: clone-gfs clone-gfs-mount

Explanations of the configuration :

  1. iscsi1 is the resource to discover the iSCSI targets and their LUN's. It makes them available to the local system using the standard devices files (cfr. /dev/disk and sub-directories).

  2. The iSCSI LUN is seen here as device /dev/sdb (cfr. The /dev/disk/by-path directory to know the device file in use in your case). We partition it as we do usually for standard internal disk using fdisk. The partition we have created, /dev/sdb1.

  3. dlm is the resource starting the DLM controlling daemon, which independent of the type of cluster filesystem we use (it is the same as for OCFS also – another cluster filesystem).

  4. gfs is the resource starting the GFS controlling daemon, which is depending of the cluster filesystem we use. We pass as parameters the name and arguments of the program to launch.

  5. gfs-mount is the resouce that will mount the GFS2 filesystem on the node where it runs.

  6. The clone directive is used to tell Pacemaker that a resource must run on more than one node. Thus this will allow the iscsi1, dlm, gfs and gfs-mount resources to run on each node (by default, the cluster will start the clone device on each node).

  7. The globally-unique parameter in the clone directive tells the cluster if all instances of the cloned ressource perform different (globally-unique is true) or identical (globally-unique is false) action. When globally-unique is true, more than one cloned resource can run on the same host. This is usefull with RA that needs to perform special action if another instance is already running localy (like cleaning up IPtables rules after a switch).


Before starting the gfs-mount resource, the GFS2 filesystem needs to be manually created on the disk device. This can be easily done :

  • Configure the cluster as per the configuration example above without the gfs-mount resource and his related components (clone, colocation and order).

  • Note the iSCSI LUN device file (check in /dev/disk/by-path for this)

  • Partition it as you like using fdisk

  • Assuming that you have created a partition accessed with /dev/sdb1, on one node of the cluster do :
# mkfs.gfs2 -p lock_dlm -j 2 -t pcmk:test /dev/sdb1
where:
  • lock_dlm to tell that we will use the distributed lock mechanism (the other possibility here would be to use the local lock mechanism, so in a cluster of only one node !)

  • “-j 2” tells that you want 2 journals on the filesystem. You need at least a number of journals equals to the number of different nodes that will mount this filesystem.

  • pcmk : the name of the cluster. Unless you specify another cluster name in the corosync.conf file, you must use the default name pcmk.

  • test : a name for the filesystem. You may use what you want here (1 to 16 characters). This name is used in the lock tables to differentiate between various filesystems that may be connected to the same cluster.

 

Once all the resources are running on all node, you can check that your GFS2 filesystem is mounted - output of the mount command :

/dev/sdb1 on /opt type gfs2 (rw,relatime,hostdata=jid=1)

Then go in the /opt directory on one node, create and write a file. Then you can find this file directly on the other node also.

Now that you have the GFS mounted on each node, you can configure and add the service that will use it, for instance a web server instance that we run on each nodes