02/26 Tier 2 Update (game plan)

NOTE: This is a Tier 2 update game plan for 02/26/16, service interruptions are expected and customers have been notified.

After a one week delay, I’ve come up with a better and non-destructive way to migrate nodes to NFS storage while having a rock solid rollback plan.

First, all OS package updates promoted from staging this week should be deployed

# onallcts "yum makecache"
# onallcts "yum -y update"

This is going to take a couple of hours to complete.   After it is complete on node001, move onto other nodes during downtime (waiting on container mounts, etc)

Next, all monitoring tools (nagios and shinken) will be turned off.

Next, all containers on node001 need to be stopped:

# ringfree-stopallcts

Next, the vz config file needs to be adjusted.  Open /etc/vz/vz.conf in nano and adjust the following:

VE_ROOT=/vz/root/$VEID
VE_PRIVATE=/vz/private/$VEID

to:

VE_ROOT=/vz/nfsroot/$VEID
VE_PRIVATE=/vz/nfsprivate/$VEID

Next, another rsync is required to catch any last changes in the last two days:

# for i in $(cat /etc/ringfree/manifest.tmp) ; do rsync -avz -e "ssh" --delete /vz/private/$i/ root@atl.san1.ringfree.biz:/rf-images/pbx/$i ; done

Next, all container vz conf files will need to be adjusted. In /etc/sysconfig/vz-scripts/ which is symlinked to /etc/vz/conf/, all containers in /etc/ringfree/manifest.tmp need to have their private and root directory locations changed just like we did in /etc/vz/vz.conf.

Next, the mount directories must be created:

# mkdir /vz/nfsprivate
# mkdir /vz/nfsroot

Next, the mounting variables must be created in the following files:

/etc/ringfree/nfs (The current NFS share server, this will be set to primary)
/etc/ringfree/nfsip.1 (The primary NFS share server)
/etc/ringfree/nfsip.2 (The secondary NFS share server)
/etc/ringfree/nfsmounted (Will be seeded with value of "0")
/etc/ringfree/nfsvzconf (Will be seeded to "/rf-images/vzconf")
/etc/ringfree/nfsvzroot (Will be seeded to "/rf-images/pbxroot")
/etc/ringfree/nfsvzstor (Will be seeded to "/rf-images/pbx")

The OpenVZ service needs to be restarted to load in the new private and root variables.  We don’t normally have it on by default, but vzquota must be turned off in /etc/vz/vz.conf before continuing as well.

Next, ringfree-cloudstor will mount the NFS storage:

# ringfree-cloudstor -m

The tool ringfree-cloudstor has been updated to mount NFS storage in directories other than /vz/private and /vz/root.  This allows us to roll back to local storage in case it appears something might be wrong or the migration takes longer than the maintenance window allows.  /etc/sysconfig/vz-scripts however cannot be mounted in a different directory.  As such, I have taken a snapshot of the directory to /etc/sysconfig/vz-scripts.tar.gz.

The new version of ringfree-cloudstor will mount /etc/sysconfig/vz-scripts, /vz/nfsprivate and /vz/nfsroot from the NFS shares advertised as:

  • $NFS:/rf-images/vzconf (mounted as /etc/sysconfig/vz-scripts, symlinked to /etc/vz/conf)
  • $NFS:/rf-images/pbxroot (mounted as /vz/nfsroot)
  • $NFS:/rf-images/pbx (mounted as /vz/nfsprivate)

Therefore, in case of emergency, all NFS mounts can be safely unmounted, then /etc/sysconfig/vz-scripts.tar.gz can restore all previous configs, and all locate container storage will still be in /vz/private and /vz/root as it was originally.

It is already known that the initial container mounts will take some time.  Four Seasons hospice, our PBX system and Epsilon will be restored first.  After Asterisk, Apache and Postfix are confirmed to be running, the rest of the containers will be mounted using a list of currently mounted containers:

# for i in $(cat /etc/ringfree/manifest.tmp); do vzctl start $i ; done

Luckily, the only files over NFS needed to mount a container are Asterisk, Apache, Postfix and all libraries required by them to run.  In the future we can increase our mount times over NFS simply by placing a gigabit switch in between the nodes, NFS servers and the router.  The nodes will arp directly to the NFS servers over the gigabit switch allowing gigabit network transfer.

Once all containers are confirmed to be fully mounted and load avg has returned to normal levels, nagios and shinken will be turned back on.

To ensure Asterisk is running on all CTs (except revnat and revnat beta on node001), we need to make sure the asterisk process is running.  We can do this by using onallcts in conjunction with the asterisk binary to give us the version number.  If asterisk is running, the asterisk binary will connect and display a version name:

# onallcts "asterisk -rx core\ show\ version"

At this point, node001 will be completely migrated.  Local storage will be left until when it’s decided to be removed.

Reversion Plan

In case there are issues mounting containers which will impact call service outside of the maintenance window, existing containers mounted over NFS should be unmounted and the local storage should be used.  After the containers over NFS are unmounted, /etc/vz/vz.conf should be restored to /vz/root and /vz/private values from /vz/nfsroot and /vz/nfsprivate.  The vz service should then be restarted and then NFS shares themselves should be unmounted.  After they are unmounted, restore /etc/sysconfig/vz-scripts.tar.gz should be untarred.  Containers can at this point be mounted from local storage.