diff --git a/README.md b/README.md index 8a337c1..31a8751 100644 --- a/README.md +++ b/README.md @@ -5,26 +5,26 @@ Author: David Meredith + JK This repo contains the service and cron scripts used to run a failover gocdb instance, includes the following dirs: * autoEngageFailover/ - * Contians a Service script (```gocdb-autofailover.sh```) and child scripts that monitors the main production instance. If a prolonged outage is detected, the GOCDB top DNS alias 'goc.egi.eu' is swtiched from the production instance to the failover instance. This switch can also be performed manually when needed. + * Contians a Service script (```gocdb-autofailover.sh```) and child scripts that monitors the main production instance. If a prolonged outage is detected, the GOCDB top DNS alias 'goc.egi.eu' is swtiched from the production instance to the failover instance. This switch can also be performed manually when needed. * importDBdmpFile/ - * Contains a script that should be invoked by cron hourly (```1_runDbUpdate.sh```) to fetch and install a .dmp of the production DB into the local failover DB. This runs separtely from the autoEngageFailover process. + * Contains a script that should be invoked by cron hourly (```1_runDbUpdate.sh```) to fetch and install a .dmp of the production DB into the local failover DB. This runs separtely from the autoEngageFailover process. * nsupdate_goc/ - * Scripts for switching the DNS to/from the production/failover instance. + * Scripts for switching the DNS to/from the production/failover instance. * archiveDmpDownload/ - * Contains a script to download/archive dmp files in a separate process + * Contains a script to download/archive dmp files in a separate process # Packages -* The following scripts needs to be installed and configuired for your installation: +* The following scripts needs to be installed and configuired for your installation: ``` /root/ autoEngageFailover/ # Scripts to mon the production instance and engage failover |_ gocdb-autofailover.sh# MAIN SERVICE SCRIPT to mon production instance |_ engageFailover.sh # Child script, run if prolonged outage is detected - + importDBdmpFile/ # Scripts fetch/install a .dmp of the prod data - |_ 1_runDbUpdate.sh # MAIN SCRIPT that can be called from cron, invokes child scripts below + |_ 1_runDbUpdate.sh # MAIN SCRIPT that can be called from cron, invokes child scripts below |_ ora11gEnvVars.sh # Setup oracle env - |_ getDump.sh # Fetch a .dmp of the production data + |_ getDump.sh # Fetch a .dmp of the production data |_ dropGocdbUser.sh # Drops the current DB schema |_ loadData.sh # Load the last successfully fetched DB dmp into the RDBMS |_ gatherStats.sh # Oracle gathers stats to re-index @@ -32,32 +32,32 @@ This repo contains the service and cron scripts used to run a failover gocdb ins nsupdate_goc/ # Scripts for switching the DNS to the failover |_ goc_failover.sh # Points DNS to failover instance - |_ goc_production.sh # Points DNS to production instance + |_ goc_production.sh # Points DNS to production instance archiveDmpDownload/ # Contains script to download/archive dmp files in a separate process e.g from cron.daily - |_ archiveDump.sh # Main script that dowloads dmp and saves in a sub-dir - |_ archive/ # Contains archive/dmp files + |_ archiveDump.sh # Main script that dowloads dmp and saves in a sub-dir + |_ archive/ # Contains archive/dmp files ``` -## /root/autoEngageFailover/ +## /root/autoEngageFailover/ Start in this dir. Dir contains the 'gocdb-autofailover.sh' service script which should be installed as a service in '/etc/init.d/gocdb-autofailover'. This service invokes 'engageFailover.sh' which monitors the production instance with a ping-check. If a continued outage is detected; the script starts the failover procedure which includes the -following: -* the gocdb admins are emailed, +following: +* the gocdb admins are emailed, * the age of the last successfully imported dmp file is - checked to see that it is current, + checked to see that it is current, * the hourly cron that fetches the dmp file is stopped (see - importDBdmpFile below), + importDBdmpFile below), * symbolic links to the server cert/key are updated so they - point to the 'goc.egi.eu' cert/key (note, no longer needed as cert contains dual SAN) + point to the 'goc.egi.eu' cert/key (note, no longer needed as cert contains dual SAN) * the dnscripts are invoked to change the dns (see nsupdate_goc below). -## /root/importDBdmpFile/ +## /root/importDBdmpFile/ Contains scripts that fetches the .dmp file and install this dmp file into the local Oracle XE instance. The master script is '1_runDbUpdate.sh' which needs to be invoked from an hourly @@ -70,29 +70,29 @@ cron: /root/importDBdmpFile/1_runDbUpdate.sh ``` -You will also need to: +You will also need to: * generate a public/private key pair using `ssh-keygen` and ensure the public key is present on the host with the database dmp file. * populate `importDBdmpFile/failover_TEMPLATE.sh` with appropriate values and copy it to `/etc/gocdb/failover.sh` - + ## /root/nsupdate_goc/ Contains the nsupdate keys and nsupdate scripts for switching the 'goc.egi.eu' top level DNS alias to point to either the -production instance or the failover. +production instance or the failover. ## /root/archiveDmpDownload/ Contains a script that downloads the dmp file and stores the file in the archive/ sub-dir. -The script also deletes archived files that are older than 'x' days. -This script can be called in a separate process, e.g. from cron.daily to build a -set of backups. +The script also deletes archived files that are older than 'x' days. +This script can be called in a separate process, e.g. from cron.daily to build a +set of backups. -#Failover Instructions +#Failover Instructions * Choose from options 1) 2) 3) -## To start/stop the auto failover service +## To start/stop the auto failover service This will continuously monitor the production instance and engage the failover automatically during prolonged outages @@ -105,8 +105,8 @@ chkconfig --list | grep gocdb-auto /sbin/service gocdb-autofailover status ``` - -Directly (not as a service): + +Directly (not as a service): ```bash cd /root/autoEngageFailover @@ -114,15 +114,15 @@ cd /root/autoEngageFailover ``` -## To manually engage the failover immediately +## To manually engage the failover immediately E.g. for known/scheduled outages, run the following passing 'now' as the first command-line argument: -Stop the service: +Stop the service: ``` service gocdb-autofailover stop ``` -Or to stop if running manually: +Or to stop if running manually: ``` cd /root/autoEngageFailover ./gocdb-autofailover.sh stop @@ -136,16 +136,16 @@ Engage the failover now: You will need to manually revert the steps executed by the failover so the dns points back to the production instance and restore/restart the failover process. This includes: -* restore the symlinks to the goc.dl.ac.uk server cert and key - (see details below) (no longer needed as cert contains dual SAN) +* restore the symlinks to the gocdb.hartree.stfc.ac.uk server cert and key + (see details below) (no longer needed as cert contains dual SAN) * restore the hourly cron to fetch the dmp of the DB * run nsupdate procedure to repoint 'goc.egi.eu' back to 'gocdb-base.esc.rl.ac.uk' - MUST read /root/nsupdate_goc/nsupdateReadme.txt. + MUST read /root/nsupdate_goc/nsupdateReadme.txt. * restart the failover service ####Restore Walkthrough -At end of downtime (production instance ready to be restored) first re-point DNS: +At end of downtime (production instance ready to be restored) first re-point DNS: ```bash echo We first switch dns to point to production instance @@ -154,7 +154,7 @@ cd /root/nsupdate_goc ``` -Now wait for DNS to settle, this takes approx **2hrs** and during this time the goc.egi.eu domain will +Now wait for DNS to settle, this takes approx **2hrs** and during this time the goc.egi.eu domain will swtich between the failover instance and the production instance. You should monitor this using nsupdate: ```bash @@ -167,7 +167,7 @@ nslookup goc.egi.eu Address: 130.246.143.160 ``` -After DNS has become stable the production instance will now be serving requests. +After DNS has become stable the production instance will now be serving requests. Only after this ~2hr period should we re-start failover service: ```bash @@ -177,14 +177,14 @@ rm /root/autoEngageFailover/engage.lock mv cronRunDbUpdate.sh /etc/cron.hourly # Below server cert change no longer needed as cert contains dual SAN -# This means a server restart is no longer needed. -#echo Change server certificate and key back for goc.dl.ac.uk -#ln -sf /etc/pki/tls/private/goc.dl.ac.uk.key.pem /etc/pki/tls/private/hostkey.pem -#ln -sf /etc/grid-security/goc.dl.ac.uk.cert.pem /etc/grid-security/hostcert.pem +# This means a server restart is no longer needed. +#echo Change server certificate and key back for gocdb.hartree.stfc.ac.uk +#ln -sf /etc/pki/tls/private/gocdb.hartree.stfc.ac.uk.key.pem /etc/pki/tls/private/hostkey.pem +#ln -sf /etc/grid-security/gocdb.hartree.stfc.ac.uk.cert.pem /etc/grid-security/hostcert.pem #service httpd restart #service gocdb-autofailover start #service gocdb-autofailover status -# gocdb-autofailover is running... +# gocdb-autofailover is running... ``` Now check the '/root/autoEngageFailover/pingCheckLog.txt' and diff --git a/autoEngageFailover/engageFailover.sh b/autoEngageFailover/engageFailover.sh index 845425c..6468aac 100644 --- a/autoEngageFailover/engageFailover.sh +++ b/autoEngageFailover/engageFailover.sh @@ -1,23 +1,23 @@ #!/bin/bash -# Usage: ./autoEnageFailover.sh [now] -# where now is optional. If 'now' is specified as the first cmd line arg, then -# the failover is engaged immediately rather than on detection of a prolongued outage. -# -# Script will fail early if the lockFile from previous engage is present. +# Usage: ./autoEnageFailover.sh [now] +# where now is optional. If 'now' is specified as the first cmd line arg, then +# the failover is engaged immediately rather than on detection of a prolongued outage. # -# Note, after the main instance has been restored, you will need to manually -# do the following steps: +# Script will fail early if the lockFile from previous engage is present. +# +# Note, after the main instance has been restored, you will need to manually +# do the following steps: # Revert this swap: -# ln -s /etc/pki/tls/private/goc.dl.ac.uk.key.pem /etc/pki/tls/private/hostkey.pem -# ln -s /etc/grid-security/goc.dl.ac.uk.cert.pem /etc/grid-security/hostcert.pem +# ln -s /etc/pki/tls/private/gocdb.hartree.stfc.ac.uk.key.pem /etc/pki/tls/private/hostkey.pem +# ln -s /etc/grid-security/gocdb.hartree.stfc.ac.uk.cert.pem /etc/grid-security/hostcert.pem # -# Restore hourly cron job: -# mv /root/cronRunDbUpdate.sh /etc/cron.hourly/ +# Restore hourly cron job: +# mv /root/cronRunDbUpdate.sh /etc/cron.hourly/ # ====================Setup Variables=========================== -# setup log files +# setup log files updateLog=/root/autoEngageFailover/pingCheckLog.txt errorEngageFailoverLog=/root/autoEngageFailover/errorEngageFailoverLog.txt lockFile=/root/autoEngageFailover/engage.lock @@ -28,11 +28,11 @@ importDBdmpFile=/root/importDBdmpFile # maintainthe current fail count failcount=0 -# server certificate / key -# note, in production we will use the goc.dl.ac.uk server/host cert and key which has no -# password protecting the private key. -userkey="/etc/pki/tls/private/goc.dl.ac.uk.key.pem" -usercert="/etc/grid-security/goc.dl.ac.uk.cert.pem" +# server certificate / key +# note, in production we will use the gocdb.hartree.stfc.ac.uk server/host cert and key which has no +# password protecting the private key. +userkey="/etc/pki/tls/private/gocdb.hartree.stfc.ac.uk.key.pem" +usercert="/etc/grid-security/gocdb.hartree.stfc.ac.uk.cert.pem" # URL to monitor for the main production instance pingUrl="https://goc.egi.eu/portal/GOCDB_monitor/ops_monitor_check.php" @@ -40,17 +40,17 @@ pingUrl="https://goc.egi.eu/portal/GOCDB_monitor/ops_monitor_check.php" # An external url to check that local network can reach outside externalPingUrl="http://google.co.uk" -# number of secs between re-pings (600secs = 10mins) +# number of secs between re-pings (600secs = 10mins) sleepTime=600s # number of successive fails before invoking failover (30 * 10mins = 300mins = 5hrs) failCountLimit=30 -# email subject and to address for notification that failover is engaged +# email subject and to address for notification that failover is engaged SUBJECT="gocdb failover warning" TO="some.body@world.com,a.n.other@world.com" -# Determine whether to engage the failover immediately +# Determine whether to engage the failover immediately ENGAGENOW="false" # ===================================================== @@ -65,7 +65,7 @@ if [ -n "$1" ] ; then fi -# email all given args to $TO +# email all given args to $TO function email { /bin/mail -s "$SUBJECT" "$TO" <> $updateLog } @@ -132,28 +132,28 @@ fi -# Create the log if it don't already exist -touch $updateLog +# Create the log if it don't already exist +touch $updateLog touch $errorEngageFailoverLog logger "==============================Starting up $(date)=====================================" errorLogger "===================================Starting up $(date)==================================" # loop if not engaging now if [ $ENGAGENOW == "false" ] ; then - # loop while global failcount is less than x + # loop while global failcount is less than x while [ $failcount -lt $failCountLimit ] do - pingCode=$(pingcheck) + pingCode=$(pingcheck) if [ $pingCode != 0 ]; then - # if ping failed then increment failcount + # if ping failed then increment failcount (( failcount++ )) else # else if ping worked re-set failcount (back) to zero failcount=0 - #logger "ping ok $(date) : $pingUrl" + #logger "ping ok $(date) : $pingUrl" fi - - #echo "failcount is: $failcount, pingcode is: $pingCode" + + #echo "failcount is: $failcount, pingcode is: $pingCode" sleep $sleepTime done fi @@ -162,20 +162,20 @@ fi # 'N' consecutive failures encountered. Next invoke failover script # ================================================================= -# - log the date +# - log the date errorLogger "=============Start Failover Swtich=================" errorLogger "Detected successive failues on $(date)" errorLogger "Starting engage failover" email "Detected successive failures. Attempting to engage the failover - please see the logs: $updateLog $errorEngageFailoverLog" -# While developing, force an exit here (will have to practice below using -# the provided test.egi.eu goc domain) +# While developing, force an exit here (will have to practice below using +# the provided test.egi.eu goc domain) #exit 0 -# - Test that the last goc.dmp imported ok by parsing /root/importDBdmpFile/updateLog.txt +# - Test that the last goc.dmp imported ok by parsing /root/importDBdmpFile/updateLog.txt #cd /root/importDBdmpFile cd $importDBdmpFile if [ "$(tail -1 ./updateLog.txt)" != "completed ok" ]; then @@ -185,20 +185,20 @@ fi errorLogger "Attempting to move cron" -# - Move hourly cron job to disable (don't want this to execute while in failover mode) -mv /etc/cron.hourly/cronRunDbUpdate.sh /root +# - Move hourly cron job to disable (don't want this to execute while in failover mode) +mv /etc/cron.hourly/cronRunDbUpdate.sh /root errorLogger "Swapping server certs" -## - Swap server cert +## - Swap server cert ## Not needed e.g. if your server cert has a dual SAN -#unlink /etc/grid-security/hostcert.pem +#unlink /etc/grid-security/hostcert.pem #unlink /etc/pki/tls/private/hostkey.pem #ln -s /etc/grid-security/goc.egi.eu.cert.pem /etc/grid-security/hostcert.pem #ln -s /etc/pki/tls/private/goc.egi.eu.key.pem /etc/pki/tls/private/hostkey.pem -## note, after the main instance has been restored, you will need to revert this swap: -## ln -s /etc/pki/tls/private/goc.dl.ac.uk.key.pem /etc/pki/tls/private/hostkey.pem -## ln -s /etc/grid-security/goc.dl.ac.uk.cert.pem /etc/grid-security/hostcert.pem +## note, after the main instance has been restored, you will need to revert this swap: +## ln -s /etc/pki/tls/private/gocdb.hartree.stfc.ac.uk.key.pem /etc/pki/tls/private/hostkey.pem +## ln -s /etc/grid-security/gocdb.hartree.stfc.ac.uk.cert.pem /etc/grid-security/hostcert.pem #errorLogger "After server cert swap" @@ -219,15 +219,15 @@ errorLogger "Swapping server certs" #fi # # -## Restart apache +## Restart apache #errorLogger "Restarting apache" -#service httpd restart +#service httpd restart -# Finally create the lockFile to indicate the failover ran ok +# Finally create the lockFile to indicate the failover ran ok touch $lockFile email "Failover script completed" -# End +# End errorLogger "==========================End failover switch=======================" diff --git a/nsupdate_goc/goc_failover.sh b/nsupdate_goc/goc_failover.sh index a960ddd..e962520 100644 --- a/nsupdate_goc/goc_failover.sh +++ b/nsupdate_goc/goc_failover.sh @@ -1,14 +1,14 @@ -#echo "changing goc.egi.eu DNS record at ns.mui.cz to goc.dl.ac.uk" +#echo "changing goc.egi.eu DNS record at ns.mui.cz to gocdb.hartree.stfc.ac.uk" #echo nsupdate -k goc.egi.eu_ns.muni.cz_key.conf <