PG Automated Backup and Failover System (long) - Mailing list pgsql-general

From Ingram, Bryan
Subject PG Automated Backup and Failover System (long)
Date
Msg-id 01CCE949D2717845BA2E573DC081167E052F12@BKMAIL.sixtyfootspider.com
Whole thread Raw
List pgsql-general
I'm currently testing a system I've written for Postgres that provides
automatic failover to a backup database server in the event that the master
db server should die.

The system was designed using RedHat 6.2 and Postgres 7.0.2 in a two node
cluster with two NICs in each node.  The extra NIC in each machine provides
a private network between the nodes.

The system employs two major components...
    1) Postgres Backup/Restore component
    2) Takeover monitoring component

Until Postgres has some kind of replication (Jan- need some beer, food,
moral support??) the best technique I could come up with to provide a
reasonably up-to-date copy of the master server was to create a process that
periodically checks for changes to the database(s), pg_dumps the contents,
alerts the slave node, then applies the backup files to the slave node.
This is all done via an automated process with many integrity checks along
the way.   If any errors are found a message is sent to my cell phone with a
descriptive line of text.

I'm aware of the shortcomings of this technique, but at this point, what
else is there?  I've set the system to check for data changes and to make
backups if necessary every 5 minutes (no changes, no backups).   While many
organizations would shudder at the idea of a maximum of 5 minutes of data
loss, for some the risk is worth the reward (having a backup server
automatically takeover within 30 seconds.)

Depending on your application, this kind of "replication" may or may not be
acceptable to you.  Also, if you deal with very large databases with
frequent data changes you might find that 5 minutes is too small of an
interval.   With Postgres 7.0.2 and a decent machine I can now load one of
our databases of ca. 1500000 rows in 47 seconds without sync as opposed to
4+ hours with 6.5.x.  (Though that was reduced to 1.5 hours with by
distributing the processes across two machines.)

So, basically, component one provides an almost fresh copy of the master
server on the slave node at any given time.

Component two is a set of scripts which run on both the master and slave
monitoring the state of the systems.

These scripts check for error conditions every 10 seconds and monitor these
situations:

Hub/Switch/Cable Failure
Web Cluster Failures
Local Node Network Failure
Server OS/Hardware Crash
Postgres Failures
NFS Failures

All except the NFS failure will trigger a node takeover.  Error checks are
performed using pings, psql db connections and a few other techniques.

And as with component one, any error found will generate an email message.

So far, both components are passing testing with flying colors.  The node
takeover scripts are performing wonderfully as are the automated backup and
restore scripts.

I've actually been able to have testers go through postgres/php driven web
sites, pull the plug on the database server, and then have the slave server
come up and respond to requests within 30 seconds, without the user ever
seeing an error. (Unless you consider a 30 second pause in page load an
error.)

There are a lot of details to this system I can't describe here, but I'd be
happy to answer any questions.

As I'm nearing the completion of the formal testing phase and nearing a
production phase, I would love to hear from anyone who has implemented a
similar system or can possibly help me explore potential pitfalls I may have
not considered.

Here are some logs from each component for your amusement:

Takeover Component:
Sample output from master's logs ...

Fri Sep 22 19:14:43 CDT 2000 dbrh1:master Ping-Pong to Web Cluster OK
Fri Sep 22 19:14:43 CDT 2000 dbrh1:master PG OK
Fri Sep 22 19:14:54 CDT 2000 dbrh1:master Ping-Pong to Web Cluster OK
Fri Sep 22 19:14:54 CDT 2000 dbrh1:master PG OK
Fri Sep 22 19:14:54 CDT 2000 dbrh1:master dbrh1:master slave machine is up
Fri Sep 22 19:15:01 CDT 2000 dbrh1:master NFS mount OK
Fri Sep 22 19:15:05 CDT 2000 dbrh1:master Ping-Pong to Web Cluster OK
Fri Sep 22 19:15:05 CDT 2000 dbrh1:master PG OK

From the slave's logs ...
Fri Sep 22 19:59:32 CDT 2000 dbrh2:slave PG on MASTER is OK
Fri Sep 22 19:59:42 CDT 2000 dbrh2:slave PG on MASTER is OK
Fri Sep 22 19:59:52 CDT 2000 dbrh2:slave PG on MASTER is OK
Fri Sep 22 20:00:01 CDT 2000 dbrh2:slave NFS mount OK
Fri Sep 22 20:00:02 CDT 2000 dbrh2:slave PG on MASTER is OK
Fri Sep 22 20:00:12 CDT 2000 dbrh2:slave PG on MASTER is OK
Fri Sep 22 20:00:22 CDT 2000 dbrh2:slave PG on MASTER is OK

Backup/Restore Component:
On the master ...
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master ***********************
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master users.dat: 74
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master Start bahamas backup
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master bahamas has not changed.
Skipping.
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master Start zipfind -t dealers backup
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master zipfind has not changed.
Skipping.
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master Start stfrancis backup
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master stfrancis has not changed.
Skipping.
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master Start afcard backup
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master afcard has not changed.  Skipping.
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master Start afemail backup
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master afemail has not changed.
Skipping.
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master Start test1 backup
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master End test1 backup 107225
Fri Sep 22 17:10:00 CDT 2000 dbrh1:master Archiving complete.
Fri Sep 22 17:15:00 CDT 2000 dbrh1:master ***********************
Fri Sep 22 17:15:01 CDT 2000 dbrh1:master users.dat: 74
Fri Sep 22 17:15:01 CDT 2000 dbrh1:master Start bahamas backup
Fri Sep 22 17:15:01 CDT 2000 dbrh1:master bahamas has not changed.
Skipping.
Fri Sep 22 17:15:01 CDT 2000 dbrh1:master Start zipfind -t dealers backup
Fri Sep 22 17:15:01 CDT 2000 dbrh1:master zipfind has not changed.
Skipping.
Fri Sep 22 17:15:01 CDT 2000 dbrh1:master Start stfrancis backup
Fri Sep 22 17:15:01 CDT 2000 dbrh1:master stfrancis has not changed.
Skipping.
Fri Sep 22 17:15:01 CDT 2000 dbrh1:master Start afcard backup
Fri Sep 22 17:15:01 CDT 2000 dbrh1:master afcard has not changed.  Skipping.
Fri Sep 22 17:15:01 CDT 2000 dbrh1:master Start afemail backup
Fri Sep 22 17:15:01 CDT 2000 dbrh1:master afemail has not changed.
Skipping.
Fri Sep 22 17:15:01 CDT 2000 dbrh1:master Start test1 backup
Fri Sep 22 17:15:01 CDT 2000 dbrh1:master End test1 backup 113757
Fri Sep 22 17:15:01 CDT 2000 dbrh1:master Archiving complete.


And on the slave...
Fri Sep 22 15:07:01 CDT 2000 dbrh2:slave dbs.dat FOUND. good
Fri Sep 22 15:07:01 CDT 2000 dbrh2:slave CNTL.template1 not found. SKIPPING
Fri Sep 22 15:07:01 CDT 2000 dbrh2:slave cntl.bahamas found. good
Fri Sep 22 15:07:01 CDT 2000 dbrh2:slave Checking for ready.bahamas
Fri Sep 22 15:07:01 CDT 2000 dbrh2:slave bahamas NOT READY. SKIPPING
Fri Sep 22 15:07:01 CDT 2000 dbrh2:slave cntl.dealers found. good
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave dbs.dat FOUND. good
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave CNTL.template1 not found. SKIPPING
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave cntl.bahamas found. good
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Checking for ready.bahamas
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave bahamas NOT READY. SKIPPING
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave cntl.dealers found. good
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Checking for ready.dealers
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave dealers NOT READY. SKIPPING
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave cntl.stfrancis found. good
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Checking for ready.stfrancis
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave stfrancis NOT READY. SKIPPING
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave cntl.afcard found. good
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Checking for ready.afcard
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave afcard NOT READY. SKIPPING
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave cntl.afemail found. good
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Checking for ready.afemail
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave afemail NOT READY. SKIPPING
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave cntl.test1 found. good
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Checking for ready.test1
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Found ready.test1
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave test1 locked
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave cntl.test1 copied
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave dbout.test1 copied
Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Start reload on dbout.test1
Fri Sep 22 15:21:02 CDT 2000 dbrh2:slave End reload on dbout.test1
Fri Sep 22 15:21:02 CDT 2000 dbrh2:slave 20002661520.dbout.test1 created
Fri Sep 22 15:21:02 CDT 2000 dbrh2:slave test1 unlocked
Fri Sep 22 15:21:02 CDT 2000 dbrh2:slave ----------------------


And a few generated from the slave during error conditions:
Hub/Switch Failure...
Wed Sep 13 17:21:34 CDT 2000 dbrh2:slave PG on MASTER is OK
This is where we kill the hub...
Wed Sep 13 17:21:46 CDT 2000 Postgres not responding
Wed Sep 13 17:21:46 CDT 2000 dbrh2:slave New problem.  Waiting 10 seconds
for PG recovery
Wed Sep 13 17:22:08 CDT 2000 Postgres not responding
Wed Sep 13 17:22:08 CDT 2000 dbrh2:slave ACK! PG still down.  Switching
Roles
Wed Sep 13 17:22:19 CDT 2000 dbrh2:slave Error. No pong from webfarm, cant
takeover
Wed Sep 13 17:22:31 CDT 2000 Postgres not responding
Wed Sep 13 17:22:31 CDT 2000 dbrh2:slave ACK! PG still down.  Switching
Roles
Wed Sep 13 17:22:42 CDT 2000 dbrh2:slave Error. No pong from webfarm, cant
takeover

Pulling the network cable on the master:
Fri Sep 22 11:37:23 CDT 2000 dbrh2:master Ping-Pong to Web Cluster OK
Fri Sep 22 11:37:23 CDT 2000 dbrh2:master PG OK
Network cable pulled ...
Fri Sep 22 11:37:44 CDT 2000 dbrh2:master Web cluster not responding
Fri Sep 22 11:37:44 CDT 2000 dbrh2:master New problem.  Waiting 10 seconds
for PONG
Fri Sep 22 11:38:05 CDT 2000 dbrh2:master Web cluster not responding
Fri Sep 22 11:38:05 CDT 2000 dbrh2:master ACK! No PONG.  giving up master

And finally ... when the slave detects PG on the master no longer serves
request.
Fri Sep 22 20:17:37 CDT 2000 dbrh2:slave PG on MASTER is OK
Fri Sep 22 20:17:47 CDT 2000 Postgres not responding
Fri Sep 22 20:17:47 CDT 2000 dbrh2:slave New problem.  Waiting 10 seconds
for PG recovery
Fri Sep 22 20:18:09 CDT 2000 dbrh2:slave Postgres not responding
Fri Sep 22 20:18:09 CDT 2000 dbrh2:slave ACK! PG still down.  Switching
Roles
Fri Sep 22 20:18:09 CDT 2000 dbrh2:slave Invoking Master Role
Fri Sep 22 20:18:21 CDT 2000 dbrh2:master Ping-Pong to Web Cluster OK
Fri Sep 22 20:18:22 CDT 2000 dbrh2:master PG OK

Bryan Ingram
bingram@sixtyfootspider.com
















pgsql-general by date:

Previous
From: Cristóvão Dalla Costa
Date:
Subject: why pg_dump eats so much memory?
Next
From: Chris
Date:
Subject: users