PG Automated Backup and Failover System (long) - Mailing list pgsql-general
From | Ingram, Bryan |
---|---|
Subject | PG Automated Backup and Failover System (long) |
Date | |
Msg-id | 01CCE949D2717845BA2E573DC081167E052F12@BKMAIL.sixtyfootspider.com Whole thread Raw |
List | pgsql-general |
I'm currently testing a system I've written for Postgres that provides automatic failover to a backup database server in the event that the master db server should die. The system was designed using RedHat 6.2 and Postgres 7.0.2 in a two node cluster with two NICs in each node. The extra NIC in each machine provides a private network between the nodes. The system employs two major components... 1) Postgres Backup/Restore component 2) Takeover monitoring component Until Postgres has some kind of replication (Jan- need some beer, food, moral support??) the best technique I could come up with to provide a reasonably up-to-date copy of the master server was to create a process that periodically checks for changes to the database(s), pg_dumps the contents, alerts the slave node, then applies the backup files to the slave node. This is all done via an automated process with many integrity checks along the way. If any errors are found a message is sent to my cell phone with a descriptive line of text. I'm aware of the shortcomings of this technique, but at this point, what else is there? I've set the system to check for data changes and to make backups if necessary every 5 minutes (no changes, no backups). While many organizations would shudder at the idea of a maximum of 5 minutes of data loss, for some the risk is worth the reward (having a backup server automatically takeover within 30 seconds.) Depending on your application, this kind of "replication" may or may not be acceptable to you. Also, if you deal with very large databases with frequent data changes you might find that 5 minutes is too small of an interval. With Postgres 7.0.2 and a decent machine I can now load one of our databases of ca. 1500000 rows in 47 seconds without sync as opposed to 4+ hours with 6.5.x. (Though that was reduced to 1.5 hours with by distributing the processes across two machines.) So, basically, component one provides an almost fresh copy of the master server on the slave node at any given time. Component two is a set of scripts which run on both the master and slave monitoring the state of the systems. These scripts check for error conditions every 10 seconds and monitor these situations: Hub/Switch/Cable Failure Web Cluster Failures Local Node Network Failure Server OS/Hardware Crash Postgres Failures NFS Failures All except the NFS failure will trigger a node takeover. Error checks are performed using pings, psql db connections and a few other techniques. And as with component one, any error found will generate an email message. So far, both components are passing testing with flying colors. The node takeover scripts are performing wonderfully as are the automated backup and restore scripts. I've actually been able to have testers go through postgres/php driven web sites, pull the plug on the database server, and then have the slave server come up and respond to requests within 30 seconds, without the user ever seeing an error. (Unless you consider a 30 second pause in page load an error.) There are a lot of details to this system I can't describe here, but I'd be happy to answer any questions. As I'm nearing the completion of the formal testing phase and nearing a production phase, I would love to hear from anyone who has implemented a similar system or can possibly help me explore potential pitfalls I may have not considered. Here are some logs from each component for your amusement: Takeover Component: Sample output from master's logs ... Fri Sep 22 19:14:43 CDT 2000 dbrh1:master Ping-Pong to Web Cluster OK Fri Sep 22 19:14:43 CDT 2000 dbrh1:master PG OK Fri Sep 22 19:14:54 CDT 2000 dbrh1:master Ping-Pong to Web Cluster OK Fri Sep 22 19:14:54 CDT 2000 dbrh1:master PG OK Fri Sep 22 19:14:54 CDT 2000 dbrh1:master dbrh1:master slave machine is up Fri Sep 22 19:15:01 CDT 2000 dbrh1:master NFS mount OK Fri Sep 22 19:15:05 CDT 2000 dbrh1:master Ping-Pong to Web Cluster OK Fri Sep 22 19:15:05 CDT 2000 dbrh1:master PG OK From the slave's logs ... Fri Sep 22 19:59:32 CDT 2000 dbrh2:slave PG on MASTER is OK Fri Sep 22 19:59:42 CDT 2000 dbrh2:slave PG on MASTER is OK Fri Sep 22 19:59:52 CDT 2000 dbrh2:slave PG on MASTER is OK Fri Sep 22 20:00:01 CDT 2000 dbrh2:slave NFS mount OK Fri Sep 22 20:00:02 CDT 2000 dbrh2:slave PG on MASTER is OK Fri Sep 22 20:00:12 CDT 2000 dbrh2:slave PG on MASTER is OK Fri Sep 22 20:00:22 CDT 2000 dbrh2:slave PG on MASTER is OK Backup/Restore Component: On the master ... Fri Sep 22 17:10:00 CDT 2000 dbrh1:master *********************** Fri Sep 22 17:10:00 CDT 2000 dbrh1:master users.dat: 74 Fri Sep 22 17:10:00 CDT 2000 dbrh1:master Start bahamas backup Fri Sep 22 17:10:00 CDT 2000 dbrh1:master bahamas has not changed. Skipping. Fri Sep 22 17:10:00 CDT 2000 dbrh1:master Start zipfind -t dealers backup Fri Sep 22 17:10:00 CDT 2000 dbrh1:master zipfind has not changed. Skipping. Fri Sep 22 17:10:00 CDT 2000 dbrh1:master Start stfrancis backup Fri Sep 22 17:10:00 CDT 2000 dbrh1:master stfrancis has not changed. Skipping. Fri Sep 22 17:10:00 CDT 2000 dbrh1:master Start afcard backup Fri Sep 22 17:10:00 CDT 2000 dbrh1:master afcard has not changed. Skipping. Fri Sep 22 17:10:00 CDT 2000 dbrh1:master Start afemail backup Fri Sep 22 17:10:00 CDT 2000 dbrh1:master afemail has not changed. Skipping. Fri Sep 22 17:10:00 CDT 2000 dbrh1:master Start test1 backup Fri Sep 22 17:10:00 CDT 2000 dbrh1:master End test1 backup 107225 Fri Sep 22 17:10:00 CDT 2000 dbrh1:master Archiving complete. Fri Sep 22 17:15:00 CDT 2000 dbrh1:master *********************** Fri Sep 22 17:15:01 CDT 2000 dbrh1:master users.dat: 74 Fri Sep 22 17:15:01 CDT 2000 dbrh1:master Start bahamas backup Fri Sep 22 17:15:01 CDT 2000 dbrh1:master bahamas has not changed. Skipping. Fri Sep 22 17:15:01 CDT 2000 dbrh1:master Start zipfind -t dealers backup Fri Sep 22 17:15:01 CDT 2000 dbrh1:master zipfind has not changed. Skipping. Fri Sep 22 17:15:01 CDT 2000 dbrh1:master Start stfrancis backup Fri Sep 22 17:15:01 CDT 2000 dbrh1:master stfrancis has not changed. Skipping. Fri Sep 22 17:15:01 CDT 2000 dbrh1:master Start afcard backup Fri Sep 22 17:15:01 CDT 2000 dbrh1:master afcard has not changed. Skipping. Fri Sep 22 17:15:01 CDT 2000 dbrh1:master Start afemail backup Fri Sep 22 17:15:01 CDT 2000 dbrh1:master afemail has not changed. Skipping. Fri Sep 22 17:15:01 CDT 2000 dbrh1:master Start test1 backup Fri Sep 22 17:15:01 CDT 2000 dbrh1:master End test1 backup 113757 Fri Sep 22 17:15:01 CDT 2000 dbrh1:master Archiving complete. And on the slave... Fri Sep 22 15:07:01 CDT 2000 dbrh2:slave dbs.dat FOUND. good Fri Sep 22 15:07:01 CDT 2000 dbrh2:slave CNTL.template1 not found. SKIPPING Fri Sep 22 15:07:01 CDT 2000 dbrh2:slave cntl.bahamas found. good Fri Sep 22 15:07:01 CDT 2000 dbrh2:slave Checking for ready.bahamas Fri Sep 22 15:07:01 CDT 2000 dbrh2:slave bahamas NOT READY. SKIPPING Fri Sep 22 15:07:01 CDT 2000 dbrh2:slave cntl.dealers found. good Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave dbs.dat FOUND. good Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave CNTL.template1 not found. SKIPPING Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave cntl.bahamas found. good Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Checking for ready.bahamas Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave bahamas NOT READY. SKIPPING Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave cntl.dealers found. good Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Checking for ready.dealers Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave dealers NOT READY. SKIPPING Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave cntl.stfrancis found. good Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Checking for ready.stfrancis Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave stfrancis NOT READY. SKIPPING Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave cntl.afcard found. good Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Checking for ready.afcard Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave afcard NOT READY. SKIPPING Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave cntl.afemail found. good Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Checking for ready.afemail Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave afemail NOT READY. SKIPPING Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave cntl.test1 found. good Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Checking for ready.test1 Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Found ready.test1 Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave test1 locked Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave cntl.test1 copied Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave dbout.test1 copied Fri Sep 22 15:21:01 CDT 2000 dbrh2:slave Start reload on dbout.test1 Fri Sep 22 15:21:02 CDT 2000 dbrh2:slave End reload on dbout.test1 Fri Sep 22 15:21:02 CDT 2000 dbrh2:slave 20002661520.dbout.test1 created Fri Sep 22 15:21:02 CDT 2000 dbrh2:slave test1 unlocked Fri Sep 22 15:21:02 CDT 2000 dbrh2:slave ---------------------- And a few generated from the slave during error conditions: Hub/Switch Failure... Wed Sep 13 17:21:34 CDT 2000 dbrh2:slave PG on MASTER is OK This is where we kill the hub... Wed Sep 13 17:21:46 CDT 2000 Postgres not responding Wed Sep 13 17:21:46 CDT 2000 dbrh2:slave New problem. Waiting 10 seconds for PG recovery Wed Sep 13 17:22:08 CDT 2000 Postgres not responding Wed Sep 13 17:22:08 CDT 2000 dbrh2:slave ACK! PG still down. Switching Roles Wed Sep 13 17:22:19 CDT 2000 dbrh2:slave Error. No pong from webfarm, cant takeover Wed Sep 13 17:22:31 CDT 2000 Postgres not responding Wed Sep 13 17:22:31 CDT 2000 dbrh2:slave ACK! PG still down. Switching Roles Wed Sep 13 17:22:42 CDT 2000 dbrh2:slave Error. No pong from webfarm, cant takeover Pulling the network cable on the master: Fri Sep 22 11:37:23 CDT 2000 dbrh2:master Ping-Pong to Web Cluster OK Fri Sep 22 11:37:23 CDT 2000 dbrh2:master PG OK Network cable pulled ... Fri Sep 22 11:37:44 CDT 2000 dbrh2:master Web cluster not responding Fri Sep 22 11:37:44 CDT 2000 dbrh2:master New problem. Waiting 10 seconds for PONG Fri Sep 22 11:38:05 CDT 2000 dbrh2:master Web cluster not responding Fri Sep 22 11:38:05 CDT 2000 dbrh2:master ACK! No PONG. giving up master And finally ... when the slave detects PG on the master no longer serves request. Fri Sep 22 20:17:37 CDT 2000 dbrh2:slave PG on MASTER is OK Fri Sep 22 20:17:47 CDT 2000 Postgres not responding Fri Sep 22 20:17:47 CDT 2000 dbrh2:slave New problem. Waiting 10 seconds for PG recovery Fri Sep 22 20:18:09 CDT 2000 dbrh2:slave Postgres not responding Fri Sep 22 20:18:09 CDT 2000 dbrh2:slave ACK! PG still down. Switching Roles Fri Sep 22 20:18:09 CDT 2000 dbrh2:slave Invoking Master Role Fri Sep 22 20:18:21 CDT 2000 dbrh2:master Ping-Pong to Web Cluster OK Fri Sep 22 20:18:22 CDT 2000 dbrh2:master PG OK Bryan Ingram bingram@sixtyfootspider.com
pgsql-general by date: