Re: Online Backup and WAL archives - Mailing list pgsql-admin
From | Morus Walter |
---|---|
Subject | Re: Online Backup and WAL archives |
Date | |
Msg-id | 16896.48660.415157.247818@tanto-xipolis.de Whole thread Raw |
In response to | Re: Online Backup and WAL archives (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-admin |
Tom Lane writes: Thanks for your answer Tom, and also thanks to the others, that answered. > > The documentation says > > ' To make use of this backup, you will need to keep around all the > > WAL segment files generated at or after the starting time of the backup. ' > > Now I'm wondering how much of these WAL segment files do I really need > > in order to recover the databases to a consistent state. > > If you are satisfied with recovering to the state shortly after you > completed the backup, then it would be sufficient to have a set of WAL > files spanning the time period in which the backup is done. Hmm. But that set is probably empty if there are few transaction during the backup. So I guess either I have to save non archived WALs as well (which seems to be problematic since they might change during saving) or wait until there is at least one WAL archived after the end of the backup. Is it save to backup non archived WALs? Probably it is, since the postmaster has to deal with incomplete writes to WALs for crash recovery anyway. > I'm dubious > that this is necessarily an improvement over a pg_dump backup, though. > It's better in the case I don't loose the WAL archive at least. Of course it's best not to need the backup at all, and second best to be able to restore to a state as close to the failure as possible. OTOH one of the worst cases I can think of, is to think to have a daily backup and find that it's unusable at the moment you need it, e.g. because you lost an WAL that's essential for recovery. > > I expect the online backup to faster on recovery than an SQL dump, since > > the latter would imply recreation of indexes during recovery. > > Is that assumption founded on any hard evidence? > It wasn't. But I did some tests and found it to be true. Backup using pg_dumpall: ------------------------ time pg_dumpall -U postgres | gzip > pg.dump.gz 37.818u 7.697s 9:44.62 7.7% 0+0k 0+0io 2108pf+0w 266.722u 15.111s 9:44.61 48.2% 0+0k 0+0io 2198pf+0w ll pg.dump.gz -rw-rw-r-- 1 morus tanto 308194343 Feb 2 09:08 pg.dump.gz Recovery: --------- - create directories (data, tablespaces) - initdb - copy config/access config from backup - start postmaster - recover using time gunzip -c pg.dump.gz | psql -U postgres template1 -> 1:08:17 - analyze all tables time vacuumdb -U postgres -a -v -z -> 0:06:13 So over all it's 1 1/4 to 1 1/2 hours to recover (once the machine is up again and you got the dump from tape). BTW: shouldn't pg_dumpall add an apropriate encoding command to avoid encoding trouble during restore? Just to make sure people won't shoot themselfs into the foot by having some encoding command in ~/.psqlrc. Online Backup: -------------- All data directories are copied by a 'tar -czf' After pg_stop_backup() is issued all archived WAL files are saved to another tar.gz file. The test was done without any transactions going on. real 9m11.644s user 7m28.789s sys 0m33.730s Backup-Size: -rw-r--r-- 1 root root 782785152 Feb 2 11:18 backup.20050202_110935.tar.gz -rw-r--r-- 1 root root 302 Feb 2 11:18 backup.wal.20050202_110935.tar.gz (the second file contains the archives WALs, so it's just the .backup file) Recovery: --------- There are no archived WALs after the 000000010000000100000039.000F51E8.backup file. I kept the non archived WALs (they are linked from data/pg_xlog) - stop postmaster - remove all files/dirs in data, tablespaces - unpack backup (tar xvzf ...) took 0:05:23 - prepare recovery.conf (note: I erroneously created i in pg_xlog, see below) - start the server - since there was nothing to redo, server came up immediatly: from the log: 2005-02-02 11:35:25 CET % LOG: database system was interrupted at 2005-02-02 11:09:35 CET 2005-02-02 11:35:25 CET % LOG: checkpoint record is at 1/390F51E8 2005-02-02 11:35:25 CET % LOG: redo record is at 1/390F51E8; undo record is at 0/0; shutdown FALSE 2005-02-02 11:35:25 CET % LOG: next transaction ID: 3679; next OID: 27566926 2005-02-02 11:35:25 CET % LOG: database system was not properly shut down; automatic recovery in progress 2005-02-02 11:35:25 CET % LOG: redo starts at 1/390F5224 2005-02-02 11:35:25 CET % LOG: record with zero length at 1/390F5260 2005-02-02 11:35:25 CET % LOG: redo done at 1/390F5224 2005-02-02 11:35:25 CET % LOG: database system is ready So the recovery time is ~ 10 minutes (ok, theoretically I would have to repeat that test because of the erroneous recovery.conf; but I don't think it would make a difference). I did a second test doing some data manipulation during the backup. I followed the recovery steps above, with the addition of upacking the archived WALs, which takes just a few seconds. Again I kept the pg_xlog. After moved the recovery.conf to the data directory the server recovered in ~ 15 seconds. 2005-02-02 12:20:12 CET % LOG: database system was interrupted at 2005-02-02 11:50:34 CET 2005-02-02 12:20:12 CET % LOG: starting archive recovery 2005-02-02 12:20:12 CET % LOG: restore_command = "cp /d1/postgresql8.0/archive/%f %p" cp: Aufruf von stat für »/d1/postgresql8.0/archive/00000001.history« nicht möglich: Datei oder Verzeichnis nicht gefunden 2005-02-02 12:20:12 CET % LOG: restored log file "000000010000000100000039" from archive 2005-02-02 12:20:12 CET % LOG: checkpoint record is at 1/390F52D8 2005-02-02 12:20:12 CET % LOG: redo record is at 1/390F52D8; undo record is at 0/0; shutdown FALSE 2005-02-02 12:20:12 CET % LOG: next transaction ID: 3687; next OID: 27562212 2005-02-02 12:20:12 CET % LOG: automatic recovery in progress 2005-02-02 12:20:14 CET % LOG: redo starts at 1/390F5314 2005-02-02 12:20:15 CET % LOG: restored log file "00000001000000010000003A" from archive 2005-02-02 12:20:16 CET % LOG: restored log file "00000001000000010000003B" from archive 2005-02-02 12:20:17 CET % LOG: restored log file "00000001000000010000003C" from archive 2005-02-02 12:20:18 CET % LOG: restored log file "00000001000000010000003D" from archive 2005-02-02 12:20:21 CET % LOG: restored log file "00000001000000010000003E" from archive 2005-02-02 12:20:22 CET % LOG: restored log file "00000001000000010000003F" from archive cp: Aufruf von stat für »/d1/postgresql8.0/archive/000000010000000100000040« nicht möglich: Datei oder Verzeichnis nichtgefunden 2005-02-02 12:20:23 CET % LOG: record with zero length at 1/407793A4 2005-02-02 12:20:23 CET % LOG: redo done at 1/40779368 cp: Aufruf von stat für »/d1/postgresql8.0/archive/000000010000000100000040« nicht möglich: Datei oder Verzeichnis nichtgefunden 2005-02-02 12:20:23 CET % LOG: archive recovery complete 2005-02-02 12:20:27 CET % LOG: database system is ready (the german error message of cp means `cannot stat file' / `file or directory not found') So recovery time still is ~ 10 minutes. This might change if there are much more WALs to redo but this should only happen if one recovers to a state significant time after the backup. The server is a 2 CPU (Xeon 1.8 GHz with Hyperthreading) machine running linux (RedHat 8.0, Kernel 2.4.18smp) with 1 GB RAM. The disk is a two disk raid 1. All data including the dumps were on this disk. Everything was done locally. postgresql version is 8.0.0. Of course the test leaves out the time it takes to load the dump from a tape or any other external storage. Since the online backup has more than 2.5 times the size of the dump this will presumably be slower for the online backup. Recovering from a dump has the advantage that one gets fresh tables and indices (so you save the next vacuuming and reindexing) but I doubt that's worth an aditional hour offline. Morus
pgsql-admin by date: