Problem with restoring from backup on 9.0.2 - Mailing list pgsql-general
From | hubert depesz lubaczewski |
---|---|
Subject | Problem with restoring from backup on 9.0.2 |
Date | |
Msg-id | 20101227182454.GA11903@depesz.com Whole thread Raw |
Responses |
Re: Problem with restoring from backup on 9.0.2
|
List | pgsql-general |
hi, this mail will be relatively long, because I need to explain in details what/how I do. Simple summary: when restoring from hot backup, with out xlogs in pg_xlog/, but instead using recovery.conf to get xlogs from wal archive, I get "WAL ends before consistent recovery point" in case it shouldn't happen. Details: I have following setup: master pg is sending wal segments to 2 places: - /shared/walarchive - long term storage, compressed content - /shared/4backups - short term storage, just to do hot backups on slave, not compressed for sending I use omnipitr-archive, with 2 "-dr" options slave pg is configured to read wal segments from /shared/4backups (using `cp -i ...`) *and* then to switch to streaming replication. additionally, on slave there is archive cleanup program, which cleans /shared/4backups but only if /flags/backup file is not present (which is only present when backup happens). to make backups I use omnipitr-backup-slave, which does: 1. creates /flags/backup file 2. gets snapshot of pg_controldata 3. makes backup_label in $PGDATA, where START WAL LOCATION is taken from "Latest checkpoint's REDO location" from step #2, and "CHECKPOINT LOCATION:" comes from "Latest checkpoint location" in #2 4. compresses whole $PGDATA 5. gets snapshot of pg_controldata 6. waits till new snapshot of pg_controldata (makes another one every 5 seconds) will have different (from step #4) value in "Latest checkpoint location" 7. makes .backup file in /shared/4backups that contains the same information as #3, but with "STOP WAL LOCATION" taken from "Latest checkpoint location" from #6 8. compresses /shared/4backups to secondary backup file afterwards I get 2 files - data.tar.gz and xlog.tar.gz. Now. I got those files, moved them to 3rd server, uncompressed. Afterwards I got PGDATA directory, with xlogs in pg_xlog, and no recovery.conf. When starting from this backup - pg starts, reads all necessary segments from pg_xlog, finished recovery, and starts without any problem as standalone. but, when i, after extraction, before starting - remove all xlog files, and move .backup file from pg_xlog to /shared/walarchive, and setup recovery.conf using normal omnipitr-restore, to get wal segments from /shared/walarchive - pg starts to behave weirdly. It starts. Reads all segments from the first mentioned in backup_label to the latest available in /shared/walarchvie, and then it fails with "WAL ends before consistent recovery point" - but this is *well* past the point that it should know that the state is consistent! Files content from my situation: =$ cat backup_label.old START WAL LOCATION: 33/2213B200 (file 000000010000003300000022) CHECKPOINT LOCATION: 33/23101410 START TIME: 2010-12-27 11:53:44 GMT LABEL: OmniPITR_Slave_Hot_Backup (the file got already renamed to .old by pg recovery process) =$ zcat /mnt/db/prod/walarchive/000000010000003300000022.0013B200.backup.gz START WAL LOCATION: 33/2213B200 (file 000000010000003300000022) STOP WAL LOCATION: 33/2C24C800 (file 00000001000000330000002C) CHECKPOINT LOCATION: 33/23101410 START TIME: 2010-12-27 11:53:44 GMT START TIME: 2010-12-27 12:06:17 GMT LABEL: OmniPITR_Slave_Hot_Backup last 20 lines from log; =$ tail -n 20 pg_log/postgresql-2010-12-27_172759.log 2010-12-27 17:44:16 UTC [15768]: [1934-1] LOG: restored log file "000000010000003A000000B1" from archive 2010-12-27 17:44:16 UTC [15768]: [1935-1] LOG: restored log file "000000010000003A000000B2" from archive 2010-12-27 17:44:16 UTC [15768]: [1936-1] LOG: restored log file "000000010000003A000000B3" from archive 2010-12-27 17:44:17 UTC [15768]: [1937-1] LOG: restored log file "000000010000003A000000B4" from archive 2010-12-27 17:44:17 UTC [15768]: [1938-1] LOG: restored log file "000000010000003A000000B5" from archive 2010-12-27 17:44:18 UTC [15768]: [1939-1] LOG: restored log file "000000010000003A000000B6" from archive 2010-12-27 17:44:18 UTC [15768]: [1940-1] LOG: restored log file "000000010000003A000000B7" from archive 2010-12-27 17:44:19 UTC [15768]: [1941-1] LOG: restored log file "000000010000003A000000B8" from archive 2010-12-27 17:44:19 UTC [15768]: [1942-1] LOG: restored log file "000000010000003A000000B9" from archive 2010-12-27 17:44:20 UTC [15768]: [1943-1] LOG: restored log file "000000010000003A000000BA" from archive 2010-12-27 17:44:20 UTC [15768]: [1944-1] LOG: restored log file "000000010000003A000000BB" from archive 2010-12-27 17:44:21 UTC [15768]: [1945-1] LOG: restored log file "000000010000003A000000BC" from archive 2010-12-27 17:44:21 UTC [15768]: [1946-1] LOG: restored log file "000000010000003A000000BD" from archive 2010-12-27 17:44:23 UTC [15768]: [1947-1] LOG: could not open file "pg_xlog/000000010000003A000000BE" (log file 58, segment190): No such file or directory 2010-12-27 17:44:23 UTC [15768]: [1948-1] LOG: redo done at 3A/BDFFF6E8 2010-12-27 17:44:23 UTC [15768]: [1949-1] LOG: last completed transaction was at log time 2010-12-27 17:43:23.558028+00 2010-12-27 17:44:23 UTC [15768]: [1950-1] LOG: restored log file "000000010000003A000000BD" from archive 2010-12-27 17:44:23 UTC [15768]: [1951-1] FATAL: WAL ends before consistent recovery point 2010-12-27 17:44:23 UTC [15765]: [1-1] LOG: startup process (PID 15768) exited with exit code 1 2010-12-27 17:44:23 UTC [15765]: [2-1] LOG: terminating any other active server processes First 50 lines from log: =$ head -n 50 pg_log/postgresql-2010-12-27_172759.log 2010-12-27 17:27:59 UTC [15768]: [1-1] LOG: database system was interrupted while in recovery at log time 2010-12-27 11:54:55UTC 2010-12-27 17:27:59 UTC [15768]: [2-1] HINT: If this has occurred more than once some data might be corrupted and you mightneed to choose an earlier recovery target. 2010-12-27 17:27:59 UTC [15768]: [3-1] LOG: starting archive recovery 2010-12-27 17:28:00 UTC [15768]: [4-1] LOG: restored log file "000000010000003300000023" from archive 2010-12-27 17:28:00 UTC [15768]: [5-1] LOG: restored log file "000000010000003300000022" from archive 2010-12-27 17:28:00 UTC [15768]: [6-1] LOG: redo starts at 33/2213B200 2010-12-27 17:28:01 UTC [15768]: [7-1] LOG: restored log file "000000010000003300000023" from archive 2010-12-27 17:28:01 UTC [15768]: [8-1] LOG: restored log file "000000010000003300000024" from archive 2010-12-27 17:28:01 UTC [15768]: [9-1] LOG: restored log file "000000010000003300000025" from archive 2010-12-27 17:28:02 UTC [15768]: [10-1] LOG: restored log file "000000010000003300000026" from archive 2010-12-27 17:28:02 UTC [15768]: [11-1] LOG: restored log file "000000010000003300000027" from archive 2010-12-27 17:28:03 UTC [15768]: [12-1] LOG: restored log file "000000010000003300000028" from archive 2010-12-27 17:28:03 UTC [15768]: [13-1] LOG: restored log file "000000010000003300000029" from archive 2010-12-27 17:28:03 UTC [15768]: [14-1] LOG: restored log file "00000001000000330000002A" from archive 2010-12-27 17:28:04 UTC [15768]: [15-1] LOG: restored log file "00000001000000330000002B" from archive 2010-12-27 17:28:04 UTC [15768]: [16-1] LOG: restored log file "00000001000000330000002C" from archive 2010-12-27 17:28:04 UTC [15768]: [17-1] LOG: restored log file "00000001000000330000002D" from archive 2010-12-27 17:28:05 UTC [15768]: [18-1] LOG: restored log file "00000001000000330000002E" from archive 2010-12-27 17:28:05 UTC [15768]: [19-1] LOG: restored log file "00000001000000330000002F" from archive 2010-12-27 17:28:06 UTC [15768]: [20-1] LOG: restored log file "000000010000003300000030" from archive 2010-12-27 17:28:06 UTC [15768]: [21-1] LOG: restored log file "000000010000003300000031" from archive 2010-12-27 17:28:07 UTC [15768]: [22-1] LOG: restored log file "000000010000003300000032" from archive 2010-12-27 17:28:07 UTC [15768]: [23-1] LOG: restored log file "000000010000003300000033" from archive 2010-12-27 17:28:08 UTC [15768]: [24-1] LOG: restored log file "000000010000003300000034" from archive 2010-12-27 17:28:08 UTC [15768]: [25-1] LOG: restored log file "000000010000003300000035" from archive 2010-12-27 17:28:08 UTC [15768]: [26-1] LOG: restored log file "000000010000003300000036" from archive 2010-12-27 17:28:09 UTC [15768]: [27-1] LOG: restored log file "000000010000003300000037" from archive 2010-12-27 17:28:09 UTC [15768]: [28-1] LOG: restored log file "000000010000003300000038" from archive 2010-12-27 17:28:10 UTC [15768]: [29-1] LOG: restored log file "000000010000003300000039" from archive 2010-12-27 17:28:10 UTC [15768]: [30-1] LOG: restored log file "00000001000000330000003A" from archive 2010-12-27 17:28:11 UTC [15768]: [31-1] LOG: restored log file "00000001000000330000003B" from archive 2010-12-27 17:28:11 UTC [15768]: [32-1] LOG: restored log file "00000001000000330000003C" from archive 2010-12-27 17:28:12 UTC [15768]: [33-1] LOG: restored log file "00000001000000330000003D" from archive 2010-12-27 17:28:12 UTC [15768]: [34-1] LOG: restored log file "00000001000000330000003E" from archive 2010-12-27 17:28:12 UTC [15768]: [35-1] LOG: restored log file "00000001000000330000003F" from archive 2010-12-27 17:28:13 UTC [15768]: [36-1] LOG: restored log file "000000010000003300000040" from archive 2010-12-27 17:28:13 UTC [15768]: [37-1] LOG: restored log file "000000010000003300000041" from archive 2010-12-27 17:28:14 UTC [15768]: [38-1] LOG: restored log file "000000010000003300000042" from archive 2010-12-27 17:28:14 UTC [15768]: [39-1] LOG: restored log file "000000010000003300000043" from archive 2010-12-27 17:28:14 UTC [15768]: [40-1] LOG: restored log file "000000010000003300000044" from archive 2010-12-27 17:28:15 UTC [15768]: [41-1] LOG: restored log file "000000010000003300000045" from archive 2010-12-27 17:28:15 UTC [15768]: [42-1] LOG: restored log file "000000010000003300000046" from archive 2010-12-27 17:28:15 UTC [15768]: [43-1] LOG: restored log file "000000010000003300000047" from archive 2010-12-27 17:28:16 UTC [15768]: [44-1] LOG: restored log file "000000010000003300000048" from archive 2010-12-27 17:28:16 UTC [15768]: [45-1] LOG: restored log file "000000010000003300000049" from archive 2010-12-27 17:28:16 UTC [15768]: [46-1] LOG: restored log file "00000001000000330000004A" from archive 2010-12-27 17:28:17 UTC [15768]: [47-1] LOG: restored log file "00000001000000330000004B" from archive 2010-12-27 17:28:17 UTC [15768]: [48-1] LOG: restored log file "00000001000000330000004C" from archive 2010-12-27 17:28:18 UTC [15768]: [49-1] LOG: restored log file "00000001000000330000004D" from archive 2010-12-27 17:28:18 UTC [15768]: [50-1] LOG: restored log file "00000001000000330000004E" from archive pg_controldata: =$ /usr/pgsql-9.0/bin/pg_controldata . pg_control version number: 903 Catalog version number: 201008051 Database system identifier: 5554952960675921391 Database cluster state: in archive recovery pg_control last modified: Mon 27 Dec 2010 05:43:36 PM GMT Latest checkpoint location: 3A/6377BB8 Prior checkpoint location: 37/7D93A540 Latest checkpoint's REDO location: 39/EB0350B8 Latest checkpoint's TimeLineID: 1 Latest checkpoint's NextXID: 0/1359777 Latest checkpoint's NextOID: 294922 Latest checkpoint's NextMultiXactId: 10752 Latest checkpoint's NextMultiOffset: 30580 Latest checkpoint's oldestXID: 654 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 1306958 Time of latest checkpoint: Mon 27 Dec 2010 04:50:53 PM GMT Minimum recovery ending location: 3A/B7895958 Backup start location: 33/2213B200 Current wal_level setting: hot_standby Current max_connections setting: 1024 Current max_prepared_xacts setting: 10 Current max_locks_per_xact setting: 64 Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Lines from logs that are *not* about recovered log file: =$ grep -nvE 'LOG: restored log file' pg_log/postgresql-2010-12-27_172759.log 1:2010-12-27 17:27:59 UTC [15768]: [1-1] LOG: database system was interrupted while in recovery at log time 2010-12-2711:54:55 UTC 2:2010-12-27 17:27:59 UTC [15768]: [2-1] HINT: If this has occurred more than once some data might be corrupted and youmight need to choose an earlier recovery target. 3:2010-12-27 17:27:59 UTC [15768]: [3-1] LOG: starting archive recovery 6:2010-12-27 17:28:00 UTC [15768]: [6-1] LOG: redo starts at 33/2213B200 581:2010-12-27 17:33:00 UTC [15781]: [1-1] LOG: restartpoint starting: time 666:2010-12-27 17:33:47 UTC [15781]: [2-1] LOG: restartpoint complete: wrote 9775 buffers (14.9%); write=38.593 s, sync=5.409s, total=46.922 s 667:2010-12-27 17:33:47 UTC [15781]: [3-1] LOG: recovery restart point at 35/13010BE8 668:2010-12-27 17:33:47 UTC [15781]: [4-1] DETAIL: last completed transaction was at log time 2010-12-27 16:00:12.119251+00 1172:2010-12-27 17:38:00 UTC [15781]: [5-1] LOG: restartpoint starting: time 1276:2010-12-27 17:38:47 UTC [15781]: [6-1] LOG: restartpoint complete: wrote 10779 buffers (16.4%); write=44.044 s, sync=2.710s, total=47.082 s 1277:2010-12-27 17:38:47 UTC [15781]: [7-1] LOG: recovery restart point at 37/5F01F358 1278:2010-12-27 17:38:47 UTC [15781]: [8-1] DETAIL: last completed transaction was at log time 2010-12-27 16:32:19.637622+00 1787:2010-12-27 17:43:00 UTC [15781]: [9-1] LOG: restartpoint starting: time 1859:2010-12-27 17:43:36 UTC [15781]: [10-1] LOG: restartpoint complete: wrote 6314 buffers (9.6%); write=24.869 s, sync=9.906s, total=36.024 s 1860:2010-12-27 17:43:36 UTC [15781]: [11-1] LOG: recovery restart point at 39/EB0350B8 1861:2010-12-27 17:43:36 UTC [15781]: [12-1] DETAIL: last completed transaction was at log time 2010-12-27 17:07:35.913562+00 1959:2010-12-27 17:44:23 UTC [15768]: [1947-1] LOG: could not open file "pg_xlog/000000010000003A000000BE" (log file 58,segment 190): No such file or directory 1960:2010-12-27 17:44:23 UTC [15768]: [1948-1] LOG: redo done at 3A/BDFFF6E8 1961:2010-12-27 17:44:23 UTC [15768]: [1949-1] LOG: last completed transaction was at log time 2010-12-27 17:43:23.558028+00 1963:2010-12-27 17:44:23 UTC [15768]: [1951-1] FATAL: WAL ends before consistent recovery point 1964:2010-12-27 17:44:23 UTC [15765]: [1-1] LOG: startup process (PID 15768) exited with exit code 1 1965:2010-12-27 17:44:23 UTC [15765]: [2-1] LOG: terminating any other active server processes Since pg on its own starts from this backup, and we do use omnipitr-backup-slave on many hosts, and there was never problem with it - i'm reluctant to assume the bug is in there. so, it leaves out pebkac or some bug in pg? Best regards, depesz -- Linkedin: http://www.linkedin.com/in/depesz / blog: http://www.depesz.com/ jid/gtalk: depesz@depesz.com / aim:depeszhdl / skype:depesz_hdl / gg:6749007
pgsql-general by date: