Thread: The Problem of Applying Point-in-time Recovery

The Problem of Applying Point-in-time Recovery

From

Shih Théo

Date:

20 June 2015, 13:12:02

Dear Sir/Madam,

I am not sure if is it proper to post my problem here. If not, please forgive my ignorance and tell me where should I post to.

Recently, I am applying point-in-time recovery with Debian and postgres 8.3 (due to some reason, I have no chance to upgrade) but I encountered some problems. I followed the instruction from the official document step by step.

1. First, I modified the postgres.conf to enable WAL arching and restart the postgres.

2. Then I simply tar the whole data in the cluster data directory, ${PG_DATA} to be the base backup. During this step, I called pg_start_backup('label') and pg_stop_backup() before and after the tar procedure separately.

3. After that, I inserted some data into the database.

4. Next, I simulated that the database is corrupted and need to perform recover.

4.0. stop postgres

4.1. moved the WALs from ${PG_DATA}/pg_xlog to another directory

4.2. untared the base backup and moved the data to ${PG_DATA} (overwrite it)

4.3. created recovery.conf, following is my configuration: (Note, I stored the WALs to a remote host)

restore_command = 'rsync -a host_user@host_ip:/path/to/remote/host/wal/%f %p'

recovery_target_time = 'YYYY-mm-dd HH:MM:SS'

recovery_target_timeline = 'value'

4.4. restarted postgres

At first, everything was fine. I could perform recover successfully. I could see from log that postgres did restore the WALs and I could see the data which i inserted in step 3 in database, too. But when I performed recover repeatedly (that is I repeatedly performed from step 4.0 to step 4.4). I got very high possibility that postgres could fail to recover. Here is the error message:

2015-06-18 20:22:02 GMT+8 LOG: restored log file "00000001000000000000002E.00000020.backup" from archive

2015-06-18 20:22:03 GMT+8 LOG: restored log file "00000001000000000000002E" from archive

2015-06-18 20:22:03 GMT+8 LOG: unexpected pageaddr 0/2A000000 in log file 0, segment 46, offset 0

2015-06-18 20:22:03 GMT+8 LOG: invalid checkpoint record

2015-06-18 20:22:03 GMT+8 FATAL: could not locate required checkpoint record

2015-06-18 20:22:03 GMT+8 HINT: If you are not restoring from a backup, try removing the file "/home/genie/db_mount_point/backup_label".

2015-06-18 20:22:03 GMT+8 LOG: startup process (PID 658) exited with exit code 1

2015-06-18 20:22:03 GMT+8 LOG: aborting startup due to startup process failure

I do not know what caused the problem exactly. Is the problem happened because I performed recover repeatedly? Please give me some suggestion.

Yours faithfully