Thread: PITR problem

PITR problem

From
wstrzalka
Date:
I have some problem with setting up PITR recovery on the database.

I have archive_command set properly and logs are shipping OK. Archive
timeout is also set (5 min).

When performing pg_start_backup the WAL is lets say on position
0000000100000001000000D9, then I start copy database to the second
machine which takes me 30 minutes. In that time archive timeout is
called a few times and those file are shipped properly to the second
host. After DB is succesfully copied i'm calling pg_stop_backup. The
WAL is at the moment on position 0000000100000001000000DE.

In that moment I see on the second machine WAL files from
0000000100000001000000D9 to 0000000100000001000000DE as well as
0000000100000001000000D9.00000020.backup

The problem occurs now when I'm trying to start my standby server in
recovery mode (with pg_standby).

The output from pg_standby:
------------------------------------
Trigger file             : /tmp/pgsql.promote_trigger.5432
Waiting for WAL file     : 00000001.history
WAL file path            : /var/lib/pgsql/incoming_wal/
00000001.history
Restoring to...          : pg_xlog/RECOVERYHISTORY
Sleep interval           : 5 seconds
Max wait interval        : 0 forever
Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
00000001.history" "pg_xlog/RECOVERYHISTORY"
Keep archive history     : 0000000100000001000000DB and later
running restore          : OK


Trigger file             : /tmp/pgsql.promote_trigger.5432
Waiting for WAL file     : 0000000100000001000000D9.00000020.backup
WAL file path            : /var/lib/pgsql/incoming_wal/
0000000100000001000000D9.00000020.backup
Restoring to...          : pg_xlog/RECOVERYHISTORY
Sleep interval           : 5 seconds
Max wait interval        : 0 forever
Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
0000000100000001000000D9.00000020.backup" "pg_xlog/RECOVERYHISTORY"
Keep archive history     : 0000000100000001000000DB and later
running restore          : OK


Trigger file             : /tmp/pgsql.promote_trigger.5432
Waiting for WAL file     : 0000000100000001000000D9
WAL file path            : /var/lib/pgsql/incoming_wal/
0000000100000001000000D9
Restoring to...          : pg_xlog/RECOVERYXLOG
Sleep interval           : 5 seconds
Max wait interval        : 0 forever
Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
0000000100000001000000D9" "pg_xlog/RECOVERYXLOG"
Keep archive history     : 0000000100000001000000DB and later
running restore          : OK
removing "/var/lib/pgsql/incoming_wal/0000000100000001000000D9"
removing "/var/lib/pgsql/incoming_wal/0000000100000001000000DA"

--------------------------------------------------------------------------------------------------------


For the first time I start standby Postgres log says and the postgres
process goes down:
--------------------------------------------------------------------------------------------------------
restored log file "0000000100000001000000D9.00000020.backup" from
archive
could not open file "pg_xlog/0000000100000001000000D9" (log file 1,
segment 217): No such file or directory
invalid checkpoint record
could not locate required checkpoint record
If you are not restoring from a backup, try removing the file "/var/
lib/pgsql/data/backup_label".
startup process (PID 19201) was terminated by signal 6: Aborted
aborting startup due to startup process failure
--------------------------------------------------------------------------------------------------------

When I try to start PG for the second time it just stucks waiting
for ...000D9

In my opinion the problem is that when starting standby PostgresSQL
wants to recovery WAL 0000000100000001000000D9, but first deletes it,
as keep  archive history (%r) param is set to
0000000100000001000000DB

Is it a bug or I'm missing something? I can repeat the scenario with
this big DB. However it's not happening on exactly the same
environment when playing with smaller cluster (copying cluster is
shorter then archive_timeout ).


                                           Regards
                                                  WS



Re: PITR problem

From
Erik Jones
Date:
On Apr 26, 2008, at 5:11 PM, wstrzalka wrote:

> I have some problem with setting up PITR recovery on the database.
>
> I have archive_command set properly and logs are shipping OK. Archive
> timeout is also set (5 min).
>
> When performing pg_start_backup the WAL is lets say on position
> 0000000100000001000000D9, then I start copy database to the second
> machine which takes me 30 minutes. In that time archive timeout is
> called a few times and those file are shipped properly to the second
> host. After DB is succesfully copied i'm calling pg_stop_backup. The
> WAL is at the moment on position 0000000100000001000000DE.
>
> In that moment I see on the second machine WAL files from
> 0000000100000001000000D9 to 0000000100000001000000DE as well as
> 0000000100000001000000D9.00000020.backup
>
> The problem occurs now when I'm trying to start my standby server in
> recovery mode (with pg_standby).
>
> The output from pg_standby:
> ------------------------------------
> Trigger file             : /tmp/pgsql.promote_trigger.5432
> Waiting for WAL file     : 00000001.history
> WAL file path            : /var/lib/pgsql/incoming_wal/
> 00000001.history
> Restoring to...          : pg_xlog/RECOVERYHISTORY
> Sleep interval           : 5 seconds
> Max wait interval        : 0 forever
> Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
> 00000001.history" "pg_xlog/RECOVERYHISTORY"
> Keep archive history     : 0000000100000001000000DB and later
> running restore          : OK
>
>
> Trigger file             : /tmp/pgsql.promote_trigger.5432
> Waiting for WAL file     : 0000000100000001000000D9.00000020.backup
> WAL file path            : /var/lib/pgsql/incoming_wal/
> 0000000100000001000000D9.00000020.backup
> Restoring to...          : pg_xlog/RECOVERYHISTORY
> Sleep interval           : 5 seconds
> Max wait interval        : 0 forever
> Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
> 0000000100000001000000D9.00000020.backup" "pg_xlog/RECOVERYHISTORY"
> Keep archive history     : 0000000100000001000000DB and later
> running restore          : OK
>
>
> Trigger file             : /tmp/pgsql.promote_trigger.5432
> Waiting for WAL file     : 0000000100000001000000D9
> WAL file path            : /var/lib/pgsql/incoming_wal/
> 0000000100000001000000D9
> Restoring to...          : pg_xlog/RECOVERYXLOG
> Sleep interval           : 5 seconds
> Max wait interval        : 0 forever
> Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
> 0000000100000001000000D9" "pg_xlog/RECOVERYXLOG"
> Keep archive history     : 0000000100000001000000DB and later
> running restore          : OK
> removing "/var/lib/pgsql/incoming_wal/0000000100000001000000D9"
> removing "/var/lib/pgsql/incoming_wal/0000000100000001000000DA"
>
> --------------------------------------------------------------------------------------------------------
>
>
> For the first time I start standby Postgres log says and the postgres
> process goes down:
> --------------------------------------------------------------------------------------------------------
> restored log file "0000000100000001000000D9.00000020.backup" from
> archive
> could not open file "pg_xlog/0000000100000001000000D9" (log file 1,
> segment 217): No such file or directory
> invalid checkpoint record
> could not locate required checkpoint record
> If you are not restoring from a backup, try removing the file "/var/
> lib/pgsql/data/backup_label".
> startup process (PID 19201) was terminated by signal 6: Aborted
> aborting startup due to startup process failure
> --------------------------------------------------------------------------------------------------------
>
> When I try to start PG for the second time it just stucks waiting
> for ...000D9
>
> In my opinion the problem is that when starting standby PostgresSQL
> wants to recovery WAL 0000000100000001000000D9, but first deletes it,
> as keep  archive history (%r) param is set to
> 0000000100000001000000DB
>
> Is it a bug or I'm missing something? I can repeat the scenario with
> this big DB. However it's not happening on exactly the same
> environment when playing with smaller cluster (copying cluster is
> shorter then archive_timeout ).

What is the full pg_standby command string (restore_command=....) in
your recovery.conf.  It sound's like you have pg_standby set to delete
archived WALs and possibly have that a little too aggressive.  Do you
have the -k flag set in your pg_standby call in your restore_command?

Erik Jones

DBA | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com




Re: PITR problem

From
wstrzalka
Date:
> What is the full pg_standby command string (restore_command=....) in
> your recovery.conf.  It sound's like you have pg_standby set to delete
> archived WALs and possibly have that a little too aggressive.  Do you
> have the -k flag set in your pg_standby call in your restore_command?

My restore command is:
-----------------------------------------------------------------------------------------
restore_command = 'pg_standby -l -d -s 5 -w 0 -t /tmp/
pgsql.promote_trigger.5432 ~postgres/incoming_wal %f %p %r 2>&1 |
logger -p local1.info -t pitr-standby'
-----------------------------------------------------------------------------------------

As you can see I didn't set -k to keep fixed number of WALs, but %r
parameter and the PostgreSQL controls number of keeped files
automatically (or at least it should)


Re: PITR problem

From
Erik Jones
Date:
On Apr 29, 2008, at 3:20 AM, wstrzalka wrote:

>> What is the full pg_standby command string (restore_command=....) in
>> your recovery.conf.  It sound's like you have pg_standby set to
>> delete
>> archived WALs and possibly have that a little too aggressive.  Do you
>> have the -k flag set in your pg_standby call in your restore_command?
>
> My restore command is:
> -----------------------------------------------------------------------------------------
> restore_command = 'pg_standby -l -d -s 5 -w 0 -t /tmp/
> pgsql.promote_trigger.5432 ~postgres/incoming_wal %f %p %r 2>&1 |
> logger -p local1.info -t pitr-standby'
> -----------------------------------------------------------------------------------------
>
> As you can see I didn't set -k to keep fixed number of WALs, but %r
> parameter and the PostgreSQL controls number of keeped files
> automatically (or at least it should)

Ok, I hadn't yet set up a standby on 8.3 and so hadn't seen that the
%r macro obviates the need for the -k flag.  So...

The output from pg_standby:
------------------------------------
Trigger file             : /tmp/pgsql.promote_trigger.5432
Waiting for WAL file     : 00000001.history
WAL file path            : /var/lib/pgsql/incoming_wal/
00000001.history
Restoring to...          : pg_xlog/RECOVERYHISTORY
Sleep interval           : 5 seconds
Max wait interval        : 0 forever
Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
00000001.history" "pg_xlog/RECOVERYHISTORY"
Keep archive history     : 0000000100000001000000DB and later
running restore          : OK

Trigger file             : /tmp/pgsql.promote_trigger.5432
Waiting for WAL file     : 0000000100000001000000D9.00000020.backup
WAL file path            : /var/lib/pgsql/incoming_wal/
0000000100000001000000D9.00000020.backup
Restoring to...          : pg_xlog/RECOVERYHISTORY
Sleep interval           : 5 seconds
Max wait interval        : 0 forever
Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
0000000100000001000000D9.00000020.backup" "pg_xlog/RECOVERYHISTORY"
Keep archive history     : 0000000100000001000000DB and later
running restore          : OK

Note that here, from the start, postgres is telling the recovery
command that it only needs from 0000000100000001000000DB and on.

Here's where it gets to restoring the first actual log file:

Trigger file             : /tmp/pgsql.promote_trigger.5432
Waiting for WAL file     : 0000000100000001000000D9
WAL file path            : /var/lib/pgsql/incoming_wal/
0000000100000001000000D9
Restoring to...          : pg_xlog/RECOVERYXLOG
Sleep interval           : 5 seconds
Max wait interval        : 0 forever
Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
0000000100000001000000D9" "pg_xlog/RECOVERYXLOG"
Keep archive history     : 0000000100000001000000DB and later
running restore          : OK
removing "/var/lib/pgsql/incoming_wal/0000000100000001000000D9"
removing "/var/lib/pgsql/incoming_wal/0000000100000001000000DA"

Since it says 'OK' but then fails my guess is that the order of
operations goes something along the lines of this (I could be totally
off):

1. Is /var/lib/pgsql/incoming/0000000100000001000000D9 present? -> OK
2. Clean up files older than 0000000100000001000000DB -> Delete /var/
lib/pgsql/incoming/0000000100000001000000D9 and /var/lib/pgsql/
incoming/0000000100000001000000DA
3. Restore /var/lib/pgsql/incoming/0000000100000001000000D9 -> This is
where it breaks.

So, the question is:  why does does the server say that it only needs
0000000100000001000000DB and later?  Did you clear out your pg_xlog
directory before starting up the standby?

Erik Jones

DBA | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com




Re: PITR problem

From
wstrzalka
Date:
On 29 Kwi, 17:16, e...@myemma.com (Erik Jones) wrote:
> On Apr 29, 2008, at 3:20 AM, wstrzalka wrote:
>
>
>
> >> What is the full pg_standby command string (restore_command=....) in
> >> your recovery.conf.  It sound's like you have pg_standby set to
> >> delete
> >> archived WALs and possibly have that a little too aggressive.  Do you
> >> have the -k flag set in your pg_standby call in your restore_command?
>
> > My restore command is:
> > -----------------------------------------------------------------------------------------
> > restore_command = 'pg_standby -l -d -s 5 -w 0 -t /tmp/
> > pgsql.promote_trigger.5432 ~postgres/incoming_wal %f %p %r 2>&1 |
> > logger -p local1.info -t pitr-standby'
> > -----------------------------------------------------------------------------------------
>
> > As you can see I didn't set -k to keep fixed number of WALs, but %r
> > parameter and the PostgreSQL controls number of keeped files
> > automatically (or at least it should)
>
> Ok, I hadn't yet set up a standby on 8.3 and so hadn't seen that the
> %r macro obviates the need for the -k flag.  So...
>
> The output from pg_standby:
> ------------------------------------
> Trigger file             : /tmp/pgsql.promote_trigger.5432
> Waiting for WAL file     : 00000001.history
> WAL file path            : /var/lib/pgsql/incoming_wal/
> 00000001.history
> Restoring to...          : pg_xlog/RECOVERYHISTORY
> Sleep interval           : 5 seconds
> Max wait interval        : 0 forever
> Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
> 00000001.history" "pg_xlog/RECOVERYHISTORY"
> Keep archive history     : 0000000100000001000000DB and later
> running restore          : OK
>
> Trigger file             : /tmp/pgsql.promote_trigger.5432
> Waiting for WAL file     : 0000000100000001000000D9.00000020.backup
> WAL file path            : /var/lib/pgsql/incoming_wal/
> 0000000100000001000000D9.00000020.backup
> Restoring to...          : pg_xlog/RECOVERYHISTORY
> Sleep interval           : 5 seconds
> Max wait interval        : 0 forever
> Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
> 0000000100000001000000D9.00000020.backup" "pg_xlog/RECOVERYHISTORY"
> Keep archive history     : 0000000100000001000000DB and later
> running restore          : OK
>
> Note that here, from the start, postgres is telling the recovery
> command that it only needs from 0000000100000001000000DB and on.
>
> Here's where it gets to restoring the first actual log file:
>
> Trigger file             : /tmp/pgsql.promote_trigger.5432
> Waiting for WAL file     : 0000000100000001000000D9
> WAL file path            : /var/lib/pgsql/incoming_wal/
> 0000000100000001000000D9
> Restoring to...          : pg_xlog/RECOVERYXLOG
> Sleep interval           : 5 seconds
> Max wait interval        : 0 forever
> Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
> 0000000100000001000000D9" "pg_xlog/RECOVERYXLOG"
> Keep archive history     : 0000000100000001000000DB and later
> running restore          : OK
> removing "/var/lib/pgsql/incoming_wal/0000000100000001000000D9"
> removing "/var/lib/pgsql/incoming_wal/0000000100000001000000DA"
>
> Since it says 'OK' but then fails my guess is that the order of
> operations goes something along the lines of this (I could be totally
> off):
>
> 1. Is /var/lib/pgsql/incoming/0000000100000001000000D9 present? -> OK
> 2. Clean up files older than 0000000100000001000000DB -> Delete /var/
> lib/pgsql/incoming/0000000100000001000000D9 and /var/lib/pgsql/
> incoming/0000000100000001000000DA
> 3. Restore /var/lib/pgsql/incoming/0000000100000001000000D9 -> This is
> where it breaks.
>
> So, the question is:  why does does the server say that it only needs
> 0000000100000001000000DB and later?  Did you clear out your pg_xlog
> directory before starting up the standby?
>


Yes - the param passed to %r looks bad from start.
Generally I like the %r because I don't need to worry if there are
enough WALs to continue recovery after standby reboot and I don't keep
many of the files at the same time, but I think something is wrong
with it.
And answering your question - I don't delete any files before standby
start.

So it looks like a bug for me - probably I should submit it to
pgsql.bugs - unfortunatelly ( or fortunatelly :D ) my test environment
is production now so I'll not be able to reproduce it easily.