Thread: PITR - warm standby switchover question

PITR - warm standby switchover question

From
Dan Hayes
Date:
I've followed the implementation instructions at 24.4.2:
http://www.postgresql.org/docs/current/static/warm-standby.html

And I've used the archive/restore commands from the example in F23.2:
http://www.postgresql.org/docs/current/static/pgstandby.html

This all works great.  The primary backs up the WAL archives to a mutually accessible folder and the backup system is continuously reading in the logs.  Monitoring the "standby.log" file shows that the system is working as it should.

My question is, if the main system goes down, what do I do to the backup machine to make it come alive?  Currently, when I attempt to login to the database, I get the error: "psql: FATAL:  the database is starting up"  I'm assuming this means it can't be accessed while its in continuous recovery mode.  I attempted to turn off postgresql on the main server and try connecting again, same error (was obvious, but I do things one step at a time...)  So then I removed the "recovery.conf" file on the backup server and attempted to restart postgresql.  It stopped fine, but the restart failed.

The pg_log/postgresql-XXXX-XX-XX.log file says:
LOG:  database system was interrupted while in recovery at log time 2009-04-14 17:36:14 CDT
HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
LOG:  could not open file "pg_xlog/0000000100000002000000DD" (log file 2, segment 221): No such file or directory
LOG:  invalid primary checkpoint record
LOG:  could not open file "pg_xlog/0000000100000002000000DC" (log file 2, segment 220): No such file or directory
LOG:  invalid secondary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 3756) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

Any ideas what I'm doing wrong?  How do I bring the standby server out of standby do act as the primary?

Re: PITR - warm standby switchover question

From
Erik Jones
Date:
On Apr 14, 2009, at 3:47 PM, Dan Hayes wrote:

> I've followed the implementation instructions at 24.4.2:
> http://www.postgresql.org/docs/current/static/warm-standby.html
>
> And I've used the archive/restore commands from the example in F23.2:
> http://www.postgresql.org/docs/current/static/pgstandby.html
>
> This all works great.  The primary backs up the WAL archives to a
> mutually accessible folder and the backup system is continuously
> reading in the logs.  Monitoring the "standby.log" file shows that
> the system is working as it should.
>
> My question is, if the main system goes down, what do I do to the
> backup machine to make it come alive?  Currently, when I attempt to
> login to the database, I get the error: "psql: FATAL:  the database
> is starting up"  I'm assuming this means it can't be accessed while
> its in continuous recovery mode.  I attempted to turn off postgresql
> on the main server and try connecting again, same error (was
> obvious, but I do things one step at a time...)  So then I removed
> the "recovery.conf" file on the backup server and attempted to
> restart postgresql.  It stopped fine, but the restart failed.
>
> The pg_log/postgresql-XXXX-XX-XX.log file says:
> LOG:  database system was interrupted while in recovery at log time
> 2009-04-14 17:36:14 CDT
> HINT:  If this has occurred more than once some data might be
> corrupted and you might need to choose an earlier recovery target.
> LOG:  could not open file "pg_xlog/0000000100000002000000DD" (log
> file 2, segment 221): No such file or directory
> LOG:  invalid primary checkpoint record
> LOG:  could not open file "pg_xlog/0000000100000002000000DC" (log
> file 2, segment 220): No such file or directory
> LOG:  invalid secondary checkpoint record
> PANIC:  could not locate a valid checkpoint record
> LOG:  startup process (PID 3756) was terminated by signal 6: Aborted
> LOG:  aborting startup due to startup process failure
>
> Any ideas what I'm doing wrong?  How do I bring the standby server
> out of standby do act as the primary?

You should use pg_standby's -t flag to specify a stop file.  Then all
you need to do is touch (create) that file and pg_standby will let the
server come out of recovery mode into normal operation mode.  Be sure
to rm or mv the recovery.conf once that is complete.

Erik Jones, Database Administrator
Engine Yard
Support, Scalability, Reliability
866.518.9273 x 260
Location: US/Pacific
IRC: mage2k






Re: PITR - warm standby switchover question

From
Dan Hayes
Date:
Excellent!  Thanks.  One other quick question...  What would happen if I didn't delete the recovery.conf file?  Is that step just to prevent accidentally restarting the server with it there?

On Tue, Apr 14, 2009 at 6:26 PM, Erik Jones <ejones@engineyard.com> wrote:

On Apr 14, 2009, at 3:47 PM, Dan Hayes wrote:

I've followed the implementation instructions at 24.4.2:
http://www.postgresql.org/docs/current/static/warm-standby.html

And I've used the archive/restore commands from the example in F23.2:
http://www.postgresql.org/docs/current/static/pgstandby.html

This all works great.  The primary backs up the WAL archives to a mutually accessible folder and the backup system is continuously reading in the logs.  Monitoring the "standby.log" file shows that the system is working as it should.

My question is, if the main system goes down, what do I do to the backup machine to make it come alive?  Currently, when I attempt to login to the database, I get the error: "psql: FATAL:  the database is starting up"  I'm assuming this means it can't be accessed while its in continuous recovery mode.  I attempted to turn off postgresql on the main server and try connecting again, same error (was obvious, but I do things one step at a time...)  So then I removed the "recovery.conf" file on the backup server and attempted to restart postgresql.  It stopped fine, but the restart failed.

The pg_log/postgresql-XXXX-XX-XX.log file says:
LOG:  database system was interrupted while in recovery at log time 2009-04-14 17:36:14 CDT
HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
LOG:  could not open file "pg_xlog/0000000100000002000000DD" (log file 2, segment 221): No such file or directory
LOG:  invalid primary checkpoint record
LOG:  could not open file "pg_xlog/0000000100000002000000DC" (log file 2, segment 220): No such file or directory
LOG:  invalid secondary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 3756) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

Any ideas what I'm doing wrong?  How do I bring the standby server out of standby do act as the primary?

You should use pg_standby's -t flag to specify a stop file.  Then all you need to do is touch (create) that file and pg_standby will let the server come out of recovery mode into normal operation mode.  Be sure to rm or mv the recovery.conf once that is complete.

Erik Jones, Database Administrator
Engine Yard
Support, Scalability, Reliability
866.518.9273 x 260
Location: US/Pacific
IRC: mage2k






Re: PITR - warm standby switchover question

From
Erik Jones
Date:
Well, if you don't delete the recovery.conf and you *do* delete
pg_standby's stop file (or it gets deleted, for example if you set it
to go under /tmp and the server is restarted for whatever reason) the
server will attempt to go back into recovery mode using your
configured recovery command in the recovery.conf.  I'm not sure
exactly what would happen if the stop file is still there but I would
expect pg_standby to kick in due to the recovery.conf, see the stop
file, and let the server finish starting up.  However, I haven't
really tested that before and I certainly wouldn't call it "clean".
The best option would probably be to write a small script to drop the
stop file, wait until it can get a valid connection to the db, the rm
the stop file and mv the recovery.conf to something like
recovery.conf.inactive or something along those lines.

On Apr 14, 2009, at 5:23 PM, Dan Hayes wrote:

> Excellent!  Thanks.  One other quick question...  What would happen
> if I didn't delete the recovery.conf file?  Is that step just to
> prevent accidentally restarting the server with it there?
>
> On Tue, Apr 14, 2009 at 6:26 PM, Erik Jones <ejones@engineyard.com>
> wrote:
>
> On Apr 14, 2009, at 3:47 PM, Dan Hayes wrote:
>
> I've followed the implementation instructions at 24.4.2:
> http://www.postgresql.org/docs/current/static/warm-standby.html
>
> And I've used the archive/restore commands from the example in F23.2:
> http://www.postgresql.org/docs/current/static/pgstandby.html
>
> This all works great.  The primary backs up the WAL archives to a
> mutually accessible folder and the backup system is continuously
> reading in the logs.  Monitoring the "standby.log" file shows that
> the system is working as it should.
>
> My question is, if the main system goes down, what do I do to the
> backup machine to make it come alive?  Currently, when I attempt to
> login to the database, I get the error: "psql: FATAL:  the database
> is starting up"  I'm assuming this means it can't be accessed while
> its in continuous recovery mode.  I attempted to turn off postgresql
> on the main server and try connecting again, same error (was
> obvious, but I do things one step at a time...)  So then I removed
> the "recovery.conf" file on the backup server and attempted to
> restart postgresql.  It stopped fine, but the restart failed.
>
> The pg_log/postgresql-XXXX-XX-XX.log file says:
> LOG:  database system was interrupted while in recovery at log time
> 2009-04-14 17:36:14 CDT
> HINT:  If this has occurred more than once some data might be
> corrupted and you might need to choose an earlier recovery target.
> LOG:  could not open file "pg_xlog/0000000100000002000000DD" (log
> file 2, segment 221): No such file or directory
> LOG:  invalid primary checkpoint record
> LOG:  could not open file "pg_xlog/0000000100000002000000DC" (log
> file 2, segment 220): No such file or directory
> LOG:  invalid secondary checkpoint record
> PANIC:  could not locate a valid checkpoint record
> LOG:  startup process (PID 3756) was terminated by signal 6: Aborted
> LOG:  aborting startup due to startup process failure
>
> Any ideas what I'm doing wrong?  How do I bring the standby server
> out of standby do act as the primary?
>
> You should use pg_standby's -t flag to specify a stop file.  Then
> all you need to do is touch (create) that file and pg_standby will
> let the server come out of recovery mode into normal operation
> mode.  Be sure to rm or mv the recovery.conf once that is complete.
>
> Erik Jones, Database Administrator
> Engine Yard
> Support, Scalability, Reliability
> 866.518.9273 x 260
> Location: US/Pacific
> IRC: mage2k
>
>
>
>
>
>

Erik Jones, Database Administrator
Engine Yard
Support, Scalability, Reliability
866.518.9273 x 260
Location: US/Pacific
IRC: mage2k






Re: PITR - warm standby switchover question

From
Fujii Masao
Date:
Hi,

On Wed, Apr 15, 2009 at 9:23 AM, Dan Hayes <dhayes501@gmail.com> wrote:
> Excellent!  Thanks.  One other quick question...  What would happen if I
> didn't delete the recovery.conf file?  Is that step just to prevent
> accidentally restarting the server with it there?

recovery.conf is automatically renamed recovery.done by
postgres at the end of recovery, which would prevent
accidentally restarting warm-standby. So, you don't need to
delete or rename recovery.conf by hand after recovery.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: PITR - warm standby switchover question

From
Chander Ganesan
Date:
Fujii Masao wrote:
> Hi,
>
> On Wed, Apr 15, 2009 at 9:23 AM, Dan Hayes <dhayes501@gmail.com> wrote:
>
>> Excellent!  Thanks.  One other quick question...  What would happen if I
>> didn't delete the recovery.conf file?  Is that step just to prevent
>> accidentally restarting the server with it there?
>>
>
> recovery.conf is automatically renamed recovery.done by
> postgres at the end of recovery, which would prevent
> accidentally restarting warm-standby. So, you don't need to
> delete or rename recovery.conf by hand after recovery.
>
Also, keep in mind that the ideal solution is not to create a stop file,
but to get the last few WAL files that had not been archived prior to
the main system going down and place them in the directory that
pg_standby was copying files from.  If you simply create a stopfile and
have the system start up, you're likely to lose some transactions that
were in any WAL files that had not yet been archived (at the very least,
that's the currently-in-use WAL file at the time of the crash).  While
getting these files isn't always possible, the best practice would be to
at least try...  There are a variety of ways to provide for real-time
replication of in-use WAL files, if losing transactions in a crash is
not a viable option for you..

Once you've gotten the last few WAL's, you should see PostgreSQL
automatically detect the end of WAL logs and start up - with or without
the stopfile (so you can still create the stopfile)

Once the server has completed recovery, it won't come up in recovery
again, since (as Fujii noted) PostgreSQL will rename recovery.conf to
recovery.done - indicating that recovery has completed.  It will also
begin a new timeline, etc.

Generally speaking, after recovery completes, you probably want to
create a warm-standby backup and then log subsequent WAL files to ease
future recovery from WAL's on the "currently active" cluster.

--
Chander Ganesan
Open Technology Group, Inc.
One Copley Parkway, Suite 210
Morrisville, NC  27560
919-463-0999/877-258-8987
http://www.otg-nc.com
Ask me about Expert PostgreSQL, PostGIS, and other Open Source training delivered worldwide!


Re: PITR - warm standby switchover question

From
Erik Jones
Date:
On Apr 15, 2009, at 12:42 AM, Fujii Masao wrote:

> Hi,
>
> On Wed, Apr 15, 2009 at 9:23 AM, Dan Hayes <dhayes501@gmail.com>
> wrote:
>> Excellent!  Thanks.  One other quick question...  What would happen
>> if I
>> didn't delete the recovery.conf file?  Is that step just to prevent
>> accidentally restarting the server with it there?
>
> recovery.conf is automatically renamed recovery.done by
> postgres at the end of recovery, which would prevent
> accidentally restarting warm-standby. So, you don't need to
> delete or rename recovery.conf by hand after recovery.

Doh!  Totally forgot that very nice piece of information (which is bad
as that's also why I never tested the alternate scenario mentioned in
my other response).

Erik Jones, Database Administrator
Engine Yard
Support, Scalability, Reliability
866.518.9273 x 260
Location: US/Pacific
IRC: mage2k