Re: File system level backup of shut down standby does not work? - Mailing list pgsql-general

From Antman, Jason (CMG-Atlanta)
Subject Re: File system level backup of shut down standby does not work?
Date
Msg-id 5304ADC8.4000202@coxinc.com
Whole thread Raw
In response to Re: File system level backup of shut down standby does not work?  (Jürgen Fuchsberger <juergen.fuchsberger@uni-graz.at>)
List pgsql-general
That is correct. When the source server is shut down, it may or may not
be caught up for replication (most likely not). You'll need all of the
WAL files that haven't been applied yet; if for some reason you've got a
large amount of replication lag, you might need more than just the first.

I'm not sure the best way to achieve this - we write WALs to a shared
NFS mount, one directory per slave, so that gets mounted on the
destination/standby and it has access. I'm not sure if this is practical
without specialized hardware - our NFS is served from a NetApp storage
appliance, so we have microsecond write latency.

If you want to have a "standalone" backup that doesn't need WAL files,
it's a bit more complex. This is the best that I can figure out from our
current setup. Please be cautious with it, I'm not 100% positive this
includes everything, so I'd only try it on an isolated server that can
risk having problems. This is also a bit of a hack...

Assuming PGDATA is /var/lib/pgsql/9.0/data, as the postgres user:
1) `touch /var/lib/pgsql/9.0/data/stop.replication`
2) make sure your recovery.conf includes:
trigger_file='/var/lib/pgsql/9.0/data/stop.replication'
recovery_end_command='/var/lib/pgsql/9.0/data/stop.replication'
standby_mode='on'
3) Start the database

When recovery finishes, it will attempt to execute recovery_end_command.
Since this isn't actually executable, it raises an error, and stops
postgres. This should result in a stopped database instance, in a
consistent recovery state with all WAL files applied. You can confirm
this by checking that `pg_controldata /var/lib/pgsql/9.0/data` output
includes "Database cluster state: *shut down".

Be aware that we use an intermediate host for this. i.e. we stop a
slave, copy PGDATA *and* ALL WAL files over to another machine, start
the slave back up, and then do the recovery to a consistent state there,
to end up with a standalone PGDATA directory that can be copied wherever
we need it without any WAL files.

-Jason


On 02/19/2014 05:18 AM, Jürgen Fuchsberger wrote:
> All,
>
> One very important thing I just noted when shutting down and restarting
> my standby server:
>
> My standby server *always needs the last WAL-file* from the archive
> directory, even when the shut down was "smart". Without this the
> consistent recovery state will not be reached.
>
> 2014-02-19 11:10:20 CET LOG:  received smart shutdown request
> 2014-02-19 11:10:20 CET LOG:  shutting down
> 2014-02-19 11:10:20 CET LOG:  database system is shut down
> 2014-02-19 11:11:00 CET LOG:  database system was shut down in recovery
> at 2014-
> 02-19 11:10:20 CET
> 2014-02-19 11:11:00 CET LOG:  entering standby mode
> 2014-02-19 11:11:00 CET LOG:  incomplete startup packet
> 2014-02-19 11:11:01 CET FATAL:  the database system is starting up
>
> *2014-02-19 11:11:01 CET LOG:  restored log file*
> *"00000001000002DE000000BF" from archive*
>
> 2014-02-19 11:11:01 CET LOG:  redo starts at 2DE/BF036FA4
> 2014-02-19 11:11:01 CET FATAL:  the database system is starting up
> 2014-02-19 11:11:01 CET LOG:  consistent recovery state reached at
> 2DE/BFFFE53C
> 2014-02-19 11:11:01 CET LOG:  database system is ready to accept read
> only connections
>
> So my question is, could there be something wrong with my configuration
> or is this normal?
>
> Juergen
>
> On 02/19/2014 02:14 AM, Antman, Jason (CMG-Atlanta) wrote:
>> Juergen,
>>
>> I've seen this quite a lot in the past, as we do this multiple times a day.
>>
>> Here's the procedure we use to prevent it:
>> 1) read the PID from postmaster.pid in the data directory
>> 2) Issue "service postgresql-9.0 stop" (this does a fast shutdown with
>> -t 600)
>> 3) loop until the PID is no longer running, or a timeout is exceeded (in
>> which case we error out)
>> 4) the IMPORTANT part: `pg_controldata /path/to/data | grep "Database
>> cluster state: *shut down"` - if pg_controldata output doesn't include
>> "shut down" or "shut down in recovery", then something's amiss and the
>> backup won't be clean (error in shutdown, etc.)
>> 5) `sync`
>> 6) now take the backup
>>
>> -Jason
>>
>> On 02/17/2014 08:32 AM, Jürgen Fuchsberger wrote:
>>> Hi all,
>>>
>>> I have a master-slave configuration running the master with WAL
>>> archiving enabled and the slave in recovery mode reading back the WAL
>>> files from the master ("Log-shipping standby" as described in
>>> http://www.postgresql.org/docs/9.1/static/warm-standby.html)
>>>
>>> I take frequent backups of the standby server:
>>>
>>> 1) Stop standby server (fast shutdown).
>>> 2) Rsync to another fileserver
>>> 3) Start standby server.
>>>
>>> I just tried to recover one of these backups which *failed* with the
>>> following errors:
>>>
>>> 2014-02-17 14:27:28 CET LOG:  incomplete startup packet
>>> 2014-02-17 14:27:28 CET LOG:  database system was shut down in recovery
>>> at 2013-12-25 18:00:03 CET
>>> 2014-02-17 14:27:28 CET LOG:  could not open file
>>> "pg_xlog/00000001000001E300000061" (log file 483, segment 97): No such
>>> file or directory
>>> 2014-02-17 14:27:28 CET LOG:  invalid primary checkpoint record
>>> 2014-02-17 14:27:28 CET LOG:  could not open file
>>> "pg_xlog/00000001000001E300000060" (log file 483, segment 96): No such
>>> file or directory
>>> 2014-02-17 14:27:28 CET LOG:  invalid secondary checkpoint record
>>> 2014-02-17 14:27:28 CET PANIC:  could not locate a valid checkpoint record
>>> 2014-02-17 14:27:29 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:29 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:30 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:30 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:31 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:31 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:32 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:33 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:33 CET FATAL:  the database system is starting up
>>> 2014-02-17 14:27:33 CET LOG:  startup process (PID 26186) was terminated
>>> by signal 6: Aborted
>>> 2014-02-17 14:27:33 CET LOG:  aborting startup due to startup process
>>> failure
>>>
>>>
>>> So it seems the server is missing some WAL files which are not
>>> in the backup? Or is it simply not possible to take a backup of a
>>> standby server in recovery?
>>>
>>> Best,
>>> Juergen
>>>
>>>
>>>
>>


--

Jason Antman | Systems Engineer | CMGdigital
jason.antman@coxinc.com | p: 678-645-4155


pgsql-general by date:

Previous
From: Albe Laurenz
Date:
Subject: Re: File system level backup of shut down standby does not work?
Next
From: Jürgen Fuchsberger
Date:
Subject: Re: File system level backup of shut down standby does not work?