Thread: Re: Initiating log shipping backups when the initial filesystem-level backup went out of sync

> Hi,
> 
> A while back, I had to replace the hardware for a postgres server
> containing about 600G of data. While at it, we did an upgrade of the
> postgresql version used, and we had intended to also modify the way
> backups were being done, by moving from a daily pg_dumpall in cron to
> shipping WAL logs to an offsite location.
> 
> Due to the many changes involved, I decided that it would be easier to
> copy the data by running "ssh oldserver pg_dumpall | psql postgres",
> but I did (approximately) the following:
> 
> - Create the cluster
> - Set "wal_level = hot_standby", "archive_mode = on", and
> "archive_command = 'test ! -f /srv/pg_wal/%f && cp %p /srv/pg_wal/%f'
> in postgresql.conf.
> - Set up everything so that files written to /srv/pg_wal end up
> (eventually) on the remote server.
> - rsync the data directory to the off-site server
> - Start the server
> - Run the pg_dumpall thing
> - Start the server on the off-site location, to test if it could open
> everything, and stop it again
> - Create a file "recovery.conf" in the wrong location (at first) and
> then in the correct location (later on, after reading the documentation
> more carefully), with a line saying "restore_command = 'cp
> /srv/pg_wal/%f "%p"'" and one saying "standby_mode = on"
> - Start the server
> 
> At this point, the log file shows the following message:
> 
> 2016-08-29 13:53:22 CEST [25504-1] LOG:  database system was shut down
> in recovery at (...)
> 2016-08-29 13:53:22 CEST [25504-2] LOG:  entering standby mode
> 2016-08-29 13:53:22 CEST [25504-3] LOG:  record with zero length at
> 0/2000200
> 2016-08-29 13:53:22 CEST [25504-4] LOG:  invalid primary checkpoint
> record
> 2016-08-29 13:53:22 CEST [25504-5] LOG:  record with zero length at
> 0/2000198
> 2016-08-29 13:53:22 CEST [25504-6] LOG:  invalid secondary checkpoint
> record
> 2016-08-29 13:53:22 CEST [25504-7] PANIC:  could not locate a valid
> checkpoint record
> 2016-08-29 13:53:22 CEST [25480-1] LOG:  startup process (PID 25504)
> was terminated by signal 6: Aborted
> 2016-08-29 13:53:22 CEST [25480-2] LOG:  aborting startup due to
> startup process failure
> 
> In the pg_xlog directory on the off-site server, there is a file
> 000000010000000000000002 and one 000000010000000000000003. When I run
> md5sum, I find that the first has the same checksum as the one with the
> same name in the pg_wal directory containing the files from the
> original server; the second does not, but it *does* have the same
> checksum as the file 000000010000000000000001 in that directory (which
> does not exist in the pg_xlog directory on the off-site server).
> 
> I have retained all WAL files from 000000010000000000000001 all the way
> to the current one.
> 
> Is it still possible for me to load these WAL files into the server at
> the remote site, without having to do a filesystem-level copy of all
> the files in the postgresql database directory? If not, any hints on
> what I did wrong?
> 
> Thanks,

Ping?

Is this possible, or should I spend time to rsync files over?

Thanks,

-- 
Wouter Verhelst

My experienc is that rsync is the way to go when a replicant gets stale -- you can at least be certain you didn't miss anything and it will take much less time than the initial rsync since it only takes things that are new.

Hth,
/r/b 

On Monday, September 5, 2016, Wouter Verhelst <wouter.verhelst.ext@huawei.com> wrote:
> Hi,
>
> A while back, I had to replace the hardware for a postgres server
> containing about 600G of data. While at it, we did an upgrade of the
> postgresql version used, and we had intended to also modify the way
> backups were being done, by moving from a daily pg_dumpall in cron to
> shipping WAL logs to an offsite location.
>
> Due to the many changes involved, I decided that it would be easier to
> copy the data by running "ssh oldserver pg_dumpall | psql postgres",
> but I did (approximately) the following:
>
> - Create the cluster
> - Set "wal_level = hot_standby", "archive_mode = on", and
> "archive_command = 'test ! -f /srv/pg_wal/%f && cp %p /srv/pg_wal/%f'
> in postgresql.conf.
> - Set up everything so that files written to /srv/pg_wal end up
> (eventually) on the remote server.
> - rsync the data directory to the off-site server
> - Start the server
> - Run the pg_dumpall thing
> - Start the server on the off-site location, to test if it could open
> everything, and stop it again
> - Create a file "recovery.conf" in the wrong location (at first) and
> then in the correct location (later on, after reading the documentation
> more carefully), with a line saying "restore_command = 'cp
> /srv/pg_wal/%f "%p"'" and one saying "standby_mode = on"
> - Start the server
>
> At this point, the log file shows the following message:
>
> 2016-08-29 13:53:22 CEST [25504-1] LOG:  database system was shut down
> in recovery at (...)
> 2016-08-29 13:53:22 CEST [25504-2] LOG:  entering standby mode
> 2016-08-29 13:53:22 CEST [25504-3] LOG:  record with zero length at
> 0/2000200
> 2016-08-29 13:53:22 CEST [25504-4] LOG:  invalid primary checkpoint
> record
> 2016-08-29 13:53:22 CEST [25504-5] LOG:  record with zero length at
> 0/2000198
> 2016-08-29 13:53:22 CEST [25504-6] LOG:  invalid secondary checkpoint
> record
> 2016-08-29 13:53:22 CEST [25504-7] PANIC:  could not locate a valid
> checkpoint record
> 2016-08-29 13:53:22 CEST [25480-1] LOG:  startup process (PID 25504)
> was terminated by signal 6: Aborted
> 2016-08-29 13:53:22 CEST [25480-2] LOG:  aborting startup due to
> startup process failure
>
> In the pg_xlog directory on the off-site server, there is a file
> 000000010000000000000002 and one 000000010000000000000003. When I run
> md5sum, I find that the first has the same checksum as the one with the
> same name in the pg_wal directory containing the files from the
> original server; the second does not, but it *does* have the same
> checksum as the file 000000010000000000000001 in that directory (which
> does not exist in the pg_xlog directory on the off-site server).
>
> I have retained all WAL files from 000000010000000000000001 all the way
> to the current one.
>
> Is it still possible for me to load these WAL files into the server at
> the remote site, without having to do a filesystem-level copy of all
> the files in the postgresql database directory? If not, any hints on
> what I did wrong?
>
> Thanks,

Ping?

Is this possible, or should I spend time to rsync files over?

Thanks,

--
Wouter Verhelst

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


--
--
Robert W. Burgholzer
 'Making the simple complicated is commonplace; making the complicated simple, awesomely simple, that's creativity.'  - Charles Mingus