Re: 'replication checkpoint has wrong magic' on the newly cloned replicas - Mailing list pgsql-admin

From Stephen Frost
Subject Re: 'replication checkpoint has wrong magic' on the newly cloned replicas
Date
Msg-id CAOuzzgpDMuXZiMY4h0wFiiaZDxv9=Bw31G0YHV7PFQXLOYi1Jw@mail.gmail.com
Whole thread Raw
In response to Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas  (Alex Kliukin <oleksii@fastmail.com>)
Responses Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas
List pgsql-admin
Greetings, On Wed, Nov 29, 2017 at 13:33 Alex Kliukin wrote: > > On 29. Nov 2017, at 18:52, Stephen Frost wrote: > > Greetings, > > On Wed, Nov 29, 2017 at 12:41 Oleksii Kliukin > wrote: > >> Hi Stephen, >> >> > On 29. Nov 2017, at 15:54, Stephen Frost wrote: >> > >> > Greetings, >> > >> > * Alex Kliukin (alexk@hintbits.com) wrote: >> >> The cloning itself is done by copying a compressed image via ssh, >> >> running the >> >> following command from the replica: >> >> >> >> """ssh {master} 'cd {master_datadir} && tar -lcp --exclude "*.conf" \ >> >> --exclude "recovery.done" \ >> >> --exclude "pacemaker_instanz" \ >> >> --exclude "dont_start" \ >> >> --exclude "pg_log" \ >> >> --exclude "pg_xlog" \ >> >> --exclude "postmaster.pid" \ >> >> --exclude "recovery.done" \ >> >> * | pigz -1 -p 4' | pigz -d -p 4 | tar -xpmUv -C >> >> {slave_datadir}"" >> >> >> >> The WAL archiving starts before the copy starts, as the script that >> >> clones the >> >> replica checks that the WALs archiving is running before the cloning. >> > >> > Maybe you've doing it and haven't mentioned it, but you have to use >> > pg_start/stop_backup >> >> Sorry for not mentioning it, as it seemed obvious, but we are calling >> pg_start_backup and pg_stop_backup at the right time. > > > Ah, not something I can assume, heh. > > Then it depends on which version of PG and if you’re able to run > start/stop on the replica or not. If you can’t run it on the replica and > have to run it on the primary (prior to 9.6) then you need to make sure to > wait for things to happen on the primary and for that to be replicated > before you can start. > > > We are using exclusive backups from the master. First, the script checks > that WAL files are shipped to the NFS, where the replica expects to find > them (we check the md5 checksum of the file in order to make sure that the > NFS actually delivers the file that the master has archived) . Then > pg_start_backup runs on the master and its status is checked. On success, > the copy command runs. When the copy command finishes, pg_stop_backup is > executed. Once pg_stop_backup finishes successfully, replica configuration > files (postgesql.conf, pg_hba.conf. pg_ident.conf) are linked from their > location in the repository and the replica is started. > No, you must wait until the replica has moved forward far enough and you have to copy the backup_label file from the primary as well, otherwise PG won’t realize you’re doing a backup-based recovery This is a fairly typical procedure, which, I believe, is also well > described in the docs. > Please provide a link to where that is because if that’s the case then we need to correct it or remove it. This is absolutely not safe without additional checks being done and various other magic happening (like copying the backup_label off the primary where it’s created). If you’re on 9.6 and using non-exclusive backup, you need to be sure to > capture the contents of the stop backup and write it into backup_label > before you start the system back up. > > > We don’t use non-exclusive backups altogether. > All the more likely that your procedure is causing more corruption than you realize then. Seriously, again, this is not easy to get right, especially when you’re doing things that weren’t explicitly documented and supported. Using existing tools from those versed in why the processes used are safe and have written lots of tests to verify that it is safe is really the recommendation that you should take away from this. At least with 9.6 there’s proper documentation on how to run a non-exclusive backup on a replica properly and if you very carefully follow the procedure then you may get it right, but you will still want to test extensively. Thanks! Stephen >

pgsql-admin by date:

Previous
From: Alex Kliukin
Date:
Subject: Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas
Next
From: Alex Kliukin
Date:
Subject: Re: 'replication checkpoint has wrong magic' on the newly clonedreplicas