Thread: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary

[bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary

From
"Tsunakawa, Takayuki"
Date:
Hello,

Our customer hit another bug of pg_rewind with PG 9.5.  The attached patch fixes this.


PROBLEM
========================================

After a long run of successful pg_rewind, the synchronized standby could not catch up the primary forever, emitting the
followingmessage repeatedly:
 

LOG:  XX000: could not read from log segment 000000060000028A00000031, offset 16384: No error


CAUSE
========================================

If the primary removes WAL files that pg_rewind is going to get, pg_rewind leaves 0-byte WAL files in the target
directoryhere:
 

[libpq_fetch.c]
            case FILE_ACTION_COPY:
                /* Truncate the old file out of the way, if any */
                open_target_file(entry->path, true);
                fetch_file_range(entry->path, 0, entry->newsize);
                break;

pg_rewind completes successfully, create recovery.conf, and then start the standby in the target cluster.  walreceiver
receivesWAL records and write them to the 0-byte WAL files.  Finally, xlog reader complains that he cannot read a WAL
page.


FIX
========================================

pg_rewind deletes the file when it finds that the primary has deleted it.


OTHER THOUGHTS
========================================

BTW, should pg_rewind really copy WAL files from the primary?  If the sole purpose of pg_rewind is to recover an
instanceto use as a standby, can pg_rewind just remove all WAL files in the target directory, because the standby can
getWAL files from the primary and/or archive?
 

Related to this, shouldn't pg_rewind avoid copying more files and directories like pg_basebackup?  Currently, pg_rewind
doesn'tcopy postmaster.pid, postmaster.opts, and temporary files/directories (pg_sql_tmp/).
 

Regards
Takayuki Tsunakawa



Attachment
On Thu, Mar 01, 2018 at 01:26:32AM +0000, Tsunakawa, Takayuki wrote:
> BTW, should pg_rewind really copy WAL files from the primary?  If the
> sole purpose of pg_rewind is to recover an instance to use as a
> standby, can pg_rewind just remove all WAL files in the target
> directory, because the standby can get WAL files from the primary
> and/or archive?

Yes, it should not copy those WAL files.  Most of the time they are going
to be meaningless.  See this recent thread:
https://www.postgresql.org/message-id/20180126023609.GH17847%40paquier.xyz
So I would rather go this way instead of having to worry about
manipulating those WAL segments as you do.  Depending on the needs, I
think that even a backpatch could be considered.

> Related to this, shouldn't pg_rewind avoid copying more files and
> directories like pg_basebackup?  Currently, pg_rewind doesn't copy
> postmaster.pid, postmaster.opts, and temporary files/directories
> (pg_sql_tmp/).

Yes, it should not copy those files.  I have a patch in the current CF
to do that:
https://commitfest.postgresql.org/17/1507/
--
Michael

Attachment
From: Michael Paquier [mailto:michael@paquier.xyz]
> Yes, it should not copy those WAL files.  Most of the time they are going
> to be meaningless.  See this recent thread:
> https://www.postgresql.org/message-id/20180126023609.GH17847%40paquier
> .xyz
> So I would rather go this way instead of having to worry about manipulating
> those WAL segments as you do.  Depending on the needs, I think that even
> a backpatch could be considered.

Thank you for information.  I didn't notice those activities going around pg_rewind.

It's a regret that Chen's patch, which limits the WAL to be copied, is not committed yet.  It looks good to be ready
forcommitter.
 


> > Related to this, shouldn't pg_rewind avoid copying more files and
> > directories like pg_basebackup?  Currently, pg_rewind doesn't copy
> > postmaster.pid, postmaster.opts, and temporary files/directories
> > (pg_sql_tmp/).
> 
> Yes, it should not copy those files.  I have a patch in the current CF to
> do that:
> https://commitfest.postgresql.org/17/1507/

Wow, what a great patch.  I think I should look at it.  But I'm afraid it won't be backpatched because it's big...

Even with your patch and Chen's one, my small patch is probably necessary to avoid leaving 0-byte or half-baked files.
I'mnot sure whether those strangely sized files would cause actual trouble, but maybe it would be healthy to try to
cleanthings up as much as possible.  (files in pg_twophase/ might emit WARNING messages, garbage server log files might
makethe DBA worried, etc.; yes, these may be just FUD.)
 

Regards
Takayuki Tsunakawa





On Thu, Mar 01, 2018 at 07:49:06AM +0000, Tsunakawa, Takayuki wrote:
> It's a regret that Chen's patch, which limits the WAL to be copied, is
> not committed yet.  It looks good to be ready for committer.

The message I sent provides reasons about why it should not be
integrated.  Particularly since the prior last checkpoint has been
removed in v11, there is always going to be a whole in WAL segments as
you need to create a checkpoint on the ex-standby after it has been
promoted so as its control file data is correctly reflected on disk.

> > > Related to this, shouldn't pg_rewind avoid copying more files and
> > > directories like pg_basebackup?  Currently, pg_rewind doesn't copy
> > > postmaster.pid, postmaster.opts, and temporary files/directories
> > > (pg_sql_tmp/).
> >
> > Yes, it should not copy those files.  I have a patch in the current CF to
> > do that:
> > https://commitfest.postgresql.org/17/1507/
>
> Wow, what a great patch.  I think I should look at it.  But I'm afraid
> it won't be backpatched because it's big...

That's a new feature.  This won't get backpatch'ed anyway.

> Even with your patch and Chen's one, my small patch is probably
> necessary to avoid leaving 0-byte or half-baked files.  I'm not sure
> whether those strangely sized files would cause actual trouble, but
> maybe it would be healthy to try to clean things up as much as
> possible.  (files in pg_twophase/ might emit WARNING messages, garbage
> server log files might make the DBA worried, etc.; yes, these may be
> just FUD.)

Yeah, I'd like to double-check what you are proposing here anyway.
Sorry but I do not have an opinion about what you have sent yet :(
The only thing I am sure of though is that for HEAD not copying files
from pg_wal from the origin is the way to do it.  For back-branches
that's another story.
--
Michael

Attachment
On Thu, Mar 01, 2018 at 01:26:32AM +0000, Tsunakawa, Takayuki wrote:
> Our customer hit another bug of pg_rewind with PG 9.5.  The attached
> patch fixes this.

Sorry for my late reply.  It took me unfortunately some time before
coming back to this patch.

> PROBLEM
> ========================================
>
> After a long run of successful pg_rewind, the synchronized standby
> could not catch up the primary forever, emitting the following message
> repeatedly:
>
> LOG:  XX000: could not read from log segment 000000060000028A00000031,
> offset 16384: No error

Oops.

> CAUSE
> ========================================
>
> If the primary removes WAL files that pg_rewind is going to get,
> pg_rewind leaves 0-byte WAL files in the target directory here:
>
> [libpq_fetch.c]
>             case FILE_ACTION_COPY:
>                 /* Truncate the old file out of the way, if any */
>                 open_target_file(entry->path, true);
>                 fetch_file_range(entry->path, 0, entry->newsize);
>                 break;
>
> pg_rewind completes successfully, create recovery.conf, and then start
> the standby in the target cluster.  walreceiver receives WAL records
> and write them to the 0-byte WAL files.  Finally, xlog reader
> complains that he cannot read a WAL page.

This part qualifies as a data corruption bug as some records are lost by
the backend.

> FIX
> ========================================
>
> pg_rewind deletes the file when it finds that the primary has deleted
> it.

The concept looks right to me.  I think that this does not apply only to
WAL segments though.  You could have a temporary WAL segment that has
been included in the chunk table, which is even more volatile, or a
multixact file, a snapshot file, etc.  In short anything which is not a
relation file and could disappear between the moment the file map is
built and the data is fetched from the source server.

@@ -174,7 +173,7 @@ remove_target_file(const char *path)
         return;

     snprintf(dstpath, sizeof(dstpath), "%s/%s", datadir_target, path);
-    if (unlink(dstpath) != 0)
+    if (unlink(dstpath) != 0 && !ignore_error)
I see.  So the only reason why this flag exists is that if a file is
large enough so as it is split into multiple chunks, then the first
unlink will work but not the successive ones.  One chunk can be at most
1 million bytes, which is why it is essential for WAL segments.  Instead
of ignoring *all* errors, let's ignore only ENOENT and rename
ignore_error to missing_ok as well.

You need to update the comment in receiveFileChunks where an entry gets
deleted with basically what I mentioned above, say:
"If a file has been deleted on the source, remove it on the target as
well.  Note that multiple unlink() calls may happen on the same file if
multiple data chunks are associated with it, hence ignore
unconditionally anything missing.  If this file is not a relation data
file, then it has been already truncated when creating the file chunk
list at hte previous execution of the filemap."

Adding also a comment on top of remove_target_file to explain what
missing_ok does would be nice to keep track of what the code should do.

The portion about whether data from pg_wal should be transfered or not
is really an improvement that could come in a future version, and what
you are proposing here is more general, so I think that we should go
ahead with it.
--
Michael

Attachment
From: Michael Paquier [mailto:michael@paquier.xyz]
> I see.  So the only reason why this flag exists is that if a file is large
> enough so as it is split into multiple chunks, then the first unlink will
> work but not the successive ones.  One chunk can be at most
> 1 million bytes, which is why it is essential for WAL segments.  Instead
> of ignoring *all* errors, let's ignore only ENOENT and rename ignore_error
> to missing_ok as well.
> 
> You need to update the comment in receiveFileChunks where an entry gets
> deleted with basically what I mentioned above, say:
> "If a file has been deleted on the source, remove it on the target as well.
> Note that multiple unlink() calls may happen on the same file if multiple
> data chunks are associated with it, hence ignore unconditionally anything
> missing.  If this file is not a relation data file, then it has been already
> truncated when creating the file chunk list at hte previous execution of
> the filemap."
> 
> Adding also a comment on top of remove_target_file to explain what
> missing_ok does would be nice to keep track of what the code should do.

Thanks for reviewing.  All done.

Regards
Takayuki Tsunakawa


Attachment
On Fri, Mar 09, 2018 at 08:22:49AM +0000, Tsunakawa, Takayuki wrote:
> Thanks for reviewing.  All done.

Thanks.  Please be careful with the indentation of the patch.  Attached
is a slightly-updated version with a small modification in
remove_target_file to make the code cleaner, a proposal of commit
message and pgindent applied on the code.  I am switching that as ready
for committer.
--
Michael

Attachment
On Sun, Mar 11, 2018 at 11:04:01PM +0900, Michael Paquier wrote:
> On Fri, Mar 09, 2018 at 08:22:49AM +0000, Tsunakawa, Takayuki wrote:
> > Thanks for reviewing.  All done.
> 
> Thanks.  Please be careful with the indentation of the patch.  Attached
> is a slightly-updated version with a small modification in
> remove_target_file to make the code cleaner, a proposal of commit
> message and pgindent applied on the code.  I am switching that as ready
> for committer.
> --
> Michael

> From 62950c615d87e3e58998f295ce446eb5c80707e4 Mon Sep 17 00:00:00 2001
> From: Michael Paquier <michael@paquier.xyz>
> Date: Sun, 11 Mar 2018 22:49:55 +0900
> Subject: [PATCH] Fix handling of removed files on target server with pg_rewind
> 
> After processing the filemap to build the list of chunks that will be
> fetched from the source to rewing the target server, it is possible that
> a file which was previously processed is removed from the source.  A
> simple example of such an occurence is a WAL segment which gets recycled
> on the target in-between.  When the filemap is processed, files not
> categorized as relation files are first truncated to prepare for its
> full copy of which is going to be taken from the source, divided into a
> set of junks.

Hopefully a set of CHUNKS not JUNKS ?

Justin


On Sun, Mar 11, 2018 at 09:56:33AM -0500, Justin Pryzby wrote:
> Hopefully a set of CHUNKS not JUNKS ?

Yes. (laugh)
--
Michael

Attachment
On Sun, Mar 11, 2018 at 11:04 PM, Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, Mar 09, 2018 at 08:22:49AM +0000, Tsunakawa, Takayuki wrote:
>> Thanks for reviewing.  All done.
>
> Thanks.  Please be careful with the indentation of the patch.  Attached
> is a slightly-updated version with a small modification in
> remove_target_file to make the code cleaner, a proposal of commit
> message and pgindent applied on the code.  I am switching that as ready
> for committer.

The patch looks good to me! Barring objection, I will commit it
and back-patch to 9.5 where pg_rewind was added.

Regards,

-- 
Fujii Masao


On Wed, Mar 28, 2018 at 04:45:49AM +0900, Fujii Masao wrote:
> The patch looks good to me! Barring objection, I will commit it
> and back-patch to 9.5 where pg_rewind was added.

Thanks in advance!  I just eyeballed the patch again and I don't see
issues with what's used here.  The thing should apply smoothly back to
9.5, of course let me know if you need help or an extra pair of eyes.
--
Michael

Attachment
On Wed, Mar 28, 2018 at 11:24 AM, Michael Paquier <michael@paquier.xyz> wrote:
> On Wed, Mar 28, 2018 at 04:45:49AM +0900, Fujii Masao wrote:
>> The patch looks good to me! Barring objection, I will commit it
>> and back-patch to 9.5 where pg_rewind was added.
>
> Thanks in advance!  I just eyeballed the patch again and I don't see
> issues with what's used here.  The thing should apply smoothly back to
> 9.5, of course let me know if you need help or an extra pair of eyes.

Pushed. Thanks!

Regards,

-- 
Fujii Masao


On Thu, Mar 29, 2018 at 04:04:58AM +0900, Fujii Masao wrote:
> Pushed. Thanks!

Nice, thanks.
--
Michael

Attachment