Thread: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
[bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
From
"Tsunakawa, Takayuki"
Date:
Hello, Our customer hit another bug of pg_rewind with PG 9.5. The attached patch fixes this. PROBLEM ======================================== After a long run of successful pg_rewind, the synchronized standby could not catch up the primary forever, emitting the followingmessage repeatedly: LOG: XX000: could not read from log segment 000000060000028A00000031, offset 16384: No error CAUSE ======================================== If the primary removes WAL files that pg_rewind is going to get, pg_rewind leaves 0-byte WAL files in the target directoryhere: [libpq_fetch.c] case FILE_ACTION_COPY: /* Truncate the old file out of the way, if any */ open_target_file(entry->path, true); fetch_file_range(entry->path, 0, entry->newsize); break; pg_rewind completes successfully, create recovery.conf, and then start the standby in the target cluster. walreceiver receivesWAL records and write them to the 0-byte WAL files. Finally, xlog reader complains that he cannot read a WAL page. FIX ======================================== pg_rewind deletes the file when it finds that the primary has deleted it. OTHER THOUGHTS ======================================== BTW, should pg_rewind really copy WAL files from the primary? If the sole purpose of pg_rewind is to recover an instanceto use as a standby, can pg_rewind just remove all WAL files in the target directory, because the standby can getWAL files from the primary and/or archive? Related to this, shouldn't pg_rewind avoid copying more files and directories like pg_basebackup? Currently, pg_rewind doesn'tcopy postmaster.pid, postmaster.opts, and temporary files/directories (pg_sql_tmp/). Regards Takayuki Tsunakawa
Attachment
Re: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
From
Michael Paquier
Date:
On Thu, Mar 01, 2018 at 01:26:32AM +0000, Tsunakawa, Takayuki wrote: > BTW, should pg_rewind really copy WAL files from the primary? If the > sole purpose of pg_rewind is to recover an instance to use as a > standby, can pg_rewind just remove all WAL files in the target > directory, because the standby can get WAL files from the primary > and/or archive? Yes, it should not copy those WAL files. Most of the time they are going to be meaningless. See this recent thread: https://www.postgresql.org/message-id/20180126023609.GH17847%40paquier.xyz So I would rather go this way instead of having to worry about manipulating those WAL segments as you do. Depending on the needs, I think that even a backpatch could be considered. > Related to this, shouldn't pg_rewind avoid copying more files and > directories like pg_basebackup? Currently, pg_rewind doesn't copy > postmaster.pid, postmaster.opts, and temporary files/directories > (pg_sql_tmp/). Yes, it should not copy those files. I have a patch in the current CF to do that: https://commitfest.postgresql.org/17/1507/ -- Michael
Attachment
RE: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
From
"Tsunakawa, Takayuki"
Date:
From: Michael Paquier [mailto:michael@paquier.xyz] > Yes, it should not copy those WAL files. Most of the time they are going > to be meaningless. See this recent thread: > https://www.postgresql.org/message-id/20180126023609.GH17847%40paquier > .xyz > So I would rather go this way instead of having to worry about manipulating > those WAL segments as you do. Depending on the needs, I think that even > a backpatch could be considered. Thank you for information. I didn't notice those activities going around pg_rewind. It's a regret that Chen's patch, which limits the WAL to be copied, is not committed yet. It looks good to be ready forcommitter. > > Related to this, shouldn't pg_rewind avoid copying more files and > > directories like pg_basebackup? Currently, pg_rewind doesn't copy > > postmaster.pid, postmaster.opts, and temporary files/directories > > (pg_sql_tmp/). > > Yes, it should not copy those files. I have a patch in the current CF to > do that: > https://commitfest.postgresql.org/17/1507/ Wow, what a great patch. I think I should look at it. But I'm afraid it won't be backpatched because it's big... Even with your patch and Chen's one, my small patch is probably necessary to avoid leaving 0-byte or half-baked files. I'mnot sure whether those strangely sized files would cause actual trouble, but maybe it would be healthy to try to cleanthings up as much as possible. (files in pg_twophase/ might emit WARNING messages, garbage server log files might makethe DBA worried, etc.; yes, these may be just FUD.) Regards Takayuki Tsunakawa
Re: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
From
Michael Paquier
Date:
On Thu, Mar 01, 2018 at 07:49:06AM +0000, Tsunakawa, Takayuki wrote: > It's a regret that Chen's patch, which limits the WAL to be copied, is > not committed yet. It looks good to be ready for committer. The message I sent provides reasons about why it should not be integrated. Particularly since the prior last checkpoint has been removed in v11, there is always going to be a whole in WAL segments as you need to create a checkpoint on the ex-standby after it has been promoted so as its control file data is correctly reflected on disk. > > > Related to this, shouldn't pg_rewind avoid copying more files and > > > directories like pg_basebackup? Currently, pg_rewind doesn't copy > > > postmaster.pid, postmaster.opts, and temporary files/directories > > > (pg_sql_tmp/). > > > > Yes, it should not copy those files. I have a patch in the current CF to > > do that: > > https://commitfest.postgresql.org/17/1507/ > > Wow, what a great patch. I think I should look at it. But I'm afraid > it won't be backpatched because it's big... That's a new feature. This won't get backpatch'ed anyway. > Even with your patch and Chen's one, my small patch is probably > necessary to avoid leaving 0-byte or half-baked files. I'm not sure > whether those strangely sized files would cause actual trouble, but > maybe it would be healthy to try to clean things up as much as > possible. (files in pg_twophase/ might emit WARNING messages, garbage > server log files might make the DBA worried, etc.; yes, these may be > just FUD.) Yeah, I'd like to double-check what you are proposing here anyway. Sorry but I do not have an opinion about what you have sent yet :( The only thing I am sure of though is that for HEAD not copying files from pg_wal from the origin is the way to do it. For back-branches that's another story. -- Michael
Attachment
Re: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
From
Michael Paquier
Date:
On Thu, Mar 01, 2018 at 01:26:32AM +0000, Tsunakawa, Takayuki wrote: > Our customer hit another bug of pg_rewind with PG 9.5. The attached > patch fixes this. Sorry for my late reply. It took me unfortunately some time before coming back to this patch. > PROBLEM > ======================================== > > After a long run of successful pg_rewind, the synchronized standby > could not catch up the primary forever, emitting the following message > repeatedly: > > LOG: XX000: could not read from log segment 000000060000028A00000031, > offset 16384: No error Oops. > CAUSE > ======================================== > > If the primary removes WAL files that pg_rewind is going to get, > pg_rewind leaves 0-byte WAL files in the target directory here: > > [libpq_fetch.c] > case FILE_ACTION_COPY: > /* Truncate the old file out of the way, if any */ > open_target_file(entry->path, true); > fetch_file_range(entry->path, 0, entry->newsize); > break; > > pg_rewind completes successfully, create recovery.conf, and then start > the standby in the target cluster. walreceiver receives WAL records > and write them to the 0-byte WAL files. Finally, xlog reader > complains that he cannot read a WAL page. This part qualifies as a data corruption bug as some records are lost by the backend. > FIX > ======================================== > > pg_rewind deletes the file when it finds that the primary has deleted > it. The concept looks right to me. I think that this does not apply only to WAL segments though. You could have a temporary WAL segment that has been included in the chunk table, which is even more volatile, or a multixact file, a snapshot file, etc. In short anything which is not a relation file and could disappear between the moment the file map is built and the data is fetched from the source server. @@ -174,7 +173,7 @@ remove_target_file(const char *path) return; snprintf(dstpath, sizeof(dstpath), "%s/%s", datadir_target, path); - if (unlink(dstpath) != 0) + if (unlink(dstpath) != 0 && !ignore_error) I see. So the only reason why this flag exists is that if a file is large enough so as it is split into multiple chunks, then the first unlink will work but not the successive ones. One chunk can be at most 1 million bytes, which is why it is essential for WAL segments. Instead of ignoring *all* errors, let's ignore only ENOENT and rename ignore_error to missing_ok as well. You need to update the comment in receiveFileChunks where an entry gets deleted with basically what I mentioned above, say: "If a file has been deleted on the source, remove it on the target as well. Note that multiple unlink() calls may happen on the same file if multiple data chunks are associated with it, hence ignore unconditionally anything missing. If this file is not a relation data file, then it has been already truncated when creating the file chunk list at hte previous execution of the filemap." Adding also a comment on top of remove_target_file to explain what missing_ok does would be nice to keep track of what the code should do. The portion about whether data from pg_wal should be transfered or not is really an improvement that could come in a future version, and what you are proposing here is more general, so I think that we should go ahead with it. -- Michael
Attachment
RE: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
From
"Tsunakawa, Takayuki"
Date:
From: Michael Paquier [mailto:michael@paquier.xyz] > I see. So the only reason why this flag exists is that if a file is large > enough so as it is split into multiple chunks, then the first unlink will > work but not the successive ones. One chunk can be at most > 1 million bytes, which is why it is essential for WAL segments. Instead > of ignoring *all* errors, let's ignore only ENOENT and rename ignore_error > to missing_ok as well. > > You need to update the comment in receiveFileChunks where an entry gets > deleted with basically what I mentioned above, say: > "If a file has been deleted on the source, remove it on the target as well. > Note that multiple unlink() calls may happen on the same file if multiple > data chunks are associated with it, hence ignore unconditionally anything > missing. If this file is not a relation data file, then it has been already > truncated when creating the file chunk list at hte previous execution of > the filemap." > > Adding also a comment on top of remove_target_file to explain what > missing_ok does would be nice to keep track of what the code should do. Thanks for reviewing. All done. Regards Takayuki Tsunakawa
Attachment
Re: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
From
Michael Paquier
Date:
On Fri, Mar 09, 2018 at 08:22:49AM +0000, Tsunakawa, Takayuki wrote: > Thanks for reviewing. All done. Thanks. Please be careful with the indentation of the patch. Attached is a slightly-updated version with a small modification in remove_target_file to make the code cleaner, a proposal of commit message and pgindent applied on the code. I am switching that as ready for committer. -- Michael
Attachment
Re: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
From
Justin Pryzby
Date:
On Sun, Mar 11, 2018 at 11:04:01PM +0900, Michael Paquier wrote: > On Fri, Mar 09, 2018 at 08:22:49AM +0000, Tsunakawa, Takayuki wrote: > > Thanks for reviewing. All done. > > Thanks. Please be careful with the indentation of the patch. Attached > is a slightly-updated version with a small modification in > remove_target_file to make the code cleaner, a proposal of commit > message and pgindent applied on the code. I am switching that as ready > for committer. > -- > Michael > From 62950c615d87e3e58998f295ce446eb5c80707e4 Mon Sep 17 00:00:00 2001 > From: Michael Paquier <michael@paquier.xyz> > Date: Sun, 11 Mar 2018 22:49:55 +0900 > Subject: [PATCH] Fix handling of removed files on target server with pg_rewind > > After processing the filemap to build the list of chunks that will be > fetched from the source to rewing the target server, it is possible that > a file which was previously processed is removed from the source. A > simple example of such an occurence is a WAL segment which gets recycled > on the target in-between. When the filemap is processed, files not > categorized as relation files are first truncated to prepare for its > full copy of which is going to be taken from the source, divided into a > set of junks. Hopefully a set of CHUNKS not JUNKS ? Justin
Re: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
From
Michael Paquier
Date:
On Sun, Mar 11, 2018 at 09:56:33AM -0500, Justin Pryzby wrote: > Hopefully a set of CHUNKS not JUNKS ? Yes. (laugh) -- Michael
Attachment
Re: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
From
Fujii Masao
Date:
On Sun, Mar 11, 2018 at 11:04 PM, Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Mar 09, 2018 at 08:22:49AM +0000, Tsunakawa, Takayuki wrote: >> Thanks for reviewing. All done. > > Thanks. Please be careful with the indentation of the patch. Attached > is a slightly-updated version with a small modification in > remove_target_file to make the code cleaner, a proposal of commit > message and pgindent applied on the code. I am switching that as ready > for committer. The patch looks good to me! Barring objection, I will commit it and back-patch to 9.5 where pg_rewind was added. Regards, -- Fujii Masao
Re: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
From
Michael Paquier
Date:
On Wed, Mar 28, 2018 at 04:45:49AM +0900, Fujii Masao wrote: > The patch looks good to me! Barring objection, I will commit it > and back-patch to 9.5 where pg_rewind was added. Thanks in advance! I just eyeballed the patch again and I don't see issues with what's used here. The thing should apply smoothly back to 9.5, of course let me know if you need help or an extra pair of eyes. -- Michael
Attachment
Re: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
From
Fujii Masao
Date:
On Wed, Mar 28, 2018 at 11:24 AM, Michael Paquier <michael@paquier.xyz> wrote: > On Wed, Mar 28, 2018 at 04:45:49AM +0900, Fujii Masao wrote: >> The patch looks good to me! Barring objection, I will commit it >> and back-patch to 9.5 where pg_rewind was added. > > Thanks in advance! I just eyeballed the patch again and I don't see > issues with what's used here. The thing should apply smoothly back to > 9.5, of course let me know if you need help or an extra pair of eyes. Pushed. Thanks! Regards, -- Fujii Masao
Re: [bug fix] pg_rewind creates corrupt WAL files, and the standbycannot catch up the primary
From
Michael Paquier
Date:
On Thu, Mar 29, 2018 at 04:04:58AM +0900, Fujii Masao wrote: > Pushed. Thanks! Nice, thanks. -- Michael