Re: BUG #14999: pg_rewind corrupts control file global/pg_control - Mailing list pgsql-bugs
From | Michael Paquier |
---|---|
Subject | Re: BUG #14999: pg_rewind corrupts control file global/pg_control |
Date | |
Msg-id | 20180404072625.GB2208@paquier.xyz Whole thread Raw |
In response to | Re: BUG #14999: pg_rewind corrupts control file global/pg_control (TipTop Labs <office@tiptop-labs.com>) |
Responses |
Re: BUG #14999: pg_rewind corrupts control file global/pg_control
|
List | pgsql-bugs |
On Tue, Apr 03, 2018 at 10:13:05PM +0200, TipTop Labs wrote: > Truncation of the file was moved to the second loop. Truncation occurs > there if chunks are written into files at offset 0. This is the case > for FILE_ACTION_COPY. An additional SQL "ORDER BY path, begin" ensures > that these chunks are processed first. Yes. I have spent a large portion of my day looking at that. And actually this is wrong as well. First, please find attached a patch I have written while testing your changes. There is nothing much new into it, except that I added more comments and a test (other tests fail as well, that does not matter). First, I have looked at a way to not rely on the tweak with chunkoff by extending the temporary table used to register the chunks with an extra column which tracks the type of action which is taken on the file. Another one I looked at was to pass down the action type taken on the entry directly to receiveFileChunks(). However both of them have been grotty. So after that I falled back to your patch and began testing it, which is where I noticed that we can *never* give the insurance to recover a data folder on which an error has happened in the middle of a pg_rewind. The reason for that is quite simple: even if the truncation has been moved down to the moment where the first chunk of a file is received, you may have already done work on some relation files. Particularly, some of them may have been truncated down to a given size without a new range of blocks fetched from the source. So the data folder would be in an inconsistent state if trying to rewind it again. If the control file from the source has been received and then a read-only error is found, then in order to perform a subsequent rewind you need to start and stop the target cluster cleanly, or update manually the control file and mark it as cleanly stopped. And this is a recipe for data corruption as we now start a rewind based on the guarantee that a target cluster has been able to cleanly stop, with a shutdown checkpoint issued. You could always try to start and stop the cluster cleanly, but if your target instance replays WAL on data blocks which have been truncated by a previous rewind, you would need to re-create an instance using a new base backup, which is actually harder to diagnose. The patch I sent prevously which makes the difference between relation files and "configuration" files helps a bit, still it makes me uneasy because it will not be able to handle correctly failures on files which are part of a relation but need to be fully copied. So I think that we should drop this approach as well for its complexity. Another thing that could definitely help is to work on a patch which checks that *all* files are writable before doing any operations for files which are present on both the source and the target, and make the rewind nicely fail with the source still intact. That's quite a bit of refactoring ahead for little benefit I think in the long run. What we could simply do is to document the limitation. Hence, if both the target and the source have read-only files, then the user needs to be sure to remove them from the target before doing the rewind, and do the re-linking after the operation. If an error is found while processing the rewind, then a new base backup needs to be taken. The refactoring I am mentioning two paragraphs above could definitely be done to ease user's life, but I would treat that as a new feature and not a bug. Another approach, way more simple that we could do is to scan the data folder of the target before doing anything and see if some files are not writable, and then report this list back to the user. If we do that even for files which are on the target but not the source then that would be a loss for some cases, however I would like to imagine that when using pg_rewind the configuration file set is consistent across nodes in a majority of cases, so a user could remove those non-writable files manually. This has also the advantage to be non-intrusive, so a back-patch is definitely possible. Thoughts? -- Michael
Attachment
pgsql-bugs by date: