Thread: avoid multiple hard links to same WAL file after a crash
Hi hackers, I am splitting this off of a previous thread aimed at reducing archiving overhead [0], as I believe this fix might deserve back-patching. Presently, WAL recycling uses durable_rename_excl(), which notes that a crash at an unfortunate moment can result in two links to the same file. My testing [1] demonstrated that it was possible to end up with two links to the same file in pg_wal after a crash just before unlink() during WAL recycling. Specifically, the test produced links to the same file for the current WAL file and the next one because the half-recycled WAL file was re-recycled upon restarting. This seems likely to lead to WAL corruption. The attached patch prevents this problem by using durable_rename() instead of durable_rename_excl() for WAL recycling. This removes the protection against accidentally overwriting an existing WAL file, but there shouldn't be one. This patch also sets the stage for reducing archiving overhead (as discussed in the other thread [0]). The proposed change to reduce archiving overhead will make it more likely that the server will attempt to re-archive segments after a crash. This might lead to archive corruption if the server concurrently writes to the same file via the aforementioned bug. [0] https://www.postgresql.org/message-id/20220222011948.GA3850532%40nathanxps13 [1] https://www.postgresql.org/message-id/20220222173711.GA3852671%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Apr 7, 2022 at 2:30 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Presently, WAL recycling uses durable_rename_excl(), which notes that a > crash at an unfortunate moment can result in two links to the same file. > My testing [1] demonstrated that it was possible to end up with two links > to the same file in pg_wal after a crash just before unlink() during WAL > recycling. Specifically, the test produced links to the same file for the > current WAL file and the next one because the half-recycled WAL file was > re-recycled upon restarting. This seems likely to lead to WAL corruption. Wow, that's bad. > The attached patch prevents this problem by using durable_rename() instead > of durable_rename_excl() for WAL recycling. This removes the protection > against accidentally overwriting an existing WAL file, but there shouldn't > be one. I see that durable_rename_excl() has the following comment: "Similar to durable_rename(), except that this routine tries (but does not guarantee) not to overwrite the target file." If those are the desired semantics, we could achieve them more simply and more safely by just trying to stat() the target file and then, if it's not found, call durable_rename(). I think that would be a heck of a lot safer than what this function is doing right now. I'd actually be in favor of nuking durable_rename_excl() from orbit and putting the file-exists tests in the callers. Otherwise, someone might assume that it actually has the semantics that its name suggests, which could be pretty disastrous. If we don't want to do that, then I'd changing to do the stat-then-durable-rename thing internally, so we don't leave hard links lying around in *any* code path. Perhaps that's the right answer for the back-branches in any case, since there could be third-party code calling this function. Your proposed fix is OK if we don't want to do any of that stuff, but personally I'm much more inclined to blame durable_rename_excl() for being horrible than I am to blame the calling code for using it improvidently. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote: > I see that durable_rename_excl() has the following comment: "Similar > to durable_rename(), except that this routine tries (but does not > guarantee) not to overwrite the target file." If those are the desired > semantics, we could achieve them more simply and more safely by just > trying to stat() the target file and then, if it's not found, call > durable_rename(). I think that would be a heck of a lot safer than > what this function is doing right now. IIUC it actually does guarantee that you won't overwrite the target file when HAVE_WORKING_LINK is defined. If not, it provides no guarantees at all. Using stat() before rename() would therefore weaken this check for systems with working link(), but it'd probably strengthen it for systems without a working link(). > I'd actually be in favor of nuking durable_rename_excl() from orbit > and putting the file-exists tests in the callers. Otherwise, someone > might assume that it actually has the semantics that its name > suggests, which could be pretty disastrous. If we don't want to do > that, then I'd changing to do the stat-then-durable-rename thing > internally, so we don't leave hard links lying around in *any* code > path. Perhaps that's the right answer for the back-branches in any > case, since there could be third-party code calling this function. I think there might be another problem. The man page for rename() seems to indicate that overwriting an existing file also introduces a window where the old and new path are hard links to the same file. This isn't a problem for the WAL files because we should never be overwriting an existing one, but I wonder if it's a problem for other code paths. My guess is that many code paths that overwrite an existing file are first writing changes to a temporary file before atomically replacing the original. Those paths are likely okay, too, as you can usually just discard any existing temporary files. > Your proposed fix is OK if we don't want to do any of that stuff, but > personally I'm much more inclined to blame durable_rename_excl() for > being horrible than I am to blame the calling code for using it > improvidently. I do agree that it's worth examining this stuff a bit closer. I've frequently found myself trying to reason about all the different states that callers of these functions can produce, so any changes that help simplify matters are a win in my book. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Apr 08, 2022 at 09:53:12AM -0700, Nathan Bossart wrote: > On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote: >> I'd actually be in favor of nuking durable_rename_excl() from orbit >> and putting the file-exists tests in the callers. Otherwise, someone >> might assume that it actually has the semantics that its name >> suggests, which could be pretty disastrous. If we don't want to do >> that, then I'd changing to do the stat-then-durable-rename thing >> internally, so we don't leave hard links lying around in *any* code >> path. Perhaps that's the right answer for the back-branches in any >> case, since there could be third-party code calling this function. > > I think there might be another problem. The man page for rename() seems to > indicate that overwriting an existing file also introduces a window where > the old and new path are hard links to the same file. This isn't a problem > for the WAL files because we should never be overwriting an existing one, > but I wonder if it's a problem for other code paths. My guess is that many > code paths that overwrite an existing file are first writing changes to a > temporary file before atomically replacing the original. Those paths are > likely okay, too, as you can usually just discard any existing temporary > files. Ha, so there are only a few callers of durable_rename_excl() in the PostgreSQL tree. One is basic_archive.c, which is already doing a stat() check. IIRC I only used durable_rename_excl() here to handle the case where multiple servers are writing archives to the same location. If that happened, the archiver process would begin failing. If a crash left two hard links to the same file around, we will silently succeed the next time around thanks to the compare_files() check. Besides the WAL installation code, the only other callers are in timeline.c, and both note that the use of durable_rename_excl() is for "paranoidly trying to avoid overwriting an existing file (there shouldn't be one)." So AFAICT basic_archive.c is the only caller with a strong reason for using durable_rename_excl(), and even that might not be worth keeping it around. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote: > I'd actually be in favor of nuking durable_rename_excl() from orbit > and putting the file-exists tests in the callers. Otherwise, someone > might assume that it actually has the semantics that its name > suggests, which could be pretty disastrous. If we don't want to do > that, then I'd changing to do the stat-then-durable-rename thing > internally, so we don't leave hard links lying around in *any* code > path. Perhaps that's the right answer for the back-branches in any > case, since there could be third-party code calling this function. I've attached a patch that simply removes durable_rename_excl() and replaces existing calls with durable_rename(). I noticed that Andres expressed similar misgivings about durable_rename_excl() last year [0] [1]. I can create a stat-then-durable-rename version of this for back-patching if that is still the route we want to go. [0] https://postgr.es/me/20210318014812.ds2iz4jz5h7la6un%40alap3.anarazel.de [1] https://postgr.es/m/20210318023004.gz2aejhze2kkkqr2%40alap3.anarazel.de -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Apr 8, 2022 at 12:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote: > > I see that durable_rename_excl() has the following comment: "Similar > > to durable_rename(), except that this routine tries (but does not > > guarantee) not to overwrite the target file." If those are the desired > > semantics, we could achieve them more simply and more safely by just > > trying to stat() the target file and then, if it's not found, call > > durable_rename(). I think that would be a heck of a lot safer than > > what this function is doing right now. > > IIUC it actually does guarantee that you won't overwrite the target file > when HAVE_WORKING_LINK is defined. If not, it provides no guarantees at > all. Using stat() before rename() would therefore weaken this check for > systems with working link(), but it'd probably strengthen it for systems > without a working link(). Sure, but a guarantee that happens on only some systems isn't worth much. And, if it comes at the price of potentially having multiple hard links to the same file in obscure situations, that seems like it could easily cause more problems than this whole scheme can ever hope to solve. > I think there might be another problem. The man page for rename() seems to > indicate that overwriting an existing file also introduces a window where > the old and new path are hard links to the same file. This isn't a problem > for the WAL files because we should never be overwriting an existing one, > but I wonder if it's a problem for other code paths. My guess is that many > code paths that overwrite an existing file are first writing changes to a > temporary file before atomically replacing the original. Those paths are > likely okay, too, as you can usually just discard any existing temporary > files. I wonder if this is really true. I thought rename() was supposed to be atomic. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote: > > I think there might be another problem. The man page for rename() seems to > > indicate that overwriting an existing file also introduces a window where > > the old and new path are hard links to the same file. This isn't a problem > > for the WAL files because we should never be overwriting an existing one, > > but I wonder if it's a problem for other code paths. My guess is that many > > code paths that overwrite an existing file are first writing changes to a > > temporary file before atomically replacing the original. Those paths are > > likely okay, too, as you can usually just discard any existing temporary > > files. > > I wonder if this is really true. I thought rename() was supposed to be atomic. Looks like it's atomic in that it's not like cp + rm, but not atomic the other way you want. | If newpath already exists, it will be atomically replaced, so that there is no point at which another processattempting to access newpath will find it missing. However, there will probably be a window in which | both oldpath and newpath refer to the file being renamed.
At Thu, 7 Apr 2022 11:29:54 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in > The attached patch prevents this problem by using durable_rename() instead > of durable_rename_excl() for WAL recycling. This removes the protection > against accidentally overwriting an existing WAL file, but there shouldn't > be one. From another direction, if the new segment was the currently active one, we just mustn't install it. Otherwise we don't care. So, the only thing we need to care is segment switch. Without it, the segment that InstallXLogFileSegment found by the stat loop is known to be safe to overwrite even if exists. When segment switch finds an existing file, it's no problem since the segment switch doesn't create a new segment. Otherwise segment switch always calls InstallXLogFileSegment. The section from searching for an empty segmetn slot until calling durable_rename_excl() is protected by ControlFileLock. Thus if a process is in the section, no other process can switch to a newly-created segment. If this diagnosis is correct, the comment is proved to be paranoid. > * Perform the rename using link if available, paranoidly trying to avoid > * overwriting an existing file (there shouldn't be one). As the result, I think Nathan's fix is correct that we can safely use durable_rename() instead. And I propose to use renameat2 on Linux so that we can detect the contradicting case by the regression tests even though only on Linux. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > So, the only thing we need to care is segment switch. Without it, the > segment that InstallXLogFileSegment found by the stat loop is known to > be safe to overwrite even if exists. > > When segment switch finds an existing file, it's no problem since the > segment switch doesn't create a new segment. Otherwise segment switch > always calls InstallXLogFileSegment. The section from searching for > an empty segmetn slot until calling durable_rename_excl() is protected > by ControlFileLock. Thus if a process is in the section, no other > process can switch to a newly-created segment. > > If this diagnosis is correct, the comment is proved to be paranoid. It's sometimes difficult to understand what problems really old code comments are worrying about. For example, could they have been worrying about bugs in the code? Could they have been worrying about manual interference with the pg_wal directory? It's hard to know. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: >> If this diagnosis is correct, the comment is proved to be paranoid. > It's sometimes difficult to understand what problems really old code > comments are worrying about. For example, could they have been > worrying about bugs in the code? Could they have been worrying about > manual interference with the pg_wal directory? It's hard to know. "git blame" can be helpful here, if you trace back to when the comment was written and then try to find the associated mailing-list discussion. (That leap can be difficult for commits pre-dating our current convention of including links in the commit message, but it's usually not *that* hard to locate contemporaneous discussion.) regards, tom lane
On Mon, Apr 11, 2022 at 12:28:47PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi >> <horikyota.ntt@gmail.com> wrote: >>> If this diagnosis is correct, the comment is proved to be paranoid. > >> It's sometimes difficult to understand what problems really old code >> comments are worrying about. For example, could they have been >> worrying about bugs in the code? Could they have been worrying about >> manual interference with the pg_wal directory? It's hard to know. > > "git blame" can be helpful here, if you trace back to when the comment > was written and then try to find the associated mailing-list discussion. > (That leap can be difficult for commits pre-dating our current > convention of including links in the commit message, but it's usually > not *that* hard to locate contemporaneous discussion.) I traced this back a while ago. I believe the link() was first added in November 2000 as part of f0e37a8. This even predates WAL recycling, which was added in July 2001 as part of 7d4d5c0. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in > On Mon, Apr 11, 2022 at 12:28:47PM -0400, Tom Lane wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi > >> <horikyota.ntt@gmail.com> wrote: > >>> If this diagnosis is correct, the comment is proved to be paranoid. > > > >> It's sometimes difficult to understand what problems really old code > >> comments are worrying about. For example, could they have been > >> worrying about bugs in the code? Could they have been worrying about > >> manual interference with the pg_wal directory? It's hard to know. > > > > "git blame" can be helpful here, if you trace back to when the comment > > was written and then try to find the associated mailing-list discussion. > > (That leap can be difficult for commits pre-dating our current > > convention of including links in the commit message, but it's usually > > not *that* hard to locate contemporaneous discussion.) > > I traced this back a while ago. I believe the link() was first added in > November 2000 as part of f0e37a8. This even predates WAL recycling, which > was added in July 2001 as part of 7d4d5c0. f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from somwhere out of the ML.. This patch changed XLogFileInit to supportusing existent files so that XLogWrite can use the new segment provided by checkpoint and still allow XLogWrite to create a new segment by itself. Just before the commit, calls to XLogFileInit were protected (or serialized) by logwr_lck. At the commit calls to the same function were still serialized by ControlFileLockId. I *guess* that Vadim faced/noticed a race condition when he added checkpoint. Thus introduced the link+remove protocol but finally it became useless by moving the call to XLogFileInit within ControlFileLockId section. But, of course, all of story is mere a guess. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Apr 12, 2022 at 03:46:31PM +0900, Kyotaro Horiguchi wrote: > At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in >> I traced this back a while ago. I believe the link() was first added in >> November 2000 as part of f0e37a8. This even predates WAL recycling, which >> was added in July 2001 as part of 7d4d5c0. > > f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from > somwhere out of the ML.. This patch changed XLogFileInit to > supportusing existent files so that XLogWrite can use the new segment > provided by checkpoint and still allow XLogWrite to create a new > segment by itself. Yeah, I've been unable to find any discussion besides a brief reference to adding checkpointing [0]. [0] https://postgr.es/m/8F4C99C66D04D4118F580090272A7A23018D85%40sectorbase1.sectorbase.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote: > On Fri, Apr 8, 2022 at 12:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> I think there might be another problem. The man page for rename() seems to >> indicate that overwriting an existing file also introduces a window where >> the old and new path are hard links to the same file. This isn't a problem >> for the WAL files because we should never be overwriting an existing one, >> but I wonder if it's a problem for other code paths. My guess is that many >> code paths that overwrite an existing file are first writing changes to a >> temporary file before atomically replacing the original. Those paths are >> likely okay, too, as you can usually just discard any existing temporary >> files. > > I wonder if this is really true. I thought rename() was supposed to be atomic. Not always. For example, some old versions of MacOS have a non-atomic implementation of rename(), like prairiedog with 10.4. Even 10.5 does not handle atomicity as far as I call. In short, it looks like a bad idea to me to rely on this idea at all. Some FSes have their own way of handling things, as well, but I am not much into this world. Saying that, it would be nice to see durable_rename_excl() gone as it has created quite a bit of pain for us in the past years. -- Michael
Attachment
On Mon, Apr 18, 2022 at 04:48:35PM +0900, Michael Paquier wrote: > Saying that, it would be nice to see durable_rename_excl() gone as it > has created quite a bit of pain for us in the past years. Yeah, I think this is the right thing to do. Patch upthread [0]. For back-branches, I suspect we'll want to remove all uses of durable_rename_excl() but leave the function around for any extensions that are using it. Of course, we'd also need a big comment imploring folks not to add any more callers. Another option would be to change the behavior of durable_rename_excl() to something that we think is safer (e.g., stat then rename), but that might just introduce a different set of problems. [0] https://postgr.es/m/20220408194345.GA1541826%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Michael Paquier <michael@paquier.xyz> writes: > On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote: >> I wonder if this is really true. I thought rename() was supposed to be atomic. > Not always. For example, some old versions of MacOS have a non-atomic > implementation of rename(), like prairiedog with 10.4. Even 10.5 does > not handle atomicity as far as I call. I think that's not talking about the same thing. POSIX requires rename(2) to replace an existing target link atomically: If the link named by the new argument exists, it shall be removed and old renamed to new. In this case, a link named new shall remain visible to other threads throughout the renaming operation and refer either to the file referred to by new or old before the operation began. (It's that requirement that ancient macOS fails to meet.) However, I do not see any text that addresses the question of whether the old link disappears atomically with the appearance of the new link, and it seems like that'd be pretty impractical to ensure in cases like moving a link from one directory to another. (What would it even mean to say that, considering that a thread can't read the two directories at the same instant?) From a crash-safety standpoint, it'd surely be better to make the new link before removing the old, so I imagine that's what most file systems do. regards, tom lane
The readdir interface allows processes to be in the middle of reading a directory and unless a kernel was happy to either materialize the entire directory list when the readdir starts, or lock the entire directory against modification for the entire time the a process has a readdir fd open it's always going to be possible for the a process to have previously read the old directory entry and later see the new directory entry. Kernels don't do any MVCC or cmin type of games so they're not going to be able to prevent it. What's worse of course is that it may only happen in very large directories. Most directories fit on a single block and readdir may buffer up all the entries a block at a time for efficiency. So it may only be visible on very large directories that span multiple blocks.
Here is an attempt at creating something that can be back-patched. 0001 simply replaces calls to durable_rename_excl() with durable_rename() and is intended to be back-patched. 0002 removes the definition of durable_rename_excl() and is _not_ intended for back-patching. I imagine 0002 will need to be held back for v16devel. I think back-patching 0001 will encounter a couple of small obstacles. For example, the call in basic_archive won't exist on most of the back-branches, and durable_rename_excl() was named durable_link_or_rename() before v13. I don't mind producing a patch for each back-branch if needed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Apr 26, 2022 at 01:09:35PM -0700, Nathan Bossart wrote: > Here is an attempt at creating something that can be back-patched. 0001 > simply replaces calls to durable_rename_excl() with durable_rename() and is > intended to be back-patched. 0002 removes the definition of > durable_rename_excl() and is _not_ intended for back-patching. I imagine > 0002 will need to be held back for v16devel. I would not mind applying 0002 on HEAD now to avoid more uses of this API, and I can get behind 0001 after thinking more about it. > I think back-patching 0001 will encounter a couple of small obstacles. For > example, the call in basic_archive won't exist on most of the > back-branches, and durable_rename_excl() was named durable_link_or_rename() > before v13. I don't mind producing a patch for each back-branch if needed. I am not sure that have any need to backpatch this change based on the unlikeliness of the problem, TBH. One thing that is itching me a bit, like Robert upthread, is that we don't check anymore that the newfile does not exist in the code paths because we never expect one. It is possible to use stat() for that. But access() within a simple assertion would be simpler? Say something like: Assert(access(path, F_OK) != 0 && errno == ENOENT); The case for basic_archive is limited as the comment of the patch states, but that would be helpful for the two calls in timeline.c and the one in xlog.c in the long-term. And this has no need to be part of fd.c, this can be added before the durable_rename() calls. What do you think? -- Michael
Attachment
On Wed, Apr 27, 2022 at 04:09:20PM +0900, Michael Paquier wrote: > I am not sure that have any need to backpatch this change based on the > unlikeliness of the problem, TBH. One thing that is itching me a bit, > like Robert upthread, is that we don't check anymore that the newfile > does not exist in the code paths because we never expect one. It is > possible to use stat() for that. But access() within a simple > assertion would be simpler? Say something like: > Assert(access(path, F_OK) != 0 && errno == ENOENT); > > The case for basic_archive is limited as the comment of the patch > states, but that would be helpful for the two calls in timeline.c and > the one in xlog.c in the long-term. And this has no need to be part > of fd.c, this can be added before the durable_rename() calls. What do > you think? Here is a new patch set with these assertions added. I think at least the xlog.c change ought to be back-patched. The problem may be unlikely, but AFAICT the possible consequences include WAL corruption. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Apr 27, 2022 at 11:42:04AM -0700, Nathan Bossart wrote: > Here is a new patch set with these assertions added. I think at least the > xlog.c change ought to be back-patched. The problem may be unlikely, but > AFAICT the possible consequences include WAL corruption. Okay, so I have applied this stuff this morning to see what the buildfarm had to say, and we have finished with a set of failures in various buildfarm members: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-04-28%2002%3A13%3A27 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2022-04-28%2002%3A14%3A08 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2022-04-28%2002%3A59%3A26 All of them did not like the part where we assume that a TLI history file written by a WAL receiver should not exist beforehand, but as 025_stuck_on_old_timeline.pl is showing, a standby may attempt to retrieve a TLI history file after getting it from the archives. I was analyzing the whole thing, and it looks like a race condition. Per the the buildfarm logs, we have less than 5ms between the moment the startup process retrieves the history file of TLI 2 from the archives and the moment the WAL receiver decides to check if this TLI file exists. If it does not exist, it would then retrieve it from the primary via streaming. So I guess that the sequence of events is that: - In WalRcvFetchTimeLineHistoryFiles(), the WAL receiver checks the existence of the history file for TLI 2, does not find it. - The startup process retrieves the file from the archives. - The WAL receiver goes through the internal loop of WalRcvFetchTimeLineHistoryFiles(), retrieves the history file from the primary's stream. Switching from durable_rename_excl() to durable_rename() would mean that we'd overwrite the TLI file received from the primary stream over what's been retrieved from the archives. That does not strike me as an issue in itself and that should be safe, so the comment is misleading, and we can live without the assertion in writeTimeLineHistoryFile() called by the WAL receiver. Now, I think that we'd better keep some belts in writeTimeLineHistory() called by the startup process at the end-of-recovery as I should never ever have a TLI file generated when selecting a new timeline. Perhaps this should be a elog(ERROR) at least, with a check on the file existence before calling durable_rename()? Anyway, my time is constrained next week due to the upcoming Japanese Golden Week and the buildfarm has to be stable, so I have reverted the change for now. -- Michael
Attachment
On Tue, Apr 12, 2022 at 09:27:42AM -0700, Nathan Bossart wrote: > On Tue, Apr 12, 2022 at 03:46:31PM +0900, Kyotaro Horiguchi wrote: >> At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in >>> I traced this back a while ago. I believe the link() was first added in >>> November 2000 as part of f0e37a8. This even predates WAL recycling, which >>> was added in July 2001 as part of 7d4d5c0. >> >> f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from >> somwhere out of the ML.. This patch changed XLogFileInit to >> supportusing existent files so that XLogWrite can use the new segment >> provided by checkpoint and still allow XLogWrite to create a new >> segment by itself. Yes, I think that you are right here. I also suspect that the checkpoint command was facing a concurrency issue while working on the feature and that Vadim saw that this part of the implementation would be safer in the long run if we use link() followed by unlink(). > Yeah, I've been unable to find any discussion besides a brief reference to > adding checkpointing [0]. > > [0] https://postgr.es/m/8F4C99C66D04D4118F580090272A7A23018D85%40sectorbase1.sectorbase.com While looking at the history of this area, I have also noticed this argument, telling also that this is a safety measure if this code were to run in parallel, but that's without counting on the control file lock hold while doing this operation anyway: https://www.postgresql.org/message-id/24974.982597735@sss.pgh.pa.us As mentioned already upthread, f0e37a8 is the origin of the link()/unlink() business in the WAL segment initialization logic, and also note 1f159e5 that has added a rename() as extra code path for systems where link() was not working. At the end, switching directly from durable_rename_excl() to durable_rename() should be fine for the WAL segment initialization, but we could do things a bit more carefully by adding a check on the file existence before calling durable_rename() and issue a elog(LOG) if a file is found, giving a mean for the WAL recycling to give up peacefully as it does now. Per my analysis, the TLI history file created at the end of recovery ought to issue an elog(ERROR). Now, I am surprised by the third code path of durable_rename_excl(), as of the WAL receiver doing writeTimeLineHistoryFile(), to not cause any issues, as link() should exit with EEXIST when the startup process grabs the same history file concurrently. It seems to me that in this last case using durable_rename() could be an improvement and prevent extra WAL receiver restarts as a TLI history fetched from the primary via streaming or from some archives should be the same, but we could be more careful, like the WAL init logic, by skipping the durable_rename() and issuing an elog(LOG). That would not be perfect, still a bit better than the current state of HEAD. As we are getting closer to the beta release, it looks safer to let this change aside a bit longer and wait for v16 to be opened for business on HEAD. -- Michael
Attachment
On Sun, May 01, 2022 at 10:08:53PM +0900, Michael Paquier wrote: > Now, I am surprised by the third code path of durable_rename_excl(), > as of the WAL receiver doing writeTimeLineHistoryFile(), to not cause > any issues, as link() should exit with EEXIST when the startup process > grabs the same history file concurrently. It seems to me that in this > last case using durable_rename() could be an improvement and prevent > extra WAL receiver restarts as a TLI history fetched from the primary > via streaming or from some archives should be the same, but we could > be more careful, like the WAL init logic, by skipping the > durable_rename() and issuing an elog(LOG). That would not be perfect, > still a bit better than the current state of HEAD. Skimming through at the buildfarm logs, it happens that the tests are able to see this race from time to time. Here is one such example on rorqual: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=rorqual&dt=2022-04-20%2004%3A47%3A58&stg=recovery-check And here are the relevant logs: 2022-04-20 05:04:19.028 UTC [3109048][startup][:0] LOG: restored log file "00000002.history" from archive 2022-04-20 05:04:19.029 UTC [3109111][walreceiver][:0] LOG: fetching timeline history file for timeline 2 from primary server 2022-04-20 05:04:19.048 UTC [3109111][walreceiver][:0] FATAL: could not link file "pg_wal/xlogtemp.3109111" to "pg_wal/00000002.history": File exists [...] 2022-04-20 05:04:19.234 UTC [3109250][walreceiver][:0] LOG: started streaming WAL from primary at 0/3000000 on timeline 2 The WAL receiver upgrades the ERROR to a FATAL, and restarts streaming shortly after. Using durable_rename() would not be an issue here. -- Michael
Attachment
On Sun, May 01, 2022 at 10:08:53PM +0900, Michael Paquier wrote: > At the end, switching directly from durable_rename_excl() to > durable_rename() should be fine for the WAL segment initialization, > but we could do things a bit more carefully by adding a check on the > file existence before calling durable_rename() and issue a elog(LOG) > if a file is found, giving a mean for the WAL recycling to give up > peacefully as it does now. Per my analysis, the TLI history file > created at the end of recovery ought to issue an elog(ERROR). My only concern with this approach is that it inevitably introduces a race condition. In most cases, the file existence check will prevent overwrites, but it might not always. Furthermore, we believe that such overwrites either 1) should not happen (e.g., WAL recycling) or 2) won't cause problems if they happen (e.g., when the WAL receiver writes the TLI history file). Also, these races will be difficult to test, so we won't know what breaks when they occur. My instinct is to just let the overwrites happen. That way, we are more likely to catch breakage in tests, and we'll have one less race condition to worry about. I don't mind asserting that the file doesn't exist when we don't expect it to, as that might help catch potential problems in development without affecting behavior in production. If we do want to add file existence checks, I think we'd better add a comment about the potential for race conditions. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Mon, May 02, 2022 at 07:48:18PM +0900, Michael Paquier wrote: > Skimming through at the buildfarm logs, it happens that the tests are > able to see this race from time to time. Here is one such example on > rorqual: > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=rorqual&dt=2022-04-20%2004%3A47%3A58&stg=recovery-check > > And here are the relevant logs: > 2022-04-20 05:04:19.028 UTC [3109048][startup][:0] LOG: restored log > file "00000002.history" from archive > 2022-04-20 05:04:19.029 UTC [3109111][walreceiver][:0] LOG: fetching > timeline history file for timeline 2 from primary server > 2022-04-20 05:04:19.048 UTC [3109111][walreceiver][:0] FATAL: could > not link file "pg_wal/xlogtemp.3109111" to "pg_wal/00000002.history": > File exists > [...] > 2022-04-20 05:04:19.234 UTC [3109250][walreceiver][:0] LOG: started > streaming WAL from primary at 0/3000000 on timeline 2 > > The WAL receiver upgrades the ERROR to a FATAL, and restarts > streaming shortly after. Using durable_rename() would not be an issue > here. Thanks for investigating this one. I think I agree that we should simply switch to durable_rename() (without a file existence check beforehand). -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Mon, May 02, 2022 at 10:39:07AM -0700, Nathan Bossart wrote: > On Mon, May 02, 2022 at 07:48:18PM +0900, Michael Paquier wrote: >> The WAL receiver upgrades the ERROR to a FATAL, and restarts >> streaming shortly after. Using durable_rename() would not be an issue >> here. > > Thanks for investigating this one. I think I agree that we should simply > switch to durable_rename() (without a file existence check beforehand). Here is a new patch set. For now, I've only removed the file existence check in writeTimeLineHistoryFile(). I don't know if I'm totally convinced that there isn't a problem here (e.g., due to concurrent .ready file creation), but since some platforms have been using rename() for some time, I don't know how worried we should be. I thought about adding some kind of locking between the WAL receiver and startup processes, but that seems excessive. Alternatively, we could just fix xlog.c as proposed earlier [0]. AFAICT that is the only caller that can experience problems due to the multiple-hard-link issue. All other callers are simply renaming a temporary file into place, and the temporary file can be discarded if left behind after a crash. [0] https://postgr.es/m/20220407182954.GA1231544%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, May 02, 2022 at 04:06:13PM -0700, Nathan Bossart wrote: > Here is a new patch set. For now, I've only removed the file existence > check in writeTimeLineHistoryFile(). I don't know if I'm totally convinced > that there isn't a problem here (e.g., due to concurrent .ready file > creation), but since some platforms have been using rename() for some time, > I don't know how worried we should be. That's only about Windows these days, meaning that there is much less coverage in this code path. > I thought about adding some kind of > locking between the WAL receiver and startup processes, but that seems > excessive. Agreed. > Alternatively, we could just fix xlog.c as proposed earlier > [0]. AFAICT that is the only caller that can experience problems due to > the multiple-hard-link issue. All other callers are simply renaming a > temporary file into place, and the temporary file can be discarded if left > behind after a crash. I'd agree with removing all the callers at the end. pgrename() is quite robust on Windows, but I'd keep the two checks in writeTimeLineHistory(), as the logic around findNewestTimeLine() would consider a past TLI history file as in-use even if we have a crash just after the file got created in the same path by the same standby, and the WAL segment init part. Your patch does that. -- Michael
Attachment
On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote: > I'd agree with removing all the callers at the end. pgrename() is > quite robust on Windows, but I'd keep the two checks in > writeTimeLineHistory(), as the logic around findNewestTimeLine() would > consider a past TLI history file as in-use even if we have a crash > just after the file got created in the same path by the same standby, > and the WAL segment init part. Your patch does that. As v16 is now open for business, I have revisited this change and applied 0001 to change all the callers (aka removal of the assertion for the WAL receiver when it overwrites a TLI history file). The commit log includes details about the reasoning of all the areas changed, for clarity, as of the WAL recycling part, the TLI history file part and basic_archive. -- Michael
Attachment
On Tue, Jul 05, 2022 at 10:19:49AM +0900, Michael Paquier wrote: > On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote: >> I'd agree with removing all the callers at the end. pgrename() is >> quite robust on Windows, but I'd keep the two checks in >> writeTimeLineHistory(), as the logic around findNewestTimeLine() would >> consider a past TLI history file as in-use even if we have a crash >> just after the file got created in the same path by the same standby, >> and the WAL segment init part. Your patch does that. > > As v16 is now open for business, I have revisited this change and > applied 0001 to change all the callers (aka removal of the assertion > for the WAL receiver when it overwrites a TLI history file). The > commit log includes details about the reasoning of all the areas > changed, for clarity, as of the WAL recycling part, the TLI history > file part and basic_archive. Thanks! I wonder if we should add a comment in writeTimeLineHistoryFile() about possible concurrent use by a WAL receiver and the startup process and why that is okay. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Jul 05, 2022 at 09:58:38AM -0700, Nathan Bossart wrote: > Thanks! I wonder if we should add a comment in writeTimeLineHistoryFile() > about possible concurrent use by a WAL receiver and the startup process and > why that is okay. Agreed. Adding an extra note at the top of the routine would help in the future. -- Michael
Attachment
Dear team, We recently observed a few cases where Postgres running on Linux encountered an issue with WAL segment files. Specifically, two WAL segments were linked to the same physical file after Postgres ran out of memory and the OOM killer terminated one of its processes. This resulted in the WAL segments overwriting each other and Postgres failing a later recovery. We found this fix [1] that has been applied to Postgres 16, but the cases we observed were running Postgres 15. Given that older major versions will be supported for a good number of years, and the potential for irrecoverability exists (even if rare), we would like to discuss the possibility of back-patching this fix. Are there any technical reasons not to back-patch this fix to older major versions? Thank you for your consideration. Sincerely, Robert Pang [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=dac1ff3 On Sat, May 7, 2022 at 1:19 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote: > > I'd agree with removing all the callers at the end. pgrename() is > > quite robust on Windows, but I'd keep the two checks in > > writeTimeLineHistory(), as the logic around findNewestTimeLine() would > > consider a past TLI history file as in-use even if we have a crash > > just after the file got created in the same path by the same standby, > > and the WAL segment init part. Your patch does that. > > As v16 is now open for business, I have revisited this change and > applied 0001 to change all the callers (aka removal of the assertion > for the WAL receiver when it overwrites a TLI history file). The > commit log includes details about the reasoning of all the areas > changed, for clarity, as of the WAL recycling part, the TLI history > file part and basic_archive. > -- > Michael
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Nathan Bossart
Date:
On Tue, Dec 17, 2024 at 04:50:16PM -0800, Robert Pang wrote: > We recently observed a few cases where Postgres running on Linux > encountered an issue with WAL segment files. Specifically, two WAL > segments were linked to the same physical file after Postgres ran out > of memory and the OOM killer terminated one of its processes. This > resulted in the WAL segments overwriting each other and Postgres > failing a later recovery. Yikes! > We found this fix [1] that has been applied to Postgres 16, but the > cases we observed were running Postgres 15. Given that older major > versions will be supported for a good number of years, and the > potential for irrecoverability exists (even if rare), we would like to > discuss the possibility of back-patching this fix. IMHO this is a good time to reevaluate. It looks like we originally didn't back-patch out of an abundance of caution, but now that this one has had time to bake, I think it's worth seriously considering, especially now that we have a report from the field. -- nathan
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Andres Freund
Date:
Hi, On 2024-12-18 10:38:19 -0600, Nathan Bossart wrote: > On Tue, Dec 17, 2024 at 04:50:16PM -0800, Robert Pang wrote: > > We recently observed a few cases where Postgres running on Linux > > encountered an issue with WAL segment files. Specifically, two WAL > > segments were linked to the same physical file after Postgres ran out > > of memory and the OOM killer terminated one of its processes. This > > resulted in the WAL segments overwriting each other and Postgres > > failing a later recovery. > > Yikes! Indeed. As chance would have it, I was asked for input on a corrupted server *today*. Eventually we found that recovery stopped early, after encountering a segment with a *newer* pageaddr than we expected. Which made me think of this issue, and indeed, the file recovery stopped at had two links. Before that the server had been crashing on a regular basis for unrelated reasons, which presumably increased the chances sufficiently to eventually hit this problem. It's a normal thing to discover the end of the WAL by finding a segment that has an older pageaddr than its name suggests. But in this case we saw a newer page address. I wonder if we should treat that differently... > > We found this fix [1] that has been applied to Postgres 16, but the > > cases we observed were running Postgres 15. Given that older major > > versions will be supported for a good number of years, and the > > potential for irrecoverability exists (even if rare), we would like to > > discuss the possibility of back-patching this fix. > > IMHO this is a good time to reevaluate. It looks like we originally didn't > back-patch out of an abundance of caution, but now that this one has had > time to bake, I think it's worth seriously considering, especially now that > we have a report from the field. Strongly agreed. I don't think the issue is actually quite as unlikely to be hit as reasoned in the commit message. The crash has indeed to happen between the link() and unlink() - but at the end of a checkpoint we do that operations hundreds of times in a row on a busy server. And that's just after potentially doing lots of write IO during a checkpoint, filling up drive write caches / eating up IOPS/bandwidth disk quots. Greetings, Andres Freund
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Nathan Bossart
Date:
On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote: > I've been double-checking the code to refresh myself with the problem, > and I don't see a reason to not apply something like the attached set > down to v13 for all these remaining branches (minus an edit of the > commit message). LGTM -- nathan
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Andres Freund
Date:
On 2024-12-19 09:31:14 -0600, Nathan Bossart wrote: > On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote: > > I've been double-checking the code to refresh myself with the problem, > > and I don't see a reason to not apply something like the attached set > > down to v13 for all these remaining branches (minus an edit of the > > commit message). > > LGTM Dito.
Dear Michael,
Thank you for applying this back-patch. I also appreciate everyone's input on this issue.
Sincerely,
Robert Pang
On Thu, Dec 19, 2024 at 4:13 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Dec 19, 2024 at 11:07:25AM -0500, Andres Freund wrote:
> On 2024-12-19 09:31:14 -0600, Nathan Bossart wrote:
>> LGTM
>
> Dito.
Thanks for double-checking. Done this one.
--
Michael
On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote: > On Wed, Dec 18, 2024 at 08:51:20PM -0500, Andres Freund wrote: > > I don't think the issue is actually quite as unlikely to be hit as reasoned in > > the commit message. The crash has indeed to happen between the link() and > > unlink() - but at the end of a checkpoint we do that operations hundreds of > > times in a row on a busy server. And that's just after potentially doing lots > > of write IO during a checkpoint, filling up drive write caches / eating up > > IOPS/bandwidth disk quots. > > Looks so, yep. Your timing and the report's timing are interesting. > > I've been double-checking the code to refresh myself with the problem, > and I don't see a reason to not apply something like the attached set > down to v13 for all these remaining branches (minus an edit of the > commit message). This became v14 commit 1f95181b44c843729caaa688f74babe9403b5850. I think it is causing timing-dependent failures in archive recovery, at restartpoints. v14 and earlier were relying on the "excl" part of durable_rename_excl() to keep WAL preallocation and recycling from stomping on archive-restored WAL files. If that's right, we need to either back-patch v15's suppression of those features during archive recovery (commit cc2c7d6 and five others), or we need to revert and fix the multiple hard links differently. v15 commit cc2c7d6 "Skip WAL recycling and preallocation during archive recovery" wrote that it fixed "a spurious durable_rename_excl() LOG message". Replacing durable_rename_excl() with durable_rename() increased consequences beyond spurious messages. I regret that I didn't make that connection during post-commit review of commit 1f95181b44c843729caaa688f74babe9403b5850. Trouble emerges during archive recovery, not during streaming or crash recovery. The symptom is "invalid magic number 0000 in log segment X, offset 0" when PreallocXlogFiles() stomps. The symptom is "unexpected pageaddr X in log segment Y, offset 0" [X < Y] when RemoveOldXlogFiles() recycling stomps. wal_recycle=off probably avoids the latter but not the former. Options I see: 1. Make v14 and v13 skip WAL recycling and preallocation during archive recovery, like newer branches do. I think that means back-patching the six commits cc2c7d6~4 cc2c7d6~3 cc2c7d6~2 cc2c7d6~1 cc2c7d6 e36cbef. 2. Revert 1f95181b44c843729caaa688f74babe9403b5850 and its v13 counterpart. Avoid multiple hard links by making durable_rename_excl() first rename oldfile to a temporary name. (I haven't thought this through in detail. It may not suffice.) I'm leaning toward (1) at least enough to see how messy the back-patch would be, since I don't like risks of designing old-branch-specific solutions when v15/v16/v17 have a proven solution. What else should we be thinking about before deciding? Arun Thirupathi collected the symptoms, traced them to commit 1f95181b44c843729caaa688f74babe9403b5850, and reported the diagnosis to me. These symptoms have not emerged in v15+. Thanks, nm
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Nathan Bossart
Date:
On Thu, Mar 06, 2025 at 11:30:13AM -0800, Noah Misch wrote: > Options I see: > > 1. Make v14 and v13 skip WAL recycling and preallocation during archive > recovery, like newer branches do. I think that means back-patching the six > commits cc2c7d6~4 cc2c7d6~3 cc2c7d6~2 cc2c7d6~1 cc2c7d6 e36cbef. > > 2. Revert 1f95181b44c843729caaa688f74babe9403b5850 and its v13 counterpart. > Avoid multiple hard links by making durable_rename_excl() first rename > oldfile to a temporary name. (I haven't thought this through in detail. > It may not suffice.) > > I'm leaning toward (1) at least enough to see how messy the back-patch would > be, since I don't like risks of designing old-branch-specific solutions when > v15/v16/v17 have a proven solution. What else should we be thinking about > before deciding? I agree with first attempting a back-patch. All of this stuff is from 2021-2022, so it's had time to bake and is IMHO lower risk. -- nathan
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Michael Paquier
Date:
On Thu, Mar 06, 2025 at 01:44:30PM -0600, Nathan Bossart wrote: > On Thu, Mar 06, 2025 at 11:30:13AM -0800, Noah Misch wrote: >> 1. Make v14 and v13 skip WAL recycling and preallocation during archive >> recovery, like newer branches do. I think that means back-patching the six >> commits cc2c7d6~4 cc2c7d6~3 cc2c7d6~2 cc2c7d6~1 cc2c7d6 e36cbef. >> >> 2. Revert 1f95181b44c843729caaa688f74babe9403b5850 and its v13 counterpart. >> Avoid multiple hard links by making durable_rename_excl() first rename >> oldfile to a temporary name. (I haven't thought this through in detail. >> It may not suffice.) >> >> I'm leaning toward (1) at least enough to see how messy the back-patch would >> be, since I don't like risks of designing old-branch-specific solutions when >> v15/v16/v17 have a proven solution. What else should we be thinking about >> before deciding? > > I agree with first attempting a back-patch. All of this stuff is from > 2021-2022, so it's had time to bake and is IMHO lower risk. Sorry for the late reply here (life things and the timing of Noah's report). Discarding the option where we would design a new solution specific to v14 and v13, I don't really see a different third option here. Choice 2 means that we'd just leave v13 and v14 exposed to corruptions, and that the message to our user base is that they should upgrade to v15 to get that fixed, even if we still have roughly two more 2 years of official community support, as they would still be exposed to the risk of having multiple links pointing to a single WAL segment, creating corruption. As reported upthread, that can be reached. Choice 1 would be the logical path, even if it may not be the practical one in terms of backpatch risks vs benefits. It would be much better than recommending people that they have to upgrade to 15 at least to make sure that their version of Postgres is safer to use. Saying that, I've looked at the whole for a good part of the day. In 15~, one thing that may make a backpatch messier is related to recovery and the fact that Robert has improved/refactored the area quite a bit, tweaking the code so as we rely on slightly different assumptions in 13/14 compared to 15~. There is a piece around ThisTimeLineID, for example, which is something that InstallXLogFileSegment() uses and also something that's touched by 1f95181b44c8. The impact of these pieces of refactorings specific to v15 is surprisingly lower than I thought it would. Some notes about the six commits you are mentioning, looking at them one by one on the 13 and 14 stable branches. First, for v14: cc2c7d6~4: some bumps with the removal of use_lock with one flag. Looks OK otherwise. cc2c7d6~3: Looks OK. Based on what this addresses, this is low-risk. cc2c7d6~2: XLogWalRcvWrite() has a conflict as an effect of XLogWalRcvWrite() that can be given a "tli". It still seems correct to me to use "recvFileTLI = ThisTimeLineID". cc2c7d6~1: 8ad6c5dbbe5a needs to be reverted first. Commit 8ad6c5dbbe5a has been introduced as a workaround for what we would backpatch here. One conflict in walreceiver.c. cc2c7d6: Does not conflict, which was a bit surprising, TBH. 043_no_contrecord_switch.pl gets unhappy here, on an assertion failure in WaitForWALToBecomeAvailable() about InstallXLogFileSegmentActive. e36cbef: Previous failure fixed by this one, with the same symptoms as reported in the original thread that led to this commit. Then, for v13: cc2c7d6~4: One conflict with InstallXLogFileSegment() in xlog.c, nothing huge. cc2c7d6~3: Looks OK. cc2c7d6~2: Looks OK. cc2c7d6~1: With 8ad6c5dbbe5a reverted, same as the v14 part. cc2c7d6: Does not conflict. 043_no_contrecord_switch.pl gets unhappy here, same as above. e36cbef: Same as above. Noah, all these improvements are something you have directly worked on in 15~. I'm hoping not to have missed something, even if I've looked at the paths manipulating the WAL segments changed here on v13 and v14. Could it be possible for you to double-check and also run more tests if your enviroments help? Perhaps this needs specific prerequisites for recovery? We need more confidence in the solution here, even if it's proving to work for v15, we may still have gaps specific to v14 and/or v13. I am attaching a full set of patches for v14 and v13 that can be used for these tests. WDYT? -- Michael
Attachment
- v14-0001-Remove-XLogFileInit-ability-to-skip-ControlFileL.patch
- v14-0002-In-XLogFileInit-fix-use_existent-postcondition-t.patch
- v14-0003-Remove-XLogFileInit-ability-to-unlink-a-pre-exis.patch
- v14-0004-Revert-Add-HINT-for-restartpoint-race-with-KeepF.patch
- v14-0005-Don-t-ERROR-on-PreallocXlogFiles-race-condition.patch
- v14-0006-Skip-WAL-recycling-and-preallocation-during-arch.patch
- v14-0007-Reset-InstallXLogFileSegmentActive-after-walrece.patch
- v13-0001-Remove-XLogFileInit-ability-to-skip-ControlFileL.patch
- v13-0002-In-XLogFileInit-fix-use_existent-postcondition-t.patch
- v13-0003-Remove-XLogFileInit-ability-to-unlink-a-pre-exis.patch
- v13-0004-Revert-Add-HINT-for-restartpoint-race-with-KeepF.patch
- v13-0005-Don-t-ERROR-on-PreallocXlogFiles-race-condition.patch
- v13-0006-Skip-WAL-recycling-and-preallocation-during-arch.patch
- v13-0007-Reset-InstallXLogFileSegmentActive-after-walrece.patch
- signature.asc
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Michael Paquier
Date:
On Mon, Mar 10, 2025 at 02:25:28PM +0900, Michael Paquier wrote: > I am attaching a full set of patches for v14 and v13 that can be used > for these tests. WDYT? And I forgot to mention that I've checked both v14 and v13, where I have noticed the two patterns "invalid magic number 0000 in log segment X, offset 0" and "unexpected pageaddr X in log segment Y" as LOG and DEBUG entries. These are gone once the patches are applied on both branches. Just set max_wal_size and min_wal_size (128MB here) low enough with a standby doing archive recovery, and bulk-load WAL on the primary to accelerate the reproduction of the problem. It takes a couple of minutes to have these show up with the unpatched branches. -- Michael
Attachment
On Mon, Mar 10, 2025 at 02:25:28PM +0900, Michael Paquier wrote: > On Thu, Mar 06, 2025 at 01:44:30PM -0600, Nathan Bossart wrote: > > On Thu, Mar 06, 2025 at 11:30:13AM -0800, Noah Misch wrote: > >> 1. Make v14 and v13 skip WAL recycling and preallocation during archive > >> recovery, like newer branches do. I think that means back-patching the six > >> commits cc2c7d6~4 cc2c7d6~3 cc2c7d6~2 cc2c7d6~1 cc2c7d6 e36cbef. > >> > >> 2. Revert 1f95181b44c843729caaa688f74babe9403b5850 and its v13 counterpart. > >> Avoid multiple hard links by making durable_rename_excl() first rename > >> oldfile to a temporary name. (I haven't thought this through in detail. > >> It may not suffice.) > >> > >> I'm leaning toward (1) at least enough to see how messy the back-patch would > >> be, since I don't like risks of designing old-branch-specific solutions when > >> v15/v16/v17 have a proven solution. What else should we be thinking about > >> before deciding? > > > > I agree with first attempting a back-patch. All of this stuff is from > > 2021-2022, so it's had time to bake and is IMHO lower risk. > Could it be possible for you to double-check and also run > more tests if your enviroments help? Thanks for crafting back-branch versions. I've queued a task to confirm I get the same result. There's a test case I'll polish, too.
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Michael Paquier
Date:
On Tue, Mar 11, 2025 at 01:57:49PM -0700, Noah Misch wrote: > Thanks for crafting back-branch versions. I've queued a task to confirm I get > the same result. Thanks for that. That helps a lot. > There's a test case I'll polish, too. Are you considering the addition of a TAP test in 17~ based on a wait injection point in the checkpointer coupled with a check of the server logs to see if we see the error patterns you've spotted? -- Michael
Attachment
On Wed, Mar 12, 2025 at 09:46:27AM +0900, Michael Paquier wrote: > On Tue, Mar 11, 2025 at 01:57:49PM -0700, Noah Misch wrote: > > Thanks for crafting back-branch versions. I've queued a task to confirm I get > > the same result. > > Thanks for that. That helps a lot. I'll let you know when I get there. > > There's a test case I'll polish, too. > > Are you considering the addition of a TAP test in 17~ based on a wait > injection point in the checkpointer coupled with a check of the server > logs to see if we see the error patterns you've spotted? No, nothing involving injection points or otherwise version-specific.
On Tue, Mar 11, 2025 at 06:23:15PM -0700, Noah Misch wrote: > On Wed, Mar 12, 2025 at 09:46:27AM +0900, Michael Paquier wrote: > > On Tue, Mar 11, 2025 at 01:57:49PM -0700, Noah Misch wrote: > > > Thanks for crafting back-branch versions. I've queued a task to confirm I get > > > the same result. > > > > Thanks for that. That helps a lot. > > I'll let you know when I get there. Your back-patches are correct. Thanks. > > > There's a test case I'll polish, too. > > > > Are you considering the addition of a TAP test in 17~ based on a wait > > injection point in the checkpointer coupled with a check of the server > > logs to see if we see the error patterns you've spotted? > > No, nothing involving injection points or otherwise version-specific. Here it is. Making it fail three times took looping 1383s, 5841s, and 2594s. Hence, it couldn't be expected to catch the regression before commit, but it would have made sufficient buildfarm and CI noise in the day after commit.
Attachment
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Michael Paquier
Date:
On Wed, Apr 02, 2025 at 05:29:00PM -0700, Noah Misch wrote: > Your back-patches are correct. Thanks. Thanks for double-checking. I'll move on with what I have after a second look as it's been a few weeks since I've looked at all these conflicts. I am also planning to add a few notes in the commits to mention this thread. > Here it is. Making it fail three times took looping 1383s, 5841s, and 2594s. > Hence, it couldn't be expected to catch the regression before commit, but it > would have made sufficient buildfarm and CI noise in the day after commit. Hmm. Not much of a fan of the addition of a test that has less than 1% of reproducibility for the problem, even if it's good to see that this can be made portable to run down to v13. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Wed, Apr 02, 2025 at 05:29:00PM -0700, Noah Misch wrote: >> Here it is. Making it fail three times took looping 1383s, 5841s, and 2594s. >> Hence, it couldn't be expected to catch the regression before commit, but it >> would have made sufficient buildfarm and CI noise in the day after commit. > Hmm. Not much of a fan of the addition of a test that has less than > 1% of reproducibility for the problem, even if it's good to see that > this can be made portable to run down to v13. Yeah, it's good to have a test but I doubt we should commit it. Too many buildfarm cycles will be expended for too little result. regards, tom lane
On Sat, Apr 05, 2025 at 11:07:13AM -0400, Tom Lane wrote: > Michael Paquier <michael@paquier.xyz> writes: > > On Wed, Apr 02, 2025 at 05:29:00PM -0700, Noah Misch wrote: > >> Here it is. Making it fail three times took looping 1383s, 5841s, and 2594s. > >> Hence, it couldn't be expected to catch the regression before commit, but it > >> would have made sufficient buildfarm and CI noise in the day after commit. > > > Hmm. Not much of a fan of the addition of a test that has less than > > 1% of reproducibility for the problem, even if it's good to see that > > this can be made portable to run down to v13. > > Yeah, it's good to have a test but I doubt we should commit it. > Too many buildfarm cycles will be expended for too little result. Current extent of our archive recovery restartpoint test coverage: $ grep -c 'restartpoint starting' $(grep -rl 'restored log file' **/log) | grep -v :0 src/bin/pg_combinebackup/tmp_check/log/002_compare_backups_pitr1.log:1 src/test/recovery/tmp_check/log/020_archive_status_standby2.log:1 src/test/recovery/tmp_check/log/002_archiving_standby.log:1 src/test/recovery/tmp_check/log/020_archive_status_standby.log:1 src/test/recovery/tmp_check/log/035_standby_logical_decoding_standby.log:2 Since the 2025-02 releases made non-toy-size archive recoveries fail easily, that's not enough. If the proposed 3-second test is the wrong thing, what instead?
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Michael Paquier
Date:
On Sat, Apr 05, 2025 at 12:13:39PM -0700, Noah Misch wrote: > Since the 2025-02 releases made non-toy-size archive recoveries fail easily, > that's not enough. If the proposed 3-second test is the wrong thing, what > instead? I don't have a good idea about that in ~16, TBH, but I am sure to not be a fan of the low reproducibility rate of this test as proposed. It's not perfect, but as the design to fix the original race condition has been introduced in v15, why not begin with a test in 17~ using some injection points? This should be good enough while having a good reproduction rate as the order of the actions in the restart points would be controlled. -- Michael
Attachment
On Sun, Apr 06, 2025 at 07:42:02AM +0900, Michael Paquier wrote: > On Sat, Apr 05, 2025 at 12:13:39PM -0700, Noah Misch wrote: > > Since the 2025-02 releases made non-toy-size archive recoveries fail easily, > > that's not enough. If the proposed 3-second test is the wrong thing, what > > instead? > > I don't have a good idea about that in ~16, TBH, but I am sure to not > be a fan of the low reproducibility rate of this test as proposed. > It's not perfect, but as the design to fix the original race condition > has been introduced in v15, why not begin with a test in 17~ using > some injection points? Two reasons: a) The fix ended calls to the whole range of relevant code. Hence, the injection point placement that would have been relevant before the fix isn't reached. In other words, there's no right place for the injection point. (The place for the injection point would be in durable_rename(), in the checkpointer. After the fix, the checkpointer just doesn't call durable_rename().) b) Stochastic tests catch defects beyond the specific one the test author targeted. An injection point test is less likely to do that. (That said, with reason (a), there's no known injection point test design to compete with the stochastic design.)
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Michael Paquier
Date:
On Sat, Apr 05, 2025 at 07:14:27PM +0900, Michael Paquier wrote: > On Wed, Apr 02, 2025 at 05:29:00PM -0700, Noah Misch wrote: >> Your back-patches are correct. Thanks. > > Thanks for double-checking. I'll move on with what I have after a > second look as it's been a few weeks since I've looked at all these > conflicts. I am also planning to add a few notes in the commits to > mention this thread. And applies this series on REL_14_STABLE and REL_13_STABLE, editing all the commits to mention what they were linked to previously. -- Michael
Attachment
On Sat, Apr 05, 2025 at 07:09:58PM -0700, Noah Misch wrote: > On Sun, Apr 06, 2025 at 07:42:02AM +0900, Michael Paquier wrote: > > On Sat, Apr 05, 2025 at 12:13:39PM -0700, Noah Misch wrote: > > > Since the 2025-02 releases made non-toy-size archive recoveries fail easily, > > > that's not enough. If the proposed 3-second test is the wrong thing, what > > > instead? > > > > I don't have a good idea about that in ~16, TBH, but I am sure to not > > be a fan of the low reproducibility rate of this test as proposed. > > It's not perfect, but as the design to fix the original race condition > > has been introduced in v15, why not begin with a test in 17~ using > > some injection points? > > Two reasons: > > a) The fix ended calls to the whole range of relevant code. Hence, the > injection point placement that would have been relevant before the fix > isn't reached. In other words, there's no right place for the injection > point. (The place for the injection point would be in durable_rename(), in > the checkpointer. After the fix, the checkpointer just doesn't call > durable_rename().) > > b) Stochastic tests catch defects beyond the specific one the test author > targeted. An injection point test is less likely to do that. (That said, > with reason (a), there's no known injection point test design to compete > with the stochastic design.) Tom and Michael, do you still object to the test addition, or not? If there are no new or renewed objections by 2025-04-20, I'll proceed to add the test. As another data point, raising the runtime from 3s to 17s makes it reproduce the problem 25% of the time. You can imagine a plot with axes of runtime and percent detection. One can pick any point on that plot's curve. Given how little wall time it takes for the buildfarm and CI to reach a few hundred runs, I like the trade-off of 3s runtime and 1% detection. In particular, I like it better than 17s runtime for 25% detection. How do you see it?
Noah Misch <noah@leadboat.com> writes: > Tom and Michael, do you still object to the test addition, or not? If there > are no new or renewed objections by 2025-04-20, I'll proceed to add the test. While I still don't love it, I don't have a better proposal. regards, tom lane
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Michael Paquier
Date:
On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > Tom and Michael, do you still object to the test addition, or not? If there > > are no new or renewed objections by 2025-04-20, I'll proceed to add the test. > > While I still don't love it, I don't have a better proposal. No objections from here to use your proposal for the time being on all the branches. I am planning to spend some cycles thinking about a better alternative, but I doubt that this will be portable down to v13. -- Michael
Attachment
On Mon, Apr 14, 2025 at 09:19:35AM +0900, Michael Paquier wrote: > On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote: > > Noah Misch <noah@leadboat.com> writes: > > > Tom and Michael, do you still object to the test addition, or not? If there > > > are no new or renewed objections by 2025-04-20, I'll proceed to add the test. Pushed as commit 714bd9e. The failure so far is https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-20%2015%3A36%3A35 with these highlights: pg_ctl: server does not shut down 2025-04-20 17:27:35.735 UTC [1576688][postmaster][:0] LOG: received immediate shutdown request 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] FATAL: archive command was terminated by signal 3: Quit 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] DETAIL: The failed archive command was: cp "pg_wal/00000001000000000000006D" "/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/045_archive_restartpoint/data/t_045_archive_restartpoint_primary_data/archives/00000001000000000000006D" The checkpoints and WAL creation took 30s, but archiving was only 20% done (based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT. I can reproduce this if I test with valgrind --trace-children=yes. With my normal valgrind settings, the whole test file takes only 18s. I recommend one of these changes to skink: - Add --trace-children-skip='/bin/*,/usr/bin/*' so valgrind doesn't instrument "sh" and "cp" commands. - Remove --trace-children=yes Andres, what do you think about making one of those skink configuration changes? Alternatively, I could make the test poll until archiving catches up. However, that would take skink about 30min, and I expect little value from 30min of valgrind instrumenting the "cp" command.
On Sun, Apr 20, 2025 at 02:53:39PM -0700, Noah Misch wrote: > On Mon, Apr 14, 2025 at 09:19:35AM +0900, Michael Paquier wrote: > > On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote: > > > Noah Misch <noah@leadboat.com> writes: > > > > Tom and Michael, do you still object to the test addition, or not? If there > > > > are no new or renewed objections by 2025-04-20, I'll proceed to add the test. > > Pushed as commit 714bd9e. The failure so far is > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-20%2015%3A36%3A35 > with these highlights: > > pg_ctl: server does not shut down > > 2025-04-20 17:27:35.735 UTC [1576688][postmaster][:0] LOG: received immediate shutdown request > 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] FATAL: archive command was terminated by signal 3: Quit > 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] DETAIL: The failed archive command was: cp "pg_wal/00000001000000000000006D" "/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/045_archive_restartpoint/data/t_045_archive_restartpoint_primary_data/archives/00000001000000000000006D" > > The checkpoints and WAL creation took 30s, but archiving was only 20% done > (based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT. I can > reproduce this if I test with valgrind --trace-children=yes. With my normal > valgrind settings, the whole test file takes only 18s. I recommend one of > these changes to skink: > > - Add --trace-children-skip='/bin/*,/usr/bin/*' so valgrind doesn't instrument > "sh" and "cp" commands. > - Remove --trace-children=yes I gave that more thought. One can be more surgical than that, via --trace-children-skip-by-arg='*cp "*' or similar. My previous message's two options stop valgrind instrumentation at boundaries like pg_dumpall calling system(pg_dump ...), since that execs /bin/sh to run pg_dump. If we wanted to make it even more explicit and surgical, skink could use --trace-children-skip-by-arg='*valgrind-ignore-child*' combined with: --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -1404 +1404 @@ sub enable_restoring - : qq{cp "$path/%f" "%p"}; + : qq{cp "$path/%f" "%p" # valgrind-ignore-child}; @@ -1474 +1474 @@ sub enable_archiving - : qq{cp "%p" "$path/%f"}; + : qq{cp "%p" "$path/%f" # valgrind-ignore-child}; What's your preference? > Andres, what do you think about making one of those skink configuration > changes? Alternatively, I could make the test poll until archiving catches > up. However, that would take skink about 30min, and I expect little value > from 30min of valgrind instrumenting the "cp" command.
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Andres Freund
Date:
Hi, On 2025-04-20 14:53:39 -0700, Noah Misch wrote: > On Mon, Apr 14, 2025 at 09:19:35AM +0900, Michael Paquier wrote: > > On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote: > > > Noah Misch <noah@leadboat.com> writes: > > > > Tom and Michael, do you still object to the test addition, or not? If there > > > > are no new or renewed objections by 2025-04-20, I'll proceed to add the test. > > Pushed as commit 714bd9e. The failure so far is > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-20%2015%3A36%3A35 > with these highlights: > > pg_ctl: server does not shut down > > 2025-04-20 17:27:35.735 UTC [1576688][postmaster][:0] LOG: received immediate shutdown request > 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] FATAL: archive command was terminated by signal 3: Quit > 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] DETAIL: The failed archive command was: cp "pg_wal/00000001000000000000006D" "/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/045_archive_restartpoint/data/t_045_archive_restartpoint_primary_data/archives/00000001000000000000006D" > > The checkpoints and WAL creation took 30s, but archiving was only 20% done > (based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT. Huh. That seems surprisingly slow, even for valgrind. I guess it's one more example for why the single-threaded archiving approach sucks so badly :) > I can reproduce this if I test with valgrind --trace-children=yes. With my > normal valgrind settings, the whole test file takes only 18s. I recommend > one of these changes to skink: > > - Add --trace-children-skip='/bin/*,/usr/bin/*' so valgrind doesn't instrument > "sh" and "cp" commands. > - Remove --trace-children=yes Hm. I think I used --trace-children=yes because I was thinking it was required to track forks. But a newer version of valgrind's man page has an important clarification: --trace-children=<yes|no> [default: no] When enabled, Valgrind will trace into sub-processes initiated via the exec system call. This is necessary formulti-process programs. Note that Valgrind does trace into the child of a fork (it would be difficult not to, since fork makes an identicalcopy of a process), so this option is arguably badly named. However, most children of fork calls immediately call exec anyway. So there doesn't seem to be much point in using --trace-children=yes. > Andres, what do you think about making one of those skink configuration > changes? Alternatively, I could make the test poll until archiving catches > up. However, that would take skink about 30min, and I expect little value > from 30min of valgrind instrumenting the "cp" command. I just changed the config to --trace-children=no. There already is a valgrind run in progress, so it won't be in effect for the next run. Greetings, Andres Freund
On Fri, Apr 25, 2025 at 03:35:06PM -0400, Andres Freund wrote: > On 2025-04-20 14:53:39 -0700, Noah Misch wrote: > > The checkpoints and WAL creation took 30s, but archiving was only 20% done > > (based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT. > > Huh. That seems surprisingly slow, even for valgrind. I guess it's one more > example for why the single-threaded archiving approach sucks so badly :) Yes! I also didn't expect that v14/v13 would run much faster. > I just changed the config to --trace-children=no. There already is a valgrind > run in progress, so it won't be in effect for the next run. Works for me. I see that resolved things.