Thread: avoid multiple hard links to same WAL file after a crash
Hi hackers, I am splitting this off of a previous thread aimed at reducing archiving overhead [0], as I believe this fix might deserve back-patching. Presently, WAL recycling uses durable_rename_excl(), which notes that a crash at an unfortunate moment can result in two links to the same file. My testing [1] demonstrated that it was possible to end up with two links to the same file in pg_wal after a crash just before unlink() during WAL recycling. Specifically, the test produced links to the same file for the current WAL file and the next one because the half-recycled WAL file was re-recycled upon restarting. This seems likely to lead to WAL corruption. The attached patch prevents this problem by using durable_rename() instead of durable_rename_excl() for WAL recycling. This removes the protection against accidentally overwriting an existing WAL file, but there shouldn't be one. This patch also sets the stage for reducing archiving overhead (as discussed in the other thread [0]). The proposed change to reduce archiving overhead will make it more likely that the server will attempt to re-archive segments after a crash. This might lead to archive corruption if the server concurrently writes to the same file via the aforementioned bug. [0] https://www.postgresql.org/message-id/20220222011948.GA3850532%40nathanxps13 [1] https://www.postgresql.org/message-id/20220222173711.GA3852671%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Apr 7, 2022 at 2:30 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Presently, WAL recycling uses durable_rename_excl(), which notes that a > crash at an unfortunate moment can result in two links to the same file. > My testing [1] demonstrated that it was possible to end up with two links > to the same file in pg_wal after a crash just before unlink() during WAL > recycling. Specifically, the test produced links to the same file for the > current WAL file and the next one because the half-recycled WAL file was > re-recycled upon restarting. This seems likely to lead to WAL corruption. Wow, that's bad. > The attached patch prevents this problem by using durable_rename() instead > of durable_rename_excl() for WAL recycling. This removes the protection > against accidentally overwriting an existing WAL file, but there shouldn't > be one. I see that durable_rename_excl() has the following comment: "Similar to durable_rename(), except that this routine tries (but does not guarantee) not to overwrite the target file." If those are the desired semantics, we could achieve them more simply and more safely by just trying to stat() the target file and then, if it's not found, call durable_rename(). I think that would be a heck of a lot safer than what this function is doing right now. I'd actually be in favor of nuking durable_rename_excl() from orbit and putting the file-exists tests in the callers. Otherwise, someone might assume that it actually has the semantics that its name suggests, which could be pretty disastrous. If we don't want to do that, then I'd changing to do the stat-then-durable-rename thing internally, so we don't leave hard links lying around in *any* code path. Perhaps that's the right answer for the back-branches in any case, since there could be third-party code calling this function. Your proposed fix is OK if we don't want to do any of that stuff, but personally I'm much more inclined to blame durable_rename_excl() for being horrible than I am to blame the calling code for using it improvidently. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote: > I see that durable_rename_excl() has the following comment: "Similar > to durable_rename(), except that this routine tries (but does not > guarantee) not to overwrite the target file." If those are the desired > semantics, we could achieve them more simply and more safely by just > trying to stat() the target file and then, if it's not found, call > durable_rename(). I think that would be a heck of a lot safer than > what this function is doing right now. IIUC it actually does guarantee that you won't overwrite the target file when HAVE_WORKING_LINK is defined. If not, it provides no guarantees at all. Using stat() before rename() would therefore weaken this check for systems with working link(), but it'd probably strengthen it for systems without a working link(). > I'd actually be in favor of nuking durable_rename_excl() from orbit > and putting the file-exists tests in the callers. Otherwise, someone > might assume that it actually has the semantics that its name > suggests, which could be pretty disastrous. If we don't want to do > that, then I'd changing to do the stat-then-durable-rename thing > internally, so we don't leave hard links lying around in *any* code > path. Perhaps that's the right answer for the back-branches in any > case, since there could be third-party code calling this function. I think there might be another problem. The man page for rename() seems to indicate that overwriting an existing file also introduces a window where the old and new path are hard links to the same file. This isn't a problem for the WAL files because we should never be overwriting an existing one, but I wonder if it's a problem for other code paths. My guess is that many code paths that overwrite an existing file are first writing changes to a temporary file before atomically replacing the original. Those paths are likely okay, too, as you can usually just discard any existing temporary files. > Your proposed fix is OK if we don't want to do any of that stuff, but > personally I'm much more inclined to blame durable_rename_excl() for > being horrible than I am to blame the calling code for using it > improvidently. I do agree that it's worth examining this stuff a bit closer. I've frequently found myself trying to reason about all the different states that callers of these functions can produce, so any changes that help simplify matters are a win in my book. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Apr 08, 2022 at 09:53:12AM -0700, Nathan Bossart wrote: > On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote: >> I'd actually be in favor of nuking durable_rename_excl() from orbit >> and putting the file-exists tests in the callers. Otherwise, someone >> might assume that it actually has the semantics that its name >> suggests, which could be pretty disastrous. If we don't want to do >> that, then I'd changing to do the stat-then-durable-rename thing >> internally, so we don't leave hard links lying around in *any* code >> path. Perhaps that's the right answer for the back-branches in any >> case, since there could be third-party code calling this function. > > I think there might be another problem. The man page for rename() seems to > indicate that overwriting an existing file also introduces a window where > the old and new path are hard links to the same file. This isn't a problem > for the WAL files because we should never be overwriting an existing one, > but I wonder if it's a problem for other code paths. My guess is that many > code paths that overwrite an existing file are first writing changes to a > temporary file before atomically replacing the original. Those paths are > likely okay, too, as you can usually just discard any existing temporary > files. Ha, so there are only a few callers of durable_rename_excl() in the PostgreSQL tree. One is basic_archive.c, which is already doing a stat() check. IIRC I only used durable_rename_excl() here to handle the case where multiple servers are writing archives to the same location. If that happened, the archiver process would begin failing. If a crash left two hard links to the same file around, we will silently succeed the next time around thanks to the compare_files() check. Besides the WAL installation code, the only other callers are in timeline.c, and both note that the use of durable_rename_excl() is for "paranoidly trying to avoid overwriting an existing file (there shouldn't be one)." So AFAICT basic_archive.c is the only caller with a strong reason for using durable_rename_excl(), and even that might not be worth keeping it around. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote: > I'd actually be in favor of nuking durable_rename_excl() from orbit > and putting the file-exists tests in the callers. Otherwise, someone > might assume that it actually has the semantics that its name > suggests, which could be pretty disastrous. If we don't want to do > that, then I'd changing to do the stat-then-durable-rename thing > internally, so we don't leave hard links lying around in *any* code > path. Perhaps that's the right answer for the back-branches in any > case, since there could be third-party code calling this function. I've attached a patch that simply removes durable_rename_excl() and replaces existing calls with durable_rename(). I noticed that Andres expressed similar misgivings about durable_rename_excl() last year [0] [1]. I can create a stat-then-durable-rename version of this for back-patching if that is still the route we want to go. [0] https://postgr.es/me/20210318014812.ds2iz4jz5h7la6un%40alap3.anarazel.de [1] https://postgr.es/m/20210318023004.gz2aejhze2kkkqr2%40alap3.anarazel.de -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Apr 8, 2022 at 12:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote: > > I see that durable_rename_excl() has the following comment: "Similar > > to durable_rename(), except that this routine tries (but does not > > guarantee) not to overwrite the target file." If those are the desired > > semantics, we could achieve them more simply and more safely by just > > trying to stat() the target file and then, if it's not found, call > > durable_rename(). I think that would be a heck of a lot safer than > > what this function is doing right now. > > IIUC it actually does guarantee that you won't overwrite the target file > when HAVE_WORKING_LINK is defined. If not, it provides no guarantees at > all. Using stat() before rename() would therefore weaken this check for > systems with working link(), but it'd probably strengthen it for systems > without a working link(). Sure, but a guarantee that happens on only some systems isn't worth much. And, if it comes at the price of potentially having multiple hard links to the same file in obscure situations, that seems like it could easily cause more problems than this whole scheme can ever hope to solve. > I think there might be another problem. The man page for rename() seems to > indicate that overwriting an existing file also introduces a window where > the old and new path are hard links to the same file. This isn't a problem > for the WAL files because we should never be overwriting an existing one, > but I wonder if it's a problem for other code paths. My guess is that many > code paths that overwrite an existing file are first writing changes to a > temporary file before atomically replacing the original. Those paths are > likely okay, too, as you can usually just discard any existing temporary > files. I wonder if this is really true. I thought rename() was supposed to be atomic. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote: > > I think there might be another problem. The man page for rename() seems to > > indicate that overwriting an existing file also introduces a window where > > the old and new path are hard links to the same file. This isn't a problem > > for the WAL files because we should never be overwriting an existing one, > > but I wonder if it's a problem for other code paths. My guess is that many > > code paths that overwrite an existing file are first writing changes to a > > temporary file before atomically replacing the original. Those paths are > > likely okay, too, as you can usually just discard any existing temporary > > files. > > I wonder if this is really true. I thought rename() was supposed to be atomic. Looks like it's atomic in that it's not like cp + rm, but not atomic the other way you want. | If newpath already exists, it will be atomically replaced, so that there is no point at which another processattempting to access newpath will find it missing. However, there will probably be a window in which | both oldpath and newpath refer to the file being renamed.
At Thu, 7 Apr 2022 11:29:54 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in > The attached patch prevents this problem by using durable_rename() instead > of durable_rename_excl() for WAL recycling. This removes the protection > against accidentally overwriting an existing WAL file, but there shouldn't > be one. From another direction, if the new segment was the currently active one, we just mustn't install it. Otherwise we don't care. So, the only thing we need to care is segment switch. Without it, the segment that InstallXLogFileSegment found by the stat loop is known to be safe to overwrite even if exists. When segment switch finds an existing file, it's no problem since the segment switch doesn't create a new segment. Otherwise segment switch always calls InstallXLogFileSegment. The section from searching for an empty segmetn slot until calling durable_rename_excl() is protected by ControlFileLock. Thus if a process is in the section, no other process can switch to a newly-created segment. If this diagnosis is correct, the comment is proved to be paranoid. > * Perform the rename using link if available, paranoidly trying to avoid > * overwriting an existing file (there shouldn't be one). As the result, I think Nathan's fix is correct that we can safely use durable_rename() instead. And I propose to use renameat2 on Linux so that we can detect the contradicting case by the regression tests even though only on Linux. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > So, the only thing we need to care is segment switch. Without it, the > segment that InstallXLogFileSegment found by the stat loop is known to > be safe to overwrite even if exists. > > When segment switch finds an existing file, it's no problem since the > segment switch doesn't create a new segment. Otherwise segment switch > always calls InstallXLogFileSegment. The section from searching for > an empty segmetn slot until calling durable_rename_excl() is protected > by ControlFileLock. Thus if a process is in the section, no other > process can switch to a newly-created segment. > > If this diagnosis is correct, the comment is proved to be paranoid. It's sometimes difficult to understand what problems really old code comments are worrying about. For example, could they have been worrying about bugs in the code? Could they have been worrying about manual interference with the pg_wal directory? It's hard to know. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: >> If this diagnosis is correct, the comment is proved to be paranoid. > It's sometimes difficult to understand what problems really old code > comments are worrying about. For example, could they have been > worrying about bugs in the code? Could they have been worrying about > manual interference with the pg_wal directory? It's hard to know. "git blame" can be helpful here, if you trace back to when the comment was written and then try to find the associated mailing-list discussion. (That leap can be difficult for commits pre-dating our current convention of including links in the commit message, but it's usually not *that* hard to locate contemporaneous discussion.) regards, tom lane
On Mon, Apr 11, 2022 at 12:28:47PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi >> <horikyota.ntt@gmail.com> wrote: >>> If this diagnosis is correct, the comment is proved to be paranoid. > >> It's sometimes difficult to understand what problems really old code >> comments are worrying about. For example, could they have been >> worrying about bugs in the code? Could they have been worrying about >> manual interference with the pg_wal directory? It's hard to know. > > "git blame" can be helpful here, if you trace back to when the comment > was written and then try to find the associated mailing-list discussion. > (That leap can be difficult for commits pre-dating our current > convention of including links in the commit message, but it's usually > not *that* hard to locate contemporaneous discussion.) I traced this back a while ago. I believe the link() was first added in November 2000 as part of f0e37a8. This even predates WAL recycling, which was added in July 2001 as part of 7d4d5c0. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in > On Mon, Apr 11, 2022 at 12:28:47PM -0400, Tom Lane wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> On Mon, Apr 11, 2022 at 5:12 AM Kyotaro Horiguchi > >> <horikyota.ntt@gmail.com> wrote: > >>> If this diagnosis is correct, the comment is proved to be paranoid. > > > >> It's sometimes difficult to understand what problems really old code > >> comments are worrying about. For example, could they have been > >> worrying about bugs in the code? Could they have been worrying about > >> manual interference with the pg_wal directory? It's hard to know. > > > > "git blame" can be helpful here, if you trace back to when the comment > > was written and then try to find the associated mailing-list discussion. > > (That leap can be difficult for commits pre-dating our current > > convention of including links in the commit message, but it's usually > > not *that* hard to locate contemporaneous discussion.) > > I traced this back a while ago. I believe the link() was first added in > November 2000 as part of f0e37a8. This even predates WAL recycling, which > was added in July 2001 as part of 7d4d5c0. f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from somwhere out of the ML.. This patch changed XLogFileInit to supportusing existent files so that XLogWrite can use the new segment provided by checkpoint and still allow XLogWrite to create a new segment by itself. Just before the commit, calls to XLogFileInit were protected (or serialized) by logwr_lck. At the commit calls to the same function were still serialized by ControlFileLockId. I *guess* that Vadim faced/noticed a race condition when he added checkpoint. Thus introduced the link+remove protocol but finally it became useless by moving the call to XLogFileInit within ControlFileLockId section. But, of course, all of story is mere a guess. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Apr 12, 2022 at 03:46:31PM +0900, Kyotaro Horiguchi wrote: > At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in >> I traced this back a while ago. I believe the link() was first added in >> November 2000 as part of f0e37a8. This even predates WAL recycling, which >> was added in July 2001 as part of 7d4d5c0. > > f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from > somwhere out of the ML.. This patch changed XLogFileInit to > supportusing existent files so that XLogWrite can use the new segment > provided by checkpoint and still allow XLogWrite to create a new > segment by itself. Yeah, I've been unable to find any discussion besides a brief reference to adding checkpointing [0]. [0] https://postgr.es/m/8F4C99C66D04D4118F580090272A7A23018D85%40sectorbase1.sectorbase.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote: > On Fri, Apr 8, 2022 at 12:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> I think there might be another problem. The man page for rename() seems to >> indicate that overwriting an existing file also introduces a window where >> the old and new path are hard links to the same file. This isn't a problem >> for the WAL files because we should never be overwriting an existing one, >> but I wonder if it's a problem for other code paths. My guess is that many >> code paths that overwrite an existing file are first writing changes to a >> temporary file before atomically replacing the original. Those paths are >> likely okay, too, as you can usually just discard any existing temporary >> files. > > I wonder if this is really true. I thought rename() was supposed to be atomic. Not always. For example, some old versions of MacOS have a non-atomic implementation of rename(), like prairiedog with 10.4. Even 10.5 does not handle atomicity as far as I call. In short, it looks like a bad idea to me to rely on this idea at all. Some FSes have their own way of handling things, as well, but I am not much into this world. Saying that, it would be nice to see durable_rename_excl() gone as it has created quite a bit of pain for us in the past years. -- Michael
Attachment
On Mon, Apr 18, 2022 at 04:48:35PM +0900, Michael Paquier wrote: > Saying that, it would be nice to see durable_rename_excl() gone as it > has created quite a bit of pain for us in the past years. Yeah, I think this is the right thing to do. Patch upthread [0]. For back-branches, I suspect we'll want to remove all uses of durable_rename_excl() but leave the function around for any extensions that are using it. Of course, we'd also need a big comment imploring folks not to add any more callers. Another option would be to change the behavior of durable_rename_excl() to something that we think is safer (e.g., stat then rename), but that might just introduce a different set of problems. [0] https://postgr.es/m/20220408194345.GA1541826%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Michael Paquier <michael@paquier.xyz> writes: > On Fri, Apr 08, 2022 at 09:00:36PM -0400, Robert Haas wrote: >> I wonder if this is really true. I thought rename() was supposed to be atomic. > Not always. For example, some old versions of MacOS have a non-atomic > implementation of rename(), like prairiedog with 10.4. Even 10.5 does > not handle atomicity as far as I call. I think that's not talking about the same thing. POSIX requires rename(2) to replace an existing target link atomically: If the link named by the new argument exists, it shall be removed and old renamed to new. In this case, a link named new shall remain visible to other threads throughout the renaming operation and refer either to the file referred to by new or old before the operation began. (It's that requirement that ancient macOS fails to meet.) However, I do not see any text that addresses the question of whether the old link disappears atomically with the appearance of the new link, and it seems like that'd be pretty impractical to ensure in cases like moving a link from one directory to another. (What would it even mean to say that, considering that a thread can't read the two directories at the same instant?) From a crash-safety standpoint, it'd surely be better to make the new link before removing the old, so I imagine that's what most file systems do. regards, tom lane
The readdir interface allows processes to be in the middle of reading a directory and unless a kernel was happy to either materialize the entire directory list when the readdir starts, or lock the entire directory against modification for the entire time the a process has a readdir fd open it's always going to be possible for the a process to have previously read the old directory entry and later see the new directory entry. Kernels don't do any MVCC or cmin type of games so they're not going to be able to prevent it. What's worse of course is that it may only happen in very large directories. Most directories fit on a single block and readdir may buffer up all the entries a block at a time for efficiency. So it may only be visible on very large directories that span multiple blocks.
Here is an attempt at creating something that can be back-patched. 0001 simply replaces calls to durable_rename_excl() with durable_rename() and is intended to be back-patched. 0002 removes the definition of durable_rename_excl() and is _not_ intended for back-patching. I imagine 0002 will need to be held back for v16devel. I think back-patching 0001 will encounter a couple of small obstacles. For example, the call in basic_archive won't exist on most of the back-branches, and durable_rename_excl() was named durable_link_or_rename() before v13. I don't mind producing a patch for each back-branch if needed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Apr 26, 2022 at 01:09:35PM -0700, Nathan Bossart wrote: > Here is an attempt at creating something that can be back-patched. 0001 > simply replaces calls to durable_rename_excl() with durable_rename() and is > intended to be back-patched. 0002 removes the definition of > durable_rename_excl() and is _not_ intended for back-patching. I imagine > 0002 will need to be held back for v16devel. I would not mind applying 0002 on HEAD now to avoid more uses of this API, and I can get behind 0001 after thinking more about it. > I think back-patching 0001 will encounter a couple of small obstacles. For > example, the call in basic_archive won't exist on most of the > back-branches, and durable_rename_excl() was named durable_link_or_rename() > before v13. I don't mind producing a patch for each back-branch if needed. I am not sure that have any need to backpatch this change based on the unlikeliness of the problem, TBH. One thing that is itching me a bit, like Robert upthread, is that we don't check anymore that the newfile does not exist in the code paths because we never expect one. It is possible to use stat() for that. But access() within a simple assertion would be simpler? Say something like: Assert(access(path, F_OK) != 0 && errno == ENOENT); The case for basic_archive is limited as the comment of the patch states, but that would be helpful for the two calls in timeline.c and the one in xlog.c in the long-term. And this has no need to be part of fd.c, this can be added before the durable_rename() calls. What do you think? -- Michael
Attachment
On Wed, Apr 27, 2022 at 04:09:20PM +0900, Michael Paquier wrote: > I am not sure that have any need to backpatch this change based on the > unlikeliness of the problem, TBH. One thing that is itching me a bit, > like Robert upthread, is that we don't check anymore that the newfile > does not exist in the code paths because we never expect one. It is > possible to use stat() for that. But access() within a simple > assertion would be simpler? Say something like: > Assert(access(path, F_OK) != 0 && errno == ENOENT); > > The case for basic_archive is limited as the comment of the patch > states, but that would be helpful for the two calls in timeline.c and > the one in xlog.c in the long-term. And this has no need to be part > of fd.c, this can be added before the durable_rename() calls. What do > you think? Here is a new patch set with these assertions added. I think at least the xlog.c change ought to be back-patched. The problem may be unlikely, but AFAICT the possible consequences include WAL corruption. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Apr 27, 2022 at 11:42:04AM -0700, Nathan Bossart wrote: > Here is a new patch set with these assertions added. I think at least the > xlog.c change ought to be back-patched. The problem may be unlikely, but > AFAICT the possible consequences include WAL corruption. Okay, so I have applied this stuff this morning to see what the buildfarm had to say, and we have finished with a set of failures in various buildfarm members: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-04-28%2002%3A13%3A27 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2022-04-28%2002%3A14%3A08 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2022-04-28%2002%3A59%3A26 All of them did not like the part where we assume that a TLI history file written by a WAL receiver should not exist beforehand, but as 025_stuck_on_old_timeline.pl is showing, a standby may attempt to retrieve a TLI history file after getting it from the archives. I was analyzing the whole thing, and it looks like a race condition. Per the the buildfarm logs, we have less than 5ms between the moment the startup process retrieves the history file of TLI 2 from the archives and the moment the WAL receiver decides to check if this TLI file exists. If it does not exist, it would then retrieve it from the primary via streaming. So I guess that the sequence of events is that: - In WalRcvFetchTimeLineHistoryFiles(), the WAL receiver checks the existence of the history file for TLI 2, does not find it. - The startup process retrieves the file from the archives. - The WAL receiver goes through the internal loop of WalRcvFetchTimeLineHistoryFiles(), retrieves the history file from the primary's stream. Switching from durable_rename_excl() to durable_rename() would mean that we'd overwrite the TLI file received from the primary stream over what's been retrieved from the archives. That does not strike me as an issue in itself and that should be safe, so the comment is misleading, and we can live without the assertion in writeTimeLineHistoryFile() called by the WAL receiver. Now, I think that we'd better keep some belts in writeTimeLineHistory() called by the startup process at the end-of-recovery as I should never ever have a TLI file generated when selecting a new timeline. Perhaps this should be a elog(ERROR) at least, with a check on the file existence before calling durable_rename()? Anyway, my time is constrained next week due to the upcoming Japanese Golden Week and the buildfarm has to be stable, so I have reverted the change for now. -- Michael
Attachment
On Tue, Apr 12, 2022 at 09:27:42AM -0700, Nathan Bossart wrote: > On Tue, Apr 12, 2022 at 03:46:31PM +0900, Kyotaro Horiguchi wrote: >> At Mon, 11 Apr 2022 09:52:57 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in >>> I traced this back a while ago. I believe the link() was first added in >>> November 2000 as part of f0e37a8. This even predates WAL recycling, which >>> was added in July 2001 as part of 7d4d5c0. >> >> f0e37a8 lacks discussion.. It introduced the CHECKPOINT command from >> somwhere out of the ML.. This patch changed XLogFileInit to >> supportusing existent files so that XLogWrite can use the new segment >> provided by checkpoint and still allow XLogWrite to create a new >> segment by itself. Yes, I think that you are right here. I also suspect that the checkpoint command was facing a concurrency issue while working on the feature and that Vadim saw that this part of the implementation would be safer in the long run if we use link() followed by unlink(). > Yeah, I've been unable to find any discussion besides a brief reference to > adding checkpointing [0]. > > [0] https://postgr.es/m/8F4C99C66D04D4118F580090272A7A23018D85%40sectorbase1.sectorbase.com While looking at the history of this area, I have also noticed this argument, telling also that this is a safety measure if this code were to run in parallel, but that's without counting on the control file lock hold while doing this operation anyway: https://www.postgresql.org/message-id/24974.982597735@sss.pgh.pa.us As mentioned already upthread, f0e37a8 is the origin of the link()/unlink() business in the WAL segment initialization logic, and also note 1f159e5 that has added a rename() as extra code path for systems where link() was not working. At the end, switching directly from durable_rename_excl() to durable_rename() should be fine for the WAL segment initialization, but we could do things a bit more carefully by adding a check on the file existence before calling durable_rename() and issue a elog(LOG) if a file is found, giving a mean for the WAL recycling to give up peacefully as it does now. Per my analysis, the TLI history file created at the end of recovery ought to issue an elog(ERROR). Now, I am surprised by the third code path of durable_rename_excl(), as of the WAL receiver doing writeTimeLineHistoryFile(), to not cause any issues, as link() should exit with EEXIST when the startup process grabs the same history file concurrently. It seems to me that in this last case using durable_rename() could be an improvement and prevent extra WAL receiver restarts as a TLI history fetched from the primary via streaming or from some archives should be the same, but we could be more careful, like the WAL init logic, by skipping the durable_rename() and issuing an elog(LOG). That would not be perfect, still a bit better than the current state of HEAD. As we are getting closer to the beta release, it looks safer to let this change aside a bit longer and wait for v16 to be opened for business on HEAD. -- Michael
Attachment
On Sun, May 01, 2022 at 10:08:53PM +0900, Michael Paquier wrote: > Now, I am surprised by the third code path of durable_rename_excl(), > as of the WAL receiver doing writeTimeLineHistoryFile(), to not cause > any issues, as link() should exit with EEXIST when the startup process > grabs the same history file concurrently. It seems to me that in this > last case using durable_rename() could be an improvement and prevent > extra WAL receiver restarts as a TLI history fetched from the primary > via streaming or from some archives should be the same, but we could > be more careful, like the WAL init logic, by skipping the > durable_rename() and issuing an elog(LOG). That would not be perfect, > still a bit better than the current state of HEAD. Skimming through at the buildfarm logs, it happens that the tests are able to see this race from time to time. Here is one such example on rorqual: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=rorqual&dt=2022-04-20%2004%3A47%3A58&stg=recovery-check And here are the relevant logs: 2022-04-20 05:04:19.028 UTC [3109048][startup][:0] LOG: restored log file "00000002.history" from archive 2022-04-20 05:04:19.029 UTC [3109111][walreceiver][:0] LOG: fetching timeline history file for timeline 2 from primary server 2022-04-20 05:04:19.048 UTC [3109111][walreceiver][:0] FATAL: could not link file "pg_wal/xlogtemp.3109111" to "pg_wal/00000002.history": File exists [...] 2022-04-20 05:04:19.234 UTC [3109250][walreceiver][:0] LOG: started streaming WAL from primary at 0/3000000 on timeline 2 The WAL receiver upgrades the ERROR to a FATAL, and restarts streaming shortly after. Using durable_rename() would not be an issue here. -- Michael
Attachment
On Sun, May 01, 2022 at 10:08:53PM +0900, Michael Paquier wrote: > At the end, switching directly from durable_rename_excl() to > durable_rename() should be fine for the WAL segment initialization, > but we could do things a bit more carefully by adding a check on the > file existence before calling durable_rename() and issue a elog(LOG) > if a file is found, giving a mean for the WAL recycling to give up > peacefully as it does now. Per my analysis, the TLI history file > created at the end of recovery ought to issue an elog(ERROR). My only concern with this approach is that it inevitably introduces a race condition. In most cases, the file existence check will prevent overwrites, but it might not always. Furthermore, we believe that such overwrites either 1) should not happen (e.g., WAL recycling) or 2) won't cause problems if they happen (e.g., when the WAL receiver writes the TLI history file). Also, these races will be difficult to test, so we won't know what breaks when they occur. My instinct is to just let the overwrites happen. That way, we are more likely to catch breakage in tests, and we'll have one less race condition to worry about. I don't mind asserting that the file doesn't exist when we don't expect it to, as that might help catch potential problems in development without affecting behavior in production. If we do want to add file existence checks, I think we'd better add a comment about the potential for race conditions. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Mon, May 02, 2022 at 07:48:18PM +0900, Michael Paquier wrote: > Skimming through at the buildfarm logs, it happens that the tests are > able to see this race from time to time. Here is one such example on > rorqual: > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=rorqual&dt=2022-04-20%2004%3A47%3A58&stg=recovery-check > > And here are the relevant logs: > 2022-04-20 05:04:19.028 UTC [3109048][startup][:0] LOG: restored log > file "00000002.history" from archive > 2022-04-20 05:04:19.029 UTC [3109111][walreceiver][:0] LOG: fetching > timeline history file for timeline 2 from primary server > 2022-04-20 05:04:19.048 UTC [3109111][walreceiver][:0] FATAL: could > not link file "pg_wal/xlogtemp.3109111" to "pg_wal/00000002.history": > File exists > [...] > 2022-04-20 05:04:19.234 UTC [3109250][walreceiver][:0] LOG: started > streaming WAL from primary at 0/3000000 on timeline 2 > > The WAL receiver upgrades the ERROR to a FATAL, and restarts > streaming shortly after. Using durable_rename() would not be an issue > here. Thanks for investigating this one. I think I agree that we should simply switch to durable_rename() (without a file existence check beforehand). -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Mon, May 02, 2022 at 10:39:07AM -0700, Nathan Bossart wrote: > On Mon, May 02, 2022 at 07:48:18PM +0900, Michael Paquier wrote: >> The WAL receiver upgrades the ERROR to a FATAL, and restarts >> streaming shortly after. Using durable_rename() would not be an issue >> here. > > Thanks for investigating this one. I think I agree that we should simply > switch to durable_rename() (without a file existence check beforehand). Here is a new patch set. For now, I've only removed the file existence check in writeTimeLineHistoryFile(). I don't know if I'm totally convinced that there isn't a problem here (e.g., due to concurrent .ready file creation), but since some platforms have been using rename() for some time, I don't know how worried we should be. I thought about adding some kind of locking between the WAL receiver and startup processes, but that seems excessive. Alternatively, we could just fix xlog.c as proposed earlier [0]. AFAICT that is the only caller that can experience problems due to the multiple-hard-link issue. All other callers are simply renaming a temporary file into place, and the temporary file can be discarded if left behind after a crash. [0] https://postgr.es/m/20220407182954.GA1231544%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, May 02, 2022 at 04:06:13PM -0700, Nathan Bossart wrote: > Here is a new patch set. For now, I've only removed the file existence > check in writeTimeLineHistoryFile(). I don't know if I'm totally convinced > that there isn't a problem here (e.g., due to concurrent .ready file > creation), but since some platforms have been using rename() for some time, > I don't know how worried we should be. That's only about Windows these days, meaning that there is much less coverage in this code path. > I thought about adding some kind of > locking between the WAL receiver and startup processes, but that seems > excessive. Agreed. > Alternatively, we could just fix xlog.c as proposed earlier > [0]. AFAICT that is the only caller that can experience problems due to > the multiple-hard-link issue. All other callers are simply renaming a > temporary file into place, and the temporary file can be discarded if left > behind after a crash. I'd agree with removing all the callers at the end. pgrename() is quite robust on Windows, but I'd keep the two checks in writeTimeLineHistory(), as the logic around findNewestTimeLine() would consider a past TLI history file as in-use even if we have a crash just after the file got created in the same path by the same standby, and the WAL segment init part. Your patch does that. -- Michael
Attachment
On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote: > I'd agree with removing all the callers at the end. pgrename() is > quite robust on Windows, but I'd keep the two checks in > writeTimeLineHistory(), as the logic around findNewestTimeLine() would > consider a past TLI history file as in-use even if we have a crash > just after the file got created in the same path by the same standby, > and the WAL segment init part. Your patch does that. As v16 is now open for business, I have revisited this change and applied 0001 to change all the callers (aka removal of the assertion for the WAL receiver when it overwrites a TLI history file). The commit log includes details about the reasoning of all the areas changed, for clarity, as of the WAL recycling part, the TLI history file part and basic_archive. -- Michael
Attachment
On Tue, Jul 05, 2022 at 10:19:49AM +0900, Michael Paquier wrote: > On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote: >> I'd agree with removing all the callers at the end. pgrename() is >> quite robust on Windows, but I'd keep the two checks in >> writeTimeLineHistory(), as the logic around findNewestTimeLine() would >> consider a past TLI history file as in-use even if we have a crash >> just after the file got created in the same path by the same standby, >> and the WAL segment init part. Your patch does that. > > As v16 is now open for business, I have revisited this change and > applied 0001 to change all the callers (aka removal of the assertion > for the WAL receiver when it overwrites a TLI history file). The > commit log includes details about the reasoning of all the areas > changed, for clarity, as of the WAL recycling part, the TLI history > file part and basic_archive. Thanks! I wonder if we should add a comment in writeTimeLineHistoryFile() about possible concurrent use by a WAL receiver and the startup process and why that is okay. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Jul 05, 2022 at 09:58:38AM -0700, Nathan Bossart wrote: > Thanks! I wonder if we should add a comment in writeTimeLineHistoryFile() > about possible concurrent use by a WAL receiver and the startup process and > why that is okay. Agreed. Adding an extra note at the top of the routine would help in the future. -- Michael
Attachment
Dear team, We recently observed a few cases where Postgres running on Linux encountered an issue with WAL segment files. Specifically, two WAL segments were linked to the same physical file after Postgres ran out of memory and the OOM killer terminated one of its processes. This resulted in the WAL segments overwriting each other and Postgres failing a later recovery. We found this fix [1] that has been applied to Postgres 16, but the cases we observed were running Postgres 15. Given that older major versions will be supported for a good number of years, and the potential for irrecoverability exists (even if rare), we would like to discuss the possibility of back-patching this fix. Are there any technical reasons not to back-patch this fix to older major versions? Thank you for your consideration. Sincerely, Robert Pang [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=dac1ff3 On Sat, May 7, 2022 at 1:19 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote: > > I'd agree with removing all the callers at the end. pgrename() is > > quite robust on Windows, but I'd keep the two checks in > > writeTimeLineHistory(), as the logic around findNewestTimeLine() would > > consider a past TLI history file as in-use even if we have a crash > > just after the file got created in the same path by the same standby, > > and the WAL segment init part. Your patch does that. > > As v16 is now open for business, I have revisited this change and > applied 0001 to change all the callers (aka removal of the assertion > for the WAL receiver when it overwrites a TLI history file). The > commit log includes details about the reasoning of all the areas > changed, for clarity, as of the WAL recycling part, the TLI history > file part and basic_archive. > -- > Michael
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Nathan Bossart
Date:
On Tue, Dec 17, 2024 at 04:50:16PM -0800, Robert Pang wrote: > We recently observed a few cases where Postgres running on Linux > encountered an issue with WAL segment files. Specifically, two WAL > segments were linked to the same physical file after Postgres ran out > of memory and the OOM killer terminated one of its processes. This > resulted in the WAL segments overwriting each other and Postgres > failing a later recovery. Yikes! > We found this fix [1] that has been applied to Postgres 16, but the > cases we observed were running Postgres 15. Given that older major > versions will be supported for a good number of years, and the > potential for irrecoverability exists (even if rare), we would like to > discuss the possibility of back-patching this fix. IMHO this is a good time to reevaluate. It looks like we originally didn't back-patch out of an abundance of caution, but now that this one has had time to bake, I think it's worth seriously considering, especially now that we have a report from the field. -- nathan
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Andres Freund
Date:
Hi, On 2024-12-18 10:38:19 -0600, Nathan Bossart wrote: > On Tue, Dec 17, 2024 at 04:50:16PM -0800, Robert Pang wrote: > > We recently observed a few cases where Postgres running on Linux > > encountered an issue with WAL segment files. Specifically, two WAL > > segments were linked to the same physical file after Postgres ran out > > of memory and the OOM killer terminated one of its processes. This > > resulted in the WAL segments overwriting each other and Postgres > > failing a later recovery. > > Yikes! Indeed. As chance would have it, I was asked for input on a corrupted server *today*. Eventually we found that recovery stopped early, after encountering a segment with a *newer* pageaddr than we expected. Which made me think of this issue, and indeed, the file recovery stopped at had two links. Before that the server had been crashing on a regular basis for unrelated reasons, which presumably increased the chances sufficiently to eventually hit this problem. It's a normal thing to discover the end of the WAL by finding a segment that has an older pageaddr than its name suggests. But in this case we saw a newer page address. I wonder if we should treat that differently... > > We found this fix [1] that has been applied to Postgres 16, but the > > cases we observed were running Postgres 15. Given that older major > > versions will be supported for a good number of years, and the > > potential for irrecoverability exists (even if rare), we would like to > > discuss the possibility of back-patching this fix. > > IMHO this is a good time to reevaluate. It looks like we originally didn't > back-patch out of an abundance of caution, but now that this one has had > time to bake, I think it's worth seriously considering, especially now that > we have a report from the field. Strongly agreed. I don't think the issue is actually quite as unlikely to be hit as reasoned in the commit message. The crash has indeed to happen between the link() and unlink() - but at the end of a checkpoint we do that operations hundreds of times in a row on a busy server. And that's just after potentially doing lots of write IO during a checkpoint, filling up drive write caches / eating up IOPS/bandwidth disk quots. Greetings, Andres Freund
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Nathan Bossart
Date:
On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote: > I've been double-checking the code to refresh myself with the problem, > and I don't see a reason to not apply something like the attached set > down to v13 for all these remaining branches (minus an edit of the > commit message). LGTM -- nathan
Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
From
Andres Freund
Date:
On 2024-12-19 09:31:14 -0600, Nathan Bossart wrote: > On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote: > > I've been double-checking the code to refresh myself with the problem, > > and I don't see a reason to not apply something like the attached set > > down to v13 for all these remaining branches (minus an edit of the > > commit message). > > LGTM Dito.
Dear Michael,
Thank you for applying this back-patch. I also appreciate everyone's input on this issue.
Sincerely,
Robert Pang
On Thu, Dec 19, 2024 at 4:13 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Dec 19, 2024 at 11:07:25AM -0500, Andres Freund wrote:
> On 2024-12-19 09:31:14 -0600, Nathan Bossart wrote:
>> LGTM
>
> Dito.
Thanks for double-checking. Done this one.
--
Michael