Thread: remove more archiving overhead
Hi hackers, With the addition of archive modules (5ef1eef) and other archiving improvements (e.g., beb4e9b), the amount of archiving overhead has been reduced. I spent some time measuring the remaining overhead, and the following function stood out: /* * pgarch_archiveDone * * Emit notification that an xlog file has been successfully archived. * We do this by renaming the status file from NNN.ready to NNN.done. * Eventually, a checkpoint process will notice this and delete both the * NNN.done file and the xlog file itself. */ static void pgarch_archiveDone(char *xlog) { char rlogready[MAXPGPATH]; char rlogdone[MAXPGPATH]; StatusFilePath(rlogready, xlog, ".ready"); StatusFilePath(rlogdone, xlog, ".done"); (void) durable_rename(rlogready, rlogdone, WARNING); } Specifically, the call to durable_rename(), which calls fsync() a few times, can cause a decent delay. On my machine, archiving ~12k WAL files using a bare-bones archive module that did nothing but return true took over a minute. When I changed this to rename() (i.e., reverting the change to pgarch.c in 1d4a0ab), the same test took a second or less. I also spent some time investigating whether durably renaming the archive status files was even necessary. In theory, it shouldn't be. If a crash happens before the rename is persisted, we might try to re-archive files, but your archive_command/archive_library should handle that. If the file was already recycled or removed, the archiver will skip it (thanks to 6d8727f). But when digging further, I found that WAL file recyling uses durable_rename_excl(), which has the following note: * Note that a crash in an unfortunate moment can leave you with two links to * the target file. IIUC this means that in theory, a crash at an unfortunate moment could leave you with a .ready file, the file to archive, and another link to that file with a "future" WAL filename. If you re-archive the file after it has been reused, you could end up with corrupt WAL archives. I think this might already be possible today. Syncing the directory after every rename probably reduces the likelihood quite substantially, but if RemoveOldXlogFiles() quickly picks up the .done file and attempts recycling before durable_rename() calls fsync() on the directory, presumably the same problem could occur. I believe the current recommendation is to fail if the archive file already exists. The basic_archive contrib module fails if the archive already exists and has different contents, and it succeeds without re-archiving if it already exists with identical contents. Since something along these lines is recommended as a best practice, perhaps this is okay. But I think we might be able to do better. If we ensure that the archive_status directory is sync'd prior to recycling/removing a WAL file, I believe we are safe from the aforementioned problem. The drawback of this is that we are essentially offloading more work to the checkpointer process. In addition to everything else it must do, it will also need to fsync() the archive_status directory many times. I'm not sure how big of a problem this is. It should be good to ensure that archiving keeps up, and there are far more expensive tasks that the checkpointer must perform that could make the added fsync() calls relatively insignificant. I've attached a patch that makes the changes discussed above. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Feb 21, 2022 at 05:19:48PM -0800, Nathan Bossart wrote: > I also spent some time investigating whether durably renaming the archive > status files was even necessary. In theory, it shouldn't be. If a crash > happens before the rename is persisted, we might try to re-archive files, > but your archive_command/archive_library should handle that. If the file > was already recycled or removed, the archiver will skip it (thanks to > 6d8727f). But when digging further, I found that WAL file recyling uses > durable_rename_excl(), which has the following note: > > * Note that a crash in an unfortunate moment can leave you with two links to > * the target file. > > IIUC this means that in theory, a crash at an unfortunate moment could > leave you with a .ready file, the file to archive, and another link to that > file with a "future" WAL filename. If you re-archive the file after it has > been reused, you could end up with corrupt WAL archives. I think this > might already be possible today. Syncing the directory after every rename > probably reduces the likelihood quite substantially, but if > RemoveOldXlogFiles() quickly picks up the .done file and attempts > recycling before durable_rename() calls fsync() on the directory, > presumably the same problem could occur. In my testing, I found that when I killed the server just before unlink() during WAL recyling, I ended up with links to the same file in pg_wal after restarting. My latest test produced links to the same file for the current WAL file and the next one. Maybe WAL recyling should use durable_rename() instead of durable_rename_excl(). -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote: > In my testing, I found that when I killed the server just before unlink() > during WAL recyling, I ended up with links to the same file in pg_wal after > restarting. My latest test produced links to the same file for the current > WAL file and the next one. Maybe WAL recyling should use durable_rename() > instead of durable_rename_excl(). Here is an updated patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
At Mon, 21 Feb 2022 17:19:48 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in > I also spent some time investigating whether durably renaming the archive > status files was even necessary. In theory, it shouldn't be. If a crash > happens before the rename is persisted, we might try to re-archive files, > but your archive_command/archive_library should handle that. If the file I believe we're trying to archive a WAL segment once and only once, even though actually we fail in certain cases. https://www.postgresql.org/docs/14/continuous-archiving.html | The archive command should generally be designed to refuse to | overwrite any pre-existing archive file. This is an important safety | feature to preserve the integrity of your archive in case of | administrator error (such as sending the output of two different | servers to the same archive directory). The recommended behavior of archive_command above is to avoid this kind of corruption. If we officially admit that a WAL segment can be archive twice, we need to revise that part (and the following part) of the doc. > was already recycled or removed, the archiver will skip it (thanks to > 6d8727f). But when digging further, I found that WAL file recyling uses > durable_rename_excl(), which has the following note: > > * Note that a crash in an unfortunate moment can leave you with two links to > * the target file. > > IIUC this means that in theory, a crash at an unfortunate moment could > leave you with a .ready file, the file to archive, and another link to that > file with a "future" WAL filename. If you re-archive the file after it has > been reused, you could end up with corrupt WAL archives. I think this IMHO the difference would be that the single system call (maybe) cannot be interrupted by sigkill. So I'm not sure that that is worth considering. The sigkill window can be elimianted if we could use renameat(RENAME_NOREPLACE). But finally power-loss can cause that. > might already be possible today. Syncing the directory after every rename > probably reduces the likelihood quite substantially, but if > RemoveOldXlogFiles() quickly picks up the .done file and attempts > recycling before durable_rename() calls fsync() on the directory, > presumably the same problem could occur. If we don't fsync at every rename, we don't even need for RemoveOldXlogFiles to pickup .done very quickly. Isn't it a big difference? > I believe the current recommendation is to fail if the archive file already > exists. The basic_archive contrib module fails if the archive already > exists and has different contents, and it succeeds without re-archiving if > it already exists with identical contents. Since something along these > lines is recommended as a best practice, perhaps this is okay. But I think > we might be able to do better. If we ensure that the archive_status > directory is sync'd prior to recycling/removing a WAL file, I believe we > are safe from the aforementioned problem. > > The drawback of this is that we are essentially offloading more work to the > checkpointer process. In addition to everything else it must do, it will > also need to fsync() the archive_status directory many times. I'm not sure > how big of a problem this is. It should be good to ensure that archiving > keeps up, and there are far more expensive tasks that the checkpointer must > perform that could make the added fsync() calls relatively insignificant. I think we don't want make checkpointer slower in exchange of making archiver faster. Perhaps we need to work using a condensed information rather than repeatedly scanning over the distributed information like archive_status? At Tue, 22 Feb 2022 11:52:29 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in > On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote: > > In my testing, I found that when I killed the server just before unlink() > > during WAL recyling, I ended up with links to the same file in pg_wal after > > restarting. My latest test produced links to the same file for the current > > WAL file and the next one. Maybe WAL recyling should use durable_rename() > > instead of durable_rename_excl(). > > Here is an updated patch. Is it intentional that v2 loses the xlogarchive.c part? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Thanks for taking a look! On Thu, Feb 24, 2022 at 02:13:49PM +0900, Kyotaro Horiguchi wrote: > https://www.postgresql.org/docs/14/continuous-archiving.html > | The archive command should generally be designed to refuse to > | overwrite any pre-existing archive file. This is an important safety > | feature to preserve the integrity of your archive in case of > | administrator error (such as sending the output of two different > | servers to the same archive directory). > > The recommended behavior of archive_command above is to avoid this > kind of corruption. If we officially admit that a WAL segment can be > archive twice, we need to revise that part (and the following part) of > the doc. Yeah, we might want to recommend succeeding if the archive already exists with identical contents (like basic_archive does). IMO this would best be handled in a separate thread that is solely focused on documentation updates. AFAICT this is an existing problem. >> IIUC this means that in theory, a crash at an unfortunate moment could >> leave you with a .ready file, the file to archive, and another link to that >> file with a "future" WAL filename. If you re-archive the file after it has >> been reused, you could end up with corrupt WAL archives. I think this > > IMHO the difference would be that the single system call (maybe) > cannot be interrupted by sigkill. So I'm not sure that that is worth > considering. The sigkill window can be elimianted if we could use > renameat(RENAME_NOREPLACE). But finally power-loss can cause that. Perhaps durable_rename_excl() could use renameat2() for systems that support it. However, the problem would still exist for systems that have link() but not renameat2(). In this particular case, there really shouldn't be an existing file in pg_wal. >> might already be possible today. Syncing the directory after every rename >> probably reduces the likelihood quite substantially, but if >> RemoveOldXlogFiles() quickly picks up the .done file and attempts >> recycling before durable_rename() calls fsync() on the directory, >> presumably the same problem could occur. > > If we don't fsync at every rename, we don't even need for > RemoveOldXlogFiles to pickup .done very quickly. Isn't it a big > difference? Yeah, it probably increases the chances quite a bit. > I think we don't want make checkpointer slower in exchange of making > archiver faster. Perhaps we need to work using a condensed > information rather than repeatedly scanning over the distributed > information like archive_status? v2 avoids giving the checkpointer more work. > At Tue, 22 Feb 2022 11:52:29 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in >> On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote: >> > In my testing, I found that when I killed the server just before unlink() >> > during WAL recyling, I ended up with links to the same file in pg_wal after >> > restarting. My latest test produced links to the same file for the current >> > WAL file and the next one. Maybe WAL recyling should use durable_rename() >> > instead of durable_rename_excl(). >> >> Here is an updated patch. > > Is it intentional that v2 loses the xlogarchive.c part? Yes. I found that a crash at an unfortunate moment can produce multiple links to the same file in pg_wal, which seemed bad independent of archival. By fixing that (i.e., switching from durable_rename_excl() to durable_rename()), we not only avoid this problem, but we also avoid trying to archive a file the server is concurrently writing. Then, after a crash, the WAL file to archive should either not exist (which is handled by the archiver) or contain the same contents as any preexisting archives. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
I tested this again with many more WAL files and a much larger machine (r5d.24xlarge with data directory on an NVMe SSD instance store volume). As before, I am using a bare-bones archive module that does nothing but return true. Without the patch, archiving ~120k WAL files took about 109 seconds. With the patch, it took around 62 seconds, which amounts to a ~43% reduction in overhead. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote: > Yes. I found that a crash at an unfortunate moment can produce multiple > links to the same file in pg_wal, which seemed bad independent of archival. > By fixing that (i.e., switching from durable_rename_excl() to > durable_rename()), we not only avoid this problem, but we also avoid trying > to archive a file the server is concurrently writing. Then, after a crash, > the WAL file to archive should either not exist (which is handled by the > archiver) or contain the same contents as any preexisting archives. I moved the fix for this to a new thread [0] since I think it should be back-patched. I've attached a new patch that only contains the part related to reducing archiving overhead. [0] https://www.postgresql.org/message-id/20220407182954.GA1231544%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Apr 7, 2022 at 6:23 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote: > > Yes. I found that a crash at an unfortunate moment can produce multiple > > links to the same file in pg_wal, which seemed bad independent of archival. > > By fixing that (i.e., switching from durable_rename_excl() to > > durable_rename()), we not only avoid this problem, but we also avoid trying > > to archive a file the server is concurrently writing. Then, after a crash, > > the WAL file to archive should either not exist (which is handled by the > > archiver) or contain the same contents as any preexisting archives. > > I moved the fix for this to a new thread [0] since I think it should be > back-patched. I've attached a new patch that only contains the part > related to reducing archiving overhead. While we're now after the feature freeze and thus this will need to wait for v16, it looks like a reasonable change to me. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 08, 2022 at 10:20:27AM -0400, Robert Haas wrote: > On Thu, Apr 7, 2022 at 6:23 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote: >> > Yes. I found that a crash at an unfortunate moment can produce multiple >> > links to the same file in pg_wal, which seemed bad independent of archival. >> > By fixing that (i.e., switching from durable_rename_excl() to >> > durable_rename()), we not only avoid this problem, but we also avoid trying >> > to archive a file the server is concurrently writing. Then, after a crash, >> > the WAL file to archive should either not exist (which is handled by the >> > archiver) or contain the same contents as any preexisting archives. >> >> I moved the fix for this to a new thread [0] since I think it should be >> back-patched. I've attached a new patch that only contains the part >> related to reducing archiving overhead. > > While we're now after the feature freeze and thus this will need to > wait for v16, it looks like a reasonable change to me. Dang, just missed it. Thanks for taking a look. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On 2022/04/08 7:23, Nathan Bossart wrote: > On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote: >> Yes. I found that a crash at an unfortunate moment can produce multiple >> links to the same file in pg_wal, which seemed bad independent of archival. >> By fixing that (i.e., switching from durable_rename_excl() to >> durable_rename()), we not only avoid this problem, but we also avoid trying >> to archive a file the server is concurrently writing. Then, after a crash, >> the WAL file to archive should either not exist (which is handled by the >> archiver) or contain the same contents as any preexisting archives. > > I moved the fix for this to a new thread [0] since I think it should be > back-patched. I've attached a new patch that only contains the part > related to reducing archiving overhead. Thanks for updating the patch. It looks good to me. Barring any objection, I'm thinking to commit it. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Thu, Jul 7, 2022 at 10:03 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > Thanks for updating the patch. It looks good to me. > Barring any objection, I'm thinking to commit it. I don't object, but I just started to wonder whether the need to handle re-archiving of the same file cleanly is as well-documented as it ought to be. -- Robert Haas EDB: http://www.enterprisedb.com
On 7/7/22 10:37, Robert Haas wrote: > On Thu, Jul 7, 2022 at 10:03 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> Thanks for updating the patch. It looks good to me. >> Barring any objection, I'm thinking to commit it. > > I don't object, but I just started to wonder whether the need to > handle re-archiving of the same file cleanly is as well-documented as > it ought to be. +1, but I don't think that needs to stand in the way of this patch, which looks sensible to me as-is. I think that's what you meant, but just wanted to be sure. There are plenty of ways that already-archived WAL might get archived again and this is just one of them. Thoughts, Nathan? Regards, -David
On Thu, Jul 07, 2022 at 10:46:23AM -0400, David Steele wrote: > On 7/7/22 10:37, Robert Haas wrote: >> On Thu, Jul 7, 2022 at 10:03 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> > Thanks for updating the patch. It looks good to me. >> > Barring any objection, I'm thinking to commit it. >> >> I don't object, but I just started to wonder whether the need to >> handle re-archiving of the same file cleanly is as well-documented as >> it ought to be. > > +1, but I don't think that needs to stand in the way of this patch, which > looks sensible to me as-is. I think that's what you meant, but just wanted > to be sure. Yeah, this seems like something that should be documented. I can pick this up. I believe this is an existing problem, but this patch could make it more likely. > There are plenty of ways that already-archived WAL might get archived again > and this is just one of them. What are some of the others? I was aware of the case that was fixed in ff9f111, where we might try to re-archive a file with different contents, but I'm curious what other ways you've seen this happen. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Jul 07, 2022 at 09:18:25AM -0700, Nathan Bossart wrote: > On Thu, Jul 07, 2022 at 10:46:23AM -0400, David Steele wrote: >> On 7/7/22 10:37, Robert Haas wrote: >>> I don't object, but I just started to wonder whether the need to >>> handle re-archiving of the same file cleanly is as well-documented as >>> it ought to be. >> >> +1, but I don't think that needs to stand in the way of this patch, which >> looks sensible to me as-is. I think that's what you meant, but just wanted >> to be sure. > > Yeah, this seems like something that should be documented. I can pick this > up. I believe this is an existing problem, but this patch could make it > more likely. Here is a first try at documenting this. I'm not thrilled about the placement, since it feels a bit buried in the backup docs, but this is where this sort of thing lives today. It also seems odd to stress the importance of avoiding overwriting pre-existing archives in case multiple servers are archiving to the same place while only offering solutions with obvious race conditions. Even basic_archive is subject to this now that durable_rename_excl() no longer exists. Perhaps we should make a note of that, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On 7/7/22 12:18, Nathan Bossart wrote: > On Thu, Jul 07, 2022 at 10:46:23AM -0400, David Steele wrote: > >> There are plenty of ways that already-archived WAL might get archived again >> and this is just one of them. > > What are some of the others? I was aware of the case that was fixed in > ff9f111, where we might try to re-archive a file with different contents, > but I'm curious what other ways you've seen this happen. On the PG side, crashes and (IIRC) immediate shutdown. In general, any failure in the archiver itself. Storage, memory, network, etc. There are plenty of ways that the file might make it to storage but postgres never gets notified, so it will retry. Any archiver that is not tolerant of this fact is not going to be very useful and this patch only makes it slightly more true. Regards, -David
On Thu, Jul 07, 2022 at 02:07:26PM -0400, David Steele wrote: > On 7/7/22 12:18, Nathan Bossart wrote: >> On Thu, Jul 07, 2022 at 10:46:23AM -0400, David Steele wrote: >> >> > There are plenty of ways that already-archived WAL might get archived again >> > and this is just one of them. >> >> What are some of the others? I was aware of the case that was fixed in >> ff9f111, where we might try to re-archive a file with different contents, >> but I'm curious what other ways you've seen this happen. > > On the PG side, crashes and (IIRC) immediate shutdown. > > In general, any failure in the archiver itself. Storage, memory, network, > etc. There are plenty of ways that the file might make it to storage but > postgres never gets notified, so it will retry. > > Any archiver that is not tolerant of this fact is not going to be very > useful and this patch only makes it slightly more true. Ah, got it, makes sense. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Jul 07, 2022 at 10:51:42AM -0700, Nathan Bossart wrote: > + library to ensure that it indeed does not overwrite an existing file. When > + a pre-existing file is encountered, the archive library may return > + <literal>true</literal> if the WAL file has identical contents to the > + pre-existing archive. Alternatively, the archive library can return This should likely say something about ensuring the pre-existing file is persisted to disk before returning true. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On 7/7/22 14:22, Nathan Bossart wrote: > On Thu, Jul 07, 2022 at 10:51:42AM -0700, Nathan Bossart wrote: >> + library to ensure that it indeed does not overwrite an existing file. When >> + a pre-existing file is encountered, the archive library may return >> + <literal>true</literal> if the WAL file has identical contents to the >> + pre-existing archive. Alternatively, the archive library can return > > This should likely say something about ensuring the pre-existing file is > persisted to disk before returning true. Agreed. Also, I'd prefer to remove "indeed" from this sentence (yes, I see that was in the original text): + library to ensure that it indeed does not overwrite an existing Regards, -David
On Thu, Jul 07, 2022 at 05:06:13PM -0400, David Steele wrote: > On 7/7/22 14:22, Nathan Bossart wrote: >> On Thu, Jul 07, 2022 at 10:51:42AM -0700, Nathan Bossart wrote: >> > + library to ensure that it indeed does not overwrite an existing file. When >> > + a pre-existing file is encountered, the archive library may return >> > + <literal>true</literal> if the WAL file has identical contents to the >> > + pre-existing archive. Alternatively, the archive library can return >> >> This should likely say something about ensuring the pre-existing file is >> persisted to disk before returning true. > > Agreed. > > Also, I'd prefer to remove "indeed" from this sentence (yes, I see that was > in the original text): > > + library to ensure that it indeed does not overwrite an existing Here's an updated patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
At Thu, 7 Jul 2022 15:07:16 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in > Here's an updated patch. Thinking RFC'ish, the meaning of "may" and "must" is significant in this description. On the other hand it uses both "may" and "can" but I thinkthat their difference is not significant or "can" there is somewhat confusing. I think the "can" should be "may" here. And since "must" is emphasized, doesn't "may" also needs emphasis? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 7/7/22 21:56, Kyotaro Horiguchi wrote: > At Thu, 7 Jul 2022 15:07:16 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in >> Here's an updated patch. > > Thinking RFC'ish, the meaning of "may" and "must" is significant in > this description. On the other hand it uses both "may" and "can" but > I thinkthat their difference is not significant or "can" there is > somewhat confusing. I think the "can" should be "may" here. +1. > And since "must" is emphasized, doesn't "may" also needs emphasis? I think emphasis only on must is fine. Nathan, I don't see the language about being sure to persist to storage here? Regards, -David
On Fri, Jul 08, 2022 at 08:20:09AM -0400, David Steele wrote: > On 7/7/22 21:56, Kyotaro Horiguchi wrote: >> Thinking RFC'ish, the meaning of "may" and "must" is significant in >> this description. On the other hand it uses both "may" and "can" but >> I thinkthat their difference is not significant or "can" there is >> somewhat confusing. I think the "can" should be "may" here. > > +1. Done. >> And since "must" is emphasized, doesn't "may" also needs emphasis? > > I think emphasis only on must is fine. Yeah, I wanted to emphasize the importance of returning false in this case. Since it's okay to return true or false in the identical/persisted file case, I didn't think it deserved emphasis. > Nathan, I don't see the language about being sure to persist to storage > here? It's here: When an archive library encounters a pre-existing file, it may return true if the WAL file has identical contents to the pre-existing archive and the pre-existing archive is fully persisted to storage. Since you didn't catch it, I wonder if it needs improvement. At the very least, perhaps we should note that one way to do the latter is to persist it yourself before returning true, and we could point to basic_archive.c as an example. However, I'm hesitant to make these docs too much more complicated than they already are. WDYT? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On 7/8/22 12:54, Nathan Bossart wrote: > On Fri, Jul 08, 2022 at 08:20:09AM -0400, David Steele wrote: > >> Nathan, I don't see the language about being sure to persist to storage >> here? > > It's here: > When an archive library encounters a pre-existing file, it may return > true if the WAL file has identical contents to the pre-existing archive > and the pre-existing archive is fully persisted to storage. > > Since you didn't catch it, I wonder if it needs improvement. At the very > least, perhaps we should note that one way to do the latter is to persist > it yourself before returning true, and we could point to basic_archive.c as > an example. However, I'm hesitant to make these docs too much more > complicated than they already are. WDYT? I think I wrote this before I'd had enough coffee. "fully persisted to storage" can mean many things depending on the storage (Posix, CIFS, S3, etc.) so I think this is fine. The basic_archive module is there for people who would like implementation details for Posix. Regards, -David
On Fri, Jul 08, 2022 at 01:02:51PM -0400, David Steele wrote: > I think I wrote this before I'd had enough coffee. "fully persisted to > storage" can mean many things depending on the storage (Posix, CIFS, S3, > etc.) so I think this is fine. The basic_archive module is there for people > who would like implementation details for Posix. Perhaps this section should warn against reading without sufficient caffeination. :) -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On 2022/07/09 2:18, Nathan Bossart wrote: > On Fri, Jul 08, 2022 at 01:02:51PM -0400, David Steele wrote: >> I think I wrote this before I'd had enough coffee. "fully persisted to >> storage" can mean many things depending on the storage (Posix, CIFS, S3, >> etc.) so I think this is fine. The basic_archive module is there for people >> who would like implementation details for Posix. Yes. But some users might want to use basic_archive without any changes. For them, how about documenting how basic_archive works in basic_archive docs? I'm just thinking that it's worth exposing the information commented on the top of basic_archive.c. Anyway, since the patches look good to me, I pushed them. Thanks a lot! If necessary, we can keep improving the docs later. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Tue, Jul 26, 2022 at 04:26:23PM +0900, Fujii Masao wrote: > Anyway, since the patches look good to me, I pushed them. Thanks a lot! > If necessary, we can keep improving the docs later. Thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Jul 08, 2022 at 09:54:50AM -0700, Nathan Bossart wrote: > Since it's okay to return true or false in the identical/persisted file > case, I didn't think it deserved emphasis. I think returning false is not-okay: > --- a/doc/src/sgml/backup.sgml > +++ b/doc/src/sgml/backup.sgml > @@ -681,14 +681,28 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0 > any pre-existing archive file. This is an important safety feature to > preserve the integrity of your archive in case of administrator error > (such as sending the output of two different servers to the same archive > - directory). > + directory). It is advisable to test your proposed archive library to ensure > + that it does not overwrite an existing file. > </para> > > <para> > - It is advisable to test your proposed archive library to ensure that it > - indeed does not overwrite an existing file, <emphasis>and that it returns > - <literal>false</literal> in this case</emphasis>. > - The example command above for Unix ensures this by including a separate > + In rare cases, <productname>PostgreSQL</productname> may attempt to > + re-archive a WAL file that was previously archived. For example, if the > + system crashes before the server makes a durable record of archival success, > + the server will attempt to archive the file again after restarting (provided > + archiving is still enabled). When an archive library encounters a > + pre-existing file, it may return <literal>true</literal> if the WAL file has > + identical contents to the pre-existing archive and the pre-existing archive > + is fully persisted to storage. Alternatively, the archive library may > + return <literal>false</literal> anytime a pre-existing file is encountered, > + but this will require manual action by an administrator to resolve. If a Inviting the administrator to resolve things is more dangerous than just returning true. I recommend making this text more opinionated and simpler: libraries must return true. Alternately, if some library has found a good reason to return false, this paragraph could give the reason. I don't know of such a reason, though. > + pre-existing file contains different contents than the WAL file being > + archived, the archive library <emphasis>must</emphasis> return false. > + </para>
On Sat, Jul 30, 2022 at 11:51:56PM -0700, Noah Misch wrote: > Inviting the administrator to resolve things is more dangerous than just > returning true. I recommend making this text more opinionated and simpler: > libraries must return true. Alternately, if some library has found a good > reason to return false, this paragraph could give the reason. I don't know of > such a reason, though. Your suggestion seems reasonable to me. I've attached a small patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On 8/2/22 01:02, Nathan Bossart wrote: > On Sat, Jul 30, 2022 at 11:51:56PM -0700, Noah Misch wrote: >> Inviting the administrator to resolve things is more dangerous than just >> returning true. I recommend making this text more opinionated and simpler: >> libraries must return true. Alternately, if some library has found a good >> reason to return false, this paragraph could give the reason. I don't know of >> such a reason, though. > > Your suggestion seems reasonable to me. I've attached a small patch. +1. I think this is less confusing overall. Regards, -David
On Mon, Aug 01, 2022 at 10:02:19PM -0700, Nathan Bossart wrote: > On Sat, Jul 30, 2022 at 11:51:56PM -0700, Noah Misch wrote: > > Inviting the administrator to resolve things is more dangerous than just > > returning true. I recommend making this text more opinionated and simpler: > > libraries must return true. Alternately, if some library has found a good > > reason to return false, this paragraph could give the reason. I don't know of > > such a reason, though. > > Your suggestion seems reasonable to me. I've attached a small patch. > --- a/doc/src/sgml/backup.sgml > +++ b/doc/src/sgml/backup.sgml > @@ -691,11 +691,9 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0 > system crashes before the server makes a durable record of archival success, > the server will attempt to archive the file again after restarting (provided > archiving is still enabled). When an archive library encounters a > - pre-existing file, it may return <literal>true</literal> if the WAL file has > + pre-existing file, it should return <literal>true</literal> if the WAL file has > identical contents to the pre-existing archive and the pre-existing archive > - is fully persisted to storage. Alternatively, the archive library may > - return <literal>false</literal> anytime a pre-existing file is encountered, > - but this will require manual action by an administrator to resolve. If a > + is fully persisted to storage. If a > pre-existing file contains different contents than the WAL file being > archived, the archive library <emphasis>must</emphasis> return > <literal>false</literal>. Works for me. Thanks.
On 03.08.22 09:16, Noah Misch wrote: > On Mon, Aug 01, 2022 at 10:02:19PM -0700, Nathan Bossart wrote: >> On Sat, Jul 30, 2022 at 11:51:56PM -0700, Noah Misch wrote: >>> Inviting the administrator to resolve things is more dangerous than just >>> returning true. I recommend making this text more opinionated and simpler: >>> libraries must return true. Alternately, if some library has found a good >>> reason to return false, this paragraph could give the reason. I don't know of >>> such a reason, though. >> >> Your suggestion seems reasonable to me. I've attached a small patch. > >> --- a/doc/src/sgml/backup.sgml >> +++ b/doc/src/sgml/backup.sgml >> @@ -691,11 +691,9 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0 >> system crashes before the server makes a durable record of archival success, >> the server will attempt to archive the file again after restarting (provided >> archiving is still enabled). When an archive library encounters a >> - pre-existing file, it may return <literal>true</literal> if the WAL file has >> + pre-existing file, it should return <literal>true</literal> if the WAL file has >> identical contents to the pre-existing archive and the pre-existing archive >> - is fully persisted to storage. Alternatively, the archive library may >> - return <literal>false</literal> anytime a pre-existing file is encountered, >> - but this will require manual action by an administrator to resolve. If a >> + is fully persisted to storage. If a >> pre-existing file contains different contents than the WAL file being >> archived, the archive library <emphasis>must</emphasis> return >> <literal>false</literal>. > > Works for me. Thanks. This documentation change only covers archive_library. How are users of archive_command supposed to handle this?
On Sat, Sep 17, 2022 at 11:46:39AM +0200, Peter Eisentraut wrote: >> > --- a/doc/src/sgml/backup.sgml >> > +++ b/doc/src/sgml/backup.sgml >> > @@ -691,11 +691,9 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0 >> > system crashes before the server makes a durable record of archival success, >> > the server will attempt to archive the file again after restarting (provided >> > archiving is still enabled). When an archive library encounters a >> > - pre-existing file, it may return <literal>true</literal> if the WAL file has >> > + pre-existing file, it should return <literal>true</literal> if the WAL file has >> > identical contents to the pre-existing archive and the pre-existing archive >> > - is fully persisted to storage. Alternatively, the archive library may >> > - return <literal>false</literal> anytime a pre-existing file is encountered, >> > - but this will require manual action by an administrator to resolve. If a >> > + is fully persisted to storage. If a >> > pre-existing file contains different contents than the WAL file being >> > archived, the archive library <emphasis>must</emphasis> return >> > <literal>false</literal>. >> >> Works for me. Thanks. > > This documentation change only covers archive_library. How are users of > archive_command supposed to handle this? I believe users of archive_command need to do something similar to what is described here. However, it might be more reasonable to expect archive_command users to simply return false when there is a pre-existing file, as the deleted text notes. IIRC that is why I added that sentence originally. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Sat, Sep 17, 2022 at 02:54:27PM -0700, Nathan Bossart wrote: > On Sat, Sep 17, 2022 at 11:46:39AM +0200, Peter Eisentraut wrote: > >> > --- a/doc/src/sgml/backup.sgml > >> > +++ b/doc/src/sgml/backup.sgml > >> > @@ -691,11 +691,9 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0 > >> > system crashes before the server makes a durable record of archival success, > >> > the server will attempt to archive the file again after restarting (provided > >> > archiving is still enabled). When an archive library encounters a > >> > - pre-existing file, it may return <literal>true</literal> if the WAL file has > >> > + pre-existing file, it should return <literal>true</literal> if the WAL file has > >> > identical contents to the pre-existing archive and the pre-existing archive > >> > - is fully persisted to storage. Alternatively, the archive library may > >> > - return <literal>false</literal> anytime a pre-existing file is encountered, > >> > - but this will require manual action by an administrator to resolve. If a > >> > + is fully persisted to storage. If a > >> > pre-existing file contains different contents than the WAL file being > >> > archived, the archive library <emphasis>must</emphasis> return > >> > <literal>false</literal>. > >> > >> Works for me. Thanks. > > > > This documentation change only covers archive_library. How are users of > > archive_command supposed to handle this? > > I believe users of archive_command need to do something similar to what is > described here. However, it might be more reasonable to expect > archive_command users to simply return false when there is a pre-existing > file, as the deleted text notes. IIRC that is why I added that sentence > originally. What makes the answer for archive_command diverge from the answer for archive_library?
On 18.09.22 09:13, Noah Misch wrote: >>> This documentation change only covers archive_library. How are users of >>> archive_command supposed to handle this? >> >> I believe users of archive_command need to do something similar to what is >> described here. However, it might be more reasonable to expect >> archive_command users to simply return false when there is a pre-existing >> file, as the deleted text notes. IIRC that is why I added that sentence >> originally. > > What makes the answer for archive_command diverge from the answer for > archive_library? I suspect what we are really trying to say here is === Archiving setups (using either archive_command or archive_library) should be prepared for the rare case that an identical archive file is being archived a second time. In such a case, they should compare that the source and the target file are identical and proceed without error if so. In some cases, it is difficult or impossible to configure archive_command or archive_library to do this. In such cases, the archiving command or library should error like in the case for any pre-existing target file, and operators need to be prepared to resolve such cases manually. === Is that correct?
On Mon, Sep 19, 2022 at 06:08:29AM -0400, Peter Eisentraut wrote: > On 18.09.22 09:13, Noah Misch wrote: > >>>This documentation change only covers archive_library. How are users of > >>>archive_command supposed to handle this? > >> > >>I believe users of archive_command need to do something similar to what is > >>described here. However, it might be more reasonable to expect > >>archive_command users to simply return false when there is a pre-existing > >>file, as the deleted text notes. IIRC that is why I added that sentence > >>originally. > > > >What makes the answer for archive_command diverge from the answer for > >archive_library? > > I suspect what we are really trying to say here is > > === > Archiving setups (using either archive_command or archive_library) should be > prepared for the rare case that an identical archive file is being archived > a second time. In such a case, they should compare that the source and the > target file are identical and proceed without error if so. > > In some cases, it is difficult or impossible to configure archive_command or > archive_library to do this. In such cases, the archiving command or library > should error like in the case for any pre-existing target file, and > operators need to be prepared to resolve such cases manually. > === > > Is that correct? I wanted it to stop saying anything like the second paragraph, hence commit d263ced. Implementing a proper archiving setup is not especially difficult, and inviting the operator to work around a wrong implementation invites damaging mistakes under time pressure.
On 9/19/22 07:39, Noah Misch wrote: > On Mon, Sep 19, 2022 at 06:08:29AM -0400, Peter Eisentraut wrote: >> On 18.09.22 09:13, Noah Misch wrote: >>>>> This documentation change only covers archive_library. How are users of >>>>> archive_command supposed to handle this? >>>> >>>> I believe users of archive_command need to do something similar to what is >>>> described here. However, it might be more reasonable to expect >>>> archive_command users to simply return false when there is a pre-existing >>>> file, as the deleted text notes. IIRC that is why I added that sentence >>>> originally. >>> >>> What makes the answer for archive_command diverge from the answer for >>> archive_library? >> >> I suspect what we are really trying to say here is >> >> === >> Archiving setups (using either archive_command or archive_library) should be >> prepared for the rare case that an identical archive file is being archived >> a second time. In such a case, they should compare that the source and the >> target file are identical and proceed without error if so. >> >> In some cases, it is difficult or impossible to configure archive_command or >> archive_library to do this. In such cases, the archiving command or library >> should error like in the case for any pre-existing target file, and >> operators need to be prepared to resolve such cases manually. >> === >> >> Is that correct? > > I wanted it to stop saying anything like the second paragraph, hence commit > d263ced. Implementing a proper archiving setup is not especially difficult, > and inviting the operator to work around a wrong implementation invites > damaging mistakes under time pressure. I would also not want to state that duplicate WAL is rare. In practice it is pretty common when things are going wrong. Also, implying it is rare might lead a user to decide they don't need to worry about it. Regards, -David
On Mon, Sep 19, 2022 at 6:08 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > I suspect what we are really trying to say here is > > === > Archiving setups (using either archive_command or archive_library) > should be prepared for the rare case that an identical archive file is > being archived a second time. In such a case, they should compare that > the source and the target file are identical and proceed without error > if so. > > In some cases, it is difficult or impossible to configure > archive_command or archive_library to do this. In such cases, the > archiving command or library should error like in the case for any > pre-existing target file, and operators need to be prepared to resolve > such cases manually. > === > > Is that correct? I think it is. I think commit d263ced225bffe2c340175125b0270d1869138fe made the documentation in this area substantially clearer than what we had before, and I think that we could adjust that text to apply to both archive libraries and archive commands relatively easily, so I'm not really sure that we need to add text like what you propose here. However, I also don't think it would be particularly bad if we did. An advantage of that language is that it makes it clear what the consequences are if you fail to follow the recommendation about handling identical files. That may be helpful to users in understanding the behavior of the system, and might even make it more likely that they will include proper handling of that case in their archive_command or archive_library. "impossible" does seem like a bit of a strong word, though. I'd recommend leaving that out. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Sep 19, 2022 at 10:39 AM Noah Misch <noah@leadboat.com> wrote: > I wanted it to stop saying anything like the second paragraph, hence commit > d263ced. Implementing a proper archiving setup is not especially difficult, > and inviting the operator to work around a wrong implementation invites > damaging mistakes under time pressure. I agree wholeheartedly with the second sentence. I do think that we need to be a little careful not to be too prescriptive. Telling people what they have to do when they don't really have to do it often leads to non-compliance. It can be more productive to spell out the precise consequences of non-compliance, so I think Peter's proposal has some merit. On the other hand, I don't see any real problem with the current language, either. Unfortunately, no matter what we do here, it is not likely that we'll soon be able to eliminate the phenomenon where people use a buggy home-grown archive_command, both because we've been encouraging that in our documentation for a very long time, and also because the people who are most likely to do something half-baked here are probably the least likely to actually read the updated documentation. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Sep 19, 2022 at 12:49:58PM -0400, Robert Haas wrote: > On Mon, Sep 19, 2022 at 10:39 AM Noah Misch <noah@leadboat.com> wrote: > > I wanted it to stop saying anything like the second paragraph, hence commit > > d263ced. Implementing a proper archiving setup is not especially difficult, > > and inviting the operator to work around a wrong implementation invites > > damaging mistakes under time pressure. > > I agree wholeheartedly with the second sentence. I do think that we > need to be a little careful not to be too prescriptive. Telling people > what they have to do when they don't really have to do it often leads > to non-compliance. It can be more productive to spell out the precise > consequences of non-compliance, so I think Peter's proposal has some > merit. Good point. We could say it from a perspective like, "if you don't do this, your setup will appear to work most of the time, but your operator will be stuck with dangerous manual fixes at times." > On the other hand, I don't see any real problem with the > current language, either. > > Unfortunately, no matter what we do here, it is not likely that we'll > soon be able to eliminate the phenomenon where people use a buggy > home-grown archive_command, both because we've been encouraging that > in our documentation for a very long time, and also because the people > who are most likely to do something half-baked here are probably the > least likely to actually read the updated documentation. True. If I wanted to eliminate such setups, I would make the server double-archive one WAL file each time the archive setup changes.