Thread: remove more archiving overhead

remove more archiving overhead

From
Nathan Bossart
Date:
Hi hackers,

With the addition of archive modules (5ef1eef) and other archiving
improvements (e.g., beb4e9b), the amount of archiving overhead has been
reduced.  I spent some time measuring the remaining overhead, and the
following function stood out:

    /*
     * pgarch_archiveDone
     *
     * Emit notification that an xlog file has been successfully archived.
     * We do this by renaming the status file from NNN.ready to NNN.done.
     * Eventually, a checkpoint process will notice this and delete both the
     * NNN.done file and the xlog file itself.
     */
    static void
    pgarch_archiveDone(char *xlog)
    {
        char        rlogready[MAXPGPATH];
        char        rlogdone[MAXPGPATH];

        StatusFilePath(rlogready, xlog, ".ready");
        StatusFilePath(rlogdone, xlog, ".done");
        (void) durable_rename(rlogready, rlogdone, WARNING);
    }

Specifically, the call to durable_rename(), which calls fsync() a few
times, can cause a decent delay.  On my machine, archiving ~12k WAL files
using a bare-bones archive module that did nothing but return true took
over a minute.  When I changed this to rename() (i.e., reverting the change
to pgarch.c in 1d4a0ab), the same test took a second or less.

I also spent some time investigating whether durably renaming the archive
status files was even necessary.  In theory, it shouldn't be.  If a crash
happens before the rename is persisted, we might try to re-archive files,
but your archive_command/archive_library should handle that.  If the file
was already recycled or removed, the archiver will skip it (thanks to
6d8727f).  But when digging further, I found that WAL file recyling uses
durable_rename_excl(), which has the following note:

     * Note that a crash in an unfortunate moment can leave you with two links to
     * the target file.

IIUC this means that in theory, a crash at an unfortunate moment could
leave you with a .ready file, the file to archive, and another link to that
file with a "future" WAL filename.  If you re-archive the file after it has
been reused, you could end up with corrupt WAL archives.  I think this
might already be possible today.  Syncing the directory after every rename
probably reduces the likelihood quite substantially, but if
RemoveOldXlogFiles() quickly picks up the .done file and attempts
recycling before durable_rename() calls fsync() on the directory,
presumably the same problem could occur.

I believe the current recommendation is to fail if the archive file already
exists.  The basic_archive contrib module fails if the archive already
exists and has different contents, and it succeeds without re-archiving if
it already exists with identical contents.  Since something along these
lines is recommended as a best practice, perhaps this is okay.  But I think
we might be able to do better.  If we ensure that the archive_status
directory is sync'd prior to recycling/removing a WAL file, I believe we
are safe from the aforementioned problem.

The drawback of this is that we are essentially offloading more work to the
checkpointer process.  In addition to everything else it must do, it will
also need to fsync() the archive_status directory many times.  I'm not sure
how big of a problem this is.  It should be good to ensure that archiving
keeps up, and there are far more expensive tasks that the checkpointer must
perform that could make the added fsync() calls relatively insignificant.

I've attached a patch that makes the changes discussed above.  Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: remove more archiving overhead

From
Nathan Bossart
Date:
On Mon, Feb 21, 2022 at 05:19:48PM -0800, Nathan Bossart wrote:
> I also spent some time investigating whether durably renaming the archive
> status files was even necessary.  In theory, it shouldn't be.  If a crash
> happens before the rename is persisted, we might try to re-archive files,
> but your archive_command/archive_library should handle that.  If the file
> was already recycled or removed, the archiver will skip it (thanks to
> 6d8727f).  But when digging further, I found that WAL file recyling uses
> durable_rename_excl(), which has the following note:
> 
>      * Note that a crash in an unfortunate moment can leave you with two links to
>      * the target file.
> 
> IIUC this means that in theory, a crash at an unfortunate moment could
> leave you with a .ready file, the file to archive, and another link to that
> file with a "future" WAL filename.  If you re-archive the file after it has
> been reused, you could end up with corrupt WAL archives.  I think this
> might already be possible today.  Syncing the directory after every rename
> probably reduces the likelihood quite substantially, but if
> RemoveOldXlogFiles() quickly picks up the .done file and attempts
> recycling before durable_rename() calls fsync() on the directory,
> presumably the same problem could occur.

In my testing, I found that when I killed the server just before unlink()
during WAL recyling, I ended up with links to the same file in pg_wal after
restarting.  My latest test produced links to the same file for the current
WAL file and the next one.  Maybe WAL recyling should use durable_rename()
instead of durable_rename_excl().

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: remove more archiving overhead

From
Nathan Bossart
Date:
On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote:
> In my testing, I found that when I killed the server just before unlink()
> during WAL recyling, I ended up with links to the same file in pg_wal after
> restarting.  My latest test produced links to the same file for the current
> WAL file and the next one.  Maybe WAL recyling should use durable_rename()
> instead of durable_rename_excl().

Here is an updated patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: remove more archiving overhead

From
Kyotaro Horiguchi
Date:
At Mon, 21 Feb 2022 17:19:48 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> I also spent some time investigating whether durably renaming the archive
> status files was even necessary.  In theory, it shouldn't be.  If a crash
> happens before the rename is persisted, we might try to re-archive files,
> but your archive_command/archive_library should handle that.  If the file

I believe we're trying to archive a WAL segment once and only once,
even though actually we fail in certain cases.

https://www.postgresql.org/docs/14/continuous-archiving.html
| The archive command should generally be designed to refuse to
| overwrite any pre-existing archive file. This is an important safety
| feature to preserve the integrity of your archive in case of
| administrator error (such as sending the output of two different
| servers to the same archive directory).

The recommended behavior of archive_command above is to avoid this
kind of corruption.  If we officially admit that a WAL segment can be
archive twice, we need to revise that part (and the following part) of
the doc.

> was already recycled or removed, the archiver will skip it (thanks to
> 6d8727f).  But when digging further, I found that WAL file recyling uses
> durable_rename_excl(), which has the following note:
>
>      * Note that a crash in an unfortunate moment can leave you with two links to
>      * the target file.
> 
> IIUC this means that in theory, a crash at an unfortunate moment could
> leave you with a .ready file, the file to archive, and another link to that
> file with a "future" WAL filename.  If you re-archive the file after it has
> been reused, you could end up with corrupt WAL archives.  I think this

IMHO the difference would be that the single system call (maybe)
cannot be interrupted by sigkill.  So I'm not sure that that is worth
considering.  The sigkill window can be elimianted if we could use
renameat(RENAME_NOREPLACE).  But finally power-loss can cause that.

> might already be possible today.  Syncing the directory after every rename
> probably reduces the likelihood quite substantially, but if
> RemoveOldXlogFiles() quickly picks up the .done file and attempts
> recycling before durable_rename() calls fsync() on the directory,
> presumably the same problem could occur.

If we don't fsync at every rename, we don't even need for
RemoveOldXlogFiles to pickup .done very quickly.  Isn't it a big
difference?

> I believe the current recommendation is to fail if the archive file already
> exists.  The basic_archive contrib module fails if the archive already
> exists and has different contents, and it succeeds without re-archiving if
> it already exists with identical contents.  Since something along these
> lines is recommended as a best practice, perhaps this is okay.  But I think
> we might be able to do better.  If we ensure that the archive_status
> directory is sync'd prior to recycling/removing a WAL file, I believe we
> are safe from the aforementioned problem.
>
> The drawback of this is that we are essentially offloading more work to the
> checkpointer process.  In addition to everything else it must do, it will
> also need to fsync() the archive_status directory many times.  I'm not sure
> how big of a problem this is.  It should be good to ensure that archiving
> keeps up, and there are far more expensive tasks that the checkpointer must
> perform that could make the added fsync() calls relatively insignificant.

I think we don't want make checkpointer slower in exchange of making
archiver faster.  Perhaps we need to work using a condensed
information rather than repeatedly scanning over the distributed
information like archive_status?

At Tue, 22 Feb 2022 11:52:29 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote:
> > In my testing, I found that when I killed the server just before unlink()
> > during WAL recyling, I ended up with links to the same file in pg_wal after
> > restarting.  My latest test produced links to the same file for the current
> > WAL file and the next one.  Maybe WAL recyling should use durable_rename()
> > instead of durable_rename_excl().
> 
> Here is an updated patch.

Is it intentional that v2 loses the xlogarchive.c part?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: remove more archiving overhead

From
Nathan Bossart
Date:
Thanks for taking a look!

On Thu, Feb 24, 2022 at 02:13:49PM +0900, Kyotaro Horiguchi wrote:
> https://www.postgresql.org/docs/14/continuous-archiving.html
> | The archive command should generally be designed to refuse to
> | overwrite any pre-existing archive file. This is an important safety
> | feature to preserve the integrity of your archive in case of
> | administrator error (such as sending the output of two different
> | servers to the same archive directory).
> 
> The recommended behavior of archive_command above is to avoid this
> kind of corruption.  If we officially admit that a WAL segment can be
> archive twice, we need to revise that part (and the following part) of
> the doc.

Yeah, we might want to recommend succeeding if the archive already exists
with identical contents (like basic_archive does).  IMO this would best be
handled in a separate thread that is solely focused on documentation
updates.  AFAICT this is an existing problem.

>> IIUC this means that in theory, a crash at an unfortunate moment could
>> leave you with a .ready file, the file to archive, and another link to that
>> file with a "future" WAL filename.  If you re-archive the file after it has
>> been reused, you could end up with corrupt WAL archives.  I think this
> 
> IMHO the difference would be that the single system call (maybe)
> cannot be interrupted by sigkill.  So I'm not sure that that is worth
> considering.  The sigkill window can be elimianted if we could use
> renameat(RENAME_NOREPLACE).  But finally power-loss can cause that.

Perhaps durable_rename_excl() could use renameat2() for systems that
support it.  However, the problem would still exist for systems that have
link() but not renameat2().  In this particular case, there really
shouldn't be an existing file in pg_wal.

>> might already be possible today.  Syncing the directory after every rename
>> probably reduces the likelihood quite substantially, but if
>> RemoveOldXlogFiles() quickly picks up the .done file and attempts
>> recycling before durable_rename() calls fsync() on the directory,
>> presumably the same problem could occur.
> 
> If we don't fsync at every rename, we don't even need for
> RemoveOldXlogFiles to pickup .done very quickly.  Isn't it a big
> difference?

Yeah, it probably increases the chances quite a bit.

> I think we don't want make checkpointer slower in exchange of making
> archiver faster.  Perhaps we need to work using a condensed
> information rather than repeatedly scanning over the distributed
> information like archive_status?

v2 avoids giving the checkpointer more work.

> At Tue, 22 Feb 2022 11:52:29 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in 
>> On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote:
>> > In my testing, I found that when I killed the server just before unlink()
>> > during WAL recyling, I ended up with links to the same file in pg_wal after
>> > restarting.  My latest test produced links to the same file for the current
>> > WAL file and the next one.  Maybe WAL recyling should use durable_rename()
>> > instead of durable_rename_excl().
>> 
>> Here is an updated patch.
> 
> Is it intentional that v2 loses the xlogarchive.c part?

Yes.  I found that a crash at an unfortunate moment can produce multiple
links to the same file in pg_wal, which seemed bad independent of archival.
By fixing that (i.e., switching from durable_rename_excl() to
durable_rename()), we not only avoid this problem, but we also avoid trying
to archive a file the server is concurrently writing.  Then, after a crash,
the WAL file to archive should either not exist (which is handled by the
archiver) or contain the same contents as any preexisting archives.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: remove more archiving overhead

From
Nathan Bossart
Date:
I tested this again with many more WAL files and a much larger machine
(r5d.24xlarge with data directory on an NVMe SSD instance store volume).
As before, I am using a bare-bones archive module that does nothing but
return true.  Without the patch, archiving ~120k WAL files took about 109
seconds.  With the patch, it took around 62 seconds, which amounts to a
~43% reduction in overhead.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: remove more archiving overhead

From
Nathan Bossart
Date:
On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote:
> Yes.  I found that a crash at an unfortunate moment can produce multiple
> links to the same file in pg_wal, which seemed bad independent of archival.
> By fixing that (i.e., switching from durable_rename_excl() to
> durable_rename()), we not only avoid this problem, but we also avoid trying
> to archive a file the server is concurrently writing.  Then, after a crash,
> the WAL file to archive should either not exist (which is handled by the
> archiver) or contain the same contents as any preexisting archives.

I moved the fix for this to a new thread [0] since I think it should be
back-patched.  I've attached a new patch that only contains the part
related to reducing archiving overhead.

[0] https://www.postgresql.org/message-id/20220407182954.GA1231544%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: remove more archiving overhead

From
Robert Haas
Date:
On Thu, Apr 7, 2022 at 6:23 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote:
> > Yes.  I found that a crash at an unfortunate moment can produce multiple
> > links to the same file in pg_wal, which seemed bad independent of archival.
> > By fixing that (i.e., switching from durable_rename_excl() to
> > durable_rename()), we not only avoid this problem, but we also avoid trying
> > to archive a file the server is concurrently writing.  Then, after a crash,
> > the WAL file to archive should either not exist (which is handled by the
> > archiver) or contain the same contents as any preexisting archives.
>
> I moved the fix for this to a new thread [0] since I think it should be
> back-patched.  I've attached a new patch that only contains the part
> related to reducing archiving overhead.

While we're now after the feature freeze and thus this will need to
wait for v16, it looks like a reasonable change to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: remove more archiving overhead

From
Nathan Bossart
Date:
On Fri, Apr 08, 2022 at 10:20:27AM -0400, Robert Haas wrote:
> On Thu, Apr 7, 2022 at 6:23 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote:
>> > Yes.  I found that a crash at an unfortunate moment can produce multiple
>> > links to the same file in pg_wal, which seemed bad independent of archival.
>> > By fixing that (i.e., switching from durable_rename_excl() to
>> > durable_rename()), we not only avoid this problem, but we also avoid trying
>> > to archive a file the server is concurrently writing.  Then, after a crash,
>> > the WAL file to archive should either not exist (which is handled by the
>> > archiver) or contain the same contents as any preexisting archives.
>>
>> I moved the fix for this to a new thread [0] since I think it should be
>> back-patched.  I've attached a new patch that only contains the part
>> related to reducing archiving overhead.
> 
> While we're now after the feature freeze and thus this will need to
> wait for v16, it looks like a reasonable change to me.

Dang, just missed it.  Thanks for taking a look.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: remove more archiving overhead

From
Fujii Masao
Date:

On 2022/04/08 7:23, Nathan Bossart wrote:
> On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote:
>> Yes.  I found that a crash at an unfortunate moment can produce multiple
>> links to the same file in pg_wal, which seemed bad independent of archival.
>> By fixing that (i.e., switching from durable_rename_excl() to
>> durable_rename()), we not only avoid this problem, but we also avoid trying
>> to archive a file the server is concurrently writing.  Then, after a crash,
>> the WAL file to archive should either not exist (which is handled by the
>> archiver) or contain the same contents as any preexisting archives.
> 
> I moved the fix for this to a new thread [0] since I think it should be
> back-patched.  I've attached a new patch that only contains the part
> related to reducing archiving overhead.

Thanks for updating the patch. It looks good to me.
Barring any objection, I'm thinking to commit it.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: remove more archiving overhead

From
Robert Haas
Date:
On Thu, Jul 7, 2022 at 10:03 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> Thanks for updating the patch. It looks good to me.
> Barring any objection, I'm thinking to commit it.

I don't object, but I just started to wonder whether the need to
handle re-archiving of the same file cleanly is as well-documented as
it ought to be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: remove more archiving overhead

From
David Steele
Date:
On 7/7/22 10:37, Robert Haas wrote:
> On Thu, Jul 7, 2022 at 10:03 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> Thanks for updating the patch. It looks good to me.
>> Barring any objection, I'm thinking to commit it.
> 
> I don't object, but I just started to wonder whether the need to
> handle re-archiving of the same file cleanly is as well-documented as
> it ought to be.

+1, but I don't think that needs to stand in the way of this patch, 
which looks sensible to me as-is. I think that's what you meant, but 
just wanted to be sure.

There are plenty of ways that already-archived WAL might get archived 
again and this is just one of them.

Thoughts, Nathan?

Regards,
-David



Re: remove more archiving overhead

From
Nathan Bossart
Date:
On Thu, Jul 07, 2022 at 10:46:23AM -0400, David Steele wrote:
> On 7/7/22 10:37, Robert Haas wrote:
>> On Thu, Jul 7, 2022 at 10:03 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> > Thanks for updating the patch. It looks good to me.
>> > Barring any objection, I'm thinking to commit it.
>> 
>> I don't object, but I just started to wonder whether the need to
>> handle re-archiving of the same file cleanly is as well-documented as
>> it ought to be.
> 
> +1, but I don't think that needs to stand in the way of this patch, which
> looks sensible to me as-is. I think that's what you meant, but just wanted
> to be sure.

Yeah, this seems like something that should be documented.  I can pick this
up.  I believe this is an existing problem, but this patch could make it
more likely.

> There are plenty of ways that already-archived WAL might get archived again
> and this is just one of them.

What are some of the others?  I was aware of the case that was fixed in
ff9f111, where we might try to re-archive a file with different contents,
but I'm curious what other ways you've seen this happen.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: remove more archiving overhead

From
Nathan Bossart
Date:
On Thu, Jul 07, 2022 at 09:18:25AM -0700, Nathan Bossart wrote:
> On Thu, Jul 07, 2022 at 10:46:23AM -0400, David Steele wrote:
>> On 7/7/22 10:37, Robert Haas wrote:
>>> I don't object, but I just started to wonder whether the need to
>>> handle re-archiving of the same file cleanly is as well-documented as
>>> it ought to be.
>> 
>> +1, but I don't think that needs to stand in the way of this patch, which
>> looks sensible to me as-is. I think that's what you meant, but just wanted
>> to be sure.
> 
> Yeah, this seems like something that should be documented.  I can pick this
> up.  I believe this is an existing problem, but this patch could make it
> more likely.

Here is a first try at documenting this.  I'm not thrilled about the
placement, since it feels a bit buried in the backup docs, but this is
where this sort of thing lives today.  It also seems odd to stress the
importance of avoiding overwriting pre-existing archives in case multiple
servers are archiving to the same place while only offering solutions with
obvious race conditions.  Even basic_archive is subject to this now that
durable_rename_excl() no longer exists.  Perhaps we should make a note of
that, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: remove more archiving overhead

From
David Steele
Date:
On 7/7/22 12:18, Nathan Bossart wrote:
> On Thu, Jul 07, 2022 at 10:46:23AM -0400, David Steele wrote:
> 
>> There are plenty of ways that already-archived WAL might get archived again
>> and this is just one of them.
> 
> What are some of the others?  I was aware of the case that was fixed in
> ff9f111, where we might try to re-archive a file with different contents,
> but I'm curious what other ways you've seen this happen.

On the PG side, crashes and (IIRC) immediate shutdown.

In general, any failure in the archiver itself. Storage, memory, 
network, etc. There are plenty of ways that the file might make it to 
storage but postgres never gets notified, so it will retry.

Any archiver that is not tolerant of this fact is not going to be very 
useful and this patch only makes it slightly more true.

Regards,
-David



Re: remove more archiving overhead

From
Nathan Bossart
Date:
On Thu, Jul 07, 2022 at 02:07:26PM -0400, David Steele wrote:
> On 7/7/22 12:18, Nathan Bossart wrote:
>> On Thu, Jul 07, 2022 at 10:46:23AM -0400, David Steele wrote:
>> 
>> > There are plenty of ways that already-archived WAL might get archived again
>> > and this is just one of them.
>> 
>> What are some of the others?  I was aware of the case that was fixed in
>> ff9f111, where we might try to re-archive a file with different contents,
>> but I'm curious what other ways you've seen this happen.
> 
> On the PG side, crashes and (IIRC) immediate shutdown.
> 
> In general, any failure in the archiver itself. Storage, memory, network,
> etc. There are plenty of ways that the file might make it to storage but
> postgres never gets notified, so it will retry.
> 
> Any archiver that is not tolerant of this fact is not going to be very
> useful and this patch only makes it slightly more true.

Ah, got it, makes sense.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: remove more archiving overhead

From
Nathan Bossart
Date:
On Thu, Jul 07, 2022 at 10:51:42AM -0700, Nathan Bossart wrote:
> +    library to ensure that it indeed does not overwrite an existing file.  When
> +    a pre-existing file is encountered, the archive library may return
> +    <literal>true</literal> if the WAL file has identical contents to the
> +    pre-existing archive.  Alternatively, the archive library can return

This should likely say something about ensuring the pre-existing file is
persisted to disk before returning true.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: remove more archiving overhead

From
David Steele
Date:
On 7/7/22 14:22, Nathan Bossart wrote:
> On Thu, Jul 07, 2022 at 10:51:42AM -0700, Nathan Bossart wrote:
>> +    library to ensure that it indeed does not overwrite an existing file.  When
>> +    a pre-existing file is encountered, the archive library may return
>> +    <literal>true</literal> if the WAL file has identical contents to the
>> +    pre-existing archive.  Alternatively, the archive library can return
> 
> This should likely say something about ensuring the pre-existing file is
> persisted to disk before returning true.

Agreed.

Also, I'd prefer to remove "indeed" from this sentence (yes, I see that 
was in the original text):

+    library to ensure that it indeed does not overwrite an existing

Regards,
-David



Re: remove more archiving overhead

From
Nathan Bossart
Date:
On Thu, Jul 07, 2022 at 05:06:13PM -0400, David Steele wrote:
> On 7/7/22 14:22, Nathan Bossart wrote:
>> On Thu, Jul 07, 2022 at 10:51:42AM -0700, Nathan Bossart wrote:
>> > +    library to ensure that it indeed does not overwrite an existing file.  When
>> > +    a pre-existing file is encountered, the archive library may return
>> > +    <literal>true</literal> if the WAL file has identical contents to the
>> > +    pre-existing archive.  Alternatively, the archive library can return
>> 
>> This should likely say something about ensuring the pre-existing file is
>> persisted to disk before returning true.
> 
> Agreed.
> 
> Also, I'd prefer to remove "indeed" from this sentence (yes, I see that was
> in the original text):
> 
> +    library to ensure that it indeed does not overwrite an existing

Here's an updated patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: remove more archiving overhead

From
Kyotaro Horiguchi
Date:
At Thu, 7 Jul 2022 15:07:16 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> Here's an updated patch.

Thinking RFC'ish, the meaning of "may" and "must" is significant in
this description.  On the other hand it uses both "may" and "can" but
I thinkthat their difference is not significant or "can" there is
somewhat confusing.  I think the "can" should be "may" here.

And since "must" is emphasized, doesn't "may" also needs emphasis?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: remove more archiving overhead

From
David Steele
Date:
On 7/7/22 21:56, Kyotaro Horiguchi wrote:
> At Thu, 7 Jul 2022 15:07:16 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in
>> Here's an updated patch.
> 
> Thinking RFC'ish, the meaning of "may" and "must" is significant in
> this description.  On the other hand it uses both "may" and "can" but
> I thinkthat their difference is not significant or "can" there is
> somewhat confusing.  I think the "can" should be "may" here.

+1.

> And since "must" is emphasized, doesn't "may" also needs emphasis?

I think emphasis only on must is fine.

Nathan, I don't see the language about being sure to persist to storage 
here?

Regards,
-David



Re: remove more archiving overhead

From
Nathan Bossart
Date:
On Fri, Jul 08, 2022 at 08:20:09AM -0400, David Steele wrote:
> On 7/7/22 21:56, Kyotaro Horiguchi wrote:
>> Thinking RFC'ish, the meaning of "may" and "must" is significant in
>> this description.  On the other hand it uses both "may" and "can" but
>> I thinkthat their difference is not significant or "can" there is
>> somewhat confusing.  I think the "can" should be "may" here.
> 
> +1.

Done.

>> And since "must" is emphasized, doesn't "may" also needs emphasis?
> 
> I think emphasis only on must is fine.

Yeah, I wanted to emphasize the importance of returning false in this case.
Since it's okay to return true or false in the identical/persisted file
case, I didn't think it deserved emphasis.

> Nathan, I don't see the language about being sure to persist to storage
> here?

It's here:
    When an archive library encounters a pre-existing file, it may return
    true if the WAL file has identical contents to the pre-existing archive
    and the pre-existing archive is fully persisted to storage.

Since you didn't catch it, I wonder if it needs improvement.  At the very
least, perhaps we should note that one way to do the latter is to persist
it yourself before returning true, and we could point to basic_archive.c as
an example.  However, I'm hesitant to make these docs too much more
complicated than they already are.  WDYT?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: remove more archiving overhead

From
David Steele
Date:
On 7/8/22 12:54, Nathan Bossart wrote:
> On Fri, Jul 08, 2022 at 08:20:09AM -0400, David Steele wrote:
> 
>> Nathan, I don't see the language about being sure to persist to storage
>> here?
> 
> It's here:
>     When an archive library encounters a pre-existing file, it may return
>     true if the WAL file has identical contents to the pre-existing archive
>     and the pre-existing archive is fully persisted to storage.
> 
> Since you didn't catch it, I wonder if it needs improvement.  At the very
> least, perhaps we should note that one way to do the latter is to persist
> it yourself before returning true, and we could point to basic_archive.c as
> an example.  However, I'm hesitant to make these docs too much more
> complicated than they already are.  WDYT?

I think I wrote this before I'd had enough coffee. "fully persisted to 
storage" can mean many things depending on the storage (Posix, CIFS, S3, 
etc.) so I think this is fine. The basic_archive module is there for 
people who would like implementation details for Posix.

Regards,
-David



Re: remove more archiving overhead

From
Nathan Bossart
Date:
On Fri, Jul 08, 2022 at 01:02:51PM -0400, David Steele wrote:
> I think I wrote this before I'd had enough coffee. "fully persisted to
> storage" can mean many things depending on the storage (Posix, CIFS, S3,
> etc.) so I think this is fine. The basic_archive module is there for people
> who would like implementation details for Posix.

Perhaps this section should warn against reading without sufficient
caffeination.  :)

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: remove more archiving overhead

From
Fujii Masao
Date:

On 2022/07/09 2:18, Nathan Bossart wrote:
> On Fri, Jul 08, 2022 at 01:02:51PM -0400, David Steele wrote:
>> I think I wrote this before I'd had enough coffee. "fully persisted to
>> storage" can mean many things depending on the storage (Posix, CIFS, S3,
>> etc.) so I think this is fine. The basic_archive module is there for people
>> who would like implementation details for Posix.

Yes. But some users might want to use basic_archive without any changes.
For them, how about documenting how basic_archive works in basic_archive docs?
I'm just thinking that it's worth exposing the information commented on
the top of basic_archive.c.

Anyway, since the patches look good to me, I pushed them. Thanks a lot!
If necessary, we can keep improving the docs later.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: remove more archiving overhead

From
Nathan Bossart
Date:
On Tue, Jul 26, 2022 at 04:26:23PM +0900, Fujii Masao wrote:
> Anyway, since the patches look good to me, I pushed them. Thanks a lot!
> If necessary, we can keep improving the docs later.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: remove more archiving overhead

From
Noah Misch
Date:
On Fri, Jul 08, 2022 at 09:54:50AM -0700, Nathan Bossart wrote:
> Since it's okay to return true or false in the identical/persisted file
> case, I didn't think it deserved emphasis.

I think returning false is not-okay:

> --- a/doc/src/sgml/backup.sgml
> +++ b/doc/src/sgml/backup.sgml
> @@ -681,14 +681,28 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
>      any pre-existing archive file.  This is an important safety feature to
>      preserve the integrity of your archive in case of administrator error
>      (such as sending the output of two different servers to the same archive
> -    directory).
> +    directory).  It is advisable to test your proposed archive library to ensure
> +    that it does not overwrite an existing file.
>     </para>
>  
>     <para>
> -    It is advisable to test your proposed archive library to ensure that it
> -    indeed does not overwrite an existing file, <emphasis>and that it returns
> -    <literal>false</literal> in this case</emphasis>.
> -    The example command above for Unix ensures this by including a separate
> +    In rare cases, <productname>PostgreSQL</productname> may attempt to
> +    re-archive a WAL file that was previously archived.  For example, if the
> +    system crashes before the server makes a durable record of archival success,
> +    the server will attempt to archive the file again after restarting (provided
> +    archiving is still enabled).  When an archive library encounters a
> +    pre-existing file, it may return <literal>true</literal> if the WAL file has
> +    identical contents to the pre-existing archive and the pre-existing archive
> +    is fully persisted to storage.  Alternatively, the archive library may
> +    return <literal>false</literal> anytime a pre-existing file is encountered,
> +    but this will require manual action by an administrator to resolve.  If a

Inviting the administrator to resolve things is more dangerous than just
returning true.  I recommend making this text more opinionated and simpler:
libraries must return true.  Alternately, if some library has found a good
reason to return false, this paragraph could give the reason.  I don't know of
such a reason, though.

> +    pre-existing file contains different contents than the WAL file being
> +    archived, the archive library <emphasis>must</emphasis> return false.
> +   </para>



Re: remove more archiving overhead

From
Nathan Bossart
Date:
On Sat, Jul 30, 2022 at 11:51:56PM -0700, Noah Misch wrote:
> Inviting the administrator to resolve things is more dangerous than just
> returning true.  I recommend making this text more opinionated and simpler:
> libraries must return true.  Alternately, if some library has found a good
> reason to return false, this paragraph could give the reason.  I don't know of
> such a reason, though.

Your suggestion seems reasonable to me.  I've attached a small patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: remove more archiving overhead

From
David Steele
Date:
On 8/2/22 01:02, Nathan Bossart wrote:
> On Sat, Jul 30, 2022 at 11:51:56PM -0700, Noah Misch wrote:
>> Inviting the administrator to resolve things is more dangerous than just
>> returning true.  I recommend making this text more opinionated and simpler:
>> libraries must return true.  Alternately, if some library has found a good
>> reason to return false, this paragraph could give the reason.  I don't know of
>> such a reason, though.
> 
> Your suggestion seems reasonable to me.  I've attached a small patch.

+1. I think this is less confusing overall.

Regards,
-David



Re: remove more archiving overhead

From
Noah Misch
Date:
On Mon, Aug 01, 2022 at 10:02:19PM -0700, Nathan Bossart wrote:
> On Sat, Jul 30, 2022 at 11:51:56PM -0700, Noah Misch wrote:
> > Inviting the administrator to resolve things is more dangerous than just
> > returning true.  I recommend making this text more opinionated and simpler:
> > libraries must return true.  Alternately, if some library has found a good
> > reason to return false, this paragraph could give the reason.  I don't know of
> > such a reason, though.
> 
> Your suggestion seems reasonable to me.  I've attached a small patch.

> --- a/doc/src/sgml/backup.sgml
> +++ b/doc/src/sgml/backup.sgml
> @@ -691,11 +691,9 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
>      system crashes before the server makes a durable record of archival success,
>      the server will attempt to archive the file again after restarting (provided
>      archiving is still enabled).  When an archive library encounters a
> -    pre-existing file, it may return <literal>true</literal> if the WAL file has
> +    pre-existing file, it should return <literal>true</literal> if the WAL file has
>      identical contents to the pre-existing archive and the pre-existing archive
> -    is fully persisted to storage.  Alternatively, the archive library may
> -    return <literal>false</literal> anytime a pre-existing file is encountered,
> -    but this will require manual action by an administrator to resolve.  If a
> +    is fully persisted to storage.  If a
>      pre-existing file contains different contents than the WAL file being
>      archived, the archive library <emphasis>must</emphasis> return
>      <literal>false</literal>.

Works for me.  Thanks.



Re: remove more archiving overhead

From
Peter Eisentraut
Date:
On 03.08.22 09:16, Noah Misch wrote:
> On Mon, Aug 01, 2022 at 10:02:19PM -0700, Nathan Bossart wrote:
>> On Sat, Jul 30, 2022 at 11:51:56PM -0700, Noah Misch wrote:
>>> Inviting the administrator to resolve things is more dangerous than just
>>> returning true.  I recommend making this text more opinionated and simpler:
>>> libraries must return true.  Alternately, if some library has found a good
>>> reason to return false, this paragraph could give the reason.  I don't know of
>>> such a reason, though.
>>
>> Your suggestion seems reasonable to me.  I've attached a small patch.
> 
>> --- a/doc/src/sgml/backup.sgml
>> +++ b/doc/src/sgml/backup.sgml
>> @@ -691,11 +691,9 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
>>       system crashes before the server makes a durable record of archival success,
>>       the server will attempt to archive the file again after restarting (provided
>>       archiving is still enabled).  When an archive library encounters a
>> -    pre-existing file, it may return <literal>true</literal> if the WAL file has
>> +    pre-existing file, it should return <literal>true</literal> if the WAL file has
>>       identical contents to the pre-existing archive and the pre-existing archive
>> -    is fully persisted to storage.  Alternatively, the archive library may
>> -    return <literal>false</literal> anytime a pre-existing file is encountered,
>> -    but this will require manual action by an administrator to resolve.  If a
>> +    is fully persisted to storage.  If a
>>       pre-existing file contains different contents than the WAL file being
>>       archived, the archive library <emphasis>must</emphasis> return
>>       <literal>false</literal>.
> 
> Works for me.  Thanks.

This documentation change only covers archive_library.  How are users of 
archive_command supposed to handle this?




Re: remove more archiving overhead

From
Nathan Bossart
Date:
On Sat, Sep 17, 2022 at 11:46:39AM +0200, Peter Eisentraut wrote:
>> > --- a/doc/src/sgml/backup.sgml
>> > +++ b/doc/src/sgml/backup.sgml
>> > @@ -691,11 +691,9 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
>> >       system crashes before the server makes a durable record of archival success,
>> >       the server will attempt to archive the file again after restarting (provided
>> >       archiving is still enabled).  When an archive library encounters a
>> > -    pre-existing file, it may return <literal>true</literal> if the WAL file has
>> > +    pre-existing file, it should return <literal>true</literal> if the WAL file has
>> >       identical contents to the pre-existing archive and the pre-existing archive
>> > -    is fully persisted to storage.  Alternatively, the archive library may
>> > -    return <literal>false</literal> anytime a pre-existing file is encountered,
>> > -    but this will require manual action by an administrator to resolve.  If a
>> > +    is fully persisted to storage.  If a
>> >       pre-existing file contains different contents than the WAL file being
>> >       archived, the archive library <emphasis>must</emphasis> return
>> >       <literal>false</literal>.
>> 
>> Works for me.  Thanks.
> 
> This documentation change only covers archive_library.  How are users of
> archive_command supposed to handle this?

I believe users of archive_command need to do something similar to what is
described here.  However, it might be more reasonable to expect
archive_command users to simply return false when there is a pre-existing
file, as the deleted text notes.  IIRC that is why I added that sentence
originally.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: remove more archiving overhead

From
Noah Misch
Date:
On Sat, Sep 17, 2022 at 02:54:27PM -0700, Nathan Bossart wrote:
> On Sat, Sep 17, 2022 at 11:46:39AM +0200, Peter Eisentraut wrote:
> >> > --- a/doc/src/sgml/backup.sgml
> >> > +++ b/doc/src/sgml/backup.sgml
> >> > @@ -691,11 +691,9 @@ test ! -f /mnt/server/archivedir/00000001000000A900000065 && cp pg_wal/0
> >> >       system crashes before the server makes a durable record of archival success,
> >> >       the server will attempt to archive the file again after restarting (provided
> >> >       archiving is still enabled).  When an archive library encounters a
> >> > -    pre-existing file, it may return <literal>true</literal> if the WAL file has
> >> > +    pre-existing file, it should return <literal>true</literal> if the WAL file has
> >> >       identical contents to the pre-existing archive and the pre-existing archive
> >> > -    is fully persisted to storage.  Alternatively, the archive library may
> >> > -    return <literal>false</literal> anytime a pre-existing file is encountered,
> >> > -    but this will require manual action by an administrator to resolve.  If a
> >> > +    is fully persisted to storage.  If a
> >> >       pre-existing file contains different contents than the WAL file being
> >> >       archived, the archive library <emphasis>must</emphasis> return
> >> >       <literal>false</literal>.
> >> 
> >> Works for me.  Thanks.
> > 
> > This documentation change only covers archive_library.  How are users of
> > archive_command supposed to handle this?
> 
> I believe users of archive_command need to do something similar to what is
> described here.  However, it might be more reasonable to expect
> archive_command users to simply return false when there is a pre-existing
> file, as the deleted text notes.  IIRC that is why I added that sentence
> originally.

What makes the answer for archive_command diverge from the answer for
archive_library?



Re: remove more archiving overhead

From
Peter Eisentraut
Date:
On 18.09.22 09:13, Noah Misch wrote:
>>> This documentation change only covers archive_library.  How are users of
>>> archive_command supposed to handle this?
>>
>> I believe users of archive_command need to do something similar to what is
>> described here.  However, it might be more reasonable to expect
>> archive_command users to simply return false when there is a pre-existing
>> file, as the deleted text notes.  IIRC that is why I added that sentence
>> originally.
> 
> What makes the answer for archive_command diverge from the answer for
> archive_library?

I suspect what we are really trying to say here is

===
Archiving setups (using either archive_command or archive_library) 
should be prepared for the rare case that an identical archive file is 
being archived a second time.  In such a case, they should compare that 
the source and the target file are identical and proceed without error 
if so.

In some cases, it is difficult or impossible to configure 
archive_command or archive_library to do this.  In such cases, the 
archiving command or library should error like in the case for any 
pre-existing target file, and operators need to be prepared to resolve 
such cases manually.
===

Is that correct?



Re: remove more archiving overhead

From
Noah Misch
Date:
On Mon, Sep 19, 2022 at 06:08:29AM -0400, Peter Eisentraut wrote:
> On 18.09.22 09:13, Noah Misch wrote:
> >>>This documentation change only covers archive_library.  How are users of
> >>>archive_command supposed to handle this?
> >>
> >>I believe users of archive_command need to do something similar to what is
> >>described here.  However, it might be more reasonable to expect
> >>archive_command users to simply return false when there is a pre-existing
> >>file, as the deleted text notes.  IIRC that is why I added that sentence
> >>originally.
> >
> >What makes the answer for archive_command diverge from the answer for
> >archive_library?
> 
> I suspect what we are really trying to say here is
> 
> ===
> Archiving setups (using either archive_command or archive_library) should be
> prepared for the rare case that an identical archive file is being archived
> a second time.  In such a case, they should compare that the source and the
> target file are identical and proceed without error if so.
> 
> In some cases, it is difficult or impossible to configure archive_command or
> archive_library to do this.  In such cases, the archiving command or library
> should error like in the case for any pre-existing target file, and
> operators need to be prepared to resolve such cases manually.
> ===
> 
> Is that correct?

I wanted it to stop saying anything like the second paragraph, hence commit
d263ced.  Implementing a proper archiving setup is not especially difficult,
and inviting the operator to work around a wrong implementation invites
damaging mistakes under time pressure.



Re: remove more archiving overhead

From
David Steele
Date:

On 9/19/22 07:39, Noah Misch wrote:
> On Mon, Sep 19, 2022 at 06:08:29AM -0400, Peter Eisentraut wrote:
>> On 18.09.22 09:13, Noah Misch wrote:
>>>>> This documentation change only covers archive_library.  How are users of
>>>>> archive_command supposed to handle this?
>>>>
>>>> I believe users of archive_command need to do something similar to what is
>>>> described here.  However, it might be more reasonable to expect
>>>> archive_command users to simply return false when there is a pre-existing
>>>> file, as the deleted text notes.  IIRC that is why I added that sentence
>>>> originally.
>>>
>>> What makes the answer for archive_command diverge from the answer for
>>> archive_library?
>>
>> I suspect what we are really trying to say here is
>>
>> ===
>> Archiving setups (using either archive_command or archive_library) should be
>> prepared for the rare case that an identical archive file is being archived
>> a second time.  In such a case, they should compare that the source and the
>> target file are identical and proceed without error if so.
>>
>> In some cases, it is difficult or impossible to configure archive_command or
>> archive_library to do this.  In such cases, the archiving command or library
>> should error like in the case for any pre-existing target file, and
>> operators need to be prepared to resolve such cases manually.
>> ===
>>
>> Is that correct?
> 
> I wanted it to stop saying anything like the second paragraph, hence commit
> d263ced.  Implementing a proper archiving setup is not especially difficult,
> and inviting the operator to work around a wrong implementation invites
> damaging mistakes under time pressure.

I would also not want to state that duplicate WAL is rare. In practice 
it is pretty common when things are going wrong. Also, implying it is 
rare might lead a user to decide they don't need to worry about it.

Regards,
-David



Re: remove more archiving overhead

From
Robert Haas
Date:
On Mon, Sep 19, 2022 at 6:08 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
> I suspect what we are really trying to say here is
>
> ===
> Archiving setups (using either archive_command or archive_library)
> should be prepared for the rare case that an identical archive file is
> being archived a second time.  In such a case, they should compare that
> the source and the target file are identical and proceed without error
> if so.
>
> In some cases, it is difficult or impossible to configure
> archive_command or archive_library to do this.  In such cases, the
> archiving command or library should error like in the case for any
> pre-existing target file, and operators need to be prepared to resolve
> such cases manually.
> ===
>
> Is that correct?

I think it is. I think commit d263ced225bffe2c340175125b0270d1869138fe
made the documentation in this area substantially clearer than what we
had before, and I think that we could adjust that text to apply to
both archive libraries and archive commands relatively easily, so I'm
not really sure that we need to add text like what you propose here.

However, I also don't think it would be particularly bad if we did. An
advantage of that language is that it makes it clear what the
consequences are if you fail to follow the recommendation about
handling identical files. That may be helpful to users in
understanding the behavior of the system, and might even make it more
likely that they will include proper handling of that case in their
archive_command or archive_library.

"impossible" does seem like a bit of a strong word, though. I'd
recommend leaving that out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: remove more archiving overhead

From
Robert Haas
Date:
On Mon, Sep 19, 2022 at 10:39 AM Noah Misch <noah@leadboat.com> wrote:
> I wanted it to stop saying anything like the second paragraph, hence commit
> d263ced.  Implementing a proper archiving setup is not especially difficult,
> and inviting the operator to work around a wrong implementation invites
> damaging mistakes under time pressure.

I agree wholeheartedly with the second sentence. I do think that we
need to be a little careful not to be too prescriptive. Telling people
what they have to do when they don't really have to do it often leads
to non-compliance. It can be more productive to spell out the precise
consequences of non-compliance, so I think Peter's proposal has some
merit. On the other hand, I don't see any real problem with the
current language, either.

Unfortunately, no matter what we do here, it is not likely that we'll
soon be able to eliminate the phenomenon where people use a buggy
home-grown archive_command, both because we've been encouraging that
in our documentation for a very long time, and also because the people
who are most likely to do something half-baked here are probably the
least likely to actually read the updated documentation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: remove more archiving overhead

From
Noah Misch
Date:
On Mon, Sep 19, 2022 at 12:49:58PM -0400, Robert Haas wrote:
> On Mon, Sep 19, 2022 at 10:39 AM Noah Misch <noah@leadboat.com> wrote:
> > I wanted it to stop saying anything like the second paragraph, hence commit
> > d263ced.  Implementing a proper archiving setup is not especially difficult,
> > and inviting the operator to work around a wrong implementation invites
> > damaging mistakes under time pressure.
> 
> I agree wholeheartedly with the second sentence. I do think that we
> need to be a little careful not to be too prescriptive. Telling people
> what they have to do when they don't really have to do it often leads
> to non-compliance. It can be more productive to spell out the precise
> consequences of non-compliance, so I think Peter's proposal has some
> merit.

Good point.  We could say it from a perspective like, "if you don't do this,
your setup will appear to work most of the time, but your operator will be
stuck with dangerous manual fixes at times."

> On the other hand, I don't see any real problem with the
> current language, either.
> 
> Unfortunately, no matter what we do here, it is not likely that we'll
> soon be able to eliminate the phenomenon where people use a buggy
> home-grown archive_command, both because we've been encouraging that
> in our documentation for a very long time, and also because the people
> who are most likely to do something half-baked here are probably the
> least likely to actually read the updated documentation.

True.  If I wanted to eliminate such setups, I would make the server
double-archive one WAL file each time the archive setup changes.