Re: remove more archiving overhead - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: remove more archiving overhead
Date
Msg-id 20220224.141349.1809640551882183816.horikyota.ntt@gmail.com
Whole thread Raw
In response to remove more archiving overhead  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: remove more archiving overhead
List pgsql-hackers
At Mon, 21 Feb 2022 17:19:48 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> I also spent some time investigating whether durably renaming the archive
> status files was even necessary.  In theory, it shouldn't be.  If a crash
> happens before the rename is persisted, we might try to re-archive files,
> but your archive_command/archive_library should handle that.  If the file

I believe we're trying to archive a WAL segment once and only once,
even though actually we fail in certain cases.

https://www.postgresql.org/docs/14/continuous-archiving.html
| The archive command should generally be designed to refuse to
| overwrite any pre-existing archive file. This is an important safety
| feature to preserve the integrity of your archive in case of
| administrator error (such as sending the output of two different
| servers to the same archive directory).

The recommended behavior of archive_command above is to avoid this
kind of corruption.  If we officially admit that a WAL segment can be
archive twice, we need to revise that part (and the following part) of
the doc.

> was already recycled or removed, the archiver will skip it (thanks to
> 6d8727f).  But when digging further, I found that WAL file recyling uses
> durable_rename_excl(), which has the following note:
>
>      * Note that a crash in an unfortunate moment can leave you with two links to
>      * the target file.
> 
> IIUC this means that in theory, a crash at an unfortunate moment could
> leave you with a .ready file, the file to archive, and another link to that
> file with a "future" WAL filename.  If you re-archive the file after it has
> been reused, you could end up with corrupt WAL archives.  I think this

IMHO the difference would be that the single system call (maybe)
cannot be interrupted by sigkill.  So I'm not sure that that is worth
considering.  The sigkill window can be elimianted if we could use
renameat(RENAME_NOREPLACE).  But finally power-loss can cause that.

> might already be possible today.  Syncing the directory after every rename
> probably reduces the likelihood quite substantially, but if
> RemoveOldXlogFiles() quickly picks up the .done file and attempts
> recycling before durable_rename() calls fsync() on the directory,
> presumably the same problem could occur.

If we don't fsync at every rename, we don't even need for
RemoveOldXlogFiles to pickup .done very quickly.  Isn't it a big
difference?

> I believe the current recommendation is to fail if the archive file already
> exists.  The basic_archive contrib module fails if the archive already
> exists and has different contents, and it succeeds without re-archiving if
> it already exists with identical contents.  Since something along these
> lines is recommended as a best practice, perhaps this is okay.  But I think
> we might be able to do better.  If we ensure that the archive_status
> directory is sync'd prior to recycling/removing a WAL file, I believe we
> are safe from the aforementioned problem.
>
> The drawback of this is that we are essentially offloading more work to the
> checkpointer process.  In addition to everything else it must do, it will
> also need to fsync() the archive_status directory many times.  I'm not sure
> how big of a problem this is.  It should be good to ensure that archiving
> keeps up, and there are far more expensive tasks that the checkpointer must
> perform that could make the added fsync() calls relatively insignificant.

I think we don't want make checkpointer slower in exchange of making
archiver faster.  Perhaps we need to work using a condensed
information rather than repeatedly scanning over the distributed
information like archive_status?

At Tue, 22 Feb 2022 11:52:29 -0800, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> On Tue, Feb 22, 2022 at 09:37:11AM -0800, Nathan Bossart wrote:
> > In my testing, I found that when I killed the server just before unlink()
> > during WAL recyling, I ended up with links to the same file in pg_wal after
> > restarting.  My latest test produced links to the same file for the current
> > WAL file and the next one.  Maybe WAL recyling should use durable_rename()
> > instead of durable_rename_excl().
> 
> Here is an updated patch.

Is it intentional that v2 loses the xlogarchive.c part?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: KAWAMOTO Masaya
Date:
Subject: Re: Typo in pgbench messages.
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Patch a potential memory leak in describeOneTableDetails()