remove more archiving overhead - Mailing list pgsql-hackers

From Nathan Bossart
Subject remove more archiving overhead
Date
Msg-id 20220222011948.GA3850532@nathanxps13
Whole thread Raw
Responses Re: remove more archiving overhead  (Nathan Bossart <nathandbossart@gmail.com>)
List pgsql-hackers
Hi hackers,

With the addition of archive modules (5ef1eef) and other archiving
improvements (e.g., beb4e9b), the amount of archiving overhead has been
reduced.  I spent some time measuring the remaining overhead, and the
following function stood out:

    /*
     * pgarch_archiveDone
     *
     * Emit notification that an xlog file has been successfully archived.
     * We do this by renaming the status file from NNN.ready to NNN.done.
     * Eventually, a checkpoint process will notice this and delete both the
     * NNN.done file and the xlog file itself.
     */
    static void
    pgarch_archiveDone(char *xlog)
    {
        char        rlogready[MAXPGPATH];
        char        rlogdone[MAXPGPATH];

        StatusFilePath(rlogready, xlog, ".ready");
        StatusFilePath(rlogdone, xlog, ".done");
        (void) durable_rename(rlogready, rlogdone, WARNING);
    }

Specifically, the call to durable_rename(), which calls fsync() a few
times, can cause a decent delay.  On my machine, archiving ~12k WAL files
using a bare-bones archive module that did nothing but return true took
over a minute.  When I changed this to rename() (i.e., reverting the change
to pgarch.c in 1d4a0ab), the same test took a second or less.

I also spent some time investigating whether durably renaming the archive
status files was even necessary.  In theory, it shouldn't be.  If a crash
happens before the rename is persisted, we might try to re-archive files,
but your archive_command/archive_library should handle that.  If the file
was already recycled or removed, the archiver will skip it (thanks to
6d8727f).  But when digging further, I found that WAL file recyling uses
durable_rename_excl(), which has the following note:

     * Note that a crash in an unfortunate moment can leave you with two links to
     * the target file.

IIUC this means that in theory, a crash at an unfortunate moment could
leave you with a .ready file, the file to archive, and another link to that
file with a "future" WAL filename.  If you re-archive the file after it has
been reused, you could end up with corrupt WAL archives.  I think this
might already be possible today.  Syncing the directory after every rename
probably reduces the likelihood quite substantially, but if
RemoveOldXlogFiles() quickly picks up the .done file and attempts
recycling before durable_rename() calls fsync() on the directory,
presumably the same problem could occur.

I believe the current recommendation is to fail if the archive file already
exists.  The basic_archive contrib module fails if the archive already
exists and has different contents, and it succeeds without re-archiving if
it already exists with identical contents.  Since something along these
lines is recommended as a best practice, perhaps this is okay.  But I think
we might be able to do better.  If we ensure that the archive_status
directory is sync'd prior to recycling/removing a WAL file, I believe we
are safe from the aforementioned problem.

The drawback of this is that we are essentially offloading more work to the
checkpointer process.  In addition to everything else it must do, it will
also need to fsync() the archive_status directory many times.  I'm not sure
how big of a problem this is.  It should be good to ensure that archiving
keeps up, and there are far more expensive tasks that the checkpointer must
perform that could make the added fsync() calls relatively insignificant.

I've attached a patch that makes the changes discussed above.  Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: "tanghy.fnst@fujitsu.com"
Date:
Subject: RE: Failed transaction statistics to measure the logical replication progress
Next
From: "osumi.takamichi@fujitsu.com"
Date:
Subject: RE: Failed transaction statistics to measure the logical replication progress