Re: Relation bulk write facility - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Relation bulk write facility
Date
Msg-id 20240701205250.62.nmisch@google.com
Whole thread Raw
In response to Re: Relation bulk write facility  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Relation bulk write facility
List pgsql-hackers
On Fri, Feb 23, 2024 at 04:27:34PM +0200, Heikki Linnakangas wrote:
> Committed this. Thanks everyone!

Commit 8af2565 wrote:
> --- /dev/null
> +++ b/src/backend/storage/smgr/bulk_write.c

> +/*
> + * Finish bulk write operation.
> + *
> + * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
> + * the relation if needed.
> + */
> +void
> +smgr_bulk_finish(BulkWriteState *bulkstate)
> +{
> +    /* WAL-log and flush any remaining pages */
> +    smgr_bulk_flush(bulkstate);
> +
> +    /*
> +     * When we wrote out the pages, we passed skipFsync=true to avoid the
> +     * overhead of registering all the writes with the checkpointer.  Register
> +     * the whole relation now.
> +     *
> +     * There is one hole in that idea: If a checkpoint occurred while we were
> +     * writing the pages, it already missed fsyncing the pages we had written
> +     * before the checkpoint started.  A crash later on would replay the WAL
> +     * starting from the checkpoint, therefore it wouldn't replay our earlier
> +     * WAL records.  So if a checkpoint started after the bulk write, fsync
> +     * the files now.
> +     */
> +    if (!SmgrIsTemp(bulkstate->smgr))
> +    {

Shouldn't this be "if (bulkstate->use_wal)"?  The GetRedoRecPtr()-based
decision is irrelevant to the !wal case.  Either we don't need fsync at all
(TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).  I
don't see any functional problem, but this likely arranges for an unnecessary
sync when a checkpoint starts between mdcreate() and here.  (The mdcreate()
sync may also be unnecessary, but that's longstanding.)

> +        /*
> +         * Prevent a checkpoint from starting between the GetRedoRecPtr() and
> +         * smgrregistersync() calls.
> +         */
> +        Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
> +        MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> +
> +        if (bulkstate->start_RedoRecPtr != GetRedoRecPtr())
> +        {
> +            /*
> +             * A checkpoint occurred and it didn't know about our writes, so
> +             * fsync() the relation ourselves.
> +             */
> +            MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> +            smgrimmedsync(bulkstate->smgr, bulkstate->forknum);
> +            elog(DEBUG1, "flushed relation because a checkpoint occurred concurrently");
> +        }
> +        else
> +        {
> +            smgrregistersync(bulkstate->smgr, bulkstate->forknum);
> +            MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> +        }
> +    }
> +}

This is an elegant optimization.



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: WIP: parallel GiST index builds
Next
From: Heikki Linnakangas
Date:
Subject: Re: Relation bulk write facility