Re: Relation bulk write facility - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Relation bulk write facility
Date
Msg-id cac7d1b6-8358-40be-af0b-21bc9b27d34c@iki.fi
Whole thread Raw
In response to Re: Relation bulk write facility  (Noah Misch <noah@leadboat.com>)
Responses Re: Relation bulk write facility
List pgsql-hackers
On 02/07/2024 02:24, Noah Misch wrote:
> On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote:
>> On 01/07/2024 23:52, Noah Misch wrote:
>>> Commit 8af2565 wrote:
>>>> --- /dev/null
>>>> +++ b/src/backend/storage/smgr/bulk_write.c
>>>
>>>> +/*
>>>> + * Finish bulk write operation.
>>>> + *
>>>> + * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs
>>>> + * the relation if needed.
>>>> + */
>>>> +void
>>>> +smgr_bulk_finish(BulkWriteState *bulkstate)
>>>> +{
>>>> +    /* WAL-log and flush any remaining pages */
>>>> +    smgr_bulk_flush(bulkstate);
>>>> +
>>>> +    /*
>>>> +     * When we wrote out the pages, we passed skipFsync=true to avoid the
>>>> +     * overhead of registering all the writes with the checkpointer.  Register
>>>> +     * the whole relation now.
>>>> +     *
>>>> +     * There is one hole in that idea: If a checkpoint occurred while we were
>>>> +     * writing the pages, it already missed fsyncing the pages we had written
>>>> +     * before the checkpoint started.  A crash later on would replay the WAL
>>>> +     * starting from the checkpoint, therefore it wouldn't replay our earlier
>>>> +     * WAL records.  So if a checkpoint started after the bulk write, fsync
>>>> +     * the files now.
>>>> +     */
>>>> +    if (!SmgrIsTemp(bulkstate->smgr))
>>>> +    {
>>>
>>> Shouldn't this be "if (bulkstate->use_wal)"?  The GetRedoRecPtr()-based
>>> decision is irrelevant to the !wal case.  Either we don't need fsync at all
>>> (TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal).
>>
>> The point of GetRedoRecPtr() is to detect if a checkpoint has started
>> concurrently. It works for that purpose whether or not the bulk load is
>> WAL-logged. It is not compared with the LSNs of WAL records written by the
>> bulk load.
> 
> I think the significance of start_RedoRecPtr is it preceding all records
> needed to recreate the bulk write.  If start_RedoRecPtr==GetRedoRecPtr() and
> we crash after commit, we're indifferent to whether the rel gets synced at a
> checkpoint before that crash or rebuilt from WAL after that crash.  If
> start_RedoRecPtr!=GetRedoRecPtr(), some WAL of the bulk write is already
> deleted, so only smgrimmedsync() suffices.  Overall, while it is not compared
> with LSNs in WAL records, it's significant only to the extent that such a WAL
> record exists.  What am I missing?

You're right. You pointed out below that we don't need to register or 
immediately fsync the relation if it was not WAL-logged, I missed that.

In the alternative universe that we did need to fsync() even in !use_wal 
case, the point of the start_RedoRecPtr==GetRedoRecPtr() was to detect 
whether the last checkpoint "missed" fsyncing the files that we wrote. 
But the point is moot now.

>> Unlogged tables do need to be fsync'd. The scenario is:
>>
>> 1. Bulk load an unlogged table.
>> 2. Shut down Postgres cleanly
>> 3. Pull power plug from server, and restart.
>>
>> We talked about this earlier in the "Unlogged relation copy is not fsync'd"
>> thread [1]. I had already forgotten about that; that bug actually still
>> exists in back branches, and we should fix it..
>>
>> [1] https://www.postgresql.org/message-id/flat/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965%40iki.fi
> 
> Ah, that's right.  I agree this code suffices for unlogged.  As a further
> optimization, it would be valid to ignore GetRedoRecPtr() for unlogged and
> always call smgrregistersync().  (For any rel, smgrimmedsync() improves on
> smgrregistersync() only if we fail to reach the shutdown checkpoint.  Without
> a shutdown checkpoint, unlogged rels get reset anyway.)
> 
>>> I don't see any functional problem, but this likely arranges for an
>>> unnecessary sync when a checkpoint starts between mdcreate() and
>>> here.  (The mdcreate() sync may also be unnecessary, but that's
>>> longstanding.)
>> Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily. It
>> seems hard to eliminate the redundancy. smgr_bulk_finish() could skip the
>> fsync, if it knew that smgrDoPendingSyncs() will do it later. However,
>> smgrDoPendingSyncs() might also decide to WAL-log the relation instead of
>> fsyncing it, and in that case we do still need the fsync.
> 
> We do not need the fsync in the "WAL-log the relation instead" case; see
> https://postgr.es/m/20230921062210.GA110358@rfd.leadboat.com

Ah, true, I missed that log_newpage_range() loads the pages to the 
buffer cache and dirties them. That kinds of sucks actually, I wish it 
didn't need to dirty the buffers.

> So maybe like this:
> 
>    if (use_wal) /* includes init forks */
>      current logic;
>    else if (unlogged)
>      smgrregistersync;
>    /* else temp || (permanent && wal_level=minimal): nothing to do */

Makes sense, except that we cannot distinguish between unlogged 
relations and permanent relations with !use_wal here.

It would be nice to have relpersistence flag in SMgrRelation. I remember 
wanting to have that before, although I don't remember what the context 
was exactly.

>> Fortunately, fsync() on a file that's already flushed to disk is pretty
>> cheap.
> 
> Yep.  I'm more concerned about future readers wondering why the function is
> using LSNs to decide what to do about data that doesn't appear in WAL.  A
> comment could be another way to fix that, though.

Agreed, this is all very subtle, and deserves a good comment. What do 
you think of the attached?

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)
Next
From: Dean Rasheed
Date:
Subject: Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.