Re: Unlogged relation copy is not fsync'd - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Unlogged relation copy is not fsync'd
Date
Msg-id 20230921062210.GA110358@rfd.leadboat.com
Whole thread Raw
In response to Re: Unlogged relation copy is not fsync'd  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Unlogged relation copy is not fsync'd
List pgsql-hackers
On Fri, Sep 15, 2023 at 02:47:45PM +0300, Heikki Linnakangas wrote:
> On 05/09/2023 21:20, Robert Haas wrote:

> Thinking about this some more, I think this is still not 100% correct, even
> with the patch I posted earlier:
> 
> >    /*
> >     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
> >     * reason is that since we're copying outside shared buffers, a CHECKPOINT
> >     * occurring during the copy has no way to flush the previously written
> >     * data to disk (indeed it won't know the new rel even exists).  A crash
> >     * later on would replay WAL from the checkpoint, therefore it wouldn't
> >     * replay our earlier WAL entries. If we do not fsync those pages here,
> >     * they might still not be on disk when the crash occurs.
> >     */
> >    if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
> >        smgrimmedsync(dst, forkNum);
> 
> Let's walk through each case one by one:
> 
> 1. Temporary table. No fsync() needed. This is correct.
> 
> 2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, and
> also because we bypassed the buffer manager. Correct.

Agreed.

> 3. Unlogged rel, init fork. Needs to be fsync'd, even though we WAL-logged
> it, because we bypassed the buffer manager. Like the comment explains. This
> is now correct with the patch.

This has two subtypes:

3a. Unlogged rel, init fork, use_wal (wal_level!=minimal). Matches what
you wrote.

3b. Unlogged rel, init fork, !use_wal (wal_level==minimal). Needs to be
fsync'd because we didn't WAL-log it and RelationCreateStorage() won't call
AddPendingSync().  (RelationCreateStorage() could start calling
AddPendingSync() for this case.  I think we chose not to do that because there
will never be additional writes to the init fork, and smgrDoPendingSyncs()
would send the fork to FlushRelationsAllBuffers() even though the fork will
never appear in shared buffers.  On the other hand, grouping the sync with the
other end-of-xact syncs could improve efficiency for some filesystems.  Also,
the number of distinguishable cases is unpleasantly high.)

> 4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we
> WAL-logged it, because we bypassed the buffer manager. Like the comment
> explains. Correct.
> 
> 5. Permanent rel, use_wal == false. We skip fsync, because it means that
> it's a new relation, so we have a sync request pending for it. (An assertion
> for that would be nice). At the end of transaction, in smgrDoPendingSyncs(),
> we will either fsync it or we will WAL-log all the pages if it was a small
> relation. I believe this is *wrong*. If smgrDoPendingSyncs() decides to
> WAL-log the pages, we have the same race condition that's explained in the
> comment, because we bypassed the buffer manager:
> 
> 1. RelationCopyStorage() copies some of the pages.
> 2. Checkpoint happens, which fsyncs the relation (smgrcreate() registered a
> dirty segment when the relation was created)
> 3. RelationCopyStorage() copies the rest of the pages.
> 4. smgrDoPendingSyncs() WAL-logs all the pages.

smgrDoPendingSyncs() delegates to log_newpage_range().  log_newpage_range()
loads each page into the buffer manager and calls MarkBufferDirty() on each.
Hence, ...

> 5. Another checkpoint happens. It does *not* fsync the relation.

... I think this will fsync the relation.  No?

> 6. Crash.



pgsql-hackers by date:

Previous
From: "Lepikhov Andrei"
Date:
Subject: Re: Comment about set_join_pathlist_hook()
Next
From: Suraj Kharage
Date:
Subject: Re: pg_upgrade --check fails to warn about abstime