Re: Unlogged relation copy is not fsync'd - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Unlogged relation copy is not fsync'd
Date
Msg-id 58effc10-c160-b4a6-4eb7-384e95e6f9e3@iki.fi
Whole thread Raw
In response to Re: Unlogged relation copy is not fsync'd  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Unlogged relation copy is not fsync'd
Re: Unlogged relation copy is not fsync'd
List pgsql-hackers
On 05/09/2023 21:20, Robert Haas wrote:
> In other words, somehow it feels like we ought to be trying to defer
> the fsync here until a clean shutdown actually occurs, instead of
> performing it immediately.

+1

> Admittedly, the bookkeeping seems like a problem, so maybe this is
> the best we can do, but it's clearly worse than what we do in other
> cases.
I think we can do it, I have been working on a patch along those lines 
on the side. But I want to focus on a smaller, backpatchable fix in this 
thread.


Thinking about this some more, I think this is still not 100% correct, 
even with the patch I posted earlier:

>     /*
>      * When we WAL-logged rel pages, we must nonetheless fsync them.  The
>      * reason is that since we're copying outside shared buffers, a CHECKPOINT
>      * occurring during the copy has no way to flush the previously written
>      * data to disk (indeed it won't know the new rel even exists).  A crash
>      * later on would replay WAL from the checkpoint, therefore it wouldn't
>      * replay our earlier WAL entries. If we do not fsync those pages here,
>      * they might still not be on disk when the crash occurs.
>      */
>     if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
>         smgrimmedsync(dst, forkNum);

Let's walk through each case one by one:

1. Temporary table. No fsync() needed. This is correct.

2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, 
and also because we bypassed the buffer manager. Correct.

3. Unlogged rel, init fork. Needs to be fsync'd, even though we 
WAL-logged it, because we bypassed the buffer manager. Like the comment 
explains. This is now correct with the patch.

4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we 
WAL-logged it, because we bypassed the buffer manager. Like the comment 
explains. Correct.

5. Permanent rel, use_wal == false. We skip fsync, because it means that 
it's a new relation, so we have a sync request pending for it. (An 
assertion for that would be nice). At the end of transaction, in 
smgrDoPendingSyncs(), we will either fsync it or we will WAL-log all the 
pages if it was a small relation. I believe this is *wrong*. If 
smgrDoPendingSyncs() decides to WAL-log the pages, we have the same race 
condition that's explained in the comment, because we bypassed the 
buffer manager:

1. RelationCopyStorage() copies some of the pages.
2. Checkpoint happens, which fsyncs the relation (smgrcreate() 
registered a dirty segment when the relation was created)
3. RelationCopyStorage() copies the rest of the pages.
4. smgrDoPendingSyncs() WAL-logs all the pages.
5. Another checkpoint happens. It does *not* fsync the relation.
6. Crash.

WAL replay will not see the WAL-logged pages, because they were 
WAL-logged before the last checkpoint. And the contents were not fsync'd 
either.

In other words, we must do smgrimmedsync() here for permanent relations, 
even if use_wal==false, because we bypassed the buffer manager. Same 
reason we need to do it with use_wal==true.

For a backportable fix, I think we should change this to only exempt 
temporary tables, and call smgrimmedsync() for all other cases. Maybe 
would be safe to skip it in some cases, but it feels too dangerous to be 
clever here. The other similar callers of smgrimmedsync() in nbtsort.c, 
gistbuild.c, and rewriteheap.c, need similar treatment.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: Aleksander Alekseev
Date:
Subject: Re: POC: Extension for adding distributed tracing - pg_tracing
Next
From: Nikita Malakhov
Date:
Subject: Re: POC: Extension for adding distributed tracing - pg_tracing