Re: Unlogged relation copy is not fsync'd - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Unlogged relation copy is not fsync'd |
Date | |
Msg-id | 58effc10-c160-b4a6-4eb7-384e95e6f9e3@iki.fi Whole thread Raw |
In response to | Re: Unlogged relation copy is not fsync'd (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Unlogged relation copy is not fsync'd
Re: Unlogged relation copy is not fsync'd |
List | pgsql-hackers |
On 05/09/2023 21:20, Robert Haas wrote: > In other words, somehow it feels like we ought to be trying to defer > the fsync here until a clean shutdown actually occurs, instead of > performing it immediately. +1 > Admittedly, the bookkeeping seems like a problem, so maybe this is > the best we can do, but it's clearly worse than what we do in other > cases. I think we can do it, I have been working on a patch along those lines on the side. But I want to focus on a smaller, backpatchable fix in this thread. Thinking about this some more, I think this is still not 100% correct, even with the patch I posted earlier: > /* > * When we WAL-logged rel pages, we must nonetheless fsync them. The > * reason is that since we're copying outside shared buffers, a CHECKPOINT > * occurring during the copy has no way to flush the previously written > * data to disk (indeed it won't know the new rel even exists). A crash > * later on would replay WAL from the checkpoint, therefore it wouldn't > * replay our earlier WAL entries. If we do not fsync those pages here, > * they might still not be on disk when the crash occurs. > */ > if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED) > smgrimmedsync(dst, forkNum); Let's walk through each case one by one: 1. Temporary table. No fsync() needed. This is correct. 2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, and also because we bypassed the buffer manager. Correct. 3. Unlogged rel, init fork. Needs to be fsync'd, even though we WAL-logged it, because we bypassed the buffer manager. Like the comment explains. This is now correct with the patch. 4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we WAL-logged it, because we bypassed the buffer manager. Like the comment explains. Correct. 5. Permanent rel, use_wal == false. We skip fsync, because it means that it's a new relation, so we have a sync request pending for it. (An assertion for that would be nice). At the end of transaction, in smgrDoPendingSyncs(), we will either fsync it or we will WAL-log all the pages if it was a small relation. I believe this is *wrong*. If smgrDoPendingSyncs() decides to WAL-log the pages, we have the same race condition that's explained in the comment, because we bypassed the buffer manager: 1. RelationCopyStorage() copies some of the pages. 2. Checkpoint happens, which fsyncs the relation (smgrcreate() registered a dirty segment when the relation was created) 3. RelationCopyStorage() copies the rest of the pages. 4. smgrDoPendingSyncs() WAL-logs all the pages. 5. Another checkpoint happens. It does *not* fsync the relation. 6. Crash. WAL replay will not see the WAL-logged pages, because they were WAL-logged before the last checkpoint. And the contents were not fsync'd either. In other words, we must do smgrimmedsync() here for permanent relations, even if use_wal==false, because we bypassed the buffer manager. Same reason we need to do it with use_wal==true. For a backportable fix, I think we should change this to only exempt temporary tables, and call smgrimmedsync() for all other cases. Maybe would be safe to skip it in some cases, but it feels too dangerous to be clever here. The other similar callers of smgrimmedsync() in nbtsort.c, gistbuild.c, and rewriteheap.c, need similar treatment. -- Heikki Linnakangas Neon (https://neon.tech)
pgsql-hackers by date: