Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows - Mailing list pgsql-bugs
From | Noah Misch |
---|---|
Subject | Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows |
Date | |
Msg-id | 20250406181613.61.nmisch@google.com Whole thread Raw |
In response to | Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-bugs |
On Fri, Sep 06, 2024 at 11:05:26AM +1200, Thomas Munro wrote: > Subject: [PATCH v3 1/3] RelationTruncate() must set DELAY_CHKPT_START. > > Previously, it set only DELAY_CHKPT_COMPLETE. That was important, > because it meant that if the XLOG_SMGR_TRUNCATE record preceded a > XLOG_CHECKPOINT_ONLINE record in the WAL, then the truncation would also > happen on disk before the XLOG_CHECKPOINT_ONLINE record was > written. I read this commit (75818b3) with intense interest, mostly to see if it found a rule that commit 8e7e672 needs to follow and didn't (no). I did want to find or make an example of interleaved events where DELAY_CHKPT_COMPLETE is still necessary. Do any of you have such an example? Here were my unsuccessful attempts: ==== Attempt: from upthread On Wed, Oct 11, 2023 at 12:59:39PM -0400, Robert Haas wrote: > Suppose that RelationTruncate set both DELAY_CHKPT_START and > DELAY_CHKPT_COMPLETE. I think that would prevent this problem. P2 > could still choose the redo LSN after P1 logged the truncate, but it > wouldn't then be able to reach CheckPointBuffers() until after P1 had > reached RegisterSyncRequest. Note that setting *only* > DELAY_CHKPT_START isn't good enough, because then we can get this > history: > > P1: log truncate > P2: choose redo LSN > P2: CheckPointBuffers() > P1: DropRelationBuffers() > P2: ProcessSyncRequests() > P2: log checkpoint > *** system loses power *** If "choose redo LSN" happens at the point shown there, DELAY_CHKPT_START won't allow CheckPointBuffers() at the point shown there, so this interleaving doesn't happen. ==== Attempt: XLOG_SMGR_TRUNCATE+DropRelationBuffers() just after checkpoint waits for DELAY_CHKPT_START P2: choose redo LSN 1 P2: CheckPointBuffers() enter P1: DELAY_CHKPT_START enter P1: log truncate P1: DropRelationBuffers() P2: CheckPointBuffers() actual flushes P2: ProcessSyncRequests() P2: log checkpoint P2: choose redo LSN 2 (next checkpoint) *** system loses power, below were still in future *** P1: ftruncate() P1: RegisterSyncRequest() P1: DELAY_CHKPT_START exit DropRelationBuffers() made CheckPointBuffers() do less work, and the validity of the checkpoint is now tied to truncate happening eventually. However, if XLOG_CHECKPOINT_ONLINE reaches disk, that implies XLOG_SMGR_TRUNCATE reached disk first. Any recovery starting at or before LSN 1 will redo XLOG_SMGR_TRUNCATE. Data integrity is fine. ==== Attempt: replay finding "older contents than expected" A key RelationTruncate() code comment mentions this: * First, the truncation operation might drop buffers that the checkpoint * otherwise would have flushed. If it does, then it's essential that the * files actually get truncated on disk before the checkpoint record is * written. Otherwise, if replay begins from that checkpoint, the * to-be-truncated blocks might still exist on disk but have older * contents than expected, which can cause replay to fail. It's OK for the * blocks to not exist on disk at all, but not for them to have the wrong * contents. For this reason, we need to set DELAY_CHKPT_COMPLETE while * this code executes. However, I can't see how to make that happen. RelationTruncate() has AccessExclusiveLock on the relation, so no other WAL records for the truncated block range are happening while we hold DELAY_CHKPT_START. In the previous attempt, any WAL records for the truncated block range must be before XLOG_SMGR_TRUNCATE (such records tolerate older content) or after XLOG_CHECKPOINT_ONLINE (since RelationTruncate() held AccessExclusiveLock at least that late). ==== Attempt: all truncate steps just after XLOG_CHECKPOINT_REDO P1: DELAY_CHKPT_START enter P2: choose redo LSN P1: log truncate P1: DropRelationBuffers() P1: ftruncate() P1: RegisterSyncRequest() P1: DELAY_CHKPT_START exit P2: CheckPointBuffers() P2: ProcessSyncRequests() P2: log checkpoint *** system loses power *** Recovery sees the XLOG_CHECKPOINT_ONLINE, which implies it also sees and replays the XLOG_SMGR_TRUNCATE. Data integrity is fine. ==== Attempt: XLOG_SMGR_TRUNCATE before XLOG_CHECKPOINT_REDO P1: DELAY_CHKPT_START enter P1: log truncate P2: choose redo LSN P1: DropRelationBuffers() P1: ftruncate() P1: RegisterSyncRequest() P1: DELAY_CHKPT_START exit P2: CheckPointBuffers() P2: ProcessSyncRequests() P2: log checkpoint *** system loses power *** XLOG_SMGR_TRUNCATE precedes the redo point, so recovery finds no need to redo the XLOG_SMGR_TRUNCATE. Data integrity is fine.
pgsql-bugs by date: