Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows - Mailing list pgsql-bugs

From Noah Misch
Subject Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
Date
Msg-id 20250406181613.61.nmisch@google.com
Whole thread Raw
In response to Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-bugs
On Fri, Sep 06, 2024 at 11:05:26AM +1200, Thomas Munro wrote:
> Subject: [PATCH v3 1/3] RelationTruncate() must set DELAY_CHKPT_START.
> 
> Previously, it set only DELAY_CHKPT_COMPLETE. That was important,
> because it meant that if the XLOG_SMGR_TRUNCATE record preceded a
> XLOG_CHECKPOINT_ONLINE record in the WAL, then the truncation would also
> happen on disk before the XLOG_CHECKPOINT_ONLINE record was
> written.

I read this commit (75818b3) with intense interest, mostly to see if it found
a rule that commit 8e7e672 needs to follow and didn't (no).  I did want to
find or make an example of interleaved events where DELAY_CHKPT_COMPLETE is
still necessary.  Do any of you have such an example?  Here were my
unsuccessful attempts:

==== Attempt: from upthread

On Wed, Oct 11, 2023 at 12:59:39PM -0400, Robert Haas wrote:
> Suppose that RelationTruncate set both DELAY_CHKPT_START and
> DELAY_CHKPT_COMPLETE. I think that would prevent this problem. P2
> could still choose the redo LSN after P1 logged the truncate, but it
> wouldn't then be able to reach CheckPointBuffers() until after P1 had
> reached RegisterSyncRequest. Note that setting *only*
> DELAY_CHKPT_START isn't good enough, because then we can get this
> history:
> 
> P1: log truncate
> P2:                        choose redo LSN
> P2:                        CheckPointBuffers()
> P1: DropRelationBuffers()
> P2:                        ProcessSyncRequests()
> P2:                        log checkpoint
>               *** system loses power ***

If "choose redo LSN" happens at the point shown there, DELAY_CHKPT_START won't
allow CheckPointBuffers() at the point shown there, so this interleaving
doesn't happen.


==== Attempt: XLOG_SMGR_TRUNCATE+DropRelationBuffers() just after checkpoint waits for DELAY_CHKPT_START

P2:                        choose redo LSN 1
P2:                        CheckPointBuffers() enter
P1: DELAY_CHKPT_START enter
P1: log truncate
P1: DropRelationBuffers()
P2:                        CheckPointBuffers() actual flushes
P2:                        ProcessSyncRequests()
P2:                        log checkpoint
P2:                        choose redo LSN 2 (next checkpoint)
              *** system loses power, below were still in future ***
P1: ftruncate()
P1: RegisterSyncRequest()
P1: DELAY_CHKPT_START exit

DropRelationBuffers() made CheckPointBuffers() do less work, and the validity
of the checkpoint is now tied to truncate happening eventually.  However, if
XLOG_CHECKPOINT_ONLINE reaches disk, that implies XLOG_SMGR_TRUNCATE reached
disk first.  Any recovery starting at or before LSN 1 will redo
XLOG_SMGR_TRUNCATE.  Data integrity is fine.


==== Attempt: replay finding "older contents than expected"

A key RelationTruncate() code comment mentions this:

     * First, the truncation operation might drop buffers that the checkpoint
     * otherwise would have flushed. If it does, then it's essential that the
     * files actually get truncated on disk before the checkpoint record is
     * written. Otherwise, if replay begins from that checkpoint, the
     * to-be-truncated blocks might still exist on disk but have older
     * contents than expected, which can cause replay to fail. It's OK for the
     * blocks to not exist on disk at all, but not for them to have the wrong
     * contents. For this reason, we need to set DELAY_CHKPT_COMPLETE while
     * this code executes.

However, I can't see how to make that happen.  RelationTruncate() has
AccessExclusiveLock on the relation, so no other WAL records for the truncated
block range are happening while we hold DELAY_CHKPT_START.  In the previous
attempt, any WAL records for the truncated block range must be before
XLOG_SMGR_TRUNCATE (such records tolerate older content) or after
XLOG_CHECKPOINT_ONLINE (since RelationTruncate() held AccessExclusiveLock at
least that late).


==== Attempt: all truncate steps just after XLOG_CHECKPOINT_REDO

P1: DELAY_CHKPT_START enter
P2:                        choose redo LSN
P1: log truncate
P1: DropRelationBuffers()
P1: ftruncate()
P1: RegisterSyncRequest()
P1: DELAY_CHKPT_START exit
P2:                        CheckPointBuffers()
P2:                        ProcessSyncRequests()
P2:                        log checkpoint
              *** system loses power ***

Recovery sees the XLOG_CHECKPOINT_ONLINE, which implies it also sees and
replays the XLOG_SMGR_TRUNCATE.  Data integrity is fine.


==== Attempt: XLOG_SMGR_TRUNCATE before XLOG_CHECKPOINT_REDO

P1: DELAY_CHKPT_START enter
P1: log truncate
P2:                        choose redo LSN
P1: DropRelationBuffers()
P1: ftruncate()
P1: RegisterSyncRequest()
P1: DELAY_CHKPT_START exit
P2:                        CheckPointBuffers()
P2:                        ProcessSyncRequests()
P2:                        log checkpoint
              *** system loses power ***

XLOG_SMGR_TRUNCATE precedes the redo point, so recovery finds no need to redo
the XLOG_SMGR_TRUNCATE.  Data integrity is fine.



pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #18880: Can't have a database without having postgres as a db_name and username
Next
From: "Daniel Westermann (DWE)"
Date:
Subject: Re: BUG #18880: Can't have a database without having postgres as a db_name and username