Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
Date
Msg-id CA+hUKG+-2rjGZC2kwqr2NMLBcEBp4uf59QT1advbWYF_uc+0Aw@mail.gmail.com
Whole thread Raw
In response to Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-bugs
On Fri, Oct 6, 2023 at 9:18 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Oct 6, 2023 at 8:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > ... DELAY_CHKPT_COMPLETE ...
>
> About that...  If the lights go out after the truncation and the
> delayed logging of the checkpoint, how do we know the truncation has
> actually reached the disk?  mdtruncate() enqueues fsync() calls, but
> if we were already in phase 2 (see proc.h) of a checkpoint at that
> moment, they might be processed by the *next* checkpoint, not the one
> whose phase 3 we've carefully delayed there, no?

I didn't look into this for very long so I might be missing something
here, but I think there could be at least one bad sequence.  If you
insert a couple of sleeps to jinx the scheduling, and hack
RelationTruncate() to request a checkpoint at a carefully chosen wrong
moment (see attached), then:

postgres=# create table t (i int);
CREATE TABLE
postgres=# insert into t select 1 from generate_series(1, 100);
INSERT 0 100
postgres=# checkpoint; -- checkpoint #1 just puts some data on disk
CHECKPOINT
postgres=# delete from t;
DELETE 100
postgres=# vacuum freeze t; -- truncates, starts unlucky checkpoint #2
VACUUM

If you trace the checkpointer's system calls you will see that
base/5/16384 (or whatever t's relfilenode is for you) is *not* fsync'd
by checkpoint #2.  The following checkpoint #3 might eventually do it,
but if the kernel loses power after checkpoint #2 completes and there
is no checkpoint #3, the kernel might forget the truncation, and yet
replay starts too late to redo it.  I think that bad sequence looks
like this:

P1: log truncate
P2:                        choose redo LSN
P1: DropRelationBuffers()
P2:                        CheckPointBuffers()
P2:                        ProcessSyncRequests()
P1: ftruncate()
P1: RegisterSyncRequest()
P2:                        log checkpoint
             *** system loses power ***

I realise it is a different problem than the one reported, but it's
close.  My initial thought is that perhaps we shouldn't allow a redo
LSN to be chosen until the sync request is registered, which is also
fairly close to the critical section boundaries being discussed for
ftruncate() error case.  But that's not a phase the checkpoint delay
machinery currently knows how to delay.  And there may well be better
ways...

Attachment

pgsql-bugs by date:

Previous
From: Peter Smith
Date:
Subject: Re: [16+] subscription can end up in inconsistent state
Next
From: vignesh C
Date:
Subject: Re: [16+] subscription can end up in inconsistent state