On Fri, Oct 6, 2023 at 9:18 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Oct 6, 2023 at 8:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > ... DELAY_CHKPT_COMPLETE ...
>
> About that... If the lights go out after the truncation and the
> delayed logging of the checkpoint, how do we know the truncation has
> actually reached the disk? mdtruncate() enqueues fsync() calls, but
> if we were already in phase 2 (see proc.h) of a checkpoint at that
> moment, they might be processed by the *next* checkpoint, not the one
> whose phase 3 we've carefully delayed there, no?
I didn't look into this for very long so I might be missing something
here, but I think there could be at least one bad sequence. If you
insert a couple of sleeps to jinx the scheduling, and hack
RelationTruncate() to request a checkpoint at a carefully chosen wrong
moment (see attached), then:
postgres=# create table t (i int);
CREATE TABLE
postgres=# insert into t select 1 from generate_series(1, 100);
INSERT 0 100
postgres=# checkpoint; -- checkpoint #1 just puts some data on disk
CHECKPOINT
postgres=# delete from t;
DELETE 100
postgres=# vacuum freeze t; -- truncates, starts unlucky checkpoint #2
VACUUM
If you trace the checkpointer's system calls you will see that
base/5/16384 (or whatever t's relfilenode is for you) is *not* fsync'd
by checkpoint #2. The following checkpoint #3 might eventually do it,
but if the kernel loses power after checkpoint #2 completes and there
is no checkpoint #3, the kernel might forget the truncation, and yet
replay starts too late to redo it. I think that bad sequence looks
like this:
P1: log truncate
P2: choose redo LSN
P1: DropRelationBuffers()
P2: CheckPointBuffers()
P2: ProcessSyncRequests()
P1: ftruncate()
P1: RegisterSyncRequest()
P2: log checkpoint
*** system loses power ***
I realise it is a different problem than the one reported, but it's
close. My initial thought is that perhaps we shouldn't allow a redo
LSN to be chosen until the sync request is registered, which is also
fairly close to the critical section boundaries being discussed for
ftruncate() error case. But that's not a phase the checkpoint delay
machinery currently knows how to delay. And there may well be better
ways...