Thread: Corruption during WAL replay

Corruption during WAL replay

From
Teja Mupparti
Date:

This is my *first* attempt to submit a Postgres patch, please let me know if I missed any process or format of the patch (I used this link https://wiki.postgresql.org/wiki/Working_with_Git As reference)

 

The original bug reporting-email and the relevant discussion is here

 

https://www.postgresql.org/message-id/822113470.250068.1573246011818%40connect.xfinity.com

https://www.postgresql.org/message-id/20191206230640.2dvdjpcgn46q3ks2%40alap3.anarazel.de

https://www.postgresql.org/message-id/1880.1281020817@sss.pgh.pa.us

 

The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.

 

Warm regards,

Teja

Attachment

Re: Corruption during WAL replay

From
Kyotaro Horiguchi
Date:
Thanks for working on this.

At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in 
> This is my *first* attempt to submit a Postgres patch, please let me know if I missed any process or format of the
patch
 

Welcome! The format looks fine to me. It would be better if it had a
commit message that explains what the patch does. (in the format that
git format-patch emits.)

> The original bug reporting-email and the relevant discussion is here
...
> The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before
thetruncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and
dropthe buffer later.
 

BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I'm not sure it's acceptable to remember all to-be-deleted buffers
while truncation.

+     /*START_CRIT_SECTION();*/

Is this a point of argument?  It is not needed if we choose the
strategy (c) in [1], since the protocol is aiming to allow server to
continue running after truncation failure.

[1]: https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de

However, note that md truncates a "file" a non-atomic way. mdtruncate
truncates multiple files from the last segment toward the
beginning. If mdtruncate successfully truncated the first several
segments then failed, retaining all buffers triggers assertion failure
in mdwrite while buffer flush.


Some typos found:

+     * a backround task might flush them to the disk right after we

s/backround/background/

+ *        saved list of buffers that were marked as BM_IO_IN_PRGRESS just
s/BM_IO_IN_PRGRESS/BM_IO_IN_PROGRESS/

+ * as BM_IO_IN_PROGRES. Though the buffers are marked for IO, they
s/BM_IO_IN_PROGRES/BM_IO_IN_PROGRESS/

+ * being dicarded).
s/dicarded/discarded/

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:
> At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in 
> > The original bug reporting-email and the relevant discussion is here
> ...
> > The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash
beforethe truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file
anddrop the buffer later.
 
> 
> BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?


> I'm not sure it's acceptable to remember all to-be-deleted buffers
> while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).


> 
> +     /*START_CRIT_SECTION();*/

> Is this a point of argument?  It is not needed if we choose the
> strategy (c) in [1], since the protocol is aiming to allow server to
> continue running after truncation failure.
> 
> [1]: https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.


If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

Greetings,

Andres Freund



Re: Corruption during WAL replay

From
Kyotaro Horiguchi
Date:
At Mon, 30 Mar 2020 16:31:59 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:
> > At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in 
> > > The original bug reporting-email and the relevant discussion is here
> > ...
> > > The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash
beforethe truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file
anddrop the buffer later.
 
> > 
> > BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.
> 
> I don't think that's true. For any of this to be relevant the buffer has
> to be dirty. In which case BufferAlloc() has to call
> FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
> set.
> 
> What path are you thinking of? Or alternatively, what am I missing?

# I would be wrong with far low odds..

"doesn't" is overstated. Is there a case where the buffer is already
flushed by checkpoint? (If that is the case, dropping clean buffers at
marking truncate would work?)

> > I'm not sure it's acceptable to remember all to-be-deleted buffers
> > while truncation.
> 
> I don't see a real problem with it. Nor really a good alternative. Note
> that for autovacuum truncations we'll only truncate a limited number of
> buffers at once, and for most relation truncations we don't enter this
> path (since we create a new relfilenode instead).

Thank you for the opinion. I agree to that.

> > +     /*START_CRIT_SECTION();*/
> 
> > Is this a point of argument?  It is not needed if we choose the
> > strategy (c) in [1], since the protocol is aiming to allow server to
> > continue running after truncation failure.
> > 
> > [1]: https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de
> 
> I think it's entirely broken to continue running after a truncation
> failure. We obviously have to first WAL log the truncation (since
> otherwise we can crash just after doing the truncation). But we cannot
> just continue running after WAL logging, but not performing the
> associated action: The most obvious reason is that otherwise a replica
> will execute the trunction, but the primary will not.

Hmm. If we allow PANIC on truncation failure why do we need to go on
the complicated steps?  Wouldn't it enough to enclose the sequence
(WAL insert - drop buffers - truncate) in a critical section? I
believed that this project aims to fix the db-breakage on truncation
failure by allowing rollback on truncation failure?

> The whole justification for that behaviour "It would turn a usually
> harmless failure to truncate, that might spell trouble at WAL replay,
> into a certain PANIC." was always dubious (since on-disk and in-memory
> state now can diverge), but it's clearly wrong once replication had
> entered the picture. There's just no alternative to a critical section
> here.

Yeah, I like that direction.

> If we are really concerned with truncation failing - I don't know why we
> would be, we accept that we have to be able to modify files etc to stay
> up - we can add a pre-check ensuring that permissions are set up
> appropriately to allow us to truncate.

I think the question above is the core part of the problem.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Corruption during WAL replay

From
Teja Mupparti
Date:
Thanks Andres and Kyotaro for the quick review.  I have fixed the typos and also included the critical section (emulated it with try-catch block since palloc()s are causing issues in the truncate code). This time I used git format-patch.

Regards
Teja




From: Andres Freund <andres@anarazel.de>
Sent: Monday, March 30, 2020 4:31 PM
To: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Cc: tejeswarm@hotmail.com <tejeswarm@hotmail.com>; pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>; hexexpert@comcast.net <hexexpert@comcast.net>
Subject: Re: Corruption during WAL replay
 
Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:
> At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in
> > The original bug reporting-email and the relevant discussion is here
> ...
> > The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.
>
> BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?


> I'm not sure it's acceptable to remember all to-be-deleted buffers
> while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).


>
> +      /*START_CRIT_SECTION();*/

> Is this a point of argument?  It is not needed if we choose the
> strategy (c) in [1], since the protocol is aiming to allow server to
> continue running after truncation failure.
>
> [1]: https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.


If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

Greetings,

Andres Freund
Attachment

Re: Corruption during WAL replay

From
Alvaro Herrera
Date:
On 2020-Mar-30, Andres Freund wrote:

> If we are really concerned with truncation failing - I don't know why we
> would be, we accept that we have to be able to modify files etc to stay
> up - we can add a pre-check ensuring that permissions are set up
> appropriately to allow us to truncate.

I remember I saw a case where the datadir was NFS or some other network
filesystem thingy, and it lost connection just before autovacuum
truncation, or something like that -- so there was no permission
failure, but the truncate failed and yet PG soldiered on.  I think the
connection was re-established soon thereafter and things went back to
"normal", with nobody realizing that a truncate had been lost.
Corruption was discovered a long time afterwards IIRC (weeks or months,
I don't remember).

I didn't review Teja's patch carefully, but the idea of panicking on
failure (causing WAL replay) seems better than the current behavior.
I'd rather put the server to wait until storage is really back.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2020-04-10 20:49:05 -0400, Alvaro Herrera wrote:
> On 2020-Mar-30, Andres Freund wrote:
> 
> > If we are really concerned with truncation failing - I don't know why we
> > would be, we accept that we have to be able to modify files etc to stay
> > up - we can add a pre-check ensuring that permissions are set up
> > appropriately to allow us to truncate.
> 
> I remember I saw a case where the datadir was NFS or some other network
> filesystem thingy, and it lost connection just before autovacuum
> truncation, or something like that -- so there was no permission
> failure, but the truncate failed and yet PG soldiered on.  I think the
> connection was re-established soon thereafter and things went back to
> "normal", with nobody realizing that a truncate had been lost.
> Corruption was discovered a long time afterwards IIRC (weeks or months,
> I don't remember).

Yea. In that case we're in a really bad state. Because we truncate after
throwing away the old buffer contents (even if dirty), we'll later read
page contents "from the past". Which won't end well...

Greetings,

Andres Freund



Re: Corruption during WAL replay

From
Masahiko Sawada
Date:
On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:
>
> Thanks Andres and Kyotaro for the quick review.  I have fixed the typos and also included the critical section
(emulatedit with try-catch block since palloc()s are causing issues in the truncate code). This time I used git
format-patch.
>

I briefly looked at the latest patch but I'm not sure it's the right
thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
example, with the following code you changed, we will always end up
with emitting a PANIC "failed to truncate the relation" regardless of
the actual cause of the error.

+   PG_CATCH();
+   {
+       ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
+           errmsg("failed to truncate the relation")));
+   }
+   PG_END_TRY();

And the comments of RelationTruncate() mentions:

/*
 * We WAL-log the truncation before actually truncating, which means
 * trouble if the truncation fails. If we then crash, the WAL replay
 * likely isn't going to succeed in the truncation either, and cause a
 * PANIC. It's tempting to put a critical section here, but that cure
 * would be worse than the disease. It would turn a usually harmless
 * failure to truncate, that might spell trouble at WAL replay, into a
 * certain PANIC.
 */

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

At CommitTransaction(),
3. Revisit the remembered buffers to check if the buffer still has
table data that needs to be truncated.
4-a, If it has, we mark it as IO_IN_PROGRESS.
4-b, If it already has different table data, ignore it.
5, Truncate physical files.
6, Mark the buffer we marked at #4-a as invalid.

If an error occurs between #3 and #6 or in abort case, we revert all
IO_IN_PROGRESS flags on the buffers.

In the above idea, remembering all buffers having to-be-truncated
table at RelationTruncate(), we reduce the time for checking buffers
at the commit time. Since we acquire AccessExclusiveLock the number of
buffers having to-be-truncated table's data never increases. A
downside would be that since we can truncate multiple relations we
need to remember all buffers of each truncated relations, which is up
to (sizeof(int) * NBuffers) in total.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
> On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:
> >
> > Thanks Andres and Kyotaro for the quick review.  I have fixed the typos and also included the critical section
(emulatedit with try-catch block since palloc()s are causing issues in the truncate code). This time I used git
format-patch.
> >
> 
> I briefly looked at the latest patch but I'm not sure it's the right
> thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
> example, with the following code you changed, we will always end up
> with emitting a PANIC "failed to truncate the relation" regardless of
> the actual cause of the error.
> 
> +   PG_CATCH();
> +   {
> +       ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
> +           errmsg("failed to truncate the relation")));
> +   }
> +   PG_END_TRY();
> 
> And the comments of RelationTruncate() mentions:

I think that's just a workaround for mdtruncate not being usable in
critical sections.


> /*
>  * We WAL-log the truncation before actually truncating, which means
>  * trouble if the truncation fails. If we then crash, the WAL replay
>  * likely isn't going to succeed in the truncation either, and cause a
>  * PANIC. It's tempting to put a critical section here, but that cure
>  * would be worse than the disease. It would turn a usually harmless
>  * failure to truncate, that might spell trouble at WAL replay, into a
>  * certain PANIC.
>  */

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...


> As a second idea, I wonder if we can defer truncation until commit
> time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).


> At RelationTruncate(),
> 1. WAL logging.
> 2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.


> At CommitTransaction(),
> 3. Revisit the remembered buffers to check if the buffer still has
> table data that needs to be truncated.
> 4-a, If it has, we mark it as IO_IN_PROGRESS.
> 4-b, If it already has different table data, ignore it.
> 5, Truncate physical files.
> 6, Mark the buffer we marked at #4-a as invalid.
> 
> If an error occurs between #3 and #6 or in abort case, we revert all
> IO_IN_PROGRESS flags on the buffers.

What would this help with? If we still need the more complicated
truncation sequence?

Greetings,

Andres Freund



Re: Corruption during WAL replay

From
Masahiko Sawada
Date:
On Mon, 13 Apr 2020 at 17:40, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
> > On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:
> > >
> > > Thanks Andres and Kyotaro for the quick review.  I have fixed the typos and also included the critical section
(emulatedit with try-catch block since palloc()s are causing issues in the truncate code). This time I used git
format-patch.
> > >
> >
> > I briefly looked at the latest patch but I'm not sure it's the right
> > thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
> > example, with the following code you changed, we will always end up
> > with emitting a PANIC "failed to truncate the relation" regardless of
> > the actual cause of the error.
> >
> > +   PG_CATCH();
> > +   {
> > +       ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
> > +           errmsg("failed to truncate the relation")));
> > +   }
> > +   PG_END_TRY();
> >
> > And the comments of RelationTruncate() mentions:
>
> I think that's just a workaround for mdtruncate not being usable in
> critical sections.
>
>
> > /*
> >  * We WAL-log the truncation before actually truncating, which means
> >  * trouble if the truncation fails. If we then crash, the WAL replay
> >  * likely isn't going to succeed in the truncation either, and cause a
> >  * PANIC. It's tempting to put a critical section here, but that cure
> >  * would be worse than the disease. It would turn a usually harmless
> >  * failure to truncate, that might spell trouble at WAL replay, into a
> >  * certain PANIC.
> >  */
>
> Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
> log something and then not perform the action. This leads to to primary
> / replica getting out of sync, crash recovery potentially not completing
> (because of records referencing the should-be-truncated pages), ...
>
>
> > As a second idea, I wonder if we can defer truncation until commit
> > time like smgrDoPendingDeletes mechanism. The sequence would be:
>
> This is mostly an issue during [auto]vacuum partially truncating the end
> of the file. We intentionally release the AEL regularly to allow other
> accesses to continue.
>
> For transactional truncations we don't go down this path (as we create a
> new relfilenode).
>
>
> > At RelationTruncate(),
> > 1. WAL logging.
> > 2. Remember buffers to be dropped.
>
> You definitely cannot do that, as explained above.

Ah yes, you're right.

So it seems to me currently what we can do for this issue would be to
enclose the truncation operation in a critical section. IIUC it's not
enough just to reverse the order of dropping buffers and physical file
truncation because it cannot solve the problem of inconsistency on the
standby. And as Horiguchi-san mentioned, there is no need to reverse
that order if we envelop the truncation operation by a critical
section because we can recover page changes during crash recovery. The
strategy of writing out all dirty buffers before dropping buffers,
proposed as (a) in [1], also seems not enough.

Regards,

[1] https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.deDoing
sync before truncation


--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Corruption during WAL replay

From
Kyotaro Horiguchi
Date:
At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in 
> On Mon, 13 Apr 2020 at 17:40, Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
> > > On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:
> > > /*
> > >  * We WAL-log the truncation before actually truncating, which means
> > >  * trouble if the truncation fails. If we then crash, the WAL replay
> > >  * likely isn't going to succeed in the truncation either, and cause a
> > >  * PANIC. It's tempting to put a critical section here, but that cure
> > >  * would be worse than the disease. It would turn a usually harmless
> > >  * failure to truncate, that might spell trouble at WAL replay, into a
> > >  * certain PANIC.
> > >  */
> >
> > Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
> > log something and then not perform the action. This leads to to primary
> > / replica getting out of sync, crash recovery potentially not completing
> > (because of records referencing the should-be-truncated pages), ...

It is introduced in 2008 by 3396000684, for 8.4.  So it can be said as
an overlook when introducing log-shipping.

The reason other operations like INSERTs (that extends the underlying
file) are "safe" after an extension failure is the following
operations are performed in shared buffers as if the new page exists,
then tries to extend the file again.  So if we continue working after
truncation failure, we need to disguise on shared buffers as if the
truncated pages are gone.  But we don't have a room for another flag
in buffer header.  For example, BM_DIRTY && !BM_VALID might be able to
be used as the state that the page should have been truncated but not
succeeded yet, but I'm not sure.

Anyway, I think the prognosis of a truncation failure is far hopeless
than extension failure in most cases and I doubt that it's good to
introduce such a complex feature only to overcome such a hopeless
situation.

In short, I think we should PANIC in that case.

> > > As a second idea, I wonder if we can defer truncation until commit
> > > time like smgrDoPendingDeletes mechanism. The sequence would be:
> >
> > This is mostly an issue during [auto]vacuum partially truncating the end
> > of the file. We intentionally release the AEL regularly to allow other
> > accesses to continue.
> >
> > For transactional truncations we don't go down this path (as we create a
> > new relfilenode).
> >
> >
> > > At RelationTruncate(),
> > > 1. WAL logging.
> > > 2. Remember buffers to be dropped.
> >
> > You definitely cannot do that, as explained above.
> 
> Ah yes, you're right.
> 
> So it seems to me currently what we can do for this issue would be to
> enclose the truncation operation in a critical section. IIUC it's not
> enough just to reverse the order of dropping buffers and physical file
> truncation because it cannot solve the problem of inconsistency on the
> standby. And as Horiguchi-san mentioned, there is no need to reverse
> that order if we envelop the truncation operation by a critical
> section because we can recover page changes during crash recovery. The
> strategy of writing out all dirty buffers before dropping buffers,
> proposed as (a) in [1], also seems not enough.

Agreed.  Since it's not acceptable ether WAL-logging->not-performing
nor performing->WAL-logging, there's no other way than working as if
truncation is succeeded (and try again) even if it actually
failed. But it would be too complex.

Just making it a critical section seems the right thing here.


> [1] https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de
> Doing sync before truncation

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Corruption during WAL replay

From
Teja Mupparti
Date:
Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on the critical-section around truncate, but I just want to emphasize the need for reversing the order of the dropping the buffers and the truncation.

 Repro details (when full page write = off)

         1) Page on disk has empty LP 1, Insert into page LP 1
         2) checkpoint START (Recovery REDO eventually starts here)
         3) Delete all rows on the page (page is empty now)
         4) Autovacuum kicks in and truncates the pages
                 DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on disk still empty
         5) Checkpoint completes
         6) Crash
         7) smgrtruncate - Not reached (this is where we do the physical truncate)

 Now the crash-recovery starts

         Delete-log-replay (above step-3) reads page with empty LP 1 and the delete fails with PANIC (old page on disk with no insert)

Doing recovery, truncate is even not reached, a WAL replay of the truncation will happen in the future but the recovery fails (repeatedly) even before reaching that point.

Best regards,
Teja


From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Sent: Monday, April 13, 2020 7:35 PM
To: masahiko.sawada@2ndquadrant.com <masahiko.sawada@2ndquadrant.com>
Cc: andres@anarazel.de <andres@anarazel.de>; tejeswarm@hotmail.com <tejeswarm@hotmail.com>; pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>; hexexpert@comcast.net <hexexpert@comcast.net>
Subject: Re: Corruption during WAL replay
 
At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in
> On Mon, 13 Apr 2020 at 17:40, Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
> > > On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:
> > > /*
> > >  * We WAL-log the truncation before actually truncating, which means
> > >  * trouble if the truncation fails. If we then crash, the WAL replay
> > >  * likely isn't going to succeed in the truncation either, and cause a
> > >  * PANIC. It's tempting to put a critical section here, but that cure
> > >  * would be worse than the disease. It would turn a usually harmless
> > >  * failure to truncate, that might spell trouble at WAL replay, into a
> > >  * certain PANIC.
> > >  */
> >
> > Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
> > log something and then not perform the action. This leads to to primary
> > / replica getting out of sync, crash recovery potentially not completing
> > (because of records referencing the should-be-truncated pages), ...

It is introduced in 2008 by 3396000684, for 8.4.  So it can be said as
an overlook when introducing log-shipping.

The reason other operations like INSERTs (that extends the underlying
file) are "safe" after an extension failure is the following
operations are performed in shared buffers as if the new page exists,
then tries to extend the file again.  So if we continue working after
truncation failure, we need to disguise on shared buffers as if the
truncated pages are gone.  But we don't have a room for another flag
in buffer header.  For example, BM_DIRTY && !BM_VALID might be able to
be used as the state that the page should have been truncated but not
succeeded yet, but I'm not sure.

Anyway, I think the prognosis of a truncation failure is far hopeless
than extension failure in most cases and I doubt that it's good to
introduce such a complex feature only to overcome such a hopeless
situation.

In short, I think we should PANIC in that case.

> > > As a second idea, I wonder if we can defer truncation until commit
> > > time like smgrDoPendingDeletes mechanism. The sequence would be:
> >
> > This is mostly an issue during [auto]vacuum partially truncating the end
> > of the file. We intentionally release the AEL regularly to allow other
> > accesses to continue.
> >
> > For transactional truncations we don't go down this path (as we create a
> > new relfilenode).
> >
> >
> > > At RelationTruncate(),
> > > 1. WAL logging.
> > > 2. Remember buffers to be dropped.
> >
> > You definitely cannot do that, as explained above.
>
> Ah yes, you're right.
>
> So it seems to me currently what we can do for this issue would be to
> enclose the truncation operation in a critical section. IIUC it's not
> enough just to reverse the order of dropping buffers and physical file
> truncation because it cannot solve the problem of inconsistency on the
> standby. And as Horiguchi-san mentioned, there is no need to reverse
> that order if we envelop the truncation operation by a critical
> section because we can recover page changes during crash recovery. The
> strategy of writing out all dirty buffers before dropping buffers,
> proposed as (a) in [1], also seems not enough.

Agreed.  Since it's not acceptable ether WAL-logging->not-performing
nor performing->WAL-logging, there's no other way than working as if
truncation is succeeded (and try again) even if it actually
failed. But it would be too complex.

Just making it a critical section seems the right thing here.


> [1] https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de
> Doing sync before truncation

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Corruption during WAL replay

From
Masahiko Sawada
Date:
On Wed, 15 Apr 2020 at 04:04, Teja Mupparti <tejeswarm@hotmail.com> wrote:
>
> Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on the critical-section around truncate,
butI just want to emphasize the need for reversing the order of the dropping the buffers and the truncation.
 
>
>  Repro details (when full page write = off)
>
>          1) Page on disk has empty LP 1, Insert into page LP 1
>          2) checkpoint START (Recovery REDO eventually starts here)
>          3) Delete all rows on the page (page is empty now)
>          4) Autovacuum kicks in and truncates the pages
>                  DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on disk still empty
>          5) Checkpoint completes
>          6) Crash
>          7) smgrtruncate - Not reached (this is where we do the physical truncate)
>
>  Now the crash-recovery starts
>
>          Delete-log-replay (above step-3) reads page with empty LP 1 and the delete fails with PANIC (old page on
diskwith no insert)
 
>

I agree that when replaying the deletion of (3) the page LP 1 is
empty, but does that replay really fail with PANIC? I guess that we
record that page into invalid_page_tab but don't raise a PANIC in this
case.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Wait profiling

From
Daniel Wood
Date:
After nearly 5 years does something like the following yet exist?

I feel that it would be useful to have the following two things.  One PG enhancement and one standard extension.

1) An option to "explain" to produce a wait events profile.
postgres=# explain (analyze, waitprofile) update pgbench_accounts set bid=bid+1 where aid < 2000000;
...
Execution time: 23111.231 ms

62.6% BufFileRead
50.0% CPU
  9.3% LWLock

It uses a PG timer to do this.

2) An extension based function like: select pg_wait_profile(pid, nSeconds, timerFrequency) to return the same thing for an already running query.  Useful if you want examine some already long running query that is taking too long.

Neither of these would be doing the heavy weight pg_stat_activity but directly poll the wait event in PROC.  I've already coded the EXPLAIN option.

Furthermore, can't we just remove the following "IF" test from pgstat_report_wait_{start,end}?
if (!pgstat_track_activities || !proc)
    return;
Just do the assignment of wait_event_info always.  We should use a dummy PGPROC assigned to MyProc until we assign the one in the procarray in shared memory.  That way we don't need the "!proc" test.
About the only thing I'd want to verify is whether wait_event_info is on the same cache lines as anything else having to do with snapshots.

If I recall correctly the blanks lines above I've used to make this more readable will disappear.  :-(

- Dan Wood

Re: Wait profiling

From
Alvaro Herrera
Date:
On 2020-Jul-10, Daniel Wood wrote:

> After nearly 5 years does something like the following yet exist?
> https://www.postgresql.org/message-id/559D4729.9080704@postgrespro.ru

Yes, we have pg_stat_activity.wait_events which implement pretty much
what Ildus describes there.

> 1) An option to "explain" to produce a wait events profile.
> postgres=# explain (analyze, waitprofile) update pgbench_accounts set bid=bid+1 where aid < 2000000;
> ...
> Execution time: 23111.231 ms

There's an out-of-core extension, search for pg_wait_sampling.  I
haven't tested it yet ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Wait profiling

From
Julien Rouhaud
Date:
On Fri, Jul 10, 2020 at 10:37 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>
> On 2020-Jul-10, Daniel Wood wrote:
>
> > After nearly 5 years does something like the following yet exist?
> > https://www.postgresql.org/message-id/559D4729.9080704@postgrespro.ru
>
> Yes, we have pg_stat_activity.wait_events which implement pretty much
> what Ildus describes there.
>
> > 1) An option to "explain" to produce a wait events profile.
> > postgres=# explain (analyze, waitprofile) update pgbench_accounts set bid=bid+1 where aid < 2000000;
> > ...
> > Execution time: 23111.231 ms
>
> There's an out-of-core extension, search for pg_wait_sampling.  I
> haven't tested it yet ...

I use it, and I know multiple people that are also using it (or about
to, it's currently being packaged) in production.  It's working quite
well and is compatible with pg_stat_statements' queryid.  You can see
some examples of dashboards that can be built on top of this extension
at https://powa.readthedocs.io/en/latest/components/stats_extensions/pg_wait_sampling.html.



Re: Corruption during WAL replay

From
Heikki Linnakangas
Date:
On 14/04/2020 22:04, Teja Mupparti wrote:
> Thanks Kyotaro and Masahiko for the feedback. I think there is a 
> consensus on the critical-section around truncate,

+1

> but I just want to emphasize the need for reversing the order of the
> dropping the buffers and the truncation.
> 
>   Repro details (when full page write = off)
> 
>           1) Page on disk has empty LP 1, Insert into page LP 1
>           2) checkpoint START (Recovery REDO eventually starts here)
>           3) Delete all rows on the page (page is empty now)
>           4) Autovacuum kicks in and truncates the pages
>                   DropRelFileNodeBuffers - Dirty page NOT written, LP 1 
> on disk still empty
>           5) Checkpoint completes
>           6) Crash
>           7) smgrtruncate - Not reached (this is where we do the 
> physical truncate)
> 
>   Now the crash-recovery starts
> 
>           Delete-log-replay (above step-3) reads page with empty LP 1 
> and the delete fails with PANIC (old page on disk with no insert)
> 
> Doing recovery, truncate is even not reached, a WAL replay of the 
> truncation will happen in the future but the recovery fails (repeatedly) 
> even before reaching that point.

Hmm. I think simply reversing the order of DropRelFileNodeBuffers() and 
truncating the file would open a different issue:

   1) Page on disk has empty LP 1, Insert into page LP 1
   2) checkpoint START (Recovery REDO eventually starts here)
   3) Delete all rows on the page (page is empty now)
   4) Autovacuum kicks in and starts truncating
   5) smgrtruncate() truncates the file
   6) checkpoint writes out buffers for pages that were just truncated 
away, expanding the file again.

Your patch had a mechanism to mark the buffers as io-in-progress before 
truncating the file to fix that, but I'm wary of that approach. Firstly, 
it requires scanning the buffers that are dropped twice, which can take 
a long time. I remember that people have already complained that 
DropRelFileNodeBuffers() is slow, when it has to scan all the buffers 
once. More importantly, abusing the BM_IO_INPROGRESS flag for this seems 
bad. For starters, because you're not holding buffer's I/O lock, I 
believe the checkpointer would busy-wait on the buffers until the 
truncation has completed. See StartBufferIO() and AbortBufferIO().

Perhaps a better approach would be to prevent the checkpoint from 
completing, until all in-progress truncations have completed. We have a 
mechanism to wait out in-progress commits at the beginning of a 
checkpoint, right after the redo point has been established. See 
comments around the GetVirtualXIDsDelayingChkpt() function call in 
CreateCheckPoint(). We could have a similar mechanism to wait out the 
truncations before *completing* a checkpoint.

- Heikki



Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:
> On 14/04/2020 22:04, Teja Mupparti wrote:
> > Thanks Kyotaro and Masahiko for the feedback. I think there is a
> > consensus on the critical-section around truncate,
> 
> +1

I'm inclined to think that we should do that independent of the far more
complicated fix for other related issues.


> > but I just want to emphasize the need for reversing the order of the
> > dropping the buffers and the truncation.
> > 
> >   Repro details (when full page write = off)
> > 
> >           1) Page on disk has empty LP 1, Insert into page LP 1
> >           2) checkpoint START (Recovery REDO eventually starts here)
> >           3) Delete all rows on the page (page is empty now)
> >           4) Autovacuum kicks in and truncates the pages
> >                   DropRelFileNodeBuffers - Dirty page NOT written, LP 1
> > on disk still empty
> >           5) Checkpoint completes
> >           6) Crash
> >           7) smgrtruncate - Not reached (this is where we do the
> > physical truncate)
> > 
> >   Now the crash-recovery starts
> > 
> >           Delete-log-replay (above step-3) reads page with empty LP 1
> > and the delete fails with PANIC (old page on disk with no insert)
> > 
> > Doing recovery, truncate is even not reached, a WAL replay of the
> > truncation will happen in the future but the recovery fails (repeatedly)
> > even before reaching that point.
> 
> Hmm. I think simply reversing the order of DropRelFileNodeBuffers() and
> truncating the file would open a different issue:
> 
>   1) Page on disk has empty LP 1, Insert into page LP 1
>   2) checkpoint START (Recovery REDO eventually starts here)
>   3) Delete all rows on the page (page is empty now)
>   4) Autovacuum kicks in and starts truncating
>   5) smgrtruncate() truncates the file
>   6) checkpoint writes out buffers for pages that were just truncated away,
> expanding the file again.
> 
> Your patch had a mechanism to mark the buffers as io-in-progress before
> truncating the file to fix that, but I'm wary of that approach. Firstly, it
> requires scanning the buffers that are dropped twice, which can take a long
> time.

I was thinking that we'd keep track of all the buffers marked as "in
progress" that way, avoiding the second scan.

It's also worth keeping in mind that this code is really only relevant
for partial truncations, which don't happen at the same frequency as
transactional truncations.


> I remember that people have already complained that
> DropRelFileNodeBuffers() is slow, when it has to scan all the buffers
> once.

But that's when dropping many relations, normally. E.g. at the end of a
regression test.


> More importantly, abusing the BM_IO_INPROGRESS flag for this seems
> bad. For starters, because you're not holding buffer's I/O lock, I
> believe the checkpointer would busy-wait on the buffers until the
> truncation has completed. See StartBufferIO() and AbortBufferIO().

I think we should apply Robert's patch that makes io locks into
condition variables. Then we can fairly easily have many many buffers io
locked.  Obviously there's some issues with doing so in the back
branches :(

I'm working on an AIO branch, and that also requires to be able to mark
multiple buffers as in-progress, FWIW.


> Perhaps a better approach would be to prevent the checkpoint from
> completing, until all in-progress truncations have completed. We have a
> mechanism to wait out in-progress commits at the beginning of a checkpoint,
> right after the redo point has been established. See comments around the
> GetVirtualXIDsDelayingChkpt() function call in CreateCheckPoint(). We could
> have a similar mechanism to wait out the truncations before *completing* a
> checkpoint.

What I outlined earlier *is* essentially a way to do so, by preventing
checkpointing from finishing the buffer scan while a dangerous state
exists.

Greetings,

Andres Freund



Re: Corruption during WAL replay

From
Anastasia Lubennikova
Date:
Status update for a commitfest entry.

I see quite a few unanswered questions in the thread since the last patch version was sent. So, I move it to "Waiting
onAuthor". 

The new status of this patch is: Waiting on Author

Re: Corruption during WAL replay

From
Masahiko Sawada
Date:
On Tue, Aug 18, 2020 at 3:22 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:
> > On 14/04/2020 22:04, Teja Mupparti wrote:
> > > Thanks Kyotaro and Masahiko for the feedback. I think there is a
> > > consensus on the critical-section around truncate,
> >
> > +1
>
> I'm inclined to think that we should do that independent of the far more
> complicated fix for other related issues.

+1

If we had a critical section in RelationTruncate(), crash recovery
would continue failing until the situation of the underlying file is
recovered if a PANIC happens. The current comment in
RelationTruncate() says it’s worse than the disease. But considering
physical replication, as Andres mentioned, a failure to truncate the
file after logging WAL is no longer a harmless failure. Also, the
critical section would be necessary even if we reversed the order of
truncation and dropping buffers and resolved the issue. So I agree to
proceed with the patch that adds a critical section independent of
fixing other related things discussed in this thread. If Teja seems
not to work on this I’ll write the patch.

Regards,


--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/



Re: Corruption during WAL replay

From
Anastasia Lubennikova
Date:
On 06.11.2020 14:40, Masahiko Sawada wrote:
>
> So I agree to
> proceed with the patch that adds a critical section independent of
> fixing other related things discussed in this thread. If Teja seems
> not to work on this I’ll write the patch.
>
> Regards,
>
>
> --
> Masahiko Sawada
> EnterpriseDB:  https://www.enterprisedb.com/
>
>
Status update for a commitfest entry.

The commitfest is closed now. As this entry is a bug fix, I am moving it 
to the next CF.
Are you planning to continue working on it?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Corruption during WAL replay

From
Kyotaro Horiguchi
Date:
At Mon, 17 Aug 2020 11:22:15 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:
> > On 14/04/2020 22:04, Teja Mupparti wrote:
> > > Thanks Kyotaro and Masahiko for the feedback. I think there is a
> > > consensus on the critical-section around truncate,
> > 
> > +1
> 
> I'm inclined to think that we should do that independent of the far more
> complicated fix for other related issues.
...
> > Perhaps a better approach would be to prevent the checkpoint from
> > completing, until all in-progress truncations have completed. We have a
> > mechanism to wait out in-progress commits at the beginning of a checkpoint,
> > right after the redo point has been established. See comments around the
> > GetVirtualXIDsDelayingChkpt() function call in CreateCheckPoint(). We could
> > have a similar mechanism to wait out the truncations before *completing* a
> > checkpoint.
> 
> What I outlined earlier *is* essentially a way to do so, by preventing
> checkpointing from finishing the buffer scan while a dangerous state
> exists.

Seems reasonable. The attached does that. It actually works for the
initial case.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1233448481..e1d3068f14 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3062,8 +3062,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyProc->delayChkpt);
-    MyProc->delayChkpt = true;
+    Assert(MyProc->delayChkpt == DELAY_CHKPT_NONE);
+    MyProc->delayChkpt = DELAY_CHKPT_START;
 
     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3089,7 +3089,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);
 
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = DELAY_CHKPT_NONE;
 
     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index fc18b77832..1b74a229d6 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -463,7 +463,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     proc->lxid = (LocalTransactionId) xid;
     proc->xid = xid;
     Assert(proc->xmin == InvalidTransactionId);
-    proc->delayChkpt = false;
+    proc->delayChkpt = DELAY_CHKPT_NONE;
     proc->statusFlags = 0;
     proc->pid = 0;
     proc->backendId = InvalidBackendId;
@@ -1108,7 +1108,8 @@ EndPrepare(GlobalTransaction gxact)
 
     START_CRIT_SECTION();
 
-    MyProc->delayChkpt = true;
+    Assert(MyProc->delayChkpt == DELAY_CHKPT_NONE);
+    MyProc->delayChkpt = DELAY_CHKPT_START;
 
     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1151,7 +1152,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = DELAY_CHKPT_NONE;
 
     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2199,7 +2200,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();
 
     /* See notes in RecordTransactionCommit */
-    MyProc->delayChkpt = true;
+    Assert(MyProc->delayChkpt == DELAY_CHKPT_NONE);
+    MyProc->delayChkpt = DELAY_CHKPT_START;
 
     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2247,7 +2249,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);
 
     /* Checkpoint can proceed now */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = DELAY_CHKPT_NONE;
 
     END_CRIT_SECTION();
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a2068e3fd4..000206e506 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1334,8 +1334,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert(MyProc->delayChkpt == DELAY_CHKPT_NONE);
         START_CRIT_SECTION();
-        MyProc->delayChkpt = true;
+        MyProc->delayChkpt = DELAY_CHKPT_START;
 
         SetCurrentTransactionStopTimestamp();
 
@@ -1436,7 +1437,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyProc->delayChkpt = false;
+        MyProc->delayChkpt = DELAY_CHKPT_NONE;
         END_CRIT_SECTION();
     }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ede93ad7fd..5b0a653408 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9024,18 +9024,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);
 
     CheckPointGuts(checkPoint.redo, flags);
 
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 7052dc245e..51f2581c06 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -923,7 +923,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyProc->delayChkpt);
+    Assert(MyProc->delayChkpt == DELAY_CHKPT_START);
 
     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cba7a9ada0..b75ea97b04 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -325,6 +325,16 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 
     RelationPreTruncate(rel);
 
+    /*
+     * If the file truncation fails but the concurrent checkpoint completes
+     * just before that, the next crash recovery can fail due to WAL records
+     * inconsistent with the untruncated pages. To avoid that situation we
+     * delay the checkpoint completion until we confirm the truncation to be
+     * successful.
+     */
+    Assert(MyProc->delayChkpt == DELAY_CHKPT_NONE);
+    MyProc->delayChkpt = DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -373,6 +383,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
      */
     if (need_fsm_vacuum)
         FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
+
+    MyProc->delayChkpt = DELAY_CHKPT_NONE;
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8f2c482bc8..52d70019c4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3615,7 +3615,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
     {
         XLogRecPtr    lsn = InvalidXLogRecPtr;
         bool        dirtied = false;
-        bool        delayChkpt = false;
+        int            delayChkpt = DELAY_CHKPT_NONE;
         uint32        buf_state;
 
         /*
@@ -3665,7 +3665,8 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckpoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyProc->delayChkpt = delayChkpt = true;
+            Assert(MyProc->delayChkpt == DELAY_CHKPT_NONE);
+            MyProc->delayChkpt = delayChkpt = DELAY_CHKPT_START;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }
 
@@ -3698,7 +3699,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);
 
         if (delayChkpt)
-            MyProc->delayChkpt = false;
+            MyProc->delayChkpt = DELAY_CHKPT_NONE;
 
         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cf12eda504..20757274ec 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -655,7 +655,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 
         proc->lxid = InvalidLocalTransactionId;
         proc->xmin = InvalidTransactionId;
-        proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        proc->delayChkpt = DELAY_CHKPT_NONE;
+
         proc->recoveryConflictPending = false;
 
         /* must be cleared with xid/xmin: */
@@ -694,7 +697,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
     proc->xid = InvalidTransactionId;
     proc->lxid = InvalidLocalTransactionId;
     proc->xmin = InvalidTransactionId;
-    proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    proc->delayChkpt = DELAY_CHKPT_NONE;
+
     proc->recoveryConflictPending = false;
 
     /* must be cleared with xid/xmin: */
@@ -2929,7 +2935,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGPROC.
+ * critical sections, as shown by having delayChkpt set to the specified value
+ * in their PGPROC.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -2943,13 +2950,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, DelayChkptType type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;
 
+    Assert(type != DELAY_CHKPT_NONE);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -2961,7 +2970,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         int            pgprocno = arrayP->pgprocnos[index];
         PGPROC       *proc = &allProcs[pgprocno];
 
-        if (proc->delayChkpt)
+        if (proc->delayChkpt == type)
         {
             VirtualTransactionId vxid;
 
@@ -2987,12 +2996,15 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids,
+                             DelayChkptType type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;
 
+    Assert(type != DELAY_CHKPT_NONE);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);
 
     for (index = 0; index < arrayP->numProcs; index++)
@@ -3003,7 +3015,7 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
 
         GET_VXID_FROM_PGPROC(vxid, *proc);
 
-        if (proc->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if (proc->delayChkpt == type &&    VirtualTransactionIdIsValid(vxid))
         {
             int            i;
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 9b6aa2fe0d..b7f9310afe 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -392,7 +392,7 @@ InitProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = DELAY_CHKPT_NONE;
     MyProc->statusFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -573,7 +573,7 @@ InitAuxiliaryProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = DELAY_CHKPT_NONE;
     MyProc->statusFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 989c5849d4..ca764a1a72 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -78,6 +78,14 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX
 
+/* type for PGPROC.delayChkpt */
+typedef enum DelayChkptType
+{
+    DELAY_CHKPT_NONE = 0,
+    DELAY_CHKPT_START,
+    DELAY_CHKPT_COMPLETE
+} DelayChkptType;
+
 typedef enum
 {
     PROC_WAIT_STATUS_OK,
@@ -181,7 +189,8 @@ struct PGPROC
     LOCKMASK    heldLocks;        /* bitmask for lock types already held on this
                                  * lock object by this backend */
 
-    bool        delayChkpt;        /* true if this proc delays checkpoint start */
+    DelayChkptType    delayChkpt;    /* if this proc delays checkpoint start and/or
+                                 * completion.  */
 
     uint8        statusFlags;    /* this backend's status flags, see PROC_*
                                  * above. mirrored in
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index b01fa52139..e560cada2b 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -15,11 +15,11 @@
 #define PROCARRAY_H
 
 #include "storage/lock.h"
+#include "storage/proc.h"
 #include "storage/standby.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
 
-
 extern Size ProcArrayShmemSize(void);
 extern void CreateSharedProcArray(void);
 extern void ProcArrayAdd(PGPROC *proc);
@@ -59,8 +59,10 @@ extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *slot_xmin, TransactionId *catalog_xmin);
 
-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids,
+                                                         DelayChkptType type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, DelayChkptType type);
 
 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);

Re: Corruption during WAL replay

From
Robert Haas
Date:
On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
> used each other, but MarkBufferDirtyHint which delays checkpoint start
> is called in RelationTruncate while delaying checkpoint completion.
> That is not a strange nor harmful behavior.  I changed delayChkpt to a
> bitmap integer from an enum so that both barrier are separately
> triggered.
>
> I'm not sure this is the way to go here, though.  This fixes the issue
> of a crash during RelationTruncate, but the issue of smgrtruncate
> failure during RelationTruncate still remains (unless we treat that
> failure as PANIC?).

I like this patch. As I understand it, we're currently cheating by
allowing checkpoints to complete without necessarily flushing all of
the pages that were dirty at the time we fixed the redo pointer out to
disk. We think this is OK because we know that those pages are going
to get truncated away, but it's not really OK because when the system
starts up, it has to replay WAL starting from the checkpoint's redo
pointer, but the state of the page is not the same as it was at the
time when the redo pointer was the end of WAL, so redo fails. In the
case described in
http://postgr.es/m/BYAPR06MB63739B2692DC6DBB3C5F186CABDA0@BYAPR06MB6373.namprd06.prod.outlook.com
modifications are made to the page before the redo pointer is fixed
and those changes never make it to disk, but the truncation also never
makes it to the disk either. With this patch, that can't happen,
because no checkpoint can intervene between when we (1) decide we're
not going to bother writing those dirty pages and (2) actually
truncate them away. So either the pages will get written as part of
the checkpoint, or else they'll be gone before the checkpoint
completes. In the latter case, I suppose redo that would have modified
those pages will just be skipped, thus dodging the problem.

In RelationTruncate, I suggest that we ought to clear the
delay-checkpoint flag before rather than after calling
FreeSpaceMapVacuumRange. Since the free space map is not fully
WAL-logged, anything we're doing there should be non-critical. Also, I
think it might be better if MarkBufferDirtyHint stays closer to the
existing coding and just uses a Boolean and an if-test to decide
whether to clear the bit, instead of inventing a new mechanism. I
don't really see anything wrong with the new mechanism, but I think
it's better to keep the patch minimal.

As you say, this doesn't fix the problem that truncation might fail.
But as Andres and Sawada-san said, the solution to that is to get rid
of the comments saying that it's OK for truncation to fail and make it
a PANIC. However, I don't think that change needs to be part of this
patch. Even if we do that, we still need to do this. And even if we do
this, we still need to do that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Corruption during WAL replay

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I like this patch.

I think the basic idea is about right, but I'm not happy with the
three-way delayChkpt business; that seems too cute by three-quarters.
I think two independent boolean flags, one saying "I'm preventing
checkpoint start" and one saying "I'm preventing checkpoint completion",
would be much less confusing and also more future-proof.  Who's to say
that we won't ever need both states to be set in the same process?

I also dislike the fact that the patch has made procarray.h depend
on proc.h ... maybe I'm wrong, but I thought that there was a reason
for keeping those independent, if indeed this hasn't actually resulted
in a circular-includes situation.  If we avoid inventing that enum type
then there's no need for that.  If we do need an enum, maybe it could
be put in some already-common prerequisite header.

            regards, tom lane



Re: Corruption during WAL replay

From
Robert Haas
Date:
On Fri, Sep 24, 2021 at 3:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I like this patch.
>
> I think the basic idea is about right, but I'm not happy with the
> three-way delayChkpt business; that seems too cute by three-quarters.
> I think two independent boolean flags, one saying "I'm preventing
> checkpoint start" and one saying "I'm preventing checkpoint completion",
> would be much less confusing and also more future-proof.  Who's to say
> that we won't ever need both states to be set in the same process?

Nobody, but the version of the patch that I was looking at uses a
separate bit for each one:

+/* symbols for PGPROC.delayChkpt */
+#define DELAY_CHKPT_START (1<<0)
+#define DELAY_CHKPT_COMPLETE (1<<1)

One could instead use separate Booleans, but there doesn't seem to be
anything three-way about this?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Corruption during WAL replay

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Sep 24, 2021 at 3:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think the basic idea is about right, but I'm not happy with the
>> three-way delayChkpt business; that seems too cute by three-quarters.

> Nobody, but the version of the patch that I was looking at uses a
> separate bit for each one:

> +/* symbols for PGPROC.delayChkpt */
> +#define DELAY_CHKPT_START (1<<0)
> +#define DELAY_CHKPT_COMPLETE (1<<1)

Hm, that's not in the patch version that the CF app claims to be
latest [1].  It does this:

+/* type for PGPROC.delayChkpt */
+typedef enum DelayChkptType
+{
+    DELAY_CHKPT_NONE = 0,
+    DELAY_CHKPT_START,
+    DELAY_CHKPT_COMPLETE
+} DelayChkptType;

which seems like a distinct disimprovement over what you're quoting.

            regards, tom lane

[1] https://www.postgresql.org/message-id/20210106.173327.1444585955309078930.horikyota.ntt@gmail.com



Re: Corruption during WAL replay

From
Kyotaro Horiguchi
Date:
Thaks for looking this, Robert and Tom.

At Fri, 24 Sep 2021 16:22:28 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, Sep 24, 2021 at 3:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I think the basic idea is about right, but I'm not happy with the
> >> three-way delayChkpt business; that seems too cute by three-quarters.
> 
> > Nobody, but the version of the patch that I was looking at uses a
> > separate bit for each one:
> 
> > +/* symbols for PGPROC.delayChkpt */
> > +#define DELAY_CHKPT_START (1<<0)
> > +#define DELAY_CHKPT_COMPLETE (1<<1)
> 
> Hm, that's not in the patch version that the CF app claims to be
> latest [1].  It does this:
> 
> +/* type for PGPROC.delayChkpt */
> +typedef enum DelayChkptType
> +{
> +    DELAY_CHKPT_NONE = 0,
> +    DELAY_CHKPT_START,
> +    DELAY_CHKPT_COMPLETE
> +} DelayChkptType;
> 
> which seems like a distinct disimprovement over what you're quoting.
 
Yeah, that is because the latest patch is not attached as *.patch/diff
but *.txt.  I didn't name it as *.patch in order to avoid noise patch
in that thread although it was too late. On the contrary that seems to
have lead in another trouble..

Tom's concern is right.  Actually both the two events can happen
simultaneously but the latest *.patch.txt treats that case as Robert
said.

One advantage of having the two flags as one bitmap integer is it
slightly simplifies the logic in GetVirtualXIDsDelayingChkpt and
HaveVirtualXIDsDelayingChkpt. On the other hand it very slightly
complexifies how to set/reset the flags.

GetVirtualXIDsDelayingChkpt:
+        if ((proc->delayChkpt & type) != 0)

vs

+        if (delayStart)
+            delayflag = proc->delayChkptStart;
+        else
+            delayflag = proc->delayChkptEnd;
+        if (delayflag != 0)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Corruption during WAL replay

From
Kyotaro Horiguchi
Date:
Thank you for the comments! (Sorry for the late resopnse.)

At Tue, 10 Aug 2021 14:14:05 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
> > used each other, but MarkBufferDirtyHint which delays checkpoint start
> > is called in RelationTruncate while delaying checkpoint completion.
> > That is not a strange nor harmful behavior.  I changed delayChkpt to a
> > bitmap integer from an enum so that both barrier are separately
> > triggered.
> >
> > I'm not sure this is the way to go here, though.  This fixes the issue
> > of a crash during RelationTruncate, but the issue of smgrtruncate
> > failure during RelationTruncate still remains (unless we treat that
> > failure as PANIC?).
> 
> I like this patch. As I understand it, we're currently cheating by
> allowing checkpoints to complete without necessarily flushing all of
> the pages that were dirty at the time we fixed the redo pointer out to
> disk. We think this is OK because we know that those pages are going
> to get truncated away, but it's not really OK because when the system
> starts up, it has to replay WAL starting from the checkpoint's redo
> pointer, but the state of the page is not the same as it was at the
> time when the redo pointer was the end of WAL, so redo fails. In the
> case described in
> http://postgr.es/m/BYAPR06MB63739B2692DC6DBB3C5F186CABDA0@BYAPR06MB6373.namprd06.prod.outlook.com
> modifications are made to the page before the redo pointer is fixed
> and those changes never make it to disk, but the truncation also never
> makes it to the disk either. With this patch, that can't happen,
> because no checkpoint can intervene between when we (1) decide we're
> not going to bother writing those dirty pages and (2) actually
> truncate them away. So either the pages will get written as part of
> the checkpoint, or else they'll be gone before the checkpoint
> completes. In the latter case, I suppose redo that would have modified
> those pages will just be skipped, thus dodging the problem.

I think your understanding is right.

> In RelationTruncate, I suggest that we ought to clear the
> delay-checkpoint flag before rather than after calling
> FreeSpaceMapVacuumRange. Since the free space map is not fully
> WAL-logged, anything we're doing there should be non-critical. Also, I

Agreed and fixed.

> think it might be better if MarkBufferDirtyHint stays closer to the
> existing coding and just uses a Boolean and an if-test to decide
> whether to clear the bit, instead of inventing a new mechanism. I
> don't really see anything wrong with the new mechanism, but I think
> it's better to keep the patch minimal.

Yeah, that was a a kind of silly. Fixed.

> As you say, this doesn't fix the problem that truncation might fail.
> But as Andres and Sawada-san said, the solution to that is to get rid
> of the comments saying that it's OK for truncation to fail and make it
> a PANIC. However, I don't think that change needs to be part of this
> patch. Even if we do that, we still need to do this. And even if we do
> this, we still need to do that.

Ok. Addition to the aboves, I rewrote the comment in RelatinoTruncate.

+     * Delay the concurrent checkpoint's completion until this truncation
+     * successfully completes, so that we don't establish a redo-point between
+     * buffer deletion and file-truncate. Otherwise we can leave inconsistent
+     * file content against the WAL records after the REDO position and future
+     * recovery fails.

However, a problem for me for now is that I cannot reproduce the
problem.

To avoid further confusion, the attached is named as *.patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e6c70ed0bc..17357179e3 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3075,8 +3075,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyProc->delayChkpt);
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;
 
     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3102,7 +3102,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);
 
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 2156de187c..b7dc84d6e3 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -463,7 +463,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     proc->lxid = (LocalTransactionId) xid;
     proc->xid = xid;
     Assert(proc->xmin == InvalidTransactionId);
-    proc->delayChkpt = false;
+    proc->delayChkpt = 0;
     proc->statusFlags = 0;
     proc->pid = 0;
     proc->backendId = InvalidBackendId;
@@ -1109,7 +1109,8 @@ EndPrepare(GlobalTransaction gxact)
 
     START_CRIT_SECTION();
 
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;
 
     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1152,7 +1153,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2215,7 +2216,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();
 
     /* See notes in RecordTransactionCommit */
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;
 
     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2263,7 +2265,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);
 
     /* Checkpoint can proceed now */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
     END_CRIT_SECTION();
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6597ec45a9..4a1a0c3c1f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1334,8 +1334,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyProc->delayChkpt = true;
+        MyProc->delayChkpt |= DELAY_CHKPT_START;
 
         SetCurrentTransactionStopTimestamp();
 
@@ -1436,7 +1437,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyProc->delayChkpt = false;
+        MyProc->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..a4d564323a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9153,18 +9153,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);
 
     CheckPointGuts(checkPoint.redo, flags);
 
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (0 && nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b492c656d7..f7a1f981d5 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -978,7 +978,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyProc->delayChkpt);
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) != 0);
 
     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c5ad28d71f..be9c0e107f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -325,6 +325,16 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 
     RelationPreTruncate(rel);
 
+    /*
+     * Delay the concurrent checkpoint's completion until this truncation
+     * successfully completes, so that we don't establish a redo-point between
+     * buffer deletion and file-truncate. Otherwise we can leave inconsistent
+     * file content against the WAL records after the REDO position and future
+     * recovery fails.
+     */
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -366,6 +376,10 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
     /* Do the real work to truncate relation forks */
     smgrtruncate(RelationGetSmgr(rel), forks, nforks, blocks);
 
+
+    /* FSM is not WAL-logged, finish the critical section here. */
+    MyProc->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
+
     /*
      * Update upper-level FSM pages to account for the truncation. This is
      * important because the just-truncated pages were likely marked as
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e918b..c277dc3e1e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3921,7 +3921,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckpoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyProc->delayChkpt = delayChkpt = true;
+            Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyProc->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }
 
@@ -3954,7 +3956,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);
 
         if (delayChkpt)
-            MyProc->delayChkpt = false;
+            MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bd3c7a47fe..1bc4ea15e9 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -689,7 +689,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 
         proc->lxid = InvalidLocalTransactionId;
         proc->xmin = InvalidTransactionId;
-        proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        proc->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;
 
         /* must be cleared with xid/xmin: */
@@ -728,7 +731,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
     proc->xid = InvalidTransactionId;
     proc->lxid = InvalidLocalTransactionId;
     proc->xmin = InvalidTransactionId;
-    proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    proc->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;
 
     /* must be cleared with xid/xmin: */
@@ -3026,7 +3032,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGPROC.
+ * critical sections, as shown by having delayChkpt set to the specified value
+ * in their PGPROC.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -3040,13 +3047,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;
 
+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -3058,7 +3067,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         int            pgprocno = arrayP->pgprocnos[index];
         PGPROC       *proc = &allProcs[pgprocno];
 
-        if (proc->delayChkpt)
+        if ((proc->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;
 
@@ -3084,12 +3093,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;
 
+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);
 
     for (index = 0; index < arrayP->numProcs; index++)
@@ -3100,7 +3111,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
 
         GET_VXID_FROM_PGPROC(vxid, *proc);
 
-        if (proc->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((proc->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index b7d9da0aa9..95fdf990e7 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -394,7 +394,7 @@ InitProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyProc->statusFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -579,7 +579,7 @@ InitAuxiliaryProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyProc->statusFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index be67d8a861..b9be2454c5 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -79,6 +79,10 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX
 
+/* symbols for PGPROC.delayChkpt */
+#define DELAY_CHKPT_START        (1<<0) 
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 typedef enum
 {
     PROC_WAIT_STATUS_OK,
@@ -184,7 +188,8 @@ struct PGPROC
     pg_atomic_uint64 waitStart; /* time at which wait for lock acquisition
                                  * started */
 
-    bool        delayChkpt;        /* true if this proc delays checkpoint start */
+    int            delayChkpt;        /* if this proc delays checkpoint start and/or
+                                 * completion.  */
 
     uint8        statusFlags;    /* this backend's status flags, see PROC_*
                                  * above. mirrored in
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index b01fa52139..ec40130466 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -15,11 +15,11 @@
 #define PROCARRAY_H
 
 #include "storage/lock.h"
+#include "storage/proc.h"
 #include "storage/standby.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
 
-
 extern Size ProcArrayShmemSize(void);
 extern void CreateSharedProcArray(void);
 extern void ProcArrayAdd(PGPROC *proc);
@@ -59,8 +59,9 @@ extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *slot_xmin, TransactionId *catalog_xmin);
 
-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);
 
 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);

Re: Corruption during WAL replay

From
Daniel Shelepanov
Date:
On 27.09.2021 11:30, Kyotaro Horiguchi wrote:
> Thank you for the comments! (Sorry for the late resopnse.)
>
> At Tue, 10 Aug 2021 14:14:05 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
>> On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
>> <horikyota.ntt@gmail.com> wrote:
>>> The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
>>> used each other, but MarkBufferDirtyHint which delays checkpoint start
>>> is called in RelationTruncate while delaying checkpoint completion.
>>> That is not a strange nor harmful behavior.  I changed delayChkpt to a
>>> bitmap integer from an enum so that both barrier are separately
>>> triggered.
>>>
>>> I'm not sure this is the way to go here, though.  This fixes the issue
>>> of a crash during RelationTruncate, but the issue of smgrtruncate
>>> failure during RelationTruncate still remains (unless we treat that
>>> failure as PANIC?).
>> I like this patch. As I understand it, we're currently cheating by
>> allowing checkpoints to complete without necessarily flushing all of
>> the pages that were dirty at the time we fixed the redo pointer out to
>> disk. We think this is OK because we know that those pages are going
>> to get truncated away, but it's not really OK because when the system
>> starts up, it has to replay WAL starting from the checkpoint's redo
>> pointer, but the state of the page is not the same as it was at the
>> time when the redo pointer was the end of WAL, so redo fails. In the
>> case described in
>> http://postgr.es/m/BYAPR06MB63739B2692DC6DBB3C5F186CABDA0@BYAPR06MB6373.namprd06.prod.outlook.com
>> modifications are made to the page before the redo pointer is fixed
>> and those changes never make it to disk, but the truncation also never
>> makes it to the disk either. With this patch, that can't happen,
>> because no checkpoint can intervene between when we (1) decide we're
>> not going to bother writing those dirty pages and (2) actually
>> truncate them away. So either the pages will get written as part of
>> the checkpoint, or else they'll be gone before the checkpoint
>> completes. In the latter case, I suppose redo that would have modified
>> those pages will just be skipped, thus dodging the problem.
> I think your understanding is right.
>
>> In RelationTruncate, I suggest that we ought to clear the
>> delay-checkpoint flag before rather than after calling
>> FreeSpaceMapVacuumRange. Since the free space map is not fully
>> WAL-logged, anything we're doing there should be non-critical. Also, I
> Agreed and fixed.
>
>> think it might be better if MarkBufferDirtyHint stays closer to the
>> existing coding and just uses a Boolean and an if-test to decide
>> whether to clear the bit, instead of inventing a new mechanism. I
>> don't really see anything wrong with the new mechanism, but I think
>> it's better to keep the patch minimal.
> Yeah, that was a a kind of silly. Fixed.
>
>> As you say, this doesn't fix the problem that truncation might fail.
>> But as Andres and Sawada-san said, the solution to that is to get rid
>> of the comments saying that it's OK for truncation to fail and make it
>> a PANIC. However, I don't think that change needs to be part of this
>> patch. Even if we do that, we still need to do this. And even if we do
>> this, we still need to do that.
> Ok. Addition to the aboves, I rewrote the comment in RelatinoTruncate.
>
> +     * Delay the concurrent checkpoint's completion until this truncation
> +     * successfully completes, so that we don't establish a redo-point between
> +     * buffer deletion and file-truncate. Otherwise we can leave inconsistent
> +     * file content against the WAL records after the REDO position and future
> +     * recovery fails.
>
> However, a problem for me for now is that I cannot reproduce the
> problem.
>
> To avoid further confusion, the attached is named as *.patch.
>
> regards.
>
Hi. This is my first attempt to review a patch so feel free to tell me 
if I missed something.

As of today's state of REL_14_STABLE 
(ef9706bbc8ce917a366e4640df8c603c9605817a), the problem is reproducible 
using the script provided by Daniel Wood in this 
(1335373813.287510.1573611814107@connect.xfinity.com) message. Also, the 
latest patch seems not to be applicable and requires some minor tweaks.


Regards,

Daniel Shelepanov




Re: Corruption during WAL replay

From
Kyotaro Horiguchi
Date:
At Mon, 24 Jan 2022 23:33:20 +0300, Daniel Shelepanov <deniel1495@mail.ru> wrote in 
> Hi. This is my first attempt to review a patch so feel free to tell me
> if I missed something.

Welcome!

> As of today's state of REL_14_STABLE
> (ef9706bbc8ce917a366e4640df8c603c9605817a), the problem is
> reproducible using the script provided by Daniel Wood in this
> (1335373813.287510.1573611814107@connect.xfinity.com) message. Also,
> the latest patch seems not to be applicable and requires some minor
> tweaks.

Thanks for the info.  The reason for my failure is checksum was
enabled.. After disalbing both fpw and checksum (and wal_log_hints)
allows me to reproduce the issue.  And what I found is:

v3 patch:
 >    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
!>    if (0 && nvxids > 0)
 >    {

Ugggggggh!  It looks like a debugging tweak but it prevents everything
from working.

The attached is the fixed version and it surely works with the repro.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From e4155f4e777f31c019cbbd00f41c7b5eb2320ef7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 26 Jan 2022 15:07:58 +0900
Subject: [PATCH v4] Delay checkpoint completion until truncation completes

When a relation is being truncated, the buffers for the relation are
dropped without being written to underlying relation file, so after
that moment the content of the to-be-truncated file is not guaranteed
to be consistent. This is fine in the normal cases where the file is
eventually removed. However, if a checkpoint after the buffer dropping
but finally the file truncation doesn't happen due to server crash or
filesystem failure, the files left behind is inconsistent with WAL.

This commit delays checkpoint completion until all ongoing file
truncation are gone to prevent such inconsistency.
---
 src/backend/access/transam/multixact.c  |  6 +++---
 src/backend/access/transam/twophase.c   | 12 +++++++-----
 src/backend/access/transam/xact.c       |  5 +++--
 src/backend/access/transam/xlog.c       | 16 +++++++++++++--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c           | 13 +++++++++++++
 src/backend/storage/buffer/bufmgr.c     |  6 ++++--
 src/backend/storage/ipc/procarray.c     | 26 ++++++++++++++++++-------
 src/backend/storage/lmgr/proc.c         |  4 ++--
 src/include/storage/proc.h              |  7 ++++++-
 src/include/storage/procarray.h         |  7 ++++---
 11 files changed, 76 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 806f2e43ba..70593238ab 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3075,8 +3075,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyProc->delayChkpt);
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;
 
     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3102,7 +3102,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);
 
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 271a3146db..2b1a25e723 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -474,7 +474,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     }
     proc->xid = xid;
     Assert(proc->xmin == InvalidTransactionId);
-    proc->delayChkpt = false;
+    proc->delayChkpt = 0;
     proc->statusFlags = 0;
     proc->pid = 0;
     proc->databaseId = databaseid;
@@ -1165,7 +1165,8 @@ EndPrepare(GlobalTransaction gxact)
 
     START_CRIT_SECTION();
 
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;
 
     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1208,7 +1209,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2267,7 +2268,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();
 
     /* See notes in RecordTransactionCommit */
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;
 
     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2315,7 +2317,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);
 
     /* Checkpoint can proceed now */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
     END_CRIT_SECTION();
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c9516e03fa..9e6acf954c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1386,8 +1386,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyProc->delayChkpt = true;
+        MyProc->delayChkpt |= DELAY_CHKPT_START;
 
         SetCurrentTransactionStopTimestamp();
 
@@ -1488,7 +1489,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyProc->delayChkpt = false;
+        MyProc->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dfe2a0bcce..c2729e4222 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9304,18 +9304,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);
 
     CheckPointGuts(checkPoint.redo, flags);
 
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c260310c4c..d66ec015ea 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -978,7 +978,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyProc->delayChkpt);
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) != 0);
 
     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b8075536a..7af8278a50 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -325,6 +325,15 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 
     RelationPreTruncate(rel);
 
+    /*
+     * Delay the concurrent checkpoint's completion until this truncation
+     * successfully completes. Otherwise the checkpoint may leave inconsistent
+     * relation files if the following file truncation doesn't happen due to
+     * file system failure or server crash.
+     */
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -366,6 +375,10 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
     /* Do the real work to truncate relation forks */
     smgrtruncate(RelationGetSmgr(rel), forks, nforks, blocks);
 
+
+    /* FSM is not WAL-logged, close the critical section here. */
+    MyProc->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
+
     /*
      * Update upper-level FSM pages to account for the truncation. This is
      * important because the just-truncated pages were likely marked as
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a2512e750c..1bf064ba5e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3910,7 +3910,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckPoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyProc->delayChkpt = delayChkpt = true;
+            Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyProc->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }
 
@@ -3943,7 +3945,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);
 
         if (delayChkpt)
-            MyProc->delayChkpt = false;
+            MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 3be6040289..90628d150b 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -689,7 +689,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 
         proc->lxid = InvalidLocalTransactionId;
         proc->xmin = InvalidTransactionId;
-        proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        proc->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;
 
         /* must be cleared with xid/xmin: */
@@ -728,7 +731,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
     proc->xid = InvalidTransactionId;
     proc->lxid = InvalidLocalTransactionId;
     proc->xmin = InvalidTransactionId;
-    proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    proc->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;
 
     /* must be cleared with xid/xmin: */
@@ -3039,7 +3045,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGPROC.
+ * critical sections, as shown by having specified delayChkpt bits set in their
+ * PGPROC.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -3053,13 +3060,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;
 
+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -3071,7 +3080,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         int            pgprocno = arrayP->pgprocnos[index];
         PGPROC       *proc = &allProcs[pgprocno];
 
-        if (proc->delayChkpt)
+        if ((proc->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;
 
@@ -3097,12 +3106,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;
 
+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);
 
     for (index = 0; index < arrayP->numProcs; index++)
@@ -3113,7 +3124,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
 
         GET_VXID_FROM_PGPROC(vxid, *proc);
 
-        if (proc->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((proc->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e306b04738..7611ad0f0c 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -392,7 +392,7 @@ InitProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyProc->statusFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -577,7 +577,7 @@ InitAuxiliaryProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyProc->statusFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index a58888f9e9..7b1f7a6ac7 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -86,6 +86,10 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX
 
+/* symbols for PGPROC.delayChkpt */
+#define DELAY_CHKPT_START        (1<<0)
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 typedef enum
 {
     PROC_WAIT_STATUS_OK,
@@ -191,7 +195,8 @@ struct PGPROC
     pg_atomic_uint64 waitStart; /* time at which wait for lock acquisition
                                  * started */
 
-    bool        delayChkpt;        /* true if this proc delays checkpoint start */
+    int            delayChkpt;        /* if this proc delays checkpoint start and/or
+                                 * completion.  */
 
     uint8        statusFlags;    /* this backend's status flags, see PROC_*
                                  * above. mirrored in
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index e03692053e..dcdb01be53 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -15,11 +15,11 @@
 #define PROCARRAY_H
 
 #include "storage/lock.h"
+#include "storage/proc.h"
 #include "storage/standby.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
 
-
 extern Size ProcArrayShmemSize(void);
 extern void CreateSharedProcArray(void);
 extern void ProcArrayAdd(PGPROC *proc);
@@ -59,8 +59,9 @@ extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *slot_xmin, TransactionId *catalog_xmin);
 
-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);
 
 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);
-- 
2.27.0


Re: Corruption during WAL replay

From
Robert Haas
Date:
On Wed, Jan 26, 2022 at 3:25 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> The attached is the fixed version and it surely works with the repro.

Hi,

I spent the morning working on this patch and came up with the
attached version. I wrote substantial comments in RelationTruncate(),
where I tried to make it more clear exactly what the bug is here, and
also in storage/proc.h, where I tried to clarify both the use of the
DELAY_CHKPT_* flags in general terms. If nobody is too sad about this
version, I plan to commit it.

I think it should be back-patched, too, but that looks like a bit of a
pain. I think every back-branch will require different adjustments.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: Corruption during WAL replay

From
Kyotaro Horiguchi
Date:
At Tue, 15 Mar 2022 12:44:49 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Wed, Jan 26, 2022 at 3:25 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > The attached is the fixed version and it surely works with the repro.
> 
> Hi,
> 
> I spent the morning working on this patch and came up with the
> attached version. I wrote substantial comments in RelationTruncate(),
> where I tried to make it more clear exactly what the bug is here, and
> also in storage/proc.h, where I tried to clarify both the use of the
> DELAY_CHKPT_* flags in general terms. If nobody is too sad about this
> version, I plan to commit it.

Thanks for taking this and for the time.  The additional comments
seems describing the flags more clearly.

storage.c:
+     * Make sure that a concurrent checkpoint can't complete while truncation
+     * is in progress.
+     *
+     * The truncation operation might drop buffers that the checkpoint
+     * otherwise would have flushed. If it does, then it's essential that
+     * the files actually get truncated on disk before the checkpoint record
+     * is written. Otherwise, if reply begins from that checkpoint, the
+     * to-be-truncated buffers might still exist on disk but have older
+     * contents than expected, which can cause replay to fail. It's OK for
+     * the buffers to not exist on disk at all, but not for them to have the
+     * wrong contents.

FWIW, this seems like slightly confusing between buffer and its
content.  I can read it correctly so I don't mind if it is natural
enough.

Otherwise all the added/revised comments looks fine. Thanks for the
labor.

> I think it should be back-patched, too, but that looks like a bit of a
> pain. I think every back-branch will require different adjustments.

I'll try that, if you are already working on it, please inform me. (It
may more than likely be too late..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Corruption during WAL replay

From
Robert Haas
Date:
On Wed, Mar 16, 2022 at 1:14 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> storage.c:
> +        * Make sure that a concurrent checkpoint can't complete while truncation
> +        * is in progress.
> +        *
> +        * The truncation operation might drop buffers that the checkpoint
> +        * otherwise would have flushed. If it does, then it's essential that
> +        * the files actually get truncated on disk before the checkpoint record
> +        * is written. Otherwise, if reply begins from that checkpoint, the
> +        * to-be-truncated buffers might still exist on disk but have older
> +        * contents than expected, which can cause replay to fail. It's OK for
> +        * the buffers to not exist on disk at all, but not for them to have the
> +        * wrong contents.
>
> FWIW, this seems like slightly confusing between buffer and its
> content.  I can read it correctly so I don't mind if it is natural
> enough.

Hmm. I think the last two instances of "buffers" in this comment
should actually say "blocks".

> I'll try that, if you are already working on it, please inform me. (It
> may more than likely be too late..)

If you want to take a crack at that, I'd be delighted.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Corruption during WAL replay

From
Kyotaro Horiguchi
Date:
At Wed, 16 Mar 2022 10:14:56 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> Hmm. I think the last two instances of "buffers" in this comment
> should actually say "blocks".

Ok. I replaced them with "blocks" and it looks nicer. Thanks!

> > I'll try that, if you are already working on it, please inform me. (It
> > may more than likely be too late..)
> 
> If you want to take a crack at that, I'd be delighted.

Finally, no two of from 10 to 14 doesn't accept the same patch.

As a cross-version check, I compared all combinations of the patches
for two adjacent versions and confirmed that no hunks are lost.

All versions pass check world.


The differences between each two adjacent versions are as follows.

master->14:

 A hunk fails due to the change in how to access rel->rd_smgr.

14->13:

  Several hunks fail due to simple context differences.

13->12:

 Many hunks fail due to the migration of delayChkpt from PGPROC to
 PGXACT and the context difference due to change of FSM trancation
 logic in RelationTruncate.

12->11:

 Several hunks fail due to the removal of volatile qalifier from
 pointers to PGPROC/PGXACT.

11-10:

 A hunk fails due to the context difference due to an additional
 member tempNamespaceId of PGPROC.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From c88858d3e5681005ba0396b7e7ebcde4322b3308 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 15 Mar 2022 12:29:14 -0400
Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a
 checkpoint.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Back-patch to all supported versions.

Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
---
 src/backend/access/transam/multixact.c  |  6 ++--
 src/backend/access/transam/twophase.c   | 12 ++++----
 src/backend/access/transam/xact.c       |  5 ++--
 src/backend/access/transam/xlog.c       | 16 +++++++++--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c           | 29 ++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c     |  6 ++--
 src/backend/storage/ipc/procarray.c     | 26 ++++++++++++-----
 src/backend/storage/lmgr/proc.c         |  4 +--
 src/include/storage/proc.h              | 37 ++++++++++++++++++++++++-
 src/include/storage/procarray.h         |  5 ++--
 11 files changed, 120 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 6a70d49738..9f65c600d0 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3088,8 +3088,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyProc->delayChkpt);
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3115,7 +3115,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);

-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 874c8ed125..4dc8ccc12b 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -475,7 +475,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     }
     proc->xid = xid;
     Assert(proc->xmin == InvalidTransactionId);
-    proc->delayChkpt = false;
+    proc->delayChkpt = 0;
     proc->statusFlags = 0;
     proc->pid = 0;
     proc->databaseId = databaseid;
@@ -1164,7 +1164,8 @@ EndPrepare(GlobalTransaction gxact)

     START_CRIT_SECTION();

-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1207,7 +1208,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2266,7 +2267,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();

     /* See notes in RecordTransactionCommit */
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2314,7 +2316,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);

     /* Checkpoint can proceed now */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8964ddf3eb..3596a7d734 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1387,8 +1387,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyProc->delayChkpt = true;
+        MyProc->delayChkpt |= DELAY_CHKPT_START;

         SetCurrentTransactionStopTimestamp();

@@ -1489,7 +1490,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyProc->delayChkpt = false;
+        MyProc->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f436471b27..ece71b9208 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6503,18 +6503,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);

     CheckPointGuts(checkPoint.redo, flags);

+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index f4eb54b63c..462e23503e 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1011,7 +1011,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyProc->delayChkpt);
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) != 0);

     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b8075536a..ce5568ff08 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -325,6 +325,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks)

     RelationPreTruncate(rel);

+    /*
+     * Make sure that a concurrent checkpoint can't complete while truncation
+     * is in progress.
+     *
+     * The truncation operation might drop buffers that the checkpoint
+     * otherwise would have flushed. If it does, then it's essential that
+     * the files actually get truncated on disk before the checkpoint record
+     * is written. Otherwise, if reply begins from that checkpoint, the
+     * to-be-truncated blocks might still exist on disk but have older
+     * contents than expected, which can cause replay to fail. It's OK for
+     * the blocks to not exist on disk at all, but not for them to have the
+     * wrong contents.
+     */
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -363,13 +379,24 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
             XLogFlush(lsn);
     }

-    /* Do the real work to truncate relation forks */
+    /*
+     * This will first remove any buffers from the buffer pool that should no
+     * longer exist after truncation is complete, and then truncate the
+     * corresponding files on disk.
+     */
     smgrtruncate(RelationGetSmgr(rel), forks, nforks, blocks);

+    /* We've done all the critical work, so checkpoints are OK now. */
+    MyProc->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
+
     /*
      * Update upper-level FSM pages to account for the truncation. This is
      * important because the just-truncated pages were likely marked as
      * all-free, and would be preferentially selected.
+     *
+     * NB: There's no point in delaying checkpoints until this is done.
+     * Because the FSM is not WAL-logged, we have to be prepared for the
+     * possibility of corruption after a crash anyway.
      */
     if (need_fsm_vacuum)
         FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f8..11005edc73 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3911,7 +3911,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckPoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyProc->delayChkpt = delayChkpt = true;
+            Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyProc->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }

@@ -3944,7 +3946,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);

         if (delayChkpt)
-            MyProc->delayChkpt = false;
+            MyProc->delayChkpt &= ~DELAY_CHKPT_START;

         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 13d192ec2b..735763cc24 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -698,7 +698,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)

         proc->lxid = InvalidLocalTransactionId;
         proc->xmin = InvalidTransactionId;
-        proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        proc->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;

         /* must be cleared with xid/xmin: */
@@ -737,7 +740,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
     proc->xid = InvalidTransactionId;
     proc->lxid = InvalidLocalTransactionId;
     proc->xmin = InvalidTransactionId;
-    proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    proc->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;

     /* must be cleared with xid/xmin: */
@@ -3053,7 +3059,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGPROC.
+ * critical sections, as shown by having specified delayChkpt bits set in their
+ * PGPROC.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -3067,13 +3074,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;

+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -3085,7 +3094,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         int            pgprocno = arrayP->pgprocnos[index];
         PGPROC       *proc = &allProcs[pgprocno];

-        if (proc->delayChkpt)
+        if ((proc->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;

@@ -3111,12 +3120,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;

+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);

     for (index = 0; index < arrayP->numProcs; index++)
@@ -3127,7 +3138,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)

         GET_VXID_FROM_PGPROC(vxid, *proc);

-        if (proc->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((proc->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 90283f8a9f..df080cd332 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -393,7 +393,7 @@ InitProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyProc->statusFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -578,7 +578,7 @@ InitAuxiliaryProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyProc->statusFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index a58888f9e9..36ecf7d005 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -86,6 +86,41 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX

+/*
+ * Flags for PGPROC.delayChkpt
+ *
+ * These flags can be used to delay the start or completion of a checkpoint
+ * for short periods. A flag is in effect if the corresponding bit is set in
+ * the PGPROC of any backend.
+ *
+ * For our purposes here, a checkpoint has three phases: (1) determine the
+ * location to which the redo pointer will be moved, (2) write all the
+ * data durably to disk, and (3) WAL-log the checkpoint.
+ *
+ * Setting DELAY_CHKPT_START prevents the system from moving from phase 1
+ * to phase 2. This is useful when we are performing a WAL-logged modification
+ * of data that will be flushed to disk in phase 2. By setting this flag
+ * before writing WAL and clearing it after we've both written WAL and
+ * performed the corresponding modification, we ensure that if the WAL record
+ * is inserted prior to the new redo point, the corresponding data changes will
+ * also be flushed to disk before the checkpoint can complete. (In the
+ * extremely common case where the data being modified is in shared buffers
+ * and we acquire an exclusive content lock on the relevant buffers before
+ * writing WAL, this mechanism is not needed, because phase 2 will block
+ * until we release the content lock and then flush the modified data to
+ * disk.)
+ *
+ * Setting DELAY_CHKPT_COMPLETE prevents the system from moving from phase 2
+ * to phase 3. This is useful if we are performing a WAL-logged operation that
+ * might invalidate buffers, such as relation truncation. In this case, we need
+ * to ensure that any buffers which were invalidated and thus not flushed by
+ * the checkpoint are actaully destroyed on disk. Replay can cope with a file
+ * or block that doesn't exist, but not with a block that has the wrong
+ * contents.
+ */
+#define DELAY_CHKPT_START        (1<<0)
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 typedef enum
 {
     PROC_WAIT_STATUS_OK,
@@ -191,7 +226,7 @@ struct PGPROC
     pg_atomic_uint64 waitStart; /* time at which wait for lock acquisition
                                  * started */

-    bool        delayChkpt;        /* true if this proc delays checkpoint start */
+    int            delayChkpt;        /* for DELAY_CHKPT_* flags */

     uint8        statusFlags;    /* this backend's status flags, see PROC_*
                                  * above. mirrored in
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index e03692053e..1b2cfac5ad 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -59,8 +59,9 @@ extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *slot_xmin, TransactionId *catalog_xmin);

-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);

 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);
--
2.27.0

From 71493542cda97f75d0737e3434d9aaab2beadd5f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 17 Mar 2022 14:54:25 +0900
Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a
 checkpoint.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Back-patch to all supported versions.

Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
---
 src/backend/access/transam/multixact.c  |  6 ++--
 src/backend/access/transam/twophase.c   | 12 ++++----
 src/backend/access/transam/xact.c       |  5 ++--
 src/backend/access/transam/xlog.c       | 16 +++++++++--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c           | 29 ++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c     |  6 ++--
 src/backend/storage/ipc/procarray.c     | 26 ++++++++++++-----
 src/backend/storage/lmgr/proc.c         |  4 +--
 src/include/storage/proc.h              | 37 ++++++++++++++++++++++++-
 src/include/storage/procarray.h         |  5 ++--
 11 files changed, 120 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b643564f16..50d8bab9e2 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3075,8 +3075,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyProc->delayChkpt);
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3102,7 +3102,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);

-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7cc76c1db7..dea3f485f7 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -474,7 +474,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     }
     proc->xid = xid;
     Assert(proc->xmin == InvalidTransactionId);
-    proc->delayChkpt = false;
+    proc->delayChkpt = 0;
     proc->statusFlags = 0;
     proc->pid = 0;
     proc->databaseId = databaseid;
@@ -1165,7 +1165,8 @@ EndPrepare(GlobalTransaction gxact)

     START_CRIT_SECTION();

-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1208,7 +1209,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2275,7 +2276,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();

     /* See notes in RecordTransactionCommit */
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2323,7 +2325,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);

     /* Checkpoint can proceed now */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 514044f3db..c5e7261921 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1335,8 +1335,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyProc->delayChkpt = true;
+        MyProc->delayChkpt |= DELAY_CHKPT_START;

         SetCurrentTransactionStopTimestamp();

@@ -1437,7 +1438,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyProc->delayChkpt = false;
+        MyProc->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3e71aea71f..7cc49819f0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9228,18 +9228,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);

     CheckPointGuts(checkPoint.redo, flags);

+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b153fad594..1af4a90c41 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -925,7 +925,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyProc->delayChkpt);
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) != 0);

     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cba7a9ada0..fa5682dce8 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -325,6 +325,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks)

     RelationPreTruncate(rel);

+    /*
+     * Make sure that a concurrent checkpoint can't complete while truncation
+     * is in progress.
+     *
+     * The truncation operation might drop buffers that the checkpoint
+     * otherwise would have flushed. If it does, then it's essential that
+     * the files actually get truncated on disk before the checkpoint record
+     * is written. Otherwise, if reply begins from that checkpoint, the
+     * to-be-truncated blocks might still exist on disk but have older
+     * contents than expected, which can cause replay to fail. It's OK for
+     * the blocks to not exist on disk at all, but not for them to have the
+     * wrong contents.
+     */
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -363,13 +379,24 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
             XLogFlush(lsn);
     }

-    /* Do the real work to truncate relation forks */
+    /*
+     * This will first remove any buffers from the buffer pool that should no
+     * longer exist after truncation is complete, and then truncate the
+     * corresponding files on disk.
+     */
     smgrtruncate(rel->rd_smgr, forks, nforks, blocks);

+    /* We've done all the critical work, so checkpoints are OK now. */
+    MyProc->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
+
     /*
      * Update upper-level FSM pages to account for the truncation. This is
      * important because the just-truncated pages were likely marked as
      * all-free, and would be preferentially selected.
+     *
+     * NB: There's no point in delaying checkpoints until this is done.
+     * Because the FSM is not WAL-logged, we have to be prepared for the
+     * possibility of corruption after a crash anyway.
      */
     if (need_fsm_vacuum)
         FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ffc6056c60..a55545a187 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3946,7 +3946,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckpoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyProc->delayChkpt = delayChkpt = true;
+            Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyProc->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }

@@ -3979,7 +3981,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);

         if (delayChkpt)
-            MyProc->delayChkpt = false;
+            MyProc->delayChkpt &= ~DELAY_CHKPT_START;

         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index f047f9a242..ae71d7538b 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -689,7 +689,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)

         proc->lxid = InvalidLocalTransactionId;
         proc->xmin = InvalidTransactionId;
-        proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        proc->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;

         /* must be cleared with xid/xmin: */
@@ -728,7 +731,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
     proc->xid = InvalidTransactionId;
     proc->lxid = InvalidLocalTransactionId;
     proc->xmin = InvalidTransactionId;
-    proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    proc->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;

     /* must be cleared with xid/xmin: */
@@ -3043,7 +3049,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGPROC.
+ * critical sections, as shown by having specified delayChkpt bits set in their
+ * PGPROC.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -3057,13 +3064,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;

+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -3075,7 +3084,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         int            pgprocno = arrayP->pgprocnos[index];
         PGPROC       *proc = &allProcs[pgprocno];

-        if (proc->delayChkpt)
+        if ((proc->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;

@@ -3101,12 +3110,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;

+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);

     for (index = 0; index < arrayP->numProcs; index++)
@@ -3117,7 +3128,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)

         GET_VXID_FROM_PGPROC(vxid, *proc);

-        if (proc->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((proc->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 2575ea1ca0..c50a419a54 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -394,7 +394,7 @@ InitProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyProc->statusFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -579,7 +579,7 @@ InitAuxiliaryProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyProc->statusFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index cfabfdbedf..b78012ec2b 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -86,6 +86,41 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX

+/*
+ * Flags for PGPROC.delayChkpt
+ *
+ * These flags can be used to delay the start or completion of a checkpoint
+ * for short periods. A flag is in effect if the corresponding bit is set in
+ * the PGPROC of any backend.
+ *
+ * For our purposes here, a checkpoint has three phases: (1) determine the
+ * location to which the redo pointer will be moved, (2) write all the
+ * data durably to disk, and (3) WAL-log the checkpoint.
+ *
+ * Setting DELAY_CHKPT_START prevents the system from moving from phase 1
+ * to phase 2. This is useful when we are performing a WAL-logged modification
+ * of data that will be flushed to disk in phase 2. By setting this flag
+ * before writing WAL and clearing it after we've both written WAL and
+ * performed the corresponding modification, we ensure that if the WAL record
+ * is inserted prior to the new redo point, the corresponding data changes will
+ * also be flushed to disk before the checkpoint can complete. (In the
+ * extremely common case where the data being modified is in shared buffers
+ * and we acquire an exclusive content lock on the relevant buffers before
+ * writing WAL, this mechanism is not needed, because phase 2 will block
+ * until we release the content lock and then flush the modified data to
+ * disk.)
+ *
+ * Setting DELAY_CHKPT_COMPLETE prevents the system from moving from phase 2
+ * to phase 3. This is useful if we are performing a WAL-logged operation that
+ * might invalidate buffers, such as relation truncation. In this case, we need
+ * to ensure that any buffers which were invalidated and thus not flushed by
+ * the checkpoint are actaully destroyed on disk. Replay can cope with a file
+ * or block that doesn't exist, but not with a block that has the wrong
+ * contents.
+ */
+#define DELAY_CHKPT_START        (1<<0)
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 typedef enum
 {
     PROC_WAIT_STATUS_OK,
@@ -191,7 +226,7 @@ struct PGPROC
     pg_atomic_uint64 waitStart; /* time at which wait for lock acquisition
                                  * started */

-    bool        delayChkpt;        /* true if this proc delays checkpoint start */
+    int            delayChkpt;        /* for DELAY_CHKPT_* flags */

     uint8        statusFlags;    /* this backend's status flags, see PROC_*
                                  * above. mirrored in
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index b01fa52139..93de230a32 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -59,8 +59,9 @@ extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *slot_xmin, TransactionId *catalog_xmin);

-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);

 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);
--
2.27.0

From f1832b4aaa3fcd06777a1d3bd9e322b3d85dd634 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 17 Mar 2022 19:11:22 +0900
Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a
 checkpoint.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Back-patch to all supported versions.

Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
---
 src/backend/access/transam/multixact.c  |  6 ++--
 src/backend/access/transam/twophase.c   | 12 ++++----
 src/backend/access/transam/xact.c       |  5 ++--
 src/backend/access/transam/xlog.c       | 16 +++++++++--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c           | 29 ++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c     |  6 ++--
 src/backend/storage/ipc/procarray.c     | 26 ++++++++++++-----
 src/backend/storage/lmgr/proc.c         |  4 +--
 src/include/storage/proc.h              | 37 ++++++++++++++++++++++++-
 src/include/storage/procarray.h         |  5 ++--
 11 files changed, 120 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 7990b5e5dd..3e6443fd41 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3071,8 +3071,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyProc->delayChkpt);
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3098,7 +3098,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);

-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b1a221849a..716c17c98f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -476,7 +476,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     }
     pgxact->xid = xid;
     pgxact->xmin = InvalidTransactionId;
-    proc->delayChkpt = false;
+    proc->delayChkpt = 0;
     pgxact->vacuumFlags = 0;
     proc->pid = 0;
     proc->databaseId = databaseid;
@@ -1170,7 +1170,8 @@ EndPrepare(GlobalTransaction gxact)

     START_CRIT_SECTION();

-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1213,7 +1214,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2286,7 +2287,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();

     /* See notes in RecordTransactionCommit */
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;

     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2334,7 +2336,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);

     /* Checkpoint can proceed now */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index fb6220e491..da6ce5a09e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1308,8 +1308,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyProc->delayChkpt = true;
+        MyProc->delayChkpt |= DELAY_CHKPT_START;

         SetCurrentTransactionStopTimestamp();

@@ -1410,7 +1411,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyProc->delayChkpt = false;
+        MyProc->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7bef438d9a..9522c6531f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9022,18 +9022,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);

     CheckPointGuts(checkPoint.redo, flags);

+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b21679f09e..5cff486d9e 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -904,7 +904,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyProc->delayChkpt);
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) != 0);

     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 74216785b7..0eb14cc885 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -325,6 +325,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks)

     RelationPreTruncate(rel);

+    /*
+     * Make sure that a concurrent checkpoint can't complete while truncation
+     * is in progress.
+     *
+     * The truncation operation might drop buffers that the checkpoint
+     * otherwise would have flushed. If it does, then it's essential that
+     * the files actually get truncated on disk before the checkpoint record
+     * is written. Otherwise, if reply begins from that checkpoint, the
+     * to-be-truncated blocks might still exist on disk but have older
+     * contents than expected, which can cause replay to fail. It's OK for
+     * the blocks to not exist on disk at all, but not for them to have the
+     * wrong contents.
+     */
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -363,13 +379,24 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
             XLogFlush(lsn);
     }

-    /* Do the real work to truncate relation forks */
+    /*
+     * This will first remove any buffers from the buffer pool that should no
+     * longer exist after truncation is complete, and then truncate the
+     * corresponding files on disk.
+     */
     smgrtruncate(rel->rd_smgr, forks, nforks, blocks);

+    /* We've done all the critical work, so checkpoints are OK now. */
+    MyProc->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
+
     /*
      * Update upper-level FSM pages to account for the truncation. This is
      * important because the just-truncated pages were likely marked as
      * all-free, and would be preferentially selected.
+     *
+     * NB: There's no point in delaying checkpoints until this is done.
+     * Because the FSM is not WAL-logged, we have to be prepared for the
+     * possibility of corruption after a crash anyway.
      */
     if (need_fsm_vacuum)
         FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 597afedef7..033ef46811 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3647,7 +3647,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckpoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyProc->delayChkpt = delayChkpt = true;
+            Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyProc->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }

@@ -3680,7 +3682,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);

         if (delayChkpt)
-            MyProc->delayChkpt = false;
+            MyProc->delayChkpt &= ~DELAY_CHKPT_START;

         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 02b157243e..725680f34f 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -434,7 +434,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
         pgxact->xmin = InvalidTransactionId;
         /* must be cleared with xid/xmin: */
         pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-        proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        proc->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;

         Assert(pgxact->nxids == 0);
@@ -456,7 +459,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
     pgxact->xmin = InvalidTransactionId;
     /* must be cleared with xid/xmin: */
     pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-    proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    proc->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;

     /* Clear the subtransaction-XID cache too while holding the lock */
@@ -2272,7 +2278,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGPROC.
+ * critical sections, as shown by having specified delayChkpt bits set in their
+ * PGPROC.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -2286,13 +2293,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;

+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -2304,7 +2313,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         int            pgprocno = arrayP->pgprocnos[index];
         PGPROC       *proc = &allProcs[pgprocno];

-        if (proc->delayChkpt)
+        if ((proc->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;

@@ -2330,12 +2339,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;

+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);

     for (index = 0; index < arrayP->numProcs; index++)
@@ -2346,7 +2357,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)

         GET_VXID_FROM_PGPROC(vxid, *proc);

-        if (proc->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((proc->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 0d70b03eeb..f3a6c598bf 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -396,7 +396,7 @@ InitProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -578,7 +578,7 @@ InitAuxiliaryProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index b3ea1a2586..5798b91186 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -83,6 +83,41 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX

+/*
+ * Flags for PGPROC.delayChkpt
+ *
+ * These flags can be used to delay the start or completion of a checkpoint
+ * for short periods. A flag is in effect if the corresponding bit is set in
+ * the PGPROC of any backend.
+ *
+ * For our purposes here, a checkpoint has three phases: (1) determine the
+ * location to which the redo pointer will be moved, (2) write all the
+ * data durably to disk, and (3) WAL-log the checkpoint.
+ *
+ * Setting DELAY_CHKPT_START prevents the system from moving from phase 1
+ * to phase 2. This is useful when we are performing a WAL-logged modification
+ * of data that will be flushed to disk in phase 2. By setting this flag
+ * before writing WAL and clearing it after we've both written WAL and
+ * performed the corresponding modification, we ensure that if the WAL record
+ * is inserted prior to the new redo point, the corresponding data changes will
+ * also be flushed to disk before the checkpoint can complete. (In the
+ * extremely common case where the data being modified is in shared buffers
+ * and we acquire an exclusive content lock on the relevant buffers before
+ * writing WAL, this mechanism is not needed, because phase 2 will block
+ * until we release the content lock and then flush the modified data to
+ * disk.)
+ *
+ * Setting DELAY_CHKPT_COMPLETE prevents the system from moving from phase 2
+ * to phase 3. This is useful if we are performing a WAL-logged operation that
+ * might invalidate buffers, such as relation truncation. In this case, we need
+ * to ensure that any buffers which were invalidated and thus not flushed by
+ * the checkpoint are actaully destroyed on disk. Replay can cope with a file
+ * or block that doesn't exist, but not with a block that has the wrong
+ * contents.
+ */
+#define DELAY_CHKPT_START        (1<<0)
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 /*
  * Each backend has a PGPROC struct in shared memory.  There is also a list of
  * currently-unused PGPROC structs that will be reallocated to new backends.
@@ -149,7 +184,7 @@ struct PGPROC
     LOCKMASK    heldLocks;        /* bitmask for lock types already held on this
                                  * lock object by this backend */

-    bool        delayChkpt;        /* true if this proc delays checkpoint start */
+    int            delayChkpt;        /* for DELAY_CHKPT_* flags */

     /*
      * Info to allow us to wait for synchronous replication, if needed.
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 200ef8db27..4dee2dab10 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -92,8 +92,9 @@ extern TransactionId GetOldestXmin(Relation rel, int flags);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);

-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);

 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);
--
2.27.0

From 3eb3c1df1fbccd7eb3dc0dcc1ed99938e5c12e44 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 17 Mar 2022 19:32:38 +0900
Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a
 checkpoint.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Back-patch to all supported versions.

Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
---
 src/backend/access/transam/multixact.c  |  6 ++--
 src/backend/access/transam/twophase.c   | 12 ++++----
 src/backend/access/transam/xact.c       |  5 ++--
 src/backend/access/transam/xlog.c       | 16 +++++++++--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c           | 26 ++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c     |  6 ++--
 src/backend/storage/ipc/procarray.c     | 26 ++++++++++++-----
 src/backend/storage/lmgr/proc.c         |  4 +--
 src/include/storage/proc.h              | 38 +++++++++++++++++++++++--
 src/include/storage/procarray.h         |  5 ++--
 11 files changed, 117 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 09748905a8..757346cbbb 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3069,8 +3069,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyPgXact->delayChkpt);
-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3096,7 +3096,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);

-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6def1820ca..602ca41054 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -477,7 +477,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     }
     pgxact->xid = xid;
     pgxact->xmin = InvalidTransactionId;
-    pgxact->delayChkpt = false;
+    pgxact->delayChkpt = 0;
     pgxact->vacuumFlags = 0;
     proc->pid = 0;
     proc->databaseId = databaseid;
@@ -1187,7 +1187,8 @@ EndPrepare(GlobalTransaction gxact)

     START_CRIT_SECTION();

-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1230,7 +1231,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2337,7 +2338,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();

     /* See notes in RecordTransactionCommit */
-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2385,7 +2387,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);

     /* Checkpoint can proceed now */
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9c6b87c6ec..9d23298b2b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1306,8 +1306,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyPgXact->delayChkpt = true;
+        MyPgXact->delayChkpt |= DELAY_CHKPT_START;

         SetCurrentTransactionStopTimestamp();

@@ -1408,7 +1409,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyPgXact->delayChkpt = false;
+        MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a30314bc83..9135985eaf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8920,18 +8920,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);

     CheckPointGuts(checkPoint.redo, flags);

+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 24a6f3148b..b51b0edd67 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -899,7 +899,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyPgXact->delayChkpt);
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) != 0);

     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index f899b25c0e..5a6324fec4 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,6 +29,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -252,6 +253,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
     if (vm)
         visibilitymap_truncate(rel, nblocks);

+    /*
+     * Make sure that a concurrent checkpoint can't complete while truncation
+     * is in progress.
+     *
+     * The truncation operation might drop buffers that the checkpoint
+     * otherwise would have flushed. If it does, then it's essential that
+     * the files actually get truncated on disk before the checkpoint record
+     * is written. Otherwise, if reply begins from that checkpoint, the
+     * to-be-truncated blocks might still exist on disk but have older
+     * contents than expected, which can cause replay to fail. It's OK for
+     * the blocks to not exist on disk at all, but not for them to have the
+     * wrong contents.
+     */
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -290,8 +307,15 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
             XLogFlush(lsn);
     }

-    /* Do the real work */
+    /*
+     * This will first remove any buffers from the buffer pool that should no
+     * longer exist after truncation is complete, and then truncate the
+     * corresponding files on disk.
+     */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
+
+    /* We've done all the critical work, so checkpoints are OK now. */
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
 }

 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01c09fd532..7d11b0963f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3514,7 +3514,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckpoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyPgXact->delayChkpt = delayChkpt = true;
+            Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyPgXact->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }

@@ -3547,7 +3549,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);

         if (delayChkpt)
-            MyPgXact->delayChkpt = false;
+            MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index ec7e210226..39093253fe 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -434,7 +434,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
         pgxact->xmin = InvalidTransactionId;
         /* must be cleared with xid/xmin: */
         pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-        pgxact->delayChkpt = false; /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        pgxact->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;

         Assert(pgxact->nxids == 0);
@@ -456,7 +459,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
     pgxact->xmin = InvalidTransactionId;
     /* must be cleared with xid/xmin: */
     pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-    pgxact->delayChkpt = false; /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    pgxact->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;

     /* Clear the subtransaction-XID cache too while holding the lock */
@@ -2261,7 +2267,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGXACT.
+ * critical sections, as shown by having specified delayChkpt bits set in their
+ * PGXACT.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -2275,13 +2282,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;

+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -2294,7 +2303,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         PGPROC       *proc = &allProcs[pgprocno];
         PGXACT       *pgxact = &allPgXact[pgprocno];

-        if (pgxact->delayChkpt)
+        if ((pgxact->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;

@@ -2320,12 +2329,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;

+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);

     for (index = 0; index < arrayP->numProcs; index++)
@@ -2337,7 +2348,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)

         GET_VXID_FROM_PGPROC(vxid, *proc);

-        if (pgxact->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((pgxact->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 4850df2e14..59291e01f4 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -397,7 +397,7 @@ InitProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -579,7 +579,7 @@ InitAuxiliaryProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 43d0854a41..2a16fd23d4 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -76,6 +76,41 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX

+/*
+ * Flags for PGPROC.delayChkpt
+ *
+ * These flags can be used to delay the start or completion of a checkpoint
+ * for short periods. A flag is in effect if the corresponding bit is set in
+ * the PGPROC of any backend.
+ *
+ * For our purposes here, a checkpoint has three phases: (1) determine the
+ * location to which the redo pointer will be moved, (2) write all the
+ * data durably to disk, and (3) WAL-log the checkpoint.
+ *
+ * Setting DELAY_CHKPT_START prevents the system from moving from phase 1
+ * to phase 2. This is useful when we are performing a WAL-logged modification
+ * of data that will be flushed to disk in phase 2. By setting this flag
+ * before writing WAL and clearing it after we've both written WAL and
+ * performed the corresponding modification, we ensure that if the WAL record
+ * is inserted prior to the new redo point, the corresponding data changes will
+ * also be flushed to disk before the checkpoint can complete. (In the
+ * extremely common case where the data being modified is in shared buffers
+ * and we acquire an exclusive content lock on the relevant buffers before
+ * writing WAL, this mechanism is not needed, because phase 2 will block
+ * until we release the content lock and then flush the modified data to
+ * disk.)
+ *
+ * Setting DELAY_CHKPT_COMPLETE prevents the system from moving from phase 2
+ * to phase 3. This is useful if we are performing a WAL-logged operation that
+ * might invalidate buffers, such as relation truncation. In this case, we need
+ * to ensure that any buffers which were invalidated and thus not flushed by
+ * the checkpoint are actaully destroyed on disk. Replay can cope with a file
+ * or block that doesn't exist, but not with a block that has the wrong
+ * contents.
+ */
+#define DELAY_CHKPT_START        (1<<0)
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 /*
  * Each backend has a PGPROC struct in shared memory.  There is also a list of
  * currently-unused PGPROC structs that will be reallocated to new backends.
@@ -232,8 +267,7 @@ typedef struct PGXACT

     uint8        vacuumFlags;    /* vacuum-related flags, see above */
     bool        overflowed;
-    bool        delayChkpt;        /* true if this proc delays checkpoint start;
-                                 * previously called InCommit */
+    int            delayChkpt;        /* for DELAY_CHKPT_* flags */

     uint8        nxids;
 } PGXACT;
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index d1dc0ffe28..d9ca460efc 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -92,8 +92,9 @@ extern TransactionId GetOldestXmin(Relation rel, int flags);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);

-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);

 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);
--
2.27.0

From 30fd7eea362f38a64f62fc91123bc387dabed15f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 17 Mar 2022 19:36:10 +0900
Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a
 checkpoint.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Back-patch to all supported versions.

Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
---
 src/backend/access/transam/multixact.c  |  6 ++--
 src/backend/access/transam/twophase.c   | 12 ++++----
 src/backend/access/transam/xact.c       |  5 ++--
 src/backend/access/transam/xlog.c       | 16 +++++++++--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c           | 26 ++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c     |  6 ++--
 src/backend/storage/ipc/procarray.c     | 26 ++++++++++++-----
 src/backend/storage/lmgr/proc.c         |  4 +--
 src/include/storage/proc.h              | 38 +++++++++++++++++++++++--
 src/include/storage/procarray.h         |  5 ++--
 11 files changed, 117 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ad9e7ff8f0..5612db0e21 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3069,8 +3069,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyPgXact->delayChkpt);
-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3096,7 +3096,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);

-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 8b402c3a1d..769a5fd714 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -476,7 +476,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     }
     pgxact->xid = xid;
     pgxact->xmin = InvalidTransactionId;
-    pgxact->delayChkpt = false;
+    pgxact->delayChkpt = 0;
     pgxact->vacuumFlags = 0;
     proc->pid = 0;
     proc->databaseId = databaseid;
@@ -1175,7 +1175,8 @@ EndPrepare(GlobalTransaction gxact)

     START_CRIT_SECTION();

-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1218,7 +1219,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2352,7 +2353,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();

     /* See notes in RecordTransactionCommit */
-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2400,7 +2402,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);

     /* Checkpoint can proceed now */
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e32b05d17f..5a86b6575e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1239,8 +1239,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyPgXact->delayChkpt = true;
+        MyPgXact->delayChkpt |= DELAY_CHKPT_START;

         SetCurrentTransactionStopTimestamp();

@@ -1341,7 +1342,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyPgXact->delayChkpt = false;
+        MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c68dc1b9a8..53e109b0aa 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9064,18 +9064,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);

     CheckPointGuts(checkPoint.redo, flags);

+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index c033e7bd4c..a8c140b06f 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -899,7 +899,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyPgXact->delayChkpt);
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) != 0);

     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5df4382b7e..5d6f456c70 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -27,6 +27,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -248,6 +249,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
     if (vm)
         visibilitymap_truncate(rel, nblocks);

+    /*
+     * Make sure that a concurrent checkpoint can't complete while truncation
+     * is in progress.
+     *
+     * The truncation operation might drop buffers that the checkpoint
+     * otherwise would have flushed. If it does, then it's essential that
+     * the files actually get truncated on disk before the checkpoint record
+     * is written. Otherwise, if reply begins from that checkpoint, the
+     * to-be-truncated blocks might still exist on disk but have older
+     * contents than expected, which can cause replay to fail. It's OK for
+     * the blocks to not exist on disk at all, but not for them to have the
+     * wrong contents.
+     */
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -286,8 +303,15 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
             XLogFlush(lsn);
     }

-    /* Do the real work */
+    /*
+     * This will first remove any buffers from the buffer pool that should no
+     * longer exist after truncation is complete, and then truncate the
+     * corresponding files on disk.
+     */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
+
+    /* We've done all the critical work, so checkpoints are OK now. */
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
 }

 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 459151519a..027d5067a0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3471,7 +3471,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckpoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyPgXact->delayChkpt = delayChkpt = true;
+            Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyPgXact->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }

@@ -3504,7 +3506,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);

         if (delayChkpt)
-            MyPgXact->delayChkpt = false;
+            MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 465ca66857..d88d955091 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -433,7 +433,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
         pgxact->xmin = InvalidTransactionId;
         /* must be cleared with xid/xmin: */
         pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-        pgxact->delayChkpt = false; /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        pgxact->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;

         Assert(pgxact->nxids == 0);
@@ -455,7 +458,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
     pgxact->xmin = InvalidTransactionId;
     /* must be cleared with xid/xmin: */
     pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-    pgxact->delayChkpt = false; /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    pgxact->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;

     /* Clear the subtransaction-XID cache too while holding the lock */
@@ -2267,7 +2273,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGXACT.
+ * critical sections, as shown by having specified delayChkpt bits set in their
+ * PGXACT.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -2281,13 +2288,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;

+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -2300,7 +2309,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         volatile PGPROC *proc = &allProcs[pgprocno];
         volatile PGXACT *pgxact = &allPgXact[pgprocno];

-        if (pgxact->delayChkpt)
+        if ((pgxact->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;

@@ -2326,12 +2335,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;

+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);

     for (index = 0; index < arrayP->numProcs; index++)
@@ -2343,7 +2354,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)

         GET_VXID_FROM_PGPROC(vxid, *proc);

-        if (pgxact->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((pgxact->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 69a1e37289..aaecfa67b7 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -380,7 +380,7 @@ InitProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -562,7 +562,7 @@ InitAuxiliaryProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 95c9592b21..e76ca8a11e 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -76,6 +76,41 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX

+/*
+ * Flags for PGPROC.delayChkpt
+ *
+ * These flags can be used to delay the start or completion of a checkpoint
+ * for short periods. A flag is in effect if the corresponding bit is set in
+ * the PGPROC of any backend.
+ *
+ * For our purposes here, a checkpoint has three phases: (1) determine the
+ * location to which the redo pointer will be moved, (2) write all the
+ * data durably to disk, and (3) WAL-log the checkpoint.
+ *
+ * Setting DELAY_CHKPT_START prevents the system from moving from phase 1
+ * to phase 2. This is useful when we are performing a WAL-logged modification
+ * of data that will be flushed to disk in phase 2. By setting this flag
+ * before writing WAL and clearing it after we've both written WAL and
+ * performed the corresponding modification, we ensure that if the WAL record
+ * is inserted prior to the new redo point, the corresponding data changes will
+ * also be flushed to disk before the checkpoint can complete. (In the
+ * extremely common case where the data being modified is in shared buffers
+ * and we acquire an exclusive content lock on the relevant buffers before
+ * writing WAL, this mechanism is not needed, because phase 2 will block
+ * until we release the content lock and then flush the modified data to
+ * disk.)
+ *
+ * Setting DELAY_CHKPT_COMPLETE prevents the system from moving from phase 2
+ * to phase 3. This is useful if we are performing a WAL-logged operation that
+ * might invalidate buffers, such as relation truncation. In this case, we need
+ * to ensure that any buffers which were invalidated and thus not flushed by
+ * the checkpoint are actaully destroyed on disk. Replay can cope with a file
+ * or block that doesn't exist, but not with a block that has the wrong
+ * contents.
+ */
+#define DELAY_CHKPT_START        (1<<0)
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 /*
  * Each backend has a PGPROC struct in shared memory.  There is also a list of
  * currently-unused PGPROC structs that will be reallocated to new backends.
@@ -232,8 +267,7 @@ typedef struct PGXACT

     uint8        vacuumFlags;    /* vacuum-related flags, see above */
     bool        overflowed;
-    bool        delayChkpt;        /* true if this proc delays checkpoint start;
-                                 * previously called InCommit */
+    int            delayChkpt;        /* for DELAY_CHKPT_* flags */

     uint8        nxids;
 } PGXACT;
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index a3a1bf724c..a69632a70c 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -92,8 +92,9 @@ extern TransactionId GetOldestXmin(Relation rel, int flags);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);

-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);

 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);
--
2.27.0

From f0b1e3bee795a54d2a701889dd5956283fbc2cf6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 17 Mar 2022 19:40:45 +0900
Subject: [PATCH] Fix possible recovery trouble if TRUNCATE overlaps a
 checkpoint.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

If TRUNCATE causes some buffers to be invalidated and thus the
checkpoint does not flush them, TRUNCATE must also ensure that the
corresponding files are truncated on disk. Otherwise, a replay
from the checkpoint might find that the buffers exist but have
the wrong contents, which may cause replay to fail.

Report by Teja Mupparti. Patch by Kyotaro Horiguchi, per a design
suggestion from Heikki Linnakangas, with some changes to the
comments by me. Review of this and a prior patch that approached
the issue differently by Heikki Linnakangas, Andres Freund, Álvaro
Herrera, Masahiko Sawada, and Tom Lane.

Back-patch to all supported versions.

Discussion: http://postgr.es/m/BYAPR06MB6373BF50B469CA393C614257ABF00@BYAPR06MB6373.namprd06.prod.outlook.com
---
 src/backend/access/transam/multixact.c  |  6 ++--
 src/backend/access/transam/twophase.c   | 12 ++++----
 src/backend/access/transam/xact.c       |  5 ++--
 src/backend/access/transam/xlog.c       | 16 +++++++++--
 src/backend/access/transam/xloginsert.c |  2 +-
 src/backend/catalog/storage.c           | 26 ++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c     |  6 ++--
 src/backend/storage/ipc/procarray.c     | 26 ++++++++++++-----
 src/backend/storage/lmgr/proc.c         |  4 +--
 src/include/storage/proc.h              | 38 +++++++++++++++++++++++--
 src/include/storage/procarray.h         |  5 ++--
 11 files changed, 117 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index cdaf499348..1e52972bbf 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3069,8 +3069,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyPgXact->delayChkpt);
-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3096,7 +3096,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);

-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 3eb33be69b..c61b2736a1 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -478,7 +478,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     }
     pgxact->xid = xid;
     pgxact->xmin = InvalidTransactionId;
-    pgxact->delayChkpt = false;
+    pgxact->delayChkpt = 0;
     pgxact->vacuumFlags = 0;
     proc->pid = 0;
     proc->databaseId = databaseid;
@@ -1159,7 +1159,8 @@ EndPrepare(GlobalTransaction gxact)

     START_CRIT_SECTION();

-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1191,7 +1192,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2284,7 +2285,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();

     /* See notes in RecordTransactionCommit */
-    MyPgXact->delayChkpt = true;
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_START;

     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2332,7 +2334,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);

     /* Checkpoint can proceed now */
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

     END_CRIT_SECTION();

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 25a3a4f97e..ccd99c38c2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1247,8 +1247,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyPgXact->delayChkpt = true;
+        MyPgXact->delayChkpt |= DELAY_CHKPT_START;

         SetCurrentTransactionStopTimestamp();

@@ -1349,7 +1350,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyPgXact->delayChkpt = false;
+        MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8e8bdde764..5087b5fe0a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9022,18 +9022,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);

     CheckPointGuts(checkPoint.redo, flags);

+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 579d8de775..6ff19814d4 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -899,7 +899,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyPgXact->delayChkpt);
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) != 0);

     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9a5fde00ca..729fb92c5f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -249,6 +250,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
     if (vm)
         visibilitymap_truncate(rel, nblocks);

+    /*
+     * Make sure that a concurrent checkpoint can't complete while truncation
+     * is in progress.
+     *
+     * The truncation operation might drop buffers that the checkpoint
+     * otherwise would have flushed. If it does, then it's essential that
+     * the files actually get truncated on disk before the checkpoint record
+     * is written. Otherwise, if reply begins from that checkpoint, the
+     * to-be-truncated blocks might still exist on disk but have older
+     * contents than expected, which can cause replay to fail. It's OK for
+     * the blocks to not exist on disk at all, but not for them to have the
+     * wrong contents.
+     */
+    Assert((MyPgXact->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyPgXact->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -287,8 +304,15 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
             XLogFlush(lsn);
     }

-    /* Do the real work */
+    /*
+     * This will first remove any buffers from the buffer pool that should no
+     * longer exist after truncation is complete, and then truncate the
+     * corresponding files on disk.
+     */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
+
+    /* We've done all the critical work, so checkpoints are OK now. */
+    MyPgXact->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
 }

 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bafe91ab0d..0b7bdb8634 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3469,7 +3469,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckpoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyPgXact->delayChkpt = delayChkpt = true;
+            Assert((MyPgXact->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyPgXact->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }

@@ -3502,7 +3504,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);

         if (delayChkpt)
-            MyPgXact->delayChkpt = false;
+            MyPgXact->delayChkpt &= ~DELAY_CHKPT_START;

         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index d739812f23..134b63f28b 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -433,7 +433,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
         pgxact->xmin = InvalidTransactionId;
         /* must be cleared with xid/xmin: */
         pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-        pgxact->delayChkpt = false; /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        pgxact->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;

         Assert(pgxact->nxids == 0);
@@ -455,7 +458,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
     pgxact->xmin = InvalidTransactionId;
     /* must be cleared with xid/xmin: */
     pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-    pgxact->delayChkpt = false; /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    pgxact->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;

     /* Clear the subtransaction-XID cache too while holding the lock */
@@ -2259,7 +2265,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGXACT.
+ * critical sections, as shown by having specified delayChkpt bits set in their
+ * PGXACT.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -2273,13 +2280,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;

+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -2292,7 +2301,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         volatile PGPROC *proc = &allProcs[pgprocno];
         volatile PGXACT *pgxact = &allPgXact[pgprocno];

-        if (pgxact->delayChkpt)
+        if ((pgxact->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;

@@ -2318,12 +2327,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;

+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);

     for (index = 0; index < arrayP->numProcs; index++)
@@ -2335,7 +2346,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)

         GET_VXID_FROM_PGPROC(vxid, *proc);

-        if (pgxact->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((pgxact->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 857dfdab09..e5370df019 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -377,7 +377,7 @@ InitProcess(void)
     MyProc->databaseId = InvalidOid;
     MyProc->roleId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -550,7 +550,7 @@ InitAuxiliaryProcess(void)
     MyProc->databaseId = InvalidOid;
     MyProc->roleId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyPgXact->delayChkpt = false;
+    MyPgXact->delayChkpt = 0;
     MyPgXact->vacuumFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 947f69d634..d8dd7bf5e1 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -75,6 +75,41 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX

+/*
+ * Flags for PGPROC.delayChkpt
+ *
+ * These flags can be used to delay the start or completion of a checkpoint
+ * for short periods. A flag is in effect if the corresponding bit is set in
+ * the PGPROC of any backend.
+ *
+ * For our purposes here, a checkpoint has three phases: (1) determine the
+ * location to which the redo pointer will be moved, (2) write all the
+ * data durably to disk, and (3) WAL-log the checkpoint.
+ *
+ * Setting DELAY_CHKPT_START prevents the system from moving from phase 1
+ * to phase 2. This is useful when we are performing a WAL-logged modification
+ * of data that will be flushed to disk in phase 2. By setting this flag
+ * before writing WAL and clearing it after we've both written WAL and
+ * performed the corresponding modification, we ensure that if the WAL record
+ * is inserted prior to the new redo point, the corresponding data changes will
+ * also be flushed to disk before the checkpoint can complete. (In the
+ * extremely common case where the data being modified is in shared buffers
+ * and we acquire an exclusive content lock on the relevant buffers before
+ * writing WAL, this mechanism is not needed, because phase 2 will block
+ * until we release the content lock and then flush the modified data to
+ * disk.)
+ *
+ * Setting DELAY_CHKPT_COMPLETE prevents the system from moving from phase 2
+ * to phase 3. This is useful if we are performing a WAL-logged operation that
+ * might invalidate buffers, such as relation truncation. In this case, we need
+ * to ensure that any buffers which were invalidated and thus not flushed by
+ * the checkpoint are actaully destroyed on disk. Replay can cope with a file
+ * or block that doesn't exist, but not with a block that has the wrong
+ * contents.
+ */
+#define DELAY_CHKPT_START        (1<<0)
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 /*
  * Each backend has a PGPROC struct in shared memory.  There is also a list of
  * currently-unused PGPROC structs that will be reallocated to new backends.
@@ -217,8 +252,7 @@ typedef struct PGXACT

     uint8        vacuumFlags;    /* vacuum-related flags, see above */
     bool        overflowed;
-    bool        delayChkpt;        /* true if this proc delays checkpoint start;
-                                 * previously called InCommit */
+    int            delayChkpt;        /* for DELAY_CHKPT_* flags */

     uint8        nxids;
 } PGXACT;
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 08b4b030bb..2b60b27604 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -92,8 +92,9 @@ extern TransactionId GetOldestXmin(Relation rel, int flags);
 extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);

-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);

 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);
--
2.27.0


Re: Corruption during WAL replay

From
Robert Haas
Date:
On Thu, Mar 17, 2022 at 9:21 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> Finally, no two of from 10 to 14 doesn't accept the same patch.
>
> As a cross-version check, I compared all combinations of the patches
> for two adjacent versions and confirmed that no hunks are lost.
>
> All versions pass check world.

Thanks, committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Corruption during WAL replay

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Thanks, committed.

Some of the buildfarm is seeing failures in the pg_checksums test.

            regards, tom lane



Re: Corruption during WAL replay

From
Robert Haas
Date:
On Thu, Mar 24, 2022 at 6:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Thanks, committed.
>
> Some of the buildfarm is seeing failures in the pg_checksums test.

Hmm. So the tests seem to be failing because 002_actions.pl stops the
database cluster, runs pg_checksums (which passes), writes some zero
bytes over the line pointer array of the first block of pg_class, and
then runs pg_checksums again. In the failing buildfarm runs,
pg_checksums fails to detect the corruption: the second run succeeds,
while pg_checksums expects it to fail. That's pretty curious, because
if the database cluster is stopped, and things are OK at that point,
then how could a server bug of any kind cause a Perl script to be
unable to corrupt a file on disk?

A possible clue is that I also see a few machines failing in
recoveryCheck. And the code that is failing there looks like this:

# We've seen occasional cases where multiple walsender pids are active. An
# immediate shutdown may hide evidence of a locking bug. So if multiple
# walsenders are observed, shut down in fast mode, and collect some more
# information.
if (not like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid"))
{
    my ($stdout, $stderr);
    $node_primary3->psql('postgres',
                         "\\a\\t\nSELECT * FROM pg_stat_activity",
                         stdout => \$stdout, stderr => \$stderr);
    diag $stdout, $stderr;
    $node_primary3->stop('fast');
    $node_standby3->stop('fast');
    die "could not determine walsender pid, can't continue";
}

And the failure looks like this:

#   Failed test 'have walsender pid 1047504
# 1047472'
#   at t/019_replslot_limit.pl line 343.

That sure looks like there are multiple walsender PIDs active, and the
pg_stat_activity output confirms it. 1047504 is running
START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1 and 1047472 is
running START_REPLICATION SLOT "pg_basebackup_1047472" 0/600000
TIMELINE 1.

Both of these failures could possibly be explained by some failure of
things to shut down properly, but it's not the same things. In the
first case, the database server would have had to still be running
after we run $node->stop, and it would have had to overwrite the bad
contents of pg_class with some good contents. In the second case, the
cluster's supposed to still be running, but the backends that were
creating those replication slots should have exited sooner.

I've been running the pg_checksums test in a loop here for a bit now
in the hopes of being able to reproduce the failure, but it doesn't
seem to want to fail here. And I've also looked over the commit and I
can't quite see how it would cause a process, or the cluster, to fail
to shutdown, unless perhaps it's the checkpointer that gets stuck, but
that doesn't really seem to match the symptoms.

Any ideas?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Corruption during WAL replay

From
Robert Haas
Date:
On Thu, Mar 24, 2022 at 8:37 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Any ideas?

And ... right after hitting send, I see that the recovery check
failures are under separate troubleshooting and thus probably
unrelated. But that leaves me even more confused. How can a change to
only the server code cause a client utility to fail to detect
corruption that is being created by Perl while the server is stopped?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Corruption during WAL replay

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> And ... right after hitting send, I see that the recovery check
> failures are under separate troubleshooting and thus probably
> unrelated.

Yeah, we've been chasing those for months.

> But that leaves me even more confused. How can a change to
> only the server code cause a client utility to fail to detect
> corruption that is being created by Perl while the server is stopped?

Hmm, I'd supposed that the failing test cases were new as of 412ad7a55.
Now I see they're not, which indeed puts quite a different spin on
things.  Your thought about maybe the server isn't shut down yet is
interesting --- did 412ad7a55 touch anything about the shutdown
sequence?

            regards, tom lane



Re: Corruption during WAL replay

From
Robert Haas
Date:
On Thu, Mar 24, 2022 at 8:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hmm, I'd supposed that the failing test cases were new as of 412ad7a55.
> Now I see they're not, which indeed puts quite a different spin on
> things.  Your thought about maybe the server isn't shut down yet is
> interesting --- did 412ad7a55 touch anything about the shutdown
> sequence?

I hate to say "no" because the evidence suggests that the answer might
be "yes" -- but it definitely isn't intending to change anything about
the shutdown sequence. It just introduces a mechanism to backends to
force the checkpointer to delay writing the checkpoint record.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Corruption during WAL replay

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I hate to say "no" because the evidence suggests that the answer might
> be "yes" -- but it definitely isn't intending to change anything about
> the shutdown sequence. It just introduces a mechanism to backends to
> force the checkpointer to delay writing the checkpoint record.

Wait a minute, I think we may be barking up the wrong tree.

The three commits that serinus saw as new in its first failure were

ce95c54376 Thu Mar 24 20:33:13 2022 UTC  Fix pg_statio_all_tables view for multiple TOAST indexes.
7dac61402e Thu Mar 24 19:51:40 2022 UTC  Remove unused module imports from TAP tests
412ad7a556 Thu Mar 24 18:52:28 2022 UTC  Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.

I failed to look closely at dragonet, but I now see that its
first failure saw

ce95c54376 Thu Mar 24 20:33:13 2022 UTC  Fix pg_statio_all_tables view for multiple TOAST indexes.
7dac61402e Thu Mar 24 19:51:40 2022 UTC  Remove unused module imports from TAP tests

serinus is 0-for-3 since then, and dragonet 0-for-4, so we can be pretty
confident that the failure is repeatable for them.  That means that the
culprit must be ce95c54376 or 7dac61402e, not anything nearby such as
412ad7a556.

It's *really* hard to see how the pg_statio_all_tables change could
have affected this.  So that leaves 7dac61402e, which did this to
the test script that's failing:

 use strict;
 use warnings;
-use Config;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;

Discuss.

            regards, tom lane



Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2022-03-24 20:39:27 -0400, Robert Haas wrote:
> But that leaves me even more confused. How can a change to only the server
> code cause a client utility to fail to detect corruption that is being
> created by Perl while the server is stopped?

I guess it could somehow cause the first page to be all zeroes, in which case
overwriting it with more zeroes wouldn't cause a problem that pg_checksums can
see?  But I have a somewhat more realistic idea:

I'm suspicious of pg_checksums --filenode. If I understand correctly
--filenode scans the data directory, including all tablespaces, for a file
matching that filenode. If we somehow end up with a leftover file in the pre
ALTER TABLE SET TABLESPACE location, it'd not notice that there *also* is a
file in a different place?

Perhaps the --filenode mode should print out the file location...


Randomly noticed: The test fetches the block size without doing anything with
it afaics.

Andres



Re: Corruption during WAL replay

From
Tom Lane
Date:
I wrote:
> ... So that leaves 7dac61402e, which did this to
> the test script that's failing:
 
>  use strict;
>  use warnings;
> -use Config;
>  use PostgreSQL::Test::Cluster;
>  use PostgreSQL::Test::Utils;

> Discuss.

Another thing that seems quite baffling, but is becoming clearer by
the hour, is that only serinus and dragonet are seeing this failure.
How is that?  They're not very similarly configured --- one is gcc,
one clang, and one uses jit and one doesn't.  They do share the same
perl version, 5.34.0; but so do twenty-three other animals, many of
which have reported in cleanly.  I'm at a loss to explain that.
Andres, can you think of anything that's peculiar to those two
animals?

            regards, tom lane



Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2022-03-24 21:22:38 -0400, Tom Lane wrote:
> serinus is 0-for-3 since then, and dragonet 0-for-4, so we can be pretty
> confident that the failure is repeatable for them.

That's weird. They run on the same host, but otherwise they have very little
in common. There's plenty other animals running on the same machine that
didn't report errors.

I copied serinus' configuration, ran the tests repeatedly, without reproducing
the failure so far. Odd.

Combined with the replslot failure I'd be prepared to think the machine has
issues, except that the replslot thing triggered on other machines too.


I looked through logs on the machine without finding anything indicating
something odd.

I turned on keep_error_builds for serinus. Hopefully that'll leave us with
on-disk files to inspect.

Greetings,

Andres Freund



Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2022-03-24 21:59:08 -0400, Tom Lane wrote:
> Another thing that seems quite baffling, but is becoming clearer by
> the hour, is that only serinus and dragonet are seeing this failure.
> How is that?  They're not very similarly configured --- one is gcc,
> one clang, and one uses jit and one doesn't.  They do share the same
> perl version, 5.34.0; but so do twenty-three other animals, many of
> which have reported in cleanly.  I'm at a loss to explain that.
> Andres, can you think of anything that's peculiar to those two
> animals?

No, I'm quite baffled myself. As I noted in an email I just sent, before
reading this one, I can't explain it, and at least in simple attempts, can't
reproduce it either. And there are animals much closer to each other than
those two...

I forced a run while writing the other email, with keep_error_whatnot, and I
just saw it failing... Looking whether there's anything interesting to glean.

Greetings,

Andres Freund



Re: Corruption during WAL replay

From
Thomas Munro
Date:
On Fri, Mar 25, 2022 at 3:14 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-03-24 21:22:38 -0400, Tom Lane wrote:
> > serinus is 0-for-3 since then, and dragonet 0-for-4, so we can be pretty
> > confident that the failure is repeatable for them.
>
> That's weird. They run on the same host, but otherwise they have very little
> in common. There's plenty other animals running on the same machine that
> didn't report errors.

One random thing I've noticed about serinus is that it seems to drop
UDP packets more than others, but dragonet apparently doesn't:

tmunro=> select animal, count(*) from run where result = 'FAILURE' and
'stats' = any(fail_tests) and snapshot > now() - interval '3 month'
group by 1 order by 2 desc;
    animal    | count
--------------+-------
 serinus      |    14
 flaviventris |     6
 mandrill     |     2
 bonito       |     1
 seawasp      |     1
 crake        |     1
 sungazer     |     1
(7 rows)

Example: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-03-24%2001:00:14



Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2022-03-25 15:23:24 +1300, Thomas Munro wrote:
> One random thing I've noticed about serinus is that it seems to drop
> UDP packets more than others, but dragonet apparently doesn't:

Serinus is built with optimization. Which I guess could lead to other backends
reporting stats more quickly? And of course could lead to running more often
(due to finishing before the next cron invocation). I think I've also
configured my animals to run more often than many other owners.

So I'm not sure how much can be gleaned from raw "failure counts" without
taking the number of runs into account as well?

- Andres



Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2022-03-24 19:20:10 -0700, Andres Freund wrote:
> I forced a run while writing the other email, with keep_error_whatnot, and I
> just saw it failing... Looking whether there's anything interesting to glean.

Unfortunately the test drops the table and it doesn't report the filepath of
the failure. So I haven't learned much from the data dir so far.


I still don't see a failure when running the tests in a separate source
tree. Can't explain that. Going to try to get closer to the buildfarm script
run - it'd be a whole lot easier to be able to edit the source of the test and
reproduce...

Just to be sure I'm going to clean out serinus' ccache dir and rerun. I'll
leave dragonet's alone for now.

Greetings,

Andres Freund



Re: Corruption during WAL replay

From
Thomas Munro
Date:
On Fri, Mar 25, 2022 at 3:35 PM Andres Freund <andres@anarazel.de> wrote:
> So I'm not sure how much can be gleaned from raw "failure counts" without
> taking the number of runs into account as well?

Ah, right, it does indeed hold the record for most runs in 3 months,
and taking runs into account its "stats" failure rate is clustered
with mandrill and seawasp.  Anyway, clearly not relevant because
dragonet doesn't even show up in the list.

    animal     | runs | stats_test_fail_fraction
---------------+------+--------------------------
 mandrill      |  158 |       0.0126582278481013
 seawasp       |   85 |       0.0117647058823529
 serinus       | 1299 |       0.0107775211701309
 sungazer      |  174 |      0.00574712643678161
 flaviventris  | 1292 |      0.00464396284829721
 bonito        |  313 |      0.00319488817891374
 crake         |  743 |      0.00134589502018843



Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2022-03-24 19:43:02 -0700, Andres Freund wrote:
> Just to be sure I'm going to clean out serinus' ccache dir and rerun. I'll
> leave dragonet's alone for now.

Turns out they had the same dir. But it didn't help.

I haven't yet figured out why, but I now *am* able to reproduce the problem in
the buildfarm built tree.  Wonder if there's a path length issue or such
somewhere?

Either way, I can now manipulate the tests and still repro. I made the test
abort after the first failure.

hexedit shows that the file is modified, as we'd expect:
00000000   00 00 00 00  C0 01 5B 01  16 7D 00 00  A0 03 C0 03  00 20 04 20  00 00 00 00  00 00 00 00  00 00 00 00
......[..}........ ............
 
00000020   00 9F 38 00  80 9F 38 00  60 9F 38 00  40 9F 38 00  20 9F 38 00  00 9F 38 00  E0 9E 38 00  C0 9E 38 00
..8...8.`.8.@.8..8...8...8...8.
 

And we are checking the right file:

bf@andres-postgres-edb-buildfarm-v1:~/build/buildfarm-serinus/HEAD/pgsql.build$
tmp_install/home/bf/build/buildfarm-serinus/HEAD/inst/bin/pg_checksums--check -D
/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/bin/pg_checksums/tmp_check/t_002_actions_node_checksum_data/pgdata
--filenode16391 -v
 
pg_checksums: checksums verified in file
"/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/bin/pg_checksums/tmp_check/t_002_actions_node_checksum_data/pgdata/pg_tblspc/16387/PG_15_202203241/5/16391"
Checksum operation completed
Files scanned:   1
Blocks scanned:  45
Bad checksums:  0
Data checksum version: 1

If I twiddle further bits, I see that page failing checksum verification, as
expected.

I made the script copy the file before twiddling it around:
00000000   00 00 00 00  C0 01 5B 01  16 7D 00 00  A0 03 C0 03  00 20 04 20  00 00 00 00  E0 9F 38 00  C0 9F 38 00
......[..}........ ......8...8.
 
00000020   A0 9F 38 00  80 9F 38 00  60 9F 38 00  40 9F 38 00  20 9F 38 00  00 9F 38 00  E0 9E 38 00  C0 9E 38 00
..8...8.`.8.@.8..8...8...8...8.
 

So it's indeed modified.


The only thing I can really conclude here is that we apparently end up with
the same checksum for exactly the modifications we are doing? Just on those
two damn instances? Reliably?


Gotta make some food. Suggestions what exactly to look at welcome.


Greetings,

Andres Freund

PS: I should really rename the hostname of that machine one of these days...



Re: Corruption during WAL replay

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> The only thing I can really conclude here is that we apparently end up with
> the same checksum for exactly the modifications we are doing? Just on those
> two damn instances? Reliably?

IIRC, the table's OID or relfilenode enters into the checksum.
Could it be that assigning a specific OID to the table allows
this to happen, and these two animals are somehow assigning
that OID while others are using some slightly different OID?

            regards, tom lane



Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2022-03-25 00:08:20 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > The only thing I can really conclude here is that we apparently end up with
> > the same checksum for exactly the modifications we are doing? Just on those
> > two damn instances? Reliably?
> 
> IIRC, the table's OID or relfilenode enters into the checksum.
> Could it be that assigning a specific OID to the table allows
> this to happen, and these two animals are somehow assigning
> that OID while others are using some slightly different OID?

It's just the block number that goes into it.

I do see that the LSN that ends up on the page is the same across a few runs
of the test on serinus. Which presumably differs between different
animals. Surprised that it's this predictable - but I guess the run is short
enough that there's no variation due to autovacuum, checkpoints etc.

If I add a 'SELECT txid_current()' before the CREATE TABLE in
check_relation_corruption(), the test doesn't fail anymore, because there's an
additional WAL record.

16bit checksums for the win.

Greetings,

Andres Freund



Re: Corruption during WAL replay

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I do see that the LSN that ends up on the page is the same across a few runs
> of the test on serinus. Which presumably differs between different
> animals. Surprised that it's this predictable - but I guess the run is short
> enough that there's no variation due to autovacuum, checkpoints etc.

Uh-huh.  I'm not surprised that it's repeatable on a given animal.
What remains to be explained:

1. Why'd it start failing now?  I'm guessing that ce95c5437 *was* the
culprit after all, by slightly changing the amount of catalog data
written during initdb, and thus moving the initial LSN.

2. Why just these two animals?  If initial LSN is the critical thing,
then the results of "locale -a" would affect it, so platform
dependence is hardly surprising ... but I'd have thought that all
the animals on that host would use the same initial set of
collations.  OTOH, I see petalura and pogona just fell over too.
Do you have some of those animals --with-icu and others not?

> 16bit checksums for the win.

Yay :-(

As for a fix, would damaging more of the page help?  I guess
it'd just move around the one-in-64K chance of failure.
Maybe we have to intentionally corrupt (e.g. invert) the
checksum field specifically.

            regards, tom lane



Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2022-03-24 21:54:38 -0700, Andres Freund wrote:
> I do see that the LSN that ends up on the page is the same across a few runs
> of the test on serinus. Which presumably differs between different
> animals. Surprised that it's this predictable - but I guess the run is short
> enough that there's no variation due to autovacuum, checkpoints etc.

This actually explains how the issue could start to be visible with
ce95c543763. It changes the amount of WAL initdb generates and therefore
influences what LSN the page ends up with.  I've verified that the failing
test is reproducible with ce95c543763, but not its parent 7dac61402e3. While
of course ce95c543763 isn't "actually responsible".

Ah, and that's finally also the explanation why I couldn't reproduce the
failure it in a different directory, with an otherwise identically configured
PG: The length of the path to the tablespace influences the size of the
XLOG_TBLSPC_CREATE record.


Not sure what to do here... I guess we can just change the value we overwrite
the page with and hope to not hit this again? But that feels deeply deeply
unsatisfying.

Perhaps it would be enough to write into multiple parts of the page? I am very
much not a cryptographical expert, but the way pg_checksum_block() works, it
looks to me that "multiple" changes within a 16 byte chunk have a smaller
influence on the overall result than the same "amount" of changes to separate
16 byte chunks.


I might have to find a store still selling strong beverages at this hour.

- Andres



Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2022-03-25 01:23:00 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I do see that the LSN that ends up on the page is the same across a few runs
> > of the test on serinus. Which presumably differs between different
> > animals. Surprised that it's this predictable - but I guess the run is short
> > enough that there's no variation due to autovacuum, checkpoints etc.
> 
> Uh-huh.  I'm not surprised that it's repeatable on a given animal.
> What remains to be explained:
> 
> 1. Why'd it start failing now?  I'm guessing that ce95c5437 *was* the
> culprit after all, by slightly changing the amount of catalog data
> written during initdb, and thus moving the initial LSN.

Yep, verified that (see mail I just sent).


> 2. Why just these two animals?  If initial LSN is the critical thing,
> then the results of "locale -a" would affect it, so platform
> dependence is hardly surprising ... but I'd have thought that all
> the animals on that host would use the same initial set of
> collations.

I think it's the animal's name that makes the difference, due to the
tablespace path lenght thing. And while I was confused for a second by

petalura
pogona
serinus
dragonet

failing, despite different name lengths, it still makes sense: We MAXALIGN the
start of records. Which explains why flaviventris didn't fail the same way.


> As for a fix, would damaging more of the page help?  I guess
> it'd just move around the one-in-64K chance of failure.

As I wrote in the other email, I think spreading the changes out wider might
help. But it's still not great. However:

> Maybe we have to intentionally corrupt (e.g. invert) the
> checksum field specifically.

seems like it'd do the trick? Even a single bit change of the checksum ought
to do, as long as we don't set it to 0.

Greetings,

Andres Freund



Re: Corruption during WAL replay

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Ah, and that's finally also the explanation why I couldn't reproduce the
> failure it in a different directory, with an otherwise identically configured
> PG: The length of the path to the tablespace influences the size of the
> XLOG_TBLSPC_CREATE record.

Ooooohhh ... yeah, that could explain a lot of cross-animal variation.

> Not sure what to do here... I guess we can just change the value we overwrite
> the page with and hope to not hit this again? But that feels deeply deeply
> unsatisfying.

AFAICS, this strategy of whacking a predetermined chunk of the page with
a predetermined value is going to fail 1-out-of-64K times.  We have to
change the test so that it's guaranteed to produce an invalid checksum.
Inverting just the checksum field, without doing anything else, would
do that ... but that feels pretty unsatisfying too.

            regards, tom lane



Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2022-03-25 01:38:45 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Not sure what to do here... I guess we can just change the value we overwrite
> > the page with and hope to not hit this again? But that feels deeply deeply
> > unsatisfying.
> 
> AFAICS, this strategy of whacking a predetermined chunk of the page with
> a predetermined value is going to fail 1-out-of-64K times.

Yea. I suspect that the way the modifications and checksumming are done are
actually higher chance than 1/64k. But even it actually is 1/64k, it's not
great to wait for (#animals * #catalog-changes) to approach a decent
percentage of 1/64k.


I'm was curious whether there have been similar issues in the past. Querying
the buildfarm logs suggests not, at least not in the pg_checksums test.


> We have to change the test so that it's guaranteed to produce an invalid
> checksum.  Inverting just the checksum field, without doing anything else,
> would do that ... but that feels pretty unsatisfying too.

We really ought to find a way to get to wider checksums :/

Greetings,

Andres Freund



Re: Corruption during WAL replay

From
Robert Haas
Date:
On Fri, Mar 25, 2022 at 2:07 AM Andres Freund <andres@anarazel.de> wrote:
> We really ought to find a way to get to wider checksums :/

Eh, let's just use longer names for the buildfarm animals and call it good. :-)

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Corruption during WAL replay

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-03-25 01:38:45 -0400, Tom Lane wrote:
>> AFAICS, this strategy of whacking a predetermined chunk of the page with
>> a predetermined value is going to fail 1-out-of-64K times.

> Yea. I suspect that the way the modifications and checksumming are done are
> actually higher chance than 1/64k. But even it actually is 1/64k, it's not
> great to wait for (#animals * #catalog-changes) to approach a decent
> percentage of 1/64k.

Exactly.

> I'm was curious whether there have been similar issues in the past. Querying
> the buildfarm logs suggests not, at least not in the pg_checksums test.

That test has only been there since 2018 (b34e84f16).  We've probably
accumulated a couple hundred initial-catalog-contents changes since
then, so maybe this failure arrived right on schedule :-(.

> We really ought to find a way to get to wider checksums :/

That'll just reduce the probability of failure, not eliminate it.

            regards, tom lane



Re: Corruption during WAL replay

From
Robert Haas
Date:
On Fri, Mar 25, 2022 at 9:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> That'll just reduce the probability of failure, not eliminate it.

I mean, if the expected time to the first failure on even 1 machine
exceeds the time until the heat death of the universe by 10 orders of
magnitude, it's probably good enough.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Corruption during WAL replay

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Mar 25, 2022 at 9:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That'll just reduce the probability of failure, not eliminate it.

> I mean, if the expected time to the first failure on even 1 machine
> exceeds the time until the heat death of the universe by 10 orders of
> magnitude, it's probably good enough.

Adding another 16 bits won't get you to that, sadly.  Yeah, it *might*
extend the MTTF to more than the project's likely lifespan, but that
doesn't mean we couldn't get unlucky next week.

            regards, tom lane



Re: Corruption during WAL replay

From
Robert Haas
Date:
On Fri, Mar 25, 2022 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, Mar 25, 2022 at 9:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> That'll just reduce the probability of failure, not eliminate it.
>
> > I mean, if the expected time to the first failure on even 1 machine
> > exceeds the time until the heat death of the universe by 10 orders of
> > magnitude, it's probably good enough.
>
> Adding another 16 bits won't get you to that, sadly.  Yeah, it *might*
> extend the MTTF to more than the project's likely lifespan, but that
> doesn't mean we couldn't get unlucky next week.

I suspect that the number of bits Andres wants to add is no less than 48.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Corruption during WAL replay

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Mar 25, 2022 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Adding another 16 bits won't get you to that, sadly.  Yeah, it *might*
>> extend the MTTF to more than the project's likely lifespan, but that
>> doesn't mean we couldn't get unlucky next week.

> I suspect that the number of bits Andres wants to add is no less than 48.

I dunno.  Compatibility and speed concerns aside, that seems like an awful
lot of bits to be expending on every page compared to the value.

            regards, tom lane



Re: Corruption during WAL replay

From
Robert Haas
Date:
On Fri, Mar 25, 2022 at 10:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I dunno.  Compatibility and speed concerns aside, that seems like an awful
> lot of bits to be expending on every page compared to the value.

I dunno either, but over on the TDE thread people seemed quite willing
to expend like 16-32 *bytes* for page verifiers and nonces and things.
For compatibility and speed reasons, I doubt we could ever get by with
doing that in every cluster, but I do have some hope of introducing
something like that someday at least as an optional feature. It's not
like a 16-bit checksum was state-of-the-art even when we introduced
it. We just did it because we had 2 bytes that we could repurpose
relatively painlessly, and not any larger number. And that's still the
case today, so at least in the short term we will have to choose some
other solution to this problem.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Corruption during WAL replay

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> ... It's not
> like a 16-bit checksum was state-of-the-art even when we introduced
> it. We just did it because we had 2 bytes that we could repurpose
> relatively painlessly, and not any larger number. And that's still the
> case today, so at least in the short term we will have to choose some
> other solution to this problem.

Indeed.  I propose the attached, which also fixes the unsafe use
of seek() alongside syswrite(), directly contrary to what "man perlfunc"
says to do.

            regards, tom lane

diff --git a/src/bin/pg_checksums/t/002_actions.pl b/src/bin/pg_checksums/t/002_actions.pl
index 62c608eaf6..8c70453a45 100644
--- a/src/bin/pg_checksums/t/002_actions.pl
+++ b/src/bin/pg_checksums/t/002_actions.pl
@@ -24,6 +24,7 @@ sub check_relation_corruption
     my $tablespace = shift;
     my $pgdata     = $node->data_dir;

+    # Create table and discover its filesystem location.
     $node->safe_psql(
         'postgres',
         "CREATE TABLE $table AS SELECT a FROM generate_series(1,10000) AS a;
@@ -37,9 +38,6 @@ sub check_relation_corruption
     my $relfilenode_corrupted = $node->safe_psql('postgres',
         "SELECT relfilenode FROM pg_class WHERE relname = '$table';");

-    # Set page header and block size
-    my $pageheader_size = 24;
-    my $block_size = $node->safe_psql('postgres', 'SHOW block_size;');
     $node->stop;

     # Checksums are correct for single relfilenode as the table is not
@@ -55,8 +53,12 @@ sub check_relation_corruption

     # Time to create some corruption
     open my $file, '+<', "$pgdata/$file_corrupted";
-    seek($file, $pageheader_size, SEEK_SET);
-    syswrite($file, "\0\0\0\0\0\0\0\0\0");
+    my $pageheader;
+    sysread($file, $pageheader, 24) or die "sysread failed";
+    # This inverts the pd_checksum field (only); see struct PageHeaderData
+    $pageheader ^= "\0\0\0\0\0\0\0\0\xff\xff";
+    sysseek($file, 0, 0) or die "sysseek failed";
+    syswrite($file, $pageheader) or die "syswrite failed";
     close $file;

     # Checksum checks on single relfilenode fail

Re: Corruption during WAL replay

From
Andres Freund
Date:
Hi,

On 2022-03-25 11:50:48 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > ... It's not
> > like a 16-bit checksum was state-of-the-art even when we introduced
> > it. We just did it because we had 2 bytes that we could repurpose
> > relatively painlessly, and not any larger number. And that's still the
> > case today, so at least in the short term we will have to choose some
> > other solution to this problem.
> 
> Indeed.  I propose the attached, which also fixes the unsafe use
> of seek() alongside syswrite(), directly contrary to what "man perlfunc"
> says to do.

That looks reasonable. Although I wonder if we loose something by not testing
the influence of the rest of the block - but I don't really see anything.

The same code also exists in src/bin/pg_basebackup/t/010_pg_basebackup.pl,
which presumably has the same collision risks. Perhaps we should put a
function into Cluster.pm and use it from both?

Greetings,

Andres Freund



Re: Corruption during WAL replay

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> The same code also exists in src/bin/pg_basebackup/t/010_pg_basebackup.pl,
> which presumably has the same collision risks.

Oooh, I missed that.

> Perhaps we should put a
> function into Cluster.pm and use it from both?

+1, I'll make it so.

            regards, tom lane



Re: Corruption during WAL replay

From
Dagfinn Ilmari Mannsåker
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Robert Haas <robertmhaas@gmail.com> writes:
>> ... It's not
>> like a 16-bit checksum was state-of-the-art even when we introduced
>> it. We just did it because we had 2 bytes that we could repurpose
>> relatively painlessly, and not any larger number. And that's still the
>> case today, so at least in the short term we will have to choose some
>> other solution to this problem.
>
> Indeed.  I propose the attached, which also fixes the unsafe use
> of seek() alongside syswrite(), directly contrary to what "man perlfunc"
> says to do.

LGTM, but it would be good to include $! in the die messages.

- ilmari

>             regards, tom lane
>
> diff --git a/src/bin/pg_checksums/t/002_actions.pl b/src/bin/pg_checksums/t/002_actions.pl
> index 62c608eaf6..8c70453a45 100644
> --- a/src/bin/pg_checksums/t/002_actions.pl
> +++ b/src/bin/pg_checksums/t/002_actions.pl
> @@ -24,6 +24,7 @@ sub check_relation_corruption
>      my $tablespace = shift;
>      my $pgdata     = $node->data_dir;
>  
> +    # Create table and discover its filesystem location.
>      $node->safe_psql(
>          'postgres',
>          "CREATE TABLE $table AS SELECT a FROM generate_series(1,10000) AS a;
> @@ -37,9 +38,6 @@ sub check_relation_corruption
>      my $relfilenode_corrupted = $node->safe_psql('postgres',
>          "SELECT relfilenode FROM pg_class WHERE relname = '$table';");
>  
> -    # Set page header and block size
> -    my $pageheader_size = 24;
> -    my $block_size = $node->safe_psql('postgres', 'SHOW block_size;');
>      $node->stop;
>  
>      # Checksums are correct for single relfilenode as the table is not
> @@ -55,8 +53,12 @@ sub check_relation_corruption
>  
>      # Time to create some corruption
>      open my $file, '+<', "$pgdata/$file_corrupted";
> -    seek($file, $pageheader_size, SEEK_SET);
> -    syswrite($file, "\0\0\0\0\0\0\0\0\0");
> +    my $pageheader;
> +    sysread($file, $pageheader, 24) or die "sysread failed";
> +    # This inverts the pd_checksum field (only); see struct PageHeaderData
> +    $pageheader ^= "\0\0\0\0\0\0\0\0\xff\xff";
> +    sysseek($file, 0, 0) or die "sysseek failed";
> +    syswrite($file, $pageheader) or die "syswrite failed";
>      close $file;
>  
>      # Checksum checks on single relfilenode fail



Re: Corruption during WAL replay

From
Tom Lane
Date:
=?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?= <ilmari@ilmari.org> writes:
> LGTM, but it would be good to include $! in the die messages.

Roger, will do.

            regards, tom lane



Re: Corruption during WAL replay

From
Michael Paquier
Date:
On Fri, Mar 25, 2022 at 10:34:49AM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Mar 25, 2022 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Adding another 16 bits won't get you to that, sadly.  Yeah, it *might*
>>> extend the MTTF to more than the project's likely lifespan, but that
>>> doesn't mean we couldn't get unlucky next week.
>
>> I suspect that the number of bits Andres wants to add is no less than 48.
>
> I dunno.  Compatibility and speed concerns aside, that seems like an awful
> lot of bits to be expending on every page compared to the value.

Err.  And there are not that many bits that could be recycled for this
purpose in the current page layout, aren't there?
--
Michael

Attachment

Re: Corruption during WAL replay

From
Kyotaro Horiguchi
Date:
At Thu, 24 Mar 2022 15:33:29 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Thu, Mar 17, 2022 at 9:21 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > All versions pass check world.
> 
> Thanks, committed.

(I was overwhelmed by the flood of following discussion..)

Anyway, thanks for picking up this and committing!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Corruption during WAL replay

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Fri, Mar 25, 2022 at 10:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I dunno.  Compatibility and speed concerns aside, that seems like an awful
> > lot of bits to be expending on every page compared to the value.
>
> I dunno either, but over on the TDE thread people seemed quite willing
> to expend like 16-32 *bytes* for page verifiers and nonces and things.

Absolutely.

> For compatibility and speed reasons, I doubt we could ever get by with
> doing that in every cluster, but I do have some hope of introducing
> something like that someday at least as an optional feature. It's not
> like a 16-bit checksum was state-of-the-art even when we introduced
> it. We just did it because we had 2 bytes that we could repurpose
> relatively painlessly, and not any larger number. And that's still the
> case today, so at least in the short term we will have to choose some
> other solution to this problem.

I agree that this would be great as an optional feature.  Those patches
to allow the system to be built with reserved space for $whatever would
allow us to have a larger checksum for those who want it and perhaps a
nonce with TDE for those who wish that in the future.  I mentioned
before that I thought it might be a good way to introduce page-level
epochs for 64bit xids too though it never seemed to get much traction.

Anyhow, this whole thread has struck me as a good reason to polish those
patches off and add on top of them an extended checksum ability, first,
independent of TDE, and remove the dependency of those patches from the
TDE effort and instead allow it to just leverage that ability.  I still
suspect we'll have some folks who will want TDE w/o a per-page nonce and
that could be an option but we'd be able to support TDE w/ integrity
pretty easily, which would be fantastic.

Thanks,

Stephen

Attachment

Re: Corruption during WAL replay

From
Robert Haas
Date:
On Tue, Mar 29, 2022 at 12:34 PM Stephen Frost <sfrost@snowman.net> wrote:
> Anyhow, this whole thread has struck me as a good reason to polish those
> patches off and add on top of them an extended checksum ability, first,
> independent of TDE, and remove the dependency of those patches from the
> TDE effort and instead allow it to just leverage that ability.  I still
> suspect we'll have some folks who will want TDE w/o a per-page nonce and
> that could be an option but we'd be able to support TDE w/ integrity
> pretty easily, which would be fantastic.

Yes, I like that idea. Once we get beyond feature freeze, perhaps we
can try to coordinate to avoid duplication of effort -- or absence of
effort.

-- 
Robert Haas
EDB: http://www.enterprisedb.com