Re: PATCH: standby crashed when replay block which truncated instandby but failed to truncate in master node - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: PATCH: standby crashed when replay block which truncated instandby but failed to truncate in master node
Date
Msg-id 20190924014019.GB2012@paquier.xyz
Whole thread Raw
In response to Re: PATCH: standby crashed when replay block which truncated instandby but failed to truncate in master node  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: PATCH: standby crashed when replay block which truncated instandby but failed to truncate in master node
Re: PATCH: standby crashed when replay block which truncated instandby but failed to truncate in master node
List pgsql-hackers
On Mon, Sep 23, 2019 at 01:45:14PM +0200, Tomas Vondra wrote:
> On Mon, Sep 23, 2019 at 03:48:50PM +0800, Thunder wrote:
>> Is this an issue?
>> Can we fix like this?
>> Thanks!
>>
>
> I do think it is a valid issue. No opinion on the fix yet, though.
> The report was sent on saturday, so patience ;-)

And for some others it was even a longer weekend.  Anyway, the problem
can be reproduced if you apply the attached which introduces a failure
point, and then if you run the following commands:
create table aa as select 1;
delete from aa;
\! touch /tmp/truncate_flag
vacuum aa;
\! rm /tmp/truncate_flag
vacuum aa; -- panic on standby

This also points out that there are other things to worry about than
interruptions, as for example DropRelFileNodeLocalBuffers() could lead
to an ERROR, and this happens before the physical truncation is done
but after the WAL record is replayed on the standby, so any failures
happening at the truncation phase before the work is done would be a
problem.  However we are talking about failures which should not
happen and these are elog() calls.  It would be tempting to add a
critical section here, but we could still have problems if we have a
failure after the WAL record has been flushed, which means that it
would be replayed on the standby, and the surrounding comments are
clear about that.  In short, as a matter of safety I'd like to think
that what you are suggesting is rather acceptable (aka hold interrupts
before the WAL record is written and release after the physical
truncate), so as truncation avoids failures possible to avoid.

Do others have thoughts to share on the matter?
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: "Finnerty, Jim"
Date:
Subject: Re: Unwanted expression simplification in PG12b2
Next
From: Amit Langote
Date:
Subject: Fix example in partitioning documentation