On Mon, Sep 23, 2019 at 01:45:14PM +0200, Tomas Vondra wrote:
> On Mon, Sep 23, 2019 at 03:48:50PM +0800, Thunder wrote:
>> Is this an issue?
>> Can we fix like this?
>> Thanks!
>>
>
> I do think it is a valid issue. No opinion on the fix yet, though.
> The report was sent on saturday, so patience ;-)
And for some others it was even a longer weekend. Anyway, the problem
can be reproduced if you apply the attached which introduces a failure
point, and then if you run the following commands:
create table aa as select 1;
delete from aa;
\! touch /tmp/truncate_flag
vacuum aa;
\! rm /tmp/truncate_flag
vacuum aa; -- panic on standby
This also points out that there are other things to worry about than
interruptions, as for example DropRelFileNodeLocalBuffers() could lead
to an ERROR, and this happens before the physical truncation is done
but after the WAL record is replayed on the standby, so any failures
happening at the truncation phase before the work is done would be a
problem. However we are talking about failures which should not
happen and these are elog() calls. It would be tempting to add a
critical section here, but we could still have problems if we have a
failure after the WAL record has been flushed, which means that it
would be replayed on the standby, and the surrounding comments are
clear about that. In short, as a matter of safety I'd like to think
that what you are suggesting is rather acceptable (aka hold interrupts
before the WAL record is written and release after the physical
truncate), so as truncation avoids failures possible to avoid.
Do others have thoughts to share on the matter?
--
Michael