Re: PATCH: standby crashed when replay block which truncated instandby but failed to truncate in master node - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: PATCH: standby crashed when replay block which truncated instandby but failed to truncate in master node
Date
Msg-id CAHGQGwHCK6f77yeZD4MHOnN+PaTf6XiJfEB+Ce7SksSHjeAWtg@mail.gmail.com
Whole thread Raw
In response to Re: PATCH: standby crashed when replay block which truncated instandby but failed to truncate in master node  (Michael Paquier <michael@paquier.xyz>)
Responses Re: PATCH: standby crashed when replay block which truncated instandby but failed to truncate in master node
List pgsql-hackers
On Fri, Sep 27, 2019 at 3:14 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Sep 26, 2019 at 01:13:56AM +0900, Fujii Masao wrote:
> > On Tue, Sep 24, 2019 at 10:41 AM Michael Paquier <michael@paquier.xyz> wrote:
> >> This also points out that there are other things to worry about than
> >> interruptions, as for example DropRelFileNodeLocalBuffers() could lead
> >> to an ERROR, and this happens before the physical truncation is done
> >> but after the WAL record is replayed on the standby, so any failures
> >> happening at the truncation phase before the work is done would be a
> >> problem.  However we are talking about failures which should not
> >> happen and these are elog() calls.  It would be tempting to add a
> >> critical section here, but we could still have problems if we have a
> >> failure after the WAL record has been flushed, which means that it
> >> would be replayed on the standby, and the surrounding comments are
> >> clear about that.
> >
> > Could you elaborate what problem adding a critical section there occurs?
>
> Wrapping the call of smgrtruncate() within RelationTruncate() to use a
> critical section would make things worse from the user perspective on
> the primary, no?  If the physical truncation fails, we would still
> fail WAL replay on the standby, but instead of generating an ERROR in
> the session of the user attempting the TRUNCATE, the whole primary
> would be taken down.

Thanks for elaborating that! Understood.

But this can cause subsequent recovery to always fail with invalid-pages error
and the server not to start up. This is bad. So, to allviate the situation,
I'm thinking it would be worth adding something like igore_invalid_pages
developer parameter. When this parameter is set to true, the startup process
always ignores invalid-pages errors. Thought?

Regards,

-- 
Fujii Masao



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Hooks for session start and end, take two
Next
From: Michael Paquier
Date:
Subject: Re: PATCH: standby crashed when replay block which truncated instandby but failed to truncate in master node