Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
Date
Msg-id CALj2ACV+acrnWUdwSNUXNzXLNA+kFkfuT8t=wiMoPhBWKrWUeA@mail.gmail.com
Whole thread Raw
In response to Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work  (Andres Freund <andres@anarazel.de>)
Responses Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
List pgsql-hackers
On Fri, Jan 14, 2022 at 1:08 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-12-31 18:12:37 +0530, Bharath Rupireddy wrote:
> > Currently the server is erroring out when unable to remove/parse a
> > logical rewrite file in CheckPointLogicalRewriteHeap wasting the
> > amount of work the checkpoint has done and preventing the checkpoint
> > from finishing.
>
> This seems like it'd make failures to remove the files practically
> invisible. Which'd have it's own set of problems?
>
> What motivated proposing this change?

We had an issue where there were many mapping files generated during
the crash recovery and end-of-recovery checkpoint was taking a lot of
time. We had to manually intervene and delete some of the mapping
files (although it may not sound sensible) to make end-of-recovery
checkpoint faster. Because of the race condition between manual
deletion and checkpoint deletion, the unlink error occurred which
crashed the server and the server entered the recovery again wasting
the entire earlier recovery work.

In summary, with the changes (emitting LOG-only messages for unlink
failures and continuing with the other files) proposed for
CheckPointLogicalRewriteHeap in this thread and the existing code in
CheckPointSnapBuild, I'm sure it will help not waste the recovery
that's has been done in case  unlink fails for any reasons.

Regards,
Bharath Rupireddy.



pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: missing indexes in indexlist with partitioned tables
Next
From: Julien Rouhaud
Date:
Subject: Re: pg_replslotdata - a tool for displaying replication slot information