Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Antonin Houska
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id 27476.1632752108@antos
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: POC: Cleaning up orphaned files using undo logs
List pgsql-hackers
Amit Kapila <amit.kapila16@gmail.com> wrote:

> On Fri, Sep 24, 2021 at 4:44 PM Antonin Houska <ah@cybertec.at> wrote:
> >
> > Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > On Mon, Sep 20, 2021 at 10:24 AM Antonin Houska <ah@cybertec.at> wrote:
> > > >
> > > > Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > On Fri, Sep 17, 2021 at 9:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > > > > >
> > > > > > > On Tue, Sep 14, 2021 at 10:51:42AM +0200, Antonin Houska wrote:
> > > > > >
> > > > > > * What happened with the idea of abandoning discard worker for the sake
> > > > > >   of simplicity? From what I see limiting everything to foreground undo
> > > > > >   could reduce the core of the patch series to the first four patches
> > > > > >   (forgetting about test and docs, but I guess it would be enough at
> > > > > >   least for the design review), which is already less overwhelming.
> >
> > > > What we can miss, at least for the cleanup of the orphaned files, is the *undo
> > > > worker*. In this patch series the cleanup is handled by the startup process.
> > > >
> > >
> > > Okay, I think various people at different point of times has suggested
> > > that idea. I think one thing we might need to consider is what to do
> > > in case of a FATAL error? In case of FATAL error, it won't be
> > > advisable to execute undo immediately, so would we upgrade the error
> > > to PANIC in such cases. I remember vaguely that for clean up of
> > > orphaned files that can happen rarely and someone has suggested
> > > upgrading the error to PANIC in such a case but I don't remember the
> > > exact details.
> >
> > Do you mean FATAL error during normal operation?
> >
>
> Yes.
>
> > As far as I understand, even
> > zheap does not rely on immediate UNDO execution (otherwise it'd never
> > introduce the undo worker), so FATAL only means that the undo needs to be
> > applied later so it can be discarded.
> >
>
> Yeah, zheap either applies undo later via background worker or next
> time before dml operation if there is a need.
>
> > As for the orphaned files cleanup feature with no undo worker, we might need
> > PANIC to ensure that the undo is applied during restart and that it can be
> > discarded, otherwise the unapplied undo log would stay there until the next
> > (regular) restart and it would block discarding. However upgrading FATAL to
> > PANIC just because the current transaction created a table seems quite
> > rude.
> >
>
> True, I guess but we can once see in what all scenarios it can
> generate FATAL during that operation.

By "that operation" you mean "CREATE TABLE"?

It's not about FATAL during CREATE TABLE, rather it's about FATAL anytime
during a transaction. Whichever operation caused the FATAL error, we'd need to
upgrade it to PANIC as long as the transaction has some undo.

Although the postgres core probably does not raise FATAL errors too often (OOM
conditions seem to be the typical cause), I'm still not enthusiastic about
idea that the undo feature turns such errors into PANIC.

I wonder what the reason to avoid undoing transaction on FATAL is. If it's
about possibly long duration of the undo execution, deletion of orphaned files
(relations or the whole databases) via undo shouldn't make things worse
because currently FATAL also triggers this sort of cleanup immediately, it's
just implemented in different ways.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



pgsql-hackers by date:

Previous
From: "Euler Taveira"
Date:
Subject: Re: row filtering for logical replication
Next
From: Robert Haas
Date:
Subject: Re: when the startup process doesn't (logging startup delays)