Re: using an end-of-recovery record in all cases - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: using an end-of-recovery record in all cases
Date
Msg-id 20220420170224.GA2579385@nathanxps13
Whole thread Raw
In response to Re: using an end-of-recovery record in all cases  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: using an end-of-recovery record in all cases  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Wed, Apr 20, 2022 at 09:26:07AM -0400, Robert Haas wrote:
> I was talking with Thomas Munro yesterday and he thinks there is a
> problem with relfilenode reuse here. In normal running, when a
> relation is dropped, we leave behind a 0-length file until the next
> checkpoint; this keeps that relfilenode from being used even if the
> OID counter wraps around. If we didn't do that, then imagine that
> while running with wal_level=minimal, we drop an existing relation,
> create a new relation with the same OID, load some data into it, and
> crash, all within the same checkpoint cycle, then we will be able to
> replay the drop, but we will not be able to restore the relation
> contents afterward because at wal_level=minimal they are not logged.
> Apparently, we don't create tombstone files during recovery because we
> know that there will be a checkpoint at the end.

In the example you provided, won't the tombstone file already be present
before the crash?  During recovery, the tombstone file will be removed, and
the new relation wouldn't use the same relfilenode anyway.  I'm probably
missing something obvious here.

I do see the problem if we drop an existing relation, crash, reuse the
filenode, and then crash again (all within the same checkpoint cycle).  The
first recovery would remove the tombstone file, and the second recovery
would wipe out the new relation's files.

> With the existing use of the end-of-recovery record, we always know
> that wal_level>minimal, because we're only using it on standbys. But
> with this use that wouldn't be true any more. So I guess we need to
> start creating tombstone files even during recovery, or else do
> something like what Dilip coded up in
> http://postgr.es/m/CAFiTN-u=r8UTCSzu6_pnihYAtwR1=esq5sRegTEZ2tLa92fovA@mail.gmail.com
> which I think would be a better solution at least in the long term.

IMO this would be good just to reduce the branching a bit.  I suppose
removing the files immediately during recovery might be an optimization in
some cases, but I am skeptical that it really makes that much of a
difference in practice.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Fix NULL pointer reference in _outPathTarget()
Next
From: Thomas Munro
Date:
Subject: Re: using an end-of-recovery record in all cases