Re: Make relfile tombstone files conditional on WAL level - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Make relfile tombstone files conditional on WAL level
Date
Msg-id CA+TgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg@mail.gmail.com
Whole thread Raw
In response to Re: Make relfile tombstone files conditional on WAL level  (Andres Freund <andres@anarazel.de>)
Responses Re: Make relfile tombstone files conditional on WAL level
List pgsql-hackers
On Mon, Aug 2, 2021 at 6:38 PM Andres Freund <andres@anarazel.de> wrote:
> I guess there's a somewhat hacky way to get somewhere without actually
> increasing the size. We could take 3 bytes from the fork number and use that
> to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
> everyone.
>
> It's not like we can use those bytes in a useful way, due to alignment
> requirements. Declaring that the high 7 bytes are for the relNode portion and
> the low byte for the fork would still allow efficient comparisons and doesn't
> seem too ugly.

I think this idea is worth more consideration. It seems like 2^56
relfilenodes ought to be enough for anyone, recalling that you can
only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
bunch of code that is there to guard against relfilenodes being
reused. In particular, we can remove the code that leaves a 0-length
tombstone file around until the next checkpoint to guard against
relfilenode reuse. On Windows, we still need
https://commitfest.postgresql.org/36/2962/ because of the problem that
Windows won't remove files from the directory listing until they are
both unlinked and closed. But in general this seems like it would lead
to cleaner code. For example, GetNewRelFileNode() needn't loop. If it
allocate the smallest unsigned integer that the cluster (or database)
has never previously assigned, the file should definitely not exist on
disk, and if it does, an ERROR is appropriate, as the database is
corrupted. This does assume that allocations from this new 56-bit
relfilenode counter are properly WAL-logged.

I think this would also solve a problem Dilip mentioned to me today:
suppose you make ALTER DATABASE SET TABLESPACE WAL-logged, as he's
been trying to do. Then suppose you do "ALTER DATABASE foo SET
TABLESPACE used_recently_but_not_any_more". You might get an error
complaining that “some relations of database \“%s\” are already in
tablespace \“%s\“” because there could be tombstone files in that
database. With this combination of changes, you could just use the
barrier mechanism from https://commitfest.postgresql.org/36/2962/ to
wait for those files to disappear, because they've got to be
previously-unliked files that Windows is still returning because
they're still opening -- or else they could be a sign of a corrupted
database, but there are no other possibilities.

I think, anyway.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: "Bossart, Nathan"
Date:
Subject: Re: O(n) tasks cause lengthy startups and checkpoints
Next
From: Corey Huinker
Date:
Subject: Re: SQL:2011 application time