Re: making relfilenodes 56 bits - Mailing list pgsql-hackers

From Andres Freund
Subject Re: making relfilenodes 56 bits
Date
Msg-id 20220702040840.6hkhdr6nv3u2pqc3@alap3.anarazel.de
Whole thread Raw
In response to Re: making relfilenodes 56 bits  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: making relfilenodes 56 bits
List pgsql-hackers
Hi,

I'm not feeling inspired by "locator", tbh. But I don't really have a great
alternative, so ...


On 2022-07-01 16:12:01 +0530, Dilip Kumar wrote:
> From f07ca9ef19e64922c6ee410707e93773d1a01d7c Mon Sep 17 00:00:00 2001
> From: dilip kumar <dilipbalaut@localhost.localdomain>
> Date: Sat, 25 Jun 2022 10:43:12 +0530
> Subject: [PATCH v4 2/4] Preliminary refactoring for supporting larger
>  relfilenumber

I don't think we have abbreviated buffer as 'buff' in many places? I think we
should either spell buffer out or use 'buf'. This is in regard to BuffTag etc.



> Subject: [PATCH v4 3/4] Use 56 bits for relfilenumber to avoid wraparound

>  /*
> + * GenerateNewRelFileNumber
> + *
> + * Similar to GetNewObjectId but instead of new Oid it generates new
> + * relfilenumber.
> + */
> +RelFileNumber
> +GetNewRelFileNumber(void)
> +{
> +    RelFileNumber        result;
> +
> +    /* Safety check, we should never get this far in a HS standby */

Normally we don't capitalize the first character of a comment that's not a
full sentence (i.e. ending with a punctuation mark).

> +    if (RecoveryInProgress())
> +        elog(ERROR, "cannot assign RelFileNumber during recovery");
> +
> +    LWLockAcquire(RelFileNumberGenLock, LW_EXCLUSIVE);
> +
> +    /* Check for the wraparound for the relfilenumber counter */
> +    if (unlikely (ShmemVariableCache->nextRelFileNumber > MAX_RELFILENUMBER))
> +        elog(ERROR, "relfilenumber is out of bound");
> +
> +    /* If we run out of logged for use RelFileNumber then we must log more */

"logged for use" - looks like you reformulated the sentence incompletely.


> +    if (ShmemVariableCache->relnumbercount == 0)
> +    {
> +        XLogPutNextRelFileNumber(ShmemVariableCache->nextRelFileNumber +
> +                                 VAR_RFN_PREFETCH);

I know this is just copied, but I find "XLogPut" as a prefix pretty unhelpful.


> diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
> index e1f4eef..1cf039c 100644
> --- a/src/include/catalog/pg_class.h
> +++ b/src/include/catalog/pg_class.h
> @@ -31,6 +31,10 @@
>   */
>  CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,RelationRelation_Rowtype_Id)
BKI_SCHEMA_MACRO
>  {
> +    /* identifier of physical storage file */
> +    /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */
> +    int64        relfilenode BKI_DEFAULT(0);
> +
>      /* oid */
>      Oid            oid;
>  
> @@ -52,10 +56,6 @@ CATALOG(pg_class,1259,RelationRelationId) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
>      /* access method; 0 if not a table / index */
>      Oid            relam BKI_DEFAULT(heap) BKI_LOOKUP_OPT(pg_am);
>  
> -    /* identifier of physical storage file */
> -    /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */
> -    Oid            relfilenode BKI_DEFAULT(0);
> -
>      /* identifier of table space for relation (0 means default for database) */
>      Oid            reltablespace BKI_DEFAULT(0) BKI_LOOKUP_OPT(pg_tablespace);
>

What's the story behind moving relfilenode to the front? Alignment
consideration? Seems odd to move the relfilenode before the oid. If there's an
alignment issue, can't you just swap it with reltablespace or such to resolve
it?



> From f6e8e0e7412198b02671e67d1859a7448fe83f38 Mon Sep 17 00:00:00 2001
> From: dilip kumar <dilipbalaut@localhost.localdomain>
> Date: Wed, 29 Jun 2022 13:24:32 +0530
> Subject: [PATCH v4 4/4] Don't delay removing Tombstone file until next
>  checkpoint
> 
> Currently, we can not remove the unused relfilenode until the
> next checkpoint because if we remove them immediately then
> there is a risk of reusing the same relfilenode for two
> different relations during single checkpoint due to Oid
> wraparound.

Well, not quite "currently", because at this point we've fixed that in a prior
commit ;)


> Now as part of the previous patch set we have made relfilenode
> 56 bit wider and removed the risk of wraparound so now we don't
> need to wait till the next checkpoint for removing the unused
> relation file and we can clean them up on commit.

Hm. Wasn't there also some issue around crash-restarts benefiting from having
those files around until later? I think what I'm remembering is what is
referenced in this comment:


> - * For regular relations, we don't unlink the first segment file of the rel,
> - * but just truncate it to zero length, and record a request to unlink it after
> - * the next checkpoint.  Additional segments can be unlinked immediately,
> - * however.  Leaving the empty file in place prevents that relfilenumber
> - * from being reused.  The scenario this protects us from is:
> - * 1. We delete a relation (and commit, and actually remove its file).
> - * 2. We create a new relation, which by chance gets the same relfilenumber as
> - *      the just-deleted one (OIDs must've wrapped around for that to happen).
> - * 3. We crash before another checkpoint occurs.
> - * During replay, we would delete the file and then recreate it, which is fine
> - * if the contents of the file were repopulated by subsequent WAL entries.
> - * But if we didn't WAL-log insertions, but instead relied on fsyncing the
> - * file after populating it (as we do at wal_level=minimal), the contents of
> - * the file would be lost forever.  By leaving the empty file until after the
> - * next checkpoint, we prevent reassignment of the relfilenumber until it's
> - * safe, because relfilenumber assignment skips over any existing file.

This isn't related to oid wraparound, just crashes. It's possible that the
XLogFlush() in XLogPutNextRelFileNumber() prevents such a scenario, but if so
it still ought to be explained here, I think.



> + * Note that now we can immediately unlink the first segment of the regular
> + * relation as well because the relfilenumber is 56 bits wide since PG 16.  So
> + * we don't have to worry about relfilenumber getting reused for some unrelated
> + * relation file.

I'm doubtful it's a good idea to start dropping at the first segment. I'm
fairly certain that there's smgrexists() checks in some places, and they'll
now stop working, even if there are later segments that remained, e.g. because
of an error in the middle of removing later segments.



Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: Assert name/short_desc to prevent SHOW ALL segfault
Next
From: Amit Kapila
Date:
Subject: Re: Handle infinite recursion in logical replication setup