Re: making relfilenodes 56 bits - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: making relfilenodes 56 bits
Date
Msg-id CAFiTN-uiAngcW50Trwa94F1EWY2BxEx+B38QSyX3DtV3dzEGhA@mail.gmail.com
Whole thread Raw
In response to Re: making relfilenodes 56 bits  (Andres Freund <andres@anarazel.de>)
Responses Re: making relfilenodes 56 bits
Re: making relfilenodes 56 bits
List pgsql-hackers
On Sat, Jul 2, 2022 at 9:38 AM Andres Freund <andres@anarazel.de> wrote:

Thanks for the review,

> I'm not feeling inspired by "locator", tbh. But I don't really have a great
> alternative, so ...
>
>
> On 2022-07-01 16:12:01 +0530, Dilip Kumar wrote:
> > From f07ca9ef19e64922c6ee410707e93773d1a01d7c Mon Sep 17 00:00:00 2001
> > From: dilip kumar <dilipbalaut@localhost.localdomain>
> > Date: Sat, 25 Jun 2022 10:43:12 +0530
> > Subject: [PATCH v4 2/4] Preliminary refactoring for supporting larger
> >  relfilenumber
>
> I don't think we have abbreviated buffer as 'buff' in many places? I think we
> should either spell buffer out or use 'buf'. This is in regard to BuffTag etc.

Okay, I will change it to 'buf'

> > Subject: [PATCH v4 3/4] Use 56 bits for relfilenumber to avoid wraparound

> Normally we don't capitalize the first character of a comment that's not a
> full sentence (i.e. ending with a punctuation mark).

Okay.

> "logged for use" - looks like you reformulated the sentence incompletely.

Right, I will fix it.

> > +     if (ShmemVariableCache->relnumbercount == 0)
> > +     {
> > +             XLogPutNextRelFileNumber(ShmemVariableCache->nextRelFileNumber +
> > +                                                              VAR_RFN_PREFETCH);
>
> I know this is just copied, but I find "XLogPut" as a prefix pretty unhelpful.

Maybe we can change to LogNextRelFileNumber()?

> What's the story behind moving relfilenode to the front? Alignment
> consideration? Seems odd to move the relfilenode before the oid. If there's an
> alignment issue, can't you just swap it with reltablespace or such to resolve
> it?

Because of a test case added in this commit
(79b716cfb7a1be2a61ebb4418099db1258f35e30).  Even I did not like to
move relfilenode before oid, but under this commit it is expected this
to aligned as well as before any NameData check this comments

===
+--
+--  Keep such columns before the first NameData column of the
+-- catalog, since packagers can override NAMEDATALEN to an odd number.
+--
===

>
> > From f6e8e0e7412198b02671e67d1859a7448fe83f38 Mon Sep 17 00:00:00 2001
> > From: dilip kumar <dilipbalaut@localhost.localdomain>
> > Date: Wed, 29 Jun 2022 13:24:32 +0530
> > Subject: [PATCH v4 4/4] Don't delay removing Tombstone file until next
> >  checkpoint
> >
> > Currently, we can not remove the unused relfilenode until the
> > next checkpoint because if we remove them immediately then
> > there is a risk of reusing the same relfilenode for two
> > different relations during single checkpoint due to Oid
> > wraparound.
>
> Well, not quite "currently", because at this point we've fixed that in a prior
> commit ;)

Right, I will change, but I'm not sure whether we want to commit 0003
and 0004 as an independent patch or as a simple patch.

> > Now as part of the previous patch set we have made relfilenode
> > 56 bit wider and removed the risk of wraparound so now we don't
> > need to wait till the next checkpoint for removing the unused
> > relation file and we can clean them up on commit.
>
> Hm. Wasn't there also some issue around crash-restarts benefiting from having
> those files around until later? I think what I'm remembering is what is
> referenced in this comment:

I think due to wraparound if relfilenode gets reused by another
relation in the same checkpoint then there was an issue in crash
recovery with wal level minimal.  But the root of the issue is a
wraparound, right?

>
> > - * For regular relations, we don't unlink the first segment file of the rel,
> > - * but just truncate it to zero length, and record a request to unlink it after
> > - * the next checkpoint.  Additional segments can be unlinked immediately,
> > - * however.  Leaving the empty file in place prevents that relfilenumber
> > - * from being reused.  The scenario this protects us from is:
> > - * 1. We delete a relation (and commit, and actually remove its file).
> > - * 2. We create a new relation, which by chance gets the same relfilenumber as
> > - *     the just-deleted one (OIDs must've wrapped around for that to happen).
> > - * 3. We crash before another checkpoint occurs.
> > - * During replay, we would delete the file and then recreate it, which is fine
> > - * if the contents of the file were repopulated by subsequent WAL entries.
> > - * But if we didn't WAL-log insertions, but instead relied on fsyncing the
> > - * file after populating it (as we do at wal_level=minimal), the contents of
> > - * the file would be lost forever.  By leaving the empty file until after the
> > - * next checkpoint, we prevent reassignment of the relfilenumber until it's
> > - * safe, because relfilenumber assignment skips over any existing file.
>
> This isn't related to oid wraparound, just crashes. It's possible that the
> XLogFlush() in XLogPutNextRelFileNumber() prevents such a scenario, but if so
> it still ought to be explained here, I think.

I think the root cause of the problem is oid reuse which is due to
relfilenode wraparound, and the problem is created if there is a crash
after that.  Now, we have removed the wraparound so there won't be any
more reuse of the relfilenode so there is no problem during crash
recovery.

 In XLogPutNextRelFileNumber() we need XLogFlush() to just ensure that
after the crash recovery we do not reuse the relfilenode because now
we are not checking for the existing disk file as we have completely
removed the wraparound.

So in short the problem this comment was explaining is related to if
relfilenode get reused in same checkpoint due to wraparound then crash
recovery will loose the content of the new related which has reused
the relfilenode at wal level minimal. But, by adding XLogFlush() in
XLogPutNextRelFileNumber() we are ensuring that after crash recovery
we do not reuse the same relfilenode so we ensure this wal go to disk
first before we create the relation file on the disk.

>
> > + * Note that now we can immediately unlink the first segment of the regular
> > + * relation as well because the relfilenumber is 56 bits wide since PG 16.  So
> > + * we don't have to worry about relfilenumber getting reused for some unrelated
> > + * relation file.
>
> I'm doubtful it's a good idea to start dropping at the first segment. I'm
> fairly certain that there's smgrexists() checks in some places, and they'll
> now stop working, even if there are later segments that remained, e.g. because
> of an error in the middle of removing later segments.

Okay, so you mean to say that we can first drop the remaining segment
and at last we drop the segment 0 right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Handle infinite recursion in logical replication setup
Next
From: Robert Haas
Date:
Subject: Re: making relfilenodes 56 bits