Re: making relfilenodes 56 bits - Mailing list pgsql-hackers

From Andres Freund
Subject Re: making relfilenodes 56 bits
Date
Msg-id 20220702192951.irfhpydyisgkphtj@alap3.anarazel.de
Whole thread Raw
In response to Re: making relfilenodes 56 bits  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: making relfilenodes 56 bits
Re: making relfilenodes 56 bits
List pgsql-hackers
Hi,

On 2022-07-02 14:23:08 +0530, Dilip Kumar wrote:
> > > +     if (ShmemVariableCache->relnumbercount == 0)
> > > +     {
> > > +             XLogPutNextRelFileNumber(ShmemVariableCache->nextRelFileNumber +
> > > +                                                              VAR_RFN_PREFETCH);
> >
> > I know this is just copied, but I find "XLogPut" as a prefix pretty unhelpful.
> 
> Maybe we can change to LogNextRelFileNumber()?

Much better.


Hm. Now that I think about it, isn't the XlogFlush() in
XLogPutNextRelFileNumber() problematic performance wise? Yes, we'll spread the
cost across a number of GetNewRelFileNumber() calls, but still, an additional
f[data]sync for every 64 relfilenodes assigned isn't cheap - today there's
zero fsyncs when creating a sequence or table inside a transaction (there are
some for indexes, but there's patches to fix that).

Not that I really see an obvious alternative.

I guess we could try to invent a flush-log-before-write type logic for
relfilenodes somehow? So that the first block actually written to a file needs
to ensure the WAL record that created the relation is flushed. But getting
that to work reliably seems nontrivial.


One thing that would be good is to add an assertion to a few places ensuring
that relfilenodes aren't above ->nextRelFileNumber, most importantly somewhere
in the recovery path.


Why did you choose a quite small value for VAR_RFN_PREFETCH? VAR_OID_PREFETCH
is 8192, but you chose 64 for VAR_RFN_PREFETCH?

I'd spell out RFN in VAR_RFN_PREFETCH btw, it took me a bit to expand RFN to
relfilenode.


> > What's the story behind moving relfilenode to the front? Alignment
> > consideration? Seems odd to move the relfilenode before the oid. If there's an
> > alignment issue, can't you just swap it with reltablespace or such to resolve
> > it?
> 
> Because of a test case added in this commit
> (79b716cfb7a1be2a61ebb4418099db1258f35e30).  Even I did not like to
> move relfilenode before oid, but under this commit it is expected this
> to aligned as well as before any NameData check this comments
> 
> ===
> +--
> +--  Keep such columns before the first NameData column of the
> +-- catalog, since packagers can override NAMEDATALEN to an odd number.
> +--
> ===

This is embarassing. Trying to keep alignment match between C and catalog
alignment on AIX, without actually making the system understand the alignment
rules, is a remarkably shortsighted approach.

I started a separate thread about it, since it's not really relevant to this thread:
https://postgr.es/m/20220702183354.a6uhja35wta7agew%40alap3.anarazel.de

Maybe we could at least make the field order to be something like
  oid, relam, relfilenode, relname

that should be ok alignment wise, keep oid first, and seems to make sense from
an "importance" POV? Can't really interpret later fields without knowing relam
etc.



> > > Now as part of the previous patch set we have made relfilenode
> > > 56 bit wider and removed the risk of wraparound so now we don't
> > > need to wait till the next checkpoint for removing the unused
> > > relation file and we can clean them up on commit.
> >
> > Hm. Wasn't there also some issue around crash-restarts benefiting from having
> > those files around until later? I think what I'm remembering is what is
> > referenced in this comment:
> 
> I think due to wraparound if relfilenode gets reused by another
> relation in the same checkpoint then there was an issue in crash
> recovery with wal level minimal.  But the root of the issue is a
> wraparound, right?

I'm not convinced the tombstones were required solely in the oid wraparound
case before, despite the comment saying so, with wal_level=minimal. I gotta do
some non-work stuff for a bit, so I need to stop pondering this now :)

I think it might be a good idea to have a few weeks in which we do *not*
remove the tombstones, but have assertion checks against such files existing
when we don't expect them to. I.e. commit 1-3, add the asserts, then commit 4
a bit later.


> > I'm doubtful it's a good idea to start dropping at the first segment. I'm
> > fairly certain that there's smgrexists() checks in some places, and they'll
> > now stop working, even if there are later segments that remained, e.g. because
> > of an error in the middle of removing later segments.
> 
> Okay, so you mean to say that we can first drop the remaining segment
> and at last we drop the segment 0 right?

I'd use the approach Robert suggested and delete from the end, going down.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: AIX support - alignment issues
Next
From: Steve Chavez
Date:
Subject: Re: Add red-black tree missing comparison searches