Re: problems with making relfilenodes 56-bits - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: problems with making relfilenodes 56-bits
Date
Msg-id CAEze2Wjc42N7vzODXfviuPt3cq-TcbynjouSRMw=LMZmmX7+0A@mail.gmail.com
Whole thread Raw
In response to Re: problems with making relfilenodes 56-bits  (Andres Freund <andres@anarazel.de>)
Responses Re: problems with making relfilenodes 56-bits
List pgsql-hackers
On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:
> > > I think it'd be interesting to look at per-record-type stats between two
> > > equivalent workload, to see where practical workloads suffer the most
> > > (possibly with fpw=off, to make things more repeatable).
> >
> > I would expect, and Dilip's results seem to confirm, the effect to be
> > pretty uniform: basically, nearly every record gets bigger by 4 bytes.
> > That's because most records contain at least one block reference, and
> > if they contain multiple block references, likely all but one will be
> > marked BKPBLOCK_SAME_REL, so we pay the cost just once.
>
> But it doesn't really matter that much if an already large record gets a bit
> bigger. Whereas it does matter if it's a small record. Focussing on optimizing
> the record types where the increase is large seems like a potential way
> forward to me, even if we can't find something generic.
>
>
> > I thought about trying to buy back some space elsewhere, and I think
> > that would be a reasonable approach to getting this committed if we
> > could find a way to do it. However, I don't see a terribly obvious way
> > of making it happen.
>
> I think there's plenty potential...
>
>
> > Trying to do it by optimizing specific WAL record
> > types seems like a real pain in the neck, because there's tons of
> > different WAL records that all have the same problem.
>
> I am not so sure about that. Improving a bunch of the most frequent small
> records might buy you back enough on just about every workload to be OK.
>
> I put the top record sizes for an installcheck run with full_page_writes off
> at the bottom. Certainly our regression tests aren't generally
> representative. But I think it still decently highlights how just improving a
> few records could buy you back more than enough.
>
>
> > Trying to do it in a generic way makes more sense, and the fact that we have
> > 2 padding bytes available in XLogRecord seems like a place to start looking,
> > but the way forward from there is not clear to me.
>
> Random idea: xl_prev is large. Store a full xl_prev in the page header, but
> only store a 2 byte offset from the page header xl_prev within each record.

With that small xl_prev we may not detect partial page writes in
recycled segments; or other issues in the underlying file system. With
small record sizes, the chance of returning incorrect data would be
significant for small records (it would be approximately the chance of
getting a record boundary on the underlying page boundary * chance of
getting the same MAXALIGN-adjusted size record before the persistence
boundary). That issue is part of the reason why my proposed change
upthread still contains the full xl_prev.

A different idea is removing most block_ids from the record, and
optionally reducing per-block length fields to 1B. Used block ids are
effectively always sequential, and we only allow 33+4 valid values, so
we can use 2 bits to distinguish between 'block belonging to this ID
field have at most 255B of data registered' and 'blocks up to this ID
follow sequentially without own block ID'. That would save 2N-1 total
bytes for N blocks. It is scraping the barrel, but I think it is quite
possible.

Lastly, we could add XLR_BLOCK_ID_DATA_MED for values >255 containing
up to UINT16_MAX lengths. That would save 2 bytes for records that
only just pass the 255B barrier, where 2B is still a fairly
significant part of the record size.

Kind regards,

Matthias van de Meent



pgsql-hackers by date:

Previous
From: Dagfinn Ilmari Mannsåker
Date:
Subject: Re: Miscellaneous tab completion issue fixes
Next
From: Tomas Vondra
Date:
Subject: Re: Crash in BRIN minmax-multi indexes