Re: problems with making relfilenodes 56-bits - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: problems with making relfilenodes 56-bits
Date
Msg-id CAEze2Wjd3jY_UhhOGdGGnC6NO=+NmtNOmd=JaYv-v-nwBAiXXA@mail.gmail.com
Whole thread Raw
In response to Re: problems with making relfilenodes 56-bits  (Andres Freund <andres@anarazel.de>)
Responses Re: problems with making relfilenodes 56-bits
List pgsql-hackers
On Wed, 5 Oct 2022 at 01:50, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-10-03 10:01:25 -0700, Andres Freund wrote:
> > On 2022-10-03 08:12:39 -0400, Robert Haas wrote:
> > > On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:
> > > I thought about trying to buy back some space elsewhere, and I think
> > > that would be a reasonable approach to getting this committed if we
> > > could find a way to do it. However, I don't see a terribly obvious way
> > > of making it happen.
> >
> > I think there's plenty potential...
>
> I light dusted off my old varint implementation from [1] and converted the
> RelFileLocator and BlockNumber from fixed width integers to varint ones. This
> isn't meant as a serious patch, but an experiment to see if this is a path
> worth pursuing.
>
> A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
> (for increased reproducability) shows a decent saving:
>
> master: 241106544 - 230 MB
> varint: 227858640 - 217 MB

I think a signficant part of this improvement comes from the premise
of starting with a fresh database. tablespace OID will indeed most
likely be low, but database OID may very well be linearly distributed
if concurrent workloads in the cluster include updating (potentially
unlogged) TOASTed columns and the databases are not created in one
"big bang" but over the lifetime of the cluster. In that case DBOID
will consume 5B for a significant fraction of databases (anything with
OID >=2^28).

My point being: I don't think that we should have different WAL
performance in databases which is dependent on which OID was assigned
to that database.
In addition; this varlen encoding of relfilenode would mean that
performance would drop over time, as a relations' relfile locator is
updated to something with a wider number (through VACUUM FULL or other
relfilelocator cycling; e.g. re-importing a database). For maximum
performance, you'd have to tune your database to have the lowest
possible database, namespace and relfilelocator numbers; which (in
older clusters) implies hacking into the catalogs - which seems like
an antipattern.

I would have much less issue with this if we had separate counters per
database (and approximately incremental dbOid:s), but that's not the
case right now.

> The average record size goes from 102.7 to 95.7 bytes excluding the remaining
> FPIs, 118.1 to 111.0 including FPIs.

That's quite promising.

> There's plenty other spots that could be converted (e.g. the record length
> which rarely needs four bytes), this is just meant as a demonstration.

Agreed.

> I used pg_waldump --stats for that range of WAL to measure the CPU overhead. A
> profile does show pg_varint_decode_uint64(), but partially that seems to be
> offset by the reduced amount of bytes to CRC. Maybe a ~2% overhead remains.
>
> That would be tolerable, I think, because waldump --stats pretty much doesn't
> do anything with the WAL.
>
> But I suspect there's plenty of optimization potential in the varint
> code. Right now it e.g. stores data as big endian, and the bswap instructions
> do show up. And a lot of my bit-maskery could be optimized too.

One thing that comes to mind is that we will never see dbOid < 2^8
(and rarely < 2^14, nor spcOid less than 2^8 for that matter), so
we'll probably waste at least one or two bits in the encoding of those
values. That's not the end of the world, but it'd probably be better
if we could improve on that - up to 6% of the field's disk usage would
be wasted on an always-on bit.

----

Attached is a prototype patchset that reduces the WAL record size in
many common cases. This is a prototype, as it fails tests due to a
locking issue in prepared_xacts that I have not been able to find the
source of yet. It also could use some more polishing, but the base
case seems quite good. I haven't yet run the numbers though...

0001 - Extract xl_rminfo from xl_info
See [0] for more info as to why that's useful, the patch was pulled
from there. It is mainly used to reduce the size of 0002; and mostly
consists of find-and-replace of rmgrs extracting their bits from
xl_info.

0002 - Rework XLogRecord
This makes many fields in the xlog header optional, reducing the size
of many xlog records by several bytes. This implements the design I
shared in my earlier message [1].

0003 - Rework XLogRecordBlockHeader.
This patch could be applied on current head, and saves some bytes in
per-block data. It potentially saves some bytes per registered
block/buffer in the WAL record (max 2 bytes for the first block, after
that up to 3). See the patch's commit message in the patch for
detailed information.

Kind regards,

Matthias van de Meent

[0] https://postgr.es/m/CAEze2WgZti_Bgs-Aw3egsR5PJQpHcYZwZFCJND5MS-O_DX0-Hg%40mail.gmail.com
[1] https://postgr.es/m/CAEze2WjOFzRzPMPYhH4odSa9OCF2XeZszE3jGJhJzrpdFmyLOw@mail.gmail.com

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Tracking last scan time
Next
From: David Rowley
Date:
Subject: Re: Use LIMIT instead of Unique for DISTINCT when all distinct pathkeys are redundant