Re: problems with making relfilenodes 56-bits - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: problems with making relfilenodes 56-bits
Date
Msg-id CAEze2WihN2NCCFpUYtFRS0Bs5eV0KVYmbkxiH7oBdPJ32McPbA@mail.gmail.com
Whole thread Raw
In response to Re: problems with making relfilenodes 56-bits  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Wed, 12 Oct 2022 at 23:13, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-10-12 22:05:30 +0200, Matthias van de Meent wrote:
> > On Wed, 5 Oct 2022 at 01:50, Andres Freund <andres@anarazel.de> wrote:
> > > I light dusted off my old varint implementation from [1] and converted the
> > > RelFileLocator and BlockNumber from fixed width integers to varint ones. This
> > > isn't meant as a serious patch, but an experiment to see if this is a path
> > > worth pursuing.
> > >
> > > A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
> > > (for increased reproducability) shows a decent saving:
> > >
> > > master: 241106544 - 230 MB
> > > varint: 227858640 - 217 MB
> >
> > I think a signficant part of this improvement comes from the premise
> > of starting with a fresh database. tablespace OID will indeed most
> > likely be low, but database OID may very well be linearly distributed
> > if concurrent workloads in the cluster include updating (potentially
> > unlogged) TOASTed columns and the databases are not created in one
> > "big bang" but over the lifetime of the cluster. In that case DBOID
> > will consume 5B for a significant fraction of databases (anything with
> > OID >=2^28).
> >
> > My point being: I don't think that we should have different WAL
> > performance in databases which is dependent on which OID was assigned
> > to that database.
>
> To me this is raising the bar to an absurd level. Some minor space usage
> increase after oid wraparound and for very large block numbers isn't a huge
> issue - if you're in that situation you already have a huge amount of wal.

I didn't want to block all varlen encoding, I just want to make clear
that I don't think it's great for performance testing and consistency
across installations if WAL size (and thus part of your performance)
is dependent on which actual database/relation/tablespace combination
you're running your workload in.

With the 56-bit relfilenode, the size of a block reference would
realistically differ between 7 bytes and 23 bytes:

- tblspc=0=1B
  db=16386=3B
  rel=797=2B (787 = 4 * default # of data relations in a fresh DB in PG14, + 1)
  block=0=1B

vs

- tsp>=2^28 = 5B
  dat>=2^28 =5B
  rel>=2^49 =8B
  block>=2^28 =5B

That's a difference of 16 bytes, of which only the block number can
realistically be directly influenced by the user ("just don't have
relations larger than X blocks").

If applied to Dilip's pgbench transaction data, that would imply a
minimum per transaction wal usage of 509 bytes, and a maximum per
transaction wal usage of 609 bytes. That is nearly a 20% difference in
WAL size based only on the location of your data, and I'm just not
comfortable with that. Users have little or zero control over the
internal IDs we assign to these fields, while it would affect
performance fairly significantly.

(difference % between min/max wal size is unchanged (within 0.1%)
after accounting for record alignment)

> > 0002 - Rework XLogRecord
> > This makes many fields in the xlog header optional, reducing the size
> > of many xlog records by several bytes. This implements the design I
> > shared in my earlier message [1].
> >
> > 0003 - Rework XLogRecordBlockHeader.
> > This patch could be applied on current head, and saves some bytes in
> > per-block data. It potentially saves some bytes per registered
> > block/buffer in the WAL record (max 2 bytes for the first block, after
> > that up to 3). See the patch's commit message in the patch for
> > detailed information.
>
> The amount of complexity these two introduce seems quite substantial to
> me. Both from an maintenance and a runtime perspective. I think we'd be better
> off using building blocks like variable lengths encoded values than open
> coding it in many places.

I guess that's true for length fields, but I don't think dynamic
header field presence (the 0002 rewrite, and the omission of
data_length in 0003) is that bad. We already have dynamic data
inclusion through block ids 25x; I'm not sure why we couldn't do that
more compactly with bitfields as indicators instead (hence the dynamic
header size).

As for complexity, I think my current patchset is mostly complex due
to a lack of tooling. Note that decoding makes common use of
COPY_HEADER_FIELD, which we don't really have an equivalent for in
XLogRecordAssemble. I think the code for 0002 would improve
significantly in readability if such construct would be available.

To reduce complexity in 0003, I could drop the 'repeat id'
optimization, as that reduces the complexity significantly, at the
cost of not saving that 1 byte per registered block after the first.

Kind regards,

Matthias van de Meent



pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: Incorrect comment regarding command completion tags
Next
From: Michael Paquier
Date:
Subject: Re: archive modules