Re: problems with making relfilenodes 56-bits - Mailing list pgsql-hackers

From Robert Haas
Subject Re: problems with making relfilenodes 56-bits
Date
Msg-id CA+TgmoZQm0v0MmpQdKBWiKdEDctuQX=mXuYc9tSnPM4jKvj67Q@mail.gmail.com
Whole thread Raw
In response to Re: problems with making relfilenodes 56-bits  (Andres Freund <andres@anarazel.de>)
Responses Re: problems with making relfilenodes 56-bits  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Wed, Oct 12, 2022 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:
> > I think a signficant part of this improvement comes from the premise
> > of starting with a fresh database. tablespace OID will indeed most
> > likely be low, but database OID may very well be linearly distributed
> > if concurrent workloads in the cluster include updating (potentially
> > unlogged) TOASTed columns and the databases are not created in one
> > "big bang" but over the lifetime of the cluster. In that case DBOID
> > will consume 5B for a significant fraction of databases (anything with
> > OID >=2^28).
> >
> > My point being: I don't think that we should have different WAL
> > performance in databases which is dependent on which OID was assigned
> > to that database.
>
> To me this is raising the bar to an absurd level. Some minor space usage
> increase after oid wraparound and for very large block numbers isn't a huge
> issue - if you're in that situation you already have a huge amount of wal.

I have to admit that I worried about the same thing that Matthias
raises, more or less. But I don't know whether I'm right to be
worried. A variable-length representation of any kind is essentially a
gamble that values requiring fewer bytes will be more common than
values requiring more bytes, and by enough to justify the overhead
that the method has. And, you want it to be more common for each
individual user, not just overall. For example, more people are going
to have small relations than large ones, but nobody wants performance
to drop off a cliff when the relation passes a certain size threshold.
Now, it wouldn't drop off a cliff here, but what about someone with a
really big, append-only relation? Won't they just end up writing more
to WAL than with the present system?

Maybe not. They might still have some writes to relations other than
the very large, append-only relation, and then they could still win.
Also, if we assume that the overhead of the variable-length
representation is never more than 1 byte beyond what is needed to
represent the underlying quantity in the minimal number of bytes, they
are only going to lose if their relation is already more than half the
maximum theoretical size, and if that is the case, they are in danger
of hitting the size limit anyway. You can argue that there's still a
risk here, but it doesn't seem like that bad of a risk.

But the same thing is not so obvious for, let's say, database OIDs.
What if you just have one or a few databases, but due to the previous
history of the cluster, their OIDs just happen to be big? Then you're
just behind where you would have been without the patch. Granted, if
this happens to you, you will be in the minority, because most users
are likely to have small database OIDs, but the fact that other people
are writing less WAL on average isn't going to make you happy about
writing more WAL on average. And even for a user for which that
doesn't happen, it's not at all unlikely that the gains they see will
be less than what we see on a freshly-initdb'd database.

So I don't really know what the answer is here. I don't think this
technique sucks, but I don't think it's necessarily a categorical win
for every case, either. And it even seems hard to reason about which
cases are likely to be wins and which cases are likely to be losses.

> > 0002 - Rework XLogRecord
> > This makes many fields in the xlog header optional, reducing the size
> > of many xlog records by several bytes. This implements the design I
> > shared in my earlier message [1].
> >
> > 0003 - Rework XLogRecordBlockHeader.
> > This patch could be applied on current head, and saves some bytes in
> > per-block data. It potentially saves some bytes per registered
> > block/buffer in the WAL record (max 2 bytes for the first block, after
> > that up to 3). See the patch's commit message in the patch for
> > detailed information.
>
> The amount of complexity these two introduce seems quite substantial to
> me. Both from a maintenance and a runtime perspective. I think we'd be better
> off using building blocks like variable lengths encoded values than open
> coding it in many places.

I agree that this looks pretty ornate as written, but I think there
might be some good ideas in here, too. It is also easy to reason about
this kind of thing at least in terms of space consumption. It's a bit
harder to know how things will play out in terms of CPU cycles and
code complexity.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: pg_upgrade test failure
Next
From: Tom Lane
Date:
Subject: havingQual vs hasHavingQual buglets