Re: better page-level checksums - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: better page-level checksums
Date
Msg-id CAH2-Wznckc7EMLVZVR5eRWQVhP0VG-EGxG4UrBcPXAG17SuBeA@mail.gmail.com
Whole thread Raw
In response to Re: better page-level checksums  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: better page-level checksums
List pgsql-hackers
On Tue, Jun 14, 2022 at 8:48 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jun 13, 2022 at 6:26 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Anyway, I can see how it would be useful to be able to know the offset
> > of a nonce or of a hash digest on any given page, without access to a
> > running server. But why shouldn't that be possible with other designs,
> > including designs closer to what I've outlined?
>
> I don't know what you mean by this. As far as I'm aware, the only
> design you've outlined is one where the space wasn't at the same
> offset on every page.

I am skeptical of that particular aspect, yes. Though I would define
it the other way around (now the true special area struct isn't
necessarily at the same offset for a given AM, at least across data
directories).

My main concern is maintaining the ability to interpret much about the
contents of a page without context, and to not make it any harder to
grow the special area dynamically -- which is a broader concern.
Your patch isn't going to be the last one that wants to do something
with the special area. This needs to be carefully considered.

I see a huge amount of potential for adding new optimizations that use
subsidiary space on the page, presumably implemented via a special
area that can grow dynamically. For example, an ad-hoc compression
technique for heap pages that temporarily "absorbs" some extra
versions in the event of opportunistic pruning running and failing to
free enough space. Such a design would operate on similar principles
to deduplication in unique indexes, where the goal is to buy time
rather than buy space. When we fail to keep the contents of a heap
page together today, we often barely fail, so I expect something like
this to have an outsized impact on some workloads.

> In general, I was imagining that you'd need to look at the control
> file to understand how much space had been reserved per page in this
> particular cluster. I agree that's a bit awkward, especially for
> pg_filedump. However, pg_filedump and I think also some code internal
> to PostgreSQL try to figure out what kind of page we've got by looking
> at the *size* of the special space. It's only good luck that we
> haven't had a collision there yet, and continuing to rely on that
> seems like a dead end. Perhaps we should start including a per-AM
> magic number at the beginning of the special space.

It's true that that approach is just a hack -- we probably can do
better. I don't think that it's okay to break it, though. At least not
without providing a comparable alternative, that doesn't rely on
context from the control file.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Small TAP improvements
Next
From: Tom Lane
Date:
Subject: Re: Small TAP improvements