Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas
Date
Msg-id CAH2-WzkqGMbc2bbm2zwoSpy2RpH0KSvhMcyD6qWewPUbBy8gdg@mail.gmail.com
Whole thread Raw
In response to Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas
List pgsql-hackers
On Thu, Apr 7, 2022 at 7:01 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Because there's no place to put them in the existing page format. We
> jammed checksums into the 2-byte field that had previously been set
> aside for the TLI, but that wasn't really an ideal solution because it
> meant we ended up with a checksum that is only 16 bits wide. However,
> the 2 bytes set aside for the TLI weren't really being used
> effectively anyway, so repurposing them was relatively easy, and a
> 16-bit checksum is better than nothing.

But if we were in a green-field situation we'd probably not want to
use up several bytes for a nonse anyway. You said so yourself.

> I do understand that there are significant challenges and performance
> concerns around having these kinds of initdb-controlled page layout
> changes, so the future of that patch is unclear.

Why does it need to be at initdb time?

Though I cannot prove it, I suspect that the original intent of the
special area was to support an additional (though typically small)
variable length array, that works a little like the current line
pointer array. This array would have to grow backwards (newer items
get appended at earlier physical offsets), unlike our line pointer
array (which gets appended to at the end, in the simple and obvious
way). Growing backwards like this happens with DB systems, that store
their line pointer array at the end of the page(the traditional
approach from the System R days, I believe).

Supporting a variable-length special area array like this would mean
that any time you add a new item to the variable-sized array in the
special area, the page's entire tuple space has to be memmove()'d
backwards by a couple of bytes to create the required space. And so
the relevant bufpage.c routine would have to adjust the whole line
pointer array such that each lp_off received a compensating
adjustment. The array might only be for some kind of page-level
transaction metadata, something like that -- shifting it around is
pretty expensive (reusing existing slots isn't too expensive, though).

Why can't it work like that? You don't really need to build the full
set of bufpage.c facilities (though it might not be a bad idea to
fully support these variable-length arrays, which seem like they might
come in handy). That seems perfectly compatible with what Matthias
wants to do, provided we're willing to deem the special area struct
(e.g. BTOpaque) as always coming "first" (which is essentially the
same as his current proposal anyway). You can even do the same thing
yourself for the nonse (use a fixed, known offset), with relatively
modest effort. You'd need to have AM-specific knowledge (it would
stack right on top of Matthias's technique), but that doesn't seem all
that hard. There are plenty of remaining status bits in BTOpaque, and
probably all other index AM special areas.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: logical decoding and replication of sequences
Next
From: Andres Freund
Date:
Subject: Re: test/isolation/expected/stats_1.out broken for me