Thread: Thoughts on nbtree with logical/varwidth table identifiers, v12on-disk representation

Andres has suggested that I work on teaching nbtree to accommodate
variable-width, logical table identifiers, such as those required for
indirect indexes, or clustered indexes, where secondary indexes must
use a logical primary key value instead of a heap TID. I'm not
currently committed to working on this as a project, but I definitely
don't want to make it any harder. This has caused me to think about
the problem as it relates to the new on-disk representation for v4
nbtree indexes in Postgres 12. I do have a minor misgiving about one
particular aspect of what I came up with: The precise details of how
we represent heap TID in pivot tuples seems like it might make things
harder than they need to be for a future logical/varwidth table
identifier project. This probably isn't worth doing anything about
now, but it seems worth discussing now, just in case.

The macro BTreeTupleGetHeapTID() can be used to get a pointer to an
ItemPointerData (an ItemPointer) for the heap TID column if any is
available, regardless of whether the tuple is a non-pivot tuple
(points to the heap) or a pivot tuple (belongs in internal/branch
pages, and points to a block in the index, but needs to store heap TID
as well). In the non-pivot case the ItemPointer points to the start of
the tuple (raw IndexTuple field), while in the pivot case it points to
itup + IndexTupleSize() - sizeof(ItemPointerData). This interface
seems like the right thing to me; it's simple, low-context, works just
as well with INCLUDE indexes, and makes it fast to determine if there
are any truncated suffix attributes. However, I don't like the way the
alignment padding works -- there is often "extra" padding *between*
the last untrucated suffix attribute and the heap TID.

It seems like any MAXALIGN() padding should all be at the end -- the
only padding between tuples should be based on the *general*
requirement for the underlying data types, regardless of whether or
not we're dealing with the special heap TID representation in pivot
tuples. We should eliminate what could be viewed as a special case.
This approach is probably going to be easier to generalize later.
There can be a design where the logical/varwidth attribute can be
accessed either by using the usual index_getattr() stuff, or using an
interface like BTreeTupleGetHeapTID() to get to it quickly. We'd have
to store an offset to the final/identifier attribute in the header to
make that work, because we couldn't simply assume a particular width
(like 6 bytes), but that seems straightforward. (I imagine that
there'd be less difference between pivot and non-pivot tuples with
varwidth identifiers than there are currently with heap TID, since we
won't have to worry about pg_upgrade.)

nbtinsert.c is very MAXALIGN()-heavy, and currently always represents
that index tuples have a MAXALIGN()'d size, but that doesn't seem
necessary or desirable to me. After all, we don't do that within
heapam -- we can just rely on the bufpage.c routines to allocate a
MAXALIGN()'d space for the whole tuple, while still making the lp_len
field in the line pointer use the original size (i.e. size with
un-MAXALIGN()'ed tuple data area). I've found that it's quite possible
to get the nbtree code to store the tuple size (lp_len and redundant
IndexTupleSize() representation) this way, just like heapam always
has. This has some useful consequences: BTreeTupleGetHeapTID()
continues to work with the special pivot tuple representation, while
_bt_truncate() never "puts padding in the wrong place" when it must
add a heap TID due to there being many duplicates, and split point
that avoids doing that (that "truncates the heap TID attribute"). I
could make this work without breaking the regression tests in about 10
minutes, which is at least encouraging (it was a bit tricky, though).

This also results in an immediate though small benefit for v4 nbtree
indexes: _bt_truncate() produces smaller pivot tuples in a few cases.
For example, indexes with one or two boolean fields will have pivot
tuples that are 15 bytes and 16 bytes in length respectively,
occupying 16 bytes of tuple space on internal pages. The saving comes
because we can use the alignment padding hole, that was empty in the
original non-pivot index tuple that the new pivot tuple is to be
formed from. Currently, the size of these pivot tuples would be 24
bytes, so we're occasionally saving a MAXALIGN() quantum in space this
way. It is unlikely that anyone would actually care very much about
these kinds of space savings, but at the same time it feels more
elegant to me. The heap TID may not have a pg_attribute entry, but
ISTM that the on-disk representation should not have padding "in the
wrong place", on general principle.

Thoughts?
--
Peter Geoghegan



Greetings,

* Peter Geoghegan (pg@bowt.ie) wrote:
> Andres has suggested that I work on teaching nbtree to accommodate
> variable-width, logical table identifiers, such as those required for
> indirect indexes, or clustered indexes, where secondary indexes must
> use a logical primary key value instead of a heap TID. I'm not
> currently committed to working on this as a project, but I definitely
> don't want to make it any harder. This has caused me to think about
> the problem as it relates to the new on-disk representation for v4
> nbtree indexes in Postgres 12. I do have a minor misgiving about one
> particular aspect of what I came up with: The precise details of how
> we represent heap TID in pivot tuples seems like it might make things
> harder than they need to be for a future logical/varwidth table
> identifier project. This probably isn't worth doing anything about
> now, but it seems worth discussing now, just in case.

This seems like it would be helpful for global indexes as well, wouldn't
it?

> This also results in an immediate though small benefit for v4 nbtree
> indexes: _bt_truncate() produces smaller pivot tuples in a few cases.
> For example, indexes with one or two boolean fields will have pivot
> tuples that are 15 bytes and 16 bytes in length respectively,
> occupying 16 bytes of tuple space on internal pages. The saving comes
> because we can use the alignment padding hole, that was empty in the
> original non-pivot index tuple that the new pivot tuple is to be
> formed from. Currently, the size of these pivot tuples would be 24
> bytes, so we're occasionally saving a MAXALIGN() quantum in space this
> way. It is unlikely that anyone would actually care very much about
> these kinds of space savings, but at the same time it feels more
> elegant to me. The heap TID may not have a pg_attribute entry, but
> ISTM that the on-disk representation should not have padding "in the
> wrong place", on general principle.
>
> Thoughts?

I agree with trying to avoid having padding 'in the wrong place' and if
it makes some indexes smaller, great, even if they're unlikely to be
interesting in the vast majority of cases, they may still exist out
there.  Of course, this is provided that it doesn't overly complicate
the code, but it sounds like it wouldn't be too bad in this case.

Thanks!

Stephen

Attachment
Hi,

On 2019-04-21 17:46:09 -0700, Peter Geoghegan wrote:
> Andres has suggested that I work on teaching nbtree to accommodate
> variable-width, logical table identifiers, such as those required for
> indirect indexes, or clustered indexes, where secondary indexes must
> use a logical primary key value instead of a heap TID.

I think it's two more cases:

- table AMs that want to support tables that are bigger than 32TB. That
  used to be unrealistic, but it's not anymore. Especially when the need
  to VACUUM etc is largely removed / reduced.
- global indexes (for cross-partition unique constraints and such),
  which need a partition identifier as part of the tid (or as part of
  the index key, but I think that actually makes interaction with
  indexam from other layers more complicated - the inside of the index
  maybe may want to represent it as a column, but to the outside that
  ought not to be visible)



> Thoughts?

Seems reasonable to me.

I, more generally, wonder if there's not a case to squeeze out more
padding than "just" what you describe (since we IIRC don't frequently
keep pointers into such tuples anyway, and definitely don't for byval
attrs). But that's very likely better done separately.

Greetings,

Andres Freund



On Mon, Apr 22, 2019 at 8:36 AM Stephen Frost <sfrost@snowman.net> wrote:
> This seems like it would be helpful for global indexes as well, wouldn't
> it?

Yes, though that should probably work by reusing what we already do
with heap TID (use standard IndexTuple fields on the leaf level for
heap TID), plus an additional identifier for the partition number that
is located at the physical end of the tuple. IOW, I think that this
might benefit from a design that is half way between what we already
do with heap TIDs and what we would be required to do to make varwidth
logical row identifiers in tables work -- the partition number is
varwidth, though often only a single byte.

> I agree with trying to avoid having padding 'in the wrong place' and if
> it makes some indexes smaller, great, even if they're unlikely to be
> interesting in the vast majority of cases, they may still exist out
> there.  Of course, this is provided that it doesn't overly complicate
> the code, but it sounds like it wouldn't be too bad in this case.

Here is what it took:

* Removed the "conservative" MAXALIGN() within index_form_tuple(),
bringing it in line with heap_form_tuple(), which only MAXALIGN()s so
that the first attribute in tuple's data area can safely be accessed
on alignment-picky platforms, but doesn't do the same with data_len.

* Removed most of the MAXALIGN()s from nbtinsert.c, except one that
considers if a page split is required.

* Didn't change the nbtsplitloc.c code, because we need to assume
MAXALIGN()'d space quantities there. We continue to not trust the
reported tuple length to be MAXALIGN()'d, which is now essentially
rather than just defensive.

* Removed MAXALIGN()s within _bt_truncate(), and SHORTALIGN()'d the
whole tuple size in the case where new pivot tuple requires a heap TID
representation. We access TIDs as 3 2 byte integers, so this is
necessary for alignment-picky platforms.

I will pursue this as a project for PostgreSQL 13. It doesn't affect
on-disk compatibility, because BTreeTupleGetHeapTID() works just as
well with either the existing scheme, or this new one. Having the
"real" tuple length available will make it easier to implement "true"
suffix truncation, where we truncate *within* a text attribute (i.e.
generate a new, shorter value using new opclass infrastructure).

-- 
Peter Geoghegan



On Mon, Apr 22, 2019 at 9:35 AM Andres Freund <andres@anarazel.de> wrote:
> I, more generally, wonder if there's not a case to squeeze out more
> padding than "just" what you describe (since we IIRC don't frequently
> keep pointers into such tuples anyway, and definitely don't for byval
> attrs). But that's very likely better done separately.

There is one way that that is kind of relevant here. The current
requirement seems to be that *any* sort of tuple header be
MAXALIGN()'d, because in the worst case the first attribute needs to
be accessed at a MAXALIGN()'d boundary on an alignment-picky platform.
That isn't so bad today, since we usually find a reasonably good way
to put those 8 bytes (or 23/24 bytes in the case of heap tuples) to
use. However, with varwidth table identifiers, the only use of those 8
bytes that I can think of is the offset to the identifier (or perhaps
its length), plus the usual t_info stuff. We'd almost invariably waste
4 or 5 bytes, which seems like a problem to me.

-- 
Peter Geoghegan



Greetings,

* Peter Geoghegan (pg@bowt.ie) wrote:
> On Mon, Apr 22, 2019 at 8:36 AM Stephen Frost <sfrost@snowman.net> wrote:
> > This seems like it would be helpful for global indexes as well, wouldn't
> > it?
>
> Yes, though that should probably work by reusing what we already do
> with heap TID (use standard IndexTuple fields on the leaf level for
> heap TID), plus an additional identifier for the partition number that
> is located at the physical end of the tuple. IOW, I think that this
> might benefit from a design that is half way between what we already
> do with heap TIDs and what we would be required to do to make varwidth
> logical row identifiers in tables work -- the partition number is
> varwidth, though often only a single byte.

Yes, global indexes for partitioned tables could potentially be simpler
than the logical row identifiers, but maybe it'd be useful to just have
one implementation based around logical row identifiers which ends up
working for global indexes as well as the other types of indexes and
table access methods.

If we thought that the 'single-byte' partition number covered enough
use-cases then we could possibly consider supporting them for partitions
by just 'stealing' a byte from BlockIdData and having the per-partition
size be limited to 4TB when a global index exists on the partitioned
table.  That's certainly not an ideal limitation but it might be
appealing to some users who really would like global indexes and could
possibly require less to implement, though there's a lot of other things
that would have to be done to have global indexes.  Anyhow, just some
random thoughts that I figured I'd share in case there might be
something there worth thinking about.

> > I agree with trying to avoid having padding 'in the wrong place' and if
> > it makes some indexes smaller, great, even if they're unlikely to be
> > interesting in the vast majority of cases, they may still exist out
> > there.  Of course, this is provided that it doesn't overly complicate
> > the code, but it sounds like it wouldn't be too bad in this case.
>
> Here is what it took:
>
> * Removed the "conservative" MAXALIGN() within index_form_tuple(),
> bringing it in line with heap_form_tuple(), which only MAXALIGN()s so
> that the first attribute in tuple's data area can safely be accessed
> on alignment-picky platforms, but doesn't do the same with data_len.
>
> * Removed most of the MAXALIGN()s from nbtinsert.c, except one that
> considers if a page split is required.
>
> * Didn't change the nbtsplitloc.c code, because we need to assume
> MAXALIGN()'d space quantities there. We continue to not trust the
> reported tuple length to be MAXALIGN()'d, which is now essentially
> rather than just defensive.
>
> * Removed MAXALIGN()s within _bt_truncate(), and SHORTALIGN()'d the
> whole tuple size in the case where new pivot tuple requires a heap TID
> representation. We access TIDs as 3 2 byte integers, so this is
> necessary for alignment-picky platforms.
>
> I will pursue this as a project for PostgreSQL 13. It doesn't affect
> on-disk compatibility, because BTreeTupleGetHeapTID() works just as
> well with either the existing scheme, or this new one. Having the
> "real" tuple length available will make it easier to implement "true"
> suffix truncation, where we truncate *within* a text attribute (i.e.
> generate a new, shorter value using new opclass infrastructure).

This sounds pretty good to me, though I'm not nearly as close to the
code there as you are.

Thanks!

Stephen

Attachment
On Mon, Apr 22, 2019 at 10:32 AM Stephen Frost <sfrost@snowman.net> wrote:
> Yes, global indexes for partitioned tables could potentially be simpler
> than the logical row identifiers, but maybe it'd be useful to just have
> one implementation based around logical row identifiers which ends up
> working for global indexes as well as the other types of indexes and
> table access methods.

Maybe so. I think that we'd have to actually try it out to know for sure.

> If we thought that the 'single-byte' partition number covered enough
> use-cases then we could possibly consider supporting them for partitions
> by just 'stealing' a byte from BlockIdData and having the per-partition
> size be limited to 4TB when a global index exists on the partitioned
> table.

I don't think that that would make it any easier to implement.

> This sounds pretty good to me, though I'm not nearly as close to the
> code there as you are.

I'm slightly concerned that I may have broken an index_form_tuple()
caller from some other access method, but they all seem to not trust
index_form_tuple() to have MAXALIGN()'d on their behalf, just like
nbtree (nbtree won't when I'm done, though, because it will actively
try to preserve the "real" tuple size). It's convenient to me that no
caller seems to rely on the index_form_tuple() MAXALIGN() that I want
to remove.

-- 
Peter Geoghegan



On Mon, Apr 22, 2019 at 1:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Yes, though that should probably work by reusing what we already do
> with heap TID (use standard IndexTuple fields on the leaf level for
> heap TID), plus an additional identifier for the partition number that
> is located at the physical end of the tuple. IOW, I think that this
> might benefit from a design that is half way between what we already
> do with heap TIDs and what we would be required to do to make varwidth
> logical row identifiers in tables work -- the partition number is
> varwidth, though often only a single byte.

I think we're likely to have a problem with global indexes + DETACH
PARTITION that is similar to the problem we now have with DROP COLUMN.

If you drop or detach a partition, you can either (a) perform, as part
of that operation, a scan of every global index to remove all
references to the former partition, or (b) tell each global indexes
that all references to that partition number ought to be regarded as
dead index tuples.  (b) makes detaching partitions faster and (a)
seems hard to make rollback-safe, so I'm guessing we'll end up with
(b).

But that means that if someone repeatedly attaches and detaches
partitions, the partition numbers could get quite big.  And even
without that somebody could have a lot of partitions.  So while I do
not disagree that the partition number could be variable-width and
sometimes only 1 payload byte, I think we had better make sure to
design the system in such a way that it scales to at least 4 payload
bytes, because I have no faith that anything less will be sufficient
for our demanding user base.

We don't want people to be able to exhaust the supply of partition
numbers the way they can exhaust the supply of attribute numbers by
adding and dropping columns repeatedly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, Apr 24, 2019 at 5:22 AM Robert Haas <robertmhaas@gmail.com> wrote:
> If you drop or detach a partition, you can either (a) perform, as part
> of that operation, a scan of every global index to remove all
> references to the former partition, or (b) tell each global indexes
> that all references to that partition number ought to be regarded as
> dead index tuples.  (b) makes detaching partitions faster and (a)
> seems hard to make rollback-safe, so I'm guessing we'll end up with
> (b).

I agree that (b) is the way to go.

> We don't want people to be able to exhaust the supply of partition
> numbers the way they can exhaust the supply of attribute numbers by
> adding and dropping columns repeatedly.

I agree that a partition numbering system needs to be able to
accommodate arbitrarily-many partitions over time. It wouldn't have
occurred to me to do it any other way. It is far far easier to make
this work than it would be to retrofit varwidth attribute numbers. We
won't have to worry about the HeapTupleHeaderGetNatts()
representation. At the same time, nothing stops us from representing
partition numbers in a simpler though less space efficient way in
system catalogs.

The main point of having global indexes is to be able to push down the
partition number and use it during index scans. We can store the
partition number at the end of the tuple on leaf pages, so that it's
easily accessible (important for VACUUM), while continuing to use the
IndexTuple fields for heap TID. On internal pages, the IndexTuple
fields must be used for the downlink (block number of child), so both
partition number and heap TID have to go towards the end of the tuples
(this happens just with heap TID on Postgres 12). Of course, suffix
truncation will manage to consistently get rid of both in most cases,
especially when the global index is a unique index.

The hard part is how to do varwidth encoding for space-efficient
partition numbers while continuing to use IndexTuple fields for heap
TID on the leaf level, *and* also having a
BTreeTupleGetHeapTID()-style macro to get partition number without
walking the entire index tuple. I suppose you could make the byte at
the end of the tuple indicate that there are in fact 31 bits total
when its high bit is set -- otherwise it's a 7 bit integer. Something
like that may be the way to go. The alignment rules seem to make it
worthwhile to keep the heap TID in the tuple header; it seems
inherently necessary to have a MAXALIGN()'d tuple header, so finding a
way to consistently put the first MAXALIGN() quantum to good use seems
wise.

-- 
Peter Geoghegan



On Wed, Apr 24, 2019 at 10:43 AM Peter Geoghegan <pg@bowt.ie> wrote:
> The hard part is how to do varwidth encoding for space-efficient
> partition numbers while continuing to use IndexTuple fields for heap
> TID on the leaf level, *and* also having a
> BTreeTupleGetHeapTID()-style macro to get partition number without
> walking the entire index tuple. I suppose you could make the byte at
> the end of the tuple indicate that there are in fact 31 bits total
> when its high bit is set -- otherwise it's a 7 bit integer. Something
> like that may be the way to go. The alignment rules seem to make it
> worthwhile to keep the heap TID in the tuple header; it seems
> inherently necessary to have a MAXALIGN()'d tuple header, so finding a
> way to consistently put the first MAXALIGN() quantum to good use seems
> wise.

It's even harder than that if you want to make it possible to walk the
tuple from either direction, which also seems useful. You want to be
able to jump straight to the end of the tuple to get the partition
number, while at the same time being able to access it in the usual
way, as if it was just another attribute.

Ugh, this is a mess. It would be so much easier if we had a tuple
representation that stored attribute offsets right in the header.

--
Peter Geoghegan



On Mon, Apr 22, 2019 at 9:35 AM Andres Freund <andres@anarazel.de> wrote:
> On 2019-04-21 17:46:09 -0700, Peter Geoghegan wrote:
> > Andres has suggested that I work on teaching nbtree to accommodate
> > variable-width, logical table identifiers, such as those required for
> > indirect indexes, or clustered indexes, where secondary indexes must
> > use a logical primary key value instead of a heap TID.

I'm revisiting this thread now because it may have relevance to the
nbtree deduplication patch. If nothing else, the patch further commits
us to the current heap TID format by making assumptions about the
width of posting lists with 6 byte TIDs. Though I suppose a posting
list almost has to have fixed width TIDs to perform acceptably.
Database systems with a varwidth TID format probably don't support
anything like posting lists.

> I think it's two more cases:
>
> - table AMs that want to support tables that are bigger than 32TB. That
>   used to be unrealistic, but it's not anymore. Especially when the need
>   to VACUUM etc is largely removed / reduced.

Can we steal some bits that are currently used for offset number
instead? 16 bits is far more than we ever need to use for heap offset
numbers in practice. (I wonder if this would also have benefits for
the representation of in-memory bitmaps?)

> - global indexes (for cross-partition unique constraints and such),
>   which need a partition identifier as part of the tid (or as part of
>   the index key, but I think that actually makes interaction with
>   indexam from other layers more complicated - the inside of the index
>   maybe may want to represent it as a column, but to the outside that
>   ought not to be visible)

Can we just use an implementation level attribute for this? Would it
be so bad if we weren't able to jump straight to the partition number
without walking through the tuple when the tuple has varwidth
attributes? (If that isn't acceptable, then we can probably make it
work for global indexes without having to generalize everything.)

In general, Postgres heap TIDs are not stable identifiers of rows
(they only stably identify HOT chains). This is not the case in other
database systems, which may go to great trouble to make it possible to
assume that TIDs are stable over time (Andy Pavlo says that our TIDs
are physical while Oracle's are logical -- I don't like that
terminology, but know what he means). Generalizing the nbtree AM to be
able to work with an arbitrary type of table row identifier that isn't
at all like a TID raises tricky definitional questions. It would have
to work in a way that made the new variety of table row identifier
stable, which is a significant new requirement (and one that zheap is
clearly not interested in).

If we're willing to support something like a clustered index in
nbtree, I wonder what we need to do to make things like numeric
display scale still work. For bonus points, describe how unique
checking works with a secondary unique index.

I am not suggesting that these issues are totally insurmountable. What
I am saying is this: If we already had "stable logical" TIDs instead
of "mostly physical TIDs", then generalizing nbtree index tuples to
store arbitrary table row identifiers would more or less be all about
the data structure managed by nbtree. But that isn't the case, and
that strongly discourages me from working on this -- we shouldn't talk
about the problem as if it is mostly just a matter of settling of the
best index tuple format. Frankly I am not very enthusiastic about
working on a project that has unclear scope and unclear benefits for
users.

-- 
Peter Geoghegan



Hi,

On 2019-10-30 11:33:21 -0700, Peter Geoghegan wrote:
> On Mon, Apr 22, 2019 at 9:35 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2019-04-21 17:46:09 -0700, Peter Geoghegan wrote:
> > > Andres has suggested that I work on teaching nbtree to accommodate
> > > variable-width, logical table identifiers, such as those required for
> > > indirect indexes, or clustered indexes, where secondary indexes must
> > > use a logical primary key value instead of a heap TID.
> 
> I'm revisiting this thread now because it may have relevance to the
> nbtree deduplication patch. If nothing else, the patch further commits
> us to the current heap TID format by making assumptions about the
> width of posting lists with 6 byte TIDs.

I'd much rather not entrench this further, even leaving global indexes
aside. The 4 byte block number is a significant limitation for heap
tables too, and we should lift that at some point not too far away.
Then there's also other AMs that could really use a wider tid space.


> Though I suppose a posting list almost has to have fixed width TIDs to
> perform acceptably.

Hm. It's not clear to me why that is?


> > I think it's two more cases:
> >
> > - table AMs that want to support tables that are bigger than 32TB. That
> >   used to be unrealistic, but it's not anymore. Especially when the need
> >   to VACUUM etc is largely removed / reduced.
> 
> Can we steal some bits that are currently used for offset number
> instead? 16 bits is far more than we ever need to use for heap offset
> numbers in practice.

I think that's a terrible idea. For one, some AMs will have significant
higher limits, especially taking compression and larger block sizes into
account. Also not all AMs need identifiers tied so closely to a disk
position, e.g. zedstore does not.  We shouldn't hack evermore
information into the offset, given that background.


> (I wonder if this would also have benefits for the representation of
> in-memory bitmaps?)

Hm. Not sure how?


> > - global indexes (for cross-partition unique constraints and such),
> >   which need a partition identifier as part of the tid (or as part of
> >   the index key, but I think that actually makes interaction with
> >   indexam from other layers more complicated - the inside of the index
> >   maybe may want to represent it as a column, but to the outside that
> >   ought not to be visible)
> 
> Can we just use an implementation level attribute for this? Would it
> be so bad if we weren't able to jump straight to the partition number
> without walking through the tuple when the tuple has varwidth
> attributes? (If that isn't acceptable, then we can probably make it
> work for global indexes without having to generalize everything.)

Having to walk through the index tuple might be acceptable - in all
likelihood we'll have to do so anyway.  It does however not *really*
resolve the issue that we still need to pass something tid back from the
indexam, so we can fetch the associated tuple from the heap, or add the
tid to a bitmap. But that could be done separately from the index
internal data structures.


> Generalizing the nbtree AM to be able to work with an arbitrary type
> of table row identifier that isn't at all like a TID raises tricky
> definitional questions. It would have to work in a way that made the
> new variety of table row identifier stable, which is a significant new
> requirement (and one that zheap is clearly not interested in).

Hm. I don't see why a different types of TID would imply them being
stable?


> I am not suggesting that these issues are totally insurmountable. What
> I am saying is this: If we already had "stable logical" TIDs instead
> of "mostly physical TIDs", then generalizing nbtree index tuples to
> store arbitrary table row identifiers would more or less be all about
> the data structure managed by nbtree. But that isn't the case, and
> that strongly discourages me from working on this -- we shouldn't talk
> about the problem as if it is mostly just a matter of settling of the
> best index tuple format.



> Frankly I am not very enthusiastic about working on a project that has
> unclear scope and unclear benefits for users.

Why would properly supporting AMs like zedstore, global indexes,
"indirect" indexes etc benefit users?

Greetings,

Andres Freund



On Wed, Oct 30, 2019 at 12:03 PM Andres Freund <andres@anarazel.de> wrote:
> I'd much rather not entrench this further, even leaving global indexes
> aside. The 4 byte block number is a significant limitation for heap
> tables too, and we should lift that at some point not too far away.
> Then there's also other AMs that could really use a wider tid space.

I agree that that limitation is a problem that should be fixed before
too long. But the solution probably shouldn't be a radical departure
from what we have today. The vast majority of tables are not affected
by the TID space limitation. Those tables that are can tolerate
supporting fixed width "long" TIDs (perhaps 8 bytes long?) that are
used for the higher portion of the heap TID space alone.

The idea here is that TID is varwidth, but actually uses the existing
heap TID format most of the time. For larger tables it uses a wider
fixed width struct that largely works the same as the old 6 byte
struct.

> > Though I suppose a posting list almost has to have fixed width TIDs to
> > perform acceptably.
>
> Hm. It's not clear to me why that is?

Random access matters for things like determining the correct offset
to split a posting list at. This is needed in the event of an
overlapping insertion of a new duplicate tuple whose heap TID falls
within the range of the posting list. Also, I want to be able to scan
posting lists backwards for backwards scans. In general, fixed-width
TIDs make the page space accounting fairly simple, which matters a lot
in nbtree.

I can support varwidth TIDs in the future pretty well if the TID
doesn't have to be *arbitrarily* wide. Individual posting lists can
themselves either use 6 byte or 8 byte TIDs, preserving the ability to
access a posting list entry at random using simple pointer arithmetic.
This makes converting over index AMs a lot less painful -- it'll be
pretty easy to avoid mixing together the 6 byte and 8 byte structs.

> > Can we steal some bits that are currently used for offset number
> > instead? 16 bits is far more than we ever need to use for heap offset
> > numbers in practice.
>
> I think that's a terrible idea. For one, some AMs will have significant
> higher limits, especially taking compression and larger block sizes into
> account. Also not all AMs need identifiers tied so closely to a disk
> position, e.g. zedstore does not.  We shouldn't hack evermore
> information into the offset, given that background.

Fair enough, but somebody needs to cut some scope here.

> Having to walk through the index tuple might be acceptable - in all
> likelihood we'll have to do so anyway.  It does however not *really*
> resolve the issue that we still need to pass something tid back from the
> indexam, so we can fetch the associated tuple from the heap, or add the
> tid to a bitmap. But that could be done separately from the index
> internal data structures.

I agree.

> > Generalizing the nbtree AM to be able to work with an arbitrary type
> > of table row identifier that isn't at all like a TID raises tricky
> > definitional questions.

> Hm. I don't see why a different types of TID would imply them being
> stable?

It is unclear what it means. I would like to see a sketch of a design
for varwidth TIDs that balances everybody's concerns. I don't think
"indirect" indexes are a realistic goal for Postgres. VACUUM is just
too messy there (as is any other garbage collection mechanism).
Zedstore and Zheap don't change this.

> > Frankly I am not very enthusiastic about working on a project that has
> > unclear scope and unclear benefits for users.
>
> Why would properly supporting AMs like zedstore, global indexes,
> "indirect" indexes etc benefit users?

Global indexes seem doable.

I don't see how "indirect" indexes can ever work in Postgres. I don't
know exactly what zedstore needs here, but maybe it can work well with
a less ambitious design for varwidth TIDs along the lines I've
sketched.

-- 
Peter Geoghegan



Hi,

On 2019-10-30 12:37:50 -0700, Peter Geoghegan wrote:
> On Wed, Oct 30, 2019 at 12:03 PM Andres Freund <andres@anarazel.de> wrote:
> > I'd much rather not entrench this further, even leaving global indexes
> > aside. The 4 byte block number is a significant limitation for heap
> > tables too, and we should lift that at some point not too far away.
> > Then there's also other AMs that could really use a wider tid space.
> 
> I agree that that limitation is a problem that should be fixed before
> too long. But the solution probably shouldn't be a radical departure
> from what we have today. The vast majority of tables are not affected
> by the TID space limitation. Those tables that are can tolerate
> supporting fixed width "long" TIDs (perhaps 8 bytes long?) that are
> used for the higher portion of the heap TID space alone.
> 
> The idea here is that TID is varwidth, but actually uses the existing
> heap TID format most of the time. For larger tables it uses a wider
> fixed width struct that largely works the same as the old 6 byte
> struct.

I assume you mean that the index would dynamically recognize when it
needs the wider tids ("for the higher portion")? If so, yea, that makes
sense to me. Would that need to encode the 6/8byteness of a tid on a
per-element basis? Or are you thinking of being able to upconvert in a
general way?


> > > Though I suppose a posting list almost has to have fixed width TIDs to
> > > perform acceptably.
> >
> > Hm. It's not clear to me why that is?
> 
> Random access matters for things like determining the correct offset
> to split a posting list at. This is needed in the event of an
> overlapping insertion of a new duplicate tuple whose heap TID falls
> within the range of the posting list. Also, I want to be able to scan
> posting lists backwards for backwards scans. In general, fixed-width
> TIDs make the page space accounting fairly simple, which matters a lot
> in nbtree.

If we had variable width tids varying by more than 2 bytes, it might be
reasonable for cases like this to store all tids padded to the width of
the widest tid. I think that'd still be pretty OK, because you'd only
pay the price if you actually have long tids, and only on pages where
such tids are referenced. Obviously that means that such a posting list
could grow more than by the size of the inserted tid upon insertion, but
that might still be OK?  That'd require storing the width of the posting
list elements somewhere, unfortunately - not sure if that's a problem?

ISTM if we had varint style tids, we'd probably still save space on
average for today's heap that way. How important is it for you to be
able to split out the "block offset" and "page offset" bits?

I'm somewhat inclined to think that tids should just be a varint (or
maybe two, if we want to make it slightly simpler to keep compat to how
they look today), and that the AM internally makes sense of that.


> I can support varwidth TIDs in the future pretty well if the TID
> doesn't have to be *arbitrarily* wide.

I think it'd be perfectly reasonable to have a relatively low upper
limit for tid width. Not 8 bytes, but also not 200 bytes.


> Individual posting lists can themselves either use 6 byte or 8 byte
> TIDs, preserving the ability to access a posting list entry at random
> using simple pointer arithmetic.  This makes converting over index AMs
> a lot less painful -- it'll be pretty easy to avoid mixing together
> the 6 byte and 8 byte structs.

With varint style tids as I suggested, that ought to be fairly simple?


> I don't think "indirect" indexes are a realistic goal for
> Postgres. VACUUM is just too messy there (as is any other garbage
> collection mechanism).  Zedstore and Zheap don't change this.

Hm, I don't think there's a fundamental problem, but let's leave that
aside, there's enough other reasons to improve this.

Greetings,

Andres Freund



On Wed, Oct 30, 2019 at 1:25 PM Andres Freund <andres@anarazel.de> wrote:
> I assume you mean that the index would dynamically recognize when it
> needs the wider tids ("for the higher portion")? If so, yea, that makes
> sense to me. Would that need to encode the 6/8byteness of a tid on a
> per-element basis? Or are you thinking of being able to upconvert in a
> general way?

I think that index access methods should continue to control the
layout of TIDs/item pointers, while mostly just treating them as
fixed-width integers. Maybe you just have 6 and 8 byte structs, or
maybe you also have a 16 byte struct, but it isn't really variable
length to the index AM (it's more like a union). It becomes the index
AM's responsibility to remember which TID width applies at a per tuple
level (or per posting list level, etc). In general, index AMs have to
"supply their own status bits", which should be fine (the alternative
is that they refuse to support anything other than the traditional 6
byte TID format).

Table AMs don't get to supply their own operator class for sorting
these integers -- they had better be happy with the TID sort order
that is baked in (ascending int order) in the context of index scans
that return duplicates, and things like that. There is probably
generic infrastructure for up-converting, too, but the index AM is
fundamentally in the driving seat with this design.

> If we had variable width tids varying by more than 2 bytes, it might be
> reasonable for cases like this to store all tids padded to the width of
> the widest tid. I think that'd still be pretty OK, because you'd only
> pay the price if you actually have long tids, and only on pages where
> such tids are referenced. Obviously that means that such a posting list
> could grow more than by the size of the inserted tid upon insertion, but
> that might still be OK?  That'd require storing the width of the posting
> list elements somewhere, unfortunately - not sure if that's a problem?

My solution is to require the index AM to look after itself. The index
AM is responsible for not mixing them up. For nbtree with
deduplication, this means that having different width TIDs makes it
impossible to merge together posting lists/tuples. For GIN, this means
that we can expand the width of the TIDs in the posting list to match
the widest TID. We can then make it into a posting tree if necessary
-- GIN has the benefit of always being able to fall back on the option
of making a new posting tree (unlike nbtree with deduplication). GIN's
B-Tree is very primitive in some ways (no deletion of items in the
entry tree, no deletion of pages in the entry tree), which gives it
the ability to safely fall back on creating a new posting tree when it
runs out of space.

> ISTM if we had varint style tids, we'd probably still save space on
> average for today's heap that way. How important is it for you to be
> able to split out the "block offset" and "page offset" bits?

Pretty important. The nbtree deduplication patch is very compelling
because it almost offers the best of both worlds -- the concurrency
characteristics of today's nbtree, combined with very much improved
space efficiency. Keeping the space accounting as simple as possible
seems like a big part of why this is possible at all. There is only
one new type of WAL record required for deduplication, and they're
pretty small. (Existing WAL records are changed to support posting
list splits, but these are small, low-overhead changes.)

> I'm somewhat inclined to think that tids should just be a varint (or
> maybe two, if we want to make it slightly simpler to keep compat to how
> they look today), and that the AM internally makes sense of that.

I am opposed to adding anything that is truly varint. The index access
method ought to be able to have fine control over the layout, without
being burdened by an overly complicated TID representation.

> > I can support varwidth TIDs in the future pretty well if the TID
> > doesn't have to be *arbitrarily* wide.
>
> I think it'd be perfectly reasonable to have a relatively low upper
> limit for tid width. Not 8 bytes, but also not 200 bytes.

My point is that we should find a way to make TIDs continue to be an
array of fixed width integers in any given context. Lots of index
access method code can be ported in a relatively straightforward
fashion this way. This has some downsides, but I believe that they're
worth it.

> > Individual posting lists can themselves either use 6 byte or 8 byte
> > TIDs, preserving the ability to access a posting list entry at random
> > using simple pointer arithmetic.  This makes converting over index AMs
> > a lot less painful -- it'll be pretty easy to avoid mixing together
> > the 6 byte and 8 byte structs.
>
> With varint style tids as I suggested, that ought to be fairly simple?

nbtree probably won't be able to tolerate having to widen every TID in
the posting list all at once when new tuples are inserted that have
TIDs that are one byte wider, that go in the same posting list (as I
said, keeping the space accounting simple is particularly important
for nbtree). This even seems hard for GIN, which thinks of TIDs as an
array of fixed width ints in many contexts. Also, BRIN revmap pages
are also mostly just arrays of 6 byte item pointers, that rely on
simple pointer arithmetic to do random access.

-- 
Peter Geoghegan