Thread: MaxOffsetNumber for Table AMs

MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
The notion of TID is based on pages and line pointers, which makes
sense for heapam, but that's not likely to make sense for a custom
table AM.

The obvious answer is to make a simple mapping between a TID and
whatever makes sense to the AM (for the sake of discussion, let's say a
plain row number).

The most natural thing would be to say that we have 48 bits, so it can
just be a 48-bit number. Of course, there are some restrictions on
valid values that complicate this:

  * InvalidBlockNumber of 0xFFFFFFFF. Not a problem.
  * InvalidOffsetNumber of 0. Not a problem.
  * MaxOffsetNumber of 2048. Does this limit really apply to table AMs?
It just seems like it's used when scanning heap or index pages for
stack-allocated arrays. For a table AM it appears to waste 5 bits.
  * ginpostinglist.c:itemptr_to_uint64() seems to think 2047 is the max
offset number. Is this a bug?

As a table AM author, I'd like to know what the real limits are so that
I can use whatever bits are available to map from TID to row number and
back, without worrying that something will break in the future. A few
possibilities:

  1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64().
  2. Change MaxOffsetNumber to 2047. This seems likely to break
extensions that rely on it.
  3. Define MaxOffsetNumber as 65536 and introduce a new
MaxItemsPerPage as 2048 for the stack-allocated arrays. We'd still need
to fix itemptr_to_uint64().

Thoughts?

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Matthias van de Meent
Date:
On Fri, 30 Apr 2021, 09:46 Jeff Davis, <pgsql@j-davis.com> wrote:

The notion of TID is based on pages and line pointers, which makes
sense for heapam, but that's not likely to make sense for a custom
table AM.

The obvious answer is to make a simple mapping between a TID and
whatever makes sense to the AM (for the sake of discussion, let's say a
plain row number).

The most natural thing would be to say that we have 48 bits, so it can
just be a 48-bit number. Of course, there are some restrictions on
valid values that complicate this:

  * InvalidBlockNumber of 0xFFFFFFFF. Not a problem.
  * InvalidOffsetNumber of 0. Not a problem.
  * MaxOffsetNumber of 2048. Does this limit really apply to table AMs?

MaxOffsetNumber is not per se 2048. It is BLCKSZ / sizeof(ItemIdData), which is only 2048 for a 8kiB BLCKSZ. As we support BLCKSZ up to 32kiB, MaxOffsetNumber can be as large as 8196.
 
Other than that, I believe you've also missed the special offset numbers specified in itemptr.h (SpecTokenOffsetNumber and MovedPartitionsOffsetNumber). I am not well enough aware of the usage of these OffsetNumber values, but those might also be limiting the values any tableAM can use for their TIDs.

It just seems like it's used when scanning heap or index pages for
stack-allocated arrays. For a table AM it appears to waste 5 bits.

MaxOffsetNumber is used for postgres' Page layout, of which the MaxOffsetNumber is defined as how many item pointers could exist on a page, and AFAIK should be used for postgres' Page layout only. No thing can or should change that. If any code asserts limitations to the ip_posid of table tuples that could also not be tableam tuples, then I believe that is probably a mistake in postgres, and that should be fixed.

  * ginpostinglist.c:itemptr_to_uint64() seems to think 2047 is the max
offset number. Is this a bug? 

No. The documentation for that function explicitly mentions that these item pointers are optimized for storage when using the heap tableam, and that that code will be changed once there exist tableAMs with different TID / ip_posid constraints (see the comment on lines 32 and 33 of that file).

Note that the limiting number that itemptr_to_uint64 should mention for bit calculations is actually MaxHeaptuplesPerPage, which is about one seventh of MaxOffsetNumber. The resulting number of bits reserved is not a miscalculation though, because MaxHeaptuplesPerPage (for 32kiB BLCKSZ) requires the mentioned 11 bits, and adapting bit swizzling for multiple page sizes was apparently not considered worth the effort.

As a table AM author, I'd like to know what the real limits are so that
I can use whatever bits are available to map from TID to row number and
back, without worrying that something will break in the future. A few
possibilities:

  1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64().

I believe that this is the right way when there exist tableAMs that use those upper 5 bits.
 
  2. Change MaxOffsetNumber to 2047. This seems likely to break
extensions that rely on it.

If you're going to change MaxOffsetNumber, I believe that it's better to change it to ((BLCKSZ - sizeof(PageHeaderData)) / sizeof(ItemIdData)), as that is the maximum amount of ItemIds you could put on a Page that has no page opaque.

  3. Define MaxOffsetNumber as 65536 and introduce a new
MaxItemsPerPage as 2048 for the stack-allocated arrays. We'd still need
to fix itemptr_to_uint64().

I believe that that breaks more things than otherwise required. ip_posid is already limited to uint16, so I see no reason to add a constant that would assert that the value of any uint16 is less its max value plus one.


With regards,

Matthias van de Meent

Re: MaxOffsetNumber for Table AMs

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> The notion of TID is based on pages and line pointers, which makes
> sense for heapam, but that's not likely to make sense for a custom
> table AM.
> The obvious answer is to make a simple mapping between a TID and
> whatever makes sense to the AM (for the sake of discussion, let's say a
> plain row number).

I'm inclined to think that when we get around to doing something
about this, we need to make a much bigger change than just poking
at the margins of type tid.

My thought at the moment is that all APIs above the AM level ought
to be redefined to use uint64 for tuple identifiers.  heapam and
related index AMs could map block + offset into that in some
convenient way, and other AMs could do what they like.

Andres seems to feel that we should try to allow variable-width
tupleids, but I'm afraid that the cost/benefit ratio for that
would be pretty poor.

            regards, tom lane



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Fri, Apr 30, 2021 at 8:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> My thought at the moment is that all APIs above the AM level ought
> to be redefined to use uint64 for tuple identifiers.  heapam and
> related index AMs could map block + offset into that in some
> convenient way, and other AMs could do what they like.
>
> Andres seems to feel that we should try to allow variable-width
> tupleids, but I'm afraid that the cost/benefit ratio for that
> would be pretty poor.

I agree. It'll be easier for a new table AM to be developed with that
constraint than it will be to retrofit it to every index AM. It
probably wouldn't be that hard to make nbtree deduplication and GIN
posting list compression work with uint64 TIDs. But variable-width
TIDs are a very different matter.

Compatibility with index AMs is more than a matter of switching out
the tuple identifier -- if you invent something that has totally
different performance characteristics for index AMs, then it's likely
to break tacit assumptions about the cost of certain things. For
example, index tuple deletion probably relies on the fact that there
just isn't that many table blocks to visit (to get an XID for recovery
conflict purposes) in practice due to various locality-related
effects. Plus deduplication ameliorates problems with version churn in
indexes -- presumably the same problems will exist when any new table
AM is used, and so it'll be useful to address the same problems in the
same way.

I agree that it might well be useful to make TIDs fully logical (not
"physiological" -- physical across blocks, logical within blocks) for
some new table AM. Even then, it would still definitely be a good idea
to make these logical TID values correlate with the physical table
structure in some way. Plus TIDs should still be fixed size. If a new
table AM can't do it that way then that certainly needs to be
justified -- it's unreasonable to imagine that it simply isn't the
table AM's problem to solve.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Fri, 2021-04-30 at 11:06 -0400, Tom Lane wrote:
> My thought at the moment is that all APIs above the AM level ought
> to be redefined to use uint64 for tuple identifiers.

One challenge might be reliance on InvalidOffsetNumber as a special
value in a number of places (e.g. bitmap index scans). That doesn't
seem like a major problem though.

> heapam and
> related index AMs could map block + offset into that in some
> convenient way, and other AMs could do what they like.

Do you mean that indexes would be expected to hold a uint64, a 48-bit
int (that directly maps to a uint64), or still hold an ItemPointerData?

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Fri, 2021-04-30 at 11:06 -0400, Tom Lane wrote:
>> My thought at the moment is that all APIs above the AM level ought
>> to be redefined to use uint64 for tuple identifiers.

> Do you mean that indexes would be expected to hold a uint64, a 48-bit
> int (that directly maps to a uint64), or still hold an ItemPointerData?

ISTM that would be up to the index AM.  We'd need some interlocks on
which index AMs could be used with which table AMs in any case, I think.

It'd likely not be hard for existing index AMs to be repurposed to store
"any 48-bit TID", but extending them to full 64-bit TIDs may be
impractical.

I think the hard part may really be in places like tidbitmap.c, which
one would wish to be AM-independent, but right now it's quite specific
to heap-style TIDs.  Maybe we can think of a way to parameterize it.

            regards, tom lane



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Fri, 2021-04-30 at 12:04 +0200, Matthias van de Meent wrote:
> Other than that, I believe you've also missed the special offset
> numbers specified in itemptr.h (SpecTokenOffsetNumber and
> MovedPartitionsOffsetNumber). I am not well enough aware of the usage
> of these OffsetNumber values, but those might also be limiting the
> values any tableAM can use for their TIDs.

Yes, thank you. If it is treated specially in a heap tuple, it can't be
a regular TID.

> > It just seems like it's used when scanning heap or index pages for
> > stack-allocated arrays. For a table AM it appears to waste 5 bits.
> 
> MaxOffsetNumber is used for postgres' Page layout, of which the
> MaxOffsetNumber is defined as how many item pointers could exist on a
> page, and AFAIK should be used for postgres' Page layout only. No
> thing can or should change that. If any code asserts limitations to
> the ip_posid of table tuples that could also not be tableam tuples,
> then I believe that is probably a mistake in postgres, and that
> should be fixed.

A name that would better fit your definition would be something like
"MaxItemsPerPage".

The name "MaxOffsetNumber" implies that any number past that must be
either invalid or special. But it seems like you are saying that if I
use an offset number of 5000 in my table AM, then that's fine and
should be treated like a normal TID.

> No. The documentation for that function explicitly mentions that
> these item pointers are optimized for storage when using the heap
> tableam, and that that code will be changed once there exist tableAMs
> with different TID / ip_posid constraints (see the comment on lines
> 32 and 33 of that file). 

Thank you.

I'm a table AM author, and I'd like to use whatever the real range of
TIDs is. Does that mean it's time to change that code in
ginpostinglist.c now?

> >   1. Keep MaxOffsetNumber as 2048 and fix itemptr_to_uint64().
> 
> I believe that this is the right way when there exist tableAMs that
> use those upper 5 bits.

Does that mean we should declare the valid range of offsets to be
between 1 and 0xfffc (inclusive)?

I'm trying to use some mapping now that's somewhat stable so that I
don't have to worry that something will break later, and then require
reindexing all tables with my table AM.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Fri, Apr 30, 2021 at 11:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> My thought at the moment is that all APIs above the AM level ought
> to be redefined to use uint64 for tuple identifiers.  heapam and
> related index AMs could map block + offset into that in some
> convenient way, and other AMs could do what they like.
>
> Andres seems to feel that we should try to allow variable-width
> tupleids, but I'm afraid that the cost/benefit ratio for that
> would be pretty poor.

There are two major reasons why I want variable-width tuple IDs. One
is global indexes, where you need as many bits as the AMs implementing
the partitions need, plus some extra bits to identify which partition
is relevant for a particular tuple. No fixed number of bits that you
make available can ever be sufficient here, because a global index
always needs to have extra bits compared to a partition-local index;
if you let the partition-local index use more bits, the global index
now needs even more space. The other is indirect indexes, work Álvaro
proposed a few years ago, where the index entry points to the primary
key value rather than a TID. The space needs for that are based on the
type of the primary key column. This proposal solves neither of those
problems.

Another problem in this general area is that there is quite a bit of
code that thinks a TID is specifically a block number and an offset,
like the Bitmap Index/Heap Scan code, for example. But making tuple
IDs wider doesn't help with that problem either.

What problem do you think this proposal does solve?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Fri, 2021-04-30 at 08:36 -0700, Peter Geoghegan wrote:
> Compatibility with index AMs is more than a matter of switching out
> the tuple identifier -- if you invent something that has totally
> different performance characteristics for index AMs, then it's likely
> to break tacit assumptions about the cost of certain things.

I think it would be reasonable to document and expect that table AMs
offer some locality of access for tuples with similar IDs. Do you think
we need something stronger than that?

> Plus deduplication ameliorates problems with version churn
> in
> indexes -- presumably the same problems will exist when any new table
> AM is used, and so it'll be useful to address the same problems in
> the
> same way.

I got lost after "presumably the same problems", can you explain?

> I agree that it might well be useful to make TIDs fully logical (not
> "physiological" -- physical across blocks, logical within blocks) for
> some new table AM. Even then, it would still definitely be a good
> idea
> to make these logical TID values correlate with the physical table
> structure in some way.

Agreed.

Regards,
    Jeff Davis




Re: MaxOffsetNumber for Table AMs

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Apr 30, 2021 at 11:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Andres seems to feel that we should try to allow variable-width
>> tupleids, but I'm afraid that the cost/benefit ratio for that
>> would be pretty poor.

> There are two major reasons why I want variable-width tuple IDs. One
> is global indexes, where you need as many bits as the AMs implementing
> the partitions need, plus some extra bits to identify which partition
> is relevant for a particular tuple. No fixed number of bits that you
> make available can ever be sufficient here,

I agree that global indexes need more bits, but it doesn't necessarily
follow that we must have variable-width TIDs.  We could for example
say that "real" TIDs are only 48 bits and index AMs that want to be
usable as global indexes must be capable of handling 64-bit TIDs,
leaving 16 bits for partition ID.  A more forward-looking definition
would require global index AMs to store 96 bits (partition OID plus
64-bit TID).  Either way would be far simpler for every moving part
involved than going over to full varlena TIDs.

> Another problem in this general area is that there is quite a bit of
> code that thinks a TID is specifically a block number and an offset,
> like the Bitmap Index/Heap Scan code, for example. But making tuple
> IDs wider doesn't help with that problem either.

Agreed, that's an area that will need a lot of thought for anything that
we do here.  But varlena TIDs surely do not make that easier to fix.

> What problem do you think this proposal does solve?

Accommodating table AMs that want more than 48 bits for a TID.
We're already starting to run up against the fact that that's not
enough bits for plausible use-cases.  64 bits may someday in the far
future not be enough either, but I think that's a very long way off.

            regards, tom lane



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Fri, Apr 30, 2021 at 10:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > There are two major reasons why I want variable-width tuple IDs. One
> > is global indexes, where you need as many bits as the AMs implementing
> > the partitions need, plus some extra bits to identify which partition
> > is relevant for a particular tuple. No fixed number of bits that you
> > make available can ever be sufficient here,
>
> I agree that global indexes need more bits, but it doesn't necessarily
> follow that we must have variable-width TIDs.  We could for example
> say that "real" TIDs are only 48 bits and index AMs that want to be
> usable as global indexes must be capable of handling 64-bit TIDs,
> leaving 16 bits for partition ID.  A more forward-looking definition
> would require global index AMs to store 96 bits (partition OID plus
> 64-bit TID).  Either way would be far simpler for every moving part
> involved than going over to full varlena TIDs.

The question of how the on-disk format on indexes needs to be changed
to accomodate global indexes seems like an entirely separate question
to how we go about expanding or redefining TIDs.

Global indexes should work by adding an extra column that is somewhat
like a TID, that may even have its own pg_attribute entry. It's much
more natural to make the partition number a separate column IMV --
nbtree suffix truncation and deduplication can work in about the same
way as before. Plus you'll need to do predicate pushdown using the
partition identifier in some scenarios anyway. You can make the
partition identifier variable-width without imposing the cost and
complexity of variable-width TIDs on index AMs.

I believe that the main reason why there have been so few problems
with any of the nbtree work in the past few releases is that it
avoided certain kinds of special cases. Any special cases in the
on-disk format and in the space accounting used when choosing a split
point ought to be avoided at all costs. We can probably afford to add
a lot of complexity to make global indexes work, but it ought to be
contained to cases that actually use global indexes in an obvious way.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Fri, Apr 30, 2021 at 1:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I agree that global indexes need more bits, but it doesn't necessarily
> follow that we must have variable-width TIDs.  We could for example
> say that "real" TIDs are only 48 bits and index AMs that want to be
> usable as global indexes must be capable of handling 64-bit TIDs,
> leaving 16 bits for partition ID.  A more forward-looking definition
> would require global index AMs to store 96 bits (partition OID plus
> 64-bit TID).  Either way would be far simpler for every moving part
> involved than going over to full varlena TIDs.

16 bits is not much for a partition identifier. We've already had
complaints about INNER_VAR being too small, so apparently there are
people who want to use really large numbers of partitions. But even if
we imagine a hypothetical world where nobody uses more than a couple
thousand partitions at once, it's very reasonable to want to avoid
recycling partition identifiers so that detaching a partition can be
O(1), and there's no way that's going to be viable if the whole
address space is only 16 bits, because with time series data people
are going to be continually creating new partitions and dropping old
ones. I would guess that it probably is viable with 32 bits, but we'd
have to have a mapping layer rather than using the OID directly to
avoid wraparound collisions.

Now this problem can be avoided by just requiring the AM to store more
bits, exactly as you say. I suspect 96 bits is large enough for all of
the practical use cases people have, or at least within spitting
distance. But it strikes me as relatively inefficient to say that
we're always going to store 96 bits for every TID. I certainly don't
think we want to break on-disk compatibility and widen every existing
btree index by changing all the 6-byte TIDs they're storing now to
store 12 bytes TIDs that are at least half zero bytes, so I think
we're bound to end up with at least two options: 6 and 12. But
variable-width would be a lot nicer. You could store small TIDs and
small partition identifiers very compactly, and only use the full
number of bytes when the situation demands it.

> > What problem do you think this proposal does solve?
>
> Accommodating table AMs that want more than 48 bits for a TID.
> We're already starting to run up against the fact that that's not
> enough bits for plausible use-cases.  64 bits may someday in the far
> future not be enough either, but I think that's a very long way off.

Do people actually want to store more than 2^48 rows in a table, or is
this more about the division of a TID into a block number and an item
number?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Fri, 2021-04-30 at 12:51 -0400, Robert Haas wrote:
> There are two major reasons why I want variable-width tuple IDs. One
> is global indexes, where you need as many bits as the AMs
> implementing
> the partitions need, plus some extra bits to identify which partition
> is relevant for a particular tuple. No fixed number of bits that you
> make available can ever be sufficient here, because a global index
> always needs to have extra bits compared to a partition-local index;
> if you let the partition-local index use more bits, the global index
> now needs even more space. The other is indirect indexes, work Álvaro
> proposed a few years ago, where the index entry points to the primary
> key value rather than a TID. The space needs for that are based on
> the
> type of the primary key column. This proposal solves neither of those
> problems.

The particular problem I have now is that Table AMs seem to support
indexes just fine, but TIDs are under-specified so I don't know what I
really have to work with. BlockNumber seems well-specified as
0..0XFFFFFFFE (inclusive), but I don't know what the valid range of
OffsetNumber is for the purposes of a TableAM.

Part of changing to uint64 would be specifying the TIDs in a way that I
could rely on in the future.

The problems you mention above are above the table AM layer, so they
seem orthogonal. There would still need to be an ID that table AMs can
use to fetch a tuple from a particular physical table.

In the future we may support primary unique indexes at the table AM
layer, which would get more interesting. I can see an argument for a
TID being an arbitrary datum in that case, but I haven't really
considered the design implications. Is this what you are suggesting?

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Fri, Apr 30, 2021 at 1:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Global indexes should work by adding an extra column that is somewhat
> like a TID, that may even have its own pg_attribute entry. It's much
> more natural to make the partition number a separate column IMV --
> nbtree suffix truncation and deduplication can work in about the same
> way as before. Plus you'll need to do predicate pushdown using the
> partition identifier in some scenarios anyway. You can make the
> partition identifier variable-width without imposing the cost and
> complexity of variable-width TIDs on index AMs.

I agree up to a point but ... are you imagining that the TID continues
to have its own special place in the page, while the partition
identifier is stored more like a regular tuple column? Because it
seems to me that it would be better to try to eliminate the
special-storage case, just like we did for OIDs. If you want a 6-byte
TID, put a 6-byte column in the tuple for it. If you also want a
partition identifier, put an extra column in the tuple for that. If
you want a wider TID or a varlena TID, well, put columns for that into
the tuple instead of the 6-byte column you'd normally put. This seems
extremely flexible and a lot more aesthetically appealing than what we
have today.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Fri, Apr 30, 2021 at 10:04 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2021-04-30 at 08:36 -0700, Peter Geoghegan wrote:
> > Compatibility with index AMs is more than a matter of switching out
> > the tuple identifier -- if you invent something that has totally
> > different performance characteristics for index AMs, then it's likely
> > to break tacit assumptions about the cost of certain things.
>
> I think it would be reasonable to document and expect that table AMs
> offer some locality of access for tuples with similar IDs. Do you think
> we need something stronger than that?

I don't know. This conversation is still too abstract for me to be
able to take a firm position. ISTM that we tend to talk about the
table AM in excessively abstract terms. It would be a lot easier if we
had clear fixed goals for a small number of additional table AMs.

> > Plus deduplication ameliorates problems with version churn
> > in
> > indexes -- presumably the same problems will exist when any new table
> > AM is used, and so it'll be useful to address the same problems in
> > the
> > same way.
>
> I got lost after "presumably the same problems", can you explain?

Well, there are now two things in nbtree that specifically help with
version churn caused by non-HOT updates: deduplication, and bottom-up
index deletion (especially the latter). Presumably any new table AM
will have something like non-HOT updates. Though they may rarely be a
problem (say because the new table AM isn't really for OLTP), whenever
they are a problem they'll be a very big problem. It seems like a good
idea to have the same index AM level protections against accumulating
version-churn index tuples in an unbounded way.

More generally, it seems like a good idea to try to make new table AMs
reasonably close to heapam insofar as possible. The reality is that
everything evolved around heapam, and that that's likely to matter in
all kinds of ways that nobody fully understands just yet. We have a
somewhat abstract table AM interface, which is good, but that doesn't
mean that table AMs can be designed as if it was a green field
situation.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Fri, 2021-04-30 at 12:35 -0400, Tom Lane wrote:
> ISTM that would be up to the index AM.  We'd need some interlocks on
> which index AMs could be used with which table AMs in any case, I
> think.

I'm not sure why? It seems like we should be able to come up with
something that's generic enough.

> I think the hard part may really be in places like tidbitmap.c, which
> one would wish to be AM-independent, but right now it's quite
> specific
> to heap-style TIDs.  Maybe we can think of a way to parameterize it.

For my particular AM, being able to have a parameterized granularity
might be nice, but not required.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Fri, Apr 30, 2021 at 1:37 PM Jeff Davis <pgsql@j-davis.com> wrote:
> The particular problem I have now is that Table AMs seem to support
> indexes just fine, but TIDs are under-specified so I don't know what I
> really have to work with. BlockNumber seems well-specified as
> 0..0XFFFFFFFE (inclusive), but I don't know what the valid range of
> OffsetNumber is for the purposes of a TableAM.

I agree that this is a problem.

> Part of changing to uint64 would be specifying the TIDs in a way that I
> could rely on in the future.

I mean, from my perspective, the problem here is that the abstraction
layer is leaky and things outside of the table AM layer know what heap
is doing under the hood, and rely on it. If we could refactor the
abstraction to be less leaky, it would be clearer what assumptions
table AM authors can make. If we can't, any specification doesn't seem
worth much.

> In the future we may support primary unique indexes at the table AM
> layer, which would get more interesting. I can see an argument for a
> TID being an arbitrary datum in that case, but I haven't really
> considered the design implications. Is this what you are suggesting?

I think that would be the best long-term plan. I guess I have two
distinguishable concerns. One is that I want to be able to have
indexes with a payload that's not just a 6-byte TID; e.g. adding a
partition identifier to support global indexes, or replacing the
6-byte TID with a primary key reference to support indirect indexes.
The other concern is to be able to have table AMs that use arbitrary
methods to identify a tuple. For example, if somebody implemented an
index-organized table, the "TID" would really be the primary key.

Even though these are distinguishable concerns, they basically point
in the same direction as far as index layout is concerned. The
implications for the table AM layer are somewhat different in the two
cases, but both argue that some places that are now talking about TIDs
should be changed to talk about Datums or something of that sort.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Fri, Apr 30, 2021 at 10:39 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I agree up to a point but ... are you imagining that the TID continues
> to have its own special place in the page, while the partition
> identifier is stored more like a regular tuple column? Because it
> seems to me that it would be better to try to eliminate the
> special-storage case, just like we did for OIDs.

I agree in principle, but making that work well is very hard in
practice because of the format of IndexTuple -- which bleeds into
everything. That TID is special is probably a natural consequence of
the fact that we don't have an offset-based format of the kind you see
in other DB systems -- systems that don't emphasize extensibility. We
cannot jump to a hypothetical TID attribute inexpensively inside code
like _bt_compare() because we don't have a cheap way to jump straight
to the datum for any attribute. So we just store TID in IndexTuple
directly instead. Imagine how much more expensive VACUUM would be if
it had to grovel through the IndexTuple format.

I wonder how the same useful performance characteristics can be
maintained with a variable-width TID design. If you solve the problem
by changing IndexTuple, then you are kind of obligated to not use
varlena headers to keep the on-disk size manageable. Who knows where
it all ends?

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Fri, Apr 30, 2021 at 10:56 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think that would be the best long-term plan. I guess I have two
> distinguishable concerns. One is that I want to be able to have
> indexes with a payload that's not just a 6-byte TID; e.g. adding a
> partition identifier to support global indexes, or replacing the
> 6-byte TID with a primary key reference to support indirect indexes.
> The other concern is to be able to have table AMs that use arbitrary
> methods to identify a tuple. For example, if somebody implemented an
> index-organized table, the "TID" would really be the primary key.
>
> Even though these are distinguishable concerns, they basically point
> in the same direction as far as index layout is concerned. The
> implications for the table AM layer are somewhat different in the two
> cases, but both argue that some places that are now talking about TIDs
> should be changed to talk about Datums or something of that sort.

I don't know how it's possible to do any of this without first
addressing what the table AM does in cases where heapam currently does
a non-HOT update. You obviously cannot have the equivalent of
duplicate TIDs when your new table AM runs into these scenarios. So
what do you do instead? How do you make your clustered index/IoT style
identifiers (i.e. your strictly logical TID-like identifiers) deal
with that case?

ISTM that this is by far the biggest issue with generalizing the table
AM for use by a tableam (that isn't already very much like heapam). I
am always surprised to be the one that has to point it out during
these discussions. It's a huge issue.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Fri, Apr 30, 2021 at 2:05 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I agree in principle, but making that work well is very hard in
> practice because of the format of IndexTuple -- which bleeds into
> everything. That TID is special is probably a natural consequence of
> the fact that we don't have an offset-based format of the kind you see
> in other DB systems -- systems that don't emphasize extensibility. We
> cannot jump to a hypothetical TID attribute inexpensively inside code
> like _bt_compare() because we don't have a cheap way to jump straight
> to the datum for any attribute. So we just store TID in IndexTuple
> directly instead. Imagine how much more expensive VACUUM would be if
> it had to grovel through the IndexTuple format.

I can't imagine that, so maybe you want to enlighten me? I see that
there's a potential problem there, and I'm glad you pointed it out
because I hadn't thought about it previously ... but if you always put
the column or columns that VACUUM would need first, it's not obvious
to me that it would be all that expensive. Deforming the tuple to a
sufficient degree to extract the first column, which would even be
fixed-width, shouldn't take much work.

> I wonder how the same useful performance characteristics can be
> maintained with a variable-width TID design. If you solve the problem
> by changing IndexTuple, then you are kind of obligated to not use
> varlena headers to keep the on-disk size manageable. Who knows where
> it all ends?

What's wrong with varlena headers? It would end up being a 1-byte
header in practically every case, and no variable-width representation
can do without a length word of some sort. I'm not saying varlena is
as efficient as some new design could hypothetically be, but it
doesn't seem like it'd be a big enough problem to stress about. If you
used a variable-width representation for integers, you might actually
save bytes in a lot of cases. An awful lot of the TIDs people store in
practice probably contain several zero bytes, and if we make them
wider, that's going to be even more true.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Fri, 2021-04-30 at 10:50 -0700, Peter Geoghegan wrote:
> I don't know. This conversation is still too abstract for me to be
> able to take a firm position. ISTM that we tend to talk about the
> table AM in excessively abstract terms. It would be a lot easier if
> we
> had clear fixed goals for a small number of additional table AMs.

https://github.com/citusdata/citus/tree/master/src/backend/columnar

My colleagues and I have been working on a "columnar" table AM. It
doesn't currently support indexes, but it would be useful to support
them.

The basic idea is we have "stripes" of ~150000 tuples that are
rearranged and compressed, and stored in an smgr-controlled file that
goes through the buffer cache and uses generic WAL.

To support indexes, we could do our own lookups from a "row number" to
a particular offset where we can find and decompress the stripe that
holds that row number, and then scan forward in the stripe to find the
particular row. This will be terrible for random access, but [*waves
hands*] we will keep state and use a few optimizations so that this is
not terribly slow for clustered access.

Granted, TID lookup on columnar will be much slower than for a heap
(and we can use a CustomScan so that the costs reflect that). But it
will satisfy important use cases:

  1. Indexes defined on partition parent tables. Even if the index is
never used for queries, we don't want to throw an error when defining
the partitioned parent index.
  2. Unique indexes and exclusion constraints.
  3. Clustered index scans can still be reasonably fast.
  4. Could be used for UPDATE/DELETE as well.

> More generally, it seems like a good idea to try to make new table
> AMs
> reasonably close to heapam insofar as possible. The reality is that
> everything evolved around heapam, and that that's likely to matter in
> all kinds of ways that nobody fully understands just yet.

Agreed. I think of this as an evolving situation where we take steps
toward a better abstraction.

One (hopefully reasonable) step I'd like to take is a well-specified
TID.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Fri, 2021-04-30 at 13:56 -0400, Robert Haas wrote:
> I think that would be the best long-term plan.

We should still have *some* answer in the short term for table AM
authors like me. If I use offset numbers as high as MaxOffsetNumber,
then itemptr_to_uint64 will fail. If I base my calculations for the TID
to row number mapping on MaxOffsetNumber at all, then it will break if
we change MaxOffsetNumber (as was suggested[1]).

My takeaway so far is that the only safe thing to do is hard code it to
2000. I suppose I can do that until we settle on something better (at
which point I can force a reindex, I suppose).

[1] 
https://postgr.es/m/CAEze2Wit1EtHHBHJ+CYvBPthrWUzu2Vqc-BmzU3ApK3iotHriw@mail.gmail.com

> Even though these are distinguishable concerns, they basically point
> in the same direction as far as index layout is concerned. The
> implications for the table AM layer are somewhat different in the two
> cases, but both argue that some places that are now talking about
> TIDs
> should be changed to talk about Datums or something of that sort.

Logically, that makes a lot of sense to me. Peter seems to have quite a
few practical implementation concerns though, so it could be a long
road.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Fri, Apr 30, 2021 at 11:23 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Apr 30, 2021 at 2:05 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I agree in principle, but making that work well is very hard in
> > practice because of the format of IndexTuple -- which bleeds into
> > everything. That TID is special is probably a natural consequence of
> > the fact that we don't have an offset-based format of the kind you see
> > in other DB systems -- systems that don't emphasize extensibility. We
> > cannot jump to a hypothetical TID attribute inexpensively inside code
> > like _bt_compare() because we don't have a cheap way to jump straight
> > to the datum for any attribute. So we just store TID in IndexTuple
> > directly instead. Imagine how much more expensive VACUUM would be if
> > it had to grovel through the IndexTuple format.
>
> I can't imagine that, so maybe you want to enlighten me? I see that
> there's a potential problem there, and I'm glad you pointed it out
> because I hadn't thought about it previously ... but if you always put
> the column or columns that VACUUM would need first, it's not obvious
> to me that it would be all that expensive.

Maybe. The point is that it is a problem that needs to be solved.

> Deforming the tuple to a
> sufficient degree to extract the first column, which would even be
> fixed-width, shouldn't take much work.

I think that it's reasonable to impose some cost on index AMs here,
but that needs to be bounded sensibly and unambiguously. For example,
it would probably be okay if you had either 6 byte or 8 byte TIDs, but
no other variations. You could require index AMs (the subset of index
AMs that are ever able to store 8 byte TIDs) to directly encode which
width they're dealing with at the level of each IndexTuple. That would
create some problems for nbtree deduplication, especially in boundary
cases, but ISTM that you can manage the complexity by sensibly
restricting how the TIDs work across the board. For example, the TIDs
should always work like unsigned integers -- the table AM must be
willing to work with that restriction.

You'd then have posting lists tuples in nbtree whose TIDs were all
either 6 bytes or 8 bytes wide, with a mix of each possible (though
not particularly likely) on the same leaf page. Say when you have a
table that exceeds the current MaxBlockNumber restrictions. It would
be relatively straightforward for nbtree deduplication to simply
refuse to mix 6 byte and 8 byte datums together to avoid complexity in
boundary cases. The deduplication pass logic has the flexibility that
this requires already.

> > I wonder how the same useful performance characteristics can be
> > maintained with a variable-width TID design. If you solve the problem
> > by changing IndexTuple, then you are kind of obligated to not use
> > varlena headers to keep the on-disk size manageable. Who knows where
> > it all ends?
>
> What's wrong with varlena headers? It would end up being a 1-byte
> header in practically every case, and no variable-width representation
> can do without a length word of some sort. I'm not saying varlena is
> as efficient as some new design could hypothetically be, but it
> doesn't seem like it'd be a big enough problem to stress about. If you
> used a variable-width representation for integers, you might actually
> save bytes in a lot of cases. An awful lot of the TIDs people store in
> practice probably contain several zero bytes, and if we make them
> wider, that's going to be even more true.

Maybe all of this is true, and maybe it works out to be the best path
forward in the long term, all things considered. But whether or not
that's true is crucially dependent on what real practical table AMs
(of which there will only ever be a tiny number) actually need to do.
Why should we assume that the table AM cannot accept some
restrictions? What good does it do to legalistically define the
problem as a problem for index AMs to solve?

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Fri, Apr 30, 2021 at 2:23 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I don't know how it's possible to do any of this without first
> addressing what the table AM does in cases where heapam currently does
> a non-HOT update.

Why can't it do what it does already? It's not broken for heap, so why
should it be broken for anything else? And why are non-HOT updates
specifically a problem?

> You obviously cannot have the equivalent of
> duplicate TIDs when your new table AM runs into these scenarios. So
> what do you do instead? How do you make your clustered index/IoT style
> identifiers (i.e. your strictly logical TID-like identifiers) deal
> with that case?

Is the problem you're worried about here that, with something like an
index-organized table, you can have multiple row versions that have
the same logical tuple ID, i.e. primary key value? And that the
interfaces aren't well-suited to that? Because that's a problem I have
thought about and can comment on, even though I think the question of
having multiple versions with the same TID is distinguishable from the
question of how *wide* TIDs should be. But maybe that's not what you
are talking about here, in which case I guess I need a clearer
explanation of the concern.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Fri, Apr 30, 2021 at 12:20 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Why can't it do what it does already? It's not broken for heap, so why
> should it be broken for anything else? And why are non-HOT updates
> specifically a problem?

No reason.

> > You obviously cannot have the equivalent of
> > duplicate TIDs when your new table AM runs into these scenarios. So
> > what do you do instead? How do you make your clustered index/IoT style
> > identifiers (i.e. your strictly logical TID-like identifiers) deal
> > with that case?
>
> Is the problem you're worried about here that, with something like an
> index-organized table, you can have multiple row versions that have
> the same logical tuple ID, i.e. primary key value? And that the
> interfaces aren't well-suited to that? Because that's a problem I have
> thought about and can comment on, even though I think the question of
> having multiple versions with the same TID is distinguishable from the
> question of how *wide* TIDs should be. But maybe that's not what you
> are talking about here, in which case I guess I need a clearer
> explanation of the concern.

That's what I'm talking about. I'd like to hear what you think about it.

It's not exactly a narrow concern. For one thing, it is enough to
totally validate my suggestion about how we might widen TIDs and still
have nbtree deduplication.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Fri, Apr 30, 2021 at 3:30 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Is the problem you're worried about here that, with something like an
> > index-organized table, you can have multiple row versions that have
> > the same logical tuple ID, i.e. primary key value? And that the
> > interfaces aren't well-suited to that? Because that's a problem I have
> > thought about and can comment on, even though I think the question of
> > having multiple versions with the same TID is distinguishable from the
> > question of how *wide* TIDs should be. But maybe that's not what you
> > are talking about here, in which case I guess I need a clearer
> > explanation of the concern.
>
> That's what I'm talking about. I'd like to hear what you think about it.

OK. I thought about this in regards to zheap, which has this exact
problem, because it wants to do so-called "in place" updates where the
new version of the row goes right on top of the old one in the table
page, and the old version of the row gets written into the undo log.
Just to keep things simple, we said that initially we'd only use this
in-place update strategy when no indexed columns were changed, so that
there's only ever one set of index entries for a given TID. In that
model, the index AMs don't really need to care that there are actually
multiple tuples for the same TID, because those tuples differ only in
columns that the index doesn't care about anyway. An index scan has to
be careful to fetch the correct version of the tuple, but it has a
Snapshot available, so it can do that. However, there's no easy and
efficient way to handle updates and deletes. Suppose for example that
a tuple has been updated 5 times, creating versions t1..t5. t5 is now
in the zheap page, and the other versions are in the undo. t5 points
to t4 which points to t3 and so forth. Now an updater comes along and
let's say that the updater's snapshot sees t2. It may be that t3..t5
are *uncommitted* updates in which case the attempt to update t2 may
succeed if the transaction that performed then aborts, or it may be
that the updating transactions have committed, in which case we're
going to have to fail. But that decision isn't made by the scan that
sees t3; it happens when the TID reaches the ModifyTable node. So what
zheap ends up doing is finding the right tuple version during the
scan, by making use of the snapshot, and then having to go repeat that
work when it's time to try to perform the update. It would be nice to
avoid this. If we could feed system columns from the scan through to
the update, we could pass along an undo pointer and avoid the extra
overhead. So it seems to me, so far anyway, that there's no very
fundamental problem here, but there is an efficiency issue which we
could address if we had a bit more planner and executor infrastructure
to help us out.

Now in the long run the vision for zheap was that we'd eventually want
to do in-place updates even when indexed columns have been modified,
and this gets a whole lot trickier, because now there can be multiple
sets of index entries pointing at the same TID which don't agree on
the values of the indexed columns. As old row versions die off, some
of those pointers need to be cleaned out, and others do not. I thought
we might solve this problem by something akin to retail index
deletion: have an update or delete on a zheap tuple go re-find the
associated index entries and mark them for possible cleanup, and then
vacuum can ignore all unmarked tuples. There might be some efficiency
problems with this idea I hadn't considered, based on your remarks
today. But regardless of the wisdom or folly of this approach, the
broader point is that we can't assume that all heap types are going to
have the same maintenance requirements. I think most of them are going
to have some kind of maintenance operation that need to or at least
can optionally be performed from time to time, but it might be
triggered by completely different criteria than vacuum. New table AMs
might well choose to use 64-bit XIDs, avoiding the need for wraparound
processing altogether. Maybe they have such good opportunistic cleanup
mechanisms that periodic vacuum for bloat isn't even really needed.
Maybe they bloat when updates and deletes commit but not when inserts
and updates abort, because those cases are handled via some other
mechanism. Who knows, really? It's hard to predict what
not-yet-written AMs might care about, and even if we knew, it seems
crazy to try to rewrite the way vacuum works to cater to those needs
before we actually have some working AMs to use as a testbed.

It strikes me that certain interesting cases might not really need
anything very in-depth here. For example, consider indirect indexes,
where the index references the primary key value rather than the TID.
Well, the indirect index should probably be vacuumed periodically to
prevent bloat, but it doesn't need to be vacuumed to recycle TIDs
because it doesn't contain TIDs. BRIN indexes, BTW, also don't contain
TIDs. Either could, therefore, be optionally vacuumed after vacuum has
done absolutely everything else, even truncate the table, or they
could be vacuumed on a completely separate schedule that doesn't have
anything to do with table vacuuming. I suppose we'd have to come up
with some solution, but I don't think it would need to be fully
general; it could just be good enough for that particular feature,
since fully general seems rather impossible anyway. So I feel like
it's pretty fair to just defer this question. Without some solution
you can't entirely finish a project like indirect indexes, but without
variable-width index payloads you can't even start it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Fri, 2021-04-30 at 12:29 -0700, Peter Geoghegan wrote:
> > Is the problem you're worried about here that, with something like
> > an
> > index-organized table, you can have multiple row versions that have
> > the same logical tuple ID, i.e. primary key value? 
> 
> That's what I'm talking about. I'd like to hear what you think about
> it.

FWIW, this is not a problem in my table AM. I am fine having different
TIDs for each version, just like heapam.

For index-organized tables it does seem like an interesting problem.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Fri, Apr 30, 2021 at 2:13 PM Jeff Davis <pgsql@j-davis.com> wrote:
> FWIW, this is not a problem in my table AM. I am fine having different
> TIDs for each version, just like heapam.

This means that we are largely in agreement about the general nature
of the problem. That seems like a good basis to redefine TID-like
identifiers so that they can accommodate what you want to do.

> For index-organized tables it does seem like an interesting problem.

I strongly suspect that index-organized tables (or indirect indexes,
or anything else that assumes that TID-like identifiers map directly
to logical rows as opposed to physical versions) are going to break
too many assumptions to ever be tractable. Assuming I have that right,
it would advance the discussion if we could all agree on that being a
non-goal for the tableam interface in general. This would allow us to
clearly discuss how to solve the remaining problem of accommodating
column stores and suchlike. That seems hard, but much more tractable.

The fact that the tableam has almost no non-goals has always bothered
me a bit. Especially on this particular point about purely logical
TID-like identifiers.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Fri, Apr 30, 2021 at 2:07 PM Robert Haas <robertmhaas@gmail.com> wrote:
> OK. I thought about this in regards to zheap, which has this exact
> problem, because it wants to do so-called "in place" updates where the
> new version of the row goes right on top of the old one in the table
> page, and the old version of the row gets written into the undo log.
> Just to keep things simple, we said that initially we'd only use this
> in-place update strategy when no indexed columns were changed, so that
> there's only ever one set of index entries for a given TID. In that
> model, the index AMs don't really need to care that there are actually
> multiple tuples for the same TID, because those tuples differ only in
> columns that the index doesn't care about anyway. An index scan has to
> be careful to fetch the correct version of the tuple, but it has a
> Snapshot available, so it can do that.

Right. So zheap (in the current prototype implementation) is like
heapam with its HOT optimization, except that it isn't subject to the
same limitations with regard to fitting heap tuples on the same heap
page to keep the same HOT chain going over time. You kind of have the
moral equivalent of a HOT chain that can largely live in UNDO. That
seems like a very useful thing on its own. A lot of the problems with
HOT are in this area -- maybe the vast majority, even.

A remaining problem is that we must generate a new round of index
tuples for each and every index when only one indexed column is
logically modified by an UPDATE statement. I think that this is much
less of a problem now due to bottom-up index deletion. Sure, it sucks
that we still have to dirty the page at all. But it's nevertheless
true that it all but eliminates version-driven page splits, which are
where almost all of the remaining downside is. It's very reasonable to
now wonder if this particular all-indexes problem is worth solving at
all in light of that. (Modern hardware characteristics also make a
comprehensive fix less valuable in practice.)

> However, there's no easy and
> efficient way to handle updates and deletes. Suppose for example that
> a tuple has been updated 5 times, creating versions t1..t5. t5 is now
> in the zheap page, and the other versions are in the undo. t5 points
> to t4 which points to t3 and so forth. Now an updater comes along and
> let's say that the updater's snapshot sees t2. It may be that t3..t5
> are *uncommitted* updates in which case the attempt to update t2 may
> succeed if the transaction that performed then aborts, or it may be
> that the updating transactions have committed, in which case we're
> going to have to fail. But that decision isn't made by the scan that
> sees t3; it happens when the TID reaches the ModifyTable node. So what
> zheap ends up doing is finding the right tuple version during the
> scan, by making use of the snapshot, and then having to go repeat that
> work when it's time to try to perform the update. It would be nice to
> avoid this.

I believe that this is another consequence of the fact that Postgres
versions tuples, not pages. This is not a minor theoretical point.
It's very different to what Oracle does. It's almost a necessary
consequence of our basic approach to extensibility, because you can
have things like index tuples whose values are equal but visibly
distinct (e.g., the numeric value '5.0' is equal to but distinct from
'5'). It also has a lot to do with how crash recovery works.

> If we could feed system columns from the scan through to
> the update, we could pass along an undo pointer and avoid the extra
> overhead. So it seems to me, so far anyway, that there's no very
> fundamental problem here, but there is an efficiency issue which we
> could address if we had a bit more planner and executor infrastructure
> to help us out.

FWIW you don't necessarily have to do the EPQ stuff. You could in
theory do a statement-level rollback, and repeat. The EPQ mechanism is
unique to Postgres. Maybe it doesn't matter, but I don't think that
it's essential to follow this in other table AMs.

> Now in the long run the vision for zheap was that we'd eventually want
> to do in-place updates even when indexed columns have been modified,
> and this gets a whole lot trickier, because now there can be multiple
> sets of index entries pointing at the same TID which don't agree on
> the values of the indexed columns.

It's much easier when you have a very simple type system that doesn't
allow differences like my "numeric '5.0' vs '5'" example -- a system
that is built for this from the ground up. If there are meaningful
semantic differences among opclass-equal index tuples, then we can
never assume that index tuples will always be locatable after an
update affecting indexed columns (if only because we need to preserve
the '5.0' and '5' variants in an index on a numeric column).

If we could at least be sure that two index tuples that point to the
same stable/logical zheap TID (in a world where TIDs were stable
identifiers of logical rows) were nevertheless unique, then we'd be
able to uniquely identify each index tuple during retail index tuple
deletion -- they'd still have distinct key values in the index tuple
overall. That assumption isn't workable in Postgres, though.

I'm pretty sure that there will be a bunch of tacit assumptions like
this that would shake out all over the place. You'd have to actually
pursue this design to figure out what they were, but I'm pretty sure
many more exist. In any case this one example seems sufficient to make
me doubt the whole enterprise.

> As old row versions die off, some
> of those pointers need to be cleaned out, and others do not. I thought
> we might solve this problem by something akin to retail index
> deletion: have an update or delete on a zheap tuple go re-find the
> associated index entries and mark them for possible cleanup, and then
> vacuum can ignore all unmarked tuples. There might be some efficiency
> problems with this idea I hadn't considered, based on your remarks
> today. But regardless of the wisdom or folly of this approach, the
> broader point is that we can't assume that all heap types are going to
> have the same maintenance requirements.

No, we can't. But we had better have a generalized definition that
accounts for what variation is acceptable, and (most importantly) what
variation *isn't* acceptable.

> I think most of them are going
> to have some kind of maintenance operation that need to or at least
> can optionally be performed from time to time, but it might be
> triggered by completely different criteria than vacuum. New table AMs
> might well choose to use 64-bit XIDs, avoiding the need for wraparound
> processing altogether. Maybe they have such good opportunistic cleanup
> mechanisms that periodic vacuum for bloat isn't even really needed.
> Maybe they bloat when updates and deletes commit but not when inserts
> and updates abort, because those cases are handled via some other
> mechanism. Who knows, really? It's hard to predict what
> not-yet-written AMs might care about, and even if we knew, it seems
> crazy to try to rewrite the way vacuum works to cater to those needs
> before we actually have some working AMs to use as a testbed.

Nothing is certain, but frankly I just don't believe that anybody is
ever going to solve this problem in Postgres. The fundamental
assumption that TIDs are not stable identifiers of logical rows (they
point to versions) is just too baked into everything. And the
downsides of that design can be fixed in a localized way.

On the other hand, Jeff and I agree about the parameters of the
discussion here. I can see myself doing work inside nbtree to
facilitate his work. But that is made a lot less likely by the general
lack of agreement about what ought to ultimately be possible. There is
a very real cost to indefinitely deferring making a hard choice about
what we can rule out for table AMs. It's *not* free.

How long should that situation be allowed to continue for? This is not
a rhetorical question -- maybe it should be timeboxed in some way.
Right now the tableam can in theory do. Nobody knows how, but anything
is possible!

> It strikes me that certain interesting cases might not really need
> anything very in-depth here. For example, consider indirect indexes,
> where the index references the primary key value rather than the TID.
> Well, the indirect index should probably be vacuumed periodically to
> prevent bloat, but it doesn't need to be vacuumed to recycle TIDs
> because it doesn't contain TIDs.

Why is that okay, though? How can you get away with not having
version-based TID-like identifiers here? I would be willing to accept
an answer like "it's unclear, but it must be possible" if there was no
downside. But as I said, there is a downside.

> BRIN indexes, BTW, also don't contain
> TIDs.

BRIN indexes aren't suitable as indirect indexes, though.

> Either could, therefore, be optionally vacuumed after vacuum has
> done absolutely everything else, even truncate the table, or they
> could be vacuumed on a completely separate schedule that doesn't have
> anything to do with table vacuuming.

I think that the question of how TID-like identifiers work across
table AMs is fundamentally distinct from how VACUUM works. I think
that we'll probably always have something like VACUUM. That doesn't
mean that we cannot ultimately teach VACUUM to run far less often.
There is a problem with things being overly coupled inside VACUUM, so
our practical experience with VACUUM isn't necessarily a reliable
indicator of how much of a problem VACUUM is long term.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Fri, Apr 30, 2021 at 5:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I strongly suspect that index-organized tables (or indirect indexes,
> or anything else that assumes that TID-like identifiers map directly
> to logical rows as opposed to physical versions) are going to break
> too many assumptions to ever be tractable. Assuming I have that right,
> it would advance the discussion if we could all agree on that being a
> non-goal for the tableam interface in general.

I *emphatically* disagree with the idea of ruling such things out
categorically. This is just as naive as the TODO's statement that we
do not want "All backends running as threads in a single process".
Does anyone really believe that we don't want that any more? I
believed it 10 years ago, but not any more. It's costing us very
substantially not only in that in makes parallel query more
complicated and fragile, but more importantly in that we can't scale
up to connection counts that other databases can handle because we use
up too many operating system resources. Support threading in
PostgreSQL isn't a project that someone will pull off over a long
weekend and it's not something that has to be done tomorrow, but it's
pretty clearly the future.

So here. The complexity of getting a table AM that does anything
non-trivial working is formidable, and I don't expect it to happen
right away. Picking one that is essentially block-based and can use
48-bit TIDs is very likely the right initial target because that's the
closest we have now, and there's no sense attacking the hardest
variant of the problem first. However, as with the
threads-vs-processes example, I strongly suspect that having only one
table AM is leaving vast amounts of performance on the table. To say
that we're never going to pursue the parts of that space that require
a different kind of tuple identifier is to permanently write off tons
of ideas that have produced promising results in other systems. Let's
not do that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Fri, Apr 30, 2021 at 6:19 PM Peter Geoghegan <pg@bowt.ie> wrote:
> A remaining problem is that we must generate a new round of index
> tuples for each and every index when only one indexed column is
> logically modified by an UPDATE statement. I think that this is much
> less of a problem now due to bottom-up index deletion. Sure, it sucks
> that we still have to dirty the page at all. But it's nevertheless
> true that it all but eliminates version-driven page splits, which are
> where almost all of the remaining downside is. It's very reasonable to
> now wonder if this particular all-indexes problem is worth solving at
> all in light of that. (Modern hardware characteristics also make a
> comprehensive fix less valuable in practice.)

It's reasonable to wonder. I think it depends on whether the problem
is bloat or just general slowness. To the extent that the problem is
bloat, bottom-index deletion will help a lot, but it's not going to
help with slowness because, as you say, we still have to dirty the
pages. And I am pretty confident that slowness is a very significant
part of the problem here. It's pretty common for people migrating from
another database system to have, for example, a table with 10 indexes
and then repeatedly update a column that is covered by only one of
those indexes. Now, with bottom-up index deletion, this should cause a
lot less bloat, and that's good. But you still have to update all 10
indexes in the foreground, and that's bad, because the alternative is
to find just the one affected index and update it twice -- once to
insert the new tuple, and a second time to delete-mark the old tuple.
10 is a lot more than 2, and that's even ignoring the cost of deferred
cleanup on the other 9 indexes. So I don't really expect this to get
us out of the woods. Somebody whose workload runs five times slower on
a pristine data load is quite likely to give up on using PostgreSQL
before bloat even enters the picture.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Mon, May 3, 2021 at 7:41 AM Robert Haas <robertmhaas@gmail.com> wrote:
> So here. The complexity of getting a table AM that does anything
> non-trivial working is formidable, and I don't expect it to happen
> right away. Picking one that is essentially block-based and can use
> 48-bit TIDs is very likely the right initial target because that's the
> closest we have now, and there's no sense attacking the hardest
> variant of the problem first.

It doesn't have to be block-based -- that's not what Jeff is
proposing. It just has to be able to accept the restriction that
indexes must have a unique TID-like identifier for each version (not
quite a version actually -- whatever the equivalent of a HOT chain
is). This is a restriction that Jeff had pretty much planned on
working within before starting this thread (I know this because we
spoke about it privately).

It's quite possible to rule out an index-organized table design
without ruling out a column store with logical TID-like identifiers,
that aren't block-based. It's fair to wonder if not tightening up the
rules for TID-like identifiers is actually helping table AM authors in
practice. I think it's actually making things harder.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Mon, May 3, 2021 at 8:03 AM Robert Haas <robertmhaas@gmail.com> wrote:
> It's reasonable to wonder. I think it depends on whether the problem
> is bloat or just general slowness. To the extent that the problem is
> bloat, bottom-index deletion will help a lot, but it's not going to
> help with slowness because, as you say, we still have to dirty the
> pages. And I am pretty confident that slowness is a very significant
> part of the problem here.

It's all going to depend on workload of course -- we'll need to wait
and see what users still complain about with Postgres 14 to really
have some idea. You only freshly dirty those leaf pages that weren't
already dirty, and the costs will be much more linear, so it's a
complicated question.

Here is a more modest statement that might be more convincing: The
*relative* importance of making something like HOT more robust to
things like long-running xacts has increased now that we have
bottom-up index deletion. We could improve things here by adding
something like zheap, which allows a HOT chain to mostly live in UNDO,
and therefore pretty much become arbitrarily long. This seems
plausible because users will accept that UPDATEs that modify one or
more indexed columns kinda suck, as long as there is never any truly
pathological performance. Whereas users will not easily accept that
HOT (or something like it) doesn't quite work well enough to make
relation sizes stable when they avoid updating indexed columns.

I don't think that even the absence of UPDATEs that logically modify
indexes and the absence of long running transactions (that hold up
cleanup) is sufficient to make HOT work well enough to keep table
sizes stable over time. Minor inefficiencies (e.g. LP_DEAD line
pointer bloat) will tend to aggregate over time, leading to heap
fragmentation.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Mon, May 3, 2021 at 11:26 AM Peter Geoghegan <pg@bowt.ie> wrote:
> It just has to be able to accept the restriction that
> indexes must have a unique TID-like identifier for each version (not
> quite a version actually -- whatever the equivalent of a HOT chain
> is). This is a restriction that Jeff had pretty much planned on
> working within before starting this thread (I know this because we
> spoke about it privately).

Well, I think what I'm saying is that I'm not on board with such a restriction.

If you're just saying that it has to be possible to identify rows
somehow, I am in full agreement, and I think the universe is on board
as well.

But if you're saying those identifiers have to be fixed-width and 48
(or even 64) bits, I disagree that we wish to have such a requirement
in perpetuity.

That'd be like going around to automobile manufacturers in 1925 and
asking them to agree that all future cars ever manufactured must have
a clutch.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Mon, May 3, 2021 at 9:45 AM Robert Haas <robertmhaas@gmail.com> wrote:
> But if you're saying those identifiers have to be fixed-width and 48
> (or even 64) bits, I disagree that we wish to have such a requirement
> in perpetuity.

Once you require that TID-like identifiers must point to particular
versions (as opposed to particular logical rows), you also virtually
require that the identifiers must always be integer-like (though not
necessarily block-based and not necessarily 6 bytes). You've
practically ensured that clustered index tables (and indirect indexes)
will never be possible by accepting this. Those designs are the only
real reason to have truly variable-length TID-like identifiers IMV (as
opposed to 2 or perhaps even 3 standard TID widths).

You don't accept any of that, though. Fair enough. I predict that
avoiding making a hard choice will make Jeff's work here a lot harder,
though.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Mon, May 3, 2021 at 1:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> You don't accept any of that, though. Fair enough. I predict that
> avoiding making a hard choice will make Jeff's work here a lot harder,
> though.

I don't really think so, or at least I don't see a reason why it
should. As things stand today, I don't think it's possible for a table
AM author to make any other choice than to assume that their TIDs have
to look and work like heap TIDs; that is, there had better be a block
number portion and an item number portion, and the item number had
better be smaller than MaxOffsetNumber, and if you want bitmap scans
to run reasonably quickly, the block number had also better correspond
to physical locality to some degree. It's not clear to me how exactly
someone would go about fixing all of that, but I think it would be
great if they did. Even if that person wanted to assume for purposes
of their own patch that fixed-width, integer-like TIDs are the only
thing we care about, that would be fine with me. Getting to a point
where the available 48 bits can be used in whatever way the table AM
author wants is clearly better than what we have now.

Now I'm personally of the opinion that we shouldn't be content to stop
there, but so what? I'm not insisting that Jeff or anyone else has to
work on this problem, or that they have to fix more of it rather than
less. I hope that nobody's going to try to back us into a corner by
making design decisions that deliberately complicate the possibility
of future improvements in that area, and that's about it. I don't
really understand why you think that's unreasonable, or even
problematic. I can't see that any way in which the assumption that we
will NEVER want to further generalize the TID concept simplifies
anything anyone wants to do today.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Matthias van de Meent
Date:
On Mon, 3 May 2021 at 19:00, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, May 3, 2021 at 9:45 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > But if you're saying those identifiers have to be fixed-width and 48
> > (or even 64) bits, I disagree that we wish to have such a requirement
> > in perpetuity.
>
> Once you require that TID-like identifiers must point to particular
> versions (as opposed to particular logical rows), you also virtually
> require that the identifiers must always be integer-like (though not
> necessarily block-based and not necessarily 6 bytes). You've
> practically ensured that clustered index tables (and indirect indexes)
> will never be possible by accepting this.

For IoT, as far as I know, one of the constraints is that there exists
some unique constraint on the table, which also defines the ordering.
Assuming that that is the case, we can use <unique key> + <inserting
transaction id> to identify tuple versions.

With regards,

Matthias van de Meent.



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Mon, May 3, 2021 at 10:22 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't really think so, or at least I don't see a reason why it
> should. As things stand today, I don't think it's possible for a table
> AM author to make any other choice than to assume that their TIDs have
> to look and work like heap TIDs; that is, there had better be a block
> number portion and an item number portion, and the item number had
> better be smaller than MaxOffsetNumber, and if you want bitmap scans
> to run reasonably quickly, the block number had also better correspond
> to physical locality to some degree. It's not clear to me how exactly
> someone would go about fixing all of that, but I think it would be
> great if they did. Even if that person wanted to assume for purposes
> of their own patch that fixed-width, integer-like TIDs are the only
> thing we care about, that would be fine with me. Getting to a point
> where the available 48 bits can be used in whatever way the table AM
> author wants is clearly better than what we have now.

I don't think it's much good to just do that. You probably need a full
64-bits for something like a column store. But that's all you need.

> Now I'm personally of the opinion that we shouldn't be content to stop
> there, but so what? I'm not insisting that Jeff or anyone else has to
> work on this problem, or that they have to fix more of it rather than
> less. I hope that nobody's going to try to back us into a corner by
> making design decisions that deliberately complicate the possibility
> of future improvements in that area, and that's about it. I don't
> really understand why you think that's unreasonable, or even
> problematic. I can't see that any way in which the assumption that we
> will NEVER want to further generalize the TID concept simplifies
> anything anyone wants to do today.

It creates ambiguity of the kind that deters related improvements. I
for one am not comfortable with (say) working on generalizing TID to
the extent required to facilitate Jeff's work if that obligates me to
make some legalistic and wholly insincere statement about future
improvements to the definition of TID still being quite possible (to
facilitate indirect indexes, or whatever). The truth is that I cannot
possibly know if facilitating Jeff's work in the short term blocks off
other things in the long term -- because I don't actually have a clue
how these other things could ever really be implemented sensible in
any case.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Mon, 2021-05-03 at 09:59 -0700, Peter Geoghegan wrote:
> You don't accept any of that, though. Fair enough. I predict that
> avoiding making a hard choice will make Jeff's work here a lot
> harder,
> though.

For the purposes of this discussion, what's making my life difficult is
that we don't have a good definition for TID, leaving me with two
options:

  1. be overly conservative, accept MaxOffsetNumber=2048, wasting a
bunch of address space; or
  2. risk the mapping between TID and row number could break at any
time

And compounding that, it seems that there's a bug in GIN that doesn't
honor MaxOffsetNumber, so actually neither of the rules above work
either. Instead, I need to use 2047 as the max offset number, which has
no real basis in the postgres design, but I'd be stuck with it for a
long time.

What I'm looking for is:
  * A declaration of what the actual maximum valid offset number is,
and that it will be stable enough to use for table AMs for now. (This
maximum valid offset number may or may not be called MaxOffsetNumber,
and may or may not be tied to the maximum number of items that fit on a
page.)
  * A confirmation that this GIN behavior is a bug that should be
fixed, now that there are table AMs in existence that need it fixed.

Even if we fix this in v15, we still need some guidance for what table
AMs should do in earlier versions.

If we change the way tuple IDs are represented or the table AM in v15
or beyond, that may require a REINDEX for indexes on some table AMs. As
long as we have some robust way to check that a REINDEX is necessary,
that's fine with me.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Mon, May 3, 2021 at 10:22 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> For IoT, as far as I know, one of the constraints is that there exists
> some unique constraint on the table, which also defines the ordering.
> Assuming that that is the case, we can use <unique key> + <inserting
> transaction id> to identify tuple versions.

Perhaps that's true in theory, but the resulting design seems likely
to be useless in the end. In any case I believe that systems that
generally use a heap but give you the choice of using an IoT (I'm
really thinking of Oracle) tend to not have many users that actually
avail of IoTs. On modern flash storage the trade-off made by an IoT or
clustered index design seems like the wrong one on average. You're
saving about 1 I/O on average with a PK lookup, which just isn't that
much of an upside compared to the many downsides.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Mon, 2021-05-03 at 13:22 -0400, Robert Haas wrote:
> to look and work like heap TIDs; that is, there had better be a block
> number portion and an item number portion,

Right (at least for now).

> and the item number had
> better be smaller than MaxOffsetNumber,

That's not clear to me at all, and is the whole reason I began this
thread.

  a. You say "smaller than MaxOffsetNumber", but that's a little weird.
If an offset can't be MaxOffsetNumber, it's not really the maximum, is
it?
  b. If you actually meant "less than or equal to MaxOffsetNumber",
that will fail with the GIN posting list issue raised in my first
email. Do you agree that's a bug?
  c. Why can't we go all the way up to MovedPartitionsOffsetNumber - 1?
Right now, MaxOffsetNumber is poorly named, because it actually
represents the a number slightly higher than the maximum number of
items that can fit on a page. That essentially wastes 5 bits of address
space for no obvious reason.

> and if you want bitmap scans
> to run reasonably quickly, the block number had also better
> correspond
> to physical locality to some degree.

Right (at least for now).

Regards,
    Jeff Davis






Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Mon, 2021-05-03 at 10:38 -0700, Peter Geoghegan wrote:
> I don't think it's much good to just do that. You probably need a
> full
> 64-bits for something like a column store. But that's all you need.

I would definitely like that for citus columnar, and it would
definitely make it easier to manage the address space, but I won't
demand it today. 48 bits is a workable tuple address space for many
purposes, especially when you factor in logical partitioning.

I will be dealing with gaps though, so wasting 5 bits of address space
(2^16 / MaxOffsetNumber = 32) to bring it down to 43 bits is not great.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Mon, May 3, 2021 at 10:57 AM Jeff Davis <pgsql@j-davis.com> wrote:
> For the purposes of this discussion, what's making my life difficult is
> that we don't have a good definition for TID, leaving me with two
> options:
>
>   1. be overly conservative, accept MaxOffsetNumber=2048, wasting a
> bunch of address space; or

tidbitmap.c uses MaxHeapTuplesPerPage as its MAX_TUPLES_PER_PAGE,
which is much lower than MaxOffsetNumber (it's almost 10x lower). I
wonder what that means for your design.

>   2. risk the mapping between TID and row number could break at any
> time

Though this clearly is the immediate problem for you, I think that the
real problem is that the table AM kind of tacitly assumes that there
is a universality to item pointer TIDs -- which is obviously not true.
It might be useful for you to know what assumptions index AMs can make
about how TIDs work in general, but I think that you really need an
index-AM level infrastructure that advertises the capabilities of each
index AM with respect to handling each possible variation (I suppose
you have heapam, 6 byte uint, and maybe 8 byte uint).

The easiest reasonable short term design for you is probably to find a
way to make 6 byte TIDs into 48-bit unsigned integers (perhaps only
conceptually), at least in contexts where the columnar table AM is
used. You'll still need the index AM for that. This at least makes
64-bit TID-like identifiers a relatively simple conceptually shift.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Matthias van de Meent
Date:
On Mon, 3 May 2021 at 20:43, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, May 3, 2021 at 10:57 AM Jeff Davis <pgsql@j-davis.com> wrote:
> > For the purposes of this discussion, what's making my life difficult is
> > that we don't have a good definition for TID, leaving me with two
> > options:
> >
> >   1. be overly conservative, accept MaxOffsetNumber=2048, wasting a
> > bunch of address space; or
>
> tidbitmap.c uses MaxHeapTuplesPerPage as its MAX_TUPLES_PER_PAGE,
> which is much lower than MaxOffsetNumber (it's almost 10x lower). I
> wonder what that means for your design.

One could relatively easily disable bitmap scans on the table AM by
not installing the relevant bitmap support functions on the registered
TableAM structure, and thus not touch that problem. Some indexes will
then never be accessed due to the bitmap scan requirement of their
IndexAM (gin, brin, bloom, to name a few), and as such won't make
sense to create on that table, but that's about it I think. We might
want to add some safeguards that bitmapscan-only indexams arent used
on tableams that don't support it, but I believe that's a nice-to-have
and not critical, on a similar level to the deduplication of constaint
indexes.

With regards,

Matthias van de Meent



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Mon, May 3, 2021 at 12:06 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> One could relatively easily disable bitmap scans on the table AM by
> not installing the relevant bitmap support functions on the registered
> TableAM structure, and thus not touch that problem.

I have no idea how much it'll hurt things if the column store table AM
supports no analogue of bitmap scans.

> Some indexes will
> then never be accessed due to the bitmap scan requirement of their
> IndexAM (gin, brin, bloom, to name a few), and as such won't make
> sense to create on that table, but that's about it I think.

Right. More formally: if this restriction is accepted by a table AM
(say the column store table AM), then any index AM with amgettuple set
to NULL cannot ever be used (it should probably be treated as an error
condition at CREATE INDEX time).

If this really is the best path forward (again, no idea if that's
true) then that would conveniently make it pretty easy to solve the
GIN posting list issue raised by Jeff. It just wouldn't matter -- GIN
indexes cannot be used with the column store anyway.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Fri, 2021-04-30 at 10:55 -0700, Jeff Davis wrote:
> On Fri, 2021-04-30 at 12:35 -0400, Tom Lane wrote:
> > ISTM that would be up to the index AM.  We'd need some interlocks
> > on
> > which index AMs could be used with which table AMs in any case, I
> > think.
> 
> I'm not sure why? It seems like we should be able to come up with
> something that's generic enough.

Another point: the idea of supporting only some kinds of indexes
doesn't mix well with partitioning. If you declare an index on the
parent, we should do something reasonable if one partition's table AM
doesn't support that index AM.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Mon, May 3, 2021 at 2:03 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Another point: the idea of supporting only some kinds of indexes
> doesn't mix well with partitioning. If you declare an index on the
> parent, we should do something reasonable if one partition's table AM
> doesn't support that index AM.

Sure, but it either makes sense for the columnar table AM to support
bitmap scans (or some analogous type of scan that works only slightly
differently) or it doesn't. It's not at all clear which it is right now.

If it makes sense then it will of course be necessary to describe what
"bitmap scan" actually means with the columnar storage table AM (plus
you'll still need to make some in-core changes to places like
tidbitmap.c). OTOH if it doesn't make sense then that's that -- it's
going to be a bit annoying in the partitioning scenario you describe,
but some things are bound to be *inherently* impossible, so it can't be
helped.

It seems senseless to *require* table AMs to support something like a
bitmap scan. I don't think it's a coincidence that GIN is the index AM
that looks like it presents at least 2 problems for the columnar table
AM. To me this suggests that this will need a much higher level
discussion.


--
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Mon, 2021-05-03 at 15:07 -0700, Peter Geoghegan wrote:
> Sure, but it either makes sense for the columnar table AM to support
> bitmap scans (or some analogous type of scan that works only slightly
> differently) or it doesn't. It's not at all clear which it is right
> now.

It makes sense for my columnar table AM -- there's TID locality.

> If it makes sense then it will of course be necessary to describe
> what
> "bitmap scan" actually means with the columnar storage table AM (plus
> you'll still need to make some in-core changes to places like
> tidbitmap.c). OTOH if it doesn't make sense then that's that -- it's
> going to be a bit annoying in the partitioning scenario you describe,
> but some things are bound to be *inherently* impossible, so it can't
> be
> helped.

I don't see why in-core changes are a strict requirement. It doesn't
make too much difference if a lossy TID doesn't correspond exactly to
the columnar layout -- it should be fine as long as there's locality,
right?

> It seems senseless to *require* table AMs to support something like a
> bitmap scan. 

I am not yet convinced that it's "senseless", but it is optional and
there's probably a reason that it's not required.

We still need to address the fact that two features have had a minor
collision: indexes on a partitioned table and table AMs that don't
necessarily support all index types. It's not good to just throw an
error, because we could be forcing the user to manually manage the
indexes on hundreds of partitions just because some tables have a
different AM and it doesn't support the index type.

We probably want to do something about that, but as far as I can tell,
it's not a problem for columnar right now.

> I don't think it's a coincidence that GIN is the index AM
> that looks like it presents at least 2 problems for the columnar
> table
> AM. To me this suggests that this will need a much higher level
> discussion.

One problem is that ginpostinglist.c restricts the use of offset
numbers higher than MaxOffsetNumber - 1. At best, that's a confusing
and unnecessary off-by-one error that we happen to be stuck with
because it affects the on-disk format. Now that I'm past that
particular confusion, I can live with a workaround until we do
something better.

What is the other problem with GIN?

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Mon, May 3, 2021 at 5:15 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I don't see why in-core changes are a strict requirement. It doesn't
> make too much difference if a lossy TID doesn't correspond exactly to
> the columnar layout -- it should be fine as long as there's locality,
> right?

But look at the details: tidbitmap.c uses MaxHeapTuplesPerPage as its
MAX_TUPLES_PER_PAGE, which seems like a problem -- that's 291 with
default BLCKSZ. I doubt that that restriction is something that you
can afford to live with, even just for the time being.

> > It seems senseless to *require* table AMs to support something like a
> > bitmap scan.
>
> I am not yet convinced that it's "senseless", but it is optional and
> there's probably a reason that it's not required.

I mean it's senseless to require it in the general case.

> We still need to address the fact that two features have had a minor
> collision: indexes on a partitioned table and table AMs that don't
> necessarily support all index types. It's not good to just throw an
> error, because we could be forcing the user to manually manage the
> indexes on hundreds of partitions just because some tables have a
> different AM and it doesn't support the index type.

I don't see why that's necessarily a problem. Why, in general, should
every table AM be able to support every index AM?

I find it puzzling that nobody can find one single thing that the
table AM interface *can't* do. What are the chances that the
abstraction really is perfect?

> > I don't think it's a coincidence that GIN is the index AM
> > that looks like it presents at least 2 problems for the columnar
> > table
> > AM. To me this suggests that this will need a much higher level
> > discussion.
>
> One problem is that ginpostinglist.c restricts the use of offset
> numbers higher than MaxOffsetNumber - 1. At best, that's a confusing
> and unnecessary off-by-one error that we happen to be stuck with
> because it affects the on-disk format. Now that I'm past that
> particular confusion, I can live with a workaround until we do
> something better.
>
> What is the other problem with GIN?

I just meant the tidbitmap.c stuff, and so on. There is really one big
problem: GIN leverages the fact that bitmap scans are all that it
supports in many different ways. The reality is that it was designed
to work with heapam -- that's how it evolved. It seems rather unlikely
that problems are confined to this ginpostinglist.c representational
issue -- which is very surface-level. The only way to figure it out is
to try to make it work and see what happens, though, so perhaps it
isn't worth discussing any further until that happens.

--
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Mon, 2021-05-03 at 18:12 -0700, Peter Geoghegan wrote:
> But look at the details: tidbitmap.c uses MaxHeapTuplesPerPage as its
> MAX_TUPLES_PER_PAGE, which seems like a problem -- that's 291 with
> default BLCKSZ. I doubt that that restriction is something that you
> can afford to live with, even just for the time being.

Oh, you're right. I missed that MaxHeapTuplesPerPage was an order of
magnitude smaller.

> I don't see why that's necessarily a problem. Why, in general, should
> every table AM be able to support every index AM?

I didn't propose that every table AM needs to support every index type,
just that we should do something or at least document something. It's
pretty frustrating to have to fall back to manually managing the
indexes for dozens or hundreds of partitions when you make use of
multiple table AMs.

We might be conflating support for index AMs with support for features
like bitmap scans. If a certain kind of index fails at CREATE INDEX
time, that's painful for the partitioning case. But here it's more like
the CREATE INDEX would succeed but it would just never be used, which
is a different kind of frustrating.

Whatever we do or don't do, we should try to avoid surprises. I expect
table AMs to be used heavily with partitioning.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Andres Freund
Date:
Hi,

On 2021-04-30 11:51:07 -0700, Peter Geoghegan wrote:
> I think that it's reasonable to impose some cost on index AMs here,
> but that needs to be bounded sensibly and unambiguously. For example,
> it would probably be okay if you had either 6 byte or 8 byte TIDs, but
> no other variations. You could require index AMs (the subset of index
> AMs that are ever able to store 8 byte TIDs) to directly encode which
> width they're dealing with at the level of each IndexTuple. That would
> create some problems for nbtree deduplication, especially in boundary
> cases, but ISTM that you can manage the complexity by sensibly
> restricting how the TIDs work across the board.

> For example, the TIDs should always work like unsigned integers -- the
> table AM must be willing to work with that restriction.

Isn't that more a question of the encoding than the concrete representation?


> You'd then have posting lists tuples in nbtree whose TIDs were all
> either 6 bytes or 8 bytes wide, with a mix of each possible (though
> not particularly likely) on the same leaf page. Say when you have a
> table that exceeds the current MaxBlockNumber restrictions. It would
> be relatively straightforward for nbtree deduplication to simply
> refuse to mix 6 byte and 8 byte datums together to avoid complexity in
> boundary cases. The deduplication pass logic has the flexibility that
> this requires already.

Which nbtree cases do you think would have an easier time supporting
switching between 6 or 8 byte tids than supporting fully variable width
tids?  Given that IndexTupleData already is variable-width, it's not
clear to me why supporting two distinct sizes would be harder than a
fully variable size?  I assume it's things like BTDedupState->htids?



> > What's wrong with varlena headers? It would end up being a 1-byte
> > header in practically every case, and no variable-width representation
> > can do without a length word of some sort. I'm not saying varlena is
> > as efficient as some new design could hypothetically be, but it
> > doesn't seem like it'd be a big enough problem to stress about. If you
> > used a variable-width representation for integers, you might actually
> > save bytes in a lot of cases. An awful lot of the TIDs people store in
> > practice probably contain several zero bytes, and if we make them
> > wider, that's going to be even more true.
> 
> Maybe all of this is true, and maybe it works out to be the best path
> forward in the long term, all things considered. But whether or not
> that's true is crucially dependent on what real practical table AMs
> (of which there will only ever be a tiny number) actually need to do.
> Why should we assume that the table AM cannot accept some
> restrictions? What good does it do to legalistically define the
> problem as a problem for index AMs to solve?

I don't think anybody is arguing that AMs cannot accept any restrictions? I do
think it's pretty clear that it's not entirely obvious what the concrete set
of proper restrictions would be, where we won't end up needing to re-evaluate
limits in a few years are.

If you add to that the fact that variable-width tids will often end up
considerably smaller than our current tids, it's not obvious why we should use
bitspace somewhere to indicate an 8 byte tid instead of a a variable-width
tid?

Greetings,

Andres Freund



Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Mon, May 3, 2021 at 2:13 PM Jeff Davis <pgsql@j-davis.com> wrote:
> That's not clear to me at all, and is the whole reason I began this
> thread.
>
>   a. You say "smaller than MaxOffsetNumber", but that's a little weird.
> If an offset can't be MaxOffsetNumber, it's not really the maximum, is
> it?

I wasn't trying to be that precise. I see that OffsetNumberIsValid()
returns true if the offset is <= MaxOffsetNumber, so therefore I agree
that using exactly MaxOffsetNumber ought to work.

>   b. If you actually meant "less than or equal to MaxOffsetNumber",
> that will fail with the GIN posting list issue raised in my first
> email. Do you agree that's a bug?

Given the above, yes.

>   c. Why can't we go all the way up to MovedPartitionsOffsetNumber - 1?
> Right now, MaxOffsetNumber is poorly named, because it actually
> represents the a number slightly higher than the maximum number of
> items that can fit on a page. That essentially wastes 5 bits of address
> space for no obvious reason.

Because of stuff like this:

[rhaas EDBAS]$ git grep -F '[MaxOffsetNumber'
src/backend/access/gist/gistvacuum.c:        OffsetNumber
todelete[MaxOffsetNumber];
src/backend/access/gist/gistvacuum.c:        OffsetNumber
todelete[MaxOffsetNumber];
src/backend/access/gist/gistvacuum.c:        BlockNumber
leafs_to_delete[MaxOffsetNumber];
src/backend/access/hash/hash.c:      OffsetNumber deletable[MaxOffsetNumber];
src/backend/access/hash/hashinsert.c:    OffsetNumber
deletable[MaxOffsetNumber];
src/backend/access/hash/hashovfl.c:      OffsetNumber
deletable[MaxOffsetNumber];

Maybe changing those places to use dynamic allocation wouldn't hurt
anything in terms of performance, but I'm not sure. Making them 32
times larger categorically does not seem like a good idea.

There might be other dependencies on this value in other parts of the
code; I'm not sure.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Mon, 2021-05-03 at 15:07 -0700, Peter Geoghegan wrote:
> It seems senseless to *require* table AMs to support something like a
> bitmap scan.

I thought about this some more, and this framing is backwards.
ItemPointers are fundamental to the table AM API: they are passed in to
required methods, and expected to be returned[1].

Bitmap scans are optional, but that should be determined by whether the
author wants to implement the bitmap scan methods of their table AM.
The fine details of ItemPointer representation should not be making the
decision for them.

We still need to answer the core question that started this thread:
what the heck is an ItemPointer, anyway?

After looking at itemptr.h, off.h, ginpostinglist.c and tidbitmap.c, it
seems that an ItemPointer is a block number from [0, 0xFFFFFFFe]; and
an offset number from [1, MaxHeapTuplesPerPage] which is by default [1,
291].

Attached is a patch that clarifies what I've found so far and gives
clear guidance to table AM authors. Before I commit this I'll make sure
that following the guidance actually works for the columnar AM.

Regards,
    Jeff Davis

[1] Even for the current version of columnar, which doesn't support
indexes or updates, we implemented a hack to provide dummy TIDs because
some places expect them (see analyze.c:compare_rows()).


Attachment

Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Tue, 2021-05-04 at 12:56 -0400, Robert Haas wrote:
>   b. If you actually meant "less than or equal to MaxOffsetNumber",
> > that will fail with the GIN posting list issue raised in my first
> > email. Do you agree that's a bug?
> 
> Given the above, yes.

If we just subtracted one, it would fit in 11 bits, and that would be
fine because zero is invalid anyway. Unfortunately, it's on disk, so I
think we are stuck with it.

Regardless, the other limitation in tidbitmap.c is more strict anyway
(MaxHeapTuplesPerPage=291).

> > Because of stuff like this:
> 
> [rhaas EDBAS]$ git grep -F '[MaxOffsetNumber'
> src/backend/access/gist/gistvacuum.c:        OffsetNumber
> todelete[MaxOffsetNumber];
> src/backend/access/gist/gistvacuum.c:        OffsetNumber
> todelete[MaxOffsetNumber];
> src/backend/access/gist/gistvacuum.c:        BlockNumber
> leafs_to_delete[MaxOffsetNumber];
> src/backend/access/hash/hash.c:      OffsetNumber
> deletable[MaxOffsetNumber];
> src/backend/access/hash/hashinsert.c:    OffsetNumber
> deletable[MaxOffsetNumber];
> src/backend/access/hash/hashovfl.c:      OffsetNumber
> deletable[MaxOffsetNumber];

I don't think those are problems because they represent items on an
*index* page, not ItemPointers coming from a table.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Tue, May 4, 2021 at 11:52 AM Jeff Davis <pgsql@j-davis.com> wrote:
> On Mon, 2021-05-03 at 15:07 -0700, Peter Geoghegan wrote:
> > It seems senseless to *require* table AMs to support something like a
> > bitmap scan.
>
> I thought about this some more, and this framing is backwards.
> ItemPointers are fundamental to the table AM API: they are passed in to
> required methods, and expected to be returned[1].

I prefer my framing, but okay, let's go with yours. What difference
does it make?

The fact that we're starting with the table AM API doesn't change the
fundamental fact that quite a few implementation details that are
local to code like the GIN AM and tidbitmap.c were (rightly or
wrongly) simply built with heapam in mind. The fact that that's true
is hardly surprising, and hardly argues against the idea of having a
table AM to begin with. There is no getting around the need to talk
about the first principles here, and to talk about the specific
implications for your particular table AM (perhaps others too).

Abstractions are only useful when they serve concrete implementations.
Of course they should be as general and abstract as possible -- but no
more.

> Bitmap scans are optional, but that should be determined by whether the
> author wants to implement the bitmap scan methods of their table AM.
> The fine details of ItemPointer representation should not be making the
> decision for them.

A distinction without a difference. If bitmap scans are optional and
some index AMs are 100% built from the ground up to work only with
bitmap scans, then those index AMs are effectively optional (or
optional to the extent that bitmap scans themselves are optional). I
have absolutely no idea how it would be possible to make GIN work
without having index scans. It would be so different that it wouldn't
be GIN anymore.

I think maybe it is possible for GIN to work with your column store
table AM in particular. Why aren't we talking about that concrete
issue, or something like that? We're talking about this abstraction as
if it must already be perfect, and therefore the standard by which
every other thing needs to be measured. But why?

> We still need to answer the core question that started this thread:
> what the heck is an ItemPointer, anyway?
>
> After looking at itemptr.h, off.h, ginpostinglist.c and tidbitmap.c, it
> seems that an ItemPointer is a block number from [0, 0xFFFFFFFe]; and
> an offset number from [1, MaxHeapTuplesPerPage] which is by default [1,
> 291].
>
> Attached is a patch that clarifies what I've found so far and gives
> clear guidance to table AM authors. Before I commit this I'll make sure
> that following the guidance actually works for the columnar AM.

I don't get what the point of this patch is. Obviously all of the
particulars here are just accidents of history that we ought to change
sooner or later anyway. I don't have any objection to writing them all
down someplace official. But what difference does it make if there is
no underlying *general* set of principles behind any of it? This
definition of a TID can break at any time because it just isn't useful
or general. This is self-evident -- your definition includes
MaxHeapTuplesPerPage! How could that possibly be anything other than
an accident whose details are completely arbitrary and therefore
subject to change at any time?

This is not necessarily a big deal! We can fix it by reconciling
things in a pragmatic, bottom-up way. That's what I expected would
happen all along. The table AM is not the Ark of the Covenant (just
like tidbitmap.c, or anything else).

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Mon, May 3, 2021 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
> > For example, the TIDs should always work like unsigned integers -- the
> > table AM must be willing to work with that restriction.
>
> Isn't that more a question of the encoding than the concrete representation?

I don't think so, no. How does B-Tree deduplication work without
something like that? The fact of the matter is that things are very
tightly coupled in all kinds of ways. I'm all for decoupling them to
the extent required to facilitate a new and useful table AM. But I am
unlikely to commit to months of work based on abstract arguments and
future work. I think that you'll find that I'm not the only one that
sees it that way.

> > You'd then have posting lists tuples in nbtree whose TIDs were all
> > either 6 bytes or 8 bytes wide, with a mix of each possible (though
> > not particularly likely) on the same leaf page. Say when you have a
> > table that exceeds the current MaxBlockNumber restrictions. It would
> > be relatively straightforward for nbtree deduplication to simply
> > refuse to mix 6 byte and 8 byte datums together to avoid complexity in
> > boundary cases. The deduplication pass logic has the flexibility that
> > this requires already.
>
> Which nbtree cases do you think would have an easier time supporting
> switching between 6 or 8 byte tids than supporting fully variable width
> tids?  Given that IndexTupleData already is variable-width, it's not
> clear to me why supporting two distinct sizes would be harder than a
> fully variable size?  I assume it's things like BTDedupState->htids?

Stuff like that, yeah. The space utilization stuff inside
nbtsplitloc.c and nbtdedup.c pretty much rests on the assumption that
TIDs are fixed width. Obviously there are some ways in which that
could be revised if there was a really good reason to do so -- like an
actual concrete reason with some clear basis in reality. You have no
obligation to make me happy, but FYI I find arguments like "but why
wouldn't you just allow arbitrary-width TIDs?" to be deeply
unconvincing. Do you really expect me to do a huge amount of work and
risk a lot of new bugs, just to facilitate something that may or may
not ever happen? Would you do that if you were in my position?

> I don't think anybody is arguing that AMs cannot accept any restrictions? I do
> think it's pretty clear that it's not entirely obvious what the concrete set
> of proper restrictions would be, where we won't end up needing to re-evaluate
> limits in a few years are.

I'm absolutely fine with the fact that the table AM has these issues
-- I would expect it. I would like to help! I just find these wildly
abstract discussions to be close to a total waste of time. The idea
that we should let a thousand table AM flowers bloom and then review
what to do seems divorced from reality. Even if the table AM becomes
wildly successful there will still only have been maybe 2 - 4 table
AMs that ever really had a chance. Supposing that we have no idea what
they could possibly look like just yet is just navel gazing.

> If you add to that the fact that variable-width tids will often end up
> considerably smaller than our current tids, it's not obvious why we should use
> bitspace somewhere to indicate an 8 byte tid instead of a a variable-width
> tid?

It's not really the space overhead. It's the considerable complexity
that it would add.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Andres Freund
Date:
Hi,

On 2021-05-04 14:13:36 -0700, Peter Geoghegan wrote:
> On Mon, May 3, 2021 at 10:01 PM Andres Freund <andres@anarazel.de> wrote:
> > > For example, the TIDs should always work like unsigned integers -- the
> > > table AM must be willing to work with that restriction.
> >
> > Isn't that more a question of the encoding than the concrete representation?
> 
> I don't think so, no. How does B-Tree deduplication work without
> something like that? The fact of the matter is that things are very
> tightly coupled in all kinds of ways.

What does the deduplication actually require from tids? Isn't it just
that you need to be able to compare tids?



> > > You'd then have posting lists tuples in nbtree whose TIDs were all
> > > either 6 bytes or 8 bytes wide, with a mix of each possible (though
> > > not particularly likely) on the same leaf page. Say when you have a
> > > table that exceeds the current MaxBlockNumber restrictions. It would
> > > be relatively straightforward for nbtree deduplication to simply
> > > refuse to mix 6 byte and 8 byte datums together to avoid complexity in
> > > boundary cases. The deduplication pass logic has the flexibility that
> > > this requires already.
> >
> > Which nbtree cases do you think would have an easier time supporting
> > switching between 6 or 8 byte tids than supporting fully variable width
> > tids?  Given that IndexTupleData already is variable-width, it's not
> > clear to me why supporting two distinct sizes would be harder than a
> > fully variable size?  I assume it's things like BTDedupState->htids?
> 
> Stuff like that, yeah. The space utilization stuff inside
> nbtsplitloc.c and nbtdedup.c pretty much rests on the assumption that
> TIDs are fixed width.

Hm. It doesn't seems look like that'd be all that hard to adjust / that
it'd be meaningfully easier to support only one other type of tid width.


> Obviously there are some ways in which that could be revised if there
> was a really good reason to do so -- like an actual concrete reason
> with some clear basis in reality.

The example of indirect indexes has been brought up repeatedly - you
just didn't respond to it?


> You have no obligation to make me happy, but FYI I find arguments like
> "but why wouldn't you just allow arbitrary-width TIDs?" to be deeply
> unconvincing. Do you really expect me to do a huge amount of work and
> risk a lot of new bugs, just to facilitate something that may or may
> not ever happen? Would you do that if you were in my position?

So far nobody has expressed any expectation of you doing specific work
in this thread as far as I can see? I certainly didn't intend to. I
think it's perfectly normal to discuss tradeoffs and disagree about
them?

Greetings,

Andres Freund



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Tue, 2021-05-04 at 13:51 -0700, Peter Geoghegan wrote:
> I think maybe it is possible for GIN to work with your column store
> table AM in particular. Why aren't we talking about that concrete
> issue, or something like that?

Happy to.

At this point I'd rather obey the constraint that the offset number
falls in the range [1, MaxHeapTuplesPerPage] so that columnar will have
bitmap index scans and GIN. If you see a way to work around this
limitation and still have GIN and bitmap index support, so much the
better.

The cost of obeying this limitation is that, in a workload involving
lots of small transactions, columnar might run out of TID space and
force a VACUUM FULL. In that case, VACUUM FULL is probably a good idea
anyway (to coalesce the tuples for better compression), but forcing it
is obviously not ideal. The reason columnar will run out of TID space
more quickly for small operations is because small inserts might
reserve more TIDs then they actually use, leaving gaps; and small
updates/deletes will fragment the TID space.

The benefit of obeying this limitation is that I expect that bitmap
index scans will work well with columnar because they avoid random
access. And it seems like a nice benefit if we can support the full
range of index AMs for columnar.

> I don't get what the point of this patch is. Obviously all of the
> particulars here are just accidents of history that we ought to
> change
> sooner or later anyway.

The point is if "sooner" turns into "later" then we at least have some
guidance for table AM authors in the interim. But if nobody else thinks
that's useful, then so be it.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Tue, May 4, 2021 at 5:40 PM Andres Freund <andres@anarazel.de> wrote:
> What does the deduplication actually require from tids? Isn't it just
> that you need to be able to compare tids?

It's hard to know for sure what is essential to the design, and what
can be discarded. Though I can say for sure that it depends on there
being cheap to compare TIDs.

> Hm. It doesn't seems look like that'd be all that hard to adjust / that
> it'd be meaningfully easier to support only one other type of tid width.

It depends on equi-sized TIDs within a posting list, too -- see
"Posting list splits" from the nbtree README for some idea of what I
mean. The reason that nbtree deduplication could be enabled by default
(the reason why it has very little downside) is because there are
virtually no special cases, and because the WAL overhead for posting
list splits is so low (it's only a tiny bit higher than a simple
btinsert()).

Anyway, deduplication could probably still work in about the same way
if there were some sensible limits on how generalized TIDs could work.

> > Obviously there are some ways in which that could be revised if there
> > was a really good reason to do so -- like an actual concrete reason
> > with some clear basis in reality.
>
> The example of indirect indexes has been brought up repeatedly - you
> just didn't respond to it?

I did respond to it -- in detail. I don't see how it will ever be
possible to make indirect indexes work -- at least within anything
like the current framework. And even if I was wrong there, it still
wouldn't mean that the project was particularly worth pursuing.

Whether or not you find my arguments about indirect indexes convincing
is ultimately beside the point. The fact is that I just don't believe
that that's ever going to happen (it is a fact that that's what I
believe). This will discourage me from becoming involved in anything
that touches on how the table AM thinks about TIDs, particularly in
index AMs. As far as I'm concerned this TID stuff is up in the air
(except maybe for something like zheap which is sufficiently close to
heapam that it doesn't matter).

> So far nobody has expressed any expectation of you doing specific work
> in this thread as far as I can see? I certainly didn't intend to. I
> think it's perfectly normal to discuss tradeoffs and disagree about
> them?

That's why I was careful to say "You have no obligation to make me
happy". Of course it's true that I could easily just not get involved
-- that was never my concern.

Here is my concern: I have an obligation to make it clear that I think
that you really ought to straighten out this business with
generalizing TIDs before too long. Not because I say so, but because
it's holding up progress in general. If you aren't getting cooperation
from people who work on indexing (could be somebody else), then
consider the possibility that this business with TIDs and bitmap scans
has a lot to do with it. Most people are not as outspoken as I am.

I'm not at all surprised that this happened, simply because of the
history -- it makes sense. I'm glad that the table AM interface
exists, and it was always going to be something that required
refinement over time. I want the table AM to be successful. If I
didn't care then I would say nothing at all.
-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Tue, May 4, 2021 at 9:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Here is my concern: I have an obligation to make it clear that I think
> that you really ought to straighten out this business with
> generalizing TIDs before too long. Not because I say so, but because
> it's holding up progress in general. If you aren't getting cooperation
> from people who work on indexing (could be somebody else), then
> consider the possibility that this business with TIDs and bitmap scans
> has a lot to do with it. Most people are not as outspoken as I am.

It seems to me that we're doing a lot of disagreeing given that, as I
see it, there are only relatively minor differences between the
positions of the various people here. Andres and I are, I think,
relatively convinced that variable-width TIDs would let us do things
that aren't otherwise possible, and that those things are potentially
useful and I would even venture to say cool. I don't believe you
disagree with that, but you think it's going to be too much work to
implement. Fair enough; anyone can try it who is interested and see
how far they get. Anyone who thinks it's going to be impossibly hard
probably will prefer not to try, and that's OK too.

But if we take that off the table, what about a less-ambitious
generalization of the TID mechanism? I can't really see anyone putting
up a serious argument against allowing all 48 bits of space available
in the existing TID format to be used which, as Jeff points out, is
not currently the case. So anyone who wants to try to write that patch
is free to do so. I don't have a clear idea how to make that work, to
be honest, but my limited supply of ideas need not prevent anyone else
from trying theirs.

There might be some slight disagreement about whether it's useful to
generalize TIDs from a 48-bit address space to a 64-bit address space
without making it fully general. Like Andres, I am unconvinced that's
meaningfully easier, and I am convinced that it's meaningfully less
good, but other people can disagree and that's fine. I'm perfectly
willing to change my opinion if somebody shows up with a patch that
demonstrates the value of this approach.

The main point here is one that I think you made a few emails back:
the limitations of the current system are defined by what will
actually work with the code as it exists today, not some mailing list
discussion. It's too early for the project to commit to stability in
this area; we have not managed to get a single AM apart from heapam
into core, and that situation doesn't appear likely to change in the
near future. If and when we have say 5 of those we can probably
articulate some intelligent ideas about what we think the patterns
that need to hold for future AMs are, but it's reckless to extrapolate
from 1 working example, and right now that's all we have.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Wed, May 5, 2021 at 7:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
> It seems to me that we're doing a lot of disagreeing given that, as I
> see it, there are only relatively minor differences between the
> positions of the various people here.

I'm being very vocal here because I'm concerned that we're going about
generalizing TIDs in the wrong way. To me it feels like there is a
loss of perspective about what really matters. There just isn't that
many table AM TID designs that could ever work, and even among those
schemes that could ever work there is a pretty clear hierarchy. This
blue sky thinking about generalizing TIDs 2 years in seems *weird* to
me.

Nobody is obligated to reassure me. I felt that this had to be said.

> Andres and I are, I think,
> relatively convinced that variable-width TIDs would let us do things
> that aren't otherwise possible, and that those things are potentially
> useful and I would even venture to say cool. I don't believe you
> disagree with that, but you think it's going to be too much work to
> implement. Fair enough; anyone can try it who is interested and see
> how far they get. Anyone who thinks it's going to be impossibly hard
> probably will prefer not to try, and that's OK too.

I think that's accurate. But it's easy to not disagree with the idea
that variable-width TIDs might lead to something interesting. Talk is
cheap.

No other database system has something like indirect indexes. They
have clustered indexes, but that's rather different. I think that
indirect indexes were a design that was concerned about the issue of
write amplification from non-HOT updates. But do I even remember the
details correctly? We're talking about indirect indexes as if that was
an idea whose high level user-visible goals were clear, but I don't
even think that that much is true. This kind of thing concerns me. It
very much feels like failing to see the forest for the trees.

> But if we take that off the table, what about a less-ambitious
> generalization of the TID mechanism? I can't really see anyone putting
> up a serious argument against allowing all 48 bits of space available
> in the existing TID format to be used which, as Jeff points out, is
> not currently the case. So anyone who wants to try to write that patch
> is free to do so. I don't have a clear idea how to make that work, to
> be honest, but my limited supply of ideas need not prevent anyone else
> from trying theirs.

I agree that we should focus on what we can agree on. It seems as if
we all more or less agree on this much.

> There might be some slight disagreement about whether it's useful to
> generalize TIDs from a 48-bit address space to a 64-bit address space
> without making it fully general. Like Andres, I am unconvinced that's
> meaningfully easier, and I am convinced that it's meaningfully less
> good, but other people can disagree and that's fine. I'm perfectly
> willing to change my opinion if somebody shows up with a patch that
> demonstrates the value of this approach.

It's going to be hard if not impossible to provide empirical evidence
for the proposition that 64-bit wide TIDs (alongside 48-bit TIDs) are
the way to go. Same with any other scheme. We're talking way too much
about TIDs themselves and way too little about table AM use cases, the
way the data structures might work in new table AMs, and so on.

> The main point here is one that I think you made a few emails back:
> the limitations of the current system are defined by what will
> actually work with the code as it exists today, not some mailing list
> discussion.

Right.

> It's too early for the project to commit to stability in
> this area; we have not managed to get a single AM apart from heapam
> into core, and that situation doesn't appear likely to change in the
> near future.

I would be happy if we could commit to committing to stability. I
really don't think that it's *that* hard to move significantly closer
to a design that describes just how close to heapam a table AM should
be. It doesn't commit the table AM to all that many details.

> If and when we have say 5 of those we can probably
> articulate some intelligent ideas about what we think the patterns
> that need to hold for future AMs are, but it's reckless to extrapolate
> from 1 working example, and right now that's all we have.

I don't think that there will be 5 table AMs that are credible to
users at any point in the future. In any case there only needs to be 1
or 2 good ones for the table AM to have been a resounding success.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Wed, May 5, 2021 at 11:50 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I'm being very vocal here because I'm concerned that we're going about
> generalizing TIDs in the wrong way. To me it feels like there is a
> loss of perspective about what really matters.

Well, which things matter is a question of opinion, not fact.

> No other database system has something like indirect indexes. They
> have clustered indexes, but that's rather different.

I don't think this is true at all. If you have a clustered index -
i.e. the table is physically arranged according to the index ordering
- then your secondary indexes all pretty much have to be what we're
calling indirect indexes. They can hardly point to a physical
identifier if rows are being moved around. I believe InnoDB works this
way, and I think Oracle's index-organized tables do too. I suspect
there are other examples.

> > There might be some slight disagreement about whether it's useful to
> > generalize TIDs from a 48-bit address space to a 64-bit address space
> > without making it fully general. Like Andres, I am unconvinced that's
> > meaningfully easier, and I am convinced that it's meaningfully less
> > good, but other people can disagree and that's fine. I'm perfectly
> > willing to change my opinion if somebody shows up with a patch that
> > demonstrates the value of this approach.
>
> It's going to be hard if not impossible to provide empirical evidence
> for the proposition that 64-bit wide TIDs (alongside 48-bit TIDs) are
> the way to go. Same with any other scheme. We're talking way too much
> about TIDs themselves and way too little about table AM use cases, the
> way the data structures might work in new table AMs, and so on.

I didn't mean that it has to be a test result showing that 64-bit TIDs
outperform 56-bit TIDs or something. I just meant there has to be a
reason to believe it's good, which could be based on a discussion of
use cases or whatever. If we *don't* have a reason to believe it's
good, we shouldn't do it.

My point is that so far I am not seeing a whole lot of value of this
proposed approach. For a 64-bit TID to be valuable to you, one of two
things has to be true: you either don't care about having indexes that
store TIDs on your new table type, or the index types you want to use
can store those 64-bit TIDs. Now, I have not yet heard of anyone
working on a table AM who does not want to be able to support adding
btree indexes. There may be someone that I don't know about, and if
so, fine. But otherwise, we need a way to store them. And that
requires changing the page format for btree indexes. But surely we do
not want to make all TIDs everywhere wider in future btree versions,
so at least two TID widths - 6 bytes and 8 bytes - would have to be
supported. And if we're at all going to do that, I think it's
certainly worth asking whether supporting varlena TIDs would really be
all that much harder. You seem to think it is, and you might be right,
but I'm not ready to give up, because I do not see how we are ever
going to get global indexes or indirect indexes without doing it, and
those would be good features to have.

If we can't ever get them, so be it, but you seem to kind of be saying
that things like global indexes and indirect indexes are hard, and
therefore they don't count as reasons why we might want variable-width
TIDs. But one very large reason why those things are hard is that they
require variable-width TIDs, so AFAICS this boils down to saying that
we don't want the feature because it's hard to implement. But we
should not conflate feasibility with desirability. I am quite sure
that lots of people want global indexes. The number of people who want
indirect indexes is in my estimation much smaller, but it's probably
not zero, or else Alvaro wouldn't have tried his hand at writing a
patch. Whether we can *get* those things is in doubt; whether it will
happen in the near future is very much in doubt. But I at least am not
in doubt about whether people want it, because I hear complaints about
the lack of global indexes on an almost-daily basis. If those
complaints are all from people hoping to fake me out into spending
time on something that is worthless to them, my colleagues are very
good actors.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Wed, 2021-05-05 at 10:27 -0400, Robert Haas wrote:
> It's too early for the project to commit to stability in
> this area; we have not managed to get a single AM apart from heapam
> into core

"In core" shouldn't matter. In fact, if it's in core, stability of the
APIs is much less important.

>  If and when we have say 5 of those

That seems like a standard that we won't reach in any reasonable amount
of time.

> we can probably
> articulate some intelligent ideas about what we think the patterns
> that need to hold for future AMs are, but it's reckless to
> extrapolate
> from 1 working example, and right now that's all we have.

We should count columnar as a second example. While it doesn't support
everything that heap does, we are actively working on it and it's
gaining features quickly. It's also showing some impressive real-world
results.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Wed, May 5, 2021 at 9:42 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 5, 2021 at 11:50 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > I'm being very vocal here because I'm concerned that we're going about
> > generalizing TIDs in the wrong way. To me it feels like there is a
> > loss of perspective about what really matters.
>
> Well, which things matter is a question of opinion, not fact.

I'm not trying to win an argument here. I am giving an opinion in the
hopes that it leads to some kind of useful synthesis, based on all of
our opinions.

> > No other database system has something like indirect indexes. They
> > have clustered indexes, but that's rather different.
>
> I don't think this is true at all. If you have a clustered index -
> i.e. the table is physically arranged according to the index ordering
> - then your secondary indexes all pretty much have to be what we're
> calling indirect indexes. They can hardly point to a physical
> identifier if rows are being moved around. I believe InnoDB works this
> way, and I think Oracle's index-organized tables do too. I suspect
> there are other examples.

But these systems don't have indirect indexes *on a heap table*! Why
would they ever do it that way? They already have rowid/TID as a
stable identifier of logical rows, so having indirect indexes that
point to a heap table's rows would be strictly worse than the generic
approach for indexes on a heap table.

What we call indirect indexes seem to me to be a failed attempt to
solve the "TID is not a stable identifier of logical row" issue that
is baked-in to Postgres. If I thought it was worth solving that
problem then I suppose I'd solve it directly. The "indirection" of
indirect indexes actuallys buys you nothing! It just moves some of the
problem somewhere else, at the cost of even more complexity. Indirect
indexes (without a clustered index) are a muddled idea.

Of course I accept that clustered indexes make sense in general
(though less and less these days). But the fact that these systems
"use indirect indexes" for secondary indexes is precisely why
clustered indexes don't seem like a great design with modern hardware!
Should we invest a huge amount of work in order to have all of the
disadvantages, and none of the advantages?

> My point is that so far I am not seeing a whole lot of value of this
> proposed approach. For a 64-bit TID to be valuable to you, one of two
> things has to be true: you either don't care about having indexes that
> store TIDs on your new table type, or the index types you want to use
> can store those 64-bit TIDs. Now, I have not yet heard of anyone
> working on a table AM who does not want to be able to support adding
> btree indexes. There may be someone that I don't know about, and if
> so, fine. But otherwise, we need a way to store them. And that
> requires changing the page format for btree indexes. But surely we do
> not want to make all TIDs everywhere wider in future btree versions,
> so at least two TID widths - 6 bytes and 8 bytes - would have to be
> supported.

I agree that we don't want a performance/space overhead for simple
cases that are quite happy with the current format.

> And if we're at all going to do that, I think it's
> certainly worth asking whether supporting varlena TIDs would really be
> all that much harder. You seem to think it is, and you might be right,
> but I'm not ready to give up, because I do not see how we are ever
> going to get global indexes or indirect indexes without doing it, and
> those would be good features to have.

I think that global indexes are well worth having, and should be
solved some completely different way. The partition key can be an
additive thing. I strongly suspect that indirect indexes (without a
clustered index) are 100% useless in both theory and practice, so
naturally I have little to no interest.

The difficulty of supporting (say) 6 byte and 8 byte TIDs together is
vastly lower than variable-width TIDs, for all kinds of reasons. See
my remarks to Andres upthread about deduplication.

> If we can't ever get them, so be it, but you seem to kind of be saying
> that things like global indexes and indirect indexes are hard, and
> therefore they don't count as reasons why we might want variable-width
> TIDs.But one very large reason why those things are hard is that they
> require variable-width TIDs, so AFAICS this boils down to saying that
> we don't want the feature because it's hard to implement.

More like very hard to implement for a very low benefit.

> But we
> should not conflate feasibility with desirability. I am quite sure
> that lots of people want global indexes.

I do too!

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Wed, 2021-05-05 at 08:50 -0700, Peter Geoghegan wrote:
> There just isn't that
> many table AM TID designs that could ever work, and even among those
> schemes that could ever work there is a pretty clear hierarchy. This
> blue sky thinking about generalizing TIDs 2 years in seems *weird* to
> me.

I am happy to keep table AM discussions concrete, as I have plenty of
concrete problems which I would like to turn into proposals.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Wed, May 5, 2021 at 1:13 PM Jeff Davis <pgsql@j-davis.com> wrote:
> "In core" shouldn't matter. In fact, if it's in core, stability of the
> APIs is much less important.

I don't know what to say here. I think it's unrealistic to believe
that a very new API that has only 1 in-core user is going to be fully
stable, or that we can know how it might evolve. I can understand why
you and probably other people want that, but if somebody figures out a
way to make some part of core significantly better and it requires
changing that API, they're going to change the API, not give up on the
idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Wed, May 5, 2021 at 10:33 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't know what to say here. I think it's unrealistic to believe
> that a very new API that has only 1 in-core user is going to be fully
> stable, or that we can know how it might evolve. I can understand why
> you and probably other people want that, but if somebody figures out a
> way to make some part of core significantly better and it requires
> changing that API, they're going to change the API, not give up on the
> idea.

I strongly agree. More generally, we need to decide what downsides
we're willing to live with.

What we have right now has little chance of failing. It also has
little chance of succeeding (except for something like zheap, which
can presumably get by with the heapam's idea of TID).

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Wed, May 5, 2021 at 1:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I don't think this is true at all. If you have a clustered index -
> > i.e. the table is physically arranged according to the index ordering
> > - then your secondary indexes all pretty much have to be what we're
> > calling indirect indexes. They can hardly point to a physical
> > identifier if rows are being moved around. I believe InnoDB works this
> > way, and I think Oracle's index-organized tables do too. I suspect
> > there are other examples.
>
> But these systems don't have indirect indexes *on a heap table*! Why
> would they ever do it that way? They already have rowid/TID as a
> stable identifier of logical rows, so having indirect indexes that
> point to a heap table's rows would be strictly worse than the generic
> approach for indexes on a heap table.

One advantage of indirect indexes is that you can potentially avoid a
lot of writes to the index. If a non-HOT update is performed, but the
primary key is not updated, the index does not need to be touched. I
think that's a potentially significant savings, even if bottom-up
index deletion would have prevented the page splits. Similarly, you
can mark a dead line pointer unused without having to scan the
indirect index, because the index isn't pointing to that dead line
pointer anyway.

Hmm, but I guess you have another cleanup problem. What prevents
someone from inserting a new row with the same primary key as a
previously-deleted row but different values in some indirectly-indexed
column? Then the old index entries, if still present, could mistakenly
refer to the new row. I don't know whether Alvaro thought of that
problem when he was working on this previously, or whether he solved
it somehow. Possibly that's a big enough problem that the whole idea
is dead in the water, but it's not obvious to me that this is so.

And, anyway, this whole argument is predicated on the fact that the
only table AM we have right now is heapam. If we had a table AM that
organized the data by primary key value, we'd still want to be able to
have secondary indexes, and they'd have to use the primary key value
as the TID.

> I think that global indexes are well worth having, and should be
> solved some completely different way. The partition key can be an
> additive thing.

I agree that the partition identifier should be an additive thing, but
where would we add it? It seems to me that the obvious answer is to
make it a column of the index tuple. And if we can do that, why can't
we put whatever kind of TID-like stuff people want in the index tuple,
too? Maybe part of the problem here is that I don't actually
understand how posting lists are represented...

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Wed, 2021-05-05 at 10:48 -0700, Peter Geoghegan wrote:
> What we have right now has little chance of failing. It also has
> little chance of succeeding (except for something like zheap, which
> can presumably get by with the heapam's idea of TID).

What has little chance of succeeding? Table AMs?

And why isn't columnar an example of someting that can "get by with
heapam's idea of TID"? I mean, it's not a perfect fit, but my primary
complaint this whole thread is that it's undefined, not that it's
completely unworkable.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Wed, May 5, 2021 at 10:57 AM Robert Haas <robertmhaas@gmail.com> wrote:
> One advantage of indirect indexes is that you can potentially avoid a
> lot of writes to the index. If a non-HOT update is performed, but the
> primary key is not updated, the index does not need to be touched. I
> think that's a potentially significant savings, even if bottom-up
> index deletion would have prevented the page splits. Similarly, you
> can mark a dead line pointer unused without having to scan the
> indirect index, because the index isn't pointing to that dead line
> pointer anyway.

As I said, this is equivalent to solving the "TID is a stable
identifier of logical row" issue (an exceptionally hard problem that I
don't think is worth solving), except that you make the secondary
indexes have potentially larger keys for no benefit. Sure, you can
consistently refer to a logical row using its PK value (assuming you
have this whole two-phase locking infrastructure), but why wouldn't
you "just" solve the problem with TID directly instead? What does
involving PK values actually buy you? I am pretty sure that the answer
is "less than nothing".

It is still true that I'm arguing against ever having a clustered
index table AM, which would be somewhat useful to users (that much
I'll own). The main reason for that is because we'd still be required
to solve the "TID is a stable identifier of logical row" issue, except
it's not a physiological TID/rowid (it's a fully logical row
identifier). So everything seems to lead back to that hard problem
seeming insoluble.

> Hmm, but I guess you have another cleanup problem. What prevents
> someone from inserting a new row with the same primary key as a
> previously-deleted row but different values in some indirectly-indexed
> column?

Two-phase locking in indexes stops it. Note that this is pretty much
what happens in Oracle -- it's not just SQL Server.

This is why we have rich extensibility indexing -- indexes are
strictly physical data structures in Postgres.

> And, anyway, this whole argument is predicated on the fact that the
> only table AM we have right now is heapam. If we had a table AM that
> organized the data by primary key value, we'd still want to be able to
> have secondary indexes, and they'd have to use the primary key value
> as the TID.

But Jeff has a design for the columnstore table AM where TIDs are
essentially logical (not physiological like those of heapam), that
nevertheless will work with the design around TIDs that I have in
mind. "Logical identifiers" versus "Logical identifiers that stably
identify logical rows" seems like a subtle but important distinction
here.

Of course I cannot yet rule out the possibility that this approach to
TIDs will always be good enough. But it sure seems like it might be,
and starting with the assumption that it is and working backwards
seems like a good way to attack the problem as a practical matter.

> > I think that global indexes are well worth having, and should be
> > solved some completely different way. The partition key can be an
> > additive thing.
>
> I agree that the partition identifier should be an additive thing, but
> where would we add it? It seems to me that the obvious answer is to
> make it a column of the index tuple.

Right.

> And if we can do that, why can't
> we put whatever kind of TID-like stuff people want in the index tuple,
> too? Maybe part of the problem here is that I don't actually
> understand how posting lists are represented...

You want to use the partition identifier for predicate push-down and
stuff anyway, so making it part of the TID doesn't seem particularly
natural to me.

"Posting list splits" from the nbtree README will give you some idea
of why I care about making TIDs integer-like and equi-sized within any
given index tuple. There will be similar considerations for GIN.
Though I think that nbtree deduplication is important enough on its
own to try to preserve across table AMs.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Andres Freund
Date:
Hi,

On 2021-05-05 13:32:57 -0400, Robert Haas wrote:
> I don't know what to say here. I think it's unrealistic to believe
> that a very new API that has only 1 in-core user is going to be fully
> stable, or that we can know how it might evolve. I can understand why
> you and probably other people want that, but if somebody figures out a
> way to make some part of core significantly better and it requires
> changing that API, they're going to change the API, not give up on the
> idea.

Yea. I think it would be actively *bad* if tableam were too
stable. tableam is at best an 80% solution to the abstraction needs
(those 80% were pretty painful to achieve already, I don't think we
could have gotten much more initially). If we get cornered into not
evolving the API because of 2-3 external users, we're a) going to live
with a leaky abstraction for much longer b) getting more hesitant to
work incrementally. Both would be bad.

Greetings,

Andres Freund



Re: MaxOffsetNumber for Table AMs

From
Andres Freund
Date:
Hi,

On 2021-05-05 10:56:56 -0700, Jeff Davis wrote:
> On Wed, 2021-05-05 at 10:48 -0700, Peter Geoghegan wrote:
> > What we have right now has little chance of failing. It also has
> > little chance of succeeding (except for something like zheap, which
> > can presumably get by with the heapam's idea of TID).
> 
> What has little chance of succeeding? Table AMs?
> 
> And why isn't columnar an example of someting that can "get by with
> heapam's idea of TID"? I mean, it's not a perfect fit, but my primary
> complaint this whole thread is that it's undefined, not that it's
> completely unworkable.

Agreed. And we can increase the fit a good bit without needing invasive
all-over changes. With those changes often even helping heap.

E.g. tidbitmap.c's harcoded use of MaxHeapTuplesPerPage is a problem
even for heap - we waste a lot of space that's not commonly used. A
better datastructure (radix tree like I'd say, but several tree shaped
approaches seem possible).

Greetings,

Andres Freund



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Wed, May 5, 2021 at 10:56 AM Jeff Davis <pgsql@j-davis.com> wrote:
> What has little chance of succeeding? Table AMs?
>
> And why isn't columnar an example of someting that can "get by with
> heapam's idea of TID"? I mean, it's not a perfect fit, but my primary
> complaint this whole thread is that it's undefined, not that it's
> completely unworkable.

I think that it could be fairly workable with moderate effort (maybe
even without that effort, but that doesn't seem so appealing). To do
it well we have to actually generalize TIDs sensibly. And I think that
that means admitting that we'll never solve the "TID is a stable
identifier of a logical row, not a physical version" problem. ISTM
that that's the problem that is at the root of everything here.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Wed, May 5, 2021 at 11:25 AM Andres Freund <andres@anarazel.de> wrote:
> Agreed. And we can increase the fit a good bit without needing invasive
> all-over changes. With those changes often even helping heap.
>
> E.g. tidbitmap.c's harcoded use of MaxHeapTuplesPerPage is a problem
> even for heap - we waste a lot of space that's not commonly used. A
> better datastructure (radix tree like I'd say, but several tree shaped
> approaches seem possible).

Agreed -- even if we only cared about heapam we still ought to do
something about tidbitmap.c's use of MaxHeapTuplesPerPage.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Wed, 2021-05-05 at 11:22 -0700, Andres Freund wrote:
> Yea. I think it would be actively *bad* if tableam were too
> stable. tableam is at best an 80% solution to the abstraction needs
> (those 80% were pretty painful to achieve already, I don't think we
> could have gotten much more initially). If we get cornered into not
> evolving the API because of 2-3 external users, we're a) going to
> live
> with a leaky abstraction for much longer b) getting more hesitant to
> work incrementally. Both would be bad.

Like anything, we make the decision at the time we have a reason to
break something. But why are are exensions disfavored in this
calculation vs. in-core? Isn't it a lot easier to update in-core code
to new APIs?

Evolving the API is one thing, but we should be more careful about
things that could affect on-disk state like ItemPointer
representations. By "more careful", I don't mean that we reject all
proposals; I mean that we don't casually impose new limits in other
parts of the system that happen to work for heapam but will cause table
AM extensions to break.

Regards,
    Jeff Davis





Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Wed, May 5, 2021 at 12:09 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Like anything, we make the decision at the time we have a reason to
> break something. But why are are exensions disfavored in this
> calculation vs. in-core? Isn't it a lot easier to update in-core code
> to new APIs?

We don't really have an API for how TIDs behave (unless you happen to
want to emulate heapam, which is reasonable and was expected). It's
unspecified because nobody knows what it is (or what it should be)
just yet. AFAICT there is no TID API to break.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Matthias van de Meent
Date:
On Wed, 5 May 2021 at 19:15, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, May 5, 2021 at 9:42 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Wed, May 5, 2021 at 11:50 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > > I'm being very vocal here because I'm concerned that we're going about
> > > generalizing TIDs in the wrong way. To me it feels like there is a
> > > loss of perspective about what really matters.
> >
> > Well, which things matter is a question of opinion, not fact.
>
> I'm not trying to win an argument here. I am giving an opinion in the
> hopes that it leads to some kind of useful synthesis, based on all of
> our opinions.
>
> > > No other database system has something like indirect indexes. They
> > > have clustered indexes, but that's rather different.
> >
> > I don't think this is true at all. If you have a clustered index -
> > i.e. the table is physically arranged according to the index ordering
> > - then your secondary indexes all pretty much have to be what we're
> > calling indirect indexes. They can hardly point to a physical
> > identifier if rows are being moved around. I believe InnoDB works this
> > way, and I think Oracle's index-organized tables do too. I suspect
> > there are other examples.
>
> But these systems don't have indirect indexes *on a heap table*! Why
> would they ever do it that way? They already have rowid/TID as a
> stable identifier of logical rows, so having indirect indexes that
> point to a heap table's rows would be strictly worse than the generic
> approach for indexes on a heap table.
>
> What we call indirect indexes seem to me to be a failed attempt to
> solve the "TID is not a stable identifier of logical row" issue that
> is baked-in to Postgres. If I thought it was worth solving that
> problem then I suppose I'd solve it directly. The "indirection" of
> indirect indexes actuallys buys you nothing! It just moves some of the
> problem somewhere else, at the cost of even more complexity. Indirect
> indexes (without a clustered index) are a muddled idea.
>
> Of course I accept that clustered indexes make sense in general
> (though less and less these days). But the fact that these systems
> "use indirect indexes" for secondary indexes is precisely why
> clustered indexes don't seem like a great design with modern hardware!
> Should we invest a huge amount of work in order to have all of the
> disadvantages, and none of the advantages?
>
> > My point is that so far I am not seeing a whole lot of value of this
> > proposed approach. For a 64-bit TID to be valuable to you, one of two
> > things has to be true: you either don't care about having indexes that
> > store TIDs on your new table type, or the index types you want to use
> > can store those 64-bit TIDs. Now, I have not yet heard of anyone
> > working on a table AM who does not want to be able to support adding
> > btree indexes. There may be someone that I don't know about, and if
> > so, fine. But otherwise, we need a way to store them. And that
> > requires changing the page format for btree indexes. But surely we do
> > not want to make all TIDs everywhere wider in future btree versions,
> > so at least two TID widths - 6 bytes and 8 bytes - would have to be
> > supported.
>
> I agree that we don't want a performance/space overhead for simple
> cases that are quite happy with the current format.
>
> > And if we're at all going to do that, I think it's
> > certainly worth asking whether supporting varlena TIDs would really be
> > all that much harder. You seem to think it is, and you might be right,
> > but I'm not ready to give up, because I do not see how we are ever
> > going to get global indexes or indirect indexes without doing it, and
> > those would be good features to have.
>
> I think that global indexes are well worth having, and should be
> solved some completely different way. The partition key can be an
> additive thing.

I believe that it cannot be "just" an additive thing, at least not
through a normal INCLUDEd column, as you'd get duplicate TIDs in the
index, with its related problems. You also cannot add it as a key
column, as this would disable UNIQUE indexes; one of the largest use
cases of global indexes. So, you must create specialized
infrastructure for this identifier.

And when we're already adding specialized infrastructure, then this
should probably be part of a new TID infrastructure.

And if we're going to change TID infrastructure to allow for more
sizes (as we'd need normal TableAM TIDs, and global index
partition-identifying TIDs), I'd argue that it should not be too much
more difficult to create an infrastructure for 'new TID' in which the
table AM supplies type, size and strict ordering information for these
'new TID's.

And if this 'new TID' size is not going to be defined by the index AM
but by the indexed object (be it a table or a 'global' or whatever
we'll build indexes on), I see no reason why this 'new TID'
infrastructure couldn't eventually support variable length TIDs; or
constant sized usertype TIDs (e.g. the 3 int columns of the primary
key of a clustered table).

The only requirements that I believe to be fundamental for any kind of TID are

1.) Uniqueness during the lifecycle of the tuple, from creation to
life to dead to fully dereferenced from all indexes;
2.) There exists a strict ordering of all TIDs of that type;

And maybe to supply some form of efficiency to the underlying tableAM:

3.) There should be an equivalent of bitmap for that TID type.

For the nbtree deduplication subsystem, and for gin posting lists to
be able to work efficiently, the following must also hold:

4.) The TID type has a fixed size, preferably efficiently packable.

Only the last requirement cannot be met with varlena TID types. But,
as I also believe that not all indexes can be expected to work (well)
for all kinds of TableAM, I don't see how this would be a blocking
issue.

> I strongly suspect that indirect indexes (without a
> clustered index) are 100% useless in both theory and practice, so
> naturally I have little to no interest.
>
> The difficulty of supporting (say) 6 byte and 8 byte TIDs together is
> vastly lower than variable-width TIDs, for all kinds of reasons. See
> my remarks to Andres upthread about deduplication.

I believe that deduplication is amazing when it works, but it should
not be a blocker for new TID infrastructure (assuming it still works
by default for nbtree+heap). As an example: numeric columns cannot be
deduplicated, and that wasn't considered a blocker for deduplication
to be merged.

The only reasons I can think of why varlena TIDs cannot be efficiently
deduplicated is the storage and search efficiency: I cannot binary
search in a packed array of varlena attributes, but I can binary
search through packed fixed-size attributes. Any fixed-size TID can
realistically be deduplicated, assuming enough effort is put into the
patch implementing the new TID infrastructure.

> > If we can't ever get them, so be it, but you seem to kind of be saying
> > that things like global indexes and indirect indexes are hard, and
> > therefore they don't count as reasons why we might want variable-width
> > TIDs.But one very large reason why those things are hard is that they
> > require variable-width TIDs, so AFAICS this boils down to saying that
> > we don't want the feature because it's hard to implement.
>
> More like very hard to implement for a very low benefit.

Complicated optimizations have been merged which had only a small gain.

Storage gains for index-oriented tables can become as large as the
size of the primary key by not having to store all primary key values
in both the index and the table; which can thus be around 100% of a
table in the least efficient cases of having a PK over all columns.

Yes, this might be indeed only a 'small gain' for access latency, but
not needing to store another copy of your data (and keeping it in
cache, etc.) is a significant win in my book.


Regarding losing deduplication in btrees when we have varlena TIDs:
This loss in [storage] efficiency can be partially mitigated by
implementing prefix truncation/prefix deduplication, and as such that
loss would not necessarily be too problematic when PT/PD is
implemented.



With regards,

Matthias van de Meent



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Wed, May 5, 2021 at 12:43 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I believe that it cannot be "just" an additive thing, at least not
> through a normal INCLUDEd column, as you'd get duplicate TIDs in the
> index, with its related problems. You also cannot add it as a key
> column, as this would disable UNIQUE indexes; one of the largest use
> cases of global indexes. So, you must create specialized
> infrastructure for this identifier.

You're just quibbling about the precise words that I used. Of course
it is true that there must be some sense in which a global index
partition key attribute will need to be special to the implementation
-- how else could a global index enforce uniqueness? That was clearly
implied.

> And when we're already adding specialized infrastructure, then this
> should probably be part of a new TID infrastructure.

This is a non-sequitur.

> And if we're going to change TID infrastructure to allow for more
> sizes (as we'd need normal TableAM TIDs, and global index
> partition-identifying TIDs), I'd argue that it should not be too much
> more difficult to create an infrastructure for 'new TID' in which the
> table AM supplies type, size and strict ordering information for these
> 'new TID's.
>
> And if this 'new TID' size is not going to be defined by the index AM
> but by the indexed object (be it a table or a 'global' or whatever
> we'll build indexes on), I see no reason why this 'new TID'
> infrastructure couldn't eventually support variable length TIDs; or
> constant sized usertype TIDs (e.g. the 3 int columns of the primary
> key of a clustered table).

You're not considering the big picture. It's not self-evident that
anybody will ever have much use for a variable-width TID in their
table AM, at least beyond some fairly simple scheme -- because of the
fundamental issue of TID not working as a stable identifier of logical
rows in Postgres. If it was very clear that there would be *some*
significant benefit then the costs might start to look reasonable. But
there isn't. "Build it and they will come" is not at all convincing to
me.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Matthias van de Meent
Date:
On Wed, 5 May 2021 at 22:09, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, May 5, 2021 at 12:43 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I believe that it cannot be "just" an additive thing, at least not
> > through a normal INCLUDEd column, as you'd get duplicate TIDs in the
> > index, with its related problems. You also cannot add it as a key
> > column, as this would disable UNIQUE indexes; one of the largest use
> > cases of global indexes. So, you must create specialized
> > infrastructure for this identifier.
>
> You're just quibbling about the precise words that I used. Of course
> it is true that there must be some sense in which a global index
> partition key attribute will need to be special to the implementation
> -- how else could a global index enforce uniqueness? That was clearly
> implied.

This implication was not 100% clear to me, and the last thread on
global indexes that implemented it through INCLUDEd columns didn't
mention this. As such, I wanted to explicitly mention that this
partition/table identifier would need to be part of the keyspace.

> > And when we're already adding specialized infrastructure, then this
> > should probably be part of a new TID infrastructure.
>
> This is a non-sequitur.

I may have skipped some reasoning:

I believe that the TID is the unique identifier of that tuple, within context.

For normal indexes, the TID as supplied directly by the TableAM is
sufficient, as the context is that table.
For global indexes, this TID must include enough information to relate
it to the table the tuple originated from.
In the whole database, that would be the OID of the table + the TID as
supplied by the table.

As such, the identifier of the logical row (which can be called the
TID), as stored in index tuples in global indexes, would need to
consist of the TableAM supplied TID + the (local) id of the table
containing the tuple. Assuming we're in agreement on that part, I
would think it would be consistent to put this in TID infrastructure,
such that all indexes that use such new TID infrastructure can be
defined to be global with only minimal effort.

> > And if we're going to change TID infrastructure to allow for more
> > sizes (as we'd need normal TableAM TIDs, and global index
> > partition-identifying TIDs), I'd argue that it should not be too much
> > more difficult to create an infrastructure for 'new TID' in which the
> > table AM supplies type, size and strict ordering information for these
> > 'new TID's.
> >
> > And if this 'new TID' size is not going to be defined by the index AM
> > but by the indexed object (be it a table or a 'global' or whatever
> > we'll build indexes on), I see no reason why this 'new TID'
> > infrastructure couldn't eventually support variable length TIDs; or
> > constant sized usertype TIDs (e.g. the 3 int columns of the primary
> > key of a clustered table).
>
> You're not considering the big picture. It's not self-evident that
> anybody will ever have much use for a variable-width TID in their
> table AM, at least beyond some fairly simple scheme -- because of the
> fundamental issue of TID not working as a stable identifier of logical
> rows in Postgres.

ZHeap states that it can implement stable TIDs within limits, as IIRC
it requires retail index deletion support for all indexes on the
updated columns of that table. I fail to see why this same
infrastructure could not be used for supporting clustered tables,
while enforcing these limits only soft enforced in ZHeap (that is,
only allowing index AMs that support retail index tuple deletion).

> If it was very clear that there would be *some*
> significant benefit then the costs might start to look reasonable. But
> there isn't. "Build it and they will come" is not at all convincing to
> me.

Clustered tables / Index-oriented Tables are very useful for tables of
which most columns are contained in the PK, or otherwise are often
ordered by their PK. I don't know of any way that would allow us to
build a clustered table _without_ including the primary key in some
form into the TID, or otherwise introducing a layer of indirection
that would undo the clustered access implicated by the clustered
table.

Additionally, compacting/re-clustering a table would be _much_ cheaper
for clustered tables, as the indexes attached to that table would not
need rebuilding: all TIDs will stay valid across the clustering
operation.

With regards,

Matthias van de Meent



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Wed, May 5, 2021 at 3:18 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I believe that the TID is the unique identifier of that tuple, within context.
>
> For normal indexes, the TID as supplied directly by the TableAM is
> sufficient, as the context is that table.
> For global indexes, this TID must include enough information to relate
> it to the table the tuple originated from.

Clearly something like a partition identifier column is sometimes just
like a regular user-visible column, though occasionally not like one
-- whichever is useful to the implementation in each context. For
example, we probably want to do predicate pushdown, maybe with real
cataloged operators that access the column like any other user-created
column (the optimizer knows about the column, which even has a
pg_attribute entry). Note that we only ever access the TID column
using an insertion scankey today -- so there are several ways in which
the partition identifier really would be much more like a user column
than tid/scantid ever was.

The TID is a key column for most purposes as of Postgres 12 (at least
internally). That didn't break all unique indexes due to the existence
of non-unique TIDs across duplicates! Insertions that must call
_bt_check_unique() can deal with the issue directly, by temporarily
unsetting scantid.

We can easily do roughly the same thing here: be slightly creative
about how we interpret whether or not the partition identifier is
"just another key column" across each context. This is also similar to
the way the implementation is slightly creative about NULL values,
which are not equal to any other value to the user, but are
nevertheless just another value from the domain of indexed values to
the nbtree implementation. Cleverly defining the semantics of keys to
get better performance and to avoid the need for special case code is
more or less a standard technique.

> In the whole database, that would be the OID of the table + the TID as
> supplied by the table.
>
> As such, the identifier of the logical row (which can be called the
> TID), as stored in index tuples in global indexes, would need to
> consist of the TableAM supplied TID + the (local) id of the table
> containing the tuple.

2 points:

1. Clearly you need to use the partition identifier with the TID in
order to look up the version in the table -- you need to use both
together in global indexes. But it can still work in much the same way
as it would in a standard index -- it's just that you handle that
extra detail as well. That's what I meant by additive.

2. If a TID points to a version of a row (or whatever you want to call
the generalized version of a HOT chain -- almost the same thing), then
of course you can always map it back to the logical row. That must
always be true. It is equally true within a global index.

Points 1 and 2 above seem obvious to me...so I think we agree on that
much. I just don't know how you go from here to "we need
variable-width TIDs". In all sincerity, I am confused because to me it
just seems as if you're asserting that it must be necessary to have
variable width TIDs again and again, without ever getting around to
justifying it. Or even trying to.

> Assuming we're in agreement on that part, I
> would think it would be consistent to put this in TID infrastructure,
> such that all indexes that use such new TID infrastructure can be
> defined to be global with only minimal effort.

 Abstract definitions can be very useful, but ultimately they're just
tools. They're seldom useful as a starting point in my experience. I
try to start with the reality on the ground, and perhaps arrive at
some kind of abstract model or idea much later.

> ZHeap states that it can implement stable TIDs within limits, as IIRC
> it requires retail index deletion support for all indexes on the
> updated columns of that table.

Whether or not that's true is not at all clear. What is true is that
the prototype version of zheap that we have as of today is notable in
that it more or less allows the moral equivalent of a HOT chain to be
arbitrarily long (or much longer, at least). To the best of my
knowledge there is nothing about retail index tuple deletion in the
design, except perhaps something vague and aspirational.

> I fail to see why this same
> infrastructure could not be used for supporting clustered tables,
> while enforcing these limits only soft enforced in ZHeap (that is,
> only allowing index AMs that support retail index tuple deletion).

You're ignoring an ocean of complexity here. Principally the need to
implement something like two-phase locking (key value locking) in
indexes to make this work, but also the need to account for how
fundamentally redefining TID breaks things. To say nothing of how this
might affect crash recovery.

> > If it was very clear that there would be *some*
> > significant benefit then the costs might start to look reasonable. But
> > there isn't. "Build it and they will come" is not at all convincing to
> > me.
>
> Clustered tables / Index-oriented Tables are very useful for tables of
> which most columns are contained in the PK, or otherwise are often
> ordered by their PK.

I'm well aware of the fact that clustered index based tables are
sometimes more useful than heap-based tables.

-- 
Peter Geoghegan



Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Wed, May 5, 2021 at 3:43 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I believe that it cannot be "just" an additive thing, at least not
> through a normal INCLUDEd column, as you'd get duplicate TIDs in the
> index, with its related problems. You also cannot add it as a key
> column, as this would disable UNIQUE indexes; one of the largest use
> cases of global indexes. So, you must create specialized
> infrastructure for this identifier.
>
> And when we're already adding specialized infrastructure, then this
> should probably be part of a new TID infrastructure.
>
> And if we're going to change TID infrastructure to allow for more
> sizes (as we'd need normal TableAM TIDs, and global index
> partition-identifying TIDs), I'd argue that it should not be too much
> more difficult to create an infrastructure for 'new TID' in which the
> table AM supplies type, size and strict ordering information for these
> 'new TID's.
>
> And if this 'new TID' size is not going to be defined by the index AM
> but by the indexed object (be it a table or a 'global' or whatever
> we'll build indexes on), I see no reason why this 'new TID'
> infrastructure couldn't eventually support variable length TIDs; or
> constant sized usertype TIDs (e.g. the 3 int columns of the primary
> key of a clustered table).
>
> The only requirements that I believe to be fundamental for any kind of TID are
>
> 1.) Uniqueness during the lifecycle of the tuple, from creation to
> life to dead to fully dereferenced from all indexes;
> 2.) There exists a strict ordering of all TIDs of that type;
>
> And maybe to supply some form of efficiency to the underlying tableAM:
>
> 3.) There should be an equivalent of bitmap for that TID type.
>
> For the nbtree deduplication subsystem, and for gin posting lists to
> be able to work efficiently, the following must also hold:
>
> 4.) The TID type has a fixed size, preferably efficiently packable.
>
> Only the last requirement cannot be met with varlena TID types. But,
> as I also believe that not all indexes can be expected to work (well)
> for all kinds of TableAM, I don't see how this would be a blocking
> issue.

+1 to all of that.

> Storage gains for index-oriented tables can become as large as the
> size of the primary key by not having to store all primary key values
> in both the index and the table; which can thus be around 100% of a
> table in the least efficient cases of having a PK over all columns.
>
> Yes, this might be indeed only a 'small gain' for access latency, but
> not needing to store another copy of your data (and keeping it in
> cache, etc.) is a significant win in my book.

This is a really good point. Also, if the table is ordered by a
synthetic logical TID, range scans on the primary key will be less
efficient than if the primary key is itself the TID. We have the
ability to CLUSTER on an index for good reasons, and "Automatically
maintain clustering on a table" has been on the todo list forever.
It's hard to imagine this will ever be achieved with the current heap,
though: the way to get there is to have a table AM for which this is
an explicit goal.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Hannu Krosing
Date:
How hard would it be to declare TID as current ItemPointerData with
some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe,
MovedPartitionsOffsetNumber = 0xfffd, presumably also 0xffff ?).

And then commit to fixing usage outside access/heap/ which depend on
small value for MaxHeapTuplesPerPage, currently only nbodes/tidscan.c
, access/gin/ginpostinglist.c and access/brin/brin_*.c

there is also MaxHeapTuplesPerPage usage in
./backend/storage/page/bufpage.c but it seems to be all in
heapam-dependent functions (PageRepairFragmentation(),
PageGetHeapFreeSpace() and few others) which most likely should be
moved to access/heap/

Doing it this way would leave us with some manageable complexity in
mapping from TID to 48-bit integer and/or 3 wanted positions in  2^32

------------
Hannu Krosing


On Wed, May 5, 2021 at 8:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, May 5, 2021 at 11:25 AM Andres Freund <andres@anarazel.de> wrote:
> > Agreed. And we can increase the fit a good bit without needing invasive
> > all-over changes. With those changes often even helping heap.
> >
> > E.g. tidbitmap.c's harcoded use of MaxHeapTuplesPerPage is a problem
> > even for heap - we waste a lot of space that's not commonly used. A
> > better datastructure (radix tree like I'd say, but several tree shaped
> > approaches seem possible).
>
> Agreed -- even if we only cared about heapam we still ought to do
> something about tidbitmap.c's use of MaxHeapTuplesPerPage.
>
> --
> Peter Geoghegan
>
>



Re: MaxOffsetNumber for Table AMs

From
Hannu Krosing
Date:
On Thu, May 6, 2021 at 3:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, May 5, 2021 at 3:43 PM Matthias van de Meent
> > Storage gains for index-oriented tables can become as large as the
> > size of the primary key by not having to store all primary key values
> > in both the index and the table; which can thus be around 100% of a
> > table in the least efficient cases of having a PK over all columns.
> >
> > Yes, this might be indeed only a 'small gain' for access latency, but
> > not needing to store another copy of your data (and keeping it in
> > cache, etc.) is a significant win in my book.
>
> This is a really good point. Also, if the table is ordered by a
> synthetic logical TID, range scans on the primary key will be less
> efficient than if the primary key is itself the TID. We have the
> ability to CLUSTER on an index for good reasons, and "Automatically
> maintain clustering on a table" has been on the todo list forever.
> It's hard to imagine this will ever be achieved with the current heap,
> though: the way to get there is to have a table AM for which this is
> an explicit goal.

But would this not have the downside that all the secondary indexes
will blow up as they now need to have the full table row as the TID  ?

-----
Hannu Krosing



Re: MaxOffsetNumber for Table AMs

From
Jeff Davis
Date:
On Thu, 2021-05-06 at 03:26 +0200, Hannu Krosing wrote:
> How hard would it be to declare TID as current ItemPointerData with
> some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe,
> MovedPartitionsOffsetNumber = 0xfffd, presumably also 0xffff ?).

I don't think there's consensus in this thread that we want to do that,
but I'd be fine with it.

It's possible but not trivial. tidbitmap.c would be the biggest
challenge, I think.

Regards,
    Jeff Davis






Re: MaxOffsetNumber for Table AMs

From
Hannu Krosing
Date:
-----
Hannu Krosing

On Thu, May 6, 2021 at 4:53 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Thu, 2021-05-06 at 03:26 +0200, Hannu Krosing wrote:
> > How hard would it be to declare TID as current ItemPointerData with
> > some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe,
> > MovedPartitionsOffsetNumber = 0xfffd, presumably also 0xffff ?).
>
> I don't think there's consensus in this thread that we want to do that,
> but I'd be fine with it.

Sure. I just proposed this as a Minimal Viable Change.

Just hoping that we can agree on an interim solution which addresses
the immediate problem and then continue arguing about the ideal way
for the rest of v15 cycle (and  the v16 and v17 ;) )

>
> It's possible but not trivial. tidbitmap.c would be the biggest
> challenge, I think.

I think all these places (tidbitmap, gin, brin) relay on "relatively
small" MaxHeapTuplesPerPage as an upper bound for some allocations and
then still allocate a lot more than needed.

One can get to 291 tids / page only when you create a table with no
columns, or less than 8 columns which are all set to NULL. A table
with a single non-null boolean column already can fit only 226 tuples
per page.

It is definitely a non-trivial amount of work to rewrite these three
but going to (almost) full 48 bits from current
theoretically-a-little-over-40-effective-bits would expand the number
of addressable tuples 225 times.

Of course it would be extra nice to also somehow encode the 3 special
ItemPointerData values (NULL, 0xfffe, 0cfffd) "somewhere else" and get
an option of uninterrupted 48-bit address space for non-heap table
AMs, but this would likely be much more disruptive, if possible at
all.
We could still check, if they are used outside of heapam and maybe
just fix these places.

>
> Regards,
>         Jeff Davis
>
>
>



Re: MaxOffsetNumber for Table AMs

From
Robert Haas
Date:
On Wed, May 5, 2021 at 10:53 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2021-05-06 at 03:26 +0200, Hannu Krosing wrote:
> > How hard would it be to declare TID as current ItemPointerData with
> > some values prohibited (NULL, SpecTokenOffsetNumber = 0xfffe,
> > MovedPartitionsOffsetNumber = 0xfffd, presumably also 0xffff ?).
>
> I don't think there's consensus in this thread that we want to do that,
> but I'd be fine with it.
>
> It's possible but not trivial. tidbitmap.c would be the biggest
> challenge, I think.

I think that would be fine, too. I don't think it's the ideal
situation, but it seems like a clear improvement over what we have
now. We might want to reserve a few values for future projects that
might need distinguished values like SpecTokenOffsetNumber or
MovedPartitionsOffsetNumber, though, so we don't completely box
ourselves into a corner.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: MaxOffsetNumber for Table AMs

From
Matthias van de Meent
Date:
On Thu, 6 May 2021 at 01:22, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, May 5, 2021 at 3:18 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I believe that the TID is the unique identifier of that tuple, within context.
> >
> > For normal indexes, the TID as supplied directly by the TableAM is
> > sufficient, as the context is that table.
> > For global indexes, this TID must include enough information to relate
> > it to the table the tuple originated from.
>
> Clearly something like a partition identifier column is sometimes just
> like a regular user-visible column, though occasionally not like one
> -- whichever is useful to the implementation in each context. For
> example, we probably want to do predicate pushdown, maybe with real
> cataloged operators that access the column like any other user-created
> column (the optimizer knows about the column, which even has a
> pg_attribute entry). Note that we only ever access the TID column
> using an insertion scankey today -- so there are several ways in which
> the partition identifier really would be much more like a user column
> than tid/scantid ever was.
>
> The TID is a key column for most purposes as of Postgres 12 (at least
> internally). That didn't break all unique indexes due to the existence
> of non-unique TIDs across duplicates! Insertions that must call
> _bt_check_unique() can deal with the issue directly, by temporarily
> unsetting scantid.
>
> We can easily do roughly the same thing here: be slightly creative
> about how we interpret whether or not the partition identifier is
> "just another key column" across each context. This is also similar to
> the way the implementation is slightly creative about NULL values,
> which are not equal to any other value to the user, but are
> nevertheless just another value from the domain of indexed values to
> the nbtree implementation. Cleverly defining the semantics of keys to
> get better performance and to avoid the need for special case code is
> more or less a standard technique.
>
> > In the whole database, that would be the OID of the table + the TID as
> > supplied by the table.
> >
> > As such, the identifier of the logical row (which can be called the
> > TID), as stored in index tuples in global indexes, would need to
> > consist of the TableAM supplied TID + the (local) id of the table
> > containing the tuple.
>
> 2 points:
>
> 1. Clearly you need to use the partition identifier with the TID in
> order to look up the version in the table -- you need to use both
> together in global indexes. But it can still work in much the same way
> as it would in a standard index -- it's just that you handle that
> extra detail as well. That's what I meant by additive.
>
> 2. If a TID points to a version of a row (or whatever you want to call
> the generalized version of a HOT chain -- almost the same thing), then
> of course you can always map it back to the logical row. That must
> always be true. It is equally true within a global index.
>
> Points 1 and 2 above seem obvious to me...so I think we agree on that
> much.

Yes.

> I just don't know how you go from here to "we need
> variable-width TIDs". In all sincerity, I am confused because to me it
> just seems as if you're asserting that it must be necessary to have
> variable width TIDs again and again, without ever getting around to
> justifying it. Or even trying to.

See below. I'm not saying we need it _right now_, but at the very
least I'd like to argue that we should not close the door on varlena
TIDs, because there _are_ reasons for those TID types. See also below.

> > Assuming we're in agreement on that part, I
> > would think it would be consistent to put this in TID infrastructure,
> > such that all indexes that use such new TID infrastructure can be
> > defined to be global with only minimal effort.
>
>  Abstract definitions can be very useful, but ultimately they're just
> tools. They're seldom useful as a starting point in my experience. I
> try to start with the reality on the ground, and perhaps arrive at
> some kind of abstract model or idea much later.

Although I agree that abstract definitions are tools, I disagree that
they're seldom useful as a starting point. Many have implemented b
(plus) trees based on the original paper exactly due to the guarantees
that are provided by the abstract definition as defined in the paper.
When trying to create completely new things I would agree that
starting with the abstract is probably not the right idea, but we're
not in a green field. There are many examples for storage engines /
table AMs outside PostgreSQL, and I believe that we can take learnings
from those for the potential extendability of PostgreSQL.

> > ZHeap states that it can implement stable TIDs within limits, as IIRC
> > it requires retail index deletion support for all indexes on the
> > updated columns of that table.
>
> Whether or not that's true is not at all clear. What is true is that
> the prototype version of zheap that we have as of today is notable in
> that it more or less allows the moral equivalent of a HOT chain to be
> arbitrarily long (or much longer, at least). To the best of my
> knowledge there is nothing about retail index tuple deletion in the
> design, except perhaps something vague and aspirational.

I'm fairly certain that ZHeap's todo list item no. 1 details retail
index tuple deletion as I understand it:

"Delete marking in indexes: This will allow inplace updates even when
index columns are updated and additionally with this we can avoid the
need for a dedicated vacuum process to perform retail deletes."

If I read this correctly, this means asking the index AM to delete the
specific index tuple corresponding with the deleted item AKA retail
index tuple deletion.

> > I fail to see why this same
> > infrastructure could not be used for supporting clustered tables,
> > while enforcing these limits only soft enforced in ZHeap (that is,
> > only allowing index AMs that support retail index tuple deletion).
>
> You're ignoring an ocean of complexity here. Principally the need to
> implement something like two-phase locking (key value locking) in
> indexes to make this work, but also the need to account for how
> fundamentally redefining TID breaks things. To say nothing of how this
> might affect crash recovery.

An ocean of complexity that is (to the best of my knowledge) to be
mostly delegated to the TableAM that implements these varlena TIDs.
I'm not telling you that varlena TIDs are a one-size-fits-all, but I
want to make clear that varlena TIDs are required for certain table
AMs that would otherwise be impossible.

> > > If it was very clear that there would be *some*
> > > significant benefit then the costs might start to look reasonable. But
> > > there isn't. "Build it and they will come" is not at all convincing to
> > > me.
> >
> > Clustered tables / Index-oriented Tables are very useful for tables of
> > which most columns are contained in the PK, or otherwise are often
> > ordered by their PK.
>
> I'm well aware of the fact that clustered index based tables are
> sometimes more useful than heap-based tables.

Then would you also agree that for persistently clustered tables to
work in any form of efficiency, they would need to support variable
length identifiers? Let me explain my reasoning:

If the TableAM TID contains information about it's physical location,
we cannot implement persistent clustered tables due to instability of
this physical location of the base tuple (see also how we cannot
guarantee the physical location of an index tuple in the index, only
the logical location).
That being the case, the TID must be fully logical. I hope you agree
at least up to here.

The TID must order in the same manner as the table's clustering, as
you would want to be able to find the tuple from the table, and the
only guarantee you have is that the table has an ordering. If the TID
does not have the same ordering, you cannot find the correct tuple (or
you'd need a TID -> key mapping, which would mean you'd lose a lot of
the benefits you got from the clustering).

If the TID must share an ordering with the clustering of the table,
then it should also have a similar key space as the clustering: I
could cluster a table on a text column, but any fixed size TID cannot
ever dream to be used to correctly identify the position of a certain
text string within an arbitrary sorted list of text strings. You'll
have to use that text string to find it's location in the list. As
such, I believe that the tuple identifier (or also: the identifier
using which we can locate the tuple data corresponding to that
identifier) for clustered tables / index oriented tables is at the
very least the set of columns that that table is clustered on.

Do note that this is all about the TID supplied by a TableAM that
would implement persistent clustering, which is how MySQL has
implemented tables. I am unaware of a scheme that would maintain
persistent clustering/ordering across the table on key columns without
also leaking these key columns into the identifier that is used to
find the tuple in the table.

With regards,

Matthias van de Meent



Re: MaxOffsetNumber for Table AMs

From
Peter Geoghegan
Date:
On Thu, May 6, 2021 at 4:10 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> See below. I'm not saying we need it _right now_, but at the very
> least I'd like to argue that we should not close the door on varlena
> TIDs, because there _are_ reasons for those TID types. See also below.

Perhaps I was a bit too strident. It's a question of trade-offs.

> Although I agree that abstract definitions are tools, I disagree that
> they're seldom useful as a starting point. Many have implemented b
> (plus) trees based on the original paper exactly due to the guarantees
> that are provided by the abstract definition as defined in the paper.

I was unclear. I agree with this much. I don't think it applies in
this situation, though.

There are many competing considerations besides that. What I'm
concerned about is avoiding a strategically ruinous direction that
works against our strengths -- we should play to our strengths. For
example, it's no coincidence that Postgres is the only system that has
B-Tree deduplication and enables it by default, with next to no
downside. This is more or less a consequence of the fact that indexes
in Postgres don't need their own locks, unlike any relation DB system
that more or less uses a design like ARIES -- indexes are just
physical data structures that back the logical database. They are not
part of the logical database per se.

Consider a 2015 paper from Goetz Graefe about index locking, entitled
"Orthogonal key-value locking":

https://subs.emis.de/LNI/Proceedings/Proceedings241/237.pdf

Some kind of locking in indexes (something like ARIES/IM or ARIES/KVL)
is needed to make TIDs stable identifiers of logical rows -- even in a
system like Oracle (but most obviously in a system like InnoDB or SQL
Server). According to Graefe, "further improvements [to index
concurrency control] are possible despite multiple available
techniques and decades with little progress". This is why terms like
"ruinously expensive" don't feel like hyperbole when talking about
pursuing TID stability/clustered indexes/whatever -- it is a radically
different design. And maybe even a basically inferior design these
days, all things considered.

I think that the simple approach to storage taken by Postgres has aged
incredibly well -- there are many ways in which it suits modern
hardware more than traditional designs. Modern hardware is generally
much more sensitive to the cost of any kind of synchronization [1],
but is otherwise very fast in the ways -- rather a reversal from the
environment that ARIES was created in. While there are certainly
downsides to the Postgres approach to storage, concurrency control and
recovery, these downsides can be fixed directly. Including in the ways
we've been discussing on other threads these past few months. I think
that we can push the idea of teaching heapam (and maybe nbtree) just
enough about the logical database to not make really dumb decisions in
some cases. Why not see how far that can go first?

Graefe's perspective in the key-value locking paper is roughly the
Microsoft SQL Server perspective. It's not hard for me to imagine why
he undertook to research index locking used for concurrency control
(not to be confused with what they call latches and what we call
LWLocks) while still working for Microsoft. The TPC-E benchmark is
something that Microsoft alone targets with SQL Server (no other DB
system has an official entry on the TPC website). Pity TPC-E isn't
more successful, since it seems to have a lot more real world
relevance than TPC-C. I bet we could learn a lot from it because it's
actually realistic (TPC-C isn't that realistic, and TPC-B/pgbench is
the furthest possible thing from reality).

The best analysis I've been able to find about TPC-E is probably "From
A to E: Analyzing TPC’s OLTP Benchmarks" [2]. It concludes that the
big bottleneck for TPC-E is "logical lock contention", which is the
overwhelming bottleneck at high client counts. Though I haven't run
TPC-E properly myself (because of issues with bitrot), I'd speculate
that this gives Postgres an edge in TPC-E. I've heard quite a few
reports of Postgres having much faster writes compared to any of the
big proprietary systems. Some of these reports are many years old. And
yet we routinely seem to talk about our approach as if it's obviously
inferior. The main weakness that Postgres storage has seems to me to
be largely in the area of stability, perhaps only in theoretically
rare or extreme cases that nevertheless cause real problems in the
real world.

Of course I cannot prove that adopting stable TIDs necessarily means
accepting a similar burden from lock contention in an affected table
AM, or in general. I cannot prove much of anything when we're
considering such an abstract question. But perhaps this gives you a
better idea about where my skepticism comes from.

> When trying to create completely new things I would agree that
> starting with the abstract is probably not the right idea, but we're
> not in a green field. There are many examples for storage engines /
> table AMs outside PostgreSQL, and I believe that we can take learnings
> from those for the potential extendability of PostgreSQL.

I agree, up to a point. Perhaps the reason I came up with bottom-up
index deletion was that my reading of the DB research literature led
me to consider the logical database in the context of index AMs. Turns
out we can use a dirt cheap technique to give B-Tree indexes some
basic ideas about the logical database, without having to go to the
expense of making nbtree truly have authoritative information about
it. A little appreciation of other designs can go a long way. We can
afford to have weaknesses, because we clearly also have strengths. We
only need to not have an Achilles' heel.

> "Delete marking in indexes: This will allow inplace updates even when
> index columns are updated and additionally with this we can avoid the
> need for a dedicated vacuum process to perform retail deletes."
>
> If I read this correctly, this means asking the index AM to delete the
> specific index tuple corresponding with the deleted item AKA retail
> index tuple deletion.

Dealing with this at the level of nbtree/the physical data structure
is very easy. The problem is with concurrency control, especially
within index AMs. This is about the logical contents of the database.
Being able to think of indexes as basically just data structures is a
huge luxury for us.

> Then would you also agree that for persistently clustered tables to
> work in any form of efficiency, they would need to support variable
> length identifiers? Let me explain my reasoning:

Yes, I definitely accept that. It must be true that clustered indexes
(not to be confused with indirect indexes + a heap) have *some* merit
-- that much has to be true. I just doubt that it's a promising area
for us. The cost/risks are very high, and the benefits are probably
quite low. We don't have to tackle the fundamental problem in order to
add a column store table AM with logical-ish TID-like identifiers --
at least not according to Jeff. Why not consider a low cost solution
like that for each table AM?

> If the TableAM TID contains information about it's physical location,
> we cannot implement persistent clustered tables due to instability of
> this physical location of the base tuple (see also how we cannot
> guarantee the physical location of an index tuple in the index, only
> the logical location).
> That being the case, the TID must be fully logical. I hope you agree
> at least up to here.

Absolutely. Agreed.

> If the TID must share an ordering with the clustering of the table,
> then it should also have a similar key space as the clustering: I
> could cluster a table on a text column, but any fixed size TID cannot
> ever dream to be used to correctly identify the position of a certain
> text string within an arbitrary sorted list of text strings. You'll
> have to use that text string to find it's location in the list. As
> such, I believe that the tuple identifier (or also: the identifier
> using which we can locate the tuple data corresponding to that
> identifier) for clustered tables / index oriented tables is at the
> very least the set of columns that that table is clustered on.

I agree. I think that looking at MySQL/InnoDB here might be very useful.

InnoDB secondary indexes use stable identifiers of the logical row in
the table -- which is just the primary key value for the relevant row
in all cases. Systems like Oracle and SQL Server can use TIDs/RIDs
with a heap based table, which is stable in the same way (sometimes at
great cost elsewhere). All three systems are more or less the same as
each other as far as the fundamental nature of TID-like identifiers is
concerned (and rather unlike Postgres) -- TIDs/RIDs/pointers in
secondary indexes are all stable identifiers of logical rows. Andy
Pavlo has said that InnoDB/Oracle TIDs are logical identifiers, while
Postgres TIDs are physical identifiers. (I don't find that particular
terminology useful because it confuses issues elsewhere, but I know
what he meant.)

No less an authority than Mark Callaghan (well known in the MySQL
community for his work at Google and later Facebook) did a comparative
benchmark between Postgres and MySQL (InnoDB) back in January. I'm a
big fan of his analysis of LSM/storage economics stuff, so naturally I
read his report on this with great interest. The high level outcome
was that Postgres 13 was mostly faster than MySQL 8, and Postgres 11
was mostly slower than MySQL 5.6:

https://smalldatum.blogspot.com/2021/01/sysbench-postgres-vs-mysql-and-impact.html

But the details are what really interests me. One very notable
performance gap was between MySQL 8 + InnoDB and Postgres 13 with
updates that affect an indexed column. Apparently one system could be
as much as 4x slower with sysbench index updates, which is notably the
straggler among the tests performed, for one of the systems.
Reportedly the dreaded write amplification from updates is to blame.
Just like that tawdry Uber blog post from 2016 warned of!

But guess what? *Postgres* was the system that was faster and had
*significantly less* write amplification from updates to secondary
indexes -- to the extent that it really stood out as an *advantage*
(for this particular set of tests). The pertinent information about
write amplification with index updates is under the "Write-heavy"
section: "The largest difference is for the 2nd test (update-index)
that requires index maintenance." ... "Write-amp (wKB/o) is worse for
MySQL especially for the update-index test". So...what do you think of
that?

This analysis was based on Postgres 13, not Postgres 14, so presumably
we'd do better now that we have bottom-up index deletion.

I don't mean to suggest that clearly we're living in opposite land,
and Postgres was actually the one that did better with index updates
all along -- as Callaghan says, it's one benchmark that shouldn't be
overinterpreted and is mostly useful to get the right high level
intuitions about how each system performs. I believe that sysbench has
2 or 3 indexes here, including the PK, and I wouldn't be surprised if
the final result flipped in favor of InnoDB with more indexes -- even
with bottom-up index deletion. All I'm saying is that it seems
extremely premature to assume that there are very significant
advantages to generalizing TID-like identifiers to be stable, be it in
a clustered index table AM design like InnoDB or some other kind of
table AM. Maybe there are huge advantages, *and* maybe it's even worth
the enormous trouble and risk (to say nothing of the opportunity cost
of not improving what already seems to work quite well most of the
time). But I tend to doubt it, and I'd prefer to work on simple
incremental approaches that already seem to be working.

[1] https://pathelland.substack.com/p/i-am-so-glad-im-uncoordinated
[2]
https://www.researchgate.net/publication/262275971_From_A_to_E_Analyzing_TPC's_OLTP_Benchmarks_--_The_obsolete_the_ubiquitous_the_unexplored
--
Peter Geoghegan