Re: MaxOffsetNumber for Table AMs - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: MaxOffsetNumber for Table AMs
Date
Msg-id CAEze2Wjms+6WZqa2qo+g19tTSdksBO-gmLDAxC=0LF5WX_eXkw@mail.gmail.com
Whole thread Raw
In response to Re: MaxOffsetNumber for Table AMs  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: MaxOffsetNumber for Table AMs
Re: MaxOffsetNumber for Table AMs
List pgsql-hackers
On Wed, 5 May 2021 at 19:15, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, May 5, 2021 at 9:42 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Wed, May 5, 2021 at 11:50 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > > I'm being very vocal here because I'm concerned that we're going about
> > > generalizing TIDs in the wrong way. To me it feels like there is a
> > > loss of perspective about what really matters.
> >
> > Well, which things matter is a question of opinion, not fact.
>
> I'm not trying to win an argument here. I am giving an opinion in the
> hopes that it leads to some kind of useful synthesis, based on all of
> our opinions.
>
> > > No other database system has something like indirect indexes. They
> > > have clustered indexes, but that's rather different.
> >
> > I don't think this is true at all. If you have a clustered index -
> > i.e. the table is physically arranged according to the index ordering
> > - then your secondary indexes all pretty much have to be what we're
> > calling indirect indexes. They can hardly point to a physical
> > identifier if rows are being moved around. I believe InnoDB works this
> > way, and I think Oracle's index-organized tables do too. I suspect
> > there are other examples.
>
> But these systems don't have indirect indexes *on a heap table*! Why
> would they ever do it that way? They already have rowid/TID as a
> stable identifier of logical rows, so having indirect indexes that
> point to a heap table's rows would be strictly worse than the generic
> approach for indexes on a heap table.
>
> What we call indirect indexes seem to me to be a failed attempt to
> solve the "TID is not a stable identifier of logical row" issue that
> is baked-in to Postgres. If I thought it was worth solving that
> problem then I suppose I'd solve it directly. The "indirection" of
> indirect indexes actuallys buys you nothing! It just moves some of the
> problem somewhere else, at the cost of even more complexity. Indirect
> indexes (without a clustered index) are a muddled idea.
>
> Of course I accept that clustered indexes make sense in general
> (though less and less these days). But the fact that these systems
> "use indirect indexes" for secondary indexes is precisely why
> clustered indexes don't seem like a great design with modern hardware!
> Should we invest a huge amount of work in order to have all of the
> disadvantages, and none of the advantages?
>
> > My point is that so far I am not seeing a whole lot of value of this
> > proposed approach. For a 64-bit TID to be valuable to you, one of two
> > things has to be true: you either don't care about having indexes that
> > store TIDs on your new table type, or the index types you want to use
> > can store those 64-bit TIDs. Now, I have not yet heard of anyone
> > working on a table AM who does not want to be able to support adding
> > btree indexes. There may be someone that I don't know about, and if
> > so, fine. But otherwise, we need a way to store them. And that
> > requires changing the page format for btree indexes. But surely we do
> > not want to make all TIDs everywhere wider in future btree versions,
> > so at least two TID widths - 6 bytes and 8 bytes - would have to be
> > supported.
>
> I agree that we don't want a performance/space overhead for simple
> cases that are quite happy with the current format.
>
> > And if we're at all going to do that, I think it's
> > certainly worth asking whether supporting varlena TIDs would really be
> > all that much harder. You seem to think it is, and you might be right,
> > but I'm not ready to give up, because I do not see how we are ever
> > going to get global indexes or indirect indexes without doing it, and
> > those would be good features to have.
>
> I think that global indexes are well worth having, and should be
> solved some completely different way. The partition key can be an
> additive thing.

I believe that it cannot be "just" an additive thing, at least not
through a normal INCLUDEd column, as you'd get duplicate TIDs in the
index, with its related problems. You also cannot add it as a key
column, as this would disable UNIQUE indexes; one of the largest use
cases of global indexes. So, you must create specialized
infrastructure for this identifier.

And when we're already adding specialized infrastructure, then this
should probably be part of a new TID infrastructure.

And if we're going to change TID infrastructure to allow for more
sizes (as we'd need normal TableAM TIDs, and global index
partition-identifying TIDs), I'd argue that it should not be too much
more difficult to create an infrastructure for 'new TID' in which the
table AM supplies type, size and strict ordering information for these
'new TID's.

And if this 'new TID' size is not going to be defined by the index AM
but by the indexed object (be it a table or a 'global' or whatever
we'll build indexes on), I see no reason why this 'new TID'
infrastructure couldn't eventually support variable length TIDs; or
constant sized usertype TIDs (e.g. the 3 int columns of the primary
key of a clustered table).

The only requirements that I believe to be fundamental for any kind of TID are

1.) Uniqueness during the lifecycle of the tuple, from creation to
life to dead to fully dereferenced from all indexes;
2.) There exists a strict ordering of all TIDs of that type;

And maybe to supply some form of efficiency to the underlying tableAM:

3.) There should be an equivalent of bitmap for that TID type.

For the nbtree deduplication subsystem, and for gin posting lists to
be able to work efficiently, the following must also hold:

4.) The TID type has a fixed size, preferably efficiently packable.

Only the last requirement cannot be met with varlena TID types. But,
as I also believe that not all indexes can be expected to work (well)
for all kinds of TableAM, I don't see how this would be a blocking
issue.

> I strongly suspect that indirect indexes (without a
> clustered index) are 100% useless in both theory and practice, so
> naturally I have little to no interest.
>
> The difficulty of supporting (say) 6 byte and 8 byte TIDs together is
> vastly lower than variable-width TIDs, for all kinds of reasons. See
> my remarks to Andres upthread about deduplication.

I believe that deduplication is amazing when it works, but it should
not be a blocker for new TID infrastructure (assuming it still works
by default for nbtree+heap). As an example: numeric columns cannot be
deduplicated, and that wasn't considered a blocker for deduplication
to be merged.

The only reasons I can think of why varlena TIDs cannot be efficiently
deduplicated is the storage and search efficiency: I cannot binary
search in a packed array of varlena attributes, but I can binary
search through packed fixed-size attributes. Any fixed-size TID can
realistically be deduplicated, assuming enough effort is put into the
patch implementing the new TID infrastructure.

> > If we can't ever get them, so be it, but you seem to kind of be saying
> > that things like global indexes and indirect indexes are hard, and
> > therefore they don't count as reasons why we might want variable-width
> > TIDs.But one very large reason why those things are hard is that they
> > require variable-width TIDs, so AFAICS this boils down to saying that
> > we don't want the feature because it's hard to implement.
>
> More like very hard to implement for a very low benefit.

Complicated optimizations have been merged which had only a small gain.

Storage gains for index-oriented tables can become as large as the
size of the primary key by not having to store all primary key values
in both the index and the table; which can thus be around 100% of a
table in the least efficient cases of having a PK over all columns.

Yes, this might be indeed only a 'small gain' for access latency, but
not needing to store another copy of your data (and keeping it in
cache, etc.) is a significant win in my book.


Regarding losing deduplication in btrees when we have varlena TIDs:
This loss in [storage] efficiency can be partially mitigated by
implementing prefix truncation/prefix deduplication, and as such that
loss would not necessarily be too problematic when PT/PD is
implemented.



With regards,

Matthias van de Meent



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: MaxOffsetNumber for Table AMs
Next
From: Tom Lane
Date:
Subject: Dubious assertion in RegisterDynamicBackgroundWorker