Re: MaxOffsetNumber for Table AMs - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: MaxOffsetNumber for Table AMs
Date
Msg-id CAH2-WzkR1jxNxse=_9xjh7hCw5+=OPSd5sACRQjjik13GMHSJw@mail.gmail.com
Whole thread Raw
In response to Re: MaxOffsetNumber for Table AMs  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
List pgsql-hackers
On Thu, May 6, 2021 at 4:10 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> See below. I'm not saying we need it _right now_, but at the very
> least I'd like to argue that we should not close the door on varlena
> TIDs, because there _are_ reasons for those TID types. See also below.

Perhaps I was a bit too strident. It's a question of trade-offs.

> Although I agree that abstract definitions are tools, I disagree that
> they're seldom useful as a starting point. Many have implemented b
> (plus) trees based on the original paper exactly due to the guarantees
> that are provided by the abstract definition as defined in the paper.

I was unclear. I agree with this much. I don't think it applies in
this situation, though.

There are many competing considerations besides that. What I'm
concerned about is avoiding a strategically ruinous direction that
works against our strengths -- we should play to our strengths. For
example, it's no coincidence that Postgres is the only system that has
B-Tree deduplication and enables it by default, with next to no
downside. This is more or less a consequence of the fact that indexes
in Postgres don't need their own locks, unlike any relation DB system
that more or less uses a design like ARIES -- indexes are just
physical data structures that back the logical database. They are not
part of the logical database per se.

Consider a 2015 paper from Goetz Graefe about index locking, entitled
"Orthogonal key-value locking":

https://subs.emis.de/LNI/Proceedings/Proceedings241/237.pdf

Some kind of locking in indexes (something like ARIES/IM or ARIES/KVL)
is needed to make TIDs stable identifiers of logical rows -- even in a
system like Oracle (but most obviously in a system like InnoDB or SQL
Server). According to Graefe, "further improvements [to index
concurrency control] are possible despite multiple available
techniques and decades with little progress". This is why terms like
"ruinously expensive" don't feel like hyperbole when talking about
pursuing TID stability/clustered indexes/whatever -- it is a radically
different design. And maybe even a basically inferior design these
days, all things considered.

I think that the simple approach to storage taken by Postgres has aged
incredibly well -- there are many ways in which it suits modern
hardware more than traditional designs. Modern hardware is generally
much more sensitive to the cost of any kind of synchronization [1],
but is otherwise very fast in the ways -- rather a reversal from the
environment that ARIES was created in. While there are certainly
downsides to the Postgres approach to storage, concurrency control and
recovery, these downsides can be fixed directly. Including in the ways
we've been discussing on other threads these past few months. I think
that we can push the idea of teaching heapam (and maybe nbtree) just
enough about the logical database to not make really dumb decisions in
some cases. Why not see how far that can go first?

Graefe's perspective in the key-value locking paper is roughly the
Microsoft SQL Server perspective. It's not hard for me to imagine why
he undertook to research index locking used for concurrency control
(not to be confused with what they call latches and what we call
LWLocks) while still working for Microsoft. The TPC-E benchmark is
something that Microsoft alone targets with SQL Server (no other DB
system has an official entry on the TPC website). Pity TPC-E isn't
more successful, since it seems to have a lot more real world
relevance than TPC-C. I bet we could learn a lot from it because it's
actually realistic (TPC-C isn't that realistic, and TPC-B/pgbench is
the furthest possible thing from reality).

The best analysis I've been able to find about TPC-E is probably "From
A to E: Analyzing TPC’s OLTP Benchmarks" [2]. It concludes that the
big bottleneck for TPC-E is "logical lock contention", which is the
overwhelming bottleneck at high client counts. Though I haven't run
TPC-E properly myself (because of issues with bitrot), I'd speculate
that this gives Postgres an edge in TPC-E. I've heard quite a few
reports of Postgres having much faster writes compared to any of the
big proprietary systems. Some of these reports are many years old. And
yet we routinely seem to talk about our approach as if it's obviously
inferior. The main weakness that Postgres storage has seems to me to
be largely in the area of stability, perhaps only in theoretically
rare or extreme cases that nevertheless cause real problems in the
real world.

Of course I cannot prove that adopting stable TIDs necessarily means
accepting a similar burden from lock contention in an affected table
AM, or in general. I cannot prove much of anything when we're
considering such an abstract question. But perhaps this gives you a
better idea about where my skepticism comes from.

> When trying to create completely new things I would agree that
> starting with the abstract is probably not the right idea, but we're
> not in a green field. There are many examples for storage engines /
> table AMs outside PostgreSQL, and I believe that we can take learnings
> from those for the potential extendability of PostgreSQL.

I agree, up to a point. Perhaps the reason I came up with bottom-up
index deletion was that my reading of the DB research literature led
me to consider the logical database in the context of index AMs. Turns
out we can use a dirt cheap technique to give B-Tree indexes some
basic ideas about the logical database, without having to go to the
expense of making nbtree truly have authoritative information about
it. A little appreciation of other designs can go a long way. We can
afford to have weaknesses, because we clearly also have strengths. We
only need to not have an Achilles' heel.

> "Delete marking in indexes: This will allow inplace updates even when
> index columns are updated and additionally with this we can avoid the
> need for a dedicated vacuum process to perform retail deletes."
>
> If I read this correctly, this means asking the index AM to delete the
> specific index tuple corresponding with the deleted item AKA retail
> index tuple deletion.

Dealing with this at the level of nbtree/the physical data structure
is very easy. The problem is with concurrency control, especially
within index AMs. This is about the logical contents of the database.
Being able to think of indexes as basically just data structures is a
huge luxury for us.

> Then would you also agree that for persistently clustered tables to
> work in any form of efficiency, they would need to support variable
> length identifiers? Let me explain my reasoning:

Yes, I definitely accept that. It must be true that clustered indexes
(not to be confused with indirect indexes + a heap) have *some* merit
-- that much has to be true. I just doubt that it's a promising area
for us. The cost/risks are very high, and the benefits are probably
quite low. We don't have to tackle the fundamental problem in order to
add a column store table AM with logical-ish TID-like identifiers --
at least not according to Jeff. Why not consider a low cost solution
like that for each table AM?

> If the TableAM TID contains information about it's physical location,
> we cannot implement persistent clustered tables due to instability of
> this physical location of the base tuple (see also how we cannot
> guarantee the physical location of an index tuple in the index, only
> the logical location).
> That being the case, the TID must be fully logical. I hope you agree
> at least up to here.

Absolutely. Agreed.

> If the TID must share an ordering with the clustering of the table,
> then it should also have a similar key space as the clustering: I
> could cluster a table on a text column, but any fixed size TID cannot
> ever dream to be used to correctly identify the position of a certain
> text string within an arbitrary sorted list of text strings. You'll
> have to use that text string to find it's location in the list. As
> such, I believe that the tuple identifier (or also: the identifier
> using which we can locate the tuple data corresponding to that
> identifier) for clustered tables / index oriented tables is at the
> very least the set of columns that that table is clustered on.

I agree. I think that looking at MySQL/InnoDB here might be very useful.

InnoDB secondary indexes use stable identifiers of the logical row in
the table -- which is just the primary key value for the relevant row
in all cases. Systems like Oracle and SQL Server can use TIDs/RIDs
with a heap based table, which is stable in the same way (sometimes at
great cost elsewhere). All three systems are more or less the same as
each other as far as the fundamental nature of TID-like identifiers is
concerned (and rather unlike Postgres) -- TIDs/RIDs/pointers in
secondary indexes are all stable identifiers of logical rows. Andy
Pavlo has said that InnoDB/Oracle TIDs are logical identifiers, while
Postgres TIDs are physical identifiers. (I don't find that particular
terminology useful because it confuses issues elsewhere, but I know
what he meant.)

No less an authority than Mark Callaghan (well known in the MySQL
community for his work at Google and later Facebook) did a comparative
benchmark between Postgres and MySQL (InnoDB) back in January. I'm a
big fan of his analysis of LSM/storage economics stuff, so naturally I
read his report on this with great interest. The high level outcome
was that Postgres 13 was mostly faster than MySQL 8, and Postgres 11
was mostly slower than MySQL 5.6:

https://smalldatum.blogspot.com/2021/01/sysbench-postgres-vs-mysql-and-impact.html

But the details are what really interests me. One very notable
performance gap was between MySQL 8 + InnoDB and Postgres 13 with
updates that affect an indexed column. Apparently one system could be
as much as 4x slower with sysbench index updates, which is notably the
straggler among the tests performed, for one of the systems.
Reportedly the dreaded write amplification from updates is to blame.
Just like that tawdry Uber blog post from 2016 warned of!

But guess what? *Postgres* was the system that was faster and had
*significantly less* write amplification from updates to secondary
indexes -- to the extent that it really stood out as an *advantage*
(for this particular set of tests). The pertinent information about
write amplification with index updates is under the "Write-heavy"
section: "The largest difference is for the 2nd test (update-index)
that requires index maintenance." ... "Write-amp (wKB/o) is worse for
MySQL especially for the update-index test". So...what do you think of
that?

This analysis was based on Postgres 13, not Postgres 14, so presumably
we'd do better now that we have bottom-up index deletion.

I don't mean to suggest that clearly we're living in opposite land,
and Postgres was actually the one that did better with index updates
all along -- as Callaghan says, it's one benchmark that shouldn't be
overinterpreted and is mostly useful to get the right high level
intuitions about how each system performs. I believe that sysbench has
2 or 3 indexes here, including the PK, and I wouldn't be surprised if
the final result flipped in favor of InnoDB with more indexes -- even
with bottom-up index deletion. All I'm saying is that it seems
extremely premature to assume that there are very significant
advantages to generalizing TID-like identifiers to be stable, be it in
a clustered index table AM design like InnoDB or some other kind of
table AM. Maybe there are huge advantages, *and* maybe it's even worth
the enormous trouble and risk (to say nothing of the opportunity cost
of not improving what already seems to work quite well most of the
time). But I tend to doubt it, and I'd prefer to work on simple
incremental approaches that already seem to be working.

[1] https://pathelland.substack.com/p/i-am-so-glad-im-uncoordinated
[2]
https://www.researchgate.net/publication/262275971_From_A_to_E_Analyzing_TPC's_OLTP_Benchmarks_--_The_obsolete_the_ubiquitous_the_unexplored
--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Takashi Menjo
Date:
Subject: Map WAL segment files on PMEM as WAL buffers
Next
From: Amit Kapila
Date:
Subject: Re: Forget close an open relation in ReorderBufferProcessTXN()