Re: MaxOffsetNumber for Table AMs - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: MaxOffsetNumber for Table AMs |
Date | |
Msg-id | CAH2-WzkR1jxNxse=_9xjh7hCw5+=OPSd5sACRQjjik13GMHSJw@mail.gmail.com Whole thread Raw |
In response to | Re: MaxOffsetNumber for Table AMs (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
List | pgsql-hackers |
On Thu, May 6, 2021 at 4:10 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > See below. I'm not saying we need it _right now_, but at the very > least I'd like to argue that we should not close the door on varlena > TIDs, because there _are_ reasons for those TID types. See also below. Perhaps I was a bit too strident. It's a question of trade-offs. > Although I agree that abstract definitions are tools, I disagree that > they're seldom useful as a starting point. Many have implemented b > (plus) trees based on the original paper exactly due to the guarantees > that are provided by the abstract definition as defined in the paper. I was unclear. I agree with this much. I don't think it applies in this situation, though. There are many competing considerations besides that. What I'm concerned about is avoiding a strategically ruinous direction that works against our strengths -- we should play to our strengths. For example, it's no coincidence that Postgres is the only system that has B-Tree deduplication and enables it by default, with next to no downside. This is more or less a consequence of the fact that indexes in Postgres don't need their own locks, unlike any relation DB system that more or less uses a design like ARIES -- indexes are just physical data structures that back the logical database. They are not part of the logical database per se. Consider a 2015 paper from Goetz Graefe about index locking, entitled "Orthogonal key-value locking": https://subs.emis.de/LNI/Proceedings/Proceedings241/237.pdf Some kind of locking in indexes (something like ARIES/IM or ARIES/KVL) is needed to make TIDs stable identifiers of logical rows -- even in a system like Oracle (but most obviously in a system like InnoDB or SQL Server). According to Graefe, "further improvements [to index concurrency control] are possible despite multiple available techniques and decades with little progress". This is why terms like "ruinously expensive" don't feel like hyperbole when talking about pursuing TID stability/clustered indexes/whatever -- it is a radically different design. And maybe even a basically inferior design these days, all things considered. I think that the simple approach to storage taken by Postgres has aged incredibly well -- there are many ways in which it suits modern hardware more than traditional designs. Modern hardware is generally much more sensitive to the cost of any kind of synchronization [1], but is otherwise very fast in the ways -- rather a reversal from the environment that ARIES was created in. While there are certainly downsides to the Postgres approach to storage, concurrency control and recovery, these downsides can be fixed directly. Including in the ways we've been discussing on other threads these past few months. I think that we can push the idea of teaching heapam (and maybe nbtree) just enough about the logical database to not make really dumb decisions in some cases. Why not see how far that can go first? Graefe's perspective in the key-value locking paper is roughly the Microsoft SQL Server perspective. It's not hard for me to imagine why he undertook to research index locking used for concurrency control (not to be confused with what they call latches and what we call LWLocks) while still working for Microsoft. The TPC-E benchmark is something that Microsoft alone targets with SQL Server (no other DB system has an official entry on the TPC website). Pity TPC-E isn't more successful, since it seems to have a lot more real world relevance than TPC-C. I bet we could learn a lot from it because it's actually realistic (TPC-C isn't that realistic, and TPC-B/pgbench is the furthest possible thing from reality). The best analysis I've been able to find about TPC-E is probably "From A to E: Analyzing TPC’s OLTP Benchmarks" [2]. It concludes that the big bottleneck for TPC-E is "logical lock contention", which is the overwhelming bottleneck at high client counts. Though I haven't run TPC-E properly myself (because of issues with bitrot), I'd speculate that this gives Postgres an edge in TPC-E. I've heard quite a few reports of Postgres having much faster writes compared to any of the big proprietary systems. Some of these reports are many years old. And yet we routinely seem to talk about our approach as if it's obviously inferior. The main weakness that Postgres storage has seems to me to be largely in the area of stability, perhaps only in theoretically rare or extreme cases that nevertheless cause real problems in the real world. Of course I cannot prove that adopting stable TIDs necessarily means accepting a similar burden from lock contention in an affected table AM, or in general. I cannot prove much of anything when we're considering such an abstract question. But perhaps this gives you a better idea about where my skepticism comes from. > When trying to create completely new things I would agree that > starting with the abstract is probably not the right idea, but we're > not in a green field. There are many examples for storage engines / > table AMs outside PostgreSQL, and I believe that we can take learnings > from those for the potential extendability of PostgreSQL. I agree, up to a point. Perhaps the reason I came up with bottom-up index deletion was that my reading of the DB research literature led me to consider the logical database in the context of index AMs. Turns out we can use a dirt cheap technique to give B-Tree indexes some basic ideas about the logical database, without having to go to the expense of making nbtree truly have authoritative information about it. A little appreciation of other designs can go a long way. We can afford to have weaknesses, because we clearly also have strengths. We only need to not have an Achilles' heel. > "Delete marking in indexes: This will allow inplace updates even when > index columns are updated and additionally with this we can avoid the > need for a dedicated vacuum process to perform retail deletes." > > If I read this correctly, this means asking the index AM to delete the > specific index tuple corresponding with the deleted item AKA retail > index tuple deletion. Dealing with this at the level of nbtree/the physical data structure is very easy. The problem is with concurrency control, especially within index AMs. This is about the logical contents of the database. Being able to think of indexes as basically just data structures is a huge luxury for us. > Then would you also agree that for persistently clustered tables to > work in any form of efficiency, they would need to support variable > length identifiers? Let me explain my reasoning: Yes, I definitely accept that. It must be true that clustered indexes (not to be confused with indirect indexes + a heap) have *some* merit -- that much has to be true. I just doubt that it's a promising area for us. The cost/risks are very high, and the benefits are probably quite low. We don't have to tackle the fundamental problem in order to add a column store table AM with logical-ish TID-like identifiers -- at least not according to Jeff. Why not consider a low cost solution like that for each table AM? > If the TableAM TID contains information about it's physical location, > we cannot implement persistent clustered tables due to instability of > this physical location of the base tuple (see also how we cannot > guarantee the physical location of an index tuple in the index, only > the logical location). > That being the case, the TID must be fully logical. I hope you agree > at least up to here. Absolutely. Agreed. > If the TID must share an ordering with the clustering of the table, > then it should also have a similar key space as the clustering: I > could cluster a table on a text column, but any fixed size TID cannot > ever dream to be used to correctly identify the position of a certain > text string within an arbitrary sorted list of text strings. You'll > have to use that text string to find it's location in the list. As > such, I believe that the tuple identifier (or also: the identifier > using which we can locate the tuple data corresponding to that > identifier) for clustered tables / index oriented tables is at the > very least the set of columns that that table is clustered on. I agree. I think that looking at MySQL/InnoDB here might be very useful. InnoDB secondary indexes use stable identifiers of the logical row in the table -- which is just the primary key value for the relevant row in all cases. Systems like Oracle and SQL Server can use TIDs/RIDs with a heap based table, which is stable in the same way (sometimes at great cost elsewhere). All three systems are more or less the same as each other as far as the fundamental nature of TID-like identifiers is concerned (and rather unlike Postgres) -- TIDs/RIDs/pointers in secondary indexes are all stable identifiers of logical rows. Andy Pavlo has said that InnoDB/Oracle TIDs are logical identifiers, while Postgres TIDs are physical identifiers. (I don't find that particular terminology useful because it confuses issues elsewhere, but I know what he meant.) No less an authority than Mark Callaghan (well known in the MySQL community for his work at Google and later Facebook) did a comparative benchmark between Postgres and MySQL (InnoDB) back in January. I'm a big fan of his analysis of LSM/storage economics stuff, so naturally I read his report on this with great interest. The high level outcome was that Postgres 13 was mostly faster than MySQL 8, and Postgres 11 was mostly slower than MySQL 5.6: https://smalldatum.blogspot.com/2021/01/sysbench-postgres-vs-mysql-and-impact.html But the details are what really interests me. One very notable performance gap was between MySQL 8 + InnoDB and Postgres 13 with updates that affect an indexed column. Apparently one system could be as much as 4x slower with sysbench index updates, which is notably the straggler among the tests performed, for one of the systems. Reportedly the dreaded write amplification from updates is to blame. Just like that tawdry Uber blog post from 2016 warned of! But guess what? *Postgres* was the system that was faster and had *significantly less* write amplification from updates to secondary indexes -- to the extent that it really stood out as an *advantage* (for this particular set of tests). The pertinent information about write amplification with index updates is under the "Write-heavy" section: "The largest difference is for the 2nd test (update-index) that requires index maintenance." ... "Write-amp (wKB/o) is worse for MySQL especially for the update-index test". So...what do you think of that? This analysis was based on Postgres 13, not Postgres 14, so presumably we'd do better now that we have bottom-up index deletion. I don't mean to suggest that clearly we're living in opposite land, and Postgres was actually the one that did better with index updates all along -- as Callaghan says, it's one benchmark that shouldn't be overinterpreted and is mostly useful to get the right high level intuitions about how each system performs. I believe that sysbench has 2 or 3 indexes here, including the PK, and I wouldn't be surprised if the final result flipped in favor of InnoDB with more indexes -- even with bottom-up index deletion. All I'm saying is that it seems extremely premature to assume that there are very significant advantages to generalizing TID-like identifiers to be stable, be it in a clustered index table AM design like InnoDB or some other kind of table AM. Maybe there are huge advantages, *and* maybe it's even worth the enormous trouble and risk (to say nothing of the opportunity cost of not improving what already seems to work quite well most of the time). But I tend to doubt it, and I'd prefer to work on simple incremental approaches that already seem to be working. [1] https://pathelland.substack.com/p/i-am-so-glad-im-uncoordinated [2] https://www.researchgate.net/publication/262275971_From_A_to_E_Analyzing_TPC's_OLTP_Benchmarks_--_The_obsolete_the_ubiquitous_the_unexplored -- Peter Geoghegan
pgsql-hackers by date: