"Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters) |
Date | |
Msg-id | CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com Whole thread Raw |
Responses |
Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
|
List | pgsql-hackers |
On Wed, Jul 4, 2018 at 9:43 AM, Peter Geoghegan <pg@bowt.ie> wrote: > I'm starting this new thread to discuss the benefits of B-Tree suffix > truncation, and to demonstrate how effective it can be at increasing > fan-out and space utilization with a real-world example. I haven't > explained why suffix truncation matters very well before now. Now that > I have a patch that works, I can just show the benefit directly. (By > the way, there are similar benefits to covering INCLUDE indexes, where > suffix truncation occurs all the time, and not just > opportunistically.) > Explanation > ----------- > > Pivot tuples describe how the key space will be split up, which will > *hopefully* be balanced for future insertions. So, apart from the > obvious issue about the size of pivot tuples, there is a more > important issue: why unnecessarily commit to certain exact boundaries > early on, before it's clear how well that will balance values that get > inserted later? Actually, this particular example does not really show why general suffix truncation is important. My example did manage to make an index 15% smaller in a practical, workable way, but traditional suffix truncation deserves far less credit for that than I initially claimed. My explanation was about 99% wrong, but my example is still valid. The true explanation for why my patch (the pending v3 of my unique-key-heap-TID patch) avoided so much bloat is very interesting, because it is related to a broader problem. I'll show a simple example where the bloat that my patch can avoid is far greater still, involving a simple single-attribute secondary index. Before I get to that example, I'll give the real explanation. The real reason for the marked decrease in the level of bloating is that my patch makes insertions into secondary indexes (non-unique indexes) jump immediately to the leaf page that the tuple unambiguously belongs on -- since it now has a unique key, there must be one exact page that the value is supposed to go on. We avoid trying to find a place among pages full of logical duplicates, and so we avoid the risk of "getting tired" [1] and giving up on finding free space that is actually available to us. "Getting tired" tends to produce really inefficient space utilization among leaf pages full of duplicates, at least past a certain tipping point. The whole "getting tired" thing is the root of the problem here, which is why the pending v3 of my patch will remove that code completely (_bt_findinsertloc() is streamlined). The "tipping point" nature of getting tired is of particular concern to me, since it sounds like something that could have been a factor in Uber's much-publicized blog post. :-( I attach the test case I mentioned, which I show the output from below, with my commentary in parenthesis (note that there are "\echo" psql statements whose output you'll also see): $ psql -f testcase.sql autovcuum should probably be disabled: autovacuum ──────────── off (1 row) psql:testcase.sql:3: NOTICE: table "sec_idx_bloat" does not exist, skipping DROP TABLE CREATE TABLE (Created empty table named "sec_idx_bloat", with a single int4 attribute.) CREATE INDEX (Created empty index named "bloated" on that attribute.) Initial insert of 1e7 sequential values: INSERT 0 10000000 "bloated" size on master: 214 MB "bloated" size with patch: 214 MB Initial insert of the value 42 1e7 times: INSERT 0 10000000 "bloated" size on master: 486 MB "bloated" size with patch: 430 MB Repeated insertion of the value 42 1e7 times: INSERT 0 10000000 "bloated" size on master: 757 MB "bloated" size with patch: 645 MB Delete + VACUUM away most of the dups: DELETE 18001072 VACUUM "bloated" size on master: 757 MB "bloated" size with patch: 645 MB (That is, VACUUM hasn't made either case have a smaller index yet, though it probably gave more back many more whole index pages to the FSM in the case of the patch.) Third insertion of the value 42 1e7 times: INSERT 0 10000000 (This is where it gets really interesting, because things truly fall apart for master, whereas the patch case manages to reuse existing free space in all cases.) "bloated" size on master: 1029 MB "bloated" size with patch: 645 MB Fourth insertion of the value 42 1e7 times: INSERT 0 10000000 "bloated" size on master: 1301 MB "bloated" size with patch: 688 MB (Patch can still almost reuse all the space left behind by VACUUM, though since it's a bit bigger than the original high watermark size of 645 MB it's not perfectly efficient. Master flounders even further behind, though.) To summarize: recognizing that bloat in indexes is correlated with bloat in tables allows us to recycle space in both structures cooperatively, which can be very important. While this example focusses on bloat, there are a lot of other things to be unhappy about around buffer lock contention, and around the number of pages that will be dirtied. Random choices about which page to dirty leads to increased random I/O when checkpointing. [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/nbtinsert.c;h=907cce072412adf88d41ce9317a795fb25820be2;hb=refs/heads/REL_11_STABLE#l694 -- Peter Geoghegan
Attachment
pgsql-hackers by date: