Home > mailing lists

"Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters) - Mailing list pgsql-hackers

From	Peter Geoghegan
Subject	"Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
Date	July 9, 2018 05:59:32
Msg-id	CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com Whole thread Raw
Responses	Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
List	pgsql-hackers

Tree view

On Wed, Jul 4, 2018 at 9:43 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> I'm starting this new thread to discuss the benefits of B-Tree suffix
> truncation, and to demonstrate how effective it can be at increasing
> fan-out and space utilization with a real-world example. I haven't
> explained why suffix truncation matters very well before now. Now that
> I have a patch that works, I can just show the benefit directly. (By
> the way, there are similar benefits to covering INCLUDE indexes, where
> suffix truncation occurs all the time, and not just
> opportunistically.)

> Explanation
> -----------
>
> Pivot tuples describe how the key space will be split up, which will
> *hopefully* be balanced for future insertions. So, apart from the
> obvious issue about the size of pivot tuples, there is a more
> important issue: why unnecessarily commit to certain exact boundaries
> early on, before it's clear how well that will balance values that get
> inserted later?

Actually, this particular example does not really show why general
suffix truncation is important. My example did manage to make an index
15% smaller in a practical, workable way, but traditional suffix
truncation deserves far less credit for that than I initially claimed.
My explanation was about 99% wrong, but my example is still valid. The
true explanation for why my patch (the pending v3 of my
unique-key-heap-TID patch) avoided so much bloat is very interesting,
because it is related to a broader problem. I'll show a simple example
where the bloat that my patch can avoid is far greater still,
involving a simple single-attribute secondary index.

Before I get to that example, I'll give the real explanation. The real
reason for the marked decrease in the level of bloating is that my
patch makes insertions into secondary indexes (non-unique indexes)
jump immediately to the leaf page that the tuple unambiguously belongs
on -- since it now has a unique key, there must be one exact page that
the value is supposed to go on. We avoid trying to find a place among
pages full of logical duplicates, and so we avoid the risk of "getting
tired" [1] and giving up on finding free space that is actually
available to us. "Getting tired" tends to produce really inefficient
space utilization among leaf pages full of duplicates, at least past a
certain tipping point.

The whole "getting tired" thing is the root of the problem here, which
is why the pending v3 of my patch will remove that code completely
(_bt_findinsertloc() is streamlined). The "tipping point" nature of
getting tired is of particular concern to me, since it sounds like
something that could have been a factor in Uber's much-publicized blog
post.   :-(

I attach the test case I mentioned, which I show the output from
below, with my commentary in parenthesis (note that there are "\echo"
psql statements whose output you'll also see):

$ psql -f testcase.sql
autovcuum should probably be disabled:
 autovacuum
────────────
 off
(1 row)

psql:testcase.sql:3: NOTICE:  table "sec_idx_bloat" does not exist, skipping
DROP TABLE
CREATE TABLE

(Created empty table named "sec_idx_bloat", with a single int4 attribute.)

CREATE INDEX

(Created empty index named "bloated" on that attribute.)

Initial insert of 1e7 sequential values:
INSERT 0 10000000

"bloated" size on master: 214 MB
"bloated" size with patch: 214 MB

Initial insert of the value 42 1e7 times:
INSERT 0 10000000

"bloated" size on master: 486 MB
"bloated" size with patch: 430 MB

Repeated insertion of the value 42 1e7 times:
INSERT 0 10000000

"bloated" size on master: 757 MB
"bloated" size with patch: 645 MB

Delete + VACUUM away most of the dups:
DELETE 18001072
VACUUM

"bloated" size on master: 757 MB
"bloated" size with patch: 645 MB

(That is, VACUUM hasn't made either case have a smaller index yet,
though it probably gave more back many more whole index pages to the
FSM in the case of the patch.)

Third insertion of the value 42 1e7 times:
INSERT 0 10000000

(This is where it gets really interesting, because things truly fall
apart for master, whereas the patch case manages to reuse existing
free space in all cases.)

"bloated" size on master: 1029 MB
"bloated" size with patch: 645 MB

Fourth insertion of the value 42 1e7 times:
INSERT 0 10000000

"bloated" size on master: 1301 MB
"bloated" size with patch: 688 MB

(Patch can still almost reuse all the space left behind by VACUUM,
though since it's a bit bigger than the original high watermark size
of 645 MB it's not perfectly efficient. Master flounders even further
behind, though.)

To summarize: recognizing that bloat in indexes is correlated with
bloat in tables allows us to recycle space in both structures
cooperatively, which can be very important.

While this example focusses on bloat, there are a lot of other things
to be unhappy about around buffer lock contention, and around the
number of pages that will be dirtied. Random choices about which page
to dirty leads to increased random I/O when checkpointing.

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/nbtinsert.c;h=907cce072412adf88d41ce9317a795fb25820be2;hb=refs/heads/REL_11_STABLE#l694
--
Peter Geoghegan

Attachment

testcase.sql

pgsql-hackers by date:

From: Tomas Vondra
Date: 09 July 2018, 05:22:15
Subject: Re: cost_sort() improvements

From: Jeff Davis
Date: 09 July 2018, 06:00:18
Subject: Re: [HACKERS] Another oddity in handling of WCO constraints inpostgres_fdw

"Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters) - Mailing list pgsql-hackers

Attachment

Previous

Next