Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
Date
Msg-id CAH2-Wzm9EQJdOsQRuus293QG64rHcC1hOFAZ5+_8JNm35m1c1w@mail.gmail.com
Whole thread Raw
In response to Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)
List pgsql-hackers
On Tue, Jul 17, 2018 at 1:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> This seems like really interesting and important work.  I wouldn't
> have foreseen that the "getting tired" code would have led to this
> kind of bloat (even if I had known about it at all).

Thanks!

I'm glad that I can come up with concrete, motivating examples like
this, because it's really hard to talk about the big picture here.
With something like a pgbench workload, there are undoubtedly many
different factors in play, since temporal locality influences many
different things all at once. I don't think that I understand it all
just yet. Taking a holistic view of the problem seems very helpful,
but it's also very frustrating at times.

> I wonder,
> though, whether it's possible that the reverse could happen in some
> other scenario.  It seems to me that with the existing code, if you
> reinsert a value many copies of which have been deleted, you'll
> probably find partially-empty pages whose free space can be reused,
> but if there's one specific place where each tuple needs to go, you
> might end up having to split pages if the new TIDs are all larger or
> smaller than the old TIDs.

That's a legitimate concern. After all, what I've done boils down to
adding a restriction on space utilization that wasn't there before.
This clearly helps because it makes it practical to rip out the
"getting tired" thing, but that's not everything. There are good
reasons for that hack, but if latency magically didn't matter then we
could just tear the hack out without doing anything else. That would
make groveling through pages full of duplicates at least as discerning
about space utilization as my patch manages to be, without any of the
complexity.

There is actually a flipside to that downside, though (i.e. the
downside is also an upside): While not filling up leaf pages that have
free space on them is bad, it's only bad when it doesn't leave the
pages completely empty. Leaving the pages completely empty is actually
good, because then VACUUM is in a position to delete entire pages,
removing their downlinks from parent pages. That's a variety of bloat
that we can reverse completely. I suspect that you'll see far more of
that favorable case in the real world with my patch. It's pretty much
impossible to do page deletions with pages full of duplicates today,
because the roughly-uniform distribution of still-live tuples among
leaf pages fails to exhibit any temporal locality. So, maybe my patch
would still come out ahead of simply ripping out "getting tired" in
this parallel universe where latency doesn't matter, and space
utilization is everything.

I made one small mistake with my test case: It actually *is* perfectly
efficient at recycling space even at the end, since I don't delete all
the duplicates (just 90% of them). Getting tired might have been a
contributing factor there, too.

-- 
Peter Geoghegan


pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: patch to allow disable of WAL recycling
Next
From: Peter Geoghegan
Date:
Subject: Re: "Write amplification" is made worse by "getting tired" whileinserting into nbtree secondary indexes (Was: Why B-Tree suffix truncation matters)