Re: Deleting older versions in unique indexes to avoid page splits - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Deleting older versions in unique indexes to avoid page splits
Date
Msg-id CAH2-WzmDkAP9hfin6GeTOktyeX2W28c3NqkQyHpYL=MonmxDfQ@mail.gmail.com
Whole thread Raw
In response to Re: Deleting older versions in unique indexes to avoid page splits  (Victor Yegorov <vyegorov@gmail.com>)
Responses Re: Deleting older versions in unique indexes to avoid page splits
List pgsql-hackers
On Wed, Oct 28, 2020 at 4:05 PM Victor Yegorov <vyegorov@gmail.com> wrote:
> I've reviewed v5 of the patch and did some testing.

Thanks!

> I now see what you mean by saying that this patch is a natural and logical
> extension of the deduplication v13 work. I agree with this.

I tried the patch out with a long running transaction yesterday. I
think that the synergy with the v13 deduplication work really helped.
It took a really long time for an old snapshot to lead to pgbench page
splits (relative to the master branch, running a benchmark like the
one I talked about recently -- the fiver, tenner, score, etc index
benchmark). When the page splits finally started, they seemed much
more gradual -- I don't think that I saw the familiar pattern of
distinct waves of page splits that are clearly all correlated. I think
that the indexes grew at a low steady rate, which looked like the rate
that heap relations usually grow at.

We see a kind of "tick tock" pattern with this new mechanism + v13
deduplication: even when we don't delete very many TIDs, we still free
a few, and then merge the remaining TIDs to buy more time. Very
possibly enough time that a long running transaction goes away by the
time the question of splitting the page comes up again. Maybe there is
another long running transaction by then, but deleting just a few of
the TIDs the last time around is enough to not waste time on that
block this time around, and therefore to actually succeed despite the
second, newer long running transaction (we can delete older TIDs, just
not the now-oldest TIDs that the newer long running xact might still
need).

If this scenario sounds unlikely, bear in mind that "unnecessary" page
splits (which are all we really care about here) are usually only
barely necessary today, if you think about it in a localized/page
level way. What the master branch shows is that most individual
"unnecessary" page splits are in a sense *barely* unnecessary (which
of course doesn't make the consequences any better). We could buy many
hours until the next time the question of splitting a page comes up by
just freeing a small number of tuples -- even on a very busy database.

I found that the "fiver" and "tenner" indexes in particular took a
very long time to have even one page split with a long running
transaction.  Another interesting effect was that all page splits
suddenly stopped when my one hour 30 minute long transaction/snapshot
finally went away -- the indexes stopped growing instantly when I
killed the psql session. But on the master branch the cascading
version driven page splits took at least several minutes to stop when
I killed the psql session/snapshot at that same point of the benchmark
(maybe longer). With the master branch, we can get behind on LP_DEAD
index tuple bit setting, and then have no chance of catching up.
Whereas the patch gives us a second chance for each page.

(I really have only started to think about long running transactions
this week, so my understanding is still very incomplete, and based on
guesses and intuitions.)

> I don't quite like the last sentence. Given that this code is committed,
> I would rather make it:
>
>   … cannot be applied. Therefore we opportunistically check for dead tuples
>     and reuse the space, delaying leaf page splits.
>
> I understand that "we" shouldn't be used here, but I fail to think of a
> proper way to express this.

Makes sense.

> 2. in _bt_dedup_delete_pass() and heap_index_batch_check() you're using some
> constants, like:
> - expected score of 25
> - nblocksaccessed checks for 1, 2 and 3 blocks
> - maybe more, but the ones above caught my attention.
>
> Perhaps, it'd be better to use #define-s here instead?

Yeah. It's still evolving, which is why it's still rough.

It's not easy to come up with a good interface here. Not because it's
very important and subtle. It's actually very *unimportant*, in a way.
nbtree cannot really expect too much from heapam here (though it might
get much more than expected too, when it happens to be easy for
heapam). The important thing is always what happens to be possible at
the local/page level -- the exact preferences of nbtree are not so
important. Beggars cannot be choosers.

It only makes sense to have a "score" like this because sometimes the
situation is so favorable (i.e. there are so many TIDs that can be
killed) that we want to avoid vastly exceeding what is likely to be
useful to nbtree. Actually, this situation isn't that rare (which
maybe means I was wrong to say the score thing was unimportant, but
hopefully you get the idea).

Easily hitting our target score of 25 on the first heap page probably
happens almost all the time when certain kinds of unique indexes use
the mechanism, for example. And when that happens it is nice to only
have to visit one heap block. We're almost sure that it isn't worth
visiting a second, regardless of how many TIDs we're likely to find
there.

> 3. Do we really need to touch another heap page, if all conditions are met?
>
> +           if (uniqueindex && nblocksaccessed == 1 && score == 0)
> +               break;
> +           if (!uniqueindex && nblocksaccessed == 2 && score == 0)
> +               break;
> +           if (nblocksaccessed == 3)
> +               break;
>
> I was really wondering why to look into 2 heap pages. By not doing it straight away,
> we just delay the work for the next occasion that'll work on the same page we're
> processing. I've modified this piece and included it in my tests (see below), I reduced
> 2nd condition to just 1 block and limited the 3rd case to 2 blocks (just a quick hack).

The benchmark that you ran involved indexes that are on a column whose
values are already unique, pgbench_accounts.aid (the extra indexes are
not actually unique indexes, but they could work as unique indexes).
If you actually made them unique indexes then you would have seen the
same behavior anyway.

The 2 heap pages thing is useful with low cardinality indexes. Maybe
that could be better targeted - not sure. Things are still moving
quite fast, and I'm still working on the patch by solving the biggest
problem I see on the horizon. So I will probably change this and then
change it again in the next week anyway.

I've had further success microoptimizing the sorts in heapam.c in the
past couple of days. I think that the regression that I reported can
be further shrunk. To recap, we saw a ~15% lost of throughput/TPS with
16 clients, extreme contention (no rate limiting), several low
cardinality indexes, with everything still fitting in shared_buffers.
It now looks like I can get that down to ~7%, which seems acceptable
to me given the extreme nature of the workload (and given the fact
that we still win on efficiency here -- no index growth).

> I've used scale factor 10 000, adjusted indexes (resulting in a 189GB size database)
> and run the following pgbench:
>
>     pgbench -f testcase.pgbench -r -c32 -j8 -T 3600 bench

> I can see a clear benefit from this patch *under specified conditions, YMMW*
> - 32% increase in TPS
> - 24% drop in average latency
> - most important — stable index size!

Nice. When I did a similar test on October 16th it was on a much
smaller database. I think that I saw a bigger improvement because the
initial DB size was close to shared_buffers. So not going over
shared_buffers makes a much bigger difference. Whereas here the DB
size is several times larger, so there is no question about
significantly exceeding shared_buffers -- it's going to happen for the
master branch as well as the patch. (This is kind of obvious, but
pointing it out just in case.)

> In my opinion, patch provides clear benefits from IO reduction and index size
> control perspective. I really like the stability of operations on patched
> version. I would rather stick to the "restricted" version of the patch though.

You're using EBS here, which probably has much higher latency than
what I have here (an NVME SSD). What you have is probably more
relevant to the real world, though.

> Hope this helps. I'm open to do more tests if necessary.

It's great, thanks!

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: EXPLAIN vs track_io_timing=on vs tests
Next
From: Andres Freund
Date:
Subject: Re: contrib/sslinfo cleanup and OpenSSL errorhandling