Re: Deleting older versions in unique indexes to avoid page splits - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Deleting older versions in unique indexes to avoid page splits |
Date | |
Msg-id | CAH2-WzmDkAP9hfin6GeTOktyeX2W28c3NqkQyHpYL=MonmxDfQ@mail.gmail.com Whole thread Raw |
In response to | Re: Deleting older versions in unique indexes to avoid page splits (Victor Yegorov <vyegorov@gmail.com>) |
Responses |
Re: Deleting older versions in unique indexes to avoid page splits
|
List | pgsql-hackers |
On Wed, Oct 28, 2020 at 4:05 PM Victor Yegorov <vyegorov@gmail.com> wrote: > I've reviewed v5 of the patch and did some testing. Thanks! > I now see what you mean by saying that this patch is a natural and logical > extension of the deduplication v13 work. I agree with this. I tried the patch out with a long running transaction yesterday. I think that the synergy with the v13 deduplication work really helped. It took a really long time for an old snapshot to lead to pgbench page splits (relative to the master branch, running a benchmark like the one I talked about recently -- the fiver, tenner, score, etc index benchmark). When the page splits finally started, they seemed much more gradual -- I don't think that I saw the familiar pattern of distinct waves of page splits that are clearly all correlated. I think that the indexes grew at a low steady rate, which looked like the rate that heap relations usually grow at. We see a kind of "tick tock" pattern with this new mechanism + v13 deduplication: even when we don't delete very many TIDs, we still free a few, and then merge the remaining TIDs to buy more time. Very possibly enough time that a long running transaction goes away by the time the question of splitting the page comes up again. Maybe there is another long running transaction by then, but deleting just a few of the TIDs the last time around is enough to not waste time on that block this time around, and therefore to actually succeed despite the second, newer long running transaction (we can delete older TIDs, just not the now-oldest TIDs that the newer long running xact might still need). If this scenario sounds unlikely, bear in mind that "unnecessary" page splits (which are all we really care about here) are usually only barely necessary today, if you think about it in a localized/page level way. What the master branch shows is that most individual "unnecessary" page splits are in a sense *barely* unnecessary (which of course doesn't make the consequences any better). We could buy many hours until the next time the question of splitting a page comes up by just freeing a small number of tuples -- even on a very busy database. I found that the "fiver" and "tenner" indexes in particular took a very long time to have even one page split with a long running transaction. Another interesting effect was that all page splits suddenly stopped when my one hour 30 minute long transaction/snapshot finally went away -- the indexes stopped growing instantly when I killed the psql session. But on the master branch the cascading version driven page splits took at least several minutes to stop when I killed the psql session/snapshot at that same point of the benchmark (maybe longer). With the master branch, we can get behind on LP_DEAD index tuple bit setting, and then have no chance of catching up. Whereas the patch gives us a second chance for each page. (I really have only started to think about long running transactions this week, so my understanding is still very incomplete, and based on guesses and intuitions.) > I don't quite like the last sentence. Given that this code is committed, > I would rather make it: > > … cannot be applied. Therefore we opportunistically check for dead tuples > and reuse the space, delaying leaf page splits. > > I understand that "we" shouldn't be used here, but I fail to think of a > proper way to express this. Makes sense. > 2. in _bt_dedup_delete_pass() and heap_index_batch_check() you're using some > constants, like: > - expected score of 25 > - nblocksaccessed checks for 1, 2 and 3 blocks > - maybe more, but the ones above caught my attention. > > Perhaps, it'd be better to use #define-s here instead? Yeah. It's still evolving, which is why it's still rough. It's not easy to come up with a good interface here. Not because it's very important and subtle. It's actually very *unimportant*, in a way. nbtree cannot really expect too much from heapam here (though it might get much more than expected too, when it happens to be easy for heapam). The important thing is always what happens to be possible at the local/page level -- the exact preferences of nbtree are not so important. Beggars cannot be choosers. It only makes sense to have a "score" like this because sometimes the situation is so favorable (i.e. there are so many TIDs that can be killed) that we want to avoid vastly exceeding what is likely to be useful to nbtree. Actually, this situation isn't that rare (which maybe means I was wrong to say the score thing was unimportant, but hopefully you get the idea). Easily hitting our target score of 25 on the first heap page probably happens almost all the time when certain kinds of unique indexes use the mechanism, for example. And when that happens it is nice to only have to visit one heap block. We're almost sure that it isn't worth visiting a second, regardless of how many TIDs we're likely to find there. > 3. Do we really need to touch another heap page, if all conditions are met? > > + if (uniqueindex && nblocksaccessed == 1 && score == 0) > + break; > + if (!uniqueindex && nblocksaccessed == 2 && score == 0) > + break; > + if (nblocksaccessed == 3) > + break; > > I was really wondering why to look into 2 heap pages. By not doing it straight away, > we just delay the work for the next occasion that'll work on the same page we're > processing. I've modified this piece and included it in my tests (see below), I reduced > 2nd condition to just 1 block and limited the 3rd case to 2 blocks (just a quick hack). The benchmark that you ran involved indexes that are on a column whose values are already unique, pgbench_accounts.aid (the extra indexes are not actually unique indexes, but they could work as unique indexes). If you actually made them unique indexes then you would have seen the same behavior anyway. The 2 heap pages thing is useful with low cardinality indexes. Maybe that could be better targeted - not sure. Things are still moving quite fast, and I'm still working on the patch by solving the biggest problem I see on the horizon. So I will probably change this and then change it again in the next week anyway. I've had further success microoptimizing the sorts in heapam.c in the past couple of days. I think that the regression that I reported can be further shrunk. To recap, we saw a ~15% lost of throughput/TPS with 16 clients, extreme contention (no rate limiting), several low cardinality indexes, with everything still fitting in shared_buffers. It now looks like I can get that down to ~7%, which seems acceptable to me given the extreme nature of the workload (and given the fact that we still win on efficiency here -- no index growth). > I've used scale factor 10 000, adjusted indexes (resulting in a 189GB size database) > and run the following pgbench: > > pgbench -f testcase.pgbench -r -c32 -j8 -T 3600 bench > I can see a clear benefit from this patch *under specified conditions, YMMW* > - 32% increase in TPS > - 24% drop in average latency > - most important — stable index size! Nice. When I did a similar test on October 16th it was on a much smaller database. I think that I saw a bigger improvement because the initial DB size was close to shared_buffers. So not going over shared_buffers makes a much bigger difference. Whereas here the DB size is several times larger, so there is no question about significantly exceeding shared_buffers -- it's going to happen for the master branch as well as the patch. (This is kind of obvious, but pointing it out just in case.) > In my opinion, patch provides clear benefits from IO reduction and index size > control perspective. I really like the stability of operations on patched > version. I would rather stick to the "restricted" version of the patch though. You're using EBS here, which probably has much higher latency than what I have here (an NVME SSD). What you have is probably more relevant to the real world, though. > Hope this helps. I'm open to do more tests if necessary. It's great, thanks! -- Peter Geoghegan
pgsql-hackers by date: