Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. |
Date | |
Msg-id | CAH2-Wzmww9kM99iK42nQmXONg564DPseuUDy_mSAi3OeaxU07A@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. |
List | pgsql-hackers |
On Fri, Sep 27, 2019 at 7:02 PM Peter Geoghegan <pg@bowt.ie> wrote: > I think that it makes sense to enable deduplication by default -- even > with unique indexes. It looks like deduplication can be very helpful > with non-HOT updates. Attached is v20, which adds a custom strategy for the checkingunique (unique index) case to _bt_dedup_one_page(). It also makes deduplication the default for both unique and non-unique indexes. I simply altered your new BtreeDefaultDoDedup() macro from v19 to make nbtree use deduplication wherever it is safe to do so. This default may not be the best one in the end, though deduplication in unique indexes now looks very compelling. The new checkingunique heuristics added to _bt_dedup_one_page() were developed experimentally, based on pgbench tests. The general idea with the new checkingunique stuff is to make deduplication *extremely* lazy. We want to avoid making _bt_vacuum_one_page() garbage collection less effective by being too aggressive with deduplication -- workloads with lots of non-HOT-updates into unique indexes are greatly dependent on the LP_DEAD bit setting in _bt_check_unique(). At the same time, _bt_dedup_one_page() can be just as effective at delaying page splits as it is with non-unique indexes. I've found that my "regular pgbench, but with a low cardinality index on pgbench_accounts(abalance)" benchmark works best with the specific heuristics used in the patch, especially over many hours. I spent nearly 24 hours running the test at full speed (no throttling this time), at scale 500, and with very very aggressive autovacuum settings (autovacuum_vacuum_cost_delay=0ms, autovacuum_vacuum_scale_factor=0.02). Each run lasted one hour, with alternating runs of 4, 8, and 16 clients. Towards the end, the patch had about 5% greater throughput at lower client counts, and never seemed to be significantly slower (it was very slightly slower once or twice, but I think that that was just noise). More importantly, the indexes looked like this on master: bloated_abalance: 3017 MB pgbench_accounts_pkey: 2142 MB pgbench_branches_pkey: 1352 kB pgbench_tellers_pkey: 3416 kB And like this with the patch: bloated_abalance: 1015 MB pgbench_accounts_pkey: 1745 MB pgbench_branches_pkey: 296 kB pgbench_tellers_pkey: 888 kB * bloated_abalance is about 3x smaller here, as usual -- no surprises there. * pgbench_accounts_pkey is the most interesting case. You might think that it isn't that great that pgbench_accounts_pkey is 1745 MB with the patch, since it starts out at only 1071 MB (and would go back down to 1071 MB again if we were to do a REINDEX). However, you have to bear in mind that it takes a long time for it to get that big -- the growth over time is very important here. Even after the first run with 16 clients, it only reached 1160 MB -- that's an increase of ~8%. The master case had already reached 2142 MB ("bloat saturation point") by then, though. I could easily have stopped the benchmark there, or used rate-limiting, or excluded the 16 client case -- that would have allowed me to claim that the growth was under 10% for a workload where the master case has an index that doubles in size. On the other hand, if autovacuum wasn't configured to run very frequently, then the patch wouldn't look nearly this good. Deduplication helped autovacuum by "soaking up" the "recently dead" index tuples that cannot be killed right away. In short, the patch ameliorates weaknesses of the existing garbage collection mechanisms without changing them. The patch smoothed out the growth of pgbench_accounts_pkey over many hours. As I said, it was only 1160 MB after the first 3 hours/first 16 client run. It was 1356 MB after the second 16 client run (i.e. after running another round of one hour 4/8/16 client runs), finally finishing up at 1745 MB. So the growth in the size of pgbench_accounts_pkey for the patch was significantly improved, and the *rate* of growth over time was also improved. The master branch had a terrible jerky growth in the size of pgbench_accounts_pkey. The master branch did mostly keep up at first (i.e. the size of pgbench_accounts_pkey wasn't too different at first). But once we got to 16 clients for the first time, after a couple of hours, pgbench_accounts_pkey almost doubled in size over a period of only 10 or 20 minutes! The index size *exploded* in a very short period of time, starting only a few hours into the benchmark. (Maybe we don't see this anything like this with the patch because with the patch backends are more concerned about helping VACUUM, and less concerned about creating a mess that VACUUM must clean up. Not sure.) * We also manage to make the small pgbench indexes (pgbench_branches_pkey and pgbench_tellers_pkey) over 4x smaller here (without doing anything to force more non-HOT updates on the underlying tables). This result for the two small indexes looks good, but I should point out that we still only fit ~15 or so tuples on each leaf page with the patch when everything is over -- far far less than the number that CREATE INDEX stored on the leaf pages immediately (it leaves 366 items on each leaf page). This is kind of an extreme case, because there is so much contention, but space utilization with the patch is actually very bad here. The master branch is very very very bad, though, so we're at least down to only a single "very" here. Progress. Any thoughts on the approach taken for unique indexes within _bt_dedup_one_page() in v20? Obviously that stuff needs to be examined critically -- it's possible that it wouldn't do as well as it could or should with other workloads that I haven't thought about. Please take a look at the details. -- Peter Geoghegan
Attachment
pgsql-hackers by date: