Re: [PoC] Reducing planning time when tables have many partitions - Mailing list pgsql-hackers
From | Thom Brown |
---|---|
Subject | Re: [PoC] Reducing planning time when tables have many partitions |
Date | |
Msg-id | CAA-aLv4+9_Wqi1D7bDVzBoPv8=LnQLcqHEvEfUvYW196_4KrEA@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Reducing planning time when tables have many partitions (David Rowley <dgrowleyml@gmail.com>) |
Responses |
Re: [PoC] Reducing planning time when tables have many partitions
|
List | pgsql-hackers |
On Sun, 4 Dec 2022 at 00:35, David Rowley <dgrowleyml@gmail.com> wrote: > > On Tue, 29 Nov 2022 at 21:59, Yuya Watari <watari.yuya@gmail.com> wrote: > > Thank you for testing the patch with an actual query. This speedup is > > very impressive. When I used an original query with 1024 partitions, > > its planning time was about 200ms. Given that each partition is also > > partitioned in your workload, I think the result of 1415ms is > > reasonable. > > I was looking again at the v9-0001 patch and I think we can do a > little better when building the Bitmapset of matching EMs. For > example, in the v9 patch, the code for get_ecmember_indexes_strict() > is doing: > > + if (!with_children) > + matching_ems = bms_copy(ec->ec_nonchild_indexes); > + else > + matching_ems = bms_copy(ec->ec_member_indexes); > + > + i = -1; > + while ((i = bms_next_member(relids, i)) >= 0) > + { > + RelOptInfo *rel = root->simple_rel_array[i]; > + > + matching_ems = bms_int_members(matching_ems, rel->eclass_member_indexes); > + } > > It seems reasonable that if there are a large number of partitions > then ec_member_indexes will have a large number of Bitmapwords. When > we do bms_int_members() on that, we're going to probably end up with a > bunch of trailing zero words in the set. In the v10 patch, I've > changed this to become: > > + int i = bms_next_member(relids, -1); > + > + if (i >= 0) > + { > + RelOptInfo *rel = root->simple_rel_array[i]; > + > + /* > + * bms_intersect to the first relation to try to keep the resulting > + * Bitmapset as small as possible. This saves having to make a > + * complete bms_copy() of one of them. One may contain significantly > + * more words than the other. > + */ > + if (!with_children) > + matching_ems = bms_intersect(rel->eclass_member_indexes, > + ec->ec_nonchild_indexes); > + else > + matching_ems = bms_intersect(rel->eclass_member_indexes, > + ec->ec_member_indexes); > + > + while ((i = bms_next_member(relids, i)) >= 0) > + { > + rel = root->simple_rel_array[i]; > + matching_ems = bms_int_members(matching_ems, > + rel->eclass_member_indexes); > + } > + } > > so, effectively we first bms_intersect to the first member of relids > before masking out the bits for the remaining ones. This should mean > we'll have a Bitmapset with fewer words in many complex planning > problems. There's no longer the dilemma of having to decide if we > should start with RelOptInfo's eclass_member_indexes or the > EquivalenceClass's member indexes. When using bms_int_member, we > really want to start with the smallest of those so we get the smallest > resulting set. With bms_intersect(), it will always make a copy of > the smallest set. v10 does that instead of bms_copy()ing the > EquivalenceClass's member's Bitmapset. > > I also wondered how much we're losing to the fact that > bms_int_members() zeros the trailing words and does not trim the > Bitmapset down. > > The problem there is 2-fold; > 1) we have to zero the trailing words on the left input. That'll > pollute the CPU cache a bit as it may have to fetch a bunch of extra > cache lines, and; > 2) subsequent bms_int_members() done afterwards may have to mask out > additional words. If we can make the shortest input really short, then > subsequent bms_int_members() are going to be very fast. > > You might argue there that setting nwords to the shortest length may > cause us to have to repalloc the Bitmapset if we need to later add > more members again, but if you look at the repalloc() code, it's > effectively a no-op when the allocated size >= the requested size, so > repalloc() should be very fast in this case. So, worst case, there's > an additional "no-op" repalloc() (which should be very fast) followed > by maybe a bms_add_members() which has to zero the words instead of > bms_int_members(). I changed this in the v10-0002 patch. I'm not sure > if we should do this or not. > > I also changed v10-0001 so that we still store the EquivalenceClass's > members list. There were a few places where the code just wanted to > get the first member and having to look at the Bitmapset index and > fetch the first match from PlannerInfo seemed convoluted. If the > query is simple, it seems like it's not going to be very expensive to > add a few EquivalenceMembers to this list. When planning more complex > problems, there's probably enough other extra overhead that we're > unlikely to notice the extra lappend()s. This also allows v10-0003 to > work, see below. > > In v10-0003, I experimented with the iterator concept that I mentioned > earlier. Since v10-0001 is now storing the EquivalenceMember list in > EquivalenceClass again, it's now quite simple to have the iterator > decide if it should be scanning the index or doing a loop over all > members to find the ones matching the search. We can make this > decision based on list_length(ec->ec_members). This should be a more > reliable check than checking root->simple_rel_array_size as we could > still have classes with just a few members even when there's a large > number of rels in simple_rel_array. I was hoping that v10-0003 would > allow us to maintain the same planner performance for simple queries. > It just does not seem to change the performance much. Perhaps it's not > worth the complexity if there are no performance benefits. It probably > needs more performance testing than what I've done to know if it helps > or hinders, however. > > Overall, I'm not quite sure if this is any faster than your v9 patch. > I think more performance testing needs to be done. I think the > v10-0001 + v10-0002 is faster than v9-0001, but perhaps the changes > you've made in v9-0002 and v9-0003 are worth redoing. I didn't test. I > was hoping to keep the logic about which method to use to find the > members in the iterator code and not litter it around the tree. > > I did run the test you mentioned in [1] and I got: > > $ echo Master @ 29452de73 && ./partbench.sh | grep -E "^(Testing|latency)" > Master @ 29452de73 > Testing with 2 partitions... > latency average = 0.231 ms > Testing with 4 partitions... > latency average = 0.303 ms > Testing with 8 partitions... > latency average = 0.454 ms > Testing with 16 partitions... > latency average = 0.777 ms > Testing with 32 partitions... > latency average = 1.576 ms > Testing with 64 partitions... > latency average = 3.574 ms > Testing with 128 partitions... > latency average = 9.504 ms > Testing with 256 partitions... > latency average = 37.321 ms > Testing with 512 partitions... > latency average = 171.660 ms > Testing with 1024 partitions... > latency average = 1021.990 ms > > $ echo Master + v10-0001 && ./partbench.sh | grep -E "^(Testing|latency)" > Master + v10-0001 > Testing with 2 partitions... > latency average = 0.239 ms > Testing with 4 partitions... > latency average = 0.315 ms > Testing with 8 partitions... > latency average = 0.463 ms > Testing with 16 partitions... > latency average = 0.757 ms > Testing with 32 partitions... > latency average = 1.481 ms > Testing with 64 partitions... > latency average = 2.563 ms > Testing with 128 partitions... > latency average = 5.618 ms > Testing with 256 partitions... > latency average = 16.229 ms > Testing with 512 partitions... > latency average = 38.855 ms > Testing with 1024 partitions... > latency average = 85.705 ms > > $ echo Master + v10-0001 + v10-0002 && ./partbench.sh | grep -E > "^(Testing|latency)" > Master + v10-0001 + v10-0002 > Testing with 2 partitions... > latency average = 0.241 ms > Testing with 4 partitions... > latency average = 0.312 ms > Testing with 8 partitions... > latency average = 0.459 ms > Testing with 16 partitions... > latency average = 0.755 ms > Testing with 32 partitions... > latency average = 1.464 ms > Testing with 64 partitions... > latency average = 2.580 ms > Testing with 128 partitions... > latency average = 5.652 ms > Testing with 256 partitions... > latency average = 16.464 ms > Testing with 512 partitions... > latency average = 37.674 ms > Testing with 1024 partitions... > latency average = 84.094 ms > > $ echo Master + v10-0001 + v10-0002 + v10-0003 && ./partbench.sh | > grep -E "^(Testing|latency)" > Master + v10-0001 + v10-0002 + v10-0003 > Testing with 2 partitions... > latency average = 0.240 ms > Testing with 4 partitions... > latency average = 0.318 ms > Testing with 8 partitions... > latency average = 0.465 ms > Testing with 16 partitions... > latency average = 0.763 ms > Testing with 32 partitions... > latency average = 1.486 ms > Testing with 64 partitions... > latency average = 2.858 ms > Testing with 128 partitions... > latency average = 5.764 ms > Testing with 256 partitions... > latency average = 16.995 ms > Testing with 512 partitions... > latency average = 38.012 ms > Testing with 1024 partitions... > latency average = 88.098 ms > > $ echo Master + v9-* && ./partbench.sh | grep -E "^(Testing|latency)" > Master + v9-* > Testing with 2 partitions... > latency average = 0.237 ms > Testing with 4 partitions... > latency average = 0.313 ms > Testing with 8 partitions... > latency average = 0.460 ms > Testing with 16 partitions... > latency average = 0.780 ms > Testing with 32 partitions... > latency average = 1.468 ms > Testing with 64 partitions... > latency average = 2.701 ms > Testing with 128 partitions... > latency average = 5.275 ms > Testing with 256 partitions... > latency average = 17.208 ms > Testing with 512 partitions... > latency average = 37.183 ms > Testing with 1024 partitions... > latency average = 90.595 ms Testing your patches with the same 1024 partitions, each with 64 sub-partitions, I get a planning time of 205.020 ms, which is now a 1,377x speedup. This has essentially reduced the planning time from a catastrophe to a complete non-issue. Huge win! -- Thom
pgsql-hackers by date: