Re: [PoC] Reducing planning time when tables have many partitions - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: [PoC] Reducing planning time when tables have many partitions |
Date | |
Msg-id | CAApHDvoM-6VukeCpwzO9ddcVEwn4-nc4Cyorgs4-SAJ83zOBcA@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Reducing planning time when tables have many partitions (Yuya Watari <watari.yuya@gmail.com>) |
Responses |
Re: [PoC] Reducing planning time when tables have many partitions
|
List | pgsql-hackers |
On Tue, 29 Nov 2022 at 21:59, Yuya Watari <watari.yuya@gmail.com> wrote: > Thank you for testing the patch with an actual query. This speedup is > very impressive. When I used an original query with 1024 partitions, > its planning time was about 200ms. Given that each partition is also > partitioned in your workload, I think the result of 1415ms is > reasonable. I was looking again at the v9-0001 patch and I think we can do a little better when building the Bitmapset of matching EMs. For example, in the v9 patch, the code for get_ecmember_indexes_strict() is doing: + if (!with_children) + matching_ems = bms_copy(ec->ec_nonchild_indexes); + else + matching_ems = bms_copy(ec->ec_member_indexes); + + i = -1; + while ((i = bms_next_member(relids, i)) >= 0) + { + RelOptInfo *rel = root->simple_rel_array[i]; + + matching_ems = bms_int_members(matching_ems, rel->eclass_member_indexes); + } It seems reasonable that if there are a large number of partitions then ec_member_indexes will have a large number of Bitmapwords. When we do bms_int_members() on that, we're going to probably end up with a bunch of trailing zero words in the set. In the v10 patch, I've changed this to become: + int i = bms_next_member(relids, -1); + + if (i >= 0) + { + RelOptInfo *rel = root->simple_rel_array[i]; + + /* + * bms_intersect to the first relation to try to keep the resulting + * Bitmapset as small as possible. This saves having to make a + * complete bms_copy() of one of them. One may contain significantly + * more words than the other. + */ + if (!with_children) + matching_ems = bms_intersect(rel->eclass_member_indexes, + ec->ec_nonchild_indexes); + else + matching_ems = bms_intersect(rel->eclass_member_indexes, + ec->ec_member_indexes); + + while ((i = bms_next_member(relids, i)) >= 0) + { + rel = root->simple_rel_array[i]; + matching_ems = bms_int_members(matching_ems, + rel->eclass_member_indexes); + } + } so, effectively we first bms_intersect to the first member of relids before masking out the bits for the remaining ones. This should mean we'll have a Bitmapset with fewer words in many complex planning problems. There's no longer the dilemma of having to decide if we should start with RelOptInfo's eclass_member_indexes or the EquivalenceClass's member indexes. When using bms_int_member, we really want to start with the smallest of those so we get the smallest resulting set. With bms_intersect(), it will always make a copy of the smallest set. v10 does that instead of bms_copy()ing the EquivalenceClass's member's Bitmapset. I also wondered how much we're losing to the fact that bms_int_members() zeros the trailing words and does not trim the Bitmapset down. The problem there is 2-fold; 1) we have to zero the trailing words on the left input. That'll pollute the CPU cache a bit as it may have to fetch a bunch of extra cache lines, and; 2) subsequent bms_int_members() done afterwards may have to mask out additional words. If we can make the shortest input really short, then subsequent bms_int_members() are going to be very fast. You might argue there that setting nwords to the shortest length may cause us to have to repalloc the Bitmapset if we need to later add more members again, but if you look at the repalloc() code, it's effectively a no-op when the allocated size >= the requested size, so repalloc() should be very fast in this case. So, worst case, there's an additional "no-op" repalloc() (which should be very fast) followed by maybe a bms_add_members() which has to zero the words instead of bms_int_members(). I changed this in the v10-0002 patch. I'm not sure if we should do this or not. I also changed v10-0001 so that we still store the EquivalenceClass's members list. There were a few places where the code just wanted to get the first member and having to look at the Bitmapset index and fetch the first match from PlannerInfo seemed convoluted. If the query is simple, it seems like it's not going to be very expensive to add a few EquivalenceMembers to this list. When planning more complex problems, there's probably enough other extra overhead that we're unlikely to notice the extra lappend()s. This also allows v10-0003 to work, see below. In v10-0003, I experimented with the iterator concept that I mentioned earlier. Since v10-0001 is now storing the EquivalenceMember list in EquivalenceClass again, it's now quite simple to have the iterator decide if it should be scanning the index or doing a loop over all members to find the ones matching the search. We can make this decision based on list_length(ec->ec_members). This should be a more reliable check than checking root->simple_rel_array_size as we could still have classes with just a few members even when there's a large number of rels in simple_rel_array. I was hoping that v10-0003 would allow us to maintain the same planner performance for simple queries. It just does not seem to change the performance much. Perhaps it's not worth the complexity if there are no performance benefits. It probably needs more performance testing than what I've done to know if it helps or hinders, however. Overall, I'm not quite sure if this is any faster than your v9 patch. I think more performance testing needs to be done. I think the v10-0001 + v10-0002 is faster than v9-0001, but perhaps the changes you've made in v9-0002 and v9-0003 are worth redoing. I didn't test. I was hoping to keep the logic about which method to use to find the members in the iterator code and not litter it around the tree. I did run the test you mentioned in [1] and I got: $ echo Master @ 29452de73 && ./partbench.sh | grep -E "^(Testing|latency)" Master @ 29452de73 Testing with 2 partitions... latency average = 0.231 ms Testing with 4 partitions... latency average = 0.303 ms Testing with 8 partitions... latency average = 0.454 ms Testing with 16 partitions... latency average = 0.777 ms Testing with 32 partitions... latency average = 1.576 ms Testing with 64 partitions... latency average = 3.574 ms Testing with 128 partitions... latency average = 9.504 ms Testing with 256 partitions... latency average = 37.321 ms Testing with 512 partitions... latency average = 171.660 ms Testing with 1024 partitions... latency average = 1021.990 ms $ echo Master + v10-0001 && ./partbench.sh | grep -E "^(Testing|latency)" Master + v10-0001 Testing with 2 partitions... latency average = 0.239 ms Testing with 4 partitions... latency average = 0.315 ms Testing with 8 partitions... latency average = 0.463 ms Testing with 16 partitions... latency average = 0.757 ms Testing with 32 partitions... latency average = 1.481 ms Testing with 64 partitions... latency average = 2.563 ms Testing with 128 partitions... latency average = 5.618 ms Testing with 256 partitions... latency average = 16.229 ms Testing with 512 partitions... latency average = 38.855 ms Testing with 1024 partitions... latency average = 85.705 ms $ echo Master + v10-0001 + v10-0002 && ./partbench.sh | grep -E "^(Testing|latency)" Master + v10-0001 + v10-0002 Testing with 2 partitions... latency average = 0.241 ms Testing with 4 partitions... latency average = 0.312 ms Testing with 8 partitions... latency average = 0.459 ms Testing with 16 partitions... latency average = 0.755 ms Testing with 32 partitions... latency average = 1.464 ms Testing with 64 partitions... latency average = 2.580 ms Testing with 128 partitions... latency average = 5.652 ms Testing with 256 partitions... latency average = 16.464 ms Testing with 512 partitions... latency average = 37.674 ms Testing with 1024 partitions... latency average = 84.094 ms $ echo Master + v10-0001 + v10-0002 + v10-0003 && ./partbench.sh | grep -E "^(Testing|latency)" Master + v10-0001 + v10-0002 + v10-0003 Testing with 2 partitions... latency average = 0.240 ms Testing with 4 partitions... latency average = 0.318 ms Testing with 8 partitions... latency average = 0.465 ms Testing with 16 partitions... latency average = 0.763 ms Testing with 32 partitions... latency average = 1.486 ms Testing with 64 partitions... latency average = 2.858 ms Testing with 128 partitions... latency average = 5.764 ms Testing with 256 partitions... latency average = 16.995 ms Testing with 512 partitions... latency average = 38.012 ms Testing with 1024 partitions... latency average = 88.098 ms $ echo Master + v9-* && ./partbench.sh | grep -E "^(Testing|latency)" Master + v9-* Testing with 2 partitions... latency average = 0.237 ms Testing with 4 partitions... latency average = 0.313 ms Testing with 8 partitions... latency average = 0.460 ms Testing with 16 partitions... latency average = 0.780 ms Testing with 32 partitions... latency average = 1.468 ms Testing with 64 partitions... latency average = 2.701 ms Testing with 128 partitions... latency average = 5.275 ms Testing with 256 partitions... latency average = 17.208 ms Testing with 512 partitions... latency average = 37.183 ms Testing with 1024 partitions... latency average = 90.595 ms David [1] https://postgr.es/m/CAJ2pMkZNCgoUKSE%2B_5LthD%2BKbXKvq6h2hQN8Esxpxd%2Bcxmgomg%40mail.gmail.com
Attachment
pgsql-hackers by date: