Re: [PoC] Reducing planning time when tables have many partitions - Mailing list pgsql-hackers

From Thom Brown
Subject Re: [PoC] Reducing planning time when tables have many partitions
Date
Msg-id CAA-aLv4+9_Wqi1D7bDVzBoPv8=LnQLcqHEvEfUvYW196_4KrEA@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Reducing planning time when tables have many partitions  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: [PoC] Reducing planning time when tables have many partitions  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
On Sun, 4 Dec 2022 at 00:35, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Tue, 29 Nov 2022 at 21:59, Yuya Watari <watari.yuya@gmail.com> wrote:
> > Thank you for testing the patch with an actual query. This speedup is
> > very impressive. When I used an original query with 1024 partitions,
> > its planning time was about 200ms. Given that each partition is also
> > partitioned in your workload, I think the result of 1415ms is
> > reasonable.
>
> I was looking again at the v9-0001 patch and I think we can do a
> little better when building the Bitmapset of matching EMs.  For
> example, in the v9 patch, the code for get_ecmember_indexes_strict()
> is doing:
>
> + if (!with_children)
> +     matching_ems = bms_copy(ec->ec_nonchild_indexes);
> + else
> +     matching_ems = bms_copy(ec->ec_member_indexes);
> +
> + i = -1;
> + while ((i = bms_next_member(relids, i)) >= 0)
> + {
> +     RelOptInfo *rel = root->simple_rel_array[i];
> +
> +     matching_ems = bms_int_members(matching_ems, rel->eclass_member_indexes);
> + }
>
> It seems reasonable that if there are a large number of partitions
> then ec_member_indexes will have a large number of Bitmapwords.  When
> we do bms_int_members() on that, we're going to probably end up with a
> bunch of trailing zero words in the set.  In the v10 patch, I've
> changed this to become:
>
> +    int            i = bms_next_member(relids, -1);
> +
> +    if (i >= 0)
> +    {
> +        RelOptInfo *rel = root->simple_rel_array[i];
> +
> +        /*
> +         * bms_intersect to the first relation to try to keep the resulting
> +         * Bitmapset as small as possible.  This saves having to make a
> +         * complete bms_copy() of one of them.  One may contain significantly
> +         * more words than the other.
> +         */
> +        if (!with_children)
> +            matching_ems = bms_intersect(rel->eclass_member_indexes,
> +                                         ec->ec_nonchild_indexes);
> +        else
> +            matching_ems = bms_intersect(rel->eclass_member_indexes,
> +                                         ec->ec_member_indexes);
> +
> +        while ((i = bms_next_member(relids, i)) >= 0)
> +        {
> +            rel = root->simple_rel_array[i];
> +            matching_ems = bms_int_members(matching_ems,
> +                                           rel->eclass_member_indexes);
> +        }
> +    }
>
> so, effectively we first bms_intersect to the first member of relids
> before masking out the bits for the remaining ones.  This should mean
> we'll have a Bitmapset with fewer words in many complex planning
> problems. There's no longer the dilemma of having to decide if we
> should start with RelOptInfo's eclass_member_indexes or the
> EquivalenceClass's member indexes.  When using bms_int_member, we
> really want to start with the smallest of those so we get the smallest
> resulting set.  With bms_intersect(), it will always make a copy of
> the smallest set. v10 does that instead of bms_copy()ing the
> EquivalenceClass's member's Bitmapset.
>
> I also wondered how much we're losing to the fact that
> bms_int_members() zeros the trailing words and does not trim the
> Bitmapset down.
>
> The problem there is 2-fold;
> 1) we have to zero the trailing words on the left input. That'll
> pollute the CPU cache a bit as it may have to fetch a bunch of extra
> cache lines, and;
> 2) subsequent bms_int_members() done afterwards may have to mask out
> additional words. If we can make the shortest input really short, then
> subsequent bms_int_members() are going to be very fast.
>
> You might argue there that setting nwords to the shortest length may
> cause us to have to repalloc the Bitmapset if we need to later add
> more members again, but if you look at the repalloc() code, it's
> effectively a no-op when the allocated size >= the requested size, so
> repalloc() should be very fast in this case. So, worst case, there's
> an additional "no-op" repalloc() (which should be very fast) followed
> by maybe a bms_add_members() which has to zero the words instead of
> bms_int_members(). I changed this in the v10-0002 patch. I'm not sure
> if we should do this or not.
>
> I also changed v10-0001 so that we still store the EquivalenceClass's
> members list.  There were a few places where the code just wanted to
> get the first member and having to look at the Bitmapset index and
> fetch the first match from PlannerInfo seemed convoluted.  If the
> query is simple, it seems like it's not going to be very expensive to
> add a few EquivalenceMembers to this list. When planning more complex
> problems, there's probably enough other extra overhead that we're
> unlikely to notice the extra lappend()s.  This also allows v10-0003 to
> work, see below.
>
> In v10-0003, I experimented with the iterator concept that I mentioned
> earlier.  Since v10-0001 is now storing the EquivalenceMember list in
> EquivalenceClass again, it's now quite simple to have the iterator
> decide if it should be scanning the index or doing a loop over all
> members to find the ones matching the search.  We can make this
> decision based on list_length(ec->ec_members). This should be a more
> reliable check than checking root->simple_rel_array_size as we could
> still have classes with just a few members even when there's a large
> number of rels in simple_rel_array.  I was hoping that v10-0003 would
> allow us to maintain the same planner performance for simple queries.
> It just does not seem to change the performance much. Perhaps it's not
> worth the complexity if there are no performance benefits. It probably
> needs more performance testing than what I've done to know if it helps
> or hinders, however.
>
> Overall, I'm not quite sure if this is any faster than your v9 patch.
> I think more performance testing needs to be done. I think the
> v10-0001 + v10-0002 is faster than v9-0001, but perhaps the changes
> you've made in v9-0002 and v9-0003 are worth redoing. I didn't test. I
> was hoping to keep the logic about which method to use to find the
> members in the iterator code and not litter it around the tree.
>
> I did run the test you mentioned in [1] and I got:
>
> $ echo Master @ 29452de73 && ./partbench.sh | grep -E "^(Testing|latency)"
> Master @ 29452de73
> Testing with 2 partitions...
> latency average = 0.231 ms
> Testing with 4 partitions...
> latency average = 0.303 ms
> Testing with 8 partitions...
> latency average = 0.454 ms
> Testing with 16 partitions...
> latency average = 0.777 ms
> Testing with 32 partitions...
> latency average = 1.576 ms
> Testing with 64 partitions...
> latency average = 3.574 ms
> Testing with 128 partitions...
> latency average = 9.504 ms
> Testing with 256 partitions...
> latency average = 37.321 ms
> Testing with 512 partitions...
> latency average = 171.660 ms
> Testing with 1024 partitions...
> latency average = 1021.990 ms
>
> $ echo Master + v10-0001 && ./partbench.sh | grep -E "^(Testing|latency)"
> Master + v10-0001
> Testing with 2 partitions...
> latency average = 0.239 ms
> Testing with 4 partitions...
> latency average = 0.315 ms
> Testing with 8 partitions...
> latency average = 0.463 ms
> Testing with 16 partitions...
> latency average = 0.757 ms
> Testing with 32 partitions...
> latency average = 1.481 ms
> Testing with 64 partitions...
> latency average = 2.563 ms
> Testing with 128 partitions...
> latency average = 5.618 ms
> Testing with 256 partitions...
> latency average = 16.229 ms
> Testing with 512 partitions...
> latency average = 38.855 ms
> Testing with 1024 partitions...
> latency average = 85.705 ms
>
> $ echo Master + v10-0001 + v10-0002 && ./partbench.sh | grep -E
> "^(Testing|latency)"
> Master + v10-0001 + v10-0002
> Testing with 2 partitions...
> latency average = 0.241 ms
> Testing with 4 partitions...
> latency average = 0.312 ms
> Testing with 8 partitions...
> latency average = 0.459 ms
> Testing with 16 partitions...
> latency average = 0.755 ms
> Testing with 32 partitions...
> latency average = 1.464 ms
> Testing with 64 partitions...
> latency average = 2.580 ms
> Testing with 128 partitions...
> latency average = 5.652 ms
> Testing with 256 partitions...
> latency average = 16.464 ms
> Testing with 512 partitions...
> latency average = 37.674 ms
> Testing with 1024 partitions...
> latency average = 84.094 ms
>
> $ echo Master + v10-0001 + v10-0002 + v10-0003 && ./partbench.sh |
> grep -E "^(Testing|latency)"
> Master + v10-0001 + v10-0002 + v10-0003
> Testing with 2 partitions...
> latency average = 0.240 ms
> Testing with 4 partitions...
> latency average = 0.318 ms
> Testing with 8 partitions...
> latency average = 0.465 ms
> Testing with 16 partitions...
> latency average = 0.763 ms
> Testing with 32 partitions...
> latency average = 1.486 ms
> Testing with 64 partitions...
> latency average = 2.858 ms
> Testing with 128 partitions...
> latency average = 5.764 ms
> Testing with 256 partitions...
> latency average = 16.995 ms
> Testing with 512 partitions...
> latency average = 38.012 ms
> Testing with 1024 partitions...
> latency average = 88.098 ms
>
> $ echo Master + v9-* && ./partbench.sh | grep -E "^(Testing|latency)"
> Master + v9-*
> Testing with 2 partitions...
> latency average = 0.237 ms
> Testing with 4 partitions...
> latency average = 0.313 ms
> Testing with 8 partitions...
> latency average = 0.460 ms
> Testing with 16 partitions...
> latency average = 0.780 ms
> Testing with 32 partitions...
> latency average = 1.468 ms
> Testing with 64 partitions...
> latency average = 2.701 ms
> Testing with 128 partitions...
> latency average = 5.275 ms
> Testing with 256 partitions...
> latency average = 17.208 ms
> Testing with 512 partitions...
> latency average = 37.183 ms
> Testing with 1024 partitions...
> latency average = 90.595 ms

Testing your patches with the same 1024 partitions, each with 64
sub-partitions, I get a planning time of 205.020 ms, which is now a
1,377x speedup.  This has essentially reduced the planning time from a
catastrophe to a complete non-issue.  Huge win!

-- 
Thom



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: slab allocator performance issues
Next
From: Robert Haas
Date:
Subject: Re: Collation version tracking for macOS