Thread: Record a Bitmapset of non-pruned partitions
Back when working on 959d00e9d, to allow ordered partition scans for LIST and RANGE partitioned tables, I mentioned [1] that if we had a field that recorded a Bitmapset of the non-pruned partitions, we could use that to do a more thorough check to see if ordered scans are possible. At the moment these ordered scans are possible if the order by matches the partition key and, for LIST partitioned tables we must also ensure that we don't have something like partition1 FOR VALUES IN(1,3) and partition2 FOR VALUES(2,4). Here we can't scan partition1 first before we'd get 3s before the 2s that are in partition2. Since we don't record the list of partitions that survived pruning, I just made that check to ensure all partitions only allow a single value to be stored. That means the optimisation is not done in cases where it's possible to do it. To make this work, as I mentioned in [1], we really need a live_parts field to track which partitions survived pruning. If we have that then we can ensure that just those partitions have no instances where a lower value could appear in a later partition. In the attached patch, I've only added the live_parts field and made use of it in a few locations where the knowledge is useful as an optimisation. Right now apply_scanjoin_target_to_paths() appears in profiles when planning a query to a partitioned table with many partitions and just 1 or a few survive pruning. The reason for this is simple; looping over a large almost entirely empty array is just slow and using the Bitmapset as a sort of index into the interesting elements of that array speeds it up, quite a bit. Since we've already put quite a bit of effort into making that fast, then I think it might be worth adding this field even if it was just for that purpose. With 10k empty hash partitions and 1 random one surviving partition pruning, I see: Master: 18.13% postgres postgres [.] apply_scanjoin_target_to_paths 3.66% postgres postgres [.] AllocSetAlloc 2.05% postgres postgres [.] hash_search_with_hash_value 1.95% postgres libc-2.33.so [.] __memset_avx2_unaligned_erms 1.88% postgres postgres [.] SearchCatCacheInternal 1.55% postgres postgres [.] base_yyparse 0.74% postgres postgres [.] get_relation_info 0.68% postgres [kernel.kallsyms] [k] __d_lookup_rcu 0.61% postgres postgres [.] palloc Patched: 3.72% postgres postgres [.] AllocSetAlloc 2.30% postgres postgres [.] hash_search_with_hash_value 2.22% postgres postgres [.] SearchCatCacheInternal 2.02% postgres libc-2.33.so [.] __memset_avx2_unaligned_erms 1.88% postgres postgres [.] base_yyparse 1.08% postgres postgres [.] palloc 0.92% postgres [kernel.kallsyms] [k] __d_lookup_rcu $ cat setup.sql create table hp (a int primary key, b int not null) partition by hash(a); select 'create table hp'||x|| ' partition of hp for values with (modulus 10000, remainder '||x||');' from generate_series(0,9999) x; \gexec $ cat select.sql \set p random(1,2000000) select * from hp where a = :p master $ pgbench -n -f select.sql -c 1 -j 1 -T 60 postgres tps = 2608.704218 (without initial connection time) tps = 2607.641247 (without initial connection time) tps = 2583.017011 (without initial connection time) patched $ pgbench -n -f select.sql -c 1 -j 1 -T 60 postgres tps = 2715.993495 (without initial connection time) tps = 2701.527640 (without initial connection time) tps = 2707.343009 (without initial connection time) Does anyone have any thoughts about if this is worth a new field in RelOptInfo? David [1] https://www.postgresql.org/message-id/CAKJS1f9W7sg1sb3SXiTQUovs%3DwDMrHATXv68F5dSbe5fuHH%2BiQ%40mail.gmail.com
Attachment
On Wed, Jun 30, 2021 at 10:59 PM David Rowley <dgrowleyml@gmail.com> wrote: > Back when working on 959d00e9d, to allow ordered partition scans for > LIST and RANGE partitioned tables, I mentioned [1] that if we had a > field that recorded a Bitmapset of the non-pruned partitions, we could > use that to do a more thorough check to see if ordered scans are > possible. > > At the moment these ordered scans are possible if the order by matches > the partition key and, for LIST partitioned tables we must also ensure > that we don't have something like partition1 FOR VALUES IN(1,3) and > partition2 FOR VALUES(2,4). Here we can't scan partition1 first before > we'd get 3s before the 2s that are in partition2. Since we don't > record the list of partitions that survived pruning, I just made that > check to ensure all partitions only allow a single value to be stored. > That means the optimisation is not done in cases where it's possible > to do it. > > To make this work, as I mentioned in [1], we really need a live_parts > field to track which partitions survived pruning. If we have that > then we can ensure that just those partitions have no instances where > a lower value could appear in a later partition. > > In the attached patch, I've only added the live_parts field and made > use of it in a few locations where the knowledge is useful as an > optimisation. Right now apply_scanjoin_target_to_paths() appears in > profiles when planning a query to a partitioned table with many > partitions and just 1 or a few survive pruning. The reason for this is > simple; looping over a large almost entirely empty array is just slow > and using the Bitmapset as a sort of index into the interesting > elements of that array speeds it up, quite a bit. > > Since we've already put quite a bit of effort into making that fast, > then I think it might be worth adding this field even if it was just > for that purpose. > > With 10k empty hash partitions and 1 random one surviving partition > pruning, I see: > > Master: > 18.13% postgres postgres [.] apply_scanjoin_target_to_paths > 3.66% postgres postgres [.] AllocSetAlloc > 2.05% postgres postgres [.] hash_search_with_hash_value > 1.95% postgres libc-2.33.so [.] __memset_avx2_unaligned_erms > 1.88% postgres postgres [.] SearchCatCacheInternal > 1.55% postgres postgres [.] base_yyparse > 0.74% postgres postgres [.] get_relation_info > 0.68% postgres [kernel.kallsyms] [k] __d_lookup_rcu > 0.61% postgres postgres [.] palloc > > Patched: > 3.72% postgres postgres [.] AllocSetAlloc > 2.30% postgres postgres [.] hash_search_with_hash_value > 2.22% postgres postgres [.] SearchCatCacheInternal > 2.02% postgres libc-2.33.so [.] __memset_avx2_unaligned_erms > 1.88% postgres postgres [.] base_yyparse > 1.08% postgres postgres [.] palloc > 0.92% postgres [kernel.kallsyms] [k] __d_lookup_rcu > > $ cat setup.sql > create table hp (a int primary key, b int not null) partition by hash(a); > select 'create table hp'||x|| ' partition of hp for values with > (modulus 10000, remainder '||x||');' from generate_series(0,9999) x; > \gexec > > $ cat select.sql > \set p random(1,2000000) > select * from hp where a = :p > > master > > $ pgbench -n -f select.sql -c 1 -j 1 -T 60 postgres > tps = 2608.704218 (without initial connection time) > tps = 2607.641247 (without initial connection time) > tps = 2583.017011 (without initial connection time) > > patched > > $ pgbench -n -f select.sql -c 1 -j 1 -T 60 postgres > tps = 2715.993495 (without initial connection time) > tps = 2701.527640 (without initial connection time) > tps = 2707.343009 (without initial connection time) > > Does anyone have any thoughts about if this is worth a new field in RelOptInfo? +1 from me. I had proposed adding a live_parts bitmapset back in [1] (v12 work on speeding up planning with partition) to address the apply_scanjoin_target_to_paths() inefficiency among other things, though Tom seemed to think that it wouldn't be worthwhile if only for that purpose. He'd suggested that we fix things elsewhere such that that function is not needed in the first place [2], something I keep thinking about in between doing other things, but never sit down to actually write a patch. Given that you're proposing more uses for live_parts, maybe he'd be open to the idea. -- Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/3f280722-46f2-c2a4-4c19-2cfa28c6c1cd%40lab.ntt.co.jp [2] https://www.postgresql.org/message-id/3529.1554051867%40sss.pgh.pa.us
On Thu, 1 Jul 2021 at 17:49, Amit Langote <amitlangote09@gmail.com> wrote: > Given that you're proposing more uses for live_parts, maybe he'd be > open to the idea. Just to make sure the new field in the 0001 patch gets good enough use, I've attached the patch which includes more usages of the field. 0002 adds a new field named interleaved_parts to PartitionBoundInfo which is populated for LIST partitioned tables with any partitions which have interleaved values, e.g FOR VALUES IN(3,5) and another partition with FOR VALUES IN(4), the 3,5 partition is "interleaved" around the partition for 4. This combined with recording "live_parts" in the 0001 patch allows us to do ordered partition scans in many more cases for LIST partitioning and 1 more case with RANGE partitioning. create table mclparted (a int) partition by list(a); create table mclparted1 partition of mclparted for values in(1); create table mclparted2 partition of mclparted for values in(2); create table mclparted3_5 partition of mclparted for values in(3,5); create table mclparted4 partition of mclparted for values in(4); create index on mclparted (a); set enable_bitmapscan=0; set enable_sort=0; -- ordered scan using Append explain (costs off) select * from mclparted where a in(1,2) order by a; QUERY PLAN ------------------------------------------------------------------------ Append -> Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1 Index Cond: (a = ANY ('{1,2}'::integer[])) -> Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2 Index Cond: (a = ANY ('{1,2}'::integer[])) -- no ordered scan due to interleaved partition. Must use Merge Append explain (costs off) select * from mclparted where a in(3,4) order by a; QUERY PLAN ---------------------------------------------------------------------------- Merge Append Sort Key: mclparted.a -> Index Only Scan using mclparted3_5_a_idx on mclparted3_5 mclparted_1 Index Cond: (a = ANY ('{3,4}'::integer[])) -> Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_2 Index Cond: (a = ANY ('{3,4}'::integer[])) Currently, this is a bit more strict than maybe it needs to be. I'm disabling the optimisation if any interleaved partitions remain after pruning, however, it would be ok to allow them providing their interleaved partner(s) were pruned. I think making that work might be a bit more costly as we'd need to track all partitions that were interleaved with each interleaved partition and ensure those were all pruned. As far as I can see that requires storing a Bitmapset per interleaved partition and makes the whole thing not so cheap. I'd really like to keep all this stuff cheap as possible. That's why I ended up calculating the interleaved partitions in create_list_bounds() rather than partitions_are_ordered(). The good news is that the code in partitions_are_ordered() became even more simple as a result of this change. We can do ordered scan simply when !bms_overlap(live_parts, boundinfo->interleaved_parts). The additional case we can now allow for RANGE partition is that we can now do ordered scan when there is a DEFAULT partition but it was pruned. Previously we had to disable the optimisation when there was a DEFAULT partition as we had no idea if it was pruned or not. David
Attachment
On Sat, 10 Jul 2021 at 03:24, David Rowley <dgrowleyml@gmail.com> wrote: > The good news is that the code in partitions_are_ordered() became even > more simple as a result of this change. We can do ordered scan simply > when !bms_overlap(live_parts, boundinfo->interleaved_parts). I've spent a bit more time revising the 0002 patch so that we're a bit more strict about when we mark a partition as interleaved. For example, if the DEFAULT partition happens to be the only partition, then I'm no longer classing that as interleaved as there's nothing for it to be interleaved with. This also fixes up the not-so-robust check that I had to check if the NULL partition allowed other Datums. David
Attachment
On Mon, Jul 12, 2021 at 11:47 AM David Rowley <dgrowleyml@gmail.com> wrote: > v3 patches 0001 looks mostly fine, except I thought the following could be worded to say that the bitmap members are offsets into the part_rels array. To avoid someone confusing them with RT indexes, for example. + Bitmapset *live_parts; /* Bitmap with members to indicate which + * partitions survived partition pruning. */ On 0002: interleaved_parts idea looks clever. I wonder if you decided that it's maybe not worth setting that field in the joinrel's PartitionBoundInfos? For example, adding the code that you added in create_list_bounds() also in merge_list_bounds(). ... The definition of interleaved + * is any partition that can contain multiple different values where exists at + * least one other partition which could contain a value which is between the + * multiple possible values in the other partition. The sentence sounds a bit off starting at "...where exists". How about: "A partition is considered interleaved if it contains multiple values such that there exists at least one other partition containing a value that lies between those values [ in terms of partitioning-defined ordering ]." Looks fine otherwise. -- Amit Langote EDB: http://www.enterprisedb.com
On Fri, 30 Jul 2021 at 19:10, Amit Langote <amitlangote09@gmail.com> wrote: > > 0001 looks mostly fine, except I thought the following could be worded > to say that the bitmap members are offsets into the part_rels array. > To avoid someone confusing them with RT indexes, for example. > > + Bitmapset *live_parts; /* Bitmap with members to indicate which > + * partitions survived partition pruning. */ Yeah, agreed. I've adjusted that. > On 0002: > > interleaved_parts idea looks clever. I wonder if you decided that > it's maybe not worth setting that field in the joinrel's > PartitionBoundInfos? For example, adding the code that you added in > create_list_bounds() also in merge_list_bounds(). Currently generate_orderedappend_paths() only checks partitions_are_ordered() for base and other member rels, so setting the field for join rels would be a waste of effort given that it's not used for anything. I've not really looked into the possibility of enabling this optimization for partition-wise joined rels. I know that there's a bit more complexity now due to c8434d64c. I'm not really all that clear on which cases could be allowed here and which couldn't. It would require more analysis and I'd say that's a different patch. > ... The definition of interleaved > + * is any partition that can contain multiple different values where exists at > + * least one other partition which could contain a value which is between the > + * multiple possible values in the other partition. > > The sentence sounds a bit off starting at "...where exists". How about: I must have spent too long writing SQL queries. > "A partition is considered interleaved if it contains multiple values > such that there exists at least one other partition containing a value > that lies between those values [ in terms of partitioning-defined > ordering ]." That looks better. I took that with some small adjustments. > Looks fine otherwise. Thanks for the review. I had another self review of these and I'm pretty happy with them. I'm quite glad to see the performance of querying a single partition of a table with large numbers of partitions no longer tails off as much as it used to. David
Attachment
On Sun, Aug 1, 2021 at 5:31 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 30 Jul 2021 at 19:10, Amit Langote <amitlangote09@gmail.com> wrote:
>
> 0001 looks mostly fine, except I thought the following could be worded
> to say that the bitmap members are offsets into the part_rels array.
> To avoid someone confusing them with RT indexes, for example.
>
> + Bitmapset *live_parts; /* Bitmap with members to indicate which
> + * partitions survived partition pruning. */
Yeah, agreed. I've adjusted that.
> On 0002:
>
> interleaved_parts idea looks clever. I wonder if you decided that
> it's maybe not worth setting that field in the joinrel's
> PartitionBoundInfos? For example, adding the code that you added in
> create_list_bounds() also in merge_list_bounds().
Currently generate_orderedappend_paths() only checks
partitions_are_ordered() for base and other member rels, so setting
the field for join rels would be a waste of effort given that it's not
used for anything.
I've not really looked into the possibility of enabling this
optimization for partition-wise joined rels. I know that there's a bit
more complexity now due to c8434d64c. I'm not really all that clear on
which cases could be allowed here and which couldn't. It would require
more analysis and I'd say that's a different patch.
> ... The definition of interleaved
> + * is any partition that can contain multiple different values where exists at
> + * least one other partition which could contain a value which is between the
> + * multiple possible values in the other partition.
>
> The sentence sounds a bit off starting at "...where exists". How about:
I must have spent too long writing SQL queries.
> "A partition is considered interleaved if it contains multiple values
> such that there exists at least one other partition containing a value
> that lies between those values [ in terms of partitioning-defined
> ordering ]."
That looks better. I took that with some small adjustments.
> Looks fine otherwise.
Thanks for the review.
I had another self review of these and I'm pretty happy with them. I'm
quite glad to see the performance of querying a single partition of a
table with large numbers of partitions no longer tails off as much as
it used to.
David
Hi,
Some minor comment.
bq. Here we pass which partitioned
partitioned -> partitions
Here we look for partitions which
+ * might be interleaved with other partitions and set the
+ * interleaved_parts field with the partition indexes of any partitions
+ * which may be interleaved with another partition.
+ * might be interleaved with other partitions and set the
+ * interleaved_parts field with the partition indexes of any partitions
+ * which may be interleaved with another partition.
The above seems a little bit repetitive. It can be shortened to remove repetition.
Cheers
Thanks for having a look at this. On Mon, 2 Aug 2021 at 02:33, Zhihong Yu <zyu@yugabyte.com> wrote: > Here we look for partitions which > + * might be interleaved with other partitions and set the > + * interleaved_parts field with the partition indexes of any partitions > + * which may be interleaved with another partition. > > The above seems a little bit repetitive. It can be shortened to remove repetition. I agree that the word "partition" is mentioned quite a few times. The only one I can see that could be removed is the "partition indexes" one. Likely the details about which bit we set can be left up to the struct field comment in partbounds.h I've adjusted this to become: /* * Calculate interleaved partitions. Here we look for partitions which * might be interleaved with other partitions and set a bit in * interleaved_parts for any partitions which may be interleaved with * another partition. */ David
On Sun, Aug 1, 2021 at 9:31 PM David Rowley <dgrowleyml@gmail.com> wrote: > On Fri, 30 Jul 2021 at 19:10, Amit Langote <amitlangote09@gmail.com> wrote: > > interleaved_parts idea looks clever. I wonder if you decided that > > it's maybe not worth setting that field in the joinrel's > > PartitionBoundInfos? For example, adding the code that you added in > > create_list_bounds() also in merge_list_bounds(). > > Currently generate_orderedappend_paths() only checks > partitions_are_ordered() for base and other member rels, so setting > the field for join rels would be a waste of effort given that it's not > used for anything. > > I've not really looked into the possibility of enabling this > optimization for partition-wise joined rels. I know that there's a bit > more complexity now due to c8434d64c. I'm not really all that clear on > which cases could be allowed here and which couldn't. It would require > more analysis and I'd say that's a different patch. Yeah, that makes sense. > > ... The definition of interleaved > > + * is any partition that can contain multiple different values where exists at > > + * least one other partition which could contain a value which is between the > > + * multiple possible values in the other partition. > > > > The sentence sounds a bit off starting at "...where exists". How about: > > I must have spent too long writing SQL queries. Hah. > I had another self review of these and I'm pretty happy with them. I'm > quite glad to see the performance of querying a single partition of a > table with large numbers of partitions no longer tails off as much as > it used to. Nice, glad to see the apply_scanjoin_target_to_paths() loop taken care of. Thank you. -- Amit Langote EDB: http://www.enterprisedb.com
On Mon, 2 Aug 2021 at 00:31, David Rowley <dgrowleyml@gmail.com> wrote: > I had another self review of these and I'm pretty happy with them. I'm > quite glad to see the performance of querying a single partition of a > table with large numbers of partitions no longer tails off as much as > it used to. I did some profiling and benchmarking on master and with the v4 patch. With a hash partitioned table containing 8192 partitions I see the following when running a query that selects a value from a single partition: 19.39% postgres [.] apply_scanjoin_target_to_paths 5.35% postgres [.] base_yyparse 4.71% postgres [.] AllocSetAlloc 2.86% libc-2.33.so [.] __memset_avx2_unaligned_erms 2.17% postgres [.] SearchCatCacheInternal With the patched version, I see: 5.89% postgres [.] AllocSetAlloc 3.97% postgres [.] base_yyparse 3.87% libc-2.33.so [.] __memset_avx2_unaligned_erms 2.44% postgres [.] SearchCatCacheInternal 1.29% postgres [.] hash_search_with_hash_value I'm getting: master: 16613 tps patched: 22078 tps So there's about 32% performance improvement with this number of partitions. These results are not the same as my original email here as I've only recently discovered that I really need to pin pgbench and the postgres backend to the same CPU core to get good and stable performance from a single threaded pgbench job. FWIW, the next thing there on the profile the following line in expand_partitioned_rtentry() relinfo->part_rels = (RelOptInfo **) palloc0(relinfo->nparts * sizeof(RelOptInfo *)); If anyone has any objections to both the v4 0001 and 0002 patch, can they let me know soon. I'm currently seeing no reason that they can't go in. David
On Mon, 2 Aug 2021 at 20:16, David Rowley <dgrowleyml@gmail.com> wrote: > If anyone has any objections to both the v4 0001 and 0002 patch, can > they let me know soon. I'm currently seeing no reason that they can't > go in. I've now pushed both of these. Thanks for the reviews. David
Hi David, On Mon, Aug 2, 2021 at 11:59 AM Amit Langote <amitlangote09@gmail.com> wrote: > On Sun, Aug 1, 2021 at 9:31 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Fri, 30 Jul 2021 at 19:10, Amit Langote <amitlangote09@gmail.com> wrote: > > > interleaved_parts idea looks clever. I wonder if you decided that > > > it's maybe not worth setting that field in the joinrel's > > > PartitionBoundInfos? For example, adding the code that you added in > > > create_list_bounds() also in merge_list_bounds(). > > > > Currently generate_orderedappend_paths() only checks > > partitions_are_ordered() for base and other member rels, so setting > > the field for join rels would be a waste of effort given that it's not > > used for anything. > > > > I've not really looked into the possibility of enabling this > > optimization for partition-wise joined rels. I know that there's a bit > > more complexity now due to c8434d64c. I'm not really all that clear on > > which cases could be allowed here and which couldn't. It would require > > more analysis and I'd say that's a different patch. > > Yeah, that makes sense. Related to the above, I noticed while looking at build_merged_partition_bounds() that db632fbca3 missed adding a line to that function to set interleaved_parts to NULL. Because the PartitionBoundInfo is only palloc'd (not palloc0'd), interleaved_parts of a "merged" bounds struct ends up pointing to garbage, so let's fix that. Attached a patch. -- Amit Langote EDB: http://www.enterprisedb.com
Attachment
On Thu, 30 Sept 2021 at 20:25, Amit Langote <amitlangote09@gmail.com> wrote: > Related to the above, I noticed while looking at > build_merged_partition_bounds() that db632fbca3 missed adding a line > to that function to set interleaved_parts to NULL. Because the > PartitionBoundInfo is only palloc'd (not palloc0'd), interleaved_parts > of a "merged" bounds struct ends up pointing to garbage, so let's fix > that. Attached a patch. Thanks for the patch. I think we also need to document that interleaved_parts is not set for join relations, otherwise someone may in the future try to use that field for an optimisation for join relations. At the moment, per generate_orderedappend_paths, we only handle IS_SIMPLE_REL type relations. I've attached a patch that updates the comments to mention this. David
Attachment
On Fri, Oct 1, 2021 at 7:07 AM David Rowley <dgrowleyml@gmail.com> wrote: > On Thu, 30 Sept 2021 at 20:25, Amit Langote <amitlangote09@gmail.com> wrote: > > Related to the above, I noticed while looking at > > build_merged_partition_bounds() that db632fbca3 missed adding a line > > to that function to set interleaved_parts to NULL. Because the > > PartitionBoundInfo is only palloc'd (not palloc0'd), interleaved_parts > > of a "merged" bounds struct ends up pointing to garbage, so let's fix > > that. Attached a patch. > > Thanks for the patch. > > I think we also need to document that interleaved_parts is not set for > join relations, otherwise someone may in the future try to use that > field for an optimisation for join relations. At the moment, per > generate_orderedappend_paths, we only handle IS_SIMPLE_REL type > relations. > > I've attached a patch that updates the comments to mention this. Looks good to me. Thanks. -- Amit Langote EDB: http://www.enterprisedb.com
On Fri, 1 Oct 2021 at 13:37, Amit Langote <amitlangote09@gmail.com> wrote: > > I've attached a patch that updates the comments to mention this. > > Looks good to me. Thanks. Thanks. Pushed. David