Re: Record a Bitmapset of non-pruned partitions - Mailing list pgsql-hackers

From Amit Langote
Subject Re: Record a Bitmapset of non-pruned partitions
Date
Msg-id CA+HiwqHnu-yiRfjPiNm4cd+b+7edtXL9NJABxViEYoJbjg293w@mail.gmail.com
Whole thread Raw
In response to Record a Bitmapset of non-pruned partitions  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Record a Bitmapset of non-pruned partitions  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
On Wed, Jun 30, 2021 at 10:59 PM David Rowley <dgrowleyml@gmail.com> wrote:
> Back when working on 959d00e9d, to allow ordered partition scans for
> LIST and RANGE partitioned tables, I mentioned [1] that if we had a
> field that recorded a Bitmapset of the non-pruned partitions, we could
> use that to do a more thorough check to see if ordered scans are
> possible.
>
> At the moment these ordered scans are possible if the order by matches
> the partition key and, for LIST partitioned tables we must also ensure
> that we don't have something like partition1 FOR VALUES IN(1,3) and
> partition2 FOR VALUES(2,4). Here we can't scan partition1 first before
> we'd get 3s before the 2s that are in partition2. Since we don't
> record the list of partitions that survived pruning, I just made that
> check to ensure all partitions only allow a single value to be stored.
> That means the optimisation is not done in cases where it's possible
> to do it.
>
> To make this work, as I mentioned in [1], we really need a live_parts
> field to track which partitions survived pruning.  If we have that
> then we can ensure that just those partitions have no instances where
> a lower value could appear in a later partition.
>
> In the attached patch, I've only added the live_parts field and made
> use of it in a few locations where the knowledge is useful as an
> optimisation.  Right now apply_scanjoin_target_to_paths() appears in
> profiles when planning a query to a partitioned table with many
> partitions and just 1 or a few survive pruning. The reason for this is
> simple; looping over a large almost entirely empty array is just slow
> and using the Bitmapset as a sort of index into the interesting
> elements of that array speeds it up, quite a bit.
>
> Since we've already put quite a bit of effort into making that fast,
> then I think it might be worth adding this field even if it was just
> for that purpose.
>
> With 10k empty hash partitions and 1 random one surviving partition
> pruning, I see:
>
> Master:
>  18.13%  postgres  postgres            [.] apply_scanjoin_target_to_paths
>    3.66%  postgres  postgres            [.] AllocSetAlloc
>    2.05%  postgres  postgres            [.] hash_search_with_hash_value
>    1.95%  postgres  libc-2.33.so        [.] __memset_avx2_unaligned_erms
>    1.88%  postgres  postgres            [.] SearchCatCacheInternal
>    1.55%  postgres  postgres            [.] base_yyparse
>    0.74%  postgres  postgres            [.] get_relation_info
>    0.68%  postgres  [kernel.kallsyms]   [k] __d_lookup_rcu
>    0.61%  postgres  postgres            [.] palloc
>
> Patched:
>    3.72%  postgres  postgres            [.] AllocSetAlloc
>    2.30%  postgres  postgres            [.] hash_search_with_hash_value
>    2.22%  postgres  postgres            [.] SearchCatCacheInternal
>    2.02%  postgres  libc-2.33.so        [.] __memset_avx2_unaligned_erms
>    1.88%  postgres  postgres            [.] base_yyparse
>    1.08%  postgres  postgres            [.] palloc
>    0.92%  postgres  [kernel.kallsyms]   [k] __d_lookup_rcu
>
> $ cat setup.sql
> create table hp (a int primary key, b int not null) partition by hash(a);
> select 'create table hp'||x|| ' partition of hp for values with
> (modulus 10000, remainder '||x||');' from generate_series(0,9999) x;
> \gexec
>
> $ cat select.sql
> \set p random(1,2000000)
> select * from hp where a = :p
>
> master
>
> $ pgbench -n -f select.sql -c 1 -j 1 -T 60 postgres
> tps = 2608.704218 (without initial connection time)
> tps = 2607.641247 (without initial connection time)
> tps = 2583.017011 (without initial connection time)
>
> patched
>
> $ pgbench -n -f select.sql -c 1 -j 1 -T 60 postgres
> tps = 2715.993495 (without initial connection time)
> tps = 2701.527640 (without initial connection time)
> tps = 2707.343009 (without initial connection time)
>
> Does anyone have any thoughts about if this is worth a new field in RelOptInfo?

+1 from me.

I had proposed adding a live_parts bitmapset back in [1] (v12 work on
speeding up planning with partition) to address the
apply_scanjoin_target_to_paths() inefficiency among other things,
though Tom seemed to think that it wouldn't be worthwhile if only for
that purpose.   He'd suggested that we fix things elsewhere such that
that function is not needed in the first place [2], something I keep
thinking about in between doing other things, but never sit down to
actually write a patch.

Given that you're proposing more uses for live_parts, maybe he'd be
open to the idea.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/message-id/3f280722-46f2-c2a4-4c19-2cfa28c6c1cd%40lab.ntt.co.jp
[2] https://www.postgresql.org/message-id/3529.1554051867%40sss.pgh.pa.us



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Numeric multiplication overflow errors
Next
From: Thomas Munro
Date:
Subject: Re: pgbench logging broken by time logic changes