Record a Bitmapset of non-pruned partitions - Mailing list pgsql-hackers

From David Rowley
Subject Record a Bitmapset of non-pruned partitions
Date
Msg-id CAApHDvqnPx6JnUuPwaf5ao38zczrAb9mxt9gj4U1EKFfd4AqLA@mail.gmail.com
Whole thread Raw
Responses Re: Record a Bitmapset of non-pruned partitions
List pgsql-hackers
Back when working on 959d00e9d, to allow ordered partition scans for
LIST and RANGE partitioned tables, I mentioned [1] that if we had a
field that recorded a Bitmapset of the non-pruned partitions, we could
use that to do a more thorough check to see if ordered scans are
possible.

At the moment these ordered scans are possible if the order by matches
the partition key and, for LIST partitioned tables we must also ensure
that we don't have something like partition1 FOR VALUES IN(1,3) and
partition2 FOR VALUES(2,4). Here we can't scan partition1 first before
we'd get 3s before the 2s that are in partition2. Since we don't
record the list of partitions that survived pruning, I just made that
check to ensure all partitions only allow a single value to be stored.
That means the optimisation is not done in cases where it's possible
to do it.

To make this work, as I mentioned in [1], we really need a live_parts
field to track which partitions survived pruning.  If we have that
then we can ensure that just those partitions have no instances where
a lower value could appear in a later partition.

In the attached patch, I've only added the live_parts field and made
use of it in a few locations where the knowledge is useful as an
optimisation.  Right now apply_scanjoin_target_to_paths() appears in
profiles when planning a query to a partitioned table with many
partitions and just 1 or a few survive pruning. The reason for this is
simple; looping over a large almost entirely empty array is just slow
and using the Bitmapset as a sort of index into the interesting
elements of that array speeds it up, quite a bit.

Since we've already put quite a bit of effort into making that fast,
then I think it might be worth adding this field even if it was just
for that purpose.

With 10k empty hash partitions and 1 random one surviving partition
pruning, I see:

Master:
 18.13%  postgres  postgres            [.] apply_scanjoin_target_to_paths
   3.66%  postgres  postgres            [.] AllocSetAlloc
   2.05%  postgres  postgres            [.] hash_search_with_hash_value
   1.95%  postgres  libc-2.33.so        [.] __memset_avx2_unaligned_erms
   1.88%  postgres  postgres            [.] SearchCatCacheInternal
   1.55%  postgres  postgres            [.] base_yyparse
   0.74%  postgres  postgres            [.] get_relation_info
   0.68%  postgres  [kernel.kallsyms]   [k] __d_lookup_rcu
   0.61%  postgres  postgres            [.] palloc

Patched:
   3.72%  postgres  postgres            [.] AllocSetAlloc
   2.30%  postgres  postgres            [.] hash_search_with_hash_value
   2.22%  postgres  postgres            [.] SearchCatCacheInternal
   2.02%  postgres  libc-2.33.so        [.] __memset_avx2_unaligned_erms
   1.88%  postgres  postgres            [.] base_yyparse
   1.08%  postgres  postgres            [.] palloc
   0.92%  postgres  [kernel.kallsyms]   [k] __d_lookup_rcu

$ cat setup.sql
create table hp (a int primary key, b int not null) partition by hash(a);
select 'create table hp'||x|| ' partition of hp for values with
(modulus 10000, remainder '||x||');' from generate_series(0,9999) x;
\gexec

$ cat select.sql
\set p random(1,2000000)
select * from hp where a = :p

master

$ pgbench -n -f select.sql -c 1 -j 1 -T 60 postgres
tps = 2608.704218 (without initial connection time)
tps = 2607.641247 (without initial connection time)
tps = 2583.017011 (without initial connection time)

patched

$ pgbench -n -f select.sql -c 1 -j 1 -T 60 postgres
tps = 2715.993495 (without initial connection time)
tps = 2701.527640 (without initial connection time)
tps = 2707.343009 (without initial connection time)

Does anyone have any thoughts about if this is worth a new field in RelOptInfo?

David

[1] https://www.postgresql.org/message-id/CAKJS1f9W7sg1sb3SXiTQUovs%3DwDMrHATXv68F5dSbe5fuHH%2BiQ%40mail.gmail.com

Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Preventing abort() and exit() calls in libpq
Next
From: Bharath Rupireddy
Date:
Subject: Re: Refactor "mutually exclusive options" error reporting code in parse_subscription_options