Home > mailing lists

Re: [HACKERS] path toward faster partition pruning - Mailing list pgsql-hackers

From	Jesper Pedersen
Subject	Re: [HACKERS] path toward faster partition pruning
Date	September 26, 2017 17:57:15
Msg-id	08062340-3e9a-302a-8b63-c4c27aa8196b@redhat.com Whole thread Raw
In response to	Re: [HACKERS] path toward faster partition pruning (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: [HACKERS] path toward faster partition pruning
List	pgsql-hackers

Tree view

On 09/26/2017 10:33 AM, Robert Haas wrote:
> On Tue, Sep 26, 2017 at 9:00 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> Could you share your thoughts on the usage of PartitionAppendInfo's
>> min_datum_idx / max_datum_idx ? Especially in relation to hash partitions.
> 
> This brings up something that I've kind of been thinking about.  There
> are sort of four cases when it comes to partition pruning:
> 
> 1. There is exactly one matching partition.  For example, this happens
> when there is an equality constraint on every partition column.
> 
> 2. There are multiple matching partitions which are consecutive.  For
> example, there is a single level of range partitioning with no default
> partition and the single partitioning column is constrained by < > <=
> or >=.
> 
> 3. There are multiple matching partitions which are not consecutive.
> This case is probably rare, but it can happen if there is a default
> partition, if there are list partitions with multiple bounds that are
> interleaved (e.g. p1 allows (1, 4), p2 allows (2), p3 allows (3, 5),
> and the query allows values >= 4 and <= 5), if the query involves OR
> conditions, or if there are multiple levels of partitioning (e.g.
> partition by a, subpartition by b, put a range constraint on a and an
> equality constraint on b).
> 
> 4. There are no matching partitions.
> 
> One of the goals of this algorithm is to be fast.  The obvious way to
> cater to case (3) is to iterate through all partitions and test
> whether each one works, returning a Bitmapset, but that is O(n).
> Admittedly, it might be O(n) with a pretty small constant factor, but
> it still seems like exactly the sort of thing that we want to avoid
> given the desire to scale to higher partition counts.
> 
> I propose that we create a structure that looks like this:
> 
> struct foo {
>     int min_partition;
>     int max_partition;
>     Bitmapset *extra_partitions;
> };
> 
> This indicates that all partitions from min_partition to max_partition
> need to be scanned, and in addition any partitions in extra_partitions
> need to be scanned.  Assuming that we only consider cases where all
> partition keys or a leading subset of the partition keys are
> constrained, we'll generally be able to get by with just setting
> min_partition and max_partition, but extra_partitions can be used to
> handle default partitions and interleaved list bounds.  For equality
> on all partitioning columns, we can do a single bsearch of the bounds
> to identify the target partition at a given partitioning level, and
> the same thing works for a single range-bound.  If there are two
> range-bounds (< and > or <= and >= or whatever) we need to bsearch
> twice.  The default partition, if any and if matched, must also be
> included.  When there are multiple levels of partitioning things get a
> bit more complex -- if someone wants to knock out a partition that
> breaks up the range, we might need to shrink the main range to cover
> part of it and kick the other indexes out to extra_partitions.
> 
> But the good thing is that in common cases with only O(lg n) effort we
> can return O(1) data that describes what will be scanned.  In cases
> where that's not practical we expend more effort but still prune with
> maximal effectiveness.
> 

For OLTP style applications 1) would be the common case, and with hash 
partitions it would be one equality constraint.

So, changing the method signature to use a data type as you described 
above instead of the explicit min_datum_idx / max_datum_idx output 
parameters would be more clear.

One could advocate (*cough*) that the hash partition patch [1] should be 
merged first in order to find other instances of where other CommitFest 
entries doesn't account for hash partitions at the moment in their 
method signatures; Beena noted something similar in [2]. I know that you 
said otherwise [3], but this is CommitFest 1, so there is time for a 
revert later, and hash partitions are already useful in internal testing.

[1] https://commitfest.postgresql.org/14/1059/
[2] 
https://www.postgresql.org/message-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A%40mail.gmail.com
[3] http://rhaas.blogspot.com/2017/08/plans-for-partitioning-in-v11.html

Best regards, Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

From: Craig Ringer
Date: 26 September 2017, 17:50:44
Subject: Re: [HACKERS] Built-in plugin for logical decoding output

From: Martín Marqués
Date: 26 September 2017, 18:13:25
Subject: Re: [HACKERS] Bug with pg_basebackup and 'shared' tablespace

Re: [HACKERS] path toward faster partition pruning - Mailing list pgsql-hackers

Previous

Next