Thread: [HACKERS] path toward faster partition pruning

[HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
I've been working on implementing a way to perform plan-time
partition-pruning that is hopefully faster than the current method of
using constraint exclusion to prune each of the potentially many
partitions one-by-one.  It's not fully cooked yet though.

Meanwhile, I thought I'd share a couple of patches that implement some
restructuring of the planner code related to partitioned table inheritance
planning that I think would be helpful.  They are to be applied on top of
the patches being discussed at [1].  Note that these patches themselves
don't implement the actual code that replaces constraint exclusion as a
method of performing partition pruning.  I will share that patch after
debugging it some more.

The main design goal of the patches I'm sharing here now is to defer the
locking and  opening of leaf partitions in a given partition tree to a
point after set_append_rel_size() is called on the root partitioned table.
 Currently, AFAICS, we need to lock and open the child tables in
expand_inherited_rtentry() only to set the translated_vars field in
AppendRelInfo that we create for the child.  ISTM, we can defer the
creation of a child AppendRelInfo to a point when it (its field
translated_vars in particular) will actually be used and so lock and open
the child tables only at such a time.  Although we don't lock and open the
partition child tables in expand_inherited_rtentry(), their RT entries are
still created and added to root->parse->rtable, so that
setup_simple_rel_arrays() knows the maximum number of entries
root->simple_rel_array will need to hold and allocate the memory for that
array accordingly.   Slots in simple_rel_array[] corresponding to
partition child tables will be empty until they are created when
set_append_rel_size() is called on the root parent table and it determines
the partitions that will be scanned after all.

Patch augments the existing PartitionedChildRelInfo node, which currently
holds only the partitioned child rel RT indexes, to carry some more
information about the partition tree, which includes the information
returned by RelationGetPartitionDispatchInfo() when it is called from
expand_inherited_rtentry() (per the proposed patch in [1], we call it to
be able to add partitions to the query tree in the bound order).
Actually, since PartitionedChildRelInfo now contains more information
about the partition tree than it used to before, I thought the struct's
name is no longer relevant, so renamed it to PartitionRootInfo and renamed
root->pcinfo_list accordingly to prinfo_list.  That seems okay because we
only use that node internally.

Then during the add_base_rels_to_query() step, when build_simple_rel()
builds a RelOptInfo for the root partitioned table, it also initializes
some newly introduced fields in RelOptInfo from the information contained
in PartitionRootInfo of the table.  The aforementioned fields are only
initialized in RelOptInfos of root partitioned tables.  Note that the
add_base_rels_to_query() step won't add the partition "otherrel"
RelOptInfos yet (unlike the regular inheritance case, where they are,
after looking them up in root->append_rel_list).

When set_append_rel_size() is called on the root partitioned table, it
will call a find_partitions_for_query(), which using the partition tree
information, determines the partitions that will need to be scanned for
the query.  This processing happens recursively, that is, we first
determine the root-parent's partitions and then for each partition that's
partitioned, we will determine its partitions and so on.  As we determine
partitions in this per-partitioned-table manner, we maintain a pair
(parent_relid, list-of-partition-relids-to-scan) for each partitioned
table and also a single list of all leaf partitions determined so far.
Once all partitions have been determined, we turn to locking the leaf
partitions.  The locking happens in the order of OIDs as
find_all_inheritors would have returned in expand_inherited_rtentry(); the
list of OIDs in that original order is also stored in the table's
PartitionRootInfo node.  For each OID in that list, check if that OID is
in the set of leaf partition OIDs that was just computed, and if so, lock
it.  For all chosen partitions that are partitioned tables (including the
root), we create a PartitionAppendInfo node which stores the
aforementioned pair (parent_relid, list-of-partitions-relids-to-scan), and
append it to a list in the root table's RelOptInfo, with the root table's
PartitionAppendInfo at the head of the list.  Note that the list of
partitions in this pair contains only the immediate partitions, so that
the original parent-child relationship is reflected in the list of
PartitionAppendInfos thus collected.  The next patch that will implement
actual partition-pruning will add some more code that will run under
find_partitions_for_query().

set_append_rel_size() processing then continues for the root partitioned
table.  It is at this point that we will create the RelOptInfos and
AppendRelInfos for partitions.  First for those of the root partitioned
table and then for those of each partitioned table when
set_append_rel_size() will be recursively called for the latter.


Note that this is still largely a WIP patch and the implementation details
might change per both the feedback here and the discussion at [1].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/befd7ec9-8f4c-6928-d330-ab05dbf860bf%40lab.ntt.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Mon, Aug 21, 2017 at 12:07 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I've been working on implementing a way to perform plan-time
> partition-pruning that is hopefully faster than the current method of
> using constraint exclusion to prune each of the potentially many
> partitions one-by-one.  It's not fully cooked yet though.
>
> Meanwhile, I thought I'd share a couple of patches that implement some
> restructuring of the planner code related to partitioned table inheritance
> planning that I think would be helpful.  They are to be applied on top of
> the patches being discussed at [1].  Note that these patches themselves
> don't implement the actual code that replaces constraint exclusion as a
> method of performing partition pruning.  I will share that patch after
> debugging it some more.
>
> The main design goal of the patches I'm sharing here now is to defer the
> locking and  opening of leaf partitions in a given partition tree to a
> point after set_append_rel_size() is called on the root partitioned table.
>  Currently, AFAICS, we need to lock and open the child tables in
> expand_inherited_rtentry() only to set the translated_vars field in
> AppendRelInfo that we create for the child.  ISTM, we can defer the
> creation of a child AppendRelInfo to a point when it (its field
> translated_vars in particular) will actually be used and so lock and open
> the child tables only at such a time.  Although we don't lock and open the
> partition child tables in expand_inherited_rtentry(), their RT entries are
> still created and added to root->parse->rtable, so that
> setup_simple_rel_arrays() knows the maximum number of entries
> root->simple_rel_array will need to hold and allocate the memory for that
> array accordingly.   Slots in simple_rel_array[] corresponding to
> partition child tables will be empty until they are created when
> set_append_rel_size() is called on the root parent table and it determines
> the partitions that will be scanned after all.

The partition pruning can happen only after the quals have been
distributed to Rels i.e. after deconstruct_jointree(),
reconsider_outer_join_clauses() and generate_base_implied_equalities()
have been called. If the goal is to not heap_open() the partitions
which are pruned, we can't do that in expand_inherited_rtentry(). One
reason why I think we don't want to heap_open() partition relations is
to avoid relcache bloat because of opened partition relations, which
are ultimately pruned. But please note that according to your patches,
we still need to populate catalog caches to get relkind and reltype
etc.

There are many functions that traverse simple_rel_array[] after it's
created. Most of them assume that the empty entries in that array
correspond to non-simple range entries like join RTEs. But now we are
breaking that assumption. Most of these functions also skip "other"
relations, so that may be OK now, but I am not sure if it's really
going to be fine if we keep empty slots in place of partition
relations. There may be three options here 1. add placeholder
RelOptInfos for partition relations (may be mark those specially) and
mark the ones which get pruned as dummy later. 2. Prune the partitions
before any functions scans simple_rel_array[] or postpone creating
simple_rel_array till pruning. 3. Examine all the current scanners
esp. the ones which will be called before pruning to make sure that
skipping "other" relations is going to be kosher.

>
> Patch augments the existing PartitionedChildRelInfo node, which currently
> holds only the partitioned child rel RT indexes, to carry some more
> information about the partition tree, which includes the information
> returned by RelationGetPartitionDispatchInfo() when it is called from
> expand_inherited_rtentry() (per the proposed patch in [1], we call it to
> be able to add partitions to the query tree in the bound order).
> Actually, since PartitionedChildRelInfo now contains more information
> about the partition tree than it used to before, I thought the struct's
> name is no longer relevant, so renamed it to PartitionRootInfo and renamed
> root->pcinfo_list accordingly to prinfo_list.  That seems okay because we
> only use that node internally.
>
> Then during the add_base_rels_to_query() step, when build_simple_rel()
> builds a RelOptInfo for the root partitioned table, it also initializes
> some newly introduced fields in RelOptInfo from the information contained
> in PartitionRootInfo of the table.  The aforementioned fields are only
> initialized in RelOptInfos of root partitioned tables.  Note that the
> add_base_rels_to_query() step won't add the partition "otherrel"
> RelOptInfos yet (unlike the regular inheritance case, where they are,
> after looking them up in root->append_rel_list).

Partition-wise join requires the partition hierarchy to be expanded
level-by-level keeping in-tact the parent-child relationship between
partitioned table and its partitions. Your patch doesn't do that and
adds all the partitioning information in the root partitioned table's
RelOptInfo. OTOH, partition-wise join patch adds partition bounds, key
expressions, OID and RelOptInfos of the immediate partitions
(including partitioned partitions) to RelOptInfo of a partitioned
table (see patch 0002 in the latest set of patches at [1]). I don't
see much point in having conflicting changes in both of our patches.
May be you should review that patch from my set  and we can find a set
of members which help both partition pruning and partition-wise join.

>
> When set_append_rel_size() is called on the root partitioned table, it
> will call a find_partitions_for_query(), which using the partition tree
> information, determines the partitions that will need to be scanned for
> the query.  This processing happens recursively, that is, we first
> determine the root-parent's partitions and then for each partition that's
> partitioned, we will determine its partitions and so on.  As we determine
> partitions in this per-partitioned-table manner, we maintain a pair
> (parent_relid, list-of-partition-relids-to-scan) for each partitioned
> table and also a single list of all leaf partitions determined so far.
> Once all partitions have been determined, we turn to locking the leaf
> partitions.  The locking happens in the order of OIDs as
> find_all_inheritors would have returned in expand_inherited_rtentry(); the
> list of OIDs in that original order is also stored in the table's
> PartitionRootInfo node.  For each OID in that list, check if that OID is
> in the set of leaf partition OIDs that was just computed, and if so, lock
> it.  For all chosen partitions that are partitioned tables (including the
> root), we create a PartitionAppendInfo node which stores the
> aforementioned pair (parent_relid, list-of-partitions-relids-to-scan), and
> append it to a list in the root table's RelOptInfo, with the root table's
> PartitionAppendInfo at the head of the list.  Note that the list of
> partitions in this pair contains only the immediate partitions, so that
> the original parent-child relationship is reflected in the list of
> PartitionAppendInfos thus collected.  The next patch that will implement
> actual partition-pruning will add some more code that will run under
> find_partitions_for_query().
>
> set_append_rel_size() processing then continues for the root partitioned
> table.  It is at this point that we will create the RelOptInfos and
> AppendRelInfos for partitions.  First for those of the root partitioned
> table and then for those of each partitioned table when
> set_append_rel_size() will be recursively called for the latter.

set_append_rel_size(), set_append_rel_pathlist() are already
recursive, so if we process expansion and pruning for one level in
those functions, the recursion will automatically take care of doing
so for every level.

>
>
> Note that this is still largely a WIP patch and the implementation details
> might change per both the feedback here and the discussion at [1].

The changes to code which handle expansion in this patch set should
really be part of expansion in bound order thread so that it's easy to
review all changes together. And this thread can then only concentrate
on partition pruning.

[1] http://postgr.es/m/CAFjFpRd9Vqh_=-Ldv-XqWY006d07TJ+VXuhXCbdj=P1jukYBrw@mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi Ashutosh,

Thanks for the comments and sorry that it took me a while to reply here.

On 2017/08/23 20:16, Ashutosh Bapat wrote:
> On Mon, Aug 21, 2017 at 12:07 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I've been working on implementing a way to perform plan-time
>> partition-pruning that is hopefully faster than the current method of
>> using constraint exclusion to prune each of the potentially many
>> partitions one-by-one.  It's not fully cooked yet though.
>>
>> Meanwhile, I thought I'd share a couple of patches that implement some
>> restructuring of the planner code related to partitioned table inheritance
>> planning that I think would be helpful.  They are to be applied on top of
>> the patches being discussed at [1].  Note that these patches themselves
>> don't implement the actual code that replaces constraint exclusion as a
>> method of performing partition pruning.  I will share that patch after
>> debugging it some more.
>>
>> The main design goal of the patches I'm sharing here now is to defer the
>> locking and  opening of leaf partitions in a given partition tree to a
>> point after set_append_rel_size() is called on the root partitioned table.
>>  Currently, AFAICS, we need to lock and open the child tables in
>> expand_inherited_rtentry() only to set the translated_vars field in
>> AppendRelInfo that we create for the child.  ISTM, we can defer the
>> creation of a child AppendRelInfo to a point when it (its field
>> translated_vars in particular) will actually be used and so lock and open
>> the child tables only at such a time.  Although we don't lock and open the
>> partition child tables in expand_inherited_rtentry(), their RT entries are
>> still created and added to root->parse->rtable, so that
>> setup_simple_rel_arrays() knows the maximum number of entries
>> root->simple_rel_array will need to hold and allocate the memory for that
>> array accordingly.   Slots in simple_rel_array[] corresponding to
>> partition child tables will be empty until they are created when
>> set_append_rel_size() is called on the root parent table and it determines
>> the partitions that will be scanned after all.
> 
> The partition pruning can happen only after the quals have been
> distributed to Rels i.e. after deconstruct_jointree(),
> reconsider_outer_join_clauses() and generate_base_implied_equalities()
> have been called. If the goal is to not heap_open() the partitions
> which are pruned, we can't do that in expand_inherited_rtentry(). One
> reason why I think we don't want to heap_open() partition relations is
> to avoid relcache bloat because of opened partition relations, which
> are ultimately pruned. But please note that according to your patches,
> we still need to populate catalog caches to get relkind and reltype
> etc.

Yes, we still hit syscache for *all* partitions.  I haven't yet thought
very hard about avoiding that altogether.

> There are many functions that traverse simple_rel_array[] after it's
> created. Most of them assume that the empty entries in that array
> correspond to non-simple range entries like join RTEs. But now we are
> breaking that assumption. Most of these functions also skip "other"
> relations, so that may be OK now, but I am not sure if it's really
> going to be fine if we keep empty slots in place of partition
> relations. There may be three options here 1. add placeholder
> RelOptInfos for partition relations (may be mark those specially) and
> mark the ones which get pruned as dummy later. 2. Prune the partitions
> before any functions scans simple_rel_array[] or postpone creating
> simple_rel_array till pruning. 3. Examine all the current scanners
> esp. the ones which will be called before pruning to make sure that
> skipping "other" relations is going to be kosher.

Between the point when slots in simple_rel_array are allocated
(setup_simple_rel_arrays) and partition RelOptInfos are actually created
after the partition-pruning step has occurred (set_append_rel_size), it
seems that most places that iterate over simple_rel_array know also to
skip slots containing NULL values.  We might need to document that NULL
means partitions in addition to its current meaning - non-baserels.

>> Patch augments the existing PartitionedChildRelInfo node, which currently
>> holds only the partitioned child rel RT indexes, to carry some more
>> information about the partition tree, which includes the information
>> returned by RelationGetPartitionDispatchInfo() when it is called from
>> expand_inherited_rtentry() (per the proposed patch in [1], we call it to
>> be able to add partitions to the query tree in the bound order).
>> Actually, since PartitionedChildRelInfo now contains more information
>> about the partition tree than it used to before, I thought the struct's
>> name is no longer relevant, so renamed it to PartitionRootInfo and renamed
>> root->pcinfo_list accordingly to prinfo_list.  That seems okay because we
>> only use that node internally.
>>
>> Then during the add_base_rels_to_query() step, when build_simple_rel()
>> builds a RelOptInfo for the root partitioned table, it also initializes
>> some newly introduced fields in RelOptInfo from the information contained
>> in PartitionRootInfo of the table.  The aforementioned fields are only
>> initialized in RelOptInfos of root partitioned tables.  Note that the
>> add_base_rels_to_query() step won't add the partition "otherrel"
>> RelOptInfos yet (unlike the regular inheritance case, where they are,
>> after looking them up in root->append_rel_list).
> 
> Partition-wise join requires the partition hierarchy to be expanded
> level-by-level keeping in-tact the parent-child relationship between
> partitioned table and its partitions. Your patch doesn't do that and
> adds all the partitioning information in the root partitioned table's
> RelOptInfo. OTOH, partition-wise join patch adds partition bounds, key
> expressions, OID and RelOptInfos of the immediate partitions
> (including partitioned partitions) to RelOptInfo of a partitioned
> table (see patch 0002 in the latest set of patches at [1]). I don't
> see much point in having conflicting changes in both of our patches.
> May be you should review that patch from my set  and we can find a set
> of members which help both partition pruning and partition-wise join.

Yes, I think it would be a good idea for the partition-pruning patch to
initialize those fields in the individual parents' RelOptInfos.  I will
review relevant patches in the partitionwise-join thread to see what can
be incorporated here.

>> When set_append_rel_size() is called on the root partitioned table, it
>> will call a find_partitions_for_query(), which using the partition tree
>> information, determines the partitions that will need to be scanned for
>> the query.  This processing happens recursively, that is, we first
>> determine the root-parent's partitions and then for each partition that's
>> partitioned, we will determine its partitions and so on.  As we determine
>> partitions in this per-partitioned-table manner, we maintain a pair
>> (parent_relid, list-of-partition-relids-to-scan) for each partitioned
>> table and also a single list of all leaf partitions determined so far.
>> Once all partitions have been determined, we turn to locking the leaf
>> partitions.  The locking happens in the order of OIDs as
>> find_all_inheritors would have returned in expand_inherited_rtentry(); the
>> list of OIDs in that original order is also stored in the table's
>> PartitionRootInfo node.  For each OID in that list, check if that OID is
>> in the set of leaf partition OIDs that was just computed, and if so, lock
>> it.  For all chosen partitions that are partitioned tables (including the
>> root), we create a PartitionAppendInfo node which stores the
>> aforementioned pair (parent_relid, list-of-partitions-relids-to-scan), and
>> append it to a list in the root table's RelOptInfo, with the root table's
>> PartitionAppendInfo at the head of the list.  Note that the list of
>> partitions in this pair contains only the immediate partitions, so that
>> the original parent-child relationship is reflected in the list of
>> PartitionAppendInfos thus collected.  The next patch that will implement
>> actual partition-pruning will add some more code that will run under
>> find_partitions_for_query().
>>
>> set_append_rel_size() processing then continues for the root partitioned
>> table.  It is at this point that we will create the RelOptInfos and
>> AppendRelInfos for partitions.  First for those of the root partitioned
>> table and then for those of each partitioned table when
>> set_append_rel_size() will be recursively called for the latter.
> 
> set_append_rel_size(), set_append_rel_pathlist() are already
> recursive, so if we process expansion and pruning for one level in
> those functions, the recursion will automatically take care of doing
> so for every level.

My only worry about that is locking order of leaf partitions will be
different from concurrent backends if we lock them in an order dictated by
traversing the partition tree level at a time.  Because such traversal
will presumably proceed in the partition bound order.

The patch computes *all* leaf partitions that will need to be scanned by
the query (after pruning needless ones) and lock the chosen partitions in
the original order (it keeps the original order OID list generated by
find_all_inheritors around for this purpose).  While computing the leaf
partitions, it remembers immediate parent-child relationships in the
process.  The way it does it is by processing the partition tree in a
recursive depth-first manner, and in each recursive step, creating a
PartitionAppendInfo that maps a parent table to its immediate children
(only those that will satisfy the query).  Once the PartitionAppendInfo's
for all the parents in the partition tree have been computed, we resume
the set_append_rel_size() processing, which takes the root parent's
PartitionAppendInfo and builds RelOptInfos and AppendRelInfos for its
immediate children.  Its children that are partitioned tables themselves
will recursively call set_append_rel_size() and look up its own
PartitionAppendInfo and create RelOptInfos and AppendRelInfos for its
children and so on.

>> Note that this is still largely a WIP patch and the implementation details
>> might change per both the feedback here and the discussion at [1].
> 
> The changes to code which handle expansion in this patch set should
> really be part of expansion in bound order thread so that it's easy to
> review all changes together. And this thread can then only concentrate
> on partition pruning.

I think I agree.  I'm posting today the patches that actually implement
partition-pruning.  The previous patches do seem to belong on the EIBO
thread, but will post them together here today for convenience of being
able to apply them to HEAD and try out.  Now that Robert has posted a
patch to implement depth-first EIBO, I will have to find a way to rebase
the actual partition-pruning patches on this thread so that its core logic
works at all by finding the information it needs.

Thanks,
Amit




Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/08/21 15:37, Amit Langote wrote:
> Meanwhile, I thought I'd share a couple of patches that implement some
> restructuring of the planner code related to partitioned table inheritance
> planning that I think would be helpful.  They are to be applied on top of
> the patches being discussed at [1].  Note that these patches themselves
> don't implement the actual code that replaces constraint exclusion as a
> method of performing partition pruning.  I will share that patch after
> debugging it some more.
>
> The next patch that will implement
> actual partition-pruning will add some more code that will run under
> find_partitions_for_query().

Attached is now also the set of patches that implement the actual
partition-pruning logic, viz. the last 3 patches (0004, 0005, and 0006) of
the attached.

Because the patch helps avoid performing constraint exclusion on *all*
partitions for a given query, one might expect this to improve performance
for queries on partitioned tables and scale to a fairly large number of
partitions.  Here are some numbers for the partitioned table and the query
shown below:

\d+ ptab
Columns: (a date, b int, c text)
Partition key: RANGE (a, b)
Partitions:
ptab_00001 FOR VALUES FROM ('2017-08-31', 1) TO ('2017-08-31', 1000),
ptab_00002 FOR VALUES FROM ('2017-08-31', 1000) TO ('2017-08-31', 2000),
ptab_00003 FOR VALUES FROM ('2017-08-31', 2000) TO ('2017-08-31', 3000),
ptab_00004 FOR VALUES FROM ('2017-08-31', 3000) TO ('2017-08-31', 4000),
ptab_00005 FOR VALUES FROM ('2017-08-31', 4000) TO ('2017-08-31', 5000),

ptab_00006 FOR VALUES FROM ('2017-09-01', 1) TO ('2017-09-01', 1000),
ptab_00007 FOR VALUES FROM ('2017-09-01', 1000) TO ('2017-09-01', 2000),

...
ptab_NNNNN FOR VALUES FROM (..., 4000) TO (..., 5000),

A query that prunes all partitions (empty result!):

explain select * from ptab where a < '2017-08-31';

Comparison of the average response times (in milliseconds) after running
the same query 100 times using pgbench against the database:

  #: Number of partitions of ptab
c_e: Constraint exclusion
f_p: Fast pruning


    #      c_e      f_p
=====    =====     ====
   10      0.7      0.4
   50      1.8      0.6
  100      3.2      0.8
  500     16.8      2.7
 1000     36.2      5.0
 2000     79.7     10.2
 5000    214.7     27.0
10000    443.6     64.8

For each partitioned table in a given partition tree (provided it is not
pruned to begin with), the query's clauses are matched to its partition
key and from the matched clauses, a pair of bounding keys (Datum tuples
with <= key->partnatts values for possibly a prefix of a multi-column key)
is generated.  They are passed to partition.c: get_partitions_for_keys()
as Datum *minkeys and Datum *maxkeys.  A list of partitions covering that
key range is returned.  When generating that list, whether a particular
scan key is inclusive or not is considered along with the partitioning
strategy.  It should be possible to support hash partitioning with
(hopefully) minimal changes to get_partitions_for_keys().

There are still certain limitations on the planner side of things:

1. OR clauses are not counted as contributing toward bounding scan keys;
   currently only OpExprs and NullTests are recognized, so an OR clause
   would get skipped from consideration when generating the bounding keys
   to pass to partition.c

2. Redundant clauses are not appropriately pre-processed; so if a query
   contains a = 10 and a > 1, the latter clause will be matched and
   partitions holding values with a > 1 and a < 10 will not be pruned,
   even if none of their rows will pass the query's condition

Fixing these limitations, adding more regression tests and implementing
some of the things suggested by Ashutosh Bapat [1] to prevent conflicting
changes with some preparatory patches in the partitionwise-join patch
series [2] are TODOs.

Adding this to CF-201709 as "faster partition pruning in planner".

To try out the attached patches: apply the patches posted at [3] on HEAD
and then apply these

Thanks,
Amit

[1]
https://postgr.es/m/CAFjFpRdb_fkmJHFjvAbB%2BLn0t45fWjekLd5pY%3Dsv%2BeAhBAKXPQ%40mail.gmail.com

[2]
https://postgr.es/m/CAFjFpRd9Vqh_=-Ldv-XqWY006d07TJ+VXuhXCbdj=P1jukYBrw@mail.gmail.com

[3]
https://www.postgresql.org/message-id/2124e99f-9528-0f71-4e10-ac7974dd7077%40lab.ntt.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Thu, Aug 31, 2017 at 2:02 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached is now also the set of patches that implement the actual
> partition-pruning logic, viz. the last 3 patches (0004, 0005, and 0006) of
> the attached.

It strikes me that this patch set is doing two things but maybe in the
opposite order that I would have chosen to attack them.  First,
there's getting partition pruning to use something other than
constraint exclusion.  Second, there's deferring work that is
currently done at an early stage of the process until later, so that
we waste less effort on partitions that are ultimately going to be
pruned.

The second one is certainly a worthwhile goal, but there are fairly
firm interdependencies between the first one and some other things
that are in progress.  For example, the first one probably ought to be
done before hash partitioning gets committed, because
constraint-exclusion based partitioning pruning won't work with
partitioning pruning, but some mechanism based on asking the
partitioning code which partitions might match will.  Such a mechanism
is more efficient for list and range partitions, but it's the only
thing that will work for hash partitions.  Also, Beena Emerson is
working on run-time partition pruning, and the more I think about it,
the more I think that overlaps with this first part.  Both patches
need a mechanism to identify, given a btree-indexable comparison
operator (< > <= >= =) and a set of values, which partitions might
contain matching values.  Run-time partition pruning will call that at
execution time, and this patch will call it at plan time, but it's the
same logic; it's just a question of the point at which the values are
known.  And of course we don't want to end up with two copies of the
logic.

Therefore, IMHO, it would be best to focus first on how we're going to
identify the partitions that survive pruning, and then afterwards work
on transposing that logic to happen before partitions are opened and
locked.  That way, we get some incremental benefit sooner, and also
unblock some other development work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks for the comments.

On 2017/09/02 2:52, Robert Haas wrote:
> On Thu, Aug 31, 2017 at 2:02 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached is now also the set of patches that implement the actual
>> partition-pruning logic, viz. the last 3 patches (0004, 0005, and 0006) of
>> the attached.
> 
> It strikes me that this patch set is doing two things but maybe in the
> opposite order that I would have chosen to attack them.  First,
> there's getting partition pruning to use something other than
> constraint exclusion.  Second, there's deferring work that is
> currently done at an early stage of the process until later, so that
> we waste less effort on partitions that are ultimately going to be
> pruned.

OK.

> 
> The second one is certainly a worthwhile goal, but there are fairly
> firm interdependencies between the first one and some other things
> that are in progress.  For example, the first one probably ought to be
> done before hash partitioning gets committed, because
> constraint-exclusion based partitioning pruning won't work with
> partitioning pruning, but some mechanism based on asking the
> partitioning code which partitions might match will.

Yeah.

> Such a mechanism
> is more efficient for list and range partitions, but it's the only
> thing that will work for hash partitions.  Also, Beena Emerson is
> working on run-time partition pruning, and the more I think about it,
> the more I think that overlaps with this first part.  Both patches
> need a mechanism to identify, given a btree-indexable comparison
> operator (< > <= >= =) and a set of values, which partitions might
> contain matching values.  Run-time partition pruning will call that at
> execution time, and this patch will call it at plan time, but it's the
> same logic; it's just a question of the point at which the values are
> known.  And of course we don't want to end up with two copies of the
> logic.

Agreed here too.

I agree that spending effort on the first part (deferment of locking, etc.
within the planner) does not benefit either the hash partitioning and
run-time pruning patches much.

> Therefore, IMHO, it would be best to focus first on how we're going to
> identify the partitions that survive pruning, and then afterwards work
> on transposing that logic to happen before partitions are opened and
> locked.  That way, we get some incremental benefit sooner, and also
> unblock some other development work.

Alright, I will try to do it that way.

Thanks,
Amit




Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/09/04 10:10, Amit Langote wrote:
> On 2017/09/02 2:52, Robert Haas wrote:
>> It strikes me that this patch set is doing two things but maybe in the
>> opposite order that I would have chosen to attack them.  First,
>> there's getting partition pruning to use something other than
>> constraint exclusion.  Second, there's deferring work that is
>> currently done at an early stage of the process until later, so that
>> we waste less effort on partitions that are ultimately going to be
>> pruned.
> 
> OK.
> 
>> Therefore, IMHO, it would be best to focus first on how we're going to
>> identify the partitions that survive pruning, and then afterwards work
>> on transposing that logic to happen before partitions are opened and
>> locked.  That way, we get some incremental benefit sooner, and also
>> unblock some other development work.
> 
> Alright, I will try to do it that way.

Attached set of patches that does things that way.  Individual patches
described below:

[PATCH 1/5] Expand partitioned inheritance in a non-flattened manner

This will allow us to perform scan and join planning in a per partition
sub-tree manner, with each sub-tree's root getting its own RelOptInfo.
Previously, only the root of the whole partition tree would get a
RelOptInfo, along with the leaf partitions, with each leaf partition's
AppendRelInfo pointing to the former as its parent.

This is essential, because we will be doing partition-pruning for every
partitioned table in the tree by matching query's scan keys with its
partition key.  We won't be able to do that if the intermediate
partitioned tables didn't have a RelOptInfo.

[PATCH 2/5] WIP: planner-side changes for partition-pruning

This patch adds a stub get_partitions_for_keys in partition.c with a
suitable interface for the caller to pass bounding keys extracted from the
query and other related information.

Importantly, it implements the logic within the planner to match query's
scan keys to a parent table's partition key and form the bounding keys
that will be passed to partition.c to compute the list of partitions that
satisfy those bounds.

Also, it adds certain fields to RelOptInfos of the partitioned tables that
reflect its partitioning properties.

[PATCH 3/5] WIP: Interface changes for partition_bound_{cmp/bsearch}

This guy modifies the partition bound comparison function so that the
caller can pass incomplete partition key tuple that is potentially a
prefix of a multi-column partition key.  Currently, the input tuple must
contain all of key->partnatts values, but that may not be true for
queries, which may not have restrictions on all the partition key columns.

[PATCH 4/5] WIP: Implement get_partitions_for_keys()

This one fills the get_partitions_for_keys stub with the actual logic that
searches the partdesc->boundinfo for partition bounds that match the
bounding keys specified by the caller.

[PATCH 5/5] Add more tests for the new partitioning-related planning

More tests.


Some TODO items still remain to be done:

* Process OR clauses to use for partition-pruning
* Process redundant clauses (such as a = 10 and a > 1) more smartly
* Other tricks that are missing
* Fix bugs
* More tests

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Forgot to mention a couple of important points about the relation of some
of the patches here to the patches and discussion at the
partitionwise-join thread [1].

On 2017/09/06 19:38, Amit Langote wrote:
> [PATCH 1/5] Expand partitioned inheritance in a non-flattened manner
> 
> This will allow us to perform scan and join planning in a per partition
> sub-tree manner, with each sub-tree's root getting its own RelOptInfo.
> Previously, only the root of the whole partition tree would get a
> RelOptInfo, along with the leaf partitions, with each leaf partition's
> AppendRelInfo pointing to the former as its parent.
> 
> This is essential, because we will be doing partition-pruning for every
> partitioned table in the tree by matching query's scan keys with its
> partition key.  We won't be able to do that if the intermediate
> partitioned tables didn't have a RelOptInfo.

There is a patch in the Ashutosh's posted series of patches, which does
more or less the same thing that this patch does.  He included it in his
series of patches, because, IIUC, the main partitionwise-join planning
logic that one of the later patch implements depends on being able to
consider applying that new planning technique individually for every
partition sub-tree, instead of just at the whole tree root.

One notable difference from his patch is that while his patch will expand
in non-flattened manner even in the case where the parent is the result
relation of a query, my patch doesn't in that case, because the new
partition-pruning technique cannot be applied to inheritance parent that
is a result relation, for example,

update partitioned_table set ...

And AFAICS, partitionwise-join cannot be applied to such a parent either.

Note however that if there are other instances of the same partitioned
table (in the FROM list of an update statement) or other partitioned
tables in the query, they will be expanded in a non-flattened manner,
because they are themselves not the result relations of the query.  So,
the new partition-pruning and (supposedly) partitionwise-join can be
applied for those other partitioned tables.

> [PATCH 2/5] WIP: planner-side changes for partition-pruni[...]
> 
> Also, it adds certain fields to RelOptInfos of the partitioned tables that
> reflect its partitioning properties.

There is something called PartitionScheme, which is a notion one of the
Ashutosh's patches invented that this patch incorporates as one of the new
fields in RelOptInfo that I mentioned above (also a list of
PartitionScheme's in the planner-global PlannerInfo).  Although,
PartitionScheme is not significant for the task of partition-pruning
itself, it's still useful.  On Ashutosh's suggestion, I adopted the same
in my patch series, so that the partition-wise join patch doesn't end up
conflicting with the partition-pruning patch while trying to implement the
same and can get straight to the task of implementing partition-wise joins.

The same patch in the partition-wise join patch series that introduces
PartitionScheme, also introduces a field in the RelOptInfo called
partexprs, which records the partition key expressions.  Since,
partition-pruning has use for the same, I incorporated the same here;
also, in a format that Ashutosh's partition-wise patch can use directly,
instead of the latter having to hack it again to make it suitable to store
partition key expressions of joinrels.  Again, that's to minimize
conflicts and let his patch just find the field to use as is, instead of
implementing it first.

Lastly, this patch introduces a PartitionAppendInfo in a partitioned
table's RelOptInfo that stores AppendRelInfos of the partitions (child
relations) that survive partition-pruning, which serves to identify those
partitions' RelOptInfos.  Along with the identities of surviving
partitions, it also stores the partitioning configuration of the
partitioned table after partitions are pruned.  That includes
partdesc->boundinfo (which is simply a pointer into the table's relcache)
and a few other fields that are set by partition-pruning code, such
min_datum_index, max_datum_index, null_partition_chosen, that describe the
result after pruning.  So, for two partitioned tables being joined, if the
boundinfos match per partition_bounds_equal() and these other fields
match, they can be safely partition-wise joined.

[1]
https://www.postgresql.org/message-id/CAFjFpRfRDhWp%3DoguNjyzN%3DNMoOD%2BRCC3wS%2Bb%2BxbGKwKUk0dRKg%40mail.gmail.com




Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Thu, Sep 7, 2017 at 7:16 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> There is a patch in the Ashutosh's posted series of patches, which does
> more or less the same thing that this patch does.  He included it in his
> series of patches, because, IIUC, the main partitionwise-join planning
> logic that one of the later patch implements depends on being able to
> consider applying that new planning technique individually for every
> partition sub-tree, instead of just at the whole tree root.
>
> One notable difference from his patch is that while his patch will expand
> in non-flattened manner even in the case where the parent is the result
> relation of a query, my patch doesn't in that case, because the new
> partition-pruning technique cannot be applied to inheritance parent that
> is a result relation, for example,
>
> update partitioned_table set ...
>
> And AFAICS, partitionwise-join cannot be applied to such a parent either.
>
> Note however that if there are other instances of the same partitioned
> table (in the FROM list of an update statement) or other partitioned
> tables in the query, they will be expanded in a non-flattened manner,
> because they are themselves not the result relations of the query.  So,
> the new partition-pruning and (supposedly) partitionwise-join can be
> applied for those other partitioned tables.

It seems to me that it would be better not to write new patches for
things that already have patches without a really clear explanation
with what's wrong with the already-existing patch; I don't see any
such explanation here.  Instead of writing your own patch for this to
duel with his his, why not review his and help him correct any
deficiencies which you can spot?  Then we have one patch with more
review instead of two patches with less review both of which I have to
read and try to decide which is better.

In this case, I think Ashutosh has the right idea.  I think that
handling the result-relation and non-result-relation differently
creates an unpleasant asymmetry.  With your patch, we have to deal
with three cases: (a) partitioned tables that were expanded
level-by-level because they are not result relations, (b) partitioned
tables that were expanded "flattened" because they are result
relations, and (c) non-partitioned tables that were expanded
"flattened".  With Ashutosh's approach, we only have two cases to
worry about in the future rather than three, and I like that better.

Your patch also appears to change things so that the table actually
referenced in the query ends up with an AppendRelInfo for the parent,
which seems pointless.  And it has no tests.

There are a couple of hunks from your patch that we might want or need
to incorporate into Ashutosh's patch.  The change to
relation_excluded_by_constraints() looks like it might be useful,
although it needs a better comment and some tests.  Also, Ashutosh's
patch has no equivalent of your change to add_paths_to_append_rel().
I'm not clear what the code you've introduced there is supposed to be
doing, and I'm suspicious that it is confusing "partition root" with
"table named in the query", which will often be the same but not
always; the user could have named an intermediate partition.  Can you
expand on what this is doing?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/09/08 4:41, Robert Haas wrote:
> On Thu, Sep 7, 2017 at 7:16 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> There is a patch in the Ashutosh's posted series of patches, which does
>> more or less the same thing that this patch does.  He included it in his
>> series of patches, because, IIUC, the main partitionwise-join planning
>> logic that one of the later patch implements depends on being able to
>> consider applying that new planning technique individually for every
>> partition sub-tree, instead of just at the whole tree root.
>>
>> One notable difference from his patch is that while his patch will expand
>> in non-flattened manner even in the case where the parent is the result
>> relation of a query, my patch doesn't in that case, because the new
>> partition-pruning technique cannot be applied to inheritance parent that
>> is a result relation, for example,
>>
>> update partitioned_table set ...
>>
>> And AFAICS, partitionwise-join cannot be applied to such a parent either.
>>
>> Note however that if there are other instances of the same partitioned
>> table (in the FROM list of an update statement) or other partitioned
>> tables in the query, they will be expanded in a non-flattened manner,
>> because they are themselves not the result relations of the query.  So,
>> the new partition-pruning and (supposedly) partitionwise-join can be
>> applied for those other partitioned tables.
> 
> It seems to me that it would be better not to write new patches for
> things that already have patches without a really clear explanation
> with what's wrong with the already-existing patch; I don't see any
> such explanation here.  Instead of writing your own patch for this to
> duel with his his, why not review his and help him correct any
> deficiencies which you can spot?  Then we have one patch with more
> review instead of two patches with less review both of which I have to
> read and try to decide which is better.

Sorry, I think I should have just used the Ashutosh's patch.

> In this case, I think Ashutosh has the right idea.  I think that
> handling the result-relation and non-result-relation differently
> creates an unpleasant asymmetry.  With your patch, we have to deal
> with three cases: (a) partitioned tables that were expanded
> level-by-level because they are not result relations, (b) partitioned
> tables that were expanded "flattened" because they are result
> relations, and (c) non-partitioned tables that were expanded
> "flattened".  With Ashutosh's approach, we only have two cases to
> worry about in the future rather than three, and I like that better.

I tend to agree with this now.

> Your patch also appears to change things so that the table actually
> referenced in the query ends up with an AppendRelInfo for the parent,
> which seems pointless.

Actually, it wouldn't, because my patch also got rid of the notion of
adding the duplicate RTE for original parent, because I thought the
duplicate RTE was pointless in the partitioning case.

> There are a couple of hunks from your patch that we might want or need
> to incorporate into Ashutosh's patch.  The change to
> relation_excluded_by_constraints() looks like it might be useful,
> although it needs a better comment and some tests.

I think we could just drop that part from this patch.  It also looks like
Ashutosh has a patch elsewhere concerning this.

https://commitfest.postgresql.org/14/1108/

Maybe, we could discuss what do about this on that thread.  Now that not
only the root partitioned table, but also other partitioned tables in the
tree get an RTE with inh = true, I think it would be interesting to
consider his patch.

> Also, Ashutosh's
> patch has no equivalent of your change to add_paths_to_append_rel().
> I'm not clear what the code you've introduced there is supposed to be
> doing, and I'm suspicious that it is confusing "partition root" with
> "table named in the query", which will often be the same but not
> always; the user could have named an intermediate partition.  Can you
> expand on what this is doing?

I've replied on the partition-wise thread explaining why changes in the
add_paths_to_append_rel() are necessary.

Anyway, I'm dropping my patch in favor of the patch on the other thread.
Sorry for the duplicated effort involved in having to look at both the
patches.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZEUonD9dUZH1FBEyq%3DPEv_KvE3wC%3DA%3D0zm-_tRz_917A%40mail.gmail.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 21 August 2017 at 18:37, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I've been working on implementing a way to perform plan-time
> partition-pruning that is hopefully faster than the current method of
> using constraint exclusion to prune each of the potentially many
> partitions one-by-one.  It's not fully cooked yet though.

I'm interested in seeing improvements in this area, so I've put my
name down to review this.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/09/15 10:55, David Rowley wrote:
> On 21 August 2017 at 18:37, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I've been working on implementing a way to perform plan-time
>> partition-pruning that is hopefully faster than the current method of
>> using constraint exclusion to prune each of the potentially many
>> partitions one-by-one.  It's not fully cooked yet though.
> 
> I'm interested in seeing improvements in this area, so I've put my
> name down to review this.

Great, thanks!

I will post rebased patches later today, although I think the overall
design of the patch on the planner side of things is not quite there yet.
Of course, your and others' feedback is greatly welcome.

Also, I must inform to all of those who're looking at this thread that I
won't be able to respond to emails from tomorrow (9/16, Sat) until 9/23,
Sat, due to some personal business.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Dilip Kumar
Date:
On Wed, Sep 6, 2017 at 4:08 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/09/04 10:10, Amit Langote wrote:
>> On 2017/09/02 2:52, Robert Haas wrote:

>
> [PATCH 2/5] WIP: planner-side changes for partition-pruning
>
> This patch adds a stub get_partitions_for_keys in partition.c with a
> suitable interface for the caller to pass bounding keys extracted from the
> query and other related information.
>
> Importantly, it implements the logic within the planner to match query's
> scan keys to a parent table's partition key and form the bounding keys
> that will be passed to partition.c to compute the list of partitions that
> satisfy those bounds.
>

+ Node   *leftop = get_leftop(clause);
+
+ if (IsA(leftop, RelabelType))
+ leftop = (Node *) ((RelabelType *) leftop)->arg;
+
+ if (equal(leftop, partkey))

It appears that this patch always assume that clause will be of form
"var op const", but it can also be "const op var"

That's the reason in below example where in both the queries condition
is same it can only prune in the first case but not in the second.

postgres=# explain select * from t where t.a < 2;                      QUERY PLAN
--------------------------------------------------------Append  (cost=0.00..2.24 rows=1 width=8)  ->  Seq Scan on t1
(cost=0.00..2.24rows=1 width=8)        Filter: (a < 2)
 
(3 rows)

postgres=# explain select * from t where 2>t.a;                      QUERY PLAN
--------------------------------------------------------Append  (cost=0.00..4.49 rows=2 width=8)  ->  Seq Scan on t1
(cost=0.00..2.24rows=1 width=8)        Filter: (2 > a)  ->  Seq Scan on t2  (cost=0.00..2.25 rows=1 width=8)
Filter:(2 > a)
 
(5 rows)

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi Dilip,

Thanks for looking at the patch.

On 2017/09/15 13:43, Dilip Kumar wrote:
> On Wed, Sep 6, 2017 at 4:08 PM, Amit Langote
>> [PATCH 2/5] WIP: planner-side changes for partition-pruning
>>
>> This patch adds a stub get_partitions_for_keys in partition.c with a
>> suitable interface for the caller to pass bounding keys extracted from the
>> query and other related information.
>>
>> Importantly, it implements the logic within the planner to match query's
>> scan keys to a parent table's partition key and form the bounding keys
>> that will be passed to partition.c to compute the list of partitions that
>> satisfy those bounds.
> 
> + Node   *leftop = get_leftop(clause);
> +
> + if (IsA(leftop, RelabelType))
> + leftop = (Node *) ((RelabelType *) leftop)->arg;
> +
> + if (equal(leftop, partkey))
> 
> It appears that this patch always assume that clause will be of form
> "var op const", but it can also be "const op var"
> 
> That's the reason in below example where in both the queries condition
> is same it can only prune in the first case but not in the second.
> 
> postgres=# explain select * from t where t.a < 2;
>                        QUERY PLAN
> --------------------------------------------------------
>  Append  (cost=0.00..2.24 rows=1 width=8)
>    ->  Seq Scan on t1  (cost=0.00..2.24 rows=1 width=8)
>          Filter: (a < 2)
> (3 rows)
> 
> postgres=# explain select * from t where 2>t.a;
>                        QUERY PLAN
> --------------------------------------------------------
>  Append  (cost=0.00..4.49 rows=2 width=8)
>    ->  Seq Scan on t1  (cost=0.00..2.24 rows=1 width=8)
>          Filter: (2 > a)
>    ->  Seq Scan on t2  (cost=0.00..2.25 rows=1 width=8)
>          Filter: (2 > a)
> (5 rows)

Yeah, there are a bunch of smarts still missing in that patch as it is.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/09/15 11:16, Amit Langote wrote:
> I will post rebased patches later today, although I think the overall
> design of the patch on the planner side of things is not quite there yet.
> Of course, your and others' feedback is greatly welcome.

Rebased patches attached.  Because Dilip complained earlier today about
clauses of the form (const op var) not causing partition-pruning, I've
added code to commute the clause where it is required.  Some other
previously mentioned limitations remain -- no handling of OR clauses, no
elimination of redundant clauses for given partitioning column, etc.

A note about 0001: this patch overlaps with
0003-Canonical-partition-scheme.patch from the partitionwise-join patch
series that Ashutosh Bapat posted yesterday [1].  Because I implemented
the planner-portion of this patch based on what 0001 builds, I'm posting
it here.  It might actually turn out that we will review and commit
0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply
0001 if you want to play with the later patches.  I would certainly like
to review  0003-Canonical-partition-scheme.patch myself, but won't be able
to immediately (see below).

> Also, I must inform to all of those who're looking at this thread that I
> won't be able to respond to emails from tomorrow (9/16, Sat) until 9/23,
> Sat, due to some personal business.

To remind.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFiTN-skmaqeCVaoAHCBqe2DyfO3f6sgdtEjHWrUgi0kV1yPLQ%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Fri, Sep 15, 2017 at 4:50 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Rebased patches attached.  Because Dilip complained earlier today about
> clauses of the form (const op var) not causing partition-pruning, I've
> added code to commute the clause where it is required.  Some other
> previously mentioned limitations remain -- no handling of OR clauses, no
> elimination of redundant clauses for given partitioning column, etc.
>
> A note about 0001: this patch overlaps with
> 0003-Canonical-partition-scheme.patch from the partitionwise-join patch
> series that Ashutosh Bapat posted yesterday [1].

It doesn't merely overlap; it's obviously a derivative work, and the
commit message in your version should credit all the authors.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On Sat, Sep 16, 2017 at 4:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 15, 2017 at 4:50 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Rebased patches attached.  Because Dilip complained earlier today about
>> clauses of the form (const op var) not causing partition-pruning, I've
>> added code to commute the clause where it is required.  Some other
>> previously mentioned limitations remain -- no handling of OR clauses, no
>> elimination of redundant clauses for given partitioning column, etc.
>>
>> A note about 0001: this patch overlaps with
>> 0003-Canonical-partition-scheme.patch from the partitionwise-join patch
>> series that Ashutosh Bapat posted yesterday [1].
>
> It doesn't merely overlap; it's obviously a derivative work,

Yes it is.  I noted that upthread [1] that most of these are derived
from Ashutosh's patch on his suggestion.  I guess I should have
repeated that in this message too, sorry.

> and the
> commit message in your version should credit all the authors.

That was a mistake on my part, too.  Will be careful hereon.

Thanks,
Amit

[1] https://www.postgresql.org/message-id/0e829199-a43c-2a66-b966-89a0020a6cd4%40lab.ntt.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Dilip Kumar
Date:
On Fri, Sep 15, 2017 at 2:20 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/09/15 11:16, Amit Langote wrote:

Thanks for the updated patch.  I was going through the logic of
get_rel_partitions in 0002 as almost similar functionality will be
required by runtime partition pruning on which Beena is working.  The
only difference is that here we are processing the
"rel->baserestrictinfo" and in case of runtime pruning, we also need
to process join clauses which are pushed down to appendrel.

So can we make some generic logic which can be used for both the patches.

So basically, we need to do two changes

1. In get_rel_partitions instead of processing the
"rel->baserestrictinfo" we can take clause list as input that way we
can pass any clause list to this function.

2. Don't call "get_partitions_for_keys" routine from the
"get_rel_partitions", instead, get_rel_partitions can just prepare
minkey, maxkey and the caller of the get_rel_partitions can call
get_partitions_for_keys, because for runtime pruning we need to call
get_partitions_for_keys at runtime.

After these changes also there will be one problem that the
get_partitions_for_keys is directly fetching the "rightop->constvalue"
whereas, for runtime pruning, we need to store rightop itself and
calculate the value at runtime by param evaluation,  I haven't yet
thought how can we make this last part generic.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Dilip Kumar
Date:
I have done some refactoring of the code where I have moved the code
of getting the matching clause into the separate function so that it
can fetch the matching clause from any set of given restriction list.

It can be applied on top of 0002-WIP:
planner-side-changes-for-partition-pruning.patch

On Sat, Sep 16, 2017 at 3:13 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, Sep 15, 2017 at 2:20 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/09/15 11:16, Amit Langote wrote:
>
> Thanks for the updated patch.  I was going through the logic of
> get_rel_partitions in 0002 as almost similar functionality will be
> required by runtime partition pruning on which Beena is working.  The
> only difference is that here we are processing the
> "rel->baserestrictinfo" and in case of runtime pruning, we also need
> to process join clauses which are pushed down to appendrel.
>
> So can we make some generic logic which can be used for both the patches.
>
> So basically, we need to do two changes
>
> 1. In get_rel_partitions instead of processing the
> "rel->baserestrictinfo" we can take clause list as input that way we
> can pass any clause list to this function.
>
> 2. Don't call "get_partitions_for_keys" routine from the
> "get_rel_partitions", instead, get_rel_partitions can just prepare
> minkey, maxkey and the caller of the get_rel_partitions can call
> get_partitions_for_keys, because for runtime pruning we need to call
> get_partitions_for_keys at runtime.
>
> After these changes also there will be one problem that the
> get_partitions_for_keys is directly fetching the "rightop->constvalue"
> whereas, for runtime pruning, we need to store rightop itself and
> calculate the value at runtime by param evaluation,  I haven't yet
> thought how can we make this last part generic.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi Dilip.

Thanks for looking at the patches and the comments.

On 2017/09/16 18:43, Dilip Kumar wrote:
> On Fri, Sep 15, 2017 at 2:20 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/09/15 11:16, Amit Langote wrote:
> 
> Thanks for the updated patch.  I was going through the logic of
> get_rel_partitions in 0002 as almost similar functionality will be
> required by runtime partition pruning on which Beena is working.  The
> only difference is that here we are processing the
> "rel->baserestrictinfo" and in case of runtime pruning, we also need
> to process join clauses which are pushed down to appendrel.

Yeah, I agree with the point you seem to be making that
get_rel_partitions() covers a lot of functionality, which it would be nice
to break down into reusable function(s) with suitable signature(s) that
the executor will also be able to use.

Your proposed refactoring patch down-thread seems to be a good step in
that direction.  Thanks for working on it.

> So can we make some generic logic which can be used for both the patches.
> 
> So basically, we need to do two changes
> 
> 1. In get_rel_partitions instead of processing the
> "rel->baserestrictinfo" we can take clause list as input that way we
> can pass any clause list to this function.
> 
> 2. Don't call "get_partitions_for_keys" routine from the
> "get_rel_partitions", instead, get_rel_partitions can just prepare
> minkey, maxkey and the caller of the get_rel_partitions can call
> get_partitions_for_keys, because for runtime pruning we need to call
> get_partitions_for_keys at runtime.

It's not clear to me whether get_rel_partitions() itself, as it is, is
callable from outside the planner, because its signature contains
RelOptInfo.  We have the RelOptInfo in the signature, because we want to
mark certain fields in it so that latter planning steps can use them.  So,
get_rel_partitions()'s job is not just to match clauses and find
partitions, but also to perform certain planner-specific tasks of
generating information that the later planning steps will want to use.
That may turn out to be unnecessary, but until we know that, let's not try
to export get_rel_partitions() itself out of the planner.

OTOH, the function that your refactoring patch separates out to match
clauses to partition keys and extract bounding values seems reusable
outside the planner and we should export it in such a way that it can be
used in the executor.  Then, the hypothetical executor function that does
the pruning will first call the planner's clause-matching function,
followed by calling get_partitions_for_keys() in partition.c to get the
selected partitions.

We should be careful when designing the interface of the exported function
to make sure it's not bound to the planner.  Your patch still maintains
the RelOptInfo in the signature of the clause-matching function, which the
executor pruning function won't have access to.

> After these changes also there will be one problem that the
> get_partitions_for_keys is directly fetching the "rightop->constvalue"
> whereas, for runtime pruning, we need to store rightop itself and
> calculate the value at runtime by param evaluation,  I haven't yet
> thought how can we make this last part generic.

I don't think any code introduced by the patch in partition.c itself looks
inside OpExpr (or any type of clause for that matter).  That is, I don't
see where get_partitions_for_keys() is looking at rightop->constvalue.
All it receives to work with are arrays of Datums and some other relevant
information like inclusivity, nullness, etc.

By the way, I'm now rebasing these patches on top of [1] and will try to
merge your refactoring patch in some appropriate way.  Will post more
tomorrow.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf826



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Dilip Kumar
Date:
On Mon, Sep 25, 2017 at 3:34 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

> Thanks for looking at the patches and the comments.

> It's not clear to me whether get_rel_partitions() itself, as it is, is
> callable from outside the planner, because its signature contains
> RelOptInfo.  We have the RelOptInfo in the signature, because we want to
> mark certain fields in it so that latter planning steps can use them.  So,
> get_rel_partitions()'s job is not just to match clauses and find
> partitions, but also to perform certain planner-specific tasks of
> generating information that the later planning steps will want to use.
> That may turn out to be unnecessary, but until we know that, let's not try
> to export get_rel_partitions() itself out of the planner.
>
> OTOH, the function that your refactoring patch separates out to match
> clauses to partition keys and extract bounding values seems reusable
> outside the planner and we should export it in such a way that it can be
> used in the executor.  Then, the hypothetical executor function that does
> the pruning will first call the planner's clause-matching function,
> followed by calling get_partitions_for_keys() in partition.c to get the
> selected partitions.

Thanks for your reply.

Actually, we are still planning to call get_matching_clause at the
optimizer time only.  Since we can not use get_rel_partitions function
directly for runtime pruning because it does all the work (find
matching clause, create minkey and maxkey and call
get_partitions_for_keys) during planning time itself.

For runtime pruning, we are planning to first get_matching_clause
function during optimizer time to identify the clause which is
matching with partition keys, but for PARAM_EXEC case we can not
depend upon baserelrestriction instead we will get the from join
clause, that's the reason I have separated out get_matching_clause.
But it will still be used during planning time.

After separating out the matching clause we will do somewhat similar
processing what "get_rel_partitions" is doing. But, at optimizer time
for PARAM we will not have Datum values for rightop, so we will keep
track of the PARAM itself.

And, finally at runtime when we get the PARAM value we can prepare
minkey and maxkey and call get_partitions_for_keys function.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/09/25 20:21, Dilip Kumar wrote:
> On Mon, Sep 25, 2017 at 3:34 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> 
>> Thanks for looking at the patches and the comments.
> 
>> It's not clear to me whether get_rel_partitions() itself, as it is, is
>> callable from outside the planner, because its signature contains
>> RelOptInfo.  We have the RelOptInfo in the signature, because we want to
>> mark certain fields in it so that latter planning steps can use them.  So,
>> get_rel_partitions()'s job is not just to match clauses and find
>> partitions, but also to perform certain planner-specific tasks of
>> generating information that the later planning steps will want to use.
>> That may turn out to be unnecessary, but until we know that, let's not try
>> to export get_rel_partitions() itself out of the planner.
>>
>> OTOH, the function that your refactoring patch separates out to match
>> clauses to partition keys and extract bounding values seems reusable
>> outside the planner and we should export it in such a way that it can be
>> used in the executor.  Then, the hypothetical executor function that does
>> the pruning will first call the planner's clause-matching function,
>> followed by calling get_partitions_for_keys() in partition.c to get the
>> selected partitions.
> 
> Thanks for your reply.
> 
> Actually, we are still planning to call get_matching_clause at the
> optimizer time only.  Since we can not use get_rel_partitions function
> directly for runtime pruning because it does all the work (find
> matching clause, create minkey and maxkey and call
> get_partitions_for_keys) during planning time itself.
>
> For runtime pruning, we are planning to first get_matching_clause
> function during optimizer time to identify the clause which is
> matching with partition keys, but for PARAM_EXEC case we can not
> depend upon baserelrestriction instead we will get the from join
> clause, that's the reason I have separated out get_matching_clause.
> But it will still be used during planning time.

I see.  So, in the run-time pruning case, only the work of extracting
bounding values is deferred to execution time.  Matching clauses with the
partition key still occurs during planning time.  Only that the clauses
that require run-time pruning are not those in rel->baserestrictinfo.

> After separating out the matching clause we will do somewhat similar
> processing what "get_rel_partitions" is doing. But, at optimizer time
> for PARAM we will not have Datum values for rightop, so we will keep
> track of the PARAM itself.

I guess information about which PARAMs map to which partition keys will be
kept in the plan somehow.

> And, finally at runtime when we get the PARAM value we can prepare
> minkey and maxkey and call get_partitions_for_keys function.

Note that get_partitions_for_keys() is not planner code, nor is it bound
with any other planning code.  It's callable from executor without much
change.  Maybe you already know that though.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Dilip Kumar
Date:
On Tue, Sep 26, 2017 at 2:45 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/09/25 20:21, Dilip Kumar wrote:

> I see.  So, in the run-time pruning case, only the work of extracting
> bounding values is deferred to execution time.  Matching clauses with the
> partition key still occurs during planning time.  Only that the clauses
> that require run-time pruning are not those in rel->baserestrictinfo.

Right.
>
>> After separating out the matching clause we will do somewhat similar
>> processing what "get_rel_partitions" is doing. But, at optimizer time
>> for PARAM we will not have Datum values for rightop, so we will keep
>> track of the PARAM itself.
>
> I guess information about which PARAMs map to which partition keys will be
> kept in the plan somehow.

Yes.
>
>> And, finally at runtime when we get the PARAM value we can prepare
>> minkey and maxkey and call get_partitions_for_keys function.
>
> Note that get_partitions_for_keys() is not planner code, nor is it bound
> with any other planning code.  It's callable from executor without much
> change.  Maybe you already know that though.

Yes, Right.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Jesper Pedersen
Date:
Hi Amit,

On 09/15/2017 04:50 AM, Amit Langote wrote:
> On 2017/09/15 11:16, Amit Langote wrote:
>> I will post rebased patches later today, although I think the overall
>> design of the patch on the planner side of things is not quite there yet.
>> Of course, your and others' feedback is greatly welcome.
> 
> Rebased patches attached.  Because Dilip complained earlier today about
> clauses of the form (const op var) not causing partition-pruning, I've
> added code to commute the clause where it is required.  Some other
> previously mentioned limitations remain -- no handling of OR clauses, no
> elimination of redundant clauses for given partitioning column, etc.
> 
> A note about 0001: this patch overlaps with
> 0003-Canonical-partition-scheme.patch from the partitionwise-join patch
> series that Ashutosh Bapat posted yesterday [1].  Because I implemented
> the planner-portion of this patch based on what 0001 builds, I'm posting
> it here.  It might actually turn out that we will review and commit
> 0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply
> 0001 if you want to play with the later patches.  I would certainly like
> to review  0003-Canonical-partition-scheme.patch myself, but won't be able
> to immediately (see below).
> 

Could you share your thoughts on the usage of PartitionAppendInfo's 
min_datum_idx / max_datum_idx ? Especially in relation to hash partitions.

I'm looking at get_partitions_for_keys.

Best regards, Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Tue, Sep 26, 2017 at 9:00 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> Could you share your thoughts on the usage of PartitionAppendInfo's
> min_datum_idx / max_datum_idx ? Especially in relation to hash partitions.

This brings up something that I've kind of been thinking about.  There
are sort of four cases when it comes to partition pruning:

1. There is exactly one matching partition.  For example, this happens
when there is an equality constraint on every partition column.

2. There are multiple matching partitions which are consecutive.  For
example, there is a single level of range partitioning with no default
partition and the single partitioning column is constrained by < > <=
or >=.

3. There are multiple matching partitions which are not consecutive.
This case is probably rare, but it can happen if there is a default
partition, if there are list partitions with multiple bounds that are
interleaved (e.g. p1 allows (1, 4), p2 allows (2), p3 allows (3, 5),
and the query allows values >= 4 and <= 5), if the query involves OR
conditions, or if there are multiple levels of partitioning (e.g.
partition by a, subpartition by b, put a range constraint on a and an
equality constraint on b).

4. There are no matching partitions.

One of the goals of this algorithm is to be fast.  The obvious way to
cater to case (3) is to iterate through all partitions and test
whether each one works, returning a Bitmapset, but that is O(n).
Admittedly, it might be O(n) with a pretty small constant factor, but
it still seems like exactly the sort of thing that we want to avoid
given the desire to scale to higher partition counts.

I propose that we create a structure that looks like this:

struct foo {  int min_partition;  int max_partition;  Bitmapset *extra_partitions;
};

This indicates that all partitions from min_partition to max_partition
need to be scanned, and in addition any partitions in extra_partitions
need to be scanned.  Assuming that we only consider cases where all
partition keys or a leading subset of the partition keys are
constrained, we'll generally be able to get by with just setting
min_partition and max_partition, but extra_partitions can be used to
handle default partitions and interleaved list bounds.  For equality
on all partitioning columns, we can do a single bsearch of the bounds
to identify the target partition at a given partitioning level, and
the same thing works for a single range-bound.  If there are two
range-bounds (< and > or <= and >= or whatever) we need to bsearch
twice.  The default partition, if any and if matched, must also be
included.  When there are multiple levels of partitioning things get a
bit more complex -- if someone wants to knock out a partition that
breaks up the range, we might need to shrink the main range to cover
part of it and kick the other indexes out to extra_partitions.

But the good thing is that in common cases with only O(lg n) effort we
can return O(1) data that describes what will be scanned.  In cases
where that's not practical we expend more effort but still prune with
maximal effectiveness.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Jesper Pedersen
Date:
On 09/26/2017 10:33 AM, Robert Haas wrote:
> On Tue, Sep 26, 2017 at 9:00 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> Could you share your thoughts on the usage of PartitionAppendInfo's
>> min_datum_idx / max_datum_idx ? Especially in relation to hash partitions.
> 
> This brings up something that I've kind of been thinking about.  There
> are sort of four cases when it comes to partition pruning:
> 
> 1. There is exactly one matching partition.  For example, this happens
> when there is an equality constraint on every partition column.
> 
> 2. There are multiple matching partitions which are consecutive.  For
> example, there is a single level of range partitioning with no default
> partition and the single partitioning column is constrained by < > <=
> or >=.
> 
> 3. There are multiple matching partitions which are not consecutive.
> This case is probably rare, but it can happen if there is a default
> partition, if there are list partitions with multiple bounds that are
> interleaved (e.g. p1 allows (1, 4), p2 allows (2), p3 allows (3, 5),
> and the query allows values >= 4 and <= 5), if the query involves OR
> conditions, or if there are multiple levels of partitioning (e.g.
> partition by a, subpartition by b, put a range constraint on a and an
> equality constraint on b).
> 
> 4. There are no matching partitions.
> 
> One of the goals of this algorithm is to be fast.  The obvious way to
> cater to case (3) is to iterate through all partitions and test
> whether each one works, returning a Bitmapset, but that is O(n).
> Admittedly, it might be O(n) with a pretty small constant factor, but
> it still seems like exactly the sort of thing that we want to avoid
> given the desire to scale to higher partition counts.
> 
> I propose that we create a structure that looks like this:
> 
> struct foo {
>     int min_partition;
>     int max_partition;
>     Bitmapset *extra_partitions;
> };
> 
> This indicates that all partitions from min_partition to max_partition
> need to be scanned, and in addition any partitions in extra_partitions
> need to be scanned.  Assuming that we only consider cases where all
> partition keys or a leading subset of the partition keys are
> constrained, we'll generally be able to get by with just setting
> min_partition and max_partition, but extra_partitions can be used to
> handle default partitions and interleaved list bounds.  For equality
> on all partitioning columns, we can do a single bsearch of the bounds
> to identify the target partition at a given partitioning level, and
> the same thing works for a single range-bound.  If there are two
> range-bounds (< and > or <= and >= or whatever) we need to bsearch
> twice.  The default partition, if any and if matched, must also be
> included.  When there are multiple levels of partitioning things get a
> bit more complex -- if someone wants to knock out a partition that
> breaks up the range, we might need to shrink the main range to cover
> part of it and kick the other indexes out to extra_partitions.
> 
> But the good thing is that in common cases with only O(lg n) effort we
> can return O(1) data that describes what will be scanned.  In cases
> where that's not practical we expend more effort but still prune with
> maximal effectiveness.
> 

For OLTP style applications 1) would be the common case, and with hash 
partitions it would be one equality constraint.

So, changing the method signature to use a data type as you described 
above instead of the explicit min_datum_idx / max_datum_idx output 
parameters would be more clear.

One could advocate (*cough*) that the hash partition patch [1] should be 
merged first in order to find other instances of where other CommitFest 
entries doesn't account for hash partitions at the moment in their 
method signatures; Beena noted something similar in [2]. I know that you 
said otherwise [3], but this is CommitFest 1, so there is time for a 
revert later, and hash partitions are already useful in internal testing.

[1] https://commitfest.postgresql.org/14/1059/
[2] 
https://www.postgresql.org/message-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A%40mail.gmail.com
[3] http://rhaas.blogspot.com/2017/08/plans-for-partitioning-in-v11.html

Best regards, Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Tue, Sep 26, 2017 at 10:57 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> One could advocate (*cough*) that the hash partition patch [1] should be
> merged first in order to find other instances of where other CommitFest
> entries doesn't account for hash partitions at the moment in their method
> signatures; Beena noted something similar in [2]. I know that you said
> otherwise [3], but this is CommitFest 1, so there is time for a revert
> later, and hash partitions are already useful in internal testing.

Well, that's a fair point.  I was assuming that committing things in
that order would cause me to win the "least popular committer" award
at least for that day, but maybe not.  It's certainly not ideal to
have to juggle that patch along and keep rebasing it over other
changes when it's basically done, and just waiting on other
improvements to land.  Anybody else wish to express an opinion?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 25 September 2017 at 23:04, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> By the way, I'm now rebasing these patches on top of [1] and will try to
> merge your refactoring patch in some appropriate way.  Will post more
> tomorrow.
>
> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf826

Yeah, I see 0001 conflicts with that. I'm going to set this to waiting
on author while you're busy rebasing this.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/09/27 1:51, Robert Haas wrote:
> On Tue, Sep 26, 2017 at 10:57 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> One could advocate (*cough*) that the hash partition patch [1] should be
>> merged first in order to find other instances of where other CommitFest
>> entries doesn't account for hash partitions at the moment in their method
>> signatures; Beena noted something similar in [2]. I know that you said
>> otherwise [3], but this is CommitFest 1, so there is time for a revert
>> later, and hash partitions are already useful in internal testing.
> 
> Well, that's a fair point.  I was assuming that committing things in
> that order would cause me to win the "least popular committer" award
> at least for that day, but maybe not.  It's certainly not ideal to
> have to juggle that patch along and keep rebasing it over other
> changes when it's basically done, and just waiting on other
> improvements to land.  Anybody else wish to express an opinion?

FWIW, I tend to agree that it would be nice to get the hash partitioning
patch in, even with old constraint exclusion based partition-pruning not
working for hash partitions.  That way, it might be more clear what we
need to do in the partition-pruning patches to account for hash partitions.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David,

On 2017/09/27 6:04, David Rowley wrote:
> On 25 September 2017 at 23:04, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> By the way, I'm now rebasing these patches on top of [1] and will try to
>> merge your refactoring patch in some appropriate way.  Will post more
>> tomorrow.
>>
>> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf826
> 
> Yeah, I see 0001 conflicts with that. I'm going to set this to waiting
> on author while you're busy rebasing this.

Thanks for the reminder.  Just thought I'd say that while I'm actually
done rebasing itself (attaching rebased patches to try 'em out), I'm now
considering Robert's comments and will be busy for a bit revising things
based on those comments.

Some notes about the attached patches:

- 0001 includes refactoring that Dilip proposed upthread [1] (added him as
  an author).  I slightly tweaked his patch -- renamed the function
  get_matching_clause to match_clauses_to_partkey, similar to
  match_clauses_to_index.

- Code to set AppendPath's partitioned_rels in add_paths_to_append_rel()
  revised by 0a480502b09 (was originally introduced in d3cc37f1d80) is
  still revised to get partitioned_rels from a source that is not
  PlannerInfo.pcinfo_list.  With the new code, partitioned_rels won't
  contain RT indexes of the partitioned child tables that were pruned.


Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFiTN-tGnQzF_4QtbOHT-3hE%3DOvNaMfbbeRxa4UY0CQyF0G8gQ%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi Jesper.

Firstly, thanks for looking at the patch.

On 2017/09/26 22:00, Jesper Pedersen wrote:
> Hi Amit,
> 
> On 09/15/2017 04:50 AM, Amit Langote wrote:
>> On 2017/09/15 11:16, Amit Langote wrote:
>>> I will post rebased patches later today, although I think the overall
>>> design of the patch on the planner side of things is not quite there yet.
>>> Of course, your and others' feedback is greatly welcome.
>>
>> Rebased patches attached.  Because Dilip complained earlier today about
>> clauses of the form (const op var) not causing partition-pruning, I've
>> added code to commute the clause where it is required.  Some other
>> previously mentioned limitations remain -- no handling of OR clauses, no
>> elimination of redundant clauses for given partitioning column, etc.
>>
>> A note about 0001: this patch overlaps with
>> 0003-Canonical-partition-scheme.patch from the partitionwise-join patch
>> series that Ashutosh Bapat posted yesterday [1].  Because I implemented
>> the planner-portion of this patch based on what 0001 builds, I'm posting
>> it here.  It might actually turn out that we will review and commit
>> 0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply
>> 0001 if you want to play with the later patches.  I would certainly like
>> to review  0003-Canonical-partition-scheme.patch myself, but won't be able
>> to immediately (see below).
>>
> 
> Could you share your thoughts on the usage of PartitionAppendInfo's
> min_datum_idx / max_datum_idx ? Especially in relation to hash partitions.
> 
> I'm looking at get_partitions_for_keys.

Sure.  You may have noticed that min_datum_idx and max_datum_idx in
PartitionAppendInfo are offsets in the PartitionDescData.boundinfo.datums
array, of datums that lie within the query-specified range (that is, using
=, >, >=, <, <= btree operators in the query). That array contains bounds
of all partitions sorted in the btree operator class defined order, at
least for list and range partitioning.  I haven't (yet) closely looked at
the composition of hash partition datums in PartitionBoundInfo, which
perhaps have different ordering properties (or maybe none) than list and
range partitioning datums.

Now, since they are offsets of datums in PartitionBoundInfo, not indexes
of partitions themselves, their utility outside partition.c might be
questionable.  But partition-wise join patch, for example, to determine if
two partitioned tables can be joined partition-wise, is going to check if
PartitionBoundInfos in RelOptInfos of two partitioned tables are
bound-to-bound equal [1].  Partition-pruning may select only a subset of
partitions of each of the joining partitioned tables.  Equi-join
requirement for partition-wise join means that the subset of partitions
will be same on both sides of the join.  My intent of having
min_datum_idx, max_datum_idx, along with contains_null_partition, and
contains_default_partition in PartitionAppendInfo is to have sort of a
cross check that those values end up being same on both sides of the join
after equi-join requirement has been satisfied.  That is,
get_partitions_for_keys will have chosen the same set of partitions for
both partitioned tables and hence will have set the same values for those
fields.

As mentioned above, that may be enough for list and range partitioning,
but since hash partition datums do not appear to have the same properties
as list and range datums [2], we may need an additional field(s) to
describe the hash partition selected by get_partitions_for_keys.  I guess
only one field will be enough, that will be the offset in the datums array
of the hash partition chosen for the query or -1 if query quals couldn't
conclusively determine one (maybe not all partition keys were specified to
be hashed or some or all used non-equality operator).

Hope that answers your question at least to some degree.  Now, there are
some points Robert mentioned in his reply that I will need to also
consider in the patch, which I'll go do now. :)

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpRc4UdCYknBai9pBu2GA1h4nZVNPDmzgs4jOkqFamT1huA%40mail.gmail.com

[2] It appears that in the hash partitioning case, unlike list and range   partitioning, PartitionBoundInfo doesn't
containvalues that are   directly comparable to query-specified constants, but a pair (modulus,   remainder) for each
partition. We'll first hash *all* the key values   (mentioned in the query) using the partitioning hash machinery and
thendetermine the hash partition index by using   (hash % largest_modulus) as offset into the
PartitionBoundInfo.indexes  array.
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
amul sul
Date:
On Wed, Sep 27, 2017 at 6:09 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/09/27 1:51, Robert Haas wrote:
> On Tue, Sep 26, 2017 at 10:57 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> One could advocate (*cough*) that the hash partition patch [1] should be
>> merged first in order to find other instances of where other CommitFest
>> entries doesn't account for hash partitions at the moment in their method
>> signatures; Beena noted something similar in [2]. I know that you said
>> otherwise [3], but this is CommitFest 1, so there is time for a revert
>> later, and hash partitions are already useful in internal testing.
>
> Well, that's a fair point.  I was assuming that committing things in
> that order would cause me to win the "least popular committer" award
> at least for that day, but maybe not.  It's certainly not ideal to
> have to juggle that patch along and keep rebasing it over other
> changes when it's basically done, and just waiting on other
> improvements to land.  Anybody else wish to express an opinion?

FWIW, I tend to agree that it would be nice to get the hash partitioning
patch in, even with old constraint exclusion based partition-pruning not
working for hash partitions.  That way, it might be more clear what we
need to do in the partition-pruning patches to account for hash partitions.
+1

regards,
Amul​
 

Re: [HACKERS] path toward faster partition pruning

From
Dilip Kumar
Date:
On Wed, Sep 27, 2017 at 6:52 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

I was looking into the latest patch set, seems like we can reuse some
more code between this path and runtime pruning[1]

+ foreach(lc1, matchedclauses[i])
+ {
+ Expr   *clause = lfirst(lc1);
+ Const  *rightop = (Const *) get_rightop(clause);
+ Oid opno = ((OpExpr *) clause)->opno,
+ opfamily = rel->part_scheme->partopfamily[i];
+ StrategyNumber strategy;
+
+ strategy = get_op_opfamily_strategy(opno, opfamily);
+ switch (strategy)
+ {
+ case BTLessStrategyNumber:
+ case BTLessEqualStrategyNumber:
+ if (need_next_max)
+ {
+ maxkeys[i] = rightop->constvalue;
+ if (!maxkey_set[i])
+ n_maxkeys++;
+ maxkey_set[i] = true;
+ max_incl = (strategy == BTLessEqualStrategyNumber);
+ }
+ if (strategy == BTLessStrategyNumber)
+ need_next_max = false;

I think the above logic is common between this patch and the runtime
pruning.  I think we can make
a reusable function.  Here we are preparing minkey and maxkey of Datum
because we can directly fetch rightop->constvalue whereas for runtime
pruning we are making minkeys and maxkeys of Expr because during
planning time we don't have the values for the Param.  I think we can
always make these minkey, maxkey array of Expr and later those can be
processed in whatever way we want it.  So this path will fetch the
constval out of Expr and runtime pruning will Eval that expression at
runtime.

Does this make sense or it will cause one level of extra processing
for this path i.e converting the Expr array to CONST array?

[1] https://www.postgresql.org/message-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/09/28 13:58, Dilip Kumar wrote:
> On Wed, Sep 27, 2017 at 6:52 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> 
> I was looking into the latest patch set, seems like we can reuse some
> more code between this path and runtime pruning[1]
> 
> + foreach(lc1, matchedclauses[i])
> + {
> + Expr   *clause = lfirst(lc1);
> + Const  *rightop = (Const *) get_rightop(clause);
> + Oid opno = ((OpExpr *) clause)->opno,
> + opfamily = rel->part_scheme->partopfamily[i];
> + StrategyNumber strategy;
> +
> + strategy = get_op_opfamily_strategy(opno, opfamily);
> + switch (strategy)
> + {
> + case BTLessStrategyNumber:
> + case BTLessEqualStrategyNumber:
> + if (need_next_max)
> + {
> + maxkeys[i] = rightop->constvalue;
> + if (!maxkey_set[i])
> + n_maxkeys++;
> + maxkey_set[i] = true;
> + max_incl = (strategy == BTLessEqualStrategyNumber);
> + }
> + if (strategy == BTLessStrategyNumber)
> + need_next_max = false;
> 
> I think the above logic is common between this patch and the runtime
> pruning.  I think we can make
> a reusable function.  Here we are preparing minkey and maxkey of Datum
> because we can directly fetch rightop->constvalue whereas for runtime
> pruning we are making minkeys and maxkeys of Expr because during
> planning time we don't have the values for the Param.  I think we can
> always make these minkey, maxkey array of Expr and later those can be
> processed in whatever way we want it.  So this path will fetch the
> constval out of Expr and runtime pruning will Eval that expression at
> runtime.

I think that makes sense.  In fact we could even move the minkey/maxkey
collection code to match_clauses_to_partkey() itself.  No need for a
different function and worrying about defining a separate interface for
the same.  We match clauses exactly because we want to extract the
constant or param values out of them.  No need to do the two activities
independently and in different places.

> Does this make sense or it will cause one level of extra processing
> for this path i.e converting the Expr array to CONST array?

Hm, it's not such a big cost to pay I'd think.

I will update the planner patch accordingly.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Dilip Kumar
Date:
On Thu, Sep 28, 2017 at 1:44 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/09/28 13:58, Dilip Kumar wrote:

>> I think the above logic is common between this patch and the runtime
>> pruning.  I think we can make
>> a reusable function.  Here we are preparing minkey and maxkey of Datum
>> because we can directly fetch rightop->constvalue whereas for runtime
>> pruning we are making minkeys and maxkeys of Expr because during
>> planning time we don't have the values for the Param.  I think we can
>> always make these minkey, maxkey array of Expr and later those can be
>> processed in whatever way we want it.  So this path will fetch the
>> constval out of Expr and runtime pruning will Eval that expression at
>> runtime.
>
> I think that makes sense.  In fact we could even move the minkey/maxkey
> collection code to match_clauses_to_partkey() itself.  No need for a
> different function and worrying about defining a separate interface for
> the same.  We match clauses exactly because we want to extract the
> constant or param values out of them.  No need to do the two activities
> independently and in different places.
>

+1


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 27 September 2017 at 14:22, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> - 0001 includes refactoring that Dilip proposed upthread [1] (added him as
>   an author).  I slightly tweaked his patch -- renamed the function
>   get_matching_clause to match_clauses_to_partkey, similar to
>   match_clauses_to_index.

Hi Amit,

I've made a pass over the 0001 patch while trying to get myself up to
speed with the various developments that are going on in partitioning
right now.

These are just my thoughts from reading over the patch. I understand
that there's quite a bit up in the air right now about how all this is
going to work, but I have some thoughts about that too, which I
wouldn't mind some feedback on as my line of thought may be off.

Anyway, I will start with some small typos that I noticed while reading:

partition.c:

+ * telling what kind of NullTest has been applies to the corresponding

"applies" should be "applied"

plancat.c:

* might've occurred to satisfy the query.  Rest of the fields are set in

"Rest of the" should be "The remaining"

Any onto more serious stuff:

allpath.c:

get_rel_partitions()

I wonder if this function does not belong in partition.c. I'd have
expected a function to exist per partition type, RANGE and LIST, I'd
imagine should have their own function in partition.c to eliminate
partitions
which cannot possibly match, and return the remainder. I see people
speaking of HASH partitioning, but we might one day also want
something like RANDOM or ROUNDROBIN (I'm not really kidding, imagine
routing records to be processed into foreign tables where you never
need to query them again). It would be good if we could easily expand
this list and not have to touch files all over the optimizer to do
that. Of course, there would be other work to do in the executor to
implement any new partitioning method too.

I know there's mention of it somewhere about get_rel_partitions() not
being so smart about WHERE partkey > 20 AND partkey > 10, the code
does not understand what's more restrictive. I think you could
probably follow the same rules here that are done in
eval_const_expressions(). Over there I see that evaluate_function()
will call anything that's not marked as volatile. I'd imagine, for
each partition key, you'd want to store a Datum with the minimum and
maximum possible value based on the quals processed. If either the
minimum or maximum is still set to NULL, then it's unbounded at that
end. If you encounter partkey = Const, then minimum and maximum can be
set to the value of that Const. From there you can likely ignore any
other quals for that partition key, as if the query did contain
another qual with partkey = SomeOtherConst, then that would have
become a gating qual during the constant folding process. This way if
the user had written WHERE partkey >= 1 AND partkey <= 1 the
evaluation would end up the same as if they'd written WHERE partkey =
1 as the minimum and maximum would be the same value in both cases,
and when those two values are the same then it would mean just one
theoretical binary search on a partition range to find the correct
partition instead of two.

I see in get_partitions_for_keys you've crafted the function to return
something to identify which partitions need to be scanned. I think it
would be nice to see a special element in the partition array for the
NULL and DEFAULT partition. I imagine 0 and 1, but obviously, these
would be defined constants somewhere. The signature of that function
is not so pretty and that would likely tidy it up a bit. The matching
partition indexes could be returned as a Bitmapset, yet, I don't
really see any code which handles adding the NULL and DEFAULT
partition in get_rel_partitions() either, maybe I've just not looked
hard enough yet...

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Thu, Sep 28, 2017 at 5:16 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> I'd imagine, for
> each partition key, you'd want to store a Datum with the minimum and
> maximum possible value based on the quals processed. If either the
> minimum or maximum is still set to NULL, then it's unbounded at that
> end. If you encounter partkey = Const, then minimum and maximum can be
> set to the value of that Const. From there you can likely ignore any
> other quals for that partition key, as if the query did contain
> another qual with partkey = SomeOtherConst, then that would have
> become a gating qual during the constant folding process. This way if
> the user had written WHERE partkey >= 1 AND partkey <= 1 the
> evaluation would end up the same as if they'd written WHERE partkey =
> 1 as the minimum and maximum would be the same value in both cases,
> and when those two values are the same then it would mean just one
> theoretical binary search on a partition range to find the correct
> partition instead of two.

I have not looked at the code submitted here in detail yet but I do
think we should try to avoid wasting cycles in the
presumably-quite-common case where equality is being tested.  The
whole idea of thinking of this as minimum/maximum seems like it might
be off precisely for that reason.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/09/30 1:28, Robert Haas wrote:
> On Thu, Sep 28, 2017 at 5:16 AM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> I'd imagine, for
>> each partition key, you'd want to store a Datum with the minimum and
>> maximum possible value based on the quals processed. If either the
>> minimum or maximum is still set to NULL, then it's unbounded at that
>> end. If you encounter partkey = Const, then minimum and maximum can be
>> set to the value of that Const. From there you can likely ignore any
>> other quals for that partition key, as if the query did contain
>> another qual with partkey = SomeOtherConst, then that would have
>> become a gating qual during the constant folding process. This way if
>> the user had written WHERE partkey >= 1 AND partkey <= 1 the
>> evaluation would end up the same as if they'd written WHERE partkey =
>> 1 as the minimum and maximum would be the same value in both cases,
>> and when those two values are the same then it would mean just one
>> theoretical binary search on a partition range to find the correct
>> partition instead of two.
> 
> I have not looked at the code submitted here in detail yet but I do
> think we should try to avoid wasting cycles in the
> presumably-quite-common case where equality is being tested.  The
> whole idea of thinking of this as minimum/maximum seems like it might
> be off precisely for that reason.

I agree.  Equality checks are going to be common enough to warrant them to
be handled specially, instead of implementing equality-pruning on top of
min/max framework.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Sun, Oct 1, 2017 at 9:13 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I agree.  Equality checks are going to be common enough to warrant them to
> be handled specially, instead of implementing equality-pruning on top of
> min/max framework.

What you might do is pass <btree-strategy-number, bounds> and
optionally allow a second <btree-strategy-number, bounds>.  Then for
the common case of equality you can pass BTEqualStrategyNumber and for
a range bounded at both ends you can pass BTGreaterStrategyNumber or
BTGreaterEqualStrategyNumber for one bound and BTLessStrategyNumber or
BTLessEqualStrategyNumber for the other.

Not sure if this is exactly the right idea but it's what pops to mind.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

Thanks a lot for your review comments and sorry it took me a while to reply.

On 2017/09/28 18:16, David Rowley wrote:
> On 27 September 2017 at 14:22, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> - 0001 includes refactoring that Dilip proposed upthread [1] (added him as
>>   an author).  I slightly tweaked his patch -- renamed the function
>>   get_matching_clause to match_clauses_to_partkey, similar to
>>   match_clauses_to_index.
> 
> Hi Amit,
> 
> I've made a pass over the 0001 patch while trying to get myself up to
> speed with the various developments that are going on in partitioning
> right now.
> 
> These are just my thoughts from reading over the patch. I understand
> that there's quite a bit up in the air right now about how all this is
> going to work, but I have some thoughts about that too, which I
> wouldn't mind some feedback on as my line of thought may be off.
> 
> Anyway, I will start with some small typos that I noticed while reading:
> 
> partition.c:
> 
> + * telling what kind of NullTest has been applies to the corresponding
> 
> "applies" should be "applied"

Will fix.

> plancat.c:
> 
> * might've occurred to satisfy the query.  Rest of the fields are set in
> 
> "Rest of the" should be "The remaining"

Will fix.

> Any onto more serious stuff:
> 
> allpath.c:
> 
> get_rel_partitions()
> 
> I wonder if this function does not belong in partition.c. I'd have
> expected a function to exist per partition type, RANGE and LIST, I'd
> imagine should have their own function in partition.c to eliminate
> partitions
> which cannot possibly match, and return the remainder. I see people
> speaking of HASH partitioning, but we might one day also want
> something like RANDOM or ROUNDROBIN (I'm not really kidding, imagine
> routing records to be processed into foreign tables where you never
> need to query them again). It would be good if we could easily expand
> this list and not have to touch files all over the optimizer to do
> that. Of course, there would be other work to do in the executor to
> implement any new partitioning method too.

I think there will have to be at least some code in the optimizer.  That
is, the code to match the query to the table's partition keys and using
the constant values therein to then look up the table's partitions.  More
importantly, once the partitions (viz. their offsets in the table's
partition descriptor) have been looked up by partition.c, we must be able
to then map them to their planner data structures viz. their
AppendRelInfo, and subsequently RelOptInfo.  This last part will have to
be in the optimizer.  In fact, that was the role of get_rel_partitions in
the initial versions of the patch, when neither of the code for matching
keys and for pruning using constants was implemented.

One may argue that the first part, that is, matching clauses to match to
the partition key and subsequent lookup of partitions using constants
could both be implemented in partition.c, but it seems better to me to
keep at least the part of matching clauses to the partition keys within
the planner (just like matching clauses to indexes is done by the
planner).  Looking up partitions using constants cannot be done outside
partition.c anyway, so that's something we have to implement there.

> I know there's mention of it somewhere about get_rel_partitions() not
> being so smart about WHERE partkey > 20 AND partkey > 10, the code
> does not understand what's more restrictive. I think you could
> probably follow the same rules here that are done in
> eval_const_expressions(). Over there I see that evaluate_function()
> will call anything that's not marked as volatile.

Hmm, unless I've missed it, I don't see in evaluate_function() anything
about determining which clause is more restrictive.  AFAIK, such
determination depends on the btree operator class semantics (at least in
the most common cases where, say, ">" means greater than in a sense that
btree code uses it), so I was planning to handle it the way btree code
handles it in _bt_preprocess_keys().  In fact, the existing code in
predtest.c, which makes decisions of the similar vein also relies on btree
semantics.  It's OK to do so, because partitioning methods that exist
today and for which we'd like to have such smarts use btree semantics to
partition the data.  Also, we won't need to optimize such cases for hash
partitioning anyway.

> I'd imagine, for
> each partition key, you'd want to store a Datum with the minimum and
> maximum possible value based on the quals processed. If either the
> minimum or maximum is still set to NULL, then it's unbounded at that
> end. If you encounter partkey = Const, then minimum and maximum can be
> set to the value of that Const. From there you can likely ignore any
> other quals for that partition key, as if the query did contain
> another qual with partkey = SomeOtherConst, then that would have
> become a gating qual during the constant folding process. This way if
> the user had written WHERE partkey >= 1 AND partkey <= 1 the
> evaluation would end up the same as if they'd written WHERE partkey =
> 1 as the minimum and maximum would be the same value in both cases,
> and when those two values are the same then it would mean just one
> theoretical binary search on a partition range to find the correct
> partition instead of two.

Given the way the patch recognizes a given qual as contributing to the
equal key or minimum key or maximum key, it will not conclude the above to
in fact be an equal key, because that presumably would require comparing
clauses among each other to make such a discovery.  I'm slightly hesitant
to add code to do that in the first version of the patch.  That is, for
time being let WHERE partkey >= 1 and partkey <= 1 be handled by passing 1
as both minimum and maximum key, which requires two binary searches.
Whereas, WHERE partkey = 1 would require only one.  Planner code to get
rid of the extra binary search lookup could come later, IMHO.

> I see in get_partitions_for_keys you've crafted the function to return
> something to identify which partitions need to be scanned. I think it
> would be nice to see a special element in the partition array for the
> NULL and DEFAULT partition. I imagine 0 and 1, but obviously, these
> would be defined constants somewhere. The signature of that function
> is not so pretty and that would likely tidy it up a bit. The matching
> partition indexes could be returned as a Bitmapset, yet, I don't
> really see any code which handles adding the NULL and DEFAULT
> partition in get_rel_partitions() either, maybe I've just not looked
> hard enough yet...

New version of the patch I will post soon cleans up the interface of
get_partitions_for_keys quite a bit; particularly the way selected
partitions are returned, for which I adopted an idea that Robert Haas
mentioned [2].  When it recognizes that a sequence of consecutive
partitions are to be scanned, it will return the starting and ending
offsets as *min_part_idx and *max_part_idx.  Those that don't fit this
pattern (of which there should be only a few in many cases) are returned
in a Bitmapset, as a supposedly unordered set of partitioned.  Since NULL
and DEFAULT partitions are partitions of this later category, they would
be included the bitmapset if it turns out that the query will need to scan
them after all.

Thanks again.

Regards,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf8269

[2]
https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/09/27 10:22, Amit Langote wrote:
> Thanks for the reminder.  Just thought I'd say that while I'm actually
> done rebasing itself (attaching rebased patches to try 'em out), I'm now
> considering Robert's comments and will be busy for a bit revising things
> based on those comments.

And here is the updated, significantly re-designed patch set.  I'm
dropping the WIP- from the patches' names and marking this patch set as
the v1 set (Jesper Pedersen pointed out to me offline that the patches
didn't have the vNumber before).

Significant points of revision:

* After thinking a bit more about the applicability/re-usability of the
code being discussed here in the run-time pruning case, I came to the
conclusion that my previous approach was a bit wrongheaded (as perhaps
also what David Rowley was thinking when he commented on some aspects of
the old patch's design at [1], so thanks to him for prodding me in what I
ended up thinking to be a good direction after all).

With the previous approach, a bit too much work would be done by the
planner with no possibility of the code doing that work being useful in
the executor (interface involved passing RelOptInfo *).  So, if some
optimization trick that would lead to better pruning decision depended on
the constant values in all the clauses being available, we'd have to skip
that optimization for clauses that would otherwise be chosen as run-time
pruning clauses, because by definition, they would not have constant
values available.  In the new design, planner code limits itself to only
matching the clauses to partition key (checking things like whether the
operator of a clause matched to a partition column is compatible with
partitioning, etc.) and adding every matched clause to a list partclauses.
 That should work unchanged for both the plan-time pruning case and the
run-time pruning case.  We don't look at the supposedly-constant operands
of clauses in the aforementioned planner code at all.

Now if the clauses in partclauses are all known to contain the constant
operand values (the plan-time pruning case), it can immediately pass them
to partition.c to analyze those clauses, extract bounding keys in a form
suitable to do lookup in PartitionBoundInfo and prune partitions that
won't satisfy those bounding keys (and hence the clauses).

If partclauses contains clauses that don't have the constant operand (the
run-time pruning case), don't go to partition.c just yet, instead stuff
the list into the plan (Append) node and go to partition.c only when all
the constant values are available (the patch at [2] will implement that).

* Unlike the previous approach where partition.c would return a list of
integer indexes, where the individual indexes would be those of the bound
datums and not those of partitions themselves, in the new design, indexes
of the selected partitions (their offsets in the PartitionDesc.oids array)
are returned in a way that Robert suggested [3] -- as a range of
contiguous partitions whenever possible (in the form of *min_part_idx and
*max_part_idx) and/or as a set of partitions appearing in no particular
order (in the form of a Bitmapset).  Second form is required not only for
default/null-only/list partitions (which do not have any notion of
ordering among each other), but also when individual arms of an OR clause
select partitions scattered all over the place.

* In add_paths_to_append_rel(), the partitioned_rels list passed to copy
into the Append path is no longer same as the one found in
PlannerInfo.pcinfo_list, because the latter contains *all* partitioned
child relations in a partition tree.  Instead, the patch teaches it to
only include *live* partitioned child relations.

* Since the partitionwise join code looks at *all* entries in
RelOptInfo.part_rels of both the joining partitioned relations, there is
some new code there to make dead partitions' RelOptInfos look valid with a
dummy path *after-the-fact*.  That's because, set_append_rel_size() whose
job it is to initialize certain fields of child RelOptInfos will do it
only for *live* partitions with the new arrangement, where only live
partitions of a partitioned table are processed by the main loop of
set_append_rel_size().

Some notes about the regression tests:

Patch 0001 adds new tests for partition-pruning.  Constraint exclusion
using internal partition constraints is not disabled in the code until the
last patch, which implements the last piece needed for the new partition
pruning to do any real work.  With that patch, we see some differences in
the plan generated using the new partition-pruning code which appear in
the patch as the diffs to expected/inherit.out and expected/partition.out
(latter is the new output file added by 0001).  I've almost convinced
myself that those diffs are simply artifacts of the difference in
implementation between constraint exclusion and the new partition-pruning
code and do not change the output that the plans produce.  The difference
stems from that either the old or the new method, in some cases, fails to
prune away a partition that should have been.  OTOH, in neither case do we
prune away a partition that shouldn't have been. :)

Description of the attached patches:

0001: add new tests for partition-pruning

0002: patch that makes all the changes needed in the planer (adds a stub
      function in partition.c)

0003: patch that implements the aforementioned stub (significant amount of
      code to analyze partition clauses and gin up bounding keys to
      compare with the values in PartitionBoundInfo; the actual function
      that will do the comparison is just a stub as of this patch)

0004: make some preparatory changes to partition_bound_cmp/bsearch, to be
      able to pass incomplete partition keys (aka, prefix of a multi-
      column key) for comparison with the values in PartitionBoundInfo
     (just a refactoring patch)

0005: implements the stub mentioned in 0003 and finally gets the new
      partition-pruning working (also disables constraint exclusion using
      internal partition constraints by teaching get_relation_constraints
      to not include those).

Feedback greatly welcome.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f8Jzix8cs7tCDS_qNPd0CetHjB8x9fmLG4OTbCfthgo1w%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A%40mail.gmail.com

[3]
https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Rajkumar Raghuwanshi
Date:

On Thu, Oct 19, 2017 at 12:16 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Description of the attached patches:

0001: add new tests for partition-pruning

0002: patch that makes all the changes needed in the planer (adds a stub
      function in partition.c)

0003: patch that implements the aforementioned stub (significant amount of
      code to analyze partition clauses and gin up bounding keys to
      compare with the values in PartitionBoundInfo; the actual function
      that will do the comparison is just a stub as of this patch)

0004: make some preparatory changes to partition_bound_cmp/bsearch, to be
      able to pass incomplete partition keys (aka, prefix of a multi-
      column key) for comparison with the values in PartitionBoundInfo
     (just a refactoring patch)

0005: implements the stub mentioned in 0003 and finally gets the new
      partition-pruning working (also disables constraint exclusion using
      internal partition constraints by teaching get_relation_constraints
      to not include those).

Feedback greatly welcome.
Hi Amit,

I have tried to apply attached patch. patch applied cleanly on commit id - bf54c0f05c0a58db17627724a83e1b6d4ec2712c
but make failed with below error.
 
./../../../src/include/nodes/relation.h:2126: error: redefinition of typedef ‘AppendRelInfo’
../../../../src/include/nodes/relation.h:584: note: previous declaration of ‘AppendRelInfo’ was here
make[4]: *** [gistbuild.o] Error 1

 

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks Rajkumar.

On 2017/10/23 15:58, Rajkumar Raghuwanshi wrote:
> I have tried to apply attached patch. patch applied cleanly on commit id -
> bf54c0f05c0a58db17627724a83e1b6d4ec2712c
> but make failed with below error.
> 
> ./../../../src/include/nodes/relation.h:2126: error: redefinition of
> typedef ‘AppendRelInfo’
> ../../../../src/include/nodes/relation.h:584: note: previous declaration of
> ‘AppendRelInfo’ was here
> make[4]: *** [gistbuild.o] Error 1

The compiler I have here (gcc (GCC) 6.2.0) didn't complain like that for
this typedef redefinition introduced by the 0002 patch, but it seems that
it's not needed anyway, so got rid of that line in the attached updated patch.

Fixed one more useless diff in 0002, but no changes in any other patch.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Rajkumar Raghuwanshi
Date:

On Mon, Oct 23, 2017 at 1:12 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
The compiler I have here (gcc (GCC) 6.2.0) didn't complain like that for
this typedef redefinition introduced by the 0002 patch, but it seems that
it's not needed anyway, so got rid of that line in the attached updated patch.

Fixed one more useless diff in 0002, but no changes in any other patch

Thanks for updated patches, I am able to compile it on head.

While testing this, I got an observation, pruning is not scanning default partition leading to wrong output. below is test to reproduce this.

create table rp (a int, b varchar) partition by range (a);
create table rp_p1 partition of rp default;
create table rp_p2 partition of rp for values from (1) to (10);
create table rp_p3 partition of rp for values from (10) to (maxvalue);

insert into rp values (-1,'p1');
insert into rp values (1,'p2');
insert into rp values (11,'p3');

postgres=# select tableoid::regclass,* from rp;
 tableoid | a  | b 
----------+----+----
 rp_p2    |  1 | p2
 rp_p3    | 11 | p3
 rp_p1    | -1 | p1
(3 rows)

--with pruning
postgres=# explain (costs off) select * from rp where a <= 1;
        QUERY PLAN       
--------------------------
 Append
   ->  Seq Scan on rp_p2
         Filter: (a <= 1)
(3 rows)

postgres=# select * from rp where a <= 1;
 a | b 
---+----
 1 | p2
(1 row)

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
 

Re: [HACKERS] path toward faster partition pruning

From
Beena Emerson
Date:
On Mon, Oct 23, 2017 at 3:24 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
>
> On Mon, Oct 23, 2017 at 1:12 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> The compiler I have here (gcc (GCC) 6.2.0) didn't complain like that for
>> this typedef redefinition introduced by the 0002 patch, but it seems that
>> it's not needed anyway, so got rid of that line in the attached updated
>> patch.
>>
>> Fixed one more useless diff in 0002, but no changes in any other patch
>
>
> Thanks for updated patches, I am able to compile it on head.
>
> While testing this, I got an observation, pruning is not scanning default
> partition leading to wrong output. below is test to reproduce this.
>
> create table rp (a int, b varchar) partition by range (a);
> create table rp_p1 partition of rp default;
> create table rp_p2 partition of rp for values from (1) to (10);
> create table rp_p3 partition of rp for values from (10) to (maxvalue);
>
> insert into rp values (-1,'p1');
> insert into rp values (1,'p2');
> insert into rp values (11,'p3');
>
> postgres=# select tableoid::regclass,* from rp;
>  tableoid | a  | b
> ----------+----+----
>  rp_p2    |  1 | p2
>  rp_p3    | 11 | p3
>  rp_p1    | -1 | p1
> (3 rows)
>
> --with pruning
> postgres=# explain (costs off) select * from rp where a <= 1;
>         QUERY PLAN
> --------------------------
>  Append
>    ->  Seq Scan on rp_p2
>          Filter: (a <= 1)
> (3 rows)
>
> postgres=# select * from rp where a <= 1;
>  a | b
> ---+----
>  1 | p2
> (1 row)
>

I had noticed this and also that this crash:

tprt PARTITION BY RANGE(Col1)      tprt_1 FOR VALUES FROM (1) TO (50001) PARTITION BY RANGE(Col1)             tprt_11
FORVALUES FROM (1) TO (10000),             tprt_1d DEFAULT      tprt_2 FOR VALUES FROM (50001) TO (100001)
 

EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 BETWEEN 20000 AND 70000;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>



-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks a lot Rajkumar and Beena for the tests.

On 2017/10/24 1:38, Beena Emerson wrote:
> On Mon, Oct 23, 2017 at 3:24 PM, Rajkumar Raghuwanshi wrote:
>> Thanks for updated patches, I am able to compile it on head.
>>
>> While testing this, I got an observation, pruning is not scanning default
>> partition leading to wrong output. below is test to reproduce this.
>>
>> create table rp (a int, b varchar) partition by range (a);
>> create table rp_p1 partition of rp default;
>> create table rp_p2 partition of rp for values from (1) to (10);
>> create table rp_p3 partition of rp for values from (10) to (maxvalue);
>>
>> insert into rp values (-1,'p1');
>> insert into rp values (1,'p2');
>> insert into rp values (11,'p3');
>>
>> postgres=# select tableoid::regclass,* from rp;
>>  tableoid | a  | b
>> ----------+----+----
>>  rp_p2    |  1 | p2
>>  rp_p3    | 11 | p3
>>  rp_p1    | -1 | p1
>> (3 rows)
>>
>> --with pruning
>> postgres=# explain (costs off) select * from rp where a <= 1;
>>         QUERY PLAN
>> --------------------------
>>  Append
>>    ->  Seq Scan on rp_p2
>>          Filter: (a <= 1)
>> (3 rows)
>>
>> postgres=# select * from rp where a <= 1;
>>  a | b
>> ---+----
>>  1 | p2
>> (1 row)

Both this (wrong output)...

> I had noticed this and also that this crash:
> 
> tprt PARTITION BY RANGE(Col1)
>        tprt_1 FOR VALUES FROM (1) TO (50001) PARTITION BY RANGE(Col1)
>               tprt_11 FOR VALUES FROM (1) TO (10000),
>               tprt_1d DEFAULT
>        tprt_2 FOR VALUES FROM (50001) TO (100001)
> 
> EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 BETWEEN 20000 AND 70000;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !>

...and this (crash) were due to bugs in the 0005 patch.

Output with the updated patch for Rajkumar's test:

explain (costs off ) select * from rp where a <= 1;
        QUERY PLAN
--------------------------
 Append
   ->  Seq Scan on rp_p2
         Filter: (a <= 1)
   ->  Seq Scan on rp_p1
         Filter: (a <= 1)
(5 rows)

select tableoid::regclass, * from rp where a <= 1;
 tableoid | a  | b
----------+----+----
 rp_p2    |  1 | p2
 rp_p1    | -1 | p1
(2 rows)

-- moreover
select tableoid::regclass, * from rp where a < 1;
 tableoid | a  | b
----------+----+----
 rp_d     | -1 | p1
(1 row)


Should be fixed in the attached updated version.  While fixing the bugs, I
made some significant revisions to the code introduced by 0005.

No significant changes to any of the patches 0001-0004.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/10/25 15:47, Amit Langote wrote:
> On 2017/10/24 1:38, Beena Emerson wrote:
>> I had noticed this and also that this crash:
>>
>> tprt PARTITION BY RANGE(Col1)
>>        tprt_1 FOR VALUES FROM (1) TO (50001) PARTITION BY RANGE(Col1)
>>               tprt_11 FOR VALUES FROM (1) TO (10000),
>>               tprt_1d DEFAULT
>>        tprt_2 FOR VALUES FROM (50001) TO (100001)
>>
>> EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 BETWEEN 20000 AND 70000;
>> server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>> before or while processing the request.
>> The connection to the server was lost. Attempting reset: Failed.
>> !>
> 
> ...and this (crash) were due to bugs in the 0005 patch.

[ .... ]

> Should be fixed in the attached updated version.

Oops, not quite.  The crash that Beena reported wasn't fixed (or rather
reintroduced by some unrelated change after once confirming it was fixed).

Really fixed this time.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Beena Emerson
Date:
Hello Amit,

Thanks for the updated patches

On Wed, Oct 25, 2017 at 1:07 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/10/25 15:47, Amit Langote wrote:
>> On 2017/10/24 1:38, Beena Emerson wrote:
>>> I had noticed this and also that this crash:
>>>
>>> tprt PARTITION BY RANGE(Col1)
>>>        tprt_1 FOR VALUES FROM (1) TO (50001) PARTITION BY RANGE(Col1)
>>>               tprt_11 FOR VALUES FROM (1) TO (10000),
>>>               tprt_1d DEFAULT
>>>        tprt_2 FOR VALUES FROM (50001) TO (100001)
>>>
>>> EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 BETWEEN 20000 AND 70000;
>>> server closed the connection unexpectedly
>>> This probably means the server terminated abnormally
>>> before or while processing the request.
>>> The connection to the server was lost. Attempting reset: Failed.
>>> !>
>>
>> ...and this (crash) were due to bugs in the 0005 patch.
>
> [ .... ]
>
>> Should be fixed in the attached updated version.
>
> Oops, not quite.  The crash that Beena reported wasn't fixed (or rather
> reintroduced by some unrelated change after once confirming it was fixed).
>
> Really fixed this time.
>

The crashes are fixed. However, handling of DEFAULT partition in
various queries is not proper.

Case 1: In this case default should be selected.DROP TABLE tprt; CREATE TABLE tprt (col1 int, col2 int) PARTITION BY
range(col1);CREATE TABLE tprt_1 PARTITION OF tprt FOR VALUES FROM (1) TO (50001)
 
PARTITION BY list(col1); CREATE TABLE tprt_11 PARTITION OF tprt_1 FOR VALUES IN (20000, 25000); CREATE TABLE tprt_12
PARTITIONOF tprt_1 FOR VALUES IN (50000, 35000); CREATE TABLE tprt_13 PARTITION OF tprt_1 FOR VALUES IN (10000); CREATE
TABLEtprt_1d PARTITION OF tprt_1 DEFAULT;
 


postgres=#  EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 < 10000;       QUERY PLAN
--------------------------Result  One-Time Filter: false
(2 rows)


Case 2: In this case DEFAULT need not be selected.

DROP TABLE  tprt; CREATE TABLE tprt (col1 int, col2 int) PARTITION BY range(col1); CREATE TABLE tprt_1 PARTITION OF
tprtFOR VALUES FROM (1) TO (50001)
 
PARTITION BY range(col1); CREATE TABLE tprt_11 PARTITION OF tprt_1 FOR VALUES FROM (1) TO (10000); CREATE TABLE tprt_12
PARTITIONOF tprt_1 FOR VALUES FROM (10000) TO (20000); CREATE TABLE tprt_13 PARTITION OF tprt_1 FOR VALUES FROM (20000)
TO(30000); CREATE TABLE tprt_1d PARTITION OF tprt_1 DEFAULT; INSERT INTO tprt SELECT generate_series(1,50000),
generate_series(1,50000);

postgres=#  EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 < 10000;          QUERY PLAN
--------------------------------Append  ->  Seq Scan on tprt_11        Filter: (col1 < 10000)  ->  Seq Scan on tprt_1d
     Filter: (col1 < 10000)
 
(5 rows)


-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Beena Emerson
Date:
Hello,

On Wed, Oct 25, 2017 at 1:07 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/10/25 15:47, Amit Langote wrote:
>> On 2017/10/24 1:38, Beena Emerson wrote:
>>> I had noticed this and also that this crash:
>>>
>>> tprt PARTITION BY RANGE(Col1)
>>>        tprt_1 FOR VALUES FROM (1) TO (50001) PARTITION BY RANGE(Col1)
>>>               tprt_11 FOR VALUES FROM (1) TO (10000),
>>>               tprt_1d DEFAULT
>>>        tprt_2 FOR VALUES FROM (50001) TO (100001)
>>>
>>> EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 BETWEEN 20000 AND 70000;
>>> server closed the connection unexpectedly
>>> This probably means the server terminated abnormally
>>> before or while processing the request.
>>> The connection to the server was lost. Attempting reset: Failed.
>>> !>
>>
>> ...and this (crash) were due to bugs in the 0005 patch.
>
> [ .... ]
>
>> Should be fixed in the attached updated version.
>
> Oops, not quite.  The crash that Beena reported wasn't fixed (or rather
> reintroduced by some unrelated change after once confirming it was fixed).
>
> Really fixed this time.

Some minor comments:

1. wrong function name (0003)

The comment on function get_partitions_from_clauses_guts uses wrong name:
instead of "_from_", "_using_" is written.
/*
+ * get_partitions_using_clauses_guts
+ *     Determine relation's partitions that satisfy *all* of the clauses
+ *     in the list (return value describes the set of such partitions)
+ *

2. typo information (0003)

+/*
+ * classify_partition_bounding_keys
+ *     Classify partition clauses into equal, min, max keys, along with any
+ *     Nullness constraints and return that informatin in the output argument

3. misspell admissible (0003)
+    * columns, whereas a prefix of all partition key columns is addmissible
+    * as min and max keys.

4. double and? (0002)
+                * as part of the operator family, check if its negator
+                * exists and and that the latter is compatible with

5. typo inequality (0002)
+                * (key < val OR key > val), if the partitioning method
+                * supports such notion of inequlity.


6. typo output (0005)
+    * return it as the only scannable partition, that means the query
+    * doesn't want null values in its outout.

7. typo provide (0005)
+   /* Valid keys->eqkeys must provoide all partition keys. */
+   Assert(keys->n_eqkeys == 0 || keys->n_eqkeys == partkey->partnatts);

8. comment of struct PartClause (0003)
+/*
+ * Information about a clause matched with a partition key column kept to
+ * avoid repeated recomputation in remove_redundant_clauses().
+ */

Instead of repeated recomputation, we can use just  " repeated
computation" or just " recomputation"



-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Beena,

Thanks for the tests.

On 2017/10/25 18:18, Beena Emerson wrote:
> The crashes are fixed. However, handling of DEFAULT partition in
> various queries is not proper.
> 
> Case 1: In this case default should be selected.
>  DROP TABLE tprt;
>   CREATE TABLE tprt (col1 int, col2 int) PARTITION BY range(col1);
>   CREATE TABLE tprt_1 PARTITION OF tprt FOR VALUES FROM (1) TO (50001)
> PARTITION BY list(col1);
>   CREATE TABLE tprt_11 PARTITION OF tprt_1 FOR VALUES IN (20000, 25000);
>   CREATE TABLE tprt_12 PARTITION OF tprt_1 FOR VALUES IN (50000, 35000);
>   CREATE TABLE tprt_13 PARTITION OF tprt_1 FOR VALUES IN (10000);
>   CREATE TABLE tprt_1d PARTITION OF tprt_1 DEFAULT;
> 
> 
> postgres=#  EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 < 10000;
>         QUERY PLAN
> --------------------------
>  Result
>    One-Time Filter: false
> (2 rows)

Hmm, this clearly looks wrong.  Fixed in the attached.

> Case 2: In this case DEFAULT need not be selected.
> 
> DROP TABLE  tprt;
>   CREATE TABLE tprt (col1 int, col2 int) PARTITION BY range(col1);
>   CREATE TABLE tprt_1 PARTITION OF tprt FOR VALUES FROM (1) TO (50001)
> PARTITION BY range(col1);
>   CREATE TABLE tprt_11 PARTITION OF tprt_1 FOR VALUES FROM (1) TO (10000);
>   CREATE TABLE tprt_12 PARTITION OF tprt_1 FOR VALUES FROM (10000) TO (20000);
>   CREATE TABLE tprt_13 PARTITION OF tprt_1 FOR VALUES FROM (20000) TO (30000);
>   CREATE TABLE tprt_1d PARTITION OF tprt_1 DEFAULT;
>   INSERT INTO tprt SELECT generate_series(1,50000), generate_series(1,50000);
> 
> postgres=#  EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 < 10000;
>            QUERY PLAN
> --------------------------------
>  Append
>    ->  Seq Scan on tprt_11
>          Filter: (col1 < 10000)
>    ->  Seq Scan on tprt_1d
>          Filter: (col1 < 10000)
> (5 rows)

Yeah, ideally.  But it's kind of hard to for the new partition-pruning
algorithm to be *that* correct in this particular case involving default
partitions.  Let me try to explain why I think it may be a bit hard to
implement.

I perhaps have mentioned before that the new partition-pruning algorithm
runs for every partitioned table in the tree separately.  In this example,
it will first determine for the root table tprt that only the partition
tprt_1 needs to be scanned.  Since tprt_1 is itself partitioned, algorithm
will run again, but the fact that tprt_1 (iow, any of its partitions) is
itself constrained to range [1, 50001) is, for the most part, lost on the
algorithm.  Note that non-default partitions (tprt_11, tprt_12, ...) have
bound datums in PartitionBoundInfo describing the range of data they
contain, which the algorithm uses to determine the set of partitions
satisfying given set of clauses.  The default partition has no datums.
The only thing describing what it contains is its partition constraint.
From the clause col1 < 10000, the algorithm will conclude that the default
partition might contain some data satisfying the same, because it knows
for sure that there no non-default partition for keys < 1.

It can perhaps taught to not make that conclusion by taking into account
the default partition's partition constraint, which includes constraint
inherited from the parent, viz. 1 <= col1 < 50001.  To do that, it might
be possible to summon up predtest.c's powers to conclude from the default
partition's partition constraint that it cannot contain any keys < 1, but
then we'll have to frame up a clause expression describing the latter.
Generating such a clause expression can be a bit daunting for a
multi-column key.   So, I haven't yet tried really hard to implement this.
 Any thoughts on that?


Meanwhile, attached updated set of patches including fixes for the typos
you reported in the other message.  Updated 0005 fixes the first bug (the
Case 1 in your email), while other patches 0002-0004 are updated mostly to
fix the reported typos.  A couple of tests are added in 0001 to test the
default partition case a bit more.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Thu, Oct 26, 2017 at 1:17 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> It can perhaps taught to not make that conclusion by taking into account
> the default partition's partition constraint, which includes constraint
> inherited from the parent, viz. 1 <= col1 < 50001.  To do that, it might
> be possible to summon up predtest.c's powers to conclude from the default
> partition's partition constraint that it cannot contain any keys < 1, but
> then we'll have to frame up a clause expression describing the latter.
> Generating such a clause expression can be a bit daunting for a
> multi-column key.   So, I haven't yet tried really hard to implement this.
>  Any thoughts on that?

I don't think we really want to get into theorem-proving here, because
it's slow.  Whatever we're going to do we should be able to do without
that - keeping it in the form of btree-strategy + value.  It doesn't
seem that hard.  Suppose we're asked to select partitions from tprt
subject to (<, 10000).  Well, we determine that some of the tprt_1
partitions may be relevant, so we tell tprt_1 to select partitions
subject to (>=, 1, <, 10000).  We know to do that because we know that
10000 < 50000 and we know to include >= 1 because we haven't got any
lower bound currently at all.  What's the problem?

In some sense it's tempting to say that this case just doesn't matter
very much; after all, subpartitioning on the same column used to
partition at the top level is arguably lame.  But if we can get it
right in a relatively straightforward manner then let's do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/10/26 20:34, Robert Haas wrote:
> On Thu, Oct 26, 2017 at 1:17 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> It can perhaps taught to not make that conclusion by taking into account
>> the default partition's partition constraint, which includes constraint
>> inherited from the parent, viz. 1 <= col1 < 50001.  To do that, it might
>> be possible to summon up predtest.c's powers to conclude from the default
>> partition's partition constraint that it cannot contain any keys < 1, but
>> then we'll have to frame up a clause expression describing the latter.
>> Generating such a clause expression can be a bit daunting for a
>> multi-column key.   So, I haven't yet tried really hard to implement this.
>>  Any thoughts on that?
> 
> I don't think we really want to get into theorem-proving here, because
> it's slow.

Just to be clear, I'm saying we could use theorem-proving (if at all) just
for the default partition.

> Whatever we're going to do we should be able to do without
> that - keeping it in the form of btree-strategy + value.  It doesn't
> seem that hard.  Suppose we're asked to select partitions from tprt
> subject to (<, 10000).  Well, we determine that some of the tprt_1
> partitions may be relevant, so we tell tprt_1 to select partitions
> subject to (>=, 1, <, 10000).  We know to do that because we know that
> 10000 < 50000 and we know to include >= 1 because we haven't got any
> lower bound currently at all.  What's the problem?

Hmm, that's interesting.  With the approach that the patch currently
takes, (>= 1) wouldn't be passed down when selecting the partitions of
tprt_1.  The source of values (+ btree strategy) to use to select
partitions is the same original set of clauses for all partitioned tables
in the tree, as the patch currently implements it.  Nothing would get
added to that set (>= 1, as in this example), nor subtracted (such as
clauses that are trivially true).

I will think about this approach in general and to solve this problem in
particular.

> In some sense it's tempting to say that this case just doesn't matter
> very much; after all, subpartitioning on the same column used to
> partition at the top level is arguably lame.  But if we can get it
> right in a relatively straightforward manner then let's do it.

Yeah, I tend to agree.

Thanks for the input.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Fri, Oct 27, 2017 at 3:17 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I don't think we really want to get into theorem-proving here, because
>> it's slow.
>
> Just to be clear, I'm saying we could use theorem-proving (if at all) just
> for the default partition.

I don't really see why it should be needed there either.  We've got
all the bounds in order, so we should know where there are any gaps
that are covered by the default partition in the range we care about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/10/27 13:57, Robert Haas wrote:
> On Fri, Oct 27, 2017 at 3:17 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> I don't think we really want to get into theorem-proving here, because
>>> it's slow.
>>
>> Just to be clear, I'm saying we could use theorem-proving (if at all) just
>> for the default partition.
> 
> I don't really see why it should be needed there either.  We've got
> all the bounds in order, so we should know where there are any gaps
> that are covered by the default partition in the range we care about.

Sorry, I forgot to add: "...just for the default partition, for cases like
the one in Beena's example."

In normal cases, default partition selection doesn't require any
theorem-proving.  It proceeds in a straightforward manner more or less
like what you said it should.

After thinking more on it, I think there is a rather straightforward trick
to implement the idea you mentioned to get this working for the case
presented in Beena's example, which works as follows:

For any non-root partitioned tables, we add the list of its partition
constraint clauses to the query-provided list of clauses and use the whole
list to drive the partition-pruning algorithm.  So, when partition-pruning
runs for tprt_1, along with (< 10000) which the original query provides,
we also have (>= 1) which comes from the partition constraint of tprt_1
(which is >= 1 and < 50000).  Note that there exists a trick in the new
code for the (< 50000) coming from the constraint to be overridden by the
more restrictive (< 10000) coming from the original query.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Rajkumar Raghuwanshi
Date:
On Thu, Oct 26, 2017 at 4:47 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Meanwhile, attached updated set of patches including fixes for the typos
you reported in the other message.  Updated 0005 fixes the first bug (the
Case 1 in your email), while other patches 0002-0004 are updated mostly to
fix the reported typos.  A couple of tests are added in 0001 to test the
default partition case a bit more.

Hi Amit,

while testing further this feature, I got a bug with partitions as foreign tables. Test case given below. Take a look.

CREATE EXTENSION postgres_fdw;
CREATE SERVER fp_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname 'postgres', port '5432', use_remote_estimate 'true');
CREATE USER MAPPING FOR PUBLIC SERVER fp_server;

CREATE TABLE fplt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE fplt1_p1 (a int, b int, c text);
CREATE FOREIGN TABLE ftplt1_p1 PARTITION OF fplt1 FOR VALUES IN ('0000', '0001', '0002', '0003') SERVER fp_server OPTIONS (TABLE_NAME 'fplt1_p1');
CREATE TABLE fplt1_p2 (a int, b int, c text);
CREATE FOREIGN TABLE ftplt1_p2 PARTITION OF fplt1 FOR VALUES IN ('0004', '0005', '0006', '0007') SERVER fp_server OPTIONS (TABLE_NAME 'fplt1_p2');

INSERT INTO fplt1_p1 SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(0, 198, 2) i;
INSERT INTO fplt1_p2 SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(200, 398, 2) i;

--PG-HEAD
postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT DISTINCT t2.c FROM fplt1 t2  WHERE  t2.c = t1.c ) q;
                    QUERY PLAN                   
--------------------------------------------------
 Nested Loop
   ->  Append
         ->  Foreign Scan on ftplt1_p1 t1
         ->  Foreign Scan on ftplt1_p2 t1_1
   ->  Unique
         ->  Append
               ->  Foreign Scan on ftplt1_p1 t2
               ->  Foreign Scan on ftplt1_p2 t2_1
(8 rows)

--PG-HEAD +v5 patches
postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT DISTINCT t2.c FROM fplt1 t2  WHERE  t2.c = t1.c ) q;
ERROR:  invalid expression for partition key

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

 

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks Rajkumar for the test case.

On 2017/10/27 17:05, Rajkumar Raghuwanshi wrote:
> while testing further this feature, I got a bug with partitions as foreign
> tables. Test case given below. Take a look.

[ ... ]

> 
> --PG-HEAD
> postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT
> DISTINCT t2.c FROM fplt1 t2  WHERE  t2.c = t1.c ) q;
>                     QUERY PLAN
> --------------------------------------------------
>  Nested Loop
>    ->  Append
>          ->  Foreign Scan on ftplt1_p1 t1
>          ->  Foreign Scan on ftplt1_p2 t1_1
>    ->  Unique
>          ->  Append
>                ->  Foreign Scan on ftplt1_p1 t2
>                ->  Foreign Scan on ftplt1_p2 t2_1
> (8 rows)
> 
> --PG-HEAD +v5 patches
> postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT
> DISTINCT t2.c FROM fplt1 t2  WHERE  t2.c = t1.c ) q;
> 
> *ERROR:  invalid expression for partition key*

I looked at this and it seems the error occurs not because partitions
being foreign tables, but because the new code is wrong to assume that
Param nodes can never appear in the clauses coming from baserestrictinfo.
When trying to do the plan-time pruning for the partitioned table
appearing inside the lateral subquery, there are Params in the clauses in
baserestrictinfo that the new pruning code was unprepared to handle.
Fixed the code to instead give up on plan-time pruning in such a case.

Attached updated set of patches.  In addition to fixing the above bug, it
also fixes one of the cases reported by Beena regarding default partition
pruning that I yesterday had given up on as being too difficult to
implement [1], but today found out is not that difficult to do [2].
Change summary:

0001: added some new tests
0002: no change
0003: fixed issue that Rajkumar reported (cope with Params properly)
0004: no change
0005: fix the case to prune the default partition when warranted (the
      issue reported by Beena)

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/0d6096e8-7c7b-afed-71d3-dca151306626%40lab.ntt.co.jp

[2]
https://www.postgresql.org/message-id/8499324c-8a33-4be7-9d23-7e6a95e60ddf%40lab.ntt.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Rajkumar Raghuwanshi
Date:
On Fri, Oct 27, 2017 at 2:41 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
0001: added some new tests
0002: no change
0003: fixed issue that Rajkumar reported (cope with Params properly)
0004: no change
0005: fix the case to prune the default partition when warranted (the
      issue reported by Beena)
 
Thanks for the updated patch, i am getting server crash with below query.

CREATE TABLE mp (c1 int, c2 int, c3 int) PARTITION BY LIST(c3);
CREATE TABLE mp_p1 PARTITION OF mp FOR VALUES IN (10, 20) PARTITION BY RANGE(c2);
CREATE TABLE mp_p1_1 PARTITION OF mp_p1 FOR VALUES FROM (0) TO (200);
CREATE TABLE mp_p1_2 PARTITION OF mp_p1 FOR VALUES FROM (200) TO (400);
CREATE TABLE mp_p2 PARTITION OF mp FOR VALUES IN (30, 40) PARTITION BY RANGE(c2);
CREATE TABLE mp_p2_1 PARTITION OF mp_p2 FOR VALUES FROM (0) TO (300);
CREATE TABLE mp_p2_2 PARTITION OF mp_p2 FOR VALUES FROM (300) TO (600);

INSERT INTO mp VALUES(10, 100, 10);
INSERT INTO mp VALUES(20, 200, 20);
INSERT INTO mp VALUES(21, 150, 30);
INSERT INTO mp VALUES(30, 200, 40);
INSERT INTO mp VALUES(31, 300, 30);
INSERT INTO mp VALUES(40, 400, 40);

EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM mp WHERE c3 = 40 AND c2 < 300;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

 

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks Rajkumar.

On 2017/10/27 19:29, Rajkumar Raghuwanshi wrote:
> On Fri, Oct 27, 2017 at 2:41 PM, Amit Langote wrote:
>> 0001: added some new tests
>> 0002: no change
>> 0003: fixed issue that Rajkumar reported (cope with Params properly)
>> 0004: no change
>> 0005: fix the case to prune the default partition when warranted (the
>>       issue reported by Beena)
>>
> 
> Thanks for the updated patch, i am getting server crash with below query.
> 
> CREATE TABLE mp (c1 int, c2 int, c3 int) PARTITION BY LIST(c3);
> CREATE TABLE mp_p1 PARTITION OF mp FOR VALUES IN (10, 20) PARTITION BY
> RANGE(c2);
> CREATE TABLE mp_p1_1 PARTITION OF mp_p1 FOR VALUES FROM (0) TO (200);
> CREATE TABLE mp_p1_2 PARTITION OF mp_p1 FOR VALUES FROM (200) TO (400);
> CREATE TABLE mp_p2 PARTITION OF mp FOR VALUES IN (30, 40) PARTITION BY
> RANGE(c2);
> CREATE TABLE mp_p2_1 PARTITION OF mp_p2 FOR VALUES FROM (0) TO (300);
> CREATE TABLE mp_p2_2 PARTITION OF mp_p2 FOR VALUES FROM (300) TO (600);
> 
> INSERT INTO mp VALUES(10, 100, 10);
> INSERT INTO mp VALUES(20, 200, 20);
> INSERT INTO mp VALUES(21, 150, 30);
> INSERT INTO mp VALUES(30, 200, 40);
> INSERT INTO mp VALUES(31, 300, 30);
> INSERT INTO mp VALUES(40, 400, 40);
> 
> EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM mp WHERE c3 = 40 AND
> c2 < 300;
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

Looks like an update I'd included in the last posted patch (viz. add the
non-root partitioned tables' partition constraint clauses to the list of
clauses used for pruning) exposed a bug in how ScalarArrayOpExpr clauses
are being handled by the new pruning code.  A partitioned list partition's
internal partition constraint clause contains ArrayExpr as the
ScalarArrayOpExpr's right-hand operand, whereas the pruning code thought
there could only ever be a Const holding an ArrayType value.

Fixed in the attached updated patch, along with a new test in 0001 to
cover this case.  Also, made a few tweaks to 0003 and 0005 (moved some
code from the former to the latter) around the handling of ScalarArrayOpExprs.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/10/30 14:55, Amit Langote wrote:
> Fixed in the attached updated patch, along with a new test in 0001 to
> cover this case.  Also, made a few tweaks to 0003 and 0005 (moved some
> code from the former to the latter) around the handling of ScalarArrayOpExprs.

Sorry, I'd forgotten to include some changes.

In the previous versions, RT index of the table needed to be passed to
partition.c, which I realized is no longer needed, so I removed that
requirement from the interface.  As a result, patches 0002 and 0003 have
changed in this version.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Rajkumar Raghuwanshi
Date:
On Mon, Oct 30, 2017 at 12:20 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
In the previous versions, RT index of the table needed to be passed to
partition.c, which I realized is no longer needed, so I removed that
requirement from the interface.  As a result, patches 0002 and 0003 have
changed in this version.
 
Thanks for the fix.

I am getting wrong output when default is sub-partitioned further, below is a test case.

CREATE TABLE lpd(a int, b varchar, c float) PARTITION BY LIST (a);
CREATE TABLE lpd_p1 PARTITION OF lpd FOR VALUES IN (1,2,3);
CREATE TABLE lpd_p2 PARTITION OF lpd FOR VALUES IN (4,5);
CREATE TABLE lpd_d PARTITION OF lpd DEFAULT PARTITION BY LIST(a);
CREATE TABLE lpd_d1 PARTITION OF lpd_d FOR VALUES IN (7,8,9);
CREATE TABLE lpd_d2 PARTITION OF lpd_d FOR VALUES IN (10,11,12);
CREATE TABLE lpd_d3 PARTITION OF lpd_d FOR VALUES IN (6,null);
INSERT INTO lpd SELECT i,i,i FROM generate_Series (1,12)i;
INSERT INTO lpd VALUES (null,null,null);

--on HEAD
postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
                 QUERY PLAN                 
---------------------------------------------
 Sort
   Sort Key: ((lpd_p1.tableoid)::regclass)
   ->  Result
         ->  Append
               ->  Seq Scan on lpd_p1
                     Filter: (a IS NOT NULL)
               ->  Seq Scan on lpd_p2
                     Filter: (a IS NOT NULL)
               ->  Seq Scan on lpd_d3
                     Filter: (a IS NOT NULL)
               ->  Seq Scan on lpd_d1
                     Filter: (a IS NOT NULL)
               ->  Seq Scan on lpd_d2
                     Filter: (a IS NOT NULL)
(14 rows)

postgres=#
postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
 tableoid | a  | b  | c 
----------+----+----+----
 lpd_p1   |  1 | 1  |  1
 lpd_p1   |  2 | 2  |  2
 lpd_p1   |  3 | 3  |  3
 lpd_p2   |  4 | 4  |  4
 lpd_p2   |  5 | 5  |  5
 lpd_d1   |  7 | 7  |  7
 lpd_d1   |  8 | 8  |  8
 lpd_d1   |  9 | 9  |  9
 lpd_d2   | 12 | 12 | 12
 lpd_d2   | 10 | 10 | 10
 lpd_d2   | 11 | 11 | 11
 lpd_d3   |  6 | 6  |  6
(12 rows)


--on HEAD + v8 patches

postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
                 QUERY PLAN                 
---------------------------------------------
 Sort
   Sort Key: ((lpd_p1.tableoid)::regclass)
   ->  Result
         ->  Append
               ->  Seq Scan on lpd_p1
                     Filter: (a IS NOT NULL)
               ->  Seq Scan on lpd_p2
                     Filter: (a IS NOT NULL)
(8 rows)

postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
 tableoid | a | b | c
----------+---+---+---
 lpd_p1   | 1 | 1 | 1
 lpd_p1   | 2 | 2 | 2
 lpd_p1   | 3 | 3 | 3
 lpd_p2   | 4 | 4 | 4
 lpd_p2   | 5 | 5 | 5
(5 rows)

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

 

Re: [HACKERS] path toward faster partition pruning

From
Dilip Kumar
Date:
On Mon, Oct 30, 2017 at 12:20 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/10/30 14:55, Amit Langote wrote:
>> Fixed in the attached updated patch, along with a new test in 0001 to
>> cover this case.  Also, made a few tweaks to 0003 and 0005 (moved some
>> code from the former to the latter) around the handling of ScalarArrayOpExprs.
>
> Sorry, I'd forgotten to include some changes.
>
> In the previous versions, RT index of the table needed to be passed to
> partition.c, which I realized is no longer needed, so I removed that
> requirement from the interface.  As a result, patches 0002 and 0003 have
> changed in this version.

Some Minor comments:

+ * get_rel_partitions
+ * Return the list of partitions of rel that pass the clauses mentioned
+ * rel->baserestrictinfo
+ *
+ * Returned list contains the AppendRelInfos of chosen partitions.
+ */
+static List *
+get_append_rel_partitions(PlannerInfo *root,

Function name in function header is not correct.

+ !DatumGetBool(((Const *) clause)->constvalue))
+ {
+ *constfalse = true;
+ continue;

DatumGetBool will return true if the first byte of constvalue is
nonzero otherwise
false.  IIUC, this is not the intention here. Or I am missing something?

+ * clauses in this function ourselves, for example, having both a > 1 and
+ * a = 0 the list

a = 0 the list -> a = 0 in the list

+static bool
+partkey_datum_from_expr(const Expr *expr, Datum *value)
+{
+ /*
+ * Add more expression types here as needed to support higher-level
+ * code.
+ */
+ switch (nodeTag(expr))
+ {
+ case T_Const:
+ *value = ((Const *) expr)->constvalue;
+ return true;

I think for evaluating other expressions (e.g. T_Param) we will have
to pass ExprContext to this function. Or we can do something cleaner
because if we want to access the ExprContext inside
partkey_datum_from_expr then we may need to pass it to
"get_partitions_from_clauses" which is a common function for optimizer
and executor.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks for the test case.

On 2017/10/30 17:09, Rajkumar Raghuwanshi wrote:
> I am getting wrong output when default is sub-partitioned further, below is
> a test case.
> 
> CREATE TABLE lpd(a int, b varchar, c float) PARTITION BY LIST (a);
> CREATE TABLE lpd_p1 PARTITION OF lpd FOR VALUES IN (1,2,3);
> CREATE TABLE lpd_p2 PARTITION OF lpd FOR VALUES IN (4,5);
> CREATE TABLE lpd_d PARTITION OF lpd DEFAULT PARTITION BY LIST(a);
> CREATE TABLE lpd_d1 PARTITION OF lpd_d FOR VALUES IN (7,8,9);
> CREATE TABLE lpd_d2 PARTITION OF lpd_d FOR VALUES IN (10,11,12);
> CREATE TABLE lpd_d3 PARTITION OF lpd_d FOR VALUES IN (6,null);
> INSERT INTO lpd SELECT i,i,i FROM generate_Series (1,12)i;
> INSERT INTO lpd VALUES (null,null,null);
> 
> --on HEAD
> postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE
> a IS NOT NULL ORDER BY 1;
>                  QUERY PLAN
> ---------------------------------------------
>  Sort
>    Sort Key: ((lpd_p1.tableoid)::regclass)
>    ->  Result
>          ->  Append
>                ->  Seq Scan on lpd_p1
>                      Filter: (a IS NOT NULL)
>                ->  Seq Scan on lpd_p2
>                      Filter: (a IS NOT NULL)
>                ->  Seq Scan on lpd_d3
>                      Filter: (a IS NOT NULL)
>                ->  Seq Scan on lpd_d1
>                      Filter: (a IS NOT NULL)
>                ->  Seq Scan on lpd_d2
>                      Filter: (a IS NOT NULL)
> (14 rows)
> 
> postgres=#
> postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER
> BY 1;
>  tableoid | a  | b  | c
> ----------+----+----+----
>  lpd_p1   |  1 | 1  |  1
>  lpd_p1   |  2 | 2  |  2
>  lpd_p1   |  3 | 3  |  3
>  lpd_p2   |  4 | 4  |  4
>  lpd_p2   |  5 | 5  |  5
>  lpd_d1   |  7 | 7  |  7
>  lpd_d1   |  8 | 8  |  8
>  lpd_d1   |  9 | 9  |  9
>  lpd_d2   | 12 | 12 | 12
>  lpd_d2   | 10 | 10 | 10
>  lpd_d2   | 11 | 11 | 11
>  lpd_d3   |  6 | 6  |  6
> (12 rows)
> 
> 
> --on HEAD + v8 patches
> 
> postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE
> a IS NOT NULL ORDER BY 1;
>                  QUERY PLAN
> ---------------------------------------------
>  Sort
>    Sort Key: ((lpd_p1.tableoid)::regclass)
>    ->  Result
>          ->  Append
>                ->  Seq Scan on lpd_p1
>                      Filter: (a IS NOT NULL)
>                ->  Seq Scan on lpd_p2
>                      Filter: (a IS NOT NULL)
> (8 rows)
> 
> postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER
> BY 1;
>  tableoid | a | b | c
> ----------+---+---+---
>  lpd_p1   | 1 | 1 | 1
>  lpd_p1   | 2 | 2 | 2
>  lpd_p1   | 3 | 3 | 3
>  lpd_p2   | 4 | 4 | 4
>  lpd_p2   | 5 | 5 | 5
> (5 rows)

I found bugs in 0003 and 0005 that caused this.  Will post the patches
containing the fix in reply to the Dilip's email which contains some code
review comments [1].

Also, I noticed that the new pruning code was having a hard time do deal
with the fact that the default "range" partition doesn't explicitly say in
its partition constraint that it might contain null values.  More
precisely perhaps, the default range partition's constraint appears to
imply that it can only contain non-null values, which confuses the new
pruning code.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFiTN-thYsobXxPS6bwOA_9erpax_S=iztSn3RtUxKKMKG4V4A@mail.gmail.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks Dilip for reviewing.

On 2017/10/31 1:50, Dilip Kumar wrote:
> On Mon, Oct 30, 2017 at 12:20 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/10/30 14:55, Amit Langote wrote:
>>> Fixed in the attached updated patch, along with a new test in 0001 to
>>> cover this case.  Also, made a few tweaks to 0003 and 0005 (moved some
>>> code from the former to the latter) around the handling of ScalarArrayOpExprs.
>>
>> Sorry, I'd forgotten to include some changes.
>>
>> In the previous versions, RT index of the table needed to be passed to
>> partition.c, which I realized is no longer needed, so I removed that
>> requirement from the interface.  As a result, patches 0002 and 0003 have
>> changed in this version.
> 
> Some Minor comments:
> 
> + * get_rel_partitions
> + * Return the list of partitions of rel that pass the clauses mentioned
> + * rel->baserestrictinfo
> + *
> + * Returned list contains the AppendRelInfos of chosen partitions.
> + */
> +static List *
> +get_append_rel_partitions(PlannerInfo *root,
> 
> Function name in function header is not correct.

Fixed.

> + !DatumGetBool(((Const *) clause)->constvalue))
> + {
> + *constfalse = true;
> + continue;
> 
> DatumGetBool will return true if the first byte of constvalue is
> nonzero otherwise
> false.  IIUC, this is not the intention here. Or I am missing something?

This coding pattern is in use in quite a few places; see for example in
restriction_is_constant_false() and many others like
relation_excluded_by_constraints(), negate_clause(), etc.

If a RestrictInfo is marked pseudoconstant=true, then the clause therein
must be a Const with constvalue computing to 0 if the clause is false, so
that DatumGetBool(constvalue) returns boolean false and non-zero otherwise.

> + * clauses in this function ourselves, for example, having both a > 1 and
> + * a = 0 the list
> 
> a = 0 the list -> a = 0 in the list

Right, fixed.

> 
> +static bool
> +partkey_datum_from_expr(const Expr *expr, Datum *value)
> +{
> + /*
> + * Add more expression types here as needed to support higher-level
> + * code.
> + */
> + switch (nodeTag(expr))
> + {
> + case T_Const:
> + *value = ((Const *) expr)->constvalue;
> + return true;
> 
> I think for evaluating other expressions (e.g. T_Param) we will have
> to pass ExprContext to this function.

That's right.

> Or we can do something cleaner
> because if we want to access the ExprContext inside
> partkey_datum_from_expr then we may need to pass it to
> "get_partitions_from_clauses" which is a common function for optimizer
> and executor.

Yeah, I've thought about that a little.  Since nothing else but the
planner calls it now and the planner doesn't itself have its hands on the
ExprContext that would be necessary for computing something like Params, I
left it out of the interface for now.  That said, I *am* actually thinking
about some interface changes that would be necessary for some other
unrelated functionality/optimizations that callers like the run-time
pruning code would expect of get_partitions_from_clauses().  We can design
the interface extension such that the aforementioned ExprContext is passed
together.


Attached updated version of the patches addressing some of your comments
above and fixing a bug that Rajkumar reported [1].  As mentioned there,
I'm including here a patch (the 0005 of the attached) to tweak the default
range partition constraint to be explicit about null values that it might
contain.  So, there are 6 patches now and what used to be patch 0005 in
the previous set is patch 0006 in this version of the set.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/cd5a2d2e-0957-042c-40c2-06033fe0abf2@lab.ntt.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 31 October 2017 at 21:43, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated version of the patches addressing some of your comments
> above and fixing a bug that Rajkumar reported [1].  As mentioned there,
> I'm including here a patch (the 0005 of the attached) to tweak the default
> range partition constraint to be explicit about null values that it might
> contain.  So, there are 6 patches now and what used to be patch 0005 in
> the previous set is patch 0006 in this version of the set.

Hi Amit,

I've been looking over this. I see the latest patches conflict with
cf7ab13bf. Can you send patches rebased on current master?

Thanks

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 31 October 2017 at 21:43, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated version of the patches addressing some of your comments

I've spent a bit of time looking at these but I'm out of time for now.

So far I have noted down the following;

1. This comment seem wrong.

/*
* Since the clauses in rel->baserestrictinfo should all contain Const
* operands, it should be possible to prune partitions right away.
*/

How about PARTITION BY RANGE (a) and SELECT * FROM parttable WHERE a > b; ?
baserestrictinfo in this case will contain a single RestrictInfo with
an OpExpr containing two Var args and it'll come right through that
function too.

2. This code is way more complex than it needs to be.

if (num_parts > 0)
{
int j;

all_indexes = (int *) palloc(num_parts * sizeof(int));
j = 0;
if (min_part_idx >= 0 && max_part_idx >= 0)
{
for (i = min_part_idx; i <= max_part_idx; i++)
all_indexes[j++] = i;
}
if (!bms_is_empty(other_parts))
while ((i = bms_first_member(other_parts)) >= 0)
all_indexes[j++] = i;
if (j > 1)
qsort((void *) all_indexes, j, sizeof(int), intcmp);
}

It looks like the min/max partition stuff is just complicating things
here. If you need to build this array of all_indexes[] anyway, I don't
quite understand the point of the min/max. It seems like min/max would
probably work a bit nicer if you didn't need the other_parts
BitmapSet, so I recommend just getting rid of min/max completely and
just have a BitmapSet with bit set for each partition's index you
need, you'd not need to go to the trouble of performing a qsort on an
array and you could get rid of quite a chunk of code too.

The entire function would then not be much more complex than:

partindexes = get_partitions_from_clauses(parent, partclauses);

while ((i = bms_first_member(partindexes)) >= 0)
{   AppendRelInfo *appinfo = rel->part_appinfos[i];   result = lappend(result, appinfo);
}

Then you can also get rid of your intcmp() function too.

3. Following code has the wrong size calculation:

memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType *));

should be PARTITION_MAX_KEYS * sizeof(NullTestType). It might have
worked on your machine if you're compiling as 32 bit.

I'll continue on with the review in the next few days.


-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 3 November 2017 at 17:32, David Rowley <david.rowley@2ndquadrant.com> wrote:
> 2. This code is way more complex than it needs to be.
>
> if (num_parts > 0)
> {
> int j;
>
> all_indexes = (int *) palloc(num_parts * sizeof(int));
> j = 0;
> if (min_part_idx >= 0 && max_part_idx >= 0)
> {
> for (i = min_part_idx; i <= max_part_idx; i++)
> all_indexes[j++] = i;
> }
> if (!bms_is_empty(other_parts))
> while ((i = bms_first_member(other_parts)) >= 0)
> all_indexes[j++] = i;
> if (j > 1)
> qsort((void *) all_indexes, j, sizeof(int), intcmp);
> }
>
> It looks like the min/max partition stuff is just complicating things
> here. If you need to build this array of all_indexes[] anyway, I don't
> quite understand the point of the min/max. It seems like min/max would
> probably work a bit nicer if you didn't need the other_parts
> BitmapSet, so I recommend just getting rid of min/max completely and
> just have a BitmapSet with bit set for each partition's index you
> need, you'd not need to go to the trouble of performing a qsort on an
> array and you could get rid of quite a chunk of code too.
>
> The entire function would then not be much more complex than:
>
> partindexes = get_partitions_from_clauses(parent, partclauses);
>
> while ((i = bms_first_member(partindexes)) >= 0)
> {
>     AppendRelInfo *appinfo = rel->part_appinfos[i];
>     result = lappend(result, appinfo);
> }
>
> Then you can also get rid of your intcmp() function too.

I've read a bit more of the patch and I think even more now that the
min/max stuff should be removed.

I understand that you'll be bsearching for a lower and upper bound for
cases like:

SELECT * FROM pt WHERE key BETWEEN 10 and 20;

but it looks like the min and max range stuff is thrown away if the
query is written as:

SELECT * FROM pt WHERE key BETWEEN 10 and 20 OR key BETWEEN 30 AND 40;

from reading the code, it seems like partset_union() would be called
in this case and if the min/max of each were consecutive then the
min/max range would get merged, but there's really a lot of code to
support this. I think it would be much better to invent
bms_add_range() and just use a Bitmapset to store the partition
indexes to scan. You could simply use bms_union for OR cases and
bms_intersect() or AND cases. It seems this would allow removal of
this complex code. It looks like this would allow you to remove all
the partset_range_* macros too.

I've attached a patch which implements bms_add_range() which will save
you from having to write the tight loops to call bms_add_member() such
as the ones in partset_union(). Those would not be so great if there
was a huge number of partitions as the Bitmapset->words[] array could
be expanded many more times than required. bms_add_range() will handle
that much more efficiently with a maximum of 1 repalloc() for the
whole operation. It would also likely faster since it's working at the
bitmapword level rather than bit level.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 31 October 2017 at 21:43, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated version of the patches

match_clauses_to_partkey() needs to allow for the way quals on Bool
columns are represented.

create table pt (a bool not null) partition by list (a);
create table pt_true partition of pt for values in('t');
create table pt_false partition of pt for values in('f');
explain select * from pt where a = true;                           QUERY PLAN
------------------------------------------------------------------Append  (cost=0.00..76.20 rows=2810 width=1)  ->  Seq
Scanon pt_false  (cost=0.00..38.10 rows=1405 width=1)        Filter: a  ->  Seq Scan on pt_true  (cost=0.00..38.10
rows=1405width=1)        Filter: a
 
(5 rows)

match_clause_to_indexcol() shows an example of how to handle this.

explain select * from pt where a = false;

will need to be allowed too. This works slightly differently.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
Amit Langote
Date:
Hi David,

On 2017/11/03 13:32, David Rowley wrote:
> On 31 October 2017 at 21:43, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached updated version of the patches addressing some of your comments
> 
> I've spent a bit of time looking at these but I'm out of time for now.

Thanks a lot for the review and sorry for the delay in sending rebased
patches.

> So far I have noted down the following;
> 
> 1. This comment seem wrong.
> 
> /*
> * Since the clauses in rel->baserestrictinfo should all contain Const
> * operands, it should be possible to prune partitions right away.
> */

Yes.  I used to think it was true, then realized it isn't and updated the
code to get rid of that assumption, but I forgot updating this comment.
Fixed.

> How about PARTITION BY RANGE (a) and SELECT * FROM parttable WHERE a > b; ?
> baserestrictinfo in this case will contain a single RestrictInfo with
> an OpExpr containing two Var args and it'll come right through that
> function too.

As it should I think.  Quite similarly, you will be able see that index
path won't be considered for such a clause:

create table foo (a int, b int);
create index fooi on foo (a);
insert into foo select generate_series(1, 100000);

explain select * from foo where a = 1;
                              QUERY PLAN
----------------------------------------------------------------------
 Bitmap Heap Scan on foo  (cost=12.17..482.50 rows=500 width=8)
   Recheck Cond: (a = 1)
   ->  Bitmap Index Scan on fooi  (cost=0.00..12.04 rows=500 width=0)
         Index Cond: (a = 1)
(4 rows)

explain select * from foo where a <= b;
                       QUERY PLAN
--------------------------------------------------------
 Seq Scan on foo  (cost=0.00..1693.00 rows=500 width=8)
   Filter: (a = b)
(2 rows)

We won't be able to use such a clause for pruning at all; neither
planning-time pruning nor execution-time pruning.  Am I missing something?

> 2. This code is way more complex than it needs to be.
> 
> if (num_parts > 0)
> {
> int j;
> 
> all_indexes = (int *) palloc(num_parts * sizeof(int));
> j = 0;
> if (min_part_idx >= 0 && max_part_idx >= 0)
> {
> for (i = min_part_idx; i <= max_part_idx; i++)
> all_indexes[j++] = i;
> }
> if (!bms_is_empty(other_parts))
> while ((i = bms_first_member(other_parts)) >= 0)
> all_indexes[j++] = i;
> if (j > 1)
> qsort((void *) all_indexes, j, sizeof(int), intcmp);
> }
> 
> It looks like the min/max partition stuff is just complicating things
> here. If you need to build this array of all_indexes[] anyway, I don't
> quite understand the point of the min/max. It seems like min/max would
> probably work a bit nicer if you didn't need the other_parts
> BitmapSet, so I recommend just getting rid of min/max completely and
> just have a BitmapSet with bit set for each partition's index you
> need, you'd not need to go to the trouble of performing a qsort on an
> array and you could get rid of quite a chunk of code too.
>
> The entire function would then not be much more complex than:
>
> partindexes = get_partitions_from_clauses(parent, partclauses);
>
> while ((i = bms_first_member(partindexes)) >= 0)
> {
>     AppendRelInfo *appinfo = rel->part_appinfos[i];
>     result = lappend(result, appinfo);
> }
>
> Then you can also get rid of your intcmp() function too.

The design with min/max partition index interface to the partition.c's new
partition-pruning facility is intentional.  You can find hints about how
such a design came about in the following Robert's email:

https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com

For range queries, it is desirable for the partitioning module to return
the set of qualifying partitions that are contiguous in a compact (O(1))
min/max representation than having to enumerate all those indexes in the
set.  It's nice to avoid iterating over that set twice -- once when
constructing the set in the partitioning module and then again in the
caller (in this case, planner) to perform the planning-related tasks per
selected partition.

We need the other_parts Bitmapset too, because selected partitions may not
always be contiguous, even in the case of range partitioning, if there are
OR clauses and the possibility that the default partition is also
selected.  While computing the selected partition set from a given set of
clauses, partitioning code tries to keep the min/max representation as
long as it makes sense to and once the selected partitions no longer
appear to be contiguous it switches to the Bitmapset mode.

> 3. Following code has the wrong size calculation:
> 
> memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType *));
> 
> should be PARTITION_MAX_KEYS * sizeof(NullTestType). It might have
> worked on your machine if you're compiling as 32 bit.

Oops, you're right.  Fixed.

> I'll continue on with the review in the next few days.

Thanks again.

Attached is the updated set of patches.

Regards,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
Amit Langote
Date:
On 2017/11/06 12:53, David Rowley wrote:
> On 3 November 2017 at 17:32, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> 2. This code is way more complex than it needs to be.
>>
>> if (num_parts > 0)
>> {
>> int j;
>>
>> all_indexes = (int *) palloc(num_parts * sizeof(int));
>> j = 0;
>> if (min_part_idx >= 0 && max_part_idx >= 0)
>> {
>> for (i = min_part_idx; i <= max_part_idx; i++)
>> all_indexes[j++] = i;
>> }
>> if (!bms_is_empty(other_parts))
>> while ((i = bms_first_member(other_parts)) >= 0)
>> all_indexes[j++] = i;
>> if (j > 1)
>> qsort((void *) all_indexes, j, sizeof(int), intcmp);
>> }
>>
>> It looks like the min/max partition stuff is just complicating things
>> here. If you need to build this array of all_indexes[] anyway, I don't
>> quite understand the point of the min/max. It seems like min/max would
>> probably work a bit nicer if you didn't need the other_parts
>> BitmapSet, so I recommend just getting rid of min/max completely and
>> just have a BitmapSet with bit set for each partition's index you
>> need, you'd not need to go to the trouble of performing a qsort on an
>> array and you could get rid of quite a chunk of code too.
>>
>> The entire function would then not be much more complex than:
>>
>> partindexes = get_partitions_from_clauses(parent, partclauses);
>>
>> while ((i = bms_first_member(partindexes)) >= 0)
>> {
>>     AppendRelInfo *appinfo = rel->part_appinfos[i];
>>     result = lappend(result, appinfo);
>> }
>>
>> Then you can also get rid of your intcmp() function too.
> 
> I've read a bit more of the patch and I think even more now that the
> min/max stuff should be removed.

Oops, I didn't catch this email before sending my earlier reply.  Thanks
for the bms range patch.  Will reply to this shortly after reading your
patch and thinking on it a bit.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
David Rowley
Date:
On 6 November 2017 at 17:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/11/03 13:32, David Rowley wrote:
>> On 31 October 2017 at 21:43, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> 1. This comment seem wrong.
>>
>> /*
>> * Since the clauses in rel->baserestrictinfo should all contain Const
>> * operands, it should be possible to prune partitions right away.
>> */
>
> Yes.  I used to think it was true, then realized it isn't and updated the
> code to get rid of that assumption, but I forgot updating this comment.
> Fixed.
>
>> How about PARTITION BY RANGE (a) and SELECT * FROM parttable WHERE a > b; ?
>> baserestrictinfo in this case will contain a single RestrictInfo with
>> an OpExpr containing two Var args and it'll come right through that
>> function too.

...

> We won't be able to use such a clause for pruning at all; neither
> planning-time pruning nor execution-time pruning.  Am I missing something?

I just meant the comment was wrong.

>
> The design with min/max partition index interface to the partition.c's new
> partition-pruning facility is intentional.  You can find hints about how
> such a design came about in the following Robert's email:
>
> https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com

Yeah, I remember reading that before I had looked at the code. I
disagree with Robert on this. The fact that the min/max range gets
turned into a list of everything in that range in
get_append_rel_partitions means all the advantages that storing the
partitions as a range is voided. If you could have kept it a range the
entire time, then that might be different, but seems you need to
materialize the entire range in order to sort the partitions into
order. I've included Robert in just in case he wants to take a look at
the code that resulted from that design. Maybe something is not
following what he had in mind, or maybe he'll change his mind based on
the code that resulted.


> For range queries, it is desirable for the partitioning module to return
> the set of qualifying partitions that are contiguous in a compact (O(1))
> min/max representation than having to enumerate all those indexes in the
> set.  It's nice to avoid iterating over that set twice -- once when
> constructing the set in the partitioning module and then again in the
> caller (in this case, planner) to perform the planning-related tasks per
> selected partition.

The idea is that you still get the min and max from the bsearch, but
then use bms_add_range() to populate a bitmapset of the matching
partitions. The big-O notation of the search shouldn't change.

> We need the other_parts Bitmapset too, because selected partitions may not
> always be contiguous, even in the case of range partitioning, if there are
> OR clauses and the possibility that the default partition is also
> selected.  While computing the selected partition set from a given set of
> clauses, partitioning code tries to keep the min/max representation as
> long as it makes sense to and once the selected partitions no longer
> appear to be contiguous it switches to the Bitmapset mode.

Yip. I understand that. I just think there's no benefit to having
min/max since it needs to be materialized into a list of the entire
range at some point, it might as well be done as soon as possible
using a bitmapset, which would save having all the partset_union,
partset_intersect, partset_range_empty, partset_range_overlap,
partset_range_adjacent code. You'd end up just using bms_union and
bms_intersect then bms_add_range to handle the min/max bound you get
from the bsearch.


-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
Amit Langote
Date:
On 2017/11/06 13:15, David Rowley wrote:
> On 31 October 2017 at 21:43, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached updated version of the patches
> 
> match_clauses_to_partkey() needs to allow for the way quals on Bool
> columns are represented.
> 
> create table pt (a bool not null) partition by list (a);
> create table pt_true partition of pt for values in('t');
> create table pt_false partition of pt for values in('f');
> explain select * from pt where a = true;
>                             QUERY PLAN
> ------------------------------------------------------------------
>  Append  (cost=0.00..76.20 rows=2810 width=1)
>    ->  Seq Scan on pt_false  (cost=0.00..38.10 rows=1405 width=1)
>          Filter: a
>    ->  Seq Scan on pt_true  (cost=0.00..38.10 rows=1405 width=1)
>          Filter: a
> (5 rows)
> 
> match_clause_to_indexcol() shows an example of how to handle this.
> 
> explain select * from pt where a = false;
> 
> will need to be allowed too. This works slightly differently.

You're right.  I've fixed things to handle Boolean partitioning in the
updated set of patches I will post shortly.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/11/06 14:32, David Rowley wrote:
> On 6 November 2017 at 17:30, Amit Langote wrote:
>> On 2017/11/03 13:32, David Rowley wrote:
>>> On 31 October 2017 at 21:43, Amit Langote wrote:
>>> 1. This comment seem wrong.
>>>
>>> /*
>>> * Since the clauses in rel->baserestrictinfo should all contain Const
>>> * operands, it should be possible to prune partitions right away.
>>> */
>>
>> Yes.  I used to think it was true, then realized it isn't and updated the
>> code to get rid of that assumption, but I forgot updating this comment.
>> Fixed.
>>
>>> How about PARTITION BY RANGE (a) and SELECT * FROM parttable WHERE a > b; ?
>>> baserestrictinfo in this case will contain a single RestrictInfo with
>>> an OpExpr containing two Var args and it'll come right through that
>>> function too.
> 
> ...
> 
>> We won't be able to use such a clause for pruning at all; neither
>> planning-time pruning nor execution-time pruning.  Am I missing something?
> 
> I just meant the comment was wrong.

Ah, gotcha.

>> The design with min/max partition index interface to the partition.c's new
>> partition-pruning facility is intentional.  You can find hints about how
>> such a design came about in the following Robert's email:
>>
>> https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com
> 
> Yeah, I remember reading that before I had looked at the code. I
> disagree with Robert on this. The fact that the min/max range gets
> turned into a list of everything in that range in
> get_append_rel_partitions means all the advantages that storing the
> partitions as a range is voided. If you could have kept it a range the
> entire time, then that might be different, but seems you need to
> materialize the entire range in order to sort the partitions into
> order. I've included Robert in just in case he wants to take a look at
> the code that resulted from that design. Maybe something is not
> following what he had in mind, or maybe he'll change his mind based on
> the code that resulted.
> 
>> For range queries, it is desirable for the partitioning module to return
>> the set of qualifying partitions that are contiguous in a compact (O(1))
>> min/max representation than having to enumerate all those indexes in the
>> set.  It's nice to avoid iterating over that set twice -- once when
>> constructing the set in the partitioning module and then again in the
>> caller (in this case, planner) to perform the planning-related tasks per
>> selected partition.
> 
> The idea is that you still get the min and max from the bsearch, but
> then use bms_add_range() to populate a bitmapset of the matching
> partitions. The big-O notation of the search shouldn't change.
> 
>> We need the other_parts Bitmapset too, because selected partitions may not
>> always be contiguous, even in the case of range partitioning, if there are
>> OR clauses and the possibility that the default partition is also
>> selected.  While computing the selected partition set from a given set of
>> clauses, partitioning code tries to keep the min/max representation as
>> long as it makes sense to and once the selected partitions no longer
>> appear to be contiguous it switches to the Bitmapset mode.
> 
> Yip. I understand that. I just think there's no benefit to having
> min/max since it needs to be materialized into a list of the entire
> range at some point, it might as well be done as soon as possible
> using a bitmapset, which would save having all the partset_union,
> partset_intersect, partset_range_empty, partset_range_overlap,
> partset_range_adjacent code. You'd end up just using bms_union and
> bms_intersect then bms_add_range to handle the min/max bound you get
> from the bsearch.

OK, I have gotten rid of the min/max partition index interface and instead
adopted the bms_add_range() approach by including your patch to add the
same in the patch set (which is now 0002 in the whole set).  I have to
admit that it's simpler to understand the new code with just Bitmapsets to
look at, but I'm still a bit concerned about materializing the whole set
right within partition.c, although we can perhaps optimize it later.

Attached updated set of patches, including the fix to make the new pruning
code handle Boolean partitioning.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 6 November 2017 at 23:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> OK, I have gotten rid of the min/max partition index interface and instead
> adopted the bms_add_range() approach by including your patch to add the
> same in the patch set (which is now 0002 in the whole set).  I have to
> admit that it's simpler to understand the new code with just Bitmapsets to
> look at, but I'm still a bit concerned about materializing the whole set
> right within partition.c, although we can perhaps optimize it later.

Thanks for making that change. The code looks much more simple now.

For performance, if you're worried about a very large number of
partitions, then I think you're better off using bms_next_member()
rather than bms_first_member(), (likely this applies globally, but you
don't need to worry about those).

The problem with bms_first_member is that it must always loop over the
0 words before it finds any bits set for each call, whereas
bms_next_member will start on the word it was last called for. There
will likely be a pretty big performance difference between the two
when processing a large Bitmapset.

> Attached updated set of patches, including the fix to make the new pruning
> code handle Boolean partitioning.

Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)



-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 November 2017 at 01:52, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)

I have a little more review to share:

1. Missing "in" in comment. Should be "mentioned in"

 * get_append_rel_partitions
 * Return the list of partitions of rel that pass the clauses mentioned
 * rel->baserestrictinfo

2. Variable should be declared in inner scope with the following fragment:

void
set_basic_child_rel_properties(PlannerInfo *root,
   RelOptInfo *rel,
   RelOptInfo *childrel,
   AppendRelInfo *appinfo)
{
AttrNumber attno;

if (rel->part_scheme)
{

which makes the code the same as where you moved it from.

3. Normally lfirst() is assigned to a variable at the start of a
foreach() loop. You have code which does not follow this.

foreach(lc, clauses)
{
Expr   *clause;
int i;

if (IsA(lfirst(lc), RestrictInfo))
{
RestrictInfo *rinfo = lfirst(lc);

You could assign this to a Node * since the type is unknown to you at
the start of the loop.

4.
/*
* Useless if what we're thinking of as a constant is actually
* a Var coming from this relation.
*/
if (bms_is_member(rel->relid, constrelids))
continue;

should this be moved to just above the op_strict() test? This one seems cheaper.

5. Typo "paritions": /* No clauses to prune paritions, so scan all
partitions. */

But thinking about it more the comment should something more along the
lines of /* No useful clauses for partition pruning. Scan all
partitions. */

The key difference is that there might be clauses, just without Consts.

Actually, the more I look at get_append_rel_partitions() I think it
would be better if you re-shaped that if/else if test so that it only
performs the loop over the partindexes if it's been set.

I ended up with the attached version of the function after moving
things around a little bit.

I'm still reviewing but thought I'd share this part so far.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
Amit Langote
Date:
On 2017/11/06 21:52, David Rowley wrote:
> On 6 November 2017 at 23:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> OK, I have gotten rid of the min/max partition index interface and instead
>> adopted the bms_add_range() approach by including your patch to add the
>> same in the patch set (which is now 0002 in the whole set).  I have to
>> admit that it's simpler to understand the new code with just Bitmapsets to
>> look at, but I'm still a bit concerned about materializing the whole set
>> right within partition.c, although we can perhaps optimize it later.
> 
> Thanks for making that change. The code looks much more simple now.
> 
> For performance, if you're worried about a very large number of
> partitions, then I think you're better off using bms_next_member()
> rather than bms_first_member(), (likely this applies globally, but you
> don't need to worry about those).
> 
> The problem with bms_first_member is that it must always loop over the
> 0 words before it finds any bits set for each call, whereas
> bms_next_member will start on the word it was last called for. There
> will likely be a pretty big performance difference between the two
> when processing a large Bitmapset.

Ah, thanks for the explanation.  I will change it to bms_next_member() in
the next version.

>> Attached updated set of patches, including the fix to make the new pruning
>> code handle Boolean partitioning.
> 
> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)

Thank you.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 November 2017 at 01:52, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)

Hi Amit,

I had another look over this today. Apologies if any of the review seems petty.

Here goes:

1. If test seems to be testing for a child that's a partitioned table,
but is testing for a non-NULL part_scheme.

/*
* If childrel is itself partitioned, add it and its partitioned
* children to the list being propagated up to the root rel.
*/
if (childrel->part_scheme && rel->part_scheme)

Should this code use IS_PARTITIONED_REL() instead? Seems a bit strange
to test for a NULL part_scheme

2. There's a couple of mistakes in my bms_add_range() code. I've
attached bms_add_range_fix.patch. Can you apply this to your tree?

3. This assert seems to be Asserting the same thing twice:

Assert(rel->live_partitioned_rels != NIL &&
   list_length(rel->live_partitioned_rels) > 0);

A List with length == 0 is always NIL.

4. get_partitions_from_clauses(), can you comment why you perform the
list_concat() there.

I believe this is there so that the partition bound from the parent is
passed down to the child so that we can properly eliminate all child
partitions when the 2nd level of partitioning is using the same
partition key as the 1st level. I think this deserves a paragraph of
comment to explain this.

5. Please add a comment to explain what's going on here in
classify_partition_bounding_keys()

if (partattno == 0)
{
partexpr = lfirst(partexprs_item);
partexprs_item = lnext(partexprs_item);
}

Looks like, similar to index expressions, that partition expressions
are attno 0 to mark to consume the next expression from the list.

Does this need validation that there are enough partexprs_item items
like what is done in get_range_key_properties()? Or is this validated
somewhere else?

6. Comment claims the if test will test something which it does not
seem to test for:

/*
* Redundant key elimination using btree-semantics based tricks.
*
* Only list and range partitioning use btree operator semantics, so
* skip otherwise.   Also, if there are expressions whose value is yet
* unknown, skip this step, because we need to compare actual values
* below.
*/
memset(keyclauses, 0, PARTITION_MAX_KEYS * sizeof(List *));
if (partkey->strategy == PARTITION_STRATEGY_LIST ||
partkey->strategy == PARTITION_STRATEGY_RANGE)

I was expecting this to be skipped when the clauses contained a
non-const, but it does not seem to.

7. Should be "compare them"

/*
* If the leftarg and rightarg clauses' constants are both of the type
* expected by "op" clause's operator, then compare then using the
* latter's comparison function.
*/

But if I look at the code "compare then using the latter's comparison
function." is not true, it seems to use op's comparison function not
rightarg's. With most of the calls op and rightarg are the same, but
not all of them. The function shouldn't make that assumption even if
the args op was always the same as rightarg.

8. remove_redundant_clauses() needs an overview comment of what the
function does.

9. The comment should explain what we would do in the case of key < 3
AND key <= 2 using some examples.

/* try to keep only one of <, <= */

10. Wondering why this loop runs backward?

for (s = BTMaxStrategyNumber; --s >= 0;)

Why not just:

for (s = 0; s < BTMaxStrategyNumber; s++)

I can't see a special reason for it to run backward. It seems unusual,
but if there's a good reason that I've failed to realise then it's
maybe worth a comment.

11. Pleae comment on why *constfalse = true is set here:

if (!chk || s == (BTEqualStrategyNumber - 1))
continue;

if (partition_cmp_args(partopfamily, partopcintype, chk, eq, chk,
   &test_result))
{
if (!test_result)
{
*constfalse = true;
return;
}
/* discard the redundant key. */
xform[s] = NULL;
}

Looks like we'd hit this in a case such as: WHERE key = 1 AND key > 1.

Also please add a comment when discarding the redundant key maybe
explain that equality is more useful than the other strategies when
there's an overlap.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
Amit Langote
Date:
Hi David.

Thanks for the review.

(..also looking at the comments you sent earlier today.)

On 2017/11/07 11:14, David Rowley wrote:
> On 7 November 2017 at 01:52, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)
> 
> I have a little more review to share:
> 
> 1. Missing "in" in comment. Should be "mentioned in"
> 
>  * get_append_rel_partitions
>  * Return the list of partitions of rel that pass the clauses mentioned
>  * rel->baserestrictinfo
> 
> 2. Variable should be declared in inner scope with the following fragment:
> 
> void
> set_basic_child_rel_properties(PlannerInfo *root,
>    RelOptInfo *rel,
>    RelOptInfo *childrel,
>    AppendRelInfo *appinfo)
> {
> AttrNumber attno;
> 
> if (rel->part_scheme)
> {
> 
> which makes the code the same as where you moved it from.

It seems like you included the above changes in your attached C file,
which I will incorporate into my repository.

> 3. Normally lfirst() is assigned to a variable at the start of a
> foreach() loop. You have code which does not follow this.
> 
> foreach(lc, clauses)
> {
> Expr   *clause;
> int i;
> 
> if (IsA(lfirst(lc), RestrictInfo))
> {
> RestrictInfo *rinfo = lfirst(lc);
> 
> You could assign this to a Node * since the type is unknown to you at
> the start of the loop.

Will make the suggested changes to match_clauses_to_partkey().

> 4.
> /*
> * Useless if what we're thinking of as a constant is actually
> * a Var coming from this relation.
> */
> if (bms_is_member(rel->relid, constrelids))
> continue;
> 
> should this be moved to just above the op_strict() test? This one seems cheaper.

Agreed, will do.  Also makes sense to move the PartCollMatchesExprColl()
test together.

> 5. Typo "paritions": /* No clauses to prune paritions, so scan all
> partitions. */
> 
> But thinking about it more the comment should something more along the
> lines of /* No useful clauses for partition pruning. Scan all
> partitions. */

You fixed it. :)

> The key difference is that there might be clauses, just without Consts.
> 
> Actually, the more I look at get_append_rel_partitions() I think it
> would be better if you re-shaped that if/else if test so that it only
> performs the loop over the partindexes if it's been set.
> 
> I ended up with the attached version of the function after moving
> things around a little bit.

Thanks a lot for that.  Looks much better now.

> I'm still reviewing but thought I'd share this part so far.

As mentioned at the top, I'm looking at your latest comments and they all
seem to be good points to me, so will address those in the next version.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Rajkumar Raghuwanshi
Date:
On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Attached updated set of patches, including the fix to make the new pruning
code handle Boolean partitioning.

Hi Amit,

I have tried pruning for different values of constraint exclusion GUC change, not sure exactly how it should behave, but I can see with the delete statement pruning is not happening when constraint_exclusion is off, but select is working as expected. Is this expected behaviour?

create table lp (c1 int, c2 text) partition by list(c1);
create table lp1 partition of lp for values in (1,2);
create table lp2 partition of lp for values in (3,4);
create table lp3 partition of lp for values in (5,6);
insert into lp values (1,'p1'),(2,'p1'),(3,'p2'),(4,'p2'),(5,'p3');

show constraint_exclusion ;
 constraint_exclusion
----------------------
 partition
(1 row)

explain select c1 from lp where c1 >= 1 and c1 < 2;
                        QUERY PLAN                       
----------------------------------------------------------
 Append  (cost=0.00..29.05 rows=6 width=4)
   ->  Seq Scan on lp1  (cost=0.00..29.05 rows=6 width=4)
         Filter: ((c1 >= 1) AND (c1 < 2))
(3 rows)

explain delete from lp where c1 >= 1 and c1 < 2;
                        QUERY PLAN                       
----------------------------------------------------------
 Delete on lp  (cost=0.00..29.05 rows=6 width=6)
   Delete on lp1
   ->  Seq Scan on lp1  (cost=0.00..29.05 rows=6 width=6)
         Filter: ((c1 >= 1) AND (c1 < 2))
(4 rows)

set constraint_exclusion = off;

explain select c1 from lp where c1 >= 1 and c1 < 2;
                        QUERY PLAN                       
----------------------------------------------------------
 Append  (cost=0.00..29.05 rows=6 width=4)
   ->  Seq Scan on lp1  (cost=0.00..29.05 rows=6 width=4)
         Filter: ((c1 >= 1) AND (c1 < 2))
(3 rows)

explain delete from lp where c1 >= 1 and c1 < 2;
                        QUERY PLAN                       
----------------------------------------------------------
 Delete on lp  (cost=0.00..87.15 rows=18 width=6)
   Delete on lp1
   Delete on lp2
   Delete on lp3
   ->  Seq Scan on lp1  (cost=0.00..29.05 rows=6 width=6)
         Filter: ((c1 >= 1) AND (c1 < 2))
   ->  Seq Scan on lp2  (cost=0.00..29.05 rows=6 width=6)
         Filter: ((c1 >= 1) AND (c1 < 2))
   ->  Seq Scan on lp3  (cost=0.00..29.05 rows=6 width=6)
         Filter: ((c1 >= 1) AND (c1 < 2))
(10 rows)

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi Rajkumar,

Thanks for testing.

On 2017/11/08 15:52, Rajkumar Raghuwanshi wrote:
> On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> wrote:
> 
>> Attached updated set of patches, including the fix to make the new pruning
>> code handle Boolean partitioning.
>>
> 
> Hi Amit,
> 
> I have tried pruning for different values of constraint exclusion GUC
> change, not sure exactly how it should behave, but I can see with the
> delete statement pruning is not happening when constraint_exclusion is off,
> but select is working as expected. Is this expected behaviour?

Hmm, the new pruning only works for selects, not DML.  The patch also
changes get_relation_constraints() to not include the internal partition
constraints, but mistakenly does so for all query types, not just select.
Will look into it.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
amul sul
Date:
On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/11/06 14:32, David Rowley wrote:
>> On 6 November 2017 at 17:30, Amit Langote wrote:
>>> On 2017/11/03 13:32, David Rowley wrote:
>>>> On 31 October 2017 at 21:43, Amit Langote wrote:
[....]
>
> Attached updated set of patches, including the fix to make the new pruning
> code handle Boolean partitioning.
>

I am getting following warning on mac os x:

partition.c:1800:24: warning: comparison of constant -1 with
expression of type 'NullTestType'     (aka 'enum NullTestType') is always false
[-Wtautological-constant-out-of-range-compare]                               if (keynullness[i] == -1)
                ~~~~~~~~~~~~~~ ^  ~~
 
partition.c:1932:25: warning: comparison of constant -1 with
expression of type 'NullTestType'     (aka 'enum NullTestType') is always false
[-Wtautological-constant-out-of-range-compare]                                       if (keynullness[i] == -1)
                                ~~~~~~~~~~~~~~ ^  ~~
 
2 warnings generated.


Comment for 0004 patch:
270 +   /* -1 represents an invalid value of NullTestType. */271 +   memset(keynullness, -1, PARTITION_MAX_KEYS *
sizeof(NullTestType));

I think we should not use memset to set a value other than 0 or true/false.
This will work for -1 on the system where values are stored in the 2's
complement but I am afraid of other architecture.

Regards,
Amul


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi Amul.

On 2017/11/09 20:05, amul sul wrote:
> On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/11/06 14:32, David Rowley wrote:
>>> On 6 November 2017 at 17:30, Amit Langote wrote:
>>>> On 2017/11/03 13:32, David Rowley wrote:
>>>>> On 31 October 2017 at 21:43, Amit Langote wrote:
> [....]
>>
>> Attached updated set of patches, including the fix to make the new pruning
>> code handle Boolean partitioning.
>>
> 
> I am getting following warning on mac os x:

Thanks for the review.

> partition.c:1800:24: warning: comparison of constant -1 with
> expression of type 'NullTestType'
>       (aka 'enum NullTestType') is always false
> [-Wtautological-constant-out-of-range-compare]
>                                 if (keynullness[i] == -1)
>                                     ~~~~~~~~~~~~~~ ^  ~~
> partition.c:1932:25: warning: comparison of constant -1 with
> expression of type 'NullTestType'
>       (aka 'enum NullTestType') is always false
> [-Wtautological-constant-out-of-range-compare]
>                                         if (keynullness[i] == -1)
>                                             ~~~~~~~~~~~~~~ ^  ~~
> 2 warnings generated.
> 
> 
> Comment for 0004 patch:
> 270 +   /* -1 represents an invalid value of NullTestType. */
>  271 +   memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType));
> 
> I think we should not use memset to set a value other than 0 or true/false.
> This will work for -1 on the system where values are stored in the 2's
> complement but I am afraid of other architecture.

OK, I will remove all instances of comparing and setting variables of type
NullTestType to a value of -1.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Kyotaro HORIGUCHI
Date:
Hello,

At Fri, 10 Nov 2017 09:34:57 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<5fcb1a9f-b4ad-119d-14c7-282c30c7f8d1@lab.ntt.co.jp>
> Hi Amul.
> 
> On 2017/11/09 20:05, amul sul wrote:
> > On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote
> > <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> On 2017/11/06 14:32, David Rowley wrote:
> >>> On 6 November 2017 at 17:30, Amit Langote wrote:
> >>>> On 2017/11/03 13:32, David Rowley wrote:
> >>>>> On 31 October 2017 at 21:43, Amit Langote wrote:
> > [....]
> >>
> >> Attached updated set of patches, including the fix to make the new pruning
> >> code handle Boolean partitioning.
> >>
> > 
> > I am getting following warning on mac os x:
> 
> Thanks for the review.
> 
> > partition.c:1800:24: warning: comparison of constant -1 with
> > expression of type 'NullTestType'
> >       (aka 'enum NullTestType') is always false
> > [-Wtautological-constant-out-of-range-compare]
> >                                 if (keynullness[i] == -1)
> >                                     ~~~~~~~~~~~~~~ ^  ~~
> > partition.c:1932:25: warning: comparison of constant -1 with
> > expression of type 'NullTestType'
> >       (aka 'enum NullTestType') is always false
> > [-Wtautological-constant-out-of-range-compare]
> >                                         if (keynullness[i] == -1)
> >                                             ~~~~~~~~~~~~~~ ^  ~~
> > 2 warnings generated.
> > 
> > 
> > Comment for 0004 patch:
> > 270 +   /* -1 represents an invalid value of NullTestType. */
> >  271 +   memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType));
> > 
> > I think we should not use memset to set a value other than 0 or true/false.
> > This will work for -1 on the system where values are stored in the 2's
> > complement but I am afraid of other architecture.
> 
> OK, I will remove all instances of comparing and setting variables of type
> NullTestType to a value of -1.

In 0002, bms_add_range has a bit naive-looking loop

+    while (wordnum <= uwordnum)
+    {
+        bitmapword mask = (bitmapword) ~0;
+
+        /* If working on the lower word, zero out bits below 'lower'. */
+        if (wordnum == lwordnum)
+        {
+            int lbitnum = BITNUM(lower);
+            mask >>= lbitnum;
+            mask <<= lbitnum;
+        }
+
+        /* Likewise, if working on the upper word, zero bits above 'upper' */
+        if (wordnum == uwordnum)
+        {
+            int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) + 1);
+            mask <<= ushiftbits;
+            mask >>= ushiftbits;
+        }
+
+        a->words[wordnum++] |= mask;
+    }

Without some aggressive optimization, the loop takes most of the
time to check-and-jump for nothing especially with many
partitions and somewhat unintuitive.

The following uses a bit tricky bitmap operation but
is straightforward as a whole.

=====
/* fill the bits upper from BITNUM(lower) (0-based) of the first word */
a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1);

/* fill up intermediate words */
while (wordnum < uwordnum)  a->words[wordnum++] = ~(bitmapword) 0;

/* fill up to BITNUM(upper) bit (0-based) of the last word */
a->workds[wordnum++] |=    (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1));
=====


In 0003, 

+match_clauses_to_partkey(RelOptInfo *rel,
...
+      if (rinfo->pseudoconstant &&
+        (IsA(clause, Const) &&
+         ((((Const *) clause)->constisnull) ||
+          !DatumGetBool(((Const *) clause)->constvalue))))
+      {
+        *constfalse = true;
+        continue;

If we call this function in both conjunction and disjunction
context (the latter is only in recursive case). constfalse ==
true means no need of any clauses for the former case.

Since (I think) just a list of RestrictInfo is expected to be
treated as a conjunction (it's quite doubious, though..), we
might be better call this for each subnodes of a disjunction. Or,
like match_clauses_to_index, we might be better provide
match_clause_to_partkey(rel, rinfo, contains_const), which
returns NULL if constfalse. (I'm not self-confident on this..)


+          /*
+           * If no commutator exists, cannot flip the qual's args,
+           * so give up.
+           */
+          if (!OidIsValid(expr_op))
+            continue;

I suppose it's better to leftop and rightop as is rather than
flipping over so that var is placed left-side. Does that make
things so complex?

+     * It the operator happens to be '<>', which is never listed
If?


+        if (!op_in_opfamily(expr_op, partopfamily))
+        {
+          Oid    negator = get_negator(expr_op);
+
+          if (!OidIsValid(negator) ||
+            !op_in_opfamily(negator, partopfamily))
+            continue;

classify_partition_bounding_keys() checks the same thing by
checking whether the negator's strategy is
BTEquealStrategyNumber. (I'm not sure the operator is guaranteed
to be of btreee, though..) Aren't they needed to be in similar
way?

# In the function, "partkey strategy" and "operator strategy" are
# confusing..

+  AttrNumber  attno;

This declaration might be better in a narrower scope.


-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Kyotaro HORIGUCHI
Date:
Hello, this is the second part of the review.

At Fri, 10 Nov 2017 12:30:00 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20171110.123000.151902771.horiguchi.kyotaro@lab.ntt.co.jp>
> In 0002, bms_add_range has a bit naive-looking loop
> In 0003, 

In 0004,

The name get_partitions_from_clauses_guts(), it seems to me that
we usually use _internal for recursive part of some function. (I
have the same comment as David about the comment for
get_partition_from_clause())

About the same function:

Couldn't we get out in the fast path when clauses == NIL?

+    /* No constraints on the keys, so, return *all* partitions. */
+    result = bms_add_range(result, 0, partdesc->nparts - 1);

This allows us to return immediately here. And just above this,

+    if (nkeys > 0 && !constfalse)
+        result = get_partitions_for_keys(relation, &keys);
+    else if (!constfalse)

Those two conditions are not orthogonal. Maybe something like
following seems more understantable.

> if (!constfalse)
> {
>   /* No constraints on the keys, so, return *all* partitions. */
>   if (nkeys == 0)
>     return bms_add_range(result, 0, partdesc->nparts - 1);
> 
>   result = get_partitions_for_keys(relation, &keys);
> }

I'm not sure what is meant to be (formally) done here, but it
seems to me that OrExpr is assumed to be only at the top level of
the caluses. So the following (just an example, but meaningful
expression in this shpape must exists.) expression is perhaps
wrongly processed here.

CREATE TABLE p (a int) PARITION BY (a);
CREATE TABLE c1 PARTITION OF p FOR VALUES FROM (0) TO (10);
CREATE TABLE c2 PARTITION OF p FOR VALUES FROM (10) TO (20);

SELECT * FROM p WHERE a = 15 AND (a = 15 OR a = 5);

get_partitions_for_keys() returns both c1 and c2 and still
or_clauses here holds (a = 15 OR a = 5) and the function tries to
*add* partitions for a = 15 and a = 5 separetely.


I'd like to pause here.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Kyotaro HORIGUCHI
Date:
Ooops! The following comment is wrong. Please ignore it.

At Fri, 10 Nov 2017 14:38:11 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20171110.143811.97616847.horiguchi.kyotaro@lab.ntt.co.jp>
> Those two conditions are not orthogonal. Maybe something like
> following seems more understantable.
> 
> > if (!constfalse)
> > {
> >   /* No constraints on the keys, so, return *all* partitions. */
> >   if (nkeys == 0)
> >     return bms_add_range(result, 0, partdesc->nparts - 1);
> > 
> >   result = get_partitions_for_keys(relation, &keys);
> > }
> 
> I'm not sure what is meant to be (formally) done here, but it
> seems to me that OrExpr is assumed to be only at the top level of
> the caluses. So the following (just an example, but meaningful
> expression in this shpape must exists.) expression is perhaps
> wrongly processed here.
> 
> CREATE TABLE p (a int) PARITION BY (a);
> CREATE TABLE c1 PARTITION OF p FOR VALUES FROM (0) TO (10);
> CREATE TABLE c2 PARTITION OF p FOR VALUES FROM (10) TO (20);
> 
> SELECT * FROM p WHERE a = 15 AND (a = 15 OR a = 5);
> 
> get_partitions_for_keys() returns both c1 and c2 and still
> or_clauses here holds (a = 15 OR a = 5) and the function tries to
> *add* partitions for a = 15 and a = 5 separetely.

This is working fine. Sorry for the bogus comment.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Kyotaro HORIGUCHI
Date:
Hello,

At Fri, 10 Nov 2017 14:44:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20171110.144455.117208639.horiguchi.kyotaro@lab.ntt.co.jp>

> > Those two conditions are not orthogonal. Maybe something like
> > following seems more understantable.
> > 
> > > if (!constfalse)
> > > {
> > >   /* No constraints on the keys, so, return *all* partitions. */
> > >   if (nkeys == 0)
> > >     return bms_add_range(result, 0, partdesc->nparts - 1);
> > > 
> > >   result = get_partitions_for_keys(relation, &keys);
> > > }

So, the condition (!constfalse && nkeys == 0) cannot return
there. I'm badly confused by the variable name.

I couldn't find another reasonable structure using the current
classify_p_b_keys(), but could you add a comment like the
following as an example?

+ /*
+  * Ths function processes other than OR expressions and returns
+  * the excluded OR expressions in or_clauses
+  */
>   nkeys = classify_partition_bounding_keys(relation, clauses,
>                        &keys, &constfalse,
>                        &or_clauses);
>   /*
>    * Only look up in the partition decriptor if the query provides
>    * constraints on the keys at all.
>    */
>   if (!constfalse)
>   {
>     if (nkey > 0)
>       result = get_partitions_for_keys(relation, &keys);
>     else
-+       /* No constraints on the keys, so, all partitions are passed. */
>       result = bms_add_range(result, 0, partdesc->nparts - 1);
>   }
> 
+   /*
+    * We have a partition set for clauses not returned in or_clauses
+    * here. Conjuct the result of each OR clauses.
+    */
>   foreach(lc, or_clauses)
>   {
>     BoolExpr *or = (BoolExpr *) lfirst(lc);
>     ListCell *lc1;
>     Bitmapset *or_partset = NULL;
> 
+     Assert(or_clause(or));

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 10 November 2017 at 16:30, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> In 0002, bms_add_range has a bit naive-looking loop
>
> +       while (wordnum <= uwordnum)
> +       {
> +               bitmapword mask = (bitmapword) ~0;
> +
> +               /* If working on the lower word, zero out bits below 'lower'. */
> +               if (wordnum == lwordnum)
> +               {
> +                       int lbitnum = BITNUM(lower);
> +                       mask >>= lbitnum;
> +                       mask <<= lbitnum;
> +               }
> +
> +               /* Likewise, if working on the upper word, zero bits above 'upper' */
> +               if (wordnum == uwordnum)
> +               {
> +                       int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) + 1);
> +                       mask <<= ushiftbits;
> +                       mask >>= ushiftbits;
> +               }
> +
> +               a->words[wordnum++] |= mask;
> +       }
>
> Without some aggressive optimization, the loop takes most of the
> time to check-and-jump for nothing especially with many
> partitions and somewhat unintuitive.
>
> The following uses a bit tricky bitmap operation but
> is straightforward as a whole.
>
> =====
> /* fill the bits upper from BITNUM(lower) (0-based) of the first word */
> a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1);
>
> /* fill up intermediate words */
> while (wordnum < uwordnum)
>    a->words[wordnum++] = ~(bitmapword) 0;
>
> /* fill up to BITNUM(upper) bit (0-based) of the last word */
> a->workds[wordnum++] |=
>      (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1));
> =====

No objections here for making bms_add_range() perform better, but this
is not going to work when lwordnum == uwordnum. You'd need to special
case that. I didn't think it was worth the trouble, but maybe it is...

I assume the += should be |=.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Horiguchi-san,

Thanks for taking a look.  Replying to all your emails here.

On 2017/11/10 12:30, Kyotaro HORIGUCHI wrote:
> In 0002, bms_add_range has a bit naive-looking loop
> 
> +    while (wordnum <= uwordnum)
> +    {
> +        bitmapword mask = (bitmapword) ~0;
> +
> +        /* If working on the lower word, zero out bits below 'lower'. */
> +        if (wordnum == lwordnum)
> +        {
> +            int lbitnum = BITNUM(lower);
> +            mask >>= lbitnum;
> +            mask <<= lbitnum;
> +        }
> +
> +        /* Likewise, if working on the upper word, zero bits above 'upper' */
> +        if (wordnum == uwordnum)
> +        {
> +            int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) + 1);
> +            mask <<= ushiftbits;
> +            mask >>= ushiftbits;
> +        }
> +
> +        a->words[wordnum++] |= mask;
> +    }
> 
> Without some aggressive optimization, the loop takes most of the
> time to check-and-jump for nothing especially with many
> partitions and somewhat unintuitive.
> 
> The following uses a bit tricky bitmap operation but
> is straightforward as a whole.
> 
> =====
> /* fill the bits upper from BITNUM(lower) (0-based) of the first word */
> a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1);
> 
> /* fill up intermediate words */
> while (wordnum < uwordnum)
>    a->words[wordnum++] = ~(bitmapword) 0;
> 
> /* fill up to BITNUM(upper) bit (0-based) of the last word */
> a->workds[wordnum++] |=
>      (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1));
> =====

Considering also the David's comment downthread, I will try to incorporate
this into bms_add_range().

> In 0003, 
> 
> +match_clauses_to_partkey(RelOptInfo *rel,
> ...
> +      if (rinfo->pseudoconstant &&
> +        (IsA(clause, Const) &&
> +         ((((Const *) clause)->constisnull) ||
> +          !DatumGetBool(((Const *) clause)->constvalue))))
> +      {
> +        *constfalse = true;
> +        continue;
> 
> If we call this function in both conjunction and disjunction
> context (the latter is only in recursive case). constfalse ==
> true means no need of any clauses for the former case.
> 
> Since (I think) just a list of RestrictInfo is expected to be
> treated as a conjunction (it's quite doubious, though..),

I think it makes sense to consider a list of RestrictInfo's, such as
baserestrictinfo, that is passed as input to match_clauses_to_partkey(),
to be mutually conjunctive for our purpose here.

> we
> might be better call this for each subnodes of a disjunction. Or,
> like match_clauses_to_index, we might be better provide
> match_clause_to_partkey(rel, rinfo, contains_const), which
> returns NULL if constfalse. (I'm not self-confident on this..)

After reading your comment, I realized that it was wrong that the
recursive call to match_clauses_to_partkey() passed the arguments of an OR
clause all at once.  That broke the assumption mentioned above that all of
the clauses in the list passed to match_clauses_to_partkey() are mutually
conjunctive.  Instead, we must make a single-member list for each of the
OR clause's arguments and pass the same.

Then if we got constfalse for all of the OR's arguments, then we return
the constfalse=true to the original caller.

> 
> +          /*
> +           * If no commutator exists, cannot flip the qual's args,
> +           * so give up.
> +           */
> +          if (!OidIsValid(expr_op))
> +            continue;
> 
> I suppose it's better to leftop and rightop as is rather than
> flipping over so that var is placed left-side. Does that make
> things so complex?

Reason to do it that way is that the code placed far away (code in
partition.c that extracts constant values to use for pruning from matched
clauses) can always assume that the clauses determined to be useful for
partition-pruning always come in the 'partkey op constant' form.

> +     * It the operator happens to be '<>', which is never listed
> If?

Right, will fix.

> +        if (!op_in_opfamily(expr_op, partopfamily))
> +        {
> +          Oid    negator = get_negator(expr_op);
> +
> +          if (!OidIsValid(negator) ||
> +            !op_in_opfamily(negator, partopfamily))
> +            continue;
> 
> classify_partition_bounding_keys() checks the same thing by
> checking whether the negator's strategy is
> BTEquealStrategyNumber. (I'm not sure the operator is guaranteed
> to be of btreee, though..) Aren't they needed to be in similar
> way?

You're right.  The <>'s negator may not always be a btree operator.  So,
we should add a check in match_clauses_to_partkey() that list or range
partitioning is in use, because only those require a btree operator
family.  We now have hash partitioning, so need to be careful not to make
the assumption that all partitioning operators are from btree operator
families.

If match_clauses_to_partkey() accepts such a clause with a <> operator,
then classify_partition_bounding_keys() can rely that it will get the
desired btree operators to implement pruning for the same.

> # In the function, "partkey strategy" and "operator strategy" are
> # confusing..

I agree it would be better to make that clear using comments.

Partitioning strategy and operator strategies are intimately related.
List and range partitioning related optimizations will only work if the
clause operators are of valid btree strategies, hash partitioning
optimizations will only work if the operator in the matched clauses is a
valid hash equality operator.


> +  AttrNumber  attno;
> 
> This declaration might be better in a narrower scope.

Agreed, will move.

On 2017/11/10 14:38, Kyotaro HORIGUCHI wrote:
> Hello, this is the second part of the review.
>
> At Fri, 10 Nov 2017 12:30:00 +0900 , Kyotaro HORIGUCHI wrote:
>> In 0003,
>
> In 0004,
>
> The name get_partitions_from_clauses_guts(), it seems to me that
> we usually use _internal for recursive part of some function. (I
> have the same comment as David about the comment for
> get_partition_from_clause())

OK, will replace _guts by _internal.  Looking at the David's comments too.

> About the same function:
>
> Couldn't we get out in the fast path when clauses == NIL?

Actually get_partitions_for_clauses() won't be called at all if that were
true.  I think I should add an Assert that clauses is not NIL.

>
> +    /* No constraints on the keys, so, return *all* partitions. */
> +    result = bms_add_range(result, 0, partdesc->nparts - 1);
>
> This allows us to return immediately here. And just above this,
>
> +    if (nkeys > 0 && !constfalse)
> +        result = get_partitions_for_keys(relation, &keys);
> +    else if (!constfalse)
>
> Those two conditions are not orthogonal. Maybe something like
> following seems more understantable.
>
>> if (!constfalse)
>> {
>>   /* No constraints on the keys, so, return *all* partitions. */
>>   if (nkeys == 0)
>>     return bms_add_range(result, 0, partdesc->nparts - 1);
>>
>>   result = get_partitions_for_keys(relation, &keys);
>> }

Agreed that your suggested rewrite of that portion of the code is easy to
read, but we cannot return yet in the nkeys == 0 case, as you also said
you found out.  I quote your other reply further below.

> I'm not sure what is meant to be (formally) done here, but it
> seems to me that OrExpr is assumed to be only at the top level of
> the caluses. So the following (just an example, but meaningful
> expression in this shpape must exists.) expression is perhaps
> wrongly processed here.
>
> CREATE TABLE p (a int) PARITION BY (a);
> CREATE TABLE c1 PARTITION OF p FOR VALUES FROM (0) TO (10);
> CREATE TABLE c2 PARTITION OF p FOR VALUES FROM (10) TO (20);
>
> SELECT * FROM p WHERE a = 15 AND (a = 15 OR a = 5);
>
> get_partitions_for_keys() returns both c1 and c2 and still
> or_clauses here holds (a = 15 OR a = 5) and the function tries to
> *add* partitions for a = 15 and a = 5 separetely.
>
> I'd like to pause here.

[ ... ]

On 2017/11/10 14:44, Kyotaro HORIGUCHI wrote:
> At Fri, 10 Nov 2017 14:38:11 +0900, Kyotaro HORIGUCHI wrote:
>
> This is working fine. Sorry for the bogus comment.

I'd almost started looking around if something might be wrong after all. :)

On 2017/11/10 16:07, Kyotaro HORIGUCHI wrote:
> At Fri, 10 Nov 2017 14:44:55 +0900, Kyotaro HORIGUCHI wrote:
>
>>> Those two conditions are not orthogonal. Maybe something like
>>> following seems more understantable.
>>>
>>>> if (!constfalse)
>>>> {
>>>>   /* No constraints on the keys, so, return *all* partitions. */
>>>>   if (nkeys == 0)
>>>>     return bms_add_range(result, 0, partdesc->nparts - 1);
>>>>
>>>>   result = get_partitions_for_keys(relation, &keys);
>>>> }
>
> So, the condition (!constfalse && nkeys == 0) cannot return
> there. I'm badly confused by the variable name.

Do you mean by 'constfalse'?

> I couldn't find another reasonable structure using the current
> classify_p_b_keys(), but could you add a comment like the
> following as an example?

OK, will add comments explaining what's going on.


Will post the updated patches after also taking care of David's and Amul's
review comments upthread.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 13 November 2017 at 22:46, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/11/10 12:30, Kyotaro HORIGUCHI wrote:
>> In 0002, bms_add_range has a bit naive-looking loop
>>
>> +     while (wordnum <= uwordnum)
>> +     {
>> +             bitmapword mask = (bitmapword) ~0;
>> +
>> +             /* If working on the lower word, zero out bits below 'lower'. */
>> +             if (wordnum == lwordnum)
>> +             {
>> +                     int lbitnum = BITNUM(lower);
>> +                     mask >>= lbitnum;
>> +                     mask <<= lbitnum;
>> +             }
>> +
>> +             /* Likewise, if working on the upper word, zero bits above 'upper' */
>> +             if (wordnum == uwordnum)
>> +             {
>> +                     int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) + 1);
>> +                     mask <<= ushiftbits;
>> +                     mask >>= ushiftbits;
>> +             }
>> +
>> +             a->words[wordnum++] |= mask;
>> +     }
>>
>> Without some aggressive optimization, the loop takes most of the
>> time to check-and-jump for nothing especially with many
>> partitions and somewhat unintuitive.
>>
>> The following uses a bit tricky bitmap operation but
>> is straightforward as a whole.
>>
>> =====
>> /* fill the bits upper from BITNUM(lower) (0-based) of the first word */
>> a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1);
>>
>> /* fill up intermediate words */
>> while (wordnum < uwordnum)
>>    a->words[wordnum++] = ~(bitmapword) 0;
>>
>> /* fill up to BITNUM(upper) bit (0-based) of the last word */
>> a->workds[wordnum++] |=
>>      (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1));
>> =====
>
> Considering also the David's comment downthread, I will try to incorporate
> this into bms_add_range().

I've attached an implementation of the patch using this method.

I've also attached bitset.c which runs each through their paces. I'd
have expected Kyotaro's method to be faster, but gcc 7.2 with -O2
generates very slightly slower code. I didn't really check why. clang
seems to do a better job with it.

$ gcc -O2 bitset.c -o bitset && ./bitset
bms_add_range in 0.694254 (6.94254 ns per loop)
bms_add_range2 in 0.726643 (7.26643 ns per loop)
11111111111111111111111111111110
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
-------------
11111111111111111111111111111110
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
$ gcc --version
gcc (Ubuntu 7.2.0-8ubuntu3) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ clang -O2 bitset.c -o bitset && ./bitset
bms_add_range in 0.866554 (8.66554 ns per loop)
bms_add_range2 in 0.467138 (4.67138 ns per loop)
11111111111111111111111111111110
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
-------------
11111111111111111111111111111110
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
11111111111111111111111111111111
$ clang --version
clang version 4.0.1-6 (tags/RELEASE_401/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Probably just go with Kyotaro's idea (v2). I don't think this is worth
debating, I just wanted to show it's not that clear-cut.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
Amit Langote
Date:
Hi David.

On 2017/11/14 13:00, David Rowley wrote:
> On 13 November 2017 at 22:46, Amit Langote wrote:
>> On 2017/11/10 12:30, Kyotaro HORIGUCHI wrote:
>>> The following uses a bit tricky bitmap operation but
>>> is straightforward as a whole.
>>>
>>> =====
>>> /* fill the bits upper from BITNUM(lower) (0-based) of the first word */
>>> a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1);
>>>
>>> /* fill up intermediate words */
>>> while (wordnum < uwordnum)
>>>    a->words[wordnum++] = ~(bitmapword) 0;
>>>
>>> /* fill up to BITNUM(upper) bit (0-based) of the last word */
>>> a->workds[wordnum++] |=
>>>      (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1));
>>> =====
>>
>> Considering also the David's comment downthread, I will try to incorporate
>> this into bms_add_range().
> 
> I've attached an implementation of the patch using this method.

[ ... ]

> Probably just go with Kyotaro's idea (v2). I don't think this is worth
> debating, I just wanted to show it's not that clear-cut.

Thanks.  I have incorporated the v2 patch in my local repository.  I'm
still working through some of the review comments and will be able to
post a new version no later than tomorrow, including support for the new
hash partitioning.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Kyotaro HORIGUCHI
Date:
Thanks for the interesting test, David.

At Tue, 14 Nov 2017 17:00:12 +1300, David Rowley <david.rowley@2ndquadrant.com> wrote in
<CAKJS1f-1Lc_b=Y1iicPQzvUgSn1keHSgmRqLuOGq_VR6M==zbw@mail.gmail.com>
> On 13 November 2017 at 22:46, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> > On 2017/11/10 12:30, Kyotaro HORIGUCHI wrote:
> >> Without some aggressive optimization, the loop takes most of the
> >> time to check-and-jump for nothing especially with many
> >> partitions and somewhat unintuitive.
> >>
> >> The following uses a bit tricky bitmap operation but
> >> is straightforward as a whole.
> >>
> >> =====
> >> /* fill the bits upper from BITNUM(lower) (0-based) of the first word */
> >> a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1);

'+='.. ^^;

> >> /* fill up intermediate words */
> >> while (wordnum < uwordnum)
> >>    a->words[wordnum++] = ~(bitmapword) 0;
> >>
> >> /* fill up to BITNUM(upper) bit (0-based) of the last word */
> >> a->workds[wordnum++] |=
> >>      (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1));
> >> =====
> >
> > Considering also the David's comment downthread, I will try to incorporate
> > this into bms_add_range().
> 
> I've attached an implementation of the patch using this method.
> 
> I've also attached bitset.c which runs each through their paces. I'd
> have expected Kyotaro's method to be faster, but gcc 7.2 with -O2
> generates very slightly slower code. I didn't really check why. clang
> seems to do a better job with it.
..
> $ gcc -O2 bitset.c -o bitset && ./bitset
> bms_add_range in 0.694254 (6.94254 ns per loop)
> bms_add_range2 in 0.726643 (7.26643 ns per loop)
..
> $ gcc --version
> gcc (Ubuntu 7.2.0-8ubuntu3) 7.2.0
> Copyright (C) 2017 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Hmm bms_add_range doesn't seem getting so aggressive optimization
but I had a similar result.

Looking the output of gcc -S, I found that bms_add_range() is
embedded in main(). (gcc 7.1.0) It's not surprizing after finding
that but.. Anyway I added __attribute((noinline)) to the two
functions and got the following result.

> bms_add_range in 1.24 (12.4 ns per loop)
> bms_add_range2 in 0.8 (8 ns per loop)

It seems reasonable.

> $ clang -O2 bitset.c -o bitset && ./bitset
> bms_add_range in 0.866554 (8.66554 ns per loop)
> bms_add_range2 in 0.467138 (4.67138 ns per loop)
..
> $ clang --version
> clang version 4.0.1-6 (tags/RELEASE_401/final)
> Target: x86_64-pc-linux-gnu
> Thread model: posix
> InstalledDir: /usr/bin
> 
> Probably just go with Kyotaro's idea (v2). I don't think this is worth
> debating, I just wanted to show it's not that clear-cut.

I agree that it's not so clear-cut.

regard,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 16 November 2017 at 15:54, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Anyway I added __attribute((noinline)) to the two
> functions and got the following result.
>
>> bms_add_range in 1.24 (12.4 ns per loop)
>> bms_add_range2 in 0.8 (8 ns per loop)

I see similar here with __attribute((noinline)). Thanks for
investigating that. Your way is clearly better. Thanks for suggesting
it.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

On 2017/11/08 13:44, David Rowley wrote:
> On 7 November 2017 at 01:52, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)
> 
> Hi Amit,
> 
> I had another look over this today. Apologies if any of the review seems petty.

Thanks a lot for the review.

> Here goes:
> 
> 1. If test seems to be testing for a child that's a partitioned table,
> but is testing for a non-NULL part_scheme.
> 
> /*
> * If childrel is itself partitioned, add it and its partitioned
> * children to the list being propagated up to the root rel.
> */
> if (childrel->part_scheme && rel->part_scheme)
> 
> Should this code use IS_PARTITIONED_REL() instead? Seems a bit strange
> to test for a NULL part_scheme

I guess that makes sense, done.

> 
> 2. There's a couple of mistakes in my bms_add_range() code. I've
> attached bms_add_range_fix.patch. Can you apply this to your tree?

Thanks.  I have used your bms_add_range_v2.patch that you sent earlier
today and listed both your and Horiguchi-san's names as author.

> 3. This assert seems to be Asserting the same thing twice:
> 
> Assert(rel->live_partitioned_rels != NIL &&
>    list_length(rel->live_partitioned_rels) > 0);
> 
> A List with length == 0 is always NIL.

You're right.  I changed it to:

  Assert(list_length(rel->live_partitioned_rels) >= 1);

> 4. get_partitions_from_clauses(), can you comment why you perform the
> list_concat() there.
> 
> I believe this is there so that the partition bound from the parent is
> passed down to the child so that we can properly eliminate all child
> partitions when the 2nd level of partitioning is using the same
> partition key as the 1st level. I think this deserves a paragraph of
> comment to explain this.

Yes, that's the intent.  I implemented it as a solution to fix a problem
that was reported upthread, whereby the default partition pruning didn't
work as desired.  I tried to explain it in the following email:

https://www.postgresql.org/message-id/8499324c-8a33-4be7-9d23-7e6a95e60ddf%40lab.ntt.co.jp

But, since I devised it as a solution to get the desired behavior for the
default partition, I modified the code to include partition constraint
clauses to do it only when the table has a default partition in the first
place.  Doing it always is an overkill.  Please see the comment added
nearby if it now helps make sense of what's going on.

> 5. Please add a comment to explain what's going on here in
> classify_partition_bounding_keys()
> 
> if (partattno == 0)
> {
> partexpr = lfirst(partexprs_item);
> partexprs_item = lnext(partexprs_item);
> }
> 
> Looks like, similar to index expressions, that partition expressions
> are attno 0 to mark to consume the next expression from the list.

Right.

> Does this need validation that there are enough partexprs_item items
> like what is done in get_range_key_properties()? Or is this validated
> somewhere else?

Yeah, I added the check as follows:

+            /* Set partexpr if needed. */
             if (partattno == 0)
             {
+                if (partexprs_item == NULL)
+                    elog(ERROR, "wrong number of partition key expressions");
                 partexpr = lfirst(partexprs_item);
                 partexprs_item = lnext(partexprs_item);

> 
> 6. Comment claims the if test will test something which it does not
> seem to test for:
> 
> /*
> * Redundant key elimination using btree-semantics based tricks.
> *
> * Only list and range partitioning use btree operator semantics, so
> * skip otherwise.   Also, if there are expressions whose value is yet
> * unknown, skip this step, because we need to compare actual values
> * below.
> */
> memset(keyclauses, 0, PARTITION_MAX_KEYS * sizeof(List *));
> if (partkey->strategy == PARTITION_STRATEGY_LIST ||
> partkey->strategy == PARTITION_STRATEGY_RANGE)
> 
> I was expecting this to be skipped when the clauses contained a
> non-const, but it does not seem to.

Fixed the comment.  Actually we might end up with non-consts here if
executor invokes it, so the downstream code is in position to handle them,
skipping any optimizations that depend on constant values being available.
 There are actually even cases when the planner wouldn't mind calling here
even if the matched clauses contained non-const operands as long as there
are at least some constants available.

> 7. Should be "compare them"
> 
> /*
> * If the leftarg and rightarg clauses' constants are both of the type
> * expected by "op" clause's operator, then compare then using the
> * latter's comparison function.
> */
> 
> But if I look at the code "compare then using the latter's comparison
> function." is not true, it seems to use op's comparison function not
> rightarg's. With most of the calls op and rightarg are the same, but
> not all of them. The function shouldn't make that assumption even if
> the args op was always the same as rightarg.

Rearranged the code in partition_cmp_args() a bit and added more
clarifying comments.

> 8. remove_redundant_clauses() needs an overview comment of what the
> function does.

Done.

> 9. The comment should explain what we would do in the case of key < 3
> AND key <= 2 using some examples.
> 
> /* try to keep only one of <, <= */

Done.

> 10. Wondering why this loop runs backward?
> 
> for (s = BTMaxStrategyNumber; --s >= 0;)
> 
> Why not just:
> 
> for (s = 0; s < BTMaxStrategyNumber; s++)
> 
> I can't see a special reason for it to run backward. It seems unusual,
> but if there's a good reason that I've failed to realise then it's
> maybe worth a comment.

Hmm, no special reason.  So, done the other way.  I actually brought this
redundant key logic elimination logic over from nbtutils.c:
_bt_preprocess_keys() and the loop runs that way over there.

> 11. Pleae comment on why *constfalse = true is set here:
> 
> if (!chk || s == (BTEqualStrategyNumber - 1))
> continue;
> 
> if (partition_cmp_args(partopfamily, partopcintype, chk, eq, chk,
>    &test_result))
> {
> if (!test_result)
> {
> *constfalse = true;
> return;
> }
> /* discard the redundant key. */
> xform[s] = NULL;
> }
> 
> Looks like we'd hit this in a case such as: WHERE key = 1 AND key > 1.

Right.  Added a comment with example.

> Also please add a comment when discarding the redundant key maybe
> explain that equality is more useful than the other strategies when
> there's an overlap.

Done, too.


Please find attached updated patch set.   There are significant changes in
this version as described below, including the support for hash
partitioned tables.

Earlier today, I reported [1] what looks to me like a bug in how default
partition's constraint gets generated and how that sometimes makes
constraint exclusion mistakenly prune a default partition.  I have
included the patches I posted there in this series too.  They are ahead in
the list.

So attached patches are now as follows:

0001-Add-default-partition-case-in-inheritance-testing.patch
0002-Tweak-default-range-partition-s-constraint-a-little.patch

Patches at [1].

0003-Add-new-tests-for-partition-pruning.patch

Tests.  Mostly unchanged from the last version.

0004-Add-a-bms_add_range.patch

Uses the bms_add_range_v2 patch.

0005-Planner-side-changes-for-partition-pruning.patch

Fixed some issues with how OR clauses were matched and the code for
checking if the operator is partitioning compatible is no longer in the
planner code, instead it's now only in partition.c.  Other cosmetic
improvements including those that resulted from the review comments.

0006-Implement-get_partitions_from_clauses.patch

Several bug fixes and many cosmetic improvements including improved
commentary.  Per Amul's comment upthread, the patch no longer depends on
using value -1 to denote an invalid value of NulltestType enum.  In fact,
it doesn't use NulltestType type variables at all.

0007-Some-interface-changes-for-partition_bound_-cmp-bsea.patch

No changes.

0008-Implement-get_partitions_for_keys.patch

Support for hash partitioning and tests for the same.  Also, since
update/delete on partitioned tables still depend on constraint exclusion
for pruning, fix things such that get_relation_constraints includes
partition constraints in its result only for non-select queries (for
selects we have the new pruning code).  Other bug fixes.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/ba7aaeb1-4399-220e-70b4-62eade1522d0%40lab.ntt.co.jp

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 17 November 2017 at 23:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Please find attached updated patch set.   There are significant changes in
this version as described below, including the support for hash
partitioned tables.

Hi Amit,

Thanks for making those changes and adding the HASH partition support.

There's a good chance that I'm not going to get time to look at this maybe until the last day of the month. I hope someone else can look over it in the meantime.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
Amit Langote
Date:
Hi David.

On 2017/11/17 21:44, David Rowley wrote:
> On 17 November 2017 at 23:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
> wrote:
> 
>> Please find attached updated patch set.   There are significant changes in
>> this version as described below, including the support for hash
>> partitioned tables.
>>
> 
> Hi Amit,
> 
> Thanks for making those changes and adding the HASH partition support.
> 
> There's a good chance that I'm not going to get time to look at this maybe
> until the last day of the month. I hope someone else can look over it in
> the meantime.

No problem, thanks for the review so far and look forward to the next time
you'll be able to comment.

Regards,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Kyotaro HORIGUCHI
Date:
Thank you and sorry for the confused comments.

At Mon, 13 Nov 2017 18:46:28 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<8460a6c3-68c5-b78a-7d18-d253180f2188@lab.ntt.co.jp>
> Horiguchi-san,
> 
> Thanks for taking a look.  Replying to all your emails here.

> > In 0003, 
> > 
> > +match_clauses_to_partkey(RelOptInfo *rel,
> > ...
> > +      if (rinfo->pseudoconstant &&
> > +        (IsA(clause, Const) &&
> > +         ((((Const *) clause)->constisnull) ||
> > +          !DatumGetBool(((Const *) clause)->constvalue))))
> > +      {
> > +        *constfalse = true;
> > +        continue;
> > 
> > If we call this function in both conjunction and disjunction
> > context (the latter is only in recursive case). constfalse ==
> > true means no need of any clauses for the former case.
> > 
> > Since (I think) just a list of RestrictInfo is expected to be
> > treated as a conjunction (it's quite doubious, though..),
> 
> I think it makes sense to consider a list of RestrictInfo's, such as
> baserestrictinfo, that is passed as input to match_clauses_to_partkey(),
> to be mutually conjunctive for our purpose here.

You're right and I know it. I'm ok to leave it since I recalled
that clause_selectivity always has a similar code.

> On 2017/11/10 14:44, Kyotaro HORIGUCHI wrote:
> > At Fri, 10 Nov 2017 14:38:11 +0900, Kyotaro HORIGUCHI wrote:
> >
> > This is working fine. Sorry for the bogus comment.
> 
> I'd almost started looking around if something might be wrong after all. :)

Very sorry for the wrong comment:(

> On 2017/11/10 16:07, Kyotaro HORIGUCHI wrote:
> > At Fri, 10 Nov 2017 14:44:55 +0900, Kyotaro HORIGUCHI wrote:
> >
> >>> Those two conditions are not orthogonal. Maybe something like
> >>> following seems more understantable.
> >>>
> >>>> if (!constfalse)
> >>>> {
> >>>>   /* No constraints on the keys, so, return *all* partitions. */
> >>>>   if (nkeys == 0)
> >>>>     return bms_add_range(result, 0, partdesc->nparts - 1);
> >>>>
> >>>>   result = get_partitions_for_keys(relation, &keys);
> >>>> }
> >
> > So, the condition (!constfalse && nkeys == 0) cannot return
> > there. I'm badly confused by the variable name.
> 
> Do you mean by 'constfalse'?

Perhaps. The name "constfalse" is suggesting (for me) that the
cluses evaluate to false constantly. But acutally it means just
the not-in-the-return clauses are results in false. Anyway I'll
take a look on v12 and will comment at the time.

> > I couldn't find another reasonable structure using the current
> > classify_p_b_keys(), but could you add a comment like the
> > following as an example?
> 
> OK, will add comments explaining what's going on.
> 
> 
> Will post the updated patches after also taking care of David's and Amul's
> review comments upthread.
> 
> Thanks,
> Amit

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [HACKERS] path toward faster partition pruning

From
Rajkumar Raghuwanshi
Date:
On Fri, Nov 17, 2017 at 3:31 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Support for hash partitioning and tests for the same.  Also, since
update/delete on partitioned tables still depend on constraint exclusion
for pruning, fix things such that get_relation_constraints includes
partition constraints in its result only for non-select queries (for
selects we have the new pruning code).  Other bug fixes.

Hi Amit,

I have applied attached patch set on commit 11e264517dff7a911d9e6494de86049cab42cde3 and try to test for hash partition. I got a server crash with below test.

CREATE TABLE hp_tbl (a int, b int, c int) PARTITION BY HASH (a);
CREATE TABLE hp_tbl_p1 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4, remainder 0) PARTITION BY HASH (b);
CREATE TABLE hp_tbl_p1_p1 PARTITION OF hp_tbl_p1 FOR VALUES WITH (modulus 4, remainder 0) PARTITION BY HASH (c);
CREATE TABLE hp_tbl_p1_p1_p1 PARTITION OF hp_tbl_p1_p1 FOR VALUES WITH (modulus 4, remainder 0);
CREATE TABLE hp_tbl_p2 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4, remainder 1) PARTITION BY HASH (b);
CREATE TABLE hp_tbl_p2_p1 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus 4, remainder 1);
CREATE TABLE hp_tbl_p2_p2 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus 4, remainder 2);
insert into  hp_tbl select i,i,i from generate_series(0,10)i where i not in(2,4,6,7,10);

explain select * from hp_tbl where a = 2;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks Rajkumar for the test.

On 2017/11/21 19:06, Rajkumar Raghuwanshi wrote:
> CREATE TABLE hp_tbl (a int, b int, c int) PARTITION BY HASH (a);
> CREATE TABLE hp_tbl_p1 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4,
> remainder 0) PARTITION BY HASH (b);
> CREATE TABLE hp_tbl_p1_p1 PARTITION OF hp_tbl_p1 FOR VALUES WITH (modulus
> 4, remainder 0) PARTITION BY HASH (c);
> CREATE TABLE hp_tbl_p1_p1_p1 PARTITION OF hp_tbl_p1_p1 FOR VALUES WITH
> (modulus 4, remainder 0);
> CREATE TABLE hp_tbl_p2 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4,
> remainder 1) PARTITION BY HASH (b);
> CREATE TABLE hp_tbl_p2_p1 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus
> 4, remainder 1);
> CREATE TABLE hp_tbl_p2_p2 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus
> 4, remainder 2);
> insert into  hp_tbl select i,i,i from generate_series(0,10)i where i not
> in(2,4,6,7,10);
> 
> explain select * from hp_tbl where a = 2;
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !>

It seems I wrote an Assert in the code to support hash partitioning that
wasn't based on a valid assumption.  I was wrongly assuming that all hash
partitions for a given modulus (largest modulus) must exist at any given
time, but that isn't the case.

Fixed in the attached.  No other changes beside that.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Jesper Pedersen
Date:
Hi Amit,

On 11/22/2017 03:59 AM, Amit Langote wrote:
> Fixed in the attached.  No other changes beside that.
> 

I have been using the following script to look at the patch

-- test.sql --

CREATE TABLE t1 (    a integer NOT NULL,    b integer NOT NULL
) PARTITION BY HASH (b);

CREATE TABLE t1_p00 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 0);
CREATE TABLE t1_p01 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 1);
CREATE TABLE t1_p02 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 2);
CREATE TABLE t1_p03 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 3);

CREATE INDEX idx_t1_b_a_p00 ON t1_p00 USING btree (b, a);
CREATE INDEX idx_t1_b_a_p01 ON t1_p01 USING btree (b, a);
CREATE INDEX idx_t1_b_a_p02 ON t1_p02 USING btree (b, a);
CREATE INDEX idx_t1_b_a_p03 ON t1_p03 USING btree (b, a);

CREATE TABLE t2 (    c integer NOT NULL,    d integer NOT NULL
) PARTITION BY HASH (d);

CREATE TABLE t2_p00 PARTITION OF t2 FOR VALUES WITH (MODULUS 4, 
REMAINDER 0);
CREATE TABLE t2_p01 PARTITION OF t2 FOR VALUES WITH (MODULUS 4, 
REMAINDER 1);
CREATE TABLE t2_p02 PARTITION OF t2 FOR VALUES WITH (MODULUS 4, 
REMAINDER 2);
CREATE TABLE t2_p03 PARTITION OF t2 FOR VALUES WITH (MODULUS 4, 
REMAINDER 3);

CREATE INDEX idx_t2_c_p00 ON t2_p00 USING btree (c);
CREATE INDEX idx_t2_c_p01 ON t2_p01 USING btree (c);
CREATE INDEX idx_t2_c_p02 ON t2_p02 USING btree (c);
CREATE INDEX idx_t2_c_p03 ON t2_p03 USING btree (c);

CREATE INDEX idx_t2_d_p00 ON t2_p00 USING btree (d);
CREATE INDEX idx_t2_d_p01 ON t2_p01 USING btree (d);
CREATE INDEX idx_t2_d_p02 ON t2_p02 USING btree (d);
CREATE INDEX idx_t2_d_p03 ON t2_p03 USING btree (d);

INSERT INTO t1 (SELECT i, i FROM generate_series(1, 10000) AS i);
INSERT INTO t2 (SELECT i, i FROM generate_series(1, 10000) AS i);

ANALYZE;

EXPLAIN (ANALYZE) SELECT t1.a, t1.b FROM t1 WHERE t1.b = 1;

EXPLAIN (ANALYZE) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER JOIN t2 ON 
t2.c = t1.b WHERE t2.d = 1;

BEGIN;
EXPLAIN (ANALYZE) UPDATE t1 SET a = 1 WHERE b = 1;
ROLLBACK;

BEGIN;
EXPLAIN (ANALYZE) DELETE FROM t1 WHERE b = 1;
ROLLBACK;

-- test.sql --

I just wanted to highlight that the "JOIN ON" partition isn't pruned - 
the "WHERE" one is.

Should pruning of partitions for UPDATEs (where the partition key isn't 
updated) and DELETEs be added to the TODO list ?

Thanks for working on this !

Best regards, Jesper


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks Jesper.

On 2017/11/23 3:56, Jesper Pedersen wrote:
> Hi Amit,
> 
> On 11/22/2017 03:59 AM, Amit Langote wrote:
>> Fixed in the attached.  No other changes beside that.
>>
> 
> I have been using the following script to look at the patch
> 
> -- test.sql --

[ ... ]

> EXPLAIN (ANALYZE) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER JOIN t2 ON
> t2.c = t1.b WHERE t2.d = 1;
> 
> I just wanted to highlight that the "JOIN ON" partition isn't pruned - the
> "WHERE" one is.

Did you mean to write ON t2.d = t1.b?  If so, equivalence class mechanism
will give rise to a t1.b = 1 and hence help prune t1's partition as well:

EXPLAIN (COSTS OFF)
SELECT t1.a, t1.b, t2.c, t2.d
FROM t1 INNER JOIN t2 ON t2.d = t1.b
WHERE t2.d = 1;                       QUERY PLAN
-----------------------------------------------------------Nested Loop  ->  Append        ->  Bitmap Heap Scan on
t1_p00             Recheck Cond: (b = 1)              ->  Bitmap Index Scan on idx_t1_b_a_p00                    Index
Cond:(b = 1)  ->  Materialize        ->  Append              ->  Bitmap Heap Scan on t2_p00                    Recheck
Cond:(d = 1)                    ->  Bitmap Index Scan on idx_t2_d_p00                          Index Cond: (d = 1)
 

In your original query, you use ON t2.c = t1.b, whereby there is no
"constant" value to perform partition pruning with.  t2.c is unknown until
the join actually executes.

> BEGIN;
> EXPLAIN (ANALYZE) UPDATE t1 SET a = 1 WHERE b = 1;
> ROLLBACK;
>
> BEGIN;
> EXPLAIN (ANALYZE) DELETE FROM t1 WHERE b = 1;
> ROLLBACK;
> 
> Should pruning of partitions for UPDATEs (where the partition key isn't
> updated) and DELETEs be added to the TODO list?

Note that partition pruning *does* work for UPDATE and DELETE, but only if
you use list/range partitioning.  The reason it doesn't work in this case
(t1 is hash partitioned) is that the pruning is still based on constraint
exclusion in the UPDATE/DELETE case and constraint exclusion cannot handle
hash partitioning.

See example below that uses list partitioning:

drop table t1, t2;
create table t1 (a int, b int) partition by list (b);
create table t1_p0 partition of t1 for values in (0);
create table t1_p1 partition of t1 for values in (1);
create table t1_p2 partition of t1 for values in (2);
create table t1_p3 partition of t1 for values in (3);

create table t2 (c int, d int) partition by list (d);
create table t2_p0 partition of t2 for values in (0);
create table t2_p1 partition of t2 for values in (1);
create table t2_p2 partition of t2 for values in (2);
create table t2_p3 partition of t2 for values in (3);

explain (costs off) update t1 set a = 1 where b = 1;      QUERY PLAN
=------------------------Update on t1  Update on t1_p1  ->  Seq Scan on t1_p1        Filter: (b = 1)
(4 rows)

explain (costs off) delete from t1 where b = 1;      QUERY PLAN
=------------------------Delete on t1  Delete on t1_p1  ->  Seq Scan on t1_p1        Filter: (b = 1)
(4 rows)


I can see how that seems a bit odd.  If you use hash partitioning,
UPDATE/DELETE do not benefit from partition-pruning, even though SELECT
does.  That's because SELECT uses the new partition-pruning method (this
patch set) which supports hash partitioning, whereas UPDATE and DELETE use
constraint exclusion which doesn't.  It would be a good idea to make even
UPDATE and DELETE use the new method thus bringing everyone on the same
page, but that requires us to make some pretty non-trivial changes to how
UPDATE/DELETE planning works for inheritance/partitioned tables, which we
should undertake separately, imho.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Jesper Pedersen
Date:
Hi Amit,

On 11/24/2017 12:00 AM, Amit Langote wrote:
>> On 2017/11/23 3:56, Jesper Pedersen wrote:
>> EXPLAIN (ANALYZE) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER JOIN t2 ON
>> t2.c = t1.b WHERE t2.d = 1;
>>
>> I just wanted to highlight that the "JOIN ON" partition isn't pruned - the
>> "WHERE" one is.
> 
> Did you mean to write ON t2.d = t1.b?  If so, equivalence class mechanism
> will give rise to a t1.b = 1 and hence help prune t1's partition as well:
> 

No, I meant 't2.c = t1.b'. If you take the same example, but don't 
partition you will get the following plan:

test=# EXPLAIN (COSTS OFF) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER 
JOIN t2 ON t2.c = t1.b WHERE t2.d = 1;                  QUERY PLAN
---------------------------------------------- Nested Loop   ->  Index Scan using idx_t2_d on t2         Index Cond: (d
=1)   ->  Index Only Scan using idx_t1_b_a on t1         Index Cond: (b = t2.c)
 
(5 rows)

Maybe "5.10.2. Declarative Partitioning" could be expanded to include 
some general "guidelines" of where partition based plans should be 
checked against their non-partition counterparts (at least the first 
bullet in 5.10.1 says ".. in certain situations .."). Probably a 
separate patch from this.

[snip]

>> Should pruning of partitions for UPDATEs (where the partition key isn't
>> updated) and DELETEs be added to the TODO list?
> 
> Note that partition pruning *does* work for UPDATE and DELETE, but only if
> you use list/range partitioning.  The reason it doesn't work in this case
> (t1 is hash partitioned) is that the pruning is still based on constraint
> exclusion in the UPDATE/DELETE case and constraint exclusion cannot handle
> hash partitioning.
> 

Thanks for your description.

> 
> I can see how that seems a bit odd.  If you use hash partitioning,
> UPDATE/DELETE do not benefit from partition-pruning, even though SELECT
> does.  That's because SELECT uses the new partition-pruning method (this
> patch set) which supports hash partitioning, whereas UPDATE and DELETE use
> constraint exclusion which doesn't.  It would be a good idea to make even
> UPDATE and DELETE use the new method thus bringing everyone on the same
> page, but that requires us to make some pretty non-trivial changes to how
> UPDATE/DELETE planning works for inheritance/partitioned tables, which we
> should undertake separately, imho.
> 

Agreed.

Best regards, Jesper


Re: [HACKERS] path toward faster partition pruning

From
Kyotaro HORIGUCHI
Date:
Hello,

At Wed, 22 Nov 2017 17:59:48 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<df609168-b7fd-4c0b-e9b2-6e398d411e27@lab.ntt.co.jp>
> Thanks Rajkumar for the test.
> 
> On 2017/11/21 19:06, Rajkumar Raghuwanshi wrote:
> > explain select * from hp_tbl where a = 2;
> > server closed the connection unexpectedly
> >     This probably means the server terminated abnormally
> >     before or while processing the request.
> > The connection to the server was lost. Attempting reset: Failed.
> > !>
> 
> It seems I wrote an Assert in the code to support hash partitioning that
> wasn't based on a valid assumption.  I was wrongly assuming that all hash
> partitions for a given modulus (largest modulus) must exist at any given
> time, but that isn't the case.
> 
> Fixed in the attached.  No other changes beside that.

0001 and 0002 are under discussion with Robert in another thread.

I don't have a comment on 0003, 0004.

0005:
get_partitions_from_clauses is written as _using_ in it'scomment. (also get_partitions_from_clauses_recurse is _guts
initscomment.)
 
get_append_rel_partitions just returns NIL if constfalse. Isuppose we'd better reducing indentation levelhere.
get_partitions_from_clauses_recursein 0006 does the samething.
 
In the same function, there's a else clause separated from thenclause by a multiline comment. It seems better that the
elseclausehas braces and the comment is in the braces like thefollowing.
 
>  else>  {>    /*>     * Else there are no clauses....>     */>    partindexes = bms_add_range(NULL, 0,
partdesc->nparts- 1);>  }
 



0006:
In get_partitions_from_clauses_recurse, the following commentseems confusing.

+    /*
+     * The analysis of the matched clauses done by
+     * classify_partition_bounding_keys may have found mutually contradictory
+     * clauses.
+     */
constfalse = true also when the cluase was just one false pseudoconstant restrictinfo.

+    if (!constfalse)
+    {
+        /*
+         * If all clauses in the list were OR clauses,
+         * classify_partition_bounding_keys() wouldn't have formed keys
+         * yet.  They will be handled below by recursively calling this
+         * function for each of OR clauses' arguments and combining the
+         * resulting partition sets appropriately.
+         */
+        if (nkeys > 0)
classify_p_b_keys() to return zero also when is not only all-ORclauses(all AND clauses with volatile function also
returnszero).

+            /* Set partexpr if needed. */
+            if (partattno == 0)
Could you add a description about the meaning of 0 to thecomment of PartitionKeyData something like this?
 | AttrNumber *partattrs;    /* attribute numbers of columns in the |                            * partition key. 0
meanspartexpression */
 

+ #define EXPR_MATCHES_PARTKEY(expr, partattno, partexpr) \
+         ((IsA((expr), Var) &&\
+          ((Var *) (expr))->varattno == (partattno)) ||\
+          equal((expr), (partexpr)))
...
+        if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
partattno = 0 has a different meaning than ordinary attnos.     I belive that the leftop cannot be a whole row var, but
Isupposewe should make it clear. Likewise, even it doesn'tactually happen but it formally has a chance to make a
falsematchsince partexpr is not cleared when partattno > 0.EXPR_MATCHES_PARTKEY might be better be something like
follows.

| #define EXPR_MATCHES_PARTKEY(expr, partattno, partexpr) \
|     ((partattno) != 0 ? \
|      (IsA((expr), Var) && ((Var *) (expr))->varattno == (partattno)) :\
|      equal((expr), (partexpr)))


+        if (!op_in_opfamily(opclause->opno, partopfamily))
+        {
...
+          negator = get_negator(opclause->opno);
+          if (OidIsValid(negator) &&
+            op_in_opfamily(negator, partopfamily))
+          {
+            get_op_opfamily_properties(negator, partopfamily,
+                           false,
+                           &strategy,
+                           &lefttype, &righttype);
+            if (strategy == BTEqualStrategyNumber)
This seems to me to be a bit too much relying on the specificrelationship of the access methods' property. Isn't
itreasonablethat add checking that partkey->strategy != 'h'before getting negator?
 

+          commuted->opno = get_commutator(opclause->opno);
Im afraid that get_commutator can return InvalidOid foruser-defined types or by user-defined operator class or
perhapsotherreasons uncertain to me. match_clauses_to_partkey ischecking that.
 

+      else if (IsA(clause, ScalarArrayOpExpr))
I'm not sure what to do with a multidimentional ArrayExpr but->multidims is checked some places.

+        ParseState *pstate = make_parsestate(NULL);
make_parsestate mandates for the caller to free it byfree_parsestate(). It doesn't seem to leak anything in thecontext
andI saw the same thing at other places but it would bebetter to follow it if possible, or add some apology as
acomment..(or update the comment of make_parsestate?)
 

+  * If the leftarg_const and rightarg_consr are both of the type expected
rightarg_consr -> const



+        if (partition_cmp_args(partkey, partattoff,
+                               le, lt, le,
+                               &test_result))

+        if (partition_cmp_args(partkey, partattoff, ge, gt, ge,
+                               &test_result))
Please unify the style.


+       * Boolean conditions have a special shape, which would've been
+       * accepted if the partitioning opfamily accepts Boolean
+       * conditions.
I noticed that bare true and false are not accepted by thevalues list of create table syntax. This is not a comment
onthispatch but is that intentional?
 

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi Jesper.

On 2017/11/28 3:30, Jesper Pedersen wrote:
> Hi Amit,
> 
> On 11/24/2017 12:00 AM, Amit Langote wrote:
>>> On 2017/11/23 3:56, Jesper Pedersen wrote:
>>> EXPLAIN (ANALYZE) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER JOIN t2 ON
>>> t2.c = t1.b WHERE t2.d = 1;
>>>
>>> I just wanted to highlight that the "JOIN ON" partition isn't pruned - the
>>> "WHERE" one is.
>>
>> Did you mean to write ON t2.d = t1.b?  If so, equivalence class mechanism
>> will give rise to a t1.b = 1 and hence help prune t1's partition as well:
>>
> 
> No, I meant 't2.c = t1.b'. If you take the same example, but don't
> partition you will get the following plan:
> 
> test=# EXPLAIN (COSTS OFF) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER
> JOIN t2 ON t2.c = t1.b WHERE t2.d = 1;
>                   QUERY PLAN
> ----------------------------------------------
>  Nested Loop
>    ->  Index Scan using idx_t2_d on t2
>          Index Cond: (d = 1)
>    ->  Index Only Scan using idx_t1_b_a on t1
>          Index Cond: (b = t2.c)
> (5 rows)

So it appears to me that you're pointing out the inner Index Only Scan on
t1, which is lot better than scanning all of t1 on every loop iteration.

As you might know, we can't exactly have the index scan on partitioned
table (that is, the parent table itself), because there wouldn't be any
index on it.  However, the planner is smart enough to push the clause down
to partitions (leaf tables) which may have the index and hence index scan
could be chosen for them.  But note that planner will have chosen *all*
partitions, because there is no constant value to prune partitions with at
that point.

If we get run-time pruning [1], we get to get almost close to what happens
in the non-partitioned case.  In this case, since t1.b of t2.c = t1.b is
the partition key of t1, we will make an Append node with run-time pruning
enabled.  On every loop iteration, t2.c's value will be used to prune
useless partitions, which will leave us in most cases to scan just one
partition and it might be an Index Only Scan using the partition's index.

> Maybe "5.10.2. Declarative Partitioning" could be expanded to include some
> general "guidelines" of where partition based plans should be checked
> against their non-partition counterparts (at least the first bullet in
> 5.10.1 says ".. in certain situations .."). Probably a separate patch from
> this.

I agree about shedding more light on that in the documentation.  I will
try to write up a patch someday.

Thanks,
Amit

[1] https://commitfest.postgresql.org/15/1330/



Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Wed, Nov 22, 2017 at 3:59 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> It seems I wrote an Assert in the code to support hash partitioning that
> wasn't based on a valid assumption.  I was wrongly assuming that all hash
> partitions for a given modulus (largest modulus) must exist at any given
> time, but that isn't the case.

Committed 0003 with some adjustments:

* Renamed the new test to partition_prune.
* Moved the test to what I thought was a better place in the schedule
file, and made it consistent between serial_schedule and
parallel_schedule.
* commutates -> commuted
* removed wrong /* empty */ comment
* Updated expected output.  It surprised me a bit that the tests
weren't passing as you had them, but the differences I got - all
related to mc3p_default - seemed correct to me

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Wed, Nov 29, 2017 at 3:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 22, 2017 at 3:59 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> It seems I wrote an Assert in the code to support hash partitioning that
>> wasn't based on a valid assumption.  I was wrongly assuming that all hash
>> partitions for a given modulus (largest modulus) must exist at any given
>> time, but that isn't the case.
>
> Committed 0003 with some adjustments:
>
> * Renamed the new test to partition_prune.
> * Moved the test to what I thought was a better place in the schedule
> file, and made it consistent between serial_schedule and
> parallel_schedule.
> * commutates -> commuted
> * removed wrong /* empty */ comment
> * Updated expected output.  It surprised me a bit that the tests
> weren't passing as you had them, but the differences I got - all
> related to mc3p_default - seemed correct to me

Committed 0004 after reviewing the code and testing that it seems to
work as advertised.

0005 looks like it might need to be split into smaller patches.  More
broadly, the commit messages you wrote for for 0005, 0006, and 0008
don't seem to me to do a great job explaining the motivation for the
changes which they make.  They tell me what the patches do, but not
why they are doing it.  If there's an email in this thread that
explains that stuff, please point me to it and I'll go back and reread
it more carefully; if not, I think I definitely need some more
explanation both of the mission of each patch and the reason why the
patch set is divided up in the way that it is.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/11/30 5:28, Robert Haas wrote:
> On Wed, Nov 22, 2017 at 3:59 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> It seems I wrote an Assert in the code to support hash partitioning that
>> wasn't based on a valid assumption.  I was wrongly assuming that all hash
>> partitions for a given modulus (largest modulus) must exist at any given
>> time, but that isn't the case.
> 
> Committed 0003 with some adjustments:
> 
> * Renamed the new test to partition_prune.
> * Moved the test to what I thought was a better place in the schedule
> file, and made it consistent between serial_schedule and
> parallel_schedule.
> * commutates -> commuted
> * removed wrong /* empty */ comment

Thanks a lot.

> * Updated expected output.  It surprised me a bit that the tests
> weren't passing as you had them, but the differences I got - all
> related to mc3p_default - seemed correct to me

Yeah, that one I too noticed yesterday while rebasing.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/11/30 7:15, Robert Haas wrote:
> On Wed, Nov 29, 2017 at 3:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Nov 22, 2017 at 3:59 AM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> It seems I wrote an Assert in the code to support hash partitioning that
>>> wasn't based on a valid assumption.  I was wrongly assuming that all hash
>>> partitions for a given modulus (largest modulus) must exist at any given
>>> time, but that isn't the case.
>>
>> Committed 0003 with some adjustments:
>>
>> * Renamed the new test to partition_prune.
>> * Moved the test to what I thought was a better place in the schedule
>> file, and made it consistent between serial_schedule and
>> parallel_schedule.
>> * commutates -> commuted
>> * removed wrong /* empty */ comment
>> * Updated expected output.  It surprised me a bit that the tests
>> weren't passing as you had them, but the differences I got - all
>> related to mc3p_default - seemed correct to me
> 
> Committed 0004 after reviewing the code and testing that it seems to
> work as advertised.

Thank you.

> 0005 looks like it might need to be split into smaller patches.  More
> broadly, the commit messages you wrote for for 0005, 0006, and 0008
> don't seem to me to do a great job explaining the motivation for the
> changes which they make.  They tell me what the patches do, but not
> why they are doing it.  If there's an email in this thread that
> explains that stuff, please point me to it and I'll go back and reread
> it more carefully; if not, I think I definitely need some more
> explanation both of the mission of each patch and the reason why the
> patch set is divided up in the way that it is.

I'm working on a revised version of these patches to address recent
comments by Horiguchi-san.  I will also consider the points above before
sending the new version.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Michael Paquier
Date:
On Thu, Nov 30, 2017 at 10:43 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I'm working on a revised version of these patches to address recent
> comments by Horiguchi-san.  I will also consider the points above before
> sending the new version.

Ok, this is fresh news, so I am moving this entry to next CF with
waiting on author as status.
-- 
Michael


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/11/30 11:18, Michael Paquier wrote:
> On Thu, Nov 30, 2017 at 10:43 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I'm working on a revised version of these patches to address recent
>> comments by Horiguchi-san.  I will also consider the points above before
>> sending the new version.
> 
> Ok, this is fresh news, so I am moving this entry to next CF with
> waiting on author as status.

That's correct, thanks.

Regards,
Amit



Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 30 November 2017 at 11:15, Robert Haas <robertmhaas@gmail.com> wrote:
> Committed 0004 after reviewing the code and testing that it seems to
> work as advertised.
>
> 0005 looks like it might need to be split into smaller patches.  More
> broadly, the commit messages you wrote for for 0005, 0006, and 0008
> don't seem to me to do a great job explaining the motivation for the
> changes which they make.  They tell me what the patches do, but not
> why they are doing it.  If there's an email in this thread that
> explains that stuff, please point me to it and I'll go back and reread
> it more carefully; if not, I think I definitely need some more
> explanation both of the mission of each patch and the reason why the
> patch set is divided up in the way that it is.

Hi Amit,

It looks like just 0005 to 0008 remain of this and I see that the v13
0005 patch no longer applies to current master.

Are you working on splitting this up as requested by Robert above?

I can continue reviewing this once patches are available that apply to
current master.

Many thanks

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
Amit Langote
Date:
Hi David.

On 2017/12/07 19:48, David Rowley wrote:
> On 30 November 2017 at 11:15, Robert Haas <robertmhaas@gmail.com> wrote:
>> Committed 0004 after reviewing the code and testing that it seems to
>> work as advertised.
>>
>> 0005 looks like it might need to be split into smaller patches.  More
>> broadly, the commit messages you wrote for for 0005, 0006, and 0008
>> don't seem to me to do a great job explaining the motivation for the
>> changes which they make.  They tell me what the patches do, but not
>> why they are doing it.  If there's an email in this thread that
>> explains that stuff, please point me to it and I'll go back and reread
>> it more carefully; if not, I think I definitely need some more
>> explanation both of the mission of each patch and the reason why the
>> patch set is divided up in the way that it is.
> 
> Hi Amit,
> 
> It looks like just 0005 to 0008 remain of this and I see that the v13
> 0005 patch no longer applies to current master.
> 
> Are you working on splitting this up as requested by Robert above?
> 
> I can continue reviewing this once patches are available that apply to
> current master.

I'm still working on that.  I will be able to submit a new version
sometime early in the next week, that is, if I don't manage to do it by
today evening (Japan time).  Sorry that it's taking a bit longer.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Horiguchi-san,

Thanks a lot for the review and sorry it took me a while to reply.

On 2017/11/28 20:39, Kyotaro HORIGUCHI wrote:
> At Wed, 22 Nov 2017 17:59:48 +0900, Amit Langote wrote:
> 0001 and 0002 are under discussion with Robert in another thread.

And now committed [1].

> I don't have a comment on 0003, 0004.
> 
> 0005:
> 
>  get_partitions_from_clauses is written as _using_ in it's
>  comment. (also get_partitions_from_clauses_recurse is _guts in
>  its comment.)

Fixed both.

>  get_append_rel_partitions just returns NIL if constfalse. I
>  suppose we'd better reducing indentation level
>  here. get_partitions_from_clauses_recurse in 0006 does the same
>  thing.

Less indentation sounds good to me to, so fixed.

>  In the same function, there's a else clause separated from then
>  clause by a multiline comment. It seems better that the else
>  clause has braces and the comment is in the braces like the
>  following.
> 
>  >  else
>  >  {
>  >    /*
>  >     * Else there are no clauses....
>  >     */
>  >    partindexes = bms_add_range(NULL, 0, partdesc->nparts - 1);
>  >  }

Done that way.

> 0006:
> 
>  In get_partitions_from_clauses_recurse, the following comment
>  seems confusing.
> 
> +    /*
> +     * The analysis of the matched clauses done by
> +     * classify_partition_bounding_keys may have found mutually contradictory
> +     * clauses.
> +     */
> 
>  constfalse = true also when the cluase was just one false pseudo
>  constant restrictinfo.

Updated the comment like this:

  /*
   * classify_partition_bounding_keys() may have found clauses marked
   * pseudo-constant that are false that the planner didn't or it may have
   * itself found contradictions among clauses.
   */

> +    if (!constfalse)
> +    {
> +        /*
> +         * If all clauses in the list were OR clauses,
> +         * classify_partition_bounding_keys() wouldn't have formed keys
> +         * yet.  They will be handled below by recursively calling this
> +         * function for each of OR clauses' arguments and combining the
> +         * resulting partition sets appropriately.
> +         */
> +        if (nkeys > 0)
> 
>  classify_p_b_keys() to return zero also when is not only all-OR
>  clauses(all AND clauses with volatile function also returns
>  zero).

Hmm, if all AND clauses contained volatile functions, then planner
wouldn't have called here at all.  Also, any clauses it passes to
get_partitions_from_clauses() are known to be OK to use for pruning.

> +            /* Set partexpr if needed. */
> +            if (partattno == 0)
> 
>  Could you add a description about the meaning of 0 to the
>  comment of PartitionKeyData something like this?

Sure, done.

> 
>   | AttrNumber *partattrs;    /* attribute numbers of columns in the
>   |                            * partition key. 0 means partexpression */
> 
> + #define EXPR_MATCHES_PARTKEY(expr, partattno, partexpr) \
> +         ((IsA((expr), Var) &&\
> +          ((Var *) (expr))->varattno == (partattno)) ||\
> +          equal((expr), (partexpr)))
> ...
> +        if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
> 
>  partattno = 0 has a different meaning than ordinary attnos.     
>  I belive that the leftop cannot be a whole row var, but I
>  suppose we should make it clear. Likewise, even it doesn't
>  actually happen but it formally has a chance to make a false
>  match since partexpr is not cleared when partattno > 0.
>  EXPR_MATCHES_PARTKEY might be better be something like follows.
> 
> | #define EXPR_MATCHES_PARTKEY(expr, partattno, partexpr) \
> |     ((partattno) != 0 ? \
> |      (IsA((expr), Var) && ((Var *) (expr))->varattno == (partattno)) :\
> |      equal((expr), (partexpr)))

That's better, fixed.

> +        if (!op_in_opfamily(opclause->opno, partopfamily))
> +        {
> ...
> +          negator = get_negator(opclause->opno);
> +          if (OidIsValid(negator) &&
> +            op_in_opfamily(negator, partopfamily))
> +          {
> +            get_op_opfamily_properties(negator, partopfamily,
> +                           false,
> +                           &strategy,
> +                           &lefttype, &righttype);
> +            if (strategy == BTEqualStrategyNumber)
> 
>  This seems to me to be a bit too much relying on the specific
>  relationship of the access methods' property. Isn't it
>  reasonable that add checking that partkey->strategy != 'h'
>  before getting negator?
> 
> +          commuted->opno = get_commutator(opclause->opno);
> 
>  Im afraid that get_commutator can return InvalidOid for
>  user-defined types or by user-defined operator class or perhaps
>  other reasons uncertain to me. match_clauses_to_partkey is
>  checking that.
> 
> +      else if (IsA(clause, ScalarArrayOpExpr))
> 
>  I'm not sure what to do with a multidimentional ArrayExpr but
>  ->multidims is checked some places.

In another thread that I recently started [2], Tom seemed to point out
that ScalarArrayOpExpr cannot really be used for comparing an array on LHS
with, say, an array of arrays on RHS.  The LHS expression must yield a
scalar value which is compared with individual scalar values in the array
on RHS.  The syntax allows arbitrarily multi-dimensional array to be
specified on RHS, but ultimately only looks at the scalar therein.  For
example:

select 1 = any (array[array[array[1]]]);
 ?column?
|---------
 t
(1 row)

select 1 = any ('{{{1}}}');
 ?column?
|---------
 t
(1 row)

select 1 = any ('{{{2}}}');
 ?column?
|---------
 f
(1 row)

But for the code in question on which you seemed to have commented, it
only works if the array is presented as a Const that can be deconstructed
to a list of scalar values using deconstruct_array().  If the array was
presented as an ArrayExpr, and its multidims is true, the elements list
cannot simply be assumed to contain Const nodes holding scalar values.
So, we should prohibit that case from proceeding.  Although, I haven't
been able to frame a test query that results in such a case.

> +        ParseState *pstate = make_parsestate(NULL);
> 
>  make_parsestate mandates for the caller to free it by
>  free_parsestate(). It doesn't seem to leak anything in the
>  context and I saw the same thing at other places but it would be
>  better to follow it if possible, or add some apology as a
>  comment.. (or update the comment of make_parsestate?)

Added a free_parsestate() call.

> 
> +  * If the leftarg_const and rightarg_consr are both of the type expected
> 
>  rightarg_consr -> const

Fixed.

> 
> +        if (partition_cmp_args(partkey, partattoff,
> +                               le, lt, le,
> +                               &test_result))
> 
> +        if (partition_cmp_args(partkey, partattoff, ge, gt, ge,
> +                               &test_result))
> 
>  Please unify the style.

Fixed.

> +       * Boolean conditions have a special shape, which would've been
> +       * accepted if the partitioning opfamily accepts Boolean
> +       * conditions.
> 
>  I noticed that bare true and false are not accepted by the
>  values list of create table syntax. This is not a comment on
>  this patch but is that intentional?

Hmm, I guess the original partitioning patch forgot to accept TRUE_P,
FALSE_P as valid partition bound datums along with Sconst, NumericOnly,
and NULL_P.  Sent a patch for that in a separate thread [3].


Attached updated patches.  As Robert commented [4], I tried to re-arrange
the patches after breaking down the planner patch ("0005 looks like it
might need to be split into smaller patches").  Brief description of each
follows:

[PATCH 1/5] Some interface changes for partition_bound_{cmp/bsearch}

As the name says, it's a preparatory patch to enable the next patch to use
the partition bound binary search function using partial keys.  Until now,
callers needed to specify the whole key (specify values for all partition
columns), but a query may not have clauses on all partition columns, so we
should be able to compare such incomplete keys against partition bounds.

[PATCH 2/5] Introduce a get_partitions_from_clauses()

This used to be a last patch in the previous versions.  But I moved it
ahead in the list as basic functionality that needs to be in place before
starting to modify the planner to use the same for faster pruning.

To summarize, just like the existing get_partition_for_tuple() that
receives a tuple from ExecFindPartition() and returns the index of the
partition that should contain that tuple, get_partitions_from_clauses()
will receive a set of clauses that are all known to match some partition
key and derive from it and return the set of indexes of partitions that
satisfy all those clauses.

[PATCH 3/5] Move some code of set_append_rel_size to separate
 function

First in the series that modifies the planner.  Just a preparatory patch
that moves some code.

[PATCH 4/5] More refactoring around partitioned table AppendPath
 creation

Another refactoring patch that changes how we manipulate the partitions
when generating AppendPath for partitioned tables.

Actually, there is one behavior change here - partitioned_rels in Append
today carry the RT indexes of even the partitioned child tables that have
been pruned.  Patch modifies things so that that doesn't happen anymore.

[PATCH 5/5] Teach planner to use get_partitions_from_clauses()

With this patch, we finally hit the new faster pruning functionality.

Partitions will be pruned even before looking at the individual partition
RelOptInfos.  In fact, set_append_rel_size() now only ever looks at
non-pruned partitions.

Since, we can call get_partitions_from_clauses() from only the SELECT
planning code path, UPDATE/DELETE cases still rely on constraint
exclusion.  So, we retrieve the partition constraint only in those cases.

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7b88d63a9122646ead60c1afffc248a31d4e457d

[2] https://www.postgresql.org/message-id/7677.1512743642%40sss.pgh.pa.us

[3]
https://www.postgresql.org/message-id/e05c5162-1103-7e37-d1ab-6de3e0afaf70%40lab.ntt.co.jp

[4]
https://www.postgresql.org/message-id/CA%2BTgmoYYrPA21e0y5w2NW2-sbANFR4n2nbrSWEWjzvaa_GNi0g%40mail.gmail.com

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 12 December 2017 at 22:13, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated patches.

Thanks for sending the updated patches.

I don't have a complete review at the moment, but the following code
in set_append_rel_pathlist() should be removed.

/* append_rel_list contains all append rels; ignore others */
if (appinfo->parent_relid != parentRTindex)
continue;

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
Amit Langote
Date:
Hi David.

On 2017/12/13 18:48, David Rowley wrote:
> On 12 December 2017 at 22:13, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached updated patches.
> 
> Thanks for sending the updated patches.
> 
> I don't have a complete review at the moment, but the following code
> in set_append_rel_pathlist() should be removed.
> 
> /* append_rel_list contains all append rels; ignore others */
> if (appinfo->parent_relid != parentRTindex)
> continue;

Will fix that right away, thanks.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 12 December 2017 at 22:13, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated patches.

Hi Amit,

I'm sorry to say this is another micro review per code I'm stumbling
over when looking at the run-time partition pruning stuff.

1. In get_partitions_from_clauses_recurse(), since you're assigning
the result to the first input, the following should use
bms_add_members and not bms_union. The logical end result is the same,
but using bms_union means a wasted palloc and a small memory leak
within the memory context.

/*
* Partition sets obtained from mutually-disjunctive clauses are
* combined using set union.
*/
or_partset = bms_union(or_partset, arg_partset);


2. Also in get_partitions_from_clauses_recurse(), it might also be
worth putting in a bms_free(or_partset) after:

/*
* Partition sets obtained from mutually-conjunctive clauses are
* combined using set intersection.
*/
result = bms_intersect(result, or_partset);

Also, instead of using bms_intersect here, would it be better to do:

result = bms_del_members(result, or_partset); ?

That way you don't do a bms_copy and leak member for each OR branch
since bms_intersect also does a bms_copy()

The resulting set could end up with a few more trailing 0 words than
what you have now, but it to be a better idea not allocate a new set
each time.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 19 December 2017 at 17:36, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Also, instead of using bms_intersect here, would it be better to do:
>
> result = bms_del_members(result, or_partset); ?

I should have said bms_int_members() rather than bms_del_members()

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 12 December 2017 at 22:13, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated patches.

Hi Amit,

I was also wondering about your thoughts on the design of
get_partitions_for_keys() and more generally how there are many
functions which have some special treatment doing something based on
->strategy == PARTITION_STRATEGY_XXX.

If I do:

git grep PARTITION_STRATEGY -- src/backend/catalog/partition.c | wc -l

I get 62 matches, most of which are case statements, and most of the
remainder are things like if (key->strategy ==
PARTITION_STRATEGY_HASH).

git grep --show-function PARTITION_STRATEGY -- src/backend/catalog/partition.c

shows that get_partitions_for_keys() is probably the most guilty of
having the most strategy condition tests.

Also, if we look at get_partitions_for_keys() there's an unconditional:

memset(hash_isnull, false, sizeof(hash_isnull));

which is only used for PARTITION_STRATEGY_HASH, but LIST and RANGE
must pay the price of that memset. Perhaps it's not expensive enough
to warrant only doing that when partkey->strategy ==
PARTITION_STRATEGY_HASH, but it does make me question if we should
have 3 separate functions for this and just have a case statement to
call the correct one.

I think if we were to put this off as something we'll fix later, then
the job would just become harder and harder as time goes on.

It might have been fine when we just had RANGE and LIST partitioning,
but I think HASH really tips the scales over to this being needed.

What do you think?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 19 December 2017 at 17:36, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I'm sorry to say this is another micro review per code I'm stumbling
> over when looking at the run-time partition pruning stuff.

Again, another micro review. I apologise for the slow trickle of
review. Again, these are just things I'm noticing while reading
through while thinking of the run-time pruning patch.


1. The following Assert appears to be testing for the presence of
cosmic rays :-)

/*
* Determine set of partitions using provided keys, which proceeds in a
* manner determined by the partitioning method.
*/
if (keys->n_eqkeys == partkey->partnatts)
{
Assert(keys->n_eqkeys == partkey->partnatts);

Perhaps it's misplaced during a rewrite? Should be safe enough to
remove it, I'd say.

2. The following code in classify_partition_bounding_keys() misses
looking under the RelabelType for rightarg:

leftop = (Expr *) get_leftop(clause);
if (IsA(leftop, RelabelType))
leftop = ((RelabelType *) leftop)->arg;
rightop = (Expr *) get_rightop(clause);
if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
constexpr = rightop;
else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr))
constexpr = leftop;

This breaks the following case:

create table thisthat (a varchar not null) partition by list (a);
create table this partition of thisthat for values in('this');
create table that partition of thisthat for values in('that');
explain select * from thisthat where 'this' = a; -- does not work
                         QUERY PLAN
------------------------------------------------------------
 Append  (cost=0.00..54.00 rows=14 width=32)
   ->  Seq Scan on that  (cost=0.00..27.00 rows=7 width=32)
         Filter: ('this'::text = (a)::text)
   ->  Seq Scan on this  (cost=0.00..27.00 rows=7 width=32)
         Filter: ('this'::text = (a)::text)
(5 rows)


explain select * from thisthat where a = 'this'; -- works as we look
through the RelabelType on left arg.
                         QUERY PLAN
------------------------------------------------------------
 Append  (cost=0.00..27.00 rows=7 width=32)
   ->  Seq Scan on this  (cost=0.00..27.00 rows=7 width=32)
         Filter: ((a)::text = 'this'::text)

3. The follow code assumes there will be a commutator for the operator:

if (constexpr == rightop)
pc->op = opclause;
else
{
OpExpr   *commuted;

commuted = (OpExpr *) copyObject(opclause);
commuted->opno = get_commutator(opclause->opno);
commuted->opfuncid = get_opcode(commuted->opno);
commuted->args = list_make2(rightop, leftop);
pc->op = commuted;
}

I had to hunt for it, but it appears that you're pre-filtering clauses
with the Consts on the left and no valid commutator in
match_clauses_to_partkey. I think it's likely worth putting a comment
to mention that reversed clauses with no commutator should have been
filtered out beforehand. I'd say it's also worthy of an Assert().

4. The spelling of "arbitrary" is incorrect in:

* partattno == 0 refers to arbirtary expressions, which get the

5. I've noticed that partition pruning varies slightly from constraint
exclusion in the following case:

create table ta (a int not null) partition by list (a);
create table ta1 partition of ta for values in(1,2);
create table ta2 partition of ta for values in(3,4);

explain select * from ta where a <> 1 and a <> 2; -- partition ta1 is
not eliminated.
                         QUERY PLAN
-------------------------------------------------------------
 Append  (cost=0.00..96.50 rows=5050 width=4)
   ->  Seq Scan on ta1  (cost=0.00..48.25 rows=2525 width=4)
         Filter: ((a <> 1) AND (a <> 2))
   ->  Seq Scan on ta2  (cost=0.00..48.25 rows=2525 width=4)
         Filter: ((a <> 1) AND (a <> 2))
(5 rows)


alter table ta1 add constraint ta1_chk check (a in(1,2)); -- add a
check constraint to see if can be removed.
explain select * from ta where a <> 1 and a <> 2; -- it can.
                         QUERY PLAN
-------------------------------------------------------------
 Append  (cost=0.00..48.25 rows=2525 width=4)
   ->  Seq Scan on ta2  (cost=0.00..48.25 rows=2525 width=4)
         Filter: ((a <> 1) AND (a <> 2))
(3 rows)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hello David.

Thanks for the reviews.  Replying to all your emails here.

On 2017/12/19 13:36, David Rowley wrote:
> On 12 December 2017 at 22:13, Amit Langote wrote:
>> Attached updated patches.
> 
> Hi Amit,
> 
> I'm sorry to say this is another micro review per code I'm stumbling
> over when looking at the run-time partition pruning stuff.
> 
> 1. In get_partitions_from_clauses_recurse(), since you're assigning
> the result to the first input, the following should use
> bms_add_members and not bms_union. The logical end result is the same,
> but using bms_union means a wasted palloc and a small memory leak
> within the memory context.
>
> /*
> * Partition sets obtained from mutually-disjunctive clauses are
> * combined using set union.
> */
> or_partset = bms_union(or_partset, arg_partset);

Done.  Replaced with bms_add_members().

> 2. Also in get_partitions_from_clauses_recurse(), it might also be
> worth putting in a bms_free(or_partset) after:
> 
> /*
> * Partition sets obtained from mutually-conjunctive clauses are
> * combined using set intersection.
> */
> result = bms_intersect(result, or_partset);

Done, too.

> Also, instead of using bms_intersect here, would it be better to do:
> 
> result = bms_del_members(result, or_partset); ?
>
> That way you don't do a bms_copy and leak member for each OR branch
> since bms_intersect also does a bms_copy()
>
> The resulting set could end up with a few more trailing 0 words than
> what you have now, but it to be a better idea not allocate a new set
> each time.

You meant bms_int_members(), as you also said in your other email.

On 2017/12/19 14:42, David Rowley wrote:
> I was also wondering about your thoughts on the design of
> get_partitions_for_keys() and more generally how there are many
> functions which have some special treatment doing something based on
> ->strategy == PARTITION_STRATEGY_XXX.
>
> If I do:
>
> git grep PARTITION_STRATEGY -- src/backend/catalog/partition.c | wc -l
>
> I get 62 matches, most of which are case statements, and most of the
> remainder are things like if (key->strategy ==
> PARTITION_STRATEGY_HASH).
>
> git grep --show-function PARTITION_STRATEGY -- src/backend/catalog/
> partition.c
>
> shows that get_partitions_for_keys() is probably the most guilty of
> having the most strategy condition tests.

I notice that too now that you mention it.

>
> Also, if we look at get_partitions_for_keys() there's an unconditional:
>
> memset(hash_isnull, false, sizeof(hash_isnull));
>
> which is only used for PARTITION_STRATEGY_HASH, but LIST and RANGE
> must pay the price of that memset.

Although I know you're talking about something else here (about which I
say below), turns out this hash_isnull was completely unnecessary, so I
got rid of it.

> Perhaps it's not expensive enough
> to warrant only doing that when partkey->strategy ==
> PARTITION_STRATEGY_HASH, but it does make me question if we should
> have 3 separate functions for this and just have a case statement to
> call the correct one.
>
> I think if we were to put this off as something we'll fix later, then
> the job would just become harder and harder as time goes on.
>
> It might have been fine when we just had RANGE and LIST partitioning,
> but I think HASH really tips the scales over to this being needed.
>
> What do you think?

I think I somewhat understand your concern with regard to future additions
and maintenance and also now tend to agree.

I tried dividing up get_partitions_for_keys() into one function each for
hash, list, and range and it looks like in the attached.  I think I like
the result.  Each function has to deal with only query keys and bounds
assuming a given partitioning method and that appears to add to the
overall clarity of the code.

On 2017/12/19 22:44, David Rowley wrote:
> Again, another micro review. I apologise for the slow trickle of
> review. Again, these are just things I'm noticing while reading
> through while thinking of the run-time pruning patch.
>
>
> 1. The following Assert appears to be testing for the presence of
> cosmic rays :-)
>
> /*
> * Determine set of partitions using provided keys, which proceeds in a
> * manner determined by the partitioning method.
> */
> if (keys->n_eqkeys == partkey->partnatts)
> {
> Assert(keys->n_eqkeys == partkey->partnatts);
>
> Perhaps it's misplaced during a rewrite? Should be safe enough to
> remove it, I'd say.

I noticed that too and took care of it. :)

> 2. The following code in classify_partition_bounding_keys() misses
> looking under the RelabelType for rightarg:
>
> leftop = (Expr *) get_leftop(clause);
> if (IsA(leftop, RelabelType))
> leftop = ((RelabelType *) leftop)->arg;
> rightop = (Expr *) get_rightop(clause);
> if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
> constexpr = rightop;
> else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr))
> constexpr = leftop;
>
> This breaks the following case:
>
> create table thisthat (a varchar not null) partition by list (a);
> create table this partition of thisthat for values in('this');
> create table that partition of thisthat for values in('that');
> explain select * from thisthat where 'this' = a; -- does not work
>                          QUERY PLAN
> ------------------------------------------------------------
>  Append  (cost=0.00..54.00 rows=14 width=32)
>    ->  Seq Scan on that  (cost=0.00..27.00 rows=7 width=32)
>          Filter: ('this'::text = (a)::text)
>    ->  Seq Scan on this  (cost=0.00..27.00 rows=7 width=32)
>          Filter: ('this'::text = (a)::text)
> (5 rows)
>
> explain select * from thisthat where a = 'this'; -- works as we look
> through the RelabelType on left arg.
>                          QUERY PLAN
> ------------------------------------------------------------
>  Append  (cost=0.00..27.00 rows=7 width=32)
>    ->  Seq Scan on this  (cost=0.00..27.00 rows=7 width=32)
>          Filter: ((a)::text = 'this'::text)

Thanks for pointing it out, fixed.

> 3. The follow code assumes there will be a commutator for the operator:
>
> if (constexpr == rightop)
> pc->op = opclause;
> else
> {
> OpExpr   *commuted;
>
> commuted = (OpExpr *) copyObject(opclause);
> commuted->opno = get_commutator(opclause->opno);
> commuted->opfuncid = get_opcode(commuted->opno);
> commuted->args = list_make2(rightop, leftop);
> pc->op = commuted;
> }
>
> I had to hunt for it, but it appears that you're pre-filtering clauses
> with the Consts on the left and no valid commutator in
> match_clauses_to_partkey. I think it's likely worth putting a comment
> to mention that reversed clauses with no commutator should have been
> filtered out beforehand. I'd say it's also worthy of an Assert().

Yeah, added a comment and an Assert.

>
> 4. The spelling of "arbitrary" is incorrect in:
>
> * partattno == 0 refers to arbirtary expressions, which get the

Fixed.

> 5. I've noticed that partition pruning varies slightly from constraint
> exclusion in the following case:
>
> create table ta (a int not null) partition by list (a);
> create table ta1 partition of ta for values in(1,2);
> create table ta2 partition of ta for values in(3,4);
>
> explain select * from ta where a <> 1 and a <> 2; -- partition ta1 is
> not eliminated.
>                          QUERY PLAN
> -------------------------------------------------------------
>  Append  (cost=0.00..96.50 rows=5050 width=4)
>    ->  Seq Scan on ta1  (cost=0.00..48.25 rows=2525 width=4)
>          Filter: ((a <> 1) AND (a <> 2))
>    ->  Seq Scan on ta2  (cost=0.00..48.25 rows=2525 width=4)
>          Filter: ((a <> 1) AND (a <> 2))
> (5 rows)
>
>
> alter table ta1 add constraint ta1_chk check (a in(1,2)); -- add a
> check constraint to see if can be removed.
> explain select * from ta where a <> 1 and a <> 2; -- it can.
>                          QUERY PLAN
> -------------------------------------------------------------
>  Append  (cost=0.00..48.25 rows=2525 width=4)
>    ->  Seq Scan on ta2  (cost=0.00..48.25 rows=2525 width=4)
>          Filter: ((a <> 1) AND (a <> 2))
> (3 rows)

I see.  It seems that the current approach of handling <> operators by
turning clauses containing the same into (key > const OR key < const)
doesn't always work.  I think I had noticed that for list partitioning at
least.  I will work on alternative way of handling that in the next
version of the patch.

Meanwhile, please find attached patches (v15) that take care of the rest
of the comments.  Most of the updates are to patch 0002, compared to the
last (v14) version.

Thanks again for your thoughtful review comments.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/12/20 17:27, Amit Langote wrote:
> On 2017/12/19 13:36, David Rowley wrote:
>> 5. I've noticed that partition pruning varies slightly from constraint
>> exclusion in the following case:
>>
>> create table ta (a int not null) partition by list (a);
>> create table ta1 partition of ta for values in(1,2);
>> create table ta2 partition of ta for values in(3,4);
>>
>> explain select * from ta where a <> 1 and a <> 2; -- partition ta1 is
>> not eliminated.
>>                          QUERY PLAN
>> -------------------------------------------------------------
>>  Append  (cost=0.00..96.50 rows=5050 width=4)
>>    ->  Seq Scan on ta1  (cost=0.00..48.25 rows=2525 width=4)
>>          Filter: ((a <> 1) AND (a <> 2))
>>    ->  Seq Scan on ta2  (cost=0.00..48.25 rows=2525 width=4)
>>          Filter: ((a <> 1) AND (a <> 2))
>> (5 rows)
>>
>>
>> alter table ta1 add constraint ta1_chk check (a in(1,2)); -- add a
>> check constraint to see if can be removed.
>> explain select * from ta where a <> 1 and a <> 2; -- it can.
>>                          QUERY PLAN
>> -------------------------------------------------------------
>>  Append  (cost=0.00..48.25 rows=2525 width=4)
>>    ->  Seq Scan on ta2  (cost=0.00..48.25 rows=2525 width=4)
>>          Filter: ((a <> 1) AND (a <> 2))
>> (3 rows)
> 
> I see.  It seems that the current approach of handling <> operators by
> turning clauses containing the same into (key > const OR key < const)
> doesn't always work.  I think I had noticed that for list partitioning at
> least.  I will work on alternative way of handling that in the next
> version of the patch.

I think I was able to make this work and in the process of making it work,
also came to the conclusion that this could be made to work sensibly
*only* for list partitioned tables.  That's because one cannot prune a
given partition using a set of <> operator clauses, if we cannot be sure
that those clauses exclude *all* values of the partition key allowed by
that partition.  It's only possible to do that for a list partitioned
table, because by definition one is required to spell out every value that
a given partition of such table allows.

There is a new function in the updated patch that does the pruning using
<> operator clauses and it's implemented by assuming it's only ever called
for a list partitioned table.  So, sorry range and hash partitioned tables.

Attached updated set of patches.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 21 December 2017 at 23:38, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/12/20 17:27, Amit Langote wrote:
> I think I was able to make this work and in the process of making it work,
> also came to the conclusion that this could be made to work sensibly
> *only* for list partitioned tables.  That's because one cannot prune a
> given partition using a set of <> operator clauses, if we cannot be sure
> that those clauses exclude *all* values of the partition key allowed by
> that partition.  It's only possible to do that for a list partitioned
> table, because by definition one is required to spell out every value that
> a given partition of such table allows.

Makes sense. Thanks for fixing LIST partitioning to work with that.

We have no way to know that there's no value between 1::int and
2::int, so it's completely understandable why this can't work for
RANGE. HASH is also understandable since we don't have a complete
picture of all the values that can be contained within the partition.

I'll try to do another complete review of v16 soon.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 21 December 2017 at 23:38, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated set of patches.

Looks like the new not equals code does not properly take into account
a missing NULL partition.

create table ta (a int not null) partition by list (a);
create table ta1 partition of ta for values in(1,2);
create table ta2 partition of ta for values in(3,4);
explain select * from ta where a <> 1 and a <> 2;
ERROR:  negative bitmapset member not allowed

-- Add null partition
create table ta_null partition of ta for values in(null);
explain select * from ta where a <> 1 and a <> 2; -- works now.
                         QUERY PLAN
-------------------------------------------------------------
 Append  (cost=0.00..48.25 rows=2525 width=4)
   ->  Seq Scan on ta2  (cost=0.00..48.25 rows=2525 width=4)
         Filter: ((a <> 1) AND (a <> 2))
(3 rows)

This code appears to be at fault:

/*
* Also, exclude the "null-only" partition, because strict clauses in
* ne_clauses will not select any rows from it.
*/
if (count_partition_datums(relation, boundinfo->null_index) == 0)
excluded_parts = bms_add_member(excluded_parts,
boundinfo->null_index);


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Just trying to understand the code here; some very minor comments as I
go along.

partition_op_strategy returning int32 looks pretty ugly, and the calling
code is not super-intelligible either.  How about returning a value from
a new enum?

typedef PartClause is missing a struct name, as is our tradition.


> +                 * We don't a <> operator clause into a key right away.

Missing a word there.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
David

On 2017/12/21 21:04, David Rowley wrote:
> On 21 December 2017 at 23:38, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached updated set of patches.
> 
> Looks like the new not equals code does not properly take into account
> a missing NULL partition.
> 
> create table ta (a int not null) partition by list (a);
> create table ta1 partition of ta for values in(1,2);
> create table ta2 partition of ta for values in(3,4);
> explain select * from ta where a <> 1 and a <> 2;
> ERROR:  negative bitmapset member not allowed
> 
> -- Add null partition
> create table ta_null partition of ta for values in(null);
> explain select * from ta where a <> 1 and a <> 2; -- works now.
>                          QUERY PLAN
> -------------------------------------------------------------
>  Append  (cost=0.00..48.25 rows=2525 width=4)
>    ->  Seq Scan on ta2  (cost=0.00..48.25 rows=2525 width=4)
>          Filter: ((a <> 1) AND (a <> 2))
> (3 rows)
> 
> This code appears to be at fault:
> 
> /*
> * Also, exclude the "null-only" partition, because strict clauses in
> * ne_clauses will not select any rows from it.
> */
> if (count_partition_datums(relation, boundinfo->null_index) == 0)
> excluded_parts = bms_add_member(excluded_parts,
> boundinfo->null_index);

Oops, must check before going to count datums that a null-partition exists
at all.  Will post the fixed version shortly, thanks.

Regards,
Amit



Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 22 December 2017 at 13:57, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Will post the fixed version shortly, thanks.

I've made another partial pass on the patch and have a few more
things. #3 and #4 are just comments rather than requests to change
something. I think we should change those before PG11 though.

1. If I look at the elog(ERROR) messages in partition.c, there are a
number of variations of reporting an invalid partition strategy.

There seem to be 3 variations of the same thing. Probably the
"unexpected" one would suit most, but I've not looked too closely.

elog(ERROR, "invalid partitioning strategy");
elog(ERROR, "unexpected partition strategy: %d", (int) key->strategy);
elog(ERROR, "invalid partition strategy %c",
RelationGetPartitionKey(rel)->strategy);

2. In get_relation_constraints(). Can you add a comment to why you've added:

/* Append partition predicates, if any */
if (root->parse->commandType != CMD_SELECT)
{

I guess it must be because we use the new partition pruning code for
SELECT, but not for anything else.

3. It's a shame that RelOptInfo->live_partitioned_rels is a List and
not a RelIds. I guess if you were to change that you'd need to also
change AppendPath->partitioned_rels too, so probably we can just fix
that later.

4. live_part_appinfos I think could be a Relids type too, but probably
we can change that after this patch. Append subpaths are sorted in
create_append_path() for parallel append, so the order of the subpaths
seems non-critical.

5. Small memory leaks in get_partitions_from_clauses_recurse().

if (ne_clauses)
result = bms_int_members(result,
get_partitions_from_ne_clauses(relation,
ne_clauses));

Can you assign the result of get_partitions_from_ne_clauses() and
bms_free() it after the bms_int_members() ?

Same for:

result = bms_int_members(result,
get_partitions_from_or_clause_args(relation,
rt_index,
or->args));

The reason I'm being particular about this is that for the run-time
pruning patch we'll call this from ExecReScanAppend() which will
allocate into the ExecutorState which lives as long as the query does.
So any leaks will last the entire length of the query.
ExecReScanAppend() could be called millions of billions of times, so
we need to be sure that's not going to be a problem.

6. Similar to #5, memory leaks in get_partitions_from_or_clause_args()

arg_partset = get_partitions_from_clauses_recurse(relation, rt_index,
  arg_clauses);

/*
* Partition sets obtained from mutually-disjunctive clauses are
* combined using set union.
*/
result = bms_add_members(result, arg_partset);

Need to bms_free(arg_partset)

Running out of time for today, but will look again in about 4 days.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2017/12/22 1:06, Alvaro Herrera wrote:
> Just trying to understand the code here; some very minor comments as I
> go along.
> 
> partition_op_strategy returning int32 looks pretty ugly, and the calling
> code is not super-intelligible either.  How about returning a value from
> a new enum?

OK, I made it the following enum:

typedef enum PartOpStrategy
{
    PART_OP_EQUAL,
    PART_OP_LESS,
    PART_OP_GREATER,
} PartOpStrategy;

> typedef PartClause is missing a struct name, as is our tradition.

Will fix.

>> +                 * We don't a <> operator clause into a key right away.
> 
> Missing a word there.

Oops, right.  I meant "We don't turn a <> ...".  Will fix.

Will post a new version after taking care of David's comments.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

On 2017/12/22 10:35, David Rowley wrote:
> On 22 December 2017 at 13:57, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Will post the fixed version shortly, thanks.
> 
> I've made another partial pass on the patch and have a few more
> things. #3 and #4 are just comments rather than requests to change
> something. I think we should change those before PG11 though.

Thank you.

> 1. If I look at the elog(ERROR) messages in partition.c, there are a
> number of variations of reporting an invalid partition strategy.
> 
> There seem to be 3 variations of the same thing. Probably the
> "unexpected" one would suit most, but I've not looked too closely.
> 
> elog(ERROR, "invalid partitioning strategy");
> elog(ERROR, "unexpected partition strategy: %d", (int) key->strategy);
> elog(ERROR, "invalid partition strategy %c",
> RelationGetPartitionKey(rel)->strategy);

I should have used the "unexpected ..." wording in the yesterday's update.
 Fixed.

> 2. In get_relation_constraints(). Can you add a comment to why you've added:
> 
> /* Append partition predicates, if any */
> if (root->parse->commandType != CMD_SELECT)
> {
> 
> I guess it must be because we use the new partition pruning code for
> SELECT, but not for anything else.

Yeah, I explained that a couple of times on email (maybe also in the
commit message), but not there.  Done.

> 3. It's a shame that RelOptInfo->live_partitioned_rels is a List and
> not a RelIds. I guess if you were to change that you'd need to also
> change AppendPath->partitioned_rels too, so probably we can just fix
> that later.

I agree.

> 4. live_part_appinfos I think could be a Relids type too, but probably
> we can change that after this patch. Append subpaths are sorted in
> create_append_path() for parallel append, so the order of the subpaths
> seems non-critical.

Hmm, perhaps.

> 5. Small memory leaks in get_partitions_from_clauses_recurse().
> 
> if (ne_clauses)
> result = bms_int_members(result,
> get_partitions_from_ne_clauses(relation,
> ne_clauses));
> 
> Can you assign the result of get_partitions_from_ne_clauses() and
> bms_free() it after the bms_int_members() ?
> 
> Same for:
> 
> result = bms_int_members(result,
> get_partitions_from_or_clause_args(relation,
> rt_index,
> or->args));
> 
> The reason I'm being particular about this is that for the run-time
> pruning patch we'll call this from ExecReScanAppend() which will
> allocate into the ExecutorState which lives as long as the query does.
> So any leaks will last the entire length of the query.
> ExecReScanAppend() could be called millions of billions of times, so
> we need to be sure that's not going to be a problem.

That's a very important point to stress.  Thanks.

> 6. Similar to #5, memory leaks in get_partitions_from_or_clause_args()
> 
> arg_partset = get_partitions_from_clauses_recurse(relation, rt_index,
>   arg_clauses);
> 
> /*
> * Partition sets obtained from mutually-disjunctive clauses are
> * combined using set union.
> */
> result = bms_add_members(result, arg_partset);
> 
> Need to bms_free(arg_partset)

Fixed all these instances of leaks.

> Running out of time for today, but will look again in about 4 days.

Thanks again.

Please find attached updated patches.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 22 December 2017 at 17:25, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Please find attached updated patches.

Hi Amit,

I've just completed a pass over the v17 patch set. I've found a number
of things that need to be addressed. Some might seem a bit nit-picky,
sorry about that. However, many of the others genuinely need to be
addressed.

1. The following code calls RelationGetPartitionQual, but the output
of that is only needed when partition_bound_has_default(boundinfo) is
true.

Can you change this to only get the RelationGetPartitionQual when it's required?

Bitmapset *
get_partitions_from_clauses(Relation relation, int rt_index,
ParamListInfo prmlist, ExprContext *econtext,
List *partclauses)
{
Bitmapset    *result;
List    *partconstr = RelationGetPartitionQual(relation);

2. The header comment in match_clauses_to_partkey() does not give any
warning that 'clauses' is modified within the function.

The function should create a copy of the clauses before modifying
them. This will save you having to do any list_copy calls when you're
calling the function.

The header comment is also not very clear about what the return value
of the function is.

3. "method" I think should be "strategy". We've pretty much
standardised on that term everywhere else, so let's keep to standard.

/*
* Does the query specify a key to be null or not null?  Partitioning
* handles null partition keys specially depending on the partitioning
* method in use, we store this information.
*/

4. "relation" should be in single quotes, since you're talking about
the parameter named "relation". Likewise with "partclauses", otherwise
it just seems like bad English.

 * Determine the set of partitions of relation that will satisfy all
 * the clauses contained in partclauses

5. partdesc's assignment can be delayed until it's needed. This will
save generating it when constfalse == true

static Bitmapset *
get_partitions_from_clauses_recurse(Relation relation, int rt_index,
List *clauses)
{
PartitionDesc partdesc = RelationGetPartitionDesc(relation);

6. In the following comment, I'd have expected almost identical code,
just with some other List, but the code probably differs a bit too
much to use "Ditto".

/*
* Ditto, but this time or_clauses.
*/

7. Comment claims we use "set union", but we're really just collecting
the members from the other set in:

/*
* Partition sets obtained from mutually-disjunctive clauses are
* combined using set union.
*/
result = bms_add_members(result, arg_partset);

8. These arrays could just be initialized up to partkey->partnatts.
I'd imagine most of the time this will be just 1, so would save
needlessly setting the 31 other elements, although, perhaps it's a bad
idea to optimize this.

memset(keyclauses_all, 0, sizeof(keyclauses_all));
/* false means we don't know if a given key is null */
memset(keyisnull, false, sizeof(keyisnull));
/* false means we don't know if a given key is not null */
memset(keyisnotnull, false, sizeof(keyisnull));

The last two of these could just be Bitmapsets and you'd not need any
memset at all. PARTITION_MAX_KEYS just so happens to be the same as
BITS_PER_BITMAPWORD in a standard build, so you'd always be able to
mark everything with a single bitmap word. This would help a bit in
the various places that you're counting the true elements, for
example, the following code:

for (i = 0; i < partkey->partnatts; i++)
{
if (!keys->keyisnotnull[i])
{
include_def = true;
break;
}
}

could become:

include_def = !bms_is_empty(keys->keyisnotnull);

if you converted all these to Bitmapsets.

9. The following comment would be better read as: /* clause does not
match this partition key */

/* Clause not meant for this column. */

10. The following comment talks about handling less than operators for
hash opfamilies, but the code only handles <> for btree and list
partitioning.

* Handle some cases wherein the clause's operator may not
* belong to the partitioning operator family.  For example,
* operators named '<>' are not listed in any operator
* family whatsoever.  Also, ordering opertors like '<' are
* not listed in the hash operator family.

"opertors" should be spelled "operators"

11. In the following comment "operator" should be "operators":

 * are constrained by clauses containing equality operator, unless hash

Likewise for:

 * IS NULL clauses while remaining have clauses with equality operator.

12. The following code in classify_partition_bounding_keys probably
should use EXPR_MATCHES_PARTKEY.

/* Does leftop match with this partition key column? */
if ((IsA(arg, Var) &&
((Var *) arg)->varattno == partattno) ||
equal(arg, partexpr))

13. The comment for classify_partition_bounding_keys does not seem to
define well what the meaning of the return value is:

 * classify_partition_bounding_keys
 * Classify partition clauses into equal, min, and max keys, along with
 * any Nullness constraints and return that information in the output
 * argument keys (number of keys is the return value)

I had thought all along that this must mean the number of distinct
partition keys that we've found useful clauses for, but it's possible
to fool this with duplicate IS NULL checks.

create table hp (a int, b text) partition by hash (a, b);
create table hp0 partition of hp for values with (modulus 4, remainder 0);
create table hp3 partition of hp for values with (modulus 4, remainder 3);
create table hp1 partition of hp for values with (modulus 4, remainder 1);
create table hp2 partition of hp for values with (modulus 4, remainder 2);

explain select * from hp where a is null and a is null;

This causes classify_partition_bounding_keys to return 2. I can't see
any bad side effects of this, but I think the comment needs to be
improved. Perhaps it's better just to make the function return bool,
returning true if there are any useful keys?

14. The switch statement in partition_op_strategy could be simplified
a bit and be written more as:

case BTLessEqualStrategyNumber:
  *incl = true;
  /* fall through */
case BTLessStrategyNumber:
  result = PART_OP_LESS;
  break;

15. No default case statements in partition_op_strategy() could result
in "result" not being set.

16. I'm unsure what the following comment means:

* To set eqkeys, we must have found the same for partition key columns.

The word "for" seems wrong, or I'm not sure what it wants to ensure it
finds the same for.

17. Header comment for partition_op_strategy is out-of-date.

/*
 * Returns -1, 0, or 1 to signify that the partitioning clause has a </<=,
 * =, and >/>= operator, respectively.  Sets *incl to true if equality is
 * implied.
 */

It'll never return -1.

18. The following comment has some grammar issues:

/*
* If couldn't coerce to the partition key type, that is, the type of
* datums stored in PartitionBoundInfo, no hope of using this
* expression for anything partitioning-related.
*/

Would be better with:

/*
* If we couldn't coerce the partition key type, that is, the type
* of datums stored in PartitionBoundInfo, then there's no hope of
* using this expression for anything partitioning-related.
*/

19. In partkey_datum_from_expr() the Assert(false) at the end seems
overkill. You could just get rid of the default in the switch
statement and have it fall through to the final return. This would
save 4 lines of code.

20. The word "and" I think needs removed from the following comment:

* Couldn't compare; keep hash_clause set to the previous value and
* so add this one directly to the result.  Caller would

Probably also needs a comment after "value"

21. The following comment seems to indicate 'cur' is an operator, but
it's a PartClause:

/* The code below is for btree operators, which cur is not. */

It might be better to write

/* The code below handles Btree operators which are not relevant to a
hash-partitioned table. */

22. "Stuff" should be "The code" in:

* Stuff that follows closely mimics similar processing done by

23. In remove_redundant_clauses() and various other places, you have a
variable named partattoff. What's the meaning of this name? I'd
imagine it is short for "partition attribute offset", but it's a
"partition key index", so why not "partkeyidx"?

Of course, you might claim that an array index is just an offset, but
it is a little confusing if you think of attribute offsets in a
TupleDesc.

24. get_partitions_for_keys() you're using "return" and "break" in the
switch statement. You can remove the breaks;

25. The Assert(false) in get_partitions_for_keys seems overkill. I
think it's fine to just do:

return NULL; /* keep compiler quiet */

26. The following code comment in get_partitions_for_keys_hash() does
not seem very well written:

* Hash partitioning handles puts nulls into a normal partition and
* doesn't require to define a special null-accpting partition.
* Caller didn't count nulls as a valid key; do so ourselves.

Maybe "puts" should be "storing"?

Also, does hash partitioning actually support a NULL partition? This
seems to say it doesn't require, but as far as I can see it does not
*support* a NULL partition. The comment is a bit misleading.
"accpting" should be "accepting".

27. In get_partitions_for_keys_hash() why is it possible to get a
result_index below 0? In that case, the code will end up triggering
the Assert(false), but if you want to Assert something here then maybe
it should be Assert(result_index >= 0)?

if (result_index >= 0)
return bms_make_singleton(result_index);

28. Is there any point in the following loop in get_partitions_for_keys_list()?

/*
* We might be able to get the answer sooner based on the nullness of
* keys, so get that out of the way.
*/
for (i = 0; i < partkey->partnatts; i++)
{

Won't partkey->partnatts always be 1? Maybe you can Assert that
instead, just in case someone forgets to change the code if LIST is to
support multiple partition keys.

29. In get_partitions_for_keys_list and get_partitions_for_keys_range
partdesc is only used for an Assert. Maybe you can just:

Assert(RelationGetPartitionDesc(rel)->nparts > 0);

30. The Assert(false) in get_partitions_for_keys_list() looks
suspiciously easy to hit...

create table lp (a int not null) partition by list(a);
create table lp1 partition of lp for values in(1);
explain select * from lp where a > 11; -- Assert fail!

31. ranget?

 * get_partitions_for_keys_range
 * Return partitions of a ranget partitioned table for requested keys


32. I'm not quite sure what this comment is saying:

/*
* eqoff is gives us the bound that is known to be <= eqkeys,
* given how partition_bound_bsearch works.  The bound at eqoff+1,
* then, would be the upper bound of the only partition that needs
* to be scanned.
*/

33. "one" -> "the one"

* Find the leftmost bound that satisfies the query, i.e., one that
* satisfies minkeys.

34. Why "will" there be multiple partitions sharing the same prefix?

* If only a prefix of the whole partition key is provided, there will
* be multiple partitions whose bound share the same prefix.  If minkey

Perhaps "will" should be "may"?

35. I think "would've" in the following comment is not correct. This
seems to indicate that this has already happened, which it has not.
"will have to be" might be more correct?

* satisfy the query, but don't have a valid partition assigned.  The
* default partition would've been included to cover those values.

36. "will" -> "with"?

* Since partition keys will nulls are mapped to default range

37. "If no" -> "If there's no"

* If no tuple datum to compare with the bound, consider
* the latter to be greater.

38. I don't see anything about setting keynullness here:

/*
* Get the clauses that match the partition key, including information
* about any nullness tests against partition keys.  Set keynullness to
* a invalid value of NullTestType, which 0 is not.
*/
partclauses = match_clauses_to_partkey(root, rel,
   list_copy(rel->baserestrictinfo),
   &contains_const,
   &constfalse);

39. "paritions" -> "partitions"

* Else there are no clauses that are useful to prune any paritions,

40. In match_clauses_to_partkey, the following code pulls the varnos
from each operand of the expression. It would be better to just pull
the side that's needed (if any) a bit later rather than always doing
both.

left_relids = pull_varnos((Node *) leftop);
right_relids = pull_varnos((Node *) rightop);

41. I think a list_copy is missing here:

/*
* Accumulate the live partitioned children of this child, if it's
* itself partitioned rel.
*/
if (childrel->part_scheme)
partitioned_rels = list_concat(partitioned_rels,
   childrel->live_partitioned_rels);

Surely if we don't perform a list_copy of
childrel->live_partitioned_rels then subsequent additions to the
resulting list will inadvertently add new items to the
childrel->live_partitioned_rels?

42. Is this because of an existing bug?

@@ -1906,11 +1904,13 @@ explain (costs off) select * from mcrparted
where abs(b) = 5; -- scans all parti
          Filter: (abs(b) = 5)
    ->  Seq Scan on mcrparted3
          Filter: (abs(b) = 5)
+   ->  Seq Scan on mcrparted4
+         Filter: (abs(b) = 5)
    ->  Seq Scan on mcrparted5
          Filter: (abs(b) = 5)

43. In partition_prune.sql you have a mix of /* */ and -- comments.
Please just use --

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 28 December 2017 at 15:07, David Rowley <david.rowley@2ndquadrant.com> wrote:
> 43. In partition_prune.sql you have a mix of /* */ and -- comments.
> Please just use --

Just a few extras that I found:

44. In match_clauses_to_partkey you're making use of
estimate_expression_value(), I don't think this is safe.

if (IsA(estimate_expression_value(root, rightop), Const))
*contains_const = true;

The only other places I see using this in the planner are for costing
purposes. Also, the header comment for that function says it's not
safe. Particularly "This effectively means that we plan using the
first supplied value of the Param.". If that's the case, then if we're
planning a generic plan, then wouldn't it be possible that the planner
chooses the current supplied parameter value and prune away partitions
based on that value. That would make the plan invalid for any other
parameter, but it's meant to be a generic plan, so we can't do that.

45. Why use a list_copy() here?

/*
* For a nested ArrayExpr, we don't know how to get the
* actual scalar values out into a flat list, so we give
* up doing anything with this ScalarArrayOpExpr.
*/
if (arrexpr->multidims)
continue;

elem_exprs = list_copy(arrexpr->elements);

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
I happened to notice that Ashutosh's patch series at
https://www.postgresql.org/message-id/CAFjFpReJhFSoy6DqH0ipFSHd=sLNEkSzAtz4VWCaS-w2jZL=uw@mail.gmail.com
has a 0001 patch that modifies the partition_bound_cmp stuff too.
Are those conflicting?

Ashutosh's commit message:
    Modify bound comparision functions to accept members of PartitionKey

    Functions partition_bound_cmp(), partition_rbound_cmp() and
    partition_rbound_datum_cmp() are required to merge partition bounds
    from joining relations. While doing so, we do not have access to the
    PartitionKey of either relations. So, modify these functions to accept
    only required members of PartitionKey so that the functions can be
    reused for merging bounds.

Amit's:
    Some interface changes for partition_bound_{cmp/bsearch}

    Introduces a notion of PartitionBoundCmpArg, which replaces the set
    of arguments void *probe and bool probe_is_bound of both
    partition_bound_cmp and partition_bound_bsearch.  It wasn't possible
    before to specify the number of datums when a non-bound type of
    probe is passed.  Slightly tweaking the existing interface to allow
    specifying the same seems awkward.  So, instead encapsulate that
    into PartitionBoundCmpArg.  Also, modify partition_rbound_datum_cmp
    to compare caller-specifed number of datums, instead of
    key->partnatts datums.


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Fri, Dec 29, 2017 at 6:32 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I happened to notice that Ashutosh's patch series at
> https://www.postgresql.org/message-id/CAFjFpReJhFSoy6DqH0ipFSHd=sLNEkSzAtz4VWCaS-w2jZL=uw@mail.gmail.com
> has a 0001 patch that modifies the partition_bound_cmp stuff too.
> Are those conflicting?
>
> Ashutosh's commit message:
>         Modify bound comparision functions to accept members of PartitionKey
>
>         Functions partition_bound_cmp(), partition_rbound_cmp() and
>         partition_rbound_datum_cmp() are required to merge partition bounds
>         from joining relations. While doing so, we do not have access to the
>         PartitionKey of either relations. So, modify these functions to accept
>         only required members of PartitionKey so that the functions can be
>         reused for merging bounds.
>
> Amit's:
>         Some interface changes for partition_bound_{cmp/bsearch}
>
>         Introduces a notion of PartitionBoundCmpArg, which replaces the set
>         of arguments void *probe and bool probe_is_bound of both
>         partition_bound_cmp and partition_bound_bsearch.  It wasn't possible
>         before to specify the number of datums when a non-bound type of
>         probe is passed.  Slightly tweaking the existing interface to allow
>         specifying the same seems awkward.  So, instead encapsulate that
>         into PartitionBoundCmpArg.  Also, modify partition_rbound_datum_cmp
>         to compare caller-specifed number of datums, instead of
>         key->partnatts datums.
>

I haven't looked at Amit's changes, but we need a more flexible way to
pass information required for datum comparison than using
PartitionKey, since that's not available in the optimizer and can not
be associated with join, aggregate relations. If we pass that
information through a structure, there are two ways
1. it will need to be part of PartitionScheme; I am not sure if we can
have a substructure in PartitionKey. But if we can do it that way, we
can pass that structure to the functions.
2. we will need to construct the structure filling it with comparison
information and pass it to the comparison functions. I think what we
achieve out of this isn't worth the code we will need to add.

I would prefer first approach over the other.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
> From 8d627b910278203151853d324c3319c265cd36c0 Mon Sep 17 00:00:00 2001
> From: amit <amitlangote09@gmail.com>
> Date: Tue, 22 Aug 2017 13:48:13 +0900
> Subject: [PATCH 2/5] Introduce a get_partitions_from_clauses()

This one fails to apply.  Please rebase.

Did you know you can use "git format-patch -v6" to generate
appropriately named patch files without having to rename them yourself?


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Jesper Pedersen
Date:
Hi Amit,

On 12/21/2017 11:25 PM, Amit Langote wrote:
> Thanks again.
> 
> Please find attached updated patches.
> 

I have been looking at this patch from a simple hash partition point of 
view.

-- ddl.sql --
CREATE TABLE t1 (
     a integer NOT NULL,
     b integer NOT NULL
) PARTITION BY HASH (b);

CREATE TABLE t1_p00 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 0);
CREATE TABLE t1_p01 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 1);
CREATE TABLE t1_p02 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 2);
CREATE TABLE t1_p03 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 3);

CREATE INDEX idx_t1_b_a_p00 ON t1_p00 USING btree (b, a);
CREATE INDEX idx_t1_b_a_p01 ON t1_p01 USING btree (b, a);
CREATE INDEX idx_t1_b_a_p02 ON t1_p02 USING btree (b, a);
CREATE INDEX idx_t1_b_a_p03 ON t1_p03 USING btree (b, a);

INSERT INTO t1 (SELECT i, i FROM generate_series(1, 1000000) AS i);

ANALYZE;
-- ddl.sql --

w/

-- select.sql --
\set b random(1, 1000000)
BEGIN;
SELECT t1.a, t1.b FROM t1 WHERE t1.b = :b;
COMMIT;
-- select.sql --

using pgbench -c X -j X -M prepared -T X -f select.sql part-hash

On master we have generic_cost planning cost of 33.75, and an 
avg_custom_cost of 51.25 resulting in use of the generic plan and a TPS 
of 8893.

Using v17 we have generic_cost planning cost of 33.75, and an 
avg_custom_cost of 25.9375 resulting in use of the custom plan and a TPS 
of 7129 - of course due to the generation of a custom plan for each 
invocation.

Comparing master with an non-partitioned scenario; we have a TPS of 
12968, since there is no overhead of ExecInitAppend (PortalStart) and 
ExecAppend (PortalRun).

Could you share your thoughts on

1) if the generic plan mechanics should know about the pruning and hence 
give a lower planner cost

1) if the patch should be more aggressive in removing planning nodes 
that aren't necessary, e.g. going from Append -> IndexOnly to just 
IndexOnly.

I have tested with both [1] and [2], but would like to know about your 
thoughts on the above first.

Thanks in advance !

[1] https://commitfest.postgresql.org/16/1330/
[2] https://commitfest.postgresql.org/16/1353/

Best regards,
  Jesper


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 5 January 2018 at 07:16, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
> 1) if the patch should be more aggressive in removing planning nodes that
> aren't necessary, e.g. going from Append -> IndexOnly to just IndexOnly.

That's not for this patch. There's another patch [1] to do that already.

[1] https://commitfest.postgresql.org/16/1353/


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 5 January 2018 at 07:16, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
> \set b random(1, 1000000)
> BEGIN;
> SELECT t1.a, t1.b FROM t1 WHERE t1.b = :b;
> COMMIT;
> -- select.sql --
>
> using pgbench -c X -j X -M prepared -T X -f select.sql part-hash
>
> On master we have generic_cost planning cost of 33.75, and an
> avg_custom_cost of 51.25 resulting in use of the generic plan and a TPS of
> 8893.
>
> Using v17 we have generic_cost planning cost of 33.75, and an
> avg_custom_cost of 25.9375 resulting in use of the custom plan and a TPS of
> 7129 - of course due to the generation of a custom plan for each invocation.
>
> Comparing master with an non-partitioned scenario; we have a TPS of 12968,
> since there is no overhead of ExecInitAppend (PortalStart) and ExecAppend
> (PortalRun).
>
> Could you share your thoughts on
>
> 1) if the generic plan mechanics should know about the pruning and hence
> give a lower planner cost

I think the problem here is that cached_plan_cost() is costing the
planning cost of the query too low. If this was costed higher then its
more likely the generic plan would have been chosen, instead of
generating a custom plan each time.

How well does it perform if you change cpu_operator_cost = 0.01?

I think cached_plan_cost() does need an overhaul, but I think it's not
anything that should be done as part of this patch. You've picked HASH
partitioning here just because the current master does not perform any
partition pruning for that partitioning strategy.

There also might be a tiny argument here to have some method of
disabling the planner's partition pruning as we could before with SET
constraint_exclusion = 'off', but I think that's about the limit of
the interest this patch should have in that problem.

(The problem gets more complex again when doing run-time pruning, but
that's not a topic for this thread)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Jesper Pedersen
Date:
Hi David,

On 01/04/2018 09:21 PM, David Rowley wrote:
> On 5 January 2018 at 07:16, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>> Could you share your thoughts on
>>
>> 1) if the generic plan mechanics should know about the pruning and hence
>> give a lower planner cost
> 
> I think the problem here is that cached_plan_cost() is costing the
> planning cost of the query too low. If this was costed higher then its
> more likely the generic plan would have been chosen, instead of
> generating a custom plan each time.
> 
> How well does it perform if you change cpu_operator_cost = 0.01?
> 

It gives 38.82 for generic_cost, and 108.82 for avg_custom_cost on 
master (8249 TPS). And, 38.82 for generic_cost, and 79.705 for 
avg_custom_cost with v17 (7891 TPS). Non-partitioned is 11722 TPS.

> I think cached_plan_cost() does need an overhaul, but I think it's not
> anything that should be done as part of this patch. You've picked HASH
> partitioning here just because the current master does not perform any
> partition pruning for that partitioning strategy.
>

Well, I mainly picked HASH because that is my use-case :)

For a range based setup it gives 39.84 for generic_cost, and 89.705 for 
avg_custom_cost (7862 TPS).

Best regards,
  Jesper


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/01/05 1:28, Alvaro Herrera wrote:
>> From 8d627b910278203151853d324c3319c265cd36c0 Mon Sep 17 00:00:00 2001
>> From: amit <amitlangote09@gmail.com>
>> Date: Tue, 22 Aug 2017 13:48:13 +0900
>> Subject: [PATCH 2/5] Introduce a get_partitions_from_clauses()
> 
> This one fails to apply.  Please rebase.

Sorry about the absence in the last few days.  I will post a new version
addressing various review comments by the end of this week.

> Did you know you can use "git format-patch -v6" to generate
> appropriately named patch files without having to rename them yourself?

Oh, didn't know that trick.  Will try, thanks.

Regards,
Amit



Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 9 January 2018 at 21:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Sorry about the absence in the last few days.  I will post a new version
> addressing various review comments by the end of this week.

Good to have you back.

There's a small problem with get_partitions_for_keys_list(), the
following case hits the Assert(false) at the bottom of that function.

create table ab_c (a int not null, b char) partition by list(a);
create table abc_a2 (b char, a int not null) partition by list(b);
create table abc_a2_b3 partition of abc_a2 for values in ('3');
alter table ab_c attach partition abc_a2 for values in (2);

select * from ab_c where a between 1 and 2 and b <= '2';

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 9 January 2018 at 21:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Sorry about the absence in the last few days.  I will post a new version
> addressing various review comments by the end of this week.

One more thing I discovered while troubleshooting a bug Beena reported
in the run-time partition pruning patch is that
classify_partition_bounding_keys properly does;

if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
    constexpr = rightop;
else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr))
    constexpr = leftop;
else
    /* Clause not meant for this column. */
    continue;

for OpExpr clauses, but does not do the same for leftop for the
ScalarArrayOpExpr test.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 9 January 2018 at 21:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Sorry about the absence in the last few days.  I will post a new version
> addressing various review comments by the end of this week.

I'm sorry for the flood of emails today.

Beena's tests on the run-time partition pruning patch also indirectly
exposed a problem with this patch.

Basically, the changes to add_paths_to_append_rel() are causing
duplication in partition_rels.

A test case is:

create table part (a int, b int) partition by list(a);
create table part1 partition of part for values in(1) partition by list (b);
create table part2 partition of part1 for values in(1);

select * from part;

partition_rels ends up with 3 items in the list, but there's only 2
partitions here. The reason for this is that, since planning here is
recursively calling add_paths_to_append_rel, the list for part ends up
with itself and part1 in it, then since part1's list already contains
itself, per set_append_rel_size's "rel->live_partitioned_rels =
list_make1_int(rti);", then part1 ends up in the list twice.

It would be nicer if you could use a RelIds for this, but you'd also
need some way to store the target partition relation since
nodeModifyTable.c does:

/* The root table RT index is at the head of the partitioned_rels list */
if (node->partitioned_rels)
{
    Index root_rti;
    Oid root_oid;

    root_rti = linitial_int(node->partitioned_rels);
    root_oid = getrelid(root_rti, estate->es_range_table);
    rel = heap_open(root_oid, NoLock); /* locked by InitPlan */
}

You could also fix it by instead of doing:

/*
* Accumulate the live partitioned children of this child, if it's
* itself partitioned rel.
*/
if (childrel->part_scheme)
    partitioned_rels = list_concat(partitioned_rels,
       childrel->live_partitioned_rels);

do something along the lines of:

if (childrel->part_scheme)
{
    ListCell *lc;
    ListCell *start = lnext(list_head(childrel->live_partitioned_rels));

    for_each_cell(lc, start)
       partitioned_rels = lappend_int(partitioned_rels,
       lfirst_int(lc));
}

Although it seems pretty fragile. It would probably be better to find
a nicer way of handling all this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
Amit Langote
Date:
Hi David.

Thanks a lot of the review.  I'm replying to your multiple emails here.

On 2017/12/28 11:07, David Rowley wrote:
> I've just completed a pass over the v17 patch set. I've found a number
> of things that need to be addressed. Some might seem a bit nit-picky,
> sorry about that. However, many of the others genuinely need to be
> addressed.
> 
> 1. The following code calls RelationGetPartitionQual, but the output
> of that is only needed when partition_bound_has_default(boundinfo) is
> true.
> 
> Can you change this to only get the RelationGetPartitionQual when it's required?

OK, done that way.

> 2. The header comment in match_clauses_to_partkey() does not give any
> warning that 'clauses' is modified within the function.
> 
> The function should create a copy of the clauses before modifying
> them. This will save you having to do any list_copy calls when you're
> calling the function.

OK, added a list_copy on clauses at the beginning of the function.

> 
> The header comment is also not very clear about what the return value
> of the function is.

Fixed the comment to describe return values.

> 3. "method" I think should be "strategy". We've pretty much
> standardised on that term everywhere else, so let's keep to standard.
> 
> /*
> * Does the query specify a key to be null or not null?  Partitioning
> * handles null partition keys specially depending on the partitioning
> * method in use, we store this information.
> */

Fixed.

> 4. "relation" should be in single quotes, since you're talking about
> the parameter named "relation". Likewise with "partclauses", otherwise
> it just seems like bad English.
> 
>  * Determine the set of partitions of relation that will satisfy all
>  * the clauses contained in partclauses

Fixed.

> 5. partdesc's assignment can be delayed until it's needed. This will
> save generating it when constfalse == true
> 
> static Bitmapset *
> get_partitions_from_clauses_recurse(Relation relation, int rt_index,
> List *clauses)
> {
> PartitionDesc partdesc = RelationGetPartitionDesc(relation);

Moved the whole line to a tiny block where partdesc is used.

> 6. In the following comment, I'd have expected almost identical code,
> just with some other List, but the code probably differs a bit too
> much to use "Ditto".
> 
> /*
> * Ditto, but this time or_clauses.
> */

Hmm, OK. I tried rewriting the comment a bit.

> 
> 7. Comment claims we use "set union", but we're really just collecting
> the members from the other set in:
> 
> /*
> * Partition sets obtained from mutually-disjunctive clauses are
> * combined using set union.
> */
> result = bms_add_members(result, arg_partset);

Guess the comment doesn't really add much in this case, so just deleted it.

> 8. These arrays could just be initialized up to partkey->partnatts.
> I'd imagine most of the time this will be just 1, so would save
> needlessly setting the 31 other elements, although, perhaps it's a bad
> idea to optimize this.
> 
> memset(keyclauses_all, 0, sizeof(keyclauses_all));
> /* false means we don't know if a given key is null */
> memset(keyisnull, false, sizeof(keyisnull));
> /* false means we don't know if a given key is not null */
> memset(keyisnotnull, false, sizeof(keyisnull));
> 
> The last two of these could just be Bitmapsets and you'd not need any
> memset at all. PARTITION_MAX_KEYS just so happens to be the same as
> BITS_PER_BITMAPWORD in a standard build, so you'd always be able to
> mark everything with a single bitmap word. This would help a bit in
> the various places that you're counting the true elements, for
> example, the following code:
> 
> for (i = 0; i < partkey->partnatts; i++)
> {
> if (!keys->keyisnotnull[i])
> {
> include_def = true;
> break;
> }
> }
> 
> could become:
> 
> include_def = !bms_is_empty(keys->keyisnotnull);
> 
> if you converted all these to Bitmapsets.

I liked the idea of using Bitmapset for keyisnull and keyisnotnull, so
implemented it.

> 9. The following comment would be better read as: /* clause does not
> match this partition key */
> 
> /* Clause not meant for this column. */
> 
> 10. The following comment talks about handling less than operators for
> hash opfamilies, but the code only handles <> for btree and list
> partitioning.
> 
> * Handle some cases wherein the clause's operator may not
> * belong to the partitioning operator family.  For example,
> * operators named '<>' are not listed in any operator
> * family whatsoever.  Also, ordering opertors like '<' are
> * not listed in the hash operator family.
> 
> "opertors" should be spelled "operators"

Rewrote the comments here a bit.

> 11. In the following comment "operator" should be "operators":
> 
>  * are constrained by clauses containing equality operator, unless hash
> 
> Likewise for:
> 
>  * IS NULL clauses while remaining have clauses with equality operator.

Fixed.

> 12. The following code in classify_partition_bounding_keys probably
> should use EXPR_MATCHES_PARTKEY.
> 
> /* Does leftop match with this partition key column? */
> if ((IsA(arg, Var) &&
> ((Var *) arg)->varattno == partattno) ||
> equal(arg, partexpr))

Done.

> 13. The comment for classify_partition_bounding_keys does not seem to
> define well what the meaning of the return value is:
> 
>  * classify_partition_bounding_keys
>  * Classify partition clauses into equal, min, and max keys, along with
>  * any Nullness constraints and return that information in the output
>  * argument keys (number of keys is the return value)
> 
> I had thought all along that this must mean the number of distinct
> partition keys that we've found useful clauses for, but it's possible
> to fool this with duplicate IS NULL checks.
> 
> create table hp (a int, b text) partition by hash (a, b);
> create table hp0 partition of hp for values with (modulus 4, remainder 0);
> create table hp3 partition of hp for values with (modulus 4, remainder 3);
> create table hp1 partition of hp for values with (modulus 4, remainder 1);
> create table hp2 partition of hp for values with (modulus 4, remainder 2);
> 
> explain select * from hp where a is null and a is null;
> 
> This causes classify_partition_bounding_keys to return 2. I can't see
> any bad side effects of this, but I think the comment needs to be
> improved. Perhaps it's better just to make the function return bool,
> returning true if there are any useful keys?

OK, I made classify_partition_bounding_keys() return a bool instead.

> 14. The switch statement in partition_op_strategy could be simplified
> a bit and be written more as:
> 
> case BTLessEqualStrategyNumber:
>   *incl = true;
>   /* fall through */
> case BTLessStrategyNumber:
>   result = PART_OP_LESS;
>   break;

That's better, done.

> 15. No default case statements in partition_op_strategy() could result
> in "result" not being set.

I added a default case, but it's simply an elog(ERROR, ...).

> 16. I'm unsure what the following comment means:
> 
> * To set eqkeys, we must have found the same for partition key columns.
> 
> The word "for" seems wrong, or I'm not sure what it wants to ensure it
> finds the same for.

I rewrote this comment a bit, since like you, I too am no longer able to
make sense of it.  Just wanted to say here that to set keys->eqkeys at
all, we must have found matching clauses containing equality operators for
all partition keys columns.

> 17. Header comment for partition_op_strategy is out-of-date.
> 
> /*
>  * Returns -1, 0, or 1 to signify that the partitioning clause has a </<=,
>  * =, and >/>= operator, respectively.  Sets *incl to true if equality is
>  * implied.
>  */
> 
> It'll never return -1.

Oops, fixed.

> 18. The following comment has some grammar issues:
> 
> /*
> * If couldn't coerce to the partition key type, that is, the type of
> * datums stored in PartitionBoundInfo, no hope of using this
> * expression for anything partitioning-related.
> */
> 
> Would be better with:
> 
> /*
> * If we couldn't coerce the partition key type, that is, the type
> * of datums stored in PartitionBoundInfo, then there's no hope of
> * using this expression for anything partitioning-related.
> */

OK, done.

> 19. In partkey_datum_from_expr() the Assert(false) at the end seems
> overkill. You could just get rid of the default in the switch
> statement and have it fall through to the final return. This would
> save 4 lines of code.

Actually, let's just get rid of the switch.  If I remove the default case
like you suggest, there are NodeTag enum values not handled warnings.

> 20. The word "and" I think needs removed from the following comment:
> 
> * Couldn't compare; keep hash_clause set to the previous value and
> * so add this one directly to the result.  Caller would

Actually, it seems better to keep the "and" and remove the "so".

> Probably also needs a comment after "value"

Done.

> 21. The following comment seems to indicate 'cur' is an operator, but
> it's a PartClause:
> 
> /* The code below is for btree operators, which cur is not. */
> 
> It might be better to write
> 
> /* The code below handles Btree operators which are not relevant to a
> hash-partitioned table. */

Agreed, done.

> 22. "Stuff" should be "The code" in:
> 
> * Stuff that follows closely mimics similar processing done by

Done.

> 23. In remove_redundant_clauses() and various other places, you have a
> variable named partattoff. What's the meaning of this name? I'd
> imagine it is short for "partition attribute offset", but it's a
> "partition key index", so why not "partkeyidx"?
>
> Of course, you might claim that an array index is just an offset, but
> it is a little confusing if you think of attribute offsets in a
> TupleDesc.

Hmm, OK.  Replaced partattoff with partkeyidx.

> 24. get_partitions_for_keys() you're using "return" and "break" in the
> switch statement. You can remove the breaks;

Oops, fixed.

> 25. The Assert(false) in get_partitions_for_keys seems overkill. I
> think it's fine to just do:
> 
> return NULL; /* keep compiler quiet */

Done, too.

> 26. The following code comment in get_partitions_for_keys_hash() does
> not seem very well written:
> 
> * Hash partitioning handles puts nulls into a normal partition and
> * doesn't require to define a special null-accpting partition.
> * Caller didn't count nulls as a valid key; do so ourselves.
> 
> Maybe "puts" should be "storing"?
> 
> Also, does hash partitioning actually support a NULL partition? This
> seems to say it doesn't require, but as far as I can see it does not
> *support* a NULL partition. The comment is a bit misleading.
> "accpting" should be "accepting".

I rewrote that comment.  I wanted to say that, unlike range and list
partitioning which have special handling for null values (there is only
one list/range partition at any given time that could contain nulls in the
partition key), hash partitioning does not.  All hash partitions could
contain nulls in one or more of the partition keys and hence we must
consider nulls as regular equality keys.

> 27. In get_partitions_for_keys_hash() why is it possible to get a
> result_index below 0? In that case, the code will end up triggering
> the Assert(false), but if you want to Assert something here then maybe
> it should be Assert(result_index >= 0)?
> 
> if (result_index >= 0)
> return bms_make_singleton(result_index);

I too used to think that it's impossible to get result_index < 0, but it's
actually possible, because not all required hash partitions may have been
defined yet:

create table hashp (a int) partition by hash (a);
create table hashp0 partition of hashp for values with (modulus 4,
remainder 0);

If user only defines one partition like shown above, then there are no
partitions for when the remainders for a given partition key turns out to
be 1, 2, or 3.

create table hashp (a int) partition by hash (a);
create table hashp0 partition of hashp for values with (modulus 4,
remainder 0);

insert into hashp values (1);
INSERT 0 1
explain select * from hashp where a = 1;
                          QUERY PLAN
--------------------------------------------------------------
 Append  (cost=0.00..41.88 rows=13 width=4)
   ->  Seq Scan on hashp0  (cost=0.00..41.88 rows=13 width=4)
         Filter: (a = 1)
(3 rows)

insert into hashp values (2);
ERROR:  no partition of relation "hashp" found for row
DETAIL:  Partition key of the failing row contains (a) = (2).

explain select * from hashp where a = 2;
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

I guess I forgot to remove that Assert(false) when fixing the code after
realizing that it's indeed possible to not get a hash partition for some keys.

Removed the Assert.

> 28. Is there any point in the following loop in get_partitions_for_keys_list()?
> 
> /*
> * We might be able to get the answer sooner based on the nullness of
> * keys, so get that out of the way.
> */
> for (i = 0; i < partkey->partnatts; i++)
> {
> 
> Won't partkey->partnatts always be 1? Maybe you can Assert that
> instead, just in case someone forgets to change the code if LIST is to
> support multiple partition keys.

I guess that's a leftover from when all partition strategies were handled
in one function.  Got rid of the loop and added the Assert.

> 
> 29. In get_partitions_for_keys_list and get_partitions_for_keys_range
> partdesc is only used for an Assert. Maybe you can just:
> 
> Assert(RelationGetPartitionDesc(rel)->nparts > 0);

OK, done.

> 30. The Assert(false) in get_partitions_for_keys_list() looks
> suspiciously easy to hit...
> 
> create table lp (a int not null) partition by list(a);
> create table lp1 partition of lp for values in(1);
> explain select * from lp where a > 11; -- Assert fail!
> 
> 31. ranget?
> 
>  * get_partitions_for_keys_range
>  * Return partitions of a ranget partitioned table for requested keys

The Assert indeed was easy to hit.  Removed.

> 32. I'm not quite sure what this comment is saying:
> 
> /*
> * eqoff is gives us the bound that is known to be <= eqkeys,
> * given how partition_bound_bsearch works.  The bound at eqoff+1,
> * then, would be the upper bound of the only partition that needs
> * to be scanned.
> */

Rewrote the comment as:

      /*
       * The bound at eqoff is known to be <= eqkeys, given the way
       * partition_bound_bsearch works.  Considering the same as the lower
       * bound of the partition that eqkeys falls into, the bound at
       * eqoff + 1 would be its upper bound, so use eqoff + 1 to get the
       * desired partition's index.
       */

Any better?

> 33. "one" -> "the one"
> 
> * Find the leftmost bound that satisfies the query, i.e., one that
> * satisfies minkeys.

Fixed.

> 34. Why "will" there be multiple partitions sharing the same prefix?
> 
> * If only a prefix of the whole partition key is provided, there will
> * be multiple partitions whose bound share the same prefix.  If minkey
> 
> Perhaps "will" should be "may"?

I guess you're right.

> 35. I think "would've" in the following comment is not correct. This
> seems to indicate that this has already happened, which it has not.
> "will have to be" might be more correct?
>
> * satisfy the query, but don't have a valid partition assigned.  The
> * default partition would've been included to cover those values.

"will have to be included" sounds correct to me too.

> 36. "will" -> "with"?
> 
> * Since partition keys will nulls are mapped to default range

Fixed.

> 37. "If no" -> "If there's no"
> 
> * If no tuple datum to compare with the bound, consider
> * the latter to be greater.



> 
> 38. I don't see anything about setting keynullness here:
> 
> /*
> * Get the clauses that match the partition key, including information
> * about any nullness tests against partition keys.  Set keynullness to
> * a invalid value of NullTestType, which 0 is not.
> */
> partclauses = match_clauses_to_partkey(root, rel,
>    list_copy(rel->baserestrictinfo),
>    &contains_const,
>    &constfalse);

I guess the comment is too ancient.  Fixed.

> 
> 39. "paritions" -> "partitions"
> 
> * Else there are no clauses that are useful to prune any paritions,

Fixed.

> 40. In match_clauses_to_partkey, the following code pulls the varnos
> from each operand of the expression. It would be better to just pull
> the side that's needed (if any) a bit later rather than always doing
> both.
> 
> left_relids = pull_varnos((Node *) leftop);
> right_relids = pull_varnos((Node *) rightop);

Ah, done.

> 41. I think a list_copy is missing here:
> 
> /*
> * Accumulate the live partitioned children of this child, if it's
> * itself partitioned rel.
> */
> if (childrel->part_scheme)
> partitioned_rels = list_concat(partitioned_rels,
>    childrel->live_partitioned_rels);
> 
> Surely if we don't perform a list_copy of
> childrel->live_partitioned_rels then subsequent additions to the
> resulting list will inadvertently add new items to the
> childrel->live_partitioned_rels?

Yeah, I think what's in the patch now is clearly hazardous.  Added a
list_copy on childrel->live_partitioned_rels.

> 42. Is this because of an existing bug?
> 
> @@ -1906,11 +1904,13 @@ explain (costs off) select * from mcrparted
> where abs(b) = 5; -- scans all parti
>           Filter: (abs(b) = 5)
>     ->  Seq Scan on mcrparted3
>           Filter: (abs(b) = 5)
> +   ->  Seq Scan on mcrparted4
> +         Filter: (abs(b) = 5)
>     ->  Seq Scan on mcrparted5
>           Filter: (abs(b) = 5)

Actually, to constraint exclusion's eyes, mcrparted4's partition
constraint is refuted by abs(b) = 5.

With the new patch, we simply do not perform any partition pruning for
that clause, because we do not have a constraint on an earlier partition
key column.

> 43. In partition_prune.sql you have a mix of /* */ and -- comments.
> Please just use --

Oops, fixed.

On 2017/12/29 15:47, David Rowley wrote:
> Just a few extras that I found:
>
> 44. In match_clauses_to_partkey you're making use of
> estimate_expression_value(), I don't think this is safe.
>
> if (IsA(estimate_expression_value(root, rightop), Const))
> *contains_const = true;
>
> The only other places I see using this in the planner are for costing
> purposes. Also, the header comment for that function says it's not
> safe. Particularly "This effectively means that we plan using the
> first supplied value of the Param.". If that's the case, then if we're
> planning a generic plan, then wouldn't it be possible that the planner
> chooses the current supplied parameter value and prune away partitions
> based on that value. That would make the plan invalid for any other
> parameter, but it's meant to be a generic plan, so we can't do that.

You might be right.  Perhaps, I was thinking of eval_const_expressions()
there, which I guess should be enough for this purpose.

>
> 45. Why use a list_copy() here?
>
> /*
> * For a nested ArrayExpr, we don't know how to get the
> * actual scalar values out into a flat list, so we give
> * up doing anything with this ScalarArrayOpExpr.
> */
> if (arrexpr->multidims)
> continue;
>
> elem_exprs = list_copy(arrexpr->elements);

I guess I was just being paranoid there.  No need, so removed.

On 2018/01/10 7:35, David Rowley wrote:
> There's a small problem with get_partitions_for_keys_list(), the
> following case hits the Assert(false) at the bottom of that function.
>
> create table ab_c (a int not null, b char) partition by list(a);
> create table abc_a2 (b char, a int not null) partition by list(b);
> create table abc_a2_b3 partition of abc_a2 for values in ('3');
> alter table ab_c attach partition abc_a2 for values in (2);
>
> select * from ab_c where a between 1 and 2 and b <= '2';

I removed that Assert per your comment above (in an earlier email of yours).

On 2018/01/10 10:55, David Rowley wrote:
> One more thing I discovered while troubleshooting a bug Beena reported
> in the run-time partition pruning patch is that
> classify_partition_bounding_keys properly does;
>
> if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
>     constexpr = rightop;
> else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr))
>     constexpr = leftop;
> else
>     /* Clause not meant for this column. */
>     continue;
>
> for OpExpr clauses, but does not do the same for leftop for the
> ScalarArrayOpExpr test.

I'm not sure why we'd need to do that?  Does the syntax of clauses that
use a ScalarArrayOpExpr() allow them to have the partition key on RHS?
Can you point me to the email where Beena reported the problem in question?

On 2018/01/10 13:18, David Rowley wrote:
> Beena's tests on the run-time partition pruning patch also indirectly
> exposed a problem with this patch.
>
> Basically, the changes to add_paths_to_append_rel() are causing
> duplication in partition_rels.
>
> A test case is:
>
> create table part (a int, b int) partition by list(a);
> create table part1 partition of part for values in(1) partition by list (b);
> create table part2 partition of part1 for values in(1);
>
> select * from part;
>
> partition_rels ends up with 3 items in the list, but there's only 2
> partitions here. The reason for this is that, since planning here is
> recursively calling add_paths_to_append_rel, the list for part ends up
> with itself and part1 in it, then since part1's list already contains
> itself, per set_append_rel_size's "rel->live_partitioned_rels =
> list_make1_int(rti);", then part1 ends up in the list twice.

It seems that I found the problem.  Currently, the set_append_rel_size()
step already accumulates the full list in the root partitioned table's
rel->live_partitioned_rels, that is, the list of RT indexes of *all*
partitioned relations in the tree.  Then, when add_paths_to_append_rel()
tries to accumulate child rel's live_partitioned_rels into the parent's,
duplication occurs, because the latter already contains all the entries as
compiled by the earlier step.  I think having only the latter do the
accumulation is better, because even partition-wise join code needs this
facility and it only ever calls add_paths_to_append_rel().


Attached updated version of the patch set containing fixes for almost all
the things mentioned above.


Thanks again for your thoughtful review comments and sorry that I couldn't
reply sooner.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/01/04 23:29, Ashutosh Bapat wrote:
> On Fri, Dec 29, 2017 at 6:32 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> I happened to notice that Ashutosh's patch series at
>> https://www.postgresql.org/message-id/CAFjFpReJhFSoy6DqH0ipFSHd=sLNEkSzAtz4VWCaS-w2jZL=uw@mail.gmail.com
>> has a 0001 patch that modifies the partition_bound_cmp stuff too.
>> Are those conflicting?
>>
>> Ashutosh's commit message:
>>         Modify bound comparision functions to accept members of PartitionKey
>>
>>         Functions partition_bound_cmp(), partition_rbound_cmp() and
>>         partition_rbound_datum_cmp() are required to merge partition bounds
>>         from joining relations. While doing so, we do not have access to the
>>         PartitionKey of either relations. So, modify these functions to accept
>>         only required members of PartitionKey so that the functions can be
>>         reused for merging bounds.
>>
>> Amit's:
>>         Some interface changes for partition_bound_{cmp/bsearch}
>>
>>         Introduces a notion of PartitionBoundCmpArg, which replaces the set
>>         of arguments void *probe and bool probe_is_bound of both
>>         partition_bound_cmp and partition_bound_bsearch.  It wasn't possible
>>         before to specify the number of datums when a non-bound type of
>>         probe is passed.  Slightly tweaking the existing interface to allow
>>         specifying the same seems awkward.  So, instead encapsulate that
>>         into PartitionBoundCmpArg.  Also, modify partition_rbound_datum_cmp
>>         to compare caller-specifed number of datums, instead of
>>         key->partnatts datums.
>>
> 
> I haven't looked at Amit's changes, but we need a more flexible way to
> pass information required for datum comparison than using
> PartitionKey, since that's not available in the optimizer and can not
> be associated with join, aggregate relations. If we pass that
> information through a structure, there are two ways
> 1. it will need to be part of PartitionScheme; I am not sure if we can
> have a substructure in PartitionKey. But if we can do it that way, we
> can pass that structure to the functions.
> 2. we will need to construct the structure filling it with comparison
> information and pass it to the comparison functions. I think what we
> achieve out of this isn't worth the code we will need to add.
> 
> I would prefer first approach over the other.

ISTM that they're non-conflicting for the most part.  My patch is about
modifying the way to bring "datums" into partition_bound_cmp(), whereas
Ashutosh's is about modifying the way we bring the partition key
information.  Changes seem orthogonal to me, although, the patches
definitely won't like each other when applying to the tree.

Thanks,
Amit



Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
David Rowley
Date:
Thanks for addressing that list.

Just one thing to reply on before I look at the updated version:

On 11 January 2018 at 22:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/01/10 10:55, David Rowley wrote:
>> One more thing I discovered while troubleshooting a bug Beena reported
>> in the run-time partition pruning patch is that
>> classify_partition_bounding_keys properly does;
>>
>> if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
>>     constexpr = rightop;
>> else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr))
>>     constexpr = leftop;
>> else
>>     /* Clause not meant for this column. */
>>     continue;
>>
>> for OpExpr clauses, but does not do the same for leftop for the
>> ScalarArrayOpExpr test.
>
> I'm not sure why we'd need to do that?  Does the syntax of clauses that
> use a ScalarArrayOpExpr() allow them to have the partition key on RHS?

No, but there's no test to ensure the leftop matches the partition key.

There's just:

ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
Oid saop_op = saop->opno;
Oid saop_opfuncid = saop->opfuncid;
Oid saop_coll = saop->inputcollid;
Node   *leftop = (Node *) linitial(saop->args),
   *rightop = (Node *) lsecond(saop->args);
List   *elem_exprs,
   *elem_clauses;
ListCell *lc1;
bool negated = false;

/*
* In case of NOT IN (..), we get a '<>', which while not
* listed as part of any operator family, we are able to
* handle the same if its negator is indeed a part of the
* partitioning operator family.
*/
if (!op_in_opfamily(saop_op, partopfamily))
{
Oid negator = get_negator(saop_op);
int strategy;
Oid lefttype,
righttype;


if (!OidIsValid(negator))
continue;
get_op_opfamily_properties(negator, partopfamily, false,
   &strategy,
   &lefttype, &righttype);
if (strategy == BTEqualStrategyNumber)
negated = true;
}

Since there's nothing to reject the clause that does not match the
partition key, the IN's left operand might be of any random type, and
may well not be in partopfamily, so when it comes to looking up
get_op_opfamily_properties() you'll hit: elog(ERROR, "operator %u is
not a member of opfamily %u", opno, opfamily);

Still looking at the v17 patch here, but I also don't see a test to
see if the IsBooleanOpfamily(partopfamily) is checking it matches the
partition key.

> Can you point me to the email where Beena reported the problem in question?

https://www.postgresql.org/message-id/CAOG9ApERiop7P=GRkqQKa82AuBKjxN3qVixie3WK4WqQpEjS6g@mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
David Rowley
Date:
> On 11 January 2018 at 22:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Can you point me to the email where Beena reported the problem in question?
>
> https://www.postgresql.org/message-id/CAOG9ApERiop7P=GRkqQKa82AuBKjxN3qVixie3WK4WqQpEjS6g@mail.gmail.com

To save you from having to look at the run-time prune patch, here's
case that break in v18.

create table xy (a int, b text) partition by range (a,b);
create table xy1 partition of xy for values from (0,'a') to (10, 'b');
select * from xy where a = 1 and b in('x','y');
ERROR:  operator 531 is not a member of opfamily 1976

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On 2018/01/11 19:23, David Rowley wrote:
> On 11 January 2018 at 22:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/01/10 10:55, David Rowley wrote:
>>> One more thing I discovered while troubleshooting a bug Beena reported
>>> in the run-time partition pruning patch is that
>>> classify_partition_bounding_keys properly does;
>>>
>>> if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
>>>     constexpr = rightop;
>>> else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr))
>>>     constexpr = leftop;
>>> else
>>>     /* Clause not meant for this column. */
>>>     continue;
>>>
>>> for OpExpr clauses, but does not do the same for leftop for the
>>> ScalarArrayOpExpr test.
>>
>> I'm not sure why we'd need to do that?  Does the syntax of clauses that
>> use a ScalarArrayOpExpr() allow them to have the partition key on RHS?
> 
> No, but there's no test to ensure the leftop matches the partition key.
> 
> There's just:
> 
> ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
> Oid saop_op = saop->opno;
> Oid saop_opfuncid = saop->opfuncid;
> Oid saop_coll = saop->inputcollid;
> Node   *leftop = (Node *) linitial(saop->args),
>    *rightop = (Node *) lsecond(saop->args);
> List   *elem_exprs,
>    *elem_clauses;
> ListCell *lc1;
> bool negated = false;
> 
> /*
> * In case of NOT IN (..), we get a '<>', which while not
> * listed as part of any operator family, we are able to
> * handle the same if its negator is indeed a part of the
> * partitioning operator family.
> */
> if (!op_in_opfamily(saop_op, partopfamily))
> {
> Oid negator = get_negator(saop_op);
> int strategy;
> Oid lefttype,
> righttype;
> 
> 
> if (!OidIsValid(negator))
> continue;
> get_op_opfamily_properties(negator, partopfamily, false,
>    &strategy,
>    &lefttype, &righttype);
> if (strategy == BTEqualStrategyNumber)
> negated = true;
> }
> 
> Since there's nothing to reject the clause that does not match the
> partition key, the IN's left operand might be of any random type, and
> may well not be in partopfamily, so when it comes to looking up
> get_op_opfamily_properties() you'll hit: elog(ERROR, "operator %u is
> not a member of opfamily %u", opno, opfamily);

Ah, I completely missed that.  So we need something like the following in
this IsA(clause, ScalarArrayOpExpr) block:


+                /* Clause does not match this partition key. */
+                if (!EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr))
+                    continue;
+

> Still looking at the v17 patch here, but I also don't see a test to
> see if the IsBooleanOpfamily(partopfamily) is checking it matches the
> partition key.

You're right.  Added checks there as well.

>> Can you point me to the email where Beena reported the problem in question?
> 
> https://www.postgresql.org/message-id/CAOG9ApERiop7P=GRkqQKa82AuBKjxN3qVixie3WK4WqQpEjS6g@mail.gmail.com
>
> To save you from having to look at the run-time prune patch, here's
> case that break in v18.
>
> create table xy (a int, b text) partition by range (a,b);
> create table xy1 partition of xy for values from (0,'a') to (10, 'b');
> select * from xy where a = 1 and b in('x','y');
> ERROR:  operator 531 is not a member of opfamily 1976

You'll be able to see that the error no longer appears with the attached
updated set of patches, but I'm now seeing that the resulting plan with
patched for this particular query differs from what master (constraint
exclusion) produces.  Master produces a plan with no partitions (as one
would think is the correct plan), whereas patched produces a plan
including the xy1 partition.  I will think about that a bit and post
something later.

Thanks,
Amit

Attachment
On 12 January 2018 at 15:27, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/01/11 19:23, David Rowley wrote:

>> ERROR:  operator 531 is not a member of opfamily 1976
>
> You'll be able to see that the error no longer appears with the attached
> updated set of patches, but I'm now seeing that the resulting plan with
> patched for this particular query differs from what master (constraint
> exclusion) produces.  Master produces a plan with no partitions (as one
> would think is the correct plan), whereas patched produces a plan
> including the xy1 partition.  I will think about that a bit and post
> something later.

Thanks for looking at that.

I've got a few more things for you. I'm only partway through another
pass, but it makes sense to post what I have now if you're working on
a new version.

1. partitioing -> partitioning

 * Strategy of a partition clause operator per the partitioing operator class

2. get_partitions_from_clauses() modifies partclauses without
mentioning it in the header. I think you need to either:

a) warn about this in the header comment; or
b) do a list_copy() before list_concat()
c) do list_truncate back to the original length after you're done with the list.

3. get_partitions_from_clauses_recurse(), with:

  result = bms_add_range(result, 0, partdesc->nparts - 1);

You could change that to bms_add_range(NULL, ...) and ditch the
assignment of result to NULL at the start of the function.

4. classify_partition_bounding_keys() now returns bool, but the return
statement is still:

return keys->n_eqkeys + keys->n_minkeys + keys->n_maxkeys + n_keynullness;

my compiler didn't warn about that, but I'd imagine some might.

Instead, can you make it:

if (keys->n_eqkeys > 0 || keys->n_minkeys > 0 ||
    keys->n_maxkeys > 0 || n_keynullness > 0)
    return true;

return false;

probably equal keys are the most likely case, so it'll be good to
short circuit instead of performing addition on a bunch of stuff we
don't care about anymore.

5. In classify_partition_bounding_keys, why do we "continue" here?

clause = rinfo->clause;
if (rinfo->pseudoconstant &&
!DatumGetBool(((Const *) clause)->constvalue))
{
*constfalse = true;
continue;
}

Is there any point in searching further?

Also, if you were consistent with the return value for
classify_partition_bounding_keys when you've set *constfalse = true;
you wouldn't need to handle the case twice like you are in
get_partitions_from_clauses_recurse().

6. I think it would be nicer if get_partitions_from_ne_clauses returns
a set of partitions that could be excluded.

So instead of:

 * get_partitions_from_ne_clauses
 *
 * Return partitions of relation that satisfy all <> operator clauses in
 * ne_clauses.  Only ever called if relation is a list partitioned table.

Have:

 * get_partitions_from_ne_clauses
 *
 * Returns a Bitmapset of partitions that can be safely excluded due to
 * not-equal clauses existing for all possible partition values. It is only
 * valid to call this for LIST partitioned tables.

and instead of:

result = bms_add_range(NULL, 0, partdesc->nparts - 1);
result = bms_del_members(result, excluded_parts);
bms_free(excluded_parts);

return result;

Just do:

return excluded_parts;

and in get_partitions_from_clauses_recurse(), do bms_del_members
instead of bms_int_members.

there's less bit shuffling and it seems cleaner. Perhaps the function
name would need to be changed if we're inverting the meaning too.

(I've attached a patch which makes this change along with an idea in #8 below)

7. The following comment claims the function sets *datum, but there's
no param by that name:

/*
 * partkey_datum_from_expr
 * Extract constant value from expr and set *datum to that value
 */
static bool
partkey_datum_from_expr(PartitionKey key, int partkeyidx,
Expr *expr, Datum *value)

8. The code in get_partitions_from_ne_clauses() does perform quite a
few nested loops. I think a more simple way to would be to track the
offsets you've seen in a Bitmapset. This would save you having to
check for duplicates, as an offset can only contain a single datum.
You'd just need to build a couple of arrays after that, one to sum up
the offsets found per partition, and one for the total datums allowed
in the partition. If the numbers match then you can remove the
partition.

I've written this and attached it to this email. It saves about 50
lines of code and should perform much better for complex cases, for
example, a large NOT IN list. This also implements #6.

9. "the same" -> "it"

/*
* In case of NOT IN (..), we get a '<>', which while not
* listed as part of any operator family, we are able to
* handle the same if its negator is indeed a part of the
* partitioning operator family.
*/

10. in classify_partition_bounding_keys: "0" -> "false"

/* Return if no work to do below. */
if (!will_compute_keys || *constfalse)
return 0;

Likewise for:

if (*constfalse)
return 0;

11. I don't see partition_bound_bsearch used anywhere below the
following comment:

* Generate bounding tuple(s).
*
* We look up partitions in the partition bound descriptor using, say,
* partition_bound_bsearch(), which expects a Datum (or Datums if multi-
* column key).  So, extract the same out of the constant argument of
* each clause.

I also don't know what the comment is trying to say.

12.

 * operator and sets *incl if equality is implied

should be:

 * operator and set *incl to true if the operator's strategy is inclusive.

13. What does "the same" mean in:

* and add this one directly to the result.  Caller would
* arbitrarily choose one of the many and perform
* partition-pruning with the same.  It's possible that mutual

I think you quite often use "the same" to mean "it". Can you change that?

14. Not sure what parameter you're talking about here.

 * Evaluate 'leftarg op rightarg' and set *result to its value.
 *
 * leftarg and rightarg referred to above actually refer to the constant
 * operand (Datum) of the clause contained in the parameters leftarg and
 * rightarg below, respectively.  And op refers to the operator of the
 * clause contained in the parameter op below.

15. "the latter" is normally used when you're referring to the last
thing in a list which was just mentioned. In this case, leftarg_const
and rightarg_const is the list, so "the latter" should mean
rightarg_const, but I think you mean to compare them using the
operator.

* If the leftarg_const and rightarg_const are both of the type expected
* by op's operator, then compare them using the latter.

16. There are a few things to improve with the following comment:

/*
* Hash partitioning stores partition keys containing nulls in regular
* partitions.  That is, the code that determines the hash partition for
* a given row admits nulls in the partition key when computing the key's
* hash.  So, here we treat any IS NULL clauses on partition key columns as
* equality keys, along with any other non-null values coming from equality
* operator clauses.
*/

"admits" is not the correct word here, and "hash" should be "correct",
but there are more mistakes, so might be easier just to rewrite to:

/*
* Since tuples with NULL values in the partition key columns are
stored in regular partitions,
* we'll treat any IS NULL clauses here as regular equality clauses.
/*

17. The following example will cause get_partitions_for_keys_hash to misbehave:

create table hashp (a int, b int) partition by hash (a, b);
create table hashp1 partition of hashp for values with (modulus 4, remainder 0);
create table hashp2 partition of hashp for values with (modulus 4, remainder 1);
create table hashp3 partition of hashp for values with (modulus 4, remainder 3);
create table hashp4 partition of hashp for values with (modulus 4, remainder 2);
explain select * from hashp where a = 1 and a is null;

The following code assumes that you'll never get a NULL test for a key
that has an equality test, and ends up trying to prune partitions
thinking we got compatible clauses for both partition keys.

  memset(keyisnull, false, sizeof(keyisnull));
  for (i = 0; i < partkey->partnatts; i++)
  {
    if (bms_is_member(i, keys->keyisnull))
    {
      keys->n_eqkeys++;
      keyisnull[i] = true;
    }
  }

  /*
   * Can only do pruning if we know all the keys and they're all equality
   * keys including the nulls that we just counted above.
   */
  if (keys->n_eqkeys == partkey->partnatts)

The above code will need to be made smarter. It'll likely crash if you
change "b" to a pass-by-ref type.

18. The following code:

int other_idx = -1;

/*
* Only a designated partition accepts nulls, which if there
* exists one, return the same.
*/
if (partition_bound_accepts_nulls(boundinfo) ||
partition_bound_has_default(boundinfo))
other_idx = partition_bound_accepts_nulls(boundinfo)
? boundinfo->null_index
: boundinfo->default_index;
if (other_idx >= 0)
return bms_make_singleton(other_idx);
else
return NULL;

 should be simplified to:

 /*
  * NULLs may only exist in the NULL partition, or in the
  * default, if there's no NULL partition.
  */
 if (partition_bound_accepts_nulls(boundinfo))
return bms_make_singleton(boundinfo->null_index);
 else if (partition_bound_has_default(boundinfo))
return bms_make_singleton(boundinfo->default_index);
 return NULL;

19. "exists" -> "are"

* If there are no datums to compare keys with, but there exist
* partitions, it must be the default partition.

also, instead of writing "it must be the default partition." it should
be better to say "just return the default partition."

20. I don't think the return NULL should ever hit, is it worth putting
a comment to say /* shouldn't happen */

if (boundinfo->ndatums == 0)
{
if (partition_bound_has_default(boundinfo))
return bms_make_singleton(boundinfo->default_index);
else
return NULL;
}

21. Can the following comment does not explain the situation well:

/*
* boundinfo->ndatums - 1 is the last valid list partition datums
* index.
*/

There's really no possible non-default partition for this case, so
perhaps we should just return the default, if one exists. We do go on
to check the n_maxkeys needlessly for this case. At the very least the
comment should be changed to:

 /*
  * minkeys values are greater than any non-default partition.
  * We'll check that for case below.
  */

but I think it's worth just doing the default partition check there
and returning it, or NULL. It should help reduce confusion.


Can you also perform a self-review of the patch? Some of the things
I'm picking up are leftovers from a previous version of the patch. We
might never get through this review if you keep leaving those around!

I won't continue reviewing again until next week, so don't rush.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
David,

On 2018/01/12 12:30, David Rowley wrote:
> Can you also perform a self-review of the patch? Some of the things
> I'm picking up are leftovers from a previous version of the patch. We
> might never get through this review if you keep leaving those around!

Sorry, I will look more closely before posting the next version.  I guess
I may have rushed a bit too much when posting the v18/v19 patches, partly
because it's been 3 weeks since v17 and I felt I needed to catch up
quickly given the activity on the run-time pruning thread which depends on
the patches here.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 10 January 2018 at 17:18, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Basically, the changes to add_paths_to_append_rel() are causing
> duplication in partition_rels.
>
> A test case is:
>
> create table part (a int, b int) partition by list(a);
> create table part1 partition of part for values in(1) partition by list (b);
> create table part2 partition of part1 for values in(1);
>
> select * from part;
>
> partition_rels ends up with 3 items in the list, but there's only 2
> partitions here. The reason for this is that, since planning here is
> recursively calling add_paths_to_append_rel, the list for part ends up
> with itself and part1 in it, then since part1's list already contains
> itself, per set_append_rel_size's "rel->live_partitioned_rels =
> list_make1_int(rti);", then part1 ends up in the list twice.
>
> It would be nicer if you could use a RelIds for this, but you'd also
> need some way to store the target partition relation since
> nodeModifyTable.c does:
>
> /* The root table RT index is at the head of the partitioned_rels list */
> if (node->partitioned_rels)
> {
>     Index root_rti;
>     Oid root_oid;
>
>     root_rti = linitial_int(node->partitioned_rels);
>     root_oid = getrelid(root_rti, estate->es_range_table);
>     rel = heap_open(root_oid, NoLock); /* locked by InitPlan */
> }
>
> You could also fix it by instead of doing:
>
> /*
> * Accumulate the live partitioned children of this child, if it's
> * itself partitioned rel.
> */
> if (childrel->part_scheme)
>     partitioned_rels = list_concat(partitioned_rels,
>        childrel->live_partitioned_rels);
>
> do something along the lines of:
>
> if (childrel->part_scheme)
> {
>     ListCell *lc;
>     ListCell *start = lnext(list_head(childrel->live_partitioned_rels));
>
>     for_each_cell(lc, start)
>        partitioned_rels = lappend_int(partitioned_rels,
>        lfirst_int(lc));
> }
>
> Although it seems pretty fragile. It would probably be better to find
> a nicer way of handling all this.

Hi Amit,

I also noticed earlier that this is still broken in v19.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
Amit Langote
Date:
Hi.

On 2018/01/12 18:09, David Rowley wrote:
> On 10 January 2018 at 17:18, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> Basically, the changes to add_paths_to_append_rel() are causing
>> duplication in partition_rels.

[ ... ]

> I also noticed earlier that this is still broken in v19.

I cannot see the duplication here (with v19 + some local changes per your
latest review, although I had fixed the issue in v18).

create table part (a int, b int) partition by list(a);
create table part1 partition of part for values in (1) partition by list (b);
create table part2 partition of part1 for values in (1);

select * from part;

For the above query, I set a breakpoint all the way in ExecInitAppend() to
see what partitioned_rels list it ends up with and I see no duplication:

   :partitioned_rels (i 1 3)

where 1 and 3 are RT indexes of part and part1, respectively.

With v17, you'd be able to see the duplication:

   :partitioned_rels (i 1 3 3)

Let me confirm again if you were complaining exactly of this duplication?
That the RT index of part1 appears twice due to the bug I claim I fixed i
v18?  Or something else?

Thanks,
Amit



Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning

From
David Rowley
Date:
On 12 January 2018 at 22:51, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/01/12 18:09, David Rowley wrote:
>> On 10 January 2018 at 17:18, David Rowley <david.rowley@2ndquadrant.com> wrote:
>>> Basically, the changes to add_paths_to_append_rel() are causing
>>> duplication in partition_rels.
>
>> I also noticed earlier that this is still broken in v19.
>
> I cannot see the duplication here (with v19 + some local changes per your
> latest review, although I had fixed the issue in v18).

I may have made a mistake there. The code I expected to change didn't.
I meant to test the case again, but I got distracted just before I did
and came back a while later and forgot that I hadn't tested.

If you've tested my case and it works, then please don't look any
further. I will look when v20 is ready.

Sorry for the false alarm... I must've been trying to do too many
things at once :-(

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On Thu, Jan 11, 2018 at 10:30 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Instead, can you make it:
>
> if (keys->n_eqkeys > 0 || keys->n_minkeys > 0 ||
>     keys->n_maxkeys > 0 || n_keynullness > 0)
>     return true;
>
> return false;

Or better yet:

return (keys->n_eqkeys > 0 || keys->n_minkeys > 0 ||
    keys->n_maxkeys > 0 || n_keynullness > 0);

It's not really necessary to write if (some condition is true) return
true; return false when you can just write return (boolean-valued
condition).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Hi David.

Thanks for the review.

On 2018/01/12 12:30, David Rowley wrote:
> I've got a few more things for you. I'm only partway through another
> pass, but it makes sense to post what I have now if you're working on
> a new version.
> 
> 1. partitioing -> partitioning
> 
>  * Strategy of a partition clause operator per the partitioing operator class

Fixed.

> 2. get_partitions_from_clauses() modifies partclauses without
> mentioning it in the header. I think you need to either:
> 
> a) warn about this in the header comment; or
> b) do a list_copy() before list_concat()
> c) do list_truncate back to the original length after you're done with the list.

Went with (b).

> 3. get_partitions_from_clauses_recurse(), with:
> 
>   result = bms_add_range(result, 0, partdesc->nparts - 1);
> 
> You could change that to bms_add_range(NULL, ...) and ditch the
> assignment of result to NULL at the start of the function.

Done.

> 4. classify_partition_bounding_keys() now returns bool, but the return
> statement is still:
> 
> return keys->n_eqkeys + keys->n_minkeys + keys->n_maxkeys + n_keynullness;
> 
> my compiler didn't warn about that, but I'd imagine some might.

Oops, my bad.

> 
> Instead, can you make it:
> 
> if (keys->n_eqkeys > 0 || keys->n_minkeys > 0 ||
>     keys->n_maxkeys > 0 || n_keynullness > 0)
>     return true;
> 
> return false;
>
> probably equal keys are the most likely case, so it'll be good to
> short circuit instead of performing addition on a bunch of stuff we
> don't care about anymore.

Changed it to what Robert suggested downthread.

> 5. In classify_partition_bounding_keys, why do we "continue" here?
> 
> clause = rinfo->clause;
> if (rinfo->pseudoconstant &&
> !DatumGetBool(((Const *) clause)->constvalue))
> {
> *constfalse = true;
> continue;
> }
> 
> Is there any point in searching further?
> 
> Also, if you were consistent with the return value for
> classify_partition_bounding_keys when you've set *constfalse = true;
> you wouldn't need to handle the case twice like you are in
> get_partitions_from_clauses_recurse().

OK, I made classify_partition_bounding_keys() return true whenever set
*constfalse to true.

> 6. I think it would be nicer if get_partitions_from_ne_clauses returns
> a set of partitions that could be excluded.
> 
> So instead of:
> 
>  * get_partitions_from_ne_clauses
>  *
>  * Return partitions of relation that satisfy all <> operator clauses in
>  * ne_clauses.  Only ever called if relation is a list partitioned table.
> 
> Have:
> 
>  * get_partitions_from_ne_clauses
>  *
>  * Returns a Bitmapset of partitions that can be safely excluded due to
>  * not-equal clauses existing for all possible partition values. It is only
>  * valid to call this for LIST partitioned tables.
> 
> and instead of:
> 
> result = bms_add_range(NULL, 0, partdesc->nparts - 1);
> result = bms_del_members(result, excluded_parts);
> bms_free(excluded_parts);
> 
> return result;
> 
> Just do:
> 
> return excluded_parts;
> 
> and in get_partitions_from_clauses_recurse(), do bms_del_members
> instead of bms_int_members.
> 
> there's less bit shuffling and it seems cleaner. Perhaps the function
> name would need to be changed if we're inverting the meaning too.
> 
> (I've attached a patch which makes this change along with an idea in #8 below)

Thanks for the suggestions... (comment continues below)

> 7. The following comment claims the function sets *datum, but there's
> no param by that name:
> 
> /*
>  * partkey_datum_from_expr
>  * Extract constant value from expr and set *datum to that value
>  */
> static bool
> partkey_datum_from_expr(PartitionKey key, int partkeyidx,
> Expr *expr, Datum *value)

Fixed.

> 8. The code in get_partitions_from_ne_clauses() does perform quite a
> few nested loops. I think a more simple way to would be to track the
> offsets you've seen in a Bitmapset. This would save you having to
> check for duplicates, as an offset can only contain a single datum.
> You'd just need to build a couple of arrays after that, one to sum up
> the offsets found per partition, and one for the total datums allowed
> in the partition. If the numbers match then you can remove the
> partition.
> 
> I've written this and attached it to this email. It saves about 50
> lines of code and should perform much better for complex cases, for
> example, a large NOT IN list. This also implements #6.

I liked your patch, so incorporated it, except, I feel slightly
uncomfortable about the new name that you chose for the function because
it sounds a bit generic.  I mean the function solves a very specific
problem and have very strict requirements for calling it.  It's not like
we could pass it just any partitioned relation and/or just any set of
clauses.  It has to be a list-partitioned table and the list of clauses
must contain only the clauses containing compatible <> operators.  Checks
for those requirements are carried out in yet another place, that is,
classify_partition_bounding_keys().

Perhaps we can live with that though, because it's not a publicly
available function, but someone might get confused in the future.

> 9. "the same" -> "it"
> 
> /*
> * In case of NOT IN (..), we get a '<>', which while not
> * listed as part of any operator family, we are able to
> * handle the same if its negator is indeed a part of the
> * partitioning operator family.
> */

Done.

> 10. in classify_partition_bounding_keys: "0" -> "false"
> 
> /* Return if no work to do below. */
> if (!will_compute_keys || *constfalse)
> return 0;
> 
> Likewise for:
> 
> if (*constfalse)
> return 0;

Have fixed these per an earlier comment in this email.

> 11. I don't see partition_bound_bsearch used anywhere below the
> following comment:
> 
> * Generate bounding tuple(s).
> *
> * We look up partitions in the partition bound descriptor using, say,
> * partition_bound_bsearch(), which expects a Datum (or Datums if multi-
> * column key).  So, extract the same out of the constant argument of
> * each clause.
> 
> I also don't know what the comment is trying to say.

The comment is no longer very intelligible to me too.  I just wanted to
say here that, *elsewhere*, we will use a function like
partition_bound_bsearch() to look up partitions from the clauses we
matched against individual partition key columns.  That function expects
the lookup key to be in a Datum array form, not a list-of-clauses form.
So, we must construct the lookup key(s) by extracting constant values out
the clauses.

I tried to rewrite it that way.  Hope that's a bit clearer.

> 12.
> 
>  * operator and sets *incl if equality is implied
> 
> should be:
> 
>  * operator and set *incl to true if the operator's strategy is inclusive.

Done that way.

> 13. What does "the same" mean in:
> 
> * and add this one directly to the result.  Caller would
> * arbitrarily choose one of the many and perform
> * partition-pruning with the same.  It's possible that mutual

It means "the one that caller would arbitrarily choose of the many that
this function will return to it".  Anyway, I changed "the same" to "it".

> I think you quite often use "the same" to mean "it". Can you change that?

I guess that's just one of my many odd habits when writing English, all of
which I'm trying to get rid of, but apparently with limited success.  Must
try harder. :)

> 14. Not sure what parameter you're talking about here.
> 
>  * Evaluate 'leftarg op rightarg' and set *result to its value.
>  *
>  * leftarg and rightarg referred to above actually refer to the constant
>  * operand (Datum) of the clause contained in the parameters leftarg and
>  * rightarg below, respectively.  And op refers to the operator of the
>  * clause contained in the parameter op below.

I rewrote the above comment block as:

 * Try to compare the constant arguments of 'leftarg' and 'rightarg', in that
 * order, using the operator of 'op' and set *result to the result of this
 * comparison.

Is that any better?

> 15. "the latter" is normally used when you're referring to the last
> thing in a list which was just mentioned. In this case, leftarg_const
> and rightarg_const is the list, so "the latter" should mean
> rightarg_const, but I think you mean to compare them using the
> operator.
> 
> * If the leftarg_const and rightarg_const are both of the type expected
> * by op's operator, then compare them using the latter.

Rewrote it as:

     * We can compare leftarg_const and rightarg_const using op's operator
     * only if both are of the type expected by it.

> 16. There are a few things to improve with the following comment:
> 
> /*
> * Hash partitioning stores partition keys containing nulls in regular
> * partitions.  That is, the code that determines the hash partition for
> * a given row admits nulls in the partition key when computing the key's
> * hash.  So, here we treat any IS NULL clauses on partition key columns as
> * equality keys, along with any other non-null values coming from equality
> * operator clauses.
> */
> 
> "admits" is not the correct word here, and "hash" should be "correct",
> but there are more mistakes, so might be easier just to rewrite to:
> 
> /*
> * Since tuples with NULL values in the partition key columns are
> stored in regular partitions,
> * we'll treat any IS NULL clauses here as regular equality clauses.
> /*

Agreed that your version is better, so went with it.

> 17. The following example will cause get_partitions_for_keys_hash to misbehave:
> 
> create table hashp (a int, b int) partition by hash (a, b);
> create table hashp1 partition of hashp for values with (modulus 4, remainder 0);
> create table hashp2 partition of hashp for values with (modulus 4, remainder 1);
> create table hashp3 partition of hashp for values with (modulus 4, remainder 3);
> create table hashp4 partition of hashp for values with (modulus 4, remainder 2);
> explain select * from hashp where a = 1 and a is null;
> 
> The following code assumes that you'll never get a NULL test for a key
> that has an equality test, and ends up trying to prune partitions
> thinking we got compatible clauses for both partition keys.

Yeah, I think this example helps spot a problem.  I thought we'd never get
to get_partitions_for_keys_hash() for the above query, because we
should've been able to prove much earlier that the particular clause
combination should be always false (a cannot be both non-null 1 and null).
 Now, because the planner itself doesn't substitute a constant-false for
that, I taught classify_partition_bounding_keys() to do so.  It would now
return constfalse=true if it turns out that clauses in the input list lead
to contradictory nullness condition for a given partition column.

>   memset(keyisnull, false, sizeof(keyisnull));
>   for (i = 0; i < partkey->partnatts; i++)
>   {
>     if (bms_is_member(i, keys->keyisnull))
>     {
>       keys->n_eqkeys++;
>       keyisnull[i] = true;
>     }
>   }
> 
>   /*
>    * Can only do pruning if we know all the keys and they're all equality
>    * keys including the nulls that we just counted above.
>    */
>   if (keys->n_eqkeys == partkey->partnatts)
> 
> The above code will need to be made smarter. It'll likely crash if you
> change "b" to a pass-by-ref type.

Hmm, not sure why.  It seems to work:

create table hp (a int, b text) partition by hash (a, b);
create table hp1 partition of hp for values with (modulus 4, remainder 0);
create table hp2 partition of hp for values with (modulus 4, remainder 1);
create table hp3 partition of hp for values with (modulus 4, remainder 3);
create table hp4 partition of hp for values with (modulus 4, remainder 2);

insert into hp values (1, 'xxx');
INSERT 0 1

select tableoid::regclass, * from hp;
 tableoid | a |  b
----------+---+-----
 hp1   | 1 | xxx
(1 row)

insert into hp (a) values (1);
INSERT 0 1

insert into hp (b) values ('xxx');
INSERT 0 1

select tableoid::regclass, * from hp where a is null;
 tableoid | a |  b
----------+---+-----
 hp2   |   | xxx
(1 row)

select tableoid::regclass, * from hp where b is null;
 tableoid | a | b
----------+---+---
 hp1   | 1 |
(1 row)

select tableoid::regclass, * from hp where a = 1 and b is null;
 tableoid | a | b
----------+---+---
 hp1   | 1 |
(1 row)

select tableoid::regclass, * from hp where a is null and b = 'xxx';
 tableoid | a |  b
----------+---+-----
 hp2   |   | xxx
(1 row)

> 18. The following code:
> 
> int other_idx = -1;
> 
> /*
> * Only a designated partition accepts nulls, which if there
> * exists one, return the same.
> */
> if (partition_bound_accepts_nulls(boundinfo) ||
> partition_bound_has_default(boundinfo))
> other_idx = partition_bound_accepts_nulls(boundinfo)
> ? boundinfo->null_index
> : boundinfo->default_index;
> if (other_idx >= 0)
> return bms_make_singleton(other_idx);
> else
> return NULL;
> 
>  should be simplified to:
> 
>  /*
>   * NULLs may only exist in the NULL partition, or in the
>   * default, if there's no NULL partition.
>   */
>  if (partition_bound_accepts_nulls(boundinfo))
> return bms_make_singleton(boundinfo->null_index);
>  else if (partition_bound_has_default(boundinfo))
> return bms_make_singleton(boundinfo->default_index);
>  return NULL;

Agreed, done that way.

> 19. "exists" -> "are"
> 
> * If there are no datums to compare keys with, but there exist
> * partitions, it must be the default partition.
> 
> also, instead of writing "it must be the default partition." it should
> be better to say "just return the default partition."

OK, done.

> 20. I don't think the return NULL should ever hit, is it worth putting
> a comment to say /* shouldn't happen */
> 
> if (boundinfo->ndatums == 0)
> {
> if (partition_bound_has_default(boundinfo))
> return bms_make_singleton(boundinfo->default_index);
> else
> return NULL;
> }

I added a /* shouldn't happen */ comment next to return NULL.

> 21. Can the following comment does not explain the situation well:
> 
> /*
> * boundinfo->ndatums - 1 is the last valid list partition datums
> * index.
> */
> 
> There's really no possible non-default partition for this case, so
> perhaps we should just return the default, if one exists. We do go on
> to check the n_maxkeys needlessly for this case. At the very least the
> comment should be changed to:
> 
>  /*
>   * minkeys values are greater than any non-default partition.
>   * We'll check that for case below.
>   */
> 
> but I think it's worth just doing the default partition check there
> and returning it, or NULL. It should help reduce confusion.

Yep, done.


Attached v20.  Thanks again.

Regards,
Amit

Attachment
On 16 January 2018 at 21:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/01/12 12:30, David Rowley wrote:
>> 8. The code in get_partitions_from_ne_clauses() does perform quite a
>> few nested loops. I think a more simple way to would be to track the
>> offsets you've seen in a Bitmapset. This would save you having to
>> check for duplicates, as an offset can only contain a single datum.
>> You'd just need to build a couple of arrays after that, one to sum up
>> the offsets found per partition, and one for the total datums allowed
>> in the partition. If the numbers match then you can remove the
>> partition.
>>
>> I've written this and attached it to this email. It saves about 50
>> lines of code and should perform much better for complex cases, for
>> example, a large NOT IN list. This also implements #6.
>
> I liked your patch, so incorporated it, except, I feel slightly
> uncomfortable about the new name that you chose for the function because
> it sounds a bit generic.

You're right. I only renamed it because I inverted the meaning of the
function in the patch. It no longer did
"get_partitions_from_ne_clauses", it did the opposite and give the
partitions which can't match. Please feel free to think of a new
better name. Is "get_partitions_excluded_by_ne_clauses" too long?

>> I think you quite often use "the same" to mean "it". Can you change that?
>
> I guess that's just one of my many odd habits when writing English, all of
> which I'm trying to get rid of, but apparently with limited success.  Must
> try harder. :)

Oops, on re-reading that it sounded as though I was asking you to
change some habit, but I just meant the comments. I understand there
will be places that use English where that's normal. It's just I don't
recall seeing that in PostgreSQL code before. American English is
pretty much the standard for the project, despite that not always
being strictly applied (e.g we have a command called ANALYSE which is
an alias for ANALYZE). I always try and do my best to spell words in
American English (which is not where I'm from), which for me stretches
about as far as putting 'z' in the place of some of my 's'es.

> I rewrote the above comment block as:
>
>  * Try to compare the constant arguments of 'leftarg' and 'rightarg', in that
>  * order, using the operator of 'op' and set *result to the result of this
>  * comparison.
>
> Is that any better?

Sounds good.

>
>> 15. "the latter" is normally used when you're referring to the last
>> thing in a list which was just mentioned. In this case, leftarg_const
>> and rightarg_const is the list, so "the latter" should mean
>> rightarg_const, but I think you mean to compare them using the
>> operator.
>>
>> * If the leftarg_const and rightarg_const are both of the type expected
>> * by op's operator, then compare them using the latter.
>
> Rewrote it as:
>
>      * We can compare leftarg_const and rightarg_const using op's operator
>      * only if both are of the type expected by it.

I'd probably write "expected type." instead of "type expected by it."

>> 17. The following example will cause get_partitions_for_keys_hash to misbehave:
>>
>> create table hashp (a int, b int) partition by hash (a, b);
>> create table hashp1 partition of hashp for values with (modulus 4, remainder 0);
>> create table hashp2 partition of hashp for values with (modulus 4, remainder 1);
>> create table hashp3 partition of hashp for values with (modulus 4, remainder 3);
>> create table hashp4 partition of hashp for values with (modulus 4, remainder 2);
>> explain select * from hashp where a = 1 and a is null;
>>
>> The following code assumes that you'll never get a NULL test for a key
>> that has an equality test, and ends up trying to prune partitions
>> thinking we got compatible clauses for both partition keys.
>
> Yeah, I think this example helps spot a problem.  I thought we'd never get
> to get_partitions_for_keys_hash() for the above query, because we
> should've been able to prove much earlier that the particular clause
> combination should be always false (a cannot be both non-null 1 and null).
>  Now, because the planner itself doesn't substitute a constant-false for
> that, I taught classify_partition_bounding_keys() to do so.  It would now
> return constfalse=true if it turns out that clauses in the input list lead
> to contradictory nullness condition for a given partition column.
>
>>   memset(keyisnull, false, sizeof(keyisnull));
>>   for (i = 0; i < partkey->partnatts; i++)
>>   {
>>     if (bms_is_member(i, keys->keyisnull))
>>     {
>>       keys->n_eqkeys++;
>>       keyisnull[i] = true;
>>     }
>>   }
>>
>>   /*
>>    * Can only do pruning if we know all the keys and they're all equality
>>    * keys including the nulls that we just counted above.
>>    */
>>   if (keys->n_eqkeys == partkey->partnatts)
>>
>> The above code will need to be made smarter. It'll likely crash if you
>> change "b" to a pass-by-ref type.
>
> Hmm, not sure why.  It seems to work:

Yeah, works now because you've added new code to test for
contradictions in the quals, e.g a = 1 and a is null is now rejected
as constfalse.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On 16 January 2018 at 21:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached v20.  Thanks again.

Thanks for working on v20. I've had a look over part of it and I've
written down the following:

1. The following comment is not correct

/*
* Equality look up key.  Values in the following array appear in no
* particular order (unlike minkeys and maxkeys below which must appear in
* the same order as the partition key columns).

These must be in partition key order, just like the others.

This part is not true either:

* the same order as the partition key columns).  n_eqkeys must be equal to
* the number of partition keys to be valid (except in the case of hash
* partitioning where that's not required).  When set, minkeys and maxkeys
* are ignored.

range2 is pruned just fine from the following:

create table rangep (a int, b int) partition by range (a,b);
create table rangep1 partition of rangep for values from (1,10) to (1,20);
create table rangep2 partition of rangep for values from (2,10) to (2,20);

explain select * from rangep where a = 1;
                          QUERY PLAN
---------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=11 width=8)
   ->  Seq Scan on rangep1  (cost=0.00..38.25 rows=11 width=8)
         Filter: (a = 1)
(3 rows)

2. You've added a list_copy() to get_partitions_from_clauses so as not
to modify the input list, but this function calls
get_partitions_from_clauses_recurse which calls
classify_partition_bounding_keys() which modifes that list. Would it
not just be better to make a list copy in
get_partitions_from_clauses() without any conditions?

If we get new users of that function, e.g Run-time pruning, then they
might be surprised to see new items magically added to their input
list without mention of that behaviour in the function comment.

3. The following case causes an Assert failure:

drop table listp;
CREATE TABLE listp (a int, b int) partition by list (a);
create table listp1 partition of listp for values in (1);
create table listp2 partition of listp for values in (2);

prepare q1 (int) as select * from listp where a = 1 and a in($1,10);

explain execute q1 (1);
explain execute q1 (1);
explain execute q1 (1);
explain execute q1 (1);
explain execute q1 (1);
explain execute q1 (1); -- <--- Assert failure!

In match_clauses_to_partkey you always add the ScalarArrayOpExpr to
the result regardless if it is a complete set of Consts, however, the
code in classify_partition_bounding_keys() that deals with
ScalarArrayOpExpr in can't handle non-consts

/* OK to add to the result. */
result = lappend(result, clause);
if (IsA(eval_const_expressions(root, rightop), Const))
    *contains_const = true;
else
    *contains_const = false;

*contains_consts is reset to true again by the a = 1 qual, so
get_partitions_from_clauses() gets called from
get_append_rel_partitions. Later classify_partition_bounding_keys()
when processing the ScalarArrayOpExpr, the following code assumes the
array exprs are all Consts:

foreach(lc1, elem_exprs)
{
    Const  *rightop = castNode(Const, lfirst(lc1));


Setting *contains_const = false; in match_clauses_to_partkey() is not
correct either. If I understand the intent here correctly, you want
this to be set to true if the clause list contains quals with any
consts that are useful for partition pruning during planning. If
that's the case then you should set it to true if you find a suitable
clause, otherwise leave it set to false as you set it to at the start
of the function. What you have now will have varying results based on
the order of the clauses in the list, which is certainly not correct.

4. The following code can be rearranged to not pull_varnos if there's
no commutator op.

constexpr = leftop;
constrelids = pull_varnos((Node *) leftop);
expr_op = get_commutator(expr_op);

/*
* If no commutator exists, cannot flip the qual's args,
* so give up.
*/
if (!OidIsValid(expr_op))
continue;

5. Header comment for match_clauses_to_partkey() says only clauses in
the pattern of "partkey op const" and "const op partkey" are handled.
NULL tests are also mentioned but nothing is mentioned about
ScalarArrayOpExpr. It might be better to be less verbose about what
the function handles, but if you're listing what is handled then you
should not make false claims.

 * For an individual clause to match with a partition key column, the clause
 * must be an operator clause of the form (partkey op const) or (const op
 * partkey); the latter only if a suitable commutator exists.  Furthermore,

6. Which brings me to; why do we need match_clauses_to_partkey at all?
classify_partition_bounding_keys seems to do all the work
match_clauses_to_partkey does, plus more. Item #3 above is caused by
an inconsistency between these functions. What benefit does
match_clauses_to_partkey give? I might understand if you were creating
list of clauses matching each partition key, but you're just dumping
everything in one big list which causes
classify_partition_bounding_keys() to have to match each clause to a
partition key again, and classify_partition_bounding_keys is even
coded to ignore clauses that don't' match any key, so it makes me
wonder what is match_clauses_to_partkey actually for?

I'm going to stop reviewing there as if you remove
match_clauses_to_partkey is going to cause churn that'll need to be
reviewed again.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Hi David.

On Wed, Jan 17, 2018 at 12:32 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 16 January 2018 at 21:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/01/12 12:30, David Rowley wrote:
>>> 8. The code in get_partitions_from_ne_clauses() does perform quite a
>>> few nested loops. I think a more simple way to would be to track the
>>> offsets you've seen in a Bitmapset. This would save you having to
>>> check for duplicates, as an offset can only contain a single datum.
>>> You'd just need to build a couple of arrays after that, one to sum up
>>> the offsets found per partition, and one for the total datums allowed
>>> in the partition. If the numbers match then you can remove the
>>> partition.
>>>
>>> I've written this and attached it to this email. It saves about 50
>>> lines of code and should perform much better for complex cases, for
>>> example, a large NOT IN list. This also implements #6.
>>
>> I liked your patch, so incorporated it, except, I feel slightly
>> uncomfortable about the new name that you chose for the function because
>> it sounds a bit generic.
>
> You're right. I only renamed it because I inverted the meaning of the
> function in the patch. It no longer did
> "get_partitions_from_ne_clauses", it did the opposite and give the
> partitions which can't match. Please feel free to think of a new
> better name. Is "get_partitions_excluded_by_ne_clauses" too long?
>
>>> I think you quite often use "the same" to mean "it". Can you change that?
>>
>> I guess that's just one of my many odd habits when writing English, all of
>> which I'm trying to get rid of, but apparently with limited success.  Must
>> try harder. :)
>
> Oops, on re-reading that it sounded as though I was asking you to
> change some habit, but I just meant the comments. I understand there
> will be places that use English where that's normal. It's just I don't
> recall seeing that in PostgreSQL code before.

No worries, I too slightly misread what you'd said.

When I double checked, I too couldn't find "the same" used the way as
I did in the patch.  So I actually ended up finding and replacing more
"the same"s with "it" than you had pointed out in your review in the
latest v20 patch.

> American English is
> pretty much the standard for the project, despite that not always
> being strictly applied (e.g we have a command called ANALYSE which is
> an alias for ANALYZE). I always try and do my best to spell words in
> American English (which is not where I'm from), which for me stretches
> about as far as putting 'z' in the place of some of my 's'es.

I see.

>>> 15. "the latter" is normally used when you're referring to the last
>>> thing in a list which was just mentioned. In this case, leftarg_const
>>> and rightarg_const is the list, so "the latter" should mean
>>> rightarg_const, but I think you mean to compare them using the
>>> operator.
>>>
>>> * If the leftarg_const and rightarg_const are both of the type expected
>>> * by op's operator, then compare them using the latter.
>>
>> Rewrote it as:
>>
>>      * We can compare leftarg_const and rightarg_const using op's operator
>>      * only if both are of the type expected by it.
>
> I'd probably write "expected type." instead of "type expected by it."

OK, will do.

>>> 17. The following example will cause get_partitions_for_keys_hash to misbehave:
>>>
>>> create table hashp (a int, b int) partition by hash (a, b);
>>> create table hashp1 partition of hashp for values with (modulus 4, remainder 0);
>>> create table hashp2 partition of hashp for values with (modulus 4, remainder 1);
>>> create table hashp3 partition of hashp for values with (modulus 4, remainder 3);
>>> create table hashp4 partition of hashp for values with (modulus 4, remainder 2);
>>> explain select * from hashp where a = 1 and a is null;

[ ... ]

>>> The above code will need to be made smarter. It'll likely crash if you
>>> change "b" to a pass-by-ref type.
>>
>> Hmm, not sure why.  It seems to work:
>
> Yeah, works now because you've added new code to test for
> contradictions in the quals, e.g a = 1 and a is null is now rejected
> as constfalse.

Oh, I see.  I thought you were talking of it as an independent issue.

Thanks,
Amit


On 17 January 2018 at 17:05, David Rowley <david.rowley@2ndquadrant.com> wrote:
> 6. Which brings me to; why do we need match_clauses_to_partkey at all?
> classify_partition_bounding_keys seems to do all the work
> match_clauses_to_partkey does, plus more. Item #3 above is caused by
> an inconsistency between these functions. What benefit does
> match_clauses_to_partkey give? I might understand if you were creating
> list of clauses matching each partition key, but you're just dumping
> everything in one big list which causes
> classify_partition_bounding_keys() to have to match each clause to a
> partition key again, and classify_partition_bounding_keys is even
> coded to ignore clauses that don't' match any key, so it makes me
> wonder what is match_clauses_to_partkey actually for?

I started to look at this and ended up shuffling the patch around a
bit to completely remove the match_clauses_to_partkey function.

I also cleaned up some of the comments and shuffled some fields around
in some of the structs to shrink them down a bit.

All up, this has saved 268 lines of code in the patch.

src/backend/catalog/partition.c       | 296 ++++++++++++++++-----------
src/backend/optimizer/path/allpaths.c | 368 ++--------------------------------
2 files changed, 198 insertions(+), 466 deletions(-)

It's had very minimal testing. Really I've only tested that the
regression tests pass.

I also fixed up the bad assumption that IN lists will contain Consts
only which hopefully fixes the crash I reported earlier.

I saw you'd added a check to look for contradicting IS NOT NULL
clauses when processing an IS NULL clause, but didn't do anything for
the opposite case. I added code for this so it behaves the same
regardless of the clause order.

Can you look at my changes and see if I've completely broken anything?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
Hi David.

On Wed, Jan 17, 2018 at 6:19 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 17 January 2018 at 17:05, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> 6. Which brings me to; why do we need match_clauses_to_partkey at all?
>> classify_partition_bounding_keys seems to do all the work
>> match_clauses_to_partkey does, plus more. Item #3 above is caused by
>> an inconsistency between these functions. What benefit does
>> match_clauses_to_partkey give? I might understand if you were creating
>> list of clauses matching each partition key, but you're just dumping
>> everything in one big list which causes
>> classify_partition_bounding_keys() to have to match each clause to a
>> partition key again, and classify_partition_bounding_keys is even
>> coded to ignore clauses that don't' match any key, so it makes me
>> wonder what is match_clauses_to_partkey actually for?
>
> I started to look at this and ended up shuffling the patch around a
> bit to completely remove the match_clauses_to_partkey function.
>
> I also cleaned up some of the comments and shuffled some fields around
> in some of the structs to shrink them down a bit.
>
> All up, this has saved 268 lines of code in the patch.
>
> src/backend/catalog/partition.c       | 296 ++++++++++++++++-----------
> src/backend/optimizer/path/allpaths.c | 368 ++--------------------------------
> 2 files changed, 198 insertions(+), 466 deletions(-)
>
> It's had very minimal testing. Really I've only tested that the
> regression tests pass.
>
> I also fixed up the bad assumption that IN lists will contain Consts
> only which hopefully fixes the crash I reported earlier.
>
> I saw you'd added a check to look for contradicting IS NOT NULL
> clauses when processing an IS NULL clause, but didn't do anything for
> the opposite case. I added code for this so it behaves the same
> regardless of the clause order.
>
> Can you look at my changes and see if I've completely broken anything?

Thanks for the patch.  I applied the patch and see that it didn't
break any tests, although haven't closely reviewed the code yet.

I'm concerned that after your patch to remove
match_clauses_to_partkey(), we'd be doing more work than necessary in
some cases.  For example, consider the case of using run-time pruning
for nested loop where the inner relation is a partitioned table.  With
the old approach, get_partitions_from_clauses() would only be handed
the clauses that are known to match the partition keys (which most
likely is fewer than all of the query's clauses), so
get_partitions_from_clauses() doesn't have to do the work of filtering
non-partition clauses every time (that is, for every outer row).
That's why I had decided to keep that part in the planner.

Thanks,
Amit


On 17 January 2018 at 23:48, Amit Langote <amitlangote09@gmail.com> wrote:
> I'm concerned that after your patch to remove
> match_clauses_to_partkey(), we'd be doing more work than necessary in
> some cases.  For example, consider the case of using run-time pruning
> for nested loop where the inner relation is a partitioned table.  With
> the old approach, get_partitions_from_clauses() would only be handed
> the clauses that are known to match the partition keys (which most
> likely is fewer than all of the query's clauses), so
> get_partitions_from_clauses() doesn't have to do the work of filtering
> non-partition clauses every time (that is, for every outer row).
> That's why I had decided to keep that part in the planner.

That might be better served by splitting
classify_partition_bounding_keys() into separate functions, the first
function would be in charge of building keyclauses_all. That way the
remaining work during the executor would never need to match clauses
to a partition key as they'd be in lists dedicated to each key.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


On 18 January 2018 at 00:13, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 17 January 2018 at 23:48, Amit Langote <amitlangote09@gmail.com> wrote:
>> I'm concerned that after your patch to remove
>> match_clauses_to_partkey(), we'd be doing more work than necessary in
>> some cases.  For example, consider the case of using run-time pruning
>> for nested loop where the inner relation is a partitioned table.  With
>> the old approach, get_partitions_from_clauses() would only be handed
>> the clauses that are known to match the partition keys (which most
>> likely is fewer than all of the query's clauses), so
>> get_partitions_from_clauses() doesn't have to do the work of filtering
>> non-partition clauses every time (that is, for every outer row).
>> That's why I had decided to keep that part in the planner.
>
> That might be better served by splitting
> classify_partition_bounding_keys() into separate functions, the first
> function would be in charge of building keyclauses_all. That way the
> remaining work during the executor would never need to match clauses
> to a partition key as they'd be in lists dedicated to each key.

I've attached another delta against your v20 patch which does this.
It's very rough for now and I've only checked that it passes the
regression test so far.

It will need some cleanup work, but I'd be keen to know what you think
of the general idea. I've not fully worked out how run-time pruning
will use this as it'll need another version of
get_partitions_from_clauses but passes in a PartScanClauseInfo
instead, and does not call extract_partition_key_clauses. That area
probably  needs some shuffling around so that does not end up a big
copy and paste of all that new logic.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
Hi David.

On 2018/01/18 12:14, David Rowley wrote:
> On 18 January 2018 at 00:13, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> On 17 January 2018 at 23:48, Amit Langote <amitlangote09@gmail.com> wrote:
>>> I'm concerned that after your patch to remove
>>> match_clauses_to_partkey(), we'd be doing more work than necessary in
>>> some cases.  For example, consider the case of using run-time pruning
>>> for nested loop where the inner relation is a partitioned table.  With
>>> the old approach, get_partitions_from_clauses() would only be handed
>>> the clauses that are known to match the partition keys (which most
>>> likely is fewer than all of the query's clauses), so
>>> get_partitions_from_clauses() doesn't have to do the work of filtering
>>> non-partition clauses every time (that is, for every outer row).
>>> That's why I had decided to keep that part in the planner.
>>
>> That might be better served by splitting
>> classify_partition_bounding_keys() into separate functions, the first
>> function would be in charge of building keyclauses_all. That way the
>> remaining work during the executor would never need to match clauses
>> to a partition key as they'd be in lists dedicated to each key.
> 
> I've attached another delta against your v20 patch which does this.
> It's very rough for now and I've only checked that it passes the
> regression test so far.

Thanks!

> It will need some cleanup work, but I'd be keen to know what you think
> of the general idea.

This one looks in a much better shape.

> I've not fully worked out how run-time pruning
> will use this as it'll need another version of
> get_partitions_from_clauses but passes in a PartScanClauseInfo
> instead, and does not call extract_partition_key_clauses. That area
> probably  needs some shuffling around so that does not end up a big
> copy and paste of all that new logic.
So, I've been assuming that the planner changes in the run-time pruning
patch have to do with selecting clauses (restriction clauses not
containing Consts and/or join clauses) to be passed to the executor by
recording them in the Append node.  Will they be selected by the planner
calling into partition.c?

Meanwhile, here is a revised version (v21) that incorporates your changes.
 I added you as the author in 0002 and 0005 patches.  I guess a v22 will
have to follow very soon...

Thanks,
Amit

Attachment
On 18 January 2018 at 23:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I've not fully worked out how run-time pruning
>> will use this as it'll need another version of
>> get_partitions_from_clauses but passes in a PartScanClauseInfo
>> instead, and does not call extract_partition_key_clauses. That area
>> probably  needs some shuffling around so that does not end up a big
>> copy and paste of all that new logic.
>>
> So, I've been assuming that the planner changes in the run-time pruning
> patch have to do with selecting clauses (restriction clauses not
> containing Consts and/or join clauses) to be passed to the executor by
> recording them in the Append node.  Will they be selected by the planner
> calling into partition.c?

I had thought so. I only have a rough idea in my head and that's that
the PartitionPruneInfo struct that I wrote for the run-time pruning
patch would have the clause List replaced with this new
PartScanClauseInfo struct (which likely means it needs to go into
primnodes.h), this struct would contain all the partition pruning
clauses in a more structured form so that no matching of quals to the
partition key wouldn't be required during execution. The idea is that
we'd just need to call; remove_redundant_clauses,
extract_bounding_datums and get_partitions_for_keys. I think this is
the bare minimum of work that can be done in execution since we can't
remove the redundant clauses until we know the values of the Params.

Likely this means there will need to be 2 functions externally
accessible for this in partition.c. I'm not sure exactly how best to
do this. Maybe it's fine just to have allpaths.c call
extract_partition_key_clauses to generate the PartScanClauseInfo then
call some version of get_partitions_from_clauses which does do the
extract_partition_key_clauses duties. This is made more complex by the
code that handles adding the default partition bound to the quals. I'm
not yet sure where that should live.

I've also been thinking of having some sort of PartitionPruneContext
struct that we can pass around the functions. Runtime pruning had to
add structs which store the Param values to various functions which I
didn't like. It would be good to just add those to the context and
have them passed down without having to bloat the parameters in the
functions. I might try and do that tomorrow too. This should make the
footprint of the runtime pruning patch is a bit smaller.

> Meanwhile, here is a revised version (v21) that incorporates your changes.
>  I added you as the author in 0002 and 0005 patches.  I guess a v22 will
> have to follow very soon...

Thanks for merging that in.

I'll have a try at making this work tomorrow, although I'm not sure
yet how much time I'll have to dedicate to it as I have a few other
things to do too.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Hi Amit,

It seems your mail system continually adds "[Sender Address Forgery]"
prefixes to messages. E.g. this mail now has
Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward
fasterpartition pruning
 
as its subject, whereas the mail you're replying to only had
Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning
two of them.

I think the two previous occurances of this also are from you.

This is somewhat annoying, could you try to figure out a) what the
problem is b) how to prevent the subject being edited like that?

Regards,

Andres


Hello,

At Thu, 18 Jan 2018 11:41:00 -0800, Andres Freund <andres@anarazel.de> wrote in
<20180118194100.dy3kxdtktsbvm4eq@alap3.anarazel.de>
> Hi Amit,
> 
> It seems your mail system continually adds "[Sender Address Forgery]"
> prefixes to messages. E.g. this mail now has
> Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path
towardfaster partition pruning
 
> as its subject, whereas the mail you're replying to only had
> Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning
> two of them.
> 
> I think the two previous occurances of this also are from you.
> 
> This is somewhat annoying, could you try to figure out a) what the
> problem is b) how to prevent the subject being edited like that?

Our mail server is failing to fetch SPF record for David's mails
that received directly (not via -hakders ML) and the server adds
the subject header.  It is failing to fetch SPF record for
2ndquadrant.com. The reason might be that the envelope-from of
his mails is not consistent with his server's IP address.

Anyway, mails via -hackers ML doesn't suffer so, what Amit (and
I) side can do by myself is one of the following.

- Being careful to reply to the mails comming via the ML.
- Remove the added header by hand..


And I'd like to ask David to check out his mail environment so
that SPF record is available for his message.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thank you Horiguchi-san!

On 2018/01/19 12:00, Kyotaro HORIGUCHI wrote:
> At Thu, 18 Jan 2018 11:41:00 -0800, Andres Freund <andres@anarazel.de> wrote:
>> Hi Amit,
>>
>> It seems your mail system continually adds "[Sender Address Forgery]"
>> prefixes to messages. E.g. this mail now has
>> Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path
towardfaster partition pruning
 
>> as its subject, whereas the mail you're replying to only had
>> Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning
>> two of them.
>>
>> I think the two previous occurances of this also are from you.
>>
>> This is somewhat annoying, could you try to figure out a) what the
>> problem is b) how to prevent the subject being edited like that?

Sorry about that.  I had noticed it and had manually edited the subject
line once or twice before, but failed to do it every time.

> Our mail server is failing to fetch SPF record for David's mails
> that received directly (not via -hakders ML) and the server adds
> the subject header.  It is failing to fetch SPF record for
> 2ndquadrant.com. The reason might be that the envelope-from of
> his mails is not consistent with his server's IP address.

I was able to notice that too.  It seems that the Received-SPF: PermError
and X-SPF-Status = fail/warn headers started appearing in the emails only
a few months ago, so it appears our mail server was changed to notice
these discrepancies only recently.  First email in this thread that got
such subject line was the following, which is my reply to David's email:

https://www.postgresql.org/message-id/b8094e71-2c73-ed8e-d8c3-53f232c8c049%40lab.ntt.co.jp

There are emails from David before that one and I couldn't see such
headers in all those emails, so no alterations of the subject line.

> Anyway, mails via -hackers ML doesn't suffer so, what Amit (and
> I) side can do by myself is one of the following.
> 
> - Being careful to reply to the mails comming via the ML.
> - Remove the added header by hand..

Yeah, will make sure to do that.

> And I'd like to ask David to check out his mail environment so
> that SPF record is available for his message.

That'd be nice too.

Thanks,
Amit



On 19 January 2018 at 16:00, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> And I'd like to ask David to check out his mail environment so
> that SPF record is available for his message.

Will investigate

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Hi Amit
,
On 19 January 2018 at 04:03, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 18 January 2018 at 23:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> So, I've been assuming that the planner changes in the run-time pruning
>> patch have to do with selecting clauses (restriction clauses not
>> containing Consts and/or join clauses) to be passed to the executor by
>> recording them in the Append node.  Will they be selected by the planner
>> calling into partition.c?
>
> I had thought so. I only have a rough idea in my head and that's that
> the PartitionPruneInfo struct that I wrote for the run-time pruning
> patch would have the clause List replaced with this new
> PartScanClauseInfo struct (which likely means it needs to go into
> primnodes.h), this struct would contain all the partition pruning
> clauses in a more structured form so that no matching of quals to the
> partition key wouldn't be required during execution. The idea is that
> we'd just need to call; remove_redundant_clauses,
> extract_bounding_datums and get_partitions_for_keys. I think this is
> the bare minimum of work that can be done in execution since we can't
> remove the redundant clauses until we know the values of the Params.
>
> Likely this means there will need to be 2 functions externally
> accessible for this in partition.c. I'm not sure exactly how best to
> do this. Maybe it's fine just to have allpaths.c call
> extract_partition_key_clauses to generate the PartScanClauseInfo then
> call some version of get_partitions_from_clauses which does do the
> extract_partition_key_clauses duties. This is made more complex by the
> code that handles adding the default partition bound to the quals. I'm
> not yet sure where that should live.
>
> I've also been thinking of having some sort of PartitionPruneContext
> struct that we can pass around the functions. Runtime pruning had to
> add structs which store the Param values to various functions which I
> didn't like. It would be good to just add those to the context and
> have them passed down without having to bloat the parameters in the
> functions. I might try and do that tomorrow too. This should make the
> footprint of the runtime pruning patch is a bit smaller.

Attached is what I had in mind about how to do this. Only the planner
will need to call populate_partition_clause_info. The planner and
executor can call get_partitions_from_clauseinfo. I'll just need to
change the run-time prune patch to pass the PartitionClauseInfo into
the executor in the Append node.

I've also added the context struct that I mentioned above. It's
currently not carrying much weight, but the idea is that this context
will be passed around a bit more with the run-time pruning patch and
it will also carry the details about parameter values. I'll need to
modify a few signatures of functions like partkey_datum_from_expr to
pass the context there too. I didn't do that here because currently,
those functions have no use for the context with the fields that they
currently have.

I've also fixed a bug where when you built the commutator OpExpr in
what's now called extract_partition_key_clauses the inputcollid was
not being properly set. The changes I made there go much further than
just that, I've completely removed the OpExpr from the PartClause
struct as only two members were ever used. I thought it was better
just to add those to PartClause instead.

I also did some renaming of variables that always assumed that the
Expr being compared to the partition key was a constant. This is true
now, but with run-time pruning patch, it won't be, so I thought it was
better to do this here rather than having to rename them in the
run-time pruning patch.

One thing I don't yet understand about the patch is the use of
predicate_refuted_by() in get_partitions_from_or_clause_args(). I did
adjust the comment above that code, but I'm still not certain I fully
understand why that function has to be used instead of building the
clauses for the OR being processed along with the remaining clauses.
Is it that this was too hard to extract that you ended up using
predicate_refuted_by()?

I've also removed the makeBoolExpr call that you were encapsulating
the or_clauses in. I didn't really see the need for this since you
just removed it again when looping over the or_clauses.

The only other changes are just streamlining code and comment changes.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
Hi David.

On 2018/01/23 15:44, David Rowley wrote:
> On 19 January 2018 at 04:03, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> On 18 January 2018 at 23:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> So, I've been assuming that the planner changes in the run-time pruning
>>> patch have to do with selecting clauses (restriction clauses not
>>> containing Consts and/or join clauses) to be passed to the executor by
>>> recording them in the Append node.  Will they be selected by the planner
>>> calling into partition.c?
>>
>> I had thought so. I only have a rough idea in my head and that's that
>> the PartitionPruneInfo struct that I wrote for the run-time pruning
>> patch would have the clause List replaced with this new
>> PartScanClauseInfo struct (which likely means it needs to go into
>> primnodes.h), this struct would contain all the partition pruning
>> clauses in a more structured form so that no matching of quals to the
>> partition key wouldn't be required during execution. The idea is that
>> we'd just need to call; remove_redundant_clauses,
>> extract_bounding_datums and get_partitions_for_keys. I think this is
>> the bare minimum of work that can be done in execution since we can't
>> remove the redundant clauses until we know the values of the Params.
>>
>> Likely this means there will need to be 2 functions externally
>> accessible for this in partition.c. I'm not sure exactly how best to
>> do this. Maybe it's fine just to have allpaths.c call
>> extract_partition_key_clauses to generate the PartScanClauseInfo then
>> call some version of get_partitions_from_clauses which does do the
>> extract_partition_key_clauses duties. This is made more complex by the
>> code that handles adding the default partition bound to the quals. I'm
>> not yet sure where that should live.
>>
>> I've also been thinking of having some sort of PartitionPruneContext
>> struct that we can pass around the functions. Runtime pruning had to
>> add structs which store the Param values to various functions which I
>> didn't like. It would be good to just add those to the context and
>> have them passed down without having to bloat the parameters in the
>> functions. I might try and do that tomorrow too. This should make the
>> footprint of the runtime pruning patch is a bit smaller.
> 
> Attached is what I had in mind about how to do this.

Thanks for the delta patch.  I will start looking at it tomorrow.

Regards,
Amit



On 23 January 2018 at 23:22, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/01/23 15:44, David Rowley wrote:
>> Attached is what I had in mind about how to do this.
>
> Thanks for the delta patch.  I will start looking at it tomorrow.

Thanks. I've been looking more at this and I've made a few more
adjustments in the attached.

This delta patch should be applied against the
faster_partition_prune_v21_delta_drowley_v1.patch one I sent
yesterday. This changes a few comments, also now correctly passes the
context to get_partitions_excluded_by_ne_clauses and fixes a small
error where the patch was failing to record a notnull for the
partition key when it saw a strict <> clause. It was only doing this
for the opposite case, but both seem to be perfectly applicable. I
also made a small adjustment to the regression tests to ensure this is
covered.

I'm now going to start work on basing the partition pruning patch on
top of this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

I looked at both of your faster_partition_prune_v21_delta_drowley_v1.patch
and faster_partition_prune_v21_delta_drowley_v1_delta.patch and have
incorporated them into the main patch series to get us the attached v22
set.  I like the result very much.  Thank you!

On 2018/01/23 15:44, David Rowley wrote:
> Attached is what I had in mind about how to do this. Only the planner
> will need to call populate_partition_clause_info. The planner and
> executor can call get_partitions_from_clauseinfo. I'll just need to
> change the run-time prune patch to pass the PartitionClauseInfo into
> the executor in the Append node.

I liked the division into those two functions, although not quite the
embedding of "info" into the function names.  I think it's good to just
call them populate_partition_clauses and get_partitions_from_clauses.  It
seems OK though for their arguments to contain "info" in their names.

> I've also added the context struct that I mentioned above. It's
> currently not carrying much weight, but the idea is that this context
> will be passed around a bit more with the run-time pruning patch and
> it will also carry the details about parameter values. I'll need to
> modify a few signatures of functions like partkey_datum_from_expr to
> pass the context there too. I didn't do that here because currently,
> those functions have no use for the context with the fields that they
> currently have.

The context struct seems good to me too.

> I've also fixed a bug where when you built the commutator OpExpr in
> what's now called extract_partition_key_clauses the inputcollid was
> not being properly set. The changes I made there go much further than
> just that, I've completely removed the OpExpr from the PartClause
> struct as only two members were ever used. I thought it was better
> just to add those to PartClause instead.

I wondered if we should rename that to something like
PartClauseProperties, but maybe that's too long.

> I also did some renaming of variables that always assumed that the
> Expr being compared to the partition key was a constant. This is true
> now, but with run-time pruning patch, it won't be, so I thought it was
> better to do this here rather than having to rename them in the
> run-time pruning patch.

OK, seems fine.

> One thing I don't yet understand about the patch is the use of
> predicate_refuted_by() in get_partitions_from_or_clause_args(). I did
> adjust the comment above that code, but I'm still not certain I fully
> understand why that function has to be used instead of building the
> clauses for the OR being processed along with the remaining clauses.
> Is it that this was too hard to extract that you ended up using
> predicate_refuted_by()?

I have tried to explain that better in the updated comment in the new
patch, along with some code rearrangement to better make sense of what's
going on.  Let me just copy-paste the new comment I wrote.  I have tried
to rethink the solution a number of times but never came up with a
sensible alternative.

/*
 * When matching an OR expression, it is only checked if at least one of
 * its args matches the partition key, not all.  For arguments that don't
 * match, we cannot eliminate any of its partitions using
 * get_partitions_from_clauses().  However, if the table is itself a
 * partition, we may be able to prove using constraint exclusion that the
 * clause refutes its partition constraint, that is, we can eliminate all
 * of its partitions.
 */
foreach(lc, or_clause_args)
{

> I've also removed the makeBoolExpr call that you were encapsulating
> the or_clauses in. I didn't really see the need for this since you
> just removed it again when looping over the or_clauses.

Ah, OK.  I first became concerned that you said you were adding arguments
of different OR expressions into a single list and call it or_clauses, but
calmed down after checking that that's not the case. :)

> The only other changes are just streamlining code and comment changes.

I made a few of those myself in the updated patches.

Thanks a lot again for your work on this.

Regards,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Jesper Pedersen
Date:
Hi Amit,

On 01/26/2018 04:17 AM, Amit Langote wrote:
> I made a few of those myself in the updated patches.
> 
> Thanks a lot again for your work on this.
> 

This needs a rebase.

Best regards,
  Jesper




Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi Jesper.

On 2018/01/29 22:13, Jesper Pedersen wrote:
> Hi Amit,
> 
> On 01/26/2018 04:17 AM, Amit Langote wrote:
>> I made a few of those myself in the updated patches.
>>
>> Thanks a lot again for your work on this.
>>
> 
> This needs a rebase.

AFAICS, v22 cleanly applies to HEAD (c12693d8f3 [1]), compiles, and make
check passes.

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c12693d8f3



Re: [HACKERS] path toward faster partition pruning

From
Kyotaro HORIGUCHI
Date:
Hello, let me make some comments.

At Tue, 30 Jan 2018 09:57:44 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<4a7dda08-b883-6e5e-b0bf-f5ce95584e9e@lab.ntt.co.jp>
> Hi Jesper.
> 
> On 2018/01/29 22:13, Jesper Pedersen wrote:
> > Hi Amit,
> > 
> > On 01/26/2018 04:17 AM, Amit Langote wrote:
> >> I made a few of those myself in the updated patches.
> >>
> >> Thanks a lot again for your work on this.
> >>
> > 
> > This needs a rebase.
> 
> AFAICS, v22 cleanly applies to HEAD (c12693d8f3 [1]), compiles, and make
> check passes.

Yes, it cleanly applies to HEAD and seems working.

0001 seems fine.

I have some random comments on 0002.

-extract_partition_key_clauses implicitly assumes that the
 commutator of inequality operator is the same to the original
 operator. (commutation for such operators is an identity
 function.)

 I believe it is always true for a sane operator definition, but
 perhaps wouldn't it be better using commutator instead of
 opclause->opno as the source of negator if any? (Attached diff 1)

 
-get_partitions_from_or_clause_args abandons arg_partset with all
 bit set when partition constraint doesn't refute whole the
 partition. Finally get_partitions_from_clauses does the same
 thing but it's waste of cycles and looks weird. It seems to have
 intended to return immediately there.

  >       /* Couldn't eliminate any of the partitions. */
  >       partdesc = RelationGetPartitionDesc(context->relation);
  > -     arg_partset = bms_add_range(NULL, 0, partdesc->nparts - 1);
  > +     return bms_add_range(NULL, 0, partdesc->nparts - 1);
  >     }
  >  
  >     subcontext.rt_index = context->rt_index;
  >     subcontext.relation = context->relation;
  >     subcontext.clauseinfo = &partclauseinfo;
 !>     arg_partset = get_partitions_from_clauses(&subcontext);

-get_partitions_from_or_clause_args converts IN (null) into
 nulltest and the nulltest doesn't exclude a child that the
 partition key column can be null.

 drop table if exists p;
 create table p (a int, b int) partition by list (a);
 create table c1 partition of p for values in (1, 5, 7);
 create table c2 partition of p for values in (4, 6, null);
 insert into p values (1, 0), (null, 0);
 
 explain select tableoid::regclass, * from p where a in (1, null);
 >                            QUERY PLAN                            
 > -----------------------------------------------------------------
 >  Result  (cost=0.00..76.72 rows=22 width=12)
 >    ->  Append  (cost=0.00..76.50 rows=22 width=12)
 >          ->  Seq Scan on c1  (cost=0.00..38.25 rows=11 width=12)
 >                Filter: (a = ANY ('{1,NULL}'::integer[]))
 >          ->  Seq Scan on c2  (cost=0.00..38.25 rows=11 width=12)
 >                Filter: (a = ANY ('{1,NULL}'::integer[]))

 Although the clause "a in (null)" doesn't match the (null, 0)
 row so it donesn't harm finally, I don't think this is a right
 behavior. null in an SAOP rightop should be just ignored on
 partition pruning. Or is there any purpose of this behavior?


- In extract_bounding_datums, clauseinfo is set twice to the same
  value.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index ab17524..a2488ab 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -2111,7 +2111,6 @@ extract_partition_key_clauses(PartitionKey partkey, List *clauses,
             PartClause *pc;
             Oid            partopfamily = partkey->partopfamily[i];
             Oid            partcoll = partkey->partcollation[i];
-            Oid            commutator = InvalidOid;
             AttrNumber    partattno = partkey->partattrs[i];
             Expr       *partexpr = NULL;
 
@@ -2144,6 +2143,7 @@ extract_partition_key_clauses(PartitionKey partkey, List *clauses,
             if (IsA(clause, OpExpr))
             {
                 OpExpr       *opclause = (OpExpr *) clause;
+                Oid            comparator = opclause->opno;
                 Expr       *leftop,
                            *rightop,
                            *valueexpr;
@@ -2161,13 +2161,14 @@ extract_partition_key_clauses(PartitionKey partkey, List *clauses,
                     valueexpr = rightop;
                 else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr))
                 {
-                    valueexpr = leftop;
-
-                    commutator = get_commutator(opclause->opno);
+                    Oid commutator = get_commutator(opclause->opno);
 
                     /* nothing we can do unless we can swap the operands */
                     if (!OidIsValid(commutator))
                         continue;
+
+                    valueexpr = leftop;
+                    comparator = commutator;
                 }
                 else
                     /* Clause does not match this partition key. */
@@ -2212,7 +2213,7 @@ extract_partition_key_clauses(PartitionKey partkey, List *clauses,
                      * equality operator *and* this is a list partitioned
                      * table, we can use it prune partitions.
                      */
-                    negator = get_negator(opclause->opno);
+                    negator = get_negator(comparator);
                     if (OidIsValid(negator) &&
                         op_in_opfamily(negator, partopfamily))
                     {
@@ -2236,7 +2237,7 @@ extract_partition_key_clauses(PartitionKey partkey, List *clauses,
                 }
 
                 pc = (PartClause *) palloc0(sizeof(PartClause));
-                pc->opno = OidIsValid(commutator) ? commutator : opclause->opno;
+                pc->opno = comparator;
                 pc->inputcollid = opclause->inputcollid;
                 pc->value = valueexpr;


Re: [HACKERS] path toward faster partition pruning

From
Jesper Pedersen
Date:
Hi Amit,

On 01/29/2018 07:57 PM, Amit Langote wrote:
>> This needs a rebase.
> 
> AFAICS, v22 cleanly applies to HEAD (c12693d8f3 [1]), compiles, and make
> check passes.
> 

It was a rebase error; I should have checked against a clean master.

Sorry for the noise.

Best regards,
  Jesper



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Horiguchi-san,

Thanks for the review.

On 2018/01/30 20:43, Kyotaro HORIGUCHI wrote:
> At Tue, 30 Jan 2018 09:57:44 +0900, Amit Langote wrote:
>> AFAICS, v22 cleanly applies to HEAD (c12693d8f3 [1]), compiles, and make
> I have some random comments on 0002.
> 
> -extract_partition_key_clauses implicitly assumes that the
>  commutator of inequality operator is the same to the original
>  operator. (commutation for such operators is an identity
>  function.)

Yeah, it seems so.

>  I believe it is always true for a sane operator definition, but
>  perhaps wouldn't it be better using commutator instead of
>  opclause->opno as the source of negator if any? (Attached diff 1)

ISTM, the same thing happens with or without the patch.

- pc->opno = OidIsValid(commutator) ? commutator : opclause->opno;
+ pc->opno = comparator;

comparator as added by the patch is effectively equal to the RHS
expression in the deleted line.

> -get_partitions_from_or_clause_args abandons arg_partset with all
>  bit set when partition constraint doesn't refute whole the
>  partition. Finally get_partitions_from_clauses does the same
>  thing but it's waste of cycles and looks weird. It seems to have
>  intended to return immediately there.
> 
>   >       /* Couldn't eliminate any of the partitions. */
>   >       partdesc = RelationGetPartitionDesc(context->relation);
>   > -     arg_partset = bms_add_range(NULL, 0, partdesc->nparts - 1);
>   > +     return bms_add_range(NULL, 0, partdesc->nparts - 1);
>   >     }
>   >  
>   >     subcontext.rt_index = context->rt_index;
>   >     subcontext.relation = context->relation;
>   >     subcontext.clauseinfo = &partclauseinfo;
>  !>     arg_partset = get_partitions_from_clauses(&subcontext);

You're right, fixed.

> -get_partitions_from_or_clause_args converts IN (null) into
>  nulltest and the nulltest doesn't exclude a child that the
>  partition key column can be null.
> 
>  drop table if exists p;
>  create table p (a int, b int) partition by list (a);
>  create table c1 partition of p for values in (1, 5, 7);
>  create table c2 partition of p for values in (4, 6, null);
>  insert into p values (1, 0), (null, 0);
>  
>  explain select tableoid::regclass, * from p where a in (1, null);
>  >                            QUERY PLAN                            
>  > -----------------------------------------------------------------
>  >  Result  (cost=0.00..76.72 rows=22 width=12)
>  >    ->  Append  (cost=0.00..76.50 rows=22 width=12)
>  >          ->  Seq Scan on c1  (cost=0.00..38.25 rows=11 width=12)
>  >                Filter: (a = ANY ('{1,NULL}'::integer[]))
>  >          ->  Seq Scan on c2  (cost=0.00..38.25 rows=11 width=12)
>  >                Filter: (a = ANY ('{1,NULL}'::integer[]))
> 
>  Although the clause "a in (null)" doesn't match the (null, 0)
>  row so it donesn't harm finally, I don't think this is a right
>  behavior. null in an SAOP rightop should be just ignored on
>  partition pruning. Or is there any purpose of this behavior?

Yeah, it seems that we're better off ignoring null values appearing the
IN-list.  Framing a IS NULL or IS NOT NULL expression corresponding to a
null value in the SAOP rightop array doesn't seem to be semantically
correct, as you also pointed out.  In ExecEvalScalarArrayOpExpr(), I see
that a null in the rightop array (IN-list) doesn't lead to selecting rows
containing null in the corresponding column.

> - In extract_bounding_datums, clauseinfo is set twice to the same
>   value.

Oops, my bad when merging in David's patch.

Update patch set attached.  Thanks again.

Regards,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 31 January 2018 at 21:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Update patch set attached.  Thanks again.

(My apologies for being slow to respond here. I've been on leave this
week and I'm off again next week. I now have a little time to reply)

Hi Amit,

Thanks for incorporating my changes into the patchset. A while ago I
was rebasing the run-time pruning patch on top of this but ran into a
few problems which are all results of my changes.

1. remove_redundant_clauses collapses the PartClause list into the
most restrictive set of clauses. This disallows multiple evaluations
of the PartitionClauseInfo during runtime pruning. I've written a
proposed fix for this and attached it.

2. PartitionClauseInfo->keyclauses is a list of PartClause which is
not a node type. This will cause _copyPartitionClauseInfo() to fail.

I'm still not quite sure the best way to fix #2 since PartClause
contains a FmgrInfo. I do have a local fix which moves PartClause to
primnodes.h and makes it a proper node type. I also added a copy
function which does not copy any of the cache fields in PartClause. It
just sets valid_cache to false. I didn't particularly think this was
the correct fix. I just couldn't think of how exactly this should be
done at the time.

The attached patch also adds the missing nodetag from
PartitionClauseInfo and also fixes up other code so as we don't memset
the node memory to zero, as that would destroy the tag. I ended up
just having extract_partition_key_clauses do the makeNode call. This
also resulted in populate_partition_clauses being renamed to
generate_partition_clauses

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
+/*
+ * PartitionBoundCmpArg - Caller-defined argument to be passed to
+ *                                               partition_bound_cmp()
+ *
+ * The first (fixed) argument involved in a comparison is the partition bound
+ * found in the catalog, while an instance of the following struct describes
+ * either a new partition bound being compared against existing bounds
+ * (caller should set is_bound to true and set bound), or a new tuple's
+ * partition key specified in datums (caller should set ndatums to the number
+ * of valid datums that are passed in the array).
+ */
+typedef struct PartitionBoundCmpArg
+{
+       bool    is_bound;
+       union
+       {
+               PartitionListValue         *lbound;
+               PartitionRangeBound        *rbound;
+               PartitionHashBound         *hbound;
+       }       bound;
+
+       Datum  *datums;
+       int             ndatums;
+} PartitionBoundCmpArg;

This is a pretty strange definition.  datums/ndatums are never valid
at the same time as any of lbound/rbound/hbound, but are not included
in the union.  Also, is_bound doesn't tell you which of
rbound/lbound/hbound is actually valid.  Granted, the current calling
convention looks like a mess, too.  Apparently, the argument to
partition_bound_cmp is a PartitionBoundSpec if using hash
partitioning, a Datum if list partitioning, and either a
PartitionRangeBound or a Datum * if range partitioning depending on
the value of probe_is_bound, and I use the word "apparently" because
there are zero words of comments explaining what the argument to
partition_bound_cmp of type "void *" is supposed to mean.  I really
should have noticed that and insisted that it be fixed before
partitioning got committed.

Looking a bit further, there are exactly two calls to
partition_bound_cmp().  One is in partition_bound_bsearch() and the
other is in check_new_partition_bound(). Now, looking at this, both
the call to partition_bound_cmp() and every single call to
partition_bound_bsearch() are inside a switch branch where we've
dispatched on the partitioning type, which means that from code that
is already specialized by partitioning type we are calling generic
code which then turns back around and calls code that is specialized
by partitioning type.  Now, that could make sense if the generic code
is pretty complex, but here's it's basically just the logic to do a
bsearch.  It seems to me that a cleaner solution here would be to
duplicate that logic.  Then we could have...

static int partition_list_bsearch(PartitionKey key, PartitionBoundInfo
boundinfo,
                        Datum value, bool *is_equal);
static int partition_range_bsearch(PartitionKey key,
PartitionBoundInfo boundinfo,
                        PartitionRangeBound *probe);
static int partition_range_datum_bsearch(PartitionKey key,
PartitionBoundInfo boundinfo,
                        int nvalues, Datum *values);
static int partition_hash_bsearch(PartitionKey key, PartitionBoundInfo
boundinfo,
                        int modulus, int remainder, bool *is_equal);

...which would involve fewer branches at runtime and more type-safety
at compile time.  partition_hash_bsearch could directly call
partition_hbound_cmp, partition_list_bsearch could directly invoke
FunctionCall2Coll, partition_range_bsearch could directly call
partition_rbound_cmp, and partition_range_datum_bsearch could directly
call partition_rbound_datum_cmp.

All-in-all that seems a lot nicer to me than what we have here now.
IIUC, the purpose of this patch is to let you search on a prefix of
the partition keys, but I think that's really only possible for range
partitioning, and it seems like the proposed nvalues argument to
partition_range_datum_bsearch would give you what you need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks for the review.

On 2018/02/02 7:38, Robert Haas wrote:
> +/*
> + * PartitionBoundCmpArg - Caller-defined argument to be passed to
> + *                                               partition_bound_cmp()
> + *
> + * The first (fixed) argument involved in a comparison is the partition bound
> + * found in the catalog, while an instance of the following struct describes
> + * either a new partition bound being compared against existing bounds
> + * (caller should set is_bound to true and set bound), or a new tuple's
> + * partition key specified in datums (caller should set ndatums to the number
> + * of valid datums that are passed in the array).
> + */
> +typedef struct PartitionBoundCmpArg
> +{
> +       bool    is_bound;
> +       union
> +       {
> +               PartitionListValue         *lbound;
> +               PartitionRangeBound        *rbound;
> +               PartitionHashBound         *hbound;
> +       }       bound;
> +
> +       Datum  *datums;
> +       int             ndatums;
> +} PartitionBoundCmpArg;
> 
> This is a pretty strange definition.  datums/ndatums are never valid
> at the same time as any of lbound/rbound/hbound, but are not included
> in the union.  Also, is_bound doesn't tell you which of
> rbound/lbound/hbound is actually valid.  Granted, the current calling
> convention looks like a mess, too.  Apparently, the argument to
> partition_bound_cmp is a PartitionBoundSpec if using hash
> partitioning, a Datum if list partitioning, and either a
> PartitionRangeBound or a Datum * if range partitioning depending on
> the value of probe_is_bound, and I use the word "apparently" because
> there are zero words of comments explaining what the argument to
> partition_bound_cmp of type "void *" is supposed to mean.  I really
> should have noticed that and insisted that it be fixed before
> partitioning got committed.

Yeah, I was trying to fix the status quo by introducing that new struct,
but I agree it's much better to modify the functions around a bit like the
way you describe below.

> Looking a bit further, there are exactly two calls to
> partition_bound_cmp().  One is in partition_bound_bsearch() and the
> other is in check_new_partition_bound(). Now, looking at this, both
> the call to partition_bound_cmp() and every single call to
> partition_bound_bsearch() are inside a switch branch where we've
> dispatched on the partitioning type, which means that from code that
> is already specialized by partitioning type we are calling generic
> code which then turns back around and calls code that is specialized
> by partitioning type.  Now, that could make sense if the generic code
> is pretty complex, but here's it's basically just the logic to do a
> bsearch.  It seems to me that a cleaner solution here would be to
> duplicate that logic.  Then we could have...
> 
> static int partition_list_bsearch(PartitionKey key, PartitionBoundInfo
> boundinfo,
>                         Datum value, bool *is_equal);
> static int partition_range_bsearch(PartitionKey key,
> PartitionBoundInfo boundinfo,
>                         PartitionRangeBound *probe);
> static int partition_range_datum_bsearch(PartitionKey key,
> PartitionBoundInfo boundinfo,
>                         int nvalues, Datum *values);
> static int partition_hash_bsearch(PartitionKey key, PartitionBoundInfo
> boundinfo,
>                         int modulus, int remainder, bool *is_equal);
> 
> ...which would involve fewer branches at runtime and more type-safety
> at compile time.  partition_hash_bsearch could directly call
> partition_hbound_cmp, partition_list_bsearch could directly invoke
> FunctionCall2Coll, partition_range_bsearch could directly call
> partition_rbound_cmp, and partition_range_datum_bsearch could directly
> call partition_rbound_datum_cmp.
> 
> All-in-all that seems a lot nicer to me than what we have here now.
> IIUC, the purpose of this patch is to let you search on a prefix of
> the partition keys, but I think that's really only possible for range
> partitioning, and it seems like the proposed nvalues argument to
> partition_range_datum_bsearch would give you what you need.

Your proposed cleanup sounds much better, so I implemented it in the
attached updated 0001, while dropping the previously proposed
PartitionBoundCmpArg approach.

Updated set of patches attached (patches 0002 onward mostly unchanged,
except I incorporated the delta patch posted by David upthread).

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

On 2018/02/01 8:57, David Rowley wrote:
> On 31 January 2018 at 21:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Update patch set attached.  Thanks again.
> 
> (My apologies for being slow to respond here. I've been on leave this
> week and I'm off again next week. I now have a little time to reply)

No worries.

> Thanks for incorporating my changes into the patchset. A while ago I
> was rebasing the run-time pruning patch on top of this but ran into a
> few problems which are all results of my changes.
> 
> 1. remove_redundant_clauses collapses the PartClause list into the
> most restrictive set of clauses. This disallows multiple evaluations
> of the PartitionClauseInfo during runtime pruning. I've written a
> proposed fix for this and attached it.

I've incorporated it in the latest patch I posted today.

> 2. PartitionClauseInfo->keyclauses is a list of PartClause which is
> not a node type. This will cause _copyPartitionClauseInfo() to fail.
> 
> I'm still not quite sure the best way to fix #2 since PartClause
> contains a FmgrInfo. I do have a local fix which moves PartClause to
> primnodes.h and makes it a proper node type. I also added a copy
> function which does not copy any of the cache fields in PartClause. It
> just sets valid_cache to false. I didn't particularly think this was
> the correct fix. I just couldn't think of how exactly this should be
> done at the time.
> 
> The attached patch also adds the missing nodetag from
> PartitionClauseInfo and also fixes up other code so as we don't memset
> the node memory to zero, as that would destroy the tag. I ended up
> just having extract_partition_key_clauses do the makeNode call. This
> also resulted in populate_partition_clauses being renamed to
> generate_partition_clauses

I started wondering if it's not such a good idea to make
PartitionClauseInfo a Node at all?  I went back to your earlier message
[1] where you said that it's put into the Append node for run-time pruning
to use, but it doesn't sound nice that we'd be putting into the plan
something that's looks more like scratchpad for the partition.c code.  I
think we should try to keep PartitionClauseInfo in partition.h and put
only the list of matched bare clauses into Append.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f8T_efuAgPWtyGdfwD1kBLR-giFvoez7raYsQ4P1i2OYw%40mail.gmail.com



Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Fri, Feb 2, 2018 at 1:04 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Your proposed cleanup sounds much better, so I implemented it in the
> attached updated 0001, while dropping the previously proposed
> PartitionBoundCmpArg approach.
>
> Updated set of patches attached (patches 0002 onward mostly unchanged,
> except I incorporated the delta patch posted by David upthread).

Committed 0001.  Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Fri, Feb 2, 2018 at 9:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Updated set of patches attached (patches 0002 onward mostly unchanged,
>> except I incorporated the delta patch posted by David upthread).
>
> Committed 0001.  Thanks.

Some preliminary thoughts...

Regarding 0002, I can't help noticing that this adds a LOT of new code
to partition.c.  With 0002 applied, it climbs into the top 2% of all
".c" files in terms of lines of code.  It seems to me, though, that
maybe it would make sense to instead add all of this code to some new
file .c file, e.g. src/backend/optimizer/util/partprune.c.  I realize
that's a little awkward in this case because we're hoping to use this
code at runtime and not just in the optimizer, but I don't have a
better idea.  Using src/backend/catalog as a dumping-ground for code
that doesn't have a clear-cut place to live is not a superior
alternative, for sure.  And it seems to me that the code you're adding
here is really quite similar to what we've already got in that
directory -- for example, predtest.c, which currently does partition
pruning, lives there; so does clauses.c, whose evaluate_expr facility
this patch wants to use; so does relnode.c, which the other patches
modify; and in general this looks an awful lot like other optimizer
logic that decomposes clauses.  I'm open to other suggestions but I
don't think adding all of this directly into partition.c is a good
plan.

If we do add a new file for this code, the header comment for that
file would be a good place to write an overall explanation of this new
facility.  The individual bits and pieces seem to have good comments
but an overall explanation of what's going on here seems to be
lacking.

It doesn't seem good that get_partitions_from_clauses requires us to
reopen the relation.  I'm going to give my standard review feedback
any time someone injects additional relation_open or heap_open calls
into a patch: please look for a way to piggyback on one of the places
that already has the relation open instead of adding code to re-open
it elsewhere.  Reopening it is not entirely free, and, especially when
NoLock is used, makes it hard to tell whether we're doing the locking
correctly.  Note that we've already got things like
set_relation_partition_info (which copies the bounds) and
set_baserel_partition_key_exprs (which copies, with some partitioning,
the partitioning expressions) and also find_partition_scheme, but
instead of using that existing data from the RelOptInfo, this patch is
digging it directly out of the relcache entry, which doesn't seem
great.

The changes to set_append_rel_pathlist probably make it slower in the
case where rte->relkind != RELKIND_PARTITIONED_TABLE.  We build a
whole new list that we don't really need.  How about just keeping the
(appinfo->parent_relid != parentRTindex) test in the loop and setting
rel_appinfos to either root->append_rel_list or
rel->live_part_appinfos as appropriate?

I understand why COLLATION_MATCH think that a collation OID match is
OK, but why is InvalidOid also OK?  Can you add a comment?  Maybe some
test cases, too?

How fast is this patch these days, compared with the current approach?
 It would be good to test both when nearly all of the partitions are
pruned and when almost none of the partitions are pruned.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/02/03 6:05, Robert Haas wrote:
> On Fri, Feb 2, 2018 at 9:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Updated set of patches attached (patches 0002 onward mostly unchanged,
>>> except I incorporated the delta patch posted by David upthread).
>>
>> Committed 0001.  Thanks.
> 
> Some preliminary thoughts...

Thanks for the review.

> Regarding 0002, I can't help noticing that this adds a LOT of new code
> to partition.c.  With 0002 applied, it climbs into the top 2% of all
> ".c" files in terms of lines of code.  It seems to me, though, that
> maybe it would make sense to instead add all of this code to some new
> file .c file, e.g. src/backend/optimizer/util/partprune.c.  I realize
> that's a little awkward in this case because we're hoping to use this
> code at runtime and not just in the optimizer, but I don't have a
> better idea.  Using src/backend/catalog as a dumping-ground for code
> that doesn't have a clear-cut place to live is not a superior
> alternative, for sure.

Agreed.  partition.c has gotten quite big and actually more than half of
the code that this patch adds really seems to belong outside of it.

> And it seems to me that the code you're adding
> here is really quite similar to what we've already got in that
> directory -- for example, predtest.c, which currently does partition
> pruning, lives there; so does clauses.c, whose evaluate_expr facility
> this patch wants to use; so does relnode.c, which the other patches
> modify; and in general this looks an awful lot like other optimizer
> logic that decomposes clauses.  I'm open to other suggestions but I
> don't think adding all of this directly into partition.c is a good
> plan.

Agreed.

A partprune.c in the optimizer's util directory seems like a good place.

> If we do add a new file for this code, the header comment for that
> file would be a good place to write an overall explanation of this new
> facility.  The individual bits and pieces seem to have good comments
> but an overall explanation of what's going on here seems to be
> lacking.

OK, I will add such a comment.

> It doesn't seem good that get_partitions_from_clauses requires us to
> reopen the relation.  I'm going to give my standard review feedback
> any time someone injects additional relation_open or heap_open calls
> into a patch: please look for a way to piggyback on one of the places
> that already has the relation open instead of adding code to re-open
> it elsewhere.  Reopening it is not entirely free, and, especially when
> NoLock is used, makes it hard to tell whether we're doing the locking
> correctly.  Note that we've already got things like
> set_relation_partition_info (which copies the bounds) and
> set_baserel_partition_key_exprs (which copies, with some partitioning,
> the partitioning expressions) and also find_partition_scheme, but
> instead of using that existing data from the RelOptInfo, this patch is
> digging it directly out of the relcache entry, which doesn't seem
> great.

OK, I have to admit that the quoted heap_open wasn't quite well thought
out and I'm almost sure that everything should be fine with the
information that set_relation_partition_info() fills in the RelOptInfo.
I'm now going through the patch to try to figure out how to make that work.

> The changes to set_append_rel_pathlist probably make it slower in the
> case where rte->relkind != RELKIND_PARTITIONED_TABLE.  We build a
> whole new list that we don't really need.  How about just keeping the
> (appinfo->parent_relid != parentRTindex) test in the loop and setting
> rel_appinfos to either root->append_rel_list or
> rel->live_part_appinfos as appropriate?

That's certainly better.  Also in set_append_rel_size.

> I understand why COLLATION_MATCH think that a collation OID match is
> OK, but why is InvalidOid also OK?  Can you add a comment?  Maybe some
> test cases, too?

partcollid == InvalidOid means the partition key is of uncollatable type,
so further checking the collation is unnecessary.

There is a test in partition_prune.sql that covers the failure to prune
when collations don't match for a text partition key.

> How fast is this patch these days, compared with the current approach?
>  It would be good to test both when nearly all of the partitions are
> pruned and when almost none of the partitions are pruned.

I will include some performance numbers in my next email, which hopefully
should not be later than Friday this week.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Tue, Feb 6, 2018 at 4:55 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I understand why COLLATION_MATCH think that a collation OID match is
>> OK, but why is InvalidOid also OK?  Can you add a comment?  Maybe some
>> test cases, too?
>
> partcollid == InvalidOid means the partition key is of uncollatable type,
> so further checking the collation is unnecessary.

Yeah, but in that case wouldn't BOTH OIDs be InvalidOid, and thus the
equality test would mach anyway?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Tue, Feb 6, 2018 at 3:25 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> Agreed.  partition.c has gotten quite big and actually more than half of
> the code that this patch adds really seems to belong outside of it.
>
>> And it seems to me that the code you're adding
>> here is really quite similar to what we've already got in that
>> directory -- for example, predtest.c, which currently does partition
>> pruning, lives there; so does clauses.c, whose evaluate_expr facility
>> this patch wants to use; so does relnode.c, which the other patches
>> modify; and in general this looks an awful lot like other optimizer
>> logic that decomposes clauses.  I'm open to other suggestions but I
>> don't think adding all of this directly into partition.c is a good
>> plan.
>
> Agreed.
>
> A partprune.c in the optimizer's util directory seems like a good place.

partprune.c looks to much tied to one feature. I am sure that the
functions used for partition pruning can be used by other
optimizations as well.

partition.c seems to have two kinds of functions 1. that build and
manage relcache, creates quals from bounds etc. which are metadata
management kind 2. partition bound comparison functions, and other
optimizer related functions. May be we should divide the file that
way. The first category code remains in catalog/ as it is today. The
second catagory functions move to optimizer/.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Wed, Feb 7, 2018 at 3:42 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> partprune.c looks to much tied to one feature. I am sure that the
> functions used for partition pruning can be used by other
> optimizations as well.

Uh, I don't know about that, this code looks like it does partition
pruning specifically, and nothing else.  How else do you think it
could be used?

> partition.c seems to have two kinds of functions 1. that build and
> manage relcache, creates quals from bounds etc. which are metadata
> management kind 2. partition bound comparison functions, and other
> optimizer related functions. May be we should divide the file that
> way. The first category code remains in catalog/ as it is today. The
> second catagory functions move to optimizer/.

It would be sensible to separate functions that build and manage data
in the relcache from other functions.  I think we should consider
moving the existing functions of that type from partition.c to
src/backend/utils/cache/partcache.c.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Wed, Feb 7, 2018 at 6:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 7, 2018 at 3:42 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> partprune.c looks to much tied to one feature. I am sure that the
>> functions used for partition pruning can be used by other
>> optimizations as well.
>
> Uh, I don't know about that, this code looks like it does partition
> pruning specifically, and nothing else.  How else do you think it
> could be used?

I didn't have any specific thing in mind, and now more I think, it
looks less likely that it will be used for something else.

While looking at the changes in partition.c I happened to look at the
changes in try_partition_wise_join(). They mark partitions deemed
dummy by pruning as dummy relations. If we accept those changes, we
could very well change the way we handle dummy partitioned tables,
which would mean that we also revert the recent commit
f069c91a5793ff6b7884120de748b2005ee7756f. But I guess, those changes
haven't been reviewed yet and so not final.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Wed, Feb 7, 2018 at 8:37 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> While looking at the changes in partition.c I happened to look at the
> changes in try_partition_wise_join(). They mark partitions deemed
> dummy by pruning as dummy relations. If we accept those changes, we
> could very well change the way we handle dummy partitioned tables,
> which would mean that we also revert the recent commit
> f069c91a5793ff6b7884120de748b2005ee7756f. But I guess, those changes
> haven't been reviewed yet and so not final.

Well, if you have an opinion on those proposed changes, I'd like to hear it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/02/07 1:36, Robert Haas wrote:
> On Tue, Feb 6, 2018 at 4:55 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> I understand why COLLATION_MATCH think that a collation OID match is
>>> OK, but why is InvalidOid also OK?  Can you add a comment?  Maybe some
>>> test cases, too?
>>
>> partcollid == InvalidOid means the partition key is of uncollatable type,
>> so further checking the collation is unnecessary.
> 
> Yeah, but in that case wouldn't BOTH OIDs be InvalidOid, and thus the
> equality test would mach anyway?

It seems that that's not necessarily true.  I remember to have copied that
logic from the following macro in indxpath.c:

#define IndexCollMatchesExprColl(idxcollation, exprcollation) \
    ((idxcollation) == InvalidOid || (idxcollation) == (exprcollation))

which was added by the following commit:

commit cb37c291060dd13b1a8ff61fceee09efcfbc34e1
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Thu Sep 29 00:43:42 2011 -0400

Fix index matching for operators with mixed collatable/noncollatable inputs.

If an indexable operator for a non-collatable indexed datatype has a
collatable right-hand input type, any OpExpr for it will be marked with a
nonzero inputcollid (since having one collatable input is sufficient to
make that happen).  However, an index on a non-collatable column certainly
doesn't have any collation.  This caused us to fail to match such
operators to their indexes, because indxpath.c required an exact match of
index collation and clause collation.  It seems correct to allow a match
when the index is collation-less regardless of the clause's inputcollid:
an operator with both noncollatable and collatable inputs could perhaps
depend on the collation of the collatable input, but it could hardly
expect the index for the noncollatable input to have that same collation.

[ ... ]

+ *    If the index has a collation, the clause must have the same collation.
+ *    For collation-less indexes, we assume it doesn't matter; this is
+ *    necessary for cases like "hstore ? text", wherein hstore's operators
+ *    don't care about collation but the clause will get marked with a
+ *    collation anyway because of the text argument.  (This logic is
+ *    embodied in the macro IndexCollMatchesExprColl.)
+ *


Discussion leading to the above commit occurred here:

https://www.postgresql.org/message-id/flat/201109282050.p8SKoA4O084649%40wwwmaster.postgresql.org

It seems that we can think similarly for partitioning and the let the
partition pruning proceed with a clause even if the partition key is
non-collatable whereas the clause's other argument is collatable.  Even
though it seems we don't yet allow the kind of partitioning that would
lead to such a situation.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Wed, Feb 7, 2018 at 3:42 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:

> > partition.c seems to have two kinds of functions 1. that build and
> > manage relcache, creates quals from bounds etc. which are metadata
> > management kind 2. partition bound comparison functions, and other
> > optimizer related functions. May be we should divide the file that
> > way. The first category code remains in catalog/ as it is today. The
> > second catagory functions move to optimizer/.
> 
> It would be sensible to separate functions that build and manage data
> in the relcache from other functions.  I think we should consider
> moving the existing functions of that type from partition.c to
> src/backend/utils/cache/partcache.c.

FWIW I've been thinking that perhaps we need some other separation of
code better than statu quo.  The current partition.c file includes stuff
for several modules and ISTM all these new patches are making more and
more of a mess.  So +1 to the general idea of splitting things up.
Maybe partcache.c is not ambitious enough, but it seems a good first
step.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Wed, Feb 7, 2018 at 7:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 7, 2018 at 8:37 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> While looking at the changes in partition.c I happened to look at the
>> changes in try_partition_wise_join(). They mark partitions deemed
>> dummy by pruning as dummy relations. If we accept those changes, we
>> could very well change the way we handle dummy partitioned tables,
>> which would mean that we also revert the recent commit
>> f069c91a5793ff6b7884120de748b2005ee7756f. But I guess, those changes
>> haven't been reviewed yet and so not final.
>
> Well, if you have an opinion on those proposed changes, I'd like to hear it.

I am talking about changes after this comment
         /*
+         * If either child_rel1 or child_rel2 is not a live partition, they'd
+         * not have been touched by set_append_rel_size.  So, its RelOptInfo
+         * would be missing some information that set_append_rel_size sets for
+         * live partitions, such as the target list, child EQ members, etc.
+         * We need to make the RelOptInfo of even the dead partitions look
+         * minimally valid and as having a valid dummy path attached to it.
+         */

There are couple of problems with this change
1. An N way join may call try_partition_wise_join() with the same base
relation on one side N times. The condition will be tried those many
times.

2. We will have to adjust or make similar changes in
try_partition_wise_aggregate() proposed in the partition-wise
aggregate patch. Right now it checks if the relation is dummy but it
will have to check whether the pathlist is also NULL. Any
partition-wise operation that we try in future will need this
adjustment.

AFAIU, for pruned partitions, we don't set necessary properties of the
corresponding RelOptInfo when it is pruned. If we were sure that we
will not use that RelOptInfo anywhere in the rest of the planning,
this would work. But that's not the case. AFAIU, current planner
assumes that a relation which has not been eliminated before planning
(DEAD relation), but later proved to not contribute any rows in the
result, is marked dummy. Partition pruning breaks that assumption and
thus may have other side effects, that we do not see right now. We
have similar problem with dummy partitioned tables, but we have code
in place to avoid looking at the pathlists of their children by not
considering such a partitioned table as partitioned. May be we want to
change that too.

Either we add refactoring patches to change the planner so that it
doesn't assume something like that OR we make sure that the pruned
partition's RelOptInfo have necessary properties and a dummy pathlist
set. I will vote for second. We spend CPU cycles marking pruned
partitions as dummy if the dummy pathlist is never used. May be we can
avoid setting dummy pathlist if we can detect that a pruned partition
is guaranteed not to be used, e.g when the corresponding partitioned
relation does not participate in any join or other upper planning.


Apart from that another change that caught my eye is

Instead of going through root->append_rel_list to pick up the child
appinfos, store them in an array called part_appinfos that stores
partition appinfos in the same order as RelOptInfos are stored in
part_rels, right when the latter are created.

Further, instead of going through root->pcinfo_list to get the list
of partitioned child rels, which ends up including even the rels
that are pruned by set_append_rel_size(), build up a list of "live"
partitioned child rels and use the same to initialize partitioned_rels
field of AppendPath.

That was voted down by Robert during partition-wise join
implementation. And I agree with him. Any changes around changing that
should change the way we handle AppendRelInfos for all relations, not
just (declarative) partitioned relations.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi Ashutosh.

On 2018/02/09 14:09, Ashutosh Bapat wrote:
> On Wed, Feb 7, 2018 at 7:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Feb 7, 2018 at 8:37 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>> While looking at the changes in partition.c I happened to look at the
>>> changes in try_partition_wise_join(). They mark partitions deemed
>>> dummy by pruning as dummy relations. If we accept those changes, we
>>> could very well change the way we handle dummy partitioned tables,
>>> which would mean that we also revert the recent commit
>>> f069c91a5793ff6b7884120de748b2005ee7756f. But I guess, those changes
>>> haven't been reviewed yet and so not final.
>>
>> Well, if you have an opinion on those proposed changes, I'd like to hear it.
> 
> I am talking about changes after this comment
>          /*
> +         * If either child_rel1 or child_rel2 is not a live partition, they'd
> +         * not have been touched by set_append_rel_size.  So, its RelOptInfo
> +         * would be missing some information that set_append_rel_size sets for
> +         * live partitions, such as the target list, child EQ members, etc.
> +         * We need to make the RelOptInfo of even the dead partitions look
> +         * minimally valid and as having a valid dummy path attached to it.
> +         */
> 
> There are couple of problems with this change
> 1. An N way join may call try_partition_wise_join() with the same base
> relation on one side N times. The condition will be tried those many
> times.
> 
> 2. We will have to adjust or make similar changes in
> try_partition_wise_aggregate() proposed in the partition-wise
> aggregate patch. Right now it checks if the relation is dummy but it
> will have to check whether the pathlist is also NULL. Any
> partition-wise operation that we try in future will need this
> adjustment.
> 
> AFAIU, for pruned partitions, we don't set necessary properties of the
> corresponding RelOptInfo when it is pruned. If we were sure that we
> will not use that RelOptInfo anywhere in the rest of the planning,
> this would work. But that's not the case. AFAIU, current planner
> assumes that a relation which has not been eliminated before planning
> (DEAD relation), but later proved to not contribute any rows in the
> result, is marked dummy. Partition pruning breaks that assumption and
> thus may have other side effects, that we do not see right now. We
> have similar problem with dummy partitioned tables, but we have code
> in place to avoid looking at the pathlists of their children by not
> considering such a partitioned table as partitioned. May be we want to
> change that too.
> 
> Either we add refactoring patches to change the planner so that it
> doesn't assume something like that OR we make sure that the pruned
> partition's RelOptInfo have necessary properties and a dummy pathlist
> set. I will vote for second. We spend CPU cycles marking pruned
> partitions as dummy if the dummy pathlist is never used. May be we can
> avoid setting dummy pathlist if we can detect that a pruned partition
> is guaranteed not to be used, e.g when the corresponding partitioned
> relation does not participate in any join or other upper planning.

Thanks for the analysis.  I agree with all the points of concern.  so for
now, I have dropped all the changes from my patch that give rise to the
concerns.  With the new patch, changes to the existing optimizer code
beside introducing partprune.c in the util directory are pretty thin:

git diff master --stat src/backend/optimizer/
 src/backend/optimizer/path/allpaths.c  |   16 ++
 src/backend/optimizer/util/Makefile    |    2 +-
 src/backend/optimizer/util/clauses.c   |    4 +-
 src/backend/optimizer/util/partprune.c | 1421 +++++++++++
 src/backend/optimizer/util/plancat.c   |   83 ++++---
 src/backend/optimizer/util/relnode.c   |    8 +
 6 files changed, 1504 insertions(+), 30 deletions(-)

So, no refactoring the existing optimizer code, just replacing the
partition pruning mechanism with partprune.c functions.

> Apart from that another change that caught my eye is
> 
> Instead of going through root->append_rel_list to pick up the child
> appinfos, store them in an array called part_appinfos that stores
> partition appinfos in the same order as RelOptInfos are stored in
> part_rels, right when the latter are created.
> 
> Further, instead of going through root->pcinfo_list to get the list
> of partitioned child rels, which ends up including even the rels
> that are pruned by set_append_rel_size(), build up a list of "live"
> partitioned child rels and use the same to initialize partitioned_rels
> field of AppendPath.
> 
> That was voted down by Robert during partition-wise join
> implementation. And I agree with him. Any changes around changing that
> should change the way we handle AppendRelInfos for all relations, not
> just (declarative) partitioned relations.

I removed part_appinfos from the patch.  Also, I have made the changes
introducing live_partitioned_rels a separate patch, which we can discuss
independently of the pruning changes.

Will post the latest patch set later in the evening.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/02/06 18:55, Amit Langote wrote:
>> How fast is this patch these days, compared with the current approach?
>>  It would be good to test both when nearly all of the partitions are
>> pruned and when almost none of the partitions are pruned.
> 
> I will include some performance numbers in my next email, which hopefully
> should not be later than Friday this week.

Here is the latest set of patches.  I can see about 2x speedup in planning
time for various partition counts, although it grows linearly as the
partition count grows (same as with HEAD).  Detailed performance figures
follow.

* Partitioned table schema:

H:
create table ht (a int, b int) partition by hash (b);
create table ht_* partition of ht for values with (modulus N, ...)

L:
create table lt (a int, b int) partition by list (b);
create table lt_1 partition of lt for values in (1)
..
create table lt_N partition of lt for values in (N)

R:
create table rt (a int, b int) partition by range (b);
create table rt_1 partition of rt for values from (1) to (<step>)
..
create table rt_N partition of rt for values in (N-1 * <step>) to (N * <step>)

* Queries

Prunes every partition but 1: select * from table_name where b = 1
Prunes none: select * from table_name where b >= 1


* Planning time in milliseconds (average of 5 runs).

On HEAD:

parts H-prune   H-noprune    L-prune   L-noprune   R-prune    R-noprune
   8     1.50        1.42       1.60        1.55      1.77         1.75
  16     2.49        2.37       2.32        2.65      3.29         3.07
  32     3.96        4.49       3.83        4.14      5.06         5.70
  64     8.02        7.51       7.14        7.34      9.37        10.02
 128    14.47       14.19      13.31       13.99     18.09        18.86
 256    24.76       27.63      25.59       27.87     34.15        37.19
 512    50.36       55.92      52.56       54.76     69.34        72.55
1024   102.94      110.59     104.97      110.41    136.89       146.54

Patched:

parts H-prune    H-noprune   L-prune   L-noprune   R-prune   R-noprune
   8     1.49         0.90      0.87        0.74      0.84        1.09
  16     2.01         1.50      1.42        1.68      1.42        1.41
  32     2.63         2.47      2.08        2.69      2.73        2.81
  64     5.62         4.66      4.45        4.96      4.92        5.08
 128    11.28         9.65      9.00        9.60      8.68        9.91
 256    18.36        18.49     17.11       18.39     17.47       18.43
 512    33.88        36.89     34.06       36.52     34.01       37.26
1024    66.40        72.75     66.37       73.40     67.06       67.06


Attached v25 patches.

0001-Modify-bound-comparision-functions-to-accept-mem.patch

This is Ashutosh's patch that he posted on the "advanced partition
matching algorithm for partition-wise join" thread.

0002-Refactor-partition-bound-search-functions.patch

This is similar to 0001.  Whereas 0001 modifies just the comparison
functions, this one modifies the partition bound search functions, because
the main pruning patch uses the search functions.

0003-Add-parttypid-partcollation-partsupfunc-to-Parti.patch

This adds some of the fields to PartitionScheme that were needed by the
main pruning patch.

The above 3 patches do what they do, because we'd like the main pruning to
patch to add its functionality by relying on whatever information is made
available in the partitioned table's RelOptInfo.

0004-Faster-partition-pruning.patch

The main patch that adds src/backend/optimizer/util/partprune.c, a module
to provide the functionality that will replace the current approach of
calling relation_excluded_by_constraints() for each partition.

Sorry, but there is still this big TODO here, which I'll try to fix early
next week.

+ * partprune.c
+ *      Provides functions to prune partitions of a partitioned table by
+ *      comparing provided set of clauses with the table's partitions'
+ *      boundaries
+ *
+ * TODO: write a longer description of things in this file


0005-Add-only-unpruned-partitioned-child-rels-to-part.patch

This one teaches the planner to put *only* the un-pruned partitioned child
tables into partitioned_rels list of certain plan nodes.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpRctst136uN2BvbWLAkS7w278XmKY4_PUB%2Bxk-%2BNezNq8g%40mail.gmail.com

Attachment
On 2018/02/09 2:58, Alvaro Herrera wrote:
> Robert Haas wrote:
>> On Wed, Feb 7, 2018 at 3:42 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
> 
>>> partition.c seems to have two kinds of functions 1. that build and
>>> manage relcache, creates quals from bounds etc. which are metadata
>>> management kind 2. partition bound comparison functions, and other
>>> optimizer related functions. May be we should divide the file that
>>> way. The first category code remains in catalog/ as it is today. The
>>> second catagory functions move to optimizer/.
>>
>> It would be sensible to separate functions that build and manage data
>> in the relcache from other functions.  I think we should consider
>> moving the existing functions of that type from partition.c to
>> src/backend/utils/cache/partcache.c.
> 
> FWIW I've been thinking that perhaps we need some other separation of
> code better than statu quo.  The current partition.c file includes stuff
> for several modules and ISTM all these new patches are making more and
> more of a mess.  So +1 to the general idea of splitting things up.
> Maybe partcache.c is not ambitious enough, but it seems a good first
> step.

Agree with the proposed reorganizing and adding a partcache.c, which I
tried to do in the attached patch.

* The new src/backend/utils/cache/partcache.c contains functions that
initialize relcache's partitioning related fields.  Various partition
bound comparison and search functions (and then some) that work off of the
cached information are moved.  Also, since we cache partition qual,
interface functions RelationGetPartitioQual(Relation) and
get_partition_qual_relid(Oid) are moved too.

* The new src/include/utils/partcache.h contains various struct
definitions that are moved from backend/catalog/partition.c,
include/catalog/partition.h, and include/utils/rel.h.  Also, declarations
of interface functions of partcache.c.

Thoughts?

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/02/09 21:36, Amit Langote wrote:
> 0004-Faster-partition-pruning.patch
> 
> The main patch that adds src/backend/optimizer/util/partprune.c, a module
> to provide the functionality that will replace the current approach of
> calling relation_excluded_by_constraints() for each partition.
> 
> Sorry, but there is still this big TODO here, which I'll try to fix early
> next week.
> 
> + * partprune.c
> + *      Provides functions to prune partitions of a partitioned table by
> + *      comparing provided set of clauses with the table's partitions'
> + *      boundaries
> + *
> + * TODO: write a longer description of things in this file

And I tried to fix that to some degree in the attached updated version.

Thanks,
Amit

Attachment
Thanks for working on this.  May I suggest to open a completely new
thread?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Tue, Feb 13, 2018 at 2:17 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> Agree with the proposed reorganizing and adding a partcache.c, which I
> tried to do in the attached patch.
>
> * The new src/backend/utils/cache/partcache.c contains functions that
> initialize relcache's partitioning related fields.  Various partition
> bound comparison and search functions (and then some) that work off of the
> cached information are moved.

Are you moving partition bound comparison functions to partcache.c?
They will also used by optimizer, so may be leave them out of
partcache.c?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


David Rowley wrote:
> On 19 January 2018 at 16:00, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > And I'd like to ask David to check out his mail environment so
> > that SPF record is available for his message.
> 
> Will investigate

This should be fixed now.  Please let us know if you still see problems.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: reorganizing partitioning code

From
Amit Langote
Date:
On 2018/02/13 22:23, Alvaro Herrera wrote:
> Thanks for working on this.  May I suggest to open a completely new
> thread?

Done.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/02/13 20:08, Amit Langote wrote:
> On 2018/02/09 21:36, Amit Langote wrote:
>> 0004-Faster-partition-pruning.patch
>>
>> The main patch that adds src/backend/optimizer/util/partprune.c, a module
>> to provide the functionality that will replace the current approach of
>> calling relation_excluded_by_constraints() for each partition.
>>
>> Sorry, but there is still this big TODO here, which I'll try to fix early
>> next week.
>>
>> + * partprune.c
>> + *      Provides functions to prune partitions of a partitioned table by
>> + *      comparing provided set of clauses with the table's partitions'
>> + *      boundaries
>> + *
>> + * TODO: write a longer description of things in this file
> 
> And I tried to fix that to some degree in the attached updated version.

Here is an updated version.

I realized that 0005 (Add only unpruned partitioned child rels to
partitioned_rels) did that only for (Merge)Append.  That is, it didn't
handle ModifyTable.  I fixed that by teaching inheritance_planner() to do
it.  In the process, I found out that we don't need the
PartitionedChildRelInfo node and related code anymore, so the patch ended
up removing more code than adding it.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 2 February 2018 at 23:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> 2. PartitionClauseInfo->keyclauses is a list of PartClause which is
>> not a node type. This will cause _copyPartitionClauseInfo() to fail.
>>
>> I'm still not quite sure the best way to fix #2 since PartClause
>> contains a FmgrInfo. I do have a local fix which moves PartClause to
>> primnodes.h and makes it a proper node type. I also added a copy
>> function which does not copy any of the cache fields in PartClause. It
>> just sets valid_cache to false. I didn't particularly think this was
>> the correct fix. I just couldn't think of how exactly this should be
>> done at the time.
>>
>> The attached patch also adds the missing nodetag from
>> PartitionClauseInfo and also fixes up other code so as we don't memset
>> the node memory to zero, as that would destroy the tag. I ended up
>> just having extract_partition_key_clauses do the makeNode call. This
>> also resulted in populate_partition_clauses being renamed to
>> generate_partition_clauses
>
> I started wondering if it's not such a good idea to make
> PartitionClauseInfo a Node at all?  I went back to your earlier message
> [1] where you said that it's put into the Append node for run-time pruning
> to use, but it doesn't sound nice that we'd be putting into the plan
> something that's looks more like scratchpad for the partition.c code.  I
> think we should try to keep PartitionClauseInfo in partition.h and put
> only the list of matched bare clauses into Append.

That sounds like a good idea.

A patch which puts this back is attached.

I've changed the run-time prune patch to process the clause lists
during execution instead.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
            On 15 February 2018 at 18:57, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Here is an updated version.

Thanks for sending v27. I've had a quick look over it while I was
working on the run-time prune patch. However, I've not quite managed a
complete pass of this version yet

A couple of things so far:

1. Following loop;

for (i = 0; i < partnatts; i++)
{
    if (bms_is_member(i, keys->keyisnull))
    {
        /* Only the default partition accepts nulls. */
        if (partition_bound_has_default(boundinfo))
            return bms_make_singleton(boundinfo->default_index);
        else
            return NULL;
    }
}

could become:

if (partition_bound_has_default(boundinfo) &&
   !bms_is_empty(keys->keyisnull)
    return bms_make_singleton(boundinfo->default_index);
else
    return NULL;

2. Is the following form of loop necessary?

for (i = 0; i < partnatts; i++)
{
    if (bms_is_member(i, keys->keyisnull))
    {
        keys->n_eqkeys++;
        keyisnull[i] = true;
    }
}

Can't this just be:

i = -1;
while ((i = bms_next_member(keys->keyisnull, i)) >= 0)
{
    keys->n_eqkeys++;
    keyisnull[i] = true;
}

Perhaps you can just Assert(i < partnatts), if you're worried about that.

Similar code exists in get_partitions_for_keys_range

3. Several comments mention partition_bound_bsearch, but there is now
no such function.

4. "us" should be "is"

* not be any unassigned range to speak of, because the range us unbounded

5. The following code is more complex than it needs to be:

/*
* Since partition keys with nulls are mapped to the default range
* partition, we must include the default partition if some keys
* could be null.
*/
if (keys->n_minkeys < partnatts || keys->n_maxkeys < partnatts)
{
    for (i = 0; i < partnatts; i++)
    {
        if (!bms_is_member(i, keys->keyisnotnull))
        {
            include_def = true;
            break;
        }
    }
}


Instead of the for loop, couldn't you just write:

include_def = (bms_num_members(keys->keyisnotnull) < partnatts);

6. The following comment is not well written:

 * get_partitions_excluded_by_ne_datums
 *
 * Returns a Bitmapset of indexes of partitions that can safely be removed
 * due to each such partition's every allowable non-null datum appearing in
 * a <> opeartor clause.

Maybe it would be better to write:

 * get_partitions_excluded_by_ne_datums
 *
 * Returns a Bitmapset of partition indexes that can safely be removed due to
 * the discovery of <> clauses for each datum value allowed in the partition.

if not, then "opeartor" needs the spelling fixed.

7. "The following"

 * Followig entry points exist to this module.

Are there any other .c files where we comment on all the extern
functions in this way? I don't recall seeing it before.

8. The following may as well just: context.partnatts = partnatts;

context.partnatts = rel->part_scheme->partnatts;


9. Why palloc0? Wouldn't palloc be ok?

context.partkeys = (Expr **) palloc0(sizeof(Expr *) *
context.partnatts);

Also, no need for context.partnatts, just partnatts should be fine.

10. I'd rather see bms_next_member() used here:

/* Add selected partitions' RT indexes to result. */
while ((i = bms_first_member(partindexes)) >= 0)
    result = bms_add_member(result, rel->part_rels[i]->relid);

There's not really much use for bms_first_member these days. It can be
slower due to having to traverse the unset lower significant bits each
loop. bms_next_member starts where the previous loop left off.

Will try to review more tomorrow.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 17 February 2018 at 22:24, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 2 February 2018 at 23:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I started wondering if it's not such a good idea to make
>> PartitionClauseInfo a Node at all?
> That sounds like a good idea.
>
> A patch which puts this back is attached.

Please find attached an updated patch. The previous one must've got a
bit mangled in a bad merge.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
 On 17 February 2018 at 22:39, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> 10. I'd rather see bms_next_member() used here:
>
> /* Add selected partitions' RT indexes to result. */
> while ((i = bms_first_member(partindexes)) >= 0)
>     result = bms_add_member(result, rel->part_rels[i]->relid);
>
> There's not really much use for bms_first_member these days. It can be
> slower due to having to traverse the unset lower significant bits each
> loop. bms_next_member starts where the previous loop left off.
>
> Will try to review more tomorrow.

As I mentioned yesterday, here's the remainder of the review:

11. The following comment is misleading. It says: "We currently handle
two such cases:", then it goes on to say the 2nd case is not handled.

/*
 * Handle cases where the clause's operator does not belong to
 * the partitioning operator family.  We currently handle two
 * such cases: 1. Operators named '<>' are not listed in any
 * operator family whatsoever, 2.  Ordering operators like '<'
 * are not listed in the hash operator families.  For 1, check
 * if list partitioning is in use and if so, proceed to pass
 * the clause to the caller without doing any more processing
 * ourselves.  2 cannot be handled at all, so the clause is
 * simply skipped.
 */

12. The following code should test for LIST partitioning before doing
anything else:

if (!op_in_opfamily(opclause->opno, partopfamily))
{
    Oid negator;

    /*
     * To confirm if the operator is really '<>', check if its
     * negator is a equality operator.  If it's a btree
     * equality operator *and* this is a list partitioned
     * table, we can use it prune partitions.
     */
    negator = get_negator(opclause->opno);
    if (OidIsValid(negator) &&
        op_in_opfamily(negator, partopfamily))
    {
        Oid lefttype;
        Oid righttype;
        int strategy;

        get_op_opfamily_properties(negator, partopfamily,
            false,
           &strategy,
           &lefttype, &righttype);

        if (strategy == BTEqualStrategyNumber &&
            context->strategy == PARTITION_STRATEGY_LIST)
            is_ne_listp = true;
    }

    /* Cannot handle this clause. */
    if (!is_ne_listp)
        continue;
}

The code should probably be in the form of:

if (!op_in_opfamily(opclause->opno, partopfamily))
{
    if (context->strategy != PARTITION_STRATEGY_LIST)
        continue;

    ...

    if (strategy == BTEqualStrategyNumber)
        is_ne_listp = true;
}

that way we'll save 3 syscache lookups when a <> clause appears in a
RANGE or HASH partitioned table.

13. The following code makes assumptions that the partitioning op
family is btree:

/*
 * In case of NOT IN (..), we get a '<>', which while not
 * listed as part of any operator family, we are able to
 * handle it if its negator is an equality operator that
 * is in turn part of the operator family.
 */
if (!op_in_opfamily(saop_op, partopfamily))
{
    Oid negator = get_negator(saop_op);
    int strategy;
    Oid lefttype,
    righttype;

    if (!OidIsValid(negator))
        continue;
    get_op_opfamily_properties(negator, partopfamily, false,
      &strategy,
       &lefttype, &righttype);
    if (strategy != BTEqualStrategyNumber)
        continue;
}

this might not be breakable today, but it could well break in the
future, for example, if hash op family managed to grow two more
strategies, then we could get a false match on the matching strategy
numbers (both 3).

14. The following code assumes there will be a right op:

if (IsA(clause, OpExpr))
{
    OpExpr    *opclause = (OpExpr *) clause;
    Expr    *leftop,
      *rightop,
      *valueexpr;
    bool is_ne_listp = false;

    leftop = (Expr *) get_leftop(clause);
    if (IsA(leftop, RelabelType))
        leftop = ((RelabelType *) leftop)->arg;
    rightop = (Expr *) get_rightop(clause);

This'll crash with the following:

create function nonzero(p int) returns bool as $$ begin return (p <>
0); end;$$ language plpgsql;
create operator # (procedure = nonzero, leftarg = int);
create table listp (a int) partition by list (a);
create table listp1 partition of listp for values in(1);
select * from listp where a#;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

You need to ensure that there are 2 args.

15. Various if tests in extract_partition_clauses result in a
`continue` when they should perhaps be a `break` instead.

For example:

if (!PartCollMatchesExprColl(partcoll, opclause->inputcollid))
    continue;

This clause does not depend on the partition key, so there's no point
in trying to match this again for the next partition key.

This item is perhaps not so important, as it's only a small
inefficiency, but I just wanted to point it out. Also, note that plain
Var partition keys cannot be duplicated, but expressions can, so there
may be cases that you don't want to change to `break`

Other conditions which possibly should change to `break` instead of
`continue` include:

/* Only IS [NOT] TRUE/FALSE are any good to us */
if (btest->booltesttype == IS_UNKNOWN ||
btest->booltesttype == IS_NOT_UNKNOWN)
continue;

16. The following code looks a bit fragile:

leftop = IsA(clause, Var)
? (Expr *) clause
: (Expr *) get_notclausearg((Expr *) clause);

This appears to assume that the partition key will be a plain Var and
not an expression. I tried to break this with:

create table bp (a bool, b bool) partition by ((a < b));
create table bp_true partition of bp for values in('t');
explain select * from bp where (a < b);

however, naturally, the parser builds an OpExpr instead of a
BooleanTest for this case. If it had built a BooleanTest, then the
above code would mistakenly call get_notclausearg on the (a < b) Expr.

Do you have reason to believe that the code is safe and a good idea?

17. Which relation is the comment talking about?

/*
 * get_partitions_from_args
 *
 * Returns the set of partitions of relation, each of which satisfies some
 * clause in or_args.
 */
static Bitmapset *
get_partitions_from_or_args(PartitionPruneContext *context, List *or_args)

18. "sets a field", would it not be better to mention constfalse?:

 * returns right after finding such a clause and before returning, sets a field
 * in context->clauseinfo to inform the caller that we found such clause.

19. "clauses"

* partitioning, we don't require all of eqkeys to be operator clausses.

20. There does not seem to be a need to palloc0 here. palloc seems fine.

keys->ne_datums = (Datum *)
palloc0(list_length(clauseinfo->ne_clauses) *
sizeof(Datum));

This, of course, may leave unset memory in any unused items, but you
never iterate beyond what n_ne_datums gets set to anyway, so I don't
see the need to zero any extra elements.

21. A code comment should be added to the following code to mention
that these are not arrays indexed by partition key as they're only
ever used for LIST partitioning, which only supports a single key.

/* Datum values from clauses containing <> operator */
Datum  *ne_datums;
int n_ne_datums;


22. Can you include: "'keyisnotnull' may also be set for the given
partition key when a strict OpExpr is encountered" in the following
comment?

 * Based on a IS NULL or IS NOT NULL clause that was matched to a partition
 * key, the corresponding bit in 'keyisnull' or 'keyisnotnull' is set.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

On 2018/02/17 18:24, David Rowley wrote:
> On 2 February 2018 at 23:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I started wondering if it's not such a good idea to make
>> PartitionClauseInfo a Node at all?  I went back to your earlier message
>> [1] where you said that it's put into the Append node for run-time pruning
>> to use, but it doesn't sound nice that we'd be putting into the plan
>> something that's looks more like scratchpad for the partition.c code.  I
>> think we should try to keep PartitionClauseInfo in partition.h and put
>> only the list of matched bare clauses into Append.
> 
> That sounds like a good idea.
> 
> A patch which puts this back is attached.
> 
> I've changed the run-time prune patch to process the clause lists
> during execution instead.

Thank you.  I'll incorporate it in the version I'll send next.

Regards,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

Thanks a lot for the review comments.  Replying to all of your comments.

On 2018/02/17 18:39, David Rowley wrote:
> On 15 February 2018 at 18:57, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Here is an updated version.
> 
> Thanks for sending v27. I've had a quick look over it while I was
> working on the run-time prune patch. However, I've not quite managed a
> complete pass of this version yet
> 
> A couple of things so far:
> 
> 1. Following loop;
> 
> for (i = 0; i < partnatts; i++)
> {
>     if (bms_is_member(i, keys->keyisnull))
>     {
>         /* Only the default partition accepts nulls. */
>         if (partition_bound_has_default(boundinfo))
>             return bms_make_singleton(boundinfo->default_index);
>         else
>             return NULL;
>     }
> }
> 
> could become:
> 
> if (partition_bound_has_default(boundinfo) &&
>    !bms_is_empty(keys->keyisnull)
>     return bms_make_singleton(boundinfo->default_index);
> else
>     return NULL;
> 
> 2. Is the following form of loop necessary?
> 
> for (i = 0; i < partnatts; i++)
> {
>     if (bms_is_member(i, keys->keyisnull))
>     {
>         keys->n_eqkeys++;
>         keyisnull[i] = true;
>     }
> }
> 
> Can't this just be:
> 
> i = -1;
> while ((i = bms_next_member(keys->keyisnull, i)) >= 0)
> {
>     keys->n_eqkeys++;
>     keyisnull[i] = true;
> }
> 
> Perhaps you can just Assert(i < partnatts), if you're worried about that.
> 
> Similar code exists in get_partitions_for_keys_range

Both 1 and 2 are good suggestions, so done that way.

> 
> 3. Several comments mention partition_bound_bsearch, but there is now
> no such function.
> 
> 4. "us" should be "is"
> 
> * not be any unassigned range to speak of, because the range us unbounded

Fixed.

> 5. The following code is more complex than it needs to be:
> 
> /*
> * Since partition keys with nulls are mapped to the default range
> * partition, we must include the default partition if some keys
> * could be null.
> */
> if (keys->n_minkeys < partnatts || keys->n_maxkeys < partnatts)
> {
>     for (i = 0; i < partnatts; i++)
>     {
>         if (!bms_is_member(i, keys->keyisnotnull))
>         {
>             include_def = true;
>             break;
>         }
>     }
> }
> 
> 
> Instead of the for loop, couldn't you just write:
> 
> include_def = (bms_num_members(keys->keyisnotnull) < partnatts);

Indeed, it's much simpler.  Though, I wrote it as:

+    if (bms_num_members(keys->keyisnotnull) < partnatts)
+        include_def = true;

> 6. The following comment is not well written:
> 
>  * get_partitions_excluded_by_ne_datums
>  *
>  * Returns a Bitmapset of indexes of partitions that can safely be removed
>  * due to each such partition's every allowable non-null datum appearing in
>  * a <> opeartor clause.
> 
> Maybe it would be better to write:
> 
>  * get_partitions_excluded_by_ne_datums
>  *
>  * Returns a Bitmapset of partition indexes that can safely be removed due to
>  * the discovery of <> clauses for each datum value allowed in the partition.
> 
> if not, then "opeartor" needs the spelling fixed.

Sure, your rewrite sounds much better.

> 7. "The following"
> 
>  * Followig entry points exist to this module.

Fixed.

> 
> Are there any other .c files where we comment on all the extern
> functions in this way? I don't recall seeing it before.

Hmm, not like the way this patch does, but some files in the executor do
have such introductory comments using description of functions exported by
the module.  See for example, nodeSeqscan.c, execMain.c, etc.

Not saying that this is the best way to introduce the module, but this is
the one I went with for now.  If this format is not very informative, I'm
willing to rewrite it some other way.

> 8. The following may as well just: context.partnatts = partnatts;
> 
> context.partnatts = rel->part_scheme->partnatts;
>
> 9. Why palloc0? Wouldn't palloc be ok?
> 
> context.partkeys = (Expr **) palloc0(sizeof(Expr *) *
> context.partnatts);
> 
> Also, no need for context.partnatts, just partnatts should be fine.

Fixed both.  Yes, palloc suffices.

> 10. I'd rather see bms_next_member() used here:
> 
> /* Add selected partitions' RT indexes to result. */
> while ((i = bms_first_member(partindexes)) >= 0)
>     result = bms_add_member(result, rel->part_rels[i]->relid);
> 
> There's not really much use for bms_first_member these days. It can be
> slower due to having to traverse the unset lower significant bits each
> loop. bms_next_member starts where the previous loop left off.

Thanks for clarifying.  Used bms_next_member().

> Will try to review more tomorrow.
>
On 2018/02/18 11:25, David Rowley wrote:
> As I mentioned yesterday, here's the remainder of the review:
> 
> 11. The following comment is misleading. It says: "We currently handle
> two such cases:", then it goes on to say the 2nd case is not handled.
> 
> /*
>  * Handle cases where the clause's operator does not belong to
>  * the partitioning operator family.  We currently handle two
>  * such cases: 1. Operators named '<>' are not listed in any
>  * operator family whatsoever, 2.  Ordering operators like '<'
>  * are not listed in the hash operator families.  For 1, check
>  * if list partitioning is in use and if so, proceed to pass
>  * the clause to the caller without doing any more processing
>  * ourselves.  2 cannot be handled at all, so the clause is
>  * simply skipped.
>  */

You're right.  We don't really "handle" 2.  Rewrote the comment like this:

+             * Normally we only bother with operators that are listed as
+             * being part of the partitioning operator family.  But we
+             * make an exception in one case -- operators named '<>' are
+             * not listed in any operator family whatsoever, in which
+             * case, we try to perform partition pruning with it only if
+             * list partitioning is in use.

> 
> 12. The following code should test for LIST partitioning before doing
> anything else:
> 
> if (!op_in_opfamily(opclause->opno, partopfamily))
> {
>     Oid negator;
> 
>     /*
>      * To confirm if the operator is really '<>', check if its
>      * negator is a equality operator.  If it's a btree
>      * equality operator *and* this is a list partitioned
>      * table, we can use it prune partitions.
>      */
>     negator = get_negator(opclause->opno);
>     if (OidIsValid(negator) &&
>         op_in_opfamily(negator, partopfamily))
>     {
>         Oid lefttype;
>         Oid righttype;
>         int strategy;
> 
>         get_op_opfamily_properties(negator, partopfamily,
>             false,
>            &strategy,
>            &lefttype, &righttype);
> 
>         if (strategy == BTEqualStrategyNumber &&
>             context->strategy == PARTITION_STRATEGY_LIST)
>             is_ne_listp = true;
>     }
> 
>     /* Cannot handle this clause. */
>     if (!is_ne_listp)
>         continue;
> }
> 
> The code should probably be in the form of:
> 
> if (!op_in_opfamily(opclause->opno, partopfamily))
> {
>     if (context->strategy != PARTITION_STRATEGY_LIST)
>         continue;
> 
>     ...
> 
>     if (strategy == BTEqualStrategyNumber)
>         is_ne_listp = true;
> }
> 
> that way we'll save 3 syscache lookups when a <> clause appears in a
> RANGE or HASH partitioned table.

Good idea, done.

> 13. The following code makes assumptions that the partitioning op
> family is btree:
> 
> /*
>  * In case of NOT IN (..), we get a '<>', which while not
>  * listed as part of any operator family, we are able to
>  * handle it if its negator is an equality operator that
>  * is in turn part of the operator family.
>  */
> if (!op_in_opfamily(saop_op, partopfamily))
> {
>     Oid negator = get_negator(saop_op);
>     int strategy;
>     Oid lefttype,
>     righttype;
> 
>     if (!OidIsValid(negator))
>         continue;
>     get_op_opfamily_properties(negator, partopfamily, false,
>       &strategy,
>        &lefttype, &righttype);
>     if (strategy != BTEqualStrategyNumber)
>         continue;
> }
> 
> this might not be breakable today, but it could well break in the
> future, for example, if hash op family managed to grow two more
> strategies, then we could get a false match on the matching strategy
> numbers (both 3).

I guess we'd be able to anything useful with a NOT IN (..) only with list
partitioning, so I changed the code and surrounding comments like I did
per your comment above.  So, if we process it only if list partitioning is
used, the existing code can safely assume that partitioning operator
family is btree.

> 14. The following code assumes there will be a right op:
> 
> if (IsA(clause, OpExpr))
> {
>     OpExpr    *opclause = (OpExpr *) clause;
>     Expr    *leftop,
>       *rightop,
>       *valueexpr;
>     bool is_ne_listp = false;
> 
>     leftop = (Expr *) get_leftop(clause);
>     if (IsA(leftop, RelabelType))
>         leftop = ((RelabelType *) leftop)->arg;
>     rightop = (Expr *) get_rightop(clause);
> 
> This'll crash with the following:
> 
> create function nonzero(p int) returns bool as $$ begin return (p <>
> 0); end;$$ language plpgsql;
> create operator # (procedure = nonzero, leftarg = int);
> create table listp (a int) partition by list (a);
> create table listp1 partition of listp for values in(1);
> select * from listp where a#;
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> 
> You need to ensure that there are 2 args.

Oops, so we can't prune with unary operator clause.  Added the check on
number of args.

> 15. Various if tests in extract_partition_clauses result in a
> `continue` when they should perhaps be a `break` instead.
> 
> For example:
> 
> if (!PartCollMatchesExprColl(partcoll, opclause->inputcollid))
>     continue;
> 
> This clause does not depend on the partition key, so there's no point
> in trying to match this again for the next partition key.
> 
> This item is perhaps not so important, as it's only a small
> inefficiency, but I just wanted to point it out. Also, note that plain
> Var partition keys cannot be duplicated, but expressions can, so there
> may be cases that you don't want to change to `break`
> 
> Other conditions which possibly should change to `break` instead of
> `continue` include:
> 
> /* Only IS [NOT] TRUE/FALSE are any good to us */
> if (btest->booltesttype == IS_UNKNOWN ||
> btest->booltesttype == IS_NOT_UNKNOWN)
> continue;

You're quite right.  Replaced all those continue's with break.

> 16. The following code looks a bit fragile:
> 
> leftop = IsA(clause, Var)
> ? (Expr *) clause
> : (Expr *) get_notclausearg((Expr *) clause);
> 
> This appears to assume that the partition key will be a plain Var and
> not an expression. I tried to break this with:
> 
> create table bp (a bool, b bool) partition by ((a < b));
> create table bp_true partition of bp for values in('t');
> explain select * from bp where (a < b);
> 
> however, naturally, the parser builds an OpExpr instead of a
> BooleanTest for this case. If it had built a BooleanTest, then the
> above code would mistakenly call get_notclausearg on the (a < b) Expr.
> 
> Do you have reason to believe that the code is safe and a good idea?

Thanks for the example.  I think there were problems here.  The way the
current code matched these specially shaped clauses with a Boolean
partition key wouldn't have resulted in correct pruning for your example
and perhaps more cases.

I looked at match_boolean_index_clause() and decided that we needed
something similar here.  So, in the updated patch, I've added a
match_boolean_partition_clause(), which will be called at the beginning of
the loop to recognize such specially shaped clauses as applicable to a
Boolean partition key and add them to the set of pruning clauses.  With
the new code:

create table bp (a int, b int) partition by list ((a < b));
create table bp_true partition of bp for values in ('true');
create table bp_false partition of bp for values in ('false');

explain select * from bp where (a < b) is true;
                           QUERY PLAN
----------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=753 width=8)
   ->  Seq Scan on bp_true  (cost=0.00..38.25 rows=753 width=8)
         Filter: ((a < b) IS TRUE)
(3 rows)

explain select * from bp where (a < b) is false;
                            QUERY PLAN
------------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=1507 width=8)
   ->  Seq Scan on bp_false  (cost=0.00..38.25 rows=1507 width=8)
         Filter: ((a < b) IS FALSE)
(3 rows)

explain select * from bp where (a < b) is not false;
                           QUERY PLAN
----------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=753 width=8)
   ->  Seq Scan on bp_true  (cost=0.00..38.25 rows=753 width=8)
         Filter: ((a < b) IS NOT FALSE)
(3 rows)

explain select * from bp where (a < b) = true;
                           QUERY PLAN
----------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=753 width=8)
   ->  Seq Scan on bp_true  (cost=0.00..38.25 rows=753 width=8)
         Filter: (a < b)
(3 rows)

explain select * from bp where (a < b) = false;
                           QUERY PLAN
-----------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=753 width=8)
   ->  Seq Scan on bp_false  (cost=0.00..38.25 rows=753 width=8)
         Filter: (a >= b)
(3 rows)

explain select * from bp where (a < b);
                           QUERY PLAN
----------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=753 width=8)
   ->  Seq Scan on bp_true  (cost=0.00..38.25 rows=753 width=8)
         Filter: (a < b)
(3 rows)

explain select * from bp where not (a < b);
                           QUERY PLAN
-----------------------------------------------------------------
 Append  (cost=0.00..38.25 rows=753 width=8)
   ->  Seq Scan on bp_false  (cost=0.00..38.25 rows=753 width=8)
         Filter: (a >= b)
(3 rows)

> 17. Which relation is the comment talking about?
> 
> /*
>  * get_partitions_from_args
>  *
>  * Returns the set of partitions of relation, each of which satisfies some
>  * clause in or_args.
>  */
> static Bitmapset *
> get_partitions_from_or_args(PartitionPruneContext *context, List *or_args)

Fixed the comment.

> 18. "sets a field", would it not be better to mention constfalse?:
> 
>  * returns right after finding such a clause and before returning, sets a field
>  * in context->clauseinfo to inform the caller that we found such clause.

You're right.

> 19. "clauses"
> 
> * partitioning, we don't require all of eqkeys to be operator clausses.

Fixed.

> 20. There does not seem to be a need to palloc0 here. palloc seems fine.
> 
> keys->ne_datums = (Datum *)
> palloc0(list_length(clauseinfo->ne_clauses) *
> sizeof(Datum));
>
> This, of course, may leave unset memory in any unused items, but you
> never iterate beyond what n_ne_datums gets set to anyway, so I don't
> see the need to zero any extra elements.

Agreed about using palloc.

> 21. A code comment should be added to the following code to mention
> that these are not arrays indexed by partition key as they're only
> ever used for LIST partitioning, which only supports a single key.
> 
> /* Datum values from clauses containing <> operator */
> Datum  *ne_datums;
> int n_ne_datums;

OK, done.  Also adjusted the comment above the struct's definition.

> 22. Can you include: "'keyisnotnull' may also be set for the given
> partition key when a strict OpExpr is encountered" in the following
> comment?
> 
>  * Based on a IS NULL or IS NOT NULL clause that was matched to a partition
>  * key, the corresponding bit in 'keyisnull' or 'keyisnotnull' is set.

Done.

Attached updated patches.  Thanks again!

Regards,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 19 February 2018 at 22:19, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated patches.  Thanks again!

Thanks for making those changes.  I've made another pass of v28 and
have a few more comments.

The patch is starting to look good, but there are some new changes in
recent versions which still don't look quite right.

1. This does not fully make sense:

/*
 * Remove the indexes of partitions excluded due to each of
 * those partitions' *all* of allowed datums appearing in
 * keys->ne_datums, that is compared to the partition key
 * using <> operator.
 */

 "each of those partitions' *all* of allowed" is not correct.

Maybe better to write:

/*
 * Remove the indexes of any partitions which cannot possibly
 * contain rows matching the clauses due to key->ne_datums containing
 * all datum values which are allowed in the given partition.  This
 * is only possible to do in LIST partitioning as it's the only
 * partitioning strategy which allows the specification of exact values.
 */

2. Mine does not, but some compilers may complain about
get_partitions_for_keys() result variable being uninitalised in
get_partitions_for_keys. Probably the easiest fix would be to just
assign to NULL in the default case.

3. Did you mean to put this Assert() inside the loop?

memset(keyisnull, false, sizeof(keyisnull));
i = -1;
while ((i = bms_next_member(keys->keyisnull, i)) >= 0)
{
keys->n_eqkeys++;
keyisnull[i] = true;
}
Assert(i < partnatts);

i will always be -2 at the end of the loop. Seems like a useless
Assert in its current location.

4. Can you add a comment here to say: "Note: LIST partitioning only
supports a single partition key, therefore this function requires no
looping over the partition keys"

/*
 * get_partitions_for_keys_list
 * Return partitions of a list partitioned table for requested keys
 *
 * This interprets the keys and looks up partitions in the partition bound
 * descriptor using the list partitioning semantics.
 */

5. The following comment contains a bit of duplication to the comment
which comes after it. Maybe the following:

/*
* If the query is looking for null keys, there can only be one such
* partition.  Return the same if one exists.
*/

can be changed to:

/* Handle clauses requesting a NULL valued partition key */

6. The following comment does not quite make sense:

/* Exactly matching datum exists. */

Probably better to write:

/* An exact matching datum exists. */

7. "If it's not equal (<)" I think you mean (>), not (<), in:

 * The bound at minoff is <= minkeys, given the way
 * partition_list_bsearch() works.  If it's not equal (<), then
 * increment minoff to make it point to the datum on the right
 * that necessarily satisfies minkeys.  Also do the same if it is
 * equal but minkeys is exclusive.

However, the comment is a bit clumsy. Maybe the following is better?

/*
 * partition_list_bsearch returning a positive number means that
 * minkeys[0] must be greater than or equal to the smallest datum.
 * If we didn't find an exact matching datum (!is_equal) or if the
 * operator used was non-inclusive (>), then in both of these
 * cases we're not interested in the datum pointed to by minoff,
 * but we may start getting matches in the partition which the
 * next datum belongs to, so point to that one instead.  (This may
 * be beyond the last datum in the array, but we'll detect that
 * later.)
 */

8. The following comment could be improved:

* minkeys is greater than the datums of all non-default partitions,
* meaning there isn't one to return.  Return the default partition if
* one exists.

how about:

* The value of minkeys[0] is greater than all of the datums we have
* partitions for.  The only possible partition that could contain a
* match is the default partition.  Return that, if it exists.

9. The following could also be improved:

* The bound at maxoff is <= maxkeys, given the way
* partition_list_bsearch works.  If the bound at maxoff exactly
* matches maxkey (is_equal), but the maxkey is exclusive, then
* decrement maxoff to point to the bound on the left.

how about:

* partition_list_bsearch returning a positive number means that
* maxkeys[0] must be greater than or equal to the smallest datum.
* If the match found is an equal match, but the operator used is
* non-inclusive of that value (<), then the partition belonging
* to maxoff cannot match, so we'll decrement maxoff to point to
* the partition belonging to the previous datum.  We might end up
* decrementing maxoff down to -1, but we'll handle that later.

10. Can you append "  This may not technically be true for some data
types (e.g. integer types), however, we currently lack any sort of
infrastructure to provide us with proofs that would allow us to do
anything smarter here." to:

* For range queries, always include the default list partition,
* because list partitions divide the key space in a discontinuous
* manner, not all values in the given range will have a partition
* assigned.

11. get_partitions_for_keys_range seems to prefer to do "minoff -= 1",
but get_partitions_for_keys_list likes to "minoff--", can this be made
the same? Personally, I like -- over -= 1 as it's shorter. Although I
do remember having an argument with my university professor about
this. He claimed -= 1 was clearer... I'm still unsure what he found so
confusing about -- ...

12. The following code could be optimised a little for the case when
there's no default:

/*
* There may exist a range of values unassigned to any non-default
* partition between the datums at minoff and maxoff.
*/
for (i = minoff; i <= maxoff; i++)
{
if (boundinfo->indexes[i] < 0)
{
include_def = true;
break;
}
}

/*
* Since partition keys with nulls are mapped to the default range
* partition, we must include the default partition if some keys
* *could* be null.
*/
if (bms_num_members(keys->keyisnotnull) < partnatts)
include_def = true;

if (include_def && partition_bound_has_default(boundinfo))
result = bms_add_member(result, boundinfo->default_index);

return result;

Maybe something more like:

if (!partition_bound_has_default(boundinfo))
    return result;

/*
* There may exist a range of values unassigned to any non-default
* partition between the datums at minoff and maxoff.
*/
for (i = minoff; i <= maxoff; i++)
{
    if (boundinfo->indexes[i] < 0)
        return bms_add_member(result, boundinfo->default_index);
}

/*
* Since partition keys with nulls are mapped to the default range
* partition, we must include the default partition if some keys
* *could* be null.
*/
if (bms_num_members(keys->keyisnotnull) < partnatts)
    return bms_add_member(result, boundinfo->default_index);

Which is saves a bit of needless work when there's no default to add,
and also saves a few lines, including the line where you declare the
include_def variable.

13. Variable name:

Bitmapset  *partitioned_rels_bms = NULL;

This should likely be called partitioned_relids, and be of type Relids
instead of Bitmapset.

14. This line removal seems surplus. It should probably be fixed
independently of this patch.

  parent_roots[appinfo->child_relid] = subroot;
-
  continue;

15. I'm unsure how safe the following code is:

while ((parent_rti = bms_first_member(partitioned_rels_bms)) >= 0)
    partitioned_rels = lappend_int(partitioned_rels, parent_rti);

You're now putting this list in ascending order of relid, but some
code in set_plan_refs assumes the root partition is the first element:

root->glob->rootResultRelations =
lappend_int(root->glob->rootResultRelations,
linitial_int(splan->partitioned_rels));

By luck, the first element might today be the root due to the way we
expand the inheritance hierarchy, but I think this code is wrong to
rely on that.

I'm not really a fan of having the root partition be the first element
in the List. I would much rather see a Relids type and a special Index
field for the root, but that might be more changes that you'd like to
make here. I just don't think what you have now is correct.

16. This should probably return Relids rather than Bitmapset *.

Bitmapset *
prune_append_rel_partitions(PlannerInfo *root, RelOptInfo *rel)

Please update the mention of Bitmapset in the comment at the top of
the function too.

17. hmm, my patch did palloc(), not palloc0(). My original patch was
broken and missed this, but v2 got it.

context->clauseinfo = partclauseinfo =
palloc0(sizeof(PartitionClauseInfo));

There's no need to palloc0() here. You're setting all the fields to
zero just below. As far as I understand it, it's only Node types that
we have to go through the rigmarole of doing both.

18. I tentatively agree with you having changed the continue to break
in the following:

/* We can't use any volatile value to prune partitions. */
if (contain_volatile_functions((Node *) valueexpr))
break;

I believe it's not wrong to break here, but keep in mind you're
testing valueexpr rather than something with the OpExpr itself. The
reason I'm not saying this is wrong is that if the valueexpr is
volatile then it cannot possibly match another partition key anyway,
so there's likely no point in continuing to look for another match...
You should likely write a comment to explain this a bit. I think all
of the other places you've changed to break look fine. The
ScalarArrayOpExpr volatile function test is fine to break from since
the operands cannot be reversed in that case, so rightop certainly
can't match a partition key.

19. I might have caused this, but there's no such variable as 'cur'

/* cur is more restrictive, so replace the existing. */

20. Is there a difference between
partition_bound_has_default(context->boundinfo) and
context->has_default_part? Any reason for both? Your code uses both. I
don't yet understand why has_default_part was added.

That's all I have for now.

It's getting close. Good work!

I keep having to turn up my strictness level each time I review.  I'd
like to be able to turn it up all the way earlier, but I fear I may
still be reviewing v1 if I'd done that :-)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

Thanks for the review.

On 2018/02/19 22:40, David Rowley wrote:
> On 19 February 2018 at 22:19, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached updated patches.  Thanks again!
> 
> Thanks for making those changes.  I've made another pass of v28 and
> have a few more comments.
> 
> The patch is starting to look good, but there are some new changes in
> recent versions which still don't look quite right.
> 
> 1. This does not fully make sense:
> 
> /*
>  * Remove the indexes of partitions excluded due to each of
>  * those partitions' *all* of allowed datums appearing in
>  * keys->ne_datums, that is compared to the partition key
>  * using <> operator.
>  */
> 
>  "each of those partitions' *all* of allowed" is not correct.
> 
> Maybe better to write:
> 
> /*
>  * Remove the indexes of any partitions which cannot possibly
>  * contain rows matching the clauses due to key->ne_datums containing
>  * all datum values which are allowed in the given partition.  This
>  * is only possible to do in LIST partitioning as it's the only
>  * partitioning strategy which allows the specification of exact values.
>  */

Ah, your rewrite sounds much better.

> 2. Mine does not, but some compilers may complain about
> get_partitions_for_keys() result variable being uninitalised in
> get_partitions_for_keys. Probably the easiest fix would be to just
> assign to NULL in the default case.

Done.

> 3. Did you mean to put this Assert() inside the loop?
> 
> memset(keyisnull, false, sizeof(keyisnull));
> i = -1;
> while ((i = bms_next_member(keys->keyisnull, i)) >= 0)
> {
> keys->n_eqkeys++;
> keyisnull[i] = true;
> }
> Assert(i < partnatts);
>
> i will always be -2 at the end of the loop. Seems like a useless
> Assert in its current location.

Face palm!  Fixed.

> 4. Can you add a comment here to say: "Note: LIST partitioning only
> supports a single partition key, therefore this function requires no
> looping over the partition keys"
> 
> /*
>  * get_partitions_for_keys_list
>  * Return partitions of a list partitioned table for requested keys
>  *
>  * This interprets the keys and looks up partitions in the partition bound
>  * descriptor using the list partitioning semantics.
>  */

OK, done.

> 5. The following comment contains a bit of duplication to the comment
> which comes after it. Maybe the following:
> 
> /*
> * If the query is looking for null keys, there can only be one such
> * partition.  Return the same if one exists.
> */
> 
> can be changed to:
> 
> /* Handle clauses requesting a NULL valued partition key */

You're right.  Changed.

> 6. The following comment does not quite make sense:
> 
> /* Exactly matching datum exists. */
> 
> Probably better to write:
> 
> /* An exact matching datum exists. */

OK, done.

> 7. "If it's not equal (<)" I think you mean (>), not (<), in:
> 
>  * The bound at minoff is <= minkeys, given the way
>  * partition_list_bsearch() works.  If it's not equal (<), then
>  * increment minoff to make it point to the datum on the right
>  * that necessarily satisfies minkeys.  Also do the same if it is
>  * equal but minkeys is exclusive.
> 
> However, the comment is a bit clumsy. Maybe the following is better?
> 
> /*
>  * partition_list_bsearch returning a positive number means that
>  * minkeys[0] must be greater than or equal to the smallest datum.
>  * If we didn't find an exact matching datum (!is_equal) or if the
>  * operator used was non-inclusive (>), then in both of these
>  * cases we're not interested in the datum pointed to by minoff,
>  * but we may start getting matches in the partition which the
>  * next datum belongs to, so point to that one instead.  (This may
>  * be beyond the last datum in the array, but we'll detect that
>  * later.)
>  */

Your rewrite is much better.

> 8. The following comment could be improved:
> 
> * minkeys is greater than the datums of all non-default partitions,
> * meaning there isn't one to return.  Return the default partition if
> * one exists.
> 
> how about:
> 
> * The value of minkeys[0] is greater than all of the datums we have
> * partitions for.  The only possible partition that could contain a
> * match is the default partition.  Return that, if it exists.

OK, adopted your rewrite.

> 9. The following could also be improved:
> 
> * The bound at maxoff is <= maxkeys, given the way
> * partition_list_bsearch works.  If the bound at maxoff exactly
> * matches maxkey (is_equal), but the maxkey is exclusive, then
> * decrement maxoff to point to the bound on the left.
> 
> how about:
> 
> * partition_list_bsearch returning a positive number means that
> * maxkeys[0] must be greater than or equal to the smallest datum.
> * If the match found is an equal match, but the operator used is
> * non-inclusive of that value (<), then the partition belonging
> * to maxoff cannot match, so we'll decrement maxoff to point to
> * the partition belonging to the previous datum.  We might end up
> * decrementing maxoff down to -1, but we'll handle that later.

OK, done.

> 10. Can you append "  This may not technically be true for some data
> types (e.g. integer types), however, we currently lack any sort of
> infrastructure to provide us with proofs that would allow us to do
> anything smarter here." to:
> 
> * For range queries, always include the default list partition,
> * because list partitions divide the key space in a discontinuous
> * manner, not all values in the given range will have a partition
> * assigned.

Hmm, that seems to make sense, so added the text.

I guess you know it already, but I'm trying to say in the comment that
list partitioning, by definition, does not force you to enumerate all
values that a data type may specify to exist.

> 11. get_partitions_for_keys_range seems to prefer to do "minoff -= 1",
> but get_partitions_for_keys_list likes to "minoff--", can this be made
> the same? Personally, I like -- over -= 1 as it's shorter. Although I
> do remember having an argument with my university professor about
> this. He claimed -= 1 was clearer... I'm still unsure what he found so
> confusing about -- ...

I will go with minoff-- for consistency as you say.

> 12. The following code could be optimised a little for the case when
> there's no default:
> 
> /*
> * There may exist a range of values unassigned to any non-default
> * partition between the datums at minoff and maxoff.
> */
> for (i = minoff; i <= maxoff; i++)
> {
> if (boundinfo->indexes[i] < 0)
> {
> include_def = true;
> break;
> }
> }
> 
> /*
> * Since partition keys with nulls are mapped to the default range
> * partition, we must include the default partition if some keys
> * *could* be null.
> */
> if (bms_num_members(keys->keyisnotnull) < partnatts)
> include_def = true;
> 
> if (include_def && partition_bound_has_default(boundinfo))
> result = bms_add_member(result, boundinfo->default_index);
> 
> return result;
> 
> Maybe something more like:
> 
> if (!partition_bound_has_default(boundinfo))
>     return result;
> 
> /*
> * There may exist a range of values unassigned to any non-default
> * partition between the datums at minoff and maxoff.
> */
> for (i = minoff; i <= maxoff; i++)
> {
>     if (boundinfo->indexes[i] < 0)
>         return bms_add_member(result, boundinfo->default_index);
> }
> 
> /*
> * Since partition keys with nulls are mapped to the default range
> * partition, we must include the default partition if some keys
> * *could* be null.
> */
> if (bms_num_members(keys->keyisnotnull) < partnatts)
>     return bms_add_member(result, boundinfo->default_index);
> 
> Which is saves a bit of needless work when there's no default to add,
> and also saves a few lines, including the line where you declare the
> include_def variable.

Ah, that's neat.  Done that way.  I also changed other code that sets
include_def to instead add the default partition index to result right
away, instead of just setting include_def.  So, include_def is now gone.

> 13. Variable name:
> 
> Bitmapset  *partitioned_rels_bms = NULL;
> 
> This should likely be called partitioned_relids, and be of type Relids
> instead of Bitmapset.

Makes sense, done.

> 14. This line removal seems surplus. It should probably be fixed
> independently of this patch.
> 
>   parent_roots[appinfo->child_relid] = subroot;
> -
>   continue;

Hadn't noticed that.  Fixed.

> 15. I'm unsure how safe the following code is:
> 
> while ((parent_rti = bms_first_member(partitioned_rels_bms)) >= 0)
>     partitioned_rels = lappend_int(partitioned_rels, parent_rti);
> 
> You're now putting this list in ascending order of relid, but some
> code in set_plan_refs assumes the root partition is the first element:
> 
> root->glob->rootResultRelations =
> lappend_int(root->glob->rootResultRelations,
> linitial_int(splan->partitioned_rels));
> 
> By luck, the first element might today be the root due to the way we
> expand the inheritance hierarchy, but I think this code is wrong to
> rely on that.
> 
> I'm not really a fan of having the root partition be the first element
> in the List. I would much rather see a Relids type and a special Index
> field for the root, but that might be more changes that you'd like to
> make here. I just don't think what you have now is correct.

Hmm, so you're saying that it's not future-proof for the code here to
assume that the root table will always get the smallest RT index of the
tables in a given partition tree.  I guess it is a concern that exists
independently of the changes this patch makes, but I agree with the
concern.  We can write another patch to break that assumption by using the
method you suggest of adding a Index variable to the ModifyTable node to
store the root table index.

> 16. This should probably return Relids rather than Bitmapset *.
> 
> Bitmapset *
> prune_append_rel_partitions(PlannerInfo *root, RelOptInfo *rel)
> 
> Please update the mention of Bitmapset in the comment at the top of
> the function too.

OK, done.

> 17. hmm, my patch did palloc(), not palloc0(). My original patch was
> broken and missed this, but v2 got it.
> 
> context->clauseinfo = partclauseinfo =
> palloc0(sizeof(PartitionClauseInfo));
> 
> There's no need to palloc0() here. You're setting all the fields to
> zero just below. As far as I understand it, it's only Node types that
> we have to go through the rigmarole of doing both.

Ah, you're right.  Used palloc.

> 18. I tentatively agree with you having changed the continue to break
> in the following:
> 
> /* We can't use any volatile value to prune partitions. */
> if (contain_volatile_functions((Node *) valueexpr))
> break;
> 
> I believe it's not wrong to break here, but keep in mind you're
> testing valueexpr rather than something with the OpExpr itself. The
> reason I'm not saying this is wrong is that if the valueexpr is
> volatile then it cannot possibly match another partition key anyway,
> so there's likely no point in continuing to look for another match...
> You should likely write a comment to explain this a bit I think all> of the other places you've changed to break look
fine.The
 
> ScalarArrayOpExpr volatile function test is fine to break from since
> the operands cannot be reversed in that case, so rightop certainly
> can't match a partition key.

When working on this comment, I realized we shouldn't really "break" on
the collation mismatch.  Multiple keys can have the same expression, but
different collation and "break"ing in this case would mean, we'd miss
matching it to another key that has the matching collation.  IOW, order in
which clauses appear in the input list determines if they're matched to
partition keys correctly or not.

See this somewhat made up example:

create table rp (a text) partition by range (substr(a, 1) collate "en_GB",
substr(a, 1) collate "en_US");
create table rp1 partition of rp for values from ('a', 'a') to ('a', 'e');
create table rp2 partition of rp for values from ('a', 'e') to ('a', 'z');
create table rp3 partition of rp for values from ('b', 'a') to ('b', 'e');

For the following query:

select * from rp where substr(a, 1) = 'e' collate "en_US" and substr(a, 1)
= 'a' collate "en_GB";

With the current code, we'll end up discarding (via break) the 1st clause
after its collation (en_US) doesn't match the 1st key's collation (en_GB)
and thus we'll end with only one clause, that is the 2nd one, being added
to matched clauses.  That would result in only p3 being pruned, whereas
both rp1 and rp3 should be pruned.

I've fixed that.  With the new code, both the expression and the collation
should match before we conclude that the clause matched the partition key
and then check other properties of the clause like operator strictness,
valueexpr volatility, etc.  If those other properties are not satisfied,
we can "break", because with those properties they won't be useful for
pruning, even if it matched some other key.

> 19. I might have caused this, but there's no such variable as 'cur'
> 
> /* cur is more restrictive, so replace the existing. */

Fixed this and a few other instances.

> 20. Is there a difference between
> partition_bound_has_default(context->boundinfo) and
> context->has_default_part? Any reason for both? Your code uses both. I
> don't yet understand why has_default_part was added.

One cannot use partition_bound_has_default(context->boundinfo) within
partprune.c, because PartitionBoundInfo's definition is private to
partition.c. Maybe, if someday we expose it by exporting it (and related
partition bound manipulating functions) in, say, a partbound.h, we won't
need to have to resort to the same value being stored in two different
places like this.

> That's all I have for now.
> 
> It's getting close. Good work!
> 
> I keep having to turn up my strictness level each time I review.  I'd
> like to be able to turn it up all the way earlier, but I fear I may
> still be reviewing v1 if I'd done that :-)

It's really great that you've looked into these patches in such great
detail, suggesting various performance improvements (big and small) and
also pointing out various edge cases.  Thank you very much! :)

Attached updated version.

Regards,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Rajkumar Raghuwanshi
Date:
On Tue, Feb 20, 2018 at 12:34 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Attached updated version.

Hi Amit,

I have applied v29 patch-set on head and  got "ERROR:  operator 1209 is not a member of opfamily 1994" with below test case. Please take a look.

CREATE TABLE part (c1 INT4, c2 TEXT, c3 INT4) PARTITION BY LIST (c2);
CREATE TABLE part_p1 PARTITION OF part FOR VALUES IN('ABC');
CREATE TABLE part_p2 PARTITION OF part FOR VALUES IN('DEF');
CREATE TABLE part_p3 PARTITION OF part FOR VALUES IN('GHI');
CREATE TABLE part_p4 PARTITION OF part FOR VALUES IN('JKL');

INSERT INTO part VALUES (100,'ABC',10);
INSERT INTO part VALUES (110,'DEF',20);
INSERT INTO part VALUES (120,'GHI',10);
INSERT INTO part VALUES (130,'JKL',100);

explain (costs off) SELECT * FROM part WHERE c2 LIKE '%ABC%';
ERROR:  operator 1209 is not a member of opfamily 1994

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
 

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks Rajkumar for the test.

On 2018/02/20 16:56, Rajkumar Raghuwanshi wrote:
> I have applied v29 patch-set on head and  got "ERROR:  operator 1209 is not
> a member of opfamily 1994" with below test case. Please take a look.
> 
> CREATE TABLE part (c1 INT4, c2 TEXT, c3 INT4) PARTITION BY LIST (c2);
> CREATE TABLE part_p1 PARTITION OF part FOR VALUES IN('ABC');
> CREATE TABLE part_p2 PARTITION OF part FOR VALUES IN('DEF');
> CREATE TABLE part_p3 PARTITION OF part FOR VALUES IN('GHI');
> CREATE TABLE part_p4 PARTITION OF part FOR VALUES IN('JKL');
> 
> INSERT INTO part VALUES (100,'ABC',10);
> INSERT INTO part VALUES (110,'DEF',20);
> INSERT INTO part VALUES (120,'GHI',10);
> INSERT INTO part VALUES (130,'JKL',100);
> 
> explain (costs off) SELECT * FROM part WHERE c2 LIKE '%ABC%';
> *ERROR:  operator 1209 is not a member of opfamily 1994*

An oversight in the v28 patch seems to have caused this.  Fixed in the
attached.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
v30-0004-Faster-partition-pruning.patch contains:

+create table coll_pruning_multi (a text) partition by range
(substr(a, 1) collate "en_GB", substr(a, 1) collate "en_US");

This'll likely work okay on Linux. Other collate tests seem to use
COLLATE "POSIX or "C" so that work cross-platform.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/02/21 10:19, David Rowley wrote:
> v30-0004-Faster-partition-pruning.patch contains:
> 
> +create table coll_pruning_multi (a text) partition by range
> (substr(a, 1) collate "en_GB", substr(a, 1) collate "en_US");
> 
> This'll likely work okay on Linux. Other collate tests seem to use
> COLLATE "POSIX or "C" so that work cross-platform.

Thanks.  I completely forgot about that.  I've rewritten those tests to
use "POSIX" and "C" in the attached.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 21 February 2018 at 14:53, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/02/21 10:19, David Rowley wrote:
>> v30-0004-Faster-partition-pruning.patch contains:
>>
>> +create table coll_pruning_multi (a text) partition by range
>> (substr(a, 1) collate "en_GB", substr(a, 1) collate "en_US");
>>
>> This'll likely work okay on Linux. Other collate tests seem to use
>> COLLATE "POSIX or "C" so that work cross-platform.
>
> Thanks.  I completely forgot about that.  I've rewritten those tests to
> use "POSIX" and "C" in the attached.

Thanks for fixing. I made a pass over v31 and only see a few small things:

1. In get_partitions_for_keys() why is the
get_partitions_excluded_by_ne_datums call not part of
get_partitions_for_keys_list?

2. Still a stray "minoff += 1;" in get_partitions_for_keys_range

3. You're also preferring to minoff--/++, but maxoff -= 1/maxoff += 1;
would be nice to see the style unified here.

4. "other other"

 * that is, each of its fields other other than clauseinfo must be valid before

5. "a IS NULL" -> "an IS NULL":

 * Based on a IS NULL or IS NOT NULL clause that was matched to a partition

6. Can you add a warning in the header comment for
extract_partition_clauses() to explain "Note: the 'clauses' List may
be modified inside this function. Callers may like to make a copy of
important lists before passing them to this function.", or something
like that...

7. "null" -> "nulls"

* Only allow strict operators.  This will guarantee null are

8. "dicard" -> "discard"

* contains a <= 2, then because 3 <= 2 is false, we dicard a < 3 as

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

Thanks for the review.

On 2018/02/21 19:15, David Rowley wrote:
> Thanks for fixing. I made a pass over v31 and only see a few small things:
> 
> 1. In get_partitions_for_keys() why is the
> get_partitions_excluded_by_ne_datums call not part of
> get_partitions_for_keys_list?

Hmm, there is a question of where exactly to put the call within
get_partitions_for_keys_list().  At the end would sound like an obvious
answer, but we tend to short-circuit return from that function at various
points, which it seems undesirable to change.  So, I left things as is here.

> 2. Still a stray "minoff += 1;" in get_partitions_for_keys_range

I actually found a few and changed them to ++ or --, as applicable.

> 
> 3. You're also preferring to minoff--/++, but maxoff -= 1/maxoff += 1;
> would be nice to see the style unified here.

Fixed all as mentioned above.

> 4. "other other"
> 
>  * that is, each of its fields other other than clauseinfo must be valid before

Fixed.

> 5. "a IS NULL" -> "an IS NULL":
> 
>  * Based on a IS NULL or IS NOT NULL clause that was matched to a partition

Fixed.

> 6. Can you add a warning in the header comment for
> extract_partition_clauses() to explain "Note: the 'clauses' List may
> be modified inside this function. Callers may like to make a copy of
> important lists before passing them to this function.", or something
> like that...

At least in my patch, extract_partition_clauses() is a local function with
just one caller, but I still don't see any problem with warning the
reader.  So, done.

> 7. "null" -> "nulls"
> 
> * Only allow strict operators.  This will guarantee null are
> 
> 8. "dicard" -> "discard"
> 
> * contains a <= 2, then because 3 <= 2 is false, we dicard a < 3 as

Fixed.


Please find attached updated patches.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 21 February 2018 at 23:44, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Please find attached updated patches.

Thanks for updating the code.

The question I have now is around NULL handling in
partkey_datum_from_expr(). I've not managed to find a way to get a
NULL Const in there as it seems all the clauses I try get removed
somewhere earlier in planning. Do you know for a fact that a NULL
Const is impossible to get there?

I'm having to add some NULL handling there for the run-time pruning
patch but wondered if it was also required for your patch.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/02/22 17:41, David Rowley wrote:
> On 21 February 2018 at 23:44, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Please find attached updated patches.
> 
> Thanks for updating the code.
> 
> The question I have now is around NULL handling in
> partkey_datum_from_expr(). I've not managed to find a way to get a
> NULL Const in there as it seems all the clauses I try get removed
> somewhere earlier in planning. Do you know for a fact that a NULL
> Const is impossible to get there?

We only ever call partkey_datum_from_expr() for an OpExpr's arg and if you
have a NULL Const in there, eval_const_expressions() would've folded the
OpExpr's and subsequently any AND'd OpExpr's into a constant-false qual.


create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);
create table p2 partition of p for values in (2);

explain select * from p where a = null and a = 1;
                QUERY PLAN
-------------------------------------------
 Result  (cost=0.00..0.00 rows=0 width=40)
   One-Time Filter: false

explain select * from p where (a = null and a = 1) or a = 2;
                        QUERY PLAN
----------------------------------------------------------
 Append  (cost=0.00..41.94 rows=13 width=4)
   ->  Seq Scan on p2  (cost=0.00..41.88 rows=13 width=4)
         Filter: (a = 2)
(3 rows)

> I'm having to add some NULL handling there for the run-time pruning
> patch but wondered if it was also required for your patch.

Hmm, not sure why.  Can you explain a bit more?

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 22 February 2018 at 22:48, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I'm having to add some NULL handling there for the run-time pruning
>> patch but wondered if it was also required for your patch.
>
> Hmm, not sure why.  Can you explain a bit more?

hmm, yeah, but perhaps we should be discussing on the other thread...

With a prepared statement the Param will be unavailable until
execution, in which case we don't do the const folding.

A simple case is:

create table listp (a int) partition by list (a);
create table listp1 partition of listp for values in(1);
prepare q1 (int) as  select * from listp where a = $1;
explain analyze execute q1(1); -- repeat 5 times.
explain analyze execute q1(null); -- partkey_datum_from_expr() gets a
NULL param via the call from nodeAppend.c


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/02/22 20:28, David Rowley wrote:
> On 22 February 2018 at 22:48, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> I'm having to add some NULL handling there for the run-time pruning
>>> patch but wondered if it was also required for your patch.
>>
>> Hmm, not sure why.  Can you explain a bit more?
> 
> hmm, yeah, but perhaps we should be discussing on the other thread...
> 
> With a prepared statement the Param will be unavailable until
> execution, in which case we don't do the const folding.

Ah right.

> A simple case is:
> 
> create table listp (a int) partition by list (a);
> create table listp1 partition of listp for values in(1);
> prepare q1 (int) as  select * from listp where a = $1;
> explain analyze execute q1(1); -- repeat 5 times.
> explain analyze execute q1(null); -- partkey_datum_from_expr() gets a
> NULL param via the call from nodeAppend.c

I wonder if NULLs should somehow be managed at a higher level, resulting
in the same behavior as const-folding in the optimizer produces?  In any
case, I suppose that would be something for the run-time pruning patch to
handle.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Wed, Feb 21, 2018 at 5:44 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Please find attached updated patches.

Committed 0001 and 0002.

I'm having some difficulty wrapping my head around 0003 because it has
minimal comments and no useful commit message.  I think, though, that
it's actually broken.  Pre-patch, the logic in find_partition_scheme
compares partopfamily, partopcintype, and parttypcoll and then asserts
equality for parttyplen and parttypbyval; not coincidentally,
PartitionSchemeData describes the latter two fields only as "cached
data", so that the segregation of fields in PartitionSchemeData into
two groups exactly matches what find_partition_scheme is actually
doing.  However, with the patch, it turns into a sort of hodgepodge.
parttypid is added into the "cached" section of PartitionSchemeData
and partcollation to the primary section, but both values are
compared, not asserted; parttypcoll moves from the "compared" section
to the "asserted" section but the declaration in PartitionSchemeData
stays where it was.

Moreover, there's no explanation of why this is getting changed.
There's an existing comment that explains the motivation for what the
code does today, which the patch does not modify:

  * We store the opclass-declared input data types instead of the partition key
  * datatypes since the former rather than the latter are used to compare
  * partition bounds. Since partition key data types and the opclass declared
  * input data types are expected to be binary compatible (per ResolveOpClass),
  * both of those should have same byval and length properties.

Obviously, this raises the issue of whether changing this is really
the right thing to do in the first place, but at any rate it's
certainly necessary for the comments to match what the code actually
does.

Also, I find this not very helpful:

+ * The collation of the partition key can differ from the collation of the
+ * underlying column, so we must store this separately.

If the comments about parttypcol and partcollation were clear enough
(and I think they could use some work to distinguish them better),
then this would be pretty much unnecessary -- clearly the only reason
to store two things is if they might be different from each other.

It might be more useful to somehow explain how parttypid and
partsupfunc are intended to be work/be used, but actually I don't
think any satisfactory explanation is possible.  Either we have one
partition scheme per partopcintype -- in which case parttypid is
ill-defined because it could vary among relations with the same
PartitionScheme -- or we have on per parttypid -- in which case,
without some other change, partition-wise join will stop working
between relations with different parttypids but the same
partopcintype.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/02/23 23:46, Robert Haas wrote:
> On Wed, Feb 21, 2018 at 5:44 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Please find attached updated patches.
> 
> Committed 0001 and 0002.

Thank you for committing and for the review.

> I'm having some difficulty wrapping my head around 0003 because it has
> minimal comments and no useful commit message.  I think, though, that
> it's actually broken.  Pre-patch, the logic in find_partition_scheme
> compares partopfamily, partopcintype, and parttypcoll and then asserts
> equality for parttyplen and parttypbyval; not coincidentally,
> PartitionSchemeData describes the latter two fields only as "cached
> data", so that the segregation of fields in PartitionSchemeData into
> two groups exactly matches what find_partition_scheme is actually
> doing.  However, with the patch, it turns into a sort of hodgepodge.
> parttypid is added into the "cached" section of PartitionSchemeData
> and partcollation to the primary section, but both values are
> compared, not asserted; parttypcoll moves from the "compared" section
> to the "asserted" section but the declaration in PartitionSchemeData
> stays where it was.
> 
> Moreover, there's no explanation of why this is getting changed.
> There's an existing comment that explains the motivation for what the
> code does today, which the patch does not modify:
> 
>   * We store the opclass-declared input data types instead of the partition key
>   * datatypes since the former rather than the latter are used to compare
>   * partition bounds. Since partition key data types and the opclass declared
>   * input data types are expected to be binary compatible (per ResolveOpClass),
>   * both of those should have same byval and length properties.
> 
> Obviously, this raises the issue of whether changing this is really
> the right thing to do in the first place, but at any rate it's
> certainly necessary for the comments to match what the code actually
> does.
> 
> Also, I find this not very helpful:
> 
> + * The collation of the partition key can differ from the collation of the
> + * underlying column, so we must store this separately.
> 
> If the comments about parttypcol and partcollation were clear enough
> (and I think they could use some work to distinguish them better),
> then this would be pretty much unnecessary -- clearly the only reason
> to store two things is if they might be different from each other.
> 
> It might be more useful to somehow explain how parttypid and
> partsupfunc are intended to be work/be used, but actually I don't
> think any satisfactory explanation is possible.  Either we have one
> partition scheme per partopcintype -- in which case parttypid is
> ill-defined because it could vary among relations with the same
> PartitionScheme -- or we have on per parttypid -- in which case,
> without some other change, partition-wise join will stop working
> between relations with different parttypids but the same
> partopcintype.

I think I'm convinced that partopcintype OIDs can be used where I thought
parttypid ones were necessary.  The pruning patch uses the respective OID
from this array when extracting the datum from an OpExpr to be compared
with the partition bound datums.  It's sensible, I now think, to require
the extracted datum to be of partition opclass declared input type, rather
than the type of the partition key involved.  So, I removed the parttypid
that I'd added to PartitionSchemeData.

Updated the comments to make clear the distinction between and purpose of
having both parttypcoll and partcollation.  Also expanded the comment
about partsupfunc a bit.

Attached updated patches.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Sun, Feb 25, 2018 at 11:10 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I think I'm convinced that partopcintype OIDs can be used where I thought
> parttypid ones were necessary.  The pruning patch uses the respective OID
> from this array when extracting the datum from an OpExpr to be compared
> with the partition bound datums.  It's sensible, I now think, to require
> the extracted datum to be of partition opclass declared input type, rather
> than the type of the partition key involved.  So, I removed the parttypid
> that I'd added to PartitionSchemeData.
>
> Updated the comments to make clear the distinction between and purpose of
> having both parttypcoll and partcollation.  Also expanded the comment
> about partsupfunc a bit.

I don't think this fundamentally fixes the problem, although it does
narrow it.  By requiring partcollation to match across every relation
with the same PartitionScheme, you're making partition-wise join fail
to work in some cases where it previously did.  Construct a test case
where parttypcoll matches and partcollation doesn't; then, without the
patch, the two relations will have the same PartitionScheme and thus
be eligible for a partition-wise join, but with the patch, they will
have different PartitionSchemes and thus won't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/02/27 3:27, Robert Haas wrote:
> On Sun, Feb 25, 2018 at 11:10 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I think I'm convinced that partopcintype OIDs can be used where I thought
>> parttypid ones were necessary.  The pruning patch uses the respective OID
>> from this array when extracting the datum from an OpExpr to be compared
>> with the partition bound datums.  It's sensible, I now think, to require
>> the extracted datum to be of partition opclass declared input type, rather
>> than the type of the partition key involved.  So, I removed the parttypid
>> that I'd added to PartitionSchemeData.
>>
>> Updated the comments to make clear the distinction between and purpose of
>> having both parttypcoll and partcollation.  Also expanded the comment
>> about partsupfunc a bit.
> 
> I don't think this fundamentally fixes the problem, although it does
> narrow it.  By requiring partcollation to match across every relation
> with the same PartitionScheme, you're making partition-wise join fail
> to work in some cases where it previously did.  Construct a test case
> where parttypcoll matches and partcollation doesn't; then, without the
> patch, the two relations will have the same PartitionScheme and thus
> be eligible for a partition-wise join, but with the patch, they will
> have different PartitionSchemes and thus won't.

I may be confused but shouldn't two tables partitioned on the same column
(of the same collatable type), but using different collations for
partitioning should end up with different PartitionSchemes?  Different
partitioning collations would mean that same data may end up in different
partitions of the respective tables.

create table p (a text) partition by range (a collate "en_US");
create table p1 partition of p for values from ('a') to ('m');
create table p2 partition of p for values from ('m') to ('z ');

create table q (a text) partition by range (a collate "C");
create table q1 partition of q for values from ('a') to ('m');
create table q2 partition of q for values from ('m') to ('z ');

insert into p values ('A');
INSERT 0 1

insert into q values ('A');
ERROR:  no partition of relation "q" found for row
DETAIL:  Partition key of the failing row contains (a) = (A).

You may say that partition bounds might have to be different too in this
case and hence partition-wise join won't occur anyway, but I'm wondering
if the mismatch of partcollation itself isn't enough to conclude that?

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Attached an updated version in which I incorporated some of the revisions
that David Rowley suggested to OR clauses handling (in partprune.c) that
he posted as a separate patch on the run-time pruning thread [1].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f8%2Bp-mXfFUiwR4xZ37STvgJeYF44yAjo5Rfxf92cFJyYQ%40mail.gmail.com

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Mon, Feb 26, 2018 at 10:59 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> You may say that partition bounds might have to be different too in this
> case and hence partition-wise join won't occur anyway, but I'm wondering
> if the mismatch of partcollation itself isn't enough to conclude that?

Yeah, you're right.  I think that this is just a bug in partition-wise
join, and that the partition scheme should just be using partcollation
instead of parttypcoll, as in the attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/02/28 1:05, Robert Haas wrote:
> On Mon, Feb 26, 2018 at 10:59 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> You may say that partition bounds might have to be different too in this
>> case and hence partition-wise join won't occur anyway, but I'm wondering
>> if the mismatch of partcollation itself isn't enough to conclude that?
> 
> Yeah, you're right.  I think that this is just a bug in partition-wise
> join, and that the partition scheme should just be using partcollation
> instead of parttypcoll, as in the attached.

Ah, OK. I was missing that there is no need to have both parttypcoll and
partcollation in PartitionSchemeData, as the Vars in rel->partexprs are
built from a bare PartitionKey (not PartitionSchemeData), and after that
point, parttypcoll no longer needs to kept around.

I noticed that there is a typo in the patch.

+    memcpy(part_scheme->partcollation, partkey->parttypcoll,

s/parttypcoll/partcollation/g

BTW, should there be a relevant test in partition_join.sql?  If yes,
attached a patch (partitionwise-join-collation-test-1.patch) to add one.

Also attached updated version of your patch (fixed the typo).

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Wed, Feb 28, 2018 at 6:42 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/02/28 1:05, Robert Haas wrote:
>> On Mon, Feb 26, 2018 at 10:59 PM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> You may say that partition bounds might have to be different too in this
>>> case and hence partition-wise join won't occur anyway, but I'm wondering
>>> if the mismatch of partcollation itself isn't enough to conclude that?
>>
>> Yeah, you're right.  I think that this is just a bug in partition-wise
>> join, and that the partition scheme should just be using partcollation
>> instead of parttypcoll, as in the attached.
>
> Ah, OK. I was missing that there is no need to have both parttypcoll and
> partcollation in PartitionSchemeData, as the Vars in rel->partexprs are
> built from a bare PartitionKey (not PartitionSchemeData), and after that
> point, parttypcoll no longer needs to kept around.

Yes. That's right.

>
> I noticed that there is a typo in the patch.
>
> +       memcpy(part_scheme->partcollation, partkey->parttypcoll,
>
> s/parttypcoll/partcollation/g
>
> BTW, should there be a relevant test in partition_join.sql?  If yes,
> attached a patch (partitionwise-join-collation-test-1.patch) to add one.

A partition-wise join path will be created but discarded because of
higher cost. This test won't see it in that case. So, please add some
data like other tests and add command to analyze the partitioned
tables. That kind of protects from something like that.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Tue, Feb 27, 2018 at 3:03 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached an updated version in which I incorporated some of the revisions
> that David Rowley suggested to OR clauses handling (in partprune.c) that
> he posted as a separate patch on the run-time pruning thread [1].

Some comments on 0001.

             partnatts != part_scheme->partnatts)
             continue;

-        /* Match the partition key types. */
+        /*
+         * Match the partition key types and partitioning-specific collations.
+         */

We are comparing opfamily and opclass input type as well, but this comment
doesn't explicitly mention those like it mentions collation. Instead, I think
we should just say, "Match partition key type properties"

You had commented on "advanced partition matching code" about asserting that
parsupfuncs also match. Robert too has expressed similar opinion upthread. May
be we should at least try to assert that the function OIDs match.

-    Oid           *parttypcoll;    /* OIDs of collations of partition keys. */
+
+    /*
+     * We store both the collation implied by the partition key's type and the
+     * one specified for partitioning.  Values in the former are used as
+     * varcollid in the Vars corresponding to simple column partition keys so
+     * as to make them match corresponding Vars appearing elsewhere in the
+     * query tree.  Whereas, the latter is used when actually comparing values
+     * against partition bounds datums, such as, when doing partition pruning.
+     */
+    Oid           *parttypcoll;
+    Oid           *partcollation;

As you have already mentioned upthread only partcollation is needed, not
parttypcoll.

     /* Cached information about partition key data types. */
     int16       *parttyplen;
     bool       *parttypbyval;
+
+    /*
+     * Cached array of partitioning comparison functions' fmgr structs.  We
+     * don't compare these when trying to match two partition schemes.
+     */

I think this comment should go away. The second sentence doesn't explain why
and if it did so it should do that in find_partition_scheme() not here.
partsupfunc is another property of partition keys that is cached like
parttyplen, parttypbyval. Why does it deserve a separate comment when others
don't?



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 27 February 2018 at 22:33, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached an updated version in which I incorporated some of the revisions
> that David Rowley suggested to OR clauses handling (in partprune.c) that
> he posted as a separate patch on the run-time pruning thread [1].

Thanks for fixing that up and including it.

Micro review of v34:

1. Looks like you've renamed the parttypid parameter in the definition
of partkey_datum_from_expr and partition_cmp_args, but not updated the
declaration too.

+static bool partkey_datum_from_expr(Oid parttypid, Expr *expr, Datum *value);

+static bool
+partkey_datum_from_expr(Oid partopcintype, Expr *expr, Datum *value)

+static bool partition_cmp_args(Oid parttypid, Oid partopfamily,
+    PartClause *pc, PartClause *leftarg, PartClause *rightarg,
+    bool *result);

+static bool
+partition_cmp_args(Oid partopcintype, Oid partopfamily,
+    PartClause *pc, PartClause *leftarg, PartClause *rightarg,
+    bool *result)

2. In prune_append_rel_partitions(), it's not exactly illegal, but int
i is declared twice in different scopes. Looks like there's no need
for the inner one.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Tue, Feb 27, 2018 at 8:12 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Ah, OK. I was missing that there is no need to have both parttypcoll and
> partcollation in PartitionSchemeData, as the Vars in rel->partexprs are
> built from a bare PartitionKey (not PartitionSchemeData), and after that
> point, parttypcoll no longer needs to kept around.
>
> I noticed that there is a typo in the patch.
>
> +       memcpy(part_scheme->partcollation, partkey->parttypcoll,
>
> s/parttypcoll/partcollation/g

Committed your version.

> BTW, should there be a relevant test in partition_join.sql?  If yes,
> attached a patch (partitionwise-join-collation-test-1.patch) to add one.

I don't feel strongly about it, but I'm not going to try to prevent
you from adding one, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/02/28 19:14, Ashutosh Bapat wrote:
> On Wed, Feb 28, 2018 at 6:42 AM, Amit Langote wrote:
>> BTW, should there be a relevant test in partition_join.sql?  If yes,
>> attached a patch (partitionwise-join-collation-test-1.patch) to add one.
> 
> A partition-wise join path will be created but discarded because of
> higher cost. This test won't see it in that case. So, please add some
> data like other tests and add command to analyze the partitioned
> tables. That kind of protects from something like that.

Thanks for the review.

Hmm, the added test is such that the partition collations won't match, so
partition-wise join won't be considered at all due to differing
PartitionSchemes, unless I'm missing something.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/01 2:23, Robert Haas wrote:
> On Tue, Feb 27, 2018 at 8:12 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Ah, OK. I was missing that there is no need to have both parttypcoll and
>> partcollation in PartitionSchemeData, as the Vars in rel->partexprs are
>> built from a bare PartitionKey (not PartitionSchemeData), and after that
>> point, parttypcoll no longer needs to kept around.
>>
>> I noticed that there is a typo in the patch.
>>
>> +       memcpy(part_scheme->partcollation, partkey->parttypcoll,
>>
>> s/parttypcoll/partcollation/g
> 
> Committed your version.

Thank you.

>> BTW, should there be a relevant test in partition_join.sql?  If yes,
>> attached a patch (partitionwise-join-collation-test-1.patch) to add one.
> 
> I don't feel strongly about it, but I'm not going to try to prevent
> you from adding one, either.

OK.  Attached is a revised version of that patch in case you consider
committing it, addressing Ashutosh's comment that the tables used in the
test should contain some data.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks Ashutosh and David for reviewing.  Replying to both.

On 2018/02/28 20:25, Ashutosh Bapat wrote:
> On Tue, Feb 27, 2018 at 3:03 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached an updated version in which I incorporated some of the revisions
>> that David Rowley suggested to OR clauses handling (in partprune.c) that
>> he posted as a separate patch on the run-time pruning thread [1].
> 
> Some comments on 0001.
> 
>              partnatts != part_scheme->partnatts)
>              continue;
> 
> -        /* Match the partition key types. */
> +        /*
> +         * Match the partition key types and partitioning-specific collations.
> +         */
> 
> We are comparing opfamily and opclass input type as well, but this comment
> doesn't explicitly mention those like it mentions collation. Instead, I think
> we should just say, "Match partition key type properties"

Sounds good, done.

> You had commented on "advanced partition matching code" about asserting that
> parsupfuncs also match. Robert too has expressed similar opinion upthread. May
> be we should at least try to assert that the function OIDs match.

I guess you're referring to this email of mine:

https://www.postgresql.org/message-id/e681c283-5fc6-b1c6-1bb9-7102c37e2d55%40lab.ntt.co.jp

Note that I didn't say that we should Assert the equality of partsupfunc
members themselves, but rather whether we could add a comment explaining
why we don't.  Although, like you say, we could Assert the equality of
function OIDs, so added a block like this:

+  /*
+   * If partopfamily and partopcintype matched, must have the same
+   * partition comparison functions.  Note that we cannot reliably
+   * Assert the equality of function structs themselves for they might
+   * be different across PartitionKey's, so just Assert for the function
+   * OIDs.
+   */
+#ifdef USE_ASSERT_CHECKING
+   {
+      int     i;
+
+      for (i = 0; i < partkey->partnatts; i++)
+          Assert(partkey->partsupfunc[i].fn_oid ==
+                 part_scheme->partsupfunc[i].fn_oid);
+   }
+#endif

> -    Oid           *parttypcoll;    /* OIDs of collations of partition keys. */
> +
> +    /*
> +     * We store both the collation implied by the partition key's type and the
> +     * one specified for partitioning.  Values in the former are used as
> +     * varcollid in the Vars corresponding to simple column partition keys so
> +     * as to make them match corresponding Vars appearing elsewhere in the
> +     * query tree.  Whereas, the latter is used when actually comparing values
> +     * against partition bounds datums, such as, when doing partition pruning.
> +     */
> +    Oid           *parttypcoll;
> +    Oid           *partcollation;
> 
> As you have already mentioned upthread only partcollation is needed, not
> parttypcoll.

This hunk is gone after rebasing over 2af28e603319 (For partitionwise
join, match on partcollation, not parttypcoll) that was committed earlier
today.

>      /* Cached information about partition key data types. */
>      int16       *parttyplen;
>      bool       *parttypbyval;
> +
> +    /*
> +     * Cached array of partitioning comparison functions' fmgr structs.  We
> +     * don't compare these when trying to match two partition schemes.
> +     */
> 
> I think this comment should go away. The second sentence doesn't explain why
> and if it did so it should do that in find_partition_scheme() not here.
> partsupfunc is another property of partition keys that is cached like
> parttyplen, parttypbyval. Why does it deserve a separate comment when others
> don't?

Replaced that comment with:

+    /* Cached information about partition comparison functions. */

As mentioned above, I already added a comment in find_partition_scheme().

On 2018/02/28 20:35, David Rowley wrote:
> Micro review of v34:
>
> 1. Looks like you've renamed the parttypid parameter in the definition
> of partkey_datum_from_expr and partition_cmp_args, but not updated the
> declaration too.
>
> +static bool partkey_datum_from_expr(Oid parttypid, Expr *expr, Datum
*value);
>
> +static bool
> +partkey_datum_from_expr(Oid partopcintype, Expr *expr, Datum *value)
>
> +static bool partition_cmp_args(Oid parttypid, Oid partopfamily,
> +    PartClause *pc, PartClause *leftarg, PartClause *rightarg,
> +    bool *result);
>
> +static bool
> +partition_cmp_args(Oid partopcintype, Oid partopfamily,
> +    PartClause *pc, PartClause *leftarg, PartClause *rightarg,
> +    bool *result)

Oops, forgot about the declarations.  Fixed.

> 2. In prune_append_rel_partitions(), it's not exactly illegal, but int
> i is declared twice in different scopes. Looks like there's no need
> for the inner one.

Removed the inner one.

Attached updated patches.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Thu, Mar 1, 2018 at 6:57 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/02/28 19:14, Ashutosh Bapat wrote:
>> On Wed, Feb 28, 2018 at 6:42 AM, Amit Langote wrote:
>>> BTW, should there be a relevant test in partition_join.sql?  If yes,
>>> attached a patch (partitionwise-join-collation-test-1.patch) to add one.
>>
>> A partition-wise join path will be created but discarded because of
>> higher cost. This test won't see it in that case. So, please add some
>> data like other tests and add command to analyze the partitioned
>> tables. That kind of protects from something like that.
>
> Thanks for the review.
>
> Hmm, the added test is such that the partition collations won't match, so
> partition-wise join won't be considered at all due to differing
> PartitionSchemes, unless I'm missing something.
>

The point is we wouldn't know whether PWJ was not selected because of
PartitionScheme mismatch OR the PWJ paths were expensive compared to
non-PWJ as happens with empty tables. In both the cases we will see a
non-PWJ "plan" although in one case PWJ was not possible and in the
other it was possible. I think what we want to test is that PWJ Is not
possible with collation mismatch.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Tue, Feb 27, 2018 at 4:33 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached an updated version in which I incorporated some of the revisions
> that David Rowley suggested to OR clauses handling (in partprune.c) that
> he posted as a separate patch on the run-time pruning thread [1].

I'm very skeptical about this patch's desire to remove the static
qualifier from evaluate_expr().  Why does this patch need that and
constraint exclusion not need it?  Why should this patch not instead
by using eval_const_expressions?  partkey_datum_from_expr() is
prepared to give up if evaluate_expr() doesn't return a Const, but
there's nothing in evaluate_expr() to make it give up if, for example,
the input is -- or contains -- a volatile function, e.g. random().

+       if (OidIsValid(get_default_oid_from_partdesc(partdesc)))
+               rel->has_default_part = true;
+       else
+               rel->has_default_part = false;

This can be written a lot more compactly as rel->has_default_part =
OidIsValid(get_default_oid_from_partdesc(partdesc));

PartitionPruneContext has no comment explaining its general purpose; I
think it should.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Wed, Feb 28, 2018 at 11:53 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated patches.

+       memcpy(part_scheme->partsupfunc, partkey->partsupfunc,
+                  sizeof(FmgrInfo) * partnatts);

You can't copy an FmgrInfo by just applying memcpy() to it.  Use fmgr_info_copy.

I don't like the comments at the top of partprune.c very much.  It
seems strange to document individual functions here; those functions
can (and should) be documented in their individual header comments.
What we should have at the top of the file is a discussion of the
overall theory of operation of this module, something that currently
seems not to exist anywhere in the patch.  I tried to figure that out
by looking at the new data structures the patch introduces:
PartitionPruneContext, PartScanKeyInfo, PartitionClauseInfo, and
PartClause.  It looks like the general idea idea is that code that
wants to use these facilities does the following:

Step 1. Generate a PartitionPruneContext.  In this patch, this seems
to consist entirely of copying information from the RelOptInfo or its
PartitionScheme.
Step 2. Call generate_partition_clauses() to extract relevant clauses
from rel->baserestrictinfo and generate a PartClauseInfo.
Step 3. Call get_partitions_from_clauses() to generate a set of
unpruned partition indexes.  Internally, that function will first
populate a PartScanKeyInfo from the PartClauseInfo by calling
extract_bounding_datums().  Then it calls get_partitions_for_keys()
which generates the set of unpruned partition indexes from the
PartitionPruneContext and PartScanKeyInfo.

I guess there are two main things about this that seem odd to me:

1. I don't see why the partition pruning code should ever be
responsible for evaluating anything itself, as it does currently via
evaluate_expr().  For plan-time partition pruning, we are already
using eval_const_expressions() to perform as much Const-simplification
as possible.  If we see an OpExpr with a partitioning column on one
side, then the other side is either a Const, in which case we can
perform pruning, or it's something whose value can't be known until
runtime, in which case we cannot.  There should be no case in which
eval_const_expressions() fails to simplify something to a constant yet
we know the value at plan time; if such a case exists, then it's an
opportunity to improve eval_const_expressions(), not a reason to do be
inconsistent with the rules it applies.  For run-time pruning, it is
probably semantically wrong and certainly undesirable from a
performance perspective to spin up a new ExecutorState and a new
ExprState every time we need to reassess the decision about which
partitions need to be scanned.  Instead, the expressions whose values
are inputs to the pruning process should be computed in the
ExecutorState/ExprState for the main query and the results should be
passed to the partition-pruning code as inputs to the decision-making
process.

2. All processing of clauses should happen at plan time, not run time,
so that we're not repeating work.  If, for example, a prepared query
is executed repeatedly, the clauses should get fully processed when
it's planned, and then the only thing that should happen at execution
time is that we take the values to which we now have access and use
them to decide what to prune.  With this design, we're bound to repeat
at least the portion of the work done by get_partitions_from_clauses()
at runtime, and as things stand, it looks like the current version of
the run-time partition pruning patch repeats the
generate_partition_clauses() work at runtime as well.

It seems to me as though what we ought to be doing is extracting
clauses that are of the correct general form and produces a list of
<partition-column-index>, <partition-operator-strategy>, <expression>
triplets sorted by increasing partition column index and operator
strategy number and in a form that can be represented as a Node tree.
This can be attached to the plan for use at runtime or, if there are
at least some constants among the expressions, used at plan time for
partition exclusion.  So the main interfaces would be something like
this:

extern List *extract_partition_pruning_clauses(RelOptInfo *rel);
extern PartitionPruneContext *CreatePartitionPruneContext(List *);
extern Bitmapset *PerformStaticPartitionPruning(PartitionPruneContext *);
extern Bitmapset *PerformDynamicPartitionPruning(PartitionPruneContext
*, Datum *values, bool *isnull);

In PerformStaticPartitionPruning(), we'd just ignore anything for
which the expression was non-constant; to call
PerformDynamicPartitionPruning(), the caller would need to evaluate
the expressions from the PartitionPruneContext using the appropriate
EState and then pass the results into this function.

I realize that I'm hand-waving a bit here, or maybe more than a bit,
but in general I think it's right to imagine that the work here needs
to be segregated into two very well-separated phases.  In the first
phase, we do all possible work to assess which clauses are relevant
and put them in a form where the actual pruning work can be done as
cheaply as possible once we know the values.  This phase always
happens at plan time.  In the second phase, we get the values, either
by extracting them from Const nodes at plan time or evaluating
expressions at runtime, and then decide which partitions to eliminate.
This phase happens at plan time if we have enough constants available
to do something useful and at runtime if we have enough non-constants
to do something useful.  Right now there doesn't seem to be a clean
separation between these two phases and I think that's not good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 2 March 2018 at 08:13, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't like the comments at the top of partprune.c very much.  It
> seems strange to document individual functions here; those functions
> can (and should) be documented in their individual header comments.
> What we should have at the top of the file is a discussion of the
> overall theory of operation of this module, something that currently
> seems not to exist anywhere in the patch.  I tried to figure that out
> by looking at the new data structures the patch introduces:
> PartitionPruneContext, PartScanKeyInfo, PartitionClauseInfo, and
> PartClause.  It looks like the general idea idea is that code that
> wants to use these facilities does the following:
>
> Step 1. Generate a PartitionPruneContext.  In this patch, this seems
> to consist entirely of copying information from the RelOptInfo or its
> PartitionScheme.
> Step 2. Call generate_partition_clauses() to extract relevant clauses
> from rel->baserestrictinfo and generate a PartClauseInfo.
> Step 3. Call get_partitions_from_clauses() to generate a set of
> unpruned partition indexes.  Internally, that function will first
> populate a PartScanKeyInfo from the PartClauseInfo by calling
> extract_bounding_datums().  Then it calls get_partitions_for_keys()
> which generates the set of unpruned partition indexes from the
> PartitionPruneContext and PartScanKeyInfo.

Hi Robert,

I feel I should step in here and answer this part as it was me who
first came up with the idea of the context struct. I've typed up
something below which is my first cut at what I'd have imagined the
header comment of partprune.c should look like. Some parts are only
revant after run-time pruning is also using this stuff. I've tried to
highlight those areas, I'm not sure how much or if there should be any
mention of that at all as part of this patch.

Here goes:

partprune.c

Allows efficient identification of the minimal set of partitions which match a
given set of clauses.  Thus allowing useful things such as ignoring unneeded
partitions which cannot possibly contain tuples matching the given set of
clauses.

This module breaks the process of determining the matching partitions into
two distinct steps, each of which has its own function which is externally
visible outside of this module.  The reason for not performing everything
in one step as down to the fact that there are times where we may wish to
perform the 2nd step multiple times over.  The steps could be thought of as a
compilation step followed by an execution step.

Step 1 (compilation):

Pre-process the given list of clauses and attempt to match individual clauses
up to a partition key.

The end result of this process is a PartitionClauseInfo containing details of
each clause found to match the partition key.  This output is required as
input for the 2nd step.

Step 2 (execution):

Step 2 outputs the minimal set of matching partitions based on the input from
step 1.

Internally, this step is broken down into smaller sub-steps, each of which
is explained in detail in the comments in the corresponding function.

Step 2 can be executed multiple times for its input values.  The inputs to this
step are not modified by the processing done within.  It is expected that this
step is executed multiple times in cases where the matching partitions must be
determined during query execution.  A subsequent evaluation of this step will
be required whenever a parameter which was found in a clause matching the
partition key changes its value.

PartitionPruneContext:

Each of the steps described above also requires an input of a
PartitionPruneContext.  This stores all of the remaining required inputs to
each step.  The context will vary slightly depending on the context in which
the step is being called from; i.e the planner or executor. For example,
during query planning, we're unable to determine the value of a Param found
matching the partition key.  When this step is called from the executor the
PlanState can be set in the context which allows evaluation of these Params
into Datum values. *** Only after run-time pruning ***

The PartitionPruneContext is also required since many of the query planner
node types are unavailable to the executor, which means that the source
information used to populate the context will vary depending on if it's being
called from the query planner or executor.

*** Only after run-time pruning ***

The context is also modified during step 1 to record all of the Param IDs
which were found to match the partition key.

-------------

Hopefully that also helps explain the intensions with the current code strucure.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/01 21:56, Robert Haas wrote:
> On Tue, Feb 27, 2018 at 4:33 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached an updated version in which I incorporated some of the revisions
>> that David Rowley suggested to OR clauses handling (in partprune.c) that
>> he posted as a separate patch on the run-time pruning thread [1].
> 
> I'm very skeptical about this patch's desire to remove the static
> qualifier from evaluate_expr().  Why does this patch need that and
> constraint exclusion not need it?  Why should this patch not instead
> by using eval_const_expressions?  partkey_datum_from_expr() is
> prepared to give up if evaluate_expr() doesn't return a Const, but
> there's nothing in evaluate_expr() to make it give up if, for example,
> the input is -- or contains -- a volatile function, e.g. random().

Thinking on this a bit, I have removed the evaluate_expr() business from
partkey_datum_from_expr() and thus switched evaluate_expr() back to static.

Let me explain why I'd added there in the first place -- if the constant
expression received in partkey_datum_from_expr() was not of the same type
as that of the partition key, it'd try to coerce_to_target_type() the
input expression to the partition key type which may result in a non-Const
expression.  We'd turn it back into a Const by calling evaluate_expr().  I
thought the coercion was needed because we'd be comparing the resulting
datum with the partition bound datums using a partition comparison
function that would require its arguments to be of given types.

But I realized we don't need the coercion.  Earlier steps would have
determined that the clause from which the expression originated contains
an operator that is compatible with the partitioning operator family.  If
so, the type of the expression in question, even though different from the
partition key type, would be binary coercible with it.  So, it'd be okay
to pass the datum extracted from such expression to the partition
comparison function to compare it with datums in PartitionBoundInfo,
without performing any coercion.

> +       if (OidIsValid(get_default_oid_from_partdesc(partdesc)))
> +               rel->has_default_part = true;
> +       else
> +               rel->has_default_part = false;
>
> This can be written a lot more compactly as rel->has_default_part =
> OidIsValid(get_default_oid_from_partdesc(partdesc));

Indeed, will fix.

> PartitionPruneContext has no comment explaining its general purpose; I
> think it should.

Will fix.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks for your comments.

On 2018/03/02 4:13, Robert Haas wrote:
> On Wed, Feb 28, 2018 at 11:53 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached updated patches.
> 
> +       memcpy(part_scheme->partsupfunc, partkey->partsupfunc,
> +                  sizeof(FmgrInfo) * partnatts);
> 
> You can't copy an FmgrInfo by just applying memcpy() to it.  Use fmgr_info_copy.

Oops, will fix.

> I don't like the comments at the top of partprune.c very much.  It
> seems strange to document individual functions here; those functions
> can (and should) be documented in their individual header comments.
> What we should have at the top of the file is a discussion of the
> overall theory of operation of this module, something that currently
> seems not to exist anywhere in the patch.

Sorry about that.  Thanks to your comments here and David's reply, I will
try to address that in the next version.

> I tried to figure that out
> by looking at the new data structures the patch introduces:
> PartitionPruneContext, PartScanKeyInfo, PartitionClauseInfo, and
> PartClause.  It looks like the general idea idea is that code that
> wants to use these facilities does the following:
> 
> Step 1. Generate a PartitionPruneContext.  In this patch, this seems
> to consist entirely of copying information from the RelOptInfo or its
> PartitionScheme.
> Step 2. Call generate_partition_clauses() to extract relevant clauses
> from rel->baserestrictinfo and generate a PartClauseInfo.
> Step 3. Call get_partitions_from_clauses() to generate a set of
> unpruned partition indexes.  Internally, that function will first
> populate a PartScanKeyInfo from the PartClauseInfo by calling
> extract_bounding_datums().  Then it calls get_partitions_for_keys()
> which generates the set of unpruned partition indexes from the
> PartitionPruneContext and PartScanKeyInfo.
>
> I guess there are two main things about this that seem odd to me:
> 
> 1. I don't see why the partition pruning code should ever be
> responsible for evaluating anything itself, as it does currently via
> evaluate_expr().  For plan-time partition pruning, we are already
> using eval_const_expressions() to perform as much Const-simplification
> as possible.  If we see an OpExpr with a partitioning column on one
> side, then the other side is either a Const, in which case we can
> perform pruning, or it's something whose value can't be known until
> runtime, in which case we cannot.  There should be no case in which
> eval_const_expressions() fails to simplify something to a constant yet
> we know the value at plan time; if such a case exists, then it's an
> opportunity to improve eval_const_expressions(), not a reason to do be
> inconsistent with the rules it applies.  For run-time pruning, it is
> probably semantically wrong and certainly undesirable from a
> performance perspective to spin up a new ExecutorState and a new
> ExprState every time we need to reassess the decision about which
> partitions need to be scanned.  Instead, the expressions whose values
> are inputs to the pruning process should be computed in the
> ExecutorState/ExprState for the main query and the results should be
> passed to the partition-pruning code as inputs to the decision-making
> process.

As I said in my earlier reply, I have removed the part that involved the
pruning code calling evaluate_expr().

> 2. All processing of clauses should happen at plan time, not run time,
> so that we're not repeating work.  If, for example, a prepared query
> is executed repeatedly, the clauses should get fully processed when
> it's planned, and then the only thing that should happen at execution
> time is that we take the values to which we now have access and use
> them to decide what to prune.  With this design, we're bound to repeat
> at least the portion of the work done by get_partitions_from_clauses()
> at runtime, and as things stand, it looks like the current version of
> the run-time partition pruning patch repeats the
> generate_partition_clauses() work at runtime as well.
> 
> It seems to me as though what we ought to be doing is extracting
> clauses that are of the correct general form and produces a list of
> <partition-column-index>, <partition-operator-strategy>, <expression>
> triplets sorted by increasing partition column index and operator
> strategy number and in a form that can be represented as a Node tree.
> This can be attached to the plan for use at runtime or, if there are
> at least some constants among the expressions, used at plan time for
> partition exclusion.  So the main interfaces would be something like
> this:
> 
> extern List *extract_partition_pruning_clauses(RelOptInfo *rel);
> extern PartitionPruneContext *CreatePartitionPruneContext(List *);
> extern Bitmapset *PerformStaticPartitionPruning(PartitionPruneContext *);
> extern Bitmapset *PerformDynamicPartitionPruning(PartitionPruneContext
> *, Datum *values, bool *isnull);
> 
> In PerformStaticPartitionPruning(), we'd just ignore anything for
> which the expression was non-constant; to call
> PerformDynamicPartitionPruning(), the caller would need to evaluate
> the expressions from the PartitionPruneContext using the appropriate
> EState and then pass the results into this function.
> 
> I realize that I'm hand-waving a bit here, or maybe more than a bit,
> but in general I think it's right to imagine that the work here needs
> to be segregated into two very well-separated phases.  In the first
> phase, we do all possible work to assess which clauses are relevant
> and put them in a form where the actual pruning work can be done as
> cheaply as possible once we know the values.  This phase always
> happens at plan time.  In the second phase, we get the values, either
> by extracting them from Const nodes at plan time or evaluating
> expressions at runtime, and then decide which partitions to eliminate.
> This phase happens at plan time if we have enough constants available
> to do something useful and at runtime if we have enough non-constants
> to do something useful.  Right now there doesn't seem to be a clean
> separation between these two phases and I think that's not good.

Hmm, I see that things can be improved and various points you've mentioned
here to improve the high-level interfaces seem useful.  I'll try to rework
the patch based on those and submit one early next week.


Looking at the rough interface sketch in your message, it seems that the
product of whatever steps we end up grouping into phase 1 should be
something that can be put into a Node tree (PartitionPruneContext?),
because we may need to attach it to the plan if some of the needed values
will only be made available during execution.

Given the patch's implementation, we'll have to make the structure of that
Node tree a bit more complex than a simple List.  For one thing, the patch
handles OR clauses by performing pruning separately for each arm and them
combining partitions selected across OR arms using set union.  By
"performing pruning" in the last sentence I meant following steps similar
to ones you wrote in your message:

1. Segregating pruning clauses into per-partition-key Lists, that is,
   generate_partition_clauses() producing a PartitionClauseInfo,

2. Removing redundant clauses from each list, that is,
   remove_redundant_clauses() to produce lists with just one member per
   operator strategy for each partition key,

3. Extracting Datum values from the clauses to form equal/min/max tuples
   and setting null or not null bits for individual keys, that is,
   extract_bounding_datums() producing a PartScanKeyInfo, and

4. Finally pruning with those Datum tuples and null/not null info, that
   is, get_partitions_for_keys().

Steps 2-4 are dependent on clauses providing Datums, which all the clauses
may or may not do.  Depending on whether or not, we'll have to defer those
steps to run time.

So,

* What do we encode into the Node tree attached to the plan?  Clauses that
  haven't gone through steps 2 and 3 (something like PartitionClauseInfo)
  or the product of step 3 (something like PartScanKeyInfo)?

* How do we account for OR clauses?  Perhaps by having the aforementioned
  Node trees nested inside the top-level one, wherein there will be one
  nested node per arm of an OR clause.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/02 11:12, David Rowley wrote:
> On 2 March 2018 at 08:13, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't like the comments at the top of partprune.c very much.  It
>> seems strange to document individual functions here; those functions
>> can (and should) be documented in their individual header comments.
>> What we should have at the top of the file is a discussion of the
>> overall theory of operation of this module, something that currently
>> seems not to exist anywhere in the patch.  I tried to figure that out
>> by looking at the new data structures the patch introduces:
>> PartitionPruneContext, PartScanKeyInfo, PartitionClauseInfo, and
>> PartClause.  It looks like the general idea idea is that code that
>> wants to use these facilities does the following:
>>
>> Step 1. Generate a PartitionPruneContext.  In this patch, this seems
>> to consist entirely of copying information from the RelOptInfo or its
>> PartitionScheme.
>> Step 2. Call generate_partition_clauses() to extract relevant clauses
>> from rel->baserestrictinfo and generate a PartClauseInfo.
>> Step 3. Call get_partitions_from_clauses() to generate a set of
>> unpruned partition indexes.  Internally, that function will first
>> populate a PartScanKeyInfo from the PartClauseInfo by calling
>> extract_bounding_datums().  Then it calls get_partitions_for_keys()
>> which generates the set of unpruned partition indexes from the
>> PartitionPruneContext and PartScanKeyInfo.
> 
> Hi Robert,
> 
> I feel I should step in here and answer this part as it was me who
> first came up with the idea of the context struct. I've typed up
> something below which is my first cut at what I'd have imagined the
> header comment of partprune.c should look like. Some parts are only
> revant after run-time pruning is also using this stuff. I've tried to
> highlight those areas, I'm not sure how much or if there should be any
> mention of that at all as part of this patch.
> 
> Here goes:
> 
> partprune.c
> 
> Allows efficient identification of the minimal set of partitions which match a
> given set of clauses.  Thus allowing useful things such as ignoring unneeded
> partitions which cannot possibly contain tuples matching the given set of
> clauses.
> 
> This module breaks the process of determining the matching partitions into
> two distinct steps, each of which has its own function which is externally
> visible outside of this module.  The reason for not performing everything
> in one step as down to the fact that there are times where we may wish to
> perform the 2nd step multiple times over.  The steps could be thought of as a
> compilation step followed by an execution step.
> 
> Step 1 (compilation):
> 
> Pre-process the given list of clauses and attempt to match individual clauses
> up to a partition key.
> 
> The end result of this process is a PartitionClauseInfo containing details of
> each clause found to match the partition key.  This output is required as
> input for the 2nd step.
> 
> Step 2 (execution):
> 
> Step 2 outputs the minimal set of matching partitions based on the input from
> step 1.
> 
> Internally, this step is broken down into smaller sub-steps, each of which
> is explained in detail in the comments in the corresponding function.
> 
> Step 2 can be executed multiple times for its input values.  The inputs to this
> step are not modified by the processing done within.  It is expected that this
> step is executed multiple times in cases where the matching partitions must be
> determined during query execution.  A subsequent evaluation of this step will
> be required whenever a parameter which was found in a clause matching the
> partition key changes its value.
> 
> PartitionPruneContext:
> 
> Each of the steps described above also requires an input of a
> PartitionPruneContext.  This stores all of the remaining required inputs to
> each step.  The context will vary slightly depending on the context in which
> the step is being called from; i.e the planner or executor. For example,
> during query planning, we're unable to determine the value of a Param found
> matching the partition key.  When this step is called from the executor the
> PlanState can be set in the context which allows evaluation of these Params
> into Datum values. *** Only after run-time pruning ***
> 
> The PartitionPruneContext is also required since many of the query planner
> node types are unavailable to the executor, which means that the source
> information used to populate the context will vary depending on if it's being
> called from the query planner or executor.
> 
> *** Only after run-time pruning ***
> 
> The context is also modified during step 1 to record all of the Param IDs
> which were found to match the partition key.
> 
> -------------
> 
> Hopefully that also helps explain the intensions with the current code strucure.

Thanks David for writing this down.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Fri, Mar 2, 2018 at 1:22 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> But I realized we don't need the coercion.  Earlier steps would have
> determined that the clause from which the expression originated contains
> an operator that is compatible with the partitioning operator family.  If
> so, the type of the expression in question, even though different from the
> partition key type, would be binary coercible with it.

That doesn't follow.  Knowing that two types are in the same operator
family doesn't guarantee that the types are binary coercible. For
example, int8 is not binary-coercible to int2.  Moreover, you'd better
be pretty careful about trying to cast int8 to int2 because it might
turn a query that would have returned no rows into one that fails
outright; that's not OK.  Imagine that the user types:

SELECT * FROM partitioned_by_int2 WHERE a = 1000000000000;

I think what needs to happen with cross-type situations is that you
look in the opfamily for a comparator that takes the types you want as
input; if you can't find one, you have to give up on pruning.  If you
do find one, then you use it.  For example in the above query, once
you find btint28cmp, you can use that to compare the user-provided
constant against the range bounds for the various partitions to see
which one might contain it.  You'll end up selecting the partition
with upper bound MAXVALUE if there is one, or no partition at all if
every partition has a finite upper bound.  That's as well as we can do
with current infrastructure, I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Fri, Mar 2, 2018 at 6:21 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Looking at the rough interface sketch in your message, it seems that the
> product of whatever steps we end up grouping into phase 1 should be
> something that can be put into a Node tree (PartitionPruneContext?),
> because we may need to attach it to the plan if some of the needed values
> will only be made available during execution.

Right.  You might also end up with two representations: a
Node-tree-style representation that contains all of the information we
need, and another, faster form into which it gets converted before
use.

> Given the patch's implementation, we'll have to make the structure of that
> Node tree a bit more complex than a simple List.  For one thing, the patch
> handles OR clauses by performing pruning separately for each arm and them
> combining partitions selected across OR arms using set union.  By
> "performing pruning" in the last sentence I meant following steps similar
> to ones you wrote in your message:
>
> 1. Segregating pruning clauses into per-partition-key Lists, that is,
>    generate_partition_clauses() producing a PartitionClauseInfo,
>
> 2. Removing redundant clauses from each list, that is,
>    remove_redundant_clauses() to produce lists with just one member per
>    operator strategy for each partition key,
>
> 3. Extracting Datum values from the clauses to form equal/min/max tuples
>    and setting null or not null bits for individual keys, that is,
>    extract_bounding_datums() producing a PartScanKeyInfo, and
>
> 4. Finally pruning with those Datum tuples and null/not null info, that
>    is, get_partitions_for_keys().
>
> Steps 2-4 are dependent on clauses providing Datums, which all the clauses
> may or may not do.  Depending on whether or not, we'll have to defer those
> steps to run time.

I don't see that there's a real need to perform step 2 at all.  I
mean, if you have x > $1 and x > $2 in the query, you can just compute
the set of partitions for the first clause, compute the set of
partitions for the second clause, and then intersect.  That doesn't
seem obviously worse than deciding which of $1 and $2 is greater and
then pruning only based on whichever one is greater in this case.

> * What do we encode into the Node tree attached to the plan?  Clauses that
>   haven't gone through steps 2 and 3 (something like PartitionClauseInfo)
>   or the product of step 3 (something like PartScanKeyInfo)?
>
> * How do we account for OR clauses?  Perhaps by having the aforementioned
>   Node trees nested inside the top-level one, wherein there will be one
>   nested node per arm of an OR clause.

Suppose we define the notion of a pruning program.  A pruning program
can use any number of registers, which have integer numbers starting
with 0 and counting upward as high as necessary.  Each register holds
a Bitmapset.  The result of a pruning program is the value of register
0 when the program completes.  A pruning program consists of a list of
steps, each of which is either a PruningBaseStep or a
PruningCombineStep.  A PruningCombineStep modifies the contents of the
target register based on the contents of a source register in one of
the following three ways: (1) UNION -- all bits set in source become
set in target; (2) INTERSECT -- all bits clear in source become clear
in target; (3) DIFFERENCE -- all bits set in source become clear in
target.  A PruningBaseStep consists of a strategy (equality,
less-than, etc.), an output register, and list of expressions --
either as many as there are partition keys, or for range partitioning
perhaps fewer; it prunes based on the strategy and the expressions and
overwrites the output register with the partitions that would be
selected.

Example #1.  Table is hash-partitioned on a and b.  Given a query like
SELECT * FROM tab WHERE a = 100 AND b = 233, we create a single-step
program:

1. base-step (strategy =, register 0, expressions 100, 233)

If there were an equality constraint on one of the two columns, we
would not create a pruning program at all, because no pruning is
possible.

Example #2. Table is list-partitioned on a.  Given a query like SELECT
* FROM tab WHERE (a = $1 OR a = $2) AND a != $3, we create this
program:

1. base-step (strategy =, register 0, expressions $1)
2. base-step (strategy =, register 1, expressions $2)
3. base-step (strategy !=, register 2, expressions $3)
4. combine-step (target-register 0, source-register 1, strategy union)
5. combine-step (target-register 0, source-register 2, strategy difference)

(This is unoptimized -- one could do better by reversing steps 3 and 4
and using reusing register 1 instead of needing register 2, but that
kind of optimization is probably not too important.)

Example #3. Table is range-partitioned on a and b.  Given a query like
SELECT * FROM tab WHERE (a = 40 AND b > $1) OR (a = $2 AND b = $3), we
do this:

1. base-step (strategy >, register 0, expressions 40, $1)
2. base-step (strategy =, register 1, expressions $2, $3)
3. combine-step (target-register 0, source-register 1, strategy union)

You might need a few extra gadgets here to make all of this work --
e.g. another base-step strategy to handle ScalarArrayOpExpr; I'm just
trying to convey the basic idea here.  It's pretty easy to see how to
store a program like this as a node tree: just create PruningBaseStep
and PruningCombineStep nodes and stick them into a List.  At execution
time transform the List into an array and loop over it.

Or possibly it would be better to have two lists, one of base steps
without explicit register numbers, where step N always outputs to
register N, and then a second list of combine steps.  Then at
execution time you could have an array of PruningBaseStep * and an
array of PruningCombineStep * instead of a combined array of Node *,
which might be quicker to process.

But regardless of what you do exactly, I think you should try to come
up with some kind of representation that is basically uniform,
handling all the things you support in a similar fashion.  The current
patch has basically separate and somewhat ad-hoc representations for
the regular case, the <> case, and the OR case, which I think is not
ideal because you end up with more code and a certain amount of
repeated logic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 2 March 2018 at 08:13, Robert Haas <robertmhaas@gmail.com> wrote:
> 2. All processing of clauses should happen at plan time, not run time,
> so that we're not repeating work.  If, for example, a prepared query
> is executed repeatedly, the clauses should get fully processed when
> it's planned, and then the only thing that should happen at execution
> time is that we take the values to which we now have access and use
> them to decide what to prune.  With this design, we're bound to repeat
> at least the portion of the work done by get_partitions_from_clauses()
> at runtime, and as things stand, it looks like the current version of
> the run-time partition pruning patch repeats the
> generate_partition_clauses() work at runtime as well.
>
> It seems to me as though what we ought to be doing is extracting
> clauses that are of the correct general form and produces a list of
> <partition-column-index>, <partition-operator-strategy>, <expression>
> triplets sorted by increasing partition column index and operator
> strategy number and in a form that can be represented as a Node tree.
> This can be attached to the plan for use at runtime or, if there are
> at least some constants among the expressions, used at plan time for
> partition exclusion.  So the main interfaces would be something like
> this:

We did try this already as I also thought the same a while back and
even wrote a broken patch to do that.  The thing is,
PartitionClauseInfo is designed to be optimal for having
get_partitions_from_clauses() called on it multiple times over, as
will likely happen when run-time pruning is pruning away unneeded
partitioned during a Parameterized nested loop join. To make it a Node
type, it's not quite as simple as changing arrays to Lists as the
keyclauses List contains PartClause, which are also not a Node type,
and that struct contains a FmgrInfo, which is also not a node type, so
we'd need to go try to make all those Node types (I doubt that's going
to happen) ... but ...

It's probably not impossible to come up with some intermediate
partially processed type that can be a Node type. Ideally, this could
be quickly converted into a PartitionClauseInfo during execution. The
problem with this is that this adds a rather silly additional step
during query planning to convert the intermediate processed list into
the fully processed PaitionClauseInfo that get_partitions_from_clauses
would still need.

I don't think building it is going to cost a huge amount. Presumably,
there are not many partitioned tables with 10 rows, so probably having
the get_partitions_from_clauses work as quickly as possible is better
than saving 100 nanoseconds in executor startup.

That being said, there's still a small issue with the run-time pruning
patch which is caused by me not pre-processing the clauses during
planning. Ideally, I'd be able to pre-process at least enough to
determine if any Params match the partition key so that I know if
run-time pruning can be used or not. As of now, I'm not doing that as
it seems wasteful to pre-process during planning just to get the Param
Ids out, then not be able to carry the pre-processed ones over to the
executor.  We also can't really reuse the pre-processed state that was
generated during the planner's calls to generate_partition_clauses()
since we'll additionally also be passing in the parameterized path
clauses as well as the baserestrictinfo clauses.

Given the above, I'm happy to listen to ideas on this, as I'm really
not sure what's best here.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 3 March 2018 at 04:47, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 2, 2018 at 6:21 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Looking at the rough interface sketch in your message, it seems that the
>> product of whatever steps we end up grouping into phase 1 should be
>> something that can be put into a Node tree (PartitionPruneContext?),
>> because we may need to attach it to the plan if some of the needed values
>> will only be made available during execution.
>
> Right.  You might also end up with two representations: a
> Node-tree-style representation that contains all of the information we
> need, and another, faster form into which it gets converted before
> use.

hmm, I thought Amit typed PartitionPruneContext instead of
PartitionClauseInfo by mistake here.

PartitionPruneContext can't be a node type. The run-time pruning patch
needs to set the PlanState in this context so that the code can lookup
the Param values when being called from the executor. You can't make
PlanState a node type too!

Perhaps you can make some primnode type to store all the stuff from
RelOptInfo that's currently being stored in PartitionPruneContext.
That could go along with the plan to save the executor having to look
that stuff up. We can then make that Node type a field in the
PartitionPruneContext. You could even reuse that Node from when it
would first get generated during query planning pruning and keep it
around for use to pass to the executor for run-time pruning, but you'd
probably need to stuff it somewhere like the partition's RelOptInfo so
it could be reused again later. Unsure if that's worth the trouble or
not.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Fri, Mar 2, 2018 at 10:54 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> I don't think building it is going to cost a huge amount. Presumably,
> there are not many partitioned tables with 10 rows, so probably having
> the get_partitions_from_clauses work as quickly as possible is better
> than saving 100 nanoseconds in executor startup.

I agree that one could go overboard with trying to push work from
executor time to planner time, but I don't think the current patch is
very close to the point of diminishing returns.  It's doing nearly
everything at execution time.

> That being said, there's still a small issue with the run-time pruning
> patch which is caused by me not pre-processing the clauses during
> planning. Ideally, I'd be able to pre-process at least enough to
> determine if any Params match the partition key so that I know if
> run-time pruning can be used or not. As of now, I'm not doing that as
> it seems wasteful to pre-process during planning just to get the Param
> Ids out, then not be able to carry the pre-processed ones over to the
> executor.  We also can't really reuse the pre-processed state that was
> generated during the planner's calls to generate_partition_clauses()
> since we'll additionally also be passing in the parameterized path
> clauses as well as the baserestrictinfo clauses.

I think it should be possible to have a structure where all the work
of classifying clauses happens in the planner.  By the time we get to
execution time, we should be able to know for sure which clauses are
relevant.  For example, if the user says WHERE a = $1 + 3 AND b =
(random() * 100)::int, and the partition key is (a, b), we should be
able to figure out at plan time that the clause containing b is
useless (because it's volatile) and the clause containing a is useful
only if this is range-partitioning (because with hash-partitioning we
must have an equality clause for every partition to do anything).  I
think it should also be possible to know which expressions need to be
computed at runtime -- in this case, $1 + 3 -- and to which columns of
the partition key they correspond -- in this case, the first.  I just
proposed a data representation which could track all that stuff and
I'm sure there are other ways to do it, too.

I think that things like PartClause that include both an opno and
various bits of cached information, including FmgrInfo, are not a very
good idea.  A lot of work has been done to maintain the separation of
immutable information -- like Plans or Exprs -- from the run-time
state they use -- PlanState or ExprState.  I think we would do well to
follow that distinction here, too, even if it seems to introduce some
"silly" overhead at execution time.  I think it will pay for itself in
future code maintenance and the ability to apply optimizations such as
JIT which benefit from good divisions in this case.  It is not crazy
to imagine that the "pruning program" idea I floated in a previous
email could be folded into the JIT stuff Andres is doing where
something with a less-clean separation of concerns would run into
problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 3 March 2018 at 04:47, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 2, 2018 at 6:21 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> 2. Removing redundant clauses from each list, that is,
>>    remove_redundant_clauses() to produce lists with just one member per
>>    operator strategy for each partition key,
>
> I don't see that there's a real need to perform step 2 at all.  I
> mean, if you have x > $1 and x > $2 in the query, you can just compute
> the set of partitions for the first clause, compute the set of
> partitions for the second clause, and then intersect.  That doesn't
> seem obviously worse than deciding which of $1 and $2 is greater and
> then pruning only based on whichever one is greater in this case.

This is an interesting idea. It may simplify the code quite a bit as
the clause reduction code is quite bulky. However, unless I'm
mistaken, for this to work, certain inputs will need significantly
more processing to determine the minimum set of matching partitions.

Let's look at the following perhaps unlikely case. (I picked an
extreme case to demonstrate why this may be an inferior method)

Given the table abc (...) partition by range (a,b,c), with the query:

select * from abc where a >= 1 and a >= 2 and a >= 3 and b >= 1 and b
>= 2 and b = 3 and c >= 1 and c >= 2 and c = 3;

We would likely still be parsing those clauses into some struct like
PartitionClauseInfo and would end up with some arrays or Lists with
the clauses segmented by partition key.

It appears to me, for your method to work we'd need to try every
combination of the clauses matching each partition key, which in this
case is 3 * 3 * 3 searches. Amit's current method is 1 search, after
the clause reduction which is 3 + 3 + 3 (O(N) per partition key)

I've tried to think of a more genuine poor performing case for this
with IN or NOT IN lists, but I can't quite see it since NOT IN will
only be supported by LIST partitioning, which can only have a single
partition key and IN would be OR conditions, each of which would be
evaluated in a different round of looping. Although I'm not ruling out
that my imagination is just not good enough.

With that considered, is it still a good idea to do it this way?

Or maybe I've misunderstood the idea completely?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/02 21:43, Robert Haas wrote:
> On Fri, Mar 2, 2018 at 1:22 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> But I realized we don't need the coercion.  Earlier steps would have
>> determined that the clause from which the expression originated contains
>> an operator that is compatible with the partitioning operator family.  If
>> so, the type of the expression in question, even though different from the
>> partition key type, would be binary coercible with it.
> 
> That doesn't follow.  Knowing that two types are in the same operator
> family doesn't guarantee that the types are binary coercible. For
> example, int8 is not binary-coercible to int2.  Moreover, you'd better
> be pretty careful about trying to cast int8 to int2 because it might
> turn a query that would have returned no rows into one that fails
> outright; that's not OK.  Imagine that the user types:
> 
> SELECT * FROM partitioned_by_int2 WHERE a = 1000000000000;
> 
> I think what needs to happen with cross-type situations is that you
> look in the opfamily for a comparator that takes the types you want as
> input; if you can't find one, you have to give up on pruning.  If you
> do find one, then you use it.  For example in the above query, once
> you find btint28cmp, you can use that to compare the user-provided
> constant against the range bounds for the various partitions to see
> which one might contain it.  You'll end up selecting the partition
> with upper bound MAXVALUE if there is one, or no partition at all if
> every partition has a finite upper bound.  That's as well as we can do
> with current infrastructure, I think.

Hmm, yes.

So while the patch's previous approach to convert the query's constant
value to the desired type was wrong, this is wronger. :-(

I guess I'll need to change the patch such that the comparison function
used for comparing partition bounds with a query-specified constant will
change from the default one from the PartitionKey to the one that accepts
the latter.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/03 0:47, Robert Haas wrote:
> On Fri, Mar 2, 2018 at 6:21 AM, Amit Langote wrote:
>> Given the patch's implementation, we'll have to make the structure of that
>> Node tree a bit more complex than a simple List.  For one thing, the patch
>> handles OR clauses by performing pruning separately for each arm and them
>> combining partitions selected across OR arms using set union.  By
>> "performing pruning" in the last sentence I meant following steps similar
>> to ones you wrote in your message:
>>
>> 1. Segregating pruning clauses into per-partition-key Lists, that is,
>>    generate_partition_clauses() producing a PartitionClauseInfo,
>>
>> 2. Removing redundant clauses from each list, that is,
>>    remove_redundant_clauses() to produce lists with just one member per
>>    operator strategy for each partition key,
>>
>> 3. Extracting Datum values from the clauses to form equal/min/max tuples
>>    and setting null or not null bits for individual keys, that is,
>>    extract_bounding_datums() producing a PartScanKeyInfo, and
>>
>> 4. Finally pruning with those Datum tuples and null/not null info, that
>>    is, get_partitions_for_keys().
>>
>> Steps 2-4 are dependent on clauses providing Datums, which all the clauses
>> may or may not do.  Depending on whether or not, we'll have to defer those
>> steps to run time.
> 
> I don't see that there's a real need to perform step 2 at all.  I
> mean, if you have x > $1 and x > $2 in the query, you can just compute
> the set of partitions for the first clause, compute the set of
> partitions for the second clause, and then intersect.  That doesn't
> seem obviously worse than deciding which of $1 and $2 is greater and
> then pruning only based on whichever one is greater in this case.

If we can accommodate this step with your "pruning program" idea below, I
think we should still try to keep the remove_redundant_clauses() step.

If we can keep this step, x > $1 and x > $1 will require 1 comparison + 1
bsearch over bounds, whereas without it, it'd be 2 bsearches over bounds +
1 Bitmapset operation.  I think David expressed a similar concern about
the performance in his reply down-thread.

That too as long as we implement this step such that having it in the
pipeline doesn't restrict the representation that we need to have the
"pruning steps" in.  Let's see.

>> * What do we encode into the Node tree attached to the plan?  Clauses that
>>   haven't gone through steps 2 and 3 (something like PartitionClauseInfo)
>>   or the product of step 3 (something like PartScanKeyInfo)?
>>
>> * How do we account for OR clauses?  Perhaps by having the aforementioned
>>   Node trees nested inside the top-level one, wherein there will be one
>>   nested node per arm of an OR clause.
> 
> Suppose we define the notion of a pruning program.  A pruning program
> can use any number of registers, which have integer numbers starting
> with 0 and counting upward as high as necessary.  Each register holds
> a Bitmapset.  The result of a pruning program is the value of register
> 0 when the program completes.  A pruning program consists of a list of
> steps, each of which is either a PruningBaseStep or a
> PruningCombineStep.  A PruningCombineStep modifies the contents of the
> target register based on the contents of a source register in one of
> the following three ways: (1) UNION -- all bits set in source become
> set in target; (2) INTERSECT -- all bits clear in source become clear
> in target; (3) DIFFERENCE -- all bits set in source become clear in
> target.  A PruningBaseStep consists of a strategy (equality,
> less-than, etc.), an output register, and list of expressions --
> either as many as there are partition keys, or for range partitioning
> perhaps fewer; it prunes based on the strategy and the expressions and
> overwrites the output register with the partitions that would be
> selected.
> 
> Example #1.  Table is hash-partitioned on a and b.  Given a query like
> SELECT * FROM tab WHERE a = 100 AND b = 233, we create a single-step
> program:
> 
> 1. base-step (strategy =, register 0, expressions 100, 233)
> 
> If there were an equality constraint on one of the two columns, we
> would not create a pruning program at all, because no pruning is
> possible.
> 
> Example #2. Table is list-partitioned on a.  Given a query like SELECT
> * FROM tab WHERE (a = $1 OR a = $2) AND a != $3, we create this
> program:
> 
> 1. base-step (strategy =, register 0, expressions $1)
> 2. base-step (strategy =, register 1, expressions $2)
> 3. base-step (strategy !=, register 2, expressions $3)
> 4. combine-step (target-register 0, source-register 1, strategy union)
> 5. combine-step (target-register 0, source-register 2, strategy difference)
> 
> (This is unoptimized -- one could do better by reversing steps 3 and 4
> and using reusing register 1 instead of needing register 2, but that
> kind of optimization is probably not too important.)
> 
> Example #3. Table is range-partitioned on a and b.  Given a query like
> SELECT * FROM tab WHERE (a = 40 AND b > $1) OR (a = $2 AND b = $3), we
> do this:
> 
> 1. base-step (strategy >, register 0, expressions 40, $1)
> 2. base-step (strategy =, register 1, expressions $2, $3)
> 3. combine-step (target-register 0, source-register 1, strategy union)
> 
> You might need a few extra gadgets here to make all of this work --
> e.g. another base-step strategy to handle ScalarArrayOpExpr; I'm just
> trying to convey the basic idea here.  It's pretty easy to see how to
> store a program like this as a node tree: just create PruningBaseStep
> and PruningCombineStep nodes and stick them into a List.  At execution
> time transform the List into an array and loop over it.
> 
> Or possibly it would be better to have two lists, one of base steps
> without explicit register numbers, where step N always outputs to
> register N, and then a second list of combine steps.  Then at
> execution time you could have an array of PruningBaseStep * and an
> array of PruningCombineStep * instead of a combined array of Node *,
> which might be quicker to process.
> 
> But regardless of what you do exactly, I think you should try to come
> up with some kind of representation that is basically uniform,
> handling all the things you support in a similar fashion.  The current
> patch has basically separate and somewhat ad-hoc representations for
> the regular case, the <> case, and the OR case, which I think is not
> ideal because you end up with more code and a certain amount of
> repeated logic.

Thanks for outlining this idea.

Given that the most important part in your outline seems to be the clean
Node structure to carry the information from one stage to another, that
will result by adopting the "partition pruning primitives" described
above, I will first try to implement them as a shell around the code that
I and David have debugged till now.  Then, once everything seems to work,
I will start dropping unneeded structures and code that seems to be
duplicative, which I'm beginning to suspect there will be plenty of.  I'll
post an update in a couple of days to report on how that works out.

Regards,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Fri, Mar 2, 2018 at 7:32 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Let's look at the following perhaps unlikely case. (I picked an
> extreme case to demonstrate why this may be an inferior method)
>
> Given the table abc (...) partition by range (a,b,c), with the query:
>
> select * from abc where a >= 1 and a >= 2 and a >= 3 and b >= 1 and b
>>= 2 and b = 3 and c >= 1 and c >= 2 and c = 3;
>
> We would likely still be parsing those clauses into some struct like
> PartitionClauseInfo and would end up with some arrays or Lists with
> the clauses segmented by partition key.
>
> It appears to me, for your method to work we'd need to try every
> combination of the clauses matching each partition key, which in this
> case is 3 * 3 * 3 searches. Amit's current method is 1 search, after
> the clause reduction which is 3 + 3 + 3 (O(N) per partition key)
[...]
> With that considered, is it still a good idea to do it this way?

I dunno.  What do you think?

That case is indeed pretty unfortunate, but it's also pretty
artificial.  It's not obvious to me that we shouldn't care about it,
but it's also not obvious to me that we should.  If we have some
bizarre cases that slip through the cracks or don't perform terribly
well, maybe nobody would ever notice or care.  On the other hand,
maybe they would.

I suppose in my ideal world, this could be handled by building a
GREATEST or LEAST expression.  In other words, if someone says foo >=
1 AND foo >= 2, instead of doing separate pruning steps, we'd just
prune once based on foo >= GREATEST(1,2).  But that doesn't really
work, because there's no provision to tell MinMaxExpr from which
opfamily we wish to draw the operator used to compare 1 and 2 and no
guarantee that such an operator exists for the actual data types of 1
and 2.  (Imagine that 1 and 2 of different data types; the relevant
opfamily might have an operator that can compare a value of the same
type as foo to 1 and similarly for 2, but no operator that can compare
1 and 2 to each other.)

One thing that we could do is just only accept one clause for each
column-strategy pairing, presumably either the first one or the last
one.  So in your example either a >= 1 or a >= 3 would be accepted and
the others would be discarded for purposes of partition pruning.  If a
user complains, we could tell THEM to manually do the rewrite
suggested in the previous step, and just write a >= GREATEST(1,2,3).
(And of course if it's that simple, they might want to then
pre-simplify to a >= 3!)

Another alternative is to include some kind of additional type of step
in the "pruning program" which can do this GREATEST/LEAST operation
... but that's adding quite a bit of complexity for what seems like
it's pretty much a corner case, and as noted above, there's no
guarantee that we even have the correct operator available.  It should
be fine if the partitioning column/expression and all of the constants
being compared are of the same type, and in practice *most* of the
time even when they're not, but we're going to have to have some
handling for the strange cases -- and I think the only real choices
are "try every combination and maybe be slow", "try 1 combination and
maybe fail to prune everything that could have been pruned", and some
intermediate possibilities along the same lines.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 March 2018 at 10:15, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 2, 2018 at 7:32 PM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> It appears to me, for your method to work we'd need to try every
>> combination of the clauses matching each partition key, which in this
>> case is 3 * 3 * 3 searches. Amit's current method is 1 search, after
>> the clause reduction which is 3 + 3 + 3 (O(N) per partition key)
> [...]
>> With that considered, is it still a good idea to do it this way?
>
> I dunno.  What do you think?
>
> That case is indeed pretty unfortunate, but it's also pretty
> artificial.  It's not obvious to me that we shouldn't care about it,
> but it's also not obvious to me that we should.  If we have some
> bizarre cases that slip through the cracks or don't perform terribly
> well, maybe nobody would ever notice or care.  On the other hand,
> maybe they would.

One thing I've learned in my time working with PostgreSQL is that, if
there's a known hole, someone's probably going to fall down it
eventually.  I like working with PostgreSQL because we're pretty
careful to not make holes that people can fall down, or if there is
some hole that cannot be filled in, we try to put a fence around it
with a sign, (e.g rename pg_xlog to pg_wal).  I'm not strongly opposed
to your ideas, I probably don't have a complete understanding of the
idea anyway. But from what I understand it looks like you want to take
something that works quite well and make it work less well, and there
appears not to be a good reason provided of why you want to do that.

Is it because you want to simplify the patch due to concerns about it
being too much logic to get right for PG11?

> One thing that we could do is just only accept one clause for each
> column-strategy pairing, presumably either the first one or the last
> one.

The problem with that is it can cause surprising behaviour. We reorder
clauses and clauses get pushed down from upper parts of the query.

Let's say there was some view like:

CREATE VIEW vw_ledger_2018 AS SELECT * FROM ledger WHERE postdate
BETWEEN '2018-01-01' AND '2018-12-13';

And a user comes along and does:

SELECT * FROM vw_ledger_2018 WHERE postdate BETWEEN '2018-03-01' AND
'2018-03-31'

We're going to end up with base quals something like: postdate >=
'2018-01-01' AND postdate <= '2018-12-31' AND postdate >= '2018-03-01'
AND postdate <= '2018-03-31'

If we just take the first from each op strategy then we'll not have
managed to narrow the case down to just the March partition. You might
argue that this should be resolved at some higher level in the
planner, but that does nothing for the run-time pruning case.

I don't really want to do or say anything that jeopardises this patch
from getting into PG11, so if the path of least resistance is to go
with the option you've proposed then I'd much rather that than this
getting pushed out to PG12.  I really just want to try to make sure
we've thought of everything before we create too many surprises for
users.

Perhaps a compromise would be to check all quals from the first
partition key and only the first or last one from the remaining keys.
I imagine most cases will have just 1 key anyway. This would
significantly reduce the number of possible combinations of quals to
try, but unfortunately, it still does have that element of surprise.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Tue, Mar 6, 2018 at 8:34 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> One thing I've learned in my time working with PostgreSQL is that, if
> there's a known hole, someone's probably going to fall down it
> eventually.  I like working with PostgreSQL because we're pretty
> careful to not make holes that people can fall down, or if there is
> some hole that cannot be filled in, we try to put a fence around it
> with a sign, (e.g rename pg_xlog to pg_wal).  I'm not strongly opposed
> to your ideas, I probably don't have a complete understanding of the
> idea anyway. But from what I understand it looks like you want to take
> something that works quite well and make it work less well, and there
> appears not to be a good reason provided of why you want to do that.
>
> Is it because you want to simplify the patch due to concerns about it
> being too much logic to get right for PG11?

My understanding is that the patch as submitted is fundamentally
broken in multiple ways.

As Amit said a few emails upthread, "So while the patch's previous
approach to convert the query's constant value to the desired type was
wrong, this is wronger. :-("  I agree with that analysis.  As I tried
to explain in my last email, if you've got something like foo >
'blarfle'::type1 and foo > 'hoge'::type2, there may actually be no way
at all to determine which of those clauses is more restrictive.  The
fact that > was used in the query to compare foo with a value of type1
and, separately, with a value of type2 means that those operators
exist, but it does not follow that the opfamily provides an operator
which can compare type1 to type2.  As far as I can see, what this
means is that, in general, the approach the patch takes to eliminating
redundant clauses just doesn't work; and in the general case I don't
think there's much hope of saving it.  The question of whether the
patch does too much work at execution time or not is maybe arguable --
my position is that it does -- but working properly in the face of
cross-type comparisons is non-negotiable.

The use of evaluate_expr() is also completely wrong and has got to be
fixed.  I already wrote about that upthread so I won't repeat it here.
I'm pretty sure that the current design, if allowed to stand, would
have lots of bad consequences.

As I understand it, Amit is currently hacking on the patch to try to
fix these issues.  If he comes up with something that works properly
with cross-type comparisons and doesn't abuse evaluate_expr() but
still does more work than I'd ideally prefer at execution time, I'll
consider holding my nose and consider it anyway.  But considering the
amount of rework that I think is needed, I don't really see why we
wouldn't adopt a design that minimizes execution time work, too.

In short, I don't think I'm trying to make something that works quite
well work less well, because I don't think the patch as it stands can
be correctly described as working quite well.

> Let's say there was some view like:
>
> CREATE VIEW vw_ledger_2018 AS SELECT * FROM ledger WHERE postdate
> BETWEEN '2018-01-01' AND '2018-12-13';
>
> And a user comes along and does:
>
> SELECT * FROM vw_ledger_2018 WHERE postdate BETWEEN '2018-03-01' AND
> '2018-03-31'
>
> If we just take the first from each op strategy then we'll not have
> managed to narrow the case down to just the March partition. You might
> argue that this should be resolved at some higher level in the
> planner, but that does nothing for the run-time pruning case.

Yeah, that's a good example of how this could happen in real life.

So far I see three ways forward here:

1. If we've got multiple redundant quals, ignore all but one of them
for purposes of partition pruning.  Hope users don't get mad.

2. If we've got multiple redundant quals, do multiple checks.  Hope
this isn't annoyingly slow (or too much ugly code).

3. If we've got multiple redundant quals but no cross-type operators
are in use, evaluate all of the expressions and pick the highest or
lowest value as appropriate.  Otherwise fall back to #1 or #2.  For
bonus points, do this when cross-type operators ARE in use and the
additional cross-type operators that we need to figure out the highest
or lowest value, as appropriate, is also available.

I'm OK with any of those approaches; that is, I will not seek to block
a merely patch on the basis of which of those options it chooses.  I
think they are all defensible.  Options that are not OK include (a)
trying to cast a value of one type to another type, because that could
turn a query that would have simply returned no rows into an error
case or (b) supposing that all types in an opfamily are binary
coercible to each other, because that's just plain wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi.

On 2018/03/05 17:38, Amit Langote wrote:
> I'll
> post an update in a couple of days to report on how that works out.

I'm still working on this and getting most of the tests to pass with the
new code, but not all of them yet.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/07 20:58, Amit Langote wrote:
> Hi.
> 
> On 2018/03/05 17:38, Amit Langote wrote:
>> I'll
>> post an update in a couple of days to report on how that works out.
> 
> I'm still working on this and getting most of the tests to pass with the
> new code, but not all of them yet.

Sorry about the delay.

Attached is a significantly revised version of the patch, although I admit
it could still use some work with regard to comments and other cleanup.

The rewrite introduces a notion of PartitionPruneStep nodes based on the
ideas described in [1].  So, instead of aggregating *all* of the pruning
clauses into a PartitionClauseInfo which was hard to serialize into a node
tree and then a PartScanKeyInfo (both of which no longer exist), this
generates a list of nodes.  Each node inherits from the base
PartitionPruneStep node type and contains information enough to perform
partition pruning by directly comparing the information with partition
bounds or contains sub-nodes that do.  For example, a PartitionPruneStepOp
step contains an integer telling the partitioning operator strategy (such
as various btree operator strategies) and a tuple to compare against
partition bounds stored in the relcache.  A PartitionPruneStepCombine step
contains arguments that are in turn pruning steps themselves, which are
separately executed and partition sets obtained thereby are combined using
the specified combineOp.

Also, fixed a bug of the previous design as detailed in [2].  So, with the
patch:

create table lparted (a smallint) partition by list (a);
create table lparted_1 partition of lparted for values in (1);
create table lparted_16384 partition of lparted for values in (16384);

-- all partitions pruned (lparted_16384 wouldn't be pruned by previous
-- patches due to comparison using bogus a partsupfunc)

explain (costs off) select * from lparted where a = 100000000000000;
        QUERY PLAN
--------------------------
 Result
   One-Time Filter: false
(2 rows)

Also,

create table rparted (a smallint) partition by range (a);
create table rparted_1 partition of rparted for values from (1) to (10);
create table rparted_16384 partition of rparted for values from (10) to
(16384);
create table rparted_maxvalue partition of rparted for values from (16384)
to (maxvalue);

-- all partitions except rparted_maxvalue pruned
explain (costs off) select * from rparted where a > 100000000000000;
                   QUERY PLAN
-------------------------------------------------
 Append
   ->  Seq Scan on rparted_maxvalue
         Filter: (a > '100000000000000'::bigint)
(3 rows)

I will continue working on improving the comments / cleaning things up and
post a revised version soon, but until then please look at the attached.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoahUxagjeNeJTcJkD0rbk%2BmHTXROzWcEd%2BtZ8DuQG83cg%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/CA%2BTgmoYtKitwsFtA4%2B6cdeYGEfnS1%2BOY%2BG%3DUe26fgSzJZx%3DeJg%40mail.gmail.com

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Jesper Pedersen
Date:
Hi Amit,

On 03/13/2018 07:37 AM, Amit Langote wrote:
> I will continue working on improving the comments / cleaning things up and
> post a revised version soon, but until then please look at the attached.
> 

Passes check-world.

Some minor comments:

0001: Ok

0002: Ok

0003:
* Trailing white space
* pruning.c
   - partkey_datum_from_expr
     * "Add more expression types..." -- Are you planning to add more of 
these ? Otherwise change the comment
   - get_partitions_for_null_keys
     * Documentation for method
     * 'break;' missing for _HASH and default case
   - get_partitions_for_keys
     * 'break;'s are outside of the 'case' blocks
     * The 'switch(opstrategy)'s could use some {} blocks
       * 'break;' missing from default
   - perform_pruning_combine_step
     * Documentation for method
* nodeFuncs.c
   - Missing 'break;'s to follow style

0004: Ok

Best regards,
  Jesper


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Amit Langote wrote:

> I will continue working on improving the comments / cleaning things up and
> post a revised version soon, but until then please look at the attached.

I tried to give this a read.  It looks pretty neat stuff -- as far as I
can tell, it follows Robert's sketch for how this should work.  The fact
that it's under-commented makes me unable to follow it too closely
though (I felt like adding a few "wtf?" comments here and there), so
it's taking me a bit to follow things in detail.  Please do submit
improved versions as you have them.

I think you're using an old version of pg_bsd_indent.

In particular need of commentary
 * match_clause_to_partition_key() should indicate which params are
 output and what do they get

 * get_steps_using_prefix already has a comment, but it doesn't really
 explain much.  (I'm not sure why you use the term "tuple" here.  I mean,
 mathematically it probably makes sense, but in the overall context it
 seems just confusing.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
By the way, I checked whether patch 0002 (additional tests) had an
effect on coverage, and couldn't detect any changes in terms of
lines/functions.  Were you able to find any bugs in your code thanks to
the new tests that would not have been covered by existing tests?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 14 March 2018 at 06:54, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>     * "Add more expression types..." -- Are you planning to add more of
> these ? Otherwise change the comment

Run-time pruning will deal with Param types here. The comment there
might be just to remind us all that the function must remain generic
enough so we can support more node types later. I don't particularly
need the comment for that patch, and I'm not quite sure if I should be
removing it in that patch. My imagination is not stretching far enough
today to think what we could use beyond Const and Param.

I don't feel strongly about the comment either way. This is just to
let you know that there are more up and coming things to do in that
spot.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/14 8:26, Alvaro Herrera wrote:
> By the way, I checked whether patch 0002 (additional tests) had an
> effect on coverage, and couldn't detect any changes in terms of
> lines/functions.  Were you able to find any bugs in your code thanks to
> the new tests that would not have been covered by existing tests?

All tests except those for hash partitioning got added as bugs were found
in the patch and fixed.  As you may know, constraint exclusion doesn't
help with pruning hash partitions, so those tests don' exercise any
existing functionality but are there for the *new* code.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 14 March 2018 at 00:37, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached is a significantly revised version of the patch, although I admit
> it could still use some work with regard to comments and other cleanup.

Thanks for making all those changes. There's been quite a bit of churn!

I've looked over the patch and agree that there need to be more
comments to explain things. There were certainly times during my
review that I just skipped ahead having not understood what I'd been
looking at.

Here are the notes from my review:

1. I think get_unpruned_partitions is not a great name for this
function. get_matching_partitions would be better.

2. HASH partitioning can no longer prune non-matching partitions in
cases like: SELECT * FROM hashp WHERE partkey1 IS NULL and partkey2 IS
NULL; This might mean you need to process IS NULL and the key clauses
in the same step.

I see you have a test which checks the plan for

explain (costs off) select * from hp where a = 1 and b is null;

which ensures all partitions are included. Why?

3. Header comment for get_partitions_for_keys references 'keys' which
is not a parameter to that function.

4. If you wrote a function to process an individual step and called
that in a foreach loop inside get_unpruned_partitions, then you
wouldn't need to list_make1() here. Instead just call the function to
perform one step.

argparts =  get_unpruned_partitions(context,
list_make1(step));

5. I see you're performing AND step combination in two different ways.
In perform_pruning_combine_step you do;

andparts = andparts == NULL
? argparts
: bms_int_members(andparts, argparts);

but in get_unpruned_partitions, you add the entire range to the set
using bms_add_range, then intersect on that.

The code seems a bit fragile and relies on get_unpruned_partitions
returning an allocated Bitmapset all the time, even if bms_is_empty()
is true. There should be no distinction between NULL and a Bitmapset
that returns true on bms_is_empty().

What you've got here probably works for now, but only due to the fact
that get_unpruned_partitions allocate the entire range with
bms_add_range and that bms_int_members does not return NULL with no
matching members if both input sets are non-NULL. If that were to
change then your code would misbehave in cases like:

WHERE <matches no partition> AND <matches a partition>;

When processing the <matches no partition> clause, no partitions would
match, then if that resulted in an empty set you'd then surprisingly
match partitions again despite the AND clause not actually making it
possible for any partitions to have matched.

Probably you just need to bms_add_range in
perform_pruning_combine_step too and perform bms_int_members
unconditionally, just like you're doing in get_unpruned_partitions

6. The switch (last->op_strategy) in
generate_partition_pruning_steps_internal is missing a default ERROR
for unknown strategy

7. The switch: switch (context->strategy) in
generate_partition_pruning_steps_internal should ERROR rather than
break when it sees an unknown partition strategy.

8. Instead of copying the opclause here, wouldn't you be better just
this code come after you palloc the PartClause then just setup the
PartClause with the negator directly?

if (*is_neop_listp)
{
Assert(OidIsValid(negator));
opclause = copyObject(opclause);
opclause->opno = negator;
}

8. PartitionedChildRelInfo is still mentioned in typedefs.list

9. I don't quite understand PartitionPruneStepNoop. Why can't you just
skip adding anything to the list in
generate_partition_pruning_steps_internal?

10. The following test does not make sense:

explain (costs off) select * from hp where (a = 10 and b = 'yyy') or
(a = 10 and b = 'xxx') or (a is null and b is null);
                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
 Append
   ->  Seq Scan on hp0
         Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b
= 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
   ->  Seq Scan on hp1
         Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b
= 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
   ->  Seq Scan on hp2
         Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b
= 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
   ->  Seq Scan on hp3
         Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b
= 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
(9 rows)

Why do 4 partitions match when there are only 3 sets of clauses when
each one can only match a single partition?

11. What does "root" mean here?

-- case for list partitioned table that's not root
explain (costs off) select * from rlp where a = 15 and b <> 'ab' and b
<> 'cd' and b <> 'xy' and b is not null;

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/14 20:50, David Rowley wrote:
> On 14 March 2018 at 00:37, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached is a significantly revised version of the patch, although I admit
>> it could still use some work with regard to comments and other cleanup.
> 
> Thanks for making all those changes. There's been quite a bit of churn!

Thank you David and Jesper for the reviews.  I'm halfway done addressing
the comments and will submit an updated patch by tomorrow.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

Thanks for the review.

On 2018/03/14 20:50, David Rowley wrote:
> On 14 March 2018 at 00:37, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached is a significantly revised version of the patch, although I admit
>> it could still use some work with regard to comments and other cleanup.
> 
> Thanks for making all those changes. There's been quite a bit of churn!
> 
> I've looked over the patch and agree that there need to be more
> comments to explain things. There were certainly times during my
> review that I just skipped ahead having not understood what I'd been
> looking at.

Hope the attached version is easier to understand.

> Here are the notes from my review:
> 
> 1. I think get_unpruned_partitions is not a great name for this
> function. get_matching_partitions would be better.

OK, I changed it to get_matching_partitions.

> 2. HASH partitioning can no longer prune non-matching partitions in
> cases like: SELECT * FROM hashp WHERE partkey1 IS NULL and partkey2 IS
> NULL; This might mean you need to process IS NULL and the key clauses
> in the same step.

OK, I made the PartitionPruneStepOp to contain a bitmap of null key
indexes.  It is referred to along with "values" for other columns only in
the hash partitioning case, whereas for list and range partitioning,
actual value(s) and null key information are passed via separate
PartitionPruneStepOp nodes, because in case of the latter we cannot mix
nulls and "values".

So, now there is no need for PartitionPruneStepNullness and
get_partitions_for_null_keys().

> I see you have a test which checks the plan for
> 
> explain (costs off) select * from hp where a = 1 and b is null;
> 
> which ensures all partitions are included. Why?

I had intended to remove support for hash partition pruning with some keys
being null that existed in the previous versions of the patch, but thought
it was too pessimistic after reading your comment.  I've added it back and
it works like it used to in the previous versions.

> 3. Header comment for get_partitions_for_keys references 'keys' which
> is not a parameter to that function.

Oops, fixed.

> 4. If you wrote a function to process an individual step and called
> that in a foreach loop inside get_unpruned_partitions, then you
> wouldn't need to list_make1() here. Instead just call the function to
> perform one step.
> 
> argparts =  get_unpruned_partitions(context,
> list_make1(step));

Hmm, yes.  I've introduced a perform_pruning_step() that switches on the
pruning step node type, that is called by either get_matching_partitions
while iterating the list it receives or by perform_pruning_combine_step()
calls on the arguments of a PartitionPruneStepCombine.

> 5. I see you're performing AND step combination in two different ways.
> In perform_pruning_combine_step you do;
> 
> andparts = andparts == NULL
> ? argparts
> : bms_int_members(andparts, argparts);
> 
> but in get_unpruned_partitions, you add the entire range to the set
> using bms_add_range, then intersect on that.
> 
> The code seems a bit fragile and relies on get_unpruned_partitions
> returning an allocated Bitmapset all the time, even if bms_is_empty()
> is true. There should be no distinction between NULL and a Bitmapset
> that returns true on bms_is_empty().
> 
> What you've got here probably works for now, but only due to the fact
> that get_unpruned_partitions allocate the entire range with
> bms_add_range and that bms_int_members does not return NULL with no
> matching members if both input sets are non-NULL. If that were to
> change then your code would misbehave in cases like:
> 
> WHERE <matches no partition> AND <matches a partition>;
> 
> When processing the <matches no partition> clause, no partitions would
> match, then if that resulted in an empty set you'd then surprisingly
> match partitions again despite the AND clause not actually making it
> possible for any partitions to have matched.
> 
> Probably you just need to bms_add_range in
> perform_pruning_combine_step too and perform bms_int_members
> unconditionally, just like you're doing in get_unpruned_partitions

I noticed a number of things I could improve about this, also considering
your points above.  Please check if the new structure is an improvement.

> 6. The switch (last->op_strategy) in
> generate_partition_pruning_steps_internal is missing a default ERROR
> for unknown strategy

I've fixed that.

> 7. The switch: switch (context->strategy) in
> generate_partition_pruning_steps_internal should ERROR rather than
> break when it sees an unknown partition strategy.

This one too.

> 
> 8. Instead of copying the opclause here, wouldn't you be better just
> this code come after you palloc the PartClause then just setup the
> PartClause with the negator directly?
> 
> if (*is_neop_listp)
> {
> Assert(OidIsValid(negator));
> opclause = copyObject(opclause);
> opclause->opno = negator;
> }

Agreed, done.

> 8. PartitionedChildRelInfo is still mentioned in typedefs.list

Removed.

> 9. I don't quite understand PartitionPruneStepNoop. Why can't you just
> skip adding anything to the list in
> generate_partition_pruning_steps_internal?
> 
> 10. The following test does not make sense:
> 
> explain (costs off) select * from hp where (a = 10 and b = 'yyy') or
> (a = 10 and b = 'xxx') or (a is null and b is null);
>                                                        QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------
>  Append
>    ->  Seq Scan on hp0
>          Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b
> = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
>    ->  Seq Scan on hp1
>          Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b
> = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
>    ->  Seq Scan on hp2
>          Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b
> = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
>    ->  Seq Scan on hp3
>          Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b
> = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
> (9 rows)
> 
> Why do 4 partitions match when there are only 3 sets of clauses when
> each one can only match a single partition?

After bringing the support for hash partition pruning even with IS NULL
clauses back, this works as you'd expect.

> 11. What does "root" mean here?
> 
> -- case for list partitioned table that's not root
> explain (costs off) select * from rlp where a = 15 and b <> 'ab' and b
> <> 'cd' and b <> 'xy' and b is not null;

It means a partitioned table that is not the root partitioned table.
Those <> clauses prune but not at the root level, only after recursing for
a list partitioned child of rlp.

Attached updated patches.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 17 March 2018 at 01:55, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hope the attached version is easier to understand.

Hi Amit,

Thanks for making the updates. I'll look at them soon.

I've been thinking about how we're making these improvements for
SELECT only.  If planning for an UPDATE or DELETE of a partitioned
table then since the inheritance planner is planning each partition
individually we gain next to nothing from this patch.

Generally, it seems the aim of this patch is to improve the usability
of partitions in an OLTP type workload, most likely OLAP does not
matter as much since planner overhead, in that case, is generally less
of a concern.

I experimented with the attached small patch to see if the situation
could be improved if we first plan the entire query with all
partitions then ignore dummy rels when planning for each individual
partition.

I used something along the lines of:

# create table listp (a int, b int) partition by list(a);
# select 'create table listp'||x||' partition of listp for values
in('||x||');' from generate_series(1, <number of tables>)x;
$ echo explain update listp set b = 1 where a = 1; > bench.sql
$ pgbench -f bench.sql -n -T 30 postgres

where <number of tables> started at 1 and went up in powers of 2 until 1024.

Unpatched = your v35 patch
Patched = your v35 + the attached.

The TPS result from a 30-second pgbench run of the above query showed:

Partitions = 1
Unpatched: 7323.3
Patched: 6573.2 (-10.24%)

Partitions = 2
Unpatched: 6784.8
Patched: 6377.1 (-6.01%)

Partitions = 4
Unpatched: 5903.0
Patched: 6106.8 (3.45%)

Partitions = 8
Unpatched: 4582.0
Patched: 5579.9 (21.78%)

Partitions = 16
Unpatched: 3131.5
Patched: 4521.2 (44.38%)

Partitions = 32
Unpatched: 1779.8
Patched: 3387.8 (90.35%)

Partitions = 64
Unpatched: 821.9
Patched: 2245.4 (173.18%)

Partitions = 128
Unpatched: 322.2
Patched: 1319.6 (309.56%)

Partitions = 256
Unpatched: 84.3
Patched: 731.7 (768.27%)

Partitions = 512
Unpatched: 22.5
Patched: 382.8 (1597.74%)

Partitions = 1024
Unpatched: 5.5
Patched: 150.1 (2607.83%)

Which puts the crossover point at just 4 partitions, and just a small
overhead for 1, 2 and probably 3 partitions. The planner generated a
plan 26 times faster (!) with 1024 partitions.

Likely there's more than could be squeezed out of this if we could get
the grouping_planner() to somehow skip creating paths and performing
the join search. But that patch is not nearly as simple as the
attached.

Probably grouping_planner could also be called with inheritance_update
= false, for this one case too, which might save a small amount of
effort.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

On 2018/03/19 16:18, David Rowley wrote:
> On 17 March 2018 at 01:55, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Hope the attached version is easier to understand.
> 
> Hi Amit,
> 
> Thanks for making the updates. I'll look at them soon.
> 
> I've been thinking about how we're making these improvements for
> SELECT only.  If planning for an UPDATE or DELETE of a partitioned
> table then since the inheritance planner is planning each partition
> individually we gain next to nothing from this patch.

Unfortunately, yes. :-(

Just recently, I replied to a pgsql-bugs report by someone who had OOM
kill a backend running `delete from
partitioned_table_with_7202_partitions` on their test system [1].  That'd
be because running inheritance_planner on a partitioned table doesn't cope
very well beyond a few hundred partitions, as we've also written in our
partitioning/inheritance documentation.

> Generally, it seems the aim of this patch is to improve the usability
> of partitions in an OLTP type workload, most likely OLAP does not
> matter as much since planner overhead, in that case, is generally less
> of a concern.

Yes, that makes sense.

> I experimented with the attached small patch to see if the situation
> could be improved if we first plan the entire query with all
> partitions then ignore dummy rels when planning for each individual
> partition.
> 
> I used something along the lines of:
> 
> # create table listp (a int, b int) partition by list(a);
> # select 'create table listp'||x||' partition of listp for values
> in('||x||');' from generate_series(1, <number of tables>)x;
> $ echo explain update listp set b = 1 where a = 1; > bench.sql
> $ pgbench -f bench.sql -n -T 30 postgres
> 
> where <number of tables> started at 1 and went up in powers of 2 until 1024.
> 
> Unpatched = your v35 patch
> Patched = your v35 + the attached.
> 
> The TPS result from a 30-second pgbench run of the above query showed:
> 
> Partitions = 1
> Unpatched: 7323.3
> Patched: 6573.2 (-10.24%)
> 
> Partitions = 2
> Unpatched: 6784.8
> Patched: 6377.1 (-6.01%)
> 
> Partitions = 4
> Unpatched: 5903.0
> Patched: 6106.8 (3.45%)
> 
> Partitions = 8
> Unpatched: 4582.0
> Patched: 5579.9 (21.78%)
> 
> Partitions = 16
> Unpatched: 3131.5
> Patched: 4521.2 (44.38%)
> 
> Partitions = 32
> Unpatched: 1779.8
> Patched: 3387.8 (90.35%)
> 
> Partitions = 64
> Unpatched: 821.9
> Patched: 2245.4 (173.18%)
> 
> Partitions = 128
> Unpatched: 322.2
> Patched: 1319.6 (309.56%)
> 
> Partitions = 256
> Unpatched: 84.3
> Patched: 731.7 (768.27%)
> 
> Partitions = 512
> Unpatched: 22.5
> Patched: 382.8 (1597.74%)
> 
> Partitions = 1024
> Unpatched: 5.5
> Patched: 150.1 (2607.83%)
>
> Which puts the crossover point at just 4 partitions, and just a small
> overhead for 1, 2 and probably 3 partitions. The planner generated a
> plan 26 times faster (!) with 1024 partitions.

Nice!

> Likely there's more than could be squeezed out of this if we could get
> the grouping_planner() to somehow skip creating paths and performing
> the join search. But that patch is not nearly as simple as the
> attached.

Yeah, that'd be nice.  Do you think that we cannot fix update/delete on
partitioned tables until we have such a patch though?  IOW, did you intend
the patch you posted to just be a PoC to demonstrate that we can save tons
just by not doing grouping_planner() on pruned partitions?

BTW, maybe you know, but if we want this to prune same partitions as are
pruned during select (due to the new pruning facility), we'd need to teach
get_relation_constraints() to not fetch the partition constraint
(RelationGetPartitionQual) at all.  My patch currently teaches it to avoid
fetching the partition constraint only for select.  If we include the
partition constraint in the list of constraints returned by
get_relation_constraints, we'd still be redundantly executing the
constraint exclusion logic for the selected partitions via the
grouping_planner() call on those partitions.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/fecdef72-8c2a-0794-8e0a-2ad76db82c68@lab.ntt.co.jp



Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 19 March 2018 at 23:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Just recently, I replied to a pgsql-bugs report by someone who had OOM
> kill a backend running `delete from
> partitioned_table_with_7202_partitions` on their test system [1].  That'd
> be because running inheritance_planner on a partitioned table doesn't cope
> very well beyond a few hundred partitions, as we've also written in our
> partitioning/inheritance documentation.

hmm, yeah that's unfortunate. I'd not done much study of the
inheritance planner before, but I see how that could happen now that I
understand a bit more about it.  nparts^2 RelOptInfos will be created
on such problems. My patch should help with that providing that some
pruning will actually take place, but make the problem very slightly
worse if none can be pruned.

> On 2018/03/19 16:18, David Rowley wrote:
>> Likely there's more than could be squeezed out of this if we could get
>> the grouping_planner() to somehow skip creating paths and performing
>> the join search. But that patch is not nearly as simple as the
>> attached.
>
> Yeah, that'd be nice.  Do you think that we cannot fix update/delete on
> partitioned tables until we have such a patch though?  IOW, did you intend
> the patch you posted to just be a PoC to demonstrate that we can save tons
> just by not doing grouping_planner() on pruned partitions?

The patch was meant as a PoC. I think the performance of the patch is
acceptable without any additional optimisation work. It would be nice,
but any more code that's added would need more reviewer and committer
time, both of which are finite, especially so before PG11 code cut
off.

I think it would be a shame to tell people partition is usable now for
a decent number of partitions, providing you don't need to perform any
OLTP UPDATE/DELETE operations on the partitions. I think for the few
lines of code that the proposed patch takes it's worth considering for
PG11, but only once your work has gone in. I certainly wouldn't want
this to hold your work back.

> BTW, maybe you know, but if we want this to prune same partitions as are
> pruned during select (due to the new pruning facility), we'd need to teach
> get_relation_constraints() to not fetch the partition constraint
> (RelationGetPartitionQual) at all.  My patch currently teaches it to avoid
> fetching the partition constraint only for select.  If we include the
> partition constraint in the list of constraints returned by
> get_relation_constraints, we'd still be redundantly executing the
> constraint exclusion logic for the selected partitions via the
> grouping_planner() call on those partitions.

I'd not thought of that. It seems more like a performance optimisation
than something that's required for correctness.  Removing that would
probably make constraint_exclusion = 'partition' pretty useless


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/16 21:55, Amit Langote wrote:
> Attached updated patches.

Attached is further revised version.

Of note is getting rid of PartitionPruneContext usage in the static
functions of partprune.c.  Most of the code there ought to only run during
planning, so it can access the necessary information from RelOptInfo
directly instead of copying it to PartitionPruneContext and then passing
it around.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 21 March 2018 at 00:07, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached is further revised version.

In the 0004 patch I see:

@@ -1439,6 +1441,10 @@ inheritance_planner(PlannerInfo *root)
  if (IS_DUMMY_PATH(subpath))
  continue;

+ /* Add the current parent's RT index to the partitioned rels set. */
+ partitioned_relids = bms_add_member(partitioned_relids,
+ appinfo->parent_relid);

This seems to execute regardless of if the target relation is a
partitioned table or an inheritance parent. I think there needs to be
a condition so you only do this when planning for partitioned tables.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Tue, Mar 20, 2018 at 7:07 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/03/16 21:55, Amit Langote wrote:
>> Attached updated patches.
>
> Attached is further revised version.
>
> Of note is getting rid of PartitionPruneContext usage in the static
> functions of partprune.c.  Most of the code there ought to only run during
> planning, so it can access the necessary information from RelOptInfo
> directly instead of copying it to PartitionPruneContext and then passing
> it around.

+    if (rte->relkind == RELKIND_PARTITIONED_TABLE)
+    {
+        if (rel->baserestrictinfo != NIL)
+        {
+            live_children = prune_append_rel_partitions(root, rel);
+            did_pruning = true;
+        }
+    }

Use &&

+        case COMBINE_OR:
+        {

Won't survive pgindent, which currently produces a *massive* diff for
these patches.

+            /*
+             * XXX- The following ad-hoc method of pruning only works for list
+             * partitioning.  It checks for each partition if all of its
+             * accepted values appear in ne_datums[].
+             */

So why are we doing it this way?  How about doing something not
ad-hoc?  I tried to propose that before.

+ *      Set *value to the constant value obtained by evaluating 'expr'
+ *
+ * Note that we may not be able to evaluate the input expression, in which
+ * case, the function returns false to indicate that *value has not been
+ * set.  True is returned otherwise.

These comments need updating, since this function (laudibly) no longer
does any evaluating.  I wonder how this will work for run-time
pruning, though.

+    if (context->partopcintype[partkeyidx] != exprTyp)
+    {
+        Oid     new_supfuncid;
+        int16   procnum;
+
+
+        procnum = (context->strategy == PARTITION_STRATEGY_HASH)
+                        ? HASHEXTENDED_PROC
+                        : BTORDER_PROC;
+        new_supfuncid = get_opfamily_proc(context->partopfamily[partkeyidx],
+                                          context->partopcintype[partkeyidx],
+                                          exprTyp, procnum);
+        fmgr_info(new_supfuncid, &context->partsupfunc[partkeyidx]);
+    }

What's the point of this, exactly?  Leftover dead code, maybe?

+ * Input:
+ *  See the comments above the definition of PartScanKeyInfo to see what
+ *  kind of information is contained in 'keys'.

There's no such thing as PartScanKeyInfo any more and the function has
no argument called 'keys'.  None of the functions actual arguments are
explained.

+    /*
+     * If there are multiple pruning steps, we perform them one after another,
+     * passing the result of one step as input to another.  Based on the type
+     * of pruning step, perform_pruning_step may add or remove partitions from
+     * the set of partitions it receives as the input.
+     */

The comment sounds great, but the code doesn't work that way; it
always calls bms_int_members to intersect the new result with any
previous result.  I'm baffled as to how this manages to DTRT if
COMBINE_OR is used.  In general I had hoped that the list of pruning
steps was something over which we were only going to iterate, not
recurse.  This definitely recurses for the combine steps, but it's
still (sorta) got the idea of a list of iterable steps.  That's a
weird mix.

+            if (nvalues == context->partnatts)
+            {
+                greatest_modulus = get_greatest_modulus(boundinfo);
+                rowHash = compute_hash_value(partnatts, partsupfunc, values,
+                                             isnull);
+                result_index = partindices[rowHash % greatest_modulus];
+                if (result_index >= 0)
+                    return bms_make_singleton(result_index);
+            }
+            else
+                /* Can't do pruning otherwise, so return all partitions. */
+                return bms_add_range(NULL, 0, context->nparts - 1);

Wouldn't we want to (1) arrange things so that this function is never
called if nvalues < context->partnatts && context->strategy ==
PARTITION_STRATEGY_HASH or at least (2) avoid constructing isnull from
nullkeys if we're not going to use it?

Also, shouldn't we be sanity-checking the strategy number here?

I'm out of time for right now but it looks to me like this patch still
needs quite a bit of fine-tuning.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/20 21:41, David Rowley wrote:
> On 21 March 2018 at 00:07, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached is further revised version.
> 
> In the 0004 patch I see:
> 
> @@ -1439,6 +1441,10 @@ inheritance_planner(PlannerInfo *root)
>   if (IS_DUMMY_PATH(subpath))
>   continue;
> 
> + /* Add the current parent's RT index to the partitioned rels set. */
> + partitioned_relids = bms_add_member(partitioned_relids,
> + appinfo->parent_relid);
> 
> This seems to execute regardless of if the target relation is a
> partitioned table or an inheritance parent. I think there needs to be
> a condition so you only do this when planning for partitioned tables.

Oops, that's quite wrong.  Will fix, thanks.

Regards,
Amit



Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 21 March 2018 at 00:07, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached is further revised version.

Hi Amit,

Thanks for sending the v38 patch.

I started reviewing this, but I just ended up starting to hack at the
patch instead. It's still got quite a bit of work to be done as I
think, unfortunately,  the cross type stuff is still pretty broken.
There's not really any sort of checking done to make sure you've found
a valid cross type hash or compare function in the code which results
in errors like:

create table hashp (a int, b numeric) partition by hash(a,b);
create table hashp1 partition of hashp for values with (modulus 4, remainder 0);
create table hashp2 partition of hashp for values with (modulus 4, remainder 1);
create table hashp3 partition of hashp for values with (modulus 4, remainder 2);
create table hashp4 partition of hashp for values with (modulus 4, remainder 3);
explain select * from hashp where a = 3::smallint and b = 1.0;
ERROR:  cache lookup failed for function 0

I'm not really sure if this should be a matter of doing an if
(!OidIsValid(new_supfuncid)) return false; I think the
context->partsupfunc must be pretty broken in cases like:

create table listp (a bigint) partition by list(a);
create table listp1 partition of listp for values in(1);
select * from listp where a <> 1::smallint and a <> 1::bigint;

The current patch simply just remembers the last comparison function
for comparing int8 to int4 and uses that one for the int8 to int2
comparison too.

Probably we need to cache the comparison function's Oid in with the
Expr in the step and use the correct one each time. I'm unsure of how
the fmgr info should be cached, but looks like it certainly cannot be
cached in the context in an array per partition key. I've so far only
thought some sort of hash table, but I'm sure there must be a much
better way to do this.

I started hacking it partition.c and ended up changing quite a few
things. I changed get_partitions_for_keys into 3 separate functions,
one for hash, list and range and tidied a few things up in that area.
There were a few bugs, for example passing the wrong value for the
size of the array into get_partitions_excluded_by_ne_datums.

I also changed how the Bitmapsets are handled in the step functions
and got rid of the Noop step type completely. I also got rid of the
passing of the srcparts into these functions. I think Roberts idea is
to process the steps in isolation and just combine the partitions
matching each step.

It would be great if we could coordinate our efforts here. I'm posting
this patch now just in case you're working or about to work on this.

In the meantime, I'll continue to drip feed cleanup patches. I'll try
to start writing some comments too, once I figure a few things out...

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

On 2018/03/23 16:38, David Rowley wrote:
> On 21 March 2018 at 00:07, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Attached is further revised version.
> 
> Hi Amit,
> 
> Thanks for sending the v38 patch.
> 
> I started reviewing this, but I just ended up starting to hack at the
> patch instead. It's still got quite a bit of work to be done as I
> think, unfortunately,  the cross type stuff is still pretty broken.
> There's not really any sort of checking done to make sure you've found
> a valid cross type hash or compare function in the code which results
> in errors like:
> 
> create table hashp (a int, b numeric) partition by hash(a,b);
> create table hashp1 partition of hashp for values with (modulus 4, remainder 0);
> create table hashp2 partition of hashp for values with (modulus 4, remainder 1);
> create table hashp3 partition of hashp for values with (modulus 4, remainder 2);
> create table hashp4 partition of hashp for values with (modulus 4, remainder 3);
> explain select * from hashp where a = 3::smallint and b = 1.0;
> ERROR:  cache lookup failed for function 0

Hmm yes.  I had realized that while addressing Robert's related comment.

> I'm not really sure if this should be a matter of doing an if
> (!OidIsValid(new_supfuncid)) return false; I think the
> context->partsupfunc must be pretty broken in cases like:
> 
> create table listp (a bigint) partition by list(a);
> create table listp1 partition of listp for values in(1);
> select * from listp where a <> 1::smallint and a <> 1::bigint;
> 
> The current patch simply just remembers the last comparison function
> for comparing int8 to int4 and uses that one for the int8 to int2
> comparison too.
> 
> Probably we need to cache the comparison function's Oid in with the
> Expr in the step and use the correct one each time. I'm unsure of how
> the fmgr info should be cached, but looks like it certainly cannot be
> cached in the context in an array per partition key. I've so far only
> thought some sort of hash table, but I'm sure there must be a much
> better way to do this.

Yeah, I realized that simply replacing the context->partsupfunc member is
not a solution.

In the updated patch (that is, after incorporating your changes), I have
moved this partsupfunc switching to the caller of partkey_datum_from_expr
instead of doing it there.  New patch also checks that returned function
OID is valid, which if not we don't use the expression's value for pruning.

So now. we statically allocate a partsupfunc array on every invocation of
perform_pruning_base_step() or of get_partitions_excluded_by_ne_datums().
Considering run-time pruning, we may have to find some other place to
cache that.

> I started hacking it partition.c and ended up changing quite a few
> things. I changed get_partitions_for_keys into 3 separate functions,
> one for hash, list and range and tidied a few things up in that area.
> There were a few bugs, for example passing the wrong value for the
> size of the array into get_partitions_excluded_by_ne_datums.
> 
> I also changed how the Bitmapsets are handled in the step functions
> and got rid of the Noop step type completely. I also got rid of the
> passing of the srcparts into these functions. I think Roberts idea is
> to process the steps in isolation and just combine the partitions
> matching each step.>
> It would be great if we could coordinate our efforts here. I'm posting
> this patch now just in case you're working or about to work on this.

Thanks a lot for making all those changes and sharing the patch.  I've
incorporated in the attached latest version.

> In the meantime, I'll continue to drip feed cleanup patches. I'll try
> to start writing some comments too, once I figure a few things out...

Here is the updated version.

I'm still thinking about what to do about avoiding recursion when
performing combine steps [1] as Robert mentioned in his email.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks for the review.

On 2018/03/21 6:29, Robert Haas wrote:
> On Tue, Mar 20, 2018 at 7:07 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/03/16 21:55, Amit Langote wrote:
>>> Attached updated patches.
>>
>> Attached is further revised version.
>>
>> Of note is getting rid of PartitionPruneContext usage in the static
>> functions of partprune.c.  Most of the code there ought to only run during
>> planning, so it can access the necessary information from RelOptInfo
>> directly instead of copying it to PartitionPruneContext and then passing
>> it around.
> 
> +    if (rte->relkind == RELKIND_PARTITIONED_TABLE)
> +    {
> +        if (rel->baserestrictinfo != NIL)
> +        {
> +            live_children = prune_append_rel_partitions(root, rel);
> +            did_pruning = true;
> +        }
> +    }
> 
> Use &&

Fixed in the latest version.

> 
> +        case COMBINE_OR:
> +        {
> 
> Won't survive pgindent, which currently produces a *massive* diff for
> these patches.

That's gone in the latest patch.

Things for the overall patch should have improved in the latest version.

> +            /*
> +             * XXX- The following ad-hoc method of pruning only works for list
> +             * partitioning.  It checks for each partition if all of its
> +             * accepted values appear in ne_datums[].
> +             */
> 
> So why are we doing it this way?  How about doing something not
> ad-hoc?  I tried to propose that before.

Hmm, perhaps I should have written a better comment.

What I really meant to say is that pruning using <> operators can be
implemented sanely only for list partitions.  We can prune a given
partition with <> clauses, only if we can find a <> clause for *all*
values that the partition accepts.  Doing so seems doable only for list
partitions where we require enumerating all values that the partition may
contain.

> + *      Set *value to the constant value obtained by evaluating 'expr'
> + *
> + * Note that we may not be able to evaluate the input expression, in which
> + * case, the function returns false to indicate that *value has not been
> + * set.  True is returned otherwise.
> 
> These comments need updating, since this function (laudibly) no longer
> does any evaluating.  I wonder how this will work for run-time
> pruning, though.

Fixed the comment.

Run-time pruning patch adds some code to this function to be able to
return values for Params for which, afaik, it also adds some state
information to the PartitionPruneContext argument.

> +    if (context->partopcintype[partkeyidx] != exprTyp)
> +    {
> +        Oid     new_supfuncid;
> +        int16   procnum;
> +
> +
> +        procnum = (context->strategy == PARTITION_STRATEGY_HASH)
> +                        ? HASHEXTENDED_PROC
> +                        : BTORDER_PROC;
> +        new_supfuncid = get_opfamily_proc(context->partopfamily[partkeyidx],
> +                                          context->partopcintype[partkeyidx],
> +                                          exprTyp, procnum);
> +        fmgr_info(new_supfuncid, &context->partsupfunc[partkeyidx]);
> +    }
> 
> What's the point of this, exactly?  Leftover dead code, maybe?

Actually, this *was* an effort to teach the patch to use the correct
comparison function for comparison against partition bounds in case of
clause value being of different type.

After reading David's comment about this, I concluded that it's placed at
a wrong place, which is fixed in the latest patch.   The comparison
functions are changed (if needed) in the function that would call
partkey_datum_from_expr, not in partkey_datum_from_expr itself.

> + * Input:
> + *  See the comments above the definition of PartScanKeyInfo to see what
> + *  kind of information is contained in 'keys'.
> 
> There's no such thing as PartScanKeyInfo any more and the function has
> no argument called 'keys'.  None of the functions actual arguments are
> explained.

Sorry, this should be gone in the latest patches.

> +    /*
> +     * If there are multiple pruning steps, we perform them one after another,
> +     * passing the result of one step as input to another.  Based on the type
> +     * of pruning step, perform_pruning_step may add or remove partitions from
> +     * the set of partitions it receives as the input.
> +     */
> 
> The comment sounds great, but the code doesn't work that way; it
> always calls bms_int_members to intersect the new result with any
> previous result.  I'm baffled as to how this manages to DTRT if
> COMBINE_OR is used.  In general I had hoped that the list of pruning
> steps was something over which we were only going to iterate, not
> recurse.  This definitely recurses for the combine steps, but it's
> still (sorta) got the idea of a list of iterable steps.  That's a
> weird mix.

At the top-level (in get_matching_partitions), it is assumed that the
steps in the input list come from implicitly AND'd clauses, so the
intersection between partition sets that we get for each.

Anyway, after David's rewrite of this portion of the patch incorporated in
the latest patch, things look a bit different here, although there is
still recursion for combine steps.  I'm still considering how to make the
recursion go away.

> +            if (nvalues == context->partnatts)
> +            {
> +                greatest_modulus = get_greatest_modulus(boundinfo);
> +                rowHash = compute_hash_value(partnatts, partsupfunc, values,
> +                                             isnull);
> +                result_index = partindices[rowHash % greatest_modulus];
> +                if (result_index >= 0)
> +                    return bms_make_singleton(result_index);
> +            }
> +            else
> +                /* Can't do pruning otherwise, so return all partitions. */
> +                return bms_add_range(NULL, 0, context->nparts - 1);
> 
> Wouldn't we want to (1) arrange things so that this function is never
> called if nvalues < context->partnatts && context->strategy ==
> PARTITION_STRATEGY_HASH or at least (2) avoid constructing isnull from
> nullkeys if we're not going to use it?

We call this function even if nvalues < context->partnatts but we found IS
NULL clauses for *all* the remaining columns.

Although, given the checks that planner (partprune.c) performs, we should
get here only if pruning is possible, so the code to handle cases where
pruning couldn't occur is redundant.

> Also, shouldn't we be sanity-checking the strategy number here?

That's right, fixed.

> 
> I'm out of time for right now but it looks to me like this patch still
> needs quite a bit of fine-tuning.

I have posted an updated patch in reply to David's review.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 24 March 2018 at 01:15, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> In the updated patch (that is, after incorporating your changes), I have
> moved this partsupfunc switching to the caller of partkey_datum_from_expr
> instead of doing it there.  New patch also checks that returned function
> OID is valid, which if not we don't use the expression's value for pruning.

Thanks for accepting those changes.

> So now. we statically allocate a partsupfunc array on every invocation of
> perform_pruning_base_step() or of get_partitions_excluded_by_ne_datums().
> Considering run-time pruning, we may have to find some other place to
> cache that.

hmm yeah, it's not perfect, but I don't have any better ideas for now,
apart from this probably could be done when creating the steps rather
than executing them. That would save having to look up the correct
function Oid during execution, and save bothering to create steps
values that we simply can't compare to the partition key.

I've done this in the attached patch against v39.

I also renamed argvalues to argexprs, since they're not values.  The
PartClauseInfo could probably do with the same change too, but I
didn't touch it.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 24 March 2018 at 16:42, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've done this in the attached patch against v39.
>
> I also renamed argvalues to argexprs, since they're not values.  The
> PartClauseInfo could probably do with the same change too, but I
> didn't touch it.

The attached goes a little further and does a bit more renaming. I
don't think "values" is a good name for a list of Exprs. I'd expect
that might be a better-suited name for an array of Datums.

I've also added and modified a few comments.  More comments are still
required. The Step structs are mostly undocumented still, but I'm
trying to understand how all this fits together still, at least well
enough to write about it.

The attached delta applies on top of v39 plus delta1.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 25 March 2018 at 18:28, David Rowley <david.rowley@2ndquadrant.com> wrote:
> The attached delta applies on top of v39 plus delta1.

Sorry, the attached should do this. Ignore the last attachment.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

On 2018/03/25 14:32, David Rowley wrote:
> On 25 March 2018 at 18:28, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> The attached delta applies on top of v39 plus delta1.
> 
> Sorry, the attached should do this. Ignore the last attachment.

I have incorporated both of your delta1 and delta2_1 patches.

Your proposed change to determine the cross-type comparison function OID
during planning itself is a good one, although I wasn't sure why it was
done only for the <> operators.  I also implemented that for
PartitionPruneStepOp steps.

Also, I started thinking that implementing pruning using <> operators with
a PartitionPruneCombineOp was not such a great idea.  That needed us to
add argexprs and argcmpfns to that struct, which seemed a bit odd.  I
defined a new pruning node type called PartitionPruneStepOpNe, which still
seems a bit odd, but given that our support for pruning using <> is quite
specialized, that may be fine.

I added a bunch of hopefully informative comments in partprune.c and for
the struct definitions of pruning step nodes.

Please find attached find a new version.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/23 21:19, Amit Langote wrote:
> On 2018/03/21 6:29, Robert Haas wrote:
>> +    /*
>> +     * If there are multiple pruning steps, we perform them one after another,
>> +     * passing the result of one step as input to another.  Based on the type
>> +     * of pruning step, perform_pruning_step may add or remove partitions from
>> +     * the set of partitions it receives as the input.
>> +     */
>>
>> The comment sounds great, but the code doesn't work that way; it
>> always calls bms_int_members to intersect the new result with any
>> previous result.  I'm baffled as to how this manages to DTRT if
>> COMBINE_OR is used.  In general I had hoped that the list of pruning
>> steps was something over which we were only going to iterate, not
>> recurse.  This definitely recurses for the combine steps, but it's
>> still (sorta) got the idea of a list of iterable steps.  That's a
>> weird mix.
> 
> At the top-level (in get_matching_partitions), it is assumed that the
> steps in the input list come from implicitly AND'd clauses, so the
> intersection between partition sets that we get for each.
> 
> Anyway, after David's rewrite of this portion of the patch incorporated in
> the latest patch, things look a bit different here, although there is
> still recursion for combine steps.  I'm still considering how to make the
> recursion go away.

I have managed to make the recursion go away in the attached updated
version.  I guess that's the result of employing the idea of a "output
register" for individual pruning steps as mentioned in Robert's email
upthread where he detailed the "pruning steps" approach [1].

With the new patch, pruning steps for arguments of, say, an OR clause are
not performed recursively.  Instead, each pruning step is performed
independently and its output is stored in a slot dedicated to it.  Combine
steps are always executed after all of the steps corresponding to its
arguments have been executed.  That's ensured by the way steps are allocated.


Jesper, off-list, reported an unused variable which has been removed in
the updated patch.  Thanks Jesper!  He also pointed out a case with a
list-partitioned table where pruning doesn't a produce a result as one
would expect and what constraint exclusion would produce.

create table lp (a char) partition by list (a);
create table lp_ad partition of lp for values in ('a', 'd');
create table lp_bc partition of lp for values in ('b', 'c');
create table lp_default partition of lp default;
explain (costs off) select * from lp where a > 'a' and a < 'd';
                        QUERY PLAN
-----------------------------------------------------------
 Append
   ->  Seq Scan on lp_ad
         Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
   ->  Seq Scan on lp_bc
         Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
   ->  Seq Scan on lp_default
         Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
(7 rows)

One would expect that lp_ad is not scanned.

With the implementation where a > 'a' and a < 'd' are used to prune
separately, this cannot be avoided given the way
get_partitions_for_keys_list() added by the patch works.  What happens is
we prune first with a step generated for a > 'a', which returns partitions
for all datums in the table's boundinfo greater than 'a' that have a
partition assigned, which means we'll include the partition that accepts
'd'.  Then when pruning with a < 'd', we select partitions for all datums
less than 'd' that have a partition assigned, which means we end up
including the partition that accepts 'a'.  We intersect the result of
running these independent steps, but lp_ad is present in the result sets
of both the sets, so it ends up in the final result.  Maybe there is a way
to fix that, but I haven't done anything about it yet.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoahUxagjeNeJTcJkD0rbk%2BmHTXROzWcEd%2BtZ8DuQG83cg%40mail.gmail.com

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 27 March 2018 at 00:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Also, I started thinking that implementing pruning using <> operators with
> a PartitionPruneCombineOp was not such a great idea.  That needed us to
> add argexprs and argcmpfns to that struct, which seemed a bit odd.  I
> defined a new pruning node type called PartitionPruneStepOpNe, which still
> seems a bit odd, but given that our support for pruning using <> is quite
> specialized, that may be fine.

Seems better

> I added a bunch of hopefully informative comments in partprune.c and for
> the struct definitions of pruning step nodes.

Yes. That looks better.

> Please find attached find a new version.

Thanks. I've made a pass over this and I only have the attached set of
fixes and the following to show for it.

1. Please add more comments in the switch statement in
get_partitions_for_keys_range

2. More an observation than anything else. I see we've lost the
ability to prune range queries on LIST partitions in some cases.

For example:

CREATE TABLE listp (a INT) PARTITION BY LIST(a);
CREATE TABLE listp1_3 PARTITION OF listp FOR VALUES IN(1,3);

EXPLAIN SELECT * FROM listp WHERE a > 1 AND a < 3;

This is just down to the new pruning step design. WHERE we first prune
on "a > 1", which matches listp1_3 due to 3, then binary-AND to the
results of the "a < 3", which matches listp1_3 due to 1. This is a
shame, but probably not the end of the world. Fixing it would likely
mean moving back towards the previous design.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Jesper Pedersen
Date:
Hi Amit,

On 03/27/2018 06:42 AM, Amit Langote wrote:
> I have managed to make the recursion go away in the attached updated
> version.  I guess that's the result of employing the idea of a "output
> register" for individual pruning steps as mentioned in Robert's email
> upthread where he detailed the "pruning steps" approach [1].
> 
> With the new patch, pruning steps for arguments of, say, an OR clause are
> not performed recursively.  Instead, each pruning step is performed
> independently and its output is stored in a slot dedicated to it.  Combine
> steps are always executed after all of the steps corresponding to its
> arguments have been executed.  That's ensured by the way steps are allocated.
> 

Running v41 with "partition_prune" under valgrind gives the attached report.

Best regards,
  Jesper

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Amit Langote wrote:

> [Jesper] also pointed out a case with a
> list-partitioned table where pruning doesn't a produce a result as one
> would expect and what constraint exclusion would produce.
> 
> create table lp (a char) partition by list (a);
> create table lp_ad partition of lp for values in ('a', 'd');
> create table lp_bc partition of lp for values in ('b', 'c');
> create table lp_default partition of lp default;
> explain (costs off) select * from lp where a > 'a' and a < 'd';
>                         QUERY PLAN
> -----------------------------------------------------------
>  Append
>    ->  Seq Scan on lp_ad
>          Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
>    ->  Seq Scan on lp_bc
>          Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
>    ->  Seq Scan on lp_default
>          Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
> (7 rows)
> 
> One would expect that lp_ad is not scanned.

One would?  I, for one, wouldn't particularly sweat over this case TBH.
It seems a pretty silly case.  If this works for "a <> 'a' and a <> 'd'"
(I mean, lp_ad is pruned for that qual), that sounds sufficient to me.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Amit Langote wrote:
>> [Jesper] also pointed out a case with a
>> list-partitioned table where pruning doesn't a produce a result as one
>> would expect and what constraint exclusion would produce.
>> 
>> create table lp (a char) partition by list (a);
>> create table lp_ad partition of lp for values in ('a', 'd');
>> create table lp_bc partition of lp for values in ('b', 'c');
>> create table lp_default partition of lp default;
>> explain (costs off) select * from lp where a > 'a' and a < 'd';
>> QUERY PLAN
>> -----------------------------------------------------------
>> Append
>> ->  Seq Scan on lp_ad
>> Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
>> ->  Seq Scan on lp_bc
>> Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
>> ->  Seq Scan on lp_default
>> Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
>> (7 rows)
>> 
>> One would expect that lp_ad is not scanned.

> One would?  I, for one, wouldn't particularly sweat over this case TBH.

That example works in HEAD, so if somebody is proposing a patch that
breaks it, seems like that needs investigation.

            regards, tom lane


Re: [HACKERS] path toward faster partition pruning

From
Jesper Pedersen
Date:
Hi,

On 03/27/2018 01:46 PM, Jesper Pedersen wrote:
> Running v41 with "partition_prune" under valgrind gives the attached 
> report.
> 

The reports mostly involve interaction with catcache.c and dynahash.c, 
so something for a separate thread.

Best regards,
  Jesper



Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 27 March 2018 at 23:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I have managed to make the recursion go away in the attached updated
> version.  I guess that's the result of employing the idea of a "output
> register" for individual pruning steps as mentioned in Robert's email
> upthread where he detailed the "pruning steps" approach [1].

Thanks for making that work. I've only glanced at the patch, and not
taken enough time to understand how the new parts work yet.

In the meantime, I've attached some fixes for v41 which I previously
submitted for v40.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/28 12:58, David Rowley wrote:
> On 27 March 2018 at 23:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I have managed to make the recursion go away in the attached updated
>> version.  I guess that's the result of employing the idea of a "output
>> register" for individual pruning steps as mentioned in Robert's email
>> upthread where he detailed the "pruning steps" approach [1].
> 
> Thanks for making that work. I've only glanced at the patch, and not
> taken enough time to understand how the new parts work yet.
> 
> In the meantime, I've attached some fixes for v41 which I previously
> submitted for v40.

Thank you.  I've merged it.

Also, I have redesigned how we derive partition indexes after running
pruning steps.  Previously, for each step we'd determine the indexes of
"partitions" that are not pruned leading to a list partition not being
pruned sometimes, as shown in the two recent examples.  Instead, in the
new approach, we only keep track of the indexes of the "datums" that
satisfy individual pruning steps (both base pruning steps and combine
steps) and only figure out the partition indexes after we've determined
set of datums that survive all pruning steps.  That is, after we're done
executing all pruning steps.  Whether we need to scan special partitions
like null-only and default partition is tracked along with datum indexes
for each step.  With this change, pruning works as expected in both examples:

create table lp (a char) partition by list (a);
create table lp_ad partition of lp for values in ('a', 'd');
create table lp_bc partition of lp for values in ('b', 'c');
create table lp_default partition of lp default;
explain (costs off) select * from lp where a > 'a' and a < 'd';
                        QUERY PLAN
-----------------------------------------------------------
 Append
   ->  Seq Scan on lp_bc
         Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
   ->  Seq Scan on lp_default
         Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar))
(5 rows)

CREATE TABLE listp (a INT) PARTITION BY LIST(a);
CREATE TABLE listp1_3 PARTITION OF listp FOR VALUES IN (1, 3);
EXPLAIN SELECT * FROM listp WHERE a > 1 AND a < 3;
                QUERY PLAN
------------------------------------------
 Result  (cost=0.00..0.00 rows=0 width=4)
   One-Time Filter: false
(2 rows)

Moreover, with pruning now working at a high-level with datum indexes
instead of partition indexes, pruning for PartitionPruneStepOpNe is
simplified greatly.  We simply delete from a bitmapset initially
containing the indexes of all datums in boundinfo the indexes of those
that appear in the query.  So:

explain (costs off) select * from lp where a <> 'a' and a <> 'd';
                         QUERY PLAN
-------------------------------------------------------------
 Append
   ->  Seq Scan on lp_bc
         Filter: ((a <> 'a'::bpchar) AND (a <> 'd'::bpchar))
   ->  Seq Scan on lp_default
         Filter: ((a <> 'a'::bpchar) AND (a <> 'd'::bpchar))
(5 rows)

where we delete indexes of 'a' and 'd' from the bitmapset initially
containing indexes of all datums, leaving us with only those of 'b' and
'c'.  Also, the default partition is scanned as it would always be for a
PartitionPruneStepOpNe step.

Attached is the updated set of patches, which contains other miscellaneous
changes such as updated comments, beside the main changes described above.

Regards,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/28 18:29, Amit Langote wrote:
> Attached is the updated set of patches, which contains other miscellaneous
> changes such as updated comments, beside the main changes described above.

Sorry, one of those miscellaneous changes was a typo that would cause
compilation to fail... Sigh.   Fixed in the updated version.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 28 March 2018 at 22:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Also, I have redesigned how we derive partition indexes after running
> pruning steps.  Previously, for each step we'd determine the indexes of
> "partitions" that are not pruned leading to a list partition not being
> pruned sometimes, as shown in the two recent examples.  Instead, in the
> new approach, we only keep track of the indexes of the "datums" that
> satisfy individual pruning steps (both base pruning steps and combine
> steps) and only figure out the partition indexes after we've determined
> set of datums that survive all pruning steps.  That is, after we're done
> executing all pruning steps.  Whether we need to scan special partitions
> like null-only and default partition is tracked along with datum indexes
> for each step.  With this change, pruning works as expected in both examples:

Smart thinking! Good to see that solved.

I'll try to look at v43 during my working day tomorrow, in around 9 hours time.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Jesper Pedersen
Date:
Hi,

On 03/28/2018 06:30 AM, Amit Langote wrote:
> On 2018/03/28 18:29, Amit Langote wrote:
>> Attached is the updated set of patches, which contains other miscellaneous
>> changes such as updated comments, beside the main changes described above.
> 
> Sorry, one of those miscellaneous changes was a typo that would cause
> compilation to fail... Sigh.   Fixed in the updated version.
>

Just some trivial changes.

However,

explain (costs off) select * from mc2p where a = 2 and b < 1;

is picking up

    ->  Seq Scan on mc2p2
          Filter: ((b < 1) AND (a = 2))

which doesn't seem right, as its definition is

create table mc2p2 partition of mc2p for values from (1, 1) to (2, 
minvalue);

Best regards,
  Jesper

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/29 1:41, Jesper Pedersen wrote:
> Just some trivial changes.

Thanks Jesper.  Merged.

> However,
> 
> explain (costs off) select * from mc2p where a = 2 and b < 1;
> 
> is picking up
> 
>    ->  Seq Scan on mc2p2
>          Filter: ((b < 1) AND (a = 2))
> 
> which doesn't seem right, as its definition is
> 
> create table mc2p2 partition of mc2p for values from (1, 1) to (2, minvalue);

Yeah, that wasn't right.  It boiled down to how some code in the range
partition pruning function considered a tuple containing a = 2 to fall in
this partition, which is wrong because the minvalue in its upper bound
makes the partition exclusive of any tuples with a = 2.  Fixed that.

Beside fixing that, I have decided to get rid of the
PartititionPruneStepOpNe (a special kind of base pruning step that was
being used to prune list partitions using a set of <> operator clauses)
and related functions.  Instead pruning for <> operator clauses is now
implemented by using a combination of PartitionPruneStepOp and
PartitionPruneStepCombine after adding a new combine op COMBINE_INVERT (I
also renamed COMBINE_OR and COMBINE_AND to COMBINE_UNION and
COMBINE_INTERSECT, respectively).  I decided to do so because the previous
arrangement looked like a "hack" to support a special case that touched no
less than quite a few places.

Attached find the updated version of patches.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 29 March 2018 at 21:35, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Beside fixing that, I have decided to get rid of the
> PartititionPruneStepOpNe (a special kind of base pruning step that was
> being used to prune list partitions using a set of <> operator clauses)
> and related functions.  Instead pruning for <> operator clauses is now
> implemented by using a combination of PartitionPruneStepOp and
> PartitionPruneStepCombine after adding a new combine op COMBINE_INVERT (I
> also renamed COMBINE_OR and COMBINE_AND to COMBINE_UNION and
> COMBINE_INTERSECT, respectively).  I decided to do so because the previous
> arrangement looked like a "hack" to support a special case that touched no
> less than quite a few places.

Hi Amit,

I've looked at the v44 patch. Thanks for making those changes.

The new not-equal handling code is not quite right.

DROP TABLE listp;
CREATE TABLE listp (a INT) PARTITION BY LIST(a);
CREATE TABLE listp1_3 PARTITION OF listp FOR VALUES IN(1,3);
CREATE TABLE listp_default PARTITION OF listp DEFAULT;

EXPLAIN SELECT * FROM listp WHERE a <> 1;
                            QUERY PLAN
------------------------------------------------------------------
 Append  (cost=0.00..54.56 rows=2537 width=4)
   ->  Seq Scan on listp1_3  (cost=0.00..41.88 rows=2537 width=4)
         Filter: (a <> 1)
(3 rows)

The default should be included here.

INSERT INTO listp VALUES(1),(2),(3);
SELECT * FROM listp WHERE a <> 1;
 a
---
 3
(1 row)

This code assumes its fine to just reverse the setting for default:

result->scan_default = !source->scan_default;

More complex handling is needed here.

I've attached a diff for a small set of other things I noticed while reviewing.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Tomas Vondra
Date:
Hi,

I think there's a bug in generate_pruning_steps_from_opexprs, which does
this for PARTITION_STRATEGY_HASH:


    for_each_cell(lc1, lc)
    {
        pc = lfirst(lc1);

        /*
         * Note that we pass nullkeys for step_nullkeys,
         * because we need to tell hash partition bound search
         * function which of the keys are NULL.
         */
        Assert(pc->op_strategy == HTEqualStrategyNumber);
        pc_steps =
                get_steps_using_prefix(context,
                                       HTEqualStrategyNumber,
                                       pc->expr,
                                       pc->cmpfn,
                                       pc->keyno,
                                       nullkeys,
                                       prefix);
    }

    opsteps = list_concat(opsteps, list_copy(pc_steps));


Notice that the list_concat() is outside the for_each_cell loop. Doesn't
that mean we fail to consider some of the clauses (all except the very
last clause) for pruning? I haven't managed to come up with an example,
but I haven't tried very hard.

FWIW I've noticed this because gcc complains that pg_steps might be used
uninitialized:

partprune.c: In function ‘generate_partition_pruning_steps_internal’:
partprune.c:992:16: warning: ‘pc_steps’ may be used uninitialized in
this function [-Wmaybe-uninitialized]
      opsteps = list_concat(opsteps, list_copy(pc_steps));
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
partprune.c:936:14: note: ‘pc_steps’ was declared here
      List   *pc_steps;
              ^~~~~~~~
All of PostgreSQL successfully made. Ready to install.


So even if it's not a bug, we probably need to fix the code somehow.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks Tomas for the review.

On 2018/03/30 1:55, Tomas Vondra wrote:
> Hi,
> 
> I think there's a bug in generate_pruning_steps_from_opexprs, which does
> this for PARTITION_STRATEGY_HASH:
> 
> 
>     for_each_cell(lc1, lc)
>     {
>         pc = lfirst(lc1);
> 
>         /*
>          * Note that we pass nullkeys for step_nullkeys,
>          * because we need to tell hash partition bound search
>          * function which of the keys are NULL.
>          */
>         Assert(pc->op_strategy == HTEqualStrategyNumber);
>         pc_steps =
>                 get_steps_using_prefix(context,
>                                        HTEqualStrategyNumber,
>                                        pc->expr,
>                                        pc->cmpfn,
>                                        pc->keyno,
>                                        nullkeys,
>                                        prefix);
>     }
> 
>     opsteps = list_concat(opsteps, list_copy(pc_steps));
> 
> 
> Notice that the list_concat() is outside the for_each_cell loop. Doesn't
> that mean we fail to consider some of the clauses (all except the very
> last clause) for pruning? I haven't managed to come up with an example,
> but I haven't tried very hard.
> 
> FWIW I've noticed this because gcc complains that pg_steps might be used
> uninitialized:
> 
> partprune.c: In function ‘generate_partition_pruning_steps_internal’:
> partprune.c:992:16: warning: ‘pc_steps’ may be used uninitialized in
> this function [-Wmaybe-uninitialized]
>       opsteps = list_concat(opsteps, list_copy(pc_steps));
>                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> partprune.c:936:14: note: ‘pc_steps’ was declared here
>       List   *pc_steps;
>               ^~~~~~~~
> All of PostgreSQL successfully made. Ready to install.
> 
> 
> So even if it's not a bug, we probably need to fix the code somehow.

Yeah, the code needs to be fixed.  Although, it seems to me that in the
hash partitioning case, the loop would iterate at most once, at least if
the query didn't contain any Params.  That's because, at that point, there
cannot be multiple mutually AND'd equality clauses referring to the same
key.  For example, if there were in the original query and they contained
different values, we wouldn't get this far anyway as they would be reduced
to constant-false at an earlier planning stage.  If they all contained the
same value (e.g. key = 1 and key = 1::smallint and a = 1::int and a =
1::bigint), then only one of them will be left in rel->baserestrictinfo
anyway.  But we still need to have the loop because all of what I said
wouldn't happen if the clauses contained Params.  In that case, the result
would be determined at execution time.

I have fixed the code as you suggested and will post the fixed version
shortly after fixing the issues David reported.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

On 2018/03/29 20:08, David Rowley wrote:
> I've looked at the v44 patch.

Thank you.

> Thanks for making those changes.
> 
> The new not-equal handling code is not quite right.
> 
> DROP TABLE listp;
> CREATE TABLE listp (a INT) PARTITION BY LIST(a);
> CREATE TABLE listp1_3 PARTITION OF listp FOR VALUES IN(1,3);
> CREATE TABLE listp_default PARTITION OF listp DEFAULT;
> 
> EXPLAIN SELECT * FROM listp WHERE a <> 1;
>                             QUERY PLAN
> ------------------------------------------------------------------
>  Append  (cost=0.00..54.56 rows=2537 width=4)
>    ->  Seq Scan on listp1_3  (cost=0.00..41.88 rows=2537 width=4)
>          Filter: (a <> 1)
> (3 rows)
> 
> The default should be included here.
> 
> INSERT INTO listp VALUES(1),(2),(3);
> SELECT * FROM listp WHERE a <> 1;
>  a
> ---
>  3
> (1 row)

Good catch!  Indeed, the default partition should not have been pruned
away in this case.

> This code assumes its fine to just reverse the setting for default:
> 
> result->scan_default = !source->scan_default;
> 
> More complex handling is needed here.

Hmm, I thought about this and came to a conclusion that we should *always*
scan the default partition in this case.  Inversion step removes all the
datums selected by the source step from the set of *all* datums that the
currently defined set of partitions allow.  If there's a default partition
in the mix, that means the latter contains all the datums of the partition
key's data type.  Irrespective of whether or not the source step selected
the default partition, there would be datums that would be in the set
after inversion which in turn would be in the default partition, if not in
some non-default partition that would've been selected.  I have written a
comment there trying to explain this, but I may not have been able to
articulate it properly.  Please check.  Or does this sound just wrong?

> I've attached a diff for a small set of other things I noticed while
reviewing.

Thanks, merged.

Please find attached the updated patches.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 30 March 2018 at 18:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Please find attached the updated patches.

Thanks.

I've noticed that there are no outfuncs or readfuncs for all the new
Step types you've added.

Also, the copy func does not properly copy the step_id in the base
node type. This will remain at 0 after a copyObject()


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 31 March 2018 at 01:18, David Rowley <david.rowley@2ndquadrant.com> wrote:
> I've noticed that there are no outfuncs or readfuncs for all the new
> Step types you've added.
>
> Also, the copy func does not properly copy the step_id in the base
> node type. This will remain at 0 after a copyObject()

I've attached a quickly put together fix for this. I'm not quite sure
what the done thing is to copy/read/write nodes which inherit fields
from other nodes, so what I've done in the attached might not be
correct. However, I see _copyValue() does what I've done, so perhaps
its fine, or that may just be more of a special case.

I also manually removed some hunks from the diff, so hopefully, it
still works correctly.

Attaching it as it may save you some time from doing it yourself.
Please check it though.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
David Rowley wrote:

> Also, the copy func does not properly copy the step_id in the base
> node type. This will remain at 0 after a copyObject()

As I recall, you can find these mistakes by compiling with
-DCOPY_PARSE_PLAN_TREES.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 31 March 2018 at 02:00, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 31 March 2018 at 01:18, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I've noticed that there are no outfuncs or readfuncs for all the new
>> Step types you've added.
>>
>> Also, the copy func does not properly copy the step_id in the base
>> node type. This will remain at 0 after a copyObject()
>
> Attaching it as it may save you some time from doing it yourself.
> Please check it though.

The attached might be slightly easier to apply. The previous version
was based on top of some other changes I'd been making.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 30 March 2018 at 18:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Please find attached the updated patches.

There's a bit of a strange case with v45 around prepared statements.
I've not debugged this yet, but in case you get there first, here's
the case:

create table listp (a int, b int) partition by list (a);
create table listp_1 partition of listp for values in(1) partition by list (b);
create table listp_1_1 partition of listp_1 for values in(1);
create table listp_2 partition of listp for values in(2) partition by list (b);
create table listp_2_1 partition of listp_2 for values in(2);

explain select * from listp where b in(1,2) and 2<>b and 0<>b; -- this
one looks fine.
                                 QUERY PLAN
----------------------------------------------------------------------------
 Append  (cost=0.00..49.66 rows=22 width=8)
   ->  Seq Scan on listp_1_1  (cost=0.00..49.55 rows=22 width=8)
         Filter: ((b = ANY ('{1,2}'::integer[])) AND (2 <> b) AND (0 <> b))
(3 rows)

prepare q1 (int,int,int,int) as select * from listp where b in($1,$2)
and $3 <> b and $4 <> b;
execute q1 (1,2,3,4);
execute q1 (1,2,3,4);
execute q1 (1,2,3,4);
execute q1 (1,2,3,4);
execute q1 (1,2,3,4);
explain (analyze, costs off, summary off, timing off)  execute q1 (1,2,2,0);
           QUERY PLAN
--------------------------------
 Result (actual rows=0 loops=1)
   One-Time Filter: false
(2 rows)

My best guess is that something ate the bits out of a Bitmapset of the
matching partitions somewhere.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/03/30 22:41, David Rowley wrote:
> On 31 March 2018 at 02:00, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> On 31 March 2018 at 01:18, David Rowley <david.rowley@2ndquadrant.com> wrote:
>>> I've noticed that there are no outfuncs or readfuncs for all the new
>>> Step types you've added.
>>>
>>> Also, the copy func does not properly copy the step_id in the base
>>> node type. This will remain at 0 after a copyObject()
>>
>> Attaching it as it may save you some time from doing it yourself.
>> Please check it though.
> 
> The attached might be slightly easier to apply. The previous version
> was based on top of some other changes I'd been making.

Thanks David.  I have merged this.

Regards,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

On 2018/03/31 0:55, David Rowley wrote:
> On 30 March 2018 at 18:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Please find attached the updated patches.
> 
> There's a bit of a strange case with v45 around prepared statements.
> I've not debugged this yet, but in case you get there first, here's
> the case:
> 
> create table listp (a int, b int) partition by list (a);
> create table listp_1 partition of listp for values in(1) partition by list (b);
> create table listp_1_1 partition of listp_1 for values in(1);
> create table listp_2 partition of listp for values in(2) partition by list (b);
> create table listp_2_1 partition of listp_2 for values in(2);
> 
> explain select * from listp where b in(1,2) and 2<>b and 0<>b; -- this
> one looks fine.
>                                  QUERY PLAN
> ----------------------------------------------------------------------------
>  Append  (cost=0.00..49.66 rows=22 width=8)
>    ->  Seq Scan on listp_1_1  (cost=0.00..49.55 rows=22 width=8)
>          Filter: ((b = ANY ('{1,2}'::integer[])) AND (2 <> b) AND (0 <> b))
> (3 rows)
> 
> prepare q1 (int,int,int,int) as select * from listp where b in($1,$2)
> and $3 <> b and $4 <> b;
> execute q1 (1,2,3,4);
> execute q1 (1,2,3,4);
> execute q1 (1,2,3,4);
> execute q1 (1,2,3,4);
> execute q1 (1,2,3,4);
> explain (analyze, costs off, summary off, timing off)  execute q1 (1,2,2,0);
>            QUERY PLAN
> --------------------------------
>  Result (actual rows=0 loops=1)
>    One-Time Filter: false
> (2 rows)
> 
> My best guess is that something ate the bits out of a Bitmapset of the
> matching partitions somewhere.

Hmm.  It is the newly added inversion step that's causing this.  When
creating a generic plan (that is when the planning happens via
BuildCachedPlan called with boundParams set to NULL), the presence of
Params will cause an inversion step's source step to produce
scan-all-partitions sort of result, which the inversion step dutifully
inverts to a scan-no-partitions result.

I have tried to attack that problem by handling the
no-values-to-prune-with case using a side-channel to propagate the
scan-all-partitions result through possibly multiple steps.  That is, a
base pruning step will set datum_offsets in a PruneStepResult only if
pruning is carried out by actually comparing values with the partition
bounds.  If no values were provided (like in the generic plan case), it
will set a scan_all_nonnull flag instead and return without setting
datum_offsets.  Combine steps perform their combining duty only if
datum_offset contains a valid value, that is, if scan_all_nonnulls is not set.

Attached updated version of the patches.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 2 April 2018 at 17:18, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/03/31 0:55, David Rowley wrote:
>> explain (analyze, costs off, summary off, timing off)  execute q1 (1,2,2,0);
>>            QUERY PLAN
>> --------------------------------
>>  Result (actual rows=0 loops=1)
>>    One-Time Filter: false
>> (2 rows)
>
> Hmm.  It is the newly added inversion step that's causing this.  When
> creating a generic plan (that is when the planning happens via
> BuildCachedPlan called with boundParams set to NULL), the presence of
> Params will cause an inversion step's source step to produce
> scan-all-partitions sort of result, which the inversion step dutifully
> inverts to a scan-no-partitions result.
>
> I have tried to attack that problem by handling the
> no-values-to-prune-with case using a side-channel to propagate the
> scan-all-partitions result through possibly multiple steps.  That is, a
> base pruning step will set datum_offsets in a PruneStepResult only if
> pruning is carried out by actually comparing values with the partition
> bounds.  If no values were provided (like in the generic plan case), it
> will set a scan_all_nonnull flag instead and return without setting
> datum_offsets.  Combine steps perform their combining duty only if
> datum_offset contains a valid value, that is, if scan_all_nonnulls is not set.

I'm afraid this is still not correct :-(

The following code is not doing the right thing:

+ case COMBINE_UNION:
+ foreach(lc1, cstep->source_stepids)
+ {
+ int step_id = lfirst_int(lc1);
+ PruneStepResult *step_result;
+
+ /*
+ * step_results[step_id] must contain a valid result,
+ * which is confirmed by the fact that cstep's step_id is
+ * greater than step_id and the fact that results of the
+ * individual steps are evaluated in sequence of their
+ * step_ids.
+ */
+ if (step_id >= cstep->step.step_id)
+ elog(ERROR, "invalid pruning combine step argument");
+ step_result = step_results[step_id];
+ Assert(step_result != NULL);
+
+ result->scan_all_nonnull = step_result->scan_all_nonnull;

The last line there is not properly performing a union, it just sets
the result_scan_all_nonnull to whatever the last step's value was.

At the very least it should be |= but I don't really like this new code.

Why did you move away from just storing the matching partitions in a
Bitmapset? If you want to store all non-null partitions, then why not
just set the bits for all non-null partitions? That would cut down on
bugs like this since the combining of step results would just be
simple unions or intersects.

Also, the following code could be made a bit nicer

+ result = (PruneStepResult *) palloc0(sizeof(PruneStepResult));
+
+ switch (context->strategy)
+ {
+ case PARTITION_STRATEGY_HASH:
+ result->bound_offsets = get_matching_hash_bound(context,
+ opstep->opstrategy,
+ values, nvalues,
+ partsupfunc,
+ opstep->nullkeys,
+ &result->scan_all_nonnull);

Why not allocate the PruneStepResult inside the get_matching_*_bound,
that way you wouldn't need all those out parameters to set the bool
fields.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

On 2018/04/02 21:03, David Rowley wrote:
> On 2 April 2018 at 17:18, Amit Langote wrote:
>> On 2018/03/31 0:55, David Rowley wrote:
>>> explain (analyze, costs off, summary off, timing off)  execute q1 (1,2,2,0);
>>>            QUERY PLAN
>>> --------------------------------
>>>  Result (actual rows=0 loops=1)
>>>    One-Time Filter: false
>>> (2 rows)
>>
>> Hmm.  It is the newly added inversion step that's causing this.  When
>> creating a generic plan (that is when the planning happens via
>> BuildCachedPlan called with boundParams set to NULL), the presence of
>> Params will cause an inversion step's source step to produce
>> scan-all-partitions sort of result, which the inversion step dutifully
>> inverts to a scan-no-partitions result.
>>
>> I have tried to attack that problem by handling the
>> no-values-to-prune-with case using a side-channel to propagate the
>> scan-all-partitions result through possibly multiple steps.  That is, a
>> base pruning step will set datum_offsets in a PruneStepResult only if
>> pruning is carried out by actually comparing values with the partition
>> bounds.  If no values were provided (like in the generic plan case), it
>> will set a scan_all_nonnull flag instead and return without setting
>> datum_offsets.  Combine steps perform their combining duty only if
>> datum_offset contains a valid value, that is, if scan_all_nonnulls is not set.
> 
> I'm afraid this is still not correct :-(
> 
> The following code is not doing the right thing:
> 
> + case COMBINE_UNION:
> + foreach(lc1, cstep->source_stepids)
> + {
> + int step_id = lfirst_int(lc1);
> + PruneStepResult *step_result;
> +
> + /*
> + * step_results[step_id] must contain a valid result,
> + * which is confirmed by the fact that cstep's step_id is
> + * greater than step_id and the fact that results of the
> + * individual steps are evaluated in sequence of their
> + * step_ids.
> + */
> + if (step_id >= cstep->step.step_id)
> + elog(ERROR, "invalid pruning combine step argument");
> + step_result = step_results[step_id];
> + Assert(step_result != NULL);
> +
> + result->scan_all_nonnull = step_result->scan_all_nonnull;
> 
> The last line there is not properly performing a union, it just sets
> the result_scan_all_nonnull to whatever the last step's value was.>
> At the very least it should be |= but I don't really like this new code.
> 
> Why did you move away from just storing the matching partitions in a
> Bitmapset? If you want to store all non-null partitions, then why not
> just set the bits for all non-null partitions? That would cut down on
> bugs like this since the combining of step results would just be
> simple unions or intersects.

As I mentioned in my previous email, I had to find a side-channel (that is
scan_all_nonnull) to store this information instead of doing it the
regular way, to differentiate the case where we need to scan all
partitions because of values in the base prune steps not being available
from the case where carrying out a step using actual values ends up
selecting all partitions.  When creating a generic plan, values of none of
the Params that are added to base prune steps are available and that
results in reaching the actual pruning functions
(get_matching_hash/list/range_bounds) without any values, which results in
each of those functions, in returning all partitions containing non-null data.

But actually, the presence of only Params in the pruning steps should
result in the pruning not being invoked at all (at least for the static
pruning case), thus selecting all partitions containing non-null data.  It
is better to implement that instead of a workaround like scan_all_nonnulls
side-channel I was talking about.

Fixed the patch to implement it that way.

> Also, the following code could be made a bit nicer
> 
> + result = (PruneStepResult *) palloc0(sizeof(PruneStepResult));
> +
> + switch (context->strategy)
> + {
> + case PARTITION_STRATEGY_HASH:
> + result->bound_offsets = get_matching_hash_bound(context,
> + opstep->opstrategy,
> + values, nvalues,
> + partsupfunc,
> + opstep->nullkeys,
> + &result->scan_all_nonnull);
> 
> Why not allocate the PruneStepResult inside the get_matching_*_bound,
> that way you wouldn't need all those out parameters to set the bool
> fields.

I thought it'd be nice to have perform_pruning_base_step generate the
actual PruneStepResult instead of the functions for individual
partitioning strategies, which in a way minimizes places where it is
manipulated.  Since, we've divided bound searching into 3 separate
functions anyway, it also seemed better to me to have their signatures be
relevant to the partition strategy they cater to.  The function for hash
partitioning, for example, never has to deal with setting the result for
the null or the default partition and the range partitioning function
doesn't have to worry about doing anything about for null partition.
Also, overall footprint of those 3 functions reduced because they don't
have to create the PruneStepResult themselves.

Attached v47.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 4 April 2018 at 00:02, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> But actually, the presence of only Params in the pruning steps should
> result in the pruning not being invoked at all (at least for the static
> pruning case), thus selecting all partitions containing non-null data.  It
> is better to implement that instead of a workaround like scan_all_nonnulls
> side-channel I was talking about.

I don't think this is quite true. Since we're only using strict
clauses, a list of quals with just Params still means that NULLs can't
match. If you skip the step altogether then won't you have you've lost
the chance at pruning away any NULL-only partition?

I think it would be better to just have special handling in
get_matching_list_bound so that it knows it's performing <>
elimination. I'd thought about passing some other opstrategy but the
only safe one I thought to use was InvalidStrategy, which is already
used by NULL handling.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 4 April 2018 at 09:47, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 4 April 2018 at 00:02, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> But actually, the presence of only Params in the pruning steps should
>> result in the pruning not being invoked at all (at least for the static
>> pruning case), thus selecting all partitions containing non-null data.  It
>> is better to implement that instead of a workaround like scan_all_nonnulls
>> side-channel I was talking about.
>
> I don't think this is quite true. Since we're only using strict
> clauses, a list of quals with just Params still means that NULLs can't
> match. If you skip the step altogether then won't you have you've lost
> the chance at pruning away any NULL-only partition?
>
> I think it would be better to just have special handling in
> get_matching_list_bound so that it knows it's performing <>
> elimination. I'd thought about passing some other opstrategy but the
> only safe one I thought to use was InvalidStrategy, which is already
> used by NULL handling.

I'm currently working up a patch to do this the way I think is best.

I'll submit it soon and we can review and get your thoughts on it.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 4 April 2018 at 11:22, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 4 April 2018 at 09:47, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> I think it would be better to just have special handling in
>> get_matching_list_bound so that it knows it's performing <>
>> elimination. I'd thought about passing some other opstrategy but the
>> only safe one I thought to use was InvalidStrategy, which is already
>> used by NULL handling.
>
> I'm currently working up a patch to do this the way I think is best.
>
> I'll submit it soon and we can review and get your thoughts on it.

I've attached a rough cut version of what I think is a good solution
for this. It's based on v46, not your latest v47, sorry.

This makes get_matching_list_bounds() aware that it's performing the
not-equal pruning via the opstrategy which allows it to not return all
partitions when there are no values in this case. Instead, we return
the NULL partition, so that we later invert that and return everything
apart from the NULL partition. A strict clause will allow us that
much, even if we can't get the actual value being compared to, at the
time.

There's also a bunch of other changes in there:

1. Adding missing step_id in copyfuncs.c
2. Simplified including the default partition in a bunch of cases.
3. Made it so scan_default and scan_null are only ever set to true if
there's a partition for that.
4. Changed get_matching_*_bounds to return the entire result struct
instead of the Bitmapset and pass the remaining bool values back
through params. I didn't really like how you'd change this to pass all
the bool flags back as params. There's a perfectly good struct there
to provide the entire result in a single return value. I know you've
disagreed with this already, so would be nice to get a 3rd opinion.
5. Rename get_matching_hash_bound to get_matching_hash_bounds. The
LIST and RANGE version of this function both had a plural name. I
didn't see any reason for the hash case to be different.

Let me know what you think.

I've patched the run-time pruning v18 against this and it now passes regression.

I need to do a bit more testing on this to ensure it works for all
cases, but thought I'd send now as I suspect you're currently around
to look.

There might be another issue with the patch too, but I'll send a
separate email about that.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 4 April 2018 at 13:13, David Rowley <david.rowley@2ndquadrant.com> wrote:
> There might be another issue with the patch too, but I'll send a
> separate email about that.

In the current version of the patch the following comment exists:

/*
* Fall-through for a NOT clause, which is handled in
* match_clause_to_partition_key().
*/

The only real handling of NOT clauses is in
match_boolean_partition_clause() which just handles NOT(true) or
NOT(false).

It's true that the const simplification code will generally rewrite
most NOT(clause) to use the negator operator, but if the operator does
not have a negator it can't do this.

We probably don't have any built-in operators which are members of a
btree opclass which have no negator, but it's simple enough to modify
the citext extension by commenting out the NEGATOR lines in
citext--1.4--1.5.sql.

create extension citext;
create table listp(a citext) partition by list(a citext_pattern_ops);
create table listp_1 partition of listp for values in('1');
explain select * from listp where not (a ~>~ '0' and  a ~<~ '2');
                                QUERY PLAN
--------------------------------------------------------------------------
 Append  (cost=0.00..36.45 rows=1209 width=32)
   ->  Seq Scan on listp_1  (cost=0.00..30.40 rows=1209 width=32)
         Filter: ((NOT (a ~>~ '0'::citext)) OR (NOT (a ~<~ '2'::citext)))
(3 rows)

At the moment pruning does not work for this case at all. Perhaps it should?

I imagine it might be possible to re-work the COMBINE_INVERT code so
that it becomes a flag of the combine step rather than a step operator
type. It should then be possible to invert COMBINE_UNION for
NOT(clause1 OR clause2) and COMBINE_INTERSECT on NOT(clause1 AND
clause2).

IOW, it might not take too many lines of code to put this right.
Probably the bulk of the work would be writing a test with a btree
opclass that will allow us to have the planner not invert the clause
during const folding.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> It's true that the const simplification code will generally rewrite
> most NOT(clause) to use the negator operator, but if the operator does
> not have a negator it can't do this.
> ...
> At the moment pruning does not work for this case at all. Perhaps it should?

It's hard to see why we'd expend extra effort to optimize such situations.
The right answer would invariably be to fix the inadequate operator
definition, because missing the negator link would hobble many other
cases besides this.

Now if you can show a case where the extra smarts would be useful
without presuming a badly-written opclass, it's a different matter.

            regards, tom lane


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 4 April 2018 at 16:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> It's true that the const simplification code will generally rewrite
>> most NOT(clause) to use the negator operator, but if the operator does
>> not have a negator it can't do this.
>> ...
>> At the moment pruning does not work for this case at all. Perhaps it should?
>
> It's hard to see why we'd expend extra effort to optimize such situations.
> The right answer would invariably be to fix the inadequate operator
> definition, because missing the negator link would hobble many other
> cases besides this.
>
> Now if you can show a case where the extra smarts would be useful
> without presuming a badly-written opclass, it's a different matter.

Okay, well that certainly sounds like less work.

In that case, the comment which claims we handle the NOT clauses needs
to be updated to mention that we only handle boolean NOT clauses and
don't optimize the remainder.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

On 2018/04/04 10:13, David Rowley wrote:
> On 4 April 2018 at 11:22, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> On 4 April 2018 at 09:47, David Rowley <david.rowley@2ndquadrant.com> wrote:
>>> I think it would be better to just have special handling in
>>> get_matching_list_bound so that it knows it's performing <>
>>> elimination. I'd thought about passing some other opstrategy but the
>>> only safe one I thought to use was InvalidStrategy, which is already
>>> used by NULL handling.

Thanks for this suggestion.

Having the special case handling for steps corresponding to <> operator
clauses in get_matching_list_bounds() seems like the best way and that
should have been the way all along.  It occurred to me that after I
changed the patch to store datum offsets in the result, there wasn't any
need for special handling of <> operators at a higher level -- like the
special pruning function (get_partitions_excluded_by_ne_datums) that we
used to have or the COMBINE_INVERT I recently proposed.

For each datum coming from a <> operator clause (signaled to
get_matching_list_bounds by passing InvalidStrategy for opstrategy), we
return all datums minus the one that was passed (if the latter is indeed
found in boundinfo).  Bounds for individual <> operator clauses will be
combined using INTERSECT at a higher level to give the desired result.  No
need for the invert step and for the planner to set things up very
carefully for invert step to do the right thing.

create table lp (a int) partition by list (a);
create table lp1 partition of lp for values in (1);
create table lp2 partition of lp for values in (2);
create table lp3 partition of lp for values in (3);
create table lp_null partition of lp for values in (null);
create table lp_default partition of lp default;

For

explain select * from lp where a <> 1

get_matching_list_bounds will returns the set of offsets of {2, 3} and
will set scan_default, while setting scan_null false;

and for

explain select * from lp where a <> 1 and a <> 3

it will returns the set of offsets of {2, 3} and {1, 2} for the individual
base steps and along setting scan_default and setting scan_null to false;
the INTERSECT combination step still combine those to give the offset of 2
with scan_default set to true and scan_null set to false.

                             QUERY PLAN
--------------------------------------------------------------------
 Append  (cost=0.00..121.75 rows=5050 width=4)
   ->  Seq Scan on lp2  (cost=0.00..48.25 rows=2525 width=4)
         Filter: ((a <> 1) AND (a <> 3))
   ->  Seq Scan on lp_default  (cost=0.00..48.25 rows=2525 width=4)
         Filter: ((a <> 1) AND (a <> 3))

If there are no values then the offsets of all the bounds will be returned
and unlike the previous setup with the INVERT step in the mix, they will
all survive.

I know we've been back and forth quite a bit on this, but this solution
seems like the one with the least amount of hackery.  Hope you find it to
be the same way.

>> I'm currently working up a patch to do this the way I think is best.
>>
>> I'll submit it soon and we can review and get your thoughts on it.
> 
> I've attached a rough cut version of what I think is a good solution
> for this. It's based on v46, not your latest v47, sorry.
> 
> This makes get_matching_list_bounds() aware that it's performing the
> not-equal pruning via the opstrategy which allows it to not return all
> partitions when there are no values in this case. Instead, we return
> the NULL partition, so that we later invert that and return everything
> apart from the NULL partition. A strict clause will allow us that
> much, even if we can't get the actual value being compared to, at the
> time.

As I explained above, I considered your general idea of teaching
get_matching_list_bounds to deal with being passed InvalidStrategy for
opstrategy to signal special handling of a <> clause datum.

By implementing that, I was able to get rid of a bunch of code in
partprune.c and remove the COMBINE_INVERT related code.  We can add
COMBINE_INVERT later if and when we need it (for some legitimate purpose).

> There's also a bunch of other changes in there:

Thanks.

> 1. Adding missing step_id in copyfuncs.c

Merged.

> 2. Simplified including the default partition in a bunch of cases.
> 3. Made it so scan_default and scan_null are only ever set to true if
> there's a partition for that.

I have merged these too.

> 4. Changed get_matching_*_bounds to return the entire result struct
> instead of the Bitmapset and pass the remaining bool values back
> through params. I didn't really like how you'd change this to pass all
> the bool flags back as params. There's a perfectly good struct there
> to provide the entire result in a single return value. I know you've
> disagreed with this already, so would be nice to get a 3rd opinion.

I went ahead with them returning PruneStepResult struct.

> 5. Rename get_matching_hash_bound to get_matching_hash_bounds. The
> LIST and RANGE version of this function both had a plural name. I
> didn't see any reason for the hash case to be different.

Agreed, merged.

> Let me know what you think.

I'm not sure about the following change in your patch:

-        if (!result->scan_null)
-            result->scan_null = step_result->scan_null;
-        if (!result->scan_default)
-            result->scan_default = step_result->scan_default;
+        result->scan_null |= step_result->scan_null;
+        result->scan_default |= step_result->scan_default;

Afaik, |= does bitwise OR, which even if it might give the result we want,
is not a logical operation.  I had written the original code using the
following definition of logical OR.

  a OR b = if a then true else b

Also, since things work normally even if we pass no values to
get_matching_list_bounds, including via the dummy step generated for IS
NOT NULL clause(s), I don't see the need to store notnullkeys in the prune
step.  Especially, it's redundant to set notnullkeys in the pruning step
containing non-empty exprs since, by definition, they will select
partitions containing non-null datums.

> I've patched the run-time pruning v18 against this and it now passes regression.
> 
> I need to do a bit more testing on this to ensure it works for all
> cases, but thought I'd send now as I suspect you're currently around
> to look.

See if attached works for you.

> There might be another issue with the patch too, but I'll send a
> separate email about that.

I suppose this is the email about support for pruning using NOT clauses in
partition.c.  It might be possible to do that by tweaking things somehow
by re-introducing the COMBINE_INVERT step (legitimately needed in that
case) and modifying partprune.c to capture NOT clauses in more cases than
it does currently.

Although, I modified the comment like you suggested that we only support
Boolean NOT clause in special cases like when using Boolean partitioning
opfamily.

Attached v48.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/04/04 14:42, Amit Langote wrote:
> Attached v48.

I had forgotten to remove the static_pruning parameter I had added in the
v47, because it is no longer used.  Static pruning now occurs even if a
step contains all Params, in which case each of
get_matching_hash/list/range_bounds() functions returns offsets of all
non-null datums, because the Params cannot be resolved to actual values
during static pruning.

Also, a few changes to get_matching_partitions that David had proposed in
his delta patch but I had failed to include them in v48.

Attached v49.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 4 April 2018 at 17:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I'm not sure about the following change in your patch:
>
> -               if (!result->scan_null)
> -                       result->scan_null = step_result->scan_null;
> -               if (!result->scan_default)
> -                       result->scan_default = step_result->scan_default;
> +               result->scan_null |= step_result->scan_null;
> +               result->scan_default |= step_result->scan_default;
>
> Afaik, |= does bitwise OR, which even if it might give the result we want,
> is not a logical operation.  I had written the original code using the
> following definition of logical OR.
>
>   a OR b = if a then true else b

Ok, no problem. I only changed that to make it more compact.

For the record we do the same in plenty of over places over the code base:

E.g.

parse->hasSubLinks |= subquery->hasSubLinks;

/* If subquery had any RLS conditions, now main query does too */
parse->hasRowSecurity |= subquery->hasRowSecurity;


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 4 April 2018 at 19:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached v49.

Thank for including the changes. I'll look now.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 4 April 2018 at 19:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/04/04 14:42, Amit Langote wrote:
>> Attached v48.
>
> I had forgotten to remove the static_pruning parameter I had added in the
> v47, because it is no longer used.  Static pruning now occurs even if a
> step contains all Params, in which case each of
> get_matching_hash/list/range_bounds() functions returns offsets of all
> non-null datums, because the Params cannot be resolved to actual values
> during static pruning.

Thanks for updating. I've made a pass over v49 and I didn't find very
much wrong with it.

The only real bug I found was a missing IsA(rinfo->clause, Const) in
the pseudoconstant check inside
generate_partition_pruning_steps_internal.

Most of the changes are comment fixes with a few stylistic changes
thrown which are pretty much all there just to try to shrink the code
a line or two or reduce indentation.

I feel pretty familiar with this code now and assuming the attached is
included I'm happy for someone else, hopefully, a committer to take a
look at it.

I'll leave the following notes:

1. Still not sure about RelOptInfo->has_default_part. This flag is
only looked at in generate_partition_pruning_steps. The RelOptInfo and
the boundinfo is available to look at, it's just that the
partition_bound_has_default macro is defined in partition.c rather
than partition.h.

2. Don't really like the new isopne variable name. It's not very
simple to decode, perhaps something like is_not_eq is better?

3. The part of the code I'm least familiar with is
get_steps_using_prefix_recurse(). I admit to not having had time to
fully understand that and consider ways to break it.

Marking as ready for committer.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Jesper Pedersen
Date:
Hi,

On 04/04/2018 09:29 AM, David Rowley wrote:
> Thanks for updating. I've made a pass over v49 and I didn't find very
> much wrong with it.
> 
> The only real bug I found was a missing IsA(rinfo->clause, Const) in
> the pseudoconstant check inside
> generate_partition_pruning_steps_internal.
> 
> Most of the changes are comment fixes with a few stylistic changes
> thrown which are pretty much all there just to try to shrink the code
> a line or two or reduce indentation.
> 
> I feel pretty familiar with this code now and assuming the attached is
> included I'm happy for someone else, hopefully, a committer to take a
> look at it.
> 
> I'll leave the following notes:
> 
> 1. Still not sure about RelOptInfo->has_default_part. This flag is
> only looked at in generate_partition_pruning_steps. The RelOptInfo and
> the boundinfo is available to look at, it's just that the
> partition_bound_has_default macro is defined in partition.c rather
> than partition.h.
> 
> 2. Don't really like the new isopne variable name. It's not very
> simple to decode, perhaps something like is_not_eq is better?
> 
> 3. The part of the code I'm least familiar with is
> get_steps_using_prefix_recurse(). I admit to not having had time to
> fully understand that and consider ways to break it.
> 
> Marking as ready for committer.
> 

Passes check-world, and CommitFest app has been updated to reflect the 
current patch set. Trivial changes attached.

Best regards,
  Jesper

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi.

On 2018/04/05 0:45, Jesper Pedersen wrote:
> Hi,
> 
> On 04/04/2018 09:29 AM, David Rowley wrote:
>> Thanks for updating. I've made a pass over v49 and I didn't find very
>> much wrong with it.
>>
>> The only real bug I found was a missing IsA(rinfo->clause, Const) in
>> the pseudoconstant check inside
>> generate_partition_pruning_steps_internal.

Fixed.

>> Most of the changes are comment fixes with a few stylistic changes
>> thrown which are pretty much all there just to try to shrink the code
>> a line or two or reduce indentation.
>>
>> I feel pretty familiar with this code now and assuming the attached is
>> included I'm happy for someone else, hopefully, a committer to take a
>> look at it.

Thank you, your changes look good to me.

>> I'll leave the following notes:
>>
>> 1. Still not sure about RelOptInfo->has_default_part. This flag is
>> only looked at in generate_partition_pruning_steps. The RelOptInfo and
>> the boundinfo is available to look at, it's just that the
>> partition_bound_has_default macro is defined in partition.c rather
>> than partition.h.

Hmm, it might not be such a bad idea to bring out the
PartitionBoundInfoData into partition.h.  If we do that, we won't need the
has_default_part that the patch adds to RelOptInfo.

In the Attached v50 set, 0002 does that.

>> 2. Don't really like the new isopne variable name. It's not very
>> simple to decode, perhaps something like is_not_eq is better?

isopne does sound a bit unintelligible.  I propose op_is_ne so that it
sounds consistent with the preceding member of the struct that's called
opno.  I want to keep "ne" and not start calling it not_eq, as a few other
places use the string "ne" to refer to a similar thing, like:

/* inequality */
Datum
range_ne(PG_FUNCTION_ARGS)

Datum
timestamptz_ne_date(PG_FUNCTION_ARGS)

Since the field is local to partprune.c, I guess that it's fine as the
comment where it's defined tells what it is.

>> 3. The part of the code I'm least familiar with is
>> get_steps_using_prefix_recurse(). I admit to not having had time to
>> fully understand that and consider ways to break it.

The purpose of that code is to generate *all* needed steps to be combined
using COMBINE_INTERSECT such that the pruning will occur using the most
restrictive set of clauses in cases where the same key is referenced in
multiple restriction clauses containing non-equality operators.  So, for a
range partitioned table on (a, b):

For a query like

explain select * from foo a <= 1 and a <= 3 and b < 5 and b <= 10

Pruning steps generated to be combined with an enclosing INTERSECT step
will be as follows:

<= (1, 10)
<  (1, 5)
<= (3, 10)
<  (3, 5)

>> Marking as ready for committer.

Thank you!

> Passes check-world, and CommitFest app has been updated to reflect the
> current patch set. Trivial changes attached.

Merged these changes.  Thanks again Jesper.

Attached v50.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Amit Langote wrote:

> >> 1. Still not sure about RelOptInfo->has_default_part. This flag is
> >> only looked at in generate_partition_pruning_steps. The RelOptInfo and
> >> the boundinfo is available to look at, it's just that the
> >> partition_bound_has_default macro is defined in partition.c rather
> >> than partition.h.
> 
> Hmm, it might not be such a bad idea to bring out the
> PartitionBoundInfoData into partition.h.  If we do that, we won't need the
> has_default_part that the patch adds to RelOptInfo.
> 
> In the Attached v50 set, 0002 does that.

After looking at this for a moment, I again come to the conclusion that
the overall layout of partitioning code and definitions is terrible.
But we already know that, and there's a patch in commitfest to improve
things.  So my intention right now is to hold my nose and get this
pushed; we'll fix it afterwards.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
I seems pretty clear that putting get_matching_partitions() in
catalog/partition.c is totally the wrong thing; it belongs wholly in
partprune. I think the reason you put it there is that it requires
access to a lot of internals that are static in partition.c.  In the
attached not yet cleaned version of the patch, I have moved a whole lot
of what you added to partition.c to partprune.c; and for the functions
and struct declarations that were required to make it work, I created
catalog/partition_internal.h.

I changed a lot of code also, but cosmetic changes only.

I'll clean this up a bit more now, and try to commit shortly (or early
tomorrow); wanted to share current status now in case I have to rush
out.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
> @@ -1717,8 +1691,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
>       * parentrte already has the root partrel's updatedCols translated to match
>       * the attribute ordering of parentrel.
>       */
> -    if (!*part_cols_updated)
> -        *part_cols_updated =
> +    if (!root->partColsUpdated)
> +        root->partColsUpdated =
>              has_partition_attrs(parentrel, parentrte->updatedCols, NULL);

Hmm, surely this should be |= to avoid resetting a value set in a
previous call to this function?  In the previous coding it wasn't
necessary because it was a local variable ...  (though, isn't it a bit
odd to have this in PlannerInfo?  seems like it should be in
resultRelInfo, but then you already have it there so I suppose this one
does *more*)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 6 April 2018 at 10:35, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I changed a lot of code also, but cosmetic changes only.
>
> I'll clean this up a bit more now, and try to commit shortly (or early
> tomorrow); wanted to share current status now in case I have to rush
> out.

I made a complete pass over the patch you sent. I only noted down the
following few things:

1.

+ * off < 0, meaning the look-up value is smaller that all bounds,

that -> than

2. I guess this will be removed before commit.

+#if 0
<large section of comments>
+#endif

3. This comment seems like a strange thing to write just before
testing if the clause matches the partition key.

+ /* Clause does not match this partition key. */
+ if (equal(leftop, partkey))
+ *rightop = not_clause((Node *) clause)
+ ? (Expr *) makeBoolConst(false, false)
+ : (Expr *) makeBoolConst(true, false);


4. Comment needs removed.

+ * has_default_part - Whether the table has a default partition


The only other thing I noted on this pass is that we could get rid of:

+ /* go check the next clause. */
+ if (unsupported_clause)
+ break;

and just "continue" instead of "break" in all cases apart from case
PARTCLAUSE_UNSUPPORTED:

it would save a few lines and a single condition. What's there works,
but thought this might be better...

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 6 April 2018 at 12:02, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 6 April 2018 at 10:35, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> The only other thing I noted on this pass is that we could get rid of:
>
> + /* go check the next clause. */
> + if (unsupported_clause)
> + break;
>
> and just "continue" instead of "break" in all cases apart from case
> PARTCLAUSE_UNSUPPORTED:

I should have said remove:

+ if (unsupported_clause)

The "break" would still be required.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi.

On 2018/04/06 7:35, Alvaro Herrera wrote:
> I seems pretty clear that putting get_matching_partitions() in
> catalog/partition.c is totally the wrong thing; it belongs wholly in
> partprune. I think the reason you put it there is that it requires
> access to a lot of internals that are static in partition.c.  In the
> attached not yet cleaned version of the patch, I have moved a whole lot
> of what you added to partition.c to partprune.c; and for the functions
> and struct declarations that were required to make it work, I created
> catalog/partition_internal.h.

Yes, I really wanted for most of the new code that this patch adds to land
in the planner, especially after Robert's comments here:

https://www.postgresql.org/message-id/CA%2BTgmoabi-29Vs8H0xkjtYB%3DcU%2BGVCrNwPz7okpa3KsoLmdEUQ%40mail.gmail.com

It would've been nice if we'd gotten the "reorganizing partitioning code"
thread resolved sooner.

> I changed a lot of code also, but cosmetic changes only.
> 
> I'll clean this up a bit more now, and try to commit shortly (or early
> tomorrow); wanted to share current status now in case I have to rush
> out.

Some comments on the code reorganizing part of the patch:

* Did you intentionally not put PartitionBoundInfoData and its accessor
macros in partition_internal.h.  partprune.c would not need to include
partition.h if we do that.

* Also, I wonder why you left PartitionPruneContext in partition.h.  Isn't
it better taken out to partprune.h?

I have done that in the attached.

* Why isn't gen_partprune_steps() in partprune.h?  I see only
prune_append_rel_partitions() exported out of partprune.c, but the runtime
patch needs gen_partprune_steps() to be called from createplan.c.

* I don't see get_matching_partitions() exported either.  Runtime pruning
patch needs that too.

Maybe you've thought something about these two items though.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/04/06 8:33, Alvaro Herrera wrote:
>> @@ -1717,8 +1691,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
>>       * parentrte already has the root partrel's updatedCols translated to match
>>       * the attribute ordering of parentrel.
>>       */
>> -    if (!*part_cols_updated)
>> -        *part_cols_updated =
>> +    if (!root->partColsUpdated)
>> +        root->partColsUpdated =
>>              has_partition_attrs(parentrel, parentrte->updatedCols, NULL);
> 
> Hmm, surely this should be |= to avoid resetting a value set in a
> previous call to this function?

It won't be, no?  We set it only if it hasn't been already.  Note that
there is one PlannerInfo per sub-query, so we determine this information
independently for each sub-query.

> In the previous coding it wasn't
> necessary because it was a local variable ...  (though, isn't it a bit
> odd to have this in PlannerInfo?  seems like it should be in
> resultRelInfo, but then you already have it there so I suppose this one
> does *more*)

Hmm, you'd think that we can figure this out in the executor itself, and
hence don't to have this in PlannerInfo or in ModifyTable.  But IIRC,
during the discussion of the update tuple routing patch, it became clear
that it's best do that here, given the way things are now wrt the timing
of partition/inheritance tree expansion.  An update query may modify the
partition key of a table at any arbitrary level and we have to look at all
the tables in the partition tree in this planning phase anyway, so it's
also the best time to see it if the query's modifiedCols overlaps with the
partition key of some table in the tree.  Once we've found that it does
for some table (most likely the root), we're done, that is, we know we got
some "partColsUpdated".


I realized that I had gotten rid of has_default_part from RelOptInfo but
hadn't deleted a comment about it; attached patch to fix that.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Amit Langote wrote:
> Hi.
> 
> On 2018/04/06 7:35, Alvaro Herrera wrote:
> > I seems pretty clear that putting get_matching_partitions() in
> > catalog/partition.c is totally the wrong thing; it belongs wholly in
> > partprune. I think the reason you put it there is that it requires
> > access to a lot of internals that are static in partition.c.  In the
> > attached not yet cleaned version of the patch, I have moved a whole lot
> > of what you added to partition.c to partprune.c; and for the functions
> > and struct declarations that were required to make it work, I created
> > catalog/partition_internal.h.
> 
> Yes, I really wanted for most of the new code that this patch adds to land
> in the planner, especially after Robert's comments here:
> 
> https://www.postgresql.org/message-id/CA%2BTgmoabi-29Vs8H0xkjtYB%3DcU%2BGVCrNwPz7okpa3KsoLmdEUQ%40mail.gmail.com
> 
> It would've been nice if we'd gotten the "reorganizing partitioning code"
> thread resolved sooner.

Grumble.

I don't actually like very much the idea of putting all this code in
optimizer/util.  This morning it occurred to me that we should create a new
src/backend/partitioning/ (and a src/include/partitioning/ to go with
it) and drop a bunch of files there.  Even your proposed new partcache.c
will seem misplaced *anywhere*, since it contains support code to be
used by both planner and executor; in src/{backend,include}/partitioning
it will be able to serve both without it being a modularity wart.

BTW including partition_internal.h in partition.h would defeat the point
of having partition_internal.h in the first place -- at that point you'd
be better off just putting it all in partition.h and save the hassle of
a separate file.  But given the liberty with which catalog/partition.h
has been included everywhere else, IMO that would be pretty disastrous.

I propose to work on reorganizing this code after the commitfest is
over, as part of release stabilization.  I'd rather not have us
supporting a messy system for only five years, if we restructure during
pg12 (which would mean a lot of backpatching pain and pg11-specific
bugs); or worse, forever, if we keep the current proposed layout.

One thing I don't want to do is create a new file that we'll later have
to rename or move, so choosing the best locations is a necessity.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
David Rowley wrote:

> 2. I guess this will be removed before commit.
> 
> +#if 0
> <large section of comments>
> +#endif

Yeah, there is one sentence there I didn't quite understand and would
like to add it to the rewritten version of the comment before I remove
the whole ifdeffed-out comment.

 *    PARTCLAUSE_MATCH_STEPS: *clause_steps set to list of "partition pruning
 *    step(s)" generated for the clause due to it being a BoolExpr or a
 *    ScalarArrayOpExpr that's turned into one

Exactly what does "ScalarArrayOpExpr that's turned into one" means?
Does it mean we turn SAOP into BoolExpr?

(Yes, I know "#if 0" inside a comment doesn't do anything.  It's only
documentation for myself.)

If you look at the rest of the rewritten comment, you'll notice some
things probably need more explaining.  Wording suggestions welcome.

> 3. This comment seems like a strange thing to write just before
> testing if the clause matches the partition key.
> 
> + /* Clause does not match this partition key. */
> + if (equal(leftop, partkey))
> + *rightop = not_clause((Node *) clause)
> + ? (Expr *) makeBoolConst(false, false)
> + : (Expr *) makeBoolConst(true, false);

Yeah.  Looking at this function, I noticed it tests for BooleanTest, and
falls back to checking "not_clause" and a few equals.  Does it make
sense if the clause is a SAOP?  I added this assert:
    Assert(IsA(clause, BooleanTest) ||
           IsA(clause, BoolExpr) ||
           IsA(clause, RelabelType));

and it failed:
#3  0x0000556cf04505db in match_boolean_partition_clause (partopfamily=424, 
    clause=0x556cf1041670, partkey=0x556cf1042218, rightop=0x7ffe520ec068)
    at /pgsql/source/master/src/backend/optimizer/util/partprune.c:2159
2159        Assert(IsA(clause, BooleanTest) ||
(gdb) print *clause
$1 = {type = T_ScalarArrayOpExpr}

I'm not sure whether or not this function can trust that what's incoming
must absolutely be only those node types.

> 4. Comment needs removed.
> 
> + * has_default_part - Whether the table has a default partition

Done.

> The only other thing I noted on this pass is that we could get rid of:
> 
> + /* go check the next clause. */
> + if (unsupported_clause)
> + break;
> 
> and just "continue" instead of "break" in all cases apart from case
> PARTCLAUSE_UNSUPPORTED:
> 
> it would save a few lines and a single condition. What's there works,
> but thought this might be better...

Makes sense -- looking.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
BTW, having both key_is_null and key_is_not_null output args to convey a
single bit of info is a bit lame.  I'm removing it.  We could do the
same with a single boolean, since the return value already indicates
it's a matching IS [NOT] NULL clause; we only need to indicate whether
the NOT is present.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> Yeah.  Looking at this function, I noticed it tests for BooleanTest, and
> falls back to checking "not_clause" and a few equals.  Does it make
> sense if the clause is a SAOP?  I added this assert:
>     Assert(IsA(clause, BooleanTest) ||
>            IsA(clause, BoolExpr) ||
>            IsA(clause, RelabelType));
> 
> and it failed:
> #3  0x0000556cf04505db in match_boolean_partition_clause (partopfamily=424, 
>     clause=0x556cf1041670, partkey=0x556cf1042218, rightop=0x7ffe520ec068)
>     at /pgsql/source/master/src/backend/optimizer/util/partprune.c:2159
> 2159        Assert(IsA(clause, BooleanTest) ||
> (gdb) print *clause
> $1 = {type = T_ScalarArrayOpExpr}
> 
> I'm not sure whether or not this function can trust that what's incoming
> must absolutely be only those node types.

So this is what I need for current regression tests not to crash
anymore:

    Assert(IsA(clause, BooleanTest) ||
           IsA(clause, BoolExpr) ||
           IsA(clause, RelabelType) ||
           IsA(clause, ScalarArrayOpExpr) ||
           IsA(clause, OpExpr) ||
           IsA(clause, Var));

I'm not confident in my ability to write code to handle all possible
cases right now (obviously there must be more cases that are not covered
by current regression tests), so I'll leave it without the assert since
it handles a couple of the useful cases, but I suspect it could stand
some more improvement.

I guess the question is, how interesting is boolean partitioning?  I bet
it has its uses.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On Fri, Apr 6, 2018 at 11:54 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Alvaro Herrera wrote:
>
>> Yeah.  Looking at this function, I noticed it tests for BooleanTest, and
>> falls back to checking "not_clause" and a few equals.  Does it make
>> sense if the clause is a SAOP?  I added this assert:
>>       Assert(IsA(clause, BooleanTest) ||
>>                  IsA(clause, BoolExpr) ||
>>                  IsA(clause, RelabelType));
>>
>> and it failed:
>> #3  0x0000556cf04505db in match_boolean_partition_clause (partopfamily=424,
>>     clause=0x556cf1041670, partkey=0x556cf1042218, rightop=0x7ffe520ec068)
>>     at /pgsql/source/master/src/backend/optimizer/util/partprune.c:2159
>> 2159          Assert(IsA(clause, BooleanTest) ||
>> (gdb) print *clause
>> $1 = {type = T_ScalarArrayOpExpr}
>>
>> I'm not sure whether or not this function can trust that what's incoming
>> must absolutely be only those node types.
>
> So this is what I need for current regression tests not to crash
> anymore:
>
>         Assert(IsA(clause, BooleanTest) ||
>                    IsA(clause, BoolExpr) ||
>                    IsA(clause, RelabelType) ||
>                    IsA(clause, ScalarArrayOpExpr) ||
>                    IsA(clause, OpExpr) ||
>                    IsA(clause, Var));
>
> I'm not confident in my ability to write code to handle all possible
> cases right now (obviously there must be more cases that are not covered
> by current regression tests), so I'll leave it without the assert since
> it handles a couple of the useful cases, but I suspect it could stand
> some more improvement.
>
> I guess the question is, how interesting is boolean partitioning?  I bet
> it has its uses.

match_boolean_partition_clauses() exists to capture some cases where
an OpExpr (any expression that returns a Boolean for that matter)
itself is the partition key:

create table boolpart (a int) partition by list ((a = 1));
create table boolpart_t partition of boolpart for values in ('true');
create table boolpart_f partition of boolpart for values in ('false');
explain select * from boolpart where a = 1;
                            QUERY PLAN
------------------------------------------------------------------
 Append  (cost=0.00..41.88 rows=13 width=4)
   ->  Seq Scan on boolpart_t  (cost=0.00..41.88 rows=13 width=4)
         Filter: (a = 1)
(3 rows)

explain select * from boolpart where a = 2;
                            QUERY PLAN
------------------------------------------------------------------
 Append  (cost=0.00..41.88 rows=13 width=4)
   ->  Seq Scan on boolpart_f  (cost=0.00..41.88 rows=13 width=4)
         Filter: (a = 2)
(3 rows)

So, it's not that we're only in position to accept certain node types
in match_boolean_partition_clauses().  Before it existed, the pruning
didn't work because it wasn't matched to the partition key in the
special way that match_boolean_partition_clauses() does and end up in
the block in match_clause_to_partition_key() where the OpExpr's are
analyzed for normal (non-Boolean) situations, where we extract either
the leftop or rightop and try to match it with the partition key.

It might as well be:

create table boolpart (a int) partition by list ((a in (1, 2)));

Requiring us to be position to match an ScalarArrayOpExpr with the
partition key.

This resembles match_boolean_index_clause(), by the way.

Thanks,
Amit


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On Fri, Apr 6, 2018 at 11:38 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Yeah, there is one sentence there I didn't quite understand and would
> like to add it to the rewritten version of the comment before I remove
> the whole ifdeffed-out comment.
>
>  *      PARTCLAUSE_MATCH_STEPS: *clause_steps set to list of "partition pruning
>  *      step(s)" generated for the clause due to it being a BoolExpr or a
>  *      ScalarArrayOpExpr that's turned into one
>
> Exactly what does "ScalarArrayOpExpr that's turned into one" means?
> Does it mean we turn SAOP into BoolExpr?

Yes, we turn a ScalarArrayOpExpr into a BoolExpr and generate prune
step for the latter.  Maybe we'll have a base pruning step that can
process a ScalarArrayOpExpr directly someday.  We create base steps
only for OpExpr's for now.

> If you look at the rest of the rewritten comment, you'll notice some
> things probably need more explaining.  Wording suggestions welcome.

When I looked at it earlier today, I thought your rewrite looked much better.

Thanks,
Amit


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Amit Langote wrote:

> Some comments on the code reorganizing part of the patch:
> 
> * Did you intentionally not put PartitionBoundInfoData and its accessor
> macros in partition_internal.h.  partprune.c would not need to include
> partition.h if we do that.

Not really.

After pondering this some more, I decided to call the new file
src/include/partition/partbounds.h; and the other new file will become
src/include/partition/partprune.h.  This leads naturally to the idea
that PartitionBoundInfoData will be in partbounds.h.  However, the
typedef struct PartitionBoundInfoData *PartitionBoundInfo will have to
remain in catalog/partition.h, at least for the time being.

> * Also, I wonder why you left PartitionPruneContext in partition.h.  Isn't
> it better taken out to partprune.h?

Yes.

> * Why isn't gen_partprune_steps() in partprune.h?  I see only
> prune_append_rel_partitions() exported out of partprune.c, but the runtime
> patch needs gen_partprune_steps() to be called from createplan.c.
> * I don't see get_matching_partitions() exported either.  Runtime pruning
> patch needs that too.

True -- both exported.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Here's my proposed patch.

Idle thought: how about renaming the "constfalse" argument and variables
to "contradictory" or maybe just "contradict"?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On Sat, Apr 7, 2018 at 1:41 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Here's my proposed patch.
>
> Idle thought: how about renaming the "constfalse" argument and variables
> to "contradictory" or maybe just "contradict"?

Sounds fine to me.

Thanks,
Amit


Re: [HACKERS] path toward faster partition pruning

From
Jesper Pedersen
Date:
Hi Alvaro,

On 04/06/2018 12:41 PM, Alvaro Herrera wrote:
> Here's my proposed patch.
> 
> Idle thought: how about renaming the "constfalse" argument and variables
> to "contradictory" or maybe just "contradict"?
> 

Passes check-world.

New directories, and variable rename seems like a good idea; either is ok.

Best regards,
  Jesper


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
So I pushed this 25 minutes ago, and already there's a couple of
buildfarm members complaining:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2019%3A55%3A07

Both show exactly the same diff in test partition_prune:

*** /home/pgbuildfarm/buildroot-termite/HEAD/pgsql.build/../pgsql/src/test/regress/expected/partition_prune.out    Fri
Apr 6 15:55:08 2018
 
--- /home/pgbuildfarm/buildroot-termite/HEAD/pgsql.build/src/test/regress/results/partition_prune.out    Fri Apr  6
16:01:402018
 
***************
*** 1348,1357 ****
  ----------+----+-----
   hp0      |    | 
   hp0      |  1 | 
!  hp0      |  1 | xxx
   hp3      | 10 | yyy
!  hp1      |    | xxx
!  hp2      | 10 | xxx
  (6 rows)
  
  -- partial keys won't prune, nor would non-equality conditions
--- 1348,1357 ----
  ----------+----+-----
   hp0      |    | 
   hp0      |  1 | 
!  hp0      | 10 | xxx
!  hp3      |    | xxx
   hp3      | 10 | yyy
!  hp2      |  1 | xxx
  (6 rows)
  
  -- partial keys won't prune, nor would non-equality conditions
***************
*** 1460,1466 ****
                     QUERY PLAN                    
  -------------------------------------------------
   Append
!    ->  Seq Scan on hp0
           Filter: ((a = 1) AND (b = 'xxx'::text))
  (3 rows)
  
--- 1460,1466 ----
                     QUERY PLAN                    
  -------------------------------------------------
   Append
!    ->  Seq Scan on hp2
           Filter: ((a = 1) AND (b = 'xxx'::text))
  (3 rows)
  
***************
*** 1468,1474 ****
                       QUERY PLAN                      
  -----------------------------------------------------
   Append
!    ->  Seq Scan on hp1
           Filter: ((a IS NULL) AND (b = 'xxx'::text))
  (3 rows)
  
--- 1468,1474 ----
                       QUERY PLAN                      
  -----------------------------------------------------
   Append
!    ->  Seq Scan on hp3
           Filter: ((a IS NULL) AND (b = 'xxx'::text))
  (3 rows)
  
***************
*** 1476,1482 ****
                      QUERY PLAN                    
  --------------------------------------------------
   Append
!    ->  Seq Scan on hp2
           Filter: ((a = 10) AND (b = 'xxx'::text))
  (3 rows)
  
--- 1476,1482 ----
                      QUERY PLAN                    
  --------------------------------------------------
   Append
!    ->  Seq Scan on hp0
           Filter: ((a = 10) AND (b = 'xxx'::text))
  (3 rows)
  
***************
*** 1494,1504 ****
   Append
     ->  Seq Scan on hp0
           Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
-    ->  Seq Scan on hp2
-          Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
     ->  Seq Scan on hp3
           Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
! (7 rows)
  
  -- hash partitiong pruning doesn't occur with <> operator clauses
  explain (costs off) select * from hp where a <> 1 and b <> 'xxx';
--- 1494,1502 ----
   Append
     ->  Seq Scan on hp0
           Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
     ->  Seq Scan on hp3
           Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
! (5 rows)
  
  -- hash partitiong pruning doesn't occur with <> operator clauses
  explain (costs off) select * from hp where a <> 1 and b <> 'xxx';


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Andres Freund
Date:
On 2018-04-06 17:28:00 -0300, Alvaro Herrera wrote:
> So I pushed this 25 minutes ago, and already there's a couple of
> buildfarm members complaining:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2019%3A55%3A07
> 
> Both show exactly the same diff in test partition_prune:

There's also
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2018-04-06%2020%3A45%3A01
*** /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/contrib/sepgsql/expected/misc.out    2018-02-20
18:45:02.068665297-0800
 
--- /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/contrib/sepgsql/results/misc.out    2018-04-06
13:55:50.718253850-0700
 
***************
*** 32,40 ****
  (6 rows)
  
  SELECT * FROM t1p WHERE o > 50 AND p like '%64%';
- LOG:  SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)"
 
- LOG:  SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)"
 
- LOG:  SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)"
 
  LOG:  SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_table name="public.t1p"
 
  LOG:  SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_column name="table t1p column o"
 
  LOG:  SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_column name="table t1p column p"
 
--- 32,37 ----

seems you just need to remove those rows from the expected file.

- Andres


Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Fri, Apr 6, 2018 at 8:24 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I don't actually like very much the idea of putting all this code in
> optimizer/util.  This morning it occurred to me that we should create a new
> src/backend/partitioning/ (and a src/include/partitioning/ to go with
> it) and drop a bunch of files there.  Even your proposed new partcache.c
> will seem misplaced *anywhere*, since it contains support code to be
> used by both planner and executor; in src/{backend,include}/partitioning
> it will be able to serve both without it being a modularity wart.

Uh, what?

Surely partcache.c is correctly placed next to relcache.c and
syscache.c and everything else in src/backend/utils/cache.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Fri, Apr 6, 2018 at 8:24 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > I don't actually like very much the idea of putting all this code in
> > optimizer/util.  This morning it occurred to me that we should create a new
> > src/backend/partitioning/ (and a src/include/partitioning/ to go with
> > it) and drop a bunch of files there.  Even your proposed new partcache.c
> > will seem misplaced *anywhere*, since it contains support code to be
> > used by both planner and executor; in src/{backend,include}/partitioning
> > it will be able to serve both without it being a modularity wart.
> 
> Uh, what?
> 
> Surely partcache.c is correctly placed next to relcache.c and
> syscache.c and everything else in src/backend/utils/cache.

Frankly, I'm not real sure about partcache.c yet.  Are you?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thank you Alvaro for rest of the cleanup and committing.

On Sat, Apr 7, 2018 at 5:28 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> So I pushed this 25 minutes ago, and already there's a couple of
> buildfarm members complaining:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2019%3A55%3A07
>
> Both show exactly the same diff in test partition_prune:
>
> *** /home/pgbuildfarm/buildroot-termite/HEAD/pgsql.build/../pgsql/src/test/regress/expected/partition_prune.out Fri
Apr 6 15:55:08 2018
 
> --- /home/pgbuildfarm/buildroot-termite/HEAD/pgsql.build/src/test/regress/results/partition_prune.out   Fri Apr  6
16:01:402018
 
> ***************
> *** 1348,1357 ****
>   ----------+----+-----
>    hp0      |    |
>    hp0      |  1 |
> !  hp0      |  1 | xxx
>    hp3      | 10 | yyy
> !  hp1      |    | xxx
> !  hp2      | 10 | xxx
>   (6 rows)
>
>   -- partial keys won't prune, nor would non-equality conditions
> --- 1348,1357 ----
>   ----------+----+-----
>    hp0      |    |
>    hp0      |  1 |
> !  hp0      | 10 | xxx
> !  hp3      |    | xxx
>    hp3      | 10 | yyy
> !  hp2      |  1 | xxx
>   (6 rows)
>
>   -- partial keys won't prune, nor would non-equality conditions
> ***************
> *** 1460,1466 ****
>                      QUERY PLAN
>   -------------------------------------------------
>    Append
> !    ->  Seq Scan on hp0
>            Filter: ((a = 1) AND (b = 'xxx'::text))
>   (3 rows)
>
> --- 1460,1466 ----
>                      QUERY PLAN
>   -------------------------------------------------
>    Append
> !    ->  Seq Scan on hp2
>            Filter: ((a = 1) AND (b = 'xxx'::text))
>   (3 rows)
>
> ***************
> *** 1468,1474 ****
>                        QUERY PLAN
>   -----------------------------------------------------
>    Append
> !    ->  Seq Scan on hp1
>            Filter: ((a IS NULL) AND (b = 'xxx'::text))
>   (3 rows)
>
> --- 1468,1474 ----
>                        QUERY PLAN
>   -----------------------------------------------------
>    Append
> !    ->  Seq Scan on hp3
>            Filter: ((a IS NULL) AND (b = 'xxx'::text))
>   (3 rows)
>
> ***************
> *** 1476,1482 ****
>                       QUERY PLAN
>   --------------------------------------------------
>    Append
> !    ->  Seq Scan on hp2
>            Filter: ((a = 10) AND (b = 'xxx'::text))
>   (3 rows)
>
> --- 1476,1482 ----
>                       QUERY PLAN
>   --------------------------------------------------
>    Append
> !    ->  Seq Scan on hp0
>            Filter: ((a = 10) AND (b = 'xxx'::text))
>   (3 rows)
>
> ***************
> *** 1494,1504 ****
>    Append
>      ->  Seq Scan on hp0
>            Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
> -    ->  Seq Scan on hp2
> -          Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
>      ->  Seq Scan on hp3
>            Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
> ! (7 rows)
>
>   -- hash partitiong pruning doesn't occur with <> operator clauses
>   explain (costs off) select * from hp where a <> 1 and b <> 'xxx';
> --- 1494,1502 ----
>    Append
>      ->  Seq Scan on hp0
>            Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
>      ->  Seq Scan on hp3
>            Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS
NULL)))
> ! (5 rows)

So this same failure occurs on (noting the architecture):

ppc64:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52

ia64:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2018-04-06%2022%3A32%3A24

ppc64 (POWER7):
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2018-04-06%2022%3A58%3A13

ppc64 (POWER7):
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2018-04-06%2023%3A02%3A13

powerpc:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-04-06%2023%3A05%3A08

powerpc:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-04-06%2023%3A13%3A23

powerpc 32-bit userspace on ppc64 host:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2023%3A40%3A07

Seems to be due to that the hashing function used in partitioning
gives different answer for a given set of partition key values than
others.

Thanks,
Amit


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 April 2018 at 12:35, Amit Langote <amitlangote09@gmail.com> wrote:
> Thank you Alvaro for rest of the cleanup and committing.

+10!

> So this same failure occurs on (noting the architecture):
>
> ppc64:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52
>
> ia64:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2018-04-06%2022%3A32%3A24
>
> ppc64 (POWER7):
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2018-04-06%2022%3A58%3A13
>
> ppc64 (POWER7):
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2018-04-06%2023%3A02%3A13
>
> powerpc:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-04-06%2023%3A05%3A08
>
> powerpc:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-04-06%2023%3A13%3A23
>
> powerpc 32-bit userspace on ppc64 host:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2023%3A40%3A07
>
> Seems to be due to that the hashing function used in partitioning
> gives different answer for a given set of partition key values than
> others.

They all look like bigendian CPUs.

https://en.wikipedia.org/wiki/Comparison_of_instruction_set_architectures#Endianness

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 April 2018 at 12:43, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 7 April 2018 at 12:35, Amit Langote <amitlangote09@gmail.com> wrote:
>> So this same failure occurs on (noting the architecture):
>>
>> Seems to be due to that the hashing function used in partitioning
>> gives different answer for a given set of partition key values than
>> others.
>
> They all look like bigendian CPUs.

I looked at all the regression test diffs for each of the servers you
mentioned and I verified that the diffs match on each of the 7
servers.

Maybe the best solution is to pull those tests out of
partition_prune.sql then create partition_prune_hash and just have an
alternative .out file with the partitions which match on bigendian
machines.

We could also keep them in the same file, but that's a much bigger
alternative file to maintain and more likely to get broken if someone
forgets to update it.

What do you think?


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On Sat, Apr 7, 2018 at 10:31 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 7 April 2018 at 12:43, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> On 7 April 2018 at 12:35, Amit Langote <amitlangote09@gmail.com> wrote:
>>> So this same failure occurs on (noting the architecture):
>>>
>>> Seems to be due to that the hashing function used in partitioning
>>> gives different answer for a given set of partition key values than
>>> others.
>>
>> They all look like bigendian CPUs.
>
> I looked at all the regression test diffs for each of the servers you
> mentioned and I verified that the diffs match on each of the 7
> servers.
>
> Maybe the best solution is to pull those tests out of
> partition_prune.sql then create partition_prune_hash and just have an
> alternative .out file with the partitions which match on bigendian
> machines.
>
> We could also keep them in the same file, but that's a much bigger
> alternative file to maintain and more likely to get broken if someone
> forgets to update it.
>
> What do you think?

Yeah, that's an idea.

Is it alright though that same data may end up in different hash
partitions depending on the architecture?  IIRC, that's the way we
decided to go when using hash partitioning, but it would've been
clearer if there was already some evidence in regression tests that
that's what we've chosen, such as, some existing tests for tuple
routing.

Thanks,
Amit


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 April 2018 at 13:50, Amit Langote <amitlangote09@gmail.com> wrote:
> On Sat, Apr 7, 2018 at 10:31 AM, David Rowley
>> I looked at all the regression test diffs for each of the servers you
>> mentioned and I verified that the diffs match on each of the 7
>> servers.
>>
>> Maybe the best solution is to pull those tests out of
>> partition_prune.sql then create partition_prune_hash and just have an
>> alternative .out file with the partitions which match on bigendian
>> machines.
>>
>> We could also keep them in the same file, but that's a much bigger
>> alternative file to maintain and more likely to get broken if someone
>> forgets to update it.
>>
>> What do you think?
>
> Yeah, that's an idea.
>
> Is it alright though that same data may end up in different hash
> partitions depending on the architecture?  IIRC, that's the way we
> decided to go when using hash partitioning, but it would've been
> clearer if there was already some evidence in regression tests that
> that's what we've chosen, such as, some existing tests for tuple
> routing.

The only alternative would be to change all the hash functions so that
they normalise their endianness. It does not sound like something that
will perform very well. Plus it would break everyone's hash indexes on
a pg_upgrade.

pg_basebackups can't be transferred over to other architectures
anyway, so I'm not so worried about tuples being routed to other
partitions.

Maybe someone else can see a reason why this is bad?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 April 2018 at 13:31, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Maybe the best solution is to pull those tests out of
> partition_prune.sql then create partition_prune_hash and just have an
> alternative .out file with the partitions which match on bigendian
> machines.

Here's 1 of 2. I thought it was best to get the buildfarm green again
as soon as possible.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Andres Freund
Date:
On 2018-04-07 14:42:53 +1200, David Rowley wrote:
> On 7 April 2018 at 13:31, David Rowley <david.rowley@2ndquadrant.com> wrote:
> > Maybe the best solution is to pull those tests out of
> > partition_prune.sql then create partition_prune_hash and just have an
> > alternative .out file with the partitions which match on bigendian
> > machines.
> 
> Here's 1 of 2. I thought it was best to get the buildfarm green again
> as soon as possible.

Do you have an estimate how long it'll take you to produce patch 2? It'd
be cool to get this covered again soon.  If you don't have access to a
big endian machine, we can construct the output from the buildfarm... So
pulling the tests out would be the only "urgent" thing, I can go on from
there.

- Andres


Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Sat, Apr 7, 2018 at 7:25 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 7 April 2018 at 13:50, Amit Langote <amitlangote09@gmail.com> wrote:
>> On Sat, Apr 7, 2018 at 10:31 AM, David Rowley
>>> I looked at all the regression test diffs for each of the servers you
>>> mentioned and I verified that the diffs match on each of the 7
>>> servers.
>>>
>>> Maybe the best solution is to pull those tests out of
>>> partition_prune.sql then create partition_prune_hash and just have an
>>> alternative .out file with the partitions which match on bigendian
>>> machines.
>>>
>>> We could also keep them in the same file, but that's a much bigger
>>> alternative file to maintain and more likely to get broken if someone
>>> forgets to update it.
>>>
>>> What do you think?
>>
>> Yeah, that's an idea.
>>
>> Is it alright though that same data may end up in different hash
>> partitions depending on the architecture?  IIRC, that's the way we
>> decided to go when using hash partitioning, but it would've been
>> clearer if there was already some evidence in regression tests that
>> that's what we've chosen, such as, some existing tests for tuple
>> routing.
>
> The only alternative would be to change all the hash functions so that
> they normalise their endianness. It does not sound like something that
> will perform very well. Plus it would break everyone's hash indexes on
> a pg_upgrade.
>
> pg_basebackups can't be transferred over to other architectures
> anyway, so I'm not so worried about tuples being routed to other
> partitions.
>
> Maybe someone else can see a reason why this is bad?

I don't think the concept is bad by itself. That's expected, in fact,
we have added an option to pg_dump (dump through parent or some such)
to handle exactly this case. What Amit seems to be complaining though
is the regression test. We need to write regression tests so that they
produce the same plans, pruning same partitions by name, on all
architectures.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 April 2018 at 15:00, Andres Freund <andres@anarazel.de> wrote:
> On 2018-04-07 14:42:53 +1200, David Rowley wrote:
>> On 7 April 2018 at 13:31, David Rowley <david.rowley@2ndquadrant.com> wrote:
>> > Maybe the best solution is to pull those tests out of
>> > partition_prune.sql then create partition_prune_hash and just have an
>> > alternative .out file with the partitions which match on bigendian
>> > machines.
>>
>> Here's 1 of 2. I thought it was best to get the buildfarm green again
>> as soon as possible.
>
> Do you have an estimate how long it'll take you to produce patch 2? It'd
> be cool to get this covered again soon.  If you don't have access to a
> big endian machine, we can construct the output from the buildfarm... So
> pulling the tests out would be the only "urgent" thing, I can go on from
> there.

Attached.

I've not tested on a bigendian machine, but the diff -c between the
two output files match the diff on the failing buildfarm members.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 April 2018 at 15:03, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Sat, Apr 7, 2018 at 7:25 AM, David Rowley
>> The only alternative would be to change all the hash functions so that
>> they normalise their endianness. It does not sound like something that
>> will perform very well. Plus it would break everyone's hash indexes on
>> a pg_upgrade.
>>
>> pg_basebackups can't be transferred over to other architectures
>> anyway, so I'm not so worried about tuples being routed to other
>> partitions.
>>
>> Maybe someone else can see a reason why this is bad?
>
> I don't think the concept is bad by itself. That's expected, in fact,
> we have added an option to pg_dump (dump through parent or some such)
> to handle exactly this case. What Amit seems to be complaining though
> is the regression test. We need to write regression tests so that they
> produce the same plans, pruning same partitions by name, on all
> architectures.

Why is writing tests that produce the same output required?

We have many tests with alternative outputs. Look in
src/tests/regress/expected for files matching _1.out

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Sat, Apr 7, 2018 at 8:37 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 7 April 2018 at 15:03, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> On Sat, Apr 7, 2018 at 7:25 AM, David Rowley
>>> The only alternative would be to change all the hash functions so that
>>> they normalise their endianness. It does not sound like something that
>>> will perform very well. Plus it would break everyone's hash indexes on
>>> a pg_upgrade.
>>>
>>> pg_basebackups can't be transferred over to other architectures
>>> anyway, so I'm not so worried about tuples being routed to other
>>> partitions.
>>>
>>> Maybe someone else can see a reason why this is bad?
>>
>> I don't think the concept is bad by itself. That's expected, in fact,
>> we have added an option to pg_dump (dump through parent or some such)
>> to handle exactly this case. What Amit seems to be complaining though
>> is the regression test. We need to write regression tests so that they
>> produce the same plans, pruning same partitions by name, on all
>> architectures.
>
> Why is writing tests that produce the same output required?
>
> We have many tests with alternative outputs. Look in
> src/tests/regress/expected for files matching _1.out
>

That's true, but we usually add such alternative output when we know
all the variants possible as long as "all the variants" do not cover
everything possible. AFAIU, that's not true here. Also, on a given
machine a particular row is guaranteed to fall in a given partition.
On a different machine it will fall in some other partition, but
always that partition on that machine. We don't have a way to select
alternate output based on the architecture. May be a better idea is to
use .source file, creating .out on the fly based on the architecture
of machine like testing the hash output for a given value to decide
which partition it will fall into and then crafting .out with that
partition's name.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] path toward faster partition pruning

From
Andres Freund
Date:
On 2018-04-07 15:04:37 +1200, David Rowley wrote:
> On 7 April 2018 at 15:00, Andres Freund <andres@anarazel.de> wrote:
> > On 2018-04-07 14:42:53 +1200, David Rowley wrote:
> >> On 7 April 2018 at 13:31, David Rowley <david.rowley@2ndquadrant.com> wrote:
> >> > Maybe the best solution is to pull those tests out of
> >> > partition_prune.sql then create partition_prune_hash and just have an
> >> > alternative .out file with the partitions which match on bigendian
> >> > machines.
> >>
> >> Here's 1 of 2. I thought it was best to get the buildfarm green again
> >> as soon as possible.
> >
> > Do you have an estimate how long it'll take you to produce patch 2? It'd
> > be cool to get this covered again soon.  If you don't have access to a
> > big endian machine, we can construct the output from the buildfarm... So
> > pulling the tests out would be the only "urgent" thing, I can go on from
> > there.
> 
> Attached.
> 
> I've not tested on a bigendian machine, but the diff -c between the
> two output files match the diff on the failing buildfarm members.

I've pushed the two patches (collapsed). Trying to get the BF green-ish
again...

- Andres


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 April 2018 at 15:18, Andres Freund <andres@anarazel.de> wrote:
> I've pushed the two patches (collapsed). Trying to get the BF green-ish
> again...

Thanks!


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 April 2018 at 15:14, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Sat, Apr 7, 2018 at 8:37 AM, David Rowley
>> Why is writing tests that produce the same output required?
>>
>> We have many tests with alternative outputs. Look in
>> src/tests/regress/expected for files matching _1.out
>>
>
> That's true, but we usually add such alternative output when we know
> all the variants possible as long as "all the variants" do not cover
> everything possible. AFAIU, that's not true here. Also, on a given
> machine a particular row is guaranteed to fall in a given partition.
> On a different machine it will fall in some other partition, but
> always that partition on that machine. We don't have a way to select
> alternate output based on the architecture. May be a better idea is to
> use .source file, creating .out on the fly based on the architecture
> of machine like testing the hash output for a given value to decide
> which partition it will fall into and then crafting .out with that
> partition's name.

Sounds like you're saying that if we have too many alternative files
then there's a chance that one could pass by luck.

Maybe we can just back up what's just been committed by actually
executing the queries and ensuring that all rows that made it into the
table make it back out again.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Sat, Apr 7, 2018 at 8:55 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 7 April 2018 at 15:14, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> On Sat, Apr 7, 2018 at 8:37 AM, David Rowley
>>> Why is writing tests that produce the same output required?
>>>
>>> We have many tests with alternative outputs. Look in
>>> src/tests/regress/expected for files matching _1.out
>>>
>>
>> That's true, but we usually add such alternative output when we know
>> all the variants possible as long as "all the variants" do not cover
>> everything possible. AFAIU, that's not true here. Also, on a given
>> machine a particular row is guaranteed to fall in a given partition.
>> On a different machine it will fall in some other partition, but
>> always that partition on that machine. We don't have a way to select
>> alternate output based on the architecture. May be a better idea is to
>> use .source file, creating .out on the fly based on the architecture
>> of machine like testing the hash output for a given value to decide
>> which partition it will fall into and then crafting .out with that
>> partition's name.
>
> Sounds like you're saying that if we have too many alternative files
> then there's a chance that one could pass by luck.

Yes.

>
> Maybe we can just back up what's just been committed by actually
> executing the queries and ensuring that all rows that made it into the
> table make it back out again.

That's one way. But how would we make sure that they landed in proper
partition. Actually we do not know what's proper partition for a given
architecture. And how would we make sure that all rows with the same
partition key land in the same partition. That's why I am suggesting
to calculate the hash value, take modulo and craft the name of
partition where corresponding row will land on a given architecture.
That way, we are sure that the tuple routing logic is correct and also
the partition pruning logic.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] path toward faster partition pruning

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> Sounds like you're saying that if we have too many alternative files
> then there's a chance that one could pass by luck.

Yeah, exactly: it passed, but did it pass for the right reason?

If there's just two expected-files, it's likely not a big problem,
but if you have a bunch it's something to worry about.

I'm also wondering how come we had hash partitioning before and
did not have this sort of problem.  Is it just that we added a
new test that's more sensitive to the details of the hashing
(if so, could it be made less so)?  Or is there actually more
platform dependence now than before (and if so, why is that)?

            regards, tom lane


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 April 2018 at 15:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm also wondering how come we had hash partitioning before and
> did not have this sort of problem.  Is it just that we added a
> new test that's more sensitive to the details of the hashing
> (if so, could it be made less so)?  Or is there actually more
> platform dependence now than before (and if so, why is that)?

We didn't prune HASH partitions before today. They were just all returned.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Andres Freund
Date:
Hi,

On 2018-04-06 23:41:22 -0400, Tom Lane wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > Sounds like you're saying that if we have too many alternative files
> > then there's a chance that one could pass by luck.
> 
> Yeah, exactly: it passed, but did it pass for the right reason?
> 
> If there's just two expected-files, it's likely not a big problem,
> but if you have a bunch it's something to worry about.

There should be only two alternatives, given our current hashing
implementation, right?

Greetings,

Andres Freund


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 April 2018 at 15:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> Sounds like you're saying that if we have too many alternative files
>> then there's a chance that one could pass by luck.
>
> Yeah, exactly: it passed, but did it pass for the right reason?
>
> If there's just two expected-files, it's likely not a big problem,
> but if you have a bunch it's something to worry about.

Right, I suggest we wait and see if all members go green again as a
result of 40e42e1024c, and if they're happy then we could maybe leave
it as is with the 2 alternatives output files.

If there are some other variations that crop up, then we can think
harder about what we can do to improve the coverage.

I don't particularly think it matters which hash partition a tuple
goes into, as long as the hash function spreads the values out enough
and most importantly, the pruning code looks for the tuple in the
partition that it was actually inserted into in the first place.
Obviously, we also want to ensure we never do anything which would
change the matching partition in either minor or major version
upgrades too.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Andres Freund
Date:
Hi,

On 2018-04-07 15:49:54 +1200, David Rowley wrote:
> Right, I suggest we wait and see if all members go green again as a
> result of 40e42e1024c, and if they're happy then we could maybe leave
> it as is with the 2 alternatives output files.

At least the first previously borked animal came back green (termite).


> I don't particularly think it matters which hash partition a tuple
> goes into, as long as the hash function spreads the values out enough
> and most importantly, the pruning code looks for the tuple in the
> partition that it was actually inserted into in the first place.
> Obviously, we also want to ensure we never do anything which would
> change the matching partition in either minor or major version
> upgrades too.

+1


I've also attempted to fix rhinoceros's failure I remarked upon a couple
hours ago in
https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de

Greetings,

Andres Freund


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 April 2018 at 09:03, Andres Freund <andres@anarazel.de> wrote:
> There's also
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2018-04-06%2020%3A45%3A01
> *** /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/contrib/sepgsql/expected/misc.out       2018-02-20
18:45:02.068665297-0800
 
> --- /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/contrib/sepgsql/results/misc.out        2018-04-06
13:55:50.718253850-0700
 
> ***************
> *** 32,40 ****
>   (6 rows)
>
>   SELECT * FROM t1p WHERE o > 50 AND p like '%64%';
> - LOG:  SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)"
 
> - LOG:  SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)"
 
> - LOG:  SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)"
 
>   LOG:  SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_table name="public.t1p"
 
>   LOG:  SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_column name="table t1p column o"
 
>   LOG:  SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255
tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_column name="table t1p column p"
 
> --- 32,37 ----
>
> seems you just need to remove those rows from the expected file.

Agreed.

Patch attached.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 April 2018 at 16:09, Andres Freund <andres@anarazel.de> wrote:
> I've also attempted to fix rhinoceros's failure I remarked upon a couple
> hours ago in
> https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de

Oh, thanks!

I had just been looking at that too...

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 7 April 2018 at 15:18, Andres Freund <andres@anarazel.de> wrote:
> I've pushed the two patches (collapsed). Trying to get the BF green-ish
> again...

termite has now gone green.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On Sat, Apr 7, 2018 at 1:09 PM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2018-04-07 15:49:54 +1200, David Rowley wrote:
>> Right, I suggest we wait and see if all members go green again as a
>> result of 40e42e1024c, and if they're happy then we could maybe leave
>> it as is with the 2 alternatives output files.
>
> At least the first previously borked animal came back green (termite).
>
>
>> I don't particularly think it matters which hash partition a tuple
>> goes into, as long as the hash function spreads the values out enough
>> and most importantly, the pruning code looks for the tuple in the
>> partition that it was actually inserted into in the first place.
>> Obviously, we also want to ensure we never do anything which would
>> change the matching partition in either minor or major version
>> upgrades too.
>
> +1

+1

Given that the difference only appeared on animals that David pointed
out have big-endian architecture, it seems we'd only need two output
files.  It does seem true that the extended hashing functions  that
were adding to support partitioning would somehow be affected by
endianness.

Thank you David for creating the patches and Andres for committing it.
Buildfarm seems to be turning green where it had gone red due to the
hashing differences.

> I've also attempted to fix rhinoceros's failure I remarked upon a couple
> hours ago in
> https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de

Thanks Andres.

Regards,
Amit


Re: [HACKERS] path toward faster partition pruning

From
Tom Lane
Date:
Amit Langote <amitlangote09@gmail.com> writes:
> Given that the difference only appeared on animals that David pointed
> out have big-endian architecture, it seems we'd only need two output
> files.

Dunno, I'm wondering whether 32 vs 64 bit will make a difference.

            regards, tom lane


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On Sat, Apr 7, 2018 at 1:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
>> Given that the difference only appeared on animals that David pointed
>> out have big-endian architecture, it seems we'd only need two output
>> files.
>
> Dunno, I'm wondering whether 32 vs 64 bit will make a difference.

There was one 32-bit animal in the failing set, which apparently
produces the same hashes as others (allegedly due to endianness
difference).

powerpc 32-bit userspace on ppc64 host:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2023%3A40%3A07

...and it has turned green since the alternative outputs fix went in.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-07%2004%3A06%3A09

Thanks,
Amit


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Andres Freund wrote:
> Hi,
> 
> On 2018-04-07 15:49:54 +1200, David Rowley wrote:
> > Right, I suggest we wait and see if all members go green again as a
> > result of 40e42e1024c, and if they're happy then we could maybe leave
> > it as is with the 2 alternatives output files.
> 
> At least the first previously borked animal came back green (termite).

Thanks everyone for addressing this.

> I've also attempted to fix rhinoceros's failure I remarked upon a couple
> hours ago in
> https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de

And this, too.  I was unsure if this was because we were missing calling
some object access hook from the new code, or the additional pruning.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Andres Freund
Date:
On 2018-04-07 08:13:23 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > I've also attempted to fix rhinoceros's failure I remarked upon a couple
> > hours ago in
> > https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de
> 
> And this, too.  I was unsure if this was because we were missing calling
> some object access hook from the new code, or the additional pruning.

That's possible.  I did attempt to skim the code, that's where my
complain about the docs originated. There certainly isn't an
InvokeFunctionExecuteHook() present.  It's not clear to me whether
that's an issue - we don't invoke the hooks in a significant number of
places either, and as far as I can discern there's not much rule or
reason about where we invoke it.

Greetings,

Andres Freund


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Andres Freund wrote:
> On 2018-04-07 08:13:23 -0300, Alvaro Herrera wrote:
> > Andres Freund wrote:
> > > I've also attempted to fix rhinoceros's failure I remarked upon a couple
> > > hours ago in
> > > https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de
> > 
> > And this, too.  I was unsure if this was because we were missing calling
> > some object access hook from the new code, or the additional pruning.
> 
> That's possible.  I did attempt to skim the code, that's where my
> complain about the docs originated. There certainly isn't an
> InvokeFunctionExecuteHook() present.  It's not clear to me whether
> that's an issue - we don't invoke the hooks in a significant number of
> places either, and as far as I can discern there's not much rule or
> reason about where we invoke it.

I managed to convince myself that it's not higher-level code's
responsibility to invoke the execute hooks; the likelihood of bugs of
omission seems just too large.  But maybe I'm wrong.

There's a small number of InvokeFunctionExecuteHook calls in the
executor, but I really doubt that it exhaustively covers everyplace
where catalogued functions are called in the executor.

CC'ing KaiGai and Stephen Frost; they may want to chip in here.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
While looking at the docs in [1], I saw that we still mention:

4. Ensure that the constraint_exclusion configuration parameter is not
disabled in postgresql.conf. If it is, queries will not be optimized
as desired.

This is no longer true. The attached patch removed it.

[1] https://www.postgresql.org/docs/10/static/ddl-partitioning.html

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

On 2018/04/09 12:48, David Rowley wrote:
> While looking at the docs in [1], I saw that we still mention:
> 
> 4. Ensure that the constraint_exclusion configuration parameter is not
> disabled in postgresql.conf. If it is, queries will not be optimized
> as desired.
> 
> This is no longer true. The attached patch removed it.
> 
> [1] https://www.postgresql.org/docs/10/static/ddl-partitioning.htm
Thanks.  I was aware of the changes that would need to be made, but you
beat me to writing the patch itself.

About the patch:

While the users no longer need to enable constraint_exclusion true for
select queries, one would still need it for update/delete queries, because
the new pruning logic only gets invoked for the former.  Alas...

Also, further down on that page, there is a 5.10.4 Partitioning and
Constraint Exclusion sub-section.  I think it would also need some tweaks
due to new developments.

I updated your patch to fix that.  Please take a look.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Fri, Apr 6, 2018 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> Sounds like you're saying that if we have too many alternative files
>> then there's a chance that one could pass by luck.
>
> Yeah, exactly: it passed, but did it pass for the right reason?
>
> If there's just two expected-files, it's likely not a big problem,
> but if you have a bunch it's something to worry about.
>
> I'm also wondering how come we had hash partitioning before and
> did not have this sort of problem.  Is it just that we added a
> new test that's more sensitive to the details of the hashing
> (if so, could it be made less so)?  Or is there actually more
> platform dependence now than before (and if so, why is that)?

The existing hash partitioning tests did have some dependencies on the
hash function, but they took care not to use the built-in hash
functions.  Instead they did stuff like this:

CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
$$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
CREATE TABLE mchash (a int, b text, c jsonb)
  PARTITION BY HASH (a test_int4_ops, b test_text_ops);

I think that this approach should also be used for the new tests.
Variant expected output files are a pain to maintain, and you
basically just have to take whatever output you get as the right
answer, because nobody knows what output a certain built-in hash
function should produce for a given input except by running the code.
If you do the kind of thing shown above, though, then you can easily
see by inspection that you're getting the right answer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Apr 6, 2018 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> David Rowley <david.rowley@2ndquadrant.com> writes:
>>> Sounds like you're saying that if we have too many alternative files
>>> then there's a chance that one could pass by luck.
>>
>> Yeah, exactly: it passed, but did it pass for the right reason?
>>
>> If there's just two expected-files, it's likely not a big problem,
>> but if you have a bunch it's something to worry about.
>>
>> I'm also wondering how come we had hash partitioning before and
>> did not have this sort of problem.  Is it just that we added a
>> new test that's more sensitive to the details of the hashing
>> (if so, could it be made less so)?  Or is there actually more
>> platform dependence now than before (and if so, why is that)?
>
> The existing hash partitioning tests did have some dependencies on the
> hash function, but they took care not to use the built-in hash
> functions.  Instead they did stuff like this:
>
> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
> CREATE TABLE mchash (a int, b text, c jsonb)
>   PARTITION BY HASH (a test_int4_ops, b test_text_ops);
>
> I think that this approach should also be used for the new tests.
> Variant expected output files are a pain to maintain, and you
> basically just have to take whatever output you get as the right
> answer, because nobody knows what output a certain built-in hash
> function should produce for a given input except by running the code.
> If you do the kind of thing shown above, though, then you can easily
> see by inspection that you're getting the right answer.

+1.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/04/10 13:27, Ashutosh Bapat wrote:
> On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Apr 6, 2018 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> David Rowley <david.rowley@2ndquadrant.com> writes:
>>>> Sounds like you're saying that if we have too many alternative files
>>>> then there's a chance that one could pass by luck.
>>>
>>> Yeah, exactly: it passed, but did it pass for the right reason?
>>>
>>> If there's just two expected-files, it's likely not a big problem,
>>> but if you have a bunch it's something to worry about.
>>>
>>> I'm also wondering how come we had hash partitioning before and
>>> did not have this sort of problem.  Is it just that we added a
>>> new test that's more sensitive to the details of the hashing
>>> (if so, could it be made less so)?  Or is there actually more
>>> platform dependence now than before (and if so, why is that)?
>>
>> The existing hash partitioning tests did have some dependencies on the
>> hash function, but they took care not to use the built-in hash
>> functions.  Instead they did stuff like this:
>>
>> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
>> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
>> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
>> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
>> CREATE TABLE mchash (a int, b text, c jsonb)
>>   PARTITION BY HASH (a test_int4_ops, b test_text_ops);
>>
>> I think that this approach should also be used for the new tests.
>> Variant expected output files are a pain to maintain, and you
>> basically just have to take whatever output you get as the right
>> answer, because nobody knows what output a certain built-in hash
>> function should produce for a given input except by running the code.
>> If you do the kind of thing shown above, though, then you can easily
>> see by inspection that you're getting the right answer.

Thanks for the idea.  I think it makes sense and also agree that alternate
outputs approach is not perfectly reliable and maintainable.

> +1.

Attached find a patch that rewrites hash partition pruning tests that
away.  It creates two hash operator classes, one for int4 and another for
text type and uses them to create hash partitioned table to be used in the
tests, like done in the existing tests in hash_part.sql.  Since that makes
tests (hopefully) reliably return the same result always, I no longer see
the need to keep them in a separate partition_prune_hash.sql.  The
reasoning behind having the separate file was to keep the alternative
output file small as David explained in [1].

However, I noticed that there is a bug in RelationBuildPartitionKey that
causes a crash when using a SQL function as partition support function as
the revised tests do, so please refer to and apply the patches I posted
here before running the revised tests:

https://www.postgresql.org/message-id/3041e853-b1dd-a0c6-ff21-7cc5633bffd0%40lab.ntt.co.jp

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f-SON_hAekqoV4_WQwJBtJ_rvvSe68jRNhuYcXqQ8PoQg%40mail.gmail.com

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 10 April 2018 at 20:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/04/10 13:27, Ashutosh Bapat wrote:
>> On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
>>> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
>>> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
>>> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
>>> CREATE TABLE mchash (a int, b text, c jsonb)
>>>   PARTITION BY HASH (a test_int4_ops, b test_text_ops);
>
> Thanks for the idea.  I think it makes sense and also agree that alternate
> outputs approach is not perfectly reliable and maintainable.
>
>> +1.
>
> Attached find a patch that rewrites hash partition pruning tests that
> away.  It creates two hash operator classes, one for int4 and another for
> text type and uses them to create hash partitioned table to be used in the
> tests, like done in the existing tests in hash_part.sql.  Since that makes
> tests (hopefully) reliably return the same result always, I no longer see
> the need to keep them in a separate partition_prune_hash.sql.  The
> reasoning behind having the separate file was to keep the alternative
> output file small as David explained in [1].
> [1]
> https://www.postgresql.org/message-id/CAKJS1f-SON_hAekqoV4_WQwJBtJ_rvvSe68jRNhuYcXqQ8PoQg%40mail.gmail.com

I had a quick look, but I'm still confused about why a function like
hash_uint32_extended() is susceptible to varying results depending on
CPU endianness but hash_combine64 is not.

Apart from that confusion, looking at the patch:

+CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
+$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT;
+CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
+OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
+CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
+$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT;


Why coalesce here? Maybe I've not thought of something, but coalesce
only seems useful to me if there's > 1 argument. Plus the function is
strict, so not sure it's really doing even if you added a default.

I know this one was there before, but I only just noticed it:

+-- pruning should work if non-null values are provided for all the keys
+explain (costs off) select * from hp where a is null and b is null;

The comment is a bit misleading given the first test below it is
testing for nulls. Maybe it can be changed to

+-- pruning should work if values or is null clauses are provided for
all partition keys.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Tue, Apr 10, 2018 at 5:32 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 10 April 2018 at 20:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/04/10 13:27, Ashutosh Bapat wrote:
>>> On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
>>>> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
>>>> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
>>>> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
>>>> CREATE TABLE mchash (a int, b text, c jsonb)
>>>>   PARTITION BY HASH (a test_int4_ops, b test_text_ops);
>>
>> Thanks for the idea.  I think it makes sense and also agree that alternate
>> outputs approach is not perfectly reliable and maintainable.
>>
>>> +1.
>>
>> Attached find a patch that rewrites hash partition pruning tests that
>> away.  It creates two hash operator classes, one for int4 and another for
>> text type and uses them to create hash partitioned table to be used in the
>> tests, like done in the existing tests in hash_part.sql.  Since that makes
>> tests (hopefully) reliably return the same result always, I no longer see
>> the need to keep them in a separate partition_prune_hash.sql.  The
>> reasoning behind having the separate file was to keep the alternative
>> output file small as David explained in [1].
>> [1]
>> https://www.postgresql.org/message-id/CAKJS1f-SON_hAekqoV4_WQwJBtJ_rvvSe68jRNhuYcXqQ8PoQg%40mail.gmail.com
>
> I had a quick look, but I'm still confused about why a function like
> hash_uint32_extended() is susceptible to varying results depending on
> CPU endianness but hash_combine64 is not.
>
> Apart from that confusion, looking at the patch:
>
> +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
> +$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT;
> +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
> +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
> +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
> +$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT;
>
>
> Why coalesce here? Maybe I've not thought of something, but coalesce
> only seems useful to me if there's > 1 argument. Plus the function is
> strict, so not sure it's really doing even if you added a default.

I think Amit Langote wanted to write coalesce($1, $2), $2 being the
seed for hash function. See how hash operator class functions are
defined in sql/insert.sql. May be we should just use the same
functions or even the same tables.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks for the comment.

On 2018/04/10 21:11, Ashutosh Bapat wrote:
> On Tue, Apr 10, 2018 at 5:32 PM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> Apart from that confusion, looking at the patch:
>>
>> +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
>> +$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT;
>> +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
>> +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
>> +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
>> +$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT;
>>
>>
>> Why coalesce here? Maybe I've not thought of something, but coalesce
>> only seems useful to me if there's > 1 argument. Plus the function is
>> strict, so not sure it's really doing even if you added a default.
> 
> I think Amit Langote wanted to write coalesce($1, $2), $2 being the
> seed for hash function. See how hash operator class functions are
> defined in sql/insert.sql.

Actually, I referenced functions and operator classes defined in
hash_part.sql, not insert.sql.  Although as you point out, I didn't think
very hard about the significance of $2 passed to coalesce in those
functions.  I will fix that and add it back, along with some other changes
that makes them almost identical with definitions in hash_part.sql.

> May be we should just use the same
> functions or even the same tables.

Because hash_part.sql and partition_prune.sql tests run in parallel, I've
decided to rename the functions, operator classes, and the tables in
partition_prune.sql.  It seems like a good idea in any case.  Also, since
the existing pruning tests were written with that table, I decided not to
change that.

Will post an updated patch after addressing David's comment.

Regards,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thanks for the review.

On 2018/04/10 21:02, David Rowley wrote:
> On 10 April 2018 at 20:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/04/10 13:27, Ashutosh Bapat wrote:
>>> On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
>>>> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
>>>> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
>>>> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
>>>> CREATE TABLE mchash (a int, b text, c jsonb)
>>>>   PARTITION BY HASH (a test_int4_ops, b test_text_ops);
>>
>> Thanks for the idea.  I think it makes sense and also agree that alternate
>> outputs approach is not perfectly reliable and maintainable.
>>
>>> +1.
>>
>> Attached find a patch that rewrites hash partition pruning tests that
>> away.  It creates two hash operator classes, one for int4 and another for
>> text type and uses them to create hash partitioned table to be used in the
>> tests, like done in the existing tests in hash_part.sql.  Since that makes
>> tests (hopefully) reliably return the same result always, I no longer see
>> the need to keep them in a separate partition_prune_hash.sql.  The
>> reasoning behind having the separate file was to keep the alternative
>> output file small as David explained in [1].
>> [1]
>> https://www.postgresql.org/message-id/CAKJS1f-SON_hAekqoV4_WQwJBtJ_rvvSe68jRNhuYcXqQ8PoQg%40mail.gmail.com
> 
> I had a quick look, but I'm still confused about why a function like
> hash_uint32_extended() is susceptible to varying results depending on
> CPU endianness but hash_combine64 is not.

It might as well be the combination of both that's sensitive to
endianness.  I too am not sure exactly which part.  They're are both used
in succession in compute_hash_value:

            /*
             * Compute hash for each datum value by calling respective
             * datatype-specific hash functions of each partition key
             * attribute.
             */
            hash = FunctionCall2(&partsupfunc[i], values[i], seed);

            /* Form a single 64-bit hash value */
            rowHash = hash_combine64(rowHash, DatumGetUInt64(hash));

> Apart from that confusion, looking at the patch:
> 
> +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
> +$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT;
> +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
> +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
> +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
> +$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT;
> 
> 
> Why coalesce here? Maybe I've not thought of something, but coalesce
> only seems useful to me if there's > 1 argument. Plus the function is
> strict, so not sure it's really doing even if you added a default.

After reading Ashutosh's comment, I realized I didn't really mean to add
the STRICT to those function definitions.  As these are not operators, but
support (hash) procedures, it's insignificant to the pruning code whether
they are STRICT or not, unlike clause operators where it is.

Also, I've adopted the coalesce-based hashing function from hash_part.sql,
albeit with unnecessary tweaks.  I've not read anywhere about why the
coalesce was used in the first place, but it's insignificant for our
purpose here anyway.

> I know this one was there before, but I only just noticed it:
> 
> +-- pruning should work if non-null values are provided for all the keys
> +explain (costs off) select * from hp where a is null and b is null;
> 
> The comment is a bit misleading given the first test below it is
> testing for nulls. Maybe it can be changed to
> 
> +-- pruning should work if values or is null clauses are provided for
> all partition keys.
I have adjusted the comments.

Updated patch attached.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 11 April 2018 at 18:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Updated patch attached.

Thanks for the updated patch.

The only thing I'm not sure about is the chances you've made to the
COALESCE function.

+CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
+$$SELECT coalesce($1, $2)::int8$$ LANGUAGE sql IMMUTABLE;
+CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
+OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
+CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
+$$SELECT length(coalesce($1, ''))::int8$$ LANGUAGE sql IMMUTABLE;

Why does one default to the seed and the other to an empty string?
Shouldn't they both do the same thing? If you were to copy the
hash_part.sql you'd just coalesce($1, 0) and coalesce($1, ''), any
special reason not to do that?

Also just wondering if it's worth adding some verification that we've
actually eliminated the correct partitions by backing the tests up
with a call to satisfies_hash_partition.

I've attached a delta patch that applies to your v2 which does this.
Do you think it's worth doing?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi David.

Thanks for the review.

On 2018/04/11 17:59, David Rowley wrote:
> On 11 April 2018 at 18:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Updated patch attached.
> 
> Thanks for the updated patch.
> 
> The only thing I'm not sure about is the chances you've made to the
> COALESCE function.
> 
> +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
> +$$SELECT coalesce($1, $2)::int8$$ LANGUAGE sql IMMUTABLE;
> +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
> +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
> +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
> +$$SELECT length(coalesce($1, ''))::int8$$ LANGUAGE sql IMMUTABLE;
> 
> Why does one default to the seed and the other to an empty string?
> Shouldn't they both do the same thing? If you were to copy the
> hash_part.sql you'd just coalesce($1, 0) and coalesce($1, ''), any
> special reason not to do that?

Oops, so I hadn't actually restored it to the way it is in hash_part.sql.

Also, Ashutosh was talking about the custom hashing function used in
insert.sql, not hash_part.sql, which I based my revision upon.

Fixed it now.

> Also just wondering if it's worth adding some verification that we've
> actually eliminated the correct partitions by backing the tests up
> with a call to satisfies_hash_partition.
> 
> I've attached a delta patch that applies to your v2 which does this.
> Do you think it's worth doing?

We can see check by inspection that individual values are in appropriate
partitions, which is the point of having the inserts and the select just
above the actual pruning related tests.  So, I'm not sure if adding the
satisfies_hash_partition against each pruning tests adds much.

Attached revised patch.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Wed, Apr 11, 2018 at 2:52 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

>>
>> I've attached a delta patch that applies to your v2 which does this.
>> Do you think it's worth doing?
>
> We can see check by inspection that individual values are in appropriate
> partitions, which is the point of having the inserts and the select just
> above the actual pruning related tests.  So, I'm not sure if adding the
> satisfies_hash_partition against each pruning tests adds much.

+1.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 11 April 2018 at 21:22, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Also just wondering if it's worth adding some verification that we've
>> actually eliminated the correct partitions by backing the tests up
>> with a call to satisfies_hash_partition.
>>
>> I've attached a delta patch that applies to your v2 which does this.
>> Do you think it's worth doing?
>
> We can see check by inspection that individual values are in appropriate
> partitions, which is the point of having the inserts and the select just
> above the actual pruning related tests.  So, I'm not sure if adding the
> satisfies_hash_partition against each pruning tests adds much.

Right, that's true.

> Attached revised patch.

Thanks. It looks fine to me, with or without the
satisfies_hash_partition tests. I agree that they're probably
overkill, but I see you've added them now.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Here's an idea.  Why don't we move the function/opclass creation lines
to insert.sql, without the DROPs, and use the same functions/opclasses
in the three tests insert.sql, alter_table.sql, hash_part.sql and
partition_prune.sql, i.e. not recreate what are essentially the same
objects three times?  This also leaves them around for the pg_upgrade
test, which is not a bad thing.

(This would require a few updates to insert.sql because the definitions
there are different, but it shouldn't be a problem coverage-wise.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/04/11 21:35, Alvaro Herrera wrote:
> Here's an idea.  Why don't we move the function/opclass creation lines
> to insert.sql, without the DROPs, and use the same functions/opclasses
> in the three tests insert.sql, alter_table.sql, hash_part.sql and
> partition_prune.sql, i.e. not recreate what are essentially the same
> objects three times?  This also leaves them around for the pg_upgrade
> test, which is not a bad thing.
> 
> (This would require a few updates to insert.sql because the definitions
> there are different, but it shouldn't be a problem coverage-wise.)

OK, I've tried doing that.  Needed adjustments to hash_part.sql as well.
The hash function for int4 was defined differently in insert.sql,
alter_table.sql, and hash_part.sql.  I went with the definition in
insert.sql, which although slightly different from the one
alter_table.sql, didn't affect the latter's output in any way.  Since the
definition in hash_part.sql was totally different, a couple of tests
needed adjusting after starting to use hash opclasses defined in insert.sql.

Attached updated patch.

PS: git grep "partition by hash\|PARTITION BY HASH" on src/test indicates
that there are hash partitioning related tests in create_table,
foreign_key, and partition_join files as well.  Do we want to use the
custom opclass in those files as well?

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Here's an idea.  Why don't we move the function/opclass creation lines
> to insert.sql, without the DROPs, and use the same functions/opclasses
> in the three tests insert.sql, alter_table.sql, hash_part.sql and
> partition_prune.sql, i.e. not recreate what are essentially the same
> objects three times?  This also leaves them around for the pg_upgrade
> test, which is not a bad thing.

That sounds good, but maybe we should go further and move the
partitioning tests out of generically-named things like insert.sql
altogether and have test names that actually mention partitioning.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/04/13 1:47, Robert Haas wrote:
> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> Here's an idea.  Why don't we move the function/opclass creation lines
>> to insert.sql, without the DROPs, and use the same functions/opclasses
>> in the three tests insert.sql, alter_table.sql, hash_part.sql and
>> partition_prune.sql, i.e. not recreate what are essentially the same
>> objects three times?  This also leaves them around for the pg_upgrade
>> test, which is not a bad thing.
> 
> That sounds good, but maybe we should go further and move the
> partitioning tests out of generically-named things like insert.sql
> altogether and have test names that actually mention partitioning.

Do you mean to do that for *all* files that have tests exercising some
partitioning code, even if it's just one test?  I can see that tests in at
least some of them could be put into their own partition_ file as follows:

partition_insert (including tests in insert_conflict)
partition_update
partition_triggers
partition_indexing (indexing.sql added when partitioned indexes went in)
partition_ddl (for the tests in create_table and alter_table)

That leaves:

cluster
create_index (one test here could be moved to partition_indexing?)
foreign_data (could be moved to partition_ddl?)
foreign_key  (could be moved to partition_ddl?)
hash_part    (leave alone, because already contains 'part' in the name?)
identity
join
plancache
plpgsql
publication
rowsecurity
rules
stats_ext
tablesample
truncate
updatable_views
vacuum


What about the tests in inherit.sql that start with:

--
-- Check that constraint exclusion works correctly with partitions using
-- implicit constraints generated from the partition bound information.
--

Maybe, just move all of them to partition_prune.sql, because we no longer
use constraint exclusion for pruning?

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 13 April 2018 at 14:15, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/04/13 1:47, Robert Haas wrote:
>> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>>> Here's an idea.  Why don't we move the function/opclass creation lines
>>> to insert.sql, without the DROPs, and use the same functions/opclasses
>>> in the three tests insert.sql, alter_table.sql, hash_part.sql and
>>> partition_prune.sql, i.e. not recreate what are essentially the same
>>> objects three times?  This also leaves them around for the pg_upgrade
>>> test, which is not a bad thing.
>>
>> That sounds good, but maybe we should go further and move the
>> partitioning tests out of generically-named things like insert.sql
>> altogether and have test names that actually mention partitioning.
>
> Do you mean to do that for *all* files that have tests exercising some
> partitioning code, even if it's just one test?  I can see that tests in at
> least some of them could be put into their own partition_ file as follows:

Wouldn't it be best to just move hash partition tests into hash_part?
Leave all the other stuff where it is?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Ashutosh Bapat
Date:
On Fri, Apr 13, 2018 at 7:45 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/04/13 1:47, Robert Haas wrote:
>> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>>> Here's an idea.  Why don't we move the function/opclass creation lines
>>> to insert.sql, without the DROPs, and use the same functions/opclasses
>>> in the three tests insert.sql, alter_table.sql, hash_part.sql and
>>> partition_prune.sql, i.e. not recreate what are essentially the same
>>> objects three times?  This also leaves them around for the pg_upgrade
>>> test, which is not a bad thing.
>>
>> That sounds good, but maybe we should go further and move the
>> partitioning tests out of generically-named things like insert.sql
>> altogether and have test names that actually mention partitioning.
>
> Do you mean to do that for *all* files that have tests exercising some
> partitioning code, even if it's just one test?  I can see that tests in at
> least some of them could be put into their own partition_ file as follows:
>
> partition_insert (including tests in insert_conflict)
> partition_update
> partition_triggers
> partition_indexing (indexing.sql added when partitioned indexes went in)
> partition_ddl (for the tests in create_table and alter_table)
>

We haven't generally created test files specific to a table type for
example temporary tables or unlogged tables, instead have created
files by SQL commands. But then that's not true for indexes; we have
separate files for indexes and we also have separate file for
materialized views and also for various data types. So, our test file
organization seems to have cut across of SQL commands and object
types. But partitioning seems an area large enough to have files of
its own; we already have partition_join and partition_aggregate.

Do we want to move to a directory based organization for tests also,
where sql/ expected/ will have directories within them for various
types of objects like partitioned tables, indexes, regular tables,
datatypes etc. and each of those will have files organized by sql
commands? An immediate question arises as to where to add the files
which exercises a mixture of objects; may be in a directory
corresponding to the primary object like materialized views over
partitioned tables, would fit materialized view (or just views?)
directory.

Whatever organization we want to use, it should be easy to find
testcases for relevant functionality e.g. all tests for partitioned
tables or all alter table command tests.


> What about the tests in inherit.sql that start with:
>
> --
> -- Check that constraint exclusion works correctly with partitions using
> -- implicit constraints generated from the partition bound information.
> --
>
> Maybe, just move all of them to partition_prune.sql, because we no longer
> use constraint exclusion for pruning?

I think we need to have some testcases somwhere to test constraint
exclusion on partitions and partitioned tables, those do not
necessarily fit partition pruning.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > Here's an idea.  Why don't we move the function/opclass creation lines
> > to insert.sql, without the DROPs, and use the same functions/opclasses
> > in the three tests insert.sql, alter_table.sql, hash_part.sql and
> > partition_prune.sql, i.e. not recreate what are essentially the same
> > objects three times?  This also leaves them around for the pg_upgrade
> > test, which is not a bad thing.
> 
> That sounds good, but maybe we should go further and move the
> partitioning tests out of generically-named things like insert.sql
> altogether and have test names that actually mention partitioning.

I don't think that's necessary to fix the problem that
partition_prune_hash.sql file has two expected output files.  If you
want to propose such a reorganization, feel free, but let's not hijack
the patch at hand.  For the record, I'm not a fan of the idea.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Robert Haas
Date:
On Fri, Apr 13, 2018 at 10:50 AM, Alvaro Herrera
<alvherre@alvh.no-ip.org> wrote:
> Robert Haas wrote:
>> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> > Here's an idea.  Why don't we move the function/opclass creation lines
>> > to insert.sql, without the DROPs, and use the same functions/opclasses
>> > in the three tests insert.sql, alter_table.sql, hash_part.sql and
>> > partition_prune.sql, i.e. not recreate what are essentially the same
>> > objects three times?  This also leaves them around for the pg_upgrade
>> > test, which is not a bad thing.
>>
>> That sounds good, but maybe we should go further and move the
>> partitioning tests out of generically-named things like insert.sql
>> altogether and have test names that actually mention partitioning.
>
> I don't think that's necessary to fix the problem that
> partition_prune_hash.sql file has two expected output files.  If you
> want to propose such a reorganization, feel free, but let's not hijack
> the patch at hand.  For the record, I'm not a fan of the idea.

Fair enough.  I don't think I'm hacking the thread much more than it
was already hijacked; and it was just a thought.  I haven't really
studied the tests well enough to have a really clear idea what a
better organization would look like.  It was just that, for example,
the commit that added hash partitioning added tests to 5 different
files, and some things had to be duplicated as a result.  It sounds
like what you've already done improves that, but I was wondering if
there's a way to do better.  I don't feel super-strongly about it
though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Amit Langote wrote:

> PS: git grep "partition by hash\|PARTITION BY HASH" on src/test indicates
> that there are hash partitioning related tests in create_table,
> foreign_key, and partition_join files as well.  Do we want to use the
> custom opclass in those files as well?

By the way, let me suggest 'git grep -i' instead.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/04/21 0:58, Alvaro Herrera wrote:
> Amit Langote wrote:
> 
>> PS: git grep "partition by hash\|PARTITION BY HASH" on src/test indicates
>> that there are hash partitioning related tests in create_table,
>> foreign_key, and partition_join files as well.  Do we want to use the
>> custom opclass in those files as well?
> 
> By the way, let me suggest 'git grep -i' instead.

Ah, thanks.

Regards,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Marina Polyakova
Date:
Hello everyone in this thread!

I got a similar server crash as in [1] on the master branch since the 
commit 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails 
because the second argument ScalarArrayOpExpr is not a Const or an 
ArrayExpr, but is an ArrayCoerceExpr (see [2]):

=# create table list_parted (
   a varchar
) partition by list (a);
=# create table part_ab_cd partition of list_parted for values in ('ab', 
'cd');
=# CREATE OR REPLACE FUNCTION public.x_stl_text_integer (
)
RETURNS text STABLE AS
$body$
BEGIN
RAISE NOTICE 's text integer';
RETURN 1::text;
END;
$body$
LANGUAGE 'plpgsql';
=# explain (costs off) select * from list_parted where a in ('ab', 'cd', 
x_stl_text_integer());
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

[1] 
https://www.postgresql.org/message-id/CAKcux6nCsCmu9oUnnuKZkeBenYvUFbU2Lt4q2MFNEb7QErzn8w%40mail.gmail.com

[2] partprune.c, function match_clause_to_partition_key:
        if (IsA(rightop, Const))
        {
            ...
        }
        else
        {
            ArrayExpr  *arrexpr = castNode(ArrayExpr, rightop); # fails here
            ...
        }

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] path toward faster partition pruning

From
Michael Paquier
Date:
On Fri, May 04, 2018 at 12:32:23PM +0300, Marina Polyakova wrote:
> I got a similar server crash as in [1] on the master branch since the commit
> 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails because
> the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, but is
> an ArrayCoerceExpr (see [2]):

Indeed, I can see the crash.  I have been playing with this stuff and I
am in the middle of writing the patch, but let's track this properly for
now.
--
Michael

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Marina Polyakova
Date:
On 07-05-2018 4:37, Michael Paquier wrote:
> On Fri, May 04, 2018 at 12:32:23PM +0300, Marina Polyakova wrote:
>> I got a similar server crash as in [1] on the master branch since the 
>> commit
>> 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails 
>> because
>> the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, 
>> but is
>> an ArrayCoerceExpr (see [2]):
> 
> Indeed, I can see the crash.  I have been playing with this stuff and I
> am in the middle of writing the patch, but let's track this properly 
> for
> now.

Thank you very much!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] path toward faster partition pruning

From
Michael Paquier
Date:
On Mon, May 07, 2018 at 10:37:10AM +0900, Michael Paquier wrote:
> On Fri, May 04, 2018 at 12:32:23PM +0300, Marina Polyakova wrote:
> > I got a similar server crash as in [1] on the master branch since the commit
> > 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails because
> > the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, but is
> > an ArrayCoerceExpr (see [2]):
>
> Indeed, I can see the crash.  I have been playing with this stuff and I
> am in the middle of writing the patch, but let's track this properly for
> now.

So the problem appears when an expression needs to use
COERCION_PATH_ARRAYCOERCE for a type coercion from one type to another,
which requires an execution state to be able to build the list of
elements.  The clause matching happens at planning state, so while there
are surely cases where this could be improved depending on the
expression type, I propose to just discard all clauses which do not
match OpExpr and ArrayExpr for now, as per the attached.  It would be
definitely a good practice to have a default code path returning
PARTCLAUSE_UNSUPPORTED where the element list cannot be built.

Thoughts?
--
Michael

Attachment

Re: [Suspect SPAM] Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Thank you Marina for the report and Michael for following up.

On 2018/05/07 16:56, Michael Paquier wrote:
> On Mon, May 07, 2018 at 10:37:10AM +0900, Michael Paquier wrote:
>> On Fri, May 04, 2018 at 12:32:23PM +0300, Marina Polyakova wrote:
>>> I got a similar server crash as in [1] on the master branch since the commit
>>> 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails because
>>> the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, but is
>>> an ArrayCoerceExpr (see [2]):
>>
>> Indeed, I can see the crash.  I have been playing with this stuff and I
>> am in the middle of writing the patch, but let's track this properly for
>> now.
> 
> So the problem appears when an expression needs to use
> COERCION_PATH_ARRAYCOERCE for a type coercion from one type to another,
> which requires an execution state to be able to build the list of
> elements.  The clause matching happens at planning state, so while there
> are surely cases where this could be improved depending on the
> expression type, I propose to just discard all clauses which do not
> match OpExpr and ArrayExpr for now, as per the attached.  It would be
> definitely a good practice to have a default code path returning
> PARTCLAUSE_UNSUPPORTED where the element list cannot be built.
> 
> Thoughts?

I have to agree to go with this conservative approach for now.  Although
we might be able to evaluate the array elements by applying the coercion
specified by ArrayCoerceExpr, let's save that as an improvement to be
pursued later.

FWIW, constraint exclusion wouldn't prune in this case either (that is, if
you try this example with PG 10 or using HEAD as of the parent of
9fdb675fc5), but it doesn't crash like the new pruning code does.

Thanks again.

Regards,
Amit



Re: [Suspect SPAM] Re: [HACKERS] path toward faster partition pruning

From
Michael Paquier
Date:
On Tue, May 08, 2018 at 04:07:41PM +0900, Amit Langote wrote:
> I have to agree to go with this conservative approach for now.  Although
> we might be able to evaluate the array elements by applying the coercion
> specified by ArrayCoerceExpr, let's save that as an improvement to be
> pursued later.

Thanks for confirming.  Yes, non-volatile functions would be actually
safe, and we'd need to be careful about NULL handling as well, but
that's definitely out of scope for v11.

> FWIW, constraint exclusion wouldn't prune in this case either (that is, if
> you try this example with PG 10 or using HEAD as of the parent of
> 9fdb675fc5), but it doesn't crash like the new pruning code does.

Yeah, I have noticed that.
--
Michael

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Michael Paquier wrote:

> So the problem appears when an expression needs to use
> COERCION_PATH_ARRAYCOERCE for a type coercion from one type to another,
> which requires an execution state to be able to build the list of
> elements.  The clause matching happens at planning state, so while there
> are surely cases where this could be improved depending on the
> expression type, I propose to just discard all clauses which do not
> match OpExpr and ArrayExpr for now, as per the attached.  It would be
> definitely a good practice to have a default code path returning
> PARTCLAUSE_UNSUPPORTED where the element list cannot be built.
> 
> Thoughts?

I found a related crash and I'm investigating it further.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
So I found that this query also crashed (using your rig),

create table coercepart (a varchar) partition by list (a);
create table coercepart_ab partition of coercepart for values in ('ab');
create table coercepart_bc partition of coercepart for values in ('bc');
create table coercepart_cd partition of coercepart for values in ('cd');
explain (costs off) select * from coercepart where a ~ any ('{ab}');

The reason for this crash is that gen_partprune_steps_internal() is
unable to generate any steps for the clause -- which is natural, since
the operator is not in a btree opclass.  There are various callers
of gen_partprune_steps_internal that are aware that it could return an
empty set of steps, but this one in match_clause_to_partition_key for
the ScalarArrayOpExpr case was being a bit too optimistic.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
Hi.

On 2018/05/09 7:05, Alvaro Herrera wrote:
> So I found that this query also crashed (using your rig),
> 
> create table coercepart (a varchar) partition by list (a);
> create table coercepart_ab partition of coercepart for values in ('ab');
> create table coercepart_bc partition of coercepart for values in ('bc');
> create table coercepart_cd partition of coercepart for values in ('cd');
> explain (costs off) select * from coercepart where a ~ any ('{ab}');
> 
> The reason for this crash is that gen_partprune_steps_internal() is
> unable to generate any steps for the clause -- which is natural, since
> the operator is not in a btree opclass.  There are various callers
> of gen_partprune_steps_internal that are aware that it could return an
> empty set of steps, but this one in match_clause_to_partition_key for
> the ScalarArrayOpExpr case was being a bit too optimistic.

Yeah, good catch!  That fixes the crash, but looking around that code a
bit, it seems that we shouldn't even have gotten up to the point you're
proposing to fix.  If saop_op is not in the partitioning opfamily, it
should have bailed out much sooner, that is, here:

    /*
     * In case of NOT IN (..), we get a '<>', which we handle if list
     * partitioning is in use and we're able to confirm that it's negator
     * is a btree equality operator belonging to the partitioning operator
     * family.
     */
    if (!op_in_opfamily(saop_op, partopfamily))
    {
        <snip>

        negator = get_negator(saop_op);
        if (OidIsValid(negator) && op_in_opfamily(negator, partopfamily))
        {
            <snip>
        }
+       else
+           /* otherwise, unsupported! */
+           return PARTCLAUSE_UNSUPPORTED;

Let me propose that we also have this along with the rest of the changes
you made in that part of the function.  So, attached is an updated patch.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Michael Paquier
Date:
On Tue, May 08, 2018 at 07:05:46PM -0300, Alvaro Herrera wrote:
> The reason for this crash is that gen_partprune_steps_internal() is
> unable to generate any steps for the clause -- which is natural, since
> the operator is not in a btree opclass.  There are various callers
> of gen_partprune_steps_internal that are aware that it could return an
> empty set of steps, but this one in match_clause_to_partition_key for
> the ScalarArrayOpExpr case was being a bit too optimistic.

Indeed.

While looking at this code, is there any reason to not make
gen_partprune_steps static?  This is only used in partprune.c for now,
so the intention is to make it available for future patches?
--
Michael

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/05/09 11:20, Michael Paquier wrote:
> While looking at this code, is there any reason to not make
> gen_partprune_steps static?  This is only used in partprune.c for now,
> so the intention is to make it available for future patches?

Yeah, making it static might be a good idea.  I had made it externally
visible, because I was under the impression that the runtime pruning
related code would want to call it from elsewhere within the planner.
But, instead it introduced a make_partition_pruneinfo() which in turn
calls get_partprune_steps.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

From
David Rowley
Date:
On 9 May 2018 at 14:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/05/09 11:20, Michael Paquier wrote:
>> While looking at this code, is there any reason to not make
>> gen_partprune_steps static?  This is only used in partprune.c for now,
>> so the intention is to make it available for future patches?
>
> Yeah, making it static might be a good idea.  I had made it externally
> visible, because I was under the impression that the runtime pruning
> related code would want to call it from elsewhere within the planner.
> But, instead it introduced a make_partition_pruneinfo() which in turn
> calls get_partprune_steps.

Yeah. Likely left over from when run-time pruning was generating the
steps during execution rather than during planning.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/05/09 11:31, David Rowley wrote:
> On 9 May 2018 at 14:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/05/09 11:20, Michael Paquier wrote:
>>> While looking at this code, is there any reason to not make
>>> gen_partprune_steps static?  This is only used in partprune.c for now,
>>> so the intention is to make it available for future patches?
>>
>> Yeah, making it static might be a good idea.  I had made it externally
>> visible, because I was under the impression that the runtime pruning
>> related code would want to call it from elsewhere within the planner.
>> But, instead it introduced a make_partition_pruneinfo() which in turn
>> calls get_partprune_steps.
> 
> Yeah. Likely left over from when run-time pruning was generating the
> steps during execution rather than during planning.

Here is a patch that does that.

Thanks,
Amit

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Michael Paquier
Date:
On Wed, May 09, 2018 at 02:01:26PM +0900, Amit Langote wrote:
> On 2018/05/09 11:31, David Rowley wrote:
>> On 9 May 2018 at 14:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> On 2018/05/09 11:20, Michael Paquier wrote:
>>>> While looking at this code, is there any reason to not make
>>>> gen_partprune_steps static?  This is only used in partprune.c for now,
>>>> so the intention is to make it available for future patches?
>>>
>>> Yeah, making it static might be a good idea.  I had made it externally
>>> visible, because I was under the impression that the runtime pruning
>>> related code would want to call it from elsewhere within the planner.
>>> But, instead it introduced a make_partition_pruneinfo() which in turn
>>> calls get_partprune_steps.
>>
>> Yeah. Likely left over from when run-time pruning was generating the
>> steps during execution rather than during planning.
>
> Here is a patch that does that.

Thanks, Amit.

Alvaro, could it be possible to consider as well the patch I posted
here?
https://www.postgresql.org/message-id/20180424012042.GD1570@paquier.xyz

This removes a useless default clause in partprune.c and it got
forgotten in the crowd.  Just attaching it again here, and it can just
be applied on top of the rest.
--
Michael

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Michael Paquier wrote:

> Alvaro, could it be possible to consider as well the patch I posted
> here?
> https://www.postgresql.org/message-id/20180424012042.GD1570@paquier.xyz
> 
> This removes a useless default clause in partprune.c and it got
> forgotten in the crowd.  Just attaching it again here, and it can just
> be applied on top of the rest.

Done, thanks for insisting.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Amit Langote wrote:
> On 2018/05/09 11:31, David Rowley wrote:
> > On 9 May 2018 at 14:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> On 2018/05/09 11:20, Michael Paquier wrote:
> >>> While looking at this code, is there any reason to not make
> >>> gen_partprune_steps static?  This is only used in partprune.c for now,
> >>> so the intention is to make it available for future patches?

> Here is a patch that does that.

Pushed, thanks.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Alvaro Herrera
Date:
Marina Polyakova wrote:
> Hello everyone in this thread!

> I got a similar server crash as in [1] on the master branch since the commit
> 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails because
> the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, but is
> an ArrayCoerceExpr (see [2]):

Hello Marina, thanks for reporting this.  I have pushed all fixes
derived from this report -- thanks to Amit and Michaël for those.
I verified your test case no longer crashes.  If you have more elaborate
test cases, please do try these too.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] path toward faster partition pruning

From
Michael Paquier
Date:
On Wed, May 09, 2018 at 10:39:07AM -0300, Alvaro Herrera wrote:
> Michael Paquier wrote:
>> This removes a useless default clause in partprune.c and it got
>> forgotten in the crowd.  Just attaching it again here, and it can just
>> be applied on top of the rest.
>
> Done, thanks for insisting.

Thanks!
--
Michael

Attachment

Re: [HACKERS] path toward faster partition pruning

From
Amit Langote
Date:
On 2018/05/09 22:43, Alvaro Herrera wrote:
> Amit Langote wrote:
>> On 2018/05/09 11:31, David Rowley wrote:
>>> On 9 May 2018 at 14:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>>> On 2018/05/09 11:20, Michael Paquier wrote:
>>>>> While looking at this code, is there any reason to not make
>>>>> gen_partprune_steps static?  This is only used in partprune.c for now,
>>>>> so the intention is to make it available for future patches?
> 
>> Here is a patch that does that.
> 
> Pushed, thanks.

Thank you.

Regards,
Amit



Re: [HACKERS] path toward faster partition pruning

From
Marina Polyakova
Date:
On 09-05-2018 17:30, Alvaro Herrera wrote:
> Marina Polyakova wrote:
>> Hello everyone in this thread!
> 
>> I got a similar server crash as in [1] on the master branch since the 
>> commit
>> 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails 
>> because
>> the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, 
>> but is
>> an ArrayCoerceExpr (see [2]):
> 
> Hello Marina, thanks for reporting this.  I have pushed all fixes
> derived from this report -- thanks to Amit and Michaël for those.
> I verified your test case no longer crashes.  If you have more 
> elaborate
> test cases, please do try these too.

Hello, thank you all very much! :)

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company