Thread: Partition-wise join for join between (declaratively) partitioned tables

Partition-wise join for join between (declaratively) partitioned tables

From
Ashutosh Bapat
Date:
Amit Langote is working on supporting declarative partitioning in PostgreSQL [1]. I have started working on supporting partition-wise join. This mail describes very high level design and some insight into the performance improvements.

An equi-join between two partitioned tables can be broken down into pair-wise join between their partitions. This technique is called partition-wise join. Partition-wise joins between similarly partitioned tables with equi-join condition can be efficient because [2]
1. Each provably non-empty partition-wise join smaller. All such joins collectively might be more efficient than the join between their parent.
2. Such joins are able to exploit properties of partitions like indexes, their storage etc.
3. An N-way partition-wise join may have different efficient join orders compared to the efficient join order between the parent tables.

A partition-wise join is processed in following stages [2], [3].
1. Applicability testing: This phase checks if the join conditions match the partitioning scheme. A partition-wise join is efficient if there is an equi-join on the partition keys. E.g. join between tables R and S partitioned by columns a and b resp. can be broken down into partition-wise joins if there exists a join condition is R.a = S.b. Or in other words the number of provably non-empty partition-wise joins is O(N) where N is the number of partitions.

2. Matching: This phase determines which joins between the partitions of R and S can potentially produce tuples in the join and prunes empty joins between partitions.

3. Clustering: This phase aims at reducing the number of partition-wise joins by clubbing together partitions from joining relations. E.g. clubbing multiple partitions from either of the partitioned relations which can join to a single partition from the other partitioned relation.

4. Path/plan creation: This phase creates multiple paths for each partition-wise join. It also creates Append path/s representing the union of partition-wise joins.

The work here focuses on a subset of use-cases discussed in [2]. It only considers partition-wise join for join between similarly partitioned tables with same number of partitions with same properties, thus producing at most as many partition-wise joins as there are partitions. It should be possible to apply partition-wise join technique (with some special handling for OUTER joins) if both relations have some extra partitions with non-overlapping partition conditions, apart from the matching partitions. But I am not planning to implement this optimization in the first cut.

The attached patch is a POC implementation of partition-wise join. It is is based on the set of patches posted on 23rd May 2016 by Amit Langote for declarative partitioning. The patch gives an idea about the approach used. It has several TODOs, which I am working on.

Attached is a script with output which measures potential performance improvement because of partition-wise join. The script uses a GUC enable_partition_wise_join to disable/enable this feature for performance measurement. The scripts measures performance improvement of a join between two tables partitioned by range on integer column. Each table contains 50K rows. Each table has an integer and a varchar column. It shows around 10-15% reduction in execution time when partition-wise join is used. Accompanied with parallel query and FDWs, it opens up avenues for further improvements for joins between partitioned tables.

[1]. https://www.postgresql.org/message-id/55D3093C.5010800@lab.ntt.co.jp
[2]. https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf
[3]. https://users.cs.duke.edu/~shivnath/tmp/paqo_draft.pdf

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachment
On Wed, Jun 15, 2016 at 3:25 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Amit Langote is working on supporting declarative partitioning in PostgreSQL
> [1]. I have started working on supporting partition-wise join. This mail
> describes very high level design and some insight into the performance
> improvements.
>
> An equi-join between two partitioned tables can be broken down into
> pair-wise join between their partitions. This technique is called
> partition-wise join. Partition-wise joins between similarly partitioned
> tables with equi-join condition can be efficient because [2]
> 1. Each provably non-empty partition-wise join smaller. All such joins
> collectively might be more efficient than the join between their parent.
> 2. Such joins are able to exploit properties of partitions like indexes,
> their storage etc.
> 3. An N-way partition-wise join may have different efficient join orders
> compared to the efficient join order between the parent tables.
>
> A partition-wise join is processed in following stages [2], [3].
> 1. Applicability testing: This phase checks if the join conditions match the
> partitioning scheme. A partition-wise join is efficient if there is an
> equi-join on the partition keys. E.g. join between tables R and S
> partitioned by columns a and b resp. can be broken down into partition-wise
> joins if there exists a join condition is R.a = S.b. Or in other words the
> number of provably non-empty partition-wise joins is O(N) where N is the
> number of partitions.
>
> 2. Matching: This phase determines which joins between the partitions of R
> and S can potentially produce tuples in the join and prunes empty joins
> between partitions.
>
> 3. Clustering: This phase aims at reducing the number of partition-wise
> joins by clubbing together partitions from joining relations. E.g. clubbing
> multiple partitions from either of the partitioned relations which can join
> to a single partition from the other partitioned relation.
>
> 4. Path/plan creation: This phase creates multiple paths for each
> partition-wise join. It also creates Append path/s representing the union of
> partition-wise joins.
>
> The work here focuses on a subset of use-cases discussed in [2]. It only
> considers partition-wise join for join between similarly partitioned tables
> with same number of partitions with same properties, thus producing at most
> as many partition-wise joins as there are partitions. It should be possible
> to apply partition-wise join technique (with some special handling for OUTER
> joins) if both relations have some extra partitions with non-overlapping
> partition conditions, apart from the matching partitions. But I am not
> planning to implement this optimization in the first cut.

I haven't reviewed this code yet due to being busy with 9.6, but I
think this is a very important query planner improvement with the
potential for big wins on queries involving large amounts of data.

Suppose we have a pair of equi-partitioned tables.  Right now, if we
choose to perform a hash join, we'll have to build a giant hash table
with all of the rows from every inner partition and then probe it for
every row in every outer partition.  If there are few enough inner
rows that the resultant hash table still fits in work_mem, this is
somewhat inefficient but not terrible - but if it causes us to have to
batch the hash join where we otherwise would not need to do so, then
it really sucks.  Similarly, if we decide to merge-join each pair of
partitions, a partitionwise join may be able to use an internal sort
on some or all partitions whereas if we had to deal with all of the
data at the same time we'd need an external sort, possibly multi-pass. And if we choose a nested loop, say over an
innerindex-scan, we do
 
O(outer rows) index probes with this optimization but O(outer rows *
inner partitions) index probes without it.

In addition, parallel query can benefit significantly from this kind
of optimization.  Tom recently raised the case of an appendrel where
every child has a parallel-safe path but not every child has a partial
path; currently, we can't go parallel in that case, but it's easy to
see that we could handle it by scheduling the appendrel's children
across a pool of workers.  If we had this optimization, that sort of
thing would be much more likely to be useful, because it could create
appendrels where each member is an N-way join between equipartitioned
tables.  That's particularly important right now because of the
restriction that a partial path must be driven by a Parallel SeqScan,
but even after that restriction is lifted it's easy to imagine that
the effective degree of parallelism for a single index scan may be
limited - so this kind of thing may significantly increase the number
of workers that a given query can use productively.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partition-wise join for join between (declaratively) partitioned tables

From
Ashutosh Bapat
Date:


On Fri, Jul 8, 2016 at 12:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I haven't reviewed this code yet due to being busy with 9.6, but I
think this is a very important query planner improvement with the
potential for big wins on queries involving large amounts of data.

Suppose we have a pair of equi-partitioned tables.  Right now, if we
choose to perform a hash join, we'll have to build a giant hash table
with all of the rows from every inner partition and then probe it for
every row in every outer partition.  If there are few enough inner
rows that the resultant hash table still fits in work_mem, this is
somewhat inefficient but not terrible - but if it causes us to have to
batch the hash join where we otherwise would not need to do so, then
it really sucks.  Similarly, if we decide to merge-join each pair of
partitions, a partitionwise join may be able to use an internal sort
on some or all partitions whereas if we had to deal with all of the
data at the same time we'd need an external sort, possibly multi-pass.

Or we might be able to use indexes directly without need of a MergeAppend.
 
  And if we choose a nested loop, say over an inner index-scan, we do
O(outer rows) index probes with this optimization but O(outer rows *
inner partitions) index probes without it.

In addition, parallel query can benefit significantly from this kind
of optimization.  Tom recently raised the case of an appendrel where
every child has a parallel-safe path but not every child has a partial
path; currently, we can't go parallel in that case, but it's easy to
see that we could handle it by scheduling the appendrel's children
across a pool of workers.  If we had this optimization, that sort of
thing would be much more likely to be useful, because it could create
appendrels where each member is an N-way join between equipartitioned
tables.  That's particularly important right now because of the
restriction that a partial path must be driven by a Parallel SeqScan,
but even after that restriction is lifted it's easy to imagine that
the effective degree of parallelism for a single index scan may be
limited - so this kind of thing may significantly increase the number
of workers that a given query can use productively.

+1.

The attached patch implements the logic to assess whether two partitioned
tables can be joined using partition-wise join technique described in my last
mail on this thread.

Two partitioned relations are considered for partition-wise join if following
conditions are met (See build_joinrel_part_info() for details):
1. Both the partitions have same number of partitions, with same number of
partition keys and partitioned by same strategy - range or list.
2. They have matching datatypes for partition keys (partkey_types_match())
3. For list partitioned relations, they have same lists for each pair of
partitions, paired by position in which they appear.
4. For range partitioned relations, they have same bounds for each pair of
partitions, paired by their position when ordered in ascending fashion on the
upper bounds.
5. There exists an equi-join condition for each pair of partition keys, paired
by the position in which they appear.

Partition-wise join technique can be applied under more lenient constraints [1]
e.g. joins between tables with different number of partitions but having same
bounds/lists for the common partitions. I am planning to defer that to a later
version of this feature.

A join executed using partition-wise join technique is itself a relation
partitioned by the similar partitioning scheme as the joining relations with
the partition keys combined from the joining relations.

A PartitionOptInfo (uses name similar to RelOptInfo or IndexOptInfo) structure
is used to store the partitioning information for a given base or relation.
In build_simple_rel(), we construct PartitionOptInfo structure for the given
base relation by copying the relation's PartitionDesc and PartitionKey
(structures from Amit Langote's patch). While doing so, all the partition keys
are stored as expressions. The structure also holds the RelOptInfos of the
partition relations. For a join relation, most of the PartitionOptInfo is
copied from either of the joining relations, except the partition keys and
RelOptInfo of partition relations. Partition keys of the join relations are
created by combing partition keys from both the joining relations. The logic to
cosnstruct RelOptInfo for the partition-wise join relations is yet to be
implemented.

Since the logic to create the paths and RelOptInfos for partition-wise join
relations is not implemented yet, a query which can use partition-wise join
fails with error
"ERROR: the relation was considered for partition-wise join, which is not
supported right now.". It will also print messages to show which of the joins
can and can not use partition-wise join technique e.g.
"NOTICE:  join between relations (b 1) and (b 2) is considered for
partition-wise join." The relations are indicated by their relid in the query.
OR
"NOTICE:  join between relations (b 1) and (b 2) is NOT considered for
partition-wise join.".
These messages are for debugging only, and will be removed once path creation
logic is implemented.

The patch adds a test partition_join.sql, which has a number of positive and
negative testcases for joins between partitioned tables.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachment

Re: Partition-wise join for join between (declaratively) partitioned tables

From
Ashutosh Bapat
Date:
Sorry forgot to mention: this patch applies on top of the v7 patches posted by Amit Langote on 27th June (https://www.postgresql.org/message-id/81371428-bb4b-1e33-5ad6-8c5c51b52cb7%40lab.ntt.co.jp).

On Tue, Jul 19, 2016 at 7:41 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:


On Fri, Jul 8, 2016 at 12:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I haven't reviewed this code yet due to being busy with 9.6, but I
think this is a very important query planner improvement with the
potential for big wins on queries involving large amounts of data.

Suppose we have a pair of equi-partitioned tables.  Right now, if we
choose to perform a hash join, we'll have to build a giant hash table
with all of the rows from every inner partition and then probe it for
every row in every outer partition.  If there are few enough inner
rows that the resultant hash table still fits in work_mem, this is
somewhat inefficient but not terrible - but if it causes us to have to
batch the hash join where we otherwise would not need to do so, then
it really sucks.  Similarly, if we decide to merge-join each pair of
partitions, a partitionwise join may be able to use an internal sort
on some or all partitions whereas if we had to deal with all of the
data at the same time we'd need an external sort, possibly multi-pass.

Or we might be able to use indexes directly without need of a MergeAppend.
 
  And if we choose a nested loop, say over an inner index-scan, we do
O(outer rows) index probes with this optimization but O(outer rows *
inner partitions) index probes without it.

In addition, parallel query can benefit significantly from this kind
of optimization.  Tom recently raised the case of an appendrel where
every child has a parallel-safe path but not every child has a partial
path; currently, we can't go parallel in that case, but it's easy to
see that we could handle it by scheduling the appendrel's children
across a pool of workers.  If we had this optimization, that sort of
thing would be much more likely to be useful, because it could create
appendrels where each member is an N-way join between equipartitioned
tables.  That's particularly important right now because of the
restriction that a partial path must be driven by a Parallel SeqScan,
but even after that restriction is lifted it's easy to imagine that
the effective degree of parallelism for a single index scan may be
limited - so this kind of thing may significantly increase the number
of workers that a given query can use productively.

+1.

The attached patch implements the logic to assess whether two partitioned
tables can be joined using partition-wise join technique described in my last
mail on this thread.

Two partitioned relations are considered for partition-wise join if following
conditions are met (See build_joinrel_part_info() for details):
1. Both the partitions have same number of partitions, with same number of
partition keys and partitioned by same strategy - range or list.
2. They have matching datatypes for partition keys (partkey_types_match())
3. For list partitioned relations, they have same lists for each pair of
partitions, paired by position in which they appear.
4. For range partitioned relations, they have same bounds for each pair of
partitions, paired by their position when ordered in ascending fashion on the
upper bounds.
5. There exists an equi-join condition for each pair of partition keys, paired
by the position in which they appear.

Partition-wise join technique can be applied under more lenient constraints [1]
e.g. joins between tables with different number of partitions but having same
bounds/lists for the common partitions. I am planning to defer that to a later
version of this feature.

A join executed using partition-wise join technique is itself a relation
partitioned by the similar partitioning scheme as the joining relations with
the partition keys combined from the joining relations.

A PartitionOptInfo (uses name similar to RelOptInfo or IndexOptInfo) structure
is used to store the partitioning information for a given base or relation.
In build_simple_rel(), we construct PartitionOptInfo structure for the given
base relation by copying the relation's PartitionDesc and PartitionKey
(structures from Amit Langote's patch). While doing so, all the partition keys
are stored as expressions. The structure also holds the RelOptInfos of the
partition relations. For a join relation, most of the PartitionOptInfo is
copied from either of the joining relations, except the partition keys and
RelOptInfo of partition relations. Partition keys of the join relations are
created by combing partition keys from both the joining relations. The logic to
cosnstruct RelOptInfo for the partition-wise join relations is yet to be
implemented.

Since the logic to create the paths and RelOptInfos for partition-wise join
relations is not implemented yet, a query which can use partition-wise join
fails with error
"ERROR: the relation was considered for partition-wise join, which is not
supported right now.". It will also print messages to show which of the joins
can and can not use partition-wise join technique e.g.
"NOTICE:  join between relations (b 1) and (b 2) is considered for
partition-wise join." The relations are indicated by their relid in the query.
OR
"NOTICE:  join between relations (b 1) and (b 2) is NOT considered for
partition-wise join.".
These messages are for debugging only, and will be removed once path creation
logic is implemented.

The patch adds a test partition_join.sql, which has a number of positive and
negative testcases for joins between partitioned tables.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Partition-wise join for join between (declaratively) partitioned tables

From
Ashutosh Bapat
Date:
Hi All,

PFA the patch to support partition-wise joins for partitioned tables. The patch
is based on the declarative parition support patches provided by Amit Langote
on 26th August 2016. The previous patch added support to assess whether two
tables can be joined using partition-wise join technique, but did not have
complete support to create plans which used partition-wise technique. This
patch implements three important pieces for supporting partition-wise join

1. Logic to assess whether join between two partitioned tables can be executed
using partition-wise join technique.
2. Construct RelOptInfo's representating join between matching partitions of
the joining relations and add join paths to those RelOptInfo's
3. Add append paths to the RelOptInfo representing the join between partitioned
tables. Rest of the planner code chooses the optimal path for join.

make_join_rel() now calls try_partition_wise_join(), which executes all of the
steps listed above. If the joining partitioned relations are deemed fit for
partition-wise join, we create one RelOptInfo (if not already present)
representing a join between every pair of partitions to be joined. Since the
join between parents is deemed legal, the join between the partitions is also
legal, hence legality of the join is not checked again. RelOptInfo representing
the join between partitions is constructed by translating the relevant members
of RelOptInfo of the parent join relation. Similarly SpecialJoinInfo,
restrictlist (for given join order) are constructed by translating those for
the parent join.

make_join_rel() is split into two portions, a. that deals with constructing
restrictlist and RelOptInfo for join relation b. that creates paths for the
join. The second portion is separated into a function
populate_joinrel_with_paths(), which is reused in try_partition_wise_join() to
create paths for join between matching partitions.

set_append_rel_pathlist() generates paths for child relations, marks the empty
children as dummy relations and creates append paths by collecting paths with
similar properties (parameterization and pathkeys) from non-empty children. It
then adds append paths to the parent relation. This patch divides
set_append_rel_pathlist() into two parts a. marking empty child relations as
dummy and generating paths for non-empty children. b. collecting children paths
into append paths for parent. Part b is separate into a function
add_paths_to_append_rel() which is reused for collecting paths from
partition-wise join child relations to construct append paths for join between
partitioned tables.

For an N-way join between partitioned tables, make_join_rel() is called as many
times as the number of valid join orders exist. For each such call, we will add
paths to join between partitions for corresponding join order between those
partitions. We can generate the append paths for parent joinrel only after all
such join orders have been considered. Hence before setting cheapest path forx
parent join relation, we set the cheapest path for each join relation between
partitions, followed by creating append paths for the parent joinrel. This
method needs some readjustment for multi-level partitions (TODO item 2 below).

A GUC enable_partition_wise_join is added to enable or disable partition-wise
join technique. I think the GUC is useful similar to other join related GUCs
like enable_hashjoin.

parameterized paths: While creating parameterized paths for child relations of
a partitioned tables, we do not have an idea as to whether we will be able to
use partition-wise join technique or not. Also we do not know the child
partition of the other partitioned table, to which a given partition would
join. Hence we do not create paths parameterized by child partitions of other
partitioned relations. But path for child of a partitioned relation
parameterized by other parent relation, can be considered to be parameterised
by any child relation of the other partitioned relation by replacing the parent
parameters by corresponding child parameters. This condition is used to
eliminate parameterized paths while creating merge and hash joins, to decide
the resultant parameterization of a join between child partitions and to create
nested loop paths with inner path parameterized by outer relation where inner
and outer relations are child partitions. While creating such nest loop join
paths we translate the path parameterized by other parent partitioned relation,
to that parameterized by the required child.

Functions like select_outer_pathkeys_for_merge(), make_sort_from_pathkeys(),
find_ec_member_for_tle() which did not expect to be called for a child
relation, are now used for child partition relations for joins. These functions
are adjusted for that usage.

Testing:
I have added partition_join.sql testcase to test partition-wise join feature.
That file has extensive tests for list, range, multi-level partitioning schemes
and various kinds of joins including nested loop join with inner relation
parameterized by outer relationThat file has extensive tests for list, range,
multi-level partitioning schemes and various kinds of joins including nested
loop join with inner relation parameterized by outer relation.

make check passes clean.

TODOs:

1. Instead of storing partitioning information in RelOptInfo of each of the
partitioned relations (base and join relations), we can keep a list of
canonical partition schemes in PlannerInfo. Every RelOptInfo gets a pointer to
the member of list representing the partitioning scheme of corresponding
relation. RelOptInfo's of all similarly partitioned relations get the same
pointer thus making it easy to match the partitioning schemes by comparing the
pointers. While we are supporting only exact partition matching scheme now,
it's possible to extend this method to match compatible partitioning schemes by
maintaining a list of compatible partitioning schemes.

Right now, I have moved some partition related structures from partition.c to
partition.h. These structures are still being reviewed and might change when
Amit Langote improves his patches. Having canonical partitioning scheme in
PlannerInfo may not require moving those structures out. So, that code is still
under development. A related change is renaming RangeBound structure in Amit
Langote's patches to PartitionRangeBound to avoid name conflict with
rangetypes.h. That change too should vanish once we decide where to keep that
structure and its final name.

2. Multi-level partitioned tables: For some reason path created for joining
partitions are not being picked up as the cheapest paths. I think, we need to
finalize the lower level paths before moving upwards in the partition
hierarchy. But I am yet to investigate the issue here. RelOptInfo::parent_relid
should point to top parents rather than immediate parents.

3. Testing: need more tests for testing partition-wise join with foreign tables
as partitions. More tests for parameterized joins for multi-level partitioned
joins.

4. Remove bms_to_char(): I have added this function to print Relids in the
debugger. I have found it very useful to quickly examine Relids in debugger,
which otherwise wasn't so easy. If others find it useful too, I can create a
separate patch to be considered for a separate commit.

5. In add_paths_to_append_rel() to find the possible set of outer relations for
generating parameterized paths for a given join. This code needs to be adjusted
to eliminate the parent relations possible set of outer relations for a join
between child partitions.

6. Add support to reparameterize more types of paths for child relations. I
will add this once we finalize the method to reparameterize a parent path for
child partition.

7. The patch adds make_joinrel() (name needs to be changed because of its
similariy with make_join_rel()) to construct an empty RelOptInfo for a join
between partitions. The function copies code doing the same from
build_join_rel(). build_join_rel() too can use this function, if we decide to
retain it.

8. Few small TODOs related to code reorganization, proper function,
variable naming etc. are in the patch. pg_indent run.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachment

Re: Partition-wise join for join between (declaratively) partitioned tables

From
Amit Kapila
Date:
On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>
> 4. Remove bms_to_char(): I have added this function to print Relids in the
> debugger. I have found it very useful to quickly examine Relids in debugger,
> which otherwise wasn't so easy. If others find it useful too, I can create a
> separate patch to be considered for a separate commit.
>

+1 to have such a function.  I often need something like that whenever
I debug the optimizer code.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Partition-wise join for join between (declaratively) partitioned tables

From
Amit Langote
Date:
On 2016/09/09 18:47, Ashutosh Bapat wrote:
> A related change is renaming RangeBound structure in Amit
> Langote's patches to PartitionRangeBound to avoid name conflict with
> rangetypes.h. That change too should vanish once we decide where to keep
> that structure and its final name.

This change has been incorporated into the latest patch I posted on Sep 9 [1].

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/28ee345c-1278-700e-39a7-36a71f9a3b43@lab.ntt.co.jp





Re: Partition-wise join for join between (declaratively) partitioned tables

From
Rajkumar Raghuwanshi
Date:

On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
Hi All,

PFA the patch to support partition-wise joins for partitioned tables. The patch
is based on the declarative parition support patches provided by Amit Langote
on 26th August 2016.

I have applied declarative partitioning patches posted by Amit Langote on 26 Aug 2016 and then partition-wise-join patch,  getting below error while make install.

../../../../src/include/nodes/relation.h:706: error: redefinition of typedef ‘PartitionOptInfo’
../../../../src/include/nodes/relation.h:490: note: previous declaration of ‘PartitionOptInfo’ was here
make[4]: *** [gistbuild.o] Error 1
make[4]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access/gist'
make[3]: *** [gist-recursive] Error 2
make[3]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src'
make: *** [all-src-recurse] Error 2

PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

Attached the patch for the fix of above error.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
Attachment

Re: Partition-wise join for join between (declaratively) partitioned tables

From
Ashutosh Bapat
Date:
On Fri, Sep 16, 2016 at 6:00 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
>
> On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>> Hi All,
>>
>> PFA the patch to support partition-wise joins for partitioned tables. The
>> patch
>> is based on the declarative parition support patches provided by Amit
>> Langote
>> on 26th August 2016.
>
>
> I have applied declarative partitioning patches posted by Amit Langote on 26
> Aug 2016 and then partition-wise-join patch,  getting below error while make
> install.
>
> ../../../../src/include/nodes/relation.h:706: error: redefinition of typedef
> ‘PartitionOptInfo’
> ../../../../src/include/nodes/relation.h:490: note: previous declaration of
> ‘PartitionOptInfo’ was here
> make[4]: *** [gistbuild.o] Error 1
> make[4]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access/gist'
> make[3]: *** [gist-recursive] Error 2
> make[3]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access'
> make[2]: *** [access-recursive] Error 2
> make[2]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend'
> make[1]: *** [all-backend-recurse] Error 2
> make[1]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src'
> make: *** [all-src-recurse] Error 2
>
> PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)
>
> Attached the patch for the fix of above error.

Thanks for the report. I will fix this in the next patch.



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Partition-wise join for join between (declaratively) partitioned tables

From
Ashutosh Bapat
Date:
PFA patch which takes care of some of the TODOs mentioned in my
previous mail. The patch is based on the set of patches supporting
declarative partitioning by Amit Langoted posted on 26th August.

>
> TODOs:
>
> 1. Instead of storing partitioning information in RelOptInfo of each of the
> partitioned relations (base and join relations), we can keep a list of
> canonical partition schemes in PlannerInfo. Every RelOptInfo gets a pointer
> to
> the member of list representing the partitioning scheme of corresponding
> relation. RelOptInfo's of all similarly partitioned relations get the same
> pointer thus making it easy to match the partitioning schemes by comparing
> the
> pointers. While we are supporting only exact partition matching scheme now,
> it's possible to extend this method to match compatible partitioning schemes
> by
> maintaining a list of compatible partitioning schemes.
>
> Right now, I have moved some partition related structures from partition.c
> to
> partition.h. These structures are still being reviewed and might change when
> Amit Langote improves his patches. Having canonical partitioning scheme in
> PlannerInfo may not require moving those structures out. So, that code is
> still
> under development. A related change is renaming RangeBound structure in Amit
> Langote's patches to PartitionRangeBound to avoid name conflict with
> rangetypes.h. That change too should vanish once we decide where to keep
> that
> structure and its final name.

Done.

>
> 2. Multi-level partitioned tables: For some reason path created for joining
> partitions are not being picked up as the cheapest paths. I think, we need
> to
> finalize the lower level paths before moving upwards in the partition
> hierarchy. But I am yet to investigate the issue here.
> RelOptInfo::parent_relid
> should point to top parents rather than immediate parents.

Done

>
> 3. Testing: need more tests for testing partition-wise join with foreign
> tables
> as partitions. More tests for parameterized joins for multi-level
> partitioned
> joins.

Needs to be done.

>
> 4. Remove bms_to_char(): I have added this function to print Relids in the
> debugger. I have found it very useful to quickly examine Relids in debugger,
> which otherwise wasn't so easy. If others find it useful too, I can create a
> separate patch to be considered for a separate commit.

I will take care of this after rebasing the patch on the latest
sources and latest set of patches by Amit Langote.

>
> 5. In add_paths_to_append_rel() to find the possible set of outer relations
> for
> generating parameterized paths for a given join. This code needs to be
> adjusted
> to eliminate the parent relations possible set of outer relations for a join
> between child partitions.

Done.

>
> 6. Add support to reparameterize more types of paths for child relations. I
> will add this once we finalize the method to reparameterize a parent path
> for
> child partition.

Will wait for reviewer's opinion.

>
> 7. The patch adds make_joinrel() (name needs to be changed because of its
> similariy with make_join_rel()) to construct an empty RelOptInfo for a join
> between partitions. The function copies code doing the same from
> build_join_rel(). build_join_rel() too can use this function, if we decide
> to
> retain it.

This will be done as a separate cleanup patch.

>
> 8. Few small TODOs related to code reorganization, proper function,
> variable naming etc. are in the patch. pg_indent run.

I have taken care of most of the TODOs. But there are still some TODOs
remaining. I will take care of those in the next version of patches.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Partition-wise join for join between (declaratively) partitioned tables

From
Ashutosh Bapat
Date:
Hi Rajkumar,


On Fri, Sep 16, 2016 at 6:00 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
>
> On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>> Hi All,
>>
>> PFA the patch to support partition-wise joins for partitioned tables. The
>> patch
>> is based on the declarative parition support patches provided by Amit
>> Langote
>> on 26th August 2016.
>
>
> I have applied declarative partitioning patches posted by Amit Langote on 26
> Aug 2016 and then partition-wise-join patch,  getting below error while make
> install.
>
> ../../../../src/include/nodes/relation.h:706: error: redefinition of typedef
> ‘PartitionOptInfo’
> ../../../../src/include/nodes/relation.h:490: note: previous declaration of
> ‘PartitionOptInfo’ was here
> make[4]: *** [gistbuild.o] Error 1
> make[4]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access/gist'
> make[3]: *** [gist-recursive] Error 2
> make[3]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access'
> make[2]: *** [access-recursive] Error 2
> make[2]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend'
> make[1]: *** [all-backend-recurse] Error 2
> make[1]: Leaving directory
> `/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src'
> make: *** [all-src-recurse] Error 2
>
> PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)
>

Thanks for the report and the patch.

This is fixed by the patch posted with
https://www.postgresql.org/message-id/CAFjFpRdRFWMc4zNjeJB6p1Ncpznc9DMdXfZJmVK5X_us5zeD9Q%40mail.gmail.com.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Partition-wise join for join between (declaratively) partitioned tables

From
Rajkumar Raghuwanshi
Date:
On Tue, Sep 20, 2016 at 4:26 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
PFA patch which takes care of some of the TODOs mentioned in my
previous mail. The patch is based on the set of patches supporting
declarative partitioning by Amit Langoted posted on 26th August.

I have applied declarative partitioning patches posted by Amit Langote on 26 Aug 2016 and then latest partition-wise-join patch,  getting below error while make install.

../../../../src/include/catalog/partition.h:37: error: redefinition of typedef ‘PartitionScheme’
../../../../src/include/nodes/relation.h:492: note: previous declaration of ‘PartitionScheme’ was here
make[4]: *** [commit_ts.o] Error 1
make[4]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend/access/transam'
make[3]: *** [transam-recursive] Error 2
make[3]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory `/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src'
make: *** [all-src-recurse] Error 2

PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

I have commented below statement in src/include/catalog/partition.h file and then tried to install, it worked fine.

/* typedef struct PartitionSchemeData    *PartitionScheme; */

Thanks & Regards,
Rajkumar Raghuwanshi

Re: Partition-wise join for join between (declaratively) partitioned tables

From
Ashutosh Bapat
Date:
> ../../../../src/include/catalog/partition.h:37: error: redefinition of
> typedef ‘PartitionScheme’
> ../../../../src/include/nodes/relation.h:492: note: previous declaration of
> ‘PartitionScheme’ was here
[...]
>
> PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

Thanks for the report. For some reason, I am not getting these errors
with my compiler

[ashutosh@ubuntu regress]gcc --version
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

Anyway, I have fixed it in the attached patch.

The patch is based on sources upto commit

commit 2a7f4f76434d82eb0d1b5f4f7051043e1dd3ee1a
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   Wed Sep 21 13:24:13 2016 +0300

and Amit Langote's set of patches posted on 15th Sept. 2016 [1]

There are few implementation details that need to be worked out like
1. adjust_partitionrel_attrs() calls adjust_appendrel_attrs() as many
times as the number of base relations in the join, possibly producing
a new expression tree in every call. It can be optimized to call
adjust_appendrel_attrs() only once. I will work on that if reviewers
agree that adjust_partitionrel_attrs() is needed and should be
optimized.

2. As mentioned in earlier mails, the paths parameterized by parent
partitioned table are translated to be parameterized by child
partitions. That code needs to support more kinds of paths. I will
work on that, if reviewers agree that the approach of translating
paths is acceptable.

3. Because of an issue with declarative partitioning patch [2]
multi-level partition table tests are failing in partition_join.sql.
Those were not failing with an earlier set of patches supporting
declarative partitions. Those will be fixed based on the discussion in
that thread.

4. More tests for foreign tables as partitions and for multi-level
partitioned tables.

5. The tests use unpartitioned tables for verifying results. Those
tables and corresponding SQL statements will be removed once the tests
are finalised.

[1]. https://www.postgresql.org/message-id/e5c1c9cf-3f5a-c4d7-6047-7351147aaef9%40lab.ntt.co.jp
[2]. https://www.postgresql.org/message-id/CAFjFpRc%3DT%2BCjpGNkNSdOkHza8VAPb35bngaCdAzPgBkhijmJhg%40mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Partition-wise join for join between (declaratively) partitioned tables

From
Rajkumar Raghuwanshi
Date:

On Thu, Sep 22, 2016 at 4:11 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
The patch is based on sources upto commit

commit 2a7f4f76434d82eb0d1b5f4f7051043e1dd3ee1a
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   Wed Sep 21 13:24:13 2016 +0300

and Amit Langote's set of patches posted on 15th Sept. 2016 [1]

I have applied your patch on top of Amit patches posted on 15th Sept. 2016, and tried to create some test cases on list and multi-level partition based on test cases written for range partition.

I got some server crash and errors which I have mentioned as comment in expected output file, which need to be updated once these issues will get fix. also for these issue expected output is generated by creating same query for non-partition table with same data.

Attached patch created on top to Ashutosh's patch posted on 22 Sept 2016.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation



 
Attachment

Re: Partition-wise join for join between (declaratively) partitioned tables

From
Robert Haas
Date:
On Thu, Sep 22, 2016 at 6:41 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> [ new patch ]

This should probably get updated since Rajkumar reported a crash.
Meanwhile, here are some comments from an initial read-through:

+ * Multiple relations may be partitioned in the same way. The relations
+ * resulting from joining such relations may be partitioned in the same way as
+ * the joining relations.  Similarly, relations derived from such relations by
+ * grouping, sorting be partitioned in the same as the underlying relations.

I think you should change "may be partitioned in the same way" to "are
partitioned in the same way" or "can be regarded as partitioned in the
same way". The sentence that begins with "Similarly," is not
grammatical; it should say something like: ...by grouping or sorting
are partitioned in the same way as the underlying relations.

@@ -870,20 +902,21 @@ RelationBuildPartitionDesc(Relation rel)                result->bounds->rangeinfo = rangeinfo;
           break;            }        }    }
 
    MemoryContextSwitchTo(oldcxt);    rel->rd_partdesc = result;}

+/* * Are two partition bound collections logically equal? * * Used in the keep logic of relcache.c (ie, in
RelationClearRelation()).* This is also useful when b1 and b2 are bound collections of two separate * relations,
respectively,because BoundCollection is a canonical * representation of a set partition bounds (for given partitioning
strategy).*/boolpartition_bounds_equal(PartitionKey key,
 

Spurious hunk.

+ *     For an umpartitioned table, it returns NULL.

Spelling.

+             * two arguemnts and returns boolean. For types, it
suffices to match

Spelling.

+ * partition key expression is stored as a single member list to accomodate

Spelling.

+ * For a base relation, construct an array of partition key expressions. Each
+ * partition key expression is stored as a single member list to accomodate
+ * more partition keys when relations are joined.

How would joining relations result in more partitioning keys getting
added?  Especially given the comment for the preceding function, which
says that a new PartitionScheme gets created unless an exact match is
found.

+            if (!lc)

Test lc == NIL instead of !lc.

+extern int
+PartitionSchemeGetNumParts(PartitionScheme part_scheme)
+{
+    return part_scheme ? part_scheme->nparts : 0;
+}

I'm not convinced it's a very good idea for this function to have
special handling for when part_scheme is NULL.  In
try_partition_wise_join() that checks is not needed because it's
already been done, and in generate_partition_wise_join_paths it is
needed but only because you are initializing nparts too early.  If you
move this initialization down below the IS_DUMMY_REL() check you won't
need the NULL guard.  I would ditch this function and let the callers
access the structure member directly.

+extern int
+PartitionSchemeGetNumKeys(PartitionScheme part_scheme)
+{
+    return part_scheme ? part_scheme->partnatts : 0;
+}

Similarly here.  have_partkey_equi_join should probably have a
quick-exit path when part_scheme is NULL, and then num_pks can be set
afterwards unconditionally.  Same for match_expr_to_partition_keys.
build_joinrel_partition_info already has it and doesn't need this
double-check.

+extern Oid *
+PartitionDescGetPartOids(PartitionDesc part_desc)
+{
+    Oid       *part_oids;
+    int        cnt_parts;
+
+    if (!part_desc || part_desc->nparts <= 0)
+        return NULL;
+
+    part_oids = (Oid *) palloc(sizeof(Oid) * part_desc->nparts);
+    for (cnt_parts = 0; cnt_parts < part_desc->nparts; cnt_parts++)
+        part_oids[cnt_parts] = part_desc->oids[cnt_parts];
+
+    return part_oids;
+}

I may be missing something, but this looks like a bad idea in multiple
ways.  First, you've got checks for part_desc's validity here that
should be in the caller, as noted above.  Second, you're copying an
array by looping instead of using memcpy().  Third, the one and only
caller is set_append_rel_size, which doesn't seem to have any need to
copy this data in the first place.  If there is any possibility that
the PartitionDesc is going to change under us while that function is
running, something is deeply broken.  Nothing in the planner is going
to cope with the table structure changing under us, so it had better
not.

+    /*
+     * For a partitioned relation, we will save the child RelOptInfos in parent
+     * RelOptInfo in the same the order as corresponding bounds/lists are
+     * stored in the partition scheme.
+     */

This comment seems misplaced; shouldn't it be next to the code that is
actually doing this, rather than the code that is merely setting up
for it?  And, also, the comment implies that we're doing this instead
of what we'd normally do, whereas I think we are actually doing
something additional.

+        /*
+         * Save topmost parent's relid. If the parent itself is a child of some
+         * other relation, use parent's topmost parent relids.
+         */
+        if (rel->top_parent_relids)
+            childrel->top_parent_relids = rel->top_parent_relids;
+        else
+            childrel->top_parent_relids = bms_copy(rel->relids);

Comment should explain why we're doing it, not what we're doing.  The
comment as written just restates what anybody who's likely to be
looking at this can already see to be true from looking at the code
that follows.  The question is why do it.

+    /* Set only for "other" base or join relations. */
+    Relids        top_parent_relids;

Comment should say what it is, not just when it's set.

+    /* Should have found all the childrels of a partitioned relation. */
+    if (rel->part_scheme)
+    {
+        int        cnt_parts;
+        for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
+            Assert(rel->part_rels[cnt_parts]);
+    }

A block that does nothing but Assert() should be guarded by #ifdef
USE_ASSERT_CHECKING.  Although, actually, maybe this should be an
elog(), just in case?

+    }
+
+    add_paths_to_append_rel(root, rel, live_childrels);
+}
+
+static void
+add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
+                        List *live_childrels)

The new function should have a header comment, which should include an
explanation of why this is now separate from
set_append_rel_pathlist().

+    if (!live_childrels)

As before, I think live_childrels == NIL is better style.

+            generate_partition_wise_join_paths(root, rel);

Needs an update to the comment earlier in the hunk.  It's important to
explain why this has to be done here and not within
join_search_one_level.

+            /* Recursively collect the paths from child joinrel. */
+            generate_partition_wise_join_paths(root, child_rel);

Given the recursion, check_stack_depth() at top of function is
probably appropriate.  Same for try_partition_wise_join().

+    if (live_children)
+        pfree(live_children);

Given that none of the substructure, including ListCells, will be
freed, this seems utterly pointless.  If it's necessary to recover
memory here at all, we probably need to be more aggressive about it.
Have you tested the effect of this patch on planner memory consumption
with multi-way joins between tables with many partitions?  If you
haven't, you probably should. (Testing runtime would be good, too.)
Does it grow linearly?  Quadratically?  Exponentially?  Minor leaks
don't matter, but if we're generating too much garbage we'll have to
make sure it gets cleaned up soon enough to prevent runaway memory
usage.
    /*
+     * An inner path parameterized by the parent relation of outer
+     * relation needs to be reparameterized by the outer relation to be used
+     * for parameterized nested loop join.
+     */

No doubt, but I think the comment is missing the bigger picture -- it
doesn't say anything about this being here to support partition-wise
joins, which seems like a key point.

+        /* If we could not translate the path, don't produce nest loop path. */
+        if (!inner_path)
+            return;

Why would that ever happen?

+/*
+ * If the join between the given two relations can be executed as
+ * partition-wise join create the join relations for partition-wise join,
+ * create paths for those and then create append paths to combine
+ * partition-wise join results.
+ */
+static void
+try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+                        RelOptInfo *joinrel, SpecialJoinInfo *parent_sjinfo,
+                        List *parent_restrictlist)

This comment doesn't accurately describe what the function does.  No
append paths are created here; that happens at a much later stage.  I
think this comment needs quite a bit more work, and maybe the function
should be renamed, too.  There are really two steps involved here:
first, we create paths for each child, attached to a new RelOptInfo
flagged as RELOPT_OTHER_JOINREL paths; later, we create additional
paths for the parent RelOptInfo by appending a path for each child.

Broadly, I think there's a lack of adequate documentation of the
overall theory of operation of this patch.  I believe that an update
to the optimizer README would be appropriate, probably with a new
section but maybe incorporating the new material into an existing
section.  In addition, the comments for individual comments and chunks
of code need to do a better job explaining how each part of the patch
contributes to the overall picture.  I also think we need to do a
better join hammering out the terminology.  I don't particularly like
the term "partition-wise join" in the first place, although I don't
know what would be better, but we certainly need to avoid confusing a
partition-wise join -- which is a join performed by joining each
partition of one partitioned rel to the corresponding partition of a
similarly partitioned rel rather than by the usual execution strategy
of joining the parent rels -- with the concept of an other-join-rel,
which an other-member-rel analogue for joins.  I don't think the patch
is currently very clear about this right now, either in the code or in
the comments.  Maybe this function ought to be named something like
make_child_joins() or make_child_join_paths(), and we could use "child
joins" and/or "child join paths" as standard terminology throughout
the patch.

+    rel1_desc = makeStringInfo();
+    rel2_desc = makeStringInfo();
+
+    /* TODO: remove this notice when finalising the patch. */
+    outBitmapset(rel1_desc, rel1->relids);
+    outBitmapset(rel2_desc, rel2->relids);
+    elog(NOTICE, "join between relations %s and %s is considered for
partition-wise join.",
+         rel1_desc->data, rel2_desc->data);

Please remove your debugging cruft before submitting patches to
pgsql-hackers, or at least put #ifdef NOT_USED or something around it.

+     * We allocate the array for child RelOptInfos till we find at least one
+     * join order which can use partition-wise join technique. If no join order
+     * can use partition-wise join technique, there are no child relations.

This comment has problems.  I think "till" is supposed to be "until",
and there's supposed to be a "don't" in there somewhere.  But really,
I think what you're going for is just /* Allocate when first needed */
which would be a lot shorter and also more clear.

+     * Create join relations for the partition relations, if they do not exist
+     * already. Add paths to those for the given pair of joining relations.

I think the comment could be a bit more explanatory here.  Something
like: "This joinrel is partitioned, so iterate over the partitions and
create paths for each one, allowing us to eventually build an
append-of-joins path for the parent.  Since this routine may be called
multiple times for various join orders, the RelOptInfo needed for each
child join may or may not already exist, but the paths for this join
order definitely do not.  Note that we don't create any actual
AppendPath at this stage; it only makes sense to do that at the end,
after each possible join order has been considered for each child
join.  The best join order may differ from child to child."

+         * partiticipating in the given partition relations. We need them

Spelling.

+/*
+ * Construct the SpecialJoinInfo for the partition-wise join using parents'
+ * special join info. Also, instead of
+ * constructing an sjinfo everytime, we should probably save it in
+ * root->join_info_list and search within it like join_is_legal?
+ */

The lines here are of very different lengths for no particularly good
reason, and it should end with a period, not a question mark.

On the substance of the issue, it seems like the way you're doing this
right now could allocate a very large number of SpecialJoinInfo
structures.  For every join relation, you'll create one
SpecialJoinInfo per legal join order per partition.  That seems like
it could get to be a big number.  I don't know if that's going to be a
problem from a memory-usage standpoint, but it seems like it might.
It's not just the SpecialJoinInfo itself; all of the substructure gets
duplicated, too.

+    SpecialJoinInfo *sjinfo = copyObject(parent_sjinfo);
+    sjinfo->min_lefthand = adjust_partition_relids(sjinfo->min_lefthand,
+                                                   append_rel_infos1);

Missing a blank line here.

+        AppendRelInfo    *ari = lfirst(lc);

Standard naming convention for an AppendRelInfo variable seems to be
appinfo, not ari.  (I just did "git grep AppendRelInfo".)

+        /* Skip non-equi-join clauses. */
+        if (!rinfo->can_join ||
+            rinfo->hashjoinoperator == InvalidOid ||
+            !rinfo->mergeopfamilies)
+            continue;

There's definitely something ugly about this.  If rinfo->can_join is
false, then we're done.  But suppose one of mergeopfamilies == NIL and
rinfo->hashoperator == InvalidOid is true and the other is false.  Are
we really precluded from doing a partiion-wise join in that case, or
are we just prohibited from using certain join strategies?  In most
places where we make similar tests, we're careful not to require more
than we need.

I also think that these tests need to consider the partitioning
operator in use.  Suppose that the partition key is of a type T which
has two operator classes X and Y.  Both relations are partitioned
using an operator from opfamily X, but the join condition mentions
opfamily Y.  I'm pretty sure this precludes a partitionwise join.  If
the join condition used opfamily X, then we would have a guarantee
that two rows which compared as equal would be in the same partition,
but because it uses opfamily Y, that's not guaranteed.  For example,
if T is a text type, X might test for exact equality using "C"
collation rules, while Y might test for equality using some
case-insensitive set of rules.  If the partition boundaries are such
that "foo" and "FOO" are in different partitions, a partitionwise join
using the case-insensitive operator will produce wrong results.  You
can also imagine this happening with numeric, if you have one opclass
(like the default one) that considers 5.0 and 5.00 to be equal, but
another opclass that thinks they are different; if the latter is used
to set the partition bounds, 5.0 and 5.00 could end up in different
partitions - which will be fine if an operator from that opclass is
used for the join, but not if an operator from the regular opclass is
used.

After thinking this over a bit, I think the right way to think about this is:

1. Amit's patch currently only ever uses btree opfamilies for
partitioning.  It uses those for both range partitioning and list
partitioning.  If we ever support hash partitioning, we would
presumably use hash opfamilies for that purpose, but right now it's
all about btree opfamilies.

2. Therefore, if A and B are partitioned but the btree opfamilies
don't match, they don't have the same partitioning scheme and this
code should never be reached.  Similarly, if they use the same
opfamily but different collations, the partitioning schemes shouldn't
match and therefore this code should not be reached.

3. If A and B are partitioned and the partitioning opfamilies - which
are necessarily btree opfamilies - do match, then the operator which
appears in the query needs to be from the same opfamily and have
amopstrategy of BTEqualStrategyNumber within that opfamily.  If not,
then a partition-wise join is not possible.

4. Assuming the above conditions are met, have_partkey_equi_join
doesn't need to care whether the operator chosen has mergeopfamilies
or a valid hashjoinoperator.  Those factors will control which join
methods are legal, but not whether a partitionwise join is possible in
principle.

Let me know whether that seems right.

+     * RelabelType node; eval_const_expressions() will have simplied if more

Spelling.

    /*
+     * Code below scores equivalence classes by how many equivalence members
+     * can produce join clauses for this join relation. Equivalence members
+     * which do not cover the parents of a partition-wise join relation, can
+     * produce join clauses for partition-wise join relation.
+     */

I don't know what that means.  The comma in the second sentence
doesn't belong there.

+    /*
+     * TODO: Instead of copying and mutating the trees one child relation at a
+     * time, we should be able to do this en-masse for all the partitions
+     * involved.
+     */

I don't see how that would be possible, but if it's a TODO, you'd
better do it (or decide not to do it and remove or change the
comment).
    /*     * Create explicit sort nodes for the outer and inner paths if necessary.     */    if
(best_path->outersortkeys)   {
 
+        Relids        outer_relids = outer_path->parent->relids;        Sort       *sort =
make_sort_from_pathkeys(outer_plan,
-                                                   best_path->outersortkeys);
+                                                   best_path->outersortkeys,
+                                                   outer_relids);

The changes related to make_sort_from_pathkeys() are pretty opaque to
me.  Can you explain?

+     * Change parameterization of sub paths recursively. Also carry out any

"sub paths" should not be two words, here or anywhere.

+reparameterize_path_for_child(PlannerInfo *root, Path *path,
+                              RelOptInfo *child_rel)

This is suspiciously unlike reparameterize_path.  Why?

+    /* Computer information relevant to the foreign relations. */
+    set_foreign_rel_properties(joinrel, outer_rel, inner_rel);

Perhaps this refactoring could be split out into a preliminary patch,
which would then simplify this patch.  And same for add_join_rel().

+     * Produce partition-wise joinrel's targetlist by translating the parent
+     * joinrel's targetlist. This will also include the required placeholder

Again the confusion between a "child" join and a partition-wise join...

+    /*
+     * Nothing to do if
+     * a. partition-wise join is disabled.
+     * b. joining relations are not partitioned.
+     * c. partitioning schemes do not match.
+     */
+

I don't think that's going to survive pgindent.

+     * are not considered equal, an equi-join involing inner partition keys

Spelling.

+     * Collect the partition key expressions. An OUTER join will produce rows
+     * where the partition key columns of inner side are NULL and may not fit
+     * the partitioning scheme with inner partition keys. Since two NULL values
+     * are not considered equal, an equi-join involing inner partition keys
+     * still prohibits cross-partition joins while joining with another
+     * similarly partitioned relation.

I can't figure out what this comment is trying to tell me.  Possibly I
just need more caffeine.

+ * Adding these two join_rel_level list also means that top level list has more
+ * than one join relation, which is symantically incorrect.

I don't understand this, either; also, spelling.

As a general comment, the ratio of tests-to-code in this patch is way
out of line with PostgreSQL's normal practices.  The total patch file
is 10965 lines. The test cases begin at line 3047, meaning that in
round figures you've got about one-quarter code and about
three-quarters test cases.  I suspect that a large fraction of those
test cases aren't adding any meaningful code coverage and will just
take work to maintain.  That needs to be slimmed down substantially in
any version of this considered for commit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partition-wise join for join between (declaratively) partitioned tables

From
Ashutosh Bapat
Date:
On Wed, Sep 28, 2016 at 2:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 22, 2016 at 6:41 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> [ new patch ]
>
> This should probably get updated since Rajkumar reported a crash.
> Meanwhile, here are some comments from an initial read-through:

Done. Fixed those crashes. Also fixed some crashes in foreign table
code and postgres_fdw. The tests were provided by Rajkumar. I am
working on including those in my patch. The attached patch is still
based on Amit's patches set of patches posted on 15th Sept. 2016. He
is addressing your comments on his patches. So, I am expecting a more
stable version arrive soon. I will rebase my patches then. Because of
a bug in those patches related to multi-level partitioned tables and
lateral joins and also a restriction on sharing partition keys across
levels of partitions, the testcase is still failing. I will work on
that while rebasing the patch.

>
> + * Multiple relations may be partitioned in the same way. The relations
> + * resulting from joining such relations may be partitioned in the same way as
> + * the joining relations.  Similarly, relations derived from such relations by
> + * grouping, sorting be partitioned in the same as the underlying relations.
>
> I think you should change "may be partitioned in the same way" to "are
> partitioned in the same way" or "can be regarded as partitioned in the
> same way".

The relations resulting from joining partitioned relations are
partitioned in the same way, if there exist equi-join condition/s
between their partition keys. If such equi-joins do not exist, the
join is *not* partitioned. Hence I did not use "are" or "can be" which
indicate a certainty. Instead I used "may" which indicates
"uncertainty". I am not sure whether that's a good place to explain
the conditions under which such relations are partitioned. Those
conditions will change as we implement more and more partition-wise
join strategies. But that comment conveys two things 1. partition
scheme makes sense for all kinds of relations 2. multiple relations
(of any kind) may share partition scheme. I have slightly changed the
wording to make this point clear. Please let me know if it looks
better.

> The sentence that begins with "Similarly," is not
> grammatical; it should say something like: ...by grouping or sorting
> are partitioned in the same way as the underlying relations.

Done. Instead of "are" I have used "may" for the same reason as above.

>
> @@ -870,20 +902,21 @@ RelationBuildPartitionDesc(Relation rel)
>                  result->bounds->rangeinfo = rangeinfo;
>                  break;
>              }
>          }
>      }
>
>      MemoryContextSwitchTo(oldcxt);
>      rel->rd_partdesc = result;
>  }
>
> +
>  /*
>   * Are two partition bound collections logically equal?
>   *
>   * Used in the keep logic of relcache.c (ie, in RelationClearRelation()).
>   * This is also useful when b1 and b2 are bound collections of two separate
>   * relations, respectively, because BoundCollection is a canonical
>   * representation of a set partition bounds (for given partitioning strategy).
>   */
>  bool
>  partition_bounds_equal(PartitionKey key,
>
> Spurious hunk.
>

Thanks. Done.

> + *     For an umpartitioned table, it returns NULL.
>
> Spelling.

Done. Thanks.

>
> +             * two arguemnts and returns boolean. For types, it
> suffices to match
>
> Spelling.

Thanks. Done.

>
> + * partition key expression is stored as a single member list to accomodate
>
> Spelling.

Thanks. Done.

>
> + * For a base relation, construct an array of partition key expressions. Each
> + * partition key expression is stored as a single member list to accomodate
> + * more partition keys when relations are joined.
>
> How would joining relations result in more partitioning keys getting
> added?  Especially given the comment for the preceding function, which
> says that a new PartitionScheme gets created unless an exact match is
> found.

Let's assume that relation A and B are partitioned by columns a and b
resp. and have same partitioning scheme. This means that the datatypes
of a and b as well as the opclass used for comparing partition key
values of A and B are same. A join between A and B with condition A.a
= B.b is partitioned by both A.a and B.b. We need to keep track of
both the keys in case AB joins with C which is partitioned in the same
manner. I guess, the confusion is with the term "partition keys" -
which is being used to indicate the class of partition key as well as
instance of partition key. In the above example, the datatype of
partition key and the opclass together indicate partition key class
whereas A.a and B.b are instances of that class. Increase in partition
keys may mean both increase in the number of classes or increase in
the number of instances. In the above comment I used to mean number of
instances. May be we should use "partition key expressions" to
indicate the partition key instances and "partition key" to indicate
partition key class. I have changed the comments to use partition keys
and partition key expressions appropriately. Please let me know if the
comments are worded correctly.

PartitionScheme does not hold the actual partition key expressions. It
holds the partition key type and opclass used for comparison, which
should be same for all the relations sharing the partition scheme.

>
> +            if (!lc)
>
> Test lc == NIL instead of !lc.

NIL is defined as (List *) NULL and lc is ListCell *. So changed the
test to lc == NULL instead of !lc.

>
> +extern int
> +PartitionSchemeGetNumParts(PartitionScheme part_scheme)
> +{
> +    return part_scheme ? part_scheme->nparts : 0;
> +}
>
> I'm not convinced it's a very good idea for this function to have
> special handling for when part_scheme is NULL.  In
> try_partition_wise_join() that checks is not needed because it's
> already been done, and in generate_partition_wise_join_paths it is
> needed but only because you are initializing nparts too early.  If you
> move this initialization down below the IS_DUMMY_REL() check you won't
> need the NULL guard.  I would ditch this function and let the callers
> access the structure member directly.
>
> +extern int
> +PartitionSchemeGetNumKeys(PartitionScheme part_scheme)
> +{
> +    return part_scheme ? part_scheme->partnatts : 0;
> +}
>
> Similarly here.  have_partkey_equi_join should probably have a
> quick-exit path when part_scheme is NULL, and then num_pks can be set
> afterwards unconditionally.  Same for match_expr_to_partition_keys.
> build_joinrel_partition_info already has it and doesn't need this
> double-check.
>
> +extern Oid *
> +PartitionDescGetPartOids(PartitionDesc part_desc)
> +{
> +    Oid       *part_oids;
> +    int        cnt_parts;
> +
> +    if (!part_desc || part_desc->nparts <= 0)
> +        return NULL;
> +
> +    part_oids = (Oid *) palloc(sizeof(Oid) * part_desc->nparts);
> +    for (cnt_parts = 0; cnt_parts < part_desc->nparts; cnt_parts++)
> +        part_oids[cnt_parts] = part_desc->oids[cnt_parts];
> +
> +    return part_oids;
> +}
>
> I may be missing something, but this looks like a bad idea in multiple
> ways.  First, you've got checks for part_desc's validity here that
> should be in the caller, as noted above.  Second, you're copying an
> array by looping instead of using memcpy().  Third, the one and only
> caller is set_append_rel_size, which doesn't seem to have any need to
> copy this data in the first place.  If there is any possibility that
> the PartitionDesc is going to change under us while that function is
> running, something is deeply broken.  Nothing in the planner is going
> to cope with the table structure changing under us, so it had better
> not.

These three functions were written based on Amit Langote's patches
which did not expose partition related structures outside partition.c.
Hence they required wrappers. I have moved PartitionSchemeData to
partition.h and removed these functions. Instead the members are
accessed directly.

>
> +    /*
> +     * For a partitioned relation, we will save the child RelOptInfos in parent
> +     * RelOptInfo in the same the order as corresponding bounds/lists are
> +     * stored in the partition scheme.
> +     */
>
> This comment seems misplaced; shouldn't it be next to the code that is
> actually doing this, rather than the code that is merely setting up
> for it?  And, also, the comment implies that we're doing this instead
> of what we'd normally do, whereas I think we are actually doing
> something additional.
>

Ok. I have moved the comment few line below, near the code which saves
the partition RelOptInfos.

> +        /*
> +         * Save topmost parent's relid. If the parent itself is a child of some
> +         * other relation, use parent's topmost parent relids.
> +         */
> +        if (rel->top_parent_relids)
> +            childrel->top_parent_relids = rel->top_parent_relids;
> +        else
> +            childrel->top_parent_relids = bms_copy(rel->relids);
>
> Comment should explain why we're doing it, not what we're doing.  The
> comment as written just restates what anybody who's likely to be
> looking at this can already see to be true from looking at the code
> that follows.  The question is why do it.
>

The point of that comment is to explain how it percolates down the
hierarchy, which is not so clear from the code. I have changed it to
read
/*
 * Recursively save topmost parent's relid in RelOptInfos of
 * partitions.
*/

Or you are expecting that the comment to explain the purpose of
top_parent_relids? I don't think that's a good idea, since the purpose
will change over the time and the comment will soon be out of sync
with the actual code, unless the developers expanding the usage
remember to update the comment. I have not seen the comments,
explaining purpose, next to the assignments. Take for example
RelOptInfo::relids.

> +    /* Set only for "other" base or join relations. */
> +    Relids        top_parent_relids;
>
> Comment should say what it is, not just when it's set.

Done. Check if it looks good.

>
> +    /* Should have found all the childrels of a partitioned relation. */
> +    if (rel->part_scheme)
> +    {
> +        int        cnt_parts;
> +        for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
> +            Assert(rel->part_rels[cnt_parts]);
> +    }
>
> A block that does nothing but Assert() should be guarded by #ifdef
> USE_ASSERT_CHECKING.  Although, actually, maybe this should be an
> elog(), just in case?

Changed it to elog().

>
> +    }
> +
> +    add_paths_to_append_rel(root, rel, live_childrels);
> +}
> +
> +static void
> +add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
> +                        List *live_childrels)
>
> The new function should have a header comment, which should include an
> explanation of why this is now separate from
> set_append_rel_pathlist().

Sorry for missing it. Added the prologue. Let me know, if it looks
good. I have made sure that all functions have a prologue and tried to
match the style with surrounding functions. Let me know if I have
still missed any or the styles do not match.

>
> +    if (!live_childrels)
>
> As before, I think live_childrels == NIL is better style.

Fixed.

>
> +            generate_partition_wise_join_paths(root, rel);
>
> Needs an update to the comment earlier in the hunk.  It's important to
> explain why this has to be done here and not within
> join_search_one_level.

Thanks for pointing that out. Similar to generate_gather_paths(), we
need to add explanation in standard_join_search() as well as in the
function prologue. Did that. Let me know if it looks good.

>
> +            /* Recursively collect the paths from child joinrel. */
> +            generate_partition_wise_join_paths(root, child_rel);
>
> Given the recursion, check_stack_depth() at top of function is
> probably appropriate.  Same for try_partition_wise_join().

Done. I wouldn't imagine a user creating that many levels of
partitions, but it's good to guard against some automated script that
has gone berserk.

>
> +    if (live_children)
> +        pfree(live_children);
>
> Given that none of the substructure, including ListCells, will be
> freed, this seems utterly pointless.  If it's necessary to recover
> memory here at all, we probably need to be more aggressive about it.

I intended to use list_free() instead of pfree(). Fixed that.

> Have you tested the effect of this patch on planner memory consumption
> with multi-way joins between tables with many partitions?  If you
> haven't, you probably should. (Testing runtime would be good, too.)
> Does it grow linearly?  Quadratically?  Exponentially?  Minor leaks
> don't matter, but if we're generating too much garbage we'll have to
> make sure it gets cleaned up soon enough to prevent runaway memory
> usage.

I tried to check memory usage with various combinations of number of
partitions and number of relations being joined. For higher number of
relations being joined like 10 with 100 partitions, OOM killer kicked
in during the planning phase. I am suspecting
adjust_partitionrel_attrs() (changed that name to
adjust_join_appendrel_attrs() to be in sync with
adjust_appendrel_attrs()) to be the culprit. It copies expression
trees every time for joining two children. That's an exponentially
increasing number as the number of legal joins increases
exponentially. I am still investigating this.

As a side question, do we have a function to free an expression tree?
I didn't find any.

>
>      /*
> +     * An inner path parameterized by the parent relation of outer
> +     * relation needs to be reparameterized by the outer relation to be used
> +     * for parameterized nested loop join.
> +     */
>
> No doubt, but I think the comment is missing the bigger picture -- it
> doesn't say anything about this being here to support partition-wise
> joins, which seems like a key point.

I have tried to explain the partition-wise join context. Let me know
if it looks good.

>
> +        /* If we could not translate the path, don't produce nest loop path. */
> +        if (!inner_path)
> +            return;
>
> Why would that ever happen?

Right now, reparameterize_path_for_child() does not support all kinds
of paths. So I have added that condition. I will add support for more
path types there once we agree that this is the right way to translate
the paths and that the path translation is required.
>
> +/*
> + * If the join between the given two relations can be executed as
> + * partition-wise join create the join relations for partition-wise join,
> + * create paths for those and then create append paths to combine
> + * partition-wise join results.
> + */
> +static void
> +try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
> +                        RelOptInfo *joinrel, SpecialJoinInfo *parent_sjinfo,
> +                        List *parent_restrictlist)
>
> This comment doesn't accurately describe what the function does.  No
> append paths are created here; that happens at a much later stage.

Removed reference to the append paths. Sorry for leaving it there,
when I moved the append path creation to a later stage.

> I
> think this comment needs quite a bit more work, and maybe the function
> should be renamed, too.

Improved the comments in the prologue and inside the function. Please
let me know, if they look good.

> There are really two steps involved here:
> first, we create paths for each child, attached to a new RelOptInfo
> flagged as RELOPT_OTHER_JOINREL paths; later, we create additional
> paths for the parent RelOptInfo by appending a path for each child.
>

Right, the first one is done in try_partition_wise_join() and the
later is done in generate_partition_wise_join_paths()

> Broadly, I think there's a lack of adequate documentation of the
> overall theory of operation of this patch.  I believe that an update
> to the optimizer README would be appropriate, probably with a new
> section but maybe incorporating the new material into an existing
> section.

Done. I have added a separate section to optimizer/README

> In addition, the comments for individual comments and chunks
> of code need to do a better job explaining how each part of the patch
> contributes to the overall picture.


> I also think we need to do a
> better join hammering out the terminology.  I don't particularly like
> the term "partition-wise join" in the first place, although I don't
> know what would be better, but we certainly need to avoid confusing a
> partition-wise join -- which is a join performed by joining each
> partition of one partitioned rel to the corresponding partition of a
> similarly partitioned rel rather than by the usual execution strategy
> of joining the parent rels -- with the concept of an other-join-rel,
> which an other-member-rel analogue for joins.  I don't think the patch
> is currently very clear about this right now, either in the code or in
> the comments.  Maybe this function ought to be named something like
> make_child_joins() or make_child_join_paths(), and we could use "child
> joins" and/or "child join paths" as standard terminology throughout
> the patch.

Partition-wise join is widely used term in the literature. Other
DBMSes use the same term as well. So, I think we should stick with
"partition-wise join". Partition-wise join as you have described is a
join performed by joining each partition of one partitioned rel to the
corresponding partition of a similarly partitioned rel rather than by
the usual execution strategy of joining the parent rels. I have
usually used the term "partition-wise join technique" to refer to this
method. I have changed the other usages of this term to use wording
like child joins or join between partiitions or join between child
relations as appropriate. Also, I have changed the names of functions
dealing with joins between partitions to use child_join instead of
partition_join or partition_wise_join.

Since partition-wise join is a method to join two relations just like
other methods, try_partition_wise_join() fits into the naming
convention try_<join technique> like try_nestloop_join.

>
> +    rel1_desc = makeStringInfo();
> +    rel2_desc = makeStringInfo();
> +
> +    /* TODO: remove this notice when finalising the patch. */
> +    outBitmapset(rel1_desc, rel1->relids);
> +    outBitmapset(rel2_desc, rel2->relids);
> +    elog(NOTICE, "join between relations %s and %s is considered for
> partition-wise join.",
> +         rel1_desc->data, rel2_desc->data);
>
> Please remove your debugging cruft before submitting patches to
> pgsql-hackers, or at least put #ifdef NOT_USED or something around it.

I kept this one intentionally. But as the TODO comment says, I do
intend to remove it once testing is over. Those messages make it very
easy to know whether partition-wise join was considered for a given
join or not. Without those messages, one has to break into
try_partition_wise_join() to figure out whether partition-wise join
was used or not. The final plan may not come out to be partition-wise
join plan even if partition-wise join was considered. Although, I have
now used DEBUG3 instead of NOTICE and removed those lines from the
expected output.

>
> +     * We allocate the array for child RelOptInfos till we find at least one
> +     * join order which can use partition-wise join technique. If no join order
> +     * can use partition-wise join technique, there are no child relations.
>
> This comment has problems.  I think "till" is supposed to be "until",
> and there's supposed to be a "don't" in there somewhere.  But really,
> I think what you're going for is just /* Allocate when first needed */
> which would be a lot shorter and also more clear.

Sorry for those mistakes. Yes, shorter version is better. Fixed the
comment as per your suggestion.

>
> +     * Create join relations for the partition relations, if they do not exist
> +     * already. Add paths to those for the given pair of joining relations.
>
> I think the comment could be a bit more explanatory here.  Something
> like: "This joinrel is partitioned, so iterate over the partitions and
> create paths for each one, allowing us to eventually build an
> append-of-joins path for the parent.  Since this routine may be called
> multiple times for various join orders, the RelOptInfo needed for each
> child join may or may not already exist, but the paths for this join
> order definitely do not.  Note that we don't create any actual
> AppendPath at this stage; it only makes sense to do that at the end,
> after each possible join order has been considered for each child
> join.  The best join order may differ from child to child."
>

Copied verbatim. Thanks for the detailed comment.


> +         * partiticipating in the given partition relations. We need them
>
> Spelling.
>

Done. Also fixed other grammatical mistakes and typos in that comment.

> +/*
> + * Construct the SpecialJoinInfo for the partition-wise join using parents'
> + * special join info. Also, instead of
> + * constructing an sjinfo everytime, we should probably save it in
> + * root->join_info_list and search within it like join_is_legal?
> + */
>
> The lines here are of very different lengths for no particularly good
> reason, and it should end with a period, not a question mark.

My bad. Sorry. Fixed.

>
> On the substance of the issue, it seems like the way you're doing this
> right now could allocate a very large number of SpecialJoinInfo
> structures.  For every join relation, you'll create one
> SpecialJoinInfo per legal join order per partition.  That seems like
> it could get to be a big number.  I don't know if that's going to be a
> problem from a memory-usage standpoint, but it seems like it might.
> It's not just the SpecialJoinInfo itself; all of the substructure gets
> duplicated, too.
>

Yes. We need the SpecialJoinInfo structures for the existing path
creation to work. The code will be complicated if we try to use parent
SpecialJoinInfo instead of creating those for children. We may free
memory allocated in SpecialJoinInfo to save some memory.
SpecialJoinInfos are not needed once the paths are created. Still we
will waste some memory for semi_rhs_exprs, which are reused for unique
paths. But otherwise we will reclaim the rest of the memory. Memory
wastage in adjust_partition_relids() may be minimized by modifying
adjust_appendrel_attrs() to accept list of AppendRelInfos and mutating
the tree only once rather than doing it N times for an N-way join.

> +    SpecialJoinInfo *sjinfo = copyObject(parent_sjinfo);
> +    sjinfo->min_lefthand = adjust_partition_relids(sjinfo->min_lefthand,
> +                                                   append_rel_infos1);
>
> Missing a blank line here.

Done.

>
> +        AppendRelInfo    *ari = lfirst(lc);
>
> Standard naming convention for an AppendRelInfo variable seems to be
> appinfo, not ari.  (I just did "git grep AppendRelInfo".)

Done.

>
> +        /* Skip non-equi-join clauses. */
> +        if (!rinfo->can_join ||
> +            rinfo->hashjoinoperator == InvalidOid ||
> +            !rinfo->mergeopfamilies)
> +            continue;
>
> There's definitely something ugly about this.  If rinfo->can_join is
> false, then we're done.  But suppose one of mergeopfamilies == NIL and
> rinfo->hashoperator == InvalidOid is true and the other is false.  Are
> we really precluded from doing a partiion-wise join in that case, or
> are we just prohibited from using certain join strategies?  In most
> places where we make similar tests, we're careful not to require more
> than we need.

Right. That condition is flawed. Corrected it.

>
> I also think that these tests need to consider the partitioning
> operator in use.  Suppose that the partition key is of a type T which
> has two operator classes X and Y.  Both relations are partitioned
> using an operator from opfamily X, but the join condition mentions
> opfamily Y.  I'm pretty sure this precludes a partitionwise join.  If
> the join condition used opfamily X, then we would have a guarantee
> that two rows which compared as equal would be in the same partition,
> but because it uses opfamily Y, that's not guaranteed.  For example,
> if T is a text type, X might test for exact equality using "C"
> collation rules, while Y might test for equality using some
> case-insensitive set of rules.  If the partition boundaries are such
> that "foo" and "FOO" are in different partitions, a partitionwise join
> using the case-insensitive operator will produce wrong results.  You
> can also imagine this happening with numeric, if you have one opclass
> (like the default one) that considers 5.0 and 5.00 to be equal, but
> another opclass that thinks they are different; if the latter is used
> to set the partition bounds, 5.0 and 5.00 could end up in different
> partitions - which will be fine if an operator from that opclass is
> used for the join, but not if an operator from the regular opclass is
> used.

Your description above uses opfamily and opclass interchangeably. It
starts saying X and Y are classed but then also refers to them as
families. But I got the point. I guess, similar to
relation_has_unique_index_for(), I have to check whether the operator
family specified in the partition scheme is present in the
mergeopfamilies in RestrictInfo for matching partition key. I have
added that check and restructured that portion of code to be readable.

>
> After thinking this over a bit, I think the right way to think about this is:
>
> 1. Amit's patch currently only ever uses btree opfamilies for
> partitioning.  It uses those for both range partitioning and list
> partitioning.  If we ever support hash partitioning, we would
> presumably use hash opfamilies for that purpose, but right now it's
> all about btree opfamilies.
>
> 2. Therefore, if A and B are partitioned but the btree opfamilies
> don't match, they don't have the same partitioning scheme and this
> code should never be reached.  Similarly, if they use the same
> opfamily but different collations, the partitioning schemes shouldn't
> match and therefore this code should not be reached.

That's right.

>
> 3. If A and B are partitioned and the partitioning opfamilies - which
> are necessarily btree opfamilies - do match, then the operator which
> appears in the query needs to be from the same opfamily and have
> amopstrategy of BTEqualStrategyNumber within that opfamily.  If not,
> then a partition-wise join is not possible.

I think this is achieved by checking whether the opfamily for given
partition key is present in the mergeopfamilies of corresponding
RestrictInfo, as stated above.

>
> 4. Assuming the above conditions are met, have_partkey_equi_join
> doesn't need to care whether the operator chosen has mergeopfamilies
> or a valid hashjoinoperator.  Those factors will control which join
> methods are legal, but not whether a partitionwise join is possible in
> principle.

If mergeopfamilies is NIL, above check will fail anyway. But skipping
a clause which has mergeopfamilies NIL will save some cycles in
matching expressions.

There is something strange happening with Amit's patch. When we create
a table partitioned by range on a column of type int2vector, it
somehow gets a btree operator family, but doesn't have mergeopfamilies
set in RestrictInfo of equality condition on that column. Instead the
RestrictInfo has hashjoinoperator. In this case if we ignore
hashjoinoperator, we won't be able to apply partition-wise join. I
guess, in such case we want to play safe and not apply partition-wise
join, even though applying it will give the correct result.

>
> +     * RelabelType node; eval_const_expressions() will have simplied if more
>
> Spelling.
>

Thanks. Done.

>
>      /*
> +     * Code below scores equivalence classes by how many equivalence members
> +     * can produce join clauses for this join relation. Equivalence members
> +     * which do not cover the parents of a partition-wise join relation, can
> +     * produce join clauses for partition-wise join relation.
> +     */
>
> I don't know what that means.  The comma in the second sentence
> doesn't belong there.

Sorry for that construction. I have changed the comment to be
something more meaningful.

>
> +    /*
> +     * TODO: Instead of copying and mutating the trees one child relation at a
> +     * time, we should be able to do this en-masse for all the partitions
> +     * involved.
> +     */
>
> I don't see how that would be possible, but if it's a TODO, you'd
> better do it (or decide not to do it and remove or change the
> comment).

That should be doable by passing a list of AppendRelInfo structures to
adjust_appendrel_attrs_mutator(). In the mutator, we have to check
each appinfo instead of just one. But that's a lot of refactoring. May
be done as a separate patch, if we are consuming too much memory. I
have removed TODO for now.

>
>      /*
>       * Create explicit sort nodes for the outer and inner paths if necessary.
>       */
>      if (best_path->outersortkeys)
>      {
> +        Relids        outer_relids = outer_path->parent->relids;
>          Sort       *sort = make_sort_from_pathkeys(outer_plan,
> -                                                   best_path->outersortkeys);
> +                                                   best_path->outersortkeys,
> +                                                   outer_relids);
>
> The changes related to make_sort_from_pathkeys() are pretty opaque to
> me.  Can you explain?

prepare_sort_from_pathkeys() accepts Relids as one of the argument to
find equivalence members belonging to child relations. The function
does not expect relids when searching equivalence members for parent
relations. Before this patch, make_sort_from_pathkeys() passed NULL to
this function, because it didn't expect child relations before.
Because of partition-wise joins, we need to sort child relations for
merge join or to create unique paths. So, make_sort_from_pathkeys() is
required to pass relids to prepare_sort_from_pathkeys() when
processing child relations, so that the later does not skip child
members.

>
> +     * Change parameterization of sub paths recursively. Also carry out any
>
> "sub paths" should not be two words, here or anywhere.

Fixed.

>
> +reparameterize_path_for_child(PlannerInfo *root, Path *path,
> +                              RelOptInfo *child_rel)
>
> This is suspiciously unlike reparameterize_path.  Why?

reparameterize_path() tries to create path with new parameterization
from an existing parameterized path. So, it looks for additional
conditions to expand the parameterization. But this functions
translates a path parameterized by parent to be parameterized by its
child. That does not involve looking for any extra conditions, but
involves translating the existing ones so that they can be used with a
child. A right name would be translate_parampath_to_child() or
something which uses word "translate" instead of "reparameterize". But
every name like that is getting too long. For now I have renamed it as
reparameterize_path_by_child(). Also added a comment in the function
prologue about cost, rows, width etc.

>
> +    /* Computer information relevant to the foreign relations. */
> +    set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
>
> Perhaps this refactoring could be split out into a preliminary patch,
> which would then simplify this patch.  And same for add_join_rel().
>

Yes, that's better. I will separate the code out in a separate patch.

There's code in build_join_rel() and build_partition_join_rel() (I
will change that name) which creates a joinrel RelOptInfo. Most of
that code simply sets NULL or 0 fields and is duplicated in both the
functions. Do you see any value in separating it out in its own
function?

Also, makeNode() uses palloc0(), thus makeNode(RelOptInfo) would set
most of the fields to 0 or NULL. Why do we then again set those fields
as NULL or 0? Should I try to remove unnecessary assignments?

> +     * Produce partition-wise joinrel's targetlist by translating the parent
> +     * joinrel's targetlist. This will also include the required placeholder
>
> Again the confusion between a "child" join and a partition-wise join...
>
> +    /*
> +     * Nothing to do if
> +     * a. partition-wise join is disabled.
> +     * b. joining relations are not partitioned.
> +     * c. partitioning schemes do not match.
> +     */
> +
>
> I don't think that's going to survive pgindent.

Changed this code a bit.

>
> +     * are not considered equal, an equi-join involing inner partition keys
>
> Spelling.
>
> +     * Collect the partition key expressions. An OUTER join will produce rows
> +     * where the partition key columns of inner side are NULL and may not fit
> +     * the partitioning scheme with inner partition keys. Since two NULL values
> +     * are not considered equal, an equi-join involing inner partition keys
> +     * still prohibits cross-partition joins while joining with another
> +     * similarly partitioned relation.
>
> I can't figure out what this comment is trying to tell me.  Possibly I
> just need more caffeine.

Re-wrote the comment with examples and detailed explanation. The
comment talks about whether inner partition key expressions should be
considered as the partition key expressions for the join, given that
for an OUTER join the inner partition key expressions may go NULL. The
comment explains why it's safe to do so. If we don't do that, any FULL
OUTER join will have no partition expressions and thus partition-wise
join technique will be useless for a N-way FULL OUTER join even if
it's safe to use it.

>
> + * Adding these two join_rel_level list also means that top level list has more
> + * than one join relation, which is symantically incorrect.
>
> I don't understand this, either; also, spelling.

I think, that sentence is not required. Removed it.

>
> As a general comment, the ratio of tests-to-code in this patch is way
> out of line with PostgreSQL's normal practices.  The total patch file
> is 10965 lines. The test cases begin at line 3047, meaning that in
> round figures you've got about one-quarter code and about
> three-quarters test cases.  I suspect that a large fraction of those
> test cases aren't adding any meaningful code coverage and will just
> take work to maintain.  That needs to be slimmed down substantially in
> any version of this considered for commit.

I agree. We require two kinds of tests 1. those which test partition
scheme matching 2. those test the planner code, which deals with path
creation. I have added both kinds of testcases for all kinds of
partitioning schemes (range, list, multi-level, partition key being
expressions, columns). That's not required. We need 1st kind of tests
for all partitioning schemes and 2nd kind of testcases only for one of
the partitioning schemes. So, definitely the number of tests will
reduce. A possible extreme would be to use a single multi-level
partitioned tests, which includes all kinds of partitioning schemes at
various partition levels. But that kind of testcase will be highly
unreadable and harder to maintain. Let me know what do you think. I
will work on that in the next version of patch. The test still fails
because of a bug in Amit's earlier set of patches

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Partition-wise join for join between (declaratively) partitioned tables

From
Robert Haas
Date:
On Fri, Oct 14, 2016 at 12:37 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>> Have you tested the effect of this patch on planner memory consumption
>> with multi-way joins between tables with many partitions?  If you
>> haven't, you probably should. (Testing runtime would be good, too.)
>> Does it grow linearly?  Quadratically?  Exponentially?  Minor leaks
>> don't matter, but if we're generating too much garbage we'll have to
>> make sure it gets cleaned up soon enough to prevent runaway memory
>> usage.
>
> I tried to check memory usage with various combinations of number of
> partitions and number of relations being joined. For higher number of
> relations being joined like 10 with 100 partitions, OOM killer kicked
> in during the planning phase. I am suspecting
> adjust_partitionrel_attrs() (changed that name to
> adjust_join_appendrel_attrs() to be in sync with
> adjust_appendrel_attrs()) to be the culprit. It copies expression
> trees every time for joining two children. That's an exponentially
> increasing number as the number of legal joins increases
> exponentially. I am still investigating this.

I think the root of this problem is that the existing paths shares a
lot more substructure than the ones created by the new code.  Without
a partition-wise join, the incremental memory usage for a joinrel
isn't any different whether the underlying rel is partitioned or not.
If it's partitioned, we'll be pointing to an AppendPath; if not, we'll
be pointing to some kind of Scan.  But the join itself creates exactly
the same amount of new stuff regardless of what's underneath it.  With
partitionwise join, that ceases to be true.  Every joinrel - and the
number of those grows exponentially in the number of baserels, IICU -
needs its own list of paths for every member rel.  So if a
non-partition-wise join created X paths, and there are K partitions, a
partition-wise join creates X * K paths.  That's a lot.

Although we might be able to save some memory by tightening things up
here and there - for example, right now the planner isn't real smart
about recycling paths that are evicted by add_path(), and there's
probably other wastage as well - I suspect that what this shows is
that the basic design of this patch is not going to be viable.
Intuitively, it's often going to be the case that we want the "same
plan" for every partition-set.  That is, if we have A JOIN B ON A.x =
B.x JOIN C ON A.y = B.y, and if A, B, and C are all compatibility
partitioned, then the result should be an Append plan with 100 join
plans under it, and all 100 of those plans should be basically mirror
images of each other.  Of course, that's not really right in general:
for example, it could be that A1 is big and A2 is small while B1 is
small and B2 is big, so that the right plan for (A1 JOIN B1) and for
(A2 JOIN B2) are totally different from each other.  But in many
practical cases we'll want to end up with a plan of precisely the same
shape for all children, and the current design ignores this, expending
both memory and CPU time to compute essentially-equivalent paths
across all children.

One way of attacking this problem is to gang together partitions which
are equivalent for planning purposes, as discussed in the paper "Join
Optimization Techniques for Partitioned Tables" by Herodotou, Borisov,
and Babu.  However, it's not exactly clear how to do this: we could
gang together partitions that have the same index definitions, but the
sizes of the heaps, the sizes of their indexes, and the row counts
will vary from one partition to the next, and any of those things
could cause the plan choice to be different for one partition vs. the
next.  We could try to come up with heuristics for when those things
are likely to be true.  For example, suppose we compute the set of
partitions such that all joined relations have matching index
definitions on all tables; then, we take the biggest table in the set
and consider all tables more than half that size as part of one gang.
The biggest table becomes the leader and we compute partition-wise
paths for just that partition; the other members of the gang will
eventually get a plan that is of the same shape, but we don't actually
create it that plan until after scan/join planning is concluded.

Another idea is to try to reduce peak memory usage by performing
planning separately for each partition-set.  For example, suppose we
decide to do a partition-wise join of A, B, and C.  Initially, this
gets represented as a PartitionJoinPath tree, like this:

PartitionJoinPath
-> AppendPath for A
-> PartitionJoinPath -> AppendPath for B -> AppendPath for C

Because we haven't created individual join paths for the members, this
doesn't use much memory.  Somehow, we come up with a cost for the
PartitionJoinPath; it probably won't be entirely accurate.  Once
scan/join planning is concluded, if our final path contains a
PartitionJoinPath, we go back and loop over the partitions.  For each
partition, we switch to a new memory context, perform planning, copy
the best path and its substructure back to the parent context, and
then reset the context.  In that way, peak memory usage only grows by
about a factor of 2 rather than a factor equal to the partition count,
because we don't need to keep every possibly-useful path for every
partition all at the same time, but rather every possibly-useful path
for a single partition.

Maybe there are other ideas but I have a feeling any way you slice it
this is going to be a lot of work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partition-wise join for join between (declaratively) partitioned tables

From
Ashutosh Bapat
Date:
On Tue, Oct 18, 2016 at 9:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 14, 2016 at 12:37 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>> Have you tested the effect of this patch on planner memory consumption
>>> with multi-way joins between tables with many partitions?  If you
>>> haven't, you probably should. (Testing runtime would be good, too.)
>>> Does it grow linearly?  Quadratically?  Exponentially?  Minor leaks
>>> don't matter, but if we're generating too much garbage we'll have to
>>> make sure it gets cleaned up soon enough to prevent runaway memory
>>> usage.
>>
>> I tried to check memory usage with various combinations of number of
>> partitions and number of relations being joined. For higher number of
>> relations being joined like 10 with 100 partitions, OOM killer kicked
>> in during the planning phase. I am suspecting
>> adjust_partitionrel_attrs() (changed that name to
>> adjust_join_appendrel_attrs() to be in sync with
>> adjust_appendrel_attrs()) to be the culprit. It copies expression
>> trees every time for joining two children. That's an exponentially
>> increasing number as the number of legal joins increases
>> exponentially. I am still investigating this.
>
> I think the root of this problem is that the existing paths shares a
> lot more substructure than the ones created by the new code.  Without
> a partition-wise join, the incremental memory usage for a joinrel
> isn't any different whether the underlying rel is partitioned or not.
> If it's partitioned, we'll be pointing to an AppendPath; if not, we'll
> be pointing to some kind of Scan.  But the join itself creates exactly
> the same amount of new stuff regardless of what's underneath it.  With
> partitionwise join, that ceases to be true.  Every joinrel - and the
> number of those grows exponentially in the number of baserels, IICU -
> needs its own list of paths for every member rel.  So if a
> non-partition-wise join created X paths, and there are K partitions, a
> partition-wise join creates X * K paths.  That's a lot.
>
> Although we might be able to save some memory by tightening things up
> here and there - for example, right now the planner isn't real smart
> about recycling paths that are evicted by add_path(), and there's
> probably other wastage as well - I suspect that what this shows is
> that the basic design of this patch is not going to be viable.
> Intuitively, it's often going to be the case that we want the "same
> plan" for every partition-set.  That is, if we have A JOIN B ON A.x =
> B.x JOIN C ON A.y = B.y, and if A, B, and C are all compatibility
> partitioned, then the result should be an Append plan with 100 join
> plans under it, and all 100 of those plans should be basically mirror
> images of each other.  Of course, that's not really right in general:
> for example, it could be that A1 is big and A2 is small while B1 is
> small and B2 is big, so that the right plan for (A1 JOIN B1) and for
> (A2 JOIN B2) are totally different from each other.  But in many
> practical cases we'll want to end up with a plan of precisely the same
> shape for all children, and the current design ignores this, expending
> both memory and CPU time to compute essentially-equivalent paths
> across all children.

I think there are going to be two kinds of partitioning use-cases.
First, carefully hand-crafted by DBAs so that every partition is
different from other and so is every join between two partitions.
There will be lesser number of partitions, but creating paths for each
join between partitions will be crucial from performance point of
view. Consider, for example, systems which use partitions to
consolidate results from different sources for analytical purposes or
sharding. If we consider various points you have listed in [1] as to
why a partition is equivalent to a table, each join between partitions
is going to have very different characteristics and thus deserves a
set of paths for its own. Add to that possibility of partition pruning
or certain conditions affecting particular partitions, the need for
detailed planning evident.

The other usage of partitioning is to distribute the data and/or
quickly eliminate the data by partition pruning. In such case, all
partitions of a given table will have very similar properties. There
is a large chance that we will end up having same plans for every
partition and for joins between partitions. In such cases, I think it
suffices to create paths for just one or may be a handful partitions
of join and repeat that plan for other partitions of join. But in such
cases it also makes sense to have a light-weight representation for
partitions as compared to partitions being a full-fledged tables. If
we have such a light-weight representation, we may not even create
RelOptInfos representing joins between partitions, and different paths
for each join between partitions.

>
> One way of attacking this problem is to gang together partitions which
> are equivalent for planning purposes, as discussed in the paper "Join
> Optimization Techniques for Partitioned Tables" by Herodotou, Borisov,
> and Babu.  However, it's not exactly clear how to do this: we could
> gang together partitions that have the same index definitions, but the
> sizes of the heaps, the sizes of their indexes, and the row counts
> will vary from one partition to the next, and any of those things
> could cause the plan choice to be different for one partition vs. the
> next.  We could try to come up with heuristics for when those things
> are likely to be true.  For example, suppose we compute the set of
> partitions such that all joined relations have matching index
> definitions on all tables; then, we take the biggest table in the set
> and consider all tables more than half that size as part of one gang.
> The biggest table becomes the leader and we compute partition-wise
> paths for just that partition; the other members of the gang will
> eventually get a plan that is of the same shape, but we don't actually
> create it that plan until after scan/join planning is concluded.

Section 5 of that paper talks about clustering partitions together for
joining, only when there is 1:m OR n:1 partition matching for join. In
such a case, it clusters all the partitions from one relation that are
all joined with a single partition of the other relation. But I think
your idea to gang up partitions with similar properties may reduce the
number of paths we create but as you have mentioned how to gang them
up is not very clear. There are just too many factors like
availability of the indexes, sizes of tables, size of intermediate
results etc. which make it difficult to identify the properties used
for ganging up. Even after we do that, in the worst case, we will
still end up creating paths for all partitions of all joins, thus
causing increase in paths proportionate to the number of partitions.

In the section 6.3, the paper mentions that the number of paths
retained are linear in the number of child joins per parent join. So,
it's clear that the paper never considered linear increase in the
paths to be a problem or at least a problem that that work had to
solve. Now, it's surprising that their memory usage increased by 7% to
10%. But 1. they might be measuring total memory and not the memory
used by the planner and they experimented with PostgreSQL 8.3.7, which
probably tried much less number of paths than the current optimizer.

>
> Another idea is to try to reduce peak memory usage by performing
> planning separately for each partition-set.  For example, suppose we
> decide to do a partition-wise join of A, B, and C.  Initially, this
> gets represented as a PartitionJoinPath tree, like this:
>
> PartitionJoinPath
> -> AppendPath for A
> -> PartitionJoinPath
>   -> AppendPath for B
>   -> AppendPath for C
>
> Because we haven't created individual join paths for the members, this
> doesn't use much memory.  Somehow, we come up with a cost for the
> PartitionJoinPath; it probably won't be entirely accurate.  Once
> scan/join planning is concluded, if our final path contains a
> PartitionJoinPath, we go back and loop over the partitions.

A typical join tree will be composite: some portion partitioned and
some portion unpartitioned or different portions partitioned by
different partition schemes. In such case, inaccurate costs for
PartitionJoinPath, can affect the plan heavily, causing a suboptimal
path to be picked. Assuming that partitioning will be useful for large
sets of data, choosing a suboptimal plan can be more dangerous than
consuming memory for creating paths.

If we could come up with costs for PartitionJoinPath using some
methods of interpolation, say by sampling few partitions and then
extrapolating their costs for entire PartitionJoinPath, we can use
this method. But unless the partitions have very similar
characteristics or have such characteristics that costs can be guessed
based on the differences between the characteristics, I do not see how
that can happen. For example, while costing a PartitionJoinPath with
pathkeys, the costs will change a lot based on whether underlying
relations have indexes, or which join methods are used, which in turn
is based on properties on the partitions. Same is the case for paths
with parameterization. All such paths are important when a partitioned
join relation joins with other unpartitioned relation or a partitioned
relation with different partitioning scheme.

When each partition of base relation being joined has different
properties, the cost for join between one set of partitions can differ
from join between other set of partitions. Not only that, the costs
for various properties of resultant paths like pathkeys,
parameterization can vary a lot, depending upon the available indexes
and estimates of rows for each join. So, we need to come up with these
cost estimates separately for each join between partitions to come up
with cost of each PartitionJoinPath. If we have to calculate those
costs to create PartitionJoinPath, we better save them in paths rather
than recalculating them in the second round of planning for joins
between partitions.

> For each
> partition, we switch to a new memory context, perform planning, copy
> the best path and its substructure back to the parent context, and
> then reset the context.

This could be rather tricky. It assumes that all the code that creates
paths for joins, should not allocate any memory which is linked to
some object in a context that lives longer than the path creation
context. There is some code like create_join_clause() or
make_canonical_pathkey(), which carefully chooses which memory context
to allocate memory in. But can we ensure it always? postgres_fdw for
example allocates memory for PgFdwRelationInfo in current memory
context and attaches it in RelOptInfo, which should be in the
planner's original context. So, if we create a new memory context for
each partition, fpinfos would be invalidated when those contexts are
released. Not that, we can not enforce some restriction on the memory
usage while planning, it's hard to enforce it and bugs arising from it
may go unnoticed. GEQO planner might have its own problems with this
approach. Third party FDWs will pose a problem.

A possible solution would be to keep the track of used paths using a
reference count. Once the paths for given join tree are created, free
up the unused paths by traversing pathlist in each of the RelOptInfos.
Attached patch has a prototype implementation for the same. There are
some paths which are not linked to RelOptInfos, which need a bit
different treatment, but they can be handled too.

> In that way, peak memory usage only grows by
> about a factor of 2 rather than a factor equal to the partition count,
> because we don't need to keep every possibly-useful path for every
> partition all at the same time, but rather every possibly-useful path
> for a single partition.
>
> Maybe there are other ideas but I have a feeling any way you slice it
> this is going to be a lot of work.

For the case of carefully hand-crafted partitions, I think, users
would expect the planner to use really the best plan and thus may be
willing to accommodate for increased memory usage. Going by any
approach that does not create the paths for joins between partitions
is not guaranteed to give the best plan. Users willing to provide
increased memory will be unhappy if we do not give them the best path.

The user who creates hundreds of partitions, will ideally be using
pretty powerful servers with a lot of memory. On such servers, the
linear increase in memory for paths may not be as bad as you are
portraying above, as long as its producing the best plan.

Just joining partitioned tables with hundreds of partitions does not
increase the number of paths. Number of paths increases when two
partitioned tables with similar partitioning scheme are joined with
equality condition on partition key. Unless we consider
repartitioning, how many of the joining relations share same
partitioning scheme? Section 8.6 mentions, "no TPC-H query plan,
regardless of the partitioning scheme, contains n-way child joins for
n >= 4. Maximum partitions that the paper mentions is 168 (Table 3).
My VM which has 8GB RAM and 4 cores handled that case pretty well. We
may add logic to free up space used by useless paths post-join to free
up some memory for next stages of query execution.

There will still be users, for whom the increase in the memory usage
is unexpected. Those will need to be educated or for them we might
take heuristic PartitionJoinPath based approach discussed above. But I
don't think that heuristic approach should be the default case. May be
we should supply a GUC which can switch between the approaches.

Some ideas for GUCs are 1. delay_partition_wise_join - when ON uses
the heuristic approach of PartitionJoinPath.
2. A GUC similar to join_collapse_limit may be used to limit the
number of partitioned relations being joined using partition-wise join
technique. A value of 1, indicates enable_partition_wise_join = false.
So, we may replace enable_partition_wise_join withe this GUC.
3. A GUC max_joinable_partitions (open to suggestions for name) may
specify the maximum number of partitions that two relations may have
to be eligible for partition-wise join.

I guess, using these GUCs allows a user handle the trade-off between
getting the best plan and memory usage consciously. I think, users
would like to accept a suboptimal plans consciously than being thrown
a suboptimal plan without choice.

[1] http://postgresql.nabble.com/design-for-a-partitioning-feature-was-inheritance-td5921603.html

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Partition-wise join for join between (declaratively) partitioned tables

From
Robert Haas
Date:
On Fri, Oct 28, 2016 at 3:09 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> I think there are going to be two kinds of partitioning use-cases.
> First, carefully hand-crafted by DBAs so that every partition is
> different from other and so is every join between two partitions.
> There will be lesser number of partitions, but creating paths for each
> join between partitions will be crucial from performance point of
> view. Consider, for example, systems which use partitions to
> consolidate results from different sources for analytical purposes or
> sharding. If we consider various points you have listed in [1] as to
> why a partition is equivalent to a table, each join between partitions
> is going to have very different characteristics and thus deserves a
> set of paths for its own. Add to that possibility of partition pruning
> or certain conditions affecting particular partitions, the need for
> detailed planning evident.
>
> The other usage of partitioning is to distribute the data and/or
> quickly eliminate the data by partition pruning. In such case, all
> partitions of a given table will have very similar properties. There
> is a large chance that we will end up having same plans for every
> partition and for joins between partitions. In such cases, I think it
> suffices to create paths for just one or may be a handful partitions
> of join and repeat that plan for other partitions of join. But in such
> cases it also makes sense to have a light-weight representation for
> partitions as compared to partitions being a full-fledged tables. If
> we have such a light-weight representation, we may not even create
> RelOptInfos representing joins between partitions, and different paths
> for each join between partitions.

I'm not sure I see a real distinction between these two use cases.  I
think that the problem of differing data distribution between
partitions is almost always going to be an issue.  Take the simple
case of an "orders" table which is partitioned by month.  First, the
month that's currently in progress may be much smaller than a typical
completed month.  Second, many businesses are seasonal and may have
many more orders at certain times of year.  For example, in American
retail, many businesses have large spikes in December.  I think some
businesses may do four times as much business in December as any other
month, for example.  So you will have that sort of variation, at
least.

> A typical join tree will be composite: some portion partitioned and
> some portion unpartitioned or different portions partitioned by
> different partition schemes. In such case, inaccurate costs for
> PartitionJoinPath, can affect the plan heavily, causing a suboptimal
> path to be picked. Assuming that partitioning will be useful for large
> sets of data, choosing a suboptimal plan can be more dangerous than
> consuming memory for creating paths.

Well, sure.  But, I mean, every simplifying assumption which the
planner makes to limit resource consumption could have that effect.
join_collapse_limit, for example, can cause horrible plans.  However,
we have it anyway, because the alternative of having planning take far
too long is unpalatable.  Planning is always, at some level,
guesswork.

>> For each
>> partition, we switch to a new memory context, perform planning, copy
>> the best path and its substructure back to the parent context, and
>> then reset the context.
>
> This could be rather tricky. It assumes that all the code that creates
> paths for joins, should not allocate any memory which is linked to
> some object in a context that lives longer than the path creation
> context. There is some code like create_join_clause() or
> make_canonical_pathkey(), which carefully chooses which memory context
> to allocate memory in. But can we ensure it always? postgres_fdw for
> example allocates memory for PgFdwRelationInfo in current memory
> context and attaches it in RelOptInfo, which should be in the
> planner's original context. So, if we create a new memory context for
> each partition, fpinfos would be invalidated when those contexts are
> released. Not that, we can not enforce some restriction on the memory
> usage while planning, it's hard to enforce it and bugs arising from it
> may go unnoticed. GEQO planner might have its own problems with this
> approach. Third party FDWs will pose a problem.

Yep, there are problems.  :-)

> A possible solution would be to keep the track of used paths using a
> reference count. Once the paths for given join tree are created, free
> up the unused paths by traversing pathlist in each of the RelOptInfos.
> Attached patch has a prototype implementation for the same. There are
> some paths which are not linked to RelOptInfos, which need a bit
> different treatment, but they can be handled too.

So, if you apply this with your previous patch, how much does it cut
down memory consumption?

>> In that way, peak memory usage only grows by
>> about a factor of 2 rather than a factor equal to the partition count,
>> because we don't need to keep every possibly-useful path for every
>> partition all at the same time, but rather every possibly-useful path
>> for a single partition.
>>
>> Maybe there are other ideas but I have a feeling any way you slice it
>> this is going to be a lot of work.
>
> For the case of carefully hand-crafted partitions, I think, users
> would expect the planner to use really the best plan and thus may be
> willing to accommodate for increased memory usage. Going by any
> approach that does not create the paths for joins between partitions
> is not guaranteed to give the best plan. Users willing to provide
> increased memory will be unhappy if we do not give them the best path.
>
> The user who creates hundreds of partitions, will ideally be using
> pretty powerful servers with a lot of memory. On such servers, the
> linear increase in memory for paths may not be as bad as you are
> portraying above, as long as its producing the best plan.

No, I don't agree.  We should be trying to build something that scales
well.  I've heard reports of customers with hundreds or even thousands
of partitions; I think it is quite reasonable to think that we need to
scale to 1000 partitions.  If we use 3MB of memory to plan a query
involving unpartitioned, using 3GB to plan a query where the main
tables have been partitioned 1000 ways does not seem reasonable to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partition-wise join for join between (declaratively) partitioned tables

From
Ashutosh Bapat
Date:
On Mon, Oct 31, 2016 at 6:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 28, 2016 at 3:09 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> I think there are going to be two kinds of partitioning use-cases.
>> First, carefully hand-crafted by DBAs so that every partition is
>> different from other and so is every join between two partitions.
>> There will be lesser number of partitions, but creating paths for each
>> join between partitions will be crucial from performance point of
>> view. Consider, for example, systems which use partitions to
>> consolidate results from different sources for analytical purposes or
>> sharding. If we consider various points you have listed in [1] as to
>> why a partition is equivalent to a table, each join between partitions
>> is going to have very different characteristics and thus deserves a
>> set of paths for its own. Add to that possibility of partition pruning
>> or certain conditions affecting particular partitions, the need for
>> detailed planning evident.
>>
>> The other usage of partitioning is to distribute the data and/or
>> quickly eliminate the data by partition pruning. In such case, all
>> partitions of a given table will have very similar properties. There
>> is a large chance that we will end up having same plans for every
>> partition and for joins between partitions. In such cases, I think it
>> suffices to create paths for just one or may be a handful partitions
>> of join and repeat that plan for other partitions of join. But in such
>> cases it also makes sense to have a light-weight representation for
>> partitions as compared to partitions being a full-fledged tables. If
>> we have such a light-weight representation, we may not even create
>> RelOptInfos representing joins between partitions, and different paths
>> for each join between partitions.
>
> I'm not sure I see a real distinction between these two use cases.  I
> think that the problem of differing data distribution between
> partitions is almost always going to be an issue.  Take the simple
> case of an "orders" table which is partitioned by month.  First, the
> month that's currently in progress may be much smaller than a typical
> completed month.  Second, many businesses are seasonal and may have
> many more orders at certain times of year.  For example, in American
> retail, many businesses have large spikes in December.  I think some
> businesses may do four times as much business in December as any other
> month, for example.  So you will have that sort of variation, at
> least.
>
>> A typical join tree will be composite: some portion partitioned and
>> some portion unpartitioned or different portions partitioned by
>> different partition schemes. In such case, inaccurate costs for
>> PartitionJoinPath, can affect the plan heavily, causing a suboptimal
>> path to be picked. Assuming that partitioning will be useful for large
>> sets of data, choosing a suboptimal plan can be more dangerous than
>> consuming memory for creating paths.
>
> Well, sure.  But, I mean, every simplifying assumption which the
> planner makes to limit resource consumption could have that effect.
> join_collapse_limit, for example, can cause horrible plans.  However,
> we have it anyway, because the alternative of having planning take far
> too long is unpalatable.  Planning is always, at some level,
> guesswork.

My point is, this behaviour is configurable. Users who are ready to
spend time and resources to get the best plan are still able to do so,
by choosing a higher limit on join_collapse_limit. Those who can not
afford to do so, are ready to use inferior plans willingly by setting
join_collapse_limit to a lower number.

>
>> A possible solution would be to keep the track of used paths using a
>> reference count. Once the paths for given join tree are created, free
>> up the unused paths by traversing pathlist in each of the RelOptInfos.
>> Attached patch has a prototype implementation for the same. There are
>> some paths which are not linked to RelOptInfos, which need a bit
>> different treatment, but they can be handled too.
>
> So, if you apply this with your previous patch, how much does it cut
> down memory consumption?

Answered this below:

>
>>> In that way, peak memory usage only grows by
>>> about a factor of 2 rather than a factor equal to the partition count,
>>> because we don't need to keep every possibly-useful path for every
>>> partition all at the same time, but rather every possibly-useful path
>>> for a single partition.
>>>
>>> Maybe there are other ideas but I have a feeling any way you slice it
>>> this is going to be a lot of work.
>>
>> For the case of carefully hand-crafted partitions, I think, users
>> would expect the planner to use really the best plan and thus may be
>> willing to accommodate for increased memory usage. Going by any
>> approach that does not create the paths for joins between partitions
>> is not guaranteed to give the best plan. Users willing to provide
>> increased memory will be unhappy if we do not give them the best path.
>>
>> The user who creates hundreds of partitions, will ideally be using
>> pretty powerful servers with a lot of memory. On such servers, the
>> linear increase in memory for paths may not be as bad as you are
>> portraying above, as long as its producing the best plan.
>
> No, I don't agree.  We should be trying to build something that scales
> well.  I've heard reports of customers with hundreds or even thousands
> of partitions; I think it is quite reasonable to think that we need to
> scale to 1000 partitions.  If we use 3MB of memory to plan a query
> involving unpartitioned, using 3GB to plan a query where the main
> tables have been partitioned 1000 ways does not seem reasonable to me.

Here are memory consumption numbers.

For a simple query "select * from v5_prt100", where v5_prt100 is a
view on a 5 way self join of table prt100, which is a plain table with
100 partitions without any indexes.
postgres=# \d+ v5_prt100          View "part_mem_usage.v5_prt100"Column |  Type  | Modifiers | Storage  | Description
--------+--------+-----------+----------+-------------t1     | prt100 |           | extended |t2     | prt100 |
 | extended |t3     | prt100 |           | extended |t4     | prt100 |           | extended |t5     | prt100 |
| extended |
 
View definition:SELECT t1.*::prt100 AS t1,   t2.*::prt100 AS t2,   t3.*::prt100 AS t3,   t4.*::prt100 AS t4,
t5.*::prt100AS t5  FROM prt100 t1,   prt100 t2,   prt100 t3,   prt100 t4,   prt100 t5 WHERE t1.a = t2.a AND t2.a = t3.a
ANDt3.a = t4.a AND t4.a = t5.a;
 

postgres=# \d prt100    Table "part_mem_usage.prt100"Column |       Type        | Modifiers
--------+-------------------+-----------a      | integer           |b      | integer           |c      | character
varying|
 
Partition Key: RANGE (a)
Number of partitions: 100 (Use \d+ to list them.)

Without partition-wise join the standard_planner() consumes 4311 kB
memory of which 150 kB is consumed in add_paths_to_joinrel().

With partition-wise join standard_planner() consumes 65MB memory,
which is 16 times more (not 100 times more as you suspected above). Of
this bloat 16MB is consumed for creating child join paths whereas
651kB is consumed in creating append paths. That's 100 times bloat for
path creation. Rest of the memory bloat is broken down as 9MB to
create child join RelOptInfos, 29MB to translate restrict clauses, 8MB
to translate target lists. 2MB for creating special join info for
children, 2MB goes into creating plans.

If we apply logic to free unused paths, the memory consumption reduces
as follows

Without partition-wise join standard_planner() consumes  4268 kB
(against 4311kB earlier) of which 123kB (against 150kB earlier) is
consumed in add_paths_to_joinrel().

With partition-wise join, standard_planner() consumes 63MB (against
65MB earlier). Child join paths still consume 13 MB (against 16MB
earlier), which is still 100 times that without using partition-wise
join. We may shave off some memory consumption by using better methods
than translating expressions, but we will continue to have bloats
introduced by paths, RelOptInfos for child joins etc.

So, I am thinking about your approach of creating PartitionJoinPaths
without actually creating child paths and then at a later stage
actually plan the child joins. Here's rough sketch of how that may be
done.

At the time of creating regular paths, we identify the join orders
which can use partition-wise join and save those in the RelOptInfo of
the parent table. If no such join order exists, we do not create
PartitionJoinPaths for that relation. Otherwise, once we have
considered all the join orders i.e. in
generate_partition_wise_join_paths(), we create one PartitionJoinPath
for every path that has survived in the parent or at least for every
path that has distinct properties like pathkeys or parameterisation,
with those properties.

At the time of creating plans, if PartitionJoinPath is chosen, we
actually create paths for every partition of that relation
recursively. The path creation logic is carried out in a different
memory context. Amongst the paths that survive, we choose the best
path that has the same properties as PartitionJoinPath. We would
expect all parameterized paths to be retained and any unparameterized
path can be sorted to match the pathkeys of reference
PartitionJoinPath. We then create the plan out of this path and copy
it into the outer memory context and release the memory context used
for path creation. This is similar to how prepared statements save
their plans. Once we have the plan, the memory consumed by paths won't
be referenced, and hence can not create problems. At the end we create
an Append/MergeAppend plan with all the child plans and return it.

Costing PartitionJoinPath needs more thought so that we don't end up
with bad overall plans. Here's an idea. Partition-wise joins are
better compared to the unpartitioned ones, because of the smaller
sizes of partitions. If we think of join as O(MN) operation where M
and N are sizes of unpartitioned tables being joined, partition-wise
join computes P joins each with average O(M/P * N/P) order where P is
the number of partitions, which is still O(MN) with constant factor
reduced by P times. I think, we need to apply similar logic to
costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N)
where S and R are setup cost and joining cost (for M, N rows) resp.
Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P)
+ P * R(M/P, N/P). Each of the join methods will have different S and
R functions and may not be linear on the number of rows. So,
PartitionJoinPath costs are obtained from corresponding regular path
costs subjected to above transformation. This way, we will be
protected from choosing a PartitionJoinPath when it's not optimal.
Take example of a join where the joining relations are very small in
size, thus hash join on full relation is optimal compared to hash join
of each partition because of setup cost. In such a case, the function
which calculates the cost of hash table setup, would result in almost
same cost for full table as well as each of the partitions, thus
increasing P * S(M/P, N/P) as compared to S(M, N).

Let me know your comments.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Partition-wise join for join between (declaratively) partitioned tables

From
Ashutosh Bapat
Date:
>
> So, I am thinking about your approach of creating PartitionJoinPaths
> without actually creating child paths and then at a later stage
> actually plan the child joins. Here's rough sketch of how that may be
> done.
>
> At the time of creating regular paths, we identify the join orders
> which can use partition-wise join and save those in the RelOptInfo of
> the parent table. If no such join order exists, we do not create
> PartitionJoinPaths for that relation. Otherwise, once we have
> considered all the join orders i.e. in
> generate_partition_wise_join_paths(), we create one PartitionJoinPath
> for every path that has survived in the parent or at least for every
> path that has distinct properties like pathkeys or parameterisation,
> with those properties.
>
> At the time of creating plans, if PartitionJoinPath is chosen, we
> actually create paths for every partition of that relation
> recursively. The path creation logic is carried out in a different
> memory context. Amongst the paths that survive, we choose the best
> path that has the same properties as PartitionJoinPath. We would
> expect all parameterized paths to be retained and any unparameterized
> path can be sorted to match the pathkeys of reference
> PartitionJoinPath. We then create the plan out of this path and copy
> it into the outer memory context and release the memory context used
> for path creation. This is similar to how prepared statements save
> their plans. Once we have the plan, the memory consumed by paths won't
> be referenced, and hence can not create problems. At the end we create
> an Append/MergeAppend plan with all the child plans and return it.
>
> Costing PartitionJoinPath needs more thought so that we don't end up
> with bad overall plans. Here's an idea. Partition-wise joins are
> better compared to the unpartitioned ones, because of the smaller
> sizes of partitions. If we think of join as O(MN) operation where M
> and N are sizes of unpartitioned tables being joined, partition-wise
> join computes P joins each with average O(M/P * N/P) order where P is
> the number of partitions, which is still O(MN) with constant factor
> reduced by P times. I think, we need to apply similar logic to
> costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N)
> where S and R are setup cost and joining cost (for M, N rows) resp.
> Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P)
> + P * R(M/P, N/P). Each of the join methods will have different S and
> R functions and may not be linear on the number of rows. So,
> PartitionJoinPath costs are obtained from corresponding regular path
> costs subjected to above transformation. This way, we will be
> protected from choosing a PartitionJoinPath when it's not optimal.
> Take example of a join where the joining relations are very small in
> size, thus hash join on full relation is optimal compared to hash join
> of each partition because of setup cost. In such a case, the function
> which calculates the cost of hash table setup, would result in almost
> same cost for full table as well as each of the partitions, thus
> increasing P * S(M/P, N/P) as compared to S(M, N).
>
> Let me know your comments.

I tried to measure the impact of having a memory context reset 1000
times (once for each partition) with the attached patch. Without this
patch make check in regress/ takes about 24 seconds on my laptop and
with this patch it takes 26 seconds. This is almost 10% increase in
time. I hope that's fine.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Partition-wise join for join between (declaratively) partitioned tables

From
Robert Haas
Date:
On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Costing PartitionJoinPath needs more thought so that we don't end up
> with bad overall plans. Here's an idea. Partition-wise joins are
> better compared to the unpartitioned ones, because of the smaller
> sizes of partitions. If we think of join as O(MN) operation where M
> and N are sizes of unpartitioned tables being joined, partition-wise
> join computes P joins each with average O(M/P * N/P) order where P is
> the number of partitions, which is still O(MN) with constant factor
> reduced by P times. I think, we need to apply similar logic to
> costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N)
> where S and R are setup cost and joining cost (for M, N rows) resp.
> Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P)
> + P * R(M/P, N/P). Each of the join methods will have different S and
> R functions and may not be linear on the number of rows. So,
> PartitionJoinPath costs are obtained from corresponding regular path
> costs subjected to above transformation. This way, we will be
> protected from choosing a PartitionJoinPath when it's not optimal.

I'm not sure that I really understand the stuff with big-O notation
and M, N, and P.  But I think what you are saying is that we could
cost a PartitionJoinPath by costing some of the partitions (it might
be a good idea to choose the biggest ones) and assuming the cost for
the remaining ones will be roughly proportional.  That does seem like
a reasonable strategy to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partition-wise join for join between (declaratively) partitioned tables

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Costing PartitionJoinPath needs more thought so that we don't end up
>> with bad overall plans. Here's an idea. Partition-wise joins are
>> better compared to the unpartitioned ones, because of the smaller
>> sizes of partitions. If we think of join as O(MN) operation where M
>> and N are sizes of unpartitioned tables being joined, partition-wise
>> join computes P joins each with average O(M/P * N/P) order where P is
>> the number of partitions, which is still O(MN) with constant factor
>> reduced by P times. I think, we need to apply similar logic to
>> costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N)
>> where S and R are setup cost and joining cost (for M, N rows) resp.
>> Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P)
>> + P * R(M/P, N/P). Each of the join methods will have different S and
>> R functions and may not be linear on the number of rows. So,
>> PartitionJoinPath costs are obtained from corresponding regular path
>> costs subjected to above transformation. This way, we will be
>> protected from choosing a PartitionJoinPath when it's not optimal.

> I'm not sure that I really understand the stuff with big-O notation
> and M, N, and P.  But I think what you are saying is that we could
> cost a PartitionJoinPath by costing some of the partitions (it might
> be a good idea to choose the biggest ones) and assuming the cost for
> the remaining ones will be roughly proportional.  That does seem like
> a reasonable strategy to me.

I'm not sure to what extent the above argument depends on the assumption
that join is O(MN), but I will point out that in no case of practical
interest for large tables is it actually O(MN).  That would be true
only for the stupidest possible nested-loop join method.  It would be
wise to convince ourselves that the argument holds for more realistic
big-O costs, eg hash join is more like O(M+N) if all goes well.
        regards, tom lane



Re: Partition-wise join for join between (declaratively) partitioned tables

From
Robert Haas
Date:
On Mon, Nov 14, 2016 at 9:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>> Costing PartitionJoinPath needs more thought so that we don't end up
>>> with bad overall plans. Here's an idea. Partition-wise joins are
>>> better compared to the unpartitioned ones, because of the smaller
>>> sizes of partitions. If we think of join as O(MN) operation where M
>>> and N are sizes of unpartitioned tables being joined, partition-wise
>>> join computes P joins each with average O(M/P * N/P) order where P is
>>> the number of partitions, which is still O(MN) with constant factor
>>> reduced by P times. I think, we need to apply similar logic to
>>> costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N)
>>> where S and R are setup cost and joining cost (for M, N rows) resp.
>>> Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P)
>>> + P * R(M/P, N/P). Each of the join methods will have different S and
>>> R functions and may not be linear on the number of rows. So,
>>> PartitionJoinPath costs are obtained from corresponding regular path
>>> costs subjected to above transformation. This way, we will be
>>> protected from choosing a PartitionJoinPath when it's not optimal.
>
>> I'm not sure that I really understand the stuff with big-O notation
>> and M, N, and P.  But I think what you are saying is that we could
>> cost a PartitionJoinPath by costing some of the partitions (it might
>> be a good idea to choose the biggest ones) and assuming the cost for
>> the remaining ones will be roughly proportional.  That does seem like
>> a reasonable strategy to me.
>
> I'm not sure to what extent the above argument depends on the assumption
> that join is O(MN), but I will point out that in no case of practical
> interest for large tables is it actually O(MN).  That would be true
> only for the stupidest possible nested-loop join method.  It would be
> wise to convince ourselves that the argument holds for more realistic
> big-O costs, eg hash join is more like O(M+N) if all goes well.

Yeah, I agree.  To recap briefly, the problem we're trying to solve
here is how to build a path for a partitionwise join without an
explosion in the amount of memory the planner uses or the number of
paths created.  In the initial design, if there are N partitions per
relation, the total number of paths generated by the planner increases
by a factor of N+1, which gets ugly if, say, N = 1000, or even N =
100.  To reign that in, we want to do a rough cut at costing the
partitionwise join that will be good enough to let us throw away
obviously inferior paths, and then work out the exact paths we're
going to use only for partitionwise joins that are actually selected.
I think costing one or a few of the larger sub-joins and assuming
those costs are representative is probably a reasonable approach to
that problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partition-wise join for join between (declaratively) partitioned tables

From
Ashutosh Bapat
Date:
Hi Robert,
Sorry for delayed response.

The attached patch implements following ideas:
1. At the time of creating paths - If the joining relations are both
partitioned and join can use partition-wise join, we create paths for
few child-joins. Similar to inheritance relations
(set_append_rel_pathlist()), we collect paths with similar properties
from all sampled child-joins and create one PartitionJoinPath with
each set of paths. The cost of the PartitionJoinPath is obtained by
multiplying the sum of costs of paths in the given set by the ratio of
(number of rows estimated in the parent-join/sum of rows in
child-joins).

2. If the PartitionJoinPath emerges as the best path, we create paths
for each of the remaining child-joins. Then we collect paths with
properties same as the given PartitionJoinPath, one from each
child-join. These paths are converted into plans and a Merge/Append
plan is created combing these plans. The paths and plans for
child-join are created in a temporary memory context. The final plan
for each child-join is copied into planner's context and the temporary
memory context is reset.

Right now, we choose 1% or 1 (whichever is higher) child-joins to base
PartitionJoinPath costs on.

Memory consumption
-----------------------------
I tested a 5-way self-join for a table with 1000 partitions, each
partition having 1M rows. The memory consumed in standard_planner()
was measured with some granular tracking
(mem_usage_func_wise_measurement_slabwise.patch). Partition-wise join
consumed total of 289MB memory which is approx 6.6 times more than
non-partition-wise join which consumed 44MB. That's much better than
the earlier 16 times consumption for 5-way join with 100 partitions.

The extra 245MB memory was consumed by child-join RelOptInfos (48MB),
SpecialJoinInfos for child-joins (64MB), restrictlist translation
(92MB), paths for sampled child-joins (1.5MB), building targetlists
for child-joins (7MB).

In order to choose representative child-joins based on the sizes of
child-joins, we need to create all the child-join RelOptInfos. In
order to estimate sizes of child-joins, we need to create
SpecialJoinInfos and restrictlists for at least one join order for all
child-joins. For every representative child-join, we need to create
SpecialJoinInfo and restrictlist for all join orders for that
child-join. We might be able to save of restrictlist translation, if
we create restrict lists from joininfo similar to parent joins. I
haven't tried that yet.

Choosing representative child-joins:
--------------------------------------------------
There's another angle to choosing representative child joins. In a
partitioned N-way join, different joins covering different subsets of
N relations, will have different size distributions across the
partitions. This means that the child-joins costed for (N-k) joins,
may be different for those required for (N-k+1) joins. With a factor
of 1% sampling, N is such that a child-join participates in 100 joins,
we will end up creating paths for all partitions before creating
PartitionJoinPaths for the final N-way join. Hopefully that will be a
rare case and usually we will end up using paths already created. We
can not avoid creating PartitionJoinPaths for subset joins, as there
might be cases when partition-wise join will be optimal for an N-k way
join but not for N-way join. We may avoid this if we choose
representative child-joins based on their positions, in which case, we
may end up with some or all of those being empty and thus skewing the
costs heavily.

Partial paths
-----------------
AFAIU, we create partial paths for append relation, when all the
children have partial paths. Unlike parameterized paths or path with
pathkeys, there is no way to create a partial path for a normal path.
This means that unless we create paths for all child-joins, we can not
create partial paths for appendrel comprising of child-joins, and thus
can not use parallel query right now. This may not be that bad, since
it would be more efficient to run each child-join in a separate
worker, rather than using multiple workers for a single child-join.

regression tests
----------------------
I observed that for small relations (1000 rows in each partition and
100 partitions), the size estimates in append relations and sum of
those in child relations are very different. As a result, the
extrapolated costs for PartitionJoinPaths as described above, are way
higher than costs of join of appends (or even append of joins if we
are to create paths for all child-joins). Thus with this approach, we
choose partition-wise join for large number of partitions with large
data (e.g. 1000 partitions with 1M rows each). These are certainly the
cases when partition-wise join is a big win. I have not tried to find
out a threshold above which partition-wise join gets chosen with above
approach, but it's going to be a larger threshold. That makes writing
regression tests difficult, as those will require large data.  So, we
have to find a way so that we can test partition-wise join with
smaller data. There are few possibilities like 1. convert the fraction
of representative child-joins into GUC and setting it to 100% would
start choosing partition-wise joins for tables with a few hundred rows
per partition, like it did in earlier approach, 2. provide a way to
force partition-wise join whenever possible, by say costing
partition-wise joins much lesser than non-partition-wise join when a
GUC is set (e.g. enable_partition_wise_join with values always, never,
optimal or something like that).

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables

From
Ashutosh Bapat
Date:
PFA patch rebased after partitioning code was committed.

On Thu, Dec 1, 2016 at 4:32 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Hi Robert,
> Sorry for delayed response.
>
> The attached patch implements following ideas:
> 1. At the time of creating paths - If the joining relations are both
> partitioned and join can use partition-wise join, we create paths for
> few child-joins. Similar to inheritance relations
> (set_append_rel_pathlist()), we collect paths with similar properties
> from all sampled child-joins and create one PartitionJoinPath with
> each set of paths. The cost of the PartitionJoinPath is obtained by
> multiplying the sum of costs of paths in the given set by the ratio of
> (number of rows estimated in the parent-join/sum of rows in
> child-joins).
>
> 2. If the PartitionJoinPath emerges as the best path, we create paths
> for each of the remaining child-joins. Then we collect paths with
> properties same as the given PartitionJoinPath, one from each
> child-join. These paths are converted into plans and a Merge/Append
> plan is created combing these plans. The paths and plans for
> child-join are created in a temporary memory context. The final plan
> for each child-join is copied into planner's context and the temporary
> memory context is reset.
>
> Right now, we choose 1% or 1 (whichever is higher) child-joins to base
> PartitionJoinPath costs on.
>
> Memory consumption
> -----------------------------
> I tested a 5-way self-join for a table with 1000 partitions, each
> partition having 1M rows. The memory consumed in standard_planner()
> was measured with some granular tracking
> (mem_usage_func_wise_measurement_slabwise.patch). Partition-wise join
> consumed total of 289MB memory which is approx 6.6 times more than
> non-partition-wise join which consumed 44MB. That's much better than
> the earlier 16 times consumption for 5-way join with 100 partitions.
>
> The extra 245MB memory was consumed by child-join RelOptInfos (48MB),
> SpecialJoinInfos for child-joins (64MB), restrictlist translation
> (92MB), paths for sampled child-joins (1.5MB), building targetlists
> for child-joins (7MB).
>

In the earlier implementation, a given clause which was applicable to
multiple join orders was getting translated as many times as the join
orders it was applicable in. I changed RestrictInfo for parent to
store a list of RestrictInfos applicable to children to avoid multiple
translations.

My earlier patch created the child-join plans in a temporary context
and then copied them into planner context since the translated clauses
were allocated memory in temporary memory context then. Now that they
are stored in planner's context, we can directly create the plan in
the planner's context.

Third, I added code to free up child SpecialJoinInfos after using those.

As a result the total memory consumption now is 192MB, which is approx
4.4 times the memory consumed during planning in case of
non-partition-wise join.

>
> Choosing representative child-joins:
> --------------------------------------------------
> There's another angle to choosing representative child joins. In a
> partitioned N-way join, different joins covering different subsets of
> N relations, will have different size distributions across the
> partitions. This means that the child-joins costed for (N-k) joins,
> may be different for those required for (N-k+1) joins. With a factor
> of 1% sampling, N is such that a child-join participates in 100 joins,
> we will end up creating paths for all partitions before creating
> PartitionJoinPaths for the final N-way join. Hopefully that will be a
> rare case and usually we will end up using paths already created. We
> can not avoid creating PartitionJoinPaths for subset joins, as there
> might be cases when partition-wise join will be optimal for an N-k way
> join but not for N-way join. We may avoid this if we choose
> representative child-joins based on their positions, in which case, we
> may end up with some or all of those being empty and thus skewing the
> costs heavily.
>
> Partial paths
> -----------------
> AFAIU, we create partial paths for append relation, when all the
> children have partial paths. Unlike parameterized paths or path with
> pathkeys, there is no way to create a partial path for a normal path.
> This means that unless we create paths for all child-joins, we can not
> create partial paths for appendrel comprising of child-joins, and thus
> can not use parallel query right now. This may not be that bad, since
> it would be more efficient to run each child-join in a separate
> worker, rather than using multiple workers for a single child-join.

This still applies.

>
> regression tests
> ----------------------
> I observed that for small relations (1000 rows in each partition and
> 100 partitions), the size estimates in append relations and sum of
> those in child relations are very different. As a result, the
> extrapolated costs for PartitionJoinPaths as described above, are way
> higher than costs of join of appends (or even append of joins if we
> are to create paths for all child-joins). Thus with this approach, we
> choose partition-wise join for large number of partitions with large
> data (e.g. 1000 partitions with 1M rows each). These are certainly the
> cases when partition-wise join is a big win. I have not tried to find
> out a threshold above which partition-wise join gets chosen with above
> approach, but it's going to be a larger threshold. That makes writing
> regression tests difficult, as those will require large data.  So, we
> have to find a way so that we can test partition-wise join with
> smaller data. There are few possibilities like 1. convert the fraction
> of representative child-joins into GUC and setting it to 100% would
> start choosing partition-wise joins for tables with a few hundred rows
> per partition, like it did in earlier approach, 2. provide a way to
> force partition-wise join whenever possible, by say costing
> partition-wise joins much lesser than non-partition-wise join when a
> GUC is set (e.g. enable_partition_wise_join with values always, never,
> optimal or something like that).
>

For now I have added a float GUC partition_wise_plan_weight. The
partition-wise join cost derived from the samples is multiplied by
this GUC and set as the cost of ParitionJoinPath. A value of 1 means
that the cost derived from the samples are used as is. A value higher
than 1 discourages use of partition-wise join and that lower than 1
encourages use of partition-wise join. I am not very keen on keeping
this GUC, in this form. But we need some way to run regression with
smaller data.

For now I have disabled partition-wise join for multi-level
partitions. I will post a patch soon with that enabled.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Tue, Dec 27, 2016 at 11:01 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> PFA patch rebased after partitioning code was committed.
>
> On Thu, Dec 1, 2016 at 4:32 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Hi Robert,
>> Sorry for delayed response.
>>
>> The attached patch implements following ideas:
>> 1. At the time of creating paths - If the joining relations are both
>> partitioned and join can use partition-wise join, we create paths for
>> few child-joins. Similar to inheritance relations
>> (set_append_rel_pathlist()), we collect paths with similar properties
>> from all sampled child-joins and create one PartitionJoinPath with
>> each set of paths. The cost of the PartitionJoinPath is obtained by
>> multiplying the sum of costs of paths in the given set by the ratio of
>> (number of rows estimated in the parent-join/sum of rows in
>> child-joins).
>>
>> 2. If the PartitionJoinPath emerges as the best path, we create paths
>> for each of the remaining child-joins. Then we collect paths with
>> properties same as the given PartitionJoinPath, one from each
>> child-join. These paths are converted into plans and a Merge/Append
>> plan is created combing these plans. The paths and plans for
>> child-join are created in a temporary memory context. The final plan
>> for each child-join is copied into planner's context and the temporary
>> memory context is reset.
>>
>> Right now, we choose 1% or 1 (whichever is higher) child-joins to base
>> PartitionJoinPath costs on.
>>
>> Memory consumption
>> -----------------------------
>> I tested a 5-way self-join for a table with 1000 partitions, each
>> partition having 1M rows. The memory consumed in standard_planner()
>> was measured with some granular tracking
>> (mem_usage_func_wise_measurement_slabwise.patch). Partition-wise join
>> consumed total of 289MB memory which is approx 6.6 times more than
>> non-partition-wise join which consumed 44MB. That's much better than
>> the earlier 16 times consumption for 5-way join with 100 partitions.
>>
>> The extra 245MB memory was consumed by child-join RelOptInfos (48MB),
>> SpecialJoinInfos for child-joins (64MB), restrictlist translation
>> (92MB), paths for sampled child-joins (1.5MB), building targetlists
>> for child-joins (7MB).
>>
>
> In the earlier implementation, a given clause which was applicable to
> multiple join orders was getting translated as many times as the join
> orders it was applicable in. I changed RestrictInfo for parent to
> store a list of RestrictInfos applicable to children to avoid multiple
> translations.
>
> My earlier patch created the child-join plans in a temporary context
> and then copied them into planner context since the translated clauses
> were allocated memory in temporary memory context then. Now that they
> are stored in planner's context, we can directly create the plan in
> the planner's context.
>
> Third, I added code to free up child SpecialJoinInfos after using those.
>
> As a result the total memory consumption now is 192MB, which is approx
> 4.4 times the memory consumed during planning in case of
> non-partition-wise join.
>
>>
>> Choosing representative child-joins:
>> --------------------------------------------------
>> There's another angle to choosing representative child joins. In a
>> partitioned N-way join, different joins covering different subsets of
>> N relations, will have different size distributions across the
>> partitions. This means that the child-joins costed for (N-k) joins,
>> may be different for those required for (N-k+1) joins. With a factor
>> of 1% sampling, N is such that a child-join participates in 100 joins,
>> we will end up creating paths for all partitions before creating
>> PartitionJoinPaths for the final N-way join. Hopefully that will be a
>> rare case and usually we will end up using paths already created. We
>> can not avoid creating PartitionJoinPaths for subset joins, as there
>> might be cases when partition-wise join will be optimal for an N-k way
>> join but not for N-way join. We may avoid this if we choose
>> representative child-joins based on their positions, in which case, we
>> may end up with some or all of those being empty and thus skewing the
>> costs heavily.
>>
>> Partial paths
>> -----------------
>> AFAIU, we create partial paths for append relation, when all the
>> children have partial paths. Unlike parameterized paths or path with
>> pathkeys, there is no way to create a partial path for a normal path.
>> This means that unless we create paths for all child-joins, we can not
>> create partial paths for appendrel comprising of child-joins, and thus
>> can not use parallel query right now. This may not be that bad, since
>> it would be more efficient to run each child-join in a separate
>> worker, rather than using multiple workers for a single child-join.
>
> This still applies.
>
>>
>> regression tests
>> ----------------------
>> I observed that for small relations (1000 rows in each partition and
>> 100 partitions), the size estimates in append relations and sum of
>> those in child relations are very different. As a result, the
>> extrapolated costs for PartitionJoinPaths as described above, are way
>> higher than costs of join of appends (or even append of joins if we
>> are to create paths for all child-joins). Thus with this approach, we
>> choose partition-wise join for large number of partitions with large
>> data (e.g. 1000 partitions with 1M rows each). These are certainly the
>> cases when partition-wise join is a big win. I have not tried to find
>> out a threshold above which partition-wise join gets chosen with above
>> approach, but it's going to be a larger threshold. That makes writing
>> regression tests difficult, as those will require large data.  So, we
>> have to find a way so that we can test partition-wise join with
>> smaller data. There are few possibilities like 1. convert the fraction
>> of representative child-joins into GUC and setting it to 100% would
>> start choosing partition-wise joins for tables with a few hundred rows
>> per partition, like it did in earlier approach, 2. provide a way to
>> force partition-wise join whenever possible, by say costing
>> partition-wise joins much lesser than non-partition-wise join when a
>> GUC is set (e.g. enable_partition_wise_join with values always, never,
>> optimal or something like that).
>>
>
> For now I have added a float GUC partition_wise_plan_weight. The
> partition-wise join cost derived from the samples is multiplied by
> this GUC and set as the cost of ParitionJoinPath. A value of 1 means
> that the cost derived from the samples are used as is. A value higher
> than 1 discourages use of partition-wise join and that lower than 1
> encourages use of partition-wise join. I am not very keen on keeping
> this GUC, in this form. But we need some way to run regression with
> smaller data.
>
> For now I have disabled partition-wise join for multi-level
> partitions. I will post a patch soon with that enabled.

PFA the patch (pg_dp_join_v6.patch) with some bugs fixed and rebased
on the latest code.

Also, PFA patch to support partition-wise join between multi-level
partitioned tables. I copied the Amit Langote's patch for translating
partition hierarchy into inheritance hierarchy and added code to
support partition-wise join. You had expressed some concerns about
Amit's approach in [1], but that discussion is still open. So, I
haven't merged those changes to partition-wise join patch. We may
continue to work on it as separate patch or I can include it in
partition-wise join main patch.

BTW, INSERT into multi-level partitioned tables is crashing with
latest head. The issue was reported in [2]. Because of that
multi_level_partition_join test crashes in pg_dp_join_v6.patch.
Intestingly the crash vanishes when we apply patch supporting
mult-level partition-wise join.


[1] https://www.postgresql.org/message-id/CA%2BTgmoaEU10Kmdy44izcqJYLh1fkh58_6sbGGu0Q4b7PPE46eA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAKcux6%3Dm1qyqB2k6cjniuMMrYXb75O-MB4qGQMu8zg-iGGLjDw%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Mon, Jan 2, 2017 at 7:32 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> PFA the patch (pg_dp_join_v6.patch) with some bugs fixed and rebased
> on the latest code.

Maybe not surprisingly given how fast things are moving around here
these days, this needs a rebase.

Apart from that, my overall comment on this patch is that it's huge:
37 files changed, 7993 insertions(+), 287 deletions(-)

Now, more than half of that is regression test cases and their output,
which you will certainly be asked to pare down in any version of this
intended for commit. But even excluding those, it's still a fairly
patch:
30 files changed, 2783 insertions(+), 272 deletions(-)

I think the reason this is so large is because there's a fair amount
of refactoring work that has been done as a precondition of the actual
meat of the patch, and no attempt has been made to separate the
refactoring work from the main body of the patch.  I think that's
something that needs to be done.  If you look at the way Amit Langote
submitted the partitioning patches and the follow-up bug fixes, he had
a series of patches 0001-blah, 0002-quux, etc. generated using
format-patch.  Each patch had its own commit message written by him
explaining the purpose of that patch, links to relevant discussion,
etc.  If you can separate this into more digestible chunks it will be
easier to get committed.

Other questions/comments:

Why does find_partition_scheme need to copy the partition bound
information instead of just pointing to it?  Amit went to some trouble
to make sure that this can't change under us while we hold a lock on
the relation, and we'd better hold a lock on the relation if we're
planning a query against it.

I think the PartitionScheme stuff should live in the optimizer rather
that src/backend/catalog/partition.c.  Maybe plancat.c?  Perhaps we
eventually need a new file in the optimizer just for partitioning
stuff, but I'm not sure about that yet.

The fact that set_append_rel_size needs to reopen the relation to
extract a few more bits of information is not desirable.  You need to
fish this information through in some other way; for example, you
could have get_relation_info() stash the needed bits in the
RelOptInfo.

+                * For two partitioned tables with the same
partitioning scheme, it is
+                * assumed that the Oids of matching partitions from
both the tables
+                * are placed at the same position in the array of
partition oids in

Rather than saying that we assume this, you should say why it has to
be true.  (If it doesn't have to be true, we shouldn't assume it.)

+                * join relations. Partition tables should have same
layout as the
+                * parent table and hence should not need any
translation. But rest of

The same attributes have to be present with the same types, but they
can be rearranged.  This comment seems to imply the contrary.

FRACTION_PARTS_TO_PLAN seems like it should be a GUC.

+               /*
+                * Add this relation to the list of samples ordered by
the increasing
+                * number of rows at appropriate place.
+                */
+               foreach (lc, ordered_child_nos)
+               {
+                       int     child_no = lfirst_int(lc);
+                       RelOptInfo *other_childrel = rel->part_rels[child_no];
+
+                       /*
+                        * Keep track of child with lowest number of
rows but higher than the
+                        * that of the child being inserted. Insert
the child before a
+                        * child with highest number of rows lesser than it.
+                        */
+                       if (child_rel->rows <= other_childrel->rows)
+                               insert_after = lc;
+                       else
+                               break;
+               }

Can we use quicksort instead of a hand-coded insertion sort?

+               if (bms_num_members(outer_relids) > 1)

Seems like bms_get_singleton_member could be used.

+        * Partitioning scheme in join relation indicates a possibilty that the

Spelling.

There seems to be no reason for create_partition_plan to be separated
from create_plan_recurse.  You can just add another case for the new
path type.

Why does create_partition_join_path need to be separate from
create_partition_join_path_with_pathkeys?  Couldn't that be combined
into a single function with a pathkeys argument that might sometimes
be NIL?  I assume most of the logic is common.

From a sort of theoretical standpoint, the biggest danger of this
patch seems to be that by deferring path creation until a later stage
than normal, we could miss some important processing.
subquery_planner() does a lot of stuff after
expand_inherited_tables(); if any of those things, especially the ones
that happen AFTER path generation, have an effect on the paths, then
this code needs to compensate for those changes somehow.  It seems
like having the planning of unsampled children get deferred until
create_plan() time is awfully surprising; here we are creating the
plan and suddenly what used to be a straightforward path->plan
translation is running around doing major planning work.  I can't
entirely justify it, but I somehow have a feeling that work ought to
be moved earlier.  Not sure exactly where.

This is not really a full review, mostly because I can't easily figure
out the motivation for all of the changes the patch makes.  It makes a
lot of changes in a lot of places, and it's not really very easy to
understand why those changes are necessary.  My comments above about
splitting the patch into a series of patches that can potentially be
reviewed and applied independently, with the main patch being the last
in the series, are a suggestion as to how to tackle that.  There might
be some work that needs to or could be done on the comments, too.  For
example, the patch splits out add_paths_to_append_rel from
set_append_rel_pathlist, but the comments don't say anything helpful
like "we need to do X after Y, because Z".  They just say that we do
it.  To some extent I think the comments in the optimizer have that
problem generally, so it's not entirely the fault of this patch;
still, the lack of those explanations makes the code reorganization
harder to follow, and might confuse future patch authors, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables

From
Ashutosh Bapat
Date:
On Thu, Feb 2, 2017 at 2:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jan 2, 2017 at 7:32 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> PFA the patch (pg_dp_join_v6.patch) with some bugs fixed and rebased
>> on the latest code.
>
> Maybe not surprisingly given how fast things are moving around here
> these days, this needs a rebase.
>
> Apart from that, my overall comment on this patch is that it's huge:
>
>  37 files changed, 7993 insertions(+), 287 deletions(-)
>
> Now, more than half of that is regression test cases and their output,
> which you will certainly be asked to pare down in any version of this
> intended for commit.

Yes. I will work on that once the design and implementation is in
acceptable state. I have already toned down testcases compared to the
previous patch.

> But even excluding those, it's still a fairly
> patch:
>
>  30 files changed, 2783 insertions(+), 272 deletions(-)
>
> I think the reason this is so large is because there's a fair amount
> of refactoring work that has been done as a precondition of the actual
> meat of the patch, and no attempt has been made to separate the
> refactoring work from the main body of the patch.  I think that's
> something that needs to be done.  If you look at the way Amit Langote
> submitted the partitioning patches and the follow-up bug fixes, he had
> a series of patches 0001-blah, 0002-quux, etc. generated using
> format-patch.  Each patch had its own commit message written by him
> explaining the purpose of that patch, links to relevant discussion,
> etc.  If you can separate this into more digestible chunks it will be
> easier to get committed.

I will try to break down the patch into smaller, easy-to-review,
logically cohesive patches.

>
> Other questions/comments:
>
> Why does find_partition_scheme need to copy the partition bound
> information instead of just pointing to it?  Amit went to some trouble
> to make sure that this can't change under us while we hold a lock on
> the relation, and we'd better hold a lock on the relation if we're
> planning a query against it.

PartitionScheme is shared across multiple relations, join or base,
partitioned similarly. Obviously it can't and does not need to point
partition bound informations (which should all be same) of all those
base relations. O the the face of it, it looks weird that it points to
only one of them, mostly the one which it encounters first. But, since
it's going to be the same partition bound information, it doesn't
matter which one. So, I think, we can point of any one of those. Do
you agree?

>
> I think the PartitionScheme stuff should live in the optimizer rather
> that src/backend/catalog/partition.c.  Maybe plancat.c?  Perhaps we
> eventually need a new file in the optimizer just for partitioning
> stuff, but I'm not sure about that yet.

I placed PartitionScheme stuff in partition.c because most of the
functions and structures in partition.c are not visible outside that
file. But I will try again to locate PartitionScheme to optimizer.

>
> The fact that set_append_rel_size needs to reopen the relation to
> extract a few more bits of information is not desirable.  You need to
> fish this information through in some other way; for example, you
> could have get_relation_info() stash the needed bits in the
> RelOptInfo.

I considered this option and discarded it, since not all partitioned
relations will have OIDs for partitions e.g. partitioned joins will
not have OIDs for their partitions. But now that I think of it, we
should probably store those OIDs just for the base relation and leave
them unused for non-base relations just like other base relation
specific fields in RelOptInfo.

>
> +                * For two partitioned tables with the same
> partitioning scheme, it is
> +                * assumed that the Oids of matching partitions from
> both the tables
> +                * are placed at the same position in the array of
> partition oids in
>
> Rather than saying that we assume this, you should say why it has to
> be true.  (If it doesn't have to be true, we shouldn't assume it.)

Will take care of this.

>
> +                * join relations. Partition tables should have same
> layout as the
> +                * parent table and hence should not need any
> translation. But rest of
>
> The same attributes have to be present with the same types, but they
> can be rearranged.  This comment seems to imply the contrary.

Hmm, will take care of this.

>
> FRACTION_PARTS_TO_PLAN seems like it should be a GUC.

+1. Will take care of this. Does "representative_partitions_fraction"
or "sample_partition_fraction" look like a good GUC name? Any other
suggestions?

>
> +               /*
> +                * Add this relation to the list of samples ordered by
> the increasing
> +                * number of rows at appropriate place.
> +                */
> +               foreach (lc, ordered_child_nos)
> +               {
> +                       int     child_no = lfirst_int(lc);
> +                       RelOptInfo *other_childrel = rel->part_rels[child_no];
> +
> +                       /*
> +                        * Keep track of child with lowest number of
> rows but higher than the
> +                        * that of the child being inserted. Insert
> the child before a
> +                        * child with highest number of rows lesser than it.
> +                        */
> +                       if (child_rel->rows <= other_childrel->rows)
> +                               insert_after = lc;
> +                       else
> +                               break;
> +               }
>
> Can we use quicksort instead of a hand-coded insertion sort?

I guess so, if I write comparison functions, which shouldn't be a
problem. Will try that.

>
> +               if (bms_num_members(outer_relids) > 1)
>
> Seems like bms_get_singleton_member could be used.
>
> +        * Partitioning scheme in join relation indicates a possibilty that the
>
> Spelling.
>
> There seems to be no reason for create_partition_plan to be separated
> from create_plan_recurse.  You can just add another case for the new
> path type.
>
> Why does create_partition_join_path need to be separate from
> create_partition_join_path_with_pathkeys?  Couldn't that be combined
> into a single function with a pathkeys argument that might sometimes
> be NIL?  I assume most of the logic is common.
>
> From a sort of theoretical standpoint, the biggest danger of this
> patch seems to be that by deferring path creation until a later stage
> than normal, we could miss some important processing.
> subquery_planner() does a lot of stuff after
> expand_inherited_tables(); if any of those things, especially the ones
> that happen AFTER path generation, have an effect on the paths, then
> this code needs to compensate for those changes somehow.  It seems
> like having the planning of unsampled children get deferred until
> create_plan() time is awfully surprising; here we are creating the
> plan and suddenly what used to be a straightforward path->plan
> translation is running around doing major planning work.  I can't
> entirely justify it, but I somehow have a feeling that work ought to
> be moved earlier.  Not sure exactly where.
>
> This is not really a full review, mostly because I can't easily figure
> out the motivation for all of the changes the patch makes.  It makes a
> lot of changes in a lot of places, and it's not really very easy to
> understand why those changes are necessary.  My comments above about
> splitting the patch into a series of patches that can potentially be
> reviewed and applied independently, with the main patch being the last
> in the series, are a suggestion as to how to tackle that.  There might
> be some work that needs to or could be done on the comments, too.  For
> example, the patch splits out add_paths_to_append_rel from
> set_append_rel_pathlist, but the comments don't say anything helpful
> like "we need to do X after Y, because Z".  They just say that we do
> it.  To some extent I think the comments in the optimizer have that
> problem generally, so it's not entirely the fault of this patch;
> still, the lack of those explanations makes the code reorganization
> harder to follow, and might confuse future patch authors, too.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables

From
Ashutosh Bapat
Date:
sent the previous mail before completing my reply. Sorry for that.
Here's the rest of the reply.

>>
>> +               if (bms_num_members(outer_relids) > 1)
>>
>> Seems like bms_get_singleton_member could be used.
>>
>> +        * Partitioning scheme in join relation indicates a possibilty that the
>>
>> Spelling.

Will take care of this.

>>
>> There seems to be no reason for create_partition_plan to be separated
>> from create_plan_recurse.  You can just add another case for the new
>> path type.

Will take care of this.

>>
>> Why does create_partition_join_path need to be separate from
>> create_partition_join_path_with_pathkeys?  Couldn't that be combined
>> into a single function with a pathkeys argument that might sometimes
>> be NIL?  I assume most of the logic is common.

Agreed. will take care of this.

>>
>> From a sort of theoretical standpoint, the biggest danger of this
>> patch seems to be that by deferring path creation until a later stage
>> than normal, we could miss some important processing.
>> subquery_planner() does a lot of stuff after
>> expand_inherited_tables(); if any of those things, especially the ones
>> that happen AFTER path generation, have an effect on the paths, then
>> this code needs to compensate for those changes somehow.  It seems
>> like having the planning of unsampled children get deferred until
>> create_plan() time is awfully surprising; here we are creating the
>> plan and suddenly what used to be a straightforward path->plan
>> translation is running around doing major planning work.  I can't
>> entirely justify it, but I somehow have a feeling that work ought to
>> be moved earlier.  Not sure exactly where.

I agree with this. Probably we should add a path tree mutator before
SS_identify_outer_params() to replace any Partition*Paths with
Merge/Append paths. The mutator will create paths for child-joins
within temporary memory context, copy the relevant paths and create
Merge/Append paths. There are two problems there 1. We have to write
code to copy paths; most of the paths would be flat copy but custom
scan paths might have some unexpected problems. 2. There will be many
surviving PartitionPaths, and all the corresponding child paths would
need copying and consume memory. In order to reduce that consumption,
we have run this mutator after set_cheapest() in subquery_planner();
but then nothing interesting happens between that and create_plan().
Expanding PartitionPaths during create_plan() does not need any path
copying and we expand only the PartitionPaths which will be converted
to plans. That does save a lot of memory; the reason why we defer
creating paths for child-joins.

>>
>> This is not really a full review, mostly because I can't easily figure
>> out the motivation for all of the changes the patch makes.  It makes a
>> lot of changes in a lot of places, and it's not really very easy to
>> understand why those changes are necessary.  My comments above about
>> splitting the patch into a series of patches that can potentially be
>> reviewed and applied independently, with the main patch being the last
>> in the series, are a suggestion as to how to tackle that.  There might
>> be some work that needs to or could be done on the comments, too.  For
>> example, the patch splits out add_paths_to_append_rel from
>> set_append_rel_pathlist, but the comments don't say anything helpful
>> like "we need to do X after Y, because Z".  They just say that we do
>> it.  To some extent I think the comments in the optimizer have that
>> problem generally, so it's not entirely the fault of this patch;
>> still, the lack of those explanations makes the code reorganization
>> harder to follow, and might confuse future patch authors, too.

Specifically about add_paths_to_append_rel(), what do you expect the
comment to say? It would be obvious why we split that functionality
into a separate function: in fact, we don't necessarily explain why
certain code resides in a separate function in the comments. I think,
that particular comment (or for that matter other such comments in the
optimizer) can be removed altogether, since it just writes the
function names as an "English" sentence. I sometimes find those
comments useful, because I can read just those comments and forget
about the code, making comprehension easy. If highlighting is ON, your
brain habitually ignores the non-comment portions when required. I am
open to suggestions.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables

From
Ashutosh Bapat
Date:
Per your suggestion I have split the patch into many smaller patches.

0001-Refactor-set_append_rel_pathlist.patch
0002-Refactor-make_join_rel.patch
0003-Refactor-adjust_appendrel_attrs.patch
0004-Refactor-build_join_rel.patch
0005-Add-function-find_param_path_info.patch

These four refactor existing code.

0006-Canonical-partition-scheme.patch
0007-Partition-wise-join-tests.patch -- just tests, they fail
0008-Partition-wise-join.patch -- actual patch implementing
partition-wise join, still some tests fail\

0009-Adjust-join-related-to-code-to-accept-child-relation.patch
0010-Parameterized-path-fixes.patch
0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch

The last three patches change existing code to expect child(-join)
relations where they were not expected earlier.

Each patch has summary of the changes.

Partition-wise join for multi-level partitioned tables is not covered
by these patches. I will post those patches soon.

>
>>
>> Other questions/comments:
>>
>> Why does find_partition_scheme need to copy the partition bound
>> information instead of just pointing to it?  Amit went to some trouble
>> to make sure that this can't change under us while we hold a lock on
>> the relation, and we'd better hold a lock on the relation if we're
>> planning a query against it.
>
> PartitionScheme is shared across multiple relations, join or base,
> partitioned similarly. Obviously it can't and does not need to point
> partition bound informations (which should all be same) of all those
> base relations. O the the face of it, it looks weird that it points to
> only one of them, mostly the one which it encounters first. But, since
> it's going to be the same partition bound information, it doesn't
> matter which one. So, I think, we can point of any one of those. Do
> you agree?

Instead of copying PartitionBoundInfo, used pointer of the first
encountered one.

>
>>
>> I think the PartitionScheme stuff should live in the optimizer rather
>> that src/backend/catalog/partition.c.  Maybe plancat.c?  Perhaps we
>> eventually need a new file in the optimizer just for partitioning
>> stuff, but I'm not sure about that yet.
>
> I placed PartitionScheme stuff in partition.c because most of the
> functions and structures in partition.c are not visible outside that
> file. But I will try again to locate PartitionScheme to optimizer.

Moved the code as per your suggestion.

>
>>
>> The fact that set_append_rel_size needs to reopen the relation to
>> extract a few more bits of information is not desirable.  You need to
>> fish this information through in some other way; for example, you
>> could have get_relation_info() stash the needed bits in the
>> RelOptInfo.
>
> I considered this option and discarded it, since not all partitioned
> relations will have OIDs for partitions e.g. partitioned joins will
> not have OIDs for their partitions. But now that I think of it, we
> should probably store those OIDs just for the base relation and leave
> them unused for non-base relations just like other base relation
> specific fields in RelOptInfo.

Changed as per your suggestions.

>
>>
>> +                * For two partitioned tables with the same
>> partitioning scheme, it is
>> +                * assumed that the Oids of matching partitions from
>> both the tables
>> +                * are placed at the same position in the array of
>> partition oids in
>>
>> Rather than saying that we assume this, you should say why it has to
>> be true.  (If it doesn't have to be true, we shouldn't assume it.)
>
> Will take care of this.

Done. Please check.

>
>>
>> +                * join relations. Partition tables should have same
>> layout as the
>> +                * parent table and hence should not need any
>> translation. But rest of
>>
>> The same attributes have to be present with the same types, but they
>> can be rearranged.  This comment seems to imply the contrary.
>
> Hmm, will take care of this.

Done.

>
>>
>> FRACTION_PARTS_TO_PLAN seems like it should be a GUC.
>
> +1. Will take care of this. Does "representative_partitions_fraction"
> or "sample_partition_fraction" look like a good GUC name? Any other
> suggestions?

used "sample_partition_fraction" for now. Suggestions are welcome.

>
>>
>> +               /*
>> +                * Add this relation to the list of samples ordered by
>> the increasing
>> +                * number of rows at appropriate place.
>> +                */
>> +               foreach (lc, ordered_child_nos)
>> +               {
>> +                       int     child_no = lfirst_int(lc);
>> +                       RelOptInfo *other_childrel = rel->part_rels[child_no];
>> +
>> +                       /*
>> +                        * Keep track of child with lowest number of
>> rows but higher than the
>> +                        * that of the child being inserted. Insert
>> the child before a
>> +                        * child with highest number of rows lesser than it.
>> +                        */
>> +                       if (child_rel->rows <= other_childrel->rows)
>> +                               insert_after = lc;
>> +                       else
>> +                               break;
>> +               }
>>
>> Can we use quicksort instead of a hand-coded insertion sort?
>
> I guess so, if I write comparison functions, which shouldn't be a
> problem. Will try that.

Done.

>
>>
>> +               if (bms_num_members(outer_relids) > 1)
>>
>> Seems like bms_get_singleton_member could be used.

That code is not required any more.

>>
>> +        * Partitioning scheme in join relation indicates a possibilty that the
>>
>> Spelling.

Done.

>>
>> There seems to be no reason for create_partition_plan to be separated
>> from create_plan_recurse.  You can just add another case for the new
>> path type.

Done.

>>
>> Why does create_partition_join_path need to be separate from
>> create_partition_join_path_with_pathkeys?  Couldn't that be combined
>> into a single function with a pathkeys argument that might sometimes
>> be NIL?  I assume most of the logic is common.

Combined those into a single function.

>>
>> From a sort of theoretical standpoint, the biggest danger of this
>> patch seems to be that by deferring path creation until a later stage
>> than normal, we could miss some important processing.
>> subquery_planner() does a lot of stuff after
>> expand_inherited_tables(); if any of those things, especially the ones
>> that happen AFTER path generation, have an effect on the paths, then
>> this code needs to compensate for those changes somehow.  It seems
>> like having the planning of unsampled children get deferred until
>> create_plan() time is awfully surprising; here we are creating the
>> plan and suddenly what used to be a straightforward path->plan
>> translation is running around doing major planning work.  I can't
>> entirely justify it, but I somehow have a feeling that work ought to
>> be moved earlier.  Not sure exactly where.

Pasting my previous replies here to keep everything in one mail.

I agree with this. Probably we should add a path tree mutator before
SS_identify_outer_params() to replace any Partition*Paths with
Merge/Append paths. The mutator will create paths for child-joins
within temporary memory context, copy the relevant paths and create
Merge/Append paths. There are two problems there 1. We have to write
code to copy paths; most of the paths would be flat copy but custom
scan paths might have some unexpected problems. 2. There will be many
surviving PartitionPaths, and all the corresponding child paths would
need copying and consume memory. In order to reduce that consumption,
we have run this mutator after set_cheapest() in subquery_planner();
but then nothing interesting happens between that and create_plan().
Expanding PartitionPaths during create_plan() does not need any path
copying and we expand only the PartitionPaths which will be converted
to plans. That does save a lot of memory; the reason why we defer
creating paths for child-joins.

>>
>> This is not really a full review, mostly because I can't easily figure
>> out the motivation for all of the changes the patch makes.  It makes a
>> lot of changes in a lot of places, and it's not really very easy to
>> understand why those changes are necessary.  My comments above about
>> splitting the patch into a series of patches that can potentially be
>> reviewed and applied independently, with the main patch being the last
>> in the series, are a suggestion as to how to tackle that.  There might
>> be some work that needs to or could be done on the comments, too.  For
>> example, the patch splits out add_paths_to_append_rel from
>> set_append_rel_pathlist, but the comments don't say anything helpful
>> like "we need to do X after Y, because Z".  They just say that we do
>> it.  To some extent I think the comments in the optimizer have that
>> problem generally, so it's not entirely the fault of this patch;
>> still, the lack of those explanations makes the code reorganization
>> harder to follow, and might confuse future patch authors, too.
>>

Specifically about add_paths_to_append_rel(), what do you expect the
comment to say? It would be obvious why we split that functionality
into a separate function: in fact, we don't necessarily explain why
certain code resides in a separate function in the comments. I think,
that particular comment (or for that matter other such comments in the
optimizer) can be removed altogether, since it just writes the
function names as an "English" sentence. I sometimes find those
comments useful, because I can read just those comments and forget
about the code, making comprehension easy. If highlighting is ON, your
brain habitually ignores the non-comment portions when required. I am
open to suggestions.



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables

From
Ashutosh Bapat
Date:
Fixed a problem with the way qsort was being used in the earlier set
of patches. Attached PFA the set of patches with that fixed.

On Thu, Feb 9, 2017 at 4:20 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Per your suggestion I have split the patch into many smaller patches.
>
> 0001-Refactor-set_append_rel_pathlist.patch
> 0002-Refactor-make_join_rel.patch
> 0003-Refactor-adjust_appendrel_attrs.patch
> 0004-Refactor-build_join_rel.patch
> 0005-Add-function-find_param_path_info.patch
>
> These four refactor existing code.
>
> 0006-Canonical-partition-scheme.patch
> 0007-Partition-wise-join-tests.patch -- just tests, they fail
> 0008-Partition-wise-join.patch -- actual patch implementing
> partition-wise join, still some tests fail\
>
> 0009-Adjust-join-related-to-code-to-accept-child-relation.patch
> 0010-Parameterized-path-fixes.patch
> 0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch
>
> The last three patches change existing code to expect child(-join)
> relations where they were not expected earlier.
>
> Each patch has summary of the changes.
>
> Partition-wise join for multi-level partitioned tables is not covered
> by these patches. I will post those patches soon.
>
>>
>>>
>>> Other questions/comments:
>>>
>>> Why does find_partition_scheme need to copy the partition bound
>>> information instead of just pointing to it?  Amit went to some trouble
>>> to make sure that this can't change under us while we hold a lock on
>>> the relation, and we'd better hold a lock on the relation if we're
>>> planning a query against it.
>>
>> PartitionScheme is shared across multiple relations, join or base,
>> partitioned similarly. Obviously it can't and does not need to point
>> partition bound informations (which should all be same) of all those
>> base relations. O the the face of it, it looks weird that it points to
>> only one of them, mostly the one which it encounters first. But, since
>> it's going to be the same partition bound information, it doesn't
>> matter which one. So, I think, we can point of any one of those. Do
>> you agree?
>
> Instead of copying PartitionBoundInfo, used pointer of the first
> encountered one.
>
>>
>>>
>>> I think the PartitionScheme stuff should live in the optimizer rather
>>> that src/backend/catalog/partition.c.  Maybe plancat.c?  Perhaps we
>>> eventually need a new file in the optimizer just for partitioning
>>> stuff, but I'm not sure about that yet.
>>
>> I placed PartitionScheme stuff in partition.c because most of the
>> functions and structures in partition.c are not visible outside that
>> file. But I will try again to locate PartitionScheme to optimizer.
>
> Moved the code as per your suggestion.
>
>>
>>>
>>> The fact that set_append_rel_size needs to reopen the relation to
>>> extract a few more bits of information is not desirable.  You need to
>>> fish this information through in some other way; for example, you
>>> could have get_relation_info() stash the needed bits in the
>>> RelOptInfo.
>>
>> I considered this option and discarded it, since not all partitioned
>> relations will have OIDs for partitions e.g. partitioned joins will
>> not have OIDs for their partitions. But now that I think of it, we
>> should probably store those OIDs just for the base relation and leave
>> them unused for non-base relations just like other base relation
>> specific fields in RelOptInfo.
>
> Changed as per your suggestions.
>
>>
>>>
>>> +                * For two partitioned tables with the same
>>> partitioning scheme, it is
>>> +                * assumed that the Oids of matching partitions from
>>> both the tables
>>> +                * are placed at the same position in the array of
>>> partition oids in
>>>
>>> Rather than saying that we assume this, you should say why it has to
>>> be true.  (If it doesn't have to be true, we shouldn't assume it.)
>>
>> Will take care of this.
>
> Done. Please check.
>
>>
>>>
>>> +                * join relations. Partition tables should have same
>>> layout as the
>>> +                * parent table and hence should not need any
>>> translation. But rest of
>>>
>>> The same attributes have to be present with the same types, but they
>>> can be rearranged.  This comment seems to imply the contrary.
>>
>> Hmm, will take care of this.
>
> Done.
>
>>
>>>
>>> FRACTION_PARTS_TO_PLAN seems like it should be a GUC.
>>
>> +1. Will take care of this. Does "representative_partitions_fraction"
>> or "sample_partition_fraction" look like a good GUC name? Any other
>> suggestions?
>
> used "sample_partition_fraction" for now. Suggestions are welcome.
>
>>
>>>
>>> +               /*
>>> +                * Add this relation to the list of samples ordered by
>>> the increasing
>>> +                * number of rows at appropriate place.
>>> +                */
>>> +               foreach (lc, ordered_child_nos)
>>> +               {
>>> +                       int     child_no = lfirst_int(lc);
>>> +                       RelOptInfo *other_childrel = rel->part_rels[child_no];
>>> +
>>> +                       /*
>>> +                        * Keep track of child with lowest number of
>>> rows but higher than the
>>> +                        * that of the child being inserted. Insert
>>> the child before a
>>> +                        * child with highest number of rows lesser than it.
>>> +                        */
>>> +                       if (child_rel->rows <= other_childrel->rows)
>>> +                               insert_after = lc;
>>> +                       else
>>> +                               break;
>>> +               }
>>>
>>> Can we use quicksort instead of a hand-coded insertion sort?
>>
>> I guess so, if I write comparison functions, which shouldn't be a
>> problem. Will try that.
>
> Done.
>
>>
>>>
>>> +               if (bms_num_members(outer_relids) > 1)
>>>
>>> Seems like bms_get_singleton_member could be used.
>
> That code is not required any more.
>
>>>
>>> +        * Partitioning scheme in join relation indicates a possibilty that the
>>>
>>> Spelling.
>
> Done.
>
>>>
>>> There seems to be no reason for create_partition_plan to be separated
>>> from create_plan_recurse.  You can just add another case for the new
>>> path type.
>
> Done.
>
>>>
>>> Why does create_partition_join_path need to be separate from
>>> create_partition_join_path_with_pathkeys?  Couldn't that be combined
>>> into a single function with a pathkeys argument that might sometimes
>>> be NIL?  I assume most of the logic is common.
>
> Combined those into a single function.
>
>>>
>>> From a sort of theoretical standpoint, the biggest danger of this
>>> patch seems to be that by deferring path creation until a later stage
>>> than normal, we could miss some important processing.
>>> subquery_planner() does a lot of stuff after
>>> expand_inherited_tables(); if any of those things, especially the ones
>>> that happen AFTER path generation, have an effect on the paths, then
>>> this code needs to compensate for those changes somehow.  It seems
>>> like having the planning of unsampled children get deferred until
>>> create_plan() time is awfully surprising; here we are creating the
>>> plan and suddenly what used to be a straightforward path->plan
>>> translation is running around doing major planning work.  I can't
>>> entirely justify it, but I somehow have a feeling that work ought to
>>> be moved earlier.  Not sure exactly where.
>
> Pasting my previous replies here to keep everything in one mail.
>
> I agree with this. Probably we should add a path tree mutator before
> SS_identify_outer_params() to replace any Partition*Paths with
> Merge/Append paths. The mutator will create paths for child-joins
> within temporary memory context, copy the relevant paths and create
> Merge/Append paths. There are two problems there 1. We have to write
> code to copy paths; most of the paths would be flat copy but custom
> scan paths might have some unexpected problems. 2. There will be many
> surviving PartitionPaths, and all the corresponding child paths would
> need copying and consume memory. In order to reduce that consumption,
> we have run this mutator after set_cheapest() in subquery_planner();
> but then nothing interesting happens between that and create_plan().
> Expanding PartitionPaths during create_plan() does not need any path
> copying and we expand only the PartitionPaths which will be converted
> to plans. That does save a lot of memory; the reason why we defer
> creating paths for child-joins.
>
>>>
>>> This is not really a full review, mostly because I can't easily figure
>>> out the motivation for all of the changes the patch makes.  It makes a
>>> lot of changes in a lot of places, and it's not really very easy to
>>> understand why those changes are necessary.  My comments above about
>>> splitting the patch into a series of patches that can potentially be
>>> reviewed and applied independently, with the main patch being the last
>>> in the series, are a suggestion as to how to tackle that.  There might
>>> be some work that needs to or could be done on the comments, too.  For
>>> example, the patch splits out add_paths_to_append_rel from
>>> set_append_rel_pathlist, but the comments don't say anything helpful
>>> like "we need to do X after Y, because Z".  They just say that we do
>>> it.  To some extent I think the comments in the optimizer have that
>>> problem generally, so it's not entirely the fault of this patch;
>>> still, the lack of those explanations makes the code reorganization
>>> harder to follow, and might confuse future patch authors, too.
>>>
>
> Specifically about add_paths_to_append_rel(), what do you expect the
> comment to say? It would be obvious why we split that functionality
> into a separate function: in fact, we don't necessarily explain why
> certain code resides in a separate function in the comments. I think,
> that particular comment (or for that matter other such comments in the
> optimizer) can be removed altogether, since it just writes the
> function names as an "English" sentence. I sometimes find those
> comments useful, because I can read just those comments and forget
> about the code, making comprehension easy. If highlighting is ON, your
> brain habitually ignores the non-comment portions when required. I am
> open to suggestions.
>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables

From
Ashutosh Bapat
Date:
Here is set of patches with support for partition-wise join between
multi-level partitioned tables.


On Fri, Feb 10, 2017 at 11:19 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Fixed a problem with the way qsort was being used in the earlier set
> of patches. Attached PFA the set of patches with that fixed.

This fix is included.

>
> On Thu, Feb 9, 2017 at 4:20 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Per your suggestion I have split the patch into many smaller patches.
>>
>> 0001-Refactor-set_append_rel_pathlist.patch
>> 0002-Refactor-make_join_rel.patch
>> 0003-Refactor-adjust_appendrel_attrs.patch
>> 0004-Refactor-build_join_rel.patch
>> 0005-Add-function-find_param_path_info.patch
>>
>> These four refactor existing code.
>>
>> 0006-Canonical-partition-scheme.patch
>> 0007-Partition-wise-join-tests.patch -- just tests, they fail
>> 0008-Partition-wise-join.patch -- actual patch implementing
>> partition-wise join, still some tests fail\
>>
>> 0009-Adjust-join-related-to-code-to-accept-child-relation.patch
>> 0010-Parameterized-path-fixes.patch
>> 0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch
>>

patch to translate partition hierarchy into inheritance hierarchy
without flattening

0012-Multi-level-partitioned-table-expansion.patch

patches for multi-level partition-wise join support

0013-Multi-level-partition-wise-join-tests.patch
0014-Multi-level-partition-wise-join-support.patch

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Mon, Feb 6, 2017 at 3:34 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> PartitionScheme is shared across multiple relations, join or base,
> partitioned similarly. Obviously it can't and does not need to point
> partition bound informations (which should all be same) of all those
> base relations. O the the face of it, it looks weird that it points to
> only one of them, mostly the one which it encounters first. But, since
> it's going to be the same partition bound information, it doesn't
> matter which one. So, I think, we can point of any one of those. Do
> you agree?

Yes.

>> The fact that set_append_rel_size needs to reopen the relation to
>> extract a few more bits of information is not desirable.  You need to
>> fish this information through in some other way; for example, you
>> could have get_relation_info() stash the needed bits in the
>> RelOptInfo.
>
> I considered this option and discarded it, since not all partitioned
> relations will have OIDs for partitions e.g. partitioned joins will
> not have OIDs for their partitions. But now that I think of it, we
> should probably store those OIDs just for the base relation and leave
> them unused for non-base relations just like other base relation
> specific fields in RelOptInfo.

Right.

>> FRACTION_PARTS_TO_PLAN seems like it should be a GUC.
>
> +1. Will take care of this. Does "representative_partitions_fraction"
> or "sample_partition_fraction" look like a good GUC name? Any other
> suggestions?

I like the second one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



>
> 2. If the PartitionJoinPath emerges as the best path, we create paths
> for each of the remaining child-joins. Then we collect paths with
> properties same as the given PartitionJoinPath, one from each
> child-join. These paths are converted into plans and a Merge/Append
> plan is created combing these plans. The paths and plans for
> child-join are created in a temporary memory context. The final plan
> for each child-join is copied into planner's context and the temporary
> memory context is reset.
>

Robert and I discussed this in more detail. Path creation code may
allocate objects other than paths. postgres_fdw, for example,
allocates character array to hold the name of relation being
pushed-down. When the temporary context gets zapped after creating
paths for a given child-join, those other objects also gets thrown
away. Attached patch has implemented the idea that came out of the
discussion.

We create a memory context for holding paths at the time of creating
PlannerGlobal and save it in PlannerGlobal. The patch introduces a new
macro makePathNode() which allocates the memory for given type of path
from this context. Every create_*_path function has been changed to
use this macro instead of makeNode(). In standard_planner(), at the
end of planning we destroy the memory context freeing all the paths
allocated. While creating a plan node, planner copies everything
required by the plan from the path, so the path is not needed any
more. So, freeing corresponding memory should not have any adverse
effects.

Most of the create_*_path() functions accept root as an argument, thus
the temporary path context is available through root->glob everywhere.
An exception is create_append_path() which does not accept root as an
argument. The patch changes create_append_path() and its callers like
set_dummy_rel_pathlist(), mark_dummy_rel() to accept root as an
argument. Ideally paths are not required after creating plan, so we
should be
able to free the context right after the call to create_plan(). But we
need dummy paths while creating flat rtable in
set_plan_references()->add_rtes_to_flat_rtable(). We used to So free
the path context at the end of planning cycle. Now that we are
allocating all the paths in a different memory context, it doesn't
make sense to switch context in mark_dummy_rel().

0001 patch implements the idea described above.
0002 patch adds instrumentation to measure memory consumed in
standard_planner() call.
0003 patch adds a GUC zap_paths to enable/disable destroying path context.
The last two patches are for testing only.

Attached also find the SQL script and its output showing the memory
saved. For a 5 way self-join of pg_class, the total memory consumed in
standard_planner() is 760K without patch and with patch it comes down
to 713K, saving 47K memory otherwise occupied by paths. It looks like
something useful even without partition-wise joins.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
Updated 0001 patch with some more comments. Attaching all the patches
for quick access.

On Wed, Mar 1, 2017 at 2:26 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>>
>> 2. If the PartitionJoinPath emerges as the best path, we create paths
>> for each of the remaining child-joins. Then we collect paths with
>> properties same as the given PartitionJoinPath, one from each
>> child-join. These paths are converted into plans and a Merge/Append
>> plan is created combing these plans. The paths and plans for
>> child-join are created in a temporary memory context. The final plan
>> for each child-join is copied into planner's context and the temporary
>> memory context is reset.
>>
>
> Robert and I discussed this in more detail. Path creation code may
> allocate objects other than paths. postgres_fdw, for example,
> allocates character array to hold the name of relation being
> pushed-down. When the temporary context gets zapped after creating
> paths for a given child-join, those other objects also gets thrown
> away. Attached patch has implemented the idea that came out of the
> discussion.
>
> We create a memory context for holding paths at the time of creating
> PlannerGlobal and save it in PlannerGlobal. The patch introduces a new
> macro makePathNode() which allocates the memory for given type of path
> from this context. Every create_*_path function has been changed to
> use this macro instead of makeNode(). In standard_planner(), at the
> end of planning we destroy the memory context freeing all the paths
> allocated. While creating a plan node, planner copies everything
> required by the plan from the path, so the path is not needed any
> more. So, freeing corresponding memory should not have any adverse
> effects.
>
> Most of the create_*_path() functions accept root as an argument, thus
> the temporary path context is available through root->glob everywhere.
> An exception is create_append_path() which does not accept root as an
> argument. The patch changes create_append_path() and its callers like
> set_dummy_rel_pathlist(), mark_dummy_rel() to accept root as an
> argument. Ideally paths are not required after creating plan, so we
> should be
> able to free the context right after the call to create_plan(). But we
> need dummy paths while creating flat rtable in
> set_plan_references()->add_rtes_to_flat_rtable(). We used to So free
> the path context at the end of planning cycle. Now that we are
> allocating all the paths in a different memory context, it doesn't
> make sense to switch context in mark_dummy_rel().
>
> 0001 patch implements the idea described above.
> 0002 patch adds instrumentation to measure memory consumed in
> standard_planner() call.
> 0003 patch adds a GUC zap_paths to enable/disable destroying path context.
> The last two patches are for testing only.
>
> Attached also find the SQL script and its output showing the memory
> saved. For a 5 way self-join of pg_class, the total memory consumed in
> standard_planner() is 760K without patch and with patch it comes down
> to 713K, saving 47K memory otherwise occupied by paths. It looks like
> something useful even without partition-wise joins.
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Wed, Mar 1, 2017 at 3:56 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>> 2. If the PartitionJoinPath emerges as the best path, we create paths
>> for each of the remaining child-joins. Then we collect paths with
>> properties same as the given PartitionJoinPath, one from each
>> child-join. These paths are converted into plans and a Merge/Append
>> plan is created combing these plans. The paths and plans for
>> child-join are created in a temporary memory context. The final plan
>> for each child-join is copied into planner's context and the temporary
>> memory context is reset.
>>
>
> Robert and I discussed this in more detail. Path creation code may
> allocate objects other than paths. postgres_fdw, for example,
> allocates character array to hold the name of relation being
> pushed-down. When the temporary context gets zapped after creating
> paths for a given child-join, those other objects also gets thrown
> away. Attached patch has implemented the idea that came out of the
> discussion.
>
> We create a memory context for holding paths at the time of creating
> PlannerGlobal and save it in PlannerGlobal. The patch introduces a new
> macro makePathNode() which allocates the memory for given type of path
> from this context. Every create_*_path function has been changed to
> use this macro instead of makeNode(). In standard_planner(), at the
> end of planning we destroy the memory context freeing all the paths
> allocated. While creating a plan node, planner copies everything
> required by the plan from the path, so the path is not needed any
> more. So, freeing corresponding memory should not have any adverse
> effects.
>
> Most of the create_*_path() functions accept root as an argument, thus
> the temporary path context is available through root->glob everywhere.
> An exception is create_append_path() which does not accept root as an
> argument. The patch changes create_append_path() and its callers like
> set_dummy_rel_pathlist(), mark_dummy_rel() to accept root as an
> argument. Ideally paths are not required after creating plan, so we
> should be
> able to free the context right after the call to create_plan(). But we
> need dummy paths while creating flat rtable in
> set_plan_references()->add_rtes_to_flat_rtable(). We used to So free
> the path context at the end of planning cycle. Now that we are
> allocating all the paths in a different memory context, it doesn't
> make sense to switch context in mark_dummy_rel().
>
> 0001 patch implements the idea described above.
> 0002 patch adds instrumentation to measure memory consumed in
> standard_planner() call.
> 0003 patch adds a GUC zap_paths to enable/disable destroying path context.
> The last two patches are for testing only.
>
> Attached also find the SQL script and its output showing the memory
> saved. For a 5 way self-join of pg_class, the total memory consumed in
> standard_planner() is 760K without patch and with patch it comes down
> to 713K, saving 47K memory otherwise occupied by paths. It looks like
> something useful even without partition-wise joins.

Of course, that's not a lot, but the savings will be a lot better for
partition-wise joins.  Do you have a set of patches for that feature
that apply on top of 0001?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



PFA the zip containing all the patches rebased on
56018bf26eec1a0b4bf20303c98065a8eb1b0c5d and contain the patch to free
memory consumed by paths using a separate path context.

There are some more changes wrt earlier set of patches
1. Since we don't need a separate context for planning for each
child_join, changed code in create_partition_join_plan() to not do
that. The function collects all child_join paths into merge/append
path and calls create_plan_recurse() on that path instead of
converting each child_join path to plan one at a time.

2. Changed optimizer/README and some comments referring to temporary
memory context, since we do not use that anymore.

3. reparameterize_path_by_child() is fixed to translate the merge and
hash clause in Hash/Merge path.

On Thu, Mar 9, 2017 at 6:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 1, 2017 at 3:56 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>> 2. If the PartitionJoinPath emerges as the best path, we create paths
>>> for each of the remaining child-joins. Then we collect paths with
>>> properties same as the given PartitionJoinPath, one from each
>>> child-join. These paths are converted into plans and a Merge/Append
>>> plan is created combing these plans. The paths and plans for
>>> child-join are created in a temporary memory context. The final plan
>>> for each child-join is copied into planner's context and the temporary
>>> memory context is reset.
>>>
>>
>> Robert and I discussed this in more detail. Path creation code may
>> allocate objects other than paths. postgres_fdw, for example,
>> allocates character array to hold the name of relation being
>> pushed-down. When the temporary context gets zapped after creating
>> paths for a given child-join, those other objects also gets thrown
>> away. Attached patch has implemented the idea that came out of the
>> discussion.
>>
>> We create a memory context for holding paths at the time of creating
>> PlannerGlobal and save it in PlannerGlobal. The patch introduces a new
>> macro makePathNode() which allocates the memory for given type of path
>> from this context. Every create_*_path function has been changed to
>> use this macro instead of makeNode(). In standard_planner(), at the
>> end of planning we destroy the memory context freeing all the paths
>> allocated. While creating a plan node, planner copies everything
>> required by the plan from the path, so the path is not needed any
>> more. So, freeing corresponding memory should not have any adverse
>> effects.
>>
>> Most of the create_*_path() functions accept root as an argument, thus
>> the temporary path context is available through root->glob everywhere.
>> An exception is create_append_path() which does not accept root as an
>> argument. The patch changes create_append_path() and its callers like
>> set_dummy_rel_pathlist(), mark_dummy_rel() to accept root as an
>> argument. Ideally paths are not required after creating plan, so we
>> should be
>> able to free the context right after the call to create_plan(). But we
>> need dummy paths while creating flat rtable in
>> set_plan_references()->add_rtes_to_flat_rtable(). We used to So free
>> the path context at the end of planning cycle. Now that we are
>> allocating all the paths in a different memory context, it doesn't
>> make sense to switch context in mark_dummy_rel().
>>
>> 0001 patch implements the idea described above.
>> 0002 patch adds instrumentation to measure memory consumed in
>> standard_planner() call.
>> 0003 patch adds a GUC zap_paths to enable/disable destroying path context.
>> The last two patches are for testing only.
>>
>> Attached also find the SQL script and its output showing the memory
>> saved. For a 5 way self-join of pg_class, the total memory consumed in
>> standard_planner() is 760K without patch and with patch it comes down
>> to 713K, saving 47K memory otherwise occupied by paths. It looks like
>> something useful even without partition-wise joins.
>
> Of course, that's not a lot, but the savings will be a lot better for
> partition-wise joins.  Do you have a set of patches for that feature
> that apply on top of 0001?
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Fri, Mar 10, 2017 at 5:43 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> PFA the zip containing all the patches rebased on
> 56018bf26eec1a0b4bf20303c98065a8eb1b0c5d and contain the patch to free
> memory consumed by paths using a separate path context.

Some very high-level thoughts based on a look through these patches:

In 0001, you've removed a comment about how GEQO needs special
handling, but it doesn't look as if you've made any compensating
change elsewhere.  That seems unlikely to be correct.  If GEQO needs
some paths to survive longer than others, how can it be right for this
code to create them all in the same context?  Incidentally,
geqo_eval() seems to be an existing precedent for the idea of throwing
away paths and RelOptInfos, so we might want to use similar code for
partitionwise join.

0002 and 0003 look OK.

Probably 0004 is OK too, although that seems to be adding some
overhead to existing callers for the benefit of new ones.  Might be
insignificant, though.

0005 looks OK, except that add_join_rel's definition is missing a
"static" qualifier.  That's not just cosmetic; based on previous
expereince, this will break the BF.

0006 seems to be unnecessary; the new function isn't used in later patches.

Haven't looked at 0007 yet.

0008 is, as previously mentioned, more than we probably want to commit.

Haven't looked at 0009 yet.

0010 - 0012 seem to be various fixes which would need to be done
before or along with 0009, rather than afterward, so I am confused
about the ordering of those patches in the patch series.

The commit message for 0013 is a bit unclear about what it's doing,
although I can guess, a bit, based on the commit message for 0007.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, Mar 13, 2017 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Haven't looked at 0007 yet.

+               if (rel->part_scheme)
+               {
+                       int             cnt_parts;
+
+                       for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
+                       {
+                               if (rel->part_oids[cnt_parts] ==
childRTE->relid)
+                               {
+                                       Assert(!rel->part_rels[cnt_parts]);
+                                       rel->part_rels[cnt_parts] = childrel;
+                               }
+                       }
+               }

It's not very appealing to use an O(n^2) algorithm here.  I wonder if
we could arrange things so that inheritance expansion expands
partitions in the right order, and then we could just match them up
one-to-one.  This would probably require an alternate version of
find_all_inheritors() that expand_inherited_rtentry() would call only
for partitioned tables.  Failing that, another idea would be to use
qsort() or qsort_arg() to put the partitions in the right order.

+       if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE ||
+               !inhparent ||
+               !(rel->part_scheme = find_partition_scheme(root, relation)))

Maybe just don't call this function in the first place in the
!inhparent case, instead of passing down an argument that must always
be true.

+               /* Match the partition key types. */
+               for (cnt_pks = 0; cnt_pks < partnatts; cnt_pks++)
+               {
+                       /*
+                        * For types, it suffices to match the type
id, mod and collation;
+                        * len, byval and align are depedent on the first two.
+                        */
+                       if (part_key->partopfamily[cnt_pks] !=
part_scheme->partopfamily[cnt_pks] ||
+                               part_key->partopcintype[cnt_pks] !=
part_scheme->partopcintype[cnt_pks] ||
+                               part_key->parttypid[cnt_pks] !=
part_scheme->key_types[cnt_pks] ||
+                               part_key->parttypmod[cnt_pks] !=
part_scheme->key_typmods[cnt_pks] ||
+                               part_key->parttypcoll[cnt_pks] !=
part_scheme->key_collations[cnt_pks])
+                               break;
+               }

I think memcmp() might be better than a for-loop.

Overall this one looks pretty good and straightforward.  Of course, I
haven't looked at the main act (0009) yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 2017/03/14 9:17, Robert Haas wrote:
> On Mon, Mar 13, 2017 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Haven't looked at 0007 yet.
> 
> Overall this one looks pretty good and straightforward.

In the following code of find_partition_scheme():

+    /* Did not find matching partition scheme. Create one. */
+    part_scheme = (PartitionScheme) palloc0(sizeof(PartitionSchemeData));
+
+    /* Copy partition bounds/lists. */
+    part_scheme->nparts = part_desc->nparts;
+    part_scheme->strategy = part_key->strategy;
+    part_scheme->boundinfo = part_desc->boundinfo;
+
+    /* Store partition key information. */
+    part_scheme->partnatts = part_key->partnatts;
+
+    part_scheme->partopfamily = (Oid *) palloc(sizeof(Oid) * partnatts);
+    memcpy(part_scheme->partopfamily, part_key->partopfamily,
+           sizeof(Oid) * partnatts);
+
+    part_scheme->partopcintype = (Oid *) palloc(sizeof(Oid) * partnatts);
+    memcpy(part_scheme->partopcintype, part_key->partopcintype,
+           sizeof(Oid) * partnatts);
+
+    part_scheme->key_types = (Oid *) palloc(sizeof(Oid) * partnatts);
+    memcpy(part_scheme->key_types, part_key->parttypid,
+           sizeof(Oid) * partnatts);
+
+    part_scheme->key_typmods = (int32 *) palloc(sizeof(int32) * partnatts);
+    memcpy(part_scheme->key_typmods, part_key->parttypmod,
+           sizeof(int32) * partnatts);
+
+    part_scheme->key_collations = (Oid *) palloc(sizeof(Oid) * partnatts);
+    memcpy(part_scheme->key_collations, part_key->parttypcoll,
+           sizeof(Oid) * partnatts);

Couldn't we avoid the memcpy() on individual members of part_key?  After
all, RelationData.rd_partkey is guarded just like rd_partdesc by
relcache.c in face of invalidations (see keep_partkey logic in
RelationClearRelation).

Thanks,
Amit





Thanks for the review.

>
> Some very high-level thoughts based on a look through these patches:
>
> In 0001, you've removed a comment about how GEQO needs special
> handling, but it doesn't look as if you've made any compensating
> change elsewhere.  That seems unlikely to be correct.  If GEQO needs
> some paths to survive longer than others, how can it be right for this
> code to create them all in the same context?

Thanks for pointing that out. I have replaced the code and the
comments back. There was another issue that the temporary paths
created by geqo will not be freed when geqo moves one genetic string
to next genetic string (or what it calls as tour: an list of relation
to be joined in a given order). To fix this, we need to set the path
context to the temporary context of geqo inside geqo_eval() before
calling gimme_tree() and reset it later. That way the temporary paths
are also created in the temporary memory context of geqo. Fixed in the
patch.

> Incidentally,
> geqo_eval() seems to be an existing precedent for the idea of throwing
> away paths and RelOptInfos, so we might want to use similar code for
> partitionwise join.

There are some differences in what geqo does and what partition-wise
needs to do. geqo tries many joining orders each one in a separate
temporary context. The way geqo slices the work, every slice produces
a full plan. For partition-wise join I do not see a way to slice the
work such that the whole path and corresponding RelOptInfos come from
the same slice. So, we can't use the same method as GEQO.

It's worth noticing that paths are created twice for the cheapest
joining order that it finds: once in the trial phase and second time
when the final plan is created. The second time, the paths,
RelOptInfos, expressions used by the final plan are in the context as
the plan.

>
> 0002 and 0003 look OK.
>
> Probably 0004 is OK too, although that seems to be adding some
> overhead to existing callers for the benefit of new ones.  Might be
> insignificant, though.

Yes, the overhead is to add and extract the appinfo from a list when
there is only one appinfo. We may optimize it by passing appinfo
directly when there's only one and pass list when there are more, but
that is complicating the code unnecessarily. The overhead seems to be
worth the cost to keep the code simpler.

>
> 0005 looks OK, except that add_join_rel's definition is missing a
> "static" qualifier.  That's not just cosmetic; based on previous
> expereince, this will break the BF.

Thanks for pointing it out. Done.

>
> 0006 seems to be unnecessary; the new function isn't used in later patches.

It's required by 0011 - reparameterize_path_by_child(). BTW, I need to
know whether  reparameterize_path_by_child() looks good, so that I can
complete it by adding support for all kinds of path in that function.

>
> Haven't looked at 0007 yet.
>
> 0008 is, as previously mentioned, more than we probably want to commit.

I agree, and I will work on that.

>
> Haven't looked at 0009 yet.
>
> 0010 - 0012 seem to be various fixes which would need to be done
> before or along with 0009, rather than afterward, so I am confused
> about the ordering of those patches in the patch series.

They are needed only when we have 0009. But when those are clubbed
with 0009, it makes 0009 review difficult as the code for those fixes
mixes with the code for partition-wise support. So, I have separated
those out into patches categorized by functionality. Reviewer may then
apply 0009 and see what failures each of the changes in 0010-0012
fixes, if required. They need to be committed along-with 0009.

>
> The commit message for 0013 is a bit unclear about what it's doing,
> although I can guess, a bit, based on the commit message for 0007.
>

This is preparatory patch for 0015 which supports partition-wise join
for multi-level partitioned tables. We have discussed about
partition-wise join support for multi-level partitioned tables in [1].
We may decide to postpone patches 0013-0015 to v11, if this gets too
much for v10.

[1] https://www.postgresql.org/message-id/CAFjFpRceMmx26653XFAYvc5KVQcrzcKScVFqZdbXV%3DkB8Akkqg@mail.gmail.com
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Tue, Mar 14, 2017 at 5:47 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 13, 2017 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Haven't looked at 0007 yet.
>
> +               if (rel->part_scheme)
> +               {
> +                       int             cnt_parts;
> +
> +                       for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
> +                       {
> +                               if (rel->part_oids[cnt_parts] ==
> childRTE->relid)
> +                               {
> +                                       Assert(!rel->part_rels[cnt_parts]);
> +                                       rel->part_rels[cnt_parts] = childrel;
> +                               }
> +                       }
> +               }
>
> It's not very appealing to use an O(n^2) algorithm here.  I wonder if
> we could arrange things so that inheritance expansion expands
> partitions in the right order, and then we could just match them up
> one-to-one.  This would probably require an alternate version of
> find_all_inheritors() that expand_inherited_rtentry() would call only
> for partitioned tables.

That seems a much better solution, but
1. Right now when we expand a multi-level partitioned table, we
include indirect partitions as direct children in inheritance
hierachy. part_rels array OTOH should correspond to the partitioning
scheme and should hold RelOptInfos of direct partitions. 0013 patch
fixes that to include only direct partitions as direct children
preserving partitioning hierarchy in the inheritance hierarchy. That
patch right now uses find_inheritance_children() to get Oids of direct
partitions, but instead it could return rd_partdesc->oids in the form
of list; OIDs ordered same as the array. Once we do that, we should
expect the appinfos to appear in the same order as the
rd_partdesc->oids and so RelOptInfo::part_oids. We just need to make
sure that the order is preserved and assign part_rels as they appear
in that loop.

One would argue that we preserve the OIDs only for single-level
partitioned tables, but in expand_inheritance_rtentry(), if we want to
detect whether a relation is single-level partitioned or multi-level,
we need to look up its direct partitions to see if they are further
partitioned. That will look a bit ugly and will not be necessary once
we have 0013. In case we decide to defer multi-level partitioned table
changes to v11 and based on the progress in [1], I will work on fixing
the order in which appinfos are created for single-level partitioned
tables.

> Failing that, another idea would be to use
> qsort() or qsort_arg() to put the partitions in the right order.

I didn't get this. I could not find documentation for qsort_arg(). Can
you please elaborate? I guess, if we fix expand_inheritance_rtentry()
we don't need this. It looks like we will change
expand_inheritance_rtentry() anyway.

>
> +       if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE ||
> +               !inhparent ||
> +               !(rel->part_scheme = find_partition_scheme(root, relation)))
>
> Maybe just don't call this function in the first place in the
> !inhparent case, instead of passing down an argument that must always
> be true.

The function serves a single place to re/set partitioning information.
It would set the partitioning information if the above three
conditions are met. Otherwise it would nullify that information. If we
decide not to call this function when !inhparent, we will need to
nullify the partitioning information outside of this function as well
as inside this function, duplicating the code.

>
> +               /* Match the partition key types. */
> +               for (cnt_pks = 0; cnt_pks < partnatts; cnt_pks++)
> +               {
> +                       /*
> +                        * For types, it suffices to match the type
> id, mod and collation;
> +                        * len, byval and align are depedent on the first two.
> +                        */
> +                       if (part_key->partopfamily[cnt_pks] !=
> part_scheme->partopfamily[cnt_pks] ||
> +                               part_key->partopcintype[cnt_pks] !=
> part_scheme->partopcintype[cnt_pks] ||
> +                               part_key->parttypid[cnt_pks] !=
> part_scheme->key_types[cnt_pks] ||
> +                               part_key->parttypmod[cnt_pks] !=
> part_scheme->key_typmods[cnt_pks] ||
> +                               part_key->parttypcoll[cnt_pks] !=
> part_scheme->key_collations[cnt_pks])
> +                               break;
> +               }
>
> I think memcmp() might be better than a for-loop.

Done.

PFA patches.

[1] https://www.postgresql.org/message-id/2b0d42f2-3a53-763b-c9c2-47139e4b1c2e@lab.ntt.co.jp
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Tue, Mar 14, 2017 at 6:28 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/03/14 9:17, Robert Haas wrote:
>> On Mon, Mar 13, 2017 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Haven't looked at 0007 yet.
>>
>> Overall this one looks pretty good and straightforward.
>
> In the following code of find_partition_scheme():
>
> +       /* Did not find matching partition scheme. Create one. */
> +       part_scheme = (PartitionScheme) palloc0(sizeof(PartitionSchemeData));
> +
> +       /* Copy partition bounds/lists. */
> +       part_scheme->nparts = part_desc->nparts;
> +       part_scheme->strategy = part_key->strategy;
> +       part_scheme->boundinfo = part_desc->boundinfo;
> +
> +       /* Store partition key information. */
> +       part_scheme->partnatts = part_key->partnatts;
> +
> +       part_scheme->partopfamily = (Oid *) palloc(sizeof(Oid) * partnatts);
> +       memcpy(part_scheme->partopfamily, part_key->partopfamily,
> +                  sizeof(Oid) * partnatts);
> +
> +       part_scheme->partopcintype = (Oid *) palloc(sizeof(Oid) * partnatts);
> +       memcpy(part_scheme->partopcintype, part_key->partopcintype,
> +                  sizeof(Oid) * partnatts);
> +
> +       part_scheme->key_types = (Oid *) palloc(sizeof(Oid) * partnatts);
> +       memcpy(part_scheme->key_types, part_key->parttypid,
> +                  sizeof(Oid) * partnatts);
> +
> +       part_scheme->key_typmods = (int32 *) palloc(sizeof(int32) * partnatts);
> +       memcpy(part_scheme->key_typmods, part_key->parttypmod,
> +                  sizeof(int32) * partnatts);
> +
> +       part_scheme->key_collations = (Oid *) palloc(sizeof(Oid) * partnatts);
> +       memcpy(part_scheme->key_collations, part_key->parttypcoll,
> +                  sizeof(Oid) * partnatts);
>
> Couldn't we avoid the memcpy() on individual members of part_key?  After
> all, RelationData.rd_partkey is guarded just like rd_partdesc by
> relcache.c in face of invalidations (see keep_partkey logic in
> RelationClearRelation).

This suggestion looks good to me. Incorporated in the latest set of patches.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Tue, Mar 14, 2017 at 8:04 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>> In 0001, you've removed a comment about how GEQO needs special
>> handling, but it doesn't look as if you've made any compensating
>> change elsewhere.  That seems unlikely to be correct.  If GEQO needs
>> some paths to survive longer than others, how can it be right for this
>> code to create them all in the same context?
>
> Thanks for pointing that out. I have replaced the code and the
> comments back. There was another issue that the temporary paths
> created by geqo will not be freed when geqo moves one genetic string
> to next genetic string (or what it calls as tour: an list of relation
> to be joined in a given order). To fix this, we need to set the path
> context to the temporary context of geqo inside geqo_eval() before
> calling gimme_tree() and reset it later. That way the temporary paths
> are also created in the temporary memory context of geqo. Fixed in the
> patch.

Yeah, that looks better.

>> Incidentally,
>> geqo_eval() seems to be an existing precedent for the idea of throwing
>> away paths and RelOptInfos, so we might want to use similar code for
>> partitionwise join.
>
> There are some differences in what geqo does and what partition-wise
> needs to do. geqo tries many joining orders each one in a separate
> temporary context. The way geqo slices the work, every slice produces
> a full plan. For partition-wise join I do not see a way to slice the
> work such that the whole path and corresponding RelOptInfos come from
> the same slice. So, we can't use the same method as GEQO.

What I was thinking about was the use of this technique for getting
rid of joinrels:
   root->join_rel_list = list_truncate(root->join_rel_list,                                       savelength);
root->join_rel_hash= savehash;
 

makePathNode() serves to segregate paths into a separate memory
context that can then be destroyed, but as you point out, the path
lists are still hanging around, and so are the RelOptInfo nodes.  It
seems to me we could do a lot better using this technique.  Suppose we
jigger things so that the List objects created by add_path go into
path_cxt, and so that RelOptInfo nodes also go into path_cxt.  Then
when we blow up path_cxt we won't have dangling pointers in the
RelOptInfo objects any more because the RelOptInfos themselves will be
gone.  The only problem is that the join_rel_list (and join_rel_hash
if it exists) will be corrupt, but we can fix that using the technique
demonstrated above.

Of course, that supposes that 0009 can manage to postpone creating
non-sampled child joinrels until create_partition_join_plan(), which
it currently doesn't.  In fact, unless I'm missing something, 0009
hasn't been even slightly adapted to take advantage of the
infrastructure in 0001; it doesn't seem to reset the path_cxt or
anything.  That seems like a fairly major omission.

Incidentally, I committed 0002, 0003, and 0005 as a single commit with
a few tweaks; I think you may need to do a bit of rebasing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Tue, Mar 14, 2017 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Of course, that supposes that 0009 can manage to postpone creating
> non-sampled child joinrels until create_partition_join_plan(), which
> it currently doesn't.  In fact, unless I'm missing something, 0009
> hasn't been even slightly adapted to take advantage of the
> infrastructure in 0001; it doesn't seem to reset the path_cxt or
> anything.  That seems like a fairly major omission.

Some other comments on 0009:

Documentation changes for the new GUCs are missing.

+between the partition keys of the joining tables. The equi-join between
+partition keys implies that for a given row in a given partition of a given
+partitioned table, its joining row, if exists, should exist only in the
+matching partition of the other partitioned table; no row from non-matching

There could be more than one.   I'd write: The equi-join between
partition keys implies that all join partners for a given row in one
partitioned table must be in the corresponding partition of the other
partitioned table.

+#include "miscadmin.h"#include <limits.h>#include <math.h>

Added in wrong place.

+                                * System attributes do not need
translation. In such a case,
+                                * the attribute numbers of the parent
and the child should
+                                * start from the same minimum attribute.

I would delete the second sentence and add an Assert() to that effect instead.

+               /* Pass top parent's relids down the inheritance hierarchy. */

Why?

+                       for (attno = rel->min_attr; attno <=
rel->max_attr; attno++)

Add add a comment explaining why we need to do this.

-       add_paths_to_append_rel(root, rel, live_childrels);
+       add_paths_to_append_rel(root, rel, live_childrels, false);}

-

No need to remove blank line.

+ * When called on partitioned join relation with partition_join_path = true, it
+ * adds PartitionJoinPath instead of Merge/Append path. This path is costed
+ * based on the costs of sampled child-join and is expanded later into
+ * Merge/Append plan.

I'm not a big fan of the Merge/Append terminology here.  If somebody
adds another kind of append-path someday, then all of these comments
will have to be updated.  I think this can be phrased more
generically.
       /*
+        * While creating PartitionJoinPath, we sample paths from only
a few child
+        * relations. Even if all of sampled children have partial
paths, it's not
+        * guaranteed that all the unsampled children will have partial paths.
+        * Hence we do not create partial PartitionJoinPaths.
+        */

Very sad.  I guess if we had parallel append available, we could maybe
dodge this problem, but for now I suppose we're stuck with it.

+       /*
+        * Partitioning scheme in join relation indicates a possibility that the
+        * join may be partitioned, but it's not necessary that every pair of
+        * joining relations can use partition-wise join technique. If one of
+        * joining relations turns out to be unpartitioned, this pair of joining
+        * relations can not use partition-wise join technique.
+        */
+       if (!rel1->part_scheme || !rel2->part_scheme)
+               return;

How can this happen?  If rel->part_scheme != NULL, doesn't that imply
that every rel covered by the joinrel is partitioned that way, and
therefore this condition must necessarily hold?

In general, I think it's better style to write explicit tests against
NULL or NIL than to just write if (blahptr).

+       partitioned_join->sjinfo = copyObject(parent_sjinfo);

Why do we need to copy it?

+       /*
+        * Remove the relabel decoration. We can assume that there is
at most one
+        * RelabelType node; eval_const_expressions() simplifies multiple
+        * RelabelType nodes into one.
+        */
+       if (IsA(expr, RelabelType))
+               expr = (Expr *) ((RelabelType *) expr)->arg;

Still, instead of assuming this, you could just s/if/while/, and then
you wouldn't need the assumption any more.  Also, consider castNode().

partition_wise_plan_weight may be useful for testing, but I don't
think it should be present in the final patch.

This is not a full review; I ran out of mental energy before I got to
the end.  (Sorry.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



>>
>> There are some differences in what geqo does and what partition-wise
>> needs to do. geqo tries many joining orders each one in a separate
>> temporary context. The way geqo slices the work, every slice produces
>> a full plan. For partition-wise join I do not see a way to slice the
>> work such that the whole path and corresponding RelOptInfos come from
>> the same slice. So, we can't use the same method as GEQO.
>
> What I was thinking about was the use of this technique for getting
> rid of joinrels:
>
>     root->join_rel_list = list_truncate(root->join_rel_list,
>                                         savelength);
>     root->join_rel_hash = savehash;
>
> makePathNode() serves to segregate paths into a separate memory
> context that can then be destroyed, but as you point out, the path
> lists are still hanging around, and so are the RelOptInfo nodes.  It
> seems to me we could do a lot better using this technique.  Suppose we
> jigger things so that the List objects created by add_path go into
> path_cxt, and so that RelOptInfo nodes also go into path_cxt.  Then
> when we blow up path_cxt we won't have dangling pointers in the
> RelOptInfo objects any more because the RelOptInfos themselves will be
> gone.  The only problem is that the join_rel_list (and join_rel_hash
> if it exists) will be corrupt, but we can fix that using the technique
> demonstrated above.
>
> Of course, that supposes that 0009 can manage to postpone creating
> non-sampled child joinrels until create_partition_join_plan(), which
> it currently doesn't.

Right. We need the child-join's RelOptInfos to estimate sizes, so that
we could sample the largest ones. So postponing it looks difficult.

> In fact, unless I'm missing something, 0009
> hasn't been even slightly adapted to take advantage of the
> infrastructure in 0001; it doesn't seem to reset the path_cxt or
> anything.  That seems like a fairly major omission.

The path_cxt reset introduced by 0001 recycles memory used by all the
paths, including paths created for the children. But that happens only
after all the planning has completed. I thought that's what we
discussed to be done. We could create a separate path context for
every top-level child-join. That will require either copying the
cheapest path-tree into root->glob->path_cxt memory context OR will
require it to be converted to a plan immediately. The first will
require spending CPU cycles and memory in copying path-tree. The later
requires almost all the create_*_append_plan() code to be duplicated
in create_partition_join_plan() which is ugly. In an earlier version
of this patch I had that code, which I got rid of in the latest set of
patches. Between those two the first looks better.

>
> Incidentally, I committed 0002, 0003, and 0005 as a single commit with
> a few tweaks; I think you may need to do a bit of rebasing.

Thanks. I will have fewer patches to rebase now :).

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Wed, Mar 15, 2017 at 6:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 14, 2017 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Of course, that supposes that 0009 can manage to postpone creating
>> non-sampled child joinrels until create_partition_join_plan(), which
>> it currently doesn't.  In fact, unless I'm missing something, 0009
>> hasn't been even slightly adapted to take advantage of the
>> infrastructure in 0001; it doesn't seem to reset the path_cxt or
>> anything.  That seems like a fairly major omission.
>
> Some other comments on 0009:
>
> Documentation changes for the new GUCs are missing.

Done. The description might need more massaging, but I will work on
that once we have fixed their names and usage. I think
sample_partition_fraction and partition_wise_plan_weight, if retained,
will be applicable to other partition-wise planning like
partition-wise aggregates. So we will need more generic description
there.

>
> +between the partition keys of the joining tables. The equi-join between
> +partition keys implies that for a given row in a given partition of a given
> +partitioned table, its joining row, if exists, should exist only in the
> +matching partition of the other partitioned table; no row from non-matching
>
> There could be more than one.   I'd write: The equi-join between
> partition keys implies that all join partners for a given row in one
> partitioned table must be in the corresponding partition of the other
> partitioned table.

Done. I think it's important to emphasize the the joining partners can
not be in other partitions. So, added that sentence after your
suggested sentence.

>
> +#include "miscadmin.h"
>  #include <limits.h>
>  #include <math.h>
>
> Added in wrong place.

Done.

>
> +                                * System attributes do not need
> translation. In such a case,
> +                                * the attribute numbers of the parent
> and the child should
> +                                * start from the same minimum attribute.
>
> I would delete the second sentence and add an Assert() to that effect instead.

The assertion is there just few lines down. Please let me know if that
suffices. Deleted the second sentence.

>
> +               /* Pass top parent's relids down the inheritance hierarchy. */
>
> Why?

That is required for a multi-level partitioned table.
top_parent_relids are used for translating expressions of the top
parent to that of child table.

>
> +                       for (attno = rel->min_attr; attno <=
> rel->max_attr; attno++)
>
> Add add a comment explaining why we need to do this.

The comment is there just few lines above. I have moved it just above
this for loop.

>
> -       add_paths_to_append_rel(root, rel, live_childrels);
> +       add_paths_to_append_rel(root, rel, live_childrels, false);
>  }
>
> -
>
> No need to remove blank line.

Sorry. That was added by my patch to refactor
set_append_rel_pathlist(). I have added a patch in the series to
remove that line.

>
> + * When called on partitioned join relation with partition_join_path = true, it
> + * adds PartitionJoinPath instead of Merge/Append path. This path is costed
> + * based on the costs of sampled child-join and is expanded later into
> + * Merge/Append plan.
>
> I'm not a big fan of the Merge/Append terminology here.  If somebody
> adds another kind of append-path someday, then all of these comments
> will have to be updated.  I think this can be phrased more
> generically.

Reworded as
+ * When partition_join_path is true, the caller intends to add a
+ * PartitionJoinPath costed based on the sampled child-joins passed as
+ * live_childrels.

Also added an assertion to make sure the partition_join_path is true
only for join relations.

>
>         /*
> +        * While creating PartitionJoinPath, we sample paths from only
> a few child
> +        * relations. Even if all of sampled children have partial
> paths, it's not
> +        * guaranteed that all the unsampled children will have partial paths.
> +        * Hence we do not create partial PartitionJoinPaths.
> +        */
>
> Very sad.  I guess if we had parallel append available, we could maybe
> dodge this problem, but for now I suppose we're stuck with it.

Really sad. Is there a way to look at the relation (without any
partial paths yet) and see whether the relation will have partial
paths or not. Even if we don't have actual partial paths but know that
there will be at least one added in the future, we will be able to fix
this problem.

>
> +       /*
> +        * Partitioning scheme in join relation indicates a possibility that the
> +        * join may be partitioned, but it's not necessary that every pair of
> +        * joining relations can use partition-wise join technique. If one of
> +        * joining relations turns out to be unpartitioned, this pair of joining
> +        * relations can not use partition-wise join technique.
> +        */
> +       if (!rel1->part_scheme || !rel2->part_scheme)
> +               return;
>
> How can this happen?  If rel->part_scheme != NULL, doesn't that imply
> that every rel covered by the joinrel is partitioned that way, and
> therefore this condition must necessarily hold?

I don't remember exactly, but this was added considering a more
generic partition-wise join. But then we would have more changes when
we support that. So, turned this into an assertion.

>
> In general, I think it's better style to write explicit tests against
> NULL or NIL than to just write if (blahptr).

PG code uses both the styles. Take for example
src/backend/rewrite/rewriteManip.c or any file, both styles are being
used. I find this style useful, when I want to code, say "if this
relation does not have a partitioning scheme" rather than "if this
relation have NULL partitioning scheme". Although I don't have
objections changing it as per your suggestion.

>
> +       partitioned_join->sjinfo = copyObject(parent_sjinfo);
>
> Why do we need to copy it?
>

sjinfo in make_join_rel() may be from root->join_info_list or it could
be one made up locally in that function. The one made up in that
function would go away with that function, whereas we need it much
later to create paths for child-joins. So, I thought it's better to
copy it. But now I have changed to code to pass NULL for a made-up
sjinfo. In such a case, the child-join's sjinfo is also made up. This
required some refactoring to separate out the making-up code. So,
there's new refactoring patch.

> +       /*
> +        * Remove the relabel decoration. We can assume that there is
> at most one
> +        * RelabelType node; eval_const_expressions() simplifies multiple
> +        * RelabelType nodes into one.
> +        */
> +       if (IsA(expr, RelabelType))
> +               expr = (Expr *) ((RelabelType *) expr)->arg;
>
> Still, instead of assuming this, you could just s/if/while/, and then
> you wouldn't need the assumption any more.  Also, consider castNode().

Done.

>
> partition_wise_plan_weight may be useful for testing, but I don't
> think it should be present in the final patch.

partition_join test needs it so that it can work with smaller dataset
and complete faster. For smaller data sets the partition-wise join
paths come out to be costlier than other kinds and are never chosen.
By setting partition_wise_plan_weight I can force partition-wise join
to be chosen. An alternate solution would be to use
sample_partition_fraction = 1.0, but then we will never test delayed
planning for unsampled child-joins. I also think that users will find
partition_wise_plan_weight useful when estimates based on samples are
unrealistic. Obviously, in a longer run we should be able to provide
better estimates.

Apart from this, I have also removed recursive calls to
try_partition_wise_join() and generate_partition_wise_join_paths()
from 0009 and places them in the 0014 patch. Those are required for
multi-level partitioned tables, which are not supported in 0009.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Wed, Mar 15, 2017 at 8:49 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>> Of course, that supposes that 0009 can manage to postpone creating
>> non-sampled child joinrels until create_partition_join_plan(), which
>> it currently doesn't.
>
> Right. We need the child-join's RelOptInfos to estimate sizes, so that
> we could sample the largest ones. So postponing it looks difficult.

You have a point.

>> In fact, unless I'm missing something, 0009
>> hasn't been even slightly adapted to take advantage of the
>> infrastructure in 0001; it doesn't seem to reset the path_cxt or
>> anything.  That seems like a fairly major omission.
>
> The path_cxt reset introduced by 0001 recycles memory used by all the
> paths, including paths created for the children. But that happens only
> after all the planning has completed. I thought that's what we
> discussed to be done. We could create a separate path context for
> every top-level child-join.

I don't think we need to create a new context for each top-level
child-join, but I think we should create a context to be used across
all top-level child-joins and then reset it after planning each one.
I thought the whole point here was that NOT doing that caused the
memory usage for partitionwise join to get out of control.  Am I
confused?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, Mar 15, 2017 at 8:55 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Sorry. That was added by my patch to refactor
> set_append_rel_pathlist(). I have added a patch in the series to
> remove that line.

It's not worth an extra commit just to change what isn't broken.
Let's just leave it alone.

>> Very sad.  I guess if we had parallel append available, we could maybe
>> dodge this problem, but for now I suppose we're stuck with it.
>
> Really sad. Is there a way to look at the relation (without any
> partial paths yet) and see whether the relation will have partial
> paths or not. Even if we don't have actual partial paths but know that
> there will be at least one added in the future, we will be able to fix
> this problem.

I don't think so.  If we know that rel->consider_parallel will end up
true for a plain table, we should always get a parallel sequential
scan path at least, but if there are foreign tables involved, then
nothing is guaranteed.

>> partition_wise_plan_weight may be useful for testing, but I don't
>> think it should be present in the final patch.
>
> partition_join test needs it so that it can work with smaller dataset
> and complete faster. For smaller data sets the partition-wise join
> paths come out to be costlier than other kinds and are never chosen.
> By setting partition_wise_plan_weight I can force partition-wise join
> to be chosen. An alternate solution would be to use
> sample_partition_fraction = 1.0, but then we will never test delayed
> planning for unsampled child-joins. I also think that users will find
> partition_wise_plan_weight useful when estimates based on samples are
> unrealistic. Obviously, in a longer run we should be able to provide
> better estimates.

I still don't like it -- we have no other similar knob.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



So I am looking at this part of 0008:

+       /*
+        * Do not copy parent_rinfo and child_rinfos because 1. they create a
+        * circular dependency between child and parent RestrictInfo 2. dropping
+        * those links just means that we loose some memory
optimizations. 3. There
+        * is a possibility that the child and parent RestrictInfots
themselves may
+        * have got copied and thus the old links may no longer be valid. The
+        * caller may set up those links itself, if needed.
+        */

I don't think that it's very clear whether or not this is safe.  I
experimented with making _copyRestrictInfo PANIC, which,
interestingly, does not affect the core regression tests at all, but
does trip on this bit from the postgres_fdw tests:

-- subquery using stable function (can't be sent to remote)
PREPARE st2(int) AS SELECT * FROM ft1 t1 WHERE t1.c1 < $2 AND t1.c3 IN
(SELECT c3 FROM ft2 t2 WHERE c1 > $1 AND date(c4) =
'1970-01-17'::date) ORDER BY c1;
EXPLAIN (VERBOSE, COSTS OFF) EXECUTE st2(10, 20);

I'm not sure why this particular case is affected when so many others
are not, and the comment doesn't help me very much in figuring it out.

Why do we need this cache in the RestrictInfo, anyway?  Aside from the
comment above, I looked at the comment in the RestrictInfo struct, and
I looked at the comment in build_child_restrictinfo, and I looked at
the comment in build_child_clauses, and I looked at the place where
build_child_clauses is called in set_append_rel_size, and none of
those places explain why we need this cache.  I would assume we'd need
a separate translation of the RestrictInfo for every separate
child-join, so how does the cache help?

Maybe the answer is that build_child_clauses() is also called from
try_partition_wise_join() and add_paths_to_child_joinrel(), and those
three call sights all end up producing the same set of translated
RestrictInfos.  But if that's the case, somehow it seems like we ought
to be producing these in one place where we can get convenient access
to them from each child join, rather than having to search through
this cache to find it.  It's a pretty inefficient cache: it takes O(n)
time to search it, I think, where n is the number of partitions.  And
you do O(n) searches.  So it's an O(n^2) algorithm, which is a little
unfortunate.  Can't we affix the translated RestrictInfos someplace
where they can be found more efficiently?

Yet another thing that the comments don't explain is why the existing
adjust_appendrel_attrs call needs to be replaced with
build_child_clauses.

So I feel, overall, that the point of all of this is not explained well at all.

...Robert



On Thu, Mar 16, 2017 at 12:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 15, 2017 at 8:49 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>> Of course, that supposes that 0009 can manage to postpone creating
>>> non-sampled child joinrels until create_partition_join_plan(), which
>>> it currently doesn't.
>>
>> Right. We need the child-join's RelOptInfos to estimate sizes, so that
>> we could sample the largest ones. So postponing it looks difficult.
>
> You have a point.
>
>>> In fact, unless I'm missing something, 0009
>>> hasn't been even slightly adapted to take advantage of the
>>> infrastructure in 0001; it doesn't seem to reset the path_cxt or
>>> anything.  That seems like a fairly major omission.
>>
>> The path_cxt reset introduced by 0001 recycles memory used by all the
>> paths, including paths created for the children. But that happens only
>> after all the planning has completed. I thought that's what we
>> discussed to be done. We could create a separate path context for
>> every top-level child-join.
>
> I don't think we need to create a new context for each top-level
> child-join, but I think we should create a context to be used across
> all top-level child-joins and then reset it after planning each one.

Sorry, that's what I meant by creating a new context for each
top-level child-join. So, we need to copy the required path tree
before resetting the context. I am fine doing that but read on.

> I thought the whole point here was that NOT doing that caused the
> memory usage for partitionwise join to get out of control.  Am I
> confused?

We took a few steps to reduce the memory footprint of partition-wise
join in [1] and [2]. According to the numbers reported in [1] and then
in [2], if the total memory consumed by a planner is 44MB (memory
consumed by paths 150K) for a 5-way non-parition-wise join between
tables with 1000 partitions, partition-wise join consumed 192MB which
is 4.4 times the non-partitino-wise case. The earlier implementation
of blowing away a memory context after each top-level child-join, just
got rid of the paths created for that child-join. The total memory
consumed by paths created for all the child-joins was about 150MB.
Remember that we can not get rid of memory consumed by expressions,
RelOptInfos, RestrictInfos etc. since their pointers will be copied
into the plan nodes.

With changes in 0001, what happens is we accumulate 150MB till the end
of the planning and get rid of it after we have created a plan. So,
till the plan is created we are consuming approx. 192MB + 150MB =
342MB memory and are getting rid of 150MB memory after we have created
the plan. I am not sure whether consuming extra 150MB or for that
matter 342MB in a setup with a thousand partitions is "going out of
control". (342MB is approx. 7.8 time 44MB; not 1000 times, and not
even 10 times). But if you think that we should throw away unused
paths after planning every top-level child-join I am fine with it.

[1] https://www.postgresql.org/message-id/CAFjFpRcZ_M3-JxoiDkdoPS%2B-9Cok4ux9Si%2B4drcRL-62af%3DjWw@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAFjFpRe66z%2Bw9%2BdnAkWGiaB1CU2CUQsLGsqzHzYBoA%3DKJFf%2BPQ%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Thu, Mar 16, 2017 at 12:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 15, 2017 at 8:55 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Sorry. That was added by my patch to refactor
>> set_append_rel_pathlist(). I have added a patch in the series to
>> remove that line.
>
> It's not worth an extra commit just to change what isn't broken.
> Let's just leave it alone.

Ok. Removed that patch from the set of patches.

>
>>> Very sad.  I guess if we had parallel append available, we could maybe
>>> dodge this problem, but for now I suppose we're stuck with it.
>>
>> Really sad. Is there a way to look at the relation (without any
>> partial paths yet) and see whether the relation will have partial
>> paths or not. Even if we don't have actual partial paths but know that
>> there will be at least one added in the future, we will be able to fix
>> this problem.
>
> I don't think so.  If we know that rel->consider_parallel will end up
> true for a plain table, we should always get a parallel sequential
> scan path at least, but if there are foreign tables involved, then
> nothing is guaranteed.

Ok.

>
>>> partition_wise_plan_weight may be useful for testing, but I don't
>>> think it should be present in the final patch.
>>
>> partition_join test needs it so that it can work with smaller dataset
>> and complete faster. For smaller data sets the partition-wise join
>> paths come out to be costlier than other kinds and are never chosen.
>> By setting partition_wise_plan_weight I can force partition-wise join
>> to be chosen. An alternate solution would be to use
>> sample_partition_fraction = 1.0, but then we will never test delayed
>> planning for unsampled child-joins. I also think that users will find
>> partition_wise_plan_weight useful when estimates based on samples are
>> unrealistic. Obviously, in a longer run we should be able to provide
>> better estimates.
>
> I still don't like it -- we have no other similar knob.

We have another cost-skewing GUC, disable_cost, which adds a huge cost
to anything that needs to be disabled. This is different in the sense
that it multiplies the cost.

Well, in that case, we can replace it with force_partition_wise_plan
(on/off) for the sake of regression, to test with smaller data. Even
then we will need to adjust the costs, so that partition-wise join
plan comes out to be the cheapest. Probably we will need set
partition-wise join plan costs to very low or even 0 when
force_partition_wise_plan is set to on. Does that look good? Any other
ideas?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Thu, Mar 16, 2017 at 7:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> So I am looking at this part of 0008:
>
> +       /*
> +        * Do not copy parent_rinfo and child_rinfos because 1. they create a
> +        * circular dependency between child and parent RestrictInfo 2. dropping
> +        * those links just means that we loose some memory
> optimizations. 3. There
> +        * is a possibility that the child and parent RestrictInfots
> themselves may
> +        * have got copied and thus the old links may no longer be valid. The
> +        * caller may set up those links itself, if needed.
> +        */
>
> I don't think that it's very clear whether or not this is safe.  I
> experimented with making _copyRestrictInfo PANIC,

I am not able to understand how to make _copyRestrictInfo PANIC. Can
you please share the patch or compiler flags or settings? I will look
at the case below once I have that.

> which,
> interestingly, does not affect the core regression tests at all, but
> does trip on this bit from the postgres_fdw tests:
>
> -- subquery using stable function (can't be sent to remote)
> PREPARE st2(int) AS SELECT * FROM ft1 t1 WHERE t1.c1 < $2 AND t1.c3 IN
> (SELECT c3 FROM ft2 t2 WHERE c1 > $1 AND date(c4) =
> '1970-01-17'::date) ORDER BY c1;
> EXPLAIN (VERBOSE, COSTS OFF) EXECUTE st2(10, 20);
>
> I'm not sure why this particular case is affected when so many others
> are not, and the comment doesn't help me very much in figuring it out.

>
> Why do we need this cache in the RestrictInfo, anyway?  Aside from the
> comment above, I looked at the comment in the RestrictInfo struct, and
> I looked at the comment in build_child_restrictinfo, and I looked at
> the comment in build_child_clauses, and I looked at the place where
> build_child_clauses is called in set_append_rel_size, and none of
> those places explain why we need this cache.  I would assume we'd need
> a separate translation of the RestrictInfo for every separate
> child-join, so how does the cache help?
>
> Maybe the answer is that build_child_clauses() is also called from
> try_partition_wise_join() and add_paths_to_child_joinrel(), and those
> three call sights all end up producing the same set of translated
> RestrictInfos.  But if that's the case, somehow it seems like we ought
> to be producing these in one place where we can get convenient access
> to them from each child join, rather than having to search through
> this cache to find it.

I had explained this briefly in [1]. But forgot to add it as comments.

There are multiple means by which a RestrictInfo gets translated
multiple times for the same child.

1. Consider a join  A J (B J C on B.b = C.c) ON (A.a = B.b) the clause
A.a = B.b is part of the restrictlist for orders (AB)C and A(BC) (and
(AC)B depending upon the type of join). So, the clause gets translated
twice once for each of those join orders.

2. In the above example, A.a = B.b is part of joininfo list (if it
happens to be an outer join) of A, B and BC. So, it should be part of
joininfo list of children of A, B and BC. But the RestrictInfo which
is part of joininfo of B and BC looks exactly same.

Similarly param_info->clauses get translated multiple times each time
with a different set of required_outer.

In order to avoid multiple translations and spend memory in each
translation it's better to cache the result and retrieve it.

Updated prologue of build_child_restrictinfo with this explanation.

> It's a pretty inefficient cache: it takes O(n)
> time to search it, I think, where n is the number of partitions.

Above explanation shows that it's worse than that.

>  And
> you do O(n) searches.  So it's an O(n^2) algorithm, which is a little
> unfortunate.  Can't we affix the translated RestrictInfos someplace
> where they can be found more efficiently?

Would a hash similar to root->join_rel_hash help? That will reduce the
searches to O(1). I have added a separate patch (0008) for using
hashtable to store child restrictinfos. If that patch looks good to
you, I will merge it with the main patch supporting partition-wise
join.

>
> Yet another thing that the comments don't explain is why the existing
> adjust_appendrel_attrs call needs to be replaced with
> build_child_clauses.

The call to adjust_appendrel_attrs() used to translate joininfo for
child has been replaced by build_child_clauses to take advantage of
the RestrictInfo cache. As explained above a clause which is part of
joininfo of a child, is also part of joininfo of the child-join in
which it participates except the child-joins covering the clause. So,
a cached copy of that RestrictInfo helps. I have added a patch (0010)
to use build_child_clause() only for partitioned tables and use
adjust_appendrel_attrs() for non-partitioned case. If this change
looks good, I will merge it with the main patch.
>
> So I feel, overall, that the point of all of this is not explained well at all.

Sorry for that. I should have added the explanation in the comments.
Corrected this in this set of patches.

[1] https://www.postgresql.org/message-id/CAFjFpRe66z%2Bw9%2BdnAkWGiaB1CU2CUQsLGsqzHzYBoA%3DKJFf%2BPQ%40mail.gmail.com
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Thu, Mar 16, 2017 at 6:48 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>> I thought the whole point here was that NOT doing that caused the
>> memory usage for partitionwise join to get out of control.  Am I
>> confused?
>
> We took a few steps to reduce the memory footprint of partition-wise
> join in [1] and [2]. According to the numbers reported in [1] and then
> in [2], if the total memory consumed by a planner is 44MB (memory
> consumed by paths 150K) for a 5-way non-parition-wise join between
> tables with 1000 partitions, partition-wise join consumed 192MB which
> is 4.4 times the non-partitino-wise case. The earlier implementation
> of blowing away a memory context after each top-level child-join, just
> got rid of the paths created for that child-join. The total memory
> consumed by paths created for all the child-joins was about 150MB.
> Remember that we can not get rid of memory consumed by expressions,
> RelOptInfos, RestrictInfos etc. since their pointers will be copied
> into the plan nodes.

All right, I propose that we revise our plan for attacking this
problem.  The code in this patch that proposes to reduce memory
utilization is very complicated and it's likely to cause us to miss
this release altogether if we keep hacking on it.  So, I propose that
you refactor this patch series so that the first big patch is
partition-wise join without any of the optimizations that save memory
- essentially the sample_partition_fraction = 1 case with all
memory-saving optimizations removed.  If it's only there to save
memory, rip it out.  Also, change the default value of
enable_partition_wise_join to false and document that turning it on
may cause a large increase in planner memory utilization, and that's
why it's not enabled by default.

If we get that committed, then we can have follow-on patches that add
the incremental path creation stuff and other memory-saving features,
and then at the end we can flip the default from "off" to "on".
Probably that last part will slip beyond v10 since we're only two
weeks from the end of the release cycle, but I think that's still
better than having everything slip.  Let's also put the multi-level
partition-wise join stuff ahead of the memory-saving stuff, because
being able to do only a single-level of partition-wise join is a
fairly unimpressive feature; I'm not sure this is really even
committable without that.

I realize in some sense that I'm telling you to go and undo all of the
work that you just did based on what I told you before, but I think
we've actually made some pretty good progress here: it's now clear
that there are viable strategies for getting the memory usage down to
an acceptable level, and we've got draft patches for those strategies.
So committing the core feature without immediately including that work
can't be regarded as breaking everything hopelessly; rather, it now
looks (I think, anyway) like a reasonable intermediate step towards
the eventual goal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Thu, Mar 16, 2017 at 7:19 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Thu, Mar 16, 2017 at 7:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> So I am looking at this part of 0008:
>>
>> +       /*
>> +        * Do not copy parent_rinfo and child_rinfos because 1. they create a
>> +        * circular dependency between child and parent RestrictInfo 2. dropping
>> +        * those links just means that we loose some memory
>> optimizations. 3. There
>> +        * is a possibility that the child and parent RestrictInfots
>> themselves may
>> +        * have got copied and thus the old links may no longer be valid. The
>> +        * caller may set up those links itself, if needed.
>> +        */
>>
>> I don't think that it's very clear whether or not this is safe.  I
>> experimented with making _copyRestrictInfo PANIC,
>
> I am not able to understand how to make _copyRestrictInfo PANIC. Can
> you please share the patch or compiler flags or settings? I will look
> at the case below once I have that.

I just put elog(PANIC, "_copyRestrictInfo") into the function.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Thu, Mar 16, 2017 at 8:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 16, 2017 at 6:48 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>> I thought the whole point here was that NOT doing that caused the
>>> memory usage for partitionwise join to get out of control.  Am I
>>> confused?
>>
>> We took a few steps to reduce the memory footprint of partition-wise
>> join in [1] and [2]. According to the numbers reported in [1] and then
>> in [2], if the total memory consumed by a planner is 44MB (memory
>> consumed by paths 150K) for a 5-way non-parition-wise join between
>> tables with 1000 partitions, partition-wise join consumed 192MB which
>> is 4.4 times the non-partitino-wise case. The earlier implementation
>> of blowing away a memory context after each top-level child-join, just
>> got rid of the paths created for that child-join. The total memory
>> consumed by paths created for all the child-joins was about 150MB.
>> Remember that we can not get rid of memory consumed by expressions,
>> RelOptInfos, RestrictInfos etc. since their pointers will be copied
>> into the plan nodes.
>
> All right, I propose that we revise our plan for attacking this
> problem.  The code in this patch that proposes to reduce memory
> utilization is very complicated and it's likely to cause us to miss
> this release altogether if we keep hacking on it.  So, I propose that
> you refactor this patch series so that the first big patch is
> partition-wise join without any of the optimizations that save memory
> - essentially the sample_partition_fraction = 1 case with all
> memory-saving optimizations removed.  If it's only there to save
> memory, rip it out.  Also, change the default value of
> enable_partition_wise_join to false and document that turning it on
> may cause a large increase in planner memory utilization, and that's
> why it's not enabled by default.
>
> If we get that committed, then we can have follow-on patches that add
> the incremental path creation stuff and other memory-saving features,
> and then at the end we can flip the default from "off" to "on".
> Probably that last part will slip beyond v10 since we're only two
> weeks from the end of the release cycle, but I think that's still
> better than having everything slip.  Let's also put the multi-level
> partition-wise join stuff ahead of the memory-saving stuff, because
> being able to do only a single-level of partition-wise join is a
> fairly unimpressive feature; I'm not sure this is really even
> committable without that.
>
> I realize in some sense that I'm telling you to go and undo all of the
> work that you just did based on what I told you before, but I think
> we've actually made some pretty good progress here: it's now clear
> that there are viable strategies for getting the memory usage down to
> an acceptable level, and we've got draft patches for those strategies.
> So committing the core feature without immediately including that work
> can't be regarded as breaking everything hopelessly; rather, it now
> looks (I think, anyway) like a reasonable intermediate step towards
> the eventual goal.

Here's the set of patches with all the memory saving stuff removed.
It's now bare partition-wise joins. I have tried to eliminate all
memory saving stuff carefully, except few bms_free() and list_free()
which fit the functions they were part of and mostly were present in
my earlier versions of patches. But I might have missed some. Also, I
have corrected any indentation/white space mistakes introduced by
editing patches with +/-, but I might have missed some. Please let me
know.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
>
> Here's the set of patches with all the memory saving stuff removed.
> It's now bare partition-wise joins. I have tried to eliminate all
> memory saving stuff carefully, except few bms_free() and list_free()
> which fit the functions they were part of and mostly were present in
> my earlier versions of patches. But I might have missed some. Also, I
> have corrected any indentation/white space mistakes introduced by
> editing patches with +/-, but I might have missed some. Please let me
> know.
>

Rajkumar offlist reported two issues with earlier set of patches.
1. 0008 conflicted with latest changes in postgres_fdw/deparse.c.

2. In the earlier set of patches part_scheme of a join relation was
being set when joining relations had same part_scheme even if there
was no equi-join between partition keys. The idea being that
rel->part_scheme and rel->part_rels together tell whether a relation
is partitioned or not. At a later stage if none of the joining pairs
resulted in partitioned join, part_rels would be NULL and then we
would reset part_scheme as well. But this logic not required. For the
exact partition scheme matching, that we are using, if one pair of
joining relation has an equi-join on partition keys, and both of those
have exactly same partitioning scheme, all other pairs of joining
relations would have an equi-join on partition keys and also exactly
same partitioning scheme. So, we can set part_scheme only by looking
at the first pair of joining relation while building the child-join.

This set of patches fixes both of those things.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Fri, Mar 17, 2017 at 9:15 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> This set of patches fixes both of those things.

0001 changes the purpose of a function and then 0007 renames it.  It
would be better to include the renaming in 0001 so that you're not
taking multiple whacks at the same function in the same patch series.
I believe it would also be best to include 0011's changes to
adjust_appendrel_attrs_multilevel in 0001.

0002 should either add find_param_path_info() to the relevant header
file as extern from the beginning, or it should declare and define it
as static and then 0007 can remove those markings.  It makes no sense
to declare it as extern but put the prototype in the .c file.

0004 still needs to be pared down.  If you want to get something
committed this release cycle, you have to get these details taken care
of, uh, more or less immediately.  Actually, preferably, several weeks
ago.  You're welcome to maintain your own test suite locally but what
you submit should be what you are proposing for commit -- or if not,
then you should separate the part proposed for commit and the part
included for dev testing into two different patches.

In 0005's README, the part about planning partition-wise joins in two
phases needs to be removed.  This patch also contains a small change
to partition_join.sql that belongs in 0004.

0008 removes direct tests against RELOPT_JOINREL almost everywhere,
but it overlooks the new ones added to postgres_fdw.c by
b30fb56b07a885f3476fe05920249f4832ca8da5.  It should be updated to
cover those as well, I suspect.  The commit message claims that it
will "Similarly replace RELOPT_OTHER_MEMBER_REL test with
IS_OTHER_REL() where we want to test for child relations of all kinds,
but in fact it makes exactly zero such substitutions.

While I was studying what you did with reparameterize_path_by_child(),
I started to wonder whether reparameterize_path() doesn't need to
start handling join paths.  I think it only handles scan paths right
now because that's the only thing that can appear under an appendrel
created by inheritance expansion, but you're changing that.  Maybe
it's not critical -- I think the worst consequences of missing some
handling there is that we won't consider a parameterized path in some
case where it would be advantageous to do so.  Still, you might want
to investigate a bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Sat, Mar 18, 2017 at 5:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 17, 2017 at 9:15 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> This set of patches fixes both of those things.
>
> 0001 changes the purpose of a function and then 0007 renames it.  It
> would be better to include the renaming in 0001 so that you're not
> taking multiple whacks at the same function in the same patch series.
> I believe it would also be best to include 0011's changes to
> adjust_appendrel_attrs_multilevel in 0001.
>
> 0002 should either add find_param_path_info() to the relevant header
> file as extern from the beginning, or it should declare and define it
> as static and then 0007 can remove those markings.  It makes no sense
> to declare it as extern but put the prototype in the .c file.
>
> 0004 still needs to be pared down.  If you want to get something
> committed this release cycle, you have to get these details taken care
> of, uh, more or less immediately.  Actually, preferably, several weeks
> ago.  You're welcome to maintain your own test suite locally but what
> you submit should be what you are proposing for commit -- or if not,
> then you should separate the part proposed for commit and the part
> included for dev testing into two different patches.
>
> In 0005's README, the part about planning partition-wise joins in two
> phases needs to be removed.  This patch also contains a small change
> to partition_join.sql that belongs in 0004.
>
> 0008 removes direct tests against RELOPT_JOINREL almost everywhere,
> but it overlooks the new ones added to postgres_fdw.c by
> b30fb56b07a885f3476fe05920249f4832ca8da5.  It should be updated to
> cover those as well, I suspect.  The commit message claims that it
> will "Similarly replace RELOPT_OTHER_MEMBER_REL test with
> IS_OTHER_REL() where we want to test for child relations of all kinds,
> but in fact it makes exactly zero such substitutions.
>
> While I was studying what you did with reparameterize_path_by_child(),
> I started to wonder whether reparameterize_path() doesn't need to
> start handling join paths.  I think it only handles scan paths right
> now because that's the only thing that can appear under an appendrel
> created by inheritance expansion, but you're changing that.  Maybe
> it's not critical -- I think the worst consequences of missing some
> handling there is that we won't consider a parameterized path in some
> case where it would be advantageous to do so.  Still, you might want
> to investigate a bit.
>
I was trying to play around with this patch and came across following
case when without the patch query completes in 9 secs and with it in
15 secs. Theoretically, I tried to capture the case when each
partition is having good amount of rows in output and each has to
build their own hash, in that case the cost of building so many hashes
comes to be more costly than having an append and then join. Thought
it might be helpful to consider this case in better designing of the
algorithm. Please feel free to point out if I missed something.

Test details:
commit: b4ff8609dbad541d287b332846442b076a25a6df
Please find the attached .sql file for the complete schema and data
and .out file for the result of explain analyse with and without
patch.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Sun, Mar 19, 2017 at 12:15 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> I was trying to play around with this patch and came across following
> case when without the patch query completes in 9 secs and with it in
> 15 secs. Theoretically, I tried to capture the case when each
> partition is having good amount of rows in output and each has to
> build their own hash, in that case the cost of building so many hashes
> comes to be more costly than having an append and then join. Thought
> it might be helpful to consider this case in better designing of the
> algorithm. Please feel free to point out if I missed something.

In the non-partitionwise plan, the query planner correctly chooses to
hash the same table (prt2) and probe from the large table (prt).  In
the partition-wise plan, it generally does the opposite.  There is a
mix of merge joins and hash joins, but of the 15 children that picked
merge joins, 14 of them hashed the larger partition (in each case,
from prt) and probed from the smaller one (in each case, from prt2),
which seems like an odd strategy.  So I think the problem is not that
building lots of hash tables is slower than building just one, but
rather that for some reason it's choosing the wrong table to hash.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Fri, Mar 17, 2017 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> While I was studying what you did with reparameterize_path_by_child(),
> I started to wonder whether reparameterize_path() doesn't need to
> start handling join paths.  I think it only handles scan paths right
> now because that's the only thing that can appear under an appendrel
> created by inheritance expansion, but you're changing that.  Maybe
> it's not critical -- I think the worst consequences of missing some
> handling there is that we won't consider a parameterized path in some
> case where it would be advantageous to do so.  Still, you might want
> to investigate a bit.

I spent a fair amount of time this weekend musing over
reparameterize_path_by_child().  I think a key question for this patch
- as you already pointed out - is whether we're happy with that
approach.  When we discover that we want to perform a partitionwise
parameterized nestloop, and therefore that we need the paths for each
inner appendrel to get their input values from the corresponding outer
appendrel members rather than from the outer parent, we've got two
choices.  The first is to do what the patch actually does, which is to
build a new path tree for the nestloop inner path parameterized by the
appropriate childrel.  The second is to use the existing paths, which
are parameterized by the parent rel, and then somehow allow make that
work.  For example, you can imagine that create_plan_recurse() could
pass down a list of parameterized nestloops above the current point in
the path tree, and a parent-child mapping for each, and then we could
try to substitute everything while actually generating the plan
instead of creating paths sooner.  Which is better?

It would be nice to hear opinions from anyone else who cares, but
after some thought I think the approach you've picked is probably
better, because it's more like what we do already.  We have existing
precedent for reparameterizing a path, but none for allowing a Var for
one relation (the parent) to in effect refer to another relation (the
child).

That having been said, having try_nestloop_path() perform the
reparameterization at the very top of the function seems quite
undesirable.  You're creating a new path there before you know whether
it's going to be rejected by the invalid-parameterization test and
also before you know whether initial_cost_nestloop is going to reject
it.  It would be much better if you could find a way to postpone the
reparameterization until after those steps, and only do it if you're
going to try add_path().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Sat, Mar 18, 2017 at 5:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 17, 2017 at 9:15 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> This set of patches fixes both of those things.
>
> 0001 changes the purpose of a function and then 0007 renames it.  It
> would be better to include the renaming in 0001 so that you're not
> taking multiple whacks at the same function in the same patch series.

adjust_relid_set was renamed as adjust_child_relids() post
"extern"alising. I think, this comment is about that function. Done.

> I believe it would also be best to include 0011's changes to
> adjust_appendrel_attrs_multilevel in 0001.
>

The function needs to repeat the "adjustment" process for every
"other" relation (join or base) that it encounters, by testing using
OTHER_BASE_REL or OTHER_JOINREL in short IS_OTHER_REL(). The last
macros are added by the partition-wise join implementation patch 0005.
It doesn't make sense to add that macro in 0001 OR modify that
function twice, once in 0001 and then after 0005. So, I will leave it
to be part of 0011, where the changes are actually needed.

> 0002 should either add find_param_path_info() to the relevant header
> file as extern from the beginning, or it should declare and define it
> as static and then 0007 can remove those markings.  It makes no sense
> to declare it as extern but put the prototype in the .c file.

Done, added find_param_path_info() as an extern definition to start
with. I have also squashed 0001 and 0002 together, since they are both
refactoring patches and from your next mail about
reparameterize_path_by_child(), it seems that we are going to accept
the approach in that patch.

>
> 0004 still needs to be pared down.  If you want to get something
> committed this release cycle, you have to get these details taken care
> of, uh, more or less immediately.  Actually, preferably, several weeks
> ago.  You're welcome to maintain your own test suite locally but what
> you submit should be what you are proposing for commit -- or if not,
> then you should separate the part proposed for commit and the part
> included for dev testing into two different patches.
>

Done. Now SQL file has 325 lines and output has 1697 lines as against
515 and 4085 lines resp. earlier.

> In 0005's README, the part about planning partition-wise joins in two
> phases needs to be removed.

Done.

> This patch also contains a small change
> to partition_join.sql that belongs in 0004.

The reason I added the test patch prior to implementation was 1. for
me to make sure the tests that the queries run without the
optimization and the results they produce to catch any issues with
partitioning implementation. That would help someone looking at those
patches as well. 2. Once partitioning implementation patch was
applied, once could see the purpose of changes in two follow on
patches. Now that that purpose has served, I have reordered the
patches so that test patch comes after the implementation and follow
on fixes. If you still want to run the test before or after any of
those patches, you could apply the patch separately.

>
> 0008 removes direct tests against RELOPT_JOINREL almost everywhere,
> but it overlooks the new ones added to postgres_fdw.c by
> b30fb56b07a885f3476fe05920249f4832ca8da5.  It should be updated to
> cover those as well, I suspect.

Done.

deparseSubqueryTargetList() and some other functions are excluding
"other" base relation from the assertions. I guess, that's a problem.
Will submit a separate patch to fix this.

> The commit message claims that it
> will "Similarly replace RELOPT_OTHER_MEMBER_REL test with
> IS_OTHER_REL() where we want to test for child relations of all kinds,
> but in fact it makes exactly zero such substitutions.

The relevant changes have been covered by other commits. Removed this
line from the commit message.

>
> While I was studying what you did with reparameterize_path_by_child(),
> I started to wonder whether reparameterize_path() doesn't need to
> start handling join paths.  I think it only handles scan paths right
> now because that's the only thing that can appear under an appendrel
> created by inheritance expansion, but you're changing that.  Maybe
> it's not critical -- I think the worst consequences of missing some
> handling there is that we won't consider a parameterized path in some
> case where it would be advantageous to do so.  Still, you might want
> to investigate a bit.

Yes, we need to update reparameterize_path() for child-joins. A path
for child base relation gets reparameterized, if there exists a path
with that parameterization in at least one other child. The
parameterization bubbles up the join tree from base relations. So, if
a child required to be reparameterized, probably all its joins require
reparameterization, since that parameterization would bubble up the
child-join tree in which some other child participates. But as you
said it's an optimization and not a correctness issue. The function
get_cheapest_parameterized_child_path() returns NULL, if it can not
find or create a path (by reparameterization) with required
parameterization. Its caller add_paths_to_append_rel() is capable of
handling NULL values by not creating append paths with that
paramterization. If the "append" relation requires minimum
parameterization, all its children will create that minimum
parameterization, hence do not require to reparameterize path. So,
there isn't any correctness issue there.

There are two ways to fix it,

1. when we create a reparameterized path add it to the list of paths,
thus the parameterization bubbles up the join tree. But then we will
be changing the path list after set_cheapest() has been called OR may
be throwing out paths which other paths refer to. That's not
desirable. May be we can save this path in another list and create
join paths using this path instead of reparameterizing existing join
paths.
2. Add code to reparameterize_path() to handle join paths, and I think
all kinds of paths since we might have trickle the parameterization
down the joining paths which could be almost anything including
sort_paths, unique_paths etc. That looks like a significant effort. I
think, we should attack it separately after the stock partition-wise
join has been committed.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Mon, Mar 20, 2017 at 8:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 17, 2017 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> While I was studying what you did with reparameterize_path_by_child(),
>> I started to wonder whether reparameterize_path() doesn't need to
>> start handling join paths.  I think it only handles scan paths right
>> now because that's the only thing that can appear under an appendrel
>> created by inheritance expansion, but you're changing that.  Maybe
>> it's not critical -- I think the worst consequences of missing some
>> handling there is that we won't consider a parameterized path in some
>> case where it would be advantageous to do so.  Still, you might want
>> to investigate a bit.
>
> I spent a fair amount of time this weekend musing over
> reparameterize_path_by_child().  I think a key question for this patch
> - as you already pointed out - is whether we're happy with that
> approach.  When we discover that we want to perform a partitionwise
> parameterized nestloop, and therefore that we need the paths for each
> inner appendrel to get their input values from the corresponding outer
> appendrel members rather than from the outer parent, we've got two
> choices.  The first is to do what the patch actually does, which is to
> build a new path tree for the nestloop inner path parameterized by the
> appropriate childrel.  The second is to use the existing paths, which
> are parameterized by the parent rel, and then somehow allow make that
> work.  For example, you can imagine that create_plan_recurse() could
> pass down a list of parameterized nestloops above the current point in
> the path tree, and a parent-child mapping for each, and then we could
> try to substitute everything while actually generating the plan
> instead of creating paths sooner.  Which is better?
>
> It would be nice to hear opinions from anyone else who cares, but
> after some thought I think the approach you've picked is probably
> better, because it's more like what we do already.  We have existing
> precedent for reparameterizing a path, but none for allowing a Var for
> one relation (the parent) to in effect refer to another relation (the
> child).

Right. If we could use parent Vars to indicate parent Var or child Var
depending upon the context, a lot of memory issues would be solved; we
wouldn't need to translate a single expression. But I think that's not
straight forward. I have been thinking about some kind of polymorphic
Var node, but it seems a lot more invasive change. Although, if we
could get something like that, we would save a huge memory. :)

>
> That having been said, having try_nestloop_path() perform the
> reparameterization at the very top of the function seems quite
> undesirable.  You're creating a new path there before you know whether
> it's going to be rejected by the invalid-parameterization test and
> also before you know whether initial_cost_nestloop is going to reject
> it.  It would be much better if you could find a way to postpone the
> reparameterization until after those steps, and only do it if you're
> going to try add_path().

Hmm. I think we can do that by refactoring
calc_nestloop_required_outer(), allow_star_schema_join() and
have_dangerous_phv() to use relids instead of paths. If the checks
pass for a join between parents, those should pass for joins between
children. Done in the attached set of patches.

try_nestloop_path has few new variables. Among those innerrelids and
outerrelids indicate the relids to be used by the parameterization
checks (see patch for details). They are not relids of inner and outer
relations resp. but kind of effective relids to be used. But I
couldn't come up with better names which convey proper meaning and
still are short enough. effective_innerrelids is mouthful.

I am wondering whether we need to change
calc_non_nestloop_required_outer() similar to
calc_nestloop_required_outer() just to keep their signatures in sync.

Should I work on completing reparamterized_path_by_child() to support
all kinds of paths?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Mon, Mar 20, 2017 at 8:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 17, 2017 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> While I was studying what you did with reparameterize_path_by_child(),
>> I started to wonder whether reparameterize_path() doesn't need to
>> start handling join paths.  I think it only handles scan paths right
>> now because that's the only thing that can appear under an appendrel
>> created by inheritance expansion, but you're changing that.  Maybe
>> it's not critical -- I think the worst consequences of missing some
>> handling there is that we won't consider a parameterized path in some
>> case where it would be advantageous to do so.  Still, you might want
>> to investigate a bit.
>
> I spent a fair amount of time this weekend musing over
> reparameterize_path_by_child().  I think a key question for this patch
> - as you already pointed out - is whether we're happy with that
> approach.  When we discover that we want to perform a partitionwise
> parameterized nestloop, and therefore that we need the paths for each
> inner appendrel to get their input values from the corresponding outer
> appendrel members rather than from the outer parent, we've got two
> choices.  The first is to do what the patch actually does, which is to
> build a new path tree for the nestloop inner path parameterized by the
> appropriate childrel.  The second is to use the existing paths, which
> are parameterized by the parent rel, and then somehow allow make that
> work.  For example, you can imagine that create_plan_recurse() could
> pass down a list of parameterized nestloops above the current point in
> the path tree, and a parent-child mapping for each, and then we could
> try to substitute everything while actually generating the plan
> instead of creating paths sooner.  Which is better?
>
> It would be nice to hear opinions from anyone else who cares, but
> after some thought I think the approach you've picked is probably
> better, because it's more like what we do already.  We have existing
> precedent for reparameterizing a path, but none for allowing a Var for
> one relation (the parent) to in effect refer to another relation (the
> child).
>
> That having been said, having try_nestloop_path() perform the
> reparameterization at the very top of the function seems quite
> undesirable.  You're creating a new path there before you know whether
> it's going to be rejected by the invalid-parameterization test and
> also before you know whether initial_cost_nestloop is going to reject
> it.  It would be much better if you could find a way to postpone the
> reparameterization until after those steps, and only do it if you're
> going to try add_path().

On a further testing of this patch I find another case when it is
showing regression, the time taken with patch is around 160 secs and
without it is 125 secs.
Another minor thing to note that is planning time is almost twice with
this patch, though I understand that this is for scenarios with really
big 'big data' so this may not be a serious issue in such cases, but
it'd be good if we can keep an eye on this that it doesn't exceed the
computational bounds for a really large number of tables..
Please find the attached .out file to check the output I witnessed and
let me know if anymore information is required
Schema and data was similar to the preciously shared schema with the
addition of more data for this case, parameter settings used were:
work_mem = 1GB
random_page_cost = seq_page_cost = 0.1

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Mon, Mar 20, 2017 at 9:44 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>> I believe it would also be best to include 0011's changes to
>> adjust_appendrel_attrs_multilevel in 0001.
>
> The function needs to repeat the "adjustment" process for every
> "other" relation (join or base) that it encounters, by testing using
> OTHER_BASE_REL or OTHER_JOINREL in short IS_OTHER_REL(). The last
> macros are added by the partition-wise join implementation patch 0005.
> It doesn't make sense to add that macro in 0001 OR modify that
> function twice, once in 0001 and then after 0005. So, I will leave it
> to be part of 0011, where the changes are actually needed.

Hmm.  I would kind of like to move the IS_JOIN_REL() and
IS_OTHER_REL() stuff to the front of the series.  In other words, I
propose that we add those macros first, each testing for only the one
kind of RelOptInfo that exists today, and change all the code to use
them.  Then, when we add child joinrels, we can modify the macros at
the same time.  The problem with doing it the way you have it is that
those changes will have to be squashed into the main partitionwise
join commit, because otherwise stuff will be broken.  Doing it the
other way around lets us commit that bit separately.

> Done. Now SQL file has 325 lines and output has 1697 lines as against
> 515 and 4085 lines resp. earlier.

Sounds reasonable.

> Now that that purpose has served, I have reordered the
> patches so that test patch comes after the implementation and follow
> on fixes.

Sounds good.

> There are two ways to fix it,
>
> 1. when we create a reparameterized path add it to the list of paths,
> thus the parameterization bubbles up the join tree. But then we will
> be changing the path list after set_cheapest() has been called OR may
> be throwing out paths which other paths refer to. That's not
> desirable. May be we can save this path in another list and create
> join paths using this path instead of reparameterizing existing join
> paths.
> 2. Add code to reparameterize_path() to handle join paths, and I think
> all kinds of paths since we might have trickle the parameterization
> down the joining paths which could be almost anything including
> sort_paths, unique_paths etc. That looks like a significant effort. I
> think, we should attack it separately after the stock partition-wise
> join has been committed.

I don't understand #1.  #2 sounds like what I was expecting.  I agree
it can be postponed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, Mar 20, 2017 at 9:44 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Right. If we could use parent Vars to indicate parent Var or child Var
> depending upon the context, a lot of memory issues would be solved; we
> wouldn't need to translate a single expression. But I think that's not
> straight forward. I have been thinking about some kind of polymorphic
> Var node, but it seems a lot more invasive change. Although, if we
> could get something like that, we would save a huge memory. :)

Yes, that's why I'm interested in exploring that approach once the
basic framework is in place here.

> I am wondering whether we need to change
> calc_non_nestloop_required_outer() similar to
> calc_nestloop_required_outer() just to keep their signatures in sync.

I haven't looked at the patch, but I don't think you need to worry about that.

> Should I work on completing reparamterized_path_by_child() to support
> all kinds of paths?

Yes, or at the very least all scans, like reparameterize_path() already does.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, Mar 20, 2017 at 12:07 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> On a further testing of this patch I find another case when it is
> showing regression, the time taken with patch is around 160 secs and
> without it is 125 secs.

This is basically the same problem as before; the partitionwise case
is doing the hash joins with the sides flipped from the optimal
strategy.  I bet that's a bug in the code rather than a problem with
the concept.

> Another minor thing to note that is planning time is almost twice with
> this patch, though I understand that this is for scenarios with really
> big 'big data' so this may not be a serious issue in such cases, but
> it'd be good if we can keep an eye on this that it doesn't exceed the
> computational bounds for a really large number of tables..

Yes, this is definitely going to use significant additional planning
time and memory.  There are several possible strategies for improving
that situation, but I think we need to get the basics in place first.
That's why the proposal is now to have this turned off by default.
People joining really big tables that happen to be equipartitioned are
likely to want to turn it on, though, even before those optimizations
are done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



>
> On a further testing of this patch I find another case when it is
> showing regression, the time taken with patch is around 160 secs and
> without it is 125 secs.
> Another minor thing to note that is planning time is almost twice with
> this patch, though I understand that this is for scenarios with really
> big 'big data' so this may not be a serious issue in such cases, but
> it'd be good if we can keep an eye on this that it doesn't exceed the
> computational bounds for a really large number of tables.

Right, planning time would be proportional to the number of partitions
at least in the first version. We may improve upon it later.

> Please find the attached .out file to check the output I witnessed and
> let me know if anymore information is required
> Schema and data was similar to the preciously shared schema with the
> addition of more data for this case, parameter settings used were:
> work_mem = 1GB
> random_page_cost = seq_page_cost = 0.1

The patch does not introduce any new costing model. It costs the
partition-wise join as sum of costs of joins between partitions. The
method to create the paths for joins between partitions is same as
creating the paths for joins between regular tables and then the
method to collect paths across partition-wise joins is same as
collecting paths across child base relations. So, there is a large
chance that the costing for joins between partitions might have a
problem which is showing up here. There may be some special handling
for regular tables versus child tables that may be the root cause. But
I have not seen that kind of code till now.

Can you please provide the outputs of individual partition-joins? If
the plans for joins between partitions are same as the ones chosen for
partition-wise joins, we may need to fix the existing join cost
models.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



>
> Hmm.  I would kind of like to move the IS_JOIN_REL() and
> IS_OTHER_REL() stuff to the front of the series.  In other words, I
> propose that we add those macros first, each testing for only the one
> kind of RelOptInfo that exists today, and change all the code to use
> them.  Then, when we add child joinrels, we can modify the macros at
> the same time.  The problem with doing it the way you have it is that
> those changes will have to be squashed into the main partitionwise
> join commit, because otherwise stuff will be broken.  Doing it the
> other way around lets us commit that bit separately.
>

I can provide a patch with adjust_appendrel_attrs_multilevel() changed
to child-joins, which can be applied before multi-level
partitioin-wise support patch but after partition-wise implementation
patch. You may consider applying that patch separately before
multi-level partition-wise support, in case we see that multi-level
partition-wise join support can be committed. Does that sound good?
That way we save changing those macros twice.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Mon, Mar 20, 2017 at 12:52 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>> Hmm.  I would kind of like to move the IS_JOIN_REL() and
>> IS_OTHER_REL() stuff to the front of the series.  In other words, I
>> propose that we add those macros first, each testing for only the one
>> kind of RelOptInfo that exists today, and change all the code to use
>> them.  Then, when we add child joinrels, we can modify the macros at
>> the same time.  The problem with doing it the way you have it is that
>> those changes will have to be squashed into the main partitionwise
>> join commit, because otherwise stuff will be broken.  Doing it the
>> other way around lets us commit that bit separately.
>
> I can provide a patch with adjust_appendrel_attrs_multilevel() changed
> to child-joins, which can be applied before multi-level
> partitioin-wise support patch but after partition-wise implementation
> patch. You may consider applying that patch separately before
> multi-level partition-wise support, in case we see that multi-level
> partition-wise join support can be committed. Does that sound good?
> That way we save changing those macros twice.

That seems different than what I suggested and I'm not sure what the
reason is for the difference?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, Mar 20, 2017 at 10:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 20, 2017 at 12:52 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>> Hmm.  I would kind of like to move the IS_JOIN_REL() and
>>> IS_OTHER_REL() stuff to the front of the series.  In other words, I
>>> propose that we add those macros first, each testing for only the one
>>> kind of RelOptInfo that exists today, and change all the code to use
>>> them.  Then, when we add child joinrels, we can modify the macros at
>>> the same time.  The problem with doing it the way you have it is that
>>> those changes will have to be squashed into the main partitionwise
>>> join commit, because otherwise stuff will be broken.  Doing it the
>>> other way around lets us commit that bit separately.
>>
>> I can provide a patch with adjust_appendrel_attrs_multilevel() changed
>> to child-joins, which can be applied before multi-level
>> partitioin-wise support patch but after partition-wise implementation
>> patch. You may consider applying that patch separately before
>> multi-level partition-wise support, in case we see that multi-level
>> partition-wise join support can be committed. Does that sound good?
>> That way we save changing those macros twice.
>
> That seems different than what I suggested and I'm not sure what the
> reason is for the difference?
>

The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing
the code to use it will look quite odd by itself. We are not changing
all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use
those. There is code which needs to check those kinds, instead of "all
join rels" or "all other rels" resp. So the patch will add those
macros, change only few places to use those macros, which are intended
to be changed while applying partition-wise join support for single
level partitioned table.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>> That seems different than what I suggested and I'm not sure what the
>> reason is for the difference?
>
> The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing
> the code to use it will look quite odd by itself. We are not changing
> all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use
> those. There is code which needs to check those kinds, instead of "all
> join rels" or "all other rels" resp. So the patch will add those
> macros, change only few places to use those macros, which are intended
> to be changed while applying partition-wise join support for single
> level partitioned table.

Hmm.  You might be right, but I'm not convinced.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables

From
Rajkumar Raghuwanshi
Date:
> On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:

I have created some test to cover partition wise joins with
postgres_fdw, also verified make check.
patch attached.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Mon, Mar 20, 2017 at 10:17 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>>
>> On a further testing of this patch I find another case when it is
>> showing regression, the time taken with patch is around 160 secs and
>> without it is 125 secs.
>> Another minor thing to note that is planning time is almost twice with
>> this patch, though I understand that this is for scenarios with really
>> big 'big data' so this may not be a serious issue in such cases, but
>> it'd be good if we can keep an eye on this that it doesn't exceed the
>> computational bounds for a really large number of tables.
>
> Right, planning time would be proportional to the number of partitions
> at least in the first version. We may improve upon it later.
>
>> Please find the attached .out file to check the output I witnessed and
>> let me know if anymore information is required
>> Schema and data was similar to the preciously shared schema with the
>> addition of more data for this case, parameter settings used were:
>> work_mem = 1GB
>> random_page_cost = seq_page_cost = 0.1

this doesn't look good. Why do you set both these costs to the same value?

>
> The patch does not introduce any new costing model. It costs the
> partition-wise join as sum of costs of joins between partitions. The
> method to create the paths for joins between partitions is same as
> creating the paths for joins between regular tables and then the
> method to collect paths across partition-wise joins is same as
> collecting paths across child base relations. So, there is a large
> chance that the costing for joins between partitions might have a
> problem which is showing up here. There may be some special handling
> for regular tables versus child tables that may be the root cause. But
> I have not seen that kind of code till now.
>
> Can you please provide the outputs of individual partition-joins? If
> the plans for joins between partitions are same as the ones chosen for
> partition-wise joins, we may need to fix the existing join cost
> models.

Offlist, Rafia shared the outputs of joins between partitions and join
between partitioned table. The joins between partitions look similar
to those pick up by the partition-wise join. So, it looks that some
costing error in regular joins is resulting in an costing error in
partition-wise join as suspected. Attached the SQL and the output.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Mon, Mar 20, 2017 at 11:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>> That seems different than what I suggested and I'm not sure what the
>>> reason is for the difference?
>>
>> The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing
>> the code to use it will look quite odd by itself. We are not changing
>> all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use
>> those. There is code which needs to check those kinds, instead of "all
>> join rels" or "all other rels" resp. So the patch will add those
>> macros, change only few places to use those macros, which are intended
>> to be changed while applying partition-wise join support for single
>> level partitioned table.
>
> Hmm.  You might be right, but I'm not convinced.

Ok. changed as per your request in the latest set of patches.

There are some more changes as follows
1. In the earlier patch set the changes related to
calc_nestloop_required_outer() and related functions were spread
across multiple patches. That was unintentional. This patch set has
all those changes in a single patch.

2. Rajkumar reported a crash offlist. When one of the joining
multi-level partitioned relations is empty, an assertion in
try_partition_wise_join() Assert(rel1->part_rels && rel2->part_rels);
failed since it didn't find part_rels for a subpartition. The problem
here is set_append_rel_size() does not call set_rel_size() and hence
set_append_rel_size() if a child is found to be empty, a scenario
described in [1]. It's the later one which sets the part_rels for a
partitioned relation and hence the subpartitions do not get part_rels
since set_append_rel_size() is never called for those. Generally, if a
partitioned relation is found to be empty before we set part_rels, we
may not want to spend time in creating/collecting child RelOptInfos,
since they will be empty anyway. If part_rels isn't present,
part_scheme doesn't make sense. So an empty partitioned table without
any partitions can be treated as unpartitioned. So, I have fixed
set_dummy_rel_pathlist() and mark_dummy_rel(), the functions setting a
relation empty, to reset partition scheme when those conditions are
met. This fix is included as a separate patch. Let me know if this
looks good to you.

3. I am in the process of completing reparameterize_paths_by_child()
by adding all possible paths. I have restructured the function to look
better and have one switch case instead of two. Also added more path
types including ForeignPath, for which I have added a FDW hook, with
documentation, for handling fdw_private. Please let me know if this
looks good to you. I am thinking of similar hook for CustomPath. I
will continue to add more path types to
reparameterize_path_by_child().

I am wondering whether we should bring 0007 he patche adjusting code
to work with child-joins before 0006, partition-wise join. 0006 needs
it, but 0007 doesn't depend upon 0006. Will that be any better?

[1] CAFjFpRcdrdsCRDbBu0J2pxwWbhb_sDWQUTVznBy_4XGr-p3+wA@mail.gmail.com,
subject "Asymmetry between parent and child wrt "false" quals"

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
Thanks Rajkumar. Added those in the latest set of patches.

On Tue, Mar 21, 2017 at 3:52 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:
>> On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>
> I have created some test to cover partition wise joins with
> postgres_fdw, also verified make check.
> patch attached.
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
> QMG, EnterpriseDB Corporation



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Tue, Mar 21, 2017 at 7:41 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Mon, Mar 20, 2017 at 10:17 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>>
>>> On a further testing of this patch I find another case when it is
>>> showing regression, the time taken with patch is around 160 secs and
>>> without it is 125 secs.
>>> Another minor thing to note that is planning time is almost twice with
>>> this patch, though I understand that this is for scenarios with really
>>> big 'big data' so this may not be a serious issue in such cases, but
>>> it'd be good if we can keep an eye on this that it doesn't exceed the
>>> computational bounds for a really large number of tables.
>>
>> Right, planning time would be proportional to the number of partitions
>> at least in the first version. We may improve upon it later.
>>
>>> Please find the attached .out file to check the output I witnessed and
>>> let me know if anymore information is required
>>> Schema and data was similar to the preciously shared schema with the
>>> addition of more data for this case, parameter settings used were:
>>> work_mem = 1GB
>>> random_page_cost = seq_page_cost = 0.1
>
> this doesn't look good. Why do you set both these costs to the same value?

That's a perfectly reasonable configuration if the data is in memory
on a medium with fast random access, like an SSD.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Here's the set of patches rebased on latest head, which also has a
commit to eliminate scans on partitioned tables. This change has
caused problems with multi-level partitioned tables, that I have not
fixed in this patch set. Also a couple of partition-wise join plans
for single-level partitioned tables have changed to non-partition-wise
joins. I haven't fixed those as well.

I have added a separate patch to fix add_paths_to_append_rel() to
collect partitioned_rels list for join relations. Please let me know
if this looks good. I think it needs to be merged into some other
patch, but I am not sure which. Probably we should just treat it as
another refactoring patch.

On Tue, Mar 21, 2017 at 5:16 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Mon, Mar 20, 2017 at 11:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>>> That seems different than what I suggested and I'm not sure what the
>>>> reason is for the difference?
>>>
>>> The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing
>>> the code to use it will look quite odd by itself. We are not changing
>>> all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use
>>> those. There is code which needs to check those kinds, instead of "all
>>> join rels" or "all other rels" resp. So the patch will add those
>>> macros, change only few places to use those macros, which are intended
>>> to be changed while applying partition-wise join support for single
>>> level partitioned table.
>>
>> Hmm.  You might be right, but I'm not convinced.
>
> Ok. changed as per your request in the latest set of patches.
>
> There are some more changes as follows
> 1. In the earlier patch set the changes related to
> calc_nestloop_required_outer() and related functions were spread
> across multiple patches. That was unintentional. This patch set has
> all those changes in a single patch.
>
> 2. Rajkumar reported a crash offlist. When one of the joining
> multi-level partitioned relations is empty, an assertion in
> try_partition_wise_join() Assert(rel1->part_rels && rel2->part_rels);
> failed since it didn't find part_rels for a subpartition. The problem
> here is set_append_rel_size() does not call set_rel_size() and hence
> set_append_rel_size() if a child is found to be empty, a scenario
> described in [1]. It's the later one which sets the part_rels for a
> partitioned relation and hence the subpartitions do not get part_rels
> since set_append_rel_size() is never called for those. Generally, if a
> partitioned relation is found to be empty before we set part_rels, we
> may not want to spend time in creating/collecting child RelOptInfos,
> since they will be empty anyway. If part_rels isn't present,
> part_scheme doesn't make sense. So an empty partitioned table without
> any partitions can be treated as unpartitioned. So, I have fixed
> set_dummy_rel_pathlist() and mark_dummy_rel(), the functions setting a
> relation empty, to reset partition scheme when those conditions are
> met. This fix is included as a separate patch. Let me know if this
> looks good to you.
>
> 3. I am in the process of completing reparameterize_paths_by_child()
> by adding all possible paths. I have restructured the function to look
> better and have one switch case instead of two. Also added more path
> types including ForeignPath, for which I have added a FDW hook, with
> documentation, for handling fdw_private. Please let me know if this
> looks good to you. I am thinking of similar hook for CustomPath. I
> will continue to add more path types to
> reparameterize_path_by_child().
>
> I am wondering whether we should bring 0007 he patche adjusting code
> to work with child-joins before 0006, partition-wise join. 0006 needs
> it, but 0007 doesn't depend upon 0006. Will that be any better?
>
> [1] CAFjFpRcdrdsCRDbBu0J2pxwWbhb_sDWQUTVznBy_4XGr-p3+wA@mail.gmail.com,
> subject "Asymmetry between parent and child wrt "false" quals"
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Tue, Mar 21, 2017 at 10:40 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Here's the set of patches rebased on latest head, which also has a
> commit to eliminate scans on partitioned tables. This change has
> caused problems with multi-level partitioned tables, that I have not
> fixed in this patch set. Also a couple of partition-wise join plans
> for single-level partitioned tables have changed to non-partition-wise
> joins. I haven't fixed those as well.
>
> I have added a separate patch to fix add_paths_to_append_rel() to
> collect partitioned_rels list for join relations. Please let me know
> if this looks good. I think it needs to be merged into some other
> patch, but I am not sure which. Probably we should just treat it as
> another refactoring patch.
>
> On Tue, Mar 21, 2017 at 5:16 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> On Mon, Mar 20, 2017 at 11:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat
>>> <ashutosh.bapat@enterprisedb.com> wrote:
>>>>> That seems different than what I suggested and I'm not sure what the
>>>>> reason is for the difference?
>>>>
>>>> The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing
>>>> the code to use it will look quite odd by itself. We are not changing
>>>> all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use
>>>> those. There is code which needs to check those kinds, instead of "all
>>>> join rels" or "all other rels" resp. So the patch will add those
>>>> macros, change only few places to use those macros, which are intended
>>>> to be changed while applying partition-wise join support for single
>>>> level partitioned table.
>>>
>>> Hmm.  You might be right, but I'm not convinced.
>>
>> Ok. changed as per your request in the latest set of patches.
>>
>> There are some more changes as follows
>> 1. In the earlier patch set the changes related to
>> calc_nestloop_required_outer() and related functions were spread
>> across multiple patches. That was unintentional. This patch set has
>> all those changes in a single patch.
>>
>> 2. Rajkumar reported a crash offlist. When one of the joining
>> multi-level partitioned relations is empty, an assertion in
>> try_partition_wise_join() Assert(rel1->part_rels && rel2->part_rels);
>> failed since it didn't find part_rels for a subpartition. The problem
>> here is set_append_rel_size() does not call set_rel_size() and hence
>> set_append_rel_size() if a child is found to be empty, a scenario
>> described in [1]. It's the later one which sets the part_rels for a
>> partitioned relation and hence the subpartitions do not get part_rels
>> since set_append_rel_size() is never called for those. Generally, if a
>> partitioned relation is found to be empty before we set part_rels, we
>> may not want to spend time in creating/collecting child RelOptInfos,
>> since they will be empty anyway. If part_rels isn't present,
>> part_scheme doesn't make sense. So an empty partitioned table without
>> any partitions can be treated as unpartitioned. So, I have fixed
>> set_dummy_rel_pathlist() and mark_dummy_rel(), the functions setting a
>> relation empty, to reset partition scheme when those conditions are
>> met. This fix is included as a separate patch. Let me know if this
>> looks good to you.
>>
>> 3. I am in the process of completing reparameterize_paths_by_child()
>> by adding all possible paths. I have restructured the function to look
>> better and have one switch case instead of two. Also added more path
>> types including ForeignPath, for which I have added a FDW hook, with
>> documentation, for handling fdw_private. Please let me know if this
>> looks good to you. I am thinking of similar hook for CustomPath. I
>> will continue to add more path types to
>> reparameterize_path_by_child().
>>
>> I am wondering whether we should bring 0007 he patche adjusting code
>> to work with child-joins before 0006, partition-wise join. 0006 needs
>> it, but 0007 doesn't depend upon 0006. Will that be any better?
>>
>> [1] CAFjFpRcdrdsCRDbBu0J2pxwWbhb_sDWQUTVznBy_4XGr-p3+wA@mail.gmail.com,
>> subject "Asymmetry between parent and child wrt "false" quals"
>>
In an attempt to test the geqo side of this patch, I reduced
geqo_threshold to 6 and set enable_partitionwise_join to to true and
tried following query, which crashed,

explain select * from prt, prt2, prt3, prt32, prt4, prt42 where prt.a
= prt2.b and prt3.a = prt32.b and prt4.a = prt42.b and prt2.a > 1000
order by prt.a desc;

Stack-trace for the crash is as follows,

Program received signal SIGSEGV, Segmentation fault.
0x00000000007a43d1 in find_param_path_info (rel=0x2d3fe30,
required_outer=0x2ff6d30) at relnode.c:1534
1534 if (bms_equal(ppi->ppi_req_outer, required_outer))
(gdb) bt
#0  0x00000000007a43d1 in find_param_path_info (rel=0x2d3fe30,
required_outer=0x2ff6d30) at relnode.c:1534
#1  0x000000000079b8bb in reparameterize_path_by_child
(root=0x2df7550, path=0x2f6dec0, child_rel=0x2d4a860) at
pathnode.c:3455
#2  0x000000000075be30 in try_nestloop_path (root=0x2df7550,
joinrel=0x2ff51b0, outer_path=0x2f96540, inner_path=0x2f6dec0,
pathkeys=0x0,   jointype=JOIN_INNER, extra=0x7fffe6b4e130) at joinpath.c:344
#3  0x000000000075d55b in match_unsorted_outer (root=0x2df7550,
joinrel=0x2ff51b0, outerrel=0x2d4a860, innerrel=0x2d3fe30,
jointype=JOIN_INNER,   extra=0x7fffe6b4e130) at joinpath.c:1389
#4  0x000000000075bc5f in add_paths_to_joinrel (root=0x2df7550,
joinrel=0x2ff51b0, outerrel=0x2d4a860, innerrel=0x2d3fe30,
jointype=JOIN_INNER,   sjinfo=0x3076bc8, restrictlist=0x3077168) at joinpath.c:234
#5  0x000000000075f1d5 in populate_joinrel_with_paths (root=0x2df7550,
rel1=0x2d3fe30, rel2=0x2d4a860, joinrel=0x2ff51b0, sjinfo=0x3076bc8,   restrictlist=0x3077168) at joinrels.c:793
#6  0x0000000000760107 in try_partition_wise_join (root=0x2df7550,
rel1=0x2d3f6d8, rel2=0x2d4a1a0, joinrel=0x30752f0,
parent_sjinfo=0x7fffe6b4e2d0,   parent_restrictlist=0x3075768) at joinrels.c:1401
#7  0x000000000075f0e6 in make_join_rel (root=0x2df7550,
rel1=0x2d3f6d8, rel2=0x2d4a1a0) at joinrels.c:744
#8  0x0000000000742053 in merge_clump (root=0x2df7550,
clumps=0x3075270, new_clump=0x30752a8, force=0 '\000') at
geqo_eval.c:260
#9  0x0000000000741f1c in gimme_tree (root=0x2df7550, tour=0x2ff2430,
num_gene=6) at geqo_eval.c:199
#10 0x0000000000741df5 in geqo_eval (root=0x2df7550, tour=0x2ff2430,
num_gene=6) at geqo_eval.c:102
#11 0x000000000074288a in random_init_pool (root=0x2df7550,
pool=0x2ff23d0) at geqo_pool.c:109
#12 0x00000000007422a6 in geqo (root=0x2df7550, number_of_rels=6,
initial_rels=0x2ff22d0) at geqo_main.c:114
#13 0x0000000000747f19 in make_rel_from_joinlist (root=0x2df7550,
joinlist=0x2dce940) at allpaths.c:2333
#14 0x0000000000744e7e in make_one_rel (root=0x2df7550,
joinlist=0x2dce940) at allpaths.c:182
#15 0x0000000000772df9 in query_planner (root=0x2df7550,
tlist=0x2dec2c0, qp_callback=0x777ce1 <standard_qp_callback>,
qp_extra=0x7fffe6b4e700)   at planmain.c:254

Please let me know if any more information is required on this.
-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/



>>>
> In an attempt to test the geqo side of this patch, I reduced
> geqo_threshold to 6 and set enable_partitionwise_join to to true and
> tried following query, which crashed,
>
> explain select * from prt, prt2, prt3, prt32, prt4, prt42 where prt.a
> = prt2.b and prt3.a = prt32.b and prt4.a = prt42.b and prt2.a > 1000
> order by prt.a desc;
>
> Stack-trace for the crash is as follows,
>
Nice catch. When reparameterize_path_by_child() may be running in a
temporary memory context while running in GEQO mode. It may add a new
PPI to base relation all in the temporary context. In the next GEQO
cycle, the ppilist will be clobbered since the temporary context is
reset for each geqo cycle. The fix is to allocate PPI in the same
memory context as the RelOptInfo similar to mark_dummy_rel().

I also found another problem. In geqo, we never call
generate_partition_wise_join_paths() which set cheapest paths for each
child-join. Because of this cheapest_*_paths are never set for those
rels, thus segfaulting in functions like sort_inner_and_outer() which
use those.

Here's patch fixing both the issues. Please let me know if it fixes
the issues you are seeing.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Wed, Mar 22, 2017 at 3:19 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>>>>
>> In an attempt to test the geqo side of this patch, I reduced
>> geqo_threshold to 6 and set enable_partitionwise_join to to true and
>> tried following query, which crashed,
>>
>> explain select * from prt, prt2, prt3, prt32, prt4, prt42 where prt.a
>> = prt2.b and prt3.a = prt32.b and prt4.a = prt42.b and prt2.a > 1000
>> order by prt.a desc;
>>
>> Stack-trace for the crash is as follows,
>>
> Nice catch. When reparameterize_path_by_child() may be running in a
> temporary memory context while running in GEQO mode. It may add a new
> PPI to base relation all in the temporary context. In the next GEQO
> cycle, the ppilist will be clobbered since the temporary context is
> reset for each geqo cycle. The fix is to allocate PPI in the same
> memory context as the RelOptInfo similar to mark_dummy_rel().
>
> I also found another problem. In geqo, we never call
> generate_partition_wise_join_paths() which set cheapest paths for each
> child-join. Because of this cheapest_*_paths are never set for those
> rels, thus segfaulting in functions like sort_inner_and_outer() which
> use those.
>
> Here's patch fixing both the issues. Please let me know if it fixes
> the issues you are seeing.

I tested the applied patch, it is fixing the reported issue.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/



Here's set of updated patches rebased on
1148e22a82edc96172fc78855da392b6f0015c88.

I have fixed all the issues reported till now.

I have also completed reparameterize_path_by_child() for all the
required paths. There's no TODO there now. :) The function has grown
quite long now and might take some time to review. Given the size, I
am wondering whether we should separate that fix from the main
partition-wise join fix. That will make reviewing that function
easier, allowing a careful review. Here's the idea how that can be
done. As explained in the commit of 0009, the function is required in
case of lateral joins between partitioned relations. For a A LATERAL
JOIN B, B is the minimum required parameterization by A. Hence
children of A i.e. A1, A2 ... all require their paths to be
parameterized by B. When that comes to partition-wise joins, A1
requires its paths to be parameterized by B1 (matching partition from
B). Otherwise we can not create paths for A1B1. This means that we
require to reparameterize all A1's paths to be reparameterized by B1
using function reparameterize_paths_by_child(). So the function needs
to support reparameterization of all the paths; we do not know which
of those have survived add_path(). But if we disable partition-wise
join for lateral joins i.e. when direct_lateral_relids of one relation
contains the any subset of the relids in the other relation, we do not
need reparameterize_path_by_child(). Please let me know if this
strategy will help to make review and commit easier.

After the commit,
commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50
Author: Robert Haas <rhaas@postgresql.org>
Date:   Tue Mar 21 09:48:04 2017 -0400

    Don't scan partitioned tables.

We do not create any AppendRelInfos and hence RelOptInfos for the
partitioned tables. My approach to attach multi-level partitioned join
was to store RelOptInfos of immediate partitions in part_rels of
RelOptInfo of a partitioned table, thus maintaining a tree of
RelOptInfos reflecting partitioning tree. This allows to add append
paths to intermediate RelOptInfos, flattening them as we go up the
partitioning hierarchy. With no RelOptInfos for intermediate
partitions, we can support multi-level partition-wise join only in
limited cases when the partitioning hierarchy of the joining table
exactly matches. Please refer [1] for some more discussion.

I think we need the RelOptInfos for the partitions, which are
partitioned to hold the "append" paths containing paths from their
children and to match the partitions in partition-wise join. Similar
hierarchy will be created for partitioned joins, with partitioned join
partitions. So, I have not changed the multi-level partition-wise join
support patches. After applying 0011-0013 the multi-level partitioning
tests fail with error "could not find the RelOptInfo of a partition
with oid", since it does not find the RelOptInfos of partitions which
are partitioned.

[1] https://www.postgresql.org/message-id/CAFjFpRceMmx26653XFAYvc5KVQcrzcKScVFqZdbXV%3DkB8Akkqg@mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
>
> I tested the applied patch, it is fixing the reported issue.

Thanks for the confirmation Rafia. I have included the fix in the
latest set of patches.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Wed, Mar 22, 2017 at 8:46 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> I have also completed reparameterize_path_by_child() for all the
> required paths. There's no TODO there now. :) The function has grown
> quite long now and might take some time to review. Given the size, I
> am wondering whether we should separate that fix from the main
> partition-wise join fix. That will make reviewing that function
> easier, allowing a careful review. Here's the idea how that can be
> done. As explained in the commit of 0009, the function is required in
> case of lateral joins between partitioned relations. For a A LATERAL
> JOIN B, B is the minimum required parameterization by A. Hence
> children of A i.e. A1, A2 ... all require their paths to be
> parameterized by B. When that comes to partition-wise joins, A1
> requires its paths to be parameterized by B1 (matching partition from
> B). Otherwise we can not create paths for A1B1. This means that we
> require to reparameterize all A1's paths to be reparameterized by B1
> using function reparameterize_paths_by_child(). So the function needs
> to support reparameterization of all the paths; we do not know which
> of those have survived add_path(). But if we disable partition-wise
> join for lateral joins i.e. when direct_lateral_relids of one relation
> contains the any subset of the relids in the other relation, we do not
> need reparameterize_path_by_child(). Please let me know if this
> strategy will help to make review and commit easier.

In my testing last week, reparameterize_path_by_child() was essential
for nested loops to work properly, even without LATERAL.  Without it,
the parameterized path ends up containing vars that reference the
parent varno instead of the child varno.  That confused later planner
stages so that those Vars did not get replaced with Param during
replace_nestloop_params(), eventually resulting in a crash at
execution time.  Based on that experiment, I think we could consider
having reparameterize_path_by_child() handle only scan paths as
reparameterize_path() does, and just give up on plans like this:

Append
-> Left Join  -> Scan on a  -> Inner Join     -> Index Scan on b     -> Index Scan on c
[repeat for each partition]

But I doubt we can get by without it altogether.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, Mar 22, 2017 at 6:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 22, 2017 at 8:46 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> I have also completed reparameterize_path_by_child() for all the
>> required paths. There's no TODO there now. :) The function has grown
>> quite long now and might take some time to review. Given the size, I
>> am wondering whether we should separate that fix from the main
>> partition-wise join fix. That will make reviewing that function
>> easier, allowing a careful review. Here's the idea how that can be
>> done. As explained in the commit of 0009, the function is required in
>> case of lateral joins between partitioned relations. For a A LATERAL
>> JOIN B, B is the minimum required parameterization by A. Hence
>> children of A i.e. A1, A2 ... all require their paths to be
>> parameterized by B. When that comes to partition-wise joins, A1
>> requires its paths to be parameterized by B1 (matching partition from
>> B). Otherwise we can not create paths for A1B1. This means that we
>> require to reparameterize all A1's paths to be reparameterized by B1
>> using function reparameterize_paths_by_child(). So the function needs
>> to support reparameterization of all the paths; we do not know which
>> of those have survived add_path(). But if we disable partition-wise
>> join for lateral joins i.e. when direct_lateral_relids of one relation
>> contains the any subset of the relids in the other relation, we do not
>> need reparameterize_path_by_child(). Please let me know if this
>> strategy will help to make review and commit easier.
>
> In my testing last week, reparameterize_path_by_child() was essential
> for nested loops to work properly, even without LATERAL.  Without it,
> the parameterized path ends up containing vars that reference the
> parent varno instead of the child varno.  That confused later planner
> stages so that those Vars did not get replaced with Param during
> replace_nestloop_params(), eventually resulting in a crash at
> execution time.

I half-described the solution. Sorry. Along-with disabling
partition-wise lateral joins, we have to disable nested loop
child-joins where inner child is parameterized by the parent of the
outer one. We will still have nestloop join between parents where
inner relation is parameterized by the outer and every child of inner
is parameterized by the outer. But we won't create nest loop joins
where inner child is parameterized by the outer child, where we
require reparameterize_path_by_child. We will loose this optimization
only till we get reparameterize_path_by_child() committed. Basically,
in try_nestloop_path() (in the patch 0009), if
(PATH_PARAM_BY_PARENT(inner_path, outer_path->parent)), give up
creating nest loop path. That shouldn't create any problems.

Did you experiment with this change in try_nestloop_path()? Can you
please share the testcase? I will take a look at it.

> Based on that experiment, I think we could consider
> having reparameterize_path_by_child() handle only scan paths as
> reparameterize_path() does, and just give up on plans like this:
>
> Append
> -> Left Join
>    -> Scan on a
>    -> Inner Join
>       -> Index Scan on b
>       -> Index Scan on c
> [repeat for each partition]
>

I am assuming that a, b and c are partitions of A, B and C resp. which
are being joined and both or one of the scans on b and c are
parameteried by a or scan of c is parameterized by b.

I don't think we will get away by supporting just scan paths, since
the inner side of lateral join can be any paths not just scan path. Or
you are suggesting that we disable partition-wise lateral join and
support reparameterization of only scan paths?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Wed, Mar 22, 2017 at 9:59 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>> In my testing last week, reparameterize_path_by_child() was essential
>> for nested loops to work properly, even without LATERAL.  Without it,
>> the parameterized path ends up containing vars that reference the
>> parent varno instead of the child varno.  That confused later planner
>> stages so that those Vars did not get replaced with Param during
>> replace_nestloop_params(), eventually resulting in a crash at
>> execution time.
>
> I half-described the solution. Sorry. Along-with disabling
> partition-wise lateral joins, we have to disable nested loop
> child-joins where inner child is parameterized by the parent of the
> outer one. We will still have nestloop join between parents where
> inner relation is parameterized by the outer and every child of inner
> is parameterized by the outer. But we won't create nest loop joins
> where inner child is parameterized by the outer child, where we
> require reparameterize_path_by_child. We will loose this optimization
> only till we get reparameterize_path_by_child() committed. Basically,
> in try_nestloop_path() (in the patch 0009), if
> (PATH_PARAM_BY_PARENT(inner_path, outer_path->parent)), give up
> creating nest loop path. That shouldn't create any problems.
>
> Did you experiment with this change in try_nestloop_path()? Can you
> please share the testcase? I will take a look at it.

I didn't save the test case.  It was basically just forcing a
partitionwise nestloop join between two equipartitioned tables, with
the calls to adjust_appendrel_attrs() ripped out of
reparameterize_path_by_child(), just to see what would break.

>> Based on that experiment, I think we could consider
>> having reparameterize_path_by_child() handle only scan paths as
>> reparameterize_path() does, and just give up on plans like this:
>>
>> Append
>> -> Left Join
>>    -> Scan on a
>>    -> Inner Join
>>       -> Index Scan on b
>>       -> Index Scan on c
>> [repeat for each partition]
>>
>
> I am assuming that a, b and c are partitions of A, B and C resp. which
> are being joined and both or one of the scans on b and c are
> parameteried by a or scan of c is parameterized by b.

Yes.

> I don't think we will get away by supporting just scan paths, since
> the inner side of lateral join can be any paths not just scan path. Or
> you are suggesting that we disable partition-wise lateral join and
> support reparameterization of only scan paths?

I think if you can do a straight-up partitionwise nested loop between
two tables A and B, that's pretty bad.  But if there are more complex
cases that involve parameterizing entire join trees which aren't
covered, that's less bad.  Parallel query almost entirely punts on
LATERAL right now, and nobody's complained yet.  I'm sure that'll need
to get fixed someday, but not today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, Mar 22, 2017 at 8:46 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Here's set of updated patches rebased on
> 1148e22a82edc96172fc78855da392b6f0015c88.
>
> I have fixed all the issues reported till now.

I don't understand why patch 0001 ends up changing every existing test
for RELOPT_JOINREL anywhere in the source tree to use IS_JOIN_REL(),
yet none of the existing tests for RELOPT_OTHER_MEMBER_REL end up
getting changed to use IS_OTHER_REL().  That's very surprising.  Some
of those tests are essentially checking for something that is going to
have a scan plan rather than a join or upper plan, and those tests
probably don't need to be modified; for example, the test in
set_rel_consider_parallel() is obviously of this type. But others are
testing whether we've got some kind of child rel, and those seem like
they might need work.  Going through a few specific examples:

- generate_join_implied_equalities_for_ecs() assumes that any child
rel is an other member rel.
- generate_join_implied_equalities_broken() assumes that any child rel
is an other member rel.
- generate_implied_equalities_for_column() set is_child_rel on the
assumption that only an other member rel can be a child rel.
- eclass_useful_for_merging() assumes that the only kind of child rel
is an other member rel.
- find_childrel_appendrelinfo() assumes that any child rel is an other
member rel.
- find_childrel_top_parent() and find_childrel_parents() assume that
children must be other member rels and their parents must be baserels.
- adjust_appendrel_attrs_multilevel() assumes that children must be
other member rels and their parents must be baserels.

It's possible that, for various reasons, none of these code paths
would ever be reachable by a child join, but it doesn't look likely to
me.  And even if that's true, some comment updates are probably
needed, and maybe some renaming of functions too.

In postgres_fdw, get_useful_ecs_for_relation() assumes that any child
rel is an other member rel.  I'm not sure if we're hoping that
partitionwise join will work with postgres_fdw's join pushdown out of
the chute, but clearly this would need to be adjusted to have any
chance of being right.

Some that seem OK:

- set_rel_consider_parallel() is fine.
- set_append_rel_size() is only going to be called for baserels or
their children, so it's fine.
- relation_excluded_by_constraints() is only intended to be called on
baserels or their children, so it's fine.
- check_index_predicates() is only intended to be called on baserels
or their children, so it's fine.
- query_planner() loops over baserels and their children, so it's fine.

Perhaps we could introduce an IS_BASEREL_OR_CHILD() test that could be
used in some of these places, just for symmetry.   The point is that
there are really three questions here: (1) is it some kind of baserel
(parent or child)? (2) is it some kind of joinrel (parent or child)?
and (3) is it some kind of child (baserel or join)?  Right now, both
#2 and #3 are tested by just comparing against
RELOPT_OTHER_MEMBER_REL, but they become different tests as soon as we
add child joinrels.  The goal of 0001, IMV, ought to be to try to
figure out which of #1, #2, and #3 is being checked in each case and
make that clear via use of an appropriate macro.  (If is-other-baserel
is the real test, then fine, but I bet that's a rare case.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



>
>> I don't think we will get away by supporting just scan paths, since
>> the inner side of lateral join can be any paths not just scan path. Or
>> you are suggesting that we disable partition-wise lateral join and
>> support reparameterization of only scan paths?
>
> I think if you can do a straight-up partitionwise nested loop between
> two tables A and B, that's pretty bad.

Ok.

> But if there are more complex
> cases that involve parameterizing entire join trees which aren't
> covered, that's less bad.  Parallel query almost entirely punts on
> LATERAL right now, and nobody's complained yet.  I'm sure that'll need
> to get fixed someday, but not today.
>
Ok.

I am suggesting this possibility in case we run of time to review and
commit reparameterize_path_by_child() entirely. If we can do that, I
will be happy.

In case, we have to include a stripped down version of
reparameterize_path_by_child(), with which I am fine too, we will need
to disable LATERAL joins, so that we don't end up with an error "could
not devise a query plan for the given query".

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Thu, Mar 23, 2017 at 1:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Mar 22, 2017 at 8:46 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Here's set of updated patches rebased on
>> 1148e22a82edc96172fc78855da392b6f0015c88.
>>
>> I have fixed all the issues reported till now.
>
> I don't understand why patch 0001 ends up changing every existing test
> for RELOPT_JOINREL anywhere in the source tree to use IS_JOIN_REL(),
> yet none of the existing tests for RELOPT_OTHER_MEMBER_REL end up
> getting changed to use IS_OTHER_REL().  That's very surprising.  Some
> of those tests are essentially checking for something that is going to
> have a scan plan rather than a join or upper plan, and those tests
> probably don't need to be modified; for example, the test in
> set_rel_consider_parallel() is obviously of this type. But others are
> testing whether we've got some kind of child rel, and those seem like
> they might need work.  Going through a few specific examples:
>
> - generate_join_implied_equalities_for_ecs() assumes that any child
> rel is an other member rel.
> - generate_join_implied_equalities_broken() assumes that any child rel
> is an other member rel.

Fixed those.

> - generate_implied_equalities_for_column() set is_child_rel on the
> assumption that only an other member rel can be a child rel.

This  function is called for indexes, which are not defined on the
join relations. So, we shouldn't worry about child-joins here. I have
added an assertion in there to make sure that that function gets
called only for base and "other" member rels.

> - eclass_useful_for_merging() assumes that the only kind of child rel
> is an other member rel.

This was being fixed in a later patch which had many small fixes for
handling child-joins. But now I have moved that fix into 0001.

> - find_childrel_appendrelinfo() assumes that any child rel is an other
> member rel.

The function is called for "other" member relation only. For joins we
use find_appinfos_by_relids() We could replace
find_childrel_appendrelinfo() with find_appinfos_by_relids(), which
does same thing as find_childrel_appendrelinfo() for a relids set. But
find_appinfos_by_relids() returns a list of AppendRelInfos, hence
using it instead of find_childrel_appendrelinfo() will spend some
memory and CPU cycles in creating a one element list and then
extracting that element out of the list. So, I have not replaced
usages of find_childrel_appendrelinfo() with
find_appinfos_by_relids(). This also simplifies changes to
get_useful_ecs_for_relation().

> - find_childrel_top_parent() and find_childrel_parents() assume that
> children must be other member rels and their parents must be baserels.

For partition-wise join implementation we save relids of topmost
parent in RelOptInfo of child. We can directly use that instead of
calling find_childrel_top_parent(). So, in 0001 I am adding
top_parent_relids in RelOptInfo and getting rid of
find_childrel_top_parent(). This also fixes
get_useful_ecs_for_relation() in a better way. find_childrel_parents()
is called only for simple relations not joins, since it's callers are
called only for simple relations. I have added an assertion to that
effect.

> - adjust_appendrel_attrs_multilevel() assumes that children must be
> other member rels and their parents must be baserels.

It was being fixed in a later patch. In the attached patch set 0001
changes it to use IS_OTHER_REL().

>
> It's possible that, for various reasons, none of these code paths
> would ever be reachable by a child join, but it doesn't look likely to
> me.  And even if that's true, some comment updates are probably
> needed, and maybe some renaming of functions too.

Now commit messages of 0001 explains which instances of
RELOPT_OTHER_MEMBER_REL and RELOPT_BASEREL have been changed, and
which have been retained and why. Also, added assertions wherever
necessary.

>
> In postgres_fdw, get_useful_ecs_for_relation() assumes that any child
> rel is an other member rel.  I'm not sure if we're hoping that
> partitionwise join will work with postgres_fdw's join pushdown out of
> the chute, but clearly this would need to be adjusted to have any
> chance of being right.

Fixed this as explained above.

>
> Some that seem OK:
>
> - set_rel_consider_parallel() is fine.
> - set_append_rel_size() is only going to be called for baserels or
> their children, so it's fine.
> - relation_excluded_by_constraints() is only intended to be called on
> baserels or their children, so it's fine.
> - check_index_predicates() is only intended to be called on baserels
> or their children, so it's fine.
> - query_planner() loops over baserels and their children, so it's fine.
>

Right.

> Perhaps we could introduce an IS_BASEREL_OR_CHILD() test that could be
> used in some of these places, just for symmetry.

I was wondering about this as well. Although, I though it better not
to touch base relations in partition-wise join. But now, I have added
that macro and adjusted corresponding tests in the code. See 0001.

You may actually want to squash 0001 and 0002 into a single patch. But
for now, I have left those as separate.

> The point is that
> there are really three questions here: (1) is it some kind of baserel
> (parent or child)? (2) is it some kind of joinrel (parent or child)?
> and (3) is it some kind of child (baserel or join)?  Right now, both
> #2 and #3 are tested by just comparing against
> RELOPT_OTHER_MEMBER_REL, but they become different tests as soon as we
> add child joinrels.  The goal of 0001, IMV, ought to be to try to
> figure out which of #1, #2, and #3 is being checked in each case and
> make that clear via use of an appropriate macro.  (If is-other-baserel
> is the real test, then fine, but I bet that's a rare case.)

Agreed. I have gone through all the cases, and fixed the necessary
ones as explained above and in the commit messages of 0001.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
Hi Ashutosh,

On 2017/03/23 21:48, Ashutosh Bapat wrote:
>>> I have fixed all the issues reported till now.

In patch 0007, the following code in have_partkey_equi_join() looks
potentially unsafe:
       /*        * If the clause refers to different partition keys from        * both relations, it can not be used
forpartition-wise join.        */       if (ipk1 != ipk2)           continue;
 
       /*        * The clause allows partition-wise join if only it uses the same        * operator family as that
specifiedby the partition key.        */       if (!list_member_oid(rinfo->mergeopfamilies,
part_scheme->partopfamily[ipk1]))          continue;
 

What if ipk1 and ipk2 both turn out to be -1? Accessing
part_schem->partopfamily[ipk1] would be incorrect, no?

Thanks,
Amit





Hi Ashutosh,

On 2017/03/23 21:48, Ashutosh Bapat wrote:
>>> I have fixed all the issues reported till now.

I've tried to fix your 0012 patch (Multi-level partitioned table
expansion) considering your message earlier on this thread [1].
Especially the fact that no AppendRelInfo and RelOptInfo are allocated for
partitioned child tables as of commit d3cc37f1d [2].  I've fixed
expand_inherited_rtentry() such that AppendRelInfo *is* allocated for a
partitioned child RTEs whose rte->inh is set to true.  Such an RTE is
recursively expanded with that RTE the parent.

Also as I mentioned elsewhere [3], the multi-level inheritance expansion
of partitioned table will break update/delete for partitioned table, which
is because inheritance_planner() is not ready to handle inheritance sets
structured that way.  I tried to refactor inheritance_planner() such that
its core logic can be recursively invoked for partitioned child RTEs.  The
resulting child paths and other auxiliary information related to planning
across the hierarchy are maintained in one place using a struct to hold
the same in a few flat lists.  The refactoring didn't break any existing
tests and a couple of new tests are added to check that it indeed works
for multi-level partitioned tables expanded using new multi-level structure.

There is some test failure in 0014 (Multi-level partition-wise join
tests), probably because of the changes I made to 0012, which I didn't get
time to check why, although I've checked using an example that multi-level
join planning still works, so it's not completely broken either.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAFjFpRefs5ZMnxQ2vP9v5zOtWtNPuiMYc01sb1SWjCOB1CT%3DuQ%40mail.gmail.com

[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d3cc37f1d

[3]
https://www.postgresql.org/message-id/744d20fe-fc7b-f89e-8d06-6496ec537b86%40lab.ntt.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: Partition-wise join for join between (declaratively)partitioned tables

From
Ashutosh Bapat
Date:
On Fri, Mar 24, 2017 at 1:57 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hi Ashutosh,
>
> On 2017/03/23 21:48, Ashutosh Bapat wrote:
>>>> I have fixed all the issues reported till now.
>
> In patch 0007, the following code in have_partkey_equi_join() looks
> potentially unsafe:
>
>         /*
>          * If the clause refers to different partition keys from
>          * both relations, it can not be used for partition-wise join.
>          */
>         if (ipk1 != ipk2)
>             continue;
>
>         /*
>          * The clause allows partition-wise join if only it uses the same
>          * operator family as that specified by the partition key.
>          */
>         if (!list_member_oid(rinfo->mergeopfamilies,
>                              part_scheme->partopfamily[ipk1]))
>             continue;
>
> What if ipk1 and ipk2 both turn out to be -1? Accessing
> part_schem->partopfamily[ipk1] would be incorrect, no?

Thanks for the report. Surprising this should have crashed sometime,
but didn't ever. Neither it showed wrong output for queries where
partition keys were not part of equi-joins. The reason being
partopfamily[-1] had 0 in it, which when tested again
list_member_oid(rinfo->mergeopfamilies, ..) returned false. Attached
patches fix this code.

Also, I have fixed few grammar mistakes, typos, renamed variables in
PartitionSchemeData to match those in PartitionKey. I have squashed
the patches introducing IS_JOIN_REL, IS_OTHER_REL, IS_SIMPLE_REL into
one.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Partition-wise join for join between (declaratively)partitioned tables

From
Ashutosh Bapat
Date:
On Fri, Mar 24, 2017 at 4:18 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hi Ashutosh,
>
> On 2017/03/23 21:48, Ashutosh Bapat wrote:
>>>> I have fixed all the issues reported till now.
>
> I've tried to fix your 0012 patch (Multi-level partitioned table
> expansion) considering your message earlier on this thread [1].
> Especially the fact that no AppendRelInfo and RelOptInfo are allocated for
> partitioned child tables as of commit d3cc37f1d [2].  I've fixed
> expand_inherited_rtentry() such that AppendRelInfo *is* allocated for a
> partitioned child RTEs whose rte->inh is set to true.  Such an RTE is
> recursively expanded with that RTE the parent.
>
> Also as I mentioned elsewhere [3], the multi-level inheritance expansion
> of partitioned table will break update/delete for partitioned table, which
> is because inheritance_planner() is not ready to handle inheritance sets
> structured that way.  I tried to refactor inheritance_planner() such that
> its core logic can be recursively invoked for partitioned child RTEs.  The
> resulting child paths and other auxiliary information related to planning
> across the hierarchy are maintained in one place using a struct to hold
> the same in a few flat lists.  The refactoring didn't break any existing
> tests and a couple of new tests are added to check that it indeed works
> for multi-level partitioned tables expanded using new multi-level structure.
>
> There is some test failure in 0014 (Multi-level partition-wise join
> tests), probably because of the changes I made to 0012, which I didn't get
> time to check why, although I've checked using an example that multi-level
> join planning still works, so it's not completely broken either.
>

I have gone through the patch, and it looks good to me. Here's the set
of patches with this patch included. Fixed the testcase failures.
Rebased the patchset on de4da168d57de812bb30d359394b7913635d21a9.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Partition-wise join for join between (declaratively)partitioned tables

From
Robert Haas
Date:
On Mon, Mar 27, 2017 at 8:36 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> I have gone through the patch, and it looks good to me. Here's the set
> of patches with this patch included. Fixed the testcase failures.
> Rebased the patchset on de4da168d57de812bb30d359394b7913635d21a9.

This version of 0001 looks much better to me, but I still have some concerns.

I think we should also introduce IS_UPPER_REL() at the same time, for
symmetry and because partitionwise aggregate will need it, and use it
in place of direct tests against RELOPT_UPPER_REL.

I think it would make sense to change the test in deparseFromExpr() to
check for IS_JOIN_REL() || IS_SIMPLE_REL().  There's no obvious reason
why that shouldn't be OK, and it would remove the last direct test
against RELOPT_JOINREL in the tree, and it will probably need to be
changed for partitionwise aggregate anyway.

Could set_append_rel_size Assert(IS_SIMPLE_REL(rel))?  I notice that
you did this in some other places such as
generate_implied_equalities_for_column(), and I like that.  If for
some reason that's not going to work, then it's doubtful whether
Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL) is going to
survive either.

Similarly, I think relation_excluded_by_constraints() would also
benefit from Assert(IS_SIMPLE_REL(rel)).

Why not set top_parent_relids earlier, when actually creating the
RelOptInfo?  I think you could just change build_simple_rel() so that
instead of passing RelOptKind reloptkind, you instead pass RelOptInfo
*parent.  I think postponing that work until set_append_rel_size()
just introduces possible bugs resulting from it not being set early
enough.

Apart from the above, I think 0001 is in good shape.

Regarding 0002, I think the parts that involve factoring out
find_param_path_info() are uncontroversial.  Regarding the changes to
adjust_appendrel_attrs(), my main question is whether we wouldn't be
better off using an array representation rather than a List
representation.  In other words, this function could take PlannerInfo
*root, Node *node, int nappinfos, AppendRelInfo **appinfos.  Existing
callers doing adjust_appendrel_attrs(root, whatever, appinfo) could
just do adjust_appendrel_attrs(root, whatever, 1, &appinfo), not
needing to allocate.  To make this work, adjust_child_relids() and
find_appinfos_by_relids() would need to be adjusted to use a similar
argument-passing convention.  I suspect this makes iterating over the
AppendRelInfos mildly faster, too, apart from the memory savings.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partition-wise join for join between (declaratively)partitioned tables

From
Robert Haas
Date:
On Tue, Mar 28, 2017 at 12:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Regarding 0002, I think the parts that involve factoring out
> find_param_path_info() are uncontroversial.  Regarding the changes to
> adjust_appendrel_attrs(), my main question is whether we wouldn't be
> better off using an array representation rather than a List
> representation.  In other words, this function could take PlannerInfo
> *root, Node *node, int nappinfos, AppendRelInfo **appinfos.  Existing
> callers doing adjust_appendrel_attrs(root, whatever, appinfo) could
> just do adjust_appendrel_attrs(root, whatever, 1, &appinfo), not
> needing to allocate.  To make this work, adjust_child_relids() and
> find_appinfos_by_relids() would need to be adjusted to use a similar
> argument-passing convention.  I suspect this makes iterating over the
> AppendRelInfos mildly faster, too, apart from the memory savings.

Still regarding 0002, looking at adjust_appendrel_attrs_multilevel,
could we have a common code path for the baserel and joinrel cases?
It seems like perhaps you could just loop over root->append_rel_list.
For each appinfo, if (bms_is_member(appinfo->child_relid,
child_rel->relids)) bms_add_member(parent_relids,
appinfo->parent_relid).

This implementation would have some loss of efficiency in the
single-rel case because we'd scan all of the AppendRelInfos in the
list even if there's only one relid.  But you could fix that by
writing it like this:

foreach (lc, root->append_rel_list)
{   if (bms_is_member(appinfo->child_relid, child_rel->relids))   {       bms_add_member(parent_relids,
appinfo->parent_relid);      if (child_rel->reloptkind == RELOPT_OTHER_MEMBER_REL)           break;    /* only one
relidto find, and we've found it */   }
 
}
Assert(bms_num_members(child_rel->relids) == bms_num_members(parent_relids));

That seems pretty slick.  It is just as fast as the current
implementation for the single-rel case.  It allocates no memory
(unlike what you've got now).  And it handles the joinrel case using
essentially the same code as the simple rel case.

In 0003, it seems that it would be more consistent with what you did
elsewhere if the last argument to allow_star_schema_join were named
inner_paramrels rather than innerparams.  Other than that, I don't see
anything to complain about.

In 0004:

+                                       Assert(!rel->part_rels[cnt_parts]);
+                                       rel->part_rels[cnt_parts] = childrel;

break here?

+static void
+get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
+                                                       Relation
relation, bool inhparent)
+{
+       /* No partitioning information for an unpartitioned relation. */
+       if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE ||
+               !inhparent ||

I still think the inhparent check should be moved to the caller.

In 0005:

+ *             Returns a list of the RT indexes of the partitioned
child relations
+ *             with any of joining relations' rti as the root parent RT index.

I found this wording confusing.  Maybe: Build and return a list
containing the RTI of every partitioned relation which is a child of
some rel included in the join.

+ * Note: Only call this function on joins between partitioned tables.

Or what, the boogeyman will come and get you?

(In other words, I don't think that's a very informative comment.)

I don't think 0011 is likely to be acceptable in current form.  I
can't imagine that we just went to the trouble of getting rid of
AppendRelInfos for child partitioned rels only to turn around and put
them back again.  If you just need the parent-child mappings, you can
get that from the PartitionedChildRelInfo list.

Unfortunately, I don't think we're likely to be able to get this whole
patch series into a committable form in the next few days, but I'd
like to keep reviewing it and working with you on it; there's always
next cycle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partition-wise join for join between (declaratively)partitioned tables

From
Ashutosh Bapat
Date:
On Tue, Mar 28, 2017 at 10:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 27, 2017 at 8:36 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> I have gone through the patch, and it looks good to me. Here's the set
>> of patches with this patch included. Fixed the testcase failures.
>> Rebased the patchset on de4da168d57de812bb30d359394b7913635d21a9.
>
> This version of 0001 looks much better to me, but I still have some concerns.
>
> I think we should also introduce IS_UPPER_REL() at the same time, for
> symmetry and because partitionwise aggregate will need it, and use it
> in place of direct tests against RELOPT_UPPER_REL.

Ok. Done. I introduced IS_JOIN_REL and IS_OTHER_REL only to simplify
the tests for child-joins. But now we have grown this patch with
IS_SIMPLE_REL() and IS_UPPER_REL(). That has introduced changes
unrelated to partition-wise join. But I am happy with the way the code
looks now with all IS_*_REL() macros. If we delay this commit, some
more usages of bare RELOPT_* would creep in the code. To avoid that,
we may want to commit these changes in v10.

>
> I think it would make sense to change the test in deparseFromExpr() to
> check for IS_JOIN_REL() || IS_SIMPLE_REL().  There's no obvious reason
> why that shouldn't be OK, and it would remove the last direct test
> against RELOPT_JOINREL in the tree, and it will probably need to be
> changed for partitionwise aggregate anyway.

Done. However, we need another assertion to make sure than an "other"
upper rel has an "other" rel as scanrel. That can be added when
partition-wise aggregate, which would introduce "other" upper rels, is
implemented.

>
> Could set_append_rel_size Assert(IS_SIMPLE_REL(rel))?  I notice that
> you did this in some other places such as
> generate_implied_equalities_for_column(), and I like that.  If for
> some reason that's not going to work, then it's doubtful whether
> Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL) is going to
> survive either.

Done. Also modified prologue of that function to explicitly say simple
"append relation" since we can have join "append relations" and upper
"append relations" with partition-wise operations.

>
> Similarly, I think relation_excluded_by_constraints() would also
> benefit from Assert(IS_SIMPLE_REL(rel)).

Done.

>
> Why not set top_parent_relids earlier, when actually creating the
> RelOptInfo?  I think you could just change build_simple_rel() so that
> instead of passing RelOptKind reloptkind, you instead pass RelOptInfo
> *parent.  I think postponing that work until set_append_rel_size()
> just introduces possible bugs resulting from it not being set early
> enough.

Done.

>
> Apart from the above, I think 0001 is in good shape.
>
> Regarding 0002, I think the parts that involve factoring out
> find_param_path_info() are uncontroversial.  Regarding the changes to
> adjust_appendrel_attrs(), my main question is whether we wouldn't be
> better off using an array representation rather than a List
> representation.  In other words, this function could take PlannerInfo
> *root, Node *node, int nappinfos, AppendRelInfo **appinfos.  Existing
> callers doing adjust_appendrel_attrs(root, whatever, appinfo) could
> just do adjust_appendrel_attrs(root, whatever, 1, &appinfo), not
> needing to allocate.  To make this work, adjust_child_relids() and
> find_appinfos_by_relids() would need to be adjusted to use a similar
> argument-passing convention.  I suspect this makes iterating over the
> AppendRelInfos mildly faster, too, apart from the memory savings.

Done.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Partition-wise join for join between (declaratively)partitioned tables

From
Ashutosh Bapat
Date:
On Wed, Mar 29, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Mar 28, 2017 at 12:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Regarding 0002, I think the parts that involve factoring out
>> find_param_path_info() are uncontroversial.  Regarding the changes to
>> adjust_appendrel_attrs(), my main question is whether we wouldn't be
>> better off using an array representation rather than a List
>> representation.  In other words, this function could take PlannerInfo
>> *root, Node *node, int nappinfos, AppendRelInfo **appinfos.  Existing
>> callers doing adjust_appendrel_attrs(root, whatever, appinfo) could
>> just do adjust_appendrel_attrs(root, whatever, 1, &appinfo), not
>> needing to allocate.  To make this work, adjust_child_relids() and
>> find_appinfos_by_relids() would need to be adjusted to use a similar
>> argument-passing convention.  I suspect this makes iterating over the
>> AppendRelInfos mildly faster, too, apart from the memory savings.
>
> Still regarding 0002, looking at adjust_appendrel_attrs_multilevel,
> could we have a common code path for the baserel and joinrel cases?
> It seems like perhaps you could just loop over root->append_rel_list.
> For each appinfo, if (bms_is_member(appinfo->child_relid,
> child_rel->relids)) bms_add_member(parent_relids,
> appinfo->parent_relid).
>
> This implementation would have some loss of efficiency in the
> single-rel case because we'd scan all of the AppendRelInfos in the
> list even if there's only one relid.  But you could fix that by
> writing it like this:
>
> foreach (lc, root->append_rel_list)
> {
>     if (bms_is_member(appinfo->child_relid, child_rel->relids))
>     {
>         bms_add_member(parent_relids, appinfo->parent_relid);
>         if (child_rel->reloptkind == RELOPT_OTHER_MEMBER_REL)
>             break;    /* only one relid to find, and we've found it */
>     }
> }
> Assert(bms_num_members(child_rel->relids) == bms_num_members(parent_relids));
>
> That seems pretty slick.  It is just as fast as the current
> implementation for the single-rel case.  It allocates no memory
> (unlike what you've got now).  And it handles the joinrel case using
> essentially the same code as the simple rel case.

I got rid of those differences completely by using trick similar to
adjust_child_relids_multilevel(), which uses top_parent_relids instead
of rel->reloptkind to decide whether we have reached the top parent or
not. Those can trickle down from the topmost caller to any depth in
recursion. This also avoids any call to find_*_rel(), which was the
main reason why we had different code paths for base and join
relation.

>
> In 0003, it seems that it would be more consistent with what you did
> elsewhere if the last argument to allow_star_schema_join were named
> inner_paramrels rather than innerparams.  Other than that, I don't see
> anything to complain about.

I had used the same name as the local variable declared with the same
purpose. But this change looks good. Done.

>
> In 0004:
>
> +                                       Assert(!rel->part_rels[cnt_parts]);
> +                                       rel->part_rels[cnt_parts] = childrel;
>
> break here?

Right, done.

>
> +static void
> +get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
> +                                                       Relation
> relation, bool inhparent)
> +{
> +       /* No partitioning information for an unpartitioned relation. */
> +       if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE ||
> +               !inhparent ||
>
> I still think the inhparent check should be moved to the caller.

Done.

>
> In 0005:
>
> + *             Returns a list of the RT indexes of the partitioned
> child relations
> + *             with any of joining relations' rti as the root parent RT index.
>
> I found this wording confusing.  Maybe: Build and return a list
> containing the RTI of every partitioned relation which is a child of
> some rel included in the join.

This is better. Thanks. Done.

>
> + * Note: Only call this function on joins between partitioned tables.
>
> Or what, the boogeyman will come and get you?
>
> (In other words, I don't think that's a very informative comment.)

I mimicked the prologue of earlier function. I guess, similar comment
in the prologue of earlier function is written because if we use
something other than a partitioned table there, the assertion at the
end of that function would trip. Similarly, for this function, the
assertion at the end of the function will trip, if we use it for
something other than a join relation.

PFA patches rebased on f90d23d0c51895e0d7db7910538e85d3d38691f0.

>
> I don't think 0011 is likely to be acceptable in current form.  I
> can't imagine that we just went to the trouble of getting rid of
> AppendRelInfos for child partitioned rels only to turn around and put
> them back again.  If you just need the parent-child mappings, you can
> get that from the PartitionedChildRelInfo list.

I will reply to this separately.

>
> Unfortunately, I don't think we're likely to be able to get this whole
> patch series into a committable form in the next few days, but I'd
> like to keep reviewing it and working with you on it; there's always
> next cycle.

Thanks for all your efforts in reviewing the patches and for your
excellent suggestions to improve the patches.

As I have stated earlier, it will help if we can get 0001 committed,
may be 0002. 0004 introduces the concept of partitioning scheme which
seems to be vital for partititon-wise aggregation, partition pruning
and may be sorting optimization discussed in [1]. If we are able to
commit it in this commitfest, the patches for those optimizations can
take advantage of partitioning scheme. I understand it's really close
to the end of this commitfest and we may not be able to commit even
those patches.

https://www.mail-archive.com/pgsql-hackers@postgresql.org/msg308742.html

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Partition-wise join for join between (declaratively)partitioned tables

From
Ashutosh Bapat
Date:
> On Wed, Mar 29, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think 0011 is likely to be acceptable in current form.  I
> can't imagine that we just went to the trouble of getting rid of
> AppendRelInfos for child partitioned rels only to turn around and put
> them back again.  If you just need the parent-child mappings, you can
> get that from the PartitionedChildRelInfo list.
>

Please refer to my earlier mails on this subject [1], [2]. For
multi-level partition-wise join, we need RelOptInfo of a partitioned
table to contain RelOptInfo of its immediate partitions. I have not
seen any counter arguments not to create RelOptInfos for intermediate
partitioned tables. We create child RelOptInfos only for entries in
root->append_rel_list i.e. only for those relations which have an
AppendRelInfo. Since we are not creating AppendRelInfos for
partitioned partitions, we do not create RelOptInfos for those. So, to
me it looks like we have to either have AppendRelInfos for partitioned
partitions or create RelOptInfos by traversing some other list like
PartitionedChildRelInfo list. It looks odd to walk
root->append_rel_list as well as this new list for creating
RelOptInfos. But for a moment, we assume that we have to walk this
other list. But then that other list is also lossy. It stores only the
topmost parent of any of the partitioned partitions and not the
immediate parent as required to add RelOptInfos of immediate children
to the RelOptInfo of a parent.

Coming back to the point of PartitionedChildRelInfo list as a way to
maintain parent - child relationship. All the code assumes that the
parent-child relationship is stored in AppendRelInfo linked as
root->append_rel_list and walks that list to find children of a given
parent of parent/s of a given child. We will have to modify all those
places to traverse two lists instead of one. Some of those even return
AppendRelInfo structure, and now they some times return an
AppendRelInfo and sometimes PartitionedChildRelInfo. That looks ugly.

Consider a case where P has partitions p1 and p2, which in turn have
partitions p11, p12 and p21, p22 resp. Another partitioned table Q has
partitions q1, q2. q1 is further partitioned into q11, q12 but q2 is
not partitioned. The partitioning scheme of P and Q matches. Also,
partitioning scheme of p1 and q1 matches. So, a partition-wise join
between P and Q would look like P J Q = append (p11 J q11, p12 J q12,
p2 J q2), p2 J q2 being append(p21, p22) J q2. When constructing the
restrictlist (and other clauses) for p2 J q2 we need to translate the
restrictlist applicable for P J Q. This translation requires
AppendRelInfo of p2 which does not exist today. We can not use
PartitionedChildRelInfo because it doesn't have information about how
to translate Vars of P to those of p2.

I don't see a way to avoid creating AppendRelInfos for partitioned partitions.

[1] https://www.postgresql.org/message-id/CAFjFpRefs5ZMnxQ2vP9v5zOtWtNPuiMYc01sb1SWjCOB1CT%3DuQ%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Partition-wise join for join between (declaratively)partitioned tables

From
Amit Langote
Date:
On 2017/03/30 18:35, Ashutosh Bapat wrote:
>> On Wed, Mar 29, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't think 0011 is likely to be acceptable in current form.  I
>> can't imagine that we just went to the trouble of getting rid of
>> AppendRelInfos for child partitioned rels only to turn around and put
>> them back again.  If you just need the parent-child mappings, you can
>> get that from the PartitionedChildRelInfo list.
>>
> 
> Please refer to my earlier mails on this subject [1], [2]. For
> multi-level partition-wise join, we need RelOptInfo of a partitioned
> table to contain RelOptInfo of its immediate partitions. I have not
> seen any counter arguments not to create RelOptInfos for intermediate
> partitioned tables. We create child RelOptInfos only for entries in
> root->append_rel_list i.e. only for those relations which have an
> AppendRelInfo. Since we are not creating AppendRelInfos for
> partitioned partitions, we do not create RelOptInfos for those. So, to
> me it looks like we have to either have AppendRelInfos for partitioned
> partitions or create RelOptInfos by traversing some other list like
> PartitionedChildRelInfo list. It looks odd to walk
> root->append_rel_list as well as this new list for creating
> RelOptInfos. But for a moment, we assume that we have to walk this
> other list. But then that other list is also lossy. It stores only the
> topmost parent of any of the partitioned partitions and not the
> immediate parent as required to add RelOptInfos of immediate children
> to the RelOptInfo of a parent.

So, because we want to create an Append path for each partitioned table in
a tree separately, we'll need RelOptInfo for each one, which in turn
requires an AppendRelInfo.  Note that we do that only for those
partitioned child RTEs that have inh set to true, so that all the later
stages will treat it as the parent rel to create an Append path for.
There would still be partitioned child RTEs with inh set to false for
which, just like before, no AppendRelInfos and RelOptInfos are created;
they get added as the only member of partitioned_rels in the
PartitionedChildRelInfo of each partitioned table.  Finally, when the
Append path for the root parent is created, its subpaths list will contain
paths of leaf partitions of all levels and its partitioned_rels list
should contain the RT indexes of partitioned tables of all levels.

If we have the following partition tree:
     A   / | \  B  C  D    / \   E   F

The following RTEs will be created, in that order.  RTEs with inh=true are
shown with suffix _i.  RTEs that get an AppendRelInfo (& a RelOptInfo) are
shown with suffix _a.

A_i_a
A
B_a
C_i_a
C
E_a
F_a
D_a

Thanks,
Amit





Re: Partition-wise join for join between (declaratively)partitioned tables

From
Robert Haas
Date:
On Thu, Mar 30, 2017 at 6:32 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> So, because we want to create an Append path for each partitioned table in
> a tree separately, we'll need RelOptInfo for each one, which in turn
> requires an AppendRelInfo.

Hmm.  It's no more desirable to have an Append inside of another
Append with partitionwise join than it is in general.  If we've got A
partitioned into A1, A2, A3 and similarly B partitioned into B1, B2,
and B3, and then A1 and B1 are further partitioned into A1a, A1b, B1a,
B1b, then a partitionwise join between the tables ought to end up
looking like this:

Append
-> Join (A1a, B1a)
-> Join (A1b, B1b)
-> Join (A2, B2)
-> Join (A3, B3)

So here we don't actually end up with an append-path for A1-B1 here
anywhere.  But you might need that in more complex cases, I guess,
because suppose you now join this to C with partitions C1, C2, C3; but
C1 is not sub-partitioned.  Then you might end up with a plan like:

Append
-> Join -> Append   -> Join (A1a, B1a)   -> Join (A1b, B1b) -> Scan C1
-> Join ((A2, B2), C2)
-> Join ((A3, B3), C3)

So maybe you're right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partition-wise join for join between (declaratively)partitioned tables

From
Robert Haas
Date:
On Thu, Mar 30, 2017 at 1:14 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Done.

Ashutosh and I spent several hours discussing this patch set today.
I'm starting to become concerned about the fact that 0004 makes the
partition bounds part of the PartitionScheme, because that means you
can't do a partition-wise join between two tables that have any
difference at all in the partition bounds.  It might be possible in
the future to introduce a notion of a compatible partition scheme, so
that you could say, OK, well, these two partition schemes are not
quite the same, but they are compatible, and we'll make a new
partition scheme for whatever results from reconciling them.

What I think *may* be better is to consider the partition bound
information as a property of the RelOptInfo rather than the
PartitionScheme.  For example, suppose we're joining A with partitions
A1, A2, and A4 against B with partitions B1, B2, and B3 and C with
partitions C1, C2, and C5.  With the current approach, we end up with
a PartitionScheme for each baserel and, not in this patch set but
maybe eventually, a separate PartitionScheme for each of (A B), (A C),
(B C), and (A B C).  That seems pretty unsatisfying.  If we consider
the PartitionScheme to only include the question of whether we're
doing a join on the partition keys, then if the join includes WHERE
a.key = b.key AND b.key = c.key, we can say that they all have the
same PartitionScheme up front.  Then, each RelOptInfo can have a
separate list of bounds, like this:

A: 1, 2, 4
B: 1, 2, 3
C: 1, 2, 5
A B: 1, 2, 3, 4
A C: 1, 2, 4, 5
B C: 1, 2, 3, 5
A B C: 1, 2, 3, 4, 5

Or if it's an inner join, then instead of taking the union at each
level, we can take the intersection, because any partition without a
match on the other side of the join, then that partition can't produce
any rows and doesn't need to be scanned.  In that case, the
RelOptInfos for (A B), (A C), (B, C), and (A, B, C) will all end up
with a bound list of 1, 2.

A related question (that I did not discuss with Ashutosh, but occurred
to me later) is whether the PartitionScheme ought to worry about
cross-type joins.  For instance, if A is partitioned on an int4 column
and B is partitioned on an int8 column, and they are joined on their
respective partitioning columns, can't we still do a partition-wise
join?  We do need to require that the operator family of the operator
actually used in the query, the operator family used to partition the
inner table, and the operator family used to partition the other table
all match; and the collation used for the comparison in the query, the
collation used to partition the outer table, and the collation used to
partition the inner table must all match.  But it doesn't seem
necessary to require an exact type or typmod match.  In many ways this
seems a whole lot like the what we test when building equivalence
classes (cf. process_equivalence) although I'm not sure that we can
leverage that in any useful way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partition-wise join for join between (declaratively)partitioned tables

From
Ashutosh Bapat
Date:


On Tue, Apr 4, 2017 at 2:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 30, 2017 at 1:14 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Done.

Ashutosh and I spent several hours discussing this patch set today.
I'm starting to become concerned about the fact that 0004 makes the
partition bounds part of the PartitionScheme, because that means you
can't do a partition-wise join between two tables that have any
difference at all in the partition bounds.  It might be possible in
the future to introduce a notion of a compatible partition scheme, so
that you could say, OK, well, these two partition schemes are not
quite the same, but they are compatible, and we'll make a new
partition scheme for whatever results from reconciling them.

What I think *may* be better is to consider the partition bound
information as a property of the RelOptInfo rather than the
PartitionScheme.  For example, suppose we're joining A with partitions
A1, A2, and A4 against B with partitions B1, B2, and B3 and C with
partitions C1, C2, and C5.  With the current approach, we end up with
a PartitionScheme for each baserel and, not in this patch set but
maybe eventually, a separate PartitionScheme for each of (A B), (A C),
(B C), and (A B C).  That seems pretty unsatisfying.  If we consider
the PartitionScheme to only include the question of whether we're
doing a join on the partition keys, then if the join includes WHERE
a.key = b.key AND b.key = c.key, we can say that they all have the
same PartitionScheme up front.  Then, each RelOptInfo can have a
separate list of bounds, like this:

A: 1, 2, 4
B: 1, 2, 3
C: 1, 2, 5
A B: 1, 2, 3, 4
A C: 1, 2, 4, 5
B C: 1, 2, 3, 5
A B C: 1, 2, 3, 4, 5

Or if it's an inner join, then instead of taking the union at each
level, we can take the intersection, because any partition without a
match on the other side of the join, then that partition can't produce
any rows and doesn't need to be scanned.  In that case, the
RelOptInfos for (A B), (A C), (B, C), and (A, B, C) will all end up
with a bound list of 1, 2.

I have separated partition bounds from partition scheme. The patch adds build_joinrel_partition_bounds() to calculate the bounds of the join relation and the pairs of matching partitions from the joining relation. For now the function just check whether both the relations have same bounds and returns the bounds of the first one. But in future, we will expand this function to merge partition bounds from the joining relation and return pairs of matching partitions which when joined form the partitions of the join according to the merged partition bounds.

Also, moved the code to collect partition RelOptInfos from set_append_rel_size() to build_simple_rel(), so everything related to partitioning gets set in that function for a base relation.

I think, we should rename partition scheme as PartitionKeyOptInfo and club partition bounds, nparts and part_rels as PartitionDescOptInfo. But I haven't done that in this patch yet.
 

A related question (that I did not discuss with Ashutosh, but occurred
to me later) is whether the PartitionScheme ought to worry about
cross-type joins.  For instance, if A is partitioned on an int4 column
and B is partitioned on an int8 column, and they are joined on their
respective partitioning columns, can't we still do a partition-wise
join?  We do need to require that the operator family of the operator
actually used in the query, the operator family used to partition the
inner table, and the operator family used to partition the other table
all match; and the collation used for the comparison in the query, the
collation used to partition the outer table, and the collation used to
partition the inner table must all match.  But it doesn't seem
necessary to require an exact type or typmod match.  In many ways this
seems a whole lot like the what we test when building equivalence
classes (cf. process_equivalence) although I'm not sure that we can
leverage that in any useful way.


Yes, I agree. For an inner join, the partition key types need to "shrink" and for outer join they need to be "widened". I don't know if there is a way to know "wider" or "shorter" of two given types. We might have to implement a method to merge partition keys to produce partition key of the join, which may be different from either of the partition keys. So, after-all we may have to abandon the idea of canonical partition scheme. I haven't included this change in the attached set of patches.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachment

Re: Partition-wise join for join between (declaratively)partitioned tables

From
Ashutosh Bapat
Date:
Somehow I sent the old patch set again. Here's the real v17.

On Tue, Apr 4, 2017 at 7:52 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>
>
> On Tue, Apr 4, 2017 at 2:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Mar 30, 2017 at 1:14 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>> > Done.
>>
>> Ashutosh and I spent several hours discussing this patch set today.
>> I'm starting to become concerned about the fact that 0004 makes the
>> partition bounds part of the PartitionScheme, because that means you
>> can't do a partition-wise join between two tables that have any
>> difference at all in the partition bounds.  It might be possible in
>> the future to introduce a notion of a compatible partition scheme, so
>> that you could say, OK, well, these two partition schemes are not
>> quite the same, but they are compatible, and we'll make a new
>> partition scheme for whatever results from reconciling them.
>>
>> What I think *may* be better is to consider the partition bound
>> information as a property of the RelOptInfo rather than the
>> PartitionScheme.  For example, suppose we're joining A with partitions
>> A1, A2, and A4 against B with partitions B1, B2, and B3 and C with
>> partitions C1, C2, and C5.  With the current approach, we end up with
>> a PartitionScheme for each baserel and, not in this patch set but
>> maybe eventually, a separate PartitionScheme for each of (A B), (A C),
>> (B C), and (A B C).  That seems pretty unsatisfying.  If we consider
>> the PartitionScheme to only include the question of whether we're
>> doing a join on the partition keys, then if the join includes WHERE
>> a.key = b.key AND b.key = c.key, we can say that they all have the
>> same PartitionScheme up front.  Then, each RelOptInfo can have a
>> separate list of bounds, like this:
>>
>> A: 1, 2, 4
>> B: 1, 2, 3
>> C: 1, 2, 5
>> A B: 1, 2, 3, 4
>> A C: 1, 2, 4, 5
>> B C: 1, 2, 3, 5
>> A B C: 1, 2, 3, 4, 5
>>
>> Or if it's an inner join, then instead of taking the union at each
>> level, we can take the intersection, because any partition without a
>> match on the other side of the join, then that partition can't produce
>> any rows and doesn't need to be scanned.  In that case, the
>> RelOptInfos for (A B), (A C), (B, C), and (A, B, C) will all end up
>> with a bound list of 1, 2.
>
>
> I have separated partition bounds from partition scheme. The patch adds
> build_joinrel_partition_bounds() to calculate the bounds of the join
> relation and the pairs of matching partitions from the joining relation. For
> now the function just check whether both the relations have same bounds and
> returns the bounds of the first one. But in future, we will expand this
> function to merge partition bounds from the joining relation and return
> pairs of matching partitions which when joined form the partitions of the
> join according to the merged partition bounds.
>
> Also, moved the code to collect partition RelOptInfos from
> set_append_rel_size() to build_simple_rel(), so everything related to
> partitioning gets set in that function for a base relation.
>
> I think, we should rename partition scheme as PartitionKeyOptInfo and club
> partition bounds, nparts and part_rels as PartitionDescOptInfo. But I
> haven't done that in this patch yet.
>
>>
>>
>> A related question (that I did not discuss with Ashutosh, but occurred
>> to me later) is whether the PartitionScheme ought to worry about
>> cross-type joins.  For instance, if A is partitioned on an int4 column
>> and B is partitioned on an int8 column, and they are joined on their
>> respective partitioning columns, can't we still do a partition-wise
>> join?  We do need to require that the operator family of the operator
>> actually used in the query, the operator family used to partition the
>> inner table, and the operator family used to partition the other table
>> all match; and the collation used for the comparison in the query, the
>> collation used to partition the outer table, and the collation used to
>> partition the inner table must all match.  But it doesn't seem
>> necessary to require an exact type or typmod match.  In many ways this
>> seems a whole lot like the what we test when building equivalence
>> classes (cf. process_equivalence) although I'm not sure that we can
>> leverage that in any useful way.
>>
>
> Yes, I agree. For an inner join, the partition key types need to "shrink"
> and for outer join they need to be "widened". I don't know if there is a way
> to know "wider" or "shorter" of two given types. We might have to implement
> a method to merge partition keys to produce partition key of the join, which
> may be different from either of the partition keys. So, after-all we may
> have to abandon the idea of canonical partition scheme. I haven't included
> this change in the attached set of patches.
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Partition-wise join for join between (declaratively)partitioned tables

From
Robert Haas
Date:
On Tue, Apr 4, 2017 at 10:22 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Yes, I agree. For an inner join, the partition key types need to "shrink"
> and for outer join they need to be "widened". I don't know if there is a way
> to know "wider" or "shorter" of two given types. We might have to implement
> a method to merge partition keys to produce partition key of the join, which
> may be different from either of the partition keys. So, after-all we may
> have to abandon the idea of canonical partition scheme. I haven't included
> this change in the attached set of patches.

I think this is why you need to regard the partitioning scheme as
something more like an equivalence class - possibly the partitioning
scheme should actually contain (or be?) an equivalence class.  Suppose
this is the query:

SELECT * FROM i4 INNER JOIN i8 ON i4.x = i8.x;

...where i4 (x) is an int4 partitioning key and i8 (x) is an int8
partitioning key.  It's meaningless to ask whether the result of the
join is partitioned by int4 or int8.  It's partitioned by the
equivalence class that contains both i4.x and i8.x.  If the result of
this join where joined to another table on either of those two
columns, a second partition-wise join would be theoretically possible.
If you insist on knowing the type of the partitioning scheme, rather
than just the opfamily, you've boxed yourself into a corner from which
there's no good escape.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Partition-wise join for join between (declaratively)partitioned tables

From
Ashutosh Bapat
Date:
On Wed, Apr 5, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 4, 2017 at 10:22 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Yes, I agree. For an inner join, the partition key types need to "shrink"
>> and for outer join they need to be "widened". I don't know if there is a way
>> to know "wider" or "shorter" of two given types. We might have to implement
>> a method to merge partition keys to produce partition key of the join, which
>> may be different from either of the partition keys. So, after-all we may
>> have to abandon the idea of canonical partition scheme. I haven't included
>> this change in the attached set of patches.
>
> I think this is why you need to regard the partitioning scheme as
> something more like an equivalence class - possibly the partitioning
> scheme should actually contain (or be?) an equivalence class.  Suppose
> this is the query:
>
> SELECT * FROM i4 INNER JOIN i8 ON i4.x = i8.x;
>
> ...where i4 (x) is an int4 partitioning key and i8 (x) is an int8
> partitioning key.  It's meaningless to ask whether the result of the
> join is partitioned by int4 or int8.  It's partitioned by the
> equivalence class that contains both i4.x and i8.x.  If the result of
> this join where joined to another table on either of those two
> columns, a second partition-wise join would be theoretically possible.
> If you insist on knowing the type of the partitioning scheme, rather
> than just the opfamily, you've boxed yourself into a corner from which
> there's no good escape.

Only inner join conditions have equivalence classes associated with
those. Outer join conditions create single element equivalence
classes. So, we can not associate equivalence classes as they are with
partition scheme. If we could do that, it makes life much easier since
checking whether equi-join between all partition keys exist, is simply
looking up equivalence classes that cover joining relations and find
em_member corresponding to partition keys.

It looks like we should only keep strategy, partnatts, partopfamily
and parttypcoll in PartitionScheme. A partition-wise join between two
relations would be possible if all those match. When matching
partition bounds of joining relations, we should rely on partopfamily
to give us comparison function based on the types of partition keys
being joined. In that context it looks like all the partition bound
comparision functions which accept partition key were not written
keeping this use case in mind. They will need to be rewritten to
accept strategy, partnatts, partopfamily and parttypcoll.

There's a relevant comment in 0006, build_joinrel_partition_info()
(probably that name needs to change, but I will do that once we have
settled on design)
+   /*
+    * Construct partition keys for the join.
+    *
+    * An INNER join between two partitioned relations is partition by key
+    * expressions from both the relations. For tables A and B
partitioned by a and b
+    * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a
+    * and B.b.
+    *
+    * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
+    * B.b NULL. These rows may not fit the partitioning conditions imposed on
+    * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
+    * Strictly speaking, partition keys of an OUTER join should include
+    * partition key expressions from the OUTER side only. Consider a join like
+    * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not
+    * include B.b as partition key expression for (AB), it prohibits us from
+    * using partition-wise join when joining (AB) with C as there is no
+    * equi-join between partition keys of joining relations. But two NULL
+    * values are never equal and no two rows from mis-matching partitions can
+    * join. Hence it's safe to include B.b as partition key expression for
+    * (AB), even though rows in (AB) are not strictly partitioned by B.b.
+    */

I think that also needs to be reviewed carefully. Partition-wise joins
may be happy including partition keys from all sides, but
partition-wise aggregates may not be, esp. when pushing complete
aggregation down to partitions. In that case, rows with NULL partition
key, which falls on nullable side of join, will be spread across
multiple partitions. Proabably, we should separate nullable and
non-nullable partition key expressions.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Partition-wise join for join between (declaratively)partitioned tables

From
Robert Haas
Date:
On Wed, Apr 5, 2017 at 2:42 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Only inner join conditions have equivalence classes associated with
> those. Outer join conditions create single element equivalence
> classes. So, we can not associate equivalence classes as they are with
> partition scheme. If we could do that, it makes life much easier since
> checking whether equi-join between all partition keys exist, is simply
> looking up equivalence classes that cover joining relations and find
> em_member corresponding to partition keys.

OK.

> It looks like we should only keep strategy, partnatts, partopfamily
> and parttypcoll in PartitionScheme. A partition-wise join between two
> relations would be possible if all those match.

Yes, I think so. Conceivably you could even exclude partnatts and
strategy, since there's nothing preventing a partitionwise join
between a list-partitioned table and a range-partitioned table, or
between a table range-partitioned on (a) and another range-partitioned
on (a, b), but there is probably not much benefit in trying to cover
such cases.  I think it's reasonable to tell users that this is only
going to work when the partitioning strategy is the same and the join
conditions include all of the partitioning columns on both sides.

> There's a relevant comment in 0006, build_joinrel_partition_info()
> (probably that name needs to change, but I will do that once we have
> settled on design)
> +   /*
> +    * Construct partition keys for the join.
> +    *
> +    * An INNER join between two partitioned relations is partition by key
> +    * expressions from both the relations. For tables A and B
> partitioned by a and b
> +    * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a
> +    * and B.b.
> +    *
> +    * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
> +    * B.b NULL. These rows may not fit the partitioning conditions imposed on
> +    * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
> +    * Strictly speaking, partition keys of an OUTER join should include
> +    * partition key expressions from the OUTER side only. Consider a join like
> +    * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not
> +    * include B.b as partition key expression for (AB), it prohibits us from
> +    * using partition-wise join when joining (AB) with C as there is no
> +    * equi-join between partition keys of joining relations. But two NULL
> +    * values are never equal and no two rows from mis-matching partitions can
> +    * join. Hence it's safe to include B.b as partition key expression for
> +    * (AB), even though rows in (AB) are not strictly partitioned by B.b.
> +    */
>
> I think that also needs to be reviewed carefully.

The following passage from src/backend/optimizer/README seems highly relevant:

===
The planner's treatment of outer join reordering is based on the following
identities:

1.      (A leftjoin B on (Pab)) innerjoin C on (Pac)       = (A innerjoin C on (Pac)) leftjoin B on (Pab)

where Pac is a predicate referencing A and C, etc (in this case, clearly
Pac cannot reference B, or the transformation is nonsensical).

2.      (A leftjoin B on (Pab)) leftjoin C on (Pac)       = (A leftjoin C on (Pac)) leftjoin B on (Pab)

3.      (A leftjoin B on (Pab)) leftjoin C on (Pbc)       = A leftjoin (B leftjoin C on (Pbc)) on (Pab)

Identity 3 only holds if predicate Pbc must fail for all-null B rows
(that is, Pbc is strict for at least one column of B).  If Pbc is not
strict, the first form might produce some rows with nonnull C columns
where the second form would make those entries null.
===

In other words, I think your statement that null is never equal to
null is a bit imprecise.  Somebody could certainly create an operator
that is named "=" which returns true in that case, and then they could
say, hey, two nulls are equal (when you use that operator).  The
argument needs to be made in terms of the formal properties of the
operator.  The relevant logic is in have_partkey_equi_join:

+               /* Skip clauses which are not equality conditions. */
+               if (rinfo->hashjoinoperator == InvalidOid &&
!rinfo->mergeopfamilies)
+                       continue;

Actually, I think the hashjoinoperator test is formally and
practically unnecessary here; lower down there is a test to see
whether the partitioning scheme's operator family is a member of
rinfo->mergeopfamilies, which will certainly fail if we got through
this test with rinfo->mergeopfamilies == NIL just on the strength of
rinfo->hashjoinoperator != InvalidOid.  So you can just bail out if
rinfo->mergeopfamilies == NIL.  But the underlying point here is that
the only thing you really know about the function is that it's got to
be a strategy-3 operator in some btree opclass; if that guarantees
strictness, then so be it -- but I wasn't able to find anything in the
code or documentation off-hand that supports that contention, so we
might need to think a bit more about why (or if) this is guaranteed to
be true.

> Partition-wise joins
> may be happy including partition keys from all sides, but
> partition-wise aggregates may not be, esp. when pushing complete
> aggregation down to partitions. In that case, rows with NULL partition
> key, which falls on nullable side of join, will be spread across
> multiple partitions. Proabably, we should separate nullable and
> non-nullable partition key expressions.

I don't think I understand quite what you're getting at here.  Can you
spell this out in more detail?  To push an aggregate down to
partitions, you need the grouping key to match the applicable
partition key, and the partition key shouldn't allow nulls in more
than one place.  Now I think your point may be that outer join
semantics could let them creep in there, e.g. SELECT b.x, sum(a.y)
FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a
good test case for partitionwise aggregate.  I'd be inclined to think
that we should just give up on partitionwise aggregate in such cases;
it's not worth trying to optimize such a weird query, at least IMHO.
(Does this sort of case ever happen with joins?  I think not, as long
as the join operator is strict.)

I spent some time thinking about this patch set today and I don't see
that there's much point in committing any more of this to v10.  I
think that 0001 and 0002 are probably committable or very close at
this point.  However, 0001 is adding more complexity than I think is
warranted until we're actually ready to commit the feature that uses
it, and 0002 is so small that committing isn't really going to smooth
future development much.  0003-0009 are essentially all one big patch
that will have to be committed together.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> ... But the underlying point here is that
> the only thing you really know about the function is that it's got to
> be a strategy-3 operator in some btree opclass; if that guarantees
> strictness, then so be it -- but I wasn't able to find anything in the
> code or documentation off-hand that supports that contention, so we
> might need to think a bit more about why (or if) this is guaranteed to
> be true.

FWIW, I do not think that follows.  If you want to check that the
function is strict, check that explicitly.

It's very likely that in practice, all such functions are indeed strict,
but we don't have an assumption about that wired into the planner.
        regards, tom lane



On Thu, Apr 6, 2017 at 6:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Apr 5, 2017 at 2:42 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Only inner join conditions have equivalence classes associated with
>> those. Outer join conditions create single element equivalence
>> classes. So, we can not associate equivalence classes as they are with
>> partition scheme. If we could do that, it makes life much easier since
>> checking whether equi-join between all partition keys exist, is simply
>> looking up equivalence classes that cover joining relations and find
>> em_member corresponding to partition keys.
>
> OK.
>
>> It looks like we should only keep strategy, partnatts, partopfamily
>> and parttypcoll in PartitionScheme. A partition-wise join between two
>> relations would be possible if all those match.
>
> Yes, I think so. Conceivably you could even exclude partnatts and
> strategy, since there's nothing preventing a partitionwise join
> between a list-partitioned table and a range-partitioned table, or
> between a table range-partitioned on (a) and another range-partitioned
> on (a, b), but there is probably not much benefit in trying to cover
> such cases.  I think it's reasonable to tell users that this is only
> going to work when the partitioning strategy is the same and the join
> conditions include all of the partitioning columns on both sides.
>
>> There's a relevant comment in 0006, build_joinrel_partition_info()
>> (probably that name needs to change, but I will do that once we have
>> settled on design)
>> +   /*
>> +    * Construct partition keys for the join.
>> +    *
>> +    * An INNER join between two partitioned relations is partition by key
>> +    * expressions from both the relations. For tables A and B
>> partitioned by a and b
>> +    * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a
>> +    * and B.b.
>> +    *
>> +    * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
>> +    * B.b NULL. These rows may not fit the partitioning conditions imposed on
>> +    * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
>> +    * Strictly speaking, partition keys of an OUTER join should include
>> +    * partition key expressions from the OUTER side only. Consider a join like
>> +    * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not
>> +    * include B.b as partition key expression for (AB), it prohibits us from
>> +    * using partition-wise join when joining (AB) with C as there is no
>> +    * equi-join between partition keys of joining relations. But two NULL
>> +    * values are never equal and no two rows from mis-matching partitions can
>> +    * join. Hence it's safe to include B.b as partition key expression for
>> +    * (AB), even though rows in (AB) are not strictly partitioned by B.b.
>> +    */
>>
>> I think that also needs to be reviewed carefully.
>
> The following passage from src/backend/optimizer/README seems highly relevant:
>
> ===
> The planner's treatment of outer join reordering is based on the following
> identities:
>
> 1.      (A leftjoin B on (Pab)) innerjoin C on (Pac)
>         = (A innerjoin C on (Pac)) leftjoin B on (Pab)
>
> where Pac is a predicate referencing A and C, etc (in this case, clearly
> Pac cannot reference B, or the transformation is nonsensical).
>
> 2.      (A leftjoin B on (Pab)) leftjoin C on (Pac)
>         = (A leftjoin C on (Pac)) leftjoin B on (Pab)
>
> 3.      (A leftjoin B on (Pab)) leftjoin C on (Pbc)
>         = A leftjoin (B leftjoin C on (Pbc)) on (Pab)
>
> Identity 3 only holds if predicate Pbc must fail for all-null B rows
> (that is, Pbc is strict for at least one column of B).  If Pbc is not
> strict, the first form might produce some rows with nonnull C columns
> where the second form would make those entries null.
> ===
>
> In other words, I think your statement that null is never equal to
> null is a bit imprecise.  Somebody could certainly create an operator
> that is named "=" which returns true in that case, and then they could
> say, hey, two nulls are equal (when you use that operator).  The
> argument needs to be made in terms of the formal properties of the
> operator.  The relevant logic is in have_partkey_equi_join:
>
> +               /* Skip clauses which are not equality conditions. */
> +               if (rinfo->hashjoinoperator == InvalidOid &&
> !rinfo->mergeopfamilies)
> +                       continue;
>
> Actually, I think the hashjoinoperator test is formally and
> practically unnecessary here; lower down there is a test to see
> whether the partitioning scheme's operator family is a member of
> rinfo->mergeopfamilies, which will certainly fail if we got through
> this test with rinfo->mergeopfamilies == NIL just on the strength of
> rinfo->hashjoinoperator != InvalidOid.  So you can just bail out if
> rinfo->mergeopfamilies == NIL.  But the underlying point here is that
> the only thing you really know about the function is that it's got to
> be a strategy-3 operator in some btree opclass; if that guarantees
> strictness, then so be it -- but I wasn't able to find anything in the
> code or documentation off-hand that supports that contention, so we
> might need to think a bit more about why (or if) this is guaranteed to
> be true.

I need more time to think about this. Will get back to this soon.

>
>> Partition-wise joins
>> may be happy including partition keys from all sides, but
>> partition-wise aggregates may not be, esp. when pushing complete
>> aggregation down to partitions. In that case, rows with NULL partition
>> key, which falls on nullable side of join, will be spread across
>> multiple partitions. Proabably, we should separate nullable and
>> non-nullable partition key expressions.
>
> I don't think I understand quite what you're getting at here.  Can you
> spell this out in more detail?  To push an aggregate down to
> partitions, you need the grouping key to match the applicable
> partition key, and the partition key shouldn't allow nulls in more
> than one place.  Now I think your point may be that outer join
> semantics could let them creep in there, e.g. SELECT b.x, sum(a.y)
> FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a
> good test case for partitionwise aggregate.  I'd be inclined to think
> that we should just give up on partitionwise aggregate in such cases;
> it's not worth trying to optimize such a weird query, at least IMHO.
> (Does this sort of case ever happen with joins?  I think not, as long
> as the join operator is strict.)

Yes, this is the case, I am thinking about. No, it doesn't happen with join.

>
> I spent some time thinking about this patch set today and I don't see
> that there's much point in committing any more of this to v10.  I
> think that 0001 and 0002 are probably committable or very close at
> this point.  However, 0001 is adding more complexity than I think is
> warranted until we're actually ready to commit the feature that uses
> it, and 0002 is so small that committing isn't really going to smooth
> future development much.  0003-0009 are essentially all one big patch
> that will have to be committed together.

Ok. Thanks.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Wed, Apr 5, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 4, 2017 at 10:22 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Yes, I agree. For an inner join, the partition key types need to "shrink"
>> and for outer join they need to be "widened". I don't know if there is a way
>> to know "wider" or "shorter" of two given types. We might have to implement
>> a method to merge partition keys to produce partition key of the join, which
>> may be different from either of the partition keys. So, after-all we may
>> have to abandon the idea of canonical partition scheme. I haven't included
>> this change in the attached set of patches.
>
> I think this is why you need to regard the partitioning scheme as
> something more like an equivalence class - possibly the partitioning
> scheme should actually contain (or be?) an equivalence class.  Suppose
> this is the query:
>
> SELECT * FROM i4 INNER JOIN i8 ON i4.x = i8.x;
>
> ...where i4 (x) is an int4 partitioning key and i8 (x) is an int8
> partitioning key.  It's meaningless to ask whether the result of the
> join is partitioned by int4 or int8.  It's partitioned by the
> equivalence class that contains both i4.x and i8.x.  If the result of
> this join where joined to another table on either of those two
> columns, a second partition-wise join would be theoretically possible.
> If you insist on knowing the type of the partitioning scheme, rather
> than just the opfamily, you've boxed yourself into a corner from which
> there's no good escape.

When we merge partition bounds from two relations with different
partition key types, the merged partition bounds need to have some
information abound the way those constants look like e.g. their
length, structure etc. That's the reason we need to store partition
key types of merged partitioning scheme. Consider a three way join (i4
JOIN i8 ON i4.x = i8.x) JOIN i2 ON (i2.x = i.x). When we compare
partition bounds of i4 and i8, we use operators for int4 and int8. The
join i4 JOIN i8 will get partition bounds by merging those of i4 and
i8. When we come to join with i2, we need to know which operators to
use for comparing the partition bounds of the join with those of i2.

So, if the partition key types of the joining relations differ (but
they have matching partitioning schemes per strategy, natts and
operator family) the partition bounds of the join are converted to the
wider type among the partition key types of the joining tree.
Actually, as I am explained earlier we could choose a wider outer type
for an OUTER join and shorter type for inner join. This type is used
as partition key type of the join. In the above case join between i4
and i8 have its partition bounds converted to i8 (or i4) and then when
it is joined with i2 the partition bounds of the join are converted to
i8 (or i2).

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Tue, Apr 18, 2017 at 6:55 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> When we merge partition bounds from two relations with different
> partition key types, the merged partition bounds need to have some
> information abound the way those constants look like e.g. their
> length, structure etc. That's the reason we need to store partition
> key types of merged partitioning scheme. Consider a three way join (i4
> JOIN i8 ON i4.x = i8.x) JOIN i2 ON (i2.x = i.x). When we compare
> partition bounds of i4 and i8, we use operators for int4 and int8. The
> join i4 JOIN i8 will get partition bounds by merging those of i4 and
> i8. When we come to join with i2, we need to know which operators to
> use for comparing the partition bounds of the join with those of i2.
>
> So, if the partition key types of the joining relations differ (but
> they have matching partitioning schemes per strategy, natts and
> operator family) the partition bounds of the join are converted to the
> wider type among the partition key types of the joining tree.
> Actually, as I am explained earlier we could choose a wider outer type
> for an OUTER join and shorter type for inner join. This type is used
> as partition key type of the join. In the above case join between i4
> and i8 have its partition bounds converted to i8 (or i4) and then when
> it is joined with i2 the partition bounds of the join are converted to
> i8 (or i2).

I don't understand why you think that partition-wise join needs any
new logic here; if this were a non-partitionwise join, we'd similarly
need to use the correct operator, but the existing code handles that
just fine.  If the join is performed partition-wise, it should use the
same operators that would have been used by a non-partitionwise join
between the same tables.

I think the choice of operator depends only on the column types, and
that the "width" of those types has nothing to do with it.  For
example, if the user writes .WHERE A.x = B.x AND B.x = C.x, the
operator for an A/B join or a B/C join will be the one that appears in
the query; parse analysis will have identified which specific operator
is meant based on the types of the columns.  If the optimizer
subsequently decides to reorder the joins and perform the A/C join
first, it will go hunt down the operator with the same strategy number
in the same operator family that takes the type of A.x on one side and
the type of C.x on the other side.  No problem.  A partition-wise join
between A and C will use that same operator; again, no problem.

Your example involves joining the output of a join between i4 and i8
against i2, so it seems there is some ambiguity about what the input
type should be.  But, again, the planner already copes with this
problem.  In fact, the join is performed either using i4.x or i8.x --
I don't know what happens, or whether it depends on other details of
the query or the plan -- and the operator which can accept that value
on one side and i2.x on the other side is the one that gets used.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> I don't understand why you think that partition-wise join needs any
> new logic here; if this were a non-partitionwise join, we'd similarly
> need to use the correct operator, but the existing code handles that
> just fine.  If the join is performed partition-wise, it should use the
> same operators that would have been used by a non-partitionwise join
> between the same tables.

More to the point, the appropriate operator was chosen by parse analysis.
The planner has *zero* flexibility as to which operator is involved.

BTW, I remain totally mystified as to what people think the semantics of
partitioning ought to be.  Child columns can have a different type from
parent columns?  Really?  Why is this even under discussion?  We don't
allow that in old-school inheritance, and I cannot imagine a rational
argument why partitioning should allow it.
        regards, tom lane



On Thu, Apr 20, 2017 at 11:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>
> BTW, I remain totally mystified as to what people think the semantics of
> partitioning ought to be.  Child columns can have a different type from
> parent columns?  Really?  Why is this even under discussion?  We don't
> allow that in old-school inheritance, and I cannot imagine a rational
> argument why partitioning should allow it.
>

No, we aren't doing that. We are discussing here how to represent
partition bounds of top level join and all the intermediate joins
between A, B and C which are partitioned tables with different
partition key types. We are not discussing the column types of
children, join or simple.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Thu, Apr 20, 2017 at 10:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 18, 2017 at 6:55 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> When we merge partition bounds from two relations with different
>> partition key types, the merged partition bounds need to have some
>> information abound the way those constants look like e.g. their
>> length, structure etc. That's the reason we need to store partition
>> key types of merged partitioning scheme. Consider a three way join (i4
>> JOIN i8 ON i4.x = i8.x) JOIN i2 ON (i2.x = i.x). When we compare
>> partition bounds of i4 and i8, we use operators for int4 and int8. The
>> join i4 JOIN i8 will get partition bounds by merging those of i4 and
>> i8. When we come to join with i2, we need to know which operators to
>> use for comparing the partition bounds of the join with those of i2.
>>
>> So, if the partition key types of the joining relations differ (but
>> they have matching partitioning schemes per strategy, natts and
>> operator family) the partition bounds of the join are converted to the
>> wider type among the partition key types of the joining tree.
>> Actually, as I am explained earlier we could choose a wider outer type
>> for an OUTER join and shorter type for inner join. This type is used
>> as partition key type of the join. In the above case join between i4
>> and i8 have its partition bounds converted to i8 (or i4) and then when
>> it is joined with i2 the partition bounds of the join are converted to
>> i8 (or i2).
>
> I don't understand why you think that partition-wise join needs any
> new logic here; if this were a non-partitionwise join, we'd similarly
> need to use the correct operator, but the existing code handles that
> just fine.  If the join is performed partition-wise, it should use the
> same operators that would have been used by a non-partitionwise join
> between the same tables.
>
> I think the choice of operator depends only on the column types, and
> that the "width" of those types has nothing to do with it.  For
> example, if the user writes .WHERE A.x = B.x AND B.x = C.x, the
> operator for an A/B join or a B/C join will be the one that appears in
> the query; parse analysis will have identified which specific operator
> is meant based on the types of the columns.  If the optimizer
> subsequently decides to reorder the joins and perform the A/C join
> first, it will go hunt down the operator with the same strategy number
> in the same operator family that takes the type of A.x on one side and
> the type of C.x on the other side.  No problem.  A partition-wise join
> between A and C will use that same operator; again, no problem.
>
> Your example involves joining the output of a join between i4 and i8
> against i2, so it seems there is some ambiguity about what the input
> type should be.  But, again, the planner already copes with this
> problem.  In fact, the join is performed either using i4.x or i8.x --
> I don't know what happens, or whether it depends on other details of
> the query or the plan -- and the operator which can accept that value
> on one side and i2.x on the other side is the one that gets used.

I think you are confusing join condition application and partition
bounds of a join relation. What you have described above is how
operators are chosen to apply join conditions - it picks up the
correct operator from the operator family based on the column types
being used in join condition. That it can do because the columns being
joined are both present the relations being joined, irrespective of
which pair of relations is being joined. In your example, A.x, B.x and
C.x are all present on one of the sides of join irrespective of
whether the join is executed as (AB)C, A(BC) or (AC)B.

But the problem we are trying to solve here about partition bounds of
the join relation: what should be the partition bounds of AB, BC or
AC? When we compare partition bounds of and intermediate join with
other intermediate join (e.g. AB with those of C) what operator should
be used? You seem to be suggesting that we keep as many sets of
partition bounds as there are base relations participating in the join
and then use appropriate partition bounds based on the columns in the
join conditions, so that we can use the same operator as used in the
join condition. That doesn't seem to be a good option since the
partition bounds will all have same values, only differing in their
binary representation because of differences in data types. I am of
the opinion that we save a single set of partition bounds. We have to
then associate a data type with bounds to know binary representation
of partition bound datums. That datatype would be one of the partition
key types of joining relations. I may be wrong in using term "wider"
since its associated with the length of binary reprentation. But we
need some logic to coalesce the two data types based on the type of
join and key type on the outer side.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On 2017/04/20 15:45, Ashutosh Bapat wrote:
> On Thu, Apr 20, 2017 at 10:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't understand why you think that partition-wise join needs any
>> new logic here; if this were a non-partitionwise join, we'd similarly
>> need to use the correct operator, but the existing code handles that
>> just fine.  If the join is performed partition-wise, it should use the
>> same operators that would have been used by a non-partitionwise join
>> between the same tables.
>>
>> I think the choice of operator depends only on the column types, and
>> that the "width" of those types has nothing to do with it.  For
>> example, if the user writes .WHERE A.x = B.x AND B.x = C.x, the
>> operator for an A/B join or a B/C join will be the one that appears in
>> the query; parse analysis will have identified which specific operator
>> is meant based on the types of the columns.  If the optimizer
>> subsequently decides to reorder the joins and perform the A/C join
>> first, it will go hunt down the operator with the same strategy number
>> in the same operator family that takes the type of A.x on one side and
>> the type of C.x on the other side.  No problem.  A partition-wise join
>> between A and C will use that same operator; again, no problem.
>>
>> Your example involves joining the output of a join between i4 and i8
>> against i2, so it seems there is some ambiguity about what the input
>> type should be.  But, again, the planner already copes with this
>> problem.  In fact, the join is performed either using i4.x or i8.x --
>> I don't know what happens, or whether it depends on other details of
>> the query or the plan -- and the operator which can accept that value
>> on one side and i2.x on the other side is the one that gets used.
> 
> I think you are confusing join condition application and partition
> bounds of a join relation. What you have described above is how
> operators are chosen to apply join conditions - it picks up the
> correct operator from the operator family based on the column types
> being used in join condition. That it can do because the columns being
> joined are both present the relations being joined, irrespective of
> which pair of relations is being joined. In your example, A.x, B.x and
> C.x are all present on one of the sides of join irrespective of
> whether the join is executed as (AB)C, A(BC) or (AC)B.
> 
> But the problem we are trying to solve here about partition bounds of
> the join relation: what should be the partition bounds of AB, BC or
> AC? When we compare partition bounds of and intermediate join with
> other intermediate join (e.g. AB with those of C) what operator should
> be used? You seem to be suggesting that we keep as many sets of
> partition bounds as there are base relations participating in the join
> and then use appropriate partition bounds based on the columns in the
> join conditions, so that we can use the same operator as used in the
> join condition. That doesn't seem to be a good option since the
> partition bounds will all have same values, only differing in their
> binary representation because of differences in data types. I am of
> the opinion that we save a single set of partition bounds. We have to
> then associate a data type with bounds to know binary representation
> of partition bound datums. That datatype would be one of the partition
> key types of joining relations. I may be wrong in using term "wider"
> since its associated with the length of binary reprentation. But we
> need some logic to coalesce the two data types based on the type of
> join and key type on the outer side.

FWIW, I think that using any one of the partition bounds of the baserels
being partitionwise-joined should suffice as the partition bound of any
combination of joins involving two or more of those baserels, as long as
the partitioning operator of each of the baserels is in the same operator
family (I guess that *is* checked somewhere in the partitionwise join
consideration flow).  IOW, partopfamily[] of all of the baserels should
match and then the join clause operators involved should belong to the
same respective operator families.

ISTM, the question here is about how to derive the partitioning properties
of joinrels from those of the baserels involved.  Even if the join
conditions refer to columns of different types on two sides, as long as
the partitioning and joining is known to occur using operators of
compatible semantics, I don't understand what more needs to be considered
or done.  Although, I haven't studied things in enough detail to say
anything confidently about whether join being INNER or OUTER has any
bearing on the semantics of the partitioning of the joinrels in question.
IIUC, using partitioning properties to apply partitionwise join technique
at successive join levels will be affected by the OUTER considerations
similar to how they affect what levels a give EquivalenceClass clause
could be applied without causing any semantics violations.  As already
mentioned upthread, it would be a good idea to have some integration of
the partitioning considerations with the equivalence class mechanism (how
ForeignKeyOptInfo contains links to ECs comes to mind).

By the way, looking at match_expr_to_partition_keys() in your latest
patch, I wonder why not use an approach similar to calling
is_indexable_operator() that is used in match_clause_to_indexcol()?  Note
that is_indexable_operator() simply checks if clause->opno is in the index
key's operator family, as returned by op_in_opfamily().  Instead I see the
following:
       /*        * The clause allows partition-wise join if only it uses the same        * operator family as that
specifiedby the partition key.        */       if (!list_member_oid(rinfo->mergeopfamilies,
part_scheme->partopfamily[ipk1]))          continue;
 

But maybe I am missing something.

Thanks,
Amit




On Thu, Apr 20, 2017 at 3:35 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/04/20 15:45, Ashutosh Bapat wrote:
>> On Thu, Apr 20, 2017 at 10:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I don't understand why you think that partition-wise join needs any
>>> new logic here; if this were a non-partitionwise join, we'd similarly
>>> need to use the correct operator, but the existing code handles that
>>> just fine.  If the join is performed partition-wise, it should use the
>>> same operators that would have been used by a non-partitionwise join
>>> between the same tables.
>>>
>>> I think the choice of operator depends only on the column types, and
>>> that the "width" of those types has nothing to do with it.  For
>>> example, if the user writes .WHERE A.x = B.x AND B.x = C.x, the
>>> operator for an A/B join or a B/C join will be the one that appears in
>>> the query; parse analysis will have identified which specific operator
>>> is meant based on the types of the columns.  If the optimizer
>>> subsequently decides to reorder the joins and perform the A/C join
>>> first, it will go hunt down the operator with the same strategy number
>>> in the same operator family that takes the type of A.x on one side and
>>> the type of C.x on the other side.  No problem.  A partition-wise join
>>> between A and C will use that same operator; again, no problem.
>>>
>>> Your example involves joining the output of a join between i4 and i8
>>> against i2, so it seems there is some ambiguity about what the input
>>> type should be.  But, again, the planner already copes with this
>>> problem.  In fact, the join is performed either using i4.x or i8.x --
>>> I don't know what happens, or whether it depends on other details of
>>> the query or the plan -- and the operator which can accept that value
>>> on one side and i2.x on the other side is the one that gets used.
>>
>> I think you are confusing join condition application and partition
>> bounds of a join relation. What you have described above is how
>> operators are chosen to apply join conditions - it picks up the
>> correct operator from the operator family based on the column types
>> being used in join condition. That it can do because the columns being
>> joined are both present the relations being joined, irrespective of
>> which pair of relations is being joined. In your example, A.x, B.x and
>> C.x are all present on one of the sides of join irrespective of
>> whether the join is executed as (AB)C, A(BC) or (AC)B.
>>
>> But the problem we are trying to solve here about partition bounds of
>> the join relation: what should be the partition bounds of AB, BC or
>> AC? When we compare partition bounds of and intermediate join with
>> other intermediate join (e.g. AB with those of C) what operator should
>> be used? You seem to be suggesting that we keep as many sets of
>> partition bounds as there are base relations participating in the join
>> and then use appropriate partition bounds based on the columns in the
>> join conditions, so that we can use the same operator as used in the
>> join condition. That doesn't seem to be a good option since the
>> partition bounds will all have same values, only differing in their
>> binary representation because of differences in data types. I am of
>> the opinion that we save a single set of partition bounds. We have to
>> then associate a data type with bounds to know binary representation
>> of partition bound datums. That datatype would be one of the partition
>> key types of joining relations. I may be wrong in using term "wider"
>> since its associated with the length of binary reprentation. But we
>> need some logic to coalesce the two data types based on the type of
>> join and key type on the outer side.
>
> FWIW, I think that using any one of the partition bounds of the baserels
> being partitionwise-joined should suffice as the partition bound of any
> combination of joins involving two or more of those baserels, as long as
> the partitioning operator of each of the baserels is in the same operator
> family (I guess that *is* checked somewhere in the partitionwise join
> consideration flow).  IOW, partopfamily[] of all of the baserels should
> match and then the join clause operators involved should belong to the
> same respective operator families.

The partition bounds of different base rels may be different and we
have to compare them. Even we say that we join two tables with same
partition bounds using partitio-wise join, we need to make sure that
those partition bounds are indeed same, thus requiring to compare. And
to compare any datum we need to know its type.

>
> ISTM, the question here is about how to derive the partitioning properties
> of joinrels from those of the baserels involved.  Even if the join
> conditions refer to columns of different types on two sides, as long as
> the partitioning and joining is known to occur using operators of
> compatible semantics, I don't understand what more needs to be considered
> or done.  Although, I haven't studied things in enough detail to say
> anything confidently about whether join being INNER or OUTER has any
> bearing on the semantics of the partitioning of the joinrels in question.
> IIUC, using partitioning properties to apply partitionwise join technique
> at successive join levels will be affected by the OUTER considerations
> similar to how they affect what levels a give EquivalenceClass clause
> could be applied without causing any semantics violations.  As already
> mentioned upthread, it would be a good idea to have some integration of
> the partitioning considerations with the equivalence class mechanism (how
> ForeignKeyOptInfo contains links to ECs comes to mind).

This has been already discussed. I have showed earlier why equivalence
classes are not useful in this case.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Thu, Apr 20, 2017 at 8:45 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> I think you are confusing join condition application and partition
> bounds of a join relation.

You're right, I misunderstood what you were talking about.

> But the problem we are trying to solve here about partition bounds of
> the join relation: what should be the partition bounds of AB, BC or
> AC? When we compare partition bounds of and intermediate join with
> other intermediate join (e.g. AB with those of C) what operator should
> be used? You seem to be suggesting that we keep as many sets of
> partition bounds as there are base relations participating in the join
> and then use appropriate partition bounds based on the columns in the
> join conditions, so that we can use the same operator as used in the
> join condition. That doesn't seem to be a good option since the
> partition bounds will all have same values, only differing in their
> binary representation because of differences in data types.

Well, actually, I think it is a good option, as I wrote in
http://postgr.es/m/CA+TgmoY-LiJ+_S7OijNU_r2y=dhSj539WTqA7CaYJ-hcEcCdZg@mail.gmail.com

In that email, my principal concern was allowing partition-wise join
to succeed even with slightly different sets of partition boundaries
on the two sides of the join; in particular, if we've got A with A1 ..
A10 and B with B1 .. B10 and the DBA adds A11, I don't want
performance to tank until the DBA gets around to adding B11.  Removing
the partition bounds from the PartitionScheme and storing them
per-RelOptInfo fixes that problem; the fact that it also solves this
problem of what happens when we have different data types on the two
sides looks to me like a second reason to go that way.

And there's a third reason, too, which is that the opfamily mechanism
doesn't currently provide any mechanism for reasoning about which data
types are "wider" or "narrower" in the way that you want.  In general,
there's not even a reason why such a relationship has to exist;
consider two data types t1 and t2 with opclasses t1_ops and t2_ops
that are part of the same opfamily t_ops, and suppose that t1 can
represent any positive integer and t2 can represent any even integer,
or in general that each data type can represent some but not all of
the values that can be represented by the other data type.  In such a
case, neither would be "wider" than the other in the sense that you
need; you essentially want to find a data type within the opfamily to
which all values of any of the types involved in the query can be cast
without error, but there is nothing today which requires such a data
type to exist, and no way to identify which one it is.  In practice,
for all of the built-in opfamilies that have more than one opclass,
such a data type always exists but is not always unique -- in
particular, datetime_ops contains date_ops, timestamptz_ops, and
timestamp_ops, and either of the latter two is a plausible choice for
the "widest" data type of the three.  But there's no way to figure
that out from the opfamily or opclass information we have today.

In theory, it would be possible to modify the opfamily machinery so
that every opfamily designates an optional ordering of types from
"narrowest" to "widest", such that saying t1 is-narrower-than t2 is a
guarantee that every value of type t1 can be cast without error to a
value of type t2.  But I think that's a bad plan.  It means that every
opfamily created by either the core code or some extension now needs
to worry about annotating the opclass with this new information, and
we have to add to core the SQL syntax and supporting code to make that
work.  If it were implementing a valuable feature which could not
practically be implemented without extending the opfamily machinery,
then I guess that's what we'd have to suck it up and incur that
complexity, but in this case it does not appear necessary.  Storing
the partition bounds per-RelOptInfo makes this problem -- and a few
others -- go away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Fri, Apr 21, 2017 at 1:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> You seem to be suggesting that we keep as many sets of
>> partition bounds as there are base relations participating in the join
>> and then use appropriate partition bounds based on the columns in the
>> join conditions, so that we can use the same operator as used in the
>> join condition. That doesn't seem to be a good option since the
>> partition bounds will all have same values, only differing in their
>> binary representation because of differences in data types.
>
> Well, actually, I think it is a good option, as I wrote in
> http://postgr.es/m/CA+TgmoY-LiJ+_S7OijNU_r2y=dhSj539WTqA7CaYJ-hcEcCdZg@mail.gmail.com

I guess, you are now confusing between partition bounds for a join
relation and partition bounds of base relation. Above paragraph is
about partition bounds of a join relation. I have already agreed that
we need to store partition bounds in RelOptInfo. For base relation
this is trivial; its RelOptInfo has to store partition bounds as
stored in the partition descriptor of corresponding partitioned table.
I am talking about partition bounds of a join relation. See below for
more explanation.

>
> In that email, my principal concern was allowing partition-wise join
> to succeed even with slightly different sets of partition boundaries
> on the two sides of the join; in particular, if we've got A with A1 ..
> A10 and B with B1 .. B10 and the DBA adds A11, I don't want
> performance to tank until the DBA gets around to adding B11.  Removing
> the partition bounds from the PartitionScheme and storing them
> per-RelOptInfo fixes that problem;

We have an agreement on this.

> the fact that it also solves this
> problem of what happens when we have different data types on the two
> sides looks to me like a second reason to go that way.

I don't see how is that fixed. For a join relation we need to come up
with one set of partition bounds by merging partition bounds of the
joining relation and in order to understand how to interpret the
datums in the partition bounds, we need to associate data types. The
question is which data type we should use if the relations being
joined have different data types associated with their respective
partition bounds.

Or are you saying that we don't need to associate data type with
merged partition bounds? In that case, I don't know how do we compare
the partition bounds of two relations?

In your example, A has partition key of type int8, has bound datums
X1.. X10. B has partition key of type int4 and has bounds datums X1 ..
X11. C has partition key type int2 and bound datums X1 .. X12. The
binary representation of X's is going to differ between A, B and C
although each Xk for A, B and C is equal, wherever exists. Join
between A and B will have merged bound datums X1 .. X10 (and X11
depending upon the join type). In order to match bounds of AB with C,
we need to know the data type of bounds of AB, so that we can choose
appropriate equality operator. The question is what should we choose
as data type of partition bounds of AB, int8 or int4. This is
different from applying join conditions between AB and C, which can
choose the right opfamily operator based on the join conditions.

>
> And there's a third reason, too, which is that the opfamily mechanism
> doesn't currently provide any mechanism for reasoning about which data
> types are "wider" or "narrower" in the way that you want.  In general,
> there's not even a reason why such a relationship has to exist;
> consider two data types t1 and t2 with opclasses t1_ops and t2_ops
> that are part of the same opfamily t_ops, and suppose that t1 can
> represent any positive integer and t2 can represent any even integer,
> or in general that each data type can represent some but not all of
> the values that can be represented by the other data type.  In such a
> case, neither would be "wider" than the other in the sense that you
> need; you essentially want to find a data type within the opfamily to
> which all values of any of the types involved in the query can be cast
> without error, but there is nothing today which requires such a data
> type to exist, and no way to identify which one it is.  In practice,
> for all of the built-in opfamilies that have more than one opclass,
> such a data type always exists but is not always unique -- in
> particular, datetime_ops contains date_ops, timestamptz_ops, and
> timestamp_ops, and either of the latter two is a plausible choice for
> the "widest" data type of the three.  But there's no way to figure
> that out from the opfamily or opclass information we have today.
>
> In theory, it would be possible to modify the opfamily machinery so
> that every opfamily designates an optional ordering of types from
> "narrowest" to "widest", such that saying t1 is-narrower-than t2 is a
> guarantee that every value of type t1 can be cast without error to a
> value of type t2.  But I think that's a bad plan.  It means that every
> opfamily created by either the core code or some extension now needs
> to worry about annotating the opclass with this new information, and
> we have to add to core the SQL syntax and supporting code to make that
> work.  If it were implementing a valuable feature which could not
> practically be implemented without extending the opfamily machinery,
> then I guess that's what we'd have to suck it up and incur that
> complexity, but in this case it does not appear necessary.  Storing
> the partition bounds per-RelOptInfo makes this problem -- and a few
> others -- go away.

This seems to suggest that we can not come up with merged bounds for
join if the partition key types of joining relations differ.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Here's an updated patch set

0001-Refactor-adjust_appendrel_attrs-adjust_appendrel_att.patch
0002-Refactor-calc_nestloop_required_outer-and-allow_star.patch
These are same as earlier patch set.

0003-Refactor-partition_bounds_equal-to-be-used-without-P.patch
0004-Modify-bound-comparision-functions-to-accept-members.patch
These are new patches to refactor partition bound comparison functions
without passing partition key directly. Changes in the first patch are
being used in this set but the second patch will be useful for more
generic bound matching.

0005-Multi-level-partitioned-table-expansion.patch
This is same as old set with minor changes. I have moved it ahead of
the other patches as we discussed offline.

0006-Canonical-partition-scheme.patch
Partition bounds are no more part of partition scheme. They appear in
RelOptInfo. We are discussing the data type handling of partition
bounds for join relation. So, I still have partopcintype in partition
scheme. There is one change though. From
ComputePartitionAttrs()->GetDefaultOpClass()/ResolveOpClass(), I
gather that partopcintype is the type used for comparison of partition
bounds instead of parttypid. When they are different they are binary
compatible. So, I have saved partopcintype in PartitionScheme instead
of parttypid and parttypmod.

0007-Canonical-partitioning-scheme-for-multi-level-partit.patch
What was earlier mult-level partition-wise join support patch is now
broken into set of patches and goes with corresponding patch for
single-level partition-wise join patch. The idea is if we agree on
changes for multi-level partitioning support and want to commit it
before partition-wise join support, we can squash those pairs into
one. This also associates the multi-level support changes with
corresponding changes for single-level support. That might be easier
to review.

0008-In-add_paths_to_append_rel-get-partitioned_rels-for-.patch
No changes in this patch.

0009-Partition-wise-join-implementation.patch
The patch adds build_joinrel_partition_bounds() to match the partition
bounds of the relations being joined. This function is called from
try_partition_wise_join(). This function is also responsible for
creating the pairs of matching partitions. When we come to support
partition-wise joins for unequal number of partitions, this function
would change without changing rest of the code.

0010-Multi-level-partition-wise-join-implementation.patch
multi-level support

0011-Adjust-join-related-to-code-to-accept-child-relation.patch
No changes to this patch.

0012-Fix-ConvertRowtypeExpr-refs-in-join-targetlist-and-q.patch
Fixes a crash with mult-level partitioning reported by Rajkumar. Fixes
set_plan_refs code for nested ConvertRowtypeExprs corresponding to
multiple levels of partitions.

0013-Parameterized-path-fixes.patch
No changes to this patch.

0014-Reparameterize-path-across-multiple-levels-of-partit.patch
Multi-level support changes for 0013

0015-Partition-wise-join-tests.patch
0016-Multi-level-partition-wise-join-tests.patch
Added the testcases  reported by Rajkumar.

On Fri, Apr 21, 2017 at 12:11 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Fri, Apr 21, 2017 at 1:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> You seem to be suggesting that we keep as many sets of
>>> partition bounds as there are base relations participating in the join
>>> and then use appropriate partition bounds based on the columns in the
>>> join conditions, so that we can use the same operator as used in the
>>> join condition. That doesn't seem to be a good option since the
>>> partition bounds will all have same values, only differing in their
>>> binary representation because of differences in data types.
>>
>> Well, actually, I think it is a good option, as I wrote in
>> http://postgr.es/m/CA+TgmoY-LiJ+_S7OijNU_r2y=dhSj539WTqA7CaYJ-hcEcCdZg@mail.gmail.com
>
> I guess, you are now confusing between partition bounds for a join
> relation and partition bounds of base relation. Above paragraph is
> about partition bounds of a join relation. I have already agreed that
> we need to store partition bounds in RelOptInfo. For base relation
> this is trivial; its RelOptInfo has to store partition bounds as
> stored in the partition descriptor of corresponding partitioned table.
> I am talking about partition bounds of a join relation. See below for
> more explanation.
>
>>
>> In that email, my principal concern was allowing partition-wise join
>> to succeed even with slightly different sets of partition boundaries
>> on the two sides of the join; in particular, if we've got A with A1 ..
>> A10 and B with B1 .. B10 and the DBA adds A11, I don't want
>> performance to tank until the DBA gets around to adding B11.  Removing
>> the partition bounds from the PartitionScheme and storing them
>> per-RelOptInfo fixes that problem;
>
> We have an agreement on this.
>
>> the fact that it also solves this
>> problem of what happens when we have different data types on the two
>> sides looks to me like a second reason to go that way.
>
> I don't see how is that fixed. For a join relation we need to come up
> with one set of partition bounds by merging partition bounds of the
> joining relation and in order to understand how to interpret the
> datums in the partition bounds, we need to associate data types. The
> question is which data type we should use if the relations being
> joined have different data types associated with their respective
> partition bounds.
>
> Or are you saying that we don't need to associate data type with
> merged partition bounds? In that case, I don't know how do we compare
> the partition bounds of two relations?
>
> In your example, A has partition key of type int8, has bound datums
> X1.. X10. B has partition key of type int4 and has bounds datums X1 ..
> X11. C has partition key type int2 and bound datums X1 .. X12. The
> binary representation of X's is going to differ between A, B and C
> although each Xk for A, B and C is equal, wherever exists. Join
> between A and B will have merged bound datums X1 .. X10 (and X11
> depending upon the join type). In order to match bounds of AB with C,
> we need to know the data type of bounds of AB, so that we can choose
> appropriate equality operator. The question is what should we choose
> as data type of partition bounds of AB, int8 or int4. This is
> different from applying join conditions between AB and C, which can
> choose the right opfamily operator based on the join conditions.
>
>>
>> And there's a third reason, too, which is that the opfamily mechanism
>> doesn't currently provide any mechanism for reasoning about which data
>> types are "wider" or "narrower" in the way that you want.  In general,
>> there's not even a reason why such a relationship has to exist;
>> consider two data types t1 and t2 with opclasses t1_ops and t2_ops
>> that are part of the same opfamily t_ops, and suppose that t1 can
>> represent any positive integer and t2 can represent any even integer,
>> or in general that each data type can represent some but not all of
>> the values that can be represented by the other data type.  In such a
>> case, neither would be "wider" than the other in the sense that you
>> need; you essentially want to find a data type within the opfamily to
>> which all values of any of the types involved in the query can be cast
>> without error, but there is nothing today which requires such a data
>> type to exist, and no way to identify which one it is.  In practice,
>> for all of the built-in opfamilies that have more than one opclass,
>> such a data type always exists but is not always unique -- in
>> particular, datetime_ops contains date_ops, timestamptz_ops, and
>> timestamp_ops, and either of the latter two is a plausible choice for
>> the "widest" data type of the three.  But there's no way to figure
>> that out from the opfamily or opclass information we have today.
>>
>> In theory, it would be possible to modify the opfamily machinery so
>> that every opfamily designates an optional ordering of types from
>> "narrowest" to "widest", such that saying t1 is-narrower-than t2 is a
>> guarantee that every value of type t1 can be cast without error to a
>> value of type t2.  But I think that's a bad plan.  It means that every
>> opfamily created by either the core code or some extension now needs
>> to worry about annotating the opclass with this new information, and
>> we have to add to core the SQL syntax and supporting code to make that
>> work.  If it were implementing a valuable feature which could not
>> practically be implemented without extending the opfamily machinery,
>> then I guess that's what we'd have to suck it up and incur that
>> complexity, but in this case it does not appear necessary.  Storing
>> the partition bounds per-RelOptInfo makes this problem -- and a few
>> others -- go away.
>
> This seems to suggest that we can not come up with merged bounds for
> join if the partition key types of joining relations differ.
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Fri, Apr 21, 2017 at 8:41 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> I don't see how is that fixed. For a join relation we need to come up
> with one set of partition bounds by merging partition bounds of the
> joining relation and in order to understand how to interpret the
> datums in the partition bounds, we need to associate data types. The
> question is which data type we should use if the relations being
> joined have different data types associated with their respective
> partition bounds.
>
> Or are you saying that we don't need to associate data type with
> merged partition bounds? In that case, I don't know how do we compare
> the partition bounds of two relations?

Well, since there is no guarantee that a datatype exists which can be
used to "merge" the partition bounds in the sense that you are
describing, and even if there is one we have no opfamily
infrastructure to find out which one it is, I think it would be smart
to try to set things up so that we don't need to do that.  I believe
that's probably possible.

> In your example, A has partition key of type int8, has bound datums
> X1.. X10. B has partition key of type int4 and has bounds datums X1 ..
> X11. C has partition key type int2 and bound datums X1 .. X12.

OK, sure.

> The binary representation of X's is going to differ between A, B and C
> although each Xk for A, B and C is equal, wherever exists.

Agreed.

> Join
> between A and B will have merged bound datums X1 .. X10 (and X11
> depending upon the join type). In order to match bounds of AB with C,
> we need to know the data type of bounds of AB, so that we can choose
> appropriate equality operator. The question is what should we choose
> as data type of partition bounds of AB, int8 or int4. This is
> different from applying join conditions between AB and C, which can
> choose the right opfamily operator based on the join conditions.

Well, the join is actually being performed either on A.keycol =
C.keycol or on B.keycol = C.keycol, right?  It has to be one or the
other; there's no "merged" join column in any relation's targetlist,
but only columns derived from the various baserels.  So let's use that
set of bounds for the matching.  It makes sense to use the set of
bounds for the matching that corresponds to the column actually being
joined, I think.

It's late here and I'm tired, but it seems like it should be possible
to relate the child joinrels of the AB join back to the child joinrels
of either A or B.  (AB)1 .. (AB)10 related back to A1 .. A10 and B1 ..
B10.  (AB)11 relates back to B11 but, of course not to A11, which
doesn't exist.  If the join is INNER, (AB)11 is a dummy rel anyway and
actually we should probably see whether we can omit it altogether.  If
the join is an outer join of some kind, there's an interesting case
where the user wrote A LEFT JOIN B or B RIGHT JOIN A so that A is not
on the nullable side of the join; in that case, too, (AB)11 is dummy
or nonexistent.  Otherwise, assuming A is nullable, (AB)11 maps only
to B11 and not to A11.  But that's absolutely right: if the join to C
uses A.keycol, either the join operator is strict and (AB)11 won't
match anything anyway, or it's not and partition-wise join is illegal
because A.keycol in (AB)11 can include not only values from X11 but
also nulls.

So, it seems to me that what you can do is loop over the childrels on
the outer side of the join.  For each one, you've got a join clause
that relates the outer rel to the inner rel, and that join clause
mentions some baserel which is contained in the joinrel.  So drill
down through the childrel to the corresponding partition of the
baserel and get those bounds.  Then if you do the same thing for the
inner childrels, you've now got two lists of bounds, and the type on
the left matches the outer side of the join and the type on the right
matches the inner side of the join and the opfamily of the operator in
the join clause gives you a comparison operator that relates those two
types, and now you can match them up.

(We should also keep in mind the case where there are multiple columns
in the partition key.)

> This seems to suggest that we can not come up with merged bounds for
> join if the partition key types of joining relations differ.

Yes, I think that would be difficult.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Sat, Apr 22, 2017 at 3:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Apr 21, 2017 at 8:41 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> I don't see how is that fixed. For a join relation we need to come up
>> with one set of partition bounds by merging partition bounds of the
>> joining relation and in order to understand how to interpret the
>> datums in the partition bounds, we need to associate data types. The
>> question is which data type we should use if the relations being
>> joined have different data types associated with their respective
>> partition bounds.
>>
>> Or are you saying that we don't need to associate data type with
>> merged partition bounds? In that case, I don't know how do we compare
>> the partition bounds of two relations?
>
> Well, since there is no guarantee that a datatype exists which can be
> used to "merge" the partition bounds in the sense that you are
> describing, and even if there is one we have no opfamily
> infrastructure to find out which one it is, I think it would be smart
> to try to set things up so that we don't need to do that.  I believe
> that's probably possible.
>
>> In your example, A has partition key of type int8, has bound datums
>> X1.. X10. B has partition key of type int4 and has bounds datums X1 ..
>> X11. C has partition key type int2 and bound datums X1 .. X12.
>
> OK, sure.
>
>> The binary representation of X's is going to differ between A, B and C
>> although each Xk for A, B and C is equal, wherever exists.
>
> Agreed.
>
>> Join
>> between A and B will have merged bound datums X1 .. X10 (and X11
>> depending upon the join type). In order to match bounds of AB with C,
>> we need to know the data type of bounds of AB, so that we can choose
>> appropriate equality operator. The question is what should we choose
>> as data type of partition bounds of AB, int8 or int4. This is
>> different from applying join conditions between AB and C, which can
>> choose the right opfamily operator based on the join conditions.
>
> Well, the join is actually being performed either on A.keycol =
> C.keycol or on B.keycol = C.keycol, right?  It has to be one or the
> other; there's no "merged" join column in any relation's targetlist,
> but only columns derived from the various baserels.  So let's use that
> set of bounds for the matching.  It makes sense to use the set of
> bounds for the matching that corresponds to the column actually being
> joined, I think.
>
> It's late here and I'm tired, but it seems like it should be possible
> to relate the child joinrels of the AB join back to the child joinrels
> of either A or B.  (AB)1 .. (AB)10 related back to A1 .. A10 and B1 ..
> B10.  (AB)11 relates back to B11 but, of course not to A11, which
> doesn't exist.  If the join is INNER, (AB)11 is a dummy rel anyway and
> actually we should probably see whether we can omit it altogether.  If
> the join is an outer join of some kind, there's an interesting case
> where the user wrote A LEFT JOIN B or B RIGHT JOIN A so that A is not
> on the nullable side of the join; in that case, too, (AB)11 is dummy
> or nonexistent.  Otherwise, assuming A is nullable, (AB)11 maps only
> to B11 and not to A11.  But that's absolutely right: if the join to C
> uses A.keycol, either the join operator is strict and (AB)11 won't
> match anything anyway, or it's not and partition-wise join is illegal
> because A.keycol in (AB)11 can include not only values from X11 but
> also nulls.
>
> So, it seems to me that what you can do is loop over the childrels on
> the outer side of the join.  For each one, you've got a join clause
> that relates the outer rel to the inner rel, and that join clause
> mentions some baserel which is contained in the joinrel.  So drill
> down through the childrel to the corresponding partition of the
> baserel and get those bounds.  Then if you do the same thing for the
> inner childrels, you've now got two lists of bounds, and the type on
> the left matches the outer side of the join and the type on the right
> matches the inner side of the join and the opfamily of the operator in
> the join clause gives you a comparison operator that relates those two
> types, and now you can match them up.

This assumes that datums in partition bounds have one to one mapping
with the partitions, which isn't true for list partitions. For list
partitions we have multiple datums corresponding to the items listed
associated with a given partition. So, simply looping over the
partitions of outer relations doesn't work; in fact there are two
outer relations for a full outer join, so we have to loop over both of
them together in a merge join fashion.

Consider A join B where A has partitions A1 (a, b, c), A2(e, f), A3(g,
h) and B has partitions B1 (a, b), B2 (c, d, e), B3(f, g, h). If we
just look at the partitions, we won't recognize that list item c is
repeated in A1B1 and A2B2. That can be recognized only when we loop
over the datums of A and B trying to match the partitions. We will see
that for a, b A1 and B1 match but for c A1 and B1 do not match,
instead A1 and B2 match. In one to one partition matching we will bail
out here.

I think, we have to find the base relations whose partition bounds
should be used for comparison looking at the equi-join conditions and
then compare those partition bounds to come up with the partition
bounds of join relation. That won't work straight forward either when
their are partitions missing on either sides of the join, I guess.
Needs a careful thought.

>
> (We should also keep in mind the case where there are multiple columns
> in the partition key.)

Yes. This is tricky. Consider A partitioned by (a1, a2) B partitioned
by (b1, b2) and C partitioned by (c1, c2). If the query is A join B on
(A.a1 = B.a1 and A.a2 = B.b2) join C on (C.c1 = A.a1 and C.c2 = B.b2),
we need to fetch partition bound values for a1 from A's partition
bounds and those for b1 from B's partition bounds. Create combined
partition bounds from those and then compare the combined bounds with
those of C.

After saying all that, I think we have a precedence of merged join
columns with merged data types. Consider
create table t1(a int2, b int);create table t2 (a int4, b int);
explain verbose select * from t1 join t2 using(a);                              QUERY PLAN
-------------------------------------------------------------------------Merge Join  (cost=327.25..745.35 rows=27120
width=12) Output: t2.a, t1.b, t2.b  Merge Cond: (t2.a = t1.a)  ->  Sort  (cost=158.51..164.16 rows=2260 width=8)
Output:t2.a, t2.b        Sort Key: t2.a        ->  Seq Scan on public.t2  (cost=0.00..32.60 rows=2260 width=8)
   Output: t2.a, t2.b  ->  Sort  (cost=168.75..174.75 rows=2400 width=6)        Output: t1.b, t1.a        Sort Key:
t1.a       ->  Seq Scan on public.t1  (cost=0.00..34.00 rows=2400 width=6)              Output: t1.b, t1.a
 
(13 rows)

When using clause is used the columns specified by using clause from
the joining relations are merged into a single column. Here it has
used a "wider" type column t2.a as the merged column for t1.a and
t2.a. The logic is in buildMergedJoinVar().

Probably we want to build merged partition bounds for a join relation
where partition keys of the joining relations are different using a
single data type provided by the same logic as buildMergedJoinVar()
and attach those to the join relation.

[1] http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg312629.html

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Mon, Apr 24, 2017 at 7:06 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> This assumes that datums in partition bounds have one to one mapping
> with the partitions, which isn't true for list partitions. For list
> partitions we have multiple datums corresponding to the items listed
> associated with a given partition. So, simply looping over the
> partitions of outer relations doesn't work; in fact there are two
> outer relations for a full outer join, so we have to loop over both of
> them together in a merge join fashion.

Maybe so, but my point is that it can be done with the original types,
without converting anything to a different type.

> When using clause is used the columns specified by using clause from
> the joining relations are merged into a single column. Here it has
> used a "wider" type column t2.a as the merged column for t1.a and
> t2.a. The logic is in buildMergedJoinVar().

That relies on select_common_type(), which can error out if it can't
find a common type.  That's OK for the current uses of that function,
because if it fails it means that the query is invalid.  But it's not
OK for what you want here, because it's not OK to error out due to
inability to do a partition-wise join when a non-partition-wise join
would have worked.  Also, note that all select_common_type() is really
doing is looking for the type within the type category that is marked
typispreferred, or else checking which direction has an implicit cast.
Neither of those things guarantee the property you want here, namely
that the "common" type is in the same opfamily and can store every
value of any of the input types without loss of precision.  So I don't
think you can rely on that.

I'm going to say this one more time: I really, really, really think
you need to avoid trying to convert the partition bounds to a common
type.  I said before that the infrastructure to do that is not present
in our type system, and I'm pretty sure that statement is 100%
correct.  The fact that you can find other cases where we do something
sorta like that but in a different case with different requirements
doesn't make that false.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> I'm going to say this one more time: I really, really, really think
> you need to avoid trying to convert the partition bounds to a common
> type.  I said before that the infrastructure to do that is not present
> in our type system, and I'm pretty sure that statement is 100%
> correct.  The fact that you can find other cases where we do something
> sorta like that but in a different case with different requirements
> doesn't make that false.

It's not just a matter of lack of infrastructure: the very attempt is
flawed, because in some cases there simply isn't a supertype that can
hold all values of both types.  An easy counterexample is float8 vs
numeric: you can't convert float8 'Infinity' to numeric, but also there
are values of numeric that can't be converted to float8 without overflow
and/or loss of precision.

The whole business of precision loss makes things very touchy for almost
anything involving float and a non-float type, actually.

What I'm going to ask one more time, though, is why we are even discussing
this.  Surely the partition bounds of a partitioned table must all be of
the same type already.  If there is a case where they are not, that is
a bug we had better close off before v10 ships, not a feature that we
need to write a lot of code to accommodate.
        regards, tom lane



On Wed, Apr 26, 2017 at 12:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What I'm going to ask one more time, though, is why we are even discussing
> this.  Surely the partition bounds of a partitioned table must all be of
> the same type already.  If there is a case where they are not, that is
> a bug we had better close off before v10 ships, not a feature that we
> need to write a lot of code to accommodate.

This question was answered before, by Ashutosh.

http://postgr.es/m/CAFjFpRfaKSO4YZjVv7jkcMEMVgDcnqc4yhqVWhO5gczB5mW8eQ@mail.gmail.com

Since you either didn't read his answer, or else didn't understand it
and didn't bother asking for clarification, I'll try to be more blunt:
of course all of the partition bounds of a single partitioned table
have to be of the same type.  We're not talking about that, because no
kidding.  This thread is about the possibility -- in a future release
-- of implementing a join between two different partitioned tables by
joining each pair of matching partitions.  To do that, you need the
tables to be compatibly partitioned, which requires that the
partitioning columns use the same opfamily for each partitioning
column but not necessarily that the types be the same.  Making
partition-wise join work in the case where the partitioning columns
are of different types within an opfamily (like int4 vs. int8) is
giving Ashutosh a bit of trouble.  So this is about a cross-type join,
not multiple types within a single partitioning hierarchy, as you
might also gather from the subject line of this thread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> So this is about a cross-type join,
> not multiple types within a single partitioning hierarchy, as you
> might also gather from the subject line of this thread.

OK, but I still don't understand why any type conversion is needed
in such a case.  The existing join estimators don't try to do that,
for the good and sufficient reasons you and I have already mentioned.
They just apply the given cross-type join operator, and whatever
cross-type selectivity estimator might be associated with it, and
possibly other cross-type operators obtained from the same btree
opfamily.

The minute you get into trying to do any type conversion that is not
mandated by the semantics of the query as written, you're going to
have problems.
        regards, tom lane



On Wed, Apr 26, 2017 at 12:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> So this is about a cross-type join,
>> not multiple types within a single partitioning hierarchy, as you
>> might also gather from the subject line of this thread.
>
> OK, but I still don't understand why any type conversion is needed
> in such a case.  The existing join estimators don't try to do that,
> for the good and sufficient reasons you and I have already mentioned.
> They just apply the given cross-type join operator, and whatever
> cross-type selectivity estimator might be associated with it, and
> possibly other cross-type operators obtained from the same btree
> opfamily.
>
> The minute you get into trying to do any type conversion that is not
> mandated by the semantics of the query as written, you're going to
> have problems.

There is no daylight whatsoever between us on this issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, Apr 26, 2017 at 9:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Apr 24, 2017 at 7:06 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> This assumes that datums in partition bounds have one to one mapping
>> with the partitions, which isn't true for list partitions. For list
>> partitions we have multiple datums corresponding to the items listed
>> associated with a given partition. So, simply looping over the
>> partitions of outer relations doesn't work; in fact there are two
>> outer relations for a full outer join, so we have to loop over both of
>> them together in a merge join fashion.
>
> Maybe so, but my point is that it can be done with the original types,
> without converting anything to a different type.
>

Theoretically, I agree with this. But practically the implementation
is lot more complex than what you have described in the earlier mails.
I am afraid, that the patch with those changes will be a lot harder to
review and commit. Later in this mail, I will try to explain some of
the complexities.

>
> I'm going to say this one more time: I really, really, really think
> you need to avoid trying to convert the partition bounds to a common
> type.  I said before that the infrastructure to do that is not present
> in our type system, and I'm pretty sure that statement is 100%
> correct.  The fact that you can find other cases where we do something
> sorta like that but in a different case with different requirements
> doesn't make that false.

Ok. Thanks for the explanation.

The current design and implementation is for a restricted case where
the partition bounds, partition key types and numbers match exactly.
We want to commit an implementation which is reasonably extensible and
doesn't require a lot of changes when we add more capabilities. Some
of the extensions we discussed are as follows:
1. Partition-wise join when the existing partitions have matching
bounds/lists but there can be extra partitions on either side of the
join (between base relations or join relations) without a matching
partition on the other side.\
2. Partition-wise join when the partition bounds/lists do not match
exactly but there is 1:1 or 1:0 or 0:1 mapping between the partitions
which can contribute to the final result. E.g. A (0-100, 100 - 150,
200-300), B (0-50, 125-200, 300-400)
3. Partition-wise join when the partition key types do not match, but
there's a single opfamily being used for partitioning.
4. Partition-wise join where 1:m or m:n mapping exists between
partitions of the joining relations.


First one is clearly something that we will need. We may add it in the
first commit or next commit, but it will be needed pretty soon (v11?).
To me 2nd is more important than the 3rd one. You may have a different
view. We will expect 3rd optimization to work with all the prior
optimizations. I am restricting myself from thinking about 4th one
since that requires ganging together multiple RelOptInfos as a single
RelOptInfo while joining, something we don't have infrastruture for.

In case of first goal, supporting INNER joins and OUTER joins where
the partitions are missing on the OUTER side but not inner side are
easier. In those cases we just drop those partitions and corresponding
bounds/lists from the join. For a FULL OUTER join, where both sides
act as OUTER as well as INNER, we will need exact mapping between the
partitions. For supporting OUTER joins where partitions on the INNER
sides can be missing, we need to create some "dummy" relations
representing the missing partitions so that we have OUTER rows with
NULL inner side. This requires giving those dummy relations some
relids and thus in case of base relations we may need to inject some
dummy children. This may mean that we have to expand simple_rel_array
as part of outer join, may or may not require adding new
AppendRelInfos and so on. We are basically breaking an assumption that
base relations can not be introduced while planning joins and that
might require some rework in the existing infrastructure. There might
be other ways to introduce dummy relations during join planning, but I
haven't really thought through the problem.

The third goal requires that the partition bounds be compared based on
the partition keys present in the equi-join. While matching the
partitions to be joined, the partition bounds corresponding the base
relation whose partition keys appear in the equi-join are used for
comparison using support function corresponding to the data types of
partition keys. This requires us to associate the partitions of a join
with the bounds of base relation. E.g. A(A1, A2) with bounds (X1, X3)
(notice missing X2), B (B1, B2) bounds (X1, X2), C (C1, C2, C3) bounds
(X1, X2, X3) and the join is A LJ B on A.a = B.b LJ C on B.b = C.c
assuming strict operators this can be executed as (AB)C or A(BC). AB
will have partitions A1B1, A2B3 since there is no matching bound of A
for B2 and A is outer relation. A1B1 is associated with bound X1 of A
and C both. A2B3 is associated with bound of X3, which happens to be
2nd bound of A but third of B. When we join (AB) with C, we should
notice that C1 goes with A1B1, C2 doesn't have any matching partition
in AB and C3 goes with A2B3. If we compare bounds of B with C without
any transformation we will know C2 matches B2, but we need to look at
the children of AB to realize that B2 isn't present in any of the
children and thus C2 should not be joined with any partition of AB.
That usually looks a quadratic order operation on the number of
partitions. The complexity can be reduced by maintaining as many
partition bounds as the number of base relations participating in the
join (an idea, I have floated earlier [1]) I don't elaborate it here
to avoid digression. There's also the complexity of an N-way join with
multiple partition keys and joins on partition keys from different
relations as discussed in [1]. There may be more involved cases, that
I haven't thought about. In short, implementation for 1st and 3rd
optimization together looks fairly complex.

Add to this the 2nd optimization and it becomes still more complex.

In order to keep the patches manageable to implement review and
commit, I am proposing following approach.

1. Implement first optimization on top of the current patches, which
enforces that the partition key datatypes of the joining relations
match. I am right now working on that patch. Do this for INNER join
and OUTER join where partitions are missing on the OUTER side and not
INNER side.
As a side note, the existing partition bound comparison functions are
tied to PartitionKey structure and require complete set of bounds from
partitioned relation. Both of those are not applicable anymore,
PartitionKey structure is not available for join and we have to
compare individual bounds in case of join as against one probe with a
complete set. This refactoring did eat some time.

2. Implement support for OUTER join where partitions can be missing
from either side.

3. Implement support for partition-wise join with different partition key types.

All those implementation will be different patches on top of v18 patches.

Given the complexities involved in 2 and 3, I am not sure which order
I should attack them. I don't have any estimates as to how much time
each of those are going to require. May be a couple of months, but I
am not sure.

Obviously we have to wait till the first commitfest to commit the
first version of the patch. So, based on the status at time, we can
decide what goes in the first commit of this feature and adjust the
patch set accordingly.

Thoughts/comments?

[1] http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg312916.html

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Thu, Apr 27, 2017 at 3:41 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> The third goal requires that the partition bounds be compared based on
> the partition keys present in the equi-join. While matching the
> partitions to be joined, the partition bounds corresponding the base
> relation whose partition keys appear in the equi-join are used for
> comparison using support function corresponding to the data types of
> partition keys. This requires us to associate the partitions of a join
> with the bounds of base relation. E.g. A(A1, A2) with bounds (X1, X3)
> (notice missing X2), B (B1, B2) bounds (X1, X2), C (C1, C2, C3) bounds
> (X1, X2, X3) and the join is A LJ B on A.a = B.b LJ C on B.b = C.c
> assuming strict operators this can be executed as (AB)C or A(BC). AB
> will have partitions A1B1, A2B3 since there is no matching bound of A
> for B2 and A is outer relation. A1B1 is associated with bound X1 of A
> and C both. A2B3 is associated with bound of X3, which happens to be
> 2nd bound of A but third of B. When we join (AB) with C, we should
> notice that C1 goes with A1B1, C2 doesn't have any matching partition
> in AB and C3 goes with A2B3. If we compare bounds of B with C without
> any transformation we will know C2 matches B2, but we need to look at
> the children of AB to realize that B2 isn't present in any of the
> children and thus C2 should not be joined with any partition of AB.

Sure.

> That usually looks a quadratic order operation on the number of
> partitions.

Now that I don't buy.  Certainly, for range partitions, given a list
of ranges of length M and another of length N, this can be done in
O(M+N) time by merge-joining the lists of bounds.  You pointed out
upthread that for list partitions, things are a bit complicated
because a single list partition can contain multiple values which are
not necessarily contiguous, but I think that this can still be done in
O(M+N) time.  Sort all of the bounds, associating each one to a
partition, and do a merge pass; whenever two bounds match, match the
two corresponding partitions, but if one of those partitions is
already matched to some other partition, then fail.

For example, consider A1 FOR VALUES IN (1,3,5), A2 FOR VALUES IN
(2,4,6), B1 FOR VALUES IN (1,6), B2 FOR VALUES IN (2,4).  The sorted
bounds for A are 1,2,3,4,5,6; for B, 1,2,4,6.  The first value in both
lists is a 1, so the corresponding partitions A1 and B1 are matched.
The second value in both lists is a 2, so the corresponding partitions
A2 and B2 are matched.  Then we hit a 3 on the A side that has no
match on the B side, but that's fine; we don't need to do anything.
If the partition on the A side never got a mapping at any point during
this merge pass, we'd eventually need to match it to a dummy partition
(unless this is an inner join) but it's already mapped to B1 so no
problem.  Then we hit a 4 which says that A2 must match B2, which is
consistent with what we already determine; no problem.  Then we hit
another value that only exists on the A side, which is fine just as
before.  Finally we hit a 6 on each side, which means that A2 must
match B1, which is inconsistent with the existing mappings so we give
up; no partitionwise join is possible here.

> The complexity can be reduced by maintaining as many
> partition bounds as the number of base relations participating in the
> join (an idea, I have floated earlier [1]) I don't elaborate it here
> to avoid digression. There's also the complexity of an N-way join with
> multiple partition keys and joins on partition keys from different
> relations as discussed in [1]. There may be more involved cases, that
> I haven't thought about. In short, implementation for 1st and 3rd
> optimization together looks fairly complex.

I spent some time thinking about this today and I think I see how we
could make it work: keep a single set of bounds for each join
relation, but record the type of each bound.  For example, suppose we
are full joining relation i2, with an int2 partition column, which has
partitions i2a from 0 to 10000 and i2b from 20000 to 30000, to
relation i4, with an int4 partition column, which has partitions i4a
from 5000 to 15000 and i4b from 25000 to 35000.   We end up with a
joinrel with 2 partitions.  The first goes from 0 (stored as an int2)
to 15000 (stored as an int4) and the second goes from 20000 (stored as
an int2) to 35000 (stored as an int4).  If we subsequently need to
merge these bounds with yet another relation at a higher join level,
we can use the opfamily (which is common) to dig out the right
cross-type operator for each comparison we may need to perform, based
on the precise types of the datums being compared.  Of course, we
might not find an appropriate cross-type operator in some cases,
because an opfamily isn't required to provide that, so then we'd have
to fail gracefully somehow, but that could be done.

Having said that I think we could make this work, I'm starting to
agree with you that it will add more complexity than it's worth.
Needing to keep track of the type of every partition bound
individually seems like a real nuisance, and it's not likely to win
very often because, realistically, people should and generally will
use the same type for the partitioning column in all of the relevant
tables.  So I'm going to revise my position and say it's fine to just
give up on partitionwise join unless the types match exactly, but I
still think we should try to cover the cases where the bounds don't
match exactly but only 1:1 or 1:0 or 0:1 mappings are needed (iow,
optimizations 1 and 2 from your list of 4).  I agree that ganging
partitions (optimization 4 from your list) is not something to tackle
right now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Fri, Apr 28, 2017 at 1:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Apr 27, 2017 at 3:41 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> The third goal requires that the partition bounds be compared based on
>> the partition keys present in the equi-join. While matching the
>> partitions to be joined, the partition bounds corresponding the base
>> relation whose partition keys appear in the equi-join are used for
>> comparison using support function corresponding to the data types of
>> partition keys. This requires us to associate the partitions of a join
>> with the bounds of base relation. E.g. A(A1, A2) with bounds (X1, X3)
>> (notice missing X2), B (B1, B2) bounds (X1, X2), C (C1, C2, C3) bounds
>> (X1, X2, X3) and the join is A LJ B on A.a = B.b LJ C on B.b = C.c
>> assuming strict operators this can be executed as (AB)C or A(BC). AB
>> will have partitions A1B1, A2B3 since there is no matching bound of A
>> for B2 and A is outer relation. A1B1 is associated with bound X1 of A
>> and C both. A2B3 is associated with bound of X3, which happens to be
>> 2nd bound of A but third of B. When we join (AB) with C, we should
>> notice that C1 goes with A1B1, C2 doesn't have any matching partition
>> in AB and C3 goes with A2B3. If we compare bounds of B with C without
>> any transformation we will know C2 matches B2, but we need to look at
>> the children of AB to realize that B2 isn't present in any of the
>> children and thus C2 should not be joined with any partition of AB.
>
> Sure.
>
>> That usually looks a quadratic order operation on the number of
>> partitions.
>
> Now that I don't buy.  Certainly, for range partitions, given a list
> of ranges of length M and another of length N, this can be done in
> O(M+N) time by merge-joining the lists of bounds.  You pointed out
> upthread that for list partitions, things are a bit complicated
> because a single list partition can contain multiple values which are
> not necessarily contiguous, but I think that this can still be done in
> O(M+N) time.  Sort all of the bounds, associating each one to a
> partition, and do a merge pass; whenever two bounds match, match the
> two corresponding partitions, but if one of those partitions is
> already matched to some other partition, then fail.
>
> For example, consider A1 FOR VALUES IN (1,3,5), A2 FOR VALUES IN
> (2,4,6), B1 FOR VALUES IN (1,6), B2 FOR VALUES IN (2,4).  The sorted
> bounds for A are 1,2,3,4,5,6; for B, 1,2,4,6.  The first value in both
> lists is a 1, so the corresponding partitions A1 and B1 are matched.
> The second value in both lists is a 2, so the corresponding partitions
> A2 and B2 are matched.  Then we hit a 3 on the A side that has no
> match on the B side, but that's fine; we don't need to do anything.
> If the partition on the A side never got a mapping at any point during
> this merge pass, we'd eventually need to match it to a dummy partition
> (unless this is an inner join) but it's already mapped to B1 so no
> problem.  Then we hit a 4 which says that A2 must match B2, which is
> consistent with what we already determine; no problem.  Then we hit
> another value that only exists on the A side, which is fine just as
> before.  Finally we hit a 6 on each side, which means that A2 must
> match B1, which is inconsistent with the existing mappings so we give
> up; no partitionwise join is possible here.

For two-way join this works and is fairly straight-forward. I am
assuming that A an B are base relations and not joins. But making it
work for N-way join is the challenge. I don't see your example
describing that. But I think, given your revised position below, we
don't need to get this right at this point. Remember, that the
paragraph was about 3rd goal, which according to your revised position
is now deferred.

>
> Having said that I think we could make this work, I'm starting to
> agree with you that it will add more complexity than it's worth.
> Needing to keep track of the type of every partition bound
> individually seems like a real nuisance, and it's not likely to win
> very often because, realistically, people should and generally will
> use the same type for the partitioning column in all of the relevant
> tables.  So I'm going to revise my position and say it's fine to just
> give up on partitionwise join unless the types match exactly, but I
> still think we should try to cover the cases where the bounds don't
> match exactly but only 1:1 or 1:0 or 0:1 mappings are needed (iow,
> optimizations 1 and 2 from your list of 4).  I agree that ganging
> partitions (optimization 4 from your list) is not something to tackle
> right now.

Good. I will have a more enjoyable vacation now.

Do you still want the patition key type to be out of partition scheme?
Keeping it there means we match it only once and save it only at a
single place. Otherwise, it will have to be stored in RelOptInfo of
the partitioned table and match it for every pair of joining
relations.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Fri, Apr 28, 2017 at 1:18 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> For two-way join this works and is fairly straight-forward. I am
> assuming that A an B are base relations and not joins. But making it
> work for N-way join is the challenge.

I don't think it's much different, is it?  Anyway, I'm going to
protest if your algorithm for merging bounds takes any more than
linear time, regardless of what else we decide.

>> Having said that I think we could make this work, I'm starting to
>> agree with you that it will add more complexity than it's worth.
>> Needing to keep track of the type of every partition bound
>> individually seems like a real nuisance, and it's not likely to win
>> very often because, realistically, people should and generally will
>> use the same type for the partitioning column in all of the relevant
>> tables.  So I'm going to revise my position and say it's fine to just
>> give up on partitionwise join unless the types match exactly, but I
>> still think we should try to cover the cases where the bounds don't
>> match exactly but only 1:1 or 1:0 or 0:1 mappings are needed (iow,
>> optimizations 1 and 2 from your list of 4).  I agree that ganging
>> partitions (optimization 4 from your list) is not something to tackle
>> right now.
>
> Good. I will have a more enjoyable vacation now.

Phew, what a relief.  :-)

> Do you still want the patition key type to be out of partition scheme?
> Keeping it there means we match it only once and save it only at a
> single place. Otherwise, it will have to be stored in RelOptInfo of
> the partitioned table and match it for every pair of joining
> relations.

The only reason for removing things from the PartitionScheme was if
they didn't need to be consistent across all tables.  Deciding that
the type is one of the things that has to match means deciding it
should be in the PartitionScheme, not the RelOptInfo.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Thu, Apr 6, 2017 at 6:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> There's a relevant comment in 0006, build_joinrel_partition_info()
>> (probably that name needs to change, but I will do that once we have
>> settled on design)
>> +   /*
>> +    * Construct partition keys for the join.
>> +    *
>> +    * An INNER join between two partitioned relations is partition by key
>> +    * expressions from both the relations. For tables A and B
>> partitioned by a and b
>> +    * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a
>> +    * and B.b.
>> +    *
>> +    * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
>> +    * B.b NULL. These rows may not fit the partitioning conditions imposed on
>> +    * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
>> +    * Strictly speaking, partition keys of an OUTER join should include
>> +    * partition key expressions from the OUTER side only. Consider a join like
>> +    * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not
>> +    * include B.b as partition key expression for (AB), it prohibits us from
>> +    * using partition-wise join when joining (AB) with C as there is no
>> +    * equi-join between partition keys of joining relations. But two NULL
>> +    * values are never equal and no two rows from mis-matching partitions can
>> +    * join. Hence it's safe to include B.b as partition key expression for
>> +    * (AB), even though rows in (AB) are not strictly partitioned by B.b.
>> +    */
>>
>> I think that also needs to be reviewed carefully.
>
> The following passage from src/backend/optimizer/README seems highly relevant:
>
> ===
> The planner's treatment of outer join reordering is based on the following
> identities:
>
> 1.      (A leftjoin B on (Pab)) innerjoin C on (Pac)
>         = (A innerjoin C on (Pac)) leftjoin B on (Pab)
>
> where Pac is a predicate referencing A and C, etc (in this case, clearly
> Pac cannot reference B, or the transformation is nonsensical).
>
> 2.      (A leftjoin B on (Pab)) leftjoin C on (Pac)
>         = (A leftjoin C on (Pac)) leftjoin B on (Pab)
>
> 3.      (A leftjoin B on (Pab)) leftjoin C on (Pbc)
>         = A leftjoin (B leftjoin C on (Pbc)) on (Pab)
>
> Identity 3 only holds if predicate Pbc must fail for all-null B rows
> (that is, Pbc is strict for at least one column of B).  If Pbc is not
> strict, the first form might produce some rows with nonnull C columns
> where the second form would make those entries null.
> ===
>
> In other words, I think your statement that null is never equal to
> null is a bit imprecise.  Somebody could certainly create an operator
> that is named "=" which returns true in that case, and then they could
> say, hey, two nulls are equal (when you use that operator).  The
> argument needs to be made in terms of the formal properties of the
> operator.

[.. some portion clipped .. ]

> The relevant logic is in have_partkey_equi_join:
>
> +               /* Skip clauses which are not equality conditions. */
> +               if (rinfo->hashjoinoperator == InvalidOid &&
> !rinfo->mergeopfamilies)
> +                       continue;
>
> Actually, I think the hashjoinoperator test is formally and
> practically unnecessary here; lower down there is a test to see
> whether the partitioning scheme's operator family is a member of
> rinfo->mergeopfamilies, which will certainly fail if we got through
> this test with rinfo->mergeopfamilies == NIL just on the strength of
> rinfo->hashjoinoperator != InvalidOid.  So you can just bail out if
> rinfo->mergeopfamilies == NIL.  But the underlying point here is that
> the only thing you really know about the function is that it's got to
> be a strategy-3 operator in some btree opclass; if that guarantees
> strictness, then so be it -- but I wasn't able to find anything in the
> code or documentation off-hand that supports that contention, so we
> might need to think a bit more about why (or if) this is guaranteed to
> be true.
>
>> Partition-wise joins
>> may be happy including partition keys from all sides, but
>> partition-wise aggregates may not be, esp. when pushing complete
>> aggregation down to partitions. In that case, rows with NULL partition
>> key, which falls on nullable side of join, will be spread across
>> multiple partitions. Proabably, we should separate nullable and
>> non-nullable partition key expressions.
>
> I don't think I understand quite what you're getting at here.  Can you
> spell this out in more detail?  To push an aggregate down to
> partitions, you need the grouping key to match the applicable
> partition key, and the partition key shouldn't allow nulls in more
> than one place.  Now I think your point may be that outer join
> semantics could let them creep in there, e.g. SELECT b.x, sum(a.y)
> FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a
> good test case for partitionwise aggregate.  I'd be inclined to think
> that we should just give up on partitionwise aggregate in such cases;
> it's not worth trying to optimize such a weird query, at least IMHO.
> (Does this sort of case ever happen with joins?  I think not, as long
> as the join operator is strict.)
>

I am revisiting NULL equality in the context of merging partition
bounds. In [1] paragraphs following

--
Do not write expression = NULL because NULL is not “equal to” NULL.
(The null value represents an unknown value, and it is not known
whether two unknown values are equal.)

--

seem to indicate that an equality operator should never return true
for two NULL values since it would never know whether two NULL
(unknown) values are same or not. In a paragraph above, Robert stated
that

> In other words, I think your statement that null is never equal to
> null is a bit imprecise.  Somebody could certainly create an operator
> that is named "=" which returns true in that case, and then they could
> say, hey, two nulls are equal (when you use that operator).  The
> argument needs to be made in terms of the formal properties of the
> operator.

But in case a user has written an = operator which returns true for
two NULL values, per description in [1], that comparison operator is
flawed and
using that operator is going to result in SQL-standard-incompliant
behaviour. I have tried to preserve all the relevant portions of
discussion in this mail. Am I missing something?

[1] https://www.postgresql.org/docs/devel/static/functions-comparison.html

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Thu, May 18, 2017 at 4:38 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> But in case a user has written an = operator which returns true for
> two NULL values, per description in [1], that comparison operator is
> flawed and
> using that operator is going to result in SQL-standard-incompliant
> behaviour. I have tried to preserve all the relevant portions of
> discussion in this mail. Am I missing something?

Yes.  You're confusing friendly advice about how to write good SQL
with internals documentation about how the system actually works.  The
documentation we have about how operator classes and index methods and
so forth actually work under the hood is in
https://www.postgresql.org/docs/devel/static/xindex.html -- as a
developer, that's what you should be looking at.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Sat, Apr 29, 2017 at 12:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Apr 28, 2017 at 1:18 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> For two-way join this works and is fairly straight-forward. I am
>> assuming that A an B are base relations and not joins. But making it
>> work for N-way join is the challenge.
>
> I don't think it's much different, is it?  Anyway, I'm going to
> protest if your algorithm for merging bounds takes any more than
> linear time, regardless of what else we decide.
>
>>> Having said that I think we could make this work, I'm starting to
>>> agree with you that it will add more complexity than it's worth.
>>> Needing to keep track of the type of every partition bound
>>> individually seems like a real nuisance, and it's not likely to win
>>> very often because, realistically, people should and generally will
>>> use the same type for the partitioning column in all of the relevant
>>> tables.  So I'm going to revise my position and say it's fine to just
>>> give up on partitionwise join unless the types match exactly, but I
>>> still think we should try to cover the cases where the bounds don't
>>> match exactly but only 1:1 or 1:0 or 0:1 mappings are needed (iow,
>>> optimizations 1 and 2 from your list of 4).  I agree that ganging
>>> partitions (optimization 4 from your list) is not something to tackle
>>> right now.
>>
>> Good. I will have a more enjoyable vacation now.
>
> Phew, what a relief.  :-)
>
>> Do you still want the patition key type to be out of partition scheme?
>> Keeping it there means we match it only once and save it only at a
>> single place. Otherwise, it will have to be stored in RelOptInfo of
>> the partitioned table and match it for every pair of joining
>> relations.
>
> The only reason for removing things from the PartitionScheme was if
> they didn't need to be consistent across all tables.  Deciding that
> the type is one of the things that has to match means deciding it
> should be in the PartitionScheme, not the RelOptInfo.
>

Here's set of patches rebased on latest head.

I spent some time trying to implement partition-wise join when
partition bounds do not match exactly but there's 1:1, 1:0 or 0:1
mapping between partitions. A WIP patch 0017 is included in the set
for the same. The patch is not complete, it doesn't support range
partitions and needs some bugs to be fixed for list partitions. Also
because of the way it crafts partition bounds for a join, it leaks
memory consumed by partition bounds for every pair of joining
relations. I will work on fixing those issues. That patch is pretty
large now. So, I think we will have to commit it separately on top of
basic partition-wise join implementation. But you will see that it has
minimal changes to the basic partition-wise join code.

I rewrote code handling partition keys on the nullable side of the
join. Now we store partition keys from nullable and non-nullable
relations separately. The partition keys from nullable relations are
matched only when the equality operator is strict. This is explained
in details the comments in match_expr_to_partition_keys() and
build_joinrel_partition_info().

Also please note that since last patch set I have paired the
multi-level partition-wise join support patches with single-level
partition-wise join patches providing corresponding functionality.

[1] https://www.postgresql.org/message-id/CAFjFpRd9ebX225KhuvYXQRBuk9NrVJfPzHqGPGqpze%2BqvH0xmw%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment


On Mon, May 22, 2017 at 12:02 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

Here's set of patches rebased on latest head.

In an attempt to test this set of patches, I found that not all of the patches could be applied on latest head-- commit 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7
Might be in need of rebasing. 

--
Regards,
Rafia Sabih
On Fri, Jun 30, 2017 at 2:53 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
>
>
> On Mon, May 22, 2017 at 12:02 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>>
>> Here's set of patches rebased on latest head.
>
>
> In an attempt to test this set of patches, I found that not all of the
> patches could be applied on latest head-- commit
> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7
> Might be in need of rebasing.

Thanks Rafia for your interest. I have started rebasing the patches on
the latest head. I am expecting it to take some time. Will update the
thread with the patches once I am done rebasing them.



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Tue, Jul 4, 2017 at 10:02 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Fri, Jun 30, 2017 at 2:53 PM, Rafia Sabih
> <rafia.sabih@enterprisedb.com> wrote:
>>
>>
>> On Mon, May 22, 2017 at 12:02 PM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>>
>>>
>>> Here's set of patches rebased on latest head.
>>
>>
>> In an attempt to test this set of patches, I found that not all of the
>> patches could be applied on latest head-- commit
>> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7
>> Might be in need of rebasing.
>
> Thanks Rafia for your interest. I have started rebasing the patches on
> the latest head. I am expecting it to take some time. Will update the
> thread with the patches once I am done rebasing them.
>

Here are patches rebased.

As mentioned in my previous mail [1], the last two patches are not
complete but are included, so that the reviewer can see the changes we
will have to make when we go towards more general partition-wise join.
Please use patches upto 0015, which implement 1:1 partition mapping
for benchmarking and testing.

[1] https://www.postgresql.org/message-id/CAFjFpRdF8GpmSjjn0fm85cMW2iz+r3MQJQ_HC0eDATzWSv5buw@mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Mon, Jul 10, 2017 at 3:57 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Tue, Jul 4, 2017 at 10:02 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> On Fri, Jun 30, 2017 at 2:53 PM, Rafia Sabih
>> <rafia.sabih@enterprisedb.com> wrote:
>>>
>>>
>>> On Mon, May 22, 2017 at 12:02 PM, Ashutosh Bapat
>>> <ashutosh.bapat@enterprisedb.com> wrote:
>>>>
>>>>
>>>> Here's set of patches rebased on latest head.
>>>
>>>
>>> In an attempt to test this set of patches, I found that not all of the
>>> patches could be applied on latest head-- commit
>>> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7
>>> Might be in need of rebasing.
>>
>> Thanks Rafia for your interest. I have started rebasing the patches on
>> the latest head. I am expecting it to take some time. Will update the
>> thread with the patches once I am done rebasing them.
>>
>
> Here are patches rebased.
>
> As mentioned in my previous mail [1], the last two patches are not
> complete but are included, so that the reviewer can see the changes we
> will have to make when we go towards more general partition-wise join.
> Please use patches upto 0015, which implement 1:1 partition mapping
> for benchmarking and testing.
>
> [1] https://www.postgresql.org/message-id/CAFjFpRdF8GpmSjjn0fm85cMW2iz+r3MQJQ_HC0eDATzWSv5buw@mail.gmail.com
>

Here's revised patch set with only 0004 revised. That patch deals with
creating multi-level inheritance hierarchy from multi-level partition
hierarchy. The original logic of recursively calling
inheritance_planner()'s guts over the inheritance hierarchy required
that for every such recursion we flatten many lists created by that
code. Recursion also meant that root->append_rel_list is traversed as
many times as the number of partitioned partitions in the hierarchy.
Instead the revised version keep the iterative shape of
inheritance_planner() intact, thus naturally creating flat lists,
iterates over root->append_rel_list only once and is still easy to
read and maintain.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment


On Fri, Jul 14, 2017 at 12:32 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
>
>
> Here's revised patch set with only 0004 revised. That patch deals with
> creating multi-level inheritance hierarchy from multi-level partition
> hierarchy. The original logic of recursively calling
> inheritance_planner()'s guts over the inheritance hierarchy required
> that for every such recursion we flatten many lists created by that
> code. Recursion also meant that root->append_rel_list is traversed as
> many times as the number of partitioned partitions in the hierarchy.
> Instead the revised version keep the iterative shape of
> inheritance_planner() intact, thus naturally creating flat lists,
> iterates over root->append_rel_list only once and is still easy to
> read and maintain.
>
On testing this patch for TPC-H (for scale factor 20) benchmark I found a regression for Q21, on head it was taking some 600 seconds and with this patch it is taking 3200 seconds. This comparison is on the same partitioned database, one using the partition wise join patch and other is without it. The execution time of Q21 on unpartitioned head is some 300 seconds. The explain analyse output for each of these cases is attached.

This suggests that partitioning is not a suitable strategy for this query, but then may be partition wise should not be picked for such a case to aggravate the performance issue.

The details of the setup is as follows,

Server parameter settings,
work_mem - 1GB
effective_cache_size - 8GB
shared_buffers - 8GB
enable_partition_wise_join = on

Partition information:
Type of partitioning - single column range partition
Tables partitioned - Lineitem and orders

Lineitem -
Partition key = l_orderkey
No of partitions = 18

Orders -
Partition key = o_orderkey
No of partitions = 11

Commit id - 42171e2cd23c8307bbe0ec64e901f58e297db1c3

I chose orderkey as the partition key since it is the primary key of orders and along with l_linenumber it forms the primary key for lineitem.
For the above mentioned settings, there was no other query that used partitioned wise join. 

Please let me know if any more information is required regarding this experimentation.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/
Attachment
On Wed, Jul 19, 2017 at 12:24 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> On testing this patch for TPC-H (for scale factor 20) benchmark I found a
> regression for Q21, on head it was taking some 600 seconds and with this
> patch it is taking 3200 seconds. This comparison is on the same partitioned
> database, one using the partition wise join patch and other is without it.
> The execution time of Q21 on unpartitioned head is some 300 seconds. The
> explain analyse output for each of these cases is attached.

Interesting.

> This suggests that partitioning is not a suitable strategy for this query,
> but then may be partition wise should not be picked for such a case to
> aggravate the performance issue.

In the unpartitioned case, and in the partitioned case on head, the
join order is l1-(nation-supplier)-l2-orders-l3.  In the patched case,
the join order changes to l1-l2-supplier-orders-nation-l3.  If the
planner used the former join order, it wouldn't be able to do a
partition-wise join at all, so it must think that the l1-l2 join gets
much cheaper when done partitionwise, thus justifying a change in the
overall join order to be able to use partion-wise join.  But it
doesn't work out.

I think the problem is that the row count estimates for the child
joins seem to be totally bogus:

->  Hash Semi Join  (cost=309300.53..491665.60 rows=1 width=12)
(actual time=10484.422..15945.851 rows=1523493 loops=3) Hash Cond: (l1.l_orderkey = l2.l_orderkey) Join Filter:
(l2.l_suppkey<> l1.l_suppkey) Rows Removed by Join Filter: 395116
 

That's clearly wrong.  In the un-partitioned plan, the join to l2
produces about as many rows of output as the number of rows that were
input (998433 vs. 962909); but here, a child join with a million rows
as input is estimated to produce only 1 row of output.  I bet the
problem is that the child-join's row count estimate isn't getting
initialized at all, but then something is clamping it to 1 row instead
of 0.

So this looks like a bug in Ashutosh's patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Thu, Jul 20, 2017 at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think the problem is that the row count estimates for the child
> joins seem to be totally bogus:
>
> ->  Hash Semi Join  (cost=309300.53..491665.60 rows=1 width=12)
> (actual time=10484.422..15945.851 rows=1523493 loops=3)
>   Hash Cond: (l1.l_orderkey = l2.l_orderkey)
>   Join Filter: (l2.l_suppkey <> l1.l_suppkey)
>   Rows Removed by Join Filter: 395116
>
> That's clearly wrong.  In the un-partitioned plan, the join to l2
> produces about as many rows of output as the number of rows that were
> input (998433 vs. 962909); but here, a child join with a million rows
> as input is estimated to produce only 1 row of output.  I bet the
> problem is that the child-join's row count estimate isn't getting
> initialized at all, but then something is clamping it to 1 row instead
> of 0.
>
> So this looks like a bug in Ashutosh's patch.

Isn't this the same as the issue reported here?

https://www.postgresql.org/message-id/flat/CAEepm%3D270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com



On Wed, Jul 19, 2017 at 7:45 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Jul 20, 2017 at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think the problem is that the row count estimates for the child
>> joins seem to be totally bogus:
>>
>> ->  Hash Semi Join  (cost=309300.53..491665.60 rows=1 width=12)
>> (actual time=10484.422..15945.851 rows=1523493 loops=3)
>>   Hash Cond: (l1.l_orderkey = l2.l_orderkey)
>>   Join Filter: (l2.l_suppkey <> l1.l_suppkey)
>>   Rows Removed by Join Filter: 395116
>>
>> That's clearly wrong.  In the un-partitioned plan, the join to l2
>> produces about as many rows of output as the number of rows that were
>> input (998433 vs. 962909); but here, a child join with a million rows
>> as input is estimated to produce only 1 row of output.  I bet the
>> problem is that the child-join's row count estimate isn't getting
>> initialized at all, but then something is clamping it to 1 row instead
>> of 0.
>>
>> So this looks like a bug in Ashutosh's patch.
>
> Isn't this the same as the issue reported here?
>
> https://www.postgresql.org/message-id/flat/CAEepm%3D270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg%40mail.gmail.com

Hmm, possibly.  But why would that affect the partition-wise join case only?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Thu, Jul 20, 2017 at 2:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jul 19, 2017 at 7:45 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Isn't this the same as the issue reported here?
>>
>> https://www.postgresql.org/message-id/flat/CAEepm%3D270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg%40mail.gmail.com
>
> Hmm, possibly.  But why would that affect the partition-wise join case only?

It doesn't.  From Rafia's part_reg.zip we see a bunch of rows=1 that
turn out to be wrong by several orders of magnitude:

21_nopart_head.out:  Hash Semi Join  (cost=5720107.25..9442574.55
rows=1 width=50)
21_part_head.out:    Hash Semi Join  (cost=5423094.06..8847638.36
rows=1 width=38)
21_part_patched.out: Hash Semi Join  (cost=309300.53..491665.60 rows=1 width=12)

My guess is that the consequences of that bad estimate are sensitive
to arbitrary other parameters moving around, as you can see from the
big jump in execution time I showed in the that message, measured on
unpatched master of the day:
 4 workers = 9.5s 3 workers = 39.7s

That's why why both parallel hash join and partition-wise join are
showing regressions on Q21: it's just flip-flopping between various
badly costed plans.  Note that even without parallelism, the fix that
Tom Lane suggested gives a much better plan:

https://www.postgresql.org/message-id/CAEepm%3D11BiYUkgXZNzMtYhXh4S3a9DwUP8O%2BF2_ZPeGzzJFPbw%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com



On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
>
> Partition information:
> Type of partitioning - single column range partition
> Tables partitioned - Lineitem and orders
>
> Lineitem -
> Partition key = l_orderkey
> No of partitions = 18
>
> Orders -
> Partition key = o_orderkey
> No of partitions = 11
>

The patch set upto 0015 would refuse to join two partitioned relations
using a partition-wise join if they have different number of
partitions. Next patches implement a more advanced partition matching
algorithm only for list partitions. Those next patches would refuse to
apply partition-wise join for range partitioned tables. So, I am
confused as to how come partition-wise join is being chosen even when
the number of partitions differ.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On 2017/07/20 15:05, Ashutosh Bapat wrote:
> On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih
> <rafia.sabih@enterprisedb.com> wrote:
>>
>> Partition information:
>> Type of partitioning - single column range partition
>> Tables partitioned - Lineitem and orders
>>
>> Lineitem -
>> Partition key = l_orderkey
>> No of partitions = 18
>>
>> Orders -
>> Partition key = o_orderkey
>> No of partitions = 11
>>
> 
> The patch set upto 0015 would refuse to join two partitioned relations
> using a partition-wise join if they have different number of
> partitions. Next patches implement a more advanced partition matching
> algorithm only for list partitions. Those next patches would refuse to
> apply partition-wise join for range partitioned tables. So, I am
> confused as to how come partition-wise join is being chosen even when
> the number of partitions differ.

In 21_part_patched.out, I see that lineitem is partitionwise-joined with
itself.
>  Append
  ->  Hash Semi Join      Hash Cond: (l1.l_orderkey = l2.l_orderkey)      Join Filter: (l2.l_suppkey <> l1.l_suppkey)
  Rows Removed by Join Filter: 395116
 
      ->  Parallel Seq Scan on lineitem_001 l1          Filter: (l_receiptdate > l_commitdate)          Rows Removed by
Filter:919654
 
      ->  Hash          Buckets: 8388608  Batches: 1  Memory Usage: 358464kB          ->  Seq Scan on lineitem_001 l2


Thanks,
Amit




On Thu, Jul 20, 2017 at 12:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> This suggests that partitioning is not a suitable strategy for this query,
>> but then may be partition wise should not be picked for such a case to
>> aggravate the performance issue.
>
> In the unpartitioned case, and in the partitioned case on head, the
> join order is l1-(nation-supplier)-l2-orders-l3.  In the patched case,
> the join order changes to l1-l2-supplier-orders-nation-l3.  If the
> planner used the former join order, it wouldn't be able to do a
> partition-wise join at all, so it must think that the l1-l2 join gets
> much cheaper when done partitionwise, thus justifying a change in the
> overall join order to be able to use partion-wise join.  But it
> doesn't work out.
>
> I think the problem is that the row count estimates for the child
> joins seem to be totally bogus:
>
> ->  Hash Semi Join  (cost=309300.53..491665.60 rows=1 width=12)
> (actual time=10484.422..15945.851 rows=1523493 loops=3)
>   Hash Cond: (l1.l_orderkey = l2.l_orderkey)
>   Join Filter: (l2.l_suppkey <> l1.l_suppkey)
>   Rows Removed by Join Filter: 395116
>
> That's clearly wrong.  In the un-partitioned plan, the join to l2
> produces about as many rows of output as the number of rows that were
> input (998433 vs. 962909); but here, a child join with a million rows
> as input is estimated to produce only 1 row of output.  I bet the
> problem is that the child-join's row count estimate isn't getting
> initialized at all, but then something is clamping it to 1 row instead
> of 0.
>
> So this looks like a bug in Ashutosh's patch.

The patch does not have any changes to the selectivity estimation. It
might happen that some correction in selectivity estimation for
child-joins is required, but I have not spotted any code in
selectivity estimation that differentiates explicitly between child
and parent Vars and estimates. So, I am more inclined to believe
Thomas's theory. I will try Tom's suggested approach.

I am investigating this case with the setup that Rafia provided.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Thu, Jul 20, 2017 at 11:46 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/07/20 15:05, Ashutosh Bapat wrote:
>> On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih
>> <rafia.sabih@enterprisedb.com> wrote:
>>>
>>> Partition information:
>>> Type of partitioning - single column range partition
>>> Tables partitioned - Lineitem and orders
>>>
>>> Lineitem -
>>> Partition key = l_orderkey
>>> No of partitions = 18
>>>
>>> Orders -
>>> Partition key = o_orderkey
>>> No of partitions = 11
>>>
>>
>> The patch set upto 0015 would refuse to join two partitioned relations
>> using a partition-wise join if they have different number of
>> partitions. Next patches implement a more advanced partition matching
>> algorithm only for list partitions. Those next patches would refuse to
>> apply partition-wise join for range partitioned tables. So, I am
>> confused as to how come partition-wise join is being chosen even when
>> the number of partitions differ.
>
> In 21_part_patched.out, I see that lineitem is partitionwise-joined with
> itself.
>
>  >  Append
>
>    ->  Hash Semi Join
>        Hash Cond: (l1.l_orderkey = l2.l_orderkey)
>        Join Filter: (l2.l_suppkey <> l1.l_suppkey)
>        Rows Removed by Join Filter: 395116
>
>        ->  Parallel Seq Scan on lineitem_001 l1
>            Filter: (l_receiptdate > l_commitdate)
>            Rows Removed by Filter: 919654
>
>        ->  Hash
>            Buckets: 8388608  Batches: 1  Memory Usage: 358464kB
>            ->  Seq Scan on lineitem_001 l2
>
Ah, I see now.

We need the same number of partitions in all partitioned tables, for
joins to pick up partition-wise join.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company




On Thu, Jul 20, 2017 at 8:53 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Thu, Jul 20, 2017 at 2:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jul 19, 2017 at 7:45 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Isn't this the same as the issue reported here?
>>
>> https://www.postgresql.org/message-id/flat/CAEepm%3D270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg%40mail.gmail.com
>
> Hmm, possibly.  But why would that affect the partition-wise join case only?

It doesn't.  From Rafia's part_reg.zip we see a bunch of rows=1 that
turn out to be wrong by several orders of magnitude:

21_nopart_head.out:  Hash Semi Join  (cost=5720107.25..9442574.55
rows=1 width=50)
21_part_head.out:    Hash Semi Join  (cost=5423094.06..8847638.36
rows=1 width=38)
21_part_patched.out: Hash Semi Join  (cost=309300.53..491665.60 rows=1 width=12)

My guess is that the consequences of that bad estimate are sensitive
to arbitrary other parameters moving around, as you can see from the
big jump in execution time I showed in the that message, measured on
unpatched master of the day:

  4 workers = 9.5s
  3 workers = 39.7s

That's why why both parallel hash join and partition-wise join are
showing regressions on Q21: it's just flip-flopping between various
badly costed plans.  Note that even without parallelism, the fix that
Tom Lane suggested gives a much better plan:

https://www.postgresql.org/message-id/CAEepm%3D11BiYUkgXZNzMtYhXh4S3a9DwUP8O%2BF2_ZPeGzzJFPbw%40mail.gmail.com


Following the discussion at [1], with the patch Thomas posted there, now Q21 completes in some 160 seconds. The plan is changed for the good but does not use partition-wise join. The output of explain analyse is attached. 

Not just the join orders but the join strategy itself changed, with the patch no hash semi join is picked which was consuming most time there, rather nested loop semi join is in picture now, though the estimates are still way-off, but the change in join-order made them terrible from horrible. It appears like this query is performing efficient now particularly because of worse under-estimated hash-join as compared to under-estimated nested loop join.

For the hash-semi-join:
->  Hash  (cost=3449457.34..3449457.34 rows=119994934 width=8) (actual time=180858.448..180858.448 rows=119994608 loops=3)
                                                   Buckets: 33554432  Batches: 8  Memory Usage: 847911kB

Overall, this doesn't look like a problem of partition-wise join patch itself.


--
Regards,
Rafia Sabih
Attachment
On Thu, Jul 20, 2017 at 2:44 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>
> On Thu, Jul 20, 2017 at 11:46 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> > On 2017/07/20 15:05, Ashutosh Bapat wrote:
> >> On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih
> >> <rafia.sabih@enterprisedb.com> wrote:
> >>>
> >>> Partition information:
> >>> Type of partitioning - single column range partition
> >>> Tables partitioned - Lineitem and orders
> >>>
> >>> Lineitem -
> >>> Partition key = l_orderkey
> >>> No of partitions = 18
> >>>
> >>> Orders -
> >>> Partition key = o_orderkey
> >>> No of partitions = 11
> >>>
> >>
> >> The patch set upto 0015 would refuse to join two partitioned relations
> >> using a partition-wise join if they have different number of
> >> partitions. Next patches implement a more advanced partition matching
> >> algorithm only for list partitions. Those next patches would refuse to
> >> apply partition-wise join for range partitioned tables. So, I am
> >> confused as to how come partition-wise join is being chosen even when
> >> the number of partitions differ.
> >
> > In 21_part_patched.out, I see that lineitem is partitionwise-joined with
> > itself.
> >
> >  >  Append
> >
> >    ->  Hash Semi Join
> >        Hash Cond: (l1.l_orderkey = l2.l_orderkey)
> >        Join Filter: (l2.l_suppkey <> l1.l_suppkey)
> >        Rows Removed by Join Filter: 395116
> >
> >        ->  Parallel Seq Scan on lineitem_001 l1
> >            Filter: (l_receiptdate > l_commitdate)
> >            Rows Removed by Filter: 919654
> >
> >        ->  Hash
> >            Buckets: 8388608  Batches: 1  Memory Usage: 358464kB
> >            ->  Seq Scan on lineitem_001 l2
> >
> Ah, I see now.
>
> We need the same number of partitions in all partitioned tables, for
> joins to pick up partition-wise join.
>
Oh, I missed this limitation, will modify my setup to have same number
of partitions in the partitioned table with same ranges. So, does this
also mean that a partitioned table will not join with an unpartitioned
table without append of partitions?

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/



On Fri, Jul 21, 2017 at 11:42 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
>
> Following the discussion at [1], with the patch Thomas posted there, now Q21
> completes in some 160 seconds.

Your earlier reports mentioned unpartitioned case taking 300 seconds,
partitioned case without partition-wise join taking 600 seconds and
with partition-wise join it taking 3200 seconds. My experiements
showed that those have changed to 70s, 160s and 160s resp. This is
with Thomas's patch. Can you please confirm?

> The plan is changed for the good but does not
> use partition-wise join.

As explained earlier, this is because the tables are not partitioned
similarly. Please try with lineitem and orders partitioned similarly
i.e. same number of partitions and exactly same ranges.


> Not just the join orders but the join strategy itself changed, with the
> patch no hash semi join is picked which was consuming most time there,
> rather nested loop semi join is in picture now, though the estimates are
> still way-off, but the change in join-order made them terrible from
> horrible. It appears like this query is performing efficient now
> particularly because of worse under-estimated hash-join as compared to
> under-estimated nested loop join.

Earlier it was using partition-wise join between lineitems (l1, l2,
l3) since it's the same table. Now for some reason the planner doesn't
find joining them to each other a better strategy, instead they are
joined indirectly so we don't see partition-wise join being picked. We
should experiment with orders and lineitems being partitioned
similarly. Can you please provide that result?

>
> For the hash-semi-join:
> ->  Hash  (cost=3449457.34..3449457.34 rows=119994934 width=8) (actual
> time=180858.448..180858.448 rows=119994608 loops=3)
>                                                    Buckets: 33554432
> Batches: 8  Memory Usage: 847911kB
>
> Overall, this doesn't look like a problem of partition-wise join patch
> itself.
>

Thanks for confirming it.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Fri, Jul 21, 2017 at 11:54 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> So, does this
> also mean that a partitioned table will not join with an unpartitioned
> table without append of partitions?
>

Yes. When you join an unpartitioned table with a partitioned table,
the planner will choose to append all the partitions of the
partitioned table and then join with the unpartitioned table.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company





On Fri, Jul 21, 2017 at 12:11 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
> On Fri, Jul 21, 2017 at 11:54 AM, Rafia Sabih
> <rafia.sabih@enterprisedb.com> wrote:
>> So, does this
>> also mean that a partitioned table will not join with an unpartitioned
>> table without append of partitions?
>>
>
> Yes. When you join an unpartitioned table with a partitioned table,
> the planner will choose to append all the partitions of the
> partitioned table and then join with the unpartitioned table.
>

I tested this set of patches for TPC-H benchmark and came across following results, 
- total 7 queries were using partition-wise join, 

- Q4 attains a speedup of around 80% compared to the partitioned setup without partition-wise join, the main reason being the poor plan choice at head for partitioned database. 
When I tried this query with forced nested-loop join then it completes in some 45 seconds at head. So, basically when no partition-wise join is present because of terrible selectivity estimation optimiser picks up a hash join plan, which results poorly as the estimated number of rows are two orders of magnitude lesser than actual.
Note that this is not the effect of [1], I tried this without that patch as well.

- other queries show a good 20-30% improvement in performance. Performance numbers are as follows,

Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) |
3 | 76 |127 | 88 |
4 |17 | 244 | 41 |
5 | 52 | 123 | 84 |
7 | 73 | 134 | 103 |
10 | 67 | 111 | 89 |
12 | 53 | 114 | 99 |
18 | 447 | 709 | 551 |

The experimental settings used were,

Partitioning: Range partitioning on lineitem and orders on l_orderkey and o_orderkey respectively. The number and range of partitions were kept same for both the tables.

Server parameters:
work_mem - 1GB
effective_cache_size - 8GB
shared_buffers - 8GB
enable_partition_wise_join = on

TPC-H setup:
scale-factor - 20

Commit id - 42171e2cd23c8307bbe0ec64e901f58e297db1c3, also, the patch at [1] was applied in all the cases.
Query plans for the above mentioned queries is attached.

Attachment
On Tue, Jul 25, 2017 at 1:31 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
> - other queries show a good 20-30% improvement in performance. Performance
> numbers are as follows,
>
> Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) |
> 3 | 76 |127 | 88 |
> 4 |17 | 244 | 41 |
> 5 | 52 | 123 | 84 |
> 7 | 73 | 134 | 103 |
> 10 | 67 | 111 | 89 |
> 12 | 53 | 114 | 99 |
> 18 | 447 | 709 | 551 |

Hmm.  This certainly shows that benefit of the patch, although it's
rather sad that we're still slower than if we hadn't partitioned the
data in the first place.  Why does partitioning hurt performance so
much?

Maybe things would be better at a higher scale factor.

When reporting results of this sort, it would be good to make a habit
of reporting the number of partitions along with the other details you
included.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Tue, Jul 25, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jul 25, 2017 at 1:31 AM, Rafia Sabih
> <rafia.sabih@enterprisedb.com> wrote:
>> - other queries show a good 20-30% improvement in performance. Performance
>> numbers are as follows,
>>
>> Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) |
>> 3 | 76 |127 | 88 |
>> 4 |17 | 244 | 41 |
>> 5 | 52 | 123 | 84 |
>> 7 | 73 | 134 | 103 |
>> 10 | 67 | 111 | 89 |
>> 12 | 53 | 114 | 99 |
>> 18 | 447 | 709 | 551 |
>
> Hmm.  This certainly shows that benefit of the patch, although it's
> rather sad that we're still slower than if we hadn't partitioned the
> data in the first place.  Why does partitioning hurt performance so
> much?

I was analysing some of the plans (without partition and with
partition), Seems like one of the reasons of performing worse with the
partitioned table is that we can not use an index on the partitioned
table.

Q4 is taking 17s without partition whereas it's taking 244s with partition.

Now if we analyze the plan

Without partition, it can use parameterize index scan on lineitem
table which is really big in size. But with partition, it has to scan
this table completely.
                         ->  Nested Loop Semi Join                                ->  Parallel Bitmap Heap Scan on
orders                                     ->  Bitmap Index Scan on
 
idx_orders_orderdate  (cost=0.00..24378.88 r                      ->  Index Scan using idx_lineitem_orderkey on
lineitem  (cost=0.57..29.29 rows=105 width=8) (actual
time=0.031..0.031 rows=1 loops=1122364)                                      Index Cond: (l_orderkey =
orders.o_orderkey)                                      Filter: (l_commitdate < l_receiptdate)
           Rows Removed by Filter: 1
 

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Tue, Jul 25, 2017 at 9:39 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Jul 25, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Jul 25, 2017 at 1:31 AM, Rafia Sabih
>> <rafia.sabih@enterprisedb.com> wrote:
>>> - other queries show a good 20-30% improvement in performance. Performance
>>> numbers are as follows,
>>>
>>> Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) |
>>> 3 | 76 |127 | 88 |
>>> 4 |17 | 244 | 41 |
>>> 5 | 52 | 123 | 84 |
>>> 7 | 73 | 134 | 103 |
>>> 10 | 67 | 111 | 89 |
>>> 12 | 53 | 114 | 99 |
>>> 18 | 447 | 709 | 551 |
>>
>> Hmm.  This certainly shows that benefit of the patch, although it's
>> rather sad that we're still slower than if we hadn't partitioned the
>> data in the first place.  Why does partitioning hurt performance so
>> much?
>
> I was analysing some of the plans (without partition and with
> partition), Seems like one of the reasons of performing worse with the
> partitioned table is that we can not use an index on the partitioned
> table.
>
> Q4 is taking 17s without partition whereas it's taking 244s with partition.
>
> Now if we analyze the plan
>
> Without partition, it can use parameterize index scan on lineitem
> table which is really big in size. But with partition, it has to scan
> this table completely.
>
>                           ->  Nested Loop Semi Join
>                                  ->  Parallel Bitmap Heap Scan on orders
>                                        ->  Bitmap Index Scan on
> idx_orders_orderdate  (cost=0.00..24378.88 r
>                        ->  Index Scan using idx_lineitem_orderkey on
> lineitem  (cost=0.57..29.29 rows=105 width=8) (actual
> time=0.031..0.031 rows=1 loops=1122364)
>                                        Index Cond: (l_orderkey =
> orders.o_orderkey)
>                                        Filter: (l_commitdate < l_receiptdate)
>                                        Rows Removed by Filter: 1
>

If the partitions have the same indexes as the unpartitioned table,
planner manages to create parameterized plans for each partition and
thus parameterized plan for the whole partitioned table. Do we have
same indexes on unpartitioned table and each of the partitions? The
difference between the two cases is the parameterized path on an
unpartitioned table scans only one index whereas that on the
partitioned table scans the indexes on all the partitions. My guess is
the planner thinks those many scans are costlier than hash/merge join
and chooses those strategies over parameterized nest loop join. In
case of partition-wise join, only one index on the inner partition is
involved and thus partition-wise join picks up parameterized nest loop
join. Notice, that this index is much smaller than the index on the
partitioned table, so the index scan will be a bit faster. But only a
bit, since the depth of the index doesn't increase linearly with the
size of index.

Rrun-time partition pruning will improve performance even without
partition-wise join since partition pruning will be able to eliminate
all but one partition and only one index needs to be scanned. If
planner is smart enough to cost that effectively, it will choose
parameterized nest loop join for partitioned table thus improving the
performance similar to unpartitioned case.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

> Query plans for the above mentioned queries is attached.
>

Can you please share plans for all the queries, even if they haven't
chosen partition-wise join when run on partitioned tables with
enable_partition_wise_join ON? Also, please include the query in
explain analyze output using -a or -e flats to psql. That way we will
have query and its plan in the same file for ready reference.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company





On Wed, Jul 26, 2017 at 10:58 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

> Query plans for the above mentioned queries is attached.
>

Can you please share plans for all the queries, even if they haven't
chosen partition-wise join when run on partitioned tables with
enable_partition_wise_join ON? Also, please include the query in
explain analyze output using -a or -e flats to psql. That way we will
have query and its plan in the same file for ready reference.

I didn't run the query not using partition-wise join, for now.


--
Regards,
Rafia Sabih
On Wed, Jul 26, 2017 at 11:00 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
>
>
> On Wed, Jul 26, 2017 at 10:58 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>> On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih
>> <rafia.sabih@enterprisedb.com> wrote:
>>
>> > Query plans for the above mentioned queries is attached.
>> >
>>
>> Can you please share plans for all the queries, even if they haven't
>> chosen partition-wise join when run on partitioned tables with
>> enable_partition_wise_join ON? Also, please include the query in
>> explain analyze output using -a or -e flats to psql. That way we will
>> have query and its plan in the same file for ready reference.
>>
> I didn't run the query not using partition-wise join, for now.

parse-parse error, sorry. Do you mean, you haven't run the queries
which do not use partition-wise join?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company





On Wed, Jul 26, 2017 at 11:06 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Wed, Jul 26, 2017 at 11:00 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
>
>
> On Wed, Jul 26, 2017 at 10:58 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>> On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih
>> <rafia.sabih@enterprisedb.com> wrote:
>>
>> > Query plans for the above mentioned queries is attached.
>> >
>>
>> Can you please share plans for all the queries, even if they haven't
>> chosen partition-wise join when run on partitioned tables with
>> enable_partition_wise_join ON? Also, please include the query in
>> explain analyze output using -a or -e flats to psql. That way we will
>> have query and its plan in the same file for ready reference.
>>
> I didn't run the query not using partition-wise join, for now.

parse-parse error, sorry. Do you mean, you haven't run the queries
which do not use partition-wise join?

Yes, that's what I mean.
--
Regards,
Rafia Sabih
On Wed, Jul 26, 2017 at 11:08 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
>
>
> On Wed, Jul 26, 2017 at 11:06 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>
>> On Wed, Jul 26, 2017 at 11:00 AM, Rafia Sabih
>> <rafia.sabih@enterprisedb.com> wrote:
>> >
>> >
>> > On Wed, Jul 26, 2017 at 10:58 AM, Ashutosh Bapat
>> > <ashutosh.bapat@enterprisedb.com> wrote:
>> >>
>> >> On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih
>> >> <rafia.sabih@enterprisedb.com> wrote:
>> >>
>> >> > Query plans for the above mentioned queries is attached.
>> >> >
>> >>
>> >> Can you please share plans for all the queries, even if they haven't
>> >> chosen partition-wise join when run on partitioned tables with
>> >> enable_partition_wise_join ON? Also, please include the query in
>> >> explain analyze output using -a or -e flats to psql. That way we will
>> >> have query and its plan in the same file for ready reference.
>> >>
>> > I didn't run the query not using partition-wise join, for now.
>>
>> parse-parse error, sorry. Do you mean, you haven't run the queries
>> which do not use partition-wise join?
>>
> Yes, that's what I mean.

Ok. If those queries have equi-join between partitioned tables and are
not picking up partition-wise join, that case needs to be
investigated. Q21 for example has join between three lineitem
instances. Those joins can be executed by partition-wise join. But it
may so happen that optimal join order doesn't join partitioned tables
with each other, thus interleaving partitioned tables with
unpartitioned or differently partitioned tables in join order.
Partition-wise join is not possible then. A different partitioning
scheme may be required there.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Wed, Jul 26, 2017 at 10:38 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>
> On Tue, Jul 25, 2017 at 9:39 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Tue, Jul 25, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Tue, Jul 25, 2017 at 1:31 AM, Rafia Sabih
> >> <rafia.sabih@enterprisedb.com> wrote:
> >>> - other queries show a good 20-30% improvement in performance. Performance
> >>> numbers are as follows,
> >>>
> >>> Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) |
> >>> 3 | 76 |127 | 88 |
> >>> 4 |17 | 244 | 41 |
> >>> 5 | 52 | 123 | 84 |
> >>> 7 | 73 | 134 | 103 |
> >>> 10 | 67 | 111 | 89 |
> >>> 12 | 53 | 114 | 99 |
> >>> 18 | 447 | 709 | 551 |
> >>
> >> Hmm.  This certainly shows that benefit of the patch, although it's
> >> rather sad that we're still slower than if we hadn't partitioned the
> >> data in the first place.  Why does partitioning hurt performance so
> >> much?
> >
> > I was analysing some of the plans (without partition and with
> > partition), Seems like one of the reasons of performing worse with the
> > partitioned table is that we can not use an index on the partitioned
> > table.
> >
> > Q4 is taking 17s without partition whereas it's taking 244s with partition.
> >
> > Now if we analyze the plan
> >
> > Without partition, it can use parameterize index scan on lineitem
> > table which is really big in size. But with partition, it has to scan
> > this table completely.
> >
> >                           ->  Nested Loop Semi Join
> >                                  ->  Parallel Bitmap Heap Scan on orders
> >                                        ->  Bitmap Index Scan on
> > idx_orders_orderdate  (cost=0.00..24378.88 r
> >                        ->  Index Scan using idx_lineitem_orderkey on
> > lineitem  (cost=0.57..29.29 rows=105 width=8) (actual
> > time=0.031..0.031 rows=1 loops=1122364)
> >                                        Index Cond: (l_orderkey =
> > orders.o_orderkey)
> >                                        Filter: (l_commitdate < l_receiptdate)
> >                                        Rows Removed by Filter: 1
> >
>
> If the partitions have the same indexes as the unpartitioned table,
> planner manages to create parameterized plans for each partition and
> thus parameterized plan for the whole partitioned table. Do we have
> same indexes on unpartitioned table and each of the partitions? The

Yes both lineitem and orders have same number of partitions viz 17 and
on the same partitioning key (*_orderkey) and same ranges for each
partition. However, I missed creating the index on o_orderdate for the
partitions. But on creating it as well, the plan with bitmap heap scan
is used and it still completes in some 200 seconds, check the attached
file for the query plan.

> difference between the two cases is the parameterized path on an
> unpartitioned table scans only one index whereas that on the
> partitioned table scans the indexes on all the partitions. My guess is
> the planner thinks those many scans are costlier than hash/merge join
> and chooses those strategies over parameterized nest loop join. In
> case of partition-wise join, only one index on the inner partition is
> involved and thus partition-wise join picks up parameterized nest loop
> join. Notice, that this index is much smaller than the index on the
> partitioned table, so the index scan will be a bit faster. But only a
> bit, since the depth of the index doesn't increase linearly with the
> size of index.
>
As I have observed, the thing with this query is that selectivity
estimation is too high than actual, now when index scan is chosen for
lineitem being in the inner side of NLJ, the query completes quickly
since the number of actual returned rows is too low. However, in case
we pick seq scan, or lineitem is on the outer side, the query is going
to take a really long time. Now, when Hash-Join is picked in the case
of partitioned database and no partition-wise join is available, seq
scan is preferred instead of index scan and hence the elongated query
execution time.

I tried this query with random_page_cost = 0 and forcing NLJ and the
chosen plan completes the query in 45 seconds, check the attached file
for explain analyse output.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Wed, Jul 26, 2017 at 12:02 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>
> Ok. If those queries have equi-join between partitioned tables and are
> not picking up partition-wise join, that case needs to be
> investigated. Q21 for example has join between three lineitem
> instances. Those joins can be executed by partition-wise join. But it
> may so happen that optimal join order doesn't join partitioned tables
> with each other, thus interleaving partitioned tables with
> unpartitioned or differently partitioned tables in join order.
> Partition-wise join is not possible then. A different partitioning
> scheme may be required there.
>
Good point, will look into this direction as well.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/



On Fri, Jul 14, 2017 at 3:02 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Here's revised patch set with only 0004 revised. That patch deals with
> creating multi-level inheritance hierarchy from multi-level partition
> hierarchy. The original logic of recursively calling
> inheritance_planner()'s guts over the inheritance hierarchy required
> that for every such recursion we flatten many lists created by that
> code. Recursion also meant that root->append_rel_list is traversed as
> many times as the number of partitioned partitions in the hierarchy.
> Instead the revised version keep the iterative shape of
> inheritance_planner() intact, thus naturally creating flat lists,
> iterates over root->append_rel_list only once and is still easy to
> read and maintain.

0001-0003 look basically OK to me, modulo some cosmetic stuff.  Regarding 0004:

+        if (brel->reloptkind != RELOPT_BASEREL &&
+            brte->relkind != RELKIND_PARTITIONED_TABLE)

I spent a lot of time staring at this code before I figured out what
was going on here.  We're iterating over simple_rel_array, so the
reloptkind must be RELOPT_OTHER_MEMBER_REL if it isn't RELOPT_BASEREL.
But does that guarantee that rtekind is RTE_RELATION such that
brte->relkind will be initialized to a value?  I'm not 100% sure.  I
think it would be clearer to write this test like this:

Assert(IS_SIMPLE_REL(brel));
if (brel->reloptkind == RELOPT_OTHER_MEMBER_REL &&   (brte->rtekind != RELOPT_BASEREL ||   brte->relkind !=
RELKIND_PARTITIONED_TABLE))  continue;
 

Note that the way you wrote the comment is says if it *is* another
REL, not if it's *not* a baserel; it's good if those kinds of little
details match between the code and the comments.

It is not clear to me what the motivation is for the API changes in
expanded_inherited_rtentry.  They don't appear to be necessary.  If
they are necessary, you need to do a more thorough job updating the
comments.  This one, in particular:
 *      If so, add entries for all the child tables to the query's *      rangetable, and build AppendRelInfo nodes for
allthe child tables *      and add them to root->append_rel_list.  If not, clear the entry's
 

And the comments could maybe say something like "We return the list of
appinfos rather than directly appending it to append_rel_list because
$REASON."

-         * is a partitioned table.
+         * RTE simply duplicates the parent *partitioned* table.         */
-        if (childrte->relkind != RELKIND_PARTITIONED_TABLE)
+        if (childrte->relkind != RELKIND_PARTITIONED_TABLE || childrte->inh)

This is another case where it's hard to understand the test from the comments.

+     * In case of multi-level inheritance hierarchy, for every child we require
+     * PlannerInfo of its immediate parent. Hence we save those in a an array

Say why.  Also, need to fix "a an".

I'm a little bit surprised that this patch doesn't make any changes to
allpaths.c or relnode.c.  It looks to me like we'll generate paths for
the new RTEs that are being added.  Are we going to end up with
multiple levels of Append nodes, then?  Does the consider the way
consider_parallel is propagated up and down in set_append_rel_size()
and set_append_rel_pathlist() really work with multiple levels?  Maybe
this is all fine; I haven't tried to verify it in depth.

Overall I think this is a reasonable direction to go but I'm worried
that there may be bugs lurking -- other code that needs adjusting that
hasn't been found, really.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, Jul 31, 2017 at 8:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jul 14, 2017 at 3:02 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Here's revised patch set with only 0004 revised. That patch deals with
>> creating multi-level inheritance hierarchy from multi-level partition
>> hierarchy. The original logic of recursively calling
>> inheritance_planner()'s guts over the inheritance hierarchy required
>> that for every such recursion we flatten many lists created by that
>> code. Recursion also meant that root->append_rel_list is traversed as
>> many times as the number of partitioned partitions in the hierarchy.
>> Instead the revised version keep the iterative shape of
>> inheritance_planner() intact, thus naturally creating flat lists,
>> iterates over root->append_rel_list only once and is still easy to
>> read and maintain.
>
> 0001-0003 look basically OK to me, modulo some cosmetic stuff.  Regarding 0004:
>
> +        if (brel->reloptkind != RELOPT_BASEREL &&
> +            brte->relkind != RELKIND_PARTITIONED_TABLE)
>
> I spent a lot of time staring at this code before I figured out what
> was going on here.  We're iterating over simple_rel_array, so the
> reloptkind must be RELOPT_OTHER_MEMBER_REL if it isn't RELOPT_BASEREL.
> But does that guarantee that rtekind is RTE_RELATION such that
> brte->relkind will be initialized to a value?  I'm not 100% sure.

Comment in RangeTblEntry says952     /*953      * Fields valid for a plain relation RTE (else zero):954      *
... clipped portion for RTE_NAMEDTUPLESTORE related comment
960     Oid         relid;          /* OID of the relation */961     char        relkind;        /* relation kind (see
pg_class.relkind)*/
 

This means that relkind will be 0 when rtekind != RTE_RELATION. So,
the condition holds. But code creating an RTE somewhere which is not
in sync with this comment would create a problem. So your suggestion
makes sense.

> I
> think it would be clearer to write this test like this:
>
> Assert(IS_SIMPLE_REL(brel));
> if (brel->reloptkind == RELOPT_OTHER_MEMBER_REL &&
>     (brte->rtekind != RELOPT_BASEREL ||

Do you mean (brte_>rtekind != RTE_RELATION)?

>     brte->relkind != RELKIND_PARTITIONED_TABLE))
>     continue;
>
> Note that the way you wrote the comment is says if it *is* another
> REL, not if it's *not* a baserel; it's good if those kinds of little
> details match between the code and the comments.

I find the existing comment and code in this part of the function
differ. The comment just above the loop on simple_rel_array[], talks
about changing something in the child, but the very next line skips
child relations and later a loop on append_rel_list makes changes to
appropriate children. I guess, it's done that way to keep the code
working even after we introduce some RELOPTKIND other than BASEREL or
OTHER_MEMBER_REL for a simple rel. But your suggestion makes more
sense. Changed it according to your suggestion.

>
> It is not clear to me what the motivation is for the API changes in
> expanded_inherited_rtentry.  They don't appear to be necessary.

expand_inherited_rtentry() creates AppendRelInfos for all the children
of a given parent and collects them in a list. The list is appended to
root->append_rel_list at the end of the function. Now that function
needs to do this recursively. This means that for a partitioned
partition table its children's AppendRelInfos will be added to
root->append_rel_list before AppendRelInfo of that partitioned
partition table. inheritance_planner() assumes that the parent's
AppendRelInfo comes before its children in append_rel_list.This
assumption allows it to be simplified greately, retaining its
iterative form. My earlier patches had recursive version of
inheritance_planner(), which is complex. I have comments in this patch
explaining this.

Adding AppendRelInfos to root->append_rel_list as and when they are
created would keep parent AppendRelInfos before those of children. But
that function throws away the AppendRelInfo it created when their are
no real children i.e. in partitioned table's case when has no leaf
partitions. So, we can't do that. Hence, I chose to change the API to
return the list of AppendRelInfos when the given RTE has real
children.

> If
> they are necessary, you need to do a more thorough job updating the
> comments.  This one, in particular:
>
>   *      If so, add entries for all the child tables to the query's
>   *      rangetable, and build AppendRelInfo nodes for all the child tables
>   *      and add them to root->append_rel_list.  If not, clear the entry's

Done.

>
> And the comments could maybe say something like "We return the list of
> appinfos rather than directly appending it to append_rel_list because
> $REASON."

Done. Please check the attached version.

>
> -         * is a partitioned table.
> +         * RTE simply duplicates the parent *partitioned* table.
>           */
> -        if (childrte->relkind != RELKIND_PARTITIONED_TABLE)
> +        if (childrte->relkind != RELKIND_PARTITIONED_TABLE || childrte->inh)
>
> This is another case where it's hard to understand the test from the comments.

The current comment says it all, but it very cryptic manner.
1526         /*
1527          * Build an AppendRelInfo for this parent and child,
unless the child
1528          * RTE simply duplicates the parent *partitioned* table.
1529          */

The comment makes sense in the context of this paragraph in the prologue
1364  * Note that the original RTE is considered to represent the whole
1365  * inheritance set.  The first of the generated RTEs is an RTE for the same
1366  * table, but with inh = false, to represent the parent table in its role
1367  * as a simple member of the inheritance set.
1368  *

The code avoids creating AppendRelInfos for a child which represents
the parent in its role as a simple member of inheritance set.

I have reworded it as
1526         /*
1527          * Build an AppendRelInfo for this parent and child,
unless the child
1528          * RTE represents the parent as a simple member of inheritance set.
1529          */

>
> +     * In case of multi-level inheritance hierarchy, for every child we require
> +     * PlannerInfo of its immediate parent. Hence we save those in a an array
>
> Say why.  Also, need to fix "a an".

Done.

>
> I'm a little bit surprised that this patch doesn't make any changes to
> allpaths.c or relnode.c.

> It looks to me like we'll generate paths for
> the new RTEs that are being added.  Are we going to end up with
> multiple levels of Append nodes, then?  Does the consider the way
> consider_parallel is propagated up and down in set_append_rel_size()
> and set_append_rel_pathlist() really work with multiple levels?  Maybe
> this is all fine; I haven't tried to verify it in depth.

This has been discussed before, but I can not locate the mail
answering these questions. accumulate_append_subpath() called from
add_paths_to_append_rel() takes care of flattening Merge/Append paths.
The planner code deals with the multi-level inheritance hierarchy
created for subqueries with set operations. There is code in relnode.c
to build the RelOptInfos for such subqueries recursively through using
RangeTblEntry::inh flag. So there are no changes in allpaths.c and
relnode.c. Are you looking for some other changes?

>
> Overall I think this is a reasonable direction to go but I'm worried
> that there may be bugs lurking -- other code that needs adjusting that
> hasn't been found, really.
>

Planner code is already aware of such hierarchies except DMLs, which
this patch adjusts. We have fixed issues revealed by mine and
Rajkumar's testing.
What kinds of things you suspect?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Forgot the patch set. Here it is.

On Mon, Jul 31, 2017 at 5:29 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Mon, Jul 31, 2017 at 8:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Jul 14, 2017 at 3:02 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>> Here's revised patch set with only 0004 revised. That patch deals with
>>> creating multi-level inheritance hierarchy from multi-level partition
>>> hierarchy. The original logic of recursively calling
>>> inheritance_planner()'s guts over the inheritance hierarchy required
>>> that for every such recursion we flatten many lists created by that
>>> code. Recursion also meant that root->append_rel_list is traversed as
>>> many times as the number of partitioned partitions in the hierarchy.
>>> Instead the revised version keep the iterative shape of
>>> inheritance_planner() intact, thus naturally creating flat lists,
>>> iterates over root->append_rel_list only once and is still easy to
>>> read and maintain.
>>
>> 0001-0003 look basically OK to me, modulo some cosmetic stuff.  Regarding 0004:
>>
>> +        if (brel->reloptkind != RELOPT_BASEREL &&
>> +            brte->relkind != RELKIND_PARTITIONED_TABLE)
>>
>> I spent a lot of time staring at this code before I figured out what
>> was going on here.  We're iterating over simple_rel_array, so the
>> reloptkind must be RELOPT_OTHER_MEMBER_REL if it isn't RELOPT_BASEREL.
>> But does that guarantee that rtekind is RTE_RELATION such that
>> brte->relkind will be initialized to a value?  I'm not 100% sure.
>
> Comment in RangeTblEntry says
>  952     /*
>  953      * Fields valid for a plain relation RTE (else zero):
>  954      *
> ... clipped portion for RTE_NAMEDTUPLESTORE related comment
>
>  960     Oid         relid;          /* OID of the relation */
>  961     char        relkind;        /* relation kind (see pg_class.relkind) */
>
> This means that relkind will be 0 when rtekind != RTE_RELATION. So,
> the condition holds. But code creating an RTE somewhere which is not
> in sync with this comment would create a problem. So your suggestion
> makes sense.
>
>> I
>> think it would be clearer to write this test like this:
>>
>> Assert(IS_SIMPLE_REL(brel));
>> if (brel->reloptkind == RELOPT_OTHER_MEMBER_REL &&
>>     (brte->rtekind != RELOPT_BASEREL ||
>
> Do you mean (brte_>rtekind != RTE_RELATION)?
>
>>     brte->relkind != RELKIND_PARTITIONED_TABLE))
>>     continue;
>>
>> Note that the way you wrote the comment is says if it *is* another
>> REL, not if it's *not* a baserel; it's good if those kinds of little
>> details match between the code and the comments.
>
> I find the existing comment and code in this part of the function
> differ. The comment just above the loop on simple_rel_array[], talks
> about changing something in the child, but the very next line skips
> child relations and later a loop on append_rel_list makes changes to
> appropriate children. I guess, it's done that way to keep the code
> working even after we introduce some RELOPTKIND other than BASEREL or
> OTHER_MEMBER_REL for a simple rel. But your suggestion makes more
> sense. Changed it according to your suggestion.
>
>>
>> It is not clear to me what the motivation is for the API changes in
>> expanded_inherited_rtentry.  They don't appear to be necessary.
>
> expand_inherited_rtentry() creates AppendRelInfos for all the children
> of a given parent and collects them in a list. The list is appended to
> root->append_rel_list at the end of the function. Now that function
> needs to do this recursively. This means that for a partitioned
> partition table its children's AppendRelInfos will be added to
> root->append_rel_list before AppendRelInfo of that partitioned
> partition table. inheritance_planner() assumes that the parent's
> AppendRelInfo comes before its children in append_rel_list.This
> assumption allows it to be simplified greately, retaining its
> iterative form. My earlier patches had recursive version of
> inheritance_planner(), which is complex. I have comments in this patch
> explaining this.
>
> Adding AppendRelInfos to root->append_rel_list as and when they are
> created would keep parent AppendRelInfos before those of children. But
> that function throws away the AppendRelInfo it created when their are
> no real children i.e. in partitioned table's case when has no leaf
> partitions. So, we can't do that. Hence, I chose to change the API to
> return the list of AppendRelInfos when the given RTE has real
> children.
>
>> If
>> they are necessary, you need to do a more thorough job updating the
>> comments.  This one, in particular:
>>
>>   *      If so, add entries for all the child tables to the query's
>>   *      rangetable, and build AppendRelInfo nodes for all the child tables
>>   *      and add them to root->append_rel_list.  If not, clear the entry's
>
> Done.
>
>>
>> And the comments could maybe say something like "We return the list of
>> appinfos rather than directly appending it to append_rel_list because
>> $REASON."
>
> Done. Please check the attached version.
>
>>
>> -         * is a partitioned table.
>> +         * RTE simply duplicates the parent *partitioned* table.
>>           */
>> -        if (childrte->relkind != RELKIND_PARTITIONED_TABLE)
>> +        if (childrte->relkind != RELKIND_PARTITIONED_TABLE || childrte->inh)
>>
>> This is another case where it's hard to understand the test from the comments.
>
> The current comment says it all, but it very cryptic manner.
> 1526         /*
> 1527          * Build an AppendRelInfo for this parent and child,
> unless the child
> 1528          * RTE simply duplicates the parent *partitioned* table.
> 1529          */
>
> The comment makes sense in the context of this paragraph in the prologue
> 1364  * Note that the original RTE is considered to represent the whole
> 1365  * inheritance set.  The first of the generated RTEs is an RTE for the same
> 1366  * table, but with inh = false, to represent the parent table in its role
> 1367  * as a simple member of the inheritance set.
> 1368  *
>
> The code avoids creating AppendRelInfos for a child which represents
> the parent in its role as a simple member of inheritance set.
>
> I have reworded it as
> 1526         /*
> 1527          * Build an AppendRelInfo for this parent and child,
> unless the child
> 1528          * RTE represents the parent as a simple member of inheritance set.
> 1529          */
>
>>
>> +     * In case of multi-level inheritance hierarchy, for every child we require
>> +     * PlannerInfo of its immediate parent. Hence we save those in a an array
>>
>> Say why.  Also, need to fix "a an".
>
> Done.
>
>>
>> I'm a little bit surprised that this patch doesn't make any changes to
>> allpaths.c or relnode.c.
>
>> It looks to me like we'll generate paths for
>> the new RTEs that are being added.  Are we going to end up with
>> multiple levels of Append nodes, then?  Does the consider the way
>> consider_parallel is propagated up and down in set_append_rel_size()
>> and set_append_rel_pathlist() really work with multiple levels?  Maybe
>> this is all fine; I haven't tried to verify it in depth.
>
> This has been discussed before, but I can not locate the mail
> answering these questions. accumulate_append_subpath() called from
> add_paths_to_append_rel() takes care of flattening Merge/Append paths.
> The planner code deals with the multi-level inheritance hierarchy
> created for subqueries with set operations. There is code in relnode.c
> to build the RelOptInfos for such subqueries recursively through using
> RangeTblEntry::inh flag. So there are no changes in allpaths.c and
> relnode.c. Are you looking for some other changes?
>
>>
>> Overall I think this is a reasonable direction to go but I'm worried
>> that there may be bugs lurking -- other code that needs adjusting that
>> hasn't been found, really.
>>
>
> Planner code is already aware of such hierarchies except DMLs, which
> this patch adjusts. We have fixed issues revealed by mine and
> Rajkumar's testing.
> What kinds of things you suspect?
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Mon, Jul 31, 2017 at 7:59 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Adding AppendRelInfos to root->append_rel_list as and when they are
> created would keep parent AppendRelInfos before those of children. But
> that function throws away the AppendRelInfo it created when their are
> no real children i.e. in partitioned table's case when has no leaf
> partitions. So, we can't do that. Hence, I chose to change the API to
> return the list of AppendRelInfos when the given RTE has real
> children.

So, IIUC, the case you're concerned about is when you have a hierarchy
of only partitioned tables, with no plain tables.  For example, if B
is a partitioned table and a partition of A, and that's all there is,
A will recurse to B and B will return NIL.

Is it necessary to get rid of the extra AppendRelInfos, or are they
harmless like the duplicate RTE and PlanRowMark nodes?
   /*    * If all the children were temp tables or a partitioned parent did not    * have any leaf partitions, pretend
it'sa non-inheritance situation; we    * don't need Append node in that case.  The duplicate RTE we added for    * the
parenttable is harmless, so we don't bother to get rid of it;    * ditto for the useless PlanRowMark node.    */   if
(!need_append)  {       /* Clear flag before returning */       rte->inh = false;       return;   }
 

If we do need to get rid of the extra AppendRelInfos, maybe a less
invasive solution would be to change the if (!need_append) case to do
root->append_rel_list = list_truncate(root->append_rel_list,
original_append_rel_length).

> The code avoids creating AppendRelInfos for a child which represents
> the parent in its role as a simple member of inheritance set.

OK, I suggest we rewrite the whole comment like this: "We need an
AppendRelInfo if paths will be built for the child RTE.  If
childrte->inh is true, then we'll always need to generate append paths
for it.  If childrte->inh is false, we must scan it if it's not a
partitioned table; but if it is a partitioned table, then it never has
any data of its own and need not be scanned.  It does, however, need
to be locked, so note the OID for inclusion in the
PartitionedChildRelInfo we're going to build."

It looks like you also need to update the header comment for
AppendRelInfo itself, in nodes/relation.h.

+     * PlannerInfo for every child is obtained by translating relevant members

Insert "The" at the start of the sentence.

-        subroot->parse = (Query *)
-            adjust_appendrel_attrs(root,
-                                   (Node *) parse,
-                                   appinfo);
+        subroot->parse = (Query *) adjust_appendrel_attrs(parent_root,
+                                                          (Node *)
parent_parse,
+                                                          1, &appinfo);

I suggest that you don't remove the line break after the cast.

+         * If the child is further partitioned, remember it as a parent. Since
+         * partitioned tables do not have any data, we don't need to create
+         * plan for it. We just need its PlannerInfo set up to be used while
+         * planning its children.

Most of this comment is in the singular, but the first half of the
second sentence is plural.  Should be "Since a partitioned table does
not have any data...".  I might replace the last sentence by "We do,
however, need to remember the PlannerInfo for use when planning its
children."

+-- Check UPDATE with *multi-level partitioned* inherited target

Asterisks seem like overkill.

Since expand_inherited_rtentry() and set_append_rel_size() can now
recurse down to as many levels as there are levels in the inheritance
hierarchy, they should probably have a check_stack_depth() check.

>> Overall I think this is a reasonable direction to go but I'm worried
>> that there may be bugs lurking -- other code that needs adjusting that
>> hasn't been found, really.
>>
> Planner code is already aware of such hierarchies except DMLs, which
> this patch adjusts. We have fixed issues revealed by mine and
> Rajkumar's testing.
> What kinds of things you suspect?

I'm not sure exactly.  It's just hard with this kind of patch to make
sure you've caught everything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, Jul 31, 2017 at 9:07 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Forgot the patch set. Here it is.

The commit message for 0005 isn't really accurate given that it
follows 0004.  I think you could just flatten 0005 and 0006 into one
patch.

Reviewing those together:

- Existing code does partdesc = RelationGetPartitionDesc(relation) but
this has got it as part_desc.  Seems better to be consistent.
Likewise existing variables for PartitionKey are key or partkey, not
part_key.

- get_relation_partition_info has a useless trailing return.

- Instead of adding nparts, boundinfo, and part_oids to RelOptInfo,
how about just adding partdesc?  Seems cleaner.

- pkexprs seems like a potentially confusing name, since PK is widely
used to mean "primary key" but here you mean "partition key".  Maybe
partkeyexprs.

- build_simple_rel's matching algorithm is O(n^2).  We may have talked
about this problem before...

- This patch introduces some bits that are not yet used, like
nullable_pkexprs, or even the code to set the partition scheme for
joinrels.  I think perhaps some of that logic should be moved from
0008 to here - e.g. the initial portion of
build_joinrel_partition_info.

There may be more, but I've run out of energy for tonight.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Thu, Aug 3, 2017 at 2:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jul 31, 2017 at 7:59 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Adding AppendRelInfos to root->append_rel_list as and when they are
>> created would keep parent AppendRelInfos before those of children. But
>> that function throws away the AppendRelInfo it created when their are
>> no real children i.e. in partitioned table's case when has no leaf
>> partitions. So, we can't do that. Hence, I chose to change the API to
>> return the list of AppendRelInfos when the given RTE has real
>> children.
>
> So, IIUC, the case you're concerned about is when you have a hierarchy
> of only partitioned tables, with no plain tables.  For example, if B
> is a partitioned table and a partition of A, and that's all there is,
> A will recurse to B and B will return NIL.
>
> Is it necessary to get rid of the extra AppendRelInfos, or are they
> harmless like the duplicate RTE and PlanRowMark nodes?
>

Actually there are two sides to this:

If there are no leaf partitions, without the patch two things happen
1. rte->inh is cleared and 2 no appinfo is added to the
root->append_rel_list, even though harmless RTE and PlanRowMark nodes
are created. The first avoids treating the relation as the inheritance
parent and thus avoids creating any child relations and paths, saving
a lot of work. Ultimately set_rel_size() marks such a relation as
dummy352                 else if (rte->relkind == RELKIND_PARTITIONED_TABLE)353                 {354
/*355                      * A partitioned table without leaf
 
partitions is marked356                      * as a dummy rel.357                      */358
set_dummy_rel_pathlist(rel);359                }
 

Since root->append_rel_list is traversed for every inheritance parent,
not adding needless AppendRelInfos improves performance and saves
memory, (FWIW or consider a case where there are thousands of
partitioned partitions without any leaf partition.).

My initial thought was to keep both these properties intact. But then
removing such AppendRelInfos would have a problem when such a table is
on the inner side of the join as described in [1]. So I wrote the
patch not to do either of those things when there are partitioned
partitions without leaf partitions. So, it looks like you are correct,
we could just go ahead and add those AppendRelInfos directly to
root->append_rel_list.

>     /*
>      * If all the children were temp tables or a partitioned parent did not
>      * have any leaf partitions, pretend it's a non-inheritance situation; we
>      * don't need Append node in that case.  The duplicate RTE we added for
>      * the parent table is harmless, so we don't bother to get rid of it;
>      * ditto for the useless PlanRowMark node.
>      */
>     if (!need_append)
>     {
>         /* Clear flag before returning */
>         rte->inh = false;
>         return;
>     }
>
> If we do need to get rid of the extra AppendRelInfos, maybe a less
> invasive solution would be to change the if (!need_append) case to do
> root->append_rel_list = list_truncate(root->append_rel_list,
> original_append_rel_length).

We might require this for non-partitioned tables. I will try to
implement it this way in the next set of patches.

>
>> The code avoids creating AppendRelInfos for a child which represents
>> the parent in its role as a simple member of inheritance set.
>
> OK, I suggest we rewrite the whole comment like this: "We need an
> AppendRelInfo if paths will be built for the child RTE.  If
> childrte->inh is true, then we'll always need to generate append paths
> for it.  If childrte->inh is false, we must scan it if it's not a
> partitioned table; but if it is a partitioned table, then it never has
> any data of its own and need not be scanned.  It does, however, need
> to be locked, so note the OID for inclusion in the
> PartitionedChildRelInfo we're going to build."

Done.

>
> It looks like you also need to update the header comment for
> AppendRelInfo itself, in nodes/relation.h.

Done. Thanks for pointing it out.

>
> +     * PlannerInfo for every child is obtained by translating relevant members
>
> Insert "The" at the start of the sentence.

Done.

>
> -        subroot->parse = (Query *)
> -            adjust_appendrel_attrs(root,
> -                                   (Node *) parse,
> -                                   appinfo);
> +        subroot->parse = (Query *) adjust_appendrel_attrs(parent_root,
> +                                                          (Node *)
> parent_parse,
> +                                                          1, &appinfo);
>
> I suggest that you don't remove the line break after the cast.

This is part of 0001 patch, fixed there.

>
> +         * If the child is further partitioned, remember it as a parent. Since
> +         * partitioned tables do not have any data, we don't need to create
> +         * plan for it. We just need its PlannerInfo set up to be used while
> +         * planning its children.
>
> Most of this comment is in the singular, but the first half of the
> second sentence is plural.  Should be "Since a partitioned table does
> not have any data...".  I might replace the last sentence by "We do,
> however, need to remember the PlannerInfo for use when planning its
> children."

Done.

>
> +-- Check UPDATE with *multi-level partitioned* inherited target
>
> Asterisks seem like overkill.

Done.

This style was copied from an existing comment in that file.-- Check UPDATE with *partitioned* inherited target

>
> Since expand_inherited_rtentry() and set_append_rel_size() can now
> recurse down to as many levels as there are levels in the inheritance
> hierarchy, they should probably have a check_stack_depth() check.

Done. Even without this patch set_append_rel_size() could recurse down
many levels of inheritance hierarchy (created by set operation
queries) through
set_append_rel_size()->set_rel_size()->set_append_rel_size(). And so
would set_rel_size(). But now it's more prone to that problem.

I will provide updated patches after taking care of your comments
about 0005 and 0006.

[1] https://www.postgresql.org/message-id/CAFjFpRd5+zroxY7UMGTR2M=rjBV4aBOCxQg3+1rBmTPLK5mpDg@mail.gmail.com
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Thu, Aug 3, 2017 at 9:38 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Thu, Aug 3, 2017 at 2:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Jul 31, 2017 at 7:59 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>> Adding AppendRelInfos to root->append_rel_list as and when they are
>>> created would keep parent AppendRelInfos before those of children. But
>>> that function throws away the AppendRelInfo it created when their are
>>> no real children i.e. in partitioned table's case when has no leaf
>>> partitions. So, we can't do that. Hence, I chose to change the API to
>>> return the list of AppendRelInfos when the given RTE has real
>>> children.
>>
>> So, IIUC, the case you're concerned about is when you have a hierarchy
>> of only partitioned tables, with no plain tables.  For example, if B
>> is a partitioned table and a partition of A, and that's all there is,
>> A will recurse to B and B will return NIL.
>>
>> Is it necessary to get rid of the extra AppendRelInfos, or are they
>> harmless like the duplicate RTE and PlanRowMark nodes?
>>
>
> Actually there are two sides to this:
>
> If there are no leaf partitions, without the patch two things happen
> 1. rte->inh is cleared and 2 no appinfo is added to the
> root->append_rel_list, even though harmless RTE and PlanRowMark nodes
> are created. The first avoids treating the relation as the inheritance
> parent and thus avoids creating any child relations and paths, saving
> a lot of work. Ultimately set_rel_size() marks such a relation as
> dummy
>  352                 else if (rte->relkind == RELKIND_PARTITIONED_TABLE)
>  353                 {
>  354                     /*
>  355                      * A partitioned table without leaf
> partitions is marked
>  356                      * as a dummy rel.
>  357                      */
>  358                     set_dummy_rel_pathlist(rel);
>  359                 }
>
> Since root->append_rel_list is traversed for every inheritance parent,
> not adding needless AppendRelInfos improves performance and saves
> memory, (FWIW or consider a case where there are thousands of
> partitioned partitions without any leaf partition.).

With some testing, I found that this was true once, but not after
declarative partition support. Please check [1].

>
> My initial thought was to keep both these properties intact. But then
> removing such AppendRelInfos would have a problem when such a table is
> on the inner side of the join as described in [1]. So I wrote the
> patch not to do either of those things when there are partitioned
> partitions without leaf partitions. So, it looks like you are correct,
> we could just go ahead and add those AppendRelInfos directly to
> root->append_rel_list.

Irrespective of [1], I have implemented your idea of not changing
signature of expand_inherited_rtentry() with following idea.

>>
>> If we do need to get rid of the extra AppendRelInfos, maybe a less
>> invasive solution would be to change the if (!need_append) case to do
>> root->append_rel_list = list_truncate(root->append_rel_list,
>> original_append_rel_length).

[1] https://www.postgresql.org/message-id/CAFjFpReWJr1yTkHU=OqiMBmcYCMoSW3VPR39RBuQ_ovwDFBT5Q@mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Thu, Aug 3, 2017 at 7:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jul 31, 2017 at 9:07 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Forgot the patch set. Here it is.
>
> The commit message for 0005 isn't really accurate given that it
> follows 0004. I think you could just flatten 0005 and 0006 into one
> patch.

Earlier, there was some doubt about the approach for expanding
multi-level partitioned table's inheritance hierarchy. So, I had
separated all multi-level partition related changes into patches by
themselves, collocating them with their respective single level
partition peers. I thought that would make the reviews easier while
leaving the possibility of committing single-level partition-wise
support before multi-level partition-wise join support. From your
previous replies, it seems that you are fine with the multi-level
partitioned hierarchy expansion, so it may be committed along-with
other patches. So, I have squashed those two patches together.
Similarly I have squashed pairs 0008-0009 and 0012-0013. Those dealt
with similar issues for single-level partitioned and multi-level
partitioned tables.

>
> Reviewing those together:
>
> - Existing code does partdesc = RelationGetPartitionDesc(relation) but
> this has got it as part_desc.  Seems better to be consistent.
> Likewise existing variables for PartitionKey are key or partkey, not
> part_key.

Done.

>
> - get_relation_partition_info has a useless trailing return.
>

Done.

> - Instead of adding nparts, boundinfo, and part_oids to RelOptInfo,
> how about just adding partdesc?  Seems cleaner.

nparts and boundinfo apply to any kind of relation simple, join or
upper but part_oids applies only to simple relations. So, I have split
those members and added them in respective sections. Do you still
think that we should add PartitionDesc as a single member?

Similar to your suggestion of changing name of part_key to partkey,
should we rename part_scheme as partscheme, part_rels as partrels and
part_oids as partoids?

>
> - pkexprs seems like a potentially confusing name, since PK is widely
> used to mean "primary key" but here you mean "partition key".  Maybe
> partkeyexprs.

agreed. Done. PartitionKey structure has member partexprs for
partition keys which are expressions. I have used the same name
instread of pkexprs.

>
> - build_simple_rel's matching algorithm is O(n^2).  We may have talked
> about this problem before...

If root->append_rel_list has AppendRelInfos in the same order as the
partition bounds, we could reduce this to O(n). That expansion option
is being discussed in [1]. Once we commit it, I will change the code
to make it O(n). Right now, we can not rely on the order of
AppendRelInfos in root->append_rel_list.

>
> - This patch introduces some bits that are not yet used, like
> nullable_pkexprs,

We could fix that by adding that member in 0008. IIRC, earlier you had
complained about declaring a structure in one patch and adding members
to it in the subsequent patches, so I just added all members in the
same patch. BTW, I have renamed that member to nullable_partexprs to
be consistent with change to pkexpers.

> or even the code to set the partition scheme for
> joinrels. I think perhaps some of that logic should be moved from
> 0008 to here - e.g. the initial portion of
> build_joinrel_partition_info.

Setting part_scheme for joinrel should really be part of the patch
which actually implements partition-wise join. That will keep all the
partition-wise join implementation together. 0005 and 0006 really just
introduce PartitionScheme for base relation. I think PartitionScheme
and other partitioning properties for base relation are useful for
something else like partition-wise aggregation on base relation. So,
we may want to commit those two patches separately. If you want, we
could squash the partition scheme and partition-wise join
implementation together.

[1] https://www.postgresql.org/message-id/0118a1f2-84bb-19a7-b906-dec040a206f2@lab.ntt.co.jp

Updated patches attached.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Tue, Aug 8, 2017 at 8:51 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Updated patches attached.

Hi,

I started reviewing this.  It's nicely commented, but it's also very
complicated, and it's going to take me several rounds to understand
what all the parts do, but here's some assorted feedback after reading
some parts of the patches, some tinkering and quite a bit of time
spent trying to break it (so far unsuccessfully).

On my computer it took ~1.5 seconds to plan a 1000 partition join,
~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a
4000 partition join.  I poked around in a profiler a bit and saw that
for the 2000 partition case I spent almost half the time in
create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle,
and about half of that was in bms_is_subset.  The other half the time
was in query_planner->make_one_rel which spent 2/3 of its time in
set_rel_size->add_child_rel_equivalences->bms_overlap and the other
1/3 in standard_join_search.

One micro-optimisation opportunity I noticed was in those
bms_is_subset and bms_overlap calls.  The Bitmapsets don't tend to
have trailing words but often have hundreds of empty leading words.
If I hack bitmapset.{c,h} so that it tracks first_non_empty_wordnum
and then adjust bms_is_subset and bms_overlap so that they start their
searches at Min(a->first_non_empty_wordnum,
b->first_non_empty_wordnum) then the planning time improves
measurably:

1000 partitions: ~1.5s -> 1.3s
2000 partitions: ~7.1s -> 5.8s
4000 partitions: ~50s -> ~44s

When using list-based partitions, it must be possible to omit the part
of a join key that is implied by the partition because the partition
has only one list value.  For example, if I create a two level
hierarchy with one partition per US state and then time-based range
partitions under that, the state part of this merge condition is
redundant:
        Merge Cond: ((sales_wy_2017_10.state =
purchases_wy_2017_10.state) AND (sales_wy_2017_10.created =
purchases_wy_2017_10.created))

0003-Refactor-partition_bounds_equal-to-be-used-without-P.patch

-partition_bounds_equal(PartitionKey key,
+partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval,
PartitionBoundInfob1,
 
PartitionBoundInfo b2)

I wonder is there any value in creating a struct to represent the
common part of PartitionKey and PartitionScheme that functions like
this and others need?  Maybe not.  Perhaps you didn't want to make
PartitionKey contain a PartitionScheme because then you'd lose the
property that every pointer to PartitionScheme in the system must be a
pointer to an interned (canonical) PartitionScheme, so it's always
safe to compare pointers to test for equality?

0005-Canonical-partition-scheme.patch:

+/*
+ * get_relation_partition_info
+ *
+ * Retrieves partitioning information for a given relation.
+ *
+ * For a partitioned table it sets partitioning scheme, partition key
+ * expressions, number of partitions and OIDs of partitions in the given
+ * RelOptInfo.
+ */
+static void
+get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
+                                                       Relation relation)

Would this be better called "set_relation_partition_info"?  It doesn't
really "retrieve" the information, it "installs" it.

+{
+       PartitionDesc part_desc;
+
+       /* No partitioning information for an unpartitioned relation. */
+       if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE ||
+               !(rel->part_scheme = find_partition_scheme(root, relation)))
+               return;

Here and elsewhere you use the idiom !(foo = bar), which is perfectly
good C in my book but I understand the project convention is to avoid
implicit pointer->boolean conversion and to prefer expressions like
(foo = bar) != NULL and there is certainly a lot more code like that.

0007-Partition-wise-join-implementation.patch

+       {"enable_partition_wise_join", PGC_USERSET, QUERY_TUNING_METHOD,

This GUC should appear in postgresql.conf.sample.

I'm chewing on 0007.  More soon.

-- 
Thomas Munro
http://www.enterprisedb.com



On Thu, Aug 10, 2017 at 1:39 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On my computer it took ~1.5 seconds to plan a 1000 partition join,
> ~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a
> 4000 partition join.  I poked around in a profiler a bit and saw that
> for the 2000 partition case I spent almost half the time in
> create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle,
> and about half of that was in bms_is_subset.  The other half the time
> was in query_planner->make_one_rel which spent 2/3 of its time in
> set_rel_size->add_child_rel_equivalences->bms_overlap and the other
> 1/3 in standard_join_search.

Ashutosh asked me how I did that.  Please see attached. I was
explaining simple joins like SELECT * FROM foofoo JOIN barbar USING
(a, b).  Here also is the experimental hack I tried when I saw
bitmapset.c eating my CPU.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment