Thread: [HACKERS] path toward faster partition pruning
I've been working on implementing a way to perform plan-time partition-pruning that is hopefully faster than the current method of using constraint exclusion to prune each of the potentially many partitions one-by-one. It's not fully cooked yet though. Meanwhile, I thought I'd share a couple of patches that implement some restructuring of the planner code related to partitioned table inheritance planning that I think would be helpful. They are to be applied on top of the patches being discussed at [1]. Note that these patches themselves don't implement the actual code that replaces constraint exclusion as a method of performing partition pruning. I will share that patch after debugging it some more. The main design goal of the patches I'm sharing here now is to defer the locking and opening of leaf partitions in a given partition tree to a point after set_append_rel_size() is called on the root partitioned table. Currently, AFAICS, we need to lock and open the child tables in expand_inherited_rtentry() only to set the translated_vars field in AppendRelInfo that we create for the child. ISTM, we can defer the creation of a child AppendRelInfo to a point when it (its field translated_vars in particular) will actually be used and so lock and open the child tables only at such a time. Although we don't lock and open the partition child tables in expand_inherited_rtentry(), their RT entries are still created and added to root->parse->rtable, so that setup_simple_rel_arrays() knows the maximum number of entries root->simple_rel_array will need to hold and allocate the memory for that array accordingly. Slots in simple_rel_array[] corresponding to partition child tables will be empty until they are created when set_append_rel_size() is called on the root parent table and it determines the partitions that will be scanned after all. Patch augments the existing PartitionedChildRelInfo node, which currently holds only the partitioned child rel RT indexes, to carry some more information about the partition tree, which includes the information returned by RelationGetPartitionDispatchInfo() when it is called from expand_inherited_rtentry() (per the proposed patch in [1], we call it to be able to add partitions to the query tree in the bound order). Actually, since PartitionedChildRelInfo now contains more information about the partition tree than it used to before, I thought the struct's name is no longer relevant, so renamed it to PartitionRootInfo and renamed root->pcinfo_list accordingly to prinfo_list. That seems okay because we only use that node internally. Then during the add_base_rels_to_query() step, when build_simple_rel() builds a RelOptInfo for the root partitioned table, it also initializes some newly introduced fields in RelOptInfo from the information contained in PartitionRootInfo of the table. The aforementioned fields are only initialized in RelOptInfos of root partitioned tables. Note that the add_base_rels_to_query() step won't add the partition "otherrel" RelOptInfos yet (unlike the regular inheritance case, where they are, after looking them up in root->append_rel_list). When set_append_rel_size() is called on the root partitioned table, it will call a find_partitions_for_query(), which using the partition tree information, determines the partitions that will need to be scanned for the query. This processing happens recursively, that is, we first determine the root-parent's partitions and then for each partition that's partitioned, we will determine its partitions and so on. As we determine partitions in this per-partitioned-table manner, we maintain a pair (parent_relid, list-of-partition-relids-to-scan) for each partitioned table and also a single list of all leaf partitions determined so far. Once all partitions have been determined, we turn to locking the leaf partitions. The locking happens in the order of OIDs as find_all_inheritors would have returned in expand_inherited_rtentry(); the list of OIDs in that original order is also stored in the table's PartitionRootInfo node. For each OID in that list, check if that OID is in the set of leaf partition OIDs that was just computed, and if so, lock it. For all chosen partitions that are partitioned tables (including the root), we create a PartitionAppendInfo node which stores the aforementioned pair (parent_relid, list-of-partitions-relids-to-scan), and append it to a list in the root table's RelOptInfo, with the root table's PartitionAppendInfo at the head of the list. Note that the list of partitions in this pair contains only the immediate partitions, so that the original parent-child relationship is reflected in the list of PartitionAppendInfos thus collected. The next patch that will implement actual partition-pruning will add some more code that will run under find_partitions_for_query(). set_append_rel_size() processing then continues for the root partitioned table. It is at this point that we will create the RelOptInfos and AppendRelInfos for partitions. First for those of the root partitioned table and then for those of each partitioned table when set_append_rel_size() will be recursively called for the latter. Note that this is still largely a WIP patch and the implementation details might change per both the feedback here and the discussion at [1]. Thanks, Amit [1] https://www.postgresql.org/message-id/befd7ec9-8f4c-6928-d330-ab05dbf860bf%40lab.ntt.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Aug 21, 2017 at 12:07 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I've been working on implementing a way to perform plan-time > partition-pruning that is hopefully faster than the current method of > using constraint exclusion to prune each of the potentially many > partitions one-by-one. It's not fully cooked yet though. > > Meanwhile, I thought I'd share a couple of patches that implement some > restructuring of the planner code related to partitioned table inheritance > planning that I think would be helpful. They are to be applied on top of > the patches being discussed at [1]. Note that these patches themselves > don't implement the actual code that replaces constraint exclusion as a > method of performing partition pruning. I will share that patch after > debugging it some more. > > The main design goal of the patches I'm sharing here now is to defer the > locking and opening of leaf partitions in a given partition tree to a > point after set_append_rel_size() is called on the root partitioned table. > Currently, AFAICS, we need to lock and open the child tables in > expand_inherited_rtentry() only to set the translated_vars field in > AppendRelInfo that we create for the child. ISTM, we can defer the > creation of a child AppendRelInfo to a point when it (its field > translated_vars in particular) will actually be used and so lock and open > the child tables only at such a time. Although we don't lock and open the > partition child tables in expand_inherited_rtentry(), their RT entries are > still created and added to root->parse->rtable, so that > setup_simple_rel_arrays() knows the maximum number of entries > root->simple_rel_array will need to hold and allocate the memory for that > array accordingly. Slots in simple_rel_array[] corresponding to > partition child tables will be empty until they are created when > set_append_rel_size() is called on the root parent table and it determines > the partitions that will be scanned after all. The partition pruning can happen only after the quals have been distributed to Rels i.e. after deconstruct_jointree(), reconsider_outer_join_clauses() and generate_base_implied_equalities() have been called. If the goal is to not heap_open() the partitions which are pruned, we can't do that in expand_inherited_rtentry(). One reason why I think we don't want to heap_open() partition relations is to avoid relcache bloat because of opened partition relations, which are ultimately pruned. But please note that according to your patches, we still need to populate catalog caches to get relkind and reltype etc. There are many functions that traverse simple_rel_array[] after it's created. Most of them assume that the empty entries in that array correspond to non-simple range entries like join RTEs. But now we are breaking that assumption. Most of these functions also skip "other" relations, so that may be OK now, but I am not sure if it's really going to be fine if we keep empty slots in place of partition relations. There may be three options here 1. add placeholder RelOptInfos for partition relations (may be mark those specially) and mark the ones which get pruned as dummy later. 2. Prune the partitions before any functions scans simple_rel_array[] or postpone creating simple_rel_array till pruning. 3. Examine all the current scanners esp. the ones which will be called before pruning to make sure that skipping "other" relations is going to be kosher. > > Patch augments the existing PartitionedChildRelInfo node, which currently > holds only the partitioned child rel RT indexes, to carry some more > information about the partition tree, which includes the information > returned by RelationGetPartitionDispatchInfo() when it is called from > expand_inherited_rtentry() (per the proposed patch in [1], we call it to > be able to add partitions to the query tree in the bound order). > Actually, since PartitionedChildRelInfo now contains more information > about the partition tree than it used to before, I thought the struct's > name is no longer relevant, so renamed it to PartitionRootInfo and renamed > root->pcinfo_list accordingly to prinfo_list. That seems okay because we > only use that node internally. > > Then during the add_base_rels_to_query() step, when build_simple_rel() > builds a RelOptInfo for the root partitioned table, it also initializes > some newly introduced fields in RelOptInfo from the information contained > in PartitionRootInfo of the table. The aforementioned fields are only > initialized in RelOptInfos of root partitioned tables. Note that the > add_base_rels_to_query() step won't add the partition "otherrel" > RelOptInfos yet (unlike the regular inheritance case, where they are, > after looking them up in root->append_rel_list). Partition-wise join requires the partition hierarchy to be expanded level-by-level keeping in-tact the parent-child relationship between partitioned table and its partitions. Your patch doesn't do that and adds all the partitioning information in the root partitioned table's RelOptInfo. OTOH, partition-wise join patch adds partition bounds, key expressions, OID and RelOptInfos of the immediate partitions (including partitioned partitions) to RelOptInfo of a partitioned table (see patch 0002 in the latest set of patches at [1]). I don't see much point in having conflicting changes in both of our patches. May be you should review that patch from my set and we can find a set of members which help both partition pruning and partition-wise join. > > When set_append_rel_size() is called on the root partitioned table, it > will call a find_partitions_for_query(), which using the partition tree > information, determines the partitions that will need to be scanned for > the query. This processing happens recursively, that is, we first > determine the root-parent's partitions and then for each partition that's > partitioned, we will determine its partitions and so on. As we determine > partitions in this per-partitioned-table manner, we maintain a pair > (parent_relid, list-of-partition-relids-to-scan) for each partitioned > table and also a single list of all leaf partitions determined so far. > Once all partitions have been determined, we turn to locking the leaf > partitions. The locking happens in the order of OIDs as > find_all_inheritors would have returned in expand_inherited_rtentry(); the > list of OIDs in that original order is also stored in the table's > PartitionRootInfo node. For each OID in that list, check if that OID is > in the set of leaf partition OIDs that was just computed, and if so, lock > it. For all chosen partitions that are partitioned tables (including the > root), we create a PartitionAppendInfo node which stores the > aforementioned pair (parent_relid, list-of-partitions-relids-to-scan), and > append it to a list in the root table's RelOptInfo, with the root table's > PartitionAppendInfo at the head of the list. Note that the list of > partitions in this pair contains only the immediate partitions, so that > the original parent-child relationship is reflected in the list of > PartitionAppendInfos thus collected. The next patch that will implement > actual partition-pruning will add some more code that will run under > find_partitions_for_query(). > > set_append_rel_size() processing then continues for the root partitioned > table. It is at this point that we will create the RelOptInfos and > AppendRelInfos for partitions. First for those of the root partitioned > table and then for those of each partitioned table when > set_append_rel_size() will be recursively called for the latter. set_append_rel_size(), set_append_rel_pathlist() are already recursive, so if we process expansion and pruning for one level in those functions, the recursion will automatically take care of doing so for every level. > > > Note that this is still largely a WIP patch and the implementation details > might change per both the feedback here and the discussion at [1]. The changes to code which handle expansion in this patch set should really be part of expansion in bound order thread so that it's easy to review all changes together. And this thread can then only concentrate on partition pruning. [1] http://postgr.es/m/CAFjFpRd9Vqh_=-Ldv-XqWY006d07TJ+VXuhXCbdj=P1jukYBrw@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Hi Ashutosh, Thanks for the comments and sorry that it took me a while to reply here. On 2017/08/23 20:16, Ashutosh Bapat wrote: > On Mon, Aug 21, 2017 at 12:07 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I've been working on implementing a way to perform plan-time >> partition-pruning that is hopefully faster than the current method of >> using constraint exclusion to prune each of the potentially many >> partitions one-by-one. It's not fully cooked yet though. >> >> Meanwhile, I thought I'd share a couple of patches that implement some >> restructuring of the planner code related to partitioned table inheritance >> planning that I think would be helpful. They are to be applied on top of >> the patches being discussed at [1]. Note that these patches themselves >> don't implement the actual code that replaces constraint exclusion as a >> method of performing partition pruning. I will share that patch after >> debugging it some more. >> >> The main design goal of the patches I'm sharing here now is to defer the >> locking and opening of leaf partitions in a given partition tree to a >> point after set_append_rel_size() is called on the root partitioned table. >> Currently, AFAICS, we need to lock and open the child tables in >> expand_inherited_rtentry() only to set the translated_vars field in >> AppendRelInfo that we create for the child. ISTM, we can defer the >> creation of a child AppendRelInfo to a point when it (its field >> translated_vars in particular) will actually be used and so lock and open >> the child tables only at such a time. Although we don't lock and open the >> partition child tables in expand_inherited_rtentry(), their RT entries are >> still created and added to root->parse->rtable, so that >> setup_simple_rel_arrays() knows the maximum number of entries >> root->simple_rel_array will need to hold and allocate the memory for that >> array accordingly. Slots in simple_rel_array[] corresponding to >> partition child tables will be empty until they are created when >> set_append_rel_size() is called on the root parent table and it determines >> the partitions that will be scanned after all. > > The partition pruning can happen only after the quals have been > distributed to Rels i.e. after deconstruct_jointree(), > reconsider_outer_join_clauses() and generate_base_implied_equalities() > have been called. If the goal is to not heap_open() the partitions > which are pruned, we can't do that in expand_inherited_rtentry(). One > reason why I think we don't want to heap_open() partition relations is > to avoid relcache bloat because of opened partition relations, which > are ultimately pruned. But please note that according to your patches, > we still need to populate catalog caches to get relkind and reltype > etc. Yes, we still hit syscache for *all* partitions. I haven't yet thought very hard about avoiding that altogether. > There are many functions that traverse simple_rel_array[] after it's > created. Most of them assume that the empty entries in that array > correspond to non-simple range entries like join RTEs. But now we are > breaking that assumption. Most of these functions also skip "other" > relations, so that may be OK now, but I am not sure if it's really > going to be fine if we keep empty slots in place of partition > relations. There may be three options here 1. add placeholder > RelOptInfos for partition relations (may be mark those specially) and > mark the ones which get pruned as dummy later. 2. Prune the partitions > before any functions scans simple_rel_array[] or postpone creating > simple_rel_array till pruning. 3. Examine all the current scanners > esp. the ones which will be called before pruning to make sure that > skipping "other" relations is going to be kosher. Between the point when slots in simple_rel_array are allocated (setup_simple_rel_arrays) and partition RelOptInfos are actually created after the partition-pruning step has occurred (set_append_rel_size), it seems that most places that iterate over simple_rel_array know also to skip slots containing NULL values. We might need to document that NULL means partitions in addition to its current meaning - non-baserels. >> Patch augments the existing PartitionedChildRelInfo node, which currently >> holds only the partitioned child rel RT indexes, to carry some more >> information about the partition tree, which includes the information >> returned by RelationGetPartitionDispatchInfo() when it is called from >> expand_inherited_rtentry() (per the proposed patch in [1], we call it to >> be able to add partitions to the query tree in the bound order). >> Actually, since PartitionedChildRelInfo now contains more information >> about the partition tree than it used to before, I thought the struct's >> name is no longer relevant, so renamed it to PartitionRootInfo and renamed >> root->pcinfo_list accordingly to prinfo_list. That seems okay because we >> only use that node internally. >> >> Then during the add_base_rels_to_query() step, when build_simple_rel() >> builds a RelOptInfo for the root partitioned table, it also initializes >> some newly introduced fields in RelOptInfo from the information contained >> in PartitionRootInfo of the table. The aforementioned fields are only >> initialized in RelOptInfos of root partitioned tables. Note that the >> add_base_rels_to_query() step won't add the partition "otherrel" >> RelOptInfos yet (unlike the regular inheritance case, where they are, >> after looking them up in root->append_rel_list). > > Partition-wise join requires the partition hierarchy to be expanded > level-by-level keeping in-tact the parent-child relationship between > partitioned table and its partitions. Your patch doesn't do that and > adds all the partitioning information in the root partitioned table's > RelOptInfo. OTOH, partition-wise join patch adds partition bounds, key > expressions, OID and RelOptInfos of the immediate partitions > (including partitioned partitions) to RelOptInfo of a partitioned > table (see patch 0002 in the latest set of patches at [1]). I don't > see much point in having conflicting changes in both of our patches. > May be you should review that patch from my set and we can find a set > of members which help both partition pruning and partition-wise join. Yes, I think it would be a good idea for the partition-pruning patch to initialize those fields in the individual parents' RelOptInfos. I will review relevant patches in the partitionwise-join thread to see what can be incorporated here. >> When set_append_rel_size() is called on the root partitioned table, it >> will call a find_partitions_for_query(), which using the partition tree >> information, determines the partitions that will need to be scanned for >> the query. This processing happens recursively, that is, we first >> determine the root-parent's partitions and then for each partition that's >> partitioned, we will determine its partitions and so on. As we determine >> partitions in this per-partitioned-table manner, we maintain a pair >> (parent_relid, list-of-partition-relids-to-scan) for each partitioned >> table and also a single list of all leaf partitions determined so far. >> Once all partitions have been determined, we turn to locking the leaf >> partitions. The locking happens in the order of OIDs as >> find_all_inheritors would have returned in expand_inherited_rtentry(); the >> list of OIDs in that original order is also stored in the table's >> PartitionRootInfo node. For each OID in that list, check if that OID is >> in the set of leaf partition OIDs that was just computed, and if so, lock >> it. For all chosen partitions that are partitioned tables (including the >> root), we create a PartitionAppendInfo node which stores the >> aforementioned pair (parent_relid, list-of-partitions-relids-to-scan), and >> append it to a list in the root table's RelOptInfo, with the root table's >> PartitionAppendInfo at the head of the list. Note that the list of >> partitions in this pair contains only the immediate partitions, so that >> the original parent-child relationship is reflected in the list of >> PartitionAppendInfos thus collected. The next patch that will implement >> actual partition-pruning will add some more code that will run under >> find_partitions_for_query(). >> >> set_append_rel_size() processing then continues for the root partitioned >> table. It is at this point that we will create the RelOptInfos and >> AppendRelInfos for partitions. First for those of the root partitioned >> table and then for those of each partitioned table when >> set_append_rel_size() will be recursively called for the latter. > > set_append_rel_size(), set_append_rel_pathlist() are already > recursive, so if we process expansion and pruning for one level in > those functions, the recursion will automatically take care of doing > so for every level. My only worry about that is locking order of leaf partitions will be different from concurrent backends if we lock them in an order dictated by traversing the partition tree level at a time. Because such traversal will presumably proceed in the partition bound order. The patch computes *all* leaf partitions that will need to be scanned by the query (after pruning needless ones) and lock the chosen partitions in the original order (it keeps the original order OID list generated by find_all_inheritors around for this purpose). While computing the leaf partitions, it remembers immediate parent-child relationships in the process. The way it does it is by processing the partition tree in a recursive depth-first manner, and in each recursive step, creating a PartitionAppendInfo that maps a parent table to its immediate children (only those that will satisfy the query). Once the PartitionAppendInfo's for all the parents in the partition tree have been computed, we resume the set_append_rel_size() processing, which takes the root parent's PartitionAppendInfo and builds RelOptInfos and AppendRelInfos for its immediate children. Its children that are partitioned tables themselves will recursively call set_append_rel_size() and look up its own PartitionAppendInfo and create RelOptInfos and AppendRelInfos for its children and so on. >> Note that this is still largely a WIP patch and the implementation details >> might change per both the feedback here and the discussion at [1]. > > The changes to code which handle expansion in this patch set should > really be part of expansion in bound order thread so that it's easy to > review all changes together. And this thread can then only concentrate > on partition pruning. I think I agree. I'm posting today the patches that actually implement partition-pruning. The previous patches do seem to belong on the EIBO thread, but will post them together here today for convenience of being able to apply them to HEAD and try out. Now that Robert has posted a patch to implement depth-first EIBO, I will have to find a way to rebase the actual partition-pruning patches on this thread so that its core logic works at all by finding the information it needs. Thanks, Amit
On 2017/08/21 15:37, Amit Langote wrote: > Meanwhile, I thought I'd share a couple of patches that implement some > restructuring of the planner code related to partitioned table inheritance > planning that I think would be helpful. They are to be applied on top of > the patches being discussed at [1]. Note that these patches themselves > don't implement the actual code that replaces constraint exclusion as a > method of performing partition pruning. I will share that patch after > debugging it some more. > > The next patch that will implement > actual partition-pruning will add some more code that will run under > find_partitions_for_query(). Attached is now also the set of patches that implement the actual partition-pruning logic, viz. the last 3 patches (0004, 0005, and 0006) of the attached. Because the patch helps avoid performing constraint exclusion on *all* partitions for a given query, one might expect this to improve performance for queries on partitioned tables and scale to a fairly large number of partitions. Here are some numbers for the partitioned table and the query shown below: \d+ ptab Columns: (a date, b int, c text) Partition key: RANGE (a, b) Partitions: ptab_00001 FOR VALUES FROM ('2017-08-31', 1) TO ('2017-08-31', 1000), ptab_00002 FOR VALUES FROM ('2017-08-31', 1000) TO ('2017-08-31', 2000), ptab_00003 FOR VALUES FROM ('2017-08-31', 2000) TO ('2017-08-31', 3000), ptab_00004 FOR VALUES FROM ('2017-08-31', 3000) TO ('2017-08-31', 4000), ptab_00005 FOR VALUES FROM ('2017-08-31', 4000) TO ('2017-08-31', 5000), ptab_00006 FOR VALUES FROM ('2017-09-01', 1) TO ('2017-09-01', 1000), ptab_00007 FOR VALUES FROM ('2017-09-01', 1000) TO ('2017-09-01', 2000), ... ptab_NNNNN FOR VALUES FROM (..., 4000) TO (..., 5000), A query that prunes all partitions (empty result!): explain select * from ptab where a < '2017-08-31'; Comparison of the average response times (in milliseconds) after running the same query 100 times using pgbench against the database: #: Number of partitions of ptab c_e: Constraint exclusion f_p: Fast pruning # c_e f_p ===== ===== ==== 10 0.7 0.4 50 1.8 0.6 100 3.2 0.8 500 16.8 2.7 1000 36.2 5.0 2000 79.7 10.2 5000 214.7 27.0 10000 443.6 64.8 For each partitioned table in a given partition tree (provided it is not pruned to begin with), the query's clauses are matched to its partition key and from the matched clauses, a pair of bounding keys (Datum tuples with <= key->partnatts values for possibly a prefix of a multi-column key) is generated. They are passed to partition.c: get_partitions_for_keys() as Datum *minkeys and Datum *maxkeys. A list of partitions covering that key range is returned. When generating that list, whether a particular scan key is inclusive or not is considered along with the partitioning strategy. It should be possible to support hash partitioning with (hopefully) minimal changes to get_partitions_for_keys(). There are still certain limitations on the planner side of things: 1. OR clauses are not counted as contributing toward bounding scan keys; currently only OpExprs and NullTests are recognized, so an OR clause would get skipped from consideration when generating the bounding keys to pass to partition.c 2. Redundant clauses are not appropriately pre-processed; so if a query contains a = 10 and a > 1, the latter clause will be matched and partitions holding values with a > 1 and a < 10 will not be pruned, even if none of their rows will pass the query's condition Fixing these limitations, adding more regression tests and implementing some of the things suggested by Ashutosh Bapat [1] to prevent conflicting changes with some preparatory patches in the partitionwise-join patch series [2] are TODOs. Adding this to CF-201709 as "faster partition pruning in planner". To try out the attached patches: apply the patches posted at [3] on HEAD and then apply these Thanks, Amit [1] https://postgr.es/m/CAFjFpRdb_fkmJHFjvAbB%2BLn0t45fWjekLd5pY%3Dsv%2BeAhBAKXPQ%40mail.gmail.com [2] https://postgr.es/m/CAFjFpRd9Vqh_=-Ldv-XqWY006d07TJ+VXuhXCbdj=P1jukYBrw@mail.gmail.com [3] https://www.postgresql.org/message-id/2124e99f-9528-0f71-4e10-ac7974dd7077%40lab.ntt.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
- 0001-Teach-pg_inherits.c-a-bit-about-partitioning.patch
- 0002-Allow-locking-only-partitioned-children-in-partition.patch
- 0003-WIP-Defer-opening-and-locking-partitions-to-set_appe.patch
- 0004-WIP-Interface-changes-for-partition_bound_-cmp-bsear.patch
- 0005-WIP-partition.c-interface-additions-for-partition-pr.patch
- 0006-WIP-planner-side-changes-for-partition-pruning.patch
On Thu, Aug 31, 2017 at 2:02 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached is now also the set of patches that implement the actual > partition-pruning logic, viz. the last 3 patches (0004, 0005, and 0006) of > the attached. It strikes me that this patch set is doing two things but maybe in the opposite order that I would have chosen to attack them. First, there's getting partition pruning to use something other than constraint exclusion. Second, there's deferring work that is currently done at an early stage of the process until later, so that we waste less effort on partitions that are ultimately going to be pruned. The second one is certainly a worthwhile goal, but there are fairly firm interdependencies between the first one and some other things that are in progress. For example, the first one probably ought to be done before hash partitioning gets committed, because constraint-exclusion based partitioning pruning won't work with partitioning pruning, but some mechanism based on asking the partitioning code which partitions might match will. Such a mechanism is more efficient for list and range partitions, but it's the only thing that will work for hash partitions. Also, Beena Emerson is working on run-time partition pruning, and the more I think about it, the more I think that overlaps with this first part. Both patches need a mechanism to identify, given a btree-indexable comparison operator (< > <= >= =) and a set of values, which partitions might contain matching values. Run-time partition pruning will call that at execution time, and this patch will call it at plan time, but it's the same logic; it's just a question of the point at which the values are known. And of course we don't want to end up with two copies of the logic. Therefore, IMHO, it would be best to focus first on how we're going to identify the partitions that survive pruning, and then afterwards work on transposing that logic to happen before partitions are opened and locked. That way, we get some incremental benefit sooner, and also unblock some other development work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thanks for the comments. On 2017/09/02 2:52, Robert Haas wrote: > On Thu, Aug 31, 2017 at 2:02 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached is now also the set of patches that implement the actual >> partition-pruning logic, viz. the last 3 patches (0004, 0005, and 0006) of >> the attached. > > It strikes me that this patch set is doing two things but maybe in the > opposite order that I would have chosen to attack them. First, > there's getting partition pruning to use something other than > constraint exclusion. Second, there's deferring work that is > currently done at an early stage of the process until later, so that > we waste less effort on partitions that are ultimately going to be > pruned. OK. > > The second one is certainly a worthwhile goal, but there are fairly > firm interdependencies between the first one and some other things > that are in progress. For example, the first one probably ought to be > done before hash partitioning gets committed, because > constraint-exclusion based partitioning pruning won't work with > partitioning pruning, but some mechanism based on asking the > partitioning code which partitions might match will. Yeah. > Such a mechanism > is more efficient for list and range partitions, but it's the only > thing that will work for hash partitions. Also, Beena Emerson is > working on run-time partition pruning, and the more I think about it, > the more I think that overlaps with this first part. Both patches > need a mechanism to identify, given a btree-indexable comparison > operator (< > <= >= =) and a set of values, which partitions might > contain matching values. Run-time partition pruning will call that at > execution time, and this patch will call it at plan time, but it's the > same logic; it's just a question of the point at which the values are > known. And of course we don't want to end up with two copies of the > logic. Agreed here too. I agree that spending effort on the first part (deferment of locking, etc. within the planner) does not benefit either the hash partitioning and run-time pruning patches much. > Therefore, IMHO, it would be best to focus first on how we're going to > identify the partitions that survive pruning, and then afterwards work > on transposing that logic to happen before partitions are opened and > locked. That way, we get some incremental benefit sooner, and also > unblock some other development work. Alright, I will try to do it that way. Thanks, Amit
On 2017/09/04 10:10, Amit Langote wrote: > On 2017/09/02 2:52, Robert Haas wrote: >> It strikes me that this patch set is doing two things but maybe in the >> opposite order that I would have chosen to attack them. First, >> there's getting partition pruning to use something other than >> constraint exclusion. Second, there's deferring work that is >> currently done at an early stage of the process until later, so that >> we waste less effort on partitions that are ultimately going to be >> pruned. > > OK. > >> Therefore, IMHO, it would be best to focus first on how we're going to >> identify the partitions that survive pruning, and then afterwards work >> on transposing that logic to happen before partitions are opened and >> locked. That way, we get some incremental benefit sooner, and also >> unblock some other development work. > > Alright, I will try to do it that way. Attached set of patches that does things that way. Individual patches described below: [PATCH 1/5] Expand partitioned inheritance in a non-flattened manner This will allow us to perform scan and join planning in a per partition sub-tree manner, with each sub-tree's root getting its own RelOptInfo. Previously, only the root of the whole partition tree would get a RelOptInfo, along with the leaf partitions, with each leaf partition's AppendRelInfo pointing to the former as its parent. This is essential, because we will be doing partition-pruning for every partitioned table in the tree by matching query's scan keys with its partition key. We won't be able to do that if the intermediate partitioned tables didn't have a RelOptInfo. [PATCH 2/5] WIP: planner-side changes for partition-pruning This patch adds a stub get_partitions_for_keys in partition.c with a suitable interface for the caller to pass bounding keys extracted from the query and other related information. Importantly, it implements the logic within the planner to match query's scan keys to a parent table's partition key and form the bounding keys that will be passed to partition.c to compute the list of partitions that satisfy those bounds. Also, it adds certain fields to RelOptInfos of the partitioned tables that reflect its partitioning properties. [PATCH 3/5] WIP: Interface changes for partition_bound_{cmp/bsearch} This guy modifies the partition bound comparison function so that the caller can pass incomplete partition key tuple that is potentially a prefix of a multi-column partition key. Currently, the input tuple must contain all of key->partnatts values, but that may not be true for queries, which may not have restrictions on all the partition key columns. [PATCH 4/5] WIP: Implement get_partitions_for_keys() This one fills the get_partitions_for_keys stub with the actual logic that searches the partdesc->boundinfo for partition bounds that match the bounding keys specified by the caller. [PATCH 5/5] Add more tests for the new partitioning-related planning More tests. Some TODO items still remain to be done: * Process OR clauses to use for partition-pruning * Process redundant clauses (such as a = 10 and a > 1) more smartly * Other tricks that are missing * Fix bugs * More tests Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Forgot to mention a couple of important points about the relation of some of the patches here to the patches and discussion at the partitionwise-join thread [1]. On 2017/09/06 19:38, Amit Langote wrote: > [PATCH 1/5] Expand partitioned inheritance in a non-flattened manner > > This will allow us to perform scan and join planning in a per partition > sub-tree manner, with each sub-tree's root getting its own RelOptInfo. > Previously, only the root of the whole partition tree would get a > RelOptInfo, along with the leaf partitions, with each leaf partition's > AppendRelInfo pointing to the former as its parent. > > This is essential, because we will be doing partition-pruning for every > partitioned table in the tree by matching query's scan keys with its > partition key. We won't be able to do that if the intermediate > partitioned tables didn't have a RelOptInfo. There is a patch in the Ashutosh's posted series of patches, which does more or less the same thing that this patch does. He included it in his series of patches, because, IIUC, the main partitionwise-join planning logic that one of the later patch implements depends on being able to consider applying that new planning technique individually for every partition sub-tree, instead of just at the whole tree root. One notable difference from his patch is that while his patch will expand in non-flattened manner even in the case where the parent is the result relation of a query, my patch doesn't in that case, because the new partition-pruning technique cannot be applied to inheritance parent that is a result relation, for example, update partitioned_table set ... And AFAICS, partitionwise-join cannot be applied to such a parent either. Note however that if there are other instances of the same partitioned table (in the FROM list of an update statement) or other partitioned tables in the query, they will be expanded in a non-flattened manner, because they are themselves not the result relations of the query. So, the new partition-pruning and (supposedly) partitionwise-join can be applied for those other partitioned tables. > [PATCH 2/5] WIP: planner-side changes for partition-pruni[...] > > Also, it adds certain fields to RelOptInfos of the partitioned tables that > reflect its partitioning properties. There is something called PartitionScheme, which is a notion one of the Ashutosh's patches invented that this patch incorporates as one of the new fields in RelOptInfo that I mentioned above (also a list of PartitionScheme's in the planner-global PlannerInfo). Although, PartitionScheme is not significant for the task of partition-pruning itself, it's still useful. On Ashutosh's suggestion, I adopted the same in my patch series, so that the partition-wise join patch doesn't end up conflicting with the partition-pruning patch while trying to implement the same and can get straight to the task of implementing partition-wise joins. The same patch in the partition-wise join patch series that introduces PartitionScheme, also introduces a field in the RelOptInfo called partexprs, which records the partition key expressions. Since, partition-pruning has use for the same, I incorporated the same here; also, in a format that Ashutosh's partition-wise patch can use directly, instead of the latter having to hack it again to make it suitable to store partition key expressions of joinrels. Again, that's to minimize conflicts and let his patch just find the field to use as is, instead of implementing it first. Lastly, this patch introduces a PartitionAppendInfo in a partitioned table's RelOptInfo that stores AppendRelInfos of the partitions (child relations) that survive partition-pruning, which serves to identify those partitions' RelOptInfos. Along with the identities of surviving partitions, it also stores the partitioning configuration of the partitioned table after partitions are pruned. That includes partdesc->boundinfo (which is simply a pointer into the table's relcache) and a few other fields that are set by partition-pruning code, such min_datum_index, max_datum_index, null_partition_chosen, that describe the result after pruning. So, for two partitioned tables being joined, if the boundinfos match per partition_bounds_equal() and these other fields match, they can be safely partition-wise joined. [1] https://www.postgresql.org/message-id/CAFjFpRfRDhWp%3DoguNjyzN%3DNMoOD%2BRCC3wS%2Bb%2BxbGKwKUk0dRKg%40mail.gmail.com
On Thu, Sep 7, 2017 at 7:16 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > There is a patch in the Ashutosh's posted series of patches, which does > more or less the same thing that this patch does. He included it in his > series of patches, because, IIUC, the main partitionwise-join planning > logic that one of the later patch implements depends on being able to > consider applying that new planning technique individually for every > partition sub-tree, instead of just at the whole tree root. > > One notable difference from his patch is that while his patch will expand > in non-flattened manner even in the case where the parent is the result > relation of a query, my patch doesn't in that case, because the new > partition-pruning technique cannot be applied to inheritance parent that > is a result relation, for example, > > update partitioned_table set ... > > And AFAICS, partitionwise-join cannot be applied to such a parent either. > > Note however that if there are other instances of the same partitioned > table (in the FROM list of an update statement) or other partitioned > tables in the query, they will be expanded in a non-flattened manner, > because they are themselves not the result relations of the query. So, > the new partition-pruning and (supposedly) partitionwise-join can be > applied for those other partitioned tables. It seems to me that it would be better not to write new patches for things that already have patches without a really clear explanation with what's wrong with the already-existing patch; I don't see any such explanation here. Instead of writing your own patch for this to duel with his his, why not review his and help him correct any deficiencies which you can spot? Then we have one patch with more review instead of two patches with less review both of which I have to read and try to decide which is better. In this case, I think Ashutosh has the right idea. I think that handling the result-relation and non-result-relation differently creates an unpleasant asymmetry. With your patch, we have to deal with three cases: (a) partitioned tables that were expanded level-by-level because they are not result relations, (b) partitioned tables that were expanded "flattened" because they are result relations, and (c) non-partitioned tables that were expanded "flattened". With Ashutosh's approach, we only have two cases to worry about in the future rather than three, and I like that better. Your patch also appears to change things so that the table actually referenced in the query ends up with an AppendRelInfo for the parent, which seems pointless. And it has no tests. There are a couple of hunks from your patch that we might want or need to incorporate into Ashutosh's patch. The change to relation_excluded_by_constraints() looks like it might be useful, although it needs a better comment and some tests. Also, Ashutosh's patch has no equivalent of your change to add_paths_to_append_rel(). I'm not clear what the code you've introduced there is supposed to be doing, and I'm suspicious that it is confusing "partition root" with "table named in the query", which will often be the same but not always; the user could have named an intermediate partition. Can you expand on what this is doing? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/09/08 4:41, Robert Haas wrote: > On Thu, Sep 7, 2017 at 7:16 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> There is a patch in the Ashutosh's posted series of patches, which does >> more or less the same thing that this patch does. He included it in his >> series of patches, because, IIUC, the main partitionwise-join planning >> logic that one of the later patch implements depends on being able to >> consider applying that new planning technique individually for every >> partition sub-tree, instead of just at the whole tree root. >> >> One notable difference from his patch is that while his patch will expand >> in non-flattened manner even in the case where the parent is the result >> relation of a query, my patch doesn't in that case, because the new >> partition-pruning technique cannot be applied to inheritance parent that >> is a result relation, for example, >> >> update partitioned_table set ... >> >> And AFAICS, partitionwise-join cannot be applied to such a parent either. >> >> Note however that if there are other instances of the same partitioned >> table (in the FROM list of an update statement) or other partitioned >> tables in the query, they will be expanded in a non-flattened manner, >> because they are themselves not the result relations of the query. So, >> the new partition-pruning and (supposedly) partitionwise-join can be >> applied for those other partitioned tables. > > It seems to me that it would be better not to write new patches for > things that already have patches without a really clear explanation > with what's wrong with the already-existing patch; I don't see any > such explanation here. Instead of writing your own patch for this to > duel with his his, why not review his and help him correct any > deficiencies which you can spot? Then we have one patch with more > review instead of two patches with less review both of which I have to > read and try to decide which is better. Sorry, I think I should have just used the Ashutosh's patch. > In this case, I think Ashutosh has the right idea. I think that > handling the result-relation and non-result-relation differently > creates an unpleasant asymmetry. With your patch, we have to deal > with three cases: (a) partitioned tables that were expanded > level-by-level because they are not result relations, (b) partitioned > tables that were expanded "flattened" because they are result > relations, and (c) non-partitioned tables that were expanded > "flattened". With Ashutosh's approach, we only have two cases to > worry about in the future rather than three, and I like that better. I tend to agree with this now. > Your patch also appears to change things so that the table actually > referenced in the query ends up with an AppendRelInfo for the parent, > which seems pointless. Actually, it wouldn't, because my patch also got rid of the notion of adding the duplicate RTE for original parent, because I thought the duplicate RTE was pointless in the partitioning case. > There are a couple of hunks from your patch that we might want or need > to incorporate into Ashutosh's patch. The change to > relation_excluded_by_constraints() looks like it might be useful, > although it needs a better comment and some tests. I think we could just drop that part from this patch. It also looks like Ashutosh has a patch elsewhere concerning this. https://commitfest.postgresql.org/14/1108/ Maybe, we could discuss what do about this on that thread. Now that not only the root partitioned table, but also other partitioned tables in the tree get an RTE with inh = true, I think it would be interesting to consider his patch. > Also, Ashutosh's > patch has no equivalent of your change to add_paths_to_append_rel(). > I'm not clear what the code you've introduced there is supposed to be > doing, and I'm suspicious that it is confusing "partition root" with > "table named in the query", which will often be the same but not > always; the user could have named an intermediate partition. Can you > expand on what this is doing? I've replied on the partition-wise thread explaining why changes in the add_paths_to_append_rel() are necessary. Anyway, I'm dropping my patch in favor of the patch on the other thread. Sorry for the duplicated effort involved in having to look at both the patches. Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmoZEUonD9dUZH1FBEyq%3DPEv_KvE3wC%3DA%3D0zm-_tRz_917A%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 21 August 2017 at 18:37, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I've been working on implementing a way to perform plan-time > partition-pruning that is hopefully faster than the current method of > using constraint exclusion to prune each of the potentially many > partitions one-by-one. It's not fully cooked yet though. I'm interested in seeing improvements in this area, so I've put my name down to review this. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/09/15 10:55, David Rowley wrote: > On 21 August 2017 at 18:37, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I've been working on implementing a way to perform plan-time >> partition-pruning that is hopefully faster than the current method of >> using constraint exclusion to prune each of the potentially many >> partitions one-by-one. It's not fully cooked yet though. > > I'm interested in seeing improvements in this area, so I've put my > name down to review this. Great, thanks! I will post rebased patches later today, although I think the overall design of the patch on the planner side of things is not quite there yet. Of course, your and others' feedback is greatly welcome. Also, I must inform to all of those who're looking at this thread that I won't be able to respond to emails from tomorrow (9/16, Sat) until 9/23, Sat, due to some personal business. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 6, 2017 at 4:08 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/04 10:10, Amit Langote wrote: >> On 2017/09/02 2:52, Robert Haas wrote: > > [PATCH 2/5] WIP: planner-side changes for partition-pruning > > This patch adds a stub get_partitions_for_keys in partition.c with a > suitable interface for the caller to pass bounding keys extracted from the > query and other related information. > > Importantly, it implements the logic within the planner to match query's > scan keys to a parent table's partition key and form the bounding keys > that will be passed to partition.c to compute the list of partitions that > satisfy those bounds. > + Node *leftop = get_leftop(clause); + + if (IsA(leftop, RelabelType)) + leftop = (Node *) ((RelabelType *) leftop)->arg; + + if (equal(leftop, partkey)) It appears that this patch always assume that clause will be of form "var op const", but it can also be "const op var" That's the reason in below example where in both the queries condition is same it can only prune in the first case but not in the second. postgres=# explain select * from t where t.a < 2; QUERY PLAN --------------------------------------------------------Append (cost=0.00..2.24 rows=1 width=8) -> Seq Scan on t1 (cost=0.00..2.24rows=1 width=8) Filter: (a < 2) (3 rows) postgres=# explain select * from t where 2>t.a; QUERY PLAN --------------------------------------------------------Append (cost=0.00..4.49 rows=2 width=8) -> Seq Scan on t1 (cost=0.00..2.24rows=1 width=8) Filter: (2 > a) -> Seq Scan on t2 (cost=0.00..2.25 rows=1 width=8) Filter:(2 > a) (5 rows) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi Dilip, Thanks for looking at the patch. On 2017/09/15 13:43, Dilip Kumar wrote: > On Wed, Sep 6, 2017 at 4:08 PM, Amit Langote >> [PATCH 2/5] WIP: planner-side changes for partition-pruning >> >> This patch adds a stub get_partitions_for_keys in partition.c with a >> suitable interface for the caller to pass bounding keys extracted from the >> query and other related information. >> >> Importantly, it implements the logic within the planner to match query's >> scan keys to a parent table's partition key and form the bounding keys >> that will be passed to partition.c to compute the list of partitions that >> satisfy those bounds. > > + Node *leftop = get_leftop(clause); > + > + if (IsA(leftop, RelabelType)) > + leftop = (Node *) ((RelabelType *) leftop)->arg; > + > + if (equal(leftop, partkey)) > > It appears that this patch always assume that clause will be of form > "var op const", but it can also be "const op var" > > That's the reason in below example where in both the queries condition > is same it can only prune in the first case but not in the second. > > postgres=# explain select * from t where t.a < 2; > QUERY PLAN > -------------------------------------------------------- > Append (cost=0.00..2.24 rows=1 width=8) > -> Seq Scan on t1 (cost=0.00..2.24 rows=1 width=8) > Filter: (a < 2) > (3 rows) > > postgres=# explain select * from t where 2>t.a; > QUERY PLAN > -------------------------------------------------------- > Append (cost=0.00..4.49 rows=2 width=8) > -> Seq Scan on t1 (cost=0.00..2.24 rows=1 width=8) > Filter: (2 > a) > -> Seq Scan on t2 (cost=0.00..2.25 rows=1 width=8) > Filter: (2 > a) > (5 rows) Yeah, there are a bunch of smarts still missing in that patch as it is. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/09/15 11:16, Amit Langote wrote: > I will post rebased patches later today, although I think the overall > design of the patch on the planner side of things is not quite there yet. > Of course, your and others' feedback is greatly welcome. Rebased patches attached. Because Dilip complained earlier today about clauses of the form (const op var) not causing partition-pruning, I've added code to commute the clause where it is required. Some other previously mentioned limitations remain -- no handling of OR clauses, no elimination of redundant clauses for given partitioning column, etc. A note about 0001: this patch overlaps with 0003-Canonical-partition-scheme.patch from the partitionwise-join patch series that Ashutosh Bapat posted yesterday [1]. Because I implemented the planner-portion of this patch based on what 0001 builds, I'm posting it here. It might actually turn out that we will review and commit 0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply 0001 if you want to play with the later patches. I would certainly like to review 0003-Canonical-partition-scheme.patch myself, but won't be able to immediately (see below). > Also, I must inform to all of those who're looking at this thread that I > won't be able to respond to emails from tomorrow (9/16, Sat) until 9/23, > Sat, due to some personal business. To remind. Thanks, Amit [1] https://www.postgresql.org/message-id/CAFiTN-skmaqeCVaoAHCBqe2DyfO3f6sgdtEjHWrUgi0kV1yPLQ%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Sep 15, 2017 at 4:50 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Rebased patches attached. Because Dilip complained earlier today about > clauses of the form (const op var) not causing partition-pruning, I've > added code to commute the clause where it is required. Some other > previously mentioned limitations remain -- no handling of OR clauses, no > elimination of redundant clauses for given partitioning column, etc. > > A note about 0001: this patch overlaps with > 0003-Canonical-partition-scheme.patch from the partitionwise-join patch > series that Ashutosh Bapat posted yesterday [1]. It doesn't merely overlap; it's obviously a derivative work, and the commit message in your version should credit all the authors. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 16, 2017 at 4:04 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Sep 15, 2017 at 4:50 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Rebased patches attached. Because Dilip complained earlier today about >> clauses of the form (const op var) not causing partition-pruning, I've >> added code to commute the clause where it is required. Some other >> previously mentioned limitations remain -- no handling of OR clauses, no >> elimination of redundant clauses for given partitioning column, etc. >> >> A note about 0001: this patch overlaps with >> 0003-Canonical-partition-scheme.patch from the partitionwise-join patch >> series that Ashutosh Bapat posted yesterday [1]. > > It doesn't merely overlap; it's obviously a derivative work, Yes it is. I noted that upthread [1] that most of these are derived from Ashutosh's patch on his suggestion. I guess I should have repeated that in this message too, sorry. > and the > commit message in your version should credit all the authors. That was a mistake on my part, too. Will be careful hereon. Thanks, Amit [1] https://www.postgresql.org/message-id/0e829199-a43c-2a66-b966-89a0020a6cd4%40lab.ntt.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 15, 2017 at 2:20 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/15 11:16, Amit Langote wrote: Thanks for the updated patch. I was going through the logic of get_rel_partitions in 0002 as almost similar functionality will be required by runtime partition pruning on which Beena is working. The only difference is that here we are processing the "rel->baserestrictinfo" and in case of runtime pruning, we also need to process join clauses which are pushed down to appendrel. So can we make some generic logic which can be used for both the patches. So basically, we need to do two changes 1. In get_rel_partitions instead of processing the "rel->baserestrictinfo" we can take clause list as input that way we can pass any clause list to this function. 2. Don't call "get_partitions_for_keys" routine from the "get_rel_partitions", instead, get_rel_partitions can just prepare minkey, maxkey and the caller of the get_rel_partitions can call get_partitions_for_keys, because for runtime pruning we need to call get_partitions_for_keys at runtime. After these changes also there will be one problem that the get_partitions_for_keys is directly fetching the "rightop->constvalue" whereas, for runtime pruning, we need to store rightop itself and calculate the value at runtime by param evaluation, I haven't yet thought how can we make this last part generic. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
I have done some refactoring of the code where I have moved the code of getting the matching clause into the separate function so that it can fetch the matching clause from any set of given restriction list. It can be applied on top of 0002-WIP: planner-side-changes-for-partition-pruning.patch On Sat, Sep 16, 2017 at 3:13 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Fri, Sep 15, 2017 at 2:20 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/09/15 11:16, Amit Langote wrote: > > Thanks for the updated patch. I was going through the logic of > get_rel_partitions in 0002 as almost similar functionality will be > required by runtime partition pruning on which Beena is working. The > only difference is that here we are processing the > "rel->baserestrictinfo" and in case of runtime pruning, we also need > to process join clauses which are pushed down to appendrel. > > So can we make some generic logic which can be used for both the patches. > > So basically, we need to do two changes > > 1. In get_rel_partitions instead of processing the > "rel->baserestrictinfo" we can take clause list as input that way we > can pass any clause list to this function. > > 2. Don't call "get_partitions_for_keys" routine from the > "get_rel_partitions", instead, get_rel_partitions can just prepare > minkey, maxkey and the caller of the get_rel_partitions can call > get_partitions_for_keys, because for runtime pruning we need to call > get_partitions_for_keys at runtime. > > After these changes also there will be one problem that the > get_partitions_for_keys is directly fetching the "rightop->constvalue" > whereas, for runtime pruning, we need to store rightop itself and > calculate the value at runtime by param evaluation, I haven't yet > thought how can we make this last part generic. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi Dilip. Thanks for looking at the patches and the comments. On 2017/09/16 18:43, Dilip Kumar wrote: > On Fri, Sep 15, 2017 at 2:20 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/09/15 11:16, Amit Langote wrote: > > Thanks for the updated patch. I was going through the logic of > get_rel_partitions in 0002 as almost similar functionality will be > required by runtime partition pruning on which Beena is working. The > only difference is that here we are processing the > "rel->baserestrictinfo" and in case of runtime pruning, we also need > to process join clauses which are pushed down to appendrel. Yeah, I agree with the point you seem to be making that get_rel_partitions() covers a lot of functionality, which it would be nice to break down into reusable function(s) with suitable signature(s) that the executor will also be able to use. Your proposed refactoring patch down-thread seems to be a good step in that direction. Thanks for working on it. > So can we make some generic logic which can be used for both the patches. > > So basically, we need to do two changes > > 1. In get_rel_partitions instead of processing the > "rel->baserestrictinfo" we can take clause list as input that way we > can pass any clause list to this function. > > 2. Don't call "get_partitions_for_keys" routine from the > "get_rel_partitions", instead, get_rel_partitions can just prepare > minkey, maxkey and the caller of the get_rel_partitions can call > get_partitions_for_keys, because for runtime pruning we need to call > get_partitions_for_keys at runtime. It's not clear to me whether get_rel_partitions() itself, as it is, is callable from outside the planner, because its signature contains RelOptInfo. We have the RelOptInfo in the signature, because we want to mark certain fields in it so that latter planning steps can use them. So, get_rel_partitions()'s job is not just to match clauses and find partitions, but also to perform certain planner-specific tasks of generating information that the later planning steps will want to use. That may turn out to be unnecessary, but until we know that, let's not try to export get_rel_partitions() itself out of the planner. OTOH, the function that your refactoring patch separates out to match clauses to partition keys and extract bounding values seems reusable outside the planner and we should export it in such a way that it can be used in the executor. Then, the hypothetical executor function that does the pruning will first call the planner's clause-matching function, followed by calling get_partitions_for_keys() in partition.c to get the selected partitions. We should be careful when designing the interface of the exported function to make sure it's not bound to the planner. Your patch still maintains the RelOptInfo in the signature of the clause-matching function, which the executor pruning function won't have access to. > After these changes also there will be one problem that the > get_partitions_for_keys is directly fetching the "rightop->constvalue" > whereas, for runtime pruning, we need to store rightop itself and > calculate the value at runtime by param evaluation, I haven't yet > thought how can we make this last part generic. I don't think any code introduced by the patch in partition.c itself looks inside OpExpr (or any type of clause for that matter). That is, I don't see where get_partitions_for_keys() is looking at rightop->constvalue. All it receives to work with are arrays of Datums and some other relevant information like inclusivity, nullness, etc. By the way, I'm now rebasing these patches on top of [1] and will try to merge your refactoring patch in some appropriate way. Will post more tomorrow. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf826 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 25, 2017 at 3:34 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Thanks for looking at the patches and the comments. > It's not clear to me whether get_rel_partitions() itself, as it is, is > callable from outside the planner, because its signature contains > RelOptInfo. We have the RelOptInfo in the signature, because we want to > mark certain fields in it so that latter planning steps can use them. So, > get_rel_partitions()'s job is not just to match clauses and find > partitions, but also to perform certain planner-specific tasks of > generating information that the later planning steps will want to use. > That may turn out to be unnecessary, but until we know that, let's not try > to export get_rel_partitions() itself out of the planner. > > OTOH, the function that your refactoring patch separates out to match > clauses to partition keys and extract bounding values seems reusable > outside the planner and we should export it in such a way that it can be > used in the executor. Then, the hypothetical executor function that does > the pruning will first call the planner's clause-matching function, > followed by calling get_partitions_for_keys() in partition.c to get the > selected partitions. Thanks for your reply. Actually, we are still planning to call get_matching_clause at the optimizer time only. Since we can not use get_rel_partitions function directly for runtime pruning because it does all the work (find matching clause, create minkey and maxkey and call get_partitions_for_keys) during planning time itself. For runtime pruning, we are planning to first get_matching_clause function during optimizer time to identify the clause which is matching with partition keys, but for PARAM_EXEC case we can not depend upon baserelrestriction instead we will get the from join clause, that's the reason I have separated out get_matching_clause. But it will still be used during planning time. After separating out the matching clause we will do somewhat similar processing what "get_rel_partitions" is doing. But, at optimizer time for PARAM we will not have Datum values for rightop, so we will keep track of the PARAM itself. And, finally at runtime when we get the PARAM value we can prepare minkey and maxkey and call get_partitions_for_keys function. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/09/25 20:21, Dilip Kumar wrote: > On Mon, Sep 25, 2017 at 3:34 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> Thanks for looking at the patches and the comments. > >> It's not clear to me whether get_rel_partitions() itself, as it is, is >> callable from outside the planner, because its signature contains >> RelOptInfo. We have the RelOptInfo in the signature, because we want to >> mark certain fields in it so that latter planning steps can use them. So, >> get_rel_partitions()'s job is not just to match clauses and find >> partitions, but also to perform certain planner-specific tasks of >> generating information that the later planning steps will want to use. >> That may turn out to be unnecessary, but until we know that, let's not try >> to export get_rel_partitions() itself out of the planner. >> >> OTOH, the function that your refactoring patch separates out to match >> clauses to partition keys and extract bounding values seems reusable >> outside the planner and we should export it in such a way that it can be >> used in the executor. Then, the hypothetical executor function that does >> the pruning will first call the planner's clause-matching function, >> followed by calling get_partitions_for_keys() in partition.c to get the >> selected partitions. > > Thanks for your reply. > > Actually, we are still planning to call get_matching_clause at the > optimizer time only. Since we can not use get_rel_partitions function > directly for runtime pruning because it does all the work (find > matching clause, create minkey and maxkey and call > get_partitions_for_keys) during planning time itself. > > For runtime pruning, we are planning to first get_matching_clause > function during optimizer time to identify the clause which is > matching with partition keys, but for PARAM_EXEC case we can not > depend upon baserelrestriction instead we will get the from join > clause, that's the reason I have separated out get_matching_clause. > But it will still be used during planning time. I see. So, in the run-time pruning case, only the work of extracting bounding values is deferred to execution time. Matching clauses with the partition key still occurs during planning time. Only that the clauses that require run-time pruning are not those in rel->baserestrictinfo. > After separating out the matching clause we will do somewhat similar > processing what "get_rel_partitions" is doing. But, at optimizer time > for PARAM we will not have Datum values for rightop, so we will keep > track of the PARAM itself. I guess information about which PARAMs map to which partition keys will be kept in the plan somehow. > And, finally at runtime when we get the PARAM value we can prepare > minkey and maxkey and call get_partitions_for_keys function. Note that get_partitions_for_keys() is not planner code, nor is it bound with any other planning code. It's callable from executor without much change. Maybe you already know that though. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 26, 2017 at 2:45 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/25 20:21, Dilip Kumar wrote: > I see. So, in the run-time pruning case, only the work of extracting > bounding values is deferred to execution time. Matching clauses with the > partition key still occurs during planning time. Only that the clauses > that require run-time pruning are not those in rel->baserestrictinfo. Right. > >> After separating out the matching clause we will do somewhat similar >> processing what "get_rel_partitions" is doing. But, at optimizer time >> for PARAM we will not have Datum values for rightop, so we will keep >> track of the PARAM itself. > > I guess information about which PARAMs map to which partition keys will be > kept in the plan somehow. Yes. > >> And, finally at runtime when we get the PARAM value we can prepare >> minkey and maxkey and call get_partitions_for_keys function. > > Note that get_partitions_for_keys() is not planner code, nor is it bound > with any other planning code. It's callable from executor without much > change. Maybe you already know that though. Yes, Right. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi Amit, On 09/15/2017 04:50 AM, Amit Langote wrote: > On 2017/09/15 11:16, Amit Langote wrote: >> I will post rebased patches later today, although I think the overall >> design of the patch on the planner side of things is not quite there yet. >> Of course, your and others' feedback is greatly welcome. > > Rebased patches attached. Because Dilip complained earlier today about > clauses of the form (const op var) not causing partition-pruning, I've > added code to commute the clause where it is required. Some other > previously mentioned limitations remain -- no handling of OR clauses, no > elimination of redundant clauses for given partitioning column, etc. > > A note about 0001: this patch overlaps with > 0003-Canonical-partition-scheme.patch from the partitionwise-join patch > series that Ashutosh Bapat posted yesterday [1]. Because I implemented > the planner-portion of this patch based on what 0001 builds, I'm posting > it here. It might actually turn out that we will review and commit > 0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply > 0001 if you want to play with the later patches. I would certainly like > to review 0003-Canonical-partition-scheme.patch myself, but won't be able > to immediately (see below). > Could you share your thoughts on the usage of PartitionAppendInfo's min_datum_idx / max_datum_idx ? Especially in relation to hash partitions. I'm looking at get_partitions_for_keys. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 26, 2017 at 9:00 AM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > Could you share your thoughts on the usage of PartitionAppendInfo's > min_datum_idx / max_datum_idx ? Especially in relation to hash partitions. This brings up something that I've kind of been thinking about. There are sort of four cases when it comes to partition pruning: 1. There is exactly one matching partition. For example, this happens when there is an equality constraint on every partition column. 2. There are multiple matching partitions which are consecutive. For example, there is a single level of range partitioning with no default partition and the single partitioning column is constrained by < > <= or >=. 3. There are multiple matching partitions which are not consecutive. This case is probably rare, but it can happen if there is a default partition, if there are list partitions with multiple bounds that are interleaved (e.g. p1 allows (1, 4), p2 allows (2), p3 allows (3, 5), and the query allows values >= 4 and <= 5), if the query involves OR conditions, or if there are multiple levels of partitioning (e.g. partition by a, subpartition by b, put a range constraint on a and an equality constraint on b). 4. There are no matching partitions. One of the goals of this algorithm is to be fast. The obvious way to cater to case (3) is to iterate through all partitions and test whether each one works, returning a Bitmapset, but that is O(n). Admittedly, it might be O(n) with a pretty small constant factor, but it still seems like exactly the sort of thing that we want to avoid given the desire to scale to higher partition counts. I propose that we create a structure that looks like this: struct foo { int min_partition; int max_partition; Bitmapset *extra_partitions; }; This indicates that all partitions from min_partition to max_partition need to be scanned, and in addition any partitions in extra_partitions need to be scanned. Assuming that we only consider cases where all partition keys or a leading subset of the partition keys are constrained, we'll generally be able to get by with just setting min_partition and max_partition, but extra_partitions can be used to handle default partitions and interleaved list bounds. For equality on all partitioning columns, we can do a single bsearch of the bounds to identify the target partition at a given partitioning level, and the same thing works for a single range-bound. If there are two range-bounds (< and > or <= and >= or whatever) we need to bsearch twice. The default partition, if any and if matched, must also be included. When there are multiple levels of partitioning things get a bit more complex -- if someone wants to knock out a partition that breaks up the range, we might need to shrink the main range to cover part of it and kick the other indexes out to extra_partitions. But the good thing is that in common cases with only O(lg n) effort we can return O(1) data that describes what will be scanned. In cases where that's not practical we expend more effort but still prune with maximal effectiveness. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 09/26/2017 10:33 AM, Robert Haas wrote: > On Tue, Sep 26, 2017 at 9:00 AM, Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: >> Could you share your thoughts on the usage of PartitionAppendInfo's >> min_datum_idx / max_datum_idx ? Especially in relation to hash partitions. > > This brings up something that I've kind of been thinking about. There > are sort of four cases when it comes to partition pruning: > > 1. There is exactly one matching partition. For example, this happens > when there is an equality constraint on every partition column. > > 2. There are multiple matching partitions which are consecutive. For > example, there is a single level of range partitioning with no default > partition and the single partitioning column is constrained by < > <= > or >=. > > 3. There are multiple matching partitions which are not consecutive. > This case is probably rare, but it can happen if there is a default > partition, if there are list partitions with multiple bounds that are > interleaved (e.g. p1 allows (1, 4), p2 allows (2), p3 allows (3, 5), > and the query allows values >= 4 and <= 5), if the query involves OR > conditions, or if there are multiple levels of partitioning (e.g. > partition by a, subpartition by b, put a range constraint on a and an > equality constraint on b). > > 4. There are no matching partitions. > > One of the goals of this algorithm is to be fast. The obvious way to > cater to case (3) is to iterate through all partitions and test > whether each one works, returning a Bitmapset, but that is O(n). > Admittedly, it might be O(n) with a pretty small constant factor, but > it still seems like exactly the sort of thing that we want to avoid > given the desire to scale to higher partition counts. > > I propose that we create a structure that looks like this: > > struct foo { > int min_partition; > int max_partition; > Bitmapset *extra_partitions; > }; > > This indicates that all partitions from min_partition to max_partition > need to be scanned, and in addition any partitions in extra_partitions > need to be scanned. Assuming that we only consider cases where all > partition keys or a leading subset of the partition keys are > constrained, we'll generally be able to get by with just setting > min_partition and max_partition, but extra_partitions can be used to > handle default partitions and interleaved list bounds. For equality > on all partitioning columns, we can do a single bsearch of the bounds > to identify the target partition at a given partitioning level, and > the same thing works for a single range-bound. If there are two > range-bounds (< and > or <= and >= or whatever) we need to bsearch > twice. The default partition, if any and if matched, must also be > included. When there are multiple levels of partitioning things get a > bit more complex -- if someone wants to knock out a partition that > breaks up the range, we might need to shrink the main range to cover > part of it and kick the other indexes out to extra_partitions. > > But the good thing is that in common cases with only O(lg n) effort we > can return O(1) data that describes what will be scanned. In cases > where that's not practical we expend more effort but still prune with > maximal effectiveness. > For OLTP style applications 1) would be the common case, and with hash partitions it would be one equality constraint. So, changing the method signature to use a data type as you described above instead of the explicit min_datum_idx / max_datum_idx output parameters would be more clear. One could advocate (*cough*) that the hash partition patch [1] should be merged first in order to find other instances of where other CommitFest entries doesn't account for hash partitions at the moment in their method signatures; Beena noted something similar in [2]. I know that you said otherwise [3], but this is CommitFest 1, so there is time for a revert later, and hash partitions are already useful in internal testing. [1] https://commitfest.postgresql.org/14/1059/ [2] https://www.postgresql.org/message-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A%40mail.gmail.com [3] http://rhaas.blogspot.com/2017/08/plans-for-partitioning-in-v11.html Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 26, 2017 at 10:57 AM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > One could advocate (*cough*) that the hash partition patch [1] should be > merged first in order to find other instances of where other CommitFest > entries doesn't account for hash partitions at the moment in their method > signatures; Beena noted something similar in [2]. I know that you said > otherwise [3], but this is CommitFest 1, so there is time for a revert > later, and hash partitions are already useful in internal testing. Well, that's a fair point. I was assuming that committing things in that order would cause me to win the "least popular committer" award at least for that day, but maybe not. It's certainly not ideal to have to juggle that patch along and keep rebasing it over other changes when it's basically done, and just waiting on other improvements to land. Anybody else wish to express an opinion? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 25 September 2017 at 23:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > By the way, I'm now rebasing these patches on top of [1] and will try to > merge your refactoring patch in some appropriate way. Will post more > tomorrow. > > [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf826 Yeah, I see 0001 conflicts with that. I'm going to set this to waiting on author while you're busy rebasing this. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/09/27 1:51, Robert Haas wrote: > On Tue, Sep 26, 2017 at 10:57 AM, Jesper Pedersen > <jesper.pedersen@redhat.com> wrote: >> One could advocate (*cough*) that the hash partition patch [1] should be >> merged first in order to find other instances of where other CommitFest >> entries doesn't account for hash partitions at the moment in their method >> signatures; Beena noted something similar in [2]. I know that you said >> otherwise [3], but this is CommitFest 1, so there is time for a revert >> later, and hash partitions are already useful in internal testing. > > Well, that's a fair point. I was assuming that committing things in > that order would cause me to win the "least popular committer" award > at least for that day, but maybe not. It's certainly not ideal to > have to juggle that patch along and keep rebasing it over other > changes when it's basically done, and just waiting on other > improvements to land. Anybody else wish to express an opinion? FWIW, I tend to agree that it would be nice to get the hash partitioning patch in, even with old constraint exclusion based partition-pruning not working for hash partitions. That way, it might be more clear what we need to do in the partition-pruning patches to account for hash partitions. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi David, On 2017/09/27 6:04, David Rowley wrote: > On 25 September 2017 at 23:04, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> By the way, I'm now rebasing these patches on top of [1] and will try to >> merge your refactoring patch in some appropriate way. Will post more >> tomorrow. >> >> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf826 > > Yeah, I see 0001 conflicts with that. I'm going to set this to waiting > on author while you're busy rebasing this. Thanks for the reminder. Just thought I'd say that while I'm actually done rebasing itself (attaching rebased patches to try 'em out), I'm now considering Robert's comments and will be busy for a bit revising things based on those comments. Some notes about the attached patches: - 0001 includes refactoring that Dilip proposed upthread [1] (added him as an author). I slightly tweaked his patch -- renamed the function get_matching_clause to match_clauses_to_partkey, similar to match_clauses_to_index. - Code to set AppendPath's partitioned_rels in add_paths_to_append_rel() revised by 0a480502b09 (was originally introduced in d3cc37f1d80) is still revised to get partitioned_rels from a source that is not PlannerInfo.pcinfo_list. With the new code, partitioned_rels won't contain RT indexes of the partitioned child tables that were pruned. Thanks, Amit [1] https://www.postgresql.org/message-id/CAFiTN-tGnQzF_4QtbOHT-3hE%3DOvNaMfbbeRxa4UY0CQyF0G8gQ%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi Jesper. Firstly, thanks for looking at the patch. On 2017/09/26 22:00, Jesper Pedersen wrote: > Hi Amit, > > On 09/15/2017 04:50 AM, Amit Langote wrote: >> On 2017/09/15 11:16, Amit Langote wrote: >>> I will post rebased patches later today, although I think the overall >>> design of the patch on the planner side of things is not quite there yet. >>> Of course, your and others' feedback is greatly welcome. >> >> Rebased patches attached. Because Dilip complained earlier today about >> clauses of the form (const op var) not causing partition-pruning, I've >> added code to commute the clause where it is required. Some other >> previously mentioned limitations remain -- no handling of OR clauses, no >> elimination of redundant clauses for given partitioning column, etc. >> >> A note about 0001: this patch overlaps with >> 0003-Canonical-partition-scheme.patch from the partitionwise-join patch >> series that Ashutosh Bapat posted yesterday [1]. Because I implemented >> the planner-portion of this patch based on what 0001 builds, I'm posting >> it here. It might actually turn out that we will review and commit >> 0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply >> 0001 if you want to play with the later patches. I would certainly like >> to review 0003-Canonical-partition-scheme.patch myself, but won't be able >> to immediately (see below). >> > > Could you share your thoughts on the usage of PartitionAppendInfo's > min_datum_idx / max_datum_idx ? Especially in relation to hash partitions. > > I'm looking at get_partitions_for_keys. Sure. You may have noticed that min_datum_idx and max_datum_idx in PartitionAppendInfo are offsets in the PartitionDescData.boundinfo.datums array, of datums that lie within the query-specified range (that is, using =, >, >=, <, <= btree operators in the query). That array contains bounds of all partitions sorted in the btree operator class defined order, at least for list and range partitioning. I haven't (yet) closely looked at the composition of hash partition datums in PartitionBoundInfo, which perhaps have different ordering properties (or maybe none) than list and range partitioning datums. Now, since they are offsets of datums in PartitionBoundInfo, not indexes of partitions themselves, their utility outside partition.c might be questionable. But partition-wise join patch, for example, to determine if two partitioned tables can be joined partition-wise, is going to check if PartitionBoundInfos in RelOptInfos of two partitioned tables are bound-to-bound equal [1]. Partition-pruning may select only a subset of partitions of each of the joining partitioned tables. Equi-join requirement for partition-wise join means that the subset of partitions will be same on both sides of the join. My intent of having min_datum_idx, max_datum_idx, along with contains_null_partition, and contains_default_partition in PartitionAppendInfo is to have sort of a cross check that those values end up being same on both sides of the join after equi-join requirement has been satisfied. That is, get_partitions_for_keys will have chosen the same set of partitions for both partitioned tables and hence will have set the same values for those fields. As mentioned above, that may be enough for list and range partitioning, but since hash partition datums do not appear to have the same properties as list and range datums [2], we may need an additional field(s) to describe the hash partition selected by get_partitions_for_keys. I guess only one field will be enough, that will be the offset in the datums array of the hash partition chosen for the query or -1 if query quals couldn't conclusively determine one (maybe not all partition keys were specified to be hashed or some or all used non-equality operator). Hope that answers your question at least to some degree. Now, there are some points Robert mentioned in his reply that I will need to also consider in the patch, which I'll go do now. :) Thanks, Amit [1] https://www.postgresql.org/message-id/CAFjFpRc4UdCYknBai9pBu2GA1h4nZVNPDmzgs4jOkqFamT1huA%40mail.gmail.com [2] It appears that in the hash partitioning case, unlike list and range partitioning, PartitionBoundInfo doesn't containvalues that are directly comparable to query-specified constants, but a pair (modulus, remainder) for each partition. We'll first hash *all* the key values (mentioned in the query) using the partitioning hash machinery and thendetermine the hash partition index by using (hash % largest_modulus) as offset into the PartitionBoundInfo.indexes array. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 27, 2017 at 6:09 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/09/27 1:51, Robert Haas wrote:
> On Tue, Sep 26, 2017 at 10:57 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> One could advocate (*cough*) that the hash partition patch [1] should be
>> merged first in order to find other instances of where other CommitFest
>> entries doesn't account for hash partitions at the moment in their method
>> signatures; Beena noted something similar in [2]. I know that you said
>> otherwise [3], but this is CommitFest 1, so there is time for a revert
>> later, and hash partitions are already useful in internal testing.
>
> Well, that's a fair point. I was assuming that committing things in
> that order would cause me to win the "least popular committer" award
> at least for that day, but maybe not. It's certainly not ideal to
> have to juggle that patch along and keep rebasing it over other
> changes when it's basically done, and just waiting on other
> improvements to land. Anybody else wish to express an opinion?
FWIW, I tend to agree that it would be nice to get the hash partitioning
patch in, even with old constraint exclusion based partition-pruning not
working for hash partitions. That way, it might be more clear what we
need to do in the partition-pruning patches to account for hash partitions.
+1
regards,
Amul
On Wed, Sep 27, 2017 at 6:52 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: I was looking into the latest patch set, seems like we can reuse some more code between this path and runtime pruning[1] + foreach(lc1, matchedclauses[i]) + { + Expr *clause = lfirst(lc1); + Const *rightop = (Const *) get_rightop(clause); + Oid opno = ((OpExpr *) clause)->opno, + opfamily = rel->part_scheme->partopfamily[i]; + StrategyNumber strategy; + + strategy = get_op_opfamily_strategy(opno, opfamily); + switch (strategy) + { + case BTLessStrategyNumber: + case BTLessEqualStrategyNumber: + if (need_next_max) + { + maxkeys[i] = rightop->constvalue; + if (!maxkey_set[i]) + n_maxkeys++; + maxkey_set[i] = true; + max_incl = (strategy == BTLessEqualStrategyNumber); + } + if (strategy == BTLessStrategyNumber) + need_next_max = false; I think the above logic is common between this patch and the runtime pruning. I think we can make a reusable function. Here we are preparing minkey and maxkey of Datum because we can directly fetch rightop->constvalue whereas for runtime pruning we are making minkeys and maxkeys of Expr because during planning time we don't have the values for the Param. I think we can always make these minkey, maxkey array of Expr and later those can be processed in whatever way we want it. So this path will fetch the constval out of Expr and runtime pruning will Eval that expression at runtime. Does this make sense or it will cause one level of extra processing for this path i.e converting the Expr array to CONST array? [1] https://www.postgresql.org/message-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/09/28 13:58, Dilip Kumar wrote: > On Wed, Sep 27, 2017 at 6:52 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > I was looking into the latest patch set, seems like we can reuse some > more code between this path and runtime pruning[1] > > + foreach(lc1, matchedclauses[i]) > + { > + Expr *clause = lfirst(lc1); > + Const *rightop = (Const *) get_rightop(clause); > + Oid opno = ((OpExpr *) clause)->opno, > + opfamily = rel->part_scheme->partopfamily[i]; > + StrategyNumber strategy; > + > + strategy = get_op_opfamily_strategy(opno, opfamily); > + switch (strategy) > + { > + case BTLessStrategyNumber: > + case BTLessEqualStrategyNumber: > + if (need_next_max) > + { > + maxkeys[i] = rightop->constvalue; > + if (!maxkey_set[i]) > + n_maxkeys++; > + maxkey_set[i] = true; > + max_incl = (strategy == BTLessEqualStrategyNumber); > + } > + if (strategy == BTLessStrategyNumber) > + need_next_max = false; > > I think the above logic is common between this patch and the runtime > pruning. I think we can make > a reusable function. Here we are preparing minkey and maxkey of Datum > because we can directly fetch rightop->constvalue whereas for runtime > pruning we are making minkeys and maxkeys of Expr because during > planning time we don't have the values for the Param. I think we can > always make these minkey, maxkey array of Expr and later those can be > processed in whatever way we want it. So this path will fetch the > constval out of Expr and runtime pruning will Eval that expression at > runtime. I think that makes sense. In fact we could even move the minkey/maxkey collection code to match_clauses_to_partkey() itself. No need for a different function and worrying about defining a separate interface for the same. We match clauses exactly because we want to extract the constant or param values out of them. No need to do the two activities independently and in different places. > Does this make sense or it will cause one level of extra processing > for this path i.e converting the Expr array to CONST array? Hm, it's not such a big cost to pay I'd think. I will update the planner patch accordingly. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 28, 2017 at 1:44 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/09/28 13:58, Dilip Kumar wrote: >> I think the above logic is common between this patch and the runtime >> pruning. I think we can make >> a reusable function. Here we are preparing minkey and maxkey of Datum >> because we can directly fetch rightop->constvalue whereas for runtime >> pruning we are making minkeys and maxkeys of Expr because during >> planning time we don't have the values for the Param. I think we can >> always make these minkey, maxkey array of Expr and later those can be >> processed in whatever way we want it. So this path will fetch the >> constval out of Expr and runtime pruning will Eval that expression at >> runtime. > > I think that makes sense. In fact we could even move the minkey/maxkey > collection code to match_clauses_to_partkey() itself. No need for a > different function and worrying about defining a separate interface for > the same. We match clauses exactly because we want to extract the > constant or param values out of them. No need to do the two activities > independently and in different places. > +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 27 September 2017 at 14:22, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > - 0001 includes refactoring that Dilip proposed upthread [1] (added him as > an author). I slightly tweaked his patch -- renamed the function > get_matching_clause to match_clauses_to_partkey, similar to > match_clauses_to_index. Hi Amit, I've made a pass over the 0001 patch while trying to get myself up to speed with the various developments that are going on in partitioning right now. These are just my thoughts from reading over the patch. I understand that there's quite a bit up in the air right now about how all this is going to work, but I have some thoughts about that too, which I wouldn't mind some feedback on as my line of thought may be off. Anyway, I will start with some small typos that I noticed while reading: partition.c: + * telling what kind of NullTest has been applies to the corresponding "applies" should be "applied" plancat.c: * might've occurred to satisfy the query. Rest of the fields are set in "Rest of the" should be "The remaining" Any onto more serious stuff: allpath.c: get_rel_partitions() I wonder if this function does not belong in partition.c. I'd have expected a function to exist per partition type, RANGE and LIST, I'd imagine should have their own function in partition.c to eliminate partitions which cannot possibly match, and return the remainder. I see people speaking of HASH partitioning, but we might one day also want something like RANDOM or ROUNDROBIN (I'm not really kidding, imagine routing records to be processed into foreign tables where you never need to query them again). It would be good if we could easily expand this list and not have to touch files all over the optimizer to do that. Of course, there would be other work to do in the executor to implement any new partitioning method too. I know there's mention of it somewhere about get_rel_partitions() not being so smart about WHERE partkey > 20 AND partkey > 10, the code does not understand what's more restrictive. I think you could probably follow the same rules here that are done in eval_const_expressions(). Over there I see that evaluate_function() will call anything that's not marked as volatile. I'd imagine, for each partition key, you'd want to store a Datum with the minimum and maximum possible value based on the quals processed. If either the minimum or maximum is still set to NULL, then it's unbounded at that end. If you encounter partkey = Const, then minimum and maximum can be set to the value of that Const. From there you can likely ignore any other quals for that partition key, as if the query did contain another qual with partkey = SomeOtherConst, then that would have become a gating qual during the constant folding process. This way if the user had written WHERE partkey >= 1 AND partkey <= 1 the evaluation would end up the same as if they'd written WHERE partkey = 1 as the minimum and maximum would be the same value in both cases, and when those two values are the same then it would mean just one theoretical binary search on a partition range to find the correct partition instead of two. I see in get_partitions_for_keys you've crafted the function to return something to identify which partitions need to be scanned. I think it would be nice to see a special element in the partition array for the NULL and DEFAULT partition. I imagine 0 and 1, but obviously, these would be defined constants somewhere. The signature of that function is not so pretty and that would likely tidy it up a bit. The matching partition indexes could be returned as a Bitmapset, yet, I don't really see any code which handles adding the NULL and DEFAULT partition in get_rel_partitions() either, maybe I've just not looked hard enough yet... -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 28, 2017 at 5:16 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > I'd imagine, for > each partition key, you'd want to store a Datum with the minimum and > maximum possible value based on the quals processed. If either the > minimum or maximum is still set to NULL, then it's unbounded at that > end. If you encounter partkey = Const, then minimum and maximum can be > set to the value of that Const. From there you can likely ignore any > other quals for that partition key, as if the query did contain > another qual with partkey = SomeOtherConst, then that would have > become a gating qual during the constant folding process. This way if > the user had written WHERE partkey >= 1 AND partkey <= 1 the > evaluation would end up the same as if they'd written WHERE partkey = > 1 as the minimum and maximum would be the same value in both cases, > and when those two values are the same then it would mean just one > theoretical binary search on a partition range to find the correct > partition instead of two. I have not looked at the code submitted here in detail yet but I do think we should try to avoid wasting cycles in the presumably-quite-common case where equality is being tested. The whole idea of thinking of this as minimum/maximum seems like it might be off precisely for that reason. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/09/30 1:28, Robert Haas wrote: > On Thu, Sep 28, 2017 at 5:16 AM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> I'd imagine, for >> each partition key, you'd want to store a Datum with the minimum and >> maximum possible value based on the quals processed. If either the >> minimum or maximum is still set to NULL, then it's unbounded at that >> end. If you encounter partkey = Const, then minimum and maximum can be >> set to the value of that Const. From there you can likely ignore any >> other quals for that partition key, as if the query did contain >> another qual with partkey = SomeOtherConst, then that would have >> become a gating qual during the constant folding process. This way if >> the user had written WHERE partkey >= 1 AND partkey <= 1 the >> evaluation would end up the same as if they'd written WHERE partkey = >> 1 as the minimum and maximum would be the same value in both cases, >> and when those two values are the same then it would mean just one >> theoretical binary search on a partition range to find the correct >> partition instead of two. > > I have not looked at the code submitted here in detail yet but I do > think we should try to avoid wasting cycles in the > presumably-quite-common case where equality is being tested. The > whole idea of thinking of this as minimum/maximum seems like it might > be off precisely for that reason. I agree. Equality checks are going to be common enough to warrant them to be handled specially, instead of implementing equality-pruning on top of min/max framework. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Oct 1, 2017 at 9:13 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I agree. Equality checks are going to be common enough to warrant them to > be handled specially, instead of implementing equality-pruning on top of > min/max framework. What you might do is pass <btree-strategy-number, bounds> and optionally allow a second <btree-strategy-number, bounds>. Then for the common case of equality you can pass BTEqualStrategyNumber and for a range bounded at both ends you can pass BTGreaterStrategyNumber or BTGreaterEqualStrategyNumber for one bound and BTLessStrategyNumber or BTLessEqualStrategyNumber for the other. Not sure if this is exactly the right idea but it's what pops to mind. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi David. Thanks a lot for your review comments and sorry it took me a while to reply. On 2017/09/28 18:16, David Rowley wrote: > On 27 September 2017 at 14:22, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> - 0001 includes refactoring that Dilip proposed upthread [1] (added him as >> an author). I slightly tweaked his patch -- renamed the function >> get_matching_clause to match_clauses_to_partkey, similar to >> match_clauses_to_index. > > Hi Amit, > > I've made a pass over the 0001 patch while trying to get myself up to > speed with the various developments that are going on in partitioning > right now. > > These are just my thoughts from reading over the patch. I understand > that there's quite a bit up in the air right now about how all this is > going to work, but I have some thoughts about that too, which I > wouldn't mind some feedback on as my line of thought may be off. > > Anyway, I will start with some small typos that I noticed while reading: > > partition.c: > > + * telling what kind of NullTest has been applies to the corresponding > > "applies" should be "applied" Will fix. > plancat.c: > > * might've occurred to satisfy the query. Rest of the fields are set in > > "Rest of the" should be "The remaining" Will fix. > Any onto more serious stuff: > > allpath.c: > > get_rel_partitions() > > I wonder if this function does not belong in partition.c. I'd have > expected a function to exist per partition type, RANGE and LIST, I'd > imagine should have their own function in partition.c to eliminate > partitions > which cannot possibly match, and return the remainder. I see people > speaking of HASH partitioning, but we might one day also want > something like RANDOM or ROUNDROBIN (I'm not really kidding, imagine > routing records to be processed into foreign tables where you never > need to query them again). It would be good if we could easily expand > this list and not have to touch files all over the optimizer to do > that. Of course, there would be other work to do in the executor to > implement any new partitioning method too. I think there will have to be at least some code in the optimizer. That is, the code to match the query to the table's partition keys and using the constant values therein to then look up the table's partitions. More importantly, once the partitions (viz. their offsets in the table's partition descriptor) have been looked up by partition.c, we must be able to then map them to their planner data structures viz. their AppendRelInfo, and subsequently RelOptInfo. This last part will have to be in the optimizer. In fact, that was the role of get_rel_partitions in the initial versions of the patch, when neither of the code for matching keys and for pruning using constants was implemented. One may argue that the first part, that is, matching clauses to match to the partition key and subsequent lookup of partitions using constants could both be implemented in partition.c, but it seems better to me to keep at least the part of matching clauses to the partition keys within the planner (just like matching clauses to indexes is done by the planner). Looking up partitions using constants cannot be done outside partition.c anyway, so that's something we have to implement there. > I know there's mention of it somewhere about get_rel_partitions() not > being so smart about WHERE partkey > 20 AND partkey > 10, the code > does not understand what's more restrictive. I think you could > probably follow the same rules here that are done in > eval_const_expressions(). Over there I see that evaluate_function() > will call anything that's not marked as volatile. Hmm, unless I've missed it, I don't see in evaluate_function() anything about determining which clause is more restrictive. AFAIK, such determination depends on the btree operator class semantics (at least in the most common cases where, say, ">" means greater than in a sense that btree code uses it), so I was planning to handle it the way btree code handles it in _bt_preprocess_keys(). In fact, the existing code in predtest.c, which makes decisions of the similar vein also relies on btree semantics. It's OK to do so, because partitioning methods that exist today and for which we'd like to have such smarts use btree semantics to partition the data. Also, we won't need to optimize such cases for hash partitioning anyway. > I'd imagine, for > each partition key, you'd want to store a Datum with the minimum and > maximum possible value based on the quals processed. If either the > minimum or maximum is still set to NULL, then it's unbounded at that > end. If you encounter partkey = Const, then minimum and maximum can be > set to the value of that Const. From there you can likely ignore any > other quals for that partition key, as if the query did contain > another qual with partkey = SomeOtherConst, then that would have > become a gating qual during the constant folding process. This way if > the user had written WHERE partkey >= 1 AND partkey <= 1 the > evaluation would end up the same as if they'd written WHERE partkey = > 1 as the minimum and maximum would be the same value in both cases, > and when those two values are the same then it would mean just one > theoretical binary search on a partition range to find the correct > partition instead of two. Given the way the patch recognizes a given qual as contributing to the equal key or minimum key or maximum key, it will not conclude the above to in fact be an equal key, because that presumably would require comparing clauses among each other to make such a discovery. I'm slightly hesitant to add code to do that in the first version of the patch. That is, for time being let WHERE partkey >= 1 and partkey <= 1 be handled by passing 1 as both minimum and maximum key, which requires two binary searches. Whereas, WHERE partkey = 1 would require only one. Planner code to get rid of the extra binary search lookup could come later, IMHO. > I see in get_partitions_for_keys you've crafted the function to return > something to identify which partitions need to be scanned. I think it > would be nice to see a special element in the partition array for the > NULL and DEFAULT partition. I imagine 0 and 1, but obviously, these > would be defined constants somewhere. The signature of that function > is not so pretty and that would likely tidy it up a bit. The matching > partition indexes could be returned as a Bitmapset, yet, I don't > really see any code which handles adding the NULL and DEFAULT > partition in get_rel_partitions() either, maybe I've just not looked > hard enough yet... New version of the patch I will post soon cleans up the interface of get_partitions_for_keys quite a bit; particularly the way selected partitions are returned, for which I adopted an idea that Robert Haas mentioned [2]. When it recognizes that a sequence of consecutive partitions are to be scanned, it will return the starting and ending offsets as *min_part_idx and *max_part_idx. Those that don't fit this pattern (of which there should be only a few in many cases) are returned in a Bitmapset, as a supposedly unordered set of partitioned. Since NULL and DEFAULT partitions are partitions of this later category, they would be included the bitmapset if it turns out that the query will need to scan them after all. Thanks again. Regards, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf8269 [2] https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/09/27 10:22, Amit Langote wrote: > Thanks for the reminder. Just thought I'd say that while I'm actually > done rebasing itself (attaching rebased patches to try 'em out), I'm now > considering Robert's comments and will be busy for a bit revising things > based on those comments. And here is the updated, significantly re-designed patch set. I'm dropping the WIP- from the patches' names and marking this patch set as the v1 set (Jesper Pedersen pointed out to me offline that the patches didn't have the vNumber before). Significant points of revision: * After thinking a bit more about the applicability/re-usability of the code being discussed here in the run-time pruning case, I came to the conclusion that my previous approach was a bit wrongheaded (as perhaps also what David Rowley was thinking when he commented on some aspects of the old patch's design at [1], so thanks to him for prodding me in what I ended up thinking to be a good direction after all). With the previous approach, a bit too much work would be done by the planner with no possibility of the code doing that work being useful in the executor (interface involved passing RelOptInfo *). So, if some optimization trick that would lead to better pruning decision depended on the constant values in all the clauses being available, we'd have to skip that optimization for clauses that would otherwise be chosen as run-time pruning clauses, because by definition, they would not have constant values available. In the new design, planner code limits itself to only matching the clauses to partition key (checking things like whether the operator of a clause matched to a partition column is compatible with partitioning, etc.) and adding every matched clause to a list partclauses. That should work unchanged for both the plan-time pruning case and the run-time pruning case. We don't look at the supposedly-constant operands of clauses in the aforementioned planner code at all. Now if the clauses in partclauses are all known to contain the constant operand values (the plan-time pruning case), it can immediately pass them to partition.c to analyze those clauses, extract bounding keys in a form suitable to do lookup in PartitionBoundInfo and prune partitions that won't satisfy those bounding keys (and hence the clauses). If partclauses contains clauses that don't have the constant operand (the run-time pruning case), don't go to partition.c just yet, instead stuff the list into the plan (Append) node and go to partition.c only when all the constant values are available (the patch at [2] will implement that). * Unlike the previous approach where partition.c would return a list of integer indexes, where the individual indexes would be those of the bound datums and not those of partitions themselves, in the new design, indexes of the selected partitions (their offsets in the PartitionDesc.oids array) are returned in a way that Robert suggested [3] -- as a range of contiguous partitions whenever possible (in the form of *min_part_idx and *max_part_idx) and/or as a set of partitions appearing in no particular order (in the form of a Bitmapset). Second form is required not only for default/null-only/list partitions (which do not have any notion of ordering among each other), but also when individual arms of an OR clause select partitions scattered all over the place. * In add_paths_to_append_rel(), the partitioned_rels list passed to copy into the Append path is no longer same as the one found in PlannerInfo.pcinfo_list, because the latter contains *all* partitioned child relations in a partition tree. Instead, the patch teaches it to only include *live* partitioned child relations. * Since the partitionwise join code looks at *all* entries in RelOptInfo.part_rels of both the joining partitioned relations, there is some new code there to make dead partitions' RelOptInfos look valid with a dummy path *after-the-fact*. That's because, set_append_rel_size() whose job it is to initialize certain fields of child RelOptInfos will do it only for *live* partitions with the new arrangement, where only live partitions of a partitioned table are processed by the main loop of set_append_rel_size(). Some notes about the regression tests: Patch 0001 adds new tests for partition-pruning. Constraint exclusion using internal partition constraints is not disabled in the code until the last patch, which implements the last piece needed for the new partition pruning to do any real work. With that patch, we see some differences in the plan generated using the new partition-pruning code which appear in the patch as the diffs to expected/inherit.out and expected/partition.out (latter is the new output file added by 0001). I've almost convinced myself that those diffs are simply artifacts of the difference in implementation between constraint exclusion and the new partition-pruning code and do not change the output that the plans produce. The difference stems from that either the old or the new method, in some cases, fails to prune away a partition that should have been. OTOH, in neither case do we prune away a partition that shouldn't have been. :) Description of the attached patches: 0001: add new tests for partition-pruning 0002: patch that makes all the changes needed in the planer (adds a stub function in partition.c) 0003: patch that implements the aforementioned stub (significant amount of code to analyze partition clauses and gin up bounding keys to compare with the values in PartitionBoundInfo; the actual function that will do the comparison is just a stub as of this patch) 0004: make some preparatory changes to partition_bound_cmp/bsearch, to be able to pass incomplete partition keys (aka, prefix of a multi- column key) for comparison with the values in PartitionBoundInfo (just a refactoring patch) 0005: implements the stub mentioned in 0003 and finally gets the new partition-pruning working (also disables constraint exclusion using internal partition constraints by teaching get_relation_constraints to not include those). Feedback greatly welcome. Thanks, Amit [1] https://www.postgresql.org/message-id/CAKJS1f8Jzix8cs7tCDS_qNPd0CetHjB8x9fmLG4OTbCfthgo1w%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A%40mail.gmail.com [3] https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Oct 19, 2017 at 12:16 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Description of the attached patches:
0001: add new tests for partition-pruning
0002: patch that makes all the changes needed in the planer (adds a stub
function in partition.c)
0003: patch that implements the aforementioned stub (significant amount of
code to analyze partition clauses and gin up bounding keys to
compare with the values in PartitionBoundInfo; the actual function
that will do the comparison is just a stub as of this patch)
0004: make some preparatory changes to partition_bound_cmp/bsearch, to be
able to pass incomplete partition keys (aka, prefix of a multi-
column key) for comparison with the values in PartitionBoundInfo
(just a refactoring patch)
0005: implements the stub mentioned in 0003 and finally gets the new
partition-pruning working (also disables constraint exclusion using
internal partition constraints by teaching get_relation_constraints
to not include those).
Feedback greatly welcome.
Hi Amit,
I have tried to apply attached patch. patch applied cleanly on commit id - bf54c0f05c0a58db17627724a83e1b6d4ec2712c
but make failed with below error.
./../../../src/include/nodes/relation.h:2126: error: redefinition of typedef ‘AppendRelInfo’
../../../../src/include/nodes/relation.h:584: note: previous declaration of ‘AppendRelInfo’ was here
make[4]: *** [gistbuild.o] Error 1
./../../../src/include/nodes/relation.h:2126: error: redefinition of typedef ‘AppendRelInfo’
../../../../src/include/nodes/relation.h:584: note: previous declaration of ‘AppendRelInfo’ was here
make[4]: *** [gistbuild.o] Error 1
Thanks Rajkumar. On 2017/10/23 15:58, Rajkumar Raghuwanshi wrote: > I have tried to apply attached patch. patch applied cleanly on commit id - > bf54c0f05c0a58db17627724a83e1b6d4ec2712c > but make failed with below error. > > ./../../../src/include/nodes/relation.h:2126: error: redefinition of > typedef ‘AppendRelInfo’ > ../../../../src/include/nodes/relation.h:584: note: previous declaration of > ‘AppendRelInfo’ was here > make[4]: *** [gistbuild.o] Error 1 The compiler I have here (gcc (GCC) 6.2.0) didn't complain like that for this typedef redefinition introduced by the 0002 patch, but it seems that it's not needed anyway, so got rid of that line in the attached updated patch. Fixed one more useless diff in 0002, but no changes in any other patch. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Oct 23, 2017 at 1:12 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Thanks for updated patches, I am able to compile it on head.
The compiler I have here (gcc (GCC) 6.2.0) didn't complain like that for
this typedef redefinition introduced by the 0002 patch, but it seems that
it's not needed anyway, so got rid of that line in the attached updated patch.
Fixed one more useless diff in 0002, but no changes in any other patch
Thanks for updated patches, I am able to compile it on head.
While testing this, I got an observation, pruning is not scanning default partition leading to wrong output. below is test to reproduce this.
create table rp (a int, b varchar) partition by range (a);
create table rp_p1 partition of rp default;
create table rp_p2 partition of rp for values from (1) to (10);
create table rp_p3 partition of rp for values from (10) to (maxvalue);
insert into rp values (-1,'p1');
insert into rp values (1,'p2');
insert into rp values (11,'p3');
postgres=# select tableoid::regclass,* from rp;
tableoid | a | b
----------+----+----
rp_p2 | 1 | p2
rp_p3 | 11 | p3
rp_p1 | -1 | p1
(3 rows)
--with pruning
postgres=# explain (costs off) select * from rp where a <= 1;
QUERY PLAN
--------------------------
Append
-> Seq Scan on rp_p2
Filter: (a <= 1)
(3 rows)
postgres=# select * from rp where a <= 1;
a | b
---+----
1 | p2
(1 row)
create table rp (a int, b varchar) partition by range (a);
create table rp_p1 partition of rp default;
create table rp_p2 partition of rp for values from (1) to (10);
create table rp_p3 partition of rp for values from (10) to (maxvalue);
insert into rp values (-1,'p1');
insert into rp values (1,'p2');
insert into rp values (11,'p3');
postgres=# select tableoid::regclass,* from rp;
tableoid | a | b
----------+----+----
rp_p2 | 1 | p2
rp_p3 | 11 | p3
rp_p1 | -1 | p1
(3 rows)
--with pruning
postgres=# explain (costs off) select * from rp where a <= 1;
QUERY PLAN
--------------------------
Append
-> Seq Scan on rp_p2
Filter: (a <= 1)
(3 rows)
postgres=# select * from rp where a <= 1;
a | b
---+----
1 | p2
(1 row)
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
On Mon, Oct 23, 2017 at 3:24 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote: > > On Mon, Oct 23, 2017 at 1:12 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >> The compiler I have here (gcc (GCC) 6.2.0) didn't complain like that for >> this typedef redefinition introduced by the 0002 patch, but it seems that >> it's not needed anyway, so got rid of that line in the attached updated >> patch. >> >> Fixed one more useless diff in 0002, but no changes in any other patch > > > Thanks for updated patches, I am able to compile it on head. > > While testing this, I got an observation, pruning is not scanning default > partition leading to wrong output. below is test to reproduce this. > > create table rp (a int, b varchar) partition by range (a); > create table rp_p1 partition of rp default; > create table rp_p2 partition of rp for values from (1) to (10); > create table rp_p3 partition of rp for values from (10) to (maxvalue); > > insert into rp values (-1,'p1'); > insert into rp values (1,'p2'); > insert into rp values (11,'p3'); > > postgres=# select tableoid::regclass,* from rp; > tableoid | a | b > ----------+----+---- > rp_p2 | 1 | p2 > rp_p3 | 11 | p3 > rp_p1 | -1 | p1 > (3 rows) > > --with pruning > postgres=# explain (costs off) select * from rp where a <= 1; > QUERY PLAN > -------------------------- > Append > -> Seq Scan on rp_p2 > Filter: (a <= 1) > (3 rows) > > postgres=# select * from rp where a <= 1; > a | b > ---+---- > 1 | p2 > (1 row) > I had noticed this and also that this crash: tprt PARTITION BY RANGE(Col1) tprt_1 FOR VALUES FROM (1) TO (50001) PARTITION BY RANGE(Col1) tprt_11 FORVALUES FROM (1) TO (10000), tprt_1d DEFAULT tprt_2 FOR VALUES FROM (50001) TO (100001) EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 BETWEEN 20000 AND 70000; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Thanks a lot Rajkumar and Beena for the tests. On 2017/10/24 1:38, Beena Emerson wrote: > On Mon, Oct 23, 2017 at 3:24 PM, Rajkumar Raghuwanshi wrote: >> Thanks for updated patches, I am able to compile it on head. >> >> While testing this, I got an observation, pruning is not scanning default >> partition leading to wrong output. below is test to reproduce this. >> >> create table rp (a int, b varchar) partition by range (a); >> create table rp_p1 partition of rp default; >> create table rp_p2 partition of rp for values from (1) to (10); >> create table rp_p3 partition of rp for values from (10) to (maxvalue); >> >> insert into rp values (-1,'p1'); >> insert into rp values (1,'p2'); >> insert into rp values (11,'p3'); >> >> postgres=# select tableoid::regclass,* from rp; >> tableoid | a | b >> ----------+----+---- >> rp_p2 | 1 | p2 >> rp_p3 | 11 | p3 >> rp_p1 | -1 | p1 >> (3 rows) >> >> --with pruning >> postgres=# explain (costs off) select * from rp where a <= 1; >> QUERY PLAN >> -------------------------- >> Append >> -> Seq Scan on rp_p2 >> Filter: (a <= 1) >> (3 rows) >> >> postgres=# select * from rp where a <= 1; >> a | b >> ---+---- >> 1 | p2 >> (1 row) Both this (wrong output)... > I had noticed this and also that this crash: > > tprt PARTITION BY RANGE(Col1) > tprt_1 FOR VALUES FROM (1) TO (50001) PARTITION BY RANGE(Col1) > tprt_11 FOR VALUES FROM (1) TO (10000), > tprt_1d DEFAULT > tprt_2 FOR VALUES FROM (50001) TO (100001) > > EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 BETWEEN 20000 AND 70000; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > !> ...and this (crash) were due to bugs in the 0005 patch. Output with the updated patch for Rajkumar's test: explain (costs off ) select * from rp where a <= 1; QUERY PLAN -------------------------- Append -> Seq Scan on rp_p2 Filter: (a <= 1) -> Seq Scan on rp_p1 Filter: (a <= 1) (5 rows) select tableoid::regclass, * from rp where a <= 1; tableoid | a | b ----------+----+---- rp_p2 | 1 | p2 rp_p1 | -1 | p1 (2 rows) -- moreover select tableoid::regclass, * from rp where a < 1; tableoid | a | b ----------+----+---- rp_d | -1 | p1 (1 row) Should be fixed in the attached updated version. While fixing the bugs, I made some significant revisions to the code introduced by 0005. No significant changes to any of the patches 0001-0004. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017/10/25 15:47, Amit Langote wrote: > On 2017/10/24 1:38, Beena Emerson wrote: >> I had noticed this and also that this crash: >> >> tprt PARTITION BY RANGE(Col1) >> tprt_1 FOR VALUES FROM (1) TO (50001) PARTITION BY RANGE(Col1) >> tprt_11 FOR VALUES FROM (1) TO (10000), >> tprt_1d DEFAULT >> tprt_2 FOR VALUES FROM (50001) TO (100001) >> >> EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 BETWEEN 20000 AND 70000; >> server closed the connection unexpectedly >> This probably means the server terminated abnormally >> before or while processing the request. >> The connection to the server was lost. Attempting reset: Failed. >> !> > > ...and this (crash) were due to bugs in the 0005 patch. [ .... ] > Should be fixed in the attached updated version. Oops, not quite. The crash that Beena reported wasn't fixed (or rather reintroduced by some unrelated change after once confirming it was fixed). Really fixed this time. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hello Amit, Thanks for the updated patches On Wed, Oct 25, 2017 at 1:07 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/10/25 15:47, Amit Langote wrote: >> On 2017/10/24 1:38, Beena Emerson wrote: >>> I had noticed this and also that this crash: >>> >>> tprt PARTITION BY RANGE(Col1) >>> tprt_1 FOR VALUES FROM (1) TO (50001) PARTITION BY RANGE(Col1) >>> tprt_11 FOR VALUES FROM (1) TO (10000), >>> tprt_1d DEFAULT >>> tprt_2 FOR VALUES FROM (50001) TO (100001) >>> >>> EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 BETWEEN 20000 AND 70000; >>> server closed the connection unexpectedly >>> This probably means the server terminated abnormally >>> before or while processing the request. >>> The connection to the server was lost. Attempting reset: Failed. >>> !> >> >> ...and this (crash) were due to bugs in the 0005 patch. > > [ .... ] > >> Should be fixed in the attached updated version. > > Oops, not quite. The crash that Beena reported wasn't fixed (or rather > reintroduced by some unrelated change after once confirming it was fixed). > > Really fixed this time. > The crashes are fixed. However, handling of DEFAULT partition in various queries is not proper. Case 1: In this case default should be selected.DROP TABLE tprt; CREATE TABLE tprt (col1 int, col2 int) PARTITION BY range(col1);CREATE TABLE tprt_1 PARTITION OF tprt FOR VALUES FROM (1) TO (50001) PARTITION BY list(col1); CREATE TABLE tprt_11 PARTITION OF tprt_1 FOR VALUES IN (20000, 25000); CREATE TABLE tprt_12 PARTITIONOF tprt_1 FOR VALUES IN (50000, 35000); CREATE TABLE tprt_13 PARTITION OF tprt_1 FOR VALUES IN (10000); CREATE TABLEtprt_1d PARTITION OF tprt_1 DEFAULT; postgres=# EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 < 10000; QUERY PLAN --------------------------Result One-Time Filter: false (2 rows) Case 2: In this case DEFAULT need not be selected. DROP TABLE tprt; CREATE TABLE tprt (col1 int, col2 int) PARTITION BY range(col1); CREATE TABLE tprt_1 PARTITION OF tprtFOR VALUES FROM (1) TO (50001) PARTITION BY range(col1); CREATE TABLE tprt_11 PARTITION OF tprt_1 FOR VALUES FROM (1) TO (10000); CREATE TABLE tprt_12 PARTITIONOF tprt_1 FOR VALUES FROM (10000) TO (20000); CREATE TABLE tprt_13 PARTITION OF tprt_1 FOR VALUES FROM (20000) TO(30000); CREATE TABLE tprt_1d PARTITION OF tprt_1 DEFAULT; INSERT INTO tprt SELECT generate_series(1,50000), generate_series(1,50000); postgres=# EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 < 10000; QUERY PLAN --------------------------------Append -> Seq Scan on tprt_11 Filter: (col1 < 10000) -> Seq Scan on tprt_1d Filter: (col1 < 10000) (5 rows) -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hello, On Wed, Oct 25, 2017 at 1:07 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/10/25 15:47, Amit Langote wrote: >> On 2017/10/24 1:38, Beena Emerson wrote: >>> I had noticed this and also that this crash: >>> >>> tprt PARTITION BY RANGE(Col1) >>> tprt_1 FOR VALUES FROM (1) TO (50001) PARTITION BY RANGE(Col1) >>> tprt_11 FOR VALUES FROM (1) TO (10000), >>> tprt_1d DEFAULT >>> tprt_2 FOR VALUES FROM (50001) TO (100001) >>> >>> EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 BETWEEN 20000 AND 70000; >>> server closed the connection unexpectedly >>> This probably means the server terminated abnormally >>> before or while processing the request. >>> The connection to the server was lost. Attempting reset: Failed. >>> !> >> >> ...and this (crash) were due to bugs in the 0005 patch. > > [ .... ] > >> Should be fixed in the attached updated version. > > Oops, not quite. The crash that Beena reported wasn't fixed (or rather > reintroduced by some unrelated change after once confirming it was fixed). > > Really fixed this time. Some minor comments: 1. wrong function name (0003) The comment on function get_partitions_from_clauses_guts uses wrong name: instead of "_from_", "_using_" is written. /* + * get_partitions_using_clauses_guts + * Determine relation's partitions that satisfy *all* of the clauses + * in the list (return value describes the set of such partitions) + * 2. typo information (0003) +/* + * classify_partition_bounding_keys + * Classify partition clauses into equal, min, max keys, along with any + * Nullness constraints and return that informatin in the output argument 3. misspell admissible (0003) + * columns, whereas a prefix of all partition key columns is addmissible + * as min and max keys. 4. double and? (0002) + * as part of the operator family, check if its negator + * exists and and that the latter is compatible with 5. typo inequality (0002) + * (key < val OR key > val), if the partitioning method + * supports such notion of inequlity. 6. typo output (0005) + * return it as the only scannable partition, that means the query + * doesn't want null values in its outout. 7. typo provide (0005) + /* Valid keys->eqkeys must provoide all partition keys. */ + Assert(keys->n_eqkeys == 0 || keys->n_eqkeys == partkey->partnatts); 8. comment of struct PartClause (0003) +/* + * Information about a clause matched with a partition key column kept to + * avoid repeated recomputation in remove_redundant_clauses(). + */ Instead of repeated recomputation, we can use just " repeated computation" or just " recomputation" -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Beena, Thanks for the tests. On 2017/10/25 18:18, Beena Emerson wrote: > The crashes are fixed. However, handling of DEFAULT partition in > various queries is not proper. > > Case 1: In this case default should be selected. > DROP TABLE tprt; > CREATE TABLE tprt (col1 int, col2 int) PARTITION BY range(col1); > CREATE TABLE tprt_1 PARTITION OF tprt FOR VALUES FROM (1) TO (50001) > PARTITION BY list(col1); > CREATE TABLE tprt_11 PARTITION OF tprt_1 FOR VALUES IN (20000, 25000); > CREATE TABLE tprt_12 PARTITION OF tprt_1 FOR VALUES IN (50000, 35000); > CREATE TABLE tprt_13 PARTITION OF tprt_1 FOR VALUES IN (10000); > CREATE TABLE tprt_1d PARTITION OF tprt_1 DEFAULT; > > > postgres=# EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 < 10000; > QUERY PLAN > -------------------------- > Result > One-Time Filter: false > (2 rows) Hmm, this clearly looks wrong. Fixed in the attached. > Case 2: In this case DEFAULT need not be selected. > > DROP TABLE tprt; > CREATE TABLE tprt (col1 int, col2 int) PARTITION BY range(col1); > CREATE TABLE tprt_1 PARTITION OF tprt FOR VALUES FROM (1) TO (50001) > PARTITION BY range(col1); > CREATE TABLE tprt_11 PARTITION OF tprt_1 FOR VALUES FROM (1) TO (10000); > CREATE TABLE tprt_12 PARTITION OF tprt_1 FOR VALUES FROM (10000) TO (20000); > CREATE TABLE tprt_13 PARTITION OF tprt_1 FOR VALUES FROM (20000) TO (30000); > CREATE TABLE tprt_1d PARTITION OF tprt_1 DEFAULT; > INSERT INTO tprt SELECT generate_series(1,50000), generate_series(1,50000); > > postgres=# EXPLAIN (COSTS OFF) SELECT * FROM tprt WHERE col1 < 10000; > QUERY PLAN > -------------------------------- > Append > -> Seq Scan on tprt_11 > Filter: (col1 < 10000) > -> Seq Scan on tprt_1d > Filter: (col1 < 10000) > (5 rows) Yeah, ideally. But it's kind of hard to for the new partition-pruning algorithm to be *that* correct in this particular case involving default partitions. Let me try to explain why I think it may be a bit hard to implement. I perhaps have mentioned before that the new partition-pruning algorithm runs for every partitioned table in the tree separately. In this example, it will first determine for the root table tprt that only the partition tprt_1 needs to be scanned. Since tprt_1 is itself partitioned, algorithm will run again, but the fact that tprt_1 (iow, any of its partitions) is itself constrained to range [1, 50001) is, for the most part, lost on the algorithm. Note that non-default partitions (tprt_11, tprt_12, ...) have bound datums in PartitionBoundInfo describing the range of data they contain, which the algorithm uses to determine the set of partitions satisfying given set of clauses. The default partition has no datums. The only thing describing what it contains is its partition constraint. From the clause col1 < 10000, the algorithm will conclude that the default partition might contain some data satisfying the same, because it knows for sure that there no non-default partition for keys < 1. It can perhaps taught to not make that conclusion by taking into account the default partition's partition constraint, which includes constraint inherited from the parent, viz. 1 <= col1 < 50001. To do that, it might be possible to summon up predtest.c's powers to conclude from the default partition's partition constraint that it cannot contain any keys < 1, but then we'll have to frame up a clause expression describing the latter. Generating such a clause expression can be a bit daunting for a multi-column key. So, I haven't yet tried really hard to implement this. Any thoughts on that? Meanwhile, attached updated set of patches including fixes for the typos you reported in the other message. Updated 0005 fixes the first bug (the Case 1 in your email), while other patches 0002-0004 are updated mostly to fix the reported typos. A couple of tests are added in 0001 to test the default partition case a bit more. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Oct 26, 2017 at 1:17 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > It can perhaps taught to not make that conclusion by taking into account > the default partition's partition constraint, which includes constraint > inherited from the parent, viz. 1 <= col1 < 50001. To do that, it might > be possible to summon up predtest.c's powers to conclude from the default > partition's partition constraint that it cannot contain any keys < 1, but > then we'll have to frame up a clause expression describing the latter. > Generating such a clause expression can be a bit daunting for a > multi-column key. So, I haven't yet tried really hard to implement this. > Any thoughts on that? I don't think we really want to get into theorem-proving here, because it's slow. Whatever we're going to do we should be able to do without that - keeping it in the form of btree-strategy + value. It doesn't seem that hard. Suppose we're asked to select partitions from tprt subject to (<, 10000). Well, we determine that some of the tprt_1 partitions may be relevant, so we tell tprt_1 to select partitions subject to (>=, 1, <, 10000). We know to do that because we know that 10000 < 50000 and we know to include >= 1 because we haven't got any lower bound currently at all. What's the problem? In some sense it's tempting to say that this case just doesn't matter very much; after all, subpartitioning on the same column used to partition at the top level is arguably lame. But if we can get it right in a relatively straightforward manner then let's do it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/10/26 20:34, Robert Haas wrote: > On Thu, Oct 26, 2017 at 1:17 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> It can perhaps taught to not make that conclusion by taking into account >> the default partition's partition constraint, which includes constraint >> inherited from the parent, viz. 1 <= col1 < 50001. To do that, it might >> be possible to summon up predtest.c's powers to conclude from the default >> partition's partition constraint that it cannot contain any keys < 1, but >> then we'll have to frame up a clause expression describing the latter. >> Generating such a clause expression can be a bit daunting for a >> multi-column key. So, I haven't yet tried really hard to implement this. >> Any thoughts on that? > > I don't think we really want to get into theorem-proving here, because > it's slow. Just to be clear, I'm saying we could use theorem-proving (if at all) just for the default partition. > Whatever we're going to do we should be able to do without > that - keeping it in the form of btree-strategy + value. It doesn't > seem that hard. Suppose we're asked to select partitions from tprt > subject to (<, 10000). Well, we determine that some of the tprt_1 > partitions may be relevant, so we tell tprt_1 to select partitions > subject to (>=, 1, <, 10000). We know to do that because we know that > 10000 < 50000 and we know to include >= 1 because we haven't got any > lower bound currently at all. What's the problem? Hmm, that's interesting. With the approach that the patch currently takes, (>= 1) wouldn't be passed down when selecting the partitions of tprt_1. The source of values (+ btree strategy) to use to select partitions is the same original set of clauses for all partitioned tables in the tree, as the patch currently implements it. Nothing would get added to that set (>= 1, as in this example), nor subtracted (such as clauses that are trivially true). I will think about this approach in general and to solve this problem in particular. > In some sense it's tempting to say that this case just doesn't matter > very much; after all, subpartitioning on the same column used to > partition at the top level is arguably lame. But if we can get it > right in a relatively straightforward manner then let's do it. Yeah, I tend to agree. Thanks for the input. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 27, 2017 at 3:17 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I don't think we really want to get into theorem-proving here, because >> it's slow. > > Just to be clear, I'm saying we could use theorem-proving (if at all) just > for the default partition. I don't really see why it should be needed there either. We've got all the bounds in order, so we should know where there are any gaps that are covered by the default partition in the range we care about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/10/27 13:57, Robert Haas wrote: > On Fri, Oct 27, 2017 at 3:17 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> I don't think we really want to get into theorem-proving here, because >>> it's slow. >> >> Just to be clear, I'm saying we could use theorem-proving (if at all) just >> for the default partition. > > I don't really see why it should be needed there either. We've got > all the bounds in order, so we should know where there are any gaps > that are covered by the default partition in the range we care about. Sorry, I forgot to add: "...just for the default partition, for cases like the one in Beena's example." In normal cases, default partition selection doesn't require any theorem-proving. It proceeds in a straightforward manner more or less like what you said it should. After thinking more on it, I think there is a rather straightforward trick to implement the idea you mentioned to get this working for the case presented in Beena's example, which works as follows: For any non-root partitioned tables, we add the list of its partition constraint clauses to the query-provided list of clauses and use the whole list to drive the partition-pruning algorithm. So, when partition-pruning runs for tprt_1, along with (< 10000) which the original query provides, we also have (>= 1) which comes from the partition constraint of tprt_1 (which is >= 1 and < 50000). Note that there exists a trick in the new code for the (< 50000) coming from the constraint to be overridden by the more restrictive (< 10000) coming from the original query. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 26, 2017 at 4:47 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Meanwhile, attached updated set of patches including fixes for the typos
you reported in the other message. Updated 0005 fixes the first bug (the
Case 1 in your email), while other patches 0002-0004 are updated mostly to
fix the reported typos. A couple of tests are added in 0001 to test the
default partition case a bit more.
Hi Amit,
while testing further this feature, I got a bug with partitions as foreign tables. Test case given below. Take a look.
CREATE EXTENSION postgres_fdw;
CREATE SERVER fp_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname 'postgres', port '5432', use_remote_estimate 'true');
CREATE USER MAPPING FOR PUBLIC SERVER fp_server;
CREATE TABLE fplt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE fplt1_p1 (a int, b int, c text);
CREATE FOREIGN TABLE ftplt1_p1 PARTITION OF fplt1 FOR VALUES IN ('0000', '0001', '0002', '0003') SERVER fp_server OPTIONS (TABLE_NAME 'fplt1_p1');
CREATE TABLE fplt1_p2 (a int, b int, c text);
CREATE FOREIGN TABLE ftplt1_p2 PARTITION OF fplt1 FOR VALUES IN ('0004', '0005', '0006', '0007') SERVER fp_server OPTIONS (TABLE_NAME 'fplt1_p2');
INSERT INTO fplt1_p1 SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(0, 198, 2) i;
INSERT INTO fplt1_p2 SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(200, 398, 2) i;
--PG-HEAD
postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT DISTINCT t2.c FROM fplt1 t2 WHERE t2.c = t1.c ) q;
QUERY PLAN
--------------------------------------------------
Nested Loop
-> Append
-> Foreign Scan on ftplt1_p1 t1
-> Foreign Scan on ftplt1_p2 t1_1
-> Unique
-> Append
-> Foreign Scan on ftplt1_p1 t2
-> Foreign Scan on ftplt1_p2 t2_1
(8 rows)
--PG-HEAD +v5 patches
postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT DISTINCT t2.c FROM fplt1 t2 WHERE t2.c = t1.c ) q;
ERROR: invalid expression for partition key
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
while testing further this feature, I got a bug with partitions as foreign tables. Test case given below. Take a look.
CREATE EXTENSION postgres_fdw;
CREATE SERVER fp_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname 'postgres', port '5432', use_remote_estimate 'true');
CREATE USER MAPPING FOR PUBLIC SERVER fp_server;
CREATE TABLE fplt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE fplt1_p1 (a int, b int, c text);
CREATE FOREIGN TABLE ftplt1_p1 PARTITION OF fplt1 FOR VALUES IN ('0000', '0001', '0002', '0003') SERVER fp_server OPTIONS (TABLE_NAME 'fplt1_p1');
CREATE TABLE fplt1_p2 (a int, b int, c text);
CREATE FOREIGN TABLE ftplt1_p2 PARTITION OF fplt1 FOR VALUES IN ('0004', '0005', '0006', '0007') SERVER fp_server OPTIONS (TABLE_NAME 'fplt1_p2');
INSERT INTO fplt1_p1 SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(0, 198, 2) i;
INSERT INTO fplt1_p2 SELECT i, i, to_char(i/50, 'FM0000') FROM generate_series(200, 398, 2) i;
--PG-HEAD
postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT DISTINCT t2.c FROM fplt1 t2 WHERE t2.c = t1.c ) q;
QUERY PLAN
--------------------------------------------------
Nested Loop
-> Append
-> Foreign Scan on ftplt1_p1 t1
-> Foreign Scan on ftplt1_p2 t1_1
-> Unique
-> Append
-> Foreign Scan on ftplt1_p1 t2
-> Foreign Scan on ftplt1_p2 t2_1
(8 rows)
--PG-HEAD +v5 patches
postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT DISTINCT t2.c FROM fplt1 t2 WHERE t2.c = t1.c ) q;
ERROR: invalid expression for partition key
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
Thanks Rajkumar for the test case. On 2017/10/27 17:05, Rajkumar Raghuwanshi wrote: > while testing further this feature, I got a bug with partitions as foreign > tables. Test case given below. Take a look. [ ... ] > > --PG-HEAD > postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT > DISTINCT t2.c FROM fplt1 t2 WHERE t2.c = t1.c ) q; > QUERY PLAN > -------------------------------------------------- > Nested Loop > -> Append > -> Foreign Scan on ftplt1_p1 t1 > -> Foreign Scan on ftplt1_p2 t1_1 > -> Unique > -> Append > -> Foreign Scan on ftplt1_p1 t2 > -> Foreign Scan on ftplt1_p2 t2_1 > (8 rows) > > --PG-HEAD +v5 patches > postgres=# EXPLAIN (COSTS OFF) SELECT t1.c FROM fplt1 t1, LATERAL (SELECT > DISTINCT t2.c FROM fplt1 t2 WHERE t2.c = t1.c ) q; > > *ERROR: invalid expression for partition key* I looked at this and it seems the error occurs not because partitions being foreign tables, but because the new code is wrong to assume that Param nodes can never appear in the clauses coming from baserestrictinfo. When trying to do the plan-time pruning for the partitioned table appearing inside the lateral subquery, there are Params in the clauses in baserestrictinfo that the new pruning code was unprepared to handle. Fixed the code to instead give up on plan-time pruning in such a case. Attached updated set of patches. In addition to fixing the above bug, it also fixes one of the cases reported by Beena regarding default partition pruning that I yesterday had given up on as being too difficult to implement [1], but today found out is not that difficult to do [2]. Change summary: 0001: added some new tests 0002: no change 0003: fixed issue that Rajkumar reported (cope with Params properly) 0004: no change 0005: fix the case to prune the default partition when warranted (the issue reported by Beena) Thanks, Amit [1] https://www.postgresql.org/message-id/0d6096e8-7c7b-afed-71d3-dca151306626%40lab.ntt.co.jp [2] https://www.postgresql.org/message-id/8499324c-8a33-4be7-9d23-7e6a95e60ddf%40lab.ntt.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Oct 27, 2017 at 2:41 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
0001: added some new tests
0002: no change
0003: fixed issue that Rajkumar reported (cope with Params properly)
0004: no change
0005: fix the case to prune the default partition when warranted (the
issue reported by Beena)
Thanks for the updated patch, i am getting server crash with below query.
CREATE TABLE mp (c1 int, c2 int, c3 int) PARTITION BY LIST(c3);
CREATE TABLE mp_p1 PARTITION OF mp FOR VALUES IN (10, 20) PARTITION BY RANGE(c2);
CREATE TABLE mp_p1_1 PARTITION OF mp_p1 FOR VALUES FROM (0) TO (200);
CREATE TABLE mp_p1_2 PARTITION OF mp_p1 FOR VALUES FROM (200) TO (400);
CREATE TABLE mp_p2 PARTITION OF mp FOR VALUES IN (30, 40) PARTITION BY RANGE(c2);
CREATE TABLE mp_p2_1 PARTITION OF mp_p2 FOR VALUES FROM (0) TO (300);
CREATE TABLE mp_p2_2 PARTITION OF mp_p2 FOR VALUES FROM (300) TO (600);
INSERT INTO mp VALUES(10, 100, 10);
INSERT INTO mp VALUES(20, 200, 20);
INSERT INTO mp VALUES(21, 150, 30);
INSERT INTO mp VALUES(30, 200, 40);
INSERT INTO mp VALUES(31, 300, 30);
INSERT INTO mp VALUES(40, 400, 40);
EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM mp WHERE c3 = 40 AND c2 < 300;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
CREATE TABLE mp (c1 int, c2 int, c3 int) PARTITION BY LIST(c3);
CREATE TABLE mp_p1 PARTITION OF mp FOR VALUES IN (10, 20) PARTITION BY RANGE(c2);
CREATE TABLE mp_p1_1 PARTITION OF mp_p1 FOR VALUES FROM (0) TO (200);
CREATE TABLE mp_p1_2 PARTITION OF mp_p1 FOR VALUES FROM (200) TO (400);
CREATE TABLE mp_p2 PARTITION OF mp FOR VALUES IN (30, 40) PARTITION BY RANGE(c2);
CREATE TABLE mp_p2_1 PARTITION OF mp_p2 FOR VALUES FROM (0) TO (300);
CREATE TABLE mp_p2_2 PARTITION OF mp_p2 FOR VALUES FROM (300) TO (600);
INSERT INTO mp VALUES(10, 100, 10);
INSERT INTO mp VALUES(20, 200, 20);
INSERT INTO mp VALUES(21, 150, 30);
INSERT INTO mp VALUES(30, 200, 40);
INSERT INTO mp VALUES(31, 300, 30);
INSERT INTO mp VALUES(40, 400, 40);
EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM mp WHERE c3 = 40 AND c2 < 300;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
Thanks Rajkumar. On 2017/10/27 19:29, Rajkumar Raghuwanshi wrote: > On Fri, Oct 27, 2017 at 2:41 PM, Amit Langote wrote: >> 0001: added some new tests >> 0002: no change >> 0003: fixed issue that Rajkumar reported (cope with Params properly) >> 0004: no change >> 0005: fix the case to prune the default partition when warranted (the >> issue reported by Beena) >> > > Thanks for the updated patch, i am getting server crash with below query. > > CREATE TABLE mp (c1 int, c2 int, c3 int) PARTITION BY LIST(c3); > CREATE TABLE mp_p1 PARTITION OF mp FOR VALUES IN (10, 20) PARTITION BY > RANGE(c2); > CREATE TABLE mp_p1_1 PARTITION OF mp_p1 FOR VALUES FROM (0) TO (200); > CREATE TABLE mp_p1_2 PARTITION OF mp_p1 FOR VALUES FROM (200) TO (400); > CREATE TABLE mp_p2 PARTITION OF mp FOR VALUES IN (30, 40) PARTITION BY > RANGE(c2); > CREATE TABLE mp_p2_1 PARTITION OF mp_p2 FOR VALUES FROM (0) TO (300); > CREATE TABLE mp_p2_2 PARTITION OF mp_p2 FOR VALUES FROM (300) TO (600); > > INSERT INTO mp VALUES(10, 100, 10); > INSERT INTO mp VALUES(20, 200, 20); > INSERT INTO mp VALUES(21, 150, 30); > INSERT INTO mp VALUES(30, 200, 40); > INSERT INTO mp VALUES(31, 300, 30); > INSERT INTO mp VALUES(40, 400, 40); > > EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM mp WHERE c3 = 40 AND > c2 < 300; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. Looks like an update I'd included in the last posted patch (viz. add the non-root partitioned tables' partition constraint clauses to the list of clauses used for pruning) exposed a bug in how ScalarArrayOpExpr clauses are being handled by the new pruning code. A partitioned list partition's internal partition constraint clause contains ArrayExpr as the ScalarArrayOpExpr's right-hand operand, whereas the pruning code thought there could only ever be a Const holding an ArrayType value. Fixed in the attached updated patch, along with a new test in 0001 to cover this case. Also, made a few tweaks to 0003 and 0005 (moved some code from the former to the latter) around the handling of ScalarArrayOpExprs. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017/10/30 14:55, Amit Langote wrote: > Fixed in the attached updated patch, along with a new test in 0001 to > cover this case. Also, made a few tweaks to 0003 and 0005 (moved some > code from the former to the latter) around the handling of ScalarArrayOpExprs. Sorry, I'd forgotten to include some changes. In the previous versions, RT index of the table needed to be passed to partition.c, which I realized is no longer needed, so I removed that requirement from the interface. As a result, patches 0002 and 0003 have changed in this version. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Oct 30, 2017 at 12:20 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
In the previous versions, RT index of the table needed to be passed to
partition.c, which I realized is no longer needed, so I removed that
requirement from the interface. As a result, patches 0002 and 0003 have
changed in this version.
Thanks for the fix.
I am getting wrong output when default is sub-partitioned further, below is a test case.
CREATE TABLE lpd(a int, b varchar, c float) PARTITION BY LIST (a);
CREATE TABLE lpd_p1 PARTITION OF lpd FOR VALUES IN (1,2,3);
CREATE TABLE lpd_p2 PARTITION OF lpd FOR VALUES IN (4,5);
CREATE TABLE lpd_d PARTITION OF lpd DEFAULT PARTITION BY LIST(a);
CREATE TABLE lpd_d1 PARTITION OF lpd_d FOR VALUES IN (7,8,9);
CREATE TABLE lpd_d2 PARTITION OF lpd_d FOR VALUES IN (10,11,12);
CREATE TABLE lpd_d3 PARTITION OF lpd_d FOR VALUES IN (6,null);
INSERT INTO lpd SELECT i,i,i FROM generate_Series (1,12)i;
INSERT INTO lpd VALUES (null,null,null);
--on HEAD
postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
QUERY PLAN
---------------------------------------------
Sort
Sort Key: ((lpd_p1.tableoid)::regclass)
-> Result
-> Append
-> Seq Scan on lpd_p1
Filter: (a IS NOT NULL)
-> Seq Scan on lpd_p2
Filter: (a IS NOT NULL)
-> Seq Scan on lpd_d3
Filter: (a IS NOT NULL)
-> Seq Scan on lpd_d1
Filter: (a IS NOT NULL)
-> Seq Scan on lpd_d2
Filter: (a IS NOT NULL)
(14 rows)
postgres=#
postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
tableoid | a | b | c
----------+----+----+----
lpd_p1 | 1 | 1 | 1
lpd_p1 | 2 | 2 | 2
lpd_p1 | 3 | 3 | 3
lpd_p2 | 4 | 4 | 4
lpd_p2 | 5 | 5 | 5
lpd_d1 | 7 | 7 | 7
lpd_d1 | 8 | 8 | 8
lpd_d1 | 9 | 9 | 9
lpd_d2 | 12 | 12 | 12
lpd_d2 | 10 | 10 | 10
lpd_d2 | 11 | 11 | 11
lpd_d3 | 6 | 6 | 6
(12 rows)
--on HEAD + v8 patches
postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
QUERY PLAN
---------------------------------------------
Sort
Sort Key: ((lpd_p1.tableoid)::regclass)
-> Result
-> Append
-> Seq Scan on lpd_p1
Filter: (a IS NOT NULL)
-> Seq Scan on lpd_p2
Filter: (a IS NOT NULL)
(8 rows)
postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
tableoid | a | b | c
----------+---+---+---
lpd_p1 | 1 | 1 | 1
lpd_p1 | 2 | 2 | 2
lpd_p1 | 3 | 3 | 3
lpd_p2 | 4 | 4 | 4
lpd_p2 | 5 | 5 | 5
(5 rows)
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
I am getting wrong output when default is sub-partitioned further, below is a test case.
CREATE TABLE lpd(a int, b varchar, c float) PARTITION BY LIST (a);
CREATE TABLE lpd_p1 PARTITION OF lpd FOR VALUES IN (1,2,3);
CREATE TABLE lpd_p2 PARTITION OF lpd FOR VALUES IN (4,5);
CREATE TABLE lpd_d PARTITION OF lpd DEFAULT PARTITION BY LIST(a);
CREATE TABLE lpd_d1 PARTITION OF lpd_d FOR VALUES IN (7,8,9);
CREATE TABLE lpd_d2 PARTITION OF lpd_d FOR VALUES IN (10,11,12);
CREATE TABLE lpd_d3 PARTITION OF lpd_d FOR VALUES IN (6,null);
INSERT INTO lpd SELECT i,i,i FROM generate_Series (1,12)i;
INSERT INTO lpd VALUES (null,null,null);
--on HEAD
postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
QUERY PLAN
---------------------------------------------
Sort
Sort Key: ((lpd_p1.tableoid)::regclass)
-> Result
-> Append
-> Seq Scan on lpd_p1
Filter: (a IS NOT NULL)
-> Seq Scan on lpd_p2
Filter: (a IS NOT NULL)
-> Seq Scan on lpd_d3
Filter: (a IS NOT NULL)
-> Seq Scan on lpd_d1
Filter: (a IS NOT NULL)
-> Seq Scan on lpd_d2
Filter: (a IS NOT NULL)
(14 rows)
postgres=#
postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
tableoid | a | b | c
----------+----+----+----
lpd_p1 | 1 | 1 | 1
lpd_p1 | 2 | 2 | 2
lpd_p1 | 3 | 3 | 3
lpd_p2 | 4 | 4 | 4
lpd_p2 | 5 | 5 | 5
lpd_d1 | 7 | 7 | 7
lpd_d1 | 8 | 8 | 8
lpd_d1 | 9 | 9 | 9
lpd_d2 | 12 | 12 | 12
lpd_d2 | 10 | 10 | 10
lpd_d2 | 11 | 11 | 11
lpd_d3 | 6 | 6 | 6
(12 rows)
--on HEAD + v8 patches
postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
QUERY PLAN
---------------------------------------------
Sort
Sort Key: ((lpd_p1.tableoid)::regclass)
-> Result
-> Append
-> Seq Scan on lpd_p1
Filter: (a IS NOT NULL)
-> Seq Scan on lpd_p2
Filter: (a IS NOT NULL)
(8 rows)
postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER BY 1;
tableoid | a | b | c
----------+---+---+---
lpd_p1 | 1 | 1 | 1
lpd_p1 | 2 | 2 | 2
lpd_p1 | 3 | 3 | 3
lpd_p2 | 4 | 4 | 4
lpd_p2 | 5 | 5 | 5
(5 rows)
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
On Mon, Oct 30, 2017 at 12:20 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/10/30 14:55, Amit Langote wrote: >> Fixed in the attached updated patch, along with a new test in 0001 to >> cover this case. Also, made a few tweaks to 0003 and 0005 (moved some >> code from the former to the latter) around the handling of ScalarArrayOpExprs. > > Sorry, I'd forgotten to include some changes. > > In the previous versions, RT index of the table needed to be passed to > partition.c, which I realized is no longer needed, so I removed that > requirement from the interface. As a result, patches 0002 and 0003 have > changed in this version. Some Minor comments: + * get_rel_partitions + * Return the list of partitions of rel that pass the clauses mentioned + * rel->baserestrictinfo + * + * Returned list contains the AppendRelInfos of chosen partitions. + */ +static List * +get_append_rel_partitions(PlannerInfo *root, Function name in function header is not correct. + !DatumGetBool(((Const *) clause)->constvalue)) + { + *constfalse = true; + continue; DatumGetBool will return true if the first byte of constvalue is nonzero otherwise false. IIUC, this is not the intention here. Or I am missing something? + * clauses in this function ourselves, for example, having both a > 1 and + * a = 0 the list a = 0 the list -> a = 0 in the list +static bool +partkey_datum_from_expr(const Expr *expr, Datum *value) +{ + /* + * Add more expression types here as needed to support higher-level + * code. + */ + switch (nodeTag(expr)) + { + case T_Const: + *value = ((Const *) expr)->constvalue; + return true; I think for evaluating other expressions (e.g. T_Param) we will have to pass ExprContext to this function. Or we can do something cleaner because if we want to access the ExprContext inside partkey_datum_from_expr then we may need to pass it to "get_partitions_from_clauses" which is a common function for optimizer and executor. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Thanks for the test case. On 2017/10/30 17:09, Rajkumar Raghuwanshi wrote: > I am getting wrong output when default is sub-partitioned further, below is > a test case. > > CREATE TABLE lpd(a int, b varchar, c float) PARTITION BY LIST (a); > CREATE TABLE lpd_p1 PARTITION OF lpd FOR VALUES IN (1,2,3); > CREATE TABLE lpd_p2 PARTITION OF lpd FOR VALUES IN (4,5); > CREATE TABLE lpd_d PARTITION OF lpd DEFAULT PARTITION BY LIST(a); > CREATE TABLE lpd_d1 PARTITION OF lpd_d FOR VALUES IN (7,8,9); > CREATE TABLE lpd_d2 PARTITION OF lpd_d FOR VALUES IN (10,11,12); > CREATE TABLE lpd_d3 PARTITION OF lpd_d FOR VALUES IN (6,null); > INSERT INTO lpd SELECT i,i,i FROM generate_Series (1,12)i; > INSERT INTO lpd VALUES (null,null,null); > > --on HEAD > postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE > a IS NOT NULL ORDER BY 1; > QUERY PLAN > --------------------------------------------- > Sort > Sort Key: ((lpd_p1.tableoid)::regclass) > -> Result > -> Append > -> Seq Scan on lpd_p1 > Filter: (a IS NOT NULL) > -> Seq Scan on lpd_p2 > Filter: (a IS NOT NULL) > -> Seq Scan on lpd_d3 > Filter: (a IS NOT NULL) > -> Seq Scan on lpd_d1 > Filter: (a IS NOT NULL) > -> Seq Scan on lpd_d2 > Filter: (a IS NOT NULL) > (14 rows) > > postgres=# > postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER > BY 1; > tableoid | a | b | c > ----------+----+----+---- > lpd_p1 | 1 | 1 | 1 > lpd_p1 | 2 | 2 | 2 > lpd_p1 | 3 | 3 | 3 > lpd_p2 | 4 | 4 | 4 > lpd_p2 | 5 | 5 | 5 > lpd_d1 | 7 | 7 | 7 > lpd_d1 | 8 | 8 | 8 > lpd_d1 | 9 | 9 | 9 > lpd_d2 | 12 | 12 | 12 > lpd_d2 | 10 | 10 | 10 > lpd_d2 | 11 | 11 | 11 > lpd_d3 | 6 | 6 | 6 > (12 rows) > > > --on HEAD + v8 patches > > postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE > a IS NOT NULL ORDER BY 1; > QUERY PLAN > --------------------------------------------- > Sort > Sort Key: ((lpd_p1.tableoid)::regclass) > -> Result > -> Append > -> Seq Scan on lpd_p1 > Filter: (a IS NOT NULL) > -> Seq Scan on lpd_p2 > Filter: (a IS NOT NULL) > (8 rows) > > postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER > BY 1; > tableoid | a | b | c > ----------+---+---+--- > lpd_p1 | 1 | 1 | 1 > lpd_p1 | 2 | 2 | 2 > lpd_p1 | 3 | 3 | 3 > lpd_p2 | 4 | 4 | 4 > lpd_p2 | 5 | 5 | 5 > (5 rows) I found bugs in 0003 and 0005 that caused this. Will post the patches containing the fix in reply to the Dilip's email which contains some code review comments [1]. Also, I noticed that the new pruning code was having a hard time do deal with the fact that the default "range" partition doesn't explicitly say in its partition constraint that it might contain null values. More precisely perhaps, the default range partition's constraint appears to imply that it can only contain non-null values, which confuses the new pruning code. Thanks, Amit [1] https://www.postgresql.org/message-id/CAFiTN-thYsobXxPS6bwOA_9erpax_S=iztSn3RtUxKKMKG4V4A@mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Thanks Dilip for reviewing. On 2017/10/31 1:50, Dilip Kumar wrote: > On Mon, Oct 30, 2017 at 12:20 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/10/30 14:55, Amit Langote wrote: >>> Fixed in the attached updated patch, along with a new test in 0001 to >>> cover this case. Also, made a few tweaks to 0003 and 0005 (moved some >>> code from the former to the latter) around the handling of ScalarArrayOpExprs. >> >> Sorry, I'd forgotten to include some changes. >> >> In the previous versions, RT index of the table needed to be passed to >> partition.c, which I realized is no longer needed, so I removed that >> requirement from the interface. As a result, patches 0002 and 0003 have >> changed in this version. > > Some Minor comments: > > + * get_rel_partitions > + * Return the list of partitions of rel that pass the clauses mentioned > + * rel->baserestrictinfo > + * > + * Returned list contains the AppendRelInfos of chosen partitions. > + */ > +static List * > +get_append_rel_partitions(PlannerInfo *root, > > Function name in function header is not correct. Fixed. > + !DatumGetBool(((Const *) clause)->constvalue)) > + { > + *constfalse = true; > + continue; > > DatumGetBool will return true if the first byte of constvalue is > nonzero otherwise > false. IIUC, this is not the intention here. Or I am missing something? This coding pattern is in use in quite a few places; see for example in restriction_is_constant_false() and many others like relation_excluded_by_constraints(), negate_clause(), etc. If a RestrictInfo is marked pseudoconstant=true, then the clause therein must be a Const with constvalue computing to 0 if the clause is false, so that DatumGetBool(constvalue) returns boolean false and non-zero otherwise. > + * clauses in this function ourselves, for example, having both a > 1 and > + * a = 0 the list > > a = 0 the list -> a = 0 in the list Right, fixed. > > +static bool > +partkey_datum_from_expr(const Expr *expr, Datum *value) > +{ > + /* > + * Add more expression types here as needed to support higher-level > + * code. > + */ > + switch (nodeTag(expr)) > + { > + case T_Const: > + *value = ((Const *) expr)->constvalue; > + return true; > > I think for evaluating other expressions (e.g. T_Param) we will have > to pass ExprContext to this function. That's right. > Or we can do something cleaner > because if we want to access the ExprContext inside > partkey_datum_from_expr then we may need to pass it to > "get_partitions_from_clauses" which is a common function for optimizer > and executor. Yeah, I've thought about that a little. Since nothing else but the planner calls it now and the planner doesn't itself have its hands on the ExprContext that would be necessary for computing something like Params, I left it out of the interface for now. That said, I *am* actually thinking about some interface changes that would be necessary for some other unrelated functionality/optimizations that callers like the run-time pruning code would expect of get_partitions_from_clauses(). We can design the interface extension such that the aforementioned ExprContext is passed together. Attached updated version of the patches addressing some of your comments above and fixing a bug that Rajkumar reported [1]. As mentioned there, I'm including here a patch (the 0005 of the attached) to tweak the default range partition constraint to be explicit about null values that it might contain. So, there are 6 patches now and what used to be patch 0005 in the previous set is patch 0006 in this version of the set. Thanks, Amit [1] https://www.postgresql.org/message-id/cd5a2d2e-0957-042c-40c2-06033fe0abf2@lab.ntt.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
- 0001-Add-new-tests-for-partition-pruning-v9.patch
- 0002-Planner-side-changes-for-partition-pruning-v9.patch
- 0003-Implement-get_partitions_from_clauses-v9.patch
- 0004-Some-interface-changes-for-partition_bound_-cmp-bsea-v9.patch
- 0005-Tweak-default-range-partition-s-constraint-a-little-v9.patch
- 0006-Implement-get_partitions_for_keys-v9.patch
On 31 October 2017 at 21:43, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached updated version of the patches addressing some of your comments > above and fixing a bug that Rajkumar reported [1]. As mentioned there, > I'm including here a patch (the 0005 of the attached) to tweak the default > range partition constraint to be explicit about null values that it might > contain. So, there are 6 patches now and what used to be patch 0005 in > the previous set is patch 0006 in this version of the set. Hi Amit, I've been looking over this. I see the latest patches conflict with cf7ab13bf. Can you send patches rebased on current master? Thanks -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 31 October 2017 at 21:43, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached updated version of the patches addressing some of your comments I've spent a bit of time looking at these but I'm out of time for now. So far I have noted down the following; 1. This comment seem wrong. /* * Since the clauses in rel->baserestrictinfo should all contain Const * operands, it should be possible to prune partitions right away. */ How about PARTITION BY RANGE (a) and SELECT * FROM parttable WHERE a > b; ? baserestrictinfo in this case will contain a single RestrictInfo with an OpExpr containing two Var args and it'll come right through that function too. 2. This code is way more complex than it needs to be. if (num_parts > 0) { int j; all_indexes = (int *) palloc(num_parts * sizeof(int)); j = 0; if (min_part_idx >= 0 && max_part_idx >= 0) { for (i = min_part_idx; i <= max_part_idx; i++) all_indexes[j++] = i; } if (!bms_is_empty(other_parts)) while ((i = bms_first_member(other_parts)) >= 0) all_indexes[j++] = i; if (j > 1) qsort((void *) all_indexes, j, sizeof(int), intcmp); } It looks like the min/max partition stuff is just complicating things here. If you need to build this array of all_indexes[] anyway, I don't quite understand the point of the min/max. It seems like min/max would probably work a bit nicer if you didn't need the other_parts BitmapSet, so I recommend just getting rid of min/max completely and just have a BitmapSet with bit set for each partition's index you need, you'd not need to go to the trouble of performing a qsort on an array and you could get rid of quite a chunk of code too. The entire function would then not be much more complex than: partindexes = get_partitions_from_clauses(parent, partclauses); while ((i = bms_first_member(partindexes)) >= 0) { AppendRelInfo *appinfo = rel->part_appinfos[i]; result = lappend(result, appinfo); } Then you can also get rid of your intcmp() function too. 3. Following code has the wrong size calculation: memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType *)); should be PARTITION_MAX_KEYS * sizeof(NullTestType). It might have worked on your machine if you're compiling as 32 bit. I'll continue on with the review in the next few days. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 3 November 2017 at 17:32, David Rowley <david.rowley@2ndquadrant.com> wrote: > 2. This code is way more complex than it needs to be. > > if (num_parts > 0) > { > int j; > > all_indexes = (int *) palloc(num_parts * sizeof(int)); > j = 0; > if (min_part_idx >= 0 && max_part_idx >= 0) > { > for (i = min_part_idx; i <= max_part_idx; i++) > all_indexes[j++] = i; > } > if (!bms_is_empty(other_parts)) > while ((i = bms_first_member(other_parts)) >= 0) > all_indexes[j++] = i; > if (j > 1) > qsort((void *) all_indexes, j, sizeof(int), intcmp); > } > > It looks like the min/max partition stuff is just complicating things > here. If you need to build this array of all_indexes[] anyway, I don't > quite understand the point of the min/max. It seems like min/max would > probably work a bit nicer if you didn't need the other_parts > BitmapSet, so I recommend just getting rid of min/max completely and > just have a BitmapSet with bit set for each partition's index you > need, you'd not need to go to the trouble of performing a qsort on an > array and you could get rid of quite a chunk of code too. > > The entire function would then not be much more complex than: > > partindexes = get_partitions_from_clauses(parent, partclauses); > > while ((i = bms_first_member(partindexes)) >= 0) > { > AppendRelInfo *appinfo = rel->part_appinfos[i]; > result = lappend(result, appinfo); > } > > Then you can also get rid of your intcmp() function too. I've read a bit more of the patch and I think even more now that the min/max stuff should be removed. I understand that you'll be bsearching for a lower and upper bound for cases like: SELECT * FROM pt WHERE key BETWEEN 10 and 20; but it looks like the min and max range stuff is thrown away if the query is written as: SELECT * FROM pt WHERE key BETWEEN 10 and 20 OR key BETWEEN 30 AND 40; from reading the code, it seems like partset_union() would be called in this case and if the min/max of each were consecutive then the min/max range would get merged, but there's really a lot of code to support this. I think it would be much better to invent bms_add_range() and just use a Bitmapset to store the partition indexes to scan. You could simply use bms_union for OR cases and bms_intersect() or AND cases. It seems this would allow removal of this complex code. It looks like this would allow you to remove all the partset_range_* macros too. I've attached a patch which implements bms_add_range() which will save you from having to write the tight loops to call bms_add_member() such as the ones in partset_union(). Those would not be so great if there was a huge number of partitions as the Bitmapset->words[] array could be expanded many more times than required. bms_add_range() will handle that much more efficiently with a maximum of 1 repalloc() for the whole operation. It would also likely faster since it's working at the bitmapword level rather than bit level. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 31 October 2017 at 21:43, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached updated version of the patches match_clauses_to_partkey() needs to allow for the way quals on Bool columns are represented. create table pt (a bool not null) partition by list (a); create table pt_true partition of pt for values in('t'); create table pt_false partition of pt for values in('f'); explain select * from pt where a = true; QUERY PLAN ------------------------------------------------------------------Append (cost=0.00..76.20 rows=2810 width=1) -> Seq Scanon pt_false (cost=0.00..38.10 rows=1405 width=1) Filter: a -> Seq Scan on pt_true (cost=0.00..38.10 rows=1405width=1) Filter: a (5 rows) match_clause_to_indexcol() shows an example of how to handle this. explain select * from pt where a = false; will need to be allowed too. This works slightly differently. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
Amit Langote
Date:
Hi David, On 2017/11/03 13:32, David Rowley wrote: > On 31 October 2017 at 21:43, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached updated version of the patches addressing some of your comments > > I've spent a bit of time looking at these but I'm out of time for now. Thanks a lot for the review and sorry for the delay in sending rebased patches. > So far I have noted down the following; > > 1. This comment seem wrong. > > /* > * Since the clauses in rel->baserestrictinfo should all contain Const > * operands, it should be possible to prune partitions right away. > */ Yes. I used to think it was true, then realized it isn't and updated the code to get rid of that assumption, but I forgot updating this comment. Fixed. > How about PARTITION BY RANGE (a) and SELECT * FROM parttable WHERE a > b; ? > baserestrictinfo in this case will contain a single RestrictInfo with > an OpExpr containing two Var args and it'll come right through that > function too. As it should I think. Quite similarly, you will be able see that index path won't be considered for such a clause: create table foo (a int, b int); create index fooi on foo (a); insert into foo select generate_series(1, 100000); explain select * from foo where a = 1; QUERY PLAN ---------------------------------------------------------------------- Bitmap Heap Scan on foo (cost=12.17..482.50 rows=500 width=8) Recheck Cond: (a = 1) -> Bitmap Index Scan on fooi (cost=0.00..12.04 rows=500 width=0) Index Cond: (a = 1) (4 rows) explain select * from foo where a <= b; QUERY PLAN -------------------------------------------------------- Seq Scan on foo (cost=0.00..1693.00 rows=500 width=8) Filter: (a = b) (2 rows) We won't be able to use such a clause for pruning at all; neither planning-time pruning nor execution-time pruning. Am I missing something? > 2. This code is way more complex than it needs to be. > > if (num_parts > 0) > { > int j; > > all_indexes = (int *) palloc(num_parts * sizeof(int)); > j = 0; > if (min_part_idx >= 0 && max_part_idx >= 0) > { > for (i = min_part_idx; i <= max_part_idx; i++) > all_indexes[j++] = i; > } > if (!bms_is_empty(other_parts)) > while ((i = bms_first_member(other_parts)) >= 0) > all_indexes[j++] = i; > if (j > 1) > qsort((void *) all_indexes, j, sizeof(int), intcmp); > } > > It looks like the min/max partition stuff is just complicating things > here. If you need to build this array of all_indexes[] anyway, I don't > quite understand the point of the min/max. It seems like min/max would > probably work a bit nicer if you didn't need the other_parts > BitmapSet, so I recommend just getting rid of min/max completely and > just have a BitmapSet with bit set for each partition's index you > need, you'd not need to go to the trouble of performing a qsort on an > array and you could get rid of quite a chunk of code too. > > The entire function would then not be much more complex than: > > partindexes = get_partitions_from_clauses(parent, partclauses); > > while ((i = bms_first_member(partindexes)) >= 0) > { > AppendRelInfo *appinfo = rel->part_appinfos[i]; > result = lappend(result, appinfo); > } > > Then you can also get rid of your intcmp() function too. The design with min/max partition index interface to the partition.c's new partition-pruning facility is intentional. You can find hints about how such a design came about in the following Robert's email: https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com For range queries, it is desirable for the partitioning module to return the set of qualifying partitions that are contiguous in a compact (O(1)) min/max representation than having to enumerate all those indexes in the set. It's nice to avoid iterating over that set twice -- once when constructing the set in the partitioning module and then again in the caller (in this case, planner) to perform the planning-related tasks per selected partition. We need the other_parts Bitmapset too, because selected partitions may not always be contiguous, even in the case of range partitioning, if there are OR clauses and the possibility that the default partition is also selected. While computing the selected partition set from a given set of clauses, partitioning code tries to keep the min/max representation as long as it makes sense to and once the selected partitions no longer appear to be contiguous it switches to the Bitmapset mode. > 3. Following code has the wrong size calculation: > > memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType *)); > > should be PARTITION_MAX_KEYS * sizeof(NullTestType). It might have > worked on your machine if you're compiling as 32 bit. Oops, you're right. Fixed. > I'll continue on with the review in the next few days. Thanks again. Attached is the updated set of patches. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
- 0001-Add-new-tests-for-partition-pruning-v10.patch
- 0002-Planner-side-changes-for-partition-pruning-v10.patch
- 0003-Implement-get_partitions_from_clauses-v10.patch
- 0004-Some-interface-changes-for-partition_bound_-cmp-bsea-v10.patch
- 0005-Tweak-default-range-partition-s-constraint-a-little-v10.patch
- 0006-Implement-get_partitions_for_keys-v10.patch
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
Amit Langote
Date:
On 2017/11/06 12:53, David Rowley wrote: > On 3 November 2017 at 17:32, David Rowley <david.rowley@2ndquadrant.com> wrote: >> 2. This code is way more complex than it needs to be. >> >> if (num_parts > 0) >> { >> int j; >> >> all_indexes = (int *) palloc(num_parts * sizeof(int)); >> j = 0; >> if (min_part_idx >= 0 && max_part_idx >= 0) >> { >> for (i = min_part_idx; i <= max_part_idx; i++) >> all_indexes[j++] = i; >> } >> if (!bms_is_empty(other_parts)) >> while ((i = bms_first_member(other_parts)) >= 0) >> all_indexes[j++] = i; >> if (j > 1) >> qsort((void *) all_indexes, j, sizeof(int), intcmp); >> } >> >> It looks like the min/max partition stuff is just complicating things >> here. If you need to build this array of all_indexes[] anyway, I don't >> quite understand the point of the min/max. It seems like min/max would >> probably work a bit nicer if you didn't need the other_parts >> BitmapSet, so I recommend just getting rid of min/max completely and >> just have a BitmapSet with bit set for each partition's index you >> need, you'd not need to go to the trouble of performing a qsort on an >> array and you could get rid of quite a chunk of code too. >> >> The entire function would then not be much more complex than: >> >> partindexes = get_partitions_from_clauses(parent, partclauses); >> >> while ((i = bms_first_member(partindexes)) >= 0) >> { >> AppendRelInfo *appinfo = rel->part_appinfos[i]; >> result = lappend(result, appinfo); >> } >> >> Then you can also get rid of your intcmp() function too. > > I've read a bit more of the patch and I think even more now that the > min/max stuff should be removed. Oops, I didn't catch this email before sending my earlier reply. Thanks for the bms range patch. Will reply to this shortly after reading your patch and thinking on it a bit. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
David Rowley
Date:
On 6 November 2017 at 17:30, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/11/03 13:32, David Rowley wrote: >> On 31 October 2017 at 21:43, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> 1. This comment seem wrong. >> >> /* >> * Since the clauses in rel->baserestrictinfo should all contain Const >> * operands, it should be possible to prune partitions right away. >> */ > > Yes. I used to think it was true, then realized it isn't and updated the > code to get rid of that assumption, but I forgot updating this comment. > Fixed. > >> How about PARTITION BY RANGE (a) and SELECT * FROM parttable WHERE a > b; ? >> baserestrictinfo in this case will contain a single RestrictInfo with >> an OpExpr containing two Var args and it'll come right through that >> function too. ... > We won't be able to use such a clause for pruning at all; neither > planning-time pruning nor execution-time pruning. Am I missing something? I just meant the comment was wrong. > > The design with min/max partition index interface to the partition.c's new > partition-pruning facility is intentional. You can find hints about how > such a design came about in the following Robert's email: > > https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com Yeah, I remember reading that before I had looked at the code. I disagree with Robert on this. The fact that the min/max range gets turned into a list of everything in that range in get_append_rel_partitions means all the advantages that storing the partitions as a range is voided. If you could have kept it a range the entire time, then that might be different, but seems you need to materialize the entire range in order to sort the partitions into order. I've included Robert in just in case he wants to take a look at the code that resulted from that design. Maybe something is not following what he had in mind, or maybe he'll change his mind based on the code that resulted. > For range queries, it is desirable for the partitioning module to return > the set of qualifying partitions that are contiguous in a compact (O(1)) > min/max representation than having to enumerate all those indexes in the > set. It's nice to avoid iterating over that set twice -- once when > constructing the set in the partitioning module and then again in the > caller (in this case, planner) to perform the planning-related tasks per > selected partition. The idea is that you still get the min and max from the bsearch, but then use bms_add_range() to populate a bitmapset of the matching partitions. The big-O notation of the search shouldn't change. > We need the other_parts Bitmapset too, because selected partitions may not > always be contiguous, even in the case of range partitioning, if there are > OR clauses and the possibility that the default partition is also > selected. While computing the selected partition set from a given set of > clauses, partitioning code tries to keep the min/max representation as > long as it makes sense to and once the selected partitions no longer > appear to be contiguous it switches to the Bitmapset mode. Yip. I understand that. I just think there's no benefit to having min/max since it needs to be materialized into a list of the entire range at some point, it might as well be done as soon as possible using a bitmapset, which would save having all the partset_union, partset_intersect, partset_range_empty, partset_range_overlap, partset_range_adjacent code. You'd end up just using bms_union and bms_intersect then bms_add_range to handle the min/max bound you get from the bsearch. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
Amit Langote
Date:
On 2017/11/06 13:15, David Rowley wrote: > On 31 October 2017 at 21:43, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached updated version of the patches > > match_clauses_to_partkey() needs to allow for the way quals on Bool > columns are represented. > > create table pt (a bool not null) partition by list (a); > create table pt_true partition of pt for values in('t'); > create table pt_false partition of pt for values in('f'); > explain select * from pt where a = true; > QUERY PLAN > ------------------------------------------------------------------ > Append (cost=0.00..76.20 rows=2810 width=1) > -> Seq Scan on pt_false (cost=0.00..38.10 rows=1405 width=1) > Filter: a > -> Seq Scan on pt_true (cost=0.00..38.10 rows=1405 width=1) > Filter: a > (5 rows) > > match_clause_to_indexcol() shows an example of how to handle this. > > explain select * from pt where a = false; > > will need to be allowed too. This works slightly differently. You're right. I've fixed things to handle Boolean partitioning in the updated set of patches I will post shortly. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017/11/06 14:32, David Rowley wrote: > On 6 November 2017 at 17:30, Amit Langote wrote: >> On 2017/11/03 13:32, David Rowley wrote: >>> On 31 October 2017 at 21:43, Amit Langote wrote: >>> 1. This comment seem wrong. >>> >>> /* >>> * Since the clauses in rel->baserestrictinfo should all contain Const >>> * operands, it should be possible to prune partitions right away. >>> */ >> >> Yes. I used to think it was true, then realized it isn't and updated the >> code to get rid of that assumption, but I forgot updating this comment. >> Fixed. >> >>> How about PARTITION BY RANGE (a) and SELECT * FROM parttable WHERE a > b; ? >>> baserestrictinfo in this case will contain a single RestrictInfo with >>> an OpExpr containing two Var args and it'll come right through that >>> function too. > > ... > >> We won't be able to use such a clause for pruning at all; neither >> planning-time pruning nor execution-time pruning. Am I missing something? > > I just meant the comment was wrong. Ah, gotcha. >> The design with min/max partition index interface to the partition.c's new >> partition-pruning facility is intentional. You can find hints about how >> such a design came about in the following Robert's email: >> >> https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com > > Yeah, I remember reading that before I had looked at the code. I > disagree with Robert on this. The fact that the min/max range gets > turned into a list of everything in that range in > get_append_rel_partitions means all the advantages that storing the > partitions as a range is voided. If you could have kept it a range the > entire time, then that might be different, but seems you need to > materialize the entire range in order to sort the partitions into > order. I've included Robert in just in case he wants to take a look at > the code that resulted from that design. Maybe something is not > following what he had in mind, or maybe he'll change his mind based on > the code that resulted. > >> For range queries, it is desirable for the partitioning module to return >> the set of qualifying partitions that are contiguous in a compact (O(1)) >> min/max representation than having to enumerate all those indexes in the >> set. It's nice to avoid iterating over that set twice -- once when >> constructing the set in the partitioning module and then again in the >> caller (in this case, planner) to perform the planning-related tasks per >> selected partition. > > The idea is that you still get the min and max from the bsearch, but > then use bms_add_range() to populate a bitmapset of the matching > partitions. The big-O notation of the search shouldn't change. > >> We need the other_parts Bitmapset too, because selected partitions may not >> always be contiguous, even in the case of range partitioning, if there are >> OR clauses and the possibility that the default partition is also >> selected. While computing the selected partition set from a given set of >> clauses, partitioning code tries to keep the min/max representation as >> long as it makes sense to and once the selected partitions no longer >> appear to be contiguous it switches to the Bitmapset mode. > > Yip. I understand that. I just think there's no benefit to having > min/max since it needs to be materialized into a list of the entire > range at some point, it might as well be done as soon as possible > using a bitmapset, which would save having all the partset_union, > partset_intersect, partset_range_empty, partset_range_overlap, > partset_range_adjacent code. You'd end up just using bms_union and > bms_intersect then bms_add_range to handle the min/max bound you get > from the bsearch. OK, I have gotten rid of the min/max partition index interface and instead adopted the bms_add_range() approach by including your patch to add the same in the patch set (which is now 0002 in the whole set). I have to admit that it's simpler to understand the new code with just Bitmapsets to look at, but I'm still a bit concerned about materializing the whole set right within partition.c, although we can perhaps optimize it later. Attached updated set of patches, including the fix to make the new pruning code handle Boolean partitioning. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
- 0001-Add-new-tests-for-partition-pruning-v11.patch
- 0002-Add-bms_add_range-to-add-members-within-the-specifie-v11.patch
- 0003-Planner-side-changes-for-partition-pruning-v11.patch
- 0004-Implement-get_partitions_from_clauses-v11.patch
- 0005-Some-interface-changes-for-partition_bound_-cmp-bsea-v11.patch
- 0006-Tweak-default-range-partition-s-constraint-a-little-v11.patch
- 0007-Implement-get_partitions_for_keys-v11.patch
On 6 November 2017 at 23:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > OK, I have gotten rid of the min/max partition index interface and instead > adopted the bms_add_range() approach by including your patch to add the > same in the patch set (which is now 0002 in the whole set). I have to > admit that it's simpler to understand the new code with just Bitmapsets to > look at, but I'm still a bit concerned about materializing the whole set > right within partition.c, although we can perhaps optimize it later. Thanks for making that change. The code looks much more simple now. For performance, if you're worried about a very large number of partitions, then I think you're better off using bms_next_member() rather than bms_first_member(), (likely this applies globally, but you don't need to worry about those). The problem with bms_first_member is that it must always loop over the 0 words before it finds any bits set for each call, whereas bms_next_member will start on the word it was last called for. There will likely be a pretty big performance difference between the two when processing a large Bitmapset. > Attached updated set of patches, including the fix to make the new pruning > code handle Boolean partitioning. Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13) -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 7 November 2017 at 01:52, David Rowley <david.rowley@2ndquadrant.com> wrote: > Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13) I have a little more review to share: 1. Missing "in" in comment. Should be "mentioned in" * get_append_rel_partitions * Return the list of partitions of rel that pass the clauses mentioned * rel->baserestrictinfo 2. Variable should be declared in inner scope with the following fragment: void set_basic_child_rel_properties(PlannerInfo *root, RelOptInfo *rel, RelOptInfo *childrel, AppendRelInfo *appinfo) { AttrNumber attno; if (rel->part_scheme) { which makes the code the same as where you moved it from. 3. Normally lfirst() is assigned to a variable at the start of a foreach() loop. You have code which does not follow this. foreach(lc, clauses) { Expr *clause; int i; if (IsA(lfirst(lc), RestrictInfo)) { RestrictInfo *rinfo = lfirst(lc); You could assign this to a Node * since the type is unknown to you at the start of the loop. 4. /* * Useless if what we're thinking of as a constant is actually * a Var coming from this relation. */ if (bms_is_member(rel->relid, constrelids)) continue; should this be moved to just above the op_strict() test? This one seems cheaper. 5. Typo "paritions": /* No clauses to prune paritions, so scan all partitions. */ But thinking about it more the comment should something more along the lines of /* No useful clauses for partition pruning. Scan all partitions. */ The key difference is that there might be clauses, just without Consts. Actually, the more I look at get_append_rel_partitions() I think it would be better if you re-shaped that if/else if test so that it only performs the loop over the partindexes if it's been set. I ended up with the attached version of the function after moving things around a little bit. I'm still reviewing but thought I'd share this part so far. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
Amit Langote
Date:
On 2017/11/06 21:52, David Rowley wrote: > On 6 November 2017 at 23:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> OK, I have gotten rid of the min/max partition index interface and instead >> adopted the bms_add_range() approach by including your patch to add the >> same in the patch set (which is now 0002 in the whole set). I have to >> admit that it's simpler to understand the new code with just Bitmapsets to >> look at, but I'm still a bit concerned about materializing the whole set >> right within partition.c, although we can perhaps optimize it later. > > Thanks for making that change. The code looks much more simple now. > > For performance, if you're worried about a very large number of > partitions, then I think you're better off using bms_next_member() > rather than bms_first_member(), (likely this applies globally, but you > don't need to worry about those). > > The problem with bms_first_member is that it must always loop over the > 0 words before it finds any bits set for each call, whereas > bms_next_member will start on the word it was last called for. There > will likely be a pretty big performance difference between the two > when processing a large Bitmapset. Ah, thanks for the explanation. I will change it to bms_next_member() in the next version. >> Attached updated set of patches, including the fix to make the new pruning >> code handle Boolean partitioning. > > Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13) Thank you. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 7 November 2017 at 01:52, David Rowley <david.rowley@2ndquadrant.com> wrote: > Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13) Hi Amit, I had another look over this today. Apologies if any of the review seems petty. Here goes: 1. If test seems to be testing for a child that's a partitioned table, but is testing for a non-NULL part_scheme. /* * If childrel is itself partitioned, add it and its partitioned * children to the list being propagated up to the root rel. */ if (childrel->part_scheme && rel->part_scheme) Should this code use IS_PARTITIONED_REL() instead? Seems a bit strange to test for a NULL part_scheme 2. There's a couple of mistakes in my bms_add_range() code. I've attached bms_add_range_fix.patch. Can you apply this to your tree? 3. This assert seems to be Asserting the same thing twice: Assert(rel->live_partitioned_rels != NIL && list_length(rel->live_partitioned_rels) > 0); A List with length == 0 is always NIL. 4. get_partitions_from_clauses(), can you comment why you perform the list_concat() there. I believe this is there so that the partition bound from the parent is passed down to the child so that we can properly eliminate all child partitions when the 2nd level of partitioning is using the same partition key as the 1st level. I think this deserves a paragraph of comment to explain this. 5. Please add a comment to explain what's going on here in classify_partition_bounding_keys() if (partattno == 0) { partexpr = lfirst(partexprs_item); partexprs_item = lnext(partexprs_item); } Looks like, similar to index expressions, that partition expressions are attno 0 to mark to consume the next expression from the list. Does this need validation that there are enough partexprs_item items like what is done in get_range_key_properties()? Or is this validated somewhere else? 6. Comment claims the if test will test something which it does not seem to test for: /* * Redundant key elimination using btree-semantics based tricks. * * Only list and range partitioning use btree operator semantics, so * skip otherwise. Also, if there are expressions whose value is yet * unknown, skip this step, because we need to compare actual values * below. */ memset(keyclauses, 0, PARTITION_MAX_KEYS * sizeof(List *)); if (partkey->strategy == PARTITION_STRATEGY_LIST || partkey->strategy == PARTITION_STRATEGY_RANGE) I was expecting this to be skipped when the clauses contained a non-const, but it does not seem to. 7. Should be "compare them" /* * If the leftarg and rightarg clauses' constants are both of the type * expected by "op" clause's operator, then compare then using the * latter's comparison function. */ But if I look at the code "compare then using the latter's comparison function." is not true, it seems to use op's comparison function not rightarg's. With most of the calls op and rightarg are the same, but not all of them. The function shouldn't make that assumption even if the args op was always the same as rightarg. 8. remove_redundant_clauses() needs an overview comment of what the function does. 9. The comment should explain what we would do in the case of key < 3 AND key <= 2 using some examples. /* try to keep only one of <, <= */ 10. Wondering why this loop runs backward? for (s = BTMaxStrategyNumber; --s >= 0;) Why not just: for (s = 0; s < BTMaxStrategyNumber; s++) I can't see a special reason for it to run backward. It seems unusual, but if there's a good reason that I've failed to realise then it's maybe worth a comment. 11. Pleae comment on why *constfalse = true is set here: if (!chk || s == (BTEqualStrategyNumber - 1)) continue; if (partition_cmp_args(partopfamily, partopcintype, chk, eq, chk, &test_result)) { if (!test_result) { *constfalse = true; return; } /* discard the redundant key. */ xform[s] = NULL; } Looks like we'd hit this in a case such as: WHERE key = 1 AND key > 1. Also please add a comment when discarding the redundant key maybe explain that equality is more useful than the other strategies when there's an overlap. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
Amit Langote
Date:
Hi David. Thanks for the review. (..also looking at the comments you sent earlier today.) On 2017/11/07 11:14, David Rowley wrote: > On 7 November 2017 at 01:52, David Rowley <david.rowley@2ndquadrant.com> wrote: >> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13) > > I have a little more review to share: > > 1. Missing "in" in comment. Should be "mentioned in" > > * get_append_rel_partitions > * Return the list of partitions of rel that pass the clauses mentioned > * rel->baserestrictinfo > > 2. Variable should be declared in inner scope with the following fragment: > > void > set_basic_child_rel_properties(PlannerInfo *root, > RelOptInfo *rel, > RelOptInfo *childrel, > AppendRelInfo *appinfo) > { > AttrNumber attno; > > if (rel->part_scheme) > { > > which makes the code the same as where you moved it from. It seems like you included the above changes in your attached C file, which I will incorporate into my repository. > 3. Normally lfirst() is assigned to a variable at the start of a > foreach() loop. You have code which does not follow this. > > foreach(lc, clauses) > { > Expr *clause; > int i; > > if (IsA(lfirst(lc), RestrictInfo)) > { > RestrictInfo *rinfo = lfirst(lc); > > You could assign this to a Node * since the type is unknown to you at > the start of the loop. Will make the suggested changes to match_clauses_to_partkey(). > 4. > /* > * Useless if what we're thinking of as a constant is actually > * a Var coming from this relation. > */ > if (bms_is_member(rel->relid, constrelids)) > continue; > > should this be moved to just above the op_strict() test? This one seems cheaper. Agreed, will do. Also makes sense to move the PartCollMatchesExprColl() test together. > 5. Typo "paritions": /* No clauses to prune paritions, so scan all > partitions. */ > > But thinking about it more the comment should something more along the > lines of /* No useful clauses for partition pruning. Scan all > partitions. */ You fixed it. :) > The key difference is that there might be clauses, just without Consts. > > Actually, the more I look at get_append_rel_partitions() I think it > would be better if you re-shaped that if/else if test so that it only > performs the loop over the partindexes if it's been set. > > I ended up with the attached version of the function after moving > things around a little bit. Thanks a lot for that. Looks much better now. > I'm still reviewing but thought I'd share this part so far. As mentioned at the top, I'm looking at your latest comments and they all seem to be good points to me, so will address those in the next version. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
create table lp (c1 int, c2 text) partition by list(c1);
create table lp1 partition of lp for values in (1,2);
create table lp2 partition of lp for values in (3,4);
create table lp3 partition of lp for values in (5,6);
insert into lp values (1,'p1'),(2,'p1'),(3,'p2'),(4,'p2'),(5,'p3');
show constraint_exclusion ;
constraint_exclusion
----------------------
partition
(1 row)
explain select c1 from lp where c1 >= 1 and c1 < 2;
QUERY PLAN
----------------------------------------------------------
Append (cost=0.00..29.05 rows=6 width=4)
-> Seq Scan on lp1 (cost=0.00..29.05 rows=6 width=4)
Filter: ((c1 >= 1) AND (c1 < 2))
(3 rows)
explain delete from lp where c1 >= 1 and c1 < 2;
QUERY PLAN
----------------------------------------------------------
Delete on lp (cost=0.00..29.05 rows=6 width=6)
Delete on lp1
-> Seq Scan on lp1 (cost=0.00..29.05 rows=6 width=6)
Filter: ((c1 >= 1) AND (c1 < 2))
(4 rows)
set constraint_exclusion = off;
explain select c1 from lp where c1 >= 1 and c1 < 2;
QUERY PLAN
----------------------------------------------------------
Append (cost=0.00..29.05 rows=6 width=4)
-> Seq Scan on lp1 (cost=0.00..29.05 rows=6 width=4)
Filter: ((c1 >= 1) AND (c1 < 2))
(3 rows)
explain delete from lp where c1 >= 1 and c1 < 2;
QUERY PLAN
----------------------------------------------------------
Delete on lp (cost=0.00..87.15 rows=18 width=6)
Delete on lp1
Delete on lp2
Delete on lp3
-> Seq Scan on lp1 (cost=0.00..29.05 rows=6 width=6)
Filter: ((c1 >= 1) AND (c1 < 2))
-> Seq Scan on lp2 (cost=0.00..29.05 rows=6 width=6)
Filter: ((c1 >= 1) AND (c1 < 2))
-> Seq Scan on lp3 (cost=0.00..29.05 rows=6 width=6)
Filter: ((c1 >= 1) AND (c1 < 2))
(10 rows)
Attached updated set of patches, including the fix to make the new pruning
code handle Boolean partitioning.
Hi Amit,
I have tried pruning for different values of constraint exclusion GUC change, not sure exactly how it should behave, but I can see with the delete statement pruning is not happening when constraint_exclusion is off, but select is working as expected. Is this expected behaviour?create table lp (c1 int, c2 text) partition by list(c1);
create table lp1 partition of lp for values in (1,2);
create table lp2 partition of lp for values in (3,4);
create table lp3 partition of lp for values in (5,6);
insert into lp values (1,'p1'),(2,'p1'),(3,'p2'),(4,'p2'),(5,'p3');
show constraint_exclusion ;
constraint_exclusion
----------------------
partition
(1 row)
explain select c1 from lp where c1 >= 1 and c1 < 2;
QUERY PLAN
----------------------------------------------------------
Append (cost=0.00..29.05 rows=6 width=4)
-> Seq Scan on lp1 (cost=0.00..29.05 rows=6 width=4)
Filter: ((c1 >= 1) AND (c1 < 2))
(3 rows)
explain delete from lp where c1 >= 1 and c1 < 2;
QUERY PLAN
----------------------------------------------------------
Delete on lp (cost=0.00..29.05 rows=6 width=6)
Delete on lp1
-> Seq Scan on lp1 (cost=0.00..29.05 rows=6 width=6)
Filter: ((c1 >= 1) AND (c1 < 2))
(4 rows)
set constraint_exclusion = off;
explain select c1 from lp where c1 >= 1 and c1 < 2;
QUERY PLAN
----------------------------------------------------------
Append (cost=0.00..29.05 rows=6 width=4)
-> Seq Scan on lp1 (cost=0.00..29.05 rows=6 width=4)
Filter: ((c1 >= 1) AND (c1 < 2))
(3 rows)
explain delete from lp where c1 >= 1 and c1 < 2;
QUERY PLAN
----------------------------------------------------------
Delete on lp (cost=0.00..87.15 rows=18 width=6)
Delete on lp1
Delete on lp2
Delete on lp3
-> Seq Scan on lp1 (cost=0.00..29.05 rows=6 width=6)
Filter: ((c1 >= 1) AND (c1 < 2))
-> Seq Scan on lp2 (cost=0.00..29.05 rows=6 width=6)
Filter: ((c1 >= 1) AND (c1 < 2))
-> Seq Scan on lp3 (cost=0.00..29.05 rows=6 width=6)
Filter: ((c1 >= 1) AND (c1 < 2))
(10 rows)
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation Hi Rajkumar, Thanks for testing. On 2017/11/08 15:52, Rajkumar Raghuwanshi wrote: > On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> > wrote: > >> Attached updated set of patches, including the fix to make the new pruning >> code handle Boolean partitioning. >> > > Hi Amit, > > I have tried pruning for different values of constraint exclusion GUC > change, not sure exactly how it should behave, but I can see with the > delete statement pruning is not happening when constraint_exclusion is off, > but select is working as expected. Is this expected behaviour? Hmm, the new pruning only works for selects, not DML. The patch also changes get_relation_constraints() to not include the internal partition constraints, but mistakenly does so for all query types, not just select. Will look into it. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/11/06 14:32, David Rowley wrote: >> On 6 November 2017 at 17:30, Amit Langote wrote: >>> On 2017/11/03 13:32, David Rowley wrote: >>>> On 31 October 2017 at 21:43, Amit Langote wrote: [....] > > Attached updated set of patches, including the fix to make the new pruning > code handle Boolean partitioning. > I am getting following warning on mac os x: partition.c:1800:24: warning: comparison of constant -1 with expression of type 'NullTestType' (aka 'enum NullTestType') is always false [-Wtautological-constant-out-of-range-compare] if (keynullness[i] == -1) ~~~~~~~~~~~~~~ ^ ~~ partition.c:1932:25: warning: comparison of constant -1 with expression of type 'NullTestType' (aka 'enum NullTestType') is always false [-Wtautological-constant-out-of-range-compare] if (keynullness[i] == -1) ~~~~~~~~~~~~~~ ^ ~~ 2 warnings generated. Comment for 0004 patch: 270 + /* -1 represents an invalid value of NullTestType. */271 + memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType)); I think we should not use memset to set a value other than 0 or true/false. This will work for -1 on the system where values are stored in the 2's complement but I am afraid of other architecture. Regards, Amul -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi Amul. On 2017/11/09 20:05, amul sul wrote: > On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/11/06 14:32, David Rowley wrote: >>> On 6 November 2017 at 17:30, Amit Langote wrote: >>>> On 2017/11/03 13:32, David Rowley wrote: >>>>> On 31 October 2017 at 21:43, Amit Langote wrote: > [....] >> >> Attached updated set of patches, including the fix to make the new pruning >> code handle Boolean partitioning. >> > > I am getting following warning on mac os x: Thanks for the review. > partition.c:1800:24: warning: comparison of constant -1 with > expression of type 'NullTestType' > (aka 'enum NullTestType') is always false > [-Wtautological-constant-out-of-range-compare] > if (keynullness[i] == -1) > ~~~~~~~~~~~~~~ ^ ~~ > partition.c:1932:25: warning: comparison of constant -1 with > expression of type 'NullTestType' > (aka 'enum NullTestType') is always false > [-Wtautological-constant-out-of-range-compare] > if (keynullness[i] == -1) > ~~~~~~~~~~~~~~ ^ ~~ > 2 warnings generated. > > > Comment for 0004 patch: > 270 + /* -1 represents an invalid value of NullTestType. */ > 271 + memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType)); > > I think we should not use memset to set a value other than 0 or true/false. > This will work for -1 on the system where values are stored in the 2's > complement but I am afraid of other architecture. OK, I will remove all instances of comparing and setting variables of type NullTestType to a value of -1. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hello, At Fri, 10 Nov 2017 09:34:57 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <5fcb1a9f-b4ad-119d-14c7-282c30c7f8d1@lab.ntt.co.jp> > Hi Amul. > > On 2017/11/09 20:05, amul sul wrote: > > On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote > > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> On 2017/11/06 14:32, David Rowley wrote: > >>> On 6 November 2017 at 17:30, Amit Langote wrote: > >>>> On 2017/11/03 13:32, David Rowley wrote: > >>>>> On 31 October 2017 at 21:43, Amit Langote wrote: > > [....] > >> > >> Attached updated set of patches, including the fix to make the new pruning > >> code handle Boolean partitioning. > >> > > > > I am getting following warning on mac os x: > > Thanks for the review. > > > partition.c:1800:24: warning: comparison of constant -1 with > > expression of type 'NullTestType' > > (aka 'enum NullTestType') is always false > > [-Wtautological-constant-out-of-range-compare] > > if (keynullness[i] == -1) > > ~~~~~~~~~~~~~~ ^ ~~ > > partition.c:1932:25: warning: comparison of constant -1 with > > expression of type 'NullTestType' > > (aka 'enum NullTestType') is always false > > [-Wtautological-constant-out-of-range-compare] > > if (keynullness[i] == -1) > > ~~~~~~~~~~~~~~ ^ ~~ > > 2 warnings generated. > > > > > > Comment for 0004 patch: > > 270 + /* -1 represents an invalid value of NullTestType. */ > > 271 + memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType)); > > > > I think we should not use memset to set a value other than 0 or true/false. > > This will work for -1 on the system where values are stored in the 2's > > complement but I am afraid of other architecture. > > OK, I will remove all instances of comparing and setting variables of type > NullTestType to a value of -1. In 0002, bms_add_range has a bit naive-looking loop + while (wordnum <= uwordnum) + { + bitmapword mask = (bitmapword) ~0; + + /* If working on the lower word, zero out bits below 'lower'. */ + if (wordnum == lwordnum) + { + int lbitnum = BITNUM(lower); + mask >>= lbitnum; + mask <<= lbitnum; + } + + /* Likewise, if working on the upper word, zero bits above 'upper' */ + if (wordnum == uwordnum) + { + int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) + 1); + mask <<= ushiftbits; + mask >>= ushiftbits; + } + + a->words[wordnum++] |= mask; + } Without some aggressive optimization, the loop takes most of the time to check-and-jump for nothing especially with many partitions and somewhat unintuitive. The following uses a bit tricky bitmap operation but is straightforward as a whole. ===== /* fill the bits upper from BITNUM(lower) (0-based) of the first word */ a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1); /* fill up intermediate words */ while (wordnum < uwordnum) a->words[wordnum++] = ~(bitmapword) 0; /* fill up to BITNUM(upper) bit (0-based) of the last word */ a->workds[wordnum++] |= (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1)); ===== In 0003, +match_clauses_to_partkey(RelOptInfo *rel, ... + if (rinfo->pseudoconstant && + (IsA(clause, Const) && + ((((Const *) clause)->constisnull) || + !DatumGetBool(((Const *) clause)->constvalue)))) + { + *constfalse = true; + continue; If we call this function in both conjunction and disjunction context (the latter is only in recursive case). constfalse == true means no need of any clauses for the former case. Since (I think) just a list of RestrictInfo is expected to be treated as a conjunction (it's quite doubious, though..), we might be better call this for each subnodes of a disjunction. Or, like match_clauses_to_index, we might be better provide match_clause_to_partkey(rel, rinfo, contains_const), which returns NULL if constfalse. (I'm not self-confident on this..) + /* + * If no commutator exists, cannot flip the qual's args, + * so give up. + */ + if (!OidIsValid(expr_op)) + continue; I suppose it's better to leftop and rightop as is rather than flipping over so that var is placed left-side. Does that make things so complex? + * It the operator happens to be '<>', which is never listed If? + if (!op_in_opfamily(expr_op, partopfamily)) + { + Oid negator = get_negator(expr_op); + + if (!OidIsValid(negator) || + !op_in_opfamily(negator, partopfamily)) + continue; classify_partition_bounding_keys() checks the same thing by checking whether the negator's strategy is BTEquealStrategyNumber. (I'm not sure the operator is guaranteed to be of btreee, though..) Aren't they needed to be in similar way? # In the function, "partkey strategy" and "operator strategy" are # confusing.. + AttrNumber attno; This declaration might be better in a narrower scope. -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hello, this is the second part of the review. At Fri, 10 Nov 2017 12:30:00 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20171110.123000.151902771.horiguchi.kyotaro@lab.ntt.co.jp> > In 0002, bms_add_range has a bit naive-looking loop > In 0003, In 0004, The name get_partitions_from_clauses_guts(), it seems to me that we usually use _internal for recursive part of some function. (I have the same comment as David about the comment for get_partition_from_clause()) About the same function: Couldn't we get out in the fast path when clauses == NIL? + /* No constraints on the keys, so, return *all* partitions. */ + result = bms_add_range(result, 0, partdesc->nparts - 1); This allows us to return immediately here. And just above this, + if (nkeys > 0 && !constfalse) + result = get_partitions_for_keys(relation, &keys); + else if (!constfalse) Those two conditions are not orthogonal. Maybe something like following seems more understantable. > if (!constfalse) > { > /* No constraints on the keys, so, return *all* partitions. */ > if (nkeys == 0) > return bms_add_range(result, 0, partdesc->nparts - 1); > > result = get_partitions_for_keys(relation, &keys); > } I'm not sure what is meant to be (formally) done here, but it seems to me that OrExpr is assumed to be only at the top level of the caluses. So the following (just an example, but meaningful expression in this shpape must exists.) expression is perhaps wrongly processed here. CREATE TABLE p (a int) PARITION BY (a); CREATE TABLE c1 PARTITION OF p FOR VALUES FROM (0) TO (10); CREATE TABLE c2 PARTITION OF p FOR VALUES FROM (10) TO (20); SELECT * FROM p WHERE a = 15 AND (a = 15 OR a = 5); get_partitions_for_keys() returns both c1 and c2 and still or_clauses here holds (a = 15 OR a = 5) and the function tries to *add* partitions for a = 15 and a = 5 separetely. I'd like to pause here. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Ooops! The following comment is wrong. Please ignore it. At Fri, 10 Nov 2017 14:38:11 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20171110.143811.97616847.horiguchi.kyotaro@lab.ntt.co.jp> > Those two conditions are not orthogonal. Maybe something like > following seems more understantable. > > > if (!constfalse) > > { > > /* No constraints on the keys, so, return *all* partitions. */ > > if (nkeys == 0) > > return bms_add_range(result, 0, partdesc->nparts - 1); > > > > result = get_partitions_for_keys(relation, &keys); > > } > > I'm not sure what is meant to be (formally) done here, but it > seems to me that OrExpr is assumed to be only at the top level of > the caluses. So the following (just an example, but meaningful > expression in this shpape must exists.) expression is perhaps > wrongly processed here. > > CREATE TABLE p (a int) PARITION BY (a); > CREATE TABLE c1 PARTITION OF p FOR VALUES FROM (0) TO (10); > CREATE TABLE c2 PARTITION OF p FOR VALUES FROM (10) TO (20); > > SELECT * FROM p WHERE a = 15 AND (a = 15 OR a = 5); > > get_partitions_for_keys() returns both c1 and c2 and still > or_clauses here holds (a = 15 OR a = 5) and the function tries to > *add* partitions for a = 15 and a = 5 separetely. This is working fine. Sorry for the bogus comment. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hello, At Fri, 10 Nov 2017 14:44:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20171110.144455.117208639.horiguchi.kyotaro@lab.ntt.co.jp> > > Those two conditions are not orthogonal. Maybe something like > > following seems more understantable. > > > > > if (!constfalse) > > > { > > > /* No constraints on the keys, so, return *all* partitions. */ > > > if (nkeys == 0) > > > return bms_add_range(result, 0, partdesc->nparts - 1); > > > > > > result = get_partitions_for_keys(relation, &keys); > > > } So, the condition (!constfalse && nkeys == 0) cannot return there. I'm badly confused by the variable name. I couldn't find another reasonable structure using the current classify_p_b_keys(), but could you add a comment like the following as an example? + /* + * Ths function processes other than OR expressions and returns + * the excluded OR expressions in or_clauses + */ > nkeys = classify_partition_bounding_keys(relation, clauses, > &keys, &constfalse, > &or_clauses); > /* > * Only look up in the partition decriptor if the query provides > * constraints on the keys at all. > */ > if (!constfalse) > { > if (nkey > 0) > result = get_partitions_for_keys(relation, &keys); > else -+ /* No constraints on the keys, so, all partitions are passed. */ > result = bms_add_range(result, 0, partdesc->nparts - 1); > } > + /* + * We have a partition set for clauses not returned in or_clauses + * here. Conjuct the result of each OR clauses. + */ > foreach(lc, or_clauses) > { > BoolExpr *or = (BoolExpr *) lfirst(lc); > ListCell *lc1; > Bitmapset *or_partset = NULL; > + Assert(or_clause(or)); regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 10 November 2017 at 16:30, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > In 0002, bms_add_range has a bit naive-looking loop > > + while (wordnum <= uwordnum) > + { > + bitmapword mask = (bitmapword) ~0; > + > + /* If working on the lower word, zero out bits below 'lower'. */ > + if (wordnum == lwordnum) > + { > + int lbitnum = BITNUM(lower); > + mask >>= lbitnum; > + mask <<= lbitnum; > + } > + > + /* Likewise, if working on the upper word, zero bits above 'upper' */ > + if (wordnum == uwordnum) > + { > + int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) + 1); > + mask <<= ushiftbits; > + mask >>= ushiftbits; > + } > + > + a->words[wordnum++] |= mask; > + } > > Without some aggressive optimization, the loop takes most of the > time to check-and-jump for nothing especially with many > partitions and somewhat unintuitive. > > The following uses a bit tricky bitmap operation but > is straightforward as a whole. > > ===== > /* fill the bits upper from BITNUM(lower) (0-based) of the first word */ > a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1); > > /* fill up intermediate words */ > while (wordnum < uwordnum) > a->words[wordnum++] = ~(bitmapword) 0; > > /* fill up to BITNUM(upper) bit (0-based) of the last word */ > a->workds[wordnum++] |= > (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1)); > ===== No objections here for making bms_add_range() perform better, but this is not going to work when lwordnum == uwordnum. You'd need to special case that. I didn't think it was worth the trouble, but maybe it is... I assume the += should be |=. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Horiguchi-san, Thanks for taking a look. Replying to all your emails here. On 2017/11/10 12:30, Kyotaro HORIGUCHI wrote: > In 0002, bms_add_range has a bit naive-looking loop > > + while (wordnum <= uwordnum) > + { > + bitmapword mask = (bitmapword) ~0; > + > + /* If working on the lower word, zero out bits below 'lower'. */ > + if (wordnum == lwordnum) > + { > + int lbitnum = BITNUM(lower); > + mask >>= lbitnum; > + mask <<= lbitnum; > + } > + > + /* Likewise, if working on the upper word, zero bits above 'upper' */ > + if (wordnum == uwordnum) > + { > + int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) + 1); > + mask <<= ushiftbits; > + mask >>= ushiftbits; > + } > + > + a->words[wordnum++] |= mask; > + } > > Without some aggressive optimization, the loop takes most of the > time to check-and-jump for nothing especially with many > partitions and somewhat unintuitive. > > The following uses a bit tricky bitmap operation but > is straightforward as a whole. > > ===== > /* fill the bits upper from BITNUM(lower) (0-based) of the first word */ > a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1); > > /* fill up intermediate words */ > while (wordnum < uwordnum) > a->words[wordnum++] = ~(bitmapword) 0; > > /* fill up to BITNUM(upper) bit (0-based) of the last word */ > a->workds[wordnum++] |= > (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1)); > ===== Considering also the David's comment downthread, I will try to incorporate this into bms_add_range(). > In 0003, > > +match_clauses_to_partkey(RelOptInfo *rel, > ... > + if (rinfo->pseudoconstant && > + (IsA(clause, Const) && > + ((((Const *) clause)->constisnull) || > + !DatumGetBool(((Const *) clause)->constvalue)))) > + { > + *constfalse = true; > + continue; > > If we call this function in both conjunction and disjunction > context (the latter is only in recursive case). constfalse == > true means no need of any clauses for the former case. > > Since (I think) just a list of RestrictInfo is expected to be > treated as a conjunction (it's quite doubious, though..), I think it makes sense to consider a list of RestrictInfo's, such as baserestrictinfo, that is passed as input to match_clauses_to_partkey(), to be mutually conjunctive for our purpose here. > we > might be better call this for each subnodes of a disjunction. Or, > like match_clauses_to_index, we might be better provide > match_clause_to_partkey(rel, rinfo, contains_const), which > returns NULL if constfalse. (I'm not self-confident on this..) After reading your comment, I realized that it was wrong that the recursive call to match_clauses_to_partkey() passed the arguments of an OR clause all at once. That broke the assumption mentioned above that all of the clauses in the list passed to match_clauses_to_partkey() are mutually conjunctive. Instead, we must make a single-member list for each of the OR clause's arguments and pass the same. Then if we got constfalse for all of the OR's arguments, then we return the constfalse=true to the original caller. > > + /* > + * If no commutator exists, cannot flip the qual's args, > + * so give up. > + */ > + if (!OidIsValid(expr_op)) > + continue; > > I suppose it's better to leftop and rightop as is rather than > flipping over so that var is placed left-side. Does that make > things so complex? Reason to do it that way is that the code placed far away (code in partition.c that extracts constant values to use for pruning from matched clauses) can always assume that the clauses determined to be useful for partition-pruning always come in the 'partkey op constant' form. > + * It the operator happens to be '<>', which is never listed > If? Right, will fix. > + if (!op_in_opfamily(expr_op, partopfamily)) > + { > + Oid negator = get_negator(expr_op); > + > + if (!OidIsValid(negator) || > + !op_in_opfamily(negator, partopfamily)) > + continue; > > classify_partition_bounding_keys() checks the same thing by > checking whether the negator's strategy is > BTEquealStrategyNumber. (I'm not sure the operator is guaranteed > to be of btreee, though..) Aren't they needed to be in similar > way? You're right. The <>'s negator may not always be a btree operator. So, we should add a check in match_clauses_to_partkey() that list or range partitioning is in use, because only those require a btree operator family. We now have hash partitioning, so need to be careful not to make the assumption that all partitioning operators are from btree operator families. If match_clauses_to_partkey() accepts such a clause with a <> operator, then classify_partition_bounding_keys() can rely that it will get the desired btree operators to implement pruning for the same. > # In the function, "partkey strategy" and "operator strategy" are > # confusing.. I agree it would be better to make that clear using comments. Partitioning strategy and operator strategies are intimately related. List and range partitioning related optimizations will only work if the clause operators are of valid btree strategies, hash partitioning optimizations will only work if the operator in the matched clauses is a valid hash equality operator. > + AttrNumber attno; > > This declaration might be better in a narrower scope. Agreed, will move. On 2017/11/10 14:38, Kyotaro HORIGUCHI wrote: > Hello, this is the second part of the review. > > At Fri, 10 Nov 2017 12:30:00 +0900 , Kyotaro HORIGUCHI wrote: >> In 0003, > > In 0004, > > The name get_partitions_from_clauses_guts(), it seems to me that > we usually use _internal for recursive part of some function. (I > have the same comment as David about the comment for > get_partition_from_clause()) OK, will replace _guts by _internal. Looking at the David's comments too. > About the same function: > > Couldn't we get out in the fast path when clauses == NIL? Actually get_partitions_for_clauses() won't be called at all if that were true. I think I should add an Assert that clauses is not NIL. > > + /* No constraints on the keys, so, return *all* partitions. */ > + result = bms_add_range(result, 0, partdesc->nparts - 1); > > This allows us to return immediately here. And just above this, > > + if (nkeys > 0 && !constfalse) > + result = get_partitions_for_keys(relation, &keys); > + else if (!constfalse) > > Those two conditions are not orthogonal. Maybe something like > following seems more understantable. > >> if (!constfalse) >> { >> /* No constraints on the keys, so, return *all* partitions. */ >> if (nkeys == 0) >> return bms_add_range(result, 0, partdesc->nparts - 1); >> >> result = get_partitions_for_keys(relation, &keys); >> } Agreed that your suggested rewrite of that portion of the code is easy to read, but we cannot return yet in the nkeys == 0 case, as you also said you found out. I quote your other reply further below. > I'm not sure what is meant to be (formally) done here, but it > seems to me that OrExpr is assumed to be only at the top level of > the caluses. So the following (just an example, but meaningful > expression in this shpape must exists.) expression is perhaps > wrongly processed here. > > CREATE TABLE p (a int) PARITION BY (a); > CREATE TABLE c1 PARTITION OF p FOR VALUES FROM (0) TO (10); > CREATE TABLE c2 PARTITION OF p FOR VALUES FROM (10) TO (20); > > SELECT * FROM p WHERE a = 15 AND (a = 15 OR a = 5); > > get_partitions_for_keys() returns both c1 and c2 and still > or_clauses here holds (a = 15 OR a = 5) and the function tries to > *add* partitions for a = 15 and a = 5 separetely. > > I'd like to pause here. [ ... ] On 2017/11/10 14:44, Kyotaro HORIGUCHI wrote: > At Fri, 10 Nov 2017 14:38:11 +0900, Kyotaro HORIGUCHI wrote: > > This is working fine. Sorry for the bogus comment. I'd almost started looking around if something might be wrong after all. :) On 2017/11/10 16:07, Kyotaro HORIGUCHI wrote: > At Fri, 10 Nov 2017 14:44:55 +0900, Kyotaro HORIGUCHI wrote: > >>> Those two conditions are not orthogonal. Maybe something like >>> following seems more understantable. >>> >>>> if (!constfalse) >>>> { >>>> /* No constraints on the keys, so, return *all* partitions. */ >>>> if (nkeys == 0) >>>> return bms_add_range(result, 0, partdesc->nparts - 1); >>>> >>>> result = get_partitions_for_keys(relation, &keys); >>>> } > > So, the condition (!constfalse && nkeys == 0) cannot return > there. I'm badly confused by the variable name. Do you mean by 'constfalse'? > I couldn't find another reasonable structure using the current > classify_p_b_keys(), but could you add a comment like the > following as an example? OK, will add comments explaining what's going on. Will post the updated patches after also taking care of David's and Amul's review comments upthread. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 13 November 2017 at 22:46, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/11/10 12:30, Kyotaro HORIGUCHI wrote: >> In 0002, bms_add_range has a bit naive-looking loop >> >> + while (wordnum <= uwordnum) >> + { >> + bitmapword mask = (bitmapword) ~0; >> + >> + /* If working on the lower word, zero out bits below 'lower'. */ >> + if (wordnum == lwordnum) >> + { >> + int lbitnum = BITNUM(lower); >> + mask >>= lbitnum; >> + mask <<= lbitnum; >> + } >> + >> + /* Likewise, if working on the upper word, zero bits above 'upper' */ >> + if (wordnum == uwordnum) >> + { >> + int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) + 1); >> + mask <<= ushiftbits; >> + mask >>= ushiftbits; >> + } >> + >> + a->words[wordnum++] |= mask; >> + } >> >> Without some aggressive optimization, the loop takes most of the >> time to check-and-jump for nothing especially with many >> partitions and somewhat unintuitive. >> >> The following uses a bit tricky bitmap operation but >> is straightforward as a whole. >> >> ===== >> /* fill the bits upper from BITNUM(lower) (0-based) of the first word */ >> a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1); >> >> /* fill up intermediate words */ >> while (wordnum < uwordnum) >> a->words[wordnum++] = ~(bitmapword) 0; >> >> /* fill up to BITNUM(upper) bit (0-based) of the last word */ >> a->workds[wordnum++] |= >> (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1)); >> ===== > > Considering also the David's comment downthread, I will try to incorporate > this into bms_add_range(). I've attached an implementation of the patch using this method. I've also attached bitset.c which runs each through their paces. I'd have expected Kyotaro's method to be faster, but gcc 7.2 with -O2 generates very slightly slower code. I didn't really check why. clang seems to do a better job with it. $ gcc -O2 bitset.c -o bitset && ./bitset bms_add_range in 0.694254 (6.94254 ns per loop) bms_add_range2 in 0.726643 (7.26643 ns per loop) 11111111111111111111111111111110 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 ------------- 11111111111111111111111111111110 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 $ gcc --version gcc (Ubuntu 7.2.0-8ubuntu3) 7.2.0 Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ clang -O2 bitset.c -o bitset && ./bitset bms_add_range in 0.866554 (8.66554 ns per loop) bms_add_range2 in 0.467138 (4.67138 ns per loop) 11111111111111111111111111111110 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 ------------- 11111111111111111111111111111110 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 11111111111111111111111111111111 $ clang --version clang version 4.0.1-6 (tags/RELEASE_401/final) Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin Probably just go with Kyotaro's idea (v2). I don't think this is worth debating, I just wanted to show it's not that clear-cut. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
Amit Langote
Date:
Hi David. On 2017/11/14 13:00, David Rowley wrote: > On 13 November 2017 at 22:46, Amit Langote wrote: >> On 2017/11/10 12:30, Kyotaro HORIGUCHI wrote: >>> The following uses a bit tricky bitmap operation but >>> is straightforward as a whole. >>> >>> ===== >>> /* fill the bits upper from BITNUM(lower) (0-based) of the first word */ >>> a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1); >>> >>> /* fill up intermediate words */ >>> while (wordnum < uwordnum) >>> a->words[wordnum++] = ~(bitmapword) 0; >>> >>> /* fill up to BITNUM(upper) bit (0-based) of the last word */ >>> a->workds[wordnum++] |= >>> (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1)); >>> ===== >> >> Considering also the David's comment downthread, I will try to incorporate >> this into bms_add_range(). > > I've attached an implementation of the patch using this method. [ ... ] > Probably just go with Kyotaro's idea (v2). I don't think this is worth > debating, I just wanted to show it's not that clear-cut. Thanks. I have incorporated the v2 patch in my local repository. I'm still working through some of the review comments and will be able to post a new version no later than tomorrow, including support for the new hash partitioning. Thanks, Amit
Thanks for the interesting test, David. At Tue, 14 Nov 2017 17:00:12 +1300, David Rowley <david.rowley@2ndquadrant.com> wrote in <CAKJS1f-1Lc_b=Y1iicPQzvUgSn1keHSgmRqLuOGq_VR6M==zbw@mail.gmail.com> > On 13 November 2017 at 22:46, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > On 2017/11/10 12:30, Kyotaro HORIGUCHI wrote: > >> Without some aggressive optimization, the loop takes most of the > >> time to check-and-jump for nothing especially with many > >> partitions and somewhat unintuitive. > >> > >> The following uses a bit tricky bitmap operation but > >> is straightforward as a whole. > >> > >> ===== > >> /* fill the bits upper from BITNUM(lower) (0-based) of the first word */ > >> a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1); '+='.. ^^; > >> /* fill up intermediate words */ > >> while (wordnum < uwordnum) > >> a->words[wordnum++] = ~(bitmapword) 0; > >> > >> /* fill up to BITNUM(upper) bit (0-based) of the last word */ > >> a->workds[wordnum++] |= > >> (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1)); > >> ===== > > > > Considering also the David's comment downthread, I will try to incorporate > > this into bms_add_range(). > > I've attached an implementation of the patch using this method. > > I've also attached bitset.c which runs each through their paces. I'd > have expected Kyotaro's method to be faster, but gcc 7.2 with -O2 > generates very slightly slower code. I didn't really check why. clang > seems to do a better job with it. .. > $ gcc -O2 bitset.c -o bitset && ./bitset > bms_add_range in 0.694254 (6.94254 ns per loop) > bms_add_range2 in 0.726643 (7.26643 ns per loop) .. > $ gcc --version > gcc (Ubuntu 7.2.0-8ubuntu3) 7.2.0 > Copyright (C) 2017 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Hmm bms_add_range doesn't seem getting so aggressive optimization but I had a similar result. Looking the output of gcc -S, I found that bms_add_range() is embedded in main(). (gcc 7.1.0) It's not surprizing after finding that but.. Anyway I added __attribute((noinline)) to the two functions and got the following result. > bms_add_range in 1.24 (12.4 ns per loop) > bms_add_range2 in 0.8 (8 ns per loop) It seems reasonable. > $ clang -O2 bitset.c -o bitset && ./bitset > bms_add_range in 0.866554 (8.66554 ns per loop) > bms_add_range2 in 0.467138 (4.67138 ns per loop) .. > $ clang --version > clang version 4.0.1-6 (tags/RELEASE_401/final) > Target: x86_64-pc-linux-gnu > Thread model: posix > InstalledDir: /usr/bin > > Probably just go with Kyotaro's idea (v2). I don't think this is worth > debating, I just wanted to show it's not that clear-cut. I agree that it's not so clear-cut. regard, -- Kyotaro Horiguchi NTT Open Source Software Center
On 16 November 2017 at 15:54, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Anyway I added __attribute((noinline)) to the two > functions and got the following result. > >> bms_add_range in 1.24 (12.4 ns per loop) >> bms_add_range2 in 0.8 (8 ns per loop) I see similar here with __attribute((noinline)). Thanks for investigating that. Your way is clearly better. Thanks for suggesting it. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi David. On 2017/11/08 13:44, David Rowley wrote: > On 7 November 2017 at 01:52, David Rowley <david.rowley@2ndquadrant.com> wrote: >> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13) > > Hi Amit, > > I had another look over this today. Apologies if any of the review seems petty. Thanks a lot for the review. > Here goes: > > 1. If test seems to be testing for a child that's a partitioned table, > but is testing for a non-NULL part_scheme. > > /* > * If childrel is itself partitioned, add it and its partitioned > * children to the list being propagated up to the root rel. > */ > if (childrel->part_scheme && rel->part_scheme) > > Should this code use IS_PARTITIONED_REL() instead? Seems a bit strange > to test for a NULL part_scheme I guess that makes sense, done. > > 2. There's a couple of mistakes in my bms_add_range() code. I've > attached bms_add_range_fix.patch. Can you apply this to your tree? Thanks. I have used your bms_add_range_v2.patch that you sent earlier today and listed both your and Horiguchi-san's names as author. > 3. This assert seems to be Asserting the same thing twice: > > Assert(rel->live_partitioned_rels != NIL && > list_length(rel->live_partitioned_rels) > 0); > > A List with length == 0 is always NIL. You're right. I changed it to: Assert(list_length(rel->live_partitioned_rels) >= 1); > 4. get_partitions_from_clauses(), can you comment why you perform the > list_concat() there. > > I believe this is there so that the partition bound from the parent is > passed down to the child so that we can properly eliminate all child > partitions when the 2nd level of partitioning is using the same > partition key as the 1st level. I think this deserves a paragraph of > comment to explain this. Yes, that's the intent. I implemented it as a solution to fix a problem that was reported upthread, whereby the default partition pruning didn't work as desired. I tried to explain it in the following email: https://www.postgresql.org/message-id/8499324c-8a33-4be7-9d23-7e6a95e60ddf%40lab.ntt.co.jp But, since I devised it as a solution to get the desired behavior for the default partition, I modified the code to include partition constraint clauses to do it only when the table has a default partition in the first place. Doing it always is an overkill. Please see the comment added nearby if it now helps make sense of what's going on. > 5. Please add a comment to explain what's going on here in > classify_partition_bounding_keys() > > if (partattno == 0) > { > partexpr = lfirst(partexprs_item); > partexprs_item = lnext(partexprs_item); > } > > Looks like, similar to index expressions, that partition expressions > are attno 0 to mark to consume the next expression from the list. Right. > Does this need validation that there are enough partexprs_item items > like what is done in get_range_key_properties()? Or is this validated > somewhere else? Yeah, I added the check as follows: + /* Set partexpr if needed. */ if (partattno == 0) { + if (partexprs_item == NULL) + elog(ERROR, "wrong number of partition key expressions"); partexpr = lfirst(partexprs_item); partexprs_item = lnext(partexprs_item); > > 6. Comment claims the if test will test something which it does not > seem to test for: > > /* > * Redundant key elimination using btree-semantics based tricks. > * > * Only list and range partitioning use btree operator semantics, so > * skip otherwise. Also, if there are expressions whose value is yet > * unknown, skip this step, because we need to compare actual values > * below. > */ > memset(keyclauses, 0, PARTITION_MAX_KEYS * sizeof(List *)); > if (partkey->strategy == PARTITION_STRATEGY_LIST || > partkey->strategy == PARTITION_STRATEGY_RANGE) > > I was expecting this to be skipped when the clauses contained a > non-const, but it does not seem to. Fixed the comment. Actually we might end up with non-consts here if executor invokes it, so the downstream code is in position to handle them, skipping any optimizations that depend on constant values being available. There are actually even cases when the planner wouldn't mind calling here even if the matched clauses contained non-const operands as long as there are at least some constants available. > 7. Should be "compare them" > > /* > * If the leftarg and rightarg clauses' constants are both of the type > * expected by "op" clause's operator, then compare then using the > * latter's comparison function. > */ > > But if I look at the code "compare then using the latter's comparison > function." is not true, it seems to use op's comparison function not > rightarg's. With most of the calls op and rightarg are the same, but > not all of them. The function shouldn't make that assumption even if > the args op was always the same as rightarg. Rearranged the code in partition_cmp_args() a bit and added more clarifying comments. > 8. remove_redundant_clauses() needs an overview comment of what the > function does. Done. > 9. The comment should explain what we would do in the case of key < 3 > AND key <= 2 using some examples. > > /* try to keep only one of <, <= */ Done. > 10. Wondering why this loop runs backward? > > for (s = BTMaxStrategyNumber; --s >= 0;) > > Why not just: > > for (s = 0; s < BTMaxStrategyNumber; s++) > > I can't see a special reason for it to run backward. It seems unusual, > but if there's a good reason that I've failed to realise then it's > maybe worth a comment. Hmm, no special reason. So, done the other way. I actually brought this redundant key logic elimination logic over from nbtutils.c: _bt_preprocess_keys() and the loop runs that way over there. > 11. Pleae comment on why *constfalse = true is set here: > > if (!chk || s == (BTEqualStrategyNumber - 1)) > continue; > > if (partition_cmp_args(partopfamily, partopcintype, chk, eq, chk, > &test_result)) > { > if (!test_result) > { > *constfalse = true; > return; > } > /* discard the redundant key. */ > xform[s] = NULL; > } > > Looks like we'd hit this in a case such as: WHERE key = 1 AND key > 1. Right. Added a comment with example. > Also please add a comment when discarding the redundant key maybe > explain that equality is more useful than the other strategies when > there's an overlap. Done, too. Please find attached updated patch set. There are significant changes in this version as described below, including the support for hash partitioned tables. Earlier today, I reported [1] what looks to me like a bug in how default partition's constraint gets generated and how that sometimes makes constraint exclusion mistakenly prune a default partition. I have included the patches I posted there in this series too. They are ahead in the list. So attached patches are now as follows: 0001-Add-default-partition-case-in-inheritance-testing.patch 0002-Tweak-default-range-partition-s-constraint-a-little.patch Patches at [1]. 0003-Add-new-tests-for-partition-pruning.patch Tests. Mostly unchanged from the last version. 0004-Add-a-bms_add_range.patch Uses the bms_add_range_v2 patch. 0005-Planner-side-changes-for-partition-pruning.patch Fixed some issues with how OR clauses were matched and the code for checking if the operator is partitioning compatible is no longer in the planner code, instead it's now only in partition.c. Other cosmetic improvements including those that resulted from the review comments. 0006-Implement-get_partitions_from_clauses.patch Several bug fixes and many cosmetic improvements including improved commentary. Per Amul's comment upthread, the patch no longer depends on using value -1 to denote an invalid value of NulltestType enum. In fact, it doesn't use NulltestType type variables at all. 0007-Some-interface-changes-for-partition_bound_-cmp-bsea.patch No changes. 0008-Implement-get_partitions_for_keys.patch Support for hash partitioning and tests for the same. Also, since update/delete on partitioned tables still depend on constraint exclusion for pruning, fix things such that get_relation_constraints includes partition constraints in its result only for non-select queries (for selects we have the new pruning code). Other bug fixes. Thanks, Amit [1] https://www.postgresql.org/message-id/ba7aaeb1-4399-220e-70b4-62eade1522d0%40lab.ntt.co.jp
Attachment
- 0001-Add-default-partition-case-in-inheritance-testing.patch
- 0002-Tweak-default-range-partition-s-constraint-a-little.patch
- 0003-Add-new-tests-for-partition-pruning-v12.patch
- 0004-Add-a-bms_add_range-v12.patch
- 0005-Planner-side-changes-for-partition-pruning-v12.patch
- 0006-Implement-get_partitions_from_clauses-v12.patch
- 0007-Some-interface-changes-for-partition_bound_-cmp-bsea-v12.patch
- 0008-Implement-get_partitions_for_keys-v12.patch
On 17 November 2017 at 23:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
-- Please find attached updated patch set. There are significant changes in
this version as described below, including the support for hash
partitioned tables.
Hi Amit,
Thanks for making those changes and adding the HASH partition support.
There's a good chance that I'm not going to get time to look at this maybe until the last day of the month. I hope someone else can look over it in the meantime.
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
Amit Langote
Date:
Hi David. On 2017/11/17 21:44, David Rowley wrote: > On 17 November 2017 at 23:01, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> > wrote: > >> Please find attached updated patch set. There are significant changes in >> this version as described below, including the support for hash >> partitioned tables. >> > > Hi Amit, > > Thanks for making those changes and adding the HASH partition support. > > There's a good chance that I'm not going to get time to look at this maybe > until the last day of the month. I hope someone else can look over it in > the meantime. No problem, thanks for the review so far and look forward to the next time you'll be able to comment. Regards, Amit
Thank you and sorry for the confused comments. At Mon, 13 Nov 2017 18:46:28 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <8460a6c3-68c5-b78a-7d18-d253180f2188@lab.ntt.co.jp> > Horiguchi-san, > > Thanks for taking a look. Replying to all your emails here. > > In 0003, > > > > +match_clauses_to_partkey(RelOptInfo *rel, > > ... > > + if (rinfo->pseudoconstant && > > + (IsA(clause, Const) && > > + ((((Const *) clause)->constisnull) || > > + !DatumGetBool(((Const *) clause)->constvalue)))) > > + { > > + *constfalse = true; > > + continue; > > > > If we call this function in both conjunction and disjunction > > context (the latter is only in recursive case). constfalse == > > true means no need of any clauses for the former case. > > > > Since (I think) just a list of RestrictInfo is expected to be > > treated as a conjunction (it's quite doubious, though..), > > I think it makes sense to consider a list of RestrictInfo's, such as > baserestrictinfo, that is passed as input to match_clauses_to_partkey(), > to be mutually conjunctive for our purpose here. You're right and I know it. I'm ok to leave it since I recalled that clause_selectivity always has a similar code. > On 2017/11/10 14:44, Kyotaro HORIGUCHI wrote: > > At Fri, 10 Nov 2017 14:38:11 +0900, Kyotaro HORIGUCHI wrote: > > > > This is working fine. Sorry for the bogus comment. > > I'd almost started looking around if something might be wrong after all. :) Very sorry for the wrong comment:( > On 2017/11/10 16:07, Kyotaro HORIGUCHI wrote: > > At Fri, 10 Nov 2017 14:44:55 +0900, Kyotaro HORIGUCHI wrote: > > > >>> Those two conditions are not orthogonal. Maybe something like > >>> following seems more understantable. > >>> > >>>> if (!constfalse) > >>>> { > >>>> /* No constraints on the keys, so, return *all* partitions. */ > >>>> if (nkeys == 0) > >>>> return bms_add_range(result, 0, partdesc->nparts - 1); > >>>> > >>>> result = get_partitions_for_keys(relation, &keys); > >>>> } > > > > So, the condition (!constfalse && nkeys == 0) cannot return > > there. I'm badly confused by the variable name. > > Do you mean by 'constfalse'? Perhaps. The name "constfalse" is suggesting (for me) that the cluses evaluate to false constantly. But acutally it means just the not-in-the-return clauses are results in false. Anyway I'll take a look on v12 and will comment at the time. > > I couldn't find another reasonable structure using the current > > classify_p_b_keys(), but could you add a comment like the > > following as an example? > > OK, will add comments explaining what's going on. > > > Will post the updated patches after also taking care of David's and Amul's > review comments upthread. > > Thanks, > Amit regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Nov 17, 2017 at 3:31 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
CREATE TABLE hp_tbl (a int, b int, c int) PARTITION BY HASH (a);
CREATE TABLE hp_tbl_p1 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4, remainder 0) PARTITION BY HASH (b);
CREATE TABLE hp_tbl_p1_p1 PARTITION OF hp_tbl_p1 FOR VALUES WITH (modulus 4, remainder 0) PARTITION BY HASH (c);
CREATE TABLE hp_tbl_p1_p1_p1 PARTITION OF hp_tbl_p1_p1 FOR VALUES WITH (modulus 4, remainder 0);
CREATE TABLE hp_tbl_p2 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4, remainder 1) PARTITION BY HASH (b);
CREATE TABLE hp_tbl_p2_p1 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus 4, remainder 1);
CREATE TABLE hp_tbl_p2_p2 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus 4, remainder 2);
insert into hp_tbl select i,i,i from generate_series(0,10)i where i not in(2,4,6,7,10);
explain select * from hp_tbl where a = 2;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>
Thanks & Regards,
Support for hash partitioning and tests for the same. Also, since
update/delete on partitioned tables still depend on constraint exclusion
for pruning, fix things such that get_relation_constraints includes
partition constraints in its result only for non-select queries (for
selects we have the new pruning code). Other bug fixes.
Hi Amit,
I have applied attached patch set on commit 11e264517dff7a911d9e6494de86049cab42cde3 and try to test for hash partition. I got a server crash with below test.CREATE TABLE hp_tbl (a int, b int, c int) PARTITION BY HASH (a);
CREATE TABLE hp_tbl_p1 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4, remainder 0) PARTITION BY HASH (b);
CREATE TABLE hp_tbl_p1_p1 PARTITION OF hp_tbl_p1 FOR VALUES WITH (modulus 4, remainder 0) PARTITION BY HASH (c);
CREATE TABLE hp_tbl_p1_p1_p1 PARTITION OF hp_tbl_p1_p1 FOR VALUES WITH (modulus 4, remainder 0);
CREATE TABLE hp_tbl_p2 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4, remainder 1) PARTITION BY HASH (b);
CREATE TABLE hp_tbl_p2_p1 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus 4, remainder 1);
CREATE TABLE hp_tbl_p2_p2 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus 4, remainder 2);
insert into hp_tbl select i,i,i from generate_series(0,10)i where i not in(2,4,6,7,10);
explain select * from hp_tbl where a = 2;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation Thanks Rajkumar for the test. On 2017/11/21 19:06, Rajkumar Raghuwanshi wrote: > CREATE TABLE hp_tbl (a int, b int, c int) PARTITION BY HASH (a); > CREATE TABLE hp_tbl_p1 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4, > remainder 0) PARTITION BY HASH (b); > CREATE TABLE hp_tbl_p1_p1 PARTITION OF hp_tbl_p1 FOR VALUES WITH (modulus > 4, remainder 0) PARTITION BY HASH (c); > CREATE TABLE hp_tbl_p1_p1_p1 PARTITION OF hp_tbl_p1_p1 FOR VALUES WITH > (modulus 4, remainder 0); > CREATE TABLE hp_tbl_p2 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4, > remainder 1) PARTITION BY HASH (b); > CREATE TABLE hp_tbl_p2_p1 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus > 4, remainder 1); > CREATE TABLE hp_tbl_p2_p2 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus > 4, remainder 2); > insert into hp_tbl select i,i,i from generate_series(0,10)i where i not > in(2,4,6,7,10); > > explain select * from hp_tbl where a = 2; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > !> It seems I wrote an Assert in the code to support hash partitioning that wasn't based on a valid assumption. I was wrongly assuming that all hash partitions for a given modulus (largest modulus) must exist at any given time, but that isn't the case. Fixed in the attached. No other changes beside that. Thanks, Amit
Attachment
- 0001-Add-default-partition-case-in-inheritance-testing.patch
- 0002-Tweak-default-range-partition-s-constraint-a-little.patch
- 0003-Add-new-tests-for-partition-pruning-v13.patch
- 0004-Add-a-bms_add_range-v13.patch
- 0005-Planner-side-changes-for-partition-pruning-v13.patch
- 0006-Implement-get_partitions_from_clauses-v13.patch
- 0007-Some-interface-changes-for-partition_bound_-cmp-bsea-v13.patch
- 0008-Implement-get_partitions_for_keys-v13.patch
Hi Amit, On 11/22/2017 03:59 AM, Amit Langote wrote: > Fixed in the attached. No other changes beside that. > I have been using the following script to look at the patch -- test.sql -- CREATE TABLE t1 ( a integer NOT NULL, b integer NOT NULL ) PARTITION BY HASH (b); CREATE TABLE t1_p00 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 0); CREATE TABLE t1_p01 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 1); CREATE TABLE t1_p02 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 2); CREATE TABLE t1_p03 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 3); CREATE INDEX idx_t1_b_a_p00 ON t1_p00 USING btree (b, a); CREATE INDEX idx_t1_b_a_p01 ON t1_p01 USING btree (b, a); CREATE INDEX idx_t1_b_a_p02 ON t1_p02 USING btree (b, a); CREATE INDEX idx_t1_b_a_p03 ON t1_p03 USING btree (b, a); CREATE TABLE t2 ( c integer NOT NULL, d integer NOT NULL ) PARTITION BY HASH (d); CREATE TABLE t2_p00 PARTITION OF t2 FOR VALUES WITH (MODULUS 4, REMAINDER 0); CREATE TABLE t2_p01 PARTITION OF t2 FOR VALUES WITH (MODULUS 4, REMAINDER 1); CREATE TABLE t2_p02 PARTITION OF t2 FOR VALUES WITH (MODULUS 4, REMAINDER 2); CREATE TABLE t2_p03 PARTITION OF t2 FOR VALUES WITH (MODULUS 4, REMAINDER 3); CREATE INDEX idx_t2_c_p00 ON t2_p00 USING btree (c); CREATE INDEX idx_t2_c_p01 ON t2_p01 USING btree (c); CREATE INDEX idx_t2_c_p02 ON t2_p02 USING btree (c); CREATE INDEX idx_t2_c_p03 ON t2_p03 USING btree (c); CREATE INDEX idx_t2_d_p00 ON t2_p00 USING btree (d); CREATE INDEX idx_t2_d_p01 ON t2_p01 USING btree (d); CREATE INDEX idx_t2_d_p02 ON t2_p02 USING btree (d); CREATE INDEX idx_t2_d_p03 ON t2_p03 USING btree (d); INSERT INTO t1 (SELECT i, i FROM generate_series(1, 10000) AS i); INSERT INTO t2 (SELECT i, i FROM generate_series(1, 10000) AS i); ANALYZE; EXPLAIN (ANALYZE) SELECT t1.a, t1.b FROM t1 WHERE t1.b = 1; EXPLAIN (ANALYZE) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER JOIN t2 ON t2.c = t1.b WHERE t2.d = 1; BEGIN; EXPLAIN (ANALYZE) UPDATE t1 SET a = 1 WHERE b = 1; ROLLBACK; BEGIN; EXPLAIN (ANALYZE) DELETE FROM t1 WHERE b = 1; ROLLBACK; -- test.sql -- I just wanted to highlight that the "JOIN ON" partition isn't pruned - the "WHERE" one is. Should pruning of partitions for UPDATEs (where the partition key isn't updated) and DELETEs be added to the TODO list ? Thanks for working on this ! Best regards, Jesper
Thanks Jesper. On 2017/11/23 3:56, Jesper Pedersen wrote: > Hi Amit, > > On 11/22/2017 03:59 AM, Amit Langote wrote: >> Fixed in the attached. No other changes beside that. >> > > I have been using the following script to look at the patch > > -- test.sql -- [ ... ] > EXPLAIN (ANALYZE) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER JOIN t2 ON > t2.c = t1.b WHERE t2.d = 1; > > I just wanted to highlight that the "JOIN ON" partition isn't pruned - the > "WHERE" one is. Did you mean to write ON t2.d = t1.b? If so, equivalence class mechanism will give rise to a t1.b = 1 and hence help prune t1's partition as well: EXPLAIN (COSTS OFF) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER JOIN t2 ON t2.d = t1.b WHERE t2.d = 1; QUERY PLAN -----------------------------------------------------------Nested Loop -> Append -> Bitmap Heap Scan on t1_p00 Recheck Cond: (b = 1) -> Bitmap Index Scan on idx_t1_b_a_p00 Index Cond:(b = 1) -> Materialize -> Append -> Bitmap Heap Scan on t2_p00 Recheck Cond:(d = 1) -> Bitmap Index Scan on idx_t2_d_p00 Index Cond: (d = 1) In your original query, you use ON t2.c = t1.b, whereby there is no "constant" value to perform partition pruning with. t2.c is unknown until the join actually executes. > BEGIN; > EXPLAIN (ANALYZE) UPDATE t1 SET a = 1 WHERE b = 1; > ROLLBACK; > > BEGIN; > EXPLAIN (ANALYZE) DELETE FROM t1 WHERE b = 1; > ROLLBACK; > > Should pruning of partitions for UPDATEs (where the partition key isn't > updated) and DELETEs be added to the TODO list? Note that partition pruning *does* work for UPDATE and DELETE, but only if you use list/range partitioning. The reason it doesn't work in this case (t1 is hash partitioned) is that the pruning is still based on constraint exclusion in the UPDATE/DELETE case and constraint exclusion cannot handle hash partitioning. See example below that uses list partitioning: drop table t1, t2; create table t1 (a int, b int) partition by list (b); create table t1_p0 partition of t1 for values in (0); create table t1_p1 partition of t1 for values in (1); create table t1_p2 partition of t1 for values in (2); create table t1_p3 partition of t1 for values in (3); create table t2 (c int, d int) partition by list (d); create table t2_p0 partition of t2 for values in (0); create table t2_p1 partition of t2 for values in (1); create table t2_p2 partition of t2 for values in (2); create table t2_p3 partition of t2 for values in (3); explain (costs off) update t1 set a = 1 where b = 1; QUERY PLAN =------------------------Update on t1 Update on t1_p1 -> Seq Scan on t1_p1 Filter: (b = 1) (4 rows) explain (costs off) delete from t1 where b = 1; QUERY PLAN =------------------------Delete on t1 Delete on t1_p1 -> Seq Scan on t1_p1 Filter: (b = 1) (4 rows) I can see how that seems a bit odd. If you use hash partitioning, UPDATE/DELETE do not benefit from partition-pruning, even though SELECT does. That's because SELECT uses the new partition-pruning method (this patch set) which supports hash partitioning, whereas UPDATE and DELETE use constraint exclusion which doesn't. It would be a good idea to make even UPDATE and DELETE use the new method thus bringing everyone on the same page, but that requires us to make some pretty non-trivial changes to how UPDATE/DELETE planning works for inheritance/partitioned tables, which we should undertake separately, imho. Thanks, Amit
Hi Amit, On 11/24/2017 12:00 AM, Amit Langote wrote: >> On 2017/11/23 3:56, Jesper Pedersen wrote: >> EXPLAIN (ANALYZE) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER JOIN t2 ON >> t2.c = t1.b WHERE t2.d = 1; >> >> I just wanted to highlight that the "JOIN ON" partition isn't pruned - the >> "WHERE" one is. > > Did you mean to write ON t2.d = t1.b? If so, equivalence class mechanism > will give rise to a t1.b = 1 and hence help prune t1's partition as well: > No, I meant 't2.c = t1.b'. If you take the same example, but don't partition you will get the following plan: test=# EXPLAIN (COSTS OFF) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER JOIN t2 ON t2.c = t1.b WHERE t2.d = 1; QUERY PLAN ---------------------------------------------- Nested Loop -> Index Scan using idx_t2_d on t2 Index Cond: (d =1) -> Index Only Scan using idx_t1_b_a on t1 Index Cond: (b = t2.c) (5 rows) Maybe "5.10.2. Declarative Partitioning" could be expanded to include some general "guidelines" of where partition based plans should be checked against their non-partition counterparts (at least the first bullet in 5.10.1 says ".. in certain situations .."). Probably a separate patch from this. [snip] >> Should pruning of partitions for UPDATEs (where the partition key isn't >> updated) and DELETEs be added to the TODO list? > > Note that partition pruning *does* work for UPDATE and DELETE, but only if > you use list/range partitioning. The reason it doesn't work in this case > (t1 is hash partitioned) is that the pruning is still based on constraint > exclusion in the UPDATE/DELETE case and constraint exclusion cannot handle > hash partitioning. > Thanks for your description. > > I can see how that seems a bit odd. If you use hash partitioning, > UPDATE/DELETE do not benefit from partition-pruning, even though SELECT > does. That's because SELECT uses the new partition-pruning method (this > patch set) which supports hash partitioning, whereas UPDATE and DELETE use > constraint exclusion which doesn't. It would be a good idea to make even > UPDATE and DELETE use the new method thus bringing everyone on the same > page, but that requires us to make some pretty non-trivial changes to how > UPDATE/DELETE planning works for inheritance/partitioned tables, which we > should undertake separately, imho. > Agreed. Best regards, Jesper
Hello, At Wed, 22 Nov 2017 17:59:48 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <df609168-b7fd-4c0b-e9b2-6e398d411e27@lab.ntt.co.jp> > Thanks Rajkumar for the test. > > On 2017/11/21 19:06, Rajkumar Raghuwanshi wrote: > > explain select * from hp_tbl where a = 2; > > server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > The connection to the server was lost. Attempting reset: Failed. > > !> > > It seems I wrote an Assert in the code to support hash partitioning that > wasn't based on a valid assumption. I was wrongly assuming that all hash > partitions for a given modulus (largest modulus) must exist at any given > time, but that isn't the case. > > Fixed in the attached. No other changes beside that. 0001 and 0002 are under discussion with Robert in another thread. I don't have a comment on 0003, 0004. 0005: get_partitions_from_clauses is written as _using_ in it'scomment. (also get_partitions_from_clauses_recurse is _guts initscomment.) get_append_rel_partitions just returns NIL if constfalse. Isuppose we'd better reducing indentation levelhere. get_partitions_from_clauses_recursein 0006 does the samething. In the same function, there's a else clause separated from thenclause by a multiline comment. It seems better that the elseclausehas braces and the comment is in the braces like thefollowing. > else> {> /*> * Else there are no clauses....> */> partindexes = bms_add_range(NULL, 0, partdesc->nparts- 1);> } 0006: In get_partitions_from_clauses_recurse, the following commentseems confusing. + /* + * The analysis of the matched clauses done by + * classify_partition_bounding_keys may have found mutually contradictory + * clauses. + */ constfalse = true also when the cluase was just one false pseudoconstant restrictinfo. + if (!constfalse) + { + /* + * If all clauses in the list were OR clauses, + * classify_partition_bounding_keys() wouldn't have formed keys + * yet. They will be handled below by recursively calling this + * function for each of OR clauses' arguments and combining the + * resulting partition sets appropriately. + */ + if (nkeys > 0) classify_p_b_keys() to return zero also when is not only all-ORclauses(all AND clauses with volatile function also returnszero). + /* Set partexpr if needed. */ + if (partattno == 0) Could you add a description about the meaning of 0 to thecomment of PartitionKeyData something like this? | AttrNumber *partattrs; /* attribute numbers of columns in the | * partition key. 0 meanspartexpression */ + #define EXPR_MATCHES_PARTKEY(expr, partattno, partexpr) \ + ((IsA((expr), Var) &&\ + ((Var *) (expr))->varattno == (partattno)) ||\ + equal((expr), (partexpr))) ... + if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr)) partattno = 0 has a different meaning than ordinary attnos. I belive that the leftop cannot be a whole row var, but Isupposewe should make it clear. Likewise, even it doesn'tactually happen but it formally has a chance to make a falsematchsince partexpr is not cleared when partattno > 0.EXPR_MATCHES_PARTKEY might be better be something like follows. | #define EXPR_MATCHES_PARTKEY(expr, partattno, partexpr) \ | ((partattno) != 0 ? \ | (IsA((expr), Var) && ((Var *) (expr))->varattno == (partattno)) :\ | equal((expr), (partexpr))) + if (!op_in_opfamily(opclause->opno, partopfamily)) + { ... + negator = get_negator(opclause->opno); + if (OidIsValid(negator) && + op_in_opfamily(negator, partopfamily)) + { + get_op_opfamily_properties(negator, partopfamily, + false, + &strategy, + &lefttype, &righttype); + if (strategy == BTEqualStrategyNumber) This seems to me to be a bit too much relying on the specificrelationship of the access methods' property. Isn't itreasonablethat add checking that partkey->strategy != 'h'before getting negator? + commuted->opno = get_commutator(opclause->opno); Im afraid that get_commutator can return InvalidOid foruser-defined types or by user-defined operator class or perhapsotherreasons uncertain to me. match_clauses_to_partkey ischecking that. + else if (IsA(clause, ScalarArrayOpExpr)) I'm not sure what to do with a multidimentional ArrayExpr but->multidims is checked some places. + ParseState *pstate = make_parsestate(NULL); make_parsestate mandates for the caller to free it byfree_parsestate(). It doesn't seem to leak anything in thecontext andI saw the same thing at other places but it would bebetter to follow it if possible, or add some apology as acomment..(or update the comment of make_parsestate?) + * If the leftarg_const and rightarg_consr are both of the type expected rightarg_consr -> const + if (partition_cmp_args(partkey, partattoff, + le, lt, le, + &test_result)) + if (partition_cmp_args(partkey, partattoff, ge, gt, ge, + &test_result)) Please unify the style. + * Boolean conditions have a special shape, which would've been + * accepted if the partitioning opfamily accepts Boolean + * conditions. I noticed that bare true and false are not accepted by thevalues list of create table syntax. This is not a comment onthispatch but is that intentional? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hi Jesper. On 2017/11/28 3:30, Jesper Pedersen wrote: > Hi Amit, > > On 11/24/2017 12:00 AM, Amit Langote wrote: >>> On 2017/11/23 3:56, Jesper Pedersen wrote: >>> EXPLAIN (ANALYZE) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER JOIN t2 ON >>> t2.c = t1.b WHERE t2.d = 1; >>> >>> I just wanted to highlight that the "JOIN ON" partition isn't pruned - the >>> "WHERE" one is. >> >> Did you mean to write ON t2.d = t1.b? If so, equivalence class mechanism >> will give rise to a t1.b = 1 and hence help prune t1's partition as well: >> > > No, I meant 't2.c = t1.b'. If you take the same example, but don't > partition you will get the following plan: > > test=# EXPLAIN (COSTS OFF) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER > JOIN t2 ON t2.c = t1.b WHERE t2.d = 1; > QUERY PLAN > ---------------------------------------------- > Nested Loop > -> Index Scan using idx_t2_d on t2 > Index Cond: (d = 1) > -> Index Only Scan using idx_t1_b_a on t1 > Index Cond: (b = t2.c) > (5 rows) So it appears to me that you're pointing out the inner Index Only Scan on t1, which is lot better than scanning all of t1 on every loop iteration. As you might know, we can't exactly have the index scan on partitioned table (that is, the parent table itself), because there wouldn't be any index on it. However, the planner is smart enough to push the clause down to partitions (leaf tables) which may have the index and hence index scan could be chosen for them. But note that planner will have chosen *all* partitions, because there is no constant value to prune partitions with at that point. If we get run-time pruning [1], we get to get almost close to what happens in the non-partitioned case. In this case, since t1.b of t2.c = t1.b is the partition key of t1, we will make an Append node with run-time pruning enabled. On every loop iteration, t2.c's value will be used to prune useless partitions, which will leave us in most cases to scan just one partition and it might be an Index Only Scan using the partition's index. > Maybe "5.10.2. Declarative Partitioning" could be expanded to include some > general "guidelines" of where partition based plans should be checked > against their non-partition counterparts (at least the first bullet in > 5.10.1 says ".. in certain situations .."). Probably a separate patch from > this. I agree about shedding more light on that in the documentation. I will try to write up a patch someday. Thanks, Amit [1] https://commitfest.postgresql.org/15/1330/
On Wed, Nov 22, 2017 at 3:59 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > It seems I wrote an Assert in the code to support hash partitioning that > wasn't based on a valid assumption. I was wrongly assuming that all hash > partitions for a given modulus (largest modulus) must exist at any given > time, but that isn't the case. Committed 0003 with some adjustments: * Renamed the new test to partition_prune. * Moved the test to what I thought was a better place in the schedule file, and made it consistent between serial_schedule and parallel_schedule. * commutates -> commuted * removed wrong /* empty */ comment * Updated expected output. It surprised me a bit that the tests weren't passing as you had them, but the differences I got - all related to mc3p_default - seemed correct to me -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 29, 2017 at 3:28 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Nov 22, 2017 at 3:59 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> It seems I wrote an Assert in the code to support hash partitioning that >> wasn't based on a valid assumption. I was wrongly assuming that all hash >> partitions for a given modulus (largest modulus) must exist at any given >> time, but that isn't the case. > > Committed 0003 with some adjustments: > > * Renamed the new test to partition_prune. > * Moved the test to what I thought was a better place in the schedule > file, and made it consistent between serial_schedule and > parallel_schedule. > * commutates -> commuted > * removed wrong /* empty */ comment > * Updated expected output. It surprised me a bit that the tests > weren't passing as you had them, but the differences I got - all > related to mc3p_default - seemed correct to me Committed 0004 after reviewing the code and testing that it seems to work as advertised. 0005 looks like it might need to be split into smaller patches. More broadly, the commit messages you wrote for for 0005, 0006, and 0008 don't seem to me to do a great job explaining the motivation for the changes which they make. They tell me what the patches do, but not why they are doing it. If there's an email in this thread that explains that stuff, please point me to it and I'll go back and reread it more carefully; if not, I think I definitely need some more explanation both of the mission of each patch and the reason why the patch set is divided up in the way that it is. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/11/30 5:28, Robert Haas wrote: > On Wed, Nov 22, 2017 at 3:59 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> It seems I wrote an Assert in the code to support hash partitioning that >> wasn't based on a valid assumption. I was wrongly assuming that all hash >> partitions for a given modulus (largest modulus) must exist at any given >> time, but that isn't the case. > > Committed 0003 with some adjustments: > > * Renamed the new test to partition_prune. > * Moved the test to what I thought was a better place in the schedule > file, and made it consistent between serial_schedule and > parallel_schedule. > * commutates -> commuted > * removed wrong /* empty */ comment Thanks a lot. > * Updated expected output. It surprised me a bit that the tests > weren't passing as you had them, but the differences I got - all > related to mc3p_default - seemed correct to me Yeah, that one I too noticed yesterday while rebasing. Thanks, Amit
On 2017/11/30 7:15, Robert Haas wrote: > On Wed, Nov 29, 2017 at 3:28 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Nov 22, 2017 at 3:59 AM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> It seems I wrote an Assert in the code to support hash partitioning that >>> wasn't based on a valid assumption. I was wrongly assuming that all hash >>> partitions for a given modulus (largest modulus) must exist at any given >>> time, but that isn't the case. >> >> Committed 0003 with some adjustments: >> >> * Renamed the new test to partition_prune. >> * Moved the test to what I thought was a better place in the schedule >> file, and made it consistent between serial_schedule and >> parallel_schedule. >> * commutates -> commuted >> * removed wrong /* empty */ comment >> * Updated expected output. It surprised me a bit that the tests >> weren't passing as you had them, but the differences I got - all >> related to mc3p_default - seemed correct to me > > Committed 0004 after reviewing the code and testing that it seems to > work as advertised. Thank you. > 0005 looks like it might need to be split into smaller patches. More > broadly, the commit messages you wrote for for 0005, 0006, and 0008 > don't seem to me to do a great job explaining the motivation for the > changes which they make. They tell me what the patches do, but not > why they are doing it. If there's an email in this thread that > explains that stuff, please point me to it and I'll go back and reread > it more carefully; if not, I think I definitely need some more > explanation both of the mission of each patch and the reason why the > patch set is divided up in the way that it is. I'm working on a revised version of these patches to address recent comments by Horiguchi-san. I will also consider the points above before sending the new version. Thanks, Amit
On Thu, Nov 30, 2017 at 10:43 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I'm working on a revised version of these patches to address recent > comments by Horiguchi-san. I will also consider the points above before > sending the new version. Ok, this is fresh news, so I am moving this entry to next CF with waiting on author as status. -- Michael
On 2017/11/30 11:18, Michael Paquier wrote: > On Thu, Nov 30, 2017 at 10:43 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I'm working on a revised version of these patches to address recent >> comments by Horiguchi-san. I will also consider the points above before >> sending the new version. > > Ok, this is fresh news, so I am moving this entry to next CF with > waiting on author as status. That's correct, thanks. Regards, Amit
On 30 November 2017 at 11:15, Robert Haas <robertmhaas@gmail.com> wrote: > Committed 0004 after reviewing the code and testing that it seems to > work as advertised. > > 0005 looks like it might need to be split into smaller patches. More > broadly, the commit messages you wrote for for 0005, 0006, and 0008 > don't seem to me to do a great job explaining the motivation for the > changes which they make. They tell me what the patches do, but not > why they are doing it. If there's an email in this thread that > explains that stuff, please point me to it and I'll go back and reread > it more carefully; if not, I think I definitely need some more > explanation both of the mission of each patch and the reason why the > patch set is divided up in the way that it is. Hi Amit, It looks like just 0005 to 0008 remain of this and I see that the v13 0005 patch no longer applies to current master. Are you working on splitting this up as requested by Robert above? I can continue reviewing this once patches are available that apply to current master. Many thanks -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
Amit Langote
Date:
Hi David. On 2017/12/07 19:48, David Rowley wrote: > On 30 November 2017 at 11:15, Robert Haas <robertmhaas@gmail.com> wrote: >> Committed 0004 after reviewing the code and testing that it seems to >> work as advertised. >> >> 0005 looks like it might need to be split into smaller patches. More >> broadly, the commit messages you wrote for for 0005, 0006, and 0008 >> don't seem to me to do a great job explaining the motivation for the >> changes which they make. They tell me what the patches do, but not >> why they are doing it. If there's an email in this thread that >> explains that stuff, please point me to it and I'll go back and reread >> it more carefully; if not, I think I definitely need some more >> explanation both of the mission of each patch and the reason why the >> patch set is divided up in the way that it is. > > Hi Amit, > > It looks like just 0005 to 0008 remain of this and I see that the v13 > 0005 patch no longer applies to current master. > > Are you working on splitting this up as requested by Robert above? > > I can continue reviewing this once patches are available that apply to > current master. I'm still working on that. I will be able to submit a new version sometime early in the next week, that is, if I don't manage to do it by today evening (Japan time). Sorry that it's taking a bit longer. Thanks, Amit
Horiguchi-san, Thanks a lot for the review and sorry it took me a while to reply. On 2017/11/28 20:39, Kyotaro HORIGUCHI wrote: > At Wed, 22 Nov 2017 17:59:48 +0900, Amit Langote wrote: > 0001 and 0002 are under discussion with Robert in another thread. And now committed [1]. > I don't have a comment on 0003, 0004. > > 0005: > > get_partitions_from_clauses is written as _using_ in it's > comment. (also get_partitions_from_clauses_recurse is _guts in > its comment.) Fixed both. > get_append_rel_partitions just returns NIL if constfalse. I > suppose we'd better reducing indentation level > here. get_partitions_from_clauses_recurse in 0006 does the same > thing. Less indentation sounds good to me to, so fixed. > In the same function, there's a else clause separated from then > clause by a multiline comment. It seems better that the else > clause has braces and the comment is in the braces like the > following. > > > else > > { > > /* > > * Else there are no clauses.... > > */ > > partindexes = bms_add_range(NULL, 0, partdesc->nparts - 1); > > } Done that way. > 0006: > > In get_partitions_from_clauses_recurse, the following comment > seems confusing. > > + /* > + * The analysis of the matched clauses done by > + * classify_partition_bounding_keys may have found mutually contradictory > + * clauses. > + */ > > constfalse = true also when the cluase was just one false pseudo > constant restrictinfo. Updated the comment like this: /* * classify_partition_bounding_keys() may have found clauses marked * pseudo-constant that are false that the planner didn't or it may have * itself found contradictions among clauses. */ > + if (!constfalse) > + { > + /* > + * If all clauses in the list were OR clauses, > + * classify_partition_bounding_keys() wouldn't have formed keys > + * yet. They will be handled below by recursively calling this > + * function for each of OR clauses' arguments and combining the > + * resulting partition sets appropriately. > + */ > + if (nkeys > 0) > > classify_p_b_keys() to return zero also when is not only all-OR > clauses(all AND clauses with volatile function also returns > zero). Hmm, if all AND clauses contained volatile functions, then planner wouldn't have called here at all. Also, any clauses it passes to get_partitions_from_clauses() are known to be OK to use for pruning. > + /* Set partexpr if needed. */ > + if (partattno == 0) > > Could you add a description about the meaning of 0 to the > comment of PartitionKeyData something like this? Sure, done. > > | AttrNumber *partattrs; /* attribute numbers of columns in the > | * partition key. 0 means partexpression */ > > + #define EXPR_MATCHES_PARTKEY(expr, partattno, partexpr) \ > + ((IsA((expr), Var) &&\ > + ((Var *) (expr))->varattno == (partattno)) ||\ > + equal((expr), (partexpr))) > ... > + if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr)) > > partattno = 0 has a different meaning than ordinary attnos. > I belive that the leftop cannot be a whole row var, but I > suppose we should make it clear. Likewise, even it doesn't > actually happen but it formally has a chance to make a false > match since partexpr is not cleared when partattno > 0. > EXPR_MATCHES_PARTKEY might be better be something like follows. > > | #define EXPR_MATCHES_PARTKEY(expr, partattno, partexpr) \ > | ((partattno) != 0 ? \ > | (IsA((expr), Var) && ((Var *) (expr))->varattno == (partattno)) :\ > | equal((expr), (partexpr))) That's better, fixed. > + if (!op_in_opfamily(opclause->opno, partopfamily)) > + { > ... > + negator = get_negator(opclause->opno); > + if (OidIsValid(negator) && > + op_in_opfamily(negator, partopfamily)) > + { > + get_op_opfamily_properties(negator, partopfamily, > + false, > + &strategy, > + &lefttype, &righttype); > + if (strategy == BTEqualStrategyNumber) > > This seems to me to be a bit too much relying on the specific > relationship of the access methods' property. Isn't it > reasonable that add checking that partkey->strategy != 'h' > before getting negator? > > + commuted->opno = get_commutator(opclause->opno); > > Im afraid that get_commutator can return InvalidOid for > user-defined types or by user-defined operator class or perhaps > other reasons uncertain to me. match_clauses_to_partkey is > checking that. > > + else if (IsA(clause, ScalarArrayOpExpr)) > > I'm not sure what to do with a multidimentional ArrayExpr but > ->multidims is checked some places. In another thread that I recently started [2], Tom seemed to point out that ScalarArrayOpExpr cannot really be used for comparing an array on LHS with, say, an array of arrays on RHS. The LHS expression must yield a scalar value which is compared with individual scalar values in the array on RHS. The syntax allows arbitrarily multi-dimensional array to be specified on RHS, but ultimately only looks at the scalar therein. For example: select 1 = any (array[array[array[1]]]); ?column? |--------- t (1 row) select 1 = any ('{{{1}}}'); ?column? |--------- t (1 row) select 1 = any ('{{{2}}}'); ?column? |--------- f (1 row) But for the code in question on which you seemed to have commented, it only works if the array is presented as a Const that can be deconstructed to a list of scalar values using deconstruct_array(). If the array was presented as an ArrayExpr, and its multidims is true, the elements list cannot simply be assumed to contain Const nodes holding scalar values. So, we should prohibit that case from proceeding. Although, I haven't been able to frame a test query that results in such a case. > + ParseState *pstate = make_parsestate(NULL); > > make_parsestate mandates for the caller to free it by > free_parsestate(). It doesn't seem to leak anything in the > context and I saw the same thing at other places but it would be > better to follow it if possible, or add some apology as a > comment.. (or update the comment of make_parsestate?) Added a free_parsestate() call. > > + * If the leftarg_const and rightarg_consr are both of the type expected > > rightarg_consr -> const Fixed. > > + if (partition_cmp_args(partkey, partattoff, > + le, lt, le, > + &test_result)) > > + if (partition_cmp_args(partkey, partattoff, ge, gt, ge, > + &test_result)) > > Please unify the style. Fixed. > + * Boolean conditions have a special shape, which would've been > + * accepted if the partitioning opfamily accepts Boolean > + * conditions. > > I noticed that bare true and false are not accepted by the > values list of create table syntax. This is not a comment on > this patch but is that intentional? Hmm, I guess the original partitioning patch forgot to accept TRUE_P, FALSE_P as valid partition bound datums along with Sconst, NumericOnly, and NULL_P. Sent a patch for that in a separate thread [3]. Attached updated patches. As Robert commented [4], I tried to re-arrange the patches after breaking down the planner patch ("0005 looks like it might need to be split into smaller patches"). Brief description of each follows: [PATCH 1/5] Some interface changes for partition_bound_{cmp/bsearch} As the name says, it's a preparatory patch to enable the next patch to use the partition bound binary search function using partial keys. Until now, callers needed to specify the whole key (specify values for all partition columns), but a query may not have clauses on all partition columns, so we should be able to compare such incomplete keys against partition bounds. [PATCH 2/5] Introduce a get_partitions_from_clauses() This used to be a last patch in the previous versions. But I moved it ahead in the list as basic functionality that needs to be in place before starting to modify the planner to use the same for faster pruning. To summarize, just like the existing get_partition_for_tuple() that receives a tuple from ExecFindPartition() and returns the index of the partition that should contain that tuple, get_partitions_from_clauses() will receive a set of clauses that are all known to match some partition key and derive from it and return the set of indexes of partitions that satisfy all those clauses. [PATCH 3/5] Move some code of set_append_rel_size to separate function First in the series that modifies the planner. Just a preparatory patch that moves some code. [PATCH 4/5] More refactoring around partitioned table AppendPath creation Another refactoring patch that changes how we manipulate the partitions when generating AppendPath for partitioned tables. Actually, there is one behavior change here - partitioned_rels in Append today carry the RT indexes of even the partitioned child tables that have been pruned. Patch modifies things so that that doesn't happen anymore. [PATCH 5/5] Teach planner to use get_partitions_from_clauses() With this patch, we finally hit the new faster pruning functionality. Partitions will be pruned even before looking at the individual partition RelOptInfos. In fact, set_append_rel_size() now only ever looks at non-pruned partitions. Since, we can call get_partitions_from_clauses() from only the SELECT planning code path, UPDATE/DELETE cases still rely on constraint exclusion. So, we retrieve the partition constraint only in those cases. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7b88d63a9122646ead60c1afffc248a31d4e457d [2] https://www.postgresql.org/message-id/7677.1512743642%40sss.pgh.pa.us [3] https://www.postgresql.org/message-id/e05c5162-1103-7e37-d1ab-6de3e0afaf70%40lab.ntt.co.jp [4] https://www.postgresql.org/message-id/CA%2BTgmoYYrPA21e0y5w2NW2-sbANFR4n2nbrSWEWjzvaa_GNi0g%40mail.gmail.com
Attachment
- 0001-Some-interface-changes-for-partition_bound_-cmp-bsea-v14.patch
- 0002-Introduce-a-get_partitions_from_clauses-v14.patch
- 0003-Move-some-code-of-set_append_rel_size-to-separate-fu-v14.patch
- 0004-More-refactoring-around-partitioned-table-AppendPath-v14.patch
- 0005-Teach-planner-to-use-get_partitions_from_clauses-v14.patch
On 12 December 2017 at 22:13, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached updated patches. Thanks for sending the updated patches. I don't have a complete review at the moment, but the following code in set_append_rel_pathlist() should be removed. /* append_rel_list contains all append rels; ignore others */ if (appinfo->parent_relid != parentRTindex) continue; -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
Amit Langote
Date:
Hi David. On 2017/12/13 18:48, David Rowley wrote: > On 12 December 2017 at 22:13, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached updated patches. > > Thanks for sending the updated patches. > > I don't have a complete review at the moment, but the following code > in set_append_rel_pathlist() should be removed. > > /* append_rel_list contains all append rels; ignore others */ > if (appinfo->parent_relid != parentRTindex) > continue; Will fix that right away, thanks. Thanks, Amit
On 12 December 2017 at 22:13, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached updated patches. Hi Amit, I'm sorry to say this is another micro review per code I'm stumbling over when looking at the run-time partition pruning stuff. 1. In get_partitions_from_clauses_recurse(), since you're assigning the result to the first input, the following should use bms_add_members and not bms_union. The logical end result is the same, but using bms_union means a wasted palloc and a small memory leak within the memory context. /* * Partition sets obtained from mutually-disjunctive clauses are * combined using set union. */ or_partset = bms_union(or_partset, arg_partset); 2. Also in get_partitions_from_clauses_recurse(), it might also be worth putting in a bms_free(or_partset) after: /* * Partition sets obtained from mutually-conjunctive clauses are * combined using set intersection. */ result = bms_intersect(result, or_partset); Also, instead of using bms_intersect here, would it be better to do: result = bms_del_members(result, or_partset); ? That way you don't do a bms_copy and leak member for each OR branch since bms_intersect also does a bms_copy() The resulting set could end up with a few more trailing 0 words than what you have now, but it to be a better idea not allocate a new set each time. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 19 December 2017 at 17:36, David Rowley <david.rowley@2ndquadrant.com> wrote: > Also, instead of using bms_intersect here, would it be better to do: > > result = bms_del_members(result, or_partset); ? I should have said bms_int_members() rather than bms_del_members() -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 12 December 2017 at 22:13, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached updated patches. Hi Amit, I was also wondering about your thoughts on the design of get_partitions_for_keys() and more generally how there are many functions which have some special treatment doing something based on ->strategy == PARTITION_STRATEGY_XXX. If I do: git grep PARTITION_STRATEGY -- src/backend/catalog/partition.c | wc -l I get 62 matches, most of which are case statements, and most of the remainder are things like if (key->strategy == PARTITION_STRATEGY_HASH). git grep --show-function PARTITION_STRATEGY -- src/backend/catalog/partition.c shows that get_partitions_for_keys() is probably the most guilty of having the most strategy condition tests. Also, if we look at get_partitions_for_keys() there's an unconditional: memset(hash_isnull, false, sizeof(hash_isnull)); which is only used for PARTITION_STRATEGY_HASH, but LIST and RANGE must pay the price of that memset. Perhaps it's not expensive enough to warrant only doing that when partkey->strategy == PARTITION_STRATEGY_HASH, but it does make me question if we should have 3 separate functions for this and just have a case statement to call the correct one. I think if we were to put this off as something we'll fix later, then the job would just become harder and harder as time goes on. It might have been fine when we just had RANGE and LIST partitioning, but I think HASH really tips the scales over to this being needed. What do you think? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 19 December 2017 at 17:36, David Rowley <david.rowley@2ndquadrant.com> wrote: > I'm sorry to say this is another micro review per code I'm stumbling > over when looking at the run-time partition pruning stuff. Again, another micro review. I apologise for the slow trickle of review. Again, these are just things I'm noticing while reading through while thinking of the run-time pruning patch. 1. The following Assert appears to be testing for the presence of cosmic rays :-) /* * Determine set of partitions using provided keys, which proceeds in a * manner determined by the partitioning method. */ if (keys->n_eqkeys == partkey->partnatts) { Assert(keys->n_eqkeys == partkey->partnatts); Perhaps it's misplaced during a rewrite? Should be safe enough to remove it, I'd say. 2. The following code in classify_partition_bounding_keys() misses looking under the RelabelType for rightarg: leftop = (Expr *) get_leftop(clause); if (IsA(leftop, RelabelType)) leftop = ((RelabelType *) leftop)->arg; rightop = (Expr *) get_rightop(clause); if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr)) constexpr = rightop; else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr)) constexpr = leftop; This breaks the following case: create table thisthat (a varchar not null) partition by list (a); create table this partition of thisthat for values in('this'); create table that partition of thisthat for values in('that'); explain select * from thisthat where 'this' = a; -- does not work QUERY PLAN ------------------------------------------------------------ Append (cost=0.00..54.00 rows=14 width=32) -> Seq Scan on that (cost=0.00..27.00 rows=7 width=32) Filter: ('this'::text = (a)::text) -> Seq Scan on this (cost=0.00..27.00 rows=7 width=32) Filter: ('this'::text = (a)::text) (5 rows) explain select * from thisthat where a = 'this'; -- works as we look through the RelabelType on left arg. QUERY PLAN ------------------------------------------------------------ Append (cost=0.00..27.00 rows=7 width=32) -> Seq Scan on this (cost=0.00..27.00 rows=7 width=32) Filter: ((a)::text = 'this'::text) 3. The follow code assumes there will be a commutator for the operator: if (constexpr == rightop) pc->op = opclause; else { OpExpr *commuted; commuted = (OpExpr *) copyObject(opclause); commuted->opno = get_commutator(opclause->opno); commuted->opfuncid = get_opcode(commuted->opno); commuted->args = list_make2(rightop, leftop); pc->op = commuted; } I had to hunt for it, but it appears that you're pre-filtering clauses with the Consts on the left and no valid commutator in match_clauses_to_partkey. I think it's likely worth putting a comment to mention that reversed clauses with no commutator should have been filtered out beforehand. I'd say it's also worthy of an Assert(). 4. The spelling of "arbitrary" is incorrect in: * partattno == 0 refers to arbirtary expressions, which get the 5. I've noticed that partition pruning varies slightly from constraint exclusion in the following case: create table ta (a int not null) partition by list (a); create table ta1 partition of ta for values in(1,2); create table ta2 partition of ta for values in(3,4); explain select * from ta where a <> 1 and a <> 2; -- partition ta1 is not eliminated. QUERY PLAN ------------------------------------------------------------- Append (cost=0.00..96.50 rows=5050 width=4) -> Seq Scan on ta1 (cost=0.00..48.25 rows=2525 width=4) Filter: ((a <> 1) AND (a <> 2)) -> Seq Scan on ta2 (cost=0.00..48.25 rows=2525 width=4) Filter: ((a <> 1) AND (a <> 2)) (5 rows) alter table ta1 add constraint ta1_chk check (a in(1,2)); -- add a check constraint to see if can be removed. explain select * from ta where a <> 1 and a <> 2; -- it can. QUERY PLAN ------------------------------------------------------------- Append (cost=0.00..48.25 rows=2525 width=4) -> Seq Scan on ta2 (cost=0.00..48.25 rows=2525 width=4) Filter: ((a <> 1) AND (a <> 2)) (3 rows) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hello David. Thanks for the reviews. Replying to all your emails here. On 2017/12/19 13:36, David Rowley wrote: > On 12 December 2017 at 22:13, Amit Langote wrote: >> Attached updated patches. > > Hi Amit, > > I'm sorry to say this is another micro review per code I'm stumbling > over when looking at the run-time partition pruning stuff. > > 1. In get_partitions_from_clauses_recurse(), since you're assigning > the result to the first input, the following should use > bms_add_members and not bms_union. The logical end result is the same, > but using bms_union means a wasted palloc and a small memory leak > within the memory context. > > /* > * Partition sets obtained from mutually-disjunctive clauses are > * combined using set union. > */ > or_partset = bms_union(or_partset, arg_partset); Done. Replaced with bms_add_members(). > 2. Also in get_partitions_from_clauses_recurse(), it might also be > worth putting in a bms_free(or_partset) after: > > /* > * Partition sets obtained from mutually-conjunctive clauses are > * combined using set intersection. > */ > result = bms_intersect(result, or_partset); Done, too. > Also, instead of using bms_intersect here, would it be better to do: > > result = bms_del_members(result, or_partset); ? > > That way you don't do a bms_copy and leak member for each OR branch > since bms_intersect also does a bms_copy() > > The resulting set could end up with a few more trailing 0 words than > what you have now, but it to be a better idea not allocate a new set > each time. You meant bms_int_members(), as you also said in your other email. On 2017/12/19 14:42, David Rowley wrote: > I was also wondering about your thoughts on the design of > get_partitions_for_keys() and more generally how there are many > functions which have some special treatment doing something based on > ->strategy == PARTITION_STRATEGY_XXX. > > If I do: > > git grep PARTITION_STRATEGY -- src/backend/catalog/partition.c | wc -l > > I get 62 matches, most of which are case statements, and most of the > remainder are things like if (key->strategy == > PARTITION_STRATEGY_HASH). > > git grep --show-function PARTITION_STRATEGY -- src/backend/catalog/ > partition.c > > shows that get_partitions_for_keys() is probably the most guilty of > having the most strategy condition tests. I notice that too now that you mention it. > > Also, if we look at get_partitions_for_keys() there's an unconditional: > > memset(hash_isnull, false, sizeof(hash_isnull)); > > which is only used for PARTITION_STRATEGY_HASH, but LIST and RANGE > must pay the price of that memset. Although I know you're talking about something else here (about which I say below), turns out this hash_isnull was completely unnecessary, so I got rid of it. > Perhaps it's not expensive enough > to warrant only doing that when partkey->strategy == > PARTITION_STRATEGY_HASH, but it does make me question if we should > have 3 separate functions for this and just have a case statement to > call the correct one. > > I think if we were to put this off as something we'll fix later, then > the job would just become harder and harder as time goes on. > > It might have been fine when we just had RANGE and LIST partitioning, > but I think HASH really tips the scales over to this being needed. > > What do you think? I think I somewhat understand your concern with regard to future additions and maintenance and also now tend to agree. I tried dividing up get_partitions_for_keys() into one function each for hash, list, and range and it looks like in the attached. I think I like the result. Each function has to deal with only query keys and bounds assuming a given partitioning method and that appears to add to the overall clarity of the code. On 2017/12/19 22:44, David Rowley wrote: > Again, another micro review. I apologise for the slow trickle of > review. Again, these are just things I'm noticing while reading > through while thinking of the run-time pruning patch. > > > 1. The following Assert appears to be testing for the presence of > cosmic rays :-) > > /* > * Determine set of partitions using provided keys, which proceeds in a > * manner determined by the partitioning method. > */ > if (keys->n_eqkeys == partkey->partnatts) > { > Assert(keys->n_eqkeys == partkey->partnatts); > > Perhaps it's misplaced during a rewrite? Should be safe enough to > remove it, I'd say. I noticed that too and took care of it. :) > 2. The following code in classify_partition_bounding_keys() misses > looking under the RelabelType for rightarg: > > leftop = (Expr *) get_leftop(clause); > if (IsA(leftop, RelabelType)) > leftop = ((RelabelType *) leftop)->arg; > rightop = (Expr *) get_rightop(clause); > if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr)) > constexpr = rightop; > else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr)) > constexpr = leftop; > > This breaks the following case: > > create table thisthat (a varchar not null) partition by list (a); > create table this partition of thisthat for values in('this'); > create table that partition of thisthat for values in('that'); > explain select * from thisthat where 'this' = a; -- does not work > QUERY PLAN > ------------------------------------------------------------ > Append (cost=0.00..54.00 rows=14 width=32) > -> Seq Scan on that (cost=0.00..27.00 rows=7 width=32) > Filter: ('this'::text = (a)::text) > -> Seq Scan on this (cost=0.00..27.00 rows=7 width=32) > Filter: ('this'::text = (a)::text) > (5 rows) > > explain select * from thisthat where a = 'this'; -- works as we look > through the RelabelType on left arg. > QUERY PLAN > ------------------------------------------------------------ > Append (cost=0.00..27.00 rows=7 width=32) > -> Seq Scan on this (cost=0.00..27.00 rows=7 width=32) > Filter: ((a)::text = 'this'::text) Thanks for pointing it out, fixed. > 3. The follow code assumes there will be a commutator for the operator: > > if (constexpr == rightop) > pc->op = opclause; > else > { > OpExpr *commuted; > > commuted = (OpExpr *) copyObject(opclause); > commuted->opno = get_commutator(opclause->opno); > commuted->opfuncid = get_opcode(commuted->opno); > commuted->args = list_make2(rightop, leftop); > pc->op = commuted; > } > > I had to hunt for it, but it appears that you're pre-filtering clauses > with the Consts on the left and no valid commutator in > match_clauses_to_partkey. I think it's likely worth putting a comment > to mention that reversed clauses with no commutator should have been > filtered out beforehand. I'd say it's also worthy of an Assert(). Yeah, added a comment and an Assert. > > 4. The spelling of "arbitrary" is incorrect in: > > * partattno == 0 refers to arbirtary expressions, which get the Fixed. > 5. I've noticed that partition pruning varies slightly from constraint > exclusion in the following case: > > create table ta (a int not null) partition by list (a); > create table ta1 partition of ta for values in(1,2); > create table ta2 partition of ta for values in(3,4); > > explain select * from ta where a <> 1 and a <> 2; -- partition ta1 is > not eliminated. > QUERY PLAN > ------------------------------------------------------------- > Append (cost=0.00..96.50 rows=5050 width=4) > -> Seq Scan on ta1 (cost=0.00..48.25 rows=2525 width=4) > Filter: ((a <> 1) AND (a <> 2)) > -> Seq Scan on ta2 (cost=0.00..48.25 rows=2525 width=4) > Filter: ((a <> 1) AND (a <> 2)) > (5 rows) > > > alter table ta1 add constraint ta1_chk check (a in(1,2)); -- add a > check constraint to see if can be removed. > explain select * from ta where a <> 1 and a <> 2; -- it can. > QUERY PLAN > ------------------------------------------------------------- > Append (cost=0.00..48.25 rows=2525 width=4) > -> Seq Scan on ta2 (cost=0.00..48.25 rows=2525 width=4) > Filter: ((a <> 1) AND (a <> 2)) > (3 rows) I see. It seems that the current approach of handling <> operators by turning clauses containing the same into (key > const OR key < const) doesn't always work. I think I had noticed that for list partitioning at least. I will work on alternative way of handling that in the next version of the patch. Meanwhile, please find attached patches (v15) that take care of the rest of the comments. Most of the updates are to patch 0002, compared to the last (v14) version. Thanks again for your thoughtful review comments. Thanks, Amit
Attachment
- 0001-Some-interface-changes-for-partition_bound_-cmp-bsea-v15.patch
- 0002-Introduce-a-get_partitions_from_clauses-v15.patch
- 0003-Move-some-code-of-set_append_rel_size-to-separate-fu-v15.patch
- 0004-More-refactoring-around-partitioned-table-AppendPath-v15.patch
- 0005-Teach-planner-to-use-get_partitions_from_clauses-v15.patch
On 2017/12/20 17:27, Amit Langote wrote: > On 2017/12/19 13:36, David Rowley wrote: >> 5. I've noticed that partition pruning varies slightly from constraint >> exclusion in the following case: >> >> create table ta (a int not null) partition by list (a); >> create table ta1 partition of ta for values in(1,2); >> create table ta2 partition of ta for values in(3,4); >> >> explain select * from ta where a <> 1 and a <> 2; -- partition ta1 is >> not eliminated. >> QUERY PLAN >> ------------------------------------------------------------- >> Append (cost=0.00..96.50 rows=5050 width=4) >> -> Seq Scan on ta1 (cost=0.00..48.25 rows=2525 width=4) >> Filter: ((a <> 1) AND (a <> 2)) >> -> Seq Scan on ta2 (cost=0.00..48.25 rows=2525 width=4) >> Filter: ((a <> 1) AND (a <> 2)) >> (5 rows) >> >> >> alter table ta1 add constraint ta1_chk check (a in(1,2)); -- add a >> check constraint to see if can be removed. >> explain select * from ta where a <> 1 and a <> 2; -- it can. >> QUERY PLAN >> ------------------------------------------------------------- >> Append (cost=0.00..48.25 rows=2525 width=4) >> -> Seq Scan on ta2 (cost=0.00..48.25 rows=2525 width=4) >> Filter: ((a <> 1) AND (a <> 2)) >> (3 rows) > > I see. It seems that the current approach of handling <> operators by > turning clauses containing the same into (key > const OR key < const) > doesn't always work. I think I had noticed that for list partitioning at > least. I will work on alternative way of handling that in the next > version of the patch. I think I was able to make this work and in the process of making it work, also came to the conclusion that this could be made to work sensibly *only* for list partitioned tables. That's because one cannot prune a given partition using a set of <> operator clauses, if we cannot be sure that those clauses exclude *all* values of the partition key allowed by that partition. It's only possible to do that for a list partitioned table, because by definition one is required to spell out every value that a given partition of such table allows. There is a new function in the updated patch that does the pruning using <> operator clauses and it's implemented by assuming it's only ever called for a list partitioned table. So, sorry range and hash partitioned tables. Attached updated set of patches. Thanks, Amit
Attachment
- 0001-Some-interface-changes-for-partition_bound_-cmp-bsea-v16.patch
- 0002-Introduce-a-get_partitions_from_clauses-v16.patch
- 0003-Move-some-code-of-set_append_rel_size-to-separate-fu-v16.patch
- 0004-More-refactoring-around-partitioned-table-AppendPath-v16.patch
- 0005-Teach-planner-to-use-get_partitions_from_clauses-v16.patch
On 21 December 2017 at 23:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/12/20 17:27, Amit Langote wrote: > I think I was able to make this work and in the process of making it work, > also came to the conclusion that this could be made to work sensibly > *only* for list partitioned tables. That's because one cannot prune a > given partition using a set of <> operator clauses, if we cannot be sure > that those clauses exclude *all* values of the partition key allowed by > that partition. It's only possible to do that for a list partitioned > table, because by definition one is required to spell out every value that > a given partition of such table allows. Makes sense. Thanks for fixing LIST partitioning to work with that. We have no way to know that there's no value between 1::int and 2::int, so it's completely understandable why this can't work for RANGE. HASH is also understandable since we don't have a complete picture of all the values that can be contained within the partition. I'll try to do another complete review of v16 soon. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 21 December 2017 at 23:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached updated set of patches. Looks like the new not equals code does not properly take into account a missing NULL partition. create table ta (a int not null) partition by list (a); create table ta1 partition of ta for values in(1,2); create table ta2 partition of ta for values in(3,4); explain select * from ta where a <> 1 and a <> 2; ERROR: negative bitmapset member not allowed -- Add null partition create table ta_null partition of ta for values in(null); explain select * from ta where a <> 1 and a <> 2; -- works now. QUERY PLAN ------------------------------------------------------------- Append (cost=0.00..48.25 rows=2525 width=4) -> Seq Scan on ta2 (cost=0.00..48.25 rows=2525 width=4) Filter: ((a <> 1) AND (a <> 2)) (3 rows) This code appears to be at fault: /* * Also, exclude the "null-only" partition, because strict clauses in * ne_clauses will not select any rows from it. */ if (count_partition_datums(relation, boundinfo->null_index) == 0) excluded_parts = bms_add_member(excluded_parts, boundinfo->null_index); -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Just trying to understand the code here; some very minor comments as I go along. partition_op_strategy returning int32 looks pretty ugly, and the calling code is not super-intelligible either. How about returning a value from a new enum? typedef PartClause is missing a struct name, as is our tradition. > + * We don't a <> operator clause into a key right away. Missing a word there. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
David On 2017/12/21 21:04, David Rowley wrote: > On 21 December 2017 at 23:38, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached updated set of patches. > > Looks like the new not equals code does not properly take into account > a missing NULL partition. > > create table ta (a int not null) partition by list (a); > create table ta1 partition of ta for values in(1,2); > create table ta2 partition of ta for values in(3,4); > explain select * from ta where a <> 1 and a <> 2; > ERROR: negative bitmapset member not allowed > > -- Add null partition > create table ta_null partition of ta for values in(null); > explain select * from ta where a <> 1 and a <> 2; -- works now. > QUERY PLAN > ------------------------------------------------------------- > Append (cost=0.00..48.25 rows=2525 width=4) > -> Seq Scan on ta2 (cost=0.00..48.25 rows=2525 width=4) > Filter: ((a <> 1) AND (a <> 2)) > (3 rows) > > This code appears to be at fault: > > /* > * Also, exclude the "null-only" partition, because strict clauses in > * ne_clauses will not select any rows from it. > */ > if (count_partition_datums(relation, boundinfo->null_index) == 0) > excluded_parts = bms_add_member(excluded_parts, > boundinfo->null_index); Oops, must check before going to count datums that a null-partition exists at all. Will post the fixed version shortly, thanks. Regards, Amit
On 22 December 2017 at 13:57, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Will post the fixed version shortly, thanks. I've made another partial pass on the patch and have a few more things. #3 and #4 are just comments rather than requests to change something. I think we should change those before PG11 though. 1. If I look at the elog(ERROR) messages in partition.c, there are a number of variations of reporting an invalid partition strategy. There seem to be 3 variations of the same thing. Probably the "unexpected" one would suit most, but I've not looked too closely. elog(ERROR, "invalid partitioning strategy"); elog(ERROR, "unexpected partition strategy: %d", (int) key->strategy); elog(ERROR, "invalid partition strategy %c", RelationGetPartitionKey(rel)->strategy); 2. In get_relation_constraints(). Can you add a comment to why you've added: /* Append partition predicates, if any */ if (root->parse->commandType != CMD_SELECT) { I guess it must be because we use the new partition pruning code for SELECT, but not for anything else. 3. It's a shame that RelOptInfo->live_partitioned_rels is a List and not a RelIds. I guess if you were to change that you'd need to also change AppendPath->partitioned_rels too, so probably we can just fix that later. 4. live_part_appinfos I think could be a Relids type too, but probably we can change that after this patch. Append subpaths are sorted in create_append_path() for parallel append, so the order of the subpaths seems non-critical. 5. Small memory leaks in get_partitions_from_clauses_recurse(). if (ne_clauses) result = bms_int_members(result, get_partitions_from_ne_clauses(relation, ne_clauses)); Can you assign the result of get_partitions_from_ne_clauses() and bms_free() it after the bms_int_members() ? Same for: result = bms_int_members(result, get_partitions_from_or_clause_args(relation, rt_index, or->args)); The reason I'm being particular about this is that for the run-time pruning patch we'll call this from ExecReScanAppend() which will allocate into the ExecutorState which lives as long as the query does. So any leaks will last the entire length of the query. ExecReScanAppend() could be called millions of billions of times, so we need to be sure that's not going to be a problem. 6. Similar to #5, memory leaks in get_partitions_from_or_clause_args() arg_partset = get_partitions_from_clauses_recurse(relation, rt_index, arg_clauses); /* * Partition sets obtained from mutually-disjunctive clauses are * combined using set union. */ result = bms_add_members(result, arg_partset); Need to bms_free(arg_partset) Running out of time for today, but will look again in about 4 days. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2017/12/22 1:06, Alvaro Herrera wrote: > Just trying to understand the code here; some very minor comments as I > go along. > > partition_op_strategy returning int32 looks pretty ugly, and the calling > code is not super-intelligible either. How about returning a value from > a new enum? OK, I made it the following enum: typedef enum PartOpStrategy { PART_OP_EQUAL, PART_OP_LESS, PART_OP_GREATER, } PartOpStrategy; > typedef PartClause is missing a struct name, as is our tradition. Will fix. >> + * We don't a <> operator clause into a key right away. > > Missing a word there. Oops, right. I meant "We don't turn a <> ...". Will fix. Will post a new version after taking care of David's comments. Thanks, Amit
Hi David. On 2017/12/22 10:35, David Rowley wrote: > On 22 December 2017 at 13:57, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Will post the fixed version shortly, thanks. > > I've made another partial pass on the patch and have a few more > things. #3 and #4 are just comments rather than requests to change > something. I think we should change those before PG11 though. Thank you. > 1. If I look at the elog(ERROR) messages in partition.c, there are a > number of variations of reporting an invalid partition strategy. > > There seem to be 3 variations of the same thing. Probably the > "unexpected" one would suit most, but I've not looked too closely. > > elog(ERROR, "invalid partitioning strategy"); > elog(ERROR, "unexpected partition strategy: %d", (int) key->strategy); > elog(ERROR, "invalid partition strategy %c", > RelationGetPartitionKey(rel)->strategy); I should have used the "unexpected ..." wording in the yesterday's update. Fixed. > 2. In get_relation_constraints(). Can you add a comment to why you've added: > > /* Append partition predicates, if any */ > if (root->parse->commandType != CMD_SELECT) > { > > I guess it must be because we use the new partition pruning code for > SELECT, but not for anything else. Yeah, I explained that a couple of times on email (maybe also in the commit message), but not there. Done. > 3. It's a shame that RelOptInfo->live_partitioned_rels is a List and > not a RelIds. I guess if you were to change that you'd need to also > change AppendPath->partitioned_rels too, so probably we can just fix > that later. I agree. > 4. live_part_appinfos I think could be a Relids type too, but probably > we can change that after this patch. Append subpaths are sorted in > create_append_path() for parallel append, so the order of the subpaths > seems non-critical. Hmm, perhaps. > 5. Small memory leaks in get_partitions_from_clauses_recurse(). > > if (ne_clauses) > result = bms_int_members(result, > get_partitions_from_ne_clauses(relation, > ne_clauses)); > > Can you assign the result of get_partitions_from_ne_clauses() and > bms_free() it after the bms_int_members() ? > > Same for: > > result = bms_int_members(result, > get_partitions_from_or_clause_args(relation, > rt_index, > or->args)); > > The reason I'm being particular about this is that for the run-time > pruning patch we'll call this from ExecReScanAppend() which will > allocate into the ExecutorState which lives as long as the query does. > So any leaks will last the entire length of the query. > ExecReScanAppend() could be called millions of billions of times, so > we need to be sure that's not going to be a problem. That's a very important point to stress. Thanks. > 6. Similar to #5, memory leaks in get_partitions_from_or_clause_args() > > arg_partset = get_partitions_from_clauses_recurse(relation, rt_index, > arg_clauses); > > /* > * Partition sets obtained from mutually-disjunctive clauses are > * combined using set union. > */ > result = bms_add_members(result, arg_partset); > > Need to bms_free(arg_partset) Fixed all these instances of leaks. > Running out of time for today, but will look again in about 4 days. Thanks again. Please find attached updated patches. Thanks, Amit
Attachment
- 0001-Some-interface-changes-for-partition_bound_-cmp-bsea-v17.patch
- 0002-Introduce-a-get_partitions_from_clauses-v17.patch
- 0003-Move-some-code-of-set_append_rel_size-to-separate-fu-v17.patch
- 0004-More-refactoring-around-partitioned-table-AppendPath-v17.patch
- 0005-Teach-planner-to-use-get_partitions_from_clauses-v17.patch
On 22 December 2017 at 17:25, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Please find attached updated patches. Hi Amit, I've just completed a pass over the v17 patch set. I've found a number of things that need to be addressed. Some might seem a bit nit-picky, sorry about that. However, many of the others genuinely need to be addressed. 1. The following code calls RelationGetPartitionQual, but the output of that is only needed when partition_bound_has_default(boundinfo) is true. Can you change this to only get the RelationGetPartitionQual when it's required? Bitmapset * get_partitions_from_clauses(Relation relation, int rt_index, ParamListInfo prmlist, ExprContext *econtext, List *partclauses) { Bitmapset *result; List *partconstr = RelationGetPartitionQual(relation); 2. The header comment in match_clauses_to_partkey() does not give any warning that 'clauses' is modified within the function. The function should create a copy of the clauses before modifying them. This will save you having to do any list_copy calls when you're calling the function. The header comment is also not very clear about what the return value of the function is. 3. "method" I think should be "strategy". We've pretty much standardised on that term everywhere else, so let's keep to standard. /* * Does the query specify a key to be null or not null? Partitioning * handles null partition keys specially depending on the partitioning * method in use, we store this information. */ 4. "relation" should be in single quotes, since you're talking about the parameter named "relation". Likewise with "partclauses", otherwise it just seems like bad English. * Determine the set of partitions of relation that will satisfy all * the clauses contained in partclauses 5. partdesc's assignment can be delayed until it's needed. This will save generating it when constfalse == true static Bitmapset * get_partitions_from_clauses_recurse(Relation relation, int rt_index, List *clauses) { PartitionDesc partdesc = RelationGetPartitionDesc(relation); 6. In the following comment, I'd have expected almost identical code, just with some other List, but the code probably differs a bit too much to use "Ditto". /* * Ditto, but this time or_clauses. */ 7. Comment claims we use "set union", but we're really just collecting the members from the other set in: /* * Partition sets obtained from mutually-disjunctive clauses are * combined using set union. */ result = bms_add_members(result, arg_partset); 8. These arrays could just be initialized up to partkey->partnatts. I'd imagine most of the time this will be just 1, so would save needlessly setting the 31 other elements, although, perhaps it's a bad idea to optimize this. memset(keyclauses_all, 0, sizeof(keyclauses_all)); /* false means we don't know if a given key is null */ memset(keyisnull, false, sizeof(keyisnull)); /* false means we don't know if a given key is not null */ memset(keyisnotnull, false, sizeof(keyisnull)); The last two of these could just be Bitmapsets and you'd not need any memset at all. PARTITION_MAX_KEYS just so happens to be the same as BITS_PER_BITMAPWORD in a standard build, so you'd always be able to mark everything with a single bitmap word. This would help a bit in the various places that you're counting the true elements, for example, the following code: for (i = 0; i < partkey->partnatts; i++) { if (!keys->keyisnotnull[i]) { include_def = true; break; } } could become: include_def = !bms_is_empty(keys->keyisnotnull); if you converted all these to Bitmapsets. 9. The following comment would be better read as: /* clause does not match this partition key */ /* Clause not meant for this column. */ 10. The following comment talks about handling less than operators for hash opfamilies, but the code only handles <> for btree and list partitioning. * Handle some cases wherein the clause's operator may not * belong to the partitioning operator family. For example, * operators named '<>' are not listed in any operator * family whatsoever. Also, ordering opertors like '<' are * not listed in the hash operator family. "opertors" should be spelled "operators" 11. In the following comment "operator" should be "operators": * are constrained by clauses containing equality operator, unless hash Likewise for: * IS NULL clauses while remaining have clauses with equality operator. 12. The following code in classify_partition_bounding_keys probably should use EXPR_MATCHES_PARTKEY. /* Does leftop match with this partition key column? */ if ((IsA(arg, Var) && ((Var *) arg)->varattno == partattno) || equal(arg, partexpr)) 13. The comment for classify_partition_bounding_keys does not seem to define well what the meaning of the return value is: * classify_partition_bounding_keys * Classify partition clauses into equal, min, and max keys, along with * any Nullness constraints and return that information in the output * argument keys (number of keys is the return value) I had thought all along that this must mean the number of distinct partition keys that we've found useful clauses for, but it's possible to fool this with duplicate IS NULL checks. create table hp (a int, b text) partition by hash (a, b); create table hp0 partition of hp for values with (modulus 4, remainder 0); create table hp3 partition of hp for values with (modulus 4, remainder 3); create table hp1 partition of hp for values with (modulus 4, remainder 1); create table hp2 partition of hp for values with (modulus 4, remainder 2); explain select * from hp where a is null and a is null; This causes classify_partition_bounding_keys to return 2. I can't see any bad side effects of this, but I think the comment needs to be improved. Perhaps it's better just to make the function return bool, returning true if there are any useful keys? 14. The switch statement in partition_op_strategy could be simplified a bit and be written more as: case BTLessEqualStrategyNumber: *incl = true; /* fall through */ case BTLessStrategyNumber: result = PART_OP_LESS; break; 15. No default case statements in partition_op_strategy() could result in "result" not being set. 16. I'm unsure what the following comment means: * To set eqkeys, we must have found the same for partition key columns. The word "for" seems wrong, or I'm not sure what it wants to ensure it finds the same for. 17. Header comment for partition_op_strategy is out-of-date. /* * Returns -1, 0, or 1 to signify that the partitioning clause has a </<=, * =, and >/>= operator, respectively. Sets *incl to true if equality is * implied. */ It'll never return -1. 18. The following comment has some grammar issues: /* * If couldn't coerce to the partition key type, that is, the type of * datums stored in PartitionBoundInfo, no hope of using this * expression for anything partitioning-related. */ Would be better with: /* * If we couldn't coerce the partition key type, that is, the type * of datums stored in PartitionBoundInfo, then there's no hope of * using this expression for anything partitioning-related. */ 19. In partkey_datum_from_expr() the Assert(false) at the end seems overkill. You could just get rid of the default in the switch statement and have it fall through to the final return. This would save 4 lines of code. 20. The word "and" I think needs removed from the following comment: * Couldn't compare; keep hash_clause set to the previous value and * so add this one directly to the result. Caller would Probably also needs a comment after "value" 21. The following comment seems to indicate 'cur' is an operator, but it's a PartClause: /* The code below is for btree operators, which cur is not. */ It might be better to write /* The code below handles Btree operators which are not relevant to a hash-partitioned table. */ 22. "Stuff" should be "The code" in: * Stuff that follows closely mimics similar processing done by 23. In remove_redundant_clauses() and various other places, you have a variable named partattoff. What's the meaning of this name? I'd imagine it is short for "partition attribute offset", but it's a "partition key index", so why not "partkeyidx"? Of course, you might claim that an array index is just an offset, but it is a little confusing if you think of attribute offsets in a TupleDesc. 24. get_partitions_for_keys() you're using "return" and "break" in the switch statement. You can remove the breaks; 25. The Assert(false) in get_partitions_for_keys seems overkill. I think it's fine to just do: return NULL; /* keep compiler quiet */ 26. The following code comment in get_partitions_for_keys_hash() does not seem very well written: * Hash partitioning handles puts nulls into a normal partition and * doesn't require to define a special null-accpting partition. * Caller didn't count nulls as a valid key; do so ourselves. Maybe "puts" should be "storing"? Also, does hash partitioning actually support a NULL partition? This seems to say it doesn't require, but as far as I can see it does not *support* a NULL partition. The comment is a bit misleading. "accpting" should be "accepting". 27. In get_partitions_for_keys_hash() why is it possible to get a result_index below 0? In that case, the code will end up triggering the Assert(false), but if you want to Assert something here then maybe it should be Assert(result_index >= 0)? if (result_index >= 0) return bms_make_singleton(result_index); 28. Is there any point in the following loop in get_partitions_for_keys_list()? /* * We might be able to get the answer sooner based on the nullness of * keys, so get that out of the way. */ for (i = 0; i < partkey->partnatts; i++) { Won't partkey->partnatts always be 1? Maybe you can Assert that instead, just in case someone forgets to change the code if LIST is to support multiple partition keys. 29. In get_partitions_for_keys_list and get_partitions_for_keys_range partdesc is only used for an Assert. Maybe you can just: Assert(RelationGetPartitionDesc(rel)->nparts > 0); 30. The Assert(false) in get_partitions_for_keys_list() looks suspiciously easy to hit... create table lp (a int not null) partition by list(a); create table lp1 partition of lp for values in(1); explain select * from lp where a > 11; -- Assert fail! 31. ranget? * get_partitions_for_keys_range * Return partitions of a ranget partitioned table for requested keys 32. I'm not quite sure what this comment is saying: /* * eqoff is gives us the bound that is known to be <= eqkeys, * given how partition_bound_bsearch works. The bound at eqoff+1, * then, would be the upper bound of the only partition that needs * to be scanned. */ 33. "one" -> "the one" * Find the leftmost bound that satisfies the query, i.e., one that * satisfies minkeys. 34. Why "will" there be multiple partitions sharing the same prefix? * If only a prefix of the whole partition key is provided, there will * be multiple partitions whose bound share the same prefix. If minkey Perhaps "will" should be "may"? 35. I think "would've" in the following comment is not correct. This seems to indicate that this has already happened, which it has not. "will have to be" might be more correct? * satisfy the query, but don't have a valid partition assigned. The * default partition would've been included to cover those values. 36. "will" -> "with"? * Since partition keys will nulls are mapped to default range 37. "If no" -> "If there's no" * If no tuple datum to compare with the bound, consider * the latter to be greater. 38. I don't see anything about setting keynullness here: /* * Get the clauses that match the partition key, including information * about any nullness tests against partition keys. Set keynullness to * a invalid value of NullTestType, which 0 is not. */ partclauses = match_clauses_to_partkey(root, rel, list_copy(rel->baserestrictinfo), &contains_const, &constfalse); 39. "paritions" -> "partitions" * Else there are no clauses that are useful to prune any paritions, 40. In match_clauses_to_partkey, the following code pulls the varnos from each operand of the expression. It would be better to just pull the side that's needed (if any) a bit later rather than always doing both. left_relids = pull_varnos((Node *) leftop); right_relids = pull_varnos((Node *) rightop); 41. I think a list_copy is missing here: /* * Accumulate the live partitioned children of this child, if it's * itself partitioned rel. */ if (childrel->part_scheme) partitioned_rels = list_concat(partitioned_rels, childrel->live_partitioned_rels); Surely if we don't perform a list_copy of childrel->live_partitioned_rels then subsequent additions to the resulting list will inadvertently add new items to the childrel->live_partitioned_rels? 42. Is this because of an existing bug? @@ -1906,11 +1904,13 @@ explain (costs off) select * from mcrparted where abs(b) = 5; -- scans all parti Filter: (abs(b) = 5) -> Seq Scan on mcrparted3 Filter: (abs(b) = 5) + -> Seq Scan on mcrparted4 + Filter: (abs(b) = 5) -> Seq Scan on mcrparted5 Filter: (abs(b) = 5) 43. In partition_prune.sql you have a mix of /* */ and -- comments. Please just use -- -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 28 December 2017 at 15:07, David Rowley <david.rowley@2ndquadrant.com> wrote: > 43. In partition_prune.sql you have a mix of /* */ and -- comments. > Please just use -- Just a few extras that I found: 44. In match_clauses_to_partkey you're making use of estimate_expression_value(), I don't think this is safe. if (IsA(estimate_expression_value(root, rightop), Const)) *contains_const = true; The only other places I see using this in the planner are for costing purposes. Also, the header comment for that function says it's not safe. Particularly "This effectively means that we plan using the first supplied value of the Param.". If that's the case, then if we're planning a generic plan, then wouldn't it be possible that the planner chooses the current supplied parameter value and prune away partitions based on that value. That would make the plan invalid for any other parameter, but it's meant to be a generic plan, so we can't do that. 45. Why use a list_copy() here? /* * For a nested ArrayExpr, we don't know how to get the * actual scalar values out into a flat list, so we give * up doing anything with this ScalarArrayOpExpr. */ if (arrexpr->multidims) continue; elem_exprs = list_copy(arrexpr->elements); -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
I happened to notice that Ashutosh's patch series at https://www.postgresql.org/message-id/CAFjFpReJhFSoy6DqH0ipFSHd=sLNEkSzAtz4VWCaS-w2jZL=uw@mail.gmail.com has a 0001 patch that modifies the partition_bound_cmp stuff too. Are those conflicting? Ashutosh's commit message: Modify bound comparision functions to accept members of PartitionKey Functions partition_bound_cmp(), partition_rbound_cmp() and partition_rbound_datum_cmp() are required to merge partition bounds from joining relations. While doing so, we do not have access to the PartitionKey of either relations. So, modify these functions to accept only required members of PartitionKey so that the functions can be reused for merging bounds. Amit's: Some interface changes for partition_bound_{cmp/bsearch} Introduces a notion of PartitionBoundCmpArg, which replaces the set of arguments void *probe and bool probe_is_bound of both partition_bound_cmp and partition_bound_bsearch. It wasn't possible before to specify the number of datums when a non-bound type of probe is passed. Slightly tweaking the existing interface to allow specifying the same seems awkward. So, instead encapsulate that into PartitionBoundCmpArg. Also, modify partition_rbound_datum_cmp to compare caller-specifed number of datums, instead of key->partnatts datums. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Dec 29, 2017 at 6:32 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I happened to notice that Ashutosh's patch series at > https://www.postgresql.org/message-id/CAFjFpReJhFSoy6DqH0ipFSHd=sLNEkSzAtz4VWCaS-w2jZL=uw@mail.gmail.com > has a 0001 patch that modifies the partition_bound_cmp stuff too. > Are those conflicting? > > Ashutosh's commit message: > Modify bound comparision functions to accept members of PartitionKey > > Functions partition_bound_cmp(), partition_rbound_cmp() and > partition_rbound_datum_cmp() are required to merge partition bounds > from joining relations. While doing so, we do not have access to the > PartitionKey of either relations. So, modify these functions to accept > only required members of PartitionKey so that the functions can be > reused for merging bounds. > > Amit's: > Some interface changes for partition_bound_{cmp/bsearch} > > Introduces a notion of PartitionBoundCmpArg, which replaces the set > of arguments void *probe and bool probe_is_bound of both > partition_bound_cmp and partition_bound_bsearch. It wasn't possible > before to specify the number of datums when a non-bound type of > probe is passed. Slightly tweaking the existing interface to allow > specifying the same seems awkward. So, instead encapsulate that > into PartitionBoundCmpArg. Also, modify partition_rbound_datum_cmp > to compare caller-specifed number of datums, instead of > key->partnatts datums. > I haven't looked at Amit's changes, but we need a more flexible way to pass information required for datum comparison than using PartitionKey, since that's not available in the optimizer and can not be associated with join, aggregate relations. If we pass that information through a structure, there are two ways 1. it will need to be part of PartitionScheme; I am not sure if we can have a substructure in PartitionKey. But if we can do it that way, we can pass that structure to the functions. 2. we will need to construct the structure filling it with comparison information and pass it to the comparison functions. I think what we achieve out of this isn't worth the code we will need to add. I would prefer first approach over the other. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
> From 8d627b910278203151853d324c3319c265cd36c0 Mon Sep 17 00:00:00 2001 > From: amit <amitlangote09@gmail.com> > Date: Tue, 22 Aug 2017 13:48:13 +0900 > Subject: [PATCH 2/5] Introduce a get_partitions_from_clauses() This one fails to apply. Please rebase. Did you know you can use "git format-patch -v6" to generate appropriately named patch files without having to rename them yourself? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Amit, On 12/21/2017 11:25 PM, Amit Langote wrote: > Thanks again. > > Please find attached updated patches. > I have been looking at this patch from a simple hash partition point of view. -- ddl.sql -- CREATE TABLE t1 ( a integer NOT NULL, b integer NOT NULL ) PARTITION BY HASH (b); CREATE TABLE t1_p00 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 0); CREATE TABLE t1_p01 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 1); CREATE TABLE t1_p02 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 2); CREATE TABLE t1_p03 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, REMAINDER 3); CREATE INDEX idx_t1_b_a_p00 ON t1_p00 USING btree (b, a); CREATE INDEX idx_t1_b_a_p01 ON t1_p01 USING btree (b, a); CREATE INDEX idx_t1_b_a_p02 ON t1_p02 USING btree (b, a); CREATE INDEX idx_t1_b_a_p03 ON t1_p03 USING btree (b, a); INSERT INTO t1 (SELECT i, i FROM generate_series(1, 1000000) AS i); ANALYZE; -- ddl.sql -- w/ -- select.sql -- \set b random(1, 1000000) BEGIN; SELECT t1.a, t1.b FROM t1 WHERE t1.b = :b; COMMIT; -- select.sql -- using pgbench -c X -j X -M prepared -T X -f select.sql part-hash On master we have generic_cost planning cost of 33.75, and an avg_custom_cost of 51.25 resulting in use of the generic plan and a TPS of 8893. Using v17 we have generic_cost planning cost of 33.75, and an avg_custom_cost of 25.9375 resulting in use of the custom plan and a TPS of 7129 - of course due to the generation of a custom plan for each invocation. Comparing master with an non-partitioned scenario; we have a TPS of 12968, since there is no overhead of ExecInitAppend (PortalStart) and ExecAppend (PortalRun). Could you share your thoughts on 1) if the generic plan mechanics should know about the pruning and hence give a lower planner cost 1) if the patch should be more aggressive in removing planning nodes that aren't necessary, e.g. going from Append -> IndexOnly to just IndexOnly. I have tested with both [1] and [2], but would like to know about your thoughts on the above first. Thanks in advance ! [1] https://commitfest.postgresql.org/16/1330/ [2] https://commitfest.postgresql.org/16/1353/ Best regards, Jesper
On 5 January 2018 at 07:16, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > 1) if the patch should be more aggressive in removing planning nodes that > aren't necessary, e.g. going from Append -> IndexOnly to just IndexOnly. That's not for this patch. There's another patch [1] to do that already. [1] https://commitfest.postgresql.org/16/1353/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 5 January 2018 at 07:16, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > \set b random(1, 1000000) > BEGIN; > SELECT t1.a, t1.b FROM t1 WHERE t1.b = :b; > COMMIT; > -- select.sql -- > > using pgbench -c X -j X -M prepared -T X -f select.sql part-hash > > On master we have generic_cost planning cost of 33.75, and an > avg_custom_cost of 51.25 resulting in use of the generic plan and a TPS of > 8893. > > Using v17 we have generic_cost planning cost of 33.75, and an > avg_custom_cost of 25.9375 resulting in use of the custom plan and a TPS of > 7129 - of course due to the generation of a custom plan for each invocation. > > Comparing master with an non-partitioned scenario; we have a TPS of 12968, > since there is no overhead of ExecInitAppend (PortalStart) and ExecAppend > (PortalRun). > > Could you share your thoughts on > > 1) if the generic plan mechanics should know about the pruning and hence > give a lower planner cost I think the problem here is that cached_plan_cost() is costing the planning cost of the query too low. If this was costed higher then its more likely the generic plan would have been chosen, instead of generating a custom plan each time. How well does it perform if you change cpu_operator_cost = 0.01? I think cached_plan_cost() does need an overhaul, but I think it's not anything that should be done as part of this patch. You've picked HASH partitioning here just because the current master does not perform any partition pruning for that partitioning strategy. There also might be a tiny argument here to have some method of disabling the planner's partition pruning as we could before with SET constraint_exclusion = 'off', but I think that's about the limit of the interest this patch should have in that problem. (The problem gets more complex again when doing run-time pruning, but that's not a topic for this thread) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David, On 01/04/2018 09:21 PM, David Rowley wrote: > On 5 January 2018 at 07:16, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: >> Could you share your thoughts on >> >> 1) if the generic plan mechanics should know about the pruning and hence >> give a lower planner cost > > I think the problem here is that cached_plan_cost() is costing the > planning cost of the query too low. If this was costed higher then its > more likely the generic plan would have been chosen, instead of > generating a custom plan each time. > > How well does it perform if you change cpu_operator_cost = 0.01? > It gives 38.82 for generic_cost, and 108.82 for avg_custom_cost on master (8249 TPS). And, 38.82 for generic_cost, and 79.705 for avg_custom_cost with v17 (7891 TPS). Non-partitioned is 11722 TPS. > I think cached_plan_cost() does need an overhaul, but I think it's not > anything that should be done as part of this patch. You've picked HASH > partitioning here just because the current master does not perform any > partition pruning for that partitioning strategy. > Well, I mainly picked HASH because that is my use-case :) For a range based setup it gives 39.84 for generic_cost, and 89.705 for avg_custom_cost (7862 TPS). Best regards, Jesper
On 2018/01/05 1:28, Alvaro Herrera wrote: >> From 8d627b910278203151853d324c3319c265cd36c0 Mon Sep 17 00:00:00 2001 >> From: amit <amitlangote09@gmail.com> >> Date: Tue, 22 Aug 2017 13:48:13 +0900 >> Subject: [PATCH 2/5] Introduce a get_partitions_from_clauses() > > This one fails to apply. Please rebase. Sorry about the absence in the last few days. I will post a new version addressing various review comments by the end of this week. > Did you know you can use "git format-patch -v6" to generate > appropriately named patch files without having to rename them yourself? Oh, didn't know that trick. Will try, thanks. Regards, Amit
On 9 January 2018 at 21:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Sorry about the absence in the last few days. I will post a new version > addressing various review comments by the end of this week. Good to have you back. There's a small problem with get_partitions_for_keys_list(), the following case hits the Assert(false) at the bottom of that function. create table ab_c (a int not null, b char) partition by list(a); create table abc_a2 (b char, a int not null) partition by list(b); create table abc_a2_b3 partition of abc_a2 for values in ('3'); alter table ab_c attach partition abc_a2 for values in (2); select * from ab_c where a between 1 and 2 and b <= '2'; -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 9 January 2018 at 21:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Sorry about the absence in the last few days. I will post a new version > addressing various review comments by the end of this week. One more thing I discovered while troubleshooting a bug Beena reported in the run-time partition pruning patch is that classify_partition_bounding_keys properly does; if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr)) constexpr = rightop; else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr)) constexpr = leftop; else /* Clause not meant for this column. */ continue; for OpExpr clauses, but does not do the same for leftop for the ScalarArrayOpExpr test. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 9 January 2018 at 21:40, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Sorry about the absence in the last few days. I will post a new version > addressing various review comments by the end of this week. I'm sorry for the flood of emails today. Beena's tests on the run-time partition pruning patch also indirectly exposed a problem with this patch. Basically, the changes to add_paths_to_append_rel() are causing duplication in partition_rels. A test case is: create table part (a int, b int) partition by list(a); create table part1 partition of part for values in(1) partition by list (b); create table part2 partition of part1 for values in(1); select * from part; partition_rels ends up with 3 items in the list, but there's only 2 partitions here. The reason for this is that, since planning here is recursively calling add_paths_to_append_rel, the list for part ends up with itself and part1 in it, then since part1's list already contains itself, per set_append_rel_size's "rel->live_partitioned_rels = list_make1_int(rti);", then part1 ends up in the list twice. It would be nicer if you could use a RelIds for this, but you'd also need some way to store the target partition relation since nodeModifyTable.c does: /* The root table RT index is at the head of the partitioned_rels list */ if (node->partitioned_rels) { Index root_rti; Oid root_oid; root_rti = linitial_int(node->partitioned_rels); root_oid = getrelid(root_rti, estate->es_range_table); rel = heap_open(root_oid, NoLock); /* locked by InitPlan */ } You could also fix it by instead of doing: /* * Accumulate the live partitioned children of this child, if it's * itself partitioned rel. */ if (childrel->part_scheme) partitioned_rels = list_concat(partitioned_rels, childrel->live_partitioned_rels); do something along the lines of: if (childrel->part_scheme) { ListCell *lc; ListCell *start = lnext(list_head(childrel->live_partitioned_rels)); for_each_cell(lc, start) partitioned_rels = lappend_int(partitioned_rels, lfirst_int(lc)); } Although it seems pretty fragile. It would probably be better to find a nicer way of handling all this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
Amit Langote
Date:
Hi David. Thanks a lot of the review. I'm replying to your multiple emails here. On 2017/12/28 11:07, David Rowley wrote: > I've just completed a pass over the v17 patch set. I've found a number > of things that need to be addressed. Some might seem a bit nit-picky, > sorry about that. However, many of the others genuinely need to be > addressed. > > 1. The following code calls RelationGetPartitionQual, but the output > of that is only needed when partition_bound_has_default(boundinfo) is > true. > > Can you change this to only get the RelationGetPartitionQual when it's required? OK, done that way. > 2. The header comment in match_clauses_to_partkey() does not give any > warning that 'clauses' is modified within the function. > > The function should create a copy of the clauses before modifying > them. This will save you having to do any list_copy calls when you're > calling the function. OK, added a list_copy on clauses at the beginning of the function. > > The header comment is also not very clear about what the return value > of the function is. Fixed the comment to describe return values. > 3. "method" I think should be "strategy". We've pretty much > standardised on that term everywhere else, so let's keep to standard. > > /* > * Does the query specify a key to be null or not null? Partitioning > * handles null partition keys specially depending on the partitioning > * method in use, we store this information. > */ Fixed. > 4. "relation" should be in single quotes, since you're talking about > the parameter named "relation". Likewise with "partclauses", otherwise > it just seems like bad English. > > * Determine the set of partitions of relation that will satisfy all > * the clauses contained in partclauses Fixed. > 5. partdesc's assignment can be delayed until it's needed. This will > save generating it when constfalse == true > > static Bitmapset * > get_partitions_from_clauses_recurse(Relation relation, int rt_index, > List *clauses) > { > PartitionDesc partdesc = RelationGetPartitionDesc(relation); Moved the whole line to a tiny block where partdesc is used. > 6. In the following comment, I'd have expected almost identical code, > just with some other List, but the code probably differs a bit too > much to use "Ditto". > > /* > * Ditto, but this time or_clauses. > */ Hmm, OK. I tried rewriting the comment a bit. > > 7. Comment claims we use "set union", but we're really just collecting > the members from the other set in: > > /* > * Partition sets obtained from mutually-disjunctive clauses are > * combined using set union. > */ > result = bms_add_members(result, arg_partset); Guess the comment doesn't really add much in this case, so just deleted it. > 8. These arrays could just be initialized up to partkey->partnatts. > I'd imagine most of the time this will be just 1, so would save > needlessly setting the 31 other elements, although, perhaps it's a bad > idea to optimize this. > > memset(keyclauses_all, 0, sizeof(keyclauses_all)); > /* false means we don't know if a given key is null */ > memset(keyisnull, false, sizeof(keyisnull)); > /* false means we don't know if a given key is not null */ > memset(keyisnotnull, false, sizeof(keyisnull)); > > The last two of these could just be Bitmapsets and you'd not need any > memset at all. PARTITION_MAX_KEYS just so happens to be the same as > BITS_PER_BITMAPWORD in a standard build, so you'd always be able to > mark everything with a single bitmap word. This would help a bit in > the various places that you're counting the true elements, for > example, the following code: > > for (i = 0; i < partkey->partnatts; i++) > { > if (!keys->keyisnotnull[i]) > { > include_def = true; > break; > } > } > > could become: > > include_def = !bms_is_empty(keys->keyisnotnull); > > if you converted all these to Bitmapsets. I liked the idea of using Bitmapset for keyisnull and keyisnotnull, so implemented it. > 9. The following comment would be better read as: /* clause does not > match this partition key */ > > /* Clause not meant for this column. */ > > 10. The following comment talks about handling less than operators for > hash opfamilies, but the code only handles <> for btree and list > partitioning. > > * Handle some cases wherein the clause's operator may not > * belong to the partitioning operator family. For example, > * operators named '<>' are not listed in any operator > * family whatsoever. Also, ordering opertors like '<' are > * not listed in the hash operator family. > > "opertors" should be spelled "operators" Rewrote the comments here a bit. > 11. In the following comment "operator" should be "operators": > > * are constrained by clauses containing equality operator, unless hash > > Likewise for: > > * IS NULL clauses while remaining have clauses with equality operator. Fixed. > 12. The following code in classify_partition_bounding_keys probably > should use EXPR_MATCHES_PARTKEY. > > /* Does leftop match with this partition key column? */ > if ((IsA(arg, Var) && > ((Var *) arg)->varattno == partattno) || > equal(arg, partexpr)) Done. > 13. The comment for classify_partition_bounding_keys does not seem to > define well what the meaning of the return value is: > > * classify_partition_bounding_keys > * Classify partition clauses into equal, min, and max keys, along with > * any Nullness constraints and return that information in the output > * argument keys (number of keys is the return value) > > I had thought all along that this must mean the number of distinct > partition keys that we've found useful clauses for, but it's possible > to fool this with duplicate IS NULL checks. > > create table hp (a int, b text) partition by hash (a, b); > create table hp0 partition of hp for values with (modulus 4, remainder 0); > create table hp3 partition of hp for values with (modulus 4, remainder 3); > create table hp1 partition of hp for values with (modulus 4, remainder 1); > create table hp2 partition of hp for values with (modulus 4, remainder 2); > > explain select * from hp where a is null and a is null; > > This causes classify_partition_bounding_keys to return 2. I can't see > any bad side effects of this, but I think the comment needs to be > improved. Perhaps it's better just to make the function return bool, > returning true if there are any useful keys? OK, I made classify_partition_bounding_keys() return a bool instead. > 14. The switch statement in partition_op_strategy could be simplified > a bit and be written more as: > > case BTLessEqualStrategyNumber: > *incl = true; > /* fall through */ > case BTLessStrategyNumber: > result = PART_OP_LESS; > break; That's better, done. > 15. No default case statements in partition_op_strategy() could result > in "result" not being set. I added a default case, but it's simply an elog(ERROR, ...). > 16. I'm unsure what the following comment means: > > * To set eqkeys, we must have found the same for partition key columns. > > The word "for" seems wrong, or I'm not sure what it wants to ensure it > finds the same for. I rewrote this comment a bit, since like you, I too am no longer able to make sense of it. Just wanted to say here that to set keys->eqkeys at all, we must have found matching clauses containing equality operators for all partition keys columns. > 17. Header comment for partition_op_strategy is out-of-date. > > /* > * Returns -1, 0, or 1 to signify that the partitioning clause has a </<=, > * =, and >/>= operator, respectively. Sets *incl to true if equality is > * implied. > */ > > It'll never return -1. Oops, fixed. > 18. The following comment has some grammar issues: > > /* > * If couldn't coerce to the partition key type, that is, the type of > * datums stored in PartitionBoundInfo, no hope of using this > * expression for anything partitioning-related. > */ > > Would be better with: > > /* > * If we couldn't coerce the partition key type, that is, the type > * of datums stored in PartitionBoundInfo, then there's no hope of > * using this expression for anything partitioning-related. > */ OK, done. > 19. In partkey_datum_from_expr() the Assert(false) at the end seems > overkill. You could just get rid of the default in the switch > statement and have it fall through to the final return. This would > save 4 lines of code. Actually, let's just get rid of the switch. If I remove the default case like you suggest, there are NodeTag enum values not handled warnings. > 20. The word "and" I think needs removed from the following comment: > > * Couldn't compare; keep hash_clause set to the previous value and > * so add this one directly to the result. Caller would Actually, it seems better to keep the "and" and remove the "so". > Probably also needs a comment after "value" Done. > 21. The following comment seems to indicate 'cur' is an operator, but > it's a PartClause: > > /* The code below is for btree operators, which cur is not. */ > > It might be better to write > > /* The code below handles Btree operators which are not relevant to a > hash-partitioned table. */ Agreed, done. > 22. "Stuff" should be "The code" in: > > * Stuff that follows closely mimics similar processing done by Done. > 23. In remove_redundant_clauses() and various other places, you have a > variable named partattoff. What's the meaning of this name? I'd > imagine it is short for "partition attribute offset", but it's a > "partition key index", so why not "partkeyidx"? > > Of course, you might claim that an array index is just an offset, but > it is a little confusing if you think of attribute offsets in a > TupleDesc. Hmm, OK. Replaced partattoff with partkeyidx. > 24. get_partitions_for_keys() you're using "return" and "break" in the > switch statement. You can remove the breaks; Oops, fixed. > 25. The Assert(false) in get_partitions_for_keys seems overkill. I > think it's fine to just do: > > return NULL; /* keep compiler quiet */ Done, too. > 26. The following code comment in get_partitions_for_keys_hash() does > not seem very well written: > > * Hash partitioning handles puts nulls into a normal partition and > * doesn't require to define a special null-accpting partition. > * Caller didn't count nulls as a valid key; do so ourselves. > > Maybe "puts" should be "storing"? > > Also, does hash partitioning actually support a NULL partition? This > seems to say it doesn't require, but as far as I can see it does not > *support* a NULL partition. The comment is a bit misleading. > "accpting" should be "accepting". I rewrote that comment. I wanted to say that, unlike range and list partitioning which have special handling for null values (there is only one list/range partition at any given time that could contain nulls in the partition key), hash partitioning does not. All hash partitions could contain nulls in one or more of the partition keys and hence we must consider nulls as regular equality keys. > 27. In get_partitions_for_keys_hash() why is it possible to get a > result_index below 0? In that case, the code will end up triggering > the Assert(false), but if you want to Assert something here then maybe > it should be Assert(result_index >= 0)? > > if (result_index >= 0) > return bms_make_singleton(result_index); I too used to think that it's impossible to get result_index < 0, but it's actually possible, because not all required hash partitions may have been defined yet: create table hashp (a int) partition by hash (a); create table hashp0 partition of hashp for values with (modulus 4, remainder 0); If user only defines one partition like shown above, then there are no partitions for when the remainders for a given partition key turns out to be 1, 2, or 3. create table hashp (a int) partition by hash (a); create table hashp0 partition of hashp for values with (modulus 4, remainder 0); insert into hashp values (1); INSERT 0 1 explain select * from hashp where a = 1; QUERY PLAN -------------------------------------------------------------- Append (cost=0.00..41.88 rows=13 width=4) -> Seq Scan on hashp0 (cost=0.00..41.88 rows=13 width=4) Filter: (a = 1) (3 rows) insert into hashp values (2); ERROR: no partition of relation "hashp" found for row DETAIL: Partition key of the failing row contains (a) = (2). explain select * from hashp where a = 2; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. I guess I forgot to remove that Assert(false) when fixing the code after realizing that it's indeed possible to not get a hash partition for some keys. Removed the Assert. > 28. Is there any point in the following loop in get_partitions_for_keys_list()? > > /* > * We might be able to get the answer sooner based on the nullness of > * keys, so get that out of the way. > */ > for (i = 0; i < partkey->partnatts; i++) > { > > Won't partkey->partnatts always be 1? Maybe you can Assert that > instead, just in case someone forgets to change the code if LIST is to > support multiple partition keys. I guess that's a leftover from when all partition strategies were handled in one function. Got rid of the loop and added the Assert. > > 29. In get_partitions_for_keys_list and get_partitions_for_keys_range > partdesc is only used for an Assert. Maybe you can just: > > Assert(RelationGetPartitionDesc(rel)->nparts > 0); OK, done. > 30. The Assert(false) in get_partitions_for_keys_list() looks > suspiciously easy to hit... > > create table lp (a int not null) partition by list(a); > create table lp1 partition of lp for values in(1); > explain select * from lp where a > 11; -- Assert fail! > > 31. ranget? > > * get_partitions_for_keys_range > * Return partitions of a ranget partitioned table for requested keys The Assert indeed was easy to hit. Removed. > 32. I'm not quite sure what this comment is saying: > > /* > * eqoff is gives us the bound that is known to be <= eqkeys, > * given how partition_bound_bsearch works. The bound at eqoff+1, > * then, would be the upper bound of the only partition that needs > * to be scanned. > */ Rewrote the comment as: /* * The bound at eqoff is known to be <= eqkeys, given the way * partition_bound_bsearch works. Considering the same as the lower * bound of the partition that eqkeys falls into, the bound at * eqoff + 1 would be its upper bound, so use eqoff + 1 to get the * desired partition's index. */ Any better? > 33. "one" -> "the one" > > * Find the leftmost bound that satisfies the query, i.e., one that > * satisfies minkeys. Fixed. > 34. Why "will" there be multiple partitions sharing the same prefix? > > * If only a prefix of the whole partition key is provided, there will > * be multiple partitions whose bound share the same prefix. If minkey > > Perhaps "will" should be "may"? I guess you're right. > 35. I think "would've" in the following comment is not correct. This > seems to indicate that this has already happened, which it has not. > "will have to be" might be more correct? > > * satisfy the query, but don't have a valid partition assigned. The > * default partition would've been included to cover those values. "will have to be included" sounds correct to me too. > 36. "will" -> "with"? > > * Since partition keys will nulls are mapped to default range Fixed. > 37. "If no" -> "If there's no" > > * If no tuple datum to compare with the bound, consider > * the latter to be greater. > > 38. I don't see anything about setting keynullness here: > > /* > * Get the clauses that match the partition key, including information > * about any nullness tests against partition keys. Set keynullness to > * a invalid value of NullTestType, which 0 is not. > */ > partclauses = match_clauses_to_partkey(root, rel, > list_copy(rel->baserestrictinfo), > &contains_const, > &constfalse); I guess the comment is too ancient. Fixed. > > 39. "paritions" -> "partitions" > > * Else there are no clauses that are useful to prune any paritions, Fixed. > 40. In match_clauses_to_partkey, the following code pulls the varnos > from each operand of the expression. It would be better to just pull > the side that's needed (if any) a bit later rather than always doing > both. > > left_relids = pull_varnos((Node *) leftop); > right_relids = pull_varnos((Node *) rightop); Ah, done. > 41. I think a list_copy is missing here: > > /* > * Accumulate the live partitioned children of this child, if it's > * itself partitioned rel. > */ > if (childrel->part_scheme) > partitioned_rels = list_concat(partitioned_rels, > childrel->live_partitioned_rels); > > Surely if we don't perform a list_copy of > childrel->live_partitioned_rels then subsequent additions to the > resulting list will inadvertently add new items to the > childrel->live_partitioned_rels? Yeah, I think what's in the patch now is clearly hazardous. Added a list_copy on childrel->live_partitioned_rels. > 42. Is this because of an existing bug? > > @@ -1906,11 +1904,13 @@ explain (costs off) select * from mcrparted > where abs(b) = 5; -- scans all parti > Filter: (abs(b) = 5) > -> Seq Scan on mcrparted3 > Filter: (abs(b) = 5) > + -> Seq Scan on mcrparted4 > + Filter: (abs(b) = 5) > -> Seq Scan on mcrparted5 > Filter: (abs(b) = 5) Actually, to constraint exclusion's eyes, mcrparted4's partition constraint is refuted by abs(b) = 5. With the new patch, we simply do not perform any partition pruning for that clause, because we do not have a constraint on an earlier partition key column. > 43. In partition_prune.sql you have a mix of /* */ and -- comments. > Please just use -- Oops, fixed. On 2017/12/29 15:47, David Rowley wrote: > Just a few extras that I found: > > 44. In match_clauses_to_partkey you're making use of > estimate_expression_value(), I don't think this is safe. > > if (IsA(estimate_expression_value(root, rightop), Const)) > *contains_const = true; > > The only other places I see using this in the planner are for costing > purposes. Also, the header comment for that function says it's not > safe. Particularly "This effectively means that we plan using the > first supplied value of the Param.". If that's the case, then if we're > planning a generic plan, then wouldn't it be possible that the planner > chooses the current supplied parameter value and prune away partitions > based on that value. That would make the plan invalid for any other > parameter, but it's meant to be a generic plan, so we can't do that. You might be right. Perhaps, I was thinking of eval_const_expressions() there, which I guess should be enough for this purpose. > > 45. Why use a list_copy() here? > > /* > * For a nested ArrayExpr, we don't know how to get the > * actual scalar values out into a flat list, so we give > * up doing anything with this ScalarArrayOpExpr. > */ > if (arrexpr->multidims) > continue; > > elem_exprs = list_copy(arrexpr->elements); I guess I was just being paranoid there. No need, so removed. On 2018/01/10 7:35, David Rowley wrote: > There's a small problem with get_partitions_for_keys_list(), the > following case hits the Assert(false) at the bottom of that function. > > create table ab_c (a int not null, b char) partition by list(a); > create table abc_a2 (b char, a int not null) partition by list(b); > create table abc_a2_b3 partition of abc_a2 for values in ('3'); > alter table ab_c attach partition abc_a2 for values in (2); > > select * from ab_c where a between 1 and 2 and b <= '2'; I removed that Assert per your comment above (in an earlier email of yours). On 2018/01/10 10:55, David Rowley wrote: > One more thing I discovered while troubleshooting a bug Beena reported > in the run-time partition pruning patch is that > classify_partition_bounding_keys properly does; > > if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr)) > constexpr = rightop; > else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr)) > constexpr = leftop; > else > /* Clause not meant for this column. */ > continue; > > for OpExpr clauses, but does not do the same for leftop for the > ScalarArrayOpExpr test. I'm not sure why we'd need to do that? Does the syntax of clauses that use a ScalarArrayOpExpr() allow them to have the partition key on RHS? Can you point me to the email where Beena reported the problem in question? On 2018/01/10 13:18, David Rowley wrote: > Beena's tests on the run-time partition pruning patch also indirectly > exposed a problem with this patch. > > Basically, the changes to add_paths_to_append_rel() are causing > duplication in partition_rels. > > A test case is: > > create table part (a int, b int) partition by list(a); > create table part1 partition of part for values in(1) partition by list (b); > create table part2 partition of part1 for values in(1); > > select * from part; > > partition_rels ends up with 3 items in the list, but there's only 2 > partitions here. The reason for this is that, since planning here is > recursively calling add_paths_to_append_rel, the list for part ends up > with itself and part1 in it, then since part1's list already contains > itself, per set_append_rel_size's "rel->live_partitioned_rels = > list_make1_int(rti);", then part1 ends up in the list twice. It seems that I found the problem. Currently, the set_append_rel_size() step already accumulates the full list in the root partitioned table's rel->live_partitioned_rels, that is, the list of RT indexes of *all* partitioned relations in the tree. Then, when add_paths_to_append_rel() tries to accumulate child rel's live_partitioned_rels into the parent's, duplication occurs, because the latter already contains all the entries as compiled by the earlier step. I think having only the latter do the accumulation is better, because even partition-wise join code needs this facility and it only ever calls add_paths_to_append_rel(). Attached updated version of the patch set containing fixes for almost all the things mentioned above. Thanks again for your thoughtful review comments and sorry that I couldn't reply sooner. Thanks, Amit
Attachment
- v18-0001-Some-interface-changes-for-partition_bound_-cmp-.patch
- v18-0002-Introduce-a-get_partitions_from_clauses.patch
- v18-0003-Move-some-code-of-set_append_rel_size-to-separat.patch
- v18-0004-More-refactoring-around-partitioned-table-Append.patch
- v18-0005-Teach-planner-to-use-get_partitions_from_clauses.patch
On 2018/01/04 23:29, Ashutosh Bapat wrote: > On Fri, Dec 29, 2017 at 6:32 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> I happened to notice that Ashutosh's patch series at >> https://www.postgresql.org/message-id/CAFjFpReJhFSoy6DqH0ipFSHd=sLNEkSzAtz4VWCaS-w2jZL=uw@mail.gmail.com >> has a 0001 patch that modifies the partition_bound_cmp stuff too. >> Are those conflicting? >> >> Ashutosh's commit message: >> Modify bound comparision functions to accept members of PartitionKey >> >> Functions partition_bound_cmp(), partition_rbound_cmp() and >> partition_rbound_datum_cmp() are required to merge partition bounds >> from joining relations. While doing so, we do not have access to the >> PartitionKey of either relations. So, modify these functions to accept >> only required members of PartitionKey so that the functions can be >> reused for merging bounds. >> >> Amit's: >> Some interface changes for partition_bound_{cmp/bsearch} >> >> Introduces a notion of PartitionBoundCmpArg, which replaces the set >> of arguments void *probe and bool probe_is_bound of both >> partition_bound_cmp and partition_bound_bsearch. It wasn't possible >> before to specify the number of datums when a non-bound type of >> probe is passed. Slightly tweaking the existing interface to allow >> specifying the same seems awkward. So, instead encapsulate that >> into PartitionBoundCmpArg. Also, modify partition_rbound_datum_cmp >> to compare caller-specifed number of datums, instead of >> key->partnatts datums. >> > > I haven't looked at Amit's changes, but we need a more flexible way to > pass information required for datum comparison than using > PartitionKey, since that's not available in the optimizer and can not > be associated with join, aggregate relations. If we pass that > information through a structure, there are two ways > 1. it will need to be part of PartitionScheme; I am not sure if we can > have a substructure in PartitionKey. But if we can do it that way, we > can pass that structure to the functions. > 2. we will need to construct the structure filling it with comparison > information and pass it to the comparison functions. I think what we > achieve out of this isn't worth the code we will need to add. > > I would prefer first approach over the other. ISTM that they're non-conflicting for the most part. My patch is about modifying the way to bring "datums" into partition_bound_cmp(), whereas Ashutosh's is about modifying the way we bring the partition key information. Changes seem orthogonal to me, although, the patches definitely won't like each other when applying to the tree. Thanks, Amit
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
David Rowley
Date:
Thanks for addressing that list. Just one thing to reply on before I look at the updated version: On 11 January 2018 at 22:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/01/10 10:55, David Rowley wrote: >> One more thing I discovered while troubleshooting a bug Beena reported >> in the run-time partition pruning patch is that >> classify_partition_bounding_keys properly does; >> >> if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr)) >> constexpr = rightop; >> else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr)) >> constexpr = leftop; >> else >> /* Clause not meant for this column. */ >> continue; >> >> for OpExpr clauses, but does not do the same for leftop for the >> ScalarArrayOpExpr test. > > I'm not sure why we'd need to do that? Does the syntax of clauses that > use a ScalarArrayOpExpr() allow them to have the partition key on RHS? No, but there's no test to ensure the leftop matches the partition key. There's just: ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause; Oid saop_op = saop->opno; Oid saop_opfuncid = saop->opfuncid; Oid saop_coll = saop->inputcollid; Node *leftop = (Node *) linitial(saop->args), *rightop = (Node *) lsecond(saop->args); List *elem_exprs, *elem_clauses; ListCell *lc1; bool negated = false; /* * In case of NOT IN (..), we get a '<>', which while not * listed as part of any operator family, we are able to * handle the same if its negator is indeed a part of the * partitioning operator family. */ if (!op_in_opfamily(saop_op, partopfamily)) { Oid negator = get_negator(saop_op); int strategy; Oid lefttype, righttype; if (!OidIsValid(negator)) continue; get_op_opfamily_properties(negator, partopfamily, false, &strategy, &lefttype, &righttype); if (strategy == BTEqualStrategyNumber) negated = true; } Since there's nothing to reject the clause that does not match the partition key, the IN's left operand might be of any random type, and may well not be in partopfamily, so when it comes to looking up get_op_opfamily_properties() you'll hit: elog(ERROR, "operator %u is not a member of opfamily %u", opno, opfamily); Still looking at the v17 patch here, but I also don't see a test to see if the IsBooleanOpfamily(partopfamily) is checking it matches the partition key. > Can you point me to the email where Beena reported the problem in question? https://www.postgresql.org/message-id/CAOG9ApERiop7P=GRkqQKa82AuBKjxN3qVixie3WK4WqQpEjS6g@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
David Rowley
Date:
> On 11 January 2018 at 22:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Can you point me to the email where Beena reported the problem in question? > > https://www.postgresql.org/message-id/CAOG9ApERiop7P=GRkqQKa82AuBKjxN3qVixie3WK4WqQpEjS6g@mail.gmail.com To save you from having to look at the run-time prune patch, here's case that break in v18. create table xy (a int, b text) partition by range (a,b); create table xy1 partition of xy for values from (0,'a') to (10, 'b'); select * from xy where a = 1 and b in('x','y'); ERROR: operator 531 is not a member of opfamily 1976 -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning
From
Amit Langote
Date:
On 2018/01/11 19:23, David Rowley wrote: > On 11 January 2018 at 22:52, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2018/01/10 10:55, David Rowley wrote: >>> One more thing I discovered while troubleshooting a bug Beena reported >>> in the run-time partition pruning patch is that >>> classify_partition_bounding_keys properly does; >>> >>> if (EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr)) >>> constexpr = rightop; >>> else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr)) >>> constexpr = leftop; >>> else >>> /* Clause not meant for this column. */ >>> continue; >>> >>> for OpExpr clauses, but does not do the same for leftop for the >>> ScalarArrayOpExpr test. >> >> I'm not sure why we'd need to do that? Does the syntax of clauses that >> use a ScalarArrayOpExpr() allow them to have the partition key on RHS? > > No, but there's no test to ensure the leftop matches the partition key. > > There's just: > > ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause; > Oid saop_op = saop->opno; > Oid saop_opfuncid = saop->opfuncid; > Oid saop_coll = saop->inputcollid; > Node *leftop = (Node *) linitial(saop->args), > *rightop = (Node *) lsecond(saop->args); > List *elem_exprs, > *elem_clauses; > ListCell *lc1; > bool negated = false; > > /* > * In case of NOT IN (..), we get a '<>', which while not > * listed as part of any operator family, we are able to > * handle the same if its negator is indeed a part of the > * partitioning operator family. > */ > if (!op_in_opfamily(saop_op, partopfamily)) > { > Oid negator = get_negator(saop_op); > int strategy; > Oid lefttype, > righttype; > > > if (!OidIsValid(negator)) > continue; > get_op_opfamily_properties(negator, partopfamily, false, > &strategy, > &lefttype, &righttype); > if (strategy == BTEqualStrategyNumber) > negated = true; > } > > Since there's nothing to reject the clause that does not match the > partition key, the IN's left operand might be of any random type, and > may well not be in partopfamily, so when it comes to looking up > get_op_opfamily_properties() you'll hit: elog(ERROR, "operator %u is > not a member of opfamily %u", opno, opfamily); Ah, I completely missed that. So we need something like the following in this IsA(clause, ScalarArrayOpExpr) block: + /* Clause does not match this partition key. */ + if (!EXPR_MATCHES_PARTKEY(leftop, partattno, partexpr)) + continue; + > Still looking at the v17 patch here, but I also don't see a test to > see if the IsBooleanOpfamily(partopfamily) is checking it matches the > partition key. You're right. Added checks there as well. >> Can you point me to the email where Beena reported the problem in question? > > https://www.postgresql.org/message-id/CAOG9ApERiop7P=GRkqQKa82AuBKjxN3qVixie3WK4WqQpEjS6g@mail.gmail.com > > To save you from having to look at the run-time prune patch, here's > case that break in v18. > > create table xy (a int, b text) partition by range (a,b); > create table xy1 partition of xy for values from (0,'a') to (10, 'b'); > select * from xy where a = 1 and b in('x','y'); > ERROR: operator 531 is not a member of opfamily 1976 You'll be able to see that the error no longer appears with the attached updated set of patches, but I'm now seeing that the resulting plan with patched for this particular query differs from what master (constraint exclusion) produces. Master produces a plan with no partitions (as one would think is the correct plan), whereas patched produces a plan including the xy1 partition. I will think about that a bit and post something later. Thanks, Amit
Attachment
- v19-0001-Some-interface-changes-for-partition_bound_-cmp-.patch
- v19-0002-Introduce-a-get_partitions_from_clauses.patch
- v19-0003-Move-some-code-of-set_append_rel_size-to-separat.patch
- v19-0004-More-refactoring-around-partitioned-table-Append.patch
- v19-0005-Teach-planner-to-use-get_partitions_from_clauses.patch
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning
From
David Rowley
Date:
On 12 January 2018 at 15:27, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/01/11 19:23, David Rowley wrote: >> ERROR: operator 531 is not a member of opfamily 1976 > > You'll be able to see that the error no longer appears with the attached > updated set of patches, but I'm now seeing that the resulting plan with > patched for this particular query differs from what master (constraint > exclusion) produces. Master produces a plan with no partitions (as one > would think is the correct plan), whereas patched produces a plan > including the xy1 partition. I will think about that a bit and post > something later. Thanks for looking at that. I've got a few more things for you. I'm only partway through another pass, but it makes sense to post what I have now if you're working on a new version. 1. partitioing -> partitioning * Strategy of a partition clause operator per the partitioing operator class 2. get_partitions_from_clauses() modifies partclauses without mentioning it in the header. I think you need to either: a) warn about this in the header comment; or b) do a list_copy() before list_concat() c) do list_truncate back to the original length after you're done with the list. 3. get_partitions_from_clauses_recurse(), with: result = bms_add_range(result, 0, partdesc->nparts - 1); You could change that to bms_add_range(NULL, ...) and ditch the assignment of result to NULL at the start of the function. 4. classify_partition_bounding_keys() now returns bool, but the return statement is still: return keys->n_eqkeys + keys->n_minkeys + keys->n_maxkeys + n_keynullness; my compiler didn't warn about that, but I'd imagine some might. Instead, can you make it: if (keys->n_eqkeys > 0 || keys->n_minkeys > 0 || keys->n_maxkeys > 0 || n_keynullness > 0) return true; return false; probably equal keys are the most likely case, so it'll be good to short circuit instead of performing addition on a bunch of stuff we don't care about anymore. 5. In classify_partition_bounding_keys, why do we "continue" here? clause = rinfo->clause; if (rinfo->pseudoconstant && !DatumGetBool(((Const *) clause)->constvalue)) { *constfalse = true; continue; } Is there any point in searching further? Also, if you were consistent with the return value for classify_partition_bounding_keys when you've set *constfalse = true; you wouldn't need to handle the case twice like you are in get_partitions_from_clauses_recurse(). 6. I think it would be nicer if get_partitions_from_ne_clauses returns a set of partitions that could be excluded. So instead of: * get_partitions_from_ne_clauses * * Return partitions of relation that satisfy all <> operator clauses in * ne_clauses. Only ever called if relation is a list partitioned table. Have: * get_partitions_from_ne_clauses * * Returns a Bitmapset of partitions that can be safely excluded due to * not-equal clauses existing for all possible partition values. It is only * valid to call this for LIST partitioned tables. and instead of: result = bms_add_range(NULL, 0, partdesc->nparts - 1); result = bms_del_members(result, excluded_parts); bms_free(excluded_parts); return result; Just do: return excluded_parts; and in get_partitions_from_clauses_recurse(), do bms_del_members instead of bms_int_members. there's less bit shuffling and it seems cleaner. Perhaps the function name would need to be changed if we're inverting the meaning too. (I've attached a patch which makes this change along with an idea in #8 below) 7. The following comment claims the function sets *datum, but there's no param by that name: /* * partkey_datum_from_expr * Extract constant value from expr and set *datum to that value */ static bool partkey_datum_from_expr(PartitionKey key, int partkeyidx, Expr *expr, Datum *value) 8. The code in get_partitions_from_ne_clauses() does perform quite a few nested loops. I think a more simple way to would be to track the offsets you've seen in a Bitmapset. This would save you having to check for duplicates, as an offset can only contain a single datum. You'd just need to build a couple of arrays after that, one to sum up the offsets found per partition, and one for the total datums allowed in the partition. If the numbers match then you can remove the partition. I've written this and attached it to this email. It saves about 50 lines of code and should perform much better for complex cases, for example, a large NOT IN list. This also implements #6. 9. "the same" -> "it" /* * In case of NOT IN (..), we get a '<>', which while not * listed as part of any operator family, we are able to * handle the same if its negator is indeed a part of the * partitioning operator family. */ 10. in classify_partition_bounding_keys: "0" -> "false" /* Return if no work to do below. */ if (!will_compute_keys || *constfalse) return 0; Likewise for: if (*constfalse) return 0; 11. I don't see partition_bound_bsearch used anywhere below the following comment: * Generate bounding tuple(s). * * We look up partitions in the partition bound descriptor using, say, * partition_bound_bsearch(), which expects a Datum (or Datums if multi- * column key). So, extract the same out of the constant argument of * each clause. I also don't know what the comment is trying to say. 12. * operator and sets *incl if equality is implied should be: * operator and set *incl to true if the operator's strategy is inclusive. 13. What does "the same" mean in: * and add this one directly to the result. Caller would * arbitrarily choose one of the many and perform * partition-pruning with the same. It's possible that mutual I think you quite often use "the same" to mean "it". Can you change that? 14. Not sure what parameter you're talking about here. * Evaluate 'leftarg op rightarg' and set *result to its value. * * leftarg and rightarg referred to above actually refer to the constant * operand (Datum) of the clause contained in the parameters leftarg and * rightarg below, respectively. And op refers to the operator of the * clause contained in the parameter op below. 15. "the latter" is normally used when you're referring to the last thing in a list which was just mentioned. In this case, leftarg_const and rightarg_const is the list, so "the latter" should mean rightarg_const, but I think you mean to compare them using the operator. * If the leftarg_const and rightarg_const are both of the type expected * by op's operator, then compare them using the latter. 16. There are a few things to improve with the following comment: /* * Hash partitioning stores partition keys containing nulls in regular * partitions. That is, the code that determines the hash partition for * a given row admits nulls in the partition key when computing the key's * hash. So, here we treat any IS NULL clauses on partition key columns as * equality keys, along with any other non-null values coming from equality * operator clauses. */ "admits" is not the correct word here, and "hash" should be "correct", but there are more mistakes, so might be easier just to rewrite to: /* * Since tuples with NULL values in the partition key columns are stored in regular partitions, * we'll treat any IS NULL clauses here as regular equality clauses. /* 17. The following example will cause get_partitions_for_keys_hash to misbehave: create table hashp (a int, b int) partition by hash (a, b); create table hashp1 partition of hashp for values with (modulus 4, remainder 0); create table hashp2 partition of hashp for values with (modulus 4, remainder 1); create table hashp3 partition of hashp for values with (modulus 4, remainder 3); create table hashp4 partition of hashp for values with (modulus 4, remainder 2); explain select * from hashp where a = 1 and a is null; The following code assumes that you'll never get a NULL test for a key that has an equality test, and ends up trying to prune partitions thinking we got compatible clauses for both partition keys. memset(keyisnull, false, sizeof(keyisnull)); for (i = 0; i < partkey->partnatts; i++) { if (bms_is_member(i, keys->keyisnull)) { keys->n_eqkeys++; keyisnull[i] = true; } } /* * Can only do pruning if we know all the keys and they're all equality * keys including the nulls that we just counted above. */ if (keys->n_eqkeys == partkey->partnatts) The above code will need to be made smarter. It'll likely crash if you change "b" to a pass-by-ref type. 18. The following code: int other_idx = -1; /* * Only a designated partition accepts nulls, which if there * exists one, return the same. */ if (partition_bound_accepts_nulls(boundinfo) || partition_bound_has_default(boundinfo)) other_idx = partition_bound_accepts_nulls(boundinfo) ? boundinfo->null_index : boundinfo->default_index; if (other_idx >= 0) return bms_make_singleton(other_idx); else return NULL; should be simplified to: /* * NULLs may only exist in the NULL partition, or in the * default, if there's no NULL partition. */ if (partition_bound_accepts_nulls(boundinfo)) return bms_make_singleton(boundinfo->null_index); else if (partition_bound_has_default(boundinfo)) return bms_make_singleton(boundinfo->default_index); return NULL; 19. "exists" -> "are" * If there are no datums to compare keys with, but there exist * partitions, it must be the default partition. also, instead of writing "it must be the default partition." it should be better to say "just return the default partition." 20. I don't think the return NULL should ever hit, is it worth putting a comment to say /* shouldn't happen */ if (boundinfo->ndatums == 0) { if (partition_bound_has_default(boundinfo)) return bms_make_singleton(boundinfo->default_index); else return NULL; } 21. Can the following comment does not explain the situation well: /* * boundinfo->ndatums - 1 is the last valid list partition datums * index. */ There's really no possible non-default partition for this case, so perhaps we should just return the default, if one exists. We do go on to check the n_maxkeys needlessly for this case. At the very least the comment should be changed to: /* * minkeys values are greater than any non-default partition. * We'll check that for case below. */ but I think it's worth just doing the default partition check there and returning it, or NULL. It should help reduce confusion. Can you also perform a self-review of the patch? Some of the things I'm picking up are leftovers from a previous version of the patch. We might never get through this review if you keep leaving those around! I won't continue reviewing again until next week, so don't rush. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [SenderAddress Forgery]Re: [HACKERS] path toward faster partition pruning
From
Amit Langote
Date:
David, On 2018/01/12 12:30, David Rowley wrote: > Can you also perform a self-review of the patch? Some of the things > I'm picking up are leftovers from a previous version of the patch. We > might never get through this review if you keep leaving those around! Sorry, I will look more closely before posting the next version. I guess I may have rushed a bit too much when posting the v18/v19 patches, partly because it's been 3 weeks since v17 and I felt I needed to catch up quickly given the activity on the run-time pruning thread which depends on the patches here. Thanks, Amit
On 10 January 2018 at 17:18, David Rowley <david.rowley@2ndquadrant.com> wrote: > Basically, the changes to add_paths_to_append_rel() are causing > duplication in partition_rels. > > A test case is: > > create table part (a int, b int) partition by list(a); > create table part1 partition of part for values in(1) partition by list (b); > create table part2 partition of part1 for values in(1); > > select * from part; > > partition_rels ends up with 3 items in the list, but there's only 2 > partitions here. The reason for this is that, since planning here is > recursively calling add_paths_to_append_rel, the list for part ends up > with itself and part1 in it, then since part1's list already contains > itself, per set_append_rel_size's "rel->live_partitioned_rels = > list_make1_int(rti);", then part1 ends up in the list twice. > > It would be nicer if you could use a RelIds for this, but you'd also > need some way to store the target partition relation since > nodeModifyTable.c does: > > /* The root table RT index is at the head of the partitioned_rels list */ > if (node->partitioned_rels) > { > Index root_rti; > Oid root_oid; > > root_rti = linitial_int(node->partitioned_rels); > root_oid = getrelid(root_rti, estate->es_range_table); > rel = heap_open(root_oid, NoLock); /* locked by InitPlan */ > } > > You could also fix it by instead of doing: > > /* > * Accumulate the live partitioned children of this child, if it's > * itself partitioned rel. > */ > if (childrel->part_scheme) > partitioned_rels = list_concat(partitioned_rels, > childrel->live_partitioned_rels); > > do something along the lines of: > > if (childrel->part_scheme) > { > ListCell *lc; > ListCell *start = lnext(list_head(childrel->live_partitioned_rels)); > > for_each_cell(lc, start) > partitioned_rels = lappend_int(partitioned_rels, > lfirst_int(lc)); > } > > Although it seems pretty fragile. It would probably be better to find > a nicer way of handling all this. Hi Amit, I also noticed earlier that this is still broken in v19. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
Amit Langote
Date:
Hi. On 2018/01/12 18:09, David Rowley wrote: > On 10 January 2018 at 17:18, David Rowley <david.rowley@2ndquadrant.com> wrote: >> Basically, the changes to add_paths_to_append_rel() are causing >> duplication in partition_rels. [ ... ] > I also noticed earlier that this is still broken in v19. I cannot see the duplication here (with v19 + some local changes per your latest review, although I had fixed the issue in v18). create table part (a int, b int) partition by list(a); create table part1 partition of part for values in (1) partition by list (b); create table part2 partition of part1 for values in (1); select * from part; For the above query, I set a breakpoint all the way in ExecInitAppend() to see what partitioned_rels list it ends up with and I see no duplication: :partitioned_rels (i 1 3) where 1 and 3 are RT indexes of part and part1, respectively. With v17, you'd be able to see the duplication: :partitioned_rels (i 1 3 3) Let me confirm again if you were complaining exactly of this duplication? That the RT index of part1 appears twice due to the bug I claim I fixed i v18? Or something else? Thanks, Amit
Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning
From
David Rowley
Date:
On 12 January 2018 at 22:51, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/01/12 18:09, David Rowley wrote: >> On 10 January 2018 at 17:18, David Rowley <david.rowley@2ndquadrant.com> wrote: >>> Basically, the changes to add_paths_to_append_rel() are causing >>> duplication in partition_rels. > >> I also noticed earlier that this is still broken in v19. > > I cannot see the duplication here (with v19 + some local changes per your > latest review, although I had fixed the issue in v18). I may have made a mistake there. The code I expected to change didn't. I meant to test the case again, but I got distracted just before I did and came back a while later and forgot that I hadn't tested. If you've tested my case and it works, then please don't look any further. I will look when v20 is ready. Sorry for the false alarm... I must've been trying to do too many things at once :-( -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning
From
Robert Haas
Date:
On Thu, Jan 11, 2018 at 10:30 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > Instead, can you make it: > > if (keys->n_eqkeys > 0 || keys->n_minkeys > 0 || > keys->n_maxkeys > 0 || n_keynullness > 0) > return true; > > return false; Or better yet: return (keys->n_eqkeys > 0 || keys->n_minkeys > 0 || keys->n_maxkeys > 0 || n_keynullness > 0); It's not really necessary to write if (some condition is true) return true; return false when you can just write return (boolean-valued condition). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning
From
Amit Langote
Date:
Hi David. Thanks for the review. On 2018/01/12 12:30, David Rowley wrote: > I've got a few more things for you. I'm only partway through another > pass, but it makes sense to post what I have now if you're working on > a new version. > > 1. partitioing -> partitioning > > * Strategy of a partition clause operator per the partitioing operator class Fixed. > 2. get_partitions_from_clauses() modifies partclauses without > mentioning it in the header. I think you need to either: > > a) warn about this in the header comment; or > b) do a list_copy() before list_concat() > c) do list_truncate back to the original length after you're done with the list. Went with (b). > 3. get_partitions_from_clauses_recurse(), with: > > result = bms_add_range(result, 0, partdesc->nparts - 1); > > You could change that to bms_add_range(NULL, ...) and ditch the > assignment of result to NULL at the start of the function. Done. > 4. classify_partition_bounding_keys() now returns bool, but the return > statement is still: > > return keys->n_eqkeys + keys->n_minkeys + keys->n_maxkeys + n_keynullness; > > my compiler didn't warn about that, but I'd imagine some might. Oops, my bad. > > Instead, can you make it: > > if (keys->n_eqkeys > 0 || keys->n_minkeys > 0 || > keys->n_maxkeys > 0 || n_keynullness > 0) > return true; > > return false; > > probably equal keys are the most likely case, so it'll be good to > short circuit instead of performing addition on a bunch of stuff we > don't care about anymore. Changed it to what Robert suggested downthread. > 5. In classify_partition_bounding_keys, why do we "continue" here? > > clause = rinfo->clause; > if (rinfo->pseudoconstant && > !DatumGetBool(((Const *) clause)->constvalue)) > { > *constfalse = true; > continue; > } > > Is there any point in searching further? > > Also, if you were consistent with the return value for > classify_partition_bounding_keys when you've set *constfalse = true; > you wouldn't need to handle the case twice like you are in > get_partitions_from_clauses_recurse(). OK, I made classify_partition_bounding_keys() return true whenever set *constfalse to true. > 6. I think it would be nicer if get_partitions_from_ne_clauses returns > a set of partitions that could be excluded. > > So instead of: > > * get_partitions_from_ne_clauses > * > * Return partitions of relation that satisfy all <> operator clauses in > * ne_clauses. Only ever called if relation is a list partitioned table. > > Have: > > * get_partitions_from_ne_clauses > * > * Returns a Bitmapset of partitions that can be safely excluded due to > * not-equal clauses existing for all possible partition values. It is only > * valid to call this for LIST partitioned tables. > > and instead of: > > result = bms_add_range(NULL, 0, partdesc->nparts - 1); > result = bms_del_members(result, excluded_parts); > bms_free(excluded_parts); > > return result; > > Just do: > > return excluded_parts; > > and in get_partitions_from_clauses_recurse(), do bms_del_members > instead of bms_int_members. > > there's less bit shuffling and it seems cleaner. Perhaps the function > name would need to be changed if we're inverting the meaning too. > > (I've attached a patch which makes this change along with an idea in #8 below) Thanks for the suggestions... (comment continues below) > 7. The following comment claims the function sets *datum, but there's > no param by that name: > > /* > * partkey_datum_from_expr > * Extract constant value from expr and set *datum to that value > */ > static bool > partkey_datum_from_expr(PartitionKey key, int partkeyidx, > Expr *expr, Datum *value) Fixed. > 8. The code in get_partitions_from_ne_clauses() does perform quite a > few nested loops. I think a more simple way to would be to track the > offsets you've seen in a Bitmapset. This would save you having to > check for duplicates, as an offset can only contain a single datum. > You'd just need to build a couple of arrays after that, one to sum up > the offsets found per partition, and one for the total datums allowed > in the partition. If the numbers match then you can remove the > partition. > > I've written this and attached it to this email. It saves about 50 > lines of code and should perform much better for complex cases, for > example, a large NOT IN list. This also implements #6. I liked your patch, so incorporated it, except, I feel slightly uncomfortable about the new name that you chose for the function because it sounds a bit generic. I mean the function solves a very specific problem and have very strict requirements for calling it. It's not like we could pass it just any partitioned relation and/or just any set of clauses. It has to be a list-partitioned table and the list of clauses must contain only the clauses containing compatible <> operators. Checks for those requirements are carried out in yet another place, that is, classify_partition_bounding_keys(). Perhaps we can live with that though, because it's not a publicly available function, but someone might get confused in the future. > 9. "the same" -> "it" > > /* > * In case of NOT IN (..), we get a '<>', which while not > * listed as part of any operator family, we are able to > * handle the same if its negator is indeed a part of the > * partitioning operator family. > */ Done. > 10. in classify_partition_bounding_keys: "0" -> "false" > > /* Return if no work to do below. */ > if (!will_compute_keys || *constfalse) > return 0; > > Likewise for: > > if (*constfalse) > return 0; Have fixed these per an earlier comment in this email. > 11. I don't see partition_bound_bsearch used anywhere below the > following comment: > > * Generate bounding tuple(s). > * > * We look up partitions in the partition bound descriptor using, say, > * partition_bound_bsearch(), which expects a Datum (or Datums if multi- > * column key). So, extract the same out of the constant argument of > * each clause. > > I also don't know what the comment is trying to say. The comment is no longer very intelligible to me too. I just wanted to say here that, *elsewhere*, we will use a function like partition_bound_bsearch() to look up partitions from the clauses we matched against individual partition key columns. That function expects the lookup key to be in a Datum array form, not a list-of-clauses form. So, we must construct the lookup key(s) by extracting constant values out the clauses. I tried to rewrite it that way. Hope that's a bit clearer. > 12. > > * operator and sets *incl if equality is implied > > should be: > > * operator and set *incl to true if the operator's strategy is inclusive. Done that way. > 13. What does "the same" mean in: > > * and add this one directly to the result. Caller would > * arbitrarily choose one of the many and perform > * partition-pruning with the same. It's possible that mutual It means "the one that caller would arbitrarily choose of the many that this function will return to it". Anyway, I changed "the same" to "it". > I think you quite often use "the same" to mean "it". Can you change that? I guess that's just one of my many odd habits when writing English, all of which I'm trying to get rid of, but apparently with limited success. Must try harder. :) > 14. Not sure what parameter you're talking about here. > > * Evaluate 'leftarg op rightarg' and set *result to its value. > * > * leftarg and rightarg referred to above actually refer to the constant > * operand (Datum) of the clause contained in the parameters leftarg and > * rightarg below, respectively. And op refers to the operator of the > * clause contained in the parameter op below. I rewrote the above comment block as: * Try to compare the constant arguments of 'leftarg' and 'rightarg', in that * order, using the operator of 'op' and set *result to the result of this * comparison. Is that any better? > 15. "the latter" is normally used when you're referring to the last > thing in a list which was just mentioned. In this case, leftarg_const > and rightarg_const is the list, so "the latter" should mean > rightarg_const, but I think you mean to compare them using the > operator. > > * If the leftarg_const and rightarg_const are both of the type expected > * by op's operator, then compare them using the latter. Rewrote it as: * We can compare leftarg_const and rightarg_const using op's operator * only if both are of the type expected by it. > 16. There are a few things to improve with the following comment: > > /* > * Hash partitioning stores partition keys containing nulls in regular > * partitions. That is, the code that determines the hash partition for > * a given row admits nulls in the partition key when computing the key's > * hash. So, here we treat any IS NULL clauses on partition key columns as > * equality keys, along with any other non-null values coming from equality > * operator clauses. > */ > > "admits" is not the correct word here, and "hash" should be "correct", > but there are more mistakes, so might be easier just to rewrite to: > > /* > * Since tuples with NULL values in the partition key columns are > stored in regular partitions, > * we'll treat any IS NULL clauses here as regular equality clauses. > /* Agreed that your version is better, so went with it. > 17. The following example will cause get_partitions_for_keys_hash to misbehave: > > create table hashp (a int, b int) partition by hash (a, b); > create table hashp1 partition of hashp for values with (modulus 4, remainder 0); > create table hashp2 partition of hashp for values with (modulus 4, remainder 1); > create table hashp3 partition of hashp for values with (modulus 4, remainder 3); > create table hashp4 partition of hashp for values with (modulus 4, remainder 2); > explain select * from hashp where a = 1 and a is null; > > The following code assumes that you'll never get a NULL test for a key > that has an equality test, and ends up trying to prune partitions > thinking we got compatible clauses for both partition keys. Yeah, I think this example helps spot a problem. I thought we'd never get to get_partitions_for_keys_hash() for the above query, because we should've been able to prove much earlier that the particular clause combination should be always false (a cannot be both non-null 1 and null). Now, because the planner itself doesn't substitute a constant-false for that, I taught classify_partition_bounding_keys() to do so. It would now return constfalse=true if it turns out that clauses in the input list lead to contradictory nullness condition for a given partition column. > memset(keyisnull, false, sizeof(keyisnull)); > for (i = 0; i < partkey->partnatts; i++) > { > if (bms_is_member(i, keys->keyisnull)) > { > keys->n_eqkeys++; > keyisnull[i] = true; > } > } > > /* > * Can only do pruning if we know all the keys and they're all equality > * keys including the nulls that we just counted above. > */ > if (keys->n_eqkeys == partkey->partnatts) > > The above code will need to be made smarter. It'll likely crash if you > change "b" to a pass-by-ref type. Hmm, not sure why. It seems to work: create table hp (a int, b text) partition by hash (a, b); create table hp1 partition of hp for values with (modulus 4, remainder 0); create table hp2 partition of hp for values with (modulus 4, remainder 1); create table hp3 partition of hp for values with (modulus 4, remainder 3); create table hp4 partition of hp for values with (modulus 4, remainder 2); insert into hp values (1, 'xxx'); INSERT 0 1 select tableoid::regclass, * from hp; tableoid | a | b ----------+---+----- hp1 | 1 | xxx (1 row) insert into hp (a) values (1); INSERT 0 1 insert into hp (b) values ('xxx'); INSERT 0 1 select tableoid::regclass, * from hp where a is null; tableoid | a | b ----------+---+----- hp2 | | xxx (1 row) select tableoid::regclass, * from hp where b is null; tableoid | a | b ----------+---+--- hp1 | 1 | (1 row) select tableoid::regclass, * from hp where a = 1 and b is null; tableoid | a | b ----------+---+--- hp1 | 1 | (1 row) select tableoid::regclass, * from hp where a is null and b = 'xxx'; tableoid | a | b ----------+---+----- hp2 | | xxx (1 row) > 18. The following code: > > int other_idx = -1; > > /* > * Only a designated partition accepts nulls, which if there > * exists one, return the same. > */ > if (partition_bound_accepts_nulls(boundinfo) || > partition_bound_has_default(boundinfo)) > other_idx = partition_bound_accepts_nulls(boundinfo) > ? boundinfo->null_index > : boundinfo->default_index; > if (other_idx >= 0) > return bms_make_singleton(other_idx); > else > return NULL; > > should be simplified to: > > /* > * NULLs may only exist in the NULL partition, or in the > * default, if there's no NULL partition. > */ > if (partition_bound_accepts_nulls(boundinfo)) > return bms_make_singleton(boundinfo->null_index); > else if (partition_bound_has_default(boundinfo)) > return bms_make_singleton(boundinfo->default_index); > return NULL; Agreed, done that way. > 19. "exists" -> "are" > > * If there are no datums to compare keys with, but there exist > * partitions, it must be the default partition. > > also, instead of writing "it must be the default partition." it should > be better to say "just return the default partition." OK, done. > 20. I don't think the return NULL should ever hit, is it worth putting > a comment to say /* shouldn't happen */ > > if (boundinfo->ndatums == 0) > { > if (partition_bound_has_default(boundinfo)) > return bms_make_singleton(boundinfo->default_index); > else > return NULL; > } I added a /* shouldn't happen */ comment next to return NULL. > 21. Can the following comment does not explain the situation well: > > /* > * boundinfo->ndatums - 1 is the last valid list partition datums > * index. > */ > > There's really no possible non-default partition for this case, so > perhaps we should just return the default, if one exists. We do go on > to check the n_maxkeys needlessly for this case. At the very least the > comment should be changed to: > > /* > * minkeys values are greater than any non-default partition. > * We'll check that for case below. > */ > > but I think it's worth just doing the default partition check there > and returning it, or NULL. It should help reduce confusion. Yep, done. Attached v20. Thanks again. Regards, Amit
Attachment
- v20-0001-Some-interface-changes-for-partition_bound_-cmp-.patch
- v20-0002-Introduce-a-get_partitions_from_clauses.patch
- v20-0003-Move-some-code-of-set_append_rel_size-to-separat.patch
- v20-0004-More-refactoring-around-partitioned-table-Append.patch
- v20-0005-Teach-planner-to-use-get_partitions_from_clauses.patch
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning
From
David Rowley
Date:
On 16 January 2018 at 21:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/01/12 12:30, David Rowley wrote: >> 8. The code in get_partitions_from_ne_clauses() does perform quite a >> few nested loops. I think a more simple way to would be to track the >> offsets you've seen in a Bitmapset. This would save you having to >> check for duplicates, as an offset can only contain a single datum. >> You'd just need to build a couple of arrays after that, one to sum up >> the offsets found per partition, and one for the total datums allowed >> in the partition. If the numbers match then you can remove the >> partition. >> >> I've written this and attached it to this email. It saves about 50 >> lines of code and should perform much better for complex cases, for >> example, a large NOT IN list. This also implements #6. > > I liked your patch, so incorporated it, except, I feel slightly > uncomfortable about the new name that you chose for the function because > it sounds a bit generic. You're right. I only renamed it because I inverted the meaning of the function in the patch. It no longer did "get_partitions_from_ne_clauses", it did the opposite and give the partitions which can't match. Please feel free to think of a new better name. Is "get_partitions_excluded_by_ne_clauses" too long? >> I think you quite often use "the same" to mean "it". Can you change that? > > I guess that's just one of my many odd habits when writing English, all of > which I'm trying to get rid of, but apparently with limited success. Must > try harder. :) Oops, on re-reading that it sounded as though I was asking you to change some habit, but I just meant the comments. I understand there will be places that use English where that's normal. It's just I don't recall seeing that in PostgreSQL code before. American English is pretty much the standard for the project, despite that not always being strictly applied (e.g we have a command called ANALYSE which is an alias for ANALYZE). I always try and do my best to spell words in American English (which is not where I'm from), which for me stretches about as far as putting 'z' in the place of some of my 's'es. > I rewrote the above comment block as: > > * Try to compare the constant arguments of 'leftarg' and 'rightarg', in that > * order, using the operator of 'op' and set *result to the result of this > * comparison. > > Is that any better? Sounds good. > >> 15. "the latter" is normally used when you're referring to the last >> thing in a list which was just mentioned. In this case, leftarg_const >> and rightarg_const is the list, so "the latter" should mean >> rightarg_const, but I think you mean to compare them using the >> operator. >> >> * If the leftarg_const and rightarg_const are both of the type expected >> * by op's operator, then compare them using the latter. > > Rewrote it as: > > * We can compare leftarg_const and rightarg_const using op's operator > * only if both are of the type expected by it. I'd probably write "expected type." instead of "type expected by it." >> 17. The following example will cause get_partitions_for_keys_hash to misbehave: >> >> create table hashp (a int, b int) partition by hash (a, b); >> create table hashp1 partition of hashp for values with (modulus 4, remainder 0); >> create table hashp2 partition of hashp for values with (modulus 4, remainder 1); >> create table hashp3 partition of hashp for values with (modulus 4, remainder 3); >> create table hashp4 partition of hashp for values with (modulus 4, remainder 2); >> explain select * from hashp where a = 1 and a is null; >> >> The following code assumes that you'll never get a NULL test for a key >> that has an equality test, and ends up trying to prune partitions >> thinking we got compatible clauses for both partition keys. > > Yeah, I think this example helps spot a problem. I thought we'd never get > to get_partitions_for_keys_hash() for the above query, because we > should've been able to prove much earlier that the particular clause > combination should be always false (a cannot be both non-null 1 and null). > Now, because the planner itself doesn't substitute a constant-false for > that, I taught classify_partition_bounding_keys() to do so. It would now > return constfalse=true if it turns out that clauses in the input list lead > to contradictory nullness condition for a given partition column. > >> memset(keyisnull, false, sizeof(keyisnull)); >> for (i = 0; i < partkey->partnatts; i++) >> { >> if (bms_is_member(i, keys->keyisnull)) >> { >> keys->n_eqkeys++; >> keyisnull[i] = true; >> } >> } >> >> /* >> * Can only do pruning if we know all the keys and they're all equality >> * keys including the nulls that we just counted above. >> */ >> if (keys->n_eqkeys == partkey->partnatts) >> >> The above code will need to be made smarter. It'll likely crash if you >> change "b" to a pass-by-ref type. > > Hmm, not sure why. It seems to work: Yeah, works now because you've added new code to test for contradictions in the quals, e.g a = 1 and a is null is now rejected as constfalse. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning
From
David Rowley
Date:
On 16 January 2018 at 21:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached v20. Thanks again. Thanks for working on v20. I've had a look over part of it and I've written down the following: 1. The following comment is not correct /* * Equality look up key. Values in the following array appear in no * particular order (unlike minkeys and maxkeys below which must appear in * the same order as the partition key columns). These must be in partition key order, just like the others. This part is not true either: * the same order as the partition key columns). n_eqkeys must be equal to * the number of partition keys to be valid (except in the case of hash * partitioning where that's not required). When set, minkeys and maxkeys * are ignored. range2 is pruned just fine from the following: create table rangep (a int, b int) partition by range (a,b); create table rangep1 partition of rangep for values from (1,10) to (1,20); create table rangep2 partition of rangep for values from (2,10) to (2,20); explain select * from rangep where a = 1; QUERY PLAN --------------------------------------------------------------- Append (cost=0.00..38.25 rows=11 width=8) -> Seq Scan on rangep1 (cost=0.00..38.25 rows=11 width=8) Filter: (a = 1) (3 rows) 2. You've added a list_copy() to get_partitions_from_clauses so as not to modify the input list, but this function calls get_partitions_from_clauses_recurse which calls classify_partition_bounding_keys() which modifes that list. Would it not just be better to make a list copy in get_partitions_from_clauses() without any conditions? If we get new users of that function, e.g Run-time pruning, then they might be surprised to see new items magically added to their input list without mention of that behaviour in the function comment. 3. The following case causes an Assert failure: drop table listp; CREATE TABLE listp (a int, b int) partition by list (a); create table listp1 partition of listp for values in (1); create table listp2 partition of listp for values in (2); prepare q1 (int) as select * from listp where a = 1 and a in($1,10); explain execute q1 (1); explain execute q1 (1); explain execute q1 (1); explain execute q1 (1); explain execute q1 (1); explain execute q1 (1); -- <--- Assert failure! In match_clauses_to_partkey you always add the ScalarArrayOpExpr to the result regardless if it is a complete set of Consts, however, the code in classify_partition_bounding_keys() that deals with ScalarArrayOpExpr in can't handle non-consts /* OK to add to the result. */ result = lappend(result, clause); if (IsA(eval_const_expressions(root, rightop), Const)) *contains_const = true; else *contains_const = false; *contains_consts is reset to true again by the a = 1 qual, so get_partitions_from_clauses() gets called from get_append_rel_partitions. Later classify_partition_bounding_keys() when processing the ScalarArrayOpExpr, the following code assumes the array exprs are all Consts: foreach(lc1, elem_exprs) { Const *rightop = castNode(Const, lfirst(lc1)); Setting *contains_const = false; in match_clauses_to_partkey() is not correct either. If I understand the intent here correctly, you want this to be set to true if the clause list contains quals with any consts that are useful for partition pruning during planning. If that's the case then you should set it to true if you find a suitable clause, otherwise leave it set to false as you set it to at the start of the function. What you have now will have varying results based on the order of the clauses in the list, which is certainly not correct. 4. The following code can be rearranged to not pull_varnos if there's no commutator op. constexpr = leftop; constrelids = pull_varnos((Node *) leftop); expr_op = get_commutator(expr_op); /* * If no commutator exists, cannot flip the qual's args, * so give up. */ if (!OidIsValid(expr_op)) continue; 5. Header comment for match_clauses_to_partkey() says only clauses in the pattern of "partkey op const" and "const op partkey" are handled. NULL tests are also mentioned but nothing is mentioned about ScalarArrayOpExpr. It might be better to be less verbose about what the function handles, but if you're listing what is handled then you should not make false claims. * For an individual clause to match with a partition key column, the clause * must be an operator clause of the form (partkey op const) or (const op * partkey); the latter only if a suitable commutator exists. Furthermore, 6. Which brings me to; why do we need match_clauses_to_partkey at all? classify_partition_bounding_keys seems to do all the work match_clauses_to_partkey does, plus more. Item #3 above is caused by an inconsistency between these functions. What benefit does match_clauses_to_partkey give? I might understand if you were creating list of clauses matching each partition key, but you're just dumping everything in one big list which causes classify_partition_bounding_keys() to have to match each clause to a partition key again, and classify_partition_bounding_keys is even coded to ignore clauses that don't' match any key, so it makes me wonder what is match_clauses_to_partkey actually for? I'm going to stop reviewing there as if you remove match_clauses_to_partkey is going to cause churn that'll need to be reviewed again. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning
From
Amit Langote
Date:
Hi David. On Wed, Jan 17, 2018 at 12:32 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 16 January 2018 at 21:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2018/01/12 12:30, David Rowley wrote: >>> 8. The code in get_partitions_from_ne_clauses() does perform quite a >>> few nested loops. I think a more simple way to would be to track the >>> offsets you've seen in a Bitmapset. This would save you having to >>> check for duplicates, as an offset can only contain a single datum. >>> You'd just need to build a couple of arrays after that, one to sum up >>> the offsets found per partition, and one for the total datums allowed >>> in the partition. If the numbers match then you can remove the >>> partition. >>> >>> I've written this and attached it to this email. It saves about 50 >>> lines of code and should perform much better for complex cases, for >>> example, a large NOT IN list. This also implements #6. >> >> I liked your patch, so incorporated it, except, I feel slightly >> uncomfortable about the new name that you chose for the function because >> it sounds a bit generic. > > You're right. I only renamed it because I inverted the meaning of the > function in the patch. It no longer did > "get_partitions_from_ne_clauses", it did the opposite and give the > partitions which can't match. Please feel free to think of a new > better name. Is "get_partitions_excluded_by_ne_clauses" too long? > >>> I think you quite often use "the same" to mean "it". Can you change that? >> >> I guess that's just one of my many odd habits when writing English, all of >> which I'm trying to get rid of, but apparently with limited success. Must >> try harder. :) > > Oops, on re-reading that it sounded as though I was asking you to > change some habit, but I just meant the comments. I understand there > will be places that use English where that's normal. It's just I don't > recall seeing that in PostgreSQL code before. No worries, I too slightly misread what you'd said. When I double checked, I too couldn't find "the same" used the way as I did in the patch. So I actually ended up finding and replacing more "the same"s with "it" than you had pointed out in your review in the latest v20 patch. > American English is > pretty much the standard for the project, despite that not always > being strictly applied (e.g we have a command called ANALYSE which is > an alias for ANALYZE). I always try and do my best to spell words in > American English (which is not where I'm from), which for me stretches > about as far as putting 'z' in the place of some of my 's'es. I see. >>> 15. "the latter" is normally used when you're referring to the last >>> thing in a list which was just mentioned. In this case, leftarg_const >>> and rightarg_const is the list, so "the latter" should mean >>> rightarg_const, but I think you mean to compare them using the >>> operator. >>> >>> * If the leftarg_const and rightarg_const are both of the type expected >>> * by op's operator, then compare them using the latter. >> >> Rewrote it as: >> >> * We can compare leftarg_const and rightarg_const using op's operator >> * only if both are of the type expected by it. > > I'd probably write "expected type." instead of "type expected by it." OK, will do. >>> 17. The following example will cause get_partitions_for_keys_hash to misbehave: >>> >>> create table hashp (a int, b int) partition by hash (a, b); >>> create table hashp1 partition of hashp for values with (modulus 4, remainder 0); >>> create table hashp2 partition of hashp for values with (modulus 4, remainder 1); >>> create table hashp3 partition of hashp for values with (modulus 4, remainder 3); >>> create table hashp4 partition of hashp for values with (modulus 4, remainder 2); >>> explain select * from hashp where a = 1 and a is null; [ ... ] >>> The above code will need to be made smarter. It'll likely crash if you >>> change "b" to a pass-by-ref type. >> >> Hmm, not sure why. It seems to work: > > Yeah, works now because you've added new code to test for > contradictions in the quals, e.g a = 1 and a is null is now rejected > as constfalse. Oh, I see. I thought you were talking of it as an independent issue. Thanks, Amit
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning
From
David Rowley
Date:
On 17 January 2018 at 17:05, David Rowley <david.rowley@2ndquadrant.com> wrote: > 6. Which brings me to; why do we need match_clauses_to_partkey at all? > classify_partition_bounding_keys seems to do all the work > match_clauses_to_partkey does, plus more. Item #3 above is caused by > an inconsistency between these functions. What benefit does > match_clauses_to_partkey give? I might understand if you were creating > list of clauses matching each partition key, but you're just dumping > everything in one big list which causes > classify_partition_bounding_keys() to have to match each clause to a > partition key again, and classify_partition_bounding_keys is even > coded to ignore clauses that don't' match any key, so it makes me > wonder what is match_clauses_to_partkey actually for? I started to look at this and ended up shuffling the patch around a bit to completely remove the match_clauses_to_partkey function. I also cleaned up some of the comments and shuffled some fields around in some of the structs to shrink them down a bit. All up, this has saved 268 lines of code in the patch. src/backend/catalog/partition.c | 296 ++++++++++++++++----------- src/backend/optimizer/path/allpaths.c | 368 ++-------------------------------- 2 files changed, 198 insertions(+), 466 deletions(-) It's had very minimal testing. Really I've only tested that the regression tests pass. I also fixed up the bad assumption that IN lists will contain Consts only which hopefully fixes the crash I reported earlier. I saw you'd added a check to look for contradicting IS NOT NULL clauses when processing an IS NULL clause, but didn't do anything for the opposite case. I added code for this so it behaves the same regardless of the clause order. Can you look at my changes and see if I've completely broken anything? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning
From
Amit Langote
Date:
Hi David. On Wed, Jan 17, 2018 at 6:19 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 17 January 2018 at 17:05, David Rowley <david.rowley@2ndquadrant.com> wrote: >> 6. Which brings me to; why do we need match_clauses_to_partkey at all? >> classify_partition_bounding_keys seems to do all the work >> match_clauses_to_partkey does, plus more. Item #3 above is caused by >> an inconsistency between these functions. What benefit does >> match_clauses_to_partkey give? I might understand if you were creating >> list of clauses matching each partition key, but you're just dumping >> everything in one big list which causes >> classify_partition_bounding_keys() to have to match each clause to a >> partition key again, and classify_partition_bounding_keys is even >> coded to ignore clauses that don't' match any key, so it makes me >> wonder what is match_clauses_to_partkey actually for? > > I started to look at this and ended up shuffling the patch around a > bit to completely remove the match_clauses_to_partkey function. > > I also cleaned up some of the comments and shuffled some fields around > in some of the structs to shrink them down a bit. > > All up, this has saved 268 lines of code in the patch. > > src/backend/catalog/partition.c | 296 ++++++++++++++++----------- > src/backend/optimizer/path/allpaths.c | 368 ++-------------------------------- > 2 files changed, 198 insertions(+), 466 deletions(-) > > It's had very minimal testing. Really I've only tested that the > regression tests pass. > > I also fixed up the bad assumption that IN lists will contain Consts > only which hopefully fixes the crash I reported earlier. > > I saw you'd added a check to look for contradicting IS NOT NULL > clauses when processing an IS NULL clause, but didn't do anything for > the opposite case. I added code for this so it behaves the same > regardless of the clause order. > > Can you look at my changes and see if I've completely broken anything? Thanks for the patch. I applied the patch and see that it didn't break any tests, although haven't closely reviewed the code yet. I'm concerned that after your patch to remove match_clauses_to_partkey(), we'd be doing more work than necessary in some cases. For example, consider the case of using run-time pruning for nested loop where the inner relation is a partitioned table. With the old approach, get_partitions_from_clauses() would only be handed the clauses that are known to match the partition keys (which most likely is fewer than all of the query's clauses), so get_partitions_from_clauses() doesn't have to do the work of filtering non-partition clauses every time (that is, for every outer row). That's why I had decided to keep that part in the planner. Thanks, Amit
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning
From
David Rowley
Date:
On 17 January 2018 at 23:48, Amit Langote <amitlangote09@gmail.com> wrote: > I'm concerned that after your patch to remove > match_clauses_to_partkey(), we'd be doing more work than necessary in > some cases. For example, consider the case of using run-time pruning > for nested loop where the inner relation is a partitioned table. With > the old approach, get_partitions_from_clauses() would only be handed > the clauses that are known to match the partition keys (which most > likely is fewer than all of the query's clauses), so > get_partitions_from_clauses() doesn't have to do the work of filtering > non-partition clauses every time (that is, for every outer row). > That's why I had decided to keep that part in the planner. That might be better served by splitting classify_partition_bounding_keys() into separate functions, the first function would be in charge of building keyclauses_all. That way the remaining work during the executor would never need to match clauses to a partition key as they'd be in lists dedicated to each key. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS]path toward faster partition pruning
From
David Rowley
Date:
On 18 January 2018 at 00:13, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 17 January 2018 at 23:48, Amit Langote <amitlangote09@gmail.com> wrote: >> I'm concerned that after your patch to remove >> match_clauses_to_partkey(), we'd be doing more work than necessary in >> some cases. For example, consider the case of using run-time pruning >> for nested loop where the inner relation is a partitioned table. With >> the old approach, get_partitions_from_clauses() would only be handed >> the clauses that are known to match the partition keys (which most >> likely is fewer than all of the query's clauses), so >> get_partitions_from_clauses() doesn't have to do the work of filtering >> non-partition clauses every time (that is, for every outer row). >> That's why I had decided to keep that part in the planner. > > That might be better served by splitting > classify_partition_bounding_keys() into separate functions, the first > function would be in charge of building keyclauses_all. That way the > remaining work during the executor would never need to match clauses > to a partition key as they'd be in lists dedicated to each key. I've attached another delta against your v20 patch which does this. It's very rough for now and I've only checked that it passes the regression test so far. It will need some cleanup work, but I'd be keen to know what you think of the general idea. I've not fully worked out how run-time pruning will use this as it'll need another version of get_partitions_from_clauses but passes in a PartScanClauseInfo instead, and does not call extract_partition_key_clauses. That area probably needs some shuffling around so that does not end up a big copy and paste of all that new logic. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [SenderAddress Forgery]Re: [HACKERS] path toward faster partition pruning
From
Amit Langote
Date:
Hi David. On 2018/01/18 12:14, David Rowley wrote: > On 18 January 2018 at 00:13, David Rowley <david.rowley@2ndquadrant.com> wrote: >> On 17 January 2018 at 23:48, Amit Langote <amitlangote09@gmail.com> wrote: >>> I'm concerned that after your patch to remove >>> match_clauses_to_partkey(), we'd be doing more work than necessary in >>> some cases. For example, consider the case of using run-time pruning >>> for nested loop where the inner relation is a partitioned table. With >>> the old approach, get_partitions_from_clauses() would only be handed >>> the clauses that are known to match the partition keys (which most >>> likely is fewer than all of the query's clauses), so >>> get_partitions_from_clauses() doesn't have to do the work of filtering >>> non-partition clauses every time (that is, for every outer row). >>> That's why I had decided to keep that part in the planner. >> >> That might be better served by splitting >> classify_partition_bounding_keys() into separate functions, the first >> function would be in charge of building keyclauses_all. That way the >> remaining work during the executor would never need to match clauses >> to a partition key as they'd be in lists dedicated to each key. > > I've attached another delta against your v20 patch which does this. > It's very rough for now and I've only checked that it passes the > regression test so far. Thanks! > It will need some cleanup work, but I'd be keen to know what you think > of the general idea. This one looks in a much better shape. > I've not fully worked out how run-time pruning > will use this as it'll need another version of > get_partitions_from_clauses but passes in a PartScanClauseInfo > instead, and does not call extract_partition_key_clauses. That area > probably needs some shuffling around so that does not end up a big > copy and paste of all that new logic. So, I've been assuming that the planner changes in the run-time pruning patch have to do with selecting clauses (restriction clauses not containing Consts and/or join clauses) to be passed to the executor by recording them in the Append node. Will they be selected by the planner calling into partition.c? Meanwhile, here is a revised version (v21) that incorporates your changes. I added you as the author in 0002 and 0005 patches. I guess a v22 will have to follow very soon... Thanks, Amit
Attachment
- v21-0001-Some-interface-changes-for-partition_bound_-cmp-.patch
- v21-0002-Introduce-a-get_partitions_from_clauses.patch
- v21-0003-Move-some-code-of-set_append_rel_size-to-separat.patch
- v21-0004-More-refactoring-around-partitioned-table-Append.patch
- v21-0005-Teach-planner-to-use-get_partitions_from_clauses.patch
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [SenderAddress Forgery]Re: [HACKERS] path toward faster partition pruning
From
David Rowley
Date:
On 18 January 2018 at 23:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I've not fully worked out how run-time pruning >> will use this as it'll need another version of >> get_partitions_from_clauses but passes in a PartScanClauseInfo >> instead, and does not call extract_partition_key_clauses. That area >> probably needs some shuffling around so that does not end up a big >> copy and paste of all that new logic. >> > So, I've been assuming that the planner changes in the run-time pruning > patch have to do with selecting clauses (restriction clauses not > containing Consts and/or join clauses) to be passed to the executor by > recording them in the Append node. Will they be selected by the planner > calling into partition.c? I had thought so. I only have a rough idea in my head and that's that the PartitionPruneInfo struct that I wrote for the run-time pruning patch would have the clause List replaced with this new PartScanClauseInfo struct (which likely means it needs to go into primnodes.h), this struct would contain all the partition pruning clauses in a more structured form so that no matching of quals to the partition key wouldn't be required during execution. The idea is that we'd just need to call; remove_redundant_clauses, extract_bounding_datums and get_partitions_for_keys. I think this is the bare minimum of work that can be done in execution since we can't remove the redundant clauses until we know the values of the Params. Likely this means there will need to be 2 functions externally accessible for this in partition.c. I'm not sure exactly how best to do this. Maybe it's fine just to have allpaths.c call extract_partition_key_clauses to generate the PartScanClauseInfo then call some version of get_partitions_from_clauses which does do the extract_partition_key_clauses duties. This is made more complex by the code that handles adding the default partition bound to the quals. I'm not yet sure where that should live. I've also been thinking of having some sort of PartitionPruneContext struct that we can pass around the functions. Runtime pruning had to add structs which store the Param values to various functions which I didn't like. It would be good to just add those to the context and have them passed down without having to bloat the parameters in the functions. I might try and do that tomorrow too. This should make the footprint of the runtime pruning patch is a bit smaller. > Meanwhile, here is a revised version (v21) that incorporates your changes. > I added you as the author in 0002 and 0005 patches. I guess a v22 will > have to follow very soon... Thanks for merging that in. I'll have a try at making this work tomorrow, although I'm not sure yet how much time I'll have to dedicate to it as I have a few other things to do too. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [SenderAddress Forgery]Re: [HACKERS] path toward faster partition pruning
From
Andres Freund
Date:
Hi Amit, It seems your mail system continually adds "[Sender Address Forgery]" prefixes to messages. E.g. this mail now has Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward fasterpartition pruning as its subject, whereas the mail you're replying to only had Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning two of them. I think the two previous occurances of this also are from you. This is somewhat annoying, could you try to figure out a) what the problem is b) how to prevent the subject being edited like that? Regards, Andres
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re:[Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning
From
Kyotaro HORIGUCHI
Date:
Hello, At Thu, 18 Jan 2018 11:41:00 -0800, Andres Freund <andres@anarazel.de> wrote in <20180118194100.dy3kxdtktsbvm4eq@alap3.anarazel.de> > Hi Amit, > > It seems your mail system continually adds "[Sender Address Forgery]" > prefixes to messages. E.g. this mail now has > Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path towardfaster partition pruning > as its subject, whereas the mail you're replying to only had > Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning > two of them. > > I think the two previous occurances of this also are from you. > > This is somewhat annoying, could you try to figure out a) what the > problem is b) how to prevent the subject being edited like that? Our mail server is failing to fetch SPF record for David's mails that received directly (not via -hakders ML) and the server adds the subject header. It is failing to fetch SPF record for 2ndquadrant.com. The reason might be that the envelope-from of his mails is not consistent with his server's IP address. Anyway, mails via -hackers ML doesn't suffer so, what Amit (and I) side can do by myself is one of the following. - Being careful to reply to the mails comming via the ML. - Remove the added header by hand.. And I'd like to ask David to check out his mail environment so that SPF record is available for his message. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Thank you Horiguchi-san! On 2018/01/19 12:00, Kyotaro HORIGUCHI wrote: > At Thu, 18 Jan 2018 11:41:00 -0800, Andres Freund <andres@anarazel.de> wrote: >> Hi Amit, >> >> It seems your mail system continually adds "[Sender Address Forgery]" >> prefixes to messages. E.g. this mail now has >> Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path towardfaster partition pruning >> as its subject, whereas the mail you're replying to only had >> Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning >> two of them. >> >> I think the two previous occurances of this also are from you. >> >> This is somewhat annoying, could you try to figure out a) what the >> problem is b) how to prevent the subject being edited like that? Sorry about that. I had noticed it and had manually edited the subject line once or twice before, but failed to do it every time. > Our mail server is failing to fetch SPF record for David's mails > that received directly (not via -hakders ML) and the server adds > the subject header. It is failing to fetch SPF record for > 2ndquadrant.com. The reason might be that the envelope-from of > his mails is not consistent with his server's IP address. I was able to notice that too. It seems that the Received-SPF: PermError and X-SPF-Status = fail/warn headers started appearing in the emails only a few months ago, so it appears our mail server was changed to notice these discrepancies only recently. First email in this thread that got such subject line was the following, which is my reply to David's email: https://www.postgresql.org/message-id/b8094e71-2c73-ed8e-d8c3-53f232c8c049%40lab.ntt.co.jp There are emails from David before that one and I couldn't see such headers in all those emails, so no alterations of the subject line. > Anyway, mails via -hackers ML doesn't suffer so, what Amit (and > I) side can do by myself is one of the following. > > - Being careful to reply to the mails comming via the ML. > - Remove the added header by hand.. Yeah, will make sure to do that. > And I'd like to ask David to check out his mail environment so > that SPF record is available for his message. That'd be nice too. Thanks, Amit
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [SenderAddress Forgery]Re: [HACKERS] path toward faster partition pruning
From
David Rowley
Date:
On 19 January 2018 at 16:00, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > And I'd like to ask David to check out his mail environment so > that SPF record is available for his message. Will investigate -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [SenderAddress Forgery]Re: [HACKERS] path toward faster partition pruning
From
David Rowley
Date:
Hi Amit , On 19 January 2018 at 04:03, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 18 January 2018 at 23:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> So, I've been assuming that the planner changes in the run-time pruning >> patch have to do with selecting clauses (restriction clauses not >> containing Consts and/or join clauses) to be passed to the executor by >> recording them in the Append node. Will they be selected by the planner >> calling into partition.c? > > I had thought so. I only have a rough idea in my head and that's that > the PartitionPruneInfo struct that I wrote for the run-time pruning > patch would have the clause List replaced with this new > PartScanClauseInfo struct (which likely means it needs to go into > primnodes.h), this struct would contain all the partition pruning > clauses in a more structured form so that no matching of quals to the > partition key wouldn't be required during execution. The idea is that > we'd just need to call; remove_redundant_clauses, > extract_bounding_datums and get_partitions_for_keys. I think this is > the bare minimum of work that can be done in execution since we can't > remove the redundant clauses until we know the values of the Params. > > Likely this means there will need to be 2 functions externally > accessible for this in partition.c. I'm not sure exactly how best to > do this. Maybe it's fine just to have allpaths.c call > extract_partition_key_clauses to generate the PartScanClauseInfo then > call some version of get_partitions_from_clauses which does do the > extract_partition_key_clauses duties. This is made more complex by the > code that handles adding the default partition bound to the quals. I'm > not yet sure where that should live. > > I've also been thinking of having some sort of PartitionPruneContext > struct that we can pass around the functions. Runtime pruning had to > add structs which store the Param values to various functions which I > didn't like. It would be good to just add those to the context and > have them passed down without having to bloat the parameters in the > functions. I might try and do that tomorrow too. This should make the > footprint of the runtime pruning patch is a bit smaller. Attached is what I had in mind about how to do this. Only the planner will need to call populate_partition_clause_info. The planner and executor can call get_partitions_from_clauseinfo. I'll just need to change the run-time prune patch to pass the PartitionClauseInfo into the executor in the Append node. I've also added the context struct that I mentioned above. It's currently not carrying much weight, but the idea is that this context will be passed around a bit more with the run-time pruning patch and it will also carry the details about parameter values. I'll need to modify a few signatures of functions like partkey_datum_from_expr to pass the context there too. I didn't do that here because currently, those functions have no use for the context with the fields that they currently have. I've also fixed a bug where when you built the commutator OpExpr in what's now called extract_partition_key_clauses the inputcollid was not being properly set. The changes I made there go much further than just that, I've completely removed the OpExpr from the PartClause struct as only two members were ever used. I thought it was better just to add those to PartClause instead. I also did some renaming of variables that always assumed that the Expr being compared to the partition key was a constant. This is true now, but with run-time pruning patch, it won't be, so I thought it was better to do this here rather than having to rename them in the run-time pruning patch. One thing I don't yet understand about the patch is the use of predicate_refuted_by() in get_partitions_from_or_clause_args(). I did adjust the comment above that code, but I'm still not certain I fully understand why that function has to be used instead of building the clauses for the OR being processed along with the remaining clauses. Is it that this was too hard to extract that you ended up using predicate_refuted_by()? I've also removed the makeBoolExpr call that you were encapsulating the or_clauses in. I didn't really see the need for this since you just removed it again when looping over the or_clauses. The only other changes are just streamlining code and comment changes. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [SenderAddress Forgery]Re: [HACKERS] path toward faster partition pruning
From
Amit Langote
Date:
Hi David. On 2018/01/23 15:44, David Rowley wrote: > On 19 January 2018 at 04:03, David Rowley <david.rowley@2ndquadrant.com> wrote: >> On 18 January 2018 at 23:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> So, I've been assuming that the planner changes in the run-time pruning >>> patch have to do with selecting clauses (restriction clauses not >>> containing Consts and/or join clauses) to be passed to the executor by >>> recording them in the Append node. Will they be selected by the planner >>> calling into partition.c? >> >> I had thought so. I only have a rough idea in my head and that's that >> the PartitionPruneInfo struct that I wrote for the run-time pruning >> patch would have the clause List replaced with this new >> PartScanClauseInfo struct (which likely means it needs to go into >> primnodes.h), this struct would contain all the partition pruning >> clauses in a more structured form so that no matching of quals to the >> partition key wouldn't be required during execution. The idea is that >> we'd just need to call; remove_redundant_clauses, >> extract_bounding_datums and get_partitions_for_keys. I think this is >> the bare minimum of work that can be done in execution since we can't >> remove the redundant clauses until we know the values of the Params. >> >> Likely this means there will need to be 2 functions externally >> accessible for this in partition.c. I'm not sure exactly how best to >> do this. Maybe it's fine just to have allpaths.c call >> extract_partition_key_clauses to generate the PartScanClauseInfo then >> call some version of get_partitions_from_clauses which does do the >> extract_partition_key_clauses duties. This is made more complex by the >> code that handles adding the default partition bound to the quals. I'm >> not yet sure where that should live. >> >> I've also been thinking of having some sort of PartitionPruneContext >> struct that we can pass around the functions. Runtime pruning had to >> add structs which store the Param values to various functions which I >> didn't like. It would be good to just add those to the context and >> have them passed down without having to bloat the parameters in the >> functions. I might try and do that tomorrow too. This should make the >> footprint of the runtime pruning patch is a bit smaller. > > Attached is what I had in mind about how to do this. Thanks for the delta patch. I will start looking at it tomorrow. Regards, Amit
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [SenderAddress Forgery]Re: [HACKERS] path toward faster partition pruning
From
David Rowley
Date:
On 23 January 2018 at 23:22, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/01/23 15:44, David Rowley wrote: >> Attached is what I had in mind about how to do this. > > Thanks for the delta patch. I will start looking at it tomorrow. Thanks. I've been looking more at this and I've made a few more adjustments in the attached. This delta patch should be applied against the faster_partition_prune_v21_delta_drowley_v1.patch one I sent yesterday. This changes a few comments, also now correctly passes the context to get_partitions_excluded_by_ne_clauses and fixes a small error where the patch was failing to record a notnull for the partition key when it saw a strict <> clause. It was only doing this for the opposite case, but both seem to be perfectly applicable. I also made a small adjustment to the regression tests to ensure this is covered. I'm now going to start work on basing the partition pruning patch on top of this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi David. I looked at both of your faster_partition_prune_v21_delta_drowley_v1.patch and faster_partition_prune_v21_delta_drowley_v1_delta.patch and have incorporated them into the main patch series to get us the attached v22 set. I like the result very much. Thank you! On 2018/01/23 15:44, David Rowley wrote: > Attached is what I had in mind about how to do this. Only the planner > will need to call populate_partition_clause_info. The planner and > executor can call get_partitions_from_clauseinfo. I'll just need to > change the run-time prune patch to pass the PartitionClauseInfo into > the executor in the Append node. I liked the division into those two functions, although not quite the embedding of "info" into the function names. I think it's good to just call them populate_partition_clauses and get_partitions_from_clauses. It seems OK though for their arguments to contain "info" in their names. > I've also added the context struct that I mentioned above. It's > currently not carrying much weight, but the idea is that this context > will be passed around a bit more with the run-time pruning patch and > it will also carry the details about parameter values. I'll need to > modify a few signatures of functions like partkey_datum_from_expr to > pass the context there too. I didn't do that here because currently, > those functions have no use for the context with the fields that they > currently have. The context struct seems good to me too. > I've also fixed a bug where when you built the commutator OpExpr in > what's now called extract_partition_key_clauses the inputcollid was > not being properly set. The changes I made there go much further than > just that, I've completely removed the OpExpr from the PartClause > struct as only two members were ever used. I thought it was better > just to add those to PartClause instead. I wondered if we should rename that to something like PartClauseProperties, but maybe that's too long. > I also did some renaming of variables that always assumed that the > Expr being compared to the partition key was a constant. This is true > now, but with run-time pruning patch, it won't be, so I thought it was > better to do this here rather than having to rename them in the > run-time pruning patch. OK, seems fine. > One thing I don't yet understand about the patch is the use of > predicate_refuted_by() in get_partitions_from_or_clause_args(). I did > adjust the comment above that code, but I'm still not certain I fully > understand why that function has to be used instead of building the > clauses for the OR being processed along with the remaining clauses. > Is it that this was too hard to extract that you ended up using > predicate_refuted_by()? I have tried to explain that better in the updated comment in the new patch, along with some code rearrangement to better make sense of what's going on. Let me just copy-paste the new comment I wrote. I have tried to rethink the solution a number of times but never came up with a sensible alternative. /* * When matching an OR expression, it is only checked if at least one of * its args matches the partition key, not all. For arguments that don't * match, we cannot eliminate any of its partitions using * get_partitions_from_clauses(). However, if the table is itself a * partition, we may be able to prove using constraint exclusion that the * clause refutes its partition constraint, that is, we can eliminate all * of its partitions. */ foreach(lc, or_clause_args) { > I've also removed the makeBoolExpr call that you were encapsulating > the or_clauses in. I didn't really see the need for this since you > just removed it again when looping over the or_clauses. Ah, OK. I first became concerned that you said you were adding arguments of different OR expressions into a single list and call it or_clauses, but calmed down after checking that that's not the case. :) > The only other changes are just streamlining code and comment changes. I made a few of those myself in the updated patches. Thanks a lot again for your work on this. Regards, Amit
Attachment
- v22-0001-Some-interface-changes-for-partition_bound_-cmp-.patch
- v22-0002-Introduce-a-get_partitions_from_clauses.patch
- v22-0003-Move-some-code-of-set_append_rel_size-to-separat.patch
- v22-0004-More-refactoring-around-partitioned-table-Append.patch
- v22-0005-Teach-planner-to-use-get_partitions_from_clauses.patch
Hi Amit, On 01/26/2018 04:17 AM, Amit Langote wrote: > I made a few of those myself in the updated patches. > > Thanks a lot again for your work on this. > This needs a rebase. Best regards, Jesper
Hi Jesper. On 2018/01/29 22:13, Jesper Pedersen wrote: > Hi Amit, > > On 01/26/2018 04:17 AM, Amit Langote wrote: >> I made a few of those myself in the updated patches. >> >> Thanks a lot again for your work on this. >> > > This needs a rebase. AFAICS, v22 cleanly applies to HEAD (c12693d8f3 [1]), compiles, and make check passes. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c12693d8f3
Hello, let me make some comments. At Tue, 30 Jan 2018 09:57:44 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <4a7dda08-b883-6e5e-b0bf-f5ce95584e9e@lab.ntt.co.jp> > Hi Jesper. > > On 2018/01/29 22:13, Jesper Pedersen wrote: > > Hi Amit, > > > > On 01/26/2018 04:17 AM, Amit Langote wrote: > >> I made a few of those myself in the updated patches. > >> > >> Thanks a lot again for your work on this. > >> > > > > This needs a rebase. > > AFAICS, v22 cleanly applies to HEAD (c12693d8f3 [1]), compiles, and make > check passes. Yes, it cleanly applies to HEAD and seems working. 0001 seems fine. I have some random comments on 0002. -extract_partition_key_clauses implicitly assumes that the commutator of inequality operator is the same to the original operator. (commutation for such operators is an identity function.) I believe it is always true for a sane operator definition, but perhaps wouldn't it be better using commutator instead of opclause->opno as the source of negator if any? (Attached diff 1) -get_partitions_from_or_clause_args abandons arg_partset with all bit set when partition constraint doesn't refute whole the partition. Finally get_partitions_from_clauses does the same thing but it's waste of cycles and looks weird. It seems to have intended to return immediately there. > /* Couldn't eliminate any of the partitions. */ > partdesc = RelationGetPartitionDesc(context->relation); > - arg_partset = bms_add_range(NULL, 0, partdesc->nparts - 1); > + return bms_add_range(NULL, 0, partdesc->nparts - 1); > } > > subcontext.rt_index = context->rt_index; > subcontext.relation = context->relation; > subcontext.clauseinfo = &partclauseinfo; !> arg_partset = get_partitions_from_clauses(&subcontext); -get_partitions_from_or_clause_args converts IN (null) into nulltest and the nulltest doesn't exclude a child that the partition key column can be null. drop table if exists p; create table p (a int, b int) partition by list (a); create table c1 partition of p for values in (1, 5, 7); create table c2 partition of p for values in (4, 6, null); insert into p values (1, 0), (null, 0); explain select tableoid::regclass, * from p where a in (1, null); > QUERY PLAN > ----------------------------------------------------------------- > Result (cost=0.00..76.72 rows=22 width=12) > -> Append (cost=0.00..76.50 rows=22 width=12) > -> Seq Scan on c1 (cost=0.00..38.25 rows=11 width=12) > Filter: (a = ANY ('{1,NULL}'::integer[])) > -> Seq Scan on c2 (cost=0.00..38.25 rows=11 width=12) > Filter: (a = ANY ('{1,NULL}'::integer[])) Although the clause "a in (null)" doesn't match the (null, 0) row so it donesn't harm finally, I don't think this is a right behavior. null in an SAOP rightop should be just ignored on partition pruning. Or is there any purpose of this behavior? - In extract_bounding_datums, clauseinfo is set twice to the same value. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index ab17524..a2488ab 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -2111,7 +2111,6 @@ extract_partition_key_clauses(PartitionKey partkey, List *clauses, PartClause *pc; Oid partopfamily = partkey->partopfamily[i]; Oid partcoll = partkey->partcollation[i]; - Oid commutator = InvalidOid; AttrNumber partattno = partkey->partattrs[i]; Expr *partexpr = NULL; @@ -2144,6 +2143,7 @@ extract_partition_key_clauses(PartitionKey partkey, List *clauses, if (IsA(clause, OpExpr)) { OpExpr *opclause = (OpExpr *) clause; + Oid comparator = opclause->opno; Expr *leftop, *rightop, *valueexpr; @@ -2161,13 +2161,14 @@ extract_partition_key_clauses(PartitionKey partkey, List *clauses, valueexpr = rightop; else if (EXPR_MATCHES_PARTKEY(rightop, partattno, partexpr)) { - valueexpr = leftop; - - commutator = get_commutator(opclause->opno); + Oid commutator = get_commutator(opclause->opno); /* nothing we can do unless we can swap the operands */ if (!OidIsValid(commutator)) continue; + + valueexpr = leftop; + comparator = commutator; } else /* Clause does not match this partition key. */ @@ -2212,7 +2213,7 @@ extract_partition_key_clauses(PartitionKey partkey, List *clauses, * equality operator *and* this is a list partitioned * table, we can use it prune partitions. */ - negator = get_negator(opclause->opno); + negator = get_negator(comparator); if (OidIsValid(negator) && op_in_opfamily(negator, partopfamily)) { @@ -2236,7 +2237,7 @@ extract_partition_key_clauses(PartitionKey partkey, List *clauses, } pc = (PartClause *) palloc0(sizeof(PartClause)); - pc->opno = OidIsValid(commutator) ? commutator : opclause->opno; + pc->opno = comparator; pc->inputcollid = opclause->inputcollid; pc->value = valueexpr;
Hi Amit, On 01/29/2018 07:57 PM, Amit Langote wrote: >> This needs a rebase. > > AFAICS, v22 cleanly applies to HEAD (c12693d8f3 [1]), compiles, and make > check passes. > It was a rebase error; I should have checked against a clean master. Sorry for the noise. Best regards, Jesper
Horiguchi-san, Thanks for the review. On 2018/01/30 20:43, Kyotaro HORIGUCHI wrote: > At Tue, 30 Jan 2018 09:57:44 +0900, Amit Langote wrote: >> AFAICS, v22 cleanly applies to HEAD (c12693d8f3 [1]), compiles, and make > I have some random comments on 0002. > > -extract_partition_key_clauses implicitly assumes that the > commutator of inequality operator is the same to the original > operator. (commutation for such operators is an identity > function.) Yeah, it seems so. > I believe it is always true for a sane operator definition, but > perhaps wouldn't it be better using commutator instead of > opclause->opno as the source of negator if any? (Attached diff 1) ISTM, the same thing happens with or without the patch. - pc->opno = OidIsValid(commutator) ? commutator : opclause->opno; + pc->opno = comparator; comparator as added by the patch is effectively equal to the RHS expression in the deleted line. > -get_partitions_from_or_clause_args abandons arg_partset with all > bit set when partition constraint doesn't refute whole the > partition. Finally get_partitions_from_clauses does the same > thing but it's waste of cycles and looks weird. It seems to have > intended to return immediately there. > > > /* Couldn't eliminate any of the partitions. */ > > partdesc = RelationGetPartitionDesc(context->relation); > > - arg_partset = bms_add_range(NULL, 0, partdesc->nparts - 1); > > + return bms_add_range(NULL, 0, partdesc->nparts - 1); > > } > > > > subcontext.rt_index = context->rt_index; > > subcontext.relation = context->relation; > > subcontext.clauseinfo = &partclauseinfo; > !> arg_partset = get_partitions_from_clauses(&subcontext); You're right, fixed. > -get_partitions_from_or_clause_args converts IN (null) into > nulltest and the nulltest doesn't exclude a child that the > partition key column can be null. > > drop table if exists p; > create table p (a int, b int) partition by list (a); > create table c1 partition of p for values in (1, 5, 7); > create table c2 partition of p for values in (4, 6, null); > insert into p values (1, 0), (null, 0); > > explain select tableoid::regclass, * from p where a in (1, null); > > QUERY PLAN > > ----------------------------------------------------------------- > > Result (cost=0.00..76.72 rows=22 width=12) > > -> Append (cost=0.00..76.50 rows=22 width=12) > > -> Seq Scan on c1 (cost=0.00..38.25 rows=11 width=12) > > Filter: (a = ANY ('{1,NULL}'::integer[])) > > -> Seq Scan on c2 (cost=0.00..38.25 rows=11 width=12) > > Filter: (a = ANY ('{1,NULL}'::integer[])) > > Although the clause "a in (null)" doesn't match the (null, 0) > row so it donesn't harm finally, I don't think this is a right > behavior. null in an SAOP rightop should be just ignored on > partition pruning. Or is there any purpose of this behavior? Yeah, it seems that we're better off ignoring null values appearing the IN-list. Framing a IS NULL or IS NOT NULL expression corresponding to a null value in the SAOP rightop array doesn't seem to be semantically correct, as you also pointed out. In ExecEvalScalarArrayOpExpr(), I see that a null in the rightop array (IN-list) doesn't lead to selecting rows containing null in the corresponding column. > - In extract_bounding_datums, clauseinfo is set twice to the same > value. Oops, my bad when merging in David's patch. Update patch set attached. Thanks again. Regards, Amit
Attachment
- v23-0001-Some-interface-changes-for-partition_bound_-cmp-.patch
- v23-0002-Introduce-a-get_partitions_from_clauses.patch
- v23-0003-Move-some-code-of-set_append_rel_size-to-separat.patch
- v23-0004-More-refactoring-around-partitioned-table-Append.patch
- v23-0005-Teach-planner-to-use-get_partitions_from_clauses.patch
On 31 January 2018 at 21:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Update patch set attached. Thanks again. (My apologies for being slow to respond here. I've been on leave this week and I'm off again next week. I now have a little time to reply) Hi Amit, Thanks for incorporating my changes into the patchset. A while ago I was rebasing the run-time pruning patch on top of this but ran into a few problems which are all results of my changes. 1. remove_redundant_clauses collapses the PartClause list into the most restrictive set of clauses. This disallows multiple evaluations of the PartitionClauseInfo during runtime pruning. I've written a proposed fix for this and attached it. 2. PartitionClauseInfo->keyclauses is a list of PartClause which is not a node type. This will cause _copyPartitionClauseInfo() to fail. I'm still not quite sure the best way to fix #2 since PartClause contains a FmgrInfo. I do have a local fix which moves PartClause to primnodes.h and makes it a proper node type. I also added a copy function which does not copy any of the cache fields in PartClause. It just sets valid_cache to false. I didn't particularly think this was the correct fix. I just couldn't think of how exactly this should be done at the time. The attached patch also adds the missing nodetag from PartitionClauseInfo and also fixes up other code so as we don't memset the node memory to zero, as that would destroy the tag. I ended up just having extract_partition_key_clauses do the makeNode call. This also resulted in populate_partition_clauses being renamed to generate_partition_clauses -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
+/* + * PartitionBoundCmpArg - Caller-defined argument to be passed to + * partition_bound_cmp() + * + * The first (fixed) argument involved in a comparison is the partition bound + * found in the catalog, while an instance of the following struct describes + * either a new partition bound being compared against existing bounds + * (caller should set is_bound to true and set bound), or a new tuple's + * partition key specified in datums (caller should set ndatums to the number + * of valid datums that are passed in the array). + */ +typedef struct PartitionBoundCmpArg +{ + bool is_bound; + union + { + PartitionListValue *lbound; + PartitionRangeBound *rbound; + PartitionHashBound *hbound; + } bound; + + Datum *datums; + int ndatums; +} PartitionBoundCmpArg; This is a pretty strange definition. datums/ndatums are never valid at the same time as any of lbound/rbound/hbound, but are not included in the union. Also, is_bound doesn't tell you which of rbound/lbound/hbound is actually valid. Granted, the current calling convention looks like a mess, too. Apparently, the argument to partition_bound_cmp is a PartitionBoundSpec if using hash partitioning, a Datum if list partitioning, and either a PartitionRangeBound or a Datum * if range partitioning depending on the value of probe_is_bound, and I use the word "apparently" because there are zero words of comments explaining what the argument to partition_bound_cmp of type "void *" is supposed to mean. I really should have noticed that and insisted that it be fixed before partitioning got committed. Looking a bit further, there are exactly two calls to partition_bound_cmp(). One is in partition_bound_bsearch() and the other is in check_new_partition_bound(). Now, looking at this, both the call to partition_bound_cmp() and every single call to partition_bound_bsearch() are inside a switch branch where we've dispatched on the partitioning type, which means that from code that is already specialized by partitioning type we are calling generic code which then turns back around and calls code that is specialized by partitioning type. Now, that could make sense if the generic code is pretty complex, but here's it's basically just the logic to do a bsearch. It seems to me that a cleaner solution here would be to duplicate that logic. Then we could have... static int partition_list_bsearch(PartitionKey key, PartitionBoundInfo boundinfo, Datum value, bool *is_equal); static int partition_range_bsearch(PartitionKey key, PartitionBoundInfo boundinfo, PartitionRangeBound *probe); static int partition_range_datum_bsearch(PartitionKey key, PartitionBoundInfo boundinfo, int nvalues, Datum *values); static int partition_hash_bsearch(PartitionKey key, PartitionBoundInfo boundinfo, int modulus, int remainder, bool *is_equal); ...which would involve fewer branches at runtime and more type-safety at compile time. partition_hash_bsearch could directly call partition_hbound_cmp, partition_list_bsearch could directly invoke FunctionCall2Coll, partition_range_bsearch could directly call partition_rbound_cmp, and partition_range_datum_bsearch could directly call partition_rbound_datum_cmp. All-in-all that seems a lot nicer to me than what we have here now. IIUC, the purpose of this patch is to let you search on a prefix of the partition keys, but I think that's really only possible for range partitioning, and it seems like the proposed nvalues argument to partition_range_datum_bsearch would give you what you need. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thanks for the review. On 2018/02/02 7:38, Robert Haas wrote: > +/* > + * PartitionBoundCmpArg - Caller-defined argument to be passed to > + * partition_bound_cmp() > + * > + * The first (fixed) argument involved in a comparison is the partition bound > + * found in the catalog, while an instance of the following struct describes > + * either a new partition bound being compared against existing bounds > + * (caller should set is_bound to true and set bound), or a new tuple's > + * partition key specified in datums (caller should set ndatums to the number > + * of valid datums that are passed in the array). > + */ > +typedef struct PartitionBoundCmpArg > +{ > + bool is_bound; > + union > + { > + PartitionListValue *lbound; > + PartitionRangeBound *rbound; > + PartitionHashBound *hbound; > + } bound; > + > + Datum *datums; > + int ndatums; > +} PartitionBoundCmpArg; > > This is a pretty strange definition. datums/ndatums are never valid > at the same time as any of lbound/rbound/hbound, but are not included > in the union. Also, is_bound doesn't tell you which of > rbound/lbound/hbound is actually valid. Granted, the current calling > convention looks like a mess, too. Apparently, the argument to > partition_bound_cmp is a PartitionBoundSpec if using hash > partitioning, a Datum if list partitioning, and either a > PartitionRangeBound or a Datum * if range partitioning depending on > the value of probe_is_bound, and I use the word "apparently" because > there are zero words of comments explaining what the argument to > partition_bound_cmp of type "void *" is supposed to mean. I really > should have noticed that and insisted that it be fixed before > partitioning got committed. Yeah, I was trying to fix the status quo by introducing that new struct, but I agree it's much better to modify the functions around a bit like the way you describe below. > Looking a bit further, there are exactly two calls to > partition_bound_cmp(). One is in partition_bound_bsearch() and the > other is in check_new_partition_bound(). Now, looking at this, both > the call to partition_bound_cmp() and every single call to > partition_bound_bsearch() are inside a switch branch where we've > dispatched on the partitioning type, which means that from code that > is already specialized by partitioning type we are calling generic > code which then turns back around and calls code that is specialized > by partitioning type. Now, that could make sense if the generic code > is pretty complex, but here's it's basically just the logic to do a > bsearch. It seems to me that a cleaner solution here would be to > duplicate that logic. Then we could have... > > static int partition_list_bsearch(PartitionKey key, PartitionBoundInfo > boundinfo, > Datum value, bool *is_equal); > static int partition_range_bsearch(PartitionKey key, > PartitionBoundInfo boundinfo, > PartitionRangeBound *probe); > static int partition_range_datum_bsearch(PartitionKey key, > PartitionBoundInfo boundinfo, > int nvalues, Datum *values); > static int partition_hash_bsearch(PartitionKey key, PartitionBoundInfo > boundinfo, > int modulus, int remainder, bool *is_equal); > > ...which would involve fewer branches at runtime and more type-safety > at compile time. partition_hash_bsearch could directly call > partition_hbound_cmp, partition_list_bsearch could directly invoke > FunctionCall2Coll, partition_range_bsearch could directly call > partition_rbound_cmp, and partition_range_datum_bsearch could directly > call partition_rbound_datum_cmp. > > All-in-all that seems a lot nicer to me than what we have here now. > IIUC, the purpose of this patch is to let you search on a prefix of > the partition keys, but I think that's really only possible for range > partitioning, and it seems like the proposed nvalues argument to > partition_range_datum_bsearch would give you what you need. Your proposed cleanup sounds much better, so I implemented it in the attached updated 0001, while dropping the previously proposed PartitionBoundCmpArg approach. Updated set of patches attached (patches 0002 onward mostly unchanged, except I incorporated the delta patch posted by David upthread). Thanks, Amit
Attachment
- v24-0001-Refactor-code-for-partition-bound-searching.patch
- v24-0002-Introduce-a-get_partitions_from_clauses.patch
- v24-0003-Move-some-code-of-set_append_rel_size-to-separat.patch
- v24-0004-More-refactoring-around-partitioned-table-Append.patch
- v24-0005-Teach-planner-to-use-get_partitions_from_clauses.patch
Hi David. On 2018/02/01 8:57, David Rowley wrote: > On 31 January 2018 at 21:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Update patch set attached. Thanks again. > > (My apologies for being slow to respond here. I've been on leave this > week and I'm off again next week. I now have a little time to reply) No worries. > Thanks for incorporating my changes into the patchset. A while ago I > was rebasing the run-time pruning patch on top of this but ran into a > few problems which are all results of my changes. > > 1. remove_redundant_clauses collapses the PartClause list into the > most restrictive set of clauses. This disallows multiple evaluations > of the PartitionClauseInfo during runtime pruning. I've written a > proposed fix for this and attached it. I've incorporated it in the latest patch I posted today. > 2. PartitionClauseInfo->keyclauses is a list of PartClause which is > not a node type. This will cause _copyPartitionClauseInfo() to fail. > > I'm still not quite sure the best way to fix #2 since PartClause > contains a FmgrInfo. I do have a local fix which moves PartClause to > primnodes.h and makes it a proper node type. I also added a copy > function which does not copy any of the cache fields in PartClause. It > just sets valid_cache to false. I didn't particularly think this was > the correct fix. I just couldn't think of how exactly this should be > done at the time. > > The attached patch also adds the missing nodetag from > PartitionClauseInfo and also fixes up other code so as we don't memset > the node memory to zero, as that would destroy the tag. I ended up > just having extract_partition_key_clauses do the makeNode call. This > also resulted in populate_partition_clauses being renamed to > generate_partition_clauses I started wondering if it's not such a good idea to make PartitionClauseInfo a Node at all? I went back to your earlier message [1] where you said that it's put into the Append node for run-time pruning to use, but it doesn't sound nice that we'd be putting into the plan something that's looks more like scratchpad for the partition.c code. I think we should try to keep PartitionClauseInfo in partition.h and put only the list of matched bare clauses into Append. Thanks, Amit [1] https://www.postgresql.org/message-id/CAKJS1f8T_efuAgPWtyGdfwD1kBLR-giFvoez7raYsQ4P1i2OYw%40mail.gmail.com
On Fri, Feb 2, 2018 at 1:04 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Your proposed cleanup sounds much better, so I implemented it in the > attached updated 0001, while dropping the previously proposed > PartitionBoundCmpArg approach. > > Updated set of patches attached (patches 0002 onward mostly unchanged, > except I incorporated the delta patch posted by David upthread). Committed 0001. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 2, 2018 at 9:33 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Updated set of patches attached (patches 0002 onward mostly unchanged, >> except I incorporated the delta patch posted by David upthread). > > Committed 0001. Thanks. Some preliminary thoughts... Regarding 0002, I can't help noticing that this adds a LOT of new code to partition.c. With 0002 applied, it climbs into the top 2% of all ".c" files in terms of lines of code. It seems to me, though, that maybe it would make sense to instead add all of this code to some new file .c file, e.g. src/backend/optimizer/util/partprune.c. I realize that's a little awkward in this case because we're hoping to use this code at runtime and not just in the optimizer, but I don't have a better idea. Using src/backend/catalog as a dumping-ground for code that doesn't have a clear-cut place to live is not a superior alternative, for sure. And it seems to me that the code you're adding here is really quite similar to what we've already got in that directory -- for example, predtest.c, which currently does partition pruning, lives there; so does clauses.c, whose evaluate_expr facility this patch wants to use; so does relnode.c, which the other patches modify; and in general this looks an awful lot like other optimizer logic that decomposes clauses. I'm open to other suggestions but I don't think adding all of this directly into partition.c is a good plan. If we do add a new file for this code, the header comment for that file would be a good place to write an overall explanation of this new facility. The individual bits and pieces seem to have good comments but an overall explanation of what's going on here seems to be lacking. It doesn't seem good that get_partitions_from_clauses requires us to reopen the relation. I'm going to give my standard review feedback any time someone injects additional relation_open or heap_open calls into a patch: please look for a way to piggyback on one of the places that already has the relation open instead of adding code to re-open it elsewhere. Reopening it is not entirely free, and, especially when NoLock is used, makes it hard to tell whether we're doing the locking correctly. Note that we've already got things like set_relation_partition_info (which copies the bounds) and set_baserel_partition_key_exprs (which copies, with some partitioning, the partitioning expressions) and also find_partition_scheme, but instead of using that existing data from the RelOptInfo, this patch is digging it directly out of the relcache entry, which doesn't seem great. The changes to set_append_rel_pathlist probably make it slower in the case where rte->relkind != RELKIND_PARTITIONED_TABLE. We build a whole new list that we don't really need. How about just keeping the (appinfo->parent_relid != parentRTindex) test in the loop and setting rel_appinfos to either root->append_rel_list or rel->live_part_appinfos as appropriate? I understand why COLLATION_MATCH think that a collation OID match is OK, but why is InvalidOid also OK? Can you add a comment? Maybe some test cases, too? How fast is this patch these days, compared with the current approach? It would be good to test both when nearly all of the partitions are pruned and when almost none of the partitions are pruned. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018/02/03 6:05, Robert Haas wrote: > On Fri, Feb 2, 2018 at 9:33 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> Updated set of patches attached (patches 0002 onward mostly unchanged, >>> except I incorporated the delta patch posted by David upthread). >> >> Committed 0001. Thanks. > > Some preliminary thoughts... Thanks for the review. > Regarding 0002, I can't help noticing that this adds a LOT of new code > to partition.c. With 0002 applied, it climbs into the top 2% of all > ".c" files in terms of lines of code. It seems to me, though, that > maybe it would make sense to instead add all of this code to some new > file .c file, e.g. src/backend/optimizer/util/partprune.c. I realize > that's a little awkward in this case because we're hoping to use this > code at runtime and not just in the optimizer, but I don't have a > better idea. Using src/backend/catalog as a dumping-ground for code > that doesn't have a clear-cut place to live is not a superior > alternative, for sure. Agreed. partition.c has gotten quite big and actually more than half of the code that this patch adds really seems to belong outside of it. > And it seems to me that the code you're adding > here is really quite similar to what we've already got in that > directory -- for example, predtest.c, which currently does partition > pruning, lives there; so does clauses.c, whose evaluate_expr facility > this patch wants to use; so does relnode.c, which the other patches > modify; and in general this looks an awful lot like other optimizer > logic that decomposes clauses. I'm open to other suggestions but I > don't think adding all of this directly into partition.c is a good > plan. Agreed. A partprune.c in the optimizer's util directory seems like a good place. > If we do add a new file for this code, the header comment for that > file would be a good place to write an overall explanation of this new > facility. The individual bits and pieces seem to have good comments > but an overall explanation of what's going on here seems to be > lacking. OK, I will add such a comment. > It doesn't seem good that get_partitions_from_clauses requires us to > reopen the relation. I'm going to give my standard review feedback > any time someone injects additional relation_open or heap_open calls > into a patch: please look for a way to piggyback on one of the places > that already has the relation open instead of adding code to re-open > it elsewhere. Reopening it is not entirely free, and, especially when > NoLock is used, makes it hard to tell whether we're doing the locking > correctly. Note that we've already got things like > set_relation_partition_info (which copies the bounds) and > set_baserel_partition_key_exprs (which copies, with some partitioning, > the partitioning expressions) and also find_partition_scheme, but > instead of using that existing data from the RelOptInfo, this patch is > digging it directly out of the relcache entry, which doesn't seem > great. OK, I have to admit that the quoted heap_open wasn't quite well thought out and I'm almost sure that everything should be fine with the information that set_relation_partition_info() fills in the RelOptInfo. I'm now going through the patch to try to figure out how to make that work. > The changes to set_append_rel_pathlist probably make it slower in the > case where rte->relkind != RELKIND_PARTITIONED_TABLE. We build a > whole new list that we don't really need. How about just keeping the > (appinfo->parent_relid != parentRTindex) test in the loop and setting > rel_appinfos to either root->append_rel_list or > rel->live_part_appinfos as appropriate? That's certainly better. Also in set_append_rel_size. > I understand why COLLATION_MATCH think that a collation OID match is > OK, but why is InvalidOid also OK? Can you add a comment? Maybe some > test cases, too? partcollid == InvalidOid means the partition key is of uncollatable type, so further checking the collation is unnecessary. There is a test in partition_prune.sql that covers the failure to prune when collations don't match for a text partition key. > How fast is this patch these days, compared with the current approach? > It would be good to test both when nearly all of the partitions are > pruned and when almost none of the partitions are pruned. I will include some performance numbers in my next email, which hopefully should not be later than Friday this week. Thanks, Amit
On Tue, Feb 6, 2018 at 4:55 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I understand why COLLATION_MATCH think that a collation OID match is >> OK, but why is InvalidOid also OK? Can you add a comment? Maybe some >> test cases, too? > > partcollid == InvalidOid means the partition key is of uncollatable type, > so further checking the collation is unnecessary. Yeah, but in that case wouldn't BOTH OIDs be InvalidOid, and thus the equality test would mach anyway? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 6, 2018 at 3:25 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > Agreed. partition.c has gotten quite big and actually more than half of > the code that this patch adds really seems to belong outside of it. > >> And it seems to me that the code you're adding >> here is really quite similar to what we've already got in that >> directory -- for example, predtest.c, which currently does partition >> pruning, lives there; so does clauses.c, whose evaluate_expr facility >> this patch wants to use; so does relnode.c, which the other patches >> modify; and in general this looks an awful lot like other optimizer >> logic that decomposes clauses. I'm open to other suggestions but I >> don't think adding all of this directly into partition.c is a good >> plan. > > Agreed. > > A partprune.c in the optimizer's util directory seems like a good place. partprune.c looks to much tied to one feature. I am sure that the functions used for partition pruning can be used by other optimizations as well. partition.c seems to have two kinds of functions 1. that build and manage relcache, creates quals from bounds etc. which are metadata management kind 2. partition bound comparison functions, and other optimizer related functions. May be we should divide the file that way. The first category code remains in catalog/ as it is today. The second catagory functions move to optimizer/. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Wed, Feb 7, 2018 at 3:42 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > partprune.c looks to much tied to one feature. I am sure that the > functions used for partition pruning can be used by other > optimizations as well. Uh, I don't know about that, this code looks like it does partition pruning specifically, and nothing else. How else do you think it could be used? > partition.c seems to have two kinds of functions 1. that build and > manage relcache, creates quals from bounds etc. which are metadata > management kind 2. partition bound comparison functions, and other > optimizer related functions. May be we should divide the file that > way. The first category code remains in catalog/ as it is today. The > second catagory functions move to optimizer/. It would be sensible to separate functions that build and manage data in the relcache from other functions. I think we should consider moving the existing functions of that type from partition.c to src/backend/utils/cache/partcache.c. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 7, 2018 at 6:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Feb 7, 2018 at 3:42 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> partprune.c looks to much tied to one feature. I am sure that the >> functions used for partition pruning can be used by other >> optimizations as well. > > Uh, I don't know about that, this code looks like it does partition > pruning specifically, and nothing else. How else do you think it > could be used? I didn't have any specific thing in mind, and now more I think, it looks less likely that it will be used for something else. While looking at the changes in partition.c I happened to look at the changes in try_partition_wise_join(). They mark partitions deemed dummy by pruning as dummy relations. If we accept those changes, we could very well change the way we handle dummy partitioned tables, which would mean that we also revert the recent commit f069c91a5793ff6b7884120de748b2005ee7756f. But I guess, those changes haven't been reviewed yet and so not final. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Wed, Feb 7, 2018 at 8:37 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > While looking at the changes in partition.c I happened to look at the > changes in try_partition_wise_join(). They mark partitions deemed > dummy by pruning as dummy relations. If we accept those changes, we > could very well change the way we handle dummy partitioned tables, > which would mean that we also revert the recent commit > f069c91a5793ff6b7884120de748b2005ee7756f. But I guess, those changes > haven't been reviewed yet and so not final. Well, if you have an opinion on those proposed changes, I'd like to hear it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018/02/07 1:36, Robert Haas wrote: > On Tue, Feb 6, 2018 at 4:55 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> I understand why COLLATION_MATCH think that a collation OID match is >>> OK, but why is InvalidOid also OK? Can you add a comment? Maybe some >>> test cases, too? >> >> partcollid == InvalidOid means the partition key is of uncollatable type, >> so further checking the collation is unnecessary. > > Yeah, but in that case wouldn't BOTH OIDs be InvalidOid, and thus the > equality test would mach anyway? It seems that that's not necessarily true. I remember to have copied that logic from the following macro in indxpath.c: #define IndexCollMatchesExprColl(idxcollation, exprcollation) \ ((idxcollation) == InvalidOid || (idxcollation) == (exprcollation)) which was added by the following commit: commit cb37c291060dd13b1a8ff61fceee09efcfbc34e1 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Thu Sep 29 00:43:42 2011 -0400 Fix index matching for operators with mixed collatable/noncollatable inputs. If an indexable operator for a non-collatable indexed datatype has a collatable right-hand input type, any OpExpr for it will be marked with a nonzero inputcollid (since having one collatable input is sufficient to make that happen). However, an index on a non-collatable column certainly doesn't have any collation. This caused us to fail to match such operators to their indexes, because indxpath.c required an exact match of index collation and clause collation. It seems correct to allow a match when the index is collation-less regardless of the clause's inputcollid: an operator with both noncollatable and collatable inputs could perhaps depend on the collation of the collatable input, but it could hardly expect the index for the noncollatable input to have that same collation. [ ... ] + * If the index has a collation, the clause must have the same collation. + * For collation-less indexes, we assume it doesn't matter; this is + * necessary for cases like "hstore ? text", wherein hstore's operators + * don't care about collation but the clause will get marked with a + * collation anyway because of the text argument. (This logic is + * embodied in the macro IndexCollMatchesExprColl.) + * Discussion leading to the above commit occurred here: https://www.postgresql.org/message-id/flat/201109282050.p8SKoA4O084649%40wwwmaster.postgresql.org It seems that we can think similarly for partitioning and the let the partition pruning proceed with a clause even if the partition key is non-collatable whereas the clause's other argument is collatable. Even though it seems we don't yet allow the kind of partitioning that would lead to such a situation. Thanks, Amit
Robert Haas wrote: > On Wed, Feb 7, 2018 at 3:42 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: > > partition.c seems to have two kinds of functions 1. that build and > > manage relcache, creates quals from bounds etc. which are metadata > > management kind 2. partition bound comparison functions, and other > > optimizer related functions. May be we should divide the file that > > way. The first category code remains in catalog/ as it is today. The > > second catagory functions move to optimizer/. > > It would be sensible to separate functions that build and manage data > in the relcache from other functions. I think we should consider > moving the existing functions of that type from partition.c to > src/backend/utils/cache/partcache.c. FWIW I've been thinking that perhaps we need some other separation of code better than statu quo. The current partition.c file includes stuff for several modules and ISTM all these new patches are making more and more of a mess. So +1 to the general idea of splitting things up. Maybe partcache.c is not ambitious enough, but it seems a good first step. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Feb 7, 2018 at 7:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Feb 7, 2018 at 8:37 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> While looking at the changes in partition.c I happened to look at the >> changes in try_partition_wise_join(). They mark partitions deemed >> dummy by pruning as dummy relations. If we accept those changes, we >> could very well change the way we handle dummy partitioned tables, >> which would mean that we also revert the recent commit >> f069c91a5793ff6b7884120de748b2005ee7756f. But I guess, those changes >> haven't been reviewed yet and so not final. > > Well, if you have an opinion on those proposed changes, I'd like to hear it. I am talking about changes after this comment /* + * If either child_rel1 or child_rel2 is not a live partition, they'd + * not have been touched by set_append_rel_size. So, its RelOptInfo + * would be missing some information that set_append_rel_size sets for + * live partitions, such as the target list, child EQ members, etc. + * We need to make the RelOptInfo of even the dead partitions look + * minimally valid and as having a valid dummy path attached to it. + */ There are couple of problems with this change 1. An N way join may call try_partition_wise_join() with the same base relation on one side N times. The condition will be tried those many times. 2. We will have to adjust or make similar changes in try_partition_wise_aggregate() proposed in the partition-wise aggregate patch. Right now it checks if the relation is dummy but it will have to check whether the pathlist is also NULL. Any partition-wise operation that we try in future will need this adjustment. AFAIU, for pruned partitions, we don't set necessary properties of the corresponding RelOptInfo when it is pruned. If we were sure that we will not use that RelOptInfo anywhere in the rest of the planning, this would work. But that's not the case. AFAIU, current planner assumes that a relation which has not been eliminated before planning (DEAD relation), but later proved to not contribute any rows in the result, is marked dummy. Partition pruning breaks that assumption and thus may have other side effects, that we do not see right now. We have similar problem with dummy partitioned tables, but we have code in place to avoid looking at the pathlists of their children by not considering such a partitioned table as partitioned. May be we want to change that too. Either we add refactoring patches to change the planner so that it doesn't assume something like that OR we make sure that the pruned partition's RelOptInfo have necessary properties and a dummy pathlist set. I will vote for second. We spend CPU cycles marking pruned partitions as dummy if the dummy pathlist is never used. May be we can avoid setting dummy pathlist if we can detect that a pruned partition is guaranteed not to be used, e.g when the corresponding partitioned relation does not participate in any join or other upper planning. Apart from that another change that caught my eye is Instead of going through root->append_rel_list to pick up the child appinfos, store them in an array called part_appinfos that stores partition appinfos in the same order as RelOptInfos are stored in part_rels, right when the latter are created. Further, instead of going through root->pcinfo_list to get the list of partitioned child rels, which ends up including even the rels that are pruned by set_append_rel_size(), build up a list of "live" partitioned child rels and use the same to initialize partitioned_rels field of AppendPath. That was voted down by Robert during partition-wise join implementation. And I agree with him. Any changes around changing that should change the way we handle AppendRelInfos for all relations, not just (declarative) partitioned relations. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Hi Ashutosh. On 2018/02/09 14:09, Ashutosh Bapat wrote: > On Wed, Feb 7, 2018 at 7:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Feb 7, 2018 at 8:37 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> While looking at the changes in partition.c I happened to look at the >>> changes in try_partition_wise_join(). They mark partitions deemed >>> dummy by pruning as dummy relations. If we accept those changes, we >>> could very well change the way we handle dummy partitioned tables, >>> which would mean that we also revert the recent commit >>> f069c91a5793ff6b7884120de748b2005ee7756f. But I guess, those changes >>> haven't been reviewed yet and so not final. >> >> Well, if you have an opinion on those proposed changes, I'd like to hear it. > > I am talking about changes after this comment > /* > + * If either child_rel1 or child_rel2 is not a live partition, they'd > + * not have been touched by set_append_rel_size. So, its RelOptInfo > + * would be missing some information that set_append_rel_size sets for > + * live partitions, such as the target list, child EQ members, etc. > + * We need to make the RelOptInfo of even the dead partitions look > + * minimally valid and as having a valid dummy path attached to it. > + */ > > There are couple of problems with this change > 1. An N way join may call try_partition_wise_join() with the same base > relation on one side N times. The condition will be tried those many > times. > > 2. We will have to adjust or make similar changes in > try_partition_wise_aggregate() proposed in the partition-wise > aggregate patch. Right now it checks if the relation is dummy but it > will have to check whether the pathlist is also NULL. Any > partition-wise operation that we try in future will need this > adjustment. > > AFAIU, for pruned partitions, we don't set necessary properties of the > corresponding RelOptInfo when it is pruned. If we were sure that we > will not use that RelOptInfo anywhere in the rest of the planning, > this would work. But that's not the case. AFAIU, current planner > assumes that a relation which has not been eliminated before planning > (DEAD relation), but later proved to not contribute any rows in the > result, is marked dummy. Partition pruning breaks that assumption and > thus may have other side effects, that we do not see right now. We > have similar problem with dummy partitioned tables, but we have code > in place to avoid looking at the pathlists of their children by not > considering such a partitioned table as partitioned. May be we want to > change that too. > > Either we add refactoring patches to change the planner so that it > doesn't assume something like that OR we make sure that the pruned > partition's RelOptInfo have necessary properties and a dummy pathlist > set. I will vote for second. We spend CPU cycles marking pruned > partitions as dummy if the dummy pathlist is never used. May be we can > avoid setting dummy pathlist if we can detect that a pruned partition > is guaranteed not to be used, e.g when the corresponding partitioned > relation does not participate in any join or other upper planning. Thanks for the analysis. I agree with all the points of concern. so for now, I have dropped all the changes from my patch that give rise to the concerns. With the new patch, changes to the existing optimizer code beside introducing partprune.c in the util directory are pretty thin: git diff master --stat src/backend/optimizer/ src/backend/optimizer/path/allpaths.c | 16 ++ src/backend/optimizer/util/Makefile | 2 +- src/backend/optimizer/util/clauses.c | 4 +- src/backend/optimizer/util/partprune.c | 1421 +++++++++++ src/backend/optimizer/util/plancat.c | 83 ++++--- src/backend/optimizer/util/relnode.c | 8 + 6 files changed, 1504 insertions(+), 30 deletions(-) So, no refactoring the existing optimizer code, just replacing the partition pruning mechanism with partprune.c functions. > Apart from that another change that caught my eye is > > Instead of going through root->append_rel_list to pick up the child > appinfos, store them in an array called part_appinfos that stores > partition appinfos in the same order as RelOptInfos are stored in > part_rels, right when the latter are created. > > Further, instead of going through root->pcinfo_list to get the list > of partitioned child rels, which ends up including even the rels > that are pruned by set_append_rel_size(), build up a list of "live" > partitioned child rels and use the same to initialize partitioned_rels > field of AppendPath. > > That was voted down by Robert during partition-wise join > implementation. And I agree with him. Any changes around changing that > should change the way we handle AppendRelInfos for all relations, not > just (declarative) partitioned relations. I removed part_appinfos from the patch. Also, I have made the changes introducing live_partitioned_rels a separate patch, which we can discuss independently of the pruning changes. Will post the latest patch set later in the evening. Thanks, Amit
On 2018/02/06 18:55, Amit Langote wrote: >> How fast is this patch these days, compared with the current approach? >> It would be good to test both when nearly all of the partitions are >> pruned and when almost none of the partitions are pruned. > > I will include some performance numbers in my next email, which hopefully > should not be later than Friday this week. Here is the latest set of patches. I can see about 2x speedup in planning time for various partition counts, although it grows linearly as the partition count grows (same as with HEAD). Detailed performance figures follow. * Partitioned table schema: H: create table ht (a int, b int) partition by hash (b); create table ht_* partition of ht for values with (modulus N, ...) L: create table lt (a int, b int) partition by list (b); create table lt_1 partition of lt for values in (1) .. create table lt_N partition of lt for values in (N) R: create table rt (a int, b int) partition by range (b); create table rt_1 partition of rt for values from (1) to (<step>) .. create table rt_N partition of rt for values in (N-1 * <step>) to (N * <step>) * Queries Prunes every partition but 1: select * from table_name where b = 1 Prunes none: select * from table_name where b >= 1 * Planning time in milliseconds (average of 5 runs). On HEAD: parts H-prune H-noprune L-prune L-noprune R-prune R-noprune 8 1.50 1.42 1.60 1.55 1.77 1.75 16 2.49 2.37 2.32 2.65 3.29 3.07 32 3.96 4.49 3.83 4.14 5.06 5.70 64 8.02 7.51 7.14 7.34 9.37 10.02 128 14.47 14.19 13.31 13.99 18.09 18.86 256 24.76 27.63 25.59 27.87 34.15 37.19 512 50.36 55.92 52.56 54.76 69.34 72.55 1024 102.94 110.59 104.97 110.41 136.89 146.54 Patched: parts H-prune H-noprune L-prune L-noprune R-prune R-noprune 8 1.49 0.90 0.87 0.74 0.84 1.09 16 2.01 1.50 1.42 1.68 1.42 1.41 32 2.63 2.47 2.08 2.69 2.73 2.81 64 5.62 4.66 4.45 4.96 4.92 5.08 128 11.28 9.65 9.00 9.60 8.68 9.91 256 18.36 18.49 17.11 18.39 17.47 18.43 512 33.88 36.89 34.06 36.52 34.01 37.26 1024 66.40 72.75 66.37 73.40 67.06 67.06 Attached v25 patches. 0001-Modify-bound-comparision-functions-to-accept-mem.patch This is Ashutosh's patch that he posted on the "advanced partition matching algorithm for partition-wise join" thread. 0002-Refactor-partition-bound-search-functions.patch This is similar to 0001. Whereas 0001 modifies just the comparison functions, this one modifies the partition bound search functions, because the main pruning patch uses the search functions. 0003-Add-parttypid-partcollation-partsupfunc-to-Parti.patch This adds some of the fields to PartitionScheme that were needed by the main pruning patch. The above 3 patches do what they do, because we'd like the main pruning to patch to add its functionality by relying on whatever information is made available in the partitioned table's RelOptInfo. 0004-Faster-partition-pruning.patch The main patch that adds src/backend/optimizer/util/partprune.c, a module to provide the functionality that will replace the current approach of calling relation_excluded_by_constraints() for each partition. Sorry, but there is still this big TODO here, which I'll try to fix early next week. + * partprune.c + * Provides functions to prune partitions of a partitioned table by + * comparing provided set of clauses with the table's partitions' + * boundaries + * + * TODO: write a longer description of things in this file 0005-Add-only-unpruned-partitioned-child-rels-to-part.patch This one teaches the planner to put *only* the un-pruned partitioned child tables into partitioned_rels list of certain plan nodes. Thanks, Amit [1] https://www.postgresql.org/message-id/CAFjFpRctst136uN2BvbWLAkS7w278XmKY4_PUB%2Bxk-%2BNezNq8g%40mail.gmail.com
Attachment
reorganizing partitioning code (was: Re: [HACKERS] path toward fasterpartition pruning)
From
Amit Langote
Date:
On 2018/02/09 2:58, Alvaro Herrera wrote: > Robert Haas wrote: >> On Wed, Feb 7, 2018 at 3:42 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: > >>> partition.c seems to have two kinds of functions 1. that build and >>> manage relcache, creates quals from bounds etc. which are metadata >>> management kind 2. partition bound comparison functions, and other >>> optimizer related functions. May be we should divide the file that >>> way. The first category code remains in catalog/ as it is today. The >>> second catagory functions move to optimizer/. >> >> It would be sensible to separate functions that build and manage data >> in the relcache from other functions. I think we should consider >> moving the existing functions of that type from partition.c to >> src/backend/utils/cache/partcache.c. > > FWIW I've been thinking that perhaps we need some other separation of > code better than statu quo. The current partition.c file includes stuff > for several modules and ISTM all these new patches are making more and > more of a mess. So +1 to the general idea of splitting things up. > Maybe partcache.c is not ambitious enough, but it seems a good first > step. Agree with the proposed reorganizing and adding a partcache.c, which I tried to do in the attached patch. * The new src/backend/utils/cache/partcache.c contains functions that initialize relcache's partitioning related fields. Various partition bound comparison and search functions (and then some) that work off of the cached information are moved. Also, since we cache partition qual, interface functions RelationGetPartitioQual(Relation) and get_partition_qual_relid(Oid) are moved too. * The new src/include/utils/partcache.h contains various struct definitions that are moved from backend/catalog/partition.c, include/catalog/partition.h, and include/utils/rel.h. Also, declarations of interface functions of partcache.c. Thoughts? Thanks, Amit
Attachment
On 2018/02/09 21:36, Amit Langote wrote: > 0004-Faster-partition-pruning.patch > > The main patch that adds src/backend/optimizer/util/partprune.c, a module > to provide the functionality that will replace the current approach of > calling relation_excluded_by_constraints() for each partition. > > Sorry, but there is still this big TODO here, which I'll try to fix early > next week. > > + * partprune.c > + * Provides functions to prune partitions of a partitioned table by > + * comparing provided set of clauses with the table's partitions' > + * boundaries > + * > + * TODO: write a longer description of things in this file And I tried to fix that to some degree in the attached updated version. Thanks, Amit
Attachment
Re: reorganizing partitioning code (was: Re: [HACKERS] path towardfaster partition pruning)
From
Alvaro Herrera
Date:
Thanks for working on this. May I suggest to open a completely new thread? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: reorganizing partitioning code (was: Re: [HACKERS] path towardfaster partition pruning)
From
Ashutosh Bapat
Date:
On Tue, Feb 13, 2018 at 2:17 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > Agree with the proposed reorganizing and adding a partcache.c, which I > tried to do in the attached patch. > > * The new src/backend/utils/cache/partcache.c contains functions that > initialize relcache's partitioning related fields. Various partition > bound comparison and search functions (and then some) that work off of the > cached information are moved. Are you moving partition bound comparison functions to partcache.c? They will also used by optimizer, so may be leave them out of partcache.c? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [SenderAddress Forgery]Re: [HACKERS] path toward faster partition pruning
From
Alvaro Herrera
Date:
David Rowley wrote: > On 19 January 2018 at 16:00, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > And I'd like to ask David to check out his mail environment so > > that SPF record is available for his message. > > Will investigate This should be fixed now. Please let us know if you still see problems. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018/02/13 22:23, Alvaro Herrera wrote: > Thanks for working on this. May I suggest to open a completely new > thread? Done. Thanks, Amit
On 2018/02/13 20:08, Amit Langote wrote: > On 2018/02/09 21:36, Amit Langote wrote: >> 0004-Faster-partition-pruning.patch >> >> The main patch that adds src/backend/optimizer/util/partprune.c, a module >> to provide the functionality that will replace the current approach of >> calling relation_excluded_by_constraints() for each partition. >> >> Sorry, but there is still this big TODO here, which I'll try to fix early >> next week. >> >> + * partprune.c >> + * Provides functions to prune partitions of a partitioned table by >> + * comparing provided set of clauses with the table's partitions' >> + * boundaries >> + * >> + * TODO: write a longer description of things in this file > > And I tried to fix that to some degree in the attached updated version. Here is an updated version. I realized that 0005 (Add only unpruned partitioned child rels to partitioned_rels) did that only for (Merge)Append. That is, it didn't handle ModifyTable. I fixed that by teaching inheritance_planner() to do it. In the process, I found out that we don't need the PartitionedChildRelInfo node and related code anymore, so the patch ended up removing more code than adding it. Thanks, Amit
Attachment
On 2 February 2018 at 23:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> 2. PartitionClauseInfo->keyclauses is a list of PartClause which is >> not a node type. This will cause _copyPartitionClauseInfo() to fail. >> >> I'm still not quite sure the best way to fix #2 since PartClause >> contains a FmgrInfo. I do have a local fix which moves PartClause to >> primnodes.h and makes it a proper node type. I also added a copy >> function which does not copy any of the cache fields in PartClause. It >> just sets valid_cache to false. I didn't particularly think this was >> the correct fix. I just couldn't think of how exactly this should be >> done at the time. >> >> The attached patch also adds the missing nodetag from >> PartitionClauseInfo and also fixes up other code so as we don't memset >> the node memory to zero, as that would destroy the tag. I ended up >> just having extract_partition_key_clauses do the makeNode call. This >> also resulted in populate_partition_clauses being renamed to >> generate_partition_clauses > > I started wondering if it's not such a good idea to make > PartitionClauseInfo a Node at all? I went back to your earlier message > [1] where you said that it's put into the Append node for run-time pruning > to use, but it doesn't sound nice that we'd be putting into the plan > something that's looks more like scratchpad for the partition.c code. I > think we should try to keep PartitionClauseInfo in partition.h and put > only the list of matched bare clauses into Append. That sounds like a good idea. A patch which puts this back is attached. I've changed the run-time prune patch to process the clause lists during execution instead. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 15 February 2018 at 18:57, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Here is an updated version. Thanks for sending v27. I've had a quick look over it while I was working on the run-time prune patch. However, I've not quite managed a complete pass of this version yet A couple of things so far: 1. Following loop; for (i = 0; i < partnatts; i++) { if (bms_is_member(i, keys->keyisnull)) { /* Only the default partition accepts nulls. */ if (partition_bound_has_default(boundinfo)) return bms_make_singleton(boundinfo->default_index); else return NULL; } } could become: if (partition_bound_has_default(boundinfo) && !bms_is_empty(keys->keyisnull) return bms_make_singleton(boundinfo->default_index); else return NULL; 2. Is the following form of loop necessary? for (i = 0; i < partnatts; i++) { if (bms_is_member(i, keys->keyisnull)) { keys->n_eqkeys++; keyisnull[i] = true; } } Can't this just be: i = -1; while ((i = bms_next_member(keys->keyisnull, i)) >= 0) { keys->n_eqkeys++; keyisnull[i] = true; } Perhaps you can just Assert(i < partnatts), if you're worried about that. Similar code exists in get_partitions_for_keys_range 3. Several comments mention partition_bound_bsearch, but there is now no such function. 4. "us" should be "is" * not be any unassigned range to speak of, because the range us unbounded 5. The following code is more complex than it needs to be: /* * Since partition keys with nulls are mapped to the default range * partition, we must include the default partition if some keys * could be null. */ if (keys->n_minkeys < partnatts || keys->n_maxkeys < partnatts) { for (i = 0; i < partnatts; i++) { if (!bms_is_member(i, keys->keyisnotnull)) { include_def = true; break; } } } Instead of the for loop, couldn't you just write: include_def = (bms_num_members(keys->keyisnotnull) < partnatts); 6. The following comment is not well written: * get_partitions_excluded_by_ne_datums * * Returns a Bitmapset of indexes of partitions that can safely be removed * due to each such partition's every allowable non-null datum appearing in * a <> opeartor clause. Maybe it would be better to write: * get_partitions_excluded_by_ne_datums * * Returns a Bitmapset of partition indexes that can safely be removed due to * the discovery of <> clauses for each datum value allowed in the partition. if not, then "opeartor" needs the spelling fixed. 7. "The following" * Followig entry points exist to this module. Are there any other .c files where we comment on all the extern functions in this way? I don't recall seeing it before. 8. The following may as well just: context.partnatts = partnatts; context.partnatts = rel->part_scheme->partnatts; 9. Why palloc0? Wouldn't palloc be ok? context.partkeys = (Expr **) palloc0(sizeof(Expr *) * context.partnatts); Also, no need for context.partnatts, just partnatts should be fine. 10. I'd rather see bms_next_member() used here: /* Add selected partitions' RT indexes to result. */ while ((i = bms_first_member(partindexes)) >= 0) result = bms_add_member(result, rel->part_rels[i]->relid); There's not really much use for bms_first_member these days. It can be slower due to having to traverse the unset lower significant bits each loop. bms_next_member starts where the previous loop left off. Will try to review more tomorrow. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 17 February 2018 at 22:24, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 2 February 2018 at 23:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I started wondering if it's not such a good idea to make >> PartitionClauseInfo a Node at all? > That sounds like a good idea. > > A patch which puts this back is attached. Please find attached an updated patch. The previous one must've got a bit mangled in a bad merge. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 17 February 2018 at 22:39, David Rowley <david.rowley@2ndquadrant.com> wrote: > 10. I'd rather see bms_next_member() used here: > > /* Add selected partitions' RT indexes to result. */ > while ((i = bms_first_member(partindexes)) >= 0) > result = bms_add_member(result, rel->part_rels[i]->relid); > > There's not really much use for bms_first_member these days. It can be > slower due to having to traverse the unset lower significant bits each > loop. bms_next_member starts where the previous loop left off. > > Will try to review more tomorrow. As I mentioned yesterday, here's the remainder of the review: 11. The following comment is misleading. It says: "We currently handle two such cases:", then it goes on to say the 2nd case is not handled. /* * Handle cases where the clause's operator does not belong to * the partitioning operator family. We currently handle two * such cases: 1. Operators named '<>' are not listed in any * operator family whatsoever, 2. Ordering operators like '<' * are not listed in the hash operator families. For 1, check * if list partitioning is in use and if so, proceed to pass * the clause to the caller without doing any more processing * ourselves. 2 cannot be handled at all, so the clause is * simply skipped. */ 12. The following code should test for LIST partitioning before doing anything else: if (!op_in_opfamily(opclause->opno, partopfamily)) { Oid negator; /* * To confirm if the operator is really '<>', check if its * negator is a equality operator. If it's a btree * equality operator *and* this is a list partitioned * table, we can use it prune partitions. */ negator = get_negator(opclause->opno); if (OidIsValid(negator) && op_in_opfamily(negator, partopfamily)) { Oid lefttype; Oid righttype; int strategy; get_op_opfamily_properties(negator, partopfamily, false, &strategy, &lefttype, &righttype); if (strategy == BTEqualStrategyNumber && context->strategy == PARTITION_STRATEGY_LIST) is_ne_listp = true; } /* Cannot handle this clause. */ if (!is_ne_listp) continue; } The code should probably be in the form of: if (!op_in_opfamily(opclause->opno, partopfamily)) { if (context->strategy != PARTITION_STRATEGY_LIST) continue; ... if (strategy == BTEqualStrategyNumber) is_ne_listp = true; } that way we'll save 3 syscache lookups when a <> clause appears in a RANGE or HASH partitioned table. 13. The following code makes assumptions that the partitioning op family is btree: /* * In case of NOT IN (..), we get a '<>', which while not * listed as part of any operator family, we are able to * handle it if its negator is an equality operator that * is in turn part of the operator family. */ if (!op_in_opfamily(saop_op, partopfamily)) { Oid negator = get_negator(saop_op); int strategy; Oid lefttype, righttype; if (!OidIsValid(negator)) continue; get_op_opfamily_properties(negator, partopfamily, false, &strategy, &lefttype, &righttype); if (strategy != BTEqualStrategyNumber) continue; } this might not be breakable today, but it could well break in the future, for example, if hash op family managed to grow two more strategies, then we could get a false match on the matching strategy numbers (both 3). 14. The following code assumes there will be a right op: if (IsA(clause, OpExpr)) { OpExpr *opclause = (OpExpr *) clause; Expr *leftop, *rightop, *valueexpr; bool is_ne_listp = false; leftop = (Expr *) get_leftop(clause); if (IsA(leftop, RelabelType)) leftop = ((RelabelType *) leftop)->arg; rightop = (Expr *) get_rightop(clause); This'll crash with the following: create function nonzero(p int) returns bool as $$ begin return (p <> 0); end;$$ language plpgsql; create operator # (procedure = nonzero, leftarg = int); create table listp (a int) partition by list (a); create table listp1 partition of listp for values in(1); select * from listp where a#; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. You need to ensure that there are 2 args. 15. Various if tests in extract_partition_clauses result in a `continue` when they should perhaps be a `break` instead. For example: if (!PartCollMatchesExprColl(partcoll, opclause->inputcollid)) continue; This clause does not depend on the partition key, so there's no point in trying to match this again for the next partition key. This item is perhaps not so important, as it's only a small inefficiency, but I just wanted to point it out. Also, note that plain Var partition keys cannot be duplicated, but expressions can, so there may be cases that you don't want to change to `break` Other conditions which possibly should change to `break` instead of `continue` include: /* Only IS [NOT] TRUE/FALSE are any good to us */ if (btest->booltesttype == IS_UNKNOWN || btest->booltesttype == IS_NOT_UNKNOWN) continue; 16. The following code looks a bit fragile: leftop = IsA(clause, Var) ? (Expr *) clause : (Expr *) get_notclausearg((Expr *) clause); This appears to assume that the partition key will be a plain Var and not an expression. I tried to break this with: create table bp (a bool, b bool) partition by ((a < b)); create table bp_true partition of bp for values in('t'); explain select * from bp where (a < b); however, naturally, the parser builds an OpExpr instead of a BooleanTest for this case. If it had built a BooleanTest, then the above code would mistakenly call get_notclausearg on the (a < b) Expr. Do you have reason to believe that the code is safe and a good idea? 17. Which relation is the comment talking about? /* * get_partitions_from_args * * Returns the set of partitions of relation, each of which satisfies some * clause in or_args. */ static Bitmapset * get_partitions_from_or_args(PartitionPruneContext *context, List *or_args) 18. "sets a field", would it not be better to mention constfalse?: * returns right after finding such a clause and before returning, sets a field * in context->clauseinfo to inform the caller that we found such clause. 19. "clauses" * partitioning, we don't require all of eqkeys to be operator clausses. 20. There does not seem to be a need to palloc0 here. palloc seems fine. keys->ne_datums = (Datum *) palloc0(list_length(clauseinfo->ne_clauses) * sizeof(Datum)); This, of course, may leave unset memory in any unused items, but you never iterate beyond what n_ne_datums gets set to anyway, so I don't see the need to zero any extra elements. 21. A code comment should be added to the following code to mention that these are not arrays indexed by partition key as they're only ever used for LIST partitioning, which only supports a single key. /* Datum values from clauses containing <> operator */ Datum *ne_datums; int n_ne_datums; 22. Can you include: "'keyisnotnull' may also be set for the given partition key when a strict OpExpr is encountered" in the following comment? * Based on a IS NULL or IS NOT NULL clause that was matched to a partition * key, the corresponding bit in 'keyisnull' or 'keyisnotnull' is set. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David. On 2018/02/17 18:24, David Rowley wrote: > On 2 February 2018 at 23:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I started wondering if it's not such a good idea to make >> PartitionClauseInfo a Node at all? I went back to your earlier message >> [1] where you said that it's put into the Append node for run-time pruning >> to use, but it doesn't sound nice that we'd be putting into the plan >> something that's looks more like scratchpad for the partition.c code. I >> think we should try to keep PartitionClauseInfo in partition.h and put >> only the list of matched bare clauses into Append. > > That sounds like a good idea. > > A patch which puts this back is attached. > > I've changed the run-time prune patch to process the clause lists > during execution instead. Thank you. I'll incorporate it in the version I'll send next. Regards, Amit
Hi David. Thanks a lot for the review comments. Replying to all of your comments. On 2018/02/17 18:39, David Rowley wrote: > On 15 February 2018 at 18:57, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Here is an updated version. > > Thanks for sending v27. I've had a quick look over it while I was > working on the run-time prune patch. However, I've not quite managed a > complete pass of this version yet > > A couple of things so far: > > 1. Following loop; > > for (i = 0; i < partnatts; i++) > { > if (bms_is_member(i, keys->keyisnull)) > { > /* Only the default partition accepts nulls. */ > if (partition_bound_has_default(boundinfo)) > return bms_make_singleton(boundinfo->default_index); > else > return NULL; > } > } > > could become: > > if (partition_bound_has_default(boundinfo) && > !bms_is_empty(keys->keyisnull) > return bms_make_singleton(boundinfo->default_index); > else > return NULL; > > 2. Is the following form of loop necessary? > > for (i = 0; i < partnatts; i++) > { > if (bms_is_member(i, keys->keyisnull)) > { > keys->n_eqkeys++; > keyisnull[i] = true; > } > } > > Can't this just be: > > i = -1; > while ((i = bms_next_member(keys->keyisnull, i)) >= 0) > { > keys->n_eqkeys++; > keyisnull[i] = true; > } > > Perhaps you can just Assert(i < partnatts), if you're worried about that. > > Similar code exists in get_partitions_for_keys_range Both 1 and 2 are good suggestions, so done that way. > > 3. Several comments mention partition_bound_bsearch, but there is now > no such function. > > 4. "us" should be "is" > > * not be any unassigned range to speak of, because the range us unbounded Fixed. > 5. The following code is more complex than it needs to be: > > /* > * Since partition keys with nulls are mapped to the default range > * partition, we must include the default partition if some keys > * could be null. > */ > if (keys->n_minkeys < partnatts || keys->n_maxkeys < partnatts) > { > for (i = 0; i < partnatts; i++) > { > if (!bms_is_member(i, keys->keyisnotnull)) > { > include_def = true; > break; > } > } > } > > > Instead of the for loop, couldn't you just write: > > include_def = (bms_num_members(keys->keyisnotnull) < partnatts); Indeed, it's much simpler. Though, I wrote it as: + if (bms_num_members(keys->keyisnotnull) < partnatts) + include_def = true; > 6. The following comment is not well written: > > * get_partitions_excluded_by_ne_datums > * > * Returns a Bitmapset of indexes of partitions that can safely be removed > * due to each such partition's every allowable non-null datum appearing in > * a <> opeartor clause. > > Maybe it would be better to write: > > * get_partitions_excluded_by_ne_datums > * > * Returns a Bitmapset of partition indexes that can safely be removed due to > * the discovery of <> clauses for each datum value allowed in the partition. > > if not, then "opeartor" needs the spelling fixed. Sure, your rewrite sounds much better. > 7. "The following" > > * Followig entry points exist to this module. Fixed. > > Are there any other .c files where we comment on all the extern > functions in this way? I don't recall seeing it before. Hmm, not like the way this patch does, but some files in the executor do have such introductory comments using description of functions exported by the module. See for example, nodeSeqscan.c, execMain.c, etc. Not saying that this is the best way to introduce the module, but this is the one I went with for now. If this format is not very informative, I'm willing to rewrite it some other way. > 8. The following may as well just: context.partnatts = partnatts; > > context.partnatts = rel->part_scheme->partnatts; > > 9. Why palloc0? Wouldn't palloc be ok? > > context.partkeys = (Expr **) palloc0(sizeof(Expr *) * > context.partnatts); > > Also, no need for context.partnatts, just partnatts should be fine. Fixed both. Yes, palloc suffices. > 10. I'd rather see bms_next_member() used here: > > /* Add selected partitions' RT indexes to result. */ > while ((i = bms_first_member(partindexes)) >= 0) > result = bms_add_member(result, rel->part_rels[i]->relid); > > There's not really much use for bms_first_member these days. It can be > slower due to having to traverse the unset lower significant bits each > loop. bms_next_member starts where the previous loop left off. Thanks for clarifying. Used bms_next_member(). > Will try to review more tomorrow. > On 2018/02/18 11:25, David Rowley wrote: > As I mentioned yesterday, here's the remainder of the review: > > 11. The following comment is misleading. It says: "We currently handle > two such cases:", then it goes on to say the 2nd case is not handled. > > /* > * Handle cases where the clause's operator does not belong to > * the partitioning operator family. We currently handle two > * such cases: 1. Operators named '<>' are not listed in any > * operator family whatsoever, 2. Ordering operators like '<' > * are not listed in the hash operator families. For 1, check > * if list partitioning is in use and if so, proceed to pass > * the clause to the caller without doing any more processing > * ourselves. 2 cannot be handled at all, so the clause is > * simply skipped. > */ You're right. We don't really "handle" 2. Rewrote the comment like this: + * Normally we only bother with operators that are listed as + * being part of the partitioning operator family. But we + * make an exception in one case -- operators named '<>' are + * not listed in any operator family whatsoever, in which + * case, we try to perform partition pruning with it only if + * list partitioning is in use. > > 12. The following code should test for LIST partitioning before doing > anything else: > > if (!op_in_opfamily(opclause->opno, partopfamily)) > { > Oid negator; > > /* > * To confirm if the operator is really '<>', check if its > * negator is a equality operator. If it's a btree > * equality operator *and* this is a list partitioned > * table, we can use it prune partitions. > */ > negator = get_negator(opclause->opno); > if (OidIsValid(negator) && > op_in_opfamily(negator, partopfamily)) > { > Oid lefttype; > Oid righttype; > int strategy; > > get_op_opfamily_properties(negator, partopfamily, > false, > &strategy, > &lefttype, &righttype); > > if (strategy == BTEqualStrategyNumber && > context->strategy == PARTITION_STRATEGY_LIST) > is_ne_listp = true; > } > > /* Cannot handle this clause. */ > if (!is_ne_listp) > continue; > } > > The code should probably be in the form of: > > if (!op_in_opfamily(opclause->opno, partopfamily)) > { > if (context->strategy != PARTITION_STRATEGY_LIST) > continue; > > ... > > if (strategy == BTEqualStrategyNumber) > is_ne_listp = true; > } > > that way we'll save 3 syscache lookups when a <> clause appears in a > RANGE or HASH partitioned table. Good idea, done. > 13. The following code makes assumptions that the partitioning op > family is btree: > > /* > * In case of NOT IN (..), we get a '<>', which while not > * listed as part of any operator family, we are able to > * handle it if its negator is an equality operator that > * is in turn part of the operator family. > */ > if (!op_in_opfamily(saop_op, partopfamily)) > { > Oid negator = get_negator(saop_op); > int strategy; > Oid lefttype, > righttype; > > if (!OidIsValid(negator)) > continue; > get_op_opfamily_properties(negator, partopfamily, false, > &strategy, > &lefttype, &righttype); > if (strategy != BTEqualStrategyNumber) > continue; > } > > this might not be breakable today, but it could well break in the > future, for example, if hash op family managed to grow two more > strategies, then we could get a false match on the matching strategy > numbers (both 3). I guess we'd be able to anything useful with a NOT IN (..) only with list partitioning, so I changed the code and surrounding comments like I did per your comment above. So, if we process it only if list partitioning is used, the existing code can safely assume that partitioning operator family is btree. > 14. The following code assumes there will be a right op: > > if (IsA(clause, OpExpr)) > { > OpExpr *opclause = (OpExpr *) clause; > Expr *leftop, > *rightop, > *valueexpr; > bool is_ne_listp = false; > > leftop = (Expr *) get_leftop(clause); > if (IsA(leftop, RelabelType)) > leftop = ((RelabelType *) leftop)->arg; > rightop = (Expr *) get_rightop(clause); > > This'll crash with the following: > > create function nonzero(p int) returns bool as $$ begin return (p <> > 0); end;$$ language plpgsql; > create operator # (procedure = nonzero, leftarg = int); > create table listp (a int) partition by list (a); > create table listp1 partition of listp for values in(1); > select * from listp where a#; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > > You need to ensure that there are 2 args. Oops, so we can't prune with unary operator clause. Added the check on number of args. > 15. Various if tests in extract_partition_clauses result in a > `continue` when they should perhaps be a `break` instead. > > For example: > > if (!PartCollMatchesExprColl(partcoll, opclause->inputcollid)) > continue; > > This clause does not depend on the partition key, so there's no point > in trying to match this again for the next partition key. > > This item is perhaps not so important, as it's only a small > inefficiency, but I just wanted to point it out. Also, note that plain > Var partition keys cannot be duplicated, but expressions can, so there > may be cases that you don't want to change to `break` > > Other conditions which possibly should change to `break` instead of > `continue` include: > > /* Only IS [NOT] TRUE/FALSE are any good to us */ > if (btest->booltesttype == IS_UNKNOWN || > btest->booltesttype == IS_NOT_UNKNOWN) > continue; You're quite right. Replaced all those continue's with break. > 16. The following code looks a bit fragile: > > leftop = IsA(clause, Var) > ? (Expr *) clause > : (Expr *) get_notclausearg((Expr *) clause); > > This appears to assume that the partition key will be a plain Var and > not an expression. I tried to break this with: > > create table bp (a bool, b bool) partition by ((a < b)); > create table bp_true partition of bp for values in('t'); > explain select * from bp where (a < b); > > however, naturally, the parser builds an OpExpr instead of a > BooleanTest for this case. If it had built a BooleanTest, then the > above code would mistakenly call get_notclausearg on the (a < b) Expr. > > Do you have reason to believe that the code is safe and a good idea? Thanks for the example. I think there were problems here. The way the current code matched these specially shaped clauses with a Boolean partition key wouldn't have resulted in correct pruning for your example and perhaps more cases. I looked at match_boolean_index_clause() and decided that we needed something similar here. So, in the updated patch, I've added a match_boolean_partition_clause(), which will be called at the beginning of the loop to recognize such specially shaped clauses as applicable to a Boolean partition key and add them to the set of pruning clauses. With the new code: create table bp (a int, b int) partition by list ((a < b)); create table bp_true partition of bp for values in ('true'); create table bp_false partition of bp for values in ('false'); explain select * from bp where (a < b) is true; QUERY PLAN ---------------------------------------------------------------- Append (cost=0.00..38.25 rows=753 width=8) -> Seq Scan on bp_true (cost=0.00..38.25 rows=753 width=8) Filter: ((a < b) IS TRUE) (3 rows) explain select * from bp where (a < b) is false; QUERY PLAN ------------------------------------------------------------------ Append (cost=0.00..38.25 rows=1507 width=8) -> Seq Scan on bp_false (cost=0.00..38.25 rows=1507 width=8) Filter: ((a < b) IS FALSE) (3 rows) explain select * from bp where (a < b) is not false; QUERY PLAN ---------------------------------------------------------------- Append (cost=0.00..38.25 rows=753 width=8) -> Seq Scan on bp_true (cost=0.00..38.25 rows=753 width=8) Filter: ((a < b) IS NOT FALSE) (3 rows) explain select * from bp where (a < b) = true; QUERY PLAN ---------------------------------------------------------------- Append (cost=0.00..38.25 rows=753 width=8) -> Seq Scan on bp_true (cost=0.00..38.25 rows=753 width=8) Filter: (a < b) (3 rows) explain select * from bp where (a < b) = false; QUERY PLAN ----------------------------------------------------------------- Append (cost=0.00..38.25 rows=753 width=8) -> Seq Scan on bp_false (cost=0.00..38.25 rows=753 width=8) Filter: (a >= b) (3 rows) explain select * from bp where (a < b); QUERY PLAN ---------------------------------------------------------------- Append (cost=0.00..38.25 rows=753 width=8) -> Seq Scan on bp_true (cost=0.00..38.25 rows=753 width=8) Filter: (a < b) (3 rows) explain select * from bp where not (a < b); QUERY PLAN ----------------------------------------------------------------- Append (cost=0.00..38.25 rows=753 width=8) -> Seq Scan on bp_false (cost=0.00..38.25 rows=753 width=8) Filter: (a >= b) (3 rows) > 17. Which relation is the comment talking about? > > /* > * get_partitions_from_args > * > * Returns the set of partitions of relation, each of which satisfies some > * clause in or_args. > */ > static Bitmapset * > get_partitions_from_or_args(PartitionPruneContext *context, List *or_args) Fixed the comment. > 18. "sets a field", would it not be better to mention constfalse?: > > * returns right after finding such a clause and before returning, sets a field > * in context->clauseinfo to inform the caller that we found such clause. You're right. > 19. "clauses" > > * partitioning, we don't require all of eqkeys to be operator clausses. Fixed. > 20. There does not seem to be a need to palloc0 here. palloc seems fine. > > keys->ne_datums = (Datum *) > palloc0(list_length(clauseinfo->ne_clauses) * > sizeof(Datum)); > > This, of course, may leave unset memory in any unused items, but you > never iterate beyond what n_ne_datums gets set to anyway, so I don't > see the need to zero any extra elements. Agreed about using palloc. > 21. A code comment should be added to the following code to mention > that these are not arrays indexed by partition key as they're only > ever used for LIST partitioning, which only supports a single key. > > /* Datum values from clauses containing <> operator */ > Datum *ne_datums; > int n_ne_datums; OK, done. Also adjusted the comment above the struct's definition. > 22. Can you include: "'keyisnotnull' may also be set for the given > partition key when a strict OpExpr is encountered" in the following > comment? > > * Based on a IS NULL or IS NOT NULL clause that was matched to a partition > * key, the corresponding bit in 'keyisnull' or 'keyisnotnull' is set. Done. Attached updated patches. Thanks again! Regards, Amit
Attachment
On 19 February 2018 at 22:19, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached updated patches. Thanks again! Thanks for making those changes. I've made another pass of v28 and have a few more comments. The patch is starting to look good, but there are some new changes in recent versions which still don't look quite right. 1. This does not fully make sense: /* * Remove the indexes of partitions excluded due to each of * those partitions' *all* of allowed datums appearing in * keys->ne_datums, that is compared to the partition key * using <> operator. */ "each of those partitions' *all* of allowed" is not correct. Maybe better to write: /* * Remove the indexes of any partitions which cannot possibly * contain rows matching the clauses due to key->ne_datums containing * all datum values which are allowed in the given partition. This * is only possible to do in LIST partitioning as it's the only * partitioning strategy which allows the specification of exact values. */ 2. Mine does not, but some compilers may complain about get_partitions_for_keys() result variable being uninitalised in get_partitions_for_keys. Probably the easiest fix would be to just assign to NULL in the default case. 3. Did you mean to put this Assert() inside the loop? memset(keyisnull, false, sizeof(keyisnull)); i = -1; while ((i = bms_next_member(keys->keyisnull, i)) >= 0) { keys->n_eqkeys++; keyisnull[i] = true; } Assert(i < partnatts); i will always be -2 at the end of the loop. Seems like a useless Assert in its current location. 4. Can you add a comment here to say: "Note: LIST partitioning only supports a single partition key, therefore this function requires no looping over the partition keys" /* * get_partitions_for_keys_list * Return partitions of a list partitioned table for requested keys * * This interprets the keys and looks up partitions in the partition bound * descriptor using the list partitioning semantics. */ 5. The following comment contains a bit of duplication to the comment which comes after it. Maybe the following: /* * If the query is looking for null keys, there can only be one such * partition. Return the same if one exists. */ can be changed to: /* Handle clauses requesting a NULL valued partition key */ 6. The following comment does not quite make sense: /* Exactly matching datum exists. */ Probably better to write: /* An exact matching datum exists. */ 7. "If it's not equal (<)" I think you mean (>), not (<), in: * The bound at minoff is <= minkeys, given the way * partition_list_bsearch() works. If it's not equal (<), then * increment minoff to make it point to the datum on the right * that necessarily satisfies minkeys. Also do the same if it is * equal but minkeys is exclusive. However, the comment is a bit clumsy. Maybe the following is better? /* * partition_list_bsearch returning a positive number means that * minkeys[0] must be greater than or equal to the smallest datum. * If we didn't find an exact matching datum (!is_equal) or if the * operator used was non-inclusive (>), then in both of these * cases we're not interested in the datum pointed to by minoff, * but we may start getting matches in the partition which the * next datum belongs to, so point to that one instead. (This may * be beyond the last datum in the array, but we'll detect that * later.) */ 8. The following comment could be improved: * minkeys is greater than the datums of all non-default partitions, * meaning there isn't one to return. Return the default partition if * one exists. how about: * The value of minkeys[0] is greater than all of the datums we have * partitions for. The only possible partition that could contain a * match is the default partition. Return that, if it exists. 9. The following could also be improved: * The bound at maxoff is <= maxkeys, given the way * partition_list_bsearch works. If the bound at maxoff exactly * matches maxkey (is_equal), but the maxkey is exclusive, then * decrement maxoff to point to the bound on the left. how about: * partition_list_bsearch returning a positive number means that * maxkeys[0] must be greater than or equal to the smallest datum. * If the match found is an equal match, but the operator used is * non-inclusive of that value (<), then the partition belonging * to maxoff cannot match, so we'll decrement maxoff to point to * the partition belonging to the previous datum. We might end up * decrementing maxoff down to -1, but we'll handle that later. 10. Can you append " This may not technically be true for some data types (e.g. integer types), however, we currently lack any sort of infrastructure to provide us with proofs that would allow us to do anything smarter here." to: * For range queries, always include the default list partition, * because list partitions divide the key space in a discontinuous * manner, not all values in the given range will have a partition * assigned. 11. get_partitions_for_keys_range seems to prefer to do "minoff -= 1", but get_partitions_for_keys_list likes to "minoff--", can this be made the same? Personally, I like -- over -= 1 as it's shorter. Although I do remember having an argument with my university professor about this. He claimed -= 1 was clearer... I'm still unsure what he found so confusing about -- ... 12. The following code could be optimised a little for the case when there's no default: /* * There may exist a range of values unassigned to any non-default * partition between the datums at minoff and maxoff. */ for (i = minoff; i <= maxoff; i++) { if (boundinfo->indexes[i] < 0) { include_def = true; break; } } /* * Since partition keys with nulls are mapped to the default range * partition, we must include the default partition if some keys * *could* be null. */ if (bms_num_members(keys->keyisnotnull) < partnatts) include_def = true; if (include_def && partition_bound_has_default(boundinfo)) result = bms_add_member(result, boundinfo->default_index); return result; Maybe something more like: if (!partition_bound_has_default(boundinfo)) return result; /* * There may exist a range of values unassigned to any non-default * partition between the datums at minoff and maxoff. */ for (i = minoff; i <= maxoff; i++) { if (boundinfo->indexes[i] < 0) return bms_add_member(result, boundinfo->default_index); } /* * Since partition keys with nulls are mapped to the default range * partition, we must include the default partition if some keys * *could* be null. */ if (bms_num_members(keys->keyisnotnull) < partnatts) return bms_add_member(result, boundinfo->default_index); Which is saves a bit of needless work when there's no default to add, and also saves a few lines, including the line where you declare the include_def variable. 13. Variable name: Bitmapset *partitioned_rels_bms = NULL; This should likely be called partitioned_relids, and be of type Relids instead of Bitmapset. 14. This line removal seems surplus. It should probably be fixed independently of this patch. parent_roots[appinfo->child_relid] = subroot; - continue; 15. I'm unsure how safe the following code is: while ((parent_rti = bms_first_member(partitioned_rels_bms)) >= 0) partitioned_rels = lappend_int(partitioned_rels, parent_rti); You're now putting this list in ascending order of relid, but some code in set_plan_refs assumes the root partition is the first element: root->glob->rootResultRelations = lappend_int(root->glob->rootResultRelations, linitial_int(splan->partitioned_rels)); By luck, the first element might today be the root due to the way we expand the inheritance hierarchy, but I think this code is wrong to rely on that. I'm not really a fan of having the root partition be the first element in the List. I would much rather see a Relids type and a special Index field for the root, but that might be more changes that you'd like to make here. I just don't think what you have now is correct. 16. This should probably return Relids rather than Bitmapset *. Bitmapset * prune_append_rel_partitions(PlannerInfo *root, RelOptInfo *rel) Please update the mention of Bitmapset in the comment at the top of the function too. 17. hmm, my patch did palloc(), not palloc0(). My original patch was broken and missed this, but v2 got it. context->clauseinfo = partclauseinfo = palloc0(sizeof(PartitionClauseInfo)); There's no need to palloc0() here. You're setting all the fields to zero just below. As far as I understand it, it's only Node types that we have to go through the rigmarole of doing both. 18. I tentatively agree with you having changed the continue to break in the following: /* We can't use any volatile value to prune partitions. */ if (contain_volatile_functions((Node *) valueexpr)) break; I believe it's not wrong to break here, but keep in mind you're testing valueexpr rather than something with the OpExpr itself. The reason I'm not saying this is wrong is that if the valueexpr is volatile then it cannot possibly match another partition key anyway, so there's likely no point in continuing to look for another match... You should likely write a comment to explain this a bit. I think all of the other places you've changed to break look fine. The ScalarArrayOpExpr volatile function test is fine to break from since the operands cannot be reversed in that case, so rightop certainly can't match a partition key. 19. I might have caused this, but there's no such variable as 'cur' /* cur is more restrictive, so replace the existing. */ 20. Is there a difference between partition_bound_has_default(context->boundinfo) and context->has_default_part? Any reason for both? Your code uses both. I don't yet understand why has_default_part was added. That's all I have for now. It's getting close. Good work! I keep having to turn up my strictness level each time I review. I'd like to be able to turn it up all the way earlier, but I fear I may still be reviewing v1 if I'd done that :-) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David. Thanks for the review. On 2018/02/19 22:40, David Rowley wrote: > On 19 February 2018 at 22:19, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached updated patches. Thanks again! > > Thanks for making those changes. I've made another pass of v28 and > have a few more comments. > > The patch is starting to look good, but there are some new changes in > recent versions which still don't look quite right. > > 1. This does not fully make sense: > > /* > * Remove the indexes of partitions excluded due to each of > * those partitions' *all* of allowed datums appearing in > * keys->ne_datums, that is compared to the partition key > * using <> operator. > */ > > "each of those partitions' *all* of allowed" is not correct. > > Maybe better to write: > > /* > * Remove the indexes of any partitions which cannot possibly > * contain rows matching the clauses due to key->ne_datums containing > * all datum values which are allowed in the given partition. This > * is only possible to do in LIST partitioning as it's the only > * partitioning strategy which allows the specification of exact values. > */ Ah, your rewrite sounds much better. > 2. Mine does not, but some compilers may complain about > get_partitions_for_keys() result variable being uninitalised in > get_partitions_for_keys. Probably the easiest fix would be to just > assign to NULL in the default case. Done. > 3. Did you mean to put this Assert() inside the loop? > > memset(keyisnull, false, sizeof(keyisnull)); > i = -1; > while ((i = bms_next_member(keys->keyisnull, i)) >= 0) > { > keys->n_eqkeys++; > keyisnull[i] = true; > } > Assert(i < partnatts); > > i will always be -2 at the end of the loop. Seems like a useless > Assert in its current location. Face palm! Fixed. > 4. Can you add a comment here to say: "Note: LIST partitioning only > supports a single partition key, therefore this function requires no > looping over the partition keys" > > /* > * get_partitions_for_keys_list > * Return partitions of a list partitioned table for requested keys > * > * This interprets the keys and looks up partitions in the partition bound > * descriptor using the list partitioning semantics. > */ OK, done. > 5. The following comment contains a bit of duplication to the comment > which comes after it. Maybe the following: > > /* > * If the query is looking for null keys, there can only be one such > * partition. Return the same if one exists. > */ > > can be changed to: > > /* Handle clauses requesting a NULL valued partition key */ You're right. Changed. > 6. The following comment does not quite make sense: > > /* Exactly matching datum exists. */ > > Probably better to write: > > /* An exact matching datum exists. */ OK, done. > 7. "If it's not equal (<)" I think you mean (>), not (<), in: > > * The bound at minoff is <= minkeys, given the way > * partition_list_bsearch() works. If it's not equal (<), then > * increment minoff to make it point to the datum on the right > * that necessarily satisfies minkeys. Also do the same if it is > * equal but minkeys is exclusive. > > However, the comment is a bit clumsy. Maybe the following is better? > > /* > * partition_list_bsearch returning a positive number means that > * minkeys[0] must be greater than or equal to the smallest datum. > * If we didn't find an exact matching datum (!is_equal) or if the > * operator used was non-inclusive (>), then in both of these > * cases we're not interested in the datum pointed to by minoff, > * but we may start getting matches in the partition which the > * next datum belongs to, so point to that one instead. (This may > * be beyond the last datum in the array, but we'll detect that > * later.) > */ Your rewrite is much better. > 8. The following comment could be improved: > > * minkeys is greater than the datums of all non-default partitions, > * meaning there isn't one to return. Return the default partition if > * one exists. > > how about: > > * The value of minkeys[0] is greater than all of the datums we have > * partitions for. The only possible partition that could contain a > * match is the default partition. Return that, if it exists. OK, adopted your rewrite. > 9. The following could also be improved: > > * The bound at maxoff is <= maxkeys, given the way > * partition_list_bsearch works. If the bound at maxoff exactly > * matches maxkey (is_equal), but the maxkey is exclusive, then > * decrement maxoff to point to the bound on the left. > > how about: > > * partition_list_bsearch returning a positive number means that > * maxkeys[0] must be greater than or equal to the smallest datum. > * If the match found is an equal match, but the operator used is > * non-inclusive of that value (<), then the partition belonging > * to maxoff cannot match, so we'll decrement maxoff to point to > * the partition belonging to the previous datum. We might end up > * decrementing maxoff down to -1, but we'll handle that later. OK, done. > 10. Can you append " This may not technically be true for some data > types (e.g. integer types), however, we currently lack any sort of > infrastructure to provide us with proofs that would allow us to do > anything smarter here." to: > > * For range queries, always include the default list partition, > * because list partitions divide the key space in a discontinuous > * manner, not all values in the given range will have a partition > * assigned. Hmm, that seems to make sense, so added the text. I guess you know it already, but I'm trying to say in the comment that list partitioning, by definition, does not force you to enumerate all values that a data type may specify to exist. > 11. get_partitions_for_keys_range seems to prefer to do "minoff -= 1", > but get_partitions_for_keys_list likes to "minoff--", can this be made > the same? Personally, I like -- over -= 1 as it's shorter. Although I > do remember having an argument with my university professor about > this. He claimed -= 1 was clearer... I'm still unsure what he found so > confusing about -- ... I will go with minoff-- for consistency as you say. > 12. The following code could be optimised a little for the case when > there's no default: > > /* > * There may exist a range of values unassigned to any non-default > * partition between the datums at minoff and maxoff. > */ > for (i = minoff; i <= maxoff; i++) > { > if (boundinfo->indexes[i] < 0) > { > include_def = true; > break; > } > } > > /* > * Since partition keys with nulls are mapped to the default range > * partition, we must include the default partition if some keys > * *could* be null. > */ > if (bms_num_members(keys->keyisnotnull) < partnatts) > include_def = true; > > if (include_def && partition_bound_has_default(boundinfo)) > result = bms_add_member(result, boundinfo->default_index); > > return result; > > Maybe something more like: > > if (!partition_bound_has_default(boundinfo)) > return result; > > /* > * There may exist a range of values unassigned to any non-default > * partition between the datums at minoff and maxoff. > */ > for (i = minoff; i <= maxoff; i++) > { > if (boundinfo->indexes[i] < 0) > return bms_add_member(result, boundinfo->default_index); > } > > /* > * Since partition keys with nulls are mapped to the default range > * partition, we must include the default partition if some keys > * *could* be null. > */ > if (bms_num_members(keys->keyisnotnull) < partnatts) > return bms_add_member(result, boundinfo->default_index); > > Which is saves a bit of needless work when there's no default to add, > and also saves a few lines, including the line where you declare the > include_def variable. Ah, that's neat. Done that way. I also changed other code that sets include_def to instead add the default partition index to result right away, instead of just setting include_def. So, include_def is now gone. > 13. Variable name: > > Bitmapset *partitioned_rels_bms = NULL; > > This should likely be called partitioned_relids, and be of type Relids > instead of Bitmapset. Makes sense, done. > 14. This line removal seems surplus. It should probably be fixed > independently of this patch. > > parent_roots[appinfo->child_relid] = subroot; > - > continue; Hadn't noticed that. Fixed. > 15. I'm unsure how safe the following code is: > > while ((parent_rti = bms_first_member(partitioned_rels_bms)) >= 0) > partitioned_rels = lappend_int(partitioned_rels, parent_rti); > > You're now putting this list in ascending order of relid, but some > code in set_plan_refs assumes the root partition is the first element: > > root->glob->rootResultRelations = > lappend_int(root->glob->rootResultRelations, > linitial_int(splan->partitioned_rels)); > > By luck, the first element might today be the root due to the way we > expand the inheritance hierarchy, but I think this code is wrong to > rely on that. > > I'm not really a fan of having the root partition be the first element > in the List. I would much rather see a Relids type and a special Index > field for the root, but that might be more changes that you'd like to > make here. I just don't think what you have now is correct. Hmm, so you're saying that it's not future-proof for the code here to assume that the root table will always get the smallest RT index of the tables in a given partition tree. I guess it is a concern that exists independently of the changes this patch makes, but I agree with the concern. We can write another patch to break that assumption by using the method you suggest of adding a Index variable to the ModifyTable node to store the root table index. > 16. This should probably return Relids rather than Bitmapset *. > > Bitmapset * > prune_append_rel_partitions(PlannerInfo *root, RelOptInfo *rel) > > Please update the mention of Bitmapset in the comment at the top of > the function too. OK, done. > 17. hmm, my patch did palloc(), not palloc0(). My original patch was > broken and missed this, but v2 got it. > > context->clauseinfo = partclauseinfo = > palloc0(sizeof(PartitionClauseInfo)); > > There's no need to palloc0() here. You're setting all the fields to > zero just below. As far as I understand it, it's only Node types that > we have to go through the rigmarole of doing both. Ah, you're right. Used palloc. > 18. I tentatively agree with you having changed the continue to break > in the following: > > /* We can't use any volatile value to prune partitions. */ > if (contain_volatile_functions((Node *) valueexpr)) > break; > > I believe it's not wrong to break here, but keep in mind you're > testing valueexpr rather than something with the OpExpr itself. The > reason I'm not saying this is wrong is that if the valueexpr is > volatile then it cannot possibly match another partition key anyway, > so there's likely no point in continuing to look for another match... > You should likely write a comment to explain this a bit I think all> of the other places you've changed to break look fine.The > ScalarArrayOpExpr volatile function test is fine to break from since > the operands cannot be reversed in that case, so rightop certainly > can't match a partition key. When working on this comment, I realized we shouldn't really "break" on the collation mismatch. Multiple keys can have the same expression, but different collation and "break"ing in this case would mean, we'd miss matching it to another key that has the matching collation. IOW, order in which clauses appear in the input list determines if they're matched to partition keys correctly or not. See this somewhat made up example: create table rp (a text) partition by range (substr(a, 1) collate "en_GB", substr(a, 1) collate "en_US"); create table rp1 partition of rp for values from ('a', 'a') to ('a', 'e'); create table rp2 partition of rp for values from ('a', 'e') to ('a', 'z'); create table rp3 partition of rp for values from ('b', 'a') to ('b', 'e'); For the following query: select * from rp where substr(a, 1) = 'e' collate "en_US" and substr(a, 1) = 'a' collate "en_GB"; With the current code, we'll end up discarding (via break) the 1st clause after its collation (en_US) doesn't match the 1st key's collation (en_GB) and thus we'll end with only one clause, that is the 2nd one, being added to matched clauses. That would result in only p3 being pruned, whereas both rp1 and rp3 should be pruned. I've fixed that. With the new code, both the expression and the collation should match before we conclude that the clause matched the partition key and then check other properties of the clause like operator strictness, valueexpr volatility, etc. If those other properties are not satisfied, we can "break", because with those properties they won't be useful for pruning, even if it matched some other key. > 19. I might have caused this, but there's no such variable as 'cur' > > /* cur is more restrictive, so replace the existing. */ Fixed this and a few other instances. > 20. Is there a difference between > partition_bound_has_default(context->boundinfo) and > context->has_default_part? Any reason for both? Your code uses both. I > don't yet understand why has_default_part was added. One cannot use partition_bound_has_default(context->boundinfo) within partprune.c, because PartitionBoundInfo's definition is private to partition.c. Maybe, if someday we expose it by exporting it (and related partition bound manipulating functions) in, say, a partbound.h, we won't need to have to resort to the same value being stored in two different places like this. > That's all I have for now. > > It's getting close. Good work! > > I keep having to turn up my strictness level each time I review. I'd > like to be able to turn it up all the way earlier, but I fear I may > still be reviewing v1 if I'd done that :-) It's really great that you've looked into these patches in such great detail, suggesting various performance improvements (big and small) and also pointing out various edge cases. Thank you very much! :) Attached updated version. Regards, Amit
Attachment
On Tue, Feb 20, 2018 at 12:34 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
CREATE TABLE part (c1 INT4, c2 TEXT, c3 INT4) PARTITION BY LIST (c2);
CREATE TABLE part_p1 PARTITION OF part FOR VALUES IN('ABC');
CREATE TABLE part_p2 PARTITION OF part FOR VALUES IN('DEF');
CREATE TABLE part_p3 PARTITION OF part FOR VALUES IN('GHI');
CREATE TABLE part_p4 PARTITION OF part FOR VALUES IN('JKL');
INSERT INTO part VALUES (100,'ABC',10);
INSERT INTO part VALUES (110,'DEF',20);
INSERT INTO part VALUES (120,'GHI',10);
INSERT INTO part VALUES (130,'JKL',100);
explain (costs off) SELECT * FROM part WHERE c2 LIKE '%ABC%';
ERROR: operator 1209 is not a member of opfamily 1994
Attached updated version.
Hi Amit,
I have applied v29 patch-set on head and got "ERROR: operator 1209 is not a member of opfamily 1994" with below test case. Please take a look.CREATE TABLE part (c1 INT4, c2 TEXT, c3 INT4) PARTITION BY LIST (c2);
CREATE TABLE part_p1 PARTITION OF part FOR VALUES IN('ABC');
CREATE TABLE part_p2 PARTITION OF part FOR VALUES IN('DEF');
CREATE TABLE part_p3 PARTITION OF part FOR VALUES IN('GHI');
CREATE TABLE part_p4 PARTITION OF part FOR VALUES IN('JKL');
INSERT INTO part VALUES (100,'ABC',10);
INSERT INTO part VALUES (110,'DEF',20);
INSERT INTO part VALUES (120,'GHI',10);
INSERT INTO part VALUES (130,'JKL',100);
explain (costs off) SELECT * FROM part WHERE c2 LIKE '%ABC%';
ERROR: operator 1209 is not a member of opfamily 1994
Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation
Thanks Rajkumar for the test. On 2018/02/20 16:56, Rajkumar Raghuwanshi wrote: > I have applied v29 patch-set on head and got "ERROR: operator 1209 is not > a member of opfamily 1994" with below test case. Please take a look. > > CREATE TABLE part (c1 INT4, c2 TEXT, c3 INT4) PARTITION BY LIST (c2); > CREATE TABLE part_p1 PARTITION OF part FOR VALUES IN('ABC'); > CREATE TABLE part_p2 PARTITION OF part FOR VALUES IN('DEF'); > CREATE TABLE part_p3 PARTITION OF part FOR VALUES IN('GHI'); > CREATE TABLE part_p4 PARTITION OF part FOR VALUES IN('JKL'); > > INSERT INTO part VALUES (100,'ABC',10); > INSERT INTO part VALUES (110,'DEF',20); > INSERT INTO part VALUES (120,'GHI',10); > INSERT INTO part VALUES (130,'JKL',100); > > explain (costs off) SELECT * FROM part WHERE c2 LIKE '%ABC%'; > *ERROR: operator 1209 is not a member of opfamily 1994* An oversight in the v28 patch seems to have caused this. Fixed in the attached. Thanks, Amit
Attachment
v30-0004-Faster-partition-pruning.patch contains: +create table coll_pruning_multi (a text) partition by range (substr(a, 1) collate "en_GB", substr(a, 1) collate "en_US"); This'll likely work okay on Linux. Other collate tests seem to use COLLATE "POSIX or "C" so that work cross-platform. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/02/21 10:19, David Rowley wrote: > v30-0004-Faster-partition-pruning.patch contains: > > +create table coll_pruning_multi (a text) partition by range > (substr(a, 1) collate "en_GB", substr(a, 1) collate "en_US"); > > This'll likely work okay on Linux. Other collate tests seem to use > COLLATE "POSIX or "C" so that work cross-platform. Thanks. I completely forgot about that. I've rewritten those tests to use "POSIX" and "C" in the attached. Thanks, Amit
Attachment
On 21 February 2018 at 14:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/02/21 10:19, David Rowley wrote: >> v30-0004-Faster-partition-pruning.patch contains: >> >> +create table coll_pruning_multi (a text) partition by range >> (substr(a, 1) collate "en_GB", substr(a, 1) collate "en_US"); >> >> This'll likely work okay on Linux. Other collate tests seem to use >> COLLATE "POSIX or "C" so that work cross-platform. > > Thanks. I completely forgot about that. I've rewritten those tests to > use "POSIX" and "C" in the attached. Thanks for fixing. I made a pass over v31 and only see a few small things: 1. In get_partitions_for_keys() why is the get_partitions_excluded_by_ne_datums call not part of get_partitions_for_keys_list? 2. Still a stray "minoff += 1;" in get_partitions_for_keys_range 3. You're also preferring to minoff--/++, but maxoff -= 1/maxoff += 1; would be nice to see the style unified here. 4. "other other" * that is, each of its fields other other than clauseinfo must be valid before 5. "a IS NULL" -> "an IS NULL": * Based on a IS NULL or IS NOT NULL clause that was matched to a partition 6. Can you add a warning in the header comment for extract_partition_clauses() to explain "Note: the 'clauses' List may be modified inside this function. Callers may like to make a copy of important lists before passing them to this function.", or something like that... 7. "null" -> "nulls" * Only allow strict operators. This will guarantee null are 8. "dicard" -> "discard" * contains a <= 2, then because 3 <= 2 is false, we dicard a < 3 as -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David. Thanks for the review. On 2018/02/21 19:15, David Rowley wrote: > Thanks for fixing. I made a pass over v31 and only see a few small things: > > 1. In get_partitions_for_keys() why is the > get_partitions_excluded_by_ne_datums call not part of > get_partitions_for_keys_list? Hmm, there is a question of where exactly to put the call within get_partitions_for_keys_list(). At the end would sound like an obvious answer, but we tend to short-circuit return from that function at various points, which it seems undesirable to change. So, I left things as is here. > 2. Still a stray "minoff += 1;" in get_partitions_for_keys_range I actually found a few and changed them to ++ or --, as applicable. > > 3. You're also preferring to minoff--/++, but maxoff -= 1/maxoff += 1; > would be nice to see the style unified here. Fixed all as mentioned above. > 4. "other other" > > * that is, each of its fields other other than clauseinfo must be valid before Fixed. > 5. "a IS NULL" -> "an IS NULL": > > * Based on a IS NULL or IS NOT NULL clause that was matched to a partition Fixed. > 6. Can you add a warning in the header comment for > extract_partition_clauses() to explain "Note: the 'clauses' List may > be modified inside this function. Callers may like to make a copy of > important lists before passing them to this function.", or something > like that... At least in my patch, extract_partition_clauses() is a local function with just one caller, but I still don't see any problem with warning the reader. So, done. > 7. "null" -> "nulls" > > * Only allow strict operators. This will guarantee null are > > 8. "dicard" -> "discard" > > * contains a <= 2, then because 3 <= 2 is false, we dicard a < 3 as Fixed. Please find attached updated patches. Thanks, Amit
Attachment
On 21 February 2018 at 23:44, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Please find attached updated patches. Thanks for updating the code. The question I have now is around NULL handling in partkey_datum_from_expr(). I've not managed to find a way to get a NULL Const in there as it seems all the clauses I try get removed somewhere earlier in planning. Do you know for a fact that a NULL Const is impossible to get there? I'm having to add some NULL handling there for the run-time pruning patch but wondered if it was also required for your patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/02/22 17:41, David Rowley wrote: > On 21 February 2018 at 23:44, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Please find attached updated patches. > > Thanks for updating the code. > > The question I have now is around NULL handling in > partkey_datum_from_expr(). I've not managed to find a way to get a > NULL Const in there as it seems all the clauses I try get removed > somewhere earlier in planning. Do you know for a fact that a NULL > Const is impossible to get there? We only ever call partkey_datum_from_expr() for an OpExpr's arg and if you have a NULL Const in there, eval_const_expressions() would've folded the OpExpr's and subsequently any AND'd OpExpr's into a constant-false qual. create table p (a int) partition by list (a); create table p1 partition of p for values in (1); create table p2 partition of p for values in (2); explain select * from p where a = null and a = 1; QUERY PLAN ------------------------------------------- Result (cost=0.00..0.00 rows=0 width=40) One-Time Filter: false explain select * from p where (a = null and a = 1) or a = 2; QUERY PLAN ---------------------------------------------------------- Append (cost=0.00..41.94 rows=13 width=4) -> Seq Scan on p2 (cost=0.00..41.88 rows=13 width=4) Filter: (a = 2) (3 rows) > I'm having to add some NULL handling there for the run-time pruning > patch but wondered if it was also required for your patch. Hmm, not sure why. Can you explain a bit more? Thanks, Amit
On 22 February 2018 at 22:48, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I'm having to add some NULL handling there for the run-time pruning >> patch but wondered if it was also required for your patch. > > Hmm, not sure why. Can you explain a bit more? hmm, yeah, but perhaps we should be discussing on the other thread... With a prepared statement the Param will be unavailable until execution, in which case we don't do the const folding. A simple case is: create table listp (a int) partition by list (a); create table listp1 partition of listp for values in(1); prepare q1 (int) as select * from listp where a = $1; explain analyze execute q1(1); -- repeat 5 times. explain analyze execute q1(null); -- partkey_datum_from_expr() gets a NULL param via the call from nodeAppend.c -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/02/22 20:28, David Rowley wrote: > On 22 February 2018 at 22:48, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> I'm having to add some NULL handling there for the run-time pruning >>> patch but wondered if it was also required for your patch. >> >> Hmm, not sure why. Can you explain a bit more? > > hmm, yeah, but perhaps we should be discussing on the other thread... > > With a prepared statement the Param will be unavailable until > execution, in which case we don't do the const folding. Ah right. > A simple case is: > > create table listp (a int) partition by list (a); > create table listp1 partition of listp for values in(1); > prepare q1 (int) as select * from listp where a = $1; > explain analyze execute q1(1); -- repeat 5 times. > explain analyze execute q1(null); -- partkey_datum_from_expr() gets a > NULL param via the call from nodeAppend.c I wonder if NULLs should somehow be managed at a higher level, resulting in the same behavior as const-folding in the optimizer produces? In any case, I suppose that would be something for the run-time pruning patch to handle. Thanks, Amit
On Wed, Feb 21, 2018 at 5:44 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Please find attached updated patches. Committed 0001 and 0002. I'm having some difficulty wrapping my head around 0003 because it has minimal comments and no useful commit message. I think, though, that it's actually broken. Pre-patch, the logic in find_partition_scheme compares partopfamily, partopcintype, and parttypcoll and then asserts equality for parttyplen and parttypbyval; not coincidentally, PartitionSchemeData describes the latter two fields only as "cached data", so that the segregation of fields in PartitionSchemeData into two groups exactly matches what find_partition_scheme is actually doing. However, with the patch, it turns into a sort of hodgepodge. parttypid is added into the "cached" section of PartitionSchemeData and partcollation to the primary section, but both values are compared, not asserted; parttypcoll moves from the "compared" section to the "asserted" section but the declaration in PartitionSchemeData stays where it was. Moreover, there's no explanation of why this is getting changed. There's an existing comment that explains the motivation for what the code does today, which the patch does not modify: * We store the opclass-declared input data types instead of the partition key * datatypes since the former rather than the latter are used to compare * partition bounds. Since partition key data types and the opclass declared * input data types are expected to be binary compatible (per ResolveOpClass), * both of those should have same byval and length properties. Obviously, this raises the issue of whether changing this is really the right thing to do in the first place, but at any rate it's certainly necessary for the comments to match what the code actually does. Also, I find this not very helpful: + * The collation of the partition key can differ from the collation of the + * underlying column, so we must store this separately. If the comments about parttypcol and partcollation were clear enough (and I think they could use some work to distinguish them better), then this would be pretty much unnecessary -- clearly the only reason to store two things is if they might be different from each other. It might be more useful to somehow explain how parttypid and partsupfunc are intended to be work/be used, but actually I don't think any satisfactory explanation is possible. Either we have one partition scheme per partopcintype -- in which case parttypid is ill-defined because it could vary among relations with the same PartitionScheme -- or we have on per parttypid -- in which case, without some other change, partition-wise join will stop working between relations with different parttypids but the same partopcintype. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018/02/23 23:46, Robert Haas wrote: > On Wed, Feb 21, 2018 at 5:44 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Please find attached updated patches. > > Committed 0001 and 0002. Thank you for committing and for the review. > I'm having some difficulty wrapping my head around 0003 because it has > minimal comments and no useful commit message. I think, though, that > it's actually broken. Pre-patch, the logic in find_partition_scheme > compares partopfamily, partopcintype, and parttypcoll and then asserts > equality for parttyplen and parttypbyval; not coincidentally, > PartitionSchemeData describes the latter two fields only as "cached > data", so that the segregation of fields in PartitionSchemeData into > two groups exactly matches what find_partition_scheme is actually > doing. However, with the patch, it turns into a sort of hodgepodge. > parttypid is added into the "cached" section of PartitionSchemeData > and partcollation to the primary section, but both values are > compared, not asserted; parttypcoll moves from the "compared" section > to the "asserted" section but the declaration in PartitionSchemeData > stays where it was. > > Moreover, there's no explanation of why this is getting changed. > There's an existing comment that explains the motivation for what the > code does today, which the patch does not modify: > > * We store the opclass-declared input data types instead of the partition key > * datatypes since the former rather than the latter are used to compare > * partition bounds. Since partition key data types and the opclass declared > * input data types are expected to be binary compatible (per ResolveOpClass), > * both of those should have same byval and length properties. > > Obviously, this raises the issue of whether changing this is really > the right thing to do in the first place, but at any rate it's > certainly necessary for the comments to match what the code actually > does. > > Also, I find this not very helpful: > > + * The collation of the partition key can differ from the collation of the > + * underlying column, so we must store this separately. > > If the comments about parttypcol and partcollation were clear enough > (and I think they could use some work to distinguish them better), > then this would be pretty much unnecessary -- clearly the only reason > to store two things is if they might be different from each other. > > It might be more useful to somehow explain how parttypid and > partsupfunc are intended to be work/be used, but actually I don't > think any satisfactory explanation is possible. Either we have one > partition scheme per partopcintype -- in which case parttypid is > ill-defined because it could vary among relations with the same > PartitionScheme -- or we have on per parttypid -- in which case, > without some other change, partition-wise join will stop working > between relations with different parttypids but the same > partopcintype. I think I'm convinced that partopcintype OIDs can be used where I thought parttypid ones were necessary. The pruning patch uses the respective OID from this array when extracting the datum from an OpExpr to be compared with the partition bound datums. It's sensible, I now think, to require the extracted datum to be of partition opclass declared input type, rather than the type of the partition key involved. So, I removed the parttypid that I'd added to PartitionSchemeData. Updated the comments to make clear the distinction between and purpose of having both parttypcoll and partcollation. Also expanded the comment about partsupfunc a bit. Attached updated patches. Thanks, Amit
Attachment
On Sun, Feb 25, 2018 at 11:10 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I think I'm convinced that partopcintype OIDs can be used where I thought > parttypid ones were necessary. The pruning patch uses the respective OID > from this array when extracting the datum from an OpExpr to be compared > with the partition bound datums. It's sensible, I now think, to require > the extracted datum to be of partition opclass declared input type, rather > than the type of the partition key involved. So, I removed the parttypid > that I'd added to PartitionSchemeData. > > Updated the comments to make clear the distinction between and purpose of > having both parttypcoll and partcollation. Also expanded the comment > about partsupfunc a bit. I don't think this fundamentally fixes the problem, although it does narrow it. By requiring partcollation to match across every relation with the same PartitionScheme, you're making partition-wise join fail to work in some cases where it previously did. Construct a test case where parttypcoll matches and partcollation doesn't; then, without the patch, the two relations will have the same PartitionScheme and thus be eligible for a partition-wise join, but with the patch, they will have different PartitionSchemes and thus won't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018/02/27 3:27, Robert Haas wrote: > On Sun, Feb 25, 2018 at 11:10 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I think I'm convinced that partopcintype OIDs can be used where I thought >> parttypid ones were necessary. The pruning patch uses the respective OID >> from this array when extracting the datum from an OpExpr to be compared >> with the partition bound datums. It's sensible, I now think, to require >> the extracted datum to be of partition opclass declared input type, rather >> than the type of the partition key involved. So, I removed the parttypid >> that I'd added to PartitionSchemeData. >> >> Updated the comments to make clear the distinction between and purpose of >> having both parttypcoll and partcollation. Also expanded the comment >> about partsupfunc a bit. > > I don't think this fundamentally fixes the problem, although it does > narrow it. By requiring partcollation to match across every relation > with the same PartitionScheme, you're making partition-wise join fail > to work in some cases where it previously did. Construct a test case > where parttypcoll matches and partcollation doesn't; then, without the > patch, the two relations will have the same PartitionScheme and thus > be eligible for a partition-wise join, but with the patch, they will > have different PartitionSchemes and thus won't. I may be confused but shouldn't two tables partitioned on the same column (of the same collatable type), but using different collations for partitioning should end up with different PartitionSchemes? Different partitioning collations would mean that same data may end up in different partitions of the respective tables. create table p (a text) partition by range (a collate "en_US"); create table p1 partition of p for values from ('a') to ('m'); create table p2 partition of p for values from ('m') to ('z '); create table q (a text) partition by range (a collate "C"); create table q1 partition of q for values from ('a') to ('m'); create table q2 partition of q for values from ('m') to ('z '); insert into p values ('A'); INSERT 0 1 insert into q values ('A'); ERROR: no partition of relation "q" found for row DETAIL: Partition key of the failing row contains (a) = (A). You may say that partition bounds might have to be different too in this case and hence partition-wise join won't occur anyway, but I'm wondering if the mismatch of partcollation itself isn't enough to conclude that? Thanks, Amit
Attached an updated version in which I incorporated some of the revisions that David Rowley suggested to OR clauses handling (in partprune.c) that he posted as a separate patch on the run-time pruning thread [1]. Thanks, Amit [1] https://www.postgresql.org/message-id/CAKJS1f8%2Bp-mXfFUiwR4xZ37STvgJeYF44yAjo5Rfxf92cFJyYQ%40mail.gmail.com
Attachment
On Mon, Feb 26, 2018 at 10:59 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > You may say that partition bounds might have to be different too in this > case and hence partition-wise join won't occur anyway, but I'm wondering > if the mismatch of partcollation itself isn't enough to conclude that? Yeah, you're right. I think that this is just a bug in partition-wise join, and that the partition scheme should just be using partcollation instead of parttypcoll, as in the attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 2018/02/28 1:05, Robert Haas wrote: > On Mon, Feb 26, 2018 at 10:59 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> You may say that partition bounds might have to be different too in this >> case and hence partition-wise join won't occur anyway, but I'm wondering >> if the mismatch of partcollation itself isn't enough to conclude that? > > Yeah, you're right. I think that this is just a bug in partition-wise > join, and that the partition scheme should just be using partcollation > instead of parttypcoll, as in the attached. Ah, OK. I was missing that there is no need to have both parttypcoll and partcollation in PartitionSchemeData, as the Vars in rel->partexprs are built from a bare PartitionKey (not PartitionSchemeData), and after that point, parttypcoll no longer needs to kept around. I noticed that there is a typo in the patch. + memcpy(part_scheme->partcollation, partkey->parttypcoll, s/parttypcoll/partcollation/g BTW, should there be a relevant test in partition_join.sql? If yes, attached a patch (partitionwise-join-collation-test-1.patch) to add one. Also attached updated version of your patch (fixed the typo). Thanks, Amit
Attachment
On Wed, Feb 28, 2018 at 6:42 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/02/28 1:05, Robert Haas wrote: >> On Mon, Feb 26, 2018 at 10:59 PM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> You may say that partition bounds might have to be different too in this >>> case and hence partition-wise join won't occur anyway, but I'm wondering >>> if the mismatch of partcollation itself isn't enough to conclude that? >> >> Yeah, you're right. I think that this is just a bug in partition-wise >> join, and that the partition scheme should just be using partcollation >> instead of parttypcoll, as in the attached. > > Ah, OK. I was missing that there is no need to have both parttypcoll and > partcollation in PartitionSchemeData, as the Vars in rel->partexprs are > built from a bare PartitionKey (not PartitionSchemeData), and after that > point, parttypcoll no longer needs to kept around. Yes. That's right. > > I noticed that there is a typo in the patch. > > + memcpy(part_scheme->partcollation, partkey->parttypcoll, > > s/parttypcoll/partcollation/g > > BTW, should there be a relevant test in partition_join.sql? If yes, > attached a patch (partitionwise-join-collation-test-1.patch) to add one. A partition-wise join path will be created but discarded because of higher cost. This test won't see it in that case. So, please add some data like other tests and add command to analyze the partitioned tables. That kind of protects from something like that. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Tue, Feb 27, 2018 at 3:03 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached an updated version in which I incorporated some of the revisions > that David Rowley suggested to OR clauses handling (in partprune.c) that > he posted as a separate patch on the run-time pruning thread [1]. Some comments on 0001. partnatts != part_scheme->partnatts) continue; - /* Match the partition key types. */ + /* + * Match the partition key types and partitioning-specific collations. + */ We are comparing opfamily and opclass input type as well, but this comment doesn't explicitly mention those like it mentions collation. Instead, I think we should just say, "Match partition key type properties" You had commented on "advanced partition matching code" about asserting that parsupfuncs also match. Robert too has expressed similar opinion upthread. May be we should at least try to assert that the function OIDs match. - Oid *parttypcoll; /* OIDs of collations of partition keys. */ + + /* + * We store both the collation implied by the partition key's type and the + * one specified for partitioning. Values in the former are used as + * varcollid in the Vars corresponding to simple column partition keys so + * as to make them match corresponding Vars appearing elsewhere in the + * query tree. Whereas, the latter is used when actually comparing values + * against partition bounds datums, such as, when doing partition pruning. + */ + Oid *parttypcoll; + Oid *partcollation; As you have already mentioned upthread only partcollation is needed, not parttypcoll. /* Cached information about partition key data types. */ int16 *parttyplen; bool *parttypbyval; + + /* + * Cached array of partitioning comparison functions' fmgr structs. We + * don't compare these when trying to match two partition schemes. + */ I think this comment should go away. The second sentence doesn't explain why and if it did so it should do that in find_partition_scheme() not here. partsupfunc is another property of partition keys that is cached like parttyplen, parttypbyval. Why does it deserve a separate comment when others don't? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 27 February 2018 at 22:33, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached an updated version in which I incorporated some of the revisions > that David Rowley suggested to OR clauses handling (in partprune.c) that > he posted as a separate patch on the run-time pruning thread [1]. Thanks for fixing that up and including it. Micro review of v34: 1. Looks like you've renamed the parttypid parameter in the definition of partkey_datum_from_expr and partition_cmp_args, but not updated the declaration too. +static bool partkey_datum_from_expr(Oid parttypid, Expr *expr, Datum *value); +static bool +partkey_datum_from_expr(Oid partopcintype, Expr *expr, Datum *value) +static bool partition_cmp_args(Oid parttypid, Oid partopfamily, + PartClause *pc, PartClause *leftarg, PartClause *rightarg, + bool *result); +static bool +partition_cmp_args(Oid partopcintype, Oid partopfamily, + PartClause *pc, PartClause *leftarg, PartClause *rightarg, + bool *result) 2. In prune_append_rel_partitions(), it's not exactly illegal, but int i is declared twice in different scopes. Looks like there's no need for the inner one. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 27, 2018 at 8:12 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Ah, OK. I was missing that there is no need to have both parttypcoll and > partcollation in PartitionSchemeData, as the Vars in rel->partexprs are > built from a bare PartitionKey (not PartitionSchemeData), and after that > point, parttypcoll no longer needs to kept around. > > I noticed that there is a typo in the patch. > > + memcpy(part_scheme->partcollation, partkey->parttypcoll, > > s/parttypcoll/partcollation/g Committed your version. > BTW, should there be a relevant test in partition_join.sql? If yes, > attached a patch (partitionwise-join-collation-test-1.patch) to add one. I don't feel strongly about it, but I'm not going to try to prevent you from adding one, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018/02/28 19:14, Ashutosh Bapat wrote: > On Wed, Feb 28, 2018 at 6:42 AM, Amit Langote wrote: >> BTW, should there be a relevant test in partition_join.sql? If yes, >> attached a patch (partitionwise-join-collation-test-1.patch) to add one. > > A partition-wise join path will be created but discarded because of > higher cost. This test won't see it in that case. So, please add some > data like other tests and add command to analyze the partitioned > tables. That kind of protects from something like that. Thanks for the review. Hmm, the added test is such that the partition collations won't match, so partition-wise join won't be considered at all due to differing PartitionSchemes, unless I'm missing something. Thanks, Amit
On 2018/03/01 2:23, Robert Haas wrote: > On Tue, Feb 27, 2018 at 8:12 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Ah, OK. I was missing that there is no need to have both parttypcoll and >> partcollation in PartitionSchemeData, as the Vars in rel->partexprs are >> built from a bare PartitionKey (not PartitionSchemeData), and after that >> point, parttypcoll no longer needs to kept around. >> >> I noticed that there is a typo in the patch. >> >> + memcpy(part_scheme->partcollation, partkey->parttypcoll, >> >> s/parttypcoll/partcollation/g > > Committed your version. Thank you. >> BTW, should there be a relevant test in partition_join.sql? If yes, >> attached a patch (partitionwise-join-collation-test-1.patch) to add one. > > I don't feel strongly about it, but I'm not going to try to prevent > you from adding one, either. OK. Attached is a revised version of that patch in case you consider committing it, addressing Ashutosh's comment that the tables used in the test should contain some data. Thanks, Amit
Attachment
Thanks Ashutosh and David for reviewing. Replying to both. On 2018/02/28 20:25, Ashutosh Bapat wrote: > On Tue, Feb 27, 2018 at 3:03 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached an updated version in which I incorporated some of the revisions >> that David Rowley suggested to OR clauses handling (in partprune.c) that >> he posted as a separate patch on the run-time pruning thread [1]. > > Some comments on 0001. > > partnatts != part_scheme->partnatts) > continue; > > - /* Match the partition key types. */ > + /* > + * Match the partition key types and partitioning-specific collations. > + */ > > We are comparing opfamily and opclass input type as well, but this comment > doesn't explicitly mention those like it mentions collation. Instead, I think > we should just say, "Match partition key type properties" Sounds good, done. > You had commented on "advanced partition matching code" about asserting that > parsupfuncs also match. Robert too has expressed similar opinion upthread. May > be we should at least try to assert that the function OIDs match. I guess you're referring to this email of mine: https://www.postgresql.org/message-id/e681c283-5fc6-b1c6-1bb9-7102c37e2d55%40lab.ntt.co.jp Note that I didn't say that we should Assert the equality of partsupfunc members themselves, but rather whether we could add a comment explaining why we don't. Although, like you say, we could Assert the equality of function OIDs, so added a block like this: + /* + * If partopfamily and partopcintype matched, must have the same + * partition comparison functions. Note that we cannot reliably + * Assert the equality of function structs themselves for they might + * be different across PartitionKey's, so just Assert for the function + * OIDs. + */ +#ifdef USE_ASSERT_CHECKING + { + int i; + + for (i = 0; i < partkey->partnatts; i++) + Assert(partkey->partsupfunc[i].fn_oid == + part_scheme->partsupfunc[i].fn_oid); + } +#endif > - Oid *parttypcoll; /* OIDs of collations of partition keys. */ > + > + /* > + * We store both the collation implied by the partition key's type and the > + * one specified for partitioning. Values in the former are used as > + * varcollid in the Vars corresponding to simple column partition keys so > + * as to make them match corresponding Vars appearing elsewhere in the > + * query tree. Whereas, the latter is used when actually comparing values > + * against partition bounds datums, such as, when doing partition pruning. > + */ > + Oid *parttypcoll; > + Oid *partcollation; > > As you have already mentioned upthread only partcollation is needed, not > parttypcoll. This hunk is gone after rebasing over 2af28e603319 (For partitionwise join, match on partcollation, not parttypcoll) that was committed earlier today. > /* Cached information about partition key data types. */ > int16 *parttyplen; > bool *parttypbyval; > + > + /* > + * Cached array of partitioning comparison functions' fmgr structs. We > + * don't compare these when trying to match two partition schemes. > + */ > > I think this comment should go away. The second sentence doesn't explain why > and if it did so it should do that in find_partition_scheme() not here. > partsupfunc is another property of partition keys that is cached like > parttyplen, parttypbyval. Why does it deserve a separate comment when others > don't? Replaced that comment with: + /* Cached information about partition comparison functions. */ As mentioned above, I already added a comment in find_partition_scheme(). On 2018/02/28 20:35, David Rowley wrote: > Micro review of v34: > > 1. Looks like you've renamed the parttypid parameter in the definition > of partkey_datum_from_expr and partition_cmp_args, but not updated the > declaration too. > > +static bool partkey_datum_from_expr(Oid parttypid, Expr *expr, Datum *value); > > +static bool > +partkey_datum_from_expr(Oid partopcintype, Expr *expr, Datum *value) > > +static bool partition_cmp_args(Oid parttypid, Oid partopfamily, > + PartClause *pc, PartClause *leftarg, PartClause *rightarg, > + bool *result); > > +static bool > +partition_cmp_args(Oid partopcintype, Oid partopfamily, > + PartClause *pc, PartClause *leftarg, PartClause *rightarg, > + bool *result) Oops, forgot about the declarations. Fixed. > 2. In prune_append_rel_partitions(), it's not exactly illegal, but int > i is declared twice in different scopes. Looks like there's no need > for the inner one. Removed the inner one. Attached updated patches. Thanks, Amit
Attachment
On Thu, Mar 1, 2018 at 6:57 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/02/28 19:14, Ashutosh Bapat wrote: >> On Wed, Feb 28, 2018 at 6:42 AM, Amit Langote wrote: >>> BTW, should there be a relevant test in partition_join.sql? If yes, >>> attached a patch (partitionwise-join-collation-test-1.patch) to add one. >> >> A partition-wise join path will be created but discarded because of >> higher cost. This test won't see it in that case. So, please add some >> data like other tests and add command to analyze the partitioned >> tables. That kind of protects from something like that. > > Thanks for the review. > > Hmm, the added test is such that the partition collations won't match, so > partition-wise join won't be considered at all due to differing > PartitionSchemes, unless I'm missing something. > The point is we wouldn't know whether PWJ was not selected because of PartitionScheme mismatch OR the PWJ paths were expensive compared to non-PWJ as happens with empty tables. In both the cases we will see a non-PWJ "plan" although in one case PWJ was not possible and in the other it was possible. I think what we want to test is that PWJ Is not possible with collation mismatch. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Tue, Feb 27, 2018 at 4:33 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached an updated version in which I incorporated some of the revisions > that David Rowley suggested to OR clauses handling (in partprune.c) that > he posted as a separate patch on the run-time pruning thread [1]. I'm very skeptical about this patch's desire to remove the static qualifier from evaluate_expr(). Why does this patch need that and constraint exclusion not need it? Why should this patch not instead by using eval_const_expressions? partkey_datum_from_expr() is prepared to give up if evaluate_expr() doesn't return a Const, but there's nothing in evaluate_expr() to make it give up if, for example, the input is -- or contains -- a volatile function, e.g. random(). + if (OidIsValid(get_default_oid_from_partdesc(partdesc))) + rel->has_default_part = true; + else + rel->has_default_part = false; This can be written a lot more compactly as rel->has_default_part = OidIsValid(get_default_oid_from_partdesc(partdesc)); PartitionPruneContext has no comment explaining its general purpose; I think it should. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 28, 2018 at 11:53 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached updated patches. + memcpy(part_scheme->partsupfunc, partkey->partsupfunc, + sizeof(FmgrInfo) * partnatts); You can't copy an FmgrInfo by just applying memcpy() to it. Use fmgr_info_copy. I don't like the comments at the top of partprune.c very much. It seems strange to document individual functions here; those functions can (and should) be documented in their individual header comments. What we should have at the top of the file is a discussion of the overall theory of operation of this module, something that currently seems not to exist anywhere in the patch. I tried to figure that out by looking at the new data structures the patch introduces: PartitionPruneContext, PartScanKeyInfo, PartitionClauseInfo, and PartClause. It looks like the general idea idea is that code that wants to use these facilities does the following: Step 1. Generate a PartitionPruneContext. In this patch, this seems to consist entirely of copying information from the RelOptInfo or its PartitionScheme. Step 2. Call generate_partition_clauses() to extract relevant clauses from rel->baserestrictinfo and generate a PartClauseInfo. Step 3. Call get_partitions_from_clauses() to generate a set of unpruned partition indexes. Internally, that function will first populate a PartScanKeyInfo from the PartClauseInfo by calling extract_bounding_datums(). Then it calls get_partitions_for_keys() which generates the set of unpruned partition indexes from the PartitionPruneContext and PartScanKeyInfo. I guess there are two main things about this that seem odd to me: 1. I don't see why the partition pruning code should ever be responsible for evaluating anything itself, as it does currently via evaluate_expr(). For plan-time partition pruning, we are already using eval_const_expressions() to perform as much Const-simplification as possible. If we see an OpExpr with a partitioning column on one side, then the other side is either a Const, in which case we can perform pruning, or it's something whose value can't be known until runtime, in which case we cannot. There should be no case in which eval_const_expressions() fails to simplify something to a constant yet we know the value at plan time; if such a case exists, then it's an opportunity to improve eval_const_expressions(), not a reason to do be inconsistent with the rules it applies. For run-time pruning, it is probably semantically wrong and certainly undesirable from a performance perspective to spin up a new ExecutorState and a new ExprState every time we need to reassess the decision about which partitions need to be scanned. Instead, the expressions whose values are inputs to the pruning process should be computed in the ExecutorState/ExprState for the main query and the results should be passed to the partition-pruning code as inputs to the decision-making process. 2. All processing of clauses should happen at plan time, not run time, so that we're not repeating work. If, for example, a prepared query is executed repeatedly, the clauses should get fully processed when it's planned, and then the only thing that should happen at execution time is that we take the values to which we now have access and use them to decide what to prune. With this design, we're bound to repeat at least the portion of the work done by get_partitions_from_clauses() at runtime, and as things stand, it looks like the current version of the run-time partition pruning patch repeats the generate_partition_clauses() work at runtime as well. It seems to me as though what we ought to be doing is extracting clauses that are of the correct general form and produces a list of <partition-column-index>, <partition-operator-strategy>, <expression> triplets sorted by increasing partition column index and operator strategy number and in a form that can be represented as a Node tree. This can be attached to the plan for use at runtime or, if there are at least some constants among the expressions, used at plan time for partition exclusion. So the main interfaces would be something like this: extern List *extract_partition_pruning_clauses(RelOptInfo *rel); extern PartitionPruneContext *CreatePartitionPruneContext(List *); extern Bitmapset *PerformStaticPartitionPruning(PartitionPruneContext *); extern Bitmapset *PerformDynamicPartitionPruning(PartitionPruneContext *, Datum *values, bool *isnull); In PerformStaticPartitionPruning(), we'd just ignore anything for which the expression was non-constant; to call PerformDynamicPartitionPruning(), the caller would need to evaluate the expressions from the PartitionPruneContext using the appropriate EState and then pass the results into this function. I realize that I'm hand-waving a bit here, or maybe more than a bit, but in general I think it's right to imagine that the work here needs to be segregated into two very well-separated phases. In the first phase, we do all possible work to assess which clauses are relevant and put them in a form where the actual pruning work can be done as cheaply as possible once we know the values. This phase always happens at plan time. In the second phase, we get the values, either by extracting them from Const nodes at plan time or evaluating expressions at runtime, and then decide which partitions to eliminate. This phase happens at plan time if we have enough constants available to do something useful and at runtime if we have enough non-constants to do something useful. Right now there doesn't seem to be a clean separation between these two phases and I think that's not good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2 March 2018 at 08:13, Robert Haas <robertmhaas@gmail.com> wrote: > I don't like the comments at the top of partprune.c very much. It > seems strange to document individual functions here; those functions > can (and should) be documented in their individual header comments. > What we should have at the top of the file is a discussion of the > overall theory of operation of this module, something that currently > seems not to exist anywhere in the patch. I tried to figure that out > by looking at the new data structures the patch introduces: > PartitionPruneContext, PartScanKeyInfo, PartitionClauseInfo, and > PartClause. It looks like the general idea idea is that code that > wants to use these facilities does the following: > > Step 1. Generate a PartitionPruneContext. In this patch, this seems > to consist entirely of copying information from the RelOptInfo or its > PartitionScheme. > Step 2. Call generate_partition_clauses() to extract relevant clauses > from rel->baserestrictinfo and generate a PartClauseInfo. > Step 3. Call get_partitions_from_clauses() to generate a set of > unpruned partition indexes. Internally, that function will first > populate a PartScanKeyInfo from the PartClauseInfo by calling > extract_bounding_datums(). Then it calls get_partitions_for_keys() > which generates the set of unpruned partition indexes from the > PartitionPruneContext and PartScanKeyInfo. Hi Robert, I feel I should step in here and answer this part as it was me who first came up with the idea of the context struct. I've typed up something below which is my first cut at what I'd have imagined the header comment of partprune.c should look like. Some parts are only revant after run-time pruning is also using this stuff. I've tried to highlight those areas, I'm not sure how much or if there should be any mention of that at all as part of this patch. Here goes: partprune.c Allows efficient identification of the minimal set of partitions which match a given set of clauses. Thus allowing useful things such as ignoring unneeded partitions which cannot possibly contain tuples matching the given set of clauses. This module breaks the process of determining the matching partitions into two distinct steps, each of which has its own function which is externally visible outside of this module. The reason for not performing everything in one step as down to the fact that there are times where we may wish to perform the 2nd step multiple times over. The steps could be thought of as a compilation step followed by an execution step. Step 1 (compilation): Pre-process the given list of clauses and attempt to match individual clauses up to a partition key. The end result of this process is a PartitionClauseInfo containing details of each clause found to match the partition key. This output is required as input for the 2nd step. Step 2 (execution): Step 2 outputs the minimal set of matching partitions based on the input from step 1. Internally, this step is broken down into smaller sub-steps, each of which is explained in detail in the comments in the corresponding function. Step 2 can be executed multiple times for its input values. The inputs to this step are not modified by the processing done within. It is expected that this step is executed multiple times in cases where the matching partitions must be determined during query execution. A subsequent evaluation of this step will be required whenever a parameter which was found in a clause matching the partition key changes its value. PartitionPruneContext: Each of the steps described above also requires an input of a PartitionPruneContext. This stores all of the remaining required inputs to each step. The context will vary slightly depending on the context in which the step is being called from; i.e the planner or executor. For example, during query planning, we're unable to determine the value of a Param found matching the partition key. When this step is called from the executor the PlanState can be set in the context which allows evaluation of these Params into Datum values. *** Only after run-time pruning *** The PartitionPruneContext is also required since many of the query planner node types are unavailable to the executor, which means that the source information used to populate the context will vary depending on if it's being called from the query planner or executor. *** Only after run-time pruning *** The context is also modified during step 1 to record all of the Param IDs which were found to match the partition key. ------------- Hopefully that also helps explain the intensions with the current code strucure. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/03/01 21:56, Robert Haas wrote: > On Tue, Feb 27, 2018 at 4:33 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached an updated version in which I incorporated some of the revisions >> that David Rowley suggested to OR clauses handling (in partprune.c) that >> he posted as a separate patch on the run-time pruning thread [1]. > > I'm very skeptical about this patch's desire to remove the static > qualifier from evaluate_expr(). Why does this patch need that and > constraint exclusion not need it? Why should this patch not instead > by using eval_const_expressions? partkey_datum_from_expr() is > prepared to give up if evaluate_expr() doesn't return a Const, but > there's nothing in evaluate_expr() to make it give up if, for example, > the input is -- or contains -- a volatile function, e.g. random(). Thinking on this a bit, I have removed the evaluate_expr() business from partkey_datum_from_expr() and thus switched evaluate_expr() back to static. Let me explain why I'd added there in the first place -- if the constant expression received in partkey_datum_from_expr() was not of the same type as that of the partition key, it'd try to coerce_to_target_type() the input expression to the partition key type which may result in a non-Const expression. We'd turn it back into a Const by calling evaluate_expr(). I thought the coercion was needed because we'd be comparing the resulting datum with the partition bound datums using a partition comparison function that would require its arguments to be of given types. But I realized we don't need the coercion. Earlier steps would have determined that the clause from which the expression originated contains an operator that is compatible with the partitioning operator family. If so, the type of the expression in question, even though different from the partition key type, would be binary coercible with it. So, it'd be okay to pass the datum extracted from such expression to the partition comparison function to compare it with datums in PartitionBoundInfo, without performing any coercion. > + if (OidIsValid(get_default_oid_from_partdesc(partdesc))) > + rel->has_default_part = true; > + else > + rel->has_default_part = false; > > This can be written a lot more compactly as rel->has_default_part = > OidIsValid(get_default_oid_from_partdesc(partdesc)); Indeed, will fix. > PartitionPruneContext has no comment explaining its general purpose; I > think it should. Will fix. Thanks, Amit
Thanks for your comments. On 2018/03/02 4:13, Robert Haas wrote: > On Wed, Feb 28, 2018 at 11:53 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached updated patches. > > + memcpy(part_scheme->partsupfunc, partkey->partsupfunc, > + sizeof(FmgrInfo) * partnatts); > > You can't copy an FmgrInfo by just applying memcpy() to it. Use fmgr_info_copy. Oops, will fix. > I don't like the comments at the top of partprune.c very much. It > seems strange to document individual functions here; those functions > can (and should) be documented in their individual header comments. > What we should have at the top of the file is a discussion of the > overall theory of operation of this module, something that currently > seems not to exist anywhere in the patch. Sorry about that. Thanks to your comments here and David's reply, I will try to address that in the next version. > I tried to figure that out > by looking at the new data structures the patch introduces: > PartitionPruneContext, PartScanKeyInfo, PartitionClauseInfo, and > PartClause. It looks like the general idea idea is that code that > wants to use these facilities does the following: > > Step 1. Generate a PartitionPruneContext. In this patch, this seems > to consist entirely of copying information from the RelOptInfo or its > PartitionScheme. > Step 2. Call generate_partition_clauses() to extract relevant clauses > from rel->baserestrictinfo and generate a PartClauseInfo. > Step 3. Call get_partitions_from_clauses() to generate a set of > unpruned partition indexes. Internally, that function will first > populate a PartScanKeyInfo from the PartClauseInfo by calling > extract_bounding_datums(). Then it calls get_partitions_for_keys() > which generates the set of unpruned partition indexes from the > PartitionPruneContext and PartScanKeyInfo. > > I guess there are two main things about this that seem odd to me: > > 1. I don't see why the partition pruning code should ever be > responsible for evaluating anything itself, as it does currently via > evaluate_expr(). For plan-time partition pruning, we are already > using eval_const_expressions() to perform as much Const-simplification > as possible. If we see an OpExpr with a partitioning column on one > side, then the other side is either a Const, in which case we can > perform pruning, or it's something whose value can't be known until > runtime, in which case we cannot. There should be no case in which > eval_const_expressions() fails to simplify something to a constant yet > we know the value at plan time; if such a case exists, then it's an > opportunity to improve eval_const_expressions(), not a reason to do be > inconsistent with the rules it applies. For run-time pruning, it is > probably semantically wrong and certainly undesirable from a > performance perspective to spin up a new ExecutorState and a new > ExprState every time we need to reassess the decision about which > partitions need to be scanned. Instead, the expressions whose values > are inputs to the pruning process should be computed in the > ExecutorState/ExprState for the main query and the results should be > passed to the partition-pruning code as inputs to the decision-making > process. As I said in my earlier reply, I have removed the part that involved the pruning code calling evaluate_expr(). > 2. All processing of clauses should happen at plan time, not run time, > so that we're not repeating work. If, for example, a prepared query > is executed repeatedly, the clauses should get fully processed when > it's planned, and then the only thing that should happen at execution > time is that we take the values to which we now have access and use > them to decide what to prune. With this design, we're bound to repeat > at least the portion of the work done by get_partitions_from_clauses() > at runtime, and as things stand, it looks like the current version of > the run-time partition pruning patch repeats the > generate_partition_clauses() work at runtime as well. > > It seems to me as though what we ought to be doing is extracting > clauses that are of the correct general form and produces a list of > <partition-column-index>, <partition-operator-strategy>, <expression> > triplets sorted by increasing partition column index and operator > strategy number and in a form that can be represented as a Node tree. > This can be attached to the plan for use at runtime or, if there are > at least some constants among the expressions, used at plan time for > partition exclusion. So the main interfaces would be something like > this: > > extern List *extract_partition_pruning_clauses(RelOptInfo *rel); > extern PartitionPruneContext *CreatePartitionPruneContext(List *); > extern Bitmapset *PerformStaticPartitionPruning(PartitionPruneContext *); > extern Bitmapset *PerformDynamicPartitionPruning(PartitionPruneContext > *, Datum *values, bool *isnull); > > In PerformStaticPartitionPruning(), we'd just ignore anything for > which the expression was non-constant; to call > PerformDynamicPartitionPruning(), the caller would need to evaluate > the expressions from the PartitionPruneContext using the appropriate > EState and then pass the results into this function. > > I realize that I'm hand-waving a bit here, or maybe more than a bit, > but in general I think it's right to imagine that the work here needs > to be segregated into two very well-separated phases. In the first > phase, we do all possible work to assess which clauses are relevant > and put them in a form where the actual pruning work can be done as > cheaply as possible once we know the values. This phase always > happens at plan time. In the second phase, we get the values, either > by extracting them from Const nodes at plan time or evaluating > expressions at runtime, and then decide which partitions to eliminate. > This phase happens at plan time if we have enough constants available > to do something useful and at runtime if we have enough non-constants > to do something useful. Right now there doesn't seem to be a clean > separation between these two phases and I think that's not good. Hmm, I see that things can be improved and various points you've mentioned here to improve the high-level interfaces seem useful. I'll try to rework the patch based on those and submit one early next week. Looking at the rough interface sketch in your message, it seems that the product of whatever steps we end up grouping into phase 1 should be something that can be put into a Node tree (PartitionPruneContext?), because we may need to attach it to the plan if some of the needed values will only be made available during execution. Given the patch's implementation, we'll have to make the structure of that Node tree a bit more complex than a simple List. For one thing, the patch handles OR clauses by performing pruning separately for each arm and them combining partitions selected across OR arms using set union. By "performing pruning" in the last sentence I meant following steps similar to ones you wrote in your message: 1. Segregating pruning clauses into per-partition-key Lists, that is, generate_partition_clauses() producing a PartitionClauseInfo, 2. Removing redundant clauses from each list, that is, remove_redundant_clauses() to produce lists with just one member per operator strategy for each partition key, 3. Extracting Datum values from the clauses to form equal/min/max tuples and setting null or not null bits for individual keys, that is, extract_bounding_datums() producing a PartScanKeyInfo, and 4. Finally pruning with those Datum tuples and null/not null info, that is, get_partitions_for_keys(). Steps 2-4 are dependent on clauses providing Datums, which all the clauses may or may not do. Depending on whether or not, we'll have to defer those steps to run time. So, * What do we encode into the Node tree attached to the plan? Clauses that haven't gone through steps 2 and 3 (something like PartitionClauseInfo) or the product of step 3 (something like PartScanKeyInfo)? * How do we account for OR clauses? Perhaps by having the aforementioned Node trees nested inside the top-level one, wherein there will be one nested node per arm of an OR clause. Thanks, Amit
On 2018/03/02 11:12, David Rowley wrote: > On 2 March 2018 at 08:13, Robert Haas <robertmhaas@gmail.com> wrote: >> I don't like the comments at the top of partprune.c very much. It >> seems strange to document individual functions here; those functions >> can (and should) be documented in their individual header comments. >> What we should have at the top of the file is a discussion of the >> overall theory of operation of this module, something that currently >> seems not to exist anywhere in the patch. I tried to figure that out >> by looking at the new data structures the patch introduces: >> PartitionPruneContext, PartScanKeyInfo, PartitionClauseInfo, and >> PartClause. It looks like the general idea idea is that code that >> wants to use these facilities does the following: >> >> Step 1. Generate a PartitionPruneContext. In this patch, this seems >> to consist entirely of copying information from the RelOptInfo or its >> PartitionScheme. >> Step 2. Call generate_partition_clauses() to extract relevant clauses >> from rel->baserestrictinfo and generate a PartClauseInfo. >> Step 3. Call get_partitions_from_clauses() to generate a set of >> unpruned partition indexes. Internally, that function will first >> populate a PartScanKeyInfo from the PartClauseInfo by calling >> extract_bounding_datums(). Then it calls get_partitions_for_keys() >> which generates the set of unpruned partition indexes from the >> PartitionPruneContext and PartScanKeyInfo. > > Hi Robert, > > I feel I should step in here and answer this part as it was me who > first came up with the idea of the context struct. I've typed up > something below which is my first cut at what I'd have imagined the > header comment of partprune.c should look like. Some parts are only > revant after run-time pruning is also using this stuff. I've tried to > highlight those areas, I'm not sure how much or if there should be any > mention of that at all as part of this patch. > > Here goes: > > partprune.c > > Allows efficient identification of the minimal set of partitions which match a > given set of clauses. Thus allowing useful things such as ignoring unneeded > partitions which cannot possibly contain tuples matching the given set of > clauses. > > This module breaks the process of determining the matching partitions into > two distinct steps, each of which has its own function which is externally > visible outside of this module. The reason for not performing everything > in one step as down to the fact that there are times where we may wish to > perform the 2nd step multiple times over. The steps could be thought of as a > compilation step followed by an execution step. > > Step 1 (compilation): > > Pre-process the given list of clauses and attempt to match individual clauses > up to a partition key. > > The end result of this process is a PartitionClauseInfo containing details of > each clause found to match the partition key. This output is required as > input for the 2nd step. > > Step 2 (execution): > > Step 2 outputs the minimal set of matching partitions based on the input from > step 1. > > Internally, this step is broken down into smaller sub-steps, each of which > is explained in detail in the comments in the corresponding function. > > Step 2 can be executed multiple times for its input values. The inputs to this > step are not modified by the processing done within. It is expected that this > step is executed multiple times in cases where the matching partitions must be > determined during query execution. A subsequent evaluation of this step will > be required whenever a parameter which was found in a clause matching the > partition key changes its value. > > PartitionPruneContext: > > Each of the steps described above also requires an input of a > PartitionPruneContext. This stores all of the remaining required inputs to > each step. The context will vary slightly depending on the context in which > the step is being called from; i.e the planner or executor. For example, > during query planning, we're unable to determine the value of a Param found > matching the partition key. When this step is called from the executor the > PlanState can be set in the context which allows evaluation of these Params > into Datum values. *** Only after run-time pruning *** > > The PartitionPruneContext is also required since many of the query planner > node types are unavailable to the executor, which means that the source > information used to populate the context will vary depending on if it's being > called from the query planner or executor. > > *** Only after run-time pruning *** > > The context is also modified during step 1 to record all of the Param IDs > which were found to match the partition key. > > ------------- > > Hopefully that also helps explain the intensions with the current code strucure. Thanks David for writing this down. Thanks, Amit
On Fri, Mar 2, 2018 at 1:22 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > But I realized we don't need the coercion. Earlier steps would have > determined that the clause from which the expression originated contains > an operator that is compatible with the partitioning operator family. If > so, the type of the expression in question, even though different from the > partition key type, would be binary coercible with it. That doesn't follow. Knowing that two types are in the same operator family doesn't guarantee that the types are binary coercible. For example, int8 is not binary-coercible to int2. Moreover, you'd better be pretty careful about trying to cast int8 to int2 because it might turn a query that would have returned no rows into one that fails outright; that's not OK. Imagine that the user types: SELECT * FROM partitioned_by_int2 WHERE a = 1000000000000; I think what needs to happen with cross-type situations is that you look in the opfamily for a comparator that takes the types you want as input; if you can't find one, you have to give up on pruning. If you do find one, then you use it. For example in the above query, once you find btint28cmp, you can use that to compare the user-provided constant against the range bounds for the various partitions to see which one might contain it. You'll end up selecting the partition with upper bound MAXVALUE if there is one, or no partition at all if every partition has a finite upper bound. That's as well as we can do with current infrastructure, I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 2, 2018 at 6:21 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Looking at the rough interface sketch in your message, it seems that the > product of whatever steps we end up grouping into phase 1 should be > something that can be put into a Node tree (PartitionPruneContext?), > because we may need to attach it to the plan if some of the needed values > will only be made available during execution. Right. You might also end up with two representations: a Node-tree-style representation that contains all of the information we need, and another, faster form into which it gets converted before use. > Given the patch's implementation, we'll have to make the structure of that > Node tree a bit more complex than a simple List. For one thing, the patch > handles OR clauses by performing pruning separately for each arm and them > combining partitions selected across OR arms using set union. By > "performing pruning" in the last sentence I meant following steps similar > to ones you wrote in your message: > > 1. Segregating pruning clauses into per-partition-key Lists, that is, > generate_partition_clauses() producing a PartitionClauseInfo, > > 2. Removing redundant clauses from each list, that is, > remove_redundant_clauses() to produce lists with just one member per > operator strategy for each partition key, > > 3. Extracting Datum values from the clauses to form equal/min/max tuples > and setting null or not null bits for individual keys, that is, > extract_bounding_datums() producing a PartScanKeyInfo, and > > 4. Finally pruning with those Datum tuples and null/not null info, that > is, get_partitions_for_keys(). > > Steps 2-4 are dependent on clauses providing Datums, which all the clauses > may or may not do. Depending on whether or not, we'll have to defer those > steps to run time. I don't see that there's a real need to perform step 2 at all. I mean, if you have x > $1 and x > $2 in the query, you can just compute the set of partitions for the first clause, compute the set of partitions for the second clause, and then intersect. That doesn't seem obviously worse than deciding which of $1 and $2 is greater and then pruning only based on whichever one is greater in this case. > * What do we encode into the Node tree attached to the plan? Clauses that > haven't gone through steps 2 and 3 (something like PartitionClauseInfo) > or the product of step 3 (something like PartScanKeyInfo)? > > * How do we account for OR clauses? Perhaps by having the aforementioned > Node trees nested inside the top-level one, wherein there will be one > nested node per arm of an OR clause. Suppose we define the notion of a pruning program. A pruning program can use any number of registers, which have integer numbers starting with 0 and counting upward as high as necessary. Each register holds a Bitmapset. The result of a pruning program is the value of register 0 when the program completes. A pruning program consists of a list of steps, each of which is either a PruningBaseStep or a PruningCombineStep. A PruningCombineStep modifies the contents of the target register based on the contents of a source register in one of the following three ways: (1) UNION -- all bits set in source become set in target; (2) INTERSECT -- all bits clear in source become clear in target; (3) DIFFERENCE -- all bits set in source become clear in target. A PruningBaseStep consists of a strategy (equality, less-than, etc.), an output register, and list of expressions -- either as many as there are partition keys, or for range partitioning perhaps fewer; it prunes based on the strategy and the expressions and overwrites the output register with the partitions that would be selected. Example #1. Table is hash-partitioned on a and b. Given a query like SELECT * FROM tab WHERE a = 100 AND b = 233, we create a single-step program: 1. base-step (strategy =, register 0, expressions 100, 233) If there were an equality constraint on one of the two columns, we would not create a pruning program at all, because no pruning is possible. Example #2. Table is list-partitioned on a. Given a query like SELECT * FROM tab WHERE (a = $1 OR a = $2) AND a != $3, we create this program: 1. base-step (strategy =, register 0, expressions $1) 2. base-step (strategy =, register 1, expressions $2) 3. base-step (strategy !=, register 2, expressions $3) 4. combine-step (target-register 0, source-register 1, strategy union) 5. combine-step (target-register 0, source-register 2, strategy difference) (This is unoptimized -- one could do better by reversing steps 3 and 4 and using reusing register 1 instead of needing register 2, but that kind of optimization is probably not too important.) Example #3. Table is range-partitioned on a and b. Given a query like SELECT * FROM tab WHERE (a = 40 AND b > $1) OR (a = $2 AND b = $3), we do this: 1. base-step (strategy >, register 0, expressions 40, $1) 2. base-step (strategy =, register 1, expressions $2, $3) 3. combine-step (target-register 0, source-register 1, strategy union) You might need a few extra gadgets here to make all of this work -- e.g. another base-step strategy to handle ScalarArrayOpExpr; I'm just trying to convey the basic idea here. It's pretty easy to see how to store a program like this as a node tree: just create PruningBaseStep and PruningCombineStep nodes and stick them into a List. At execution time transform the List into an array and loop over it. Or possibly it would be better to have two lists, one of base steps without explicit register numbers, where step N always outputs to register N, and then a second list of combine steps. Then at execution time you could have an array of PruningBaseStep * and an array of PruningCombineStep * instead of a combined array of Node *, which might be quicker to process. But regardless of what you do exactly, I think you should try to come up with some kind of representation that is basically uniform, handling all the things you support in a similar fashion. The current patch has basically separate and somewhat ad-hoc representations for the regular case, the <> case, and the OR case, which I think is not ideal because you end up with more code and a certain amount of repeated logic. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2 March 2018 at 08:13, Robert Haas <robertmhaas@gmail.com> wrote: > 2. All processing of clauses should happen at plan time, not run time, > so that we're not repeating work. If, for example, a prepared query > is executed repeatedly, the clauses should get fully processed when > it's planned, and then the only thing that should happen at execution > time is that we take the values to which we now have access and use > them to decide what to prune. With this design, we're bound to repeat > at least the portion of the work done by get_partitions_from_clauses() > at runtime, and as things stand, it looks like the current version of > the run-time partition pruning patch repeats the > generate_partition_clauses() work at runtime as well. > > It seems to me as though what we ought to be doing is extracting > clauses that are of the correct general form and produces a list of > <partition-column-index>, <partition-operator-strategy>, <expression> > triplets sorted by increasing partition column index and operator > strategy number and in a form that can be represented as a Node tree. > This can be attached to the plan for use at runtime or, if there are > at least some constants among the expressions, used at plan time for > partition exclusion. So the main interfaces would be something like > this: We did try this already as I also thought the same a while back and even wrote a broken patch to do that. The thing is, PartitionClauseInfo is designed to be optimal for having get_partitions_from_clauses() called on it multiple times over, as will likely happen when run-time pruning is pruning away unneeded partitioned during a Parameterized nested loop join. To make it a Node type, it's not quite as simple as changing arrays to Lists as the keyclauses List contains PartClause, which are also not a Node type, and that struct contains a FmgrInfo, which is also not a node type, so we'd need to go try to make all those Node types (I doubt that's going to happen) ... but ... It's probably not impossible to come up with some intermediate partially processed type that can be a Node type. Ideally, this could be quickly converted into a PartitionClauseInfo during execution. The problem with this is that this adds a rather silly additional step during query planning to convert the intermediate processed list into the fully processed PaitionClauseInfo that get_partitions_from_clauses would still need. I don't think building it is going to cost a huge amount. Presumably, there are not many partitioned tables with 10 rows, so probably having the get_partitions_from_clauses work as quickly as possible is better than saving 100 nanoseconds in executor startup. That being said, there's still a small issue with the run-time pruning patch which is caused by me not pre-processing the clauses during planning. Ideally, I'd be able to pre-process at least enough to determine if any Params match the partition key so that I know if run-time pruning can be used or not. As of now, I'm not doing that as it seems wasteful to pre-process during planning just to get the Param Ids out, then not be able to carry the pre-processed ones over to the executor. We also can't really reuse the pre-processed state that was generated during the planner's calls to generate_partition_clauses() since we'll additionally also be passing in the parameterized path clauses as well as the baserestrictinfo clauses. Given the above, I'm happy to listen to ideas on this, as I'm really not sure what's best here. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 3 March 2018 at 04:47, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 2, 2018 at 6:21 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Looking at the rough interface sketch in your message, it seems that the >> product of whatever steps we end up grouping into phase 1 should be >> something that can be put into a Node tree (PartitionPruneContext?), >> because we may need to attach it to the plan if some of the needed values >> will only be made available during execution. > > Right. You might also end up with two representations: a > Node-tree-style representation that contains all of the information we > need, and another, faster form into which it gets converted before > use. hmm, I thought Amit typed PartitionPruneContext instead of PartitionClauseInfo by mistake here. PartitionPruneContext can't be a node type. The run-time pruning patch needs to set the PlanState in this context so that the code can lookup the Param values when being called from the executor. You can't make PlanState a node type too! Perhaps you can make some primnode type to store all the stuff from RelOptInfo that's currently being stored in PartitionPruneContext. That could go along with the plan to save the executor having to look that stuff up. We can then make that Node type a field in the PartitionPruneContext. You could even reuse that Node from when it would first get generated during query planning pruning and keep it around for use to pass to the executor for run-time pruning, but you'd probably need to stuff it somewhere like the partition's RelOptInfo so it could be reused again later. Unsure if that's worth the trouble or not. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Mar 2, 2018 at 10:54 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > I don't think building it is going to cost a huge amount. Presumably, > there are not many partitioned tables with 10 rows, so probably having > the get_partitions_from_clauses work as quickly as possible is better > than saving 100 nanoseconds in executor startup. I agree that one could go overboard with trying to push work from executor time to planner time, but I don't think the current patch is very close to the point of diminishing returns. It's doing nearly everything at execution time. > That being said, there's still a small issue with the run-time pruning > patch which is caused by me not pre-processing the clauses during > planning. Ideally, I'd be able to pre-process at least enough to > determine if any Params match the partition key so that I know if > run-time pruning can be used or not. As of now, I'm not doing that as > it seems wasteful to pre-process during planning just to get the Param > Ids out, then not be able to carry the pre-processed ones over to the > executor. We also can't really reuse the pre-processed state that was > generated during the planner's calls to generate_partition_clauses() > since we'll additionally also be passing in the parameterized path > clauses as well as the baserestrictinfo clauses. I think it should be possible to have a structure where all the work of classifying clauses happens in the planner. By the time we get to execution time, we should be able to know for sure which clauses are relevant. For example, if the user says WHERE a = $1 + 3 AND b = (random() * 100)::int, and the partition key is (a, b), we should be able to figure out at plan time that the clause containing b is useless (because it's volatile) and the clause containing a is useful only if this is range-partitioning (because with hash-partitioning we must have an equality clause for every partition to do anything). I think it should also be possible to know which expressions need to be computed at runtime -- in this case, $1 + 3 -- and to which columns of the partition key they correspond -- in this case, the first. I just proposed a data representation which could track all that stuff and I'm sure there are other ways to do it, too. I think that things like PartClause that include both an opno and various bits of cached information, including FmgrInfo, are not a very good idea. A lot of work has been done to maintain the separation of immutable information -- like Plans or Exprs -- from the run-time state they use -- PlanState or ExprState. I think we would do well to follow that distinction here, too, even if it seems to introduce some "silly" overhead at execution time. I think it will pay for itself in future code maintenance and the ability to apply optimizations such as JIT which benefit from good divisions in this case. It is not crazy to imagine that the "pruning program" idea I floated in a previous email could be folded into the JIT stuff Andres is doing where something with a less-clean separation of concerns would run into problems. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3 March 2018 at 04:47, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 2, 2018 at 6:21 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> 2. Removing redundant clauses from each list, that is, >> remove_redundant_clauses() to produce lists with just one member per >> operator strategy for each partition key, > > I don't see that there's a real need to perform step 2 at all. I > mean, if you have x > $1 and x > $2 in the query, you can just compute > the set of partitions for the first clause, compute the set of > partitions for the second clause, and then intersect. That doesn't > seem obviously worse than deciding which of $1 and $2 is greater and > then pruning only based on whichever one is greater in this case. This is an interesting idea. It may simplify the code quite a bit as the clause reduction code is quite bulky. However, unless I'm mistaken, for this to work, certain inputs will need significantly more processing to determine the minimum set of matching partitions. Let's look at the following perhaps unlikely case. (I picked an extreme case to demonstrate why this may be an inferior method) Given the table abc (...) partition by range (a,b,c), with the query: select * from abc where a >= 1 and a >= 2 and a >= 3 and b >= 1 and b >= 2 and b = 3 and c >= 1 and c >= 2 and c = 3; We would likely still be parsing those clauses into some struct like PartitionClauseInfo and would end up with some arrays or Lists with the clauses segmented by partition key. It appears to me, for your method to work we'd need to try every combination of the clauses matching each partition key, which in this case is 3 * 3 * 3 searches. Amit's current method is 1 search, after the clause reduction which is 3 + 3 + 3 (O(N) per partition key) I've tried to think of a more genuine poor performing case for this with IN or NOT IN lists, but I can't quite see it since NOT IN will only be supported by LIST partitioning, which can only have a single partition key and IN would be OR conditions, each of which would be evaluated in a different round of looping. Although I'm not ruling out that my imagination is just not good enough. With that considered, is it still a good idea to do it this way? Or maybe I've misunderstood the idea completely? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/03/02 21:43, Robert Haas wrote: > On Fri, Mar 2, 2018 at 1:22 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> But I realized we don't need the coercion. Earlier steps would have >> determined that the clause from which the expression originated contains >> an operator that is compatible with the partitioning operator family. If >> so, the type of the expression in question, even though different from the >> partition key type, would be binary coercible with it. > > That doesn't follow. Knowing that two types are in the same operator > family doesn't guarantee that the types are binary coercible. For > example, int8 is not binary-coercible to int2. Moreover, you'd better > be pretty careful about trying to cast int8 to int2 because it might > turn a query that would have returned no rows into one that fails > outright; that's not OK. Imagine that the user types: > > SELECT * FROM partitioned_by_int2 WHERE a = 1000000000000; > > I think what needs to happen with cross-type situations is that you > look in the opfamily for a comparator that takes the types you want as > input; if you can't find one, you have to give up on pruning. If you > do find one, then you use it. For example in the above query, once > you find btint28cmp, you can use that to compare the user-provided > constant against the range bounds for the various partitions to see > which one might contain it. You'll end up selecting the partition > with upper bound MAXVALUE if there is one, or no partition at all if > every partition has a finite upper bound. That's as well as we can do > with current infrastructure, I think. Hmm, yes. So while the patch's previous approach to convert the query's constant value to the desired type was wrong, this is wronger. :-( I guess I'll need to change the patch such that the comparison function used for comparing partition bounds with a query-specified constant will change from the default one from the PartitionKey to the one that accepts the latter. Thanks, Amit
On 2018/03/03 0:47, Robert Haas wrote: > On Fri, Mar 2, 2018 at 6:21 AM, Amit Langote wrote: >> Given the patch's implementation, we'll have to make the structure of that >> Node tree a bit more complex than a simple List. For one thing, the patch >> handles OR clauses by performing pruning separately for each arm and them >> combining partitions selected across OR arms using set union. By >> "performing pruning" in the last sentence I meant following steps similar >> to ones you wrote in your message: >> >> 1. Segregating pruning clauses into per-partition-key Lists, that is, >> generate_partition_clauses() producing a PartitionClauseInfo, >> >> 2. Removing redundant clauses from each list, that is, >> remove_redundant_clauses() to produce lists with just one member per >> operator strategy for each partition key, >> >> 3. Extracting Datum values from the clauses to form equal/min/max tuples >> and setting null or not null bits for individual keys, that is, >> extract_bounding_datums() producing a PartScanKeyInfo, and >> >> 4. Finally pruning with those Datum tuples and null/not null info, that >> is, get_partitions_for_keys(). >> >> Steps 2-4 are dependent on clauses providing Datums, which all the clauses >> may or may not do. Depending on whether or not, we'll have to defer those >> steps to run time. > > I don't see that there's a real need to perform step 2 at all. I > mean, if you have x > $1 and x > $2 in the query, you can just compute > the set of partitions for the first clause, compute the set of > partitions for the second clause, and then intersect. That doesn't > seem obviously worse than deciding which of $1 and $2 is greater and > then pruning only based on whichever one is greater in this case. If we can accommodate this step with your "pruning program" idea below, I think we should still try to keep the remove_redundant_clauses() step. If we can keep this step, x > $1 and x > $1 will require 1 comparison + 1 bsearch over bounds, whereas without it, it'd be 2 bsearches over bounds + 1 Bitmapset operation. I think David expressed a similar concern about the performance in his reply down-thread. That too as long as we implement this step such that having it in the pipeline doesn't restrict the representation that we need to have the "pruning steps" in. Let's see. >> * What do we encode into the Node tree attached to the plan? Clauses that >> haven't gone through steps 2 and 3 (something like PartitionClauseInfo) >> or the product of step 3 (something like PartScanKeyInfo)? >> >> * How do we account for OR clauses? Perhaps by having the aforementioned >> Node trees nested inside the top-level one, wherein there will be one >> nested node per arm of an OR clause. > > Suppose we define the notion of a pruning program. A pruning program > can use any number of registers, which have integer numbers starting > with 0 and counting upward as high as necessary. Each register holds > a Bitmapset. The result of a pruning program is the value of register > 0 when the program completes. A pruning program consists of a list of > steps, each of which is either a PruningBaseStep or a > PruningCombineStep. A PruningCombineStep modifies the contents of the > target register based on the contents of a source register in one of > the following three ways: (1) UNION -- all bits set in source become > set in target; (2) INTERSECT -- all bits clear in source become clear > in target; (3) DIFFERENCE -- all bits set in source become clear in > target. A PruningBaseStep consists of a strategy (equality, > less-than, etc.), an output register, and list of expressions -- > either as many as there are partition keys, or for range partitioning > perhaps fewer; it prunes based on the strategy and the expressions and > overwrites the output register with the partitions that would be > selected. > > Example #1. Table is hash-partitioned on a and b. Given a query like > SELECT * FROM tab WHERE a = 100 AND b = 233, we create a single-step > program: > > 1. base-step (strategy =, register 0, expressions 100, 233) > > If there were an equality constraint on one of the two columns, we > would not create a pruning program at all, because no pruning is > possible. > > Example #2. Table is list-partitioned on a. Given a query like SELECT > * FROM tab WHERE (a = $1 OR a = $2) AND a != $3, we create this > program: > > 1. base-step (strategy =, register 0, expressions $1) > 2. base-step (strategy =, register 1, expressions $2) > 3. base-step (strategy !=, register 2, expressions $3) > 4. combine-step (target-register 0, source-register 1, strategy union) > 5. combine-step (target-register 0, source-register 2, strategy difference) > > (This is unoptimized -- one could do better by reversing steps 3 and 4 > and using reusing register 1 instead of needing register 2, but that > kind of optimization is probably not too important.) > > Example #3. Table is range-partitioned on a and b. Given a query like > SELECT * FROM tab WHERE (a = 40 AND b > $1) OR (a = $2 AND b = $3), we > do this: > > 1. base-step (strategy >, register 0, expressions 40, $1) > 2. base-step (strategy =, register 1, expressions $2, $3) > 3. combine-step (target-register 0, source-register 1, strategy union) > > You might need a few extra gadgets here to make all of this work -- > e.g. another base-step strategy to handle ScalarArrayOpExpr; I'm just > trying to convey the basic idea here. It's pretty easy to see how to > store a program like this as a node tree: just create PruningBaseStep > and PruningCombineStep nodes and stick them into a List. At execution > time transform the List into an array and loop over it. > > Or possibly it would be better to have two lists, one of base steps > without explicit register numbers, where step N always outputs to > register N, and then a second list of combine steps. Then at > execution time you could have an array of PruningBaseStep * and an > array of PruningCombineStep * instead of a combined array of Node *, > which might be quicker to process. > > But regardless of what you do exactly, I think you should try to come > up with some kind of representation that is basically uniform, > handling all the things you support in a similar fashion. The current > patch has basically separate and somewhat ad-hoc representations for > the regular case, the <> case, and the OR case, which I think is not > ideal because you end up with more code and a certain amount of > repeated logic. Thanks for outlining this idea. Given that the most important part in your outline seems to be the clean Node structure to carry the information from one stage to another, that will result by adopting the "partition pruning primitives" described above, I will first try to implement them as a shell around the code that I and David have debugged till now. Then, once everything seems to work, I will start dropping unneeded structures and code that seems to be duplicative, which I'm beginning to suspect there will be plenty of. I'll post an update in a couple of days to report on how that works out. Regards, Amit
On Fri, Mar 2, 2018 at 7:32 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > Let's look at the following perhaps unlikely case. (I picked an > extreme case to demonstrate why this may be an inferior method) > > Given the table abc (...) partition by range (a,b,c), with the query: > > select * from abc where a >= 1 and a >= 2 and a >= 3 and b >= 1 and b >>= 2 and b = 3 and c >= 1 and c >= 2 and c = 3; > > We would likely still be parsing those clauses into some struct like > PartitionClauseInfo and would end up with some arrays or Lists with > the clauses segmented by partition key. > > It appears to me, for your method to work we'd need to try every > combination of the clauses matching each partition key, which in this > case is 3 * 3 * 3 searches. Amit's current method is 1 search, after > the clause reduction which is 3 + 3 + 3 (O(N) per partition key) [...] > With that considered, is it still a good idea to do it this way? I dunno. What do you think? That case is indeed pretty unfortunate, but it's also pretty artificial. It's not obvious to me that we shouldn't care about it, but it's also not obvious to me that we should. If we have some bizarre cases that slip through the cracks or don't perform terribly well, maybe nobody would ever notice or care. On the other hand, maybe they would. I suppose in my ideal world, this could be handled by building a GREATEST or LEAST expression. In other words, if someone says foo >= 1 AND foo >= 2, instead of doing separate pruning steps, we'd just prune once based on foo >= GREATEST(1,2). But that doesn't really work, because there's no provision to tell MinMaxExpr from which opfamily we wish to draw the operator used to compare 1 and 2 and no guarantee that such an operator exists for the actual data types of 1 and 2. (Imagine that 1 and 2 of different data types; the relevant opfamily might have an operator that can compare a value of the same type as foo to 1 and similarly for 2, but no operator that can compare 1 and 2 to each other.) One thing that we could do is just only accept one clause for each column-strategy pairing, presumably either the first one or the last one. So in your example either a >= 1 or a >= 3 would be accepted and the others would be discarded for purposes of partition pruning. If a user complains, we could tell THEM to manually do the rewrite suggested in the previous step, and just write a >= GREATEST(1,2,3). (And of course if it's that simple, they might want to then pre-simplify to a >= 3!) Another alternative is to include some kind of additional type of step in the "pruning program" which can do this GREATEST/LEAST operation ... but that's adding quite a bit of complexity for what seems like it's pretty much a corner case, and as noted above, there's no guarantee that we even have the correct operator available. It should be fine if the partitioning column/expression and all of the constants being compared are of the same type, and in practice *most* of the time even when they're not, but we're going to have to have some handling for the strange cases -- and I think the only real choices are "try every combination and maybe be slow", "try 1 combination and maybe fail to prune everything that could have been pruned", and some intermediate possibilities along the same lines. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7 March 2018 at 10:15, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 2, 2018 at 7:32 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> It appears to me, for your method to work we'd need to try every >> combination of the clauses matching each partition key, which in this >> case is 3 * 3 * 3 searches. Amit's current method is 1 search, after >> the clause reduction which is 3 + 3 + 3 (O(N) per partition key) > [...] >> With that considered, is it still a good idea to do it this way? > > I dunno. What do you think? > > That case is indeed pretty unfortunate, but it's also pretty > artificial. It's not obvious to me that we shouldn't care about it, > but it's also not obvious to me that we should. If we have some > bizarre cases that slip through the cracks or don't perform terribly > well, maybe nobody would ever notice or care. On the other hand, > maybe they would. One thing I've learned in my time working with PostgreSQL is that, if there's a known hole, someone's probably going to fall down it eventually. I like working with PostgreSQL because we're pretty careful to not make holes that people can fall down, or if there is some hole that cannot be filled in, we try to put a fence around it with a sign, (e.g rename pg_xlog to pg_wal). I'm not strongly opposed to your ideas, I probably don't have a complete understanding of the idea anyway. But from what I understand it looks like you want to take something that works quite well and make it work less well, and there appears not to be a good reason provided of why you want to do that. Is it because you want to simplify the patch due to concerns about it being too much logic to get right for PG11? > One thing that we could do is just only accept one clause for each > column-strategy pairing, presumably either the first one or the last > one. The problem with that is it can cause surprising behaviour. We reorder clauses and clauses get pushed down from upper parts of the query. Let's say there was some view like: CREATE VIEW vw_ledger_2018 AS SELECT * FROM ledger WHERE postdate BETWEEN '2018-01-01' AND '2018-12-13'; And a user comes along and does: SELECT * FROM vw_ledger_2018 WHERE postdate BETWEEN '2018-03-01' AND '2018-03-31' We're going to end up with base quals something like: postdate >= '2018-01-01' AND postdate <= '2018-12-31' AND postdate >= '2018-03-01' AND postdate <= '2018-03-31' If we just take the first from each op strategy then we'll not have managed to narrow the case down to just the March partition. You might argue that this should be resolved at some higher level in the planner, but that does nothing for the run-time pruning case. I don't really want to do or say anything that jeopardises this patch from getting into PG11, so if the path of least resistance is to go with the option you've proposed then I'd much rather that than this getting pushed out to PG12. I really just want to try to make sure we've thought of everything before we create too many surprises for users. Perhaps a compromise would be to check all quals from the first partition key and only the first or last one from the remaining keys. I imagine most cases will have just 1 key anyway. This would significantly reduce the number of possible combinations of quals to try, but unfortunately, it still does have that element of surprise. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Mar 6, 2018 at 8:34 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > One thing I've learned in my time working with PostgreSQL is that, if > there's a known hole, someone's probably going to fall down it > eventually. I like working with PostgreSQL because we're pretty > careful to not make holes that people can fall down, or if there is > some hole that cannot be filled in, we try to put a fence around it > with a sign, (e.g rename pg_xlog to pg_wal). I'm not strongly opposed > to your ideas, I probably don't have a complete understanding of the > idea anyway. But from what I understand it looks like you want to take > something that works quite well and make it work less well, and there > appears not to be a good reason provided of why you want to do that. > > Is it because you want to simplify the patch due to concerns about it > being too much logic to get right for PG11? My understanding is that the patch as submitted is fundamentally broken in multiple ways. As Amit said a few emails upthread, "So while the patch's previous approach to convert the query's constant value to the desired type was wrong, this is wronger. :-(" I agree with that analysis. As I tried to explain in my last email, if you've got something like foo > 'blarfle'::type1 and foo > 'hoge'::type2, there may actually be no way at all to determine which of those clauses is more restrictive. The fact that > was used in the query to compare foo with a value of type1 and, separately, with a value of type2 means that those operators exist, but it does not follow that the opfamily provides an operator which can compare type1 to type2. As far as I can see, what this means is that, in general, the approach the patch takes to eliminating redundant clauses just doesn't work; and in the general case I don't think there's much hope of saving it. The question of whether the patch does too much work at execution time or not is maybe arguable -- my position is that it does -- but working properly in the face of cross-type comparisons is non-negotiable. The use of evaluate_expr() is also completely wrong and has got to be fixed. I already wrote about that upthread so I won't repeat it here. I'm pretty sure that the current design, if allowed to stand, would have lots of bad consequences. As I understand it, Amit is currently hacking on the patch to try to fix these issues. If he comes up with something that works properly with cross-type comparisons and doesn't abuse evaluate_expr() but still does more work than I'd ideally prefer at execution time, I'll consider holding my nose and consider it anyway. But considering the amount of rework that I think is needed, I don't really see why we wouldn't adopt a design that minimizes execution time work, too. In short, I don't think I'm trying to make something that works quite well work less well, because I don't think the patch as it stands can be correctly described as working quite well. > Let's say there was some view like: > > CREATE VIEW vw_ledger_2018 AS SELECT * FROM ledger WHERE postdate > BETWEEN '2018-01-01' AND '2018-12-13'; > > And a user comes along and does: > > SELECT * FROM vw_ledger_2018 WHERE postdate BETWEEN '2018-03-01' AND > '2018-03-31' > > If we just take the first from each op strategy then we'll not have > managed to narrow the case down to just the March partition. You might > argue that this should be resolved at some higher level in the > planner, but that does nothing for the run-time pruning case. Yeah, that's a good example of how this could happen in real life. So far I see three ways forward here: 1. If we've got multiple redundant quals, ignore all but one of them for purposes of partition pruning. Hope users don't get mad. 2. If we've got multiple redundant quals, do multiple checks. Hope this isn't annoyingly slow (or too much ugly code). 3. If we've got multiple redundant quals but no cross-type operators are in use, evaluate all of the expressions and pick the highest or lowest value as appropriate. Otherwise fall back to #1 or #2. For bonus points, do this when cross-type operators ARE in use and the additional cross-type operators that we need to figure out the highest or lowest value, as appropriate, is also available. I'm OK with any of those approaches; that is, I will not seek to block a merely patch on the basis of which of those options it chooses. I think they are all defensible. Options that are not OK include (a) trying to cast a value of one type to another type, because that could turn a query that would have simply returned no rows into an error case or (b) supposing that all types in an opfamily are binary coercible to each other, because that's just plain wrong. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi. On 2018/03/05 17:38, Amit Langote wrote: > I'll > post an update in a couple of days to report on how that works out. I'm still working on this and getting most of the tests to pass with the new code, but not all of them yet. Thanks, Amit
On 2018/03/07 20:58, Amit Langote wrote: > Hi. > > On 2018/03/05 17:38, Amit Langote wrote: >> I'll >> post an update in a couple of days to report on how that works out. > > I'm still working on this and getting most of the tests to pass with the > new code, but not all of them yet. Sorry about the delay. Attached is a significantly revised version of the patch, although I admit it could still use some work with regard to comments and other cleanup. The rewrite introduces a notion of PartitionPruneStep nodes based on the ideas described in [1]. So, instead of aggregating *all* of the pruning clauses into a PartitionClauseInfo which was hard to serialize into a node tree and then a PartScanKeyInfo (both of which no longer exist), this generates a list of nodes. Each node inherits from the base PartitionPruneStep node type and contains information enough to perform partition pruning by directly comparing the information with partition bounds or contains sub-nodes that do. For example, a PartitionPruneStepOp step contains an integer telling the partitioning operator strategy (such as various btree operator strategies) and a tuple to compare against partition bounds stored in the relcache. A PartitionPruneStepCombine step contains arguments that are in turn pruning steps themselves, which are separately executed and partition sets obtained thereby are combined using the specified combineOp. Also, fixed a bug of the previous design as detailed in [2]. So, with the patch: create table lparted (a smallint) partition by list (a); create table lparted_1 partition of lparted for values in (1); create table lparted_16384 partition of lparted for values in (16384); -- all partitions pruned (lparted_16384 wouldn't be pruned by previous -- patches due to comparison using bogus a partsupfunc) explain (costs off) select * from lparted where a = 100000000000000; QUERY PLAN -------------------------- Result One-Time Filter: false (2 rows) Also, create table rparted (a smallint) partition by range (a); create table rparted_1 partition of rparted for values from (1) to (10); create table rparted_16384 partition of rparted for values from (10) to (16384); create table rparted_maxvalue partition of rparted for values from (16384) to (maxvalue); -- all partitions except rparted_maxvalue pruned explain (costs off) select * from rparted where a > 100000000000000; QUERY PLAN ------------------------------------------------- Append -> Seq Scan on rparted_maxvalue Filter: (a > '100000000000000'::bigint) (3 rows) I will continue working on improving the comments / cleaning things up and post a revised version soon, but until then please look at the attached. Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmoahUxagjeNeJTcJkD0rbk%2BmHTXROzWcEd%2BtZ8DuQG83cg%40mail.gmail.com [2] https://www.postgresql.org/message-id/CA%2BTgmoYtKitwsFtA4%2B6cdeYGEfnS1%2BOY%2BG%3DUe26fgSzJZx%3DeJg%40mail.gmail.com
Attachment
Hi Amit, On 03/13/2018 07:37 AM, Amit Langote wrote: > I will continue working on improving the comments / cleaning things up and > post a revised version soon, but until then please look at the attached. > Passes check-world. Some minor comments: 0001: Ok 0002: Ok 0003: * Trailing white space * pruning.c - partkey_datum_from_expr * "Add more expression types..." -- Are you planning to add more of these ? Otherwise change the comment - get_partitions_for_null_keys * Documentation for method * 'break;' missing for _HASH and default case - get_partitions_for_keys * 'break;'s are outside of the 'case' blocks * The 'switch(opstrategy)'s could use some {} blocks * 'break;' missing from default - perform_pruning_combine_step * Documentation for method * nodeFuncs.c - Missing 'break;'s to follow style 0004: Ok Best regards, Jesper
Amit Langote wrote: > I will continue working on improving the comments / cleaning things up and > post a revised version soon, but until then please look at the attached. I tried to give this a read. It looks pretty neat stuff -- as far as I can tell, it follows Robert's sketch for how this should work. The fact that it's under-commented makes me unable to follow it too closely though (I felt like adding a few "wtf?" comments here and there), so it's taking me a bit to follow things in detail. Please do submit improved versions as you have them. I think you're using an old version of pg_bsd_indent. In particular need of commentary * match_clause_to_partition_key() should indicate which params are output and what do they get * get_steps_using_prefix already has a comment, but it doesn't really explain much. (I'm not sure why you use the term "tuple" here. I mean, mathematically it probably makes sense, but in the overall context it seems just confusing.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
By the way, I checked whether patch 0002 (additional tests) had an effect on coverage, and couldn't detect any changes in terms of lines/functions. Were you able to find any bugs in your code thanks to the new tests that would not have been covered by existing tests? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 14 March 2018 at 06:54, Jesper Pedersen <jesper.pedersen@redhat.com> wrote: > * "Add more expression types..." -- Are you planning to add more of > these ? Otherwise change the comment Run-time pruning will deal with Param types here. The comment there might be just to remind us all that the function must remain generic enough so we can support more node types later. I don't particularly need the comment for that patch, and I'm not quite sure if I should be removing it in that patch. My imagination is not stretching far enough today to think what we could use beyond Const and Param. I don't feel strongly about the comment either way. This is just to let you know that there are more up and coming things to do in that spot. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/03/14 8:26, Alvaro Herrera wrote: > By the way, I checked whether patch 0002 (additional tests) had an > effect on coverage, and couldn't detect any changes in terms of > lines/functions. Were you able to find any bugs in your code thanks to > the new tests that would not have been covered by existing tests? All tests except those for hash partitioning got added as bugs were found in the patch and fixed. As you may know, constraint exclusion doesn't help with pruning hash partitions, so those tests don' exercise any existing functionality but are there for the *new* code. Thanks, Amit
On 14 March 2018 at 00:37, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached is a significantly revised version of the patch, although I admit > it could still use some work with regard to comments and other cleanup. Thanks for making all those changes. There's been quite a bit of churn! I've looked over the patch and agree that there need to be more comments to explain things. There were certainly times during my review that I just skipped ahead having not understood what I'd been looking at. Here are the notes from my review: 1. I think get_unpruned_partitions is not a great name for this function. get_matching_partitions would be better. 2. HASH partitioning can no longer prune non-matching partitions in cases like: SELECT * FROM hashp WHERE partkey1 IS NULL and partkey2 IS NULL; This might mean you need to process IS NULL and the key clauses in the same step. I see you have a test which checks the plan for explain (costs off) select * from hp where a = 1 and b is null; which ensures all partitions are included. Why? 3. Header comment for get_partitions_for_keys references 'keys' which is not a parameter to that function. 4. If you wrote a function to process an individual step and called that in a foreach loop inside get_unpruned_partitions, then you wouldn't need to list_make1() here. Instead just call the function to perform one step. argparts = get_unpruned_partitions(context, list_make1(step)); 5. I see you're performing AND step combination in two different ways. In perform_pruning_combine_step you do; andparts = andparts == NULL ? argparts : bms_int_members(andparts, argparts); but in get_unpruned_partitions, you add the entire range to the set using bms_add_range, then intersect on that. The code seems a bit fragile and relies on get_unpruned_partitions returning an allocated Bitmapset all the time, even if bms_is_empty() is true. There should be no distinction between NULL and a Bitmapset that returns true on bms_is_empty(). What you've got here probably works for now, but only due to the fact that get_unpruned_partitions allocate the entire range with bms_add_range and that bms_int_members does not return NULL with no matching members if both input sets are non-NULL. If that were to change then your code would misbehave in cases like: WHERE <matches no partition> AND <matches a partition>; When processing the <matches no partition> clause, no partitions would match, then if that resulted in an empty set you'd then surprisingly match partitions again despite the AND clause not actually making it possible for any partitions to have matched. Probably you just need to bms_add_range in perform_pruning_combine_step too and perform bms_int_members unconditionally, just like you're doing in get_unpruned_partitions 6. The switch (last->op_strategy) in generate_partition_pruning_steps_internal is missing a default ERROR for unknown strategy 7. The switch: switch (context->strategy) in generate_partition_pruning_steps_internal should ERROR rather than break when it sees an unknown partition strategy. 8. Instead of copying the opclause here, wouldn't you be better just this code come after you palloc the PartClause then just setup the PartClause with the negator directly? if (*is_neop_listp) { Assert(OidIsValid(negator)); opclause = copyObject(opclause); opclause->opno = negator; } 8. PartitionedChildRelInfo is still mentioned in typedefs.list 9. I don't quite understand PartitionPruneStepNoop. Why can't you just skip adding anything to the list in generate_partition_pruning_steps_internal? 10. The following test does not make sense: explain (costs off) select * from hp where (a = 10 and b = 'yyy') or (a = 10 and b = 'xxx') or (a is null and b is null); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Append -> Seq Scan on hp0 Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) -> Seq Scan on hp1 Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) -> Seq Scan on hp2 Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) -> Seq Scan on hp3 Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) (9 rows) Why do 4 partitions match when there are only 3 sets of clauses when each one can only match a single partition? 11. What does "root" mean here? -- case for list partitioned table that's not root explain (costs off) select * from rlp where a = 15 and b <> 'ab' and b <> 'cd' and b <> 'xy' and b is not null; -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/03/14 20:50, David Rowley wrote: > On 14 March 2018 at 00:37, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached is a significantly revised version of the patch, although I admit >> it could still use some work with regard to comments and other cleanup. > > Thanks for making all those changes. There's been quite a bit of churn! Thank you David and Jesper for the reviews. I'm halfway done addressing the comments and will submit an updated patch by tomorrow. Thanks, Amit
Hi David. Thanks for the review. On 2018/03/14 20:50, David Rowley wrote: > On 14 March 2018 at 00:37, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached is a significantly revised version of the patch, although I admit >> it could still use some work with regard to comments and other cleanup. > > Thanks for making all those changes. There's been quite a bit of churn! > > I've looked over the patch and agree that there need to be more > comments to explain things. There were certainly times during my > review that I just skipped ahead having not understood what I'd been > looking at. Hope the attached version is easier to understand. > Here are the notes from my review: > > 1. I think get_unpruned_partitions is not a great name for this > function. get_matching_partitions would be better. OK, I changed it to get_matching_partitions. > 2. HASH partitioning can no longer prune non-matching partitions in > cases like: SELECT * FROM hashp WHERE partkey1 IS NULL and partkey2 IS > NULL; This might mean you need to process IS NULL and the key clauses > in the same step. OK, I made the PartitionPruneStepOp to contain a bitmap of null key indexes. It is referred to along with "values" for other columns only in the hash partitioning case, whereas for list and range partitioning, actual value(s) and null key information are passed via separate PartitionPruneStepOp nodes, because in case of the latter we cannot mix nulls and "values". So, now there is no need for PartitionPruneStepNullness and get_partitions_for_null_keys(). > I see you have a test which checks the plan for > > explain (costs off) select * from hp where a = 1 and b is null; > > which ensures all partitions are included. Why? I had intended to remove support for hash partition pruning with some keys being null that existed in the previous versions of the patch, but thought it was too pessimistic after reading your comment. I've added it back and it works like it used to in the previous versions. > 3. Header comment for get_partitions_for_keys references 'keys' which > is not a parameter to that function. Oops, fixed. > 4. If you wrote a function to process an individual step and called > that in a foreach loop inside get_unpruned_partitions, then you > wouldn't need to list_make1() here. Instead just call the function to > perform one step. > > argparts = get_unpruned_partitions(context, > list_make1(step)); Hmm, yes. I've introduced a perform_pruning_step() that switches on the pruning step node type, that is called by either get_matching_partitions while iterating the list it receives or by perform_pruning_combine_step() calls on the arguments of a PartitionPruneStepCombine. > 5. I see you're performing AND step combination in two different ways. > In perform_pruning_combine_step you do; > > andparts = andparts == NULL > ? argparts > : bms_int_members(andparts, argparts); > > but in get_unpruned_partitions, you add the entire range to the set > using bms_add_range, then intersect on that. > > The code seems a bit fragile and relies on get_unpruned_partitions > returning an allocated Bitmapset all the time, even if bms_is_empty() > is true. There should be no distinction between NULL and a Bitmapset > that returns true on bms_is_empty(). > > What you've got here probably works for now, but only due to the fact > that get_unpruned_partitions allocate the entire range with > bms_add_range and that bms_int_members does not return NULL with no > matching members if both input sets are non-NULL. If that were to > change then your code would misbehave in cases like: > > WHERE <matches no partition> AND <matches a partition>; > > When processing the <matches no partition> clause, no partitions would > match, then if that resulted in an empty set you'd then surprisingly > match partitions again despite the AND clause not actually making it > possible for any partitions to have matched. > > Probably you just need to bms_add_range in > perform_pruning_combine_step too and perform bms_int_members > unconditionally, just like you're doing in get_unpruned_partitions I noticed a number of things I could improve about this, also considering your points above. Please check if the new structure is an improvement. > 6. The switch (last->op_strategy) in > generate_partition_pruning_steps_internal is missing a default ERROR > for unknown strategy I've fixed that. > 7. The switch: switch (context->strategy) in > generate_partition_pruning_steps_internal should ERROR rather than > break when it sees an unknown partition strategy. This one too. > > 8. Instead of copying the opclause here, wouldn't you be better just > this code come after you palloc the PartClause then just setup the > PartClause with the negator directly? > > if (*is_neop_listp) > { > Assert(OidIsValid(negator)); > opclause = copyObject(opclause); > opclause->opno = negator; > } Agreed, done. > 8. PartitionedChildRelInfo is still mentioned in typedefs.list Removed. > 9. I don't quite understand PartitionPruneStepNoop. Why can't you just > skip adding anything to the list in > generate_partition_pruning_steps_internal? > > 10. The following test does not make sense: > > explain (costs off) select * from hp where (a = 10 and b = 'yyy') or > (a = 10 and b = 'xxx') or (a is null and b is null); > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------- > Append > -> Seq Scan on hp0 > Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b > = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) > -> Seq Scan on hp1 > Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b > = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) > -> Seq Scan on hp2 > Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b > = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) > -> Seq Scan on hp3 > Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b > = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) > (9 rows) > > Why do 4 partitions match when there are only 3 sets of clauses when > each one can only match a single partition? After bringing the support for hash partition pruning even with IS NULL clauses back, this works as you'd expect. > 11. What does "root" mean here? > > -- case for list partitioned table that's not root > explain (costs off) select * from rlp where a = 15 and b <> 'ab' and b > <> 'cd' and b <> 'xy' and b is not null; It means a partitioned table that is not the root partitioned table. Those <> clauses prune but not at the root level, only after recursing for a list partitioned child of rlp. Attached updated patches. Thanks, Amit
Attachment
On 17 March 2018 at 01:55, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Hope the attached version is easier to understand. Hi Amit, Thanks for making the updates. I'll look at them soon. I've been thinking about how we're making these improvements for SELECT only. If planning for an UPDATE or DELETE of a partitioned table then since the inheritance planner is planning each partition individually we gain next to nothing from this patch. Generally, it seems the aim of this patch is to improve the usability of partitions in an OLTP type workload, most likely OLAP does not matter as much since planner overhead, in that case, is generally less of a concern. I experimented with the attached small patch to see if the situation could be improved if we first plan the entire query with all partitions then ignore dummy rels when planning for each individual partition. I used something along the lines of: # create table listp (a int, b int) partition by list(a); # select 'create table listp'||x||' partition of listp for values in('||x||');' from generate_series(1, <number of tables>)x; $ echo explain update listp set b = 1 where a = 1; > bench.sql $ pgbench -f bench.sql -n -T 30 postgres where <number of tables> started at 1 and went up in powers of 2 until 1024. Unpatched = your v35 patch Patched = your v35 + the attached. The TPS result from a 30-second pgbench run of the above query showed: Partitions = 1 Unpatched: 7323.3 Patched: 6573.2 (-10.24%) Partitions = 2 Unpatched: 6784.8 Patched: 6377.1 (-6.01%) Partitions = 4 Unpatched: 5903.0 Patched: 6106.8 (3.45%) Partitions = 8 Unpatched: 4582.0 Patched: 5579.9 (21.78%) Partitions = 16 Unpatched: 3131.5 Patched: 4521.2 (44.38%) Partitions = 32 Unpatched: 1779.8 Patched: 3387.8 (90.35%) Partitions = 64 Unpatched: 821.9 Patched: 2245.4 (173.18%) Partitions = 128 Unpatched: 322.2 Patched: 1319.6 (309.56%) Partitions = 256 Unpatched: 84.3 Patched: 731.7 (768.27%) Partitions = 512 Unpatched: 22.5 Patched: 382.8 (1597.74%) Partitions = 1024 Unpatched: 5.5 Patched: 150.1 (2607.83%) Which puts the crossover point at just 4 partitions, and just a small overhead for 1, 2 and probably 3 partitions. The planner generated a plan 26 times faster (!) with 1024 partitions. Likely there's more than could be squeezed out of this if we could get the grouping_planner() to somehow skip creating paths and performing the join search. But that patch is not nearly as simple as the attached. Probably grouping_planner could also be called with inheritance_update = false, for this one case too, which might save a small amount of effort. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi David. On 2018/03/19 16:18, David Rowley wrote: > On 17 March 2018 at 01:55, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Hope the attached version is easier to understand. > > Hi Amit, > > Thanks for making the updates. I'll look at them soon. > > I've been thinking about how we're making these improvements for > SELECT only. If planning for an UPDATE or DELETE of a partitioned > table then since the inheritance planner is planning each partition > individually we gain next to nothing from this patch. Unfortunately, yes. :-( Just recently, I replied to a pgsql-bugs report by someone who had OOM kill a backend running `delete from partitioned_table_with_7202_partitions` on their test system [1]. That'd be because running inheritance_planner on a partitioned table doesn't cope very well beyond a few hundred partitions, as we've also written in our partitioning/inheritance documentation. > Generally, it seems the aim of this patch is to improve the usability > of partitions in an OLTP type workload, most likely OLAP does not > matter as much since planner overhead, in that case, is generally less > of a concern. Yes, that makes sense. > I experimented with the attached small patch to see if the situation > could be improved if we first plan the entire query with all > partitions then ignore dummy rels when planning for each individual > partition. > > I used something along the lines of: > > # create table listp (a int, b int) partition by list(a); > # select 'create table listp'||x||' partition of listp for values > in('||x||');' from generate_series(1, <number of tables>)x; > $ echo explain update listp set b = 1 where a = 1; > bench.sql > $ pgbench -f bench.sql -n -T 30 postgres > > where <number of tables> started at 1 and went up in powers of 2 until 1024. > > Unpatched = your v35 patch > Patched = your v35 + the attached. > > The TPS result from a 30-second pgbench run of the above query showed: > > Partitions = 1 > Unpatched: 7323.3 > Patched: 6573.2 (-10.24%) > > Partitions = 2 > Unpatched: 6784.8 > Patched: 6377.1 (-6.01%) > > Partitions = 4 > Unpatched: 5903.0 > Patched: 6106.8 (3.45%) > > Partitions = 8 > Unpatched: 4582.0 > Patched: 5579.9 (21.78%) > > Partitions = 16 > Unpatched: 3131.5 > Patched: 4521.2 (44.38%) > > Partitions = 32 > Unpatched: 1779.8 > Patched: 3387.8 (90.35%) > > Partitions = 64 > Unpatched: 821.9 > Patched: 2245.4 (173.18%) > > Partitions = 128 > Unpatched: 322.2 > Patched: 1319.6 (309.56%) > > Partitions = 256 > Unpatched: 84.3 > Patched: 731.7 (768.27%) > > Partitions = 512 > Unpatched: 22.5 > Patched: 382.8 (1597.74%) > > Partitions = 1024 > Unpatched: 5.5 > Patched: 150.1 (2607.83%) > > Which puts the crossover point at just 4 partitions, and just a small > overhead for 1, 2 and probably 3 partitions. The planner generated a > plan 26 times faster (!) with 1024 partitions. Nice! > Likely there's more than could be squeezed out of this if we could get > the grouping_planner() to somehow skip creating paths and performing > the join search. But that patch is not nearly as simple as the > attached. Yeah, that'd be nice. Do you think that we cannot fix update/delete on partitioned tables until we have such a patch though? IOW, did you intend the patch you posted to just be a PoC to demonstrate that we can save tons just by not doing grouping_planner() on pruned partitions? BTW, maybe you know, but if we want this to prune same partitions as are pruned during select (due to the new pruning facility), we'd need to teach get_relation_constraints() to not fetch the partition constraint (RelationGetPartitionQual) at all. My patch currently teaches it to avoid fetching the partition constraint only for select. If we include the partition constraint in the list of constraints returned by get_relation_constraints, we'd still be redundantly executing the constraint exclusion logic for the selected partitions via the grouping_planner() call on those partitions. Thanks, Amit [1] https://www.postgresql.org/message-id/fecdef72-8c2a-0794-8e0a-2ad76db82c68@lab.ntt.co.jp
On 19 March 2018 at 23:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Just recently, I replied to a pgsql-bugs report by someone who had OOM > kill a backend running `delete from > partitioned_table_with_7202_partitions` on their test system [1]. That'd > be because running inheritance_planner on a partitioned table doesn't cope > very well beyond a few hundred partitions, as we've also written in our > partitioning/inheritance documentation. hmm, yeah that's unfortunate. I'd not done much study of the inheritance planner before, but I see how that could happen now that I understand a bit more about it. nparts^2 RelOptInfos will be created on such problems. My patch should help with that providing that some pruning will actually take place, but make the problem very slightly worse if none can be pruned. > On 2018/03/19 16:18, David Rowley wrote: >> Likely there's more than could be squeezed out of this if we could get >> the grouping_planner() to somehow skip creating paths and performing >> the join search. But that patch is not nearly as simple as the >> attached. > > Yeah, that'd be nice. Do you think that we cannot fix update/delete on > partitioned tables until we have such a patch though? IOW, did you intend > the patch you posted to just be a PoC to demonstrate that we can save tons > just by not doing grouping_planner() on pruned partitions? The patch was meant as a PoC. I think the performance of the patch is acceptable without any additional optimisation work. It would be nice, but any more code that's added would need more reviewer and committer time, both of which are finite, especially so before PG11 code cut off. I think it would be a shame to tell people partition is usable now for a decent number of partitions, providing you don't need to perform any OLTP UPDATE/DELETE operations on the partitions. I think for the few lines of code that the proposed patch takes it's worth considering for PG11, but only once your work has gone in. I certainly wouldn't want this to hold your work back. > BTW, maybe you know, but if we want this to prune same partitions as are > pruned during select (due to the new pruning facility), we'd need to teach > get_relation_constraints() to not fetch the partition constraint > (RelationGetPartitionQual) at all. My patch currently teaches it to avoid > fetching the partition constraint only for select. If we include the > partition constraint in the list of constraints returned by > get_relation_constraints, we'd still be redundantly executing the > constraint exclusion logic for the selected partitions via the > grouping_planner() call on those partitions. I'd not thought of that. It seems more like a performance optimisation than something that's required for correctness. Removing that would probably make constraint_exclusion = 'partition' pretty useless -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/03/16 21:55, Amit Langote wrote: > Attached updated patches. Attached is further revised version. Of note is getting rid of PartitionPruneContext usage in the static functions of partprune.c. Most of the code there ought to only run during planning, so it can access the necessary information from RelOptInfo directly instead of copying it to PartitionPruneContext and then passing it around. Thanks, Amit
Attachment
On 21 March 2018 at 00:07, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached is further revised version. In the 0004 patch I see: @@ -1439,6 +1441,10 @@ inheritance_planner(PlannerInfo *root) if (IS_DUMMY_PATH(subpath)) continue; + /* Add the current parent's RT index to the partitioned rels set. */ + partitioned_relids = bms_add_member(partitioned_relids, + appinfo->parent_relid); This seems to execute regardless of if the target relation is a partitioned table or an inheritance parent. I think there needs to be a condition so you only do this when planning for partitioned tables. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Mar 20, 2018 at 7:07 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/03/16 21:55, Amit Langote wrote: >> Attached updated patches. > > Attached is further revised version. > > Of note is getting rid of PartitionPruneContext usage in the static > functions of partprune.c. Most of the code there ought to only run during > planning, so it can access the necessary information from RelOptInfo > directly instead of copying it to PartitionPruneContext and then passing > it around. + if (rte->relkind == RELKIND_PARTITIONED_TABLE) + { + if (rel->baserestrictinfo != NIL) + { + live_children = prune_append_rel_partitions(root, rel); + did_pruning = true; + } + } Use && + case COMBINE_OR: + { Won't survive pgindent, which currently produces a *massive* diff for these patches. + /* + * XXX- The following ad-hoc method of pruning only works for list + * partitioning. It checks for each partition if all of its + * accepted values appear in ne_datums[]. + */ So why are we doing it this way? How about doing something not ad-hoc? I tried to propose that before. + * Set *value to the constant value obtained by evaluating 'expr' + * + * Note that we may not be able to evaluate the input expression, in which + * case, the function returns false to indicate that *value has not been + * set. True is returned otherwise. These comments need updating, since this function (laudibly) no longer does any evaluating. I wonder how this will work for run-time pruning, though. + if (context->partopcintype[partkeyidx] != exprTyp) + { + Oid new_supfuncid; + int16 procnum; + + + procnum = (context->strategy == PARTITION_STRATEGY_HASH) + ? HASHEXTENDED_PROC + : BTORDER_PROC; + new_supfuncid = get_opfamily_proc(context->partopfamily[partkeyidx], + context->partopcintype[partkeyidx], + exprTyp, procnum); + fmgr_info(new_supfuncid, &context->partsupfunc[partkeyidx]); + } What's the point of this, exactly? Leftover dead code, maybe? + * Input: + * See the comments above the definition of PartScanKeyInfo to see what + * kind of information is contained in 'keys'. There's no such thing as PartScanKeyInfo any more and the function has no argument called 'keys'. None of the functions actual arguments are explained. + /* + * If there are multiple pruning steps, we perform them one after another, + * passing the result of one step as input to another. Based on the type + * of pruning step, perform_pruning_step may add or remove partitions from + * the set of partitions it receives as the input. + */ The comment sounds great, but the code doesn't work that way; it always calls bms_int_members to intersect the new result with any previous result. I'm baffled as to how this manages to DTRT if COMBINE_OR is used. In general I had hoped that the list of pruning steps was something over which we were only going to iterate, not recurse. This definitely recurses for the combine steps, but it's still (sorta) got the idea of a list of iterable steps. That's a weird mix. + if (nvalues == context->partnatts) + { + greatest_modulus = get_greatest_modulus(boundinfo); + rowHash = compute_hash_value(partnatts, partsupfunc, values, + isnull); + result_index = partindices[rowHash % greatest_modulus]; + if (result_index >= 0) + return bms_make_singleton(result_index); + } + else + /* Can't do pruning otherwise, so return all partitions. */ + return bms_add_range(NULL, 0, context->nparts - 1); Wouldn't we want to (1) arrange things so that this function is never called if nvalues < context->partnatts && context->strategy == PARTITION_STRATEGY_HASH or at least (2) avoid constructing isnull from nullkeys if we're not going to use it? Also, shouldn't we be sanity-checking the strategy number here? I'm out of time for right now but it looks to me like this patch still needs quite a bit of fine-tuning. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018/03/20 21:41, David Rowley wrote: > On 21 March 2018 at 00:07, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached is further revised version. > > In the 0004 patch I see: > > @@ -1439,6 +1441,10 @@ inheritance_planner(PlannerInfo *root) > if (IS_DUMMY_PATH(subpath)) > continue; > > + /* Add the current parent's RT index to the partitioned rels set. */ > + partitioned_relids = bms_add_member(partitioned_relids, > + appinfo->parent_relid); > > This seems to execute regardless of if the target relation is a > partitioned table or an inheritance parent. I think there needs to be > a condition so you only do this when planning for partitioned tables. Oops, that's quite wrong. Will fix, thanks. Regards, Amit
On 21 March 2018 at 00:07, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached is further revised version. Hi Amit, Thanks for sending the v38 patch. I started reviewing this, but I just ended up starting to hack at the patch instead. It's still got quite a bit of work to be done as I think, unfortunately, the cross type stuff is still pretty broken. There's not really any sort of checking done to make sure you've found a valid cross type hash or compare function in the code which results in errors like: create table hashp (a int, b numeric) partition by hash(a,b); create table hashp1 partition of hashp for values with (modulus 4, remainder 0); create table hashp2 partition of hashp for values with (modulus 4, remainder 1); create table hashp3 partition of hashp for values with (modulus 4, remainder 2); create table hashp4 partition of hashp for values with (modulus 4, remainder 3); explain select * from hashp where a = 3::smallint and b = 1.0; ERROR: cache lookup failed for function 0 I'm not really sure if this should be a matter of doing an if (!OidIsValid(new_supfuncid)) return false; I think the context->partsupfunc must be pretty broken in cases like: create table listp (a bigint) partition by list(a); create table listp1 partition of listp for values in(1); select * from listp where a <> 1::smallint and a <> 1::bigint; The current patch simply just remembers the last comparison function for comparing int8 to int4 and uses that one for the int8 to int2 comparison too. Probably we need to cache the comparison function's Oid in with the Expr in the step and use the correct one each time. I'm unsure of how the fmgr info should be cached, but looks like it certainly cannot be cached in the context in an array per partition key. I've so far only thought some sort of hash table, but I'm sure there must be a much better way to do this. I started hacking it partition.c and ended up changing quite a few things. I changed get_partitions_for_keys into 3 separate functions, one for hash, list and range and tidied a few things up in that area. There were a few bugs, for example passing the wrong value for the size of the array into get_partitions_excluded_by_ne_datums. I also changed how the Bitmapsets are handled in the step functions and got rid of the Noop step type completely. I also got rid of the passing of the srcparts into these functions. I think Roberts idea is to process the steps in isolation and just combine the partitions matching each step. It would be great if we could coordinate our efforts here. I'm posting this patch now just in case you're working or about to work on this. In the meantime, I'll continue to drip feed cleanup patches. I'll try to start writing some comments too, once I figure a few things out... -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi David. On 2018/03/23 16:38, David Rowley wrote: > On 21 March 2018 at 00:07, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Attached is further revised version. > > Hi Amit, > > Thanks for sending the v38 patch. > > I started reviewing this, but I just ended up starting to hack at the > patch instead. It's still got quite a bit of work to be done as I > think, unfortunately, the cross type stuff is still pretty broken. > There's not really any sort of checking done to make sure you've found > a valid cross type hash or compare function in the code which results > in errors like: > > create table hashp (a int, b numeric) partition by hash(a,b); > create table hashp1 partition of hashp for values with (modulus 4, remainder 0); > create table hashp2 partition of hashp for values with (modulus 4, remainder 1); > create table hashp3 partition of hashp for values with (modulus 4, remainder 2); > create table hashp4 partition of hashp for values with (modulus 4, remainder 3); > explain select * from hashp where a = 3::smallint and b = 1.0; > ERROR: cache lookup failed for function 0 Hmm yes. I had realized that while addressing Robert's related comment. > I'm not really sure if this should be a matter of doing an if > (!OidIsValid(new_supfuncid)) return false; I think the > context->partsupfunc must be pretty broken in cases like: > > create table listp (a bigint) partition by list(a); > create table listp1 partition of listp for values in(1); > select * from listp where a <> 1::smallint and a <> 1::bigint; > > The current patch simply just remembers the last comparison function > for comparing int8 to int4 and uses that one for the int8 to int2 > comparison too. > > Probably we need to cache the comparison function's Oid in with the > Expr in the step and use the correct one each time. I'm unsure of how > the fmgr info should be cached, but looks like it certainly cannot be > cached in the context in an array per partition key. I've so far only > thought some sort of hash table, but I'm sure there must be a much > better way to do this. Yeah, I realized that simply replacing the context->partsupfunc member is not a solution. In the updated patch (that is, after incorporating your changes), I have moved this partsupfunc switching to the caller of partkey_datum_from_expr instead of doing it there. New patch also checks that returned function OID is valid, which if not we don't use the expression's value for pruning. So now. we statically allocate a partsupfunc array on every invocation of perform_pruning_base_step() or of get_partitions_excluded_by_ne_datums(). Considering run-time pruning, we may have to find some other place to cache that. > I started hacking it partition.c and ended up changing quite a few > things. I changed get_partitions_for_keys into 3 separate functions, > one for hash, list and range and tidied a few things up in that area. > There were a few bugs, for example passing the wrong value for the > size of the array into get_partitions_excluded_by_ne_datums. > > I also changed how the Bitmapsets are handled in the step functions > and got rid of the Noop step type completely. I also got rid of the > passing of the srcparts into these functions. I think Roberts idea is > to process the steps in isolation and just combine the partitions > matching each step.> > It would be great if we could coordinate our efforts here. I'm posting > this patch now just in case you're working or about to work on this. Thanks a lot for making all those changes and sharing the patch. I've incorporated in the attached latest version. > In the meantime, I'll continue to drip feed cleanup patches. I'll try > to start writing some comments too, once I figure a few things out... Here is the updated version. I'm still thinking about what to do about avoiding recursion when performing combine steps [1] as Robert mentioned in his email. Thanks, Amit
Attachment
Thanks for the review. On 2018/03/21 6:29, Robert Haas wrote: > On Tue, Mar 20, 2018 at 7:07 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2018/03/16 21:55, Amit Langote wrote: >>> Attached updated patches. >> >> Attached is further revised version. >> >> Of note is getting rid of PartitionPruneContext usage in the static >> functions of partprune.c. Most of the code there ought to only run during >> planning, so it can access the necessary information from RelOptInfo >> directly instead of copying it to PartitionPruneContext and then passing >> it around. > > + if (rte->relkind == RELKIND_PARTITIONED_TABLE) > + { > + if (rel->baserestrictinfo != NIL) > + { > + live_children = prune_append_rel_partitions(root, rel); > + did_pruning = true; > + } > + } > > Use && Fixed in the latest version. > > + case COMBINE_OR: > + { > > Won't survive pgindent, which currently produces a *massive* diff for > these patches. That's gone in the latest patch. Things for the overall patch should have improved in the latest version. > + /* > + * XXX- The following ad-hoc method of pruning only works for list > + * partitioning. It checks for each partition if all of its > + * accepted values appear in ne_datums[]. > + */ > > So why are we doing it this way? How about doing something not > ad-hoc? I tried to propose that before. Hmm, perhaps I should have written a better comment. What I really meant to say is that pruning using <> operators can be implemented sanely only for list partitions. We can prune a given partition with <> clauses, only if we can find a <> clause for *all* values that the partition accepts. Doing so seems doable only for list partitions where we require enumerating all values that the partition may contain. > + * Set *value to the constant value obtained by evaluating 'expr' > + * > + * Note that we may not be able to evaluate the input expression, in which > + * case, the function returns false to indicate that *value has not been > + * set. True is returned otherwise. > > These comments need updating, since this function (laudibly) no longer > does any evaluating. I wonder how this will work for run-time > pruning, though. Fixed the comment. Run-time pruning patch adds some code to this function to be able to return values for Params for which, afaik, it also adds some state information to the PartitionPruneContext argument. > + if (context->partopcintype[partkeyidx] != exprTyp) > + { > + Oid new_supfuncid; > + int16 procnum; > + > + > + procnum = (context->strategy == PARTITION_STRATEGY_HASH) > + ? HASHEXTENDED_PROC > + : BTORDER_PROC; > + new_supfuncid = get_opfamily_proc(context->partopfamily[partkeyidx], > + context->partopcintype[partkeyidx], > + exprTyp, procnum); > + fmgr_info(new_supfuncid, &context->partsupfunc[partkeyidx]); > + } > > What's the point of this, exactly? Leftover dead code, maybe? Actually, this *was* an effort to teach the patch to use the correct comparison function for comparison against partition bounds in case of clause value being of different type. After reading David's comment about this, I concluded that it's placed at a wrong place, which is fixed in the latest patch. The comparison functions are changed (if needed) in the function that would call partkey_datum_from_expr, not in partkey_datum_from_expr itself. > + * Input: > + * See the comments above the definition of PartScanKeyInfo to see what > + * kind of information is contained in 'keys'. > > There's no such thing as PartScanKeyInfo any more and the function has > no argument called 'keys'. None of the functions actual arguments are > explained. Sorry, this should be gone in the latest patches. > + /* > + * If there are multiple pruning steps, we perform them one after another, > + * passing the result of one step as input to another. Based on the type > + * of pruning step, perform_pruning_step may add or remove partitions from > + * the set of partitions it receives as the input. > + */ > > The comment sounds great, but the code doesn't work that way; it > always calls bms_int_members to intersect the new result with any > previous result. I'm baffled as to how this manages to DTRT if > COMBINE_OR is used. In general I had hoped that the list of pruning > steps was something over which we were only going to iterate, not > recurse. This definitely recurses for the combine steps, but it's > still (sorta) got the idea of a list of iterable steps. That's a > weird mix. At the top-level (in get_matching_partitions), it is assumed that the steps in the input list come from implicitly AND'd clauses, so the intersection between partition sets that we get for each. Anyway, after David's rewrite of this portion of the patch incorporated in the latest patch, things look a bit different here, although there is still recursion for combine steps. I'm still considering how to make the recursion go away. > + if (nvalues == context->partnatts) > + { > + greatest_modulus = get_greatest_modulus(boundinfo); > + rowHash = compute_hash_value(partnatts, partsupfunc, values, > + isnull); > + result_index = partindices[rowHash % greatest_modulus]; > + if (result_index >= 0) > + return bms_make_singleton(result_index); > + } > + else > + /* Can't do pruning otherwise, so return all partitions. */ > + return bms_add_range(NULL, 0, context->nparts - 1); > > Wouldn't we want to (1) arrange things so that this function is never > called if nvalues < context->partnatts && context->strategy == > PARTITION_STRATEGY_HASH or at least (2) avoid constructing isnull from > nullkeys if we're not going to use it? We call this function even if nvalues < context->partnatts but we found IS NULL clauses for *all* the remaining columns. Although, given the checks that planner (partprune.c) performs, we should get here only if pruning is possible, so the code to handle cases where pruning couldn't occur is redundant. > Also, shouldn't we be sanity-checking the strategy number here? That's right, fixed. > > I'm out of time for right now but it looks to me like this patch still > needs quite a bit of fine-tuning. I have posted an updated patch in reply to David's review. Thanks, Amit
On 24 March 2018 at 01:15, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > In the updated patch (that is, after incorporating your changes), I have > moved this partsupfunc switching to the caller of partkey_datum_from_expr > instead of doing it there. New patch also checks that returned function > OID is valid, which if not we don't use the expression's value for pruning. Thanks for accepting those changes. > So now. we statically allocate a partsupfunc array on every invocation of > perform_pruning_base_step() or of get_partitions_excluded_by_ne_datums(). > Considering run-time pruning, we may have to find some other place to > cache that. hmm yeah, it's not perfect, but I don't have any better ideas for now, apart from this probably could be done when creating the steps rather than executing them. That would save having to look up the correct function Oid during execution, and save bothering to create steps values that we simply can't compare to the partition key. I've done this in the attached patch against v39. I also renamed argvalues to argexprs, since they're not values. The PartClauseInfo could probably do with the same change too, but I didn't touch it. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 24 March 2018 at 16:42, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've done this in the attached patch against v39. > > I also renamed argvalues to argexprs, since they're not values. The > PartClauseInfo could probably do with the same change too, but I > didn't touch it. The attached goes a little further and does a bit more renaming. I don't think "values" is a good name for a list of Exprs. I'd expect that might be a better-suited name for an array of Datums. I've also added and modified a few comments. More comments are still required. The Step structs are mostly undocumented still, but I'm trying to understand how all this fits together still, at least well enough to write about it. The attached delta applies on top of v39 plus delta1. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 25 March 2018 at 18:28, David Rowley <david.rowley@2ndquadrant.com> wrote: > The attached delta applies on top of v39 plus delta1. Sorry, the attached should do this. Ignore the last attachment. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi David. On 2018/03/25 14:32, David Rowley wrote: > On 25 March 2018 at 18:28, David Rowley <david.rowley@2ndquadrant.com> wrote: >> The attached delta applies on top of v39 plus delta1. > > Sorry, the attached should do this. Ignore the last attachment. I have incorporated both of your delta1 and delta2_1 patches. Your proposed change to determine the cross-type comparison function OID during planning itself is a good one, although I wasn't sure why it was done only for the <> operators. I also implemented that for PartitionPruneStepOp steps. Also, I started thinking that implementing pruning using <> operators with a PartitionPruneCombineOp was not such a great idea. That needed us to add argexprs and argcmpfns to that struct, which seemed a bit odd. I defined a new pruning node type called PartitionPruneStepOpNe, which still seems a bit odd, but given that our support for pruning using <> is quite specialized, that may be fine. I added a bunch of hopefully informative comments in partprune.c and for the struct definitions of pruning step nodes. Please find attached find a new version. Thanks, Amit
Attachment
On 2018/03/23 21:19, Amit Langote wrote: > On 2018/03/21 6:29, Robert Haas wrote: >> + /* >> + * If there are multiple pruning steps, we perform them one after another, >> + * passing the result of one step as input to another. Based on the type >> + * of pruning step, perform_pruning_step may add or remove partitions from >> + * the set of partitions it receives as the input. >> + */ >> >> The comment sounds great, but the code doesn't work that way; it >> always calls bms_int_members to intersect the new result with any >> previous result. I'm baffled as to how this manages to DTRT if >> COMBINE_OR is used. In general I had hoped that the list of pruning >> steps was something over which we were only going to iterate, not >> recurse. This definitely recurses for the combine steps, but it's >> still (sorta) got the idea of a list of iterable steps. That's a >> weird mix. > > At the top-level (in get_matching_partitions), it is assumed that the > steps in the input list come from implicitly AND'd clauses, so the > intersection between partition sets that we get for each. > > Anyway, after David's rewrite of this portion of the patch incorporated in > the latest patch, things look a bit different here, although there is > still recursion for combine steps. I'm still considering how to make the > recursion go away. I have managed to make the recursion go away in the attached updated version. I guess that's the result of employing the idea of a "output register" for individual pruning steps as mentioned in Robert's email upthread where he detailed the "pruning steps" approach [1]. With the new patch, pruning steps for arguments of, say, an OR clause are not performed recursively. Instead, each pruning step is performed independently and its output is stored in a slot dedicated to it. Combine steps are always executed after all of the steps corresponding to its arguments have been executed. That's ensured by the way steps are allocated. Jesper, off-list, reported an unused variable which has been removed in the updated patch. Thanks Jesper! He also pointed out a case with a list-partitioned table where pruning doesn't a produce a result as one would expect and what constraint exclusion would produce. create table lp (a char) partition by list (a); create table lp_ad partition of lp for values in ('a', 'd'); create table lp_bc partition of lp for values in ('b', 'c'); create table lp_default partition of lp default; explain (costs off) select * from lp where a > 'a' and a < 'd'; QUERY PLAN ----------------------------------------------------------- Append -> Seq Scan on lp_ad Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar)) -> Seq Scan on lp_bc Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar)) -> Seq Scan on lp_default Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar)) (7 rows) One would expect that lp_ad is not scanned. With the implementation where a > 'a' and a < 'd' are used to prune separately, this cannot be avoided given the way get_partitions_for_keys_list() added by the patch works. What happens is we prune first with a step generated for a > 'a', which returns partitions for all datums in the table's boundinfo greater than 'a' that have a partition assigned, which means we'll include the partition that accepts 'd'. Then when pruning with a < 'd', we select partitions for all datums less than 'd' that have a partition assigned, which means we end up including the partition that accepts 'a'. We intersect the result of running these independent steps, but lp_ad is present in the result sets of both the sets, so it ends up in the final result. Maybe there is a way to fix that, but I haven't done anything about it yet. Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmoahUxagjeNeJTcJkD0rbk%2BmHTXROzWcEd%2BtZ8DuQG83cg%40mail.gmail.com
Attachment
On 27 March 2018 at 00:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Also, I started thinking that implementing pruning using <> operators with > a PartitionPruneCombineOp was not such a great idea. That needed us to > add argexprs and argcmpfns to that struct, which seemed a bit odd. I > defined a new pruning node type called PartitionPruneStepOpNe, which still > seems a bit odd, but given that our support for pruning using <> is quite > specialized, that may be fine. Seems better > I added a bunch of hopefully informative comments in partprune.c and for > the struct definitions of pruning step nodes. Yes. That looks better. > Please find attached find a new version. Thanks. I've made a pass over this and I only have the attached set of fixes and the following to show for it. 1. Please add more comments in the switch statement in get_partitions_for_keys_range 2. More an observation than anything else. I see we've lost the ability to prune range queries on LIST partitions in some cases. For example: CREATE TABLE listp (a INT) PARTITION BY LIST(a); CREATE TABLE listp1_3 PARTITION OF listp FOR VALUES IN(1,3); EXPLAIN SELECT * FROM listp WHERE a > 1 AND a < 3; This is just down to the new pruning step design. WHERE we first prune on "a > 1", which matches listp1_3 due to 3, then binary-AND to the results of the "a < 3", which matches listp1_3 due to 1. This is a shame, but probably not the end of the world. Fixing it would likely mean moving back towards the previous design. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi Amit, On 03/27/2018 06:42 AM, Amit Langote wrote: > I have managed to make the recursion go away in the attached updated > version. I guess that's the result of employing the idea of a "output > register" for individual pruning steps as mentioned in Robert's email > upthread where he detailed the "pruning steps" approach [1]. > > With the new patch, pruning steps for arguments of, say, an OR clause are > not performed recursively. Instead, each pruning step is performed > independently and its output is stored in a slot dedicated to it. Combine > steps are always executed after all of the steps corresponding to its > arguments have been executed. That's ensured by the way steps are allocated. > Running v41 with "partition_prune" under valgrind gives the attached report. Best regards, Jesper
Attachment
Amit Langote wrote: > [Jesper] also pointed out a case with a > list-partitioned table where pruning doesn't a produce a result as one > would expect and what constraint exclusion would produce. > > create table lp (a char) partition by list (a); > create table lp_ad partition of lp for values in ('a', 'd'); > create table lp_bc partition of lp for values in ('b', 'c'); > create table lp_default partition of lp default; > explain (costs off) select * from lp where a > 'a' and a < 'd'; > QUERY PLAN > ----------------------------------------------------------- > Append > -> Seq Scan on lp_ad > Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar)) > -> Seq Scan on lp_bc > Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar)) > -> Seq Scan on lp_default > Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar)) > (7 rows) > > One would expect that lp_ad is not scanned. One would? I, for one, wouldn't particularly sweat over this case TBH. It seems a pretty silly case. If this works for "a <> 'a' and a <> 'd'" (I mean, lp_ad is pruned for that qual), that sounds sufficient to me. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Amit Langote wrote: >> [Jesper] also pointed out a case with a >> list-partitioned table where pruning doesn't a produce a result as one >> would expect and what constraint exclusion would produce. >> >> create table lp (a char) partition by list (a); >> create table lp_ad partition of lp for values in ('a', 'd'); >> create table lp_bc partition of lp for values in ('b', 'c'); >> create table lp_default partition of lp default; >> explain (costs off) select * from lp where a > 'a' and a < 'd'; >> QUERY PLAN >> ----------------------------------------------------------- >> Append >> -> Seq Scan on lp_ad >> Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar)) >> -> Seq Scan on lp_bc >> Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar)) >> -> Seq Scan on lp_default >> Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar)) >> (7 rows) >> >> One would expect that lp_ad is not scanned. > One would? I, for one, wouldn't particularly sweat over this case TBH. That example works in HEAD, so if somebody is proposing a patch that breaks it, seems like that needs investigation. regards, tom lane
Hi, On 03/27/2018 01:46 PM, Jesper Pedersen wrote: > Running v41 with "partition_prune" under valgrind gives the attached > report. > The reports mostly involve interaction with catcache.c and dynahash.c, so something for a separate thread. Best regards, Jesper
On 27 March 2018 at 23:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I have managed to make the recursion go away in the attached updated > version. I guess that's the result of employing the idea of a "output > register" for individual pruning steps as mentioned in Robert's email > upthread where he detailed the "pruning steps" approach [1]. Thanks for making that work. I've only glanced at the patch, and not taken enough time to understand how the new parts work yet. In the meantime, I've attached some fixes for v41 which I previously submitted for v40. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2018/03/28 12:58, David Rowley wrote: > On 27 March 2018 at 23:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I have managed to make the recursion go away in the attached updated >> version. I guess that's the result of employing the idea of a "output >> register" for individual pruning steps as mentioned in Robert's email >> upthread where he detailed the "pruning steps" approach [1]. > > Thanks for making that work. I've only glanced at the patch, and not > taken enough time to understand how the new parts work yet. > > In the meantime, I've attached some fixes for v41 which I previously > submitted for v40. Thank you. I've merged it. Also, I have redesigned how we derive partition indexes after running pruning steps. Previously, for each step we'd determine the indexes of "partitions" that are not pruned leading to a list partition not being pruned sometimes, as shown in the two recent examples. Instead, in the new approach, we only keep track of the indexes of the "datums" that satisfy individual pruning steps (both base pruning steps and combine steps) and only figure out the partition indexes after we've determined set of datums that survive all pruning steps. That is, after we're done executing all pruning steps. Whether we need to scan special partitions like null-only and default partition is tracked along with datum indexes for each step. With this change, pruning works as expected in both examples: create table lp (a char) partition by list (a); create table lp_ad partition of lp for values in ('a', 'd'); create table lp_bc partition of lp for values in ('b', 'c'); create table lp_default partition of lp default; explain (costs off) select * from lp where a > 'a' and a < 'd'; QUERY PLAN ----------------------------------------------------------- Append -> Seq Scan on lp_bc Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar)) -> Seq Scan on lp_default Filter: ((a > 'a'::bpchar) AND (a < 'd'::bpchar)) (5 rows) CREATE TABLE listp (a INT) PARTITION BY LIST(a); CREATE TABLE listp1_3 PARTITION OF listp FOR VALUES IN (1, 3); EXPLAIN SELECT * FROM listp WHERE a > 1 AND a < 3; QUERY PLAN ------------------------------------------ Result (cost=0.00..0.00 rows=0 width=4) One-Time Filter: false (2 rows) Moreover, with pruning now working at a high-level with datum indexes instead of partition indexes, pruning for PartitionPruneStepOpNe is simplified greatly. We simply delete from a bitmapset initially containing the indexes of all datums in boundinfo the indexes of those that appear in the query. So: explain (costs off) select * from lp where a <> 'a' and a <> 'd'; QUERY PLAN ------------------------------------------------------------- Append -> Seq Scan on lp_bc Filter: ((a <> 'a'::bpchar) AND (a <> 'd'::bpchar)) -> Seq Scan on lp_default Filter: ((a <> 'a'::bpchar) AND (a <> 'd'::bpchar)) (5 rows) where we delete indexes of 'a' and 'd' from the bitmapset initially containing indexes of all datums, leaving us with only those of 'b' and 'c'. Also, the default partition is scanned as it would always be for a PartitionPruneStepOpNe step. Attached is the updated set of patches, which contains other miscellaneous changes such as updated comments, beside the main changes described above. Regards, Amit
Attachment
On 2018/03/28 18:29, Amit Langote wrote: > Attached is the updated set of patches, which contains other miscellaneous > changes such as updated comments, beside the main changes described above. Sorry, one of those miscellaneous changes was a typo that would cause compilation to fail... Sigh. Fixed in the updated version. Thanks, Amit
Attachment
On 28 March 2018 at 22:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Also, I have redesigned how we derive partition indexes after running > pruning steps. Previously, for each step we'd determine the indexes of > "partitions" that are not pruned leading to a list partition not being > pruned sometimes, as shown in the two recent examples. Instead, in the > new approach, we only keep track of the indexes of the "datums" that > satisfy individual pruning steps (both base pruning steps and combine > steps) and only figure out the partition indexes after we've determined > set of datums that survive all pruning steps. That is, after we're done > executing all pruning steps. Whether we need to scan special partitions > like null-only and default partition is tracked along with datum indexes > for each step. With this change, pruning works as expected in both examples: Smart thinking! Good to see that solved. I'll try to look at v43 during my working day tomorrow, in around 9 hours time. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 03/28/2018 06:30 AM, Amit Langote wrote: > On 2018/03/28 18:29, Amit Langote wrote: >> Attached is the updated set of patches, which contains other miscellaneous >> changes such as updated comments, beside the main changes described above. > > Sorry, one of those miscellaneous changes was a typo that would cause > compilation to fail... Sigh. Fixed in the updated version. > Just some trivial changes. However, explain (costs off) select * from mc2p where a = 2 and b < 1; is picking up -> Seq Scan on mc2p2 Filter: ((b < 1) AND (a = 2)) which doesn't seem right, as its definition is create table mc2p2 partition of mc2p for values from (1, 1) to (2, minvalue); Best regards, Jesper
Attachment
On 2018/03/29 1:41, Jesper Pedersen wrote: > Just some trivial changes. Thanks Jesper. Merged. > However, > > explain (costs off) select * from mc2p where a = 2 and b < 1; > > is picking up > > -> Seq Scan on mc2p2 > Filter: ((b < 1) AND (a = 2)) > > which doesn't seem right, as its definition is > > create table mc2p2 partition of mc2p for values from (1, 1) to (2, minvalue); Yeah, that wasn't right. It boiled down to how some code in the range partition pruning function considered a tuple containing a = 2 to fall in this partition, which is wrong because the minvalue in its upper bound makes the partition exclusive of any tuples with a = 2. Fixed that. Beside fixing that, I have decided to get rid of the PartititionPruneStepOpNe (a special kind of base pruning step that was being used to prune list partitions using a set of <> operator clauses) and related functions. Instead pruning for <> operator clauses is now implemented by using a combination of PartitionPruneStepOp and PartitionPruneStepCombine after adding a new combine op COMBINE_INVERT (I also renamed COMBINE_OR and COMBINE_AND to COMBINE_UNION and COMBINE_INTERSECT, respectively). I decided to do so because the previous arrangement looked like a "hack" to support a special case that touched no less than quite a few places. Attached find the updated version of patches. Thanks, Amit
Attachment
On 29 March 2018 at 21:35, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Beside fixing that, I have decided to get rid of the > PartititionPruneStepOpNe (a special kind of base pruning step that was > being used to prune list partitions using a set of <> operator clauses) > and related functions. Instead pruning for <> operator clauses is now > implemented by using a combination of PartitionPruneStepOp and > PartitionPruneStepCombine after adding a new combine op COMBINE_INVERT (I > also renamed COMBINE_OR and COMBINE_AND to COMBINE_UNION and > COMBINE_INTERSECT, respectively). I decided to do so because the previous > arrangement looked like a "hack" to support a special case that touched no > less than quite a few places. Hi Amit, I've looked at the v44 patch. Thanks for making those changes. The new not-equal handling code is not quite right. DROP TABLE listp; CREATE TABLE listp (a INT) PARTITION BY LIST(a); CREATE TABLE listp1_3 PARTITION OF listp FOR VALUES IN(1,3); CREATE TABLE listp_default PARTITION OF listp DEFAULT; EXPLAIN SELECT * FROM listp WHERE a <> 1; QUERY PLAN ------------------------------------------------------------------ Append (cost=0.00..54.56 rows=2537 width=4) -> Seq Scan on listp1_3 (cost=0.00..41.88 rows=2537 width=4) Filter: (a <> 1) (3 rows) The default should be included here. INSERT INTO listp VALUES(1),(2),(3); SELECT * FROM listp WHERE a <> 1; a --- 3 (1 row) This code assumes its fine to just reverse the setting for default: result->scan_default = !source->scan_default; More complex handling is needed here. I've attached a diff for a small set of other things I noticed while reviewing. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi, I think there's a bug in generate_pruning_steps_from_opexprs, which does this for PARTITION_STRATEGY_HASH: for_each_cell(lc1, lc) { pc = lfirst(lc1); /* * Note that we pass nullkeys for step_nullkeys, * because we need to tell hash partition bound search * function which of the keys are NULL. */ Assert(pc->op_strategy == HTEqualStrategyNumber); pc_steps = get_steps_using_prefix(context, HTEqualStrategyNumber, pc->expr, pc->cmpfn, pc->keyno, nullkeys, prefix); } opsteps = list_concat(opsteps, list_copy(pc_steps)); Notice that the list_concat() is outside the for_each_cell loop. Doesn't that mean we fail to consider some of the clauses (all except the very last clause) for pruning? I haven't managed to come up with an example, but I haven't tried very hard. FWIW I've noticed this because gcc complains that pg_steps might be used uninitialized: partprune.c: In function ‘generate_partition_pruning_steps_internal’: partprune.c:992:16: warning: ‘pc_steps’ may be used uninitialized in this function [-Wmaybe-uninitialized] opsteps = list_concat(opsteps, list_copy(pc_steps)); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ partprune.c:936:14: note: ‘pc_steps’ was declared here List *pc_steps; ^~~~~~~~ All of PostgreSQL successfully made. Ready to install. So even if it's not a bug, we probably need to fix the code somehow. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thanks Tomas for the review. On 2018/03/30 1:55, Tomas Vondra wrote: > Hi, > > I think there's a bug in generate_pruning_steps_from_opexprs, which does > this for PARTITION_STRATEGY_HASH: > > > for_each_cell(lc1, lc) > { > pc = lfirst(lc1); > > /* > * Note that we pass nullkeys for step_nullkeys, > * because we need to tell hash partition bound search > * function which of the keys are NULL. > */ > Assert(pc->op_strategy == HTEqualStrategyNumber); > pc_steps = > get_steps_using_prefix(context, > HTEqualStrategyNumber, > pc->expr, > pc->cmpfn, > pc->keyno, > nullkeys, > prefix); > } > > opsteps = list_concat(opsteps, list_copy(pc_steps)); > > > Notice that the list_concat() is outside the for_each_cell loop. Doesn't > that mean we fail to consider some of the clauses (all except the very > last clause) for pruning? I haven't managed to come up with an example, > but I haven't tried very hard. > > FWIW I've noticed this because gcc complains that pg_steps might be used > uninitialized: > > partprune.c: In function ‘generate_partition_pruning_steps_internal’: > partprune.c:992:16: warning: ‘pc_steps’ may be used uninitialized in > this function [-Wmaybe-uninitialized] > opsteps = list_concat(opsteps, list_copy(pc_steps)); > ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > partprune.c:936:14: note: ‘pc_steps’ was declared here > List *pc_steps; > ^~~~~~~~ > All of PostgreSQL successfully made. Ready to install. > > > So even if it's not a bug, we probably need to fix the code somehow. Yeah, the code needs to be fixed. Although, it seems to me that in the hash partitioning case, the loop would iterate at most once, at least if the query didn't contain any Params. That's because, at that point, there cannot be multiple mutually AND'd equality clauses referring to the same key. For example, if there were in the original query and they contained different values, we wouldn't get this far anyway as they would be reduced to constant-false at an earlier planning stage. If they all contained the same value (e.g. key = 1 and key = 1::smallint and a = 1::int and a = 1::bigint), then only one of them will be left in rel->baserestrictinfo anyway. But we still need to have the loop because all of what I said wouldn't happen if the clauses contained Params. In that case, the result would be determined at execution time. I have fixed the code as you suggested and will post the fixed version shortly after fixing the issues David reported. Thanks, Amit
Hi David. On 2018/03/29 20:08, David Rowley wrote: > I've looked at the v44 patch. Thank you. > Thanks for making those changes. > > The new not-equal handling code is not quite right. > > DROP TABLE listp; > CREATE TABLE listp (a INT) PARTITION BY LIST(a); > CREATE TABLE listp1_3 PARTITION OF listp FOR VALUES IN(1,3); > CREATE TABLE listp_default PARTITION OF listp DEFAULT; > > EXPLAIN SELECT * FROM listp WHERE a <> 1; > QUERY PLAN > ------------------------------------------------------------------ > Append (cost=0.00..54.56 rows=2537 width=4) > -> Seq Scan on listp1_3 (cost=0.00..41.88 rows=2537 width=4) > Filter: (a <> 1) > (3 rows) > > The default should be included here. > > INSERT INTO listp VALUES(1),(2),(3); > SELECT * FROM listp WHERE a <> 1; > a > --- > 3 > (1 row) Good catch! Indeed, the default partition should not have been pruned away in this case. > This code assumes its fine to just reverse the setting for default: > > result->scan_default = !source->scan_default; > > More complex handling is needed here. Hmm, I thought about this and came to a conclusion that we should *always* scan the default partition in this case. Inversion step removes all the datums selected by the source step from the set of *all* datums that the currently defined set of partitions allow. If there's a default partition in the mix, that means the latter contains all the datums of the partition key's data type. Irrespective of whether or not the source step selected the default partition, there would be datums that would be in the set after inversion which in turn would be in the default partition, if not in some non-default partition that would've been selected. I have written a comment there trying to explain this, but I may not have been able to articulate it properly. Please check. Or does this sound just wrong? > I've attached a diff for a small set of other things I noticed while reviewing. Thanks, merged. Please find attached the updated patches. Thanks, Amit
Attachment
On 30 March 2018 at 18:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Please find attached the updated patches. Thanks. I've noticed that there are no outfuncs or readfuncs for all the new Step types you've added. Also, the copy func does not properly copy the step_id in the base node type. This will remain at 0 after a copyObject() -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 31 March 2018 at 01:18, David Rowley <david.rowley@2ndquadrant.com> wrote: > I've noticed that there are no outfuncs or readfuncs for all the new > Step types you've added. > > Also, the copy func does not properly copy the step_id in the base > node type. This will remain at 0 after a copyObject() I've attached a quickly put together fix for this. I'm not quite sure what the done thing is to copy/read/write nodes which inherit fields from other nodes, so what I've done in the attached might not be correct. However, I see _copyValue() does what I've done, so perhaps its fine, or that may just be more of a special case. I also manually removed some hunks from the diff, so hopefully, it still works correctly. Attaching it as it may save you some time from doing it yourself. Please check it though. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
David Rowley wrote: > Also, the copy func does not properly copy the step_id in the base > node type. This will remain at 0 after a copyObject() As I recall, you can find these mistakes by compiling with -DCOPY_PARSE_PLAN_TREES. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 31 March 2018 at 02:00, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 31 March 2018 at 01:18, David Rowley <david.rowley@2ndquadrant.com> wrote: >> I've noticed that there are no outfuncs or readfuncs for all the new >> Step types you've added. >> >> Also, the copy func does not properly copy the step_id in the base >> node type. This will remain at 0 after a copyObject() > > Attaching it as it may save you some time from doing it yourself. > Please check it though. The attached might be slightly easier to apply. The previous version was based on top of some other changes I'd been making. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 30 March 2018 at 18:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Please find attached the updated patches. There's a bit of a strange case with v45 around prepared statements. I've not debugged this yet, but in case you get there first, here's the case: create table listp (a int, b int) partition by list (a); create table listp_1 partition of listp for values in(1) partition by list (b); create table listp_1_1 partition of listp_1 for values in(1); create table listp_2 partition of listp for values in(2) partition by list (b); create table listp_2_1 partition of listp_2 for values in(2); explain select * from listp where b in(1,2) and 2<>b and 0<>b; -- this one looks fine. QUERY PLAN ---------------------------------------------------------------------------- Append (cost=0.00..49.66 rows=22 width=8) -> Seq Scan on listp_1_1 (cost=0.00..49.55 rows=22 width=8) Filter: ((b = ANY ('{1,2}'::integer[])) AND (2 <> b) AND (0 <> b)) (3 rows) prepare q1 (int,int,int,int) as select * from listp where b in($1,$2) and $3 <> b and $4 <> b; execute q1 (1,2,3,4); execute q1 (1,2,3,4); execute q1 (1,2,3,4); execute q1 (1,2,3,4); execute q1 (1,2,3,4); explain (analyze, costs off, summary off, timing off) execute q1 (1,2,2,0); QUERY PLAN -------------------------------- Result (actual rows=0 loops=1) One-Time Filter: false (2 rows) My best guess is that something ate the bits out of a Bitmapset of the matching partitions somewhere. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/03/30 22:41, David Rowley wrote: > On 31 March 2018 at 02:00, David Rowley <david.rowley@2ndquadrant.com> wrote: >> On 31 March 2018 at 01:18, David Rowley <david.rowley@2ndquadrant.com> wrote: >>> I've noticed that there are no outfuncs or readfuncs for all the new >>> Step types you've added. >>> >>> Also, the copy func does not properly copy the step_id in the base >>> node type. This will remain at 0 after a copyObject() >> >> Attaching it as it may save you some time from doing it yourself. >> Please check it though. > > The attached might be slightly easier to apply. The previous version > was based on top of some other changes I'd been making. Thanks David. I have merged this. Regards, Amit
Hi David. On 2018/03/31 0:55, David Rowley wrote: > On 30 March 2018 at 18:38, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Please find attached the updated patches. > > There's a bit of a strange case with v45 around prepared statements. > I've not debugged this yet, but in case you get there first, here's > the case: > > create table listp (a int, b int) partition by list (a); > create table listp_1 partition of listp for values in(1) partition by list (b); > create table listp_1_1 partition of listp_1 for values in(1); > create table listp_2 partition of listp for values in(2) partition by list (b); > create table listp_2_1 partition of listp_2 for values in(2); > > explain select * from listp where b in(1,2) and 2<>b and 0<>b; -- this > one looks fine. > QUERY PLAN > ---------------------------------------------------------------------------- > Append (cost=0.00..49.66 rows=22 width=8) > -> Seq Scan on listp_1_1 (cost=0.00..49.55 rows=22 width=8) > Filter: ((b = ANY ('{1,2}'::integer[])) AND (2 <> b) AND (0 <> b)) > (3 rows) > > prepare q1 (int,int,int,int) as select * from listp where b in($1,$2) > and $3 <> b and $4 <> b; > execute q1 (1,2,3,4); > execute q1 (1,2,3,4); > execute q1 (1,2,3,4); > execute q1 (1,2,3,4); > execute q1 (1,2,3,4); > explain (analyze, costs off, summary off, timing off) execute q1 (1,2,2,0); > QUERY PLAN > -------------------------------- > Result (actual rows=0 loops=1) > One-Time Filter: false > (2 rows) > > My best guess is that something ate the bits out of a Bitmapset of the > matching partitions somewhere. Hmm. It is the newly added inversion step that's causing this. When creating a generic plan (that is when the planning happens via BuildCachedPlan called with boundParams set to NULL), the presence of Params will cause an inversion step's source step to produce scan-all-partitions sort of result, which the inversion step dutifully inverts to a scan-no-partitions result. I have tried to attack that problem by handling the no-values-to-prune-with case using a side-channel to propagate the scan-all-partitions result through possibly multiple steps. That is, a base pruning step will set datum_offsets in a PruneStepResult only if pruning is carried out by actually comparing values with the partition bounds. If no values were provided (like in the generic plan case), it will set a scan_all_nonnull flag instead and return without setting datum_offsets. Combine steps perform their combining duty only if datum_offset contains a valid value, that is, if scan_all_nonnulls is not set. Attached updated version of the patches. Thanks, Amit
Attachment
On 2 April 2018 at 17:18, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/03/31 0:55, David Rowley wrote: >> explain (analyze, costs off, summary off, timing off) execute q1 (1,2,2,0); >> QUERY PLAN >> -------------------------------- >> Result (actual rows=0 loops=1) >> One-Time Filter: false >> (2 rows) > > Hmm. It is the newly added inversion step that's causing this. When > creating a generic plan (that is when the planning happens via > BuildCachedPlan called with boundParams set to NULL), the presence of > Params will cause an inversion step's source step to produce > scan-all-partitions sort of result, which the inversion step dutifully > inverts to a scan-no-partitions result. > > I have tried to attack that problem by handling the > no-values-to-prune-with case using a side-channel to propagate the > scan-all-partitions result through possibly multiple steps. That is, a > base pruning step will set datum_offsets in a PruneStepResult only if > pruning is carried out by actually comparing values with the partition > bounds. If no values were provided (like in the generic plan case), it > will set a scan_all_nonnull flag instead and return without setting > datum_offsets. Combine steps perform their combining duty only if > datum_offset contains a valid value, that is, if scan_all_nonnulls is not set. I'm afraid this is still not correct :-( The following code is not doing the right thing: + case COMBINE_UNION: + foreach(lc1, cstep->source_stepids) + { + int step_id = lfirst_int(lc1); + PruneStepResult *step_result; + + /* + * step_results[step_id] must contain a valid result, + * which is confirmed by the fact that cstep's step_id is + * greater than step_id and the fact that results of the + * individual steps are evaluated in sequence of their + * step_ids. + */ + if (step_id >= cstep->step.step_id) + elog(ERROR, "invalid pruning combine step argument"); + step_result = step_results[step_id]; + Assert(step_result != NULL); + + result->scan_all_nonnull = step_result->scan_all_nonnull; The last line there is not properly performing a union, it just sets the result_scan_all_nonnull to whatever the last step's value was. At the very least it should be |= but I don't really like this new code. Why did you move away from just storing the matching partitions in a Bitmapset? If you want to store all non-null partitions, then why not just set the bits for all non-null partitions? That would cut down on bugs like this since the combining of step results would just be simple unions or intersects. Also, the following code could be made a bit nicer + result = (PruneStepResult *) palloc0(sizeof(PruneStepResult)); + + switch (context->strategy) + { + case PARTITION_STRATEGY_HASH: + result->bound_offsets = get_matching_hash_bound(context, + opstep->opstrategy, + values, nvalues, + partsupfunc, + opstep->nullkeys, + &result->scan_all_nonnull); Why not allocate the PruneStepResult inside the get_matching_*_bound, that way you wouldn't need all those out parameters to set the bool fields. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David. On 2018/04/02 21:03, David Rowley wrote: > On 2 April 2018 at 17:18, Amit Langote wrote: >> On 2018/03/31 0:55, David Rowley wrote: >>> explain (analyze, costs off, summary off, timing off) execute q1 (1,2,2,0); >>> QUERY PLAN >>> -------------------------------- >>> Result (actual rows=0 loops=1) >>> One-Time Filter: false >>> (2 rows) >> >> Hmm. It is the newly added inversion step that's causing this. When >> creating a generic plan (that is when the planning happens via >> BuildCachedPlan called with boundParams set to NULL), the presence of >> Params will cause an inversion step's source step to produce >> scan-all-partitions sort of result, which the inversion step dutifully >> inverts to a scan-no-partitions result. >> >> I have tried to attack that problem by handling the >> no-values-to-prune-with case using a side-channel to propagate the >> scan-all-partitions result through possibly multiple steps. That is, a >> base pruning step will set datum_offsets in a PruneStepResult only if >> pruning is carried out by actually comparing values with the partition >> bounds. If no values were provided (like in the generic plan case), it >> will set a scan_all_nonnull flag instead and return without setting >> datum_offsets. Combine steps perform their combining duty only if >> datum_offset contains a valid value, that is, if scan_all_nonnulls is not set. > > I'm afraid this is still not correct :-( > > The following code is not doing the right thing: > > + case COMBINE_UNION: > + foreach(lc1, cstep->source_stepids) > + { > + int step_id = lfirst_int(lc1); > + PruneStepResult *step_result; > + > + /* > + * step_results[step_id] must contain a valid result, > + * which is confirmed by the fact that cstep's step_id is > + * greater than step_id and the fact that results of the > + * individual steps are evaluated in sequence of their > + * step_ids. > + */ > + if (step_id >= cstep->step.step_id) > + elog(ERROR, "invalid pruning combine step argument"); > + step_result = step_results[step_id]; > + Assert(step_result != NULL); > + > + result->scan_all_nonnull = step_result->scan_all_nonnull; > > The last line there is not properly performing a union, it just sets > the result_scan_all_nonnull to whatever the last step's value was.> > At the very least it should be |= but I don't really like this new code. > > Why did you move away from just storing the matching partitions in a > Bitmapset? If you want to store all non-null partitions, then why not > just set the bits for all non-null partitions? That would cut down on > bugs like this since the combining of step results would just be > simple unions or intersects. As I mentioned in my previous email, I had to find a side-channel (that is scan_all_nonnull) to store this information instead of doing it the regular way, to differentiate the case where we need to scan all partitions because of values in the base prune steps not being available from the case where carrying out a step using actual values ends up selecting all partitions. When creating a generic plan, values of none of the Params that are added to base prune steps are available and that results in reaching the actual pruning functions (get_matching_hash/list/range_bounds) without any values, which results in each of those functions, in returning all partitions containing non-null data. But actually, the presence of only Params in the pruning steps should result in the pruning not being invoked at all (at least for the static pruning case), thus selecting all partitions containing non-null data. It is better to implement that instead of a workaround like scan_all_nonnulls side-channel I was talking about. Fixed the patch to implement it that way. > Also, the following code could be made a bit nicer > > + result = (PruneStepResult *) palloc0(sizeof(PruneStepResult)); > + > + switch (context->strategy) > + { > + case PARTITION_STRATEGY_HASH: > + result->bound_offsets = get_matching_hash_bound(context, > + opstep->opstrategy, > + values, nvalues, > + partsupfunc, > + opstep->nullkeys, > + &result->scan_all_nonnull); > > Why not allocate the PruneStepResult inside the get_matching_*_bound, > that way you wouldn't need all those out parameters to set the bool > fields. I thought it'd be nice to have perform_pruning_base_step generate the actual PruneStepResult instead of the functions for individual partitioning strategies, which in a way minimizes places where it is manipulated. Since, we've divided bound searching into 3 separate functions anyway, it also seemed better to me to have their signatures be relevant to the partition strategy they cater to. The function for hash partitioning, for example, never has to deal with setting the result for the null or the default partition and the range partitioning function doesn't have to worry about doing anything about for null partition. Also, overall footprint of those 3 functions reduced because they don't have to create the PruneStepResult themselves. Attached v47. Thanks, Amit
Attachment
On 4 April 2018 at 00:02, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > But actually, the presence of only Params in the pruning steps should > result in the pruning not being invoked at all (at least for the static > pruning case), thus selecting all partitions containing non-null data. It > is better to implement that instead of a workaround like scan_all_nonnulls > side-channel I was talking about. I don't think this is quite true. Since we're only using strict clauses, a list of quals with just Params still means that NULLs can't match. If you skip the step altogether then won't you have you've lost the chance at pruning away any NULL-only partition? I think it would be better to just have special handling in get_matching_list_bound so that it knows it's performing <> elimination. I'd thought about passing some other opstrategy but the only safe one I thought to use was InvalidStrategy, which is already used by NULL handling. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 4 April 2018 at 09:47, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 4 April 2018 at 00:02, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> But actually, the presence of only Params in the pruning steps should >> result in the pruning not being invoked at all (at least for the static >> pruning case), thus selecting all partitions containing non-null data. It >> is better to implement that instead of a workaround like scan_all_nonnulls >> side-channel I was talking about. > > I don't think this is quite true. Since we're only using strict > clauses, a list of quals with just Params still means that NULLs can't > match. If you skip the step altogether then won't you have you've lost > the chance at pruning away any NULL-only partition? > > I think it would be better to just have special handling in > get_matching_list_bound so that it knows it's performing <> > elimination. I'd thought about passing some other opstrategy but the > only safe one I thought to use was InvalidStrategy, which is already > used by NULL handling. I'm currently working up a patch to do this the way I think is best. I'll submit it soon and we can review and get your thoughts on it. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 4 April 2018 at 11:22, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 4 April 2018 at 09:47, David Rowley <david.rowley@2ndquadrant.com> wrote: >> I think it would be better to just have special handling in >> get_matching_list_bound so that it knows it's performing <> >> elimination. I'd thought about passing some other opstrategy but the >> only safe one I thought to use was InvalidStrategy, which is already >> used by NULL handling. > > I'm currently working up a patch to do this the way I think is best. > > I'll submit it soon and we can review and get your thoughts on it. I've attached a rough cut version of what I think is a good solution for this. It's based on v46, not your latest v47, sorry. This makes get_matching_list_bounds() aware that it's performing the not-equal pruning via the opstrategy which allows it to not return all partitions when there are no values in this case. Instead, we return the NULL partition, so that we later invert that and return everything apart from the NULL partition. A strict clause will allow us that much, even if we can't get the actual value being compared to, at the time. There's also a bunch of other changes in there: 1. Adding missing step_id in copyfuncs.c 2. Simplified including the default partition in a bunch of cases. 3. Made it so scan_default and scan_null are only ever set to true if there's a partition for that. 4. Changed get_matching_*_bounds to return the entire result struct instead of the Bitmapset and pass the remaining bool values back through params. I didn't really like how you'd change this to pass all the bool flags back as params. There's a perfectly good struct there to provide the entire result in a single return value. I know you've disagreed with this already, so would be nice to get a 3rd opinion. 5. Rename get_matching_hash_bound to get_matching_hash_bounds. The LIST and RANGE version of this function both had a plural name. I didn't see any reason for the hash case to be different. Let me know what you think. I've patched the run-time pruning v18 against this and it now passes regression. I need to do a bit more testing on this to ensure it works for all cases, but thought I'd send now as I suspect you're currently around to look. There might be another issue with the patch too, but I'll send a separate email about that. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 4 April 2018 at 13:13, David Rowley <david.rowley@2ndquadrant.com> wrote: > There might be another issue with the patch too, but I'll send a > separate email about that. In the current version of the patch the following comment exists: /* * Fall-through for a NOT clause, which is handled in * match_clause_to_partition_key(). */ The only real handling of NOT clauses is in match_boolean_partition_clause() which just handles NOT(true) or NOT(false). It's true that the const simplification code will generally rewrite most NOT(clause) to use the negator operator, but if the operator does not have a negator it can't do this. We probably don't have any built-in operators which are members of a btree opclass which have no negator, but it's simple enough to modify the citext extension by commenting out the NEGATOR lines in citext--1.4--1.5.sql. create extension citext; create table listp(a citext) partition by list(a citext_pattern_ops); create table listp_1 partition of listp for values in('1'); explain select * from listp where not (a ~>~ '0' and a ~<~ '2'); QUERY PLAN -------------------------------------------------------------------------- Append (cost=0.00..36.45 rows=1209 width=32) -> Seq Scan on listp_1 (cost=0.00..30.40 rows=1209 width=32) Filter: ((NOT (a ~>~ '0'::citext)) OR (NOT (a ~<~ '2'::citext))) (3 rows) At the moment pruning does not work for this case at all. Perhaps it should? I imagine it might be possible to re-work the COMBINE_INVERT code so that it becomes a flag of the combine step rather than a step operator type. It should then be possible to invert COMBINE_UNION for NOT(clause1 OR clause2) and COMBINE_INTERSECT on NOT(clause1 AND clause2). IOW, it might not take too many lines of code to put this right. Probably the bulk of the work would be writing a test with a btree opclass that will allow us to have the planner not invert the clause during const folding. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes: > It's true that the const simplification code will generally rewrite > most NOT(clause) to use the negator operator, but if the operator does > not have a negator it can't do this. > ... > At the moment pruning does not work for this case at all. Perhaps it should? It's hard to see why we'd expend extra effort to optimize such situations. The right answer would invariably be to fix the inadequate operator definition, because missing the negator link would hobble many other cases besides this. Now if you can show a case where the extra smarts would be useful without presuming a badly-written opclass, it's a different matter. regards, tom lane
On 4 April 2018 at 16:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: > David Rowley <david.rowley@2ndquadrant.com> writes: >> It's true that the const simplification code will generally rewrite >> most NOT(clause) to use the negator operator, but if the operator does >> not have a negator it can't do this. >> ... >> At the moment pruning does not work for this case at all. Perhaps it should? > > It's hard to see why we'd expend extra effort to optimize such situations. > The right answer would invariably be to fix the inadequate operator > definition, because missing the negator link would hobble many other > cases besides this. > > Now if you can show a case where the extra smarts would be useful > without presuming a badly-written opclass, it's a different matter. Okay, well that certainly sounds like less work. In that case, the comment which claims we handle the NOT clauses needs to be updated to mention that we only handle boolean NOT clauses and don't optimize the remainder. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David. On 2018/04/04 10:13, David Rowley wrote: > On 4 April 2018 at 11:22, David Rowley <david.rowley@2ndquadrant.com> wrote: >> On 4 April 2018 at 09:47, David Rowley <david.rowley@2ndquadrant.com> wrote: >>> I think it would be better to just have special handling in >>> get_matching_list_bound so that it knows it's performing <> >>> elimination. I'd thought about passing some other opstrategy but the >>> only safe one I thought to use was InvalidStrategy, which is already >>> used by NULL handling. Thanks for this suggestion. Having the special case handling for steps corresponding to <> operator clauses in get_matching_list_bounds() seems like the best way and that should have been the way all along. It occurred to me that after I changed the patch to store datum offsets in the result, there wasn't any need for special handling of <> operators at a higher level -- like the special pruning function (get_partitions_excluded_by_ne_datums) that we used to have or the COMBINE_INVERT I recently proposed. For each datum coming from a <> operator clause (signaled to get_matching_list_bounds by passing InvalidStrategy for opstrategy), we return all datums minus the one that was passed (if the latter is indeed found in boundinfo). Bounds for individual <> operator clauses will be combined using INTERSECT at a higher level to give the desired result. No need for the invert step and for the planner to set things up very carefully for invert step to do the right thing. create table lp (a int) partition by list (a); create table lp1 partition of lp for values in (1); create table lp2 partition of lp for values in (2); create table lp3 partition of lp for values in (3); create table lp_null partition of lp for values in (null); create table lp_default partition of lp default; For explain select * from lp where a <> 1 get_matching_list_bounds will returns the set of offsets of {2, 3} and will set scan_default, while setting scan_null false; and for explain select * from lp where a <> 1 and a <> 3 it will returns the set of offsets of {2, 3} and {1, 2} for the individual base steps and along setting scan_default and setting scan_null to false; the INTERSECT combination step still combine those to give the offset of 2 with scan_default set to true and scan_null set to false. QUERY PLAN -------------------------------------------------------------------- Append (cost=0.00..121.75 rows=5050 width=4) -> Seq Scan on lp2 (cost=0.00..48.25 rows=2525 width=4) Filter: ((a <> 1) AND (a <> 3)) -> Seq Scan on lp_default (cost=0.00..48.25 rows=2525 width=4) Filter: ((a <> 1) AND (a <> 3)) If there are no values then the offsets of all the bounds will be returned and unlike the previous setup with the INVERT step in the mix, they will all survive. I know we've been back and forth quite a bit on this, but this solution seems like the one with the least amount of hackery. Hope you find it to be the same way. >> I'm currently working up a patch to do this the way I think is best. >> >> I'll submit it soon and we can review and get your thoughts on it. > > I've attached a rough cut version of what I think is a good solution > for this. It's based on v46, not your latest v47, sorry. > > This makes get_matching_list_bounds() aware that it's performing the > not-equal pruning via the opstrategy which allows it to not return all > partitions when there are no values in this case. Instead, we return > the NULL partition, so that we later invert that and return everything > apart from the NULL partition. A strict clause will allow us that > much, even if we can't get the actual value being compared to, at the > time. As I explained above, I considered your general idea of teaching get_matching_list_bounds to deal with being passed InvalidStrategy for opstrategy to signal special handling of a <> clause datum. By implementing that, I was able to get rid of a bunch of code in partprune.c and remove the COMBINE_INVERT related code. We can add COMBINE_INVERT later if and when we need it (for some legitimate purpose). > There's also a bunch of other changes in there: Thanks. > 1. Adding missing step_id in copyfuncs.c Merged. > 2. Simplified including the default partition in a bunch of cases. > 3. Made it so scan_default and scan_null are only ever set to true if > there's a partition for that. I have merged these too. > 4. Changed get_matching_*_bounds to return the entire result struct > instead of the Bitmapset and pass the remaining bool values back > through params. I didn't really like how you'd change this to pass all > the bool flags back as params. There's a perfectly good struct there > to provide the entire result in a single return value. I know you've > disagreed with this already, so would be nice to get a 3rd opinion. I went ahead with them returning PruneStepResult struct. > 5. Rename get_matching_hash_bound to get_matching_hash_bounds. The > LIST and RANGE version of this function both had a plural name. I > didn't see any reason for the hash case to be different. Agreed, merged. > Let me know what you think. I'm not sure about the following change in your patch: - if (!result->scan_null) - result->scan_null = step_result->scan_null; - if (!result->scan_default) - result->scan_default = step_result->scan_default; + result->scan_null |= step_result->scan_null; + result->scan_default |= step_result->scan_default; Afaik, |= does bitwise OR, which even if it might give the result we want, is not a logical operation. I had written the original code using the following definition of logical OR. a OR b = if a then true else b Also, since things work normally even if we pass no values to get_matching_list_bounds, including via the dummy step generated for IS NOT NULL clause(s), I don't see the need to store notnullkeys in the prune step. Especially, it's redundant to set notnullkeys in the pruning step containing non-empty exprs since, by definition, they will select partitions containing non-null datums. > I've patched the run-time pruning v18 against this and it now passes regression. > > I need to do a bit more testing on this to ensure it works for all > cases, but thought I'd send now as I suspect you're currently around > to look. See if attached works for you. > There might be another issue with the patch too, but I'll send a > separate email about that. I suppose this is the email about support for pruning using NOT clauses in partition.c. It might be possible to do that by tweaking things somehow by re-introducing the COMBINE_INVERT step (legitimately needed in that case) and modifying partprune.c to capture NOT clauses in more cases than it does currently. Although, I modified the comment like you suggested that we only support Boolean NOT clause in special cases like when using Boolean partitioning opfamily. Attached v48. Thanks, Amit
Attachment
On 2018/04/04 14:42, Amit Langote wrote: > Attached v48. I had forgotten to remove the static_pruning parameter I had added in the v47, because it is no longer used. Static pruning now occurs even if a step contains all Params, in which case each of get_matching_hash/list/range_bounds() functions returns offsets of all non-null datums, because the Params cannot be resolved to actual values during static pruning. Also, a few changes to get_matching_partitions that David had proposed in his delta patch but I had failed to include them in v48. Attached v49. Thanks, Amit
Attachment
On 4 April 2018 at 17:42, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I'm not sure about the following change in your patch: > > - if (!result->scan_null) > - result->scan_null = step_result->scan_null; > - if (!result->scan_default) > - result->scan_default = step_result->scan_default; > + result->scan_null |= step_result->scan_null; > + result->scan_default |= step_result->scan_default; > > Afaik, |= does bitwise OR, which even if it might give the result we want, > is not a logical operation. I had written the original code using the > following definition of logical OR. > > a OR b = if a then true else b Ok, no problem. I only changed that to make it more compact. For the record we do the same in plenty of over places over the code base: E.g. parse->hasSubLinks |= subquery->hasSubLinks; /* If subquery had any RLS conditions, now main query does too */ parse->hasRowSecurity |= subquery->hasRowSecurity; -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 4 April 2018 at 19:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached v49. Thank for including the changes. I'll look now. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 4 April 2018 at 19:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/04/04 14:42, Amit Langote wrote: >> Attached v48. > > I had forgotten to remove the static_pruning parameter I had added in the > v47, because it is no longer used. Static pruning now occurs even if a > step contains all Params, in which case each of > get_matching_hash/list/range_bounds() functions returns offsets of all > non-null datums, because the Params cannot be resolved to actual values > during static pruning. Thanks for updating. I've made a pass over v49 and I didn't find very much wrong with it. The only real bug I found was a missing IsA(rinfo->clause, Const) in the pseudoconstant check inside generate_partition_pruning_steps_internal. Most of the changes are comment fixes with a few stylistic changes thrown which are pretty much all there just to try to shrink the code a line or two or reduce indentation. I feel pretty familiar with this code now and assuming the attached is included I'm happy for someone else, hopefully, a committer to take a look at it. I'll leave the following notes: 1. Still not sure about RelOptInfo->has_default_part. This flag is only looked at in generate_partition_pruning_steps. The RelOptInfo and the boundinfo is available to look at, it's just that the partition_bound_has_default macro is defined in partition.c rather than partition.h. 2. Don't really like the new isopne variable name. It's not very simple to decode, perhaps something like is_not_eq is better? 3. The part of the code I'm least familiar with is get_steps_using_prefix_recurse(). I admit to not having had time to fully understand that and consider ways to break it. Marking as ready for committer. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi, On 04/04/2018 09:29 AM, David Rowley wrote: > Thanks for updating. I've made a pass over v49 and I didn't find very > much wrong with it. > > The only real bug I found was a missing IsA(rinfo->clause, Const) in > the pseudoconstant check inside > generate_partition_pruning_steps_internal. > > Most of the changes are comment fixes with a few stylistic changes > thrown which are pretty much all there just to try to shrink the code > a line or two or reduce indentation. > > I feel pretty familiar with this code now and assuming the attached is > included I'm happy for someone else, hopefully, a committer to take a > look at it. > > I'll leave the following notes: > > 1. Still not sure about RelOptInfo->has_default_part. This flag is > only looked at in generate_partition_pruning_steps. The RelOptInfo and > the boundinfo is available to look at, it's just that the > partition_bound_has_default macro is defined in partition.c rather > than partition.h. > > 2. Don't really like the new isopne variable name. It's not very > simple to decode, perhaps something like is_not_eq is better? > > 3. The part of the code I'm least familiar with is > get_steps_using_prefix_recurse(). I admit to not having had time to > fully understand that and consider ways to break it. > > Marking as ready for committer. > Passes check-world, and CommitFest app has been updated to reflect the current patch set. Trivial changes attached. Best regards, Jesper
Attachment
Hi. On 2018/04/05 0:45, Jesper Pedersen wrote: > Hi, > > On 04/04/2018 09:29 AM, David Rowley wrote: >> Thanks for updating. I've made a pass over v49 and I didn't find very >> much wrong with it. >> >> The only real bug I found was a missing IsA(rinfo->clause, Const) in >> the pseudoconstant check inside >> generate_partition_pruning_steps_internal. Fixed. >> Most of the changes are comment fixes with a few stylistic changes >> thrown which are pretty much all there just to try to shrink the code >> a line or two or reduce indentation. >> >> I feel pretty familiar with this code now and assuming the attached is >> included I'm happy for someone else, hopefully, a committer to take a >> look at it. Thank you, your changes look good to me. >> I'll leave the following notes: >> >> 1. Still not sure about RelOptInfo->has_default_part. This flag is >> only looked at in generate_partition_pruning_steps. The RelOptInfo and >> the boundinfo is available to look at, it's just that the >> partition_bound_has_default macro is defined in partition.c rather >> than partition.h. Hmm, it might not be such a bad idea to bring out the PartitionBoundInfoData into partition.h. If we do that, we won't need the has_default_part that the patch adds to RelOptInfo. In the Attached v50 set, 0002 does that. >> 2. Don't really like the new isopne variable name. It's not very >> simple to decode, perhaps something like is_not_eq is better? isopne does sound a bit unintelligible. I propose op_is_ne so that it sounds consistent with the preceding member of the struct that's called opno. I want to keep "ne" and not start calling it not_eq, as a few other places use the string "ne" to refer to a similar thing, like: /* inequality */ Datum range_ne(PG_FUNCTION_ARGS) Datum timestamptz_ne_date(PG_FUNCTION_ARGS) Since the field is local to partprune.c, I guess that it's fine as the comment where it's defined tells what it is. >> 3. The part of the code I'm least familiar with is >> get_steps_using_prefix_recurse(). I admit to not having had time to >> fully understand that and consider ways to break it. The purpose of that code is to generate *all* needed steps to be combined using COMBINE_INTERSECT such that the pruning will occur using the most restrictive set of clauses in cases where the same key is referenced in multiple restriction clauses containing non-equality operators. So, for a range partitioned table on (a, b): For a query like explain select * from foo a <= 1 and a <= 3 and b < 5 and b <= 10 Pruning steps generated to be combined with an enclosing INTERSECT step will be as follows: <= (1, 10) < (1, 5) <= (3, 10) < (3, 5) >> Marking as ready for committer. Thank you! > Passes check-world, and CommitFest app has been updated to reflect the > current patch set. Trivial changes attached. Merged these changes. Thanks again Jesper. Attached v50. Thanks, Amit
On 2018/04/05 17:28, Amit Langote wrote: > Attached v50. Really attached this time. Thanks, Amit
Attachment
Amit Langote wrote: > >> 1. Still not sure about RelOptInfo->has_default_part. This flag is > >> only looked at in generate_partition_pruning_steps. The RelOptInfo and > >> the boundinfo is available to look at, it's just that the > >> partition_bound_has_default macro is defined in partition.c rather > >> than partition.h. > > Hmm, it might not be such a bad idea to bring out the > PartitionBoundInfoData into partition.h. If we do that, we won't need the > has_default_part that the patch adds to RelOptInfo. > > In the Attached v50 set, 0002 does that. After looking at this for a moment, I again come to the conclusion that the overall layout of partitioning code and definitions is terrible. But we already know that, and there's a patch in commitfest to improve things. So my intention right now is to hold my nose and get this pushed; we'll fix it afterwards. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I seems pretty clear that putting get_matching_partitions() in catalog/partition.c is totally the wrong thing; it belongs wholly in partprune. I think the reason you put it there is that it requires access to a lot of internals that are static in partition.c. In the attached not yet cleaned version of the patch, I have moved a whole lot of what you added to partition.c to partprune.c; and for the functions and struct declarations that were required to make it work, I created catalog/partition_internal.h. I changed a lot of code also, but cosmetic changes only. I'll clean this up a bit more now, and try to commit shortly (or early tomorrow); wanted to share current status now in case I have to rush out. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
> @@ -1717,8 +1691,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte, > * parentrte already has the root partrel's updatedCols translated to match > * the attribute ordering of parentrel. > */ > - if (!*part_cols_updated) > - *part_cols_updated = > + if (!root->partColsUpdated) > + root->partColsUpdated = > has_partition_attrs(parentrel, parentrte->updatedCols, NULL); Hmm, surely this should be |= to avoid resetting a value set in a previous call to this function? In the previous coding it wasn't necessary because it was a local variable ... (though, isn't it a bit odd to have this in PlannerInfo? seems like it should be in resultRelInfo, but then you already have it there so I suppose this one does *more*) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 6 April 2018 at 10:35, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I changed a lot of code also, but cosmetic changes only. > > I'll clean this up a bit more now, and try to commit shortly (or early > tomorrow); wanted to share current status now in case I have to rush > out. I made a complete pass over the patch you sent. I only noted down the following few things: 1. + * off < 0, meaning the look-up value is smaller that all bounds, that -> than 2. I guess this will be removed before commit. +#if 0 <large section of comments> +#endif 3. This comment seems like a strange thing to write just before testing if the clause matches the partition key. + /* Clause does not match this partition key. */ + if (equal(leftop, partkey)) + *rightop = not_clause((Node *) clause) + ? (Expr *) makeBoolConst(false, false) + : (Expr *) makeBoolConst(true, false); 4. Comment needs removed. + * has_default_part - Whether the table has a default partition The only other thing I noted on this pass is that we could get rid of: + /* go check the next clause. */ + if (unsupported_clause) + break; and just "continue" instead of "break" in all cases apart from case PARTCLAUSE_UNSUPPORTED: it would save a few lines and a single condition. What's there works, but thought this might be better... -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 6 April 2018 at 12:02, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 6 April 2018 at 10:35, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > The only other thing I noted on this pass is that we could get rid of: > > + /* go check the next clause. */ > + if (unsupported_clause) > + break; > > and just "continue" instead of "break" in all cases apart from case > PARTCLAUSE_UNSUPPORTED: I should have said remove: + if (unsupported_clause) The "break" would still be required. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi. On 2018/04/06 7:35, Alvaro Herrera wrote: > I seems pretty clear that putting get_matching_partitions() in > catalog/partition.c is totally the wrong thing; it belongs wholly in > partprune. I think the reason you put it there is that it requires > access to a lot of internals that are static in partition.c. In the > attached not yet cleaned version of the patch, I have moved a whole lot > of what you added to partition.c to partprune.c; and for the functions > and struct declarations that were required to make it work, I created > catalog/partition_internal.h. Yes, I really wanted for most of the new code that this patch adds to land in the planner, especially after Robert's comments here: https://www.postgresql.org/message-id/CA%2BTgmoabi-29Vs8H0xkjtYB%3DcU%2BGVCrNwPz7okpa3KsoLmdEUQ%40mail.gmail.com It would've been nice if we'd gotten the "reorganizing partitioning code" thread resolved sooner. > I changed a lot of code also, but cosmetic changes only. > > I'll clean this up a bit more now, and try to commit shortly (or early > tomorrow); wanted to share current status now in case I have to rush > out. Some comments on the code reorganizing part of the patch: * Did you intentionally not put PartitionBoundInfoData and its accessor macros in partition_internal.h. partprune.c would not need to include partition.h if we do that. * Also, I wonder why you left PartitionPruneContext in partition.h. Isn't it better taken out to partprune.h? I have done that in the attached. * Why isn't gen_partprune_steps() in partprune.h? I see only prune_append_rel_partitions() exported out of partprune.c, but the runtime patch needs gen_partprune_steps() to be called from createplan.c. * I don't see get_matching_partitions() exported either. Runtime pruning patch needs that too. Maybe you've thought something about these two items though. Thanks, Amit
Attachment
On 2018/04/06 8:33, Alvaro Herrera wrote: >> @@ -1717,8 +1691,8 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte, >> * parentrte already has the root partrel's updatedCols translated to match >> * the attribute ordering of parentrel. >> */ >> - if (!*part_cols_updated) >> - *part_cols_updated = >> + if (!root->partColsUpdated) >> + root->partColsUpdated = >> has_partition_attrs(parentrel, parentrte->updatedCols, NULL); > > Hmm, surely this should be |= to avoid resetting a value set in a > previous call to this function? It won't be, no? We set it only if it hasn't been already. Note that there is one PlannerInfo per sub-query, so we determine this information independently for each sub-query. > In the previous coding it wasn't > necessary because it was a local variable ... (though, isn't it a bit > odd to have this in PlannerInfo? seems like it should be in > resultRelInfo, but then you already have it there so I suppose this one > does *more*) Hmm, you'd think that we can figure this out in the executor itself, and hence don't to have this in PlannerInfo or in ModifyTable. But IIRC, during the discussion of the update tuple routing patch, it became clear that it's best do that here, given the way things are now wrt the timing of partition/inheritance tree expansion. An update query may modify the partition key of a table at any arbitrary level and we have to look at all the tables in the partition tree in this planning phase anyway, so it's also the best time to see it if the query's modifiedCols overlaps with the partition key of some table in the tree. Once we've found that it does for some table (most likely the root), we're done, that is, we know we got some "partColsUpdated". I realized that I had gotten rid of has_default_part from RelOptInfo but hadn't deleted a comment about it; attached patch to fix that. Thanks, Amit
Attachment
Amit Langote wrote: > Hi. > > On 2018/04/06 7:35, Alvaro Herrera wrote: > > I seems pretty clear that putting get_matching_partitions() in > > catalog/partition.c is totally the wrong thing; it belongs wholly in > > partprune. I think the reason you put it there is that it requires > > access to a lot of internals that are static in partition.c. In the > > attached not yet cleaned version of the patch, I have moved a whole lot > > of what you added to partition.c to partprune.c; and for the functions > > and struct declarations that were required to make it work, I created > > catalog/partition_internal.h. > > Yes, I really wanted for most of the new code that this patch adds to land > in the planner, especially after Robert's comments here: > > https://www.postgresql.org/message-id/CA%2BTgmoabi-29Vs8H0xkjtYB%3DcU%2BGVCrNwPz7okpa3KsoLmdEUQ%40mail.gmail.com > > It would've been nice if we'd gotten the "reorganizing partitioning code" > thread resolved sooner. Grumble. I don't actually like very much the idea of putting all this code in optimizer/util. This morning it occurred to me that we should create a new src/backend/partitioning/ (and a src/include/partitioning/ to go with it) and drop a bunch of files there. Even your proposed new partcache.c will seem misplaced *anywhere*, since it contains support code to be used by both planner and executor; in src/{backend,include}/partitioning it will be able to serve both without it being a modularity wart. BTW including partition_internal.h in partition.h would defeat the point of having partition_internal.h in the first place -- at that point you'd be better off just putting it all in partition.h and save the hassle of a separate file. But given the liberty with which catalog/partition.h has been included everywhere else, IMO that would be pretty disastrous. I propose to work on reorganizing this code after the commitfest is over, as part of release stabilization. I'd rather not have us supporting a messy system for only five years, if we restructure during pg12 (which would mean a lot of backpatching pain and pg11-specific bugs); or worse, forever, if we keep the current proposed layout. One thing I don't want to do is create a new file that we'll later have to rename or move, so choosing the best locations is a necessity. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
David Rowley wrote: > 2. I guess this will be removed before commit. > > +#if 0 > <large section of comments> > +#endif Yeah, there is one sentence there I didn't quite understand and would like to add it to the rewritten version of the comment before I remove the whole ifdeffed-out comment. * PARTCLAUSE_MATCH_STEPS: *clause_steps set to list of "partition pruning * step(s)" generated for the clause due to it being a BoolExpr or a * ScalarArrayOpExpr that's turned into one Exactly what does "ScalarArrayOpExpr that's turned into one" means? Does it mean we turn SAOP into BoolExpr? (Yes, I know "#if 0" inside a comment doesn't do anything. It's only documentation for myself.) If you look at the rest of the rewritten comment, you'll notice some things probably need more explaining. Wording suggestions welcome. > 3. This comment seems like a strange thing to write just before > testing if the clause matches the partition key. > > + /* Clause does not match this partition key. */ > + if (equal(leftop, partkey)) > + *rightop = not_clause((Node *) clause) > + ? (Expr *) makeBoolConst(false, false) > + : (Expr *) makeBoolConst(true, false); Yeah. Looking at this function, I noticed it tests for BooleanTest, and falls back to checking "not_clause" and a few equals. Does it make sense if the clause is a SAOP? I added this assert: Assert(IsA(clause, BooleanTest) || IsA(clause, BoolExpr) || IsA(clause, RelabelType)); and it failed: #3 0x0000556cf04505db in match_boolean_partition_clause (partopfamily=424, clause=0x556cf1041670, partkey=0x556cf1042218, rightop=0x7ffe520ec068) at /pgsql/source/master/src/backend/optimizer/util/partprune.c:2159 2159 Assert(IsA(clause, BooleanTest) || (gdb) print *clause $1 = {type = T_ScalarArrayOpExpr} I'm not sure whether or not this function can trust that what's incoming must absolutely be only those node types. > 4. Comment needs removed. > > + * has_default_part - Whether the table has a default partition Done. > The only other thing I noted on this pass is that we could get rid of: > > + /* go check the next clause. */ > + if (unsupported_clause) > + break; > > and just "continue" instead of "break" in all cases apart from case > PARTCLAUSE_UNSUPPORTED: > > it would save a few lines and a single condition. What's there works, > but thought this might be better... Makes sense -- looking. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
BTW, having both key_is_null and key_is_not_null output args to convey a single bit of info is a bit lame. I'm removing it. We could do the same with a single boolean, since the return value already indicates it's a matching IS [NOT] NULL clause; we only need to indicate whether the NOT is present. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera wrote: > Yeah. Looking at this function, I noticed it tests for BooleanTest, and > falls back to checking "not_clause" and a few equals. Does it make > sense if the clause is a SAOP? I added this assert: > Assert(IsA(clause, BooleanTest) || > IsA(clause, BoolExpr) || > IsA(clause, RelabelType)); > > and it failed: > #3 0x0000556cf04505db in match_boolean_partition_clause (partopfamily=424, > clause=0x556cf1041670, partkey=0x556cf1042218, rightop=0x7ffe520ec068) > at /pgsql/source/master/src/backend/optimizer/util/partprune.c:2159 > 2159 Assert(IsA(clause, BooleanTest) || > (gdb) print *clause > $1 = {type = T_ScalarArrayOpExpr} > > I'm not sure whether or not this function can trust that what's incoming > must absolutely be only those node types. So this is what I need for current regression tests not to crash anymore: Assert(IsA(clause, BooleanTest) || IsA(clause, BoolExpr) || IsA(clause, RelabelType) || IsA(clause, ScalarArrayOpExpr) || IsA(clause, OpExpr) || IsA(clause, Var)); I'm not confident in my ability to write code to handle all possible cases right now (obviously there must be more cases that are not covered by current regression tests), so I'll leave it without the assert since it handles a couple of the useful cases, but I suspect it could stand some more improvement. I guess the question is, how interesting is boolean partitioning? I bet it has its uses. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Apr 6, 2018 at 11:54 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Alvaro Herrera wrote: > >> Yeah. Looking at this function, I noticed it tests for BooleanTest, and >> falls back to checking "not_clause" and a few equals. Does it make >> sense if the clause is a SAOP? I added this assert: >> Assert(IsA(clause, BooleanTest) || >> IsA(clause, BoolExpr) || >> IsA(clause, RelabelType)); >> >> and it failed: >> #3 0x0000556cf04505db in match_boolean_partition_clause (partopfamily=424, >> clause=0x556cf1041670, partkey=0x556cf1042218, rightop=0x7ffe520ec068) >> at /pgsql/source/master/src/backend/optimizer/util/partprune.c:2159 >> 2159 Assert(IsA(clause, BooleanTest) || >> (gdb) print *clause >> $1 = {type = T_ScalarArrayOpExpr} >> >> I'm not sure whether or not this function can trust that what's incoming >> must absolutely be only those node types. > > So this is what I need for current regression tests not to crash > anymore: > > Assert(IsA(clause, BooleanTest) || > IsA(clause, BoolExpr) || > IsA(clause, RelabelType) || > IsA(clause, ScalarArrayOpExpr) || > IsA(clause, OpExpr) || > IsA(clause, Var)); > > I'm not confident in my ability to write code to handle all possible > cases right now (obviously there must be more cases that are not covered > by current regression tests), so I'll leave it without the assert since > it handles a couple of the useful cases, but I suspect it could stand > some more improvement. > > I guess the question is, how interesting is boolean partitioning? I bet > it has its uses. match_boolean_partition_clauses() exists to capture some cases where an OpExpr (any expression that returns a Boolean for that matter) itself is the partition key: create table boolpart (a int) partition by list ((a = 1)); create table boolpart_t partition of boolpart for values in ('true'); create table boolpart_f partition of boolpart for values in ('false'); explain select * from boolpart where a = 1; QUERY PLAN ------------------------------------------------------------------ Append (cost=0.00..41.88 rows=13 width=4) -> Seq Scan on boolpart_t (cost=0.00..41.88 rows=13 width=4) Filter: (a = 1) (3 rows) explain select * from boolpart where a = 2; QUERY PLAN ------------------------------------------------------------------ Append (cost=0.00..41.88 rows=13 width=4) -> Seq Scan on boolpart_f (cost=0.00..41.88 rows=13 width=4) Filter: (a = 2) (3 rows) So, it's not that we're only in position to accept certain node types in match_boolean_partition_clauses(). Before it existed, the pruning didn't work because it wasn't matched to the partition key in the special way that match_boolean_partition_clauses() does and end up in the block in match_clause_to_partition_key() where the OpExpr's are analyzed for normal (non-Boolean) situations, where we extract either the leftop or rightop and try to match it with the partition key. It might as well be: create table boolpart (a int) partition by list ((a in (1, 2))); Requiring us to be position to match an ScalarArrayOpExpr with the partition key. This resembles match_boolean_index_clause(), by the way. Thanks, Amit
On Fri, Apr 6, 2018 at 11:38 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Yeah, there is one sentence there I didn't quite understand and would > like to add it to the rewritten version of the comment before I remove > the whole ifdeffed-out comment. > > * PARTCLAUSE_MATCH_STEPS: *clause_steps set to list of "partition pruning > * step(s)" generated for the clause due to it being a BoolExpr or a > * ScalarArrayOpExpr that's turned into one > > Exactly what does "ScalarArrayOpExpr that's turned into one" means? > Does it mean we turn SAOP into BoolExpr? Yes, we turn a ScalarArrayOpExpr into a BoolExpr and generate prune step for the latter. Maybe we'll have a base pruning step that can process a ScalarArrayOpExpr directly someday. We create base steps only for OpExpr's for now. > If you look at the rest of the rewritten comment, you'll notice some > things probably need more explaining. Wording suggestions welcome. When I looked at it earlier today, I thought your rewrite looked much better. Thanks, Amit
Amit Langote wrote: > Some comments on the code reorganizing part of the patch: > > * Did you intentionally not put PartitionBoundInfoData and its accessor > macros in partition_internal.h. partprune.c would not need to include > partition.h if we do that. Not really. After pondering this some more, I decided to call the new file src/include/partition/partbounds.h; and the other new file will become src/include/partition/partprune.h. This leads naturally to the idea that PartitionBoundInfoData will be in partbounds.h. However, the typedef struct PartitionBoundInfoData *PartitionBoundInfo will have to remain in catalog/partition.h, at least for the time being. > * Also, I wonder why you left PartitionPruneContext in partition.h. Isn't > it better taken out to partprune.h? Yes. > * Why isn't gen_partprune_steps() in partprune.h? I see only > prune_append_rel_partitions() exported out of partprune.c, but the runtime > patch needs gen_partprune_steps() to be called from createplan.c. > * I don't see get_matching_partitions() exported either. Runtime pruning > patch needs that too. True -- both exported. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Here's my proposed patch. Idle thought: how about renaming the "constfalse" argument and variables to "contradictory" or maybe just "contradict"? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Sat, Apr 7, 2018 at 1:41 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Here's my proposed patch. > > Idle thought: how about renaming the "constfalse" argument and variables > to "contradictory" or maybe just "contradict"? Sounds fine to me. Thanks, Amit
Hi Alvaro, On 04/06/2018 12:41 PM, Alvaro Herrera wrote: > Here's my proposed patch. > > Idle thought: how about renaming the "constfalse" argument and variables > to "contradictory" or maybe just "contradict"? > Passes check-world. New directories, and variable rename seems like a good idea; either is ok. Best regards, Jesper
So I pushed this 25 minutes ago, and already there's a couple of buildfarm members complaining: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2019%3A55%3A07 Both show exactly the same diff in test partition_prune: *** /home/pgbuildfarm/buildroot-termite/HEAD/pgsql.build/../pgsql/src/test/regress/expected/partition_prune.out Fri Apr 6 15:55:08 2018 --- /home/pgbuildfarm/buildroot-termite/HEAD/pgsql.build/src/test/regress/results/partition_prune.out Fri Apr 6 16:01:402018 *************** *** 1348,1357 **** ----------+----+----- hp0 | | hp0 | 1 | ! hp0 | 1 | xxx hp3 | 10 | yyy ! hp1 | | xxx ! hp2 | 10 | xxx (6 rows) -- partial keys won't prune, nor would non-equality conditions --- 1348,1357 ---- ----------+----+----- hp0 | | hp0 | 1 | ! hp0 | 10 | xxx ! hp3 | | xxx hp3 | 10 | yyy ! hp2 | 1 | xxx (6 rows) -- partial keys won't prune, nor would non-equality conditions *************** *** 1460,1466 **** QUERY PLAN ------------------------------------------------- Append ! -> Seq Scan on hp0 Filter: ((a = 1) AND (b = 'xxx'::text)) (3 rows) --- 1460,1466 ---- QUERY PLAN ------------------------------------------------- Append ! -> Seq Scan on hp2 Filter: ((a = 1) AND (b = 'xxx'::text)) (3 rows) *************** *** 1468,1474 **** QUERY PLAN ----------------------------------------------------- Append ! -> Seq Scan on hp1 Filter: ((a IS NULL) AND (b = 'xxx'::text)) (3 rows) --- 1468,1474 ---- QUERY PLAN ----------------------------------------------------- Append ! -> Seq Scan on hp3 Filter: ((a IS NULL) AND (b = 'xxx'::text)) (3 rows) *************** *** 1476,1482 **** QUERY PLAN -------------------------------------------------- Append ! -> Seq Scan on hp2 Filter: ((a = 10) AND (b = 'xxx'::text)) (3 rows) --- 1476,1482 ---- QUERY PLAN -------------------------------------------------- Append ! -> Seq Scan on hp0 Filter: ((a = 10) AND (b = 'xxx'::text)) (3 rows) *************** *** 1494,1504 **** Append -> Seq Scan on hp0 Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) - -> Seq Scan on hp2 - Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) -> Seq Scan on hp3 Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) ! (7 rows) -- hash partitiong pruning doesn't occur with <> operator clauses explain (costs off) select * from hp where a <> 1 and b <> 'xxx'; --- 1494,1502 ---- Append -> Seq Scan on hp0 Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) -> Seq Scan on hp3 Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) ! (5 rows) -- hash partitiong pruning doesn't occur with <> operator clauses explain (costs off) select * from hp where a <> 1 and b <> 'xxx'; -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-04-06 17:28:00 -0300, Alvaro Herrera wrote: > So I pushed this 25 minutes ago, and already there's a couple of > buildfarm members complaining: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2019%3A55%3A07 > > Both show exactly the same diff in test partition_prune: There's also https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2018-04-06%2020%3A45%3A01 *** /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/contrib/sepgsql/expected/misc.out 2018-02-20 18:45:02.068665297-0800 --- /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/contrib/sepgsql/results/misc.out 2018-04-06 13:55:50.718253850-0700 *************** *** 32,40 **** (6 rows) SELECT * FROM t1p WHERE o > 50 AND p like '%64%'; - LOG: SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255 tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)" - LOG: SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255 tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)" - LOG: SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255 tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)" LOG: SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255 tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_table name="public.t1p" LOG: SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255 tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_column name="table t1p column o" LOG: SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255 tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_column name="table t1p column p" --- 32,37 ---- seems you just need to remove those rows from the expected file. - Andres
On Fri, Apr 6, 2018 at 8:24 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I don't actually like very much the idea of putting all this code in > optimizer/util. This morning it occurred to me that we should create a new > src/backend/partitioning/ (and a src/include/partitioning/ to go with > it) and drop a bunch of files there. Even your proposed new partcache.c > will seem misplaced *anywhere*, since it contains support code to be > used by both planner and executor; in src/{backend,include}/partitioning > it will be able to serve both without it being a modularity wart. Uh, what? Surely partcache.c is correctly placed next to relcache.c and syscache.c and everything else in src/backend/utils/cache. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Fri, Apr 6, 2018 at 8:24 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > I don't actually like very much the idea of putting all this code in > > optimizer/util. This morning it occurred to me that we should create a new > > src/backend/partitioning/ (and a src/include/partitioning/ to go with > > it) and drop a bunch of files there. Even your proposed new partcache.c > > will seem misplaced *anywhere*, since it contains support code to be > > used by both planner and executor; in src/{backend,include}/partitioning > > it will be able to serve both without it being a modularity wart. > > Uh, what? > > Surely partcache.c is correctly placed next to relcache.c and > syscache.c and everything else in src/backend/utils/cache. Frankly, I'm not real sure about partcache.c yet. Are you? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thank you Alvaro for rest of the cleanup and committing. On Sat, Apr 7, 2018 at 5:28 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > So I pushed this 25 minutes ago, and already there's a couple of > buildfarm members complaining: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2019%3A55%3A07 > > Both show exactly the same diff in test partition_prune: > > *** /home/pgbuildfarm/buildroot-termite/HEAD/pgsql.build/../pgsql/src/test/regress/expected/partition_prune.out Fri Apr 6 15:55:08 2018 > --- /home/pgbuildfarm/buildroot-termite/HEAD/pgsql.build/src/test/regress/results/partition_prune.out Fri Apr 6 16:01:402018 > *************** > *** 1348,1357 **** > ----------+----+----- > hp0 | | > hp0 | 1 | > ! hp0 | 1 | xxx > hp3 | 10 | yyy > ! hp1 | | xxx > ! hp2 | 10 | xxx > (6 rows) > > -- partial keys won't prune, nor would non-equality conditions > --- 1348,1357 ---- > ----------+----+----- > hp0 | | > hp0 | 1 | > ! hp0 | 10 | xxx > ! hp3 | | xxx > hp3 | 10 | yyy > ! hp2 | 1 | xxx > (6 rows) > > -- partial keys won't prune, nor would non-equality conditions > *************** > *** 1460,1466 **** > QUERY PLAN > ------------------------------------------------- > Append > ! -> Seq Scan on hp0 > Filter: ((a = 1) AND (b = 'xxx'::text)) > (3 rows) > > --- 1460,1466 ---- > QUERY PLAN > ------------------------------------------------- > Append > ! -> Seq Scan on hp2 > Filter: ((a = 1) AND (b = 'xxx'::text)) > (3 rows) > > *************** > *** 1468,1474 **** > QUERY PLAN > ----------------------------------------------------- > Append > ! -> Seq Scan on hp1 > Filter: ((a IS NULL) AND (b = 'xxx'::text)) > (3 rows) > > --- 1468,1474 ---- > QUERY PLAN > ----------------------------------------------------- > Append > ! -> Seq Scan on hp3 > Filter: ((a IS NULL) AND (b = 'xxx'::text)) > (3 rows) > > *************** > *** 1476,1482 **** > QUERY PLAN > -------------------------------------------------- > Append > ! -> Seq Scan on hp2 > Filter: ((a = 10) AND (b = 'xxx'::text)) > (3 rows) > > --- 1476,1482 ---- > QUERY PLAN > -------------------------------------------------- > Append > ! -> Seq Scan on hp0 > Filter: ((a = 10) AND (b = 'xxx'::text)) > (3 rows) > > *************** > *** 1494,1504 **** > Append > -> Seq Scan on hp0 > Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) > - -> Seq Scan on hp2 > - Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) > -> Seq Scan on hp3 > Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) > ! (7 rows) > > -- hash partitiong pruning doesn't occur with <> operator clauses > explain (costs off) select * from hp where a <> 1 and b <> 'xxx'; > --- 1494,1502 ---- > Append > -> Seq Scan on hp0 > Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) > -> Seq Scan on hp3 > Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL))) > ! (5 rows) So this same failure occurs on (noting the architecture): ppc64: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52 ia64: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2018-04-06%2022%3A32%3A24 ppc64 (POWER7): https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2018-04-06%2022%3A58%3A13 ppc64 (POWER7): https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2018-04-06%2023%3A02%3A13 powerpc: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-04-06%2023%3A05%3A08 powerpc: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-04-06%2023%3A13%3A23 powerpc 32-bit userspace on ppc64 host: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2023%3A40%3A07 Seems to be due to that the hashing function used in partitioning gives different answer for a given set of partition key values than others. Thanks, Amit
On 7 April 2018 at 12:35, Amit Langote <amitlangote09@gmail.com> wrote: > Thank you Alvaro for rest of the cleanup and committing. +10! > So this same failure occurs on (noting the architecture): > > ppc64: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka&dt=2018-04-06%2020%3A09%3A52 > > ia64: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole&dt=2018-04-06%2022%3A32%3A24 > > ppc64 (POWER7): > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2018-04-06%2022%3A58%3A13 > > ppc64 (POWER7): > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2018-04-06%2023%3A02%3A13 > > powerpc: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-04-06%2023%3A05%3A08 > > powerpc: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-04-06%2023%3A13%3A23 > > powerpc 32-bit userspace on ppc64 host: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2023%3A40%3A07 > > Seems to be due to that the hashing function used in partitioning > gives different answer for a given set of partition key values than > others. They all look like bigendian CPUs. https://en.wikipedia.org/wiki/Comparison_of_instruction_set_architectures#Endianness -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 7 April 2018 at 12:43, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 7 April 2018 at 12:35, Amit Langote <amitlangote09@gmail.com> wrote: >> So this same failure occurs on (noting the architecture): >> >> Seems to be due to that the hashing function used in partitioning >> gives different answer for a given set of partition key values than >> others. > > They all look like bigendian CPUs. I looked at all the regression test diffs for each of the servers you mentioned and I verified that the diffs match on each of the 7 servers. Maybe the best solution is to pull those tests out of partition_prune.sql then create partition_prune_hash and just have an alternative .out file with the partitions which match on bigendian machines. We could also keep them in the same file, but that's a much bigger alternative file to maintain and more likely to get broken if someone forgets to update it. What do you think? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Apr 7, 2018 at 10:31 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 7 April 2018 at 12:43, David Rowley <david.rowley@2ndquadrant.com> wrote: >> On 7 April 2018 at 12:35, Amit Langote <amitlangote09@gmail.com> wrote: >>> So this same failure occurs on (noting the architecture): >>> >>> Seems to be due to that the hashing function used in partitioning >>> gives different answer for a given set of partition key values than >>> others. >> >> They all look like bigendian CPUs. > > I looked at all the regression test diffs for each of the servers you > mentioned and I verified that the diffs match on each of the 7 > servers. > > Maybe the best solution is to pull those tests out of > partition_prune.sql then create partition_prune_hash and just have an > alternative .out file with the partitions which match on bigendian > machines. > > We could also keep them in the same file, but that's a much bigger > alternative file to maintain and more likely to get broken if someone > forgets to update it. > > What do you think? Yeah, that's an idea. Is it alright though that same data may end up in different hash partitions depending on the architecture? IIRC, that's the way we decided to go when using hash partitioning, but it would've been clearer if there was already some evidence in regression tests that that's what we've chosen, such as, some existing tests for tuple routing. Thanks, Amit
On 7 April 2018 at 13:50, Amit Langote <amitlangote09@gmail.com> wrote: > On Sat, Apr 7, 2018 at 10:31 AM, David Rowley >> I looked at all the regression test diffs for each of the servers you >> mentioned and I verified that the diffs match on each of the 7 >> servers. >> >> Maybe the best solution is to pull those tests out of >> partition_prune.sql then create partition_prune_hash and just have an >> alternative .out file with the partitions which match on bigendian >> machines. >> >> We could also keep them in the same file, but that's a much bigger >> alternative file to maintain and more likely to get broken if someone >> forgets to update it. >> >> What do you think? > > Yeah, that's an idea. > > Is it alright though that same data may end up in different hash > partitions depending on the architecture? IIRC, that's the way we > decided to go when using hash partitioning, but it would've been > clearer if there was already some evidence in regression tests that > that's what we've chosen, such as, some existing tests for tuple > routing. The only alternative would be to change all the hash functions so that they normalise their endianness. It does not sound like something that will perform very well. Plus it would break everyone's hash indexes on a pg_upgrade. pg_basebackups can't be transferred over to other architectures anyway, so I'm not so worried about tuples being routed to other partitions. Maybe someone else can see a reason why this is bad? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 7 April 2018 at 13:31, David Rowley <david.rowley@2ndquadrant.com> wrote: > Maybe the best solution is to pull those tests out of > partition_prune.sql then create partition_prune_hash and just have an > alternative .out file with the partitions which match on bigendian > machines. Here's 1 of 2. I thought it was best to get the buildfarm green again as soon as possible. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2018-04-07 14:42:53 +1200, David Rowley wrote: > On 7 April 2018 at 13:31, David Rowley <david.rowley@2ndquadrant.com> wrote: > > Maybe the best solution is to pull those tests out of > > partition_prune.sql then create partition_prune_hash and just have an > > alternative .out file with the partitions which match on bigendian > > machines. > > Here's 1 of 2. I thought it was best to get the buildfarm green again > as soon as possible. Do you have an estimate how long it'll take you to produce patch 2? It'd be cool to get this covered again soon. If you don't have access to a big endian machine, we can construct the output from the buildfarm... So pulling the tests out would be the only "urgent" thing, I can go on from there. - Andres
On Sat, Apr 7, 2018 at 7:25 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 7 April 2018 at 13:50, Amit Langote <amitlangote09@gmail.com> wrote: >> On Sat, Apr 7, 2018 at 10:31 AM, David Rowley >>> I looked at all the regression test diffs for each of the servers you >>> mentioned and I verified that the diffs match on each of the 7 >>> servers. >>> >>> Maybe the best solution is to pull those tests out of >>> partition_prune.sql then create partition_prune_hash and just have an >>> alternative .out file with the partitions which match on bigendian >>> machines. >>> >>> We could also keep them in the same file, but that's a much bigger >>> alternative file to maintain and more likely to get broken if someone >>> forgets to update it. >>> >>> What do you think? >> >> Yeah, that's an idea. >> >> Is it alright though that same data may end up in different hash >> partitions depending on the architecture? IIRC, that's the way we >> decided to go when using hash partitioning, but it would've been >> clearer if there was already some evidence in regression tests that >> that's what we've chosen, such as, some existing tests for tuple >> routing. > > The only alternative would be to change all the hash functions so that > they normalise their endianness. It does not sound like something that > will perform very well. Plus it would break everyone's hash indexes on > a pg_upgrade. > > pg_basebackups can't be transferred over to other architectures > anyway, so I'm not so worried about tuples being routed to other > partitions. > > Maybe someone else can see a reason why this is bad? I don't think the concept is bad by itself. That's expected, in fact, we have added an option to pg_dump (dump through parent or some such) to handle exactly this case. What Amit seems to be complaining though is the regression test. We need to write regression tests so that they produce the same plans, pruning same partitions by name, on all architectures. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 7 April 2018 at 15:00, Andres Freund <andres@anarazel.de> wrote: > On 2018-04-07 14:42:53 +1200, David Rowley wrote: >> On 7 April 2018 at 13:31, David Rowley <david.rowley@2ndquadrant.com> wrote: >> > Maybe the best solution is to pull those tests out of >> > partition_prune.sql then create partition_prune_hash and just have an >> > alternative .out file with the partitions which match on bigendian >> > machines. >> >> Here's 1 of 2. I thought it was best to get the buildfarm green again >> as soon as possible. > > Do you have an estimate how long it'll take you to produce patch 2? It'd > be cool to get this covered again soon. If you don't have access to a > big endian machine, we can construct the output from the buildfarm... So > pulling the tests out would be the only "urgent" thing, I can go on from > there. Attached. I've not tested on a bigendian machine, but the diff -c between the two output files match the diff on the failing buildfarm members. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 7 April 2018 at 15:03, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Sat, Apr 7, 2018 at 7:25 AM, David Rowley >> The only alternative would be to change all the hash functions so that >> they normalise their endianness. It does not sound like something that >> will perform very well. Plus it would break everyone's hash indexes on >> a pg_upgrade. >> >> pg_basebackups can't be transferred over to other architectures >> anyway, so I'm not so worried about tuples being routed to other >> partitions. >> >> Maybe someone else can see a reason why this is bad? > > I don't think the concept is bad by itself. That's expected, in fact, > we have added an option to pg_dump (dump through parent or some such) > to handle exactly this case. What Amit seems to be complaining though > is the regression test. We need to write regression tests so that they > produce the same plans, pruning same partitions by name, on all > architectures. Why is writing tests that produce the same output required? We have many tests with alternative outputs. Look in src/tests/regress/expected for files matching _1.out -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Apr 7, 2018 at 8:37 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 7 April 2018 at 15:03, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> On Sat, Apr 7, 2018 at 7:25 AM, David Rowley >>> The only alternative would be to change all the hash functions so that >>> they normalise their endianness. It does not sound like something that >>> will perform very well. Plus it would break everyone's hash indexes on >>> a pg_upgrade. >>> >>> pg_basebackups can't be transferred over to other architectures >>> anyway, so I'm not so worried about tuples being routed to other >>> partitions. >>> >>> Maybe someone else can see a reason why this is bad? >> >> I don't think the concept is bad by itself. That's expected, in fact, >> we have added an option to pg_dump (dump through parent or some such) >> to handle exactly this case. What Amit seems to be complaining though >> is the regression test. We need to write regression tests so that they >> produce the same plans, pruning same partitions by name, on all >> architectures. > > Why is writing tests that produce the same output required? > > We have many tests with alternative outputs. Look in > src/tests/regress/expected for files matching _1.out > That's true, but we usually add such alternative output when we know all the variants possible as long as "all the variants" do not cover everything possible. AFAIU, that's not true here. Also, on a given machine a particular row is guaranteed to fall in a given partition. On a different machine it will fall in some other partition, but always that partition on that machine. We don't have a way to select alternate output based on the architecture. May be a better idea is to use .source file, creating .out on the fly based on the architecture of machine like testing the hash output for a given value to decide which partition it will fall into and then crafting .out with that partition's name. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2018-04-07 15:04:37 +1200, David Rowley wrote: > On 7 April 2018 at 15:00, Andres Freund <andres@anarazel.de> wrote: > > On 2018-04-07 14:42:53 +1200, David Rowley wrote: > >> On 7 April 2018 at 13:31, David Rowley <david.rowley@2ndquadrant.com> wrote: > >> > Maybe the best solution is to pull those tests out of > >> > partition_prune.sql then create partition_prune_hash and just have an > >> > alternative .out file with the partitions which match on bigendian > >> > machines. > >> > >> Here's 1 of 2. I thought it was best to get the buildfarm green again > >> as soon as possible. > > > > Do you have an estimate how long it'll take you to produce patch 2? It'd > > be cool to get this covered again soon. If you don't have access to a > > big endian machine, we can construct the output from the buildfarm... So > > pulling the tests out would be the only "urgent" thing, I can go on from > > there. > > Attached. > > I've not tested on a bigendian machine, but the diff -c between the > two output files match the diff on the failing buildfarm members. I've pushed the two patches (collapsed). Trying to get the BF green-ish again... - Andres
On 7 April 2018 at 15:18, Andres Freund <andres@anarazel.de> wrote: > I've pushed the two patches (collapsed). Trying to get the BF green-ish > again... Thanks! -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 7 April 2018 at 15:14, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Sat, Apr 7, 2018 at 8:37 AM, David Rowley >> Why is writing tests that produce the same output required? >> >> We have many tests with alternative outputs. Look in >> src/tests/regress/expected for files matching _1.out >> > > That's true, but we usually add such alternative output when we know > all the variants possible as long as "all the variants" do not cover > everything possible. AFAIU, that's not true here. Also, on a given > machine a particular row is guaranteed to fall in a given partition. > On a different machine it will fall in some other partition, but > always that partition on that machine. We don't have a way to select > alternate output based on the architecture. May be a better idea is to > use .source file, creating .out on the fly based on the architecture > of machine like testing the hash output for a given value to decide > which partition it will fall into and then crafting .out with that > partition's name. Sounds like you're saying that if we have too many alternative files then there's a chance that one could pass by luck. Maybe we can just back up what's just been committed by actually executing the queries and ensuring that all rows that made it into the table make it back out again. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Apr 7, 2018 at 8:55 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 7 April 2018 at 15:14, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> On Sat, Apr 7, 2018 at 8:37 AM, David Rowley >>> Why is writing tests that produce the same output required? >>> >>> We have many tests with alternative outputs. Look in >>> src/tests/regress/expected for files matching _1.out >>> >> >> That's true, but we usually add such alternative output when we know >> all the variants possible as long as "all the variants" do not cover >> everything possible. AFAIU, that's not true here. Also, on a given >> machine a particular row is guaranteed to fall in a given partition. >> On a different machine it will fall in some other partition, but >> always that partition on that machine. We don't have a way to select >> alternate output based on the architecture. May be a better idea is to >> use .source file, creating .out on the fly based on the architecture >> of machine like testing the hash output for a given value to decide >> which partition it will fall into and then crafting .out with that >> partition's name. > > Sounds like you're saying that if we have too many alternative files > then there's a chance that one could pass by luck. Yes. > > Maybe we can just back up what's just been committed by actually > executing the queries and ensuring that all rows that made it into the > table make it back out again. That's one way. But how would we make sure that they landed in proper partition. Actually we do not know what's proper partition for a given architecture. And how would we make sure that all rows with the same partition key land in the same partition. That's why I am suggesting to calculate the hash value, take modulo and craft the name of partition where corresponding row will land on a given architecture. That way, we are sure that the tuple routing logic is correct and also the partition pruning logic. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
David Rowley <david.rowley@2ndquadrant.com> writes: > Sounds like you're saying that if we have too many alternative files > then there's a chance that one could pass by luck. Yeah, exactly: it passed, but did it pass for the right reason? If there's just two expected-files, it's likely not a big problem, but if you have a bunch it's something to worry about. I'm also wondering how come we had hash partitioning before and did not have this sort of problem. Is it just that we added a new test that's more sensitive to the details of the hashing (if so, could it be made less so)? Or is there actually more platform dependence now than before (and if so, why is that)? regards, tom lane
On 7 April 2018 at 15:41, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm also wondering how come we had hash partitioning before and > did not have this sort of problem. Is it just that we added a > new test that's more sensitive to the details of the hashing > (if so, could it be made less so)? Or is there actually more > platform dependence now than before (and if so, why is that)? We didn't prune HASH partitions before today. They were just all returned. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2018-04-06 23:41:22 -0400, Tom Lane wrote: > David Rowley <david.rowley@2ndquadrant.com> writes: > > Sounds like you're saying that if we have too many alternative files > > then there's a chance that one could pass by luck. > > Yeah, exactly: it passed, but did it pass for the right reason? > > If there's just two expected-files, it's likely not a big problem, > but if you have a bunch it's something to worry about. There should be only two alternatives, given our current hashing implementation, right? Greetings, Andres Freund
On 7 April 2018 at 15:41, Tom Lane <tgl@sss.pgh.pa.us> wrote: > David Rowley <david.rowley@2ndquadrant.com> writes: >> Sounds like you're saying that if we have too many alternative files >> then there's a chance that one could pass by luck. > > Yeah, exactly: it passed, but did it pass for the right reason? > > If there's just two expected-files, it's likely not a big problem, > but if you have a bunch it's something to worry about. Right, I suggest we wait and see if all members go green again as a result of 40e42e1024c, and if they're happy then we could maybe leave it as is with the 2 alternatives output files. If there are some other variations that crop up, then we can think harder about what we can do to improve the coverage. I don't particularly think it matters which hash partition a tuple goes into, as long as the hash function spreads the values out enough and most importantly, the pruning code looks for the tuple in the partition that it was actually inserted into in the first place. Obviously, we also want to ensure we never do anything which would change the matching partition in either minor or major version upgrades too. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2018-04-07 15:49:54 +1200, David Rowley wrote: > Right, I suggest we wait and see if all members go green again as a > result of 40e42e1024c, and if they're happy then we could maybe leave > it as is with the 2 alternatives output files. At least the first previously borked animal came back green (termite). > I don't particularly think it matters which hash partition a tuple > goes into, as long as the hash function spreads the values out enough > and most importantly, the pruning code looks for the tuple in the > partition that it was actually inserted into in the first place. > Obviously, we also want to ensure we never do anything which would > change the matching partition in either minor or major version > upgrades too. +1 I've also attempted to fix rhinoceros's failure I remarked upon a couple hours ago in https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de Greetings, Andres Freund
On 7 April 2018 at 09:03, Andres Freund <andres@anarazel.de> wrote: > There's also > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2018-04-06%2020%3A45%3A01 > *** /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/contrib/sepgsql/expected/misc.out 2018-02-20 18:45:02.068665297-0800 > --- /opt/src/pgsql-git/build-farm-root/HEAD/pgsql.build/contrib/sepgsql/results/misc.out 2018-04-06 13:55:50.718253850-0700 > *************** > *** 32,40 **** > (6 rows) > > SELECT * FROM t1p WHERE o > 50 AND p like '%64%'; > - LOG: SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255 tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)" > - LOG: SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255 tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)" > - LOG: SELinux: allowed { execute } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255 tcontext=system_u:object_r:sepgsql_proc_exec_t:s0tclass=db_procedure name="pg_catalog.int4le(integer,integer)" > LOG: SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255 tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_table name="public.t1p" > LOG: SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255 tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_column name="table t1p column o" > LOG: SELinux: allowed { select } scontext=unconfined_u:unconfined_r:sepgsql_regtest_superuser_t:s0-s0:c0.c255 tcontext=unconfined_u:object_r:sepgsql_table_t:s0tclass=db_column name="table t1p column p" > --- 32,37 ---- > > seems you just need to remove those rows from the expected file. Agreed. Patch attached. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 7 April 2018 at 16:09, Andres Freund <andres@anarazel.de> wrote: > I've also attempted to fix rhinoceros's failure I remarked upon a couple > hours ago in > https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de Oh, thanks! I had just been looking at that too... -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 7 April 2018 at 15:18, Andres Freund <andres@anarazel.de> wrote: > I've pushed the two patches (collapsed). Trying to get the BF green-ish > again... termite has now gone green. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Apr 7, 2018 at 1:09 PM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2018-04-07 15:49:54 +1200, David Rowley wrote: >> Right, I suggest we wait and see if all members go green again as a >> result of 40e42e1024c, and if they're happy then we could maybe leave >> it as is with the 2 alternatives output files. > > At least the first previously borked animal came back green (termite). > > >> I don't particularly think it matters which hash partition a tuple >> goes into, as long as the hash function spreads the values out enough >> and most importantly, the pruning code looks for the tuple in the >> partition that it was actually inserted into in the first place. >> Obviously, we also want to ensure we never do anything which would >> change the matching partition in either minor or major version >> upgrades too. > > +1 +1 Given that the difference only appeared on animals that David pointed out have big-endian architecture, it seems we'd only need two output files. It does seem true that the extended hashing functions that were adding to support partitioning would somehow be affected by endianness. Thank you David for creating the patches and Andres for committing it. Buildfarm seems to be turning green where it had gone red due to the hashing differences. > I've also attempted to fix rhinoceros's failure I remarked upon a couple > hours ago in > https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de Thanks Andres. Regards, Amit
Amit Langote <amitlangote09@gmail.com> writes: > Given that the difference only appeared on animals that David pointed > out have big-endian architecture, it seems we'd only need two output > files. Dunno, I'm wondering whether 32 vs 64 bit will make a difference. regards, tom lane
On Sat, Apr 7, 2018 at 1:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <amitlangote09@gmail.com> writes: >> Given that the difference only appeared on animals that David pointed >> out have big-endian architecture, it seems we'd only need two output >> files. > > Dunno, I'm wondering whether 32 vs 64 bit will make a difference. There was one 32-bit animal in the failing set, which apparently produces the same hashes as others (allegedly due to endianness difference). powerpc 32-bit userspace on ppc64 host: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-06%2023%3A40%3A07 ...and it has turned green since the alternative outputs fix went in. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite&dt=2018-04-07%2004%3A06%3A09 Thanks, Amit
Andres Freund wrote: > Hi, > > On 2018-04-07 15:49:54 +1200, David Rowley wrote: > > Right, I suggest we wait and see if all members go green again as a > > result of 40e42e1024c, and if they're happy then we could maybe leave > > it as is with the 2 alternatives output files. > > At least the first previously borked animal came back green (termite). Thanks everyone for addressing this. > I've also attempted to fix rhinoceros's failure I remarked upon a couple > hours ago in > https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de And this, too. I was unsure if this was because we were missing calling some object access hook from the new code, or the additional pruning. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-04-07 08:13:23 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > I've also attempted to fix rhinoceros's failure I remarked upon a couple > > hours ago in > > https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de > > And this, too. I was unsure if this was because we were missing calling > some object access hook from the new code, or the additional pruning. That's possible. I did attempt to skim the code, that's where my complain about the docs originated. There certainly isn't an InvokeFunctionExecuteHook() present. It's not clear to me whether that's an issue - we don't invoke the hooks in a significant number of places either, and as far as I can discern there's not much rule or reason about where we invoke it. Greetings, Andres Freund
Andres Freund wrote: > On 2018-04-07 08:13:23 -0300, Alvaro Herrera wrote: > > Andres Freund wrote: > > > I've also attempted to fix rhinoceros's failure I remarked upon a couple > > > hours ago in > > > https://postgr.es/m/20180406210330.wmqw42wqgiicktli@alap3.anarazel.de > > > > And this, too. I was unsure if this was because we were missing calling > > some object access hook from the new code, or the additional pruning. > > That's possible. I did attempt to skim the code, that's where my > complain about the docs originated. There certainly isn't an > InvokeFunctionExecuteHook() present. It's not clear to me whether > that's an issue - we don't invoke the hooks in a significant number of > places either, and as far as I can discern there's not much rule or > reason about where we invoke it. I managed to convince myself that it's not higher-level code's responsibility to invoke the execute hooks; the likelihood of bugs of omission seems just too large. But maybe I'm wrong. There's a small number of InvokeFunctionExecuteHook calls in the executor, but I really doubt that it exhaustively covers everyplace where catalogued functions are called in the executor. CC'ing KaiGai and Stephen Frost; they may want to chip in here. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
While looking at the docs in [1], I saw that we still mention: 4. Ensure that the constraint_exclusion configuration parameter is not disabled in postgresql.conf. If it is, queries will not be optimized as desired. This is no longer true. The attached patch removed it. [1] https://www.postgresql.org/docs/10/static/ddl-partitioning.html -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi David. On 2018/04/09 12:48, David Rowley wrote: > While looking at the docs in [1], I saw that we still mention: > > 4. Ensure that the constraint_exclusion configuration parameter is not > disabled in postgresql.conf. If it is, queries will not be optimized > as desired. > > This is no longer true. The attached patch removed it. > > [1] https://www.postgresql.org/docs/10/static/ddl-partitioning.htm Thanks. I was aware of the changes that would need to be made, but you beat me to writing the patch itself. About the patch: While the users no longer need to enable constraint_exclusion true for select queries, one would still need it for update/delete queries, because the new pruning logic only gets invoked for the former. Alas... Also, further down on that page, there is a 5.10.4 Partitioning and Constraint Exclusion sub-section. I think it would also need some tweaks due to new developments. I updated your patch to fix that. Please take a look. Thanks, Amit
Attachment
On Fri, Apr 6, 2018 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > David Rowley <david.rowley@2ndquadrant.com> writes: >> Sounds like you're saying that if we have too many alternative files >> then there's a chance that one could pass by luck. > > Yeah, exactly: it passed, but did it pass for the right reason? > > If there's just two expected-files, it's likely not a big problem, > but if you have a bunch it's something to worry about. > > I'm also wondering how come we had hash partitioning before and > did not have this sort of problem. Is it just that we added a > new test that's more sensitive to the details of the hashing > (if so, could it be made less so)? Or is there actually more > platform dependence now than before (and if so, why is that)? The existing hash partitioning tests did have some dependencies on the hash function, but they took care not to use the built-in hash functions. Instead they did stuff like this: CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE; CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8); CREATE TABLE mchash (a int, b text, c jsonb) PARTITION BY HASH (a test_int4_ops, b test_text_ops); I think that this approach should also be used for the new tests. Variant expected output files are a pain to maintain, and you basically just have to take whatever output you get as the right answer, because nobody knows what output a certain built-in hash function should produce for a given input except by running the code. If you do the kind of thing shown above, though, then you can easily see by inspection that you're getting the right answer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Apr 6, 2018 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> David Rowley <david.rowley@2ndquadrant.com> writes: >>> Sounds like you're saying that if we have too many alternative files >>> then there's a chance that one could pass by luck. >> >> Yeah, exactly: it passed, but did it pass for the right reason? >> >> If there's just two expected-files, it's likely not a big problem, >> but if you have a bunch it's something to worry about. >> >> I'm also wondering how come we had hash partitioning before and >> did not have this sort of problem. Is it just that we added a >> new test that's more sensitive to the details of the hashing >> (if so, could it be made less so)? Or is there actually more >> platform dependence now than before (and if so, why is that)? > > The existing hash partitioning tests did have some dependencies on the > hash function, but they took care not to use the built-in hash > functions. Instead they did stuff like this: > > CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS > $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE; > CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS > OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8); > CREATE TABLE mchash (a int, b text, c jsonb) > PARTITION BY HASH (a test_int4_ops, b test_text_ops); > > I think that this approach should also be used for the new tests. > Variant expected output files are a pain to maintain, and you > basically just have to take whatever output you get as the right > answer, because nobody knows what output a certain built-in hash > function should produce for a given input except by running the code. > If you do the kind of thing shown above, though, then you can easily > see by inspection that you're getting the right answer. +1. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2018/04/10 13:27, Ashutosh Bapat wrote: > On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Apr 6, 2018 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> David Rowley <david.rowley@2ndquadrant.com> writes: >>>> Sounds like you're saying that if we have too many alternative files >>>> then there's a chance that one could pass by luck. >>> >>> Yeah, exactly: it passed, but did it pass for the right reason? >>> >>> If there's just two expected-files, it's likely not a big problem, >>> but if you have a bunch it's something to worry about. >>> >>> I'm also wondering how come we had hash partitioning before and >>> did not have this sort of problem. Is it just that we added a >>> new test that's more sensitive to the details of the hashing >>> (if so, could it be made less so)? Or is there actually more >>> platform dependence now than before (and if so, why is that)? >> >> The existing hash partitioning tests did have some dependencies on the >> hash function, but they took care not to use the built-in hash >> functions. Instead they did stuff like this: >> >> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS >> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE; >> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS >> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8); >> CREATE TABLE mchash (a int, b text, c jsonb) >> PARTITION BY HASH (a test_int4_ops, b test_text_ops); >> >> I think that this approach should also be used for the new tests. >> Variant expected output files are a pain to maintain, and you >> basically just have to take whatever output you get as the right >> answer, because nobody knows what output a certain built-in hash >> function should produce for a given input except by running the code. >> If you do the kind of thing shown above, though, then you can easily >> see by inspection that you're getting the right answer. Thanks for the idea. I think it makes sense and also agree that alternate outputs approach is not perfectly reliable and maintainable. > +1. Attached find a patch that rewrites hash partition pruning tests that away. It creates two hash operator classes, one for int4 and another for text type and uses them to create hash partitioned table to be used in the tests, like done in the existing tests in hash_part.sql. Since that makes tests (hopefully) reliably return the same result always, I no longer see the need to keep them in a separate partition_prune_hash.sql. The reasoning behind having the separate file was to keep the alternative output file small as David explained in [1]. However, I noticed that there is a bug in RelationBuildPartitionKey that causes a crash when using a SQL function as partition support function as the revised tests do, so please refer to and apply the patches I posted here before running the revised tests: https://www.postgresql.org/message-id/3041e853-b1dd-a0c6-ff21-7cc5633bffd0%40lab.ntt.co.jp Thanks, Amit [1] https://www.postgresql.org/message-id/CAKJS1f-SON_hAekqoV4_WQwJBtJ_rvvSe68jRNhuYcXqQ8PoQg%40mail.gmail.com
Attachment
On 10 April 2018 at 20:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/04/10 13:27, Ashutosh Bapat wrote: >> On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS >>> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE; >>> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS >>> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8); >>> CREATE TABLE mchash (a int, b text, c jsonb) >>> PARTITION BY HASH (a test_int4_ops, b test_text_ops); > > Thanks for the idea. I think it makes sense and also agree that alternate > outputs approach is not perfectly reliable and maintainable. > >> +1. > > Attached find a patch that rewrites hash partition pruning tests that > away. It creates two hash operator classes, one for int4 and another for > text type and uses them to create hash partitioned table to be used in the > tests, like done in the existing tests in hash_part.sql. Since that makes > tests (hopefully) reliably return the same result always, I no longer see > the need to keep them in a separate partition_prune_hash.sql. The > reasoning behind having the separate file was to keep the alternative > output file small as David explained in [1]. > [1] > https://www.postgresql.org/message-id/CAKJS1f-SON_hAekqoV4_WQwJBtJ_rvvSe68jRNhuYcXqQ8PoQg%40mail.gmail.com I had a quick look, but I'm still confused about why a function like hash_uint32_extended() is susceptible to varying results depending on CPU endianness but hash_combine64 is not. Apart from that confusion, looking at the patch: +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS +$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT; +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8); +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS +$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT; Why coalesce here? Maybe I've not thought of something, but coalesce only seems useful to me if there's > 1 argument. Plus the function is strict, so not sure it's really doing even if you added a default. I know this one was there before, but I only just noticed it: +-- pruning should work if non-null values are provided for all the keys +explain (costs off) select * from hp where a is null and b is null; The comment is a bit misleading given the first test below it is testing for nulls. Maybe it can be changed to +-- pruning should work if values or is null clauses are provided for all partition keys. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Apr 10, 2018 at 5:32 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 10 April 2018 at 20:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2018/04/10 13:27, Ashutosh Bapat wrote: >>> On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS >>>> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE; >>>> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS >>>> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8); >>>> CREATE TABLE mchash (a int, b text, c jsonb) >>>> PARTITION BY HASH (a test_int4_ops, b test_text_ops); >> >> Thanks for the idea. I think it makes sense and also agree that alternate >> outputs approach is not perfectly reliable and maintainable. >> >>> +1. >> >> Attached find a patch that rewrites hash partition pruning tests that >> away. It creates two hash operator classes, one for int4 and another for >> text type and uses them to create hash partitioned table to be used in the >> tests, like done in the existing tests in hash_part.sql. Since that makes >> tests (hopefully) reliably return the same result always, I no longer see >> the need to keep them in a separate partition_prune_hash.sql. The >> reasoning behind having the separate file was to keep the alternative >> output file small as David explained in [1]. >> [1] >> https://www.postgresql.org/message-id/CAKJS1f-SON_hAekqoV4_WQwJBtJ_rvvSe68jRNhuYcXqQ8PoQg%40mail.gmail.com > > I had a quick look, but I'm still confused about why a function like > hash_uint32_extended() is susceptible to varying results depending on > CPU endianness but hash_combine64 is not. > > Apart from that confusion, looking at the patch: > > +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS > +$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT; > +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS > +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8); > +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS > +$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT; > > > Why coalesce here? Maybe I've not thought of something, but coalesce > only seems useful to me if there's > 1 argument. Plus the function is > strict, so not sure it's really doing even if you added a default. I think Amit Langote wanted to write coalesce($1, $2), $2 being the seed for hash function. See how hash operator class functions are defined in sql/insert.sql. May be we should just use the same functions or even the same tables. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Thanks for the comment. On 2018/04/10 21:11, Ashutosh Bapat wrote: > On Tue, Apr 10, 2018 at 5:32 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> Apart from that confusion, looking at the patch: >> >> +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS >> +$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT; >> +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS >> +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8); >> +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS >> +$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT; >> >> >> Why coalesce here? Maybe I've not thought of something, but coalesce >> only seems useful to me if there's > 1 argument. Plus the function is >> strict, so not sure it's really doing even if you added a default. > > I think Amit Langote wanted to write coalesce($1, $2), $2 being the > seed for hash function. See how hash operator class functions are > defined in sql/insert.sql. Actually, I referenced functions and operator classes defined in hash_part.sql, not insert.sql. Although as you point out, I didn't think very hard about the significance of $2 passed to coalesce in those functions. I will fix that and add it back, along with some other changes that makes them almost identical with definitions in hash_part.sql. > May be we should just use the same > functions or even the same tables. Because hash_part.sql and partition_prune.sql tests run in parallel, I've decided to rename the functions, operator classes, and the tables in partition_prune.sql. It seems like a good idea in any case. Also, since the existing pruning tests were written with that table, I decided not to change that. Will post an updated patch after addressing David's comment. Regards, Amit
Thanks for the review. On 2018/04/10 21:02, David Rowley wrote: > On 10 April 2018 at 20:56, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2018/04/10 13:27, Ashutosh Bapat wrote: >>> On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS >>>> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE; >>>> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS >>>> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8); >>>> CREATE TABLE mchash (a int, b text, c jsonb) >>>> PARTITION BY HASH (a test_int4_ops, b test_text_ops); >> >> Thanks for the idea. I think it makes sense and also agree that alternate >> outputs approach is not perfectly reliable and maintainable. >> >>> +1. >> >> Attached find a patch that rewrites hash partition pruning tests that >> away. It creates two hash operator classes, one for int4 and another for >> text type and uses them to create hash partitioned table to be used in the >> tests, like done in the existing tests in hash_part.sql. Since that makes >> tests (hopefully) reliably return the same result always, I no longer see >> the need to keep them in a separate partition_prune_hash.sql. The >> reasoning behind having the separate file was to keep the alternative >> output file small as David explained in [1]. >> [1] >> https://www.postgresql.org/message-id/CAKJS1f-SON_hAekqoV4_WQwJBtJ_rvvSe68jRNhuYcXqQ8PoQg%40mail.gmail.com > > I had a quick look, but I'm still confused about why a function like > hash_uint32_extended() is susceptible to varying results depending on > CPU endianness but hash_combine64 is not. It might as well be the combination of both that's sensitive to endianness. I too am not sure exactly which part. They're are both used in succession in compute_hash_value: /* * Compute hash for each datum value by calling respective * datatype-specific hash functions of each partition key * attribute. */ hash = FunctionCall2(&partsupfunc[i], values[i], seed); /* Form a single 64-bit hash value */ rowHash = hash_combine64(rowHash, DatumGetUInt64(hash)); > Apart from that confusion, looking at the patch: > > +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS > +$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT; > +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS > +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8); > +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS > +$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT; > > > Why coalesce here? Maybe I've not thought of something, but coalesce > only seems useful to me if there's > 1 argument. Plus the function is > strict, so not sure it's really doing even if you added a default. After reading Ashutosh's comment, I realized I didn't really mean to add the STRICT to those function definitions. As these are not operators, but support (hash) procedures, it's insignificant to the pruning code whether they are STRICT or not, unlike clause operators where it is. Also, I've adopted the coalesce-based hashing function from hash_part.sql, albeit with unnecessary tweaks. I've not read anywhere about why the coalesce was used in the first place, but it's insignificant for our purpose here anyway. > I know this one was there before, but I only just noticed it: > > +-- pruning should work if non-null values are provided for all the keys > +explain (costs off) select * from hp where a is null and b is null; > > The comment is a bit misleading given the first test below it is > testing for nulls. Maybe it can be changed to > > +-- pruning should work if values or is null clauses are provided for > all partition keys. I have adjusted the comments. Updated patch attached. Thanks, Amit
Attachment
On 11 April 2018 at 18:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Updated patch attached. Thanks for the updated patch. The only thing I'm not sure about is the chances you've made to the COALESCE function. +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS +$$SELECT coalesce($1, $2)::int8$$ LANGUAGE sql IMMUTABLE; +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8); +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS +$$SELECT length(coalesce($1, ''))::int8$$ LANGUAGE sql IMMUTABLE; Why does one default to the seed and the other to an empty string? Shouldn't they both do the same thing? If you were to copy the hash_part.sql you'd just coalesce($1, 0) and coalesce($1, ''), any special reason not to do that? Also just wondering if it's worth adding some verification that we've actually eliminated the correct partitions by backing the tests up with a call to satisfies_hash_partition. I've attached a delta patch that applies to your v2 which does this. Do you think it's worth doing? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi David. Thanks for the review. On 2018/04/11 17:59, David Rowley wrote: > On 11 April 2018 at 18:04, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Updated patch attached. > > Thanks for the updated patch. > > The only thing I'm not sure about is the chances you've made to the > COALESCE function. > > +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS > +$$SELECT coalesce($1, $2)::int8$$ LANGUAGE sql IMMUTABLE; > +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS > +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8); > +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS > +$$SELECT length(coalesce($1, ''))::int8$$ LANGUAGE sql IMMUTABLE; > > Why does one default to the seed and the other to an empty string? > Shouldn't they both do the same thing? If you were to copy the > hash_part.sql you'd just coalesce($1, 0) and coalesce($1, ''), any > special reason not to do that? Oops, so I hadn't actually restored it to the way it is in hash_part.sql. Also, Ashutosh was talking about the custom hashing function used in insert.sql, not hash_part.sql, which I based my revision upon. Fixed it now. > Also just wondering if it's worth adding some verification that we've > actually eliminated the correct partitions by backing the tests up > with a call to satisfies_hash_partition. > > I've attached a delta patch that applies to your v2 which does this. > Do you think it's worth doing? We can see check by inspection that individual values are in appropriate partitions, which is the point of having the inserts and the select just above the actual pruning related tests. So, I'm not sure if adding the satisfies_hash_partition against each pruning tests adds much. Attached revised patch. Thanks, Amit
Attachment
On Wed, Apr 11, 2018 at 2:52 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >> I've attached a delta patch that applies to your v2 which does this. >> Do you think it's worth doing? > > We can see check by inspection that individual values are in appropriate > partitions, which is the point of having the inserts and the select just > above the actual pruning related tests. So, I'm not sure if adding the > satisfies_hash_partition against each pruning tests adds much. +1. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 11 April 2018 at 21:22, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Also just wondering if it's worth adding some verification that we've >> actually eliminated the correct partitions by backing the tests up >> with a call to satisfies_hash_partition. >> >> I've attached a delta patch that applies to your v2 which does this. >> Do you think it's worth doing? > > We can see check by inspection that individual values are in appropriate > partitions, which is the point of having the inserts and the select just > above the actual pruning related tests. So, I'm not sure if adding the > satisfies_hash_partition against each pruning tests adds much. Right, that's true. > Attached revised patch. Thanks. It looks fine to me, with or without the satisfies_hash_partition tests. I agree that they're probably overkill, but I see you've added them now. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Here's an idea. Why don't we move the function/opclass creation lines to insert.sql, without the DROPs, and use the same functions/opclasses in the three tests insert.sql, alter_table.sql, hash_part.sql and partition_prune.sql, i.e. not recreate what are essentially the same objects three times? This also leaves them around for the pg_upgrade test, which is not a bad thing. (This would require a few updates to insert.sql because the definitions there are different, but it shouldn't be a problem coverage-wise.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018/04/11 21:35, Alvaro Herrera wrote: > Here's an idea. Why don't we move the function/opclass creation lines > to insert.sql, without the DROPs, and use the same functions/opclasses > in the three tests insert.sql, alter_table.sql, hash_part.sql and > partition_prune.sql, i.e. not recreate what are essentially the same > objects three times? This also leaves them around for the pg_upgrade > test, which is not a bad thing. > > (This would require a few updates to insert.sql because the definitions > there are different, but it shouldn't be a problem coverage-wise.) OK, I've tried doing that. Needed adjustments to hash_part.sql as well. The hash function for int4 was defined differently in insert.sql, alter_table.sql, and hash_part.sql. I went with the definition in insert.sql, which although slightly different from the one alter_table.sql, didn't affect the latter's output in any way. Since the definition in hash_part.sql was totally different, a couple of tests needed adjusting after starting to use hash opclasses defined in insert.sql. Attached updated patch. PS: git grep "partition by hash\|PARTITION BY HASH" on src/test indicates that there are hash partitioning related tests in create_table, foreign_key, and partition_join files as well. Do we want to use the custom opclass in those files as well? Thanks, Amit
Attachment
On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Here's an idea. Why don't we move the function/opclass creation lines > to insert.sql, without the DROPs, and use the same functions/opclasses > in the three tests insert.sql, alter_table.sql, hash_part.sql and > partition_prune.sql, i.e. not recreate what are essentially the same > objects three times? This also leaves them around for the pg_upgrade > test, which is not a bad thing. That sounds good, but maybe we should go further and move the partitioning tests out of generically-named things like insert.sql altogether and have test names that actually mention partitioning. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018/04/13 1:47, Robert Haas wrote: > On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> Here's an idea. Why don't we move the function/opclass creation lines >> to insert.sql, without the DROPs, and use the same functions/opclasses >> in the three tests insert.sql, alter_table.sql, hash_part.sql and >> partition_prune.sql, i.e. not recreate what are essentially the same >> objects three times? This also leaves them around for the pg_upgrade >> test, which is not a bad thing. > > That sounds good, but maybe we should go further and move the > partitioning tests out of generically-named things like insert.sql > altogether and have test names that actually mention partitioning. Do you mean to do that for *all* files that have tests exercising some partitioning code, even if it's just one test? I can see that tests in at least some of them could be put into their own partition_ file as follows: partition_insert (including tests in insert_conflict) partition_update partition_triggers partition_indexing (indexing.sql added when partitioned indexes went in) partition_ddl (for the tests in create_table and alter_table) That leaves: cluster create_index (one test here could be moved to partition_indexing?) foreign_data (could be moved to partition_ddl?) foreign_key (could be moved to partition_ddl?) hash_part (leave alone, because already contains 'part' in the name?) identity join plancache plpgsql publication rowsecurity rules stats_ext tablesample truncate updatable_views vacuum What about the tests in inherit.sql that start with: -- -- Check that constraint exclusion works correctly with partitions using -- implicit constraints generated from the partition bound information. -- Maybe, just move all of them to partition_prune.sql, because we no longer use constraint exclusion for pruning? Thanks, Amit
On 13 April 2018 at 14:15, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/04/13 1:47, Robert Haas wrote: >> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >>> Here's an idea. Why don't we move the function/opclass creation lines >>> to insert.sql, without the DROPs, and use the same functions/opclasses >>> in the three tests insert.sql, alter_table.sql, hash_part.sql and >>> partition_prune.sql, i.e. not recreate what are essentially the same >>> objects three times? This also leaves them around for the pg_upgrade >>> test, which is not a bad thing. >> >> That sounds good, but maybe we should go further and move the >> partitioning tests out of generically-named things like insert.sql >> altogether and have test names that actually mention partitioning. > > Do you mean to do that for *all* files that have tests exercising some > partitioning code, even if it's just one test? I can see that tests in at > least some of them could be put into their own partition_ file as follows: Wouldn't it be best to just move hash partition tests into hash_part? Leave all the other stuff where it is? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Apr 13, 2018 at 7:45 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/04/13 1:47, Robert Haas wrote: >> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >>> Here's an idea. Why don't we move the function/opclass creation lines >>> to insert.sql, without the DROPs, and use the same functions/opclasses >>> in the three tests insert.sql, alter_table.sql, hash_part.sql and >>> partition_prune.sql, i.e. not recreate what are essentially the same >>> objects three times? This also leaves them around for the pg_upgrade >>> test, which is not a bad thing. >> >> That sounds good, but maybe we should go further and move the >> partitioning tests out of generically-named things like insert.sql >> altogether and have test names that actually mention partitioning. > > Do you mean to do that for *all* files that have tests exercising some > partitioning code, even if it's just one test? I can see that tests in at > least some of them could be put into their own partition_ file as follows: > > partition_insert (including tests in insert_conflict) > partition_update > partition_triggers > partition_indexing (indexing.sql added when partitioned indexes went in) > partition_ddl (for the tests in create_table and alter_table) > We haven't generally created test files specific to a table type for example temporary tables or unlogged tables, instead have created files by SQL commands. But then that's not true for indexes; we have separate files for indexes and we also have separate file for materialized views and also for various data types. So, our test file organization seems to have cut across of SQL commands and object types. But partitioning seems an area large enough to have files of its own; we already have partition_join and partition_aggregate. Do we want to move to a directory based organization for tests also, where sql/ expected/ will have directories within them for various types of objects like partitioned tables, indexes, regular tables, datatypes etc. and each of those will have files organized by sql commands? An immediate question arises as to where to add the files which exercises a mixture of objects; may be in a directory corresponding to the primary object like materialized views over partitioned tables, would fit materialized view (or just views?) directory. Whatever organization we want to use, it should be easy to find testcases for relevant functionality e.g. all tests for partitioned tables or all alter table command tests. > What about the tests in inherit.sql that start with: > > -- > -- Check that constraint exclusion works correctly with partitions using > -- implicit constraints generated from the partition bound information. > -- > > Maybe, just move all of them to partition_prune.sql, because we no longer > use constraint exclusion for pruning? I think we need to have some testcases somwhere to test constraint exclusion on partitions and partitioned tables, those do not necessarily fit partition pruning. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Robert Haas wrote: > On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Here's an idea. Why don't we move the function/opclass creation lines > > to insert.sql, without the DROPs, and use the same functions/opclasses > > in the three tests insert.sql, alter_table.sql, hash_part.sql and > > partition_prune.sql, i.e. not recreate what are essentially the same > > objects three times? This also leaves them around for the pg_upgrade > > test, which is not a bad thing. > > That sounds good, but maybe we should go further and move the > partitioning tests out of generically-named things like insert.sql > altogether and have test names that actually mention partitioning. I don't think that's necessary to fix the problem that partition_prune_hash.sql file has two expected output files. If you want to propose such a reorganization, feel free, but let's not hijack the patch at hand. For the record, I'm not a fan of the idea. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Apr 13, 2018 at 10:50 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Robert Haas wrote: >> On Wed, Apr 11, 2018 at 8:35 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: >> > Here's an idea. Why don't we move the function/opclass creation lines >> > to insert.sql, without the DROPs, and use the same functions/opclasses >> > in the three tests insert.sql, alter_table.sql, hash_part.sql and >> > partition_prune.sql, i.e. not recreate what are essentially the same >> > objects three times? This also leaves them around for the pg_upgrade >> > test, which is not a bad thing. >> >> That sounds good, but maybe we should go further and move the >> partitioning tests out of generically-named things like insert.sql >> altogether and have test names that actually mention partitioning. > > I don't think that's necessary to fix the problem that > partition_prune_hash.sql file has two expected output files. If you > want to propose such a reorganization, feel free, but let's not hijack > the patch at hand. For the record, I'm not a fan of the idea. Fair enough. I don't think I'm hacking the thread much more than it was already hijacked; and it was just a thought. I haven't really studied the tests well enough to have a really clear idea what a better organization would look like. It was just that, for example, the commit that added hash partitioning added tests to 5 different files, and some things had to be duplicated as a result. It sounds like what you've already done improves that, but I was wondering if there's a way to do better. I don't feel super-strongly about it though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Amit Langote wrote: > PS: git grep "partition by hash\|PARTITION BY HASH" on src/test indicates > that there are hash partitioning related tests in create_table, > foreign_key, and partition_join files as well. Do we want to use the > custom opclass in those files as well? By the way, let me suggest 'git grep -i' instead. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018/04/21 0:58, Alvaro Herrera wrote: > Amit Langote wrote: > >> PS: git grep "partition by hash\|PARTITION BY HASH" on src/test indicates >> that there are hash partitioning related tests in create_table, >> foreign_key, and partition_join files as well. Do we want to use the >> custom opclass in those files as well? > > By the way, let me suggest 'git grep -i' instead. Ah, thanks. Regards, Amit
Hello everyone in this thread! I got a similar server crash as in [1] on the master branch since the commit 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails because the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, but is an ArrayCoerceExpr (see [2]): =# create table list_parted ( a varchar ) partition by list (a); =# create table part_ab_cd partition of list_parted for values in ('ab', 'cd'); =# CREATE OR REPLACE FUNCTION public.x_stl_text_integer ( ) RETURNS text STABLE AS $body$ BEGIN RAISE NOTICE 's text integer'; RETURN 1::text; END; $body$ LANGUAGE 'plpgsql'; =# explain (costs off) select * from list_parted where a in ('ab', 'cd', x_stl_text_integer()); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> [1] https://www.postgresql.org/message-id/CAKcux6nCsCmu9oUnnuKZkeBenYvUFbU2Lt4q2MFNEb7QErzn8w%40mail.gmail.com [2] partprune.c, function match_clause_to_partition_key: if (IsA(rightop, Const)) { ... } else { ArrayExpr *arrexpr = castNode(ArrayExpr, rightop); # fails here ... } -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, May 04, 2018 at 12:32:23PM +0300, Marina Polyakova wrote: > I got a similar server crash as in [1] on the master branch since the commit > 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails because > the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, but is > an ArrayCoerceExpr (see [2]): Indeed, I can see the crash. I have been playing with this stuff and I am in the middle of writing the patch, but let's track this properly for now. -- Michael
Attachment
On 07-05-2018 4:37, Michael Paquier wrote: > On Fri, May 04, 2018 at 12:32:23PM +0300, Marina Polyakova wrote: >> I got a similar server crash as in [1] on the master branch since the >> commit >> 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails >> because >> the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, >> but is >> an ArrayCoerceExpr (see [2]): > > Indeed, I can see the crash. I have been playing with this stuff and I > am in the middle of writing the patch, but let's track this properly > for > now. Thank you very much! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, May 07, 2018 at 10:37:10AM +0900, Michael Paquier wrote: > On Fri, May 04, 2018 at 12:32:23PM +0300, Marina Polyakova wrote: > > I got a similar server crash as in [1] on the master branch since the commit > > 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails because > > the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, but is > > an ArrayCoerceExpr (see [2]): > > Indeed, I can see the crash. I have been playing with this stuff and I > am in the middle of writing the patch, but let's track this properly for > now. So the problem appears when an expression needs to use COERCION_PATH_ARRAYCOERCE for a type coercion from one type to another, which requires an execution state to be able to build the list of elements. The clause matching happens at planning state, so while there are surely cases where this could be improved depending on the expression type, I propose to just discard all clauses which do not match OpExpr and ArrayExpr for now, as per the attached. It would be definitely a good practice to have a default code path returning PARTCLAUSE_UNSUPPORTED where the element list cannot be built. Thoughts? -- Michael
Attachment
Thank you Marina for the report and Michael for following up. On 2018/05/07 16:56, Michael Paquier wrote: > On Mon, May 07, 2018 at 10:37:10AM +0900, Michael Paquier wrote: >> On Fri, May 04, 2018 at 12:32:23PM +0300, Marina Polyakova wrote: >>> I got a similar server crash as in [1] on the master branch since the commit >>> 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails because >>> the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, but is >>> an ArrayCoerceExpr (see [2]): >> >> Indeed, I can see the crash. I have been playing with this stuff and I >> am in the middle of writing the patch, but let's track this properly for >> now. > > So the problem appears when an expression needs to use > COERCION_PATH_ARRAYCOERCE for a type coercion from one type to another, > which requires an execution state to be able to build the list of > elements. The clause matching happens at planning state, so while there > are surely cases where this could be improved depending on the > expression type, I propose to just discard all clauses which do not > match OpExpr and ArrayExpr for now, as per the attached. It would be > definitely a good practice to have a default code path returning > PARTCLAUSE_UNSUPPORTED where the element list cannot be built. > > Thoughts? I have to agree to go with this conservative approach for now. Although we might be able to evaluate the array elements by applying the coercion specified by ArrayCoerceExpr, let's save that as an improvement to be pursued later. FWIW, constraint exclusion wouldn't prune in this case either (that is, if you try this example with PG 10 or using HEAD as of the parent of 9fdb675fc5), but it doesn't crash like the new pruning code does. Thanks again. Regards, Amit
On Tue, May 08, 2018 at 04:07:41PM +0900, Amit Langote wrote: > I have to agree to go with this conservative approach for now. Although > we might be able to evaluate the array elements by applying the coercion > specified by ArrayCoerceExpr, let's save that as an improvement to be > pursued later. Thanks for confirming. Yes, non-volatile functions would be actually safe, and we'd need to be careful about NULL handling as well, but that's definitely out of scope for v11. > FWIW, constraint exclusion wouldn't prune in this case either (that is, if > you try this example with PG 10 or using HEAD as of the parent of > 9fdb675fc5), but it doesn't crash like the new pruning code does. Yeah, I have noticed that. -- Michael
Attachment
Michael Paquier wrote: > So the problem appears when an expression needs to use > COERCION_PATH_ARRAYCOERCE for a type coercion from one type to another, > which requires an execution state to be able to build the list of > elements. The clause matching happens at planning state, so while there > are surely cases where this could be improved depending on the > expression type, I propose to just discard all clauses which do not > match OpExpr and ArrayExpr for now, as per the attached. It would be > definitely a good practice to have a default code path returning > PARTCLAUSE_UNSUPPORTED where the element list cannot be built. > > Thoughts? I found a related crash and I'm investigating it further. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
So I found that this query also crashed (using your rig), create table coercepart (a varchar) partition by list (a); create table coercepart_ab partition of coercepart for values in ('ab'); create table coercepart_bc partition of coercepart for values in ('bc'); create table coercepart_cd partition of coercepart for values in ('cd'); explain (costs off) select * from coercepart where a ~ any ('{ab}'); The reason for this crash is that gen_partprune_steps_internal() is unable to generate any steps for the clause -- which is natural, since the operator is not in a btree opclass. There are various callers of gen_partprune_steps_internal that are aware that it could return an empty set of steps, but this one in match_clause_to_partition_key for the ScalarArrayOpExpr case was being a bit too optimistic. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi. On 2018/05/09 7:05, Alvaro Herrera wrote: > So I found that this query also crashed (using your rig), > > create table coercepart (a varchar) partition by list (a); > create table coercepart_ab partition of coercepart for values in ('ab'); > create table coercepart_bc partition of coercepart for values in ('bc'); > create table coercepart_cd partition of coercepart for values in ('cd'); > explain (costs off) select * from coercepart where a ~ any ('{ab}'); > > The reason for this crash is that gen_partprune_steps_internal() is > unable to generate any steps for the clause -- which is natural, since > the operator is not in a btree opclass. There are various callers > of gen_partprune_steps_internal that are aware that it could return an > empty set of steps, but this one in match_clause_to_partition_key for > the ScalarArrayOpExpr case was being a bit too optimistic. Yeah, good catch! That fixes the crash, but looking around that code a bit, it seems that we shouldn't even have gotten up to the point you're proposing to fix. If saop_op is not in the partitioning opfamily, it should have bailed out much sooner, that is, here: /* * In case of NOT IN (..), we get a '<>', which we handle if list * partitioning is in use and we're able to confirm that it's negator * is a btree equality operator belonging to the partitioning operator * family. */ if (!op_in_opfamily(saop_op, partopfamily)) { <snip> negator = get_negator(saop_op); if (OidIsValid(negator) && op_in_opfamily(negator, partopfamily)) { <snip> } + else + /* otherwise, unsupported! */ + return PARTCLAUSE_UNSUPPORTED; Let me propose that we also have this along with the rest of the changes you made in that part of the function. So, attached is an updated patch. Thanks, Amit
Attachment
On Tue, May 08, 2018 at 07:05:46PM -0300, Alvaro Herrera wrote: > The reason for this crash is that gen_partprune_steps_internal() is > unable to generate any steps for the clause -- which is natural, since > the operator is not in a btree opclass. There are various callers > of gen_partprune_steps_internal that are aware that it could return an > empty set of steps, but this one in match_clause_to_partition_key for > the ScalarArrayOpExpr case was being a bit too optimistic. Indeed. While looking at this code, is there any reason to not make gen_partprune_steps static? This is only used in partprune.c for now, so the intention is to make it available for future patches? -- Michael
Attachment
On 2018/05/09 11:20, Michael Paquier wrote: > While looking at this code, is there any reason to not make > gen_partprune_steps static? This is only used in partprune.c for now, > so the intention is to make it available for future patches? Yeah, making it static might be a good idea. I had made it externally visible, because I was under the impression that the runtime pruning related code would want to call it from elsewhere within the planner. But, instead it introduced a make_partition_pruneinfo() which in turn calls get_partprune_steps. Thanks, Amit
On 9 May 2018 at 14:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/05/09 11:20, Michael Paquier wrote: >> While looking at this code, is there any reason to not make >> gen_partprune_steps static? This is only used in partprune.c for now, >> so the intention is to make it available for future patches? > > Yeah, making it static might be a good idea. I had made it externally > visible, because I was under the impression that the runtime pruning > related code would want to call it from elsewhere within the planner. > But, instead it introduced a make_partition_pruneinfo() which in turn > calls get_partprune_steps. Yeah. Likely left over from when run-time pruning was generating the steps during execution rather than during planning. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018/05/09 11:31, David Rowley wrote: > On 9 May 2018 at 14:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2018/05/09 11:20, Michael Paquier wrote: >>> While looking at this code, is there any reason to not make >>> gen_partprune_steps static? This is only used in partprune.c for now, >>> so the intention is to make it available for future patches? >> >> Yeah, making it static might be a good idea. I had made it externally >> visible, because I was under the impression that the runtime pruning >> related code would want to call it from elsewhere within the planner. >> But, instead it introduced a make_partition_pruneinfo() which in turn >> calls get_partprune_steps. > > Yeah. Likely left over from when run-time pruning was generating the > steps during execution rather than during planning. Here is a patch that does that. Thanks, Amit
Attachment
On Wed, May 09, 2018 at 02:01:26PM +0900, Amit Langote wrote: > On 2018/05/09 11:31, David Rowley wrote: >> On 9 May 2018 at 14:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> On 2018/05/09 11:20, Michael Paquier wrote: >>>> While looking at this code, is there any reason to not make >>>> gen_partprune_steps static? This is only used in partprune.c for now, >>>> so the intention is to make it available for future patches? >>> >>> Yeah, making it static might be a good idea. I had made it externally >>> visible, because I was under the impression that the runtime pruning >>> related code would want to call it from elsewhere within the planner. >>> But, instead it introduced a make_partition_pruneinfo() which in turn >>> calls get_partprune_steps. >> >> Yeah. Likely left over from when run-time pruning was generating the >> steps during execution rather than during planning. > > Here is a patch that does that. Thanks, Amit. Alvaro, could it be possible to consider as well the patch I posted here? https://www.postgresql.org/message-id/20180424012042.GD1570@paquier.xyz This removes a useless default clause in partprune.c and it got forgotten in the crowd. Just attaching it again here, and it can just be applied on top of the rest. -- Michael
Attachment
Michael Paquier wrote: > Alvaro, could it be possible to consider as well the patch I posted > here? > https://www.postgresql.org/message-id/20180424012042.GD1570@paquier.xyz > > This removes a useless default clause in partprune.c and it got > forgotten in the crowd. Just attaching it again here, and it can just > be applied on top of the rest. Done, thanks for insisting. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Amit Langote wrote: > On 2018/05/09 11:31, David Rowley wrote: > > On 9 May 2018 at 14:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> On 2018/05/09 11:20, Michael Paquier wrote: > >>> While looking at this code, is there any reason to not make > >>> gen_partprune_steps static? This is only used in partprune.c for now, > >>> so the intention is to make it available for future patches? > Here is a patch that does that. Pushed, thanks. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Marina Polyakova wrote: > Hello everyone in this thread! > I got a similar server crash as in [1] on the master branch since the commit > 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails because > the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, but is > an ArrayCoerceExpr (see [2]): Hello Marina, thanks for reporting this. I have pushed all fixes derived from this report -- thanks to Amit and Michaël for those. I verified your test case no longer crashes. If you have more elaborate test cases, please do try these too. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, May 09, 2018 at 10:39:07AM -0300, Alvaro Herrera wrote: > Michael Paquier wrote: >> This removes a useless default clause in partprune.c and it got >> forgotten in the crowd. Just attaching it again here, and it can just >> be applied on top of the rest. > > Done, thanks for insisting. Thanks! -- Michael
Attachment
On 2018/05/09 22:43, Alvaro Herrera wrote: > Amit Langote wrote: >> On 2018/05/09 11:31, David Rowley wrote: >>> On 9 May 2018 at 14:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>>> On 2018/05/09 11:20, Michael Paquier wrote: >>>>> While looking at this code, is there any reason to not make >>>>> gen_partprune_steps static? This is only used in partprune.c for now, >>>>> so the intention is to make it available for future patches? > >> Here is a patch that does that. > > Pushed, thanks. Thank you. Regards, Amit
On 09-05-2018 17:30, Alvaro Herrera wrote: > Marina Polyakova wrote: >> Hello everyone in this thread! > >> I got a similar server crash as in [1] on the master branch since the >> commit >> 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails >> because >> the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, >> but is >> an ArrayCoerceExpr (see [2]): > > Hello Marina, thanks for reporting this. I have pushed all fixes > derived from this report -- thanks to Amit and Michaël for those. > I verified your test case no longer crashes. If you have more > elaborate > test cases, please do try these too. Hello, thank you all very much! :) -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company