Home > mailing lists

Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
Date	August 9, 2017 16:39:21
Msg-id	CAEepm=16SDdJ9yy7Dmoc+Y6yxsv+f2RDbgOoNHa_8Yz3m61DPA@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Responses	Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
List	pgsql-hackers

Tree view

On Tue, Aug 8, 2017 at 8:51 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Updated patches attached.

Hi,

I started reviewing this.  It's nicely commented, but it's also very
complicated, and it's going to take me several rounds to understand
what all the parts do, but here's some assorted feedback after reading
some parts of the patches, some tinkering and quite a bit of time
spent trying to break it (so far unsuccessfully).

On my computer it took ~1.5 seconds to plan a 1000 partition join,
~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a
4000 partition join.  I poked around in a profiler a bit and saw that
for the 2000 partition case I spent almost half the time in
create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle,
and about half of that was in bms_is_subset.  The other half the time
was in query_planner->make_one_rel which spent 2/3 of its time in
set_rel_size->add_child_rel_equivalences->bms_overlap and the other
1/3 in standard_join_search.

One micro-optimisation opportunity I noticed was in those
bms_is_subset and bms_overlap calls.  The Bitmapsets don't tend to
have trailing words but often have hundreds of empty leading words.
If I hack bitmapset.{c,h} so that it tracks first_non_empty_wordnum
and then adjust bms_is_subset and bms_overlap so that they start their
searches at Min(a->first_non_empty_wordnum,
b->first_non_empty_wordnum) then the planning time improves
measurably:

1000 partitions: ~1.5s -> 1.3s
2000 partitions: ~7.1s -> 5.8s
4000 partitions: ~50s -> ~44s

When using list-based partitions, it must be possible to omit the part
of a join key that is implied by the partition because the partition
has only one list value.  For example, if I create a two level
hierarchy with one partition per US state and then time-based range
partitions under that, the state part of this merge condition is
redundant:
        Merge Cond: ((sales_wy_2017_10.state =
purchases_wy_2017_10.state) AND (sales_wy_2017_10.created =
purchases_wy_2017_10.created))

0003-Refactor-partition_bounds_equal-to-be-used-without-P.patch

-partition_bounds_equal(PartitionKey key,
+partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval,
PartitionBoundInfob1,
 
PartitionBoundInfo b2)

I wonder is there any value in creating a struct to represent the
common part of PartitionKey and PartitionScheme that functions like
this and others need?  Maybe not.  Perhaps you didn't want to make
PartitionKey contain a PartitionScheme because then you'd lose the
property that every pointer to PartitionScheme in the system must be a
pointer to an interned (canonical) PartitionScheme, so it's always
safe to compare pointers to test for equality?

0005-Canonical-partition-scheme.patch:

+/*
+ * get_relation_partition_info
+ *
+ * Retrieves partitioning information for a given relation.
+ *
+ * For a partitioned table it sets partitioning scheme, partition key
+ * expressions, number of partitions and OIDs of partitions in the given
+ * RelOptInfo.
+ */
+static void
+get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
+                                                       Relation relation)

Would this be better called "set_relation_partition_info"?  It doesn't
really "retrieve" the information, it "installs" it.

+{
+       PartitionDesc part_desc;
+
+       /* No partitioning information for an unpartitioned relation. */
+       if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE ||
+               !(rel->part_scheme = find_partition_scheme(root, relation)))
+               return;

Here and elsewhere you use the idiom !(foo = bar), which is perfectly
good C in my book but I understand the project convention is to avoid
implicit pointer->boolean conversion and to prefer expressions like
(foo = bar) != NULL and there is certainly a lot more code like that.

0007-Partition-wise-join-implementation.patch

+       {"enable_partition_wise_join", PGC_USERSET, QUERY_TUNING_METHOD,

This GUC should appear in postgresql.conf.sample.

I'm chewing on 0007.  More soon.

-- 
Thomas Munro
http://www.enterprisedb.com

pgsql-hackers by date:

From: Tom Lane
Date: 09 August 2017, 16:39:04
Subject: Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests

From: Tom Lane
Date: 09 August 2017, 17:14:14
Subject: Re: [HACKERS] Re: [COMMITTERS] pgsql: Fix inadequacies in recently added wait events

Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables - Mailing list pgsql-hackers

Previous

Next