Thread: v12.0: ERROR: could not find pathkey item to sort

v12.0: ERROR: could not find pathkey item to sort

From
Justin Pryzby
Date:
I've reduced the failing query as much as possible to this:

-- This is necessary to fail:
SET enable_nestloop=off;

SELECT * FROM
        (SELECT start_time, t1.site_id
        FROM pgw_kpi_view t1
        -- Apparently the where clause is necessary to fail...
        WHERE (start_time>='2019-10-10' AND start_time<'2019-10-11')
        -- The group by MAY be necessary to fail...
        GROUP BY 1,2
        ) AS data
JOIN sites ON ( sites.site_location='' OR sites.site_office=data.site_id)

The view is actually a join of two relkind=p partitioned tables (which I
will acknowledge probably performs poorly).

(gdb) bt
#0  errfinish (dummy=dummy@entry=0) at elog.c:411
#1  0x000000000087a959 in elog_finish (elevel=elevel@entry=20, fmt=fmt@entry=0x9d93d8 "could not find pathkey item to
sort")at elog.c:1365
 
#2  0x00000000006a587f in prepare_sort_from_pathkeys (lefttree=0x7f7cb84492e8, pathkeys=<optimized out>,
relids=0x7f7cb8410700,reqColIdx=reqColIdx@entry=0x0, adjust_tlist_in_place=<optimized out>, 
 
    adjust_tlist_in_place@entry=false, p_numsortkeys=p_numsortkeys@entry=0x7ffc4b2e10c4,
p_sortColIdx=p_sortColIdx@entry=0x7ffc4b2e10c8,p_sortOperators=p_sortOperators@entry=0x7ffc4b2e10d0, 
 
    p_collations=p_collations@entry=0x7ffc4b2e10d8, p_nullsFirst=p_nullsFirst@entry=0x7ffc4b2e10e0) at
createplan.c:5893
#3  0x00000000006a5a6a in make_sort_from_pathkeys (lefttree=<optimized out>, pathkeys=<optimized out>,
relids=<optimizedout>) at createplan.c:6020
 
#4  0x00000000006a6e30 in create_sort_plan (flags=4, best_path=0x7f7cb8410cc8, root=0x7f7fdc3ac6b0) at
createplan.c:1985
#5  create_plan_recurse (root=root@entry=0x7f7fdc3ac6b0, best_path=0x7f7cb8410cc8, flags=flags@entry=4) at
createplan.c:459
#6  0x00000000006a6e4e in create_group_plan (best_path=0x7f7cb8410d58, root=0x7f7fdc3ac6b0) at createplan.c:2012
#7  create_plan_recurse (root=root@entry=0x7f7fdc3ac6b0, best_path=best_path@entry=0x7f7cb8410d58, flags=flags@entry=1)
atcreateplan.c:464
 
#8  0x00000000006a8278 in create_merge_append_plan (flags=4, best_path=0x7f7cb8446cd8, root=0x7f7fdc3ac6b0) at
createplan.c:1333
#9  create_plan_recurse (root=root@entry=0x7f7fdc3ac6b0, best_path=0x7f7cb8446cd8, flags=flags@entry=4) at
createplan.c:402
#10 0x00000000006a6e4e in create_group_plan (best_path=0x7f7cb84486c8, root=0x7f7fdc3ac6b0) at createplan.c:2012
#11 create_plan_recurse (root=root@entry=0x7f7fdc3ac6b0, best_path=0x7f7cb84486c8, flags=flags@entry=1) at
createplan.c:464
#12 0x00000000006a9739 in create_plan (root=0x7f7fdc3ac6b0, best_path=<optimized out>) at createplan.c:325
#13 0x00000000006aa988 in create_subqueryscan_plan (scan_clauses=0x0, tlist=0x7f7cb8450820, best_path=0x7f7cb8448db8,
root=0x7f7fdc34b948)at createplan.c:3385
 
#14 create_scan_plan (root=root@entry=0x7f7fdc34b948, best_path=best_path@entry=0x7f7cb8448db8, flags=<optimized out>,
flags@entry=0)at createplan.c:670
 
#15 0x00000000006a6d31 in create_plan_recurse (root=root@entry=0x7f7fdc34b948, best_path=0x7f7cb8448db8,
flags=flags@entry=0)at createplan.c:427
 
#16 0x00000000006a983a in create_nestloop_plan (best_path=0x7f7cb844fb80, root=0x7f7fdc34b948) at createplan.c:4008
#17 create_join_plan (root=root@entry=0x7f7fdc34b948, best_path=best_path@entry=0x7f7cb844fb80) at createplan.c:1020
#18 0x00000000006a6d75 in create_plan_recurse (root=root@entry=0x7f7fdc34b948, best_path=0x7f7cb844fb80,
flags=flags@entry=1)at createplan.c:393
 
#19 0x00000000006a9739 in create_plan (root=root@entry=0x7f7fdc34b948, best_path=<optimized out>) at createplan.c:325
#20 0x00000000006b5a04 in standard_planner (parse=0x1bd2308, cursorOptions=256, boundParams=0x0) at planner.c:413
#21 0x000000000075fb2e in pg_plan_query (querytree=querytree@entry=0x1bd2308, cursorOptions=cursorOptions@entry=256,
boundParams=boundParams@entry=0x0)at postgres.c:878
 
#22 0x000000000075fbee in pg_plan_queries (querytrees=<optimized out>, cursorOptions=cursorOptions@entry=256,
boundParams=boundParams@entry=0x0)at postgres.c:968
 
#23 0x000000000076007a in exec_simple_query (
    query_string=0x1ba36e0 "SELECT * FROM\n\t(SELECT start_time, t1.site_id\n\tFROM pgw_kpi_view t1\n\t\n\tWHERE
(start_time>='2019-10-10'AND start_time<'2019-10-11')\n\t\n\tGROUP BY 1,2\n\t) AS data\nJOIN sites ON (
sites.site_location=''OR"...) at postgres.c:1143
 
#24 0x0000000000761212 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x1bd8e70, dbname=0x1bd8d10 "ts",
username=<optimizedout>) at postgres.c:4236
 
#25 0x0000000000483d02 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4431
#26 BackendStartup (port=0x1bd5190) at postmaster.c:4122
#27 ServerLoop () at postmaster.c:1704
#28 0x00000000006f0b1f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1b9e280) at postmaster.c:1377
#29 0x0000000000484c93 in main (argc=3, argv=0x1b9e280) at main.c:228


bt f:

#2  0x00000000006a587f in prepare_sort_from_pathkeys (lefttree=0x7f7cb84492e8, pathkeys=<optimized out>,
relids=0x7f7cb8410700,reqColIdx=reqColIdx@entry=0x0, adjust_tlist_in_place=<optimized out>, 
 
    adjust_tlist_in_place@entry=false, p_numsortkeys=p_numsortkeys@entry=0x7ffc4b2e10c4,
p_sortColIdx=p_sortColIdx@entry=0x7ffc4b2e10c8,p_sortOperators=p_sortOperators@entry=0x7ffc4b2e10d0, 
 
    p_collations=p_collations@entry=0x7ffc4b2e10d8, p_nullsFirst=p_nullsFirst@entry=0x7ffc4b2e10e0) at
createplan.c:5893
        sortexpr = <optimized out>
        ec = 0x7f7cb8edbe28
        em = <optimized out>
        tle = <optimized out>
        pathkey = <optimized out>
        pk_datatype = <optimized out>
        sortop = <optimized out>
        j = <optimized out>
        tlist = 0x7f7cb8451bb8
        i = 0x7f7cb8edc2d8
        numsortkeys = 0
        sortColIdx = 0x7f7cb8451c58
        sortOperators = 0x7f7cb8451c70
        collations = 0x7f7cb8451c88
        nullsFirst = 0x7f7cb8451ca0
        __func__ = "prepare_sort_from_pathkeys"
#3  0x00000000006a5a6a in make_sort_from_pathkeys (lefttree=<optimized out>, pathkeys=<optimized out>,
relids=<optimizedout>) at createplan.c:6020
 
        numsortkeys = 32636
        sortColIdx = 0x7f7cb8447468
        sortOperators = 0x7f7cb83fa278
        collations = 0x0
        nullsFirst = 0x7f7cb8edc2f8



Re: v12.0: ERROR: could not find pathkey item to sort

From
Tom Lane
Date:
Justin Pryzby <pryzby@telsasoft.com> writes:
> I've reduced the failing query as much as possible to this:
> -- This is necessary to fail:
> SET enable_nestloop=off;

> SELECT * FROM
>         (SELECT start_time, t1.site_id
>         FROM pgw_kpi_view t1
>         -- Apparently the where clause is necessary to fail...
>         WHERE (start_time>='2019-10-10' AND start_time<'2019-10-11')
>         -- The group by MAY be necessary to fail...
>         GROUP BY 1,2
>         ) AS data
> JOIN sites ON ( sites.site_location='' OR sites.site_office=data.site_id)

> The view is actually a join of two relkind=p partitioned tables (which I
> will acknowledge probably performs poorly).

Could you provide a self-contained test case please?

            regards, tom lane



Re: v12.0: ERROR: could not find pathkey item to sort

From
Justin Pryzby
Date:
On Fri, Oct 11, 2019 at 10:48:37AM -0400, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > The view is actually a join of two relkind=p partitioned tables (which I
> > will acknowledge probably performs poorly).
> 
> Could you provide a self-contained test case please?

Working on it.  FWIW explain for a v11 customer looks like this:

 Nested Loop  (cost=10000011818.10..10000076508.23 rows=734500 width=159)
   Join Filter: ((s.site_location = ''::text) OR (s.site_office = (COALESCE(huawei_ggsn_201610.ne_name,
huawei_ggsn_gw_201610.ne_name))))
   ->  Group  (cost=11818.10..11946.31 rows=2937 width=40)
         Group Key: (COALESCE(huawei_ggsn_201610.start_time, huawei_ggsn_gw_201610.start_time)),
(COALESCE(huawei_ggsn_201610.ne_name,huawei_ggsn_gw_201610.ne_name))
 
         ->  Merge Append  (cost=11818.10..11931.59 rows=2944 width=40)
               Sort Key: (COALESCE(huawei_ggsn_201610.start_time, huawei_ggsn_gw_201610.start_time)),
(COALESCE(huawei_ggsn_201610.ne_name,huawei_ggsn_gw_201610.ne_name))
 
               ->  Group  (cost=332.48..333.10 rows=83 width=40)
                     Group Key: (COALESCE(huawei_ggsn_201610.start_time, huawei_ggsn_gw_201610.start_time)),
(COALESCE(huawei_ggsn_201610.ne_name,huawei_ggsn_gw_201610.ne_name))
 
                     ->  Sort  (cost=332.48..332.69 rows=83 width=40)
                           Sort Key: (COALESCE(huawei_ggsn_201610.start_time, huawei_ggsn_gw_201610.start_time)),
(COALESCE(huawei_ggsn_201610.ne_name,huawei_ggsn_gw_201610.ne_name))
 
                           ->  Hash Full Join  (cost=46.48..329.84 rows=83 width=40)
                                 Hash Cond: ((huawei_ggsn_201610.ne_name = huawei_ggsn_gw_201610.ne_name) AND
(huawei_ggsn_201610.ggsn_function= huawei_ggsn_gw_201610.ggsn_function) AND (huawei_ggsn_201610.start_time = huawei_
 
ggsn_gw_201610.start_time) AND (huawei_ggsn_201610.interval_seconds = huawei_ggsn_gw_201610.interval_seconds) AND
(huawei_ggsn_201610.device_id= huawei_ggsn_gw_201610.device_id) AND (huawei_ggsn_201610.c_134710251 = huawei_ggs
 
n_gw_201610.c_134710251) AND (huawei_ggsn_201610.c_134710252 = huawei_ggsn_gw_201610.c_134710252) AND
(huawei_ggsn_201610.c_134710253= huawei_ggsn_gw_201610.c_134710253) AND (huawei_ggsn_201610.ne_id =
huawei_ggsn_gw_201610.ne
_id) AND (huawei_ggsn_201610.ugw_function = huawei_ggsn_gw_201610.ugw_function))
                                 Filter: ((COALESCE(huawei_ggsn_201610.start_time, huawei_ggsn_gw_201610.start_time) >=
'2019-10-0100:00:00-11'::timestamp with time zone) AND (COALESCE(huawei_ggsn_201610.start_time, huawei_ggs
 
n_gw_201610.start_time) < '2019-10-02 00:00:00-11'::timestamp with time zone))
                                 ->  Seq Scan on huawei_ggsn_201610  (cost=0.00..255.44 rows=744 width=94)
                                 ->  Hash  (cost=20.44..20.44 rows=744 width=94)
                                       ->  Seq Scan on huawei_ggsn_gw_201610  (cost=0.00..20.44 rows=744 width=94)
[...]

I'm suspecting this; is it useful to test with this commit reverted ?

commit 8edd0e79460b414b1d971895312e549e95e12e4f
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Mon Mar 25 15:42:35 2019 -0400

    Suppress Append and MergeAppend plan nodes that have a single child.




Re: v12.0: ERROR: could not find pathkey item to sort

From
Tom Lane
Date:
Justin Pryzby <pryzby@telsasoft.com> writes:
> On Fri, Oct 11, 2019 at 10:48:37AM -0400, Tom Lane wrote:
>> Could you provide a self-contained test case please?

> I'm suspecting this; is it useful to test with this commit reverted ?

I wouldn't bother; we'd still need a test case to find out what the
problem is.

            regards, tom lane



Re: v12.0: ERROR: could not find pathkey item to sort

From
Justin Pryzby
Date:
On Fri, Oct 11, 2019 at 10:48:37AM -0400, Tom Lane wrote:
> Could you provide a self-contained test case please?

SET enable_partitionwise_aggregate = 'on';
SET enable_partitionwise_join = 'on';
SET max_parallel_workers_per_gather=0;
-- maybe not important but explain(settings) suggests I should include them for completeness:
SET effective_io_concurrency = '0';
SET work_mem = '512MB';
SET jit = 'off';

CREATE TABLE s(site_id int, site_location text, site_office text);
INSERT INTO s SELECT generate_series(1,99),'','';

CREATE TABLE t(start_time timestamp, site_id text, i int)PARTITION BY RANGE(start_time);
CREATE TABLE t1 PARTITION OF t FOR VALUES FROM ('2019-10-01')TO('2019-10-02');
INSERT INTO t1 SELECT a,b FROM generate_series( '2019-10-01'::timestamp, '2019-10-01 23:45'::timestamp, '15 minutes')a,
generate_series(1,99)b,generate_series(1,99)c;
 
CREATE TABLE t2 PARTITION OF t FOR VALUES FROM ('2019-10-02')TO('2019-10-03');
INSERT INTO t2 SELECT a,b FROM generate_series( '2019-10-02'::timestamp, '2019-10-02 23:45'::timestamp, '15 minutes')a,
generate_series(1,99)b,generate_series(1,99)c;
 

ANALYZE s,t;

explain
SELECT s.* FROM
        (SELECT start_time, site_id::int
        FROM t t1 FULL JOIN t t2 USING(start_time,site_id)
        WHERE (start_time>='2019-10-01' AND start_time<'2019-10-01 01:00')
        GROUP BY 1,2) AS data
JOIN s ON (s.site_location='' OR s.site_office::int=data.site_id)

Justin



Re: v12.0: ERROR: could not find pathkey item to sort

From
Tom Lane
Date:
Justin Pryzby <pryzby@telsasoft.com> writes:
> On Fri, Oct 11, 2019 at 10:48:37AM -0400, Tom Lane wrote:
>> Could you provide a self-contained test case please?

> [ test case ]

Yup, fails for me too.  Will look shortly.

            regards, tom lane



Re: v12.0: ERROR: could not find pathkey item to sort

From
Tom Lane
Date:
Justin Pryzby <pryzby@telsasoft.com> writes:
> On Fri, Oct 11, 2019 at 10:48:37AM -0400, Tom Lane wrote:
>> Could you provide a self-contained test case please?

> [ test case ]

Oh, this is the same issue Amit described in


https://www.postgresql.org/message-id/flat/CA%2BHiwqG2WVUGmLJqtR0tPFhniO%3DH%3D9qQ%2BZ3L_ZC%2BY3-EVQHFGg%40mail.gmail.com

namely that we're not generating EquivalenceClass members corresponding
to sub-joins of a partitionwise join.

Are you interested in helping to test the patches proposed there?

            regards, tom lane



Re: v12.0: ERROR: could not find pathkey item to sort

From
Justin Pryzby
Date:
On Sun, Oct 13, 2019 at 02:06:02PM -0400, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > On Fri, Oct 11, 2019 at 10:48:37AM -0400, Tom Lane wrote:
> >> Could you provide a self-contained test case please?
> 
> > [ test case ]
> 
> Oh, this is the same issue Amit described in 
> 
>
https://www.postgresql.org/message-id/flat/CA%2BHiwqG2WVUGmLJqtR0tPFhniO%3DH%3D9qQ%2BZ3L_ZC%2BY3-EVQHFGg%40mail.gmail.com
> 
> namely that we're not generating EquivalenceClass members corresponding
> to sub-joins of a partitionwise join.
> 
> Are you interested in helping to test the patches proposed there?

Sure.  Any requests other than testing that our original query works correctly
and maybe endeavoring to read the patch ?

BTW it probably should've been documented as an "Open Item" for v12.

-- 
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581



Re: v12.0: ERROR: could not find pathkey item to sort

From
Justin Pryzby
Date:
On Sun, Oct 13, 2019 at 01:30:29PM -0500, Justin Pryzby wrote:
> BTW it probably should've been documented as an "Open Item" for v12.

https://commitfest.postgresql.org/25/2278/
I realized possibly people were thinking of that as a "feature" and not a
bugfix for backpatch (?)

But, my issue is a query which worked under v11 PWJ but fails under v12
(apparently broken by d25ea01275).

In my mind, if the planner doesn't support that query with PWJ, I think it
should run without PWJ rather than fail.

Justin



Re: v12.0: ERROR: could not find pathkey item to sort

From
Tom Lane
Date:
Justin Pryzby <pryzby@telsasoft.com> writes:
> On Sun, Oct 13, 2019 at 01:30:29PM -0500, Justin Pryzby wrote:
>> BTW it probably should've been documented as an "Open Item" for v12.

> https://commitfest.postgresql.org/25/2278/
> I realized possibly people were thinking of that as a "feature" and not a
> bugfix for backpatch (?)
> But, my issue is a query which worked under v11 PWJ but fails under v12
> (apparently broken by d25ea01275).

Yeah, this should have been dealt with as an open item, but it
slipped through the cracks.  We'll make sure to get it fixed,
one way or another, for 12.1.

In view of the proposed patches being dependent on some other
13-only changes, I wonder if we should fix v12 by reverting
d25ea0127.  The potential planner performance loss for large
partition sets could be nasty, but failing to plan at all is worse.

            regards, tom lane



Re: v12.0: ERROR: could not find pathkey item to sort

From
Amit Langote
Date:
Sorry about the late reply.

On Mon, Oct 14, 2019 at 11:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > On Sun, Oct 13, 2019 at 01:30:29PM -0500, Justin Pryzby wrote:
> >> BTW it probably should've been documented as an "Open Item" for v12.
>
> > https://commitfest.postgresql.org/25/2278/
> > I realized possibly people were thinking of that as a "feature" and not a
> > bugfix for backpatch (?)
> > But, my issue is a query which worked under v11 PWJ but fails under v12
> > (apparently broken by d25ea01275).
>
> Yeah, this should have been dealt with as an open item, but it
> slipped through the cracks.  We'll make sure to get it fixed,
> one way or another, for 12.1.
>
> In view of the proposed patches being dependent on some other
> 13-only changes, I wonder if we should fix v12 by reverting
> d25ea0127.  The potential planner performance loss for large
> partition sets could be nasty, but failing to plan at all is worse.

Actually, the patch I proposed to fix equivalence code can be applied
on its own.  The example I gave on that thread needs us to fix
partitionwise code to even work, which is perhaps a 13-only change,
but we have an example here that is broken due to d25ea01275.
Perhaps, considering applying my patch seems better than alternatives
which are either reverting d25ea01275 or avoiding doing partitionwise
join for such queries.

Since we've got 3373c71553 ("Speed up finding EquivalenceClasses for a
given set of rels") in HEAD, need two versions of the patch; please
see attached.

Thanks,
Amit

Attachment

Re: v12.0: ERROR: could not find pathkey item to sort

From
Tom Lane
Date:
Amit Langote <amitlangote09@gmail.com> writes:
> On Mon, Oct 14, 2019 at 11:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In view of the proposed patches being dependent on some other
>> 13-only changes, I wonder if we should fix v12 by reverting
>> d25ea0127.  The potential planner performance loss for large
>> partition sets could be nasty, but failing to plan at all is worse.

> Actually, the patch I proposed to fix equivalence code can be applied
> on its own.  The example I gave on that thread needs us to fix
> partitionwise code to even work, which is perhaps a 13-only change,
> but we have an example here that is broken due to d25ea01275.
> Perhaps, considering applying my patch seems better than alternatives
> which are either reverting d25ea01275 or avoiding doing partitionwise
> join for such queries.

I looked at this a bit, and I see that the core idea is to generate
the missing EC members by applying add_child_rel_equivalences when
building child join rels.  Perhaps we can make that work, but this
patch fails to, because you've ignored the comment pointing out
that the nullable_relids fixup logic only works for baserels:

                 * And likewise for nullable_relids.  Note this code assumes
                 * parent and child relids are singletons.

We could improve that to work for joinrels, I think, but it would become
very much more complicated (and slower).  AFAICS the logic would have
to be "iterate through top_parent_relids, see which ones are in
em_nullable_relids, and for each one that is, find the corresponding
child_relid and substitute that in new_nullable_relids".  This is a
bit of a problem because we don't have any really convenient way to
map individual top parent relids to child relids.  I wonder if we
should extend AppendRelInfo to include the top parent relid as well as
the immediate parent.  (Perhaps, while we're at it, we could make
adjust_appendrel_attrs_multilevel less of an inefficient and
underdocumented mess.)

Also, I'm pretty sure this addition is wrong/broken:

+
+                /*
+                 * There aren't going to be more expressions to translate in
+                 * the same EC.
+                 */
+                break;

What makes you think that an EC can only contain one entry per rel?

More generally, as long as this patch requires changing
add_child_rel_equivalences' API anyway, I wonder if we should
rethink that altogether.  Looking at the code now, I realize that
d25ea01275 resulted in horribly bastardizing that function's API,
because the parent_rel and appinfo arguments are only consulted in
some cases, while in other cases we disregard them and rely on
child_rel->top_parent_relids to figure out how to translate stuff.
It would be possible to make the argument list be just (root, child_rel)
and always rely on child_rel->top_parent_relids.  At the very least,
if we keep the extra arguments, we should document them as being just
optimizations.

            regards, tom lane



Re: v12.0: ERROR: could not find pathkey item to sort

From
Amit Langote
Date:
Thanks for taking a look and sorry about the delay in replying.

On Fri, Oct 25, 2019 at 1:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
> > On Mon, Oct 14, 2019 at 11:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> In view of the proposed patches being dependent on some other
> >> 13-only changes, I wonder if we should fix v12 by reverting
> >> d25ea0127.  The potential planner performance loss for large
> >> partition sets could be nasty, but failing to plan at all is worse.
>
> > Actually, the patch I proposed to fix equivalence code can be applied
> > on its own.  The example I gave on that thread needs us to fix
> > partitionwise code to even work, which is perhaps a 13-only change,
> > but we have an example here that is broken due to d25ea01275.
> > Perhaps, considering applying my patch seems better than alternatives
> > which are either reverting d25ea01275 or avoiding doing partitionwise
> > join for such queries.
>
> I looked at this a bit, and I see that the core idea is to generate
> the missing EC members by applying add_child_rel_equivalences when
> building child join rels.  Perhaps we can make that work, but this
> patch fails to, because you've ignored the comment pointing out
> that the nullable_relids fixup logic only works for baserels:
>
>                  * And likewise for nullable_relids.  Note this code assumes
>                  * parent and child relids are singletons.
>
> We could improve that to work for joinrels, I think, but it would become
> very much more complicated (and slower).  AFAICS the logic would have
> to be "iterate through top_parent_relids, see which ones are in
> em_nullable_relids, and for each one that is, find the corresponding
> child_relid and substitute that in new_nullable_relids".  This is a
> bit of a problem because we don't have any really convenient way to
> map individual top parent relids to child relids.

Actually, there is adjust_child_relids_multilevel() which translates
the top parent relids in em_nullable_relids to child relids.

I have updated the patches that way.

> I wonder if we
> should extend AppendRelInfo to include the top parent relid as well as
> the immediate parent.  (Perhaps, while we're at it, we could make
> adjust_appendrel_attrs_multilevel less of an inefficient and
> underdocumented mess.)

Hmm, I agree we should try to fix that situation somehow.  Even better
if we could do away with child expressions in ECs, because they cause
EC related code to show up in profiles when executing complex queries
with thousands of partitions.

> Also, I'm pretty sure this addition is wrong/broken:
>
> +
> +                /*
> +                 * There aren't going to be more expressions to translate in
> +                 * the same EC.
> +                 */
> +                break;
>
> What makes you think that an EC can only contain one entry per rel?

I was wrong about that.  Fixed.

> More generally, as long as this patch requires changing
> add_child_rel_equivalences' API anyway, I wonder if we should
> rethink that altogether.  Looking at the code now, I realize that
> d25ea01275 resulted in horribly bastardizing that function's API,
> because the parent_rel and appinfo arguments are only consulted in
> some cases, while in other cases we disregard them and rely on
> child_rel->top_parent_relids to figure out how to translate stuff.
> It would be possible to make the argument list be just (root, child_rel)
> and always rely on child_rel->top_parent_relids.

Actually, as of 3373c71553, add_child_rel_equivalences() looks at
parent_rel->eclass_indexes to look up ECs, so maybe we can't take out
parent_rel.

>  At the very least,
> if we keep the extra arguments, we should document them as being just
> optimizations.

For common cases that doesn't involve multi-level partitioning, it
really helps to have the appinfos be supplied by the caller because
they're already available.  I've added a comment at the top about
that.

Attached updated patches.

Thanks,
Amit

Attachment

Re: v12.0: ERROR: could not find pathkey item to sort

From
Tom Lane
Date:
Amit Langote <amitlangote09@gmail.com> writes:
> Attached updated patches.

[ looks at that... ]  I seriously, seriously dislike what you did
in build_join_rel, ie adding the new joinrel to the global data
structures before it's fully filled in.  That's inevitably going
to bite us on the ass someday, and you couldn't even be bothered
with a comment.

Worse, the reason you did that seems to be so that
generate_join_implied_equalities can find and modify the joinrel,
which is an undocumented and not very well thought out side-effect.
There are other call sites for that where the joinrel may or may not
already exist, meaning that it might or might not add more members into
the joinrel's eclass_indexes.  At best that's wasted work, and at
worst it costs efficiency, since we could in principle get different
sets of common relids depending on which join input pair we're
considering.  They're equally valid sets, but unioning them is
just going to add more relids than we need.

Also, the existing logic around eclass_indexes is that it's only
set for baserels and we know it is valid after we've finished
EC merging.  I don't much like modifying add_child_rel_equivalences
to have some different opinions about that for joinrels.

It'd probably be better to just eat the cost of doing
get_common_eclass_indexes over again when it's time to do
add_child_rel_equivalences for a joinrel, since we aren't (I hope)
going to do that more than once per joinrel anyway.  This would
probably require refactoring things so that there are separate
entry points to add child equivalences for base rels and join rels.
But that seems cleaner anyway than what you've got here.

David --- much of the complexity here comes from the addition of
the eclass_indexes infrastructure, so do you have any thoughts?

            regards, tom lane



Re: v12.0: ERROR: could not find pathkey item to sort

From
David Rowley
Date:
On Thu, 31 Oct 2019 at 05:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David --- much of the complexity here comes from the addition of
> the eclass_indexes infrastructure, so do you have any thoughts?

Hindsight makes me think I should have mentioned in the comment for
eclass_indexes that it's only used for simple rels and remains NULL
for anything else.

All the code in equivclass.c either uses get_common_eclass_indexes()
and get_eclass_indexes_for_relids() which go down to the simple rel
level to obtain their eclass_indexes. When calling
get_eclass_indexes_for_relids() we'll build a union Bitmapset with the
indexes from each simple rel that the join rel is made from. We only
ever directly use the eclass_indexes field when we're certain we're
dealing with a simple rel already. get_eclass_indexes_for_relids()
would do the same job, but using the field directly saves a bit of
needless effort and memory allocations. So, in short, I don't really
see why we need to set eclass_indexes for anything other than simple
rels.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: v12.0: ERROR: could not find pathkey item to sort

From
Amit Langote
Date:
Thanks for checking.

On Thu, Oct 31, 2019 at 1:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Also, the existing logic around eclass_indexes is that it's only
> set for baserels and we know it is valid after we've finished
> EC merging.  I don't much like modifying add_child_rel_equivalences
> to have some different opinions about that for joinrels.
>
> It'd probably be better to just eat the cost of doing
> get_common_eclass_indexes over again when it's time to do
> add_child_rel_equivalences for a joinrel, since we aren't (I hope)
> going to do that more than once per joinrel anyway.

If you mean once per child joinrel, then yes.  Implemented this
approach in the attached updated patch.

ISTM, get_common_eclass_indexes() returns the same value for all the
child joinrels (or really for its outer and inner component rels) as
it does for  the parent joinrel.  So, it might be better to set
eclass_indexes in parent joinrel once and use the same value for all
its descendant joinrels.  Although, we'd need to export
get_common_eclass_indexes() out of equivclass.c to call it from
build_join_rel() such that it doesn't require messing with where the
joinrel is added to the global data structure.  Maybe that complicates
eclass_indexes infrastructure though.

> This would
> probably require refactoring things so that there are separate
> entry points to add child equivalences for base rels and join rels.
> But that seems cleaner anyway than what you've got here.

Separate entry points sounds better, but only in HEAD?  Should we have
separate entry points in PG 12 too?

Attached updated patch only for HEAD.

Thanks,
Amit

Attachment

Re: v12.0: ERROR: could not find pathkey item to sort

From
Tom Lane
Date:
Amit Langote <amitlangote09@gmail.com> writes:
>> This would
>> probably require refactoring things so that there are separate
>> entry points to add child equivalences for base rels and join rels.
>> But that seems cleaner anyway than what you've got here.

> Separate entry points sounds better, but only in HEAD?

I had actually had in mind that we might have two wrappers around a
common search-and-replace routine, but after studying the code I see that
there's just enough differences to make it probably not worth the trouble
to do it like that.  I did spend a bit of time removing cosmetic
differences between the two versions, though, mostly by creating
common local variables.

I think the way you did the matching_ecs computation is actually wrong:
we need to find ECs that reference any rel of the join, not only those
that reference both inputs.  If nothing else, the way you have it here
makes the results dependent on which pair of input rels gets considered
first, and that's certainly bad for multiway joins.

Also, I thought we should try to put more conditions on whether we
invoke add_child_join_rel_equivalences at all.  In the attached I did

    if ((enable_partitionwise_join || enable_partitionwise_aggregate) &&
        (joinrel->has_eclass_joins ||
         has_useful_pathkeys(root, parent_joinrel)))

but I wonder if we could do more, like checking to see if the parent
joinrel is partitioned.  Alternatively, maybe it's unnecessary because
we won't try to build child joinrels unless these conditions are true?

I did not like the test case either.  Creating a whole new (and rather
large) test table just for this one case is unreasonably expensive ---
it about doubles the runtime of the equivclass test for me.  There's
already a perfectly good test table in partition_join.sql, which seems
like a more natural home for this anyhow.  After a bit of finagling
I was able to adapt the test query to fail on that table.

Patch v4 attached.  I've not looked at what we need to do to make this
work in v12.

            regards, tom lane

diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index ccc07ba..082776f 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -2209,7 +2209,7 @@ match_eclasses_to_foreign_key_col(PlannerInfo *root,

 /*
  * add_child_rel_equivalences
- *      Search for EC members that reference the parent_rel, and
+ *      Search for EC members that reference the root parent of child_rel, and
  *      add transformed members referencing the child_rel.
  *
  * Note that this function won't be called at all unless we have at least some
@@ -2217,6 +2217,12 @@ match_eclasses_to_foreign_key_col(PlannerInfo *root,
  *
  * parent_rel and child_rel could be derived from appinfo, but since the
  * caller has already computed them, we might as well just pass them in.
+ *
+ * The passed-in AppendRelInfo is not used when the parent_rel is not a
+ * top-level baserel, since it shows the mapping from the parent_rel but
+ * we need to translate EC expressions that refer to the top-level parent.
+ * Using it is faster than using adjust_appendrel_attrs_multilevel(), though,
+ * so we prefer it when we can.
  */
 void
 add_child_rel_equivalences(PlannerInfo *root,
@@ -2224,6 +2230,8 @@ add_child_rel_equivalences(PlannerInfo *root,
                            RelOptInfo *parent_rel,
                            RelOptInfo *child_rel)
 {
+    Relids        top_parent_relids = child_rel->top_parent_relids;
+    Relids        child_relids = child_rel->relids;
     int            i;

     /*
@@ -2248,7 +2256,7 @@ add_child_rel_equivalences(PlannerInfo *root,
             continue;

         /* Sanity check eclass_indexes only contain ECs for parent_rel */
-        Assert(bms_is_subset(child_rel->top_parent_relids, cur_ec->ec_relids));
+        Assert(bms_is_subset(top_parent_relids, cur_ec->ec_relids));

         /*
          * We don't use foreach() here because there's no point in scanning
@@ -2268,13 +2276,14 @@ add_child_rel_equivalences(PlannerInfo *root,
              * already-transformed child members.  Otherwise, if some original
              * member expression references more than one appendrel, we'd get
              * an O(N^2) explosion of useless derived expressions for
-             * combinations of children.
+             * combinations of children.  (But add_child_join_rel_equivalences
+             * may add targeted combinations for partitionwise-join purposes.)
              */
             if (cur_em->em_is_child)
                 continue;        /* ignore children here */

             /* Does this member reference child's topmost parent rel? */
-            if (bms_overlap(cur_em->em_relids, child_rel->top_parent_relids))
+            if (bms_overlap(cur_em->em_relids, top_parent_relids))
             {
                 /* Yes, generate transformed child version */
                 Expr       *child_expr;
@@ -2295,8 +2304,8 @@ add_child_rel_equivalences(PlannerInfo *root,
                     child_expr = (Expr *)
                         adjust_appendrel_attrs_multilevel(root,
                                                           (Node *) cur_em->em_expr,
-                                                          child_rel->relids,
-                                                          child_rel->top_parent_relids);
+                                                          child_relids,
+                                                          top_parent_relids);
                 }

                 /*
@@ -2306,21 +2315,20 @@ add_child_rel_equivalences(PlannerInfo *root,
                  * don't want the child member to be marked as constant.
                  */
                 new_relids = bms_difference(cur_em->em_relids,
-                                            child_rel->top_parent_relids);
-                new_relids = bms_add_members(new_relids, child_rel->relids);
+                                            top_parent_relids);
+                new_relids = bms_add_members(new_relids, child_relids);

                 /*
                  * And likewise for nullable_relids.  Note this code assumes
                  * parent and child relids are singletons.
                  */
                 new_nullable_relids = cur_em->em_nullable_relids;
-                if (bms_overlap(new_nullable_relids,
-                                child_rel->top_parent_relids))
+                if (bms_overlap(new_nullable_relids, top_parent_relids))
                 {
                     new_nullable_relids = bms_difference(new_nullable_relids,
-                                                         child_rel->top_parent_relids);
+                                                         top_parent_relids);
                     new_nullable_relids = bms_add_members(new_nullable_relids,
-                                                          child_rel->relids);
+                                                          child_relids);
                 }

                 (void) add_eq_member(cur_ec, child_expr,
@@ -2334,6 +2342,133 @@ add_child_rel_equivalences(PlannerInfo *root,
     }
 }

+/*
+ * add_child_join_rel_equivalences
+ *      Like add_child_rel_equivalences(), but for joinrels
+ *
+ * Here we find the ECs relevant to the top parent joinrel and add transformed
+ * member expressions that refer to this child joinrel.
+ *
+ * Note that this function won't be called at all unless we have at least some
+ * reason to believe that the EC members it generates will be useful.
+ */
+void
+add_child_join_rel_equivalences(PlannerInfo *root,
+                                int nappinfos, AppendRelInfo **appinfos,
+                                RelOptInfo *parent_joinrel,
+                                RelOptInfo *child_joinrel)
+{
+    Relids        top_parent_relids = child_joinrel->top_parent_relids;
+    Relids        child_relids = child_joinrel->relids;
+    Bitmapset  *matching_ecs;
+    int            i;
+
+    Assert(IS_JOIN_REL(child_joinrel) && IS_JOIN_REL(parent_joinrel));
+
+    /* We need consider only ECs that mention the parent joinrel */
+    matching_ecs = get_eclass_indexes_for_relids(root, top_parent_relids);
+
+    i = -1;
+    while ((i = bms_next_member(matching_ecs, i)) >= 0)
+    {
+        EquivalenceClass *cur_ec = (EquivalenceClass *) list_nth(root->eq_classes, i);
+        int            num_members;
+
+        /*
+         * If this EC contains a volatile expression, then generating child
+         * EMs would be downright dangerous, so skip it.  We rely on a
+         * volatile EC having only one EM.
+         */
+        if (cur_ec->ec_has_volatile)
+            continue;
+
+        /* Sanity check on get_eclass_indexes_for_relids result */
+        Assert(bms_overlap(top_parent_relids, cur_ec->ec_relids));
+
+        /*
+         * We don't use foreach() here because there's no point in scanning
+         * newly-added child members, so we can stop after the last
+         * pre-existing EC member.
+         */
+        num_members = list_length(cur_ec->ec_members);
+        for (int pos = 0; pos < num_members; pos++)
+        {
+            EquivalenceMember *cur_em = (EquivalenceMember *) list_nth(cur_ec->ec_members, pos);
+
+            if (cur_em->em_is_const)
+                continue;        /* ignore consts here */
+
+            /*
+             * We consider only original EC members here, not
+             * already-transformed child members.
+             */
+            if (cur_em->em_is_child)
+                continue;        /* ignore children here */
+
+            /*
+             * We may ignore expressions that reference a single baserel,
+             * because add_child_rel_equivalences should have handled them.
+             */
+            if (bms_membership(cur_em->em_relids) != BMS_MULTIPLE)
+                continue;
+
+            /* Does this member reference child's topmost parent rel? */
+            if (bms_overlap(cur_em->em_relids, top_parent_relids))
+            {
+                /* Yes, generate transformed child version */
+                Expr       *child_expr;
+                Relids        new_relids;
+                Relids        new_nullable_relids;
+
+                if (parent_joinrel->reloptkind == RELOPT_JOINREL)
+                {
+                    /* Simple single-level transformation */
+                    child_expr = (Expr *)
+                        adjust_appendrel_attrs(root,
+                                               (Node *) cur_em->em_expr,
+                                               nappinfos, appinfos);
+                }
+                else
+                {
+                    /* Must do multi-level transformation */
+                    Assert(parent_joinrel->reloptkind == RELOPT_OTHER_JOINREL);
+                    child_expr = (Expr *)
+                        adjust_appendrel_attrs_multilevel(root,
+                                                          (Node *) cur_em->em_expr,
+                                                          child_relids,
+                                                          top_parent_relids);
+                }
+
+                /*
+                 * Transform em_relids to match.  Note we do *not* do
+                 * pull_varnos(child_expr) here, as for example the
+                 * transformation might have substituted a constant, but we
+                 * don't want the child member to be marked as constant.
+                 */
+                new_relids = bms_difference(cur_em->em_relids,
+                                            top_parent_relids);
+                new_relids = bms_add_members(new_relids, child_relids);
+
+                /*
+                 * For nullable_relids, we must selectively replace parent
+                 * nullable relids with child ones.
+                 */
+                new_nullable_relids = cur_em->em_nullable_relids;
+                if (bms_overlap(new_nullable_relids, top_parent_relids))
+                    new_nullable_relids =
+                        adjust_child_relids_multilevel(root,
+                                                       new_nullable_relids,
+                                                       child_relids,
+                                                       top_parent_relids);
+
+                (void) add_eq_member(cur_ec, child_expr,
+                                     new_relids, new_nullable_relids,
+                                     true, cur_em->em_datatype);
+            }
+        }
+    }
+}
+

 /*
  * generate_implied_equalities_for_column
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 8541538..1e71e2e 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -843,6 +843,7 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
     /* Compute information relevant to foreign relations. */
     set_foreign_rel_properties(joinrel, outer_rel, inner_rel);

+    /* Conpute information needed for mapping Vars to the child rel */
     appinfos = find_appinfos_by_relids(root, joinrel->relids, &nappinfos);

     /* Set up reltarget struct */
@@ -854,7 +855,6 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
                                                         (Node *) parent_joinrel->joininfo,
                                                         nappinfos,
                                                         appinfos);
-    pfree(appinfos);

     /*
      * Lateral relids referred in child join will be same as that referred in
@@ -886,6 +886,22 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
     /* Add the relation to the PlannerInfo. */
     add_join_rel(root, joinrel);

+    /*
+     * If we are using partitionwise join or aggregation, we might need
+     * EquivalenceClass members corresponding to the child join, so that we
+     * can represent sort pathkeys for it.  As with children of baserels, we
+     * shouldn't need this unless there are relevant eclass joins (implying
+     * that a merge join might be possible) or pathkeys to sort by.
+     */
+    if ((enable_partitionwise_join || enable_partitionwise_aggregate) &&
+        (joinrel->has_eclass_joins ||
+         has_useful_pathkeys(root, parent_joinrel)))
+        add_child_join_rel_equivalences(root,
+                                        nappinfos, appinfos,
+                                        parent_joinrel, joinrel);
+
+    pfree(appinfos);
+
     return joinrel;
 }

diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 7345137..c6c3463 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -153,6 +153,11 @@ extern void add_child_rel_equivalences(PlannerInfo *root,
                                        AppendRelInfo *appinfo,
                                        RelOptInfo *parent_rel,
                                        RelOptInfo *child_rel);
+extern void add_child_join_rel_equivalences(PlannerInfo *root,
+                                            int nappinfos,
+                                            AppendRelInfo **appinfos,
+                                            RelOptInfo *parent_rel,
+                                            RelOptInfo *child_rel);
 extern List *generate_implied_equalities_for_column(PlannerInfo *root,
                                                     RelOptInfo *rel,
                                                     ec_matches_callback_type callback,
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index cad8dd5..975bf67 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -459,6 +459,83 @@ SELECT t1.a, ss.t2a, ss.t2c FROM prt1 t1 LEFT JOIN LATERAL
  550 |     |
 (12 rows)

+-- bug with inadequate sort key representation
+SET enable_partitionwise_aggregate TO true;
+SET enable_hashjoin TO false;
+EXPLAIN (COSTS OFF)
+SELECT a, b FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b)
+  WHERE a BETWEEN 490 AND 510
+  GROUP BY 1, 2 ORDER BY 1, 2;
+                                                    QUERY PLAN
+-------------------------------------------------------------------------------------------------------------------
+ Group
+   Group Key: (COALESCE(prt1_p1.a, p2.a)), (COALESCE(prt1_p1.b, p2.b))
+   ->  Merge Append
+         Sort Key: (COALESCE(prt1_p1.a, p2.a)), (COALESCE(prt1_p1.b, p2.b))
+         ->  Group
+               Group Key: (COALESCE(prt1_p1.a, p2.a)), (COALESCE(prt1_p1.b, p2.b))
+               ->  Sort
+                     Sort Key: (COALESCE(prt1_p1.a, p2.a)), (COALESCE(prt1_p1.b, p2.b))
+                     ->  Merge Full Join
+                           Merge Cond: ((prt1_p1.a = p2.a) AND (prt1_p1.b = p2.b))
+                           Filter: ((COALESCE(prt1_p1.a, p2.a) >= 490) AND (COALESCE(prt1_p1.a, p2.a) <= 510))
+                           ->  Sort
+                                 Sort Key: prt1_p1.a, prt1_p1.b
+                                 ->  Seq Scan on prt1_p1
+                           ->  Sort
+                                 Sort Key: p2.a, p2.b
+                                 ->  Seq Scan on prt2_p1 p2
+         ->  Group
+               Group Key: (COALESCE(prt1_p2.a, p2_1.a)), (COALESCE(prt1_p2.b, p2_1.b))
+               ->  Sort
+                     Sort Key: (COALESCE(prt1_p2.a, p2_1.a)), (COALESCE(prt1_p2.b, p2_1.b))
+                     ->  Merge Full Join
+                           Merge Cond: ((prt1_p2.a = p2_1.a) AND (prt1_p2.b = p2_1.b))
+                           Filter: ((COALESCE(prt1_p2.a, p2_1.a) >= 490) AND (COALESCE(prt1_p2.a, p2_1.a) <= 510))
+                           ->  Sort
+                                 Sort Key: prt1_p2.a, prt1_p2.b
+                                 ->  Seq Scan on prt1_p2
+                           ->  Sort
+                                 Sort Key: p2_1.a, p2_1.b
+                                 ->  Seq Scan on prt2_p2 p2_1
+         ->  Group
+               Group Key: (COALESCE(prt1_p3.a, p2_2.a)), (COALESCE(prt1_p3.b, p2_2.b))
+               ->  Sort
+                     Sort Key: (COALESCE(prt1_p3.a, p2_2.a)), (COALESCE(prt1_p3.b, p2_2.b))
+                     ->  Merge Full Join
+                           Merge Cond: ((prt1_p3.a = p2_2.a) AND (prt1_p3.b = p2_2.b))
+                           Filter: ((COALESCE(prt1_p3.a, p2_2.a) >= 490) AND (COALESCE(prt1_p3.a, p2_2.a) <= 510))
+                           ->  Sort
+                                 Sort Key: prt1_p3.a, prt1_p3.b
+                                 ->  Seq Scan on prt1_p3
+                           ->  Sort
+                                 Sort Key: p2_2.a, p2_2.b
+                                 ->  Seq Scan on prt2_p3 p2_2
+(43 rows)
+
+SELECT a, b FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b)
+  WHERE a BETWEEN 490 AND 510
+  GROUP BY 1, 2 ORDER BY 1, 2;
+  a  | b
+-----+----
+ 490 | 15
+ 492 | 17
+ 494 | 19
+ 495 | 20
+ 496 | 21
+ 498 | 23
+ 500 |  0
+ 501 |  1
+ 502 |  2
+ 504 |  4
+ 506 |  6
+ 507 |  7
+ 508 |  8
+ 510 | 10
+(14 rows)
+
+RESET enable_partitionwise_aggregate;
+RESET enable_hashjoin;
 --
 -- partitioned by expression
 --
diff --git a/src/test/regress/sql/partition_join.sql b/src/test/regress/sql/partition_join.sql
index fb3ba18..92994b4 100644
--- a/src/test/regress/sql/partition_join.sql
+++ b/src/test/regress/sql/partition_join.sql
@@ -91,6 +91,21 @@ SELECT t1.a, ss.t2a, ss.t2c FROM prt1 t1 LEFT JOIN LATERAL
               (SELECT t2.a AS t2a, t3.a AS t3a, t2.b t2b, t2.c t2c, least(t1.a,t2.a,t3.a) FROM prt1 t2 JOIN prt2 t3 ON
(t2.a= t3.b)) ss 
               ON t1.c = ss.t2c WHERE (t1.b + coalesce(ss.t2b, 0)) = 0 ORDER BY t1.a;

+-- bug with inadequate sort key representation
+SET enable_partitionwise_aggregate TO true;
+SET enable_hashjoin TO false;
+
+EXPLAIN (COSTS OFF)
+SELECT a, b FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b)
+  WHERE a BETWEEN 490 AND 510
+  GROUP BY 1, 2 ORDER BY 1, 2;
+SELECT a, b FROM prt1 FULL JOIN prt2 p2(b,a,c) USING(a,b)
+  WHERE a BETWEEN 490 AND 510
+  GROUP BY 1, 2 ORDER BY 1, 2;
+
+RESET enable_partitionwise_aggregate;
+RESET enable_hashjoin;
+
 --
 -- partitioned by expression
 --

Re: v12.0: ERROR: could not find pathkey item to sort

From
Amit Langote
Date:
On Sun, Nov 3, 2019 at 4:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
> >> This would
> >> probably require refactoring things so that there are separate
> >> entry points to add child equivalences for base rels and join rels.
> >> But that seems cleaner anyway than what you've got here.
>
> > Separate entry points sounds better, but only in HEAD?
>
> I had actually had in mind that we might have two wrappers around a
> common search-and-replace routine, but after studying the code I see that
> there's just enough differences to make it probably not worth the trouble
> to do it like that.  I did spend a bit of time removing cosmetic
> differences between the two versions, though, mostly by creating
> common local variables.

Agree that having two totally separate routines is better.

> I think the way you did the matching_ecs computation is actually wrong:
> we need to find ECs that reference any rel of the join, not only those
> that reference both inputs.  If nothing else, the way you have it here
> makes the results dependent on which pair of input rels gets considered
> first, and that's certainly bad for multiway joins.

I'm not sure I fully understand the problems, but maybe you're right.

> Also, I thought we should try to put more conditions on whether we
> invoke add_child_join_rel_equivalences at all.  In the attached I did
>
>     if ((enable_partitionwise_join || enable_partitionwise_aggregate) &&
>         (joinrel->has_eclass_joins ||
>          has_useful_pathkeys(root, parent_joinrel)))
>
> but I wonder if we could do more, like checking to see if the parent
> joinrel is partitioned.  Alternatively, maybe it's unnecessary because
> we won't try to build child joinrels unless these conditions are true?

Actually, I think we can assert in add_child_rel_equivalences() that
enable_partitionwise_join is true.  Then checking
enable_partitionwise_aggregate is unnecessary.

> I did not like the test case either.  Creating a whole new (and rather
> large) test table just for this one case is unreasonably expensive ---
> it about doubles the runtime of the equivclass test for me.  There's
> already a perfectly good test table in partition_join.sql, which seems
> like a more natural home for this anyhow.  After a bit of finagling
> I was able to adapt the test query to fail on that table.

That's great.  I tried but I can only finagle so much when it comes to
twisting around plan shapes to what I need. :)

> Patch v4 attached.  I've not looked at what we need to do to make this
> work in v12.

Thanks a lot for the revised patch.

Maybe the only difference between HEAD and v12 is that the former has
eclass_indexes infrastructure, whereas the latter doesn't?  I have
attached a version of your patch adapted for v12.

Also, looking at this in the patched code:

+ /*
+ * We may ignore expressions that reference a single baserel,
+ * because add_child_rel_equivalences should have handled them.
+ */
+ if (bms_membership(cur_em->em_relids) != BMS_MULTIPLE)
+ continue;

I have been thinking maybe add_child_rel_equivalences() doesn't need
to translate EC members that reference multiple appendrels, because
there top_parent_relids is always a singleton set, whereas em_relids
of such expressions is not?  Those half-translated expressions are
useless, only adding to the overhead of scanning ec_members.  I'm
thinking that we should apply this diff:

diff --git a/src/backend/optimizer/path/equivclass.c
b/src/backend/optimizer/path/equivclass.c
index e8e9e9a314..d4d80c8101 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -2169,7 +2169,7 @@ add_child_rel_equivalences(PlannerInfo *root,
                 continue;       /* ignore children here */

             /* Does this member reference child's topmost parent rel? */
-            if (bms_overlap(cur_em->em_relids, top_parent_relids))
+            if (bms_is_subset(cur_em->em_relids, top_parent_relids))
             {
                 /* Yes, generate transformed child version */
                 Expr       *child_expr;

Thoughts?

Thanks,
Amit

Attachment

Re: v12.0: ERROR: could not find pathkey item to sort

From
Tom Lane
Date:
Amit Langote <amitlangote09@gmail.com> writes:
> On Sun, Nov 3, 2019 at 4:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Also, I thought we should try to put more conditions on whether we
>> invoke add_child_join_rel_equivalences at all.  In the attached I did
>>     if ((enable_partitionwise_join || enable_partitionwise_aggregate) &&
>>         (joinrel->has_eclass_joins ||
>>          has_useful_pathkeys(root, parent_joinrel)))
>> but I wonder if we could do more, like checking to see if the parent
>> joinrel is partitioned.  Alternatively, maybe it's unnecessary because
>> we won't try to build child joinrels unless these conditions are true?

> Actually, I think we can assert in add_child_rel_equivalences() that
> enable_partitionwise_join is true.  Then checking
> enable_partitionwise_aggregate is unnecessary.

After tracing the call sites back a bit further, I agree that we won't
be here in the first place unless enable_partitionwise_join is true,
so the extra tests I proposed are unnecessary.  I took them out again.

> I have been thinking maybe add_child_rel_equivalences() doesn't need
> to translate EC members that reference multiple appendrels, because
> there top_parent_relids is always a singleton set, whereas em_relids
> of such expressions is not?  Those half-translated expressions are
> useless, only adding to the overhead of scanning ec_members.  I'm
> thinking that we should apply this diff:
> -            if (bms_overlap(cur_em->em_relids, top_parent_relids))
> +            if (bms_is_subset(cur_em->em_relids, top_parent_relids))

Meh, I'm not really convinced.  The case where this would be relevant
is an EC generated from something like "WHERE (a.x + b.y) = c.z"
where "a" is partitioned.  It's possible that we'd never have a use
for a sort key corresponding to "a_child.x + b.y", but I think that's
not obvious, and probably short-sighted.  Anyway such EC members are
pretty rare in the first place, so we're not going to win much
performance by trying to optimize them.

Anyway, I've pushed the fix for Justin's problem to v12 and HEAD.
The problem with poor planning of multiway joins that you mentioned
in the other thread remains open, but I imagine the patches you
posted there are going to need rebasing over this commit, so I
set that CF entry to Waiting On Author.

            regards, tom lane



Re: v12.0: ERROR: could not find pathkey item to sort

From
Amit Langote
Date:
On Wed, Nov 6, 2019 at 1:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangote09@gmail.com> writes:
> > I have been thinking maybe add_child_rel_equivalences() doesn't need
> > to translate EC members that reference multiple appendrels, because
> > there top_parent_relids is always a singleton set, whereas em_relids
> > of such expressions is not?  Those half-translated expressions are
> > useless, only adding to the overhead of scanning ec_members.  I'm
> > thinking that we should apply this diff:
> > -            if (bms_overlap(cur_em->em_relids, top_parent_relids))
> > +            if (bms_is_subset(cur_em->em_relids, top_parent_relids))
>
> Meh, I'm not really convinced.  The case where this would be relevant
> is an EC generated from something like "WHERE (a.x + b.y) = c.z"
> where "a" is partitioned.  It's possible that we'd never have a use
> for a sort key corresponding to "a_child.x + b.y", but I think that's
> not obvious, and probably short-sighted.  Anyway such EC members are
> pretty rare in the first place, so we're not going to win much
> performance by trying to optimize them.

OK.

> Anyway, I've pushed the fix for Justin's problem to v12 and HEAD.
> The problem with poor planning of multiway joins that you mentioned
> in the other thread remains open, but I imagine the patches you
> posted there are going to need rebasing over this commit, so I
> set that CF entry to Waiting On Author.

Thank you.  I will send rebased patches on that thread.

Regards,
Amit