Thread: Relids in upper relations

Relids in upper relations

From

Ashutosh Bapat

Date:

05 October 2016, 12:28:02

Hi,
While reviewing aggregate pushdown patch [1] we noticed that
RelOptInfos for upper relations do not have relids set.
create_foreignscan_plan() copies the relids from RelOptInfo into
ForeignScan::fs_relids. That field is used to identify the RTIs
covered by the ForeignScan. For example, postgresBeginForeignScan()
uses it to get the user mapping
1281     /*
1282      * Identify which user to do the remote access as.  This
should match what
1283      * ExecCheckRTEPerms() does.  In case of a join, use the
lowest-numbered
1284      * member RTE as a representative; we would get the same
result from any.
1285      */
1286     if (fsplan->scan.scanrelid > 0)
1287         rtindex = fsplan->scan.scanrelid;
1288     else
1289         rtindex = bms_next_member(fsplan->fs_relids, -1);
1290     rte = rt_fetch(rtindex, estate->es_range_table);
1291     userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();

Core functions also use this field to get RTIs covered by ForeignScan
e.g. ExplainPreScanNode.

Since this field is not set in an upper relation, when we push down an
upper operation like grouping/aggregation, fs_relids remains NULL,
causing undesirable results. In case of postgres_fdw aggregate
pushdown, it crashed in rt_fetch().

We could prevent the crash by passing input_rel->relids to
fetch_upper_rel() in create_grouping_path() as seen in the attached
patch. preprocess_minmax_aggregates() needed to be fixed so that both
these functions add paths to the same relation. I am sure, if we go
this route, we will have to fix each call of fetch_upper_rel() in this
manner. But I am not sure if it's intended to be so.

The comment in fetch_upper_rel() says
 847  * An "upper" relation is identified by an UpperRelationKind and
a Relids set.
 848  * The meaning of the Relids set is not specified here, and very
likely will
 849  * vary for different relation kinds.
which seems to indicate that relids in upper relation will be
different that those in the lower relations, but it doesn't say how.

We need to fix the usages of fs_relids or the calls to
fetch_upper_rel() for pushdown of upper operations. I am not sure
which. Please suggest, how to move forward with this.


[1]
https://www.postgresql.org/message-id/flat/CANEvxPokcFi7OfEpi3%3DJ%2Bmvxu%2BPPAG2zXASLEkv5DzDPhSTk6A%40mail.gmail.com#CANEvxPokcFi7OfEpi3=J+mvxu+PPAG2zXASLEkv5DzDPhSTk6A@mail.gmail.com
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment

pg_upper_relids.patch

Re: Relids in upper relations

From

Tom Lane

Date:

05 October 2016, 13:42:11

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
> While reviewing aggregate pushdown patch [1] we noticed that
> RelOptInfos for upper relations do not have relids set.

Indeed, because they don't correspond to any particular scan relation or
set of scan relations.  I have in mind that in future releases, any
particular upperrel might have its own definition of what the relids
mean --- for example, in UPPERREL_SETOP it would likely be useful for
the relids to represent the set of leaf SELECTs that have been merged
in a particular path.  You could imagine UPPERREL_WINDOW using the
relids to track which window functions have been implemented in a path,
whenever we get around to considering multiple window function orderings.
None of that's there yet.

> create_foreignscan_plan() copies the relids from RelOptInfo into
> ForeignScan::fs_relids. That field is used to identify the RTIs
> covered by the ForeignScan.

That's fine for scan/join paths.  If you want to pay attention to
relids for an upper rel, it's up to you to know what they mean.
I would not counsel assuming that they have anything at all to do
with baserel RTIs.

> We could prevent the crash by passing input_rel->relids to
> fetch_upper_rel() in create_grouping_path() as seen in the attached
> patch.

I think this is fundamentally wrongheaded.  If we go that route,
the only valid relids for any upper path would be the union of all
baserel RTIs, making it rather pointless to carry the value around
at all, and definitely making it impossible to use the field to
distinguish different partial implementations of the same upperrel.

You should look to root->all_baserels, instead, if that's the value
you want when considering an upperrel Path.
        regards, tom lane

Re: Relids in upper relations

From

Ashutosh Bapat

Date:

06 October 2016, 06:46:43

On Wed, Oct 5, 2016 at 7:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
>> While reviewing aggregate pushdown patch [1] we noticed that
>> RelOptInfos for upper relations do not have relids set.
>
> Indeed, because they don't correspond to any particular scan relation or
> set of scan relations.  I have in mind that in future releases, any
> particular upperrel might have its own definition of what the relids
> mean --- for example, in UPPERREL_SETOP it would likely be useful for
> the relids to represent the set of leaf SELECTs that have been merged
> in a particular path.
> You could imagine UPPERREL_WINDOW using the
> relids to track which window functions have been implemented in a path,
> whenever we get around to considering multiple window function orderings.
> None of that's there yet.

Why do we want to save something in Relids object which is not relids?
Wouldn't this overloading create confusion. I don't know much about
window function ordering, but purely from the above description it
looks like we are saving something in RelOptInfo which is specific to
a path. That sounds odd. But most probably, I misunderstood you.

>
>> create_foreignscan_plan() copies the relids from RelOptInfo into
>> ForeignScan::fs_relids. That field is used to identify the RTIs
>> covered by the ForeignScan.
>
> That's fine for scan/join paths.  If you want to pay attention to
> relids for an upper rel, it's up to you to know what they mean.
> I would not counsel assuming that they have anything at all to do
> with baserel RTIs.
>
>> We could prevent the crash by passing input_rel->relids to
>> fetch_upper_rel() in create_grouping_path() as seen in the attached
>> patch.
>
> I think this is fundamentally wrongheaded.  If we go that route,
> the only valid relids for any upper path would be the union of all
> baserel RTIs, making it rather pointless to carry the value around
> at all,

That won't be true if we start pushing upper operations under Append
or even Join. An aggregate of append may be calculated as aggregate of
append or append of aggregates. In the later case, we will need to
differentiate between upper relations from each segment being appended
and also the upper relation representing the overall result. A natural
way is to set relids of that upper relation with the relids of
underlying scan relations. In this case, setting relids to underlying
scan relations isn't pointless at all. And then this may apply to all
kinds of upper relations, not just aggregate/grouping

> and definitely making it impossible to use the field to
> distinguish different partial implementations of the same upperrel.

Probably, we should add another set of fields exclusive to be used for
upper relations, like some members which are exclusively used for base
relations. Storing something in relids, which is not relids looks
confusing, unless we change the name (and type) of that field or what
we store has a meaning similar to relids.

>
> You should look to root->all_baserels, instead, if that's the value
> you want when considering an upperrel Path.
>

Thanks. It looks like, for now, aggregate pushdown should use
all_baserels for an upper relation. Although, that will need to change
if we push down aggregate to the foreign server as part of converting
aggregate of append to append of aggregates.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Relids in upper relations

From

Robert Haas

Date:

09 October 2016, 22:10:40

On Wed, Oct 5, 2016 at 9:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We could prevent the crash by passing input_rel->relids to
>> fetch_upper_rel() in create_grouping_path() as seen in the attached
>> patch.
>
> I think this is fundamentally wrongheaded.  If we go that route,
> the only valid relids for any upper path would be the union of all
> baserel RTIs, ...

Hmm, but this is only true if the upper steps are always done last.
Hackers on this list have been hoping to reorder joins with aggregates
since at least 2008 - probably sooner, but that's when I started
reading this mailing list.   A simple example is:

SELECT order_line.order_id, order.customer_id, SUM(order_line.amount)
FROM order_line, order WHERE order_line.order_id = order.order_id
GROUP BY 1,2;

Doing the aggregation step first is likely to be much faster than
doing the join first here, unless of course order_line.order_id =
order.order_id turns out to be highly selective.  Or if there were an
additional WHERE condition on the order table then it gets much less
obvious which way is better.  In any event, I assume we want to
eventually be able to cost it out both ways and pick the cheaper one.
This doesn't necessarily mean that the set of relids couldn't be used
as you suggest, but if we supported this kind of thing then it
wouldn't be true that "the only valid relids for any upper path would
be the union of all baserel RTIs".

Have you thought about this case?  How would you propose that it be
handled in the framework you have in mind?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Relids in upper relations

From

Tom Lane

Date:

09 October 2016, 22:33:41

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Oct 5, 2016 at 9:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think this is fundamentally wrongheaded.  If we go that route,
>> the only valid relids for any upper path would be the union of all
>> baserel RTIs, ...

> Hmm, but this is only true if the upper steps are always done last.
> Hackers on this list have been hoping to reorder joins with aggregates
> since at least 2008 - probably sooner, but that's when I started
> reading this mailing list.   A simple example is:

> SELECT order_line.order_id, order.customer_id, SUM(order_line.amount)
> FROM order_line, order WHERE order_line.order_id = order.order_id
> GROUP BY 1,2;

> Doing the aggregation step first is likely to be much faster than
> doing the join first here,

Please provide some reason to believe that.  It's the nature of an
aggregate that it's sensitive to the number of rows going through it,
with only a tiny number of exceptions (and SUM ain't one).  So you could
only push it down past joins that won't change the number of rows the
aggregate will process, and how is that going to make it any faster?

I'm also dubious that doing the aggregate first could even be correct
in your example, because I sure don't see how you'd group by
order.customer_id before joining.  But that's an artifact of this
example not a general point.
        regards, tom lane

Re: Relids in upper relations

From

David Rowley

Date:

09 October 2016, 23:26:55

On 10 October 2016 at 11:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> SELECT order_line.order_id, order.customer_id, SUM(order_line.amount)
>> FROM order_line, order WHERE order_line.order_id = order.order_id
>> GROUP BY 1,2;
>
>> Doing the aggregation step first is likely to be much faster than
>> doing the join first here,
>
> Please provide some reason to believe that.  It's the nature of an
> aggregate that it's sensitive to the number of rows going through it,
> with only a tiny number of exceptions (and SUM ain't one).  So you could
> only push it down past joins that won't change the number of rows the
> aggregate will process, and how is that going to make it any faster?

I think there's a flaw in Robert's example because the GROUP BY has
columns from either side of the joins, however it's very simple to
come up with a real world case which aggregate push downs can improve.
We can simulate the aggregate push down with a simple subquery.

create table sale (sale_id int primary key,product_id int not
null,quantity int not null);
create table product (product_id int primary key,productdesc
varchar(32) not null);

insert into sale select x,x%100+1,(random()*10)::int+1 from
generate_series(1,10000000) x(x);
insert into product select x,'ABC'||x::text from generate_Series(1,100) x(x);

postgres=# explain (analyze, timing off) select
p.product_id,p.productdesc,sum(s.quantity) qty from sale s inner join
product p on s.product_id = p.product_id group by p.product_id;                                                QUERY
PLAN

-------------------------------------------------------------------------------------------------------------HashAggregate
(cost=341558.25..341559.25 rows=100 width=17) (actual
 
rows=100 loops=1)  Group Key: p.product_id  ->  Hash Join  (cost=3.25..291558.25 rows=10000000 width=13)
(actual rows=10000000 loops=1)        Hash Cond: (s.product_id = p.product_id)        ->  Seq Scan on sale s
(cost=0.00..154055.00rows=10000000
 
width=8) (actual rows=10000000 loops=1)        ->  Hash  (cost=2.00..2.00 rows=100 width=9) (actual rows=100 loops=1)
          Buckets: 1024  Batches: 1  Memory Usage: 13kB              ->  Seq Scan on product p  (cost=0.00..2.00
rows=100
width=9) (actual rows=100 loops=1)Planning time: 0.308 msExecution time: 7568.927 ms
(10 rows)

Time: 7569.842 ms (00:07.570)

postgres=# explain (analyze, timing off) select
p.product_id,p.productdesc,s.qty from (select product_id,sum(quantity)
qty from sale group by product_id) s inner join product p on
s.product_id = p.product_id;                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------Hash Join
(cost=204058.25..204061.63rows=100 width=17) (actual
 
rows=100 loops=1)  Hash Cond: (sale.product_id = p.product_id)  ->  HashAggregate  (cost=204055.00..204056.00 rows=100
width=12)
(actual rows=100 loops=1)        Group Key: sale.product_id        ->  Seq Scan on sale  (cost=0.00..154055.00
rows=10000000
width=8) (actual rows=10000000 loops=1)  ->  Hash  (cost=2.00..2.00 rows=100 width=9) (actual rows=100 loops=1)
Buckets:1024  Batches: 1  Memory Usage: 13kB        ->  Seq Scan on product p  (cost=0.00..2.00 rows=100 width=9)
 
(actual rows=100 loops=1)Planning time: 0.610 msExecution time: 4267.145 ms
(10 rows)

So the pushed down version runs in 56% of the time of the non-pushed
down version. Of course, as mentioned by Robert, both versions would
have to be costed in case the join condition was highly selective

There's a good paper which goes into a bit more details on this
http://www.vldb.org/conf/1995/P345.PDF

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Relids in upper relations

From

Robert Haas

Date:

10 October 2016, 01:43:59

On Sun, Oct 9, 2016 at 6:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Oct 5, 2016 at 9:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I think this is fundamentally wrongheaded.  If we go that route,
>>> the only valid relids for any upper path would be the union of all
>>> baserel RTIs, ...
>
>> Hmm, but this is only true if the upper steps are always done last.
>> Hackers on this list have been hoping to reorder joins with aggregates
>> since at least 2008 - probably sooner, but that's when I started
>> reading this mailing list.   A simple example is:
>
>> SELECT order_line.order_id, order.customer_id, SUM(order_line.amount)
>> FROM order_line, order WHERE order_line.order_id = order.order_id
>> GROUP BY 1,2;
>
>> Doing the aggregation step first is likely to be much faster than
>> doing the join first here,
>
> Please provide some reason to believe that.  It's the nature of an
> aggregate that it's sensitive to the number of rows going through it,
> with only a tiny number of exceptions (and SUM ain't one).  So you could
> only push it down past joins that won't change the number of rows the
> aggregate will process, and how is that going to make it any faster?

Hmm, I guess it only works if you know that the order (order_id) is
unique.  If that's true, then each order_line.order_id will match 0 or
1 values in order.order_id.  If you had duplicates in order.order_id,
then SUM(order_line.amount) would effectively be multiplied by the
number of duplicates, so pushing the join down doesn't work.  But if
order (order_id) is unique, then I think it's OK.  For any given value
of order_line.order_id, either every row in the group matches the one
and only row in the order table where that value appears in
order.order_id, or there is no such row and the whole group vanishes.

As for why it should be faster, my intuition is that the aggregate
might be slightly faster because the rows passing through it can be
narrower, but the real win is the join should be quite a lot faster
because the number of rows that need to be joined is probably being
reduced by the aggregation step, possibly quite drastically.

Here's basically that example with 10GB of TPC-H data:

rhaas=# explain analyze select l_orderkey, o_custkey,
sum(l_extendedprice) from lineitem, orders where l_orderkey =
o_orderkey group by 1, 2;
        QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------GroupAggregate
(cost=15665868.56..17015554.73 rows=59986052
 
width=48) (actual time=114794.016..156967.012 rows=15000000 loops=1)  Group Key: lineitem.l_orderkey, orders.o_custkey
-> Sort  (cost=15665868.56..15815833.69 rows=59986052 width=24)
 
(actual time=114793.999..126303.591 rows=59986052 loops=1)        Sort Key: lineitem.l_orderkey, orders.o_custkey
Sort Method: external sort  Disk: 2031640kB        ->  Merge Join  (cost=244.37..4225682.88 rows=59986052
 
width=24) (actual time=0.020..56368.444 rows=59986052 loops=1)              Merge Cond: (orders.o_orderkey =
lineitem.l_orderkey)             ->  Index Scan using orders_pkey on orders
 
(cost=0.43..665749.44 rows=15000000 width=12) (actual
time=0.009..5722.540 rows=15000000 loops=1)              ->  Index Scan using idx_lineitem_orderkey on lineitem
(cost=0.56..2772694.34 rows=59986052 width=16) (actual
time=0.008..25766.270 rows=59986052 loops=1)Planning time: 0.251 msExecution time: 158320.203 ms
(11 rows)

rhaas=# explain analyze select l_orderkey, o_custkey, l_extendedprice
from (select l_orderkey, sum(l_extendedprice) l_extendedprice from
lineitem group by 1) x, orders where l_orderkey = o_orderkey;
     QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------Merge
Join (cost=1.00..3789127.49 rows=378670 width=48) (actual
 
time=0.104..76107.364 rows=15000000 loops=1)  Merge Cond: (lineitem.l_orderkey = orders.o_orderkey)  ->  GroupAggregate
(cost=0.56..3077357.98 rows=378670 width=40)
 
(actual time=0.071..59870.322 rows=15000000 loops=1)        Group Key: lineitem.l_orderkey        ->  Index Scan using
idx_lineitem_orderkeyon lineitem
 
(cost=0.56..2772694.34 rows=59986052 width=16) (actual
time=0.057..24382.053 rows=59986052 loops=1)  ->  Index Scan using orders_pkey on orders  (cost=0.43..665749.44
rows=15000000 width=12) (actual time=0.030..6115.856 rows=15000000
loops=1)Planning time: 0.117 msExecution time: 77266.600 ms
(8 rows)

Actually, the major culprit here is that the plan which does the join
first introduces a totally useless sort, but it looks like the second
plan would be modestly faster even the planner were smart enough not
to do that.

If I set work_mem to 1GB and max_workers_per_gather to 8, the time for
the aggregate-first plan improves to 66s, but the join-first plan
doesn't change.

I'm not sure this is actually a great example; 1.5 million output rows
is a lot.  But I've seen this kind of thing plenty of times in the
real world; see also David Rowley's example.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company