Thread: Relids in upper relations
Hi, While reviewing aggregate pushdown patch [1] we noticed that RelOptInfos for upper relations do not have relids set. create_foreignscan_plan() copies the relids from RelOptInfo into ForeignScan::fs_relids. That field is used to identify the RTIs covered by the ForeignScan. For example, postgresBeginForeignScan() uses it to get the user mapping 1281 /* 1282 * Identify which user to do the remote access as. This should match what 1283 * ExecCheckRTEPerms() does. In case of a join, use the lowest-numbered 1284 * member RTE as a representative; we would get the same result from any. 1285 */ 1286 if (fsplan->scan.scanrelid > 0) 1287 rtindex = fsplan->scan.scanrelid; 1288 else 1289 rtindex = bms_next_member(fsplan->fs_relids, -1); 1290 rte = rt_fetch(rtindex, estate->es_range_table); 1291 userid = rte->checkAsUser ? rte->checkAsUser : GetUserId(); Core functions also use this field to get RTIs covered by ForeignScan e.g. ExplainPreScanNode. Since this field is not set in an upper relation, when we push down an upper operation like grouping/aggregation, fs_relids remains NULL, causing undesirable results. In case of postgres_fdw aggregate pushdown, it crashed in rt_fetch(). We could prevent the crash by passing input_rel->relids to fetch_upper_rel() in create_grouping_path() as seen in the attached patch. preprocess_minmax_aggregates() needed to be fixed so that both these functions add paths to the same relation. I am sure, if we go this route, we will have to fix each call of fetch_upper_rel() in this manner. But I am not sure if it's intended to be so. The comment in fetch_upper_rel() says 847 * An "upper" relation is identified by an UpperRelationKind and a Relids set. 848 * The meaning of the Relids set is not specified here, and very likely will 849 * vary for different relation kinds. which seems to indicate that relids in upper relation will be different that those in the lower relations, but it doesn't say how. We need to fix the usages of fs_relids or the calls to fetch_upper_rel() for pushdown of upper operations. I am not sure which. Please suggest, how to move forward with this. [1] https://www.postgresql.org/message-id/flat/CANEvxPokcFi7OfEpi3%3DJ%2Bmvxu%2BPPAG2zXASLEkv5DzDPhSTk6A%40mail.gmail.com#CANEvxPokcFi7OfEpi3=J+mvxu+PPAG2zXASLEkv5DzDPhSTk6A@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Attachment
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes: > While reviewing aggregate pushdown patch [1] we noticed that > RelOptInfos for upper relations do not have relids set. Indeed, because they don't correspond to any particular scan relation or set of scan relations. I have in mind that in future releases, any particular upperrel might have its own definition of what the relids mean --- for example, in UPPERREL_SETOP it would likely be useful for the relids to represent the set of leaf SELECTs that have been merged in a particular path. You could imagine UPPERREL_WINDOW using the relids to track which window functions have been implemented in a path, whenever we get around to considering multiple window function orderings. None of that's there yet. > create_foreignscan_plan() copies the relids from RelOptInfo into > ForeignScan::fs_relids. That field is used to identify the RTIs > covered by the ForeignScan. That's fine for scan/join paths. If you want to pay attention to relids for an upper rel, it's up to you to know what they mean. I would not counsel assuming that they have anything at all to do with baserel RTIs. > We could prevent the crash by passing input_rel->relids to > fetch_upper_rel() in create_grouping_path() as seen in the attached > patch. I think this is fundamentally wrongheaded. If we go that route, the only valid relids for any upper path would be the union of all baserel RTIs, making it rather pointless to carry the value around at all, and definitely making it impossible to use the field to distinguish different partial implementations of the same upperrel. You should look to root->all_baserels, instead, if that's the value you want when considering an upperrel Path. regards, tom lane
On Wed, Oct 5, 2016 at 7:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes: >> While reviewing aggregate pushdown patch [1] we noticed that >> RelOptInfos for upper relations do not have relids set. > > Indeed, because they don't correspond to any particular scan relation or > set of scan relations. I have in mind that in future releases, any > particular upperrel might have its own definition of what the relids > mean --- for example, in UPPERREL_SETOP it would likely be useful for > the relids to represent the set of leaf SELECTs that have been merged > in a particular path. > You could imagine UPPERREL_WINDOW using the > relids to track which window functions have been implemented in a path, > whenever we get around to considering multiple window function orderings. > None of that's there yet. Why do we want to save something in Relids object which is not relids? Wouldn't this overloading create confusion. I don't know much about window function ordering, but purely from the above description it looks like we are saving something in RelOptInfo which is specific to a path. That sounds odd. But most probably, I misunderstood you. > >> create_foreignscan_plan() copies the relids from RelOptInfo into >> ForeignScan::fs_relids. That field is used to identify the RTIs >> covered by the ForeignScan. > > That's fine for scan/join paths. If you want to pay attention to > relids for an upper rel, it's up to you to know what they mean. > I would not counsel assuming that they have anything at all to do > with baserel RTIs. > >> We could prevent the crash by passing input_rel->relids to >> fetch_upper_rel() in create_grouping_path() as seen in the attached >> patch. > > I think this is fundamentally wrongheaded. If we go that route, > the only valid relids for any upper path would be the union of all > baserel RTIs, making it rather pointless to carry the value around > at all, That won't be true if we start pushing upper operations under Append or even Join. An aggregate of append may be calculated as aggregate of append or append of aggregates. In the later case, we will need to differentiate between upper relations from each segment being appended and also the upper relation representing the overall result. A natural way is to set relids of that upper relation with the relids of underlying scan relations. In this case, setting relids to underlying scan relations isn't pointless at all. And then this may apply to all kinds of upper relations, not just aggregate/grouping > and definitely making it impossible to use the field to > distinguish different partial implementations of the same upperrel. Probably, we should add another set of fields exclusive to be used for upper relations, like some members which are exclusively used for base relations. Storing something in relids, which is not relids looks confusing, unless we change the name (and type) of that field or what we store has a meaning similar to relids. > > You should look to root->all_baserels, instead, if that's the value > you want when considering an upperrel Path. > Thanks. It looks like, for now, aggregate pushdown should use all_baserels for an upper relation. Although, that will need to change if we push down aggregate to the foreign server as part of converting aggregate of append to append of aggregates. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Wed, Oct 5, 2016 at 9:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> We could prevent the crash by passing input_rel->relids to >> fetch_upper_rel() in create_grouping_path() as seen in the attached >> patch. > > I think this is fundamentally wrongheaded. If we go that route, > the only valid relids for any upper path would be the union of all > baserel RTIs, ... Hmm, but this is only true if the upper steps are always done last. Hackers on this list have been hoping to reorder joins with aggregates since at least 2008 - probably sooner, but that's when I started reading this mailing list. A simple example is: SELECT order_line.order_id, order.customer_id, SUM(order_line.amount) FROM order_line, order WHERE order_line.order_id = order.order_id GROUP BY 1,2; Doing the aggregation step first is likely to be much faster than doing the join first here, unless of course order_line.order_id = order.order_id turns out to be highly selective. Or if there were an additional WHERE condition on the order table then it gets much less obvious which way is better. In any event, I assume we want to eventually be able to cost it out both ways and pick the cheaper one. This doesn't necessarily mean that the set of relids couldn't be used as you suggest, but if we supported this kind of thing then it wouldn't be true that "the only valid relids for any upper path would be the union of all baserel RTIs". Have you thought about this case? How would you propose that it be handled in the framework you have in mind? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Oct 5, 2016 at 9:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think this is fundamentally wrongheaded. If we go that route, >> the only valid relids for any upper path would be the union of all >> baserel RTIs, ... > Hmm, but this is only true if the upper steps are always done last. > Hackers on this list have been hoping to reorder joins with aggregates > since at least 2008 - probably sooner, but that's when I started > reading this mailing list. A simple example is: > SELECT order_line.order_id, order.customer_id, SUM(order_line.amount) > FROM order_line, order WHERE order_line.order_id = order.order_id > GROUP BY 1,2; > Doing the aggregation step first is likely to be much faster than > doing the join first here, Please provide some reason to believe that. It's the nature of an aggregate that it's sensitive to the number of rows going through it, with only a tiny number of exceptions (and SUM ain't one). So you could only push it down past joins that won't change the number of rows the aggregate will process, and how is that going to make it any faster? I'm also dubious that doing the aggregate first could even be correct in your example, because I sure don't see how you'd group by order.customer_id before joining. But that's an artifact of this example not a general point. regards, tom lane
On 10 October 2016 at 11:33, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> SELECT order_line.order_id, order.customer_id, SUM(order_line.amount) >> FROM order_line, order WHERE order_line.order_id = order.order_id >> GROUP BY 1,2; > >> Doing the aggregation step first is likely to be much faster than >> doing the join first here, > > Please provide some reason to believe that. It's the nature of an > aggregate that it's sensitive to the number of rows going through it, > with only a tiny number of exceptions (and SUM ain't one). So you could > only push it down past joins that won't change the number of rows the > aggregate will process, and how is that going to make it any faster? I think there's a flaw in Robert's example because the GROUP BY has columns from either side of the joins, however it's very simple to come up with a real world case which aggregate push downs can improve. We can simulate the aggregate push down with a simple subquery. create table sale (sale_id int primary key,product_id int not null,quantity int not null); create table product (product_id int primary key,productdesc varchar(32) not null); insert into sale select x,x%100+1,(random()*10)::int+1 from generate_series(1,10000000) x(x); insert into product select x,'ABC'||x::text from generate_Series(1,100) x(x); postgres=# explain (analyze, timing off) select p.product_id,p.productdesc,sum(s.quantity) qty from sale s inner join product p on s.product_id = p.product_id group by p.product_id; QUERY PLAN -------------------------------------------------------------------------------------------------------------HashAggregate (cost=341558.25..341559.25 rows=100 width=17) (actual rows=100 loops=1) Group Key: p.product_id -> Hash Join (cost=3.25..291558.25 rows=10000000 width=13) (actual rows=10000000 loops=1) Hash Cond: (s.product_id = p.product_id) -> Seq Scan on sale s (cost=0.00..154055.00rows=10000000 width=8) (actual rows=10000000 loops=1) -> Hash (cost=2.00..2.00 rows=100 width=9) (actual rows=100 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 13kB -> Seq Scan on product p (cost=0.00..2.00 rows=100 width=9) (actual rows=100 loops=1)Planning time: 0.308 msExecution time: 7568.927 ms (10 rows) Time: 7569.842 ms (00:07.570) postgres=# explain (analyze, timing off) select p.product_id,p.productdesc,s.qty from (select product_id,sum(quantity) qty from sale group by product_id) s inner join product p on s.product_id = p.product_id; QUERY PLAN -----------------------------------------------------------------------------------------------------------Hash Join (cost=204058.25..204061.63rows=100 width=17) (actual rows=100 loops=1) Hash Cond: (sale.product_id = p.product_id) -> HashAggregate (cost=204055.00..204056.00 rows=100 width=12) (actual rows=100 loops=1) Group Key: sale.product_id -> Seq Scan on sale (cost=0.00..154055.00 rows=10000000 width=8) (actual rows=10000000 loops=1) -> Hash (cost=2.00..2.00 rows=100 width=9) (actual rows=100 loops=1) Buckets:1024 Batches: 1 Memory Usage: 13kB -> Seq Scan on product p (cost=0.00..2.00 rows=100 width=9) (actual rows=100 loops=1)Planning time: 0.610 msExecution time: 4267.145 ms (10 rows) So the pushed down version runs in 56% of the time of the non-pushed down version. Of course, as mentioned by Robert, both versions would have to be costed in case the join condition was highly selective There's a good paper which goes into a bit more details on this http://www.vldb.org/conf/1995/P345.PDF -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Oct 9, 2016 at 6:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Oct 5, 2016 at 9:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I think this is fundamentally wrongheaded. If we go that route, >>> the only valid relids for any upper path would be the union of all >>> baserel RTIs, ... > >> Hmm, but this is only true if the upper steps are always done last. >> Hackers on this list have been hoping to reorder joins with aggregates >> since at least 2008 - probably sooner, but that's when I started >> reading this mailing list. A simple example is: > >> SELECT order_line.order_id, order.customer_id, SUM(order_line.amount) >> FROM order_line, order WHERE order_line.order_id = order.order_id >> GROUP BY 1,2; > >> Doing the aggregation step first is likely to be much faster than >> doing the join first here, > > Please provide some reason to believe that. It's the nature of an > aggregate that it's sensitive to the number of rows going through it, > with only a tiny number of exceptions (and SUM ain't one). So you could > only push it down past joins that won't change the number of rows the > aggregate will process, and how is that going to make it any faster? Hmm, I guess it only works if you know that the order (order_id) is unique. If that's true, then each order_line.order_id will match 0 or 1 values in order.order_id. If you had duplicates in order.order_id, then SUM(order_line.amount) would effectively be multiplied by the number of duplicates, so pushing the join down doesn't work. But if order (order_id) is unique, then I think it's OK. For any given value of order_line.order_id, either every row in the group matches the one and only row in the order table where that value appears in order.order_id, or there is no such row and the whole group vanishes. As for why it should be faster, my intuition is that the aggregate might be slightly faster because the rows passing through it can be narrower, but the real win is the join should be quite a lot faster because the number of rows that need to be joined is probably being reduced by the aggregation step, possibly quite drastically. Here's basically that example with 10GB of TPC-H data: rhaas=# explain analyze select l_orderkey, o_custkey, sum(l_extendedprice) from lineitem, orders where l_orderkey = o_orderkey group by 1, 2; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------GroupAggregate (cost=15665868.56..17015554.73 rows=59986052 width=48) (actual time=114794.016..156967.012 rows=15000000 loops=1) Group Key: lineitem.l_orderkey, orders.o_custkey -> Sort (cost=15665868.56..15815833.69 rows=59986052 width=24) (actual time=114793.999..126303.591 rows=59986052 loops=1) Sort Key: lineitem.l_orderkey, orders.o_custkey Sort Method: external sort Disk: 2031640kB -> Merge Join (cost=244.37..4225682.88 rows=59986052 width=24) (actual time=0.020..56368.444 rows=59986052 loops=1) Merge Cond: (orders.o_orderkey = lineitem.l_orderkey) -> Index Scan using orders_pkey on orders (cost=0.43..665749.44 rows=15000000 width=12) (actual time=0.009..5722.540 rows=15000000 loops=1) -> Index Scan using idx_lineitem_orderkey on lineitem (cost=0.56..2772694.34 rows=59986052 width=16) (actual time=0.008..25766.270 rows=59986052 loops=1)Planning time: 0.251 msExecution time: 158320.203 ms (11 rows) rhaas=# explain analyze select l_orderkey, o_custkey, l_extendedprice from (select l_orderkey, sum(l_extendedprice) l_extendedprice from lineitem group by 1) x, orders where l_orderkey = o_orderkey; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------Merge Join (cost=1.00..3789127.49 rows=378670 width=48) (actual time=0.104..76107.364 rows=15000000 loops=1) Merge Cond: (lineitem.l_orderkey = orders.o_orderkey) -> GroupAggregate (cost=0.56..3077357.98 rows=378670 width=40) (actual time=0.071..59870.322 rows=15000000 loops=1) Group Key: lineitem.l_orderkey -> Index Scan using idx_lineitem_orderkeyon lineitem (cost=0.56..2772694.34 rows=59986052 width=16) (actual time=0.057..24382.053 rows=59986052 loops=1) -> Index Scan using orders_pkey on orders (cost=0.43..665749.44 rows=15000000 width=12) (actual time=0.030..6115.856 rows=15000000 loops=1)Planning time: 0.117 msExecution time: 77266.600 ms (8 rows) Actually, the major culprit here is that the plan which does the join first introduces a totally useless sort, but it looks like the second plan would be modestly faster even the planner were smart enough not to do that. If I set work_mem to 1GB and max_workers_per_gather to 8, the time for the aggregate-first plan improves to 66s, but the join-first plan doesn't change. I'm not sure this is actually a great example; 1.5 million output rows is a lot. But I've seen this kind of thing plenty of times in the real world; see also David Rowley's example. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company