Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery)) - Mailing list pgsql-hackers

From Yeb Havinga
Subject Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))
Date
Msg-id 4E2F163B.6060105@gmail.com
Whole thread Raw
In response to Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)  (Hitoshi Harada <umi.tanuki@gmail.com>)
Responses Re: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))
Re: Pull up aggregate sublink (was: Parameterized aggregate subquery (was: Pull up aggregate subquery))
List pgsql-hackers
On 2011-07-22 17:35, Hitoshi Harada wrote:
> 2011/7/23 Yeb Havinga<yebhavinga@gmail.com>:
>> Works like a charm :-). However, now there is always a copyObject of a
>> subquery even when the subquery is not safe for qual pushdown. The problem
>> with the previous issafe was that it was only assigned for
>> rel->baserestrictinfo != NIL. If it is assigned before the if statement, it
>> still works. See attached patch that avoids subquery copy for unsafe
>> subqueries, and also exits best_inner_subqueryscan before palloc of
>> differenttypes in case of unsafe queries.
>
> Ah, yeah, right. Too quick fix bloated my brain :P Thanks for testing!
> I'll check it more.

A few days ago I read Tomas Vondra's blog post about dss tpc-h queries 
on PostgreSQL at 
http://fuzzy.cz/en/articles/dss-tpc-h-benchmark-with-postgresql/ - in 
which he showed how to manually pull up a dss subquery to get a large 
speed up. Initially I thought: cool, this is probably now handled by 
Hitoshi's patch, but it turns out the subquery type in the dss query is 
different.

The original and rewritten queries are below. The debug_print_plan 
output shows the subquery is called from a opexpr (< l_quantity, 
subquery output) and the sublink type is EXPR_SUBLINK. Looking at the 
source code; pull_up_sublink only considers ANY and EXISTS sublinks. I'm 
wondering if this could be expanded to deal with EXPR sublinks. Clearly 
in the example Tomas has given this can be done. I'm wondering if there 
are show stoppers that prevent this to be possible in the general case, 
but can't think of any, other than the case of a sublink returning NULL 
and the opexpr is part of a larger OR expression or IS NULL; in which 
case it should not be pulled op, or perhaps it could be pulled up as 
outer join.

Thoughts?

regards,
Yeb


The original query:

tpch=# explain select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part
where
p_partkey = l_partkey
and p_brand = 'Brand#13'
and p_container = 'JUMBO PKG'
and l_quantity < (
select
0.2 * avg(l_quantity)
from
lineitem
where
l_partkey = p_partkey
)
LIMIT 1;                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------ Limit
(cost=183345309.79..183345309.81rows=1 width=8)   ->  Aggregate  (cost=183345309.79..183345309.81 rows=1 width=8)
 ->  Hash Join  (cost=2839.99..183345307.76 rows=815 width=8)               Hash Cond: (public.lineitem.l_partkey =
part.p_partkey)              Join Filter: (public.lineitem.l_quantity < (SubPlan 1))               ->  Seq Scan on
lineitem (cost=0.00..68985.69 
 
rows=2399869 width=17)               ->  Hash  (cost=2839.00..2839.00 rows=79 width=4)                     ->  Seq Scan
onpart  (cost=0.00..2839.00 rows=79 
 
width=4)                           Filter: ((p_brand = 'Brand#13'::bpchar) AND 
(p_container = 'JUMBO PKG'::bpchar))               SubPlan 1                 ->  Aggregate  (cost=74985.44..74985.46
rows=1width=5)                       ->  Seq Scan on lineitem  (cost=0.00..74985.36 
 
rows=31 width=5)                             Filter: (l_partkey = part.p_partkey)

manually rewritten variant:

tpch=# explain select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part,
(SELECT l_partkey AS agg_partkey,                0.2 * avg(l_quantity) AS avg_quantity          FROM lineitem GROUP BY
l_partkey)part_agg
 
where
p_partkey = l_partkey
and agg_partkey = l_partkey
and p_brand = 'Brand#13'
and p_container = 'JUMBO PKG'
and l_quantity < avg_quantity
LIMIT 1;                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
Limit (cost=179643.88..179643.89 rows=1 width=8)   ->  Aggregate  (cost=179643.88..179643.89 rows=1 width=8)         ->
Hash Join  (cost=161865.21..178853.91 rows=315985 width=8)               Hash Cond: (public.lineitem.l_partkey =
part.p_partkey)              Join Filter: (public.lineitem.l_quantity < ((0.2 * 
 
avg(public.lineitem.l_quantity))))               ->  HashAggregate  (cost=80985.04..82148.65 rows=77574 
width=9)                     ->  Seq Scan on lineitem  (cost=0.00..68985.69 
rows=2399869 width=9)               ->  Hash  (cost=80849.63..80849.63 rows=2444 width=21)                     ->  Hash
Join (cost=2839.99..80849.63 rows=2444 
 
width=21)                           Hash Cond: (public.lineitem.l_partkey = 
part.p_partkey)                           ->  Seq Scan on lineitem  
(cost=0.00..68985.69 rows=2399869 width=17)                           ->  Hash  (cost=2839.00..2839.00 rows=79 
width=4)                                 ->  Seq Scan on part  
(cost=0.00..2839.00 rows=79 width=4)                                       Filter: ((p_brand = 
'Brand#13'::bpchar) AND (p_container = 'JUMBO PKG'::bpchar))




pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: sinval synchronization considered harmful
Next
From: Alexander Korotkov
Date:
Subject: Re: WIP: Fast GiST index build