Parameterized aggregate subquery (was: Pull up aggregate subquery) - Mailing list pgsql-hackers

From Hitoshi Harada
Subject Parameterized aggregate subquery (was: Pull up aggregate subquery)
Date
Msg-id BANLkTikGFtGnAaXVh5=ntRdN+4w+r=NPuw@mail.gmail.com
Whole thread Raw
Responses Re: Parameterized aggregate subquery (was: Pull up aggregate subquery)
List pgsql-hackers
Purpose & Goal
--------------
Allow planner to create NestLoop with parameterized aggregate subquery
just like inner IndexScan pattern. This helps to avoid unnecessary
aggregate process that would be filtered at the stage of upper join
filter in such case:

create table size_m as select i as id, repeat(i::text, i % 100) as val
from generate_series(1, 20000) i;
create table size_l as select i as id, m.id as m_id, repeat(i::text, i
% 100) as val from generate_series(1, 1000000) i, size_m m where i.i /
10 = m.id;
analyze size_m;
analyze size_l;
---- make this query faster
select m_id, sum_len from size_m m inner join(select m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where val = '10101';

In addition, it might be the first step to take account of "general
parameterized scan" design, although this proposal contains only
narrow cases of aggregate subquery.


Related information
-------------------
This work is the next step of the previously proposed "Pull up
aggregate subqery" thread.
http://archives.postgresql.org/message-id/BANLkTimW-EqS_9hk5AYj14R8U%3DuQoc6tuA%40mail.gmail.com


Design
------
In contract to the previous design, I made aggregate subquery
"parameterized" instead of pulling it up. The design is based on the
discussion with Tom Lane and Rober Haas et al. It has some benefits
compared with the pulling up appoach as, 1) parameterizing any scan
node other than index scan as nest loop's inner is our way to go, 2)
pulling up or pushing down any aggregate query has potential risk that
the aggregate results are wrong (, which may be solvable by adding
some constraints like "only when unique" etc,) 3) the decision whether
to make parameterized or not can be made by only cost without any
heuristics.

The idea is very similar to inner IndexScan. When NestLoop path is
created in match_unsorted_outer(), parameterized SubqueryScan path is
also considered, similarly to inner IndexScan paths. While IndexScan
is simple since its information like costs are well known by the base
relation, SubqueryScan should re-plan its Query to gain that, which is
expensive.

To do so, I added SubqueryPath node to store each subquery
information. Before patching, subplan, subrtable and subrowmark is
stored in RelOptInfo, but now we need to save them separately for each
parameterized one. This part might be split from the original patch
for easy review.

Result
------
Without breakage of regression test, it successfully makes subquery as
parameterized as the inner of NestLoop. Query performance is as aimed.

(without patch)
db1=# explain analyze
select m_id, sum_len from size_m m inner join(select m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where val = '10101';
                                                            QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=79393.64..82937.42 rows=100 width=12) (actual
time=1256.569..1733.903 rows=1 loops=1)
   Join Filter: (m.id = size_l.m_id)
   ->  Seq Scan on size_m m  (cost=0.00..897.00 rows=1 width=4)
(actual time=25.182..32.237 rows=1 loops=1)
         Filter: (val = '10101'::text)
   ->  GroupAggregate  (cost=79393.64..81592.65 rows=19901 width=277)
(actual time=772.791..1694.848 rows=20000 loops=1)
         ->  Sort  (cost=79393.64..79893.64 rows=200000 width=277)
(actual time=772.754..929.225 rows=200000 loops=1)
               Sort Key: size_l.m_id
               Sort Method: external sort  Disk: 56848kB
               ->  Seq Scan on size_l  (cost=0.00..9830.00 rows=200000
width=277) (actual time=0.030..198.093 rows=200000 loops=1)
 Total runtime: 1754.111 ms
(10 rows)

(with patch)
db1=# explain analyze
select m_id, sum_len from size_m m inner join(select m_id,
sum(length(val)) as sum_len from size_l group by m_id)l on m.id =
l.m_id where val = '10101';
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..11674.86 rows=100 width=12) (actual
time=119.386..122.478 rows=1 loops=1)
   Join Filter: (m.id = size_l.m_id)
   ->  Seq Scan on size_m m  (cost=0.00..897.00 rows=1 width=4)
(actual time=9.720..12.811 rows=1 loops=1)
         Filter: (val = '10101'::text)
   ->  GroupAggregate  (cost=0.00..10330.08 rows=1 width=277) (actual
time=109.639..109.640 rows=1 loops=1)
         ->  Seq Scan on size_l  (cost=0.00..10330.00 rows=10
width=277) (actual time=59.138..109.611 rows=10 loops=1)
               Filter: (m.id = m_id)
 Total runtime: 122.612 ms
(8 rows)

I tested some more cases like "where val IN ('1', '10101')", "where
val between '1' and '10101'", which shows as good as I expected.

Concern
-------
Although execution time will be shorter in many cases, planning time
will be longer. Especially re-planning subquery with pushing join
quals for each joinrel step. I didn't measure the additional cost in
planner stage.


BTW, as I changed title and design from the previous post, should I
throw away the old commit fest entry and make the new one?

Regards,

--
Hitoshi Harada

Attachment

pgsql-hackers by date:

Previous
From: Pavan Deolasee
Date:
Subject: Re: Autoanalyze and OldestXmin
Next
From: Bhavin Kamani
Date:
Subject: postgresql 9.0.4 source compilation issue on OSX