Thread: Wrong plan for subSELECT with GROUP BY

Wrong plan for subSELECT with GROUP BY

From
Antal Attila
Date:
Hi!

See the next case, please! This is a theoretical CASE, which cause
problems to me. Please, help!

CREATE TABLE a (
    id SERIAL,    -- This is the PRIMARY KEY
   col TEXT
);

CREATE TABLE b (
    id SERIAL,  -- This is the PRIMARY KEY
    a_id INT4,   -- REFERENCE TO a.id
    value INT4,
    txt TEXT
);

CREATE TABLE c (
    id SERIAL,  -- This is the PRIMARY KEY
    a_id INT4,   -- REFERENCE TO a.id
    value INT4,
    txt TEXT
);

I examined the next query. In my opinion its query plan is not too optimal.
There are indexes on b.a_id, b.value, b.txt, c.a_id, c.value, c.txt.
I generated test datas: 10000 rows into the table 'a',
                                  100000 rows into the table 'b',
                                  100000 rows into the table 'c'.

SELECT  a.id, a.col,
               TABLE_B.minval, TABLE_B.b_txt,
               TABLE_C.maxval, TABLE_C.c_txt
FROM a
LEFT JOIN (
    SELECT a_id,
                   min(value) AS minval,
                   max(txt) AS b_txt
    FROM b
    GROUP BY a_id) TABLE_B ON (TABLE_B.a_id = a.id)
LEFT JOIN (
    SELECT a_id,
                   max(value) AS maxval,
                   min(txt) AS c_txt
    FROM c
    GROUP BY a_id) TABLE_C ON (TABLE_C.a_id = a.id)
LIMIT 10;

                                                                  QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6907.56..6932.86 rows=10 width=84) (actual
time=750.075..750.184 rows=10 loops=1)
   ->  Hash Left Join  (cost=6907.56..34114.19 rows=10753 width=84)
(actual time=750.071..750.163 rows=10 loops=1)
         Hash Cond: ("outer".id = "inner".a_id)
         ->  Merge Left Join  (cost=4171.85..4370.95 rows=10000
width=48) (actual time=400.712..400.775 rows=10 loops=1)
               Merge Cond: ("outer".id = "inner".a_id)
               ->  Sort  (cost=823.39..848.39 rows=10000 width=12)
(actual time=31.926..31.935 rows=10 loops=1)
                     Sort Key: a.id
                     ->  Seq Scan on a  (cost=0.00..159.00 rows=10000
width=12) (actual time=0.013..12.120 rows=10000 loops=1)
               ->  Sort  (cost=3348.47..3373.32 rows=9940 width=40)
(actual time=368.741..368.762 rows=22 loops=1)
                     Sort Key: table_b.a_id
                     ->  Subquery Scan table_b  (cost=2440.00..2688.50
rows=9940 width=40) (actual time=305.836..338.764 rows=10000 loops=1)
                           ->  HashAggregate  (cost=2440.00..2589.10
rows=9940 width=17) (actual time=305.832..320.891 rows=10000 loops=1)
                                 ->  Seq Scan on b  (cost=0.00..1690.00
rows=100000 width=17) (actual time=0.012..125.888 rows=100000 loops=1)
         ->  Hash  (cost=2708.83..2708.83 rows=10753 width=40) (actual
time=349.314..349.314 rows=10000 loops=1)
               ->  Subquery Scan table_c  (cost=2440.00..2708.83
rows=10753 width=40) (actual time=298.826..331.239 rows=10000 loops=1)
                     ->  HashAggregate  (cost=2440.00..2601.30
rows=10753 width=17) (actual time=298.821..313.603 rows=10000 loops=1)
                           ->  Seq Scan on c  (cost=0.00..1690.00
rows=100000 width=17) (actual time=0.015..124.996 rows=100000 loops=1)
 Total runtime: 757.818 ms

But I can optimize the previous query by hand:

SELECT  a.id, a.col,
               (SELECT min(value) FROM b WHERE b.a_id=a.id) AS minval,
               (SELECT max(txt) FROM b WHERE b.a_id=a.id) AS b_txt,
               (SELECT max(value) FROM c WHERE c.a_id=a.id) AS maxval,
               (SELECT min(txt) FROM c WHERE c.a_id=a.id) AS c_txt
FROM a
LIMIT 10;

                                                                  QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=0.00..126.76 rows=10 width=12) (actual time=0.221..1.754
rows=10 loops=1)
   ->  Seq Scan on a  (cost=0.00..126764.21 rows=10000 width=12) (actual
time=0.218..1.734 rows=10 loops=1)
         SubPlan
           ->  Aggregate  (cost=3.15..3.16 rows=1 width=9) (actual
time=0.039..0.039 rows=1 loops=10)
                 ->  Index Scan using idx_c_aid on c  (cost=0.00..3.12
rows=9 width=9) (actual time=0.007..0.024 rows=10 loops=10)
                       Index Cond: (a_id = $0)
           ->  Aggregate  (cost=3.15..3.16 rows=1 width=4) (actual
time=0.038..0.039 rows=1 loops=10)
                 ->  Index Scan using idx_c_aid on c  (cost=0.00..3.12
rows=9 width=4) (actual time=0.008..0.025 rows=10 loops=10)
                       Index Cond: (a_id = $0)
           ->  Aggregate  (cost=3.16..3.17 rows=1 width=9) (actual
time=0.038..0.039 rows=1 loops=10)
                 ->  Index Scan using idx_b_aid on b  (cost=0.00..3.14
rows=10 width=9) (actual time=0.006..0.022 rows=10 loops=10)
                       Index Cond: (a_id = $0)
           ->  Aggregate  (cost=3.16..3.17 rows=1 width=4) (actual
time=0.039..0.040 rows=1 loops=10)
                 ->  Index Scan using idx_b_aid on b  (cost=0.00..3.14
rows=10 width=4) (actual time=0.008..0.025 rows=10 loops=10)
                       Index Cond: (a_id = $0)
 Total runtime: 1.933 ms

There is huge difference between the query performances. Why?
My problem is that in the first query use HashAggregation on the all
table 'b' and 'c' and cannot take notice of the 'LIMIT 10'.
In my special case I cannot use the second query formalization. I
simplified my problem to this theoretical CASE.
Do you know why cannot make the best of  the 'LIMIT' criteria in the
first query?
These tables are big, so in my opinion the planner could optimize better.

If this is a deficiency of the planner, I'd like to suggest this feature
into the planner.

Regards,
  Antal Attila

Re: Wrong plan for subSELECT with GROUP BY

From
Tom Lane
Date:
Antal Attila <atesz@ritek.hu> writes:
> If this is a deficiency of the planner, I'd like to suggest this feature
> into the planner.

This really falls into the category of "you've got to be kidding".
There's no way that it'd be reasonable for the planner to expend cycles
on every query to look for corner cases like this.  Do the
hand-optimization instead.

            regards, tom lane

Re: Wrong plan for subSELECT with GROUP BY

From
"Jim C. Nasby"
Date:
If you wrap the LIMIT select into a subquery in the FROM the planner
might figure it out...

SELECT ...
    FROM (SELECT blah FROM a LIMIT 10)
        LEFT JOIN ...

Unlike some other databases that will spend huge amounts of time on
trying to re-arrange queries and then hope they can effectively cache
query plans, PostgreSQL prefers not to spend cycles on more esoteric
cases so that query planning is fast.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Wrong plan for subSELECT with GROUP BY

From
Simon Riggs
Date:
On Fri, 2006-05-12 at 10:05 -0400, Tom Lane wrote:
> Antal Attila <atesz@ritek.hu> writes:
> > If this is a deficiency of the planner, I'd like to suggest this feature
> > into the planner.
>
> This really falls into the category of "you've got to be kidding".

Agreed

> There's no way that it'd be reasonable for the planner to expend cycles
> on every query to look for corner cases like this.

OT: Should we have a way of telling the optimizer how much time and
effort we would like it to go to? Some of the new optimizations and many
yet to come cover smaller and smaller sub-cases.

At least internally, we could mark the cost-of-optimization as we go, so
we can play with the external interface later.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com