Thread: query optimization differs between view and explicit query

query optimization differs between view and explicit query

From
Reece Hart
Date:
I have a large query which I would like to place in a view. The explicit query is sufficiently fast, but the same query as a view is much slower and uses a different plan. I would appreciate an explanation of why this is, and, more importantly whether/how I might coax the view to use a different plan.


The original query:
rkh@csb-dev=> select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as "pstart",
max(H.pstop) as "pstop",A.ident,(A.ident/Q.len::float*100)::int as "pct_ident",
sum(H.pstop-H.pstart+1) as "aln_length",H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as "gstart",
max(H.gstop) as "gstop"
from p2gblathsp H
join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id
join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id
join pseq Q on H.pseq_id=Q.pseq_id
where H.pseq_id=76
group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len
\g /dev/null
Time: 277.804 ms
Now as a view:
rkh@csb-dev=> create view v1 as
select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as "pstart",
max(H.pstop) as "pstop",A.ident,(A.ident/Q.len::float*100)::int as "pct_ident",
sum(H.pstop-H.pstart+1) as "aln_length",H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as "gstart",
max(H.gstop) as "gstop"
from p2gblathsp H
join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id
join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id
join pseq Q on H.pseq_id=Q.pseq_id
group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len;
CREATE VIEW
Time: 103.041 ms

rkh@csb-dev=> select * from v1 where pseq_id=76 \g /dev/null
Time: 31973.979 ms

Okay, that's ~100x slower. The plans:

rkh@csb-dev=> explain select distinct on <snip... same as the first query above>
                                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------- Unique  (cost=11157.75..11187.26 rows=454 width=40)
   ->  GroupAggregate  (cost=11157.75..11186.13 rows=454 width=40)
         ->  Sort  (cost=11157.75..11158.89 rows=454 width=40)
               Sort Key: ah.p2gblataln_id, h.pseq_id, h.genasm_id, h.chr, h.plus_strand, a.ident, q.len
               ->  Nested Loop  (cost=11125.62..11137.71 rows=454 width=40)
                     ->  Index Scan using pseq_pkey on pseq q  (cost=0.00..3.01 rows=2 width=6)
                           Index Cond: (76 = pseq_id)
                     ->  Materialize  (cost=11125.62..11127.89 rows=227 width=38)
                           ->  Nested Loop  (cost=546.15..11125.62 rows=227 width=38)
                                 ->  Hash Join  (cost=546.15..10438.72 rows=227 width=34)
                                       Hash Cond: ("outer".p2gblathsp_id = "inner".p2gblathsp_id)
                                       ->  Seq Scan on p2gblatalnhsp ah  (cost=0.00..6504.03 rows=451503 width=8)
                                       ->  Hash  (cost=545.58..545.58 rows=227 width=34)
                                             ->  Index Scan using p2gblathsp_p_lookup on p2gblathsp h  (cost=0.00..545.58 rows=227 wid                                                   Index Cond: (pseq_id = 76)
                                 ->  Index Scan using p2gblataln_pkey on p2gblataln a  (cost=0.00..3.01 rows=1 width=8)
                                       Index Cond: ("outer".p2gblataln_id = a.p2gblataln_id)
(17 rows)


rkh@csb-dev=> explain select * from v1 where pseq_id=76;
                                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------- Subquery Scan v1  (cost=246907.54..281897.70 rows=2258 width=77)
   Filter: (pseq_id = 76)
   ->  Unique  (cost=246907.54..276254.13 rows=451486 width=40)
         ->  GroupAggregate  (cost=246907.54..275125.41 rows=451486 width=40)
               ->  Sort  (cost=246907.54..248036.25 rows=451486 width=40)
                     Sort Key: ah.p2gblataln_id, h.pseq_id, h.genasm_id, h.chr, h.plus_strand, a.ident, q.len
                     ->  Hash Join  (cost=14019.29..204503.24 rows=451486 width=40)
                           Hash Cond: ("outer".p2gblataln_id = "inner".p2gblataln_id)
                           ->  Hash Join  (cost=7632.79..191344.45 rows=451486 width=36)
                                 Hash Cond: ("outer".p2gblathsp_id = "inner".p2gblathsp_id)
                                 ->  Merge Join  (cost=0.00..176939.38 rows=451486 width=36)
                                       Merge Cond: ("outer".pseq_id = "inner".pseq_id)
                                       ->  Index Scan using p2gblathsp_p_lookup on p2gblathsp h  (cost=0.00..16102.40 rows=451485 widt        

Re: query optimization differs between view and explicit

From
Christopher Kings-Lynne
Date:
> rkh@csb-dev=> create view v1 as
> select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as "pstart",
> max(H.pstop) as "pstop",A.ident,(A.ident/Q.len::float*100)::int as "pct_ident",
> sum(H.pstop-H.pstart+1) as "aln_length",H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as "gstart",
> max(H.gstop) as "gstop"
> from p2gblathsp H
> join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id
> join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id
> join pseq Q on H.pseq_id=Q.pseq_id
> group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len;
> CREATE VIEW
> Time: 103.041 ms

What happens if you make it a function:

CREATE FUNCTION f1() RETURNS ... AS '
select distinct on (AH.p2gblataln_id)
AH.p2gblataln_id,H.pseq_id,min(H.pstart) as "pstart",
max(H.pstop) as "pstop",A.ident,(A.ident/Q.len::float*100)::int as
"pct_ident",
sum(H.pstop-H.pstart+1) as
"aln_length",H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as "gstart",
max(H.gstop) as "gstop"
from p2gblathsp H
join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id
join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id
join pseq Q on H.pseq_id=Q.pseq_id
where H.pseq_id=76
group by
AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len
' LANGUAGE SQL;

I suspect that will be even faster than the normal (non-view) query.

Chris

Re: query optimization differs between view and explicit

From
Stephan Szabo
Date:
On Thu, 29 Jan 2004, Reece Hart wrote:

> I have a large query which I would like to place in a view. The explicit
> query is sufficiently fast, but the same query as a view is much slower
> and uses a different plan. I would appreciate an explanation of why this
> is, and, more importantly whether/how I might coax the view to use a
> different plan.

Well, in general

select distinct on (A) A, B
 from table
 where B=10
 order by A,B;

is not always the same as

select * from
 (select distinct on (A) A, B
   from table order by A,B) foo
 where B=10;

If A is not unique, then given two rows of the
same A value one with B=10 and one with another B
value less than 10, the former is guaranteed to give
you an A,10 row, the latter will give no such row AFAICS.

If A is unique then the two queries are equivalent,
but then distinct on (A) isn't terribly meaningful.

Re: query optimization differs between view and explicit

From
Tom Lane
Date:
Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
> On Thu, 29 Jan 2004, Reece Hart wrote:
>> I have a large query which I would like to place in a view. The explicit
>> query is sufficiently fast, but the same query as a view is much slower
>> and uses a different plan. I would appreciate an explanation of why this
>> is, and, more importantly whether/how I might coax the view to use a
>> different plan.

> Well, in general [ they're not the same query ]

Right.  The reason the performance is so much worse is that the
restriction pseq_id=76 cannot be "pushed down" into the view subquery;
we have to form the entire logical output of the view and then filter
on pseq_id=76.  In your inline query you have done the pushing down
anyway and so the restriction is applied much lower in the plan,
resulting in lots less work.  But the results might be different.

The point that Stephan makes is explicitly understood by the planner as
of PG 7.4:

 * 3. If the subquery uses DISTINCT ON, we must not push down any quals that
 * refer to non-DISTINCT output columns, because that could change the set
 * of rows returned.

It's hard to give any advice on how to make a faster view without more
context.  What's the actual intention in all this?  What's the semantics
of pseq_id --- is it unique?  It might be you could fix the problem by
adding pseq_id to the DISTINCT ON list, but we don't have enough info
to understand whether that would break the desired behavior.

            regards, tom lane

Re: query optimization differs between view and explicit

From
Reece Hart
Date:
Stephan, Tom-

Thanks. I now see that DISTINCT can't be moved within the plan as I thought. This is exactly the thinko that I was hoping someone would expose.

I've decided to abandon the DISTINCT clause. The view is more general and sufficiently fast without it, and callers can always add it themselves to achieve what I was doing in the explicit query.

I appreciate your time.

-Reece

-- 
Reece Hart, http://www.in-machina.com/~reece/, GPG:0x25EC91A0