Thread: query optimization differs between view and explicit query
I have a large query which I would like to place in a view. The explicit query is sufficiently fast, but the same query as a view is much slower and uses a different plan. I would appreciate an explanation of why this is, and, more importantly whether/how I might coax the view to use a different plan.
The original query:
The original query:
rkh@csb-dev=> select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as "pstart", max(H.pstop) as "pstop",A.ident,(A.ident/Q.len::float*100)::int as "pct_ident", sum(H.pstop-H.pstart+1) as "aln_length",H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as "gstart", max(H.gstop) as "gstop" from p2gblathsp H join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id join pseq Q on H.pseq_id=Q.pseq_id where H.pseq_id=76 group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len \g /dev/null Time: 277.804 msNow as a view:
rkh@csb-dev=> create view v1 as select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as "pstart", max(H.pstop) as "pstop",A.ident,(A.ident/Q.len::float*100)::int as "pct_ident", sum(H.pstop-H.pstart+1) as "aln_length",H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as "gstart", max(H.gstop) as "gstop" from p2gblathsp H join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id join pseq Q on H.pseq_id=Q.pseq_id group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len; CREATE VIEW Time: 103.041 ms rkh@csb-dev=> select * from v1 where pseq_id=76 \g /dev/null Time: 31973.979 ms Okay, that's ~100x slower. The plans: rkh@csb-dev=> explain select distinct on <snip... same as the first query above> QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=11157.75..11187.26 rows=454 width=40) -> GroupAggregate (cost=11157.75..11186.13 rows=454 width=40) -> Sort (cost=11157.75..11158.89 rows=454 width=40) Sort Key: ah.p2gblataln_id, h.pseq_id, h.genasm_id, h.chr, h.plus_strand, a.ident, q.len -> Nested Loop (cost=11125.62..11137.71 rows=454 width=40) -> Index Scan using pseq_pkey on pseq q (cost=0.00..3.01 rows=2 width=6) Index Cond: (76 = pseq_id) -> Materialize (cost=11125.62..11127.89 rows=227 width=38) -> Nested Loop (cost=546.15..11125.62 rows=227 width=38) -> Hash Join (cost=546.15..10438.72 rows=227 width=34) Hash Cond: ("outer".p2gblathsp_id = "inner".p2gblathsp_id) -> Seq Scan on p2gblatalnhsp ah (cost=0.00..6504.03 rows=451503 width=8) -> Hash (cost=545.58..545.58 rows=227 width=34) -> Index Scan using p2gblathsp_p_lookup on p2gblathsp h (cost=0.00..545.58 rows=227 wid Index Cond: (pseq_id = 76) -> Index Scan using p2gblataln_pkey on p2gblataln a (cost=0.00..3.01 rows=1 width=8) Index Cond: ("outer".p2gblataln_id = a.p2gblataln_id) (17 rows) rkh@csb-dev=> explain select * from v1 where pseq_id=76; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Subquery Scan v1 (cost=246907.54..281897.70 rows=2258 width=77) Filter: (pseq_id = 76) -> Unique (cost=246907.54..276254.13 rows=451486 width=40) -> GroupAggregate (cost=246907.54..275125.41 rows=451486 width=40) -> Sort (cost=246907.54..248036.25 rows=451486 width=40) Sort Key: ah.p2gblataln_id, h.pseq_id, h.genasm_id, h.chr, h.plus_strand, a.ident, q.len -> Hash Join (cost=14019.29..204503.24 rows=451486 width=40) Hash Cond: ("outer".p2gblataln_id = "inner".p2gblataln_id) -> Hash Join (cost=7632.79..191344.45 rows=451486 width=36) Hash Cond: ("outer".p2gblathsp_id = "inner".p2gblathsp_id) -> Merge Join (cost=0.00..176939.38 rows=451486 width=36) Merge Cond: ("outer".pseq_id = "inner".pseq_id) -> Index Scan using p2gblathsp_p_lookup on p2gblathsp h (cost=0.00..16102.40 rows=451485 widt
> rkh@csb-dev=> create view v1 as > select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as "pstart", > max(H.pstop) as "pstop",A.ident,(A.ident/Q.len::float*100)::int as "pct_ident", > sum(H.pstop-H.pstart+1) as "aln_length",H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as "gstart", > max(H.gstop) as "gstop" > from p2gblathsp H > join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id > join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id > join pseq Q on H.pseq_id=Q.pseq_id > group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len; > CREATE VIEW > Time: 103.041 ms What happens if you make it a function: CREATE FUNCTION f1() RETURNS ... AS ' select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as "pstart", max(H.pstop) as "pstop",A.ident,(A.ident/Q.len::float*100)::int as "pct_ident", sum(H.pstop-H.pstart+1) as "aln_length",H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as "gstart", max(H.gstop) as "gstop" from p2gblathsp H join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id join pseq Q on H.pseq_id=Q.pseq_id where H.pseq_id=76 group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len ' LANGUAGE SQL; I suspect that will be even faster than the normal (non-view) query. Chris
On Thu, 29 Jan 2004, Reece Hart wrote: > I have a large query which I would like to place in a view. The explicit > query is sufficiently fast, but the same query as a view is much slower > and uses a different plan. I would appreciate an explanation of why this > is, and, more importantly whether/how I might coax the view to use a > different plan. Well, in general select distinct on (A) A, B from table where B=10 order by A,B; is not always the same as select * from (select distinct on (A) A, B from table order by A,B) foo where B=10; If A is not unique, then given two rows of the same A value one with B=10 and one with another B value less than 10, the former is guaranteed to give you an A,10 row, the latter will give no such row AFAICS. If A is unique then the two queries are equivalent, but then distinct on (A) isn't terribly meaningful.
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > On Thu, 29 Jan 2004, Reece Hart wrote: >> I have a large query which I would like to place in a view. The explicit >> query is sufficiently fast, but the same query as a view is much slower >> and uses a different plan. I would appreciate an explanation of why this >> is, and, more importantly whether/how I might coax the view to use a >> different plan. > Well, in general [ they're not the same query ] Right. The reason the performance is so much worse is that the restriction pseq_id=76 cannot be "pushed down" into the view subquery; we have to form the entire logical output of the view and then filter on pseq_id=76. In your inline query you have done the pushing down anyway and so the restriction is applied much lower in the plan, resulting in lots less work. But the results might be different. The point that Stephan makes is explicitly understood by the planner as of PG 7.4: * 3. If the subquery uses DISTINCT ON, we must not push down any quals that * refer to non-DISTINCT output columns, because that could change the set * of rows returned. It's hard to give any advice on how to make a faster view without more context. What's the actual intention in all this? What's the semantics of pseq_id --- is it unique? It might be you could fix the problem by adding pseq_id to the DISTINCT ON list, but we don't have enough info to understand whether that would break the desired behavior. regards, tom lane
Stephan, Tom-
Thanks. I now see that DISTINCT can't be moved within the plan as I thought. This is exactly the thinko that I was hoping someone would expose.
I've decided to abandon the DISTINCT clause. The view is more general and sufficiently fast without it, and callers can always add it themselves to achieve what I was doing in the explicit query.
I appreciate your time.
-Reece
Thanks. I now see that DISTINCT can't be moved within the plan as I thought. This is exactly the thinko that I was hoping someone would expose.
I've decided to abandon the DISTINCT clause. The view is more general and sufficiently fast without it, and callers can always add it themselves to achieve what I was doing in the explicit query.
I appreciate your time.
-Reece
-- Reece Hart, http://www.in-machina.com/~reece/, GPG:0x25EC91A0 |