join to view over custom aggregate seems like it should be faster - Mailing list pgsql-performance

From Merlin Moncure
Subject join to view over custom aggregate seems like it should be faster
Date
Msg-id b42b73150704091205o668a084y1693520c974f3c6b@mail.gmail.com
Whole thread Raw
Responses Re: join to view over custom aggregate seems like it should be faster
List pgsql-performance
I have an odd performance issue on 8.2 that I'd thought I'd document
here.  I have a workaround, but I'm if there is something that I'm not
seeing.

ok, for starters:
I have a large table that is basically organized like this:
create table big
(
  key1 int,
  key2 int,
  ts timestamp
  [other fields]
);

and a view most_recent_big which lists for each combination of key1
and key2, the '[other fields]' that are behind the highest (most
recent) timestamp.  The original view implementation involved a self
join which is the classic sql approach to pulling values from a
denormalized table (the real solution of course is to normalize the
data but I can't do that for various reasons).  This wasn't very fast,
so I wrote a custom aggregate to optimize the view (there are usuallly
very small #s of records for key1, key2 pair:

create view latest_big_view as
  select key1, key2, max_other_fields[other fields]
  from big
  group by key1, key2;

This worked very well, but sometimes the index on key1, key2 does not
get utilized when joining against latest_big_view.  Let's say I have a
number of key1, key2 pairs in another table:

for example:
select * from foo, latest_big_view using (key1, key2);
breaks down.

here is a example of the 'breakdown' plan on real tables. selecting a
single record from the view is very fast...1ms or less.  The join
can't 'see through' the view to filter the index.

dev20400=# explain analyze select * from foo join latest_download
using (host_id, software_binary_id);

     QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=15.35..4616.65 rows=25 width=90) (actual
time=229.623..10601.317 rows=494 loops=1)
   Hash Cond: ((latest_download.host_id = foo.host_id) AND
(latest_download.software_binary_id = foo.software_binary_id))
   ->  GroupAggregate  (cost=0.00..4499.01 rows=4535 width=94) (actual
time=0.346..10370.383 rows=37247 loops=1)
         ->  Index Scan using software_download_idx on
software_download  (cost=0.00..2526.53 rows=45342 width=94) (actual
time=0.028..344.591
         SubPlan
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.006..0.011 rows=1 loops=37247)
           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual
time=0.005..0.010 rows=1 loops=37247)
   ->  Hash  (cost=7.94..7.94 rows=494 width=8) (actual
time=5.568..5.568 rows=494 loops=1)
         ->  Seq Scan on foo  (cost=0.00..7.94 rows=494 width=8)
(actual time=0.018..2.686 rows=494 loops=1)
 Total runtime: 10604.260 ms
(18 rows)


Here is the same query but on the root table, instead of the view:
dev20400=# explain analyze select * from foo join software_download
using (host_id, software_binary_id);
                                                                    QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..1521.60 rows=19 width=94) (actual
time=0.084..24.992 rows=607 loops=1)
   ->  Seq Scan on foo  (cost=0.00..7.94 rows=494 width=8) (actual
time=0.044..2.753 rows=494 loops=1)
   ->  Index Scan using software_download_idx on software_download
(cost=0.00..3.05 rows=1 width=94) (actual time=0.011..0.019 rows=1
loops=49
         Index Cond: ((foo.host_id = software_download.host_id) AND
(foo.software_binary_id = software_download.software_binary_id))
 Total runtime: 28.385 ms
(5 rows)

I can use a trick with a function to make the view give out reasonalbe results:

create function foo(int, int) returns latest_download as
$$ select * from latest_download where software_binary_id = $1 and
host_id = $2; $$ language sql;

dev20400=# explain analyze select (v).* from (select
foo(software_binary_id, host_id) as v from foo) q;
                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Subquery Scan q  (cost=0.00..14.12 rows=494 width=32) (actual
time=1.436..139.644 rows=494 loops=1)
   ->  Seq Scan on foo  (cost=0.00..9.18 rows=494 width=8) (actual
time=1.414..131.144 rows=494 loops=1)
 Total runtime: 142.887 ms
(3 rows)

Time: 144.306 ms

merlin

pgsql-performance by date:

Previous
From: "Alex Deucher"
Date:
Subject: Re: postgres 8.2 seems to prefer Seq Scan
Next
From: "Alex Deucher"
Date:
Subject: Re: postgres 8.2 seems to prefer Seq Scan