join to view over custom aggregate seems like it should be faster - Mailing list pgsql-performance
From | Merlin Moncure |
---|---|
Subject | join to view over custom aggregate seems like it should be faster |
Date | |
Msg-id | b42b73150704091205o668a084y1693520c974f3c6b@mail.gmail.com Whole thread Raw |
Responses |
Re: join to view over custom aggregate seems like it should be faster
|
List | pgsql-performance |
I have an odd performance issue on 8.2 that I'd thought I'd document here. I have a workaround, but I'm if there is something that I'm not seeing. ok, for starters: I have a large table that is basically organized like this: create table big ( key1 int, key2 int, ts timestamp [other fields] ); and a view most_recent_big which lists for each combination of key1 and key2, the '[other fields]' that are behind the highest (most recent) timestamp. The original view implementation involved a self join which is the classic sql approach to pulling values from a denormalized table (the real solution of course is to normalize the data but I can't do that for various reasons). This wasn't very fast, so I wrote a custom aggregate to optimize the view (there are usuallly very small #s of records for key1, key2 pair: create view latest_big_view as select key1, key2, max_other_fields[other fields] from big group by key1, key2; This worked very well, but sometimes the index on key1, key2 does not get utilized when joining against latest_big_view. Let's say I have a number of key1, key2 pairs in another table: for example: select * from foo, latest_big_view using (key1, key2); breaks down. here is a example of the 'breakdown' plan on real tables. selecting a single record from the view is very fast...1ms or less. The join can't 'see through' the view to filter the index. dev20400=# explain analyze select * from foo join latest_download using (host_id, software_binary_id); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=15.35..4616.65 rows=25 width=90) (actual time=229.623..10601.317 rows=494 loops=1) Hash Cond: ((latest_download.host_id = foo.host_id) AND (latest_download.software_binary_id = foo.software_binary_id)) -> GroupAggregate (cost=0.00..4499.01 rows=4535 width=94) (actual time=0.346..10370.383 rows=37247 loops=1) -> Index Scan using software_download_idx on software_download (cost=0.00..2526.53 rows=45342 width=94) (actual time=0.028..344.591 SubPlan -> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.006..0.011 rows=1 loops=37247) -> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.006..0.011 rows=1 loops=37247) -> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.005..0.011 rows=1 loops=37247) -> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.006..0.011 rows=1 loops=37247) -> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.006..0.011 rows=1 loops=37247) -> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.006..0.011 rows=1 loops=37247) -> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.005..0.011 rows=1 loops=37247) -> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.005..0.011 rows=1 loops=37247) -> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.006..0.011 rows=1 loops=37247) -> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.005..0.010 rows=1 loops=37247) -> Hash (cost=7.94..7.94 rows=494 width=8) (actual time=5.568..5.568 rows=494 loops=1) -> Seq Scan on foo (cost=0.00..7.94 rows=494 width=8) (actual time=0.018..2.686 rows=494 loops=1) Total runtime: 10604.260 ms (18 rows) Here is the same query but on the root table, instead of the view: dev20400=# explain analyze select * from foo join software_download using (host_id, software_binary_id); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..1521.60 rows=19 width=94) (actual time=0.084..24.992 rows=607 loops=1) -> Seq Scan on foo (cost=0.00..7.94 rows=494 width=8) (actual time=0.044..2.753 rows=494 loops=1) -> Index Scan using software_download_idx on software_download (cost=0.00..3.05 rows=1 width=94) (actual time=0.011..0.019 rows=1 loops=49 Index Cond: ((foo.host_id = software_download.host_id) AND (foo.software_binary_id = software_download.software_binary_id)) Total runtime: 28.385 ms (5 rows) I can use a trick with a function to make the view give out reasonalbe results: create function foo(int, int) returns latest_download as $$ select * from latest_download where software_binary_id = $1 and host_id = $2; $$ language sql; dev20400=# explain analyze select (v).* from (select foo(software_binary_id, host_id) as v from foo) q; QUERY PLAN --------------------------------------------------------------------------------------------------------- Subquery Scan q (cost=0.00..14.12 rows=494 width=32) (actual time=1.436..139.644 rows=494 loops=1) -> Seq Scan on foo (cost=0.00..9.18 rows=494 width=8) (actual time=1.414..131.144 rows=494 loops=1) Total runtime: 142.887 ms (3 rows) Time: 144.306 ms merlin
pgsql-performance by date: