Re: slow joining very large table to smaller ones - Mailing list pgsql-performance
From | Dan Harris |
---|---|
Subject | Re: slow joining very large table to smaller ones |
Date | |
Msg-id | A8A7C7A0-7DC3-449C-A081-C1557DB9E16A@drivefaster.net Whole thread Raw |
In response to | Re: slow joining very large table to smaller ones (John A Meinel <john@arbash-meinel.com>) |
List | pgsql-performance |
On Jul 14, 2005, at 5:12 PM, John A Meinel wrote: > Dan Harris wrote: > > >>> >>> Well, postgres is estimating around 500 rows each, is that way >>> off? Try >>> just doing: >>> EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107; >>> EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94; >>> > > Once again, do this and post the results. We might just need to tweak > your settings so that it estimates the number of rows correctly, > and we > don't need to do anything else. > Ok, sorry I missed these the first time through: explain analyze select incidentid from k_b where id = 107; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------ Index Scan using k_b_idx on k_b (cost=0.00..1926.03 rows=675 width=14) (actual time=0.042..298.394 rows=2493 loops=1) Index Cond: (id = 107) Total runtime: 299.103 ms select count(*) from k_b; count -------- 698350 ( sorry! I think I said this one only had tens of thousands in it ) explain analyze select incidentid from k_r where id = 94; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------- Index Scan using k_r_idx on k_r (cost=0.00..2137.61 rows=757 width=14) (actual time=0.092..212.187 rows=10893 loops=1) Index Cond: (id = 94) Total runtime: 216.498 ms (3 rows) select count(*) from k_r; count -------- 671670 That one is quite a bit slower, yet it's the same table structure and same index as k_b, also it has fewer records. I did run VACUUM ANALYZE immediately before running these queries. It seems a lot better with the join_collapse set. > > \ > Well, the planner is powerful enough to flatten nested selects. To > make > it less "intelligent" you can do: > SET join_collapse_limit 1; > or > SET join_collapse_limit 0; > Which should tell postgres to not try and get tricky with your query. > Again, *usually* the planner knows better than you do. So again > just do > it to see what you get. > Ok, when join_collapse_limit = 1 I get this now: explain analyze select recordtext from eventactivity join ( select incidentid from k_r join k_b using (incidentid) where k_r.id = 94 and k_b.id = 107 ) a using (incidentid ); QUERY PLAN ------------------------------------------------------------------------ ----------------------------------------------------------------------- Nested Loop (cost=0.00..156509.08 rows=2948 width=35) (actual time=1.555..340.625 rows=24825 loops=1) -> Nested Loop (cost=0.00..5361.89 rows=6 width=28) (actual time=1.234..142.078 rows=366 loops=1) -> Index Scan using k_b_idx on k_b (cost=0.00..1943.09 rows=681 width=14) (actual time=0.423..56.974 rows=2521 loops=1) Index Cond: (id = 107) -> Index Scan using k_r_idx on k_r (cost=0.00..5.01 rows=1 width=14) (actual time=0.031..0.031 rows=0 loops=2521) Index Cond: ((k_r.id = 94) AND ((k_r.incidentid)::text = ("outer".incidentid)::text)) -> Index Scan using eventactivity1 on eventactivity (cost=0.00..25079.55 rows=8932 width=49) (actual time=0.107..0.481 rows=68 loops=366) Index Cond: ((eventactivity.incidentid)::text = ("outer".incidentid)::text) Total runtime: 347.975 ms MUCH better! Maybe you can help me understand what I did and if I need to make something permanent to get this behavior from now on? > > > > If you have analyzed recently can you do: > SELECT relname, reltuples FROM pg_class WHERE relname='eventactivity'; > > It is a cheaper form than "SELECT count(*) FROM eventactivity" to > get an > approximate estimate of the number of rows. But if it isn't too > expensive, please also give the value from SELECT count(*) FROM > eventactivity. > > Again, that helps us know if your tables are up-to-date. > Sure: select relname, reltuples from pg_class where relname='eventactivity'; relname | reltuples ---------------+------------- eventactivity | 3.16882e+07 select count(*) from eventactivity; count ---------- 31871142 > > >> >> >> >>> I don't know how selective your keys are, but one of these queries >>> should probably structure it better for the planner. It depends >>> a lot on >>> how selective your query is. >>> >> >> >> eventactivity currently has around 36 million rows in it. There >> should >> only be maybe 200-300 incidentids at most that will be matched >> with the >> combination of k_b and k_r. That's why I was thinking I could >> somehow >> get a list of just the incidentids that matched the id = 94 and id = >> 107 in k_b and k_r first. Then, I would only need to grab a few >> hundred >> out of 36 million rows from eventactivity. >> >> > > Well, you can also try: > SELECT count(*) FROM k_b JOIN k_r USING (incidentid) > WHERE k_b.id=?? AND k_r.id=?? > ; > > That will tell you how many rows they have in common. select count(*) from k_b join k_r using (incidentid) where k_b.id=107 and k_r.id=94; count ------- 373 > > Well, if you look at the latest plans, things have gone up from 44M to > 156M, I don't know why it is worse, but it is getting there. I assume this is because r_k and r_b are growing fairly rapidly right now. The time in between queries contained a lot of inserts. I was careful to vacuum analyze before sending statistics, as I did this time. I'm sorry if this has confused the issue.
pgsql-performance by date: