Re: slow joining very large table to smaller ones - Mailing list pgsql-performance
From | Dan Harris |
---|---|
Subject | Re: slow joining very large table to smaller ones |
Date | |
Msg-id | C3470621-9DB0-444A-BAD1-F9995E0142FB@drivefaster.net Whole thread Raw |
In response to | Re: slow joining very large table to smaller ones (John A Meinel <john@arbash-meinel.com>) |
Responses |
Re: slow joining very large table to smaller ones
Re: slow joining very large table to smaller ones Re: slow joining very large table to smaller ones Re: slow joining very large table to smaller ones Re: slow joining very large table to smaller ones |
List | pgsql-performance |
On Jul 14, 2005, at 9:42 AM, John A Meinel wrote: > > > You might try giving it a little bit more freedom with: > > EXPLAIN ANALYZE > SELECT recordtext FROM eventactivity, k_r, k_b > WHERE eventactivity.incidentid = k_r.incidentid > AND eventactivity.incidentid = k_b.incidentid > AND k_r.id = 94 > AND k_b.id = 107 > -- AND k_r.incidentid = k_b.incidentid > ; > > I'm pretty sure that would give identical results, just let the > planner > have a little bit more freedom about how it does it. > Also the last line is commented out, because I think it is redundant. > Ok, I tried this one. My ssh keeps getting cut off by a router somewhere between me and the server due to inactivity timeouts, so all I know is that both the select and explain analyze are taking over an hour to run. Here's the explain select for that one, since that's the best I can get. explain select recordtext from eventactivity,k_r,k_b where eventactivity.incidentid = k_r.incidentid and eventactivity.incidentid = k_b.incidentid and k_r.id = 94 and k_b.id = 107; QUERY PLAN ------------------------------------------------------------------------ -------------------------------------- Merge Join (cost=9624.61..4679590.52 rows=151009549 width=35) Merge Cond: (("outer".incidentid)::text = "inner"."?column2?") -> Merge Join (cost=4766.92..4547684.26 rows=16072733 width=117) Merge Cond: (("outer".incidentid)::text = "inner"."?column2?") -> Index Scan using eventactivity1 on eventactivity (cost=0.00..4186753.16 rows=46029271 width=49) -> Sort (cost=4766.92..4771.47 rows=1821 width=68) Sort Key: (k_b.incidentid)::text -> Index Scan using k_b_idx on k_b (cost=0.00..4668.31 rows=1821 width=68) Index Cond: (id = 107) -> Sort (cost=4857.69..4862.39 rows=1879 width=68) Sort Key: (k_r.incidentid)::text -> Index Scan using k_r_idx on k_r (cost=0.00..4755.52 rows=1879 width=68) Index Cond: (id = 94) (13 rows) > You might also try: > EXPLAIN ANALYZE > SELECT recordtext > FROM eventactivity JOIN k_r USING (incidentid) > JOIN k_b USING (incidentid) > WHERE k_r.id = 94 > AND k_b.id = 107 > ; > Similar results here. The query is taking at least an hour to finish. explain select recordtext from eventactivity join k_r using ( incidentid ) join k_b using (incidentid ) where k_r.id = 94 and k_b.id = 107; QUERY PLAN ------------------------------------------------------------------------ -------------------------------------- Merge Join (cost=9542.77..4672831.12 rows=148391132 width=35) Merge Cond: (("outer".incidentid)::text = "inner"."?column2?") -> Merge Join (cost=4726.61..4542825.87 rows=15930238 width=117) Merge Cond: (("outer".incidentid)::text = "inner"."?column2?") -> Index Scan using eventactivity1 on eventactivity (cost=0.00..4184145.43 rows=46000104 width=49) -> Sort (cost=4726.61..4731.13 rows=1806 width=68) Sort Key: (k_b.incidentid)::text -> Index Scan using k_b_idx on k_b (cost=0.00..4628.92 rows=1806 width=68) Index Cond: (id = 107) -> Sort (cost=4816.16..4820.82 rows=1863 width=68) Sort Key: (k_r.incidentid)::text -> Index Scan using k_r_idx on k_r (cost=0.00..4714.97 rows=1863 width=68) Index Cond: (id = 94) (13 rows) > Also, if possible give us the EXPLAIN ANALYZE so that we know if the > planner is making accurate estimates. (You might send an EXPLAIN while > waiting for the EXPLAIN ANALYZE to finish) > > You can also try disabling merge joins, and see how that changes > things. > Are there any negative sideaffects of doing this? > >> >> > > Well, postgres is estimating around 500 rows each, is that way off? > Try > just doing: > EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107; > EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94; > > And see if postgres estimates the number of rows properly. > > I assume you have recently VACUUM ANALYZEd, which means you might need > to update the statistics target (ALTER TABLE k_b ALTER COLUMN > incidientid SET STATISTICS 100) default is IIRC 10, ranges from > 1-1000, > higher is more accurate, but makes ANALYZE slower. > > >> >> Right now, it looks like pg is trying to sort the entire >> eventactivity >> table for the merge join which is taking several minutes to do. >> Can I >> rephrase this so that it does the searching through k_r and k_b >> FIRST >> and then go into eventactivity using the index on incidentid? It >> seems >> like that shouldn't be too hard to make fast but my SQL query skills >> are only average. >> > > To me, it looks like it is doing an index scan (on k_b.id) through k_b > first, sorting the results by incidentid, then merge joining that with > eventactivity. > > I'm guessing you actually want it to merge k_b and k_r to get extra > selectivity before joining against eventactivity. > I think my alternate forms would let postgres realize this. But if > not, > you could try: > > EXPLAIN ANALYZE > SELECT recordtext FROM eventactivity > JOIN (SELECT incidentid FROM k_r JOIN k_b USING (incidentid) > WHERE k_r.id = 94 AND k_b.id = 107) > USING (incidentid); > This one looks like the same plan as the others: explain select recordtext from eventactivity join ( select incidentid from k_r join k_b using (incidentid) where k_r.id = 94 and k_b.id = 107 ) a using (incidentid ); QUERY PLAN ------------------------------------------------------------------------ -------------------------------------- Merge Join (cost=9793.33..4693149.15 rows=156544758 width=35) Merge Cond: (("outer".incidentid)::text = "inner"."?column2?") -> Merge Join (cost=4847.75..4557237.59 rows=16365843 width=117) Merge Cond: (("outer".incidentid)::text = "inner"."?column2?") -> Index Scan using eventactivity1 on eventactivity (cost=0.00..4191691.79 rows=46084161 width=49) -> Sort (cost=4847.75..4852.38 rows=1852 width=68) Sort Key: (k_b.incidentid)::text -> Index Scan using k_b_idx on k_b (cost=0.00..4747.24 rows=1852 width=68) Index Cond: (id = 107) -> Sort (cost=4945.58..4950.36 rows=1913 width=68) Sort Key: (k_r.incidentid)::text -> Index Scan using k_r_idx on k_r (cost=0.00..4841.30 rows=1913 width=68) Index Cond: (id = 94) (13 rows) > I don't know how selective your keys are, but one of these queries > should probably structure it better for the planner. It depends a > lot on > how selective your query is. eventactivity currently has around 36 million rows in it. There should only be maybe 200-300 incidentids at most that will be matched with the combination of k_b and k_r. That's why I was thinking I could somehow get a list of just the incidentids that matched the id = 94 and id = 107 in k_b and k_r first. Then, I would only need to grab a few hundred out of 36 million rows from eventactivity. > If you have 100M rows, the above query looks like it expects k_r to > restrict it to 44M rows, and k_r + k_b down to 11M rows, which really > should be a seq scan (> 10% of the rows = seq scan). But if you are > saying the selectivity is mis-estimated it could be different. > Yeah, if I understand you correctly, I think the previous paragraph shows this is a significant misestimate.
pgsql-performance by date: