Thread: Encouraging multi-table join order
I have a query that is intended to select from multiple "small tables" to get a limited subset of "incidentid" and then join with a "very large" table. One of the operations will require a sequential scan, but the planner is doing the scan on the very large table before joining the small ones, resulting in a huge amount of disk I/O. How would I make this query join the large table only after narrowing down the possible selections from the smaller tables? This is running on version 8.0.3. Thanks for any ideas. -Dan QUERY ######################################## explain analyze select distinct eventmain.incidentid, eventmain.entrydate, eventgeo.eventlocation, recordtext as retdata from eventactivity join ( select incidentid from k_h where id = 33396 and k_h.entrydate >= '2006-1-1 00:00' and k_h.entrydate < '2006-4-8 00:00' ) id_keywords using ( incidentid ) , eventmain, eventgeo where eventmain.incidentid = eventactivity.incidentid and eventmain.incidentid = eventgeo.incidentid and ( ' ' || recordtext || ' ' like '%HAL%' ) and eventactivity.entrydate >= '2006-1-1 00:00' and eventactivity.entrydate < '2006-4-8 00:00' order by eventmain.entrydate limit 10000; EXPLAIN ANALYZE OUTPUT ######################################## Limit (cost=2521191.65..2521191.90 rows=6 width=187) (actual time=1360935.787..1361072.277 rows=1400 loops=1) -> Unique (cost=2521191.65..2521191.90 rows=6 width=187) (actual time=1360935.779..1361067.853 rows=1400 loops=1) -> Sort (cost=2521191.65..2521191.66 rows=6 width=187) (actual time=1360935.765..1360958.258 rows=16211 loops=1) Sort Key: eventmain.entrydate, eventmain.incidentid, eventactivity.recordtext, eventgeo.eventlocation -> Nested Loop (cost=219.39..2521191.57 rows=6 width=187) (actual time=1123.115..1360579.798 rows=16211 loops=1) -> Nested Loop (cost=219.39..2521173.23 rows=6 width=154) (actual time=1105.773..1325907.716 rows=16211 loops=1) -> Hash Join (cost=219.39..2521153.37 rows=6 width=66) (actual time=1069.476..1289608.261 rows=16211 loops=1) Hash Cond: (("outer".incidentid)::text = ("inner".incidentid)::text) -> Seq Scan on eventactivity (cost=0.00..2518092.06 rows=1532 width=52) (actual time=57.205..1288514.530 rows=2621 loops=1) Filter: ((((' '::text || (recordtext)::text) || ' '::text) ~~ '%HAL%'::text) AND (entrydate >= '2006-01-01 00:00:00'::timestamp without time zone) AND (entrydate < '2006-04-08 00:00:00'::timestamp without time zone)) -> Hash (cost=217.53..217.53 rows=741 width=14) (actual time=899.128..899.128 rows=0 loops=1) -> Index Scan using k_h_id_idx on k_h (cost=0.00..217.53 rows=741 width=14) (actual time=55.097..893.883 rows=1162 loops=1) Index Cond: (id = 33396) Filter: ((entrydate >= '2006-01-01 00:00:00'::timestamp without time zone) AND (entrydate < '2006-04-08 00:00:00'::timestamp without time zone)) -> Index Scan using eventmain_incidentid_idx on eventmain (cost=0.00..3.30 rows=1 width=88) (actual time=1.866..2.227 rows=1 loops=16211) Index Cond: ((eventmain.incidentid)::text = ("outer".incidentid)::text) -> Index Scan using eventgeo_incidentid_idx on eventgeo (cost=0.00..3.04 rows=1 width=75) (actual time=1.770..2.126 rows=1 loops=16211) Index Cond: ((eventgeo.incidentid)::text = ("outer".incidentid)::text) Total runtime: 1361080.787 ms (19 rows)
Dan Harris <fbsd@drivefaster.net> writes: > I have a query that is intended to select from multiple "small tables" > to get a limited subset of "incidentid" and then join with a "very > large" table. One of the operations will require a sequential scan, but > the planner is doing the scan on the very large table before joining the > small ones, resulting in a huge amount of disk I/O. How would I make > this query join the large table only after narrowing down the possible > selections from the smaller tables? This is running on version 8.0.3. That's very strange --- the estimated cost of the seqscan is high enough that the planner should have chosen a nestloop with inner indexscan on the big table. I'm not sure about the join-order point, but the hash plan for the first join seems wrong in any case. Um, you do have an index on eventactivity.incidentid, right? What's the datatype(s) of the incidentid columns? What happens to the plan if you turn off enable_hashjoin and enable_mergejoin? regards, tom lane
Tom Lane wrote: > That's very strange --- the estimated cost of the seqscan is high enough > that the planner should have chosen a nestloop with inner indexscan on > the big table. I'm not sure about the join-order point, but the hash > plan for the first join seems wrong in any case. > > Um, you do have an index on eventactivity.incidentid, right? What's the > datatype(s) of the incidentid columns? What happens to the plan if you > turn off enable_hashjoin and enable_mergejoin? > > regards, tom lane > Yes, eventactivity.incidentid is indexed. The datatype is varchar(40). Although, by checking this, I noticed that k_h.incidentid was varchar(100). Perhaps the difference in length between the keys caused the planner to not use the fastest method? I have no defense as to why those aren't the same.. I will make them so and check. Here's the EXPLAIN analyze with enable_hashjoin = off and enable_mergejoin = off : Limit (cost=4226535.73..4226544.46 rows=698 width=82) (actual time=74339.016..74356.521 rows=888 loops=1) -> Unique (cost=4226535.73..4226544.46 rows=698 width=82) (actual time=74339.011..74354.073 rows=888 loops=1) -> Sort (cost=4226535.73..4226537.48 rows=698 width=82) (actual time=74339.003..74344.031 rows=3599 loops=1) Sort Key: eventmain.entrydate, eventmain.incidentid, eventgeo.eventlocation, eventactivity.recordtext -> Nested Loop (cost=0.00..4226502.76 rows=698 width=82) (actual time=921.325..74314.959 rows=3599 loops=1) -> Nested Loop (cost=0.00..4935.61 rows=731 width=72) (actual time=166.354..14638.308 rows=1162 loops=1) -> Nested Loop (cost=0.00..2482.47 rows=741 width=50) (actual time=150.396..7348.013 rows=1162 loops=1) -> Index Scan using k_h_id_idx on k_h (cost=0.00..217.55 rows=741 width=14) (actual time=129.540..1022.243 rows=1162 loops=1) Index Cond: (id = 33396) Filter: ((entrydate >= '2006-01-01 00:00:00'::timestamp without time zone) AND (entrydate < '2006-04-08 00:00:00'::timestamp without time zone)) -> Index Scan using eventgeo_incidentid_idx on eventgeo (cost=0.00..3.04 rows=1 width=36) (actual time=5.260..5.429 rows=1 loops=1162) Index Cond: ((eventgeo.incidentid)::text = ("outer".incidentid)::text) -> Index Scan using eventmain_incidentid_idx on eventmain (cost=0.00..3.30 rows=1 width=22) (actual time=5.976..6.259 rows=1 loops=1162) Index Cond: ((eventmain.incidentid)::text = ("outer".incidentid)::text) -> Index Scan using eventactivity1 on eventactivity (cost=0.00..5774.81 rows=20 width=52) (actual time=29.768..51.334 rows=3 loops=1162) Index Cond: (("outer".incidentid)::text = (eventactivity.incidentid)::text) Filter: ((((' '::text || (recordtext)::text) || ' '::text) ~~ '%HAL%'::text) AND (entrydate >= '2006-01-01 00:00:00'::timestamp without time zone) AND (entrydate < '2006-04-08 00:00:00'::timestamp without time zone))
Dan Harris <fbsd@drivefaster.net> writes: > Yes, eventactivity.incidentid is indexed. The datatype is varchar(40). > Although, by checking this, I noticed that k_h.incidentid was > varchar(100). Perhaps the difference in length between the keys caused > the planner to not use the fastest method? No, the planner wouldn't care about that. > Here's the EXPLAIN analyze with enable_hashjoin = off and > enable_mergejoin = off : OK, so it does consider the "right" plan, but it's estimating it'll take longer than the other one. One thing that's very strange is that the estimated number of rows out has changed ... did you re-ANALYZE since the previous message? > -> Index Scan using eventactivity1 on > eventactivity (cost=0.00..5774.81 rows=20 width=52) (actual > time=29.768..51.334 rows=3 loops=1162) > Index Cond: (("outer".incidentid)::text = > (eventactivity.incidentid)::text) > Filter: ((((' '::text || (recordtext)::text) > || ' '::text) ~~ '%HAL%'::text) AND (entrydate >= '2006-01-01 > 00:00:00'::timestamp without time zone) AND (entrydate < '2006-04-08 > 00:00:00'::timestamp without time zone)) So it's estimating 5775 cost units per probe into eventactivity, which is pretty high --- it must think that a lot of rows will be retrieved by the index (way more than the 20 or so it thinks will get past the filter condition). What does the pg_stats entry for eventactivity.incidentid contain? It might be worth increasing the statistics target for that column to try to get a better estimate. regards, tom lane
Tom Lane wrote: > <SNIP> > So it's estimating 5775 cost units per probe into eventactivity, which > is pretty high --- it must think that a lot of rows will be retrieved by > the index (way more than the 20 or so it thinks will get past the filter > condition). > What does the pg_stats entry for eventactivity.incidentid > contain? select * from pg_stats where tablename = 'eventactivity' and attname='incidentid'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+---------------+------------+-----------+-----------+------------+-----------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+------------- public | eventactivity | incidentid | 0 | 14 | 8157 | {P043190299,P051560740,P052581036,P052830218,P053100679,P053190889,P060370845,P042070391,P042690319,P043290117} | {0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00133333,0.00133333,0.00133333} | {P022140319,P030471058,P033090308,P041961082,P042910689,P050311006,P051350254,P052261148,P053270945,P060240316,P061000287} | 0.241737 > It might be worth increasing the statistics target for that > column to try to get a better estimate. > How high should I set this? I read the default is 10, but I'm not sure if doubling this would make a difference or if I should be doing a much larger number. There's approx 45 million rows in the table, if that matters. Thanks again, Dan
Dan Harris <fbsd@drivefaster.net> writes: > Tom Lane wrote: >> What does the pg_stats entry for eventactivity.incidentid >> contain? > {P043190299,P051560740,P052581036,P052830218,P053100679,P053190889,P060370845,P042070391,P042690319,P043290117} > | > {0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00166667,0.00133333,0.00133333,0.00133333} > How high should I set this? I read the default is 10, but I'm not sure > if doubling this would make a difference or if I should be doing a much > larger number. There's approx 45 million rows in the table, if that matters. What the stats entry is saying is that the most common entries occur about 75000 times apiece (0.00166667 * 45e6), which is what's scaring the planner here ;-). I think those frequencies are artificially high though. The default statistics sample size is 3000 rows (300 * statistics target, actually), so those numbers correspond to 5 or 4 rows in the sample, which is probably just random chance. Try increasing the stats targets for this table to 100, then re-ANALYZE and see what you get. The most_common_freqs entries might drop as much as a factor of 10. regards, tom lane
Tom Lane wrote: > What the stats entry is saying is that the most common entries occur > about 75000 times apiece (0.00166667 * 45e6), which is what's scaring > the planner here ;-). I think those frequencies are artificially high > though. The default statistics sample size is 3000 rows (300 * > statistics target, actually), so those numbers correspond to 5 or 4 > rows in the sample, which is probably just random chance. > > Try increasing the stats targets for this table to 100, then re-ANALYZE > and see what you get. The most_common_freqs entries might drop as much > as a factor of 10. > > regards, tom lane > Tom: I believe this was the problem. I upped the statistics to 100, for a sample size of 30k and now the planner does the correct nested loop/index scan and takes only 30 seconds! This is a HUGE performance increase. I wonder why the estimates were so far off the first time? This table has been ANALYZED regularly ever since creation. Once again, thank you and all of the developers for your hard work on PostgreSQL. This is by far the most pleasant management experience of any database I've worked on. -Dan
Dan Harris <fbsd@drivefaster.net> writes: > I wonder why the estimates were so far off the first time? This table > has been ANALYZED regularly ever since creation. Probably just that you need a bigger sample size for such a large table. We've been arguing ever since 7.2 about what the default statistics target ought to be --- a lot of people think 10 is too small. (It could also be that the fixed 300X multiplier ought to depend on table size instead. The math that told us 300X was OK was really about getting the histogram right, not about whether the most-common-values stats would be any good.) regards, tom lane