Re: bad select performance fixed by forbidding hash joins - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: bad select performance fixed by forbidding hash joins |
Date | |
Msg-id | 14955.933693143@sss.pgh.pa.us Whole thread Raw |
List | pgsql-hackers |
A few weeks ago, George Young <gry@ll.mit.edu> complained that the following query: > select os.name,r.run_name,ro.status from opset_steps os,runs r, > run_opsets ro where (ro.status=3 or ro.status=1) and > ro.opset_id=os.opset_id and ro.run_id=r.run_id and > ro.opset_ver=os.opset_ver and r.status=1; had horrible performance when executed via the system's preferred plan, > Hash Join (cost=1793.58 rows=14560 width=38) > -> Hash Join (cost=1266.98 rows=14086 width=24) > -> Seq Scan on run_opsets ro (cost=685.51 rows=13903 width=8) > -> Hash (cost=70.84 rows=1389 width=16) > -> Seq Scan on opset_steps os (cost=70.84 rows=1389 width=16) > -> Hash (cost=47.43 rows=374 width=14) > -> Seq Scan on runs r (cost=47.43 rows=374 width=14) I have looked into this, and it seems that the problem is this: the innermost hash join between run_opsets and opset_steps is being done on the join clause ro.opset_ver=os.opset_ver. In George's data, the opset_ver columns only have about 14 distinct values, with a very strong bias towards the values 1,2,3. This means that the vast majority of the opset_steps entries go into only three hash buckets, and the vast majority of the probes from run_opsets search one of those same three buckets, so that most of the run_opsets rows are being compared to almost a third of the opset_steps rows, not just a small fraction of them. Almost all of the runtime of the query is going into the tuple comparison tests :-( It seems clear that we want the system not to risk using a hashjoin unless it has good evidence that the inner table's column has a fairly flat distribution. I'm thinking that the right sort of check would be to check whether the "disbursion" statistic set by VACUUM ANALYZE is fairly small, maybe 0.01 or less (but not zero, which would suggest that VACUUM ANALYZE has never been run). This would roughly correspond to the most common value appearing not more than 1% of the time, so that we can be sure at least 100 different hashbuckets will be used. Comments? Is that value too small? This change is likely to reduce the optimizer's willingness to use hashjoins by a *lot*, especially if we make the threshold too small. If you'd like to see what kind of disbursion numbers you get on your own data, try something likeselect relname,attname,attdisbursion from pg_class,pg_attributewhere attrelid = pg_class.oidand relkind = 'r' and attnum > 0order by relname,attname; after a vacuum analyze. regards, tom lane PS: George, in the meantime I bet your query would run fine if the system would only choose the opset_id clause instead of opset_ver to do the hashjoin with --- opset_id has far better distribution. I'm guessing that it thinks the two clauses are equally attractive and is just choosing whichever one it happens to process first (or last?). You might try rearranging the order of the WHERE clauses as a stopgap solution...
pgsql-hackers by date: