nested loop semijoin estimates - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | nested loop semijoin estimates |
Date | |
Msg-id | 5568F44C.1010300@2ndquadrant.com Whole thread Raw |
Responses |
Re: nested loop semijoin estimates
Re: nested loop semijoin estimates |
List | pgsql-hackers |
Hi, while looking at this post from pgsql-performance about plan changes http://www.postgresql.org/message-id/flat/20150529095117.GB15813@hjp.at I noticed that initial_cost_nestloop() does this in (9.1, mentioned in the pgsql-performance post uses the same logic): if (jointype == JOIN_SEMI || jointype == JOIN_ANTI) { double outer_matched_rows; Selectivity inner_scan_frac; run_cost += inner_run_cost; outer_matched_rows = rint(outer_path_rows * semifactors->outer_match_frac); inner_scan_frac = 2.0 / (semifactors->match_count + 1.0); if (outer_matched_rows > 1) run_cost += (outer_matched_rows - 1) * inner_rescan_run_cost * inner_scan_frac; ... } I wonder whether the run_cost += inner_run_cost; is actually correct, because this pretty much means we assume scanning the whole inner relation (once). Wouldn't something like this be more appropriate? run_cost += inner_run_cost * inner_scan_frac; i.e. only counting the proportional part of the inner run cost, just like we do for the rescans (i.e. until the first match)? Imagine a simple EXISTS() query that gets turned into a semijoin, with an inner index scan. The current code pretty much means we'll scan the whole result, even though we only really need a single matching query (to confirm the EXISTS clause). For cheap index scans (e.g. using a PK) this does not make much difference, but the example posted to pgsql-performance results in index scans matching ~50% of the table, which makes the whole index scan quite expensive, and much higher than the rescan cost (multiplied by inner_scan_frac). While investigating the pgsql-performance report I've prepared the attached testcase, producing similar data set (attached). So let's see how the code change impacts plan choice. With the current code, the planner produces a plan like this: QUERY PLAN ----------------------------------------------------------------------- Nested Loop (cost=169627.96..169636.03 rows=1 width=74) Join Filter: ((t.term)::text = (f.berechnungsart)::text) -> Index Scan using term_facttablename_columnname_idx on term t (cost=0.55..8.57 rows=1 width=74) Index Cond: (((facttablename)::text = 'facttable_stat_fta4'::text) AND ((columnname)::text = 'berechnungsart'::text)) -> HashAggregate (cost=169627.41..169627.43 rows=2 width=2) Group Key: (f.berechnungsart)::text -> Seq Scan on facttable_stat_fta4 f (cost=0.00..145274.93 rows=9740993 width=2) which seems slightly inefficient, as ~50% of the facttable_stat_fta4 matches the condition (so building the whole hash is a waste of time). Also notice this is a regular join, not a semijoin - that seems to confirm the planner does not realize the cost difference, believes both plans (regular and semijoin) are equally expensive and simply uses the first one. With the run_cost change, I do get this plan: QUERY PLAN ----------------------------------------------------------------------- Nested Loop Semi Join (cost=111615.33..111623.42 rows=1 width=74) -> Index Scan using term_facttablename_columnname_idx on term t (cost=0.55..8.57 rows=1 width=74) Index Cond: (((facttablename)::text = 'facttable_stat_fta4'::text) AND ((columnname)::text = 'berechnungsart'::text)) -> Bitmap Heap Scan on facttable_stat_fta4 f (cost=111614.78..220360.98 rows=4870496 width=2) Recheck Cond: ((berechnungsart)::text = (t.term)::text) -> Bitmap Index Scan on facttable_stat_fta4_berechnungsart_idx (cost=0.00..110397.15 rows=4870496 width=0) Index Cond: ((berechnungsart)::text = (t.term)::text) and this runs about ~3x faster than the original plan (700ms vs. 2000ms), at least when using the testcase dataset on my laptop. This however is not the whole story, because after disabling the bitmap index scan, I do get this plan, running in ~1ms (so ~700x faster than the bitmap index scan): QUERY PLAN ------------------------------------------------------------------------ Nested Loop Semi Join (cost=0.99..9.20 rows=1 width=74) -> Index Scan using term_facttablename_columnname_idx on term t (cost=0.55..8.57 rows=1 width=74) Index Cond: (((facttablename)::text = 'facttable_stat_fta4'::text) AND ((columnname)::text = 'berechnungsart'::text)) -> Index Only Scan using facttable_stat_fta4_berechnungsart_idx on facttable_stat_fta4 f (cost=0.43..310525.02 rows=4870496 width=2) Index Cond: (berechnungsart = (t.term)::text) Notice the cost - it's way lover than the previous plan (9.2 vs ~111k), yet this plan was not chosen. So either the change broke something (e.g. by violating some optimizer assumption), or maybe there's a bug somewhere else ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: