Re: nested loop semijoin estimates - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: nested loop semijoin estimates |
Date | |
Msg-id | 5569F09C.7090204@2ndquadrant.com Whole thread Raw |
In response to | Re: nested loop semijoin estimates (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
List | pgsql-hackers |
On 05/30/15 03:52, Tomas Vondra wrote: > > > On 05/30/15 01:20, Tomas Vondra wrote: >> >> Notice the cost - it's way lover than the previous plan (9.2 vs >> ~111k), yet this plan was not chosen. So either the change broke >> something (e.g. by violating some optimizer assumption), or maybe >> there's a bug somewhere else ... > > After a bit more investigation, what I think is happening here is > add_path() does not realize this is a SEMI join and a single tuple is > enough, and discards the simple indexscan path in favor of the bitmap > index scan as that seems cheaper when scanning everything. > > So not really a bug, but maybe 'consider_startup' would help? A bit more info - the reason why the IndexPath gets discarded (in favor of BitmapHeapScan) seems to be twofold: (a) Semijoin does no imply 'consider_startup' for the inner path, so it gets discarded by the BitmapHeapScan, as it has lower total cost. I'm not sure where exactly the consider_startup should be set, so I've forced that at the beginning on add_path() to see what happens. Sadly, this alone is not sufficient, because of the next point. (b) To actually use startup costs in compare_path_costs_fuzzily, there is an additional condition about handling parameterized paths, as explained in this comment: * This function also includes special hacks to support a policy * enforced by its sole caller, add_path(): paths that have any * parameterization cannot win comparisons on the grounds of having * cheaper startup cost, since we deem only total cost to be of * interest for a parameterized path. (Unparameterized paths are * more common, so we check for this case last.) so the startup_cost comparisons look like this: if (consider_startup && path2->startup_cost > path1->startup_cost * fuzz_factor && path1->param_info == NULL) Of course, path1 (IndexPath) actually has parameters, which makes the whole condition false, and the IndexPath gets discarded :-( ISTM the reported test case seems like a counter example to the policy, so maybe this should really be relaxed somehow. To see what happens, I tweaked both (a) and (b) by forcing consider_startup = true at the beginning of add_path(), and removing the (param_info==NULL) conditions (patch attached). It also includes the run_cost tweak mentioned in the first message. I'm not claiming this is the correct fix, the only goal was to see what happens. And it really does produce the 'right' plan: QUERY PLAN ----------------------------------------------------------------------- Nested Loop Semi Join (cost=0.99..9.20 rows=1 width=74) -> Index Scan using term_facttablename_columnname_idx on term t (cost=0.55..8.57 rows=1 width=74) Index Cond: (((facttablename)::text = 'facttable_stat_fta4'::text) AND ((columnname)::text = 'berechnungsart'::text)) -> Index Only Scan using facttable_stat_fta4_berechnungsart_idx on facttable_stat_fta4 f (cost=0.43..310525.02 rows=4870496 width=2) Index Cond: (berechnungsart = (t.term)::text) (5 rows) If I get the comments in pathnode.c about the startup_cost policy, it's "only" an optimization to reduce the number of parameterized paths, so I believe this does not break anything. Also, it's no fun if the planning is a tad faster, but the query runtime is 1000x higher. Ideas how to relax the policy without significant impact on optimizer? -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: