Re: Planning performance problem (67626.278ms) - Mailing list pgsql-performance
From | Manuel Weitzman |
---|---|
Subject | Re: Planning performance problem (67626.278ms) |
Date | |
Msg-id | E4AB7AC2-596B-4B60-A5D4-D4BD48F9DE55@gmail.com Whole thread Raw |
In response to | Re: Planning performance problem (67626.278ms) (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Planning performance problem (67626.278ms)
Re: Planning performance problem (67626.278ms) |
List | pgsql-performance |
> 1. create_join_clause doesn't trouble to look for commuted > equivalents, which perhaps is penny-wise and pound-foolish. > The cost of re-deriving selectivity estimates could be way > more than the cost of checking this. Agreed. > 2. Although these look like they ought to be equivalent to the > original clauses (modulo commutation, for some of them), they don't > look that way to create_join_clause, because it's also checking > for parent_ec equality. Per the comment, > > * parent_ec is either equal to ec (if the clause is a potentially-redundant > * join clause) or NULL (if not). We have to treat this as part of the > * match requirements --- it's possible that a clause comparing the same two > * EMs is a join clause in one join path and a restriction clause in another. > > It might be worth digging into the git history to see why that > became a thing and then considering whether there's a way around it. > (I'm pretty sure that comment is mine, but I don't recall the details > anymore.) To me that sounds OK, I cannot prove that they're equivalent to the original clauses so I think it is fine to assume they're not (not an expert here, quite the opposite). > Anyway, it's certainly not the case that we're making new > RestrictInfos for every pair of rels. It looks that way in this > example because the join vars all belong to the same EC, but > that typically wouldn't be the case in more complex queries. Good to know, this wasn't clear to me. > So we could look into whether this code can be improved to share > RestrictInfos across more cases. Another thought is that even > if we need to keep original and derived clauses separate, maybe it'd > be all right to copy previously-determined cached selectivity values > from an original clause to an otherwise-identical derived clause > (cf. commute_restrictinfo()). I'm not sure though whether it's > reliably the case that we'd have filled in selectivities for the > original clauses before this code wants to clone them. To be honest, even if that sounds like a good idea to dig on, I think it wouldn't completely solve the problem with repeated calls to get_actual_variable_range(). The example query I gave is doing a lot of simple auto-joins which makes the thought process simpler, but I worry more about the more "common" case in which there is more than 2 distinct tables involved in the query For example, instead of having "b1, b2, ..., bn" as aliases of "b" in this query >> explain (analyze, buffers) >> select * from a >> join b b1 on (b1.a = a.a) >> join b b2 on (b2.a = a.a) >> where b1.a in (1,100,10000,1000000,1000001); it is also possible to reproduce the increasing cost in planning buffers for each new join on a distinct table being added: explain (analyze, buffers) select * from a join b on (b.a = a.a) join c on (c.a = a.a) -- ... (etc) where c.a in (1,100,10000,1000000,1000001); I can imagine that deconstruct_jointree() and generate_join_implied_equalities() would generate multiple RestrictInfos, in which many of them a constraint on a.a would be involved (each involving a different table). b.a = a.a c.a = a.a c.a = b.a a.a = b.a a.a = c.a ... (etc) (if we wanted, we could also add a different WHERE clause on each of the tables involved to make really sure all RestrictInfos are different). For each of these RestrictInfos there *could* be one cache miss on cached_scansel() that *could* force the planner to compute get_actual_variable_range() for the same variable (a.a) over and over, as mergejoinscansel() always computes the selectivity for the intervals that require actual extremal values. In practice this re-computing of the variable range seems to happen a lot. One way in which I see possible to share this kind of information (of extremal values) across RestrictInfos is to store the known variable ranges in PlannerInfo (or within a member of such struct), which seems to be around everywhere it would be needed. Best regards, Manuel
pgsql-performance by date: