Re: Planning performance problem (67626.278ms) - Mailing list pgsql-performance

From Manuel Weitzman
Subject Re: Planning performance problem (67626.278ms)
Date
Msg-id E4AB7AC2-596B-4B60-A5D4-D4BD48F9DE55@gmail.com
Whole thread Raw
In response to Re: Planning performance problem (67626.278ms)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Planning performance problem (67626.278ms)  (Manuel Weitzman <manuelweitzman@gmail.com>)
Re: Planning performance problem (67626.278ms)  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-performance
> 1. create_join_clause doesn't trouble to look for commuted
> equivalents, which perhaps is penny-wise and pound-foolish.
> The cost of re-deriving selectivity estimates could be way
> more than the cost of checking this.

Agreed.

> 2. Although these look like they ought to be equivalent to the
> original clauses (modulo commutation, for some of them), they don't
> look that way to create_join_clause, because it's also checking
> for parent_ec equality.  Per the comment,
>
> * parent_ec is either equal to ec (if the clause is a potentially-redundant
> * join clause) or NULL (if not).  We have to treat this as part of the
> * match requirements --- it's possible that a clause comparing the same two
> * EMs is a join clause in one join path and a restriction clause in another.
>
> It might be worth digging into the git history to see why that
> became a thing and then considering whether there's a way around it.
> (I'm pretty sure that comment is mine, but I don't recall the details
> anymore.)

To me that sounds OK, I cannot prove that they're equivalent to the
original clauses so I think it is fine to assume they're not (not an
expert here, quite the opposite).

> Anyway, it's certainly not the case that we're making new
> RestrictInfos for every pair of rels.  It looks that way in this
> example because the join vars all belong to the same EC, but
> that typically wouldn't be the case in more complex queries.

Good to know, this wasn't clear to me.

> So we could look into whether this code can be improved to share
> RestrictInfos across more cases.  Another thought is that even
> if we need to keep original and derived clauses separate, maybe it'd
> be all right to copy previously-determined cached selectivity values
> from an original clause to an otherwise-identical derived clause
> (cf. commute_restrictinfo()).  I'm not sure though whether it's
> reliably the case that we'd have filled in selectivities for the
> original clauses before this code wants to clone them.

To be honest, even if that sounds like a good idea to dig on, I think
it wouldn't completely solve the problem with repeated calls to
get_actual_variable_range().

The example query I gave is doing a lot of simple auto-joins which
makes the thought process simpler, but I worry more about the more
"common" case in which there is more than 2 distinct tables involved
in the query

For example, instead of having "b1, b2, ..., bn" as aliases of "b" in
this query

>> explain (analyze, buffers)
>> select * from a
>> join b b1 on (b1.a = a.a)
>> join b b2 on (b2.a = a.a)
>> where b1.a in (1,100,10000,1000000,1000001);

it is also possible to reproduce the increasing cost in planning
buffers for each new join on a distinct table being added:

explain (analyze, buffers)
select * from a
join b on (b.a = a.a)
join c on (c.a = a.a)
-- ... (etc)
where c.a in (1,100,10000,1000000,1000001);

I can imagine that deconstruct_jointree() and
generate_join_implied_equalities() would generate multiple
RestrictInfos, in which many of them a constraint on a.a would be
involved (each involving a different table).

b.a = a.a
c.a = a.a
c.a = b.a
a.a = b.a
a.a = c.a
... (etc)

(if we wanted, we could also add a different WHERE clause on each of
the tables involved to make really sure all RestrictInfos are
different).

For each of these RestrictInfos there *could* be one cache miss on
cached_scansel() that *could* force the planner to compute
get_actual_variable_range() for the same variable (a.a) over and over,
as mergejoinscansel() always computes the selectivity for the
intervals that require actual extremal values. In practice this
re-computing of the variable range seems to happen a lot.

One way in which I see possible to share this kind of information (of
extremal values) across RestrictInfos is to store the known variable
ranges in PlannerInfo (or within a member of such struct), which seems
to be around everywhere it would be needed.


Best regards,
Manuel




pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Planning performance problem (67626.278ms)
Next
From: Ayub Khan
Date:
Subject: Re: slow performance with cursor