Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation - Mailing list pgsql-hackers

From Ashutosh Bapat
Subject Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation
Date
Msg-id CAFjFpRcex4Q4uSGtPMRkpxidg1QfSj7fChmGLyBLjj9A329uDw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On Thu, Jul 26, 2018 at 10:30 PM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> On 26 July 2018 at 07:12, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> In the patch clauselist_selectivity() gets called repeatedly for every
>> new qual added to the clause list. Instead, if we try to combine the
>> cost/row estimation with order_qual_clauses() or
>> clauselist_selectivity(), we might be able to what the current patch
>> does in O(n). clauselist_selectivity() calculates combined selectivity
>> of all the given quals in O(n). We should do something similar to that
>> in this case.
>
> Duplicating the logic of clauselist_selectivity() seems like quite a
> lot of effort to go to just for improved filter cost estimates. Note
> also that clauselist_selectivity() might get a good deal more complex
> with multivariate stats.

I am not suggesting to duplicate code in clauselist_selectivity().
Instead I am suggesting that we incorporate costing in that function.
I don't know how feasible is that.

>
> Perhaps a reasonable simplification would be to just treat the clauses
> as independent when estimating the filter cost, and so use the
> following as a "good enough" estimate for the filter cost:
>
>   cost(A) + sel(A) * cost(B) + sel(A) * sel(B) * cost(C) + ...
>
> That would probably be an improvement over what we currently have. It
> would be O(n) to compute, and it would probably use the cached
> selectivity estimates for the clauses.

That looks like a good trade-off. But looking at the following comment
in clause_selectivity(), I doubt if this will work in all the cases
        /*
         * If possible, cache the result of the selectivity calculation for
         * the clause.  We can cache if varRelid is zero or the clause
         * contains only vars of that relid --- otherwise varRelid will affect
         * the result, so mustn't cache.  Outer join quals might be examined
         * with either their join's actual jointype or JOIN_INNER, so we need
         * two cache variables to remember both cases.  Note: we assume the
         * result won't change if we are switching the input relations or
         * considering a unique-ified case, so we only need one cache variable
         * for all non-JOIN_INNER cases.
         */


>
> Note also that with this simplified expression for the filter cost, it
> would be possible to improve order_qual_clauses() to take into account
> the clause selectivities when sorting the clauses, and minimise the
> cost of the filter by evaluating more selective clauses sooner, if
> they're not too expensive. I believe that amounts to sorting by
> cost/(1-sel) rather than just cost, except for clauses with sel==1,
> which it makes sense to move to the end, and just sort by cost.

+1, if we could do that.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


pgsql-hackers by date:

Previous
From: "Tsunakawa, Takayuki"
Date:
Subject: RE: Temporary tables prevent autovacuum, leading to XID wraparound
Next
From: Philip Scott
Date:
Subject: Auditing via logical decoding