Re: Additional improvements to extended statistics - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Additional improvements to extended statistics
Date
Msg-id 20200324010838.hbv5gzhexcv5cilg@development
Whole thread Raw
In response to Re: Additional improvements to extended statistics  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: Additional improvements to extended statistics
List pgsql-hackers
On Mon, Mar 23, 2020 at 08:21:42AM +0000, Dean Rasheed wrote:
>On Sat, 21 Mar 2020 at 21:59, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Ah, right. Yeah, I think that should work. I thought there would be some
>> volatility due to groups randomly not making it into the MCV list, but
>> you're right it's possible to construct the data in a way to make it
>> perfectly deterministic. So that's what I've done in the attached patch.
>>
>
>Looking through those new tests, another issue occurred to me -- under
>some circumstances this patch can lead to extended stats being applied
>twice to the same clause, which is not great, because it involves
>quite a lot of extra work, and also because it can lead to
>overestimates because of the way that MCV stats are applied as a delta
>correction to simple_sel.
>
>The way this comes about is as follows -- if we start with an OR
>clause, that will be passed as a single-item implicitly AND'ed list to
>clauselist_selectivity(), and from there to
>statext_clauselist_selectivity() and then to
>statext_mcv_clauselist_selectivity(). This will call
>clauselist_selectivity_simple() to get simple_sel, before calling
>mcv_clauselist_selectivity(), which will recursively compute all the
>MCV corrections. However, the call to clauselist_selectivity_simple()
>will call clause_selectivity() for each clause (just a single OR
>clause in this example), which will now call
>clauselist_selectivity_or(), which will go back into
>statext_clauselist_selectivity() with "is_or = true", which will apply
>the MCV corrections to the same set of clauses that the outer call was
>about to process.
>

Hmmm. So let's consider a simple OR clause with two arguments, both
covered by single statistics object. Something like this:

   CREATE TABLE t (a int, b int);
   INSERT INTO t SELECT mod(i, 10), mod(i, 10)
     FROM generate_series(1,100000);
   CREATE STATISTICS s (mcv) ON a,b FROM t;

   SELECT * FROM t WHERE a = 0 OR b = 0;

Which is estimated correctly, but the calls graph looks like this:

   clauselist_selectivity
     statext_clauselist_selectivity
       statext_mcv_clauselist_selectivity  <---
         clauselist_selectivity_simple
           clause_selectivity
             clauselist_selectivity_or
               statext_clauselist_selectivity
                 statext_mcv_clauselist_selectivity  <---
                   clauselist_selectivity_simple_or
                     clause_selectivity
                     clause_selectivity
                   mcv_clauselist_selectivity
               clauselist_selectivity_simple_or
         mcv_clauselist_selectivity
       clauselist_selectivity_simple
         (already estimated)

IIUC the problem you have in mind is that we end up calling
statext_mcv_clauselist_selectivity twice, once for the top-level AND
clause with a single argument, and then recursively for the expanded OR
clause. Indeed, that seems to be applying the correction twice.


>I'm not sure what's the best way to resolve that. Perhaps
>statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
>should ignore OR clauses from an AND-list, on the basis that they will
>get processed recursively later. Or perhaps estimatedclauses can
>somehow be used to prevent this, though I'm not sure exactly how that
>would work.

I don't know. I feel uneasy about just ignoring some of the clauses,
because what happens for complex clauses, where the OR is not directly
in the AND clause, but is negated or something like that?

Isn't it the case that clauselist_selectivity_simple (and the OR
variant) should ignore extended stats entirely? That is, we'd need to
add a flag (or _simple variant) to clause_selectivity, so that it calls
causelist_selectivity_simple_or. So the calls would look like this:

   clauselist_selectivity
     statext_clauselist_selectivity
       statext_mcv_clauselist_selectivity
         clauselist_selectivity_simple  <--- disable extended stats
           clause_selectivity
             clauselist_selectivity_simple_or
               clause_selectivity
               clause_selectivity
         mcv_clauselist_selectivity
     clauselist_selectivity_simple
       already estimated

I've only quickly hacked clause_selectivity, but it does not seems very
invasive (of course, it means disruption to clause_selectivity callers,
but I suppose most call clauselist_selectivity).

BTW do you have an example where this would actually cause an issue?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Define variables in the approprieate scope
Next
From: Andres Freund
Date:
Subject: Re: pgsql: Improve autovacuum logging for aggressive andanti-wraparound ru