Re: Additional improvements to extended statistics - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Additional improvements to extended statistics
Date
Msg-id 20200324143306.rz5siohm4jsoveqm@development
Whole thread Raw
In response to Re: Additional improvements to extended statistics  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: Additional improvements to extended statistics
List pgsql-hackers
On Tue, Mar 24, 2020 at 01:20:07PM +0000, Dean Rasheed wrote:
>On Tue, 24 Mar 2020 at 01:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hmmm. So let's consider a simple OR clause with two arguments, both
>> covered by single statistics object. Something like this:
>>
>>    CREATE TABLE t (a int, b int);
>>    INSERT INTO t SELECT mod(i, 10), mod(i, 10)
>>      FROM generate_series(1,100000);
>>    CREATE STATISTICS s (mcv) ON a,b FROM t;
>>
>>    SELECT * FROM t WHERE a = 0 OR b = 0;
>>
>> Which is estimated correctly...
>>
>
>Hmm, the reason that is estimated correctly is that the MCV values
>cover all values in the table, so mcv_totalsel is 1 (or pretty close
>to 1), and other_sel is capped to nearly 0, and so the result is
>basically just mcv_sel -- i.e., in this case the MCV estimates are
>returned more-or-less as-is, rather than being applied as a
>correction, and so the result is independent of how many times
>extended stats are applied.
>
>The more interesting (and probably more realistic) case is where the
>MCV values do not cover the all values in the table, as in the new
>mcv_lists_partial examples in the regression tests, for example this
>test case, which produces a significant overestimate:
>
>  SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0
>
>although actually, I think there's another reason for that (in
>addition to the extended stats correction being applied twice) -- the
>way the MCV code makes use of base selectivity doesn't seem really
>appropriate for OR clauses, because the way base_frequency is computed
>is really based on the assumption that every column would be matched.
>I'm not sure that there's any easy answer to that though. I feels like
>what's needed when computing the match bitmaps in mcv.c, is to produce
>a bitmap (would it fit in 1 byte?) per value, to indicate which
>columns match, and base_frequency values per column. That would be
>significantly more work though, so almost certainly isn't feasible for
>PG13.
>

Good point. I haven't thought about the base frequencies. I think 1 byte
should be enough, as we limit the number of columns to 8.

>> IIUC the problem you have in mind is that we end up calling
>> statext_mcv_clauselist_selectivity twice, once for the top-level AND
>> clause with a single argument, and then recursively for the expanded OR
>> clause. Indeed, that seems to be applying the correction twice.
>>
>>
>> >I'm not sure what's the best way to resolve that. Perhaps
>> >statext_mcv_clauselist_selectivity() / statext_is_compatible_clause()
>> >should ignore OR clauses from an AND-list, on the basis that they will
>> >get processed recursively later. Or perhaps estimatedclauses can
>> >somehow be used to prevent this, though I'm not sure exactly how that
>> >would work.
>>
>> I don't know. I feel uneasy about just ignoring some of the clauses,
>> because what happens for complex clauses, where the OR is not directly
>> in the AND clause, but is negated or something like that?
>>
>> Isn't it the case that clauselist_selectivity_simple (and the OR
>> variant) should ignore extended stats entirely? That is, we'd need to
>> add a flag (or _simple variant) to clause_selectivity, so that it calls
>> causelist_selectivity_simple_or. So the calls would look like this:
>>
>>    clauselist_selectivity
>>      statext_clauselist_selectivity
>>        statext_mcv_clauselist_selectivity
>>          clauselist_selectivity_simple  <--- disable extended stats
>>            clause_selectivity
>>              clauselist_selectivity_simple_or
>>                clause_selectivity
>>                clause_selectivity
>>          mcv_clauselist_selectivity
>>      clauselist_selectivity_simple
>>        already estimated
>>
>> I've only quickly hacked clause_selectivity, but it does not seems very
>> invasive (of course, it means disruption to clause_selectivity callers,
>> but I suppose most call clauselist_selectivity).
>>
>
>Sounds like a reasonable approach, but I think it would be better to
>preserve the current public API by having clauselist_selectivity()
>become a thin wrapper around  a new function that optionally applies
>extended stats.
>

OK, makes sense. I'll take a stab at it.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: documentation pdf build fail (HEAD)
Next
From: Juan José Santamaría Flecha
Date:
Subject: Re: color by default