Re: Additional improvements to extended statistics - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Additional improvements to extended statistics |
Date | |
Msg-id | 20200324143306.rz5siohm4jsoveqm@development Whole thread Raw |
In response to | Re: Additional improvements to extended statistics (Dean Rasheed <dean.a.rasheed@gmail.com>) |
Responses |
Re: Additional improvements to extended statistics
|
List | pgsql-hackers |
On Tue, Mar 24, 2020 at 01:20:07PM +0000, Dean Rasheed wrote: >On Tue, 24 Mar 2020 at 01:08, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> Hmmm. So let's consider a simple OR clause with two arguments, both >> covered by single statistics object. Something like this: >> >> CREATE TABLE t (a int, b int); >> INSERT INTO t SELECT mod(i, 10), mod(i, 10) >> FROM generate_series(1,100000); >> CREATE STATISTICS s (mcv) ON a,b FROM t; >> >> SELECT * FROM t WHERE a = 0 OR b = 0; >> >> Which is estimated correctly... >> > >Hmm, the reason that is estimated correctly is that the MCV values >cover all values in the table, so mcv_totalsel is 1 (or pretty close >to 1), and other_sel is capped to nearly 0, and so the result is >basically just mcv_sel -- i.e., in this case the MCV estimates are >returned more-or-less as-is, rather than being applied as a >correction, and so the result is independent of how many times >extended stats are applied. > >The more interesting (and probably more realistic) case is where the >MCV values do not cover the all values in the table, as in the new >mcv_lists_partial examples in the regression tests, for example this >test case, which produces a significant overestimate: > > SELECT * FROM mcv_lists_partial WHERE a = 0 OR b = 0 OR c = 0 > >although actually, I think there's another reason for that (in >addition to the extended stats correction being applied twice) -- the >way the MCV code makes use of base selectivity doesn't seem really >appropriate for OR clauses, because the way base_frequency is computed >is really based on the assumption that every column would be matched. >I'm not sure that there's any easy answer to that though. I feels like >what's needed when computing the match bitmaps in mcv.c, is to produce >a bitmap (would it fit in 1 byte?) per value, to indicate which >columns match, and base_frequency values per column. That would be >significantly more work though, so almost certainly isn't feasible for >PG13. > Good point. I haven't thought about the base frequencies. I think 1 byte should be enough, as we limit the number of columns to 8. >> IIUC the problem you have in mind is that we end up calling >> statext_mcv_clauselist_selectivity twice, once for the top-level AND >> clause with a single argument, and then recursively for the expanded OR >> clause. Indeed, that seems to be applying the correction twice. >> >> >> >I'm not sure what's the best way to resolve that. Perhaps >> >statext_mcv_clauselist_selectivity() / statext_is_compatible_clause() >> >should ignore OR clauses from an AND-list, on the basis that they will >> >get processed recursively later. Or perhaps estimatedclauses can >> >somehow be used to prevent this, though I'm not sure exactly how that >> >would work. >> >> I don't know. I feel uneasy about just ignoring some of the clauses, >> because what happens for complex clauses, where the OR is not directly >> in the AND clause, but is negated or something like that? >> >> Isn't it the case that clauselist_selectivity_simple (and the OR >> variant) should ignore extended stats entirely? That is, we'd need to >> add a flag (or _simple variant) to clause_selectivity, so that it calls >> causelist_selectivity_simple_or. So the calls would look like this: >> >> clauselist_selectivity >> statext_clauselist_selectivity >> statext_mcv_clauselist_selectivity >> clauselist_selectivity_simple <--- disable extended stats >> clause_selectivity >> clauselist_selectivity_simple_or >> clause_selectivity >> clause_selectivity >> mcv_clauselist_selectivity >> clauselist_selectivity_simple >> already estimated >> >> I've only quickly hacked clause_selectivity, but it does not seems very >> invasive (of course, it means disruption to clause_selectivity callers, >> but I suppose most call clauselist_selectivity). >> > >Sounds like a reasonable approach, but I think it would be better to >preserve the current public API by having clauselist_selectivity() >become a thin wrapper around a new function that optionally applies >extended stats. > OK, makes sense. I'll take a stab at it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: