Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id CAEZATCXvarTUJthsnoSMzhQwivUqsxGcvSm_5h1L-yNapVyo-g@mail.gmail.com
Whole thread Raw
In response to Re: PoC/WIP: Extended statistics on expressions  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: PoC/WIP: Extended statistics on expressions
List pgsql-hackers
On Fri, 22 Jan 2021 at 03:49, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Whooops. A fixed version attached.
>

The change to pg_stats_ext_exprs isn't quite right, because now it
cross joins expressions and their stats, which leads to too many rows,
with the wrong stats being listed against expressions. For example:

CREATE TABLE foo (a int, b text);
INSERT INTO foo SELECT 1, 'xxx' FROM generate_series(1,1000);
CREATE STATISTICS foo_s ON (a*10), upper(b) FROM foo;
ANALYSE foo;

SELECT tablename, statistics_name, expr, most_common_vals
  FROM pg_stats_ext_exprs;

 tablename | statistics_name |   expr   | most_common_vals
-----------+-----------------+----------+------------------
 foo       | foo_s           | (a * 10) | {10}
 foo       | foo_s           | (a * 10) | {XXX}
 foo       | foo_s           | upper(b) | {10}
 foo       | foo_s           | upper(b) | {XXX}
(4 rows)


More protection is still required for tables with no analysable
columns. For example:

CREATE TABLE foo();
CREATE STATISTICS foo_s ON (1) FROM foo;
INSERT INTO foo SELECT FROM generate_series(1,1000);
ANALYSE foo;

Program received signal SIGSEGV, Segmentation fault.
0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598, attrs=0x0,
    exprs=0x216b258, nvacatts=0, vacatts=0x216cb40) at extended_stats.c:664
664            stats[i]->tupDesc = vacatts[0]->tupDesc;

#0  0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598,
    attrs=0x0, exprs=0x216b258, nvacatts=0, vacatts=0x216cb40)
    at extended_stats.c:664
#1  0x000000000090da93 in BuildRelationExtStatistics (onerel=0x7f7766b37598,
    totalrows=1000, numrows=100, rows=0x216d040, natts=0,
    vacattrstats=0x216cb40) at extended_stats.c:161
#2  0x000000000066ea97 in do_analyze_rel (onerel=0x7f7766b37598,
    params=0x7ffc06f7d450, va_cols=0x0,
    acquirefunc=0x66f71a <acquire_sample_rows>, relpages=4, inh=false,
    in_outer_xact=false, elevel=13) at analyze.c:595


Attached is an incremental update fixing those issues, together with a
few more suggested improvements:

There was quite a bit of code duplication in extended_stats.c which I
attempted to reduce by

1). Deleting examine_opclause_expression() in favour of examine_clause_args().
2). Deleting examine_opclause_expression2() in favour of examine_clause_args2().
3). Merging examine_clause_args() and examine_clause_args2(), renaming
it examine_opclause_args() (which was actually the name it had in its
original doc comment, despite the name in the code being different).
4). Merging statext_extract_expression() and
statext_extract_expression_internal() into
statext_is_compatible_clause() and
statext_is_compatible_clause_internal() respectively.

That last change goes beyond just removing code duplication. It allows
support for compound clauses that contain a mix of attribute and
expression clauses, for example, this simple test case wasn't
previously estimated well:

CREATE TABLE foo (a int, b int, c int);
INSERT INTO foo SELECT x/100, x/100, x/100 FROM generate_series(1,10000) g(x);
CREATE STATISTICS foo_s on a,b,(c*c) FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE SELECT * FROM foo WHERE a=1 AND (b=1 OR c*c=1);

I didn't add any new regression tests, but perhaps it would be worth
adding something to test a case like that.

I changed choose_best_statistics() in a couple of ways. Firstly, I
think it wants to only count expressions from fully covered clauses,
just as we only count attributes if the stat covers all the attributes
from a clause, since otherwise the stat cannot estimate the clause, so
it shouldn't count. Secondly, I think the number of expressions in the
stat needs to be added to it's number of keys, so that the choice of
narrowest stat with the same number of matches counts expressions in
the same way as attributes.

I simplified the code in statext_mcv_clauselist_selectivity(), by
attempting to handle expressions and attributes together in the same
way, making it much closer to the original code. I don't think that
the check for the existence of a stat covering all the expressions in
a clause was necessary when pre-processing the list of clauses, since
that's checked later on, so it's enough to just detect compatible
clauses. Also, it now checks for stats that cover both the attributes
and the expressions from each clause, rather than one or the other, to
cope with examples like the one above. I also updated the check for
simple_clauses -- what's wanted there is to identify clauses that only
reference a single column or a single expression, so that the later
code doesn't apply multi-column estimates to it.

I'm attaching it as a incremental patch (0004) on top of your patches,
but if 0003 and 0004 are collapsed together, the total number of diffs
is less than 0003 alone.

Regards,
Dean

Attachment

pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: Wrong usage of RelationNeedsWAL
Next
From: Heikki Linnakangas
Date:
Subject: Re: Protect syscache from bloating with negative cache entries