Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id 2147e075-fef6-b1d5-e5f5-7f92ffefe827@enterprisedb.com
Whole thread Raw
In response to Re: PoC/WIP: Extended statistics on expressions  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: PoC/WIP: Extended statistics on expressions
List pgsql-hackers

On 1/27/21 12:02 PM, Dean Rasheed wrote:
> On Fri, 22 Jan 2021 at 03:49, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> Whooops. A fixed version attached.
>>
> 
> The change to pg_stats_ext_exprs isn't quite right, because now it
> cross joins expressions and their stats, which leads to too many rows,
> with the wrong stats being listed against expressions. For example:
> 
> CREATE TABLE foo (a int, b text);
> INSERT INTO foo SELECT 1, 'xxx' FROM generate_series(1,1000);
> CREATE STATISTICS foo_s ON (a*10), upper(b) FROM foo;
> ANALYSE foo;
> 
> SELECT tablename, statistics_name, expr, most_common_vals
>   FROM pg_stats_ext_exprs;
> 
>  tablename | statistics_name |   expr   | most_common_vals
> -----------+-----------------+----------+------------------
>  foo       | foo_s           | (a * 10) | {10}
>  foo       | foo_s           | (a * 10) | {XXX}
>  foo       | foo_s           | upper(b) | {10}
>  foo       | foo_s           | upper(b) | {XXX}
> (4 rows)
> 
> 
> More protection is still required for tables with no analysable
> columns. For example:
> 
> CREATE TABLE foo();
> CREATE STATISTICS foo_s ON (1) FROM foo;
> INSERT INTO foo SELECT FROM generate_series(1,1000);
> ANALYSE foo;
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598, attrs=0x0,
>     exprs=0x216b258, nvacatts=0, vacatts=0x216cb40) at extended_stats.c:664
> 664            stats[i]->tupDesc = vacatts[0]->tupDesc;
> 
> #0  0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598,
>     attrs=0x0, exprs=0x216b258, nvacatts=0, vacatts=0x216cb40)
>     at extended_stats.c:664
> #1  0x000000000090da93 in BuildRelationExtStatistics (onerel=0x7f7766b37598,
>     totalrows=1000, numrows=100, rows=0x216d040, natts=0,
>     vacattrstats=0x216cb40) at extended_stats.c:161
> #2  0x000000000066ea97 in do_analyze_rel (onerel=0x7f7766b37598,
>     params=0x7ffc06f7d450, va_cols=0x0,
>     acquirefunc=0x66f71a <acquire_sample_rows>, relpages=4, inh=false,
>     in_outer_xact=false, elevel=13) at analyze.c:595
> 
> 
> Attached is an incremental update fixing those issues, together with a
> few more suggested improvements:
> 
> There was quite a bit of code duplication in extended_stats.c which I
> attempted to reduce by
> 
> 1). Deleting examine_opclause_expression() in favour of examine_clause_args().
> 2). Deleting examine_opclause_expression2() in favour of examine_clause_args2().
> 3). Merging examine_clause_args() and examine_clause_args2(), renaming
> it examine_opclause_args() (which was actually the name it had in its
> original doc comment, despite the name in the code being different).
> 4). Merging statext_extract_expression() and
> statext_extract_expression_internal() into
> statext_is_compatible_clause() and
> statext_is_compatible_clause_internal() respectively.
> 
> That last change goes beyond just removing code duplication. It allows
> support for compound clauses that contain a mix of attribute and
> expression clauses, for example, this simple test case wasn't
> previously estimated well:
> 
> CREATE TABLE foo (a int, b int, c int);
> INSERT INTO foo SELECT x/100, x/100, x/100 FROM generate_series(1,10000) g(x);
> CREATE STATISTICS foo_s on a,b,(c*c) FROM foo;
> ANALYSE foo;
> EXPLAIN ANALYSE SELECT * FROM foo WHERE a=1 AND (b=1 OR c*c=1);
> 
> I didn't add any new regression tests, but perhaps it would be worth
> adding something to test a case like that.
> 
> I changed choose_best_statistics() in a couple of ways. Firstly, I
> think it wants to only count expressions from fully covered clauses,
> just as we only count attributes if the stat covers all the attributes
> from a clause, since otherwise the stat cannot estimate the clause, so
> it shouldn't count. Secondly, I think the number of expressions in the
> stat needs to be added to it's number of keys, so that the choice of
> narrowest stat with the same number of matches counts expressions in
> the same way as attributes.
> 
> I simplified the code in statext_mcv_clauselist_selectivity(), by
> attempting to handle expressions and attributes together in the same
> way, making it much closer to the original code. I don't think that
> the check for the existence of a stat covering all the expressions in
> a clause was necessary when pre-processing the list of clauses, since
> that's checked later on, so it's enough to just detect compatible
> clauses. Also, it now checks for stats that cover both the attributes
> and the expressions from each clause, rather than one or the other, to
> cope with examples like the one above. I also updated the check for
> simple_clauses -- what's wanted there is to identify clauses that only
> reference a single column or a single expression, so that the later
> code doesn't apply multi-column estimates to it.
> 
> I'm attaching it as a incremental patch (0004) on top of your patches,
> but if 0003 and 0004 are collapsed together, the total number of diffs
> is less than 0003 alone.
> 

Thanks. All of this seems like a clear improvement, both removing the
duplicate copy-pasted code, and fixing the handling of the cases that
mix plain variables and expressions. FWIW I agree there should be a
regression test for this, so I'll add one.

I think the main remaining issue is how we handle the expressions in
bitmapsets. I've been experimenting with this a bit, but shifting the
regular attnums and stashing expressions before them seems quite
complex, especially when we don't know how many expressions there are
(e.g. when merging functional dependencies). It's true using attnums
above MaxHeapAttributeNumber for expressions wastes ~200B, but is that
really an issue, considering it's very short-lived allocation?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Isaac Morland
Date:
Subject: Re: TRIM_ARRAY
Next
From: Pavel Stehule
Date:
Subject: Re: proposal: possibility to read dumped table's name from file