Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: PoC/WIP: Extended statistics on expressions |
Date | |
Msg-id | CAEZATCXvarTUJthsnoSMzhQwivUqsxGcvSm_5h1L-yNapVyo-g@mail.gmail.com Whole thread Raw |
In response to | Re: PoC/WIP: Extended statistics on expressions (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: PoC/WIP: Extended statistics on expressions
|
List | pgsql-hackers |
On Fri, 22 Jan 2021 at 03:49, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Whooops. A fixed version attached. > The change to pg_stats_ext_exprs isn't quite right, because now it cross joins expressions and their stats, which leads to too many rows, with the wrong stats being listed against expressions. For example: CREATE TABLE foo (a int, b text); INSERT INTO foo SELECT 1, 'xxx' FROM generate_series(1,1000); CREATE STATISTICS foo_s ON (a*10), upper(b) FROM foo; ANALYSE foo; SELECT tablename, statistics_name, expr, most_common_vals FROM pg_stats_ext_exprs; tablename | statistics_name | expr | most_common_vals -----------+-----------------+----------+------------------ foo | foo_s | (a * 10) | {10} foo | foo_s | (a * 10) | {XXX} foo | foo_s | upper(b) | {10} foo | foo_s | upper(b) | {XXX} (4 rows) More protection is still required for tables with no analysable columns. For example: CREATE TABLE foo(); CREATE STATISTICS foo_s ON (1) FROM foo; INSERT INTO foo SELECT FROM generate_series(1,1000); ANALYSE foo; Program received signal SIGSEGV, Segmentation fault. 0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598, attrs=0x0, exprs=0x216b258, nvacatts=0, vacatts=0x216cb40) at extended_stats.c:664 664 stats[i]->tupDesc = vacatts[0]->tupDesc; #0 0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598, attrs=0x0, exprs=0x216b258, nvacatts=0, vacatts=0x216cb40) at extended_stats.c:664 #1 0x000000000090da93 in BuildRelationExtStatistics (onerel=0x7f7766b37598, totalrows=1000, numrows=100, rows=0x216d040, natts=0, vacattrstats=0x216cb40) at extended_stats.c:161 #2 0x000000000066ea97 in do_analyze_rel (onerel=0x7f7766b37598, params=0x7ffc06f7d450, va_cols=0x0, acquirefunc=0x66f71a <acquire_sample_rows>, relpages=4, inh=false, in_outer_xact=false, elevel=13) at analyze.c:595 Attached is an incremental update fixing those issues, together with a few more suggested improvements: There was quite a bit of code duplication in extended_stats.c which I attempted to reduce by 1). Deleting examine_opclause_expression() in favour of examine_clause_args(). 2). Deleting examine_opclause_expression2() in favour of examine_clause_args2(). 3). Merging examine_clause_args() and examine_clause_args2(), renaming it examine_opclause_args() (which was actually the name it had in its original doc comment, despite the name in the code being different). 4). Merging statext_extract_expression() and statext_extract_expression_internal() into statext_is_compatible_clause() and statext_is_compatible_clause_internal() respectively. That last change goes beyond just removing code duplication. It allows support for compound clauses that contain a mix of attribute and expression clauses, for example, this simple test case wasn't previously estimated well: CREATE TABLE foo (a int, b int, c int); INSERT INTO foo SELECT x/100, x/100, x/100 FROM generate_series(1,10000) g(x); CREATE STATISTICS foo_s on a,b,(c*c) FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT * FROM foo WHERE a=1 AND (b=1 OR c*c=1); I didn't add any new regression tests, but perhaps it would be worth adding something to test a case like that. I changed choose_best_statistics() in a couple of ways. Firstly, I think it wants to only count expressions from fully covered clauses, just as we only count attributes if the stat covers all the attributes from a clause, since otherwise the stat cannot estimate the clause, so it shouldn't count. Secondly, I think the number of expressions in the stat needs to be added to it's number of keys, so that the choice of narrowest stat with the same number of matches counts expressions in the same way as attributes. I simplified the code in statext_mcv_clauselist_selectivity(), by attempting to handle expressions and attributes together in the same way, making it much closer to the original code. I don't think that the check for the existence of a stat covering all the expressions in a clause was necessary when pre-processing the list of clauses, since that's checked later on, so it's enough to just detect compatible clauses. Also, it now checks for stats that cover both the attributes and the expressions from each clause, rather than one or the other, to cope with examples like the one above. I also updated the check for simple_clauses -- what's wanted there is to identify clauses that only reference a single column or a single expression, so that the later code doesn't apply multi-column estimates to it. I'm attaching it as a incremental patch (0004) on top of your patches, but if 0003 and 0004 are collapsed together, the total number of diffs is less than 0003 alone. Regards, Dean
Attachment
pgsql-hackers by date: