Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: PoC/WIP: Extended statistics on expressions |
Date | |
Msg-id | 1ed534fb-b42b-f008-35ee-4eb930b0088f@enterprisedb.com Whole thread Raw |
In response to | Re: PoC/WIP: Extended statistics on expressions (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: PoC/WIP: Extended statistics on expressions
Re: PoC/WIP: Extended statistics on expressions Re: PoC/WIP: Extended statistics on expressions Re: PoC/WIP: Extended statistics on expressions |
List | pgsql-hackers |
On 3/24/21 5:48 PM, Tomas Vondra wrote: > > > On 3/24/21 5:28 PM, Dean Rasheed wrote: >> On Wed, 24 Mar 2021 at 14:48, Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >>> AFAIK the primary issue here is that the two places disagree. While >>> estimate_num_groups does this >>> >>> varnos = pull_varnos(root, (Node *) varshere); >>> if (bms_membership(varnos) == BMS_SINGLETON) >>> { ... } >>> >>> the add_unique_group_expr does this >>> >>> varnos = pull_varnos(root, (Node *) groupexpr); >>> >>> That is, one looks at the group expression, while the other look at vars >>> extracted from it by pull_var_clause(). Apparently for PlaceHolderVar >>> this can differ, causing the crash. >>> >>> So we need to change one of those places - my fix tweaked the second >>> place to also look at the vars, but maybe we should change the other >>> place? Or maybe it's not the right fix for PlaceHolderVars ... >>> >> >> I think that it doesn't make any difference which place is changed. >> >> This is a case of an expression with no stats. With your change, >> you'll get a single GroupExprInfo containing a list of >> VariableStatData's for each of it's Var's, whereas if you changed it >> the other way, you'd get a separate GroupExprInfo for each Var. But I >> think they'd both end up being treated the same by >> estimate_multivariate_ndistinct(), since there wouldn't be any stats >> matching the expression, only the individual Var's. Maybe changing the >> first place would be the more bulletproof fix though. >> > > Yeah, I think that's true. I'll do a bit more research / experiments. > Actually, I think we need that block at all - there's no point in keeping the exact expression, because if there was a statistics matching it it'd be matched by the examine_variable. So if we get here, we have to just split it into the vars anyway. So the second block is entirely useless. That however means we don't need the processing with GroupExprInfo and GroupVarInfo lists, i.e. we can revert back to the original simpler processing, with a bit of extra logic to match expressions, that's all. The patch 0003 does this (it's a bit crude, but hopefully enough to demonstrate). here's an updated patch. 0001 should address most of the today's review items regarding comments etc. 0002 is an attempt to fix an issue I noticed today - we need to handle type changes. Until now we did not have problems with that, because we only had attnums - so we just reset the statistics (with the exception of functional dependencies, on the assumption that those remain valid). With expressions it's a bit more complicated, though. 1) we need to transform the expressions so that the Vars contain the right type info etc. Otherwise an analyze with the old pg_node_tree crashes 2) we need to reset the pg_statistic[] data too, which however makes keeping the functional dependencies a bit less useful, because those rely on the expression stats :-( So I'm wondering what to do about this. I looked into how ALTER TABLE handles indexes, and 0003 is a PoC to do the same thing for statistics. Of couse, this is a bit unfortunate because it recreates the statistics (so we don't keep anything, not even functional dependencies). I think we have two options: a) Make UpdateStatisticsForTypeChange smarter to also transform and update the expression string, and reset pg_statistics[] data. b) Just recreate the statistics, just like we do for indexes. Currently this does not force analyze, so it just resets all the stats. Maybe it should do analyze, though. Any opinions? I need to think about this a bit more, but maybe (b) with the analyze is the right thing to do. Keeping just some of the stats always seemed a bit weird. (This is why the 0002 patch breaks one of the regression tests.) BTW I wonder how useful the updated statistics actually is. Consider this example: ======================================================================== CREATE TABLE t (a int, b int, c int); INSERT INTO t SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000000) s(i); CREATE STATISTICS s (ndistinct) ON (a+b), (b+c) FROM t; ANALYZE t; EXPLAIN ANALYZE SELECT 1 FROM t GROUP BY (a+b), (b+c); test=# \d t Table "public.t" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- a | integer | | | b | integer | | | c | integer | | | Statistics objects: "public"."s" (ndistinct) ON ((a + b)), ((b + c)) FROM t test=# EXPLAIN SELECT 1 FROM t GROUP BY (a+b), (b+c); QUERY PLAN ----------------------------------------------------------------- HashAggregate (cost=25406.00..25406.15 rows=10 width=12) Group Key: (a + b), (b + c) -> Seq Scan on t (cost=0.00..20406.00 rows=1000000 width=8) (3 rows) ======================================================================== Great. Now let's change one of the data types to something else: ======================================================================== test=# alter table t alter column c type numeric; ALTER TABLE test=# \d t Table "public.t" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- a | integer | | | b | integer | | | c | numeric | | | Statistics objects: "public"."s" (ndistinct) ON ((a + b)), (((b)::numeric + c)) FROM t test=# analyze t; ANALYZE test=# EXPLAIN SELECT 1 FROM t GROUP BY (a+b), (b+c); QUERY PLAN ------------------------------------------------------------------ HashAggregate (cost=27906.00..27906.17 rows=10 width=40) Group Key: (a + b), ((b)::numeric + c) -> Seq Scan on t (cost=0.00..22906.00 rows=1000000 width=36) (3 rows) ======================================================================== Great! Let's change it again: ======================================================================== test=# alter table t alter column c type double precision; ALTER TABLE test=# analyze t; ANALYZE test=# EXPLAIN SELECT 1 FROM t GROUP BY (a+b), (b+c); QUERY PLAN ------------------------------------------------------------------ HashAggregate (cost=27906.00..27923.50 rows=1000 width=16) Group Key: (a + b), ((b)::double precision + c) -> Seq Scan on t (cost=0.00..22906.00 rows=1000000 width=12) (3 rows) ======================================================================== Well, not that great, apparently. We clearly failed to match the second expression, so we ended with (b+c) estimated as (10 * 10). Why? Because the expression now looks like this: ======================================================================== "public"."s" (ndistinct) ON ((a + b)), ((((b)::numeric)::double precision + c)) FROM t ======================================================================== But we're matching it to (((b)::double precision + c)), so that fails. This is not specific to extended statistics - indexes have exactly the same issue. Not sure how common this is in practice. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: