Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id 1ed534fb-b42b-f008-35ee-4eb930b0088f@enterprisedb.com
Whole thread Raw
In response to Re: PoC/WIP: Extended statistics on expressions  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: PoC/WIP: Extended statistics on expressions
Re: PoC/WIP: Extended statistics on expressions
Re: PoC/WIP: Extended statistics on expressions
Re: PoC/WIP: Extended statistics on expressions
List pgsql-hackers
On 3/24/21 5:48 PM, Tomas Vondra wrote:
> 
> 
> On 3/24/21 5:28 PM, Dean Rasheed wrote:
>> On Wed, 24 Mar 2021 at 14:48, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> AFAIK the primary issue here is that the two places disagree. While
>>> estimate_num_groups does this
>>>
>>>     varnos = pull_varnos(root, (Node *) varshere);
>>>     if (bms_membership(varnos) == BMS_SINGLETON)
>>>     { ... }
>>>
>>> the add_unique_group_expr does this
>>>
>>>     varnos = pull_varnos(root, (Node *) groupexpr);
>>>
>>> That is, one looks at the group expression, while the other look at vars
>>> extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
>>> this can differ, causing the crash.
>>>
>>> So we need to change one of those places - my fix tweaked the second
>>> place to also look at the vars, but maybe we should change the other
>>> place? Or maybe it's not the right fix for PlaceHolderVars ...
>>>
>>
>> I think that it doesn't make any difference which place is changed.
>>
>> This is a case of an expression with no stats. With your change,
>> you'll get a single GroupExprInfo containing a list of
>> VariableStatData's for each of it's Var's, whereas if you changed it
>> the other way, you'd get a separate GroupExprInfo for each Var. But I
>> think they'd both end up being treated the same by
>> estimate_multivariate_ndistinct(), since there wouldn't be any stats
>> matching the expression, only the individual Var's. Maybe changing the
>> first place would be the more bulletproof fix though.
>>
> 
> Yeah, I think that's true. I'll do a bit more research / experiments.
> 

Actually, I think we need that block at all - there's no point in
keeping the exact expression, because if there was a statistics matching
it it'd be matched by the examine_variable. So if we get here, we have
to just split it into the vars anyway. So the second block is entirely
useless.

That however means we don't need the processing with GroupExprInfo and
GroupVarInfo lists, i.e. we can revert back to the original simpler
processing, with a bit of extra logic to match expressions, that's all.

The patch 0003 does this (it's a bit crude, but hopefully enough to
demonstrate).

here's an updated patch. 0001 should address most of the today's review
items regarding comments etc.

0002 is an attempt to fix an issue I noticed today - we need to handle
type changes. Until now we did not have problems with that, because we
only had attnums - so we just reset the statistics (with the exception
of functional dependencies, on the assumption that those remain valid).

With expressions it's a bit more complicated, though.

1) we need to transform the expressions so that the Vars contain the
right type info etc. Otherwise an analyze with the old pg_node_tree crashes

2) we need to reset the pg_statistic[] data too, which however makes
keeping the functional dependencies a bit less useful, because those
rely on the expression stats :-(

So I'm wondering what to do about this. I looked into how ALTER TABLE
handles indexes, and 0003 is a PoC to do the same thing for statistics.
Of couse, this is a bit unfortunate because it recreates the statistics
(so we don't keep anything, not even functional dependencies).

I think we have two options:

a) Make UpdateStatisticsForTypeChange smarter to also transform and
update the expression string, and reset pg_statistics[] data.

b) Just recreate the statistics, just like we do for indexes. Currently
this does not force analyze, so it just resets all the stats. Maybe it
should do analyze, though.

Any opinions? I need to think about this a bit more, but maybe (b) with
the analyze is the right thing to do. Keeping just some of the stats
always seemed a bit weird. (This is why the 0002 patch breaks one of the
regression tests.)

BTW I wonder how useful the updated statistics actually is. Consider
this example:

========================================================================
CREATE TABLE t (a int, b int, c int);

INSERT INTO t SELECT mod(i,10), mod(i,10), mod(i,10)
  FROM generate_series(1,1000000) s(i);

CREATE STATISTICS s (ndistinct) ON (a+b), (b+c) FROM t;

ANALYZE t;

EXPLAIN ANALYZE SELECT 1 FROM t GROUP BY (a+b), (b+c);

test=# \d t
                 Table "public.t"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           |          |
 b      | integer |           |          |
 c      | integer |           |          |
Statistics objects:
    "public"."s" (ndistinct) ON ((a + b)), ((b + c)) FROM t

test=# EXPLAIN  SELECT 1 FROM t GROUP BY (a+b), (b+c);
                           QUERY PLAN
-----------------------------------------------------------------
 HashAggregate  (cost=25406.00..25406.15 rows=10 width=12)
   Group Key: (a + b), (b + c)
   ->  Seq Scan on t  (cost=0.00..20406.00 rows=1000000 width=8)
(3 rows)
========================================================================

Great. Now let's change one of the data types to something else:

========================================================================
test=# alter table t alter column c type numeric;
ALTER TABLE
test=# \d t
                 Table "public.t"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           |          |
 b      | integer |           |          |
 c      | numeric |           |          |
Statistics objects:
    "public"."s" (ndistinct) ON ((a + b)), (((b)::numeric + c)) FROM t

test=# analyze t;
ANALYZE
test=# EXPLAIN  SELECT 1 FROM t GROUP BY (a+b), (b+c);
                            QUERY PLAN
------------------------------------------------------------------
 HashAggregate  (cost=27906.00..27906.17 rows=10 width=40)
   Group Key: (a + b), ((b)::numeric + c)
   ->  Seq Scan on t  (cost=0.00..22906.00 rows=1000000 width=36)
(3 rows)
========================================================================

Great! Let's change it again:

========================================================================
test=# alter table t alter column c type double precision;
ALTER TABLE
test=# analyze t;
ANALYZE
test=# EXPLAIN  SELECT 1 FROM t GROUP BY (a+b), (b+c);
                            QUERY PLAN
------------------------------------------------------------------
 HashAggregate  (cost=27906.00..27923.50 rows=1000 width=16)
   Group Key: (a + b), ((b)::double precision + c)
   ->  Seq Scan on t  (cost=0.00..22906.00 rows=1000000 width=12)
(3 rows)
========================================================================

Well, not that great, apparently. We clearly failed to match the second
expression, so we ended with (b+c) estimated as (10 * 10). Why? Because
the expression now looks like this:

========================================================================
"public"."s" (ndistinct) ON ((a + b)), ((((b)::numeric)::double
precision + c)) FROM t
========================================================================

But we're matching it to (((b)::double precision + c)), so that fails.

This is not specific to extended statistics - indexes have exactly the
same issue. Not sure how common this is in practice.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: popcount
Next
From: Justin Pryzby
Date:
Subject: Re: PoC/WIP: Extended statistics on expressions