Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id 3f20da4d-89dc-0912-98d2-b6f0ef06c87f@enterprisedb.com
Whole thread Raw
In response to Re: PoC/WIP: Extended statistics on expressions  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: PoC/WIP: Extended statistics on expressions  (Dean Rasheed <dean.a.rasheed@gmail.com>)
List pgsql-hackers
On 1/18/21 4:48 PM, Dean Rasheed wrote:
> Looking through extended_stats.c, I found a corner case that can lead
> to a seg-fault:
> 
> CREATE TABLE foo();
> CREATE STATISTICS s ON (1) FROM foo;
> ANALYSE foo;
> 
> This crashes in lookup_var_attr_stats(), because it isn't expecting
> nvacatts to be 0. I can't think of any case where building stats on a
> table with no analysable columns is useful, so it should probably just
> exit early in that case.
> 
> 
> In BuildRelationExtStatistics(), it looks like min_attrs should be
> declared assert-only.
> 
> 
> In evaluate_expressions():
> 
> +   /* set the pointers */
> +   result = (ExprInfo *) ptr;
> +   ptr += sizeof(ExprInfo);
> 
> I think that should probably have a MAXALIGN().
> 

Thanks, I'll fix all of that.

> 
> A slightly bigger issue that I don't like is the way it assigns
> attribute numbers for expressions starting from
> MaxHeapAttributeNumber+1, so the first expression has an attnum of
> 1601. That leads to pretty inefficient use of Bitmapsets, since most
> tables only contain a handful of columns, and there's a large block of
> unused space in the middle the Bitmapset.
> 
> An alternative approach might be to use regular attnums for columns
> and use negative indexes -1, -2, -3, ... for expressions in the stored
> stats. Then when storing and retrieving attnums from Bitmapsets, it
> could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
> in the Bitmapsets, since there can't be more than that many
> expressions (just like other code stores system attributes using
> FirstLowInvalidHeapAttributeNumber).
> 
> That would be a somewhat bigger change, but hopefully fairly
> mechanical, and then some code like add_expressions_to_attributes()
> would go away.
> 

Well, I tried this but unfortunately it's not that simple. We still need 
to build the bitmaps, so I don't think add_expression_to_attributes can 
be just removed. I mean, we need to do the offsetting somewhere, even if 
we change how we do it.

But the main issue is that in some cases the number of expressions is 
not really limited by STATS_MAX_DIMENSIONS - for example when applying 
functional dependencies, we "merge" multiple statistics, so we may end 
up with more expressions. So we can't just use STATS_MAX_DIMENSIONS.

Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts 
the order of processing (so far we've assumed expressions are after 
regular attnums). So the changes are more extensive - I tried doing that 
anyway, and I'm still struggling with crashes and regression failures. 
Of course, that doesn't mean we shouldn't do it, but it's far from 
mechanical. (Some of that is probably a sign this code needs a bit more 
work to polish.)

But I wonder if it'd be easier to just calculate the actual max attnum 
and then use it instead of MaxHeapAttributeNumber ...

> 
> Looking at the new view pg_stats_ext_exprs, I noticed that it fails to
> show expressions until the statistics have been built. For example:
> 
> CREATE TABLE foo(a int, b int);
> CREATE STATISTICS s ON (a+b), (a*b) FROM foo;
> SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;
> 
>   statistics_name | tablename | expr | n_distinct
> -----------------+-----------+------+------------
>   s               | foo       |      |
> (1 row)
> 
> but after populating and analysing the table, this becomes:
> 
>   statistics_name | tablename |  expr   | n_distinct
> -----------------+-----------+---------+------------
>   s               | foo       | (a + b) |         11
>   s               | foo       | (a * b) |         11
> (2 rows)
> 
> I think it should show the expressions even before the stats have been built.
> 
> Another issue is that it returns rows for non-expression stats as
> well. For example:
> 
> CREATE TABLE foo(a int, b int);
> CREATE STATISTICS s ON a, b FROM foo;
> SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;
> 
>   statistics_name | tablename | expr | n_distinct
> -----------------+-----------+------+------------
>   s               | foo       |      |
> (1 row)
> 
> and those values will never be populated, since they're not
> expressions, so I would expect them to not be shown in the view.
> 
> So basically, instead of
> 
> +         LEFT JOIN LATERAL (
> +             SELECT
> +                 *
> +             FROM (
> +                 SELECT
> +
> unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
> +                     unnest(sd.stxdexpr)::pg_statistic AS a
> +             ) x
> +         ) stat ON sd.stxdexpr IS NOT NULL;
> 
> perhaps just
> 
> +         JOIN LATERAL (
> +             SELECT
> +                 *
> +             FROM (
> +                 SELECT
> +
> unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
> +                     unnest(sd.stxdexpr)::pg_statistic AS a
> +             ) x
> +         ) stat ON true;

Will fix.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: "tsunakawa.takay@fujitsu.com"
Date:
Subject: RE: POC: postgres_fdw insert batching
Next
From: Ian Lawrence Barwick
Date:
Subject: Re: psql \df choose functions by their arguments