Re: Add expressions to pg_restore_extended_stats() - Mailing list pgsql-hackers

From Corey Huinker
Subject Re: Add expressions to pg_restore_extended_stats()
Date
Msg-id CADkLM=cPB_V+oV5d+5bfZGA_1Loa7iMmqXWj+SM-YGzxoUYrcQ@mail.gmail.com
Whole thread Raw
In response to Re: Add expressions to pg_restore_extended_stats()  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Add expressions to pg_restore_extended_stats()
List pgsql-hackers

The code paths that catch my attention is how the data is inserted,
though, as in serialize_expr_stats(), and we have the following thing
that skips the insertion of a pg_statistic tuple for an expression
with invalid data (I thought this should be the case, now I am sure it
is the case):
if (!stats->stats_valid)
{
    astate = accumArrayResult(astate,
                              (Datum) 0,
                              true,
                              typOid,
                              CurrentMemoryContext);
    continue;
}

So depending on what ANALYZE thinks in terms of what data is valid or
not, it is entirely possible that we don't insert any data all all for
some of the expressions.

+1

"valid" in this case could also mean "totally inconclusive and therefore not worth storing".
 
That pretty much answers the previous question I had that I could not
answer to yet a few mails ago: the restore function needs to be
flexible with the restore of expressions, and it looks like we should
not expect all the inputs to be set.  It is OK to not restore any data
at all for some of the expressions, or skip the all set of expressions
if nothing is given in input because we found no expression stats data.

I'm all for tolerating an undercount in exported statistics. But this got me wondering how the system view handles lining up stats to its expression counterpart when there's an undercount.

And, uh, well, it doesn't:

    FROM pg_statistic_ext s JOIN pg_class c ON (c.oid = s.stxrelid)
         LEFT JOIN pg_statistic_ext_data sd ON (s.oid = sd.stxoid)
         LEFT JOIN pg_namespace cn ON (cn.oid = c.relnamespace)
         LEFT JOIN pg_namespace sn ON (sn.oid = s.stxnamespace)
         JOIN LATERAL (
             SELECT unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
                    unnest(sd.stxdexpr)::pg_statistic AS a
         ) stat ON (stat.expr IS NOT NULL)

Two unnests like that will just null-pad the shorter list, so if there is a missing pg_statistic, then the system view could be giving us stats from a different expression. Now, we would probably detect that if the data types don't line up...but the data types could be compatible enough...ick. 
 

As far as I can see, the proposed patch is incorrect and inconsistent
with the backend regarding all that.  For example:
CREATE TABLE test (name text);
CREATE STATISTICS stats_exprs (dependencies)
  ON lower(name), upper(name)
  FROM test;

So based on my read of the code, we could expect zero, one or two
pg_statistic tuples based on what ANALYZE thinks.  The restore
functions expects always two elements in its inner array.  IMO, we
should have the flexibility to pass down 0, 1 or 2 elements in this
1-D array, without failing as long as the elements of the input data
have valid data.

The patch allows an object pattern like this one, for example, with a
second object being empty:
'exprs', '[{"avg_width": "7", [all valid fields] .. }, {}]

Unfortunately, on such input we insert TWO pg_statistic tuple.  We
could rely on the order of the items in the 1-D array, but it seems to
me that we will have a much easier life if we shape the input data
based on the following rules, matching with the policy that ANALYZE
can allow:
- An input array can have as many elements as the number of
expressions.

+1 
 
- The expression a single expression refers ought to be tracked in
each element, individually.  This means at least the addition of a
negative attribute number in EACH element (I am entirely rejecting my
wilder idea of the expression text at this stage).  If the "exprs"
data has no element for a given expression, that's fine, this has the
same meaning as no tuples found in pg_statistic for an expression.

I agree in principle, but I also fear that the system view may be wrong, so even if we lined up the expression text from pg_statistic_ext with the expression text from pg_stats_ext, the view itself would have already mis-identified which stats go where, and we have no direct access to pg_statistic_ext_data because of the security barrier.

And before you ask, there is *nothing* in pg_statistic_ext_data that indicate which expression it belongs to in the pg_statistic_ext beyond its position within the array, and we now know that isn't guaranteed.

- An element in the input array should have all its key/value pairs
set, if set for an expression.

That won't fly, because there will be pg_dumps from prior versions that won't know about future keys which would then be required.
 
Does this analysis make sense to you?

It does, but I'm not seeing how we line up the right stats to the right element in the event of an undercount. I've been digging around in the planner, and it seems like it just takes the Anum_pg_statistic_ext_data_stxdexpr, expands it, and then statext_expressions_load just assumes that the array will have something at that index. I hope I'm misreading this, because we may have uncovered something broken.

Then again it could be that expressions always have stats built for them, so the undercount never actually happens, thus the arrays all line up.

pgsql-hackers by date:

Previous
From: wenhui qiu
Date:
Subject: Re: Convert NOT IN sublinks to anti-joins when safe
Next
From: vignesh C
Date:
Subject: Re: [Proposal] Adding Log File Capability to pg_createsubscriber