Re: Statistics Import and Export - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Statistics Import and Export
Date
Msg-id d61e865c-6ff5-065d-6d26-37ea4ade892d@enterprisedb.com
Whole thread Raw
In response to Re: Statistics Import and Export  (Corey Huinker <corey.huinker@gmail.com>)
Responses Re: Statistics Import and Export
List pgsql-hackers
On 12/13/23 11:26, Corey Huinker wrote:
>     Yeah, that was the simplest output function possible, it didn't seem
>
>     worth it to implement something more advanced. pg_mcv_list_items() is
>     more convenient for most needs, but it's quite far from the on-disk
>     representation.
>
>
> I was able to make it work.
>
>
>
>     That's actually a good question - how closely should the exported data
>     be to the on-disk format? I'd say we should keep it abstract, not tied
>     to the details of the on-disk format (which might easily change
between
>     versions).
>
>
> For the most part, I chose the exported data json types and formats in a
> way that was the most accommodating to cstring input functions. So,
> while so many of the statistic values are obviously only ever
> integers/floats, those get stored as a numeric data type which lacks
> direct numeric->int/float4/float8 functions (though we could certainly
> create them, and I'm not against that), casting them to text lets us
> leverage pg_strtoint16, etc.
>
>
>
>     I'm a bit confused about the JSON schema used in pg_statistic_export
>     view, though. It simply serializes stakinds, stavalues, stanumbers
into
>     arrays ... which works, but why not to use the JSON nesting? I mean,
>     there could be a nested document for histogram, MCV, ... with just the
>     correct fields.
>
>       {
>         ...
>         histogram : { stavalues: [...] },
>         mcv : { stavalues: [...], stanumbers: [...] },
>         ...
>       }
>
>
> That's a very good question. I went with this format because it was
> fairly straightforward to code in SQL using existing JSON/JSONB
> functions, and that's what we will need if we want to export statistics
> on any server currently in existence. I'm certainly not locked in with
> the current format, and if it can be shown how to transform the data
> into a superior format, I'd happily do so.
>
>     and so on. Also, what does TRIVIAL stand for?
>
>
> It's currently serving double-duty for "there are no stats in this slot"
> and the situations where the stats computation could draw no conclusions
> about the data.
>
> Attached is v3 of this patch. Key features are:
>
> * Handles regular pg_statistic stats for any relation type.
> * Handles extended statistics.
> * Export views pg_statistic_export and pg_statistic_ext_export to allow
> inspection of existing stats and saving those values for later use.
> * Import functions pg_import_rel_stats() and pg_import_ext_stats() which
> take Oids as input. This is intentional to allow stats from one object
> to be imported into another object.
> * User scripts pg_export_stats and pg_import stats, which offer a
> primitive way to serialize all the statistics of one database and import
> them into another.
> * Has regression test coverage for both with a variety of data types.
> * Passes my own manual test of extracting all of the stats from a v15
> version of the popular "dvdrental" example database, as well as some
> additional extended statistics objects, and importing them into a
> development database.
> * Import operations never touch the heap of any relation outside of
> pg_catalog. As such, this should be significantly faster than even the
> most cursory analyze operation, and therefore should be useful in
> upgrade situations, allowing the database to work with "good enough"
> stats more quickly, while still allowing for regular autovacuum to
> recalculate the stats "for real" at some later point.
>
> The relation statistics code was adapted from similar features in
> analyze.c, but is now done in a query context. As before, the
> rowcount/pagecount values are updated on pg_class in a non-transactional
> fashion to avoid table bloat, while the updates to pg_statistic are
> pg_statistic_ext_data are done transactionally.
>
> The existing statistics _store() functions were leveraged wherever
> practical, so much so that the extended statistics import is mostly just
> adapting the existing _build() functions into _import() functions which
> pull their values from JSON rather than computing the statistics.
>
> Current concerns are:
>
> 1. I had to code a special-case exception for MCELEM stats on array data
> types, so that the array_in() call uses the element type rather than the
> array type. I had assumed that the existing exmaine_attribute()
> functions would have properly derived the typoid for that column, but it
> appears to not be the case, and I'm clearly missing how the existing
> code gets it right.
Hmm, after looking at this, I'm not sure it's such an ugly hack ...

The way this works for ANALYZE is that examine_attribute() eventually
calls the typanalyze function:

  if (OidIsValid(stats->attrtype->typanalyze))
    ok = DatumGetBool(OidFunctionCall1(stats->attrtype->typanalyze,
                                       PointerGetDatum(stats)));

which for arrays is array_typanalyze, and this sets stats->extra_data to
ArrayAnalyzeExtraData with all the interesting info about the array
element type, and then also std_extra_data with info about the array
type itself.

  stats -> extra_data -> std_extra_data

compute_array_stats then "restores" std_extra_data to compute standard
stats for the whole array, and then uses the ArrayAnalyzeExtraData to
calculate stats for the elements.

It's not exactly pretty, because there are global variables and so on.

And examine_rel_attribute() does the same thing - calls typanalyze, so
if I break after it returns, I see this for int[] column:

(gdb) p * (ArrayAnalyzeExtraData *) stat->extra_data

$1 = {type_id = 23, eq_opr = 96, coll_id = 0, typbyval = true, typlen =
4, typalign = 105 'i', cmp = 0x2e57920, hash = 0x2e57950,
std_compute_stats = 0x6681b8 <compute_scalar_stats>, std_extra_data =
0x2efe670}

I think the "problem" will be how to use this in import_stavalues(). You
can't just do this for any array type, I think. I could create an array
type (with ELEMENT=X) but with a custom analyze function, in which case
the extra_data may be something entirely different.

I suppose the correct solution would be to add an "import" function into
the pg_type catalog (next to typanalyze). Or maybe it'd be enough to set
it from the typanalyze? After all, that's what sets compute_stats.

But maybe it's enough to just do what you did - if we get an MCELEM
slot, can it ever contain anything else than array of elements of the
attribute array type? I'd bet that'd cause all sorts of issues, no?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Avoid computing ORDER BY junk columns unnecessarily
Next
From: David Rowley
Date:
Subject: Re: Revise the Asserts added to bimapset manipulation functions