Re: Statistics Import and Export - Mailing list pgsql-hackers

From Corey Huinker
Subject Re: Statistics Import and Export
Date
Msg-id CADkLM=doQW3TE_LyBwG10NppKZLz0CjAgu10O79uMQc2frYCmA@mail.gmail.com
Whole thread Raw
In response to Re: Statistics Import and Export  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Statistics Import and Export
List pgsql-hackers
Also, it says "statistics are replaced" but it's quite clear if that
applies only to matching statistics or if all stats are deleted first
and then the new stuff is inserted. (FWIW remove_pg_statistics clearly
deletes all pre-existing stats).

All are now deleted first, both in the pg_statistic and pg_statistic_ext_data tables. The previous version was taking a more "replace it if we find a new value" approach, but that's overly complicated, so following the example set by extended statistics seemed best.

 
- import_pg_statistics: I somewhat dislike that we're passing arguments
as datum[] array - it's hard to say what the elements are expected to
be, etc. Maybe we should expand this, to make it clear. How do we even
know the array is large enough?

Completely fair. Initially that was done with the expectation that the array would be the same for both regular stats and extended stats, but that was no longer the case.
 
- I don't quite understand why we need examine_rel_attribute. It sets a
lot of fields in the VacAttrStats struct, but then we only use attrtypid
and attrtypmod from it - so why bother and not simply load just these
two fields? Or maybe I miss something.
 
I think you're right, we don't need it anymore for regular statistics. We still need it in extended stats because statext_store() takes a subset of the vacattrstats rows as an input.

Which leads to a side issue. We currently have 3 functions: examine_rel_attribute and the two varieties of examine_attribute (one in analyze.c and the other in extended stats). These are highly similar but just different enough that I didn't feel comfortable refactoring them into a one-size-fits-all function, and I was particularly reluctant to modify existing code for the ANALYZE path.
 

- examine_rel_attribute can return NULL, but get_attrinfo does not check
for NULL and just dereferences the pointer. Surely that can lead to
segfaults?

Good catch, and it highlights how little we need VacAttrStats for regular statistics.
 

- validate_no_duplicates and the other validate functions would deserve
a better docs, explaining what exactly is checked (it took me a while to
realize we check just for duplicates), what the parameters do etc.

Those functions are in a fairly formative phase - I expect a conversation about what sort of validations we want to do to ensure that the statistics being imported make sense, and under what circumstances we would forego some of those checks.
 

- Do we want to make the validate_ functions part of the public API? I
realize we want to use them from multiple places (regular and extended
stats), but maybe it'd be better to have an "internal" header file, just
like we have extended_stats_internal?

I see no need to have them be a part of the public API. Will move.
 

- I'm not sure we do "\set debug f" elsewhere. It took me a while to
realize why the query outputs are empty ...

That was an experiment that rose out of the difficulty in determining _where_ a difference was when the set-difference checks failed. So far I like it, and I'm hoping it catches on.

 


0002

- I'd rename create_stat_ext_entry to statext_create_entry.

- Do we even want to include OIDs from the source server? Why not to
just have object names and resolve those? Seems safer - if the target
server has the OID allocated to a different object, that could lead to
confusing / hard to detect issues.

The import functions would obviously never use the imported oids to look up objects on the destination system. Rather, they're there to verify that the local object oid matches the exported object oid, which is true in the case of a binary upgrade.

The export format is an attempt to export the pg_statistic[_ext_data] for that object as-is, and, as Tom suggested, let the import function do the transformations. We can of course remove them if they truly have no purpose for validation.
 

- What happens if we import statistics which includes data for extended
statistics object which does not exist on the target machine?

The import function takes an oid of the object (relation or extstat object), and the json payload is supposed to be the stats for ONE corresponding object. Multiple objects of data really don't fit into the json format, and statistics exported for an object that does not exist on the destination system would have no meaningful invocation. I envision the dump file looking like this

    CREATE TABLE public.foo (....);

    SELECT pg_import_rel_stats('public.foo'::regclass, <json blob>, option flag, option flag);

So a call against a nonexistent object would fail on the regclass cast.
 

- pg_import_ext_stats seems to not use require_match_oids - bug?

I haven't yet seen a good way to make use of matching oids in extended stats. Checking matching operator/collation oids would make sense, but little else.
 


0003

- no SGML docs for the new tools?

Correct. I foresee the export tool being folded into pg_dump(), and the import tool going away entirely as psql could handle it.
 

- The help() seems to be wrong / copied from "clusterdb" or something
like that, right?

Correct, for the reason above.

 
>
> pg_import_rel_stats matches up local columns with the exported stats by
> column name, not attnum. This allows for stats to be imported when columns
> have been dropped, added, or reordered.
>

Makes sense. What will happen if we try to import data for extended
statistics (or index) that does not exist on the target server?

One of the parameters to the function is the oid of the object that is the target of the stats. The importer will not seek out objects with matching names and each JSON payload is limited to holding one object, though clearly someone could encapsulate the existing format in a format that has a manifest of objects to import.
 

> pg_import_ext_stats can also handle column reordering, though it currently
> would get confused by changes in expressions that maintain the same result
> data type. I'm not yet brave enough to handle importing nodetrees, nor do I
> think it's wise to try. I think we'd be better off validating that the
> destination extended stats object is identical in structure, and to fail
> the import of that one object if it isn't perfect.
>

Yeah, column reordering is something we probably need to handle. The
stats order them by attnum, so if we want to allow import on a system
where the attributes were dropped/created in a different way, this is
necessary. I haven't tested this - is there a regression test for this?

The overlong transformation SQL starts with the object to be imported (the local oid was specified) and it

1. grabs all the attributes (or exprs, for extended stats) of that object.
2. looks for columns/exprs in the exported json for an attribute with a matching name
3. takes the exported attnum of that exported attribute for use in things like stdexprs
4. looks up the type, collation, and operators for the exported attribute.

So we get a situation where there might not be importable stats for an attribute of the destination table, and we'd import nothing for that column. Stats for exported columns with no matching local column would never be referenced.

Yes, there should be a test of this.


I agree expressions are hard. I don't think it's feasible to import
nodetree from other server versions, but why don't we simply deparse the
expression on the source, and either parse it on the target (and then
compare the two nodetrees), or deparse the target too and compare the
two deparsed expressions? I suspect the deparsing may produce slightly
different results on the two versions (causing false mismatches), but
perhaps the deparse on source + parse on target + compare nodetrees
would work? Haven't tried, though.

> Export formats go back to v10.
>

Do we even want/need to go beyond 12? All earlier versions are EOL.

True, but we had pg_dump and pg_restore stuff back to 7.x until fairly recently, and a major friction point in getting customers to upgrade their instances off of unsupported versions is the downtime caused by an upgrade, why wouldn't we make it easier for them?

pgsql-hackers by date:

Previous
From: shihao zhong
Date:
Subject: Fix incorrect PG_GETARG in pgcrypto
Next
From: Robert Haas
Date:
Subject: Re: Collation version tracking for macOS