Re: Statistics Import and Export - Mailing list pgsql-hackers

From Corey Huinker
Subject Re: Statistics Import and Export
Date
Msg-id CADkLM=dNQzmdsWfXVVDBqv7gHsa-7dk3v13AqrAzBoJWTMrWEA@mail.gmail.com
Whole thread Raw
In response to Re: Statistics Import and Export  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Thu, May 16, 2024 at 2:26 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Thu, 2024-05-16 at 05:25 -0400, Corey Huinker wrote:
>
> Per previous comments, it was suggested by others that:
>
> - having them in SECTION_NONE was a grave mistake
> - Everything that could belong in SECTION_DATA should, and the rest
> should be in SECTION_POST_DATA

I don't understand the gravity of the choice here: what am I missing?

To be clear: I'm not arguing against it, but I'd like to understand it
better. Perhaps it has to do with the relationship between the sections
and the dependencies?

I'm with you, I don't understand the choice and would like to, but at the same time it now works in the way others strongly suggested that it should, so I'm still curious about the why.

There were several people expressing interest in this patch at pgconf.dev, so I thought I'd post a rebase and give a summary of things to date.

THE INITIAL GOAL

The initial goal of this effort was to reduce upgrade downtimes by eliminating the need for the vacuumdb --analyze-in-stages call that is recommended (but often not done) after a pg_upgrade. The analyze-in-stages steps is usually by far the longest part of a binary upgrade and is a significant part of a restore from dump, so eliminating this step will save users time, and eliminate or greatly reduce a potential pitfall to upgrade...and thus reduce upgrade friction (read: excuses to not upgrade).

THE FUNCTIONS

These patches introduce two functions, pg_set_relation_stats() and pg_set_attribute_stats(), which allow the caller to modify the statistics of any relation, provided that they own that relation or have maintainer privilege.

The function pg_set_relation_stats looks like this:

SELECT pg_set_relation_stats('stats_export_import.test'::regclass,
                             150000::integer,
                             'relpages', 17::integer,
                             'reltuples', 400.0::real,
                             'relallvisible', 4::integer);

The function takes an oid of the relation to have stats imported, a version number (SERVER_VERSION_NUM) for the source of the statistics, and then a series of varargs organized as name-value pairs. Currently, three arg pairs are required to properly set (relpages, reltuples, and relallvisible). If all three are not present, the function will issue a warning, and the row will not be updated.

The choice of varargs is a defensive one, basically ensuring that a pgdump that includes statistics import calls will not fail on a future version that does not have one or more of these values. The call itself would fail to modify the relation row, but it wouldn't cause the whole restore to fail. I'm personally not against having a fixed arg version of this function, nor am I against having both at the same time, the varargs version basically teeing up the fixed-param call appropriate for the destination server version.

This function does an in-place update of the pg_class row to avoid bloat pg_class, just like ANALYZE does. This means that this function call is NON-transactional.

The function pg_set_attribute_stats looks like this:

SELECT pg_catalog.pg_set_attribute_stats(
    'stats_export_import.test'::regclass,
    'id'::name,
    false::boolean,
    150000::integer,
    'null_frac', 0.5::real,
    'avg_width', 2::integer,
    'n_distinct', -0.1::real,
    'most_common_vals', '{2,1,3}'::text,
    'most_common_freqs', '{0.3,0.25,0.05}'::real[]
    );

Like the first function, it takes a relation oid and a source server version though that is in the 4th position. It also takes the name of an attribute, and a boolean as to whether these stats are for inherited statistics (true) or regular (false). Again what follows is a vararg list of name-value pairs, each name corresponding to an attribute of pg_stats, and expecting a value appropriate for said attribute of pg_stats. Note that ANYARRAY values are passed in as text. This is done for a few reasons. First, if the attribute is an array type, then the most_common_elements value will be an array of that array type, and there is no way to represent that in SQL (it instead gives a higher order array of the same base type). Second, it allows us to import the values with a simple array_in() call. Last, it allows for situations where the type name changed from source system to destination (example: a schema-qualified extension type gets moved to core).

There are lots of ways that this function call can go wrong. An invalid attribute name, an invalid parameter name in a name-value pair, invalid data type of parameter being passed in the value of a name-value pair, or type coercion errors in array_in() to name just a few. All of these errors result in a warning and the import failing, but the function completes normally. Internal typecasting and array_in are all done with the _safe() equivalents, and any such errors are re-emitted as warnings. The central goal here is to not make a restore fail just because the statistics are wonky.

Calls to pg_set_attribute_stats() are transactional. This wouldn't warrant mentioning if not for pg_set_relation_stats() being non-transactional.

DUMP / RESTORE / UPGRADE

The code for pg_dump/restore/upgrade has been modified to allow for statistics to be exported/imported by default. There are flags to prevent this (--no-statistics) and there are flags to ONLY do statistics (--statistics-only) the utility of which will be discussed later.

pg_dump will make queries of the source database, adjusting the syntax to reflect the version of the source system. There is very little variance in those queries, so it should be possible to query as far back as 9.2 and get usable stats. The output of these calls will be a series of SELECT statements, each one making a call to either pg_set_relation_stats (one per table/index/matview) or pg_set_attribute_stats (one per attribute that had a matching pg_statistic row).

The positioning of these calls in the restore sequence was originally set up as SECTION_NONE, but it was strongly suggested that SECTION_DATA / SECTION_POST_DATA was the right spot instead, and that's where they currently reside.

The end result will be that the new database now has the stats identical (or at least close to) the source system. Those statistics might be good or bad, but they're almost certainly better than no stats at all. Even if they are bad, they will be overwritten by the next ANALYZE or autovacuum.


WHAT IS NOT DONE

1. Extended Statistics, which are considerably more complex than regular stats (stxdexprs is itself an array of pg_statistic rows) and thus more difficult to express in a simple function call. They are also used fairly rarely in customer installations, so leaving them out of the v1 patch seemed like an easy trade-off.

2. Any sort of validity checking beyond data-types. This was initially provided, verifying that arrays values representing frequencies must be between 0.0 and 1.0, arrays that represent most common value frequencies must be in monotonically non-increasing order, etc. but these were rejected as being overly complex, potentially rejecting valid stats, and getting in the way of an other use I hadn't considered.

3. Export functions. Strictly speaking we don't need them, but some use-cases described below may make the case for including them.

OTHER USES

Usage of these functions is not restricted to upgrade/restore situations. The most obvious use was to experiment with how the planner behaves when one or more tables grow and/or skew. It is difficult to create a table with 10 billion rows in it, but it's now trivial to create a table that says it has 10 billion rows in it.

This can be taken a step further, and in a way I had not anticipated - actively stress-testing the planner by inserting wildly incorrect and/or nonsensical stats. In that sense, these functions are a fuzzing tool that happens to make upgrades go faster.

FUTURE PLANS

Integration with postgres_fdw is an obvious next step, allowing an ANALYZE on a foreign table to, instead of asking for a remote row sample, to simply export the stats of the remote table and import them into the foreign table.

Extended Statistics.

CURRENT PROGRESS

I believe that all outstanding questions/request were addressed, and the patch is now back to needing a review.

FOR YOUR CONSIDERATION

Rebase (current as of f04d1c1db01199f02b0914a7ca2962c531935717) attached.
Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Use the same Windows image on both VS and MinGW tasks
Next
From: Amit Kapila
Date:
Subject: Re: Conflict Detection and Resolution