Re: Statistics Import and Export - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Statistics Import and Export
Date
Msg-id CAOuzzgpw1ridazdKsedj=_srNGvyT2wSoHR9AT77BZ567BVtbg@mail.gmail.com
Whole thread Raw
In response to Re: Statistics Import and Export  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: Statistics Import and Export
List pgsql-hackers
Greetings,

On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Fri, 1 Mar 2024, 04:55 Corey Huinker, <corey.huinker@gmail.com> wrote:
>> Also per our prior discussion- this makes sense to include in post-data section, imv, and also because then we have the indexes we may wish to load stats for, but further that also means it’ll be in the paralleliziable part of the process, making me a bit less concerned overall about the individual timing.
>
>
> The ability to parallelize is pretty persuasive. But is that per-statement parallelization or do we get transaction blocks? i.e. if we ended up importing stats like this:
>
> BEGIN;
> LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
> LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
> SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
> SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
> SELECT pg_import_pg_statistic('schema.relation', 'name', ...);

How well would this simplify to the following:

SELECT pg_import_statistic('schema.relation', attname, ...)
FROM (VALUES ('id', ...), ...) AS relation_stats (attname, ...);

Using a VALUES construct for this does seem like it might make it cleaner, so +1 for investigating that idea.

Or even just one VALUES for the whole statistics loading?

I don’t think we’d want to go beyond one relation at a time as then it can be parallelized, we won’t be trying to lock a whole bunch of objects at once, and any failures would only impact that one relation’s stats load.

I suspect the main issue with combining this into one statement
(transaction) is that failure to load one column's statistics implies
you'll have to redo all the other statistics (or fail to load the
statistics at all), which may be problematic at the scale of thousands
of relations with tens of columns each.

I’m pretty skeptical that “stats fail to load and lead to a failed transaction” is a likely scenario that we have to spend a lot of effort on.  I’m pretty bullish on the idea that this simply won’t happen except in very exceptional cases under a pg_upgrade (where the pg_dump that’s used must match the target server version) and where it happens under a pg_dump it’ll be because it’s an older pg_dump’s output and the answer will likely need to be “you’re using a pg_dump file generated using an older version of pg_dump and need to exclude stats entirely from the load and instead run analyze on the data after loading it.”

What are the cases where we would be seeing stats reloads failing where it would make sense to re-try on a subset of columns, or just generally, if we know that the pg_dump version matches the target server version?

Thanks!

Stephen

pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Next
From: Dean Rasheed
Date:
Subject: Re: Proposal to include --exclude-extension Flag in pg_dump