Home > mailing lists

Thread: Statistics Import and Export

Statistics Import and Export

From

Corey Huinker

Date:

31 August 2023, 06:47:31

pg_stats_export is a view that aggregates pg_statistic data by relation
oid and stores all of the column statistical data in a system-indepdent (i.e.
no oids, collation information removed, all MCV values rendered as text)
jsonb format, along with the relation's relname, reltuples, and relpages
from pg_class, as well as the schemaname from pg_namespace.

pg_import_rel_stats is a function which takes a relation oid,
server_version_num, num_tuples, num_pages, and a column_stats jsonb in
a format matching that of pg_stats_export, and applies that data to
the specified pg_class and pg_statistics rows for the relation
specified.

The most common use-case for such a function is in upgrades and
dump/restore, wherein the upgrade process would capture the output of
pg_stats_export into a regular table, perform the upgrade, and then
join that data to the existing pg_class rows, updating statistics to be
a close approximation of what they were just prior to the upgrade. The
hope is that these statistics are better than the early stages of
--analyze-in-stages and can be applied faster, thus reducing system
downtime.

The values applied to pg_class are done inline, which is to say
non-transactionally. The values applied to pg_statitics are applied
transactionally, as if an ANALYZE operation was reading from a
cheat-sheet.

This function and view will need to be followed up with corresponding

ones for pg_stastitic_ext and pg_stastitic_ext_data, and while we would
likely never backport the import functions, we can have user programs
do the same work as the export views such that statistics can be brought
forward from versions as far back as there is jsonb to store it.

While the primary purpose of the import function(s) are to reduce downtime
during an upgrade, it is not hard to see that they could also be used to
facilitate tuning and development operations, asking questions like "how might
this query plan change if this table has 1000x rows in it?", without actually
putting those rows into the table.

Attachment

v1-0001-Introduce-the-system-view-pg_stats_export-and-the.patch

Re: Statistics Import and Export

From

Ashutosh Bapat

Date:

31 August 2023, 07:07:12

On Thu, Aug 31, 2023 at 12:17 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>
> While the primary purpose of the import function(s) are to reduce downtime
> during an upgrade, it is not hard to see that they could also be used to
> facilitate tuning and development operations, asking questions like "how might
> this query plan change if this table has 1000x rows in it?", without actually
> putting those rows into the table.

Thanks. I think this may be used with postgres_fdw to import
statistics directly from the foreigns server, whenever possible,
rather than fetching the rows and building it locally. If it's known
that the stats on foreign and local servers match for a foreign table,
we will be one step closer to accurately estimating the cost of a
foreign plan locally rather than through EXPLAIN.

--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

From

Corey Huinker

Date:

31 August 2023, 21:18:32

Thanks. I think this may be used with postgres_fdw to import
statistics directly from the foreigns server, whenever possible,
rather than fetching the rows and building it locally. If it's known
that the stats on foreign and local servers match for a foreign table,
we will be one step closer to accurately estimating the cost of a
foreign plan locally rather than through EXPLAIN.

Yeah, that use makes sense as well, and if so then postgres_fdw would likely need to be aware of the appropriate query for several versions back - they change, not by much, but they do change. So now we'd have each query text in three places: a system view, postgres_fdw, and the bin/scripts pre-upgrade program. So I probably should consider the best way to share those in the codebase.

Re: Statistics Import and Export

From

Corey Huinker

Date:

31 October 2023, 07:25:17

Yeah, that use makes sense as well, and if so then postgres_fdw would likely need to be aware of the appropriate query for several versions back - they change, not by much, but they do change. So now we'd have each query text in three places: a system view, postgres_fdw, and the bin/scripts pre-upgrade program. So I probably should consider the best way to share those in the codebase.

Attached is v2 of this patch.

New features:

* imports index statistics. This is not strictly accurate: it re-computes index statistics the same as ANALYZE does, which is to say it derives those stats entirely from table column stats, which are imported, so in that sense we're getting index stats without touching the heap.
* now support extended statistics except for MCV, which is currently serialized as an difficult-to-decompose bytea field.
* bare-bones CLI script pg_export_stats, which extracts stats on databases back to v12 (tested) and could work back to v10.
* bare-bones CLI script pg_import_stats, which obviously only works on current devel dbs, but can take exports from older versions.

Attachment

Re: Statistics Import and Export

From

Tomas Vondra

Date:

01 November 2023, 20:07:27

On 10/31/23 08:25, Corey Huinker wrote:
>
> Attached is v2 of this patch.
> 
> New features:
> * imports index statistics. This is not strictly accurate: it 
> re-computes index statistics the same as ANALYZE does, which is to
> say it derives those stats entirely from table column stats, which
> are imported, so in that sense we're getting index stats without
> touching the heap.

Maybe I just don't understand, but I'm pretty sure ANALYZE does not
derive index stats from column stats. It actually builds them from the
row sample.

> * now support extended statistics except for MCV, which is currently 
> serialized as an difficult-to-decompose bytea field.

Doesn't pg_mcv_list_items() already do all the heavy work?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Statistics Import and Export

From

Corey Huinker

Date:

02 November 2023, 05:01:49

Maybe I just don't understand, but I'm pretty sure ANALYZE does not
derive index stats from column stats. It actually builds them from the
row sample.

That is correct, my error.

> * now support extended statistics except for MCV, which is currently
> serialized as an difficult-to-decompose bytea field.

Doesn't pg_mcv_list_items() already do all the heavy work?

Thanks! I'll look into that.

The comment below in mcv.c made me think there was no easy way to get output.

/*
* pg_mcv_list_out - output routine for type pg_mcv_list.
*
* MCV lists are serialized into a bytea value, so we simply call byteaout()
* to serialize the value into text. But it'd be nice to serialize that into
* a meaningful representation (e.g. for inspection by people).
*
* XXX This should probably return something meaningful, similar to what
* pg_dependencies_out does. Not sure how to deal with the deduplicated
* values, though - do we want to expand that or not?
*/

Re: Statistics Import and Export

From

Tomas Vondra

Date:

02 November 2023, 13:52:20

On 11/2/23 06:01, Corey Huinker wrote:
> 
> 
>     Maybe I just don't understand, but I'm pretty sure ANALYZE does not
>     derive index stats from column stats. It actually builds them from the
>     row sample.
> 
> 
> That is correct, my error.
>  
> 
> 
>     > * now support extended statistics except for MCV, which is currently
>     > serialized as an difficult-to-decompose bytea field.
> 
>     Doesn't pg_mcv_list_items() already do all the heavy work?
> 
> 
> Thanks! I'll look into that.
> 
> The comment below in mcv.c made me think there was no easy way to get
> output.
> 
> /*
>  * pg_mcv_list_out      - output routine for type pg_mcv_list.
>  *
>  * MCV lists are serialized into a bytea value, so we simply call byteaout()
>  * to serialize the value into text. But it'd be nice to serialize that into
>  * a meaningful representation (e.g. for inspection by people).
>  *
>  * XXX This should probably return something meaningful, similar to what
>  * pg_dependencies_out does. Not sure how to deal with the deduplicated
>  * values, though - do we want to expand that or not?
>  */
> 

Yeah, that was the simplest output function possible, it didn't seem
worth it to implement something more advanced. pg_mcv_list_items() is
more convenient for most needs, but it's quite far from the on-disk
representation.

That's actually a good question - how closely should the exported data
be to the on-disk format? I'd say we should keep it abstract, not tied
to the details of the on-disk format (which might easily change between
versions).

I'm a bit confused about the JSON schema used in pg_statistic_export
view, though. It simply serializes stakinds, stavalues, stanumbers into
arrays ... which works, but why not to use the JSON nesting? I mean,
there could be a nested document for histogram, MCV, ... with just the
correct fields.

  {
    ...
    histogram : { stavalues: [...] },
    mcv : { stavalues: [...], stanumbers: [...] },
    ...
  }

and so on. Also, what does TRIVIAL stand for?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Statistics Import and Export

From

Shubham Khanna

Date:

06 November 2023, 10:49:41

On Mon, Nov 6, 2023 at 4:16 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>>
>>
>> Yeah, that use makes sense as well, and if so then postgres_fdw would likely need to be aware of the appropriate
queryfor several versions back - they change, not by much, but they do change. So now we'd have each query text in
threeplaces: a system view, postgres_fdw, and the bin/scripts pre-upgrade program. So I probably should consider the
bestway to share those in the codebase. 
>>
>
> Attached is v2 of this patch.

While applying Patch, I noticed few Indentation issues:
1) D:\Project\Postgres>git am v2-0003-Add-pg_import_rel_stats.patch
.git/rebase-apply/patch:1265: space before tab in indent.
                                        errmsg("invalid statistics
format, stxndeprs must be array or null");
.git/rebase-apply/patch:1424: trailing whitespace.
                                   errmsg("invalid statistics format,
stxndistinct attnums elements must be strings, but one is %s",
.git/rebase-apply/patch:1315: new blank line at EOF.
+
warning: 3 lines add whitespace errors.
Applying: Add pg_import_rel_stats().

2) D:\Project\Postgres>git am v2-0004-Add-pg_export_stats-pg_import_stats.patch
.git/rebase-apply/patch:282: trailing whitespace.
const char *export_query_v14 =
.git/rebase-apply/patch:489: trailing whitespace.
const char *export_query_v12 =
.git/rebase-apply/patch:648: trailing whitespace.
const char *export_query_v10 =
.git/rebase-apply/patch:826: trailing whitespace.

.git/rebase-apply/patch:1142: trailing whitespace.
        result = PQexec(conn,
warning: squelched 4 whitespace errors
warning: 9 lines add whitespace errors.
Applying: Add pg_export_stats, pg_import_stats.

Thanks and Regards,
Shubham Khanna.

Re: Statistics Import and Export

From

Ashutosh Bapat

Date:

07 November 2023, 09:23:54

On Tue, Oct 31, 2023 at 12:55 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>>
>>
>> Yeah, that use makes sense as well, and if so then postgres_fdw would likely need to be aware of the appropriate
queryfor several versions back - they change, not by much, but they do change. So now we'd have each query text in
threeplaces: a system view, postgres_fdw, and the bin/scripts pre-upgrade program. So I probably should consider the
bestway to share those in the codebase. 
>>
>
> Attached is v2 of this patch.
>
> New features:
> * imports index statistics. This is not strictly accurate: it re-computes index statistics the same as ANALYZE does,
whichis to say it derives those stats entirely from table column stats, which are imported, so in that sense we're
gettingindex stats without touching the heap. 
> * now support extended statistics except for MCV, which is currently serialized as an difficult-to-decompose bytea
field.
> * bare-bones CLI script pg_export_stats, which extracts stats on databases back to v12 (tested) and could work back
tov10. 
> * bare-bones CLI script pg_import_stats, which obviously only works on current devel dbs, but can take exports from
olderversions. 
>

I did a small experiment with your patches. In a separate database
"fdw_dst" I created a table t1 and populated it with 100K rows
#create table t1 (a int, b int);
#insert into t1 select i, i + 1 from generate_series(1, 100000) i;
#analyse t1;

In database "postgres" on the same server, I created a foreign table
pointing to t1
#create server fdw_dst_server foreign data wrapper postgres_fdw
OPTIONS ( dbname 'fdw_dst', port '5432');
#create user mapping for public server fdw_dst_server ;
#create foreign table t1 (a int, b int) server fdw_dst_server;

The estimates are off
#explain select * from t1 where a = 100;
                        QUERY PLAN
-----------------------------------------------------------
 Foreign Scan on t1  (cost=100.00..142.26 rows=13 width=8)
(1 row)

Export and import stats for table t1
$ pg_export_stats -d fdw_dst | pg_import_stats -d postgres

gives accurate estimates
#explain select * from t1 where a = 100;
                        QUERY PLAN
-----------------------------------------------------------
 Foreign Scan on t1  (cost=100.00..1793.02 rows=1 width=8)
(1 row)

In this simple case it's working like a charm.

Then I wanted to replace all ANALYZE commands in postgres_fdw.sql with
import and export of statistics. But I can not do that since it
requires table names to match. Foreign table metadata stores the
mapping between local and remote table as well as column names. Import
can use that mapping to install the statistics appropriately. We may
want to support a command or function in postgres_fdw to import
statistics of all the tables that point to a given foreign server.
That may be some future work based on your current patches.

I have not looked at the code though.

--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

From

Corey Huinker

Date:

13 December 2023, 10:26:04

Yeah, that was the simplest output function possible, it didn't seem

worth it to implement something more advanced. pg_mcv_list_items() is
more convenient for most needs, but it's quite far from the on-disk
representation.

I was able to make it work.

That's actually a good question - how closely should the exported data
be to the on-disk format? I'd say we should keep it abstract, not tied
to the details of the on-disk format (which might easily change between
versions).

For the most part, I chose the exported data json types and formats in a way that was the most accommodating to cstring input functions. So, while so many of the statistic values are obviously only ever integers/floats, those get stored as a numeric data type which lacks direct numeric->int/float4/float8 functions (though we could certainly create them, and I'm not against that), casting them to text lets us leverage pg_strtoint16, etc.

I'm a bit confused about the JSON schema used in pg_statistic_export
view, though. It simply serializes stakinds, stavalues, stanumbers into
arrays ... which works, but why not to use the JSON nesting? I mean,
there could be a nested document for histogram, MCV, ... with just the
correct fields.

{
...
histogram : { stavalues: [...] },
mcv : { stavalues: [...], stanumbers: [...] },
...
}

That's a very good question. I went with this format because it was fairly straightforward to code in SQL using existing JSON/JSONB functions, and that's what we will need if we want to export statistics on any server currently in existence. I'm certainly not locked in with the current format, and if it can be shown how to transform the data into a superior format, I'd happily do so.

and so on. Also, what does TRIVIAL stand for?

It's currently serving double-duty for "there are no stats in this slot" and the situations where the stats computation could draw no conclusions about the data.

Attached is v3 of this patch. Key features are:

* Handles regular pg_statistic stats for any relation type.
* Handles extended statistics.
* Export views pg_statistic_export and pg_statistic_ext_export to allow inspection of existing stats and saving those values for later use.
* Import functions pg_import_rel_stats() and pg_import_ext_stats() which take Oids as input. This is intentional to allow stats from one object to be imported into another object.
* User scripts pg_export_stats and pg_import stats, which offer a primitive way to serialize all the statistics of one database and import them into another.

* Has regression test coverage for both with a variety of data types.

* Passes my own manual test of extracting all of the stats from a v15 version of the popular "dvdrental" example database, as well as some additional extended statistics objects, and importing them into a development database.
* Import operations never touch the heap of any relation outside of pg_catalog. As such, this should be significantly faster than even the most cursory analyze operation, and therefore should be useful in upgrade situations, allowing the database to work with "good enough" stats more quickly, while still allowing for regular autovacuum to recalculate the stats "for real" at some later point.

The relation statistics code was adapted from similar features in analyze.c, but is now done in a query context. As before, the rowcount/pagecount values are updated on pg_class in a non-transactional fashion to avoid table bloat, while the updates to pg_statistic are pg_statistic_ext_data are done transactionally.

The existing statistics _store() functions were leveraged wherever practical, so much so that the extended statistics import is mostly just adapting the existing _build() functions into _import() functions which pull their values from JSON rather than computing the statistics.

Current concerns are:

1. I had to code a special-case exception for MCELEM stats on array data types, so that the array_in() call uses the element type rather than the array type. I had assumed that the existing exmaine_attribute() functions would have properly derived the typoid for that column, but it appears to not be the case, and I'm clearly missing how the existing code gets it right.
2. This hasn't been tested with external custom datatypes, but if they have a custom typanalyze function things should be ok.

3. While I think I have cataloged all of the schema-structural changes to pg_statistic[_ext[_data]] since version 10, I may have missed a case where the schema stayed the same, but the values are interpreted differently.
4. I don't yet have a complete vision for how these tools will be used by pg_upgrade and pg_dump/restore, the places where these will provide the biggest win for users.

Attachment

Re: Statistics Import and Export

From

Andrei Lepikhov

Date:

15 December 2023, 08:36:08

On 13/12/2023 17:26, Corey Huinker wrote:> 4. I don't yet have a 
complete vision for how these tools will be used
> by pg_upgrade and pg_dump/restore, the places where these will provide 
> the biggest win for users.

Some issues here with docs:

func.sgml:28465: parser error : Opening and ending tag mismatch: sect1 
line 26479 and sect2
   </sect2>
           ^

Also, as I remember, we already had some attempts to invent dump/restore 
statistics [1,2]. They were stopped with the problem of type 
verification. What if the definition of the type has changed between the 
dump and restore? As I see in the code, Importing statistics you just 
check the column name and don't see into the type.

[1] Backup and recovery of pg_statistic
https://www.postgresql.org/message-id/flat/724322880.K8vzik8zPz%40abook
[2] Re: Ideas about a better API for postgres_fdw remote estimates
https://www.postgresql.org/message-id/7a40707d-1758-85a2-7bb1-6e5775518e64%40postgrespro.ru

-- 
regards,
Andrei Lepikhov
Postgres Professional

Re: Statistics Import and Export

From

Corey Huinker

Date:

16 December 2023, 04:30:25

On Fri, Dec 15, 2023 at 3:36 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote:

On 13/12/2023 17:26, Corey Huinker wrote:> 4. I don't yet have a
complete vision for how these tools will be used
> by pg_upgrade and pg_dump/restore, the places where these will provide
> the biggest win for users.

Some issues here with docs:

func.sgml:28465: parser error : Opening and ending tag mismatch: sect1
line 26479 and sect2
</sect2>
^

Apologies, will fix.

Also, as I remember, we already had some attempts to invent dump/restore
statistics [1,2]. They were stopped with the problem of type
verification. What if the definition of the type has changed between the
dump and restore? As I see in the code, Importing statistics you just
check the column name and don't see into the type.

We look up the imported statistics via column name, that is correct.

However, the values in stavalues and mcv and such are stored purely as text, so they must be casted using the input functions for that particular datatype. If that column definition changed, or the underlying input function changed, the stats import of that particular table would fail. It should be noted, however, that those same input functions were used to bring the data into the table via restore, so it would have already failed on that step. Either way, the structure of the table has effectively changed, so failure to import those statistics would be a good thing.

[1] Backup and recovery of pg_statistic
https://www.postgresql.org/message-id/flat/724322880.K8vzik8zPz%40abook

That proposal sought to serialize enough information on the old server such that rows could be directly inserted into pg_statistic on the new server. As was pointed out at the time, version N of a server cannot know what the format of pg_statistic will be in version N+1.

This patch avoids that problem by inspecting the structure of the object to be faux-analyzed, and using that to determine what parts of the JSON to fetch, and what datatype to cast values to in cases like mcv and stavaluesN. The exported JSON has no oids in it whatseover, all elements subject to casting on import have already been cast to text, and the record returned has the server version number of the producing system, and the import function can use that to determine how it interprets the data it finds.

[2] Re: Ideas about a better API for postgres_fdw remote estimates
https://www.postgresql.org/message-id/7a40707d-1758-85a2-7bb1-6e5775518e64%40postgrespro.ru

This one seems to be pulling oids from the remote server, and we can't guarantee their stability across systems, especially for objects and operators from extensions. I tried to go the route of extracting the full text name of an operator, but discovered that the qualified names, in addition to being unsightly, were irrelevant because we can't insert stats that disagree about type with the attribute/expression. So it didn't matter what type the remote system thought it had, the local system was going to coerce it into the expected data type or ereport() trying.

I think there is hope for having do_analyze() run a remote query fetching the remote table's exported stats and then storing them locally, possibly after some modification, and that would save us from having to sample a remote table.

Re: Statistics Import and Export

From

Tomas Vondra

Date:

26 December 2023, 01:18:56

Hi,

I finally had time to look at the last version of the patch, so here's a
couple thoughts and questions in somewhat random order. Please take this
as a bit of a brainstorming and push back if you disagree some of my
comments.

In general, I like the goal of this patch - not having statistics is a
common issue after an upgrade, and people sometimes don't even realize
they need to run analyze. So, it's definitely worth improving.

I'm not entirely sure about the other use case - allowing people to
tweak optimizer statistics on a running cluster, to see what would be
the plan in that case. Or more precisely - I agree that would be an
interesting and useful feature, but maybe the interface should not be
the same as for the binary upgrade use case?


interfaces
----------

When I thought about the ability to dump/load statistics in the past, I
usually envisioned some sort of DDL that would do the export and import.
So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
or something like that, and that'd do all the work. This would mean
stats are "first-class citizens" and it'd be fairly straightforward to
add this into pg_dump, for example. Or at least I think so ...

Alternatively we could have the usual "functional" interface, with a
functions to export/import statistics, replacing the DDL commands.

Unfortunately, none of this works for the pg_upgrade use case, because
existing cluster versions would not support this new interface, of
course. That's a significant flaw, as it'd make this useful only for
upgrades of future versions.

So I think for the pg_upgrade use case, we don't have much choice other
than using "custom" export through a view, which is what the patch does.

However, for the other use case (tweaking optimizer stats) this is not
really an issue - that always happens on the same instance, so no issue
with not having the "export" function and so on. I'd bet there are more
convenient ways to do this than using the export view. I'm sure it could
share a lot of the infrastructure, ofc.

I suggest we focus on the pg_upgrade use case for now. In particular, I
think we really need to find a good way to integrate this into
pg_upgrade. I'm not against having custom CLI commands, but it's still a
manual thing - I wonder if we could extend pg_dump to dump stats, or
make it built-in into pg_upgrade in some way (possibly disabled by
default, or something like that).


JSON format
-----------

As for the JSON format, I wonder if we need that at all? Isn't that an
unnecessary layer of indirection? Couldn't we simply dump pg_statistic
and pg_statistic_ext_data in CSV, or something like that? The amount of
new JSONB code seems to be very small, so it's OK I guess.

I'm still a bit unsure about the "right" JSON schema. I find it a bit
inconvenient that the JSON objects mimic the pg_statistic schema very
closely. In particular, it has one array for stakind values, another
array for stavalues, array for stanumbers etc. I understand generating
this JSON in SQL is fairly straightforward, and for the pg_upgrade use
case it's probably OK. But my concern is it's not very convenient for
the "manual tweaking" use case, because the "related" fields are
scattered in different parts of the JSON.

That's pretty much why I envisioned a format "grouping" the arrays for a
particular type of statistics (MCV, histogram) into the same object, as
for example in

 {
   "mcv" : {"values" : [...], "frequencies" : [...]}
   "histogram" : {"bounds" : [...]}
 }

But that's probably much harder to generate from plain SQL (at least I
think so, I haven't tried).


data missing in the export
--------------------------

I think the data needs to include more information. Maybe not for the
pg_upgrade use case, where it's mostly guaranteed not to change, but for
the "manual tweak" use case it can change. And I don't think we want two
different formats - we want one, working for everything.

Consider for example about the staopN and stacollN fields - if we clone
the stats from one table to the other, and the table uses different
collations, will that still work? Similarly, I think we should include
the type of each column, because it's absolutely not guaranteed the
import function will fail if the type changes. For example, if the type
changes from integer to text, that will work, but the ordering will
absolutely not be the same. And so on.

For the extended statistics export, I think we need to include also the
attribute names and expressions, because these can be different between
the statistics. And not only that - the statistics values reference the
attributes by positions, but if the two tables have the attributes in a
different order (when ordered by attnum), that will break stuff.


more strict checks
------------------

I think the code should be a bit more "defensive" when importing stuff,
and do at least some sanity checks. For the pg_upgrade use case this
should be mostly non-issue (except for maybe helping to detect bugs
earlier), but for the "manual tweak" use case it's much more important.

By this I mean checks like:

* making sure the frequencies in MCV lists are not obviously wrong
  (outside [0,1], sum exceeding > 1.0, etc.)

* cross-checking that stanumbers/stavalues make sense (e.g. that MCV has
  both arrays while histogram has only stavalues, that the arrays have
  the same length for MCV, etc.)

* checking there are no duplicate stakind values (e.g. two MCV lists)

This is another reason why I was thinking the current JSON format may be
a bit inconvenient, because it loads the fields separately, making the
checks harder. But I guess it could be done after loading everything, as
a separate phase.

Not sure if all the checks need to be regular elog(ERROR), perhaps some
could/should be just asserts.


minor questions
---------------

1) Should the views be called pg_statistic_export or pg_stats_export?
Perhaps pg_stats_export is better, because the format is meant to be
human-readable (rather than 100% internal).

2) It's not very clear what "non-transactional update" of pg_class
fields actually means. Does that mean we update the fields in-place,
can't be rolled back, is not subject to MVCC or what? I suspect users
won't know unless the docs say that explicitly.

3) The "statistics.c" code should really document the JSON structure. Or
maybe if we plan to use this for other purposes, it should be documented
in the SGML?

Actually, this means that the use supported cases determine if the
expected JSON structure is part of the API. For pg_upgrade we could keep
it as "internal" and maybe change it as needed, but for "manual tweak"
it'd become part of the public API.

4) Why do we need the separate "replaced" flags in import_stakinds? Can
it happen that collreplaces/opreplaces differ from kindreplaces?

5) What happens in we import statistics for a table that already has
some statistics? Will this discard the existing statistics, or will this
merge them somehow? (I think we should always discard the existing
stats, and keep only the new version.)

6) What happens if we import extended stats with mismatching definition?
For example, what if the "new" statistics object does not have "mcv"
enabled, but the imported data do include MCV? What if the statistics do
have the same number of "dimensions" but not the same number of columns
and expressions?

7) The func.sgml additions in 0007 seems a bit strange, particularly the
first sentence of the paragraph.

8) While experimenting with the patch, I noticed this:

  create table t (a int, b int, c text);
  create statistics s on a, b, c, (a+b), (a-b) from t;

  create table t2 (a text, b text, c text);
  create statistics s2 on a, c from t2;

  select pg_import_ext_stats(
    (select oid from pg_statistic_ext where stxname = 's2'),
    (select server_version_num from pg_statistic_ext_export
                              where ext_stats_name = 's'),
    (select stats from pg_statistic_ext_export
                 where ext_stats_name = 's'));

WARNING:  statistics import has 5 mcv dimensions, but the expects 2.
Skipping excess dimensions.
ERROR:  statistics import has 5 mcv dimensions, but the expects 2.
Skipping excess dimensions.

I guess we should not trigger WARNING+ERROR with the same message.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Statistics Import and Export

From

Bruce Momjian

Date:

26 December 2023, 18:15:14

On Tue, Dec 26, 2023 at 02:18:56AM +0100, Tomas Vondra wrote:
> interfaces
> ----------
> 
> When I thought about the ability to dump/load statistics in the past, I
> usually envisioned some sort of DDL that would do the export and import.
> So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
> or something like that, and that'd do all the work. This would mean
> stats are "first-class citizens" and it'd be fairly straightforward to
> add this into pg_dump, for example. Or at least I think so ...
> 
> Alternatively we could have the usual "functional" interface, with a
> functions to export/import statistics, replacing the DDL commands.
> 
> Unfortunately, none of this works for the pg_upgrade use case, because
> existing cluster versions would not support this new interface, of
> course. That's a significant flaw, as it'd make this useful only for
> upgrades of future versions.
> 
> So I think for the pg_upgrade use case, we don't have much choice other
> than using "custom" export through a view, which is what the patch does.
> 
> However, for the other use case (tweaking optimizer stats) this is not
> really an issue - that always happens on the same instance, so no issue
> with not having the "export" function and so on. I'd bet there are more
> convenient ways to do this than using the export view. I'm sure it could
> share a lot of the infrastructure, ofc.
> 
> I suggest we focus on the pg_upgrade use case for now. In particular, I
> think we really need to find a good way to integrate this into
> pg_upgrade. I'm not against having custom CLI commands, but it's still a
> manual thing - I wonder if we could extend pg_dump to dump stats, or
> make it built-in into pg_upgrade in some way (possibly disabled by
> default, or something like that).

I have some thoughts on this too.  I understand the desire to add
something that can be used for upgrades _to_ PG 17, but I am concerned
that this will give us a cumbersome API that will hamper future
development.  I think we should develop the API we want, regardless of
how useful it is for upgrades _to_ PG 17, and then figure out what
short-term hacks we can add to get it working for upgrades _to_ PG 17; 
these hacks can eventually be removed.  Even if they can't be removed,
they are export-only and we can continue developing the import SQL
command cleanly, and I think import is going to need the most long-term
maintenance.

I think we need a robust API to handle two cases:

*  changes in how we store statistics
*  changes in how how data type values are represented in the statistics

We have had such changes in the past, and I think these two issues are
what have prevented import/export of statistics up to this point.
Developing an API that doesn't cleanly handle these will cause long-term
pain.

In summary, I think we need an SQL-level command for this.  I think we
need to embed the Postgres export version number into the statistics
export file (maybe in the COPY header), and then load the file via COPY
internally (not JSON) into a temporary table that we know matches the
exported Postgres version.  We then need to use SQL to make any
adjustments to it before loading it into pg_statistic.  Doing that
internally in JSON just isn't efficient.  If people want JSON for such
cases, I suggest we add a JSON format to COPY.

I think we can then look at pg_upgrade to see if we can simulate the
export action which can use the statistics import SQL command.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Statistics Import and Export

From

Tom Lane

Date:

26 December 2023, 19:19:04

Bruce Momjian <bruce@momjian.us> writes:
> I think we need a robust API to handle two cases:

> *  changes in how we store statistics
> *  changes in how how data type values are represented in the statistics

> We have had such changes in the past, and I think these two issues are
> what have prevented import/export of statistics up to this point.
> Developing an API that doesn't cleanly handle these will cause long-term
> pain.

Agreed.

> In summary, I think we need an SQL-level command for this.

I think a SQL command is an actively bad idea.  It'll just add development
and maintenance overhead that we don't need.  When I worked on this topic
years ago at Salesforce, I had things set up with simple functions, which
pg_dump would invoke by writing more or less

    SELECT pg_catalog.load_statistics(....);

This has a number of advantages, not least of which is that an extension
could plausibly add compatible functions to older versions.  The trick,
as you say, is to figure out what the argument lists ought to be.
Unfortunately I recall few details of what I wrote for Salesforce,
but I think I had it broken down in a way where there was a separate
function call occurring for each pg_statistic "slot", thus roughly

load_statistics(table regclass, attname text, stakind int, stavalue ...);

I might have had a separate load_statistics_xxx function for each
stakind, which would ease the issue of deciding what the datatype
of "stavalue" is.  As mentioned already, we'd also need some sort of
version identifier, and we'd expect the load_statistics() functions
to be able to transform the data if the old version used a different
representation.  I agree with the idea that an explicit representation
of the source table attribute's type would be wise, too.

            regards, tom lane

Re: Statistics Import and Export

From

Tomas Vondra

Date:

27 December 2023, 12:08:47

On 12/26/23 20:19, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> I think we need a robust API to handle two cases:
> 
>> *  changes in how we store statistics
>> *  changes in how how data type values are represented in the statistics
> 
>> We have had such changes in the past, and I think these two issues are
>> what have prevented import/export of statistics up to this point.
>> Developing an API that doesn't cleanly handle these will cause long-term
>> pain.
> 
> Agreed.
> 

I agree the format is important - we don't want to end up with a format
that's cumbersome or inconvenient to use. But I don't think the proposed
format is somewhat bad in those respects - it mostly reflects how we
store statistics and if I was designing a format for humans, it might
look a bit differently. But that's not the goal here, IMHO.

I don't quite understand the two cases above. Why should this affect how
we store statistics? Surely, making the statistics easy to use for the
optimizer is much more important than occasional export/import.

>> In summary, I think we need an SQL-level command for this.
> 
> I think a SQL command is an actively bad idea.  It'll just add development
> and maintenance overhead that we don't need.  When I worked on this topic
> years ago at Salesforce, I had things set up with simple functions, which
> pg_dump would invoke by writing more or less
> 
>     SELECT pg_catalog.load_statistics(....);
> 
> This has a number of advantages, not least of which is that an extension
> could plausibly add compatible functions to older versions.  The trick,
> as you say, is to figure out what the argument lists ought to be.
> Unfortunately I recall few details of what I wrote for Salesforce,
> but I think I had it broken down in a way where there was a separate
> function call occurring for each pg_statistic "slot", thus roughly
> 
> load_statistics(table regclass, attname text, stakind int, stavalue ...);
> 
> I might have had a separate load_statistics_xxx function for each
> stakind, which would ease the issue of deciding what the datatype
> of "stavalue" is.  As mentioned already, we'd also need some sort of
> version identifier, and we'd expect the load_statistics() functions
> to be able to transform the data if the old version used a different
> representation.  I agree with the idea that an explicit representation
> of the source table attribute's type would be wise, too.
> 

Yeah, this is pretty much what I meant by "functional" interface. But if
I said maybe the format implemented by the patch is maybe too close to
how we store the statistics, then this has exactly the same issue. And
it has other issues too, I think - it breaks down the stats into
multiple function calls, so ensuring the sanity/correctness of whole
sets of statistics gets much harder, I think.

I'm not sure about the extension idea. Yes, we could have an extension
providing such functions, but do we have any precedent of making
pg_upgrade dependent on an external extension? I'd much rather have
something built-in that just works, especially if we intend to make it
the default behavior (which I think should be our aim here).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Statistics Import and Export

From

Bruce Momjian

Date:

27 December 2023, 15:29:01

On Wed, Dec 27, 2023 at 01:08:47PM +0100, Tomas Vondra wrote:
> On 12/26/23 20:19, Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> >> I think we need a robust API to handle two cases:
> > 
> >> *  changes in how we store statistics
> >> *  changes in how how data type values are represented in the statistics
> > 
> >> We have had such changes in the past, and I think these two issues are
> >> what have prevented import/export of statistics up to this point.
> >> Developing an API that doesn't cleanly handle these will cause long-term
> >> pain.
> > 
> > Agreed.
> > 
> 
> I agree the format is important - we don't want to end up with a format
> that's cumbersome or inconvenient to use. But I don't think the proposed
> format is somewhat bad in those respects - it mostly reflects how we
> store statistics and if I was designing a format for humans, it might
> look a bit differently. But that's not the goal here, IMHO.
> 
> I don't quite understand the two cases above. Why should this affect how
> we store statistics? Surely, making the statistics easy to use for the
> optimizer is much more important than occasional export/import.

The two items above were to focus on getting a solution that can easily
handle future statistics storage changes.  I figured we would want to
manipulate the data as a table internally so I am confused why we would
export JSON instead of a COPY format.  I didn't think we were changing
how we internall store or use the statistics.

> >> In summary, I think we need an SQL-level command for this.
> > 
> > I think a SQL command is an actively bad idea.  It'll just add development
> > and maintenance overhead that we don't need.  When I worked on this topic
> > years ago at Salesforce, I had things set up with simple functions, which
> > pg_dump would invoke by writing more or less
> > 
> >     SELECT pg_catalog.load_statistics(....);
> > 
> > This has a number of advantages, not least of which is that an extension
> > could plausibly add compatible functions to older versions.  The trick,
> > as you say, is to figure out what the argument lists ought to be.
> > Unfortunately I recall few details of what I wrote for Salesforce,
> > but I think I had it broken down in a way where there was a separate
> > function call occurring for each pg_statistic "slot", thus roughly
> > 
> > load_statistics(table regclass, attname text, stakind int, stavalue ...);
> > 
> > I might have had a separate load_statistics_xxx function for each
> > stakind, which would ease the issue of deciding what the datatype
> > of "stavalue" is.  As mentioned already, we'd also need some sort of
> > version identifier, and we'd expect the load_statistics() functions
> > to be able to transform the data if the old version used a different
> > representation.  I agree with the idea that an explicit representation
> > of the source table attribute's type would be wise, too.
> > 
> 
> Yeah, this is pretty much what I meant by "functional" interface. But if
> I said maybe the format implemented by the patch is maybe too close to
> how we store the statistics, then this has exactly the same issue. And
> it has other issues too, I think - it breaks down the stats into
> multiple function calls, so ensuring the sanity/correctness of whole
> sets of statistics gets much harder, I think.

I was suggesting an SQL command because this feature is going to need a
lot of options and do a lot of different things, I am afraid, and a
single function might be too complex to manage.

> I'm not sure about the extension idea. Yes, we could have an extension
> providing such functions, but do we have any precedent of making
> pg_upgrade dependent on an external extension? I'd much rather have
> something built-in that just works, especially if we intend to make it
> the default behavior (which I think should be our aim here).

Uh, an extension seems nice to allow people in back branches to install
it, but not for normal usage.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Statistics Import and Export

From

Corey Huinker

Date:

28 December 2023, 02:41:31

On Mon, Dec 25, 2023 at 8:18 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

I finally had time to look at the last version of the patch, so here's a
couple thoughts and questions in somewhat random order. Please take this
as a bit of a brainstorming and push back if you disagree some of my
comments.

In general, I like the goal of this patch - not having statistics is a
common issue after an upgrade, and people sometimes don't even realize
they need to run analyze. So, it's definitely worth improving.

I'm not entirely sure about the other use case - allowing people to
tweak optimizer statistics on a running cluster, to see what would be
the plan in that case. Or more precisely - I agree that would be an
interesting and useful feature, but maybe the interface should not be
the same as for the binary upgrade use case?

interfaces
----------

When I thought about the ability to dump/load statistics in the past, I
usually envisioned some sort of DDL that would do the export and import.
So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
or something like that, and that'd do all the work. This would mean
stats are "first-class citizens" and it'd be fairly straightforward to
add this into pg_dump, for example. Or at least I think so ...

Alternatively we could have the usual "functional" interface, with a
functions to export/import statistics, replacing the DDL commands.

Unfortunately, none of this works for the pg_upgrade use case, because
existing cluster versions would not support this new interface, of
course. That's a significant flaw, as it'd make this useful only for
upgrades of future versions.

This was the reason I settled on the interface that I did: while we can create whatever interface we want for importing the statistics, we would need to be able to extract stats from databases using only the facilities available in those same databases, and then store that in a medium that could be conveyed across databases, either by text files or by saving them off in a side table prior to upgrade. JSONB met the criteria.

So I think for the pg_upgrade use case, we don't have much choice other
than using "custom" export through a view, which is what the patch does.

However, for the other use case (tweaking optimizer stats) this is not
really an issue - that always happens on the same instance, so no issue
with not having the "export" function and so on. I'd bet there are more
convenient ways to do this than using the export view. I'm sure it could
share a lot of the infrastructure, ofc.

So, there is a third use case - foreign data wrappers. When analyzing a foreign table, at least one in the postgresql_fdw family of foreign servers, we should be able to send a query specific to the version and dialect of that server, get back the JSONB, and import those results. That use case may be more tangible to you than the tweak/tuning case.

JSON format
-----------

As for the JSON format, I wonder if we need that at all? Isn't that an
unnecessary layer of indirection? Couldn't we simply dump pg_statistic
and pg_statistic_ext_data in CSV, or something like that? The amount of
new JSONB code seems to be very small, so it's OK I guess.

I see a few problems with dumping pg_statistic[_ext_data]. The first is that the importer now has to understand all of the past formats of those two tables. The next is that the tables are chock full of Oids that don't necessarily carry forward. I could see us having a text-ified version of those two tables, but we'd need that for all previous iterations of those table formats. Instead, I put the burden on the stats export to de-oid the data and make it *_in() function friendly.

That's pretty much why I envisioned a format "grouping" the arrays for a
particular type of statistics (MCV, histogram) into the same object, as
for example in

{
"mcv" : {"values" : [...], "frequencies" : [...]}
"histogram" : {"bounds" : [...]}
}

I agree that would be a lot more readable, and probably a lot more debuggable. But I went into this unsure if there could be more than one stats slot of a given kind per table. Knowing that they must be unique helps.

But that's probably much harder to generate from plain SQL (at least I
think so, I haven't tried).

I think it would be harder, but far from impossible.

data missing in the export
--------------------------

I think the data needs to include more information. Maybe not for the
pg_upgrade use case, where it's mostly guaranteed not to change, but for
the "manual tweak" use case it can change. And I don't think we want two
different formats - we want one, working for everything.

I"m not against this at all, and I started out doing that, but the qualified names of operators got _ugly_, and I quickly realized that what I was generating wouldn't matter, either the input data would make sense for the attribute's stats or it would fail trying.

Consider for example about the staopN and stacollN fields - if we clone
the stats from one table to the other, and the table uses different
collations, will that still work? Similarly, I think we should include
the type of each column, because it's absolutely not guaranteed the
import function will fail if the type changes. For example, if the type
changes from integer to text, that will work, but the ordering will
absolutely not be the same. And so on.

I can see including the type of the column, that's a lot cleaner than the operator names for sure, and I can see us rejecting stats or sections of stats in certain situations. Like in your example, if the collation changed, then reject all "<" op stats but keep the "=" ones.

For the extended statistics export, I think we need to include also the
attribute names and expressions, because these can be different between
the statistics. And not only that - the statistics values reference the
attributes by positions, but if the two tables have the attributes in a
different order (when ordered by attnum), that will break stuff.

Correct me if I'm wrong, but I thought expression parse trees change _a lot_ from version to version?

Attribute reordering is a definite vulnerability of the current implementation, so an attribute name export might be a way to mitigate that.

* making sure the frequencies in MCV lists are not obviously wrong
(outside [0,1], sum exceeding > 1.0, etc.)

+1

* cross-checking that stanumbers/stavalues make sense (e.g. that MCV has
both arrays while histogram has only stavalues, that the arrays have
the same length for MCV, etc.)

To this end, there's an edge-case hack in the code where I have to derive the array elemtype. I had thought that examine_attribute() or std_typanalyze() was going to do that for me, but it didn't. Very much want your input there.

* checking there are no duplicate stakind values (e.g. two MCV lists)

Per previous comment, it's good to learn these restrictions.

Not sure if all the checks need to be regular elog(ERROR), perhaps some
could/should be just asserts.

For this first pass, all errors were one-size fits all, safe for the WARNING vs ERROR.

minor questions
---------------

1) Should the views be called pg_statistic_export or pg_stats_export?
Perhaps pg_stats_export is better, because the format is meant to be
human-readable (rather than 100% internal).

I have no opinion on what the best name would be, and will go with consensus.

2) It's not very clear what "non-transactional update" of pg_class
fields actually means. Does that mean we update the fields in-place,
can't be rolled back, is not subject to MVCC or what? I suspect users
won't know unless the docs say that explicitly.

Correct. Cannot be rolled back, not subject to MVCC.

3) The "statistics.c" code should really document the JSON structure. Or
maybe if we plan to use this for other purposes, it should be documented
in the SGML?

I agree, but I also didn't expect the format to survive first contact with reviewers, so I held back.

4) Why do we need the separate "replaced" flags in import_stakinds? Can
it happen that collreplaces/opreplaces differ from kindreplaces?

That was initially done to maximize the amount of code that could be copied from do_analyze(). In retrospect, I like how extended statistics just deletes all the pg_statistic_ext_data rows and replaces them and I would like to do the same for pg_statistic before this is all done.

5) What happens in we import statistics for a table that already has
some statistics? Will this discard the existing statistics, or will this
merge them somehow? (I think we should always discard the existing
stats, and keep only the new version.)

In the case of pg_statistic_ext_data, the stats are thrown out and replaced by the imported ones.

In the case of pg_statistic, it's basically an upsert, and any values that were missing in the JSON are not updated on the existing row. That's appealing in a tweak situation where you want to only alter one or two bits of a stat, but not really useful in other situations. Per previous comment, I'd prefer a clean slate and forcing tweaking use cases to fill in all the blanks.

6) What happens if we import extended stats with mismatching definition?
For example, what if the "new" statistics object does not have "mcv"
enabled, but the imported data do include MCV? What if the statistics do
have the same number of "dimensions" but not the same number of columns
and expressions?

The importer is currently driven by the types of stats to be expected for that pg_attribute/pg_statistic_ext. It only looks for things that are possible for that stat type, and any extra JSON values are ignored.

Re: Statistics Import and Export

From

Corey Huinker

Date:

28 December 2023, 02:44:55

As mentioned already, we'd also need some sort of
version identifier, and we'd expect the load_statistics() functions
to be able to transform the data if the old version used a different
representation. I agree with the idea that an explicit representation
of the source table attribute's type would be wise, too.

There is a version identifier currently (its own column not embedded in the JSON), but I discovered that I was able to put the burden on the export queries to spackle-over the changes in the table structures over time. Still, I knew that we'd need the version number in there eventually.

Re: Statistics Import and Export

From

Corey Huinker

Date:

28 December 2023, 02:49:23

Yeah, this is pretty much what I meant by "functional" interface. But if
I said maybe the format implemented by the patch is maybe too close to
how we store the statistics, then this has exactly the same issue. And
it has other issues too, I think - it breaks down the stats into
multiple function calls, so ensuring the sanity/correctness of whole
sets of statistics gets much harder, I think.

Export functions was my original plan, for simplicity, maintenance, etc, but it seemed like I'd be adding quite a few functions, so the one view made more sense for an initial version. Also, I knew that pg_dump or some other stats exporter would have to inline the guts of those functions into queries for older versions, and adapting a view definition seemed more straightforward for the reader than function definitions.

Re: Statistics Import and Export

From

Tom Lane

Date:

28 December 2023, 03:10:25

Corey Huinker <corey.huinker@gmail.com> writes:
> Export functions was my original plan, for simplicity, maintenance, etc,
> but it seemed like I'd be adding quite a few functions, so the one view
> made more sense for an initial version. Also, I knew that pg_dump or some
> other stats exporter would have to inline the guts of those functions into
> queries for older versions, and adapting a view definition seemed more
> straightforward for the reader than function definitions.

Hmm, I'm not sure we are talking about the same thing at all.

What I am proposing is *import* functions.  I didn't say anything about
how pg_dump obtains the data it prints; however, I would advocate that
we keep that part as simple as possible.  You cannot expect export
functionality to know the requirements of future server versions,
so I don't think it's useful to put much intelligence there.

So I think pg_dump should produce a pretty literal representation of
what it finds in the source server's catalog, and then rely on the
import functions in the destination server to make sense of that
and do whatever slicing-n-dicing is required.

That being the case, I don't see a lot of value in a view -- especially
not given the requirement to dump from older server versions.
(Conceivably we could just say that we won't dump stats from server
versions predating the introduction of this feature, but that's hardly
a restriction that supports doing this via a view.)

            regards, tom lane

Re: Statistics Import and Export

From

Bruce Momjian

Date:

28 December 2023, 03:11:23

On Wed, Dec 27, 2023 at 09:41:31PM -0500, Corey Huinker wrote:
>     When I thought about the ability to dump/load statistics in the past, I
>     usually envisioned some sort of DDL that would do the export and import.
>     So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
>     or something like that, and that'd do all the work. This would mean
>     stats are "first-class citizens" and it'd be fairly straightforward to
>     add this into pg_dump, for example. Or at least I think so ...
> 
>     Alternatively we could have the usual "functional" interface, with a
>     functions to export/import statistics, replacing the DDL commands.
> 
>     Unfortunately, none of this works for the pg_upgrade use case, because
>     existing cluster versions would not support this new interface, of
>     course. That's a significant flaw, as it'd make this useful only for
>     upgrades of future versions.
> 
> 
> This was the reason I settled on the interface that I did: while we can create
> whatever interface we want for importing the statistics, we would need to be
> able to extract stats from databases using only the facilities available in
> those same databases, and then store that in a medium that could be conveyed
> across databases, either by text files or by saving them off in a side table
> prior to upgrade. JSONB met the criteria.

Uh, it wouldn't be crazy to add this capability to pg_upgrade/pg_dump in
a minor version upgrade if it wasn't enabled by default, and if we were
very careful.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Statistics Import and Export

From

Corey Huinker

Date:

28 December 2023, 17:28:06

On Wed, Dec 27, 2023 at 10:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Corey Huinker <corey.huinker@gmail.com> writes:
> Export functions was my original plan, for simplicity, maintenance, etc,
> but it seemed like I'd be adding quite a few functions, so the one view
> made more sense for an initial version. Also, I knew that pg_dump or some
> other stats exporter would have to inline the guts of those functions into
> queries for older versions, and adapting a view definition seemed more
> straightforward for the reader than function definitions.

Hmm, I'm not sure we are talking about the same thing at all.

Right, I was conflating two things.

What I am proposing is *import* functions. I didn't say anything about
how pg_dump obtains the data it prints; however, I would advocate that
we keep that part as simple as possible. You cannot expect export
functionality to know the requirements of future server versions,
so I don't think it's useful to put much intelligence there.

True, but presumably you'd be using the pg_dump/pg_upgrade of that future version to do the exporting, so the export format would always be tailored to the importer's needs.

So I think pg_dump should produce a pretty literal representation of
what it finds in the source server's catalog, and then rely on the
import functions in the destination server to make sense of that
and do whatever slicing-n-dicing is required.

Obviously it can't be purely literal, as we have to replace the oid values with whatever text representation we feel helps us carry forward. In addition, we're setting the number of tuples and number of pages directly in pg_class, and doing so non-transactionally just like ANALYZE does. We could separate that out into its own import function, but then we're locking every relation twice, once for the tuples/pages and once again for the pg_statistic import.

My current line of thinking was that the stats import call, if enabled, would immediately follow the CREATE statement of the object itself, but that requires us to have everything we need to know for the import passed into the import function, so we'd be needing a way to serialize _that_. If you're thinking that we have one big bulk stats import, that might work, but it also means that we're less tolerant of failures in the import step.

Re: Statistics Import and Export

From

Bruce Momjian

Date:

28 December 2023, 17:37:16

On Thu, Dec 28, 2023 at 12:28:06PM -0500, Corey Huinker wrote:
>     What I am proposing is *import* functions.  I didn't say anything about
>     how pg_dump obtains the data it prints; however, I would advocate that
>     we keep that part as simple as possible.  You cannot expect export
>     functionality to know the requirements of future server versions,
>     so I don't think it's useful to put much intelligence there.
> 
> True, but presumably you'd be using the pg_dump/pg_upgrade of that future
> version to do the exporting, so the export format would always be tailored to
> the importer's needs.

I think the question is whether we will have the export functionality in
the old cluster, or if it will be queries run by pg_dump and therefore
also run by pg_upgrade calling pg_dump.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Statistics Import and Export

From

Tomas Vondra

Date:

28 December 2023, 23:55:08

On 12/13/23 11:26, Corey Huinker wrote:
>     Yeah, that was the simplest output function possible, it didn't seem
>
>     worth it to implement something more advanced. pg_mcv_list_items() is
>     more convenient for most needs, but it's quite far from the on-disk
>     representation.
>
>
> I was able to make it work.
>
>
>
>     That's actually a good question - how closely should the exported data
>     be to the on-disk format? I'd say we should keep it abstract, not tied
>     to the details of the on-disk format (which might easily change
between
>     versions).
>
>
> For the most part, I chose the exported data json types and formats in a
> way that was the most accommodating to cstring input functions. So,
> while so many of the statistic values are obviously only ever
> integers/floats, those get stored as a numeric data type which lacks
> direct numeric->int/float4/float8 functions (though we could certainly
> create them, and I'm not against that), casting them to text lets us
> leverage pg_strtoint16, etc.
>
>
>
>     I'm a bit confused about the JSON schema used in pg_statistic_export
>     view, though. It simply serializes stakinds, stavalues, stanumbers
into
>     arrays ... which works, but why not to use the JSON nesting? I mean,
>     there could be a nested document for histogram, MCV, ... with just the
>     correct fields.
>
>       {
>         ...
>         histogram : { stavalues: [...] },
>         mcv : { stavalues: [...], stanumbers: [...] },
>         ...
>       }
>
>
> That's a very good question. I went with this format because it was
> fairly straightforward to code in SQL using existing JSON/JSONB
> functions, and that's what we will need if we want to export statistics
> on any server currently in existence. I'm certainly not locked in with
> the current format, and if it can be shown how to transform the data
> into a superior format, I'd happily do so.
>
>     and so on. Also, what does TRIVIAL stand for?
>
>
> It's currently serving double-duty for "there are no stats in this slot"
> and the situations where the stats computation could draw no conclusions
> about the data.
>
> Attached is v3 of this patch. Key features are:
>
> * Handles regular pg_statistic stats for any relation type.
> * Handles extended statistics.
> * Export views pg_statistic_export and pg_statistic_ext_export to allow
> inspection of existing stats and saving those values for later use.
> * Import functions pg_import_rel_stats() and pg_import_ext_stats() which
> take Oids as input. This is intentional to allow stats from one object
> to be imported into another object.
> * User scripts pg_export_stats and pg_import stats, which offer a
> primitive way to serialize all the statistics of one database and import
> them into another.
> * Has regression test coverage for both with a variety of data types.
> * Passes my own manual test of extracting all of the stats from a v15
> version of the popular "dvdrental" example database, as well as some
> additional extended statistics objects, and importing them into a
> development database.
> * Import operations never touch the heap of any relation outside of
> pg_catalog. As such, this should be significantly faster than even the
> most cursory analyze operation, and therefore should be useful in
> upgrade situations, allowing the database to work with "good enough"
> stats more quickly, while still allowing for regular autovacuum to
> recalculate the stats "for real" at some later point.
>
> The relation statistics code was adapted from similar features in
> analyze.c, but is now done in a query context. As before, the
> rowcount/pagecount values are updated on pg_class in a non-transactional
> fashion to avoid table bloat, while the updates to pg_statistic are
> pg_statistic_ext_data are done transactionally.
>
> The existing statistics _store() functions were leveraged wherever
> practical, so much so that the extended statistics import is mostly just
> adapting the existing _build() functions into _import() functions which
> pull their values from JSON rather than computing the statistics.
>
> Current concerns are:
>
> 1. I had to code a special-case exception for MCELEM stats on array data
> types, so that the array_in() call uses the element type rather than the
> array type. I had assumed that the existing exmaine_attribute()
> functions would have properly derived the typoid for that column, but it
> appears to not be the case, and I'm clearly missing how the existing
> code gets it right.
Hmm, after looking at this, I'm not sure it's such an ugly hack ...

The way this works for ANALYZE is that examine_attribute() eventually
calls the typanalyze function:

  if (OidIsValid(stats->attrtype->typanalyze))
    ok = DatumGetBool(OidFunctionCall1(stats->attrtype->typanalyze,
                                       PointerGetDatum(stats)));

which for arrays is array_typanalyze, and this sets stats->extra_data to
ArrayAnalyzeExtraData with all the interesting info about the array
element type, and then also std_extra_data with info about the array
type itself.

  stats -> extra_data -> std_extra_data

compute_array_stats then "restores" std_extra_data to compute standard
stats for the whole array, and then uses the ArrayAnalyzeExtraData to
calculate stats for the elements.

It's not exactly pretty, because there are global variables and so on.

And examine_rel_attribute() does the same thing - calls typanalyze, so
if I break after it returns, I see this for int[] column:

(gdb) p * (ArrayAnalyzeExtraData *) stat->extra_data

$1 = {type_id = 23, eq_opr = 96, coll_id = 0, typbyval = true, typlen =
4, typalign = 105 'i', cmp = 0x2e57920, hash = 0x2e57950,
std_compute_stats = 0x6681b8 <compute_scalar_stats>, std_extra_data =
0x2efe670}

I think the "problem" will be how to use this in import_stavalues(). You
can't just do this for any array type, I think. I could create an array
type (with ELEMENT=X) but with a custom analyze function, in which case
the extra_data may be something entirely different.

I suppose the correct solution would be to add an "import" function into
the pg_type catalog (next to typanalyze). Or maybe it'd be enough to set
it from the typanalyze? After all, that's what sets compute_stats.

But maybe it's enough to just do what you did - if we get an MCELEM
slot, can it ever contain anything else than array of elements of the
attribute array type? I'd bet that'd cause all sorts of issues, no?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Statistics Import and Export

From

Corey Huinker

Date:

29 December 2023, 16:27:50

But maybe it's enough to just do what you did - if we get an MCELEM
slot, can it ever contain anything else than array of elements of the
attribute array type? I'd bet that'd cause all sorts of issues, no?

Thanks for the explanation of why it wasn't working for me. Knowing that the case of MCELEM + is-array-type is the only case where we'd need to do that puts me at ease.

Re: Statistics Import and Export

From

Tomas Vondra

Date:

29 December 2023, 20:14:34

On 12/29/23 17:27, Corey Huinker wrote:
>     But maybe it's enough to just do what you did - if we get an MCELEM
>     slot, can it ever contain anything else than array of elements of the
>     attribute array type? I'd bet that'd cause all sorts of issues, no?
> 
> 
> Thanks for the explanation of why it wasn't working for me. Knowing that
> the case of MCELEM + is-array-type is the only case where we'd need to
> do that puts me at ease.
> 

But I didn't claim MCELEM is the only slot where this might be an issue.
I merely asked if a MCELEM slot can ever contain an array with element
type different from the "original" attribute.

After thinking about this a bit more, and doing a couple experiments
with a trivial custom data type, I think this is true:

1) MCELEM slots for "real" array types are OK

I don't think we allow "real" arrays created by users directly, all
arrays are created implicitly by the system. Those types always have
array_typanalyze, which guarantees MCELEM has the correct element type.

I haven't found a way to either inject my custom array type or alter the
typanalyze to some custom function. So I think this is OK.

2) I'm not sure we can extend this regular data types / other slots

For example, I think I can implement a data type with custom typanalyze
function (and custom compute_stats function) that fills slots with some
other / strange stuff. For example I might build MCV with hashes of the
original data, a CountMin sketch, or something like that.

Yes, I don't think people do that often, but as long as the type also
implements custom selectivity functions for the operators, I think this
would work.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Statistics Import and Export

From

Peter Smith

Date:

22 January 2024, 06:09:24

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4538/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4538

Kind Regards,
Peter Smith.

Re: Statistics Import and Export

From

Corey Huinker

Date:

02 February 2024, 08:33:08

On Mon, Jan 22, 2024 at 1:09 AM Peter Smith <smithpb2250@gmail.com> wrote:

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4538/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4538

Kind Regards,
Peter Smith.

Attached is v4 of the statistics export/import patch.

This version has been refactored to match the design feedback received previously.

The system views are gone. These were mostly there to serve as a baseline for what an export query would look like. That role is temporarily reassigned to pg_export_stats.c, but hopefully they will be integrated into pg_dump in the next version. The regression test also contains the version of each query suitable for the current server version.

The export format is far closer to the raw format of pg_statistic and pg_statistic_ext_data, respectively. This format involves exporting oid values for types, collations, operators, and attributes - values which are specific to the server they were created on. To make sense of those values, a subset of the columns of pg_type, pg_attribute, pg_collation, and pg_operator are exported as well, which allows pg_import_rel_stats() and pg_import_ext_stats() to reconstitute the data structure as it existed on the old server, and adapt it to the modern structure and local schema objects.

pg_import_rel_stats matches up local columns with the exported stats by column name, not attnum. This allows for stats to be imported when columns have been dropped, added, or reordered.

pg_import_ext_stats can also handle column reordering, though it currently would get confused by changes in expressions that maintain the same result data type. I'm not yet brave enough to handle importing nodetrees, nor do I think it's wise to try. I think we'd be better off validating that the destination extended stats object is identical in structure, and to fail the import of that one object if it isn't perfect.

Export formats go back to v10.

Re: Statistics Import and Export

From

Corey Huinker

Date:

02 February 2024, 08:37:10

(hit send before attaching patches, reposting message as well)

Attached is v4 of the statistics export/import patch.

This version has been refactored to match the design feedback received previously.

The system views are gone. These were mostly there to serve as a baseline for what an export query would look like. That role is temporarily reassigned to pg_export_stats.c, but hopefully they will be integrated into pg_dump in the next version. The regression test also contains the version of each query suitable for the current server version.

The export format is far closer to the raw format of pg_statistic and pg_statistic_ext_data, respectively. This format involves exporting oid values for types, collations, operators, and attributes - values which are specific to the server they were created on. To make sense of those values, a subset of the columns of pg_type, pg_attribute, pg_collation, and pg_operator are exported as well, which allows pg_import_rel_stats() and pg_import_ext_stats() to reconstitute the data structure as it existed on the old server, and adapt it to the modern structure and local schema objects.

pg_import_rel_stats matches up local columns with the exported stats by column name, not attnum. This allows for stats to be imported when columns have been dropped, added, or reordered.

pg_import_ext_stats can also handle column reordering, though it currently would get confused by changes in expressions that maintain the same result data type. I'm not yet brave enough to handle importing nodetrees, nor do I think it's wise to try. I think we'd be better off validating that the destination extended stats object is identical in structure, and to fail the import of that one object if it isn't perfect.

Export formats go back to v10.

On Mon, Jan 22, 2024 at 1:09 AM Peter Smith <smithpb2250@gmail.com> wrote:

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4538/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4538

Kind Regards,
Peter Smith.

Attachment

Re: Statistics Import and Export

From

Tomas Vondra

Date:

07 February 2024, 21:46:52

Hi,

I took a quick look at the v4 patches. I haven't done much testing yet,
so only some basic review.

0001

- The SGML docs for pg_import_rel_stats may need some changes. It starts
with description of what gets overwritten (non-)transactionally (which
gets repeated twice), but that seems more like an implementation detail.
But it does not really say which pg_class fields get updated. Then it
speculates about the possible use case (pg_upgrade). I think it'd be
better to focus on the overall goal of updating statistics, explain what
gets updated/how, and only then maybe mention the pg_upgrade use case.

Also, it says "statistics are replaced" but it's quite clear if that
applies only to matching statistics or if all stats are deleted first
and then the new stuff is inserted. (FWIW remove_pg_statistics clearly
deletes all pre-existing stats).


- import_pg_statistics: I somewhat dislike that we're passing arguments
as datum[] array - it's hard to say what the elements are expected to
be, etc. Maybe we should expand this, to make it clear. How do we even
know the array is large enough?

- I don't quite understand why we need examine_rel_attribute. It sets a
lot of fields in the VacAttrStats struct, but then we only use attrtypid
and attrtypmod from it - so why bother and not simply load just these
two fields? Or maybe I miss something.

- examine_rel_attribute can return NULL, but get_attrinfo does not check
for NULL and just dereferences the pointer. Surely that can lead to
segfaults?

- validate_no_duplicates and the other validate functions would deserve
a better docs, explaining what exactly is checked (it took me a while to
realize we check just for duplicates), what the parameters do etc.

- Do we want to make the validate_ functions part of the public API? I
realize we want to use them from multiple places (regular and extended
stats), but maybe it'd be better to have an "internal" header file, just
like we have extended_stats_internal?

- I'm not sure we do "\set debug f" elsewhere. It took me a while to
realize why the query outputs are empty ...


0002

- I'd rename create_stat_ext_entry to statext_create_entry.

- Do we even want to include OIDs from the source server? Why not to
just have object names and resolve those? Seems safer - if the target
server has the OID allocated to a different object, that could lead to
confusing / hard to detect issues.

- What happens if we import statistics which includes data for extended
statistics object which does not exist on the target machine?

- pg_import_ext_stats seems to not use require_match_oids - bug?


0003

- no SGML docs for the new tools?

- The help() seems to be wrong / copied from "clusterdb" or something
like that, right?


On 2/2/24 09:37, Corey Huinker wrote:
> (hit send before attaching patches, reposting message as well)
> 
> Attached is v4 of the statistics export/import patch.
> 
> This version has been refactored to match the design feedback received
> previously.
> 
> The system views are gone. These were mostly there to serve as a baseline
> for what an export query would look like. That role is temporarily
> reassigned to pg_export_stats.c, but hopefully they will be integrated into
> pg_dump in the next version. The regression test also contains the version
> of each query suitable for the current server version.
> 

OK

> The export format is far closer to the raw format of pg_statistic and
> pg_statistic_ext_data, respectively. This format involves exporting oid
> values for types, collations, operators, and attributes - values which are
> specific to the server they were created on. To make sense of those values,
> a subset of the columns of pg_type, pg_attribute, pg_collation, and
> pg_operator are exported as well, which allows pg_import_rel_stats() and
> pg_import_ext_stats() to reconstitute the data structure as it existed on
> the old server, and adapt it to the modern structure and local schema
> objects.

I have no opinion on the proposed format - still JSON, but closer to the
original data. Works for me, but I wonder what Tom thinks about it,
considering he suggested making it closer to the raw data.

> 
> pg_import_rel_stats matches up local columns with the exported stats by
> column name, not attnum. This allows for stats to be imported when columns
> have been dropped, added, or reordered.
> 

Makes sense. What will happen if we try to import data for extended
statistics (or index) that does not exist on the target server?

> pg_import_ext_stats can also handle column reordering, though it currently
> would get confused by changes in expressions that maintain the same result
> data type. I'm not yet brave enough to handle importing nodetrees, nor do I
> think it's wise to try. I think we'd be better off validating that the
> destination extended stats object is identical in structure, and to fail
> the import of that one object if it isn't perfect.
> 

Yeah, column reordering is something we probably need to handle. The
stats order them by attnum, so if we want to allow import on a system
where the attributes were dropped/created in a different way, this is
necessary. I haven't tested this - is there a regression test for this?

I agree expressions are hard. I don't think it's feasible to import
nodetree from other server versions, but why don't we simply deparse the
expression on the source, and either parse it on the target (and then
compare the two nodetrees), or deparse the target too and compare the
two deparsed expressions? I suspect the deparsing may produce slightly
different results on the two versions (causing false mismatches), but
perhaps the deparse on source + parse on target + compare nodetrees
would work? Haven't tried, though.

> Export formats go back to v10.
> 

Do we even want/need to go beyond 12? All earlier versions are EOL.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Statistics Import and Export

From

Corey Huinker

Date:

13 February 2024, 05:07:26

Also, it says "statistics are replaced" but it's quite clear if that
applies only to matching statistics or if all stats are deleted first
and then the new stuff is inserted. (FWIW remove_pg_statistics clearly
deletes all pre-existing stats).

All are now deleted first, both in the pg_statistic and pg_statistic_ext_data tables. The previous version was taking a more "replace it if we find a new value" approach, but that's overly complicated, so following the example set by extended statistics seemed best.

- import_pg_statistics: I somewhat dislike that we're passing arguments
as datum[] array - it's hard to say what the elements are expected to
be, etc. Maybe we should expand this, to make it clear. How do we even
know the array is large enough?

Completely fair. Initially that was done with the expectation that the array would be the same for both regular stats and extended stats, but that was no longer the case.

- I don't quite understand why we need examine_rel_attribute. It sets a
lot of fields in the VacAttrStats struct, but then we only use attrtypid
and attrtypmod from it - so why bother and not simply load just these
two fields? Or maybe I miss something.

I think you're right, we don't need it anymore for regular statistics. We still need it in extended stats because statext_store() takes a subset of the vacattrstats rows as an input.

Which leads to a side issue. We currently have 3 functions: examine_rel_attribute and the two varieties of examine_attribute (one in analyze.c and the other in extended stats). These are highly similar but just different enough that I didn't feel comfortable refactoring them into a one-size-fits-all function, and I was particularly reluctant to modify existing code for the ANALYZE path.

- examine_rel_attribute can return NULL, but get_attrinfo does not check
for NULL and just dereferences the pointer. Surely that can lead to
segfaults?

Good catch, and it highlights how little we need VacAttrStats for regular statistics.

- validate_no_duplicates and the other validate functions would deserve
a better docs, explaining what exactly is checked (it took me a while to
realize we check just for duplicates), what the parameters do etc.

Those functions are in a fairly formative phase - I expect a conversation about what sort of validations we want to do to ensure that the statistics being imported make sense, and under what circumstances we would forego some of those checks.

- Do we want to make the validate_ functions part of the public API? I
realize we want to use them from multiple places (regular and extended
stats), but maybe it'd be better to have an "internal" header file, just
like we have extended_stats_internal?

I see no need to have them be a part of the public API. Will move.

- I'm not sure we do "\set debug f" elsewhere. It took me a while to
realize why the query outputs are empty ...

That was an experiment that rose out of the difficulty in determining _where_ a difference was when the set-difference checks failed. So far I like it, and I'm hoping it catches on.

0002

- I'd rename create_stat_ext_entry to statext_create_entry.

- Do we even want to include OIDs from the source server? Why not to
just have object names and resolve those? Seems safer - if the target
server has the OID allocated to a different object, that could lead to
confusing / hard to detect issues.

The import functions would obviously never use the imported oids to look up objects on the destination system. Rather, they're there to verify that the local object oid matches the exported object oid, which is true in the case of a binary upgrade.

The export format is an attempt to export the pg_statistic[_ext_data] for that object as-is, and, as Tom suggested, let the import function do the transformations. We can of course remove them if they truly have no purpose for validation.

- What happens if we import statistics which includes data for extended
statistics object which does not exist on the target machine?

The import function takes an oid of the object (relation or extstat object), and the json payload is supposed to be the stats for ONE corresponding object. Multiple objects of data really don't fit into the json format, and statistics exported for an object that does not exist on the destination system would have no meaningful invocation. I envision the dump file looking like this

CREATE TABLE public.foo (....);

SELECT pg_import_rel_stats('public.foo'::regclass, <json blob>, option flag, option flag);

So a call against a nonexistent object would fail on the regclass cast.

- pg_import_ext_stats seems to not use require_match_oids - bug?

I haven't yet seen a good way to make use of matching oids in extended stats. Checking matching operator/collation oids would make sense, but little else.

0003

- no SGML docs for the new tools?

Correct. I foresee the export tool being folded into pg_dump(), and the import tool going away entirely as psql could handle it.

- The help() seems to be wrong / copied from "clusterdb" or something
like that, right?

Correct, for the reason above.

>
> pg_import_rel_stats matches up local columns with the exported stats by
> column name, not attnum. This allows for stats to be imported when columns
> have been dropped, added, or reordered.
>

Makes sense. What will happen if we try to import data for extended
statistics (or index) that does not exist on the target server?

One of the parameters to the function is the oid of the object that is the target of the stats. The importer will not seek out objects with matching names and each JSON payload is limited to holding one object, though clearly someone could encapsulate the existing format in a format that has a manifest of objects to import.

> pg_import_ext_stats can also handle column reordering, though it currently
> would get confused by changes in expressions that maintain the same result
> data type. I'm not yet brave enough to handle importing nodetrees, nor do I
> think it's wise to try. I think we'd be better off validating that the
> destination extended stats object is identical in structure, and to fail
> the import of that one object if it isn't perfect.
>

Yeah, column reordering is something we probably need to handle. The
stats order them by attnum, so if we want to allow import on a system
where the attributes were dropped/created in a different way, this is
necessary. I haven't tested this - is there a regression test for this?

The overlong transformation SQL starts with the object to be imported (the local oid was specified) and it

1. grabs all the attributes (or exprs, for extended stats) of that object.
2. looks for columns/exprs in the exported json for an attribute with a matching name

3. takes the exported attnum of that exported attribute for use in things like stdexprs
4. looks up the type, collation, and operators for the exported attribute.

So we get a situation where there might not be importable stats for an attribute of the destination table, and we'd import nothing for that column. Stats for exported columns with no matching local column would never be referenced.

Yes, there should be a test of this.

I agree expressions are hard. I don't think it's feasible to import
nodetree from other server versions, but why don't we simply deparse the
expression on the source, and either parse it on the target (and then
compare the two nodetrees), or deparse the target too and compare the
two deparsed expressions? I suspect the deparsing may produce slightly
different results on the two versions (causing false mismatches), but
perhaps the deparse on source + parse on target + compare nodetrees
would work? Haven't tried, though.

> Export formats go back to v10.
>

Do we even want/need to go beyond 12? All earlier versions are EOL.

True, but we had pg_dump and pg_restore stuff back to 7.x until fairly recently, and a major friction point in getting customers to upgrade their instances off of unsupported versions is the downtime caused by an upgrade, why wouldn't we make it easier for them?

Re: Statistics Import and Export

From

Corey Huinker

Date:

15 February 2024, 09:09:41

Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(), which address many of the concerns listed earlier.

Leaving the export/import scripts off for the time being, as they haven't changed and the next likely change is to fold them into pg_dump.

Attachment

Re: Statistics Import and Export

From

Corey Huinker

Date:

20 February 2024, 07:24:52

On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker <corey.huinker@gmail.com> wrote:

Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(), which address many of the concerns listed earlier.

Leaving the export/import scripts off for the time being, as they haven't changed and the next likely change is to fold them into pg_dump.

v6 posted below.

Changes:

- Additional documentation about the overall process.
- Rewording of SGML docs.
- removed a fair number of columns from the transformation queries.

- enabled require_match_oids in extended statistics, but I'm having my doubts about the value of that.
- moved stats extraction functions to an fe_utils file stats_export.c that will be used by both pg_export_stats and pg_dump.
- pg_export_stats now generates SQL statements rather than a tsv, and has boolean flags to set the validate and require_match_oids parameters in the calls to pg_import_(rel|ext)_stats.
- pg_import_stats is gone, as importing can now be done with psql.

I'm hoping to get feedback on a few areas.

1. The checks for matching oids. On the one hand, in a binary upgrade situation, we would of course want the oid of the relation to match what was exported, as well as all of the atttypids of the attributes to match the type ids exported, same for collations, etc. However, the binary upgrade is the one place where there are absolutely no middle steps that could have altered either the stats jsons or the source tables. Given that and that oid simply will never match in any situation other than a binary upgrade, it may be best to discard those checks.

2. The checks for relnames matching, and typenames of attributes matching (they are already matched by name, so the column order can change without the import missing a beat) seem so necessary that there shouldn't be an option to enable/disable them. But if that's true, then the initial relation parameter becomes somewhat unnecessary, and anyone using these functions for tuning or FDW purposes could easily transform the JSON using SQL to put in the proper relname.

3. The data integrity validation functions may belong in a separate function rather than being a parameter on the existing import functions.

4. Lastly, pg_dump. Each relation object and extended statistics object will have a statistics import statement. From my limited experience with pg_dump, it seems like we would add an additional Stmt variable (statsStmt) to the TOC entry for each object created, and the restore process would check the value of --with-statistics and in cases where the statistics flag was set AND a stats import statement exists, then execute that stats statement immediately after the creation of the object. This assumes that there is no case where additional attributes are added to a relation after it's initial CREATE statement. Indexes are independent relations in this regard.

Attachment

Re: Statistics Import and Export

From

Stephen Frost

Date:

29 February 2024, 20:23:07

Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker <corey.huinker@gmail.com>
> wrote:
> > Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(),
> > which address many of the concerns listed earlier.
> >
> > Leaving the export/import scripts off for the time being, as they haven't
> > changed and the next likely change is to fold them into pg_dump.

> v6 posted below.
>
> Changes:
>
> - Additional documentation about the overall process.
> - Rewording of SGML docs.
> - removed a fair number of columns from the transformation queries.
> - enabled require_match_oids in extended statistics, but I'm having my
> doubts about the value of that.
> - moved stats extraction functions to an fe_utils file stats_export.c that
> will be used by both pg_export_stats and pg_dump.
> - pg_export_stats now generates SQL statements rather than a tsv, and has
> boolean flags to set the validate and require_match_oids parameters in the
> calls to pg_import_(rel|ext)_stats.
> - pg_import_stats is gone, as importing can now be done with psql.

Having looked through this thread and discussed a bit with Corey
off-line, the approach that Tom laid out up-thread seems like it would
make the most sense overall- that is, eliminate the JSON bits and the
SPI and instead export the stats data by running queries from the new
version of pg_dump/server (in the FDW case) against the old server
with the intelligence of how to transform the data into the format
needed for the current pg_dump/server to accept, through function calls
where the function calls generally map up to the rows/information being
updated- a call to update the information in pg_class for each relation
and then a call for each attribute to update the information in
pg_statistic.

Part of this process would include mapping from OIDs/attrnum's to names
on the source side and then from those names to the appropriate
OIDs/attrnum's on the destination side.

As this code would be used by both pg_dump and the postgres_fdw, it
seems logical that it would go into the common library.  Further, it
would make sense to have this code be able to handle multiple major
versions for the foreign side, such as how postgres_fdw and pg_dump
already do.

In terms of working to ensure that newer versions support loading from
older dumps (that is, that v18 would be able to load a dump file created
by a v17 pg_dump against a v17 server in the face of changes having been
made to the statistics system in v18), we could have the functions take
a version parameter (to handle cases where the data structure is the
same but the contents have to be handled differently), use overloaded
functions, or have version-specific names for the functions.  I'm also
generally supportive of the idea that we, perhaps initially, only
support dumping/loading stats with pg_dump when in binary-upgrade mode,
which removes our need to be concerned with this (perhaps that would be
a good v1 of this feature?) as the version of pg_dump needs to match
that of pg_upgrade and the destination server for various other reasons.
Including a switch to exclude stats on restore might also be an
acceptable answer, or even simply excluding them by default when going
between major versions except in binary-upgrade mode.

Along those same lines when it comes to a 'v1', I'd say that we may wish
to consider excluding extended statistics, which I am fairly confident
Corey's heard a number of times previously already but thought I would
add my own support for that.  To the extent that we do want to make
extended stats work down the road, we should probably have some
pre-patches to flush out the missing _in/_recv functions for those types
which don't have them today- and that would include modifying the _out
of those types to use names instead of OIDs/attrnums.  In thinking about
this, I was reviewing specifically pg_dependencies.  To the extent that
there are people who depend on the current output, I would think that
they'd actually appreciate this change.

I don't generally feel like we need to be checking that the OIDs between
the old server and the new server match- I appreciate that that should
be the case in a binary-upgrade situation but it still feels unnecessary
and complicated and clutters up the output and the function calls.

Overall, I definitely think this is a good project to work on as it's an
often, rightfully, complained about issue when it comes to pg_upgrade
and the amount of downtime required for it before the upgraded system
can be reasonably used again.

Thanks,

Stephen

Attachment

signature.asc

Re: Statistics Import and Export

From

Corey Huinker

Date:

29 February 2024, 22:47:59

Having looked through this thread and discussed a bit with Corey
off-line, the approach that Tom laid out up-thread seems like it would
make the most sense overall- that is, eliminate the JSON bits and the
SPI and instead export the stats data by running queries from the new
version of pg_dump/server (in the FDW case) against the old server
with the intelligence of how to transform the data into the format
needed for the current pg_dump/server to accept, through function calls
where the function calls generally map up to the rows/information being
updated- a call to update the information in pg_class for each relation
and then a call for each attribute to update the information in
pg_statistic.

Thanks for the excellent summary of our conversation, though I do add that we discussed a problem with per-attribute functions: each function would be acquiring locks on both the relation (so it doesn't go away) and pg_statistic, and that lock thrashing would add up. Whether that overhead is judged significant or not is up for discussion. If it is significant, it makes sense to package up all the attributes into one call, passing in an array of some new pg_statistic-esque special type....the very issue that sent me down the JSON path.

I certainly see the flexibility in having a per-attribute functions, but am concerned about non-binary-upgrade situations where the attnums won't line up, and if we're passing them by name then the function has dig around looking for the right matching attnum, and that's overhead too. In the whole-table approach, we just iterate over the attributes that exist, and find the matching parameter row.

Re: Statistics Import and Export

From

Stephen Frost

Date:

29 February 2024, 23:17:14

Greetings,

On Thu, Feb 29, 2024 at 17:48 Corey Huinker <corey.huinker@gmail.com> wrote:

Having looked through this thread and discussed a bit with Corey
off-line, the approach that Tom laid out up-thread seems like it would
make the most sense overall- that is, eliminate the JSON bits and the
SPI and instead export the stats data by running queries from the new
version of pg_dump/server (in the FDW case) against the old server
with the intelligence of how to transform the data into the format
needed for the current pg_dump/server to accept, through function calls
where the function calls generally map up to the rows/information being
updated- a call to update the information in pg_class for each relation
and then a call for each attribute to update the information in
pg_statistic.

Thanks for the excellent summary of our conversation, though I do add that we discussed a problem with per-attribute functions: each function would be acquiring locks on both the relation (so it doesn't go away) and pg_statistic, and that lock thrashing would add up. Whether that overhead is judged significant or not is up for discussion. If it is significant, it makes sense to package up all the attributes into one call, passing in an array of some new pg_statistic-esque special type....the very issue that sent me down the JSON path.

I certainly see the flexibility in having a per-attribute functions, but am concerned about non-binary-upgrade situations where the attnums won't line up, and if we're passing them by name then the function has dig around looking for the right matching attnum, and that's overhead too. In the whole-table approach, we just iterate over the attributes that exist, and find the matching parameter row.

That’s certainly a fair point and my initial reaction (which could certainly be wrong) is that it’s unlikely to be an issue- but also, if you feel you could make it work with an array and passing all the attribute info in with one call, which I suspect would be possible but just a bit more complex to build, then sure, go for it. If it ends up being overly unwieldy then perhaps the per-attribute call would be better and we could perhaps acquire the lock before the function calls..? Doing a check to see if we have already locked it would be cheaper than trying to acquire a new lock, I’m fairly sure.

Also per our prior discussion- this makes sense to include in post-data section, imv, and also because then we have the indexes we may wish to load stats for, but further that also means it’ll be in the paralleliziable part of the process, making me a bit less concerned overall about the individual timing.

Thanks!

Stephen

Re: Statistics Import and Export

From

Corey Huinker

Date:

01 March 2024, 03:55:20

That’s certainly a fair point and my initial reaction (which could certainly be wrong) is that it’s unlikely to be an issue- but also, if you feel you could make it work with an array and passing all the attribute info in with one call, which I suspect would be possible but just a bit more complex to build, then sure, go for it. If it ends up being overly unwieldy then perhaps the per-attribute call would be better and we could perhaps acquire the lock before the function calls..? Doing a check to see if we have already locked it would be cheaper than trying to acquire a new lock, I’m fairly sure.

Well the do_analyze() code was already ok with acquiring the lock once for non-inherited stats and again for inherited stats, so the locks were already not the end of the world. However, that's at most a 2x of the locking required, and this would natts * x, quite a bit more. Having the procedures check for a pre-existing lock seems like a good compromise.

Also per our prior discussion- this makes sense to include in post-data section, imv, and also because then we have the indexes we may wish to load stats for, but further that also means it’ll be in the paralleliziable part of the process, making me a bit less concerned overall about the individual timing.

The ability to parallelize is pretty persuasive. But is that per-statement parallelization or do we get transaction blocks? i.e. if we ended up importing stats like this:

BEGIN;
LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
SELECT pg_import_pg_statistic('schema.relation', 'name', ...);
SELECT pg_import_pg_statistic('schema.relation', 'description', ...);
...
COMMIT;

Re: Statistics Import and Export

From

Nathan Bossart

Date:

01 March 2024, 17:13:57

On Thu, Feb 29, 2024 at 10:55:20PM -0500, Corey Huinker wrote:
>> That’s certainly a fair point and my initial reaction (which could
>> certainly be wrong) is that it’s unlikely to be an issue- but also, if you
>> feel you could make it work with an array and passing all the attribute
>> info in with one call, which I suspect would be possible but just a bit
>> more complex to build, then sure, go for it. If it ends up being overly
>> unwieldy then perhaps the  per-attribute call would be better and we could
>> perhaps acquire the lock before the function calls..?  Doing a check to see
>> if we have already locked it would be cheaper than trying to acquire a new
>> lock, I’m fairly sure.
> 
> Well the do_analyze() code was already ok with acquiring the lock once for
> non-inherited stats and again for inherited stats, so the locks were
> already not the end of the world. However, that's at most a 2x of the
> locking required, and this would natts * x, quite a bit more. Having the
> procedures check for a pre-existing lock seems like a good compromise.

I think this is a reasonable starting point.  If the benchmarks show that
the locking is a problem, we can reevaluate, but otherwise IMHO we should
try to keep it as simple/flexible as possible.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Statistics Import and Export

From

Stephen Frost

Date:

01 March 2024, 17:16:51

Greetings,

On Fri, Mar 1, 2024 at 12:14 Nathan Bossart <nathandbossart@gmail.com> wrote:

On Thu, Feb 29, 2024 at 10:55:20PM -0500, Corey Huinker wrote:
>> That’s certainly a fair point and my initial reaction (which could
>> certainly be wrong) is that it’s unlikely to be an issue- but also, if you
>> feel you could make it work with an array and passing all the attribute
>> info in with one call, which I suspect would be possible but just a bit
>> more complex to build, then sure, go for it. If it ends up being overly
>> unwieldy then perhaps the per-attribute call would be better and we could
>> perhaps acquire the lock before the function calls..? Doing a check to see
>> if we have already locked it would be cheaper than trying to acquire a new
>> lock, I’m fairly sure.
>
> Well the do_analyze() code was already ok with acquiring the lock once for
> non-inherited stats and again for inherited stats, so the locks were
> already not the end of the world. However, that's at most a 2x of the
> locking required, and this would natts * x, quite a bit more. Having the
> procedures check for a pre-existing lock seems like a good compromise.

I think this is a reasonable starting point. If the benchmarks show that
the locking is a problem, we can reevaluate, but otherwise IMHO we should
try to keep it as simple/flexible as possible.

Yeah, this was my general feeling as well. If it does become an issue, it certainly seems like we would have ways to improve it in the future. Even with this locking it is surely going to be better than having to re-analyze the entire database which is where we are at now.

Thanks,

Stephen

Re: Statistics Import and Export

From

Bertrand Drouvot

Date:

04 March 2024, 14:39:40

Hi,

On Tue, Feb 20, 2024 at 02:24:52AM -0500, Corey Huinker wrote:
> On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker <corey.huinker@gmail.com>
> wrote:
> 
> > Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(),
> > which address many of the concerns listed earlier.
> >
> > Leaving the export/import scripts off for the time being, as they haven't
> > changed and the next likely change is to fold them into pg_dump.
> >
> >
> >
> v6 posted below.

Thanks!

I had in mind to look at it but it looks like a rebase is needed.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Statistics Import and Export

From

Matthias van de Meent

Date:

06 March 2024, 10:06:39

On Fri, 1 Mar 2024, 04:55 Corey Huinker, <corey.huinker@gmail.com> wrote:
>> Also per our prior discussion- this makes sense to include in post-data section, imv, and also because then we have
theindexes we may wish to load stats for, but further that also means it’ll be in the paralleliziable part of the
process,making me a bit less concerned overall about the individual timing. 
>
>
> The ability to parallelize is pretty persuasive. But is that per-statement parallelization or do we get transaction
blocks?i.e. if we ended up importing stats like this: 
>
> BEGIN;
> LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
> LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
> SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
> SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
> SELECT pg_import_pg_statistic('schema.relation', 'name', ...);

How well would this simplify to the following:

SELECT pg_import_statistic('schema.relation', attname, ...)
FROM (VALUES ('id', ...), ...) AS relation_stats (attname, ...);

Or even just one VALUES for the whole statistics loading?

I suspect the main issue with combining this into one statement
(transaction) is that failure to load one column's statistics implies
you'll have to redo all the other statistics (or fail to load the
statistics at all), which may be problematic at the scale of thousands
of relations with tens of columns each.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Statistics Import and Export

From

Stephen Frost

Date:

06 March 2024, 10:33:02

Greetings,

On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

On Fri, 1 Mar 2024, 04:55 Corey Huinker, <corey.huinker@gmail.com> wrote:
>> Also per our prior discussion- this makes sense to include in post-data section, imv, and also because then we have the indexes we may wish to load stats for, but further that also means it’ll be in the paralleliziable part of the process, making me a bit less concerned overall about the individual timing.
>
>
> The ability to parallelize is pretty persuasive. But is that per-statement parallelization or do we get transaction blocks? i.e. if we ended up importing stats like this:
>
> BEGIN;
> LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
> LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
> SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
> SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
> SELECT pg_import_pg_statistic('schema.relation', 'name', ...);

How well would this simplify to the following:

SELECT pg_import_statistic('schema.relation', attname, ...)
FROM (VALUES ('id', ...), ...) AS relation_stats (attname, ...);

Using a VALUES construct for this does seem like it might make it cleaner, so +1 for investigating that idea.

Or even just one VALUES for the whole statistics loading?

I don’t think we’d want to go beyond one relation at a time as then it can be parallelized, we won’t be trying to lock a whole bunch of objects at once, and any failures would only impact that one relation’s stats load.

I suspect the main issue with combining this into one statement
(transaction) is that failure to load one column's statistics implies
you'll have to redo all the other statistics (or fail to load the
statistics at all), which may be problematic at the scale of thousands
of relations with tens of columns each.

I’m pretty skeptical that “stats fail to load and lead to a failed transaction” is a likely scenario that we have to spend a lot of effort on. I’m pretty bullish on the idea that this simply won’t happen except in very exceptional cases under a pg_upgrade (where the pg_dump that’s used must match the target server version) and where it happens under a pg_dump it’ll be because it’s an older pg_dump’s output and the answer will likely need to be “you’re using a pg_dump file generated using an older version of pg_dump and need to exclude stats entirely from the load and instead run analyze on the data after loading it.”

What are the cases where we would be seeing stats reloads failing where it would make sense to re-try on a subset of columns, or just generally, if we know that the pg_dump version matches the target server version?

Thanks!

Stephen

Re: Statistics Import and Export

From

Matthias van de Meent

Date:

06 March 2024, 11:06:28

On Wed, 6 Mar 2024 at 11:33, Stephen Frost <sfrost@snowman.net> wrote:
> On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
>> Or even just one VALUES for the whole statistics loading?
>
>
> I don’t think we’d want to go beyond one relation at a time as then it can be parallelized, we won’t be trying to
locka whole bunch of objects at once, and any failures would only impact that one relation’s stats load. 

That also makes sense.

>> I suspect the main issue with combining this into one statement
>> (transaction) is that failure to load one column's statistics implies
>> you'll have to redo all the other statistics (or fail to load the
>> statistics at all), which may be problematic at the scale of thousands
>> of relations with tens of columns each.
>
>
> I’m pretty skeptical that “stats fail to load and lead to a failed transaction” is a likely scenario that we have to
spenda lot of effort on. 

Agreed on the "don't have to spend a lot of time on it", but I'm not
so sure on the "unlikely" part while the autovacuum deamon is
involved, specifically for non-upgrade pg_restore. I imagine (haven't
checked) that autoanalyze is disabled during pg_upgrade, but
pg_restore doesn't do that, while it would have to be able to restore
statistics of a table if it is included in the dump (and the version
matches).

> What are the cases where we would be seeing stats reloads failing where it would make sense to re-try on a subset of
columns,or just generally, if we know that the pg_dump version matches the target server version? 

Last time I checked, pg_restore's default is to load data on a
row-by-row basis without --single-transaction or --exit-on-error. Of
course, pg_upgrade uses it's own set of flags, but if a user is
restoring stats with  pg_restore, I suspect they'd rather have some
column's stats loaded than no stats at all; so I would assume this
requires one separate pg_import_pg_statistic()-transaction for every
column.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Statistics Import and Export

From

Stephen Frost

Date:

06 March 2024, 18:28:20

Greetings,

* Matthias van de Meent (boekewurm+postgres@gmail.com) wrote:
> On Wed, 6 Mar 2024 at 11:33, Stephen Frost <sfrost@snowman.net> wrote:
> > On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
> >> Or even just one VALUES for the whole statistics loading?
> > I don’t think we’d want to go beyond one relation at a time as then it can be parallelized, we won’t be trying to
locka whole bunch of objects at once, and any failures would only impact that one relation’s stats load. 
>
> That also makes sense.

Great, thanks.

> >> I suspect the main issue with combining this into one statement
> >> (transaction) is that failure to load one column's statistics implies
> >> you'll have to redo all the other statistics (or fail to load the
> >> statistics at all), which may be problematic at the scale of thousands
> >> of relations with tens of columns each.
> >
> >
> > I’m pretty skeptical that “stats fail to load and lead to a failed transaction” is a likely scenario that we have
tospend a lot of effort on. 
>
> Agreed on the "don't have to spend a lot of time on it", but I'm not
> so sure on the "unlikely" part while the autovacuum deamon is
> involved, specifically for non-upgrade pg_restore. I imagine (haven't
> checked) that autoanalyze is disabled during pg_upgrade, but
> pg_restore doesn't do that, while it would have to be able to restore
> statistics of a table if it is included in the dump (and the version
> matches).

Even if autovacuum was running and it kicked off an auto-analyze of the
relation at just the time that we were trying to load the stats, there
would be appropriate locking happening to keep them from causing an
outright ERROR and transaction failure, or if not, that's a lack of
locking and should be fixed.  With the per-attribute-function-call
approach, that could lead to a situation where some stats are from the
auto-analyze and some are from the stats being loaded but I'm not sure
if that's a big concern or not.

For users of this, I would think we'd generally encourage them to
disable autovacuum on the tables they're loading as otherwise they'll
end up with the stats going back to whatever an auto-analyze ends up
finding.  That may be fine in some cases, but not in others.

A couple questions to think about though: Should pg_dump explicitly ask
autovacuum to ignore these tables while we're loading them?
Should these functions only perform a load when there aren't any
existing stats?  Should the latter be an argument to the functions to
allow the caller to decide?

> > What are the cases where we would be seeing stats reloads failing where it would make sense to re-try on a subset
ofcolumns, or just generally, if we know that the pg_dump version matches the target server version? 
>
> Last time I checked, pg_restore's default is to load data on a
> row-by-row basis without --single-transaction or --exit-on-error. Of
> course, pg_upgrade uses it's own set of flags, but if a user is
> restoring stats with  pg_restore, I suspect they'd rather have some
> column's stats loaded than no stats at all; so I would assume this
> requires one separate pg_import_pg_statistic()-transaction for every
> column.

Having some discussion around that would be useful.  Is it better to
have a situation where there are stats for some columns but no stats for
other columns?  There would be a good chance that this would lead to a
set of queries that were properly planned out and a set which end up
with unexpected and likely poor query plans due to lack of stats.
Arguably that's better overall, but either way an ANALYZE needs to be
done to address the lack of stats for those columns and then that
ANALYZE is going to blow away whatever stats got loaded previously
anyway and all we did with a partial stats load was maybe have a subset
of queries have better plans in the interim, after having expended the
cost to try and individually load the stats and dealing with the case of
some of them succeeding and some failing.

Overall, I'd suggest we wait to see what Corey comes up with in terms of
doing the stats load for all attributes in a single function call,
perhaps using the VALUES construct as you suggested up-thread, and then
we can contemplate if that's clean enough to work or if it's so grotty
that the better plan would be to do per-attribute function calls.  If it
ends up being the latter, then we can revisit this discussion and try to
answer some of the questions raised above.

Thanks!

Stephen

Attachment

signature.asc

Re: Statistics Import and Export

From

Corey Huinker

Date:

08 March 2024, 06:09:10

> BEGIN;
> LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
> LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
> SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
> SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
> SELECT pg_import_pg_statistic('schema.relation', 'name', ...);

How well would this simplify to the following:

SELECT pg_import_statistic('schema.relation', attname, ...)
FROM (VALUES ('id', ...), ...) AS relation_stats (attname, ...);

Or even just one VALUES for the whole statistics loading?

I'm sorry, I don't quite understand what you're suggesting here. I'm about to post the new functions, so perhaps you can rephrase this in the context of those functions.

I suspect the main issue with combining this into one statement
(transaction) is that failure to load one column's statistics implies
you'll have to redo all the other statistics (or fail to load the
statistics at all), which may be problematic at the scale of thousands
of relations with tens of columns each.

Yes, that is is a concern, and I can see value to having it both ways (one failure fails the whole table's worth of set_something() functions, but I can also see emitting a warning instead of error and returning false. I'm eager to get feedback on which the community would prefer, or perhaps even make it a parameter.

Re: Statistics Import and Export

From

Corey Huinker

Date:

08 March 2024, 06:35:40

Having some discussion around that would be useful. Is it better to
have a situation where there are stats for some columns but no stats for
other columns? There would be a good chance that this would lead to a
set of queries that were properly planned out and a set which end up
with unexpected and likely poor query plans due to lack of stats.
Arguably that's better overall, but either way an ANALYZE needs to be
done to address the lack of stats for those columns and then that
ANALYZE is going to blow away whatever stats got loaded previously
anyway and all we did with a partial stats load was maybe have a subset
of queries have better plans in the interim, after having expended the
cost to try and individually load the stats and dealing with the case of
some of them succeeding and some failing.

It is my (incomplete and entirely second-hand) understanding is that pg_upgrade doesn't STOP autovacuum, but sets a delay to a very long value and then resets it on completion, presumably because analyzing a table before its data is loaded and indexes are created would just be a waste of time.

Overall, I'd suggest we wait to see what Corey comes up with in terms of
doing the stats load for all attributes in a single function call,
perhaps using the VALUES construct as you suggested up-thread, and then
we can contemplate if that's clean enough to work or if it's so grotty
that the better plan would be to do per-attribute function calls. If it
ends up being the latter, then we can revisit this discussion and try to
answer some of the questions raised above.

In the patch below, I ended up doing per-attribute function calls, mostly because it allowed me to avoid creating a custom data type for the portable version of pg_statistic. This comes at the cost of a very high number of parameters, but that's the breaks.

I am a bit concerned about the number of locks on pg_statistic and the relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per attribute rather than once per relation. But I also see that this will mostly get used at a time when no other traffic is on the machine, and whatever it costs, it's still faster than the smallest table sample (insert joke about "don't have to be faster than the bear" here).

This raises questions about whether a failure in one attribute update statement should cause the others in that relation to roll back or not, and I can see situations where both would be desirable.

I'm putting this out there ahead of the pg_dump / fe_utils work, mostly because what I do there heavily depends on how this is received.

Also, I'm still seeking confirmation that I can create a pg_dump TOC entry with a chain of commands (e.g. BEGIN; ... COMMIT; ) or if I have to fan them out into multiple entries.

Anyway, here's v7. Eagerly awaiting feedback.

Attachment

v7-0001-Create-pg_set_relation_stats-pg_set_attribute_sta.patch

Re: Statistics Import and Export

From

Stephen Frost

Date:

08 March 2024, 12:05:04

Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > Having some discussion around that would be useful.  Is it better to
> > have a situation where there are stats for some columns but no stats for
> > other columns?  There would be a good chance that this would lead to a
> > set of queries that were properly planned out and a set which end up
> > with unexpected and likely poor query plans due to lack of stats.
> > Arguably that's better overall, but either way an ANALYZE needs to be
> > done to address the lack of stats for those columns and then that
> > ANALYZE is going to blow away whatever stats got loaded previously
> > anyway and all we did with a partial stats load was maybe have a subset
> > of queries have better plans in the interim, after having expended the
> > cost to try and individually load the stats and dealing with the case of
> > some of them succeeding and some failing.
>
> It is my (incomplete and entirely second-hand) understanding is that
> pg_upgrade doesn't STOP autovacuum, but sets a delay to a very long value
> and then resets it on completion, presumably because analyzing a table
> before its data is loaded and indexes are created would just be a waste of
> time.

No, pg_upgrade starts the postmaster with -b (undocumented
binary-upgrade mode) which prevents autovacuum (and logical replication
workers) from starting, so we don't need to worry about autovacuum
coming in and causing issues during binary upgrade.  That doesn't
entirely eliminate the concerns around pg_dump vs. autovacuum because we
may restore a dump into a non-binary-upgrade-in-progress system though,
of course.

> > Overall, I'd suggest we wait to see what Corey comes up with in terms of
> > doing the stats load for all attributes in a single function call,
> > perhaps using the VALUES construct as you suggested up-thread, and then
> > we can contemplate if that's clean enough to work or if it's so grotty
> > that the better plan would be to do per-attribute function calls.  If it
> > ends up being the latter, then we can revisit this discussion and try to
> > answer some of the questions raised above.
>
> In the patch below, I ended up doing per-attribute function calls, mostly
> because it allowed me to avoid creating a custom data type for the portable
> version of pg_statistic. This comes at the cost of a very high number of
> parameters, but that's the breaks.

Perhaps it's just me ... but it doesn't seem like it's all that many
parameters.

> I am a bit concerned about the number of locks on pg_statistic and the
> relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per
> attribute rather than once per relation. But I also see that this will
> mostly get used at a time when no other traffic is on the machine, and
> whatever it costs, it's still faster than the smallest table sample (insert
> joke about "don't have to be faster than the bear" here).

I continue to not be too concerned about this until and unless it's
actually shown to be an issue.  Keeping things simple and
straight-forward has its own value.

> This raises questions about whether a failure in one attribute update
> statement should cause the others in that relation to roll back or not, and
> I can see situations where both would be desirable.
>
> I'm putting this out there ahead of the pg_dump / fe_utils work, mostly
> because what I do there heavily depends on how this is received.
>
> Also, I'm still seeking confirmation that I can create a pg_dump TOC entry
> with a chain of commands (e.g. BEGIN; ...  COMMIT; ) or if I have to fan
> them out into multiple entries.

If we do go with this approach, we'd certainly want to make sure to do
it in a manner which would allow pg_restore's single-transaction mode to
still work, which definitely complicates this some.

Given that and the other general feeling that the locking won't be a big
issue, I'd suggest the simple approach on the pg_dump side of just
dumping the stats out without the BEGIN/COMMIT.

> Anyway, here's v7. Eagerly awaiting feedback.

> Subject: [PATCH v7] Create pg_set_relation_stats, pg_set_attribute_stats.

> diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
> index 291ed876fc..d12b6e3ca3 100644
> --- a/src/include/catalog/pg_proc.dat
> +++ b/src/include/catalog/pg_proc.dat
> @@ -8818,7 +8818,6 @@
>  { oid => '3813', descr => 'generate XML text node',
>    proname => 'xmltext', proisstrict => 't', prorettype => 'xml',
>    proargtypes => 'text', prosrc => 'xmltext' },
> -
>  { oid => '2923', descr => 'map table contents to XML',
>    proname => 'table_to_xml', procost => '100', provolatile => 's',
>    proparallel => 'r', prorettype => 'xml',
> @@ -12163,8 +12162,24 @@
>
>  # GiST stratnum implementations
>  { oid => '8047', descr => 'GiST support',
> -  proname => 'gist_stratnum_identity', prorettype => 'int2',
> +  proname => 'gist_stratnum_identity',prorettype => 'int2',
>    proargtypes => 'int2',
>    prosrc => 'gist_stratnum_identity' },

Random whitespace hunks shouldn't be included

> diff --git a/src/backend/statistics/statistics.c b/src/backend/statistics/statistics.c
> new file mode 100644
> index 0000000000..999aebdfa9
> --- /dev/null
> +++ b/src/backend/statistics/statistics.c
> @@ -0,0 +1,360 @@
> +/*------------------------------------------------------------------------- * * statistics.c *
> + * IDENTIFICATION
> + *       src/backend/statistics/statistics.c
> + *
> + *-------------------------------------------------------------------------
> + */

Top-of-file comment should be cleaned up.

> +/*
> + * Set statistics for a given pg_class entry.
> + *
> + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> + *
> + * This does an in-place (i.e. non-transactional) update of pg_class, just as
> + * is done in ANALYZE.
> + *
> + */
> +Datum
> +pg_set_relation_stats(PG_FUNCTION_ARGS)
> +{
> +    const char *param_names[] = {
> +        "relation",
> +        "reltuples",
> +        "relpages",
> +    };
> +
> +    Oid                relid;
> +    Relation        rel;
> +    HeapTuple        ctup;
> +    Form_pg_class    pgcform;
> +
> +    for (int i = 0; i <= 2; i++)
> +        if (PG_ARGISNULL(i))
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                     errmsg("%s cannot be NULL", param_names[i])));

Why not just mark this function as strict..?  Or perhaps we should allow
NULLs to be passed in and just not update the current value in that
case?  Also, in some cases we allow the function to be called with a
NULL but then make it a no-op rather than throwing an ERROR (eg, if the
OID ends up being NULL).  Not sure if that makes sense here or not
offhand but figured I'd mention it as something to consider.

> +    pgcform = (Form_pg_class) GETSTRUCT(ctup);
> +    pgcform->reltuples = PG_GETARG_FLOAT4(1);
> +    pgcform->relpages = PG_GETARG_INT32(2);

Shouldn't we include relallvisible?

Also, perhaps we should use the approach that we have in ANALYZE, and
only actually do something if the values are different rather than just
always doing an update.

> +/*
> + * Import statistics for a given relation attribute
> + *
> + * pg_set_attribute_stats(relation Oid, attname name, stainherit bool,
> + *                        stanullfrac float4, stawidth int, stadistinct float4,
> + *                        stakind1 int2, stakind2 int2, stakind3 int3,
> + *                        stakind4 int2, stakind5 int2, stanumbers1 float4[],
> + *                        stanumbers2 float4[], stanumbers3 float4[],
> + *                        stanumbers4 float4[], stanumbers5 float4[],
> + *                        stanumbers1 float4[], stanumbers2 float4[],
> + *                        stanumbers3 float4[], stanumbers4 float4[],
> + *                        stanumbers5 float4[], stavalues1 text,
> + *                        stavalues2 text, stavalues3 text,
> + *                        stavalues4 text, stavalues5 text);
> + *
> + *
> + */

Don't know that it makes sense to just repeat the function declaration
inside a comment like this- it'll just end up out of date.

> +Datum
> +pg_set_attribute_stats(PG_FUNCTION_ARGS)

> +    /* names of columns that cannot be null */
> +    const char *required_param_names[] = {
> +        "relation",
> +        "attname",
> +        "stainherit",
> +        "stanullfrac",
> +        "stawidth",
> +        "stadistinct",
> +        "stakind1",
> +        "stakind2",
> +        "stakind3",
> +        "stakind4",
> +        "stakind5",
> +    };

Same comment here as above wrt NULL being passed in.

> +    for (int k = 0; k < 5; k++)

Shouldn't we use STATISTIC_NUM_SLOTS here?

Thanks!

Stephen

Attachment

signature.asc

Re: Statistics Import and Export

From

Corey Huinker

Date:

08 March 2024, 19:17:31

Perhaps it's just me ... but it doesn't seem like it's all that many

parameters.

It's more than I can memorize, so that feels like a lot to me. Clearly it's not insurmountable.

> I am a bit concerned about the number of locks on pg_statistic and the
> relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per
> attribute rather than once per relation. But I also see that this will
> mostly get used at a time when no other traffic is on the machine, and
> whatever it costs, it's still faster than the smallest table sample (insert
> joke about "don't have to be faster than the bear" here).

I continue to not be too concerned about this until and unless it's
actually shown to be an issue. Keeping things simple and
straight-forward has its own value.

Ok, I'm sold on that plan.

> +/*
> + * Set statistics for a given pg_class entry.
> + *
> + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> + *
> + * This does an in-place (i.e. non-transactional) update of pg_class, just as
> + * is done in ANALYZE.
> + *
> + */
> +Datum
> +pg_set_relation_stats(PG_FUNCTION_ARGS)
> +{
> + const char *param_names[] = {
> + "relation",
> + "reltuples",
> + "relpages",
> + };
> +
> + Oid relid;
> + Relation rel;
> + HeapTuple ctup;
> + Form_pg_class pgcform;
> +
> + for (int i = 0; i <= 2; i++)
> + if (PG_ARGISNULL(i))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("%s cannot be NULL", param_names[i])));

Why not just mark this function as strict..? Or perhaps we should allow
NULLs to be passed in and just not update the current value in that
case?

Strict could definitely apply here, and I'm inclined to make it so.

Also, in some cases we allow the function to be called with a
NULL but then make it a no-op rather than throwing an ERROR (eg, if the
OID ends up being NULL).

Thoughts on it emitting a WARN or NOTICE before returning false?

Not sure if that makes sense here or not
offhand but figured I'd mention it as something to consider.

> + pgcform = (Form_pg_class) GETSTRUCT(ctup);
> + pgcform->reltuples = PG_GETARG_FLOAT4(1);
> + pgcform->relpages = PG_GETARG_INT32(2);

Shouldn't we include relallvisible?

Yes. No idea why I didn't have that in there from the start.

Also, perhaps we should use the approach that we have in ANALYZE, and
only actually do something if the values are different rather than just
always doing an update.

That was how it worked back in v1, more for the possibility that there was no matching JSON to set values.

Looking again at analyze.c (currently lines 1751-1780), we just check if there is a row in the way, and if so we replace it. I don't see where we compare existing values to new values.

> +/*
> + * Import statistics for a given relation attribute
> + *
> + * pg_set_attribute_stats(relation Oid, attname name, stainherit bool,
> + * stanullfrac float4, stawidth int, stadistinct float4,
> + * stakind1 int2, stakind2 int2, stakind3 int3,
> + * stakind4 int2, stakind5 int2, stanumbers1 float4[],
> + * stanumbers2 float4[], stanumbers3 float4[],
> + * stanumbers4 float4[], stanumbers5 float4[],
> + * stanumbers1 float4[], stanumbers2 float4[],
> + * stanumbers3 float4[], stanumbers4 float4[],
> + * stanumbers5 float4[], stavalues1 text,
> + * stavalues2 text, stavalues3 text,
> + * stavalues4 text, stavalues5 text);
> + *
> + *
> + */

Don't know that it makes sense to just repeat the function declaration
inside a comment like this- it'll just end up out of date.

Historical artifact - previous versions had a long explanation of the JSON format.

> +Datum
> +pg_set_attribute_stats(PG_FUNCTION_ARGS)

> + /* names of columns that cannot be null */
> + const char *required_param_names[] = {
> + "relation",
> + "attname",
> + "stainherit",
> + "stanullfrac",
> + "stawidth",
> + "stadistinct",
> + "stakind1",
> + "stakind2",
> + "stakind3",
> + "stakind4",
> + "stakind5",
> + };

Same comment here as above wrt NULL being passed in.

In this case, the last 10 params (stanumbersN and stavaluesN) can be null, and are NULL more often than not.

> + for (int k = 0; k < 5; k++)

Shouldn't we use STATISTIC_NUM_SLOTS here?

Yes, I had in the past. Not sure why I didn't again.

Re: Statistics Import and Export

From

Bharath Rupireddy

Date:

10 March 2024, 15:57:22

On Fri, Mar 8, 2024 at 12:06 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>
> Anyway, here's v7. Eagerly awaiting feedback.

Thanks for working on this. It looks useful to have the ability to
restore the stats after upgrade and restore. But, the exported stats
are valid only until the next ANALYZE is run on the table. IIUC,
postgres collects stats during VACUUM, autovacuum and ANALYZE, right?
 Perhaps there are other ways to collect stats. I'm thinking what
problems does the user face if they are just asked to run ANALYZE on
the tables (I'm assuming ANALYZE doesn't block concurrent access to
the tables) instead of automatically exporting stats.

Here are some comments on the v7 patch. I've not dived into the whole
thread, so some comments may be of type repeated or need
clarification. Please bear with me.

1. The following two are unnecessary changes in pg_proc.dat, please remove them.

--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8818,7 +8818,6 @@
 { oid => '3813', descr => 'generate XML text node',
   proname => 'xmltext', proisstrict => 't', prorettype => 'xml',
   proargtypes => 'text', prosrc => 'xmltext' },
-
 { oid => '2923', descr => 'map table contents to XML',
   proname => 'table_to_xml', procost => '100', provolatile => 's',
   proparallel => 'r', prorettype => 'xml',
@@ -12163,8 +12162,24 @@

 # GiST stratnum implementations
 { oid => '8047', descr => 'GiST support',
-  proname => 'gist_stratnum_identity', prorettype => 'int2',
+  proname => 'gist_stratnum_identity',prorettype => 'int2',
   proargtypes => 'int2',
   prosrc => 'gist_stratnum_identity' },

2.
+        they are replaced by the next auto-analyze. This function is used by
+        <command>pg_upgrade</command> and <command>pg_restore</command> to
+        convey the statistics from the old system version into the new one.
+       </para>

Is there any demonstration of pg_set_relation_stats and
pg_set_attribute_stats being used either in pg_upgrade or in
pg_restore? Perhaps, having them as 0002, 0003 and so on patches might
show real need for functions like this. It also clarifies how these
functions pull stats from tables on the old cluster to the tables on
the new cluster.

3. pg_set_relation_stats and pg_set_attribute_stats seem to be writing
to pg_class and might affect the plans as stats can get tampered. Can
we REVOKE the execute permissions from the public out of the box in
src/backend/catalog/system_functions.sql? This way one can decide who
to give permissions to.

4.
+SELECT pg_set_relation_stats('stats_export_import.test'::regclass,
3.6::float4, 15000);
+ pg_set_relation_stats
+-----------------------
+ t
+(1 row)
+
+SELECT reltuples, relpages FROM pg_class WHERE oid =
'stats_export_import.test'::regclass;
+ reltuples | relpages
+-----------+----------
+       3.6 |    15000

Isn't this test case showing a misuse of these functions? Table
actually has  no rows, but we are lying to the postgres optimizer on
stats. I think altering stats of a table mustn't be that easy for the
end user. As mentioned in comment #3, permissions need to be
tightened. In addition, we can also mark the functions pg_upgrade only
with CHECK_IS_BINARY_UPGRADE, but that might not work for pg_restore
(or I don't know if we have a way to know within the server that the
server is running for pg_restore).

5. In continuation to the comment #2, is pg_dump supposed to generate
pg_set_relation_stats and pg_set_attribute_stats statements for each
table? When pg_dump does that , pg_restore can automatically load the
stats.

6.
+/*-------------------------------------------------------------------------
* * statistics.c *
+ * IDENTIFICATION
+ *       src/backend/statistics/statistics.c
+ *
+ *-------------------------------------------------------------------------

A description of what the new file statistics.c does is missing.

7. pgindent isn't happy with new file statistics.c, please check.

8.
+/*
+ * Import statistics for a given relation attribute
+ *
+ * pg_set_attribute_stats(relation Oid, attname name, stainherit bool,
+ *                        stanullfrac float4, stawidth int, stadistinct float4,

Having function definition in the function comment isn't necessary -
it's hard to keep it consistent with pg_proc.dat in future. If
required, one can either look at pg_proc.dat or docs.

9. Isn't it good to add a test case where the plan of a query on table
after exporting the stats would remain same as that of the original
table from which the stats are exported? IMO, this is a more realistic
than just comparing pg_statistic of the tables because this is what an
end-user wants eventually.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Statistics Import and Export

From

Corey Huinker

Date:

10 March 2024, 19:52:51

On Sun, Mar 10, 2024 at 11:57 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Mar 8, 2024 at 12:06 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>
> Anyway, here's v7. Eagerly awaiting feedback.

Thanks for working on this. It looks useful to have the ability to
restore the stats after upgrade and restore. But, the exported stats
are valid only until the next ANALYZE is run on the table. IIUC,
postgres collects stats during VACUUM, autovacuum and ANALYZE, right?
Perhaps there are other ways to collect stats. I'm thinking what
problems does the user face if they are just asked to run ANALYZE on
the tables (I'm assuming ANALYZE doesn't block concurrent access to
the tables) instead of automatically exporting stats.

Correct. These are just as temporary as any other analyze of the table. Another analyze will happen later, probably through autovacuum and wipe out these values. This is designed to QUICKLY get stats into a table to enable the database to be operational sooner. This is especially important after an upgrade/restore, when all stats were wiped out. Other uses could be adapting this for use the postgres_fdw so that we don't have to do table sampling on the remote table, and of course statistics injection to test the query planner.

2.
+ they are replaced by the next auto-analyze. This function is used by
+ <command>pg_upgrade</command> and <command>pg_restore</command> to
+ convey the statistics from the old system version into the new one.
+ </para>

Is there any demonstration of pg_set_relation_stats and
pg_set_attribute_stats being used either in pg_upgrade or in
pg_restore? Perhaps, having them as 0002, 0003 and so on patches might
show real need for functions like this. It also clarifies how these
functions pull stats from tables on the old cluster to the tables on
the new cluster.

That code was adapted from do_analyze(), and yes, there is a patch for pg_dump, but as I noted earlier it is on hold pending feedback.

3. pg_set_relation_stats and pg_set_attribute_stats seem to be writing
to pg_class and might affect the plans as stats can get tampered. Can
we REVOKE the execute permissions from the public out of the box in
src/backend/catalog/system_functions.sql? This way one can decide who
to give permissions to.

You'd have to be the table owner to alter the stats. I can envision these functions getting a special role, but they could also be fine as superuser-only.

4.
+SELECT pg_set_relation_stats('stats_export_import.test'::regclass,
3.6::float4, 15000);
+ pg_set_relation_stats
+-----------------------
+ t
+(1 row)
+
+SELECT reltuples, relpages FROM pg_class WHERE oid =
'stats_export_import.test'::regclass;
+ reltuples | relpages
+-----------+----------
+ 3.6 | 15000

Isn't this test case showing a misuse of these functions? Table
actually has no rows, but we are lying to the postgres optimizer on
stats.

Consider this case. You want to know at what point the query planner will start using a given index. You can generate dummy data for a thousand, a million, a billion rows, and wait for that to complete, or you can just tell the table "I say you have a billion rows, twenty million pages, etc" and see when it changes.

But again, in most cases, you're setting the values to the same values the table had on the old database just before the restore/upgrade.

I think altering stats of a table mustn't be that easy for the
end user.

Only easy for the end users that happen to be the table owner or a superuser.

As mentioned in comment #3, permissions need to be
tightened. In addition, we can also mark the functions pg_upgrade only
with CHECK_IS_BINARY_UPGRADE, but that might not work for pg_restore
(or I don't know if we have a way to know within the server that the
server is running for pg_restore).

I think they will have usage both in postgres_fdw and for tuning.

5. In continuation to the comment #2, is pg_dump supposed to generate
pg_set_relation_stats and pg_set_attribute_stats statements for each
table? When pg_dump does that , pg_restore can automatically load the
stats.

Current plan is to have one TOC entry in the post-data section with a dependency on the table/index/matview. That let's us leverage existing filters. The TOC entry will have a series of statements in it, one pg_set_relation_stats() and one pg_set_attribute_stats() per attribute.

9. Isn't it good to add a test case where the plan of a query on table
after exporting the stats would remain same as that of the original
table from which the stats are exported? IMO, this is a more realistic
than just comparing pg_statistic of the tables because this is what an
end-user wants eventually.

I'm sure we can add something like that, but query plan formats change a lot and are greatly dependent on database configuration, so maintaining such a test would be a lot of work.

Re: Statistics Import and Export

From

Bertrand Drouvot

Date:

11 March 2024, 08:50:33

Hi,

On Fri, Mar 08, 2024 at 01:35:40AM -0500, Corey Huinker wrote:
> Anyway, here's v7. Eagerly awaiting feedback.

Thanks!

A few random comments:

1 ===

+        The purpose of this function is to apply statistics values in an
+        upgrade situation that are "good enough" for system operation until

Worth to add a few words about "influencing" the planner use case?

2 ===

+#include "catalog/pg_type.h"
+#include "fmgr.h"

Are those 2 needed?

3 ===

+       if (!HeapTupleIsValid(ctup))
+               elog(ERROR, "pg_class entry for relid %u vanished during statistics import",

s/during statistics import/when setting statistics/?

4 ===

+Datum
+pg_set_relation_stats(PG_FUNCTION_ARGS)
+{
.
.
+       table_close(rel, ShareUpdateExclusiveLock);
+
+       PG_RETURN_BOOL(true);

Why returning a bool? (I mean we'd throw an error or return true).

5 ===

+ */
+Datum
+pg_set_attribute_stats(PG_FUNCTION_ARGS)
+{

This function is not that simple, worth to explain its logic in a comment above?

6 ===

+       if (!HeapTupleIsValid(tuple))
+       {
+               relation_close(rel, NoLock);
+               PG_RETURN_BOOL(false);
+       }
+
+       attr = (Form_pg_attribute) GETSTRUCT(tuple);
+       if (attr->attisdropped)
+       {
+               ReleaseSysCache(tuple);
+               relation_close(rel, NoLock);
+               PG_RETURN_BOOL(false);
+       }

Why is it returning "false" and not throwing an error? (if ok, then I think
we can get rid of returning a bool).

7 ===

+        * If this relation is an index and that index has expressions in
+        * it, and the attnum specified

s/is an index and that index has/is an index that has/?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Statistics Import and Export

From

Stephen Frost

Date:

11 March 2024, 10:00:32

Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > > +/*
> > > + * Set statistics for a given pg_class entry.
> > > + *
> > > + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> > > + *
> > > + * This does an in-place (i.e. non-transactional) update of pg_class,
> > just as
> > > + * is done in ANALYZE.
> > > + *
> > > + */
> > > +Datum
> > > +pg_set_relation_stats(PG_FUNCTION_ARGS)
> > > +{
> > > +     const char *param_names[] = {
> > > +             "relation",
> > > +             "reltuples",
> > > +             "relpages",
> > > +     };
> > > +
> > > +     Oid                             relid;
> > > +     Relation                rel;
> > > +     HeapTuple               ctup;
> > > +     Form_pg_class   pgcform;
> > > +
> > > +     for (int i = 0; i <= 2; i++)
> > > +             if (PG_ARGISNULL(i))
> > > +                     ereport(ERROR,
> > > +
> >  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > +                                      errmsg("%s cannot be NULL",
> > param_names[i])));
> >
> > Why not just mark this function as strict..?  Or perhaps we should allow
> > NULLs to be passed in and just not update the current value in that
> > case?
>
> Strict could definitely apply here, and I'm inclined to make it so.

Having thought about it a bit more, I generally like the idea of being
able to just update one stat instead of having to update all of them at
once (and therefore having to go look up what the other values currently
are...).  That said, per below, perhaps making it strict is the better
plan.

> > Also, in some cases we allow the function to be called with a
> > NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> > OID ends up being NULL).
>
> Thoughts on it emitting a WARN or NOTICE before returning false?

Eh, I don't think so?

Where this is coming from is that we can often end up with functions
like these being called inside of larger queries, and having them spit
out WARN or NOTICE will just make them noisy.

That leads to my general feeling of just returning NULL if called with a
NULL OID, as we would get with setting the function strict.

> >   Not sure if that makes sense here or not
> > offhand but figured I'd mention it as something to consider.
> >
> > > +     pgcform = (Form_pg_class) GETSTRUCT(ctup);
> > > +     pgcform->reltuples = PG_GETARG_FLOAT4(1);
> > > +     pgcform->relpages = PG_GETARG_INT32(2);
> >
> > Shouldn't we include relallvisible?
>
> Yes. No idea why I didn't have that in there from the start.

Ok.

> > Also, perhaps we should use the approach that we have in ANALYZE, and
> > only actually do something if the values are different rather than just
> > always doing an update.
>
> That was how it worked back in v1, more for the possibility that there was
> no matching JSON to set values.
>
> Looking again at analyze.c (currently lines 1751-1780), we just check if
> there is a row in the way, and if so we replace it. I don't see where we
> compare existing values to new values.

Well, that code is for pg_statistic while I was looking at pg_class (in
vacuum.c:1428-1443, where we track if we're actually changing anything
and only make the pg_class change if there's actually something
different):

vacuum.c:1531
    /* If anything changed, write out the tuple. */
    if (dirty)
        heap_inplace_update(rd, ctup);

Not sure why we don't treat both the same way though ... although it's
probably the case that it's much less likely to have an entire
pg_statistic row be identical than the few values in pg_class.

> > > +Datum
> > > +pg_set_attribute_stats(PG_FUNCTION_ARGS)
> >
> > > +     /* names of columns that cannot be null */
> > > +     const char *required_param_names[] = {
> > > +             "relation",
> > > +             "attname",
> > > +             "stainherit",
> > > +             "stanullfrac",
> > > +             "stawidth",
> > > +             "stadistinct",
> > > +             "stakind1",
> > > +             "stakind2",
> > > +             "stakind3",
> > > +             "stakind4",
> > > +             "stakind5",
> > > +     };
> >
> > Same comment here as above wrt NULL being passed in.
>
> In this case, the last 10 params (stanumbersN and stavaluesN) can be null,
> and are NULL more often than not.

Hmm, that's a valid point, so a NULL passed in would need to set that
value actually to NULL, presumably.  Perhaps then we should have
pg_set_relation_stats() be strict and have pg_set_attribute_stats()
handles NULLs passed in appropriately, and return NULL if the relation
itself or attname, or other required (not NULL'able) argument passed in
cause the function to return NULL.

(What I'm trying to drive at here is a consistent interface for these
functions, but one which does a no-op instead of returning an ERROR on
values being passed in which aren't allowable; it can be quite
frustrating trying to get a query to work where one of the functions
decides to return ERROR instead of just ignoring things passed in which
aren't valid.)

> > > +     for (int k = 0; k < 5; k++)
> >
> > Shouldn't we use STATISTIC_NUM_SLOTS here?
>
> Yes, I had in the past. Not sure why I didn't again.

No worries.

Thanks!

Stephen

Attachment

signature.asc

Re: Statistics Import and Export

From

Corey Huinker

Date:

11 March 2024, 18:20:36

Having thought about it a bit more, I generally like the idea of being
able to just update one stat instead of having to update all of them at
once (and therefore having to go look up what the other values currently
are...). That said, per below, perhaps making it strict is the better
plan.

v8 has it as strict.

> > Also, in some cases we allow the function to be called with a
> > NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> > OID ends up being NULL).
>
> Thoughts on it emitting a WARN or NOTICE before returning false?

Eh, I don't think so?

Where this is coming from is that we can often end up with functions
like these being called inside of larger queries, and having them spit
out WARN or NOTICE will just make them noisy.

That leads to my general feeling of just returning NULL if called with a
NULL OID, as we would get with setting the function strict.

In which case we're failing nearly silently, yes, there is a null returned, but we have no idea why there is a null returned. If I were using this function manually I'd want to know what I did wrong, what parameter I skipped, etc.

Well, that code is for pg_statistic while I was looking at pg_class (in
vacuum.c:1428-1443, where we track if we're actually changing anything
and only make the pg_class change if there's actually something
different):

I can do that, especially since it's only 3 tuples of known types, but my reservations are summed up in the next comment.

Not sure why we don't treat both the same way though ... although it's
probably the case that it's much less likely to have an entire
pg_statistic row be identical than the few values in pg_class.

That would also involve comparing ANYARRAY values, yuk. Also, a matched record will never be the case when used in primary purpose of the function (upgrades), and not a big deal in the other future cases (if we use it in ANALYZE on foreign tables instead of remote table samples, users experimenting with tuning queries under hypothetical workloads).

Hmm, that's a valid point, so a NULL passed in would need to set that
value actually to NULL, presumably. Perhaps then we should have
pg_set_relation_stats() be strict and have pg_set_attribute_stats()
handles NULLs passed in appropriately, and return NULL if the relation
itself or attname, or other required (not NULL'able) argument passed in
cause the function to return NULL.

That's how I have relstats done in v8, and could make it do that for attr stats.

(What I'm trying to drive at here is a consistent interface for these
functions, but one which does a no-op instead of returning an ERROR on
values being passed in which aren't allowable; it can be quite
frustrating trying to get a query to work where one of the functions
decides to return ERROR instead of just ignoring things passed in which
aren't valid.)

I like the symmetry of a consistent interface, but we've already got an asymmetry in that the pg_class update is done non-transactionally (like ANALYZE does).

One persistent problem is that there is no _safe equivalent to ARRAY_IN, so that can always fail on us, though it should only do so if the string passed in wasn't a valid array input format, or the values in the array can't coerce to the attribute's basetype.

I should also point out that we've lost the ability to check if the export values were of a type, and if the destination column is also of that type. That's a non-issue in binary upgrades, but of course if a field changed from integers to text the histograms would now be highly misleading. Thoughts on adding a typname parameter that the function uses as a cheap validity check?

v8 attached, incorporating these suggestions plus those of Bharath and Bertrand. Still no pg_dump.

As for pg_dump, I'm currently leading toward the TOC entry having either a series of commands:

SELECT pg_set_relation_stats('foo.bar'::regclass, ...); pg_set_attribute_stats('foo.bar'::regclass, 'id'::name, ...); ...

Or one compound command

SELECT pg_set_relation_stats(t.oid, ...)
pg_set_attribute_stats(t.oid, 'id'::name, ...),
pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
...
FROM (VALUES('foo.bar'::regclass)) AS t(oid);

The second one has the feature that if any one attribute fails, then the whole update fails, except, of course, for the in-place update of pg_class. This avoids having an explicit transaction block, but we could get that back by having restore wrap the list of commands in a transaction block (and adding the explicit lock commands) when it is safe to do so.

Attachment

v8-0001-Create-pg_set_relation_stats-pg_set_attribute_sta.patch

Re: Statistics Import and Export

From

Stephen Frost

Date:

11 March 2024, 18:48:47

Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > Having thought about it a bit more, I generally like the idea of being
> > able to just update one stat instead of having to update all of them at
> > once (and therefore having to go look up what the other values currently
> > are...).  That said, per below, perhaps making it strict is the better
> > plan.
>
> v8 has it as strict.

Ok.

> > > > Also, in some cases we allow the function to be called with a
> > > > NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> > > > OID ends up being NULL).
> > >
> > > Thoughts on it emitting a WARN or NOTICE before returning false?
> >
> > Eh, I don't think so?
> >
> > Where this is coming from is that we can often end up with functions
> > like these being called inside of larger queries, and having them spit
> > out WARN or NOTICE will just make them noisy.
> >
> > That leads to my general feeling of just returning NULL if called with a
> > NULL OID, as we would get with setting the function strict.
>
> In which case we're failing nearly silently, yes, there is a null returned,
> but we have no idea why there is a null returned. If I were using this
> function manually I'd want to know what I did wrong, what parameter I
> skipped, etc.

I can see it both ways and don't feel super strongly about it ... I just
know that I've had some cases where we returned an ERROR or otherwise
were a bit noisy on NULL values getting passed into a function and it
was much more on the annoying side than on the helpful side; to the
point where we've gone back and pulled out ereport(ERROR) calls from
functions before because they were causing issues in otherwise pretty
reasonable queries (consider things like functions getting pushed down
to below WHERE clauses and such...).

> > Well, that code is for pg_statistic while I was looking at pg_class (in
> > vacuum.c:1428-1443, where we track if we're actually changing anything
> > and only make the pg_class change if there's actually something
> > different):
>
> I can do that, especially since it's only 3 tuples of known types, but my
> reservations are summed up in the next comment.

> > Not sure why we don't treat both the same way though ... although it's
> > probably the case that it's much less likely to have an entire
> > pg_statistic row be identical than the few values in pg_class.
>
> That would also involve comparing ANYARRAY values, yuk. Also, a matched
> record will never be the case when used in primary purpose of the function
> (upgrades), and not a big deal in the other future cases (if we use it in
> ANALYZE on foreign tables instead of remote table samples, users
> experimenting with tuning queries under hypothetical workloads).

Sure.  Not a huge deal either way, was just pointing out the difference.
I do think it'd be good to match what ANALYZE does here, so checking if
the values in pg_class are different and only updating if they are,
while keeping the code for pg_statistic where it'll just always update.

> > Hmm, that's a valid point, so a NULL passed in would need to set that
> > value actually to NULL, presumably.  Perhaps then we should have
> > pg_set_relation_stats() be strict and have pg_set_attribute_stats()
> > handles NULLs passed in appropriately, and return NULL if the relation
> > itself or attname, or other required (not NULL'able) argument passed in
> > cause the function to return NULL.
> >
>
> That's how I have relstats done in v8, and could make it do that for attr
> stats.

That'd be my suggestion, at least, but as I mention above, it's not a
position I hold very strongly.

> > (What I'm trying to drive at here is a consistent interface for these
> > functions, but one which does a no-op instead of returning an ERROR on
> > values being passed in which aren't allowable; it can be quite
> > frustrating trying to get a query to work where one of the functions
> > decides to return ERROR instead of just ignoring things passed in which
> > aren't valid.)
>
> I like the symmetry of a consistent interface, but we've already got an
> asymmetry in that the pg_class update is done non-transactionally (like
> ANALYZE does).

Don't know that I really consider that to be the same kind of thing when
it comes to talking about the interface as the other aspects we're
discussing ...

> One persistent problem is that there is no _safe equivalent to ARRAY_IN, so
> that can always fail on us, though it should only do so if the string
> passed in wasn't a valid array input format, or the values in the array
> can't coerce to the attribute's basetype.

That would happen before we even get to being called and there's not
much to do about it anyway.

> I should also point out that we've lost the ability to check if the export
> values were of a type, and if the destination column is also of that type.
> That's a non-issue in binary upgrades, but of course if a field changed
> from integers to text the histograms would now be highly misleading.
> Thoughts on adding a typname parameter that the function uses as a cheap
> validity check?

Seems reasonable to me.

> v8 attached, incorporating these suggestions plus those of Bharath and
> Bertrand. Still no pg_dump.
>
> As for pg_dump, I'm currently leading toward the TOC entry having either a
> series of commands:
>
>     SELECT pg_set_relation_stats('foo.bar'::regclass, ...);
> pg_set_attribute_stats('foo.bar'::regclass, 'id'::name, ...); ...

I'm guessing the above was intended to be SELECT ..; SELECT ..;

> Or one compound command
>
>     SELECT pg_set_relation_stats(t.oid, ...)
>          pg_set_attribute_stats(t.oid, 'id'::name, ...),
>          pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
>          ...
>     FROM (VALUES('foo.bar'::regclass)) AS t(oid);
>
> The second one has the feature that if any one attribute fails, then the
> whole update fails, except, of course, for the in-place update of pg_class.
> This avoids having an explicit transaction block, but we could get that
> back by having restore wrap the list of commands in a transaction block
> (and adding the explicit lock commands) when it is safe to do so.

Hm, I like this approach as it should essentially give us the
transaction block we had been talking about wanting but without needing
to explicitly do a begin/commit, which would add in some annoying
complications.  This would hopefully also reduce the locking concern
mentioned previously, since we'd get the lock needed in the first
function call and then the others would be able to just see that we've
already got the lock pretty quickly.

> Subject: [PATCH v8] Create pg_set_relation_stats, pg_set_attribute_stats.

[...]

> +Datum
> +pg_set_relation_stats(PG_FUNCTION_ARGS)

[...]

> +    ctup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
> +    if (!HeapTupleIsValid(ctup))
> +        elog(ERROR, "pg_class entry for relid %u vanished during statistics import",
> +             relid);

Maybe drop the 'during statistics import' part of this message?  Also
wonder if maybe we should make it a regular ereport() instead, since it
might be possible for a user to end up seeing this?

> +    pgcform = (Form_pg_class) GETSTRUCT(ctup);
> +
> +    reltuples = PG_GETARG_FLOAT4(P_RELTUPLES);
> +    relpages = PG_GETARG_INT32(P_RELPAGES);
> +    relallvisible = PG_GETARG_INT32(P_RELALLVISIBLE);
> +
> +    /* Do not update pg_class unless there is no meaningful change */

This comment doesn't seem quite right.  Maybe it would be better if it
was in the positive, eg: Only update pg_class if there is a meaningful
change.

Rest of it looks pretty good to me, at least.

Thanks!

Stephen

Attachment

signature.asc

Re: Statistics Import and Export

From

Corey Huinker

Date:

11 March 2024, 20:08:05

> In which case we're failing nearly silently, yes, there is a null returned,
> but we have no idea why there is a null returned. If I were using this
> function manually I'd want to know what I did wrong, what parameter I
> skipped, etc.

I can see it both ways and don't feel super strongly about it ... I just
know that I've had some cases where we returned an ERROR or otherwise
were a bit noisy on NULL values getting passed into a function and it
was much more on the annoying side than on the helpful side; to the
point where we've gone back and pulled out ereport(ERROR) calls from
functions before because they were causing issues in otherwise pretty
reasonable queries (consider things like functions getting pushed down
to below WHERE clauses and such...).

I don't have strong feelings either. I think we should get more input on this. Regardless, it's easy to change...for now.

Sure. Not a huge deal either way, was just pointing out the difference.
I do think it'd be good to match what ANALYZE does here, so checking if
the values in pg_class are different and only updating if they are,
while keeping the code for pg_statistic where it'll just always update.

I agree that mirroring ANALYZE wherever possible is the ideal.

> I like the symmetry of a consistent interface, but we've already got an
> asymmetry in that the pg_class update is done non-transactionally (like
> ANALYZE does).

Don't know that I really consider that to be the same kind of thing when
it comes to talking about the interface as the other aspects we're
discussing ...

Fair.

> One persistent problem is that there is no _safe equivalent to ARRAY_IN, so
> that can always fail on us, though it should only do so if the string
> passed in wasn't a valid array input format, or the values in the array
> can't coerce to the attribute's basetype.

That would happen before we even get to being called and there's not
much to do about it anyway.

Not sure I follow you here. the ARRAY_IN function calls happen once for every non-null stavaluesN parameter, and it's done inside the function because the result type could be the base type for a domain/array type, or could be the type itself. I suppose we could move that determination to the caller, but then we'd need to call get_base_element_type() inside a client, and that seems wrong if it's even possible.

> I should also point out that we've lost the ability to check if the export
> values were of a type, and if the destination column is also of that type.
> That's a non-issue in binary upgrades, but of course if a field changed
> from integers to text the histograms would now be highly misleading.
> Thoughts on adding a typname parameter that the function uses as a cheap
> validity check?

Seems reasonable to me.

I'd like to hear what Tomas thinks about this, as he was the initial advocate for it.

> As for pg_dump, I'm currently leading toward the TOC entry having either a
> series of commands:
>
> SELECT pg_set_relation_stats('foo.bar'::regclass, ...);
> pg_set_attribute_stats('foo.bar'::regclass, 'id'::name, ...); ...

I'm guessing the above was intended to be SELECT ..; SELECT ..;

Yes.

> Or one compound command
>
> SELECT pg_set_relation_stats(t.oid, ...)
> pg_set_attribute_stats(t.oid, 'id'::name, ...),
> pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
> ...
> FROM (VALUES('foo.bar'::regclass)) AS t(oid);
>
> The second one has the feature that if any one attribute fails, then the
> whole update fails, except, of course, for the in-place update of pg_class.
> This avoids having an explicit transaction block, but we could get that
> back by having restore wrap the list of commands in a transaction block
> (and adding the explicit lock commands) when it is safe to do so.

Hm, I like this approach as it should essentially give us the
transaction block we had been talking about wanting but without needing
to explicitly do a begin/commit, which would add in some annoying
complications. This would hopefully also reduce the locking concern
mentioned previously, since we'd get the lock needed in the first
function call and then the others would be able to just see that we've
already got the lock pretty quickly.

True, we'd get the lock needed in the first function call, but wouldn't we also release that very lock before the subsequent call? Obviously we'd be shrinking the window in which another process could get in line and take a superior lock, and the universe of other processes that would even want a lock that blocks us is nil in the case of an upgrade, identical to existing behavior in the case of an FDW ANALYZE, and perfectly fine in the case of someone tinkering with stats.

> Subject: [PATCH v8] Create pg_set_relation_stats, pg_set_attribute_stats.

[...]

> +Datum
> +pg_set_relation_stats(PG_FUNCTION_ARGS)

[...]

> + ctup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
> + if (!HeapTupleIsValid(ctup))
> + elog(ERROR, "pg_class entry for relid %u vanished during statistics import",
> + relid);

Maybe drop the 'during statistics import' part of this message? Also
wonder if maybe we should make it a regular ereport() instead, since it
might be possible for a user to end up seeing this?

Agreed and agreed. It was copypasta from ANALYZE.

This comment doesn't seem quite right. Maybe it would be better if it
was in the positive, eg: Only update pg_class if there is a meaningful
change.

+1

Re: Statistics Import and Export

From

Stephen Frost

Date:

12 March 2024, 08:51:34

Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > > One persistent problem is that there is no _safe equivalent to ARRAY_IN,
> > so
> > > that can always fail on us, though it should only do so if the string
> > > passed in wasn't a valid array input format, or the values in the array
> > > can't coerce to the attribute's basetype.
> >
> > That would happen before we even get to being called and there's not
> > much to do about it anyway.
>
> Not sure I follow you here. the ARRAY_IN function calls happen once for
> every non-null stavaluesN parameter, and it's done inside the function
> because the result type could be the base type for a domain/array type, or
> could be the type itself. I suppose we could move that determination to the
> caller, but then we'd need to call get_base_element_type() inside a client,
> and that seems wrong if it's even possible.

Ah, yeah, ok, I see what you're saying here and sure, there's a risk
those might ERROR too, but that's outright invalid data then as opposed
to a NULL getting passed in.

> > > Or one compound command
> > >
> > >     SELECT pg_set_relation_stats(t.oid, ...)
> > >          pg_set_attribute_stats(t.oid, 'id'::name, ...),
> > >          pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
> > >          ...
> > >     FROM (VALUES('foo.bar'::regclass)) AS t(oid);
> > >
> > > The second one has the feature that if any one attribute fails, then the
> > > whole update fails, except, of course, for the in-place update of
> > pg_class.
> > > This avoids having an explicit transaction block, but we could get that
> > > back by having restore wrap the list of commands in a transaction block
> > > (and adding the explicit lock commands) when it is safe to do so.
> >
> > Hm, I like this approach as it should essentially give us the
> > transaction block we had been talking about wanting but without needing
> > to explicitly do a begin/commit, which would add in some annoying
> > complications.  This would hopefully also reduce the locking concern
> > mentioned previously, since we'd get the lock needed in the first
> > function call and then the others would be able to just see that we've
> > already got the lock pretty quickly.
>
> True, we'd get the lock needed in the first function call, but wouldn't we
> also release that very lock before the subsequent call? Obviously we'd be
> shrinking the window in which another process could get in line and take a
> superior lock, and the universe of other processes that would even want a
> lock that blocks us is nil in the case of an upgrade, identical to existing
> behavior in the case of an FDW ANALYZE, and perfectly fine in the case of
> someone tinkering with stats.

No, we should be keeping the lock until the end of the transaction
(which in this case would be just the one statement, but it would be the
whole statement and all of the calls in it).  See analyze.c:268 or
so, where we call relation_close(onerel, NoLock); meaning we're closing
the relation but we're *not* releasing the lock on it- it'll get
released at the end of the transaction.

Thanks!

Stephen

Attachment

signature.asc

Re: Statistics Import and Export

From

Corey Huinker

Date:

12 March 2024, 16:15:13

No, we should be keeping the lock until the end of the transaction
(which in this case would be just the one statement, but it would be the
whole statement and all of the calls in it). See analyze.c:268 or
so, where we call relation_close(onerel, NoLock); meaning we're closing
the relation but we're *not* releasing the lock on it- it'll get
released at the end of the transaction.

If that's the case, then changing the two table_close() statements to NoLock should resolve any remaining concern.

Re: Statistics Import and Export

From

Stephen Frost

Date:

13 March 2024, 12:10:37

Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > No, we should be keeping the lock until the end of the transaction
> > (which in this case would be just the one statement, but it would be the
> > whole statement and all of the calls in it).  See analyze.c:268 or
> > so, where we call relation_close(onerel, NoLock); meaning we're closing
> > the relation but we're *not* releasing the lock on it- it'll get
> > released at the end of the transaction.
>
> If that's the case, then changing the two table_close() statements to
> NoLock should resolve any remaining concern.

Note that there's two different things we're talking about here- the
lock on the relation that we're analyzing and then the lock on the
pg_statistic (or pg_class) catalog itself.  Currently, at least, it
looks like in the three places in the backend that we open
StatisticRelationId, we release the lock when we close it rather than
waiting for transaction end.  I'd be inclined to keep it that way in
these functions also.  I doubt that one lock will end up causing much in
the way of issues to acquire/release it multiple times and it would keep
the code consistent with the way ANALYZE works.

If it can be shown to be an issue then we could certainly revisit this.

Thanks,

Stephen

Attachment

signature.asc

Re: Statistics Import and Export

From

Corey Huinker

Date:

13 March 2024, 22:33:14

Note that there's two different things we're talking about here- the
lock on the relation that we're analyzing and then the lock on the
pg_statistic (or pg_class) catalog itself. Currently, at least, it
looks like in the three places in the backend that we open
StatisticRelationId, we release the lock when we close it rather than
waiting for transaction end. I'd be inclined to keep it that way in
these functions also. I doubt that one lock will end up causing much in
the way of issues to acquire/release it multiple times and it would keep
the code consistent with the way ANALYZE works.

ANALYZE takes out one lock StatisticRelationId per relation, not per attribute like we do now. If we didn't release the lock after every attribute, and we only called the function outside of a larger transaction (as we plan to do with pg_restore) then that is the closest we're going to get to being consistent with ANALYZE.

Re: Statistics Import and Export

From

Corey Huinker

Date:

15 March 2024, 07:55:13

ANALYZE takes out one lock StatisticRelationId per relation, not per attribute like we do now. If we didn't release the lock after every attribute, and we only called the function outside of a larger transaction (as we plan to do with pg_restore) then that is the closest we're going to get to being consistent with ANALYZE.

v9 attached. This adds pg_dump support. It works in tests against existing databases such as dvdrental, though I was surprised at how few indexes have attribute stats there.

Statistics are preserved by default, but this can be disabled with the option --no-statistics. This follows the prevailing option pattern in pg_dump, etc.

There are currently several failing TAP tests around pg_dump/pg_restore/pg_upgrade. I'm looking at those, but in the mean time I'm seeking feedback on the progress so far.

Attachment

Re: Statistics Import and Export

From

Jeff Davis

Date:

15 March 2024, 22:30:51

On Fri, 2024-03-15 at 03:55 -0400, Corey Huinker wrote:
>
> Statistics are preserved by default, but this can be disabled with
> the option --no-statistics. This follows the prevailing option
> pattern in pg_dump, etc.

I'm not sure if saving statistics should be the default in 17. I'm
inclined to make it opt-in.

> There are currently several failing TAP tests around
> pg_dump/pg_restore/pg_upgrade.

It is a permissions problem. When user running pg_dump is not the
superuser, they don't have permission to access pg_statistic. That
causes an error in exportRelationStatsStmt(), which returns NULL, and
then the caller segfaults.

> I'm looking at those, but in the mean time I'm seeking feedback on
> the progress so far.

Still looking, but one quick comment is that the third argument of
dumpRelationStats() should be const, which eliminates a warning.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Jeff Davis

Date:

15 March 2024, 23:43:03

On Fri, 2024-03-15 at 15:30 -0700, Jeff Davis wrote:
> Still looking, but one quick comment is that the third argument of
> dumpRelationStats() should be const, which eliminates a warning.

A few other comments:

* pg_set_relation_stats() needs to do an ACL check so you can't set the
stats on someone else's table. I suggest honoring the new MAINTAIN
privilege as well.

* If possible, reading from pg_stats (instead of pg_statistic) would be
ideal because pg_stats already does the right checks at read time, so a
non-superuser can export stats, too.

* If reading from pg_stats, should you change the signature of
pg_set_relation_stats() to have argument names matching the columns of
pg_stats (e.g. most_common_vals instead of stakind/stavalues)?

In other words, make this a slightly higher level: conceptually
exporting/importing pg_stats rather than pg_statistic. This may also
make the SQL export queries simpler.

Also, I'm wondering about error handling. Is some kind of error thrown
by pg_set_relation_stats() going to abort an entire restore? That might
be easy to prevent with pg_restore, because it can just omit the stats,
but harder if it's in a SQL file.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

18 March 2024, 03:33:57

* pg_set_relation_stats() needs to do an ACL check so you can't set the
stats on someone else's table. I suggest honoring the new MAINTAIN
privilege as well.

Added.

* If possible, reading from pg_stats (instead of pg_statistic) would be
ideal because pg_stats already does the right checks at read time, so a
non-superuser can export stats, too.

Done. That was sorta how it was originally, so returning to that wasn't too hard.

* If reading from pg_stats, should you change the signature of
pg_set_relation_stats() to have argument names matching the columns of
pg_stats (e.g. most_common_vals instead of stakind/stavalues)?

Done.

In other words, make this a slightly higher level: conceptually
exporting/importing pg_stats rather than pg_statistic. This may also
make the SQL export queries simpler.

Eh, about the same.

Also, I'm wondering about error handling. Is some kind of error thrown
by pg_set_relation_stats() going to abort an entire restore? That might
be easy to prevent with pg_restore, because it can just omit the stats,
but harder if it's in a SQL file.

Aside from the oid being invalid, there's not a whole lot that can go wrong in set_relation_stats(). The error checking I did closely mirrors that in analyze.c.

Aside from the changes you suggested, as well as the error reporting change you suggested for pg_dump, I also filtered out attempts to dump stats on views.

A few TAP tests are still failing and I haven't been able to diagnose why, though the failures in parallel dump seem to be that it tries to import stats on indexes that haven't been created yet, which is odd because I sent the dependency.

All those changes are available in the patches attached.

Attachment

Re: Statistics Import and Export

From

Jeff Davis

Date:

18 March 2024, 16:50:42

On Sun, 2024-03-17 at 23:33 -0400, Corey Huinker wrote:
>
> A few TAP tests are still failing and I haven't been able to diagnose
> why, though the failures in parallel dump seem to be that it tries to
> import stats on indexes that haven't been created yet, which is odd
> because I sent the dependency.

From testrun/pg_dump/002_pg_dump/log/regress_log_002_pg_dump, search
for the "not ok" and then look at what it tried to do right before
that. I see:

pg_dump: error: prepared statement failed: ERROR:  syntax error at or
near "%"
LINE 1: ..._histogram => %L::real[]) coalesce($2, format('%I.%I',
a.nsp...

> All those changes are available in the patches attached.

How about if you provided "get" versions of the functions that return a
set of rows that match what the "set" versions expect? That would make
0001 essentially a complete feature itself.

I think it would also make the changes in pg_dump simpler, and the
tests in 0001 a lot simpler.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

18 March 2024, 18:25:40

From testrun/pg_dump/002_pg_dump/log/regress_log_002_pg_dump, search
for the "not ok" and then look at what it tried to do right before
that. I see:

pg_dump: error: prepared statement failed: ERROR: syntax error at or
near "%"
LINE 1: ..._histogram => %L::real[]) coalesce($2, format('%I.%I',
a.nsp...

Thanks. Unfamiliar turf for me.

> All those changes are available in the patches attached.

How about if you provided "get" versions of the functions that return a
set of rows that match what the "set" versions expect? That would make
0001 essentially a complete feature itself.

That's tricky. At the base level, those functions would just be an encapsulation of "SELECT * FROM pg_stats WHERE schemaname = $1 AND tablename = $2" which isn't all that much of a savings. Perhaps we can make the documentation more explicit about the source and nature of the parameters going into the pg_set_ functions.

Per conversation, it would be trivial to add a helper functions that replace the parameters after the initial oid with a pg_class rowtype, and that would dissect the values needed and call the more complex function:

pg_set_relation_stats( oid, pg_class)
pg_set_attribute_stats( oid, pg_stats)

I think it would also make the changes in pg_dump simpler, and the
tests in 0001 a lot simpler.

I agree. The tests are currently showing that a fidelity copy can be made from one table to another, but to do so we have to conceal the actual stats values because those are 1. not deterministic/known and 2. subject to change from version to version.

I can add some sets to arbitrary values like was done for pg_set_relation_stats().

Re: Statistics Import and Export

From

Corey Huinker

Date:

19 March 2024, 09:16:29

v11 attached.

- TAP tests passing (the big glitch was that indexes that are used in constraints should have their stats dependent on the constraint, not the index, thanks Jeff)
- The new range-specific statistics types are now supported. I'm not happy with the typid machinations I do to get them to work, but it is working so far. These are stored out-of-stakind-order (7 before 6), which is odd because all other types seem store stakinds in ascending order. It shouldn't matter, it was just odd.

- regression tests now make simpler calls with arbitrary stats to demonstrate the function usage more cleanly

- pg_set_*_stats function now have all of their parameters in the same order as the table/view they pull from

Attachment

Re: Statistics Import and Export

From

Jeff Davis

Date:

21 March 2024, 06:29:48

On Tue, 2024-03-19 at 05:16 -0400, Corey Huinker wrote:
> v11 attached.

Thank you.

Comments on 0001:

This test:

   +SELECT
   +    format('SELECT pg_catalog.pg_set_attribute_stats( '
   ...

seems misplaced. It's generating SQL that can be used to restore or
copy the stats -- that seems like the job of pg_dump, and shouldn't be
tested within the plain SQL regression tests.

And can the other tests use pg_stats rather than pg_statistic?

The function signature for pg_set_attribute_stats could be more
friendly -- how about there are a few required parameters, and then it
only sets the stats that are provided and the other ones are either
left to the existing value or get some reasonable default?

Make sure all error paths ReleaseSysCache().

Why are you calling checkCanModifyRelation() twice?

I'm confused about when the function should return false and when it
should throw an error. I'm inclined to think the return type should be
void and all failures should be reported as ERROR.

replaces[] is initialized to {true}, which means only the first element
is initialized to true. Try following the pattern in AlterDatabase (or
similar) which reads the catalog tuple first, then updates a few fields
selectively, setting the corresponding element of replaces[] along the
way.

The test also sets the most_common_freqs in an ascending order, which
is weird.

Relatedly, I got worried recently about the idea of plain users
updating statistics. In theory, that should be fine, and the planner
should be robust to whatever pg_statistic contains; but in practice
there's some risk of mischief there until everyone understands that the
contents of pg_stats should not be trusted. Fortunately I didn't find
any planner crashes or even errors after a brief test.

One thing we can do is some extra validation for consistency, like
checking that the arrays are properly sorted, check for negative
numbers in the wrong place, or fractions larger than 1.0, etc.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

21 March 2024, 07:27:47

On Thu, Mar 21, 2024 at 2:29 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2024-03-19 at 05:16 -0400, Corey Huinker wrote:
> v11 attached.

Thank you.

Comments on 0001:

This test:

+SELECT
+ format('SELECT pg_catalog.pg_set_attribute_stats( '
...

seems misplaced. It's generating SQL that can be used to restore or
copy the stats -- that seems like the job of pg_dump, and shouldn't be
tested within the plain SQL regression tests.

Fair enough.

And can the other tests use pg_stats rather than pg_statistic?

They can, but part of what I wanted to show was that the values that aren't directly passed in as parameters (staopN, stacollN) get set to the correct values, and those values aren't guaranteed to match across databases, hence testing them in the regression test rather than in a TAP test. I'd still like to be able to test that.

The function signature for pg_set_attribute_stats could be more
friendly -- how about there are a few required parameters, and then it
only sets the stats that are provided and the other ones are either
left to the existing value or get some reasonable default?

That would be problematic.

1. We'd have to compare the stats provided against the stats that are already there, make that list in-memory, and then re-order what remains
2. There would be no way to un-set statistics of a given stakind, unless we added an "actually set it null" boolean for each parameter that can be null.

3. I tried that with the JSON formats, it made the code even messier than it already was.

Make sure all error paths ReleaseSysCache().

+1

Why are you calling checkCanModifyRelation() twice?

Once for the relation itself, and once for pg_statistic.

I'm confused about when the function should return false and when it
should throw an error. I'm inclined to think the return type should be
void and all failures should be reported as ERROR.

I go back and forth on that. I can see making it void and returning an error for everything that we currently return false for, but if we do that, then a statement with one pg_set_relation_stats, and N pg_set_attribute_stats (which we lump together in one command for the locking benefits and atomic transaction) would fail entirely if one of the set_attributes named a column that we had dropped. It's up for debate whether that's the right behavior or not.

replaces[] is initialized to {true}, which means only the first element
is initialized to true. Try following the pattern in AlterDatabase (or
similar) which reads the catalog tuple first, then updates a few fields
selectively, setting the corresponding element of replaces[] along the
way.

+1.

The test also sets the most_common_freqs in an ascending order, which
is weird.

I pulled most of the hardcoded values from pg_stats itself. The sample set is trivially small, and the values inserted were in-order-ish. So maybe that's why.

Relatedly, I got worried recently about the idea of plain users
updating statistics. In theory, that should be fine, and the planner
should be robust to whatever pg_statistic contains; but in practice
there's some risk of mischief there until everyone understands that the
contents of pg_stats should not be trusted. Fortunately I didn't find
any planner crashes or even errors after a brief test.

Maybe we could have the functions restricted to a role or roles:

1. pg_write_all_stats (can modify stats on ANY table)
2. pg_write_own_stats (can modify stats on tables owned by user)

I'm iffy on the need for the first one, I list it first purely to show how I derived the name for the second.

One thing we can do is some extra validation for consistency, like
checking that the arrays are properly sorted, check for negative
numbers in the wrong place, or fractions larger than 1.0, etc.

+1. All suggestions of validation checks welcome.

Re: Statistics Import and Export

From

Jeff Davis

Date:

21 March 2024, 17:27:53

On Thu, 2024-03-21 at 03:27 -0400, Corey Huinker wrote:
>
> They can, but part of what I wanted to show was that the values that
> aren't directly passed in as parameters (staopN, stacollN) get set to
> the correct values, and those values aren't guaranteed to match
> across databases, hence testing them in the regression test rather
> than in a TAP test. I'd still like to be able to test that.

OK, that's fine.

> > The function signature for pg_set_attribute_stats could be more
> > friendly
...
> 1. We'd have to compare the stats provided against the stats that are
> already there, make that list in-memory, and then re-order what
> remains
> 2. There would be no way to un-set statistics of a given stakind,
> unless we added an "actually set it null" boolean for each parameter
> that can be null. 
> 3. I tried that with the JSON formats, it made the code even messier
> than it already was.

How about just some defaults then? Many of them have a reasonable
default, like NULL or an empty array. Some are parallel arrays and
either both should be specified or neither (e.g.
most_common_vals+most_common_freqs), but you can check for that.

> > Why are you calling checkCanModifyRelation() twice?
>
> Once for the relation itself, and once for pg_statistic.

Nobody has the privileges to modify pg_statistic except superuser,
right? I thought the point of a privilege check is that users could
modify statistics for their own tables, or the tables they maintain.

>
> I can see making it void and returning an error for everything that
> we currently return false for, but if we do that, then a statement
> with one pg_set_relation_stats, and N pg_set_attribute_stats (which
> we lump together in one command for the locking benefits and atomic
> transaction) would fail entirely if one of the set_attributes named a
> column that we had dropped. It's up for debate whether that's the
> right behavior or not.

I'd probably make the dropped column a WARNING with a message like
"skipping dropped column whatever". Regardless, have some kind of
explanatory comment.

>
> I pulled most of the hardcoded values from pg_stats itself. The
> sample set is trivially small, and the values inserted were in-order-
> ish. So maybe that's why.

In my simple test, most_common_freqs is descending:

   CREATE TABLE a(i int);
   INSERT INTO a VALUES(1);
   INSERT INTO a VALUES(2);
   INSERT INTO a VALUES(2);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   ANALYZE a;
   SELECT most_common_vals, most_common_freqs
     FROM pg_stats WHERE tablename='a';
    most_common_vals | most_common_freqs
   ------------------+-------------------
    {4,3,2}          | {0.4,0.3,0.2}
   (1 row)

Can you show an example where it's not?

>
> Maybe we could have the functions restricted to a role or roles:
>
> 1. pg_write_all_stats (can modify stats on ANY table)
> 2. pg_write_own_stats (can modify stats on tables owned by user)

If we go that route, we are giving up on the ability for users to
restore stats on their own tables. Let's just be careful about
validating data to mitigate this risk.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

21 March 2024, 19:10:42

How about just some defaults then? Many of them have a reasonable
default, like NULL or an empty array. Some are parallel arrays and
either both should be specified or neither (e.g.
most_common_vals+most_common_freqs), but you can check for that.

+1
Default NULL has been implemented for all parameters after n_distinct.

> > Why are you calling checkCanModifyRelation() twice?
>
> Once for the relation itself, and once for pg_statistic.

Nobody has the privileges to modify pg_statistic except superuser,
right? I thought the point of a privilege check is that users could
modify statistics for their own tables, or the tables they maintain.

In which case wouldn't the checkCanModify on pg_statistic would be a proxy for is_superuser/has_special_role_we_havent_created_yet.

>
> I can see making it void and returning an error for everything that
> we currently return false for, but if we do that, then a statement
> with one pg_set_relation_stats, and N pg_set_attribute_stats (which
> we lump together in one command for the locking benefits and atomic
> transaction) would fail entirely if one of the set_attributes named a
> column that we had dropped. It's up for debate whether that's the
> right behavior or not.

I'd probably make the dropped column a WARNING with a message like
"skipping dropped column whatever". Regardless, have some kind of
explanatory comment.

That's certainly do-able.

>
> I pulled most of the hardcoded values from pg_stats itself. The
> sample set is trivially small, and the values inserted were in-order-
> ish. So maybe that's why.

In my simple test, most_common_freqs is descending:

CREATE TABLE a(i int);
INSERT INTO a VALUES(1);
INSERT INTO a VALUES(2);
INSERT INTO a VALUES(2);
INSERT INTO a VALUES(3);
INSERT INTO a VALUES(3);
INSERT INTO a VALUES(3);
INSERT INTO a VALUES(4);
INSERT INTO a VALUES(4);
INSERT INTO a VALUES(4);
INSERT INTO a VALUES(4);
ANALYZE a;
SELECT most_common_vals, most_common_freqs
FROM pg_stats WHERE tablename='a';
most_common_vals | most_common_freqs
------------------+-------------------
{4,3,2} | {0.4,0.3,0.2}
(1 row)

Can you show an example where it's not?

Not off hand, no.

>
> Maybe we could have the functions restricted to a role or roles:
>
> 1. pg_write_all_stats (can modify stats on ANY table)
> 2. pg_write_own_stats (can modify stats on tables owned by user)

If we go that route, we are giving up on the ability for users to
restore stats on their own tables. Let's just be careful about
validating data to mitigate this risk.

A great many test cases coming in the next patch.

Re: Statistics Import and Export

From

Jeff Davis

Date:

21 March 2024, 19:26:44

On Thu, 2024-03-21 at 15:10 -0400, Corey Huinker wrote:
>
> In which case wouldn't the checkCanModify on pg_statistic would be a
> proxy for is_superuser/has_special_role_we_havent_created_yet.

So if someone pg_dumps their table and gets the statistics in the SQL,
then they will get errors loading it unless they are a member of a
special role?

If so we'd certainly need to make --no-statistics the default, and have
some way of skipping stats during reload of the dump (perhaps make the
set function a no-op based on a GUC?).

But ideally we'd just make it safe to dump and reload stats on your own
tables, and then not worry about it.

> Not off hand, no.

To me it seems like inconsistent data to have most_common_freqs in
anything but descending order, and we should prevent it.

> >
Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

21 March 2024, 19:33:29

But ideally we'd just make it safe to dump and reload stats on your own
tables, and then not worry about it.

That is my strong preference, yes.

> Not off hand, no.

To me it seems like inconsistent data to have most_common_freqs in
anything but descending order, and we should prevent it.

Sorry, I misunderstood, I thought we were talking about values, not the frequencies. Yes, the frequencies should only be monotonically non-increasing (i.e. it can go down or flatline from N->N+1). I'll add a test case for that.

Re: Statistics Import and Export

From

Corey Huinker

Date:

23 March 2024, 01:51:01

v12 attached.

0001 -

The functions pg_set_relation_stats() and pg_set_attribute_stats() now return void. There just weren't enough conditions where a condition was considered recoverable to justify having it. This may mean that combining multiple pg_set_attribute_stats calls into one compound statement may no longer be desirable, but that's just one of the places where I'd like feedback on how pg_dump/pg_restore use these functions.

The function pg_set_attribute_stats() now has NULL defaults for all stakind-based statistics types. Thus, you can set statistics on a more terse basis, like so:

SELECT pg_catalog.pg_set_attribute_stats(
relation => 'stats_export_import.test'::regclass,
attname => 'id'::name,
inherited => false::boolean,
null_frac => 0.5::real,
avg_width => 2::integer,
n_distinct => -0.1::real,
most_common_vals => '{2,1,3}'::text,
most_common_freqs => '{0.3,0.25,0.05}'::real[]
);

This would generate a pg_statistic row with exactly one stakind in it, and replaces whatever statistics previously existed for that attribute.

It now checks for many types of data inconsistencies, and most (35) of those have test coverage in the regression. There's a few areas still uncovered, mostly surrounding histograms where the datatype is dependent on the attribute.

The functions both require that the caller be the owner of the table/index.

The function pg_set_relation_stats is largely unchanged from previous versions.

Key areas where I'm seeking feedback:

- What additional checks can be made to ensure validity of statistics?

- What additional regression tests would be desirable?

- What extra information can we add to the error messages to give the user an idea of how to fix the error?

- What are some edge cases we should test concerning putting bad stats in a table to get an undesirable outcome?

0002 -

This patch concerns invoking the functions in 0001 via pg_restore/pg_upgrade. Little has changed here. Dumping statistics is currently the default for pg_dump/pg_restore/pg_upgrade, and can be switched off with the switch --no-statistics. Some have expressed concern about whether stats dumping should be the default. I have a slight preference for making it the default, for the following reasons:

- The existing commandline switches are all --no-something based, and this follows the pattern.

- Complaints about poor performance post-upgrade are often the result of the user not knowing about vacuumdb --analyze-in-stages or the need to manually ANALYZE. If they don't know about that, how can we expect them to know about about new switches in pg_upgrade?

- The failure condition means that the user has a table with no stats in it (or possibly partial stats, if we change how we make the calls), which is exactly where they were before they made the call.
- Any performance regressions will be remedied with the next autovacuum or manual ANALYZE.

- If we had a positive flag (e.g. --with-statistics or just --statistics), and we then changed the default, that could be considered a POLA violation.

Key areas where I'm seeking feedback:

- What level of errors in a restore will a user tolerate, and what should be done to the error messages to indicate that the data itself is fine, but a manual operation to update stats on that particular table is now warranted?

- To what degree could pg_restore/pg_upgrade take that recovery action automatically?

- Should the individual attribute/class set function calls be grouped by relation, so that they all succeed/fail together, or should they be called separately, each able to succeed or fail on their own?

- Any other concerns about how to best use these new functions.

Attachment

Re: Statistics Import and Export

From

Corey Huinker

Date:

25 March 2024, 08:27:38

On Fri, Mar 22, 2024 at 9:51 PM Corey Huinker <corey.huinker@gmail.com> wrote:

v12 attached.

v13 attached. All the same features as v12, but with a lot more type checking, bounds checking, value inspection, etc. Perhaps the most notable feature is that we're now ensuring that histogram values are in ascending order. This could come in handy for detecting when we are applying stats to a column of the wrong type, or the right type but with a different collation. It's not a guarantee of validity, of course, but it would detect egregious changes in sort order.

Attachment

Re: Statistics Import and Export

From

Ashutosh Bapat

Date:

25 March 2024, 10:08:32

Hi Corey,

On Sat, Mar 23, 2024 at 7:21 AM Corey Huinker <corey.huinker@gmail.com> wrote:

v12 attached.

0001 -

Some random comments

+SELECT
+ format('SELECT pg_catalog.pg_set_attribute_stats( '
+ || 'relation => %L::regclass::oid, attname => %L::name, '
+ || 'inherited => %L::boolean, null_frac => %L::real, '
+ || 'avg_width => %L::integer, n_distinct => %L::real, '
+ || 'most_common_vals => %L::text, '
+ || 'most_common_freqs => %L::real[], '
+ || 'histogram_bounds => %L::text, '
+ || 'correlation => %L::real, '
+ || 'most_common_elems => %L::text, '
+ || 'most_common_elem_freqs => %L::real[], '
+ || 'elem_count_histogram => %L::real[], '
+ || 'range_length_histogram => %L::text, '
+ || 'range_empty_frac => %L::real, '
+ || 'range_bounds_histogram => %L::text) ',
+ 'stats_export_import.' || s.tablename || '_clone', s.attname,
+ s.inherited, s.null_frac,
+ s.avg_width, s.n_distinct,
+ s.most_common_vals, s.most_common_freqs, s.histogram_bounds,
+ s.correlation, s.most_common_elems, s.most_common_elem_freqs,
+ s.elem_count_histogram, s.range_length_histogram,
+ s.range_empty_frac, s.range_bounds_histogram)
+FROM pg_catalog.pg_stats AS s
+WHERE s.schemaname = 'stats_export_import'
+AND s.tablename IN ('test', 'is_odd')
+\gexec

Why do we need to construct the command and execute? Can we instead execute the function directly? That would also avoid ECHO magic.

+ <table id="functions-admin-statsimport">
+ <title>Database Object Statistics Import Functions</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ Function
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>

COMMENT: The functions throw many validation errors. Do we want to list the acceptable/unacceptable input values in the documentation corresponding to those? I don't expect one line per argument validation. Something like "these, these and these arguments can not be NULL" or "both arguments in each of the pairs x and y, a and b, and c and d should be non-NULL or NULL respectively".

The functions pg_set_relation_stats() and pg_set_attribute_stats() now return void. There just weren't enough conditions where a condition was considered recoverable to justify having it. This may mean that combining multiple pg_set_attribute_stats calls into one compound statement may no longer be desirable, but that's just one of the places where I'd like feedback on how pg_dump/pg_restore use these functions.

0002 -

This patch concerns invoking the functions in 0001 via pg_restore/pg_upgrade. Little has changed here. Dumping statistics is currently the default for pg_dump/pg_restore/pg_upgrade, and can be switched off with the switch --no-statistics. Some have expressed concern about whether stats dumping should be the default. I have a slight preference for making it the default, for the following reasons:

+ /* Statistics are dependent on the definition, not the data */
+ /* Views don't have stats */
+ if ((tbinfo->dobj.dump & DUMP_COMPONENT_STATISTICS) &&
+ (tbinfo->relkind == RELKIND_VIEW))
+ dumpRelationStats(fout, &tbinfo->dobj, reltypename,
+ tbinfo->dobj.dumpId);
+

Statistics are about data. Whenever pg_dump dumps some filtered data, the
statistics collected for the whole table are uselss. We should avoide dumping
statistics in such a case. E.g. when only schema is dumped what good is
statistics? Similarly the statistics on a partitioned table may not be useful
if some its partitions are not dumped. Said that dumping statistics on foreign
table makes sense since they do not contain data but the statistics still makes sense.

Key areas where I'm seeking feedback:

- What level of errors in a restore will a user tolerate, and what should be done to the error messages to indicate that the data itself is fine, but a manual operation to update stats on that particular table is now warranted?
- To what degree could pg_restore/pg_upgrade take that recovery action automatically?
- Should the individual attribute/class set function calls be grouped by relation, so that they all succeed/fail together, or should they be called separately, each able to succeed or fail on their own?
- Any other concerns about how to best use these new functions.

Whether or not I pass --no-statistics, there is no difference in the dump output. Am I missing something?

$ pg_dump -d postgres > /tmp/dump_no_arguments.out
$ pg_dump -d postgres --no-statistics > /tmp/dump_no_statistics.out
$ diff /tmp/dump_no_arguments.out /tmp/dump_no_statistics.out

$

IIUC, pg_dump includes statistics by default. That means all our pg_dump related tests will have statistics output by default. That's good since the functionality will always be tested. 1. We need additional tests to ensure that the statistics is installed after restore. 2. Some of those tests compare dumps before and after restore. In case the statistics is changed because of auto-analyze happening post-restore, these tests will fail.

I believe, in order to import statistics through IMPORT FOREIGN SCHEMA, postgresImportForeignSchema() will need to add SELECT commands invoking pg_set_relation_stats() on each imported table and pg_set_attribute_stats() on each of its attribute. Am I right? Do we want to make that happen in the first cut of the feature? How do you expect these functions to be used to update statistics of foreign tables?

--

Best Wishes,

Ashutosh Bapat

Re: Statistics Import and Export

From

Tomas Vondra

Date:

25 March 2024, 23:16:48

On 3/25/24 09:27, Corey Huinker wrote:
> On Fri, Mar 22, 2024 at 9:51 PM Corey Huinker <corey.huinker@gmail.com>
> wrote:
> 
>> v12 attached.
>>
>>
> v13 attached. All the same features as v12, but with a lot more type
> checking, bounds checking, value inspection, etc. Perhaps the most notable
> feature is that we're now ensuring that histogram values are in ascending
> order. This could come in handy for detecting when we are applying stats to
> a column of the wrong type, or the right type but with a different
> collation. It's not a guarantee of validity, of course, but it would detect
> egregious changes in sort order.
> 

Hi,

I did take a closer look at v13 today. I have a bunch of comments and
some minor whitespace fixes in the attached review patches.

0001
----

1) The docs say this:

  <para>
   The purpose of this function is to apply statistics values in an
   upgrade situation that are "good enough" for system operation until
   they are replaced by the next <command>ANALYZE</command>, usually via
   <command>autovacuum</command> This function is used by
   <command>pg_upgrade</command> and <command>pg_restore</command> to
   convey the statistics from the old system version into the new one.
  </para>

I find this a bit confusing, considering the pg_dump/pg_restore changes
are only in 0002, not in this patch.

2) Also, I'm not sure about this:

  <parameter>relation</parameter>, the parameters in this are all
  derived from <structname>pg_stats</structname>, and the values
  given are most often extracted from there.

How do we know where do the values come from "most often"? I mean, where
else would it come from?

3) The function pg_set_attribute_stats() is veeeeery long - 1000 lines
or so, that's way too many for me to think about. I agree the flow is
pretty simple, but I still wonder if there's a way to maybe split it
into some smaller "meaningful" steps.

4) It took me *ages* to realize the enums at the beginning of some of
the functions are actually indexes of arguments in PG_FUNCTION_ARGS.
That'd surely deserve a comment explaining this.

5) The comment for param_names in pg_set_attribute_stats says this:

    /* names of columns that cannot be null */
    const char *param_names[] = { ... }

but isn't that actually incorrect? I think that applies only to a couple
initial arguments, but then other fields (MCV, mcelem stats, ...) can be
NULL, right?

6) There's a couple minor whitespace fixes or comments etc.

0002
----

1) I don't understand why we have exportExtStatsSupported(). Seems
pointless - nothing calls it, even if it did we don't know how to export
the stats.

2) I think this condition in dumpTableSchema() is actually incorrect:

  if ((tbinfo->dobj.dump & DUMP_COMPONENT_STATISTICS) &&
      (tbinfo->relkind == RELKIND_VIEW))
      dumpRelationStats(fout, &tbinfo->dobj, reltypename,

Aren't indexes pretty much exactly the thing for which we don't want to
dump statistics? In fact this skips dumping statistics for table - if
you dump a database with a single table (-Fc), pg_restore -l will tell
you this:

217; 1259 16385 TABLE public t user
3403; 0 16385 TABLE DATA public t user

Which is not surprising, because table is not a view. With an expression
index you get this:

217; 1259 16385 TABLE public t user
3404; 0 16385 TABLE DATA public t user
3258; 1259 16418 INDEX public t_expr_idx user
3411; 0 0 STATS IMPORT public INDEX t_expr_idx

Unfortunately, fixing the condition does not work:

  $ pg_dump -Fc test > test.dump
  pg_dump: warning: archive items not in correct section order

This happens for a very simple reason - the statistics are marked as
SECTION_POST_DATA, which for the index works, because indexes are in
post-data section. But the table stats are dumped right after data,
still in the "data" section.

IMO that's wrong, the statistics should be delayed to the post-data
section. Which probably means there needs to be a separate dumpable
object for statistics on table/index, with a dependency on the object.

3) I don't like the "STATS IMPORT" description. For extended statistics
we dump the definition as "STATISTICS" so why to shorten it to "STATS"
here? And "IMPORT" seems more like the process of loading data, not the
data itself. So I suggest "STATISTICS DATA".

regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Statistics Import and Export

From

Corey Huinker

Date:

27 March 2024, 06:20:41

+\gexec

Why do we need to construct the command and execute? Can we instead execute the function directly? That would also avoid ECHO magic.

We don't strictly need it, but I've found the set-difference operation to be incredibly useful in diagnosing problems. Additionally, the values are subject to change due to changes in test data, no guarantee that the output of ANALYZE is deterministic, etc. But most of all, because the test cares about the correct copying of values, not the values themselves.

+ <table id="functions-admin-statsimport">
+ <title>Database Object Statistics Import Functions</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ Function
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>

COMMENT: The functions throw many validation errors. Do we want to list the acceptable/unacceptable input values in the documentation corresponding to those? I don't expect one line per argument validation. Something like "these, these and these arguments can not be NULL" or "both arguments in each of the pairs x and y, a and b, and c and d should be non-NULL or NULL respectively".

Yes. It should.

Statistics are about data. Whenever pg_dump dumps some filtered data, the
statistics collected for the whole table are uselss. We should avoide dumping
statistics in such a case. E.g. when only schema is dumped what good is
statistics? Similarly the statistics on a partitioned table may not be useful
if some its partitions are not dumped. Said that dumping statistics on foreign
table makes sense since they do not contain data but the statistics still makes sense.

Good points, but I'm not immediately sure how to enforce those rules.

Key areas where I'm seeking feedback:

- What level of errors in a restore will a user tolerate, and what should be done to the error messages to indicate that the data itself is fine, but a manual operation to update stats on that particular table is now warranted?
- To what degree could pg_restore/pg_upgrade take that recovery action automatically?
- Should the individual attribute/class set function calls be grouped by relation, so that they all succeed/fail together, or should they be called separately, each able to succeed or fail on their own?
- Any other concerns about how to best use these new functions.

Whether or not I pass --no-statistics, there is no difference in the dump output. Am I missing something?
$ pg_dump -d postgres > /tmp/dump_no_arguments.out
$ pg_dump -d postgres --no-statistics > /tmp/dump_no_statistics.out
$ diff /tmp/dump_no_arguments.out /tmp/dump_no_statistics.out
$

IIUC, pg_dump includes statistics by default. That means all our pg_dump related tests will have statistics output by default. That's good since the functionality will always be tested. 1. We need additional tests to ensure that the statistics is installed after restore. 2. Some of those tests compare dumps before and after restore. In case the statistics is changed because of auto-analyze happening post-restore, these tests will fail.

+1

I believe, in order to import statistics through IMPORT FOREIGN SCHEMA, postgresImportForeignSchema() will need to add SELECT commands invoking pg_set_relation_stats() on each imported table and pg_set_attribute_stats() on each of its attribute. Am I right? Do we want to make that happen in the first cut of the feature? How do you expect these functions to be used to update statistics of foreign tables?

I don't think there's time to get it into this release. I think we'd want to extend this functionality to both IMPORT FOREIGN SCHEMA and ANALYZE for foreign tables, in both cases with a server/table option to do regular remote sampling. In both cases, they'd do a remote query very similar to what pg_dump does (hence putting it in fe_utils), with some filters on which columns/tables it believes it can trust. The remote table might itself be a view (in which case they query would turn up nothing) or column data types may change across the wire, and in those cases we'd have to fall back to sampling.

Re: Statistics Import and Export

From

Corey Huinker

Date:

27 March 2024, 06:27:17

1) The docs say this:

<para>
The purpose of this function is to apply statistics values in an
upgrade situation that are "good enough" for system operation until
they are replaced by the next <command>ANALYZE</command>, usually via
<command>autovacuum</command> This function is used by
<command>pg_upgrade</command> and <command>pg_restore</command> to
convey the statistics from the old system version into the new one.
</para>

I find this a bit confusing, considering the pg_dump/pg_restore changes
are only in 0002, not in this patch.

True, I'll split the docs.

2) Also, I'm not sure about this:

<parameter>relation</parameter>, the parameters in this are all
derived from <structname>pg_stats</structname>, and the values
given are most often extracted from there.

How do we know where do the values come from "most often"? I mean, where
else would it come from?

The next most likely sources would be 1. stats from another similar table and 2. the imagination of a user testing hypothetical query plans.

3) The function pg_set_attribute_stats() is veeeeery long - 1000 lines
or so, that's way too many for me to think about. I agree the flow is
pretty simple, but I still wonder if there's a way to maybe split it
into some smaller "meaningful" steps.

I wrestle with that myself. I think there's some pieces that can be factored out.

4) It took me *ages* to realize the enums at the beginning of some of
the functions are actually indexes of arguments in PG_FUNCTION_ARGS.
That'd surely deserve a comment explaining this.

My apologies, it definitely deserves a comment.

5) The comment for param_names in pg_set_attribute_stats says this:

/* names of columns that cannot be null */
const char *param_names[] = { ... }

but isn't that actually incorrect? I think that applies only to a couple
initial arguments, but then other fields (MCV, mcelem stats, ...) can be
NULL, right?

Yes, that is vestigial, I'll remove it.

6) There's a couple minor whitespace fixes or comments etc.

0002
----

1) I don't understand why we have exportExtStatsSupported(). Seems
pointless - nothing calls it, even if it did we don't know how to export
the stats.

It's not strictly necessary.

2) I think this condition in dumpTableSchema() is actually incorrect:

IMO that's wrong, the statistics should be delayed to the post-data
section. Which probably means there needs to be a separate dumpable
object for statistics on table/index, with a dependency on the object.

Good points.

3) I don't like the "STATS IMPORT" description. For extended statistics
we dump the definition as "STATISTICS" so why to shorten it to "STATS"
here? And "IMPORT" seems more like the process of loading data, not the
data itself. So I suggest "STATISTICS DATA".

+1

Re: Statistics Import and Export

From

Jeff Davis

Date:

28 March 2024, 06:32:09

Hi Tom,

Comparing the current patch set to your advice below:

On Tue, 2023-12-26 at 14:19 -0500, Tom Lane wrote:
> I had things set up with simple functions, which
> pg_dump would invoke by writing more or less
>
>         SELECT pg_catalog.load_statistics(....);
>
> This has a number of advantages, not least of which is that an
> extension
> could plausibly add compatible functions to older versions.

Check.

>   The trick,
> as you say, is to figure out what the argument lists ought to be.
> Unfortunately I recall few details of what I wrote for Salesforce,
> but I think I had it broken down in a way where there was a separate
> function call occurring for each pg_statistic "slot", thus roughly
>
> load_statistics(table regclass, attname text, stakind int, stavalue
> ...);

The problem with basing the function on pg_statistic directly is that
it can only be exported by the superuser.

The current patches instead base it on the pg_stats view, which already
does the privilege checking. Technically, information about which
stakinds go in which slots is lost, but I don't think that's a problem
as long as the stats make it in, right? It's also more user-friendly to
have nice names for the function arguments. The only downside I see is
that it's slightly asymmetric: exporting from pg_stats and importing
into pg_statistic.

I do have some concerns about letting non-superusers import their own
statistics: how robust is the rest of the code to handle malformed
stats once they make it into pg_statistic? Corey has addressed that
with basic input validation, so I think it's fine, but perhaps I'm
missing something.

>  As mentioned already, we'd also need some sort of
> version identifier, and we'd expect the load_statistics() functions
> to be able to transform the data if the old version used a different
> representation.

You mean a version argument to the function, which would appear in the
exported stats data? That's not in the current patch set.

It's relying on the new version of pg_dump understanding the old
statistics data, and dumping it out in a form that the new server will
understand.

>   I agree with the idea that an explicit representation
> of the source table attribute's type would be wise, too.

That's not in the current patch set, either.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Jeff Davis

Date:

29 March 2024, 06:25:38

On Tue, 2024-03-26 at 00:16 +0100, Tomas Vondra wrote:
> I did take a closer look at v13 today. I have a bunch of comments and
> some minor whitespace fixes in the attached review patches.

I also attached a patch implementing a different approach to the
pg_dump support. Instead of trying to create a query that uses SQL
"format()" to create more SQL, I did all the formatting in C. It turned
out to be about 30% fewer lines, and I find it more understandable and
consistent with the way other stuff in pg_dump happens.

The attached patch is pretty rough -- not many comments, and perhaps
some things should be moved around. I only tested very basic
dump/reload in SQL format.

Regards,
    Jeff Davis

Attachment

vjeff-0001-Enable-dumping-of-table-index-stats-in-pg_dump.patch

Re: Statistics Import and Export

From

Corey Huinker

Date:

29 March 2024, 09:32:40

On Fri, Mar 29, 2024 at 2:25 AM Jeff Davis <pgsql@j-davis.com> wrote:

I also attached a patch implementing a different approach to the
pg_dump support. Instead of trying to create a query that uses SQL
"format()" to create more SQL, I did all the formatting in C. It turned
out to be about 30% fewer lines, and I find it more understandable and
consistent with the way other stuff in pg_dump happens.

That is fairly close to what I came up with per our conversation (attached below), but I really like the att_stats_arginfo construct and I definitely want to adopt that and expand it to a third dimension that flags the fields that cannot be null. I will incorporate that into v15.

As for v14, here are the highlights:

0001:
- broke up pg_set_attribute_stats() into many functions. Every stat kind gets its own validation function. Type derivation is now done in its own function.

- removed check on inherited stats flag that required the table be partitioned. that was in error

- added check for most_common_values to be unique in ascending order, and tests to match
- no more mention of pg_dump in the function documentation

- function documentation cites pg-stats-view as reference for the parameter's data requirements

0002:
- All relstats and attrstats calls are now their own statement instead of a compound statement
- moved the archive TOC entry from post-data back to SECTION_NONE (as it was modeled on object COMMENTs), which seems to work better.

- remove meta-query in favor of more conventional query building

- removed all changes to fe_utils/

Attachment

Re: Statistics Import and Export

From

Jeff Davis

Date:

29 March 2024, 15:05:00

On Fri, 2024-03-29 at 05:32 -0400, Corey Huinker wrote:
> That is fairly close to what I came up with per our conversation
> (attached below), but I really like the att_stats_arginfo construct
> and I definitely want to adopt that and expand it to a third
> dimension that flags the fields that cannot be null. I will
> incorporate that into v15.

Sounds good. I think it cuts down on the boilerplate.

> 0002:
> - All relstats and attrstats calls are now their own statement
> instead of a compound statement
> - moved the archive TOC entry from post-data back to SECTION_NONE (as
> it was modeled on object COMMENTs), which seems to work better.
> - remove meta-query in favor of more conventional query building
> - removed all changes to fe_utils/

Can we get a consensus on whether the default should be with stats or
without? That seems like the most important thing remaining in the
pg_dump changes.

There's still a failure in the pg_upgrade TAP test. One seems to be
ordering, so perhaps we need to ORDER BY the attribute number. Others
seem to be missing relstats and I'm not sure why yet. I suggest doing
some manual pg_upgrade tests and comparing the before/after dumps to
see if you can reproduce a smaller version of the problem.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Stephen Frost

Date:

29 March 2024, 22:02:27

Greetings,

On Fri, Mar 29, 2024 at 11:05 Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2024-03-29 at 05:32 -0400, Corey Huinker wrote:
> 0002:
> - All relstats and attrstats calls are now their own statement
> instead of a compound statement
> - moved the archive TOC entry from post-data back to SECTION_NONE (as
> it was modeled on object COMMENTs), which seems to work better.
> - remove meta-query in favor of more conventional query building
> - removed all changes to fe_utils/

Can we get a consensus on whether the default should be with stats or
without? That seems like the most important thing remaining in the
pg_dump changes.

I’d certainly think “with stats” would be the preferred default of our users.

Thanks!

Stephen

Re: Statistics Import and Export

From

Corey Huinker

Date:

29 March 2024, 23:31:19

There's still a failure in the pg_upgrade TAP test. One seems to be
ordering, so perhaps we need to ORDER BY the attribute number. Others
seem to be missing relstats and I'm not sure why yet. I suggest doing
some manual pg_upgrade tests and comparing the before/after dumps to
see if you can reproduce a smaller version of the problem.

That's fixed in my current working version, as is a tsvector-specific issue. Working on the TAP issue.

Re: Statistics Import and Export

From

Jeff Davis

Date:

29 March 2024, 23:34:48

On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> I’d certainly think “with stats” would be the preferred default of
> our users.

I'm concerned there could still be paths that lead to an error. For
pg_restore, or when loading a SQL file, a single error isn't fatal
(unless -e is specified), but it still could be somewhat scary to see
errors during a reload.

Also, it's new behavior, so it may cause some minor surprises, or there
might be minor interactions to work out. For instance, dumping stats
doesn't make a lot of sense if pg_upgrade (or something else) is just
going to run analyze anyway.

What do you think about starting off with it as non-default, and then
switching it to default in 18?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

29 March 2024, 23:47:42

Jeff Davis <pgsql@j-davis.com> writes:
> On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
>> I’d certainly think “with stats” would be the preferred default of
>> our users.

> What do you think about starting off with it as non-default, and then
> switching it to default in 18?

I'm with Stephen: I find it very hard to imagine that there's any
users who wouldn't want this as default.  If we do what you suggest,
then there will be three historical behaviors to cope with not two.
That doesn't sound like it'll make anyone's life better.

As for the "it might break" argument, that could be leveled against
any nontrivial patch.  You're at least offering an opt-out switch,
which is something we more often don't do.

(I've not read the patch yet, but I assume the switch works like
other pg_dump filters in that you can apply it on the restore
side?)

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

30 March 2024, 00:26:16

On Fri, Mar 29, 2024 at 7:34 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> I’d certainly think “with stats” would be the preferred default of
> our users.

I'm concerned there could still be paths that lead to an error. For
pg_restore, or when loading a SQL file, a single error isn't fatal
(unless -e is specified), but it still could be somewhat scary to see
errors during a reload.

To that end, I'm going to be modifying the "Optimizer statistics are not transferred by pg_upgrade..." message when stats _were_ transferred, width additional instructions that the user should treat any stats-ish error messages encountered as a reason to manually analyze that table. We should probably say something about extended stats as well.

Re: Statistics Import and Export

From

Corey Huinker

Date:

30 March 2024, 00:28:12

(I've not read the patch yet, but I assume the switch works like
other pg_dump filters in that you can apply it on the restore
side?)

Correct. It follows the existing --no-something pattern.

Re: Statistics Import and Export

From

Stephen Frost

Date:

30 March 2024, 00:54:20

Greetings,

On Fri, Mar 29, 2024 at 19:35 Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> I’d certainly think “with stats” would be the preferred default of
> our users.

I'm concerned there could still be paths that lead to an error. For
pg_restore, or when loading a SQL file, a single error isn't fatal
(unless -e is specified), but it still could be somewhat scary to see
errors during a reload.

I understand that point.

Also, it's new behavior, so it may cause some minor surprises, or there
might be minor interactions to work out. For instance, dumping stats
doesn't make a lot of sense if pg_upgrade (or something else) is just
going to run analyze anyway.

But we don’t expect anything to run analyze … do we? So I’m not sure why it makes sense to raise this as a concern.

What do you think about starting off with it as non-default, and then
switching it to default in 18?

What’s different, given the above arguments, in making the change with 18 instead of now? I also suspect that if we say “we will change the default later” … that later won’t ever come and we will end up making our users always have to remember to say “with-stats” instead.

The stats are important which is why the effort is being made in the first place. If just doing an analyze after loading the data was good enough then this wouldn’t be getting worked on.

Independently, I had a thought around doing an analyze as the data is being loaded .. but we can’t do that for indexes (but we could perhaps analyze the indexed values as we build the index..). This works when we do a truncate or create the table in the same transaction, so we would tie into some of the existing logic that we have around that. Would also adjust COPY to accept an option that specifies the anticipated number of rows being loaded (which we can figure out during the dump phase reasonably..). Perhaps this would lead to a pg_dump option to do the data load as a transaction with a truncate before the copy (point here being to be able to still do parallel load while getting the benefits from knowing that we are completely reloading the table). Just some other thoughts- which I don’t intend to take away from the current effort at all, which I see as valuable and should be enabled by default.

Thanks!

Stephen

Re: Statistics Import and Export

From

Jeff Davis

Date:

30 March 2024, 01:02:40

On Fri, 2024-03-29 at 20:54 -0400, Stephen Frost wrote:
> What’s different, given the above arguments, in making the change
> with 18 instead of now?

Acknowledged. You, Tom, and Corey (and perhaps everyone else) seem to
be aligned here, so that's consensus enough for me. Default is with
stats, --no-statistics to disable them.

> Independently, I had a thought around doing an analyze as the data is
> being loaded ..

Right, I think there are some interesting things to pursue here. I also
had an idea to use logical decoding to get a streaming sample, which
would be better randomness than block sampling. At this point that's
just an idea, I haven't looked into it seriously.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

30 March 2024, 05:34:16

Right, I think there are some interesting things to pursue here. I also
had an idea to use logical decoding to get a streaming sample, which
would be better randomness than block sampling. At this point that's
just an idea, I haven't looked into it seriously.

Regards,
Jeff Davis

v15 attached

0001:
- fixed an error involving tsvector types
- only derive element type if element stats available
- general cleanup

0002:

- 002pg_upgrade.pl now dumps before/after databases with --no-statistics. I tried to find out why some tables were getting their relstats either not set, or set and reset, never affecting the attribute stats. I even tried turning autovacuum off for both instances, but nothing seemed to change the fact that the same tables were having their relstats reset.

TODO list:

- decision on whether suppressing stats in the pg_upgrade TAP check is for the best
- pg_upgrade option to suppress stats import, there is no real pattern to follow there

- what message text to convey to the user about the potential stats import errors and their remediation, and to what degree that replaces the "you ought to run vacuumdb" message.
- what additional error context we want to add to the array_in() imports of anyarray strings

Attachment

Re: Statistics Import and Export

From

Magnus Hagander

Date:

30 March 2024, 11:26:59

On Sat, Mar 30, 2024 at 1:26 AM Corey Huinker <corey.huinker@gmail.com> wrote:

On Fri, Mar 29, 2024 at 7:34 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> I’d certainly think “with stats” would be the preferred default of
> our users.

I'm concerned there could still be paths that lead to an error. For
pg_restore, or when loading a SQL file, a single error isn't fatal
(unless -e is specified), but it still could be somewhat scary to see
errors during a reload.

To that end, I'm going to be modifying the "Optimizer statistics are not transferred by pg_upgrade..." message when stats _were_ transferred, width additional instructions that the user should treat any stats-ish error messages encountered as a reason to manually analyze that table. We should probably say something about extended stats as well.

I'm getting late into this discussion and I apologize if I've missed this being discussed before. But.

Please don't.

That will make it *really* hard for any form of automation or drivers of this. The information needs to go somewhere where such tools can easily consume it, and an informational message during runtime (which is also likely to be translated in many environments) is the exact opposite of that.

Surely we can come up with something better. Otherwise, I think all those tools are just going ot have to end up assuming that it always failed and proceed based on that, and that would be a shame.

--

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: Statistics Import and Export

From

Jeff Davis

Date:

30 March 2024, 17:01:54

On Sat, 2024-03-30 at 01:34 -0400, Corey Huinker wrote:
>
> - 002pg_upgrade.pl now dumps before/after databases with --no-
> statistics. I tried to find out why some tables were getting their
> relstats either not set, or set and reset, never affecting the
> attribute stats. I even tried turning autovacuum off for both
> instances, but nothing seemed to change the fact that the same tables
> were having their relstats reset.

I think I found out why this is happening: a schema-only dump first
creates the table, then sets the relstats, then creates indexes. The
index creation updates the relstats, but because the dump was schema-
only, it overwrites the relstats with zeros.

That exposes an interesting dependency, which is that relstats must be
set after index creation, otherwise they will be lost -- at least in
the case of pg_upgrade.

This re-raises the question of whether stats are part of a schema-only
dump or not. Did we settle conclusively that they are?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

30 March 2024, 17:18:54

Jeff Davis <pgsql@j-davis.com> writes:
> This re-raises the question of whether stats are part of a schema-only
> dump or not. Did we settle conclusively that they are?

Surely they are data, not schema.  It would make zero sense to restore
them if you aren't restoring the data they describe.

Hence, it'll be a bit messy if we can't put them in the dump's DATA
section.  Maybe we need to revisit CREATE INDEX's behavior rather
than assuming it's graven in stone?

            regards, tom lane

Re: Statistics Import and Export

From

Jeff Davis

Date:

30 March 2024, 17:29:43

On Sat, 2024-03-30 at 13:18 -0400, Tom Lane wrote:
> Surely they are data, not schema.  It would make zero sense to
> restore
> them if you aren't restoring the data they describe.

The complexity is that pg_upgrade does create the data, but relies on a
schema-only dump. So we'd need to at least account for that somehow,
either with a separate stats-only dump, or make a special case in
binary upgrade mode that dumps schema+stats (and resolves the CREATE
INDEX issue).

> Maybe we need to revisit CREATE INDEX's behavior rather
> than assuming it's graven in stone?

Would there be a significant cost to just not doing that? Or are you
suggesting that we special-case the behavior, or turn it off during
restore with a GUC?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

30 March 2024, 17:39:40

Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2024-03-30 at 13:18 -0400, Tom Lane wrote:
>> Surely they are data, not schema.  It would make zero sense to
>> restore them if you aren't restoring the data they describe.

> The complexity is that pg_upgrade does create the data, but relies on a
> schema-only dump. So we'd need to at least account for that somehow,
> either with a separate stats-only dump, or make a special case in
> binary upgrade mode that dumps schema+stats (and resolves the CREATE
> INDEX issue).

Ah, good point.  But binary-upgrade mode is special in tons of ways
already.  I don't see a big problem with allowing it to dump stats
even though --schema-only would normally imply not doing that.

(You could also imagine an explicit positive --stats switch that would
override --schema-only, but I don't see that it's worth the trouble.)

>> Maybe we need to revisit CREATE INDEX's behavior rather
>> than assuming it's graven in stone?

> Would there be a significant cost to just not doing that? Or are you
> suggesting that we special-case the behavior, or turn it off during
> restore with a GUC?

I didn't have any specific proposal in mind, was just trying to think
outside the box.

            regards, tom lane

Re: Statistics Import and Export

From

Jeff Davis

Date:

30 March 2024, 18:20:30

On Sat, 2024-03-30 at 13:39 -0400, Tom Lane wrote:
> (You could also imagine an explicit positive --stats switch that
> would
> override --schema-only, but I don't see that it's worth the trouble.)

That would have its own utility for reproducing planner problems
outside of production systems.

(That could be a separate feature, though, and doesn't need to be a
part of this patch set.)

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

30 March 2024, 23:11:30

I'm getting late into this discussion and I apologize if I've missed this being discussed before. But.

Please don't.

That will make it *really* hard for any form of automation or drivers of this. The information needs to go somewhere where such tools can easily consume it, and an informational message during runtime (which is also likely to be translated in many environments) is the exact opposite of that.

That makes a lot of sense. I'm not sure what form it would take (file, pseudo-table, something else?). Open to suggestions.

Re: Statistics Import and Export

From

Corey Huinker

Date:

30 March 2024, 23:14:21

I didn't have any specific proposal in mind, was just trying to think
outside the box.

What if we added a separate resection SECTION_STATISTICS which is run following post-data?

Re: Statistics Import and Export

From

Tom Lane

Date:

31 March 2024, 00:08:56

Corey Huinker <corey.huinker@gmail.com> writes:
>> I didn't have any specific proposal in mind, was just trying to think
>> outside the box.

> What if we added a separate resection SECTION_STATISTICS which is run
> following post-data?

Maybe, but that would have a lot of side-effects on pg_dump's API
and probably on some end-user scripts.  I'd rather not.

I haven't looked at the details, but I'm really a bit surprised
by Jeff's assertion that CREATE INDEX destroys statistics on the
base table.  That seems wrong from here, and maybe something we
could have it not do.  (I do realize that it recalculates reltuples
and relpages, but so what?  If it updates those, the results should
be perfectly accurate.)

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

31 March 2024, 11:17:26

That will make it *really* hard for any form of automation or drivers of this. The information needs to go somewhere where such tools can easily consume it, and an informational message during runtime (which is also likely to be translated in many environments) is the exact opposite of that.

Having given this some thought, I'd be inclined to create a view, pg_stats_missing, with the same security barrier as pg_stats, but looking for tables that lack stats on at least one column, or lack stats on an extended statistics object.

Table structure would be

schemaname name
tablename name
attnames text[]
ext_stats text[]

The informational message, if it changes at all, could reference this new view as the place to learn about how well the stats import went.

vacuumdb might get a --missing-only option in addition to --analyze-in-stages.

For that matter, we could add --analyze-missing options to pg_restore and pg_upgrade to do the mopping up themselves.

Re: Statistics Import and Export

From

Tom Lane

Date:

31 March 2024, 18:41:00

Corey Huinker <corey.huinker@gmail.com> writes:
> Having given this some thought, I'd be inclined to create a view,
> pg_stats_missing, with the same security barrier as pg_stats, but looking
> for tables that lack stats on at least one column, or lack stats on an
> extended statistics object.

The week before feature freeze is no time to be designing something
like that, unless you've abandoned all hope of getting this into v17.

There's a bigger issue though: AFAICS this patch set does nothing
about dumping extended statistics.  I surely don't want to hold up
the patch insisting that that has to happen before we can commit the
functionality proposed here.  But we cannot rip out pg_upgrade's
support for post-upgrade ANALYZE processing before we do something
about extended statistics, and that means it's premature to be
designing any changes to how that works.  So I'd set that whole
topic on the back burner.

It's possible that we could drop the analyze-in-stages recommendation,
figuring that this functionality will get people to the
able-to-limp-along level immediately and that all that is needed is a
single mop-up ANALYZE pass.  But I think we should leave that till we
have a bit more than zero field experience with this feature.

            regards, tom lane

Re: Statistics Import and Export

From

Tom Lane

Date:

31 March 2024, 18:48:19

My apologies for having paid so little attention to this thread for
months.  I got around to reading the v15 patches today, and while
I think they're going in more or less the right direction, there's
a long way to go IMO.

I concur with the plan of extracting data from pg_stats not
pg_statistics, and with emitting a single "set statistics"
call per attribute.  (I think at one point I'd suggested a call
per stakind slot, but that would lead to a bunch of UPDATEs on
existing pg_attribute tuples and hence a bunch of dead tuples
at the end of an import, so it's not the way to go.  A series
of UPDATEs would likely also play poorly with a background
auto-ANALYZE happening concurrently.)

I do not like the current design for pg_set_attribute_stats' API
though: I don't think it's at all future-proof.  What happens when
somebody adds a new stakind (and hence new pg_stats column)?
You could try to add an overloaded pg_set_attribute_stats
version with more parameters, but I'm pretty sure that would
lead to "ambiguous function call" failures when trying to load
old dump files containing only the original parameters.  The
present design is also fragile in that an unrecognized parameter
will lead to a parse-time failure and no function call happening,
which is less robust than I'd like.  As lesser points,
the relation argument ought to be declared regclass not oid for
convenience of use, and I really think that we need to provide
the source server's major version number --- maybe we will never
need that, but if we do and we don't have it we will be sad.

So this leads me to suggest that we'd be best off with a VARIADIC
ANY signature, where the variadic part consists of alternating
parameter labels and values:

pg_set_attribute_stats(table regclass, attribute name,
                       inherited bool, source_version int,
                       variadic "any") returns void

where a call might look like

SELECT pg_set_attribute_stats('public.mytable', 'mycolumn',
                              false, -- not inherited
                  16, -- source server major version
                              -- pairs of labels and values follow
                              'null_frac', 0.4,
                              'avg_width', 42,
                              'histogram_bounds',
                              array['a', 'b', 'c']::text[],
                              ...);

Note a couple of useful things here:

* AFAICS we could label the function strict and remove all those ad-hoc
null checks.  If you don't have a value for a particular stat, you
just leave that pair of arguments out (exactly as the existing code
in 0002 does, just using a different notation).  This also means that
we don't need any default arguments and so no need for hackery in
system_functions.sql.

* If we don't recognize a parameter label at runtime, we can treat
that as a warning rather than a hard error, and press on.  This case
would mostly be useful in major version downgrades I suppose, but
that will be something people will want eventually.

* We can require the calling statement to cast arguments, particularly
arrays, to the proper type, removing the need for conversions within
the stats-setting function.  (But instead, it'd need to check that the
next "any" argument is the type it ought to be based on the type of
the target column.)

If we write the labels as undecorated string literals as I show
above, I think they will arrive at the function as "unknown"-type
constants, which is a little weird but doesn't seem like it's
really a big problem.  The alternative is to cast them all to text
explicitly, but that's adding notation to no great benefit IMO.

pg_set_relation_stats is simpler in that the set of stats values
to be set will probably remain fairly static, and there seems little
reason to allow only part of them to be supplied (so personally I'd
drop the business about accepting nulls there too).  If we do grow
another value or values for it to set there shouldn't be much problem
with overloading it with another version with more arguments.
Still needs to take regclass not oid though ...

I've not read the patches in great detail, but I did note a
few low-level issues:

* why is check_relation_permissions looking up the pg_class row?
There's already a copy of that in the Relation struct.  Likewise
for the other caller of can_modify_relation (but why is that
caller not using check_relation_permissions?)  That all looks
overly complicated and duplicative.  I think you don't need two
layers of function there.

* I find the stuff with enums and "const char *param_names" to
be way too cute and unlike anything we do elsewhere.  Please
don't invent your own notations for coding patterns that have
hundreds of existing instances.  pg_set_relation_stats, for
example, has absolutely no reason not to look like the usual

    Oid    relid = PG_GETARG_OID(0);
    float4    relpages = PG_GETARG_FLOAT4(1);
    ... etc ...

* The array manipulations seem to me to be mostly not well chosen.
There's no reason to use expanded arrays here, since you won't be
modifying the arrays in-place; all that's doing is wasting memory.
I'm also noting a lack of defenses against nulls in the arrays.
I'd suggest using deconstruct_array to disassemble the arrays,
if indeed they need disassembled at all.  (Maybe they don't, see
next item.)

* I'm dubious that we can fully vet the contents of these arrays,
and even a little dubious that we need to try.  As an example,
what's the worst that's going to happen if a histogram array isn't
sorted precisely?  You might get bogus selectivity estimates
from the planner, but that's no worse than you would've got with
no stats at all.  (It used to be that selfuncs.c would use a
histogram even if its contents didn't match the query's collation.
The comments justifying that seem to be gone, but I think it's
still the case that the code isn't *really* dependent on the sort
order being exactly so.)  The amount of hastily-written code in the
patch for checking this seems a bit scary, and it's well within the
realm of possibility that it introduces more bugs than it prevents.
We do need to verify data types, lack of nulls, and maybe
1-dimensional-ness, which could break the accessing code at a fairly
low level; but I'm not sure that we need more than that.

* There's a lot of ERROR cases that maybe we ought to downgrade
to WARN-and-press-on, in the service of not breaking the restore
completely in case of trouble.

* 0002 is confused about whether the tag for these new TOC
entries is "STATISTICS" or "STATISTICS DATA".  I also think
they need to be in SECTION_DATA not SECTION_NONE, and I'd be
inclined to make them dependent on the table data objects
not the table declarations.  We don't really want a parallel
restore to load them before the data is loaded: that just
increases the risk of bad interactions with concurrent
auto-analyze.

* It'd definitely not be OK to put BEGIN/COMMIT into the commands
in these TOC entries.  But I don't think we need to.

* dumpRelationStats seems to be dumping the relation-level
stats twice.

* Why exactly are you suppressing testing of statistics upgrade
in 002_pg_upgrade??

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

31 March 2024, 22:37:28

On Sun, Mar 31, 2024 at 2:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Corey Huinker <corey.huinker@gmail.com> writes:
> Having given this some thought, I'd be inclined to create a view,
> pg_stats_missing, with the same security barrier as pg_stats, but looking
> for tables that lack stats on at least one column, or lack stats on an
> extended statistics object.

The week before feature freeze is no time to be designing something
like that, unless you've abandoned all hope of getting this into v17.

It was a response to the suggestion that there be some way for tools/automation to read the status of stats. I would view it as a separate patch, as such a view would be useful now for knowing which tables to ANALYZE, regardless of whether this patch goes in or not.

There's a bigger issue though: AFAICS this patch set does nothing
about dumping extended statistics. I surely don't want to hold up
the patch insisting that that has to happen before we can commit the
functionality proposed here. But we cannot rip out pg_upgrade's
support for post-upgrade ANALYZE processing before we do something
about extended statistics, and that means it's premature to be
designing any changes to how that works. So I'd set that whole
topic on the back burner.

So Extended Stats _were_ supported by earlier versions where the medium of communication was JSON. However, there were several problems with adapting that to the current model where we match params to stat types:

* Several of the column types do not have functional input functions, so we must construct the data structure internally and pass them to statext_store().

* The output functions for some of those column types have lists of attnums, with negative values representing positional expressions in the stat definition. This information is not translatable to another system without also passing along the attnum/attname mapping of the source system.

At least three people told me "nobody uses extended stats" and to just drop that from the initial version. Unhappy with this assessment, I inquired as to whether my employer (AWS) had some internal databases that used extended stats so that I could get good test data, and came up with nothing, nor did anyone know of customers who used the feature. So when the fourth person told me that nobody uses extended stats, and not to let a rarely-used feature get in the way of a feature that would benefit nearly 100% of users, I dropped it.

It's possible that we could drop the analyze-in-stages recommendation,
figuring that this functionality will get people to the
able-to-limp-along level immediately and that all that is needed is a
single mop-up ANALYZE pass. But I think we should leave that till we
have a bit more than zero field experience with this feature.

It may be that we leave the recommendation exactly as it is.

Perhaps we enhance the error messages in pg_set_*_stats() to indicate what command would remediate the issue.

Re: Statistics Import and Export

From

Tom Lane

Date:

31 March 2024, 22:44:13

Corey Huinker <corey.huinker@gmail.com> writes:
> On Sun, Mar 31, 2024 at 2:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> There's a bigger issue though: AFAICS this patch set does nothing
>> about dumping extended statistics.  I surely don't want to hold up
>> the patch insisting that that has to happen before we can commit the
>> functionality proposed here.  But we cannot rip out pg_upgrade's
>> support for post-upgrade ANALYZE processing before we do something
>> about extended statistics, and that means it's premature to be
>> designing any changes to how that works.  So I'd set that whole
>> topic on the back burner.

> So Extended Stats _were_ supported by earlier versions where the medium of
> communication was JSON. However, there were several problems with adapting
> that to the current model where we match params to stat types:

> * Several of the column types do not have functional input functions, so we
> must construct the data structure internally and pass them to
> statext_store().
> * The output functions for some of those column types have lists of
> attnums, with negative values representing positional expressions in the
> stat definition. This information is not translatable to another system
> without also passing along the attnum/attname mapping of the source system.

I wonder if the right answer to that is "let's enhance the I/O
functions for those types".  But whether that helps or not, it's
v18-or-later material for sure.

> At least three people told me "nobody uses extended stats" and to just drop
> that from the initial version.

I can't quibble with that view of what has priority.  I'm just
suggesting that redesigning what pg_upgrade does in this area
should come later than doing something about extended stats.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

31 March 2024, 22:58:48

I wonder if the right answer to that is "let's enhance the I/O
functions for those types". But whether that helps or not, it's
v18-or-later material for sure.

That was Stephen's take as well, and I agreed given that I had to throw the kitchen-sink of source-side oid mappings (attname, types, collatons, operators) into the JSON to work around the limitation.

I can't quibble with that view of what has priority. I'm just
suggesting that redesigning what pg_upgrade does in this area
should come later than doing something about extended stats.

I mostly agree, with the caveat that pg_upgrade's existing message saying that optimizer stats were not carried over wouldn't be 100% true anymore.

Re: Statistics Import and Export

From

Tom Lane

Date:

31 March 2024, 23:04:47

Corey Huinker <corey.huinker@gmail.com> writes:
>> I can't quibble with that view of what has priority.  I'm just
>> suggesting that redesigning what pg_upgrade does in this area
>> should come later than doing something about extended stats.

> I mostly agree, with the caveat that pg_upgrade's existing message saying
> that optimizer stats were not carried over wouldn't be 100% true anymore.

I think we can tweak the message wording.  I just don't want to be
doing major redesign of the behavior, nor adding fundamentally new
monitoring capabilities.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

01 April 2024, 00:10:19

I concur with the plan of extracting data from pg_stats not
pg_statistics, and with emitting a single "set statistics"
call per attribute. (I think at one point I'd suggested a call
per stakind slot, but that would lead to a bunch of UPDATEs on
existing pg_attribute tuples and hence a bunch of dead tuples
at the end of an import, so it's not the way to go. A series
of UPDATEs would likely also play poorly with a background
auto-ANALYZE happening concurrently.)

That was my reasoning as well.

I do not like the current design for pg_set_attribute_stats' API
though: I don't think it's at all future-proof. What happens when
somebody adds a new stakind (and hence new pg_stats column)?
You could try to add an overloaded pg_set_attribute_stats
version with more parameters, but I'm pretty sure that would
lead to "ambiguous function call" failures when trying to load
old dump files containing only the original parameters.

I don't think we'd overload, we'd just add new parameters to the function signature.

The
present design is also fragile in that an unrecognized parameter
will lead to a parse-time failure and no function call happening,
which is less robust than I'd like.

There was a lot of back-and-forth about what sorts of failures were error-worthy, and which were warn-worthy. I'll discuss further below.

As lesser points,
the relation argument ought to be declared regclass not oid for
convenience of use,

+1

and I really think that we need to provide
the source server's major version number --- maybe we will never
need that, but if we do and we don't have it we will be sad.

The JSON had it, and I never did use it. Not against having it again.

So this leads me to suggest that we'd be best off with a VARIADIC
ANY signature, where the variadic part consists of alternating
parameter labels and values:

pg_set_attribute_stats(table regclass, attribute name,
inherited bool, source_version int,
variadic "any") returns void

where a call might look like

SELECT pg_set_attribute_stats('public.mytable', 'mycolumn',
false, -- not inherited
16, -- source server major version
-- pairs of labels and values follow
'null_frac', 0.4,
'avg_width', 42,
'histogram_bounds',
array['a', 'b', 'c']::text[],
...);

Note a couple of useful things here:

* AFAICS we could label the function strict and remove all those ad-hoc
null checks. If you don't have a value for a particular stat, you
just leave that pair of arguments out (exactly as the existing code
in 0002 does, just using a different notation). This also means that
we don't need any default arguments and so no need for hackery in
system_functions.sql.

I'm not aware of how strict works with variadics. Would the lack of any variadic parameters trigger it?

Also going with strict means that an inadvertent explicit NULL in one parameter would cause the entire attribute import to fail silently. I'd rather fail loudly.

* If we don't recognize a parameter label at runtime, we can treat
that as a warning rather than a hard error, and press on. This case
would mostly be useful in major version downgrades I suppose, but
that will be something people will want eventually.

Interesting.

* We can require the calling statement to cast arguments, particularly
arrays, to the proper type, removing the need for conversions within
the stats-setting function. (But instead, it'd need to check that the
next "any" argument is the type it ought to be based on the type of
the target column.)

So, that's tricky. The type of the values is not always the attribute type, for expression indexes, we do call exprType() and exprCollation(), in which case we always use the expression type over the attribute type, but only use the collation type if the attribute had no collation. This mimics the behavior of ANALYZE.

Then, for the MCELEM and DECHIST stakinds we have to find the type's element type, and that has special logic for tsvectors, ranges, and other non-scalars, borrowing from the various *_typanalyze() functions. For that matter, the existing typanalyze functions don't grab the < operator, which I need for later data validations, so using examine_attribute() was simultaneously overkill and insufficient.

None of this functionality is accessible from a client program, so we'd have to pull in a lot of backend stuff to pg_dump to make it resolve the typecasts correctly. Text and array_in() was just easier.

pg_set_relation_stats is simpler in that the set of stats values
to be set will probably remain fairly static, and there seems little
reason to allow only part of them to be supplied (so personally I'd
drop the business about accepting nulls there too). If we do grow
another value or values for it to set there shouldn't be much problem
with overloading it with another version with more arguments.
Still needs to take regclass not oid though ...

I'm still iffy about the silent failures of strict.

I looked it up, and the only change needed for changing oid to regclass is in the pg_proc.dat. (and the docs, of course). So I'm already on board.

* why is check_relation_permissions looking up the pg_class row?
There's already a copy of that in the Relation struct. Likewise
for the other caller of can_modify_relation (but why is that
caller not using check_relation_permissions?) That all looks
overly complicated and duplicative. I think you don't need two
layers of function there.

To prove that the caller is the owner (or better) of the table.

* The array manipulations seem to me to be mostly not well chosen.
There's no reason to use expanded arrays here, since you won't be
modifying the arrays in-place; all that's doing is wasting memory.
I'm also noting a lack of defenses against nulls in the arrays.

Easily remedied in light of the deconstruct_array() suggestion below, but I do want to add that value_not_null_array_len() does check for nulls, and that function is used to generate all but one of the arrays (and that one we're just verifying that it's length matches the length of the other array).There's even a regression test that checks it (search for: "elem_count_histogram null element").

I'd suggest using deconstruct_array to disassemble the arrays,
if indeed they need disassembled at all. (Maybe they don't, see
next item.)

+1

* I'm dubious that we can fully vet the contents of these arrays,
and even a little dubious that we need to try. As an example,
what's the worst that's going to happen if a histogram array isn't
sorted precisely? You might get bogus selectivity estimates
from the planner, but that's no worse than you would've got with
no stats at all. (It used to be that selfuncs.c would use a
histogram even if its contents didn't match the query's collation.
The comments justifying that seem to be gone, but I think it's
still the case that the code isn't *really* dependent on the sort
order being exactly so.) The amount of hastily-written code in the
patch for checking this seems a bit scary, and it's well within the
realm of possibility that it introduces more bugs than it prevents.
We do need to verify data types, lack of nulls, and maybe
1-dimensional-ness, which could break the accessing code at a fairly
low level; but I'm not sure that we need more than that.

A lot of the feedback I got on this patch over the months concerned giving inaccurate, nonsensical, or malicious data to the planner. Surely the planner does do *some* defensive programming when fetching these values, but this is the first time those values were potentially set by a user, not by our own internal code. We can try to match types, collations, etc from source to dest, but even that would fall victim to another glibc-level collation change. Verifying that the list the source system said was sorted is actually sorted when put on the destination system is the truest test we're ever going to get, albeit for sampled elements.

* There's a lot of ERROR cases that maybe we ought to downgrade
to WARN-and-press-on, in the service of not breaking the restore
completely in case of trouble.

All cases were made error precisely to spark debate about which cases we'd want to continue from and which we'd want to error from. Also, I was under the impression it was bad form to follow up NOTICE/WARN with an ERROR in the same function call.

* 0002 is confused about whether the tag for these new TOC
entries is "STATISTICS" or "STATISTICS DATA". I also think
they need to be in SECTION_DATA not SECTION_NONE, and I'd be
inclined to make them dependent on the table data objects
not the table declarations. We don't really want a parallel
restore to load them before the data is loaded: that just
increases the risk of bad interactions with concurrent
auto-analyze.

SECTION_NONE works the best, but we're getting some situations where the relpages/reltuples/relallvisible gets reset to 0s in pg_class. Hence the temporary --no-statistics in the pg_upgrade TAP test.

SECTION_POST_DATA (a previous suggestion) causes something weird to happen where certain GRANT/REVOKEs happen outside of their expected section.

In work I've done since v15, I tried giving the table stats archive entry a dependency on every index (and index constraint) as well as the table itself, thinking that would get us past all resets of pg_class, but it hasn't worked.

* It'd definitely not be OK to put BEGIN/COMMIT into the commands
in these TOC entries. But I don't think we need to.

Agreed. Don't need to, each function call now sinks or swims on its own.

* dumpRelationStats seems to be dumping the relation-level
stats twice.

+1

* Why exactly are you suppressing testing of statistics upgrade
in 002_pg_upgrade??

Temporary. Related to the pg_class overwrite issue above.

Re: Statistics Import and Export

From

Tom Lane

Date:

01 April 2024, 00:47:32

Corey Huinker <corey.huinker@gmail.com> writes:
>> and I really think that we need to provide
>> the source server's major version number --- maybe we will never
>> need that, but if we do and we don't have it we will be sad.

> The JSON had it, and I never did use it. Not against having it again.

Well, you don't need it now seeing that the definition of pg_stats
columns hasn't changed in the past ... but there's no guarantee we
won't want to change them in the future.

>> So this leads me to suggest that we'd be best off with a VARIADIC
>> ANY signature, where the variadic part consists of alternating
>> parameter labels and values:
>> pg_set_attribute_stats(table regclass, attribute name,
>>                        inherited bool, source_version int,
>>                        variadic "any") returns void

> I'm not aware of how strict works with variadics. Would the lack of any
> variadic parameters trigger it?

IIRC, "variadic any" requires having at least one variadic parameter.
But that seems fine --- what would be the point, or even the
semantics, of calling pg_set_attribute_stats with no data fields?

> Also going with strict means that an inadvertent explicit NULL in one
> parameter would cause the entire attribute import to fail silently. I'd
> rather fail loudly.

Not really convinced that that is worth any trouble...

> * We can require the calling statement to cast arguments, particularly
>> arrays, to the proper type, removing the need for conversions within
>> the stats-setting function.  (But instead, it'd need to check that the
>> next "any" argument is the type it ought to be based on the type of
>> the target column.)

> So, that's tricky. The type of the values is not always the attribute type,

Hmm.  You would need to have enough smarts in pg_set_attribute_stats
to identify the appropriate array type in any case: as coded, it needs
that for coercion, whereas what I'm suggesting would only require it
for checking, but either way you need it.  I do concede that pg_dump
(or other logic generating the calls) needs to know more under my
proposal than before.  I had been thinking that it would not need to
hard-code that because it could look to see what the actual type is
of the array it's dumping.  However, I see that pg_typeof() doesn't
work for that because it just returns anyarray.  Perhaps we could
invent a new backend function that extracts the actual element type
of a non-null anyarray argument.

Another way we could get to no-coercions is to stick with your
signature but declare the relevant parameters as anyarray instead of
text.  I still think though that we'd be better off to leave the
parameter matching to runtime, so that we-don't-recognize-that-field
can be a warning not an error.

>> * why is check_relation_permissions looking up the pg_class row?
>> There's already a copy of that in the Relation struct.

> To prove that the caller is the owner (or better) of the table.

I think you missed my point: you're doing that inefficiently,
and maybe even with race conditions.  Use the relcache's copy
of the pg_class row.

>> * I'm dubious that we can fully vet the contents of these arrays,
>> and even a little dubious that we need to try.

> A lot of the feedback I got on this patch over the months concerned giving
> inaccurate, nonsensical, or malicious data to the planner. Surely the
> planner does do *some* defensive programming when fetching these values,
> but this is the first time those values were potentially set by a user, not
> by our own internal code. We can try to match types, collations, etc from
> source to dest, but even that would fall victim to another glibc-level
> collation change.

That sort of concern is exactly why I think the planner has to, and
does, defend itself.  Even if you fully vet the data at the instant
of loading, we might have the collation change under us later.

It could be argued that feeding bogus data to the planner for testing
purposes is a valid use-case for this feature.  (Of course, as
superuser we could inject bogus data into pg_statistic manually,
so it's not necessary to have this feature for that purpose.)
I guess I'm a great deal more sanguine than other people about the
planner's ability to tolerate inconsistent data; but in any case
I don't have a lot of faith in relying on checks in
pg_set_attribute_stats to substitute for that ability.  That idea
mainly leads to having a whole lot of code that has to be kept in
sync with other code that's far away from it and probably isn't
coded in a parallel fashion either.

>> * There's a lot of ERROR cases that maybe we ought to downgrade
>> to WARN-and-press-on, in the service of not breaking the restore
>> completely in case of trouble.

> All cases were made error precisely to spark debate about which cases we'd
> want to continue from and which we'd want to error from.

Well, I'm here to debate it if you want, but I'll just note that *one*
error will be enough to abort a pg_upgrade entirely, and most users
these days get scared by errors during manual dump/restore too.  So we
had better not be throwing errors except for cases that we don't think
pg_dump could ever emit.

> Also, I was under
> the impression it was bad form to follow up NOTICE/WARN with an ERROR in
> the same function call.

Seems like nonsense to me.  WARN then ERROR about the same condition
would be annoying, but that's not what we are talking about here.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

01 April 2024, 01:32:17

IIRC, "variadic any" requires having at least one variadic parameter.
But that seems fine --- what would be the point, or even the
semantics, of calling pg_set_attribute_stats with no data fields?

If my pg_dump run emitted a bunch of stats that could never be imported, I'd want to know. With silent failures, I don't.

Perhaps we could
invent a new backend function that extracts the actual element type
of a non-null anyarray argument.

A backend function that we can't guarantee exists on the source system. :(

Another way we could get to no-coercions is to stick with your
signature but declare the relevant parameters as anyarray instead of
text. I still think though that we'd be better off to leave the
parameter matching to runtime, so that we-don't-recognize-that-field
can be a warning not an error.

I'm a bit confused here. AFAIK we can't construct an anyarray in SQL:

# select '{1,2,3}'::anyarray;
ERROR: cannot accept a value of type anyarray

I think you missed my point: you're doing that inefficiently,
and maybe even with race conditions. Use the relcache's copy
of the pg_class row.

Roger Wilco.

Well, I'm here to debate it if you want, but I'll just note that *one*
error will be enough to abort a pg_upgrade entirely, and most users
these days get scared by errors during manual dump/restore too. So we
had better not be throwing errors except for cases that we don't think
pg_dump could ever emit.

That's pretty persuasive. It also means that we need to trap for error in the array_in() calls, as that function does not yet have a _safe() mode.

Re: Statistics Import and Export

From

Ashutosh Bapat

Date:

01 April 2024, 11:26:39

Hi Corey,

On Mon, Mar 25, 2024 at 3:38 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

Hi Corey,

On Sat, Mar 23, 2024 at 7:21 AM Corey Huinker <corey.huinker@gmail.com> wrote:
v12 attached.

0001 -

Some random comments

+SELECT
+ format('SELECT pg_catalog.pg_set_attribute_stats( '
+ || 'relation => %L::regclass::oid, attname => %L::name, '
+ || 'inherited => %L::boolean, null_frac => %L::real, '
+ || 'avg_width => %L::integer, n_distinct => %L::real, '
+ || 'most_common_vals => %L::text, '
+ || 'most_common_freqs => %L::real[], '
+ || 'histogram_bounds => %L::text, '
+ || 'correlation => %L::real, '
+ || 'most_common_elems => %L::text, '
+ || 'most_common_elem_freqs => %L::real[], '
+ || 'elem_count_histogram => %L::real[], '
+ || 'range_length_histogram => %L::text, '
+ || 'range_empty_frac => %L::real, '
+ || 'range_bounds_histogram => %L::text) ',
+ 'stats_export_import.' || s.tablename || '_clone', s.attname,
+ s.inherited, s.null_frac,
+ s.avg_width, s.n_distinct,
+ s.most_common_vals, s.most_common_freqs, s.histogram_bounds,
+ s.correlation, s.most_common_elems, s.most_common_elem_freqs,
+ s.elem_count_histogram, s.range_length_histogram,
+ s.range_empty_frac, s.range_bounds_histogram)
+FROM pg_catalog.pg_stats AS s
+WHERE s.schemaname = 'stats_export_import'
+AND s.tablename IN ('test', 'is_odd')
+\gexec

Why do we need to construct the command and execute? Can we instead execute the function directly? That would also avoid ECHO magic.

Addressed in v15

+ <table id="functions-admin-statsimport">
+ <title>Database Object Statistics Import Functions</title>
+ <tgroup cols="1">
+ <thead>
+ <row>
+ <entry role="func_table_entry"><para role="func_signature">
+ Function
+ </para>
+ <para>
+ Description
+ </para></entry>
+ </row>
+ </thead>

COMMENT: The functions throw many validation errors. Do we want to list the acceptable/unacceptable input values in the documentation corresponding to those? I don't expect one line per argument validation. Something like "these, these and these arguments can not be NULL" or "both arguments in each of the pairs x and y, a and b, and c and d should be non-NULL or NULL respectively".

Addressed in v15.

+ /* Statistics are dependent on the definition, not the data */
+ /* Views don't have stats */
+ if ((tbinfo->dobj.dump & DUMP_COMPONENT_STATISTICS) &&
+ (tbinfo->relkind == RELKIND_VIEW))
+ dumpRelationStats(fout, &tbinfo->dobj, reltypename,
+ tbinfo->dobj.dumpId);
+

Statistics are about data. Whenever pg_dump dumps some filtered data, the
statistics collected for the whole table are uselss. We should avoide dumping
statistics in such a case. E.g. when only schema is dumped what good is
statistics? Similarly the statistics on a partitioned table may not be useful
if some its partitions are not dumped. Said that dumping statistics on foreign
table makes sense since they do not contain data but the statistics still makes sense.

Dumping statistics without data is required for pg_upgrade. This is being discussed in the same thread. But I don't see some of the suggestions e.g. using binary-mode switch being used in v15.

Also, should we handle sequences, composite types the same way? THe latter is probably not dumped, but in case.

Whether or not I pass --no-statistics, there is no difference in the dump output. Am I missing something?
$ pg_dump -d postgres > /tmp/dump_no_arguments.out
$ pg_dump -d postgres --no-statistics > /tmp/dump_no_statistics.out
$ diff /tmp/dump_no_arguments.out /tmp/dump_no_statistics.out
$

IIUC, pg_dump includes statistics by default. That means all our pg_dump related tests will have statistics output by default. That's good since the functionality will always be tested. 1. We need additional tests to ensure that the statistics is installed after restore. 2. Some of those tests compare dumps before and after restore. In case the statistics is changed because of auto-analyze happening post-restore, these tests will fail.

Fixed.

Thanks for addressing those comments.

--

Best Wishes,

Ashutosh Bapat

Re: Statistics Import and Export

From

Ashutosh Bapat

Date:

01 April 2024, 11:31:02

Hi Corey,

Some more comments on v15.

+/*
+ * A more encapsulated version of can_modify_relation for when the the
+ * HeapTuple and Form_pg_class are not needed later.
+ */
+static void
+check_relation_permissions(Relation rel)

This function is used exactly at one place, so usually won't make much sense to write a separate function. But given that the caller is so long, this seems ok. If this function returns the cached tuple when permission checks succeed, it can be used at the other place as well. The caller will be responsible to release the tuple Or update it.

Attached patch contains a test to invoke this function on a view. ANALYZE throws a WARNING when a view is passed to it. Similarly this function should refuse to update the statistics on relations for which ANALYZE throws a warning. A warning instead of an error seems fine.

+
+ const float4 min = 0.0;
+ const float4 max = 1.0;

When reading the validation condition, I have to look up variable values. That can be avoided by directly using the values in the condition itself? If there's some dependency elsewhere in the code, we can use macros. But I have not seen using constant variables in such a way elsewhere in the code.

+ values[Anum_pg_statistic_starelid - 1] = ObjectIdGetDatum(relid);
+ values[Anum_pg_statistic_staattnum - 1] = Int16GetDatum(attnum);
+ values[Anum_pg_statistic_stainherit - 1] = PG_GETARG_DATUM(P_INHERITED);

For a partitioned table this value has to be true. For a normal table when setting this value to true, it should at least make sure that the table has at least one child. Otherwise it should throw an error. Blindly accepting the given value may render the statistics unusable. Prologue of the function needs to be updated accordingly.

I have fixed a documentation error in the patch as well. Please incorporate it in your next patchset.

--

Best Wishes,

Ashutosh Bapat

Attachment

stats_import_export_review.patch

Re: Statistics Import and Export

From

Tom Lane

Date:

01 April 2024, 15:10:09

Corey Huinker <corey.huinker@gmail.com> writes:
>> IIRC, "variadic any" requires having at least one variadic parameter.
>> But that seems fine --- what would be the point, or even the
>> semantics, of calling pg_set_attribute_stats with no data fields?

> If my pg_dump run emitted a bunch of stats that could never be imported,
> I'd want to know. With silent failures, I don't.

What do you think would be silent about that?  If there's a complaint
to be made, it's that it'd be a hard failure ("no such function").

To be clear, I'm ok with emitting ERROR for something that pg_dump
clearly did wrong, which in this case would be emitting a
set_statistics call for an attribute it had exactly no stats values
for.  What I think needs to be WARN is conditions that the originating
pg_dump couldn't have foreseen, for example cross-version differences.
If we do try to check things like sort order, that complaint obviously
has to be WARN, since it's checking something potentially different
from what was correct at the source server.

>> Perhaps we could
>> invent a new backend function that extracts the actual element type
>> of a non-null anyarray argument.

> A backend function that we can't guarantee exists on the source system. :(

[ shrug... ] If this doesn't work for source servers below v17, that
would be a little sad, but it wouldn't be the end of the world.
I see your point that that is an argument for finding another way,
though.

>> Another way we could get to no-coercions is to stick with your
>> signature but declare the relevant parameters as anyarray instead of
>> text.

> I'm a bit confused here. AFAIK we can't construct an anyarray in SQL:

> # select '{1,2,3}'::anyarray;
> ERROR:  cannot accept a value of type anyarray

That's not what I suggested at all.  The function parameters would
be declared anyarray, but the values passed to them would be coerced
to the correct concrete array types.  So as far as the coercion rules
are concerned this'd be equivalent to the variadic-any approach.

> That's pretty persuasive. It also means that we need to trap for error in
> the array_in() calls, as that function does not yet have a _safe() mode.

Well, the approach I'm advocating for would have the array input and
coercion done by the calling query before control ever reaches
pg_set_attribute_stats, so that any incorrect-for-the-data-type values
would result in hard errors.  I think that's okay for the same reason
you probably figured you didn't have to trap array_in: it's the fault
of the originating pg_dump if it offers a value that doesn't coerce to
the datatype it claims the value is of.  My formulation is a bit safer
though in that it's the originating pg_dump, not the receiving server,
that is in charge of saying which type that is.  (If that type doesn't
agree with what the receiving server thinks it should be, that's a
condition that pg_set_attribute_stats itself will detect, and then it
can WARN and move on to the next thing.)

            regards, tom lane

Re: Statistics Import and Export

From

Jeff Davis

Date:

01 April 2024, 17:06:45

On Sat, 2024-03-30 at 20:08 -0400, Tom Lane wrote:
> I haven't looked at the details, but I'm really a bit surprised
> by Jeff's assertion that CREATE INDEX destroys statistics on the
> base table.  That seems wrong from here, and maybe something we
> could have it not do.  (I do realize that it recalculates reltuples
> and relpages, but so what?  If it updates those, the results should
> be perfectly accurate.)

In the v15 of the patch I was looking at, "pg_dump -s" included the
statistics. The stats appeared first in the dump, followed by the
CREATE INDEX commands. The latter overwrote the relpages/reltuples set
by the former.

While zeros are the right answers for a schema-only dump, it defeated
the purpose of including relpages/reltuples stats in the dump, and
caused the pg_upgrade TAP test to fail.

You're right that there are a number of ways this could be resolved --
I don't think it's an inherent problem.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Bruce Momjian

Date:

01 April 2024, 17:11:06

Reality check --- are we still targeting this feature for PG 17?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Statistics Import and Export

From

Tom Lane

Date:

01 April 2024, 17:18:46

Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2024-03-30 at 20:08 -0400, Tom Lane wrote:
>> I haven't looked at the details, but I'm really a bit surprised
>> by Jeff's assertion that CREATE INDEX destroys statistics on the
>> base table.  That seems wrong from here, and maybe something we
>> could have it not do.  (I do realize that it recalculates reltuples
>> and relpages, but so what?  If it updates those, the results should
>> be perfectly accurate.)

> In the v15 of the patch I was looking at, "pg_dump -s" included the
> statistics. The stats appeared first in the dump, followed by the
> CREATE INDEX commands. The latter overwrote the relpages/reltuples set
> by the former.

> While zeros are the right answers for a schema-only dump, it defeated
> the purpose of including relpages/reltuples stats in the dump, and
> caused the pg_upgrade TAP test to fail.

> You're right that there are a number of ways this could be resolved --
> I don't think it's an inherent problem.

I'm inclined to call it not a problem at all.  While I do agree there
are use-cases for injecting false statistics with these functions,
I do not think that pg_dump has to cater to such use-cases.

In any case, I remain of the opinion that stats are data and should
not be included in a -s dump (with some sort of exception for
pg_upgrade).  If the data has been loaded, then a subsequent
overwrite by CREATE INDEX should not be a problem.

            regards, tom lane

Re: Statistics Import and Export

From

Tom Lane

Date:

01 April 2024, 17:21:53

Bruce Momjian <bruce@momjian.us> writes:
> Reality check --- are we still targeting this feature for PG 17?

I'm not sure.  I think if we put our heads down we could finish
the changes I'm suggesting and resolve the other issues this week.
However, it is starting to feel like the sort of large, barely-ready
patch that we often regret cramming in at the last minute.  Maybe
we should agree that the first v18 CF would be a better time to
commit it.

            regards, tom lane

Re: Statistics Import and Export

From

Jeff Davis

Date:

01 April 2024, 17:31:03

On Mon, 2024-04-01 at 13:11 -0400, Bruce Momjian wrote:
> Reality check --- are we still targeting this feature for PG 17?

I see a few useful pieces here:

1. Support import of statistics (i.e.
pg_set_{relation|attribute}_stats()).

2. Support pg_dump of stats

3. Support pg_upgrade with stats

It's possible that not all of them make it, but let's not dismiss the
entire feature yet.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Bruce Momjian

Date:

01 April 2024, 17:33:28

On Sun, Mar 31, 2024 at 07:04:47PM -0400, Tom Lane wrote:
> Corey Huinker <corey.huinker@gmail.com> writes:
> >> I can't quibble with that view of what has priority.  I'm just
> >> suggesting that redesigning what pg_upgrade does in this area
> >> should come later than doing something about extended stats.
> 
> > I mostly agree, with the caveat that pg_upgrade's existing message saying
> > that optimizer stats were not carried over wouldn't be 100% true anymore.
> 
> I think we can tweak the message wording.  I just don't want to be
> doing major redesign of the behavior, nor adding fundamentally new
> monitoring capabilities.

I think pg_upgrade could check for the existence of extended statistics
in any database and adjust the analyze recommdnation wording
accordingly.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Statistics Import and Export

From

Jeff Davis

Date:

01 April 2024, 17:39:10

On Sun, 2024-03-31 at 14:48 -0400, Tom Lane wrote:
> What happens when
> somebody adds a new stakind (and hence new pg_stats column)?
> You could try to add an overloaded pg_set_attribute_stats
> version with more parameters, but I'm pretty sure that would
> lead to "ambiguous function call" failures when trying to load
> old dump files containing only the original parameters.

Why would you need to overload in this case? Wouldn't we just define a
new function with more optional named parameters?

>   The
> present design is also fragile in that an unrecognized parameter
> will lead to a parse-time failure and no function call happening,
> which is less robust than I'd like.

I agree on this point; I found this annoying when testing the feature.

> So this leads me to suggest that we'd be best off with a VARIADIC
> ANY signature, where the variadic part consists of alternating
> parameter labels and values:

I didn't consider this and I think it has a lot of advantages. It's
slightly unfortunate that we can't make them explicitly name/value
pairs, but pg_dump can use whitespace or even SQL comments to make it
more readable.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

01 April 2024, 17:56:13

Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2024-04-01 at 13:11 -0400, Bruce Momjian wrote:
>> Reality check --- are we still targeting this feature for PG 17?

> I see a few useful pieces here:

> 1. Support import of statistics (i.e.
> pg_set_{relation|attribute}_stats()).

> 2. Support pg_dump of stats

> 3. Support pg_upgrade with stats

> It's possible that not all of them make it, but let's not dismiss the
> entire feature yet.

The unresolved questions largely have to do with the interactions
between these pieces.  I think we would seriously regret setting
any one of them in stone before all three are ready to go.

            regards, tom lane

Re: Statistics Import and Export

From

Tom Lane

Date:

01 April 2024, 18:09:37

Jeff Davis <pgsql@j-davis.com> writes:
> On Sun, 2024-03-31 at 14:48 -0400, Tom Lane wrote:
>> What happens when
>> somebody adds a new stakind (and hence new pg_stats column)?

> Why would you need to overload in this case? Wouldn't we just define a
> new function with more optional named parameters?

Ah, yeah, you could change the function to have more parameters,
given the assumption that all calls will be named-parameter style.
I still suggest that my proposal is more robust for the case where
the dump lists parameters that the receiving system doesn't have.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

01 April 2024, 18:46:15

That's not what I suggested at all. The function parameters would
be declared anyarray, but the values passed to them would be coerced
to the correct concrete array types. So as far as the coercion rules
are concerned this'd be equivalent to the variadic-any approach.

+1

> That's pretty persuasive. It also means that we need to trap for error in
> the array_in() calls, as that function does not yet have a _safe() mode.

Well, the approach I'm advocating for would have the array input and
coercion done by the calling query before control ever reaches
pg_set_attribute_stats, so that any incorrect-for-the-data-type values
would result in hard errors. I think that's okay for the same reason
you probably figured you didn't have to trap array_in: it's the fault
of the originating pg_dump if it offers a value that doesn't coerce to
the datatype it claims the value is of.

+1

Re: Statistics Import and Export

From

Corey Huinker

Date:

01 April 2024, 18:49:44

I think pg_upgrade could check for the existence of extended statistics
in any database and adjust the analyze recommdnation wording
accordingly.

+1

Re: Statistics Import and Export

From

Corey Huinker

Date:

01 April 2024, 18:53:43

Ah, yeah, you could change the function to have more parameters,
given the assumption that all calls will be named-parameter style.
I still suggest that my proposal is more robust for the case where
the dump lists parameters that the receiving system doesn't have.

So what's the behavior when the user fails to supply a parameter that is currently NOT NULL checked (example: avg_witdth)? Is that a WARN-and-exit?

Re: Statistics Import and Export

From

Tom Lane

Date:

01 April 2024, 19:24:15

Corey Huinker <corey.huinker@gmail.com> writes:
> So what's the behavior when the user fails to supply a parameter that is
> currently NOT NULL checked (example: avg_witdth)? Is that a WARN-and-exit?

I still think that we could just declare the function strict, if we
use the variadic-any approach.  Passing a null in any position is
indisputable caller error.  However, if you're allergic to silently
doing nothing in such a case, we could have pg_set_attribute_stats
check each argument and throw an error.  (Or warn and keep going;
but according to the design principle I posited earlier, this'd be
the sort of thing we don't need to tolerate.)

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

01 April 2024, 19:54:30

I still think that we could just declare the function strict, if we
use the variadic-any approach. Passing a null in any position is
indisputable caller error. However, if you're allergic to silently
doing nothing in such a case, we could have pg_set_attribute_stats
check each argument and throw an error. (Or warn and keep going;
but according to the design principle I posited earlier, this'd be
the sort of thing we don't need to tolerate.)

Any thoughts about going back to having a return value, a caller could then see that the function returned NULL rather than whatever the expected value was (example: TRUE)?

Re: Statistics Import and Export

From

Tom Lane

Date:

01 April 2024, 21:09:05

Corey Huinker <corey.huinker@gmail.com> writes:
> Any thoughts about going back to having a return value, a caller could then
> see that the function returned NULL rather than whatever the expected value
> was (example: TRUE)?

If we are envisioning that the function might emit multiple warnings
per call, a useful definition could be to return the number of
warnings (so zero is good, not-zero is bad).  But I'm not sure that's
really better than a boolean result.  pg_dump/pg_restore won't notice
anyway, but perhaps other programs using these functions would care.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

01 April 2024, 21:15:25

If we are envisioning that the function might emit multiple warnings
per call, a useful definition could be to return the number of
warnings (so zero is good, not-zero is bad). But I'm not sure that's
really better than a boolean result. pg_dump/pg_restore won't notice
anyway, but perhaps other programs using these functions would care.

A boolean is what we had before, I'm quite comfortable with that, and it addresses my silent-failure concerns.

Re: Statistics Import and Export

From

Tom Lane

Date:

01 April 2024, 21:47:58

Corey Huinker <corey.huinker@gmail.com> writes:
> A boolean is what we had before, I'm quite comfortable with that, and it
> addresses my silent-failure concerns.

WFM.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

02 April 2024, 09:38:53

Here's a one-liner patch for disabling update of pg_class relpages/reltuples/relallviible during a binary upgrade.

This was causting pg_upgrade tests to fail in the existing stats import work.

Attachment

v1-0001-Disable-updating-pg_class-for-CREATE-INDEX-during.patch

Re: Statistics Import and Export

From

Jeff Davis

Date:

02 April 2024, 15:10:50

On Tue, 2024-04-02 at 05:38 -0400, Corey Huinker wrote:
> Here's a one-liner patch for disabling update of pg_class
> relpages/reltuples/relallviible during a binary upgrade.

This change makes sense to me regardless of the rest of the work.
Updating the relpages/reltuples/relallvisible during pg_upgrade before
the data is there will store the wrong stats.

It could use a brief comment, though.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

02 April 2024, 16:59:08

I have refactored pg_set_relation_stats to be variadic, and I'm working on pg_set_attribute_sttats, but I'm encountering an issue with the anyarray values.

Jeff suggested looking at anyarray_send as a way of extracting the type, and with some extra twiddling we can get and cast the type. However, some of the ANYARRAYs have element types that are themselves arrays, and near as I can tell, such a construct is not expressible in SQL. So, rather than getting an anyarray of an array type, you instead get an array of one higher dimension. Like so:

# select schemaname, tablename, attname,

substring(substring(anyarray_send(histogram_bounds) from 9 for 4)::text,2)::bit(32)::integer::regtype,

substring(substring(anyarray_send(histogram_bounds::text::text[][]) from 9 for 4)::text,2)::bit(32)::integer::regtype
from pg_stats where histogram_bounds is not null

and tablename = 'pg_proc' and attname = 'proargnames' ;

schemaname | tablename | attname | substring | substring

------------+-----------+-------------+-----------+-----------

pg_catalog | pg_proc | proargnames | text[] | text
Luckily, passing in such a value would have done all of the element typechecking for us, so we would just move the data to an array of one less dimension typed elem[]. If there's an easy way to do that, I don't know of it.

What remains is just checking the input types against the expected type of the array, stepping down the dimension if need be, and skipping if the type doesn't meet expectations.

Re: Statistics Import and Export

From

Jeff Davis

Date:

02 April 2024, 21:13:57

On Tue, 2024-04-02 at 12:59 -0400, Corey Huinker wrote:
>  However, some of the ANYARRAYs have element types that are
> themselves arrays, and near as I can tell, such a construct is not
> expressible in SQL. So, rather than getting an anyarray of an array
> type, you instead get an array of one higher dimension.

Fundamentally, you want to recreate the exact same anyarray values on
the destination system as they existed on the source. There's some
complexity to that on both the export side as well as the import side,
but I believe the problems are solvable.

On the export side, the problem is that the element type (and
dimensionality and maybe hasnull) is an important part of the anyarray
value, but it's not part of the output of anyarray_out(). For new
versions, we can add a scalar function that simply outputs the
information we need. For old versions, we can hack it by parsing the
output of anyarray_send(), which contains the information we need
(binary outputs are under-specified, but I believe they are specified
enough in this case). There may be other hacks to get the information
from the older systems; that's just an idea. To get the actual data,
doing histogram_bounds::text::text[] seems to be enough: that seems to
always give a one-dimensional array with element type "text", even if
the element type is an array. (Note: this means we need the function's
API to also include this extra information about the anyarray values,
so it might be slightly more complex than name/value pairs).

On the import side, the problem is that there may not be an input
function to go from a 1-D array of text to a 1-D array of any element
type we want. For example, there's no input function that will create a
1-D array with element type float4[] (that's because Postgres doesn't
really have arrays-of-arrays, it has multi-dimensional arrays).
Instead, don't use the input function, pass each element of the 1-D
text array to the element type's input function (which may be scalar or
not) and then construct a 1-D array out of that with the appropriate
element type (which may be scalar or not).

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

02 April 2024, 21:31:07

Jeff Davis <pgsql@j-davis.com> writes:
> On the export side, the problem is that the element type (and
> dimensionality and maybe hasnull) is an important part of the anyarray
> value, but it's not part of the output of anyarray_out(). For new
> versions, we can add a scalar function that simply outputs the
> information we need. For old versions, we can hack it by parsing the
> output of anyarray_send(), which contains the information we need
> (binary outputs are under-specified, but I believe they are specified
> enough in this case).

Yeah, I was thinking yesterday about pulling the anyarray columns in
binary and looking at the header fields.  However, I fear there is a
showstopper problem: anyarray_send will fail if the element type
doesn't have a typsend function, which is entirely possible for
user-defined types (and I'm not even sure we've provided them for
every type in the core distro).  I haven't thought of a good answer
to that other than a new backend function.  However ...

> On the import side, the problem is that there may not be an input
> function to go from a 1-D array of text to a 1-D array of any element
> type we want. For example, there's no input function that will create a
> 1-D array with element type float4[] (that's because Postgres doesn't
> really have arrays-of-arrays, it has multi-dimensional arrays).
> Instead, don't use the input function, pass each element of the 1-D
> text array to the element type's input function (which may be scalar or
> not) and then construct a 1-D array out of that with the appropriate
> element type (which may be scalar or not).

Yup.  I had hoped that we could avoid doing any array-munging inside
pg_set_attribute_stats, but this array-of-arrays problem seems to
mean we have to.  In turn, that means that the whole idea of
declaring the function inputs as anyarray rather than text[] is
probably pointless.  And that means that we don't need the sending
side to know the element type anyway.  So, I apologize for sending
us down a useless side path.  We may as well stick to the function
signature as shown in the v15 patch --- although maybe variadic
any is still worthwhile so that an unrecognized field name doesn't
need to be a hard error?

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

02 April 2024, 21:36:30

side to know the element type anyway. So, I apologize for sending
us down a useless side path. We may as well stick to the function
signature as shown in the v15 patch --- although maybe variadic
any is still worthwhile so that an unrecognized field name doesn't
need to be a hard error?

Variadic is nearly done. This issue was the main blocking point. I can go back to array_in() as we know that code works.

Re: Statistics Import and Export

From

Jeff Davis

Date:

02 April 2024, 21:59:12

On Tue, 2024-04-02 at 17:31 -0400, Tom Lane wrote:
> And that means that we don't need the sending
> side to know the element type anyway.

We need to get the original element type on the import side somehow,
right? Otherwise it will be hard to tell whether '{1, 2, 3, 4}' has
element type "int4" or "text", which affects the binary representation
of the anyarray value in pg_statistic.

Either we need to get it at export time (which seems the most reliable
in principle, but problematic for older versions) and pass it as an
argument to pg_set_attribute_stats(); or we need to derive it reliably
from the table schema on the destination side, right?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

02 April 2024, 22:18:53

Jeff Davis <pgsql@j-davis.com> writes:
> We need to get the original element type on the import side somehow,
> right? Otherwise it will be hard to tell whether '{1, 2, 3, 4}' has
> element type "int4" or "text", which affects the binary representation
> of the anyarray value in pg_statistic.

Yeah, but that problem exists no matter what.  I haven't read enough
of the patch to find where it's determining that, but I assume there's
code in there to intuit the statistics storage type depending on the
table column's data type and the statistics kind.

> Either we need to get it at export time (which seems the most reliable
> in principle, but problematic for older versions) and pass it as an
> argument to pg_set_attribute_stats(); or we need to derive it reliably
> from the table schema on the destination side, right?

We could not trust the exporting side to tell us the correct answer;
for one reason, it might be different across different releases.
So "derive it reliably on the destination" is really the only option.

I think that it's impossible to do this in the general case, since
type-specific typanalyze functions can store pretty nearly whatever
they like.  However, the pg_stats view isn't going to show nonstandard
statistics kinds anyway, so we are going to be lossy for custom
statistics kinds.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

02 April 2024, 22:35:55

Yeah, but that problem exists no matter what. I haven't read enough
of the patch to find where it's determining that, but I assume there's
code in there to intuit the statistics storage type depending on the
table column's data type and the statistics kind.

Correct. It borrows a lot from examine_attribute() and the *_typanalyze() functions. Actually using VacAttrStats proved problematic, but that can be revisited at some point.

We could not trust the exporting side to tell us the correct answer;
for one reason, it might be different across different releases.
So "derive it reliably on the destination" is really the only option.

+1

I think that it's impossible to do this in the general case, since
type-specific typanalyze functions can store pretty nearly whatever
they like. However, the pg_stats view isn't going to show nonstandard
statistics kinds anyway, so we are going to be lossy for custom
statistics kinds.

Sadly true.

Re: Statistics Import and Export

From

Corey Huinker

Date:

03 April 2024, 04:59:10

v16 attached.

- both functions now use variadics for anything that can be considered a stat.
- most consistency checks removed, null element tests remain
- functions strive to not ERROR unless absolutely necessary. The biggest exposure is the call to array_in().
- docs have not yet been updated, pending general acceptance of the variadic over the named arg version.

Having variant arguments is definitely a little bit more work to manage, and the shift from ERROR to WARN removes a lot of the easy exits that it previously had, as well as having to do some extra type checking that we got for free with fixed arguments. Still, I don't think the readability suffers too much, and we are now able to work for downgrades as well as upgrades.

Attachment

Re: Statistics Import and Export

From

Tom Lane

Date:

03 April 2024, 17:18:07

Corey Huinker <corey.huinker@gmail.com> writes:
> - functions strive to not ERROR unless absolutely necessary. The biggest
> exposure is the call to array_in().

As far as that goes, it shouldn't be that hard to deal with, at least
not for "soft" errors which hopefully cover most input-function
failures these days.  You should be invoking array_in via
InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
(Look at pg_input_error_info() for useful precedent.)

There might be something to be said for handling all the error
cases via an ErrorSaveContext and use of ereturn() instead of
ereport().  Not sure if it's worth the trouble or not.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

03 April 2024, 18:13:04

As far as that goes, it shouldn't be that hard to deal with, at least
not for "soft" errors which hopefully cover most input-function
failures these days. You should be invoking array_in via
InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
(Look at pg_input_error_info() for useful precedent.)

Ah, my understanding may be out of date. I was under the impression that that mechanism relied on the the cooperation of the per-element input function, so even if we got all the builtin datatypes to play nice with *Safe(), we were always going to be at risk with a user-defined input function.

There might be something to be said for handling all the error
cases via an ErrorSaveContext and use of ereturn() instead of
ereport(). Not sure if it's worth the trouble or not.

It would help us tailor the user experience. Right now we have several endgames. To recap:

1. NULL input => Return NULL. (because strict).
2. Actual error (permissions, cache lookup not found, etc) => Raise ERROR (thus ruining binary upgrade)
3. Call values are so bad (examples: attname not found, required stat missing) that nothing can recover => WARN, return FALSE.
4. At least one stakind-stat is wonky (impossible for datatype, missing stat pair, wrong type on input parameter), but that's the worst of it => 1 to N WARNs, write stats that do make sense, return TRUE.
5. Hunky-dory. => No warns. Write all stats. return TRUE.

Which of those seem like good ereturn candidates to you?

Re: Statistics Import and Export

From

Tom Lane

Date:

03 April 2024, 20:02:47

Corey Huinker <corey.huinker@gmail.com> writes:
>> As far as that goes, it shouldn't be that hard to deal with, at least
>> not for "soft" errors which hopefully cover most input-function
>> failures these days.  You should be invoking array_in via
>> InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
>> (Look at pg_input_error_info() for useful precedent.)

> Ah, my understanding may be out of date. I was under the impression that
> that mechanism relied on the the cooperation of the per-element input
> function, so even if we got all the builtin datatypes to play nice with
> *Safe(), we were always going to be at risk with a user-defined input
> function.

That's correct, but it's silly not to do what we can.  Also, I imagine
that there is going to be high evolutionary pressure on UDTs to
support soft error mode for COPY, so over time the problem will
decrease --- as long as we invoke the soft error mode.

> 1. NULL input =>  Return NULL. (because strict).
> 2. Actual error (permissions, cache lookup not found, etc) => Raise ERROR
> (thus ruining binary upgrade)
> 3. Call values are so bad (examples: attname not found, required stat
> missing) that nothing can recover => WARN, return FALSE.
> 4. At least one stakind-stat is wonky (impossible for datatype, missing
> stat pair, wrong type on input parameter), but that's the worst of it => 1
> to N WARNs, write stats that do make sense, return TRUE.
> 5. Hunky-dory. => No warns. Write all stats. return TRUE.

> Which of those seem like good ereturn candidates to you?

I'm good with all those behaviors.  On reflection, the design I was
vaguely imagining wouldn't cope with case 4 (multiple WARNs per call)
so never mind that.

            regards, tom lane

Re: Statistics Import and Export

From

Michael Paquier

Date:

04 April 2024, 01:14:52

On Mon, Apr 01, 2024 at 01:21:53PM -0400, Tom Lane wrote:
> I'm not sure.  I think if we put our heads down we could finish
> the changes I'm suggesting and resolve the other issues this week.
> However, it is starting to feel like the sort of large, barely-ready
> patch that we often regret cramming in at the last minute.  Maybe
> we should agree that the first v18 CF would be a better time to
> commit it.

There are still 4 days remaining, so there's still time, but my
overall experience on the matter with my RMT hat on is telling me that
we should not rush this patch set.  Redesigning portions close to the
end of a dev cycle is not a good sign, I am afraid, especially if the
sub-parts of the design don't fit well in the global picture as that
could mean more maintenance work on stable branches in the long term.
Still, it is very good to be aware of the problems because you'd know
what to tackle to reach the goals of this patch set.
--
Michael

Attachment

signature.asc

Re: Statistics Import and Export

From

Corey Huinker

Date:

04 April 2024, 04:30:18

I'm good with all those behaviors. On reflection, the design I was
vaguely imagining wouldn't cope with case 4 (multiple WARNs per call)
so never mind that.

regards, tom lane

v17

0001
- array_in now repackages cast errors as warnings and skips the stat, test added

- version parameter added, though it's mostly for future compatibility, tests modified
- both functions delay object/attribute locking until absolutely necessary
- general cleanup

0002

- added version parameter to dumps

- --schema-only will not dump stats unless in binary upgrade mode

- stats are dumped SECTION_NONE

- general cleanup

I think that covers the outstanding issues.

Attachment

Re: Statistics Import and Export

From

Corey Huinker

Date:

05 April 2024, 01:30:32

For a partitioned table this value has to be true. For a normal table when setting this value to true, it should at least make sure that the table has at least one child. Otherwise it should throw an error. Blindly accepting the given value may render the statistics unusable. Prologue of the function needs to be updated accordingly.

I can see rejecting non-inherited stats for a partitioned table. The reverse, however, isn't true, because a table may end up being inherited by another, so those statistics may be legit. Having said that, a great deal of the data validation I was doing was seen as unnecessary, so I' not sure where this check would fall on that line. It's a trivial check if we do add it.

Re: Statistics Import and Export

From

Ashutosh Bapat

Date:

05 April 2024, 04:18:50

On Fri, Apr 5, 2024 at 7:00 AM Corey Huinker <corey.huinker@gmail.com> wrote:

For a partitioned table this value has to be true. For a normal table when setting this value to true, it should at least make sure that the table has at least one child. Otherwise it should throw an error. Blindly accepting the given value may render the statistics unusable. Prologue of the function needs to be updated accordingly.

I can see rejecting non-inherited stats for a partitioned table. The reverse, however, isn't true, because a table may end up being inherited by another, so those statistics may be legit. Having said that, a great deal of the data validation I was doing was seen as unnecessary, so I' not sure where this check would fall on that line. It's a trivial check if we do add it.

I read that discussion, and it may be ok for pg_upgrade/pg_dump usecase and maybe also for IMPORT foreign schema where the SQL is generated by PostgreSQL itself. But not for simulating statistics. In that case, if the function happily installs statistics cooked by the user and those aren't used anywhere, users may be misled by the plans that are generated subsequently. Thus negating the very purpose of simulating statistics. Once the feature is out there, we won't be able to restrict its usage unless we document the possible anomalies.

--

Best Wishes,

Ashutosh Bapat

Re: Statistics Import and Export

From

Tom Lane

Date:

05 April 2024, 04:37:32

Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> I read that discussion, and it may be ok for pg_upgrade/pg_dump usecase and
> maybe also for IMPORT foreign schema where the SQL is generated by
> PostgreSQL itself. But not for simulating statistics. In that case, if the
> function happily installs statistics cooked by the user and those aren't
> used anywhere, users may be misled by the plans that are generated
> subsequently. Thus negating the very purpose of simulating
> statistics.

I'm not sure what you think the "purpose of simulating statistics" is,
but it seems like you have an extremely narrow-minded view of it.
I think we should allow injecting any stats that won't actively crash
the backend.  Such functionality could be useful for stress-testing
the planner, for example, or even just to see what it would do in
a situation that is not what you have.

Note that I don't think pg_dump or pg_upgrade need to support
injection of counterfactual statistics.  But direct calls of the
stats insertion functions should be able to do so.

            regards, tom lane

Re: Statistics Import and Export

From

Ashutosh Bapat

Date:

05 April 2024, 06:09:57

On Fri, Apr 5, 2024 at 10:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> I read that discussion, and it may be ok for pg_upgrade/pg_dump usecase and
> maybe also for IMPORT foreign schema where the SQL is generated by
> PostgreSQL itself. But not for simulating statistics. In that case, if the
> function happily installs statistics cooked by the user and those aren't
> used anywhere, users may be misled by the plans that are generated
> subsequently. Thus negating the very purpose of simulating
> statistics.

I'm not sure what you think the "purpose of simulating statistics" is,
but it seems like you have an extremely narrow-minded view of it.
I think we should allow injecting any stats that won't actively crash
the backend. Such functionality could be useful for stress-testing
the planner, for example, or even just to see what it would do in
a situation that is not what you have.

My reply was in the following context

For a partitioned table this value has to be true. For a normal table when setting this value to true, it should at least make sure that the table has at least one child. Otherwise it should throw an error. Blindly accepting the given value may render the statistics unusable. Prologue of the function needs to be updated accordingly.

I can see rejecting non-inherited stats for a partitioned table. The reverse, however, isn't true, because a table may end up being inherited by another, so those statistics may be legit. Having said that, a great deal of the data validation I was doing was seen as unnecessary, so I' not sure where this check would fall on that line. It's a trivial check if we do add it.

If a user installs inherited stats for a non-inherited table by accidently passing true to the corresponding argument, those stats won't be even used. The user wouldn't know that those stats are not used. Yet, they would think that any change in the plans is the result of their stats. So whatever simulation experiment they are running would lead to wrong conclusions. This could be easily avoided by raising an error. Similarly for installing non-inherited stats for a partitioned table. There might be other scenarios where the error won't be required.

--

Best Wishes,

Ashutosh Bapat

Re: Statistics Import and Export

From

Jeff Davis

Date:

06 April 2024, 03:47:40

On Thu, 2024-04-04 at 00:30 -0400, Corey Huinker wrote:
>
> v17
>
> 0001
> - array_in now repackages cast errors as warnings and skips the stat,
> test added
> - version parameter added, though it's mostly for future
> compatibility, tests modified
> - both functions delay object/attribute locking until absolutely
> necessary
> - general cleanup
>
> 0002
> - added version parameter to dumps
> - --schema-only will not dump stats unless in binary upgrade mode
> - stats are dumped SECTION_NONE
> - general cleanup
>
> I think that covers the outstanding issues. 

Thank you, this has improved a lot and the fundamentals are very close.

I think it could benefit from a bit more time to settle on a few
issues:

1. SECTION_NONE. Conceptually, stats are more like data, and so
intuitively I would expect this in the SECTION_DATA or
SECTION_POST_DATA. However, the two most important use cases (in my
opinion) don't involve dumping the data: pg_upgrade (data doesn't come
from the dump) and planner simulations/repros. Perhaps the section we
place it in is not a critical decision, but we will need to stick with
it for a long time, and I'm not sure that we have consensus on that
point.

2. We changed the stats import function API to be VARIADIC very
recently. After we have a bit of time to think on it, I'm not 100% sure
we will want to stick with that new API. It's not easy to document,
which is something I always like to consider.

3. The error handling also changed recently to change soft errors (i.e.
type input errors) to warnings. I like this change but I'd need a bit
more time to get comfortable with how this is done, there is not a lot
of precedent for doing this kind of thing. This is connected to the
return value, as well as the machine-readability concern that Magnus
raised.

Additionally, a lot of people are simply very busy around this time,
and may not have had a chance to opine on all the recent changes yet.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

06 April 2024, 04:05:28

Jeff Davis <pgsql@j-davis.com> writes:
> Thank you, this has improved a lot and the fundamentals are very close.
> I think it could benefit from a bit more time to settle on a few
> issues:

Yeah ... it feels like we aren't quite going to manage to get this
over the line for v17.  We could commit with the hope that these
last details will get sorted later, but that path inevitably leads
to a mess.

> 1. SECTION_NONE. Conceptually, stats are more like data, and so
> intuitively I would expect this in the SECTION_DATA or
> SECTION_POST_DATA. However, the two most important use cases (in my
> opinion) don't involve dumping the data: pg_upgrade (data doesn't come
> from the dump) and planner simulations/repros. Perhaps the section we
> place it in is not a critical decision, but we will need to stick with
> it for a long time, and I'm not sure that we have consensus on that
> point.

I think it'll be a serious, serious error for this not to be
SECTION_DATA.  Maybe POST_DATA is OK, but even that seems like
an implementation compromise not "the way it ought to be".

> 2. We changed the stats import function API to be VARIADIC very
> recently. After we have a bit of time to think on it, I'm not 100% sure
> we will want to stick with that new API. It's not easy to document,
> which is something I always like to consider.

Perhaps.  I think the argument of wanting to be able to salvage
something even in the presence of unrecognized stats types is
stronger, but I agree this could use more time in the oven.
Unlike many other things in this patch, this would be nigh
impossible to reconsider later.

> 3. The error handling also changed recently to change soft errors (i.e.
> type input errors) to warnings. I like this change but I'd need a bit
> more time to get comfortable with how this is done, there is not a lot
> of precedent for doing this kind of thing.

I don't think there's much disagreement that that's the right thing,
but yeah there could be bugs or some more to do in this area.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

06 April 2024, 21:23:43

I think it'll be a serious, serious error for this not to be
SECTION_DATA. Maybe POST_DATA is OK, but even that seems like
an implementation compromise not "the way it ought to be".

We'd have to split them on account of when the underlying object is created. Index statistics would be SECTION_POST_DATA, and everything else would be SECTION_DATA. Looking ahead, statistics data for extended statistics objects would also be POST. That's not a big change, but my first attempt at that resulted in a bunch of unrelated grants dumping in the wrong section.

Re: Statistics Import and Export

From

Corey Huinker

Date:

11 April 2024, 19:54:07

On Sat, Apr 6, 2024 at 5:23 PM Corey Huinker <corey.huinker@gmail.com> wrote:

I think it'll be a serious, serious error for this not to be
SECTION_DATA. Maybe POST_DATA is OK, but even that seems like
an implementation compromise not "the way it ought to be".

We'd have to split them on account of when the underlying object is created. Index statistics would be SECTION_POST_DATA, and everything else would be SECTION_DATA. Looking ahead, statistics data for extended statistics objects would also be POST. That's not a big change, but my first attempt at that resulted in a bunch of unrelated grants dumping in the wrong section.

At the request of a few people, attached is an attempt to move stats to DATA/POST-DATA, and the TAP test failure that results from that.

The relevant errors are confusing, in that they all concern GRANT/REVOKE, and the fact that I made no changes to the TAP test itself.

$ grep 'not ok' build/meson-logs/testlog.txt
not ok 9347 - section_data: should not dump GRANT INSERT(col1) ON TABLE test_second_table
not ok 9348 - section_data: should not dump GRANT SELECT (proname ...) ON TABLE pg_proc TO public
not ok 9349 - section_data: should not dump GRANT SELECT ON TABLE measurement
not ok 9350 - section_data: should not dump GRANT SELECT ON TABLE measurement_y2006m2
not ok 9351 - section_data: should not dump GRANT SELECT ON TABLE test_table
not ok 9379 - section_data: should not dump REVOKE SELECT ON TABLE pg_proc FROM public
not ok 9788 - section_pre_data: should dump CREATE TABLE test_table
not ok 9837 - section_pre_data: should dump GRANT INSERT(col1) ON TABLE test_second_table
not ok 9838 - section_pre_data: should dump GRANT SELECT (proname ...) ON TABLE pg_proc TO public
not ok 9839 - section_pre_data: should dump GRANT SELECT ON TABLE measurement
not ok 9840 - section_pre_data: should dump GRANT SELECT ON TABLE measurement_y2006m2
not ok 9841 - section_pre_data: should dump GRANT SELECT ON TABLE test_table
not ok 9869 - section_pre_data: should dump REVOKE SELECT ON TABLE pg_proc FROM public

Attachment

Re: Statistics Import and Export

From

Nathan Bossart

Date:

17 April 2024, 16:50:53

On Thu, Apr 11, 2024 at 03:54:07PM -0400, Corey Huinker wrote:
> At the request of a few people, attached is an attempt to move stats to
> DATA/POST-DATA, and the TAP test failure that results from that.
> 
> The relevant errors are confusing, in that they all concern GRANT/REVOKE,
> and the fact that I made no changes to the TAP test itself.
> 
> $ grep 'not ok' build/meson-logs/testlog.txt
> not ok 9347 - section_data: should not dump GRANT INSERT(col1) ON TABLE
> test_second_table

It looks like the problem is that the ACLs are getting dumped in the data
section when we are also dumping stats.  I'm able to get the tests to pass
by moving the call to dumpRelationStats() that's in dumpTableSchema() to
dumpTableData().  I'm not entirely sure why that fixes it yet, but if we're
treating stats as data, then it intuitively makes sense for us to dump it
in dumpTableData().  However, that seems to prevent the stats from getting
exported in the --schema-only/--binary-upgrade scenario, which presents a
problem for pg_upgrade.  ISTM we'll need some extra hacks to get this to
work as desired.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Statistics Import and Export

From

Jeff Davis

Date:

22 April 2024, 17:56:54

On Wed, 2024-04-17 at 11:50 -0500, Nathan Bossart wrote:
> It looks like the problem is that the ACLs are getting dumped in the
> data
> section when we are also dumping stats.  I'm able to get the tests to
> pass
> by moving the call to dumpRelationStats() that's in dumpTableSchema()
> to
> dumpTableData().  I'm not entirely sure why that fixes it yet, but if
> we're
> treating stats as data, then it intuitively makes sense for us to
> dump it
> in dumpTableData().

Would it make sense to have a new SECTION_STATS?

>  However, that seems to prevent the stats from getting
> exported in the --schema-only/--binary-upgrade scenario, which
> presents a
> problem for pg_upgrade.  ISTM we'll need some extra hacks to get this
> to
> work as desired.

Philosophically, I suppose stats are data, but I still don't understand
why considering stats to be data is so important in pg_dump.

Practically, I want to dump stats XOR data. That's because, if I dump
the data, it's so costly to reload and rebuild indexes that it's not
very important to avoid a re-ANALYZE.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

22 April 2024, 20:19:13

Jeff Davis <pgsql@j-davis.com> writes:
> Would it make sense to have a new SECTION_STATS?

Perhaps, but the implications for pg_dump's API would be nontrivial,
eg would we break any applications that know about the current
options for --section.  And you still have to face up to the question
"does --data-only include this stuff?".

> Philosophically, I suppose stats are data, but I still don't understand
> why considering stats to be data is so important in pg_dump.
> Practically, I want to dump stats XOR data. That's because, if I dump
> the data, it's so costly to reload and rebuild indexes that it's not
> very important to avoid a re-ANALYZE.

Hmm, interesting point.  But the counterargument to that is that
the cost of building indexes will also dwarf the cost of installing
stats, so why not do so?  Loading data without stats, and hoping
that auto-analyze will catch up sooner not later, is exactly the
current behavior that we're doing all this work to get out of.
I don't really think we want it to continue to be the default.

            regards, tom lane

Re: Statistics Import and Export

From

Jeff Davis

Date:

23 April 2024, 02:48:02

On Mon, 2024-04-22 at 16:19 -0400, Tom Lane wrote:
> Loading data without stats, and hoping
> that auto-analyze will catch up sooner not later, is exactly the
> current behavior that we're doing all this work to get out of.

That's the disconnect, I think. For me, the main reason I'm excited
about this work is as a way to solve the bad-plans-after-upgrade
problem and to repro planner issues outside of production. Avoiding the
need to ANALYZE at the end of a data load is also a nice convenience,
but not a primary driver (for me).

Should we just itemize some common use cases for pg_dump, and then
choose the defaults that are least likely to cause surprise?

As for the section, I'm not sure what to do about that. Based on this
thread it seems that SECTION_NONE (or a SECTION_STATS?) is easiest to
implement, but I don't understand the long-term consequences of that
choice.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

23 April 2024, 03:52:21

Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2024-04-22 at 16:19 -0400, Tom Lane wrote:
>> Loading data without stats, and hoping
>> that auto-analyze will catch up sooner not later, is exactly the
>> current behavior that we're doing all this work to get out of.

> That's the disconnect, I think. For me, the main reason I'm excited
> about this work is as a way to solve the bad-plans-after-upgrade
> problem and to repro planner issues outside of production. Avoiding the
> need to ANALYZE at the end of a data load is also a nice convenience,
> but not a primary driver (for me).

Oh, I don't doubt that there are use-cases for dumping stats without
data.  I'm just dubious about the reverse.  I think data+stats should
be the default, even if only because pg_dump's default has always
been to dump everything.  Then there should be a way to get stats
only, and maybe a way to get data only.  Maybe this does argue for a
four-section definition, despite the ensuing churn in the pg_dump API.

> Should we just itemize some common use cases for pg_dump, and then
> choose the defaults that are least likely to cause surprise?

Per above, I don't find any difficulty in deciding what should be the
default.  What I think we need to consider is what the pg_dump and
pg_restore switch sets should be.  There's certainly a few different
ways we could present that; maybe we should sketch out the details for
a couple of ways.

            regards, tom lane

Re: Statistics Import and Export

From

Matthias van de Meent

Date:

23 April 2024, 16:33:48

On Tue, 23 Apr 2024, 05:52 Tom Lane, <tgl@sss.pgh.pa.us> wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > On Mon, 2024-04-22 at 16:19 -0400, Tom Lane wrote:
> >> Loading data without stats, and hoping
> >> that auto-analyze will catch up sooner not later, is exactly the
> >> current behavior that we're doing all this work to get out of.
>
> > That's the disconnect, I think. For me, the main reason I'm excited
> > about this work is as a way to solve the bad-plans-after-upgrade
> > problem and to repro planner issues outside of production. Avoiding the
> > need to ANALYZE at the end of a data load is also a nice convenience,
> > but not a primary driver (for me).
>
> Oh, I don't doubt that there are use-cases for dumping stats without
> data.  I'm just dubious about the reverse.  I think data+stats should
> be the default, even if only because pg_dump's default has always
> been to dump everything.  Then there should be a way to get stats
> only, and maybe a way to get data only.  Maybe this does argue for a
> four-section definition, despite the ensuing churn in the pg_dump API.

I've heard of use cases where dumping stats without data would help
with production database planner debugging on a non-prod system.

Sure, some planner inputs would have to be taken into account too, but
having an exact copy of production stats is at least a start and can
help build models and alerts for what'll happen when the tables grow
larger with the current stats.

As for other planner inputs: table size is relatively easy to shim
with sparse files; cumulative statistics can be copied from a donor
replica if needed, and btree indexes only really really need to
contain their highest and lowest values (and need their height set
correctly).

Kind regards,

Matthias van de Meent

Re: Statistics Import and Export

From

Corey Huinker

Date:

24 April 2024, 10:18:30

I've heard of use cases where dumping stats without data would help
with production database planner debugging on a non-prod system.

So far, I'm seeing these use cases:

1. Binary upgrade. (schema: on, data: off, stats: on)
2. Dump to file/dir and restore elsewhere. (schema: on, data: on, stats: on)

3. Dump stats for one or more objects, either to directly apply those stats to a remote database, or to allow a developer to edit/experiment with those stats. (schema: off, data: off, stats: on)

4. restore situations where stats are not wanted and/or not trusted (whatever: on, stats: off)

Case #1 is handled via pg_upgrade and special case flags in pg_dump.

Case #2 uses the default pg_dump options, so that's covered.

Case #3 would require a --statistics-only option mutually exclusive with --data-only and --schema-only. Alternatively, I could reanimate the script pg_export_statistics, but we'd end up duplicating a lot of filtering options that pg_dump already has solved. Similarly, we may want server-side functions that generate the statements for us (pg_get_*_stats paired with each pg_set_*_stats)
Case #4 is handled via --no-statistics.

Attached is v19, which attempts to put table stats in SECTION_DATA and matview/index stats in SECTION_POST_DATA. It's still failing one TAP test (004_pg_dump_parallel: parallel restore as inserts). I'm still unclear as to why using SECTION_NONE is a bad idea, but I'm willing to go along with DATA/POST_DATA, assuming we can make it work.

Attachment

Re: Statistics Import and Export

From

Bruce Momjian

Date:

24 April 2024, 19:31:49

On Tue, Apr 23, 2024 at 06:33:48PM +0200, Matthias van de Meent wrote:
> I've heard of use cases where dumping stats without data would help
> with production database planner debugging on a non-prod system.
> 
> Sure, some planner inputs would have to be taken into account too, but
> having an exact copy of production stats is at least a start and can
> help build models and alerts for what'll happen when the tables grow
> larger with the current stats.
> 
> As for other planner inputs: table size is relatively easy to shim
> with sparse files; cumulative statistics can be copied from a donor
> replica if needed, and btree indexes only really really need to
> contain their highest and lowest values (and need their height set
> correctly).

Is it possible to prevent stats from being updated by autovacuum and
other methods?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Statistics Import and Export

From

Matthias van de Meent

Date:

24 April 2024, 19:56:15

On Wed, 24 Apr 2024 at 21:31, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Apr 23, 2024 at 06:33:48PM +0200, Matthias van de Meent wrote:
> > I've heard of use cases where dumping stats without data would help
> > with production database planner debugging on a non-prod system.
> >
> > Sure, some planner inputs would have to be taken into account too, but
> > having an exact copy of production stats is at least a start and can
> > help build models and alerts for what'll happen when the tables grow
> > larger with the current stats.
> >
> > As for other planner inputs: table size is relatively easy to shim
> > with sparse files; cumulative statistics can be copied from a donor
> > replica if needed, and btree indexes only really really need to
> > contain their highest and lowest values (and need their height set
> > correctly).
>
> Is it possible to prevent stats from being updated by autovacuum

You can set autovacuum_analyze_threshold and *_scale_factor to
excessively high values, which has the effect of disabling autoanalyze
until it has had similarly excessive tuple churn. But that won't
guarantee autoanalyze won't run; that guarantee only exists with
autovacuum = off.

> and other methods?

No nice ways. AFAIK there is no command (or command sequence) that can
"disable" only ANALYZE and which also guarantee statistics won't be
updated until ANALYZE is manually "re-enabled" for that table. An
extension could maybe do this, but I'm not aware of any extension
points where this would hook into PostgreSQL in a nice way.

You can limit maintenance access on the table to only trusted roles
that you know won't go in and run ANALYZE for those tables, or even
only your superuser (so only they can run ANALYZE, and have them
promise they won't). Alternatively, you can also constantly keep a
lock on the table that conflicts with ANALYZE. The last few are just
workarounds though, and not all something I'd suggest running on a
production database.

Kind regards,

Matthias van de Meent

Re: Statistics Import and Export

From

Corey Huinker

Date:

26 April 2024, 03:27:08

You can set autovacuum_analyze_threshold and *_scale_factor to
excessively high values, which has the effect of disabling autoanalyze
until it has had similarly excessive tuple churn. But that won't
guarantee autoanalyze won't run; that guarantee only exists with
autovacuum = off.

I'd be a bit afraid to set to those values so high, for fear that they wouldn't get reset when normal operations resumed, and nobody would notice until things got bad.

v20 is attached. It resolves the dependency issue in v19, so while I'm still unclear as to why we want it this way vs the simplicity of SECTION_NONE, I'm going to roll with it.

Next up for question is how to handle --statistics-only or an equivalent. The option would be mutually exclusive with --schema-only and --data-only, and it would be mildly incongruous if it didn't have a short option like the others, so I'm suggested -P for Probablity / Percentile / ρ: correlation / etc.

One wrinkle with having three mutually exclusive options instead of two is that the existing code was able to assume that one of the options being true meant that we could bail out of certain dumpXYZ() functions, and now those tests have to compare against two, which makes me think we should add three new DumpOptions that are the non-exclusive positives (yesSchema, yesData, yesStats) and set those in addition to the schemaOnly, dataOnly, and statsOnly flags. Thoughts?

Attachment

Re: Statistics Import and Export

From

Corey Huinker

Date:

07 May 2024, 03:43:50

Next up for question is how to handle --statistics-only or an equivalent. The option would be mutually exclusive with --schema-only and --data-only, and it would be mildly incongruous if it didn't have a short option like the others, so I'm suggested -P for Probablity / Percentile / ρ: correlation / etc.

One wrinkle with having three mutually exclusive options instead of two is that the existing code was able to assume that one of the options being true meant that we could bail out of certain dumpXYZ() functions, and now those tests have to compare against two, which makes me think we should add three new DumpOptions that are the non-exclusive positives (yesSchema, yesData, yesStats) and set those in addition to the schemaOnly, dataOnly, and statsOnly flags. Thoughts?

v21 attached.

0001 is the same.

0002 is a preparatory change to pg_dump introducing DumpOption/RestoreOption variables dumpSchema and dumpData. The current code makes heavy use of the fact that schemaOnly and dataOnly are mutually exclusive and logically opposite. That will not be the case when statisticsOnly is introduced, so I decided to add the new variables whose value is entirely derivative of the existing command flags, but resolves the complexities of those interactions in one spot, as those complexities are about to jump with the new options.

0003 is the statistics changes to pg_dump, adding the options -X / --statistics-only, and the derivative boolean statisticsOnly. The -P option is already used by pg_restore, so instead I chose -X because of the passing resemblance to Chi as in the chi-square statistics test makes it vaguely statistics-ish. If someone has a better letter, I'm listening.

With that change, people should be able to use pg_dump -X --table=foo to dump existing stats for a table and its dependent indexes, and then tweak those calls to do tuning work. Have fun with it. If this becomes a common use-case then it may make sense to get functions to fetch relation/attribute stats for a given relation, either as a formed SQL statement or as the parameter values.

Attachment

Re: Statistics Import and Export

From

Jeff Davis

Date:

16 May 2024, 00:02:57

On Mon, 2024-05-06 at 23:43 -0400, Corey Huinker wrote:
>
> v21 attached.
>
> 0003 is the statistics changes to pg_dump, adding the options -X / --
> statistics-only, and the derivative boolean statisticsOnly. The -P
> option is already used by pg_restore, so instead I chose -X because
> of the passing resemblance to Chi as in the chi-square statistics
> test makes it vaguely statistics-ish. If someone has a better letter,
> I'm listening.
>
> With that change, people should be able to use pg_dump -X --table=foo
> to dump existing stats for a table and its dependent indexes, and
> then tweak those calls to do tuning work. Have fun with it. If this
> becomes a common use-case then it may make sense to get functions to
> fetch relation/attribute stats for a given relation, either as a
> formed SQL statement or as the parameter values.

Can you explain what you did with the
SECTION_NONE/SECTION_DATA/SECTION_POST_DATA over v19-v21 and why?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

16 May 2024, 09:25:58

Can you explain what you did with the
SECTION_NONE/SECTION_DATA/SECTION_POST_DATA over v19-v21 and why?

Initially, I got things to work by having statistics import behave like COMMENTs, which meant that they were run immediately after the table/matview/index/constraint that created the pg_class/pg_attribute entries, but they could be suppressed with a --noX flag

Per previous comments, it was suggested by others that:

- having them in SECTION_NONE was a grave mistake
- Everything that could belong in SECTION_DATA should, and the rest should be in SECTION_POST_DATA

- This would almost certainly require the statistics import commands to be TOC objects (one object per pg_class entry, not one object per function call)

Turning them into TOC objects was a multi-phase process.

1. the TOC entries are generated with dependencies (the parent pg_class object as well as the potential unique/pk constraint in the case of indexes), but no statements are generated (in case the stats are filtered out or the parent object is filtered out). This TOC entry must have everything we'll need to later generate the function calls. So far, that information is the parent name, parent schema, and relkind of the parent object.

2. The TOC entries get sorted by dependencies, and additional dependencies are added which enforce the PRE/DATA/POST boundaries. This is where knowing the parent object's relkind is required, as that determines the DATA/POST section.

3. Now the TOC entry is able to stand on its own, and generate the statements if they survive the dump/restore filters. Most of the later versions of the patch were efforts to get the objects to fall into the right PRE/DATA/POST sections, and the central bug was that the dependencies passed into ARCHIVE_OPTS were incorrect, as the dependent object passed in was now the new TOC object, not the parent TOC object. Once that was resolved, things fell into place.

Re: Statistics Import and Export

From

Jeff Davis

Date:

16 May 2024, 18:26:08

On Thu, 2024-05-16 at 05:25 -0400, Corey Huinker wrote:
>
> Per previous comments, it was suggested by others that:
>
> - having them in SECTION_NONE was a grave mistake
> - Everything that could belong in SECTION_DATA should, and the rest
> should be in SECTION_POST_DATA

I don't understand the gravity of the choice here: what am I missing?

To be clear: I'm not arguing against it, but I'd like to understand it
better. Perhaps it has to do with the relationship between the sections
and the dependencies?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

04 June 2024, 03:34:51

On Thu, May 16, 2024 at 2:26 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2024-05-16 at 05:25 -0400, Corey Huinker wrote:
>
> Per previous comments, it was suggested by others that:
>
> - having them in SECTION_NONE was a grave mistake
> - Everything that could belong in SECTION_DATA should, and the rest
> should be in SECTION_POST_DATA

I don't understand the gravity of the choice here: what am I missing?

To be clear: I'm not arguing against it, but I'd like to understand it
better. Perhaps it has to do with the relationship between the sections
and the dependencies?

I'm with you, I don't understand the choice and would like to, but at the same time it now works in the way others strongly suggested that it should, so I'm still curious about the why.

There were several people expressing interest in this patch at pgconf.dev, so I thought I'd post a rebase and give a summary of things to date.

THE INITIAL GOAL

The initial goal of this effort was to reduce upgrade downtimes by eliminating the need for the vacuumdb --analyze-in-stages call that is recommended (but often not done) after a pg_upgrade. The analyze-in-stages steps is usually by far the longest part of a binary upgrade and is a significant part of a restore from dump, so eliminating this step will save users time, and eliminate or greatly reduce a potential pitfall to upgrade...and thus reduce upgrade friction (read: excuses to not upgrade).

THE FUNCTIONS

These patches introduce two functions, pg_set_relation_stats() and pg_set_attribute_stats(), which allow the caller to modify the statistics of any relation, provided that they own that relation or have maintainer privilege.

The function pg_set_relation_stats looks like this:

SELECT pg_set_relation_stats('stats_export_import.test'::regclass,
150000::integer,
'relpages', 17::integer,
'reltuples', 400.0::real,
'relallvisible', 4::integer);

The function takes an oid of the relation to have stats imported, a version number (SERVER_VERSION_NUM) for the source of the statistics, and then a series of varargs organized as name-value pairs. Currently, three arg pairs are required to properly set (relpages, reltuples, and relallvisible). If all three are not present, the function will issue a warning, and the row will not be updated.

The choice of varargs is a defensive one, basically ensuring that a pgdump that includes statistics import calls will not fail on a future version that does not have one or more of these values. The call itself would fail to modify the relation row, but it wouldn't cause the whole restore to fail. I'm personally not against having a fixed arg version of this function, nor am I against having both at the same time, the varargs version basically teeing up the fixed-param call appropriate for the destination server version.

This function does an in-place update of the pg_class row to avoid bloat pg_class, just like ANALYZE does. This means that this function call is NON-transactional.

The function pg_set_attribute_stats looks like this:

SELECT pg_catalog.pg_set_attribute_stats(
'stats_export_import.test'::regclass,
'id'::name,
false::boolean,
150000::integer,
'null_frac', 0.5::real,
'avg_width', 2::integer,
'n_distinct', -0.1::real,
'most_common_vals', '{2,1,3}'::text,
'most_common_freqs', '{0.3,0.25,0.05}'::real[]
);

Like the first function, it takes a relation oid and a source server version though that is in the 4th position. It also takes the name of an attribute, and a boolean as to whether these stats are for inherited statistics (true) or regular (false). Again what follows is a vararg list of name-value pairs, each name corresponding to an attribute of pg_stats, and expecting a value appropriate for said attribute of pg_stats. Note that ANYARRAY values are passed in as text. This is done for a few reasons. First, if the attribute is an array type, then the most_common_elements value will be an array of that array type, and there is no way to represent that in SQL (it instead gives a higher order array of the same base type). Second, it allows us to import the values with a simple array_in() call. Last, it allows for situations where the type name changed from source system to destination (example: a schema-qualified extension type gets moved to core).

There are lots of ways that this function call can go wrong. An invalid attribute name, an invalid parameter name in a name-value pair, invalid data type of parameter being passed in the value of a name-value pair, or type coercion errors in array_in() to name just a few. All of these errors result in a warning and the import failing, but the function completes normally. Internal typecasting and array_in are all done with the _safe() equivalents, and any such errors are re-emitted as warnings. The central goal here is to not make a restore fail just because the statistics are wonky.

Calls to pg_set_attribute_stats() are transactional. This wouldn't warrant mentioning if not for pg_set_relation_stats() being non-transactional.

DUMP / RESTORE / UPGRADE

The code for pg_dump/restore/upgrade has been modified to allow for statistics to be exported/imported by default. There are flags to prevent this (--no-statistics) and there are flags to ONLY do statistics (--statistics-only) the utility of which will be discussed later.

pg_dump will make queries of the source database, adjusting the syntax to reflect the version of the source system. There is very little variance in those queries, so it should be possible to query as far back as 9.2 and get usable stats. The output of these calls will be a series of SELECT statements, each one making a call to either pg_set_relation_stats (one per table/index/matview) or pg_set_attribute_stats (one per attribute that had a matching pg_statistic row).

The positioning of these calls in the restore sequence was originally set up as SECTION_NONE, but it was strongly suggested that SECTION_DATA / SECTION_POST_DATA was the right spot instead, and that's where they currently reside.

The end result will be that the new database now has the stats identical (or at least close to) the source system. Those statistics might be good or bad, but they're almost certainly better than no stats at all. Even if they are bad, they will be overwritten by the next ANALYZE or autovacuum.

WHAT IS NOT DONE

1. Extended Statistics, which are considerably more complex than regular stats (stxdexprs is itself an array of pg_statistic rows) and thus more difficult to express in a simple function call. They are also used fairly rarely in customer installations, so leaving them out of the v1 patch seemed like an easy trade-off.

2. Any sort of validity checking beyond data-types. This was initially provided, verifying that arrays values representing frequencies must be between 0.0 and 1.0, arrays that represent most common value frequencies must be in monotonically non-increasing order, etc. but these were rejected as being overly complex, potentially rejecting valid stats, and getting in the way of an other use I hadn't considered.

3. Export functions. Strictly speaking we don't need them, but some use-cases described below may make the case for including them.

OTHER USES

Usage of these functions is not restricted to upgrade/restore situations. The most obvious use was to experiment with how the planner behaves when one or more tables grow and/or skew. It is difficult to create a table with 10 billion rows in it, but it's now trivial to create a table that says it has 10 billion rows in it.

This can be taken a step further, and in a way I had not anticipated - actively stress-testing the planner by inserting wildly incorrect and/or nonsensical stats. In that sense, these functions are a fuzzing tool that happens to make upgrades go faster.

FUTURE PLANS

Integration with postgres_fdw is an obvious next step, allowing an ANALYZE on a foreign table to, instead of asking for a remote row sample, to simply export the stats of the remote table and import them into the foreign table.

Extended Statistics.

CURRENT PROGRESS

I believe that all outstanding questions/request were addressed, and the patch is now back to needing a review.

FOR YOUR CONSIDERATION

Rebase (current as of f04d1c1db01199f02b0914a7ca2962c531935717) attached.

Attachment

Re: Statistics Import and Export

From

Corey Huinker

Date:

18 July 2024, 06:09:26

v23:

Split pg_set_relation_stats into two functions: pg_set_relation_stats with named parameters like it had around v19 and pg_restore_relations_stats with the variadic parameters it has had in more recent versions, which processes the variadic parameters and then makes a call to pg_set_relation_stats.

Split pg_set_attribute_stats into two functions: pg_set_attribute_stats with named parameters like it had around v19 and pg_restore_attribute_stats with the variadic parameters it has had in more recent versions, which processes the variadic parameters and then makes a call to pg_set_attribute_stats.

The intention here is that the named parameters signatures are easier for ad-hoc use, while the variadic signatures are evergreen and thus ideal for pg_dump/pg_upgrade.

rebased to a0a5869a8598cdeae1d2f2d632038d26dcc69d19 (master as of early July 18)

Attachment

Re: Statistics Import and Export

From

Jeff Davis

Date:

19 July 2024, 21:21:58

On Thu, 2024-07-18 at 02:09 -0400, Corey Huinker wrote:
> v23:
>
> Split pg_set_relation_stats into two functions: pg_set_relation_stats
> with named parameters like it had around v19 and
> pg_restore_relations_stats with the variadic parameters it has had in
> more recent versions, which processes the variadic parameters and
> then makes a call to pg_set_relation_stats.
>
> Split pg_set_attribute_stats into two functions:
> pg_set_attribute_stats with named parameters like it had around v19
> and pg_restore_attribute_stats with the variadic parameters it has
> had in more recent versions, which processes the variadic parameters
> and then makes a call to pg_set_attribute_stats.
>
> The intention here is that the named parameters signatures are easier
> for ad-hoc use, while the variadic signatures are evergreen and thus
> ideal for pg_dump/pg_upgrade.

v23-0001:

* I like the split for the reason you mention. I'm not 100% sure that
we need both, but from the standpoint of reviewing, it makes things
easier. We can always remove one at the last minute if its found to be
unnecessary. I also like the names.

* Doc build error and malformatting.

* I'm not certain that we want all changes to relation stats to be non-
transactional. Are there transactional use cases? Should it be an
option? Should it be transactional for pg_set_relation_stats() but non-
transactional for pg_restore_relation_stats()?

* The documentation for the pg_set_attribute_stats() still refers to
upgrade scenarios -- shouldn't that be in the
pg_restore_attribute_stats() docs? I imagine the pg_set variant to be
used for ad-hoc planner stuff rather than upgrades.

* For the "WARNING: stat names must be of type text" I think we need an
ERROR instead. The calling convention of name/value pairs is broken and
we can't safely continue.

* The huge list of "else if (strcmp(statname, mc_freqs_name) == 0) ..."
seems wasteful and hard to read. I think we already discussed this,
what was the reason we can't just use an array to map the arg name to
an arg position type OID?

* How much error checking did we decide is appropriate? Do we need to
check that range_length_hist is always specified with range_empty_frac,
or should we just call that the planner's problem if one is specified
and the other not? Similarly, range stats for a non-range type.

* I think most of the tests should be of pg_set_*_stats(). For
pg_restore_, we just want to know that it's translating the name/value
pairs reasonably well and throwing WARNINGs when appropriate. Then, for
pg_dump tests, it should exercise pg_restore_*_stats() more completely.

* It might help to clarify which arguments are important (like
n_distinct) vs not. I assume the difference is that it's a non-NULLable
column in pg_statistic.

* Some arguments, like the relid, just seem absolutely required, and
it's weird to just emit a WARNING and return false in that case.

* To clarify: a return of "true" means all settings were successfully
applied, whereas "false" means that some were applied and some were
unrecognized, correct? Or does it also mean that some recognized
options may not have been applied?

* pg_set_attribute_stats(): why initialize the output tuple nulls array
to false? It seems like initializing it to true would be safer.

* please use a better name for "k" and add some error checking to make
sure it doesn't overrun the available slots.

* the pg_statistic tuple is always completely replaced, but the way you
can call pg_set_attribute_stats() doesn't imply that -- calling
pg_set_attribute_stats(..., most_common_vals => ..., most_common_freqs
=> ...) looks like it would just replace the most_common_vals+freqs and
leave histogram_bounds as it was, but it actually clears
histogram_bounds, right? Should we make that work or should we document
better that it doesn't?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

20 July 2024, 01:58:33

* Doc build error and malformatting.

Looking into it.

* I'm not certain that we want all changes to relation stats to be non-
transactional. Are there transactional use cases? Should it be an
option? Should it be transactional for pg_set_relation_stats() but non-
transactional for pg_restore_relation_stats()?

It's non-transactional because that's how ANALYZE does it to avoid bloating pg_class. We _could_ do it transactionally, but on restore we'd immediately have a pg_class that was 50% bloat.

* The documentation for the pg_set_attribute_stats() still refers to
upgrade scenarios -- shouldn't that be in the
pg_restore_attribute_stats() docs? I imagine the pg_set variant to be
used for ad-hoc planner stuff rather than upgrades.

Noted.

* For the "WARNING: stat names must be of type text" I think we need an
ERROR instead. The calling convention of name/value pairs is broken and
we can't safely continue.

They can't be errors, because any one error fails the whole pg_upgrade.

* The huge list of "else if (strcmp(statname, mc_freqs_name) == 0) ..."
seems wasteful and hard to read. I think we already discussed this,
what was the reason we can't just use an array to map the arg name to
an arg position type OID?

That was my overreaction to the dislike that the P_argname enum got in previous reviews.

We'd need an array of struct like

argname (ex. "mc_vals")
argtypeoid (one of: int, text, real, rea[])
argtypename (name we want to call the argtypeoid (integer, text. real, real[] about covers it).
argpos (position in the arg list of the corresponding pg_set_ function

* How much error checking did we decide is appropriate? Do we need to
check that range_length_hist is always specified with range_empty_frac,
or should we just call that the planner's problem if one is specified
and the other not? Similarly, range stats for a non-range type.

I suppose we can let that go, and leave incomplete stat pairs in there.

The big risk is that somebody packs the call with more than 5 statkinds, which would overflow the struct.

* I think most of the tests should be of pg_set_*_stats(). For
pg_restore_, we just want to know that it's translating the name/value
pairs reasonably well and throwing WARNINGs when appropriate. Then, for
pg_dump tests, it should exercise pg_restore_*_stats() more completely.

I was afraid you'd suggest that, in which case I'd break up the patch into the pg_sets and the pg_restores.

* It might help to clarify which arguments are important (like
n_distinct) vs not. I assume the difference is that it's a non-NULLable
column in pg_statistic.

There are NOT NULL stats...now. They might not be in the future. Does that change your opinion?

* Some arguments, like the relid, just seem absolutely required, and
it's weird to just emit a WARNING and return false in that case.

Again, we can't fail.Any one failure breaks pg_upgrade.

* To clarify: a return of "true" means all settings were successfully
applied, whereas "false" means that some were applied and some were
unrecognized, correct? Or does it also mean that some recognized
options may not have been applied?

True means "at least some stats were applied. False means "nothing was modified".

* pg_set_attribute_stats(): why initialize the output tuple nulls array
to false? It seems like initializing it to true would be safer.

+1

* please use a better name for "k" and add some error checking to make
sure it doesn't overrun the available slots.

k was an inheritance from analzye.c, from whence the very first version was cribbed. No objection to renaming.

* the pg_statistic tuple is always completely replaced, but the way you
can call pg_set_attribute_stats() doesn't imply that -- calling
pg_set_attribute_stats(..., most_common_vals => ..., most_common_freqs
=> ...) looks like it would just replace the most_common_vals+freqs and
leave histogram_bounds as it was, but it actually clears
histogram_bounds, right? Should we make that work or should we document
better that it doesn't?

That would complicate things. How would we intentionally null-out one stat, while leaving others unchanged? However, this points out that I didn't re-instate the re-definition that applied the NULL defaults.

Re: Statistics Import and Export

From

Corey Huinker

Date:

22 July 2024, 16:05:34

Attached is v24, incorporating Jeff's feedback - looping an arg data structure rather than individually checking each param type being the biggest of them.

v23's part one has been broken into three patches:

* pg_set_relation_stats
* pg_set_attribute_stats
* pg_restore_X_stats

And the two pg_dump-related patches remain unchanged.

I think this split is a net-positive for reviewability. The one drawback is that there's a lot of redundancy in the regression tests now, much of which can go away once we decide what other data problems we don't need to check.

Attachment

Re: Statistics Import and Export

From

Jeff Davis

Date:

23 July 2024, 00:45:50

On Mon, 2024-07-22 at 12:05 -0400, Corey Huinker wrote:
> Attached is v24, incorporating Jeff's feedback - looping an arg data
> structure rather than individually checking each param type being the
> biggest of them.
>

Thank you for splitting up the patches more finely.

v24-0001:

  * pg_set_relation_stats(): the warning: "cannot export statistics
prior to version 9.2" doesn't make sense because the function is for
importing. Reword.

  * I really think there should be a transactional option, just another
boolean, and if it has a default it should be true. This clearly has
use cases for testing plans, etc., and often transactions will be the
right thing there. This should be a trivial code change, and it will
also be easier to document.

  * The return type is documented as 'void'? Please change to bool and
be clear about what true/false returns really mean. I think false means
"no updates happened at all, and a WARNING was printed indicating why"
whereas true means "all updates were applied successfully".

  * An alternative would be to have an 'error_ok' parameter to say
whether to issue WARNINGs or ERRORs. I think we already discussed that
and agreed on the boolean return, but I just want to confirm that this
was a conscious choice?

  * tests should be called stats_import.sql; there's no exporting going
on

  * Aside from the above comments and some other cleanup, I think this
is a simple patch and independently useful. I am looking to commit this
one soon.

v24-0002:

  * Documented return type is 'void'

  * I'm not totally sure what should be returned in the event that some
updates were applied and some not. I'm inclined to say that true should
mean that all updates were applied -- otherwise it's hard to
automatically detect some kind of typo.

  * Can you describe your approach to error checking? What kinds of
errors are worth checking, and which should we just put into the
catalog and let the planner deal with?

  * I'd check stakindidx at the time that it's incremented rather than
summing boolean values cast to integers.

v24-0003:

  * I'm not convinced that we should continue when a stat name is not
text. The argument for being lenient is that statistics may change over
time, and we might have to ignore something that can't be imported from
an old version into a new version because it's either gone or the
meaning has changed too much. But that argument doesn't apply to a
bogus call, where the name/value pairs get misaligned or something.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

23 July 2024, 04:20:32

* pg_set_relation_stats(): the warning: "cannot export statistics
prior to version 9.2" doesn't make sense because the function is for
importing. Reword.

+1

* I really think there should be a transactional option, just another
boolean, and if it has a default it should be true. This clearly has
use cases for testing plans, etc., and often transactions will be the
right thing there. This should be a trivial code change, and it will
also be easier to document.

For it to have a default, the parameter would have to be at the end of the list, and it's a parameter list that will grow in the future. And when that happens we have a jumbled parameter list, which is fine if we only ever call params by name, but I know some people won't do that. Which means it's up front right after `version`. Since `version` is already in there, and we can't default that, I feel ok about moving it there, but alas no default.

If there was some way that the function could detect that it was in a binary upgrade, then we could use that to determine if it should update inplace or transactionally.

* The return type is documented as 'void'? Please change to bool and
be clear about what true/false returns really mean. I think false means
"no updates happened at all, and a WARNING was printed indicating why"
whereas true means "all updates were applied successfully".

Good point, that's a holdover.

* An alternative would be to have an 'error_ok' parameter to say
whether to issue WARNINGs or ERRORs. I think we already discussed that
and agreed on the boolean return, but I just want to confirm that this
was a conscious choice?

That had been discussed as well. If we're adding parameters, then we could add one for that too. It's making the function call progressively more unwieldy, but anyone who chooses to wield these on a regular basis can certainly write a SQL wrapper function to reduce the function call to their presets, I suppose.

* tests should be called stats_import.sql; there's no exporting going
on

Sigh. True.

* Aside from the above comments and some other cleanup, I think this
is a simple patch and independently useful. I am looking to commit this
one soon.

v24-0002:

* Documented return type is 'void'

* I'm not totally sure what should be returned in the event that some
updates were applied and some not. I'm inclined to say that true should
mean that all updates were applied -- otherwise it's hard to
automatically detect some kind of typo.

Me either. Suggestions welcome.

I suppose we could return two integers: number of stats input, and number of stats applied. But that could be confusing, as some parameter pairs form one stat ( MCV, ELEM_MCV, etc).

I suppose we could return a set of (param_name text, was_set boolean, applied boolean), without trying to organize them into their pairs, but that would get really verbose.

We should decide on something soon, because we'd want relation stats to follow a similar signature.

* Can you describe your approach to error checking? What kinds of
errors are worth checking, and which should we just put into the
catalog and let the planner deal with?

1. When the parameters given make for something nonsensical Such as providing most_common_elems with no corresponding most_common_freqs, then you can't form an MCV stat, so you must throw out the one you did receive. That gets a warning.

2. When the data provided is antithetical to the type of statistic. For instance, most array-ish parameters can't have NULL values in them (there are separate stats for nulls (null-frac, empty_frac). I don't remember if doing so crashes the server or just creates a hard error, but it's a big no-no, and we have to reject such stats, which for now means a warning and trying to carry on with the stats that remain.

3. When the stats provided would overflow the data structure. We attack this from two directions: First, we eliminate stat kinds that are meaningless for the data type (scalars can't have most-common-elements, only ranges can have range stats, etc), issue warnings for those and move on with the remaining stats. If, however, the number of those statkinds exceeds the number of statkind slots available, then we give up because now we'd have to CHOOSE which N-5 stats to ignore, and the caller is clearly just having fun with us.

We let the planner have fun with other error-like things:

1. most-common-element arrays where the elements are not sorted per spec.

2. frequency histograms where the numbers are not monotonically non-increasing per spec.

3. frequency histograms that have corresponding low bound and high bound values embedded in the array, and the other values in that array must be between the low-high.

* I'd check stakindidx at the time that it's incremented rather than
summing boolean values cast to integers.

Which means that we're checking that and potentially raising the same error in 3-4 places (and growing, unless we raise the max slots), rather than 1. That struck me as worse.

v24-0003:

* I'm not convinced that we should continue when a stat name is not
text. The argument for being lenient is that statistics may change over
time, and we might have to ignore something that can't be imported from
an old version into a new version because it's either gone or the
meaning has changed too much. But that argument doesn't apply to a
bogus call, where the name/value pairs get misaligned or something.

I agree with that.

Re: Statistics Import and Export

From

Corey Huinker

Date:

23 July 2024, 21:48:57

Giving the parameter lists more thought, the desire for a return code more granular than true/false/null, and the likelihood that each function will inevitably get more parameters both stats and non-stats, I'm proposing the following:

Two functions:

pg_set_relation_stats(
out schemaname name,
out relname name,
out row_written boolean,
out params_rejected text[],
kwargs any[]) RETURNS RECORD

and

pg_set_attribute_stats(
out schemaname name,
out relname name,
out inherited bool,
out row_written boolean,
out params_accepted text[],
out params_rejected text[],
kwargs any[]) RETURNS RECORD

The leading OUT parameters tell us the rel/attribute/inh affected (if any), and which params had to be rejected for whatever reason. The kwargs is the variadic key-value pairs that we were using for all stat functions, but now we will be using it for all parameters, both statistics and control, the control parameters will be:

relation - the oid of the relation
attname - the attribute name (does not apply for relstats)
inherited - true false for attribute stats, defaults false, does not apply for relstats
warnings, boolean, if supplied AND set to true, then all ERROR that can be stepped down to WARNINGS will be. This is "binary upgrade mode".
version - the numeric version (a la PG_VERSION_NUM) of the statistics given. If NULL or omitted assume current PG_VERSION_NUM of server.

actual stats columns.

This allows casual users to set only the params they want for their needs, and get proper errors, while pg_upgrade can set

'warnings', 'true', 'version', 120034

and get the upgrade behavior we need.

and pg_set_attribute_stats.

pg_set_relation_stats(out schemaname name, out relname name,, out row_written boolean, out params_entered int, out params_accepted int, kwargs any[])

Re: Statistics Import and Export

From

Corey Huinker

Date:

23 July 2024, 21:50:17

and pg_set_attribute_stats.
pg_set_relation_stats(out schemaname name, out relname name,, out row_written boolean, out params_entered int, out params_accepted int, kwargs any[])

Oops, didn't hit undo fast enough. Disregard this last bit.

Re: Statistics Import and Export

From

Jeff Davis

Date:

25 July 2024, 19:25:28

On Tue, 2024-07-23 at 17:48 -0400, Corey Huinker wrote:
> Two functions:

I see that you moved back to a combination function to serve both the
"restore" use case as well as the "ad-hoc stats hacking" use case.

The "restore" use case is the primary point of your patch, and that
should be as simple and future-proof as possible. The parameters should
be name/value pairs and there shouldn't be any "control" parameters --
it's not the job of pg_dump to specify whether the restore should be
transactional or in-place, it should just output the necessary stats.

That restore function might be good enough to satisfy the "ad-hoc stats
hacking" use case as well, but I suspect we want slightly different
behavior. Specifically, I think we'd want the updates to be
transactional rather than in-place, or at least optional.

> The leading OUT parameters tell us the rel/attribute/inh affected (if
> any), and which params had to be rejected for whatever reason. The
> kwargs is the variadic key-value pairs that we were using for all
> stat functions, but now we will be using it for all parameters, both
> statistics and control, the control parameters will be:

I don't like the idea of mixing statistics and control parameters in
the same list.

I do like the idea of returning a set, but I think it should be the
positive set (effectively a representation of what is now in the
pg_stats view) and any ignored settings would be output as WARNINGs.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

28 July 2024, 01:08:44

The "restore" use case is the primary point of your patch, and that
should be as simple and future-proof as possible. The parameters should
be name/value pairs and there shouldn't be any "control" parameters --
it's not the job of pg_dump to specify whether the restore should be
transactional or in-place, it should just output the necessary stats.

That restore function might be good enough to satisfy the "ad-hoc stats
hacking" use case as well, but I suspect we want slightly different
behavior. Specifically, I think we'd want the updates to be
transactional rather than in-place, or at least optional.

Point well taken.

Both function pairs now call a generic internal function.

Which is to say that pg_set_relation_stats and pg_restore_relation_stats both accept parameters in their own way, and both call
an internal function relation_statistics_update(), each with their own defaults.

pg_set_relation_stats always leaves "version" NULL, does transactional updates, and treats any data quality issue as an ERROR. This is is in line with a person manually tweaking stats to check against a query to see if the plan changes.

pg_restore_relation_stats does in-place updates, and steps down all errors to warnings. The stats may not write, but at least it won't fail the pg_upgrade for you.

pg_set_attribute_stats is error-maximalist like pg_set_relation_stats. pg_restore_attribute_stats never had an in-place option to begin with.

> The leading OUT parameters tell us the rel/attribute/inh affected (if
> any), and which params had to be rejected for whatever reason. The
> kwargs is the variadic key-value pairs that we were using for all
> stat functions, but now we will be using it for all parameters, both
> statistics and control, the control parameters will be:

I don't like the idea of mixing statistics and control parameters in
the same list.

There's no way around it, at least now we need never worry about a confusing order for the parameters in the _restore_ functions because they can now be in any order you like. But that speaks to another point: there is no "you" in using the restore functions, those function calls will almost exclusively be generated by pg_dump and we can all live rich and productive lives never having seen one written down. I kid, but they're actually not that gross.

Here is a -set function taken from the regression tests:

SELECT pg_catalog.pg_set_attribute_stats(
relation => 'stats_import.test'::regclass::oid,
attname => 'arange'::name,
inherited => false::boolean,
null_frac => 0.5::real,
avg_width => 2::integer,
n_distinct => -0.1::real,
range_empty_frac => 0.5::real,
range_length_histogram => '{399,499,Infinity}'::text
);
pg_set_attribute_stats
------------------------

(1 row)

and here is a restore function

-- warning: mcv cast failure
SELECT *
FROM pg_catalog.pg_restore_attribute_stats(
'relation', 'stats_import.test'::regclass::oid,
'attname', 'id'::name,
'inherited', false::boolean,
'version', 150000::integer,
'null_frac', 0.5::real,
'avg_width', 2::integer,
'n_distinct', -0.4::real,
'most_common_vals', '{2,four,3}'::text,
'most_common_freqs', '{0.3,0.25,0.05}'::real[]
);
WARNING: invalid input syntax for type integer: "four"
row_written | stats_applied | stats_rejected | params_rejected
-------------+----------------------------------+--------------------------------------+-----------------
t | {null_frac,avg_width,n_distinct} | {most_common_vals,most_common_freqs} |
(1 row)

There's a few things going on here:

1. An intentionally bad, impossible to write, value was put in 'most_common_vals'. 'four' cannot cast to integer, so the value fails, and we get a warning

2. Because most_common_values failed, we can no longer construct a legit STAKIND_MCV, so we have to throw out most_common_freqs with it.

3. Those failures aren't enough to prevent us from writing the other stats, so we write the record, and report the row written, the stats we could write, the stats we couldn't, and a list of other parameters we entered that didn't make sense and had to be rejected (empty).

Overall, I'd say the format is on the pedantic side, but it's far from unreadable, and mixing control parameters (version) with stats parameters isn't that big a deal.

I do like the idea of returning a set, but I think it should be the
positive set (effectively a representation of what is now in the
pg_stats view) and any ignored settings would be output as WARNINGs.

Displaying the actual stats in pg_stats could get very, very big. So I wouldn't recommend that.

What do you think of the example presented earlier?

Attached is v25.

Key changes:
- Each set/restore function pair now each call a common function that does the heavy lifting, and the callers mostly marshall parameters into the right spot and form the result set (really just one row).
- The restore functions now have all parameters passed in via a variadic any[].

- the set functions now error out on just about any discrepancy, and do not have a result tuple.
- test cases simplified a bit. There's still a lot of them, and I think that's a good thing.
- Documentation to reflect significant reorganization.
- pg_dump modified to generate new function signatures.

Attachment

Re: Statistics Import and Export

From

Jeff Davis

Date:

02 August 2024, 06:44:54

On Sat, 2024-07-27 at 21:08 -0400, Corey Huinker wrote:
>
> > I don't like the idea of mixing statistics and control parameters
> > in
> > the same list.
> >
>
>
> There's no way around it, at least now we need never worry about a
> confusing order for the parameters in the _restore_ functions because
> they can now be in any order you like.

Perhaps I was not precise enough when I said "control" parameters.
Mainly what I was worried about is trying to take parameters that
control things like transaction behavior (in-place vs mvcc), and
pg_dump should not be specifying that kind of thing. A parameter like
"version" is specified by pg_dump anyway, so it's probably fine the way
you've done it.

> SELECT pg_catalog.pg_set_attribute_stats(
>     relation => 'stats_import.test'::regclass::oid,
>     attname => 'arange'::name,
>     inherited => false::boolean,
>     null_frac => 0.5::real,
>     avg_width => 2::integer,
>     n_distinct => -0.1::real,
>     range_empty_frac => 0.5::real,
>     range_length_histogram => '{399,499,Infinity}'::text
>     );
>  pg_set_attribute_stats
> ------------------------
>  
> (1 row)

I like it.

> and here is a restore function
>
> -- warning: mcv cast failure
> SELECT *
> FROM pg_catalog.pg_restore_attribute_stats(
>     'relation', 'stats_import.test'::regclass::oid,
>     'attname', 'id'::name,
>     'inherited', false::boolean,
>     'version', 150000::integer,
>     'null_frac', 0.5::real,
>     'avg_width', 2::integer,
>     'n_distinct', -0.4::real,
>     'most_common_vals', '{2,four,3}'::text,
>     'most_common_freqs', '{0.3,0.25,0.05}'::real[]
>     );
> WARNING:  invalid input syntax for type integer: "four"
>  row_written |          stats_applied           |          
>  stats_rejected            | params_rejected
> -------------+----------------------------------+--------------------
> ------------------+-----------------
>  t           | {null_frac,avg_width,n_distinct} |
> {most_common_vals,most_common_freqs} |
> (1 row)

I think I like this, as well, except for the return value, which seems
like too much information and a bit over-engineered. Can we simplify it
to what's actually going to be used by pg_upgrade and other tools?

> Attached is v25.

I believe 0001 and 0002 are in good shape API-wise, and I can start
getting those committed. I will try to clean up the code in the
process.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

04 August 2024, 05:09:40

> WARNING: invalid input syntax for type integer: "four"
> row_written | stats_applied |
> stats_rejected | params_rejected
> -------------+----------------------------------+--------------------
> ------------------+-----------------
> t | {null_frac,avg_width,n_distinct} |
> {most_common_vals,most_common_freqs} |
> (1 row)

I think I like this, as well, except for the return value, which seems
like too much information and a bit over-engineered. Can we simplify it
to what's actually going to be used by pg_upgrade and other tools?

pg_upgrade currently won't need any of it, it currently does nothing when a statistics import fails. But it could do *something* based on this information. For example, we might have an option --analyze-tables-that-have-a-statistics-import-failure that analyzes tables that have at least one statistics that didn't import. For instance, postgres_fdw may try to do stats import first, and if that fails fall back to a remote table sample.

We could do other things. It seems a shame to just throw away this information when it could potentially be used in the future.

> Attached is v25.

I believe 0001 and 0002 are in good shape API-wise, and I can start
getting those committed. I will try to clean up the code in the
process.

:)

Re: Statistics Import and Export

From

Jeff Davis

Date:

09 August 2024, 01:32:23

On Sat, 2024-07-27 at 21:08 -0400, Corey Huinker wrote:
>
> Attached is v25.

I attached new versions of 0001 and 0002. Still working on them, so
these aren't final.

v25j-0001:

  * There seems to be confusion between the relation for which we are
updating the stats, and pg_class. Permissions and ShareUpdateExclusive
should be taken on the former, not the latter. For consistency with
vac_update_relstats(), RowExclusiveLock should be fine on pg_class.
  * Lots of unnecessary #includes were removed.
  * I refactored substantially to do basic checks in the SQL function
pg_set_relation_stats() and make calling the internal function easier.
Similar refactoring might not work for pg_set_attribute_stats(), but
that's OK.
  * You don't need to declare the SQL function signatures. They're
autogenerated from pg_proc.dat into fmgrprotos.h.
  * I removed the inplace stuff for this patch because there's no
coverage for it and it can be easily added back in 0003.
  * I renamed the file to import_stats.c. Annoying to rebase, I know,
but better now than later.

v25j-0002:

  * I just did some minor cleanup on the #includes and rebased it. I
still need to look in more detail.

Regards,
    Jeff Davis

Attachment

Re: Statistics Import and Export

From

Corey Huinker

Date:

16 August 2024, 03:53:37

function attribute_statsitics_update() is significantly shorter. (Thank
you for a good set of tests, by the way, which sped up the refactoring
process.)

yw

* Remind me why the new stats completely replace the new row, rather
than updating only the statistic kinds that are specified?

because:
- complexity

- we would then need a mechanism to then tell it to *delete* a stakind
- we'd have to figure out how to reorder the remaining stakinds, or spend effort finding a matching stakind in the existing row to know to replace it
- "do what analyze does" was an initial goal and as a result many test cases directly compared pg_statistic rows from an original table to an empty clone table to see if the "copy" had fidelity.

* I'm not sure what the type_is_scalar() function was doing before,
but I just removed it. If it can't find the element type, then it skips
over the kinds that require it.

that may be sufficient,

* I introduced some hard errors. These happen when it can't find the
table, or the attribute, or doesn't have permissions. I don't see any
reason to demote those to a WARNING. Even for the restore case,
analagous errors happen for COPY, etc.

I can accept that reasoning.

* I'm still sorting through some of the type info derivations. I
think we need better explanations about why it's doing exactly the
things it's doing, e.g. for tsvector and multiranges.

I don't have the specifics of each, but any such cases were derived from similar behaviors in the custom typanalyze functions, and the lack of a custom typanalyze function for a given type was taken as evidence that the type was adequately handled by the default rules. I can see that this is an argument for having a second stats-specific custom typanalyze function for datatypes that need them, but I wasn't ready to go that far myself.

Re: Statistics Import and Export

From

jian he

Date:

27 August 2024, 06:35:00

On Sat, Aug 24, 2024 at 4:50 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
>
> I have attached version 28j as one giant patch covering what was
> previously 0001-0003. It's a bit rough (tests in particular need some
> work), but it implelements the logic to replace only those values
> specified rather than the whole tuple.
>
hi.
I did some review for v28j

git am shows some whitespace error.


+extern Datum pg_set_relation_stats(PG_FUNCTION_ARGS);
+extern Datum pg_set_attribute_stats(PG_FUNCTION_ARGS);
is unnecessary?


+       <entry role="func_table_entry">
+        <para role="func_signature">
+         <indexterm>
+          <primary>pg_set_relation_stats</primary>
+         </indexterm>
+         <function>pg_set_relation_stats</function> (
+         <parameter>relation</parameter> <type>regclass</type>
+         <optional>, <parameter>relpages</parameter>
<type>integer</type></optional>
+         <optional>, <parameter>reltuples</parameter>
<type>real</type></optional>
+         <optional>, <parameter>relallvisible</parameter>
<type>integer</type></optional> )
+         <returnvalue>boolean</returnvalue>
+        </para>
+        <para>
+         Updates table-level statistics for the given relation to the
+         specified values. The parameters correspond to columns in <link
+         linkend="catalog-pg-class"><structname>pg_class</structname></link>.
Unspecified
+         or <literal>NULL</literal> values leave the setting
+         unchanged. Returns <literal>true</literal> if a change was made;
+         <literal>false</literal> otherwise.
+        </para>
are these <optional> flags wrong? there is only one function currently:
pg_set_relation_stats(relation regclass, relpages integer, reltuples
real, relallvisible integer)
i think you want
pg_set_relation_stats(relation regclass, relpages integer default
null, reltuples real default null, relallvisible integer default null)
we can add following in src/backend/catalog/system_functions.sql:

select * from pg_set_relation_stats('emp'::regclass);
CREATE OR REPLACE FUNCTION
  pg_set_relation_stats(
                        relation regclass,
                        relpages integer default null,
                        reltuples real default null,
                        relallvisible integer default null)
RETURNS bool
LANGUAGE INTERNAL
CALLED ON NULL INPUT VOLATILE
AS 'pg_set_relation_stats';


typedef enum ...
need to add src/tools/pgindent/typedefs.list


+/*
+ * Check that array argument is one dimensional with no NULLs.
+ *
+ * If not, emit at elevel, and set argument to NULL in fcinfo.
+ */
+static void
+check_arg_array(FunctionCallInfo fcinfo, struct arginfo *arginfo,
+ int argnum, int elevel)
+{
+ ArrayType  *arr;
+
+ if (PG_ARGISNULL(argnum))
+ return;
+
+ arr = DatumGetArrayTypeP(PG_GETARG_DATUM(argnum));
+
+ if (ARR_NDIM(arr) != 1)
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("\"%s\" cannot be a multidimensional array",
+ arginfo[argnum].argname)));
+ fcinfo->args[argnum].isnull = true;
+ }
+
+ if (array_contains_nulls(arr))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("\"%s\" array cannot contain NULL values",
+ arginfo[argnum].argname)));
+ fcinfo->args[argnum].isnull = true;
+ }
+}
this part elevel should always be ERROR?
if so, we can just
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),




relation_statistics_update and other functions
may need to check relkind?
since relpages, reltuples, relallvisible not meaning to all of relkind?

Re: Statistics Import and Export

From

Corey Huinker

Date:

05 September 2024, 20:29:44

I have attached version 28j as one giant patch covering what was
previously 0001-0003. It's a bit rough (tests in particular need some
work), but it implelements the logic to replace only those values
specified rather than the whole tuple.

I like what you did restoring the parameter enums, especially now that they can be leveraged for the expected type oids data structure.

At least for the interactive "set" variants of the functions, I think
it's an improvement. It feels more natural to just change one stat
without wiping out all the others. I realize a lot of the statistics
depend on each other, but the point is not to replace ANALYZE, the
point is to experiment with planner scenarios. What do others think?

When I first heard that was what you wanted to do, I was very uneasy about it. The way you implemented it (one function to wipe out/reset all existing stats, and then the _set_ function works as an upsert) puts my mind at ease. The things I really wanted to avoid were gaps in the stakind array (which can't happen as you wrote it) and getting the stakinds out of order (admittedly that's more a tidiness issue, but pg_statistic before/after fidelity is kept, so I'm happy).

For the "restore" variants, I'm not sure it matters a lot because the
stats will already be empty. If it does matter, we could pretty easily
define the "restore" variants to wipe out existing stats when loading
the table, though I'm not sure if that's a good thing or not.

I agree, and I'm leaning towards doing the clear, because "restore" to me implies that what resides there exactly matches what was in the function call, regardless of what might have been there before. But you're also right, "restore" is expected to be used on default/missing stats, and the restore_* call generated is supposed to be comprehensive of all stats that were there at time of dump/upgrade, so impact would be minimal.

I also made more use of FunctionCallInfo structures to communicate
between functions rather than huge parameter lists. I believe that
reduced the line count substantially, and made it easier to transform
the argument pairs in the "restore" variants into the positional
arguments for the "set" variants.

You certainly did, and I see where it pays off given that _set_ / _restore_ functions are just different ways of ordering the shared internal function call.

Observation: there is currently no way to delete a stakind, keeping the rest of the record. It's certainly possible to compose a SQL query that gets the current values, invokes pg_clear_* and then pg_set_* using the values that are meant to be kept, and in fact that pattern is how I imagined the pg_set_* functions would be used when they overwrote everything in the tuple. So I am fine with going forward with this paradigm.

The code mentions that more explanation should be given for the special cases (tsvector, etc) and that explanation is basically "this code follows what the corresponding custom typanalyze() function does". In the future, it may make sense to have custom typimport() functions for datatypes that have a custom typanalzye(), which would solve the issue of handling custom stakinds.

I'll continue to work on this.

p.s. dropping invalid email address from the thread

Re: Statistics Import and Export

From

Corey Huinker

Date:

05 September 2024, 20:34:31

git am shows some whitespace error.

Jeff indicated that this was more of a stylistic/clarity reworking. I'll be handling it again for now.

+extern Datum pg_set_relation_stats(PG_FUNCTION_ARGS);
+extern Datum pg_set_attribute_stats(PG_FUNCTION_ARGS);
is unnecessary?

They're autogenerated from pg_proc.dat. I was (pleasantly) surprised too.

this part elevel should always be ERROR?
if so, we can just

I'm personally dis-inclined to error on any of these things, so I'll be leaving it as is. I suspect that the proper balance lies between all-ERROR and all-WARNING, but time will tell which.

relation_statistics_update and other functions
may need to check relkind?
since relpages, reltuples, relallvisible not meaning to all of relkind?

I'm not able to understand either of your questions, can you elaborate on them?

Re: Statistics Import and Export

From

jian he

Date:

23 September 2024, 03:57:01

On Tue, Sep 17, 2024 at 5:03 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>>
>> 1. make sure these three functions: 'pg_set_relation_stats',
>> 'pg_restore_relation_stats','pg_clear_relation_stats' proisstrict to true.
>> because in
>> pg_class catalog, these three attributes (relpages, reltuples, relallvisible) is
>> marked as not null. updating it to null will violate these constraints.
>> tom also mention this at [
>
> Things have changed a bit since then, and the purpose of the functions has changed, so the considerations are now
different.The function signature could change in the future as new pg_class stats are added, and it might not still be
strict.
>

if you add more arguments to relation_statistics_update,
but the first 3 arguments (relpages, reltuples, relallvisible) still not null.
and, we are unlikely to add 3 or more (nullable=null) arguments?

we have code like:
    if (!PG_ARGISNULL(RELPAGES_ARG))
    {
            values[ncols] = Int32GetDatum(relpages);
            ncols++;
    }
    if (!PG_ARGISNULL(RELTUPLES_ARG))
    {
            replaces[ncols] = Anum_pg_class_reltuples;
            values[ncols] = Float4GetDatum(reltuples);
    }
    if (!PG_ARGISNULL(RELALLVISIBLE_ARG))
    {
           values[ncols] = Int32GetDatum(relallvisible);
            ncols++;
    }
    newtup = heap_modify_tuple_by_cols(ctup, tupdesc, ncols, replaces, nulls);

you just directly declared "bool nulls[3]    = {false, false, false};"
if any of (RELPAGES_ARG, RELTUPLES_ARG, RELALLVISIBLE_ARG)
is null, should you set that null[position] to true?
otherwise, i am confused with the variable nulls.

Looking at other usage of heap_modify_tuple_by_cols, "ncols" cannot be
dynamic, it should be a fixed value?
The current implementation works, because the (bool[3] nulls) is
always false, never changed.
if nulls becomes {false, false, true} then "ncols" must be 3, cannot be 2.




>>
>> 8. lock_check_privileges function issue.
>> ------------------------------------------------
>> --asume there is a superuser jian
>> create role alice NOSUPERUSER LOGIN;
>> create role bob NOSUPERUSER LOGIN;
>> create role carol NOSUPERUSER LOGIN;
>> alter database test owner to alice
>> GRANT CONNECT, CREATE on database test to bob;
>> \c test bob
>> create schema one;
>> create table one.t(a int);
>> create table one.t1(a int);
>> set session AUTHORIZATION; --switch to superuser.
>> alter table one.t1 owner to carol;
>> \c test alice
>> --now current database owner alice cannot do ANYTHING WITH table one.t1,
>> like ANALYZE, SELECT, INSERT, MAINTAIN etc.
>
>
> Interesting.
>

database owners do not necessarily have schema USAGE privilege.
-------------<<<>>>------------------
create role alice NOSUPERUSER LOGIN;
create role bob NOSUPERUSER LOGIN;
create database test;
alter database test owner to alice;
GRANT CONNECT, CREATE on database test to bob;
\c test bob
create schema one;
create table one.t(a int);
\c test alice

analyze one.t;

with cte as (
select oid as the_t
from pg_class
where relname = any('{t}') and relnamespace = 'one'::regnamespace)
SELECT
pg_catalog.pg_set_relation_stats(
relation => the_t,
relpages => 17::integer,
reltuples => 400.0::real,
relallvisible => 4::integer)
from cte;


In the above case, alice cannot do "analyze one.t;",
but can do pg_set_relation_stats, which seems not ok?
-------------<<<>>>------------------

src/include/statistics/stats_utils.h
comment
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * src/include/statistics/statistics.h

should be "src/include/statistics/stats_utils.h"



comment src/backend/statistics/stats_utils.c
 * IDENTIFICATION
 *       src/backend/statistics/stats_privs.c
should be
 * IDENTIFICATION
 *       src/backend/statistics/stats_utils.c

Re: Statistics Import and Export

From

jian he

Date:

23 September 2024, 06:59:45

On Mon, Sep 23, 2024 at 8:57 AM jian he <jian.universality@gmail.com> wrote:
>
> database owners do not necessarily have schema USAGE privilege.
> -------------<<<>>>------------------
> create role alice NOSUPERUSER LOGIN;
> create role bob NOSUPERUSER LOGIN;
> create database test;
> alter database test owner to alice;
> GRANT CONNECT, CREATE on database test to bob;
> \c test bob
> create schema one;
> create table one.t(a int);
> \c test alice
>
> analyze one.t;
>
> with cte as (
> select oid as the_t
> from pg_class
> where relname = any('{t}') and relnamespace = 'one'::regnamespace)
> SELECT
> pg_catalog.pg_set_relation_stats(
> relation => the_t,
> relpages => 17::integer,
> reltuples => 400.0::real,
> relallvisible => 4::integer)
> from cte;
>
>
> In the above case, alice cannot do "analyze one.t;",
> but can do pg_set_relation_stats, which seems not ok?

sorry for the noise.
what you stats_lock_check_privileges about privilege is right.

database owner cannot do
"ANALYZE one.t;"
but it can do "ANALYZE;" to indirect analyzing one.t



which seems to be the expected behavior per
https://www.postgresql.org/docs/17/sql-analyze.html
<<
To analyze a table, one must ordinarily have the MAINTAIN privilege on
the table.
However, database owners are allowed to analyze all tables in their
databases, except shared catalogs.
<<

Re: Statistics Import and Export

From

Nathan Bossart

Date:

08 October 2024, 18:11:08

I took a look at v29-0006.

On Tue, Sep 17, 2024 at 05:02:49AM -0400, Corey Huinker wrote:
> From: Corey Huinker <corey.huinker@gmail.com>
> Date: Sat, 4 May 2024 04:52:38 -0400
> Subject: [PATCH v29 6/7] Add derivative flags dumpSchema, dumpData.
> 
> User-set flags --schema-only and --data-only are often consulted by
> various operations to determine if they should be skipped or not. While
> this logic works when there are only two mutually-exclusive -only
> options, it will get progressively more confusing when more are added.

After glancing at v29-0007, I see what you mean.

> In anticipation of this, create the flags dumpSchema and dumpData which
> are derivative of the existing options schemaOnly and dataOnly. This
> allows us to restate current skip-this-section tests in terms of what is
> enabled, rather than checking if the other -only mode is turned off.

This seems like a reasonable refactoring exercise that we could take care
of before the rest of the patch set goes in.  I added one new reference to
dopt.schemaOnly in commit bd15b7d, so that should probably be revised to
!dumpData, too.  I also noticed a few references to dataOnly/schemaOnly in
comments that should likely be adjusted.

One other question I had when looking at this patch is whether we could
remove dataOnly/schemaOnly from DumpOptions and RestoreOptions.  Once 0007
is applied, those variables become particularly hazardous, so we really
want to prevent folks from using them in new code.

-- 
nathan

Re: Statistics Import and Export

From

Corey Huinker

Date:

10 October 2024, 22:49:16

This seems like a reasonable refactoring exercise that we could take care
of before the rest of the patch set goes in. I added one new reference to
dopt.schemaOnly in commit bd15b7d, so that should probably be revised to
!dumpData, too. I also noticed a few references to dataOnly/schemaOnly in
comments that should likely be adjusted.

I'll be on the lookout for the new usage with the next rebase, and will fix the comments as well.

One other question I had when looking at this patch is whether we could
remove dataOnly/schemaOnly from DumpOptions and RestoreOptions. Once 0007
is applied, those variables become particularly hazardous, so we really
want to prevent folks from using them in new code.

Well, the very next patch in the series adds --statistics-only, so I don't think we're getting rid of user-facing command switches. However, I could see us taking away the dataOnly/schemaOnly internal variables, thus preventing coders from playing with those sharp objects.

Re: Statistics Import and Export

From

Nathan Bossart

Date:

10 October 2024, 23:17:29

On Thu, Oct 10, 2024 at 03:49:16PM -0400, Corey Huinker wrote:
>> One other question I had when looking at this patch is whether we could
>> remove dataOnly/schemaOnly from DumpOptions and RestoreOptions.  Once 0007
>> is applied, those variables become particularly hazardous, so we really
>> want to prevent folks from using them in new code.
> 
> Well, the very next patch in the series adds --statistics-only, so I don't
> think we're getting rid of user-facing command switches. However, I could
> see us taking away the dataOnly/schemaOnly internal variables, thus
> preventing coders from playing with those sharp objects.

That's what I meant.  The user-facing options would stay the same, but the
internal variables would be local to main() so that other functions would
be forced to use dumpData, dumpSchema, etc.

-- 
nathan

Re: Statistics Import and Export

From

Jeff Davis

Date:

12 October 2024, 02:10:34

On Mon, 2024-09-23 at 08:57 +0800, jian he wrote:
>     newtup = heap_modify_tuple_by_cols(ctup, tupdesc, ncols,
> replaces, nulls);
>
> you just directly declared "bool nulls[3]    = {false, false,
> false};"

Those must be false (not NULL), because in pg_class those are non-NULL
attributes. They must be set to something whenever we update.

> if any of (RELPAGES_ARG, RELTUPLES_ARG, RELALLVISIBLE_ARG)
> is null, should you set that null[position] to true?

If the corresponding SQL argument is NULL, we leave the existing value
unchanged, we don't set it to NULL.

> otherwise, i am confused with the variable nulls.
>
> Looking at other usage of heap_modify_tuple_by_cols, "ncols" cannot
> be
> dynamic, it should be a fixed value?
> The current implementation works, because the (bool[3] nulls) is
> always false, never changed.
> if nulls becomes {false, false, true} then "ncols" must be 3, cannot
> be 2.

heap_modify_tuple_by_cols() uses ncols to specify the length of the
values/isnull arrays. The "replaces" is an array of attribute numbers
to replace (in contrast to plain heap_modify_tuple(), which uses an
array of booleans).

We are going to replace a maximum of 3 attributes, so the arrays have a
maximum size of 3. Predeclaring the arrays to be 3 elements is just
fine even if we only use the first 1-2 elements -- it avoids a needless
heap allocation/free.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

14 October 2024, 22:40:21

However, this function seems to accept -1 for the relpages parameter. Below is an example of execution:
---
postgres=> CREATE TABLE data1(c1 INT PRIMARY KEY, c2 VARCHAR(10));
CREATE TABLE
postgres=> SELECT pg_set_relation_stats('data1', relpages=>-1);
pg_set_relation_stats
-----------------------
t
(1 row)
postgres=> SELECT relname, relpages FROM pg_class WHERE relname='data1';
relname | relpages
---------+----------
data1 | -1
(1 row)
---

The attached patch modifies the pg_set_relation_stats function to work as described in the manual.

Regards,
Noriyoshi Shinoda

Accepting -1 is correct. I thought I had fixed that in a recent patch. Perhaps signals got crossed somewhere along the way.

Re: Statistics Import and Export

From

Corey Huinker

Date:

15 October 2024, 04:46:58

On Mon, Oct 14, 2024 at 3:40 PM Corey Huinker <corey.huinker@gmail.com> wrote:

However, this function seems to accept -1 for the relpages parameter. Below is an example of execution:
---
postgres=> CREATE TABLE data1(c1 INT PRIMARY KEY, c2 VARCHAR(10));
CREATE TABLE
postgres=> SELECT pg_set_relation_stats('data1', relpages=>-1);
pg_set_relation_stats
-----------------------
t
(1 row)
postgres=> SELECT relname, relpages FROM pg_class WHERE relname='data1';
relname | relpages
---------+----------
data1 | -1
(1 row)
---

The attached patch modifies the pg_set_relation_stats function to work as described in the manual.

Regards,
Noriyoshi Shinoda

Accepting -1 is correct. I thought I had fixed that in a recent patch. Perhaps signals got crossed somewhere along the way.

Just to be sure, I went back to v29, fixed a typo and some whitespace issues in stats_import.out, confirmed that it passed regression tests, then changed the relpages lower bound from -1 back to 0, and sure enough, the regression test for pg_upgrade failed again.

It seems that partitioned tables have a relpages of -1, so regression tests involving tables alpha_neg and alpha_pos (and 35 others, all seemingly partitioned) fail. So it was the docs that were wrong, not the code.

e839c8ecc9352b7754e74f19ace013c0c0d18613 doesn't include the stuff that modified pg_dump/pg_upgrade, so it wouldn't have turned up this problem.

Re: Statistics Import and Export

From

Jeff Davis

Date:

15 October 2024, 21:21:26

On Mon, 2024-10-14 at 21:46 -0400, Corey Huinker wrote:
> It seems that partitioned tables have a relpages of -1

Oh, I see. It appears that there's a special case for partitioned
tables that sets relpages=-1 in do_analyze_rel() around line 680. It's
a bit inconsistent, though, because even partitioned indexes have
relpages=0. Furthermore, the parameter is of type BlockNumber, so
InvalidBlockNumber would make more sense.

Not the cleanest code, but if the value exists, we need to be able to
import it.


Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

15 October 2024, 21:50:14

Oh, I see. It appears that there's a special case for partitioned
tables that sets relpages=-1 in do_analyze_rel() around line 680. It's
a bit inconsistent, though, because even partitioned indexes have
relpages=0. Furthermore, the parameter is of type BlockNumber, so
InvalidBlockNumber would make more sense.

Not the cleanest code, but if the value exists, we need to be able to
import it.

Thanks for tracking that down. I'll have a patch ready shortly.

Re: Statistics Import and Export

From

Corey Huinker

Date:

18 October 2024, 03:54:37

On Wed, Oct 16, 2024 at 7:20 PM Corey Huinker <corey.huinker@gmail.com> wrote:

Code fix with comment on why nobody expects a relpages -1. Test case to demonstrate that relpages -1 can happen, and updated doc to reflect the new lower bound.

Additional fixes, now in a patch-set:

1. Allow relpages to be set to -1 (partitioned tables with partitions have this value after ANALYZE).
2. Turn off autovacuum on tables (where possible) if they are going to be the target of pg_set_relation_stats().
3. Allow pg_set_relation_stats to continue past an out-of-range detection on one attribute, rather than immediately returning false.

There is some uncertainty on what, if anything, should be returned by pg_set_relation_stats() and pg_set_attribute_stats().

Initially the function (and there was just one) returned void, but it had a bool return added around the time it split into relation/attribute stats.

Returning a boolean seems like good instrumentation and a helper for allowing other tooling to use the functions. However, it's rather limited in what it can convey.

Currently, a return of true means "a record was written", and false means that a record was not written. Cases where a record was not written for pg_set_relation_stats amount to the following:

1. No attributes were specified, therefore there is nothing to change.

2. The attributes were set to the values that the record already has, thus no change was necessary.

#2 can be confusing, because false may look like a failure, but it means "the pg_class values were already set to what you wanted".

An alternate use of boolean, suggested by Jeff was the following:

1. Return true if all of the fields specified were applied to the record.
2. Return false if any field that was specified was NOT set, even if the other ones were.

#2 is also confusing in that the user has received a false value, but the operation did modify the record, just not as fully as the caller had hoped.

These show that a boolean isn't really up to conveying the nuances of potential outcomes. Complex return types have met with considerable resistance, enumerations are similarly undesirable, no other scalar value seems up to the task, and while an INFO or LOG message could convey considerable complexity, it wouldn't be readily handled programmatically. This re-raises the question of whether the pg_set_*_stats functions should return anything at all.

Any feedback on what users would expect from these functions in terms of return value is appreciated. Bear in mind that these functions will NOT be integrated into pg_upgrade/pg_dump, as that functionality will be handled by functions that are less user friendly but more flexible and forgiving of bad data. We're talking purely about functions meant for tweaking stats to look for changes in planner behavior.

Re: Statistics Import and Export

From

Jeff Davis

Date:

18 October 2024, 04:41:14

On Thu, 2024-10-17 at 20:54 -0400, Corey Huinker wrote:
> There is some uncertainty on what, if anything, should be returned by
> pg_set_relation_stats() and pg_set_attribute_stats().

...

> This re-raises the question of whether the pg_set_*_stats functions
> should return anything at all.

What is the benefit of a return value from the pg_set_*_stats variants?
As far as I can tell, there is none because they throw an ERROR if
anything goes wrong, so they should just return VOID. What am I
missing?

The return value is more interesting for pg_restore_relation_stats()
and pg_restore_attribute_stats(), which will be used by pg_dump and
which are designed to keep going on non-fatal errors. Isn't that what
this discussion should be about?

Magnus, you previously commented that you'd like some visibility for
tooling:

https://www.postgresql.org/message-id/CABUevEz1gLOkWSh_Vd9LQh-JM4i%3DMu7PVT9ffc77TmH0Zh3TzA%40mail.gmail.com

Is a return value what you had in mind? Or some other function that can
help find missing stats later, or something else entirely?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

19 October 2024, 02:14:58

What is the benefit of a return value from the pg_set_*_stats variants?
As far as I can tell, there is none because they throw an ERROR if
anything goes wrong, so they should just return VOID. What am I
missing?

Probably nothing. The nuances of "this stat didn't get set but these other two did" are too complex for a scalar to report, and pg_set_relation_stats() doesn't do nuance, anyway. I've attached a patch that returns void instead.

The return value is more interesting for pg_restore_relation_stats()
and pg_restore_attribute_stats(), which will be used by pg_dump and
which are designed to keep going on non-fatal errors. Isn't that what
this discussion should be about?

Yes. I was trying to focus the conversation on something that could be easily resolved first, and then move on to -restore which is trickier.

Patch that allows relation_statistics_update to continue after one failed stat (0001) attached, along with bool->void change (0002).

Re: Statistics Import and Export

From

Corey Huinker

Date:

22 October 2024, 14:09:45

If the relpages option contains -1 only for partitioned tables, shouldn't pg_set_relation_stats restrict the values that can be
specified by table type? The attached patch limits the value to -1 or more if the target
is a partition table, and 0 or more otherwise.
Changing relpages to -1 on a non-partitioned table seems to significantly change the execution plan.

Short answer: It's working as intended. Significantly changing the execution plan in weird ways is part of the intention of the function, even if the execution plan changes for the worse. I appreciate

Longer answer:

Enforcing -1 on only partitioned tables is tricky, as it seems to be a value for any table that has no local storage. So foreign data wrapper tables could, in theory, also have this value. More importantly, the -1 value seems to be situational, in my experience it only happens on partitioned tables after they have their first partition added, which means that the current valid stat range is set according to facts that can change. Like so :

chuinker=# select version();
version
------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 16.4 (Postgres.app) on aarch64-apple-darwin21.6.0, compiled by Apple clang version 14.0.0 (clang-1400.0.29.102), 64-bit
(1 row)
chuinker=# create table part_parent (x integer) partition by range (x);
CREATE TABLE
chuinker=# select relpages from pg_class where oid = 'part_parent'::regclass;
relpages
----------
0
(1 row)

chuinker=# analyze part_parent;
ANALYZE
chuinker=# select relpages from pg_class where oid = 'part_parent'::regclass;
relpages
----------
0
(1 row)

chuinker=# create table part_child partition of part_parent for values from (0) TO (100);
CREATE TABLE
chuinker=# select relpages from pg_class where oid = 'part_parent'::regclass;
relpages
----------
0
(1 row)

chuinker=# analyze part_parent;
ANALYZE
chuinker=# select relpages from pg_class where oid = 'part_parent'::regclass;
relpages
----------
-1
(1 row)

chuinker=# drop table part_child;
DROP TABLE
chuinker=# select relpages from pg_class where oid = 'part_parent'::regclass;
relpages
----------
-1
(1 row)

chuinker=# analyze part_parent;
ANALYZE
chuinker=# select relpages from pg_class where oid = 'part_parent'::regclass;
relpages
----------
-1
(1 row)

Prior versions (March 2024 and earlier) of this patch and the pg_set_attribute_stats patch did have many checks to prevent importing stat values that were "wrong" in some way. Some examples from attribute stats import were:

* Histograms that were not monotonically nondecreasing.
* Frequency values that were out of bounds specified by other values in the array.

* Frequency values outside of the [0.0,1.0] or [-1.0,1.0] depending on the stat type.

* paired arrays of most-common-values and their attendant frequency array not having the same length

All of these checks were removed based on feedback from reviewers and committers who saw the pg_set_*_stats() functions as a fuzzing tool, so the ability to set illogical, wildly implausible, or mathematically impossible values was a feature, not a bug. I would suspect that they would view your demonstration that setting impossible values on a table as proof that the function can be used to experiment with planner scenarios. So, while I previously would have eagerly accepted this patch as another valued validation check, such checks don't fit with the new intention of the functions. Still, I greatly appreciate your helping us discover ways in which we can use this tool to make the planner do odd things.

One thing that could cause us to enforce a check like the one you submitted would be if an invalid value caused a query to fail or a session to crash, even then, that would probably spur a change to make the planner more defensive rather than more checks on the set_* side.

Re: Statistics Import and Export

From

Jeff Davis

Date:

23 October 2024, 01:27:31

> I've taken most of Jeff's work, reincorporated it into roughly the
> same patch structure as before, and am posting it now.

I committed 0001-0004 with significant revision.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Jeff Davis

Date:

23 October 2024, 19:46:06

On Tue, 2024-10-22 at 23:58 +0000, Shinoda, Noriyoshi (SXD Japan FSIP)
wrote:
> Thanks for developing good features. I tried the patch that was
> committed right away.
> It seems that the implementation and documentation differ on the
> return value of the pg_clear_attribute_stats function.
> The attached small patch fixes the documentation.

Thank you again, fixed.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Jeff Davis

Date:

26 October 2024, 01:18:48

On Tue, 2024-09-17 at 05:02 -0400, Corey Huinker wrote:
>
> I've taken most of Jeff's work, reincorporated it into roughly the
> same patch structure as before, and am posting it now.

I have committed the import side of this patch series; that is, the
function calls that can load stats into an existing cluster without the
need to ANALYZE.

The pg_restore_*_stats() functions are designed such that pg_dump can
emit the calls. Some design choices of the functions worth noting:

  (a) a variadic signature of name/value pairs rather than ordinary SQL
arguments, which makes it easier for future versions to interpret what
has been output from a previous version; and
  (b) many kinds of errors are demoted to WARNINGs, to allow some
statistics to be set for an attribute even if other statistics are
malformed (also a future-proofing design); and
  (c) we are considering whether to use an in-place heap update for the
relation stats, so that a large restore doesn't bloat pg_class -- I'd
like feedback on this idea

The pg_set_*_stats() functions are designed for interactive use, such
as tweaking statistics for planner testing, experimentation, or
reproducing a plan outside of a production system. The aforementioned
design choices don't make a lot of sense in this context, so that's why
the pg_set_*_stats() functions are separate from the
pg_restore_*_stats() functions. But there's a lot of overlap, so it may
be worth discussing again whether we should only have one set of
functions.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Alexander Lakhin

Date:

27 October 2024, 14:00:00

Hello Jeff and Corey,

26.10.2024 01:18, Jeff Davis wrote:
> On Tue, 2024-09-17 at 05:02 -0400, Corey Huinker wrote:
>> I've taken most of Jeff's work, reincorporated it into roughly the
>> same patch structure as before, and am posting it now.
> I have committed the import side of this patch series; that is, the
> function calls that can load stats into an existing cluster without the
> need to ANALYZE.
>
> The pg_restore_*_stats() functions are designed such that pg_dump can
> emit the calls. Some design choices of the functions worth noting:

Please look at the following seemingly atypical behavior of the new
functions:
CREATE TABLE test(id int);

SELECT pg_restore_attribute_stats(
   'relation', 'test'::regclass,
   'attname', 'id'::name,
   'inherited', false);

SELECT pg_restore_attribute_stats(
   'relation', 'test'::regclass,
   'attname', 'id'::name,
   'inherited', false
) FROM generate_series(1, 2);
ERROR:  XX000: tuple already updated by self
LOCATION:  simple_heap_update, heapam.c:4353

Or:
SELECT pg_clear_attribute_stats('test'::regclass, 'id'::name, false)
FROM generate_series(1, 2);
ERROR:  XX000: tuple already updated by self
LOCATION:  simple_heap_delete, heapam.c:3108

Best regards,
Alexander

Re: Statistics Import and Export

From

Corey Huinker

Date:

31 October 2024, 16:52:12

(c) we are considering whether to use an in-place heap update for the
relation stats, so that a large restore doesn't bloat pg_class -- I'd
like feedback on this idea

I'd also like feedback, though I feel very strongly that we should do what ANALYZE does. In an upgrade situation, nearly all tables will have stats imported, which would result in an immediate doubling of pg_class - not the end of the world, but not great either.

Given the recent bugs associated with inplace updates and race conditions, if we don't want to do in-place here, we should also consider getting rid of it for ANALYZE. I briefly pondered if it would make sense to vertically partition pg_class into the stable attributes and the attributes that get modified in-place, but that list is pretty long: relpages, reltuples, relallvisible, relhasindex, reltoastrelid, relhasrules, relhastriggers, relfrozenxid, and reminmxid,

If we don't want to do inplace updates in pg_restore_relation_stats(), then we could mitigate the bloat with a VACUUM FULL pg_class at the tail end of the upgrade if stats were enabled.

pg_restore_*_stats() functions. But there's a lot of overlap, so it may
be worth discussing again whether we should only have one set of
functions.

For the reason of in-place updates and error tolerance, I think they have to remain separate functions, but I'm also interested in hearing other's opinions.

Re: Statistics Import and Export

From

Heikki Linnakangas

Date:

13 November 2024, 00:17:42

On 23/10/2024 01:27, Jeff Davis wrote:
>> I've taken most of Jeff's work, reincorporated it into roughly the
>> same patch structure as before, and am posting it now.
> 
> I committed 0001-0004 with significant revision.

This just caught my eye:

postgres=# select pg_set_attribute_stats('foo', 'xmin', false, 1);
  pg_set_attribute_stats
------------------------

(1 row)

We should probably not allow that, because you cannot ANALYZE system 
columns:

postgres=# analyze foo (xmin);
ERROR:  column "xmin" of relation "foo" does not exist

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: Statistics Import and Export

From

Corey Huinker

Date:

13 November 2024, 09:04:32

We should probably not allow that, because you cannot ANALYZE system
columns:

Makes sense, and the fix is changing a single character (unless we think it warrants a test case).

Re: Statistics Import and Export

From

Bruce Momjian

Date:

18 November 2024, 22:47:24

On Fri, Nov  8, 2024 at 01:25:21PM -0500, Corey Huinker wrote:
>     WHAT IS NOT DONE - EXTENDED STATISTICS
> 
>     It is a general consensus in the community that "nobody uses extended
>     statistics", though I've had difficulty getting actual figures to back this
>     up, even from my own employer. Surveying several vendors at PgConf.EU, the
>     highest estimate was that at most 1% of their customers used extended
>     statistics, though more probably should. This reinforces my belief that a
>     feature that would eliminate a major pain point in upgrades for 99% of
>     customers shouldn't be held back by the fact that the other 1% only have a
>     reduced hassle.
> 
>     However, having relation and attribute statistics carry over on major
>     version upgrades presents a slight problem: running vacuumdb
>     --analyze-in-stages after such an upgrade is completely unnecessary for
>     those without extended statistics, and would actually result in _worse_
>     statistics for the database until the last stage is complete. Granted,
>     we've had great difficulty getting users to know that vacuumdb is a thing
>     that should be run, but word has slowly spread through our own
>     documentation and those "This one simple trick will make your postgres go
>     fast post-upgrade" blog posts. Those posts will continue to lurk in search
>     results long after this feature goes into release, and it would be a rude
>     surprise to users to find out that the extra work they put in to learn
>     about a feature that helped their upgrade in 17 was suddenly detrimental
>     (albeit temporarily) in 18. We should never punish people for only being a
>     little-bit current in their knowledge. Moreover, this surprise would
>     persist even after we add extended statistics import function
>     functionality.
> 
>     I presented this problem to several people at PgConf.EU, and the consensus
>     least-bad solution was that vacuumdb should filter out tables that are not
>     missing any statistics when using options --analyze, --analyze-only, and
>     --analyze-in-stages, with an additional flag for now called --force-analyze
>     to restore the un-filtered functionality. This gives the outcome tree:
> 
>     1. Users who do not have extended statistics and do not use (or not even
>     know about) vacuumdb will be blissfully unaware, and will get better
>     post-upgrade performance.
>     2. Users who do not have extended statistics but use vacuumdb
>     --analyze-in-stages will be pleasantly surprised that the vacuumdb run is
>     almost a no-op, and completes quickly. Those who are surprised by this and
>     re-run vacuumdb --analyze-in-stages will get another no-op.
>     3. Users who have extended statistics and use vacuumdb --analyze-in-stages
>     will get a quicker vacuumdb run, as only the tables with extended stats
>     will pass the filter. Subsequent re-runs of vacuumdb --analyze-in-stages
>     would be the no-op.
>     4. Users who have extended statistics and don't use vacuumdb will still get
>     better performance than they would have without any stats imported.
> 
>     In case anyone is curious, I'm defining "missing stats" as a table/matview
>     with any of the following:
> 
>     1. A table with an attribute that lacks a corresponding pg_statistic row.
>     2. A table with an index with an expression attribute that lacks a
>     corresponding pg_statistic row (non-expression attributes just borrow the
>     pg_statistic row from the table's attribute).
>     3. A table with at least one extended statistic that does not have a
>     corresponding pg_statistic_ext_data row.
> 
>     Note that none of these criteria are concerned with the substance of the
>     statistics (ex. pg_statistic row should have mcv stats but does not),
>     merely their row-existence. 
> 
>     Some rejected alternative solutions were:
> 
>     1. Adding a new option --analyze-missing-stats. While simple, few people
>     would learn about it, knowledge of it would be drowned out by the
>     aforementioned sea of existing blog posts.
>     2. Adding --analyze-missing-stats and making --analyze-in-stages fail with
>     an error message educating the user about --analyze-missing-stats. Users
>     might not see the error, existing tooling wouldn't be able to act on the
>     error, and there are legitimate non-upgrade uses of --analyze-in-stages.
> 
>     MAIN CONCERN GOING FORWARD
> 
>     This change to vacuumdb will require some reworking of the
>     vacuum_one_database() function so that the list of tables analyzed is
>     preserved across the stages, as subsequent stages runs won't be able to
>     detect which tables were previously missing stats.

You seem to be optimizing for people using pg_upgrade, and for people
upgrading to PG 18, without adequately considering people using vacuumdb
in non-pg_upgrade situations, and people using PG 19+.  Let me explain.

First, I see little concern here for how people who use --analyze and
--analyze-only independent of pg_upgrade will be affected by this. 
While I recommend people decrease vacuum and analyze threshold during
non-peak periods:

    https://momjian.us/main/blogs/pgblog/2017.html#January_3_2017

some people might just regenerate all statistics during non-peak periods
using these options.  You can perhaps argue that --analyze-in-stages
would only be used by pg_upgrade so maybe that can be adjusted more
easily.

Second, the API for what --analyze and --analyze-only do will be very
confusing for people running, e.g., PG 20, because the average user
reading the option name will not guess it only adds missing statistics.

I think you need to rethink your approach and just accept that a mention
of the new preserving statistic behavior of pg_upgrade, and the new
vacuumdb API required, will be sufficient.  In summary, I think you need
a new --compute-missing-statistics-only that can be combined with
--analyze, --analyze-only, and --analyze-in-stages to compute only
missing statistics, and document it in the PG 18 release notes.

Frankly, we have a similar problem with partitioned tables:

    https://www.postgresql.org/docs/current/sql-analyze.html
    
    For partitioned tables, ANALYZE gathers statistics by sampling
    rows from all partitions; in addition, it will recurse into
    each partition and update its statistics. Each leaf partition
    is analyzed only once, even with multi-level partitioning. No
    statistics are collected for only the parent table (without data
    from its partitions), because with partitioning it's guaranteed
    to be empty.

-->    The autovacuum daemon does not process partitioned tables, nor
    does it process inheritance parents if only the children are ever
    modified. It is usually necessary to periodically run a manual
    ANALYZE to keep the statistics of the table hierarchy up to date.

Now, you can say partitioned table statistics are not as important as
extended statistics, but that fact remains that we have these two odd
cases where special work must be done to generate statistics.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Statistics Import and Export

From

Bruce Momjian

Date:

18 November 2024, 23:32:30

On Mon, Nov 18, 2024 at 08:06:24PM +0000, Wetmore, Matthew  (CTR) wrote:
> Sorry to chime in with a dumb question:
> 
> How would/could this effect tables that have the vacuum and analyze
> scale_factors different from the rest of db via the ALTE RTABLE statement?
> 
> (I do this a lot)
> 
> ALTER TABLE your_schema.your_table SET
> (autovacuum_enabled,autovacuum_analyze_scale_factor,autovacuum_vacuum_scale_factor);
> 
> Just wanted to mention

I don't think it would affect it since those control autovacuum, and we
are talking about manual vacuum/analyze.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Statistics Import and Export

From

Corey Huinker

Date:

19 November 2024, 03:29:09

On Mon, Nov 18, 2024 at 3:32 PM Bruce Momjian <bruce@momjian.us> wrote:

> How would/could this effect tables that have the vacuum and analyze
> scale_factors different from the rest of db via the ALTE RTABLE statement?
>
> (I do this a lot)

I don't think it would affect it since those control autovacuum, and we
are talking about manual vacuum/analyze.

Correct. The patchset is about carrying over the gathered statistics as-is from the previous major version. Per-table configurations for how to collect statistics would be unchanged.

Re: Statistics Import and Export

From

Bruce Momjian

Date:

19 November 2024, 04:42:35

On Mon, Nov 18, 2024 at 08:29:10PM -0500, Corey Huinker wrote:
> On Mon, Nov 18, 2024 at 2:47 PM Bruce Momjian <bruce@momjian.us> wrote:
>     You seem to be optimizing for people using pg_upgrade, and for people
>     upgrading to PG 18, without adequately considering people using vacuumdb
>     in non-pg_upgrade situations, and people using PG 19+.  Let me explain.
> 
> This was a concern as I was polling people.
> 
> A person using vacuumdb in a non-upgrade situation is, to my limited
> imagination, one of three types:
> 
> 1. A person who views vacuumdb as a worthwhile janitorial task for downtimes.
> 2. A person who wants stats on a lot of recently created tables.
> 3. A person who wants better stats on a lot of recently (re)populated tables.
> 
> The first group would not be using --analyze-in-stages or --analyze-only,
> because the vacuuming is a big part of it. They will be unaffected.
> 
> The second group will be pleasantly surprised to learn that they no longer need
> to specify a subset of tables, as any table missing stats will get picked up.
> 
> The third group would be surprised that their operation completed so quickly,
> check the docs, add in --force-analyze to their script, and re-run.

We can't design an API around who is going to be surprised.  We have to
look at what the options say, what people would expect it to do, and
what it does.  The reason "surprise" doesn't work in the long run is
that while PG 18 users might be surprised, PG 20 users will be confused.

>     First, I see little concern here for how people who use --analyze and
>     --analyze-only independent of pg_upgrade will be affected by this.
>     While I recommend people decrease vacuum and analyze threshold during
>     non-peak periods:
> 
>             https://momjian.us/main/blogs/pgblog/2017.html#January_3_2017
> 
>     some people might just regenerate all statistics during non-peak periods
>     using these options.  You can perhaps argue that --analyze-in-stages
>     would only be used by pg_upgrade so maybe that can be adjusted more
>     easily.
> 
> I, personally, would be fine if this only modified --analyze-in-stages, as it
> already carries the warning:

Right, but I think we would need to rename the option to clarify what it
does, e.g. --analyze-missing-in-stages.  If they use
--analyze-in-stages, they will get an error, and will then need to
reference the docs to see the new option wording, or we can suggest the
new option in the error message.

> But others felt that --analyze-only should be in the mix as well.

Again, with those other people not saying so in this thread, I can't
really comment on it --- I can only tell you what I have seen and others
are going to have to explain why they want such dramatic changes.

> No one advocated for changing the behavior of options that involve actual
> vacuuming.
>  
> 
>     Second, the API for what --analyze and --analyze-only do will be very
>     confusing for people running, e.g., PG 20, because the average user
>     reading the option name will not guess it only adds missing statistics.
> 
>     I think you need to rethink your approach and just accept that a mention
>     of the new preserving statistic behavior of pg_upgrade, and the new
>     vacuumdb API required, will be sufficient.  In summary, I think you need
>     a new --compute-missing-statistics-only that can be combined with
>     --analyze, --analyze-only, and --analyze-in-stages to compute only
>     missing statistics, and document it in the PG 18 release notes.
> 
> 
> A --missing-only/--analyze-missing-in-stages option was my first idea, and it's
> definitely cleaner, but as I stated in the rejected ideas section above, when I
> reached out to others at PgConf.EU there was pretty general consensus that few
> people would actually read our documentation, and the few that have in the past
> are unlikely to read it again to discover the new option, and those people
> would have a negative impact of using --analyze-in-stages, effectively
> punishing them for having once read the documentation (or a blog post) but not
> re-read it prior to upgrade.

Again, you can't justify such changes based on discussions that are not
posted publicly here.

> So, to add non-pg_upgrade users to the outcome tree in my email from
> 2024-11-04:
> 
> 
>     5. Users who use vacuumdb in a non-upgrade situation and do not use either
>     --analyze-in-stages or --analyze-only will be completely unaffected.
>     6. Users who use vacuumdb in a non-upgrade situation with either
>     --analyze-in-stages or --analyze-only set will find that the operation
>     skips tables that already have stats, and will have to add --force-analyze
>     to restore previous behavior.
> 
> 
> That's not a great surprise for group 6, but I have to believe that group is
> smaller than group 5, and it's definitely smaller than the group of users that
> need to upgrade.

Again, a clean API is the goal, not surprise calculus.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Statistics Import and Export

From

Bruce Momjian

Date:

19 November 2024, 15:50:50

On Mon, Nov 18, 2024 at 08:42:35PM -0500, Bruce Momjian wrote:
> On Mon, Nov 18, 2024 at 08:29:10PM -0500, Corey Huinker wrote:
> > That's not a great surprise for group 6, but I have to believe that group is
> > smaller than group 5, and it's definitely smaller than the group of users that
> > need to upgrade.
> 
> Again, a clean API is the goal, not surprise calculus.

Maybe I was too harsh.  "Surprise calculus" is fine, but only after we
have an API that will be clearly understood by new users.  We have to
assume that in the long run new users will use this API more than
existing users.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Statistics Import and Export

From

Nathan Bossart

Date:

19 November 2024, 21:09:26

On Mon, Nov 18, 2024 at 08:42:35PM -0500, Bruce Momjian wrote:
> We can't design an API around who is going to be surprised.  We have to
> look at what the options say, what people would expect it to do, and
> what it does.  The reason "surprise" doesn't work in the long run is
> that while PG 18 users might be surprised, PG 20 users will be confused.

I think Bruce makes good points.  I'd add that even if we did nothing at
all for vacuumdb, folks who continued to use it wouldn't benefit from the
new changes, but they also shouldn't be harmed by it, either.

>> I, personally, would be fine if this only modified --analyze-in-stages, as it
>> already carries the warning:
> 
> Right, but I think we would need to rename the option to clarify what it
> does, e.g. --analyze-missing-in-stages.  If they use
> --analyze-in-stages, they will get an error, and will then need to
> reference the docs to see the new option wording, or we can suggest the
> new option in the error message.
> 
>> But others felt that --analyze-only should be in the mix as well.
> 
> Again, with those other people not saying so in this thread, I can't
> really comment on it --- I can only tell you what I have seen and others
> are going to have to explain why they want such dramatic changes.

I don't have a strong opinion here, but I suspect that if I was creating
vacuumdb from scratch, I'd have suggested a --missing-only flag that would
only work for --analyze-only/--analyze-in-stages.  That way, folks can
still regenerate statistics if they want, but we also have an answer for
folks who use pg_upgrade and have extended statistics.

-- 
nathan

Re: Statistics Import and Export

From

Corey Huinker

Date:

19 November 2024, 23:47:20

I don't have a strong opinion here, but I suspect that if I was creating
vacuumdb from scratch, I'd have suggested a --missing-only flag that would
only work for --analyze-only/--analyze-in-stages. That way, folks can
still regenerate statistics if they want, but we also have an answer for
folks who use pg_upgrade and have extended statistics.

(combining responses to Bruce's para about surprise calculus and Nathan here)

I agree that a clean API is desirable and a goal. And as I stated before, a new flag (--analyze-missing-in-stages / --analyze-post-pgupgrade, etc) or a flag modifier ( --missing-only ) was my first choice.

But if we're going to go that route, we have a messaging problem. We need to reach our customers who plan to upgrade, and explain to them that the underlying assumption behind running vacuumdb has gone away for 99% of them, and that may be 100% in the next version, but for that 99% running vacuumdb in the old way now actively undoes one of the major improvements to pg_upgrade, but this one additional option keeps the benefits of the new pg_upgrade without the drawbacks.

That, and once we have extended statistics importing on upgrade, then the need for vacuumdb post-upgrade goes away entirely. So we'll have to re-message the users with that news too.

I'd be in favor of this, but I have to be honest, our messaging reach is not good, and takes years to sink in. Years in which the message will change at least one more time. And this outreach will likely confuse users who already weren't (and now shouldn't be) using vacuumdb. In light of that, the big risk was that an action that some users learned to do years ago was now actively undoing whatever gains they were supposed to get in their upgrade downtime, and that downtime is money to them, hence the surprise calculus.

One other possibilities we could consider:

* create a pg_stats_health_check script that lists tables missing stats, with --fix/--fix-in-stages options, effectively replacing vacuumdb for those purposes, and then crank up the messaging about that change. The "new shiny" effect of a new utility that has "stats", "health", and "check" in the name may be the search/click-bait we need to get the word out effectively. That last sentence may sound facetious, but it isn't, it's just accepting how search engines and eyeballs currently function. With that in place, we can then change the vacuumdb documentation to be deter future use in post-upgrade situations.

* move missing-stats rebuilds into pg_upgrade/pg_restore itself, and this would give us the simpler one-time message that users should stop using vacuumdb in upgrade situations.

* Making a concerted push to get extended stats import into v18 despite the high-effort/low-reward nature of it, and then we can go with the simple messaging of "Remember vacuumdb, that thing you probably weren't running post-upgrade but should have been? Now you can stop using it!". I had extended stats imports working back when the function took JSON input, so it's do-able, but the difficulty lies in how to represent an array of incomplete pg_statistic rows in a serial fashion that is cross-version compatible.

Re: Statistics Import and Export

From

Bruce Momjian

Date:

20 November 2024, 01:40:20

On Tue, Nov 19, 2024 at 03:47:20PM -0500, Corey Huinker wrote:
>     I don't have a strong opinion here, but I suspect that if I was creating
>     vacuumdb from scratch, I'd have suggested a --missing-only flag that would
>     only work for --analyze-only/--analyze-in-stages.  That way, folks can
>     still regenerate statistics if they want, but we also have an answer for
>     folks who use pg_upgrade and have extended statistics.
> 
> 
> (combining responses to Bruce's para about surprise calculus and Nathan here)
> 
> I agree that a clean API is desirable and a goal. And as I stated before, a new
> flag (--analyze-missing-in-stages / --analyze-post-pgupgrade, etc) or a flag
> modifier ( --missing-only ) was my first choice.

Yes, after a clean API is designed, you can then consider surprise
calculus.  This is an issue not only for this feature, but for all
Postgres changes we consider, which is why I think it is worth stating
this clearly.  If I am thinking incorrectly, we can discuss that here too.

> But if we're going to go that route, we have a messaging problem. We need to
> reach our customers who plan to upgrade, and explain to them that the
> underlying assumption behind running vacuumdb has gone away for 99% of them,
> and that may be 100% in the next version, but for that 99% running vacuumdb in
> the old way now actively undoes one of the major improvements to pg_upgrade,
> but this one additional option keeps the benefits of the new pg_upgrade without
> the drawbacks.

How much are we supposed to consider users who do not read the major
release notes?  I realize we might be unrealistic to expect that from
the majority of our users, but I also don't want to contort our API to
adjust for them.

> That, and once we have extended statistics importing on upgrade, then the need
> for vacuumdb post-upgrade goes away entirely. So we'll have to re-message the
> users with that news too.
> 
> I'd be in favor of this, but I have to be honest, our messaging reach is not
> good, and takes years to sink in. Years in which the message will change at
> least one more time. And this outreach will likely confuse users who already
> weren't (and now shouldn't be) using vacuumdb. In light of that, the big risk
> was that an action that some users learned to do years ago was now actively
> undoing whatever gains they were supposed to get in their upgrade downtime, and
> that downtime is money to them, hence the surprise calculus.

That is a big purpose of the major release notes.  We can even list this
as an incompatibility in the sense that the procedure has changed.

> One other possibilities we could consider:
> 
> * create a pg_stats_health_check script that lists tables missing stats, with
> --fix/--fix-in-stages options, effectively replacing vacuumdb for those
> purposes, and then crank up the messaging about that change. The "new shiny"
> effect of a new utility that has "stats", "health", and "check" in the name may
> be the search/click-bait we need to get the word out effectively. That last
> sentence may sound facetious, but it isn't, it's just accepting how search
> engines and eyeballs currently function. With that in place, we can then change
> the vacuumdb documentation to be deter future use in post-upgrade situations.

We used to create a script until the functionality was added to
vacuumdb.  Since 99% of users will not need to do anything after
pg_upgrade, it would make sense to output the script only for the 1% of
users who need it and tell users to run it, rather than giving
instructions that are a no-op for 99% of users.

> * move missing-stats rebuilds into pg_upgrade/pg_restore itself, and this would
> give us the simpler one-time message that users should stop using vacuumdb in
> upgrade situations.

Uh, that would make pg_upgrade take longer for some users, which might
be confusing.

> * Making a concerted push to get extended stats import into v18 despite the
> high-effort/low-reward nature of it, and then we can go with the simple
> messaging of "Remember vacuumdb, that thing you probably weren't running
> post-upgrade but should have been? Now you can stop using it!". I had extended
> stats imports working back when the function took JSON input, so it's do-able,
> but the difficulty lies in how to represent an array of incomplete pg_statistic
> rows in a serial fashion that is cross-version compatible.

I am not a big fan of that at this point.  If we get it, we can adjust
our API at that time, but I don't want to plan on it.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Statistics Import and Export

From

Bruce Momjian

Date:

20 November 2024, 20:00:28

On Tue, Nov 19, 2024 at 05:40:20PM -0500, Bruce Momjian wrote:
> On Tue, Nov 19, 2024 at 03:47:20PM -0500, Corey Huinker wrote:
> > * create a pg_stats_health_check script that lists tables missing stats, with
> > --fix/--fix-in-stages options, effectively replacing vacuumdb for those
> > purposes, and then crank up the messaging about that change. The "new shiny"
> > effect of a new utility that has "stats", "health", and "check" in the name may
> > be the search/click-bait we need to get the word out effectively. That last
> > sentence may sound facetious, but it isn't, it's just accepting how search
> > engines and eyeballs currently function. With that in place, we can then change
> > the vacuumdb documentation to be deter future use in post-upgrade situations.
> 
> We used to create a script until the functionality was added to
> vacuumdb.  Since 99% of users will not need to do anything after
> pg_upgrade, it would make sense to output the script only for the 1% of
> users who need it and tell users to run it, rather than giving
> instructions that are a no-op for 99% of users.

One problem with the above approach is that it gives users upgrading or
loading via pg_dump no way to know which tables need analyze statistics,
right?  I think that is why we ended up putting the pg_upgrade
statistics functionality in vacuumdb --analyze-in-stages.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Statistics Import and Export

From

Nathan Bossart

Date:

23 November 2024, 00:09:03

I took another look at v32-0001 and v32-0002, and they look reasonable to
me.  Unless additional feedback materializes, I'll plan on committing those
soon.

After that, it might be a good idea to take up the vacuumdb changes next,
since there's been quite a bit of recent discussion about those.  I have a
fair amount of experience working on vacuumdb, so I could probably help
there, too.

-- 
nathan

Re: Statistics Import and Export

From

Jeff Davis

Date:

25 November 2024, 19:11:09

On Fri, 2024-11-22 at 15:09 -0600, Nathan Bossart wrote:
> I took another look at v32-0001 and v32-0002, and they look
> reasonable to
> me.  Unless additional feedback materializes, I'll plan on committing
> those
> soon.

Those refactoring patches look fine to me, the only comment I have is
that they are easier to understand as a single patch.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Nathan Bossart

Date:

25 November 2024, 22:12:52

On Mon, Nov 25, 2024 at 08:11:09AM -0800, Jeff Davis wrote:
> On Fri, 2024-11-22 at 15:09 -0600, Nathan Bossart wrote:
>> I took another look at v32-0001 and v32-0002, and they look
>> reasonable to
>> me.  Unless additional feedback materializes, I'll plan on committing
>> those
>> soon.
> 
> Those refactoring patches look fine to me, the only comment I have is
> that they are easier to understand as a single patch.

Yeah, I intend to combine them when committing.

-- 
nathan

Re: Statistics Import and Export

From

Nathan Bossart

Date:

26 November 2024, 01:42:05

On Mon, Nov 25, 2024 at 01:12:52PM -0600, Nathan Bossart wrote:
> On Mon, Nov 25, 2024 at 08:11:09AM -0800, Jeff Davis wrote:
>> On Fri, 2024-11-22 at 15:09 -0600, Nathan Bossart wrote:
>>> I took another look at v32-0001 and v32-0002, and they look
>>> reasonable to
>>> me.  Unless additional feedback materializes, I'll plan on committing
>>> those
>>> soon.
>> 
>> Those refactoring patches look fine to me, the only comment I have is
>> that they are easier to understand as a single patch.
> 
> Yeah, I intend to combine them when committing.

Committed.

-- 
nathan

Re: Statistics Import and Export

From

Jeff Davis

Date:

26 November 2024, 07:10:22

On Mon, 2024-11-18 at 20:29 -0500, Corey Huinker wrote:
> Attached is a re-basing of the existing patchset, plus 3 more
> additions:

Comments on 0003:

* If we commit 0003, is it a useful feature by itself or does it
require that we commit some or all of 0004-0014? Which of these need to
be in v18?

* Why does binary upgrade cause statistics to be dumped? Can you just
make pg_upgrade specify the appropriate set of flags?

* It looks like appendNamedArgument() is never called with
positional=true?

* It's pretty awkward to use an array of arrays of strings when an
array of structs might make more sense.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

27 November 2024, 01:11:42

Comments on 0003:

* If we commit 0003, is it a useful feature by itself or does it
require that we commit some or all of 0004-0014? Which of these need to
be in v18?

Useful by itself.

0004 seems needed to me, unless we're fine with ~50% bloat in pg_class on a new-upgraded system, or we think inplace update are on their way out.

0005 is basically theoretical, it is only needed if we change the default relpages on partitioned tables.

0006-0011 are the vacuumdb things being debated now. I've reached out to some of the people I spoke to at PgConf.eu to get them to chime in.

0012 is now moot as a similar patch was committed Friday.

0013 is a cleanup/optimization.

0014 is the --no-data flag, which has no meaning without 0004, but 0004 in no way requires it.

* Why does binary upgrade cause statistics to be dumped? Can you just
make pg_upgrade specify the appropriate set of flags?

That decision goes back a ways, I tried to dig in the archives last night but I was getting a Server Error on postgresql.org.

Today I'm coming up with https://www.postgresql.org/message-id/267624.1711756062%40sss.pgh.pa.us which is actually more about whether stats import should be the default (consensus: yesyesyes), and the binary_upgrade test may have been because binary_upgrade shuts off data section stuff, but table stats are in the data section. Happy to review the decision.

* It looks like appendNamedArgument() is never called with
positional=true?

That is currently the case. Currently all params are called with name/value pairs, but in the past we had leading positionals followed by the stat-y parameters in name-value pairs. I'll be refactoring it to remove the positonal=T/F argument, which leaves just a list of name-type pairs, and thus probably reduces the urge to make it an array of structs.

* It's pretty awkward to use an array of arrays of strings when an
array of structs might make more sense.

That pattern was introduced here: https://www.postgresql.org/message-id/4afa70edab849ff16238d1100b6652404e9a4d9d.camel%40j-davis.com :)

I'll be rebasing (that's done) and refactoring 0003 to get rid of the positional column, and moving 0014 next to 0003 because they touch the same files.

Re: Statistics Import and Export

From

Magnus Hagander

Date:

27 November 2024, 16:44:01

On Tue, Nov 19, 2024 at 1:50 PM Bruce Momjian <bruce@momjian.us> wrote:

On Mon, Nov 18, 2024 at 08:42:35PM -0500, Bruce Momjian wrote:
> On Mon, Nov 18, 2024 at 08:29:10PM -0500, Corey Huinker wrote:
> > That's not a great surprise for group 6, but I have to believe that group is
> > smaller than group 5, and it's definitely smaller than the group of users that
> > need to upgrade.
>
> Again, a clean API is the goal, not surprise calculus.

Maybe I was too harsh. "Surprise calculus" is fine, but only after we
have an API that will be clearly understood by new users. We have to
assume that in the long run new users will use this API more than
existing users.

If you want to avoid both the surprise and confusion factor mentioned before, maybe what's needed is to *remove* --analyze-in-stages, and replace it with --analyze-missing-in-stages and --analyze-all-in-stages (with the clear warning about what --analyze-all-in-stages can do to your system if you already have statistics).

That goes with the "immediate breakage that you see right away is better than silently doing the unexpected where you might not notice the problem until much later".

That might trade some of that surprise and confusion for annoyance instead, but going forward that might be a clearer path?

--

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: Statistics Import and Export

From

Bruce Momjian

Date:

27 November 2024, 17:53:07

On Wed, Nov 27, 2024 at 02:44:01PM +0100, Magnus Hagander wrote:
> If you want to avoid both the surprise and confusion factor mentioned before,
> maybe what's needed is to *remove* --analyze-in-stages, and replace it with
> --analyze-missing-in-stages and --analyze-all-in-stages (with the clear warning
> about what --analyze-all-in-stages can do to your system if you already have
> statistics).
> 
> That goes with the "immediate breakage that you see right away is better than
> silently doing the unexpected where you might not notice the problem until much
> later".
> 
> That might trade some of that surprise and confusion for annoyance instead, but
> going forward that might be a clearer path?

Oh, so remove --analyze-in-stages and have it issue a suggestion, and
make two versions --- yeah, that would work too.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Statistics Import and Export

From

Alvaro Herrera

Date:

27 November 2024, 18:00:02

On 2024-Nov-27, Bruce Momjian wrote:

> On Wed, Nov 27, 2024 at 02:44:01PM +0100, Magnus Hagander wrote:
> > If you want to avoid both the surprise and confusion factor mentioned before,
> > maybe what's needed is to *remove* --analyze-in-stages, and replace it with
> > --analyze-missing-in-stages and --analyze-all-in-stages (with the clear warning
> > about what --analyze-all-in-stages can do to your system if you already have
> > statistics).
> > 
> > That goes with the "immediate breakage that you see right away is better than
> > silently doing the unexpected where you might not notice the problem until much
> > later".
> > 
> > That might trade some of that surprise and confusion for annoyance instead, but
> > going forward that might be a clearer path?
> 
> Oh, so remove --analyze-in-stages and have it issue a suggestion, and
> make two versions --- yeah, that would work too.

Maybe not remove the option, but add a required parameter:
--analyze-in-stages=all / missing

That way, if the option is missing, the user can adapt the command line
according to need.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"How amazing is that? I call it a night and come back to find that a bug has
been identified and patched while I sleep."                (Robert Davidson)
               http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php

Re: Statistics Import and Export

From

Nathan Bossart

Date:

27 November 2024, 18:18:45

On Wed, Nov 27, 2024 at 04:00:02PM +0100, Alvaro Herrera wrote:
> On 2024-Nov-27, Bruce Momjian wrote:
> 
>> On Wed, Nov 27, 2024 at 02:44:01PM +0100, Magnus Hagander wrote:
>> > If you want to avoid both the surprise and confusion factor mentioned before,
>> > maybe what's needed is to *remove* --analyze-in-stages, and replace it with
>> > --analyze-missing-in-stages and --analyze-all-in-stages (with the clear warning
>> > about what --analyze-all-in-stages can do to your system if you already have
>> > statistics).
>> > 
>> > That goes with the "immediate breakage that you see right away is better than
>> > silently doing the unexpected where you might not notice the problem until much
>> > later".
>> > 
>> > That might trade some of that surprise and confusion for annoyance instead, but
>> > going forward that might be a clearer path?
>> 
>> Oh, so remove --analyze-in-stages and have it issue a suggestion, and
>> make two versions --- yeah, that would work too.

We did something similar when we removed exclusive backup mode.
pg_start_backup() and pg_stop_backup() were renamed to pg_backup_start()
and pg_backup_stop() to prevent folks' backup scripts from silently
changing behavior after an upgrade.

> Maybe not remove the option, but add a required parameter:
> --analyze-in-stages=all / missing
> 
> That way, if the option is missing, the user can adapt the command line
> according to need.

I like this idea.

-- 
nathan

Re: Statistics Import and Export

From

Bruce Momjian

Date:

27 November 2024, 18:47:44

On Wed, Nov 27, 2024 at 09:18:45AM -0600, Nathan Bossart wrote:
> On Wed, Nov 27, 2024 at 04:00:02PM +0100, Alvaro Herrera wrote:
> > On 2024-Nov-27, Bruce Momjian wrote:
> > 
> >> On Wed, Nov 27, 2024 at 02:44:01PM +0100, Magnus Hagander wrote:
> >> > If you want to avoid both the surprise and confusion factor mentioned before,
> >> > maybe what's needed is to *remove* --analyze-in-stages, and replace it with
> >> > --analyze-missing-in-stages and --analyze-all-in-stages (with the clear warning
> >> > about what --analyze-all-in-stages can do to your system if you already have
> >> > statistics).
> >> > 
> >> > That goes with the "immediate breakage that you see right away is better than
> >> > silently doing the unexpected where you might not notice the problem until much
> >> > later".
> >> > 
> >> > That might trade some of that surprise and confusion for annoyance instead, but
> >> > going forward that might be a clearer path?
> >> 
> >> Oh, so remove --analyze-in-stages and have it issue a suggestion, and
> >> make two versions --- yeah, that would work too.
> 
> We did something similar when we removed exclusive backup mode.
> pg_start_backup() and pg_stop_backup() were renamed to pg_backup_start()
> and pg_backup_stop() to prevent folks' backup scripts from silently
> changing behavior after an upgrade.
> 
> > Maybe not remove the option, but add a required parameter:
> > --analyze-in-stages=all / missing
> > 
> > That way, if the option is missing, the user can adapt the command line
> > according to need.
> 
> I like this idea.

Uh, do we have parameters that require a boolean option like this? 
Would there be a default?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Statistics Import and Export

From

Alvaro Herrera

Date:

27 November 2024, 18:57:35

On 2024-Nov-27, Bruce Momjian wrote:

> On Wed, Nov 27, 2024 at 09:18:45AM -0600, Nathan Bossart wrote:

> > > Maybe not remove the option, but add a required parameter:
> > > --analyze-in-stages=all / missing
> > > 
> > > That way, if the option is missing, the user can adapt the command line
> > > according to need.
> > 
> > I like this idea.
> 
> Would there be a default?

There would be no default.  Running with no option given would raise an
error.  The point is: you want to break scripts currently running
--analyze-in-stages so that they can make a choice of which of these two
modes to run.  Your proposal (as I understand it) is to remove the
--analyze-in-stages option and add two other options.  My proposal is to
keep --analyze-in-stages, but require it to have a specifier of which
mode to run.  Both achieve what you want, but I think mine achieves it
in a cleaner way.

> Uh, do we have parameters that require a boolean option like this? 

I'm not sure what exactly are you asking here.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La gente vulgar sólo piensa en pasar el tiempo;
el que tiene talento, en aprovecharlo"

Re: Statistics Import and Export

From

Nathan Bossart

Date:

27 November 2024, 19:11:45

On Wed, Nov 27, 2024 at 04:57:35PM +0100, Alvaro Herrera wrote:
> On 2024-Nov-27, Bruce Momjian wrote:
>> Uh, do we have parameters that require a boolean option like this? 
> 
> I'm not sure what exactly are you asking here.

We do have options like initdb's --sync-method that require specifying one
of a small set of valid arguments.  I don't see any reason that wouldn't
work here, too.

-- 
nathan

Re: Statistics Import and Export

From

Tom Lane

Date:

27 November 2024, 19:44:25

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2024-Nov-27, Bruce Momjian wrote:
>> Would there be a default?

> There would be no default.  Running with no option given would raise an
> error.  The point is: you want to break scripts currently running
> --analyze-in-stages so that they can make a choice of which of these two
> modes to run.  Your proposal (as I understand it) is to remove the
> --analyze-in-stages option and add two other options.  My proposal is to
> keep --analyze-in-stages, but require it to have a specifier of which
> mode to run.  Both achieve what you want, but I think mine achieves it
> in a cleaner way.

I do not like the idea of breaking existing upgrade scripts,
especially not by requiring them to use a parameter that older
vacuumdb versions will reject.  That makes it impossible to have a
script that is version independent.  I really doubt that there is any
usability improvement to be had here that's worth that.

How about causing "--analyze-in-stages" (as currently spelled) to
be a no-op?  We could keep the behavior available under some other
name.

            regards, tom lane

Re: Statistics Import and Export

From

Bruce Momjian

Date:

27 November 2024, 20:00:21

On Wed, Nov 27, 2024 at 04:57:35PM +0100, Álvaro Herrera wrote:
> On 2024-Nov-27, Bruce Momjian wrote:
> There would be no default.  Running with no option given would raise an
> error.  The point is: you want to break scripts currently running
> --analyze-in-stages so that they can make a choice of which of these two
> modes to run.  Your proposal (as I understand it) is to remove the
> --analyze-in-stages option and add two other options.  My proposal is to
> keep --analyze-in-stages, but require it to have a specifier of which
> mode to run.  Both achieve what you want, but I think mine achieves it
> in a cleaner way.
> 
> > Uh, do we have parameters that require a boolean option like this? 
> 
> I'm not sure what exactly are you asking here.

I can't think of a Postgres option that can take only one of two
possible values, and where there is no default value.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Statistics Import and Export

From

Bruce Momjian

Date:

27 November 2024, 20:01:50

On Wed, Nov 27, 2024 at 11:44:25AM -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > On 2024-Nov-27, Bruce Momjian wrote:
> >> Would there be a default?
> 
> > There would be no default.  Running with no option given would raise an
> > error.  The point is: you want to break scripts currently running
> > --analyze-in-stages so that they can make a choice of which of these two
> > modes to run.  Your proposal (as I understand it) is to remove the
> > --analyze-in-stages option and add two other options.  My proposal is to
> > keep --analyze-in-stages, but require it to have a specifier of which
> > mode to run.  Both achieve what you want, but I think mine achieves it
> > in a cleaner way.
> 
> I do not like the idea of breaking existing upgrade scripts,
> especially not by requiring them to use a parameter that older
> vacuumdb versions will reject.  That makes it impossible to have a
> script that is version independent.  I really doubt that there is any
> usability improvement to be had here that's worth that.
> 
> How about causing "--analyze-in-stages" (as currently spelled) to
> be a no-op?  We could keep the behavior available under some other
> name.

Uh, I guess we could do that, but we should emit something like
"--analyze-in-stages option ignored".

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Statistics Import and Export

From

Alvaro Herrera

Date:

27 November 2024, 21:05:51

On 2024-Nov-27, Tom Lane wrote:

> I do not like the idea of breaking existing upgrade scripts,
> especially not by requiring them to use a parameter that older
> vacuumdb versions will reject.  That makes it impossible to have a
> script that is version independent.  I really doubt that there is any
> usability improvement to be had here that's worth that.

I was only suggesting to break it because it was said upthread that that
was desirable behavior.

> How about causing "--analyze-in-stages" (as currently spelled) to
> be a no-op?  We could keep the behavior available under some other
> name.

I think making it a no-op isn't useful, because people who run the old
scripts will get the behavior we do not want: clobber the statistics and
recompute them, losing the benefit that this feature brings.

On 2024-Nov-27, Bruce Momjian wrote:

> Uh, I guess we could do that, but we should emit something like
> "--analyze-in-stages option ignored".

I think emitting a message is not useful.  It's quite possible that the
output of pg_upgrade will be redirected somewhere and this will go
unnoticed.

Maybe the most convenient for users is to keep "vacuumdb
--analyze-in-stages" doing exactly what we want to happen after
pg_upgrade, that is, in 18+, only recreate the missing stats.  This is
because of what Corey said about messaging: many users are not going to
get our message that they need to adapt their scripts, so they won't.
Breaking the script would convey that message pretty quickly, but you're
right that it's not very convenient.

For people that want to use the old behavior of recomputing _all_
statistics not just the missing ones, we could add a different switch,
or an (optional) option to --analyze-in-stages.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Statistics Import and Export

From

Magnus Hagander

Date:

27 November 2024, 21:13:04

On Wed, Nov 27, 2024, 17:44 Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2024-Nov-27, Bruce Momjian wrote:
>> Would there be a default?

> There would be no default. Running with no option given would raise an
> error. The point is: you want to break scripts currently running
> --analyze-in-stages so that they can make a choice of which of these two
> modes to run. Your proposal (as I understand it) is to remove the
> --analyze-in-stages option and add two other options. My proposal is to
> keep --analyze-in-stages, but require it to have a specifier of which
> mode to run. Both achieve what you want, but I think mine achieves it
> in a cleaner way.

I do not like the idea of breaking existing upgrade scripts,
especially not by requiring them to use a parameter that older
vacuumdb versions will reject. That makes it impossible to have a
script that is version independent. I really doubt that there is any
usability improvement to be had here that's worth that.

How about causing "--analyze-in-stages" (as currently spelled) to
be a no-op? We could keep the behavior available under some other
name.

If we're doing that we might as well make this be the "when missing".

If we make it do nothing then we surprise the users of it today just as much as we do if we make it "when missing". And it would actually solve the problem for the others. But the point was that people didn't like silently changing the behavior of the existing parameter - and making it a noop would change it even more.

/Magnus

Re: Statistics Import and Export

From

Corey Huinker

Date:

27 November 2024, 21:15:45

For people that want to use the old behavior of recomputing _all_
statistics not just the missing ones, we could add a different switch,
or an (optional) option to --analyze-in-stages.

The current patchset provides that in the form of the parameter "--force-analyze", which is a modifier to "--analyze-in-stages" and "--analyze-only".

Re: Statistics Import and Export

From

Bruce Momjian

Date:

27 November 2024, 22:09:17

On Wed, Nov 27, 2024 at 01:15:45PM -0500, Corey Huinker wrote:
> 
>     For people that want to use the old behavior of recomputing _all_
>     statistics not just the missing ones, we could add a different switch,
>     or an (optional) option to --analyze-in-stages.
> 
> The current patchset provides that in the form of the parameter
> "--force-analyze", which is a modifier to "--analyze-in-stages" and
> "--analyze-only". 

I don't think there is consensus to change --analyze-only, only maybe
--analyze-in-stages.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Statistics Import and Export

From

Jeff Davis

Date:

07 December 2024, 22:27:51

On Tue, 2024-11-26 at 17:11 -0500, Corey Huinker wrote:
>
> > * Why does binary upgrade cause statistics to be dumped? Can you
> > just
> > make pg_upgrade specify the appropriate set of flags?
> >
>
>
> That decision goes back a ways, I tried to dig in the archives last
> night but I was getting a Server Error on postgresql.org.

I suggest that pg_upgrade be changed to pass --no-data to pg_dump,
rather than --schema-only.

That way, you don't need to create a special case for the pg_dump
default that depends on whether it's a binary upgrade or not.

If wanted, there could also be a new option to pg_upgrade to specify --
with-statistics (default, passes --no-data to pg_dump) or --no-
statistics (passes --schema-only to pg_dump). But that option is
probably not necessary; everyone upgrading probably wants the stats.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

07 December 2024, 22:56:11

On Sat, Dec 7, 2024 at 2:27 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2024-11-26 at 17:11 -0500, Corey Huinker wrote:
>
> > * Why does binary upgrade cause statistics to be dumped? Can you
> > just
> > make pg_upgrade specify the appropriate set of flags?
> >
>
>
> That decision goes back a ways, I tried to dig in the archives last
> night but I was getting a Server Error on postgresql.org.

I suggest that pg_upgrade be changed to pass --no-data to pg_dump,
rather than --schema-only.

That way, you don't need to create a special case for the pg_dump
default that depends on whether it's a binary upgrade or not.

+1

If wanted, there could also be a new option to pg_upgrade to specify --
with-statistics (default, passes --no-data to pg_dump) or --no-
statistics (passes --schema-only to pg_dump). But that option is
probably not necessary; everyone upgrading probably wants the stats.

This makes sense, though perhaps instead of --schema-only perhaps we should pass both --no-statistics and --no-data. I don't envision a fourth option to the new data/schema/stats triumvirate, but --no-statistics shouldn't have a bearing on that future fourth option.

Re: Statistics Import and Export

From

Jeff Davis

Date:

08 December 2024, 00:27:23

On Sat, 2024-12-07 at 14:56 -0500, Corey Huinker wrote:
> This makes sense, though perhaps instead of --schema-only perhaps we
> should pass both --no-statistics and --no-data. I don't envision a
> fourth option to the new data/schema/stats triumvirate, but --no-
> statistics shouldn't have a bearing on that future fourth option.

+1, assuming such an option is wanted at all. I suppose it should be
there for the unlikely (and hopefully impossible) case that statistics
are causing a problem during upgrade.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

20 December 2024, 09:00:55

The biggest functional change is the way dependencies are handled for
matview stats. Materialized views ordinarily end up in
SECITON_PRE_DATA, but in some cases they can be postponed to
SECTION_POST_DATA. You solved that by always putting the matview stats
in SECTION_POST_DATA.

Accurate.

I took a different approach here and, when the matview is postponed,
also postpone the matview stats. It's slightly more code, but it felt
closer to the rest of the structure, where postponing is a special case
(that we might be able to remove in the future).

+1. The fact that this quirk was a knock-on effect of the postponing-quirk, which could go away, makes this change compelling.

Re: Statistics Import and Export

From

Jeff Davis

Date:

21 December 2024, 03:16:45

On Thu, 2024-12-19 at 21:23 -0800, Jeff Davis wrote:
> > 0001-0005 - changes to pg_dump/pg_upgrade
>
> Attached is a version 36j...

The testing can use some work here. I noticed that if I take out the
stats entirely, the tests still pass, because pg_upgrade still gets the
same before/after result.

Also, we need some testing of the output and ordering of pg_dump.
Granted, in most cases problems would result in errors during the
reload. But we have those tests for other kinds of objects, so we
should have the tests for stats, too.

I like the description "STATISTICS DATA" because it differentiates from
the extended stats definitions. It might be worth differentiating
between "RELATION STATISTICS DATA" and "ATTRIBUTE STATISTICS DATA" but
I'm not sure if there's value in that.

But how did you determine what to use for the .tag and prefix? In the
output, it uses the form:

  Name: STATISTICS DATA <name>; Type: STATISTICS DATA; ...

Should that be:

  Name: <name>; Type: STATISTICS DATA; ...

Or:

  Data for Name: ...; Name: ...; Type: STATISTICS DATA; ...

Or:

  Statistics for Name: ...; Name: ...; Type: STATISTICS DATA; ...

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

21 December 2024, 13:11:19

I like the description "STATISTICS DATA" because it differentiates from

You have Tomas to thank for that: https://www.postgresql.org/message-id/bf724b21-914a-4497-84e3-49944f9776f6%40enterprisedb.com

the extended stats definitions. It might be worth differentiating
between "RELATION STATISTICS DATA" and "ATTRIBUTE STATISTICS DATA" but
I'm not sure if there's value in that.

I have no objection to such a change, even if we don't currently have a use for the differentiation, someone in the future might.

But how did you determine what to use for the .tag and prefix? In the
output, it uses the form:

It was the minimal change needed to meet Tomas's suggestion.

Statistics for Name: ...; Name: ...; Type: STATISTICS DATA; ...

I like this one best, because it clarifies the meaning of STATISTICS DATA.

Re: Statistics Import and Export

From

Bruce Momjian

Date:

26 December 2024, 21:45:51

On Thu, Dec 19, 2024 at 09:23:20PM -0800, Jeff Davis wrote:
> On Fri, 2024-12-13 at 00:22 -0500, Corey Huinker wrote:
> > Per offline conversation with Jeff, adding a --no-schema to pg_dump
> > option both for completeness (we already have --no-data and --no-
> > statistics), but users who previously got the effect of --no-schema
> > did so by specifying --data-only, which suppresses statistics as
> > well.
> > 
> > 0001-0005 - changes to pg_dump/pg_upgrade
> 
> Attached is a version 36j where I consolidated these patches and
> cleaned up the documentation. It doesn't make a lot of sense to commit
> them separately, because as soon as the pg_dump changes are there, the
> pg_upgrade test starts showing a difference until it starts using the -
> -no-data option.
> 
> The biggest functional change is the way dependencies are handled for
> matview stats. Materialized views ordinarily end up in
> SECITON_PRE_DATA, but in some cases they can be postponed to
> SECTION_POST_DATA. You solved that by always putting the matview stats
> in SECTION_POST_DATA.
> 
> I took a different approach here and, when the matview is postponed,
> also postpone the matview stats. It's slightly more code, but it felt
> closer to the rest of the structure, where postponing is a special case
> (that we might be able to remove in the future).

I am confused by this:

    Add options --with-statistics/--no-statistics to pg_upgrade
    to enable/disable transferring of statistics to the upgraded
    cluster. The default is --with-statistics.

If statistics is the default for pg_upgrade, why would we need a
--with-statistics option?

Also, I see a misspelling:

+       printf(_("  --no-statisttics              do not import statistics from old cluster\n"));
                               --

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Statistics Import and Export

From

Bruce Momjian

Date:

26 December 2024, 21:54:41

On Wed, Dec 11, 2024 at 10:49:53PM -0500, Corey Huinker wrote:
> From cf4e731db9ffaa4e89d7c5d14b32668529c8c89a Mon Sep 17 00:00:00 2001
> From: Corey Huinker <corey.huinker@gmail.com>
> Date: Fri, 8 Nov 2024 12:27:50 -0500
> Subject: [PATCH v34 11/11] Add --force-analyze to vacuumdb.
> 
> The vacuumdb options of --analyze-in-stages and --analyze-only are often
> used after a restore from a dump or a pg_upgrade to quickly rebuild
> stats on a databse.
> 
> However, now that stats are imported in most (but not all) cases,
> running either of these commands will be at least partially redundant,
> and will overwrite the stats that were just imported, which is a big
> POLA violation.
> 
> We could add a new option such as --analyze-missing-in-stages, but that
> wouldn't help the userbase that grown accustomed to running
> --analyze-in-stages after an upgrade.
> 
> The least-bad option to handle the situation is to change the behavior
> of --analyze-only and --analyze-in-stages to only analyze tables which
> were missing stats before the vacuumdb started, but offer the
> --force-analyze flag to restore the old behavior for those who truly
> wanted it.

I am _again_ not happy with this part of the patch.  Please reply to the
criticism in my November 19th email:

    https://www.postgresql.org/message-id/Zz0T1BENIFDnXmwf@momjian.us

rather than ignoring it and posting the same version of the patch.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Statistics Import and Export

From

Jeff Davis

Date:

30 December 2024, 23:02:47

On Thu, 2024-12-26 at 13:54 -0500, Bruce Momjian wrote:
> I am _again_ not happy with this part of the patch.  Please reply to
> the
> criticism in my November 19th email:
>
>         
> https://www.postgresql.org/message-id/Zz0T1BENIFDnXmwf@momjian.us
>
> rather than ignoring it and posting the same version of the patch.

I suggest that we make a new thread about the vacuumdb changes and
focus this thread and patch series on the pg_dump changes (and minor
flag adjustments to pg_upgrade).

Unless you think that the pg_dump changes should block on the vacuumdb
changes? In which case please let me know because the pg_dump changes
are otherwise close to commit.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Bruce Momjian

Date:

30 December 2024, 23:45:03

On Mon, Dec 30, 2024 at 12:02:47PM -0800, Jeff Davis wrote:
> On Thu, 2024-12-26 at 13:54 -0500, Bruce Momjian wrote:
> > I am _again_ not happy with this part of the patch.  Please reply to
> > the
> > criticism in my November 19th email:
> > 
> >         
> > https://www.postgresql.org/message-id/Zz0T1BENIFDnXmwf@momjian.us
> > 
> > rather than ignoring it and posting the same version of the patch.
> 
> I suggest that we make a new thread about the vacuumdb changes and
> focus this thread and patch series on the pg_dump changes (and minor
> flag adjustments to pg_upgrade).
> 
> Unless you think that the pg_dump changes should block on the vacuumdb
> changes? In which case please let me know because the pg_dump changes
> are otherwise close to commit.

I think that is a good idea.  I don't see vacuumdb blocking this.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Statistics Import and Export

From

Nathan Bossart

Date:

07 January, 00:27:18

On Mon, Dec 30, 2024 at 03:45:03PM -0500, Bruce Momjian wrote:
> On Mon, Dec 30, 2024 at 12:02:47PM -0800, Jeff Davis wrote:
>> I suggest that we make a new thread about the vacuumdb changes and
>> focus this thread and patch series on the pg_dump changes (and minor
>> flag adjustments to pg_upgrade).
>> 
>> Unless you think that the pg_dump changes should block on the vacuumdb
>> changes? In which case please let me know because the pg_dump changes
>> are otherwise close to commit.
> 
> I think that is a good idea.  I don't see vacuumdb blocking this.

+1, I've been reviewing the vacuumdb portion and am planning to start a new
thread in the near future.  IIUC the bulk of the vacuumdb changes are
relatively noncontroversial, we just haven't reached consensus on the user
interface.

-- 
nathan

Re: Statistics Import and Export

From

Jeff Davis

Date:

07 January, 23:14:15

On Tue, 2025-01-07 at 01:18 -0500, Corey Huinker wrote:
> Attached is the latest (and probably last) unified patchset before
> parts get spun off into their own threads.

In this thread I'm only looking at 0001. Please start a new thread for
vacuumdb and extended stats changes.

> 0001 - This is the unified changes to pg_dump, pg_restore,
> pg_dumpall, and pg_upgrade.
>
> It incorporates most of what Jeff changed when he unified v36j, with
> typo fixes spotted by Bruce. There was interest in splitting
> STATISTICS DATA into RELATION STATISTICS DATA and ATTRIBUTE
> STATISTICS DATA.

I think we should just stick with "STATISTICS DATA".

> There was also interest in changing the prefix for STATISTICS DATA.
> However, the only special case for prefixes currently relies on an
> isData flag. Since there is no isStatistics flag, we would either
> have to create one, or do strcmps on te->description looking for
> "STATISTICS DATA". It's do-able, but I'm not sure it's worth it.

I do like the idea of a "Statistics for ..." prefix, and I think it's
doable.

The caller needs some knowledge about that anyway, to correctly output
the statistics dump when the schema is not requested. Tests should
cover those cases, too.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

jian he

Date:

27 January, 17:05:06

On Tue, Jan 21, 2025 at 7:31 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Mon, 2025-01-20 at 16:45 -0500, Corey Huinker wrote:
> >
> > What I struggle to understand is how that purpose isn't served better
> > by statistics being in SECTION_NONE like COMMENTs are, so that they
> > are imported immediately after the object that they reference.
>
> Tom, you expressed the strongest opinions on this point, can you expand
> a bit?
>
> If I understand correctly:
>
> * We strongly want stats to be exported by default[1].
>
> * Adding a SECTION_STATS could work, but would be non-trivial and might
> break expectations about the set of sections available[2].
>
> * SECTION_NONE doesn't seem right. There would be no way to get the
> stats using --section. Also, if there is no section boundary for the
> stats, then couldn't they appear in a surprising order?
>
> * I'm not sure about placing stats in SECTION_POST_DATA. That doesn't
> seem terrible to me, but not great either.
>

index is on SECTION_POST_DATA.
To dump all the statistics, we have to go through SECTION_POST_DATA.
place it there would be more convenient.

Tomas Vondra also mentioned this on [1]
[1] https://www.postgresql.org/message-id/bf724b21-914a-4497-84e3-49944f9776f6%40enterprisedb.com

> * I'm also not 100% sure about the flags. The default should dump the
> stats, of course. And I like the idea of allowing any combination of
> schema, data and stats to be exported. But that leaves a wrinkle for --
> data-only, which (as of v38) does not dump stats, because stats are a
> third kind of thing. Perhaps stats should be expressed as a subtype of
> data somehow, but I'm not sure exactly how.
>
if we have --data-only, --schema-only, --statistics-only, three options, then
--data-only also dump statistics would be unintuitive?

Re: Statistics Import and Export

From

Corey Huinker

Date:

27 January, 19:09:31

On Mon, Jan 27, 2025 at 9:05 AM jian he <jian.universality@gmail.com> wrote:

On Tue, Jan 21, 2025 at 7:31 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Mon, 2025-01-20 at 16:45 -0500, Corey Huinker wrote:
> >
> > What I struggle to understand is how that purpose isn't served better
> > by statistics being in SECTION_NONE like COMMENTs are, so that they
> > are imported immediately after the object that they reference.
>
> Tom, you expressed the strongest opinions on this point, can you expand
> a bit?
>
> If I understand correctly:
>
> * We strongly want stats to be exported by default[1].
>
> * Adding a SECTION_STATS could work, but would be non-trivial and might
> break expectations about the set of sections available[2].
>
> * SECTION_NONE doesn't seem right. There would be no way to get the
> stats using --section. Also, if there is no section boundary for the
> stats, then couldn't they appear in a surprising order?
>
> * I'm not sure about placing stats in SECTION_POST_DATA. That doesn't
> seem terrible to me, but not great either.
>

index is on SECTION_POST_DATA.
To dump all the statistics, we have to go through SECTION_POST_DATA.
place it there would be more convenient.

That would be the simpler solution, but those statistics may come in handy for refreshing mviews, so some may want table stats to stay in SECTION_DATA.

Tomas Vondra also mentioned this on [1]
[1] https://www.postgresql.org/message-id/bf724b21-914a-4497-84e3-49944f9776f6%40enterprisedb.com

> * I'm also not 100% sure about the flags. The default should dump the
> stats, of course. And I like the idea of allowing any combination of
> schema, data and stats to be exported. But that leaves a wrinkle for --
> data-only, which (as of v38) does not dump stats, because stats are a
> third kind of thing. Perhaps stats should be expressed as a subtype of
> data somehow, but I'm not sure exactly how.
>
if we have --data-only, --schema-only, --statistics-only, three options, then
--data-only also dump statistics would be unintuitive?

Yeah, I think the codebase and the user flags both have confusing bits where the not-wanting of one type of thing was specified by only-wanting the other thing, and those choices fall apart when the binary becomes trinary.

Re: Statistics Import and Export

From

Corey Huinker

Date:

06 February, 06:51:07

On Sat, Jan 25, 2025 at 10:02 AM Corey Huinker <corey.huinker@gmail.com> wrote:

Fixed. Holding off on posting updated patch pending decision on what's the best thing to do with partitioned indexes.

Though I was able to get it to work multiple ways, the one that seems to make the most sense given Michael and Jeff's feedback is to handle partitioned indexes, do the ACL checks against the underlying table just like indexes, but take a ShareUpdateExclusiveLock on the partitioned index as well, because it doesn't have the same special case in check_inplace_rel_lock() that regular indexes do.

In the future, if check_inplace_rel_lock() changes its special case to include partitioned indexes, then this code can get marginal simpler.

New patchset, no changes to 0002 as work continues there.

Thought I sent this to the list, but apparently I only sent to Michael. The changes referenced are in v45, already posted to the list.

Re: Statistics Import and Export

From

Jeff Davis

Date:

06 February, 07:45:06

On Wed, 2025-02-05 at 23:01 -0500, Corey Huinker wrote:
> And here's an update to the pg_dump code itself. This currently has
> failing TAP tests for statistics in the custom and dir formats, but
> is working otherwise.

This thread got slightly mixed up, so I'm replying to the v45-0001
posted here, and also in response to Michael's and Corey's comments
from:

https://www.postgresql.org/message-id/Z5H0iRaJc1wnDVLE%40paquier.xyz

On Thu, 2025-01-23 at 16:49 +0900, Michael Paquier wrote:
> On Tue, Jan 21, 2025 at 10:21:51PM -0500, Corey Huinker wrote:
> > After some research, I think that we should treat partitioned
> > indexes like
> > we were before, and just handle the existing special case for
> > regular
> > indexes.
>
> Hmm, why?  Sounds strange to me to not have the same locking
> semantics
> for the partitioned parts, and this even if partitioned indexes don't
> have stats that can be manipulated in relation_stats.c as far as I
> can see.  These stats APIs are designed to be permissive as Jeff
> says.
> Having a better locking from the start makes the whole picture more
> consistent, while opening the door for actually setting real stat
> numbers for partitioned indexes (if some make sense, at some point)?

v45-0001 addresses this by locking both the partitioned index, as well
as its table, in ShareUpdateExclusive mode. That satisfies the in-place
update requirement to take a ShareUpdateExclusiveLock on the
partitioned index, while otherwise being the same as normal indexes
(and therefore unlikely to cause a problem if ANALYZE sets stats on
partitioned indexes in the future).

That means:
  * For indexes: ShareUpdateExclusiveLock on table and AccessShareLock
on index
  * For partitioned indexes: ShareUpdateExclusiveLock on table and
ShareUpdateExclusiveLock on index
  * Otherwise, ShareupdateExclusiveLock on the relation

which makes sense to me. The v45-0001 patch itself could use some
cleanup, but I can take care of that at commit time if we agree on the
locking scheme.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Jeff Davis

Date:

11 February, 02:36:24

On Sun, 2025-02-09 at 22:00 -0500, Corey Huinker wrote:
>
> 0001 - I've added pg_locks tests for a regular index and a
> partitioned index.

Committed 0001, the fix for importing stats.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Jeff Davis

Date:

11 February, 04:21:39

On Sun, 2025-02-09 at 22:00 -0500, Corey Huinker wrote:
>
> 0002 - I've done some documentation rewording, mostly wording changes
> where behaviors surrounding data-only dumps are actually meant for
> any dump that has all schema excluded.

Comments on v45-0002:

* Why is generate_old_dump() passing optionally passing --no-statistics
to pg_dumpall along with --globals-only? If --globals-only is
specified, no stats are dumped anyway, right?

* The tag is still wrong: it is "STATISTICS DATA mytable" when it
should just be "mytable".

* What's the logic behind the pg_dumpall options? The docs say
it should support the new pg_dump options, but they don't seem to work.

* The enum entryType casing is unconventional. How about a type name of
TocEntryType and values like STATS_TOC_ENTRY.

* The pg_dump test suite time has increased by ~50%. If some tests are
superfluous, please remove them.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Jeff Davis

Date:

13 February, 06:00:55

On Tue, 2025-02-11 at 14:02 -0500, Corey Huinker wrote:
>
>
> The previous 0001 is now committed (thanks!) so only one remains.
>  

Summary of the decisions made in this thread:

  * pg_dump --data-only does not include stats[1]. This behavior was
    not fully resolved, but I didn't see a reasonable set of options
    where (a) --data-only would include stats; and (b) we could
    specify what pg_upgrade needs, which is schema+stats. Jian seemed
    to agree. However, this leaves us with the behavior where
    --data-only doesn't get everything in SECTION_DATA, which might be
    undesirable.
  * stats are in SECTION_DATA[2], except for stats on objects that
    are created in SECTION_POST_DATA, in which case the stats are
    also in SECTION_POST_DATA
    - indexes are created in SECTION_POST_DATA, and though plain
      indexes don't have stats, expression indexes do
    - MVs are normally created in SECTION_PRE_DATA, in which case
      the stats are in SECTION_DATA; but MVs can be deferred to
      SECTION_POST_DATA due to dependency on a primary key
  * SECTION_NONE was proposed, but rejected[2]
  * The default is to include the stats.[3]
  * pg_dump[all] options are designed to allow specifying any
    combination of schema[4], data, and stats:
      --schema-only (schema), --no-schema (data+stats)
      --data-only (data), --no-data (schema+stats)
      --stats-only (stats), --no-stats (schema+data)
  * A SECTION_STATS was proposed and rejected due to complexity[5]
  * The prefix in the dump output will be "Statistics for " (instead
    of "Data for ")[6]
  * The TOC description will be "STATISTICS DATA", differentiating
    it from an extended statistics object[6]
  * pg_upgrade will now pass --no-data (schema+stats) to pg_dump
    instead of --schema-only, thereby transferring the stats to the
    new cluster[7]

It's been a long thread, so please tell me if I missed something or if
something needs more discussion.

I'm still reviewing v48, but I intend to commit something soon.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/b40b81d38c3a87fdef61e4f7abfbc7f27c7fbcd9.camel@j-davis.com

[2]
https://www.postgresql.org/message-id/1798867.1712376328%40sss.pgh.pa.us

[3]
https://www.postgresql.org/message-id/3228677.1713844341%40sss.pgh.pa.us

[4]
https://www.postgresql.org/message-id/CACJufxG6K4EAUROhdr0wkzMh5QyFLmdLZeAoh7Vh0-VbuAtHcw%40mail.gmail.com

[5]
https://www.postgresql.org/message-id/3156140.1713817153%40sss.pgh.pa.us

[6]
https://www.postgresql.org/message-id/d8df5339cab25b5720667beaaed8a8bb8e11578c.camel@j-davis.com

[7]
https://www.postgresql.org/message-id/c2bc08dfec336c03f7a7165d1347e2b52cf98b17.camel@j-davis.com

Re: Statistics Import and Export

From

Jeff Davis

Date:

20 February, 12:39:34

On Wed, 2025-02-12 at 19:00 -0800, Jeff Davis wrote:
> I'm still reviewing v48, but I intend to commit something soon.

Committed with some revisions on top of v48:

* removed the short option -X, leaving the long option "--statistics-
only" with the same meaning.

* removed the redundant --with-statistics option for pg_upgrade,
because that's the default anyway.

* removed an unnecessary enum TocEntryType and cleaned up the API to
just pass the desired prefix directly to _printTocEntry().

* stabilized the 002_pg_upgrade test by turning off autovacuum before
the first pg_dumpall (we still want it to run before that to collect
stats).

* stabilized the 027_stream_regress recovery test by specifying --no-
statistics when comparing the data on primary and standby

* fixed the cross-version upgrade tests by using the
adjust_old_dumpfile to replace the version specifier with 000000 in the
argument list to pg_restore_* functions.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

vignesh C

Date:

20 February, 13:24:35

On Thu, 20 Feb 2025 at 15:09, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Wed, 2025-02-12 at 19:00 -0800, Jeff Davis wrote:
> > I'm still reviewing v48, but I intend to commit something soon.
>
> Committed with some revisions on top of v48:

I was checking buildfarm for another commit of mine, while checking I
noticed there is a failure in crake at [1].  I felt it might be
related to this commit as it had passed with earlier runs:
--- /home/andrew/bf/root/upgrade.crake/HEAD/origin-REL9_2_STABLE.sql.fixed
2025-02-20 04:43:40.461092087 -0500
+++ /home/andrew/bf/root/upgrade.crake/HEAD/converted-REL9_2_STABLE-to-HEAD.sql.fixed
2025-02-20 04:43:40.463092092 -0500
@@ -184,21 +184,87 @@
 --
 SELECT * FROM pg_catalog.pg_restore_relation_stats(
  'relation', '"MySchema"."Foo"'::regclass,
- 'version', '90224'::integer,
- 'relpages', '0'::integer,
- 'reltuples', '0'::real,
+ 'version', '000000'::integer,
+ 'relpages', '1'::integer,
+ 'reltuples', '1'::real,
  'relallvisible', '0'::integer

[1] -
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2025-02-20%2009%3A32%3A03&stg=xversion-upgrade-REL9_2_STABLE-HEAD

Regards,
Vignesh

Re: Statistics Import and Export

From

Andrew Dunstan

Date:

20 February, 18:43:53

On 2025-02-20 Th 4:39 AM, Jeff Davis wrote:
> On Wed, 2025-02-12 at 19:00 -0800, Jeff Davis wrote:
>> I'm still reviewing v48, but I intend to commit something soon.
> Committed with some revisions on top of v48:
>
> * removed the short option -X, leaving the long option "--statistics-
> only" with the same meaning.
>
> * removed the redundant --with-statistics option for pg_upgrade,
> because that's the default anyway.
>
> * removed an unnecessary enum TocEntryType and cleaned up the API to
> just pass the desired prefix directly to _printTocEntry().
>
> * stabilized the 002_pg_upgrade test by turning off autovacuum before
> the first pg_dumpall (we still want it to run before that to collect
> stats).
>
> * stabilized the 027_stream_regress recovery test by specifying --no-
> statistics when comparing the data on primary and standby
>
> * fixed the cross-version upgrade tests by using the
> adjust_old_dumpfile to replace the version specifier with 000000 in the
> argument list to pg_restore_* functions.
>

The buildfarm doesn't like with this. 

<https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2025-02-20%2009%3A32%3A03&stg=xversion-upgrade-REL9_2_STABLE-HEAD>

The conversion regexes are wrong for versions < 10, where the major 
version is '9.x', but that just seems to be the tip of the iceberg.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

RE: Statistics Import and Export

From

"Hayato Kuroda (Fujitsu)"

Date:

21 February, 10:24:13

Dear members,

I hope I'm in the correct thread. I found the commit 1fd1bd8 - it is so cool.
I have a question for it.

ISTM commit message said that no need to do ANALYZE again.

```
    Add support to pg_dump for dumping stats, and use that during
    pg_upgrade so that statistics are transferred during upgrade. In most
    cases this removes the need for a costly re-analyze after upgrade.
```

But pgupgrade.sgml [2] and source code [3] said that statistics must be updated.
Did I miss something, or you have been updating this?

[2]:
```
     Because optimizer statistics are not transferred by <command>pg_upgrade</command>, you will
     be instructed to run a command to regenerate that information at the end
     of the upgrade.  You might need to set connection parameters to
     match your new cluster.
```
[3]:
```
    pg_log(PG_REPORT,
           "Optimizer statistics are not transferred by pg_upgrade.\n"
           "Once you start the new server, consider running:\n"
           "    %s/vacuumdb %s--all --analyze-in-stages", new_cluster.bindir, user_specification.data);
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Statistics Import and Export

From

Andres Freund

Date:

21 February, 23:23:00

Hi,

On 2025-02-20 01:39:34 -0800, Jeff Davis wrote:
> Committed with some revisions on top of v48:

This made the pg_upgrade tests considerably slower.

In an assert build without optimization (since that's what I use for normal
dev work):

1fd1bd87101^     65.03s
1fd1bd87101      86.84s


Looking at the times in the in the regress_log, I see:

good: [15:17:31.278](36.851s) ok 5 - regression tests pass
bad:  [15:15:37.857](37.437s) ok 5 - regression tests pass

good: [15:17:36.721](5.436s) ok 6 - dump before running pg_upgrade
bad:  [15:15:50.845](12.980s) ok 6 - dump before running pg_upgrade

good: [15:17:39.759](2.441s) ok 12 - run of pg_upgrade --check for new instance
bad:  [15:15:53.861](2.415s) ok 12 - run of pg_upgrade --check for new instance

good: [15:17:51.249](11.489s) ok 14 - run of pg_upgrade for new instance
bad:  [15:16:13.304](19.443s) ok 14 - run of pg_upgrade for new instance

good: [15:17:55.382](3.958s) ok 17 - dump after running pg_upgrade
bad:  [15:16:23.766](10.290s) ok 17 - dump after running pg_upgrade


Which to me rather strongly suggests pg_dump has gotten a *lot* slower with
this change.


Greetings,

Andres

Re: Statistics Import and Export

From

Tom Lane

Date:

21 February, 23:49:10

Andres Freund <andres@anarazel.de> writes:
> Which to me rather strongly suggests pg_dump has gotten a *lot* slower with
> this change.

Well, it's doing strictly more work, so somewhat slower is to be
expected.  But yeah, more than 2x slower is not nice.

In a quick look at the committed patch, it doesn't seem to have
used any of the speedup strategies we applied to pg_dump a couple
of years ago.  One or the other of these should help:

* Issue a single query to fetch stats from every table we're dumping

* Set up a prepared query to avoid re-planning the per-table query
  (compare be85727a3)

I'm not sure how workable the first of these would be though.
It's not hard to imagine it blowing out pg_dump's memory usage
for a DB with a lot of tables and high default_statistics_target.
The second one should be relatively downside-free.

            regards, tom lane

Re: Statistics Import and Export

From

Andres Freund

Date:

21 February, 23:57:20

Hi,

On 2025-02-21 15:23:00 -0500, Andres Freund wrote:
> On 2025-02-20 01:39:34 -0800, Jeff Davis wrote:
> > Committed with some revisions on top of v48:
>
> This made the pg_upgrade tests considerably slower.
>
> In an assert build without optimization (since that's what I use for normal
> dev work):
>
> 1fd1bd87101^     65.03s
> 1fd1bd87101      86.84s
>
>
> Looking at the times in the in the regress_log, I see:
> [...]
> Which to me rather strongly suggests pg_dump has gotten a *lot* slower with
> this change.

Indeed. While the slowdown is worse with assertions and without compiler
optimizations, it's pretty bad otherwise too.

optimized, non-cassert, pg_dump and server with the regression database contents:

$ time ./src/bin/pg_dump/pg_dump regression > /dev/null

real    0m1.314s
user    0m0.189s
sys    0m0.059s

$ time ./src/bin/pg_dump/pg_dump --no-statistics regression > /dev/null

real    0m0.472s
user    0m0.179s
sys    0m0.035s

Unoptimized, cassert server and pg_dump:

$ time ./src/bin/pg_dump/pg_dump regression > /dev/null

real    0m9.008s
user    0m0.396s
sys    0m0.108s

$ time ./src/bin/pg_dump/pg_dump --no-statistics regression > /dev/null

real    0m2.590s
user    0m0.347s
sys    0m0.037s

Looking at the query log, the biggest culprit is a *lot* of additional
queries, I think primarily these two:

SELECT c.oid::regclass AS relation, current_setting('server_version_num') AS version, c.relpages, c.reltuples,
c.relallvisibleFROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = 'public' AND c.relname =
'alpha_neg_p2'

SELECT c.oid::regclass AS relation, s.attname,s.inherited,current_setting('server_version_num') AS version,
s.null_frac,s.avg_width,s.n_distinct,s.most_common_vals,s.most_common_freqs,s.histogram_bounds,s.correlation,s.most_common_elems,s.most_common_elem_freqs,s.elem_count_histogram,s.range_length_histogram,s.range_empty_frac,s.range_bounds_histogram
FROMpg_stats s JOIN pg_namespace n ON n.nspname = s.schemaname JOIN pg_class c ON c.relname = s.tablename AND
c.relnamespace= n.oid WHERE s.schemaname = 'public' AND s.tablename = 'alpha_neg_p2' ORDER BY s.attname, s.inherited

I think there are a few things wrong here:

1) Why do we need to plan this over and over? Tom a while ago put in a fair
   bit of work to make frequent queries use prepared statements.

   In this case we spend more time replanning the query than executing it.

2) Querying this one-by-one makes this much more expensive than if it were
   queried in a batched fashion, for multiple tables at once.  This is
   especially true if actually executed over network, rather than locally.

3) The query is unnecessarily expensive due to repeated joins gathering the
   same information.  pg_stats has a join to pg_namespace and pg_class, but
   then the query above joins to both *again*.

   And afaict the joins in the pg_stats query are pretty useless? Isn't all
   that information already available in pg_stats? I guess you did that to get
   it as a ::regclass, but isn't that already known, why requery it?

4) Why do we need to fetch the version twice for every table, that can't be
   right? It won't change while pg_dump is running.

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Andres Freund

Date:

22 February, 00:11:48

Hi,

On 2025-02-21 15:49:10 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Which to me rather strongly suggests pg_dump has gotten a *lot* slower with
> > this change.
>
> Well, it's doing strictly more work, so somewhat slower is to be
> expected.

Yea, if we had talked a few percent, I'd not have balked.  It's more like 2-4x
though and it'll probably be worse when not connecting over local TCP
connections.

This is a slowdown to the point that the downtime for pg_upgrade will be
substantially lengthened compared to before.  But I think we should be able to
address that to a large degree.

> In a quick look at the committed patch, it doesn't seem to have
> used any of the speedup strategies we applied to pg_dump a couple
> of years ago.  One or the other of these should help:
>
> * Issue a single query to fetch stats from every table we're dumping
> * Set up a prepared query to avoid re-planning the per-table query
>   (compare be85727a3)
>
> I'm not sure how workable the first of these would be though.
> It's not hard to imagine it blowing out pg_dump's memory usage
> for a DB with a lot of tables and high default_statistics_target.

We could presumably do the one-query approach for the relation stats, that's
just three integers.  That way we'd at least not end up with two queries for
each table (for pg_class.reltuples etc and for pg_stats).

I guess the memory usage could also be addressed by using COPY, but that's
probably unrealistically complicated.

> The second one should be relatively downside-free.

Yea. And at least with pg_dump running locally that's where a lot of the CPU
time is spent.

Remotely doing lots of one-by-one queries will hurt even with prepared
statements though.

One way to largely address that would be to use a prepared statement combined
with libpq pipelining.  That still has separate executor startup etc, but I
think it should still reduce the cost to a point where we don't care anymore.

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Tom Lane

Date:

22 February, 00:24:38

Andres Freund <andres@anarazel.de> writes:
> Looking at the query log, the biggest culprit is a *lot* of additional
> queries, I think primarily these two:

> SELECT c.oid::regclass AS relation, current_setting('server_version_num') AS version, c.relpages, c.reltuples,
c.relallvisibleFROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = 'public' AND c.relname =
'alpha_neg_p2'

> SELECT c.oid::regclass AS relation, s.attname,s.inherited,current_setting('server_version_num') AS version,
s.null_frac,s.avg_width,s.n_distinct,s.most_common_vals,s.most_common_freqs,s.histogram_bounds,s.correlation,s.most_common_elems,s.most_common_elem_freqs,s.elem_count_histogram,s.range_length_histogram,s.range_empty_frac,s.range_bounds_histogram
FROMpg_stats s JOIN pg_namespace n ON n.nspname = s.schemaname JOIN pg_class c ON c.relname = s.tablename AND
c.relnamespace= n.oid WHERE s.schemaname = 'public' AND s.tablename = 'alpha_neg_p2' ORDER BY s.attname, s.inherited 

Oy.  Those are outright horrid, even without any consideration of
pre-preparing them.  We know the OID of the table we want to dump,
we should be doing "FROM pg_class WHERE oid = whatever" and lose
the join to pg_namespace altogether.  The explicit casts to regclass
are quite expensive too to fetch information that pg_dump already
has.  It already knows the server version, too.

Moreover, the first of these shouldn't be a separate query at all.
I objected to fetching pg_statistic content for all tables at once,
but relpages/reltuples/relallvisible is a pretty small amount of
new info.  We should just collect those fields as part of getTables'
main query of pg_class (which, indeed, is already fetching relpages).

On the second one, if we want to go through the pg_stats view then
we can't rely on table OID, but I don't see why we need the joins
to anything else.  "WHERE s.schemaname = 'x' AND s.tablename = 'y'"
seems sufficient.

I wonder whether we ought to issue different queries depending on
whether we're superuser.  The pg_stats view is rather expensive
because of its security restrictions, and if we're superuser we
could just look directly at pg_statistic.  Maybe those checks are
fast enough not to matter, but ...

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

22 February, 01:09:42

Oy. Those are outright horrid, even without any consideration of
pre-preparing them. We know the OID of the table we want to dump,
we should be doing "FROM pg_class WHERE oid = whatever" and lose
the join to pg_namespace altogether. The explicit casts to regclass
are quite expensive too to fetch information that pg_dump already
has. It already knows the server version, too.

+1
Earlier versions had prepared statements, but those were removed to keep things simple. Easy enough to revive.

Moreover, the first of these shouldn't be a separate query at all.
I objected to fetching pg_statistic content for all tables at once,
but relpages/reltuples/relallvisible is a pretty small amount of
new info. We should just collect those fields as part of getTables'
main query of pg_class (which, indeed, is already fetching relpages).

+1

On the second one, if we want to go through the pg_stats view then
we can't rely on table OID, but I don't see why we need the joins
to anything else. "WHERE s.schemaname = 'x' AND s.tablename = 'y'"
seems sufficient.

+1

I wonder whether we ought to issue different queries depending on
whether we're superuser. The pg_stats view is rather expensive
because of its security restrictions, and if we're superuser we
could just look directly at pg_statistic. Maybe those checks are
fast enough not to matter, but ...

That could lead to a rather complicated query that has to replicate the guts of pg_stats for every server-specific version of pg_stats, specifically the CASE statements that transform the stakindN/stanumbersN/stavaluesN to mcv, correlation, etc, so I'd like to avoid that if possible.

Re: Statistics Import and Export

From

Tom Lane

Date:

22 February, 01:20:48

Corey Huinker <corey.huinker@gmail.com> writes:
>> I wonder whether we ought to issue different queries depending on
>> whether we're superuser.  The pg_stats view is rather expensive
>> because of its security restrictions, and if we're superuser we
>> could just look directly at pg_statistic.  Maybe those checks are
>> fast enough not to matter, but ...

> That could lead to a rather complicated query that has to replicate the
> guts of pg_stats for every server-specific version of pg_stats,
> specifically the CASE statements that transform
> the stakindN/stanumbersN/stavaluesN to mcv, correlation, etc, so I'd like
> to avoid that if possible.

Yeah, it'd be notationally ugly for sure.  Let's keep that idea in the
back pocket and see how far we get with the other ideas.

            regards, tom lane

Re: Statistics Import and Export

From

Andres Freund

Date:

22 February, 01:37:18

Hi,

On 2025-02-21 16:24:38 -0500, Tom Lane wrote:
> Oy.  Those are outright horrid, even without any consideration of
> pre-preparing them.  We know the OID of the table we want to dump,
> we should be doing "FROM pg_class WHERE oid = whatever" and lose
> the join to pg_namespace altogether.  The explicit casts to regclass
> are quite expensive too to fetch information that pg_dump already
> has.  It already knows the server version, too.

> Moreover, the first of these shouldn't be a separate query at all.
> I objected to fetching pg_statistic content for all tables at once,
> but relpages/reltuples/relallvisible is a pretty small amount of
> new info.  We should just collect those fields as part of getTables'
> main query of pg_class (which, indeed, is already fetching relpages).

> On the second one, if we want to go through the pg_stats view then
> we can't rely on table OID, but I don't see why we need the joins
> to anything else.  "WHERE s.schemaname = 'x' AND s.tablename = 'y'"
> seems sufficient.

Agreed on all those.


> I wonder whether we ought to issue different queries depending on
> whether we're superuser.  The pg_stats view is rather expensive
> because of its security restrictions, and if we're superuser we
> could just look directly at pg_statistic.  Maybe those checks are
> fast enough not to matter, but ...

It doesn't seem to make much of a difference, from what I can tell.

At execution time most of the time is is in
a) the joins to pg_attribute and pg_class (the ones in pg_stats)
b) array_out().


The times get way worse if you dump stats for catalog tables, because there
some of arrays are regproc and regprocout calls FuncnameGetCandidates(), which
then ends up iterating over a long cached list... I think that's basically
O(N^2)?

Of course that's nothing we should encounter frequently, but ugh.

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Tom Lane

Date:

22 February, 01:47:58

Andres Freund <andres@anarazel.de> writes:
> The times get way worse if you dump stats for catalog tables, because there
> some of arrays are regproc and regprocout calls FuncnameGetCandidates(), which
> then ends up iterating over a long cached list... I think that's basically
> O(N^2)?

Can't be that bad.  I don't see any proname values that occur more
than 2 dozen times.  You can call that a long list if you want,
but it's not scaling with the size of pg_proc.

> Of course that's nothing we should encounter frequently, but ugh.

Yeah, I can't get excited about the cost of that for normal user
dumps.  The 002_pg_dump test does run a dump with --schema pg_catalog,
but it's dubious that that test is worth its cycles.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

22 February, 08:00:39

Yeah, I can't get excited about the cost of that for normal user
dumps. The 002_pg_dump test does run a dump with --schema pg_catalog,
but it's dubious that that test is worth its cycles.

Attached is the first optimization, which gets rid of the pg_class queries entirely, instead getting the same information from the existing queries in getTables and getIndexes.

Additionally, the string representation of the server version number is now stored in the Archive struct. Yes, we already have remoteVersion, but that's in integer form, and remoteVersionStr is "18devel" rather than "180000".

I didn't include any work on the attribute query as I wanted to keep that separate for clarity purposes.

Attachment

v1-0001-Leverage-existing-functions-for-relation-stats.patch

Re: Statistics Import and Export

From

Jeff Davis

Date:

22 February, 21:39:17

On Fri, 2025-02-21 at 07:24 +0000, Hayato Kuroda (Fujitsu) wrote:
> I hope I'm in the correct thread. I found the commit 1fd1bd8 - it is
> so cool.

Yes, documentation corrections are appreciated, thank you.

> But pgupgrade.sgml [2] and source code [3] said that statistics must
> be updated.

Changed.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

23 February, 03:03:55

BTW, while nosing around looking for an explanation for our
cross-version-upgrade woes, I chanced to notice that
relation_statistics_update rejects negative values for
relpages and relallvisible.  This is nonsense.  Internally
those values are BlockNumbers and can have any value from
0 to UINT32_MAX.  We represent them as signed int32 at the
SQL level, which means they can read out as any int32 value.
So the range checks that are being applied to them are flat
wrong and should be removed.  Admittedly, you'd need a table
exceeding 16TB (if I did the math right) to see a problem,
but that doesn't make it not wrong.

It might be a good idea to change the code so that it declares
these values internally as BlockNumber and uses PG_GETARG_UINT32,
but I think that would only be a cosmetic change not a
correctness issue.

            regards, tom lane

Re: Statistics Import and Export

From

Jeff Davis

Date:

24 February, 03:21:59

On Sat, 2025-02-22 at 00:00 -0500, Corey Huinker wrote:
>
> Attached is the first optimization, which gets rid of the pg_class
> queries entirely, instead getting the same information from the
> existing queries in getTables and getIndexes.

Attached a revised version. The main changes are that the only struct
it changes is RelStatsInfo, and it doesn't carry around string values.

IIUC, your version carried around the string values so that there would
be no conversion; it would hold the string from one result to the next.
That makes sense, but it seemed to change a lot of struct fields, and
have unnecessary string copying and memory usage which was not freed.
Instead, I used float_to_shortest_decimal_buf(), which is what
float4out() uses, which should be a consistent way to convert the float
value.

That meant that we couldn't use appendNamedArgument() as easily, but it
wasn't helping much in that function anyway, because it was no longer a
loop.

I didn't measure any performance difference between your version and
mine, but avoiding a few allocations couldn't hurt. It seems to save
just under 20% on an unoptimized build.

Regards,
    Jeff Davis

Attachment

v2j-0001-Avoid-unnecessary-relation-stats-query-in-pg_dum.patch

Re: Statistics Import and Export

From

Corey Huinker

Date:

24 February, 04:03:45

On Sun, Feb 23, 2025 at 7:22 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Sat, 2025-02-22 at 00:00 -0500, Corey Huinker wrote:
>
> Attached is the first optimization, which gets rid of the pg_class
> queries entirely, instead getting the same information from the
> existing queries in getTables and getIndexes.

Attached a revised version. The main changes are that the only struct
it changes is RelStatsInfo, and it doesn't carry around string values.

IIUC, your version carried around the string values so that there would
be no conversion; it would hold the string from one result to the next.
That makes sense, but it seemed to change a lot of struct fields, and
have unnecessary string copying and memory usage which was not freed.
Instead, I used float_to_shortest_decimal_buf(), which is what
float4out() uses, which should be a consistent way to convert the float
value.

If we're fine with giving up on appendNamedArgument() for relstats, wouldn't we also want to mash these into a single call?

appendPQExpBuffer(out, "\t'relation', '%s'::regclass,\n", qualname);
appendPQExpBuffer(out, "\t'version', '%u'::integer,\n",
fout->remoteVersion);
appendPQExpBuffer(out, "\t'relpages', '%d'::integer,\n", rsinfo->relpages);
appendPQExpBuffer(out, "\t'reltuples', '%s'::real,\n", reltuples_str);
appendPQExpBuffer(out, "\t'relallvisible', '%d'::integer\n);\n",
rsinfo->relallvisible);

to:
appendPQExpBuffer(out, "\t'relation', '%s'::regclass"
",\n\t'version', '%u'::integer"

",\n\t'relpages', '%d'::integer"

",\n\t'reltuples', '%s'::real"

",\n\t'relallvisible', '%d'::integer",

qualname, fout->remoteVersion, rsinfo->relpages,
rsinfo->reltuples_str, rsinfo->relallvisible);
appendPQExpBufferStr(out, "\n);\n");

Also, there's work elsewhere that may add relallfrozen to pg_class, which would be something we'd want to add depending on the remoteVersion, and this format will make that change pretty clear.

That meant that we couldn't use appendNamedArgument() as easily, but it
wasn't helping much in that function anyway, because it was no longer a
loop.

It still served to encapsulate the format of a kwarg pair, but little more, agreed.

I didn't measure any performance difference between your version and
mine, but avoiding a few allocations couldn't hurt. It seems to save
just under 20% on an unoptimized build.

Part of me thinks we'd want to do the reverse - change the struct to store char[32] to for each of relpages, reltuples, and relallvisible, and then convert reltpages to int in the one place where we actually need to use in its numeric form, and even then only in one place. Conversions to and from other data types introduce the possibility, though very remote, of the converted-and-then-unconverted value being cosmetically different from what we got from the server, and if down the road we're dealing with more complex data types, those conversions might become significant.

Re: Statistics Import and Export

From

Tom Lane

Date:

24 February, 04:14:16

Corey Huinker <corey.huinker@gmail.com> writes:
> If we're fine with giving up on appendNamedArgument() for relstats,
> wouldn't we also want to mash these into a single call?

> appendPQExpBuffer(out, "\t'relation', '%s'::regclass,\n", qualname);
> appendPQExpBuffer(out, "\t'version', '%u'::integer,\n",
>       fout->remoteVersion);
> appendPQExpBuffer(out, "\t'relpages', '%d'::integer,\n", rsinfo->relpages);
> appendPQExpBuffer(out, "\t'reltuples', '%s'::real,\n", reltuples_str);
> appendPQExpBuffer(out, "\t'relallvisible', '%d'::integer\n);\n",
>       rsinfo->relallvisible);

> to:
> appendPQExpBuffer(out, "\t'relation', '%s'::regclass"
>                        ",\n\t'version', '%u'::integer"
>                        ",\n\t'relpages', '%d'::integer"
>                        ",\n\t'reltuples', '%s'::real"
>                        ",\n\t'relallvisible', '%d'::integer",
>                        qualname, fout->remoteVersion, rsinfo->relpages,
>                        rsinfo->reltuples_str, rsinfo->relallvisible);
> appendPQExpBufferStr(out, "\n);\n");

That doesn't seem like an improvement.  It's less readable ---
you have to match up %'s with arguments that are some distance
away --- and harder to modify.  There might be some microscopic
performance benefit but it'd be pretty microscopic.

            regards, tom lane

Re: Statistics Import and Export

From

Jeff Davis

Date:

24 February, 09:39:01

On Sun, 2025-02-23 at 20:03 -0500, Corey Huinker wrote:
> If we're fine with giving up on appendNamedArgument() for relstats,
> wouldn't we also want to mash these into a single call?

...

> appendPQExpBuffer(out, "\t'relation', '%s'::regclass"
>                        ",\n\t'version', '%u'::integer"
>                        ",\n\t'relpages', '%d'::integer"
>                        ",\n\t'reltuples', '%s'::real"
>                        ",\n\t'relallvisible', '%d'::integer",
>                        qualname, fout->remoteVersion, rsinfo-
> >relpages,
>                        rsinfo->reltuples_str, rsinfo->relallvisible);
> appendPQExpBufferStr(out, "\n);\n");

+1.

>
>
> Part of me thinks we'd want to do the reverse - change the struct to
> store char[32] to for each of relpages, reltuples, and relallvisible,
> and then convert reltpages to int in the one place where we actually
> need to use in its numeric form, and even then only in one place.
> Conversions to and from other data types introduce the possibility,
> though very remote, of the converted-and-then-unconverted value being
> cosmetically different from what we got from the server, and if down
> the road we're dealing with more complex data types, those
> conversions might become significant.

That's a good point but let's avoid excessive redundancy in the
structures. Adding a few fields to RelStatsInfo should be enough.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

24 February, 13:11:48

That's a good point but let's avoid excessive redundancy in the
structures. Adding a few fields to RelStatsInfo should be enough.

Regards,
Jeff Davis

Incorporating most of the feedback (I kept a few of the appendNamedArgument() calls) presented over the weekend.

* removeVersionNumStr is gone
* relpages/reltuples/relallvisible are now char[32] buffers in RelStatsInfo and nowhere else (existing relpages conversion remains, however)
* attribute stats export query is now prepared, and queries pg_stats with no joins
* version parameter moved to end of both queries for consistency.

Attachment

Re: Statistics Import and Export

From

Andres Freund

Date:

24 February, 17:54:22

Hi,

On 2025-02-24 05:11:48 -0500, Corey Huinker wrote:
> Incorporating most of the feedback (I kept a few of
> the appendNamedArgument() calls)  presented over the weekend.
> 
> * removeVersionNumStr is gone
> * relpages/reltuples/relallvisible are now char[32] buffers in RelStatsInfo
> and nowhere else (existing relpages conversion remains, however)

I don't see the point. This will use more memory and if we can't get
conversions between integers and strings right we have much bigger
problems. The same code was used in the backend too!

And it leads to storing relpages in two places, with different
transformations, which doesn't seem great.

> @@ -6921,6 +6927,7 @@ getTables(Archive *fout, int *numTables)
>      appendPQExpBufferStr(query,
>                           "SELECT c.tableoid, c.oid, c.relname, "
>                           "c.relnamespace, c.relkind, c.reltype, "
> +                         "c.relpages, c.reltuples, c.relallvisible, "
>                           "c.relowner, "
>                           "c.relchecks, "
>                           "c.relhasindex, c.relhasrules, c.relpages, "

That query is already querying relpages a bit later in the query, so we'd
query the column twice.

> +    printfPQExpBuffer(query, "EXECUTE getAttributeStats(");
> +    appendStringLiteralAH(query, dobj->namespace->dobj.name, fout);
> +    appendPQExpBufferStr(query, ", ");
> +    appendStringLiteralAH(query, dobj->name, fout);
> +    appendPQExpBufferStr(query, "); ");
>      res = ExecuteSqlQuery(fout, query->data, PGRES_TUPLES_OK);

It seems somewhat ugly that we're building an SQL string with non-trivial
constants. It'd be better to use PQexecParams() - but I guess we don't have
any uses of it yet in pg_dump.

ISTM that we ought to expose the relation oid in pg_stats. This query would be
simpler and faster if we could just use the oid as the predicate. Will take a
while till we can rely on that, but still.

Have you compared performance of with/without stats after these optimizations?

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Jeff Davis

Date:

24 February, 19:15:25

On Mon, 2025-02-24 at 09:54 -0500, Andres Freund wrote:
> ISTM that we ought to expose the relation oid in pg_stats. This query
> would be
> simpler and faster if we could just use the oid as the predicate.
> Will take a
> while till we can rely on that, but still.

+1. Maybe an internal view that exposes only starelid/staattnum, and
pg_stats could just be a simple join on top of that?

There's another annoyance, which is that pg_stats doesn't expose any
custom stakinds, so we lose those, but I'm not sure if that's worth
trying to fix.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

24 February, 20:50:58

Andres Freund <andres@anarazel.de> writes:
> On 2025-02-24 05:11:48 -0500, Corey Huinker wrote:
>> * relpages/reltuples/relallvisible are now char[32] buffers in RelStatsInfo
>> and nowhere else (existing relpages conversion remains, however)

> I don't see the point. This will use more memory and if we can't get
> conversions between integers and strings right we have much bigger
> problems. The same code was used in the backend too!

I don't like that either.  But there's a bigger problem with 0002:
it's still got mostly table-driven output.  I've been working on
fixing the problem discussed over in the -committers thread about how
we need to identify index-expression columns by number not name [1].
It's not too awful in the backend (WIP patch attached), but
getting appendAttStatsImport to do it seems like a complete disaster,
and this patch fails to make that any easier.  It'd be much better
if you gave up on that table-driven business and just open-coded the
handling of the successive output values as was discussed upthread.

I don't think the table-driven approach has anything to recommend it
anyway.  It requires keeping att_stats_arginfo[] in sync with the
query in getAttStatsExportQuery, an extremely nonobvious (and
undocumented) connection.  Personally I would nuke the separate
getAttStatsExportQuery and appendAttStatsImport functions altogether,
and have one function that executes a query and immediately interprets
the PGresult.

Also, while working on the attached, I couldn't help forming the
opinion that we'd be better off to nuke pg_set_attribute_stats()
from orbit and require people to use pg_restore_attribute_stats().
pg_set_attribute_stats() would be fine if we had a way to force
people to call it with only named-argument notation, but we don't.
So I'm afraid that its existence will encourage people to rely
on a specific parameter order, and then they'll whine if we
add/remove/reorder parameters, as indeed I had to do below.

BTW, I pushed the 0003 patch with minor adjustments.

            regards, tom lane

[1] https://www.postgresql.org/message-id/816167.1740278884%40sss.pgh.pa.us

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 9f60a476eb..ad59e3be9d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -30302,6 +30302,7 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
          <function>pg_set_attribute_stats</function> (
          <parameter>relation</parameter> <type>regclass</type>,
          <parameter>attname</parameter> <type>name</type>,
+         <parameter>attnum</parameter> <type>integer</type>,
          <parameter>inherited</parameter> <type>boolean</type>
          <optional>, <parameter>null_frac</parameter> <type>real</type></optional>
          <optional>, <parameter>avg_width</parameter> <type>integer</type></optional>
@@ -30318,14 +30319,17 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         </para>
         <para>
          Creates or updates attribute-level statistics for the given relation
-         and attribute name to the specified values. The parameters correspond
+         and attribute name (or number) to the specified values. The
+         parameters correspond
          to attributes of the same name found in the <link
          linkend="view-pg-stats"><structname>pg_stats</structname></link>
          view.
         </para>
         <para>
          Optional parameters default to <literal>NULL</literal>, which leave
-         the corresponding statistic unchanged.
+         the corresponding statistic unchanged.  Exactly one
+         of <parameter>attname</parameter> and <parameter>attnum</parameter>
+         must be non-<literal>NULL</literal>.
         </para>
         <para>
          Ordinarily, these statistics are collected automatically or updated
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 591157b1d1..876500824e 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -648,8 +648,9 @@ AS 'pg_set_relation_stats';

 CREATE OR REPLACE FUNCTION
   pg_set_attribute_stats(relation regclass,
-                         attname name,
-                         inherited bool,
+                         attname name DEFAULT NULL,
+                         attnum integer DEFAULT NULL,
+                         inherited bool DEFAULT NULL,
                          null_frac real DEFAULT NULL,
                          avg_width integer DEFAULT NULL,
                          n_distinct real DEFAULT NULL,
diff --git a/src/backend/statistics/attribute_stats.c b/src/backend/statistics/attribute_stats.c
index c0c398a4bb..4886f79611 100644
--- a/src/backend/statistics/attribute_stats.c
+++ b/src/backend/statistics/attribute_stats.c
@@ -38,6 +38,7 @@ enum attribute_stats_argnum
 {
     ATTRELATION_ARG = 0,
     ATTNAME_ARG,
+    ATTNUM_ARG,
     INHERITED_ARG,
     NULL_FRAC_ARG,
     AVG_WIDTH_ARG,
@@ -59,6 +60,7 @@ static struct StatsArgInfo attarginfo[] =
 {
     [ATTRELATION_ARG] = {"relation", REGCLASSOID},
     [ATTNAME_ARG] = {"attname", NAMEOID},
+    [ATTNUM_ARG] = {"attnum", INT4OID},
     [INHERITED_ARG] = {"inherited", BOOLOID},
     [NULL_FRAC_ARG] = {"null_frac", FLOAT4OID},
     [AVG_WIDTH_ARG] = {"avg_width", INT4OID},
@@ -76,6 +78,22 @@ static struct StatsArgInfo attarginfo[] =
     [NUM_ATTRIBUTE_STATS_ARGS] = {0}
 };

+enum clear_attribute_stats_argnum
+{
+    C_ATTRELATION_ARG = 0,
+    C_ATTNAME_ARG,
+    C_INHERITED_ARG,
+    C_NUM_ATTRIBUTE_STATS_ARGS
+};
+
+static struct StatsArgInfo cleararginfo[] =
+{
+    [C_ATTRELATION_ARG] = {"relation", REGCLASSOID},
+    [C_ATTNAME_ARG] = {"attname", NAMEOID},
+    [C_INHERITED_ARG] = {"inherited", BOOLOID},
+    [C_NUM_ATTRIBUTE_STATS_ARGS] = {0}
+};
+
 static bool attribute_statistics_update(FunctionCallInfo fcinfo, int elevel);
 static Node *get_attr_expr(Relation rel, int attnum);
 static void get_attr_stat_type(Oid reloid, AttrNumber attnum, int elevel,
@@ -116,9 +134,9 @@ static bool
 attribute_statistics_update(FunctionCallInfo fcinfo, int elevel)
 {
     Oid            reloid;
-    Name        attname;
-    bool        inherited;
+    char       *attname;
     AttrNumber    attnum;
+    bool        inherited;

     Relation    starel;
     HeapTuple    statup;
@@ -164,21 +182,51 @@ attribute_statistics_update(FunctionCallInfo fcinfo, int elevel)
     /* lock before looking up attribute */
     stats_lock_check_privileges(reloid);

-    stats_check_required_arg(fcinfo, attarginfo, ATTNAME_ARG);
-    attname = PG_GETARG_NAME(ATTNAME_ARG);
-    attnum = get_attnum(reloid, NameStr(*attname));
+    /* user can specify either attname or attnum, but not both */
+    if (!PG_ARGISNULL(ATTNAME_ARG))
+    {
+        Name        attnamename;
+
+        if (!PG_ARGISNULL(ATTNUM_ARG))
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("must specify one of attname and attnum")));
+        attnamename = PG_GETARG_NAME(ATTNAME_ARG);
+        attname = NameStr(*attnamename);
+        attnum = get_attnum(reloid, attname);
+        /* note that this test covers attisdropped cases too: */
+        if (attnum == InvalidAttrNumber)
+            ereport(ERROR,
+                    (errcode(ERRCODE_UNDEFINED_COLUMN),
+                     errmsg("column \"%s\" of relation \"%s\" does not exist",
+                            attname, get_rel_name(reloid))));
+    }
+    else if (!PG_ARGISNULL(ATTNUM_ARG))
+    {
+        attnum = PG_GETARG_INT32(ATTNUM_ARG);
+        attname = get_attname(reloid, attnum, true);
+        /* Annoyingly, get_attname doesn't check attisdropped */
+        if (attname == NULL ||
+            !SearchSysCacheExistsAttName(reloid, attname))
+            ereport(ERROR,
+                    (errcode(ERRCODE_UNDEFINED_COLUMN),
+                     errmsg("column %d of relation \"%s\" does not exist",
+                            attnum, get_rel_name(reloid))));
+    }
+    else
+    {
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("must specify one of attname and attnum")));
+        attname = NULL;            /* keep compiler quiet */
+        attnum = 0;
+    }

     if (attnum < 0)
         ereport(ERROR,
                 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                  errmsg("cannot modify statistics on system column \"%s\"",
-                        NameStr(*attname))));
-
-    if (attnum == InvalidAttrNumber)
-        ereport(ERROR,
-                (errcode(ERRCODE_UNDEFINED_COLUMN),
-                 errmsg("column \"%s\" of relation \"%s\" does not exist",
-                        NameStr(*attname), get_rel_name(reloid))));
+                        attname)));

     stats_check_required_arg(fcinfo, attarginfo, INHERITED_ARG);
     inherited = PG_GETARG_BOOL(INHERITED_ARG);
@@ -245,7 +293,7 @@ attribute_statistics_update(FunctionCallInfo fcinfo, int elevel)
                                 &elemtypid, &elem_eq_opr))
         {
             ereport(elevel,
-                    (errmsg("unable to determine element type of attribute \"%s\"", NameStr(*attname)),
+                    (errmsg("unable to determine element type of attribute \"%s\"", attname),
                      errdetail("Cannot set STATISTIC_KIND_MCELEM or STATISTIC_KIND_DECHIST.")));
             elemtypid = InvalidOid;
             elem_eq_opr = InvalidOid;
@@ -261,7 +309,7 @@ attribute_statistics_update(FunctionCallInfo fcinfo, int elevel)
     {
         ereport(elevel,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-                 errmsg("could not determine less-than operator for attribute \"%s\"", NameStr(*attname)),
+                 errmsg("could not determine less-than operator for attribute \"%s\"", attname),
                  errdetail("Cannot set STATISTIC_KIND_HISTOGRAM or STATISTIC_KIND_CORRELATION.")));

         do_histogram = false;
@@ -275,7 +323,7 @@ attribute_statistics_update(FunctionCallInfo fcinfo, int elevel)
     {
         ereport(elevel,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-                 errmsg("attribute \"%s\" is not a range type", NameStr(*attname)),
+                 errmsg("attribute \"%s\" is not a range type", attname),
                  errdetail("Cannot set STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM or STATISTIC_KIND_BOUNDS_HISTOGRAM.")));

         do_bounds_histogram = false;
@@ -855,8 +903,8 @@ init_empty_stats_tuple(Oid reloid, int16 attnum, bool inherited,
  * Import statistics for a given relation attribute.
  *
  * Inserts or replaces a row in pg_statistic for the given relation and
- * attribute name. It takes input parameters that correspond to columns in the
- * view pg_stats.
+ * attribute name or number. It takes input parameters that correspond to
+ * columns in the view pg_stats.
  *
  * Parameters null_frac, avg_width, and n_distinct all correspond to NOT NULL
  * columns in pg_statistic. The remaining parameters all belong to a specific
@@ -889,8 +937,8 @@ pg_clear_attribute_stats(PG_FUNCTION_ARGS)
     AttrNumber    attnum;
     bool        inherited;

-    stats_check_required_arg(fcinfo, attarginfo, ATTRELATION_ARG);
-    reloid = PG_GETARG_OID(ATTRELATION_ARG);
+    stats_check_required_arg(fcinfo, cleararginfo, C_ATTRELATION_ARG);
+    reloid = PG_GETARG_OID(C_ATTRELATION_ARG);

     if (RecoveryInProgress())
         ereport(ERROR,
@@ -900,8 +948,8 @@ pg_clear_attribute_stats(PG_FUNCTION_ARGS)

     stats_lock_check_privileges(reloid);

-    stats_check_required_arg(fcinfo, attarginfo, ATTNAME_ARG);
-    attname = PG_GETARG_NAME(ATTNAME_ARG);
+    stats_check_required_arg(fcinfo, cleararginfo, C_ATTNAME_ARG);
+    attname = PG_GETARG_NAME(C_ATTNAME_ARG);
     attnum = get_attnum(reloid, NameStr(*attname));

     if (attnum < 0)
@@ -916,8 +964,8 @@ pg_clear_attribute_stats(PG_FUNCTION_ARGS)
                  errmsg("column \"%s\" of relation \"%s\" does not exist",
                         NameStr(*attname), get_rel_name(reloid))));

-    stats_check_required_arg(fcinfo, attarginfo, INHERITED_ARG);
-    inherited = PG_GETARG_BOOL(INHERITED_ARG);
+    stats_check_required_arg(fcinfo, cleararginfo, C_INHERITED_ARG);
+    inherited = PG_GETARG_BOOL(C_INHERITED_ARG);

     delete_pg_statistic(reloid, attnum, inherited);
     PG_RETURN_VOID();
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index af9546de23..ce714b1fd1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12433,8 +12433,8 @@
   descr => 'set statistics on attribute',
   proname => 'pg_set_attribute_stats', provolatile => 'v', proisstrict => 'f',
   proparallel => 'u', prorettype => 'void',
-  proargtypes => 'regclass name bool float4 int4 float4 text _float4 text float4 text _float4 _float4 text float4
text',
-  proargnames =>
'{relation,attname,inherited,null_frac,avg_width,n_distinct,most_common_vals,most_common_freqs,histogram_bounds,correlation,most_common_elems,most_common_elem_freqs,elem_count_histogram,range_length_histogram,range_empty_frac,range_bounds_histogram}',
+  proargtypes => 'regclass name int4 bool float4 int4 float4 text _float4 text float4 text _float4 _float4 text float4
text',
+  proargnames =>
'{relation,attname,attnum,inherited,null_frac,avg_width,n_distinct,most_common_vals,most_common_freqs,histogram_bounds,correlation,most_common_elems,most_common_elem_freqs,elem_count_histogram,range_length_histogram,range_empty_frac,range_bounds_histogram}',
   prosrc => 'pg_set_attribute_stats' },
 { oid => '9163',
   descr => 'clear statistics on attribute',
diff --git a/src/test/regress/expected/stats_import.out b/src/test/regress/expected/stats_import.out
index 0e8491131e..a020ff015d 100644
--- a/src/test/regress/expected/stats_import.out
+++ b/src/test/regress/expected/stats_import.out
@@ -364,7 +364,7 @@ SELECT pg_catalog.pg_set_attribute_stats(
     null_frac => 0.1::real,
     avg_width => 2::integer,
     n_distinct => 0.3::real);
-ERROR:  "attname" cannot be NULL
+ERROR:  must specify one of attname and attnum
 -- error: inherited null
 SELECT pg_catalog.pg_set_attribute_stats(
     relation => 'stats_import.test'::regclass,
@@ -968,7 +968,7 @@ SELECT pg_catalog.pg_restore_attribute_stats(
     'null_frac', 0.1::real,
     'avg_width', 2::integer,
     'n_distinct', 0.3::real);
-ERROR:  "attname" cannot be NULL
+ERROR:  must specify one of attname and attnum
 -- error: attname doesn't exist
 SELECT pg_catalog.pg_restore_attribute_stats(
     'relation', 'stats_import.test'::regclass,

Re: Statistics Import and Export

From

Corey Huinker

Date:

24 February, 20:57:17

On Mon, Feb 24, 2025 at 9:54 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-02-24 05:11:48 -0500, Corey Huinker wrote:
> Incorporating most of the feedback (I kept a few of
> the appendNamedArgument() calls) presented over the weekend.
>
> * removeVersionNumStr is gone
> * relpages/reltuples/relallvisible are now char[32] buffers in RelStatsInfo
> and nowhere else (existing relpages conversion remains, however)

I don't see the point. This will use more memory and if we can't get
conversions between integers and strings right we have much bigger
problems. The same code was used in the backend too!

As I see it, the point is that we're getting an input that is a string representation from the query, and the end-goal is to convey that value with fidelity to the destination database, so there's nothing we can do to get us closer to the string that we already have.

I don't have benchmark numbers beyond the instinct that doing something takes more time than doing nothing. Granted, "nothing" here means 96 bytes of memory and 3 strncpy()s, and "something" is 24 bytes of memory, 2 atoi()s, 1 strtof() plus whatever memory and processing we do back in converting back to strings.

And it leads to storing relpages in two places, with different
transformations, which doesn't seem great.

I didn't like that either, but balanced the ugliness of that vs the cost of grinding the values back to where we started.

> @@ -6921,6 +6927,7 @@ getTables(Archive *fout, int *numTables)
> appendPQExpBufferStr(query,
> "SELECT c.tableoid, c.oid, c.relname, "
> "c.relnamespace, c.relkind, c.reltype, "
> + "c.relpages, c.reltuples, c.relallvisible, "
> "c.relowner, "
> "c.relchecks, "
> "c.relhasindex, c.relhasrules, c.relpages, "

That query is already querying relpages a bit later in the query, so we'd
query the column twice.

+1, must eliminate that duplicate.

> + printfPQExpBuffer(query, "EXECUTE getAttributeStats(");
> + appendStringLiteralAH(query, dobj->namespace->dobj.name, fout);
> + appendPQExpBufferStr(query, ", ");
> + appendStringLiteralAH(query, dobj->name, fout);
> + appendPQExpBufferStr(query, "); ");
> res = ExecuteSqlQuery(fout, query->data, PGRES_TUPLES_OK);

It seems somewhat ugly that we're building an SQL string with non-trivial
constants. It'd be better to use PQexecParams() - but I guess we don't have
any uses of it yet in pg_dump.

+1, I would like to see that change.

ISTM that we ought to expose the relation oid in pg_stats. This query would be
simpler and faster if we could just use the oid as the predicate. Will take a
while till we can rely on that, but still.

+1, but we will need to support this until v18 is as old as v9.2 is now...approx 2038.

Have you compared performance of with/without stats after these optimizations?

I have not.

Re: Statistics Import and Export

From

Jeff Davis

Date:

24 February, 21:42:40

On Mon, 2025-02-24 at 12:50 -0500, Tom Lane wrote:
> Also, while working on the attached, I couldn't help forming the
> opinion that we'd be better off to nuke pg_set_attribute_stats()
> from orbit and require people to use pg_restore_attribute_stats().

I had intended the pg_set variants to be useful for ad-hoc stats
hacking (e.g. for reproducing a plan or for testing the optimizer). For
those use cases, the following differences seem nice:

  1. named arguments are easier to write ad-hoc than lining up the
parameters in pairs
  2. elevel=ERROR makes more sense than WARNING for that kind of use
case.
  3. for relation stats, we don't want in-place updates, because you
want ROLLBACK to work

Those seemed different enough from the restore case that another entry
point made sense to me.

> pg_set_attribute_stats() would be fine if we had a way to force
> people to call it with only named-argument notation, but we don't.
> So I'm afraid that its existence will encourage people to rely
> on a specific parameter order, and then they'll whine if we
> add/remove/reorder parameters, as indeed I had to do below.

That's a good point that I hadn't considered, so perhaps we can't solve
problem #1. The other two problems might be solvable though:

  * To avoid in-place updates I think we do need a separate function,
at least for relation stats (attribute stats never do in-place
updates). We could potentially have another name/value pair to choose,
but it's impossible to choose a reasonable default: if "inplace" is the
default, that means the user would need to opt-out of it for ROLLBACK
to work; if "mvcc" is the default, that means pg_dump would need to
choose "inplace", and I don't think pg_dump should be making those
kinds of decisions.

  * The elevel=ERROR is not terribly important, so perhaps we can just
always do elevel=WARNING. If we did try to present it as an option,
then that presents the same problems as an "inplace" option.

So perhaps we can just have the pg_set variants set elevel=ERROR and
inplace=false, and otherwise be identical to the pg_restore variants?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

24 February, 21:47:19

On Mon, Feb 24, 2025 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:
> On 2025-02-24 05:11:48 -0500, Corey Huinker wrote:
>> * relpages/reltuples/relallvisible are now char[32] buffers in RelStatsInfo
>> and nowhere else (existing relpages conversion remains, however)

> I don't see the point. This will use more memory and if we can't get
> conversions between integers and strings right we have much bigger
> problems. The same code was used in the backend too!

I don't like that either. But there's a bigger problem with 0002:
it's still got mostly table-driven output. I've been working on
fixing the problem discussed over in the -committers thread about how
we need to identify index-expression columns by number not name [1].

There doesn't seem to be any way around it, but it will slightly complicate the dump-ing side of things, in that we need to either:

a) switch to attnums for index expressions and keep attname calls for everything else.

b) track what the attnum will be on the destination side, which will be different when we're not doing a binary upgrade and there are any preceding dropped columns.

The patch Tom provided opens the door for option "a", and I'm inclined to take it.

It's not too awful in the backend (WIP patch attached), but
getting appendAttStatsImport to do it seems like a complete disaster,
and this patch fails to make that any easier. It'd be much better
if you gave up on that table-driven business and just open-coded the
handling of the successive output values as was discussed upthread.

Can do.

I don't think the table-driven approach has anything to recommend it
anyway. It requires keeping att_stats_arginfo[] in sync with the
query in getAttStatsExportQuery, an extremely nonobvious (and
undocumented) connection. Personally I would nuke the separate
getAttStatsExportQuery and appendAttStatsImport functions altogether,
and have one function that executes a query and immediately interprets
the PGresult.

+1, though that comes at the cost of shutting off the possibility of a mass fetch from pg_stats without also rendering the pg_restore_attribute_stats calls at the same time.

Also, while working on the attached, I couldn't help forming the
opinion that we'd be better off to nuke pg_set_attribute_stats()
from orbit and require people to use pg_restore_attribute_stats().
pg_set_attribute_stats() would be fine if we had a way to force
people to call it with only named-argument notation, but we don't.
So I'm afraid that its existence will encourage people to rely
on a specific parameter order, and then they'll whine if we
add/remove/reorder parameters, as indeed I had to do below.

They've always had split goals. To recap for people just joining the show, the "set" family had the following properties:

1. transactional, even for pg_class
2. assumes all stats given are relevant and correct for current db version
3. guaranteed to ERROR if any parameter doesn't check out
4. unstable call signature, can and will change to match the realities of the current version

5. intended for planner experiments and fuzzing

and the "restore" family has the following properties:

1. will inplace update pg_class to avoid table bloat
2. states the version from whence the stats came, so that adjustments can be made to suit the current db version, up to and including rejecting that particular statistic

3. attempts to sidestep errors with WARNINGs so as not to kill a restore
4. stable but highly fluid kwargs-ish call signature

5. intended to be machine generated and used only in restore/upgrade

The attnum change certainly throws a wrench into that, and if we get rid of the setter functions then we will need to (re)introduce parameters to indicate our choice for properties 1 and 3. I suppose we could use the existence or non-existence of the "version" parameter as an indicator of which mode we want (if it exists, we want WARNINGS and inplace updates, if not we want pure transactional and ERROR at the first problem), but I'm not certain that proxy will hold true in the future.

BTW, I pushed the 0003 patch with minor adjustments.

Thanks!

Re: Statistics Import and Export

From

Jeff Davis

Date:

24 February, 21:54:43

On Mon, 2025-02-24 at 13:47 -0500, Corey Huinker wrote:
> There doesn't seem to be any way around it, but it will
> slightly complicate the dump-ing side of things, in that we need to
> either:
>
> a) switch to attnums for index expressions and keep attname calls for
> everything else.

The only stats for indexes are on expression columns, so AFAICT there's
no difference between the above description and "use attnums for
indexes and attnames for tables". Either way, I agree that's the way to
go.

We certainly want attnames for tables to keep it working reasonably
well for cases where the user might be doing something more interesting
than a binary upgrade, as you point out. But attribute numbers for
indexes seem much more reliable: an index with a different attribute
order is a fundamentally different index.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

24 February, 22:03:51

On Mon, Feb 24, 2025 at 1:54 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2025-02-24 at 13:47 -0500, Corey Huinker wrote:
> There doesn't seem to be any way around it, but it will
> slightly complicate the dump-ing side of things, in that we need to
> either:
>
> a) switch to attnums for index expressions and keep attname calls for
> everything else.

The only stats for indexes are on expression columns, so AFAICT there's
no difference between the above description and "use attnums for
indexes and attnames for tables". Either way, I agree that's the way to
go.

That's true now, but may not be in the future, like if we started keeping separate stats for partial indexes.

We certainly want attnames for tables to keep it working reasonably
well for cases where the user might be doing something more interesting
than a binary upgrade, as you point out. But attribute numbers for
indexes seem much more reliable: an index with a different attribute
order is a fundamentally different index.

Sadly, that attnum isn't available in pg_stats, so we'd have to reintroduce the joins to pg_namespace and pg_class to get at pg_attribute, at least for indexes.

Re: Statistics Import and Export

From

Tom Lane

Date:

24 February, 22:34:13

Jeff Davis <pgsql@j-davis.com> writes:
> We certainly want attnames for tables to keep it working reasonably
> well for cases where the user might be doing something more interesting
> than a binary upgrade, as you point out. But attribute numbers for
> indexes seem much more reliable: an index with a different attribute
> order is a fundamentally different index.

Right.  We went through pretty much this reasoning, as I recall,
when we invented ALTER INDEX ... SET STATISTICS.  The original
version used a column name like ALTER TABLE did, and we ran into
exactly the present problem that the names aren't too stable across
dump/restore, and we decided that index column numbers would do
instead.  You can't add or drop a column of an index, nor redefine it
meaningfully, except by dropping the whole index which will make any
associated stats go away.

The draft patch I posted allows callers to use attname or attnum
at their option, because I didn't see a reason to restrict that.
But I envisioned that pg_dump would always use attname for table
columns and attnum for index columns.

            regards, tom lane

Re: Statistics Import and Export

From

Tom Lane

Date:

24 February, 22:36:41

Corey Huinker <corey.huinker@gmail.com> writes:
> Sadly, that attnum isn't available in pg_stats, so we'd have to reintroduce
> the joins to pg_namespace and pg_class to get at pg_attribute, at least for
> indexes.

This argument seems to be made still from the mindset that you are
going to form a query that produces exactly what needs to be dumped,
without any additional processing.  pg_dump has all that info at
hand; there is no need to re-query the server for it.

            regards, tom lane

Re: Statistics Import and Export

From

Jeff Davis

Date:

24 February, 22:45:51

On Mon, 2025-02-24 at 12:57 -0500, Corey Huinker wrote:
> As I see it, the point is that we're getting an input that is a
> string representation from the query, and the end-goal is to convey
> that value with fidelity to the destination database, so there's
> nothing we can do to get us closer to the string that we already
> have.

As Andres mentioned, for the float-to-string conversion, we're using
what the backend does, so it doesn't seem like a problem.

But you have a point in that float4in() does slightly more work than
strtof() to handle platform differences about NaN/Inf. I'm not sure how
much to weigh that concern, but I agree that there is non-zero
cognitive overhead here.

Should we solve that problem by moving some of that code to src/common
src/port somewhere?

> I don't have benchmark numbers beyond the instinct that
> doing something takes more time than doing nothing. Granted,
> "nothing" here means 96 bytes of memory and 3 strncpy()s, and
> "something" is 24 bytes of memory, 2 atoi()s, 1 strtof() plus
> whatever memory and processing we do back in converting back to
> strings.

To me, this argument is, at best, premature optimization. Even if there
were a few cycles saved here somewhere, you'd need to compare that
against the wasted memory. Using 2 int32s and a float4 is only 12 bytes
(not 24) versus 96 for the strings.

Anyone looking at the structure would be wondering (a) why we're using
32 bytes to store something where the natural representation is 4
bytes; and (b) whether that memory adds up to anything worth worrying
about. I'm sure we could analyze that and write an explanatory comment,
but that has non-zero cognitive overhead, as well.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

24 February, 23:03:58

Jeff Davis <pgsql@j-davis.com> writes:
> But you have a point in that float4in() does slightly more work than
> strtof() to handle platform differences about NaN/Inf. I'm not sure how
> much to weigh that concern, but I agree that there is non-zero
> cognitive overhead here.

If we're speaking strictly about the reltuples value, I'm not hugely
concerned about that.  reltuples should never be NaN or Inf.  There
is a nonzero chance that it will round off to a fractionally
different value if we pass it through strtof/sprintf on the pg_dump
side, but nobody is really going to care about that.  (Maybe our
own pg_dump test script would, thanks to its not-too-bright dump
comparison logic.  But that script is never going to see reltuples
values that are big enough to be inexact in a float4.)

I do buy the better-preserve-it-exactly argument for other sorts
of statistics, where we don't have such a good sense of what might
matter.

            regards, tom lane

Re: Statistics Import and Export

From

Jeff Davis

Date:

24 February, 23:20:41

On Mon, 2025-02-24 at 15:03 -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > But you have a point in that float4in() does slightly more work
> > than
> > strtof() to handle platform differences about NaN/Inf. I'm not sure
> > how
> > much to weigh that concern, but I agree that there is non-zero
> > cognitive overhead here.
>
> If we're speaking strictly about the reltuples value, I'm not hugely
> concerned about that.  reltuples should never be NaN or Inf.

There actually is a concern here, in that the backend always has
LC_NUMERIC=C when doing float4in/out, but pg_dump does not. Patch
attached.

Regards,
    Jeff Davis

Attachment

pg-dump-setlocale.diff

Re: Statistics Import and Export

From

Andres Freund

Date:

24 February, 23:36:20

Hi,

On 2025-02-24 13:47:19 -0500, Corey Huinker wrote:
> and the "restore" family has the following properties:
> 
> 1. will inplace update pg_class to avoid table bloat

I suspect that this is a *really* bad idea. It's very very hard to get inplace
updates right. We have several unfixed correctness bugs that are related to
the use of inplace updates.  I really don't think it's wise to add additional
interfaces that can reach inplace updates unless there's really no other
alternative (like not being able to assign an xid in VACUUM to be able to deal
with anti-xid-wraparound-shutdown systems).

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Tom Lane

Date:

24 February, 23:40:13

Jeff Davis <pgsql@j-davis.com> writes:
> There actually is a concern here, in that the backend always has
> LC_NUMERIC=C when doing float4in/out, but pg_dump does not.

Hmm ... interesting point, but does it matter?  I think we always use
our own sprintf even in frontend, and it doesn't react to LC_NUMERIC.
I guess atof might be more of a concern though.

> Patch attached.

I'm a little suspicious whether that has any effect if you insert it
before set_pglocale_pgservice().

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

24 February, 23:40:53

On Mon, Feb 24, 2025 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Corey Huinker <corey.huinker@gmail.com> writes:
> Sadly, that attnum isn't available in pg_stats, so we'd have to reintroduce
> the joins to pg_namespace and pg_class to get at pg_attribute, at least for
> indexes.

This argument seems to be made still from the mindset that you are
going to form a query that produces exactly what needs to be dumped,
without any additional processing. pg_dump has all that info at
hand; there is no need to re-query the server for it.

I went looking just now, and I can't find it. I see where we have attname and attnum arrays for tables, but not indexes. We keep an array of attnums for the index, but we'd need to add an array of attnames in order to correlate back to our results of pg_stats. If I'm missing something, please correct me, but it seems like all the index stuff we'd get from attributes we instead get from:

"pg_catalog.pg_get_indexdef(i.indexrelid) AS indexdef, "

Re: Statistics Import and Export

From

Jeff Davis

Date:

24 February, 23:41:21

On Mon, 2025-02-24 at 15:36 -0500, Andres Freund wrote:
> > 1. will inplace update pg_class to avoid table bloat
>
> I suspect that this is a *really* bad idea.

The reason we added it is that it's what ANALYZE does, and a big
restore bloats pg_class without it.

I don't think those are major concerns for v1, so in principle I'm fine
removing it. But the problem is that it affects the documented
semantics, so it would be hard to change later, and we'd be stuck with
the bloating behavior forever.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

24 February, 23:45:10

I suspect that this is a *really* bad idea. It's very very hard to get inplace
updates right. We have several unfixed correctness bugs that are related to
the use of inplace updates. I really don't think it's wise to add additional
interfaces that can reach inplace updates unless there's really no other
alternative (like not being able to assign an xid in VACUUM to be able to deal
with anti-xid-wraparound-shutdown systems).

In this case, the alternative is an immediate doubling of the size of pg_class right after a restore/upgrade.

Re: Statistics Import and Export

From

Andres Freund

Date:

24 February, 23:53:04

Hi,

On 2025-02-24 15:45:10 -0500, Corey Huinker wrote:
> >
> >
> >
> > I suspect that this is a *really* bad idea. It's very very hard to get
> > inplace
> > updates right. We have several unfixed correctness bugs that are related to
> > the use of inplace updates.  I really don't think it's wise to add
> > additional
> > interfaces that can reach inplace updates unless there's really no other
> > alternative (like not being able to assign an xid in VACUUM to be able to
> > deal
> > with anti-xid-wraparound-shutdown systems).
> 
> 
> In this case, the alternative is an immediate doubling of the size of
> pg_class right after a restore/upgrade.

I don't think that's necessarily true, hot pruning might help some, as afaict
the restore happens in multiple transactions.

But even if that's the case, I don't think it's worth using in place updates
to avoid it. We should work to get rid of them, not introduce them in more
places.

And typically pg_class size isn't the relevant factor, it's pg_attribute etc.

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Tom Lane

Date:

24 February, 23:53:56

Corey Huinker <corey.huinker@gmail.com> writes:
> On Mon, Feb 24, 2025 at 2:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... pg_dump has all that info at
>> hand; there is no need to re-query the server for it.

> I went looking just now, and I can't find it. I see where we have attname
> and attnum arrays for tables, but not indexes. We keep an array of attnums
> for the index, but we'd need to add an array of attnames in order to
> correlate back to our results of pg_stats.

Hmm ... I was thinking we had it already for ALTER INDEX SET
STATISTICS, but I see that is depending on some quite ad-hoc
code (look for indstatcols and indstatvals in pg_dump.c).
I wonder if we could generalize that a bit and share the
work with this case.  Those array_agg calls don't look too fast
anyway, would be better if we could rewrite as a join I bet.

            regards, tom lane

Re: Statistics Import and Export

From

Jeff Davis

Date:

25 February, 00:01:00

On Mon, 2025-02-24 at 15:53 -0500, Andres Freund wrote:
> I don't think that's necessarily true, hot pruning might help some,
> as afaict
> the restore happens in multiple transactions.

Yeah, I just dumped and reloaded the regression database with and
without stats, and saw no difference in the resulting size. So it's
probably more correct to say "churn" rather than "bloat".

Even running "psql -1", I see modest bloat substantially less than 2x.

So if we agree that we don't mind a bit of churn and we will never need
this (despite what ANALYZE does), then I'm OK removing it. Which makes
me wonder why ANALYZE does it with inplace updates?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

25 February, 00:01:09

I don't think that's necessarily true, hot pruning might help some, as afaict
the restore happens in multiple transactions.

If we're willing to take the potential bloat to avoid a nasty complexity, then I'm all for discarding it. Jeff just indicated off-list that he isn't seeing noticeable difference in table size, maybe we're safe with how we use the function now.

But even if that's the case, I don't think it's worth using in place updates
to avoid it. We should work to get rid of them, not introduce them in more
places.

As the number of statlike columns in pg_class grows, might it make sense to break them off into their own relation, leaving pg_class to be far more stable?

Re: Statistics Import and Export

From

Jeff Davis

Date:

25 February, 00:07:39

On Mon, 2025-02-24 at 15:40 -0500, Tom Lane wrote:
> I'm a little suspicious whether that has any effect if you insert it
> before set_pglocale_pgservice().

Ah, right. Corey, can you please include that (in the right place, of
course) to the next iteration of your 0001 patch, if it's doing the
conversions to/from float4?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

25 February, 00:13:41

On Mon, Feb 24, 2025 at 4:07 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2025-02-24 at 15:40 -0500, Tom Lane wrote:
> I'm a little suspicious whether that has any effect if you insert it
> before set_pglocale_pgservice().

Ah, right. Corey, can you please include that (in the right place, of
course) to the next iteration of your 0001 patch, if it's doing the
conversions to/from float4?

Sure, I'm debating whether I want to solve the index-expression-attname issue before embarking on the next iteration, or temporarily shelve that so that we can review the other changes so far.

Re: Statistics Import and Export

From

Tom Lane

Date:

25 February, 00:25:22

I wrote:
> Hmm ... I was thinking we had it already for ALTER INDEX SET
> STATISTICS, but I see that is depending on some quite ad-hoc
> code (look for indstatcols and indstatvals in pg_dump.c).
> I wonder if we could generalize that a bit and share the
> work with this case.  Those array_agg calls don't look too fast
> anyway, would be better if we could rewrite as a join I bet.

After a bit of playing around, it seemed messy to make it into
a join, but we could replace the two array_agg sub-selects with
a single one:

(SELECT pg_catalog.array_agg(ROW(attname, attstattarget) ORDER BY attnum)
 FROM pg_catalog.pg_attribute WHERE attrelid = i.indexrelid)

and then what we need could be pulled out of that, although
I'm not sure if pg_dump has logic at hand for deconstructing an
array of composite.  Or we could leave it as two array_aggs,
aggregating attname and attstattarget separately but removing
the attstattarget filter.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

25 February, 00:30:08

After a bit of playing around, it seemed messy to make it into
a join, but we could replace the two array_agg sub-selects with
a single one:

(SELECT pg_catalog.array_agg(ROW(attname, attstattarget) ORDER BY attnum)
FROM pg_catalog.pg_attribute WHERE attrelid = i.indexrelid)

and then what we need could be pulled out of that, although
I'm not sure if pg_dump has logic at hand for deconstructing an
array of composite.

From what I can see, it doesn't. Moreover, the attstattarget array agg is only done in version 11 and higher, and we need to go as far back as we've got expression indexes.

Or we could leave it as two array_aggs,
aggregating attname and attstattarget separately but removing
the attstattarget filter.

That's what I was thinking, thanks for the confirmation.

Re: Statistics Import and Export

From

Andres Freund

Date:

25 February, 00:33:44

Hi,

On February 24, 2025 10:30:08 PM GMT+01:00, Corey Huinker <corey.huinker@gmail.com> wrote:
>From what I can see, it doesn't. Moreover, the attstattarget array agg is
>only done in version 11 and higher, and we need to go as far back as we've
>got expression indexes.

I don't think we have to at all. It's perfectly reasonable to add a complicated feature like this only when upgrading
fromnewer versions. I'd go even so far to say that it's a bad idea to support unsupported source versions, because
it'llmean we'll practically get very very little testing for those paths but claim to support them.  

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Statistics Import and Export

From

Corey Huinker

Date:

25 February, 00:47:30

On Mon, Feb 24, 2025 at 4:33 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On February 24, 2025 10:30:08 PM GMT+01:00, Corey Huinker <corey.huinker@gmail.com> wrote:
>From what I can see, it doesn't. Moreover, the attstattarget array agg is
>only done in version 11 and higher, and we need to go as far back as we've
>got expression indexes.

I don't think we have to at all. It's perfectly reasonable to add a complicated feature like this only when upgrading from newer versions. I'd go even so far to say that it's a bad idea to support unsupported source versions, because it'll mean we'll practically get very very little testing for those paths but claim to support them.

Anyone still on those versions has some serious barriers to doing an upgrade, downtime probably being the largest of them. Any stats we don't migrate here have to be analyzed later, which is more downtime or time in a degraded state. I'd rather we try to make it easier for them to upgrade, and in this case the risk is small because we're just collecting the attname:attnum pairings for an index, and it's the same SQL that we'd use for modern versions.

Re: Statistics Import and Export

From

Daniel Gustafsson

Date:

25 February, 00:51:34

> On 24 Feb 2025, at 22:33, Andres Freund <andres@anarazel.de> wrote:

> I'd go even so far to say that it's a bad idea to support unsupported source versions, because it'll mean we'll
practicallyget very very little testing for those paths but claim to support them. 

+1.  Maintaining pg_upgrade is hard enough without having to go rummaging
through the EOL-versions filing cabinet all too often when hacking on it.

--
Daniel Gustafsson

Re: Statistics Import and Export

From

Nathan Bossart

Date:

25 February, 01:11:03

On Mon, Feb 24, 2025 at 10:51:34PM +0100, Daniel Gustafsson wrote:
>> On 24 Feb 2025, at 22:33, Andres Freund <andres@anarazel.de> wrote:
> 
>> I'd go even so far to say that it's a bad idea to support unsupported
>> source versions, because it'll mean we'll practically get very very
>> little testing for those paths but claim to support them.
> 
> +1.  Maintaining pg_upgrade is hard enough without having to go rummaging
> through the EOL-versions filing cabinet all too often when hacking on it.

+1.  FWIW I'm planning to restrict one of my proposed pg_upgrade
optimizations to upgrades from v10 or newer.  Sure, I could add a ton of
complexity to support older versions, but when v18 comes out, that
restriction will only apply to users who are still running versions that
have been out-of-support for nearly 4 years.  I think it's completely
reasonable to try to help users on old versions, but I also think it's
reasonable for us as developers to prioritize maintainability and
stability.  And I suspect it's unlikely that any specific pg_upgrade
feature is the reason folks haven't upgraded, although I'll admit the
downtime implications might certainly be a factor.

-- 
nathan

Re: Statistics Import and Export

From

Corey Huinker

Date:

25 February, 01:26:47

On Mon, Feb 24, 2025 at 4:30 PM Corey Huinker <corey.huinker@gmail.com> wrote:

After a bit of playing around, it seemed messy to make it into
a join, but we could replace the two array_agg sub-selects with
a single one:

(SELECT pg_catalog.array_agg(ROW(attname, attstattarget) ORDER BY attnum)
FROM pg_catalog.pg_attribute WHERE attrelid = i.indexrelid)

and then what we need could be pulled out of that, although
I'm not sure if pg_dump has logic at hand for deconstructing an
array of composite.

Digging some more on this, I see that we populate indxinfo[j].indkey as if it's an array of Oids, but it's an array of AttrNumber/int2. Shouldn't have caused any problems given that we're dealing with small integers, but it didn't help discoverability just now. Is there some history there that I'm not aware of?

Re: Statistics Import and Export

From

Jeff Davis

Date:

25 February, 02:13:23

On Mon, 2025-02-24 at 17:26 -0500, Corey Huinker wrote:
> Digging some more on this, I see that we populate indxinfo[j].indkey
> as if it's an array of Oids,

It looks like it's done that way so that parseOidArray() can be used. A
hack, I suppose, but makes some sense to reuse that code.

We should probably add a comment somewhere, though.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

jian he

Date:

25 February, 06:30:12

On Tue, Feb 25, 2025 at 5:01 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Mon, 2025-02-24 at 15:53 -0500, Andres Freund wrote:
> > I don't think that's necessarily true, hot pruning might help some,
> > as afaict
> > the restore happens in multiple transactions.
>
> Yeah, I just dumped and reloaded the regression database with and
> without stats, and saw no difference in the resulting size. So it's
> probably more correct to say "churn" rather than "bloat".
>
> Even running "psql -1", I see modest bloat substantially less than 2x.
>
> So if we agree that we don't mind a bit of churn and we will never need
> this (despite what ANALYZE does), then I'm OK removing it. Which makes
> me wonder why ANALYZE does it with inplace updates?
>

hi.
looking at commit:
https://git.postgresql.org/cgit/postgresql.git/commit/?id=f3dae2ae5856dec9935a51e53216400566ef8d4f

I am confused by this:
```
    ctup = SearchSysCache1(RELOID, ObjectIdGetDatum(reloid));
    if (!HeapTupleIsValid(ctup))
    {
        ereport(elevel,
                (errcode(ERRCODE_OBJECT_IN_USE),
                 errmsg("pg_class entry for relid %u not found", reloid)));
        table_close(crel, RowExclusiveLock);
        return false;
    }
```
First I thought ERRCODE_OBJECT_IN_USE was weird. maybe
ERRCODE_NO_DATA_FOUND would be more appropriate.

then but ``stats_lock_check_privileges(reloid);`` already proves there is
a pg_class entry for reloid.
maybe we can just use
elog(ERROR, "pg_class entry for relid %u not found", reloid)));


also in stats_lock_check_privileges.
check_inplace_rel_lock related comments should be removed?

Re: Statistics Import and Export

From

Ashutosh Bapat

Date:

25 February, 08:41:55

Hi Jeff and Corey,

I think I have found a bug (arguably) with the dump/restore test I am
developing at [1].

#create table t1 (a int primary key, b int);
CREATE TABLE
#insert into t1 values (1, 2);
INSERT 0 1

$ createdb rdb
$ pg_dump -d postgres | psql -d rdb

$ pg_dump -d postgres > /tmp/pgdb.out
ashutosh@localhost:~/work/units/pg_dump_test$ pg_dump -d rdb > /tmp/rdb.out
ashutosh@localhost:~/work/units/pg_dump_test$ diff /tmp/pgdb.out /tmp/rdb.out
52,53c52,53
<       'relpages', '0'::integer,
<       'reltuples', '-1'::real,
---
>       'relpages', '1'::integer,
>       'reltuples', '1'::real,

So the dumped statistics are not restored exactly. The reason for this
is the table statistics is dumped before dumping ALTER TABLE ... ADD
CONSTRAINT command which changes the statistics. I think all the
pg_restore_relation_stats() calls should be dumped after all the
schema and data modifications have been done. OR what's the point in
dumping statistics only to get rewritten even before restore finishes.

[1] https://www.postgresql.org/message-id/CAExHW5sBbMki6Xs4XxFQQF3C4Wx3wxkLAcySrtuW3vrnOxXDNQ%40mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

From

Jeff Davis

Date:

25 February, 11:10:23

On Tue, 2025-02-25 at 11:30 +0800, jian he wrote:
> maybe we can just use
> elog(ERROR, "pg_class entry for relid %u not found", reloid)));

Thank you.

> also in stats_lock_check_privileges.
> check_inplace_rel_lock related comments should be removed?

In-place update locking rules still apply when updating pg_class or
pg_database even if the current caller is not performing an in-place
update. It might be better to point instead to
check_lock_if_inplace_updateable_rel()?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Jeff Davis

Date:

25 February, 21:22:13

On Mon, 2025-02-24 at 12:50 -0500, Tom Lane wrote:
> Also, while working on the attached, I couldn't help forming the
> opinion that we'd be better off to nuke pg_set_attribute_stats()
> from orbit and require people to use pg_restore_attribute_stats().

Attached a patch to do so. The docs and tests required substantial
rework, but I think it's for the better now that we aren't trying to do
in-place updates.

Regards,
    Jeff Davis

Attachment

v1-0001-Remove-redundant-pg_set_-_stats-variants.patch

Re: Statistics Import and Export

From

Corey Huinker

Date:

25 February, 23:31:40

On Tue, Feb 25, 2025 at 1:22 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2025-02-24 at 12:50 -0500, Tom Lane wrote:
> Also, while working on the attached, I couldn't help forming the
> opinion that we'd be better off to nuke pg_set_attribute_stats()
> from orbit and require people to use pg_restore_attribute_stats().

Attached a patch to do so. The docs and tests required substantial
rework, but I think it's for the better now that we aren't trying to do
in-place updates.

Regards,
Jeff Davis

All the C code changes make sense to me. Though as an aside, we're going to run into the parameter-ordering problem when it comes to pg_clear_attribute_stats, but that's a (read: my) problem for a later patch.

Documentation:

+ The currently-supported relation statistics are
+ <literal>relpages</literal> with a value of type
+ <type>integer</type>, <literal>reltuples</literal> with a value of
+ type <type>real</type>, and <literal>relallvisible</literal> with a
+ value of type <type>integer</type>.

Could we make this a bullet-list? Same for the required attribute stats and optional attribute stats. I think it would be more eye-catching and useful to people skimming to recall the name of a parameter, which is probably what most people will do after they've read it once to get the core concepts.

Question:

Do we want to re-compact the oids we consumed in pg_proc.dat?

Test cases:

We're ripping out a lot of regression tests here. Some of them obviously have no possible pg_restore_* analogs, such as explicitly set NULL values vs omitting the param entirely, but some others may not, especially the ones that test required arg-pairs.

Specifically missing are:

* regclass not found
* attribute is system column
* scalars can't have mcelem
* mcelem / mcelem freqs mismatch (parts 1 and 2)
* scalars can't have elem_count_histogram

* cannot set most_common_elems for range type

I'm less worried about all the tests of successful import calls, as the pg_upgrade TAP tests kick those tires pretty well.

I'm also ok with losing the copies from test to test_clone, those are also covered well by the TAP tests.

I'd feel better if we adapted the above tests from set-tests to restore-tests, as the TAP suite doesn't really cover intentionally bad stats.

Re: Statistics Import and Export

From

Corey Huinker

Date:

26 February, 02:40:34

Specifically missing are:

* regclass not found
* attribute is system column
* scalars can't have mcelem
* mcelem / mcelem freqs mismatch (parts 1 and 2)
* scalars can't have elem_count_histogram
* cannot set most_common_elems for range type

This patchset is as follows:

0001 - Jeff's patch from earlier today
0002 - Changing the parameter lists to <itemizedlist> to aid skim-ability
0003 - converting some of the deleted pg_set* tests into pg_restore* tests to keep the error coverage that they had.

Next I'm going to incorporate Tom's attnum change, the locale set, and the set-index-by-attnum changes.

Attachment

Re: Statistics Import and Export

From

Jeff Davis

Date:

26 February, 03:17:37

On Tue, 2025-02-25 at 15:31 -0500, Corey Huinker wrote:
> Documentation:
>
> +         The currently-supported relation statistics are
> +         <literal>relpages</literal> with a value of type
> +         <type>integer</type>, <literal>reltuples</literal> with a
> value of
> +         type <type>real</type>, and
> <literal>relallvisible</literal> with a
> +         value of type <type>integer</type>.
>
> Could we make this a bullet-list? Same for the required attribute
> stats and optional attribute stats. I think it would be more eye-
> catching and useful to people skimming to recall the name of a
> parameter, which is probably what most people will do after they've
> read it once to get the core concepts.

I couldn't make that look quite right. These functions are mostly for
use by pg_dump, and while documentation is necessary, I don't think we
should go so far as to make it "eye-catching". At least not until
things settle a bit.

> Question:
>
> Do we want to re-compact the oids we consumed in pg_proc.dat?

Done.

> Specifically missing are:
>
> * regclass not found
> * attribute is system column
> * scalars can't have mcelem
> * mcelem / mcelem freqs mismatch (parts 1 and 2)
> * scalars can't have elem_count_histogram
> * cannot set most_common_elems for range type

Done.

And committed.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Jeff Davis

Date:

26 February, 05:00:12

On Mon, 2025-02-24 at 09:54 -0500, Andres Freund wrote:
> Have you compared performance of with/without stats after these
> optimizations?

On unoptimized build with asserts enabled, dumping the regression
database:

  --no-statistics: 1.0s
  master:          3.6s
  v3j-0001:        3.0s
  v3j-0002:        1.7s

I plan to commit the patches soon.

Regards,
    Jeff Davis

Attachment

Re: Statistics Import and Export

From

Corey Huinker

Date:

26 February, 05:29:56

On Tue, Feb 25, 2025 at 9:00 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2025-02-24 at 09:54 -0500, Andres Freund wrote:
> Have you compared performance of with/without stats after these
> optimizations?

On unoptimized build with asserts enabled, dumping the regression
database:

--no-statistics: 1.0s
master: 3.6s
v3j-0001: 3.0s
v3j-0002: 1.7s

I plan to commit the patches soon.

Regards,
Jeff Davis

+1 from me

We can still convert the "EXECUTE getAttributeStats" call to a Params call, but that involves creating an ExecuteSqlQueryParams(), which starts to snowball in the changes required.

Re: Statistics Import and Export

From

Tom Lane

Date:

26 February, 06:40:06

Corey Huinker <corey.huinker@gmail.com> writes:
> We can still convert the "EXECUTE getAttributeStats" call to a Params call,
> but that involves creating an ExecuteSqlQueryParams(), which starts to
> snowball in the changes required.

Yeah, let's leave that for some other day.  It's not really apparent
that it'd buy us much performance-wise, though maybe the code would
net out cleaner.

To my mind the next task is to get the buildfarm green again by
fixing the expression-index-stats problem.  I can have a go at
that once Jeff pushes these patches, unless one of you are already
on it?

            regards, tom lane

Re: Statistics Import and Export

From

Jeff Davis

Date:

26 February, 06:55:19

On Tue, 2025-02-25 at 22:40 -0500, Tom Lane wrote:
> To my mind the next task is to get the buildfarm green again by
> fixing the expression-index-stats problem.  I can have a go at
> that once Jeff pushes these patches, unless one of you are already
> on it?

I just committed them.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

26 February, 07:05:57

To my mind the next task is to get the buildfarm green again by
fixing the expression-index-stats problem. I can have a go at
that once Jeff pushes these patches, unless one of you are already
on it?

Already on it, but I can step aside if you've got a clearer vision of how to solve it.

My solution so far is to take allo the v11+ (SELECT array_agg...) functions and put them into a LATERAL, two of them filtered by attstattarget > 0 and a new one aggregating attnames with no filter.

An alternative would be a new subselect for array_agg(attname) WHERE in.indexprs IS NOT NULL, thus removing the extra compute for the indexes that lack an index expression (i.e. most of them), and thus lack settable stats (at least for now) and wouldn't be affected by the name-jitter issue anyway.

I'm on the fence about how to handle pg_clear_attribute_stats(), leaning toward overloaded functions.

Re: Statistics Import and Export

From

Tom Lane

Date:

26 February, 07:36:09

Corey Huinker <corey.huinker@gmail.com> writes:
> My solution so far is to take allo the v11+ (SELECT array_agg...) functions
> and put them into a LATERAL, two of them filtered by attstattarget > 0 and
> a new one aggregating attnames with no filter.

> An alternative would be a new subselect for array_agg(attname) WHERE
> in.indexprs IS NOT NULL, thus removing the extra compute for the indexes
> that lack an index expression (i.e. most of them), and thus lack settable
> stats (at least for now) and wouldn't be affected by the name-jitter issue
> anyway.

Yeah, I've been thinking about that.  I think that the idea of the
current design is that relatively few indexes will have explicit stats
targets set on them, so most of the time the sub-SELECTs produce no
data.  (Which is not to say that they're cheap to execute.)  If we
pull all the column names for all indexes then we'll likely bloat
pg_dump's working storage quite a bit.  Pulling them only for indexes
with expression columns should fix that, and as you say we don't need
the names otherwise.

I still fear that those sub-selects are pretty expensive in aggregate
-- they are basically forcing a nestloop join -- and maybe we need to
rethink that whole idea.

BTW, just as a point of order: it is not the case that non-expression
indexes are free of name-jitter problems.  That's because we don't
bother to rename index columns when the underlying table column is
renamed, thus:

regression=# create table t1 (id int primary key);
CREATE TABLE
regression=# \d t1_pkey
        Index "public.t1_pkey"
 Column |  Type   | Key? | Definition 
--------+---------+------+------------
 id     | integer | yes  | id
primary key, btree, for table "public.t1"

regression=# alter table t1 rename column id to xx;
ALTER TABLE
regression=# \d t1_pkey
        Index "public.t1_pkey"
 Column |  Type   | Key? | Definition 
--------+---------+------+------------
 id     | integer | yes  | xx
primary key, btree, for table "public.t1"

After dump-n-reload, this index's column will be named "xx".
That's not relevant to our current problem as long as we
don't store stats on such index columns, but it's plenty
relevant to the ALTER INDEX ... SET STATISTICS code.

> I'm on the fence about how to handle pg_clear_attribute_stats(), leaning
> toward overloaded functions.

I kinda felt that we didn't need to bother with an attnum-based
variant of pg_clear_attribute_stats(), since pg_dump has no
use for that.  I won't stand in the way if you're desperate to
do it, though.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

26 February, 07:57:40

On Tue, Feb 25, 2025 at 11:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Corey Huinker <corey.huinker@gmail.com> writes:
> My solution so far is to take allo the v11+ (SELECT array_agg...) functions
> and put them into a LATERAL, two of them filtered by attstattarget > 0 and
> a new one aggregating attnames with no filter.

> An alternative would be a new subselect for array_agg(attname) WHERE
> in.indexprs IS NOT NULL, thus removing the extra compute for the indexes
> that lack an index expression (i.e. most of them), and thus lack settable
> stats (at least for now) and wouldn't be affected by the name-jitter issue
> anyway.

Yeah, I've been thinking about that. I think that the idea of the
current design is that relatively few indexes will have explicit stats
targets set on them, so most of the time the sub-SELECTs produce no
data. (Which is not to say that they're cheap to execute.) If we
pull all the column names for all indexes then we'll likely bloat
pg_dump's working storage quite a bit. Pulling them only for indexes
with expression columns should fix that, and as you say we don't need
the names otherwise.

I still fear that those sub-selects are pretty expensive in aggregate
-- they are basically forcing a nestloop join -- and maybe we need to
rethink that whole idea.

BTW, just as a point of order: it is not the case that non-expression
indexes are free of name-jitter problems. That's because we don't
bother to rename index columns when the underlying table column is
renamed, thus:

Ouch.

After dump-n-reload, this index's column will be named "xx".
That's not relevant to our current problem as long as we
don't store stats on such index columns, but it's plenty
relevant to the ALTER INDEX ... SET STATISTICS code.

The only way I can imagine those columns getting their own stats is if we start adding stats for columns of partial indexes, in which case we'd just bump the predicate to WHERE (i.indexprs IS NOT NULL OR i.indpred IS NOT NULL)

Just to confirm, we ARE able to assume dense packing of attributes in an index, and thus we can infer the attnum from the position of the attname in the aggregated array, and there's no need to do a parallel array_agg of attnums, yes?

> I'm on the fence about how to handle pg_clear_attribute_stats(), leaning
> toward overloaded functions.

I kinda felt that we didn't need to bother with an attnum-based
variant of pg_clear_attribute_stats(), since pg_dump has no
use for that. I won't stand in the way if you're desperate to
do it, though.

I'm not desperate to slow this thread down, no. We'll stick with attname-only.

Re: Statistics Import and Export

From

Tom Lane

Date:

26 February, 08:05:44

Corey Huinker <corey.huinker@gmail.com> writes:
> Just to confirm, we ARE able to assume dense packing of attributes in an
> index, and thus we can infer the attnum from the position of the attname in
> the aggregated array, and there's no need to do a parallel array_agg of
> attnums, yes?

Yes, absolutely, there are no dropped columns in indexes.  See
upthread discussion.

We could have avoided two sub-selects for attstattarget too,
on the same principle: just collect them all and sort it out
later.  That'd risk bloating pg_dump's storage, although maybe
we could have handled that by doing additional processing
while inspecting the results of getIndexes' query, so as not
to store anything in the common case.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

26 February, 12:25:29

On Wed, Feb 26, 2025 at 12:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Corey Huinker <corey.huinker@gmail.com> writes:
> Just to confirm, we ARE able to assume dense packing of attributes in an
> index, and thus we can infer the attnum from the position of the attname in
> the aggregated array, and there's no need to do a parallel array_agg of
> attnums, yes?

Yes, absolutely, there are no dropped columns in indexes. See
upthread discussion.

We could have avoided two sub-selects for attstattarget too,
on the same principle: just collect them all and sort it out
later. That'd risk bloating pg_dump's storage, although maybe
we could have handled that by doing additional processing
while inspecting the results of getIndexes' query, so as not
to store anything in the common case.

regards, tom lane

0001 - Add attnum support to attribute_statistics_update

* Basically what Tom posted earlier, minus the pg_set_attribute_stats stuff, obviously.

0002 - Add attnum support to pg_dump.

* Removed att_stats_arginfo
* Folds appendRelStatsImport and appendAttStatsImport into dumpRelationStats

Attachment

Re: Statistics Import and Export

From

Tom Lane

Date:

26 February, 19:13:54

Corey Huinker <corey.huinker@gmail.com> writes:
> 0001 - Add attnum support to attribute_statistics_update
> * Basically what Tom posted earlier, minus the pg_set_attribute_stats
> stuff, obviously.
> 0002 - Add attnum support to pg_dump.
> * Removed att_stats_arginfo
> * Folds appendRelStatsImport and appendAttStatsImport
> into dumpRelationStats

Cool.  Jeff, are you taking these, or shall I?

            regards, tom lane

Re: Statistics Import and Export

From

Jeff Davis

Date:

26 February, 19:16:31

On Wed, 2025-02-26 at 11:13 -0500, Tom Lane wrote:
> Cool.  Jeff, are you taking these, or shall I?

Please go ahead.

I think you had mentioned upthread something about getting rid of the
table-driven logic, which is fine with me. Did you mean for that to
happen in this patch as well?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

26 February, 19:23:35

Jeff Davis <pgsql@j-davis.com> writes:
> I think you had mentioned upthread something about getting rid of the
> table-driven logic, which is fine with me. Did you mean for that to
> happen in this patch as well?

Per Corey's description of the patch (I didn't read it yet), some
of that already happened.  I want to get to buildfarm-green ASAP,
so I'm content to leave other cosmetic changes for later.

BTW, one cosmetic change that I'd like to see is that any tables that
don't go away get marked "const".  I tried to make that happen with
attribute_stats.c's tables in my WIP patch upthread, but found that
the need for const-ness would propagate to some utility functions and
such, so I put the idea on the back burner.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

26 February, 19:43:03

On Wed, Feb 26, 2025 at 11:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Davis <pgsql@j-davis.com> writes:
> I think you had mentioned upthread something about getting rid of the
> table-driven logic, which is fine with me. Did you mean for that to
> happen in this patch as well?

Per Corey's description of the patch (I didn't read it yet), some
of that already happened. I want to get to buildfarm-green ASAP,
so I'm content to leave other cosmetic changes for later.

The structs attarginfo and cleararginfo remain, which is notable but not quite no-table.

BTW, one cosmetic change that I'd like to see is that any tables that
don't go away get marked "const". I tried to make that happen with
attribute_stats.c's tables in my WIP patch upthread, but found that
the need for const-ness would propagate to some utility functions and
such, so I put the idea on the back burner.

I didn't even think to try const-ing those structs.

Re: Statistics Import and Export

From

Jeff Davis

Date:

26 February, 21:02:13

On Wed, 2025-02-26 at 04:25 -0500, Corey Huinker wrote:
> 0001 - Add attnum support to attribute_statistics_update
>
> * Basically what Tom posted earlier, minus the pg_set_attribute_stats
> stuff, obviously.

Should have a couple simple tests.

And I would use two different error message wordings:

  "must specify either attname or attnum"
  "cannot specify both attname and attnum"

(or similar)

The "one of attname and attnum" is a bit awkward.

The new struct for pg_clear_attribute_stats() isn't great, but as
discussed we can get rid of that in a subsequent commit.

Otherwise LGTM.

> 0002 - Add attnum support to pg_dump.
>
> * Removed att_stats_arginfo
> * Folds appendRelStatsImport and appendAttStatsImport
> into dumpRelationStats 

Can we add a test here, too, to check that tables dump the attname and
indexes dump the attnum?

Everything else in the file uses i_fieldname = PQfnumber(), but in this
patch you're just using raw numbers.

Some of the fields from pg_stats are NOT NULL, so we could consider
issuing a warning in that case rather than just skipping it.

And it could use a pgindent.

I ran a quick measurement and it appears within the noise of the
numbers I posted here:

https://www.postgresql.org/message-id/6af48508a32499a8be3398cafffd29fb6188c44b.camel@j-davis.com

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

26 February, 21:06:26

Jeff Davis <pgsql@j-davis.com> writes:
> I ran a quick measurement and it appears within the noise of the
> numbers I posted here:
> https://www.postgresql.org/message-id/6af48508a32499a8be3398cafffd29fb6188c44b.camel@j-davis.com

Thanks for doing that.  I agree with your other comments and
will incorporate them.

            regards, tom lane

Re: Statistics Import and Export

From

Jeff Davis

Date:

26 February, 21:44:33

On Wed, 2025-02-26 at 13:06 -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > I ran a quick measurement and it appears within the noise of the
> > numbers I posted here:
> > https://www.postgresql.org/message-id/6af48508a32499a8be3398cafffd29fb6188c44b.camel@j-davis.com
>
> Thanks for doing that.  I agree with your other comments and
> will incorporate them.

Also, here are the numbers with an optimized build:

  --no-statistics:                 0.21 s
  pre-optimization (6c349d83b6):   0.75
  v3j-0001 (8f427187db):           0.65
  v3j-0002 (6ee3b91bad):           0.27
  new patch 0001+0002:             0.26

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Melanie Plageman

Date:

26 February, 22:54:16

On Tue, Feb 25, 2025 at 6:41 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>
> 0003 - converting some of the deleted pg_set* tests into pg_restore* tests to keep the error coverage that they had.

I haven't really followed this thread and am not sure where the right
place is to drop this question, so I'll just do it here.

I have a patch that is getting thwacked around by the churn in
stats_import.sql, and it occurred to me that I don't see why all the
negative tests  for pg_restore_relation_stats() need to have all the
parameters provided. For example, in both of these tests, you are
testing the relation parameter but including all these other fields.
It's fine if there is a reason to do that, but otherwise, it makes the
test file longer and makes the test case less clear IMO.

-- error: argument name is NULL
SELECT pg_restore_relation_stats(
        'relation', '0'::oid::regclass,
        'version', 150000::integer,
        NULL, '17'::integer,
        'reltuples', 400::real,
        'relallvisible', 4::integer);

-- error: argument name is an integer
SELECT pg_restore_relation_stats(
        'relation', '0'::oid::regclass,
        'version', 150000::integer,
        17, '17'::integer,
        'reltuples', 400::real,
        'relallvisible', 4::integer);

- Melanie

Re: Statistics Import and Export

From

Robert Haas

Date:

26 February, 23:15:19

On Mon, Feb 24, 2025 at 3:36 PM Andres Freund <andres@anarazel.de> wrote:
> I suspect that this is a *really* bad idea. It's very very hard to get inplace
> updates right. We have several unfixed correctness bugs that are related to
> the use of inplace updates.  I really don't think it's wise to add additional
> interfaces that can reach inplace updates unless there's really no other
> alternative (like not being able to assign an xid in VACUUM to be able to deal
> with anti-xid-wraparound-shutdown systems).

I strongly agree. I think shipping this feature in any form that uses
in-place updates is a bad idea. I believe that the chances that we
will regret that decision are high. I take Corey's point that bloating
pg_class isn't great either ... but if that's a big problem, I believe
the solution is to find a way to get the right values into those rows
when they are first created, not to use in-place updates after the
fact.

Honestly, I'd go further than Andres did: even when there's really no
alternative, that doesn't mean in-place updates are a good idea. It
just means they're the best thing we've been able to come up with so
far. I think we're going to keep finding bugs until we remove every
last in-place update in the system.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Statistics Import and Export

From

Jeff Davis

Date:

26 February, 23:37:47

On Wed, 2025-02-26 at 15:15 -0500, Robert Haas wrote:
> I strongly agree. I think shipping this feature in any form that uses
> in-place updates is a bad idea.

Removed already in commit f3dae2ae58.

The reason they were added was mostly for consistency with ANALYZE, and
(at least for me) secondarily about churn on pg_class. The bloat was
never terrible.

With that in mind, should we remove the in-place updates from ANALYZE
as well?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Robert Haas

Date:

26 February, 23:57:44

On Wed, Feb 26, 2025 at 3:37 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Wed, 2025-02-26 at 15:15 -0500, Robert Haas wrote:
> > I strongly agree. I think shipping this feature in any form that uses
> > in-place updates is a bad idea.
>
> Removed already in commit f3dae2ae58.

Cool.

> The reason they were added was mostly for consistency with ANALYZE, and
> (at least for me) secondarily about churn on pg_class. The bloat was
> never terrible.
>
> With that in mind, should we remove the in-place updates from ANALYZE
> as well?

While I generally think fewer in-place updates are better than more,
I'm not sure what code we're talking about here and I definitely
haven't studied it, so I don't want to make excessively strong
statements. If you feel it can be done without breaking anything else,
or you have a way to repair the breakage, I'd definitely be interested
in hearing more about that.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Statistics Import and Export

From

Jeff Davis

Date:

27 February, 00:32:26

On Wed, 2025-02-26 at 15:57 -0500, Robert Haas wrote:
> If you feel it can be done without breaking anything else,
> or you have a way to repair the breakage, I'd definitely be
> interested
> in hearing more about that.

That would be a separate thread, but it's good to know that there is a
general consensus that we don't want to use in-place updates for non-
critical things like stats (and perhaps eliminate them entirely). In
other words, the inconcistency likely won't last forever.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

27 February, 00:44:20

Jeff Davis <pgsql@j-davis.com> writes:
> That would be a separate thread, but it's good to know that there is a
> general consensus that we don't want to use in-place updates for non-
> critical things like stats (and perhaps eliminate them entirely). In
> other words, the inconcistency likely won't last forever.

I'm quite sure that the original argument for using in-place updates
for this was not wanting a full-database VACUUM or ANALYZE to update
every tuple in pg_class.  At the time that definitely did lead to
more-or-less 2x bloat.  The new information we have now is that that's
no longer the case, and thus the decision can and should be revisited.

            regards, tom lane

Re: Statistics Import and Export

From

Tom Lane

Date:

27 February, 00:46:11

Melanie Plageman <melanieplageman@gmail.com> writes:
> I have a patch that is getting thwacked around by the churn in
> stats_import.sql, and it occurred to me that I don't see why all the
> negative tests  for pg_restore_relation_stats() need to have all the
> parameters provided. For example, in both of these tests, you are
> testing the relation parameter but including all these other fields.
> It's fine if there is a reason to do that, but otherwise, it makes the
> test file longer and makes the test case less clear IMO.

+1, let's shorten those queries.  The coast is probably pretty
clear now if you want to go do that.

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

27 February, 00:57:34

I have a patch that is getting thwacked around by the churn in
stats_import.sql, and it occurred to me that I don't see why all the
negative tests for pg_restore_relation_stats() need to have all the
parameters provided. For example, in both of these tests, you are
testing the relation parameter but including all these other fields.
It's fine if there is a reason to do that, but otherwise, it makes the
test file longer and makes the test case less clear IMO.

It's a known issue, and I intend to do a culling. Things have been changing a lot with the pg_set* functions going away, and some of the tests were covered by set* functions but not restore* functions. I'll give it a pass once the buildfarm goes green again, and then I'm immediately shifting gears to your patchset so that the additional tests you'll require are smooth and minimal.

Re: Statistics Import and Export

From

Corey Huinker

Date:

27 February, 00:58:04

On Wed, Feb 26, 2025 at 4:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Melanie Plageman <melanieplageman@gmail.com> writes:
> I have a patch that is getting thwacked around by the churn in
> stats_import.sql, and it occurred to me that I don't see why all the
> negative tests for pg_restore_relation_stats() need to have all the
> parameters provided. For example, in both of these tests, you are
> testing the relation parameter but including all these other fields.
> It's fine if there is a reason to do that, but otherwise, it makes the
> test file longer and makes the test case less clear IMO.

+1, let's shorten those queries. The coast is probably pretty
clear now if you want to go do that.

On it.

Re: Statistics Import and Export

From

Corey Huinker

Date:

27 February, 05:19:12

+1, let's shorten those queries. The coast is probably pretty
clear now if you want to go do that.

On it.

The earlier conversion of pg_set_attribute_stats (which once had many not-null params) to pg_restore_* tests (where only the columns that identify the stat row are actually required) meant that a lot of parameters which were previously required were now inconsequential to the test. However, it is important to demonstrate cases where the rest of the restore operation completed after given bad statistic was encountered, but that can be adequately done by one "bystander" parameter rather than the whole fleet.

Other notes:
* organized the tests into roughly three groups: relation tests, attribute tests, and set-difference tests.
* tests that raise an error bubble up to the top of their respective groups
* tests that would have multiple warnings are reduced to having just one wherever possible
* each test gets a comment about what is to be demonstrated
* attention paid to parameter values to avoid coincidental values that could mislead someone into thinking the value was written somewhere when that just happened to be what was already there, etc.
* the set difference tests remain, as they proved extremely useful in detecting undesirable side-effects during development

Attachment

vTestReorg-0001-Organize-and-deduplicate-statistics-impor.patch

Re: Statistics Import and Export

From

Melanie Plageman

Date:

28 February, 02:27:55

On Wed, Feb 26, 2025 at 9:19 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>>>
>>> +1, let's shorten those queries.  The coast is probably pretty
>>> clear now if you want to go do that.
>>
>>
>> On it.

So, I started reviewing this and my original thought about shortening
the queries testing pg_restore_relation_stats() wasn't included in
your patch.

For example:

+--- error: relation is wrong type
+SELECT pg_catalog.pg_restore_relation_stats(
+        'relation', 0::oid,
+        'relpages', 17::integer,
+        'reltuples', 400.0::real,
+        'relallvisible', 4::integer);

Why do you need to specify all the stats (relpages, reltuples, etc)?
To exercise this you could just do:
select pg_catalog.pg_restore_relation_stats('relation', 0::oid);

Since I haven't been following along with this feature development, I
don't think I can get comfortable enough with all of the changes in
this test diff to commit them. I can't really say if this is the set
of tests that is representative and sufficient for this feature.

If you agree with me that the failure tests could be shorter, I'm
happy to commit that, but I don't really feel comfortable assessing
what the right set of full tests is.

- Melanie

Re: Statistics Import and Export

From

Jeff Davis

Date:

28 February, 05:32:20

On Tue, 2025-02-25 at 11:11 +0530, Ashutosh Bapat wrote:
> So the dumped statistics are not restored exactly. The reason for
> this
> is the table statistics is dumped before dumping ALTER TABLE ... ADD
> CONSTRAINT command which changes the statistics. I think all the
> pg_restore_relation_stats() calls should be dumped after all the
> schema and data modifications have been done. OR what's the point in
> dumping statistics only to get rewritten even before restore
> finishes.

In your example, it's not so bad because the stats are actually better:
the index is built after the data is present, and therefore relpages
and reltuples are correct.

The problem is more clear if you use --no-data. If you load data,
ANALYZE, pg_dump --no-data, then reload the sql file, then the stats
are lost.

That workflow is very close to what pg_upgrade does. We solved the
problem for pg_upgrade in commit 71b66171d0 by simply not updating the
statistics when building an index and IsBinaryUpgrade.

To solve the issue with dump --no-data, I propose that we change the
test in 71b66171d0 to only update the stats if the physical relpages is
non-zero.

Patch attached:

 * If the dump is --no-data, or during pg_upgrade, the table will be
empty, so the physical relpages will be zero and the restored stats
won't be overwritten.

 * If (like in your example) the dump includes data, the new stats are
based on real data, so they are better anyway. This is sort of like the
case where autoanalyze kicks in.

 * If the dump is --statistics-only, then there won't be any indexes
created in the SQL file, so when you restore the stats, they will
remain until you do something else to change them.

 * If your example really is a problem, you'd need to dump first with -
-no-statistics, and then with --statistics-only, and restore the two
SQL files in order.

Alternatively, we could put stats into SECTION_POST_DATA, which was
already discussed[*], and we decided against it (though there was not a
clear consensus).

Regards,
    Jeff Davis

*:
https://www.postgresql.org/message-id/1798867.1712376328%40sss.pgh.pa.us

Attachment

v1-0001-Do-not-update-stats-on-empty-table-when-building-.patch

Re: Statistics Import and Export

From

Corey Huinker

Date:

28 February, 06:01:00

+--- error: relation is wrong type
+SELECT pg_catalog.pg_restore_relation_stats(
+ 'relation', 0::oid,
+ 'relpages', 17::integer,
+ 'reltuples', 400.0::real,
+ 'relallvisible', 4::integer);

Why do you need to specify all the stats (relpages, reltuples, etc)?
To exercise this you could just do:
select pg_catalog.pg_restore_relation_stats('relation', 0::oid);

In the above case, it's historical inertia in that the pg_set_* call required all those parameters, as well as a fear that the code - now or in the future - might evaluate "can anything actually change from this call" and short circuit out before actually trying to make sense of the reg_class oid. But we can assuage that fear with just one of the three stat parameters, and I'll adjust accordingly.

Since I haven't been following along with this feature development, I
don't think I can get comfortable enough with all of the changes in
this test diff to commit them. I can't really say if this is the set
of tests that is representative and sufficient for this feature.

That's fine, I hadn't anticipated that you'd review this patch, let alone commit it.

If you agree with me that the failure tests could be shorter, I'm
happy to commit that, but I don't really feel comfortable assessing
what the right set of full tests is.

The set of tests is as short as I feel comfortable with. I'll give the parameter lists one more pass and repost.

Re: Statistics Import and Export

From

Greg Sabino Mullane

Date:

28 February, 06:42:17

I know I'm coming late to this, but I would like us to rethink having statistics dumped by default. I was caught by this today, as I was doing two dumps in a row, but the output changed between runs solely because the stats got updated. It got me thinking about all the use cases of pg_dump I've seen over the years. I think this has the potential to cause a lot of problems for things like automated scripts. It certainly violates the principle of least astonishment to have dumps change when no user interaction has happened.

Alternatively, we could put stats into SECTION_POST_DATA,

No, that would make the above-mentioned problem much worse.

Cheers,

Greg

--

Crunchy Data - https://www.crunchydata.com

Enterprise Postgres Software Products & Tech Support

Re: Statistics Import and Export

From

Corey Huinker

Date:

28 February, 07:34:40

On Thu, Feb 27, 2025 at 10:01 PM Corey Huinker <corey.huinker@gmail.com> wrote:

+--- error: relation is wrong type
+SELECT pg_catalog.pg_restore_relation_stats(
+ 'relation', 0::oid,
+ 'relpages', 17::integer,
+ 'reltuples', 400.0::real,
+ 'relallvisible', 4::integer);

Why do you need to specify all the stats (relpages, reltuples, etc)?
To exercise this you could just do:
select pg_catalog.pg_restore_relation_stats('relation', 0::oid);

In the above case, it's historical inertia in that the pg_set_* call required all those parameters, as well as a fear that the code - now or in the future - might evaluate "can anything actually change from this call" and short circuit out before actually trying to make sense of the reg_class oid. But we can assuage that fear with just one of the three stat parameters, and I'll adjust accordingly.

* reduced relstats parameters specified to the minimum needed to verify the error and avoid a theoretical future logic short-circuit described above.

* version parameter usage reduced to absolute minimum - verifying that it is accepted and ignored, though Melanie's patch may introduce a need to bring it back in a place or two.

84 lines deleted. Not great, not terrible.

I suppose if we really trusted the TAP test databases to have "one of everything" in terms of tables with all the datatypes, and sufficient rows to generate interesting stats, plus some indexes of each, then we could get rid of those two, but I feel very strongly that it would be a minor savings at a major cost to clarity.

Attachment

v2-0001-Organize-and-deduplicate-statistics-import-tests.patch

Re: Statistics Import and Export: difference in statistics dumped

From

Ashutosh Bapat

Date:

28 February, 12:21:29

Hi Jeff,
I am changing the subject on this email and thus creating a new thread
to discuss this issue.

On Fri, Feb 28, 2025 at 8:02 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Tue, 2025-02-25 at 11:11 +0530, Ashutosh Bapat wrote:
> > So the dumped statistics are not restored exactly. The reason for
> > this
> > is the table statistics is dumped before dumping ALTER TABLE ... ADD
> > CONSTRAINT command which changes the statistics. I think all the
> > pg_restore_relation_stats() calls should be dumped after all the
> > schema and data modifications have been done. OR what's the point in
> > dumping statistics only to get rewritten even before restore
> > finishes.
>
> In your example, it's not so bad because the stats are actually better:
> the index is built after the data is present, and therefore relpages
> and reltuples are correct.
>
> The problem is more clear if you use --no-data. If you load data,
> ANALYZE, pg_dump --no-data, then reload the sql file, then the stats
> are lost.
>
> That workflow is very close to what pg_upgrade does. We solved the
> problem for pg_upgrade in commit 71b66171d0 by simply not updating the
> statistics when building an index and IsBinaryUpgrade.
>
> To solve the issue with dump --no-data, I propose that we change the
> test in 71b66171d0 to only update the stats if the physical relpages is
> non-zero.

I don't think I understand the patch well, but here's one question: If
a table is truncated and index is rebuilt would the code in patch stop
it from updating the stats? If yes, that looks problematic.

>
> Patch attached:
>
>  * If the dump is --no-data, or during pg_upgrade, the table will be
> empty, so the physical relpages will be zero and the restored stats
> won't be overwritten.
>
>  * If (like in your example) the dump includes data, the new stats are
> based on real data, so they are better anyway. This is sort of like the
> case where autoanalyze kicks in.
>
>  * If the dump is --statistics-only, then there won't be any indexes
> created in the SQL file, so when you restore the stats, they will
> remain until you do something else to change them.
>
>  * If your example really is a problem, you'd need to dump first with -
> -no-statistics, and then with --statistics-only, and restore the two
> SQL files in order.

There are few problems

1. If there are thousands of tables with primary key constraints, we
have twice the number of calls to pg_restore_relation_stats() of which
only half will be useful. The stats written by the first set of calls
will be overwritten by the second set of calls. The time spent in
executing the first set of calls can be saved completely and to some
extent time dumping the calls as well. It will be some measurable
improvement I think.

2. We aren't restoring the statistics faithfully - as mentioned in
Greg's reply. If users dump and restore with autovacuum turned off,
they will be surprised to see the statistics to be different on the
original and restored database - which may have other effects like
change in plans.

3. The test I am building over at [1] is aimed at testing whether the
objects dumped get restored faithfully by comparing dumps from the
original and restored database. That's a bit crude method but is being
used by some of our tests. I think it will be good to test statistics
as well in that test. But if it's not going to be same on the original
and the restored database we can not test it. For now, I have used
--no-statistics.

>
>
> Alternatively, we could put stats into SECTION_POST_DATA, which was
> already discussed[*], and we decided against it (though there was not a
> clear consensus).

I haven't looked at the code which dumps the statistics, but it does
seem simple dump the statistics after the constraint creation command
for the tables with primary key constraint. That will dump
not-up-to-date statistics and might overwrite the statistics

[1] https://www.postgresql.org/message-id/CAExHW5sBbMki6Xs4XxFQQF3C4Wx3wxkLAcySrtuW3vrnOxXDNQ%40mail.gmail.com

--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export: difference in statistics dumped

From

Jeff Davis

Date:

28 February, 23:10:37

On Fri, 2025-02-28 at 14:51 +0530, Ashutosh Bapat wrote:
> 2. We aren't restoring the statistics faithfully - as mentioned in
> Greg's reply. If users dump and restore with autovacuum turned off,
> they will be surprised to see the statistics to be different on the
> original and restored database - which may have other effects like
> change in plans.

Then let's just address that concern directly: disable updating stats
implicitly if autovacuum is off. If autovacuum is on, the user
shouldn't have an expectation of stable stats anyway. Patch attached.

Regards,
    Jeff Davis

Attachment

v2-0001-During-CREATE-INDEX-don-t-update-stats-if-autovac.patch

Re: Statistics Import and Export

From

Jeff Davis

Date:

28 February, 23:54:03

On Thu, 2025-02-27 at 22:42 -0500, Greg Sabino Mullane wrote:
> I know I'm coming late to this, but I would like us to rethink having
> statistics dumped by default. I was caught by this today, as I was
> doing two dumps in a row, but the output changed between runs solely
> because the stats got updated. It got me thinking about all the use
> cases of pg_dump I've seen over the years. I think this has the
> potential to cause a lot of problems for things like automated
> scripts.

Can you expand on some of those cases?

There are some good reasons to make dumping stats the default:

 * The argument here[1] seemed compelling: pg_dump has always dumped
everything by default, so not doing so for stats could be surprising.

 * When dumping into the custom format, we'd almost certainly want to
include the stats so you can decide later whether to restore them or
not.

 * For most of the cases I'm aware of, if you encounter a diff related
to stats, it would be obvious what the problem is and the fix would be
easy. I can imagine cases where it might not be easy, but I can't
recall any, so if you can then it would be helpful to list them.

so we will need to weigh the costs and benefits.

Unless there's a consensus to change it, I'm inclined to keep it the
default at least into beta, so that we can get feedback from users and
make a more informed decision.

(Aside: I assume everyone here agrees that pg_upgrade should transfer
the stats by default.)

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/3228677.1713844341%40sss.pgh.pa.us

Re: Statistics Import and Export

From

Nathan Bossart

Date:

28 February, 23:56:41

On Fri, Feb 28, 2025 at 12:54:03PM -0800, Jeff Davis wrote:
> (Aside: I assume everyone here agrees that pg_upgrade should transfer
> the stats by default.)

That feels like a safe assumption to me...  I'm curious what the use-case
is for pg_upgrade's --no-statistics option.

-- 
nathan

Re: Statistics Import and Export

From

Jeff Davis

Date:

01 March, 00:24:57

On Fri, 2025-02-28 at 14:56 -0600, Nathan Bossart wrote:
> On Fri, Feb 28, 2025 at 12:54:03PM -0800, Jeff Davis wrote:
> > (Aside: I assume everyone here agrees that pg_upgrade should
> > transfer
> > the stats by default.)
>
> That feels like a safe assumption to me...  I'm curious what the use-
> case
> is for pg_upgrade's --no-statistics option.

Mostly completeness and paranoia. I don't see a real use case. If we
decide we don't need it, that's fine with me.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

01 March, 00:52:02

On Fri, Feb 28, 2025 at 4:25 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2025-02-28 at 14:56 -0600, Nathan Bossart wrote:
> On Fri, Feb 28, 2025 at 12:54:03PM -0800, Jeff Davis wrote:
> > (Aside: I assume everyone here agrees that pg_upgrade should
> > transfer
> > the stats by default.)
>
> That feels like a safe assumption to me... I'm curious what the use-
> case
> is for pg_upgrade's --no-statistics option.

Mostly completeness and paranoia. I don't see a real use case. If we
decide we don't need it, that's fine with me.

Completeness/symmetry and paranoia was how I viewed it. I suppose it might be useful in a failsafe if a pg_upgrade failed and you needed a way to retry.

Re: Statistics Import and Export

From

Nathan Bossart

Date:

01 March, 01:11:46

On Fri, Feb 28, 2025 at 04:52:02PM -0500, Corey Huinker wrote:
> On Fri, Feb 28, 2025 at 4:25 PM Jeff Davis <pgsql@j-davis.com> wrote:
>> On Fri, 2025-02-28 at 14:56 -0600, Nathan Bossart wrote:
>>> I'm curious what the use-case is for pg_upgrade's --no-statistics
>>> option.
>>
>> Mostly completeness and paranoia. I don't see a real use case. If we
>> decide we don't need it, that's fine with me.
> 
> Completeness/symmetry and paranoia was how I viewed it. I suppose it might
> be useful in a failsafe if a pg_upgrade failed and you needed a way to
> retry.

Got it.  I have no strong opinion on the matter.

-- 
nathan

Re: Statistics Import and Export

From

Alexander Lakhin

Date:

01 March, 20:00:00

Hello Jeff,

26.02.2025 04:00, Jeff Davis wrote:

I plan to commit the patches soon.

It looks like 8f427187d broke pg_dump on Cygwin:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2025-02-26%2010%3A03%3A07

As far as I can see, it exits prematurely here:
               float           reltuples = strtof(PQgetvalue(res, i, i_reltuples), NULL);

because of:
/*
* Cygwin has a strtof() which is literally just (float)strtod(), which means
* we get misrounding _and_ silent over/underflow. Using our wrapper doesn't
* fix the misrounding but does fix the error checks, which cuts down on the
* number of test variant files needed.
*/
#define HAVE_BUGGY_STRTOF 1
...
#ifdef HAVE_BUGGY_STRTOF
extern float pg_strtof(const char *nptr, char **endptr);
#define strtof(a,b) (pg_strtof((a),(b)))
#endif

and:
float
pg_strtof(const char *nptr, char **endptr)
{
...
    if (errno)
    {
        /* On error, just return the error to the caller. */
        return fresult;
    }
    else if ((*endptr == nptr) || isnan(fresult) ||
...

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: Statistics Import and Export

From

Tom Lane

Date:

01 March, 21:04:21

Alexander Lakhin <exclusion@gmail.com> writes:
> It looks like 8f427187d broke pg_dump on Cygwin:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2025-02-26%2010%3A03%3A07

Yeah, Andrew and I have been puzzling over that off-list.  pg_dump
is definitely exiting unceremoniously.

> As far as I can see, it exits prematurely here:
>                 float           reltuples = strtof(PQgetvalue(res, i, i_reltuples), NULL);

I was suspecting those float conversions as a likely cause, but
what do you think is wrong exactly?  I see nothing obviously
buggy in pg_strtof().

But I'm not sure it's worth running to ground.  I don't love any of
the portability-related hacks that 8f427187d made: the setlocale()
call looks like something with an undesirably large blast radius,
and pg_dump has never made use of strtof or f2s.c before.  Sure,
those *ought* to work, but they evidently don't work everywhere,
and I don't especially want to expend more brain cells figuring out
what's wrong here.  I think we ought to cut our losses and store
reltuples in string form, as Corey wanted to do originally.

            regards, tom lane

Re: Statistics Import and Export

From

Alexander Lakhin

Date:

01 March, 21:20:00

Hello Tom,

01.03.2025 20:04, Tom Lane wrote:
> Alexander Lakhin <exclusion@gmail.com> writes:
>> It looks like 8f427187d broke pg_dump on Cygwin:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2025-02-26%2010%3A03%3A07
> Yeah, Andrew and I have been puzzling over that off-list.  pg_dump
> is definitely exiting unceremoniously.
>
>> As far as I can see, it exits prematurely here:
>>                  float           reltuples = strtof(PQgetvalue(res, i, i_reltuples), NULL);
> I was suspecting those float conversions as a likely cause, but
> what do you think is wrong exactly?  I see nothing obviously
> buggy in pg_strtof().

 From my understanding, pg_strtof () can't stand against endptr == NULL.
I have changed that line to:
         char *tptr;
         float        reltuples = strtof(PQgetvalue(res, i, i_reltuples), &tptr);

and 002_compare_backups passed for me.

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: Statistics Import and Export

From

Greg Sabino Mullane

Date:

01 March, 21:52:07

> Can you expand on some of those cases?

Certainly. I think one of the problems is that because this patch is solving a pg_upgrade issue, the focus is on the "dump and restore" scenarios. But pg_dump is used for much more than that, especially "dump and examine".

Although pg_dump is meant to be a canonical, logical representation of your schema and data, the stats add a non-determinant element to that. Statistical sampling is random, so pg_dump output changes with each run. (yes, COPY can also change, but much less so, as I argue later).

One use case is a program that is simply using pg_dump to verify that nothing has modified your table data (I'll use a single table for these examples, but obviously this applies to a whole database as well). So let's say we create a table and populate it at time X, then check back at a later time to verify things are still exactly as we left them.

dropdb gregtest
createdb gregtest
pgbench gregtest -i 2> /dev/null
pg_dump gregtest -t pgbench_accounts > a1
sleep 10
pg_dump gregtest -t pgbench_accounts > a2
diff a1 a2 | cut -c1-50

100078c100078
< 'histogram_bounds', '{2,964,1921,2917,3892,4935
---
> 'histogram_bounds', '{7,989,1990,2969,3973,4977

While COPY is not going to promise a particular output order, the order should not change except for manual things: insert, update, delete, truncate, vacuum full, cluster (off the top of my head). What should not change the output is a background process gathering some metadata. Or someone running a database-wide ANALYZE.

Another use case is someone rolling out their schema to a QA box. All the table definitions and data are checked into a git repository, with a checksum. They want to roll it out, and then verify that everything is exactly as they expect it to be. Or the program is part of a test suite that does a sanity check that the database is in an exact known state before starting.

(Our system catalogs are very difficult when reverse engineering objects. Thus, many programs rely on pg_dump to do the heavy lifting for them. Parsing the text file generated by pg_dump is much easier than trying to manipulate the system catalogs.)

So let's say the process is to create a new database, load things into it, and then checksum the result. We can simulate that with pg_bench:

dropdb qa1; dropdb qa2
createdb qa1; createdb qa2
pgbench qa1 -i 2>/dev/null

pgbench qa2 -i 2>/dev/null

pg_dump qa1 > dump1; pg_dump qa2 > dump2

$ md5sum dump1
39a2da5e51e8541e9a2c025c918bf463 dump1

This md5sum does not match our repo! It doesn't even match the other one:

$ md5sum dump2
4a977657dfdf910cb66c875d29cfebf2 dump2

It's the stats, or course, which has added a dose of randomness that was not there before, and makes our checksums useless:

$ diff dump1 dump2 | cut -c1-50
100172c100172
< 'histogram_bounds', '{1,979,1974,2952,3973,4900
---
> 'histogram_bounds', '{8,1017,2054,3034,4045,513

With --no-statistics, the diff shows no difference, and the md5sum is always the same.

Just to be clear, I love this patch, and I love the fact that one of our major upgrade warts is finally getting fixed. I've tried fixing it myself a few times over the last decade or so, but lacked the skills to do so. :) So I am thrilled to have this finally done. I just don't think it should be enabled by default for everything using pg_dump. For the record, I would not strongly object to having stats on by default for binary dumps, although I would prefer them off.

So why not just expect people to modify their programs to use --no-statistics for cases like this? That's certainly an option, but it's going to break a lot of existing things, and create branching code:

old code:
pg_dump mydb -f pg.dump

new code:
if pg_dump.version >= 18
pg_dump --no-statistics mydb -f pg.dump
else
pg_dump mydb -f pg.dump

Also, anything trained to parse pg_dump output will have to learn about the new SELECT pg_restore_ calls with their multi-line formats (not 100% sure we don't have that anywhere, as things like "SELECT setval" and "SELECT set_config" are single line, but there may be existing things)

Cheers,

Greg

--

Crunchy Data - https://www.crunchydata.com

Enterprise Postgres Software Products & Tech Support

Re: Statistics Import and Export

From

Tom Lane

Date:

01 March, 21:56:51

Alexander Lakhin <exclusion@gmail.com> writes:
> 01.03.2025 20:04, Tom Lane wrote:
>> I was suspecting those float conversions as a likely cause, but
>> what do you think is wrong exactly?  I see nothing obviously
>> buggy in pg_strtof().

>  From my understanding, pg_strtof () can't stand against endptr == NULL.

D'oh!  I'm blind as a bat today.

> I have changed that line to:
>          char *tptr;
>          float        reltuples = strtof(PQgetvalue(res, i, i_reltuples), &tptr);
> and 002_compare_backups passed for me.

Cool, but surely the right fix is to make pg_strtof() adhere to
the POSIX specification, so we don't have to learn this lesson
again elsewhere.  I'll go make it so.

Independently of that, do we want to switch over to storing
reltuples as a string instead of converting it?  I still feel
uncomfortable about the amount of baggage we added to pg_dump
to avoid that.

            regards, tom lane

Re: Statistics Import and Export

From

Jeff Davis

Date:

01 March, 23:48:18

On Sat, 2025-03-01 at 13:52 -0500, Greg Sabino Mullane wrote:
> > Can you expand on some of those cases?
>
> Certainly. I think one of the problems is that because this patch is
> solving a pg_upgrade issue, the focus is on the "dump and restore"
> scenarios. But pg_dump is used for much more than that, especially
> "dump and examine".

Thank you for going through these examples.

> I just don't think it should be enabled by default for everything
> using pg_dump. For the record, I would not strongly object to having
> stats on by default for binary dumps, although I would prefer them
> off.

I am open to that idea, I just want to get it right, because probably
whatever the default is in 18 will stay that way.

Also, we will need to think through the set of pg_dump options again. A
lot of our tools seem to assume that "if it's the default, we don't
need a way to ask for it explicitly", which makes it a lot harder to
ever change the default and keep a coherent set of options.

> So why not just expect people to modify their programs to use --no-
> statistics for cases like this? That's certainly an option, but it's
> going to break a lot of existing things, and create branching code:

I suggest that we wait a bit to see what additional feedback we get
early in beta.

> Also, anything trained to parse pg_dump output will have to learn
> about the new SELECT pg_restore_ calls with their multi-line formats
> (not 100% sure we don't have that anywhere, as things like "SELECT
> setval" and "SELECT set_config" are single line, but there may be
> existing things)

That's an interesting point. What tools are currrently trying to parse
pg_dump output?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Tom Lane

Date:

02 March, 00:23:39

Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2025-03-01 at 13:52 -0500, Greg Sabino Mullane wrote:
>> Also, anything trained to parse pg_dump output will have to learn
>> about the new SELECT pg_restore_ calls with their multi-line formats
>> (not 100% sure we don't have that anywhere, as things like "SELECT
>> setval" and "SELECT set_config" are single line, but there may be
>> existing things)

> That's an interesting point. What tools are currrently trying to parse
> pg_dump output?

That particular argument needs to be rejected vociferously.  Otherwise
we could never make any change at all in what pg_dump emits.  I think
the standard has to be "if you parse pg_dump output, it's on you to
cope with any legal SQL".

I do grasp Greg's larger point that this is a big change in pg_dump's
behavior and will certainly break some expectations.  I kind of lean
to the position that we'll be sad in the long run if we don't change
the default, though.  What other part of pg_dump's output is not
produced by default?

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

02 March, 04:55:51

Independently of that, do we want to switch over to storing
reltuples as a string instead of converting it? I still feel
uncomfortable about the amount of baggage we added to pg_dump
to avoid that.

I'm obviously a 'yes' vote for string, either fixed width buffer or pg_strdup'd, for the reduced complexity. I'm not dismissing concerns about memory usage, and we could free the RelStatsInfo structure after use, but we're already not freeing the parent structures tbinfo or indxinfo, probably because they're needed right up til the end of the program, and there's no subsequent consumer for the memory that we'd be freeing up.

Re: Statistics Import and Export

From

Magnus Hagander

Date:

02 March, 17:38:07

On Sat, Mar 1, 2025 at 9:48 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Sat, 2025-03-01 at 13:52 -0500, Greg Sabino Mullane wrote:
> > Can you expand on some of those cases?
>
> Certainly. I think one of the problems is that because this patch is
> solving a pg_upgrade issue, the focus is on the "dump and restore"
> scenarios. But pg_dump is used for much more than that, especially
> "dump and examine".

Thank you for going through these examples.

> I just don't think it should be enabled by default for everything
> using pg_dump. For the record, I would not strongly object to having
> stats on by default for binary dumps, although I would prefer them
> off.

I am open to that idea, I just want to get it right, because probably
whatever the default is in 18 will stay that way.

Also, we will need to think through the set of pg_dump options again. A
lot of our tools seem to assume that "if it's the default, we don't
need a way to ask for it explicitly", which makes it a lot harder to
ever change the default and keep a coherent set of options.

That's a good point in general, and definitely something we should think through, independently of his patch.

> So why not just expect people to modify their programs to use --no-
> statistics for cases like this? That's certainly an option, but it's
> going to break a lot of existing things, and create branching code:

I suggest that we wait a bit to see what additional feedback we get
early in beta.

I definitely thing it should be on by default.

FWIW, I've seen many cases of people using automated tools to verify the *schema* between two databases. I'd say that's quite common. But they use pg_dump -s, which I believe is not affected by this one.

I don't think I've ever come across an automated tool to verify the contents of an entire database this way. That doesn't mean it's not out there of course, just that it's not so common. The cases I've seen pg_dump used to verify the contents that's always been in combination with a myriad of other switches such as include/exclude of specific tables etc, and adding just one more switch to those seems like a small price to pay for having the default behaviour be a big improvement for the majority of usecases.

> Also, anything trained to parse pg_dump output will have to learn
> about the new SELECT pg_restore_ calls with their multi-line formats
> (not 100% sure we don't have that anywhere, as things like "SELECT
> setval" and "SELECT set_config" are single line, but there may be
> existing things)

That's going to be true every time we add something to pg_dump. And for that matter, anything new to *postgresql*, since surely we'd want pg_dump to dump objects by default. Any tool that parses the pg_dump output directly will always have to carefully analyze each new version. And probably shouldn't be using the plaintext format in the first place - and if using pg_restore it comes out as it's own type of object, making it easy to exclude at that level.

--

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: Statistics Import and Export

From

Corey Huinker

Date:

02 March, 23:29:17

Also, we will need to think through the set of pg_dump options again. A
lot of our tools seem to assume that "if it's the default, we don't
need a way to ask for it explicitly", which makes it a lot harder to
ever change the default and keep a coherent set of options.

That's a good point in general, and definitely something we should think through, independently of his patch.

I agree. There was a --with-statistics option in earlier patchsets, which was effectively a no-op because statistics are the default, and it was removed when its existence was questioned. I mention this only to say that consensus for those options will have to be built.

FWIW, I've seen many cases of people using automated tools to verify the *schema* between two databases. I'd say that's quite common. But they use pg_dump -s, which I believe is not affected by this one.

Correct, -s behaves as before, as does --data-only. Schema, data, and statistics are independent, each has their own -only flag, each each has their own --no- flag.

If you were using --no-schema to mean data-only, or --no-data to mean schema-only, then you'll have to add --no-statistics to that call, but I'd argue that they already had a better option of getting what they wanted.

If you thought you saw major changes in the patchsets around those flags, you weren't imagining it. There was a lot of internal logic that worked on the assumptions like "If schema_only is false then we must want data" but that's no longer strictly true, so we resolved all the user flags to dumpSchema/dumpData/dumpStatistics at the very start, and now the internal logic work is based on those affirmative flags rather than the bankshot absence-of-the-opposite logic that was there before.

Re: Statistics Import and Export: difference in statistics dumped

From

Ashutosh Bapat

Date:

03 March, 19:34:16

On Sat, Mar 1, 2025 at 1:40 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Fri, 2025-02-28 at 14:51 +0530, Ashutosh Bapat wrote:
> > 2. We aren't restoring the statistics faithfully - as mentioned in
> > Greg's reply. If users dump and restore with autovacuum turned off,
> > they will be surprised to see the statistics to be different on the
> > original and restored database - which may have other effects like
> > change in plans.
>
> Then let's just address that concern directly: disable updating stats
> implicitly if autovacuum is off. If autovacuum is on, the user
> shouldn't have an expectation of stable stats anyway. Patch attached.

The fact that statistics gets updated is not documented at least under
CREATE INDEX page. So at least users should not rely on that
behaviour. But while we have hold of reltuples wasting a chance to
update it in pg_class does not look right to me. Changing regular
behaviour for the sake of pg_dump/pg_restore doesn't seem right to me.
I think the solution should be on the pg_dump/restore side and not on
the server side.

--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

From

Greg Sabino Mullane

Date:

03 March, 22:00:35

On Sat, Mar 1, 2025 at 4:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That particular argument needs to be rejected vociferously.

Okay, I will concede that part of my argument. And for the record, I've written pg_dump output parsing programs many times over the years, and seen others in the wild. It's not uncommon as some in this thread think.

What other part of pg_dump's output is not produced by default?

None, but that's kind of the point - this is a very special class of data (so much so, that we've been arguing about where it fits in our usual pre/data/post paradigm). So I don't think it's unreasonable that something this unique (and non-deterministic) gets excluded by default. It's still only a flag away if people require it.

Cheers,

Greg

--

Crunchy Data - https://www.crunchydata.com

Enterprise Postgres Software Products & Tech Support

Re: Statistics Import and Export: difference in statistics dumped

From

Jeff Davis

Date:

04 March, 03:55:46

On Mon, 2025-03-03 at 22:04 +0530, Ashutosh Bapat wrote:
> But while we have hold of reltuples wasting a chance to
> update it in pg_class does not look right to me.

To me, autovacuum=off is a pretty clear signal that the user doesn't
want this kind of side-effect to happen. Am I missing something?

> I think the solution should be on the pg_dump/restore side and not on
> the server side.

What solution are you suggesting? The only one that comes to mind is
moving everything to SECTION_POST_DATA, which is possible, but it seems
like a big design change to satisfy a small detail.

Regards,
    Jeff Davis

Re: Statistics Import and Export: difference in statistics dumped

From

Ashutosh Bapat

Date:

04 March, 07:58:16

On Tue, Mar 4, 2025 at 6:25 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Mon, 2025-03-03 at 22:04 +0530, Ashutosh Bapat wrote:
> > But while we have hold of reltuples wasting a chance to
> > update it in pg_class does not look right to me.
>
> To me, autovacuum=off is a pretty clear signal that the user doesn't
> want this kind of side-effect to happen. Am I missing something?

Documentation of autovacuum says "Controls whether the server should
run the autovacuum launcher daemon." It doesn't talk about updates
happening as a side-effect. With autovacuum there is an extra scan and
resources are consumed but with index creation all that cost is
already paid. I wouldn't compare those two.

The case with IsBinaryUpdate is straight, statistics is not updated
only when run in binary upgrade mode. If we could devise a way to not
update statistics only when the index is created as part of restoring
a dump, that will be easily acceptable. But I don't know

>
> > I think the solution should be on the pg_dump/restore side and not on
> > the server side.
>
> What solution are you suggesting? The only one that comes to mind is
> moving everything to SECTION_POST_DATA, which is possible, but it seems
> like a big design change to satisfy a small detail.

We don't have to do that. We can manage it by making statistics of
index dependent upon the indexes on the table. As far as dump is
concerned, they are dependent since index creation rewrites the
statistics so we would like to add statistics after index creation.
For that we will need to track the statistics dumpable object in the
TableInfo. When adding indexes to TableInfo in getIndexes, we add
dependency between the index and the table statistics. The dependency
based sorting will automatically take care ordering statistics objects
after all the index objects and thus print it after all CREATE INDEX
commands. I have not tried to code this. Do you see any problems with
that?

--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export: difference in statistics dumped

From

Jeff Davis

Date:

04 March, 21:15:30

On Tue, 2025-03-04 at 10:28 +0530, Ashutosh Bapat wrote:

> >
> > What solution are you suggesting? The only one that comes to mind
> > is
> > moving everything to SECTION_POST_DATA, which is possible, but it
> > seems
> > like a big design change to satisfy a small detail.
>
> We don't have to do that. We can manage it by making statistics of
> index dependent upon the indexes on the table.

The index relstats are already dependent on the index definition. If
you have a simple database like:

   CREATE TABLE t(i INT);
   INSERT INTO t SELECT generate_series(1,10);
   CREATE INDEX t_idx ON t (i);
   ANALYZE;

and then you dump it, you get:

   ------- SECTION_PRE_DATA -------

   CREATE TABLE public.t ...

   ------- SECTION_DATA -----------

   COPY public.t (i) FROM stdin;
   ...
   SELECT * FROM pg_catalog.pg_restore_relation_stats(
        'version', '180000'::integer,
        'relation', 'public.t'::regclass,
        'relpages', '1'::integer,
        'reltuples', '10'::real,
        'relallvisible', '0'::integer
   );
   ...

   ------- SECTION_POST_DATA ------

   CREATE INDEX t_idx ON public.t USING btree (i);
   SELECT * FROM pg_catalog.pg_restore_relation_stats(
        'version', '180000'::integer,
        'relation', 'public.t_idx'::regclass,
        'relpages', '2'::integer,
        'reltuples', '10'::real,
        'relallvisible', '0'::integer
   );

(section annotations added for clarity)

There is no problem with the index relstats, because they are already
dependent on the index definition, and will be restored after the
CREATE INDEX.

The issue is when the table's restored relstats are different from what
CREATE INDEX calculates, and then the CREATE INDEX overwrites the
table's just-restored relation stats. The easiest way to see this is
when restoring with --no-data, because CREATE INDEX will see an empty
table and overwrite the table's restored relstats with zeros.

If we view this issue as a dependency problem, then we'd have to make
the *table relstats* depend on the *index definition*. If a table has
any indexes, the relstats would need to go after the last index
definition, effectively moving most relstats to SECTION_POST_DATA. The
table's attribute stats would not be dependent on the index definition
(because CREATE INDEX doesn't touch those), so they could stay in
SECTION_DATA. And if the table doesn't have any indexes, then its
relstats could also stay in SECTION_DATA. But then we have a mess, so
we might as well just put all stats in SECTION_POST_DATA.

But I don't see it as a dependency problem. When I look at the above
SQL, it reads nicely to me and there's no obvious problem with it.

If we want stats to be stable, we need some kind of mode to tell the
server not to apply these kind of helpful optimizations, otherwise the
issue will resurface in some form no matter what we do with pg_dump. We
could invent a new mode, but autovacuum=off seems close enough to me.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

05 March, 11:08:55

Attached are a couple updates that fell by the wayside and I'd like to bring focus back to them, plus one potential change, and a recap of where things stand.

0001 is a patch from Jian He [1] which removes a logic deduplication and I believe should be committed.
0002 is some attempt to cull the regression tests, eliminating extraneous parameters where possible, and reorganizing the tests

0003 is a bit more experimental.

In the interest of reducing potential ERRORs raised by pg_restore_relation_stats and pg_restore_attribute_stats within a restore or upgrade, the possibility that we attempt to restore stats to a relation that does not yet exist means that the 'foo.bar'::regclass call will fail with an ERROR. If, however, we replace the relation regclass parameter with text parameters schemaname and relname, we have the flexibility to catch this particular scenario and turn it into a WARNING instead. Likewise, if we change the attname parameter to text, we avoid those casts as well.

I'm seeking feedback on whether people think this is a positive change.

TODO

1. If the schemaname/relname change is amenable, there are some other conditions where we currently raise errors but could instead issue a WARNING instead and return false. We should settle on our preference soon-ish.

2. Commit 99f8f3fbbc8f74 introduced relallfrozen to pg_class, and pg_dump does not presently dump relallfrozen stats. I can implement this pending the feedback on schemaname/relname vs relation regclass.

3. We still have a gap in functionality in that we do not currently dump and restore extended stats. That patchset was recently updated and is covered in thread [3]. I know time is getting short, but having this at the same time would
reduce the number of customers who need to use the new vacuumdb option under development in [4], and reduce customer confusion concerning whether they are in need of post-upgrade analyzing of anything.

[1] https://www.postgresql.org/message-id/flat/CACJufxFVq%3Dtq9u1zrHWYSbMi1T07gS9Ff0LJScMco4HZmtZ1xw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/CAExHW5sNgxkqkyscm9KRrcwvi%2B_Hg%3DPRe_u%2BxZYJzX%2Bw4XAMjQ%40mail.gmail.com#93c77f59150be1471c38f1a315772215

[3] https://www.postgresql.org/message-id/flat/CADkLM%3DdnFKZMAo7MwqD2X6JjiiLCoVXHmszqtZp8sycYmoCcMQ%40mail.gmail.com#eff41d39f7158086e8398803165c5c2f
[4] https://www.postgresql.org/message-id/flat/CANWCAZZf_jNn2i%2B6-JfQt_j909DBk-U6Dg0M7iArZPLgdXCAmw%40mail.gmail.com#e575b5bede51455f406ac3872d2f82ec

Attachment

Re: Statistics Import and Export: difference in statistics dumped

From

Ashutosh Bapat

Date:

05 March, 12:52:56

On Tue, Mar 4, 2025 at 11:45 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Tue, 2025-03-04 at 10:28 +0530, Ashutosh Bapat wrote:
>
> > >
> > > What solution are you suggesting? The only one that comes to mind
> > > is
> > > moving everything to SECTION_POST_DATA, which is possible, but it
> > > seems
> > > like a big design change to satisfy a small detail.
> >
> > We don't have to do that. We can manage it by making statistics of
> > index dependent upon the indexes on the table.
>
> The index relstats are already dependent on the index definition. If
> you have a simple database like:
>
>    CREATE TABLE t(i INT);
>    INSERT INTO t SELECT generate_series(1,10);
>    CREATE INDEX t_idx ON t (i);
>    ANALYZE;
>
> and then you dump it, you get:
>
>
>    ------- SECTION_PRE_DATA -------
>
>    CREATE TABLE public.t ...
>
>    ------- SECTION_DATA -----------
>
>    COPY public.t (i) FROM stdin;
>    ...
>    SELECT * FROM pg_catalog.pg_restore_relation_stats(
>         'version', '180000'::integer,
>         'relation', 'public.t'::regclass,
>         'relpages', '1'::integer,
>         'reltuples', '10'::real,
>         'relallvisible', '0'::integer
>    );
>    ...
>
>    ------- SECTION_POST_DATA ------
>
>    CREATE INDEX t_idx ON public.t USING btree (i);
>    SELECT * FROM pg_catalog.pg_restore_relation_stats(
>         'version', '180000'::integer,
>         'relation', 'public.t_idx'::regclass,
>         'relpages', '2'::integer,
>         'reltuples', '10'::real,
>         'relallvisible', '0'::integer
>    );
>
> (section annotations added for clarity)
>
> There is no problem with the index relstats, because they are already
> dependent on the index definition, and will be restored after the
> CREATE INDEX.
>
> The issue is when the table's restored relstats are different from what
> CREATE INDEX calculates, and then the CREATE INDEX overwrites the
> table's just-restored relation stats. The easiest way to see this is
> when restoring with --no-data, because CREATE INDEX will see an empty
> table and overwrite the table's restored relstats with zeros.
>
> If we view this issue as a dependency problem, then we'd have to make
> the *table relstats* depend on the *index definition*. If a table has
> any indexes, the relstats would need to go after the last index
> definition, effectively moving most relstats to SECTION_POST_DATA. The
> table's attribute stats would not be dependent on the index definition
> (because CREATE INDEX doesn't touch those), so they could stay in
> SECTION_DATA. And if the table doesn't have any indexes, then its
> relstats could also stay in SECTION_DATA. But then we have a mess, so
> we might as well just put all stats in SECTION_POST_DATA.
>
> But I don't see it as a dependency problem. When I look at the above
> SQL, it reads nicely to me and there's no obvious problem with it.

Thanks for explaining it. I

>
> If we want stats to be stable, we need some kind of mode to tell the
> server not to apply these kind of helpful optimizations, otherwise the
> issue will resurface in some form no matter what we do with pg_dump. We
> could invent a new mode, but autovacuum=off seems close enough to me.

Hmm. Updating the statistics without consuming more CPU is more
valuable when autovacuum is off it improves query plans with no extra
efforts. But if adding a new mode is some significant work, riding it
on top of autovacuum=off might ok. It's not documented either way, so
we could change that behaviour later if we find it troublesome.

--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

From

Andres Freund

Date:

06 March, 04:17:53

Hi,

On 2025-02-25 21:29:56 -0500, Corey Huinker wrote:
> On Tue, Feb 25, 2025 at 9:00 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
> > On Mon, 2025-02-24 at 09:54 -0500, Andres Freund wrote:
> > > Have you compared performance of with/without stats after these
> > > optimizations?
> >
> > On unoptimized build with asserts enabled, dumping the regression
> > database:
> >
> >   --no-statistics: 1.0s
> >   master:          3.6s
> >   v3j-0001:        3.0s
> >   v3j-0002:        1.7s
> >
> > I plan to commit the patches soon.

I think these have all been committed, but I still see a larger performance
difference than what you observed. I just checked because I was noticing that
the tests are still considerably slower than they used to be.


An optimized pg_dump against an unoptimized assert-enabled server:

time ./src/bin/pg_dump/pg_dump --no-data --quote-all-identifiers --binary-upgrade --format=custom --no-sync regression
>/dev/null
 
real    0m2.778s
user    0m0.167s
sys    0m0.057s

$ time ./src/bin/pg_dump/pg_dump --no-data --quote-all-identifiers --binary-upgrade --format=custom --no-sync
--no-statisticsregression > /dev/null
 

real    0m1.290s
user    0m0.097s
sys    0m0.026s


I thought it might be interesting to look at the set of queries arriving on
the server side, so I enabled pg-stat_statements and ran a dump:

regression[4041753][1]=# SELECT total_exec_time, total_plan_time, calls, plans, substring(query, 1, 30) FROM
pg_stat_statementsORDER BY calls DESC LIMIT 15;
 
┌─────────────────────┬─────────────────────┬───────┬───────┬────────────────────────────────┐
│   total_exec_time   │   total_plan_time   │ calls │ plans │           substring            │
├─────────────────────┼─────────────────────┼───────┼───────┼────────────────────────────────┤
│   239.9672189999998 │             12.5725 │   981 │     6 │ PREPARE getAttributeStats(pg_c │
│  15.330405000000004 │            1.836712 │   282 │     6 │ PREPARE dumpFunc(pg_catalog.oi │
│  10.129114000000003 │ 0.39834800000000004 │   199 │     6 │ PREPARE dumpTableAttach(pg_cat │
│   9.887489000000002 │  0.9332620000000001 │    84 │    84 │ SELECT pg_get_partkeydef($1)   │
│  14.350725000000006 │            0.691071 │    60 │    60 │ SELECT pg_catalog.pg_get_viewd │
│  5.1174219999999995 │  1.4604219999999999 │    47 │     6 │ PREPARE dumpAgg(pg_catalog.oid │
│ 0.24036199999999996 │            0.545125 │    41 │    41 │ SELECT pg_catalog.format_type( │
│   7.099635000000002 │ 0.47031800000000007 │    39 │    39 │ SELECT pg_catalog.pg_get_ruled │
│            0.672752 │  1.9036320000000002 │    21 │     6 │ PREPARE dumpDomain(pg_catalog. │
│  1.6519299999999997 │  3.1480380000000006 │    21 │    22 │ PREPARE getDomainConstraints(p │
│            1.085548 │  3.9647630000000005 │    16 │     6 │ PREPARE dumpCompositeType(pg_c │
│            0.196259 │            0.602291 │    11 │     6 │ PREPARE dumpOpr(pg_catalog.oid │
│            0.265461 │            4.428914 │    10 │    10 │ SELECT amprocnum, amproc::pg_c │
│ 0.39591399999999993 │            9.345471 │    10 │    10 │ SELECT amopstrategy, amopopr:: │
│ 0.35752100000000003 │            2.128437 │     9 │     9 │ SELECT nspname, tmplname FROM  │
└─────────────────────┴─────────────────────┴───────┴───────┴────────────────────────────────┘


It looks a lot less bad with an optimized build:
regression[4042057][1]=# SELECT total_exec_time, total_plan_time, calls, plans, substring(query, 1, 30) FROM
pg_stat_statementsORDER BY calls DESC LIMIT 15;
 
┌─────────────────────┬─────────────────────┬───────┬───────┬────────────────────────────────┐
│   total_exec_time   │   total_plan_time   │ calls │ plans │           substring            │
├─────────────────────┼─────────────────────┼───────┼───────┼────────────────────────────────┤
│   50.63764299999999 │            2.503585 │   981 │     6 │ PREPARE getAttributeStats(pg_c │
│  3.5241990000000007 │            0.478541 │   282 │     6 │ PREPARE dumpFunc(pg_catalog.oi │
│  2.3170359999999985 │            0.126379 │   199 │     6 │ PREPARE dumpTableAttach(pg_cat │
│            2.291331 │ 0.25360400000000005 │    84 │    84 │ SELECT pg_get_partkeydef($1)   │
│   4.678433000000003 │            0.202578 │    60 │    60 │ SELECT pg_catalog.pg_get_viewd │
│  1.1288440000000004 │ 0.30976200000000004 │    47 │     6 │ PREPARE dumpAgg(pg_catalog.oid │
│             0.06619 │ 0.16813600000000004 │    41 │    41 │ SELECT pg_catalog.format_type( │
│            2.102865 │            0.115169 │    39 │    39 │ SELECT pg_catalog.pg_get_ruled │
│             0.16163 │            0.439991 │    21 │     6 │ PREPARE dumpDomain(pg_catalog. │
│  0.5335120000000001 │            0.727573 │    21 │    22 │ PREPARE getDomainConstraints(p │
│             0.28177 │            0.894156 │    16 │     6 │ PREPARE dumpCompositeType(pg_c │
│            0.038558 │            0.140807 │    11 │     6 │ PREPARE dumpOpr(pg_catalog.oid │
│            0.082078 │  0.9654280000000001 │    10 │    10 │ SELECT amprocnum, amproc::pg_c │
│            0.136964 │  2.1140120000000002 │    10 │    10 │ SELECT amopstrategy, amopopr:: │
│ 0.11634699999999999 │ 0.48550499999999996 │     9 │     9 │ SELECT nspname, tmplname FROM  │
└─────────────────────┴─────────────────────┴───────┴───────┴────────────────────────────────┘
(15 rows)


This isn't even *remotely* an adversarial case, there are lots of workloads
with folks have a handful of indexes on each table and many many tables.


Right now --statistics more than doubles the number of queries that pg_dump
issues. That's oviously noticeable locally, but it's going to be really
noticeable when dumping across the network.


I think we need to do more to lessen the impact. Even leaving regression test
performance aside, the time increase for the default pg_dump invocation will
be painful for folks, particularly due to this being enabled by default.


One fairly easy win would be to stop issuing getAttributeStats() for
non-expression indexes. In most cases that'll already drastically cut down on
the extra queries.

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Corey Huinker

Date:

06 March, 04:30:48

One fairly easy win would be to stop issuing getAttributeStats() for
non-expression indexes. In most cases that'll already drastically cut down on
the extra queries.

That does seem like an easy win, especially since we're already using indexprs for some filters. I am concerned that, down the road, if we ever start storing non-expression stats for, say, partial indexes, we would overlook that a corresponding change needed to happen in pg_dump. If you can think of any safeguards we can create for that, I'm listening.

Re: Statistics Import and Export

From

Nathan Bossart

Date:

06 March, 04:36:29

On Wed, Mar 05, 2025 at 08:17:53PM -0500, Andres Freund wrote:
> Right now --statistics more than doubles the number of queries that pg_dump
> issues. That's oviously noticeable locally, but it's going to be really
> noticeable when dumping across the network.
> 
> I think we need to do more to lessen the impact. Even leaving regression test
> performance aside, the time increase for the default pg_dump invocation will
> be painful for folks, particularly due to this being enabled by default.
> 
> One fairly easy win would be to stop issuing getAttributeStats() for
> non-expression indexes. In most cases that'll already drastically cut down on
> the extra queries.

Apologies if this has already been considered upthread, but would it be
possible to use one query to gather all the required information into a
sorted table?  At a glance, it looks to me like it might be feasible.  I
had a lot of luck with reducing the number per-object queries with that
approach recently (e.g., commit 2329cad).

-- 
nathan

Re: Statistics Import and Export

From

Corey Huinker

Date:

06 March, 04:54:35

Apologies if this has already been considered upthread, but would it be
possible to use one query to gather all the required information into a
sorted table? At a glance, it looks to me like it might be feasible. I
had a lot of luck with reducing the number per-object queries with that
approach recently (e.g., commit 2329cad).

It's been considered and not ruled out, with a "let's see how the simple thing works, first" approach. Considerations are:

* pg_stats is keyed on schemaname + tablename (which can also be indexes) and we need to use that because of the security barrier
* Joining pg_class and pg_namespace to pg_stats was specifically singled out as a thing to remove.
* The stats data is kinda heavy (most common value lists, most common elements lists, esp for high stattargets), which would be a considerable memory impact and some of those stats might not even be needed (example, index stats for a table that is filtered out)

So it's not impossible, but it's trickier than just, say, tables or indexes.

Re: Statistics Import and Export

From

Andres Freund

Date:

06 March, 05:18:17

Hi,

On 2025-03-05 20:54:35 -0500, Corey Huinker wrote:
> It's been considered and not ruled out, with a "let's see how the simple
> thing works, first" approach. Considerations are:
>
> * pg_stats is keyed on schemaname + tablename (which can also be indexes)
> and we need to use that because of the security barrier

I don't think that has to be a big issue, you can just make the the query
query multiple tables at once using an = ANY(ARRAY[]) expression or such.


> * The stats data is kinda heavy (most common value lists, most common
> elements lists, esp for high stattargets), which would be a considerable
> memory impact and some of those stats might not even be needed (example,
> index stats for a table that is filtered out)

Doesn't the code currently have this problem already? Afaict the stats are
currently all stored in memory inside pg_dump.

$ for opt in '' --no-statistics; do echo "using option $opt"; for dbname in pgbench_part_100 pgbench_part_1000
pgbench_part_10000;do echo $dbname; /usr/bin/time -f 'Max RSS kB: %M' ./src/bin/pg_dump/pg_dump --no-data
--quote-all-identifiers--no-sync --no-data $opt $dbname -Fp > /dev/null;done;done
 

using option
pgbench_part_100
Max RSS kB: 12780
pgbench_part_1000
Max RSS kB: 22700
pgbench_part_10000
Max RSS kB: 124224
using option --no-statistics
pgbench_part_100
Max RSS kB: 12648
pgbench_part_1000
Max RSS kB: 19124
pgbench_part_10000
Max RSS kB: 85068


I don't think the query itself would be a problem, a query querying all the
required stats should probably use PQsetSingleRowMode() or
PQsetChunkedRowsMode().

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Nathan Bossart

Date:

06 March, 05:19:36

On Wed, Mar 05, 2025 at 08:54:35PM -0500, Corey Huinker wrote:
> * The stats data is kinda heavy (most common value lists, most common
> elements lists, esp for high stattargets), which would be a considerable
> memory impact and some of those stats might not even be needed (example,
> index stats for a table that is filtered out)

Understood.  Looking closer, I can see why that's a concern in this case.
You'd need 128 bytes just for the schema and table name.

-- 
nathan

Re: Statistics Import and Export

From

Corey Huinker

Date:

06 March, 06:00:42

On Wed, Mar 5, 2025 at 9:18 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-03-05 20:54:35 -0500, Corey Huinker wrote:
> It's been considered and not ruled out, with a "let's see how the simple
> thing works, first" approach. Considerations are:
>
> * pg_stats is keyed on schemaname + tablename (which can also be indexes)
> and we need to use that because of the security barrier

I don't think that has to be a big issue, you can just make the the query
query multiple tables at once using an = ANY(ARRAY[]) expression or such.

I'm uncertain how we'd do that with (schemaname,tablename) pairs. Are you suggesting we back the joins from pg_stats to pg_namespace and pg_class and then filter by oids?

> * The stats data is kinda heavy (most common value lists, most common
> elements lists, esp for high stattargets), which would be a considerable
> memory impact and some of those stats might not even be needed (example,
> index stats for a table that is filtered out)

Doesn't the code currently have this problem already? Afaict the stats are
currently all stored in memory inside pg_dump.

Each call to getAttributeStats() fetches the pg_stats for one and only one relation and then writes the SQL call to fout, then discards the result set once all the attributes of the relation are done.

I don't think the query itself would be a problem, a query querying all the
required stats should probably use PQsetSingleRowMode() or
PQsetChunkedRowsMode().

That makes sense if we get the attribute stats from the result set in the order that we need them, and I don't know how we could possibly do that. We'd still need a table to bsearch() and that would be huge.

Re: Statistics Import and Export

From

Andres Freund

Date:

06 March, 06:41:24

Hi,

On 2025-03-05 22:00:42 -0500, Corey Huinker wrote:
> On Wed, Mar 5, 2025 at 9:18 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2025-03-05 20:54:35 -0500, Corey Huinker wrote:
> > > It's been considered and not ruled out, with a "let's see how the simple
> > > thing works, first" approach. Considerations are:
> > >
> > > * pg_stats is keyed on schemaname + tablename (which can also be indexes)
> > > and we need to use that because of the security barrier
> >
> > I don't think that has to be a big issue, you can just make the the query
> > query multiple tables at once using an = ANY(ARRAY[]) expression or such.
> >
>
> I'm uncertain how we'd do that with (schemaname,tablename) pairs. Are you
> suggesting we back the joins from pg_stats to pg_namespace and pg_class and
> then filter by oids?

I was thinking of one query per schema or something like that. But yea, a
query to pg_namespace and pg_class wouldn't be a problem if we did it far
fewer times than before.   Or you could put the list of catalogs / tables to
be queried into an unnest() with two arrays or such.

Not sure how good the query plan for that would be, but it may be worth
looking at.

> > > * The stats data is kinda heavy (most common value lists, most common
> > > elements lists, esp for high stattargets), which would be a considerable
> > > memory impact and some of those stats might not even be needed (example,
> > > index stats for a table that is filtered out)
> >
> > Doesn't the code currently have this problem already? Afaict the stats are
> > currently all stored in memory inside pg_dump.
> >
>
> Each call to getAttributeStats() fetches the pg_stats for one and only one
> relation and then writes the SQL call to fout, then discards the result set
> once all the attributes of the relation are done.

I don't think that's true. For one my example demonstrated that it increases
the peak memory usage substantially. That'd not be the case if the data was
just written out to stdout or such.

Looking at the code confirms that. The ArchiveEntry() in dumpRelationStats()
is never freed, afaict. And ArchiveEntry() strdups ->createStmt, which
contains the "SELECT pg_restore_attribute_stats(...)".

> I don't think the query itself would be a problem, a query querying all the
> > required stats should probably use PQsetSingleRowMode() or
> > PQsetChunkedRowsMode().
>
>
> That makes sense if we get the attribute stats from the result set in the
> order that we need them, and I don't know how we could possibly do that.
> We'd still need a table to bsearch() and that would be huge.

I'm not following - what would be the problem with a bsearch()? Compared to
the stats data an array to map from oid to an index in an array of stats data
data would be very small.

But with the unnest() idea from above it wouldn't even be needed, you could
use

SELECT ...
FROM unnest(schema_array, table_array) WITH ORDINALITY AS src(schemaname, tablename)
...
ORDER BY ordinality

or something along those lines.

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Corey Huinker

Date:

06 March, 07:04:29

> I'm uncertain how we'd do that with (schemaname,tablename) pairs. Are you
> suggesting we back the joins from pg_stats to pg_namespace and pg_class and
> then filter by oids?

I was thinking of one query per schema or something like that. But yea, a
query to pg_namespace and pg_class wouldn't be a problem if we did it far
fewer times than before. Or you could put the list of catalogs / tables to
be queried into an unnest() with two arrays or such.

Not sure how good the query plan for that would be, but it may be worth
looking at.

Ok, so we're willing to take the pg_class/pg_namespace join hit for one or a handful of queries, good to know.

> Each call to getAttributeStats() fetches the pg_stats for one and only one
> relation and then writes the SQL call to fout, then discards the result set
> once all the attributes of the relation are done.

I don't think that's true. For one my example demonstrated that it increases
the peak memory usage substantially. That'd not be the case if the data was
just written out to stdout or such.

Looking at the code confirms that. The ArchiveEntry() in dumpRelationStats()
is never freed, afaict. And ArchiveEntry() strdups ->createStmt, which
contains the "SELECT pg_restore_attribute_stats(...)".

Pardon my inexperience, but aren't the ArchiveEntry records needed right up until the program's run? If there's value in freeing them, why isn't it being done already? What other thing would consume this freed memory?

> I don't think the query itself would be a problem, a query querying all the
> > required stats should probably use PQsetSingleRowMode() or
> > PQsetChunkedRowsMode().
>
>
> That makes sense if we get the attribute stats from the result set in the
> order that we need them, and I don't know how we could possibly do that.
> We'd still need a table to bsearch() and that would be huge.

I'm not following - what would be the problem with a bsearch()? Compared to
the stats data an array to map from oid to an index in an array of stats data
data would be very small.

If we can do oid bsearch lookups, then we might be in business, but even then we have to maintain a data structure in memory of all pg_stats records relevant to this dump, either in PGresult form, an intermediate data structure like tblinfo/indxinfo, or in the resolved string of pg_restore_attribute_stats() calls for that relation...which then get strdup'd into the ArchiveEntry that we have to maintain anyway.

But with the unnest() idea from above it wouldn't even be needed, you could
use

SELECT ...
FROM unnest(schema_array, table_array) WITH ORDINALITY AS src(schemaname, tablename)
...
ORDER BY ordinality

or something along those lines.

This still seems like there is some ability to generate a batch of these rows and then discard them, and then go to the next logical batch (perhaps by schema, as you suggested earlier), and I don't know that we have that freedom. Perhaps we would have that freedom if stats were the absolute last thing loaded in a dump.

Anyway, here's a rebased set of the existing up-for-consideration patches, plus the optimization of avoiding querying on non-expression indexes.

I should add that this set presently doesn't include a patch that reverts the set locale and strtof() call in favor of storing reltuples as a string. As far as I know that idea is still on the table.

Attachment

Re: Statistics Import and Export: difference in statistics dumped

From

Jeff Davis

Date:

06 March, 09:52:08

On Wed, 2025-03-05 at 15:22 +0530, Ashutosh Bapat wrote:
> Hmm. Updating the statistics without consuming more CPU is more
> valuable when autovacuum is off it improves query plans with no extra
> efforts. But if adding a new mode is some significant work, riding it
> on top of autovacuum=off might ok. It's not documented either way, so
> we could change that behaviour later if we find it troublesome.

Sounds good. I will commit something like the v2 patch then soon, and
if we need a different condition we can change it later.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Jeff Davis

Date:

06 March, 11:07:01

On Wed, 2025-03-05 at 23:04 -0500, Corey Huinker wrote:
>
> Anyway, here's a rebased set of the existing up-for-consideration
> patches, plus the optimization of avoiding querying on non-expression
> indexes.

Patch 0001 contains a bug: it returns REQ_STATS early, before doing any
exclusions.

But I agree the previous code was hard to read in one place, and
redundant in another, so I will commit a fixup.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Jeff Davis

Date:

06 March, 11:48:40

On Wed, 2025-03-05 at 23:04 -0500, Corey Huinker wrote:
>
> Anyway, here's a rebased set of the existing up-for-consideration
> patches, plus the optimization of avoiding querying on non-expression
> indexes.

Comments on 0003:

* All the argument names for pg_restore_attribute_stats match pg_stats,
except relname vs tablename. There doesn't appear to be a great answer
here, because "relname" is the natural name to use for
pg_restore_relation_stats(), so either the two restore functions will
be inconsistent, or the argument name of one of them will be
inconsistent with its respective catalog. I assume that's the
reasoning?

* it decides to only issue a WARNING, rather than an ERROR, if the
table can't be found, which seems fine

* Now that it's doing a namespace lookup, we should also check for the
USAGE privilege on the namespace, right?

Based on the other changes we've made to this feature, I think 0003
makes sense, so I'm inclined to move ahead with it, but I'm open to
opinions.

0004 looks straightforward, though perhaps we should move some of the
code into a static function rather than indenting so many lines.

Did you collect performance results for 0004?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

06 March, 16:49:33

On Thu, Mar 6, 2025 at 3:48 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2025-03-05 at 23:04 -0500, Corey Huinker wrote:
>
> Anyway, here's a rebased set of the existing up-for-consideration
> patches, plus the optimization of avoiding querying on non-expression
> indexes.

Comments on 0003:

* All the argument names for pg_restore_attribute_stats match pg_stats,
except relname vs tablename. There doesn't appear to be a great answer
here, because "relname" is the natural name to use for
pg_restore_relation_stats(), so either the two restore functions will
be inconsistent, or the argument name of one of them will be
inconsistent with its respective catalog. I assume that's the
reasoning?

Correct, either we use 'tablename' to describe indexes as well, or we diverge from the system view's naming.

* Now that it's doing a namespace lookup, we should also check for the
USAGE privilege on the namespace, right?

Unless some check was being done by the 'foo.bar'::regclass cast, I understand why we should add one.

Based on the other changes we've made to this feature, I think 0003
makes sense, so I'm inclined to move ahead with it, but I'm open to
opinions.

If we do, we'll want to change downgrade the following errors to warn+return false:

* stats_check_required_arg()
* stats_lock_check_privileges()
* RecoveryInProgress

* specified both attnum and argnum

* attname/attnum does not exist, or is system column

0004 looks straightforward, though perhaps we should move some of the
code into a static function rather than indenting so many lines.

I agree, but the thread conversation had already shifted to doing just one single call to pg_stats, so this was just a demonstration.

Did you collect performance results for 0004?

No, as I wasn't sure that I could replicate Andres' setup, and the conversation was quickly moving to the aforementioned single-query idea.

Re: Statistics Import and Export

From

Andres Freund

Date:

06 March, 17:29:07

Hi,

On 2025-03-05 23:04:29 -0500, Corey Huinker wrote:
> > > I'm uncertain how we'd do that with (schemaname,tablename) pairs. Are you
> > > suggesting we back the joins from pg_stats to pg_namespace and pg_class
> > and
> > > then filter by oids?
> >
> > I was thinking of one query per schema or something like that. But yea, a
> > query to pg_namespace and pg_class wouldn't be a problem if we did it far
> > fewer times than before.   Or you could put the list of catalogs / tables
> > to
> > be queried into an unnest() with two arrays or such.
> >
> > Not sure how good the query plan for that would be, but it may be worth
> > looking at.
> >
>
> Ok, so we're willing to take the pg_class/pg_namespace join hit for one or
> a handful of queries, good to know.

It's a tradeoff that needs to be evaluated. But I'd be rather surprised if it
weren't faster to run one query with the additional joins than hundreds of
queries without them.

> > > Each call to getAttributeStats() fetches the pg_stats for one and only
> > one
> > > relation and then writes the SQL call to fout, then discards the result
> > set
> > > once all the attributes of the relation are done.
> >
> > I don't think that's true. For one my example demonstrated that it
> > increases
> > the peak memory usage substantially. That'd not be the case if the data was
> > just written out to stdout or such.
> >
> > Looking at the code confirms that. The ArchiveEntry() in
> > dumpRelationStats()
> > is never freed, afaict. And ArchiveEntry() strdups ->createStmt, which
> > contains the "SELECT pg_restore_attribute_stats(...)".
> >
>
> Pardon my inexperience, but aren't the ArchiveEntry records needed right up
> until the program's run?

s/the/the end of the/?

> If there's value in freeing them, why isn't it being done already? What
> other thing would consume this freed memory?

I'm not saying that they can be freed, they can't right now. My point is just
that we *already* keep all the stats in memory, so the fact that fetching all
stats in a single query would also require keeping them in memory is not an
issue.

But TBH, I do wonder how much the current memory usage of the statistics
dump/restore support is going to bite us. In some cases this will dramatically
increase pg_dump/pg_upgrade's memory usage, my tests were with tiny amounts of
data and very simple scalar datatypes and you already could see a substantial
increase.  With something like postgis or even just a lot of jsonb columns
this is going to be way worse.

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Robert Haas

Date:

06 March, 19:15:32

On Thu, Mar 6, 2025 at 9:29 AM Andres Freund <andres@anarazel.de> wrote:
> But TBH, I do wonder how much the current memory usage of the statistics
> dump/restore support is going to bite us. In some cases this will dramatically
> increase pg_dump/pg_upgrade's memory usage, my tests were with tiny amounts of
> data and very simple scalar datatypes and you already could see a substantial
> increase.  With something like postgis or even just a lot of jsonb columns
> this is going to be way worse.

To be honest, I am a bit surprised that we decided to enable this by
default. It's not obvious to me that statistics should be regarded as
part of the database in the same way that table definitions or table
data are. That said, I'm not overwhelmingly opposed to that choice.
However, even if it's the right choice in theory, we should maybe
rethink if it's going to be too slow or use too much memory.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Statistics Import and Export

From

Corey Huinker

Date:

06 March, 20:04:25

> Pardon my inexperience, but aren't the ArchiveEntry records needed right up
> until the program's run?

s/the/the end of the/?

yes

> If there's value in freeing them, why isn't it being done already? What
> other thing would consume this freed memory?

I'm not saying that they can be freed, they can't right now. My point is just
that we *already* keep all the stats in memory, so the fact that fetching all
stats in a single query would also require keeping them in memory is not an
issue.

That's true in cases where we're not filtering schemas or tables. We fetch the pg_class stats as a part of getTables, but those are small, and not a part of the query in question.

Fetching all the pg_stats for a db when we only want one table could be a nasty performance regression, and we can't just filter on the oids of the tables we want, because those tables can have expression indexes, so the oid filter would get complicated quickly.

But TBH, I do wonder how much the current memory usage of the statistics
dump/restore support is going to bite us. In some cases this will dramatically
increase pg_dump/pg_upgrade's memory usage, my tests were with tiny amounts of
data and very simple scalar datatypes and you already could see a substantial
increase. With something like postgis or even just a lot of jsonb columns
this is going to be way worse.

Yes, it will cost us in pg_dump, but it will save customers from some long ANALYZE operations.

Re: Statistics Import and Export

From

Andres Freund

Date:

06 March, 20:16:28

Hi,

On 2025-03-06 12:04:25 -0500, Corey Huinker wrote:
> > > If there's value in freeing them, why isn't it being done already? What
> > > other thing would consume this freed memory?
> >
> > I'm not saying that they can be freed, they can't right now. My point is
> > just
> > that we *already* keep all the stats in memory, so the fact that fetching
> > all
> > stats in a single query would also require keeping them in memory is not an
> > issue.
> >
> 
> That's true in cases where we're not filtering schemas or tables. We fetch
> the pg_class stats as a part of getTables, but those are small, and not a
> part of the query in question.
> 
>  Fetching all the pg_stats for a db when we only want one table could be a
> nasty performance regression

I don't think anybody argued that we should fetch all stats regardless of
filtering for the to-be-dumped tables.


> and we can't just filter on the oids of the tables we want, because those
> tables can have expression indexes, so the oid filter would get complicated
> quickly.

I don't follow. We already have the tablenames, schemanames and oids of the
to-be-dumped tables/indexes collected in pg_dump, all that's needed is to send
a list of those to the server to filter there?


> > But TBH, I do wonder how much the current memory usage of the statistics
> > dump/restore support is going to bite us. In some cases this will
> > dramatically
> > increase pg_dump/pg_upgrade's memory usage, my tests were with tiny
> > amounts of
> > data and very simple scalar datatypes and you already could see a
> > substantial
> > increase.  With something like postgis or even just a lot of jsonb columns
> > this is going to be way worse.
> >
> 
> Yes, it will cost us in pg_dump, but it will save customers from some long
> ANALYZE operations.

My concern is that it might prevent some upgrades from *ever* completing,
because of pg_dump running out of memory.

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Corey Huinker

Date:

06 March, 20:16:44

To be honest, I am a bit surprised that we decided to enable this by
default. It's not obvious to me that statistics should be regarded as
part of the database in the same way that table definitions or table
data are. That said, I'm not overwhelmingly opposed to that choice.
However, even if it's the right choice in theory, we should maybe
rethink if it's going to be too slow or use too much memory.

I'm strongly in favor of the choice to make it default. This is reducing the impact of a post-upgrade customer footgun wherein heavy workloads are applied to a database post-upgrade but before analyze/vacuumdb have had a chance to do their magic [1].

It seems to me that we're fretting over seconds when the feature is potentially saving the customer hours of reduced availability if not outright downtime.

[1] In that situation, the workload queries have no stats, get terrible plans, everything becomes a sequential scan. Sequential scans swamp the system, starving the analyze commands of the I/O they need to get the badly needed statistics. Even after the stats are in place, the system is still swamped with queries that were in flight before the stats were in place. Even well intentioned customers [2] can fall prey to this when their microservices detect that the database is online again, and automatically resume work.

[2] This exact situation happened at a place where I was consulting. The microservices all restarted work automatically despite assurances that they would not. That bad experience was my primary motivator for implementing theis feature.

Re: Statistics Import and Export

From

Jeff Davis

Date:

06 March, 20:59:34

On Thu, 2025-03-06 at 12:16 -0500, Corey Huinker wrote:
>
> I'm strongly in favor of the choice to make it default. This is
> reducing the impact of a post-upgrade

There are potentially two different defaults: pg_dump and pg_upgrade.

In any case, let's see what improvements we can make to memory usage
and performance, and take it from there.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

06 March, 21:00:07

I don't follow. We already have the tablenames, schemanames and oids of the
to-be-dumped tables/indexes collected in pg_dump, all that's needed is to send
a list of those to the server to filter there?

Do we have something that currently does that? All of the collect functions (collectComments, etc) take an unfiltered approach. Seems like we'd have to collect the stats sometime after ProcessArchiveRestoreOptions, which is significantly after the rest of them.

My concern is that it might prevent some upgrades from *ever* completing,
because of pg_dump running out of memory.

Obviously a valid concern.

Re: Statistics Import and Export

From

Andres Freund

Date:

06 March, 21:04:55

Hi,

On 2025-03-06 12:16:44 -0500, Corey Huinker wrote:
> >
> > To be honest, I am a bit surprised that we decided to enable this by
> > default. It's not obvious to me that statistics should be regarded as
> > part of the database in the same way that table definitions or table
> > data are. That said, I'm not overwhelmingly opposed to that choice.
> > However, even if it's the right choice in theory, we should maybe
> > rethink if it's going to be too slow or use too much memory.
> >
>
> I'm strongly in favor of the choice to make it default. This is reducing
> the impact of a post-upgrade customer footgun wherein heavy workloads are
> applied to a database post-upgrade but before analyze/vacuumdb have had a
> chance to do their magic [1].

To be clear, I think this is a very important improvement that most people
should use. I just don't think it's quite there yet.

> It seems to me that we're fretting over seconds when the feature is
> potentially saving the customer hours of reduced availability if not
> outright downtime.

FWIW, I care about the performance for two reasons:

1) It's a difference of seconds in the regression database, which has a few
   hundred tables, few columns, very little data and thus small stats. In a
   database with a lot of tables and columns with complicated datatypes the
   difference will be far larger.

   And in contrast to analyzing the database in parallel, the pg_dump/restore
   work to restore stats afaict happens single-threaded for each database.

2) The changes initially substantially increased the time a test cycle takes
   for me locally. I run the tests 10s to 100s time a day, that really adds
   up.

   002_pg_upgrade is the test that dominates the overall test time for me, so
   it getting slower by a good bit means the overall test time increased.

   1fd1bd87101^:
   total test time:             1m27.010s
   003_pg_upgrade alone:        1m6.309s

   1fd1bd87101:
   total test time:             1m45.945s
   003_pg_upgrade alone:        1m24.597s

   master at 0f21db36d66:
   total test time:             1m34.576s
   003_pg_upgrade alone:        1m12.550s

   It clearly got a lot better since 1fd1bd87101, but it's still ~9% slower
   than before...

I care about the memory usage effects because I've seen plenty systems where
pg_statistics is many gigabytes (after toast compression!), and I am really
worried that pg_dump having all the serialized strings in memory will cause a
lot of previously working pg_dump invocations and pg_upgrades to fail. That'd
also be a really bad experience.

The more I think about it, the less correct it seems to me to have the
statement to restore statistics tracked via ArchiveOpts->createStmt.  We use
that for DDL, but this really is data, not DDL.  Because we store it in
->createStmt it's stored in-memory for the runtime of pg_dump, which means the
peak memory usage will inherently be quite high.

I think the stats need to be handled much more like we handle the actual table
data, which are obviously *not* stored in memory for the whole run of pg_dump.

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Jeff Davis

Date:

06 March, 21:07:43

On Thu, 2025-03-06 at 12:16 -0500, Andres Freund wrote:
> I don't follow. We already have the tablenames, schemanames and oids
> of the
> to-be-dumped tables/indexes collected in pg_dump, all that's needed
> is to send
> a list of those to the server to filter there?

Would it be appropriate to create a temp table? I wouldn't normally
expect pg_dump to create temp tables, but I can't think of a major
reason not to.

If not, did you have in mind a CTE with a large VALUES expression, or
just a giant IN() list?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Andres Freund

Date:

06 March, 21:08:20

Hi,

On 2025-03-06 13:00:07 -0500, Corey Huinker wrote:
> >
> > I don't follow. We already have the tablenames, schemanames and oids of the
> > to-be-dumped tables/indexes collected in pg_dump, all that's needed is to
> > send
> > a list of those to the server to filter there?
> >
>
> Do we have something that currently does that?

Yes. Afaict there's at least:
- getPolicies()
- getIndexes()
- getConstraints()
- getTriggers(),
- getTableAttrs()

They all send an array of oids as part of the query and then join an
unnest()ed version of the array against whatever they're collecting. See
e.g. getPolicies():

                      "FROM unnest('%s'::pg_catalog.oid[]) AS src(tbloid)\n"
                      "JOIN pg_catalog.pg_policy pol ON (src.tbloid = pol.polrelid)",

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Nathan Bossart

Date:

06 March, 21:22:12

On Thu, Mar 06, 2025 at 01:04:55PM -0500, Andres Freund wrote:
> To be clear, I think this is a very important improvement that most people
> should use.

+1

> I just don't think it's quite there yet.

I agree that we should continue working on the performance/memory stuff.

> 1) It's a difference of seconds in the regression database, which has a few
>    hundred tables, few columns, very little data and thus small stats. In a
>    database with a lot of tables and columns with complicated datatypes the
>    difference will be far larger.
> 
>    And in contrast to analyzing the database in parallel, the pg_dump/restore
>    work to restore stats afaict happens single-threaded for each database.

Yeah, I did a lot of work in v18 to rein in pg_dump --binary-upgrade
runtime, and I'm a bit worried that this will undo much of that.  Obviously
it's going to increase runtime by some amount, which is acceptable, but it
needs to be within reason.  I'm optimistic this is within reach for v18 by
reducing the number of queries.

> I care about the memory usage effects because I've seen plenty systems where
> pg_statistics is many gigabytes (after toast compression!), and I am really
> worried that pg_dump having all the serialized strings in memory will cause a
> lot of previously working pg_dump invocations and pg_upgrades to fail. That'd
> also be a really bad experience.

I think it is entirely warranted to consider these cases.  IME cases of "a
million tables" or "a million sequences" are far more common than you might
think.

-- 
nathan

Re: Statistics Import and Export

From

Corey Huinker

Date:

06 March, 21:23:36

Would it be appropriate to create a temp table? I wouldn't normally
expect pg_dump to create temp tables, but I can't think of a major
reason not to.

I think we can't - the db might be a replica.

If not, did you have in mind a CTE with a large VALUES expression, or
just a giant IN() list?

getPolicies - as Andres cited right after your post.

Re: Statistics Import and Export

From

Andres Freund

Date:

06 March, 21:23:40

Hi,

On 2025-03-06 10:07:43 -0800, Jeff Davis wrote:
> On Thu, 2025-03-06 at 12:16 -0500, Andres Freund wrote:
> > I don't follow. We already have the tablenames, schemanames and oids
> > of the
> > to-be-dumped tables/indexes collected in pg_dump, all that's needed
> > is to send
> > a list of those to the server to filter there?
> 
> Would it be appropriate to create a temp table? I wouldn't normally
> expect pg_dump to create temp tables, but I can't think of a major
> reason not to.

It doesn't work on a standby.


> If not, did you have in mind a CTE with a large VALUES expression, or
> just a giant IN() list?

An array, with a server-side unnest(), like we do in a bunch of other
places. E.g.


    /* need left join to pg_type to not fail on dropped columns ... */
    appendPQExpBuffer(q,
                      "FROM unnest('%s'::pg_catalog.oid[]) AS src(tbloid)\n"
                      "JOIN pg_catalog.pg_attribute a ON (src.tbloid = a.attrelid) "
                      "LEFT JOIN pg_catalog.pg_type t "
                      "ON (a.atttypid = t.oid)\n",
                      tbloids->data);

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Tom Lane

Date:

06 March, 21:47:34

Andres Freund <andres@anarazel.de> writes:
>    And in contrast to analyzing the database in parallel, the pg_dump/restore
>    work to restore stats afaict happens single-threaded for each database.

In principle we should be able to do stats dump/restore parallelized
just as we do for data.  There are some stumbling blocks in the way
of that:

1. pg_upgrade has made a policy judgement to apply parallelism across
databases not within a database, ie it will launch concurrent dump/
restore tasks in different DBs but not authorize any one of them to
eat multiple CPUs.  That needs to be re-thought probably, as I think
that decision dates to before we had useful parallelism in pg_dump and
pg_restore.  I wonder if we could just rip out pg_upgrade's support
for DB-level parallelism, which is not terribly pretty anyway, and
simply pass the -j switch straight to pg_dump and pg_restore.

2. pg_restore should already be able to perform stats restores in
parallel (if authorized to use multiple threads), but I'm less clear
on whether that works right now for pg_dump.

3. Also, parallel restore depends critically on the TOC entries'
dependencies being sane, and right now I do not think they are.
I looked at "pg_restore -l -v" output for the regression DB, and it
seems like it's not taking care to ensure that table/MV data is loaded
before the table/MV's stats.  (Maybe that accounts for some of the
complaints we've seen about stats getting mangled??)

> I think the stats need to be handled much more like we handle the actual table
> data, which are obviously *not* stored in memory for the whole run of pg_dump.

+1

            regards, tom lane

Re: Statistics Import and Export

From

Corey Huinker

Date:

06 March, 21:47:51

The more I think about it, the less correct it seems to me to have the
statement to restore statistics tracked via ArchiveOpts->createStmt. We use
that for DDL, but this really is data, not DDL. Because we store it in
->createStmt it's stored in-memory for the runtime of pg_dump, which means the
peak memory usage will inherently be quite high.

I think the stats need to be handled much more like we handle the actual table
data, which are obviously *not* stored in memory for the whole run of pg_dump.

I'm at the same conclusion. This would mean keeping the one getAttributeStats query perrelation, but at least we'd be able to free up the result after we write it to disk.

Re: Statistics Import and Export

From

Andres Freund

Date:

06 March, 22:04:44

On 2025-03-06 13:47:51 -0500, Corey Huinker wrote:
> I'm at the same conclusion. This would mean keeping the one
> getAttributeStats query perrelation,

Why does it have to mean that? It surely would be easier with separate
queries, but I don't think there's anything inherently blocking us from doing
something in a more batch-y fashion.

Re: Statistics Import and Export

From

Tom Lane

Date:

06 March, 22:08:52

Andres Freund <andres@anarazel.de> writes:
> On 2025-03-06 13:47:51 -0500, Corey Huinker wrote:
>> I'm at the same conclusion. This would mean keeping the one
>> getAttributeStats query perrelation,

> Why does it have to mean that? It surely would be easier with separate
> queries, but I don't think there's anything inherently blocking us from doing
> something in a more batch-y fashion.

Complexity?  pg_dump doesn't have anything like that at the moment,
and I'm loath to start inventing such facilities at this point in
the release cycle.  Let's deal with the blockers for parallelizing
dump and restore of stats, and then see where we are performance-wise.

            regards, tom lane

Re: Statistics Import and Export

From

Nathan Bossart

Date:

06 March, 22:33:13

On Thu, Mar 06, 2025 at 01:47:34PM -0500, Tom Lane wrote:
> 1. pg_upgrade has made a policy judgement to apply parallelism across
> databases not within a database, ie it will launch concurrent dump/
> restore tasks in different DBs but not authorize any one of them to
> eat multiple CPUs.  That needs to be re-thought probably, as I think
> that decision dates to before we had useful parallelism in pg_dump and
> pg_restore.  I wonder if we could just rip out pg_upgrade's support
> for DB-level parallelism, which is not terribly pretty anyway, and
> simply pass the -j switch straight to pg_dump and pg_restore.

That would certainly help for clusters with one big database with many LOs
or something, but I worry it would hurt the many database case quite a bit.
Maybe we could add a --jobs-per-db option that indicates how to parallelize
dump/restore.  If you set --jobs=8 --jobs-per-db=8, the databases would be
dumped serially, but pg_dump would get -j8.  If you set --jobs=8 and
--jobs-per-db=2, we'd process 4 databases at a time, each with -j2.

-- 
nathan

Re: Statistics Import and Export

From

Andres Freund

Date:

06 March, 22:34:41

Hi,

On 2025-03-06 13:47:34 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> >    And in contrast to analyzing the database in parallel, the pg_dump/restore
> >    work to restore stats afaict happens single-threaded for each database.
>
> In principle we should be able to do stats dump/restore parallelized
> just as we do for data.

Yea.

Whether the gains are worth the cost isn't clear to me though. Issuing
individual queries for each relation needs a fair bit of parallelism to catch
up to doing the dumping in a single statement, if it ever can.

> 1. pg_upgrade has made a policy judgement to apply parallelism across
> databases not within a database, ie it will launch concurrent dump/
> restore tasks in different DBs but not authorize any one of them to
> eat multiple CPUs.  That needs to be re-thought probably, as I think
> that decision dates to before we had useful parallelism in pg_dump and
> pg_restore.  I wonder if we could just rip out pg_upgrade's support
> for DB-level parallelism, which is not terribly pretty anyway, and
> simply pass the -j switch straight to pg_dump and pg_restore.

I don't think that'd work well, right now pg_dump only handles a single
database (pg_dumpall doesn't yet support -Fc) *and* pg_dump is still serial
for the bulk of the work that pg_upgrade cares about.

I think the only parallelism that'd actually happen for pg_upgrade would be
dumping of large objects?

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Tom Lane

Date:

06 March, 22:47:08

Nathan Bossart <nathandbossart@gmail.com> writes:
> On Thu, Mar 06, 2025 at 01:47:34PM -0500, Tom Lane wrote:
>> ... I wonder if we could just rip out pg_upgrade's support
>> for DB-level parallelism, which is not terribly pretty anyway, and
>> simply pass the -j switch straight to pg_dump and pg_restore.

> That would certainly help for clusters with one big database with many LOs
> or something, but I worry it would hurt the many database case quite a bit.

I'm very skeptical of that.  How many DBs do you know with just one table?
I think most have enough that they could keep a reasonable number of
CPUs busy with pg_dump's internal parallelism.

> Maybe we could add a --jobs-per-db option that indicates how to parallelize
> dump/restore.  If you set --jobs=8 --jobs-per-db=8, the databases would be
> dumped serially, but pg_dump would get -j8.  If you set --jobs=8 and
> --jobs-per-db=2, we'd process 4 databases at a time, each with -j2.

I specifically didn't propose such a thing because I think it will be
a sucky user experience.  In the first place, users are unlikely to
take the time to puzzle out exactly how they should slice that up;
in the second place, if they try they won't necessarily find that
there's a good solution with those knobs; in the third place,
pg_upgrade is commonly invoked through packager-supplied scripts that
might not give access to those switches anyway.

In the short term I think repurposing -j as meaning within-DB
parallelism rather than cross-DB parallelism would be a win for the
vast majority of users.  We could imagine some future feature that
lets pg_upgrade try to slice up the available jobs on its own
(say, based on a preliminary survey of how many tables in each DB).
But I don't want to build that today, and maybe we won't ever.

            regards, tom lane

Re: Statistics Import and Export

From

Tom Lane

Date:

06 March, 22:51:26

Andres Freund <andres@anarazel.de> writes:
> On 2025-03-06 13:47:34 -0500, Tom Lane wrote:
>> ... I wonder if we could just rip out pg_upgrade's support
>> for DB-level parallelism, which is not terribly pretty anyway, and
>> simply pass the -j switch straight to pg_dump and pg_restore.

> I don't think that'd work well, right now pg_dump only handles a single
> database (pg_dumpall doesn't yet support -Fc) *and* pg_dump is still serial
> for the bulk of the work that pg_upgrade cares about.
> I think the only parallelism that'd actually happen for pg_upgrade would be
> dumping of large objects?

Uh ... the entire point here is that we'd be trying to parallelize its
dumping of stats, no?  Most DBs will have enough of those to be
interesting, I should think.

            regards, tom lane

Re: Statistics Import and Export

From

Andres Freund

Date:

06 March, 22:52:22

Hi,

On 2025-03-06 14:47:08 -0500, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
> > On Thu, Mar 06, 2025 at 01:47:34PM -0500, Tom Lane wrote:
> >> ... I wonder if we could just rip out pg_upgrade's support
> >> for DB-level parallelism, which is not terribly pretty anyway, and
> >> simply pass the -j switch straight to pg_dump and pg_restore.
> 
> > That would certainly help for clusters with one big database with many LOs
> > or something, but I worry it would hurt the many database case quite a bit.
> 
> I'm very skeptical of that.  How many DBs do you know with just one table?
> I think most have enough that they could keep a reasonable number of
> CPUs busy with pg_dump's internal parallelism.

pg_dump as used by pg_upgrade doesn't need to dump table data. Afaict we only
do parallelism in pg_dump for table data and large objects. Outside of the
many-LOs case, there's nothing pg_dump's internal parallelism can accelerate?

So the number of tables in a database is irrelevant, no?

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Andres Freund

Date:

06 March, 23:20:16

Hi,

On 2025-03-06 14:51:26 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2025-03-06 13:47:34 -0500, Tom Lane wrote:
> >> ... I wonder if we could just rip out pg_upgrade's support
> >> for DB-level parallelism, which is not terribly pretty anyway, and
> >> simply pass the -j switch straight to pg_dump and pg_restore.
> 
> > I don't think that'd work well, right now pg_dump only handles a single
> > database (pg_dumpall doesn't yet support -Fc) *and* pg_dump is still serial
> > for the bulk of the work that pg_upgrade cares about.
> > I think the only parallelism that'd actually happen for pg_upgrade would be
> > dumping of large objects?
> 
> Uh ... the entire point here is that we'd be trying to parallelize its
> dumping of stats, no?  Most DBs will have enough of those to be
> interesting, I should think.

Well, we added concurrent-pg-dump runs to pg_upgrade for a reason,
presumably. Before stats got dumped, there wasn't any benefit of pg_dump level
parallelism, unless large objects are used. Presumably we validated that there
*is* gain from running pg_dump on multiple databases concurrently.

There are many systems with hundreds of databases, removing all parallelism
for those from pg_upgrade would likely hurt way more than what we can gain
here.

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Nathan Bossart

Date:

06 March, 23:50:46

On Thu, Mar 06, 2025 at 03:20:16PM -0500, Andres Freund wrote:
> There are many systems with hundreds of databases, removing all parallelism
> for those from pg_upgrade would likely hurt way more than what we can gain
> here.

I just did a quick test on a freshly analyzed database with 1,000 sequences
and 10,000 tables with 1,000 rows and 2 unique constraints apiece.

    ~/pgdata$ time pg_dump postgres --no-data --binary-upgrade > /dev/null
    0.29s user 0.09s system 21% cpu 1.777 total

    ~/pgdata$ time pg_dump postgres --no-data --no-statistics --binary-upgrade > /dev/null
    0.14s user 0.02s system 25% cpu 0.603 total

So about 1.174 seconds goes to statistics.  Even if we do all sorts of work
to make dumping statistics really fast, dumping 8 in succession would still
take upwards of 4.8 seconds or more.  Even with the current code, dumping 8
in parallel would probably take closer to 2 seconds, and I bet reducing the
number of statistics queries could drive it below 1.  Granted, I'm waving
my hands vigorously with those last two estimates.

That being said, I do think in-database parallelism would be useful in some
cases.  I frequently hear about problems with huge numbers of large objects
on a cluster with one big database.  But that's probably less likely than
the many database case.

-- 
nathan

Re: Statistics Import and Export

From

Robert Haas

Date:

07 March, 02:29:37

On Thu, Mar 6, 2025 at 3:50 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> That being said, I do think in-database parallelism would be useful in some
> cases.  I frequently hear about problems with huge numbers of large objects
> on a cluster with one big database.  But that's probably less likely than
> the many database case.

I could believe they're equally likely, or even that the
many-large-objects case is more likely. At any rate, they're both
things that can happen.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Statistics Import and Export

From

Jeff Davis

Date:

07 March, 03:58:37

On Thu, 2025-03-06 at 08:49 -0500, Corey Huinker wrote:
> Unless some check was being done by the 'foo.bar'::regclass cast, I
> understand why we should add one.

"For schemas, allows access to objects contained in the schema
(assuming that the objects' own privilege requirements are also met).
Essentially this allows the grantee to “look up” objects within the
schema. Without this permission, it is still possible to see the object
names, e.g., by querying system catalogs. Also, after revoking this
permission, existing sessions might have statements that have
previously performed this lookup, so this is not a completely secure
way to prevent object access."

https://www.postgresql.org/docs/current/ddl-priv.html

The above text indicates that we should do the check, but also that
it's not terribly important for actual security.

> If we do, we'll want to change downgrade the following errors to
> warn+return false:

Perhaps we should consider the schemaname/relname change as one patch,
which maintains relation lookup failures as hard ERRORs, and a
"downgrade errors to warnings" as a separate patch.

> I agree, but the thread conversation had already shifted to doing
> just one single call to pg_stats, so this was just a demonstration.

It's a simple patch and the discussion seems to be shifting toward
parallelism[1] rather than batching[2]. In that case it still seems
like a good change to me, so I'm inclined to commit it after I verify
that it improves performance.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/714295.1741286854@sss.pgh.pa.us

[2] https://www.postgresql.org/message-id/716907.1741288132@sss.pgh.pa.us

Re: Statistics Import and Export

From

Jeff Davis

Date:

07 March, 04:42:30

On Thu, 2025-03-06 at 11:15 -0500, Robert Haas wrote:
> To be honest, I am a bit surprised that we decided to enable this by
> default. It's not obvious to me that statistics should be regarded as
> part of the database in the same way that table definitions or table
> data are. That said, I'm not overwhelmingly opposed to that choice.
> However, even if it's the right choice in theory, we should maybe
> rethink if it's going to be too slow or use too much memory.

I don't have a strong opinion about whether stats will be opt-out or
opt-in for v18, but if they are opt-in, we would need to adjust the
available options a bit.

At minimum, we would need to at least add the option "--with-
statistics", because right now the only way to explicitly request stats
is to say "--statistics-only".

To generalize this concept: for each of {schema, data, stats} users
might want "yes", "no", or "only".

If we use this options scheme, it would be easy to change the default
for stats independently of the other options, if necessary, without
surprising consequences.

Patch attached. This patch does NOT change the default; stats are still
opt-out. But it makes it easier for users to start specifying what they
want or not explicitly, or to rely on the defaults if they prefer.

Note that the patch would mean we go from 2 options in v17:
  --{schema|data}-only

to 9 options in v18:
  --{with|no}-{schema|data|stats} and
  --{schema|data|stats}-only

I suggest we adjust the options now with something resembling the
attached patch and decide on changing the default sometime during beta.

Regards,
    Jeff Davis

Attachment

v1-0001-Add-pg_dump-with-X-options.patch

Re: Statistics Import and Export

From

Corey Huinker

Date:

07 March, 04:47:55

https://www.postgresql.org/docs/current/ddl-priv.html

The above text indicates that we should do the check, but also that
it's not terribly important for actual security.

Ok, I'm convinced.

> If we do, we'll want to change downgrade the following errors to
> warn+return false:

Perhaps we should consider the schemaname/relname change as one patch,
which maintains relation lookup failures as hard ERRORs, and a
"downgrade errors to warnings" as a separate patch.

+1

Re: Statistics Import and Export

From

Corey Huinker

Date:

07 March, 05:56:52

Patch attached. This patch does NOT change the default; stats are still
opt-out. But it makes it easier for users to start specifying what they
want or not explicitly, or to rely on the defaults if they prefer.

Note that the patch would mean we go from 2 options in v17:
--{schema|data}-only

to 9 options in v18:
--{with|no}-{schema|data|stats} and
--{schema|data|stats}-only

I suggest we adjust the options now with something resembling the
attached patch and decide on changing the default sometime during beta.

Patch is straightforward. Comments are very clear as are docs. I can't see anything that needs to be changed.

Re: Statistics Import and Export

From

Andres Freund

Date:

07 March, 19:22:40

Hi,

On 2025-03-06 17:42:30 -0800, Jeff Davis wrote:
> At minimum, we would need to at least add the option "--with-
> statistics", because right now the only way to explicitly request stats
> is to say "--statistics-only".

+1, this has been annoying me while testing.

I did get confused for a while because I used --statistics, as the opposite of
--no-statistics, while going back and forth between the two. Kinda appears to
work, but actually means --statistics-only, something rather different...


> To generalize this concept: for each of {schema, data, stats} users
> might want "yes", "no", or "only".

> If we use this options scheme, it would be easy to change the default
> for stats independently of the other options, if necessary, without
> surprising consequences.
> 
> Patch attached. This patch does NOT change the default; stats are still
> opt-out. But it makes it easier for users to start specifying what they
> want or not explicitly, or to rely on the defaults if they prefer.
> 
> Note that the patch would mean we go from 2 options in v17:
>   --{schema|data}-only
> 
> to 9 options in v18:
>   --{with|no}-{schema|data|stats} and
>   --{schema|data|stats}-only

Could we, instead of having --with-$foo, just use --$foo?

Greetings,

Andres Freund

Re: Statistics Import and Export

From

Jeff Davis

Date:

07 March, 19:53:29

On Fri, 2025-03-07 at 11:22 -0500, Andres Freund wrote:
> +1, this has been annoying me while testing.

IIRC, originally someone had questioned the need for options that
expressed what was already the default, but I can't find it right now.
Regardless, now the need is clear enough.

> Could we, instead of having --with-$foo, just use --$foo?

That creates a conflict with the existing --schema option, which is a
namespace filter.

Another idea: we could use --definitions/--data/--statistics.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Robert Treat

Date:

07 March, 20:41:08

On Thu, Mar 6, 2025 at 8:42 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2025-03-06 at 11:15 -0500, Robert Haas wrote:
> Patch attached. This patch does NOT change the default; stats are still
> opt-out. But it makes it easier for users to start specifying what they
> want or not explicitly, or to rely on the defaults if they prefer.
>
> Note that the patch would mean we go from 2 options in v17:
>   --{schema|data}-only
>
> to 9 options in v18:
>   --{with|no}-{schema|data|stats} and
>   --{schema|data|stats}-only
>

Ugh... this feels like a bit of the combinatorial explosion,
especially if we ever need to add another option. I wonder if it would
be possible to do something simple like just providing
"--include={schema|data|stats}" where you specify only what you want,
and leave out what you don't. At the risk of not providing as many
typing shortcuts, if the logic is simpler and more extensible for
future options...

Robert Treat
https://xzilla.net

Re: Statistics Import and Export

From

Jeff Davis

Date:

07 March, 21:41:35

On Fri, 2025-03-07 at 12:41 -0500, Robert Treat wrote:
> Ugh... this feels like a bit of the combinatorial explosion,
> especially if we ever need to add another option.

Not quite that bad, because ideally the yes/no/only  would not be
expanding as well. But I agree that it feels like a lot of options.

> I wonder if it would
> be possible to do something simple like just providing
> "--include={schema|data|stats}" where you specify only what you want,
> and leave out what you don't.

Can you explain the idea in a bit more detail? Does --
include=statistics mean include statistics also or statistics only? Can
you explicitly request that data be included but rely on the default
for statistics? What options would it override or conflict with?

Regards,
    Jeff Davis

Re: Statistics Import and Export: difference in statistics dumped

From

Tom Lane

Date:

07 March, 22:14:08

Jeff Davis <pgsql@j-davis.com> writes:
> Sounds good. I will commit something like the v2 patch then soon, and
> if we need a different condition we can change it later.

Sadly, this made things worse not better: crake is failing
cross-version-upgrade tests again [1], with dump diffs like

@@ -270836,8 +270836,8 @@
 --
 SELECT * FROM pg_catalog.pg_restore_relation_stats(    'version', '000000'::integer,
     'relation', 'public.hash_f8_index'::regclass,
-    'relpages', '66'::integer,
-    'reltuples', '10000'::real,
+    'relpages', '0'::integer,
+    'reltuples', '-1'::real,
     'relallvisible', '0'::integer
 );

I think what is happening is that the patch shut off CREATE
INDEX's update of not only the table's stats but also the
index's stats.  This seems unhelpful: the index's empty
stats can never be what's wanted.

We could band-aid over this by making AdjustUpgrade.pm
lobotomize the comparisons of all three stats fields,
but I think it's just wrong as-is.  Perhaps fix by
checking the relation's relkind before applying the
autovacuum heuristic?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2025-03-07%2018%3A19%3A14

Re: Statistics Import and Export

From

Robert Treat

Date:

07 March, 23:46:19

On Fri, Mar 7, 2025 at 1:41 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Fri, 2025-03-07 at 12:41 -0500, Robert Treat wrote:
> > Ugh... this feels like a bit of the combinatorial explosion,
> > especially if we ever need to add another option.
>
> Not quite that bad, because ideally the yes/no/only  would not be
> expanding as well. But I agree that it feels like a lot of options.
>
> > I wonder if it would
> > be possible to do something simple like just providing
> > "--include={schema|data|stats}" where you specify only what you want,
> > and leave out what you don't.
>
> Can you explain the idea in a bit more detail? Does --
> include=statistics mean include statistics also or statistics only? Can
> you explicitly request that data be included but rely on the default
> for statistics? What options would it override or conflict with?
>

There might be some variability depending on the default behavior, but
if we assume that default means "output everything" (which is the v17
behavior), then use of --include would mean to only include items that
are listed, so:

if you want everything --include=schema,data,statistics (presumably
redundant with the default behavior)
if you want schema only --include=schema
if you want "everything except schema" --include=data,statistics

So it's pretty easy to extrapolate data only or statistics only, and
pretty easy to work up any combo of 2 of the 3.

And if someday, for example, there is ever agreement on including role
information with normal pg_dump, you add "roles" as an option to be
parsed via --include without having to create any new flags.


Robert Treat
https://xzilla.net

Re: Statistics Import and Export

From

Jeff Davis

Date:

08 March, 00:43:56

On Fri, 2025-03-07 at 15:46 -0500, Robert Treat wrote:
> There might be some variability depending on the default behavior,
> but
> if we assume that default means "output everything"

The reason I posted this patch is that, depending on performance
characteristics in v18 and a decision to be made during beta, the
default may not output statistics.

So we need whatever set of options we choose to have the freedom to
change statistics to be either opt-in or opt-out, without needing to
reconsider the overall set of options.

I tried to generalize that requirement to all of
{schema|data|statistics} for consistency, but that resulted in 9
options.

We don't need the options to be perfectly consistent at the expense of
usability, though, so if 9 options is too many we can just have three
new options for stats, for a total of 5 options:

   --data-only
   --schema-only
   --statistics-only
   --statistics (stats also, regardless of default)
   --no-statistics (no stats, regardless of default)

which would allow combinations like "--schema-only --statistics" to
mean "schema and statistics but not data". There would be a bit of
weirdness because --statistics can combine with --data-only and --
schema-only, but nothing can combine with --statistics-only.

> if you want everything --include=schema,data,statistics (presumably
> redundant with the default behavior)
> if you want schema only --include=schema
> if you want "everything except schema" --include=data,statistics

That could work. Comparing to the options above yields:

   --include=statistics <=> --statistics-only
   --include=schema,data,statistics <=> --statistics
   --include=schema,statistics <=> --schema-only --statistics
   --include=data,statistics <=> --data-only --statistics
   --include=schema,data <=> --no-statistics

Not sure which approach is better.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

08 March, 06:40:21

if you want everything --include=schema,data,statistics (presumably
redundant with the default behavior)
if you want schema only --include=schema
if you want "everything except schema" --include=data,statistics

Until we add a fourth option, and then it becomes completely ambiguous as to whether you wanted data+statstics, or you not-wanted schema.

And if someday, for example, there is ever agreement on including role
information with normal pg_dump, you add "roles" as an option to be
parsed via --include without having to create any new flags.

This is pushing a burden onto our customers for a parsing convenience.

-1.

Re: Statistics Import and Export

From

Corey Huinker

Date:

08 March, 06:43:10

I tried to generalize that requirement to all of
{schema|data|statistics} for consistency, but that resulted in 9
options.

9 options that resolve to 3 boolean variables. It's not that hard.

And if we add a fourth option set, then we have 12 options. So it's O(3N), not O(N^2).

People have scripts now that rely on the existing -only flags, and nearly every other potentially optional thing has a -no flag. Let's leverage that.

Re: Statistics Import and Export

From

Hari Krishna Sunder

Date:

08 March, 08:51:53

To improve the performance of pg_dump can we add a new sql function that can operate more efficiently than the pg_stats view? It could also take in an optional list of oids to filter on.

This will help speed up the dump and restore within pg18 and future upgrades to higher pg versions.

Thanks

Hari Krishna Sunder

On Fri, Mar 7, 2025 at 7:43 PM Corey Huinker <corey.huinker@gmail.com> wrote:

I tried to generalize that requirement to all of
{schema|data|statistics} for consistency, but that resulted in 9
options.

9 options that resolve to 3 boolean variables. It's not that hard.

And if we add a fourth option set, then we have 12 options. So it's O(3N), not O(N^2).

People have scripts now that rely on the existing -only flags, and nearly every other potentially optional thing has a -no flag. Let's leverage that.

Re: Statistics Import and Export

From

Corey Huinker

Date:

08 March, 10:51:23

On Sat, Mar 8, 2025 at 12:52 AM Hari Krishna Sunder <hari.db.pg@gmail.com> wrote:

To improve the performance of pg_dump can we add a new sql function that can operate more efficiently than the pg_stats view? It could also take in an optional list of oids to filter on.
This will help speed up the dump and restore within pg18 and future upgrades to higher pg versions.

We can't install functions on the source database - it might be a read replica.

Re: Statistics Import and Export

From

Corey Huinker

Date:

08 March, 10:56:18

Updated and rebase patches.

0001 is the same as v6-0002, but with proper ACL checks on schemas after cache lookup

0002 attempts to replace all possible ERRORs in the restore/clear functions with WARNINGs. This is done with an eye towards reducing the set of things that could potentially cause an upgrade to fail.

Spoke with Nathan about how best to batch the pg_stats fetches. I'll be working on that now. Given that, the patch that optimized out getAttributeStats() calls on indexes without expressions has been withdrawn. It's a clear incremental gain, and we're looking for a couple orders of magnitude gain.

Attachment

Re: Statistics Import and Export

From

Robert Treat

Date:

08 March, 18:56:20

On Fri, Mar 7, 2025 at 10:40 PM Corey Huinker <corey.huinker@gmail.com> wrote:

if you want everything --include=schema,data,statistics (presumably
redundant with the default behavior)
if you want schema only --include=schema
if you want "everything except schema" --include=data,statistics

Until we add a fourth option, and then it becomes completely ambiguous as to whether you wanted data+statstics, or you not-wanted schema.

except it is perfectly clear that you *asked for* data and statistics, so you get what you asked for. however the user conjures in their heads what they are looking for, the logic is simple, you get what you asked for.

And if someday, for example, there is ever agreement on including role
information with normal pg_dump, you add "roles" as an option to be
parsed via --include without having to create any new flags.

This is pushing a burden onto our customers for a parsing convenience.

In the UX world, the general pattern is people start to get overwhelmed once you get over a 1/2 dozen options (I think that's based on Miller's law, but might be mis-remembering); we are already at 9 for this use case. So really it is quite the opposite, we'd be reducing the burden on customers by simplifying the interface rather than just throwing out every possible combination and saying "you figure it out".

Robert Treat

https://xzilla.net

Re: Statistics Import and Export

From

Corey Huinker

Date:

08 March, 22:09:51

Until we add a fourth option, and then it becomes completely ambiguous as to whether you wanted data+statstics, or you not-wanted schema.

except it is perfectly clear that you *asked for* data and statistics, so you get what you asked for. however the user conjures in their heads what they are looking for, the logic is simple, you get what you asked for.

They *asked for* that because they didn't have the mechanism to say "hold the mayo" or "everything except pickles". That's reducing their choice, and then blaming them for their choice.

In the UX world, the general pattern is people start to get overwhelmed once you get over a 1/2 dozen options (I think that's based on Miller's law, but might be mis-remembering); we are already at 9 for this use case. So really it is quite the opposite, we'd be reducing the burden on customers by simplifying the interface rather than just throwing out every possible combination and saying "you figure it out".

Except that those options are easily grouped into families. I see that there's a --no-comments flag, so why wouldn't there be a --no-statistics flag? Lots of $thing have a --no-$thing. That's the established UX pattern _working_. The user learned that pattern and we shouldn't punish them by changing it for our own parsing convenience.

Re: Statistics Import and Export

From

Jeff Davis

Date:

09 March, 20:00:35

On Sat, 2025-03-08 at 10:56 -0500, Robert Treat wrote:
> In the UX world, the general pattern is people start to get
> overwhelmed once you get over a 1/2 dozen options (I think that's
> based on Miller's law, but might be mis-remembering); we are already
> at 9 for this use case. So really it is quite the opposite, we'd be
> reducing the burden on customers by simplifying the interface rather
> than just throwing out every possible combination and saying "you
> figure it out". 

To be clear about your proposal:

* --include conflicts with --schema-only and --data-only
* --include overrides any default

is that right?

Thoughts on how we should document when/how to use --section vs --
include? Granted, that might be a point of confusion regardless of the
options we offer.

Regards,
    Jeff Davis

Re: Statistics Import and Export: difference in statistics dumped

From

Tom Lane

Date:

11 March, 00:53:26

I wrote:
> I think what is happening is that the patch shut off CREATE
> INDEX's update of not only the table's stats but also the
> index's stats.  This seems unhelpful: the index's empty
> stats can never be what's wanted.

I looked at this more closely and realized that it's a simple matter
of having made the tests in the wrong order.  The whole stanza
should only apply when dealing with the table, not the index.

I verified that this change fixes the cross-version-upgrade
failure in local testing, and pushed it.

            regards, tom lane

Re: Statistics Import and Export: difference in statistics dumped

From

Jeff Davis

Date:

11 March, 02:53:50

On Mon, 2025-03-10 at 17:53 -0400, Tom Lane wrote:
> I wrote:
> > I think what is happening is that the patch shut off CREATE
> > INDEX's update of not only the table's stats but also the
> > index's stats.  This seems unhelpful: the index's empty
> > stats can never be what's wanted.
>
> I looked at this more closely and realized that it's a simple matter
> of having made the tests in the wrong order.  The whole stanza
> should only apply when dealing with the table, not the index.
>
> I verified that this change fixes the cross-version-upgrade
> failure in local testing, and pushed it.

Ah, thank you.

Regards,
    Jeff Davis

Re: Statistics Import and Export: difference in statistics dumped

From

Ashutosh Bapat

Date:

11 March, 13:20:02

On Tue, Mar 11, 2025 at 5:23 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Mon, 2025-03-10 at 17:53 -0400, Tom Lane wrote:
> > I wrote:
> > > I think what is happening is that the patch shut off CREATE
> > > INDEX's update of not only the table's stats but also the
> > > index's stats.  This seems unhelpful: the index's empty
> > > stats can never be what's wanted.
> >
> > I looked at this more closely and realized that it's a simple matter
> > of having made the tests in the wrong order.  The whole stanza
> > should only apply when dealing with the table, not the index.
> >
> > I verified that this change fixes the cross-version-upgrade
> > failure in local testing, and pushed it.
>
> Ah, thank you.
>
> Regards,
>         Jeff Davis
>

Thanks. I verified that it has been fixed now. But there's something
wrong with materialized view statistics. I am starting a new thread
for the same.

--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

From

Corey Huinker

Date:

14 March, 23:03:16

New patches and a rebase.

0001 - no changes, but the longer I go the more I'm certain this is something we want to do.

0002- same as 0001

0003 -

Storing the restore function calls in the archive entry hogged a lot of memory and made people nervous. This introduces a new function pointer that generates those restore SQL calls right before they're written to disk, thus reducing the memory load from "stats for every object to be dumped" to just one object. Thanks to Nathan for diagnosing some weird quirks with various formats.

0004 -

This replaces the query in the prepared statement with one that batches them 100 relations at a time, and then maintains that result set until it is consumed. It seems to have obvious speedups.

database pg14, 100k tables x 2 columns each:

0004: 34.5s with statistics, 25.04s without

0003: 42.23s with statistics, 24.29s without

0002: 42.25s with statistics, 23.17s without

Gory details:

PGSERVICE=benchmark14 time /usr/local/pgsql/bin/pg_dump --file=tip.run1.dump
5.45user 2.38system 0:34.50elapsed 22%CPU (0avgtext+0avgdata 912680maxresident)k
0inputs+2105736outputs (0major+245090minor)pagefaults 0swaps

PGSERVICE=benchmark14 time /usr/local/pgsql/bin/pg_dump --no-statistics --file=tip.nostats.run1.dump
4.36user 2.05system 0:25.04elapsed 25%CPU (0avgtext+0avgdata 702488maxresident)k
0inputs+1643048outputs (0major+192512minor)pagefaults 0swaps

PGSERVICE=benchmark14 time /usr/local/pgsql/bin/pg_dump --file=nobatch.run1.dump
5.60user 3.95system 0:42.23elapsed 22%CPU (0avgtext+0avgdata 902424maxresident)k
0inputs+2105672outputs (0major+242536minor)pagefaults 0swaps

PGSERVICE=benchmark14 time /usr/local/pgsql/bin/pg_dump --no-statistics --file=nobatch-nostats.run1.dump
4.38user 2.13system 0:24.29elapsed 26%CPU (0avgtext+0avgdata 702292maxresident)k
48inputs+1642952outputs (0major+192515minor)pagefaults 0swaps

PGSERVICE=benchmark14 time /usr/local/pgsql/bin/pg_dump --file=nostmtfn.run1.dump
6.01user 4.47system 0:42.25elapsed 24%CPU (0avgtext+0avgdata 1089784maxresident)k
0inputs+2106840outputs (0major+289407minor)pagefaults 0swaps

PGSERVICE=benchmark14 time /usr/local/pgsql/bin/pg_dump --no-statistics --file=nostmtfn-nostats.run1.dump
4.35user 2.13system 0:23.17elapsed 27%CPU (0avgtext+0avgdata 690000maxresident)k
0inputs+1642952outputs (0major+189383minor)pagefaults 0swaps

Attachment

Re: Statistics Import and Export

From

Corey Huinker

Date:

16 March, 04:37:51

On Fri, Mar 14, 2025 at 4:03 PM Corey Huinker <corey.huinker@gmail.com> wrote:

New patches and a rebase.

0001 - no changes, but the longer I go the more I'm certain this is something we want to do.
0002- same as 0001

0003 -

Storing the restore function calls in the archive entry hogged a lot of memory and made people nervous. This introduces a new function pointer that generates those restore SQL calls right before they're written to disk, thus reducing the memory load from "stats for every object to be dumped" to just one object. Thanks to Nathan for diagnosing some weird quirks with various formats.

0004 -

This replaces the query in the prepared statement with one that batches them 100 relations at a time, and then maintains that result set until it is consumed. It seems to have obvious speedups.

Another rebase, and a new patch 0005 to have pg_dump fetch and restore relallfrozen for dbs of version 18 and higher. With older versions we omit relallfrozen and let the import function assign the default.

Attachment

Re: Statistics Import and Export

From

Nathan Bossart

Date:

16 March, 23:33:49

Thanks for working on this, Corey.

On Fri, Mar 14, 2025 at 04:03:16PM -0400, Corey Huinker wrote:
> 0003 -
> 
> Storing the restore function calls in the archive entry hogged a lot of
> memory and made people nervous. This introduces a new function pointer that
> generates those restore SQL calls right before they're written to disk,
> thus reducing the memory load from "stats for every object to be dumped" to
> just one object. Thanks to Nathan for diagnosing some weird quirks with
> various formats.
> 
> 0004 -
> 
> This replaces the query in the prepared statement with one that batches
> them 100 relations at a time, and then maintains that result set until it
> is consumed. It seems to have obvious speedups.

I've been doing a variety of tests with my toy database of 100K relations
[0], and I'm seeing around 20% less memory usage.  That's still 20% more
than without stats, but that's still a pretty nice improvement.

I'd propose two small changes to the design:

* I tested a variety of batch sizes, and to my suprise, I saw the best
  results with around 64 relations per batch.  I imagine the absolute best
  batch size will vary greatly depending on the workload.  It might also
  depend on work_mem and friends.

* The custom format actually does two WriteToc() calls, and since these
  patches move the queries to this part of pg_dump, it means we'll run all
  the queries twice.  The comments around this code suggest that the second
  pass isn't strictly necessary and that it is really only useful for
  data/parallel restore, so we could probably skip it for no-data dumps.

With those two changes, a pg_upgrade-style dump of my test database goes
from ~21.6 seconds without these patches to ~11.2 seconds with them.  For
reference, the same dump without stats takes ~7 seconds on HEAD.

[0] https://postgr.es/m/Z9R9-mFbxukqKmg4%40nathan

-- 
nathan

Re: Statistics Import and Export

From

Corey Huinker

Date:

17 March, 00:32:15

* The custom format actually does two WriteToc() calls, and since these
patches move the queries to this part of pg_dump, it means we'll run all
the queries twice. The comments around this code suggest that the second
pass isn't strictly necessary and that it is really only useful for
data/parallel restore, so we could probably skip it for no-data dumps.

Is there any reason we couldn't have stats objects remove themselves from the list after completion?

Re: Statistics Import and Export

From

Nathan Bossart

Date:

17 March, 17:23:56

On Sun, Mar 16, 2025 at 05:32:15PM -0400, Corey Huinker wrote:
>>
>> * The custom format actually does two WriteToc() calls, and since these
>>   patches move the queries to this part of pg_dump, it means we'll run all
>>   the queries twice.  The comments around this code suggest that the second
>>   pass isn't strictly necessary and that it is really only useful for
>>   data/parallel restore, so we could probably skip it for no-data dumps.
>>
> 
> Is there any reason we couldn't have stats objects remove themselves from
> the list after completion?

I'm assuming that writing a completely different TOC on the second pass
would corrupt the dump file.  Perhaps we could teach it to skip stats
entries on the second pass or something, but I'm not too wild about adding
to the list of invasive changes we're making last-minute for v18.

-- 
nathan

Re: Statistics Import and Export

From

Corey Huinker

Date:

18 March, 02:24:46

On Mon, Mar 17, 2025 at 10:24 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Sun, Mar 16, 2025 at 05:32:15PM -0400, Corey Huinker wrote:
>>
>> * The custom format actually does two WriteToc() calls, and since these
>> patches move the queries to this part of pg_dump, it means we'll run all
>> the queries twice. The comments around this code suggest that the second
>> pass isn't strictly necessary and that it is really only useful for
>> data/parallel restore, so we could probably skip it for no-data dumps.
>>
>
> Is there any reason we couldn't have stats objects remove themselves from
> the list after completion?

I'm assuming that writing a completely different TOC on the second pass
would corrupt the dump file. Perhaps we could teach it to skip stats
entries on the second pass or something, but I'm not too wild about adding
to the list of invasive changes we're making last-minute for v18.

I'm confused, are they needed in both places? If so, would it make sense to write out each stat entry to a file and then re-read the file on the second pass, or maybe do a \i filename in the sql script?

Not suggesting we do any of this for v18, but when I hear about doing something twice when that thing was painful the first time, I look for ways to avoid doing it, or set pan_is_hot = true for the next person.

Re: Statistics Import and Export

From

Nathan Bossart

Date:

18 March, 04:01:48

On Mon, Mar 17, 2025 at 07:24:46PM -0400, Corey Huinker wrote:
> On Mon, Mar 17, 2025 at 10:24 AM Nathan Bossart <nathandbossart@gmail.com>
> wrote:
>> I'm assuming that writing a completely different TOC on the second pass
>> would corrupt the dump file.  Perhaps we could teach it to skip stats
>> entries on the second pass or something, but I'm not too wild about adding
>> to the list of invasive changes we're making last-minute for v18.
> 
> I'm confused, are they needed in both places?

AFAICT yes.  The second pass rewrites the TOC to udpate the data offset
information.  If we wrote a different TOC the second time around, then the
dump file would be broken, right?

        /*
         * If possible, re-write the TOC in order to update the data offset
         * information.  This is not essential, as pg_restore can cope in most
         * cases without it; but it can make pg_restore significantly faster
         * in some situations (especially parallel restore).
         */
        if (ctx->hasSeek &&
            fseeko(AH->FH, tpos, SEEK_SET) == 0)
            WriteToc(AH);

-- 
nathan

Re: Statistics Import and Export

From

Jeff Davis

Date:

20 March, 01:17:40

On Sat, 2025-03-15 at 21:37 -0400, Corey Huinker wrote:
> > 0001 - no changes, but the longer I go the more I'm certain this is
> > something we want to do.

This replaces regclassin with custom lookups of the namespace and
relname, but misses some of the complexities that regclassin is
handling. For instance, it calls RangeVarGetRelid(), which calls
LookupExplicitNamespace(), which handles temp tables and
InvokeNamespaceSearchHook().

At first it looked like a bit too much code to copy, but regclassin()
passes NoLock, which means we basically just have to call
LookupExplicitNamespace().

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

20 March, 01:35:52

This replaces regclassin with custom lookups of the namespace and
relname, but misses some of the complexities that regclassin is
handling. For instance, it calls RangeVarGetRelid(), which calls
LookupExplicitNamespace(), which handles temp tables and
InvokeNamespaceSearchHook().

At first it looked like a bit too much code to copy, but regclassin()
passes NoLock, which means we basically just have to call
LookupExplicitNamespace().

To be clear, LookupExplicitNamespace() can call aclcheck_error(), which is something we cannot presently step-down into a WARNING, so an aclcheck failure inside a restore/upgrade would fail the upgrade. I want to make sure we can live with that because it might be hard to explain what's an error we can nerf and what isn't.

Re: Statistics Import and Export

From

Jeff Davis

Date:

25 March, 08:32:12

On Sat, 2025-03-08 at 14:09 -0500, Corey Huinker wrote:
> >
> > except it is perfectly clear that you *asked for* data and
> > statistics, so you get what you asked for. however the user
> > conjures in their heads what they are looking for, the logic is
> > simple, you get what you asked for. 
> >
>
>
> They *asked for* that because they didn't have the mechanism to say
> "hold the mayo" or "everything except pickles". That's reducing their
> choice, and then blaming them for their choice.

Can we reach a decision here and move forward?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Jeff Davis

Date:

25 March, 09:53:33

On Wed, 2025-03-19 at 15:17 -0700, Jeff Davis wrote:
> On Sat, 2025-03-15 at 21:37 -0400, Corey Huinker wrote:
> > > 0001 - no changes, but the longer I go the more I'm certain this
> > > is
> > > something we want to do.
>
> This replaces regclassin with custom lookups of the namespace and
> relname, but misses some of the complexities that regclassin is
> handling. For instance, it calls RangeVarGetRelid(), which calls
> LookupExplicitNamespace(), which handles temp tables and
> InvokeNamespaceSearchHook().
>
> At first it looked like a bit too much code to copy, but regclassin()
> passes NoLock, which means we basically just have to call
> LookupExplicitNamespace().

Attached new version 9j:

* Changed to use LookupExplicitNamespace()
* Added test for temp tables
* Doc fixes

Regards,
    Jeff Davis

Attachment

v9j-0001-Stats-use-schemaname-relname-instead-of-regclass.patch

Re: Statistics Import and Export

From

Corey Huinker

Date:

25 March, 17:53:22

* Changed to use LookupExplicitNamespace()

Seems good.

* Added test for temp tables

+1

* Doc fixes

So this patch swings the pendulum a bit back towards accepting some things as errors. That's understandable, as we're never going to have a situation where we can guarantee that the restore functions never generate an error, so the best we can do is to draw the error-versus-warning line at a place that:

* doesn't mess up flawed restores that we would otherwise expect to complete at least partially

* is easy for us to understand

* is easy for us to explain

* we can live with for the next couple of decades

I don't know where that line should be drawn, so if people are happy with Jeff's demarcation, then less roll with it.

Re: Statistics Import and Export

From

Robert Treat

Date:

25 March, 20:51:17

On Tue, Mar 25, 2025 at 1:32 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Sat, 2025-03-08 at 14:09 -0500, Corey Huinker wrote:
> >
> > except it is perfectly clear that you *asked for* data and
> > statistics, so you get what you asked for. however the user
> > conjures in their heads what they are looking for, the logic is
> > simple, you get what you asked for.
> >
>
>
> They *asked for* that because they didn't have the mechanism to say
> "hold the mayo" or "everything except pickles". That's reducing their
> choice, and then blaming them for their choice.

Can we reach a decision here and move forward?

AFAIK the issue has been settled, or at the least we've agreed to move on.

Robert Treat

https://xzilla.net

Re: Statistics Import and Export

From

Jeff Davis

Date:

25 March, 21:42:22

On Tue, 2025-03-25 at 10:53 -0400, Corey Huinker wrote:
>
> So this patch swings the pendulum a bit back towards accepting some
> things as errors.

Not exactly. I see patch 0001 as a change to the function signatures
from regclass to schemaname/relname, both for usability as well as
control over ERROR vs WARNING.

There's agreement to do so, so I went ahead and committed that part.

> the best we can do is to draw the error-versus-warning line at a
> place that:
>
> * doesn't mess up flawed restores that we would otherwise expect to
> complete at least partially
> * is easy for us to understand
> * is easy for us to explain
> * we can live with for the next couple of decades

The original reason we wanted to issue warnings was to allow ourselves
a chance to change the meaning of parameters, add new parameters, or
even remove parameters without causing restore failures. If there are
any ERRORs that might limit our flexibility I think we should downgrade
those to WARNINGs.

Also, out of a sense of paranoia, it might be good to downgrade some
other ERRORs to WARNINGs, like in 0002. I don't think it's quite as
important as you seem to think, however. It doesn't make a lot of
difference unless the user is running restore with --single-transaction
or --exit-on-error, in which case they probably don't want the restore
to continue if something unexpected happens. I'm fine having the
discussion, though, or we can wait until beta to see what kinds of
problems people encounter.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Corey Huinker

Date:

25 March, 22:59:13

The original reason we wanted to issue warnings was to allow ourselves
a chance to change the meaning of parameters, add new parameters, or
even remove parameters without causing restore failures. If there are
any ERRORs that might limit our flexibility I think we should downgrade
those to WARNINGs.

+1

Also, out of a sense of paranoia, it might be good to downgrade some
other ERRORs to WARNINGs, like in 0002. I don't think it's quite as
important as you seem to think, however. It doesn't make a lot of
difference unless the user is running restore with --single-transaction
or --exit-on-error, in which case they probably don't want the restore
to continue if something unexpected happens. I'm fine having the
discussion, though, or we can wait until beta to see what kinds of
problems people encounter.

At this point, I feel I've demonstrated the limit of what can be made into WARNINGs, giving us a range of options for now and into the beta. I'll rebase and move the 0002 patch to be in last position so as to tee up 0003-0004 for consideration.

Re: Statistics Import and Export

From

Corey Huinker

Date:

26 March, 04:41:25

At this point, I feel I've demonstrated the limit of what can be made into WARNINGs, giving us a range of options for now and into the beta. I'll rebase and move the 0002 patch to be in last position so as to tee up 0003-0004 for consideration.

And here's the rebase (after bde2fb797aaebcbe06bf60f330ba5a068f17dda7).

The order of the patches is different, but the purpose of each is the same as before.

Attachment

Re: Statistics Import and Export

From

Corey Huinker

Date:

29 March, 04:11:05

A rebase and a reordering of the commits to put the really-really-must-have relallfrozen ahead of the really-must-have stats batching and both of them head of the error->warning step-downs.

Attachment

Re: Statistics Import and Export

From

Jeff Davis

Date:

29 March, 08:29:16

On Fri, 2025-03-28 at 21:11 -0400, Corey Huinker wrote:
> A rebase and a reordering of the commits to put the really-really-
> must-have relallfrozen ahead of the really-must-have stats batching
> and both of them head of the error->warning step-downs.

v11-0001 has a couple issues:

The first is that i_relallfrozen is undefined in versions earlier than
18. That's trivial to fix, we just add "0 AS relallfrozen," in the
earlier versions, but still refrain from outputting it.

The second is that the pg_upgrade test (when run with
olddump/oldinstall) compares the before and after dumps, and if the
"before" version is 17, then it will not have the relallfrozen argument
to pg_restore_relation_stats. We might need a filtering step in
adjust_new_dumpfile?

Attached new v11j-0001

Regards,
    Jeff Davis

Attachment

v11j-0001-Add-relallfrozen-to-pg_dump-statistics.patch

Re: Statistics Import and Export

From

Corey Huinker

Date:

29 March, 08:44:57

The first is that i_relallfrozen is undefined in versions earlier than
18. That's trivial to fix, we just add "0 AS relallfrozen," in the
earlier versions, but still refrain from outputting it.

Ok, so long as we refrain from outputting it, I'm cool with whatever we store internally.

The second is that the pg_upgrade test (when run with
olddump/oldinstall) compares the before and after dumps, and if the
"before" version is 17, then it will not have the relallfrozen argument
to pg_restore_relation_stats. We might need a filtering step in
adjust_new_dumpfile?

That sounds trickier. Do we already have filtering steps that are sensitive to the "before" version dump?

Re: Statistics Import and Export

From

Corey Huinker

Date:

31 March, 18:11:47

The second is that the pg_upgrade test (when run with
olddump/oldinstall) compares the before and after dumps, and if the
"before" version is 17, then it will not have the relallfrozen argument
to pg_restore_relation_stats. We might need a filtering step in
adjust_new_dumpfile?

That sounds trickier.

Narrator: It was not trickier.

In light of v11-0001 being committed as 4694aedf63bf, I've rebased the remaining patches.

Attachment

Re: Statistics Import and Export

From

Robert Haas

Date:

31 March, 20:39:41

On Thu, Feb 27, 2025 at 10:43 PM Greg Sabino Mullane <htamfids@gmail.com> wrote:
> I know I'm coming late to this, but I would like us to rethink having statistics dumped by default.

+1. I think I said this before, but I don't think it's correct to
regard the statistics as part of the database. It's great for
pg_upgrade to preserve them, but I think doing so for a regular dump
should be opt-in.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Statistics Import and Export

From

Jeff Davis

Date:

01 April, 01:04:05

On Mon, 2025-03-31 at 13:39 -0400, Robert Haas wrote:
> +1. I think I said this before, but I don't think it's correct to
> regard the statistics as part of the database. It's great for
> pg_upgrade to preserve them, but I think doing so for a regular dump
> should be opt-in.

I'm confused about the timing of this message -- we already have an
Open Item for 18 to make this decision. After commit bde2fb797a,
changing the default is a one-line change, so there's no technical
problem.

I thought the general plan was to decide during beta. Would you like to
make the decision now for some reason?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Nathan Bossart

Date:

01 April, 05:33:15

On Mon, Mar 31, 2025 at 11:11:47AM -0400, Corey Huinker wrote:
> In light of v11-0001 being committed as 4694aedf63bf, I've rebased the
> remaining patches.

I spent the day preparing these for commit.  A few notes:

* I've added a new prerequisite patch that skips the second WriteToc() call
  for custom-format dumps that do not include data.  After some testing and
  code analysis, I haven't identified any examples where this produces
  different output.  This doesn't help much on its own, but it will become
  rather important when we move the attribute statistics queries to happen
  within WriteToc() in 0002.

* I was a little worried about the correctness of 0002 for dumps that run
  the attribute statistics queries twice, but I couldn't identify any
  problems here either.

* I removed a lot of miscellaneous refactoring that seemed unnecessary for
  these patches.  Let's move that to another patch set and keep these as
  simple as possible.

* I made a small adjustment to the TOC scan restarting logic in
  fetchAttributeStats().  Specifically, we now only allow the scan to
  restart once for custom-format dumps that include data.

* While these patches help decrease pg_dump's memory footprint, I believe
  pg_restore still reads the entire TOC into memory.  That's not this patch
  set's problem, but I think it's still an important consideration for the
  bigger picture.

Regarding whether pg_dump should dump statistics by default, my current
thinking is that it shouldn't, but I think we _should_ have pg_upgrade
dump/restore statistics by default because that is arguably the most
important use-case.  This is more a gut feeling than anything, so I reserve
the right to change my opinion.

My goal is to commit the attached patches on Friday morning, but of course
that is subject to change based on any feedback or objections that emerge
in the meantime.

-- 
nathan

Attachment

Re: Statistics Import and Export

From

Robert Treat

Date:

01 April, 06:02:49

On Mon, Mar 31, 2025 at 10:33 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> On Mon, Mar 31, 2025 at 11:11:47AM -0400, Corey Huinker wrote:
> Regarding whether pg_dump should dump statistics by default, my current
> thinking is that it shouldn't, but I think we _should_ have pg_upgrade
> dump/restore statistics by default because that is arguably the most
> important use-case.  This is more a gut feeling than anything, so I reserve
> the right to change my opinion.
>

I did some mental exercises on a number of different use cases and
scenarios (pagila work, pgextractor type stuff, backups, etc...) and I
couldn't come up with any strong arguments against including the stats
by default, generally because I think when your process needs to care
about the output of pg_dump, it seems like most cases require enough
specificity that this wouldn't actually break that.

Still, I am sympathetic to Greg's earlier concerns on the topic, but
would also agree it seems like a clear win for pg_upgrade, so I think
our gut feelings might actually be aligned on this one ;-)

Robert Treat
https://xzilla.net

Re: Statistics Import and Export

From

Robert Haas

Date:

01 April, 16:37:30

On Mon, Mar 31, 2025 at 6:04 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Mon, 2025-03-31 at 13:39 -0400, Robert Haas wrote:
> > +1. I think I said this before, but I don't think it's correct to
> > regard the statistics as part of the database. It's great for
> > pg_upgrade to preserve them, but I think doing so for a regular dump
> > should be opt-in.
>
> I'm confused about the timing of this message -- we already have an
> Open Item for 18 to make this decision. After commit bde2fb797a,
> changing the default is a one-line change, so there's no technical
> problem.
>
> I thought the general plan was to decide during beta. Would you like to
> make the decision now for some reason?

I don't think I was aware of the open item; I was just catching up on
email. But I also don't really see the value of waiting until beta to
make this decision. I seriously doubt that my opinion is going to
change. Maybe other people's will, though: I can only speak for
myself.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Statistics Import and Export

From

Nathan Bossart

Date:

01 April, 21:20:30

On Mon, Mar 31, 2025 at 09:33:15PM -0500, Nathan Bossart wrote:
> My goal is to commit the attached patches on Friday morning, but of course
> that is subject to change based on any feedback or objections that emerge
> in the meantime.

I spent some more time polishing these patches this morning.  There should
be no functional differences, but I did restructure 0003 to make it even
simpler.

-- 
nathan

Attachment

Re: Statistics Import and Export

From

Nathan Bossart

Date:

01 April, 21:44:14

On Tue, Apr 01, 2025 at 01:20:30PM -0500, Nathan Bossart wrote:
> On Mon, Mar 31, 2025 at 09:33:15PM -0500, Nathan Bossart wrote:
>> My goal is to commit the attached patches on Friday morning, but of course
>> that is subject to change based on any feedback or objections that emerge
>> in the meantime.
> 
> I spent some more time polishing these patches this morning.  There should
> be no functional differences, but I did restructure 0003 to make it even
> simpler.

Apologies for the noise.  I noticed one more way to simplify 0002.  As
before, there should be no functional differences.

-- 
nathan

Attachment

Re: Statistics Import and Export

From

Jeff Davis

Date:

01 April, 23:24:14

On Tue, 2025-04-01 at 09:37 -0400, Robert Haas wrote:
> I don't think I was aware of the open item; I was just catching up on
> email.

I lean towards making it opt-in for pg_dump and opt-out for pg_upgrade.
But I think we should leave open the possibility for changing the
default to opt-out for pg_dump in the future.

My reasoning for pg_dump is that releasing with stats as opt-in doesn't
put us in a worse position for making it opt-out later, so long as we
have the right set of both positive and negative options. It may even
be a better position because people have time to make their scripts
future proof by using the right combination of options.

> But I also don't really see the value of waiting until beta to
> make this decision. I seriously doubt that my opinion is going to
> change. Maybe other people's will, though: I can only speak for
> myself.

I don't think the last week before feature freeze, deep in a 400-email
thread is the best way to make decisions like this. Let's at least have
a focused thread on this topic and see if we can solicit opinions from
both sides.

Also, waiting to see if the performance improvements make it in, or
waiting for beta reports, may yield some new information that could
change minds.

Mid-beta might be too long, but let's wait for the final CF to settle
and give people the chance to respond to a top-level thread?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Jeff Davis

Date:

02 April, 01:05:59

On Tue, 2025-04-01 at 13:44 -0500, Nathan Bossart wrote:
> Apologies for the noise.  I noticed one more way to simplify 0002. 
> As
> before, there should be no functional differences.

To restate the problem: one of the problems being solved here is that
the existing code for custom-format dumps calls WriteToc twice. That
was not a big problem before this patch, when the contents of the
entries was easily accessible in memory. But the point of 0002 is to
avoid keeping all of the stats in memory at once, because that causes
bloat; and instead to query it on demand.

In theory, we could fix the pre-existing code by making the second pass
able to jump over the other contents of the entry and just update the
data offsets. But that seems invasive, at least to do it properly.

0001 sidesteps the problem by skipping the second pass if data's not
being dumped (because there are no offsets that need updating). The
worst case is when there are a lot of objects with a small amount of
data. But that's a worst case for stats in general, so I don't think
that needs to be solved here.

Issuing the stats queries twice is not great, though. If there's any
non-deterministic output in the query, that could lead to strangeness.
How bad can that be? If the results change in some way that looks
benign, but changes the length of the definition string, can it lead to
corruption of a ToC entry? I'm not saying there's a problem, but trying
to understand the risk of future problems.

For 0003, it makes an assumption about the way the scan happens in
WriteToc(). Can you add some additional sanity checks to verify that
something doesn't happen in a different order than we expect?

Also, why do we need the clause "WHERE s.tablename = ANY($2)"? Isn't
that already implied by "JOIN unnest($1, $2) ... s.tablename =
u.tablename"?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Robert Haas

Date:

02 April, 05:24:25

On Tue, Apr 1, 2025 at 4:24 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, 2025-04-01 at 09:37 -0400, Robert Haas wrote:
> > I don't think I was aware of the open item; I was just catching up on
> > email.
>
> I lean towards making it opt-in for pg_dump and opt-out for pg_upgrade.

Big +1.

> But I think we should leave open the possibility for changing the
> default to opt-out for pg_dump in the future.

We can always decide to change things, but this is different from some
other cases. Sometimes we're waiting to enable a feature by default
until, say, we think it's stable. This case is more about user
expectations, which might not be so prone to changing over time (but
you never know).

> Mid-beta might be too long, but let's wait for the final CF to settle
> and give people the chance to respond to a top-level thread?

wfm, thanks.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Statistics Import and Export

From

Nathan Bossart

Date:

02 April, 06:21:44

On Tue, Apr 01, 2025 at 03:05:59PM -0700, Jeff Davis wrote:
> To restate the problem: one of the problems being solved here is that
> the existing code for custom-format dumps calls WriteToc twice. That
> was not a big problem before this patch, when the contents of the
> entries was easily accessible in memory. But the point of 0002 is to
> avoid keeping all of the stats in memory at once, because that causes
> bloat; and instead to query it on demand.
> 
> In theory, we could fix the pre-existing code by making the second pass
> able to jump over the other contents of the entry and just update the
> data offsets. But that seems invasive, at least to do it properly.
> 
> 0001 sidesteps the problem by skipping the second pass if data's not
> being dumped (because there are no offsets that need updating). The
> worst case is when there are a lot of objects with a small amount of
> data. But that's a worst case for stats in general, so I don't think
> that needs to be solved here.
> 
> Issuing the stats queries twice is not great, though. If there's any
> non-deterministic output in the query, that could lead to strangeness.
> How bad can that be? If the results change in some way that looks
> benign, but changes the length of the definition string, can it lead to
> corruption of a ToC entry? I'm not saying there's a problem, but trying
> to understand the risk of future problems.

It certainly feels risky.  I was able to avoid executing the queries twice
in all cases by saving the definition length in the TOC entry and skipping
that many bytes the second time round.  That's simple enough, but it relies
on various assumptions such as fseeko() being available (IIUC the file will
only be open for writing so we cannot fall back on fread()) and WriteStr()
returning an accurate value (which I'm skeptical of because some formats
compress this data).  But AFAICT custom format is the only format that does
a second WriteToc() pass at the moment, and it only does so when fseeko()
is usable.  Plus, custom format doesn't appear to compress anything written
via WriteStr().

We might be able to improve this by inventing a new callback that fails for
all formats except for custom with feesko() available.  That would at least
ensure hard failures if these assumptions change.  That problably wouldn't
be terribly invasive.  I'm curious what you think.

> For 0003, it makes an assumption about the way the scan happens in
> WriteToc(). Can you add some additional sanity checks to verify that
> something doesn't happen in a different order than we expect?

Hm.  One thing we could do is to send the TocEntry to the callback and
verify that matches the one we were expecting to see next (as set by a
previous call).  Does that sound like a strong enough check?  FWIW the
pg_dump tests failed miserably until Corey and I got this part right, so
our usual tests should also offer some assurance.

> Also, why do we need the clause "WHERE s.tablename = ANY($2)"? Isn't
> that already implied by "JOIN unnest($1, $2) ... s.tablename =
> u.tablename"?

Good question.  Corey, do you recall why this was needed?

-- 
nathan

Attachment

Re: Statistics Import and Export

From

Jeff Davis

Date:

02 April, 08:44:19

On Tue, 2025-04-01 at 22:21 -0500, Nathan Bossart wrote:
> It certainly feels risky.  I was able to avoid executing the queries
> twice
> in all cases by saving the definition length in the TOC entry and
> skipping
> that many bytes the second time round.

That feels like a better approach.

>   That's simple enough, but it relies
> on various assumptions such as fseeko() being available (IIUC the
> file will
> only be open for writing so we cannot fall back on fread()) and
> WriteStr()
> returning an accurate value (which I'm skeptical of because some
> formats
> compress this data).  But AFAICT custom format is the only format
> that does
> a second WriteToc() pass at the moment, and it only does so when
> fseeko()
> is usable.

Even with those assumptions, I think it's much better than querying
twice and assuming that the results are the same.

>   Plus, custom format doesn't appear to compress anything written
> via WriteStr().

If WriteStr() was doing compression, that would make the second
WriteToc() pass to update the data offsets scary even in the existing
code.

> We might be able to improve this by inventing a new callback that
> fails for
> all formats except for custom with feesko() available.  That would at
> least
> ensure hard failures if these assumptions change.  That problably
> wouldn't
> be terribly invasive.  I'm curious what you think.

That sounds fine, I'd say do that if it feels reasonable, and if the
extra callbacks get too messy, we can just document the assumptions
instead.

>
> Hm.  One thing we could do is to send the TocEntry to the callback
> and
> verify that matches the one we were expecting to see next (as set by
> a
> previous call).  Does that sound like a strong enough check?

Again, I'd just be practical here and do the check if it feels natural,
and if not, improve the comments so that someone modifying the code
would know where to look.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Andres Freund

Date:

02 April, 19:42:01

Hi,

https://commitfest.postgresql.org/patch/4538/ is still in "needs review", even
though the feature really has been committed.  Is that intention, e.g. to
track pending changes that we're planning to make?

Greetings,

Andres

Re: Statistics Import and Export

From

Jeff Davis

Date:

03 April, 02:36:58

On Tue, 2025-04-01 at 22:21 -0500, Nathan Bossart wrote:
> It certainly feels risky.  I was able to avoid executing the queries
> twice
> in all cases by saving the definition length in the TOC entry and
> skipping
> that many bytes the second time round.

Another idea that was under-discussed is whether the stats commands
should be in the TOC at all, or if they should be written as data
chunks.

Being in the TOC creates these issues with rewriting the TOC. Also, the
stats can be fairly large, especially for a wide table with a high
stats target, so the stats commands can increase the size of the TOC by
a lot.

But putting them in the data area doesn't seem quite right either,
because the data is just data, whereas the stats are a list of SQL
commands ("SELECT pg_restore_relation_stats(...); ..."). Also, if we
went down that road, we'd have to consider parallelism, which might
defeat the batching work that we're trying to do.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Nathan Bossart

Date:

03 April, 05:26:32

On Tue, Apr 01, 2025 at 10:44:19PM -0700, Jeff Davis wrote:
> On Tue, 2025-04-01 at 22:21 -0500, Nathan Bossart wrote:
>> We might be able to improve this by inventing a new callback that fails for
>> all formats except for custom with feesko() available.  That would at least
>> ensure hard failures if these assumptions change.  That problably wouldn't
>> be terribly invasive.  I'm curious what you think.
> 
> That sounds fine, I'd say do that if it feels reasonable, and if the
> extra callbacks get too messy, we can just document the assumptions
> instead.

I did write a version with callbacks, but it felt a bit silly because it is
very obviously intended for this one case.  So, I removed them in the
attached patch set.

>> Hm.  One thing we could do is to send the TocEntry to the callback and
>> verify that matches the one we were expecting to see next (as set by a
>> previous call).  Does that sound like a strong enough check?
> 
> Again, I'd just be practical here and do the check if it feels natural,
> and if not, improve the comments so that someone modifying the code
> would know where to look.

Okay, here is an updated patch set.  I did add some verification code,
which ended up being a really good idea because it revealed a couple of
cases we weren't handling:

* Besides custom format calling WriteToc() twice to update the data
  offsets, tar format calls WriteToc() followed by RestoreArchive() to
  write restore.sql.  I couldn't think of a great way to avoid executing
  the queries twice in this case, so I settled on allowing it for only that
  mode.  While we don't expect the second set of queries to result in
  different stats definitions, even if it did, the worst case is that the
  content of restore.sql (which isn't used by pg_restore) would be
  different.  I noticed some past discussion that seems to suggest that
  this format might be a candidate for deprecation [0], so I'm not sure
  it's worth doing anything fancier.

* Our batching code assumes that stats entries are dumped in TOC order,
  which unfortunately wasn't true for formats that use RestoreArchive() for
  dumping.  This is because RestoreArchive() does multiple passes through
  the TOC and selectively dumps certain entries each time.  This is
  particularly troublesome for index stats and a subset of matview stats;
  both are in SECTION_POST_DATA, but matview stats that depend on matview
  data are dumped in RESTORE_PASS_POST_ACL, while all other stats data is
  dumped in RESTORE_PASS_MAIN.  To deal with this, I propose moving all
  stats entries in SECTION_POST_DATA to RESTORE_PASS_POST_ACL, which
  ensures that we always dump stats in TOC order.  One convenient side
  effect of this change is that we can revert a decent chunk of commit
  a0a4601765.  It might be possible to do better via smarter lookahead code
  or a more sophisticated cache, but it's a bit late in the game for that.

[0] https://postgr.es/m/20180727015306.fzlo4inv5i3zqr2c%40alap3.anarazel.de

-- 
nathan

Attachment

Re: Statistics Import and Export

From

Corey Huinker

Date:

03 April, 05:34:58

> Also, why do we need the clause "WHERE s.tablename = ANY($2)"? Isn't
> that already implied by "JOIN unnest($1, $2) ... s.tablename =
> u.tablename"?

Good question. Corey, do you recall why this was needed?

In my patch, that SQL statement came with the comment:

+ /*
+ * The results must be in the order of relations supplied in the
+ * parameters to ensure that they are in sync with a walk of the TOC.
+ *
+ * The redundant (and incomplete) filter clause on s.tablename = ANY(...)
+ * is a way to lead the query into using the index
+ * pg_class_relname_nsp_index which in turn allows the planner to avoid an
+ * expensive full scan of pg_stats.
+ *
+ * We may need to adjust this query for versions that are not so easily
+ * led.
+ */

Re: Statistics Import and Export

From

Nathan Bossart

Date:

03 April, 05:38:18

On Wed, Apr 02, 2025 at 10:34:58PM -0400, Corey Huinker wrote:
>>
>> > Also, why do we need the clause "WHERE s.tablename = ANY($2)"? Isn't
>> > that already implied by "JOIN unnest($1, $2) ... s.tablename =
>> > u.tablename"?
>>
>> Good question.  Corey, do you recall why this was needed?
>>
> 
> In my patch, that SQL statement came with the comment:
> 
> + /*
> + * The results must be in the order of relations supplied in the
> + * parameters to ensure that they are in sync with a walk of the TOC.
> + *
> + * The redundant (and incomplete) filter clause on s.tablename = ANY(...)
> + * is a way to lead the query into using the index
> + * pg_class_relname_nsp_index which in turn allows the planner to avoid an
> + * expensive full scan of pg_stats.
> + *
> + * We may need to adjust this query for versions that are not so easily
> + * led.
> + */

Thanks.  I included that in the latest patch set.

-- 
nathan

Re: Statistics Import and Export

From

Jeff Davis

Date:

04 April, 01:23:40

On Wed, 2025-04-02 at 21:26 -0500, Nathan Bossart wrote:

> Okay, here is an updated patch set.

> * Besides custom format calling WriteToc() twice to update the data
>   offsets, tar format ... even if it did, the worst case is that
> the
>   content of restore.sql (which isn't used by pg_restore) would be
>   different.  I noticed some past discussion that seems to suggest
> that
>   this format might be a candidate for deprecation [0], so I'm not
> sure
>   it's worth doing anything fancier.

I agree that the risk for tar format seems much lower.

> * Our batching code assumes that stats entries are dumped in TOC
> order,
>

...

>  I propose moving all
>   stats entries in SECTION_POST_DATA to RESTORE_PASS_POST_ACL, which
>   ensures that we always dump stats in TOC order.  One convenient
> side
>   effect of this change is that we can revert a decent chunk of
> commit
>   a0a4601765.  It might be possible to do better via smarter
> lookahead code
>   or a more sophisticated cache, but it's a bit late in the game for
> that.

This simplifies commit a0a4601765. I'd break out that simplification as
a separate commit to make it easier to understand what happened.

In patch 0003, there are quite a few static function-scoped variables,
which is not a style that I'm used to. One idea is to bundle them into
a struct representing the cache state (including enough information to
fetch the next batch), and have a single static variable that points to
that.

Also in 0003, the "next_te" variable is a bit confusing, because it's
actually the last TocEntry, until it's advanced to point to the current
one.

Other than that, looks good to me.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Nathan Bossart

Date:

04 April, 05:19:51

Thanks for reviewing.

On Thu, Apr 03, 2025 at 03:23:40PM -0700, Jeff Davis wrote:
> This simplifies commit a0a4601765. I'd break out that simplification as
> a separate commit to make it easier to understand what happened. 

Done.

> In patch 0003, there are quite a few static function-scoped variables,
> which is not a style that I'm used to. One idea is to bundle them into
> a struct representing the cache state (including enough information to
> fetch the next batch), and have a single static variable that points to
> that.

As discussed off-list, I didn't take this suggestion for now.  Corey did
this originally, and I converted it to static function-scoped variables 1)
to reduce patch size and 2) because I noticed that each of the state
variables were only needed in one function.  I agree that a struct might be
slightly more readable, but we can always change this in the future if
desired.

> Also in 0003, the "next_te" variable is a bit confusing, because it's
> actually the last TocEntry, until it's advanced to point to the current
> one.

I've renamed it to expected_te.

> Other than that, looks good to me.

Great.  I'm planning to commit the attached patch set tomorrow morning.

For the record, I spent most of today trying very hard to fix the layering
violations in 0002.  While I was successful, the result was awkward,
complicated, and nigh unreadable.  This is now the second time I've
attempted to fix this and have felt the result was worse than where I
started.  So, I added extremely descriptive comments instead.  I'm hoping
that it will be possible to clean this up with some additional work in v19.
I have a few ideas, but if anyone has suggestions, I'm all ears.

-- 
nathan

Attachment

Re: Statistics Import and Export

From

Nathan Bossart

Date:

04 April, 22:56:54

On Thu, Apr 03, 2025 at 09:19:51PM -0500, Nathan Bossart wrote:
> Great.  I'm planning to commit the attached patch set tomorrow morning.

Committed.

-- 
nathan

Re: Statistics Import and Export

From

Nathan Bossart

Date:

04 April, 23:06:45

On Fri, Apr 04, 2025 at 02:56:54PM -0500, Nathan Bossart wrote:
> Committed.

I see the buildfarm failure and am working on a fix.

-- 
nathan

Re: Statistics Import and Export

From

Nathan Bossart

Date:

04 April, 23:58:53

On Fri, Apr 04, 2025 at 03:06:45PM -0500, Nathan Bossart wrote:
> I see the buildfarm failure and am working on a fix.

I pushed commit 8ec0aae to fix this.

-- 
nathan

Re: Statistics Import and Export

From

Nathan Bossart

Date:

05 April, 01:25:29

On Fri, Apr 04, 2025 at 03:58:53PM -0500, Nathan Bossart wrote:
> I pushed commit 8ec0aae to fix this.

And now I'm seeing cross-version test failures due to our use of WITH
ORDINALITY, which wasn't added until v9.4.  Looking into it...

-- 
nathan

Re: Statistics Import and Export

From

Corey Huinker

Date:

05 April, 02:32:48

On Fri, Apr 4, 2025 at 6:25 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Fri, Apr 04, 2025 at 03:58:53PM -0500, Nathan Bossart wrote:
> I pushed commit 8ec0aae to fix this.

And now I'm seeing cross-version test failures due to our use of WITH
ORDINALITY, which wasn't added until v9.4. Looking into it...

This patch shrinks the array size to 1 for versions < 9.4, which keeps the modern code fairly elegant.

Attachment

v1-0001-Fall-back-to-single-attribute-stat-fetching-for-v.patch

Re: Statistics Import and Export

From

Nathan Bossart

Date:

05 April, 05:06:38

On Fri, Apr 04, 2025 at 07:32:48PM -0400, Corey Huinker wrote:
> This patch shrinks the array size to 1 for versions < 9.4, which keeps the
> modern code fairly elegant.

Committed.

-- 
nathan

Re: Statistics Import and Export

From

Hari Krishna Sunder

Date:

14 May, 03:01:02

We found a minor issue when testing statistics import with upgrading from versions older than v14. (We have VACUUM and ANALYZE disabled)
3d351d916b20534f973eda760cde17d96545d4c4 changed the default value for reltuples from 0 to -1. So when such tables are imported they get the pg13 default of 0 which in pg18 is treated as "vacuumed and seen to be empty" instead of "never yet vacuumed". The planner then proceeds to pick seq scans even if there are indexes for these tables.
This is a very narrow edge case and the next VACUUM or ANALYZE will fix it but the perf of these tables immediately after the upgrade is considerably affected.

Can we instead use -1 if the version is older than 14, and reltuples is 0?
This will have the unintended consequence of treating a truly empty table as "never yet vacuumed", but that should be fine as empty tables are going to be fast regardless of the plan picked.

PS: This is my first patch, so apologies for any issues with the patch.

On Fri, Apr 4, 2025 at 7:06 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Fri, Apr 04, 2025 at 07:32:48PM -0400, Corey Huinker wrote:
> This patch shrinks the array size to 1 for versions < 9.4, which keeps the
> modern code fairly elegant.

Committed.

--
nathan

Attachment

0001-Stats-import-Fix-default-reltuples-on-versions-older.patch

Re: Statistics Import and Export

From

Nathan Bossart

Date:

14 May, 18:53:08

On Tue, May 13, 2025 at 05:01:02PM -0700, Hari Krishna Sunder wrote:
> We found a minor issue when testing statistics import with upgrading from
> versions older than v14. (We have VACUUM and ANALYZE disabled)
> 3d351d916b20534f973eda760cde17d96545d4c4
> <https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3d351d916b20534f973eda760cde17d96545d4c4>
> changed
> the default value for reltuples from 0 to -1. So when such tables are
> imported they get the pg13 default of 0 which in pg18 is treated
> as "vacuumed and seen to be empty" instead of "never yet vacuumed". The
> planner then proceeds to pick seq scans even if there are indexes for these
> tables.
> This is a very narrow edge case and the next VACUUM or ANALYZE will fix it
> but the perf of these tables immediately after the upgrade is considerably
> affected.

There was a similar report for vacuumdb's new --missing-stats-only option.
We fixed that in commit 9879105 by removing the check for reltuples != 0,
which means that --missing-stats-only will process empty tables.

> Can we instead use -1 if the version is older than 14, and reltuples is 0?
> This will have the unintended consequence of treating a truly empty table
> as "never yet vacuumed", but that should be fine as empty tables are going
> to be fast regardless of the plan picked.

I'm inclined to agree that we should do this.  Even if it's much more
likely that 0 means empty versus not-yet-processed, the one-time cost of
processing some empty tables doesn't sound too bad.  In any case, since
this only applies to upgrades from <v14, that trade-off should dissipate
over time.

> PS: This is my first patch, so apologies for any issues with the patch.

It needs a comment, but otherwise it looks generally reasonable to me after
a quick glance.

-- 
nathan

Re: Statistics Import and Export

From

Hari Krishna Sunder

Date:

14 May, 23:30:48

Thanks Nathan.
Here is the patch with a comment.

On Wed, May 14, 2025 at 8:53 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Tue, May 13, 2025 at 05:01:02PM -0700, Hari Krishna Sunder wrote:
> We found a minor issue when testing statistics import with upgrading from
> versions older than v14. (We have VACUUM and ANALYZE disabled)
> 3d351d916b20534f973eda760cde17d96545d4c4
> <https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3d351d916b20534f973eda760cde17d96545d4c4>
> changed
> the default value for reltuples from 0 to -1. So when such tables are
> imported they get the pg13 default of 0 which in pg18 is treated
> as "vacuumed and seen to be empty" instead of "never yet vacuumed". The
> planner then proceeds to pick seq scans even if there are indexes for these
> tables.
> This is a very narrow edge case and the next VACUUM or ANALYZE will fix it
> but the perf of these tables immediately after the upgrade is considerably
> affected.

There was a similar report for vacuumdb's new --missing-stats-only option.
We fixed that in commit 9879105 by removing the check for reltuples != 0,
which means that --missing-stats-only will process empty tables.

> Can we instead use -1 if the version is older than 14, and reltuples is 0?
> This will have the unintended consequence of treating a truly empty table
> as "never yet vacuumed", but that should be fine as empty tables are going
> to be fast regardless of the plan picked.

I'm inclined to agree that we should do this. Even if it's much more
likely that 0 means empty versus not-yet-processed, the one-time cost of
processing some empty tables doesn't sound too bad. In any case, since
this only applies to upgrades from <v14, that trade-off should dissipate
over time.

> PS: This is my first patch, so apologies for any issues with the patch.

It needs a comment, but otherwise it looks generally reasonable to me after
a quick glance.

--
nathan

Attachment

0001-Stats-import-Fix-default-reltuples-on-versions-older.patch

Re: Statistics Import and Export

From

Hari Krishna Sunder

Date:

16 May, 21:47:12

Gentle ping on this.

---

Hari Krishna Sunder

On Wed, May 14, 2025 at 1:30 PM Hari Krishna Sunder <hari.db.pg@gmail.com> wrote:

Thanks Nathan.
Here is the patch with a comment.

On Wed, May 14, 2025 at 8:53 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, May 13, 2025 at 05:01:02PM -0700, Hari Krishna Sunder wrote:
> We found a minor issue when testing statistics import with upgrading from
> versions older than v14. (We have VACUUM and ANALYZE disabled)
> 3d351d916b20534f973eda760cde17d96545d4c4
> <https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3d351d916b20534f973eda760cde17d96545d4c4>
> changed
> the default value for reltuples from 0 to -1. So when such tables are
> imported they get the pg13 default of 0 which in pg18 is treated
> as "vacuumed and seen to be empty" instead of "never yet vacuumed". The
> planner then proceeds to pick seq scans even if there are indexes for these
> tables.
> This is a very narrow edge case and the next VACUUM or ANALYZE will fix it
> but the perf of these tables immediately after the upgrade is considerably
> affected.

There was a similar report for vacuumdb's new --missing-stats-only option.
We fixed that in commit 9879105 by removing the check for reltuples != 0,
which means that --missing-stats-only will process empty tables.

> Can we instead use -1 if the version is older than 14, and reltuples is 0?
> This will have the unintended consequence of treating a truly empty table
> as "never yet vacuumed", but that should be fine as empty tables are going
> to be fast regardless of the plan picked.

I'm inclined to agree that we should do this. Even if it's much more
likely that 0 means empty versus not-yet-processed, the one-time cost of
processing some empty tables doesn't sound too bad. In any case, since
this only applies to upgrades from <v14, that trade-off should dissipate
over time.

> PS: This is my first patch, so apologies for any issues with the patch.

It needs a comment, but otherwise it looks generally reasonable to me after
a quick glance.

--
nathan

Re: Statistics Import and Export

From

Nathan Bossart

Date:

19 May, 19:51:32

On Wed, May 14, 2025 at 01:30:48PM -0700, Hari Krishna Sunder wrote:
> Here is the patch with a comment.

Thanks.

> On Wed, May 14, 2025 at 8:53 AM Nathan Bossart <nathandbossart@gmail.com>
> wrote:
>> There was a similar report for vacuumdb's new --missing-stats-only option.
>> We fixed that in commit 9879105 by removing the check for reltuples != 0,
>> which means that --missing-stats-only will process empty tables.

I'm wondering if we should revert commit 9879105 if we take this change,
which solves the --missing-stats-only problem in a different way.  My
current thinking is that we should just leave it in place, if for no other
reason than analyzing some empty tables seems unlikely to cause too much
trouble.  Thoughts?

-- 
nathan

Re: Statistics Import and Export

From

Nathan Bossart

Date:

20 May, 00:31:57

On Mon, May 19, 2025 at 02:13:45PM -0700, Hari Krishna Sunder wrote:
> I think it would be better to revert 9879105 since there can be a
> considerable number of true empty tables that we don´t need to process.

I'm not sure that's a use-case we really need to optimize.  Even with
100,000 empty tables, "vacuumdb --analyze-only --missing-stats-only --jobs
64" completes in ~5.5 seconds on my laptop.  Plus, even if reltuples is 0,
there might actually be rows in the table, in which case analyzing it will
produce rows in pg_statistic.

-- 
nathan

Re: Statistics Import and Export

From

Nathan Bossart

Date:

21 May, 19:08:23

On Tue, May 20, 2025 at 10:32:39AM -0700, Hari Krishna Sunder wrote:
> Ah ya, forgot that reltuples are not always accurate. This sounds
> reasonable to me.

Cool.  Here is what I have staged for commit, which I am planning to do
shortly.

-- 
nathan

Attachment

v3-0001-pg_dump-Adjust-reltuples-from-0-to-1-for-dumps-on.patch

Re: Statistics Import and Export

From

Jeff Davis

Date:

22 May, 00:14:55

On Wed, 2025-05-21 at 11:08 -0500, Nathan Bossart wrote:
> On Tue, May 20, 2025 at 10:32:39AM -0700, Hari Krishna Sunder wrote:
> > Ah ya, forgot that reltuples are not always accurate. This sounds
> > reasonable to me.
>
> Cool.  Here is what I have staged for commit, which I am planning to
> do
> shortly.

Originally, one of the reasons we added a version field during dump is
so that some future version could reinterpret stats in older dump files
during import.

This patch is using a newer version of pg_dump to interpret stats from
older versions during export. That might be fine, but it would be good
to understand where the line is between things we should reinterpret
during export vs things we should reinterpret during import.

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Nathan Bossart

Date:

22 May, 00:29:56

On Wed, May 21, 2025 at 02:14:55PM -0700, Jeff Davis wrote:
> Originally, one of the reasons we added a version field during dump is
> so that some future version could reinterpret stats in older dump files
> during import.
> 
> This patch is using a newer version of pg_dump to interpret stats from
> older versions during export. That might be fine, but it would be good
> to understand where the line is between things we should reinterpret
> during export vs things we should reinterpret during import.

I don't know precisely where that line might be, but in this case, the
dumped stats have no hope of restoring into anything older than v18 (since
the stats import functions won't exist), which is well past the point where
we started using -1 for reltuples.  If we could dump the stats from v13 and
restore them into v13, then I think there would be a reasonably strong
argument for dumping it as-is and reinterpreting as necessary during
import.  But I see no particular benefit from moving the complexity to the
import side here.

Does that seem like a reasonable position to you?  Is there anything else
we should consider?

-- 
nathan

Re: Statistics Import and Export

From

Corey Huinker

Date:

22 May, 02:11:09

I don't know precisely where that line might be, but in this case, the
dumped stats have no hope of restoring into anything older than v18 (since
the stats import functions won't exist), which is well past the point where
we started using -1 for reltuples. If we could dump the stats from v13 and
restore them into v13, then I think there would be a reasonably strong
argument for dumping it as-is and reinterpreting as necessary during
import. But I see no particular benefit from moving the complexity to the
import side here.

Definitely keep complexity on the export-side.

Mapping reltuples 0 -> -1 if system version < 14 like the original patch did makes the most sense to me. That allows vacuumdb to go back to ignoring tables that are seemingly empty while still vacuuming the tables that had the pre-14 suspicious 0 reltuples value.

Does that seem like a reasonable position to you? Is there anything else
we should consider?

Automatically vacuuming tables that purport to be empty may not take much time, but it may alarm users using --missing-only, wondering why so many tables didn't get stats imported, especially if we introduce a --dry-run parameter which would answer for a user the question "what tables does vacuumdb think are missing statistics?".

Re: Statistics Import and Export

From

Jeff Davis

Date:

22 May, 02:53:17

On Wed, 2025-05-21 at 16:29 -0500, Nathan Bossart wrote:
> I don't know precisely where that line might be, but in this case,
> the
> dumped stats have no hope of restoring into anything older than
> v18... But I see no particular benefit from moving the complexity
> to the
> import side here.

That's fine with me. Perhaps we should just say that pre-18 behavior
differences can be fixed up during export, and post-18 behavior
differences are fixed up during import?

Regards,
    Jeff Davis

Re: Statistics Import and Export

From

Robert Haas

Date:

22 May, 17:20:16

On Sat, May 10, 2025 at 3:51 PM Greg Sabino Mullane <htamfids@gmail.com> wrote:
> I may have missed something (we seem to have a lot of threads for this subject), but we are in beta and both pg_dump
andpg_upgrade seem to be opt-out? I still object strongly to this;  pg_dump is meant to be a canonical representation
ofthe schema and data. Adding metadata that can change from dump to dump seems wrong, and should be opt-in. I've not
beenconvinced otherwise why stats should be output by default. 
>
> To be clear, I 100% want it to be the default for pg_upgrade.
>
> Maybe we are just leaving it enabled to see if anyone complains in beta, but I don't want us to forget about it. :)

Yeah. This could use comments from a few more people, but I really
hope we don't ship the final release this way. We do have a "Enable
statistics in pg_dump by default" item in the open items list under
"Decisions to Recheck Mid-Beta", but that's arguably now. It also sort
of looks like we might have a consensus anyway: Jeff said "I lean
towards making it opt-in for pg_dump and opt-out for pg_upgrade" and I
agree with that and it seems you do, too. So perhaps Jeff should make
it so?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Statistics Import and Export

From

Nathan Bossart

Date:

22 May, 17:30:40

On Thu, May 22, 2025 at 10:20:16AM -0400, Robert Haas wrote:
> It also sort
> of looks like we might have a consensus anyway: Jeff said "I lean
> towards making it opt-in for pg_dump and opt-out for pg_upgrade" and I
> agree with that and it seems you do, too. So perhaps Jeff should make
> it so?

+1, I think we should go ahead and do this.  If Jeff can't get to it, I'm
happy to pick it up in the next week or so.

-- 
nathan

Re: Statistics Import and Export

From

Tom Lane

Date:

22 May, 17:53:12

Nathan Bossart <nathandbossart@gmail.com> writes:
> On Thu, May 22, 2025 at 10:20:16AM -0400, Robert Haas wrote:
>> It also sort
>> of looks like we might have a consensus anyway: Jeff said "I lean
>> towards making it opt-in for pg_dump and opt-out for pg_upgrade" and I
>> agree with that and it seems you do, too. So perhaps Jeff should make
>> it so?

> +1, I think we should go ahead and do this.  If Jeff can't get to it, I'm
> happy to pick it up in the next week or so.

Works for me, too.

            regards, tom lane

Re: Statistics Import and Export

From

Nathan Bossart

Date:

22 May, 18:25:03

On Wed, May 21, 2025 at 04:53:17PM -0700, Jeff Davis wrote:
> On Wed, 2025-05-21 at 16:29 -0500, Nathan Bossart wrote:
>> I don't know precisely where that line might be, but in this case,
>> the
>> dumped stats have no hope of restoring into anything older than
>> v18... But I see no particular benefit from moving the complexity
>> to the
>> import side here.
> 
> That's fine with me. Perhaps we should just say that pre-18 behavior
> differences can be fixed up during export, and post-18 behavior
> differences are fixed up during import?

WFM.  I've committed the patch.

-- 
nathan

Re: Statistics Import and Export

From

Jeff Davis

Date:

22 May, 21:52:21

On Thu, 2025-05-22 at 10:20 -0400, Robert Haas wrote:
> Yeah. This could use comments from a few more people, but I really
> hope we don't ship the final release this way. We do have a "Enable
> statistics in pg_dump by default" item in the open items list under
> "Decisions to Recheck Mid-Beta", but that's arguably now. It also
> sort
> of looks like we might have a consensus anyway: Jeff said "I lean
> towards making it opt-in for pg_dump and opt-out for pg_upgrade" and
> I
> agree with that and it seems you do, too. So perhaps Jeff should make
> it so?

Patch attached.

A couple minor points:

 * The default for pg_restore is --no-statistics. That could cause a
minor surprise if the user specifies --with-statistics for pg_dump and
not for pg_restore. An argument could be made that "if the stats are
there, restore them", and I don't have a strong opinion about this
point, but defaulting to --no-statistics seems more consistent with
pg_dump.

 * I added --with-statistics to most of the pg_dump tests. We can be
more judicious about which tests exercise statistics as a separate
commit, but I didn't want to change the test results as a part of this
commit.

Regards,
    Jeff Davis

Attachment

v1-0001-Change-defaults-for-statistics-export.patch

Re: Statistics Import and Export

From

Nathan Bossart

Date:

22 May, 22:36:03

On Thu, May 22, 2025 at 03:29:38PM -0400, Greg Sabino Mullane wrote:
> On Thu, May 22, 2025 at 2:52 PM Jeff Davis <pgsql@j-davis.com> wrote:
>>  * The default for pg_restore is --no-statistics. That could cause a minor
>> surprise if the user specifies --with-statistics for pg_dump and
>> not for pg_restore. An argument could be made that "if the stats are
>> there, restore them", and I don't have a strong opinion about this point,
>> but defaulting to --no-statistics seems more consistent with pg_dump.
> 
> Hm...somewhat to my own surprise, I don't like this. If it's in the dump,
> restore it.

+1, I think defaulting to restoring everything in the dump file is much
less surprising than the alternative.

-- 
nathan

Re: Statistics Import and Export

From

Robert Haas

Date:

22 May, 22:36:35

On Thu, May 22, 2025 at 3:36 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> +1, I think defaulting to restoring everything in the dump file is much
> less surprising than the alternative.

+1.

--
Robert Haas
EDB: http://www.enterprisedb.com