Thread: Statistics Import and Export

Statistics Import and Export

From
Corey Huinker
Date:
pg_stats_export is a view that aggregates pg_statistic data by relation
oid and stores all of the column statistical data in a system-indepdent (i.e.
no oids, collation information removed, all MCV values rendered as text)
jsonb format, along with the relation's relname, reltuples, and relpages
from pg_class, as well as the schemaname from pg_namespace.

pg_import_rel_stats is a function which takes a relation oid,
server_version_num, num_tuples, num_pages, and a column_stats jsonb in
a format matching that of pg_stats_export, and applies that data to
the specified pg_class and pg_statistics rows for the relation
specified.

The most common use-case for such a function is in upgrades and
dump/restore, wherein the upgrade process would capture the output of
pg_stats_export into a regular table, perform the upgrade, and then
join that data to the existing pg_class rows, updating statistics to be
a close approximation of what they were just prior to the upgrade. The
hope is that these statistics are better than the early stages of
--analyze-in-stages and can be applied faster, thus reducing system
downtime.

The values applied to pg_class are done inline, which is to say
non-transactionally. The values applied to pg_statitics are applied
transactionally, as if an ANALYZE operation was reading from a
cheat-sheet.

This function and view will need to be followed up with corresponding
ones for pg_stastitic_ext and pg_stastitic_ext_data, and while we would
likely never backport the import functions, we can have user programs
do the same work as the export views such that statistics can be brought
forward from versions as far back as there is jsonb to store it.

While the primary purpose of the import function(s) are to reduce downtime
during an upgrade, it is not hard to see that they could also be used to
facilitate tuning and development operations, asking questions like "how might
this query plan change if this table has 1000x rows in it?", without actually
putting those rows into the table.
Attachment

Re: Statistics Import and Export

From
Ashutosh Bapat
Date:
On Thu, Aug 31, 2023 at 12:17 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>
> While the primary purpose of the import function(s) are to reduce downtime
> during an upgrade, it is not hard to see that they could also be used to
> facilitate tuning and development operations, asking questions like "how might
> this query plan change if this table has 1000x rows in it?", without actually
> putting those rows into the table.

Thanks. I think this may be used with postgres_fdw to import
statistics directly from the foreigns server, whenever possible,
rather than fetching the rows and building it locally. If it's known
that the stats on foreign and local servers match for a foreign table,
we will be one step closer to accurately estimating the cost of a
foreign plan locally rather than through EXPLAIN.

--
Best Wishes,
Ashutosh Bapat



Re: Statistics Import and Export

From
Corey Huinker
Date:

Thanks. I think this may be used with postgres_fdw to import
statistics directly from the foreigns server, whenever possible,
rather than fetching the rows and building it locally. If it's known
that the stats on foreign and local servers match for a foreign table,
we will be one step closer to accurately estimating the cost of a
foreign plan locally rather than through EXPLAIN.


Yeah, that use makes sense as well, and if so then postgres_fdw would likely need to be aware of the appropriate query for several versions back - they change, not by much, but they do change. So now we'd have each query text in three places: a system view, postgres_fdw, and the bin/scripts pre-upgrade program. So I probably should consider the best way to share those in the codebase.

 

Re: Statistics Import and Export

From
Corey Huinker
Date:

Yeah, that use makes sense as well, and if so then postgres_fdw would likely need to be aware of the appropriate query for several versions back - they change, not by much, but they do change. So now we'd have each query text in three places: a system view, postgres_fdw, and the bin/scripts pre-upgrade program. So I probably should consider the best way to share those in the codebase.


Attached is v2 of this patch.

New features:
* imports index statistics. This is not strictly accurate: it re-computes index statistics the same as ANALYZE does, which is to say it derives those stats entirely from table column stats, which are imported, so in that sense we're getting index stats without touching the heap.
* now support extended statistics except for MCV, which is currently serialized as an difficult-to-decompose bytea field.
* bare-bones CLI script pg_export_stats, which extracts stats on databases back to v12 (tested) and could work back to v10.
* bare-bones CLI script pg_import_stats, which obviously only works on current devel dbs, but can take exports from older versions.
 
Attachment

Re: Statistics Import and Export

From
Tomas Vondra
Date:
On 10/31/23 08:25, Corey Huinker wrote:
>
> Attached is v2 of this patch.
> 
> New features:
> * imports index statistics. This is not strictly accurate: it 
> re-computes index statistics the same as ANALYZE does, which is to
> say it derives those stats entirely from table column stats, which
> are imported, so in that sense we're getting index stats without
> touching the heap.

Maybe I just don't understand, but I'm pretty sure ANALYZE does not
derive index stats from column stats. It actually builds them from the
row sample.

> * now support extended statistics except for MCV, which is currently 
> serialized as an difficult-to-decompose bytea field.

Doesn't pg_mcv_list_items() already do all the heavy work?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Statistics Import and Export

From
Corey Huinker
Date:


Maybe I just don't understand, but I'm pretty sure ANALYZE does not
derive index stats from column stats. It actually builds them from the
row sample.

That is correct, my error.
 

> * now support extended statistics except for MCV, which is currently
> serialized as an difficult-to-decompose bytea field.

Doesn't pg_mcv_list_items() already do all the heavy work?

Thanks! I'll look into that.

The comment below in mcv.c made me think there was no easy way to get output.

/*
 * pg_mcv_list_out      - output routine for type pg_mcv_list.
 *
 * MCV lists are serialized into a bytea value, so we simply call byteaout()
 * to serialize the value into text. But it'd be nice to serialize that into
 * a meaningful representation (e.g. for inspection by people).
 *
 * XXX This should probably return something meaningful, similar to what
 * pg_dependencies_out does. Not sure how to deal with the deduplicated
 * values, though - do we want to expand that or not?
 */

Re: Statistics Import and Export

From
Tomas Vondra
Date:
On 11/2/23 06:01, Corey Huinker wrote:
> 
> 
>     Maybe I just don't understand, but I'm pretty sure ANALYZE does not
>     derive index stats from column stats. It actually builds them from the
>     row sample.
> 
> 
> That is correct, my error.
>  
> 
> 
>     > * now support extended statistics except for MCV, which is currently
>     > serialized as an difficult-to-decompose bytea field.
> 
>     Doesn't pg_mcv_list_items() already do all the heavy work?
> 
> 
> Thanks! I'll look into that.
> 
> The comment below in mcv.c made me think there was no easy way to get
> output.
> 
> /*
>  * pg_mcv_list_out      - output routine for type pg_mcv_list.
>  *
>  * MCV lists are serialized into a bytea value, so we simply call byteaout()
>  * to serialize the value into text. But it'd be nice to serialize that into
>  * a meaningful representation (e.g. for inspection by people).
>  *
>  * XXX This should probably return something meaningful, similar to what
>  * pg_dependencies_out does. Not sure how to deal with the deduplicated
>  * values, though - do we want to expand that or not?
>  */
> 

Yeah, that was the simplest output function possible, it didn't seem
worth it to implement something more advanced. pg_mcv_list_items() is
more convenient for most needs, but it's quite far from the on-disk
representation.

That's actually a good question - how closely should the exported data
be to the on-disk format? I'd say we should keep it abstract, not tied
to the details of the on-disk format (which might easily change between
versions).

I'm a bit confused about the JSON schema used in pg_statistic_export
view, though. It simply serializes stakinds, stavalues, stanumbers into
arrays ... which works, but why not to use the JSON nesting? I mean,
there could be a nested document for histogram, MCV, ... with just the
correct fields.

  {
    ...
    histogram : { stavalues: [...] },
    mcv : { stavalues: [...], stanumbers: [...] },
    ...
  }

and so on. Also, what does TRIVIAL stand for?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Statistics Import and Export

From
Shubham Khanna
Date:
On Mon, Nov 6, 2023 at 4:16 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>>
>>
>> Yeah, that use makes sense as well, and if so then postgres_fdw would likely need to be aware of the appropriate
queryfor several versions back - they change, not by much, but they do change. So now we'd have each query text in
threeplaces: a system view, postgres_fdw, and the bin/scripts pre-upgrade program. So I probably should consider the
bestway to share those in the codebase. 
>>
>
> Attached is v2 of this patch.

While applying Patch, I noticed few Indentation issues:
1) D:\Project\Postgres>git am v2-0003-Add-pg_import_rel_stats.patch
.git/rebase-apply/patch:1265: space before tab in indent.
                                        errmsg("invalid statistics
format, stxndeprs must be array or null");
.git/rebase-apply/patch:1424: trailing whitespace.
                                   errmsg("invalid statistics format,
stxndistinct attnums elements must be strings, but one is %s",
.git/rebase-apply/patch:1315: new blank line at EOF.
+
warning: 3 lines add whitespace errors.
Applying: Add pg_import_rel_stats().

2) D:\Project\Postgres>git am v2-0004-Add-pg_export_stats-pg_import_stats.patch
.git/rebase-apply/patch:282: trailing whitespace.
const char *export_query_v14 =
.git/rebase-apply/patch:489: trailing whitespace.
const char *export_query_v12 =
.git/rebase-apply/patch:648: trailing whitespace.
const char *export_query_v10 =
.git/rebase-apply/patch:826: trailing whitespace.

.git/rebase-apply/patch:1142: trailing whitespace.
        result = PQexec(conn,
warning: squelched 4 whitespace errors
warning: 9 lines add whitespace errors.
Applying: Add pg_export_stats, pg_import_stats.

Thanks and Regards,
Shubham Khanna.



Re: Statistics Import and Export

From
Ashutosh Bapat
Date:
On Tue, Oct 31, 2023 at 12:55 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>>
>>
>> Yeah, that use makes sense as well, and if so then postgres_fdw would likely need to be aware of the appropriate
queryfor several versions back - they change, not by much, but they do change. So now we'd have each query text in
threeplaces: a system view, postgres_fdw, and the bin/scripts pre-upgrade program. So I probably should consider the
bestway to share those in the codebase. 
>>
>
> Attached is v2 of this patch.
>
> New features:
> * imports index statistics. This is not strictly accurate: it re-computes index statistics the same as ANALYZE does,
whichis to say it derives those stats entirely from table column stats, which are imported, so in that sense we're
gettingindex stats without touching the heap. 
> * now support extended statistics except for MCV, which is currently serialized as an difficult-to-decompose bytea
field.
> * bare-bones CLI script pg_export_stats, which extracts stats on databases back to v12 (tested) and could work back
tov10. 
> * bare-bones CLI script pg_import_stats, which obviously only works on current devel dbs, but can take exports from
olderversions. 
>

I did a small experiment with your patches. In a separate database
"fdw_dst" I created a table t1 and populated it with 100K rows
#create table t1 (a int, b int);
#insert into t1 select i, i + 1 from generate_series(1, 100000) i;
#analyse t1;

In database "postgres" on the same server, I created a foreign table
pointing to t1
#create server fdw_dst_server foreign data wrapper postgres_fdw
OPTIONS ( dbname 'fdw_dst', port '5432');
#create user mapping for public server fdw_dst_server ;
#create foreign table t1 (a int, b int) server fdw_dst_server;

The estimates are off
#explain select * from t1 where a = 100;
                        QUERY PLAN
-----------------------------------------------------------
 Foreign Scan on t1  (cost=100.00..142.26 rows=13 width=8)
(1 row)

Export and import stats for table t1
$ pg_export_stats -d fdw_dst | pg_import_stats -d postgres

gives accurate estimates
#explain select * from t1 where a = 100;
                        QUERY PLAN
-----------------------------------------------------------
 Foreign Scan on t1  (cost=100.00..1793.02 rows=1 width=8)
(1 row)

In this simple case it's working like a charm.

Then I wanted to replace all ANALYZE commands in postgres_fdw.sql with
import and export of statistics. But I can not do that since it
requires table names to match. Foreign table metadata stores the
mapping between local and remote table as well as column names. Import
can use that mapping to install the statistics appropriately. We may
want to support a command or function in postgres_fdw to import
statistics of all the tables that point to a given foreign server.
That may be some future work based on your current patches.

I have not looked at the code though.

--
Best Wishes,
Ashutosh Bapat



Re: Statistics Import and Export

From
Corey Huinker
Date:
Yeah, that was the simplest output function possible, it didn't seem
worth it to implement something more advanced. pg_mcv_list_items() is
more convenient for most needs, but it's quite far from the on-disk
representation.

I was able to make it work.
 

That's actually a good question - how closely should the exported data
be to the on-disk format? I'd say we should keep it abstract, not tied
to the details of the on-disk format (which might easily change between
versions).

For the most part, I chose the exported data json types and formats in a way that was the most accommodating to cstring input functions. So, while so many of the statistic values are obviously only ever integers/floats, those get stored as a numeric data type which lacks direct numeric->int/float4/float8 functions (though we could certainly create them, and I'm not against that), casting them to text lets us leverage pg_strtoint16, etc.
 

I'm a bit confused about the JSON schema used in pg_statistic_export
view, though. It simply serializes stakinds, stavalues, stanumbers into
arrays ... which works, but why not to use the JSON nesting? I mean,
there could be a nested document for histogram, MCV, ... with just the
correct fields.

  {
    ...
    histogram : { stavalues: [...] },
    mcv : { stavalues: [...], stanumbers: [...] },
    ...
  }

That's a very good question. I went with this format because it was fairly straightforward to code in SQL using existing JSON/JSONB functions, and that's what we will need if we want to export statistics on any server currently in existence. I'm certainly not locked in with the current format, and if it can be shown how to transform the data into a superior format, I'd happily do so.

and so on. Also, what does TRIVIAL stand for?

It's currently serving double-duty for "there are no stats in this slot" and the situations where the stats computation could draw no conclusions about the data.

Attached is v3 of this patch. Key features are:

* Handles regular pg_statistic stats for any relation type.
* Handles extended statistics.
* Export views pg_statistic_export and pg_statistic_ext_export to allow inspection of existing stats and saving those values for later use.
* Import functions pg_import_rel_stats() and pg_import_ext_stats() which take Oids as input. This is intentional to allow stats from one object to be imported into another object.
* User scripts pg_export_stats and pg_import stats, which offer a primitive way to serialize all the statistics of one database and import them into another.
* Has regression test coverage for both with a variety of data types.
* Passes my own manual test of extracting all of the stats from a v15 version of the popular "dvdrental" example database, as well as some additional extended statistics objects, and importing them into a development database.
* Import operations never touch the heap of any relation outside of pg_catalog. As such, this should be significantly faster than even the most cursory analyze operation, and therefore should be useful in upgrade situations, allowing the database to work with "good enough" stats more quickly, while still allowing for regular autovacuum to recalculate the stats "for real" at some later point.

The relation statistics code was adapted from similar features in analyze.c, but is now done in a query context. As before, the rowcount/pagecount values are updated on pg_class in a non-transactional fashion to avoid table bloat, while the updates to pg_statistic are pg_statistic_ext_data are done transactionally.

The existing statistics _store() functions were leveraged wherever practical, so much so that the extended statistics import is mostly just adapting the existing _build() functions into _import() functions which pull their values from JSON rather than computing the statistics.

Current concerns are:

1. I had to code a special-case exception for MCELEM stats on array data types, so that the array_in() call uses the element type rather than the array type. I had assumed that the existing exmaine_attribute() functions would have properly derived the typoid for that column, but it appears to not be the case, and I'm clearly missing how the existing code gets it right.
2. This hasn't been tested with external custom datatypes, but if they have a custom typanalyze function things should be ok.
3. While I think I have cataloged all of the schema-structural changes to pg_statistic[_ext[_data]] since version 10, I may have missed a case where the schema stayed the same, but the values are interpreted differently.
4. I don't yet have a complete vision for how these tools will be used by pg_upgrade and pg_dump/restore, the places where these will provide the biggest win for users.
 
Attachment

Re: Statistics Import and Export

From
Andrei Lepikhov
Date:
On 13/12/2023 17:26, Corey Huinker wrote:> 4. I don't yet have a 
complete vision for how these tools will be used
> by pg_upgrade and pg_dump/restore, the places where these will provide 
> the biggest win for users.

Some issues here with docs:

func.sgml:28465: parser error : Opening and ending tag mismatch: sect1 
line 26479 and sect2
   </sect2>
           ^

Also, as I remember, we already had some attempts to invent dump/restore 
statistics [1,2]. They were stopped with the problem of type 
verification. What if the definition of the type has changed between the 
dump and restore? As I see in the code, Importing statistics you just 
check the column name and don't see into the type.

[1] Backup and recovery of pg_statistic
https://www.postgresql.org/message-id/flat/724322880.K8vzik8zPz%40abook
[2] Re: Ideas about a better API for postgres_fdw remote estimates
https://www.postgresql.org/message-id/7a40707d-1758-85a2-7bb1-6e5775518e64%40postgrespro.ru

-- 
regards,
Andrei Lepikhov
Postgres Professional




Re: Statistics Import and Export

From
Corey Huinker
Date:


On Fri, Dec 15, 2023 at 3:36 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote:
On 13/12/2023 17:26, Corey Huinker wrote:> 4. I don't yet have a
complete vision for how these tools will be used
> by pg_upgrade and pg_dump/restore, the places where these will provide
> the biggest win for users.

Some issues here with docs:

func.sgml:28465: parser error : Opening and ending tag mismatch: sect1
line 26479 and sect2
   </sect2>
           ^

Apologies, will fix.
 

Also, as I remember, we already had some attempts to invent dump/restore
statistics [1,2]. They were stopped with the problem of type
verification. What if the definition of the type has changed between the
dump and restore? As I see in the code, Importing statistics you just
check the column name and don't see into the type.

We look up the imported statistics via column name, that is correct.

However, the values in stavalues and mcv and such are stored purely as text, so they must be casted using the input functions for that particular datatype. If that column definition changed, or the underlying input function changed, the stats import of that particular table would fail. It should be noted, however, that those same input functions were used to bring the data into the table via restore, so it would have already failed on that step. Either way, the structure of the table has effectively changed, so failure to import those statistics would be a good thing.
 

[1] Backup and recovery of pg_statistic
https://www.postgresql.org/message-id/flat/724322880.K8vzik8zPz%40abook

That proposal sought to serialize enough information on the old server such that rows could be directly inserted into pg_statistic on the new server. As was pointed out at the time, version N of a server cannot know what the format of pg_statistic will be in version N+1. 

This patch avoids that problem by inspecting the structure of the object to be faux-analyzed, and using that to determine what parts of the JSON to fetch, and what datatype to cast values to in cases like mcv and stavaluesN. The exported JSON has no oids in it whatseover, all elements subject to casting on import have already been cast to text, and the record returned has the server version number of the producing system, and the import function can use that to determine how it interprets the data it finds.
 

[2] Re: Ideas about a better API for postgres_fdw remote estimates
https://www.postgresql.org/message-id/7a40707d-1758-85a2-7bb1-6e5775518e64%40postgrespro.ru


This one seems to be pulling oids from the remote server, and we can't guarantee their stability across systems, especially for objects and operators from extensions. I tried to go the route of extracting the full text name of an operator, but discovered that the qualified names, in addition to being unsightly, were irrelevant because we can't insert stats that disagree about type with the attribute/expression. So it didn't matter what type the remote system thought it had, the local system was going to coerce it into the expected data type or ereport() trying.

I think there is hope for having do_analyze() run a remote query fetching the remote table's exported stats and then storing them locally, possibly after some modification, and that would save us from having to sample a remote table.

Re: Statistics Import and Export

From
Tomas Vondra
Date:
Hi,

I finally had time to look at the last version of the patch, so here's a
couple thoughts and questions in somewhat random order. Please take this
as a bit of a brainstorming and push back if you disagree some of my
comments.

In general, I like the goal of this patch - not having statistics is a
common issue after an upgrade, and people sometimes don't even realize
they need to run analyze. So, it's definitely worth improving.

I'm not entirely sure about the other use case - allowing people to
tweak optimizer statistics on a running cluster, to see what would be
the plan in that case. Or more precisely - I agree that would be an
interesting and useful feature, but maybe the interface should not be
the same as for the binary upgrade use case?


interfaces
----------

When I thought about the ability to dump/load statistics in the past, I
usually envisioned some sort of DDL that would do the export and import.
So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
or something like that, and that'd do all the work. This would mean
stats are "first-class citizens" and it'd be fairly straightforward to
add this into pg_dump, for example. Or at least I think so ...

Alternatively we could have the usual "functional" interface, with a
functions to export/import statistics, replacing the DDL commands.

Unfortunately, none of this works for the pg_upgrade use case, because
existing cluster versions would not support this new interface, of
course. That's a significant flaw, as it'd make this useful only for
upgrades of future versions.

So I think for the pg_upgrade use case, we don't have much choice other
than using "custom" export through a view, which is what the patch does.

However, for the other use case (tweaking optimizer stats) this is not
really an issue - that always happens on the same instance, so no issue
with not having the "export" function and so on. I'd bet there are more
convenient ways to do this than using the export view. I'm sure it could
share a lot of the infrastructure, ofc.

I suggest we focus on the pg_upgrade use case for now. In particular, I
think we really need to find a good way to integrate this into
pg_upgrade. I'm not against having custom CLI commands, but it's still a
manual thing - I wonder if we could extend pg_dump to dump stats, or
make it built-in into pg_upgrade in some way (possibly disabled by
default, or something like that).


JSON format
-----------

As for the JSON format, I wonder if we need that at all? Isn't that an
unnecessary layer of indirection? Couldn't we simply dump pg_statistic
and pg_statistic_ext_data in CSV, or something like that? The amount of
new JSONB code seems to be very small, so it's OK I guess.

I'm still a bit unsure about the "right" JSON schema. I find it a bit
inconvenient that the JSON objects mimic the pg_statistic schema very
closely. In particular, it has one array for stakind values, another
array for stavalues, array for stanumbers etc. I understand generating
this JSON in SQL is fairly straightforward, and for the pg_upgrade use
case it's probably OK. But my concern is it's not very convenient for
the "manual tweaking" use case, because the "related" fields are
scattered in different parts of the JSON.

That's pretty much why I envisioned a format "grouping" the arrays for a
particular type of statistics (MCV, histogram) into the same object, as
for example in

 {
   "mcv" : {"values" : [...], "frequencies" : [...]}
   "histogram" : {"bounds" : [...]}
 }

But that's probably much harder to generate from plain SQL (at least I
think so, I haven't tried).


data missing in the export
--------------------------

I think the data needs to include more information. Maybe not for the
pg_upgrade use case, where it's mostly guaranteed not to change, but for
the "manual tweak" use case it can change. And I don't think we want two
different formats - we want one, working for everything.

Consider for example about the staopN and stacollN fields - if we clone
the stats from one table to the other, and the table uses different
collations, will that still work? Similarly, I think we should include
the type of each column, because it's absolutely not guaranteed the
import function will fail if the type changes. For example, if the type
changes from integer to text, that will work, but the ordering will
absolutely not be the same. And so on.

For the extended statistics export, I think we need to include also the
attribute names and expressions, because these can be different between
the statistics. And not only that - the statistics values reference the
attributes by positions, but if the two tables have the attributes in a
different order (when ordered by attnum), that will break stuff.


more strict checks
------------------

I think the code should be a bit more "defensive" when importing stuff,
and do at least some sanity checks. For the pg_upgrade use case this
should be mostly non-issue (except for maybe helping to detect bugs
earlier), but for the "manual tweak" use case it's much more important.

By this I mean checks like:

* making sure the frequencies in MCV lists are not obviously wrong
  (outside [0,1], sum exceeding > 1.0, etc.)

* cross-checking that stanumbers/stavalues make sense (e.g. that MCV has
  both arrays while histogram has only stavalues, that the arrays have
  the same length for MCV, etc.)

* checking there are no duplicate stakind values (e.g. two MCV lists)

This is another reason why I was thinking the current JSON format may be
a bit inconvenient, because it loads the fields separately, making the
checks harder. But I guess it could be done after loading everything, as
a separate phase.

Not sure if all the checks need to be regular elog(ERROR), perhaps some
could/should be just asserts.


minor questions
---------------

1) Should the views be called pg_statistic_export or pg_stats_export?
Perhaps pg_stats_export is better, because the format is meant to be
human-readable (rather than 100% internal).

2) It's not very clear what "non-transactional update" of pg_class
fields actually means. Does that mean we update the fields in-place,
can't be rolled back, is not subject to MVCC or what? I suspect users
won't know unless the docs say that explicitly.

3) The "statistics.c" code should really document the JSON structure. Or
maybe if we plan to use this for other purposes, it should be documented
in the SGML?

Actually, this means that the use supported cases determine if the
expected JSON structure is part of the API. For pg_upgrade we could keep
it as "internal" and maybe change it as needed, but for "manual tweak"
it'd become part of the public API.

4) Why do we need the separate "replaced" flags in import_stakinds? Can
it happen that collreplaces/opreplaces differ from kindreplaces?

5) What happens in we import statistics for a table that already has
some statistics? Will this discard the existing statistics, or will this
merge them somehow? (I think we should always discard the existing
stats, and keep only the new version.)

6) What happens if we import extended stats with mismatching definition?
For example, what if the "new" statistics object does not have "mcv"
enabled, but the imported data do include MCV? What if the statistics do
have the same number of "dimensions" but not the same number of columns
and expressions?

7) The func.sgml additions in 0007 seems a bit strange, particularly the
first sentence of the paragraph.

8) While experimenting with the patch, I noticed this:

  create table t (a int, b int, c text);
  create statistics s on a, b, c, (a+b), (a-b) from t;

  create table t2 (a text, b text, c text);
  create statistics s2 on a, c from t2;

  select pg_import_ext_stats(
    (select oid from pg_statistic_ext where stxname = 's2'),
    (select server_version_num from pg_statistic_ext_export
                              where ext_stats_name = 's'),
    (select stats from pg_statistic_ext_export
                 where ext_stats_name = 's'));

WARNING:  statistics import has 5 mcv dimensions, but the expects 2.
Skipping excess dimensions.
ERROR:  statistics import has 5 mcv dimensions, but the expects 2.
Skipping excess dimensions.

I guess we should not trigger WARNING+ERROR with the same message.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Tue, Dec 26, 2023 at 02:18:56AM +0100, Tomas Vondra wrote:
> interfaces
> ----------
> 
> When I thought about the ability to dump/load statistics in the past, I
> usually envisioned some sort of DDL that would do the export and import.
> So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
> or something like that, and that'd do all the work. This would mean
> stats are "first-class citizens" and it'd be fairly straightforward to
> add this into pg_dump, for example. Or at least I think so ...
> 
> Alternatively we could have the usual "functional" interface, with a
> functions to export/import statistics, replacing the DDL commands.
> 
> Unfortunately, none of this works for the pg_upgrade use case, because
> existing cluster versions would not support this new interface, of
> course. That's a significant flaw, as it'd make this useful only for
> upgrades of future versions.
> 
> So I think for the pg_upgrade use case, we don't have much choice other
> than using "custom" export through a view, which is what the patch does.
> 
> However, for the other use case (tweaking optimizer stats) this is not
> really an issue - that always happens on the same instance, so no issue
> with not having the "export" function and so on. I'd bet there are more
> convenient ways to do this than using the export view. I'm sure it could
> share a lot of the infrastructure, ofc.
> 
> I suggest we focus on the pg_upgrade use case for now. In particular, I
> think we really need to find a good way to integrate this into
> pg_upgrade. I'm not against having custom CLI commands, but it's still a
> manual thing - I wonder if we could extend pg_dump to dump stats, or
> make it built-in into pg_upgrade in some way (possibly disabled by
> default, or something like that).

I have some thoughts on this too.  I understand the desire to add
something that can be used for upgrades _to_ PG 17, but I am concerned
that this will give us a cumbersome API that will hamper future
development.  I think we should develop the API we want, regardless of
how useful it is for upgrades _to_ PG 17, and then figure out what
short-term hacks we can add to get it working for upgrades _to_ PG 17; 
these hacks can eventually be removed.  Even if they can't be removed,
they are export-only and we can continue developing the import SQL
command cleanly, and I think import is going to need the most long-term
maintenance.

I think we need a robust API to handle two cases:

*  changes in how we store statistics
*  changes in how how data type values are represented in the statistics

We have had such changes in the past, and I think these two issues are
what have prevented import/export of statistics up to this point.
Developing an API that doesn't cleanly handle these will cause long-term
pain.

In summary, I think we need an SQL-level command for this.  I think we
need to embed the Postgres export version number into the statistics
export file (maybe in the COPY header), and then load the file via COPY
internally (not JSON) into a temporary table that we know matches the
exported Postgres version.  We then need to use SQL to make any
adjustments to it before loading it into pg_statistic.  Doing that
internally in JSON just isn't efficient.  If people want JSON for such
cases, I suggest we add a JSON format to COPY.

I think we can then look at pg_upgrade to see if we can simulate the
export action which can use the statistics import SQL command.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Statistics Import and Export

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> I think we need a robust API to handle two cases:

> *  changes in how we store statistics
> *  changes in how how data type values are represented in the statistics

> We have had such changes in the past, and I think these two issues are
> what have prevented import/export of statistics up to this point.
> Developing an API that doesn't cleanly handle these will cause long-term
> pain.

Agreed.

> In summary, I think we need an SQL-level command for this.

I think a SQL command is an actively bad idea.  It'll just add development
and maintenance overhead that we don't need.  When I worked on this topic
years ago at Salesforce, I had things set up with simple functions, which
pg_dump would invoke by writing more or less

    SELECT pg_catalog.load_statistics(....);

This has a number of advantages, not least of which is that an extension
could plausibly add compatible functions to older versions.  The trick,
as you say, is to figure out what the argument lists ought to be.
Unfortunately I recall few details of what I wrote for Salesforce,
but I think I had it broken down in a way where there was a separate
function call occurring for each pg_statistic "slot", thus roughly

load_statistics(table regclass, attname text, stakind int, stavalue ...);

I might have had a separate load_statistics_xxx function for each
stakind, which would ease the issue of deciding what the datatype
of "stavalue" is.  As mentioned already, we'd also need some sort of
version identifier, and we'd expect the load_statistics() functions
to be able to transform the data if the old version used a different
representation.  I agree with the idea that an explicit representation
of the source table attribute's type would be wise, too.

            regards, tom lane



Re: Statistics Import and Export

From
Tomas Vondra
Date:
On 12/26/23 20:19, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> I think we need a robust API to handle two cases:
> 
>> *  changes in how we store statistics
>> *  changes in how how data type values are represented in the statistics
> 
>> We have had such changes in the past, and I think these two issues are
>> what have prevented import/export of statistics up to this point.
>> Developing an API that doesn't cleanly handle these will cause long-term
>> pain.
> 
> Agreed.
> 

I agree the format is important - we don't want to end up with a format
that's cumbersome or inconvenient to use. But I don't think the proposed
format is somewhat bad in those respects - it mostly reflects how we
store statistics and if I was designing a format for humans, it might
look a bit differently. But that's not the goal here, IMHO.

I don't quite understand the two cases above. Why should this affect how
we store statistics? Surely, making the statistics easy to use for the
optimizer is much more important than occasional export/import.

>> In summary, I think we need an SQL-level command for this.
> 
> I think a SQL command is an actively bad idea.  It'll just add development
> and maintenance overhead that we don't need.  When I worked on this topic
> years ago at Salesforce, I had things set up with simple functions, which
> pg_dump would invoke by writing more or less
> 
>     SELECT pg_catalog.load_statistics(....);
> 
> This has a number of advantages, not least of which is that an extension
> could plausibly add compatible functions to older versions.  The trick,
> as you say, is to figure out what the argument lists ought to be.
> Unfortunately I recall few details of what I wrote for Salesforce,
> but I think I had it broken down in a way where there was a separate
> function call occurring for each pg_statistic "slot", thus roughly
> 
> load_statistics(table regclass, attname text, stakind int, stavalue ...);
> 
> I might have had a separate load_statistics_xxx function for each
> stakind, which would ease the issue of deciding what the datatype
> of "stavalue" is.  As mentioned already, we'd also need some sort of
> version identifier, and we'd expect the load_statistics() functions
> to be able to transform the data if the old version used a different
> representation.  I agree with the idea that an explicit representation
> of the source table attribute's type would be wise, too.
> 

Yeah, this is pretty much what I meant by "functional" interface. But if
I said maybe the format implemented by the patch is maybe too close to
how we store the statistics, then this has exactly the same issue. And
it has other issues too, I think - it breaks down the stats into
multiple function calls, so ensuring the sanity/correctness of whole
sets of statistics gets much harder, I think.

I'm not sure about the extension idea. Yes, we could have an extension
providing such functions, but do we have any precedent of making
pg_upgrade dependent on an external extension? I'd much rather have
something built-in that just works, especially if we intend to make it
the default behavior (which I think should be our aim here).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Wed, Dec 27, 2023 at 01:08:47PM +0100, Tomas Vondra wrote:
> On 12/26/23 20:19, Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> >> I think we need a robust API to handle two cases:
> > 
> >> *  changes in how we store statistics
> >> *  changes in how how data type values are represented in the statistics
> > 
> >> We have had such changes in the past, and I think these two issues are
> >> what have prevented import/export of statistics up to this point.
> >> Developing an API that doesn't cleanly handle these will cause long-term
> >> pain.
> > 
> > Agreed.
> > 
> 
> I agree the format is important - we don't want to end up with a format
> that's cumbersome or inconvenient to use. But I don't think the proposed
> format is somewhat bad in those respects - it mostly reflects how we
> store statistics and if I was designing a format for humans, it might
> look a bit differently. But that's not the goal here, IMHO.
> 
> I don't quite understand the two cases above. Why should this affect how
> we store statistics? Surely, making the statistics easy to use for the
> optimizer is much more important than occasional export/import.

The two items above were to focus on getting a solution that can easily
handle future statistics storage changes.  I figured we would want to
manipulate the data as a table internally so I am confused why we would
export JSON instead of a COPY format.  I didn't think we were changing
how we internall store or use the statistics.

> >> In summary, I think we need an SQL-level command for this.
> > 
> > I think a SQL command is an actively bad idea.  It'll just add development
> > and maintenance overhead that we don't need.  When I worked on this topic
> > years ago at Salesforce, I had things set up with simple functions, which
> > pg_dump would invoke by writing more or less
> > 
> >     SELECT pg_catalog.load_statistics(....);
> > 
> > This has a number of advantages, not least of which is that an extension
> > could plausibly add compatible functions to older versions.  The trick,
> > as you say, is to figure out what the argument lists ought to be.
> > Unfortunately I recall few details of what I wrote for Salesforce,
> > but I think I had it broken down in a way where there was a separate
> > function call occurring for each pg_statistic "slot", thus roughly
> > 
> > load_statistics(table regclass, attname text, stakind int, stavalue ...);
> > 
> > I might have had a separate load_statistics_xxx function for each
> > stakind, which would ease the issue of deciding what the datatype
> > of "stavalue" is.  As mentioned already, we'd also need some sort of
> > version identifier, and we'd expect the load_statistics() functions
> > to be able to transform the data if the old version used a different
> > representation.  I agree with the idea that an explicit representation
> > of the source table attribute's type would be wise, too.
> > 
> 
> Yeah, this is pretty much what I meant by "functional" interface. But if
> I said maybe the format implemented by the patch is maybe too close to
> how we store the statistics, then this has exactly the same issue. And
> it has other issues too, I think - it breaks down the stats into
> multiple function calls, so ensuring the sanity/correctness of whole
> sets of statistics gets much harder, I think.

I was suggesting an SQL command because this feature is going to need a
lot of options and do a lot of different things, I am afraid, and a
single function might be too complex to manage.

> I'm not sure about the extension idea. Yes, we could have an extension
> providing such functions, but do we have any precedent of making
> pg_upgrade dependent on an external extension? I'd much rather have
> something built-in that just works, especially if we intend to make it
> the default behavior (which I think should be our aim here).

Uh, an extension seems nice to allow people in back branches to install
it, but not for normal usage.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Statistics Import and Export

From
Corey Huinker
Date:
On Mon, Dec 25, 2023 at 8:18 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,

I finally had time to look at the last version of the patch, so here's a
couple thoughts and questions in somewhat random order. Please take this
as a bit of a brainstorming and push back if you disagree some of my
comments.

In general, I like the goal of this patch - not having statistics is a
common issue after an upgrade, and people sometimes don't even realize
they need to run analyze. So, it's definitely worth improving.

I'm not entirely sure about the other use case - allowing people to
tweak optimizer statistics on a running cluster, to see what would be
the plan in that case. Or more precisely - I agree that would be an
interesting and useful feature, but maybe the interface should not be
the same as for the binary upgrade use case?


interfaces
----------

When I thought about the ability to dump/load statistics in the past, I
usually envisioned some sort of DDL that would do the export and import.
So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
or something like that, and that'd do all the work. This would mean
stats are "first-class citizens" and it'd be fairly straightforward to
add this into pg_dump, for example. Or at least I think so ...

Alternatively we could have the usual "functional" interface, with a
functions to export/import statistics, replacing the DDL commands.

Unfortunately, none of this works for the pg_upgrade use case, because
existing cluster versions would not support this new interface, of
course. That's a significant flaw, as it'd make this useful only for
upgrades of future versions.

This was the reason I settled on the interface that I did: while we can create whatever interface we want for importing the statistics, we would need to be able to extract stats from databases using only the facilities available in those same databases, and then store that in a medium that could be conveyed across databases, either by text files or by saving them off in a side table prior to upgrade. JSONB met the criteria.
 

So I think for the pg_upgrade use case, we don't have much choice other
than using "custom" export through a view, which is what the patch does.

However, for the other use case (tweaking optimizer stats) this is not
really an issue - that always happens on the same instance, so no issue
with not having the "export" function and so on. I'd bet there are more
convenient ways to do this than using the export view. I'm sure it could
share a lot of the infrastructure, ofc.

So, there is a third use case - foreign data wrappers. When analyzing a foreign table, at least one in the postgresql_fdw family of foreign servers, we should be able to send a query specific to the version and dialect of that server, get back the JSONB, and import those results. That use case may be more tangible to you than the tweak/tuning case.
 



JSON format
-----------

As for the JSON format, I wonder if we need that at all? Isn't that an
unnecessary layer of indirection? Couldn't we simply dump pg_statistic
and pg_statistic_ext_data in CSV, or something like that? The amount of
new JSONB code seems to be very small, so it's OK I guess.

I see a few problems with dumping pg_statistic[_ext_data]. The first is that the importer now has to understand all of the past formats of those two tables. The next is that the tables are chock full of Oids that don't necessarily carry forward. I could see us having a text-ified version of those two tables, but we'd need that for all previous iterations of those table formats. Instead, I put the burden on the stats export to de-oid the data and make it *_in() function friendly.
 
That's pretty much why I envisioned a format "grouping" the arrays for a
particular type of statistics (MCV, histogram) into the same object, as
for example in

 {
   "mcv" : {"values" : [...], "frequencies" : [...]}
   "histogram" : {"bounds" : [...]}
 }

I agree that would be a lot more readable, and probably a lot more debuggable. But I went into this unsure if there could be more than one stats slot of a given kind per table. Knowing that they must be unique helps.
 
But that's probably much harder to generate from plain SQL (at least I
think so, I haven't tried).

I think it would be harder, but far from impossible.

 
data missing in the export
--------------------------

I think the data needs to include more information. Maybe not for the
pg_upgrade use case, where it's mostly guaranteed not to change, but for
the "manual tweak" use case it can change. And I don't think we want two
different formats - we want one, working for everything.

I"m not against this at all, and I started out doing that, but the qualified names of operators got _ugly_, and I quickly realized that what I was generating wouldn't matter, either the input data would make sense for the attribute's stats or it would fail trying.
 
Consider for example about the staopN and stacollN fields - if we clone
the stats from one table to the other, and the table uses different
collations, will that still work? Similarly, I think we should include
the type of each column, because it's absolutely not guaranteed the
import function will fail if the type changes. For example, if the type
changes from integer to text, that will work, but the ordering will
absolutely not be the same. And so on.

I can see including the type of the column, that's a lot cleaner than the operator names for sure, and I can see us rejecting stats or sections of stats in certain situations. Like in your example, if the collation changed, then reject all "<" op stats but keep the "=" ones.
 
For the extended statistics export, I think we need to include also the
attribute names and expressions, because these can be different between
the statistics. And not only that - the statistics values reference the
attributes by positions, but if the two tables have the attributes in a
different order (when ordered by attnum), that will break stuff.

Correct me if I'm wrong, but I thought expression parse trees change _a lot_ from version to version?

Attribute reordering is a definite vulnerability of the current implementation, so an attribute name export might be a way to mitigate that.
 

* making sure the frequencies in MCV lists are not obviously wrong
  (outside [0,1], sum exceeding > 1.0, etc.)

+1
 

* cross-checking that stanumbers/stavalues make sense (e.g. that MCV has
  both arrays while histogram has only stavalues, that the arrays have
  the same length for MCV, etc.)

To this end, there's an edge-case hack in the code where I have to derive the array elemtype. I had thought that examine_attribute() or std_typanalyze() was going to do that for me, but it didn't. Very much want your input there.
 

* checking there are no duplicate stakind values (e.g. two MCV lists)

Per previous comment, it's good to learn these restrictions.
 
Not sure if all the checks need to be regular elog(ERROR), perhaps some
could/should be just asserts.

For this first pass, all errors were one-size fits all, safe for the WARNING vs ERROR.
 


minor questions
---------------

1) Should the views be called pg_statistic_export or pg_stats_export?
Perhaps pg_stats_export is better, because the format is meant to be
human-readable (rather than 100% internal).

I have no opinion on what the best name would be, and will go with consensus.
 

2) It's not very clear what "non-transactional update" of pg_class
fields actually means. Does that mean we update the fields in-place,
can't be rolled back, is not subject to MVCC or what? I suspect users
won't know unless the docs say that explicitly.

Correct. Cannot be rolled back, not subject to MVCC.

 
3) The "statistics.c" code should really document the JSON structure. Or
maybe if we plan to use this for other purposes, it should be documented
in the SGML?

I agree, but I also didn't expect the format to survive first contact with reviewers, so I held back.
 

4) Why do we need the separate "replaced" flags in import_stakinds? Can
it happen that collreplaces/opreplaces differ from kindreplaces?

That was initially done to maximize the amount of code that could be copied from do_analyze(). In retrospect, I like how extended statistics just deletes all the pg_statistic_ext_data rows and replaces them and I would like to do the same for pg_statistic before this is all done.
 

5) What happens in we import statistics for a table that already has
some statistics? Will this discard the existing statistics, or will this
merge them somehow? (I think we should always discard the existing
stats, and keep only the new version.)

In the case of pg_statistic_ext_data, the stats are thrown out and replaced by the imported ones.

In the case of pg_statistic, it's basically an upsert, and any values that were missing in the JSON are not updated on the existing row. That's appealing in a tweak situation where you want to only alter one or two bits of a stat, but not really useful in other situations. Per previous comment, I'd prefer a clean slate and forcing tweaking use cases to fill in all the blanks.
 

6) What happens if we import extended stats with mismatching definition?
For example, what if the "new" statistics object does not have "mcv"
enabled, but the imported data do include MCV? What if the statistics do
have the same number of "dimensions" but not the same number of columns
and expressions?

The importer is currently driven by the types of stats to be expected for that pg_attribute/pg_statistic_ext. It only looks for things that are possible for that stat type, and any extra JSON values are ignored.
 

Re: Statistics Import and Export

From
Corey Huinker
Date:

  As mentioned already, we'd also need some sort of
version identifier, and we'd expect the load_statistics() functions
to be able to transform the data if the old version used a different
representation.  I agree with the idea that an explicit representation
of the source table attribute's type would be wise, too.

There is a version identifier currently (its own column not embedded in the JSON), but I discovered that I was able to put the burden on the export queries to spackle-over the changes in the table structures over time. Still, I knew that we'd need the version number in there eventually. 

Re: Statistics Import and Export

From
Corey Huinker
Date:
Yeah, this is pretty much what I meant by "functional" interface. But if
I said maybe the format implemented by the patch is maybe too close to
how we store the statistics, then this has exactly the same issue. And
it has other issues too, I think - it breaks down the stats into
multiple function calls, so ensuring the sanity/correctness of whole
sets of statistics gets much harder, I think.

Export functions was my original plan, for simplicity, maintenance, etc, but it seemed like I'd be adding quite a few functions, so the one view made more sense for an initial version. Also, I knew that pg_dump or some other stats exporter would have to inline the guts of those functions into queries for older versions, and adapting a view definition seemed more straightforward for the reader than function definitions.
 

Re: Statistics Import and Export

From
Tom Lane
Date:
Corey Huinker <corey.huinker@gmail.com> writes:
> Export functions was my original plan, for simplicity, maintenance, etc,
> but it seemed like I'd be adding quite a few functions, so the one view
> made more sense for an initial version. Also, I knew that pg_dump or some
> other stats exporter would have to inline the guts of those functions into
> queries for older versions, and adapting a view definition seemed more
> straightforward for the reader than function definitions.

Hmm, I'm not sure we are talking about the same thing at all.

What I am proposing is *import* functions.  I didn't say anything about
how pg_dump obtains the data it prints; however, I would advocate that
we keep that part as simple as possible.  You cannot expect export
functionality to know the requirements of future server versions,
so I don't think it's useful to put much intelligence there.

So I think pg_dump should produce a pretty literal representation of
what it finds in the source server's catalog, and then rely on the
import functions in the destination server to make sense of that
and do whatever slicing-n-dicing is required.

That being the case, I don't see a lot of value in a view -- especially
not given the requirement to dump from older server versions.
(Conceivably we could just say that we won't dump stats from server
versions predating the introduction of this feature, but that's hardly
a restriction that supports doing this via a view.)

            regards, tom lane



Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Wed, Dec 27, 2023 at 09:41:31PM -0500, Corey Huinker wrote:
>     When I thought about the ability to dump/load statistics in the past, I
>     usually envisioned some sort of DDL that would do the export and import.
>     So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
>     or something like that, and that'd do all the work. This would mean
>     stats are "first-class citizens" and it'd be fairly straightforward to
>     add this into pg_dump, for example. Or at least I think so ...
> 
>     Alternatively we could have the usual "functional" interface, with a
>     functions to export/import statistics, replacing the DDL commands.
> 
>     Unfortunately, none of this works for the pg_upgrade use case, because
>     existing cluster versions would not support this new interface, of
>     course. That's a significant flaw, as it'd make this useful only for
>     upgrades of future versions.
> 
> 
> This was the reason I settled on the interface that I did: while we can create
> whatever interface we want for importing the statistics, we would need to be
> able to extract stats from databases using only the facilities available in
> those same databases, and then store that in a medium that could be conveyed
> across databases, either by text files or by saving them off in a side table
> prior to upgrade. JSONB met the criteria.

Uh, it wouldn't be crazy to add this capability to pg_upgrade/pg_dump in
a minor version upgrade if it wasn't enabled by default, and if we were
very careful.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Statistics Import and Export

From
Corey Huinker
Date:
On Wed, Dec 27, 2023 at 10:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Corey Huinker <corey.huinker@gmail.com> writes:
> Export functions was my original plan, for simplicity, maintenance, etc,
> but it seemed like I'd be adding quite a few functions, so the one view
> made more sense for an initial version. Also, I knew that pg_dump or some
> other stats exporter would have to inline the guts of those functions into
> queries for older versions, and adapting a view definition seemed more
> straightforward for the reader than function definitions.

Hmm, I'm not sure we are talking about the same thing at all.

Right, I was conflating two things.
 

What I am proposing is *import* functions.  I didn't say anything about
how pg_dump obtains the data it prints; however, I would advocate that
we keep that part as simple as possible.  You cannot expect export
functionality to know the requirements of future server versions,
so I don't think it's useful to put much intelligence there.

True, but presumably you'd be using the pg_dump/pg_upgrade of that future version to do the exporting, so the export format would always be tailored to the importer's needs.
 

So I think pg_dump should produce a pretty literal representation of
what it finds in the source server's catalog, and then rely on the
import functions in the destination server to make sense of that
and do whatever slicing-n-dicing is required.

Obviously it can't be purely literal, as we have to replace the oid values with whatever text representation we feel helps us carry forward. In addition, we're setting the number of tuples and number of pages directly in pg_class, and doing so non-transactionally just like ANALYZE does. We could separate that out into its own import function, but then we're locking every relation twice, once for the tuples/pages and once again for the pg_statistic import.

My current line of thinking was that the stats import call, if enabled, would immediately follow the CREATE statement of the object itself, but that requires us to have everything we need to know for the import passed into the import function, so we'd be needing a way to serialize _that_. If you're thinking that we have one big bulk stats import, that might work, but it also means that we're less tolerant of failures in the import step. 

Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Thu, Dec 28, 2023 at 12:28:06PM -0500, Corey Huinker wrote:
>     What I am proposing is *import* functions.  I didn't say anything about
>     how pg_dump obtains the data it prints; however, I would advocate that
>     we keep that part as simple as possible.  You cannot expect export
>     functionality to know the requirements of future server versions,
>     so I don't think it's useful to put much intelligence there.
> 
> True, but presumably you'd be using the pg_dump/pg_upgrade of that future
> version to do the exporting, so the export format would always be tailored to
> the importer's needs.

I think the question is whether we will have the export functionality in
the old cluster, or if it will be queries run by pg_dump and therefore
also run by pg_upgrade calling pg_dump.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Statistics Import and Export

From
Tomas Vondra
Date:
On 12/13/23 11:26, Corey Huinker wrote:
>     Yeah, that was the simplest output function possible, it didn't seem
>
>     worth it to implement something more advanced. pg_mcv_list_items() is
>     more convenient for most needs, but it's quite far from the on-disk
>     representation.
>
>
> I was able to make it work.
>
>
>
>     That's actually a good question - how closely should the exported data
>     be to the on-disk format? I'd say we should keep it abstract, not tied
>     to the details of the on-disk format (which might easily change
between
>     versions).
>
>
> For the most part, I chose the exported data json types and formats in a
> way that was the most accommodating to cstring input functions. So,
> while so many of the statistic values are obviously only ever
> integers/floats, those get stored as a numeric data type which lacks
> direct numeric->int/float4/float8 functions (though we could certainly
> create them, and I'm not against that), casting them to text lets us
> leverage pg_strtoint16, etc.
>
>
>
>     I'm a bit confused about the JSON schema used in pg_statistic_export
>     view, though. It simply serializes stakinds, stavalues, stanumbers
into
>     arrays ... which works, but why not to use the JSON nesting? I mean,
>     there could be a nested document for histogram, MCV, ... with just the
>     correct fields.
>
>       {
>         ...
>         histogram : { stavalues: [...] },
>         mcv : { stavalues: [...], stanumbers: [...] },
>         ...
>       }
>
>
> That's a very good question. I went with this format because it was
> fairly straightforward to code in SQL using existing JSON/JSONB
> functions, and that's what we will need if we want to export statistics
> on any server currently in existence. I'm certainly not locked in with
> the current format, and if it can be shown how to transform the data
> into a superior format, I'd happily do so.
>
>     and so on. Also, what does TRIVIAL stand for?
>
>
> It's currently serving double-duty for "there are no stats in this slot"
> and the situations where the stats computation could draw no conclusions
> about the data.
>
> Attached is v3 of this patch. Key features are:
>
> * Handles regular pg_statistic stats for any relation type.
> * Handles extended statistics.
> * Export views pg_statistic_export and pg_statistic_ext_export to allow
> inspection of existing stats and saving those values for later use.
> * Import functions pg_import_rel_stats() and pg_import_ext_stats() which
> take Oids as input. This is intentional to allow stats from one object
> to be imported into another object.
> * User scripts pg_export_stats and pg_import stats, which offer a
> primitive way to serialize all the statistics of one database and import
> them into another.
> * Has regression test coverage for both with a variety of data types.
> * Passes my own manual test of extracting all of the stats from a v15
> version of the popular "dvdrental" example database, as well as some
> additional extended statistics objects, and importing them into a
> development database.
> * Import operations never touch the heap of any relation outside of
> pg_catalog. As such, this should be significantly faster than even the
> most cursory analyze operation, and therefore should be useful in
> upgrade situations, allowing the database to work with "good enough"
> stats more quickly, while still allowing for regular autovacuum to
> recalculate the stats "for real" at some later point.
>
> The relation statistics code was adapted from similar features in
> analyze.c, but is now done in a query context. As before, the
> rowcount/pagecount values are updated on pg_class in a non-transactional
> fashion to avoid table bloat, while the updates to pg_statistic are
> pg_statistic_ext_data are done transactionally.
>
> The existing statistics _store() functions were leveraged wherever
> practical, so much so that the extended statistics import is mostly just
> adapting the existing _build() functions into _import() functions which
> pull their values from JSON rather than computing the statistics.
>
> Current concerns are:
>
> 1. I had to code a special-case exception for MCELEM stats on array data
> types, so that the array_in() call uses the element type rather than the
> array type. I had assumed that the existing exmaine_attribute()
> functions would have properly derived the typoid for that column, but it
> appears to not be the case, and I'm clearly missing how the existing
> code gets it right.
Hmm, after looking at this, I'm not sure it's such an ugly hack ...

The way this works for ANALYZE is that examine_attribute() eventually
calls the typanalyze function:

  if (OidIsValid(stats->attrtype->typanalyze))
    ok = DatumGetBool(OidFunctionCall1(stats->attrtype->typanalyze,
                                       PointerGetDatum(stats)));

which for arrays is array_typanalyze, and this sets stats->extra_data to
ArrayAnalyzeExtraData with all the interesting info about the array
element type, and then also std_extra_data with info about the array
type itself.

  stats -> extra_data -> std_extra_data

compute_array_stats then "restores" std_extra_data to compute standard
stats for the whole array, and then uses the ArrayAnalyzeExtraData to
calculate stats for the elements.

It's not exactly pretty, because there are global variables and so on.

And examine_rel_attribute() does the same thing - calls typanalyze, so
if I break after it returns, I see this for int[] column:

(gdb) p * (ArrayAnalyzeExtraData *) stat->extra_data

$1 = {type_id = 23, eq_opr = 96, coll_id = 0, typbyval = true, typlen =
4, typalign = 105 'i', cmp = 0x2e57920, hash = 0x2e57950,
std_compute_stats = 0x6681b8 <compute_scalar_stats>, std_extra_data =
0x2efe670}

I think the "problem" will be how to use this in import_stavalues(). You
can't just do this for any array type, I think. I could create an array
type (with ELEMENT=X) but with a custom analyze function, in which case
the extra_data may be something entirely different.

I suppose the correct solution would be to add an "import" function into
the pg_type catalog (next to typanalyze). Or maybe it'd be enough to set
it from the typanalyze? After all, that's what sets compute_stats.

But maybe it's enough to just do what you did - if we get an MCELEM
slot, can it ever contain anything else than array of elements of the
attribute array type? I'd bet that'd cause all sorts of issues, no?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Statistics Import and Export

From
Corey Huinker
Date:
But maybe it's enough to just do what you did - if we get an MCELEM
slot, can it ever contain anything else than array of elements of the
attribute array type? I'd bet that'd cause all sorts of issues, no?


Thanks for the explanation of why it wasn't working for me. Knowing that the case of MCELEM + is-array-type is the only case where we'd need to do that puts me at ease.

Re: Statistics Import and Export

From
Tomas Vondra
Date:
On 12/29/23 17:27, Corey Huinker wrote:
>     But maybe it's enough to just do what you did - if we get an MCELEM
>     slot, can it ever contain anything else than array of elements of the
>     attribute array type? I'd bet that'd cause all sorts of issues, no?
> 
> 
> Thanks for the explanation of why it wasn't working for me. Knowing that
> the case of MCELEM + is-array-type is the only case where we'd need to
> do that puts me at ease.
> 

But I didn't claim MCELEM is the only slot where this might be an issue.
I merely asked if a MCELEM slot can ever contain an array with element
type different from the "original" attribute.

After thinking about this a bit more, and doing a couple experiments
with a trivial custom data type, I think this is true:

1) MCELEM slots for "real" array types are OK

I don't think we allow "real" arrays created by users directly, all
arrays are created implicitly by the system. Those types always have
array_typanalyze, which guarantees MCELEM has the correct element type.

I haven't found a way to either inject my custom array type or alter the
typanalyze to some custom function. So I think this is OK.


2) I'm not sure we can extend this regular data types / other slots

For example, I think I can implement a data type with custom typanalyze
function (and custom compute_stats function) that fills slots with some
other / strange stuff. For example I might build MCV with hashes of the
original data, a CountMin sketch, or something like that.

Yes, I don't think people do that often, but as long as the type also
implements custom selectivity functions for the operators, I think this
would work.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Statistics Import and Export

From
Peter Smith
Date:
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4538/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4538

Kind Regards,
Peter Smith.



Re: Statistics Import and Export

From
Corey Huinker
Date:


On Mon, Jan 22, 2024 at 1:09 AM Peter Smith <smithpb2250@gmail.com> wrote:
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4538/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4538

Kind Regards,
Peter Smith.

Attached is v4 of the statistics export/import patch.

This version has been refactored to match the design feedback received previously.

The system views are gone. These were mostly there to serve as a baseline for what an export query would look like. That role is temporarily reassigned to pg_export_stats.c, but hopefully they will be integrated into pg_dump in the next version. The regression test also contains the version of each query suitable for the current server version.

The export format is far closer to the raw format of pg_statistic and pg_statistic_ext_data, respectively. This format involves exporting oid values for types, collations, operators, and attributes - values which are specific to the server they were created on. To make sense of those values, a subset of the columns of pg_type, pg_attribute, pg_collation, and pg_operator are exported as well, which allows pg_import_rel_stats() and pg_import_ext_stats() to reconstitute the data structure as it existed on the old server, and adapt it to the modern structure and local schema objects.

pg_import_rel_stats matches up local columns with the exported stats by column name, not attnum. This allows for stats to be imported when columns have been dropped, added, or reordered.

pg_import_ext_stats can also handle column reordering, though it currently would get confused by changes in expressions that maintain the same result data type. I'm not yet brave enough to handle importing nodetrees, nor do I think it's wise to try. I think we'd be better off validating that the destination extended stats object is identical in structure, and to fail the import of that one object if it isn't perfect.

Export formats go back to v10.

Re: Statistics Import and Export

From
Corey Huinker
Date:
(hit send before attaching patches, reposting message as well)

Attached is v4 of the statistics export/import patch.

This version has been refactored to match the design feedback received previously.

The system views are gone. These were mostly there to serve as a baseline for what an export query would look like. That role is temporarily reassigned to pg_export_stats.c, but hopefully they will be integrated into pg_dump in the next version. The regression test also contains the version of each query suitable for the current server version.

The export format is far closer to the raw format of pg_statistic and pg_statistic_ext_data, respectively. This format involves exporting oid values for types, collations, operators, and attributes - values which are specific to the server they were created on. To make sense of those values, a subset of the columns of pg_type, pg_attribute, pg_collation, and pg_operator are exported as well, which allows pg_import_rel_stats() and pg_import_ext_stats() to reconstitute the data structure as it existed on the old server, and adapt it to the modern structure and local schema objects.

pg_import_rel_stats matches up local columns with the exported stats by column name, not attnum. This allows for stats to be imported when columns have been dropped, added, or reordered.

pg_import_ext_stats can also handle column reordering, though it currently would get confused by changes in expressions that maintain the same result data type. I'm not yet brave enough to handle importing nodetrees, nor do I think it's wise to try. I think we'd be better off validating that the destination extended stats object is identical in structure, and to fail the import of that one object if it isn't perfect.

Export formats go back to v10.


On Mon, Jan 22, 2024 at 1:09 AM Peter Smith <smithpb2250@gmail.com> wrote:
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4538/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4538

Kind Regards,
Peter Smith.
Attachment

Re: Statistics Import and Export

From
Tomas Vondra
Date:
Hi,

I took a quick look at the v4 patches. I haven't done much testing yet,
so only some basic review.

0001

- The SGML docs for pg_import_rel_stats may need some changes. It starts
with description of what gets overwritten (non-)transactionally (which
gets repeated twice), but that seems more like an implementation detail.
But it does not really say which pg_class fields get updated. Then it
speculates about the possible use case (pg_upgrade). I think it'd be
better to focus on the overall goal of updating statistics, explain what
gets updated/how, and only then maybe mention the pg_upgrade use case.

Also, it says "statistics are replaced" but it's quite clear if that
applies only to matching statistics or if all stats are deleted first
and then the new stuff is inserted. (FWIW remove_pg_statistics clearly
deletes all pre-existing stats).


- import_pg_statistics: I somewhat dislike that we're passing arguments
as datum[] array - it's hard to say what the elements are expected to
be, etc. Maybe we should expand this, to make it clear. How do we even
know the array is large enough?

- I don't quite understand why we need examine_rel_attribute. It sets a
lot of fields in the VacAttrStats struct, but then we only use attrtypid
and attrtypmod from it - so why bother and not simply load just these
two fields? Or maybe I miss something.

- examine_rel_attribute can return NULL, but get_attrinfo does not check
for NULL and just dereferences the pointer. Surely that can lead to
segfaults?

- validate_no_duplicates and the other validate functions would deserve
a better docs, explaining what exactly is checked (it took me a while to
realize we check just for duplicates), what the parameters do etc.

- Do we want to make the validate_ functions part of the public API? I
realize we want to use them from multiple places (regular and extended
stats), but maybe it'd be better to have an "internal" header file, just
like we have extended_stats_internal?

- I'm not sure we do "\set debug f" elsewhere. It took me a while to
realize why the query outputs are empty ...


0002

- I'd rename create_stat_ext_entry to statext_create_entry.

- Do we even want to include OIDs from the source server? Why not to
just have object names and resolve those? Seems safer - if the target
server has the OID allocated to a different object, that could lead to
confusing / hard to detect issues.

- What happens if we import statistics which includes data for extended
statistics object which does not exist on the target machine?

- pg_import_ext_stats seems to not use require_match_oids - bug?


0003

- no SGML docs for the new tools?

- The help() seems to be wrong / copied from "clusterdb" or something
like that, right?


On 2/2/24 09:37, Corey Huinker wrote:
> (hit send before attaching patches, reposting message as well)
> 
> Attached is v4 of the statistics export/import patch.
> 
> This version has been refactored to match the design feedback received
> previously.
> 
> The system views are gone. These were mostly there to serve as a baseline
> for what an export query would look like. That role is temporarily
> reassigned to pg_export_stats.c, but hopefully they will be integrated into
> pg_dump in the next version. The regression test also contains the version
> of each query suitable for the current server version.
> 

OK

> The export format is far closer to the raw format of pg_statistic and
> pg_statistic_ext_data, respectively. This format involves exporting oid
> values for types, collations, operators, and attributes - values which are
> specific to the server they were created on. To make sense of those values,
> a subset of the columns of pg_type, pg_attribute, pg_collation, and
> pg_operator are exported as well, which allows pg_import_rel_stats() and
> pg_import_ext_stats() to reconstitute the data structure as it existed on
> the old server, and adapt it to the modern structure and local schema
> objects.

I have no opinion on the proposed format - still JSON, but closer to the
original data. Works for me, but I wonder what Tom thinks about it,
considering he suggested making it closer to the raw data.

> 
> pg_import_rel_stats matches up local columns with the exported stats by
> column name, not attnum. This allows for stats to be imported when columns
> have been dropped, added, or reordered.
> 

Makes sense. What will happen if we try to import data for extended
statistics (or index) that does not exist on the target server?

> pg_import_ext_stats can also handle column reordering, though it currently
> would get confused by changes in expressions that maintain the same result
> data type. I'm not yet brave enough to handle importing nodetrees, nor do I
> think it's wise to try. I think we'd be better off validating that the
> destination extended stats object is identical in structure, and to fail
> the import of that one object if it isn't perfect.
> 

Yeah, column reordering is something we probably need to handle. The
stats order them by attnum, so if we want to allow import on a system
where the attributes were dropped/created in a different way, this is
necessary. I haven't tested this - is there a regression test for this?

I agree expressions are hard. I don't think it's feasible to import
nodetree from other server versions, but why don't we simply deparse the
expression on the source, and either parse it on the target (and then
compare the two nodetrees), or deparse the target too and compare the
two deparsed expressions? I suspect the deparsing may produce slightly
different results on the two versions (causing false mismatches), but
perhaps the deparse on source + parse on target + compare nodetrees
would work? Haven't tried, though.

> Export formats go back to v10.
> 

Do we even want/need to go beyond 12? All earlier versions are EOL.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Statistics Import and Export

From
Corey Huinker
Date:
Also, it says "statistics are replaced" but it's quite clear if that
applies only to matching statistics or if all stats are deleted first
and then the new stuff is inserted. (FWIW remove_pg_statistics clearly
deletes all pre-existing stats).

All are now deleted first, both in the pg_statistic and pg_statistic_ext_data tables. The previous version was taking a more "replace it if we find a new value" approach, but that's overly complicated, so following the example set by extended statistics seemed best.

 
- import_pg_statistics: I somewhat dislike that we're passing arguments
as datum[] array - it's hard to say what the elements are expected to
be, etc. Maybe we should expand this, to make it clear. How do we even
know the array is large enough?

Completely fair. Initially that was done with the expectation that the array would be the same for both regular stats and extended stats, but that was no longer the case.
 
- I don't quite understand why we need examine_rel_attribute. It sets a
lot of fields in the VacAttrStats struct, but then we only use attrtypid
and attrtypmod from it - so why bother and not simply load just these
two fields? Or maybe I miss something.
 
I think you're right, we don't need it anymore for regular statistics. We still need it in extended stats because statext_store() takes a subset of the vacattrstats rows as an input.

Which leads to a side issue. We currently have 3 functions: examine_rel_attribute and the two varieties of examine_attribute (one in analyze.c and the other in extended stats). These are highly similar but just different enough that I didn't feel comfortable refactoring them into a one-size-fits-all function, and I was particularly reluctant to modify existing code for the ANALYZE path.
 

- examine_rel_attribute can return NULL, but get_attrinfo does not check
for NULL and just dereferences the pointer. Surely that can lead to
segfaults?

Good catch, and it highlights how little we need VacAttrStats for regular statistics.
 

- validate_no_duplicates and the other validate functions would deserve
a better docs, explaining what exactly is checked (it took me a while to
realize we check just for duplicates), what the parameters do etc.

Those functions are in a fairly formative phase - I expect a conversation about what sort of validations we want to do to ensure that the statistics being imported make sense, and under what circumstances we would forego some of those checks.
 

- Do we want to make the validate_ functions part of the public API? I
realize we want to use them from multiple places (regular and extended
stats), but maybe it'd be better to have an "internal" header file, just
like we have extended_stats_internal?

I see no need to have them be a part of the public API. Will move.
 

- I'm not sure we do "\set debug f" elsewhere. It took me a while to
realize why the query outputs are empty ...

That was an experiment that rose out of the difficulty in determining _where_ a difference was when the set-difference checks failed. So far I like it, and I'm hoping it catches on.

 


0002

- I'd rename create_stat_ext_entry to statext_create_entry.

- Do we even want to include OIDs from the source server? Why not to
just have object names and resolve those? Seems safer - if the target
server has the OID allocated to a different object, that could lead to
confusing / hard to detect issues.

The import functions would obviously never use the imported oids to look up objects on the destination system. Rather, they're there to verify that the local object oid matches the exported object oid, which is true in the case of a binary upgrade.

The export format is an attempt to export the pg_statistic[_ext_data] for that object as-is, and, as Tom suggested, let the import function do the transformations. We can of course remove them if they truly have no purpose for validation.
 

- What happens if we import statistics which includes data for extended
statistics object which does not exist on the target machine?

The import function takes an oid of the object (relation or extstat object), and the json payload is supposed to be the stats for ONE corresponding object. Multiple objects of data really don't fit into the json format, and statistics exported for an object that does not exist on the destination system would have no meaningful invocation. I envision the dump file looking like this

    CREATE TABLE public.foo (....);

    SELECT pg_import_rel_stats('public.foo'::regclass, <json blob>, option flag, option flag);

So a call against a nonexistent object would fail on the regclass cast.
 

- pg_import_ext_stats seems to not use require_match_oids - bug?

I haven't yet seen a good way to make use of matching oids in extended stats. Checking matching operator/collation oids would make sense, but little else.
 


0003

- no SGML docs for the new tools?

Correct. I foresee the export tool being folded into pg_dump(), and the import tool going away entirely as psql could handle it.
 

- The help() seems to be wrong / copied from "clusterdb" or something
like that, right?

Correct, for the reason above.

 
>
> pg_import_rel_stats matches up local columns with the exported stats by
> column name, not attnum. This allows for stats to be imported when columns
> have been dropped, added, or reordered.
>

Makes sense. What will happen if we try to import data for extended
statistics (or index) that does not exist on the target server?

One of the parameters to the function is the oid of the object that is the target of the stats. The importer will not seek out objects with matching names and each JSON payload is limited to holding one object, though clearly someone could encapsulate the existing format in a format that has a manifest of objects to import.
 

> pg_import_ext_stats can also handle column reordering, though it currently
> would get confused by changes in expressions that maintain the same result
> data type. I'm not yet brave enough to handle importing nodetrees, nor do I
> think it's wise to try. I think we'd be better off validating that the
> destination extended stats object is identical in structure, and to fail
> the import of that one object if it isn't perfect.
>

Yeah, column reordering is something we probably need to handle. The
stats order them by attnum, so if we want to allow import on a system
where the attributes were dropped/created in a different way, this is
necessary. I haven't tested this - is there a regression test for this?

The overlong transformation SQL starts with the object to be imported (the local oid was specified) and it

1. grabs all the attributes (or exprs, for extended stats) of that object.
2. looks for columns/exprs in the exported json for an attribute with a matching name
3. takes the exported attnum of that exported attribute for use in things like stdexprs
4. looks up the type, collation, and operators for the exported attribute.

So we get a situation where there might not be importable stats for an attribute of the destination table, and we'd import nothing for that column. Stats for exported columns with no matching local column would never be referenced.

Yes, there should be a test of this.


I agree expressions are hard. I don't think it's feasible to import
nodetree from other server versions, but why don't we simply deparse the
expression on the source, and either parse it on the target (and then
compare the two nodetrees), or deparse the target too and compare the
two deparsed expressions? I suspect the deparsing may produce slightly
different results on the two versions (causing false mismatches), but
perhaps the deparse on source + parse on target + compare nodetrees
would work? Haven't tried, though.

> Export formats go back to v10.
>

Do we even want/need to go beyond 12? All earlier versions are EOL.

True, but we had pg_dump and pg_restore stuff back to 7.x until fairly recently, and a major friction point in getting customers to upgrade their instances off of unsupported versions is the downtime caused by an upgrade, why wouldn't we make it easier for them?

Re: Statistics Import and Export

From
Corey Huinker
Date:
Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(), which address many of the concerns listed earlier.

Leaving the export/import scripts off for the time being, as they haven't  changed and the next likely change is to fold them into pg_dump.


Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:
On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker <corey.huinker@gmail.com> wrote:
Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(), which address many of the concerns listed earlier.

Leaving the export/import scripts off for the time being, as they haven't  changed and the next likely change is to fold them into pg_dump.



v6 posted below.

Changes:

- Additional documentation about the overall process.
- Rewording of SGML docs.
- removed a fair number of columns from the transformation queries.
- enabled require_match_oids in extended statistics, but I'm having my doubts about the value of that.
- moved stats extraction functions to an fe_utils file stats_export.c that will be used by both pg_export_stats and pg_dump.
- pg_export_stats now generates SQL statements rather than a tsv, and has boolean flags to set the validate and require_match_oids parameters in the calls to pg_import_(rel|ext)_stats.
- pg_import_stats is gone, as importing can now be done with psql.

I'm hoping to get feedback on a few areas.

1. The checks for matching oids. On the one hand, in a binary upgrade situation, we would of course want the oid of the relation to match what was exported, as well as all of the atttypids of the attributes to match the type ids exported, same for collations, etc. However, the binary upgrade is the one place where there are absolutely no middle steps that could have altered either the stats jsons or the source tables. Given that and that oid simply will never match in any situation other than a binary upgrade, it may be best to discard those checks.

2. The checks for relnames matching, and typenames of attributes matching (they are already matched by name, so the column order can change without the import missing a beat) seem so necessary that there shouldn't be an option to enable/disable them. But if that's true, then the initial relation parameter becomes somewhat unnecessary, and anyone using these functions for tuning or FDW purposes could easily transform the JSON using SQL to put in the proper relname.

3. The data integrity validation functions may belong in a separate function rather than being a parameter on the existing import functions.

4. Lastly, pg_dump. Each relation object and extended statistics object will have a statistics import statement. From my limited experience with pg_dump, it seems like we would add an additional Stmt variable (statsStmt) to the TOC entry for each object created, and the restore process would check the value of --with-statistics and in cases where the statistics flag was set AND a stats import statement exists, then execute that stats statement immediately after the creation of the object. This assumes that there is no case where additional attributes are added to a relation after it's initial CREATE statement. Indexes are independent relations in this regard.
 
Attachment

Re: Statistics Import and Export

From
Stephen Frost
Date:
Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker <corey.huinker@gmail.com>
> wrote:
> > Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(),
> > which address many of the concerns listed earlier.
> >
> > Leaving the export/import scripts off for the time being, as they haven't
> > changed and the next likely change is to fold them into pg_dump.

> v6 posted below.
>
> Changes:
>
> - Additional documentation about the overall process.
> - Rewording of SGML docs.
> - removed a fair number of columns from the transformation queries.
> - enabled require_match_oids in extended statistics, but I'm having my
> doubts about the value of that.
> - moved stats extraction functions to an fe_utils file stats_export.c that
> will be used by both pg_export_stats and pg_dump.
> - pg_export_stats now generates SQL statements rather than a tsv, and has
> boolean flags to set the validate and require_match_oids parameters in the
> calls to pg_import_(rel|ext)_stats.
> - pg_import_stats is gone, as importing can now be done with psql.

Having looked through this thread and discussed a bit with Corey
off-line, the approach that Tom laid out up-thread seems like it would
make the most sense overall- that is, eliminate the JSON bits and the
SPI and instead export the stats data by running queries from the new
version of pg_dump/server (in the FDW case) against the old server
with the intelligence of how to transform the data into the format
needed for the current pg_dump/server to accept, through function calls
where the function calls generally map up to the rows/information being
updated- a call to update the information in pg_class for each relation
and then a call for each attribute to update the information in
pg_statistic.

Part of this process would include mapping from OIDs/attrnum's to names
on the source side and then from those names to the appropriate
OIDs/attrnum's on the destination side.

As this code would be used by both pg_dump and the postgres_fdw, it
seems logical that it would go into the common library.  Further, it
would make sense to have this code be able to handle multiple major
versions for the foreign side, such as how postgres_fdw and pg_dump
already do.

In terms of working to ensure that newer versions support loading from
older dumps (that is, that v18 would be able to load a dump file created
by a v17 pg_dump against a v17 server in the face of changes having been
made to the statistics system in v18), we could have the functions take
a version parameter (to handle cases where the data structure is the
same but the contents have to be handled differently), use overloaded
functions, or have version-specific names for the functions.  I'm also
generally supportive of the idea that we, perhaps initially, only
support dumping/loading stats with pg_dump when in binary-upgrade mode,
which removes our need to be concerned with this (perhaps that would be
a good v1 of this feature?) as the version of pg_dump needs to match
that of pg_upgrade and the destination server for various other reasons.
Including a switch to exclude stats on restore might also be an
acceptable answer, or even simply excluding them by default when going
between major versions except in binary-upgrade mode.

Along those same lines when it comes to a 'v1', I'd say that we may wish
to consider excluding extended statistics, which I am fairly confident
Corey's heard a number of times previously already but thought I would
add my own support for that.  To the extent that we do want to make
extended stats work down the road, we should probably have some
pre-patches to flush out the missing _in/_recv functions for those types
which don't have them today- and that would include modifying the _out
of those types to use names instead of OIDs/attrnums.  In thinking about
this, I was reviewing specifically pg_dependencies.  To the extent that
there are people who depend on the current output, I would think that
they'd actually appreciate this change.

I don't generally feel like we need to be checking that the OIDs between
the old server and the new server match- I appreciate that that should
be the case in a binary-upgrade situation but it still feels unnecessary
and complicated and clutters up the output and the function calls.

Overall, I definitely think this is a good project to work on as it's an
often, rightfully, complained about issue when it comes to pg_upgrade
and the amount of downtime required for it before the upgraded system
can be reasonably used again.

Thanks,

Stephen

Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:


Having looked through this thread and discussed a bit with Corey
off-line, the approach that Tom laid out up-thread seems like it would
make the most sense overall- that is, eliminate the JSON bits and the
SPI and instead export the stats data by running queries from the new
version of pg_dump/server (in the FDW case) against the old server
with the intelligence of how to transform the data into the format
needed for the current pg_dump/server to accept, through function calls
where the function calls generally map up to the rows/information being
updated- a call to update the information in pg_class for each relation
and then a call for each attribute to update the information in
pg_statistic.

Thanks for the excellent summary of our conversation, though I do add that we discussed a problem with per-attribute functions: each function would be acquiring locks on both the relation (so it doesn't go away) and pg_statistic, and that lock thrashing would add up. Whether that overhead is judged significant or not is up for discussion. If it is significant, it makes sense to package up all the attributes into one call, passing in an array of some new pg_statistic-esque special type....the very issue that sent me down the JSON path.

I certainly see the flexibility in having a per-attribute functions, but am concerned about non-binary-upgrade situations where the attnums won't line up, and if we're passing them by name then the function has dig around looking for the right matching attnum, and that's overhead too. In the whole-table approach, we just iterate over the attributes that exist, and find the matching parameter row.







 

Re: Statistics Import and Export

From
Stephen Frost
Date:
Greetings,

On Thu, Feb 29, 2024 at 17:48 Corey Huinker <corey.huinker@gmail.com> wrote:
Having looked through this thread and discussed a bit with Corey
off-line, the approach that Tom laid out up-thread seems like it would
make the most sense overall- that is, eliminate the JSON bits and the
SPI and instead export the stats data by running queries from the new
version of pg_dump/server (in the FDW case) against the old server
with the intelligence of how to transform the data into the format
needed for the current pg_dump/server to accept, through function calls
where the function calls generally map up to the rows/information being
updated- a call to update the information in pg_class for each relation
and then a call for each attribute to update the information in
pg_statistic.

Thanks for the excellent summary of our conversation, though I do add that we discussed a problem with per-attribute functions: each function would be acquiring locks on both the relation (so it doesn't go away) and pg_statistic, and that lock thrashing would add up. Whether that overhead is judged significant or not is up for discussion. If it is significant, it makes sense to package up all the attributes into one call, passing in an array of some new pg_statistic-esque special type....the very issue that sent me down the JSON path.

I certainly see the flexibility in having a per-attribute functions, but am concerned about non-binary-upgrade situations where the attnums won't line up, and if we're passing them by name then the function has dig around looking for the right matching attnum, and that's overhead too. In the whole-table approach, we just iterate over the attributes that exist, and find the matching parameter row.

That’s certainly a fair point and my initial reaction (which could certainly be wrong) is that it’s unlikely to be an issue- but also, if you feel you could make it work with an array and passing all the attribute info in with one call, which I suspect would be possible but just a bit more complex to build, then sure, go for it. If it ends up being overly unwieldy then perhaps the  per-attribute call would be better and we could perhaps acquire the lock before the function calls..?  Doing a check to see if we have already locked it would be cheaper than trying to acquire a new lock, I’m fairly sure.

Also per our prior discussion- this makes sense to include in post-data section, imv, and also because then we have the indexes we may wish to load stats for, but further that also means it’ll be in the paralleliziable part of the process, making me a bit less concerned overall about the individual timing. 

Thanks!

Stephen

Re: Statistics Import and Export

From
Corey Huinker
Date:

That’s certainly a fair point and my initial reaction (which could certainly be wrong) is that it’s unlikely to be an issue- but also, if you feel you could make it work with an array and passing all the attribute info in with one call, which I suspect would be possible but just a bit more complex to build, then sure, go for it. If it ends up being overly unwieldy then perhaps the  per-attribute call would be better and we could perhaps acquire the lock before the function calls..?  Doing a check to see if we have already locked it would be cheaper than trying to acquire a new lock, I’m fairly sure.

Well the do_analyze() code was already ok with acquiring the lock once for non-inherited stats and again for inherited stats, so the locks were already not the end of the world. However, that's at most a 2x of the locking required, and this would natts * x, quite a bit more. Having the procedures check for a pre-existing lock seems like a good compromise.
 
Also per our prior discussion- this makes sense to include in post-data section, imv, and also because then we have the indexes we may wish to load stats for, but further that also means it’ll be in the paralleliziable part of the process, making me a bit less concerned overall about the individual timing. 

The ability to parallelize is pretty persuasive. But is that per-statement parallelization or do we get transaction blocks? i.e. if we ended up importing stats like this:

BEGIN;
LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
SELECT pg_import_pg_statistic('schema.relation', 'name', ...);
SELECT pg_import_pg_statistic('schema.relation', 'description', ...);
...
COMMIT;

Re: Statistics Import and Export

From
Nathan Bossart
Date:
On Thu, Feb 29, 2024 at 10:55:20PM -0500, Corey Huinker wrote:
>> That’s certainly a fair point and my initial reaction (which could
>> certainly be wrong) is that it’s unlikely to be an issue- but also, if you
>> feel you could make it work with an array and passing all the attribute
>> info in with one call, which I suspect would be possible but just a bit
>> more complex to build, then sure, go for it. If it ends up being overly
>> unwieldy then perhaps the  per-attribute call would be better and we could
>> perhaps acquire the lock before the function calls..?  Doing a check to see
>> if we have already locked it would be cheaper than trying to acquire a new
>> lock, I’m fairly sure.
> 
> Well the do_analyze() code was already ok with acquiring the lock once for
> non-inherited stats and again for inherited stats, so the locks were
> already not the end of the world. However, that's at most a 2x of the
> locking required, and this would natts * x, quite a bit more. Having the
> procedures check for a pre-existing lock seems like a good compromise.

I think this is a reasonable starting point.  If the benchmarks show that
the locking is a problem, we can reevaluate, but otherwise IMHO we should
try to keep it as simple/flexible as possible.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Statistics Import and Export

From
Stephen Frost
Date:
Greetings,

On Fri, Mar 1, 2024 at 12:14 Nathan Bossart <nathandbossart@gmail.com> wrote:
On Thu, Feb 29, 2024 at 10:55:20PM -0500, Corey Huinker wrote:
>> That’s certainly a fair point and my initial reaction (which could
>> certainly be wrong) is that it’s unlikely to be an issue- but also, if you
>> feel you could make it work with an array and passing all the attribute
>> info in with one call, which I suspect would be possible but just a bit
>> more complex to build, then sure, go for it. If it ends up being overly
>> unwieldy then perhaps the  per-attribute call would be better and we could
>> perhaps acquire the lock before the function calls..?  Doing a check to see
>> if we have already locked it would be cheaper than trying to acquire a new
>> lock, I’m fairly sure.
>
> Well the do_analyze() code was already ok with acquiring the lock once for
> non-inherited stats and again for inherited stats, so the locks were
> already not the end of the world. However, that's at most a 2x of the
> locking required, and this would natts * x, quite a bit more. Having the
> procedures check for a pre-existing lock seems like a good compromise.

I think this is a reasonable starting point.  If the benchmarks show that
the locking is a problem, we can reevaluate, but otherwise IMHO we should
try to keep it as simple/flexible as possible.

Yeah, this was my general feeling as well.  If it does become an issue, it certainly seems like we would have ways to improve it in the future. Even with this locking it is surely going to be better than having to re-analyze the entire database which is where we are at now.

Thanks,

Stephen

Re: Statistics Import and Export

From
Bertrand Drouvot
Date:
Hi,

On Tue, Feb 20, 2024 at 02:24:52AM -0500, Corey Huinker wrote:
> On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker <corey.huinker@gmail.com>
> wrote:
> 
> > Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(),
> > which address many of the concerns listed earlier.
> >
> > Leaving the export/import scripts off for the time being, as they haven't
> > changed and the next likely change is to fold them into pg_dump.
> >
> >
> >
> v6 posted below.

Thanks!

I had in mind to look at it but it looks like a rebase is needed.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Statistics Import and Export

From
Matthias van de Meent
Date:
On Fri, 1 Mar 2024, 04:55 Corey Huinker, <corey.huinker@gmail.com> wrote:
>> Also per our prior discussion- this makes sense to include in post-data section, imv, and also because then we have
theindexes we may wish to load stats for, but further that also means it’ll be in the paralleliziable part of the
process,making me a bit less concerned overall about the individual timing. 
>
>
> The ability to parallelize is pretty persuasive. But is that per-statement parallelization or do we get transaction
blocks?i.e. if we ended up importing stats like this: 
>
> BEGIN;
> LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
> LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
> SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
> SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
> SELECT pg_import_pg_statistic('schema.relation', 'name', ...);

How well would this simplify to the following:

SELECT pg_import_statistic('schema.relation', attname, ...)
FROM (VALUES ('id', ...), ...) AS relation_stats (attname, ...);

Or even just one VALUES for the whole statistics loading?

I suspect the main issue with combining this into one statement
(transaction) is that failure to load one column's statistics implies
you'll have to redo all the other statistics (or fail to load the
statistics at all), which may be problematic at the scale of thousands
of relations with tens of columns each.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Statistics Import and Export

From
Stephen Frost
Date:
Greetings,

On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Fri, 1 Mar 2024, 04:55 Corey Huinker, <corey.huinker@gmail.com> wrote:
>> Also per our prior discussion- this makes sense to include in post-data section, imv, and also because then we have the indexes we may wish to load stats for, but further that also means it’ll be in the paralleliziable part of the process, making me a bit less concerned overall about the individual timing.
>
>
> The ability to parallelize is pretty persuasive. But is that per-statement parallelization or do we get transaction blocks? i.e. if we ended up importing stats like this:
>
> BEGIN;
> LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
> LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
> SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
> SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
> SELECT pg_import_pg_statistic('schema.relation', 'name', ...);

How well would this simplify to the following:

SELECT pg_import_statistic('schema.relation', attname, ...)
FROM (VALUES ('id', ...), ...) AS relation_stats (attname, ...);

Using a VALUES construct for this does seem like it might make it cleaner, so +1 for investigating that idea.

Or even just one VALUES for the whole statistics loading?

I don’t think we’d want to go beyond one relation at a time as then it can be parallelized, we won’t be trying to lock a whole bunch of objects at once, and any failures would only impact that one relation’s stats load.

I suspect the main issue with combining this into one statement
(transaction) is that failure to load one column's statistics implies
you'll have to redo all the other statistics (or fail to load the
statistics at all), which may be problematic at the scale of thousands
of relations with tens of columns each.

I’m pretty skeptical that “stats fail to load and lead to a failed transaction” is a likely scenario that we have to spend a lot of effort on.  I’m pretty bullish on the idea that this simply won’t happen except in very exceptional cases under a pg_upgrade (where the pg_dump that’s used must match the target server version) and where it happens under a pg_dump it’ll be because it’s an older pg_dump’s output and the answer will likely need to be “you’re using a pg_dump file generated using an older version of pg_dump and need to exclude stats entirely from the load and instead run analyze on the data after loading it.”

What are the cases where we would be seeing stats reloads failing where it would make sense to re-try on a subset of columns, or just generally, if we know that the pg_dump version matches the target server version?

Thanks!

Stephen

Re: Statistics Import and Export

From
Matthias van de Meent
Date:
On Wed, 6 Mar 2024 at 11:33, Stephen Frost <sfrost@snowman.net> wrote:
> On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
>> Or even just one VALUES for the whole statistics loading?
>
>
> I don’t think we’d want to go beyond one relation at a time as then it can be parallelized, we won’t be trying to
locka whole bunch of objects at once, and any failures would only impact that one relation’s stats load. 

That also makes sense.

>> I suspect the main issue with combining this into one statement
>> (transaction) is that failure to load one column's statistics implies
>> you'll have to redo all the other statistics (or fail to load the
>> statistics at all), which may be problematic at the scale of thousands
>> of relations with tens of columns each.
>
>
> I’m pretty skeptical that “stats fail to load and lead to a failed transaction” is a likely scenario that we have to
spenda lot of effort on. 

Agreed on the "don't have to spend a lot of time on it", but I'm not
so sure on the "unlikely" part while the autovacuum deamon is
involved, specifically for non-upgrade pg_restore. I imagine (haven't
checked) that autoanalyze is disabled during pg_upgrade, but
pg_restore doesn't do that, while it would have to be able to restore
statistics of a table if it is included in the dump (and the version
matches).

> What are the cases where we would be seeing stats reloads failing where it would make sense to re-try on a subset of
columns,or just generally, if we know that the pg_dump version matches the target server version? 

Last time I checked, pg_restore's default is to load data on a
row-by-row basis without --single-transaction or --exit-on-error. Of
course, pg_upgrade uses it's own set of flags, but if a user is
restoring stats with  pg_restore, I suspect they'd rather have some
column's stats loaded than no stats at all; so I would assume this
requires one separate pg_import_pg_statistic()-transaction for every
column.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Statistics Import and Export

From
Stephen Frost
Date:
Greetings,

* Matthias van de Meent (boekewurm+postgres@gmail.com) wrote:
> On Wed, 6 Mar 2024 at 11:33, Stephen Frost <sfrost@snowman.net> wrote:
> > On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
> >> Or even just one VALUES for the whole statistics loading?
> > I don’t think we’d want to go beyond one relation at a time as then it can be parallelized, we won’t be trying to
locka whole bunch of objects at once, and any failures would only impact that one relation’s stats load. 
>
> That also makes sense.

Great, thanks.

> >> I suspect the main issue with combining this into one statement
> >> (transaction) is that failure to load one column's statistics implies
> >> you'll have to redo all the other statistics (or fail to load the
> >> statistics at all), which may be problematic at the scale of thousands
> >> of relations with tens of columns each.
> >
> >
> > I’m pretty skeptical that “stats fail to load and lead to a failed transaction” is a likely scenario that we have
tospend a lot of effort on. 
>
> Agreed on the "don't have to spend a lot of time on it", but I'm not
> so sure on the "unlikely" part while the autovacuum deamon is
> involved, specifically for non-upgrade pg_restore. I imagine (haven't
> checked) that autoanalyze is disabled during pg_upgrade, but
> pg_restore doesn't do that, while it would have to be able to restore
> statistics of a table if it is included in the dump (and the version
> matches).

Even if autovacuum was running and it kicked off an auto-analyze of the
relation at just the time that we were trying to load the stats, there
would be appropriate locking happening to keep them from causing an
outright ERROR and transaction failure, or if not, that's a lack of
locking and should be fixed.  With the per-attribute-function-call
approach, that could lead to a situation where some stats are from the
auto-analyze and some are from the stats being loaded but I'm not sure
if that's a big concern or not.

For users of this, I would think we'd generally encourage them to
disable autovacuum on the tables they're loading as otherwise they'll
end up with the stats going back to whatever an auto-analyze ends up
finding.  That may be fine in some cases, but not in others.

A couple questions to think about though: Should pg_dump explicitly ask
autovacuum to ignore these tables while we're loading them?
Should these functions only perform a load when there aren't any
existing stats?  Should the latter be an argument to the functions to
allow the caller to decide?

> > What are the cases where we would be seeing stats reloads failing where it would make sense to re-try on a subset
ofcolumns, or just generally, if we know that the pg_dump version matches the target server version? 
>
> Last time I checked, pg_restore's default is to load data on a
> row-by-row basis without --single-transaction or --exit-on-error. Of
> course, pg_upgrade uses it's own set of flags, but if a user is
> restoring stats with  pg_restore, I suspect they'd rather have some
> column's stats loaded than no stats at all; so I would assume this
> requires one separate pg_import_pg_statistic()-transaction for every
> column.

Having some discussion around that would be useful.  Is it better to
have a situation where there are stats for some columns but no stats for
other columns?  There would be a good chance that this would lead to a
set of queries that were properly planned out and a set which end up
with unexpected and likely poor query plans due to lack of stats.
Arguably that's better overall, but either way an ANALYZE needs to be
done to address the lack of stats for those columns and then that
ANALYZE is going to blow away whatever stats got loaded previously
anyway and all we did with a partial stats load was maybe have a subset
of queries have better plans in the interim, after having expended the
cost to try and individually load the stats and dealing with the case of
some of them succeeding and some failing.

Overall, I'd suggest we wait to see what Corey comes up with in terms of
doing the stats load for all attributes in a single function call,
perhaps using the VALUES construct as you suggested up-thread, and then
we can contemplate if that's clean enough to work or if it's so grotty
that the better plan would be to do per-attribute function calls.  If it
ends up being the latter, then we can revisit this discussion and try to
answer some of the questions raised above.

Thanks!

Stephen

Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:

> BEGIN;
> LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
> LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
> SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
> SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
> SELECT pg_import_pg_statistic('schema.relation', 'name', ...);

How well would this simplify to the following:

SELECT pg_import_statistic('schema.relation', attname, ...)
FROM (VALUES ('id', ...), ...) AS relation_stats (attname, ...);

Or even just one VALUES for the whole statistics loading?

I'm sorry, I don't quite understand what you're suggesting here. I'm about to post the new functions, so perhaps you can rephrase this in the context of those functions.
 
I suspect the main issue with combining this into one statement
(transaction) is that failure to load one column's statistics implies
you'll have to redo all the other statistics (or fail to load the
statistics at all), which may be problematic at the scale of thousands
of relations with tens of columns each.

Yes, that is is a concern, and I can see value to having it both ways (one failure fails the whole table's worth of set_something() functions, but I can also see emitting a warning instead of error and returning false. I'm eager to get feedback on which the community would prefer, or perhaps even make it a parameter.
 

Re: Statistics Import and Export

From
Corey Huinker
Date:


Having some discussion around that would be useful.  Is it better to
have a situation where there are stats for some columns but no stats for
other columns?  There would be a good chance that this would lead to a
set of queries that were properly planned out and a set which end up
with unexpected and likely poor query plans due to lack of stats.
Arguably that's better overall, but either way an ANALYZE needs to be
done to address the lack of stats for those columns and then that
ANALYZE is going to blow away whatever stats got loaded previously
anyway and all we did with a partial stats load was maybe have a subset
of queries have better plans in the interim, after having expended the
cost to try and individually load the stats and dealing with the case of
some of them succeeding and some failing.

It is my (incomplete and entirely second-hand) understanding is that pg_upgrade doesn't STOP autovacuum, but sets a delay to a very long value and then resets it on completion, presumably because analyzing a table before its data is loaded and indexes are created would just be a waste of time.

 

Overall, I'd suggest we wait to see what Corey comes up with in terms of
doing the stats load for all attributes in a single function call,
perhaps using the VALUES construct as you suggested up-thread, and then
we can contemplate if that's clean enough to work or if it's so grotty
that the better plan would be to do per-attribute function calls.  If it
ends up being the latter, then we can revisit this discussion and try to
answer some of the questions raised above.

In the patch below, I ended up doing per-attribute function calls, mostly because it allowed me to avoid creating a custom data type for the portable version of pg_statistic. This comes at the cost of a very high number of parameters, but that's the breaks.

I am a bit concerned about the number of locks on pg_statistic and the relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per attribute rather than once per relation. But I also see that this will mostly get used at a time when no other traffic is on the machine, and whatever it costs, it's still faster than the smallest table sample (insert joke about "don't have to be faster than the bear" here).

This raises questions about whether a failure in one attribute update statement should cause the others in that relation to roll back or not, and I can see situations where both would be desirable.

I'm putting this out there ahead of the pg_dump / fe_utils work, mostly because what I do there heavily depends on how this is received.

Also, I'm still seeking confirmation that I can create a pg_dump TOC entry with a chain of commands (e.g. BEGIN; ...  COMMIT; ) or if I have to fan them out into multiple entries.

Anyway, here's v7. Eagerly awaiting feedback.
Attachment

Re: Statistics Import and Export

From
Stephen Frost
Date:
Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > Having some discussion around that would be useful.  Is it better to
> > have a situation where there are stats for some columns but no stats for
> > other columns?  There would be a good chance that this would lead to a
> > set of queries that were properly planned out and a set which end up
> > with unexpected and likely poor query plans due to lack of stats.
> > Arguably that's better overall, but either way an ANALYZE needs to be
> > done to address the lack of stats for those columns and then that
> > ANALYZE is going to blow away whatever stats got loaded previously
> > anyway and all we did with a partial stats load was maybe have a subset
> > of queries have better plans in the interim, after having expended the
> > cost to try and individually load the stats and dealing with the case of
> > some of them succeeding and some failing.
>
> It is my (incomplete and entirely second-hand) understanding is that
> pg_upgrade doesn't STOP autovacuum, but sets a delay to a very long value
> and then resets it on completion, presumably because analyzing a table
> before its data is loaded and indexes are created would just be a waste of
> time.

No, pg_upgrade starts the postmaster with -b (undocumented
binary-upgrade mode) which prevents autovacuum (and logical replication
workers) from starting, so we don't need to worry about autovacuum
coming in and causing issues during binary upgrade.  That doesn't
entirely eliminate the concerns around pg_dump vs. autovacuum because we
may restore a dump into a non-binary-upgrade-in-progress system though,
of course.

> > Overall, I'd suggest we wait to see what Corey comes up with in terms of
> > doing the stats load for all attributes in a single function call,
> > perhaps using the VALUES construct as you suggested up-thread, and then
> > we can contemplate if that's clean enough to work or if it's so grotty
> > that the better plan would be to do per-attribute function calls.  If it
> > ends up being the latter, then we can revisit this discussion and try to
> > answer some of the questions raised above.
>
> In the patch below, I ended up doing per-attribute function calls, mostly
> because it allowed me to avoid creating a custom data type for the portable
> version of pg_statistic. This comes at the cost of a very high number of
> parameters, but that's the breaks.

Perhaps it's just me ... but it doesn't seem like it's all that many
parameters.

> I am a bit concerned about the number of locks on pg_statistic and the
> relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per
> attribute rather than once per relation. But I also see that this will
> mostly get used at a time when no other traffic is on the machine, and
> whatever it costs, it's still faster than the smallest table sample (insert
> joke about "don't have to be faster than the bear" here).

I continue to not be too concerned about this until and unless it's
actually shown to be an issue.  Keeping things simple and
straight-forward has its own value.

> This raises questions about whether a failure in one attribute update
> statement should cause the others in that relation to roll back or not, and
> I can see situations where both would be desirable.
>
> I'm putting this out there ahead of the pg_dump / fe_utils work, mostly
> because what I do there heavily depends on how this is received.
>
> Also, I'm still seeking confirmation that I can create a pg_dump TOC entry
> with a chain of commands (e.g. BEGIN; ...  COMMIT; ) or if I have to fan
> them out into multiple entries.

If we do go with this approach, we'd certainly want to make sure to do
it in a manner which would allow pg_restore's single-transaction mode to
still work, which definitely complicates this some.

Given that and the other general feeling that the locking won't be a big
issue, I'd suggest the simple approach on the pg_dump side of just
dumping the stats out without the BEGIN/COMMIT.

> Anyway, here's v7. Eagerly awaiting feedback.

> Subject: [PATCH v7] Create pg_set_relation_stats, pg_set_attribute_stats.

> diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
> index 291ed876fc..d12b6e3ca3 100644
> --- a/src/include/catalog/pg_proc.dat
> +++ b/src/include/catalog/pg_proc.dat
> @@ -8818,7 +8818,6 @@
>  { oid => '3813', descr => 'generate XML text node',
>    proname => 'xmltext', proisstrict => 't', prorettype => 'xml',
>    proargtypes => 'text', prosrc => 'xmltext' },
> -
>  { oid => '2923', descr => 'map table contents to XML',
>    proname => 'table_to_xml', procost => '100', provolatile => 's',
>    proparallel => 'r', prorettype => 'xml',
> @@ -12163,8 +12162,24 @@
>
>  # GiST stratnum implementations
>  { oid => '8047', descr => 'GiST support',
> -  proname => 'gist_stratnum_identity', prorettype => 'int2',
> +  proname => 'gist_stratnum_identity',prorettype => 'int2',
>    proargtypes => 'int2',
>    prosrc => 'gist_stratnum_identity' },

Random whitespace hunks shouldn't be included

> diff --git a/src/backend/statistics/statistics.c b/src/backend/statistics/statistics.c
> new file mode 100644
> index 0000000000..999aebdfa9
> --- /dev/null
> +++ b/src/backend/statistics/statistics.c
> @@ -0,0 +1,360 @@
> +/*------------------------------------------------------------------------- * * statistics.c *
> + * IDENTIFICATION
> + *       src/backend/statistics/statistics.c
> + *
> + *-------------------------------------------------------------------------
> + */

Top-of-file comment should be cleaned up.

> +/*
> + * Set statistics for a given pg_class entry.
> + *
> + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> + *
> + * This does an in-place (i.e. non-transactional) update of pg_class, just as
> + * is done in ANALYZE.
> + *
> + */
> +Datum
> +pg_set_relation_stats(PG_FUNCTION_ARGS)
> +{
> +    const char *param_names[] = {
> +        "relation",
> +        "reltuples",
> +        "relpages",
> +    };
> +
> +    Oid                relid;
> +    Relation        rel;
> +    HeapTuple        ctup;
> +    Form_pg_class    pgcform;
> +
> +    for (int i = 0; i <= 2; i++)
> +        if (PG_ARGISNULL(i))
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                     errmsg("%s cannot be NULL", param_names[i])));

Why not just mark this function as strict..?  Or perhaps we should allow
NULLs to be passed in and just not update the current value in that
case?  Also, in some cases we allow the function to be called with a
NULL but then make it a no-op rather than throwing an ERROR (eg, if the
OID ends up being NULL).  Not sure if that makes sense here or not
offhand but figured I'd mention it as something to consider.

> +    pgcform = (Form_pg_class) GETSTRUCT(ctup);
> +    pgcform->reltuples = PG_GETARG_FLOAT4(1);
> +    pgcform->relpages = PG_GETARG_INT32(2);

Shouldn't we include relallvisible?

Also, perhaps we should use the approach that we have in ANALYZE, and
only actually do something if the values are different rather than just
always doing an update.

> +/*
> + * Import statistics for a given relation attribute
> + *
> + * pg_set_attribute_stats(relation Oid, attname name, stainherit bool,
> + *                        stanullfrac float4, stawidth int, stadistinct float4,
> + *                        stakind1 int2, stakind2 int2, stakind3 int3,
> + *                        stakind4 int2, stakind5 int2, stanumbers1 float4[],
> + *                        stanumbers2 float4[], stanumbers3 float4[],
> + *                        stanumbers4 float4[], stanumbers5 float4[],
> + *                        stanumbers1 float4[], stanumbers2 float4[],
> + *                        stanumbers3 float4[], stanumbers4 float4[],
> + *                        stanumbers5 float4[], stavalues1 text,
> + *                        stavalues2 text, stavalues3 text,
> + *                        stavalues4 text, stavalues5 text);
> + *
> + *
> + */

Don't know that it makes sense to just repeat the function declaration
inside a comment like this- it'll just end up out of date.

> +Datum
> +pg_set_attribute_stats(PG_FUNCTION_ARGS)

> +    /* names of columns that cannot be null */
> +    const char *required_param_names[] = {
> +        "relation",
> +        "attname",
> +        "stainherit",
> +        "stanullfrac",
> +        "stawidth",
> +        "stadistinct",
> +        "stakind1",
> +        "stakind2",
> +        "stakind3",
> +        "stakind4",
> +        "stakind5",
> +    };

Same comment here as above wrt NULL being passed in.

> +    for (int k = 0; k < 5; k++)

Shouldn't we use STATISTIC_NUM_SLOTS here?

Thanks!

Stephen

Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:
Perhaps it's just me ... but it doesn't seem like it's all that many
parameters.

It's more than I can memorize, so that feels like a lot to me. Clearly it's not insurmountable.

 
> I am a bit concerned about the number of locks on pg_statistic and the
> relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per
> attribute rather than once per relation. But I also see that this will
> mostly get used at a time when no other traffic is on the machine, and
> whatever it costs, it's still faster than the smallest table sample (insert
> joke about "don't have to be faster than the bear" here).

I continue to not be too concerned about this until and unless it's
actually shown to be an issue.  Keeping things simple and
straight-forward has its own value.

Ok, I'm sold on that plan.
 

> +/*
> + * Set statistics for a given pg_class entry.
> + *
> + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> + *
> + * This does an in-place (i.e. non-transactional) update of pg_class, just as
> + * is done in ANALYZE.
> + *
> + */
> +Datum
> +pg_set_relation_stats(PG_FUNCTION_ARGS)
> +{
> +     const char *param_names[] = {
> +             "relation",
> +             "reltuples",
> +             "relpages",
> +     };
> +
> +     Oid                             relid;
> +     Relation                rel;
> +     HeapTuple               ctup;
> +     Form_pg_class   pgcform;
> +
> +     for (int i = 0; i <= 2; i++)
> +             if (PG_ARGISNULL(i))
> +                     ereport(ERROR,
> +                                     (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                      errmsg("%s cannot be NULL", param_names[i])));

Why not just mark this function as strict..?  Or perhaps we should allow
NULLs to be passed in and just not update the current value in that
case?

Strict could definitely apply here, and I'm inclined to make it so.

 
Also, in some cases we allow the function to be called with a
NULL but then make it a no-op rather than throwing an ERROR (eg, if the
OID ends up being NULL).

Thoughts on it emitting a WARN or NOTICE before returning false?

 
  Not sure if that makes sense here or not
offhand but figured I'd mention it as something to consider.

> +     pgcform = (Form_pg_class) GETSTRUCT(ctup);
> +     pgcform->reltuples = PG_GETARG_FLOAT4(1);
> +     pgcform->relpages = PG_GETARG_INT32(2);

Shouldn't we include relallvisible?

Yes. No idea why I didn't have that in there from the start.
 
Also, perhaps we should use the approach that we have in ANALYZE, and
only actually do something if the values are different rather than just
always doing an update.

That was how it worked back in v1, more for the possibility that there was no matching JSON to set values.

Looking again at analyze.c (currently lines 1751-1780), we just check if there is a row in the way, and if so we replace it. I don't see where we compare existing values to new values.
 

> +/*
> + * Import statistics for a given relation attribute
> + *
> + * pg_set_attribute_stats(relation Oid, attname name, stainherit bool,
> + *                        stanullfrac float4, stawidth int, stadistinct float4,
> + *                        stakind1 int2, stakind2 int2, stakind3 int3,
> + *                        stakind4 int2, stakind5 int2, stanumbers1 float4[],
> + *                        stanumbers2 float4[], stanumbers3 float4[],
> + *                        stanumbers4 float4[], stanumbers5 float4[],
> + *                        stanumbers1 float4[], stanumbers2 float4[],
> + *                        stanumbers3 float4[], stanumbers4 float4[],
> + *                        stanumbers5 float4[], stavalues1 text,
> + *                        stavalues2 text, stavalues3 text,
> + *                        stavalues4 text, stavalues5 text);
> + *
> + *
> + */

Don't know that it makes sense to just repeat the function declaration
inside a comment like this- it'll just end up out of date.

Historical artifact - previous versions had a long explanation of the JSON format.

 

> +Datum
> +pg_set_attribute_stats(PG_FUNCTION_ARGS)

> +     /* names of columns that cannot be null */
> +     const char *required_param_names[] = {
> +             "relation",
> +             "attname",
> +             "stainherit",
> +             "stanullfrac",
> +             "stawidth",
> +             "stadistinct",
> +             "stakind1",
> +             "stakind2",
> +             "stakind3",
> +             "stakind4",
> +             "stakind5",
> +     };

Same comment here as above wrt NULL being passed in.

In this case, the last 10 params (stanumbersN and stavaluesN) can be null, and are NULL more often than not.
 

> +     for (int k = 0; k < 5; k++)

Shouldn't we use STATISTIC_NUM_SLOTS here?

Yes, I had in the past. Not sure why I didn't again. 

Re: Statistics Import and Export

From
Bharath Rupireddy
Date:
On Fri, Mar 8, 2024 at 12:06 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>
> Anyway, here's v7. Eagerly awaiting feedback.

Thanks for working on this. It looks useful to have the ability to
restore the stats after upgrade and restore. But, the exported stats
are valid only until the next ANALYZE is run on the table. IIUC,
postgres collects stats during VACUUM, autovacuum and ANALYZE, right?
 Perhaps there are other ways to collect stats. I'm thinking what
problems does the user face if they are just asked to run ANALYZE on
the tables (I'm assuming ANALYZE doesn't block concurrent access to
the tables) instead of automatically exporting stats.

Here are some comments on the v7 patch. I've not dived into the whole
thread, so some comments may be of type repeated or need
clarification. Please bear with me.

1. The following two are unnecessary changes in pg_proc.dat, please remove them.

--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8818,7 +8818,6 @@
 { oid => '3813', descr => 'generate XML text node',
   proname => 'xmltext', proisstrict => 't', prorettype => 'xml',
   proargtypes => 'text', prosrc => 'xmltext' },
-
 { oid => '2923', descr => 'map table contents to XML',
   proname => 'table_to_xml', procost => '100', provolatile => 's',
   proparallel => 'r', prorettype => 'xml',
@@ -12163,8 +12162,24 @@

 # GiST stratnum implementations
 { oid => '8047', descr => 'GiST support',
-  proname => 'gist_stratnum_identity', prorettype => 'int2',
+  proname => 'gist_stratnum_identity',prorettype => 'int2',
   proargtypes => 'int2',
   prosrc => 'gist_stratnum_identity' },

2.
+        they are replaced by the next auto-analyze. This function is used by
+        <command>pg_upgrade</command> and <command>pg_restore</command> to
+        convey the statistics from the old system version into the new one.
+       </para>

Is there any demonstration of pg_set_relation_stats and
pg_set_attribute_stats being used either in pg_upgrade or in
pg_restore? Perhaps, having them as 0002, 0003 and so on patches might
show real need for functions like this. It also clarifies how these
functions pull stats from tables on the old cluster to the tables on
the new cluster.

3. pg_set_relation_stats and pg_set_attribute_stats seem to be writing
to pg_class and might affect the plans as stats can get tampered. Can
we REVOKE the execute permissions from the public out of the box in
src/backend/catalog/system_functions.sql? This way one can decide who
to give permissions to.

4.
+SELECT pg_set_relation_stats('stats_export_import.test'::regclass,
3.6::float4, 15000);
+ pg_set_relation_stats
+-----------------------
+ t
+(1 row)
+
+SELECT reltuples, relpages FROM pg_class WHERE oid =
'stats_export_import.test'::regclass;
+ reltuples | relpages
+-----------+----------
+       3.6 |    15000

Isn't this test case showing a misuse of these functions? Table
actually has  no rows, but we are lying to the postgres optimizer on
stats. I think altering stats of a table mustn't be that easy for the
end user. As mentioned in comment #3, permissions need to be
tightened. In addition, we can also mark the functions pg_upgrade only
with CHECK_IS_BINARY_UPGRADE, but that might not work for pg_restore
(or I don't know if we have a way to know within the server that the
server is running for pg_restore).

5. In continuation to the comment #2, is pg_dump supposed to generate
pg_set_relation_stats and pg_set_attribute_stats statements for each
table? When pg_dump does that , pg_restore can automatically load the
stats.

6.
+/*-------------------------------------------------------------------------
* * statistics.c *
+ * IDENTIFICATION
+ *       src/backend/statistics/statistics.c
+ *
+ *-------------------------------------------------------------------------

A description of what the new file statistics.c does is missing.

7. pgindent isn't happy with new file statistics.c, please check.

8.
+/*
+ * Import statistics for a given relation attribute
+ *
+ * pg_set_attribute_stats(relation Oid, attname name, stainherit bool,
+ *                        stanullfrac float4, stawidth int, stadistinct float4,

Having function definition in the function comment isn't necessary -
it's hard to keep it consistent with pg_proc.dat in future. If
required, one can either look at pg_proc.dat or docs.

9. Isn't it good to add a test case where the plan of a query on table
after exporting the stats would remain same as that of the original
table from which the stats are exported? IMO, this is a more realistic
than just comparing pg_statistic of the tables because this is what an
end-user wants eventually.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Statistics Import and Export

From
Corey Huinker
Date:


On Sun, Mar 10, 2024 at 11:57 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Fri, Mar 8, 2024 at 12:06 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>
> Anyway, here's v7. Eagerly awaiting feedback.

Thanks for working on this. It looks useful to have the ability to
restore the stats after upgrade and restore. But, the exported stats
are valid only until the next ANALYZE is run on the table. IIUC,
postgres collects stats during VACUUM, autovacuum and ANALYZE, right?
 Perhaps there are other ways to collect stats. I'm thinking what
problems does the user face if they are just asked to run ANALYZE on
the tables (I'm assuming ANALYZE doesn't block concurrent access to
the tables) instead of automatically exporting stats.

Correct. These are just as temporary as any other analyze of the table. Another analyze will happen later, probably through autovacuum and wipe out these values. This is designed to QUICKLY get stats into a table to enable the database to be operational sooner. This is especially important after an upgrade/restore, when all stats were wiped out. Other uses could be adapting this for use the postgres_fdw so that we don't have to do table sampling on the remote table, and of course statistics injection to test the query planner.
 
2.
+        they are replaced by the next auto-analyze. This function is used by
+        <command>pg_upgrade</command> and <command>pg_restore</command> to
+        convey the statistics from the old system version into the new one.
+       </para>

Is there any demonstration of pg_set_relation_stats and
pg_set_attribute_stats being used either in pg_upgrade or in
pg_restore? Perhaps, having them as 0002, 0003 and so on patches might
show real need for functions like this. It also clarifies how these
functions pull stats from tables on the old cluster to the tables on
the new cluster.

That code was adapted from do_analyze(), and yes, there is a patch for pg_dump, but as I noted earlier it is on hold pending feedback.
 

3. pg_set_relation_stats and pg_set_attribute_stats seem to be writing
to pg_class and might affect the plans as stats can get tampered. Can
we REVOKE the execute permissions from the public out of the box in
src/backend/catalog/system_functions.sql? This way one can decide who
to give permissions to.

You'd have to be the table owner to alter the stats. I can envision these functions getting a special role, but they could also be fine as superuser-only.
 

4.
+SELECT pg_set_relation_stats('stats_export_import.test'::regclass,
3.6::float4, 15000);
+ pg_set_relation_stats
+-----------------------
+ t
+(1 row)
+
+SELECT reltuples, relpages FROM pg_class WHERE oid =
'stats_export_import.test'::regclass;
+ reltuples | relpages
+-----------+----------
+       3.6 |    15000

Isn't this test case showing a misuse of these functions? Table
actually has  no rows, but we are lying to the postgres optimizer on
stats.

Consider this case. You want to know at what point the query planner will start using a given index. You can generate dummy data for a thousand, a million, a billion rows, and wait for that to complete, or you can just tell the table "I say you have a billion rows, twenty million pages, etc" and see when it changes.

But again, in most cases, you're setting the values to the same values the table had on the old database just before the restore/upgrade. 
 
I think altering stats of a table mustn't be that easy for the
end user.

Only easy for the end users that happen to be the table owner or a superuser.
 
As mentioned in comment #3, permissions need to be
tightened. In addition, we can also mark the functions pg_upgrade only
with CHECK_IS_BINARY_UPGRADE, but that might not work for pg_restore
(or I don't know if we have a way to know within the server that the
server is running for pg_restore).

I think they will have usage both in postgres_fdw and for tuning.
 

5. In continuation to the comment #2, is pg_dump supposed to generate
pg_set_relation_stats and pg_set_attribute_stats statements for each
table? When pg_dump does that , pg_restore can automatically load the
stats.

Current plan is to have one TOC entry in the post-data section with a dependency on the table/index/matview. That let's us leverage existing filters. The TOC entry will have a series of statements in it, one pg_set_relation_stats() and one pg_set_attribute_stats() per attribute.
 
9. Isn't it good to add a test case where the plan of a query on table
after exporting the stats would remain same as that of the original
table from which the stats are exported? IMO, this is a more realistic
than just comparing pg_statistic of the tables because this is what an
end-user wants eventually.

I'm sure we can add something like that, but query plan formats change a lot and are greatly dependent on database configuration, so maintaining such a test would be a lot of work.

Re: Statistics Import and Export

From
Bertrand Drouvot
Date:
Hi,

On Fri, Mar 08, 2024 at 01:35:40AM -0500, Corey Huinker wrote:
> Anyway, here's v7. Eagerly awaiting feedback.

Thanks!

A few random comments:

1 ===

+        The purpose of this function is to apply statistics values in an
+        upgrade situation that are "good enough" for system operation until

Worth to add a few words about "influencing" the planner use case?

2 ===

+#include "catalog/pg_type.h"
+#include "fmgr.h"

Are those 2 needed?

3 ===

+       if (!HeapTupleIsValid(ctup))
+               elog(ERROR, "pg_class entry for relid %u vanished during statistics import",

s/during statistics import/when setting statistics/?

4 ===

+Datum
+pg_set_relation_stats(PG_FUNCTION_ARGS)
+{
.
.
+       table_close(rel, ShareUpdateExclusiveLock);
+
+       PG_RETURN_BOOL(true);

Why returning a bool? (I mean we'd throw an error or return true).

5 ===

+ */
+Datum
+pg_set_attribute_stats(PG_FUNCTION_ARGS)
+{

This function is not that simple, worth to explain its logic in a comment above?

6 ===

+       if (!HeapTupleIsValid(tuple))
+       {
+               relation_close(rel, NoLock);
+               PG_RETURN_BOOL(false);
+       }
+
+       attr = (Form_pg_attribute) GETSTRUCT(tuple);
+       if (attr->attisdropped)
+       {
+               ReleaseSysCache(tuple);
+               relation_close(rel, NoLock);
+               PG_RETURN_BOOL(false);
+       }

Why is it returning "false" and not throwing an error? (if ok, then I think
we can get rid of returning a bool).

7 ===

+        * If this relation is an index and that index has expressions in
+        * it, and the attnum specified

s/is an index and that index has/is an index that has/?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Statistics Import and Export

From
Stephen Frost
Date:
Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > > +/*
> > > + * Set statistics for a given pg_class entry.
> > > + *
> > > + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> > > + *
> > > + * This does an in-place (i.e. non-transactional) update of pg_class,
> > just as
> > > + * is done in ANALYZE.
> > > + *
> > > + */
> > > +Datum
> > > +pg_set_relation_stats(PG_FUNCTION_ARGS)
> > > +{
> > > +     const char *param_names[] = {
> > > +             "relation",
> > > +             "reltuples",
> > > +             "relpages",
> > > +     };
> > > +
> > > +     Oid                             relid;
> > > +     Relation                rel;
> > > +     HeapTuple               ctup;
> > > +     Form_pg_class   pgcform;
> > > +
> > > +     for (int i = 0; i <= 2; i++)
> > > +             if (PG_ARGISNULL(i))
> > > +                     ereport(ERROR,
> > > +
> >  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > +                                      errmsg("%s cannot be NULL",
> > param_names[i])));
> >
> > Why not just mark this function as strict..?  Or perhaps we should allow
> > NULLs to be passed in and just not update the current value in that
> > case?
>
> Strict could definitely apply here, and I'm inclined to make it so.

Having thought about it a bit more, I generally like the idea of being
able to just update one stat instead of having to update all of them at
once (and therefore having to go look up what the other values currently
are...).  That said, per below, perhaps making it strict is the better
plan.

> > Also, in some cases we allow the function to be called with a
> > NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> > OID ends up being NULL).
>
> Thoughts on it emitting a WARN or NOTICE before returning false?

Eh, I don't think so?

Where this is coming from is that we can often end up with functions
like these being called inside of larger queries, and having them spit
out WARN or NOTICE will just make them noisy.

That leads to my general feeling of just returning NULL if called with a
NULL OID, as we would get with setting the function strict.

> >   Not sure if that makes sense here or not
> > offhand but figured I'd mention it as something to consider.
> >
> > > +     pgcform = (Form_pg_class) GETSTRUCT(ctup);
> > > +     pgcform->reltuples = PG_GETARG_FLOAT4(1);
> > > +     pgcform->relpages = PG_GETARG_INT32(2);
> >
> > Shouldn't we include relallvisible?
>
> Yes. No idea why I didn't have that in there from the start.

Ok.

> > Also, perhaps we should use the approach that we have in ANALYZE, and
> > only actually do something if the values are different rather than just
> > always doing an update.
>
> That was how it worked back in v1, more for the possibility that there was
> no matching JSON to set values.
>
> Looking again at analyze.c (currently lines 1751-1780), we just check if
> there is a row in the way, and if so we replace it. I don't see where we
> compare existing values to new values.

Well, that code is for pg_statistic while I was looking at pg_class (in
vacuum.c:1428-1443, where we track if we're actually changing anything
and only make the pg_class change if there's actually something
different):

vacuum.c:1531
    /* If anything changed, write out the tuple. */
    if (dirty)
        heap_inplace_update(rd, ctup);

Not sure why we don't treat both the same way though ... although it's
probably the case that it's much less likely to have an entire
pg_statistic row be identical than the few values in pg_class.

> > > +Datum
> > > +pg_set_attribute_stats(PG_FUNCTION_ARGS)
> >
> > > +     /* names of columns that cannot be null */
> > > +     const char *required_param_names[] = {
> > > +             "relation",
> > > +             "attname",
> > > +             "stainherit",
> > > +             "stanullfrac",
> > > +             "stawidth",
> > > +             "stadistinct",
> > > +             "stakind1",
> > > +             "stakind2",
> > > +             "stakind3",
> > > +             "stakind4",
> > > +             "stakind5",
> > > +     };
> >
> > Same comment here as above wrt NULL being passed in.
>
> In this case, the last 10 params (stanumbersN and stavaluesN) can be null,
> and are NULL more often than not.

Hmm, that's a valid point, so a NULL passed in would need to set that
value actually to NULL, presumably.  Perhaps then we should have
pg_set_relation_stats() be strict and have pg_set_attribute_stats()
handles NULLs passed in appropriately, and return NULL if the relation
itself or attname, or other required (not NULL'able) argument passed in
cause the function to return NULL.

(What I'm trying to drive at here is a consistent interface for these
functions, but one which does a no-op instead of returning an ERROR on
values being passed in which aren't allowable; it can be quite
frustrating trying to get a query to work where one of the functions
decides to return ERROR instead of just ignoring things passed in which
aren't valid.)

> > > +     for (int k = 0; k < 5; k++)
> >
> > Shouldn't we use STATISTIC_NUM_SLOTS here?
>
> Yes, I had in the past. Not sure why I didn't again.

No worries.

Thanks!

Stephen

Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:



Having thought about it a bit more, I generally like the idea of being
able to just update one stat instead of having to update all of them at
once (and therefore having to go look up what the other values currently
are...).  That said, per below, perhaps making it strict is the better
plan.

v8 has it as strict.
 

> > Also, in some cases we allow the function to be called with a
> > NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> > OID ends up being NULL).
>
> Thoughts on it emitting a WARN or NOTICE before returning false?

Eh, I don't think so?

Where this is coming from is that we can often end up with functions
like these being called inside of larger queries, and having them spit
out WARN or NOTICE will just make them noisy.

That leads to my general feeling of just returning NULL if called with a
NULL OID, as we would get with setting the function strict.

In which case we're failing nearly silently, yes, there is a null returned, but we have no idea why there is a null returned. If I were using this function manually I'd want to know what I did wrong, what parameter I skipped, etc.
 
Well, that code is for pg_statistic while I was looking at pg_class (in
vacuum.c:1428-1443, where we track if we're actually changing anything
and only make the pg_class change if there's actually something
different):

I can do that, especially since it's only 3 tuples of known types, but my reservations are summed up in the next comment.

 
Not sure why we don't treat both the same way though ... although it's
probably the case that it's much less likely to have an entire
pg_statistic row be identical than the few values in pg_class.

That would also involve comparing ANYARRAY values, yuk. Also, a matched record will never be the case when used in primary purpose of the function (upgrades), and not a big deal in the other future cases (if we use it in ANALYZE on foreign tables instead of remote table samples, users experimenting with tuning queries under hypothetical workloads).


 
Hmm, that's a valid point, so a NULL passed in would need to set that
value actually to NULL, presumably.  Perhaps then we should have
pg_set_relation_stats() be strict and have pg_set_attribute_stats()
handles NULLs passed in appropriately, and return NULL if the relation
itself or attname, or other required (not NULL'able) argument passed in
cause the function to return NULL.

That's how I have relstats done in v8, and could make it do that for attr stats.

(What I'm trying to drive at here is a consistent interface for these
functions, but one which does a no-op instead of returning an ERROR on
values being passed in which aren't allowable; it can be quite
frustrating trying to get a query to work where one of the functions
decides to return ERROR instead of just ignoring things passed in which
aren't valid.)

I like the symmetry of a consistent interface, but we've already got an asymmetry in that the pg_class update is done non-transactionally (like ANALYZE does).

One persistent problem is that there is no _safe equivalent to ARRAY_IN, so that can always fail on us, though it should only do so if the string passed in wasn't a valid array input format, or the values in the array can't coerce to the attribute's basetype.

I should also point out that we've lost the ability to check if the export values were of a type, and if the destination column is also of that type. That's a non-issue in binary upgrades, but of course if a field changed from integers to text the histograms would now be highly misleading. Thoughts on adding a typname parameter that the function uses as a cheap validity check?

v8 attached, incorporating these suggestions plus those of Bharath and Bertrand. Still no pg_dump.

As for pg_dump, I'm currently leading toward the TOC entry having either a series of commands:

    SELECT pg_set_relation_stats('foo.bar'::regclass, ...); pg_set_attribute_stats('foo.bar'::regclass, 'id'::name, ...); ...

Or one compound command

    SELECT pg_set_relation_stats(t.oid, ...)
         pg_set_attribute_stats(t.oid, 'id'::name, ...),
         pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
         ...
    FROM (VALUES('foo.bar'::regclass)) AS t(oid);

The second one has the feature that if any one attribute fails, then the whole update fails, except, of course, for the in-place update of pg_class. This avoids having an explicit transaction block, but we could get that back by having restore wrap the list of commands in a transaction block (and adding the explicit lock commands) when it is safe to do so.

Attachment

Re: Statistics Import and Export

From
Stephen Frost
Date:
Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > Having thought about it a bit more, I generally like the idea of being
> > able to just update one stat instead of having to update all of them at
> > once (and therefore having to go look up what the other values currently
> > are...).  That said, per below, perhaps making it strict is the better
> > plan.
>
> v8 has it as strict.

Ok.

> > > > Also, in some cases we allow the function to be called with a
> > > > NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> > > > OID ends up being NULL).
> > >
> > > Thoughts on it emitting a WARN or NOTICE before returning false?
> >
> > Eh, I don't think so?
> >
> > Where this is coming from is that we can often end up with functions
> > like these being called inside of larger queries, and having them spit
> > out WARN or NOTICE will just make them noisy.
> >
> > That leads to my general feeling of just returning NULL if called with a
> > NULL OID, as we would get with setting the function strict.
>
> In which case we're failing nearly silently, yes, there is a null returned,
> but we have no idea why there is a null returned. If I were using this
> function manually I'd want to know what I did wrong, what parameter I
> skipped, etc.

I can see it both ways and don't feel super strongly about it ... I just
know that I've had some cases where we returned an ERROR or otherwise
were a bit noisy on NULL values getting passed into a function and it
was much more on the annoying side than on the helpful side; to the
point where we've gone back and pulled out ereport(ERROR) calls from
functions before because they were causing issues in otherwise pretty
reasonable queries (consider things like functions getting pushed down
to below WHERE clauses and such...).

> > Well, that code is for pg_statistic while I was looking at pg_class (in
> > vacuum.c:1428-1443, where we track if we're actually changing anything
> > and only make the pg_class change if there's actually something
> > different):
>
> I can do that, especially since it's only 3 tuples of known types, but my
> reservations are summed up in the next comment.

> > Not sure why we don't treat both the same way though ... although it's
> > probably the case that it's much less likely to have an entire
> > pg_statistic row be identical than the few values in pg_class.
>
> That would also involve comparing ANYARRAY values, yuk. Also, a matched
> record will never be the case when used in primary purpose of the function
> (upgrades), and not a big deal in the other future cases (if we use it in
> ANALYZE on foreign tables instead of remote table samples, users
> experimenting with tuning queries under hypothetical workloads).

Sure.  Not a huge deal either way, was just pointing out the difference.
I do think it'd be good to match what ANALYZE does here, so checking if
the values in pg_class are different and only updating if they are,
while keeping the code for pg_statistic where it'll just always update.

> > Hmm, that's a valid point, so a NULL passed in would need to set that
> > value actually to NULL, presumably.  Perhaps then we should have
> > pg_set_relation_stats() be strict and have pg_set_attribute_stats()
> > handles NULLs passed in appropriately, and return NULL if the relation
> > itself or attname, or other required (not NULL'able) argument passed in
> > cause the function to return NULL.
> >
>
> That's how I have relstats done in v8, and could make it do that for attr
> stats.

That'd be my suggestion, at least, but as I mention above, it's not a
position I hold very strongly.

> > (What I'm trying to drive at here is a consistent interface for these
> > functions, but one which does a no-op instead of returning an ERROR on
> > values being passed in which aren't allowable; it can be quite
> > frustrating trying to get a query to work where one of the functions
> > decides to return ERROR instead of just ignoring things passed in which
> > aren't valid.)
>
> I like the symmetry of a consistent interface, but we've already got an
> asymmetry in that the pg_class update is done non-transactionally (like
> ANALYZE does).

Don't know that I really consider that to be the same kind of thing when
it comes to talking about the interface as the other aspects we're
discussing ...

> One persistent problem is that there is no _safe equivalent to ARRAY_IN, so
> that can always fail on us, though it should only do so if the string
> passed in wasn't a valid array input format, or the values in the array
> can't coerce to the attribute's basetype.

That would happen before we even get to being called and there's not
much to do about it anyway.

> I should also point out that we've lost the ability to check if the export
> values were of a type, and if the destination column is also of that type.
> That's a non-issue in binary upgrades, but of course if a field changed
> from integers to text the histograms would now be highly misleading.
> Thoughts on adding a typname parameter that the function uses as a cheap
> validity check?

Seems reasonable to me.

> v8 attached, incorporating these suggestions plus those of Bharath and
> Bertrand. Still no pg_dump.
>
> As for pg_dump, I'm currently leading toward the TOC entry having either a
> series of commands:
>
>     SELECT pg_set_relation_stats('foo.bar'::regclass, ...);
> pg_set_attribute_stats('foo.bar'::regclass, 'id'::name, ...); ...

I'm guessing the above was intended to be SELECT ..; SELECT ..;

> Or one compound command
>
>     SELECT pg_set_relation_stats(t.oid, ...)
>          pg_set_attribute_stats(t.oid, 'id'::name, ...),
>          pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
>          ...
>     FROM (VALUES('foo.bar'::regclass)) AS t(oid);
>
> The second one has the feature that if any one attribute fails, then the
> whole update fails, except, of course, for the in-place update of pg_class.
> This avoids having an explicit transaction block, but we could get that
> back by having restore wrap the list of commands in a transaction block
> (and adding the explicit lock commands) when it is safe to do so.

Hm, I like this approach as it should essentially give us the
transaction block we had been talking about wanting but without needing
to explicitly do a begin/commit, which would add in some annoying
complications.  This would hopefully also reduce the locking concern
mentioned previously, since we'd get the lock needed in the first
function call and then the others would be able to just see that we've
already got the lock pretty quickly.

> Subject: [PATCH v8] Create pg_set_relation_stats, pg_set_attribute_stats.

[...]

> +Datum
> +pg_set_relation_stats(PG_FUNCTION_ARGS)

[...]

> +    ctup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
> +    if (!HeapTupleIsValid(ctup))
> +        elog(ERROR, "pg_class entry for relid %u vanished during statistics import",
> +             relid);

Maybe drop the 'during statistics import' part of this message?  Also
wonder if maybe we should make it a regular ereport() instead, since it
might be possible for a user to end up seeing this?

> +    pgcform = (Form_pg_class) GETSTRUCT(ctup);
> +
> +    reltuples = PG_GETARG_FLOAT4(P_RELTUPLES);
> +    relpages = PG_GETARG_INT32(P_RELPAGES);
> +    relallvisible = PG_GETARG_INT32(P_RELALLVISIBLE);
> +
> +    /* Do not update pg_class unless there is no meaningful change */

This comment doesn't seem quite right.  Maybe it would be better if it
was in the positive, eg: Only update pg_class if there is a meaningful
change.

Rest of it looks pretty good to me, at least.

Thanks!

Stephen

Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:
> In which case we're failing nearly silently, yes, there is a null returned,
> but we have no idea why there is a null returned. If I were using this
> function manually I'd want to know what I did wrong, what parameter I
> skipped, etc.

I can see it both ways and don't feel super strongly about it ... I just
know that I've had some cases where we returned an ERROR or otherwise
were a bit noisy on NULL values getting passed into a function and it
was much more on the annoying side than on the helpful side; to the
point where we've gone back and pulled out ereport(ERROR) calls from
functions before because they were causing issues in otherwise pretty
reasonable queries (consider things like functions getting pushed down
to below WHERE clauses and such...).

I don't have strong feelings either. I think we should get more input on this. Regardless, it's easy to change...for now.

 

Sure.  Not a huge deal either way, was just pointing out the difference.
I do think it'd be good to match what ANALYZE does here, so checking if
the values in pg_class are different and only updating if they are,
while keeping the code for pg_statistic where it'll just always update.

I agree that mirroring ANALYZE wherever possible is the ideal.

 
> I like the symmetry of a consistent interface, but we've already got an
> asymmetry in that the pg_class update is done non-transactionally (like
> ANALYZE does).

Don't know that I really consider that to be the same kind of thing when
it comes to talking about the interface as the other aspects we're
discussing ...

Fair.


 

> One persistent problem is that there is no _safe equivalent to ARRAY_IN, so
> that can always fail on us, though it should only do so if the string
> passed in wasn't a valid array input format, or the values in the array
> can't coerce to the attribute's basetype.

That would happen before we even get to being called and there's not
much to do about it anyway.

Not sure I follow you here. the ARRAY_IN function calls happen once for every non-null stavaluesN parameter, and it's done inside the function because the result type could be the base type for a domain/array type, or could be the type itself. I suppose we could move that determination to the caller, but then we'd need to call get_base_element_type() inside a client, and that seems wrong if it's even possible.
 
> I should also point out that we've lost the ability to check if the export
> values were of a type, and if the destination column is also of that type.
> That's a non-issue in binary upgrades, but of course if a field changed
> from integers to text the histograms would now be highly misleading.
> Thoughts on adding a typname parameter that the function uses as a cheap
> validity check?

Seems reasonable to me.

I'd like to hear what Tomas thinks about this, as he was the initial advocate for it.
 
> As for pg_dump, I'm currently leading toward the TOC entry having either a
> series of commands:
>
>     SELECT pg_set_relation_stats('foo.bar'::regclass, ...);
> pg_set_attribute_stats('foo.bar'::regclass, 'id'::name, ...); ...

I'm guessing the above was intended to be SELECT ..; SELECT ..;

Yes.
 

> Or one compound command
>
>     SELECT pg_set_relation_stats(t.oid, ...)
>          pg_set_attribute_stats(t.oid, 'id'::name, ...),
>          pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
>          ...
>     FROM (VALUES('foo.bar'::regclass)) AS t(oid);
>
> The second one has the feature that if any one attribute fails, then the
> whole update fails, except, of course, for the in-place update of pg_class.
> This avoids having an explicit transaction block, but we could get that
> back by having restore wrap the list of commands in a transaction block
> (and adding the explicit lock commands) when it is safe to do so.

Hm, I like this approach as it should essentially give us the
transaction block we had been talking about wanting but without needing
to explicitly do a begin/commit, which would add in some annoying
complications.  This would hopefully also reduce the locking concern
mentioned previously, since we'd get the lock needed in the first
function call and then the others would be able to just see that we've
already got the lock pretty quickly.

True, we'd get the lock needed in the first function call, but wouldn't we also release that very lock before the subsequent call? Obviously we'd be shrinking the window in which another process could get in line and take a superior lock, and the universe of other processes that would even want a lock that blocks us is nil in the case of an upgrade, identical to existing behavior in the case of an FDW ANALYZE, and perfectly fine in the case of someone tinkering with stats.
 

> Subject: [PATCH v8] Create pg_set_relation_stats, pg_set_attribute_stats.

[...]

> +Datum
> +pg_set_relation_stats(PG_FUNCTION_ARGS)

[...]

> +     ctup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
> +     if (!HeapTupleIsValid(ctup))
> +             elog(ERROR, "pg_class entry for relid %u vanished during statistics import",
> +                      relid);

Maybe drop the 'during statistics import' part of this message?  Also
wonder if maybe we should make it a regular ereport() instead, since it
might be possible for a user to end up seeing this?

Agreed and agreed. It was copypasta from ANALYZE.

 

This comment doesn't seem quite right.  Maybe it would be better if it
was in the positive, eg: Only update pg_class if there is a meaningful
change.

+1

Re: Statistics Import and Export

From
Stephen Frost
Date:
Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > > One persistent problem is that there is no _safe equivalent to ARRAY_IN,
> > so
> > > that can always fail on us, though it should only do so if the string
> > > passed in wasn't a valid array input format, or the values in the array
> > > can't coerce to the attribute's basetype.
> >
> > That would happen before we even get to being called and there's not
> > much to do about it anyway.
>
> Not sure I follow you here. the ARRAY_IN function calls happen once for
> every non-null stavaluesN parameter, and it's done inside the function
> because the result type could be the base type for a domain/array type, or
> could be the type itself. I suppose we could move that determination to the
> caller, but then we'd need to call get_base_element_type() inside a client,
> and that seems wrong if it's even possible.

Ah, yeah, ok, I see what you're saying here and sure, there's a risk
those might ERROR too, but that's outright invalid data then as opposed
to a NULL getting passed in.

> > > Or one compound command
> > >
> > >     SELECT pg_set_relation_stats(t.oid, ...)
> > >          pg_set_attribute_stats(t.oid, 'id'::name, ...),
> > >          pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
> > >          ...
> > >     FROM (VALUES('foo.bar'::regclass)) AS t(oid);
> > >
> > > The second one has the feature that if any one attribute fails, then the
> > > whole update fails, except, of course, for the in-place update of
> > pg_class.
> > > This avoids having an explicit transaction block, but we could get that
> > > back by having restore wrap the list of commands in a transaction block
> > > (and adding the explicit lock commands) when it is safe to do so.
> >
> > Hm, I like this approach as it should essentially give us the
> > transaction block we had been talking about wanting but without needing
> > to explicitly do a begin/commit, which would add in some annoying
> > complications.  This would hopefully also reduce the locking concern
> > mentioned previously, since we'd get the lock needed in the first
> > function call and then the others would be able to just see that we've
> > already got the lock pretty quickly.
>
> True, we'd get the lock needed in the first function call, but wouldn't we
> also release that very lock before the subsequent call? Obviously we'd be
> shrinking the window in which another process could get in line and take a
> superior lock, and the universe of other processes that would even want a
> lock that blocks us is nil in the case of an upgrade, identical to existing
> behavior in the case of an FDW ANALYZE, and perfectly fine in the case of
> someone tinkering with stats.

No, we should be keeping the lock until the end of the transaction
(which in this case would be just the one statement, but it would be the
whole statement and all of the calls in it).  See analyze.c:268 or
so, where we call relation_close(onerel, NoLock); meaning we're closing
the relation but we're *not* releasing the lock on it- it'll get
released at the end of the transaction.

Thanks!

Stephen

Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:
No, we should be keeping the lock until the end of the transaction
(which in this case would be just the one statement, but it would be the
whole statement and all of the calls in it).  See analyze.c:268 or
so, where we call relation_close(onerel, NoLock); meaning we're closing
the relation but we're *not* releasing the lock on it- it'll get
released at the end of the transaction.


If that's the case, then changing the two table_close() statements to NoLock should resolve any remaining concern.
 

Re: Statistics Import and Export

From
Stephen Frost
Date:
Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > No, we should be keeping the lock until the end of the transaction
> > (which in this case would be just the one statement, but it would be the
> > whole statement and all of the calls in it).  See analyze.c:268 or
> > so, where we call relation_close(onerel, NoLock); meaning we're closing
> > the relation but we're *not* releasing the lock on it- it'll get
> > released at the end of the transaction.
>
> If that's the case, then changing the two table_close() statements to
> NoLock should resolve any remaining concern.

Note that there's two different things we're talking about here- the
lock on the relation that we're analyzing and then the lock on the
pg_statistic (or pg_class) catalog itself.  Currently, at least, it
looks like in the three places in the backend that we open
StatisticRelationId, we release the lock when we close it rather than
waiting for transaction end.  I'd be inclined to keep it that way in
these functions also.  I doubt that one lock will end up causing much in
the way of issues to acquire/release it multiple times and it would keep
the code consistent with the way ANALYZE works.

If it can be shown to be an issue then we could certainly revisit this.

Thanks,

Stephen

Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:
Note that there's two different things we're talking about here- the
lock on the relation that we're analyzing and then the lock on the
pg_statistic (or pg_class) catalog itself.  Currently, at least, it
looks like in the three places in the backend that we open
StatisticRelationId, we release the lock when we close it rather than
waiting for transaction end.  I'd be inclined to keep it that way in
these functions also.  I doubt that one lock will end up causing much in
the way of issues to acquire/release it multiple times and it would keep
the code consistent with the way ANALYZE works.

ANALYZE takes out one lock StatisticRelationId per relation, not per attribute like we do now. If we didn't release the lock after every attribute, and we only called the function outside of a larger transaction (as we plan to do with pg_restore) then that is the closest we're going to get to being consistent with ANALYZE.

Re: Statistics Import and Export

From
Corey Huinker
Date:
ANALYZE takes out one lock StatisticRelationId per relation, not per attribute like we do now. If we didn't release the lock after every attribute, and we only called the function outside of a larger transaction (as we plan to do with pg_restore) then that is the closest we're going to get to being consistent with ANALYZE.

v9 attached. This adds pg_dump support. It works in tests against existing databases such as dvdrental, though I was surprised at how few indexes have attribute stats there.

Statistics are preserved by default, but this can be disabled with the option --no-statistics. This follows the prevailing option pattern in pg_dump, etc.

There are currently several failing TAP tests around pg_dump/pg_restore/pg_upgrade. I'm looking at those, but in the mean time I'm seeking feedback on the progress so far.
 
Attachment

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Fri, 2024-03-15 at 03:55 -0400, Corey Huinker wrote:
>
> Statistics are preserved by default, but this can be disabled with
> the option --no-statistics. This follows the prevailing option
> pattern in pg_dump, etc.

I'm not sure if saving statistics should be the default in 17. I'm
inclined to make it opt-in.

> There are currently several failing TAP tests around
> pg_dump/pg_restore/pg_upgrade.

It is a permissions problem. When user running pg_dump is not the
superuser, they don't have permission to access pg_statistic. That
causes an error in exportRelationStatsStmt(), which returns NULL, and
then the caller segfaults.

> I'm looking at those, but in the mean time I'm seeking feedback on
> the progress so far.

Still looking, but one quick comment is that the third argument of
dumpRelationStats() should be const, which eliminates a warning.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Jeff Davis
Date:
On Fri, 2024-03-15 at 15:30 -0700, Jeff Davis wrote:
> Still looking, but one quick comment is that the third argument of
> dumpRelationStats() should be const, which eliminates a warning.

A few other comments:

* pg_set_relation_stats() needs to do an ACL check so you can't set the
stats on someone else's table. I suggest honoring the new MAINTAIN
privilege as well.

* If possible, reading from pg_stats (instead of pg_statistic) would be
ideal because pg_stats already does the right checks at read time, so a
non-superuser can export stats, too.

* If reading from pg_stats, should you change the signature of
pg_set_relation_stats() to have argument names matching the columns of
pg_stats (e.g. most_common_vals instead of stakind/stavalues)?

In other words, make this a slightly higher level: conceptually
exporting/importing pg_stats rather than pg_statistic. This may also
make the SQL export queries simpler.

Also, I'm wondering about error handling. Is some kind of error thrown
by pg_set_relation_stats() going to abort an entire restore? That might
be easy to prevent with pg_restore, because it can just omit the stats,
but harder if it's in a SQL file.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:

* pg_set_relation_stats() needs to do an ACL check so you can't set the
stats on someone else's table. I suggest honoring the new MAINTAIN
privilege as well.

Added.
 

* If possible, reading from pg_stats (instead of pg_statistic) would be
ideal because pg_stats already does the right checks at read time, so a
non-superuser can export stats, too.

Done. That was sorta how it was originally, so returning to that wasn't too hard.
 

* If reading from pg_stats, should you change the signature of
pg_set_relation_stats() to have argument names matching the columns of
pg_stats (e.g. most_common_vals instead of stakind/stavalues)?

Done.
 

In other words, make this a slightly higher level: conceptually
exporting/importing pg_stats rather than pg_statistic. This may also
make the SQL export queries simpler.

Eh, about the same.
 
Also, I'm wondering about error handling. Is some kind of error thrown
by pg_set_relation_stats() going to abort an entire restore? That might
be easy to prevent with pg_restore, because it can just omit the stats,
but harder if it's in a SQL file.

Aside from the oid being invalid, there's not a whole lot that can go wrong in set_relation_stats(). The error checking I did closely mirrors that in analyze.c.

Aside from the changes you suggested, as well as the error reporting change you suggested for pg_dump, I also filtered out attempts to dump stats on views.

A few TAP tests are still failing and I haven't been able to diagnose why, though the failures in parallel dump seem to be that it tries to import stats on indexes that haven't been created yet, which is odd because I sent the dependency.

All those changes are available in the patches attached.
 
Attachment

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Sun, 2024-03-17 at 23:33 -0400, Corey Huinker wrote:
>
> A few TAP tests are still failing and I haven't been able to diagnose
> why, though the failures in parallel dump seem to be that it tries to
> import stats on indexes that haven't been created yet, which is odd
> because I sent the dependency.

From testrun/pg_dump/002_pg_dump/log/regress_log_002_pg_dump, search
for the "not ok" and then look at what it tried to do right before
that. I see:

pg_dump: error: prepared statement failed: ERROR:  syntax error at or
near "%"
LINE 1: ..._histogram => %L::real[]) coalesce($2, format('%I.%I',
a.nsp...

> All those changes are available in the patches attached.

How about if you provided "get" versions of the functions that return a
set of rows that match what the "set" versions expect? That would make
0001 essentially a complete feature itself.

I think it would also make the changes in pg_dump simpler, and the
tests in 0001 a lot simpler.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:



From testrun/pg_dump/002_pg_dump/log/regress_log_002_pg_dump, search
for the "not ok" and then look at what it tried to do right before
that. I see:

pg_dump: error: prepared statement failed: ERROR:  syntax error at or
near "%"
LINE 1: ..._histogram => %L::real[]) coalesce($2, format('%I.%I',
a.nsp...

Thanks. Unfamiliar turf for me.
 

> All those changes are available in the patches attached.

How about if you provided "get" versions of the functions that return a
set of rows that match what the "set" versions expect? That would make
0001 essentially a complete feature itself.

That's tricky. At the base level, those functions would just be an encapsulation of "SELECT * FROM pg_stats WHERE schemaname = $1 AND tablename = $2" which isn't all that much of a savings. Perhaps we can make the documentation more explicit about the source and nature of the parameters going into the pg_set_ functions.

Per conversation, it would be trivial to add a helper functions that replace the parameters after the initial oid with a pg_class rowtype, and that would dissect the values needed and call the more complex function:

pg_set_relation_stats( oid, pg_class)
pg_set_attribute_stats( oid, pg_stats)
 

I think it would also make the changes in pg_dump simpler, and the
tests in 0001 a lot simpler.

I agree. The tests are currently showing that a fidelity copy can be made from one table to another, but to do so we have to conceal the actual stats values because those are 1. not deterministic/known and 2. subject to change from version to version.

I can add some sets to arbitrary values like was done for pg_set_relation_stats().

Re: Statistics Import and Export

From
Corey Huinker
Date:
v11 attached.

- TAP tests passing (the big glitch was that indexes that are used in constraints should have their stats dependent on the constraint, not the index, thanks Jeff)
- The new range-specific statistics types are now supported. I'm not happy with the typid machinations I do to get them to work, but it is working so far. These are stored out-of-stakind-order (7 before 6), which is odd because all other types seem store stakinds in ascending order. It shouldn't matter, it was just odd.
- regression tests now make simpler calls with arbitrary stats to demonstrate the function usage more cleanly
- pg_set_*_stats function now have all of their parameters in the same order as the table/view they pull from

Attachment

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Tue, 2024-03-19 at 05:16 -0400, Corey Huinker wrote:
> v11 attached.

Thank you.

Comments on 0001:

This test:

   +SELECT
   +    format('SELECT pg_catalog.pg_set_attribute_stats( '
   ...

seems misplaced. It's generating SQL that can be used to restore or
copy the stats -- that seems like the job of pg_dump, and shouldn't be
tested within the plain SQL regression tests.

And can the other tests use pg_stats rather than pg_statistic?

The function signature for pg_set_attribute_stats could be more
friendly -- how about there are a few required parameters, and then it
only sets the stats that are provided and the other ones are either
left to the existing value or get some reasonable default?

Make sure all error paths ReleaseSysCache().

Why are you calling checkCanModifyRelation() twice?

I'm confused about when the function should return false and when it
should throw an error. I'm inclined to think the return type should be
void and all failures should be reported as ERROR.

replaces[] is initialized to {true}, which means only the first element
is initialized to true. Try following the pattern in AlterDatabase (or
similar) which reads the catalog tuple first, then updates a few fields
selectively, setting the corresponding element of replaces[] along the
way.

The test also sets the most_common_freqs in an ascending order, which
is weird.

Relatedly, I got worried recently about the idea of plain users
updating statistics. In theory, that should be fine, and the planner
should be robust to whatever pg_statistic contains; but in practice
there's some risk of mischief there until everyone understands that the
contents of pg_stats should not be trusted. Fortunately I didn't find
any planner crashes or even errors after a brief test.

One thing we can do is some extra validation for consistency, like
checking that the arrays are properly sorted, check for negative
numbers in the wrong place, or fractions larger than 1.0, etc.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:


On Thu, Mar 21, 2024 at 2:29 AM Jeff Davis <pgsql@j-davis.com> wrote:
On Tue, 2024-03-19 at 05:16 -0400, Corey Huinker wrote:
> v11 attached.

Thank you.

Comments on 0001:

This test:

   +SELECT
   +    format('SELECT pg_catalog.pg_set_attribute_stats( '
   ...

seems misplaced. It's generating SQL that can be used to restore or
copy the stats -- that seems like the job of pg_dump, and shouldn't be
tested within the plain SQL regression tests.

Fair enough.
 

And can the other tests use pg_stats rather than pg_statistic?

They can, but part of what I wanted to show was that the values that aren't directly passed in as parameters (staopN, stacollN) get set to the correct values, and those values aren't guaranteed to match across databases, hence testing them in the regression test rather than in a TAP test. I'd still like to be able to test that.
 

The function signature for pg_set_attribute_stats could be more
friendly -- how about there are a few required parameters, and then it
only sets the stats that are provided and the other ones are either
left to the existing value or get some reasonable default?

That would be problematic.

1. We'd have to compare the stats provided against the stats that are already there, make that list in-memory, and then re-order what remains
2. There would be no way to un-set statistics of a given stakind, unless we added an "actually set it null" boolean for each parameter that can be null. 
3. I tried that with the JSON formats, it made the code even messier than it already was.

Make sure all error paths ReleaseSysCache().

+1
 

Why are you calling checkCanModifyRelation() twice?

Once for the relation itself, and once for pg_statistic.
 
I'm confused about when the function should return false and when it
should throw an error. I'm inclined to think the return type should be
void and all failures should be reported as ERROR.

I go back and forth on that. I can see making it void and returning an error for everything that we currently return false for, but if we do that, then a statement with one pg_set_relation_stats, and N pg_set_attribute_stats (which we lump together in one command for the locking benefits and atomic transaction) would fail entirely if one of the set_attributes named a column that we had dropped. It's up for debate whether that's the right behavior or not.

replaces[] is initialized to {true}, which means only the first element
is initialized to true. Try following the pattern in AlterDatabase (or
similar) which reads the catalog tuple first, then updates a few fields
selectively, setting the corresponding element of replaces[] along the
way.

+1.
 

The test also sets the most_common_freqs in an ascending order, which
is weird.

I pulled most of the hardcoded values from pg_stats itself. The sample set is trivially small, and the values inserted were in-order-ish. So maybe that's why.
 
Relatedly, I got worried recently about the idea of plain users
updating statistics. In theory, that should be fine, and the planner
should be robust to whatever pg_statistic contains; but in practice
there's some risk of mischief there until everyone understands that the
contents of pg_stats should not be trusted. Fortunately I didn't find
any planner crashes or even errors after a brief test.

Maybe we could have the functions restricted to a role or roles:

1. pg_write_all_stats (can modify stats on ANY table)
2. pg_write_own_stats (can modify stats on tables owned by user)

I'm iffy on the need for the first one, I list it first purely to show how I derived the name for the second.
 
One thing we can do is some extra validation for consistency, like
checking that the arrays are properly sorted, check for negative
numbers in the wrong place, or fractions larger than 1.0, etc.

+1. All suggestions of validation checks welcome.

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Thu, 2024-03-21 at 03:27 -0400, Corey Huinker wrote:
>
> They can, but part of what I wanted to show was that the values that
> aren't directly passed in as parameters (staopN, stacollN) get set to
> the correct values, and those values aren't guaranteed to match
> across databases, hence testing them in the regression test rather
> than in a TAP test. I'd still like to be able to test that.

OK, that's fine.

> > The function signature for pg_set_attribute_stats could be more
> > friendly
...
> 1. We'd have to compare the stats provided against the stats that are
> already there, make that list in-memory, and then re-order what
> remains
> 2. There would be no way to un-set statistics of a given stakind,
> unless we added an "actually set it null" boolean for each parameter
> that can be null. 
> 3. I tried that with the JSON formats, it made the code even messier
> than it already was.

How about just some defaults then? Many of them have a reasonable
default, like NULL or an empty array. Some are parallel arrays and
either both should be specified or neither (e.g.
most_common_vals+most_common_freqs), but you can check for that.

> > Why are you calling checkCanModifyRelation() twice?
>
> Once for the relation itself, and once for pg_statistic.

Nobody has the privileges to modify pg_statistic except superuser,
right? I thought the point of a privilege check is that users could
modify statistics for their own tables, or the tables they maintain.

>
> I can see making it void and returning an error for everything that
> we currently return false for, but if we do that, then a statement
> with one pg_set_relation_stats, and N pg_set_attribute_stats (which
> we lump together in one command for the locking benefits and atomic
> transaction) would fail entirely if one of the set_attributes named a
> column that we had dropped. It's up for debate whether that's the
> right behavior or not.

I'd probably make the dropped column a WARNING with a message like
"skipping dropped column whatever". Regardless, have some kind of
explanatory comment.

>
> I pulled most of the hardcoded values from pg_stats itself. The
> sample set is trivially small, and the values inserted were in-order-
> ish. So maybe that's why.

In my simple test, most_common_freqs is descending:

   CREATE TABLE a(i int);
   INSERT INTO a VALUES(1);
   INSERT INTO a VALUES(2);
   INSERT INTO a VALUES(2);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   ANALYZE a;
   SELECT most_common_vals, most_common_freqs
     FROM pg_stats WHERE tablename='a';
    most_common_vals | most_common_freqs
   ------------------+-------------------
    {4,3,2}          | {0.4,0.3,0.2}
   (1 row)

Can you show an example where it's not?

>
> Maybe we could have the functions restricted to a role or roles:
>
> 1. pg_write_all_stats (can modify stats on ANY table)
> 2. pg_write_own_stats (can modify stats on tables owned by user)

If we go that route, we are giving up on the ability for users to
restore stats on their own tables. Let's just be careful about
validating data to mitigate this risk.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:
How about just some defaults then? Many of them have a reasonable
default, like NULL or an empty array. Some are parallel arrays and
either both should be specified or neither (e.g.
most_common_vals+most_common_freqs), but you can check for that.

+1
Default NULL has been implemented for all parameters after n_distinct.
 

> > Why are you calling checkCanModifyRelation() twice?
>
> Once for the relation itself, and once for pg_statistic.

Nobody has the privileges to modify pg_statistic except superuser,
right? I thought the point of a privilege check is that users could
modify statistics for their own tables, or the tables they maintain.

In which case wouldn't the checkCanModify on pg_statistic would be a proxy for is_superuser/has_special_role_we_havent_created_yet.

 

>
> I can see making it void and returning an error for everything that
> we currently return false for, but if we do that, then a statement
> with one pg_set_relation_stats, and N pg_set_attribute_stats (which
> we lump together in one command for the locking benefits and atomic
> transaction) would fail entirely if one of the set_attributes named a
> column that we had dropped. It's up for debate whether that's the
> right behavior or not.

I'd probably make the dropped column a WARNING with a message like
"skipping dropped column whatever". Regardless, have some kind of
explanatory comment.

That's certainly do-able.


 

>
> I pulled most of the hardcoded values from pg_stats itself. The
> sample set is trivially small, and the values inserted were in-order-
> ish. So maybe that's why.

In my simple test, most_common_freqs is descending:

   CREATE TABLE a(i int);
   INSERT INTO a VALUES(1);
   INSERT INTO a VALUES(2);
   INSERT INTO a VALUES(2);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   ANALYZE a;
   SELECT most_common_vals, most_common_freqs
     FROM pg_stats WHERE tablename='a';
    most_common_vals | most_common_freqs
   ------------------+-------------------
    {4,3,2}          | {0.4,0.3,0.2}
   (1 row)

Can you show an example where it's not?

Not off hand, no.

 

>
> Maybe we could have the functions restricted to a role or roles:
>
> 1. pg_write_all_stats (can modify stats on ANY table)
> 2. pg_write_own_stats (can modify stats on tables owned by user)

If we go that route, we are giving up on the ability for users to
restore stats on their own tables. Let's just be careful about
validating data to mitigate this risk.

A great many test cases coming in the next patch.

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Thu, 2024-03-21 at 15:10 -0400, Corey Huinker wrote:
>
> In which case wouldn't the checkCanModify on pg_statistic would be a
> proxy for is_superuser/has_special_role_we_havent_created_yet.

So if someone pg_dumps their table and gets the statistics in the SQL,
then they will get errors loading it unless they are a member of a
special role?

If so we'd certainly need to make --no-statistics the default, and have
some way of skipping stats during reload of the dump (perhaps make the
set function a no-op based on a GUC?).

But ideally we'd just make it safe to dump and reload stats on your own
tables, and then not worry about it.

> Not off hand, no.

To me it seems like inconsistent data to have most_common_freqs in
anything but descending order, and we should prevent it.

> >
Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:


But ideally we'd just make it safe to dump and reload stats on your own
tables, and then not worry about it.

That is my strong preference, yes.
 

> Not off hand, no.

To me it seems like inconsistent data to have most_common_freqs in
anything but descending order, and we should prevent it.

Sorry, I misunderstood, I thought we were talking about values, not the frequencies. Yes, the frequencies should only be monotonically non-increasing (i.e. it can go down or flatline from N->N+1). I'll add a test case for that.

Re: Statistics Import and Export

From
Corey Huinker
Date:
v12 attached.

0001 - 

The functions pg_set_relation_stats() and pg_set_attribute_stats() now return void. There just weren't enough conditions where a condition was considered recoverable to justify having it. This may mean that combining multiple pg_set_attribute_stats calls into one compound statement may no longer be desirable, but that's just one of the places where I'd like feedback on how pg_dump/pg_restore use these functions.

The function pg_set_attribute_stats() now has NULL defaults for all stakind-based statistics types. Thus, you can set statistics on a more terse basis, like so:

SELECT pg_catalog.pg_set_attribute_stats(
    relation => 'stats_export_import.test'::regclass,
    attname => 'id'::name,
    inherited => false::boolean,
    null_frac => 0.5::real,
    avg_width => 2::integer,
    n_distinct => -0.1::real,
    most_common_vals => '{2,1,3}'::text,
    most_common_freqs => '{0.3,0.25,0.05}'::real[]
    );

This would generate a pg_statistic row with exactly one stakind in it, and replaces whatever statistics previously existed for that attribute.

It now checks for many types of data inconsistencies, and most (35) of those have test coverage in the regression. There's a few areas still uncovered, mostly surrounding histograms where the datatype is dependent on the attribute.

The functions both require that the caller be the owner of the table/index.

The function pg_set_relation_stats is largely unchanged from previous versions.

Key areas where I'm seeking feedback:

- What additional checks can be made to ensure validity of statistics?
- What additional regression tests would be desirable?
- What extra information can we add to the error messages to give the user an idea of how to fix the error?
- What are some edge cases we should test concerning putting bad stats in a table to get an undesirable outcome?


0002 -

This patch concerns invoking the functions in 0001 via pg_restore/pg_upgrade. Little has changed here. Dumping statistics is currently the default for pg_dump/pg_restore/pg_upgrade, and can be switched off with the switch --no-statistics. Some have expressed concern about whether stats dumping should be the default. I have a slight preference for making it the default, for the following reasons:

- The existing commandline switches are all --no-something based, and this follows the pattern.
- Complaints about poor performance post-upgrade are often the result of the user not knowing about vacuumdb --analyze-in-stages or the need to manually ANALYZE. If they don't know about that, how can we expect them to know about about new switches in pg_upgrade?
- The failure condition means that the user has a table with no stats in it (or possibly partial stats, if we change how we make the calls), which is exactly where they were before they made the call.
- Any performance regressions will be remedied with the next autovacuum or manual ANALYZE.
- If we had a positive flag (e.g. --with-statistics or just --statistics), and we then changed the default, that could be considered a POLA violation.


Key areas where I'm seeking feedback:

- What level of errors in a restore will a user tolerate, and what should be done to the error messages to indicate that the data itself is fine, but a manual operation to update stats on that particular table is now warranted?
- To what degree could pg_restore/pg_upgrade take that recovery action automatically?
- Should the individual attribute/class set function calls be grouped by relation, so that they all succeed/fail together, or should they be called separately, each able to succeed or fail on their own?
- Any other concerns about how to best use these new functions.


Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:
On Fri, Mar 22, 2024 at 9:51 PM Corey Huinker <corey.huinker@gmail.com> wrote:
v12 attached.


v13 attached. All the same features as v12, but with a lot more type checking, bounds checking, value inspection, etc. Perhaps the most notable feature is that we're now ensuring that histogram values are in ascending order. This could come in handy for detecting when we are applying stats to a column of the wrong type, or the right type but with a different collation. It's not a guarantee of validity, of course, but it would detect egregious changes in sort order.
 
Attachment

Re: Statistics Import and Export

From
Ashutosh Bapat
Date:
Hi Corey,


On Sat, Mar 23, 2024 at 7:21 AM Corey Huinker <corey.huinker@gmail.com> wrote:
v12 attached.

0001 - 


Some random comments

+SELECT
+    format('SELECT pg_catalog.pg_set_attribute_stats( '
+            || 'relation => %L::regclass::oid, attname => %L::name, '
+            || 'inherited => %L::boolean, null_frac => %L::real, '
+            || 'avg_width => %L::integer, n_distinct => %L::real, '
+            || 'most_common_vals => %L::text, '
+            || 'most_common_freqs => %L::real[], '
+            || 'histogram_bounds => %L::text, '
+            || 'correlation => %L::real, '
+            || 'most_common_elems => %L::text, '
+            || 'most_common_elem_freqs => %L::real[], '
+            || 'elem_count_histogram => %L::real[], '
+            || 'range_length_histogram => %L::text, '
+            || 'range_empty_frac => %L::real, '
+            || 'range_bounds_histogram => %L::text) ',
+        'stats_export_import.' || s.tablename || '_clone', s.attname,
+        s.inherited, s.null_frac,
+        s.avg_width, s.n_distinct,
+        s.most_common_vals, s.most_common_freqs, s.histogram_bounds,
+        s.correlation, s.most_common_elems, s.most_common_elem_freqs,
+        s.elem_count_histogram, s.range_length_histogram,
+        s.range_empty_frac, s.range_bounds_histogram)
+FROM pg_catalog.pg_stats AS s
+WHERE s.schemaname = 'stats_export_import'
+AND s.tablename IN ('test', 'is_odd')
+\gexec

Why do we need to construct the command and execute? Can we instead execute the function directly? That would also avoid ECHO magic.

+   <table id="functions-admin-statsimport">
+    <title>Database Object Statistics Import Functions</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Function
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>

COMMENT: The functions throw many validation errors. Do we want to list the acceptable/unacceptable input values in the documentation corresponding to those? I don't expect one line per argument validation. Something like "these, these and these arguments can not be NULL" or "both arguments in each of the pairs x and y, a and b, and c and d should be non-NULL or NULL respectively".
 
 
The functions pg_set_relation_stats() and pg_set_attribute_stats() now return void. There just weren't enough conditions where a condition was considered recoverable to justify having it. This may mean that combining multiple pg_set_attribute_stats calls into one compound statement may no longer be desirable, but that's just one of the places where I'd like feedback on how pg_dump/pg_restore use these functions.


0002 -

This patch concerns invoking the functions in 0001 via pg_restore/pg_upgrade. Little has changed here. Dumping statistics is currently the default for pg_dump/pg_restore/pg_upgrade, and can be switched off with the switch --no-statistics. Some have expressed concern about whether stats dumping should be the default. I have a slight preference for making it the default, for the following reasons:


+ /* Statistics are dependent on the definition, not the data */
+ /* Views don't have stats */
+ if ((tbinfo->dobj.dump & DUMP_COMPONENT_STATISTICS) &&
+ (tbinfo->relkind == RELKIND_VIEW))
+ dumpRelationStats(fout, &tbinfo->dobj, reltypename,
+  tbinfo->dobj.dumpId);
+

Statistics are about data. Whenever pg_dump dumps some filtered data, the
statistics collected for the whole table are uselss. We should avoide dumping
statistics in such a case. E.g. when only schema is dumped what good is
statistics? Similarly the statistics on a partitioned table may not be useful
if some its partitions are not dumped. Said that dumping statistics on foreign
table makes sense since they do not contain data but the statistics still makes sense.
 

Key areas where I'm seeking feedback:

- What level of errors in a restore will a user tolerate, and what should be done to the error messages to indicate that the data itself is fine, but a manual operation to update stats on that particular table is now warranted?
- To what degree could pg_restore/pg_upgrade take that recovery action automatically?
- Should the individual attribute/class set function calls be grouped by relation, so that they all succeed/fail together, or should they be called separately, each able to succeed or fail on their own?
- Any other concerns about how to best use these new functions.



Whether or not I pass --no-statistics, there is no difference in the dump output. Am I missing something?
$ pg_dump -d postgres > /tmp/dump_no_arguments.out
$ pg_dump -d postgres --no-statistics > /tmp/dump_no_statistics.out
$ diff /tmp/dump_no_arguments.out /tmp/dump_no_statistics.out
$

IIUC, pg_dump includes statistics by default. That means all our pg_dump related tests will have statistics output by default. That's good since the functionality will always be tested. 1. We need additional tests to ensure that the statistics is installed after restore. 2. Some of those tests compare dumps before and after restore. In case the statistics is changed because of auto-analyze happening post-restore, these tests will fail.

I believe, in order to import statistics through IMPORT FOREIGN SCHEMA, postgresImportForeignSchema() will need to add SELECT commands invoking pg_set_relation_stats() on each imported table and pg_set_attribute_stats() on each of its attribute. Am I right? Do we want to make that happen in the first cut of the feature? How do you expect these functions to be used to update statistics of foreign tables?

--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

From
Tomas Vondra
Date:
On 3/25/24 09:27, Corey Huinker wrote:
> On Fri, Mar 22, 2024 at 9:51 PM Corey Huinker <corey.huinker@gmail.com>
> wrote:
> 
>> v12 attached.
>>
>>
> v13 attached. All the same features as v12, but with a lot more type
> checking, bounds checking, value inspection, etc. Perhaps the most notable
> feature is that we're now ensuring that histogram values are in ascending
> order. This could come in handy for detecting when we are applying stats to
> a column of the wrong type, or the right type but with a different
> collation. It's not a guarantee of validity, of course, but it would detect
> egregious changes in sort order.
> 

Hi,

I did take a closer look at v13 today. I have a bunch of comments and
some minor whitespace fixes in the attached review patches.

0001
----

1) The docs say this:

  <para>
   The purpose of this function is to apply statistics values in an
   upgrade situation that are "good enough" for system operation until
   they are replaced by the next <command>ANALYZE</command>, usually via
   <command>autovacuum</command> This function is used by
   <command>pg_upgrade</command> and <command>pg_restore</command> to
   convey the statistics from the old system version into the new one.
  </para>

I find this a bit confusing, considering the pg_dump/pg_restore changes
are only in 0002, not in this patch.

2) Also, I'm not sure about this:

  <parameter>relation</parameter>, the parameters in this are all
  derived from <structname>pg_stats</structname>, and the values
  given are most often extracted from there.

How do we know where do the values come from "most often"? I mean, where
else would it come from?

3) The function pg_set_attribute_stats() is veeeeery long - 1000 lines
or so, that's way too many for me to think about. I agree the flow is
pretty simple, but I still wonder if there's a way to maybe split it
into some smaller "meaningful" steps.

4) It took me *ages* to realize the enums at the beginning of some of
the functions are actually indexes of arguments in PG_FUNCTION_ARGS.
That'd surely deserve a comment explaining this.

5) The comment for param_names in pg_set_attribute_stats says this:

    /* names of columns that cannot be null */
    const char *param_names[] = { ... }

but isn't that actually incorrect? I think that applies only to a couple
initial arguments, but then other fields (MCV, mcelem stats, ...) can be
NULL, right?

6) There's a couple minor whitespace fixes or comments etc.


0002
----

1) I don't understand why we have exportExtStatsSupported(). Seems
pointless - nothing calls it, even if it did we don't know how to export
the stats.

2) I think this condition in dumpTableSchema() is actually incorrect:

  if ((tbinfo->dobj.dump & DUMP_COMPONENT_STATISTICS) &&
      (tbinfo->relkind == RELKIND_VIEW))
      dumpRelationStats(fout, &tbinfo->dobj, reltypename,

Aren't indexes pretty much exactly the thing for which we don't want to
dump statistics? In fact this skips dumping statistics for table - if
you dump a database with a single table (-Fc), pg_restore -l will tell
you this:

217; 1259 16385 TABLE public t user
3403; 0 16385 TABLE DATA public t user

Which is not surprising, because table is not a view. With an expression
index you get this:

217; 1259 16385 TABLE public t user
3404; 0 16385 TABLE DATA public t user
3258; 1259 16418 INDEX public t_expr_idx user
3411; 0 0 STATS IMPORT public INDEX t_expr_idx

Unfortunately, fixing the condition does not work:

  $ pg_dump -Fc test > test.dump
  pg_dump: warning: archive items not in correct section order

This happens for a very simple reason - the statistics are marked as
SECTION_POST_DATA, which for the index works, because indexes are in
post-data section. But the table stats are dumped right after data,
still in the "data" section.

IMO that's wrong, the statistics should be delayed to the post-data
section. Which probably means there needs to be a separate dumpable
object for statistics on table/index, with a dependency on the object.

3) I don't like the "STATS IMPORT" description. For extended statistics
we dump the definition as "STATISTICS" so why to shorten it to "STATS"
here? And "IMPORT" seems more like the process of loading data, not the
data itself. So I suggest "STATISTICS DATA".


regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:


+\gexec

Why do we need to construct the command and execute? Can we instead execute the function directly? That would also avoid ECHO magic.

We don't strictly need it, but I've found the set-difference operation to be incredibly useful in diagnosing problems. Additionally, the values are subject to change due to changes in test data, no guarantee that the output of ANALYZE is deterministic, etc. But most of all, because the test cares about the correct copying of values, not the values themselves.
 

+   <table id="functions-admin-statsimport">
+    <title>Database Object Statistics Import Functions</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Function
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>

COMMENT: The functions throw many validation errors. Do we want to list the acceptable/unacceptable input values in the documentation corresponding to those? I don't expect one line per argument validation. Something like "these, these and these arguments can not be NULL" or "both arguments in each of the pairs x and y, a and b, and c and d should be non-NULL or NULL respectively".

Yes. It should.
 
 Statistics are about data. Whenever pg_dump dumps some filtered data, the
statistics collected for the whole table are uselss. We should avoide dumping
statistics in such a case. E.g. when only schema is dumped what good is
statistics? Similarly the statistics on a partitioned table may not be useful
if some its partitions are not dumped. Said that dumping statistics on foreign
table makes sense since they do not contain data but the statistics still makes sense.

Good points, but I'm not immediately sure how to enforce those rules.
 
 

Key areas where I'm seeking feedback:

- What level of errors in a restore will a user tolerate, and what should be done to the error messages to indicate that the data itself is fine, but a manual operation to update stats on that particular table is now warranted?
- To what degree could pg_restore/pg_upgrade take that recovery action automatically?
- Should the individual attribute/class set function calls be grouped by relation, so that they all succeed/fail together, or should they be called separately, each able to succeed or fail on their own?
- Any other concerns about how to best use these new functions.



Whether or not I pass --no-statistics, there is no difference in the dump output. Am I missing something?
$ pg_dump -d postgres > /tmp/dump_no_arguments.out
$ pg_dump -d postgres --no-statistics > /tmp/dump_no_statistics.out
$ diff /tmp/dump_no_arguments.out /tmp/dump_no_statistics.out
$

IIUC, pg_dump includes statistics by default. That means all our pg_dump related tests will have statistics output by default. That's good since the functionality will always be tested. 1. We need additional tests to ensure that the statistics is installed after restore. 2. Some of those tests compare dumps before and after restore. In case the statistics is changed because of auto-analyze happening post-restore, these tests will fail.

+1
 
I believe, in order to import statistics through IMPORT FOREIGN SCHEMA, postgresImportForeignSchema() will need to add SELECT commands invoking pg_set_relation_stats() on each imported table and pg_set_attribute_stats() on each of its attribute. Am I right? Do we want to make that happen in the first cut of the feature? How do you expect these functions to be used to update statistics of foreign tables?

I don't think there's time to get it into this release. I think we'd want to extend this functionality to both IMPORT FOREIGN SCHEMA and ANALYZE for foreign tables, in both cases with a server/table option to do regular remote sampling. In both cases, they'd do a remote query very similar to what pg_dump does (hence putting it in fe_utils), with some filters on which columns/tables it believes it can trust. The remote table might itself be a view (in which case they query would turn up nothing) or column data types may change across the wire, and in those cases we'd have to fall back to sampling.

Re: Statistics Import and Export

From
Corey Huinker
Date:
1) The docs say this:

  <para>
   The purpose of this function is to apply statistics values in an
   upgrade situation that are "good enough" for system operation until
   they are replaced by the next <command>ANALYZE</command>, usually via
   <command>autovacuum</command> This function is used by
   <command>pg_upgrade</command> and <command>pg_restore</command> to
   convey the statistics from the old system version into the new one.
  </para>

I find this a bit confusing, considering the pg_dump/pg_restore changes
are only in 0002, not in this patch.

True, I'll split the docs.
 

2) Also, I'm not sure about this:

  <parameter>relation</parameter>, the parameters in this are all
  derived from <structname>pg_stats</structname>, and the values
  given are most often extracted from there.

How do we know where do the values come from "most often"? I mean, where
else would it come from?

The next most likely sources would be 1. stats from another similar table and 2. the imagination of a user testing hypothetical query plans.
 

3) The function pg_set_attribute_stats() is veeeeery long - 1000 lines
or so, that's way too many for me to think about. I agree the flow is
pretty simple, but I still wonder if there's a way to maybe split it
into some smaller "meaningful" steps.

I wrestle with that myself. I think there's some pieces that can be factored out.


4) It took me *ages* to realize the enums at the beginning of some of
the functions are actually indexes of arguments in PG_FUNCTION_ARGS.
That'd surely deserve a comment explaining this.

My apologies, it definitely deserves a comment.
 

5) The comment for param_names in pg_set_attribute_stats says this:

    /* names of columns that cannot be null */
    const char *param_names[] = { ... }

but isn't that actually incorrect? I think that applies only to a couple
initial arguments, but then other fields (MCV, mcelem stats, ...) can be
NULL, right?

Yes, that is vestigial, I'll remove it.
 

6) There's a couple minor whitespace fixes or comments etc.


0002
----

1) I don't understand why we have exportExtStatsSupported(). Seems
pointless - nothing calls it, even if it did we don't know how to export
the stats.

It's not strictly necessary. 
 

2) I think this condition in dumpTableSchema() is actually incorrect:

IMO that's wrong, the statistics should be delayed to the post-data
section. Which probably means there needs to be a separate dumpable
object for statistics on table/index, with a dependency on the object.

Good points.
 

3) I don't like the "STATS IMPORT" description. For extended statistics
we dump the definition as "STATISTICS" so why to shorten it to "STATS"
here? And "IMPORT" seems more like the process of loading data, not the
data itself. So I suggest "STATISTICS DATA".

+1
 

Re: Statistics Import and Export

From
Jeff Davis
Date:
Hi Tom,

Comparing the current patch set to your advice below:

On Tue, 2023-12-26 at 14:19 -0500, Tom Lane wrote:
> I had things set up with simple functions, which
> pg_dump would invoke by writing more or less
>
>         SELECT pg_catalog.load_statistics(....);
>
> This has a number of advantages, not least of which is that an
> extension
> could plausibly add compatible functions to older versions.

Check.

>   The trick,
> as you say, is to figure out what the argument lists ought to be.
> Unfortunately I recall few details of what I wrote for Salesforce,
> but I think I had it broken down in a way where there was a separate
> function call occurring for each pg_statistic "slot", thus roughly
>
> load_statistics(table regclass, attname text, stakind int, stavalue
> ...);

The problem with basing the function on pg_statistic directly is that
it can only be exported by the superuser.

The current patches instead base it on the pg_stats view, which already
does the privilege checking. Technically, information about which
stakinds go in which slots is lost, but I don't think that's a problem
as long as the stats make it in, right? It's also more user-friendly to
have nice names for the function arguments. The only downside I see is
that it's slightly asymmetric: exporting from pg_stats and importing
into pg_statistic.

I do have some concerns about letting non-superusers import their own
statistics: how robust is the rest of the code to handle malformed
stats once they make it into pg_statistic? Corey has addressed that
with basic input validation, so I think it's fine, but perhaps I'm
missing something.

>  As mentioned already, we'd also need some sort of
> version identifier, and we'd expect the load_statistics() functions
> to be able to transform the data if the old version used a different
> representation.

You mean a version argument to the function, which would appear in the
exported stats data? That's not in the current patch set.

It's relying on the new version of pg_dump understanding the old
statistics data, and dumping it out in a form that the new server will
understand.

>   I agree with the idea that an explicit representation
> of the source table attribute's type would be wise, too.

That's not in the current patch set, either.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Jeff Davis
Date:
On Tue, 2024-03-26 at 00:16 +0100, Tomas Vondra wrote:
> I did take a closer look at v13 today. I have a bunch of comments and
> some minor whitespace fixes in the attached review patches.

I also attached a patch implementing a different approach to the
pg_dump support. Instead of trying to create a query that uses SQL
"format()" to create more SQL, I did all the formatting in C. It turned
out to be about 30% fewer lines, and I find it more understandable and
consistent with the way other stuff in pg_dump happens.

The attached patch is pretty rough -- not many comments, and perhaps
some things should be moved around. I only tested very basic
dump/reload in SQL format.

Regards,
    Jeff Davis


Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:
On Fri, Mar 29, 2024 at 2:25 AM Jeff Davis <pgsql@j-davis.com> wrote:
I also attached a patch implementing a different approach to the
pg_dump support. Instead of trying to create a query that uses SQL
"format()" to create more SQL, I did all the formatting in C. It turned
out to be about 30% fewer lines, and I find it more understandable and
consistent with the way other stuff in pg_dump happens.

That is fairly close to what I came up with per our conversation (attached below), but I really like the att_stats_arginfo construct and I definitely want to adopt that and expand it to a third dimension that flags the fields that cannot be null. I will incorporate that into v15.

As for v14, here are the highlights:

0001:
- broke up pg_set_attribute_stats() into many functions. Every stat kind gets its own validation function. Type derivation is now done in its own function.
- removed check on inherited stats flag that required the table be partitioned. that was in error
- added check for most_common_values to be unique in ascending order, and tests to match
- no more mention of pg_dump in the function documentation
- function documentation cites pg-stats-view as reference for the parameter's data requirements

0002:
- All relstats and attrstats calls are now their own statement instead of a compound statement
- moved the archive TOC entry from post-data back to SECTION_NONE (as it was modeled on object COMMENTs), which seems to work better.
- remove meta-query in favor of more conventional query building
- removed all changes to fe_utils/

Attachment

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Fri, 2024-03-29 at 05:32 -0400, Corey Huinker wrote:
> That is fairly close to what I came up with per our conversation
> (attached below), but I really like the att_stats_arginfo construct
> and I definitely want to adopt that and expand it to a third
> dimension that flags the fields that cannot be null. I will
> incorporate that into v15.

Sounds good. I think it cuts down on the boilerplate.

> 0002:
> - All relstats and attrstats calls are now their own statement
> instead of a compound statement
> - moved the archive TOC entry from post-data back to SECTION_NONE (as
> it was modeled on object COMMENTs), which seems to work better.
> - remove meta-query in favor of more conventional query building
> - removed all changes to fe_utils/

Can we get a consensus on whether the default should be with stats or
without? That seems like the most important thing remaining in the
pg_dump changes.

There's still a failure in the pg_upgrade TAP test. One seems to be
ordering, so perhaps we need to ORDER BY the attribute number. Others
seem to be missing relstats and I'm not sure why yet. I suggest doing
some manual pg_upgrade tests and comparing the before/after dumps to
see if you can reproduce a smaller version of the problem.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Stephen Frost
Date:
Greetings,

On Fri, Mar 29, 2024 at 11:05 Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2024-03-29 at 05:32 -0400, Corey Huinker wrote:
> 0002:
> - All relstats and attrstats calls are now their own statement
> instead of a compound statement
> - moved the archive TOC entry from post-data back to SECTION_NONE (as
> it was modeled on object COMMENTs), which seems to work better.
> - remove meta-query in favor of more conventional query building
> - removed all changes to fe_utils/

Can we get a consensus on whether the default should be with stats or
without? That seems like the most important thing remaining in the
pg_dump changes.

I’d certainly think “with stats” would be the preferred default of our users.

Thanks!

Stephen

Re: Statistics Import and Export

From
Corey Huinker
Date:

There's still a failure in the pg_upgrade TAP test. One seems to be
ordering, so perhaps we need to ORDER BY the attribute number. Others
seem to be missing relstats and I'm not sure why yet. I suggest doing
some manual pg_upgrade tests and comparing the before/after dumps to
see if you can reproduce a smaller version of the problem.

That's fixed in my current working version, as is a tsvector-specific issue. Working on the TAP issue.
 

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> I’d certainly think “with stats” would be the preferred default of
> our users.

I'm concerned there could still be paths that lead to an error. For
pg_restore, or when loading a SQL file, a single error isn't fatal
(unless -e is specified), but it still could be somewhat scary to see
errors during a reload.

Also, it's new behavior, so it may cause some minor surprises, or there
might be minor interactions to work out. For instance, dumping stats
doesn't make a lot of sense if pg_upgrade (or something else) is just
going to run analyze anyway.

What do you think about starting off with it as non-default, and then
switching it to default in 18?

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
>> I’d certainly think “with stats” would be the preferred default of
>> our users.

> What do you think about starting off with it as non-default, and then
> switching it to default in 18?

I'm with Stephen: I find it very hard to imagine that there's any
users who wouldn't want this as default.  If we do what you suggest,
then there will be three historical behaviors to cope with not two.
That doesn't sound like it'll make anyone's life better.

As for the "it might break" argument, that could be leveled against
any nontrivial patch.  You're at least offering an opt-out switch,
which is something we more often don't do.

(I've not read the patch yet, but I assume the switch works like
other pg_dump filters in that you can apply it on the restore
side?)

            regards, tom lane



Re: Statistics Import and Export

From
Corey Huinker
Date:


On Fri, Mar 29, 2024 at 7:34 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> I’d certainly think “with stats” would be the preferred default of
> our users.

I'm concerned there could still be paths that lead to an error. For
pg_restore, or when loading a SQL file, a single error isn't fatal
(unless -e is specified), but it still could be somewhat scary to see
errors during a reload.

To that end, I'm going to be modifying the "Optimizer statistics are not transferred by pg_upgrade..." message when stats _were_ transferred, width additional instructions that the user should treat any stats-ish error messages encountered as a reason to manually analyze that table. We should probably say something about extended stats as well.
 

Re: Statistics Import and Export

From
Corey Huinker
Date:
(I've not read the patch yet, but I assume the switch works like
other pg_dump filters in that you can apply it on the restore
side?)

Correct. It follows the existing --no-something pattern. 

Re: Statistics Import and Export

From
Stephen Frost
Date:
Greetings,

On Fri, Mar 29, 2024 at 19:35 Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> I’d certainly think “with stats” would be the preferred default of
> our users.

I'm concerned there could still be paths that lead to an error. For
pg_restore, or when loading a SQL file, a single error isn't fatal
(unless -e is specified), but it still could be somewhat scary to see
errors during a reload.

I understand that point.

Also, it's new behavior, so it may cause some minor surprises, or there
might be minor interactions to work out. For instance, dumping stats
doesn't make a lot of sense if pg_upgrade (or something else) is just
going to run analyze anyway.

But we don’t expect anything to run analyze … do we?  So I’m not sure why it makes sense to raise this as a concern. 

What do you think about starting off with it as non-default, and then
switching it to default in 18?

What’s different, given the above arguments, in making the change with 18 instead of now?  I also suspect that if we say “we will change the default later” … that later won’t ever come and we will end up making our users always have to remember to say “with-stats” instead.

The stats are important which is why the effort is being made in the first place. If just doing an analyze after loading the data was good enough then this wouldn’t be getting worked on.

Independently, I had a thought around doing an analyze as the data is being loaded .. but we can’t do that for indexes (but we could perhaps analyze the indexed values as we build the index..).  This works when we do a truncate or create the table in the same transaction, so we would tie into some of the existing logic that we have around that.  Would also adjust COPY to accept an option that specifies the anticipated number of rows being loaded (which we can figure out during the dump phase reasonably..). Perhaps this would lead to a pg_dump option to do the data load as a transaction with a truncate before the copy (point here being to be able to still do parallel load while getting the benefits from knowing that we are completely reloading the table). Just some other thoughts- which I don’t intend to take away from the current effort at all, which I see as valuable and should be enabled by default.

Thanks!

Stephen

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Fri, 2024-03-29 at 20:54 -0400, Stephen Frost wrote:
> What’s different, given the above arguments, in making the change
> with 18 instead of now?

Acknowledged. You, Tom, and Corey (and perhaps everyone else) seem to
be aligned here, so that's consensus enough for me. Default is with
stats, --no-statistics to disable them.

> Independently, I had a thought around doing an analyze as the data is
> being loaded ..

Right, I think there are some interesting things to pursue here. I also
had an idea to use logical decoding to get a streaming sample, which
would be better randomness than block sampling. At this point that's
just an idea, I haven't looked into it seriously.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:
Right, I think there are some interesting things to pursue here. I also
had an idea to use logical decoding to get a streaming sample, which
would be better randomness than block sampling. At this point that's
just an idea, I haven't looked into it seriously.

Regards,
        Jeff Davis



v15 attached

0001:
- fixed an error involving tsvector types
- only derive element type if element stats available
- general cleanup

0002:

- 002pg_upgrade.pl now dumps before/after databases with --no-statistics. I tried to find out why some tables were getting their relstats either not set, or set and reset, never affecting the attribute stats. I even tried turning autovacuum off for both instances, but nothing seemed to change the fact that the same tables were having their relstats reset.

TODO list:

- decision on whether suppressing stats in the pg_upgrade TAP check is for the best
- pg_upgrade option to suppress stats import, there is no real pattern to follow there
- what message text to convey to the user about the potential stats import errors and their remediation, and to what degree that replaces the "you ought to run vacuumdb" message.
- what additional error context we want to add to the array_in() imports of anyarray strings

Attachment

Re: Statistics Import and Export

From
Magnus Hagander
Date:


On Sat, Mar 30, 2024 at 1:26 AM Corey Huinker <corey.huinker@gmail.com> wrote:


On Fri, Mar 29, 2024 at 7:34 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> I’d certainly think “with stats” would be the preferred default of
> our users.

I'm concerned there could still be paths that lead to an error. For
pg_restore, or when loading a SQL file, a single error isn't fatal
(unless -e is specified), but it still could be somewhat scary to see
errors during a reload.

To that end, I'm going to be modifying the "Optimizer statistics are not transferred by pg_upgrade..." message when stats _were_ transferred, width additional instructions that the user should treat any stats-ish error messages encountered as a reason to manually analyze that table. We should probably say something about extended stats as well.
 

I'm getting late into this discussion and I apologize if I've missed this being discussed before. But.

Please don't.

That will make it *really* hard for any form of automation or drivers of this. The information needs to go somewhere where such tools can easily consume it, and an informational message during runtime (which is also likely to be translated in many environments) is the exact opposite of that.

Surely we can come up with something better. Otherwise, I think all those tools are just going ot have to end up assuming that it always failed and proceed based on that, and that would be a shame.

--

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Sat, 2024-03-30 at 01:34 -0400, Corey Huinker wrote:
>
> - 002pg_upgrade.pl now dumps before/after databases with --no-
> statistics. I tried to find out why some tables were getting their
> relstats either not set, or set and reset, never affecting the
> attribute stats. I even tried turning autovacuum off for both
> instances, but nothing seemed to change the fact that the same tables
> were having their relstats reset.

I think I found out why this is happening: a schema-only dump first
creates the table, then sets the relstats, then creates indexes. The
index creation updates the relstats, but because the dump was schema-
only, it overwrites the relstats with zeros.

That exposes an interesting dependency, which is that relstats must be
set after index creation, otherwise they will be lost -- at least in
the case of pg_upgrade.

This re-raises the question of whether stats are part of a schema-only
dump or not. Did we settle conclusively that they are?

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> This re-raises the question of whether stats are part of a schema-only
> dump or not. Did we settle conclusively that they are?

Surely they are data, not schema.  It would make zero sense to restore
them if you aren't restoring the data they describe.

Hence, it'll be a bit messy if we can't put them in the dump's DATA
section.  Maybe we need to revisit CREATE INDEX's behavior rather
than assuming it's graven in stone?

            regards, tom lane



Re: Statistics Import and Export

From
Jeff Davis
Date:
On Sat, 2024-03-30 at 13:18 -0400, Tom Lane wrote:
> Surely they are data, not schema.  It would make zero sense to
> restore
> them if you aren't restoring the data they describe.

The complexity is that pg_upgrade does create the data, but relies on a
schema-only dump. So we'd need to at least account for that somehow,
either with a separate stats-only dump, or make a special case in
binary upgrade mode that dumps schema+stats (and resolves the CREATE
INDEX issue).

> Maybe we need to revisit CREATE INDEX's behavior rather
> than assuming it's graven in stone?

Would there be a significant cost to just not doing that? Or are you
suggesting that we special-case the behavior, or turn it off during
restore with a GUC?

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2024-03-30 at 13:18 -0400, Tom Lane wrote:
>> Surely they are data, not schema.  It would make zero sense to
>> restore them if you aren't restoring the data they describe.

> The complexity is that pg_upgrade does create the data, but relies on a
> schema-only dump. So we'd need to at least account for that somehow,
> either with a separate stats-only dump, or make a special case in
> binary upgrade mode that dumps schema+stats (and resolves the CREATE
> INDEX issue).

Ah, good point.  But binary-upgrade mode is special in tons of ways
already.  I don't see a big problem with allowing it to dump stats
even though --schema-only would normally imply not doing that.

(You could also imagine an explicit positive --stats switch that would
override --schema-only, but I don't see that it's worth the trouble.)

>> Maybe we need to revisit CREATE INDEX's behavior rather
>> than assuming it's graven in stone?

> Would there be a significant cost to just not doing that? Or are you
> suggesting that we special-case the behavior, or turn it off during
> restore with a GUC?

I didn't have any specific proposal in mind, was just trying to think
outside the box.

            regards, tom lane



Re: Statistics Import and Export

From
Jeff Davis
Date:
On Sat, 2024-03-30 at 13:39 -0400, Tom Lane wrote:
> (You could also imagine an explicit positive --stats switch that
> would
> override --schema-only, but I don't see that it's worth the trouble.)

That would have its own utility for reproducing planner problems
outside of production systems.

(That could be a separate feature, though, and doesn't need to be a
part of this patch set.)

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:
I'm getting late into this discussion and I apologize if I've missed this being discussed before. But.

Please don't.

That will make it *really* hard for any form of automation or drivers of this. The information needs to go somewhere where such tools can easily consume it, and an informational message during runtime (which is also likely to be translated in many environments) is the exact opposite of that.

That makes a lot of sense. I'm not sure what form it would take (file, pseudo-table, something else?). Open to suggestions.

Re: Statistics Import and Export

From
Corey Huinker
Date:
I didn't have any specific proposal in mind, was just trying to think
outside the box.

What if we added a separate resection SECTION_STATISTICS which is run following post-data?

Re: Statistics Import and Export

From
Tom Lane
Date:
Corey Huinker <corey.huinker@gmail.com> writes:
>> I didn't have any specific proposal in mind, was just trying to think
>> outside the box.

> What if we added a separate resection SECTION_STATISTICS which is run
> following post-data?

Maybe, but that would have a lot of side-effects on pg_dump's API
and probably on some end-user scripts.  I'd rather not.

I haven't looked at the details, but I'm really a bit surprised
by Jeff's assertion that CREATE INDEX destroys statistics on the
base table.  That seems wrong from here, and maybe something we
could have it not do.  (I do realize that it recalculates reltuples
and relpages, but so what?  If it updates those, the results should
be perfectly accurate.)

            regards, tom lane



Re: Statistics Import and Export

From
Corey Huinker
Date:
That will make it *really* hard for any form of automation or drivers of this. The information needs to go somewhere where such tools can easily consume it, and an informational message during runtime (which is also likely to be translated in many environments) is the exact opposite of that.

Having given this some thought, I'd be inclined to create a view, pg_stats_missing, with the same security barrier as pg_stats, but looking for tables that lack stats on at least one column, or lack stats on an extended statistics object.

Table structure would be 
schemaname name
tablename name
attnames text[]
ext_stats text[]

The informational message, if it changes at all, could reference this new view as the place to learn about how well the stats import went.

vacuumdb might get a --missing-only option in addition to --analyze-in-stages.

For that matter, we could add --analyze-missing options to pg_restore and pg_upgrade to do the mopping up themselves.

Re: Statistics Import and Export

From
Tom Lane
Date:
Corey Huinker <corey.huinker@gmail.com> writes:
> Having given this some thought, I'd be inclined to create a view,
> pg_stats_missing, with the same security barrier as pg_stats, but looking
> for tables that lack stats on at least one column, or lack stats on an
> extended statistics object.

The week before feature freeze is no time to be designing something
like that, unless you've abandoned all hope of getting this into v17.

There's a bigger issue though: AFAICS this patch set does nothing
about dumping extended statistics.  I surely don't want to hold up
the patch insisting that that has to happen before we can commit the
functionality proposed here.  But we cannot rip out pg_upgrade's
support for post-upgrade ANALYZE processing before we do something
about extended statistics, and that means it's premature to be
designing any changes to how that works.  So I'd set that whole
topic on the back burner.

It's possible that we could drop the analyze-in-stages recommendation,
figuring that this functionality will get people to the
able-to-limp-along level immediately and that all that is needed is a
single mop-up ANALYZE pass.  But I think we should leave that till we
have a bit more than zero field experience with this feature.

            regards, tom lane



Re: Statistics Import and Export

From
Tom Lane
Date:
My apologies for having paid so little attention to this thread for
months.  I got around to reading the v15 patches today, and while
I think they're going in more or less the right direction, there's
a long way to go IMO.

I concur with the plan of extracting data from pg_stats not
pg_statistics, and with emitting a single "set statistics"
call per attribute.  (I think at one point I'd suggested a call
per stakind slot, but that would lead to a bunch of UPDATEs on
existing pg_attribute tuples and hence a bunch of dead tuples
at the end of an import, so it's not the way to go.  A series
of UPDATEs would likely also play poorly with a background
auto-ANALYZE happening concurrently.)

I do not like the current design for pg_set_attribute_stats' API
though: I don't think it's at all future-proof.  What happens when
somebody adds a new stakind (and hence new pg_stats column)?
You could try to add an overloaded pg_set_attribute_stats
version with more parameters, but I'm pretty sure that would
lead to "ambiguous function call" failures when trying to load
old dump files containing only the original parameters.  The
present design is also fragile in that an unrecognized parameter
will lead to a parse-time failure and no function call happening,
which is less robust than I'd like.  As lesser points,
the relation argument ought to be declared regclass not oid for
convenience of use, and I really think that we need to provide
the source server's major version number --- maybe we will never
need that, but if we do and we don't have it we will be sad.

So this leads me to suggest that we'd be best off with a VARIADIC
ANY signature, where the variadic part consists of alternating
parameter labels and values:

pg_set_attribute_stats(table regclass, attribute name,
                       inherited bool, source_version int,
                       variadic "any") returns void

where a call might look like

SELECT pg_set_attribute_stats('public.mytable', 'mycolumn',
                              false, -- not inherited
                  16, -- source server major version
                              -- pairs of labels and values follow
                              'null_frac', 0.4,
                              'avg_width', 42,
                              'histogram_bounds',
                              array['a', 'b', 'c']::text[],
                              ...);

Note a couple of useful things here:

* AFAICS we could label the function strict and remove all those ad-hoc
null checks.  If you don't have a value for a particular stat, you
just leave that pair of arguments out (exactly as the existing code
in 0002 does, just using a different notation).  This also means that
we don't need any default arguments and so no need for hackery in
system_functions.sql.

* If we don't recognize a parameter label at runtime, we can treat
that as a warning rather than a hard error, and press on.  This case
would mostly be useful in major version downgrades I suppose, but
that will be something people will want eventually.

* We can require the calling statement to cast arguments, particularly
arrays, to the proper type, removing the need for conversions within
the stats-setting function.  (But instead, it'd need to check that the
next "any" argument is the type it ought to be based on the type of
the target column.)

If we write the labels as undecorated string literals as I show
above, I think they will arrive at the function as "unknown"-type
constants, which is a little weird but doesn't seem like it's
really a big problem.  The alternative is to cast them all to text
explicitly, but that's adding notation to no great benefit IMO.

pg_set_relation_stats is simpler in that the set of stats values
to be set will probably remain fairly static, and there seems little
reason to allow only part of them to be supplied (so personally I'd
drop the business about accepting nulls there too).  If we do grow
another value or values for it to set there shouldn't be much problem
with overloading it with another version with more arguments.
Still needs to take regclass not oid though ...

I've not read the patches in great detail, but I did note a
few low-level issues:

* why is check_relation_permissions looking up the pg_class row?
There's already a copy of that in the Relation struct.  Likewise
for the other caller of can_modify_relation (but why is that
caller not using check_relation_permissions?)  That all looks
overly complicated and duplicative.  I think you don't need two
layers of function there.

* I find the stuff with enums and "const char *param_names" to
be way too cute and unlike anything we do elsewhere.  Please
don't invent your own notations for coding patterns that have
hundreds of existing instances.  pg_set_relation_stats, for
example, has absolutely no reason not to look like the usual

    Oid    relid = PG_GETARG_OID(0);
    float4    relpages = PG_GETARG_FLOAT4(1);
    ... etc ...

* The array manipulations seem to me to be mostly not well chosen.
There's no reason to use expanded arrays here, since you won't be
modifying the arrays in-place; all that's doing is wasting memory.
I'm also noting a lack of defenses against nulls in the arrays.
I'd suggest using deconstruct_array to disassemble the arrays,
if indeed they need disassembled at all.  (Maybe they don't, see
next item.)

* I'm dubious that we can fully vet the contents of these arrays,
and even a little dubious that we need to try.  As an example,
what's the worst that's going to happen if a histogram array isn't
sorted precisely?  You might get bogus selectivity estimates
from the planner, but that's no worse than you would've got with
no stats at all.  (It used to be that selfuncs.c would use a
histogram even if its contents didn't match the query's collation.
The comments justifying that seem to be gone, but I think it's
still the case that the code isn't *really* dependent on the sort
order being exactly so.)  The amount of hastily-written code in the
patch for checking this seems a bit scary, and it's well within the
realm of possibility that it introduces more bugs than it prevents.
We do need to verify data types, lack of nulls, and maybe
1-dimensional-ness, which could break the accessing code at a fairly
low level; but I'm not sure that we need more than that.

* There's a lot of ERROR cases that maybe we ought to downgrade
to WARN-and-press-on, in the service of not breaking the restore
completely in case of trouble.

* 0002 is confused about whether the tag for these new TOC
entries is "STATISTICS" or "STATISTICS DATA".  I also think
they need to be in SECTION_DATA not SECTION_NONE, and I'd be
inclined to make them dependent on the table data objects
not the table declarations.  We don't really want a parallel
restore to load them before the data is loaded: that just
increases the risk of bad interactions with concurrent
auto-analyze.

* It'd definitely not be OK to put BEGIN/COMMIT into the commands
in these TOC entries.  But I don't think we need to.

* dumpRelationStats seems to be dumping the relation-level
stats twice.

* Why exactly are you suppressing testing of statistics upgrade
in 002_pg_upgrade??

            regards, tom lane



Re: Statistics Import and Export

From
Corey Huinker
Date:
On Sun, Mar 31, 2024 at 2:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Corey Huinker <corey.huinker@gmail.com> writes:
> Having given this some thought, I'd be inclined to create a view,
> pg_stats_missing, with the same security barrier as pg_stats, but looking
> for tables that lack stats on at least one column, or lack stats on an
> extended statistics object.

The week before feature freeze is no time to be designing something
like that, unless you've abandoned all hope of getting this into v17.

It was a response to the suggestion that there be some way for tools/automation to read the status of stats. I would view it as a separate patch, as such a view would be useful now for knowing which tables to ANALYZE, regardless of whether this patch goes in or not.
 
There's a bigger issue though: AFAICS this patch set does nothing
about dumping extended statistics.  I surely don't want to hold up
the patch insisting that that has to happen before we can commit the
functionality proposed here.  But we cannot rip out pg_upgrade's
support for post-upgrade ANALYZE processing before we do something
about extended statistics, and that means it's premature to be
designing any changes to how that works.  So I'd set that whole
topic on the back burner.

So Extended Stats _were_ supported by earlier versions where the medium of communication was JSON. However, there were several problems with adapting that to the current model where we match params to stat types:

* Several of the column types do not have functional input functions, so we must construct the data structure internally and pass them to statext_store().
* The output functions for some of those column types have lists of attnums, with negative values representing positional expressions in the stat definition. This information is not translatable to another system without also passing along the attnum/attname mapping of the source system.

At least three people told me "nobody uses extended stats" and to just drop that from the initial version. Unhappy with this assessment, I inquired as to whether my employer (AWS) had some internal databases that used extended stats so that I could get good test data, and came up with nothing, nor did anyone know of customers who used the feature. So when the fourth person told me that nobody uses extended stats, and not to let a rarely-used feature get in the way of a feature that would benefit nearly 100% of users, I dropped it.
 
It's possible that we could drop the analyze-in-stages recommendation,
figuring that this functionality will get people to the
able-to-limp-along level immediately and that all that is needed is a
single mop-up ANALYZE pass.  But I think we should leave that till we
have a bit more than zero field experience with this feature.

It may be that we leave the recommendation exactly as it is.

Perhaps we enhance the error messages in pg_set_*_stats() to indicate what command would remediate the issue.
 

Re: Statistics Import and Export

From
Tom Lane
Date:
Corey Huinker <corey.huinker@gmail.com> writes:
> On Sun, Mar 31, 2024 at 2:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> There's a bigger issue though: AFAICS this patch set does nothing
>> about dumping extended statistics.  I surely don't want to hold up
>> the patch insisting that that has to happen before we can commit the
>> functionality proposed here.  But we cannot rip out pg_upgrade's
>> support for post-upgrade ANALYZE processing before we do something
>> about extended statistics, and that means it's premature to be
>> designing any changes to how that works.  So I'd set that whole
>> topic on the back burner.

> So Extended Stats _were_ supported by earlier versions where the medium of
> communication was JSON. However, there were several problems with adapting
> that to the current model where we match params to stat types:

> * Several of the column types do not have functional input functions, so we
> must construct the data structure internally and pass them to
> statext_store().
> * The output functions for some of those column types have lists of
> attnums, with negative values representing positional expressions in the
> stat definition. This information is not translatable to another system
> without also passing along the attnum/attname mapping of the source system.

I wonder if the right answer to that is "let's enhance the I/O
functions for those types".  But whether that helps or not, it's
v18-or-later material for sure.

> At least three people told me "nobody uses extended stats" and to just drop
> that from the initial version.

I can't quibble with that view of what has priority.  I'm just
suggesting that redesigning what pg_upgrade does in this area
should come later than doing something about extended stats.

            regards, tom lane



Re: Statistics Import and Export

From
Corey Huinker
Date:
I wonder if the right answer to that is "let's enhance the I/O
functions for those types".  But whether that helps or not, it's
v18-or-later material for sure.

That was Stephen's take as well, and I agreed given that I had to throw the kitchen-sink of source-side oid mappings (attname, types, collatons, operators) into the JSON to work around the limitation.
 
I can't quibble with that view of what has priority.  I'm just
suggesting that redesigning what pg_upgrade does in this area
should come later than doing something about extended stats.

I mostly agree, with the caveat that pg_upgrade's existing message saying that optimizer stats were not carried over wouldn't be 100% true anymore.
 

Re: Statistics Import and Export

From
Tom Lane
Date:
Corey Huinker <corey.huinker@gmail.com> writes:
>> I can't quibble with that view of what has priority.  I'm just
>> suggesting that redesigning what pg_upgrade does in this area
>> should come later than doing something about extended stats.

> I mostly agree, with the caveat that pg_upgrade's existing message saying
> that optimizer stats were not carried over wouldn't be 100% true anymore.

I think we can tweak the message wording.  I just don't want to be
doing major redesign of the behavior, nor adding fundamentally new
monitoring capabilities.

            regards, tom lane



Re: Statistics Import and Export

From
Corey Huinker
Date:

I concur with the plan of extracting data from pg_stats not
pg_statistics, and with emitting a single "set statistics"
call per attribute.  (I think at one point I'd suggested a call
per stakind slot, but that would lead to a bunch of UPDATEs on
existing pg_attribute tuples and hence a bunch of dead tuples
at the end of an import, so it's not the way to go.  A series
of UPDATEs would likely also play poorly with a background
auto-ANALYZE happening concurrently.)

That was my reasoning as well.

 
I do not like the current design for pg_set_attribute_stats' API
though: I don't think it's at all future-proof.  What happens when
somebody adds a new stakind (and hence new pg_stats column)?
You could try to add an overloaded pg_set_attribute_stats
version with more parameters, but I'm pretty sure that would
lead to "ambiguous function call" failures when trying to load
old dump files containing only the original parameters.

I don't think we'd overload, we'd just add new parameters to the function signature.
 
  The
present design is also fragile in that an unrecognized parameter
will lead to a parse-time failure and no function call happening,
which is less robust than I'd like.

There was a lot of back-and-forth about what sorts of failures were error-worthy, and which were warn-worthy. I'll discuss further below.
 
  As lesser points,
the relation argument ought to be declared regclass not oid for
convenience of use,

+1
 
and I really think that we need to provide
the source server's major version number --- maybe we will never
need that, but if we do and we don't have it we will be sad.

The JSON had it, and I never did use it. Not against having it again.
 

So this leads me to suggest that we'd be best off with a VARIADIC
ANY signature, where the variadic part consists of alternating
parameter labels and values:

pg_set_attribute_stats(table regclass, attribute name,
                       inherited bool, source_version int,
                       variadic "any") returns void

where a call might look like

SELECT pg_set_attribute_stats('public.mytable', 'mycolumn',
                              false, -- not inherited
                              16, -- source server major version
                              -- pairs of labels and values follow
                              'null_frac', 0.4,
                              'avg_width', 42,
                              'histogram_bounds',
                              array['a', 'b', 'c']::text[],
                              ...);

Note a couple of useful things here:

* AFAICS we could label the function strict and remove all those ad-hoc
null checks.  If you don't have a value for a particular stat, you
just leave that pair of arguments out (exactly as the existing code
in 0002 does, just using a different notation).  This also means that
we don't need any default arguments and so no need for hackery in
system_functions.sql.

I'm not aware of how strict works with variadics. Would the lack of any variadic parameters trigger it?

Also going with strict means that an inadvertent explicit NULL in one parameter would cause the entire attribute import to fail silently. I'd rather fail loudly.

 
* If we don't recognize a parameter label at runtime, we can treat
that as a warning rather than a hard error, and press on.  This case
would mostly be useful in major version downgrades I suppose, but
that will be something people will want eventually.

Interesting.

* We can require the calling statement to cast arguments, particularly
arrays, to the proper type, removing the need for conversions within
the stats-setting function.  (But instead, it'd need to check that the
next "any" argument is the type it ought to be based on the type of
the target column.)

So, that's tricky. The type of the values is not always the attribute type, for expression indexes, we do call exprType() and exprCollation(), in which case we always use the expression type over the attribute type, but only use the collation type if the attribute had no collation. This mimics the behavior of ANALYZE.

Then, for the MCELEM and DECHIST stakinds we have to find the type's element type, and that has special logic for tsvectors, ranges, and other non-scalars, borrowing from the various *_typanalyze() functions. For that matter, the existing typanalyze functions don't grab the < operator, which I need for later data validations, so using examine_attribute() was simultaneously overkill and insufficient.

None of this functionality is accessible from a client program, so we'd have to pull in a lot of backend stuff to pg_dump to make it resolve the typecasts correctly. Text and array_in() was just easier.
 
pg_set_relation_stats is simpler in that the set of stats values
to be set will probably remain fairly static, and there seems little
reason to allow only part of them to be supplied (so personally I'd
drop the business about accepting nulls there too).  If we do grow
another value or values for it to set there shouldn't be much problem
with overloading it with another version with more arguments.
Still needs to take regclass not oid though ...

I'm still iffy about the silent failures of strict. 

I looked it up, and the only change needed for changing oid to regclass is in the pg_proc.dat. (and the docs, of course). So I'm already on board.
 
* why is check_relation_permissions looking up the pg_class row?
There's already a copy of that in the Relation struct.  Likewise
for the other caller of can_modify_relation (but why is that
caller not using check_relation_permissions?)  That all looks
overly complicated and duplicative.  I think you don't need two
layers of function there.

To prove that the caller is the owner (or better) of the table.
  

* The array manipulations seem to me to be mostly not well chosen.
There's no reason to use expanded arrays here, since you won't be
modifying the arrays in-place; all that's doing is wasting memory.
I'm also noting a lack of defenses against nulls in the arrays.

Easily remedied in light of the deconstruct_array() suggestion below, but I do want to add that value_not_null_array_len() does check for nulls, and that function is used to generate all but one of the arrays (and that one we're just verifying that it's length matches the length of the other array).There's even a regression test that checks it (search for: "elem_count_histogram null element").
 
I'd suggest using deconstruct_array to disassemble the arrays,
if indeed they need disassembled at all.  (Maybe they don't, see
next item.)

+1 
 

* I'm dubious that we can fully vet the contents of these arrays,
and even a little dubious that we need to try.  As an example,
what's the worst that's going to happen if a histogram array isn't
sorted precisely?  You might get bogus selectivity estimates
from the planner, but that's no worse than you would've got with
no stats at all.  (It used to be that selfuncs.c would use a
histogram even if its contents didn't match the query's collation.
The comments justifying that seem to be gone, but I think it's
still the case that the code isn't *really* dependent on the sort
order being exactly so.)  The amount of hastily-written code in the
patch for checking this seems a bit scary, and it's well within the
realm of possibility that it introduces more bugs than it prevents.
We do need to verify data types, lack of nulls, and maybe
1-dimensional-ness, which could break the accessing code at a fairly
low level; but I'm not sure that we need more than that.

A lot of the feedback I got on this patch over the months concerned giving inaccurate, nonsensical, or malicious data to the planner. Surely the planner does do *some* defensive programming when fetching these values, but this is the first time those values were potentially set by a user, not by our own internal code. We can try to match types, collations, etc from source to dest, but even that would fall victim to another glibc-level collation change. Verifying that the list the source system said was sorted is actually sorted when put on the destination system is the truest test we're ever going to get, albeit for sampled elements.
 

* There's a lot of ERROR cases that maybe we ought to downgrade
to WARN-and-press-on, in the service of not breaking the restore
completely in case of trouble.

All cases were made error precisely to spark debate about which cases we'd want to continue from and which we'd want to error from. Also, I was under the impression it was bad form to follow up NOTICE/WARN with an ERROR in the same function call.

 
* 0002 is confused about whether the tag for these new TOC
entries is "STATISTICS" or "STATISTICS DATA".  I also think
they need to be in SECTION_DATA not SECTION_NONE, and I'd be
inclined to make them dependent on the table data objects
not the table declarations.  We don't really want a parallel
restore to load them before the data is loaded: that just
increases the risk of bad interactions with concurrent
auto-analyze.

SECTION_NONE works the best, but we're getting some situations where the relpages/reltuples/relallvisible gets reset to 0s in pg_class. Hence the temporary --no-statistics in the pg_upgrade TAP test.

SECTION_POST_DATA (a previous suggestion) causes something weird to happen where certain GRANT/REVOKEs happen outside of their expected section. 

In work I've done since v15, I tried giving the table stats archive entry a dependency on every index (and index constraint) as well as the table itself, thinking that would get us past all resets of pg_class, but it hasn't worked.
 
* It'd definitely not be OK to put BEGIN/COMMIT into the commands
in these TOC entries.  But I don't think we need to.

Agreed. Don't need to, each function call now sinks or swims on its own.
 

* dumpRelationStats seems to be dumping the relation-level
stats twice.

+1 

* Why exactly are you suppressing testing of statistics upgrade
in 002_pg_upgrade??

Temporary. Related to the pg_class overwrite issue above.

Re: Statistics Import and Export

From
Tom Lane
Date:
Corey Huinker <corey.huinker@gmail.com> writes:
>> and I really think that we need to provide
>> the source server's major version number --- maybe we will never
>> need that, but if we do and we don't have it we will be sad.

> The JSON had it, and I never did use it. Not against having it again.

Well, you don't need it now seeing that the definition of pg_stats
columns hasn't changed in the past ... but there's no guarantee we
won't want to change them in the future.

>> So this leads me to suggest that we'd be best off with a VARIADIC
>> ANY signature, where the variadic part consists of alternating
>> parameter labels and values:
>> pg_set_attribute_stats(table regclass, attribute name,
>>                        inherited bool, source_version int,
>>                        variadic "any") returns void

> I'm not aware of how strict works with variadics. Would the lack of any
> variadic parameters trigger it?

IIRC, "variadic any" requires having at least one variadic parameter.
But that seems fine --- what would be the point, or even the
semantics, of calling pg_set_attribute_stats with no data fields?

> Also going with strict means that an inadvertent explicit NULL in one
> parameter would cause the entire attribute import to fail silently. I'd
> rather fail loudly.

Not really convinced that that is worth any trouble...

> * We can require the calling statement to cast arguments, particularly
>> arrays, to the proper type, removing the need for conversions within
>> the stats-setting function.  (But instead, it'd need to check that the
>> next "any" argument is the type it ought to be based on the type of
>> the target column.)

> So, that's tricky. The type of the values is not always the attribute type,

Hmm.  You would need to have enough smarts in pg_set_attribute_stats
to identify the appropriate array type in any case: as coded, it needs
that for coercion, whereas what I'm suggesting would only require it
for checking, but either way you need it.  I do concede that pg_dump
(or other logic generating the calls) needs to know more under my
proposal than before.  I had been thinking that it would not need to
hard-code that because it could look to see what the actual type is
of the array it's dumping.  However, I see that pg_typeof() doesn't
work for that because it just returns anyarray.  Perhaps we could
invent a new backend function that extracts the actual element type
of a non-null anyarray argument.

Another way we could get to no-coercions is to stick with your
signature but declare the relevant parameters as anyarray instead of
text.  I still think though that we'd be better off to leave the
parameter matching to runtime, so that we-don't-recognize-that-field
can be a warning not an error.

>> * why is check_relation_permissions looking up the pg_class row?
>> There's already a copy of that in the Relation struct.

> To prove that the caller is the owner (or better) of the table.

I think you missed my point: you're doing that inefficiently,
and maybe even with race conditions.  Use the relcache's copy
of the pg_class row.

>> * I'm dubious that we can fully vet the contents of these arrays,
>> and even a little dubious that we need to try.

> A lot of the feedback I got on this patch over the months concerned giving
> inaccurate, nonsensical, or malicious data to the planner. Surely the
> planner does do *some* defensive programming when fetching these values,
> but this is the first time those values were potentially set by a user, not
> by our own internal code. We can try to match types, collations, etc from
> source to dest, but even that would fall victim to another glibc-level
> collation change.

That sort of concern is exactly why I think the planner has to, and
does, defend itself.  Even if you fully vet the data at the instant
of loading, we might have the collation change under us later.

It could be argued that feeding bogus data to the planner for testing
purposes is a valid use-case for this feature.  (Of course, as
superuser we could inject bogus data into pg_statistic manually,
so it's not necessary to have this feature for that purpose.)
I guess I'm a great deal more sanguine than other people about the
planner's ability to tolerate inconsistent data; but in any case
I don't have a lot of faith in relying on checks in
pg_set_attribute_stats to substitute for that ability.  That idea
mainly leads to having a whole lot of code that has to be kept in
sync with other code that's far away from it and probably isn't
coded in a parallel fashion either.

>> * There's a lot of ERROR cases that maybe we ought to downgrade
>> to WARN-and-press-on, in the service of not breaking the restore
>> completely in case of trouble.

> All cases were made error precisely to spark debate about which cases we'd
> want to continue from and which we'd want to error from.

Well, I'm here to debate it if you want, but I'll just note that *one*
error will be enough to abort a pg_upgrade entirely, and most users
these days get scared by errors during manual dump/restore too.  So we
had better not be throwing errors except for cases that we don't think
pg_dump could ever emit.

> Also, I was under
> the impression it was bad form to follow up NOTICE/WARN with an ERROR in
> the same function call.

Seems like nonsense to me.  WARN then ERROR about the same condition
would be annoying, but that's not what we are talking about here.

            regards, tom lane



Re: Statistics Import and Export

From
Corey Huinker
Date:
IIRC, "variadic any" requires having at least one variadic parameter.
But that seems fine --- what would be the point, or even the
semantics, of calling pg_set_attribute_stats with no data fields?

If my pg_dump run emitted a bunch of stats that could never be imported, I'd want to know. With silent failures, I don't.

 
Perhaps we could
invent a new backend function that extracts the actual element type
of a non-null anyarray argument.

A backend function that we can't guarantee exists on the source system. :(
 
Another way we could get to no-coercions is to stick with your
signature but declare the relevant parameters as anyarray instead of
text.  I still think though that we'd be better off to leave the
parameter matching to runtime, so that we-don't-recognize-that-field
can be a warning not an error.

I'm a bit confused here. AFAIK we can't construct an anyarray in SQL:

# select '{1,2,3}'::anyarray;
ERROR:  cannot accept a value of type anyarray
 
I think you missed my point: you're doing that inefficiently,
and maybe even with race conditions.  Use the relcache's copy
of the pg_class row.

Roger Wilco.


Well, I'm here to debate it if you want, but I'll just note that *one*
error will be enough to abort a pg_upgrade entirely, and most users
these days get scared by errors during manual dump/restore too.  So we
had better not be throwing errors except for cases that we don't think
pg_dump could ever emit.

That's pretty persuasive. It also means that we need to trap for error in the array_in() calls, as that function does not yet have a _safe() mode.

Re: Statistics Import and Export

From
Ashutosh Bapat
Date:
Hi Corey,


On Mon, Mar 25, 2024 at 3:38 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
Hi Corey,


On Sat, Mar 23, 2024 at 7:21 AM Corey Huinker <corey.huinker@gmail.com> wrote:
v12 attached.

0001 - 


Some random comments

+SELECT
+    format('SELECT pg_catalog.pg_set_attribute_stats( '
+            || 'relation => %L::regclass::oid, attname => %L::name, '
+            || 'inherited => %L::boolean, null_frac => %L::real, '
+            || 'avg_width => %L::integer, n_distinct => %L::real, '
+            || 'most_common_vals => %L::text, '
+            || 'most_common_freqs => %L::real[], '
+            || 'histogram_bounds => %L::text, '
+            || 'correlation => %L::real, '
+            || 'most_common_elems => %L::text, '
+            || 'most_common_elem_freqs => %L::real[], '
+            || 'elem_count_histogram => %L::real[], '
+            || 'range_length_histogram => %L::text, '
+            || 'range_empty_frac => %L::real, '
+            || 'range_bounds_histogram => %L::text) ',
+        'stats_export_import.' || s.tablename || '_clone', s.attname,
+        s.inherited, s.null_frac,
+        s.avg_width, s.n_distinct,
+        s.most_common_vals, s.most_common_freqs, s.histogram_bounds,
+        s.correlation, s.most_common_elems, s.most_common_elem_freqs,
+        s.elem_count_histogram, s.range_length_histogram,
+        s.range_empty_frac, s.range_bounds_histogram)
+FROM pg_catalog.pg_stats AS s
+WHERE s.schemaname = 'stats_export_import'
+AND s.tablename IN ('test', 'is_odd')
+\gexec

Why do we need to construct the command and execute? Can we instead execute the function directly? That would also avoid ECHO magic.

Addressed in v15
 

+   <table id="functions-admin-statsimport">
+    <title>Database Object Statistics Import Functions</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Function
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>

COMMENT: The functions throw many validation errors. Do we want to list the acceptable/unacceptable input values in the documentation corresponding to those? I don't expect one line per argument validation. Something like "these, these and these arguments can not be NULL" or "both arguments in each of the pairs x and y, a and b, and c and d should be non-NULL or NULL respectively".

Addressed in v15.
 
+ /* Statistics are dependent on the definition, not the data */
+ /* Views don't have stats */
+ if ((tbinfo->dobj.dump & DUMP_COMPONENT_STATISTICS) &&
+ (tbinfo->relkind == RELKIND_VIEW))
+ dumpRelationStats(fout, &tbinfo->dobj, reltypename,
+  tbinfo->dobj.dumpId);
+

Statistics are about data. Whenever pg_dump dumps some filtered data, the
statistics collected for the whole table are uselss. We should avoide dumping
statistics in such a case. E.g. when only schema is dumped what good is
statistics? Similarly the statistics on a partitioned table may not be useful
if some its partitions are not dumped. Said that dumping statistics on foreign
table makes sense since they do not contain data but the statistics still makes sense.

Dumping statistics without data is required for pg_upgrade. This is being discussed in the same thread. But I don't see some of the suggestions e.g. using binary-mode switch being used in v15.

Also, should we handle sequences, composite types the same way? THe latter is probably not dumped, but in case.
 

Whether or not I pass --no-statistics, there is no difference in the dump output. Am I missing something?
$ pg_dump -d postgres > /tmp/dump_no_arguments.out
$ pg_dump -d postgres --no-statistics > /tmp/dump_no_statistics.out
$ diff /tmp/dump_no_arguments.out /tmp/dump_no_statistics.out
$

IIUC, pg_dump includes statistics by default. That means all our pg_dump related tests will have statistics output by default. That's good since the functionality will always be tested. 1. We need additional tests to ensure that the statistics is installed after restore. 2. Some of those tests compare dumps before and after restore. In case the statistics is changed because of auto-analyze happening post-restore, these tests will fail.

Fixed.

Thanks for addressing those comments.
 
--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

From
Ashutosh Bapat
Date:
Hi Corey,

Some more comments on v15.

+/*
+ * A more encapsulated version of can_modify_relation for when the the
+ * HeapTuple and Form_pg_class are not needed later.
+ */
+static void
+check_relation_permissions(Relation rel)

This function is used exactly at one place, so usually won't make much sense to write a separate function. But given that the caller is so long, this seems ok. If this function returns the cached tuple when permission checks succeed, it can be used at the other place as well. The caller will be responsible to release the tuple Or update it.

Attached patch contains a test to invoke this function on a view. ANALYZE throws a WARNING when a view is passed to it. Similarly this function should refuse to update the statistics on relations for which ANALYZE throws a warning. A warning instead of an error seems fine.

+
+ const float4 min = 0.0;
+ const float4 max = 1.0;

When reading the validation condition, I have to look up variable values. That can be avoided by directly using the values in the condition itself? If there's some dependency elsewhere in the code, we can use macros. But I have not seen using constant variables in such a way elsewhere in the code.

+ values[Anum_pg_statistic_starelid - 1] = ObjectIdGetDatum(relid);
+ values[Anum_pg_statistic_staattnum - 1] = Int16GetDatum(attnum);
+ values[Anum_pg_statistic_stainherit - 1] = PG_GETARG_DATUM(P_INHERITED);

For a partitioned table this value has to be true. For a normal table when setting this value to true, it should at least make sure that the table has at least one child. Otherwise it should throw an error. Blindly accepting the given value may render the statistics unusable. Prologue of the function needs to be updated accordingly.

I have fixed a documentation error in the patch as well. Please incorporate it in your next patchset.
--
Best Wishes,
Ashutosh Bapat
Attachment

Re: Statistics Import and Export

From
Tom Lane
Date:
Corey Huinker <corey.huinker@gmail.com> writes:
>> IIRC, "variadic any" requires having at least one variadic parameter.
>> But that seems fine --- what would be the point, or even the
>> semantics, of calling pg_set_attribute_stats with no data fields?

> If my pg_dump run emitted a bunch of stats that could never be imported,
> I'd want to know. With silent failures, I don't.

What do you think would be silent about that?  If there's a complaint
to be made, it's that it'd be a hard failure ("no such function").

To be clear, I'm ok with emitting ERROR for something that pg_dump
clearly did wrong, which in this case would be emitting a
set_statistics call for an attribute it had exactly no stats values
for.  What I think needs to be WARN is conditions that the originating
pg_dump couldn't have foreseen, for example cross-version differences.
If we do try to check things like sort order, that complaint obviously
has to be WARN, since it's checking something potentially different
from what was correct at the source server.

>> Perhaps we could
>> invent a new backend function that extracts the actual element type
>> of a non-null anyarray argument.

> A backend function that we can't guarantee exists on the source system. :(

[ shrug... ] If this doesn't work for source servers below v17, that
would be a little sad, but it wouldn't be the end of the world.
I see your point that that is an argument for finding another way,
though.

>> Another way we could get to no-coercions is to stick with your
>> signature but declare the relevant parameters as anyarray instead of
>> text.

> I'm a bit confused here. AFAIK we can't construct an anyarray in SQL:

> # select '{1,2,3}'::anyarray;
> ERROR:  cannot accept a value of type anyarray

That's not what I suggested at all.  The function parameters would
be declared anyarray, but the values passed to them would be coerced
to the correct concrete array types.  So as far as the coercion rules
are concerned this'd be equivalent to the variadic-any approach.

> That's pretty persuasive. It also means that we need to trap for error in
> the array_in() calls, as that function does not yet have a _safe() mode.

Well, the approach I'm advocating for would have the array input and
coercion done by the calling query before control ever reaches
pg_set_attribute_stats, so that any incorrect-for-the-data-type values
would result in hard errors.  I think that's okay for the same reason
you probably figured you didn't have to trap array_in: it's the fault
of the originating pg_dump if it offers a value that doesn't coerce to
the datatype it claims the value is of.  My formulation is a bit safer
though in that it's the originating pg_dump, not the receiving server,
that is in charge of saying which type that is.  (If that type doesn't
agree with what the receiving server thinks it should be, that's a
condition that pg_set_attribute_stats itself will detect, and then it
can WARN and move on to the next thing.)

            regards, tom lane



Re: Statistics Import and Export

From
Jeff Davis
Date:
On Sat, 2024-03-30 at 20:08 -0400, Tom Lane wrote:
> I haven't looked at the details, but I'm really a bit surprised
> by Jeff's assertion that CREATE INDEX destroys statistics on the
> base table.  That seems wrong from here, and maybe something we
> could have it not do.  (I do realize that it recalculates reltuples
> and relpages, but so what?  If it updates those, the results should
> be perfectly accurate.)

In the v15 of the patch I was looking at, "pg_dump -s" included the
statistics. The stats appeared first in the dump, followed by the
CREATE INDEX commands. The latter overwrote the relpages/reltuples set
by the former.

While zeros are the right answers for a schema-only dump, it defeated
the purpose of including relpages/reltuples stats in the dump, and
caused the pg_upgrade TAP test to fail.

You're right that there are a number of ways this could be resolved --
I don't think it's an inherent problem.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Bruce Momjian
Date:
Reality check --- are we still targeting this feature for PG 17?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Statistics Import and Export

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2024-03-30 at 20:08 -0400, Tom Lane wrote:
>> I haven't looked at the details, but I'm really a bit surprised
>> by Jeff's assertion that CREATE INDEX destroys statistics on the
>> base table.  That seems wrong from here, and maybe something we
>> could have it not do.  (I do realize that it recalculates reltuples
>> and relpages, but so what?  If it updates those, the results should
>> be perfectly accurate.)

> In the v15 of the patch I was looking at, "pg_dump -s" included the
> statistics. The stats appeared first in the dump, followed by the
> CREATE INDEX commands. The latter overwrote the relpages/reltuples set
> by the former.

> While zeros are the right answers for a schema-only dump, it defeated
> the purpose of including relpages/reltuples stats in the dump, and
> caused the pg_upgrade TAP test to fail.

> You're right that there are a number of ways this could be resolved --
> I don't think it's an inherent problem.

I'm inclined to call it not a problem at all.  While I do agree there
are use-cases for injecting false statistics with these functions,
I do not think that pg_dump has to cater to such use-cases.

In any case, I remain of the opinion that stats are data and should
not be included in a -s dump (with some sort of exception for
pg_upgrade).  If the data has been loaded, then a subsequent
overwrite by CREATE INDEX should not be a problem.

            regards, tom lane



Re: Statistics Import and Export

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Reality check --- are we still targeting this feature for PG 17?

I'm not sure.  I think if we put our heads down we could finish
the changes I'm suggesting and resolve the other issues this week.
However, it is starting to feel like the sort of large, barely-ready
patch that we often regret cramming in at the last minute.  Maybe
we should agree that the first v18 CF would be a better time to
commit it.

            regards, tom lane



Re: Statistics Import and Export

From
Jeff Davis
Date:
On Mon, 2024-04-01 at 13:11 -0400, Bruce Momjian wrote:
> Reality check --- are we still targeting this feature for PG 17?

I see a few useful pieces here:

1. Support import of statistics (i.e.
pg_set_{relation|attribute}_stats()).

2. Support pg_dump of stats

3. Support pg_upgrade with stats

It's possible that not all of them make it, but let's not dismiss the
entire feature yet.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Sun, Mar 31, 2024 at 07:04:47PM -0400, Tom Lane wrote:
> Corey Huinker <corey.huinker@gmail.com> writes:
> >> I can't quibble with that view of what has priority.  I'm just
> >> suggesting that redesigning what pg_upgrade does in this area
> >> should come later than doing something about extended stats.
> 
> > I mostly agree, with the caveat that pg_upgrade's existing message saying
> > that optimizer stats were not carried over wouldn't be 100% true anymore.
> 
> I think we can tweak the message wording.  I just don't want to be
> doing major redesign of the behavior, nor adding fundamentally new
> monitoring capabilities.

I think pg_upgrade could check for the existence of extended statistics
in any database and adjust the analyze recommdnation wording
accordingly.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Statistics Import and Export

From
Jeff Davis
Date:
On Sun, 2024-03-31 at 14:48 -0400, Tom Lane wrote:
> What happens when
> somebody adds a new stakind (and hence new pg_stats column)?
> You could try to add an overloaded pg_set_attribute_stats
> version with more parameters, but I'm pretty sure that would
> lead to "ambiguous function call" failures when trying to load
> old dump files containing only the original parameters.

Why would you need to overload in this case? Wouldn't we just define a
new function with more optional named parameters?

>   The
> present design is also fragile in that an unrecognized parameter
> will lead to a parse-time failure and no function call happening,
> which is less robust than I'd like.

I agree on this point; I found this annoying when testing the feature.

> So this leads me to suggest that we'd be best off with a VARIADIC
> ANY signature, where the variadic part consists of alternating
> parameter labels and values:

I didn't consider this and I think it has a lot of advantages. It's
slightly unfortunate that we can't make them explicitly name/value
pairs, but pg_dump can use whitespace or even SQL comments to make it
more readable.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2024-04-01 at 13:11 -0400, Bruce Momjian wrote:
>> Reality check --- are we still targeting this feature for PG 17?

> I see a few useful pieces here:

> 1. Support import of statistics (i.e.
> pg_set_{relation|attribute}_stats()).

> 2. Support pg_dump of stats

> 3. Support pg_upgrade with stats

> It's possible that not all of them make it, but let's not dismiss the
> entire feature yet.

The unresolved questions largely have to do with the interactions
between these pieces.  I think we would seriously regret setting
any one of them in stone before all three are ready to go.

            regards, tom lane



Re: Statistics Import and Export

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sun, 2024-03-31 at 14:48 -0400, Tom Lane wrote:
>> What happens when
>> somebody adds a new stakind (and hence new pg_stats column)?

> Why would you need to overload in this case? Wouldn't we just define a
> new function with more optional named parameters?

Ah, yeah, you could change the function to have more parameters,
given the assumption that all calls will be named-parameter style.
I still suggest that my proposal is more robust for the case where
the dump lists parameters that the receiving system doesn't have.

            regards, tom lane



Re: Statistics Import and Export

From
Corey Huinker
Date:
That's not what I suggested at all.  The function parameters would
be declared anyarray, but the values passed to them would be coerced
to the correct concrete array types.  So as far as the coercion rules
are concerned this'd be equivalent to the variadic-any approach.

+1

 

> That's pretty persuasive. It also means that we need to trap for error in
> the array_in() calls, as that function does not yet have a _safe() mode.

Well, the approach I'm advocating for would have the array input and
coercion done by the calling query before control ever reaches
pg_set_attribute_stats, so that any incorrect-for-the-data-type values
would result in hard errors.  I think that's okay for the same reason
you probably figured you didn't have to trap array_in: it's the fault
of the originating pg_dump if it offers a value that doesn't coerce to
the datatype it claims the value is of.

+1
 

Re: Statistics Import and Export

From
Corey Huinker
Date:
I think pg_upgrade could check for the existence of extended statistics
in any database and adjust the analyze recommdnation wording
accordingly.

+1

Re: Statistics Import and Export

From
Corey Huinker
Date:
Ah, yeah, you could change the function to have more parameters,
given the assumption that all calls will be named-parameter style.
I still suggest that my proposal is more robust for the case where
the dump lists parameters that the receiving system doesn't have.

So what's the behavior when the user fails to supply a parameter that is currently NOT NULL checked (example: avg_witdth)? Is that a WARN-and-exit?
 

Re: Statistics Import and Export

From
Tom Lane
Date:
Corey Huinker <corey.huinker@gmail.com> writes:
> So what's the behavior when the user fails to supply a parameter that is
> currently NOT NULL checked (example: avg_witdth)? Is that a WARN-and-exit?

I still think that we could just declare the function strict, if we
use the variadic-any approach.  Passing a null in any position is
indisputable caller error.  However, if you're allergic to silently
doing nothing in such a case, we could have pg_set_attribute_stats
check each argument and throw an error.  (Or warn and keep going;
but according to the design principle I posited earlier, this'd be
the sort of thing we don't need to tolerate.)

            regards, tom lane



Re: Statistics Import and Export

From
Corey Huinker
Date:

I still think that we could just declare the function strict, if we
use the variadic-any approach.  Passing a null in any position is
indisputable caller error.  However, if you're allergic to silently
doing nothing in such a case, we could have pg_set_attribute_stats
check each argument and throw an error.  (Or warn and keep going;
but according to the design principle I posited earlier, this'd be
the sort of thing we don't need to tolerate.)

Any thoughts about going back to having a return value, a caller could then see that the function returned NULL rather than whatever the expected value was (example: TRUE)?

Re: Statistics Import and Export

From
Tom Lane
Date:
Corey Huinker <corey.huinker@gmail.com> writes:
> Any thoughts about going back to having a return value, a caller could then
> see that the function returned NULL rather than whatever the expected value
> was (example: TRUE)?

If we are envisioning that the function might emit multiple warnings
per call, a useful definition could be to return the number of
warnings (so zero is good, not-zero is bad).  But I'm not sure that's
really better than a boolean result.  pg_dump/pg_restore won't notice
anyway, but perhaps other programs using these functions would care.

            regards, tom lane



Re: Statistics Import and Export

From
Corey Huinker
Date:
If we are envisioning that the function might emit multiple warnings
per call, a useful definition could be to return the number of
warnings (so zero is good, not-zero is bad).  But I'm not sure that's
really better than a boolean result.  pg_dump/pg_restore won't notice
anyway, but perhaps other programs using these functions would care.

A boolean is what we had before, I'm quite comfortable with that, and it addresses my silent-failure concerns.
 

Re: Statistics Import and Export

From
Tom Lane
Date:
Corey Huinker <corey.huinker@gmail.com> writes:
> A boolean is what we had before, I'm quite comfortable with that, and it
> addresses my silent-failure concerns.

WFM.

            regards, tom lane



Re: Statistics Import and Export

From
Corey Huinker
Date:
Here's a one-liner patch for disabling update of pg_class relpages/reltuples/relallviible during a binary upgrade.

This was causting pg_upgrade tests to fail in the existing stats import work.
Attachment

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Tue, 2024-04-02 at 05:38 -0400, Corey Huinker wrote:
> Here's a one-liner patch for disabling update of pg_class
> relpages/reltuples/relallviible during a binary upgrade.

This change makes sense to me regardless of the rest of the work.
Updating the relpages/reltuples/relallvisible during pg_upgrade before
the data is there will store the wrong stats.

It could use a brief comment, though.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:
I have refactored pg_set_relation_stats to be variadic, and I'm working on pg_set_attribute_sttats, but I'm encountering an issue with the anyarray values.

Jeff suggested looking at anyarray_send as a way of extracting the type, and with some extra twiddling we can get and cast the type. However, some of the ANYARRAYs have element types that are themselves arrays, and near as I can tell, such a construct is not expressible in SQL. So, rather than getting an anyarray of an array type, you instead get an array of one higher dimension.  Like so:

# select schemaname, tablename, attname,

         substring(substring(anyarray_send(histogram_bounds) from 9 for 4)::text,2)::bit(32)::integer::regtype,

         substring(substring(anyarray_send(histogram_bounds::text::text[][]) from 9 for 4)::text,2)::bit(32)::integer::regtype
from pg_stats where histogram_bounds is not null

and tablename = 'pg_proc' and attname = 'proargnames'                                                                                                                                                                                                 ;

 schemaname | tablename |   attname   | substring | substring 

------------+-----------+-------------+-----------+-----------

 pg_catalog | pg_proc   | proargnames | text[]    | text
Luckily, passing in such a value would have done all of the element typechecking for us, so we would just move the data to an array of one less dimension typed elem[]. If there's an easy way to do that, I don't know of it.

What remains is just checking the input types against the expected type of the array, stepping down the dimension if need be, and skipping if the type doesn't meet expectations.

 

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Tue, 2024-04-02 at 12:59 -0400, Corey Huinker wrote:
>  However, some of the ANYARRAYs have element types that are
> themselves arrays, and near as I can tell, such a construct is not
> expressible in SQL. So, rather than getting an anyarray of an array
> type, you instead get an array of one higher dimension.

Fundamentally, you want to recreate the exact same anyarray values on
the destination system as they existed on the source. There's some
complexity to that on both the export side as well as the import side,
but I believe the problems are solvable.

On the export side, the problem is that the element type (and
dimensionality and maybe hasnull) is an important part of the anyarray
value, but it's not part of the output of anyarray_out(). For new
versions, we can add a scalar function that simply outputs the
information we need. For old versions, we can hack it by parsing the
output of anyarray_send(), which contains the information we need
(binary outputs are under-specified, but I believe they are specified
enough in this case). There may be other hacks to get the information
from the older systems; that's just an idea. To get the actual data,
doing histogram_bounds::text::text[] seems to be enough: that seems to
always give a one-dimensional array with element type "text", even if
the element type is an array. (Note: this means we need the function's
API to also include this extra information about the anyarray values,
so it might be slightly more complex than name/value pairs).

On the import side, the problem is that there may not be an input
function to go from a 1-D array of text to a 1-D array of any element
type we want. For example, there's no input function that will create a
1-D array with element type float4[] (that's because Postgres doesn't
really have arrays-of-arrays, it has multi-dimensional arrays).
Instead, don't use the input function, pass each element of the 1-D
text array to the element type's input function (which may be scalar or
not) and then construct a 1-D array out of that with the appropriate
element type (which may be scalar or not).

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On the export side, the problem is that the element type (and
> dimensionality and maybe hasnull) is an important part of the anyarray
> value, but it's not part of the output of anyarray_out(). For new
> versions, we can add a scalar function that simply outputs the
> information we need. For old versions, we can hack it by parsing the
> output of anyarray_send(), which contains the information we need
> (binary outputs are under-specified, but I believe they are specified
> enough in this case).

Yeah, I was thinking yesterday about pulling the anyarray columns in
binary and looking at the header fields.  However, I fear there is a
showstopper problem: anyarray_send will fail if the element type
doesn't have a typsend function, which is entirely possible for
user-defined types (and I'm not even sure we've provided them for
every type in the core distro).  I haven't thought of a good answer
to that other than a new backend function.  However ...

> On the import side, the problem is that there may not be an input
> function to go from a 1-D array of text to a 1-D array of any element
> type we want. For example, there's no input function that will create a
> 1-D array with element type float4[] (that's because Postgres doesn't
> really have arrays-of-arrays, it has multi-dimensional arrays).
> Instead, don't use the input function, pass each element of the 1-D
> text array to the element type's input function (which may be scalar or
> not) and then construct a 1-D array out of that with the appropriate
> element type (which may be scalar or not).

Yup.  I had hoped that we could avoid doing any array-munging inside
pg_set_attribute_stats, but this array-of-arrays problem seems to
mean we have to.  In turn, that means that the whole idea of
declaring the function inputs as anyarray rather than text[] is
probably pointless.  And that means that we don't need the sending
side to know the element type anyway.  So, I apologize for sending
us down a useless side path.  We may as well stick to the function
signature as shown in the v15 patch --- although maybe variadic
any is still worthwhile so that an unrecognized field name doesn't
need to be a hard error?

            regards, tom lane



Re: Statistics Import and Export

From
Corey Huinker
Date:
side to know the element type anyway.  So, I apologize for sending
us down a useless side path.  We may as well stick to the function
signature as shown in the v15 patch --- although maybe variadic
any is still worthwhile so that an unrecognized field name doesn't
need to be a hard error?

Variadic is nearly done. This issue was the main blocking point. I can go back to array_in() as we know that code works.


 

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Tue, 2024-04-02 at 17:31 -0400, Tom Lane wrote:
> And that means that we don't need the sending
> side to know the element type anyway.

We need to get the original element type on the import side somehow,
right? Otherwise it will be hard to tell whether '{1, 2, 3, 4}' has
element type "int4" or "text", which affects the binary representation
of the anyarray value in pg_statistic.

Either we need to get it at export time (which seems the most reliable
in principle, but problematic for older versions) and pass it as an
argument to pg_set_attribute_stats(); or we need to derive it reliably
from the table schema on the destination side, right?

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> We need to get the original element type on the import side somehow,
> right? Otherwise it will be hard to tell whether '{1, 2, 3, 4}' has
> element type "int4" or "text", which affects the binary representation
> of the anyarray value in pg_statistic.

Yeah, but that problem exists no matter what.  I haven't read enough
of the patch to find where it's determining that, but I assume there's
code in there to intuit the statistics storage type depending on the
table column's data type and the statistics kind.

> Either we need to get it at export time (which seems the most reliable
> in principle, but problematic for older versions) and pass it as an
> argument to pg_set_attribute_stats(); or we need to derive it reliably
> from the table schema on the destination side, right?

We could not trust the exporting side to tell us the correct answer;
for one reason, it might be different across different releases.
So "derive it reliably on the destination" is really the only option.

I think that it's impossible to do this in the general case, since
type-specific typanalyze functions can store pretty nearly whatever
they like.  However, the pg_stats view isn't going to show nonstandard
statistics kinds anyway, so we are going to be lossy for custom
statistics kinds.

            regards, tom lane



Re: Statistics Import and Export

From
Corey Huinker
Date:
Yeah, but that problem exists no matter what.  I haven't read enough
of the patch to find where it's determining that, but I assume there's
code in there to intuit the statistics storage type depending on the
table column's data type and the statistics kind.

Correct. It borrows a lot from examine_attribute() and the *_typanalyze() functions. Actually using VacAttrStats proved problematic, but that can be revisited at some point.
 
We could not trust the exporting side to tell us the correct answer;
for one reason, it might be different across different releases.
So "derive it reliably on the destination" is really the only option.

+1
 
I think that it's impossible to do this in the general case, since
type-specific typanalyze functions can store pretty nearly whatever
they like.  However, the pg_stats view isn't going to show nonstandard
statistics kinds anyway, so we are going to be lossy for custom
statistics kinds.

Sadly true.

Re: Statistics Import and Export

From
Corey Huinker
Date:
v16 attached.

- both functions now use variadics for anything that can be considered a stat.
- most consistency checks removed, null element tests remain
- functions strive to not ERROR unless absolutely necessary. The biggest exposure is the call to array_in().
- docs have not yet been updated, pending general acceptance of the variadic over the named arg version.

Having variant arguments is definitely a little bit more work to manage, and the shift from ERROR to WARN removes a lot of the easy exits that it previously had, as well as having to do some extra type checking that we got for free with fixed arguments. Still, I don't think the readability suffers too much, and we are now able to work for downgrades as well as upgrades.

Attachment

Re: Statistics Import and Export

From
Tom Lane
Date:
Corey Huinker <corey.huinker@gmail.com> writes:
> - functions strive to not ERROR unless absolutely necessary. The biggest
> exposure is the call to array_in().

As far as that goes, it shouldn't be that hard to deal with, at least
not for "soft" errors which hopefully cover most input-function
failures these days.  You should be invoking array_in via
InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
(Look at pg_input_error_info() for useful precedent.)

There might be something to be said for handling all the error
cases via an ErrorSaveContext and use of ereturn() instead of
ereport().  Not sure if it's worth the trouble or not.

            regards, tom lane



Re: Statistics Import and Export

From
Corey Huinker
Date:

As far as that goes, it shouldn't be that hard to deal with, at least
not for "soft" errors which hopefully cover most input-function
failures these days.  You should be invoking array_in via
InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
(Look at pg_input_error_info() for useful precedent.)

Ah, my understanding may be out of date. I was under the impression that that mechanism relied on the the cooperation of the per-element input function, so even if we got all the builtin datatypes to play nice with *Safe(), we were always going to be at risk with a user-defined input function.
 
There might be something to be said for handling all the error
cases via an ErrorSaveContext and use of ereturn() instead of
ereport().  Not sure if it's worth the trouble or not.

It would help us tailor the user experience. Right now we have several endgames. To recap:

1. NULL input = Return NULL. (because strict).
2. Actual error (permissions, cache lookup not found, etc) => Raise ERROR (thus ruining binary upgrade)
3. Call values are so bad (examples: attname not found, required stat missing) that nothing can recover => WARN, return FALSE.
4. At least one stakind-stat is wonky (impossible for datatype, missing stat pair, wrong type on input parameter), but that's the worst of it => 1 to N WARNs, write stats that do make sense, return TRUE.
5. Hunky-dory. => No warns. Write all stats. return TRUE.

Which of those seem like good ereturn candidates to you?

Re: Statistics Import and Export

From
Tom Lane
Date:
Corey Huinker <corey.huinker@gmail.com> writes:
>> As far as that goes, it shouldn't be that hard to deal with, at least
>> not for "soft" errors which hopefully cover most input-function
>> failures these days.  You should be invoking array_in via
>> InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
>> (Look at pg_input_error_info() for useful precedent.)

> Ah, my understanding may be out of date. I was under the impression that
> that mechanism relied on the the cooperation of the per-element input
> function, so even if we got all the builtin datatypes to play nice with
> *Safe(), we were always going to be at risk with a user-defined input
> function.

That's correct, but it's silly not to do what we can.  Also, I imagine
that there is going to be high evolutionary pressure on UDTs to
support soft error mode for COPY, so over time the problem will
decrease --- as long as we invoke the soft error mode.

> 1. NULL input =>  Return NULL. (because strict).
> 2. Actual error (permissions, cache lookup not found, etc) => Raise ERROR
> (thus ruining binary upgrade)
> 3. Call values are so bad (examples: attname not found, required stat
> missing) that nothing can recover => WARN, return FALSE.
> 4. At least one stakind-stat is wonky (impossible for datatype, missing
> stat pair, wrong type on input parameter), but that's the worst of it => 1
> to N WARNs, write stats that do make sense, return TRUE.
> 5. Hunky-dory. => No warns. Write all stats. return TRUE.

> Which of those seem like good ereturn candidates to you?

I'm good with all those behaviors.  On reflection, the design I was
vaguely imagining wouldn't cope with case 4 (multiple WARNs per call)
so never mind that.

            regards, tom lane



Re: Statistics Import and Export

From
Michael Paquier
Date:
On Mon, Apr 01, 2024 at 01:21:53PM -0400, Tom Lane wrote:
> I'm not sure.  I think if we put our heads down we could finish
> the changes I'm suggesting and resolve the other issues this week.
> However, it is starting to feel like the sort of large, barely-ready
> patch that we often regret cramming in at the last minute.  Maybe
> we should agree that the first v18 CF would be a better time to
> commit it.

There are still 4 days remaining, so there's still time, but my
overall experience on the matter with my RMT hat on is telling me that
we should not rush this patch set.  Redesigning portions close to the
end of a dev cycle is not a good sign, I am afraid, especially if the
sub-parts of the design don't fit well in the global picture as that
could mean more maintenance work on stable branches in the long term.
Still, it is very good to be aware of the problems because you'd know
what to tackle to reach the goals of this patch set.
--
Michael

Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:

I'm good with all those behaviors.  On reflection, the design I was
vaguely imagining wouldn't cope with case 4 (multiple WARNs per call)
so never mind that.

                        regards, tom lane

v17

0001
- array_in now repackages cast errors as warnings and skips the stat, test added
- version parameter added, though it's mostly for future compatibility, tests modified
- both functions delay object/attribute locking until absolutely necessary
- general cleanup

0002
- added version parameter to dumps
- --schema-only will not dump stats unless in binary upgrade mode
- stats are dumped SECTION_NONE
- general cleanup

I think that covers the outstanding issues. 
Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:
For a partitioned table this value has to be true. For a normal table when setting this value to true, it should at least make sure that the table has at least one child. Otherwise it should throw an error. Blindly accepting the given value may render the statistics unusable. Prologue of the function needs to be updated accordingly.

I can see rejecting non-inherited stats for a partitioned table. The reverse, however, isn't true, because a table may end up being inherited by another, so those statistics may be legit. Having said that, a great deal of the data validation I was doing was seen as unnecessary, so I' not sure where this check would fall on that line. It's a trivial check if we do add it.

Re: Statistics Import and Export

From
Ashutosh Bapat
Date:


On Fri, Apr 5, 2024 at 7:00 AM Corey Huinker <corey.huinker@gmail.com> wrote:
For a partitioned table this value has to be true. For a normal table when setting this value to true, it should at least make sure that the table has at least one child. Otherwise it should throw an error. Blindly accepting the given value may render the statistics unusable. Prologue of the function needs to be updated accordingly.

I can see rejecting non-inherited stats for a partitioned table. The reverse, however, isn't true, because a table may end up being inherited by another, so those statistics may be legit. Having said that, a great deal of the data validation I was doing was seen as unnecessary, so I' not sure where this check would fall on that line. It's a trivial check if we do add it.

I read that discussion, and it may be ok for pg_upgrade/pg_dump usecase and maybe also for IMPORT foreign schema where the SQL is generated by PostgreSQL itself. But not for simulating statistics. In that case, if the function happily installs statistics cooked by the user and those aren't used anywhere, users may be misled by the plans that are generated subsequently. Thus negating the very purpose of simulating statistics. Once the feature is out there, we won't be able to restrict its usage unless we document the possible anomalies.

--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

From
Tom Lane
Date:
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> I read that discussion, and it may be ok for pg_upgrade/pg_dump usecase and
> maybe also for IMPORT foreign schema where the SQL is generated by
> PostgreSQL itself. But not for simulating statistics. In that case, if the
> function happily installs statistics cooked by the user and those aren't
> used anywhere, users may be misled by the plans that are generated
> subsequently. Thus negating the very purpose of simulating
> statistics.

I'm not sure what you think the "purpose of simulating statistics" is,
but it seems like you have an extremely narrow-minded view of it.
I think we should allow injecting any stats that won't actively crash
the backend.  Such functionality could be useful for stress-testing
the planner, for example, or even just to see what it would do in
a situation that is not what you have.

Note that I don't think pg_dump or pg_upgrade need to support
injection of counterfactual statistics.  But direct calls of the
stats insertion functions should be able to do so.

            regards, tom lane



Re: Statistics Import and Export

From
Ashutosh Bapat
Date:


On Fri, Apr 5, 2024 at 10:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> I read that discussion, and it may be ok for pg_upgrade/pg_dump usecase and
> maybe also for IMPORT foreign schema where the SQL is generated by
> PostgreSQL itself. But not for simulating statistics. In that case, if the
> function happily installs statistics cooked by the user and those aren't
> used anywhere, users may be misled by the plans that are generated
> subsequently. Thus negating the very purpose of simulating
> statistics.

I'm not sure what you think the "purpose of simulating statistics" is,
but it seems like you have an extremely narrow-minded view of it.
I think we should allow injecting any stats that won't actively crash
the backend.  Such functionality could be useful for stress-testing
the planner, for example, or even just to see what it would do in
a situation that is not what you have.

My reply was in the following context
For a partitioned table this value has to be true. For a normal table when setting this value to true, it should at least make sure that the table has at least one child. Otherwise it should throw an error. Blindly accepting the given value may render the statistics unusable. Prologue of the function needs to be updated accordingly.

I can see rejecting non-inherited stats for a partitioned table. The reverse, however, isn't true, because a table may end up being inherited by another, so those statistics may be legit. Having said that, a great deal of the data validation I was doing was seen as unnecessary, so I' not sure where this check would fall on that line. It's a trivial check if we do add it.

If a user installs inherited stats for a non-inherited table by accidently passing true to the corresponding argument, those stats won't be even used. The user wouldn't know that those stats are not used. Yet, they would think that any change in the plans is the result of their stats. So whatever simulation experiment they are running would lead to wrong conclusions. This could be easily avoided by raising an error. Similarly for installing non-inherited stats for a partitioned table. There might be other scenarios where the error won't be required.

--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Thu, 2024-04-04 at 00:30 -0400, Corey Huinker wrote:
>
> v17
>
> 0001
> - array_in now repackages cast errors as warnings and skips the stat,
> test added
> - version parameter added, though it's mostly for future
> compatibility, tests modified
> - both functions delay object/attribute locking until absolutely
> necessary
> - general cleanup
>
> 0002
> - added version parameter to dumps
> - --schema-only will not dump stats unless in binary upgrade mode
> - stats are dumped SECTION_NONE
> - general cleanup
>
> I think that covers the outstanding issues. 

Thank you, this has improved a lot and the fundamentals are very close.

I think it could benefit from a bit more time to settle on a few
issues:

1. SECTION_NONE. Conceptually, stats are more like data, and so
intuitively I would expect this in the SECTION_DATA or
SECTION_POST_DATA. However, the two most important use cases (in my
opinion) don't involve dumping the data: pg_upgrade (data doesn't come
from the dump) and planner simulations/repros. Perhaps the section we
place it in is not a critical decision, but we will need to stick with
it for a long time, and I'm not sure that we have consensus on that
point.

2. We changed the stats import function API to be VARIADIC very
recently. After we have a bit of time to think on it, I'm not 100% sure
we will want to stick with that new API. It's not easy to document,
which is something I always like to consider.

3. The error handling also changed recently to change soft errors (i.e.
type input errors) to warnings. I like this change but I'd need a bit
more time to get comfortable with how this is done, there is not a lot
of precedent for doing this kind of thing. This is connected to the
return value, as well as the machine-readability concern that Magnus
raised.

Additionally, a lot of people are simply very busy around this time,
and may not have had a chance to opine on all the recent changes yet.

Regards,
    Jeff Davis





Re: Statistics Import and Export

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> Thank you, this has improved a lot and the fundamentals are very close.
> I think it could benefit from a bit more time to settle on a few
> issues:

Yeah ... it feels like we aren't quite going to manage to get this
over the line for v17.  We could commit with the hope that these
last details will get sorted later, but that path inevitably leads
to a mess.

> 1. SECTION_NONE. Conceptually, stats are more like data, and so
> intuitively I would expect this in the SECTION_DATA or
> SECTION_POST_DATA. However, the two most important use cases (in my
> opinion) don't involve dumping the data: pg_upgrade (data doesn't come
> from the dump) and planner simulations/repros. Perhaps the section we
> place it in is not a critical decision, but we will need to stick with
> it for a long time, and I'm not sure that we have consensus on that
> point.

I think it'll be a serious, serious error for this not to be
SECTION_DATA.  Maybe POST_DATA is OK, but even that seems like
an implementation compromise not "the way it ought to be".

> 2. We changed the stats import function API to be VARIADIC very
> recently. After we have a bit of time to think on it, I'm not 100% sure
> we will want to stick with that new API. It's not easy to document,
> which is something I always like to consider.

Perhaps.  I think the argument of wanting to be able to salvage
something even in the presence of unrecognized stats types is
stronger, but I agree this could use more time in the oven.
Unlike many other things in this patch, this would be nigh
impossible to reconsider later.

> 3. The error handling also changed recently to change soft errors (i.e.
> type input errors) to warnings. I like this change but I'd need a bit
> more time to get comfortable with how this is done, there is not a lot
> of precedent for doing this kind of thing.

I don't think there's much disagreement that that's the right thing,
but yeah there could be bugs or some more to do in this area.

            regards, tom lane



Re: Statistics Import and Export

From
Corey Huinker
Date:


I think it'll be a serious, serious error for this not to be
SECTION_DATA.  Maybe POST_DATA is OK, but even that seems like
an implementation compromise not "the way it ought to be".

We'd have to split them on account of when the underlying object is created. Index statistics would be SECTION_POST_DATA, and everything else would be SECTION_DATA. Looking ahead, statistics data for extended statistics objects would also be POST. That's not a big change, but my first attempt at that resulted in a bunch of unrelated grants dumping in the wrong section.

Re: Statistics Import and Export

From
Corey Huinker
Date:
On Sat, Apr 6, 2024 at 5:23 PM Corey Huinker <corey.huinker@gmail.com> wrote:


I think it'll be a serious, serious error for this not to be
SECTION_DATA.  Maybe POST_DATA is OK, but even that seems like
an implementation compromise not "the way it ought to be".

We'd have to split them on account of when the underlying object is created. Index statistics would be SECTION_POST_DATA, and everything else would be SECTION_DATA. Looking ahead, statistics data for extended statistics objects would also be POST. That's not a big change, but my first attempt at that resulted in a bunch of unrelated grants dumping in the wrong section.

At the request of a few people, attached is an attempt to move stats to DATA/POST-DATA, and the TAP test failure that results from that.

The relevant errors are confusing, in that they all concern GRANT/REVOKE, and the fact that I made no changes to the TAP test itself.

$ grep 'not ok' build/meson-logs/testlog.txt
not ok 9347 - section_data: should not dump GRANT INSERT(col1) ON TABLE test_second_table
not ok 9348 - section_data: should not dump GRANT SELECT (proname ...) ON TABLE pg_proc TO public
not ok 9349 - section_data: should not dump GRANT SELECT ON TABLE measurement
not ok 9350 - section_data: should not dump GRANT SELECT ON TABLE measurement_y2006m2
not ok 9351 - section_data: should not dump GRANT SELECT ON TABLE test_table
not ok 9379 - section_data: should not dump REVOKE SELECT ON TABLE pg_proc FROM public
not ok 9788 - section_pre_data: should dump CREATE TABLE test_table
not ok 9837 - section_pre_data: should dump GRANT INSERT(col1) ON TABLE test_second_table
not ok 9838 - section_pre_data: should dump GRANT SELECT (proname ...) ON TABLE pg_proc TO public
not ok 9839 - section_pre_data: should dump GRANT SELECT ON TABLE measurement
not ok 9840 - section_pre_data: should dump GRANT SELECT ON TABLE measurement_y2006m2
not ok 9841 - section_pre_data: should dump GRANT SELECT ON TABLE test_table
not ok 9869 - section_pre_data: should dump REVOKE SELECT ON TABLE pg_proc FROM public
 
Attachment

Re: Statistics Import and Export

From
Nathan Bossart
Date:
On Thu, Apr 11, 2024 at 03:54:07PM -0400, Corey Huinker wrote:
> At the request of a few people, attached is an attempt to move stats to
> DATA/POST-DATA, and the TAP test failure that results from that.
> 
> The relevant errors are confusing, in that they all concern GRANT/REVOKE,
> and the fact that I made no changes to the TAP test itself.
> 
> $ grep 'not ok' build/meson-logs/testlog.txt
> not ok 9347 - section_data: should not dump GRANT INSERT(col1) ON TABLE
> test_second_table

It looks like the problem is that the ACLs are getting dumped in the data
section when we are also dumping stats.  I'm able to get the tests to pass
by moving the call to dumpRelationStats() that's in dumpTableSchema() to
dumpTableData().  I'm not entirely sure why that fixes it yet, but if we're
treating stats as data, then it intuitively makes sense for us to dump it
in dumpTableData().  However, that seems to prevent the stats from getting
exported in the --schema-only/--binary-upgrade scenario, which presents a
problem for pg_upgrade.  ISTM we'll need some extra hacks to get this to
work as desired.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Statistics Import and Export

From
Jeff Davis
Date:
On Wed, 2024-04-17 at 11:50 -0500, Nathan Bossart wrote:
> It looks like the problem is that the ACLs are getting dumped in the
> data
> section when we are also dumping stats.  I'm able to get the tests to
> pass
> by moving the call to dumpRelationStats() that's in dumpTableSchema()
> to
> dumpTableData().  I'm not entirely sure why that fixes it yet, but if
> we're
> treating stats as data, then it intuitively makes sense for us to
> dump it
> in dumpTableData().

Would it make sense to have a new SECTION_STATS?

>  However, that seems to prevent the stats from getting
> exported in the --schema-only/--binary-upgrade scenario, which
> presents a
> problem for pg_upgrade.  ISTM we'll need some extra hacks to get this
> to
> work as desired.

Philosophically, I suppose stats are data, but I still don't understand
why considering stats to be data is so important in pg_dump.

Practically, I want to dump stats XOR data. That's because, if I dump
the data, it's so costly to reload and rebuild indexes that it's not
very important to avoid a re-ANALYZE.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> Would it make sense to have a new SECTION_STATS?

Perhaps, but the implications for pg_dump's API would be nontrivial,
eg would we break any applications that know about the current
options for --section.  And you still have to face up to the question
"does --data-only include this stuff?".

> Philosophically, I suppose stats are data, but I still don't understand
> why considering stats to be data is so important in pg_dump.
> Practically, I want to dump stats XOR data. That's because, if I dump
> the data, it's so costly to reload and rebuild indexes that it's not
> very important to avoid a re-ANALYZE.

Hmm, interesting point.  But the counterargument to that is that
the cost of building indexes will also dwarf the cost of installing
stats, so why not do so?  Loading data without stats, and hoping
that auto-analyze will catch up sooner not later, is exactly the
current behavior that we're doing all this work to get out of.
I don't really think we want it to continue to be the default.

            regards, tom lane



Re: Statistics Import and Export

From
Jeff Davis
Date:
On Mon, 2024-04-22 at 16:19 -0400, Tom Lane wrote:
> Loading data without stats, and hoping
> that auto-analyze will catch up sooner not later, is exactly the
> current behavior that we're doing all this work to get out of.

That's the disconnect, I think. For me, the main reason I'm excited
about this work is as a way to solve the bad-plans-after-upgrade
problem and to repro planner issues outside of production. Avoiding the
need to ANALYZE at the end of a data load is also a nice convenience,
but not a primary driver (for me).

Should we just itemize some common use cases for pg_dump, and then
choose the defaults that are least likely to cause surprise?

As for the section, I'm not sure what to do about that. Based on this
thread it seems that SECTION_NONE (or a SECTION_STATS?) is easiest to
implement, but I don't understand the long-term consequences of that
choice.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2024-04-22 at 16:19 -0400, Tom Lane wrote:
>> Loading data without stats, and hoping
>> that auto-analyze will catch up sooner not later, is exactly the
>> current behavior that we're doing all this work to get out of.

> That's the disconnect, I think. For me, the main reason I'm excited
> about this work is as a way to solve the bad-plans-after-upgrade
> problem and to repro planner issues outside of production. Avoiding the
> need to ANALYZE at the end of a data load is also a nice convenience,
> but not a primary driver (for me).

Oh, I don't doubt that there are use-cases for dumping stats without
data.  I'm just dubious about the reverse.  I think data+stats should
be the default, even if only because pg_dump's default has always
been to dump everything.  Then there should be a way to get stats
only, and maybe a way to get data only.  Maybe this does argue for a
four-section definition, despite the ensuing churn in the pg_dump API.

> Should we just itemize some common use cases for pg_dump, and then
> choose the defaults that are least likely to cause surprise?

Per above, I don't find any difficulty in deciding what should be the
default.  What I think we need to consider is what the pg_dump and
pg_restore switch sets should be.  There's certainly a few different
ways we could present that; maybe we should sketch out the details for
a couple of ways.

            regards, tom lane



Re: Statistics Import and Export

From
Matthias van de Meent
Date:
On Tue, 23 Apr 2024, 05:52 Tom Lane, <tgl@sss.pgh.pa.us> wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > On Mon, 2024-04-22 at 16:19 -0400, Tom Lane wrote:
> >> Loading data without stats, and hoping
> >> that auto-analyze will catch up sooner not later, is exactly the
> >> current behavior that we're doing all this work to get out of.
>
> > That's the disconnect, I think. For me, the main reason I'm excited
> > about this work is as a way to solve the bad-plans-after-upgrade
> > problem and to repro planner issues outside of production. Avoiding the
> > need to ANALYZE at the end of a data load is also a nice convenience,
> > but not a primary driver (for me).
>
> Oh, I don't doubt that there are use-cases for dumping stats without
> data.  I'm just dubious about the reverse.  I think data+stats should
> be the default, even if only because pg_dump's default has always
> been to dump everything.  Then there should be a way to get stats
> only, and maybe a way to get data only.  Maybe this does argue for a
> four-section definition, despite the ensuing churn in the pg_dump API.

I've heard of use cases where dumping stats without data would help
with production database planner debugging on a non-prod system.

Sure, some planner inputs would have to be taken into account too, but
having an exact copy of production stats is at least a start and can
help build models and alerts for what'll happen when the tables grow
larger with the current stats.

As for other planner inputs: table size is relatively easy to shim
with sparse files; cumulative statistics can be copied from a donor
replica if needed, and btree indexes only really really need to
contain their highest and lowest values (and need their height set
correctly).

Kind regards,

Matthias van de Meent



Re: Statistics Import and Export

From
Corey Huinker
Date:
I've heard of use cases where dumping stats without data would help
with production database planner debugging on a non-prod system.


So far, I'm seeing these use cases:

1. Binary upgrade. (schema: on, data: off, stats: on)
2. Dump to file/dir and restore elsewhere. (schema: on, data: on, stats: on)
3. Dump stats for one or more objects, either to directly apply those stats to a remote database, or to allow a developer to edit/experiment with those stats. (schema: off, data: off, stats: on)
4. restore situations where stats are not wanted and/or not trusted (whatever: on, stats: off)

Case #1 is handled via pg_upgrade and special case flags in pg_dump.
Case #2 uses the default pg_dump options, so that's covered.
Case #3 would require a --statistics-only option mutually exclusive with --data-only and --schema-only. Alternatively, I could reanimate the script pg_export_statistics, but we'd end up duplicating a lot of filtering options that pg_dump already has solved. Similarly, we may want server-side functions that generate the statements for us (pg_get_*_stats paired with each pg_set_*_stats)
Case #4 is handled via --no-statistics.


Attached is v19, which attempts to put table stats in SECTION_DATA and matview/index stats in SECTION_POST_DATA. It's still failing one TAP test (004_pg_dump_parallel: parallel restore as inserts). I'm still unclear as to why using SECTION_NONE is a bad idea, but I'm willing to go along with DATA/POST_DATA, assuming we can make it work.

 
Attachment

Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Tue, Apr 23, 2024 at 06:33:48PM +0200, Matthias van de Meent wrote:
> I've heard of use cases where dumping stats without data would help
> with production database planner debugging on a non-prod system.
> 
> Sure, some planner inputs would have to be taken into account too, but
> having an exact copy of production stats is at least a start and can
> help build models and alerts for what'll happen when the tables grow
> larger with the current stats.
> 
> As for other planner inputs: table size is relatively easy to shim
> with sparse files; cumulative statistics can be copied from a donor
> replica if needed, and btree indexes only really really need to
> contain their highest and lowest values (and need their height set
> correctly).

Is it possible to prevent stats from being updated by autovacuum and
other methods?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Statistics Import and Export

From
Matthias van de Meent
Date:
On Wed, 24 Apr 2024 at 21:31, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Apr 23, 2024 at 06:33:48PM +0200, Matthias van de Meent wrote:
> > I've heard of use cases where dumping stats without data would help
> > with production database planner debugging on a non-prod system.
> >
> > Sure, some planner inputs would have to be taken into account too, but
> > having an exact copy of production stats is at least a start and can
> > help build models and alerts for what'll happen when the tables grow
> > larger with the current stats.
> >
> > As for other planner inputs: table size is relatively easy to shim
> > with sparse files; cumulative statistics can be copied from a donor
> > replica if needed, and btree indexes only really really need to
> > contain their highest and lowest values (and need their height set
> > correctly).
>
> Is it possible to prevent stats from being updated by autovacuum

You can set autovacuum_analyze_threshold and *_scale_factor to
excessively high values, which has the effect of disabling autoanalyze
until it has had similarly excessive tuple churn. But that won't
guarantee autoanalyze won't run; that guarantee only exists with
autovacuum = off.

> and other methods?

No nice ways. AFAIK there is no command (or command sequence) that can
"disable" only ANALYZE and which also guarantee statistics won't be
updated until ANALYZE is manually "re-enabled" for that table. An
extension could maybe do this, but I'm not aware of any extension
points where this would hook into PostgreSQL in a nice way.

You can limit maintenance access on the table to only trusted roles
that you know won't go in and run ANALYZE for those tables, or even
only your superuser (so only they can run ANALYZE, and have them
promise they won't). Alternatively, you can also constantly keep a
lock on the table that conflicts with ANALYZE. The last few are just
workarounds though, and not all something I'd suggest running on a
production database.

Kind regards,

Matthias van de Meent



Re: Statistics Import and Export

From
Corey Huinker
Date:

You can set autovacuum_analyze_threshold and *_scale_factor to
excessively high values, which has the effect of disabling autoanalyze
until it has had similarly excessive tuple churn. But that won't
guarantee autoanalyze won't run; that guarantee only exists with
autovacuum = off.

I'd be a bit afraid to set to those values so high, for fear that they wouldn't get reset when normal operations resumed, and nobody would notice until things got bad.

v20 is attached. It resolves the dependency issue in v19, so while I'm still unclear as to why we want it this way vs the simplicity of SECTION_NONE, I'm going to roll with it.

Next up for question is how to handle --statistics-only or an equivalent. The option would be mutually exclusive with --schema-only and --data-only, and it would be mildly incongruous if it didn't have a short option like the others, so I'm suggested -P for Probablity / Percentile / ρ: correlation / etc.

One wrinkle with having three mutually exclusive options instead of two is that the existing code was able to assume that one of the options being true meant that we could bail out of certain dumpXYZ() functions, and now those tests have to compare against two, which makes me think we should add three new DumpOptions that are the non-exclusive positives (yesSchema, yesData, yesStats) and set those in addition to the schemaOnly, dataOnly, and statsOnly flags. Thoughts?
Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:

Next up for question is how to handle --statistics-only or an equivalent. The option would be mutually exclusive with --schema-only and --data-only, and it would be mildly incongruous if it didn't have a short option like the others, so I'm suggested -P for Probablity / Percentile / ρ: correlation / etc.

One wrinkle with having three mutually exclusive options instead of two is that the existing code was able to assume that one of the options being true meant that we could bail out of certain dumpXYZ() functions, and now those tests have to compare against two, which makes me think we should add three new DumpOptions that are the non-exclusive positives (yesSchema, yesData, yesStats) and set those in addition to the schemaOnly, dataOnly, and statsOnly flags. Thoughts?

v21 attached.

0001 is the same.

0002 is a preparatory change to pg_dump introducing DumpOption/RestoreOption variables dumpSchema and dumpData. The current code makes heavy use of the fact that schemaOnly and dataOnly are mutually exclusive and logically opposite. That will not be the case when statisticsOnly is introduced, so I decided to add the new variables whose value is entirely derivative of the existing command flags, but resolves the complexities of those interactions in one spot, as those complexities are about to jump with the new options.

0003 is the statistics changes to pg_dump, adding the options -X / --statistics-only, and the derivative boolean statisticsOnly. The -P option is already used by pg_restore, so instead I chose -X because of the passing resemblance to Chi as in the chi-square statistics test makes it vaguely statistics-ish. If someone has a better letter, I'm listening.

With that change, people should be able to use pg_dump -X --table=foo to dump existing stats for a table and its dependent indexes, and then tweak those calls to do tuning work. Have fun with it. If this becomes a common use-case then it may make sense to get functions to fetch relation/attribute stats for a given relation, either as a formed SQL statement or as the parameter values.
 
Attachment

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Mon, 2024-05-06 at 23:43 -0400, Corey Huinker wrote:
>
> v21 attached.
>
> 0003 is the statistics changes to pg_dump, adding the options -X / --
> statistics-only, and the derivative boolean statisticsOnly. The -P
> option is already used by pg_restore, so instead I chose -X because
> of the passing resemblance to Chi as in the chi-square statistics
> test makes it vaguely statistics-ish. If someone has a better letter,
> I'm listening.
>
> With that change, people should be able to use pg_dump -X --table=foo
> to dump existing stats for a table and its dependent indexes, and
> then tweak those calls to do tuning work. Have fun with it. If this
> becomes a common use-case then it may make sense to get functions to
> fetch relation/attribute stats for a given relation, either as a
> formed SQL statement or as the parameter values.

Can you explain what you did with the
SECTION_NONE/SECTION_DATA/SECTION_POST_DATA over v19-v21 and why?

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:
Can you explain what you did with the
SECTION_NONE/SECTION_DATA/SECTION_POST_DATA over v19-v21 and why?

Initially, I got things to work by having statistics import behave like COMMENTs, which meant that they were run immediately after the table/matview/index/constraint that created the pg_class/pg_attribute entries, but they could be suppressed with a --noX flag

Per previous comments, it was suggested by others that:

- having them in SECTION_NONE was a grave mistake
- Everything that could belong in SECTION_DATA should, and the rest should be in SECTION_POST_DATA
- This would almost certainly require the statistics import commands to be TOC objects (one object per pg_class entry, not one object per function call)

Turning them into TOC objects was a multi-phase process.

1. the TOC entries are generated with dependencies (the parent pg_class object as well as the potential unique/pk constraint in the case of indexes), but no statements are generated (in case the stats are filtered out or the parent object is filtered out). This TOC entry must have everything we'll need to later generate the function calls. So far, that information is the parent name, parent schema, and relkind of the parent object.

2. The TOC entries get sorted by dependencies, and additional dependencies are added which enforce the PRE/DATA/POST boundaries. This is where knowing the parent object's relkind is required, as that determines the DATA/POST section.

3. Now the TOC entry is able to stand on its own, and generate the statements if they survive the dump/restore filters. Most of the later versions of the patch were efforts to get the objects to fall into the right PRE/DATA/POST sections, and the central bug was that the dependencies passed into ARCHIVE_OPTS were incorrect, as the dependent object passed in was now the new TOC object, not the parent TOC object. Once that was resolved, things fell into place.
 

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Thu, 2024-05-16 at 05:25 -0400, Corey Huinker wrote:
>
> Per previous comments, it was suggested by others that:
>
> - having them in SECTION_NONE was a grave mistake
> - Everything that could belong in SECTION_DATA should, and the rest
> should be in SECTION_POST_DATA

I don't understand the gravity of the choice here: what am I missing?

To be clear: I'm not arguing against it, but I'd like to understand it
better. Perhaps it has to do with the relationship between the sections
and the dependencies?

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:
On Thu, May 16, 2024 at 2:26 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Thu, 2024-05-16 at 05:25 -0400, Corey Huinker wrote:
>
> Per previous comments, it was suggested by others that:
>
> - having them in SECTION_NONE was a grave mistake
> - Everything that could belong in SECTION_DATA should, and the rest
> should be in SECTION_POST_DATA

I don't understand the gravity of the choice here: what am I missing?

To be clear: I'm not arguing against it, but I'd like to understand it
better. Perhaps it has to do with the relationship between the sections
and the dependencies?

I'm with you, I don't understand the choice and would like to, but at the same time it now works in the way others strongly suggested that it should, so I'm still curious about the why.

There were several people expressing interest in this patch at pgconf.dev, so I thought I'd post a rebase and give a summary of things to date.

THE INITIAL GOAL

The initial goal of this effort was to reduce upgrade downtimes by eliminating the need for the vacuumdb --analyze-in-stages call that is recommended (but often not done) after a pg_upgrade. The analyze-in-stages steps is usually by far the longest part of a binary upgrade and is a significant part of a restore from dump, so eliminating this step will save users time, and eliminate or greatly reduce a potential pitfall to upgrade...and thus reduce upgrade friction (read: excuses to not upgrade).

THE FUNCTIONS

These patches introduce two functions, pg_set_relation_stats() and pg_set_attribute_stats(), which allow the caller to modify the statistics of any relation, provided that they own that relation or have maintainer privilege.

The function pg_set_relation_stats looks like this:

SELECT pg_set_relation_stats('stats_export_import.test'::regclass,
                             150000::integer,
                             'relpages', 17::integer,
                             'reltuples', 400.0::real,
                             'relallvisible', 4::integer);

The function takes an oid of the relation to have stats imported, a version number (SERVER_VERSION_NUM) for the source of the statistics, and then a series of varargs organized as name-value pairs. Currently, three arg pairs are required to properly set (relpages, reltuples, and relallvisible). If all three are not present, the function will issue a warning, and the row will not be updated.

The choice of varargs is a defensive one, basically ensuring that a pgdump that includes statistics import calls will not fail on a future version that does not have one or more of these values. The call itself would fail to modify the relation row, but it wouldn't cause the whole restore to fail. I'm personally not against having a fixed arg version of this function, nor am I against having both at the same time, the varargs version basically teeing up the fixed-param call appropriate for the destination server version.

This function does an in-place update of the pg_class row to avoid bloat pg_class, just like ANALYZE does. This means that this function call is NON-transactional.

The function pg_set_attribute_stats looks like this:

SELECT pg_catalog.pg_set_attribute_stats(
    'stats_export_import.test'::regclass,
    'id'::name,
    false::boolean,
    150000::integer,
    'null_frac', 0.5::real,
    'avg_width', 2::integer,
    'n_distinct', -0.1::real,
    'most_common_vals', '{2,1,3}'::text,
    'most_common_freqs', '{0.3,0.25,0.05}'::real[]
    );

Like the first function, it takes a relation oid and a source server version though that is in the 4th position. It also takes the name of an attribute, and a boolean as to whether these stats are for inherited statistics (true) or regular (false). Again what follows is a vararg list of name-value pairs, each name corresponding to an attribute of pg_stats, and expecting a value appropriate for said attribute of pg_stats. Note that ANYARRAY values are passed in as text. This is done for a few reasons. First, if the attribute is an array type, then the most_common_elements value will be an array of that array type, and there is no way to represent that in SQL (it instead gives a higher order array of the same base type). Second, it allows us to import the values with a simple array_in() call. Last, it allows for situations where the type name changed from source system to destination (example: a schema-qualified extension type gets moved to core).

There are lots of ways that this function call can go wrong. An invalid attribute name, an invalid parameter name in a name-value pair, invalid data type of parameter being passed in the value of a name-value pair, or type coercion errors in array_in() to name just a few. All of these errors result in a warning and the import failing, but the function completes normally. Internal typecasting and array_in are all done with the _safe() equivalents, and any such errors are re-emitted as warnings. The central goal here is to not make a restore fail just because the statistics are wonky.

Calls to pg_set_attribute_stats() are transactional. This wouldn't warrant mentioning if not for pg_set_relation_stats() being non-transactional.

DUMP / RESTORE / UPGRADE

The code for pg_dump/restore/upgrade has been modified to allow for statistics to be exported/imported by default. There are flags to prevent this (--no-statistics) and there are flags to ONLY do statistics (--statistics-only) the utility of which will be discussed later.

pg_dump will make queries of the source database, adjusting the syntax to reflect the version of the source system. There is very little variance in those queries, so it should be possible to query as far back as 9.2 and get usable stats. The output of these calls will be a series of SELECT statements, each one making a call to either pg_set_relation_stats (one per table/index/matview) or pg_set_attribute_stats (one per attribute that had a matching pg_statistic row).

The positioning of these calls in the restore sequence was originally set up as SECTION_NONE, but it was strongly suggested that SECTION_DATA / SECTION_POST_DATA was the right spot instead, and that's where they currently reside.

The end result will be that the new database now has the stats identical (or at least close to) the source system. Those statistics might be good or bad, but they're almost certainly better than no stats at all. Even if they are bad, they will be overwritten by the next ANALYZE or autovacuum.


WHAT IS NOT DONE

1. Extended Statistics, which are considerably more complex than regular stats (stxdexprs is itself an array of pg_statistic rows) and thus more difficult to express in a simple function call. They are also used fairly rarely in customer installations, so leaving them out of the v1 patch seemed like an easy trade-off.

2. Any sort of validity checking beyond data-types. This was initially provided, verifying that arrays values representing frequencies must be between 0.0 and 1.0, arrays that represent most common value frequencies must be in monotonically non-increasing order, etc. but these were rejected as being overly complex, potentially rejecting valid stats, and getting in the way of an other use I hadn't considered.

3. Export functions. Strictly speaking we don't need them, but some use-cases described below may make the case for including them.

OTHER USES

Usage of these functions is not restricted to upgrade/restore situations. The most obvious use was to experiment with how the planner behaves when one or more tables grow and/or skew. It is difficult to create a table with 10 billion rows in it, but it's now trivial to create a table that says it has 10 billion rows in it.

This can be taken a step further, and in a way I had not anticipated - actively stress-testing the planner by inserting wildly incorrect and/or nonsensical stats. In that sense, these functions are a fuzzing tool that happens to make upgrades go faster.

FUTURE PLANS

Integration with postgres_fdw is an obvious next step, allowing an ANALYZE on a foreign table to, instead of asking for a remote row sample, to simply export the stats of the remote table and import them into the foreign table.

Extended Statistics.

CURRENT PROGRESS

I believe that all outstanding questions/request were addressed, and the patch is now back to needing a review.

FOR YOUR CONSIDERATION

Rebase (current as of f04d1c1db01199f02b0914a7ca2962c531935717) attached.
Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:
v23:

Split pg_set_relation_stats into two functions: pg_set_relation_stats with named parameters like it had around v19 and pg_restore_relations_stats with the variadic parameters it has had in more recent versions, which processes the variadic parameters and then makes a call to pg_set_relation_stats.

Split pg_set_attribute_stats into two functions: pg_set_attribute_stats with named parameters like it had around v19 and pg_restore_attribute_stats with the variadic parameters it has had in more recent versions, which processes the variadic parameters and then makes a call to pg_set_attribute_stats.

The intention here is that the named parameters signatures are easier for ad-hoc use, while the variadic signatures are evergreen and thus ideal for pg_dump/pg_upgrade.

rebased to a0a5869a8598cdeae1d2f2d632038d26dcc69d19 (master as of early July 18)

Attachment

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Thu, 2024-07-18 at 02:09 -0400, Corey Huinker wrote:
> v23:
>
> Split pg_set_relation_stats into two functions: pg_set_relation_stats
> with named parameters like it had around v19 and
> pg_restore_relations_stats with the variadic parameters it has had in
> more recent versions, which processes the variadic parameters and
> then makes a call to pg_set_relation_stats.
>
> Split pg_set_attribute_stats into two functions:
> pg_set_attribute_stats with named parameters like it had around v19
> and pg_restore_attribute_stats with the variadic parameters it has
> had in more recent versions, which processes the variadic parameters
> and then makes a call to pg_set_attribute_stats.
>
> The intention here is that the named parameters signatures are easier
> for ad-hoc use, while the variadic signatures are evergreen and thus
> ideal for pg_dump/pg_upgrade.

v23-0001:

* I like the split for the reason you mention. I'm not 100% sure that
we need both, but from the standpoint of reviewing, it makes things
easier. We can always remove one at the last minute if its found to be
unnecessary. I also like the names.

* Doc build error and malformatting.

* I'm not certain that we want all changes to relation stats to be non-
transactional. Are there transactional use cases? Should it be an
option? Should it be transactional for pg_set_relation_stats() but non-
transactional for pg_restore_relation_stats()?

* The documentation for the pg_set_attribute_stats() still refers to
upgrade scenarios -- shouldn't that be in the
pg_restore_attribute_stats() docs? I imagine the pg_set variant to be
used for ad-hoc planner stuff rather than upgrades.

* For the "WARNING: stat names must be of type text" I think we need an
ERROR instead. The calling convention of name/value pairs is broken and
we can't safely continue.

* The huge list of "else if (strcmp(statname, mc_freqs_name) == 0) ..."
seems wasteful and hard to read. I think we already discussed this,
what was the reason we can't just use an array to map the arg name to
an arg position type OID?

* How much error checking did we decide is appropriate? Do we need to
check that range_length_hist is always specified with range_empty_frac,
or should we just call that the planner's problem if one is specified
and the other not? Similarly, range stats for a non-range type.

* I think most of the tests should be of pg_set_*_stats(). For
pg_restore_, we just want to know that it's translating the name/value
pairs reasonably well and throwing WARNINGs when appropriate. Then, for
pg_dump tests, it should exercise pg_restore_*_stats() more completely.

* It might help to clarify which arguments are important (like
n_distinct) vs not. I assume the difference is that it's a non-NULLable
column in pg_statistic.

* Some arguments, like the relid, just seem absolutely required, and
it's weird to just emit a WARNING and return false in that case.

* To clarify: a return of "true" means all settings were successfully
applied, whereas "false" means that some were applied and some were
unrecognized, correct? Or does it also mean that some recognized
options may not have been applied?

* pg_set_attribute_stats(): why initialize the output tuple nulls array
to false? It seems like initializing it to true would be safer.

* please use a better name for "k" and add some error checking to make
sure it doesn't overrun the available slots.

* the pg_statistic tuple is always completely replaced, but the way you
can call pg_set_attribute_stats() doesn't imply that -- calling
pg_set_attribute_stats(..., most_common_vals => ..., most_common_freqs
=> ...) looks like it would just replace the most_common_vals+freqs and
leave histogram_bounds as it was, but it actually clears
histogram_bounds, right? Should we make that work or should we document
better that it doesn't?

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:

* Doc build error and malformatting.

Looking into it.
 
* I'm not certain that we want all changes to relation stats to be non-
transactional. Are there transactional use cases? Should it be an
option? Should it be transactional for pg_set_relation_stats() but non-
transactional for pg_restore_relation_stats()?

It's non-transactional because that's how ANALYZE does it to avoid bloating pg_class. We _could_ do it transactionally, but on restore we'd immediately have a pg_class that was 50% bloat.
 

* The documentation for the pg_set_attribute_stats() still refers to
upgrade scenarios -- shouldn't that be in the
pg_restore_attribute_stats() docs? I imagine the pg_set variant to be
used for ad-hoc planner stuff rather than upgrades.

Noted.
 

* For the "WARNING: stat names must be of type text" I think we need an
ERROR instead. The calling convention of name/value pairs is broken and
we can't safely continue.

They can't be errors, because any one error fails the whole pg_upgrade.
 
* The huge list of "else if (strcmp(statname, mc_freqs_name) == 0) ..."
seems wasteful and hard to read. I think we already discussed this,
what was the reason we can't just use an array to map the arg name to
an arg position type OID?

That was my overreaction to the dislike that the P_argname enum got in previous reviews.

We'd need an array of struct like

argname  (ex. "mc_vals")
argtypeoid  (one of: int, text, real, rea[])
argtypename (name we want to call the argtypeoid (integer, text. real, real[] about covers it).
argpos (position in the arg list of the corresponding pg_set_ function
 

* How much error checking did we decide is appropriate? Do we need to
check that range_length_hist is always specified with range_empty_frac,
or should we just call that the planner's problem if one is specified
and the other not? Similarly, range stats for a non-range type.

I suppose we can let that go, and leave incomplete stat pairs in there.

The big risk is that somebody packs the call with more than 5 statkinds, which would overflow the struct.
 
* I think most of the tests should be of pg_set_*_stats(). For
pg_restore_, we just want to know that it's translating the name/value
pairs reasonably well and throwing WARNINGs when appropriate. Then, for
pg_dump tests, it should exercise pg_restore_*_stats() more completely.

I was afraid you'd suggest that, in which case I'd break up the patch into the pg_sets and the pg_restores.
 
* It might help to clarify which arguments are important (like
n_distinct) vs not. I assume the difference is that it's a non-NULLable
column in pg_statistic.

There are NOT NULL stats...now. They might not be in the future. Does that change your opinion?
 

* Some arguments, like the relid, just seem absolutely required, and
it's weird to just emit a WARNING and return false in that case.

Again, we can't fail.Any one failure breaks pg_upgrade.
 
* To clarify: a return of "true" means all settings were successfully
applied, whereas "false" means that some were applied and some were
unrecognized, correct? Or does it also mean that some recognized
options may not have been applied?

True means "at least some stats were applied. False means "nothing was modified".
 
* pg_set_attribute_stats(): why initialize the output tuple nulls array
to false? It seems like initializing it to true would be safer.
 
+1
 

* please use a better name for "k" and add some error checking to make
sure it doesn't overrun the available slots.

k was an inheritance from analzye.c, from whence the very first version was cribbed. No objection to renaming.

 
* the pg_statistic tuple is always completely replaced, but the way you
can call pg_set_attribute_stats() doesn't imply that -- calling
pg_set_attribute_stats(..., most_common_vals => ..., most_common_freqs
=> ...) looks like it would just replace the most_common_vals+freqs and
leave histogram_bounds as it was, but it actually clears
histogram_bounds, right? Should we make that work or should we document
better that it doesn't?

That would complicate things. How would we intentionally null-out one stat, while leaving others unchanged? However, this points out that I didn't re-instate the re-definition that applied the NULL defaults.

Re: Statistics Import and Export

From
Corey Huinker
Date:
Attached is v24, incorporating Jeff's feedback - looping an arg data structure rather than individually checking each param type being the biggest of them.

v23's part one has been broken into three patches:

* pg_set_relation_stats
* pg_set_attribute_stats
* pg_restore_X_stats

And the two pg_dump-related patches remain unchanged.

I think this split is a net-positive for reviewability. The one drawback is that there's a lot of redundancy in the regression tests now, much of which can go away once we decide what other data problems we don't need to check.
Attachment

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Mon, 2024-07-22 at 12:05 -0400, Corey Huinker wrote:
> Attached is v24, incorporating Jeff's feedback - looping an arg data
> structure rather than individually checking each param type being the
> biggest of them.
>

Thank you for splitting up the patches more finely.

v24-0001:

  * pg_set_relation_stats(): the warning: "cannot export statistics
prior to version 9.2" doesn't make sense because the function is for
importing. Reword.

  * I really think there should be a transactional option, just another
boolean, and if it has a default it should be true. This clearly has
use cases for testing plans, etc., and often transactions will be the
right thing there. This should be a trivial code change, and it will
also be easier to document.

  * The return type is documented as 'void'? Please change to bool and
be clear about what true/false returns really mean. I think false means
"no updates happened at all, and a WARNING was printed indicating why"
whereas true means "all updates were applied successfully".

  * An alternative would be to have an 'error_ok' parameter to say
whether to issue WARNINGs or ERRORs. I think we already discussed that
and agreed on the boolean return, but I just want to confirm that this
was a conscious choice?

  * tests should be called stats_import.sql; there's no exporting going
on

  * Aside from the above comments and some other cleanup, I think this
is a simple patch and independently useful. I am looking to commit this
one soon.

v24-0002:

  * Documented return type is 'void'

  * I'm not totally sure what should be returned in the event that some
updates were applied and some not. I'm inclined to say that true should
mean that all updates were applied -- otherwise it's hard to
automatically detect some kind of typo.

  * Can you describe your approach to error checking? What kinds of
errors are worth checking, and which should we just put into the
catalog and let the planner deal with?

  * I'd check stakindidx at the time that it's incremented rather than
summing boolean values cast to integers.

v24-0003:

  * I'm not convinced that we should continue when a stat name is not
text. The argument for being lenient is that statistics may change over
time, and we might have to ignore something that can't be imported from
an old version into a new version because it's either gone or the
meaning has changed too much. But that argument doesn't apply to a
bogus call, where the name/value pairs get misaligned or something.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:
  * pg_set_relation_stats(): the warning: "cannot export statistics
prior to version 9.2" doesn't make sense because the function is for
importing. Reword.

+1
 
  * I really think there should be a transactional option, just another
boolean, and if it has a default it should be true. This clearly has
use cases for testing plans, etc., and often transactions will be the
right thing there. This should be a trivial code change, and it will
also be easier to document.

For it to have a default, the parameter would have to be at the end of the list, and it's a parameter list that will grow in the future. And when that happens we have a jumbled parameter list, which is fine if we only ever call params by name, but I know some people won't do that. Which means it's up front right after `version`. Since `version` is already in there, and we can't default that, I feel ok about moving it there, but alas no default.

If there was some way that the function could detect that it was in a binary upgrade, then we could use that to determine if it should update inplace or transactionally.

  * The return type is documented as 'void'? Please change to bool and
be clear about what true/false returns really mean. I think false means
"no updates happened at all, and a WARNING was printed indicating why"
whereas true means "all updates were applied successfully".

Good point, that's a holdover.
 
  * An alternative would be to have an 'error_ok' parameter to say
whether to issue WARNINGs or ERRORs. I think we already discussed that
and agreed on the boolean return, but I just want to confirm that this
was a conscious choice?

That had been discussed as well. If we're adding parameters, then we could add one for that too. It's making the function call progressively more unwieldy, but anyone who chooses to wield these on a regular basis can certainly write a SQL wrapper function to reduce the function call to their presets, I suppose.
 
  * tests should be called stats_import.sql; there's no exporting going
on

Sigh. True.
 
  * Aside from the above comments and some other cleanup, I think this
is a simple patch and independently useful. I am looking to commit this
one soon.

v24-0002:

  * Documented return type is 'void'

  * I'm not totally sure what should be returned in the event that some
updates were applied and some not. I'm inclined to say that true should
mean that all updates were applied -- otherwise it's hard to
automatically detect some kind of typo.

Me either. Suggestions welcome.

I suppose we could return two integers: number of stats input, and number of stats applied. But that could be confusing, as some parameter pairs form one stat ( MCV, ELEM_MCV, etc).

I suppose we could return a set of (param_name text, was_set boolean, applied boolean), without trying to organize them into their pairs, but that would get really verbose.

We should decide on something soon, because we'd want relation stats to follow a similar signature.
 

  * Can you describe your approach to error checking? What kinds of
errors are worth checking, and which should we just put into the
catalog and let the planner deal with?

1. When the parameters given make for something nonsensical Such as providing most_common_elems with no corresponding most_common_freqs, then you can't form an MCV stat, so you must throw out the one you did receive. That gets a warning.

2. When the data provided is antithetical to the type of statistic. For instance, most array-ish parameters can't have NULL values in them (there are separate stats for nulls (null-frac, empty_frac). I don't remember if doing so crashes the server or just creates a hard error, but it's a big no-no, and we have to reject such stats, which for now means a warning and trying to carry on with the stats that remain.

3. When the stats provided would overflow the data structure. We attack this from two directions: First, we eliminate stat kinds that are meaningless for the data type (scalars can't have most-common-elements, only ranges can have range stats, etc), issue warnings for those and move on with the remaining stats. If, however, the number of those statkinds exceeds the number of statkind slots available, then we give up because now we'd have to CHOOSE which N-5 stats to ignore, and the caller is clearly just having fun with us.

We let the planner have fun with other error-like things:

1. most-common-element arrays where the elements are not sorted per spec.

2. frequency histograms where the numbers are not monotonically non-increasing per spec.

3. frequency histograms that have corresponding low bound and high bound values embedded in the array, and the other values in that array must be between the low-high.
 

  * I'd check stakindidx at the time that it's incremented rather than
summing boolean values cast to integers.

Which means that we're checking that and potentially raising the same error in 3-4 places (and growing, unless we raise the max slots), rather than 1. That struck me as worse.
 

v24-0003:

  * I'm not convinced that we should continue when a stat name is not
text. The argument for being lenient is that statistics may change over
time, and we might have to ignore something that can't be imported from
an old version into a new version because it's either gone or the
meaning has changed too much. But that argument doesn't apply to a
bogus call, where the name/value pairs get misaligned or something.

I agree with that.

Re: Statistics Import and Export

From
Corey Huinker
Date:
Giving the parameter lists more thought, the desire for a return code more granular than true/false/null, and the likelihood that each function will inevitably get more parameters both stats and non-stats, I'm proposing the following:

Two functions:

pg_set_relation_stats(
    out schemaname name,
    out relname name,
    out row_written boolean,
    out params_rejected text[],
    kwargs any[]) RETURNS RECORD

and 

pg_set_attribute_stats(
    out schemaname name,
    out relname name,
    out inherited bool,
    out row_written boolean,
    out params_accepted text[],
    out params_rejected text[],
    kwargs any[]) RETURNS RECORD

The leading OUT parameters tell us the rel/attribute/inh affected (if any), and which params had to be rejected for whatever reason. The kwargs is the variadic key-value pairs that we were using for all stat functions, but now we will be using it for all parameters, both statistics and control, the control parameters will be:

relation - the oid of the relation
attname - the attribute name (does not apply for relstats)
inherited - true false for attribute stats, defaults false, does not apply for relstats
warnings, boolean, if supplied AND set to true, then all ERROR that can be stepped down to WARNINGS will be. This is "binary upgrade mode".
version - the numeric version (a la PG_VERSION_NUM) of the statistics given. If NULL or omitted assume current PG_VERSION_NUM of server.
actual stats columns.

This allows casual users to set only the params they want for their needs, and get proper errors, while pg_upgrade can set

'warnings', 'true', 'version', 120034

and get the upgrade behavior we need.











 and pg_set_attribute_stats.
    pg_set_relation_stats(out schemaname name, out relname name,, out row_written boolean, out params_entered int, out params_accepted int, kwargs any[]) 



Re: Statistics Import and Export

From
Corey Huinker
Date:

 and pg_set_attribute_stats.
    pg_set_relation_stats(out schemaname name, out relname name,, out row_written boolean, out params_entered int, out params_accepted int, kwargs any[]) 


Oops, didn't hit undo fast enough. Disregard this last bit. 

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Tue, 2024-07-23 at 17:48 -0400, Corey Huinker wrote:
> Two functions:

I see that you moved back to a combination function to serve both the
"restore" use case as well as the "ad-hoc stats hacking" use case.

The "restore" use case is the primary point of your patch, and that
should be as simple and future-proof as possible. The parameters should
be name/value pairs and there shouldn't be any "control" parameters --
it's not the job of pg_dump to specify whether the restore should be
transactional or in-place, it should just output the necessary stats.

That restore function might be good enough to satisfy the "ad-hoc stats
hacking" use case as well, but I suspect we want slightly different
behavior. Specifically, I think we'd want the updates to be
transactional rather than in-place, or at least optional.

> The leading OUT parameters tell us the rel/attribute/inh affected (if
> any), and which params had to be rejected for whatever reason. The
> kwargs is the variadic key-value pairs that we were using for all
> stat functions, but now we will be using it for all parameters, both
> statistics and control, the control parameters will be:

I don't like the idea of mixing statistics and control parameters in
the same list.

I do like the idea of returning a set, but I think it should be the
positive set (effectively a representation of what is now in the
pg_stats view) and any ignored settings would be output as WARNINGs.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:
The "restore" use case is the primary point of your patch, and that
should be as simple and future-proof as possible. The parameters should
be name/value pairs and there shouldn't be any "control" parameters --
it's not the job of pg_dump to specify whether the restore should be
transactional or in-place, it should just output the necessary stats.

That restore function might be good enough to satisfy the "ad-hoc stats
hacking" use case as well, but I suspect we want slightly different
behavior. Specifically, I think we'd want the updates to be
transactional rather than in-place, or at least optional.

Point well taken.

Both function pairs now call a generic internal function.

Which is to say that pg_set_relation_stats and pg_restore_relation_stats both accept parameters in their own way, and both call 
an internal function relation_statistics_update(), each with their own defaults.

pg_set_relation_stats always leaves "version" NULL, does transactional updates, and treats any data quality issue as an ERROR. This is is in line with a person manually tweaking stats to check against a query to see if the plan changes.

pg_restore_relation_stats does in-place updates, and steps down all errors to warnings. The stats may not write, but at least it won't fail the pg_upgrade for you.

pg_set_attribute_stats is error-maximalist like pg_set_relation_stats. pg_restore_attribute_stats never had an in-place option to begin with.


 

> The leading OUT parameters tell us the rel/attribute/inh affected (if
> any), and which params had to be rejected for whatever reason. The
> kwargs is the variadic key-value pairs that we were using for all
> stat functions, but now we will be using it for all parameters, both
> statistics and control, the control parameters will be:

I don't like the idea of mixing statistics and control parameters in
the same list.

There's no way around it, at least now we need never worry about a confusing order for the parameters in the _restore_ functions because they can now be in any order you like. But that speaks to another point: there is no "you" in using the restore functions, those function calls will almost exclusively be generated by pg_dump and we can all live rich and productive lives never having seen one written down. I kid, but they're actually not that gross.

Here is a -set function taken from the regression tests:

SELECT pg_catalog.pg_set_attribute_stats(
    relation => 'stats_import.test'::regclass::oid,
    attname => 'arange'::name,
    inherited => false::boolean,
    null_frac => 0.5::real,
    avg_width => 2::integer,
    n_distinct => -0.1::real,
    range_empty_frac => 0.5::real,
    range_length_histogram => '{399,499,Infinity}'::text
    );
 pg_set_attribute_stats
------------------------
 
(1 row)


and here is a restore function

-- warning: mcv cast failure
SELECT *
FROM pg_catalog.pg_restore_attribute_stats(
    'relation', 'stats_import.test'::regclass::oid,
    'attname', 'id'::name,
    'inherited', false::boolean,
    'version', 150000::integer,
    'null_frac', 0.5::real,
    'avg_width', 2::integer,
    'n_distinct', -0.4::real,
    'most_common_vals', '{2,four,3}'::text,
    'most_common_freqs', '{0.3,0.25,0.05}'::real[]
    );
WARNING:  invalid input syntax for type integer: "four"
 row_written |          stats_applied           |            stats_rejected            | params_rejected
-------------+----------------------------------+--------------------------------------+-----------------
 t           | {null_frac,avg_width,n_distinct} | {most_common_vals,most_common_freqs} |
(1 row)


There's a few things going on here:

1. An intentionally bad, impossible to write, value was put in 'most_common_vals'. 'four' cannot cast to integer, so the value fails, and we get a warning
2. Because most_common_values failed, we can no longer construct a legit STAKIND_MCV, so we have to throw out most_common_freqs with it.
3. Those failures aren't enough to prevent us from writing the other stats, so we write the record, and report the row written, the stats we could write, the stats we couldn't, and a list of other parameters we entered that didn't make sense and had to be rejected (empty).

Overall, I'd say the format is on the pedantic side, but it's far from unreadable, and mixing control parameters (version) with stats parameters isn't that big a deal.


I do like the idea of returning a set, but I think it should be the
positive set (effectively a representation of what is now in the
pg_stats view) and any ignored settings would be output as WARNINGs.

Displaying the actual stats in pg_stats could get very, very big. So I wouldn't recommend that.

What do you think of the example presented earlier?

Attached is v25.

Key changes:
- Each set/restore function pair now each call a common function that does the heavy lifting, and the callers mostly marshall parameters into the right spot and form the result set (really just one row).
- The restore functions now have all parameters passed in via a variadic any[].
- the set functions now error out on just about any discrepancy, and do not have a result tuple.
- test cases simplified a bit. There's still a lot of them, and I think that's a good thing.
- Documentation to reflect significant reorganization.
- pg_dump modified to generate new function signatures.
Attachment

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Sat, 2024-07-27 at 21:08 -0400, Corey Huinker wrote:
>
> > I don't like the idea of mixing statistics and control parameters
> > in
> > the same list.
> >
>
>
> There's no way around it, at least now we need never worry about a
> confusing order for the parameters in the _restore_ functions because
> they can now be in any order you like.

Perhaps I was not precise enough when I said "control" parameters.
Mainly what I was worried about is trying to take parameters that
control things like transaction behavior (in-place vs mvcc), and
pg_dump should not be specifying that kind of thing. A parameter like
"version" is specified by pg_dump anyway, so it's probably fine the way
you've done it.

> SELECT pg_catalog.pg_set_attribute_stats(
>     relation => 'stats_import.test'::regclass::oid,
>     attname => 'arange'::name,
>     inherited => false::boolean,
>     null_frac => 0.5::real,
>     avg_width => 2::integer,
>     n_distinct => -0.1::real,
>     range_empty_frac => 0.5::real,
>     range_length_histogram => '{399,499,Infinity}'::text
>     );
>  pg_set_attribute_stats
> ------------------------
>  
> (1 row)

I like it.

> and here is a restore function
>
> -- warning: mcv cast failure
> SELECT *
> FROM pg_catalog.pg_restore_attribute_stats(
>     'relation', 'stats_import.test'::regclass::oid,
>     'attname', 'id'::name,
>     'inherited', false::boolean,
>     'version', 150000::integer,
>     'null_frac', 0.5::real,
>     'avg_width', 2::integer,
>     'n_distinct', -0.4::real,
>     'most_common_vals', '{2,four,3}'::text,
>     'most_common_freqs', '{0.3,0.25,0.05}'::real[]
>     );
> WARNING:  invalid input syntax for type integer: "four"
>  row_written |          stats_applied           |          
>  stats_rejected            | params_rejected
> -------------+----------------------------------+--------------------
> ------------------+-----------------
>  t           | {null_frac,avg_width,n_distinct} |
> {most_common_vals,most_common_freqs} |
> (1 row)

I think I like this, as well, except for the return value, which seems
like too much information and a bit over-engineered. Can we simplify it
to what's actually going to be used by pg_upgrade and other tools?

> Attached is v25.

I believe 0001 and 0002 are in good shape API-wise, and I can start
getting those committed. I will try to clean up the code in the
process.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:

> WARNING:  invalid input syntax for type integer: "four"
>  row_written |          stats_applied           |          
>  stats_rejected            | params_rejected
> -------------+----------------------------------+--------------------
> ------------------+-----------------
>  t           | {null_frac,avg_width,n_distinct} |
> {most_common_vals,most_common_freqs} |
> (1 row)

I think I like this, as well, except for the return value, which seems
like too much information and a bit over-engineered. Can we simplify it
to what's actually going to be used by pg_upgrade and other tools?

pg_upgrade currently won't need any of it, it currently does nothing when a statistics import fails. But it could do *something* based on this information. For example, we might have an option --analyze-tables-that-have-a-statistics-import-failure that analyzes tables that have at least one statistics that didn't import. For instance, postgres_fdw may try to do stats import first, and if that fails fall back to a remote table sample.

We could do other things. It seems a shame to just throw away this information when it could potentially be used in the future.
 

> Attached is v25.

I believe 0001 and 0002 are in good shape API-wise, and I can start
getting those committed. I will try to clean up the code in the
process.

:) 

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Sat, 2024-07-27 at 21:08 -0400, Corey Huinker wrote:
>
> Attached is v25.

I attached new versions of 0001 and 0002. Still working on them, so
these aren't final.

v25j-0001:

  * There seems to be confusion between the relation for which we are
updating the stats, and pg_class. Permissions and ShareUpdateExclusive
should be taken on the former, not the latter. For consistency with
vac_update_relstats(), RowExclusiveLock should be fine on pg_class.
  * Lots of unnecessary #includes were removed.
  * I refactored substantially to do basic checks in the SQL function
pg_set_relation_stats() and make calling the internal function easier.
Similar refactoring might not work for pg_set_attribute_stats(), but
that's OK.
  * You don't need to declare the SQL function signatures. They're
autogenerated from pg_proc.dat into fmgrprotos.h.
  * I removed the inplace stuff for this patch because there's no
coverage for it and it can be easily added back in 0003.
  * I renamed the file to import_stats.c. Annoying to rebase, I know,
but better now than later.

v25j-0002:

  * I just did some minor cleanup on the #includes and rebased it. I
still need to look in more detail.

Regards,
    Jeff Davis


Attachment

Re: Statistics Import and Export

From
Corey Huinker
Date:

function attribute_statsitics_update() is significantly shorter. (Thank
you for a good set of tests, by the way, which sped up the refactoring
process.)

yw
 
  * Remind me why the new stats completely replace the new row, rather
than updating only the statistic kinds that are specified?

because:
- complexity
- we would then need a mechanism to then tell it to *delete* a stakind
- we'd have to figure out how to reorder the remaining stakinds, or spend effort finding a matching stakind in the existing row to know to replace it
- "do what analyze does" was an initial goal and as a result many test cases directly compared pg_statistic rows from an original table to an empty clone table to see if the "copy" had fidelity.
 
  * I'm not sure what the type_is_scalar() function was doing before,
but I just removed it. If it can't find the element type, then it skips
over the kinds that require it.

that may be sufficient,
 
  * I introduced some hard errors. These happen when it can't find the
table, or the attribute, or doesn't have permissions. I don't see any
reason to demote those to a WARNING. Even for the restore case,
analagous errors happen for COPY, etc.

I can accept that reasoning.
 
  * I'm still sorting through some of the type info derivations. I
think we need better explanations about why it's doing exactly the
things it's doing, e.g. for tsvector and multiranges.

I don't have the specifics of each, but any such cases were derived from similar behaviors in the custom typanalyze functions, and the lack of a custom typanalyze function for a given type was taken as evidence that the type was adequately handled by the default rules. I can see that this is an argument for having a second stats-specific custom typanalyze function for datatypes that need them, but I wasn't ready to go that far myself.
 

Re: Statistics Import and Export

From
jian he
Date:
On Sat, Aug 24, 2024 at 4:50 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
>
> I have attached version 28j as one giant patch covering what was
> previously 0001-0003. It's a bit rough (tests in particular need some
> work), but it implelements the logic to replace only those values
> specified rather than the whole tuple.
>
hi.
I did some review for v28j

git am shows some whitespace error.


+extern Datum pg_set_relation_stats(PG_FUNCTION_ARGS);
+extern Datum pg_set_attribute_stats(PG_FUNCTION_ARGS);
is unnecessary?


+       <entry role="func_table_entry">
+        <para role="func_signature">
+         <indexterm>
+          <primary>pg_set_relation_stats</primary>
+         </indexterm>
+         <function>pg_set_relation_stats</function> (
+         <parameter>relation</parameter> <type>regclass</type>
+         <optional>, <parameter>relpages</parameter>
<type>integer</type></optional>
+         <optional>, <parameter>reltuples</parameter>
<type>real</type></optional>
+         <optional>, <parameter>relallvisible</parameter>
<type>integer</type></optional> )
+         <returnvalue>boolean</returnvalue>
+        </para>
+        <para>
+         Updates table-level statistics for the given relation to the
+         specified values. The parameters correspond to columns in <link
+         linkend="catalog-pg-class"><structname>pg_class</structname></link>.
Unspecified
+         or <literal>NULL</literal> values leave the setting
+         unchanged. Returns <literal>true</literal> if a change was made;
+         <literal>false</literal> otherwise.
+        </para>
are these <optional> flags wrong? there is only one function currently:
pg_set_relation_stats(relation regclass, relpages integer, reltuples
real, relallvisible integer)
i think you want
pg_set_relation_stats(relation regclass, relpages integer default
null, reltuples real default null, relallvisible integer default null)
we can add following in src/backend/catalog/system_functions.sql:

select * from pg_set_relation_stats('emp'::regclass);
CREATE OR REPLACE FUNCTION
  pg_set_relation_stats(
                        relation regclass,
                        relpages integer default null,
                        reltuples real default null,
                        relallvisible integer default null)
RETURNS bool
LANGUAGE INTERNAL
CALLED ON NULL INPUT VOLATILE
AS 'pg_set_relation_stats';


typedef enum ...
need to add src/tools/pgindent/typedefs.list


+/*
+ * Check that array argument is one dimensional with no NULLs.
+ *
+ * If not, emit at elevel, and set argument to NULL in fcinfo.
+ */
+static void
+check_arg_array(FunctionCallInfo fcinfo, struct arginfo *arginfo,
+ int argnum, int elevel)
+{
+ ArrayType  *arr;
+
+ if (PG_ARGISNULL(argnum))
+ return;
+
+ arr = DatumGetArrayTypeP(PG_GETARG_DATUM(argnum));
+
+ if (ARR_NDIM(arr) != 1)
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("\"%s\" cannot be a multidimensional array",
+ arginfo[argnum].argname)));
+ fcinfo->args[argnum].isnull = true;
+ }
+
+ if (array_contains_nulls(arr))
+ {
+ ereport(elevel,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("\"%s\" array cannot contain NULL values",
+ arginfo[argnum].argname)));
+ fcinfo->args[argnum].isnull = true;
+ }
+}
this part elevel should always be ERROR?
if so, we can just
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),




relation_statistics_update and other functions
may need to check relkind?
since relpages, reltuples, relallvisible not meaning to all of relkind?



Re: Statistics Import and Export

From
Corey Huinker
Date:
I have attached version 28j as one giant patch covering what was
previously 0001-0003. It's a bit rough (tests in particular need some
work), but it implelements the logic to replace only those values
specified rather than the whole tuple.

I like what you did restoring the parameter enums, especially now that they can be leveraged for the expected type oids data structure.
 
At least for the interactive "set" variants of the functions, I think
it's an improvement. It feels more natural to just change one stat
without wiping out all the others. I realize a lot of the statistics
depend on each other, but the point is not to replace ANALYZE, the
point is to experiment with planner scenarios. What do others think?

When I first heard that was what you wanted to do, I was very uneasy about it. The way you implemented it (one function to wipe out/reset all existing stats, and then the _set_ function works as an upsert) puts my mind at ease. The things I really wanted to avoid were gaps in the stakind array (which can't happen as you wrote it) and getting the stakinds out of order (admittedly that's more a tidiness issue, but pg_statistic before/after fidelity is kept, so I'm happy).

For the "restore" variants, I'm not sure it matters a lot because the
stats will already be empty. If it does matter, we could pretty easily
define the "restore" variants to wipe out existing stats when loading
the table, though I'm not sure if that's a good thing or not.

I agree, and I'm leaning towards doing the clear, because "restore" to me implies that what resides there exactly matches what was in the function call, regardless of what might have been there before. But you're also right, "restore" is expected to be used on default/missing stats, and the restore_* call generated is supposed to be comprehensive of all stats that were there at time of dump/upgrade, so impact would be minimal.
 
I also made more use of FunctionCallInfo structures to communicate
between functions rather than huge parameter lists. I believe that
reduced the line count substantially, and made it easier to transform
the argument pairs in the "restore" variants into the positional
arguments for the "set" variants.

You certainly did, and I see where it pays off given that _set_ / _restore_ functions are just different ways of ordering the shared internal function call.

Observation: there is currently no way to delete a stakind, keeping the rest of the record. It's certainly possible to compose a SQL query that gets the current values, invokes pg_clear_* and then pg_set_* using the values that are meant to be kept, and in fact that pattern is how I imagined the pg_set_* functions would be used when they overwrote everything in the tuple. So I am fine with going forward with this paradigm.

The code mentions that more explanation should be given for the special cases (tsvector, etc) and that explanation is basically "this code follows what the corresponding custom typanalyze() function does". In the future, it may make sense to have custom typimport() functions for datatypes that have a custom typanalzye(), which would solve the issue of handling custom stakinds.

I'll continue to work on this.

p.s. dropping invalid email address from the thread

Re: Statistics Import and Export

From
Corey Huinker
Date:

git am shows some whitespace error.


Jeff indicated that this was more of a stylistic/clarity reworking. I'll be handling it again for now.
 


+extern Datum pg_set_relation_stats(PG_FUNCTION_ARGS);
+extern Datum pg_set_attribute_stats(PG_FUNCTION_ARGS);
is unnecessary?

They're autogenerated from pg_proc.dat. I was (pleasantly) surprised too.

 
this part elevel should always be ERROR?
if so, we can just

I'm personally dis-inclined to error on any of these things, so I'll be leaving it as is. I suspect that the proper balance lies between all-ERROR and all-WARNING, but time will tell which.
 
relation_statistics_update and other functions
may need to check relkind?
since relpages, reltuples, relallvisible not meaning to all of relkind?

I'm not able to understand either of your questions, can you elaborate on them?

 

Re: Statistics Import and Export

From
jian he
Date:
On Tue, Sep 17, 2024 at 5:03 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>>
>> 1. make sure these three functions: 'pg_set_relation_stats',
>> 'pg_restore_relation_stats','pg_clear_relation_stats' proisstrict to true.
>> because in
>> pg_class catalog, these three attributes (relpages, reltuples, relallvisible) is
>> marked as not null. updating it to null will violate these constraints.
>> tom also mention this at [
>
> Things have changed a bit since then, and the purpose of the functions has changed, so the considerations are now
different.The function signature could change in the future as new pg_class stats are added, and it might not still be
strict.
>

if you add more arguments to relation_statistics_update,
but the first 3 arguments (relpages, reltuples, relallvisible) still not null.
and, we are unlikely to add 3 or more (nullable=null) arguments?

we have code like:
    if (!PG_ARGISNULL(RELPAGES_ARG))
    {
            values[ncols] = Int32GetDatum(relpages);
            ncols++;
    }
    if (!PG_ARGISNULL(RELTUPLES_ARG))
    {
            replaces[ncols] = Anum_pg_class_reltuples;
            values[ncols] = Float4GetDatum(reltuples);
    }
    if (!PG_ARGISNULL(RELALLVISIBLE_ARG))
    {
           values[ncols] = Int32GetDatum(relallvisible);
            ncols++;
    }
    newtup = heap_modify_tuple_by_cols(ctup, tupdesc, ncols, replaces, nulls);

you just directly declared "bool nulls[3]    = {false, false, false};"
if any of (RELPAGES_ARG, RELTUPLES_ARG, RELALLVISIBLE_ARG)
is null, should you set that null[position] to true?
otherwise, i am confused with the variable nulls.

Looking at other usage of heap_modify_tuple_by_cols, "ncols" cannot be
dynamic, it should be a fixed value?
The current implementation works, because the (bool[3] nulls) is
always false, never changed.
if nulls becomes {false, false, true} then "ncols" must be 3, cannot be 2.




>>
>> 8. lock_check_privileges function issue.
>> ------------------------------------------------
>> --asume there is a superuser jian
>> create role alice NOSUPERUSER LOGIN;
>> create role bob NOSUPERUSER LOGIN;
>> create role carol NOSUPERUSER LOGIN;
>> alter database test owner to alice
>> GRANT CONNECT, CREATE on database test to bob;
>> \c test bob
>> create schema one;
>> create table one.t(a int);
>> create table one.t1(a int);
>> set session AUTHORIZATION; --switch to superuser.
>> alter table one.t1 owner to carol;
>> \c test alice
>> --now current database owner alice cannot do ANYTHING WITH table one.t1,
>> like ANALYZE, SELECT, INSERT, MAINTAIN etc.
>
>
> Interesting.
>

database owners do not necessarily have schema USAGE privilege.
-------------<<<>>>------------------
create role alice NOSUPERUSER LOGIN;
create role bob NOSUPERUSER LOGIN;
create database test;
alter database test owner to alice;
GRANT CONNECT, CREATE on database test to bob;
\c test bob
create schema one;
create table one.t(a int);
\c test alice

analyze one.t;

with cte as (
select oid as the_t
from pg_class
where relname = any('{t}') and relnamespace = 'one'::regnamespace)
SELECT
pg_catalog.pg_set_relation_stats(
relation => the_t,
relpages => 17::integer,
reltuples => 400.0::real,
relallvisible => 4::integer)
from cte;


In the above case, alice cannot do "analyze one.t;",
but can do pg_set_relation_stats, which seems not ok?
-------------<<<>>>------------------

src/include/statistics/stats_utils.h
comment
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * src/include/statistics/statistics.h

should be "src/include/statistics/stats_utils.h"



comment src/backend/statistics/stats_utils.c
 * IDENTIFICATION
 *       src/backend/statistics/stats_privs.c
should be
 * IDENTIFICATION
 *       src/backend/statistics/stats_utils.c



Re: Statistics Import and Export

From
jian he
Date:
On Mon, Sep 23, 2024 at 8:57 AM jian he <jian.universality@gmail.com> wrote:
>
> database owners do not necessarily have schema USAGE privilege.
> -------------<<<>>>------------------
> create role alice NOSUPERUSER LOGIN;
> create role bob NOSUPERUSER LOGIN;
> create database test;
> alter database test owner to alice;
> GRANT CONNECT, CREATE on database test to bob;
> \c test bob
> create schema one;
> create table one.t(a int);
> \c test alice
>
> analyze one.t;
>
> with cte as (
> select oid as the_t
> from pg_class
> where relname = any('{t}') and relnamespace = 'one'::regnamespace)
> SELECT
> pg_catalog.pg_set_relation_stats(
> relation => the_t,
> relpages => 17::integer,
> reltuples => 400.0::real,
> relallvisible => 4::integer)
> from cte;
>
>
> In the above case, alice cannot do "analyze one.t;",
> but can do pg_set_relation_stats, which seems not ok?

sorry for the noise.
what you stats_lock_check_privileges about privilege is right.

database owner cannot do
"ANALYZE one.t;"
but it can do "ANALYZE;" to indirect analyzing one.t



which seems to be the expected behavior per
https://www.postgresql.org/docs/17/sql-analyze.html
<<
To analyze a table, one must ordinarily have the MAINTAIN privilege on
the table.
However, database owners are allowed to analyze all tables in their
databases, except shared catalogs.
<<



Re: Statistics Import and Export

From
Nathan Bossart
Date:
I took a look at v29-0006.

On Tue, Sep 17, 2024 at 05:02:49AM -0400, Corey Huinker wrote:
> From: Corey Huinker <corey.huinker@gmail.com>
> Date: Sat, 4 May 2024 04:52:38 -0400
> Subject: [PATCH v29 6/7] Add derivative flags dumpSchema, dumpData.
> 
> User-set flags --schema-only and --data-only are often consulted by
> various operations to determine if they should be skipped or not. While
> this logic works when there are only two mutually-exclusive -only
> options, it will get progressively more confusing when more are added.

After glancing at v29-0007, I see what you mean.

> In anticipation of this, create the flags dumpSchema and dumpData which
> are derivative of the existing options schemaOnly and dataOnly. This
> allows us to restate current skip-this-section tests in terms of what is
> enabled, rather than checking if the other -only mode is turned off.

This seems like a reasonable refactoring exercise that we could take care
of before the rest of the patch set goes in.  I added one new reference to
dopt.schemaOnly in commit bd15b7d, so that should probably be revised to
!dumpData, too.  I also noticed a few references to dataOnly/schemaOnly in
comments that should likely be adjusted.

One other question I had when looking at this patch is whether we could
remove dataOnly/schemaOnly from DumpOptions and RestoreOptions.  Once 0007
is applied, those variables become particularly hazardous, so we really
want to prevent folks from using them in new code.

-- 
nathan



Re: Statistics Import and Export

From
Corey Huinker
Date:
This seems like a reasonable refactoring exercise that we could take care
of before the rest of the patch set goes in.  I added one new reference to
dopt.schemaOnly in commit bd15b7d, so that should probably be revised to
!dumpData, too.  I also noticed a few references to dataOnly/schemaOnly in
comments that should likely be adjusted.

 I'll be on the lookout for the new usage with the next rebase, and will fix the comments as well.
 
One other question I had when looking at this patch is whether we could
remove dataOnly/schemaOnly from DumpOptions and RestoreOptions.  Once 0007
is applied, those variables become particularly hazardous, so we really
want to prevent folks from using them in new code.


Well, the very next patch in the series adds --statistics-only, so I don't think we're getting rid of user-facing command switches. However, I could see us taking away the dataOnly/schemaOnly internal variables, thus preventing coders from playing with those sharp objects.
 

Re: Statistics Import and Export

From
Nathan Bossart
Date:
On Thu, Oct 10, 2024 at 03:49:16PM -0400, Corey Huinker wrote:
>> One other question I had when looking at this patch is whether we could
>> remove dataOnly/schemaOnly from DumpOptions and RestoreOptions.  Once 0007
>> is applied, those variables become particularly hazardous, so we really
>> want to prevent folks from using them in new code.
> 
> Well, the very next patch in the series adds --statistics-only, so I don't
> think we're getting rid of user-facing command switches. However, I could
> see us taking away the dataOnly/schemaOnly internal variables, thus
> preventing coders from playing with those sharp objects.

That's what I meant.  The user-facing options would stay the same, but the
internal variables would be local to main() so that other functions would
be forced to use dumpData, dumpSchema, etc.

-- 
nathan



Re: Statistics Import and Export

From
Jeff Davis
Date:
On Mon, 2024-09-23 at 08:57 +0800, jian he wrote:
>     newtup = heap_modify_tuple_by_cols(ctup, tupdesc, ncols,
> replaces, nulls);
>
> you just directly declared "bool nulls[3]    = {false, false,
> false};"

Those must be false (not NULL), because in pg_class those are non-NULL
attributes. They must be set to something whenever we update.

> if any of (RELPAGES_ARG, RELTUPLES_ARG, RELALLVISIBLE_ARG)
> is null, should you set that null[position] to true?

If the corresponding SQL argument is NULL, we leave the existing value
unchanged, we don't set it to NULL.

> otherwise, i am confused with the variable nulls.
>
> Looking at other usage of heap_modify_tuple_by_cols, "ncols" cannot
> be
> dynamic, it should be a fixed value?
> The current implementation works, because the (bool[3] nulls) is
> always false, never changed.
> if nulls becomes {false, false, true} then "ncols" must be 3, cannot
> be 2.

heap_modify_tuple_by_cols() uses ncols to specify the length of the
values/isnull arrays. The "replaces" is an array of attribute numbers
to replace (in contrast to plain heap_modify_tuple(), which uses an
array of booleans).

We are going to replace a maximum of 3 attributes, so the arrays have a
maximum size of 3. Predeclaring the arrays to be 3 elements is just
fine even if we only use the first 1-2 elements -- it avoids a needless
heap allocation/free.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:
However, this function seems to accept -1 for the relpages parameter. Below is an example of execution:
---
postgres=> CREATE TABLE data1(c1 INT PRIMARY KEY, c2 VARCHAR(10));
CREATE TABLE
postgres=> SELECT pg_set_relation_stats('data1', relpages=>-1);
 pg_set_relation_stats
-----------------------
 t
(1 row)
postgres=> SELECT relname, relpages FROM pg_class WHERE relname='data1';
 relname | relpages
---------+----------
 data1   |       -1
(1 row)
---

The attached patch modifies the pg_set_relation_stats function to work as described in the manual.

Regards,
Noriyoshi Shinoda

Accepting -1 is correct. I thought I had fixed that in a recent patch. Perhaps signals got crossed somewhere along the way.

Re: Statistics Import and Export

From
Corey Huinker
Date:
On Mon, Oct 14, 2024 at 3:40 PM Corey Huinker <corey.huinker@gmail.com> wrote:
However, this function seems to accept -1 for the relpages parameter. Below is an example of execution:
---
postgres=> CREATE TABLE data1(c1 INT PRIMARY KEY, c2 VARCHAR(10));
CREATE TABLE
postgres=> SELECT pg_set_relation_stats('data1', relpages=>-1);
 pg_set_relation_stats
-----------------------
 t
(1 row)
postgres=> SELECT relname, relpages FROM pg_class WHERE relname='data1';
 relname | relpages
---------+----------
 data1   |       -1
(1 row)
---

The attached patch modifies the pg_set_relation_stats function to work as described in the manual.

Regards,
Noriyoshi Shinoda

Accepting -1 is correct. I thought I had fixed that in a recent patch. Perhaps signals got crossed somewhere along the way.

Just to be sure, I went back to v29, fixed a typo and some whitespace issues in stats_import.out, confirmed that it passed regression tests, then changed the relpages lower bound from -1 back to 0, and sure enough, the regression test for pg_upgrade failed again.

It seems that partitioned tables have a relpages of -1, so regression tests involving tables alpha_neg and alpha_pos (and 35 others, all seemingly partitioned) fail. So it was the docs that were wrong, not the code.

e839c8ecc9352b7754e74f19ace013c0c0d18613 doesn't include the stuff that modified pg_dump/pg_upgrade, so it wouldn't have turned up this problem.

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Mon, 2024-10-14 at 21:46 -0400, Corey Huinker wrote:
> It seems that partitioned tables have a relpages of -1

Oh, I see. It appears that there's a special case for partitioned
tables that sets relpages=-1 in do_analyze_rel() around line 680. It's
a bit inconsistent, though, because even partitioned indexes have
relpages=0. Furthermore, the parameter is of type BlockNumber, so
InvalidBlockNumber would make more sense.

Not the cleanest code, but if the value exists, we need to be able to
import it.


Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:

Oh, I see. It appears that there's a special case for partitioned
tables that sets relpages=-1 in do_analyze_rel() around line 680. It's
a bit inconsistent, though, because even partitioned indexes have
relpages=0. Furthermore, the parameter is of type BlockNumber, so
InvalidBlockNumber would make more sense.

Not the cleanest code, but if the value exists, we need to be able to
import it.


Thanks for tracking that down. I'll have a patch ready shortly. 

Re: Statistics Import and Export

From
Corey Huinker
Date:
On Wed, Oct 16, 2024 at 7:20 PM Corey Huinker <corey.huinker@gmail.com> wrote:
Code fix with comment on why nobody expects a relpages -1. Test case to demonstrate that relpages -1 can happen, and updated doc to reflect the new lower bound.

Additional fixes, now in a patch-set:

1. Allow relpages to be set to -1 (partitioned tables with partitions have this value after ANALYZE).
2. Turn off autovacuum on tables (where possible) if they are going to be the target of pg_set_relation_stats().
3. Allow pg_set_relation_stats to continue past an out-of-range detection on one attribute, rather than immediately returning false.
 

There is some uncertainty on what, if anything, should be returned by pg_set_relation_stats() and pg_set_attribute_stats().

Initially the function (and there was just one) returned void, but it had a bool return added around the time it split into relation/attribute stats.

Returning a boolean seems like good instrumentation and a helper for allowing other tooling to use the functions. However, it's rather limited in what it can convey.

Currently, a return of true means "a record was written", and false means that a record was not written. Cases where a record was not written for pg_set_relation_stats amount to the following:

1. No attributes were specified, therefore there is nothing to change.
2. The attributes were set to the values that the record already has, thus no change was necessary.

#2 can be confusing, because false may look like a failure, but it means "the pg_class values were already set to what you wanted".

An alternate use of boolean, suggested by Jeff was the following:

1. Return true if all of the fields specified were applied to the record.
2. Return false if any field that was specified was NOT set, even if the other ones were.

#2 is also confusing in that the user has received a false value, but the operation did modify the record, just not as fully as the caller had hoped.

These show that a boolean isn't really up to conveying the nuances of potential outcomes. Complex return types have met with considerable resistance, enumerations are similarly undesirable, no other scalar value seems up to the task, and while an INFO or LOG message could convey considerable complexity, it wouldn't be readily handled programmatically. This re-raises the question of whether the pg_set_*_stats functions should return anything at all.

Any feedback on what users would expect from these functions in terms of return value is appreciated. Bear in mind that these functions will NOT be integrated into pg_upgrade/pg_dump, as that functionality will be handled by functions that are less user friendly but more flexible and forgiving of bad data. We're talking purely about functions meant for tweaking stats to look for changes in planner behavior.











 

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Thu, 2024-10-17 at 20:54 -0400, Corey Huinker wrote:
> There is some uncertainty on what, if anything, should be returned by
> pg_set_relation_stats() and pg_set_attribute_stats().

...

> This re-raises the question of whether the pg_set_*_stats functions
> should return anything at all.

What is the benefit of a return value from the pg_set_*_stats variants?
As far as I can tell, there is none because they throw an ERROR if
anything goes wrong, so they should just return VOID. What am I
missing?

The return value is more interesting for pg_restore_relation_stats()
and pg_restore_attribute_stats(), which will be used by pg_dump and
which are designed to keep going on non-fatal errors. Isn't that what
this discussion should be about?

Magnus, you previously commented that you'd like some visibility for
tooling:

https://www.postgresql.org/message-id/CABUevEz1gLOkWSh_Vd9LQh-JM4i%3DMu7PVT9ffc77TmH0Zh3TzA%40mail.gmail.com

Is a return value what you had in mind? Or some other function that can
help find missing stats later, or something else entirely?

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:
What is the benefit of a return value from the pg_set_*_stats variants?
As far as I can tell, there is none because they throw an ERROR if
anything goes wrong, so they should just return VOID. What am I
missing?

Probably nothing. The nuances of "this stat didn't get set but these other two did" are too complex for a scalar to report, and pg_set_relation_stats() doesn't do nuance, anyway. I've attached a patch that returns void instead.
 

The return value is more interesting for pg_restore_relation_stats()
and pg_restore_attribute_stats(), which will be used by pg_dump and
which are designed to keep going on non-fatal errors. Isn't that what
this discussion should be about?

Yes. I was trying to focus the conversation on something that could be easily resolved first, and then move on to -restore which is trickier.

Patch that allows relation_statistics_update to continue after one failed stat (0001) attached, along with bool->void change (0002).
 

Re: Statistics Import and Export

From
Corey Huinker
Date:


If the relpages option contains -1 only for partitioned tables, shouldn't pg_set_relation_stats restrict the values that can be

specified by table type? The attached patch limits the value to -1 or more if the target

is a partition table, and 0 or more otherwise.

Changing relpages to -1 on a non-partitioned table seems to significantly change the execution plan.


Short answer: It's working as intended. Significantly changing the execution plan in weird ways is part of the intention of the function, even if the execution plan changes for the worse. I appreciate 

Longer answer:

Enforcing -1 on only partitioned tables is tricky, as it seems to be a value for any table that has no local storage. So foreign data wrapper tables could, in theory, also have this value. More importantly, the -1 value seems to be situational, in my experience it only happens on partitioned tables after they have their first partition added, which means that the current valid stat range is set according to facts that can change. Like so :

chuinker=# select version();
                                                              version                                                              
------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 16.4 (Postgres.app) on aarch64-apple-darwin21.6.0, compiled by Apple clang version 14.0.0 (clang-1400.0.29.102), 64-bit
(1 row)
chuinker=# create table part_parent (x integer) partition by range (x);
CREATE TABLE
chuinker=# select relpages from pg_class where oid = 'part_parent'::regclass;
 relpages
----------
        0
(1 row)

chuinker=# analyze part_parent;
ANALYZE
chuinker=# select relpages from pg_class where oid = 'part_parent'::regclass;
 relpages
----------
        0
(1 row)

chuinker=# create table part_child partition of part_parent for values from (0) TO (100);
CREATE TABLE
chuinker=# select relpages from pg_class where oid = 'part_parent'::regclass;
 relpages
----------
        0
(1 row)

chuinker=# analyze part_parent;
ANALYZE
chuinker=# select relpages from pg_class where oid = 'part_parent'::regclass;
 relpages
----------
       -1
(1 row)

chuinker=# drop table part_child;
DROP TABLE
chuinker=# select relpages from pg_class where oid = 'part_parent'::regclass;
 relpages
----------
       -1
(1 row)

chuinker=# analyze part_parent;
ANALYZE
chuinker=# select relpages from pg_class where oid = 'part_parent'::regclass;
 relpages
----------
       -1
(1 row)

Prior versions (March 2024 and earlier) of this patch and the pg_set_attribute_stats patch did have many checks to prevent importing stat values that were "wrong" in some way. Some examples from attribute stats import were:

* Histograms that were not monotonically nondecreasing.
* Frequency values that were out of bounds specified by other values in the array.
* Frequency values outside of the [0.0,1.0] or [-1.0,1.0] depending on the stat type.
* paired arrays of most-common-values and their attendant frequency array not having the same length

All of these checks were removed based on feedback from reviewers and committers who saw the pg_set_*_stats() functions as a fuzzing tool, so the ability to set illogical, wildly implausible, or mathematically impossible values was a feature, not a bug. I would suspect that they would view your demonstration that setting impossible values on a table as proof that the function can be used to experiment with planner scenarios. So, while I previously would have eagerly accepted this patch as another valued validation check, such checks don't fit with the new intention of the functions. Still, I greatly appreciate your helping us discover ways in which we can use this tool to make the planner do odd things.

One thing that could cause us to enforce a check like the one you submitted would be if an invalid value caused a query to fail or a session to crash, even then, that would probably spur a change to make the planner more defensive rather than more checks on the set_* side.

Re: Statistics Import and Export

From
Jeff Davis
Date:
> I've taken most of Jeff's work, reincorporated it into roughly the
> same patch structure as before, and am posting it now.

I committed 0001-0004 with significant revision.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Jeff Davis
Date:
On Tue, 2024-10-22 at 23:58 +0000, Shinoda, Noriyoshi (SXD Japan FSIP)
wrote:
> Thanks for developing good features. I tried the patch that was
> committed right away.
> It seems that the implementation and documentation differ on the
> return value of the pg_clear_attribute_stats function.
> The attached small patch fixes the documentation.

Thank you again, fixed.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Jeff Davis
Date:
On Tue, 2024-09-17 at 05:02 -0400, Corey Huinker wrote:
>
> I've taken most of Jeff's work, reincorporated it into roughly the
> same patch structure as before, and am posting it now.

I have committed the import side of this patch series; that is, the
function calls that can load stats into an existing cluster without the
need to ANALYZE.

The pg_restore_*_stats() functions are designed such that pg_dump can
emit the calls. Some design choices of the functions worth noting:

  (a) a variadic signature of name/value pairs rather than ordinary SQL
arguments, which makes it easier for future versions to interpret what
has been output from a previous version; and
  (b) many kinds of errors are demoted to WARNINGs, to allow some
statistics to be set for an attribute even if other statistics are
malformed (also a future-proofing design); and
  (c) we are considering whether to use an in-place heap update for the
relation stats, so that a large restore doesn't bloat pg_class -- I'd
like feedback on this idea

The pg_set_*_stats() functions are designed for interactive use, such
as tweaking statistics for planner testing, experimentation, or
reproducing a plan outside of a production system. The aforementioned
design choices don't make a lot of sense in this context, so that's why
the pg_set_*_stats() functions are separate from the
pg_restore_*_stats() functions. But there's a lot of overlap, so it may
be worth discussing again whether we should only have one set of
functions.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Alexander Lakhin
Date:
Hello Jeff and Corey,

26.10.2024 01:18, Jeff Davis wrote:
> On Tue, 2024-09-17 at 05:02 -0400, Corey Huinker wrote:
>> I've taken most of Jeff's work, reincorporated it into roughly the
>> same patch structure as before, and am posting it now.
> I have committed the import side of this patch series; that is, the
> function calls that can load stats into an existing cluster without the
> need to ANALYZE.
>
> The pg_restore_*_stats() functions are designed such that pg_dump can
> emit the calls. Some design choices of the functions worth noting:

Please look at the following seemingly atypical behavior of the new
functions:
CREATE TABLE test(id int);

SELECT pg_restore_attribute_stats(
   'relation', 'test'::regclass,
   'attname', 'id'::name,
   'inherited', false);

SELECT pg_restore_attribute_stats(
   'relation', 'test'::regclass,
   'attname', 'id'::name,
   'inherited', false
) FROM generate_series(1, 2);
ERROR:  XX000: tuple already updated by self
LOCATION:  simple_heap_update, heapam.c:4353

Or:
SELECT pg_clear_attribute_stats('test'::regclass, 'id'::name, false)
FROM generate_series(1, 2);
ERROR:  XX000: tuple already updated by self
LOCATION:  simple_heap_delete, heapam.c:3108

Best regards,
Alexander



Re: Statistics Import and Export

From
Corey Huinker
Date:

  (c) we are considering whether to use an in-place heap update for the
relation stats, so that a large restore doesn't bloat pg_class -- I'd
like feedback on this idea

I'd also like feedback, though I feel very strongly that we should do what ANALYZE does. In an upgrade situation, nearly all tables will have stats imported, which would result in an immediate doubling of pg_class - not the end of the world, but not great either.

Given the recent bugs associated with inplace updates and race conditions, if we don't want to do in-place here, we should also consider getting rid of it for ANALYZE. I briefly pondered if it would make sense to vertically partition pg_class into the stable attributes and the attributes that get modified in-place, but that list is pretty long: relpages, reltuples, relallvisible, relhasindex, reltoastrelid, relhasrules, relhastriggers, relfrozenxid, and reminmxid, 

If we don't want to do inplace updates in pg_restore_relation_stats(), then we could mitigate the bloat with a VACUUM FULL pg_class at the tail end of the upgrade if stats were enabled.
 
pg_restore_*_stats() functions. But there's a lot of overlap, so it may
be worth discussing again whether we should only have one set of
functions.

For the reason of in-place updates and error tolerance, I think they have to remain separate functions, but I'm also interested in hearing other's opinions.
 

Re: Statistics Import and Export

From
Heikki Linnakangas
Date:
On 23/10/2024 01:27, Jeff Davis wrote:
>> I've taken most of Jeff's work, reincorporated it into roughly the
>> same patch structure as before, and am posting it now.
> 
> I committed 0001-0004 with significant revision.

This just caught my eye:

postgres=# select pg_set_attribute_stats('foo', 'xmin', false, 1);
  pg_set_attribute_stats
------------------------

(1 row)

We should probably not allow that, because you cannot ANALYZE system 
columns:

postgres=# analyze foo (xmin);
ERROR:  column "xmin" of relation "foo" does not exist

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Statistics Import and Export

From
Corey Huinker
Date:
We should probably not allow that, because you cannot ANALYZE system
columns:

Makes sense, and the fix is changing a single character (unless we think it warrants a test case).

Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Fri, Nov  8, 2024 at 01:25:21PM -0500, Corey Huinker wrote:
>     WHAT IS NOT DONE - EXTENDED STATISTICS
> 
>     It is a general consensus in the community that "nobody uses extended
>     statistics", though I've had difficulty getting actual figures to back this
>     up, even from my own employer. Surveying several vendors at PgConf.EU, the
>     highest estimate was that at most 1% of their customers used extended
>     statistics, though more probably should. This reinforces my belief that a
>     feature that would eliminate a major pain point in upgrades for 99% of
>     customers shouldn't be held back by the fact that the other 1% only have a
>     reduced hassle.
> 
>     However, having relation and attribute statistics carry over on major
>     version upgrades presents a slight problem: running vacuumdb
>     --analyze-in-stages after such an upgrade is completely unnecessary for
>     those without extended statistics, and would actually result in _worse_
>     statistics for the database until the last stage is complete. Granted,
>     we've had great difficulty getting users to know that vacuumdb is a thing
>     that should be run, but word has slowly spread through our own
>     documentation and those "This one simple trick will make your postgres go
>     fast post-upgrade" blog posts. Those posts will continue to lurk in search
>     results long after this feature goes into release, and it would be a rude
>     surprise to users to find out that the extra work they put in to learn
>     about a feature that helped their upgrade in 17 was suddenly detrimental
>     (albeit temporarily) in 18. We should never punish people for only being a
>     little-bit current in their knowledge. Moreover, this surprise would
>     persist even after we add extended statistics import function
>     functionality.
> 
>     I presented this problem to several people at PgConf.EU, and the consensus
>     least-bad solution was that vacuumdb should filter out tables that are not
>     missing any statistics when using options --analyze, --analyze-only, and
>     --analyze-in-stages, with an additional flag for now called --force-analyze
>     to restore the un-filtered functionality. This gives the outcome tree:
> 
>     1. Users who do not have extended statistics and do not use (or not even
>     know about) vacuumdb will be blissfully unaware, and will get better
>     post-upgrade performance.
>     2. Users who do not have extended statistics but use vacuumdb
>     --analyze-in-stages will be pleasantly surprised that the vacuumdb run is
>     almost a no-op, and completes quickly. Those who are surprised by this and
>     re-run vacuumdb --analyze-in-stages will get another no-op.
>     3. Users who have extended statistics and use vacuumdb --analyze-in-stages
>     will get a quicker vacuumdb run, as only the tables with extended stats
>     will pass the filter. Subsequent re-runs of vacuumdb --analyze-in-stages
>     would be the no-op.
>     4. Users who have extended statistics and don't use vacuumdb will still get
>     better performance than they would have without any stats imported.
> 
>     In case anyone is curious, I'm defining "missing stats" as a table/matview
>     with any of the following:
> 
>     1. A table with an attribute that lacks a corresponding pg_statistic row.
>     2. A table with an index with an expression attribute that lacks a
>     corresponding pg_statistic row (non-expression attributes just borrow the
>     pg_statistic row from the table's attribute).
>     3. A table with at least one extended statistic that does not have a
>     corresponding pg_statistic_ext_data row.
> 
>     Note that none of these criteria are concerned with the substance of the
>     statistics (ex. pg_statistic row should have mcv stats but does not),
>     merely their row-existence. 
> 
>     Some rejected alternative solutions were:
> 
>     1. Adding a new option --analyze-missing-stats. While simple, few people
>     would learn about it, knowledge of it would be drowned out by the
>     aforementioned sea of existing blog posts.
>     2. Adding --analyze-missing-stats and making --analyze-in-stages fail with
>     an error message educating the user about --analyze-missing-stats. Users
>     might not see the error, existing tooling wouldn't be able to act on the
>     error, and there are legitimate non-upgrade uses of --analyze-in-stages.
> 
>     MAIN CONCERN GOING FORWARD
> 
>     This change to vacuumdb will require some reworking of the
>     vacuum_one_database() function so that the list of tables analyzed is
>     preserved across the stages, as subsequent stages runs won't be able to
>     detect which tables were previously missing stats.

You seem to be optimizing for people using pg_upgrade, and for people
upgrading to PG 18, without adequately considering people using vacuumdb
in non-pg_upgrade situations, and people using PG 19+.  Let me explain.

First, I see little concern here for how people who use --analyze and
--analyze-only independent of pg_upgrade will be affected by this. 
While I recommend people decrease vacuum and analyze threshold during
non-peak periods:

    https://momjian.us/main/blogs/pgblog/2017.html#January_3_2017

some people might just regenerate all statistics during non-peak periods
using these options.  You can perhaps argue that --analyze-in-stages
would only be used by pg_upgrade so maybe that can be adjusted more
easily.

Second, the API for what --analyze and --analyze-only do will be very
confusing for people running, e.g., PG 20, because the average user
reading the option name will not guess it only adds missing statistics.

I think you need to rethink your approach and just accept that a mention
of the new preserving statistic behavior of pg_upgrade, and the new
vacuumdb API required, will be sufficient.  In summary, I think you need
a new --compute-missing-statistics-only that can be combined with
--analyze, --analyze-only, and --analyze-in-stages to compute only
missing statistics, and document it in the PG 18 release notes.

Frankly, we have a similar problem with partitioned tables:

    https://www.postgresql.org/docs/current/sql-analyze.html
    
    For partitioned tables, ANALYZE gathers statistics by sampling
    rows from all partitions; in addition, it will recurse into
    each partition and update its statistics. Each leaf partition
    is analyzed only once, even with multi-level partitioning. No
    statistics are collected for only the parent table (without data
    from its partitions), because with partitioning it's guaranteed
    to be empty.

-->    The autovacuum daemon does not process partitioned tables, nor
    does it process inheritance parents if only the children are ever
    modified. It is usually necessary to periodically run a manual
    ANALYZE to keep the statistics of the table hierarchy up to date.

Now, you can say partitioned table statistics are not as important as
extended statistics, but that fact remains that we have these two odd
cases where special work must be done to generate statistics.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Mon, Nov 18, 2024 at 08:06:24PM +0000, Wetmore, Matthew  (CTR) wrote:
> Sorry to chime in with a dumb question:
> 
> How would/could this effect tables that have the vacuum and analyze
> scale_factors different from the rest of db via the ALTE RTABLE statement?
> 
> (I do this a lot)
> 
> ALTER TABLE your_schema.your_table SET
> (autovacuum_enabled,autovacuum_analyze_scale_factor,autovacuum_vacuum_scale_factor);
> 
> Just wanted to mention

I don't think it would affect it since those control autovacuum, and we
are talking about manual vacuum/analyze.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Statistics Import and Export

From
Corey Huinker
Date:


On Mon, Nov 18, 2024 at 3:32 PM Bruce Momjian <bruce@momjian.us> wrote:
> How would/could this effect tables that have the vacuum and analyze
> scale_factors different from the rest of db via the ALTE RTABLE statement?
>
> (I do this a lot)

I don't think it would affect it since those control autovacuum, and we
are talking about manual vacuum/analyze.

Correct. The patchset is about carrying over the gathered statistics as-is from the previous major version. Per-table configurations for how to collect statistics would be unchanged.
 

Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Mon, Nov 18, 2024 at 08:29:10PM -0500, Corey Huinker wrote:
> On Mon, Nov 18, 2024 at 2:47 PM Bruce Momjian <bruce@momjian.us> wrote:
>     You seem to be optimizing for people using pg_upgrade, and for people
>     upgrading to PG 18, without adequately considering people using vacuumdb
>     in non-pg_upgrade situations, and people using PG 19+.  Let me explain.
> 
> This was a concern as I was polling people.
> 
> A person using vacuumdb in a non-upgrade situation is, to my limited
> imagination, one of three types:
> 
> 1. A person who views vacuumdb as a worthwhile janitorial task for downtimes.
> 2. A person who wants stats on a lot of recently created tables.
> 3. A person who wants better stats on a lot of recently (re)populated tables.
> 
> The first group would not be using --analyze-in-stages or --analyze-only,
> because the vacuuming is a big part of it. They will be unaffected.
> 
> The second group will be pleasantly surprised to learn that they no longer need
> to specify a subset of tables, as any table missing stats will get picked up.
> 
> The third group would be surprised that their operation completed so quickly,
> check the docs, add in --force-analyze to their script, and re-run.

We can't design an API around who is going to be surprised.  We have to
look at what the options say, what people would expect it to do, and
what it does.  The reason "surprise" doesn't work in the long run is
that while PG 18 users might be surprised, PG 20 users will be confused.

>     First, I see little concern here for how people who use --analyze and
>     --analyze-only independent of pg_upgrade will be affected by this.
>     While I recommend people decrease vacuum and analyze threshold during
>     non-peak periods:
> 
>             https://momjian.us/main/blogs/pgblog/2017.html#January_3_2017
> 
>     some people might just regenerate all statistics during non-peak periods
>     using these options.  You can perhaps argue that --analyze-in-stages
>     would only be used by pg_upgrade so maybe that can be adjusted more
>     easily.
> 
> I, personally, would be fine if this only modified --analyze-in-stages, as it
> already carries the warning:

Right, but I think we would need to rename the option to clarify what it
does, e.g. --analyze-missing-in-stages.  If they use
--analyze-in-stages, they will get an error, and will then need to
reference the docs to see the new option wording, or we can suggest the
new option in the error message.

> But others felt that --analyze-only should be in the mix as well.

Again, with those other people not saying so in this thread, I can't
really comment on it --- I can only tell you what I have seen and others
are going to have to explain why they want such dramatic changes.

> No one advocated for changing the behavior of options that involve actual
> vacuuming.
>  
> 
>     Second, the API for what --analyze and --analyze-only do will be very
>     confusing for people running, e.g., PG 20, because the average user
>     reading the option name will not guess it only adds missing statistics.
> 
>     I think you need to rethink your approach and just accept that a mention
>     of the new preserving statistic behavior of pg_upgrade, and the new
>     vacuumdb API required, will be sufficient.  In summary, I think you need
>     a new --compute-missing-statistics-only that can be combined with
>     --analyze, --analyze-only, and --analyze-in-stages to compute only
>     missing statistics, and document it in the PG 18 release notes.
> 
> 
> A --missing-only/--analyze-missing-in-stages option was my first idea, and it's
> definitely cleaner, but as I stated in the rejected ideas section above, when I
> reached out to others at PgConf.EU there was pretty general consensus that few
> people would actually read our documentation, and the few that have in the past
> are unlikely to read it again to discover the new option, and those people
> would have a negative impact of using --analyze-in-stages, effectively
> punishing them for having once read the documentation (or a blog post) but not
> re-read it prior to upgrade.

Again, you can't justify such changes based on discussions that are not
posted publicly here.

> So, to add non-pg_upgrade users to the outcome tree in my email from
> 2024-11-04:
> 
> 
>     5. Users who use vacuumdb in a non-upgrade situation and do not use either
>     --analyze-in-stages or --analyze-only will be completely unaffected.
>     6. Users who use vacuumdb in a non-upgrade situation with either
>     --analyze-in-stages or --analyze-only set will find that the operation
>     skips tables that already have stats, and will have to add --force-analyze
>     to restore previous behavior.
> 
> 
> That's not a great surprise for group 6, but I have to believe that group is
> smaller than group 5, and it's definitely smaller than the group of users that
> need to upgrade.

Again, a clean API is the goal, not surprise calculus.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Mon, Nov 18, 2024 at 08:42:35PM -0500, Bruce Momjian wrote:
> On Mon, Nov 18, 2024 at 08:29:10PM -0500, Corey Huinker wrote:
> > That's not a great surprise for group 6, but I have to believe that group is
> > smaller than group 5, and it's definitely smaller than the group of users that
> > need to upgrade.
> 
> Again, a clean API is the goal, not surprise calculus.

Maybe I was too harsh.  "Surprise calculus" is fine, but only after we
have an API that will be clearly understood by new users.  We have to
assume that in the long run new users will use this API more than
existing users.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Statistics Import and Export

From
Nathan Bossart
Date:
On Mon, Nov 18, 2024 at 08:42:35PM -0500, Bruce Momjian wrote:
> We can't design an API around who is going to be surprised.  We have to
> look at what the options say, what people would expect it to do, and
> what it does.  The reason "surprise" doesn't work in the long run is
> that while PG 18 users might be surprised, PG 20 users will be confused.

I think Bruce makes good points.  I'd add that even if we did nothing at
all for vacuumdb, folks who continued to use it wouldn't benefit from the
new changes, but they also shouldn't be harmed by it, either.

>> I, personally, would be fine if this only modified --analyze-in-stages, as it
>> already carries the warning:
> 
> Right, but I think we would need to rename the option to clarify what it
> does, e.g. --analyze-missing-in-stages.  If they use
> --analyze-in-stages, they will get an error, and will then need to
> reference the docs to see the new option wording, or we can suggest the
> new option in the error message.
> 
>> But others felt that --analyze-only should be in the mix as well.
> 
> Again, with those other people not saying so in this thread, I can't
> really comment on it --- I can only tell you what I have seen and others
> are going to have to explain why they want such dramatic changes.

I don't have a strong opinion here, but I suspect that if I was creating
vacuumdb from scratch, I'd have suggested a --missing-only flag that would
only work for --analyze-only/--analyze-in-stages.  That way, folks can
still regenerate statistics if they want, but we also have an answer for
folks who use pg_upgrade and have extended statistics.

-- 
nathan



Re: Statistics Import and Export

From
Corey Huinker
Date:
I don't have a strong opinion here, but I suspect that if I was creating
vacuumdb from scratch, I'd have suggested a --missing-only flag that would
only work for --analyze-only/--analyze-in-stages.  That way, folks can
still regenerate statistics if they want, but we also have an answer for
folks who use pg_upgrade and have extended statistics.

(combining responses to Bruce's para about surprise calculus and Nathan here)

I agree that a clean API is desirable and a goal. And as I stated before, a new flag (--analyze-missing-in-stages / --analyze-post-pgupgrade, etc) or a flag modifier ( --missing-only ) was my first choice.

But if we're going to go that route, we have a messaging problem. We need to reach our customers who plan to upgrade, and explain to them that the underlying assumption behind running vacuumdb has gone away for 99% of them, and that may be 100% in the next version, but for that 99% running vacuumdb in the old way now actively undoes one of the major improvements to pg_upgrade, but this one additional option keeps the benefits of the new pg_upgrade without the drawbacks.

That, and once we have extended statistics importing on upgrade, then the need for vacuumdb post-upgrade goes away entirely. So we'll have to re-message the users with that news too.

I'd be in favor of this, but I have to be honest, our messaging reach is not good, and takes years to sink in. Years in which the message will change at least one more time. And this outreach will likely confuse users who already weren't (and now shouldn't be) using vacuumdb. In light of that, the big risk was that an action that some users learned to do years ago was now actively undoing whatever gains they were supposed to get in their upgrade downtime, and that downtime is money to them, hence the surprise calculus.

One other possibilities we could consider:

* create a pg_stats_health_check script that lists tables missing stats, with --fix/--fix-in-stages options, effectively replacing vacuumdb for those purposes, and then crank up the messaging about that change. The "new shiny" effect of a new utility that has "stats", "health", and "check" in the name may be the search/click-bait we need to get the word out effectively. That last sentence may sound facetious, but it isn't, it's just accepting how search engines and eyeballs currently function. With that in place, we can then change the vacuumdb documentation to be deter future use in post-upgrade situations.

* move missing-stats rebuilds into pg_upgrade/pg_restore itself, and this would give us the simpler one-time message that users should stop using vacuumdb in upgrade situations.

* Making a concerted push to get extended stats import into v18 despite the high-effort/low-reward nature of it, and then we can go with the simple messaging of "Remember vacuumdb, that thing you probably weren't running post-upgrade but should have been? Now you can stop using it!". I had extended stats imports working back when the function took JSON input, so it's do-able, but the difficulty lies in how to represent an array of incomplete pg_statistic rows in a serial fashion that is cross-version compatible.



 

Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Tue, Nov 19, 2024 at 03:47:20PM -0500, Corey Huinker wrote:
>     I don't have a strong opinion here, but I suspect that if I was creating
>     vacuumdb from scratch, I'd have suggested a --missing-only flag that would
>     only work for --analyze-only/--analyze-in-stages.  That way, folks can
>     still regenerate statistics if they want, but we also have an answer for
>     folks who use pg_upgrade and have extended statistics.
> 
> 
> (combining responses to Bruce's para about surprise calculus and Nathan here)
> 
> I agree that a clean API is desirable and a goal. And as I stated before, a new
> flag (--analyze-missing-in-stages / --analyze-post-pgupgrade, etc) or a flag
> modifier ( --missing-only ) was my first choice.

Yes, after a clean API is designed, you can then consider surprise
calculus.  This is an issue not only for this feature, but for all
Postgres changes we consider, which is why I think it is worth stating
this clearly.  If I am thinking incorrectly, we can discuss that here too.

> But if we're going to go that route, we have a messaging problem. We need to
> reach our customers who plan to upgrade, and explain to them that the
> underlying assumption behind running vacuumdb has gone away for 99% of them,
> and that may be 100% in the next version, but for that 99% running vacuumdb in
> the old way now actively undoes one of the major improvements to pg_upgrade,
> but this one additional option keeps the benefits of the new pg_upgrade without
> the drawbacks.

How much are we supposed to consider users who do not read the major
release notes?  I realize we might be unrealistic to expect that from
the majority of our users, but I also don't want to contort our API to
adjust for them.

> That, and once we have extended statistics importing on upgrade, then the need
> for vacuumdb post-upgrade goes away entirely. So we'll have to re-message the
> users with that news too.
> 
> I'd be in favor of this, but I have to be honest, our messaging reach is not
> good, and takes years to sink in. Years in which the message will change at
> least one more time. And this outreach will likely confuse users who already
> weren't (and now shouldn't be) using vacuumdb. In light of that, the big risk
> was that an action that some users learned to do years ago was now actively
> undoing whatever gains they were supposed to get in their upgrade downtime, and
> that downtime is money to them, hence the surprise calculus.

That is a big purpose of the major release notes.  We can even list this
as an incompatibility in the sense that the procedure has changed.

> One other possibilities we could consider:
> 
> * create a pg_stats_health_check script that lists tables missing stats, with
> --fix/--fix-in-stages options, effectively replacing vacuumdb for those
> purposes, and then crank up the messaging about that change. The "new shiny"
> effect of a new utility that has "stats", "health", and "check" in the name may
> be the search/click-bait we need to get the word out effectively. That last
> sentence may sound facetious, but it isn't, it's just accepting how search
> engines and eyeballs currently function. With that in place, we can then change
> the vacuumdb documentation to be deter future use in post-upgrade situations.

We used to create a script until the functionality was added to
vacuumdb.  Since 99% of users will not need to do anything after
pg_upgrade, it would make sense to output the script only for the 1% of
users who need it and tell users to run it, rather than giving
instructions that are a no-op for 99% of users.

> * move missing-stats rebuilds into pg_upgrade/pg_restore itself, and this would
> give us the simpler one-time message that users should stop using vacuumdb in
> upgrade situations.

Uh, that would make pg_upgrade take longer for some users, which might
be confusing.

> * Making a concerted push to get extended stats import into v18 despite the
> high-effort/low-reward nature of it, and then we can go with the simple
> messaging of "Remember vacuumdb, that thing you probably weren't running
> post-upgrade but should have been? Now you can stop using it!". I had extended
> stats imports working back when the function took JSON input, so it's do-able,
> but the difficulty lies in how to represent an array of incomplete pg_statistic
> rows in a serial fashion that is cross-version compatible.

I am not a big fan of that at this point.  If we get it, we can adjust
our API at that time, but I don't want to plan on it.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Tue, Nov 19, 2024 at 05:40:20PM -0500, Bruce Momjian wrote:
> On Tue, Nov 19, 2024 at 03:47:20PM -0500, Corey Huinker wrote:
> > * create a pg_stats_health_check script that lists tables missing stats, with
> > --fix/--fix-in-stages options, effectively replacing vacuumdb for those
> > purposes, and then crank up the messaging about that change. The "new shiny"
> > effect of a new utility that has "stats", "health", and "check" in the name may
> > be the search/click-bait we need to get the word out effectively. That last
> > sentence may sound facetious, but it isn't, it's just accepting how search
> > engines and eyeballs currently function. With that in place, we can then change
> > the vacuumdb documentation to be deter future use in post-upgrade situations.
> 
> We used to create a script until the functionality was added to
> vacuumdb.  Since 99% of users will not need to do anything after
> pg_upgrade, it would make sense to output the script only for the 1% of
> users who need it and tell users to run it, rather than giving
> instructions that are a no-op for 99% of users.

One problem with the above approach is that it gives users upgrading or
loading via pg_dump no way to know which tables need analyze statistics,
right?  I think that is why we ended up putting the pg_upgrade
statistics functionality in vacuumdb --analyze-in-stages.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Statistics Import and Export

From
Nathan Bossart
Date:
I took another look at v32-0001 and v32-0002, and they look reasonable to
me.  Unless additional feedback materializes, I'll plan on committing those
soon.

After that, it might be a good idea to take up the vacuumdb changes next,
since there's been quite a bit of recent discussion about those.  I have a
fair amount of experience working on vacuumdb, so I could probably help
there, too.

-- 
nathan



Re: Statistics Import and Export

From
Jeff Davis
Date:
On Fri, 2024-11-22 at 15:09 -0600, Nathan Bossart wrote:
> I took another look at v32-0001 and v32-0002, and they look
> reasonable to
> me.  Unless additional feedback materializes, I'll plan on committing
> those
> soon.

Those refactoring patches look fine to me, the only comment I have is
that they are easier to understand as a single patch.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Nathan Bossart
Date:
On Mon, Nov 25, 2024 at 08:11:09AM -0800, Jeff Davis wrote:
> On Fri, 2024-11-22 at 15:09 -0600, Nathan Bossart wrote:
>> I took another look at v32-0001 and v32-0002, and they look
>> reasonable to
>> me.  Unless additional feedback materializes, I'll plan on committing
>> those
>> soon.
> 
> Those refactoring patches look fine to me, the only comment I have is
> that they are easier to understand as a single patch.

Yeah, I intend to combine them when committing.

-- 
nathan



Re: Statistics Import and Export

From
Nathan Bossart
Date:
On Mon, Nov 25, 2024 at 01:12:52PM -0600, Nathan Bossart wrote:
> On Mon, Nov 25, 2024 at 08:11:09AM -0800, Jeff Davis wrote:
>> On Fri, 2024-11-22 at 15:09 -0600, Nathan Bossart wrote:
>>> I took another look at v32-0001 and v32-0002, and they look
>>> reasonable to
>>> me.  Unless additional feedback materializes, I'll plan on committing
>>> those
>>> soon.
>> 
>> Those refactoring patches look fine to me, the only comment I have is
>> that they are easier to understand as a single patch.
> 
> Yeah, I intend to combine them when committing.

Committed.

-- 
nathan



Re: Statistics Import and Export

From
Jeff Davis
Date:
On Mon, 2024-11-18 at 20:29 -0500, Corey Huinker wrote:
> Attached is a re-basing of the existing patchset, plus 3 more
> additions:

Comments on 0003:

* If we commit 0003, is it a useful feature by itself or does it
require that we commit some or all of 0004-0014? Which of these need to
be in v18?

* Why does binary upgrade cause statistics to be dumped? Can you just
make pg_upgrade specify the appropriate set of flags?

* It looks like appendNamedArgument() is never called with
positional=true?

* It's pretty awkward to use an array of arrays of strings when an
array of structs might make more sense.


Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:

Comments on 0003:

* If we commit 0003, is it a useful feature by itself or does it
require that we commit some or all of 0004-0014? Which of these need to
be in v18?

Useful by itself.

0004 seems needed to me, unless we're fine with ~50% bloat in pg_class on a new-upgraded system, or we think inplace update are on their way out.

0005 is basically theoretical, it is only needed if we change the default relpages on partitioned tables.

0006-0011 are the vacuumdb things being debated now. I've reached out to some of the people I spoke to at PgConf.eu to get them to chime in.

0012 is now moot as a similar patch was committed Friday.

0013 is a cleanup/optimization.

0014 is the --no-data flag, which has no meaning without 0004, but 0004 in no way requires it.
 

* Why does binary upgrade cause statistics to be dumped? Can you just
make pg_upgrade specify the appropriate set of flags?

That decision goes back a ways, I tried to dig in the archives last night but I was getting a Server Error on postgresql.org.

Today I'm coming up with https://www.postgresql.org/message-id/267624.1711756062%40sss.pgh.pa.us which is actually more about whether stats import should be the default (consensus: yesyesyes), and the binary_upgrade test may have been because binary_upgrade shuts off data section stuff, but table stats are in the data section. Happy to review the decision.
 
* It looks like appendNamedArgument() is never called with
positional=true?

That is currently the case. Currently all params are called with name/value pairs, but in the past we had leading positionals followed by the stat-y parameters in name-value pairs. I'll be refactoring it to remove the positonal=T/F argument, which leaves just a list of name-type pairs, and thus probably reduces the urge to make it an array of structs.
 
* It's pretty awkward to use an array of arrays of strings when an
array of structs might make more sense.

That pattern was introduced here: https://www.postgresql.org/message-id/4afa70edab849ff16238d1100b6652404e9a4d9d.camel%40j-davis.com :)

I'll be rebasing (that's done) and refactoring 0003 to get rid of the positional column, and moving 0014 next to 0003 because they touch the same files.

Re: Statistics Import and Export

From
Magnus Hagander
Date:

On Tue, Nov 19, 2024 at 1:50 PM Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Nov 18, 2024 at 08:42:35PM -0500, Bruce Momjian wrote:
> On Mon, Nov 18, 2024 at 08:29:10PM -0500, Corey Huinker wrote:
> > That's not a great surprise for group 6, but I have to believe that group is
> > smaller than group 5, and it's definitely smaller than the group of users that
> > need to upgrade.
>
> Again, a clean API is the goal, not surprise calculus.

Maybe I was too harsh.  "Surprise calculus" is fine, but only after we
have an API that will be clearly understood by new users.  We have to
assume that in the long run new users will use this API more than
existing users.

If you want to avoid both the surprise and confusion factor mentioned before, maybe what's needed is to *remove* --analyze-in-stages, and replace it with --analyze-missing-in-stages and --analyze-all-in-stages (with the clear warning about what --analyze-all-in-stages can do to your system if you already have statistics).

That goes with the "immediate breakage that you see right away is better than silently doing the unexpected where you might not notice the problem until much later".

That might trade some of that surprise and confusion for annoyance instead, but going forward that might be a clearer path?
 
--

Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Wed, Nov 27, 2024 at 02:44:01PM +0100, Magnus Hagander wrote:
> If you want to avoid both the surprise and confusion factor mentioned before,
> maybe what's needed is to *remove* --analyze-in-stages, and replace it with
> --analyze-missing-in-stages and --analyze-all-in-stages (with the clear warning
> about what --analyze-all-in-stages can do to your system if you already have
> statistics).
> 
> That goes with the "immediate breakage that you see right away is better than
> silently doing the unexpected where you might not notice the problem until much
> later".
> 
> That might trade some of that surprise and confusion for annoyance instead, but
> going forward that might be a clearer path?

Oh, so remove --analyze-in-stages and have it issue a suggestion, and
make two versions --- yeah, that would work too.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Statistics Import and Export

From
Alvaro Herrera
Date:
On 2024-Nov-27, Bruce Momjian wrote:

> On Wed, Nov 27, 2024 at 02:44:01PM +0100, Magnus Hagander wrote:
> > If you want to avoid both the surprise and confusion factor mentioned before,
> > maybe what's needed is to *remove* --analyze-in-stages, and replace it with
> > --analyze-missing-in-stages and --analyze-all-in-stages (with the clear warning
> > about what --analyze-all-in-stages can do to your system if you already have
> > statistics).
> > 
> > That goes with the "immediate breakage that you see right away is better than
> > silently doing the unexpected where you might not notice the problem until much
> > later".
> > 
> > That might trade some of that surprise and confusion for annoyance instead, but
> > going forward that might be a clearer path?
> 
> Oh, so remove --analyze-in-stages and have it issue a suggestion, and
> make two versions --- yeah, that would work too.

Maybe not remove the option, but add a required parameter:
--analyze-in-stages=all / missing

That way, if the option is missing, the user can adapt the command line
according to need.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"How amazing is that? I call it a night and come back to find that a bug has
been identified and patched while I sleep."                (Robert Davidson)
               http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php



Re: Statistics Import and Export

From
Nathan Bossart
Date:
On Wed, Nov 27, 2024 at 04:00:02PM +0100, Alvaro Herrera wrote:
> On 2024-Nov-27, Bruce Momjian wrote:
> 
>> On Wed, Nov 27, 2024 at 02:44:01PM +0100, Magnus Hagander wrote:
>> > If you want to avoid both the surprise and confusion factor mentioned before,
>> > maybe what's needed is to *remove* --analyze-in-stages, and replace it with
>> > --analyze-missing-in-stages and --analyze-all-in-stages (with the clear warning
>> > about what --analyze-all-in-stages can do to your system if you already have
>> > statistics).
>> > 
>> > That goes with the "immediate breakage that you see right away is better than
>> > silently doing the unexpected where you might not notice the problem until much
>> > later".
>> > 
>> > That might trade some of that surprise and confusion for annoyance instead, but
>> > going forward that might be a clearer path?
>> 
>> Oh, so remove --analyze-in-stages and have it issue a suggestion, and
>> make two versions --- yeah, that would work too.

We did something similar when we removed exclusive backup mode.
pg_start_backup() and pg_stop_backup() were renamed to pg_backup_start()
and pg_backup_stop() to prevent folks' backup scripts from silently
changing behavior after an upgrade.

> Maybe not remove the option, but add a required parameter:
> --analyze-in-stages=all / missing
> 
> That way, if the option is missing, the user can adapt the command line
> according to need.

I like this idea.

-- 
nathan



Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Wed, Nov 27, 2024 at 09:18:45AM -0600, Nathan Bossart wrote:
> On Wed, Nov 27, 2024 at 04:00:02PM +0100, Alvaro Herrera wrote:
> > On 2024-Nov-27, Bruce Momjian wrote:
> > 
> >> On Wed, Nov 27, 2024 at 02:44:01PM +0100, Magnus Hagander wrote:
> >> > If you want to avoid both the surprise and confusion factor mentioned before,
> >> > maybe what's needed is to *remove* --analyze-in-stages, and replace it with
> >> > --analyze-missing-in-stages and --analyze-all-in-stages (with the clear warning
> >> > about what --analyze-all-in-stages can do to your system if you already have
> >> > statistics).
> >> > 
> >> > That goes with the "immediate breakage that you see right away is better than
> >> > silently doing the unexpected where you might not notice the problem until much
> >> > later".
> >> > 
> >> > That might trade some of that surprise and confusion for annoyance instead, but
> >> > going forward that might be a clearer path?
> >> 
> >> Oh, so remove --analyze-in-stages and have it issue a suggestion, and
> >> make two versions --- yeah, that would work too.
> 
> We did something similar when we removed exclusive backup mode.
> pg_start_backup() and pg_stop_backup() were renamed to pg_backup_start()
> and pg_backup_stop() to prevent folks' backup scripts from silently
> changing behavior after an upgrade.
> 
> > Maybe not remove the option, but add a required parameter:
> > --analyze-in-stages=all / missing
> > 
> > That way, if the option is missing, the user can adapt the command line
> > according to need.
> 
> I like this idea.

Uh, do we have parameters that require a boolean option like this? 
Would there be a default?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Statistics Import and Export

From
Alvaro Herrera
Date:
On 2024-Nov-27, Bruce Momjian wrote:

> On Wed, Nov 27, 2024 at 09:18:45AM -0600, Nathan Bossart wrote:

> > > Maybe not remove the option, but add a required parameter:
> > > --analyze-in-stages=all / missing
> > > 
> > > That way, if the option is missing, the user can adapt the command line
> > > according to need.
> > 
> > I like this idea.
> 
> Would there be a default?

There would be no default.  Running with no option given would raise an
error.  The point is: you want to break scripts currently running
--analyze-in-stages so that they can make a choice of which of these two
modes to run.  Your proposal (as I understand it) is to remove the
--analyze-in-stages option and add two other options.  My proposal is to
keep --analyze-in-stages, but require it to have a specifier of which
mode to run.  Both achieve what you want, but I think mine achieves it
in a cleaner way.

> Uh, do we have parameters that require a boolean option like this? 

I'm not sure what exactly are you asking here.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La gente vulgar sólo piensa en pasar el tiempo;
el que tiene talento, en aprovecharlo"



Re: Statistics Import and Export

From
Nathan Bossart
Date:
On Wed, Nov 27, 2024 at 04:57:35PM +0100, Alvaro Herrera wrote:
> On 2024-Nov-27, Bruce Momjian wrote:
>> Uh, do we have parameters that require a boolean option like this? 
> 
> I'm not sure what exactly are you asking here.

We do have options like initdb's --sync-method that require specifying one
of a small set of valid arguments.  I don't see any reason that wouldn't
work here, too.

-- 
nathan



Re: Statistics Import and Export

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2024-Nov-27, Bruce Momjian wrote:
>> Would there be a default?

> There would be no default.  Running with no option given would raise an
> error.  The point is: you want to break scripts currently running
> --analyze-in-stages so that they can make a choice of which of these two
> modes to run.  Your proposal (as I understand it) is to remove the
> --analyze-in-stages option and add two other options.  My proposal is to
> keep --analyze-in-stages, but require it to have a specifier of which
> mode to run.  Both achieve what you want, but I think mine achieves it
> in a cleaner way.

I do not like the idea of breaking existing upgrade scripts,
especially not by requiring them to use a parameter that older
vacuumdb versions will reject.  That makes it impossible to have a
script that is version independent.  I really doubt that there is any
usability improvement to be had here that's worth that.

How about causing "--analyze-in-stages" (as currently spelled) to
be a no-op?  We could keep the behavior available under some other
name.

            regards, tom lane



Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Wed, Nov 27, 2024 at 04:57:35PM +0100, Álvaro Herrera wrote:
> On 2024-Nov-27, Bruce Momjian wrote:
> There would be no default.  Running with no option given would raise an
> error.  The point is: you want to break scripts currently running
> --analyze-in-stages so that they can make a choice of which of these two
> modes to run.  Your proposal (as I understand it) is to remove the
> --analyze-in-stages option and add two other options.  My proposal is to
> keep --analyze-in-stages, but require it to have a specifier of which
> mode to run.  Both achieve what you want, but I think mine achieves it
> in a cleaner way.
> 
> > Uh, do we have parameters that require a boolean option like this? 
> 
> I'm not sure what exactly are you asking here.

I can't think of a Postgres option that can take only one of two
possible values, and where there is no default value.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Wed, Nov 27, 2024 at 11:44:25AM -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > On 2024-Nov-27, Bruce Momjian wrote:
> >> Would there be a default?
> 
> > There would be no default.  Running with no option given would raise an
> > error.  The point is: you want to break scripts currently running
> > --analyze-in-stages so that they can make a choice of which of these two
> > modes to run.  Your proposal (as I understand it) is to remove the
> > --analyze-in-stages option and add two other options.  My proposal is to
> > keep --analyze-in-stages, but require it to have a specifier of which
> > mode to run.  Both achieve what you want, but I think mine achieves it
> > in a cleaner way.
> 
> I do not like the idea of breaking existing upgrade scripts,
> especially not by requiring them to use a parameter that older
> vacuumdb versions will reject.  That makes it impossible to have a
> script that is version independent.  I really doubt that there is any
> usability improvement to be had here that's worth that.
> 
> How about causing "--analyze-in-stages" (as currently spelled) to
> be a no-op?  We could keep the behavior available under some other
> name.

Uh, I guess we could do that, but we should emit something like
"--analyze-in-stages option ignored".

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Statistics Import and Export

From
Alvaro Herrera
Date:
On 2024-Nov-27, Tom Lane wrote:

> I do not like the idea of breaking existing upgrade scripts,
> especially not by requiring them to use a parameter that older
> vacuumdb versions will reject.  That makes it impossible to have a
> script that is version independent.  I really doubt that there is any
> usability improvement to be had here that's worth that.

I was only suggesting to break it because it was said upthread that that
was desirable behavior.

> How about causing "--analyze-in-stages" (as currently spelled) to
> be a no-op?  We could keep the behavior available under some other
> name.

I think making it a no-op isn't useful, because people who run the old
scripts will get the behavior we do not want: clobber the statistics and
recompute them, losing the benefit that this feature brings.

On 2024-Nov-27, Bruce Momjian wrote:

> Uh, I guess we could do that, but we should emit something like
> "--analyze-in-stages option ignored".

I think emitting a message is not useful.  It's quite possible that the
output of pg_upgrade will be redirected somewhere and this will go
unnoticed.


Maybe the most convenient for users is to keep "vacuumdb
--analyze-in-stages" doing exactly what we want to happen after
pg_upgrade, that is, in 18+, only recreate the missing stats.  This is
because of what Corey said about messaging: many users are not going to
get our message that they need to adapt their scripts, so they won't.
Breaking the script would convey that message pretty quickly, but you're
right that it's not very convenient.

For people that want to use the old behavior of recomputing _all_
statistics not just the missing ones, we could add a different switch,
or an (optional) option to --analyze-in-stages.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: Statistics Import and Export

From
Magnus Hagander
Date:


On Wed, Nov 27, 2024, 17:44 Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2024-Nov-27, Bruce Momjian wrote:
>> Would there be a default?

> There would be no default.  Running with no option given would raise an
> error.  The point is: you want to break scripts currently running
> --analyze-in-stages so that they can make a choice of which of these two
> modes to run.  Your proposal (as I understand it) is to remove the
> --analyze-in-stages option and add two other options.  My proposal is to
> keep --analyze-in-stages, but require it to have a specifier of which
> mode to run.  Both achieve what you want, but I think mine achieves it
> in a cleaner way.

I do not like the idea of breaking existing upgrade scripts,
especially not by requiring them to use a parameter that older
vacuumdb versions will reject.  That makes it impossible to have a
script that is version independent.  I really doubt that there is any
usability improvement to be had here that's worth that.

How about causing "--analyze-in-stages" (as currently spelled) to
be a no-op?  We could keep the behavior available under some other
name.

If we're doing that we might as well make this be the "when missing". 

If we make it do nothing then we surprise the users of it today just as much as we do if we make it "when missing". And it would actually solve the problem for the others. But the point was that people didn't like silently changing the behavior of the existing parameter - and making it a noop would change it even more. 

/Magnus 

 

Re: Statistics Import and Export

From
Corey Huinker
Date:

For people that want to use the old behavior of recomputing _all_
statistics not just the missing ones, we could add a different switch,
or an (optional) option to --analyze-in-stages.


The current patchset provides that in the form of the parameter "--force-analyze", which is a modifier to "--analyze-in-stages" and "--analyze-only". 

Re: Statistics Import and Export

From
Bruce Momjian
Date:
On Wed, Nov 27, 2024 at 01:15:45PM -0500, Corey Huinker wrote:
> 
>     For people that want to use the old behavior of recomputing _all_
>     statistics not just the missing ones, we could add a different switch,
>     or an (optional) option to --analyze-in-stages.
> 
> The current patchset provides that in the form of the parameter
> "--force-analyze", which is a modifier to "--analyze-in-stages" and
> "--analyze-only". 

I don't think there is consensus to change --analyze-only, only maybe
--analyze-in-stages.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: Statistics Import and Export

From
Jeff Davis
Date:
On Tue, 2024-11-26 at 17:11 -0500, Corey Huinker wrote:
>
> > * Why does binary upgrade cause statistics to be dumped? Can you
> > just
> > make pg_upgrade specify the appropriate set of flags?
> >
>
>
> That decision goes back a ways, I tried to dig in the archives last
> night but I was getting a Server Error on postgresql.org.

I suggest that pg_upgrade be changed to pass --no-data to pg_dump,
rather than --schema-only.

That way, you don't need to create a special case for the pg_dump
default that depends on whether it's a binary upgrade or not.

If wanted, there could also be a new option to pg_upgrade to specify --
with-statistics (default, passes --no-data to pg_dump) or --no-
statistics (passes --schema-only to pg_dump). But that option is
probably not necessary; everyone upgrading probably wants the stats.

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:
On Sat, Dec 7, 2024 at 2:27 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Tue, 2024-11-26 at 17:11 -0500, Corey Huinker wrote:
>
> > * Why does binary upgrade cause statistics to be dumped? Can you
> > just
> > make pg_upgrade specify the appropriate set of flags?
> >
>
>
> That decision goes back a ways, I tried to dig in the archives last
> night but I was getting a Server Error on postgresql.org.

I suggest that pg_upgrade be changed to pass --no-data to pg_dump,
rather than --schema-only.

That way, you don't need to create a special case for the pg_dump
default that depends on whether it's a binary upgrade or not.

+1
 
If wanted, there could also be a new option to pg_upgrade to specify --
with-statistics (default, passes --no-data to pg_dump) or --no-
statistics (passes --schema-only to pg_dump). But that option is
probably not necessary; everyone upgrading probably wants the stats.

This makes sense, though perhaps instead of --schema-only perhaps we should pass both --no-statistics and --no-data. I don't envision a fourth option to the new data/schema/stats triumvirate, but --no-statistics shouldn't have a bearing on that future fourth option.
 

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Sat, 2024-12-07 at 14:56 -0500, Corey Huinker wrote:
> This makes sense, though perhaps instead of --schema-only perhaps we
> should pass both --no-statistics and --no-data. I don't envision a
> fourth option to the new data/schema/stats triumvirate, but --no-
> statistics shouldn't have a bearing on that future fourth option.

+1, assuming such an option is wanted at all. I suppose it should be
there for the unlikely (and hopefully impossible) case that statistics
are causing a problem during upgrade.

Regards,
    Jeff Davis 




Re: Statistics Import and Export

From
Corey Huinker
Date:
The biggest functional change is the way dependencies are handled for
matview stats. Materialized views ordinarily end up in
SECITON_PRE_DATA, but in some cases they can be postponed to
SECTION_POST_DATA. You solved that by always putting the matview stats
in SECTION_POST_DATA.

Accurate.

 
I took a different approach here and, when the matview is postponed,
also postpone the matview stats. It's slightly more code, but it felt
closer to the rest of the structure, where postponing is a special case
(that we might be able to remove in the future).

+1. The fact that this quirk was a knock-on effect of the postponing-quirk, which could go away, makes this change compelling.

Re: Statistics Import and Export

From
Jeff Davis
Date:
On Thu, 2024-12-19 at 21:23 -0800, Jeff Davis wrote:
> > 0001-0005 - changes to pg_dump/pg_upgrade
>
> Attached is a version 36j...

The testing can use some work here. I noticed that if I take out the
stats entirely, the tests still pass, because pg_upgrade still gets the
same before/after result.

Also, we need some testing of the output and ordering of pg_dump.
Granted, in most cases problems would result in errors during the
reload. But we have those tests for other kinds of objects, so we
should have the tests for stats, too.

I like the description "STATISTICS DATA" because it differentiates from
the extended stats definitions. It might be worth differentiating
between "RELATION STATISTICS DATA" and "ATTRIBUTE STATISTICS DATA" but
I'm not sure if there's value in that.

But how did you determine what to use for the .tag and prefix? In the
output, it uses the form:

  Name: STATISTICS DATA <name>; Type: STATISTICS DATA; ...

Should that be:

  Name: <name>; Type: STATISTICS DATA; ...

Or:

  Data for Name: ...; Name: ...; Type: STATISTICS DATA; ...

Or:

  Statistics for Name: ...; Name: ...; Type: STATISTICS DATA; ...

Regards,
    Jeff Davis




Re: Statistics Import and Export

From
Corey Huinker
Date:

I like the description "STATISTICS DATA" because it differentiates from



the extended stats definitions. It might be worth differentiating
between "RELATION STATISTICS DATA" and "ATTRIBUTE STATISTICS DATA" but
I'm not sure if there's value in that.

I have no objection to such a change, even if we don't currently have a use for the differentiation, someone in the future might.

 

But how did you determine what to use for the .tag and prefix? In the
output, it uses the form:

It was the minimal change needed to meet Tomas's suggestion.
 
  Statistics for Name: ...; Name: ...; Type: STATISTICS DATA; ...

I like this one best, because it clarifies the meaning of STATISTICS DATA.