Re: Statistics Import and Export - Mailing list pgsql-hackers

From Corey Huinker
Subject Re: Statistics Import and Export
Date
Msg-id CADkLM=ekAg8+1cFR9HmdoEey2TmGNUwVFwdLU0L5Mf_dARQgFA@mail.gmail.com
Whole thread Raw
In response to Re: Statistics Import and Export  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
List pgsql-hackers


+\gexec

Why do we need to construct the command and execute? Can we instead execute the function directly? That would also avoid ECHO magic.

We don't strictly need it, but I've found the set-difference operation to be incredibly useful in diagnosing problems. Additionally, the values are subject to change due to changes in test data, no guarantee that the output of ANALYZE is deterministic, etc. But most of all, because the test cares about the correct copying of values, not the values themselves.
 

+   <table id="functions-admin-statsimport">
+    <title>Database Object Statistics Import Functions</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Function
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>

COMMENT: The functions throw many validation errors. Do we want to list the acceptable/unacceptable input values in the documentation corresponding to those? I don't expect one line per argument validation. Something like "these, these and these arguments can not be NULL" or "both arguments in each of the pairs x and y, a and b, and c and d should be non-NULL or NULL respectively".

Yes. It should.
 
 Statistics are about data. Whenever pg_dump dumps some filtered data, the
statistics collected for the whole table are uselss. We should avoide dumping
statistics in such a case. E.g. when only schema is dumped what good is
statistics? Similarly the statistics on a partitioned table may not be useful
if some its partitions are not dumped. Said that dumping statistics on foreign
table makes sense since they do not contain data but the statistics still makes sense.

Good points, but I'm not immediately sure how to enforce those rules.
 
 

Key areas where I'm seeking feedback:

- What level of errors in a restore will a user tolerate, and what should be done to the error messages to indicate that the data itself is fine, but a manual operation to update stats on that particular table is now warranted?
- To what degree could pg_restore/pg_upgrade take that recovery action automatically?
- Should the individual attribute/class set function calls be grouped by relation, so that they all succeed/fail together, or should they be called separately, each able to succeed or fail on their own?
- Any other concerns about how to best use these new functions.



Whether or not I pass --no-statistics, there is no difference in the dump output. Am I missing something?
$ pg_dump -d postgres > /tmp/dump_no_arguments.out
$ pg_dump -d postgres --no-statistics > /tmp/dump_no_statistics.out
$ diff /tmp/dump_no_arguments.out /tmp/dump_no_statistics.out
$

IIUC, pg_dump includes statistics by default. That means all our pg_dump related tests will have statistics output by default. That's good since the functionality will always be tested. 1. We need additional tests to ensure that the statistics is installed after restore. 2. Some of those tests compare dumps before and after restore. In case the statistics is changed because of auto-analyze happening post-restore, these tests will fail.

+1
 
I believe, in order to import statistics through IMPORT FOREIGN SCHEMA, postgresImportForeignSchema() will need to add SELECT commands invoking pg_set_relation_stats() on each imported table and pg_set_attribute_stats() on each of its attribute. Am I right? Do we want to make that happen in the first cut of the feature? How do you expect these functions to be used to update statistics of foreign tables?

I don't think there's time to get it into this release. I think we'd want to extend this functionality to both IMPORT FOREIGN SCHEMA and ANALYZE for foreign tables, in both cases with a server/table option to do regular remote sampling. In both cases, they'd do a remote query very similar to what pg_dump does (hence putting it in fe_utils), with some filters on which columns/tables it believes it can trust. The remote table might itself be a view (in which case they query would turn up nothing) or column data types may change across the wire, and in those cases we'd have to fall back to sampling.

pgsql-hackers by date:

Previous
From: shveta malik
Date:
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Next
From: Corey Huinker
Date:
Subject: Re: Statistics Import and Export