Re: Statistics Import and Export - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Statistics Import and Export
Date
Msg-id 4268d862-7784-f76f-d7f1-359b0b4db67c@enterprisedb.com
Whole thread Raw
In response to Re: Statistics Import and Export  (Corey Huinker <corey.huinker@gmail.com>)
Responses Re: Statistics Import and Export
Re: Statistics Import and Export
List pgsql-hackers
Hi,

I finally had time to look at the last version of the patch, so here's a
couple thoughts and questions in somewhat random order. Please take this
as a bit of a brainstorming and push back if you disagree some of my
comments.

In general, I like the goal of this patch - not having statistics is a
common issue after an upgrade, and people sometimes don't even realize
they need to run analyze. So, it's definitely worth improving.

I'm not entirely sure about the other use case - allowing people to
tweak optimizer statistics on a running cluster, to see what would be
the plan in that case. Or more precisely - I agree that would be an
interesting and useful feature, but maybe the interface should not be
the same as for the binary upgrade use case?


interfaces
----------

When I thought about the ability to dump/load statistics in the past, I
usually envisioned some sort of DDL that would do the export and import.
So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
or something like that, and that'd do all the work. This would mean
stats are "first-class citizens" and it'd be fairly straightforward to
add this into pg_dump, for example. Or at least I think so ...

Alternatively we could have the usual "functional" interface, with a
functions to export/import statistics, replacing the DDL commands.

Unfortunately, none of this works for the pg_upgrade use case, because
existing cluster versions would not support this new interface, of
course. That's a significant flaw, as it'd make this useful only for
upgrades of future versions.

So I think for the pg_upgrade use case, we don't have much choice other
than using "custom" export through a view, which is what the patch does.

However, for the other use case (tweaking optimizer stats) this is not
really an issue - that always happens on the same instance, so no issue
with not having the "export" function and so on. I'd bet there are more
convenient ways to do this than using the export view. I'm sure it could
share a lot of the infrastructure, ofc.

I suggest we focus on the pg_upgrade use case for now. In particular, I
think we really need to find a good way to integrate this into
pg_upgrade. I'm not against having custom CLI commands, but it's still a
manual thing - I wonder if we could extend pg_dump to dump stats, or
make it built-in into pg_upgrade in some way (possibly disabled by
default, or something like that).


JSON format
-----------

As for the JSON format, I wonder if we need that at all? Isn't that an
unnecessary layer of indirection? Couldn't we simply dump pg_statistic
and pg_statistic_ext_data in CSV, or something like that? The amount of
new JSONB code seems to be very small, so it's OK I guess.

I'm still a bit unsure about the "right" JSON schema. I find it a bit
inconvenient that the JSON objects mimic the pg_statistic schema very
closely. In particular, it has one array for stakind values, another
array for stavalues, array for stanumbers etc. I understand generating
this JSON in SQL is fairly straightforward, and for the pg_upgrade use
case it's probably OK. But my concern is it's not very convenient for
the "manual tweaking" use case, because the "related" fields are
scattered in different parts of the JSON.

That's pretty much why I envisioned a format "grouping" the arrays for a
particular type of statistics (MCV, histogram) into the same object, as
for example in

 {
   "mcv" : {"values" : [...], "frequencies" : [...]}
   "histogram" : {"bounds" : [...]}
 }

But that's probably much harder to generate from plain SQL (at least I
think so, I haven't tried).


data missing in the export
--------------------------

I think the data needs to include more information. Maybe not for the
pg_upgrade use case, where it's mostly guaranteed not to change, but for
the "manual tweak" use case it can change. And I don't think we want two
different formats - we want one, working for everything.

Consider for example about the staopN and stacollN fields - if we clone
the stats from one table to the other, and the table uses different
collations, will that still work? Similarly, I think we should include
the type of each column, because it's absolutely not guaranteed the
import function will fail if the type changes. For example, if the type
changes from integer to text, that will work, but the ordering will
absolutely not be the same. And so on.

For the extended statistics export, I think we need to include also the
attribute names and expressions, because these can be different between
the statistics. And not only that - the statistics values reference the
attributes by positions, but if the two tables have the attributes in a
different order (when ordered by attnum), that will break stuff.


more strict checks
------------------

I think the code should be a bit more "defensive" when importing stuff,
and do at least some sanity checks. For the pg_upgrade use case this
should be mostly non-issue (except for maybe helping to detect bugs
earlier), but for the "manual tweak" use case it's much more important.

By this I mean checks like:

* making sure the frequencies in MCV lists are not obviously wrong
  (outside [0,1], sum exceeding > 1.0, etc.)

* cross-checking that stanumbers/stavalues make sense (e.g. that MCV has
  both arrays while histogram has only stavalues, that the arrays have
  the same length for MCV, etc.)

* checking there are no duplicate stakind values (e.g. two MCV lists)

This is another reason why I was thinking the current JSON format may be
a bit inconvenient, because it loads the fields separately, making the
checks harder. But I guess it could be done after loading everything, as
a separate phase.

Not sure if all the checks need to be regular elog(ERROR), perhaps some
could/should be just asserts.


minor questions
---------------

1) Should the views be called pg_statistic_export or pg_stats_export?
Perhaps pg_stats_export is better, because the format is meant to be
human-readable (rather than 100% internal).

2) It's not very clear what "non-transactional update" of pg_class
fields actually means. Does that mean we update the fields in-place,
can't be rolled back, is not subject to MVCC or what? I suspect users
won't know unless the docs say that explicitly.

3) The "statistics.c" code should really document the JSON structure. Or
maybe if we plan to use this for other purposes, it should be documented
in the SGML?

Actually, this means that the use supported cases determine if the
expected JSON structure is part of the API. For pg_upgrade we could keep
it as "internal" and maybe change it as needed, but for "manual tweak"
it'd become part of the public API.

4) Why do we need the separate "replaced" flags in import_stakinds? Can
it happen that collreplaces/opreplaces differ from kindreplaces?

5) What happens in we import statistics for a table that already has
some statistics? Will this discard the existing statistics, or will this
merge them somehow? (I think we should always discard the existing
stats, and keep only the new version.)

6) What happens if we import extended stats with mismatching definition?
For example, what if the "new" statistics object does not have "mcv"
enabled, but the imported data do include MCV? What if the statistics do
have the same number of "dimensions" but not the same number of columns
and expressions?

7) The func.sgml additions in 0007 seems a bit strange, particularly the
first sentence of the paragraph.

8) While experimenting with the patch, I noticed this:

  create table t (a int, b int, c text);
  create statistics s on a, b, c, (a+b), (a-b) from t;

  create table t2 (a text, b text, c text);
  create statistics s2 on a, c from t2;

  select pg_import_ext_stats(
    (select oid from pg_statistic_ext where stxname = 's2'),
    (select server_version_num from pg_statistic_ext_export
                              where ext_stats_name = 's'),
    (select stats from pg_statistic_ext_export
                 where ext_stats_name = 's'));

WARNING:  statistics import has 5 mcv dimensions, but the expects 2.
Skipping excess dimensions.
ERROR:  statistics import has 5 mcv dimensions, but the expects 2.
Skipping excess dimensions.

I guess we should not trigger WARNING+ERROR with the same message.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: pread, pwrite, etc return ssize_t not int
Next
From: Amit Kapila
Date:
Subject: Re: Track in pg_replication_slots the reason why slots conflict?