Re: Statistics Import and Export - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Statistics Import and Export |
Date | |
Msg-id | 4268d862-7784-f76f-d7f1-359b0b4db67c@enterprisedb.com Whole thread Raw |
In response to | Re: Statistics Import and Export (Corey Huinker <corey.huinker@gmail.com>) |
Responses |
Re: Statistics Import and Export
Re: Statistics Import and Export |
List | pgsql-hackers |
Hi, I finally had time to look at the last version of the patch, so here's a couple thoughts and questions in somewhat random order. Please take this as a bit of a brainstorming and push back if you disagree some of my comments. In general, I like the goal of this patch - not having statistics is a common issue after an upgrade, and people sometimes don't even realize they need to run analyze. So, it's definitely worth improving. I'm not entirely sure about the other use case - allowing people to tweak optimizer statistics on a running cluster, to see what would be the plan in that case. Or more precisely - I agree that would be an interesting and useful feature, but maybe the interface should not be the same as for the binary upgrade use case? interfaces ---------- When I thought about the ability to dump/load statistics in the past, I usually envisioned some sort of DDL that would do the export and import. So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands, or something like that, and that'd do all the work. This would mean stats are "first-class citizens" and it'd be fairly straightforward to add this into pg_dump, for example. Or at least I think so ... Alternatively we could have the usual "functional" interface, with a functions to export/import statistics, replacing the DDL commands. Unfortunately, none of this works for the pg_upgrade use case, because existing cluster versions would not support this new interface, of course. That's a significant flaw, as it'd make this useful only for upgrades of future versions. So I think for the pg_upgrade use case, we don't have much choice other than using "custom" export through a view, which is what the patch does. However, for the other use case (tweaking optimizer stats) this is not really an issue - that always happens on the same instance, so no issue with not having the "export" function and so on. I'd bet there are more convenient ways to do this than using the export view. I'm sure it could share a lot of the infrastructure, ofc. I suggest we focus on the pg_upgrade use case for now. In particular, I think we really need to find a good way to integrate this into pg_upgrade. I'm not against having custom CLI commands, but it's still a manual thing - I wonder if we could extend pg_dump to dump stats, or make it built-in into pg_upgrade in some way (possibly disabled by default, or something like that). JSON format ----------- As for the JSON format, I wonder if we need that at all? Isn't that an unnecessary layer of indirection? Couldn't we simply dump pg_statistic and pg_statistic_ext_data in CSV, or something like that? The amount of new JSONB code seems to be very small, so it's OK I guess. I'm still a bit unsure about the "right" JSON schema. I find it a bit inconvenient that the JSON objects mimic the pg_statistic schema very closely. In particular, it has one array for stakind values, another array for stavalues, array for stanumbers etc. I understand generating this JSON in SQL is fairly straightforward, and for the pg_upgrade use case it's probably OK. But my concern is it's not very convenient for the "manual tweaking" use case, because the "related" fields are scattered in different parts of the JSON. That's pretty much why I envisioned a format "grouping" the arrays for a particular type of statistics (MCV, histogram) into the same object, as for example in { "mcv" : {"values" : [...], "frequencies" : [...]} "histogram" : {"bounds" : [...]} } But that's probably much harder to generate from plain SQL (at least I think so, I haven't tried). data missing in the export -------------------------- I think the data needs to include more information. Maybe not for the pg_upgrade use case, where it's mostly guaranteed not to change, but for the "manual tweak" use case it can change. And I don't think we want two different formats - we want one, working for everything. Consider for example about the staopN and stacollN fields - if we clone the stats from one table to the other, and the table uses different collations, will that still work? Similarly, I think we should include the type of each column, because it's absolutely not guaranteed the import function will fail if the type changes. For example, if the type changes from integer to text, that will work, but the ordering will absolutely not be the same. And so on. For the extended statistics export, I think we need to include also the attribute names and expressions, because these can be different between the statistics. And not only that - the statistics values reference the attributes by positions, but if the two tables have the attributes in a different order (when ordered by attnum), that will break stuff. more strict checks ------------------ I think the code should be a bit more "defensive" when importing stuff, and do at least some sanity checks. For the pg_upgrade use case this should be mostly non-issue (except for maybe helping to detect bugs earlier), but for the "manual tweak" use case it's much more important. By this I mean checks like: * making sure the frequencies in MCV lists are not obviously wrong (outside [0,1], sum exceeding > 1.0, etc.) * cross-checking that stanumbers/stavalues make sense (e.g. that MCV has both arrays while histogram has only stavalues, that the arrays have the same length for MCV, etc.) * checking there are no duplicate stakind values (e.g. two MCV lists) This is another reason why I was thinking the current JSON format may be a bit inconvenient, because it loads the fields separately, making the checks harder. But I guess it could be done after loading everything, as a separate phase. Not sure if all the checks need to be regular elog(ERROR), perhaps some could/should be just asserts. minor questions --------------- 1) Should the views be called pg_statistic_export or pg_stats_export? Perhaps pg_stats_export is better, because the format is meant to be human-readable (rather than 100% internal). 2) It's not very clear what "non-transactional update" of pg_class fields actually means. Does that mean we update the fields in-place, can't be rolled back, is not subject to MVCC or what? I suspect users won't know unless the docs say that explicitly. 3) The "statistics.c" code should really document the JSON structure. Or maybe if we plan to use this for other purposes, it should be documented in the SGML? Actually, this means that the use supported cases determine if the expected JSON structure is part of the API. For pg_upgrade we could keep it as "internal" and maybe change it as needed, but for "manual tweak" it'd become part of the public API. 4) Why do we need the separate "replaced" flags in import_stakinds? Can it happen that collreplaces/opreplaces differ from kindreplaces? 5) What happens in we import statistics for a table that already has some statistics? Will this discard the existing statistics, or will this merge them somehow? (I think we should always discard the existing stats, and keep only the new version.) 6) What happens if we import extended stats with mismatching definition? For example, what if the "new" statistics object does not have "mcv" enabled, but the imported data do include MCV? What if the statistics do have the same number of "dimensions" but not the same number of columns and expressions? 7) The func.sgml additions in 0007 seems a bit strange, particularly the first sentence of the paragraph. 8) While experimenting with the patch, I noticed this: create table t (a int, b int, c text); create statistics s on a, b, c, (a+b), (a-b) from t; create table t2 (a text, b text, c text); create statistics s2 on a, c from t2; select pg_import_ext_stats( (select oid from pg_statistic_ext where stxname = 's2'), (select server_version_num from pg_statistic_ext_export where ext_stats_name = 's'), (select stats from pg_statistic_ext_export where ext_stats_name = 's')); WARNING: statistics import has 5 mcv dimensions, but the expects 2. Skipping excess dimensions. ERROR: statistics import has 5 mcv dimensions, but the expects 2. Skipping excess dimensions. I guess we should not trigger WARNING+ERROR with the same message. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: