Re: Add expressions to pg_restore_extended_stats() - Mailing list pgsql-hackers
| From | Corey Huinker |
|---|---|
| Subject | Re: Add expressions to pg_restore_extended_stats() |
| Date | |
| Msg-id | CADkLM=fZci8mBEoKX9Gk-+0jXrVmQBuTErP-0JYgE1iUL+EX2A@mail.gmail.com Whole thread Raw |
| In response to | Re: Add expressions to pg_restore_extended_stats() (Michael Paquier <michael@paquier.xyz>) |
| List | pgsql-hackers |
On Fri, Jan 30, 2026 at 12:55 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Jan 30, 2026 at 12:08:49AM -0500, Corey Huinker wrote:
> 3. Keeping the 2-D text array in #1, but each "row" is a list of
> kwargs-like pairs like the arguments used in pg_restore_attribute_stats
> (i.e. ARRAY['null_frac','0.5','avg_width','1.0','most_common_values',...]
>
> 4. JSON. The outer structure would be an array of objects, each object
> would be a key-value.
I'd still favor 4 on the ground that it's easier to edit and read,
which would more in line with the MCV, dependencies, ndistinct and
att/rel statistics. The format proposed in the attached patch is hard
to work with, anyway. Now, I do take your point about composite
record values casted into a single text value could be confusing
(double-quote issues, I guess?), so perhaps a text[] as in 3 would be
more adapted for readability.
Hmm, maybe it isn't so bad:
SELECT '{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[];
text
---------------------------------------------------
{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}
(1 row)
SELECT array_to_json('{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[]);
array_to_json
---------------------------------------------------
["{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"]
(1 row)
SELECT '{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[];
text
---------------------------------------------------
{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}
(1 row)
SELECT array_to_json('{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[]);
array_to_json
---------------------------------------------------
["{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"]
(1 row)
Mind you, this is an ANYARRAY first casted to text, if we cast the pg_stats_ext_exprs.most_common_values directly to JSON then it'll drill down into the innards of the composite values because it can see the local datatypes, and that breaks our ability to use regular input functions. I learned that the hard way when using JSON for serializing attribute stats stuff when this effort first began.
Before seeing that, I wanted to try option 3 first, as it brings clarity with no real increase in tooling other than looking for repeated keywords, but if you're hyped for json then I'll try that first.
We could also force some checks based
the order of the arguments in the input array, so as duplicates are
not an issue, I guess?
If we're doing a kwargs-thing then I may as well just track which keywords were already used. We already bail out on the whole expressions array at the first sign of inconsistency, so it's not like we have to decide which of the duplicates to keep.
Structurally, I feel that import_expressions() is overcomplicated, and
with the correct structure tracking the state of each field, I would
try to reduce a bit the duplication that the patch presents, aiming at
less callers of statatt_build_stavalues() and statatt_set_slot(),
perhaps.
I don't think we can get around those. It's a limitation of how the sta(kindN/opN/collN/numbersN/valuesN) system in pg_statistic works. We want to fill in each stakind as we find it, and we don't know how many of them we've already filled out. An array of records would have been better, but we've got 5 parallel arrays of scalars and we have to live with it.
pgsql-hackers by date: