Re: Add expressions to pg_restore_extended_stats() - Mailing list pgsql-hackers

From Corey Huinker
Subject Re: Add expressions to pg_restore_extended_stats()
Date
Msg-id CADkLM=du+OcctrsTk+hZryUGy=0OnPep-3LdGzut1nqF391+Eg@mail.gmail.com
Whole thread Raw
In response to Re: Add expressions to pg_restore_extended_stats()  (Corey Huinker <corey.huinker@gmail.com>)
Responses Re: Add expressions to pg_restore_extended_stats()
List pgsql-hackers


On Fri, Jan 30, 2026 at 3:07 AM Corey Huinker <corey.huinker@gmail.com> wrote:
On Fri, Jan 30, 2026 at 12:55 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Jan 30, 2026 at 12:08:49AM -0500, Corey Huinker wrote:
> 3. Keeping the 2-D text array in #1, but each "row" is a list of
> kwargs-like pairs like the arguments used in pg_restore_attribute_stats
> (i.e. ARRAY['null_frac','0.5','avg_width','1.0','most_common_values',...]
>
> 4. JSON. The outer structure would be an array of objects, each object
> would be a key-value.

I'd still favor 4 on the ground that it's easier to edit and read,
which would more in line with the MCV, dependencies, ndistinct and
att/rel statistics.  The format proposed in the attached patch is hard
to work with, anyway.  Now, I do take your point about composite
record values casted into a single text value could be confusing
(double-quote issues, I guess?), so perhaps a text[] as in 3 would be
more adapted for readability.

Hmm, maybe it isn't so bad:

SELECT '{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[];
                       text                        
---------------------------------------------------
 {"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}
(1 row)

SELECT array_to_json('{"{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"}'::text[]);
                   array_to_json                  
---------------------------------------------------
 ["{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}"]
(1 row)

Mind you, this is an ANYARRAY first casted to text, if we cast the pg_stats_ext_exprs.most_common_values directly to JSON then it'll drill down into the innards of the composite values because it can see the local datatypes, and that breaks our ability to use regular input functions. I learned that the hard way when using JSON for serializing attribute stats stuff when this effort first began.

Before seeing that, I wanted to try option 3 first, as it brings clarity with no real increase in tooling other than looking for repeated keywords, but if you're hyped for json then I'll try that first.
 
We could also force some checks based
the order of the arguments in the input array, so as duplicates are
not an issue, I guess?

If we're doing a kwargs-thing then I may as well just track which keywords were already used. We already bail out on the whole expressions array at the first sign of inconsistency, so it's not like we have to decide which of the duplicates to keep.

Structurally, I feel that import_expressions() is overcomplicated, and
with the correct structure tracking the state of each field, I would
try to reduce a bit the duplication that the patch presents, aiming at
less callers of statatt_build_stavalues() and statatt_set_slot(),
perhaps.

I don't think we can get around those. It's a limitation of how the sta(kindN/opN/collN/numbersN/valuesN) system in pg_statistic works. We want to fill in each stakind as we find it, and we don't know how many of them we've already filled out. An array of records would have been better, but we've got 5 parallel arrays of scalars and we have to live with it.

Ok, so I refactored the exprs input to be a jsonb array, and the good news is that it results in legitmately more readable test cases. See below

-- ok, with warning: extra exprs param
SELECT pg_catalog.pg_restore_extended_stats(
  'schemaname', 'stats_import',
  'relname', 'test',
  'statistics_schemaname', 'stats_import',
  'statistics_name', 'test_stat_mcelem',
  'inherited', false,
  'exprs', '[
              {
                  "avg_width": 33,
                  "null_frac": 0,
                  "n_distinct": -1,
                  "correlation": 1,
                  "histogram_bounds": "{\"{1,1}\",\"{2,1}\",\"{3,-1}\",\"{NULL,0}\"}",
                  "most_common_vals": null,
                  "most_common_elems": "{-1,0,1,2,3}",
                  "most_common_freqs": null,
                  "elem_count_histogram": "{1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1.5}",
                  "most_common_elem_freqs": "{0.25,0.25,0.5,0.25,0.25,0.25,0.5,0.25}",
                  "bad_param": "text no one will ever parse"
              }
            ]'::jsonb);
WARNING:  malformed expr expression: "bad_param": is not an expression key name, value ignored 

The downside is that it results in 200 more lines of code in extended_stats_funcs.c and 500 more lines of test output in stats_import.sql.

There are a few reasons for this:

- It's up to us to validate the [{expr},{expr},...] structure ourselves, something a 2-D text array did rather succinctly.
- null is a valid jsonb value, and we have to check for it. This quirk alone results in 4 more test cases, though some other test cases did go away.
- json strings are unterminated c strings, so we have to convert them first to c strings and then to text datums to leverage the existing attribute stats functions

I briefly thought about making the exprs input parameter jsonb[] thus saving some structure checks, but what little that buys us is then taken away by making the pg_dump command that much more complex.

There are probably some simplifications that can happen:

- casting all stat values to text to avoid having to deal with the neither-fish-nor-fowl json numeric values.
- utility function for converting jbvString values to TextDatum.
- utility function to simplify json iteration in a simple key/value object
- deciding that unknown parameters aren't worth a warning

So...does this look better? Is it worth it?
Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [PATCH] Fix error message in RemoveWalSummaryIfOlderThan to indicate file removal failure
Next
From: Álvaro Herrera
Date:
Subject: relkind as an enum