>> For this warning in the regression tests, you should only need one >> element, reducing the number of input lines? > > That's true IF we decide that missing expression keys are a thing that we > allow, and per conversation above it's not clear yet that we do.
If ANALYZE can generate some partial data for a set of expression (aka generating some data for a portion of the expressions defined but skip some of them because of reason x_y_z), the restore function needs to cope with this policy (I'd need to double-check the code to be sure, not sure on top of my mind now and I cannot do that today, unfortunately).
In this latest patch, I decided that we can tolerate missing exprs parameters, and as you'll see from the 0002 patch, it does result in considerable decluttering.
Hmm. I would not rely on the ordering of the items as they are scanned, that seems like a recipe for disaster. We have a text representation of the expression, as of pg_stats_ext_exprs.expr. This could be mapped with the elements of hte text[] in pg_stats_ext.exprs. Here is a wilder idea: why not just putting the expression text itself in the data given in input of the restore function rather than a guessed argument number? For the case of manual stats injections, that kinds of makes things simpler.
The patch I submitted here hasn't yet implemented using the expression text as a key, and honestly I'm not too hyped on the prospect of trying.
I think that:
1. I don't think we can guarantee that the expression node text is stable across major versions, and that would break upgrades, the primary function of this code.
2. Anyone wanting to modify/hack the exprs values has almost certainly extracted it using the jsonb_build_object() code in pg_dump, so they already have all expressions before editing.
3. Array unnest() has proven to give a stable order in all tests so far.
4. We don't decompose mcv into it's parts, so why do that for exprs?
Yeah, it sounds to me that we should just set ok=false and give up rather than have a semi-filled set of numbers for a single extended stats object. There is an argument in favor of that: it can simplify the detection of missing stats for a single extended stats definition. I understand that you'd want to keep going with loading the data even if it's partial. My question is: is it possible for ANALYZE to fill in only a portion of the expressions and can these be partially skipped? If the answer to my question is yes, the restore function should do the same and my idea of the matter is wrong. If the answer to my question is no, then your idea on this matter is the right one.
In which case I think you'll like the latest patchset.
> 4. Does casting the numeric scalar values (null_frac, correlation, > avg_width, n_distinct) to text make sense, since we have to put them > through type-specific input functions anyway?
In terms of JSON, it makes the use of a representation simpler. I don't think that we need to apply a strict casting when fetching them.
Everything is jbvStrings now.
@Tomas (now added in CC for confirmation): would you see a problem against applying a JSONB data type to the argument for the restore of extended stats? This level of data serialization would be required when inserting data for what would show up into pg_stats_ext_exprs.