Re: Add expressions to pg_restore_extended_stats() - Mailing list pgsql-hackers

From Corey Huinker
Subject Re: Add expressions to pg_restore_extended_stats()
Date
Msg-id CADkLM=fSvftbJ7fcGmokGGi_w2CbyvwDNg-3shtiZOsaqLamGA@mail.gmail.com
Whole thread Raw
In response to Re: Add expressions to pg_restore_extended_stats()  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Add expressions to pg_restore_extended_stats()
List pgsql-hackers

v6-0001 has less fuzz, thanks for cleaning up the whole.  I am looking
at the patch, and immediately noted two concepts that bump into my eyes
and look rather non-reliable.

+    if (!sta_ok)
+        *exprs_is_perfect = false;
+    isnull = false;

This bit looks incorrect to me?  If !sta_ok (or !row_is_perfect in
import_pg_statistic()), then why is a non-NULL value inserted into the
resulting array?  If we fail to parse even one field or miss one key,
we should force NULL for this single expression in the worst case and
attempt to move on with the rest.  But it does not matter anyway
because import_expressions() would fail the import.  Why don't we just
skip the rest of the expressions then?  We know that the resulting
array will go to the garbage bit and that the restore failed, issuing
a WARNING for the statext object.

This is where the metaphor between pg_statistic as attribute statistic and pg_statistic as an array element in stxdexpr breaks down a bit.

For pg_restore_attribute_stats(), our goal was to create a new stat if none exists, and if one exists then replace only those elements that are 1) specified in the call and 2) successfully import. This means that the function can update a pg_statistic row, but return false because not all attributes specified were actually updates. i.e. it wasn't "perfect".

For pg_restore_extended_stats(), we've learned that we *must* have the right number of elements in the pg_statistic array, so the notion of replace-ability is severely if not fatally weakened. However, there are errors that we might want an individual pg_statistic to recover from, a good example being keys that somehow get removed in a future version. So while I've modified import_pg_statistic to return a null datum on any inconsistency, that might not be the case in the future, and import_expressions() should check to see if it happens. Similarly, import_expressions could have multiple pg_statistic rows, and if one of them has an inconsistency, I'd still like the others to still make it in. That's implemented by checking a counter of pg_statistics that imported ok vs the total number of exprs, and if they match then import_expressions() was "perfect".
 

I think that import_expressions() has its logic going backwards, by
this I mean that intializing exprs_is_perfect to true could be risky.
It seems to me that we should do the exact opposite: initialize it at
false, and switch to true *if and only if* all the correct conditions
we want are reached.  I'd suggest set of gotos and a single exit path
at the end of import_expressions() where exprs_is_perfect is set to
true to let the caller know that the expression can be safely used.
Remember import_mcv(), as one example.

+1
 
Similarly, the same concept should be applied to import_pg_statistic().
row_is_perfect should be false to start, and switched to true once we
are sure that everything we want is valid.

+1
 
Also, I think that the format of the dump should be better when it
comes to a set of expressions where some of them have invalid data.
Even if the stats of an expression are invalid, pg_dump builds a JSON
blob for the element of the expression with all the keys and their
values set to NULL.  I'd suggest to just publish a NULL for the
element where invalid stats are found.  That makes the dumps shorter
as well.

Stats can't be invalid on the way out, only on the way in. I'm assuming that you mean null/pointless data. I used jsonb_strip_nulls() to remove keys where the value is null, and nullif() to map empty objects to null, and I think that's much more tidy. We still could get a situation where all the exprs are empty (i.e. [null,null,null]). There is no simple test to map that to just a plain null, but even if there were the SQL is already drifting into "clever" territory and I know that pg_dump likes to keep things very simple.

Attachment

pgsql-hackers by date:

Previous
From: Yugo Nagata
Date:
Subject: Re: Warn when creating or enabling a subscription with max_logical_replication_workers = 0
Next
From: jian he
Date:
Subject: Re: using index to speedup add not null constraints to a table