Thread: Re: Extended Statistics set/restore/clear functions.
hi. I reviewed 0001 only. in src/backend/statistics/mvdistinct.c no need #include "nodes/pg_list.h" since src/include/statistics/statistics.h sub level include "nodes/pg_list.h" no need #include "utils/palloc.h" sicne #include "postgres.h" already included it. select '[{"6, -32768,,": -11}]'::pg_ndistinct; ERROR: malformed pg_ndistinct: "[{"6, -32768,,": -11}]" LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct; ^ DETAIL: All ndistinct count values are scalar doubles. imho, this errdetail message is not good. select '{}'::pg_ndistinct ; segfault select '{"1,":"1"}'::pg_ndistinct ; ERROR: malformed pg_ndistinct: "{"1,":"1"}" LINE 1: select '{"1,":"1"}'::pg_ndistinct ; ^ DETAIL: All ndistinct attnum lists must be a comma separated list of attnums. imho, this errdetail message is not good. would be better saying that "length of list of attnums must be larger than 1". select pt1.typnamespace, pt1.typarray, pt1.typcategory, pt1.typname from pg_type pt1 where pt1.typname ~*'distinct'; typnamespace | typarray | typcategory | typname --------------+----------+-------------+-------------- 11 | 0 | Z | pg_ndistinct typcategory (Z) marked as Internal-use types. and there is no pg_ndistinct array type, not sure this is fine. all the errcode one pair of the parenthesis is unnecessary. for example (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("malformed pg_ndistinct: \"%s\"", parse->str), errdetail("Must begin with \"{\""))); can change to errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("malformed pg_ndistinct: \"%s\"", parse->str), errdetail("Must begin with \"{\"")); see https://www.postgresql.org/docs/current/error-message-reporting.html
hi. select '{"1, 0B100101":"NaN"}'::pg_ndistinct; pg_ndistinct ------------------------ {"1, 37": -2147483648} (1 row) this is not what we expected? For the VALUE part of pg_ndistinct, float8 has 3 special values: inf, -inf, NaN. For the key part of pg_ndistinct, see example. select '{"1, 16\t":"1"}'::pg_ndistinct; here \t is not tab character, ascii 9. it's two characters: backslash and character "t". so here it should error out? (apply this to \n, \r, \b) pg_ndistinct_in(PG_FUNCTION_ARGS) ending part should be: freeJsonLexContext(lex); if (result == JSON_SUCCESS) { ...... } else { ereturn(parse_state.escontext, (Datum) 0, errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("malformed pg_ndistinct: \"%s\"", str), errdetail("Must be valid JSON.")); PG_RETURN_NULL(); } result should be either JSON_SUCCESS or anything else. all these functions: ndistinct_object_start, ndistinct_array_start, ndistinct_object_field_start, ndistinct_array_element_start have ndistinctParseState *parse = state; do we need to change it to ndistinctParseState *parse = (ndistinctParseState *)state; ? ndistinctParseState need to add to src/tools/pgindent/typedefs.list probably change it to "typedef struct ndistinctParseState". also struct ndistinctParseState need placed in the top of src/backend/statistics/mvdistinct.c not in line 340?
On Tue, Jan 28, 2025 at 11:25 AM jian he <jian.universality@gmail.com> wrote:
hi.
I reviewed 0001 only.
in src/backend/statistics/mvdistinct.c
no need #include "nodes/pg_list.h" since
src/include/statistics/statistics.h sub level include "nodes/pg_list.h"
no need #include "utils/palloc.h"
sicne #include "postgres.h"
already included it.
Noted.
select '[{"6, -32768,,": -11}]'::pg_ndistinct;
ERROR: malformed pg_ndistinct: "[{"6, -32768,,": -11}]"
LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct;
^
DETAIL: All ndistinct count values are scalar doubles.
imho, this errdetail message is not good.
What error message do you think is appropriate in that situation?
select '{}'::pg_ndistinct ;
segfault
Mmm, gotta look into that!
select '{"1,":"1"}'::pg_ndistinct ;
ERROR: malformed pg_ndistinct: "{"1,":"1"}"
LINE 1: select '{"1,":"1"}'::pg_ndistinct ;
^
DETAIL: All ndistinct attnum lists must be a comma separated list of attnums.
imho, this errdetail message is not good. would be better saying that
"length of list of attnums must be larger than 1".
That sounds better.
typcategory (Z) marked as Internal-use types. and there is no
pg_ndistinct array type,
not sure this is fine.
I think it's probably ok for now. The datatype currently has no utility other than extended statistics, and I'm doubtful that it ever will.
On Wed, Jan 29, 2025 at 2:50 AM jian he <jian.universality@gmail.com> wrote:
hi.
select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
pg_ndistinct
------------------------
{"1, 37": -2147483648}
(1 row)
I think my initial reaction is to just refuse those special values, but I'll look into the parsing code to see what can be done.
this is not what we expected?
For the VALUE part of pg_ndistinct, float8 has 3 special values: inf, -inf, NaN.
For the key part of pg_ndistinct, see example.
select '{"1, 16\t":"1"}'::pg_ndistinct;
here \t is not tab character, ascii 9. it's two characters: backslash
and character "t".
so here it should error out?
(apply this to \n, \r, \b)
I don't have a good answer as to what should happen here. Special cases like this make Tomas' suggestion to change the in/out format more attractive.
pg_ndistinct_in(PG_FUNCTION_ARGS)
ending part should be:
freeJsonLexContext(lex);
if (result == JSON_SUCCESS)
{
......
}
else
{
ereturn(parse_state.escontext, (Datum) 0,
errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("malformed pg_ndistinct: \"%s\"", str),
errdetail("Must be valid JSON."));
PG_RETURN_NULL();
}
result should be either JSON_SUCCESS or anything else.
all these functions:
ndistinct_object_start, ndistinct_array_start,
ndistinct_object_field_start, ndistinct_array_element_start
have
ndistinctParseState *parse = state;
do we need to change it to
ndistinctParseState *parse = (ndistinctParseState *)state;
?
The compiler isn't complaining so far, but I see no harm in it.
select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
pg_ndistinct
------------------------
{"1, 37": -2147483648}
(1 row)I think my initial reaction is to just refuse those special values, but I'll look into the parsing code to see what can be done.
I noticed that the output function for pg_ndistinct casts that value to an integer before formatting it %d, so it's being treated as an integer even if it is not stored as one. After some consultation with Tomas, it made the most sense to just replicate this on the input side as well, and that is addressed in the patches below.
I've updated and rebased the patches.
The existing pg_ndistinct and pg_dependences formats were kept as-is. The formats are clumsy, more processing-friendly formats would be easier, but the need for such processing is minimal bordering on theoretical, so there is little impact in keeping the historical format.
There are now checks to ensure that the pg_ndistinct or pg_dependencies value assigned to an extended statistics object actually makes sense for that object. What this amounts to is checking that for every attnum cited, the positive attnums are also ones found the in the stxkeys of the pg_statistic_ext tuple, and the negative attnums correspond do not exceed the number of expressions in the attnum. In other words, if the stats object has no expressions in it, then no negative numbers will be accepted, if it has 2 expressions than any value -3 or lower will be rejected, etc.
All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.
There are now checks to ensure that the pg_ndistinct or pg_dependencies value assigned to an extended statistics object actually makes sense for that object. What this amounts to is checking that for every attnum cited, the positive attnums are also ones found the in the stxkeys of the pg_statistic_ext tuple, and the negative attnums correspond do not exceed the number of expressions in the attnum. In other words, if the stats object has no expressions in it, then no negative numbers will be accepted, if it has 2 expressions than any value -3 or lower will be rejected, etc.
All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.
Attachment
I think my initial reaction is to just refuse those special values, but I'll look into the parsing code to see what can be done.I noticed that the output function for pg_ndistinct casts that value to an integer before formatting it %d, so it's being treated as an integer even if it is not stored as one. After some consultation with Tomas, it made the most sense to just replicate this on the input side as well, and that is addressed in the patches below.I've updated and rebased the patches.The existing pg_ndistinct and pg_dependences formats were kept as-is. The formats are clumsy, more processing-friendly formats would be easier, but the need for such processing is minimal bordering on theoretical, so there is little impact in keeping the historical format.
There are now checks to ensure that the pg_ndistinct or pg_dependencies value assigned to an extended statistics object actually makes sense for that object. What this amounts to is checking that for every attnum cited, the positive attnums are also ones found the in the stxkeys of the pg_statistic_ext tuple, and the negative attnums correspond do not exceed the number of expressions in the attnum. In other words, if the stats object has no expressions in it, then no negative numbers will be accepted, if it has 2 expressions than any value -3 or lower will be rejected, etc.
All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.
A rebasing, and a few changes
* regnamespace and name parameters changed to statistics_schemaname as text and statistics_name as text, so that there's one less thing that can potentially fail in an upgrade
* schema lookup and stat name lookup failures now issue a warning and return false, rather than ERROR
* elevel replaced with hardcoded WARNING most everywhere, as has been done with relation/attribute stats
Attachment
Just rebasing.
Attachment
On Mon, Mar 31, 2025 at 1:10 AM Corey Huinker <corey.huinker@gmail.com> wrote:
Just rebasing.
At pgconf.dev this year, the subject of changing the formats of pg_ndistinct and pg_depdentencies came up again.
To recap: presently these datatypes have no working input function, but would need one for statistics import to work on extended statistics. The existing input formats are technically JSON, but the keys themselves are a comma-separated list of attnums, so they require additional parsing. That parsing is already done in the patches in this thread, but overall the format is terrible for any sort of manipulation, like the manipulation that people might want to do to translate the values to a table with a different column order (say, after a restore of a table that had dropped columns), or to do query planner experiments.
Because the old formats don't have a corresponding input function, there is no risk of the ouptut not matching required inputs, but there will be once we add new input functions, so this is our last chance to change the format to something we like better.
The old format can be trivially translated via functions posted earlier in this thread back in January (pg_xlat_ndistinct_to_attnames, pg_xlat_dependencies_to_attnames) as well as the reverse (s/_to_/_from_/), so dumping values from older versions will not be difficult.
The old format can be trivially translated via functions posted earlier in this thread back in January (pg_xlat_ndistinct_to_attnames, pg_xlat_dependencies_to_attnames) as well as the reverse (s/_to_/_from_/), so dumping values from older versions will not be difficult.
I believe that we should take this opportunity to make the change. While we don't have a pressing need to manipulate these structures now, we might in the future and failing to do so now makes a later change much harder.
With that in mind, I'd like people to have a look at the proposed format change if pg_ndistinct (the changes to pg_dependencies are similar), to see if they want to make any improvements or comments. As you can see, the new format is much less compact (about 3x as large), which could get bad if the number of elements grew by a lot, but the number of elements is tied to the number of factors in the extended support (N choose N, then N choose N-1, etc, excluding choose 1), so this can't get too out of hand.
Existing format (newlines/formatting added by me to make head-to-head comparison easier):
'{"2, 3": 4,
'{"2, 3": 4,
"2, -1": 4,
"2, -2": 4,
"3, -1": 4,
"3, -2": 4,
"-1, -2": 3,
"2, 3, -1": 4,
"2, 3, -2": 4,
"2, -1, -2": 4,
"3, -1, -2": 4}'::pg_ndistinct
Proposed new format (again, all formatting here is just for ease of humans reading):
' [ {"attributes" : [2,3], "ndistinct" : 4},
{"attributes" : [2,-1], "ndistinct" : 4},
{"attributes" : [2,-2], "ndistinct" : 4},
{"attributes" : [3,-1], "ndistinct" : 4},
{"attributes" : [3,-2], "ndistinct" : 4},
{"attributes" : [-1,-2], "ndistinct" : 3},
{"attributes" : [2,3,-1], "ndistinct" : 4},
{"attributes" : [2,3,-2], "ndistinct" : 4},
{"attributes" : [2,-1,-2], "ndistinct" : 4},
{"attributes" : [3,-1,-2], "ndistinct" : 4}]'::pg_ndistinct
Proposed new format (again, all formatting here is just for ease of humans reading):
' [ {"attributes" : [2,3], "ndistinct" : 4},
{"attributes" : [2,-1], "ndistinct" : 4},
{"attributes" : [2,-2], "ndistinct" : 4},
{"attributes" : [3,-1], "ndistinct" : 4},
{"attributes" : [3,-2], "ndistinct" : 4},
{"attributes" : [-1,-2], "ndistinct" : 3},
{"attributes" : [2,3,-1], "ndistinct" : 4},
{"attributes" : [2,3,-2], "ndistinct" : 4},
{"attributes" : [2,-1,-2], "ndistinct" : 4},
{"attributes" : [3,-1,-2], "ndistinct" : 4}]'::pg_ndistinct
The pg_dependencies structure is only slightly more complex:
An abbreviated example:
{"2 => 1": 1.000000, "2 => -1": 1.000000, ..., "2, -2 => -1": 1.000000, "3, -1 => 2": 1.000000},
Becomes:
[ {"attributes": [2], "dependency": 1, "degree": 1.000000},
{"attributes": [2], "dependency": -1, "degree": 1.000000},
{"attributes": [2, -2], "dependency": -1, "degree": 1.000000},
...,
...,
{"attributes": [2, -2], "dependency": -1, "degree": 1.000000},
{"attributes": [3, -1], "dependency": 2, "degree": 1.000000}]
{"attributes": [3, -1], "dependency": 2, "degree": 1.000000}]
Any thoughts on using/improving these structures?