Thread: Re: Extended Statistics set/restore/clear functions.
hi. I reviewed 0001 only. in src/backend/statistics/mvdistinct.c no need #include "nodes/pg_list.h" since src/include/statistics/statistics.h sub level include "nodes/pg_list.h" no need #include "utils/palloc.h" sicne #include "postgres.h" already included it. select '[{"6, -32768,,": -11}]'::pg_ndistinct; ERROR: malformed pg_ndistinct: "[{"6, -32768,,": -11}]" LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct; ^ DETAIL: All ndistinct count values are scalar doubles. imho, this errdetail message is not good. select '{}'::pg_ndistinct ; segfault select '{"1,":"1"}'::pg_ndistinct ; ERROR: malformed pg_ndistinct: "{"1,":"1"}" LINE 1: select '{"1,":"1"}'::pg_ndistinct ; ^ DETAIL: All ndistinct attnum lists must be a comma separated list of attnums. imho, this errdetail message is not good. would be better saying that "length of list of attnums must be larger than 1". select pt1.typnamespace, pt1.typarray, pt1.typcategory, pt1.typname from pg_type pt1 where pt1.typname ~*'distinct'; typnamespace | typarray | typcategory | typname --------------+----------+-------------+-------------- 11 | 0 | Z | pg_ndistinct typcategory (Z) marked as Internal-use types. and there is no pg_ndistinct array type, not sure this is fine. all the errcode one pair of the parenthesis is unnecessary. for example (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("malformed pg_ndistinct: \"%s\"", parse->str), errdetail("Must begin with \"{\""))); can change to errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("malformed pg_ndistinct: \"%s\"", parse->str), errdetail("Must begin with \"{\"")); see https://www.postgresql.org/docs/current/error-message-reporting.html
hi. select '{"1, 0B100101":"NaN"}'::pg_ndistinct; pg_ndistinct ------------------------ {"1, 37": -2147483648} (1 row) this is not what we expected? For the VALUE part of pg_ndistinct, float8 has 3 special values: inf, -inf, NaN. For the key part of pg_ndistinct, see example. select '{"1, 16\t":"1"}'::pg_ndistinct; here \t is not tab character, ascii 9. it's two characters: backslash and character "t". so here it should error out? (apply this to \n, \r, \b) pg_ndistinct_in(PG_FUNCTION_ARGS) ending part should be: freeJsonLexContext(lex); if (result == JSON_SUCCESS) { ...... } else { ereturn(parse_state.escontext, (Datum) 0, errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), errmsg("malformed pg_ndistinct: \"%s\"", str), errdetail("Must be valid JSON.")); PG_RETURN_NULL(); } result should be either JSON_SUCCESS or anything else. all these functions: ndistinct_object_start, ndistinct_array_start, ndistinct_object_field_start, ndistinct_array_element_start have ndistinctParseState *parse = state; do we need to change it to ndistinctParseState *parse = (ndistinctParseState *)state; ? ndistinctParseState need to add to src/tools/pgindent/typedefs.list probably change it to "typedef struct ndistinctParseState". also struct ndistinctParseState need placed in the top of src/backend/statistics/mvdistinct.c not in line 340?
On Tue, Jan 28, 2025 at 11:25 AM jian he <jian.universality@gmail.com> wrote:
hi.
I reviewed 0001 only.
in src/backend/statistics/mvdistinct.c
no need #include "nodes/pg_list.h" since
src/include/statistics/statistics.h sub level include "nodes/pg_list.h"
no need #include "utils/palloc.h"
sicne #include "postgres.h"
already included it.
Noted.
select '[{"6, -32768,,": -11}]'::pg_ndistinct;
ERROR: malformed pg_ndistinct: "[{"6, -32768,,": -11}]"
LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct;
^
DETAIL: All ndistinct count values are scalar doubles.
imho, this errdetail message is not good.
What error message do you think is appropriate in that situation?
select '{}'::pg_ndistinct ;
segfault
Mmm, gotta look into that!
select '{"1,":"1"}'::pg_ndistinct ;
ERROR: malformed pg_ndistinct: "{"1,":"1"}"
LINE 1: select '{"1,":"1"}'::pg_ndistinct ;
^
DETAIL: All ndistinct attnum lists must be a comma separated list of attnums.
imho, this errdetail message is not good. would be better saying that
"length of list of attnums must be larger than 1".
That sounds better.
typcategory (Z) marked as Internal-use types. and there is no
pg_ndistinct array type,
not sure this is fine.
I think it's probably ok for now. The datatype currently has no utility other than extended statistics, and I'm doubtful that it ever will.
On Wed, Jan 29, 2025 at 2:50 AM jian he <jian.universality@gmail.com> wrote:
hi.
select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
pg_ndistinct
------------------------
{"1, 37": -2147483648}
(1 row)
I think my initial reaction is to just refuse those special values, but I'll look into the parsing code to see what can be done.
this is not what we expected?
For the VALUE part of pg_ndistinct, float8 has 3 special values: inf, -inf, NaN.
For the key part of pg_ndistinct, see example.
select '{"1, 16\t":"1"}'::pg_ndistinct;
here \t is not tab character, ascii 9. it's two characters: backslash
and character "t".
so here it should error out?
(apply this to \n, \r, \b)
I don't have a good answer as to what should happen here. Special cases like this make Tomas' suggestion to change the in/out format more attractive.
pg_ndistinct_in(PG_FUNCTION_ARGS)
ending part should be:
freeJsonLexContext(lex);
if (result == JSON_SUCCESS)
{
......
}
else
{
ereturn(parse_state.escontext, (Datum) 0,
errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
errmsg("malformed pg_ndistinct: \"%s\"", str),
errdetail("Must be valid JSON."));
PG_RETURN_NULL();
}
result should be either JSON_SUCCESS or anything else.
all these functions:
ndistinct_object_start, ndistinct_array_start,
ndistinct_object_field_start, ndistinct_array_element_start
have
ndistinctParseState *parse = state;
do we need to change it to
ndistinctParseState *parse = (ndistinctParseState *)state;
?
The compiler isn't complaining so far, but I see no harm in it.
select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
pg_ndistinct
------------------------
{"1, 37": -2147483648}
(1 row)I think my initial reaction is to just refuse those special values, but I'll look into the parsing code to see what can be done.
I noticed that the output function for pg_ndistinct casts that value to an integer before formatting it %d, so it's being treated as an integer even if it is not stored as one. After some consultation with Tomas, it made the most sense to just replicate this on the input side as well, and that is addressed in the patches below.
I've updated and rebased the patches.
The existing pg_ndistinct and pg_dependences formats were kept as-is. The formats are clumsy, more processing-friendly formats would be easier, but the need for such processing is minimal bordering on theoretical, so there is little impact in keeping the historical format.
There are now checks to ensure that the pg_ndistinct or pg_dependencies value assigned to an extended statistics object actually makes sense for that object. What this amounts to is checking that for every attnum cited, the positive attnums are also ones found the in the stxkeys of the pg_statistic_ext tuple, and the negative attnums correspond do not exceed the number of expressions in the attnum. In other words, if the stats object has no expressions in it, then no negative numbers will be accepted, if it has 2 expressions than any value -3 or lower will be rejected, etc.
All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.
There are now checks to ensure that the pg_ndistinct or pg_dependencies value assigned to an extended statistics object actually makes sense for that object. What this amounts to is checking that for every attnum cited, the positive attnums are also ones found the in the stxkeys of the pg_statistic_ext tuple, and the negative attnums correspond do not exceed the number of expressions in the attnum. In other words, if the stats object has no expressions in it, then no negative numbers will be accepted, if it has 2 expressions than any value -3 or lower will be rejected, etc.
All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.
Attachment
I think my initial reaction is to just refuse those special values, but I'll look into the parsing code to see what can be done.I noticed that the output function for pg_ndistinct casts that value to an integer before formatting it %d, so it's being treated as an integer even if it is not stored as one. After some consultation with Tomas, it made the most sense to just replicate this on the input side as well, and that is addressed in the patches below.I've updated and rebased the patches.The existing pg_ndistinct and pg_dependences formats were kept as-is. The formats are clumsy, more processing-friendly formats would be easier, but the need for such processing is minimal bordering on theoretical, so there is little impact in keeping the historical format.
There are now checks to ensure that the pg_ndistinct or pg_dependencies value assigned to an extended statistics object actually makes sense for that object. What this amounts to is checking that for every attnum cited, the positive attnums are also ones found the in the stxkeys of the pg_statistic_ext tuple, and the negative attnums correspond do not exceed the number of expressions in the attnum. In other words, if the stats object has no expressions in it, then no negative numbers will be accepted, if it has 2 expressions than any value -3 or lower will be rejected, etc.
All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.
A rebasing, and a few changes
* regnamespace and name parameters changed to statistics_schemaname as text and statistics_name as text, so that there's one less thing that can potentially fail in an upgrade
* schema lookup and stat name lookup failures now issue a warning and return false, rather than ERROR
* elevel replaced with hardcoded WARNING most everywhere, as has been done with relation/attribute stats