Thread: Re: Extended Statistics set/restore/clear functions.

Re: Extended Statistics set/restore/clear functions.

From
jian he
Date:
hi.
I reviewed 0001 only.

in src/backend/statistics/mvdistinct.c

no need #include "nodes/pg_list.h" since
src/include/statistics/statistics.h sub level include "nodes/pg_list.h"

no need #include "utils/palloc.h"
sicne #include "postgres.h"
already included it.


 select '[{"6, -32768,,": -11}]'::pg_ndistinct;
ERROR:  malformed pg_ndistinct: "[{"6, -32768,,": -11}]"
LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct;
               ^
DETAIL:  All ndistinct count values are scalar doubles.
imho, this errdetail message is not good.


select '{}'::pg_ndistinct ;
segfault


select '{"1,":"1"}'::pg_ndistinct ;
ERROR:  malformed pg_ndistinct: "{"1,":"1"}"
LINE 1: select '{"1,":"1"}'::pg_ndistinct ;
               ^
DETAIL:  All ndistinct attnum lists must be a comma separated list of attnums.

imho, this errdetail message is not good. would be better saying that
"length of list of attnums must be larger than 1".



select  pt1.typnamespace, pt1.typarray, pt1.typcategory, pt1.typname
from    pg_type pt1
where pt1.typname ~*'distinct';

 typnamespace | typarray | typcategory |   typname
--------------+----------+-------------+--------------
           11 |        0 | Z           | pg_ndistinct

typcategory (Z) marked as Internal-use types. and there is no
pg_ndistinct array type,
not sure this is fine.


all the errcode one pair of the parenthesis is unnecessary.
for example
                (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                 errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
                 errdetail("Must begin with \"{\"")));
can change to
                errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                 errmsg("malformed pg_ndistinct: \"%s\"", parse->str),
                 errdetail("Must begin with \"{\""));

see https://www.postgresql.org/docs/current/error-message-reporting.html



Re: Extended Statistics set/restore/clear functions.

From
jian he
Date:
hi.

select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
      pg_ndistinct
------------------------
 {"1, 37": -2147483648}
(1 row)

this is not what we expected?
For the VALUE part of pg_ndistinct, float8 has 3 special values: inf, -inf, NaN.

For the key part of pg_ndistinct, see example.
select '{"1, 16\t":"1"}'::pg_ndistinct;
here \t is not tab character, ascii 9. it's two characters: backslash
and character "t".
so here it should error out?
(apply this to \n, \r, \b)


pg_ndistinct_in(PG_FUNCTION_ARGS)
ending part should be:

    freeJsonLexContext(lex);
    if (result == JSON_SUCCESS)
    {
        ......
    }
    else
    {
       ereturn(parse_state.escontext, (Datum) 0,
                    errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                    errmsg("malformed pg_ndistinct: \"%s\"", str),
                    errdetail("Must be valid JSON."));
       PG_RETURN_NULL();
    }
result should be either JSON_SUCCESS or anything else.



all these functions:
ndistinct_object_start, ndistinct_array_start,
ndistinct_object_field_start, ndistinct_array_element_start
have
ndistinctParseState *parse = state;

do we need to change it to
ndistinctParseState *parse = (ndistinctParseState *)state;
?



ndistinctParseState need to add to src/tools/pgindent/typedefs.list
probably change it to "typedef struct ndistinctParseState".
also struct ndistinctParseState need placed in the top of
src/backend/statistics/mvdistinct.c
not in line 340?



Re: Extended Statistics set/restore/clear functions.

From
Corey Huinker
Date:
On Tue, Jan 28, 2025 at 11:25 AM jian he <jian.universality@gmail.com> wrote:
hi.
I reviewed 0001 only.

in src/backend/statistics/mvdistinct.c

no need #include "nodes/pg_list.h" since
src/include/statistics/statistics.h sub level include "nodes/pg_list.h"

no need #include "utils/palloc.h"
sicne #include "postgres.h"
already included it.


Noted.
 
 select '[{"6, -32768,,": -11}]'::pg_ndistinct;
ERROR:  malformed pg_ndistinct: "[{"6, -32768,,": -11}]"
LINE 1: select '[{"6, -32768,,": -11}]'::pg_ndistinct;
               ^
DETAIL:  All ndistinct count values are scalar doubles.
imho, this errdetail message is not good.

What error message do you think is appropriate in that situation?
 
select '{}'::pg_ndistinct ;
segfault

Mmm, gotta look into that!

 


select '{"1,":"1"}'::pg_ndistinct ;
ERROR:  malformed pg_ndistinct: "{"1,":"1"}"
LINE 1: select '{"1,":"1"}'::pg_ndistinct ;
               ^
DETAIL:  All ndistinct attnum lists must be a comma separated list of attnums.

imho, this errdetail message is not good. would be better saying that
"length of list of attnums must be larger than 1".

That sounds better.

 
typcategory (Z) marked as Internal-use types. and there is no
pg_ndistinct array type,
not sure this is fine.

I think it's probably ok for now. The datatype currently has no utility other than extended statistics, and I'm doubtful that it ever will.

Re: Extended Statistics set/restore/clear functions.

From
Corey Huinker
Date:
On Wed, Jan 29, 2025 at 2:50 AM jian he <jian.universality@gmail.com> wrote:
hi.

select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
      pg_ndistinct
------------------------
 {"1, 37": -2147483648}
(1 row)

I think my initial reaction is to just refuse those special values, but I'll look into the parsing code to see what can be done.



this is not what we expected?
For the VALUE part of pg_ndistinct, float8 has 3 special values: inf, -inf, NaN.

For the key part of pg_ndistinct, see example.
select '{"1, 16\t":"1"}'::pg_ndistinct;
here \t is not tab character, ascii 9. it's two characters: backslash
and character "t".
so here it should error out?
(apply this to \n, \r, \b)

I don't have a good answer as to what should happen here. Special cases like this make Tomas' suggestion to change the in/out format more attractive.


 


pg_ndistinct_in(PG_FUNCTION_ARGS)
ending part should be:

    freeJsonLexContext(lex);
    if (result == JSON_SUCCESS)
    {
        ......
    }
    else
    {
       ereturn(parse_state.escontext, (Datum) 0,
                    errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                    errmsg("malformed pg_ndistinct: \"%s\"", str),
                    errdetail("Must be valid JSON."));
       PG_RETURN_NULL();
    }
result should be either JSON_SUCCESS or anything else.



all these functions:
ndistinct_object_start, ndistinct_array_start,
ndistinct_object_field_start, ndistinct_array_element_start
have
ndistinctParseState *parse = state;

do we need to change it to
ndistinctParseState *parse = (ndistinctParseState *)state;
?

The compiler isn't complaining so far, but I see no harm in it.

Re: Extended Statistics set/restore/clear functions.

From
Corey Huinker
Date:
select '{"1, 0B100101":"NaN"}'::pg_ndistinct;
      pg_ndistinct
------------------------
 {"1, 37": -2147483648}
(1 row)

I think my initial reaction is to just refuse those special values, but I'll look into the parsing code to see what can be done.

I noticed that the output function for pg_ndistinct casts that value to an integer before formatting it %d, so it's being treated as an integer even if it is not stored as one. After some consultation with Tomas, it made the most sense to just replicate this on the input side as well, and that is addressed in the patches below.

I've updated and rebased the patches.

The existing pg_ndistinct and pg_dependences formats were kept as-is. The formats are clumsy, more processing-friendly formats would be easier, but the need for such processing is minimal bordering on theoretical, so there is little impact in keeping the historical format.

There are now checks to ensure that the pg_ndistinct or pg_dependencies value assigned to an extended statistics object actually makes sense for that object. What this amounts to is checking that for every attnum cited, the positive attnums are also ones found the in the stxkeys of the pg_statistic_ext tuple, and the negative attnums correspond do not exceed the number of expressions in the attnum. In other words, if the stats object has no expressions in it, then no negative numbers will be accepted, if it has 2 expressions than any value -3 or lower will be rejected, etc.

All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.
Attachment

Re: Extended Statistics set/restore/clear functions.

From
Corey Huinker
Date:

I think my initial reaction is to just refuse those special values, but I'll look into the parsing code to see what can be done.

I noticed that the output function for pg_ndistinct casts that value to an integer before formatting it %d, so it's being treated as an integer even if it is not stored as one. After some consultation with Tomas, it made the most sense to just replicate this on the input side as well, and that is addressed in the patches below.

I've updated and rebased the patches.

The existing pg_ndistinct and pg_dependences formats were kept as-is. The formats are clumsy, more processing-friendly formats would be easier, but the need for such processing is minimal bordering on theoretical, so there is little impact in keeping the historical format.

There are now checks to ensure that the pg_ndistinct or pg_dependencies value assigned to an extended statistics object actually makes sense for that object. What this amounts to is checking that for every attnum cited, the positive attnums are also ones found the in the stxkeys of the pg_statistic_ext tuple, and the negative attnums correspond do not exceed the number of expressions in the attnum. In other words, if the stats object has no expressions in it, then no negative numbers will be accepted, if it has 2 expressions than any value -3 or lower will be rejected, etc.

All patches rebased to 71f17823ba010296da9946bd906bb8bcad6325bc.

A rebasing, and a few changes
* regnamespace and name parameters changed to statistics_schemaname as text and statistics_name as text, so that there's one less thing that can potentially fail in an upgrade
* schema lookup and stat name lookup failures now issue a warning and return false, rather than ERROR
* elevel replaced with hardcoded WARNING most everywhere, as has been done with relation/attribute stats
 
Attachment

Re: Extended Statistics set/restore/clear functions.

From
Corey Huinker
Date:

On Mon, Mar 31, 2025 at 1:10 AM Corey Huinker <corey.huinker@gmail.com> wrote:
Just rebasing.

At pgconf.dev this year, the subject of changing the formats of pg_ndistinct and pg_depdentencies came up again.

To recap: presently these datatypes have no working input function, but would need one for statistics import to work on extended statistics. The existing input formats are technically JSON, but the keys themselves are a comma-separated list of attnums, so they require additional parsing. That parsing is already done in the patches in this thread, but overall the format is terrible for any sort of manipulation, like the manipulation that people might want to do to translate the values to a table with a different column order (say, after a restore of a table that had dropped columns), or to do query planner experiments.

Because the old formats don't have a corresponding input function, there is no risk of the ouptut not matching required inputs, but there will be once we add new input functions, so this is our last chance to change the format to something we like better.

The old format can be trivially translated via functions posted earlier in this thread back in January (pg_xlat_ndistinct_to_attnames, pg_xlat_dependencies_to_attnames) as well as the reverse (s/_to_/_from_/), so dumping values from older versions will not be difficult.

I believe that we should take this opportunity to make the change. While we don't have a pressing need to manipulate these structures now, we might in the future and failing to do so now makes a later change much harder.

With that in mind, I'd like people to have a look at the proposed format change if pg_ndistinct (the changes to pg_dependencies are similar), to see if they want to make any improvements or comments. As you can see, the new format is much less compact (about 3x as large), which could get bad if the number of elements grew by a lot, but the number of elements is tied to the number of factors in the extended support (N choose N, then N choose N-1, etc, excluding choose 1), so this can't get too out of hand.

Existing format (newlines/formatting added by me to make head-to-head comparison easier):

'{"2, 3": 4,
  "2, -1": 4,
  "2, -2": 4,
  "3, -1": 4,
  "3, -2": 4,
  "-1, -2": 3,
  "2, 3, -1": 4,
  "2, 3, -2": 4,
  "2, -1, -2": 4,
  "3, -1, -2": 4}'::pg_ndistinct

Proposed new format (again, all formatting here is just for ease of humans reading):

' [ {"attributes" : [2,3], "ndistinct" : 4},
    {"attributes" : [2,-1], "ndistinct" : 4},
    {"attributes" : [2,-2], "ndistinct" : 4},
    {"attributes" : [3,-1], "ndistinct" : 4},
    {"attributes" : [3,-2], "ndistinct" : 4},
    {"attributes" : [-1,-2], "ndistinct" : 3},
    {"attributes" : [2,3,-1], "ndistinct" : 4},
    {"attributes" : [2,3,-2], "ndistinct" : 4},
    {"attributes" : [2,-1,-2], "ndistinct" : 4},
    {"attributes" : [3,-1,-2], "ndistinct" : 4}]'::pg_ndistinct


The pg_dependencies structure is only slightly more complex:

An abbreviated example:

{"2 => 1": 1.000000, "2 => -1": 1.000000, ..., "2, -2 => -1": 1.000000, "3, -1 => 2": 1.000000},

Becomes:

[ {"attributes": [2], "dependency": 1, "degree": 1.000000},
  {"attributes": [2], "dependency": -1, "degree": 1.000000},
  {"attributes": [2, -2], "dependency":  -1, "degree": 1.000000},
   ...,
   {"attributes": [2, -2], "dependency": -1, "degree": 1.000000},
   {"attributes": [3, -1], "dependency": 2, "degree": 1.000000}]

Any thoughts on using/improving these structures?