Re: Extended Statistics set/restore/clear functions. - Mailing list pgsql-hackers
From | Corey Huinker |
---|---|
Subject | Re: Extended Statistics set/restore/clear functions. |
Date | |
Msg-id | CADkLM=eM0VA-muiu+JMM4-B8eL5ZscPeGtTxh7UtmM+iVO7C=A@mail.gmail.com Whole thread Raw |
In response to | Re: Extended Statistics set/restore/clear functions. (Corey Huinker <corey.huinker@gmail.com>) |
List | pgsql-hackers |
On Mon, Mar 31, 2025 at 1:10 AM Corey Huinker <corey.huinker@gmail.com> wrote:
Just rebasing.
At pgconf.dev this year, the subject of changing the formats of pg_ndistinct and pg_depdentencies came up again.
To recap: presently these datatypes have no working input function, but would need one for statistics import to work on extended statistics. The existing input formats are technically JSON, but the keys themselves are a comma-separated list of attnums, so they require additional parsing. That parsing is already done in the patches in this thread, but overall the format is terrible for any sort of manipulation, like the manipulation that people might want to do to translate the values to a table with a different column order (say, after a restore of a table that had dropped columns), or to do query planner experiments.
Because the old formats don't have a corresponding input function, there is no risk of the ouptut not matching required inputs, but there will be once we add new input functions, so this is our last chance to change the format to something we like better.
The old format can be trivially translated via functions posted earlier in this thread back in January (pg_xlat_ndistinct_to_attnames, pg_xlat_dependencies_to_attnames) as well as the reverse (s/_to_/_from_/), so dumping values from older versions will not be difficult.
The old format can be trivially translated via functions posted earlier in this thread back in January (pg_xlat_ndistinct_to_attnames, pg_xlat_dependencies_to_attnames) as well as the reverse (s/_to_/_from_/), so dumping values from older versions will not be difficult.
I believe that we should take this opportunity to make the change. While we don't have a pressing need to manipulate these structures now, we might in the future and failing to do so now makes a later change much harder.
With that in mind, I'd like people to have a look at the proposed format change if pg_ndistinct (the changes to pg_dependencies are similar), to see if they want to make any improvements or comments. As you can see, the new format is much less compact (about 3x as large), which could get bad if the number of elements grew by a lot, but the number of elements is tied to the number of factors in the extended support (N choose N, then N choose N-1, etc, excluding choose 1), so this can't get too out of hand.
Existing format (newlines/formatting added by me to make head-to-head comparison easier):
'{"2, 3": 4,
'{"2, 3": 4,
"2, -1": 4,
"2, -2": 4,
"3, -1": 4,
"3, -2": 4,
"-1, -2": 3,
"2, 3, -1": 4,
"2, 3, -2": 4,
"2, -1, -2": 4,
"3, -1, -2": 4}'::pg_ndistinct
Proposed new format (again, all formatting here is just for ease of humans reading):
' [ {"attributes" : [2,3], "ndistinct" : 4},
{"attributes" : [2,-1], "ndistinct" : 4},
{"attributes" : [2,-2], "ndistinct" : 4},
{"attributes" : [3,-1], "ndistinct" : 4},
{"attributes" : [3,-2], "ndistinct" : 4},
{"attributes" : [-1,-2], "ndistinct" : 3},
{"attributes" : [2,3,-1], "ndistinct" : 4},
{"attributes" : [2,3,-2], "ndistinct" : 4},
{"attributes" : [2,-1,-2], "ndistinct" : 4},
{"attributes" : [3,-1,-2], "ndistinct" : 4}]'::pg_ndistinct
Proposed new format (again, all formatting here is just for ease of humans reading):
' [ {"attributes" : [2,3], "ndistinct" : 4},
{"attributes" : [2,-1], "ndistinct" : 4},
{"attributes" : [2,-2], "ndistinct" : 4},
{"attributes" : [3,-1], "ndistinct" : 4},
{"attributes" : [3,-2], "ndistinct" : 4},
{"attributes" : [-1,-2], "ndistinct" : 3},
{"attributes" : [2,3,-1], "ndistinct" : 4},
{"attributes" : [2,3,-2], "ndistinct" : 4},
{"attributes" : [2,-1,-2], "ndistinct" : 4},
{"attributes" : [3,-1,-2], "ndistinct" : 4}]'::pg_ndistinct
The pg_dependencies structure is only slightly more complex:
An abbreviated example:
{"2 => 1": 1.000000, "2 => -1": 1.000000, ..., "2, -2 => -1": 1.000000, "3, -1 => 2": 1.000000},
Becomes:
[ {"attributes": [2], "dependency": 1, "degree": 1.000000},
{"attributes": [2], "dependency": -1, "degree": 1.000000},
{"attributes": [2, -2], "dependency": -1, "degree": 1.000000},
...,
...,
{"attributes": [2, -2], "dependency": -1, "degree": 1.000000},
{"attributes": [3, -1], "dependency": 2, "degree": 1.000000}]
{"attributes": [3, -1], "dependency": 2, "degree": 1.000000}]
Any thoughts on using/improving these structures?
pgsql-hackers by date: