Thread: Reducing output size of nodeToString

Reducing output size of nodeToString

From

Matthias van de Meent

Date:

06 December 2023, 21:08:38

Hi,

PFA a patch that reduces the output size of nodeToString by 50%+ in
most cases (measured on pg_rewrite), which on my system reduces the
total size of pg_rewrite by 33% to 472KiB. This does keep the textual
pg_node_tree format alive, but reduces its size signficantly.

The basic techniques used are
 - Don't emit scalar fields when they contain a default value, and
make the reading code aware of this.
 - Reasonable defaults are set for most datatypes, and overrides can
be added with new pg_node_attr() attributes. No introspection into
non-null Node/Array/etc. is being done though.
 - Reset more fields to their default values before storing the values.
 - Don't write trailing 0s in outDatum calls for by-ref types. This
saves many bytes for Name fields, but also some other pre-existing
entry points.

Future work will probably have to be on a significantly different
storage format, as the textual format is about to hit its entropy
limits.

See also [0], [1] and [2], where complaints about the verbosity of
nodeToString were vocalized.

Kind regards,

Matthias van de Meent

[0] https://www.postgresql.org/message-id/CAEze2WgGexDM63dOvndLdAWwA6uSmSsc97jmrCuNmrF1JEDK7w%40mail.gmail.com
[1]
https://www.postgresql.org/message-id/flat/CACxu%3DvL_SD%3DWJiFSJyyBuZAp_2v_XBqb1x9JBiqz52a_g9z3jA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/4b27fc50-8cd6-46f5-ab20-88dbaadca645%40eisentraut.org

Attachment

v0-0001-Reduce-the-size-of-serialized-nodes-in-nodeToStri.patch

Re: Reducing output size of nodeToString

From

Peter Eisentraut

Date:

07 December 2023, 10:26:10

On 06.12.23 22:08, Matthias van de Meent wrote:
> PFA a patch that reduces the output size of nodeToString by 50%+ in
> most cases (measured on pg_rewrite), which on my system reduces the
> total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> pg_node_tree format alive, but reduces its size signficantly.
> 
> The basic techniques used are
>   - Don't emit scalar fields when they contain a default value, and
> make the reading code aware of this.
>   - Reasonable defaults are set for most datatypes, and overrides can
> be added with new pg_node_attr() attributes. No introspection into
> non-null Node/Array/etc. is being done though.
>   - Reset more fields to their default values before storing the values.
>   - Don't write trailing 0s in outDatum calls for by-ref types. This
> saves many bytes for Name fields, but also some other pre-existing
> entry points.
> 
> Future work will probably have to be on a significantly different
> storage format, as the textual format is about to hit its entropy
> limits.

One thing that was mentioned repeatedly is that we might want different 
formats for human consumption and for machine storage.

For human consumption, I would like some format like what you propose, 
because it generally omits the "unset" or "uninteresting" fields.

But since you also talk about the size of pg_rewrite, I wonder whether 
it would be smaller if we just didn't write the field names at all but 
instead all the field values.  (This should be pretty easy to test, 
since the read functions currently ignore the field names anyway; you 
could just write out all field names as "x" and see what happens.)

I don't much like the way your patch uses the term "default".  Most of 
these default values are not defaults at all, but perhaps "most common 
values".  In theory, I would expect a default value to be initialized by 
makeNode().  (That could be an interesting feature, but let's stay 
focused here.)  But even then most of these "defaults" wouldn't be 
appropriate for a real default value.  This part seems quite 
controversial to me, and I would like to see some more details about how 
much this specifically really saves.

I don't quite understand why in your patch you have some fields as 
optional and some not.  Or is that what WRITE_NODE_FIELD() vs. 
WRITE_NODE_FIELD_OPT() means?  How is it decided which one to use?

The part that clears out the location fields in pg_rewrite entries might 
be worth considering as a separate patch.  Could you explain it more? 
Does it affect location pointers when using views at all?

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

07 December 2023, 11:18:28

On Thu, 7 Dec 2023 at 11:26, Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 06.12.23 22:08, Matthias van de Meent wrote:
> > PFA a patch that reduces the output size of nodeToString by 50%+ in
> > most cases (measured on pg_rewrite), which on my system reduces the
> > total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> > pg_node_tree format alive, but reduces its size signficantly.
> >
> > The basic techniques used are
> >   - Don't emit scalar fields when they contain a default value, and
> > make the reading code aware of this.
> >   - Reasonable defaults are set for most datatypes, and overrides can
> > be added with new pg_node_attr() attributes. No introspection into
> > non-null Node/Array/etc. is being done though.
> >   - Reset more fields to their default values before storing the values.
> >   - Don't write trailing 0s in outDatum calls for by-ref types. This
> > saves many bytes for Name fields, but also some other pre-existing
> > entry points.
> >
> > Future work will probably have to be on a significantly different
> > storage format, as the textual format is about to hit its entropy
> > limits.
>
> One thing that was mentioned repeatedly is that we might want different
> formats for human consumption and for machine storage.
> For human consumption, I would like some format like what you propose,
> because it generally omits the "unset" or "uninteresting" fields.
>
> But since you also talk about the size of pg_rewrite, I wonder whether
> it would be smaller if we just didn't write the field names at all but
> instead all the field values.  (This should be pretty easy to test,
> since the read functions currently ignore the field names anyway; you
> could just write out all field names as "x" and see what happens.)

I've been thinking about using a more binary storage format similar to
protobuf (but with system knowledge baked in, instead of PB's
defaults), but that would be dependent on functions that change the
output functions of pg_node_tree too, which Michel mentioned he would
work on a year ago (iiuc).

I think it would be a logical next step after this, but this patch is
just on building infrastructure that reduces the stored size without
getting in the way of Michel's work, if there was any result.

> I don't much like the way your patch uses the term "default".  Most of
> these default values are not defaults at all, but perhaps "most common
> values".

Yes, some 'defaults' are curated, but they have sound logic behind
them: *typmod is essentially always copied from an attypmod, which
defaults to -1. *isnull for any constant is generally unset. Many of
those other fields (once initialized by the relevant code) default to
those values I used.

> In theory, I would expect a default value to be initialized by
> makeNode().  (That could be an interesting feature, but let's stay
> focused here.)  But even then most of these "defaults" wouldn't be
> appropriate for a real default value.  This part seems quite
> controversial to me, and I would like to see some more details about how
> much this specifically really saves.

The tuning of these "defaults" got the savings from 20-30% to this
50%+ reduction in raw size.

> I don't quite understand why in your patch you have some fields as
> optional and some not.  Or is that what WRITE_NODE_FIELD() vs.
> WRITE_NODE_FIELD_OPT() means?  How is it decided which one to use?

I use _OPT when I know the value is likely to be its defualt value,
and don't change over to _OPT when I know with great certainty the
value is going to be dynamic, such as relation ID in RTEs, but this is
only relevant for manual code as generated code essentially always
uses the _OPT paths.

> The part that clears out the location fields in pg_rewrite entries might
> be worth considering as a separate patch.  Could you explain it more?
> Does it affect location pointers when using views at all?

Views don't store the original query string, so the location pointers
in views point to locations in a now non-existent query string.
Additionally, unless WRITE_READ_PARSE_PLAN_TREES is defined,
READ_LOCATION_FIELD does not actually read the stored value but
instead stores -1 in the indicated field, so in most cases there won't
be any difference between the deserialized data before and after this
part of the patch; the only difference is the amount of debugable
information stored in the view's internal data.
Note that resetting them to 'invalid' value thus makes sense, and
improves compressibility and allows removal from the serialized format
when serialization omits fields with default values.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Reducing output size of nodeToString

From

David Rowley

Date:

07 December 2023, 12:09:07

On Thu, 7 Dec 2023 at 10:09, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> PFA a patch that reduces the output size of nodeToString by 50%+ in
> most cases (measured on pg_rewrite), which on my system reduces the
> total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> pg_node_tree format alive, but reduces its size significantly.

It would be very cool to have the technology proposed by Andres back
in 2019 [1]. With that, we could easily write various output
functions.  One could be compact and easily machine-readable and
another designed to be better for humans for debugging purposes.

We could also easily serialize plans to binary format for copying to
parallel workers rather than converting them to a text-based
serialized format. It would also allow us to do things like serialize
PREPAREd plans into a nicely compact single allocation that we could
just pfree in a single pfree call on DEALLOCATE.

Likely we could just use the existing Perl scripts to form the
metadata arrays rather than the clang parsing stuff Andres used in his
patch.

Anyway, just wanted to ensure you knew about this idea.

David

[1] https://postgr.es/m/flat/20190828234136.fk2ndqtld3onfrrp%40alap3.anarazel.de

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

14 December 2023, 06:21:30

On Thu, 7 Dec 2023 at 13:09, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 7 Dec 2023 at 10:09, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > PFA a patch that reduces the output size of nodeToString by 50%+ in
> > most cases (measured on pg_rewrite), which on my system reduces the
> > total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> > pg_node_tree format alive, but reduces its size significantly.
>
> It would be very cool to have the technology proposed by Andres back
> in 2019 [1]. With that, we could easily write various output
> functions.  One could be compact and easily machine-readable and
> another designed to be better for humans for debugging purposes.
>
> We could also easily serialize plans to binary format for copying to
> parallel workers rather than converting them to a text-based
> serialized format. It would also allow us to do things like serialize
> PREPAREd plans into a nicely compact single allocation that we could
> just pfree in a single pfree call on DEALLOCATE.

I'm not sure what benefit you're refering to. If you mean "it's more
compact than the current format" then sure; but the other points can
already be covered by either the current nodeToString format, or by
nodeCopy-ing the prepared plan into its own MemoryContext, which would
allow us to do essentially the same thing.

> Likely we could just use the existing Perl scripts to form the
> metadata arrays rather than the clang parsing stuff Andres used in his
> patch.
>
> Anyway, just wanted to ensure you knew about this idea.

I knew about that thread thread, but didn't notice the metadata arrays
part of it, which indeed looks interesting for this patch. Thanks for
pointing it out. I'll see if I can incorporate parts of that into this
patchset.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Reducing output size of nodeToString

From

Peter Eisentraut

Date:

02 January 2024, 10:30:13

On 06.12.23 22:08, Matthias van de Meent wrote:
> PFA a patch that reduces the output size of nodeToString by 50%+ in
> most cases (measured on pg_rewrite), which on my system reduces the
> total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> pg_node_tree format alive, but reduces its size signficantly.
> 
> The basic techniques used are
>   - Don't emit scalar fields when they contain a default value, and
> make the reading code aware of this.
>   - Reasonable defaults are set for most datatypes, and overrides can
> be added with new pg_node_attr() attributes. No introspection into
> non-null Node/Array/etc. is being done though.
>   - Reset more fields to their default values before storing the values.
>   - Don't write trailing 0s in outDatum calls for by-ref types. This
> saves many bytes for Name fields, but also some other pre-existing
> entry points.

Based on our discussions, my understanding is that you wanted to produce 
an updated patch set that is split up a bit.

My suggestion is to make incremental patches along these lines:

- Omit from output all fields that have value zero.

- Omit location fields that have value -1.

- Omit trailing zeroes for scalar values.

- Recent location fields before storing in pg_rewrite (or possibly 
catalogs in general?)

- And then whatever is left, including the "default" value system that 
you have proposed.

The last one I have some doubts about, as previously expressed, but the 
first few seem sensible to me.  By splitting it up we can consider these 
incrementally.

Re: Reducing output size of nodeToString

From

David Rowley

Date:

03 January 2024, 02:02:02

On Thu, 14 Dec 2023 at 19:21, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Thu, 7 Dec 2023 at 13:09, David Rowley <dgrowleyml@gmail.com> wrote:
> > We could also easily serialize plans to binary format for copying to
> > parallel workers rather than converting them to a text-based
> > serialized format. It would also allow us to do things like serialize
> > PREPAREd plans into a nicely compact single allocation that we could
> > just pfree in a single pfree call on DEALLOCATE.
>
> I'm not sure what benefit you're refering to. If you mean "it's more
> compact than the current format" then sure; but the other points can
> already be covered by either the current nodeToString format, or by
> nodeCopy-ing the prepared plan into its own MemoryContext, which would
> allow us to do essentially the same thing.

There's significantly less memory involved in just having a plan
serialised into a single chunk of memory vs a plan stored in its own
MemoryContext.  With the serialised plan, you don't have any power of
2 rounding up wastage that aset.c does and don't need extra space for
all the MemoryChunks that would exist for every single palloc'd chunk
in the MemoryContext version.

I think it would be nice if one day in the future if a PREPAREd plan
could have multiple different plans cached. We could then select which
one to use by looking at statistics for the given parameters and
choose the plan that's most suitable for the given parameters.   Of
course, this is a whole entirely different project. I mention it just
because being able to serialise a plan would make the memory
management and overhead for such a feature much more manageable.
There'd likely need to be some eviction logic in such a feature as the
number of possible plans for some complex query is quite likely to be
much more than we'd care to cache.

David

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

03 January 2024, 23:02:13

On Wed, 3 Jan 2024 at 03:02, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 14 Dec 2023 at 19:21, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> >
> > On Thu, 7 Dec 2023 at 13:09, David Rowley <dgrowleyml@gmail.com> wrote:
> > > We could also easily serialize plans to binary format for copying to
> > > parallel workers rather than converting them to a text-based
> > > serialized format. It would also allow us to do things like serialize
> > > PREPAREd plans into a nicely compact single allocation that we could
> > > just pfree in a single pfree call on DEALLOCATE.
> >
> > I'm not sure what benefit you're refering to. If you mean "it's more
> > compact than the current format" then sure; but the other points can
> > already be covered by either the current nodeToString format, or by
> > nodeCopy-ing the prepared plan into its own MemoryContext, which would
> > allow us to do essentially the same thing.
>
> There's significantly less memory involved in just having a plan
> serialised into a single chunk of memory vs a plan stored in its own
> MemoryContext.  With the serialised plan, you don't have any power of
> 2 rounding up wastage that aset.c does and don't need extra space for
> all the MemoryChunks that would exist for every single palloc'd chunk
> in the MemoryContext version.

I was envisioning this to use the Bump memory context you proposed
over in [0], as to the best of my knowledge prepared plans are not
modified, so nodeCopy-ing a prepared plan into bump context could be a
good use case for those contexts. This should remove the issue of
rounding and memorychunk wastage in aset.

> I think it would be nice if one day in the future if a PREPAREd plan
> could have multiple different plans cached. We could then select which
> one to use by looking at statistics for the given parameters and
> choose the plan that's most suitable for the given parameters.   Of
> course, this is a whole entirely different project. I mention it just
> because being able to serialise a plan would make the memory
> management and overhead for such a feature much more manageable.
> There'd likely need to be some eviction logic in such a feature as the
> number of possible plans for some complex query is quite likely to be
> much more than we'd care to cache.

Yeah, that'd be nice, but is also definitely future work.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]:
https://www.postgresql.org/message-id/flat/CAApHDvqGSpCU95TmM%3DBp%3D6xjL_nLys4zdZOpfNyWBk97Xrdj2w%40mail.gmail.com

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

03 January 2024, 23:23:50

On Tue, 2 Jan 2024 at 11:30, Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 06.12.23 22:08, Matthias van de Meent wrote:
> > PFA a patch that reduces the output size of nodeToString by 50%+ in
> > most cases (measured on pg_rewrite), which on my system reduces the
> > total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> > pg_node_tree format alive, but reduces its size signficantly.
> >
> > The basic techniques used are
> >   - Don't emit scalar fields when they contain a default value, and
> > make the reading code aware of this.
> >   - Reasonable defaults are set for most datatypes, and overrides can
> > be added with new pg_node_attr() attributes. No introspection into
> > non-null Node/Array/etc. is being done though.
> >   - Reset more fields to their default values before storing the values.
> >   - Don't write trailing 0s in outDatum calls for by-ref types. This
> > saves many bytes for Name fields, but also some other pre-existing
> > entry points.
>
> Based on our discussions, my understanding is that you wanted to produce
> an updated patch set that is split up a bit.

I mentioned that I've been working on implementing (but have not yet
completed) a binary serialization format, with an implementation based
on Andres' generated metadata idea. However, that requires more
elaborate infrastructure than is currently available, so while I said
I'd expected it to be complete before the Christmas weekend, it'll
take some more time - I'm not sure it'll be ready for PG17.

In the meantime here's an updated version of the v0 patch, formally
keeping the textual format alive, while reducing the size
significantly (nearing 2/3 reduction), taking your comments into
account. I think the gains are worth the  consideration without taking
into account the as-of-yet unimplemented binary format.

> My suggestion is to make incremental patches along these lines:
> [...]

Something like the attached? It splits out into the following
0001: basic 'omit default values'
0002: reset location and other querystring-related node fields for all
catalogs of type pg_node_tree.
0003: add default marking on typmod fields.
0004 & 0006: various node fields marked with default() based on
observed common or initial values of those fields
0005: truncate trailing 0s from outDatum
0007 (new): do run-length + gap coding for bitmapset and the various
integer list types. This saves a surprising amount of bytes.

> The last one I have some doubts about, as previously expressed, but the
> first few seem sensible to me.  By splitting it up we can consider these
> incrementally.

That makes a lot of sense. The numbers for the full patchset do seem
quite positive though: The metrics of the query below show a 40%
decrease in size of a fresh pg_rewrite (standard toast compression)
and a 5% decrease in size of the template0 database. The uncompressed
data of pg_rewrite.ev_action is also 60% smaller.

select pg_database_size('template0') as "template0"
     , pg_total_relation_size('pg_rewrite') as "pg_rewrite"
     , sum(pg_column_size(ev_action)) as "compressed"
     , sum(octet_length(ev_action)) as "raw"
from pg_rewrite;

 version | template0 | pg_rewrite | compressed |   raw
---------|-----------+------------+------------+---------
 master  |   7545359 |     761856 |     573307 | 2998712
 0001    |   7365135 |     622592 |     438224 | 1943772
 0002    |   7258639 |     573440 |     401660 | 1835803
 0003    |   7258639 |     565248 |     386211 | 1672539
 0004    |   7176719 |     483328 |     317099 | 1316552
 0005    |   7176719 |     483328 |     315556 | 1300420
 0006    |   7160335 |     466944 |     302806 | 1208621
 0007    |   7143951 |     450560 |     287659 | 1187237

While looking through the data, I noticed the larger views now consist
for a significant portion out of range table entries, specifically the
Alias and Var nodes (which are mostly repeated and/or repetative
values, but split across Nodes). I think column-major storage would be
more efficient to write, but I'm not sure it's worth the effort in
planner code.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment

Re: Reducing output size of nodeToString

From

Peter Eisentraut

Date:

09 January 2024, 08:23:20

On 04.01.24 00:23, Matthias van de Meent wrote:
> Something like the attached? It splits out into the following
> 0001: basic 'omit default values'

  /* Write an integer field (anything written as ":fldname %d") */
-#define WRITE_INT_FIELD(fldname) \
+#define WRITE_INT_FIELD_DIRECT(fldname) \
         appendStringInfo(str, " :" CppAsString(fldname) " %d", 
node->fldname)
+#define WRITE_INT_FIELD_DEFAULT(fldname, default) \
+       ((node->fldname == default) ? (0) : WRITE_INT_FIELD_DIRECT(fldname))
+#define WRITE_INT_FIELD(fldname) \
+       WRITE_INT_FIELD_DEFAULT(fldname, 0)

Do we need the _DIRECT macros at all?  Could we not combine that into 
the _DEFAULT ones?

I think the way the conditional operator (?:) is written is not 
technically correct C, because one side has an integer result (0) and 
the other a void result (from appendStringInfo()).  Also, this could 
break accidentally even more if the result type of appendStringInfo() 
was changed for some reason.  I think it would be better to write this 
in a more straightforward way like

#define WRITE_INT_FIELD_DEFAULT(fldname, default) \
do { \
     if (node->fldname == default) \
         appendStringInfo(str, " :" CppAsString(fldname) " %d", 
node->fldname); \
while (0)

Relatedly, this

+/* a scaffold function to read an optionally-omitted field */
+#define READ_OPT_SCAFFOLD(fldname, read_field_code, default_value) \
+       if (pg_strtoken_next(":" CppAsString(fldname))) \
+       { \
+               read_field_code; \
+       } \
+       else \
+               local_node->fldname = default_value

would need to be written with a do { } while (0) wrapper around it.

> 0002: reset location and other querystring-related node fields for all
> catalogs of type pg_node_tree.

This goal makes sense, but I think it can be done in a better way.  If 
you look into the area of stringToNode(), stringToNodeWithLocations(), 
and stringToNodeInternal(), there already is support for selectively 
resetting or omitting location fields.  Notably, this works with the 
existing automated knowledge of where the location fields are and 
doesn't require a new hand-maintained table.  I think the way forward 
here would be to apply a similar toggle to nodeToString() (the reverse).

> 0003: add default marking on typmod fields.
> 0004 & 0006: various node fields marked with default() based on
> observed common or initial values of those fields

I think we could get about half the benefit here more automatically, by 
creating a separate type for typmods, like

typedef int32 TypMod;

and then having the node support automatically generate the 
serialization support with a -1 default.

(A similar thing could be applied to the location fields, which would 
allow us to get rid of the current hack of parsing out the name.)

Most of the other defaults I'm doubtful about.  First, we are colliding 
here between the goals of minimizing the storage size and making the 
debug output more readable.  If a Query dump would now omit the 
commandType field if it is CMD_SELECT, I think that would be widely 
confusing, and one would need to check the source code to identify the 
reason.  Also, what if we later decide to change a "default" for a 
field.  Then output between version would differ.  Of course, node 
output does change between versions in general, but these kinds of 
differences would be confusing.  Second, this relies on hand-maintained 
annotations that were created by you presumably through a combination of 
intuition and testing, based on what is in the template database.  Do we 
know whether this matches real-world queries created by users later? 
Also, my experience dealing with the node support over the last little 
while is that these manually maintained exceptions get ossified and 
outdated and create a maintenance headache for the future.

> 0005: truncate trailing 0s from outDatum

Does this significantly affect anything other than the "name" type? 
User views don't usually use the "name" type, so this would have limited 
impact outside of system views.

> 0007 (new): do run-length + gap coding for bitmapset and the various
> integer list types. This saves a surprising amount of bytes.

Can you show examples of this?  How would this affects the ability to 
manually interpret the output?

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

30 January 2024, 11:26:05

On Tue, 9 Jan 2024, 09:23 Peter Eisentraut, <peter@eisentraut.org> wrote:
>
> On 04.01.24 00:23, Matthias van de Meent wrote:
> > Something like the attached? It splits out into the following
> > 0001: basic 'omit default values'
>
>   /* Write an integer field (anything written as ":fldname %d") */
> -#define WRITE_INT_FIELD(fldname) \
> +#define WRITE_INT_FIELD_DIRECT(fldname) \
>          appendStringInfo(str, " :" CppAsString(fldname) " %d",
> node->fldname)
> +#define WRITE_INT_FIELD_DEFAULT(fldname, default) \
> +       ((node->fldname == default) ? (0) : WRITE_INT_FIELD_DIRECT(fldname))
> +#define WRITE_INT_FIELD(fldname) \
> +       WRITE_INT_FIELD_DEFAULT(fldname, 0)
>
> Do we need the _DIRECT macros at all?  Could we not combine that into
> the _DEFAULT ones?

I was planning on using them to reduce the size of generated code for
select fields that we know we will always serialize, but then later
decided against doing that in this patch as it'd add even more
arbitrary annotations to nodes. This is a leftover from that.

> I think the way the conditional operator (?:) is written is not
> technically correct C,
> [...]
> I think it would be better to write this
> in a more straightforward way like
>
> #define WRITE_INT_FIELD_DEFAULT(fldname, default) \
> do { \
> [...]
> while (0)
>
> Relatedly, this
>
> +/* a scaffold function to read an optionally-omitted field */
> [...]
> would need to be written with a do { } while (0) wrapper around it.

I'll fix that.

> > 0002: reset location and other querystring-related node fields for all
> > catalogs of type pg_node_tree.
>
> This goal makes sense, but I think it can be done in a better way.  If
> you look into the area of stringToNode(), stringToNodeWithLocations(),
> and stringToNodeInternal(), there already is support for selectively
> resetting or omitting location fields.  Notably, this works with the
> existing automated knowledge of where the location fields are and
> doesn't require a new hand-maintained table.  I think the way forward
> here would be to apply a similar toggle to nodeToString() (the reverse).

I'll try to work something out for that.

> > 0003: add default marking on typmod fields.
> > 0004 & 0006: various node fields marked with default() based on
> > observed common or initial values of those fields
>
> I think we could get about half the benefit here more automatically, by
> creating a separate type for typmods, like
>
> typedef int32 TypMod;
>
> and then having the node support automatically generate the
> serialization support with a -1 default.

Hm, I suspect that the code churn for that would be significant. I'd
also be confused when the type in storage (pg_attribute, pg_type's
typtypmod) is still int32 when it would be TypMod only in nodes.

> (A similar thing could be applied to the location fields, which would
> allow us to get rid of the current hack of parsing out the name.)

I suppose so.

> Most of the other defaults I'm doubtful about.  First, we are colliding
> here between the goals of minimizing the storage size and making the
> debug output more readable.

I've never really wanted to make the output "more readable". The
current one is too verbose, yes.

> If a Query dump would now omit the
> commandType field if it is CMD_SELECT, I think that would be widely
> confusing, and one would need to check the source code to identify the
> reason.

AFAIK, SELECT is the only command type you can possibly store in a
view (insert/delete/update/utility are all invalid there, and while
I'm not fully certain about MERGE, I'd say it's certainly a niche).

> Also, what if we later decide to change a "default" for a
> field.  Then output between version would differ.  Of course, node
> output does change between versions in general, but these kinds of
> differences would be confusing.

I've not heard of anyone trying to read and compare the contents of
pg_node_tree manually where they're not trying to debug some
deep-nested issue. Note

> Second, this relies on hand-maintained
> annotations that were created by you presumably through a combination of
> intuition and testing, based on what is in the template database.  Do we
> know whether this matches real-world queries created by users later?

No, or at least I don't know this for certain. But I think it's a good start.

> Also, my experience dealing with the node support over the last little
> while is that these manually maintained exceptions get ossified and
> outdated and create a maintenance headache for the future.

I'm not sure what headache this would become. nodeToString is a fairly
straightforward API with (AFAIK) no external dependencies, where only
nodes go in and out. The metadata on top of that will indeed require
some maintenance, but AFAIK only in the areas that read and utilize
said metadata. While it certainly wouldn't be great if we didn't have
this metadata, it'd be no worse than not having compression.

> > 0005: truncate trailing 0s from outDatum
>
> Does this significantly affect anything other than the "name" type?
> User views don't usually use the "name" type, so this would have limited
> impact outside of system views.

It saves a few bytes each on byval types like bool, oid, and int on
little-endian systems, as they don't utilize the latter bytes of the
4- or 8-byte Datum. At least in the default catalog this shaves some
bytes off.

> > 0007 (new): do run-length + gap coding for bitmapset and the various
> > integer list types. This saves a surprising amount of bytes.
>
> Can you show examples of this?  How would this affects the ability to
> manually interpret the output?

The ability to interpret the results manually is somewhat reduced for
complex cases (bitmaps), but things like RangeTableEntries are
significantly reduced in size because of this. A good amount of
IntegerLists  is reduced to (i 1 +10) instead of (i 1 2 3 4 5 ... 11).
Specifically notable are the joinleftcols/joinrightcols fields, as
they will often contain large lists of joined columns when many tables
are joined together. While bitmaps are less prevalent/large, they also
benefit from this optimization.
As for bitmapsets, the use of differential coding saves bytes when the
set is large or otherwise has structure: the bitmapset of uneven
numbers (b 1 3 5 7 ... 23 25 27 ... 101 103 ...) takes up more space
(and is less compressible than) the equivalent differential coded (b 1
2 2 2 2 ...). This is at the cost of direct readability, but I think
that's worth it.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Reducing output size of nodeToString

From

Peter Eisentraut

Date:

31 January 2024, 08:16:27

On 30.01.24 12:26, Matthias van de Meent wrote:
>> Most of the other defaults I'm doubtful about.  First, we are colliding
>> here between the goals of minimizing the storage size and making the
>> debug output more readable.
> I've never really wanted to make the output "more readable". The
> current one is too verbose, yes.

My motivations at the moment to work in this area are (1) to make the 
output more readable, and (2) to reduce maintenance burden of node 
support functions.

There can clearly be some overlap with your goals.  For example, a less 
verbose and less redundant output can ease readability.  But it can also 
go the opposite direction; a very minimalized output can be less readable.

I would like to understand your target more.  You have shown some 
figures how these various changes reduce storage size in pg_rewrite. 
But it's a few hundred kilobytes, if I read this correctly, maybe some 
megabytes if you add a lot of user views.  Does this translate into any 
other tangible benefits, like you can store more views, or processing 
views is faster, or something like that?

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

31 January 2024, 16:17:03

On Wed, 31 Jan 2024, 09:16 Peter Eisentraut, <peter@eisentraut.org> wrote:

On 30.01.24 12:26, Matthias van de Meent wrote:
>> Most of the other defaults I'm doubtful about. First, we are colliding
>> here between the goals of minimizing the storage size and making the
>> debug output more readable.
> I've never really wanted to make the output "more readable". The
> current one is too verbose, yes.

My motivations at the moment to work in this area are (1) to make the
output more readable, and (2) to reduce maintenance burden of node
support functions.

There can clearly be some overlap with your goals. For example, a less
verbose and less redundant output can ease readability. But it can also
go the opposite direction; a very minimalized output can be less readable.

I would like to understand your target more. You have shown some
figures how these various changes reduce storage size in pg_rewrite.
But it's a few hundred kilobytes, if I read this correctly, maybe some
megabytes if you add a lot of user views. Does this translate into any
other tangible benefits, like you can store more views, or processing
views is faster, or something like that?

I was also thinking about smaller per-attribute expression storage, for index attribute expressions, table default expressions, and functions. Other than that, less memory overhead for the serialized form of these constructs also helps for catalog cache sizes, etc.
People complained about the size of a fresh initdb, and I agreed with them, so I started looking at low-hanging fruits, and this is one.

I've not done any tests yet on whether it's more performant in general. I'd expect the new code to do a bit better given the extremely verbose nature of the data and the rather complex byte-at-a-time token read method used, but this is currently hypothesis.

I do think that serialization itself may be slightly slower, but given that this generally happens only in DDL, and that we have to grow the output buffer less often, this too may still be a net win (but, again, this is an untested hypothesis).

Kind regards,

Matthias van de Meent

Neon (https://neon.tech)

Re: Reducing output size of nodeToString

From

Robert Haas

Date:

31 January 2024, 17:47:39

On Wed, Jan 31, 2024 at 11:17 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I was also thinking about smaller per-attribute expression storage, for index attribute expressions, table default
expressions,and functions. Other than that, less memory overhead for the serialized form of these constructs also helps
forcatalog cache sizes, etc. 
> People complained about the size of a fresh initdb, and I agreed with them, so I started looking at low-hanging
fruits,and this is one. 
>
> I've not done any tests yet on whether it's more performant in general. I'd expect the new code to do a bit better
giventhe extremely verbose nature of the data and the rather complex byte-at-a-time token read method used, but this is
currentlyhypothesis. 
> I do think that serialization itself may be slightly slower, but given that this generally happens only in DDL, and
thatwe have to grow the output buffer less often, this too may still be a net win (but, again, this is an untested
hypothesis).

I think we're going to have to have separate formats for debugging and
storage if we want to get very far here. The current format sucks for
readability because it's so verbose, and tightening that up where we
can makes sense to me. For me, that can include things like emitting
unset location fields for sure, but delta-encoding of bitmap sets is
more questionable. Turning 1 2 3 4 5 6 7 8 9 10 into 1-10 would be
fine with me because that is both shorter and more readable, but
turning 2 4 6 8 10 into 2 2 2 2 2 is way worse for a human reader.
Such optimizations might make sense in a format that is designed for
computer processing only but not one that has to serve multiple
purposes.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

12 February 2024, 18:03:30

On Wed, 31 Jan 2024 at 18:47, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jan 31, 2024 at 11:17 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I was also thinking about smaller per-attribute expression storage, for index attribute expressions, table default
expressions,and functions. Other than that, less memory overhead for the serialized form of these constructs also helps
forcatalog cache sizes, etc. 
> > People complained about the size of a fresh initdb, and I agreed with them, so I started looking at low-hanging
fruits,and this is one. 
> >
> > I've not done any tests yet on whether it's more performant in general. I'd expect the new code to do a bit better
giventhe extremely verbose nature of the data and the rather complex byte-at-a-time token read method used, but this is
currentlyhypothesis. 
> > I do think that serialization itself may be slightly slower, but given that this generally happens only in DDL, and
thatwe have to grow the output buffer less often, this too may still be a net win (but, again, this is an untested
hypothesis).
>
> I think we're going to have to have separate formats for debugging and
> storage if we want to get very far here. The current format sucks for
> readability because it's so verbose, and tightening that up where we
> can makes sense to me. For me, that can include things like emitting
> unset location fields for sure, but delta-encoding of bitmap sets is
> more questionable. Turning 1 2 3 4 5 6 7 8 9 10 into 1-10 would be
> fine with me because that is both shorter and more readable, but
> turning 2 4 6 8 10 into 2 2 2 2 2 is way worse for a human reader.
> Such optimizations might make sense in a format that is designed for
> computer processing only but not one that has to serve multiple
> purposes.

I suppose so, yes. I've removed the delta-encoding from the
serialization of bitmapsets in the attached patchset.

Peter E. and I spoke about this patchset at FOSDEM PGDay, too. I said
to him that I wouldn't mind if this patchset was only partly applied:
The gains for most of the changes are definitely worth it even if some
others don't get in.

I think it'd be a nice QoL and storage improvement if even only (say)
the first two patches were committed, though the typmod default
markings (or alternatively, using a typedef-ed TypMod and one more
type-specific serialization handler) would also be a good improvement
without introducing too many "common value = default = omitted"
considerations that would reduce debugability.

Attached is patchset v2, which contains the improvements from these patches:

0001 has the "omit defaults" for the current types.
    -23.5%pt / -35.1%pt (toasted / raw)
0002+0003 has new #defined type "Location" for those fields in Nodes
that point into (or have sizes of) query texts, and adds
infrastructure to conditionally omit them at all (see previous
discussions)
    -3.5%pt / -6.3%pt
0004 has new #defined type TypeMod as alias for int32, that uses a
default value of -1 for (de)serialization purposes.
    -3.0%pt / -6.1%pt
0005 updates Const node serialization to omit `:constvalue` if the
value is null.
    +0.1%pt / -0.1%pt [^0]
0006 does run-length encoding for bitmaps and the various typed
integer lists, using "+int" as indicators of a run of a certain
length, excluding the start value.
     Bitmaps, IntLists and XidLists are based on runs with increments
of 1 (so, a notation (i 1 +3) means (i 1 2 3 4), while OidLists are
based on runs with no increments (so, (o 1 +3) means (o 1 1 1 1).
    -2.5%pt / -0.6%pt
0007 does add some select custom 'default' values, in that the
varnosyn and varattnosyn fields now treat the value of varno and
varattno as their default values.
    This reduces the size of lists of Vars significantly and has a
very meaningful impact on the size of the compressed data (the default
pg_rewrite dataset contains some 10.8k Var nodes).
    -10.4%pt / 9.7%pt

Total for the full applied patchset:
    55.5% smaller data in pg_rewrite.ev_action before TOAST
    45.7% smaller data in pg_rewrite.ev_action after applying TOAST

Toast relation size, as fraction of the main pg_rewrite table:
select pg_relation_size(2838) *1.0 / pg_relation_size('pg_rewrite');
  master: 4.7
  0007: 1.3

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[^0]: The small difference in size for patch 0005 is presumably due to
low occurrance of NULL-valued Const nodes. Additionally, the inline vs
out-of-line TOASTed data and data (not) fitting on the last blocks of
each relation are likely to cause the change in total compression
ratio. If we had more null-valued Const nodes in pg_rewrite, the ratio
would presumably have been better than this.

PS: query I used for my data collection, + combined data:

select 'master' as "version"
     , pg_database_size('template0') as "template0"
     , pg_total_relation_size('pg_rewrite') as "pg_rewrite"
     , sum(pg_column_size(ev_action)) as "toasted"
     , sum(octet_length(ev_action)) as "raw";

 version | template0 | pg_rewrite | toasted |   raw
---------+-----------+------------+---------+---------
 master  |   7537167 |     770048 |  574003 | 3002556
 0001    |   7348751 |     630784 |  438852 | 1946364
 0002    |   7242255 |     573440 |  403160 | 1840404
 0003    |   7242255 |     573440 |  402325 | 1838367
 0004    |   7225871 |     557056 |  384888 | 1652287
 0005    |   7234063 |     565248 |  385678 | 1648717
 0006    |   7217679 |     548864 |  371256 | 1627733
 0007    |   7143951 |     475136 |  311255 | 1337496

Attachment

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

12 February 2024, 19:32:56

On Mon, 12 Feb 2024 at 19:03, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Attached is patchset v2, which contains the improvements from these patches:

Attached v3, which fixes an out-of-bounds read in pg_strtoken_next,
detected by asan, that was a likely cause of the problems in CFBot's
FreeBSD regression tests.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

12 February 2024, 23:10:56

On Mon, 12 Feb 2024 at 20:32, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Mon, 12 Feb 2024 at 19:03, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Attached is patchset v2, which contains the improvements from these patches:
>
> Attached v3, which fixes an out-of-bounds read in pg_strtoken_next,
> detected by asan, that was a likely cause of the problems in CFBot's
> FreeBSD regression tests.

Apparently that was caused by issues in my updated bitmapset
serializer; where I used bms_next_member(..., x=0) as first iteration
thus skipping the first bit. This didn't show up earlier because that
bit is not exercised in PG's builtin views, but is exercised when
WRITE_READ_PARSE_PLAN_TREES is defined (as on the FreeBSD CI job).

Trivial fix in the attached v4 of the patchset, with some fixes for
other assertions that'd get some exercise in non-pg_node_tree paths in
the WRITE_READ configuration.

Attachment

Re: Reducing output size of nodeToString

From

Peter Eisentraut

Date:

15 February 2024, 12:59:04

Thanks, this patch set is a good way to incrementally work through these 
changes.

I have looked at 
v4-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch today. 
Here are my thoughts:

I believe we had discussed offline to not omit enum fields with value 0 
(WRITE_ENUM_FIELD).  This is because the values of enum fields are 
implementation artifacts, and this could be confusing for readers. 
(This could be added as a squeeze-out-every-byte change later, but if 
we're going to keep the format fit for human reading, I think we should 
skip this.)

I have some concerns about the round-trippability of float values.  If 
we do, effectively, if (node->fldname != 0.0), then I think this would 
also match negative zero, but when we read it back it, it would get 
assigned positive zero.  Maybe there are other edge cases like this. 
Might be safer to not mess with this.

On the reading side, the macro nesting has gotten a bit out of hand. :) 
We had talked earlier in the thread about the _DIRECT macros and you 
said there were left over from something else you want to try, but I see 
nothing else in this patch set uses this.  I think this could all be 
much simpler, like (omitting required punctuation)

#define READ_INT_FIELD(fldname, default)
     if ((token = next_field(fldname, &length)))
         local_node->fldname = atoi(token);
     else
         local_node->fldname = default;

where next_field() would

1. read the next token
2. if it is ":fldname", continue;
    else rewind the read pointer and return NULL
3. read the next token and return that

Not only is this simpler, but it might also have better performance, 
because we don't have separate pg_strtok_next() and pg_strtok() calls in 
sequence.

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

15 February 2024, 14:37:53

On Thu, 15 Feb 2024 at 13:59, Peter Eisentraut <peter@eisentraut.org> wrote:
>
> Thanks, this patch set is a good way to incrementally work through these
> changes.
>
> I have looked at
> v4-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch today.
> Here are my thoughts:
>
> I believe we had discussed offline to not omit enum fields with value 0
> (WRITE_ENUM_FIELD).  This is because the values of enum fields are
> implementation artifacts, and this could be confusing for readers.

Thanks for reminding me, I didn't remember this when I worked on
updating the patchset. I'll update this soon.

> I have some concerns about the round-trippability of float values.  If
> we do, effectively, if (node->fldname != 0.0), then I think this would
> also match negative zero, but when we read it back it, it would get
> assigned positive zero.  Maybe there are other edge cases like this.
> Might be safer to not mess with this.

That's a good point. Would an additional check that the sign of the
field equals the default's sign be enough for this? As for other
cases, I'm not sure we currently want to support non-normal floats,
even if it is technically possible to do the round-trip in the current
format.

> On the reading side, the macro nesting has gotten a bit out of hand. :)
> We had talked earlier in the thread about the _DIRECT macros and you
> said there were left over from something else you want to try, but I see
> nothing else in this patch set uses this.  I think this could all be
> much simpler, like (omitting required punctuation)
[...]
> Not only is this simpler, but it might also have better performance,
> because we don't have separate pg_strtok_next() and pg_strtok() calls in
> sequence.

Good points. I'll see what I can do here.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

19 February 2024, 13:19:58

On Thu, 15 Feb 2024 at 15:37, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Thu, 15 Feb 2024 at 13:59, Peter Eisentraut <peter@eisentraut.org> wrote:
> >
> > Thanks, this patch set is a good way to incrementally work through these
> > changes.
> >
> > I have looked at
> > v4-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch today.
> > Here are my thoughts:
> >
> > I believe we had discussed offline to not omit enum fields with value 0
> > (WRITE_ENUM_FIELD).  This is because the values of enum fields are
> > implementation artifacts, and this could be confusing for readers.
>
> Thanks for reminding me, I didn't remember this when I worked on
> updating the patchset. I'll update this soon.

This has been split into patch 0008 in the set. A query on ev_action
shows that enum default-0-omission is effective on 1994 fields:

select match, count(*)
from pg_rewrite,
    lateral (
        select unnest(regexp_matches(ev_action, '(:\w+ 0)[^0-9]', 'g')) match
    )
group by 1 order by 2 desc;
     match      | count
-----------------+-------
 :funcformat 0   |   587
 :rtekind 0      |   449
 :limitOption 0  |   260
 :querySource 0  |   260
 :override 0     |   260
 :jointype 0     |   156
 :aggsplit 0     |    15
 :subLinkType 0  |     5
 :nulltesttype 0 |     2

> > On the reading side, the macro nesting has gotten a bit out of hand. :)
> > We had talked earlier in the thread about the _DIRECT macros and you
> > said there were left over from something else you want to try, but I see
> > nothing else in this patch set uses this.  I think this could all be
> > much simpler, like (omitting required punctuation)
> [...]
> > Not only is this simpler, but it might also have better performance,
> > because we don't have separate pg_strtok_next() and pg_strtok() calls in
> > sequence.
>
> Good points. I'll see what I can do here.

Attached the updated version of the patch on top of 5497daf3, which
incorporates this last round of feedback. It moves the
default-0-omission for Enums to newly added 0008, and checks the sign
to deal with +0/-0 issues in float default checks.
See below for updated numbers.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

New numbers:

select 'master' as "version"
     , pg_database_size('template0') as "template0"
     , pg_total_relation_size('pg_rewrite') as "rel_total"
     , pg_relation_size('pg_rewrite', 'main') as "rel_main"
     , sum(pg_column_size(ev_action)) as "toasted"
     , sum(octet_length(ev_action)) as "raw"
from pg_rewrite;

 version | template0 | rel_total | rel_main | toasted |   raw
---------+-----------+-----------+----------+---------+---------
 master  |   7528975 |    770048 |   114688 |  574051 | 3002981
 0001    |   7348751 |    630784 |   131072 |  448495 | 1972854
 0002    |   7250447 |    589824 |   131072 |  412261 | 1866880
 0003    |   7242255 |    581632 |   131072 |  410476 | 1864843
 0004    |   7225871 |    565248 |   139264 |  393801 | 1678735
 0005    |   7225871 |    565248 |   139264 |  393556 | 1675165
 0006    |   7217679 |    557056 |   139264 |  379062 | 1654178
 0007    |   7160335 |    491520 |   155648 |  322145 | 1363885
 0008    |   7135759 |    475136 |   155648 |  311294 | 1337649

Attachment

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

22 February 2024, 12:37:00

On Mon, 19 Feb 2024 at 14:19, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Attached the updated version of the patch on top of 5497daf3, which
> incorporates this last round of feedback.

Now attached rebased on top of 93db6cbd to fix conflicts with fbc93b8b
and an issue in the previous patchset: I attached one too many v3-0001
from a previous patch I worked on.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

22 February 2024, 15:07:55

On Thu, 22 Feb 2024 at 13:37, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Mon, 19 Feb 2024 at 14:19, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Attached the updated version of the patch on top of 5497daf3, which
> > incorporates this last round of feedback.
>
> Now attached rebased on top of 93db6cbd to fix conflicts with fbc93b8b
> and an issue in the previous patchset: I attached one too many v3-0001
> from a previous patch I worked on.

... and now with a fix for not overwriting newly deserialized location
attributes with -1, which breaks test output for
READ_WRITE_PARSE_PLAN_TREES installations. Again, no other significant
changes since the patch of last Monday.

Sorry for the noise,

-Matthias

Attachment

Re: Reducing output size of nodeToString

From

Peter Eisentraut

Date:

11 March 2024, 13:19:44

On 22.02.24 16:07, Matthias van de Meent wrote:
> On Thu, 22 Feb 2024 at 13:37, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
>>
>> On Mon, 19 Feb 2024 at 14:19, Matthias van de Meent
>> <boekewurm+postgres@gmail.com> wrote:
>>> Attached the updated version of the patch on top of 5497daf3, which
>>> incorporates this last round of feedback.
>>
>> Now attached rebased on top of 93db6cbd to fix conflicts with fbc93b8b
>> and an issue in the previous patchset: I attached one too many v3-0001
>> from a previous patch I worked on.
> 
> ... and now with a fix for not overwriting newly deserialized location
> attributes with -1, which breaks test output for
> READ_WRITE_PARSE_PLAN_TREES installations. Again, no other significant
> changes since the patch of last Monday.

* v7-0002-pg_node_tree-Don-t-store-query-text-locations-in-.patch

This patch looks much more complicated than I was expecting.  I had 
suggested to model this after stringToNodeWithLocations().  This uses a 
global variable to toggle the mode.  Your patch creates a function 
nodeToStringNoQLocs() -- why the different naming scheme? -- and passes 
the flag down as an argument to all the output functions.  I mean, in a 
green field, avoiding global variables can be sensible, of course, but I 
think in this limited scope here it would really be better to keep the 
code for the two directions read and write the same.

Attached is a small patch that shows what I had in mind.  (It doesn't 
contain any callers, but your patch shows where all those would go.)

* v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch

This looks sensible, but maybe making Location a global type is a bit 
much?  Maybe something more specific like ParseLocation, or ParseLoc, to 
keep it under 12 characters.

Attachment

0001-Add-nodeToStringWithoutLocations.patch.nocfbot

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

11 March 2024, 20:52:26

On Mon, 11 Mar 2024 at 14:19, Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 22.02.24 16:07, Matthias van de Meent wrote:
> > On Thu, 22 Feb 2024 at 13:37, Matthias van de Meent
> > <boekewurm+postgres@gmail.com> wrote:
> >>
> >> On Mon, 19 Feb 2024 at 14:19, Matthias van de Meent
> >> <boekewurm+postgres@gmail.com> wrote:
> >>> Attached the updated version of the patch on top of 5497daf3, which
> >>> incorporates this last round of feedback.
> >>
> >> Now attached rebased on top of 93db6cbd to fix conflicts with fbc93b8b
> >> and an issue in the previous patchset: I attached one too many v3-0001
> >> from a previous patch I worked on.
> >
> > ... and now with a fix for not overwriting newly deserialized location
> > attributes with -1, which breaks test output for
> > READ_WRITE_PARSE_PLAN_TREES installations. Again, no other significant
> > changes since the patch of last Monday.
>
> * v7-0002-pg_node_tree-Don-t-store-query-text-locations-in-.patch
>
> This patch looks much more complicated than I was expecting.  I had
> suggested to model this after stringToNodeWithLocations().  This uses a
> global variable to toggle the mode.  Your patch creates a function
> nodeToStringNoQLocs() -- why the different naming scheme?

It doesn't just exclude .location fields, but also Query.len, a
similar field which contains the length of the query's string. The
name has been further refined to nodeToStringNoParseLocs() in the
attached version, but feel free to replace the names in the patch to
anything else you might want.

>  -- and passes
> the flag down as an argument to all the output functions.  I mean, in a
> green field, avoiding global variables can be sensible, of course, but I
> think in this limited scope here it would really be better to keep the
> code for the two directions read and write the same.

I'm a big fan of _not_ using magic global variables as passed context
without resetting on subnormal exits...
For GUCs their usage is understandable (and there is infrastructure to
reset them, and you're not supposed to manually update them), but IMO
here its usage should be a function-scoped variable or in a
passed-by-reference context struct, not a file-local static.
Regardless, attached is an adapted version with the file-local
variable implementation.

> Attached is a small patch that shows what I had in mind.  (It doesn't
> contain any callers, but your patch shows where all those would go.)

Attached a revised version that does it like stringToNodeInternal's
handling of restore_location_fields.

> * v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
>
> This looks sensible, but maybe making Location a global type is a bit
> much?  Maybe something more specific like ParseLocation, or ParseLoc, to
> keep it under 12 characters.

I've gone with ParseLoc in the attached v8 patchset.

Kind regards,

Matthias van de Meent

Attachment

Re: Reducing output size of nodeToString

From

Peter Eisentraut

Date:

19 March 2024, 16:13:47

On 11.03.24 21:52, Matthias van de Meent wrote:
>> * v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
>>
>> This looks sensible, but maybe making Location a global type is a bit
>> much?  Maybe something more specific like ParseLocation, or ParseLoc, to
>> keep it under 12 characters.
> I've gone with ParseLoc in the attached v8 patchset.

I have committed this one.

I moved the typedef to nodes/nodes.h, where we already had similar 
typdefs (Cardinality, etc.).  The fields stmt_location and stmt_len in 
PlannedStmt were not converted, so I fixed that.  Also, between you 
writing your patch and now, at least one new node type was added, so I 
fixed that one up, too.  (I diffed the generated node support functions 
to check.)  Hopefully, future hackers will apply the new type when 
appropriate.

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

19 March 2024, 16:46:26

On Tue, 19 Mar 2024 at 17:13, Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 11.03.24 21:52, Matthias van de Meent wrote:
> >> * v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
> >>
> >> This looks sensible, but maybe making Location a global type is a bit
> >> much?  Maybe something more specific like ParseLocation, or ParseLoc, to
> >> keep it under 12 characters.
> > I've gone with ParseLoc in the attached v8 patchset.
>
> I have committed this one.

Thanks!

> I moved the typedef to nodes/nodes.h, where we already had similar
> typdefs (Cardinality, etc.).  The fields stmt_location and stmt_len in
> PlannedStmt were not converted, so I fixed that.  Also, between you
> writing your patch and now, at least one new node type was added, so I
> fixed that one up, too.

Good points, thank you for fixing that.

> (I diffed the generated node support functions
> to check.)  Hopefully, future hackers will apply the new type when
> appropriate.

Are you also planning on committing some of the other patches later,
or should I rebase the set to keep CFBot happy?

-Matthias

Re: Reducing output size of nodeToString

From

Peter Eisentraut

Date:

20 March 2024, 11:49:52

On 19.03.24 17:13, Peter Eisentraut wrote:
> On 11.03.24 21:52, Matthias van de Meent wrote:
>>> * v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
>>>
>>> This looks sensible, but maybe making Location a global type is a bit
>>> much?  Maybe something more specific like ParseLocation, or ParseLoc, to
>>> keep it under 12 characters.
>> I've gone with ParseLoc in the attached v8 patchset.
> 
> I have committed this one.

Next, I was looking at 
v8-0003-pg_node_tree-Don-t-store-query-text-locations-in-.patch.  After 
applying that, I was looking how many uses of nodeToString() (with 
locations) were left.  I think your patch forgot to convert a number of 
them, and there also might have been a few new ones that came in with 
other recent patches.  Might be hard to make sure all new developments 
do this right.  Plus, there are various mentions in the documentation 
that should be updated.  After considering all that, there weren't 
really many callers of nodeToString() left.  It's really only debugging 
support in postgres.c and print.c, and a few places were it doesn't 
matter, like the few places where it initializes "cooked expressions", 
which were in turn already stripped of location fields at some earlier time.

So anyway, my idea was that we should turn this around and make 
nodeToString() always drop location information, and instead add 
nodeToStringWithLocations() for the few debugging uses.  And this would 
also be nice because then it matches exactly with the existing 
stringToNodeWithLocations().

Attached patch shows this.

Attachment

0001-Do-not-output-actual-value-of-location-fields-in-nod.patch

Re: Reducing output size of nodeToString

From

Matthias van de Meent

Date:

20 March 2024, 12:03:39

On Wed, 20 Mar 2024 at 12:49, Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 19.03.24 17:13, Peter Eisentraut wrote:
> > On 11.03.24 21:52, Matthias van de Meent wrote:
> >>> * v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
> >>>
> >>> This looks sensible, but maybe making Location a global type is a bit
> >>> much?  Maybe something more specific like ParseLocation, or ParseLoc, to
> >>> keep it under 12 characters.
> >> I've gone with ParseLoc in the attached v8 patchset.
> >
> > I have committed this one.
>
> Next, I was looking at
> v8-0003-pg_node_tree-Don-t-store-query-text-locations-in-.patch.

[...]

> So anyway, my idea was that we should turn this around and make
> nodeToString() always drop location information, and instead add
> nodeToStringWithLocations() for the few debugging uses.  And this would
> also be nice because then it matches exactly with the existing
> stringToNodeWithLocations().

That seems reasonable, yes.

-Matthias

Re: Reducing output size of nodeToString

From

Peter Eisentraut

Date:

22 March 2024, 09:18:33

On 20.03.24 13:03, Matthias van de Meent wrote:
> On Wed, 20 Mar 2024 at 12:49, Peter Eisentraut <peter@eisentraut.org> wrote:
>>
>> On 19.03.24 17:13, Peter Eisentraut wrote:
>>> On 11.03.24 21:52, Matthias van de Meent wrote:
>>>>> * v7-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch
>>>>>
>>>>> This looks sensible, but maybe making Location a global type is a bit
>>>>> much?  Maybe something more specific like ParseLocation, or ParseLoc, to
>>>>> keep it under 12 characters.
>>>> I've gone with ParseLoc in the attached v8 patchset.
>>>
>>> I have committed this one.
>>
>> Next, I was looking at
>> v8-0003-pg_node_tree-Don-t-store-query-text-locations-in-.patch.
> 
> [...]
> 
>> So anyway, my idea was that we should turn this around and make
>> nodeToString() always drop location information, and instead add
>> nodeToStringWithLocations() for the few debugging uses.  And this would
>> also be nice because then it matches exactly with the existing
>> stringToNodeWithLocations().
> 
> That seems reasonable, yes.

I have committed that one.

This takes care of your patches v8-0002 and v8-0003.

About the rest of your patch set:

As long as we have only one output format, we need to balance several 
uses, including debugging, storage size, (de)serialization performance.

Your patches v8-0005 and up are clearly positive for storage size but 
negative for debugging.  So I think we can't consider them now.

Your patches v8-0001 ("pg_node_tree: Omit serialization of fields with 
default values.") and v8-0004 ("gen_node_support.pl: Add a TypMod type 
for signalling TypMod behaviour") are also good for storage size.  I 
don't know how they affect serialization performance.  I also don't know 
how good they are for debugging.  I have argued here and there that 
omitting unset fields can make node dumps more readable.  But that's 
just me.  I have looked at a lot of Query and RangeTblEntry nodes 
lately, which contain many rarely used fields.  But other people might 
have completely different experiences, with other node and tree types. 
We didn't get much feedback from anyone else in this thread, so I'm very 
hesitant to impose this on everyone without any consensus.

I could see "Omit serialization of fields with default values" as a 
separate toggle for debug node dumps.

Also, there is clearly some lingering interesting in a separate 
binary-ish serialization format for internal use.  This should probably 
also take a look at (de)serialization performance, which we haven't 
really done in this thread.  In a way, with the omit default values 
patch, the serialization and deserialization does more work, so it could 
have an adverse impact.  But we don't know.

I think to proceed we need more buy-in on the "what do I want from my 
node dumps" side, and more performance numbers on the other side. 
Saving a few hundred kilobytes on view storage is fine but isn't by 
itself that useful, I think, if it potentially negatively affects other 
uses.