Re: Reducing output size of nodeToString - Mailing list pgsql-hackers

From Peter Eisentraut
Subject Re: Reducing output size of nodeToString
Date
Msg-id ff666461-bbcf-4bbf-a3ac-262785004377@eisentraut.org
Whole thread Raw
In response to Reducing output size of nodeToString  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: Reducing output size of nodeToString
List pgsql-hackers
On 06.12.23 22:08, Matthias van de Meent wrote:
> PFA a patch that reduces the output size of nodeToString by 50%+ in
> most cases (measured on pg_rewrite), which on my system reduces the
> total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> pg_node_tree format alive, but reduces its size signficantly.
> 
> The basic techniques used are
>   - Don't emit scalar fields when they contain a default value, and
> make the reading code aware of this.
>   - Reasonable defaults are set for most datatypes, and overrides can
> be added with new pg_node_attr() attributes. No introspection into
> non-null Node/Array/etc. is being done though.
>   - Reset more fields to their default values before storing the values.
>   - Don't write trailing 0s in outDatum calls for by-ref types. This
> saves many bytes for Name fields, but also some other pre-existing
> entry points.
> 
> Future work will probably have to be on a significantly different
> storage format, as the textual format is about to hit its entropy
> limits.

One thing that was mentioned repeatedly is that we might want different 
formats for human consumption and for machine storage.

For human consumption, I would like some format like what you propose, 
because it generally omits the "unset" or "uninteresting" fields.

But since you also talk about the size of pg_rewrite, I wonder whether 
it would be smaller if we just didn't write the field names at all but 
instead all the field values.  (This should be pretty easy to test, 
since the read functions currently ignore the field names anyway; you 
could just write out all field names as "x" and see what happens.)

I don't much like the way your patch uses the term "default".  Most of 
these default values are not defaults at all, but perhaps "most common 
values".  In theory, I would expect a default value to be initialized by 
makeNode().  (That could be an interesting feature, but let's stay 
focused here.)  But even then most of these "defaults" wouldn't be 
appropriate for a real default value.  This part seems quite 
controversial to me, and I would like to see some more details about how 
much this specifically really saves.

I don't quite understand why in your patch you have some fields as 
optional and some not.  Or is that what WRITE_NODE_FIELD() vs. 
WRITE_NODE_FIELD_OPT() means?  How is it decided which one to use?

The part that clears out the location fields in pg_rewrite entries might 
be worth considering as a separate patch.  Could you explain it more? 
Does it affect location pointers when using views at all?




pgsql-hackers by date:

Previous
From: "Andrey M. Borodin"
Date:
Subject: Re: Proposal to add page headers to SLRU pages
Next
From: Bharath Rupireddy
Date:
Subject: Re: Improve WALRead() to suck data directly from WAL buffers when possible