Re: Proposal to use JSON for Postgres Parser format - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: Proposal to use JSON for Postgres Parser format
Date
Msg-id CAEze2Wi0GJiQLTt29Jcabq5DdVk5_GT6EZzqAnheC98hWx8cuQ@mail.gmail.com
Whole thread Raw
In response to Re: Proposal to use JSON for Postgres Parser format  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: Proposal to use JSON for Postgres Parser format
List pgsql-hackers
On Tue, 20 Sept 2022 at 17:29, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2022-Sep-20, Matthias van de Meent wrote:
>
> > Allow me to add: compressability
> >
> > In the thread surrounding [0] there were complaints about the size of
> > catalogs, and specifically the template database. Significant parts of
> > that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
> > consists mostly of serialized Nodes. If we're going to replace our
> > current NodeToText infrastructure, we'd better know we can effectively
> > compress this data.
>
> True.  Currently, the largest ev_action values compress pretty well.  I
> think if we wanted this to be more succint, we would have to invent some
> binary format -- perhaps something like Protocol Buffers: it'd be stored
> in the binary format in catalogs, but for output it would be converted
> into something easy to read (we already do this for
> pg_statistic_ext_data for example).  We'd probably lose compressibility,
> but that'd be okay because the binary format would already remove most
> of the redundancy by nature.
>
> Do we want to go there?

I don't think that a binary format would be much better for
debugging/fixing than an optimization of the current textual format
when combined with compression. As I mentioned in that thread, there
is a lot of improvement possible with the existing format, and I think
any debugging of serialized nodes would greatly benefit from using a
textual format.

Then again, I also agree that this argument doesn't hold it's weight
when storage and output formats are going to be different. I trust
that any new tooling introduced as a result of this thread will be
better than what we have right now.

As for best format: I don't know. The current format is usable, and a
better format would not store any data for default values. JSON can do
that, but I could think of many formats that could do the same (Smile,
BSON, xml, etc.).

I do not think that protobuf is the best choice for storage, though,
because it has its own rules on what it considers a default value and
what it does or does not serialize: zero is considered the only
default for numbers, as is the empty string for text, etc.
I think it is allright for general use, but with e.g. `location: -1`
in just about every parse node we'd probably want to select our own
values to ignore during (de)serialization of fields.

Kind regards,

Matthias van de Meent



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Next
From: Andres Freund
Date:
Subject: Re: [RFC] building postgres with meson - v13