Re: Proposal to use JSON for Postgres Parser format - Mailing list pgsql-hackers

From Michel Pelletier
Subject Re: Proposal to use JSON for Postgres Parser format
Date
Msg-id CACxu=v+EvGfn-3LLOfMLAuzf7P4-CvSWAJMvtvS8KYXhNMWPAg@mail.gmail.com
Whole thread Raw
In response to Re: Proposal to use JSON for Postgres Parser format  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
List pgsql-hackers
On Wed, Sep 21, 2022 at 11:04 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Tue, 20 Sept 2022 at 17:29, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2022-Sep-20, Matthias van de Meent wrote:
>
> > Allow me to add: compressability
> >
> > In the thread surrounding [0] there were complaints about the size of
> > catalogs, and specifically the template database. Significant parts of
> > that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
> > consists mostly of serialized Nodes. If we're going to replace our
> > current NodeToText infrastructure, we'd better know we can effectively
> > compress this data.
>
> True.  Currently, the largest ev_action values compress pretty well.  I
> think if we wanted this to be more succint, we would have to invent some
> binary format -- perhaps something like Protocol Buffers: it'd be stored
> in the binary format in catalogs, but for output it would be converted
> into something easy to read (we already do this for
> pg_statistic_ext_data for example).  We'd probably lose compressibility,
> but that'd be okay because the binary format would already remove most
> of the redundancy by nature.
>
> Do we want to go there?

I don't think that a binary format would be much better for
debugging/fixing than an optimization of the current textual format
when combined with compression.

I agree, JSON is not perfect, but it compresses and it's usable everywhere.  My personal need for this is purely developer experience, and Tom pointed out, a "niche" need for sure, but we are starting to do some serious work with Dan Lynch's plpgsql deparser tool to generate RLS policies from meta schema models, and having the same format come out of the parser would make a complete end to end solution for us, especially if we can get this data from a function in a ddl_command_start event trigger.  Dan also writes a popular deparser for Javascript, and unifying the formats across these tools would be a big win for us.
 
As I mentioned in that thread, there
is a lot of improvement possible with the existing format, and I think
any debugging of serialized nodes would greatly benefit from using a
textual format.

Agreed.
 
Then again, I also agree that this argument doesn't hold it's weight
when storage and output formats are going to be different. I trust
that any new tooling introduced as a result of this thread will be
better than what we have right now.

Separating formats seems like a lot of work to me, to get what might not be a huge improvement over compressing JSON, for what seems unlikely to be more than a few megabytes of parsed SQL.
 
As for best format: I don't know. The current format is usable, and a
better format would not store any data for default values. JSON can do
that, but I could think of many formats that could do the same (Smile,
BSON, xml, etc.).

I do not think that protobuf is the best choice for storage, though,
because it has its own rules on what it considers a default value and
what it does or does not serialize: zero is considered the only
default for numbers, as is the empty string for text, etc.
I think it is allright for general use, but with e.g. `location: -1`
in just about every parse node we'd probably want to select our own
values to ignore during (de)serialization of fields.

Agreed.

 Thank you everyone who has contributed to this thread, I'm pleased that it got a very spirited debate and I apologize for the delay in getting back to everyone. 

I'd like to spike on a proposed patch that:

  - Converts the existing text format to JSON (or possibly jsonb, considering feedback from this thread)
  - Can be stored compressed
  - Can be passed to a ddl_command_start event trigger with a function.

Thoughts?

-Michel

pgsql-hackers by date:

Previous
From: Zhang Mingli
Date:
Subject: Re: Reducing duplicativeness of EquivalenceClass-derived clauses
Next
From: vignesh C
Date:
Subject: Improve tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE