Thread: Proposal to use JSON for Postgres Parser format

Proposal to use JSON for Postgres Parser format

From
Michel Pelletier
Date:
Hello hackers,

As noted in the source:

https://github.com/postgres/postgres/blob/master/src/include/nodes/pg_list.h#L6-L11

 * Once upon a time, parts of Postgres were written in Lisp and used real
 * cons-cell lists for major data structures.  When that code was rewritten
 * in C, we initially had a faithful emulation of cons-cell lists, which
 * unsurprisingly was a performance bottleneck.  A couple of major rewrites
 * later, these data structures are actually simple expansible arrays;
 * but the "List" name and a lot of the notation survives.

The Postgres parser format as described in the wiki page:

https://wiki.postgresql.org/wiki/Query_Parsing

looks almost, but not quite, entirely like JSON:

    SELECT * FROM foo where bar = 42 ORDER BY id DESC LIMIT 23;
       (
          {SELECT
          :distinctClause <>
          :intoClause <>
          :targetList (
             {RESTARGET
             :name <>
             :indirection <>
             :val
                {COLUMNREF
                :fields (
                   {A_STAR
                   }
                )
                :location 7
                }
             :location 7
             }
          )
          :fromClause (
             {RANGEVAR
             :schemaname <>
             :relname foo
             :inhOpt 2
             :relpersistence p
             :alias <>
             :location 14
             }
          )
          ... and so on
       )

This non-standard format is useful for visual inspection and perhaps
simple parsing.  Parsers that do exist for it are generally specific
to some languages.  If there were a standard way to parse queries,
tools like code generators and analysis tools can work with a variety
of libraries that already handle JSON quite well.  Future potential
would include exposing this data to command_ddl_start event triggers.
Providing a JSON Schema would also aid tools that want to validate or
transform the json with rule based systems.

I would like to propose a discussion that in a future major release Postgres switch
from this custom format to JSON.  The current format is question is
generated from macros and functions found in
`src/backend/nodes/readfuncs.c` and `src/backend/nodes/outfuncs.c` and
converting them to emit valid JSON would be relatively
straightforward.

One downside would be that this would not be a forward compatible
binary change across releases.  Since it is unlikely that very much
code is reliant on this custom format; this would not be a huge problem
for most.

Thoughts?

-Michel

Re: Proposal to use JSON for Postgres Parser format

From
Tom Lane
Date:
Michel Pelletier <pelletier.michel@gmail.com> writes:
> I would like to propose a discussion that in a future major release
> Postgres switch from this custom format to JSON.

There are certainly reasons to think about changing the node tree
storage format; but if we change it, I'd like to see it go to something
more compact not more verbose.  JSON doesn't fit our needs all that
closely, so some things like bitmapsets would become a lot longer;
and even where the semantics are pretty-much-the-same, JSON's
insistence on details like quoting field names will add bytes.
Perhaps making the physical storage be JSONB not JSON would help that
pain point.  It's still far from ideal though.

Maybe a compromise could be found whereby we provide a conversion
function that converts whatever the catalog storage format is to
some JSON equivalent.  That would address the needs of external
code that doesn't want to write a custom parser, while not tying
us directly to JSON.

            regards, tom lane



Re: Proposal to use JSON for Postgres Parser format

From
Peter Geoghegan
Date:
On Mon, Sep 19, 2022 at 7:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe a compromise could be found whereby we provide a conversion
> function that converts whatever the catalog storage format is to
> some JSON equivalent.  That would address the needs of external
> code that doesn't want to write a custom parser, while not tying
> us directly to JSON.

That seems like a perfectly good solution, as long as it can be done
in a way that doesn't leave consumers of the JSON output at any kind
of disadvantage.

I find the current node tree format ludicrously verbose, and generally
hard to work with. But it's not the format itself, really -- that's
not the problem. The underlying data structures are typically very
information dense. So having an output format that's a known quantity
sounds very valuable to me.

Writing declarative @> containment queries against (say) a JSON
variant of node tree format seems like it could be a huge quality of
life improvement.  It will make the output format even more verbose,
but that might not matter in the same way as it does right now.

-- 
Peter Geoghegan



Re: Proposal to use JSON for Postgres Parser format

From
Thomas Munro
Date:
On Tue, Sep 20, 2022 at 12:16 PM Michel Pelletier
<pelletier.michel@gmail.com> wrote:
> This non-standard format

FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
default reader/printer behaviour in small ways, including being case
sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
structs for reasons that are lost AFAIK (there's a dark age between
the commit history of the old Berkeley repo and our current repo, and
it looks like plan nodes were still printed as #(NAME ...) at
Berkeley).  At some point it was probably exchanging data between the
Lisp and C parts of POSTGRES, and you could maybe sorta claim it's
based on an ANSI standard (Lisp!), but not with a straight face :-)



Re: Proposal to use JSON for Postgres Parser format

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> Writing declarative @> containment queries against (say) a JSON
> variant of node tree format seems like it could be a huge quality of
> life improvement.

There are certainly use-cases for something like that, but let's
be clear about it: that's a niche case of interest to developers
and pretty much nobody else.  For ordinary users, what matters about
the node tree storage format is compactness and speed of loading.
Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

            regards, tom lane



Re: Proposal to use JSON for Postgres Parser format

From
Amit Kapila
Date:
On Tue, Sep 20, 2022 at 7:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Michel Pelletier <pelletier.michel@gmail.com> writes:
> > I would like to propose a discussion that in a future major release
> > Postgres switch from this custom format to JSON.
>
> There are certainly reasons to think about changing the node tree
> storage format; but if we change it, I'd like to see it go to something
> more compact not more verbose.  JSON doesn't fit our needs all that
> closely, so some things like bitmapsets would become a lot longer;
> and even where the semantics are pretty-much-the-same, JSON's
> insistence on details like quoting field names will add bytes.
> Perhaps making the physical storage be JSONB not JSON would help that
> pain point.  It's still far from ideal though.
>
> Maybe a compromise could be found whereby we provide a conversion
> function that converts whatever the catalog storage format is to
> some JSON equivalent.  That would address the needs of external
> code that doesn't want to write a custom parser, while not tying
> us directly to JSON.
>

I think the DDL deparsing stuff that is being discussed as a base for
DDL logical replication provides something like what you are saying
[1][2].

[1] - https://www.postgresql.org/message-id/CAFPTHDaqqGxqncAP42Z%3Dw9GVXDR92HN-57O%3D2Zy6tmayV2_eZw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAAD30U%2BpVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo%3DOjdQGJ9g%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: Proposal to use JSON for Postgres Parser format

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
> default reader/printer behaviour in small ways, including being case
> sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
> structs for reasons that are lost AFAIK (there's a dark age between
> the commit history of the old Berkeley repo and our current repo, and
> it looks like plan nodes were still printed as #(NAME ...) at
> Berkeley).

Wow, where did you find a commit history for Berkeley's code?
There's evidence in the tarballs I have that they were using
RCS, but I never heard that the repo was made public.

            regards, tom lane



Re: Proposal to use JSON for Postgres Parser format

From
Peter Geoghegan
Date:
On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There are certainly use-cases for something like that, but let's
> be clear about it: that's a niche case of interest to developers
> and pretty much nobody else.  For ordinary users, what matters about
> the node tree storage format is compactness and speed of loading.

Of course. But is there any reason to think that there has to be even
a tiny cost imposed on users?

> Our existing format is certainly not great on those metrics, but
> I do not see how "let's use JSON!" is a route to improvement.

The existing format was designed with developer convenience as a goal,
though -- despite my complaints, and in spite of your objections. This
is certainly not a new consideration.

If it didn't have to be easy (or even practical) for developers to
directly work with the output format, then presumably the format used
internally could be replaced with something lower level and faster. So
it seems like the two goals (developer ergonomics and faster
interchange format for users) might actually be complementary.

-- 
Peter Geoghegan



Re: Proposal to use JSON for Postgres Parser format

From
Peter Geoghegan
Date:
On Mon, Sep 19, 2022 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Wow, where did you find a commit history for Berkeley's code?
> There's evidence in the tarballs I have that they were using
> RCS, but I never heard that the repo was made public.

It's on Github:

https://github.com/kelvich/postgres_pre95
-- 
Peter Geoghegan



Re: Proposal to use JSON for Postgres Parser format

From
Thomas Munro
Date:
On Tue, Sep 20, 2022 at 3:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
> > default reader/printer behaviour in small ways, including being case
> > sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
> > structs for reasons that are lost AFAIK (there's a dark age between
> > the commit history of the old Berkeley repo and our current repo, and
> > it looks like plan nodes were still printed as #(NAME ...) at
> > Berkeley).
>
> Wow, where did you find a commit history for Berkeley's code?
> There's evidence in the tarballs I have that they were using
> RCS, but I never heard that the repo was made public.

One of the tarballs at https://dsf.berkeley.edu/postgres.html has the
complete RCS history, but Stas Kelvich imported it to github as Peter
G has just reported faster than I could.



Re: Proposal to use JSON for Postgres Parser format

From
Thomas Munro
Date:
On Tue, Sep 20, 2022 at 4:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Sep 20, 2022 at 3:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thomas Munro <thomas.munro@gmail.com> writes:
> > > FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
> > > default reader/printer behaviour in small ways, including being case
> > > sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
> > > structs for reasons that are lost AFAIK (there's a dark age between
> > > the commit history of the old Berkeley repo and our current repo, and
> > > it looks like plan nodes were still printed as #(NAME ...) at
> > > Berkeley).
> >
> > Wow, where did you find a commit history for Berkeley's code?
> > There's evidence in the tarballs I have that they were using
> > RCS, but I never heard that the repo was made public.
>
> One of the tarballs at https://dsf.berkeley.edu/postgres.html has the
> complete RCS history, but Stas Kelvich imported it to github as Peter
> G has just reported faster than I could.

To explain my earlier guess: reader code for #S(STRUCTNAME ...) can
bee seen here, though it's being lexed as "PLAN_SYM" so perhaps the
author of that C already didn't know that was a general syntax for
Lisp structs.  (Example: at a Lisp prompt, if you write (defstruct foo
x y z) then (make-foo :x 1 :y 2 :z 3), the resulting object will be
printed as #S(FOO :x 1 :y 2 :z 3), so I'm guessing that the POSTGRES
Lisp code, which sadly (for me) was ripped out before even that repo
IIUC, must have used defstruct-based plans.)

https://github.com/kelvich/postgres_pre95/blob/master/src/backend/lib/lispread.c#L132

It may still be within the bounds of what a real Lisp could be
convinced to read though, given a reader macro to handle {} and maybe
some other little tweaks here and there.



Re: Proposal to use JSON for Postgres Parser format

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Our existing format is certainly not great on those metrics, but
>> I do not see how "let's use JSON!" is a route to improvement.

> The existing format was designed with developer convenience as a goal,
> though -- despite my complaints, and in spite of your objections.

As Munro adduces nearby, it'd be a stretch to conclude that the current
format was designed with any Postgres-related goals in mind at all.
I think he's right that it's a variant of some Lisp-y dump format that's
probably far hoarier than even Berkeley Postgres.

> If it didn't have to be easy (or even practical) for developers to
> directly work with the output format, then presumably the format used
> internally could be replaced with something lower level and faster. So
> it seems like the two goals (developer ergonomics and faster
> interchange format for users) might actually be complementary.

I think the principal mistake in what we have now is that the storage
format is identical to the "developer friendly" text format (plus or
minus some whitespace).  First we need to separate those.  We could
have more than one equivalent text format perhaps, and I don't have
any strong objection to basing the text format (or one of them) on
JSON.

            regards, tom lane



Re: Proposal to use JSON for Postgres Parser format

From
Peter Geoghegan
Date:
On Mon, Sep 19, 2022 at 9:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> As Munro adduces nearby, it'd be a stretch to conclude that the current
> format was designed with any Postgres-related goals in mind at all.
> I think he's right that it's a variant of some Lisp-y dump format that's
> probably far hoarier than even Berkeley Postgres.

That sounds very much like the 1980s graduate student equivalent of
JSON to my ears.

JSON is generally manipulated as native Javascript/python/whatever
lists, maps, and strings. It's an interchange format that tries not to
be obtrusive in the same way as things like XML always are, at the
cost of making things kinda dicey for things like numeric precision
(unless you can account for everything). Isn't that...basically the
same concept as the lisp-y dump format, at a high level?

> I think the principal mistake in what we have now is that the storage
> format is identical to the "developer friendly" text format (plus or
> minus some whitespace).  First we need to separate those.  We could
> have more than one equivalent text format perhaps, and I don't have
> any strong objection to basing the text format (or one of them) on
> JSON.

Agreed.

-- 
Peter Geoghegan



Re: Proposal to use JSON for Postgres Parser format

From
Thomas Munro
Date:
On Tue, Sep 20, 2022 at 4:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Sep 19, 2022 at 9:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > As Munro adduces nearby, it'd be a stretch to conclude that the current
> > format was designed with any Postgres-related goals in mind at all.
> > I think he's right that it's a variant of some Lisp-y dump format that's
> > probably far hoarier than even Berkeley Postgres.
>
> That sounds very much like the 1980s graduate student equivalent of
> JSON to my ears.

Yeah.  Easy data interchange on Lisp systems is built in, just write
objects into a socket/file/whatever and read them back, as people now
do with JSON/XML/whatever.  That's the format we see here.

> JSON is generally manipulated as native Javascript/python/whatever
> lists, maps, and strings. It's an interchange format that tries not to
> be obtrusive in the same way as things like XML always are, at the
> cost of making things kinda dicey for things like numeric precision
> (unless you can account for everything). Isn't that...basically the
> same concept as the lisp-y dump format, at a high level?

Yes, s-expressions and JSON are absolutely the same concept; simple
representation of simple data structures of a dynamically typed
language.  There's even a chain of events connecting the two: JSON is
roughly the literal data syntax from Javascript's grammar, and
Javascript is the language that Brendan Eich developed after Netscape
hired him to do an embedded Lisp (Scheme) for the browser, except they
decided at some point to change tack and make their new language have
a surface grammar more like Java, the new hotness.  If the goal was to
make sure it caught on, it's hard to conclude they were wrong...



Re: Proposal to use JSON for Postgres Parser format

From
Alexander Korotkov
Date:
On Tue, Sep 20, 2022 at 7:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
> > On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Our existing format is certainly not great on those metrics, but
> >> I do not see how "let's use JSON!" is a route to improvement.
>
> > The existing format was designed with developer convenience as a goal,
> > though -- despite my complaints, and in spite of your objections.
>
> As Munro adduces nearby, it'd be a stretch to conclude that the current
> format was designed with any Postgres-related goals in mind at all.
> I think he's right that it's a variant of some Lisp-y dump format that's
> probably far hoarier than even Berkeley Postgres.
>
> > If it didn't have to be easy (or even practical) for developers to
> > directly work with the output format, then presumably the format used
> > internally could be replaced with something lower level and faster. So
> > it seems like the two goals (developer ergonomics and faster
> > interchange format for users) might actually be complementary.
>
> I think the principal mistake in what we have now is that the storage
> format is identical to the "developer friendly" text format (plus or
> minus some whitespace).  First we need to separate those.  We could
> have more than one equivalent text format perhaps, and I don't have
> any strong objection to basing the text format (or one of them) on
> JSON.

+1 for considering storage format and text format separately.

Let's consider what our criteria could be for the storage format.

1) Storage effectiveness (shorter is better) and
serialization/deserialization effectiveness (faster is better). On
this criterion, the custom binary format looks perfect.
2) Robustness in the case of corruption. It seems much easier to
detect the data corruption and possibly make some partial manual
recovery for textual format.
3) Standartness. It's better to use something known worldwide or at
least used in other parts of PostgreSQL than something completely
custom. From this perspective, JSON/JSONB is better than custom
things.

------
Regards,
Alexander Korotkov



Re: Proposal to use JSON for Postgres Parser format

From
Alexander Korotkov
Date:
On Tue, Sep 20, 2022 at 1:00 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Tue, Sep 20, 2022 at 7:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Peter Geoghegan <pg@bowt.ie> writes:
> > > On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> Our existing format is certainly not great on those metrics, but
> > >> I do not see how "let's use JSON!" is a route to improvement.
> >
> > > The existing format was designed with developer convenience as a goal,
> > > though -- despite my complaints, and in spite of your objections.
> >
> > As Munro adduces nearby, it'd be a stretch to conclude that the current
> > format was designed with any Postgres-related goals in mind at all.
> > I think he's right that it's a variant of some Lisp-y dump format that's
> > probably far hoarier than even Berkeley Postgres.
> >
> > > If it didn't have to be easy (or even practical) for developers to
> > > directly work with the output format, then presumably the format used
> > > internally could be replaced with something lower level and faster. So
> > > it seems like the two goals (developer ergonomics and faster
> > > interchange format for users) might actually be complementary.
> >
> > I think the principal mistake in what we have now is that the storage
> > format is identical to the "developer friendly" text format (plus or
> > minus some whitespace).  First we need to separate those.  We could
> > have more than one equivalent text format perhaps, and I don't have
> > any strong objection to basing the text format (or one of them) on
> > JSON.
>
> +1 for considering storage format and text format separately.
>
> Let's consider what our criteria could be for the storage format.
>
> 1) Storage effectiveness (shorter is better) and
> serialization/deserialization effectiveness (faster is better). On
> this criterion, the custom binary format looks perfect.
> 2) Robustness in the case of corruption. It seems much easier to
> detect the data corruption and possibly make some partial manual
> recovery for textual format.
> 3) Standartness. It's better to use something known worldwide or at
> least used in other parts of PostgreSQL than something completely
> custom. From this perspective, JSON/JSONB is better than custom
> things.

(sorry, I've accidentally cut the last paragraph from the message)

It seems that there is no perfect fit for this multi-criteria
optimization, and we should pick what is more important.  Any
thoughts?

------
Regards,
Alexander Korotkov



Re: Proposal to use JSON for Postgres Parser format

From
Matthias van de Meent
Date:
On Tue, 20 Sept 2022 at 12:00, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Tue, Sep 20, 2022 at 7:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Peter Geoghegan <pg@bowt.ie> writes:
> > > On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> Our existing format is certainly not great on those metrics, but
> > >> I do not see how "let's use JSON!" is a route to improvement.
> >
> > > The existing format was designed with developer convenience as a goal,
> > > though -- despite my complaints, and in spite of your objections.
> >
> > As Munro adduces nearby, it'd be a stretch to conclude that the current
> > format was designed with any Postgres-related goals in mind at all.
> > I think he's right that it's a variant of some Lisp-y dump format that's
> > probably far hoarier than even Berkeley Postgres.
> >
> > > If it didn't have to be easy (or even practical) for developers to
> > > directly work with the output format, then presumably the format used
> > > internally could be replaced with something lower level and faster. So
> > > it seems like the two goals (developer ergonomics and faster
> > > interchange format for users) might actually be complementary.
> >
> > I think the principal mistake in what we have now is that the storage
> > format is identical to the "developer friendly" text format (plus or
> > minus some whitespace).  First we need to separate those.  We could
> > have more than one equivalent text format perhaps, and I don't have
> > any strong objection to basing the text format (or one of them) on
> > JSON.
>
> +1 for considering storage format and text format separately.
>
> Let's consider what our criteria could be for the storage format.
>
> 1) Storage effectiveness (shorter is better) and
> serialization/deserialization effectiveness (faster is better). On
> this criterion, the custom binary format looks perfect.
> 2) Robustness in the case of corruption. It seems much easier to
> detect the data corruption and possibly make some partial manual
> recovery for textual format.
> 3) Standartness. It's better to use something known worldwide or at
> least used in other parts of PostgreSQL than something completely
> custom. From this perspective, JSON/JSONB is better than custom
> things.

Allow me to add: compressability

In the thread surrounding [0] there were complaints about the size of
catalogs, and specifically the template database. Significant parts of
that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
consists mostly of serialized Nodes. If we're going to replace our
current NodeToText infrastructure, we'd better know we can effectively
compress this data.

In that same thread, I also suggested that we could try to not emit a
Node's fields if they contain their default values while serializing;
such as the common `:location -1` or `:mynodefield <>`. Those fields
still take up space in the format, while conveying no interesting
information (the absense of that field in the struct definition would
convey the same). It would be useful if this new serialized format
would allow us to do similar tricks cheaply.

As for JSON vs JSONB for storage:
I'm fairly certain that JSONB is less compact than JSON (without
taking compression into the picture) due to the 4-byte guaranteed
overhead for each jsonb element; while for JSON that is only 2 bytes
for each (up to 3 when you consider separators, plus potential extra
overhead for escaped values that are unlikely to appear our catalogs).
Some numbers can be stored more efficiently in JSONB, but only large
numbers and small fractions that we're unlikely to hit in system
views: a back-of-the-envelope calculation puts the cutoff point of
efficient storage between strings-of-decimals and Numeric at >10^12, <
-10^11, or very precise fractional values.

Kind regards,

Matthias van de Meent

[0] https://www.postgresql.org/message-id/CAEze2WgGexDM63dOvndLdAWwA6uSmSsc97jmrCuNmrF1JEDK7w%40mail.gmail.com



Re: Proposal to use JSON for Postgres Parser format

From
Alvaro Herrera
Date:
On 2022-Sep-20, Matthias van de Meent wrote:

> Allow me to add: compressability
> 
> In the thread surrounding [0] there were complaints about the size of
> catalogs, and specifically the template database. Significant parts of
> that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
> consists mostly of serialized Nodes. If we're going to replace our
> current NodeToText infrastructure, we'd better know we can effectively
> compress this data.

True.  Currently, the largest ev_action values compress pretty well.  I
think if we wanted this to be more succint, we would have to invent some
binary format -- perhaps something like Protocol Buffers: it'd be stored
in the binary format in catalogs, but for output it would be converted
into something easy to read (we already do this for
pg_statistic_ext_data for example).  We'd probably lose compressibility,
but that'd be okay because the binary format would already remove most
of the redundancy by nature.

Do we want to go there?

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Java is clearly an example of money oriented programming"  (A. Stepanov)



Re: Proposal to use JSON for Postgres Parser format

From
Matthias van de Meent
Date:
On Tue, 20 Sept 2022 at 17:29, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2022-Sep-20, Matthias van de Meent wrote:
>
> > Allow me to add: compressability
> >
> > In the thread surrounding [0] there were complaints about the size of
> > catalogs, and specifically the template database. Significant parts of
> > that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
> > consists mostly of serialized Nodes. If we're going to replace our
> > current NodeToText infrastructure, we'd better know we can effectively
> > compress this data.
>
> True.  Currently, the largest ev_action values compress pretty well.  I
> think if we wanted this to be more succint, we would have to invent some
> binary format -- perhaps something like Protocol Buffers: it'd be stored
> in the binary format in catalogs, but for output it would be converted
> into something easy to read (we already do this for
> pg_statistic_ext_data for example).  We'd probably lose compressibility,
> but that'd be okay because the binary format would already remove most
> of the redundancy by nature.
>
> Do we want to go there?

I don't think that a binary format would be much better for
debugging/fixing than an optimization of the current textual format
when combined with compression. As I mentioned in that thread, there
is a lot of improvement possible with the existing format, and I think
any debugging of serialized nodes would greatly benefit from using a
textual format.

Then again, I also agree that this argument doesn't hold it's weight
when storage and output formats are going to be different. I trust
that any new tooling introduced as a result of this thread will be
better than what we have right now.

As for best format: I don't know. The current format is usable, and a
better format would not store any data for default values. JSON can do
that, but I could think of many formats that could do the same (Smile,
BSON, xml, etc.).

I do not think that protobuf is the best choice for storage, though,
because it has its own rules on what it considers a default value and
what it does or does not serialize: zero is considered the only
default for numbers, as is the empty string for text, etc.
I think it is allright for general use, but with e.g. `location: -1`
in just about every parse node we'd probably want to select our own
values to ignore during (de)serialization of fields.

Kind regards,

Matthias van de Meent



Re: Proposal to use JSON for Postgres Parser format

From
Michel Pelletier
Date:
On Wed, Sep 21, 2022 at 11:04 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Tue, 20 Sept 2022 at 17:29, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2022-Sep-20, Matthias van de Meent wrote:
>
> > Allow me to add: compressability
> >
> > In the thread surrounding [0] there were complaints about the size of
> > catalogs, and specifically the template database. Significant parts of
> > that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
> > consists mostly of serialized Nodes. If we're going to replace our
> > current NodeToText infrastructure, we'd better know we can effectively
> > compress this data.
>
> True.  Currently, the largest ev_action values compress pretty well.  I
> think if we wanted this to be more succint, we would have to invent some
> binary format -- perhaps something like Protocol Buffers: it'd be stored
> in the binary format in catalogs, but for output it would be converted
> into something easy to read (we already do this for
> pg_statistic_ext_data for example).  We'd probably lose compressibility,
> but that'd be okay because the binary format would already remove most
> of the redundancy by nature.
>
> Do we want to go there?

I don't think that a binary format would be much better for
debugging/fixing than an optimization of the current textual format
when combined with compression.

I agree, JSON is not perfect, but it compresses and it's usable everywhere.  My personal need for this is purely developer experience, and Tom pointed out, a "niche" need for sure, but we are starting to do some serious work with Dan Lynch's plpgsql deparser tool to generate RLS policies from meta schema models, and having the same format come out of the parser would make a complete end to end solution for us, especially if we can get this data from a function in a ddl_command_start event trigger.  Dan also writes a popular deparser for Javascript, and unifying the formats across these tools would be a big win for us.
 
As I mentioned in that thread, there
is a lot of improvement possible with the existing format, and I think
any debugging of serialized nodes would greatly benefit from using a
textual format.

Agreed.
 
Then again, I also agree that this argument doesn't hold it's weight
when storage and output formats are going to be different. I trust
that any new tooling introduced as a result of this thread will be
better than what we have right now.

Separating formats seems like a lot of work to me, to get what might not be a huge improvement over compressing JSON, for what seems unlikely to be more than a few megabytes of parsed SQL.
 
As for best format: I don't know. The current format is usable, and a
better format would not store any data for default values. JSON can do
that, but I could think of many formats that could do the same (Smile,
BSON, xml, etc.).

I do not think that protobuf is the best choice for storage, though,
because it has its own rules on what it considers a default value and
what it does or does not serialize: zero is considered the only
default for numbers, as is the empty string for text, etc.
I think it is allright for general use, but with e.g. `location: -1`
in just about every parse node we'd probably want to select our own
values to ignore during (de)serialization of fields.

Agreed.

 Thank you everyone who has contributed to this thread, I'm pleased that it got a very spirited debate and I apologize for the delay in getting back to everyone. 

I'd like to spike on a proposed patch that:

  - Converts the existing text format to JSON (or possibly jsonb, considering feedback from this thread)
  - Can be stored compressed
  - Can be passed to a ddl_command_start event trigger with a function.

Thoughts?

-Michel

Re: Proposal to use JSON for Postgres Parser format

From
Andres Freund
Date:
Hi,

On 2022-09-19 22:29:15 -0400, Tom Lane wrote:
> There are certainly reasons to think about changing the node tree
> storage format; but if we change it, I'd like to see it go to something
> more compact not more verbose.

Very much seconded - the various pg_node_trees are a quite significant
fraction of the overall size of an empty database. And they're not
particularly useful for a human either.

IIRC it's not just catalog storage that's affected, but iirc also relevant for
parallel query.

My pet peeve is the way datums are output as individual bytes printed as
integers each. For narrow fixed-width datums including a lot of 0's for bytes
that aren't even used in the datum.


> Maybe a compromise could be found whereby we provide a conversion function
> that converts whatever the catalog storage format is to some JSON
> equivalent.  That would address the needs of external code that doesn't want
> to write a custom parser, while not tying us directly to JSON.

+1

Greetings,

Andres Freund



Re: Proposal to use JSON for Postgres Parser format

From
Andrew Dunstan
Date:
On 2022-10-27 Th 19:38, Andres Freund wrote:
> Hi,
>
> On 2022-09-19 22:29:15 -0400, Tom Lane wrote:
>> Maybe a compromise could be found whereby we provide a conversion function
>> that converts whatever the catalog storage format is to some JSON
>> equivalent.  That would address the needs of external code that doesn't want
>> to write a custom parser, while not tying us directly to JSON.
> +1
>


Agreed.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Proposal to use JSON for Postgres Parser format

From
Alexander Korotkov
Date:
On Fri, Oct 28, 2022 at 4:27 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> On 2022-10-27 Th 19:38, Andres Freund wrote:
> > Hi,
> >
> > On 2022-09-19 22:29:15 -0400, Tom Lane wrote:
> >> Maybe a compromise could be found whereby we provide a conversion function
> >> that converts whatever the catalog storage format is to some JSON
> >> equivalent.  That would address the needs of external code that doesn't want
> >> to write a custom parser, while not tying us directly to JSON.
> > +1
> >
>
>
> Agreed.

+1

Michel, it seems that you now have a green light to implement node to
json function.

------
Regards,
Alexander Korotkov



Re: Proposal to use JSON for Postgres Parser format

From
Matthias van de Meent
Date:
On Mon, 31 Oct 2022 at 13:46, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Fri, Oct 28, 2022 at 4:27 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>> On 2022-10-27 Th 19:38, Andres Freund wrote:
>> > Hi,
>> >
>> > On 2022-09-19 22:29:15 -0400, Tom Lane wrote:
>> >> Maybe a compromise could be found whereby we provide a conversion function
>> >> that converts whatever the catalog storage format is to some JSON
>> >> equivalent.  That would address the needs of external code that doesn't want
>> >> to write a custom parser, while not tying us directly to JSON.
>> > +1
>>
>> Agreed.
>
> +1
>
> Michel, it seems that you now have a green light to implement node to
> json function.

I think that Tom's proposal that we +1 is on a pg_node_tree to json
SQL function / cast; which is tangentially related to the "nodeToJson
/ changing the storage format of pg_node_tree to json" proposal, but
not the same.

I will add my +1 to Tom's proposal for that function/cast, but I'm not
sure on changing the storage format of pg_node_tree to json.

Kind regards,

Matthias van de Meent



Re: Proposal to use JSON for Postgres Parser format

From
Michel Pelletier
Date:


On Mon, Oct 31, 2022 at 6:15 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Mon, 31 Oct 2022 at 13:46, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Fri, Oct 28, 2022 at 4:27 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>> On 2022-10-27 Th 19:38, Andres Freund wrote:
>> > Hi,
>> >
>> > On 2022-09-19 22:29:15 -0400, Tom Lane wrote:
>> >> Maybe a compromise could be found whereby we provide a conversion function
>> >> that converts whatever the catalog storage format is to some JSON
>> >> equivalent.  That would address the needs of external code that doesn't want
>> >> to write a custom parser, while not tying us directly to JSON.
>> > +1
>>
>> Agreed.
>
> +1
>
> Michel, it seems that you now have a green light to implement node to
> json function.

I think that Tom's proposal that we +1 is on a pg_node_tree to json
SQL function / cast; which is tangentially related to the "nodeToJson
/ changing the storage format of pg_node_tree to json" proposal, but
not the same.

I agree.
 
I will add my +1 to Tom's proposal for that function/cast, but I'm not
sure on changing the storage format of pg_node_tree to json.

I'm going to spike on this function and will get back to the thread with any updates.

Thank you!

-Michel

Re: Proposal to use JSON for Postgres Parser format

From
Thomas Munro
Date:
On Tue, Sep 20, 2022 at 4:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> To explain my earlier guess: reader code for #S(STRUCTNAME ...) can
> bee seen here, though it's being lexed as "PLAN_SYM" so perhaps the
> author of that C already didn't know that was a general syntax for
> Lisp structs.  (Example: at a Lisp prompt, if you write (defstruct foo
> x y z) then (make-foo :x 1 :y 2 :z 3), the resulting object will be
> printed as #S(FOO :x 1 :y 2 :z 3), so I'm guessing that the POSTGRES
> Lisp code, which sadly (for me) was ripped out before even that repo
> IIUC, must have used defstruct-based plans.)

That defstruct guess is confirmed by page 36 and nearby of
https://dsf.berkeley.edu/papers/UCB-MS-zfong.pdf.



Re: Proposal to use JSON for Postgres Parser format

From
Matthias van de Meent
Date:
On Mon, 31 Oct 2022 at 15:56, Michel Pelletier
<pelletier.michel@gmail.com> wrote:
> On Mon, Oct 31, 2022 at 6:15 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
>> On Mon, 31 Oct 2022 at 13:46, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>>> On Fri, Oct 28, 2022 at 4:27 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>>>> On 2022-10-27 Th 19:38, Andres Freund wrote:
>>>>> Hi,
>>>>>
>>>>> On 2022-09-19 22:29:15 -0400, Tom Lane wrote:
>>>>>> Maybe a compromise could be found whereby we provide a conversion function
>>>>>> that converts whatever the catalog storage format is to some JSON
>>>>>> equivalent.  That would address the needs of external code that doesn't want
>>>>>> to write a custom parser, while not tying us directly to JSON.
>>>>> +1
>>>>
>>>> Agreed.
>>>
>>> +1
>>>
>>> Michel, it seems that you now have a green light to implement node to
>>> json function.
>>
>> I think that Tom's proposal that we +1 is on a pg_node_tree to json
>> SQL function / cast; which is tangentially related to the "nodeToJson
>> / changing the storage format of pg_node_tree to json" proposal, but
>> not the same.
>
>
> I agree.
>
>>
>> I will add my +1 to Tom's proposal for that function/cast, but I'm not
>> sure on changing the storage format of pg_node_tree to json.
>
>
> I'm going to spike on this function and will get back to the thread with any updates.

Michel, did you get a result from this spike?

I'm asking, because as I spiked most of my ideas on updating the node
text format, and am working on wrapping it up into a patch (or
patchset) later this week. The ideas for this are:

1. Don't write fields with default values for their types, such as
NULL for Node* fields;
2. Reset location fields before transforming the node tree to text
when we don't have a copy of the original query, which removes
location fields from serialization with step 1;
3. Add default() node labels to struct fields that do not share the
field type's default, allowing more fields to be omitted with step 1;
4. Add special default_ref() pg_node_attr for node fields that default
to other node field's values, used in Var's varnosyn/varattnosyn as
refering to varno/varattno; and
5. Truncate trailing 0s in Const' outDatum notation of by-ref types,
so that e.g. Consts with `name` data don't waste so much space with 0s

Currently, it reduces the pg_total_relation_size metric of pg_rewrite
after TOAST compression by 35% vs pg16, down to 483328 bytes / 59
pages, from 753664 bytes / 92 pages. The raw size of the ev_action
column's data (that is, before compression) is reduced by 55% to
1.18MB (from 2.80MB), and the largest default shipped row (the
information_schema.columns view) in that table is reduced to 'only'
78kB raw, from 193kB.

RW performance hasn't been tested yet, so that is still to be determined...

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)