Re: Proposal to use JSON for Postgres Parser format - Mailing list pgsql-hackers
From | Matthias van de Meent |
---|---|
Subject | Re: Proposal to use JSON for Postgres Parser format |
Date | |
Msg-id | CAEze2WhWoDv+nwSRyN_D6tNwnhxC5Uq0yyip=1vPEVPaqkANTg@mail.gmail.com Whole thread Raw |
In response to | Re: Proposal to use JSON for Postgres Parser format (Alexander Korotkov <aekorotkov@gmail.com>) |
Responses |
Re: Proposal to use JSON for Postgres Parser format
|
List | pgsql-hackers |
On Tue, 20 Sept 2022 at 12:00, Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Tue, Sep 20, 2022 at 7:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Peter Geoghegan <pg@bowt.ie> writes: > > > On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> Our existing format is certainly not great on those metrics, but > > >> I do not see how "let's use JSON!" is a route to improvement. > > > > > The existing format was designed with developer convenience as a goal, > > > though -- despite my complaints, and in spite of your objections. > > > > As Munro adduces nearby, it'd be a stretch to conclude that the current > > format was designed with any Postgres-related goals in mind at all. > > I think he's right that it's a variant of some Lisp-y dump format that's > > probably far hoarier than even Berkeley Postgres. > > > > > If it didn't have to be easy (or even practical) for developers to > > > directly work with the output format, then presumably the format used > > > internally could be replaced with something lower level and faster. So > > > it seems like the two goals (developer ergonomics and faster > > > interchange format for users) might actually be complementary. > > > > I think the principal mistake in what we have now is that the storage > > format is identical to the "developer friendly" text format (plus or > > minus some whitespace). First we need to separate those. We could > > have more than one equivalent text format perhaps, and I don't have > > any strong objection to basing the text format (or one of them) on > > JSON. > > +1 for considering storage format and text format separately. > > Let's consider what our criteria could be for the storage format. > > 1) Storage effectiveness (shorter is better) and > serialization/deserialization effectiveness (faster is better). On > this criterion, the custom binary format looks perfect. > 2) Robustness in the case of corruption. It seems much easier to > detect the data corruption and possibly make some partial manual > recovery for textual format. > 3) Standartness. It's better to use something known worldwide or at > least used in other parts of PostgreSQL than something completely > custom. From this perspective, JSON/JSONB is better than custom > things. Allow me to add: compressability In the thread surrounding [0] there were complaints about the size of catalogs, and specifically the template database. Significant parts of that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which consists mostly of serialized Nodes. If we're going to replace our current NodeToText infrastructure, we'd better know we can effectively compress this data. In that same thread, I also suggested that we could try to not emit a Node's fields if they contain their default values while serializing; such as the common `:location -1` or `:mynodefield <>`. Those fields still take up space in the format, while conveying no interesting information (the absense of that field in the struct definition would convey the same). It would be useful if this new serialized format would allow us to do similar tricks cheaply. As for JSON vs JSONB for storage: I'm fairly certain that JSONB is less compact than JSON (without taking compression into the picture) due to the 4-byte guaranteed overhead for each jsonb element; while for JSON that is only 2 bytes for each (up to 3 when you consider separators, plus potential extra overhead for escaped values that are unlikely to appear our catalogs). Some numbers can be stored more efficiently in JSONB, but only large numbers and small fractions that we're unlikely to hit in system views: a back-of-the-envelope calculation puts the cutoff point of efficient storage between strings-of-decimals and Numeric at >10^12, < -10^11, or very precise fractional values. Kind regards, Matthias van de Meent [0] https://www.postgresql.org/message-id/CAEze2WgGexDM63dOvndLdAWwA6uSmSsc97jmrCuNmrF1JEDK7w%40mail.gmail.com
pgsql-hackers by date: