Re: trying again to get incremental backup - Mailing list pgsql-hackers

From Robert Haas
Subject Re: trying again to get incremental backup
Date
Msg-id CA+TgmoZfyTs5JR18omKd+C50rQaXSRKRbDmbqYQ=m-nt1+HZKQ@mail.gmail.com
Whole thread Raw
In response to Re: trying again to get incremental backup  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: trying again to get incremental backup
List pgsql-hackers
On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Robert asked me to work on this quite some time ago, and most of this
> work was done last year.
>
> Here's my WIP for an incremental JSON parser. It works and passes all
> the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book.
> The reason I haven't posted it before is that it's about 50% slower in
> pure parsing speed than the current recursive descent parser in my
> testing. I've tried various things to make it faster, but haven't made
> much impact. One of my colleagues is going to take a fresh look at it,
> but maybe someone on the list can see where we can save some cycles.
>
> If we can't make it faster, I guess we could use the RD parser for
> non-incremental cases and only use the non-RD parser for incremental,
> although that would be a bit sad. However, I don't think we can make the
> RD parser suitable for incremental parsing - there's too much state
> involved in the call stack.

Yeah, this is exactly why I didn't want to use JSON for the backup
manifest in the first place. Parsing such a manifest incrementally is
complicated. If we'd gone with my original design where the manifest
consisted of a bunch of lines each of which could be parsed
separately, we'd already have incremental parsing and wouldn't be
faced with these difficult trade-offs.

Unfortunately, I'm not in a good position either to figure out how to
make your prototype faster, or to evaluate how painful it is to keep
both in the source tree. It's probably worth considering how likely it
is that we'd be interested in incremental JSON parsing in other cases.
Maintaining two JSON parsers is probably not a lot of fun regardless,
but if each of them gets used for a bunch of things, that feels less
bad than if one of them gets used for a bunch of things and the other
one only ever gets used for backup manifests. Would we be interested
in JSON-format database dumps? Incrementally parsing JSON LOBs? Either
seems tenuous, but those are examples of the kind of thing that could
make us happy to have incremental JSON parsing as a general facility.

If nobody's very excited by those kinds of use cases, then this just
boils down to whether we want to (a) accept that users with very large
numbers of relation files won't be able to use pg_verifybackup or
incremental backup, (b) accept that we're going to maintain a second
JSON parser just to enable that use cas and with no other benefit, or
(c) undertake to change the manifest format to something that is
straightforward to parse incrementally. I think (a) is reasonable
short term, but at some point I think we should do better. I'm not
really that enthused about (c) because it means more work for me and
possibly more arguing, but if (b) is going to cause a lot of hassle
then we might need to consider it.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Andrei Zubkov
Date:
Subject: Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements
Next
From: Andrei Zubkov
Date:
Subject: Re: Add connection active, idle time to pg_stat_activity