Re: where should I stick that backup? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: where should I stick that backup?
Date
Msg-id 20200412223713.33knundjcxekyefb@alap3.anarazel.de
Whole thread Raw
In response to Re: where should I stick that backup?  (David Steele <david@pgmasters.net>)
Responses Re: where should I stick that backup?
List pgsql-hackers
Hi,

On 2020-04-12 17:57:05 -0400, David Steele wrote:
> On 4/12/20 3:17 PM, Andres Freund wrote:
> > [proposal outline[
>
> This is pretty much what pgBackRest does. We call them "local" processes and
> they do most of the work during backup/restore/archive-get/archive-push.

Hah. I swear, I didn't look.


> > The obvious problem with that proposal is that we don't want to
> > unnecessarily store the incoming data on the system pg_basebackup is
> > running on, just for the subcommand to get access to them. More on that
> > in a second.
> 
> We also implement "remote" processes so the local processes can get data
> that doesn't happen to be local, i.e. on a remote PostgreSQL cluster.

What is the interface between those? I.e. do the files have to be
spooled as a whole locally?


> > There's various ways we could address the issue for how the subcommand
> > can access the file data. The most flexible probably would be to rely on
> > exchanging file descriptors between basebackup and the subprocess (these
> > days all supported platforms have that, I think).  Alternatively we
> > could invoke the subcommand before really starting the backup, and ask
> > how many files it'd like to receive in parallel, and restart the
> > subcommand with that number of file descriptors open.
> 
> We don't exchange FDs. Each local is responsible for getting the data from
> PostgreSQL or the repo based on knowing the data source and a path. For
> pg_basebackup, however, I'd imagine each local would want a replication
> connection with the ability to request specific files that were passed to it
> by the main process.

I don't like this much. It'll push more complexity into each of the
"targets" and we can't easily share that complexity. And also, needing
to request individual files will add a lot of back/forth, and thus
latency issues. The server would always have to pre-send a list of
files, we'd have to deal with those files vanishing, etc.


> > [2] yes, I already hear json. A line deliminated format would have some
> > advantages though.
> 
> We use JSON, but each protocol request/response is linefeed-delimited. So
> for example here's what it looks like when the main process requests a local
> process to backup a specific file:
> 
>
{"{"cmd":"backupFile","param":["base/32768/33001",true,65536,null,true,0,"pg_data/base/32768/33001",false,0,3,"20200412-213313F",false,null]}"}
> 
> And the local responds with:
> 
>
{"{"out":[1,65536,65536,"6bf316f11d28c28914ea9be92c00de9bea6d9a6b",{"align":true,"error":[0,[3,5],7],"valid":false}]}"}

As long as it's line delimited, I don't really care :)


> We are considering a move to HTTP since lots of services (e.g. S3, GCS,
> Azure, etc.) require it (so we implement it) and we're not sure it makes
> sense to maintain our own protocol format. That said, we'd still prefer to
> use JSON for our payloads (like GCS) rather than XML (as S3 does).

I'm not quite sure what you mean here? You mean actual requests for each
of what currently are lines? If so, that sounds *terrible*.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: cleaning perl code
Next
From: David Steele
Date:
Subject: Re: cleaning perl code