Re: where should I stick that backup? - Mailing list pgsql-hackers

From Robert Haas
Subject Re: where should I stick that backup?
Date
Msg-id CA+TgmoZMsTXBvDFrXKGhTo1DpDOGfTqR-jxp8E9778M7Q6JgxA@mail.gmail.com
Whole thread Raw
In response to Re: where should I stick that backup?  (Magnus Hagander <magnus@hagander.net>)
Responses Re: where should I stick that backup?
Re: where should I stick that backup?
List pgsql-hackers
On Sun, Apr 12, 2020 at 10:09 AM Magnus Hagander <magnus@hagander.net> wrote:
> There are certainly cases for it. It might not be they have to be the same connection, but still be the same session,
meaningbefore the first time you perform some step of authentication, get a token, and then use that for all the files.
You'dneed somewhere to maintain that state, even if it doesn't happen to be a socket. But there are definitely plenty
ofcases where keeping an open socket can be a huge performance gain -- especially when it comes to not re-negotiating
encryptionetc. 

Hmm, OK.

> For compression and encryption, it could perhaps be as simple as "the command has to be pipe on both input and
output"and basically send the response back to pg_basebackup. 
>
> But that won't help if the target is to relocate things...

Right. And, also, it forces things to be sequential in a way I'm not
too happy about. Like, if we have some kind of parallel backup, which
I hope we will, then you can imagine (among other possibilities)
getting files for each tablespace concurrently, and piping them
through the output command concurrently. But if we emit the result in
a tarfile, then it has to be sequential; there's just no other choice.
I think we should try to come up with something that can work in a
multi-threaded environment.

> That is one way to go for it -- and in a case like that, I'd suggest the shellscript interface would be an
implementationof the other API. A number of times through the years I've bounced ideas around for what to do with
archive_commandwith different people (never quite to the level of "it's time to write a patch"), and it's mostly come
downto some sort of shlib api where in turn we'd ship a backwards compatible implementation that would behave like
archive_command.I'd envision something similar here. 

I agree. Let's imagine that there are a conceptually unlimited number
of "targets" and "filters". Targets and filters accept data via the
same API, but a target is expected to dispose of the data, whereas a
filter is expected to pass it, via that same API, to a subsequent
filter or target. So filters could include things like "gzip", "lz4",
and "encrypt-with-rot13", whereas targets would include things like
"file" (the thing we have today - write my data into some local
files!), "shell" (which writes my data to a shell command, as
originally proposed), and maybe eventually things like "netbackup" and
"s3". Ideally this will all eventually be via a loadable module
interface so that third-party filters and targets can be fully
supported, but perhaps we could consider that an optional feature for
v1. Note that there is quite a bit of work to do here just to
reorganize the code.

I would expect that we would want to provide a flexible way for a
target or filter to be passed options from the pg_basebackup command
line. So one might for example write this:

pg_basebackup --filter='lz4 -9' --filter='encrypt-with-rot13
rotations=2' --target='shell ssh rhaas@depository pgfile
create-exclusive - %f.lz4'

The idea is that the first word of the filter or target identifies
which one should be used, and the rest is just options text in
whatever form the provider cares to accept them; but with some
%<character> substitutions allowed, for things like the file name.
(The aforementioned escaping problems for things like filenames with
spaces in them still need to be sorted out, but this is just a sketch,
so while I think it's quite solvable, I am going to refrain from
proposing a precise solution here.)

As to the underlying C API behind this, I propose approximately the
following set of methods:

1. Begin a session. Returns a pointer to a session handle. Gets the
options provided on the command line. In the case of a filter, also
gets a pointer to the session handle for the next filter, or for the
target (which means we set up the final target first, and then stack
the filters on top of it).

2. Begin a file. Gets a session handle and a file name. Returns a
pointer to a file handle.

3. Write data to a file. Gets a file handle, a byte count, and some bytes.

4. End a file. Gets a file handle.

5. End a session. Gets a session handle.

If we get parallelism at some point, then there could be multiple
files in progress at the same time. Maybe some targets, or even
filters, won't be able to handle that, so we could have a flag
someplace indicating that a particular target or filter isn't
parallelism-capable. As an example, writing output to a bunch of files
in a directory is fine to do in parallel, but if you want the entire
backup in one giant tar file, you need each file sequentially.

> Is there another language that it would make sense to support in the form of "native plugins". Assume we had some
genericway to say let people write such plugins in python (we can then bikeshed about which language we should use).
Thatwould give them a much higher level language, while also making it possible for a "better" API. 

The idea of using LUA has been floated before, and I imagine that an
interface like the above could also be made to have language bindings
for the scripting language of your choice - e.g. Python. However, I
think we should start by trying to square away the C interface and
then anybody who feels motivated can try to put language bindings on
top of it. I tend to feel that's a bit of a fringe feature myself,
since realistically shell commands are about as much as (and
occasionally more than) typical users can manage. However, it would
not surprise me very much if there are power users out there for whom
C is too much but Python or LUA or something is just right, and if
somebody builds something nifty that caters to that audience, I think
that's great.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: pg_validatebackup -> pg_verifybackup?
Next
From: Robert Haas
Date:
Subject: Re: pg_validatebackup -> pg_verifybackup?