Re: where should I stick that backup? - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: where should I stick that backup?
Date
Msg-id CABUevEya=Ap22FyeruPunUtX3=QUVTxVLzfQhGT_YnXgBp5=ug@mail.gmail.com
Whole thread Raw
In response to Re: where should I stick that backup?  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: where should I stick that backup?
List pgsql-hackers


On Sat, Apr 11, 2020 at 10:22 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Apr 10, 2020 at 3:38 PM Andres Freund <andres@anarazel.de> wrote:
> Wouldn't there be state like a S3/ssh/https/... connection? And perhaps
> a 'backup_id' in the backup metadata DB that'd one would want to update
> at the end?

Good question. I don't know that there would be but, uh, maybe? It's
not obvious to me why all of that would need to be done using the same
connection, but if it is, the idea I proposed isn't going to work very
nicely.

There are certainly cases for it. It might not be they have to be the same connection, but still be the same session, meaning before the first time you perform some step of authentication, get a token, and then use that for all the files. You'd need somewhere to maintain that state, even if it doesn't happen to be a socket. But there are definitely plenty of cases where keeping an open socket can be a huge performance gain -- especially when it comes to not re-negotiating encryption etc.


More generally, can you think of any ideas for how to structure an API
here that are easier to use than "write some C code"? Or do you think
we should tell people to write some C code if they want to
compress/encrypt/relocate their backup in some non-standard way?

For compression and encryption, it could perhaps be as simple as "the command has to be pipe on both input and output" and basically send the response back to pg_basebackup.

But that won't help if the target is to relocate things...



For the record, I'm not against eventually having more than one way to
do this, maybe a shell-script interface for simpler things and some
kind of API for more complex needs (e.g. NetBackup integration,
perhaps). And I did wonder if there was some other way we could do
this. For instance, we could add an option --tar-everything that
sticks all the things that would have been returned by the backup
inside another level of tar file and sends the result to stdout. Then
you can pipe it into a single command that gets invoked only once for
all the data, rather than once per tablespace. That might be better,
but I'm not sure it's better. It's better if you want to do
complicated things that involve steps that happen before and after and
persistent connections and so on, but it seems worse for simple things
like piping through a non-default compressor.


That is one way to go for it -- and in a case like that, I'd suggest the shellscript interface would be an implementation of the other API. A number of times through the years I've bounced ideas around for what to do with archive_command with different people (never quite to the level of "it's time to write a patch"), and it's mostly come down to some sort of shlib api where in turn we'd ship a backwards compatible implementation that would behave like archive_command. I'd envision something similar here.



Larry Wall somewhat famously commented that a good programming
language should (and I paraphrase) make simple things simple and
complex things possible. My hesitation in going straight to a C API is
that it does not make simple things simple; and I'd like to be really
sure that there is no way of achieving that valuable goal before we
give up on it. However, there is no doubt that a C API is potentially
more powerful.


Is there another language that it would make sense to support in the form of "native plugins". Assume we had some generic way to say let people write such plugins in python (we can then bikeshed about which language we should use). That would give them a much higher level language, while also making it possible for a "better" API.

Note that I'm not suggesting supporting a python script running as a regular script -- that could easily be done by anybody making a shellscript implementation. It would be an actual API where the postgres tool would instantiate the python interpreter in-process and create an object there. This would allow things like keeping state across calls, and would also give access to the extensive library availability of the language (e.g. you could directly import an S3 compatible library to upload files etc).

Doing that for just pg_basebackup would probably be overkill, but it might be a generic choice that could extend to other things as well.
 

--

pgsql-hackers by date:

Previous
From: James Coleman
Date:
Subject: Re: execExprInterp() questions / How to improve scalar array op expr eval?
Next
From: Robert Haas
Date:
Subject: Re: pg_validatebackup -> pg_verifybackup?