Re: where should I stick that backup? - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: where should I stick that backup? |
Date | |
Msg-id | CA+TgmoZMsTXBvDFrXKGhTo1DpDOGfTqR-jxp8E9778M7Q6JgxA@mail.gmail.com Whole thread Raw |
In response to | Re: where should I stick that backup? (Magnus Hagander <magnus@hagander.net>) |
Responses |
Re: where should I stick that backup?
Re: where should I stick that backup? |
List | pgsql-hackers |
On Sun, Apr 12, 2020 at 10:09 AM Magnus Hagander <magnus@hagander.net> wrote: > There are certainly cases for it. It might not be they have to be the same connection, but still be the same session, meaningbefore the first time you perform some step of authentication, get a token, and then use that for all the files. You'dneed somewhere to maintain that state, even if it doesn't happen to be a socket. But there are definitely plenty ofcases where keeping an open socket can be a huge performance gain -- especially when it comes to not re-negotiating encryptionetc. Hmm, OK. > For compression and encryption, it could perhaps be as simple as "the command has to be pipe on both input and output"and basically send the response back to pg_basebackup. > > But that won't help if the target is to relocate things... Right. And, also, it forces things to be sequential in a way I'm not too happy about. Like, if we have some kind of parallel backup, which I hope we will, then you can imagine (among other possibilities) getting files for each tablespace concurrently, and piping them through the output command concurrently. But if we emit the result in a tarfile, then it has to be sequential; there's just no other choice. I think we should try to come up with something that can work in a multi-threaded environment. > That is one way to go for it -- and in a case like that, I'd suggest the shellscript interface would be an implementationof the other API. A number of times through the years I've bounced ideas around for what to do with archive_commandwith different people (never quite to the level of "it's time to write a patch"), and it's mostly come downto some sort of shlib api where in turn we'd ship a backwards compatible implementation that would behave like archive_command.I'd envision something similar here. I agree. Let's imagine that there are a conceptually unlimited number of "targets" and "filters". Targets and filters accept data via the same API, but a target is expected to dispose of the data, whereas a filter is expected to pass it, via that same API, to a subsequent filter or target. So filters could include things like "gzip", "lz4", and "encrypt-with-rot13", whereas targets would include things like "file" (the thing we have today - write my data into some local files!), "shell" (which writes my data to a shell command, as originally proposed), and maybe eventually things like "netbackup" and "s3". Ideally this will all eventually be via a loadable module interface so that third-party filters and targets can be fully supported, but perhaps we could consider that an optional feature for v1. Note that there is quite a bit of work to do here just to reorganize the code. I would expect that we would want to provide a flexible way for a target or filter to be passed options from the pg_basebackup command line. So one might for example write this: pg_basebackup --filter='lz4 -9' --filter='encrypt-with-rot13 rotations=2' --target='shell ssh rhaas@depository pgfile create-exclusive - %f.lz4' The idea is that the first word of the filter or target identifies which one should be used, and the rest is just options text in whatever form the provider cares to accept them; but with some %<character> substitutions allowed, for things like the file name. (The aforementioned escaping problems for things like filenames with spaces in them still need to be sorted out, but this is just a sketch, so while I think it's quite solvable, I am going to refrain from proposing a precise solution here.) As to the underlying C API behind this, I propose approximately the following set of methods: 1. Begin a session. Returns a pointer to a session handle. Gets the options provided on the command line. In the case of a filter, also gets a pointer to the session handle for the next filter, or for the target (which means we set up the final target first, and then stack the filters on top of it). 2. Begin a file. Gets a session handle and a file name. Returns a pointer to a file handle. 3. Write data to a file. Gets a file handle, a byte count, and some bytes. 4. End a file. Gets a file handle. 5. End a session. Gets a session handle. If we get parallelism at some point, then there could be multiple files in progress at the same time. Maybe some targets, or even filters, won't be able to handle that, so we could have a flag someplace indicating that a particular target or filter isn't parallelism-capable. As an example, writing output to a bunch of files in a directory is fine to do in parallel, but if you want the entire backup in one giant tar file, you need each file sequentially. > Is there another language that it would make sense to support in the form of "native plugins". Assume we had some genericway to say let people write such plugins in python (we can then bikeshed about which language we should use). Thatwould give them a much higher level language, while also making it possible for a "better" API. The idea of using LUA has been floated before, and I imagine that an interface like the above could also be made to have language bindings for the scripting language of your choice - e.g. Python. However, I think we should start by trying to square away the C interface and then anybody who feels motivated can try to put language bindings on top of it. I tend to feel that's a bit of a fringe feature myself, since realistically shell commands are about as much as (and occasionally more than) typical users can manage. However, it would not surprise me very much if there are power users out there for whom C is too much but Python or LUA or something is just right, and if somebody builds something nifty that caters to that audience, I think that's great. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: