Re: where should I stick that backup? - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: where should I stick that backup? |
Date | |
Msg-id | CA+TgmoZyU0tDAG30SzNwpGkhtXYsjenAoYt6ubT=3d3matUMGg@mail.gmail.com Whole thread Raw |
In response to | Re: where should I stick that backup? (Andres Freund <andres@anarazel.de>) |
Responses |
Re: where should I stick that backup?
|
List | pgsql-hackers |
On Wed, Apr 15, 2020 at 6:13 PM Andres Freund <andres@anarazel.de> wrote: > I guess what I perceived to be the fundamental difference, before this > email, between our positions is that I (still) think that exposing > detailed postprocessing shell fragment style arguments to pg_basebackup, > especially as the only option to use the new capabilities, will nail us > into a corner - but you don't necessarily think so? Where I had/have no > problems with implementing features by *internally* piping through > external binaries, as long as the user doesn't have to always specify > them. My principle concern is actually around having a C API and a flexible command-line interface. If we rearrange the code and the pg_basebackup command line syntax so that it's easy to add new "filters" and "targets", then I think that's a very good step forward. It's of less concern to me whether those "filters" and "targets" are (1) C code that we ship as part of pg_basebackup, (2) C code by extension authors that we dynamically load into pg_basebackup, (3) off-the-shelf external programs that we invoke, or (4) special external programs that we provide which do special magic. However, of those options, I like #4 least, because it seems like a pain in the tail to implement. It may turn out to be the most powerful and flexible, though I'm not completely sure about that yet. As to exactly how far we can get with #3, I think it depends a good deal on the answer to this question you pose in a footnote: > [1] I am not sure, nor the opposite, that piping is a great idea medium > term. One concern is that IIRC windows pipe performance is not great, > and that there's some other portability problems as well. I think > there's also valid concerns about per-file overhead, which might be a > problem for some future uses. If piping stuff through shell commands performs well for use cases like compression, then I think we can get pretty far with piping things through shell commands. It means we can use any compression at all with no build-time dependency on that compressor. People can install anything they want, stick it in $PATH, and away they go. I see no particular reason to dislike that kind of thing; in fact, I think it offers many compelling advantages. On the other hand, if we really need to interact directly with the library to get decent performance, because, say, pipes are too slow, then the approach of piping things through an arbitrary shell commands is a lot less exciting. Even then, though, I wonder how many runtime dependencies we're seriously willing to add. I imagine we can add one or two more compression algorithms without giving everybody fits, even if it means adding optional build-time and run-time dependencies on some external libraries. Any more than that is likely to provoke a backlash. And I doubt whether we're willing to have the postgresql operating system package depend on something like libgcrypt at all. I would expect such a proposal to meet with vigorous objections. But without such a dependency, how would we realistically get encrypted backups except by piping through a shell command? I don't really see a way, and letting a user specify a shell fragment to define what happens there seems pretty reasonable to me. I'm also not very sure to what we can assume, with either compression or encryption, that one size fits all. If there are six popular compression libraries and four popular encryption libraries, does anyone really believe that it's going to be OK for 'yum install postgresql-server' to suck in all of those things? Or, even if that were OK or if it we could somehow avoid it, what are the chances that we'd actually go to the trouble of building interfaces to all of those things? I'd rate them as slim to none; we suck at that sort of thing. Exhibit A: The work to make PostgreSQL support more than one SSL library. I'm becoming fairly uncertain as to how far we can get with shell commands; some of the concerns raised about, for example, connection management when talking to stuff like S3 are very worrying. At the same time, I think we need to think pretty seriously about some of the upsides of shell commands. The average user cannot write a C library that implements an API. The average user cannot write a C binary that speaks a novel, PostgreSQL-specific protocol. Even the above-average user who is capable of doing those things probably won't have the time to actually do it. So if thing you have to do to make PostgreSQL talk to the new sljgsjl compressor is either of those things, then we will not have sljgsjl compression support for probably a decade after it becomes the gold standard that everyone else in the industry is using. If what you have to do is 'yum install sljgsjl' and then pg_basebackup --client-filter='shell sljgsjl', people can start using it as soon as their favorite distro packages it, without anyone who reads this mailing list needing to do any work whatsoever. If what you have to do is create a 'sljgsjl.json' file in some PostgreSQL install directory that describes the salient properties of this compressor, and then after that you can say pg_basebackup --client-filter=sljgsjl, that's also accessible to a broad swath of users. Now, it may be that there's no practical way to make things that easy. But, to the extent that we can, I think we should. The ability to integrate new technology without action by PostgreSQL core developers is not the only consideration here, but it's definitely a good thing to have insofar as we reasonably can. > But I don't think it makes sense to design a C API without a rough > picture of how things should eventually look like. If we were, e.g., > eventually going to do all the work of compressing and transferring data > in one external binary, then a C API exposing transformations in > pg_basebackup doesn't necessarily make sense. If it turns out that > pipes are too inefficient on windows to implement compression filters, > that we need parallel awareness in the API, etc it'll influence the API. Yeah. I think we really need to understand the performance characteristics of pipes better. If they're slow, then anything that needs to be fast has to work some other way (but we could still provide a pipe-based slow way for niche uses). > > That's really, really, really not what I was talking about. > > What did you mean with the "marginal improvements" paragraph above? I was talking about running one compressor processor with multiple compression threads each reading from a separate pipe, vs. running multiple processes each with a single thread doing the same thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: