Re: where should I stick that backup? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: where should I stick that backup?
Date
Msg-id 20200415015008.t5jmsoaanve2gavg@alap3.anarazel.de
Whole thread Raw
In response to Re: where should I stick that backup?  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: where should I stick that backup?
List pgsql-hackers
Hi,

On 2020-04-14 11:38:03 -0400, Robert Haas wrote:
> I'm fairly deeply uncomfortable with what Andres is proposing. I see
> that it's very powerful, and can do a lot of things, and that if
> you're building something that does sophisticated things with storage,
> you probably want an API like that. It does a great job making
> complicated things possible. However, I feel that it does a lousy job
> making simple things simple.

I think it's pretty much exactly the opposite. Your approach seems to
move all the complexity to the user, having to build entire combination
of commands themselves.  Instead of having one or two default commands
that do backups in common situations, everyone has to assemble them from
pieces.

Moved from later in your email, since it seems to make more sense to
have it here:
> All they're going to see is that they can use gzip and maybe lz4
> because we provide the necessary special magic tools to integrate with
> those, but for some reason we don't have a special magic tool that
> they can use with their own favorite compressor, and so they can't use
> it. I think people are going to find that fairly unhelpful.

I have no problem with providing people with the opportunity to use
their personal favorite compressor, but forcing them to have to do that,
and to ensure it's installed etc, strikes me as a spectacurly bad
default situation. Most people don't have the time to research which
compression algorithms work the best for which precise situation.

How do you imagine a default scripted invocation of the new backup stuff
to look like?  Having to specify multiple commandline "fragments" for
compression, storing files, ...  can't be what we want the common case
should look like. It'll just again lead to everyone copy & pasting
examples that all are wrong in different ways. They'll not at all work
across platforms (or often not across OS versions).


In general, I think it's good to give expert users the ability to
customize things like backups and archiving. But defaulting to every
non-expert user having to all that expert work (or coyping it from bad
examples) is one of the most user hostile things in postgres.


> Also, I don't really see what's wrong with the server forking
> processes that exec("/usr/bin/lz4") or whatever. We do similar things
> in other places and, while it won't work for cases where you want to
> compress a shazillion files, that's not really a problem here anyway.
> At least at the moment, the server-side format is *always* tar, so the
> problem of needing a separate subprocess for every file in the data
> directory does not arise.

I really really don't understand this. Are you suggesting that for
server side compression etc we're going to add the ability to specify
shell commands as argument to the base backup command?  That seems so
obviously a non-starter?  A good default for backup configurations
should be that the PG user that the backup is done under is only allowed
to do that, and not that it directly has arbitrary remote command
execution.


> Suppose you want to compress using your favorite compression
> program. Well, you can't. Your favorite compression program doesn't
> speak the bespoke PostgreSQL protocol required for backup
> plugins.  Neither does your favorite encryption program. Either would
> be perfectly happy to accept a tarfile on stdin and dump out a
> compressed or encrypted version, as the case may be, on stdout, but
> sorry, no such luck. You need a special program that speaks the magic
> PostgreSQL protocol but otherwise does pretty much the exact same
> thing as the standard one.

But the tool speaking the protocol can just allow piping through
whatever tool?  Given that there likely is benefits to either doing
things on the client side or on the server side, it seems inevitable
that there's multiple places that would make sense to have the
capability for?


> It's possibly not the exact same thing. A special might, for example,
> use multiple threads for parallel compression rather than multiple
> processes, perhaps gaining a bit of efficiency. But it's doubtful
> whether all users care about such marginal improvements.

Marginal improvements? Compression scales decently well with the number
of cores.  pg_basebackup's compression is useless because it's so slow
(and because its clientside, but that's IME the lesser issue).  I feel I
must be misunderstanding what you mean here.

gzip - vs pigz -p $numcores on my machine: 180MB/s vs 2.5GB/s. The
latter will still sometimes be a bottleneck (it's a bottlenck in pigz,
not available compression cycles), but a lot less commonly than 180.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: remove_useless_groupby_columns does not need to record constraint dependencies
Next
From: Fujii Masao
Date:
Subject: Re: backup manifests