Re: where should I stick that backup? - Mailing list pgsql-hackers

From Robert Haas
Subject Re: where should I stick that backup?
Date
Msg-id CA+TgmoZ3Ak5V+b6J8T9UOD-c6Jt0mQ1o0tM3cnY14Y3Zyi4EMQ@mail.gmail.com
Whole thread Raw
In response to Re: where should I stick that backup?  (Andres Freund <andres@anarazel.de>)
Responses Re: where should I stick that backup?
List pgsql-hackers
On Tue, Apr 14, 2020 at 9:50 PM Andres Freund <andres@anarazel.de> wrote:
> On 2020-04-14 11:38:03 -0400, Robert Haas wrote:
> > I'm fairly deeply uncomfortable with what Andres is proposing. I see
> > that it's very powerful, and can do a lot of things, and that if
> > you're building something that does sophisticated things with storage,
> > you probably want an API like that. It does a great job making
> > complicated things possible. However, I feel that it does a lousy job
> > making simple things simple.
>
> I think it's pretty much exactly the opposite. Your approach seems to
> move all the complexity to the user, having to build entire combination
> of commands themselves.  Instead of having one or two default commands
> that do backups in common situations, everyone has to assemble them from
> pieces.

I think we're mostly talking about different things. I was speaking
mostly about the difficulty of developing it. I agree that a project
which is easier to develop is likely to provide fewer benefits to the
end user. On the other hand, it might be more likely to get done, and
projects that don't get done provide few benefits to users. I strongly
believe we need an incremental approach here.

> In general, I think it's good to give expert users the ability to
> customize things like backups and archiving. But defaulting to every
> non-expert user having to all that expert work (or coyping it from bad
> examples) is one of the most user hostile things in postgres.

I'm not against adding more built-in compression algorithms, but I
also believe (as I have several times now) that the world moves a lot
faster than PostgreSQL, which has not added a single new compression
algorithm to pg_basebackup ever. We had 1 compression algorithm in
2011, and we still have that same 1 algorithm today. So, either nobody
cares, or adding new algorithms is sufficiently challenging - for
either technical or political reasons - that nobody's managed to get
it done. I think having a simple framework in pg_basebackup for
plugging in new algorithms would make it noticeably simpler to add LZ4
or whatever your favorite compression algorithm is. And I think having
that framework also be able to use shell commands, so that users don't
have to wait a decade or more for new choices to show up, is also a
good idea.

I don't disagree that the situation around things like archive_command
is awful, but a good part of that is that every time somebody shows up
and says "hey, let's try to make a small improvement," between two and
forty people show up and start explaining why it's still going to be
terrible. Eventually the pile of requirements get so large, and/or
there are enough contradictory opinions, that the person who made the
proposal for how to improve things gives up and leaves. So then we
still have the documentation suggesting "cp". When people - it happens
to be me in this case, but the problem is much more general - show up
and propose improvements to difficult areas, we can and should give
them good advice on how to improve their proposals. But we should not
insist that they have to build something incredibly complex and
grandiose and solve every problem in that area. We should be happy if
we get ANY improvement in a difficult area, not send dozens of angry
emails complaining that their proposal is imperfect.

> I really really don't understand this. Are you suggesting that for
> server side compression etc we're going to add the ability to specify
> shell commands as argument to the base backup command?  That seems so
> obviously a non-starter?  A good default for backup configurations
> should be that the PG user that the backup is done under is only allowed
> to do that, and not that it directly has arbitrary remote command
> execution.

I hadn't really considered that aspect, and that's certainly a
concern. But I also don't understand why you think it's somehow a big
deal. My point is not that clients should have the ability to execute
arbitrary commands on the server. It's that shelling out to an
external binary provided by the operating system is a reasonable thing
to do, versus having everything have to be done by binaries that we
create. Which I think is what you are also saying right here:

> But the tool speaking the protocol can just allow piping through
> whatever tool?  Given that there likely is benefits to either doing
> things on the client side or on the server side, it seems inevitable
> that there's multiple places that would make sense to have the
> capability for?

Unless I am misunderstanding you, this is exactly what i was
proposing, and have been proposing since the first email on the
thread.

> > It's possibly not the exact same thing. A special might, for example,
> > use multiple threads for parallel compression rather than multiple
> > processes, perhaps gaining a bit of efficiency. But it's doubtful
> > whether all users care about such marginal improvements.
>
> Marginal improvements? Compression scales decently well with the number
> of cores.  pg_basebackup's compression is useless because it's so slow
> (and because its clientside, but that's IME the lesser issue).  I feel I
> must be misunderstanding what you mean here.
>
> gzip - vs pigz -p $numcores on my machine: 180MB/s vs 2.5GB/s. The
> latter will still sometimes be a bottleneck (it's a bottlenck in pigz,
> not available compression cycles), but a lot less commonly than 180.

That's really, really, really not what I was talking about.

I'm quite puzzled by your reading of this email. You seem to have
missed my point entirely. I don't know whether that's because I did a
poor job writing it or because you didn't read it carefully enough or
what. What I'm saying is: I don't immediately wish to undertake the
problem of building a new wire protocol that the client and server can
use to talk to external binaries. I would prefer to start with a C
API, because I think it will be far less work and still able to meet a
number of important needs. The new wire protocol that can be used to
talk to external binaries can be added later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Parallel copy
Next
From: Robert Haas
Date:
Subject: Re: documenting the backup manifest file format