Re: where should I stick that backup? - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: where should I stick that backup?
Date
Msg-id 20200413011828.GR13712@tamriel.snowman.net
Whole thread Raw
In response to Re: where should I stick that backup?  (Bruce Momjian <bruce@momjian.us>)
Responses Re: where should I stick that backup?  (Bruce Momjian <bruce@momjian.us>)
Re: where should I stick that backup?  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Greetings,

Answering both in one since they're largely the same.

* Bruce Momjian (bruce@momjian.us) wrote:
> On Fri, Apr 10, 2020 at 10:54:10AM -0400, Stephen Frost wrote:
> > * Robert Haas (robertmhaas@gmail.com) wrote:
> > > On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian <bruce@momjian.us> wrote:
> > > > Good point, but if there are multiple APIs, it makes shell script
> > > > flexibility even more useful.
> > >
> > > This is really the key point for me. There are so many existing tools
> > > that store a file someplace that we really can't ever hope to support
> > > them all in core, or even to have well-written extensions that support
> > > them all available on PGXN or wherever. We need to integrate with the
> > > tools that other people have created, not try to reinvent them all in
> > > PostgreSQL.
> >
> > So, this goes to what I was just mentioning to Bruce independently- you
> > could have made the same argument about FDWs, but it just doesn't
> > actually hold any water.  Sure, some of the FDWs aren't great, but
> > there's certainly no shortage of them, and the ones that are
> > particularly important (like postgres_fdw) are well written and in core.
>
> No, no one made that argument.  It isn't clear how a shell script API
> would map to relational database queries.  The point is how well the
> APIs match, and then if they are close, does it give us the flexibility
> we need.  You can't just look at flexibility without an API match.

If what we're talking about is the file_fdw, which certainly isn't very
complicated, it's not hard to see how you could use shell scripts for
it.  What happens is that it starts to get harder and require custom
code when you want to do something more complex- which is very nearly
what we're talking about here too.  Sure, for a simple 'bzip2' filter, a
shell script might be alright, but it's not going to cut it for the more
complex use-cases that users, today, expect solutions to.

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Fri, Apr 10, 2020 at 10:54 AM Stephen Frost <sfrost@snowman.net> wrote:
> > So, this goes to what I was just mentioning to Bruce independently- you
> > could have made the same argument about FDWs, but it just doesn't
> > actually hold any water.  Sure, some of the FDWs aren't great, but
> > there's certainly no shortage of them, and the ones that are
> > particularly important (like postgres_fdw) are well written and in core.
>
> That's a fairly different use case. In the case of the FDW interface:

There's two different questions we're talking about here and I feel like
they're being conflated.  To try and clarify:

- Could you implement FDWs with shell scripts, and custom programs?  I'm
  pretty confident that the answer is yes, but the thrust of that
  argument is primarily to show that you *can* implement just about
  anything using a shell script "API", so just saying it's possible to
  do doesn't make it necessarily a good solution.  The FDW system is
  complicated, and also good, because we made it so and because it's
  possible to do more sophisticated things with a C API, but it could
  have started out with shell scripts that just returned data in much
  the same way that COPY PROGRAM works today.  What matters is that
  forward thinking to consider what you're going to want to do tomorrow,
  not just thinking about how you can solve for the simple cases today
  with a shell out to an existing command.

- Does providing a C-library interface deter people from implementing
  solutions that use that interface?  Perhaps it does, but it doesn't
  have nearly the dampening effect that is being portrayed here, and we
  can see that pretty clearly from the FDW situation.  Sure, not all of
  those are good solutions, but lots and lots of archive command shell
  scripts are also pretty terrible, and there *are* a few good solutions
  out there, including the ones that we ourselves ship.  At least when
  it comes to FDWs, there's an option there for us to ship a *good*
  answer ourselves for certain (and, in particular, the very very
  common) use-cases.

> - We're only talking about writing a handful of tar files, and that's
> in the context of a full-database backup, which is a much
> heavier-weight operation than a query.

This is true for -Ft, but not -Fp, and I don't think there's enough
thought being put into this when it comes to parallelism and that you
don't want to be limited to one process per tablespace.

> - There is not really any state that needs to be maintained across calls.

As mentioned elsewhere, this isn't really true.

> > How does this solution give them a good way to do the right thing
> > though?  In a way that will work with large databases and complex
> > requirements?  The answer seems to be "well, everyone will have to write
> > their own tool to do that" and that basically means that, at best, we're
> > only providing half of a solution and expecting all of our users to
> > provide the other half, and to always do it correctly and in a well
> > written way.  Acknowledging that most users aren't going to actually do
> > that and instead they'll implement half measures that aren't reliable
> > shouldn't be seen as an endorsement of this approach.
>
> I don't acknowledge that. I think it's possible to use tools like the
> proposed option in a perfectly reliable way, and I've already given a
> bunch of examples of how it could be done. Writing a file is not such
> a complex operation that every bit of code that writes one reliably
> has to be written by someone associated with the PostgreSQL project. I
> strongly suspect that people who use a cloud provider's tools to
> upload their backup files will be quite happy with the results, and if
> they aren't, I hope they will blame the cloud provider's tool for
> eating the data rather than this option for making it easy to give the
> data to the thing that ate it.

The examples you've given of how this could be done "right" involve
someone writing custom code (or having code that's been written by the
PG project) to be executed from this shell command interface, even just
to perform a local backup.

As for where the blame goes, I don't find that to be a particularly
useful thing to argue about.  In any of this, if we are ultimately
saying "well, it's the user's fault, or the fault of the tools that the
user chose to use with our interface" then it seems like we've lost.
Maybe that's going to far and maybe we can't hold ourselves to that high
of a standard but I like to think of this project, in particular, as
being the one that's trying really hard to go as far in that direction
as possible.

To that end, if we contemplate adding support for some cloud vendor's
storage, as an example, and discover that the command line tools for it
suck or don't meet our expectations, I'd expect us to either refuse to
support it, or to forgo using the command-line tools and instead
implement support for talking to the cloud storage interface directly,
if it works well.

Thanks,

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: pg_basebackup, manifests and backends older than ~12
Next
From: Fujii Masao
Date:
Subject: Re: backup manifests