Re: Hard limit on WAL space used (because PANIC sucks) - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: Hard limit on WAL space used (because PANIC sucks)
Date
Msg-id 51B6220A.9070103@agliodbs.com
Whole thread Raw
In response to Hard limit on WAL space used (because PANIC sucks)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Hard limit on WAL space used (because PANIC sucks)
List pgsql-hackers
Josh, Daniel,

>> Right now, what we're telling users is "You can have continuous backup
>> with Postgres, but you'd better hire and expensive consultant to set it
>> up for you, or use this external tool of dubious provenance which
>> there's no packages for, or you might accidentally cause your database
>> to shut down in the middle of the night."
> 
> This is an outright falsehood. We are telling them, "You better know
> what you are doing" or "You should call a consultant". This is no
> different than, "You better know what you are doing" or "You should take
> driving lessons".

What I'm pointing out is that there is no "simple case" for archiving
the way we have it set up.  That is, every possible way to deploy PITR
for Postgres involves complex, error-prone configuration, setup, and
monitoring.  I don't think that's necessary; simple cases should have
simple solutions.

If you do a quick survey of pgsql-general, you will see that the issue
of databases shutting down unexpectedly due to archiving running them
out of disk space is a very common problem.  People shouldn't be afraid
of their backup solutions.

I'd agree that one possible answer for this is to just get one of the
external tools simplified, well-packaged, distributed, instrumented for
common monitoring systems, and referenced in our main documentation.
I'd say Barman is the closest to "a simple solution for the simple
common case", at least for PITR.  I've been able to give some clients
Barman and have them deploy it themselves.  This isn't true of the other
tools I've tried.  Too bad it's GPL, and doesn't do archiving-for-streaming.

> I have a clear bias in experience here, but I can't relate to someone
> who sets up archives but is totally okay losing a segment unceremoniously,
> because it only takes one of those once in a while to make a really,
> really bad day.  Who is this person that lackadaisically archives, and
> are they just fooling themselves?  And where are these archivers that

If WAL archiving is your *second* level of redundancy, you will
generally be willing to have it break rather than interfere with the
production workload.  This is particularly the case if you're using
archiving just as a backup for streaming replication.  Heck, I've had
one client where archiving was being used *only* to spin up staging
servers, and not for production at all; do you think they wanted
production to shut down if they ran out of archive space (which it did)?

I'll also point out that archiving can silently fail for a number of
reasons having nothing to do with "safety" options, such as an NFS mount
in Linux silently going away (I've also had this happen), or network
issues causing file corruption.  Which just points out that we need
better ways to detect gaps/corruption in archiving.

Anyway, what I'm pointing out is that this is a business decision, and
there is no way that we can make a decision for the users what to do
when we run out of WAL space.  And that the "stop archiving" option
needs to be there for users, as well as the "shut down" option.
*without* requiring users to learn the internals of the archiving system
to implement it, or to know the implied effects of non-obvious
PostgreSQL settings.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Freezing without write I/O
Next
From: Christian Ullrich
Date:
Subject: Re: Bad error message on valuntil