Re: Better handling of archive_command problems - Mailing list pgsql-hackers

From Daniel Farina
Subject Re: Better handling of archive_command problems
Date
Msg-id CAAZKuFZ_hYtvvZXKe7Y5OsaMcsO_=O+J6sxUwaP07S+1JPbJLA@mail.gmail.com
Whole thread Raw
In response to Re: Better handling of archive_command problems  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Better handling of archive_command problems  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Thu, May 16, 2013 at 5:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, May 16, 2013 at 2:42 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Thu, May 16, 2013 at 11:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Well, I think it IS a Postgres precept that interrupts should get a
>>> timely response.  You don't have to agree, but I think that's
>>> important.
>>
>> Well, yes, but the fact of the matter is that it is taking high single
>> digit numbers of seconds to get a response at times, so I don't think
>> that there is any reasonable expectation that that be almost
>> instantaneous. I don't want to make that worse, but then it might be
>> worth it in order to ameliorate a particular pain point for users.
>
> At times, like when the system is under really heavy load?  Or at
> times, like depending on what the backend is doing?  We can't do a
> whole lot about the fact that it's possible to beat a system to death
> so that, at the OS level, it stops responding.  Linux is unfriendly
> enough to put processes into non-interruptible kernel wait states when
> they're waiting on the disk, a decision that I suspect to have been
> made by a sadomasochist.  But if there are times when a system that is
> not responding to cancels in under a second when not particularly
> heavily loaded, I would consider that a bug, and we should fix it.
>
>>>> There is a setting called zero_damaged_pages, and enabling it causes
>>>> data loss. I've seen cases where it was enabled within postgresql.conf
>>>> for years.
>>>
>>> That is both true and bad, but it is not a reason to do more bad things.
>>
>> I don't think it's bad. I think that we shouldn't be paternalistic
>> towards our users. If anyone enables a setting like zero_damaged_pages
>> (or, say, wal_write_throttle) within their postgresql.conf
>> indefinitely for no good reason, then they're incompetent. End of
>> story.
>
> That's a pretty user-hostile attitude.  Configuration mistakes are a
> very common user error.  If those configuration hose the system, users
> expect to be able to change them back, hit reload, and get things back
> on track.  But you're proposing a GUC that, if set to a bad value,
> will very plausibly cause the entire system to freeze up in such a way
> that it won't respond to a reload request - or for that matter a fast
> shutdown request.  I think that's 100% unacceptable.  Despite what you
> seem to think, we've put a lot of work into ensuring interruptibility,
> and it does not make sense to abandon that principle for this or any
> other feature.

The inability to shut down in such a situation is not happy at all, as
you say, and the problems with whacking the GUC around due to
non-interruptability is pretty bad too.

>> Would you feel better about it if the setting had a time-out? Say, the
>> user had to explicitly re-enable it after one hour at the most?
>
> No, but I'd feel better about it if you figured out a way avoid
> creating a scenario where it might lock up the entire database
> cluster.  I am convinced that it is possible to avoid that

Do you have a sketch about mechanism to not encounter that problem?

> and that without that this is not a feature worthy of being included
> in PostgreSQL.  Yeah, it's more work that way.  But that's the
> difference between "a quick hack that is useful in our shop" and "a
> production-quality feature ready for a general audience".

However little it may matter, I would like to disagree with your
opinion on this one: the current situation as I imagine encountered by
*all* users of archiving is really unpleasant, 'my' shop or no.  It
would probably not be inaccurate to say that 99.9999% of archiving
users have to live with a hazy control over the amount of data loss,
only bounded by how long it takes for the system to full up the WAL
file system and then for PostgreSQL to PANIC and crash (hence, no more
writes are processed, and no more data can be lost).

Once one factors in the human cost of having to deal with that down
time or monitor it to circumvent this, I feel as though the bar for
quality should be lowered.  As you see, we've had to resort to
horrific techniques that to get around this problem.

I think this is something serious enough that it is worth doing
better, but the bind that people doing archiving find themselves in is
much worse at the margins -- involving data loss and loss of
availability -- and accordingly, I think the bar for some kind of
solution should be lowered, insomuch as that at least the interface
should be right enough to not be an albatross later (of which this
proposal may not meet).

That said, there is probably a way to please everyone and do something
better.  Any ideas?



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Better handling of archive_command problems
Next
From: Paul Hammond
Date:
Subject: Re: [GENERAL] PLJava for Postgres 9.2.