Re: Better handling of archive_command problems - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Better handling of archive_command problems
Date
Msg-id CA+Tgmobu5AkOoDv4iSkPd4-+jZ_+j74rvArQz2=yqQPyCvzDpQ@mail.gmail.com
Whole thread Raw
In response to Re: Better handling of archive_command problems  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Better handling of archive_command problems  (Peter Geoghegan <pg@heroku.com>)
Re: Better handling of archive_command problems  (Daniel Farina <daniel@heroku.com>)
List pgsql-hackers
On Thu, May 16, 2013 at 2:42 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, May 16, 2013 at 11:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, I think it IS a Postgres precept that interrupts should get a
>> timely response.  You don't have to agree, but I think that's
>> important.
>
> Well, yes, but the fact of the matter is that it is taking high single
> digit numbers of seconds to get a response at times, so I don't think
> that there is any reasonable expectation that that be almost
> instantaneous. I don't want to make that worse, but then it might be
> worth it in order to ameliorate a particular pain point for users.

At times, like when the system is under really heavy load?  Or at
times, like depending on what the backend is doing?  We can't do a
whole lot about the fact that it's possible to beat a system to death
so that, at the OS level, it stops responding.  Linux is unfriendly
enough to put processes into non-interruptible kernel wait states when
they're waiting on the disk, a decision that I suspect to have been
made by a sadomasochist.  But if there are times when a system that is
not responding to cancels in under a second when not particularly
heavily loaded, I would consider that a bug, and we should fix it.

>>> There is a setting called zero_damaged_pages, and enabling it causes
>>> data loss. I've seen cases where it was enabled within postgresql.conf
>>> for years.
>>
>> That is both true and bad, but it is not a reason to do more bad things.
>
> I don't think it's bad. I think that we shouldn't be paternalistic
> towards our users. If anyone enables a setting like zero_damaged_pages
> (or, say, wal_write_throttle) within their postgresql.conf
> indefinitely for no good reason, then they're incompetent. End of
> story.

That's a pretty user-hostile attitude.  Configuration mistakes are a
very common user error.  If those configuration hose the system, users
expect to be able to change them back, hit reload, and get things back
on track.  But you're proposing a GUC that, if set to a bad value,
will very plausibly cause the entire system to freeze up in such a way
that it won't respond to a reload request - or for that matter a fast
shutdown request.  I think that's 100% unacceptable.  Despite what you
seem to think, we've put a lot of work into ensuring interruptibility,
and it does not make sense to abandon that principle for this or any
other feature.

> Would you feel better about it if the setting had a time-out? Say, the
> user had to explicitly re-enable it after one hour at the most?

No, but I'd feel better about it if you figured out a way avoid
creating a scenario where it might lock up the entire database
cluster.  I am convinced that it is possible to avoid that, and that
without that this is not a feature worthy of being included in
PostgreSQL.  Yeah, it's more work that way.  But that's the difference
between "a quick hack that is useful in our shop" and "a
production-quality feature ready for a general audience".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: counting algorithm for incremental matview maintenance
Next
From: Liming Hu
Date:
Subject: Fwd: request a new feature in fuzzystrmatch