Thread: Better handling of archive_command problems

Better handling of archive_command problems

From
Peter Geoghegan
Date:
The documentation says of continuous archiving:

"While designing your archiving setup, consider what will happen if
the archive command fails repeatedly because some aspect requires
operator intervention or the archive runs out of space. For example,
this could occur if you write to tape without an autochanger; when the
tape fills, nothing further can be archived until the tape is swapped.
You should ensure that any error condition or request to a human
operator is reported appropriately so that the situation can be
resolved reasonably quickly. The pg_xlog/ directory will continue to
fill with WAL segment files until the situation is resolved. (If the
file system containing pg_xlog/ fills up, PostgreSQL will do a PANIC
shutdown. No committed transactions will be lost, but the database
will remain offline until you free some space.)"

I think that it is not uncommon for archiving to fall seriously
behind, risking a serious loss of availability. When this happens, the
DBA has to fight against the clock to fix whatever problem there is
with continuous archiving, hoping to catch up and prevent a PANIC
shutdown. This is a particularly unpleasant problem to have.

At Heroku, we naturally monitor the state of continuous archiving on
all clusters under our control. However, when faced with this
situation, sometimes the least-worst option to buy time is to throttle
Postgres using a crude mechanism: issuing repeated SIGSTOP and SIGCONT
signals to all Postgres processes, with the exception of the archiver
auxiliary process. Obviously this is a terrible thing to have to do,
principally because it slows almost everything right down. It would be
far preferable to just slow down the writing of WAL segments when
these emergencies arise, since that alone is what risks causing a
PANIC shutdown when XLogWrite() cannot write WAL. Even if the pg_xlog
directory is on the same filesystem as database heap files, it is
obviously the case that throttling WAL will have the effect of
throttling operations that might cause those heap files to be
enlarged. Reads (including the writes that enable reads, like those
performed by the background writer and backends to clean dirty
buffers) and checkpointing are not affected (though of course
checkpointing does have to write checkpoint WAL records, so perhaps
not quite).

What I'd like to propose is that we simply sit on WALWriteLock for a
configured delay in order to throttle the writing (though not the
insertion) of WAL records. I've drafted a patch that does just that -
it has the WAL Writer optionally sleep on the WALWriteLock for some
period of time once per activity cycle (avoiding WAL Writer
hibernation). If this sounds similar to commit_delay, that's because
it is almost exactly the same. We just sleep within the WAL Writer
rather than a group commit leader backend because that doesn't depend
upon some backend hitting the XLogFlush()/commit_delay codepath. In a
bulk loading situation, it's perfectly possible for no backend to
actually hit XLogFlush() with any sort of regularity, so commit_delay
cannot really be abused to do what I describe here. Besides, right now
commit_delay is capped so that it isn't possible to delay for more
than 1/10th of a second.

What I've proposed here has the disadvantage of making activity rounds
of the WAL Writer take longer, thus considerably increasing the window
in which any asynchronous commits will actually make it out to disk.
However, that's a problem that's inherent with any throttling of WAL
Writing as described here (XLogBackgroundFlush() itself acquires
WalWriteLock anyway), so I don't imagine that there's anything that
can be done about that other than having a clear warning. I envisage
this feature as very much a sharp tool to be used by the DBA only when
they are in a very tight bind. Better to at least be able to handle
read queries in the event of having this problem, and to not throttle
longer running transactions with some writes that don't need to make
it out to disk right away. I also have a notion that we can usefully
throttle WAL writing less aggressively than almost or entirely
preventing it. I have an idea that a third party monitoring daemon
could scale up or down the throttling delay as a function of how full
the pg_xlog filesystem is. It might be better to modestly throttle WAL
writing for two hours in order to allow continuous archiving to catch
up, rather than sharply curtailing WAL writing for a shorter period.

Has anyone else thought about approaches to mitigating the problems
that arise when an archive_command continually fails, and the DBA must
manually clean up the mess?

-- 
Peter Geoghegan



Re: Better handling of archive_command problems

From
Daniel Farina
Date:
On Mon, May 13, 2013 at 3:02 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Has anyone else thought about approaches to mitigating the problems
> that arise when an archive_command continually fails, and the DBA must
> manually clean up the mess?

Notably, the most common problem in this vein suffered at Heroku has
nothing to do with archive_command failing, and everything to do with
the ratio of block device write performance (hence, backlog) versus
the archiving performance.  When CPU is uncontended it's not a huge
deficit, but it is there and it causes quite a bit of stress.

Archive commands failing are definitely a special case there, where it
might be nice to bring write traffic to exactly zero for a time.



Re: Better handling of archive_command problems

From
Robert Haas
Date:
On Tue, May 14, 2013 at 12:23 AM, Daniel Farina <daniel@heroku.com> wrote:
> On Mon, May 13, 2013 at 3:02 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Has anyone else thought about approaches to mitigating the problems
>> that arise when an archive_command continually fails, and the DBA must
>> manually clean up the mess?
>
> Notably, the most common problem in this vein suffered at Heroku has
> nothing to do with archive_command failing, and everything to do with
> the ratio of block device write performance (hence, backlog) versus
> the archiving performance.  When CPU is uncontended it's not a huge
> deficit, but it is there and it causes quite a bit of stress.
>
> Archive commands failing are definitely a special case there, where it
> might be nice to bring write traffic to exactly zero for a time.

One possible objection to this line of attack is that, IIUC, waits to
acquire a LWLock are non-interruptible.  If someone tells PostgreSQL
to wait for some period of time before performing each WAL write,
other backends that grab the WALWriteLock will not respond to query
cancels during that time.  Worse, the locks have a tendency to back
up.  What I have observed is that if WAL isn't flushed in a timely
fashion, someone will try to grab WALWriteLock while holding
WALInsertLock.  Now anyone who attempts to insert WAL is in a
non-interruptible wait.  If the system is busy, it won't be long
before someone tries to extend pg_clog, and to do that they'll try to
grab WALInsertLock while holding CLogControlLock.  At that point, any
CLOG lookup that misses in the already-resident pages will send that
backend into a non-interruptible wait. I have seen cases where this
pile-up occurs during a heavy pgbench workload and paralyzes the
entire system, including any read-only queries, until the WAL write
completes.

Now despite all that, I can see this being useful enough that Heroku
might want to insert a very small patch into their version of
PostgreSQL to do it this way, and just live with the downsides.  But
anything that can propagate non-interruptible waits across the entire
system does not sound to me like a feature that is sufficiently
polished that we want to expose it to users less sophisticated than
Heroku (i.e. nearly all of them).  If we do this, I think we ought to
find a way to make the waits interruptible, and to insert them in
places where they really don't interfere with read-only backends.  I'd
probably also argue that we ought to try to design it such that the
GUC can be in MB/s rather than delay/WAL writer cycle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Better handling of archive_command problems

From
Peter Geoghegan
Date:
On Wed, May 15, 2013 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> One possible objection to this line of attack is that, IIUC, waits to
> acquire a LWLock are non-interruptible.  If someone tells PostgreSQL
> to wait for some period of time before performing each WAL write,
> other backends that grab the WALWriteLock will not respond to query
> cancels during that time.

I don't see any reasonable way to make LWLocks care about interrupts
(using all 3 possible underlying semaphore implementations, no less).
As it says within LWLockAcquire:

/** Lock out cancel/die interrupts until we exit the code section protected* by the LWLock.  This ensures that
interruptswill not interfere with* manipulations of data structures in shared memory.*/
 
HOLD_INTERRUPTS();

We've been pretty judicious about placing CHECK_FOR_INTERRUPTS() calls
in the right places, but it's still quite possible to see the server
take multiple seconds - perhaps even as many as 10 - to respond to an
interrupt (by psql SIGINT). Now, I didn't have enough of an interest
at the times I noticed this to figure out exactly why that may have
been or to somehow characterize it, but I don't accept that it's a
violation of some Postgres precept that this setting could result in
interrupts taking multiple seconds, and maybe even as many as 10. I'd
go so far as to let the user make the throttling sleep take as long as
they like, even though this admittedly would sort of break such a
precept.

There is a setting called zero_damaged_pages, and enabling it causes
data loss. I've seen cases where it was enabled within postgresql.conf
for years.

> Worse, the locks have a tendency to back up.

Well, yes, sleeping on WALWriteLock has some fairly bad consequences
for performance. That's sort of the idea. The bar is rather low here,
to my mind (the state of the art currently is deal with it/monitor it
yourself, and risk a PANIC shutdown if you fail). I think it's worth
noting that at least one other "Enterprise" RDBMS has a similar
feature.

The only situation that I can see where a backend would acquire
WALWriteLock alongside WALInsertLock is when a segment switch needs to
occur (i.e. a XLOG_SWITCH record is being inserted). If that's the
case, no one is going to be able to insert WAL for as long as the
sleep to throttle occurs anyway, so there is no additional harm done.

> Now despite all that, I can see this being useful enough that Heroku
> might want to insert a very small patch into their version of
> PostgreSQL to do it this way, and just live with the downsides.  But
> anything that can propagate non-interruptible waits across the entire
> system does not sound to me like a feature that is sufficiently
> polished that we want to expose it to users less sophisticated than
> Heroku (i.e. nearly all of them).  If we do this, I think we ought to
> find a way to make the waits interruptible, and to insert them in
> places where they really don't interfere with read-only backends.

It would be nice to be able to be sure that CLogControlLock could not
be held for multiple seconds as a result of this. However, I don't see
any reasons to let the perfect be the enemy of the good, or at least
the better. Just how likely is it that the scenario you describe will
affect reads in the real world? In any case, this is a problem in its
own right.

I don't intend to make any promises about how this throttling will
affect read queries, except perhaps something vague and loose.

> I'd probably also argue that we ought to try to design it such that the
> GUC can be in MB/s rather than delay/WAL writer cycle.

That'd certainly be better, but I can't see a way of doing it without
adding a whole bunch of mechanism to some important codepaths, like
within XLogWrite(), which would be quite a hard sell.

-- 
Peter Geoghegan



Re: Better handling of archive_command problems

From
Robert Haas
Date:
On Wed, May 15, 2013 at 6:40 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, May 15, 2013 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> One possible objection to this line of attack is that, IIUC, waits to
>> acquire a LWLock are non-interruptible.  If someone tells PostgreSQL
>> to wait for some period of time before performing each WAL write,
>> other backends that grab the WALWriteLock will not respond to query
>> cancels during that time.
>
> I don't see any reasonable way to make LWLocks care about interrupts
> (using all 3 possible underlying semaphore implementations, no less).

It couldn't be done across the board, but that doesn't mean that
certain cases couldn't get special treatment.

> As it says within LWLockAcquire:
>
> /*
>  * Lock out cancel/die interrupts until we exit the code section protected
>  * by the LWLock.  This ensures that interrupts will not interfere with
>  * manipulations of data structures in shared memory.
>  */
> HOLD_INTERRUPTS();
>
> We've been pretty judicious about placing CHECK_FOR_INTERRUPTS() calls
> in the right places, but it's still quite possible to see the server
> take multiple seconds - perhaps even as many as 10 - to respond to an
> interrupt (by psql SIGINT). Now, I didn't have enough of an interest
> at the times I noticed this to figure out exactly why that may have
> been or to somehow characterize it, but I don't accept that it's a
> violation of some Postgres precept that this setting could result in
> interrupts taking multiple seconds, and maybe even as many as 10. I'd
> go so far as to let the user make the throttling sleep take as long as
> they like, even though this admittedly would sort of break such a
> precept.

Well, I think it IS a Postgres precept that interrupts should get a
timely response.  You don't have to agree, but I think that's
important.

> There is a setting called zero_damaged_pages, and enabling it causes
> data loss. I've seen cases where it was enabled within postgresql.conf
> for years.

That is both true and bad, but it is not a reason to do more bad things.

>> Now despite all that, I can see this being useful enough that Heroku
>> might want to insert a very small patch into their version of
>> PostgreSQL to do it this way, and just live with the downsides.  But
>> anything that can propagate non-interruptible waits across the entire
>> system does not sound to me like a feature that is sufficiently
>> polished that we want to expose it to users less sophisticated than
>> Heroku (i.e. nearly all of them).  If we do this, I think we ought to
>> find a way to make the waits interruptible, and to insert them in
>> places where they really don't interfere with read-only backends.
>
> It would be nice to be able to be sure that CLogControlLock could not
> be held for multiple seconds as a result of this. However, I don't see
> any reasons to let the perfect be the enemy of the good, or at least
> the better. Just how likely is it that the scenario you describe will
> affect reads in the real world? In any case, this is a problem in its
> own right.

That's true, but you're proposing to add a knob which would make it
much easier for users to notice the bad behavior that already exists,
and to prolong it for unbounded periods of time even when the system
is not under load.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Better handling of archive_command problems

From
Peter Geoghegan
Date:
On Thu, May 16, 2013 at 11:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Well, I think it IS a Postgres precept that interrupts should get a
> timely response.  You don't have to agree, but I think that's
> important.

Well, yes, but the fact of the matter is that it is taking high single
digit numbers of seconds to get a response at times, so I don't think
that there is any reasonable expectation that that be almost
instantaneous. I don't want to make that worse, but then it might be
worth it in order to ameliorate a particular pain point for users.

>> There is a setting called zero_damaged_pages, and enabling it causes
>> data loss. I've seen cases where it was enabled within postgresql.conf
>> for years.
>
> That is both true and bad, but it is not a reason to do more bad things.

I don't think it's bad. I think that we shouldn't be paternalistic
towards our users. If anyone enables a setting like zero_damaged_pages
(or, say, wal_write_throttle) within their postgresql.conf
indefinitely for no good reason, then they're incompetent. End of
story.

Would you feel better about it if the setting had a time-out? Say, the
user had to explicitly re-enable it after one hour at the most?

-- 
Peter Geoghegan



Re: Better handling of archive_command problems

From
Robert Haas
Date:
On Thu, May 16, 2013 at 2:42 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, May 16, 2013 at 11:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Well, I think it IS a Postgres precept that interrupts should get a
>> timely response.  You don't have to agree, but I think that's
>> important.
>
> Well, yes, but the fact of the matter is that it is taking high single
> digit numbers of seconds to get a response at times, so I don't think
> that there is any reasonable expectation that that be almost
> instantaneous. I don't want to make that worse, but then it might be
> worth it in order to ameliorate a particular pain point for users.

At times, like when the system is under really heavy load?  Or at
times, like depending on what the backend is doing?  We can't do a
whole lot about the fact that it's possible to beat a system to death
so that, at the OS level, it stops responding.  Linux is unfriendly
enough to put processes into non-interruptible kernel wait states when
they're waiting on the disk, a decision that I suspect to have been
made by a sadomasochist.  But if there are times when a system that is
not responding to cancels in under a second when not particularly
heavily loaded, I would consider that a bug, and we should fix it.

>>> There is a setting called zero_damaged_pages, and enabling it causes
>>> data loss. I've seen cases where it was enabled within postgresql.conf
>>> for years.
>>
>> That is both true and bad, but it is not a reason to do more bad things.
>
> I don't think it's bad. I think that we shouldn't be paternalistic
> towards our users. If anyone enables a setting like zero_damaged_pages
> (or, say, wal_write_throttle) within their postgresql.conf
> indefinitely for no good reason, then they're incompetent. End of
> story.

That's a pretty user-hostile attitude.  Configuration mistakes are a
very common user error.  If those configuration hose the system, users
expect to be able to change them back, hit reload, and get things back
on track.  But you're proposing a GUC that, if set to a bad value,
will very plausibly cause the entire system to freeze up in such a way
that it won't respond to a reload request - or for that matter a fast
shutdown request.  I think that's 100% unacceptable.  Despite what you
seem to think, we've put a lot of work into ensuring interruptibility,
and it does not make sense to abandon that principle for this or any
other feature.

> Would you feel better about it if the setting had a time-out? Say, the
> user had to explicitly re-enable it after one hour at the most?

No, but I'd feel better about it if you figured out a way avoid
creating a scenario where it might lock up the entire database
cluster.  I am convinced that it is possible to avoid that, and that
without that this is not a feature worthy of being included in
PostgreSQL.  Yeah, it's more work that way.  But that's the difference
between "a quick hack that is useful in our shop" and "a
production-quality feature ready for a general audience".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Better handling of archive_command problems

From
Peter Geoghegan
Date:
On Thu, May 16, 2013 at 5:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> At times, like when the system is under really heavy load?  Or at
> times, like depending on what the backend is doing?  We can't do a
> whole lot about the fact that it's possible to beat a system to death
> so that, at the OS level, it stops responding.  Linux is unfriendly
> enough to put processes into non-interruptible kernel wait states when
> they're waiting on the disk, a decision that I suspect to have been
> made by a sadomasochist.

I find it more plausible that the decision was made by someone making
an engineering trade-off.

> But if there are times when a system that is
> not responding to cancels in under a second when not particularly
> heavily loaded, I would consider that a bug, and we should fix it.

It's not as if the DBA is going to have a hard time figuring out why
that is. It's taking a long time to respond because they've throttled
the entire server. Clearly, if that's something they're doing very
casually, they have bigger problems.

>> I don't think it's bad. I think that we shouldn't be paternalistic
>> towards our users. If anyone enables a setting like zero_damaged_pages
>> (or, say, wal_write_throttle) within their postgresql.conf
>> indefinitely for no good reason, then they're incompetent. End of
>> story.
>
> That's a pretty user-hostile attitude.

I think paternalism is user-hostile. Things should be easy to user
correctly and hard to use incorrectly. I certainly think we should be
novice friendly, but not if that implies being expert hostile.

The fact that a PANIC shutdown can occur when the pg_xlog filesystem
runs out of space is pretty user-hostile. It's hostile to both novices
and experts.

> Configuration mistakes are a
> very common user error.  If those configuration hose the system, users
> expect to be able to change them back, hit reload, and get things back
> on track.  But you're proposing a GUC that, if set to a bad value,
> will very plausibly cause the entire system to freeze up in such a way
> that it won't respond to a reload request - or for that matter a fast
> shutdown request.  I think that's 100% unacceptable.  Despite what you
> seem to think, we've put a lot of work into ensuring interruptibility,
> and it does not make sense to abandon that principle for this or any
> other feature.
>
>> Would you feel better about it if the setting had a time-out? Say, the
>> user had to explicitly re-enable it after one hour at the most?
>
> No, but I'd feel better about it if you figured out a way avoid
> creating a scenario where it might lock up the entire database
> cluster.  I am convinced that it is possible to avoid that, and that
> without that this is not a feature worthy of being included in
> PostgreSQL.

What if the WALWriter slept on its proc latch within
XLogBackgroundFlush(), rather than calling pg_usleep? That way,
WalSigHupHandler() will set the process latch on a reload, and the
sleep will end immediately if the user determines that they've made a
mistake in setting the sleep. Ditto all other signals. As with all
extant latch sleeps, we wake on postmaster death, so that an
inordinately long sleep doesn't create a denial-of-service that
prevents a restart if, say, the postmaster receives SIGKILL.

Maybe it wouldn't even be much additional work to figure out a way of
making LWLocks care about interrupts. I think an upper limit on the
relevant GUC is sufficient given the nature of what I propose to do,
but I might be convinced if a better approach came to light. Do you
have one?

> Yeah, it's more work that way.  But that's the difference
> between "a quick hack that is useful in our shop" and "a
> production-quality feature ready for a general audience".

It is certainly the case that it would be far easier for me to just
deploy this on our own customer instances. If I'm expected to solve
the problem of this throttling conceivably affecting backends
executing read queries due to the CLogControlLock scenario you
describe, just so users can have total assurance read queries are
unaffected for a couple of hours or less once in a blue moon when
they're fighting off a PANIC shutdown, then the bar is set almost
impossibly high. This is unfortunate, because there is plenty of
evidence that archive_command issues cause serious user pain all the
time.

-- 
Peter Geoghegan



Re: Better handling of archive_command problems

From
Daniel Farina
Date:
On Thu, May 16, 2013 at 5:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, May 16, 2013 at 2:42 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> On Thu, May 16, 2013 at 11:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Well, I think it IS a Postgres precept that interrupts should get a
>>> timely response.  You don't have to agree, but I think that's
>>> important.
>>
>> Well, yes, but the fact of the matter is that it is taking high single
>> digit numbers of seconds to get a response at times, so I don't think
>> that there is any reasonable expectation that that be almost
>> instantaneous. I don't want to make that worse, but then it might be
>> worth it in order to ameliorate a particular pain point for users.
>
> At times, like when the system is under really heavy load?  Or at
> times, like depending on what the backend is doing?  We can't do a
> whole lot about the fact that it's possible to beat a system to death
> so that, at the OS level, it stops responding.  Linux is unfriendly
> enough to put processes into non-interruptible kernel wait states when
> they're waiting on the disk, a decision that I suspect to have been
> made by a sadomasochist.  But if there are times when a system that is
> not responding to cancels in under a second when not particularly
> heavily loaded, I would consider that a bug, and we should fix it.
>
>>>> There is a setting called zero_damaged_pages, and enabling it causes
>>>> data loss. I've seen cases where it was enabled within postgresql.conf
>>>> for years.
>>>
>>> That is both true and bad, but it is not a reason to do more bad things.
>>
>> I don't think it's bad. I think that we shouldn't be paternalistic
>> towards our users. If anyone enables a setting like zero_damaged_pages
>> (or, say, wal_write_throttle) within their postgresql.conf
>> indefinitely for no good reason, then they're incompetent. End of
>> story.
>
> That's a pretty user-hostile attitude.  Configuration mistakes are a
> very common user error.  If those configuration hose the system, users
> expect to be able to change them back, hit reload, and get things back
> on track.  But you're proposing a GUC that, if set to a bad value,
> will very plausibly cause the entire system to freeze up in such a way
> that it won't respond to a reload request - or for that matter a fast
> shutdown request.  I think that's 100% unacceptable.  Despite what you
> seem to think, we've put a lot of work into ensuring interruptibility,
> and it does not make sense to abandon that principle for this or any
> other feature.

The inability to shut down in such a situation is not happy at all, as
you say, and the problems with whacking the GUC around due to
non-interruptability is pretty bad too.

>> Would you feel better about it if the setting had a time-out? Say, the
>> user had to explicitly re-enable it after one hour at the most?
>
> No, but I'd feel better about it if you figured out a way avoid
> creating a scenario where it might lock up the entire database
> cluster.  I am convinced that it is possible to avoid that

Do you have a sketch about mechanism to not encounter that problem?

> and that without that this is not a feature worthy of being included
> in PostgreSQL.  Yeah, it's more work that way.  But that's the
> difference between "a quick hack that is useful in our shop" and "a
> production-quality feature ready for a general audience".

However little it may matter, I would like to disagree with your
opinion on this one: the current situation as I imagine encountered by
*all* users of archiving is really unpleasant, 'my' shop or no.  It
would probably not be inaccurate to say that 99.9999% of archiving
users have to live with a hazy control over the amount of data loss,
only bounded by how long it takes for the system to full up the WAL
file system and then for PostgreSQL to PANIC and crash (hence, no more
writes are processed, and no more data can be lost).

Once one factors in the human cost of having to deal with that down
time or monitor it to circumvent this, I feel as though the bar for
quality should be lowered.  As you see, we've had to resort to
horrific techniques that to get around this problem.

I think this is something serious enough that it is worth doing
better, but the bind that people doing archiving find themselves in is
much worse at the margins -- involving data loss and loss of
availability -- and accordingly, I think the bar for some kind of
solution should be lowered, insomuch as that at least the interface
should be right enough to not be an albatross later (of which this
proposal may not meet).

That said, there is probably a way to please everyone and do something
better.  Any ideas?



Re: Better handling of archive_command problems

From
Robert Haas
Date:
On Thu, May 16, 2013 at 10:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> I don't think it's bad. I think that we shouldn't be paternalistic
>>> towards our users. If anyone enables a setting like zero_damaged_pages
>>> (or, say, wal_write_throttle) within their postgresql.conf
>>> indefinitely for no good reason, then they're incompetent. End of
>>> story.
>>
>> That's a pretty user-hostile attitude.
>
> I think paternalism is user-hostile. Things should be easy to user
> correctly and hard to use incorrectly. I certainly think we should be
> novice friendly, but not if that implies being expert hostile.

I don't disagree with that.  I am arguing that your proposal, as it
stands, is far too easy to use incorrectly.

> The fact that a PANIC shutdown can occur when the pg_xlog filesystem
> runs out of space is pretty user-hostile. It's hostile to both novices
> and experts.

+1.

>> No, but I'd feel better about it if you figured out a way avoid
>> creating a scenario where it might lock up the entire database
>> cluster.  I am convinced that it is possible to avoid that, and that
>> without that this is not a feature worthy of being included in
>> PostgreSQL.
>
> What if the WALWriter slept on its proc latch within
> XLogBackgroundFlush(), rather than calling pg_usleep? That way,
> WalSigHupHandler() will set the process latch on a reload, and the
> sleep will end immediately if the user determines that they've made a
> mistake in setting the sleep. Ditto all other signals. As with all
> extant latch sleeps, we wake on postmaster death, so that an
> inordinately long sleep doesn't create a denial-of-service that
> prevents a restart if, say, the postmaster receives SIGKILL.

That seems like a step in the right direction.  I'm not sure it's
enough: it makes that particular backend a lot more responsive to
interrupts, but other backends that may get caught waiting for the
LWLocks it holds are still going to freeze up.  Now, that no longer
prevents you from reloading the configuration and getting unstuck
(which is good), but it's still a lot of non-interruptible stuff
(which is not good).

I would also suggest considering other possible places to insert the
sleep.  One idea that occurs to me off-hand is to do it just before or
just after each XLogInsert().  We might be holding a buffer lwlock at
that point, though.  That's probably not SO bad, but what is bad is
that we might be extending pg_clog or pg_subtrans while sitting on the
relevant SLRU lock, and that's the quintessentially wrong time to go
into the tank.

Actually, I wonder if we couldn't tie this into ProcessInterrupts().
If this feature is enabled, then after each XLogInsert(), we increment
XLogBytesWrittenSinceLastSleep and set InterruptPending if it crosses
some threshold (8kB?).  In ProcessInterrupts, we add a new stanza at
bottom of function.  If that finds that XLogBytesWrittenSinceLastSleep
exceeds the same threshold, then it does gettimeofday(), computes a
sleep time based on the specified WAL write rate, the time since we
last slept, and the number of WAL bytes written.  It then resets
XLogBytesWrittenSinceLastSleep to 0 and sleeps on the process latch
for the calculated sleep time.

Under this system, we never sleep while holding an LWLock (because
LWLockAcquire calls HOLD_INTERRUPTS(), to be matched by
LWLockRelease's call to RESUME_INTERRUPTS()), we can throttle the WAL
write rate of each backend in bytes/second, and the overhead when the
feature's not in use should be negligible.

> Maybe it wouldn't even be much additional work to figure out a way of
> making LWLocks care about interrupts. I think an upper limit on the
> relevant GUC is sufficient given the nature of what I propose to do,
> but I might be convinced if a better approach came to light. Do you
> have one?

I was vaguely thinking about something like
LWLockAcquireWithoutHoldingOffInterrupts() that could be used in
select locations.  But we'd have to think about how safe that really
is, and whether it is actually useful.

> It is certainly the case that it would be far easier for me to just
> deploy this on our own customer instances. If I'm expected to solve
> the problem of this throttling conceivably affecting backends
> executing read queries due to the CLogControlLock scenario you
> describe, just so users can have total assurance read queries are
> unaffected for a couple of hours or less once in a blue moon when
> they're fighting off a PANIC shutdown, then the bar is set almost
> impossibly high. This is unfortunate, because there is plenty of
> evidence that archive_command issues cause serious user pain all the
> time.

Maybe so, but I have trouble believing that very many users are
sophisticated enough to use a feature like this unless it's pretty
well-cooked.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Better handling of archive_command problems

From
Robert Haas
Date:
On Thu, May 16, 2013 at 10:06 PM, Daniel Farina <daniel@heroku.com> wrote:
> Do you have a sketch about mechanism to not encounter that problem?

I didn't until just now, but see my email to Peter.  That idea might
be all wet, but off-hand it seems like it might work...

> However little it may matter, I would like to disagree with your
> opinion on this one: the current situation as I imagine encountered by
> *all* users of archiving is really unpleasant, 'my' shop or no.  It
> would probably not be inaccurate to say that 99.9999% of archiving
> users have to live with a hazy control over the amount of data loss,
> only bounded by how long it takes for the system to full up the WAL
> file system and then for PostgreSQL to PANIC and crash (hence, no more
> writes are processed, and no more data can be lost).

I'm really not trying to pick a fight here.  What I am saying is that
it is the problem is not so bad that we should accept the first design
proposed, despite the obvious problems with it, without trying to find
a better one.  The first post to -hackers on this was 4 days ago.
Feature freeze for the first release that could conceivably contain
this feature will most likely occur about 8 months from now.  We can
take a few more days or weeks to explore alternatives, and I believe
it will be possible to find good ones.  If I were trying to kill this
proposal, I could find better ways to do it than clearly articulating
my concerns at the outset.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Better handling of archive_command problems

From
Daniel Farina
Date:
On Thu, May 16, 2013 at 9:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, May 16, 2013 at 10:06 PM, Daniel Farina <daniel@heroku.com> wrote:
>> Do you have a sketch about mechanism to not encounter that problem?
>
> I didn't until just now, but see my email to Peter.  That idea might
> be all wet, but off-hand it seems like it might work...
>
>> However little it may matter, I would like to disagree with your
>> opinion on this one: the current situation as I imagine encountered by
>> *all* users of archiving is really unpleasant, 'my' shop or no.  It
>> would probably not be inaccurate to say that 99.9999% of archiving
>> users have to live with a hazy control over the amount of data loss,
>> only bounded by how long it takes for the system to full up the WAL
>> file system and then for PostgreSQL to PANIC and crash (hence, no more
>> writes are processed, and no more data can be lost).
>
> I'm really not trying to pick a fight here.  What I am saying is that
> it is the problem is not so bad that we should accept the first design
> proposed, despite the obvious problems with it, without trying to find
> a better one.  The first post to -hackers on this was 4 days ago.
> Feature freeze for the first release that could conceivably contain
> this feature will most likely occur about 8 months from now.  We can
> take a few more days or weeks to explore alternatives, and I believe
> it will be possible to find good ones.  If I were trying to kill this
> proposal, I could find better ways to do it than clearly articulating
> my concerns at the outset.

I know you are not trying to kill it or anything, and I didn't mean to
pick a fight either: my point is, without rancor, that I think that
as-is the state of affairs fails the sniff test dimensions like
availability and exposure to data loss, much worse than some of the
negative effects of hobbling cancellations.

And I completely agree with you that given the timing of thinking
about this issue means one can marinate it for a while.

And, I hope it goes without saying, I am grateful for your noodling on
the matter, as always.