Re: Better handling of archive_command problems - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Better handling of archive_command problems |
Date | |
Msg-id | CA+Tgmoa2z4UK5JMc0vOZbYZ-O5WXBZi80JrZKa7DS0Uzsecudg@mail.gmail.com Whole thread Raw |
In response to | Re: Better handling of archive_command problems (Peter Geoghegan <pg@heroku.com>) |
List | pgsql-hackers |
On Thu, May 16, 2013 at 10:05 PM, Peter Geoghegan <pg@heroku.com> wrote: >>> I don't think it's bad. I think that we shouldn't be paternalistic >>> towards our users. If anyone enables a setting like zero_damaged_pages >>> (or, say, wal_write_throttle) within their postgresql.conf >>> indefinitely for no good reason, then they're incompetent. End of >>> story. >> >> That's a pretty user-hostile attitude. > > I think paternalism is user-hostile. Things should be easy to user > correctly and hard to use incorrectly. I certainly think we should be > novice friendly, but not if that implies being expert hostile. I don't disagree with that. I am arguing that your proposal, as it stands, is far too easy to use incorrectly. > The fact that a PANIC shutdown can occur when the pg_xlog filesystem > runs out of space is pretty user-hostile. It's hostile to both novices > and experts. +1. >> No, but I'd feel better about it if you figured out a way avoid >> creating a scenario where it might lock up the entire database >> cluster. I am convinced that it is possible to avoid that, and that >> without that this is not a feature worthy of being included in >> PostgreSQL. > > What if the WALWriter slept on its proc latch within > XLogBackgroundFlush(), rather than calling pg_usleep? That way, > WalSigHupHandler() will set the process latch on a reload, and the > sleep will end immediately if the user determines that they've made a > mistake in setting the sleep. Ditto all other signals. As with all > extant latch sleeps, we wake on postmaster death, so that an > inordinately long sleep doesn't create a denial-of-service that > prevents a restart if, say, the postmaster receives SIGKILL. That seems like a step in the right direction. I'm not sure it's enough: it makes that particular backend a lot more responsive to interrupts, but other backends that may get caught waiting for the LWLocks it holds are still going to freeze up. Now, that no longer prevents you from reloading the configuration and getting unstuck (which is good), but it's still a lot of non-interruptible stuff (which is not good). I would also suggest considering other possible places to insert the sleep. One idea that occurs to me off-hand is to do it just before or just after each XLogInsert(). We might be holding a buffer lwlock at that point, though. That's probably not SO bad, but what is bad is that we might be extending pg_clog or pg_subtrans while sitting on the relevant SLRU lock, and that's the quintessentially wrong time to go into the tank. Actually, I wonder if we couldn't tie this into ProcessInterrupts(). If this feature is enabled, then after each XLogInsert(), we increment XLogBytesWrittenSinceLastSleep and set InterruptPending if it crosses some threshold (8kB?). In ProcessInterrupts, we add a new stanza at bottom of function. If that finds that XLogBytesWrittenSinceLastSleep exceeds the same threshold, then it does gettimeofday(), computes a sleep time based on the specified WAL write rate, the time since we last slept, and the number of WAL bytes written. It then resets XLogBytesWrittenSinceLastSleep to 0 and sleeps on the process latch for the calculated sleep time. Under this system, we never sleep while holding an LWLock (because LWLockAcquire calls HOLD_INTERRUPTS(), to be matched by LWLockRelease's call to RESUME_INTERRUPTS()), we can throttle the WAL write rate of each backend in bytes/second, and the overhead when the feature's not in use should be negligible. > Maybe it wouldn't even be much additional work to figure out a way of > making LWLocks care about interrupts. I think an upper limit on the > relevant GUC is sufficient given the nature of what I propose to do, > but I might be convinced if a better approach came to light. Do you > have one? I was vaguely thinking about something like LWLockAcquireWithoutHoldingOffInterrupts() that could be used in select locations. But we'd have to think about how safe that really is, and whether it is actually useful. > It is certainly the case that it would be far easier for me to just > deploy this on our own customer instances. If I'm expected to solve > the problem of this throttling conceivably affecting backends > executing read queries due to the CLogControlLock scenario you > describe, just so users can have total assurance read queries are > unaffected for a couple of hours or less once in a blue moon when > they're fighting off a PANIC shutdown, then the bar is set almost > impossibly high. This is unfortunate, because there is plenty of > evidence that archive_command issues cause serious user pain all the > time. Maybe so, but I have trouble believing that very many users are sophisticated enough to use a feature like this unless it's pretty well-cooked. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: