Re: Better handling of archive_command problems - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Better handling of archive_command problems |
Date | |
Msg-id | CAM3SWZS0D5vsP03fcfxb6rt7azb8_EHHB+zY0c6S0dJ4J0zC2g@mail.gmail.com Whole thread Raw |
In response to | Re: Better handling of archive_command problems (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Better handling of archive_command problems
|
List | pgsql-hackers |
On Thu, May 16, 2013 at 5:43 PM, Robert Haas <robertmhaas@gmail.com> wrote: > At times, like when the system is under really heavy load? Or at > times, like depending on what the backend is doing? We can't do a > whole lot about the fact that it's possible to beat a system to death > so that, at the OS level, it stops responding. Linux is unfriendly > enough to put processes into non-interruptible kernel wait states when > they're waiting on the disk, a decision that I suspect to have been > made by a sadomasochist. I find it more plausible that the decision was made by someone making an engineering trade-off. > But if there are times when a system that is > not responding to cancels in under a second when not particularly > heavily loaded, I would consider that a bug, and we should fix it. It's not as if the DBA is going to have a hard time figuring out why that is. It's taking a long time to respond because they've throttled the entire server. Clearly, if that's something they're doing very casually, they have bigger problems. >> I don't think it's bad. I think that we shouldn't be paternalistic >> towards our users. If anyone enables a setting like zero_damaged_pages >> (or, say, wal_write_throttle) within their postgresql.conf >> indefinitely for no good reason, then they're incompetent. End of >> story. > > That's a pretty user-hostile attitude. I think paternalism is user-hostile. Things should be easy to user correctly and hard to use incorrectly. I certainly think we should be novice friendly, but not if that implies being expert hostile. The fact that a PANIC shutdown can occur when the pg_xlog filesystem runs out of space is pretty user-hostile. It's hostile to both novices and experts. > Configuration mistakes are a > very common user error. If those configuration hose the system, users > expect to be able to change them back, hit reload, and get things back > on track. But you're proposing a GUC that, if set to a bad value, > will very plausibly cause the entire system to freeze up in such a way > that it won't respond to a reload request - or for that matter a fast > shutdown request. I think that's 100% unacceptable. Despite what you > seem to think, we've put a lot of work into ensuring interruptibility, > and it does not make sense to abandon that principle for this or any > other feature. > >> Would you feel better about it if the setting had a time-out? Say, the >> user had to explicitly re-enable it after one hour at the most? > > No, but I'd feel better about it if you figured out a way avoid > creating a scenario where it might lock up the entire database > cluster. I am convinced that it is possible to avoid that, and that > without that this is not a feature worthy of being included in > PostgreSQL. What if the WALWriter slept on its proc latch within XLogBackgroundFlush(), rather than calling pg_usleep? That way, WalSigHupHandler() will set the process latch on a reload, and the sleep will end immediately if the user determines that they've made a mistake in setting the sleep. Ditto all other signals. As with all extant latch sleeps, we wake on postmaster death, so that an inordinately long sleep doesn't create a denial-of-service that prevents a restart if, say, the postmaster receives SIGKILL. Maybe it wouldn't even be much additional work to figure out a way of making LWLocks care about interrupts. I think an upper limit on the relevant GUC is sufficient given the nature of what I propose to do, but I might be convinced if a better approach came to light. Do you have one? > Yeah, it's more work that way. But that's the difference > between "a quick hack that is useful in our shop" and "a > production-quality feature ready for a general audience". It is certainly the case that it would be far easier for me to just deploy this on our own customer instances. If I'm expected to solve the problem of this throttling conceivably affecting backends executing read queries due to the CLogControlLock scenario you describe, just so users can have total assurance read queries are unaffected for a couple of hours or less once in a blue moon when they're fighting off a PANIC shutdown, then the bar is set almost impossibly high. This is unfortunate, because there is plenty of evidence that archive_command issues cause serious user pain all the time. -- Peter Geoghegan
pgsql-hackers by date: