Re: Better handling of archive_command problems - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Better handling of archive_command problems |
Date | |
Msg-id | CA+TgmoZ=ZZOXFocUe0LpTynhrUcHGy=uHZmEYS5QLu4XF=t6mA@mail.gmail.com Whole thread Raw |
In response to | Re: Better handling of archive_command problems (Daniel Farina <daniel@heroku.com>) |
Responses |
Re: Better handling of archive_command problems
|
List | pgsql-hackers |
On Tue, May 14, 2013 at 12:23 AM, Daniel Farina <daniel@heroku.com> wrote: > On Mon, May 13, 2013 at 3:02 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Has anyone else thought about approaches to mitigating the problems >> that arise when an archive_command continually fails, and the DBA must >> manually clean up the mess? > > Notably, the most common problem in this vein suffered at Heroku has > nothing to do with archive_command failing, and everything to do with > the ratio of block device write performance (hence, backlog) versus > the archiving performance. When CPU is uncontended it's not a huge > deficit, but it is there and it causes quite a bit of stress. > > Archive commands failing are definitely a special case there, where it > might be nice to bring write traffic to exactly zero for a time. One possible objection to this line of attack is that, IIUC, waits to acquire a LWLock are non-interruptible. If someone tells PostgreSQL to wait for some period of time before performing each WAL write, other backends that grab the WALWriteLock will not respond to query cancels during that time. Worse, the locks have a tendency to back up. What I have observed is that if WAL isn't flushed in a timely fashion, someone will try to grab WALWriteLock while holding WALInsertLock. Now anyone who attempts to insert WAL is in a non-interruptible wait. If the system is busy, it won't be long before someone tries to extend pg_clog, and to do that they'll try to grab WALInsertLock while holding CLogControlLock. At that point, any CLOG lookup that misses in the already-resident pages will send that backend into a non-interruptible wait. I have seen cases where this pile-up occurs during a heavy pgbench workload and paralyzes the entire system, including any read-only queries, until the WAL write completes. Now despite all that, I can see this being useful enough that Heroku might want to insert a very small patch into their version of PostgreSQL to do it this way, and just live with the downsides. But anything that can propagate non-interruptible waits across the entire system does not sound to me like a feature that is sufficiently polished that we want to expose it to users less sophisticated than Heroku (i.e. nearly all of them). If we do this, I think we ought to find a way to make the waits interruptible, and to insert them in places where they really don't interfere with read-only backends. I'd probably also argue that we ought to try to design it such that the GUC can be in MB/s rather than delay/WAL writer cycle. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: