On Fri, Sep 13, 2013 at 3:20 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 09/13/2013 12:55 PM, Merlin Moncure wrote:
>> what are the specific symptoms of your problem? anything interesting
>> in pg_locks? is $client willing to experiment with custom patches?
>
> 3 servers: 1 master, two replicas.
> 32-core Xeon, hyperthreaded to 64 cores
> 512GB RAM each
> s_b set to 8GB
> Load-balanced between all 3
> ~~ 11 different databases
> combined database size around 600GB
> using pgbouncer
>
> Irregularly, during periods of high activity (although not necessarily
> peak activity) one or another of the systems will go into paralysis,
> with all backends apparently waiting on LWLocks (we need more tracing
> information to completely confirm this). Activity at this time is
> usually somewhere between 50 and 100 concurrent queries (and 80 to 150
> connections). pg_locks doesn't think anything is waiting on a lock.
>
> What's notable is that sometimes it's just *one* of the replicas which
> goes into paralysis. If the master gets this issue though, the replicas
> experience it soon afterwards. Increasing wal_buffers from 16GB to 64GB
> seems to make this issue happen less frequently, but it doesn't go away
> entirely. Only a restart of the server, or killing all backend, ends
> the lockup.
>
> The workload is OLTP, essentially, around 20/80 write/read. They use
> PostGIS. The other notable thing about their workload is that due to an
> ORM defect, they get idle-in-transactions which last from 5 to 15
> seconds several times a minute.
>
> They are willing to use experimental patches, but only if those patches
> can be applied only to a replica.
ok, points similar:
*) master/slave config (two slaves for me)
*) 'big' server 256GB mem, 32 core
*) 80% approx. (perhaps more)
*) some spacial searching (but not very much)
*) OLTP
*) presentation of load, although in my case it did resolve anywhere
from 30 secs to half hour
*) aside from the spike, 100% healthy
points different
*) application side pooling: 96 app servers, max 5 connections each
(aside: are you using transaction mode pgbouncer?)
*) I saw gripes about relation extension in pg_locks
merlin