Home > mailing lists

Re: Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers - Mailing list pgsql-hackers

From	Josh Berkus
Subject	Re: Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers
Date	September 13, 2013 20:21:01
Msg-id	523373A4.3030304@agliodbs.com Whole thread Raw
In response to	Re: proposal: Set effective_cache_size to greater of .conf value, shared_buffers (Josh Berkus <josh@agliodbs.com>)
Responses	Re: Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers Re: Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers
List	pgsql-hackers

Tree view

On 09/13/2013 12:55 PM, Merlin Moncure wrote:
> what are the specific symptoms of your problem?  anything interesting
> in pg_locks?  is $client willing to experiment with custom patches?

3 servers: 1 master, two replicas.
32-core Xeon, hyperthreaded to 64 cores
512GB RAM each
s_b set to 8GB
Load-balanced between all 3
~~ 11 different databases
combined database size around 600GB
using pgbouncer

Irregularly, during periods of high activity (although not necessarily
peak activity) one or another of the systems will go into paralysis,
with all backends apparently waiting on LWLocks (we need more tracing
information to completely confirm this).  Activity at this time is
usually somewhere between 50 and 100 concurrent queries (and 80 to 150
connections).  pg_locks doesn't think anything is waiting on a lock.

What's notable is that sometimes it's just *one* of the replicas which
goes into paralysis.  If the master gets this issue though, the replicas
experience it soon afterwards.  Increasing wal_buffers from 16GB to 64GB
seems to make this issue happen less frequently, but it doesn't go away
entirely.  Only a restart of the server, or killing all backend, ends
the lockup.

The workload is OLTP, essentially, around 20/80 write/read.  They use
PostGIS.  The other notable thing about their workload is that due to an
ORM defect, they get idle-in-transactions which last from 5 to 15
seconds several times a minute.

They are willing to use experimental patches, but only if those patches
can be applied only to a replica.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

pgsql-hackers by date:

From: Merlin Moncure
Date: 13 September 2013, 19:55:20
Subject: Re: Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers

From: Andres Freund
Date: 13 September 2013, 20:26:27
Subject: Re: Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers

Re: Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers - Mailing list pgsql-hackers

Previous

Next