Thread: spinlocks storm bug
Hello
I have a report of critical bug (database is temporary unavailability .. restart is necessary). A customer use:
PostgreSQL 9.2.4,
24 CPU
140G RAM
SSD disc for all
Database is under high load. There is a few databases with very high number of similar simple statements. When application produce higher load, then number of active connection is increased to 300-600 about.
In some moment starts described event - there is a minimal IO, all CPU are on 100%.
Perf result shows:
354246.00 93.0% s_lock /usr/lib/postgresql/9.2/bin/postgres
10503.00 2.8% LWLockRelease /usr/lib/postgresql/9.2/bin/postgres
8802.00 2.3% LWLockAcquire /usr/lib/postgresql/9.2/bin/postgres
828.00 0.2% _raw_spin_lock [kernel.kallsyms]
559.00 0.1% _raw_spin_lock_irqsave [kernel.kallsyms]
340.00 0.1% switch_mm [kernel.kallsyms]
305.00 0.1% poll_schedule_timeout [kernel.kallsyms]
274.00 0.1% native_write_msr_safe [kernel.kallsyms]
257.00 0.1% _raw_spin_lock_irq [kernel.kallsyms]
238.00 0.1% apic_timer_interrupt [kernel.kallsyms]
236.00 0.1% __schedule [kernel.kallsyms]
213.00 0.1% HeapTupleSatisfiesMVCC
We try to limit a connection to 300, but I am not sure if this issue is not related to some Postgres bug.
Regards
Pavel
On 2013-12-06 07:22:27 +0100, Pavel Stehule wrote: > I have a report of critical bug (database is temporary unavailability .. > restart is necessary). > PostgreSQL 9.2.4, > 24 CPU > 140G RAM > SSD disc for all > > > Database is under high load. There is a few databases with very high number > of similar simple statements. When application produce higher load, then > number of active connection is increased to 300-600 about. > > In some moment starts described event - there is a minimal IO, all CPU are > on 100%. > > Perf result shows: > 354246.00 93.0% s_lock > /usr/lib/postgresql/9.2/bin/postgres > 10503.00 2.8% LWLockRelease > /usr/lib/postgresql/9.2/bin/postgres > 8802.00 2.3% LWLockAcquire > We try to limit a connection to 300, but I am not sure if this issue is not > related to some Postgres bug. We've seen this issue repeatedly now. None of the times it turned out to be a bug, but just limitations in postgres' scalability. If you can I'd strongly suggest trying to get postgres binaries compiled with -fno-omit-frame-pointer installed to check which locks are actually conteded. My bet is BufMappingLock. There's a CF entry about changing our lwlock implementation to scale better... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
2013/12/6 Andres Freund <andres@2ndquadrant.com>
On 2013-12-06 07:22:27 +0100, Pavel Stehule wrote:
> I have a report of critical bug (database is temporary unavailability ..
> restart is necessary).> PostgreSQL 9.2.4,
> 24 CPU
> 140G RAM
> SSD disc for all
>
>
> Database is under high load. There is a few databases with very high number
> of similar simple statements. When application produce higher load, then
> number of active connection is increased to 300-600 about.
>
> In some moment starts described event - there is a minimal IO, all CPU are
> on 100%.
>
> Perf result shows:
> 354246.00 93.0% s_lock
> /usr/lib/postgresql/9.2/bin/postgres
> 10503.00 2.8% LWLockRelease
> /usr/lib/postgresql/9.2/bin/postgres
> 8802.00 2.3% LWLockAcquire> We try to limit a connection to 300, but I am not sure if this issue is notWe've seen this issue repeatedly now. None of the times it turned out to
> related to some Postgres bug.
be a bug, but just limitations in postgres' scalability. If you can I'd
strongly suggest trying to get postgres binaries compiled with
-fno-omit-frame-pointer installed to check which locks are actually
conteded.
My bet is BufMappingLock.
There's a CF entry about changing our lwlock implementation to scale
better...
one missing info - the customer's staff reduced shared buffers from 30G to 5G without success. A database is 20G about.
Regards
Pavel
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services