Thread: spinlocks storm bug

spinlocks storm bug

From
Pavel Stehule
Date:
Hello

I have a report of critical bug (database is temporary unavailability .. restart is necessary).


A customer use:

PostgreSQL 9.2.4,
24 CPU
140G RAM
SSD disc for all


Database is under high load. There is a few databases with very high number of similar simple statements. When application produce higher load, then number of active connection is increased to 300-600 about.

In some moment starts described event - there is a minimal IO, all CPU are on 100%.

Perf result shows:


           354246.00 93.0% s_lock                           /usr/lib/postgresql/9.2/bin/postgres    
            10503.00  2.8% LWLockRelease                    /usr/lib/postgresql/9.2/bin/postgres    
             8802.00  2.3% LWLockAcquire                    /usr/lib/postgresql/9.2/bin/postgres    
              828.00  0.2% _raw_spin_lock                   [kernel.kallsyms]                       
              559.00  0.1% _raw_spin_lock_irqsave           [kernel.kallsyms]                       
              340.00  0.1% switch_mm                        [kernel.kallsyms]                       
              305.00  0.1% poll_schedule_timeout            [kernel.kallsyms]                       
              274.00  0.1% native_write_msr_safe            [kernel.kallsyms]                       
              257.00  0.1% _raw_spin_lock_irq               [kernel.kallsyms]                       
              238.00  0.1% apic_timer_interrupt             [kernel.kallsyms]                       
              236.00  0.1% __schedule                       [kernel.kallsyms]                       
              213.00  0.1% HeapTupleSatisfiesMVCC

We try to limit a connection to 300, but I am not sure if this issue is not related to some Postgres bug.

Regards

Pavel

Re: spinlocks storm bug

From
Andres Freund
Date:
On 2013-12-06 07:22:27 +0100, Pavel Stehule wrote:
> I have a report of critical bug (database is temporary unavailability ..
> restart is necessary).

> PostgreSQL 9.2.4,
> 24 CPU
> 140G RAM
> SSD disc for all
> 
> 
> Database is under high load. There is a few databases with very high number
> of similar simple statements. When application produce higher load, then
> number of active connection is increased to 300-600 about.
> 
> In some moment starts described event - there is a minimal IO, all CPU are
> on 100%.
> 
> Perf result shows:
>            354246.00 93.0% s_lock
> /usr/lib/postgresql/9.2/bin/postgres
>             10503.00  2.8% LWLockRelease
>  /usr/lib/postgresql/9.2/bin/postgres
>              8802.00  2.3% LWLockAcquire

> We try to limit a connection to 300, but I am not sure if this issue is not
> related to some Postgres bug.

We've seen this issue repeatedly now. None of the times it turned out to
be a bug, but just limitations in postgres' scalability. If you can I'd
strongly suggest trying to get postgres binaries compiled with
-fno-omit-frame-pointer installed to check which locks are actually
conteded.
My bet is BufMappingLock.

There's a CF entry about changing our lwlock implementation to scale
better...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: spinlocks storm bug

From
Pavel Stehule
Date:



2013/12/6 Andres Freund <andres@2ndquadrant.com>
On 2013-12-06 07:22:27 +0100, Pavel Stehule wrote:
> I have a report of critical bug (database is temporary unavailability ..
> restart is necessary).

> PostgreSQL 9.2.4,
> 24 CPU
> 140G RAM
> SSD disc for all
>
>
> Database is under high load. There is a few databases with very high number
> of similar simple statements. When application produce higher load, then
> number of active connection is increased to 300-600 about.
>
> In some moment starts described event - there is a minimal IO, all CPU are
> on 100%.
>
> Perf result shows:
>            354246.00 93.0% s_lock
> /usr/lib/postgresql/9.2/bin/postgres
>             10503.00  2.8% LWLockRelease
>  /usr/lib/postgresql/9.2/bin/postgres
>              8802.00  2.3% LWLockAcquire

> We try to limit a connection to 300, but I am not sure if this issue is not
> related to some Postgres bug.

We've seen this issue repeatedly now. None of the times it turned out to
be a bug, but just limitations in postgres' scalability. If you can I'd
strongly suggest trying to get postgres binaries compiled with
-fno-omit-frame-pointer installed to check which locks are actually
conteded.
My bet is BufMappingLock.

There's a CF entry about changing our lwlock implementation to scale
better...


one missing info - the customer's staff reduced shared buffers from 30G to 5G without success. A database is 20G about.

Regards

Pavel


 
Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services