Thread: Slow nested loop execution on larger server

Slow nested loop execution on larger server

From
Greg Smith
Date:
Forwarding this on from someone who may pop up on the list too.  This is
a new server that's running a query slowly.  The rate at which new
semops happen may be related.  OS is standard RHEL 5.5, kernel
2.6.18-194.32.1.el5.  One possibly interesting part is that this is
leading edge hardware:  four 8-core processes with HyperThreading, for
64 threads total:

processor       : 63
model name      : Intel(R) Xeon(R) CPU           X7550  @ 2.00GHz
physical id     : 3
siblings        : 16
cpu cores       : 8

My only vague memory of issues around systems like this that comes to
mind here is Scott Marlowe talking about some kernel tweaks he had to do
on his 48 core AMD boxes to get them work right.  That seems like a
sketchy association, and I don't seem to have the details handy.

I do have the full query text and associated plans, but it's mostly
noise.  Here's the fun part:

->  Nested Loop  (cost=581.52..3930.81 rows=2 width=51) (actual
time=25.681..17562.198 rows=21574 loops=1)
       ->  Merge Join  (cost=581.52..783.07 rows=2 width=37) (actual
time=25.650..98.268 rows=21574 loops=1)
             Merge Cond: (rim.instrumentid = ri.instrumentid)
             ->  Index Scan using reportinstruments_part_125_pkey on
reportinstruments_part_125 rim  (cost=0.00..199.83 rows=110 width=8)
(actual time=0.033..27.180 rows=20555 loops=1)
                   Index Cond: (reportid = 105668)
             ->  Sort  (cost=581.52..582.31 rows=316 width=33) (actual
time=25.608..34.931 rows=21574 loops=1)
                   Sort Key: ri.instrumentid
                   Sort Method: quicksort  Memory: 2454kB
                   ->  Index Scan using riskbreakdown_part_51_pkey on
riskbreakdown_part_51 ri  (cost=0.00..568.40 rows=316 width=33) (actual
time=0.019..11.599 rows=21574 loops=1)
                         Index Cond: (reportid = 105668)
       ->  Index Scan using energymarketlist_pkey on energymarketlist
ip  (cost=0.00..1573.86 rows=1 width=18) (actual time=0.408..0.808
rows=1 loops=21574)
             Index Cond: ((reportid = 105668) AND (instrumentid =
ri.instrumentid))
...
Total runtime: 21250.377 ms

The stats are terrible and the estimates off by many orders of
magnitude.  But that's not the point.  It expected 2 rows and 21574 came
out; fine.  Why is it taking this server 17 seconds to process 21K rows
of tiny width through a Nested Loop?  Is it bouncing to a new CPU every
time the thing processes a row or something?  I'm finding it hard to
imagine how this could be a PostgreSQL problem; seems more like a kernel
bug aggrevated on this system.  I wonder if we could produce a
standalone test case with a similar plan from what the bad query looks
like, and ask the person with this strange system to try it.  See if
it's possible to make it misbehave in the same way with something
simpler others can try, too.

The only slow semops thread I found in the archives turned out to be I/O
bound.  This query is all CPU; combining a few examples here since this
is repeatable and I'm told acts the same each time:

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12634 xxx       25   0  131g 1.4g 1.4g R 100.1  0.6  11:24.20 postgres:
xxx xxx [local] SELECT

$ ./pidstat -d -T ALL -p 27683 1 100
15:28:46          PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
15:28:47        27683      0.00      0.00      0.00  postgres

The PostgreSQL 9.1.1 is a custom build, but looking at pg_config I don't
see anything special that might be related; it was optimized like this:

CFLAGS=-march=core2 -O2 -pipe

Here's a chunk of the observed and seemingly strange semop rate:

$ strace -fttp 12634
Process 12634 attached - interrupt to quit
15:38:55.966047 semop(16646247, 0x7fffd09dac20, 1) = 0
15:39:17.490133 semop(16810092, 0x7fffd09dac20, 1) = 0
15:39:17.532522 semop(16810092, 0x7fffd09dac20, 1) = 0
15:39:17.534874 semop(16777323, 0x7fffd09dac00, 1) = 0
15:39:17.603637 semop(16777323, 0x7fffd09dac00, 1) = 0
15:39:17.640646 semop(16810092, 0x7fffd09dac20, 1) = 0
15:39:17.658230 semop(16810092, 0x7fffd09dac20, 1) = 0
15:39:18.905137 semop(16646247, 0x7fffd09dac20, 1) = 0
15:39:33.396657 semop(16810092, 0x7fffd09dac20, 1) = 0
15:39:50.208516 semop(16777323, 0x7fffd09dac00, 1) = 0
15:39:54.640712 semop(16646247, 0x7fffd09dac20, 1) = 0
15:39:55.468458 semop(16777323, 0x7fffd09dac00, 1) = 0
15:39:55.488364 semop(16777323, 0x7fffd09dac00, 1) = 0
15:39:55.489344 semop(16777323, 0x7fffd09dac00, 1) = 0
Process 12634 detached

pg_locks for this 12634 shows all granted ones, nothing exciting there.
I asked how well this executes with enable_nestloop turned off, hoping
to see that next.

This all seems odd, and I get interested and concerned when that start
showing up specifically on newer hardware.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


Re: Slow nested loop execution on larger server

From
Scott Marlowe
Date:
On Fri, Dec 16, 2011 at 11:27 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> Forwarding this on from someone who may pop up on the list too.  This is a
> new server that's running a query slowly.  The rate at which new semops
> happen may be related.  OS is standard RHEL 5.5, kernel 2.6.18-194.32.1.el5.
>  One possibly interesting part is that this is leading edge hardware:  four
> 8-core processes with HyperThreading, for 64 threads total:
>
> processor       : 63
> model name      : Intel(R) Xeon(R) CPU           X7550  @ 2.00GHz
> physical id     : 3
> siblings        : 16
> cpu cores       : 8
>
> My only vague memory of issues around systems like this that comes to mind
> here is Scott Marlowe talking about some kernel tweaks he had to do on his
> 48 core AMD boxes to get them work right.  That seems like a sketchy
> association, and I don't seem to have the details handy.
>
> I do have the full query text and associated plans, but it's mostly noise.
>  Here's the fun part:
>
> ->  Nested Loop  (cost=581.52..3930.81 rows=2 width=51) (actual
> time=25.681..17562.198 rows=21574 loops=1)
>      ->  Merge Join  (cost=581.52..783.07 rows=2 width=37) (actual
> time=25.650..98.268 rows=21574 loops=1)
>            Merge Cond: (rim.instrumentid = ri.instrumentid)
>            ->  Index Scan using reportinstruments_part_125_pkey on
> reportinstruments_part_125 rim  (cost=0.00..199.83 rows=110 width=8) (actual
> time=0.033..27.180 rows=20555 loops=1)
>                  Index Cond: (reportid = 105668)
>            ->  Sort  (cost=581.52..582.31 rows=316 width=33) (actual
> time=25.608..34.931 rows=21574 loops=1)
>                  Sort Key: ri.instrumentid
>                  Sort Method: quicksort  Memory: 2454kB
>                  ->  Index Scan using riskbreakdown_part_51_pkey on
> riskbreakdown_part_51 ri  (cost=0.00..568.40 rows=316 width=33) (actual
> time=0.019..11.599 rows=21574 loops=1)
>                        Index Cond: (reportid = 105668)
>      ->  Index Scan using energymarketlist_pkey on energymarketlist ip
>  (cost=0.00..1573.86 rows=1 width=18) (actual time=0.408..0.808 rows=1
> loops=21574)
>            Index Cond: ((reportid = 105668) AND (instrumentid =
> ri.instrumentid))
> ...
> Total runtime: 21250.377 ms
>
> The stats are terrible and the estimates off by many orders of magnitude.
>  But that's not the point.  It expected 2 rows and 21574 came out; fine.
>  Why is it taking this server 17 seconds to process 21K rows of tiny width
> through a Nested Loop?  Is it bouncing to a new CPU every time the thing
> processes a row or something?  I'm finding it hard to imagine how this could
> be a PostgreSQL problem; seems more like a kernel bug aggrevated on this
> system.  I wonder if we could produce a standalone test case with a similar
> plan from what the bad query looks like, and ask the person with this
> strange system to try it.  See if it's possible to make it misbehave in the
> same way with something simpler others can try, too.

What's the difference in speed of running the query naked and with
explain analyze?  It's not uncommon to run into a situation where the
instrumentation of explain analyze costs significantly more than
anything in the query, and for some loops this gets especially bad.
If the naked query runs in say 1 second, and the explain analyze runs
in 17 seconds, then the explain analyze time keeping is costing you
16/17ths of the time to do the accounting.  If that's the case then
the real query is about 16 times faster than the explain analyzed one,
and the rows are being processed at about 25k/second.

Re: Slow nested loop execution on larger server

From
"Strange, John W"
Date:
The naked query runs a very long time, and stuck in a slow semop() syscalls from my strace output, the explain plans
areprobably more to do with our statistic problems that we are clubbing through slowly but surely.  The main concern
wasway the slow semop() calls consistently, a restart of the server seemed to have stopped the issue for now.
 

- John

-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Scott
Marlowe
Sent: Friday, December 16, 2011 3:16 PM
To: Greg Smith
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Slow nested loop execution on larger server

On Fri, Dec 16, 2011 at 11:27 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> Forwarding this on from someone who may pop up on the list too.  This 
> is a new server that's running a query slowly.  The rate at which new 
> semops happen may be related.  OS is standard RHEL 5.5, kernel 2.6.18-194.32.1.el5.
>  One possibly interesting part is that this is leading edge hardware:  
> four 8-core processes with HyperThreading, for 64 threads total:
>
> processor       : 63
> model name      : Intel(R) Xeon(R) CPU           X7550  @ 2.00GHz 
> physical id     : 3 siblings        : 16 cpu cores       : 8
>
> My only vague memory of issues around systems like this that comes to 
> mind here is Scott Marlowe talking about some kernel tweaks he had to 
> do on his
> 48 core AMD boxes to get them work right.  That seems like a sketchy 
> association, and I don't seem to have the details handy.
>
> I do have the full query text and associated plans, but it's mostly noise.
>  Here's the fun part:
>
> ->  Nested Loop  (cost=581.52..3930.81 rows=2 width=51) (actual
> time=25.681..17562.198 rows=21574 loops=1)
>      ->  Merge Join  (cost=581.52..783.07 rows=2 width=37) (actual
> time=25.650..98.268 rows=21574 loops=1)
>            Merge Cond: (rim.instrumentid = ri.instrumentid)
>            ->  Index Scan using reportinstruments_part_125_pkey on
> reportinstruments_part_125 rim  (cost=0.00..199.83 rows=110 width=8) 
> (actual
> time=0.033..27.180 rows=20555 loops=1)
>                  Index Cond: (reportid = 105668)
>            ->  Sort  (cost=581.52..582.31 rows=316 width=33) (actual
> time=25.608..34.931 rows=21574 loops=1)
>                  Sort Key: ri.instrumentid
>                  Sort Method: quicksort  Memory: 2454kB
>                  ->  Index Scan using riskbreakdown_part_51_pkey on
> riskbreakdown_part_51 ri  (cost=0.00..568.40 rows=316 width=33) 
> (actual
> time=0.019..11.599 rows=21574 loops=1)
>                        Index Cond: (reportid = 105668)
>      ->  Index Scan using energymarketlist_pkey on energymarketlist ip
>  (cost=0.00..1573.86 rows=1 width=18) (actual time=0.408..0.808 rows=1
> loops=21574)
>            Index Cond: ((reportid = 105668) AND (instrumentid =
> ri.instrumentid))
> ...
> Total runtime: 21250.377 ms
>
> The stats are terrible and the estimates off by many orders of magnitude.
>  But that's not the point.  It expected 2 rows and 21574 came out; fine.
>  Why is it taking this server 17 seconds to process 21K rows of tiny 
> width through a Nested Loop?  Is it bouncing to a new CPU every time 
> the thing processes a row or something?  I'm finding it hard to 
> imagine how this could be a PostgreSQL problem; seems more like a 
> kernel bug aggrevated on this system.  I wonder if we could produce a 
> standalone test case with a similar plan from what the bad query looks 
> like, and ask the person with this strange system to try it.  See if 
> it's possible to make it misbehave in the same way with something simpler others can try, too.

What's the difference in speed of running the query naked and with explain analyze?  It's not uncommon to run into a
situationwhere the instrumentation of explain analyze costs significantly more than anything in the query, and for some
loopsthis gets especially bad.
 
If the naked query runs in say 1 second, and the explain analyze runs in 17 seconds, then the explain analyze time
keepingis costing you 16/17ths of the time to do the accounting.  If that's the case then the real query is about 16
timesfaster than the explain analyzed one, and the rows are being processed at about 25k/second.
 

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

Re: Slow nested loop execution on larger server

From
Robert Haas
Date:
On Fri, Dec 16, 2011 at 1:27 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Forwarding this on from someone who may pop up on the list too.  This is a
> new server that's running a query slowly.  The rate at which new semops
> happen may be related.  OS is standard RHEL 5.5, kernel 2.6.18-194.32.1.el5.
>  One possibly interesting part is that this is leading edge hardware:  four
> 8-core processes with HyperThreading, for 64 threads total:
>
> processor       : 63
> model name      : Intel(R) Xeon(R) CPU           X7550  @ 2.00GHz
> physical id     : 3
> siblings        : 16
> cpu cores       : 8
>
> My only vague memory of issues around systems like this that comes to mind
> here is Scott Marlowe talking about some kernel tweaks he had to do on his
> 48 core AMD boxes to get them work right.  That seems like a sketchy
> association, and I don't seem to have the details handy.
>
> I do have the full query text and associated plans, but it's mostly noise.
>  Here's the fun part:
>
> ->  Nested Loop  (cost=581.52..3930.81 rows=2 width=51) (actual
> time=25.681..17562.198 rows=21574 loops=1)
>      ->  Merge Join  (cost=581.52..783.07 rows=2 width=37) (actual
> time=25.650..98.268 rows=21574 loops=1)
>            Merge Cond: (rim.instrumentid = ri.instrumentid)
>            ->  Index Scan using reportinstruments_part_125_pkey on
> reportinstruments_part_125 rim  (cost=0.00..199.83 rows=110 width=8) (actual
> time=0.033..27.180 rows=20555 loops=1)
>                  Index Cond: (reportid = 105668)
>            ->  Sort  (cost=581.52..582.31 rows=316 width=33) (actual
> time=25.608..34.931 rows=21574 loops=1)
>                  Sort Key: ri.instrumentid
>                  Sort Method: quicksort  Memory: 2454kB
>                  ->  Index Scan using riskbreakdown_part_51_pkey on
> riskbreakdown_part_51 ri  (cost=0.00..568.40 rows=316 width=33) (actual
> time=0.019..11.599 rows=21574 loops=1)
>                        Index Cond: (reportid = 105668)
>      ->  Index Scan using energymarketlist_pkey on energymarketlist ip
>  (cost=0.00..1573.86 rows=1 width=18) (actual time=0.408..0.808 rows=1
> loops=21574)
>            Index Cond: ((reportid = 105668) AND (instrumentid =
> ri.instrumentid))
> ...
> Total runtime: 21250.377 ms
>
> The stats are terrible and the estimates off by many orders of magnitude.
>  But that's not the point.  It expected 2 rows and 21574 came out; fine.
>  Why is it taking this server 17 seconds to process 21K rows of tiny width
> through a Nested Loop?  Is it bouncing to a new CPU every time the thing
> processes a row or something?  I'm finding it hard to imagine how this could
> be a PostgreSQL problem; seems more like a kernel bug aggrevated on this
> system.  I wonder if we could produce a standalone test case with a similar
> plan from what the bad query looks like, and ask the person with this
> strange system to try it.  See if it's possible to make it misbehave in the
> same way with something simpler others can try, too.
>
> The only slow semops thread I found in the archives turned out to be I/O
> bound.  This query is all CPU; combining a few examples here since this is
> repeatable and I'm told acts the same each time:
>
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 12634 xxx       25   0  131g 1.4g 1.4g R 100.1  0.6  11:24.20 postgres: xxx
> xxx [local] SELECT
>
> $ ./pidstat -d -T ALL -p 27683 1 100
> 15:28:46          PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
> 15:28:47        27683      0.00      0.00      0.00  postgres
>
> The PostgreSQL 9.1.1 is a custom build, but looking at pg_config I don't see
> anything special that might be related; it was optimized like this:
>
> CFLAGS=-march=core2 -O2 -pipe
>
> Here's a chunk of the observed and seemingly strange semop rate:
>
> $ strace -fttp 12634
> Process 12634 attached - interrupt to quit
> 15:38:55.966047 semop(16646247, 0x7fffd09dac20, 1) = 0
> 15:39:17.490133 semop(16810092, 0x7fffd09dac20, 1) = 0
> 15:39:17.532522 semop(16810092, 0x7fffd09dac20, 1) = 0
> 15:39:17.534874 semop(16777323, 0x7fffd09dac00, 1) = 0
> 15:39:17.603637 semop(16777323, 0x7fffd09dac00, 1) = 0
> 15:39:17.640646 semop(16810092, 0x7fffd09dac20, 1) = 0
> 15:39:17.658230 semop(16810092, 0x7fffd09dac20, 1) = 0
> 15:39:18.905137 semop(16646247, 0x7fffd09dac20, 1) = 0
> 15:39:33.396657 semop(16810092, 0x7fffd09dac20, 1) = 0
> 15:39:50.208516 semop(16777323, 0x7fffd09dac00, 1) = 0
> 15:39:54.640712 semop(16646247, 0x7fffd09dac20, 1) = 0
> 15:39:55.468458 semop(16777323, 0x7fffd09dac00, 1) = 0
> 15:39:55.488364 semop(16777323, 0x7fffd09dac00, 1) = 0
> 15:39:55.489344 semop(16777323, 0x7fffd09dac00, 1) = 0
> Process 12634 detached
>
> pg_locks for this 12634 shows all granted ones, nothing exciting there.  I
> asked how well this executes with enable_nestloop turned off, hoping to see
> that next.
>
> This all seems odd, and I get interested and concerned when that start
> showing up specifically on newer hardware.

Ridiculously late response here, but, IME, semop() calls typically
indicate LWLock contention, but with a stock build it's pretty well
impossible to figure out which LWLock is being contended; compiling
with LWLOCK_STATS could tell ou that.

Shooting from the hip, the first thing that comes to mind is that the
index isn't fully cached in shared_buffers, and every time you hit a
page that isn't there you have to acquire BufFreelistLock to run the
clock sweep.  If the lock is uncontended then you wouldn't get system
calls, but if there's other activity on the system you might get
something like this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company