Thread: How to investiage slow insert problem

How to investiage slow insert problem

From
Rural Hunter
Date:
Hi,

I'm on 9.2.4 with Ubuntu server. There are usually hundereds of
connections doing the same insert with different data from different
networks every minute, through pgbouncer in the same network of the
database server. The database has been running for about one year
without problem. Yesterday I got a problem that the connection count
limit of the database server is reached. I checked the connections and
found that there are many inserts hanging there. I checked the
load(cpu,memory,io) of the db server but seems everything is fine. I
also checked pg log and I only found there are one "incomplete message
from client" error message every several minute. The I recycled
pgbouncer and kept monitoring the connections. I found the majority of
the inserts finish quickly but every minute there are several inserts
left and seems hanging there . So after a while, the connection limit is
reached again. Besides those inserts, there are no other long run
queries and auto vacuums. I also checked the locks of the inserts and
found they were all granted. The insert statement itself is very simple
and it only inserts one row but there are some triggers involved. They
might impact the performance but I have never experience any since the
majority of the inserts are fine. The problem persisted about 1-2 hours.
I didn't do anything except recycling pgbouncer a few times. After that
period, everything goes back to normal. It's has been 24 hours and it
didn't happen again.

 From the error message in pg log, I supect it might be the network
problem from some clients. Could anyone point out if there are other
possible causes? I'm also wondering what those inserts are doing
actually when they are hanging there, such as if they are in the trigger
or not. Anything I can get similar with the connection snapshots in db2?


Re: How to investiage slow insert problem

From
Sergey Konoplev
Date:
On Mon, Aug 19, 2013 at 6:44 PM, Rural Hunter <ruralhunter@gmail.com> wrote:
> I'm on 9.2.4 with Ubuntu server. There are usually hundereds of connections
> doing the same insert with different data from different networks every
> minute, through pgbouncer in the same network of the database server. The
> database has been running for about one year without problem. Yesterday I
> got a problem that the connection count limit of the database server is
> reached. I checked the connections and found that there are many inserts
> hanging there. I checked the load(cpu,memory,io) of the db server but seems
> everything is fine. I also checked pg log and I only found there are one
> "incomplete message from client" error message every several minute. The I
> recycled pgbouncer and kept monitoring the connections. I found the majority
> of the inserts finish quickly but every minute there are several inserts
> left and seems hanging there . So after a while, the connection limit is
> reached again. Besides those inserts, there are no other long run queries
> and auto vacuums. I also checked the locks of the inserts and found they
> were all granted. The insert statement itself is very simple and it only
> inserts one row but there are some triggers involved. They might impact the
> performance but I have never experience any since the majority of the
> inserts are fine. The problem persisted about 1-2 hours. I didn't do
> anything except recycling pgbouncer a few times. After that period,
> everything goes back to normal. It's has been 24 hours and it didn't happen
> again.
>
> From the error message in pg log, I supect it might be the network problem
> from some clients. Could anyone point out if there are other possible
> causes? I'm also wondering what those inserts are doing actually when they
> are hanging there, such as if they are in the trigger or not. Anything I can
> get similar with the connection snapshots in db2?

What do you mean by recycling pgbouncer?

Haven't you noticed what was in the state column of the
pg_state_activity view? In 9.2 the query column in this view shows the
last statement that was executed in this connection, and it does not
mean that this statement is working at the moment of monitoring. If
the state is active, than it was working, however, my assumption is
that it was IDLE in transaction. You mentioned the "incomplete message
from client" error, so it might somehow be a network problem that led
to a hunging connection to pgbouncer, that made pgbouncer kept a
connection to postgres after transaction was started.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com


Re: How to investiage slow insert problem

From
Rural Hunter
Date:
于 2013/8/20 10:38, Sergey Konoplev 写道:
> On Mon, Aug 19, 2013 at 6:44 PM, Rural Hunter <ruralhunter@gmail.com> wrote:
> What do you mean by recycling pgbouncer?
I mean restarting pgbouncer.
> Haven't you noticed what was in the state column of the
> pg_state_activity view? In 9.2 the query column in this view shows the
> last statement that was executed in this connection, and it does not
> mean that this statement is working at the moment of monitoring. If
> the state is active, than it was working, however, my assumption is
> that it was IDLE in transaction.
No, they are alll with 'active' state.
> You mentioned the "incomplete message from client" error, so it might
> somehow be a network problem that led to a hunging connection to
> pgbouncer, that made pgbouncer kept a connection to postgres after
> transaction was started.
pgbouncer and the db server are in the same local network and there
shouldn't be any network problem between them. I also ran ping from
pgbouncer server to the db server and there was no problem.


Re: How to investiage slow insert problem

From
Sergey Konoplev
Date:
On Mon, Aug 19, 2013 at 7:45 PM, Rural Hunter <ruralhunter@gmail.com> wrote:
>> You mentioned the "incomplete message from client" error, so it might
>> somehow be a network problem that led to a hunging connection to pgbouncer,
>> that made pgbouncer kept a connection to postgres after transaction was
>> started.
>
> pgbouncer and the db server are in the same local network and there
> shouldn't be any network problem between them. I also ran ping from
> pgbouncer server to the db server and there was no problem.

Next time, when you face this again, set log_min_duration_statement to
the value less that the age of hunging inserts and debug_print_parse,
debug_print_rewritten, debug_print_plan and debug_pretty_print to
'on'. It will allow you to log what is happening with these inserts
and what takes so many time.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp
+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979
gray.ru@gmail.com


Re: How to investiage slow insert problem

From
Jeff Janes
Date:
On Monday, August 19, 2013, Rural Hunter wrote:
Hi,

I'm on 9.2.4 with Ubuntu server. There are usually hundereds of connections doing the same insert with different data from different networks every minute, through pgbouncer in the same network of the database server. The database has been running for about one year without problem. Yesterday I got a problem that the connection count limit of the database server is reached.

I think that this should generally not happen at the server if you are using pgbouncer, as you should configure it so that pgbouncer has a lower limit than postgresql itself does.  What pooling method (session, transaction, statement) are you using?
 
I checked the connections and found that there are many inserts hanging there. I checked the load(cpu,memory,io) of the db server but seems everything is fine.

Can you provide some example numbers for the io load?
 
I also checked pg log and I only found there are one "incomplete message from client" error message every several minute.

Could you post the complete log message and a few lines of context around it?
 
The I recycled pgbouncer and kept monitoring the connections. I found the majority of the inserts finish quickly but every minute there are several inserts left and seems hanging there .

How long had they been hanging there?  It makes a big difference whether there are several hanging there at one moment, but a few milliseconds later there are several different ones, versus the same few that hang around of many seconds or minutes at a time.
 
...

From the error message in pg log, I supect it might be the network problem from some clients. Could anyone point out if there are other possible causes?

If the identities of the "hung" processes are rapidly changing, it could just be that you are hitting a throughput limit. When you do a lot of inserts into indexed the tables, the performance can drop precipitously once the size of the actively updated part of the indexes exceeds shared_buffers.  This would usually show up in the io stats, but if you always have a lot of io going on, it might not be obvious.

If it is the same few processes hung for long periods, I would strace them, or gdb them and get a backtrace.

 
I'm also wondering what those inserts are doing actually when they are hanging there, such as if they are in the trigger or not. Anything I can get similar with the connection snapshots in db2?

Sorry, I don't know what a connection snapshot in db2 looks like.
 

Cheers,

Jeff

Re: How to investiage slow insert problem

From
Rural Hunter
Date:
<div class="moz-cite-prefix">于 2013/8/20 12:34, Jeff Janes 写道:<br /></div><blockquote
cite="mid:CAMkU=1zWk8dmxCi4uGB4fCmOrb5BvwVB0+Bspu_1gmyO7UUjsg@mail.gmail.com"type="cite">On Monday, August 19, 2013,
RuralHunter wrote: <div><br /></div><div>I think that this should generally not happen at the server if you are using
pgbouncer,as you should configure it so that pgbouncer has a lower limit than postgresql itself does.  What pooling
method(session, transaction, statement) are you using?</div></blockquote> statement. Currently, I set the limit of
pgbouncerconnection to same as db connection. But I also have a few connections connecting to db server directly.
<blockquotecite="mid:CAMkU=1zWk8dmxCi4uGB4fCmOrb5BvwVB0+Bspu_1gmyO7UUjsg@mail.gmail.com" type="cite"><div><br
/></div><div>Canyou provide some example numbers for the io load?</div></blockquote> I get some when the connection
limitis reached(The database related storage is on sdb/sdd/sde/sdf):<br /> root@ubtserver:~# iostat -xm 3<br /> Linux
3.5.0-22-generic(ubuntu)     2013年08月19日     _x86_64_    (32 CPU)<br /><br /> avg-cpu:  %user   %nice %system %iowait 
%steal  %idle<br />           14.71    0.00    2.86    0.48    0.00   81.96<br /><br /> Device:         rrqm/s  
wrqm/s    r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util<br /> sda              
0.00    0.26    0.04    0.36     0.00     0.00    24.71     0.00    0.55    3.01    0.30   0.29   0.01<br />
sdb              0.00     0.26    0.18    2.32     0.02     0.38   329.50     0.01    5.36    1.26    5.69   0.21  
0.05<br/> sdc               0.01     4.59   10.13   45.75     0.30     0.92    44.65     0.05    5.14    7.49    4.62  
0.63  3.50<br /> dm-0              0.00     0.00    0.00    0.01     0.00     0.00     8.00     0.00    6.37    6.38   
6.36  3.62   0.00<br /> sdd               0.00     0.42    0.02   42.87     0.00     0.46    22.12     0.03    0.78  
14.09   0.77   0.49   2.10<br /> sde               0.00     3.68   10.23  156.41     0.19     1.45    20.06     0.03   
1.59  21.34    0.29   0.51   8.55<br /> sdf               0.00     2.56    6.29   66.00     0.29     0.71    28.42    
0.04   0.56    4.52    0.19   0.37   2.71<br /><br /> avg-cpu:  %user   %nice %system %iowait  %steal   %idle<br />
         13.99    0.00    1.91    1.04    0.00   83.06<br /><br /> Device:         rrqm/s   wrqm/s     r/s     w/s   
rMB/s   wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util<br /> sda               0.00     0.00    0.33   
0.00    0.00     0.00    16.00     0.00    4.00    4.00    0.00   4.00   0.13<br /> sdb               0.00     0.00   
0.00   0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00<br /> sdc               0.00   
15.33   5.33   14.33     0.13     0.21    34.98     0.03    1.63    6.00    0.00   1.02   2.00<br /> dm-0             
0.00    0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00<br />
sdd              0.00     0.00    0.00   31.33     0.00     0.26    17.19     0.01    0.34    0.00    0.34   0.34  
1.07<br/> sde               0.00     0.00   43.00  163.67     0.59     1.29    18.55     2.56   21.34   72.06    8.01  
1.69 34.93<br /> sdf               0.00     0.00    6.00   62.00     0.17     0.55    21.88     0.49    7.16    5.56   
7.31  0.27   1.87<br /><br /> avg-cpu:  %user   %nice %system %iowait  %steal   %idle<br />           15.84    0.00   
2.63   1.70    0.00   79.83<br /><br /> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
avgqu-sz  await r_await w_await  svctm  %util<br /> sda               0.00     1.67    0.00    2.00     0.00    
0.01   14.67     0.07   33.33    0.00   33.33  25.33   5.07<br /> sdb               0.00     0.00    0.00    0.00    
0.00    0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00<br /> sdc               0.00     0.00    4.67   
0.00    0.06     0.00    26.29     0.13    6.29    6.29    0.00  25.14  11.73<br /> dm-0              0.00     0.00   
0.00   0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00<br /> sdd               0.00    
0.33   0.00   49.00     0.00     0.39    16.49     0.02    0.35    0.00    0.35   0.35   1.73<br /> sde              
0.00   11.00   30.67   81.33     0.38     0.71    19.98    36.46  143.19   43.91  180.62   2.69  30.13<br />
sdf              0.00     9.33    3.00  326.00     0.09     2.75    17.69     3.51   10.66    5.33   10.71   0.11  
3.60<br/><br /> avg-cpu:  %user   %nice %system %iowait  %steal   %idle<br />           14.99    0.00    2.39   
4.89   0.00   77.74<br /><br /> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz  
awaitr_await w_await  svctm  %util<br /> sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00    
0.00   0.00    0.00    0.00   0.00   0.00<br /> sdb               0.00     0.00    0.00    0.00     0.00     0.00    
0.00    0.00    0.00    0.00    0.00   0.00   0.00<br /> sdc               0.00    19.67    7.33   29.00     0.09    
0.60   38.61     1.18   35.41  175.45    0.00  15.93  57.87<br /> dm-0              0.00     0.00    0.00    0.00    
0.00    0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00<br /> sdd               0.00     0.33    0.00  
39.33    0.00     0.31    15.93     0.01    0.37    0.00    0.37   0.37   1.47<br /> sde               0.00    11.33  
29.67 312.67     0.39     2.51    17.34    87.15  314.23  108.13  333.78   2.84  97.20<br /> sdf               0.00    
0.00   8.33    0.00     0.17     0.00    42.24     0.05    6.56    6.56    0.00   2.40   2.00<br /><br /> avg-cpu: 
%user  %nice %system %iowait  %steal   %idle<br />           14.98    0.00    2.23    5.45    0.00   77.34<br /><br />
Device:        rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm 
%util<br/> sda               0.00     0.00    0.00    0.67     0.00     0.01    20.00     0.00    0.00    0.00   
0.00  0.00   0.00<br /> sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   
0.00   0.00   0.00   0.00<br /> sdc               0.00     9.67   10.00    6.00     0.12     0.10    27.83     0.08   
5.08   8.13    0.00   1.42   2.27<br /> dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00    
0.00   0.00    0.00    0.00   0.00   0.00<br /> sdd               0.00     0.00    0.00   44.33     0.00     0.35   
16.00    0.03    0.72    0.00    0.72   0.72   3.20<br /> sde               0.00     0.00   47.33    0.00     0.58    
0.00   25.18     5.26  111.04  111.04    0.00  19.10  90.40<br /> sdf               0.00    11.00    3.33  683.33    
0.12    7.38    22.37    12.05   17.54  244.00   16.44   0.49  33.33<br /><br /> avg-cpu:  %user   %nice %system
%iowait %steal   %idle<br />           15.21    0.00    2.54    0.56    0.00   81.69<br /><br /> Device:        
rrqm/s  wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util<br />
sda              0.00     2.00    0.00    1.00     0.00     0.01    24.00     0.00    0.00    0.00    0.00   0.00  
0.00<br/> sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00  
0.00  0.00<br /> sdc               0.00     0.00   14.33    2.00     0.20     0.39    73.80     0.07    4.08    4.65   
0.00  2.37   3.87<br /> dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   
0.00   0.00   0.00   0.00<br /> sdd               0.00     0.33    0.00   62.00     0.00     0.52    17.08     0.02   
0.34   0.00    0.34   0.34   2.13<br /> sde               0.00     9.67   30.67  157.33     0.43     1.27    18.54    
1.75   9.33   15.91    8.04   1.09  20.53<br /> sdf               0.00     9.67    6.67    0.67     0.13     0.04   
46.91    0.04    5.09    5.60    0.00   2.36   1.73<br /><br /> avg-cpu:  %user   %nice %system %iowait  %steal  
%idle<br/>           14.72    0.00    1.95    0.58    0.00   82.76<br /><br /> Device:         rrqm/s   wrqm/s    
r/s    w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util<br /> sda               0.00    
0.00   0.00    2.00     0.00     0.01     8.00     0.00    0.00    0.00    0.00   0.00   0.00<br /> sdb              
0.00    0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00<br />
sdc              0.00    13.67    5.33   32.33     0.07     0.31    20.46     0.04    1.03    7.25    0.00   0.46  
1.73<br/> dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00  
0.00  0.00<br /> sdd               0.00     0.00    0.00   42.00     0.00     0.35    17.27     0.03    0.79    0.00   
0.79  0.79   3.33<br /> sde               0.00     0.33   48.00  804.00     0.61     6.34    16.71     8.38    9.82  
14.11   9.57   0.23  19.20<br /> sdf               0.00     0.00    8.00  463.00     0.09     4.12    18.30     5.00  
10.62   7.17   10.68   0.11   5.20<br /><blockquote
cite="mid:CAMkU=1zWk8dmxCi4uGB4fCmOrb5BvwVB0+Bspu_1gmyO7UUjsg@mail.gmail.com"type="cite"><div><br /></div><div><br
/></div><div>Couldyou post the complete log message and a few lines of context around it?</div></blockquote> There is
nocontext from the same connection around that message.  <blockquote
cite="mid:CAMkU=1zWk8dmxCi4uGB4fCmOrb5BvwVB0+Bspu_1gmyO7UUjsg@mail.gmail.com"type="cite"><div>How long had they been
hangingthere?  It makes a big difference whether there are several hanging there at one moment, but a few milliseconds
laterthere are several different ones, versus the same few that hang around of many seconds or minutes at a
time.</div></blockquote>The hanging connections never disappear until I restart pgbouncer. It's like this, At minute 1,
3connections left. At minute 2, another 3 left, total 6. Another minute, another 3 left, total 9....till the limit
reaches.<br/><blockquote cite="mid:CAMkU=1zWk8dmxCi4uGB4fCmOrb5BvwVB0+Bspu_1gmyO7UUjsg@mail.gmail.com" type="cite"><br
/><div>Ifthe identities of the "hung" processes are rapidly changing, it could just be that you are hitting a
throughputlimit. When you do a lot of inserts into indexed the tables, the performance can drop precipitously once the
sizeof the actively updated part of the indexes exceeds shared_buffers.  This would usually show up in the io stats,
butif you always have a lot of io going on, it might not be obvious.</div><div><br /></div><div>If it is the same few
processeshung for long periods, I would strace them, or gdb them and get a backtrace.</div></blockquote> any detail
guideto use strace/gdb on pg process? <blockquote
cite="mid:CAMkU=1zWk8dmxCi4uGB4fCmOrb5BvwVB0+Bspu_1gmyO7UUjsg@mail.gmail.com"type="cite"><div>Sorry, I don't know what
aconnection snapshot in db2 looks like.</div></blockquote><a class="moz-txt-link-freetext"
href="http://pic.dhe.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.cmd.doc%2Fdoc%2Fr0001945.html">http://pic.dhe.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.cmd.doc%2Fdoc%2Fr0001945.html</a><br
/>search for "<code>get snapshot for application". Note: some items in the sample are marked as "</code><code>Not
Collected"because some monitor flags are turned off.<br /></code><blockquote
cite="mid:CAMkU=1zWk8dmxCi4uGB4fCmOrb5BvwVB0+Bspu_1gmyO7UUjsg@mail.gmail.com"type="cite"><div> </div><div><br
/></div><div>Cheers,</div><div><br/></div><div>Jeff</div></blockquote><br /> 

How to investiage slow insert problem

From
Jeff Janes
Date:
On Mon, Aug 19, 2013 at 10:30 PM, Rural Hunter <ruralhunter@gmail.com> wrote:
> 于 2013/8/20 12:34, Jeff Janes 写道:
>

> > How long had they been hanging there?  It makes a big difference whether
> > there are several hanging there at one moment, but a few milliseconds later
> > there are several different ones, versus the same few that hang around of
> > many seconds or minutes at a time.
>
> The hanging connections never disappear until I restart pgbouncer. It's like
> this, At minute 1, 3 connections left. At minute 2, another 3 left, total 6.
> Another minute, another 3 left, total 9....till the limit reaches.

OK, that certainly does sound like network problems and not disk contention.  But what I don't see is why it would be listed as "active" in pg_stat_activity.  If it is blocked on a network connection, I would think it would show 'idle'.

>
> > If it is the same few processes hung for long periods, I would strace them,
> > or gdb them and get a backtrace.
>
> any detail guide to use strace/gdb on pg process?

strace and gdb are aggressive forms of monitoring, so you should talk to your sys admins before using them (if your organization has that type of segregation of duties) or read their documentation.

It is best to run them on a test system if you can get the problem to occur there.  If you can't, then I would be willing to run them against a hung backend on a production system, but I might be more adventurous than most.

I have occasionally seen strace cause the straced program to seg-fault, but I've only seen this with GUI programs and never with postgresql.

In theory attaching gdb could make a backend pause while holding a spinlock, and if left that way for long enough could cause a postgresql-wide panic over a stuck spinlock.  But I've never seen this happen without intentionally causing.  If the backend you attach is truly hung and hasn't already causes a panic, then it almost surely can't be happen, but just to be sure you should quit gdb as soon as possible so that the backend can continue unimpeded.

Anyway, both are fairly easy to run once you find the pid of one of a stuck backend (either using ps, or using pg_stat_activity).  Then you give the pid to the debugging program with -p option.

You probably have to run as the postgres user, or you won't have permissions to attach to the backend.

With strace, once you attach you will see a stream of system calls go to your screen, until you hit ctrl-C to detach.  But if the backend is hung on a network connection, there should really only be one system call, and then it just wait until you detach or the network connection times out, like this:

$ strace -p 21116
Process 21116 attached - interrupt to quit
recvfrom(9,

So it is waiting on a recvfrom call, and that call never returns until I got sick of waiting and hit ctrl-C.  Not very interesting, but it does show it is indeed stuck on the network

for gdb, it is similar to invoke:

$ gdb -p 21116

and it then produces several screenfuls of diagnostic gibberish and gives you an interactive command-line environment.  Once attached, you want to get a backtrace ("bt", return), and then quit promptly ("q", return, "y").

That produces something like this:

Loaded symbols for /lib64/libnss_files.so.2
0x00000032a80e9672 in __libc_recv (fd=<value optimized out>, buf=0xb33f60, n=8192, flags=0) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:30
30          return INLINE_SYSCALL (recvfrom, 6, fd, buf, n, flags, NULL, NULL);
(gdb) bt
#0  0x00000032a80e9672 in __libc_recv (fd=<value optimized out>, buf=0xb33f60, n=8192, flags=0) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:30
#1  0x00000000005a4846 in secure_read (port=0x22f1190, ptr=0xb33f60, len=8192) at be-secure.c:304
#2  0x00000000005ae33b in pq_recvbuf () at pqcomm.c:824
#3  0x00000000005ae73b in pq_getbyte () at pqcomm.c:865
#4  0x0000000000651c11 in SocketBackend (argc=<value optimized out>, argv=<value optimized out>, dbname=0x22d2a28 "jjanes", username=<value optimized out>)
    at postgres.c:342
#5  ReadCommand (argc=<value optimized out>, argv=<value optimized out>, dbname=0x22d2a28 "jjanes", username=<value optimized out>) at postgres.c:490
#6  PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x22d2a28 "jjanes", username=<value optimized out>) at postgres.c:3923
#7  0x000000000060e861 in BackendRun () at postmaster.c:3614
#8  BackendStartup () at postmaster.c:3304
#9  ServerLoop () at postmaster.c:1367
#10 0x00000000006113b1 in PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1127
#11 0x00000000005b0440 in main (argc=3, argv=0x22d1cb0) at main.c:199
(gdb) q
A debugging session is active.

        Inferior 1 [process 21116] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/local/pgsql9_2/bin/postgres, process 21116

Cheers,

Jeff

Re: How to investiage slow insert problem

From
Rural Hunter
Date:
Hi Jeff,

Thanks a lot for such a detailed guide!

于 2013/8/21 8:24, Jeff Janes 写道:
>
> OK, that certainly does sound like network problems and not disk
> contention.  But what I don't see is why it would be listed as
> "active" in pg_stat_activity.  If it is blocked on a network
> connection, I would think it would show 'idle'.
>
> strace and gdb are aggressive forms of monitoring, so you should talk
> to your sys admins before using them (if your organization has that
> type of segregation of duties) or read their documentation.
>
> It is best to run them on a test system if you can get the problem to
> occur there.  If you can't, then I would be willing to run them
> against a hung backend on a production system, but I might be more
> adventurous than most.
>
> I have occasionally seen strace cause the straced program to
> seg-fault, but I've only seen this with GUI programs and never with
> postgresql.
>
> In theory attaching gdb could make a backend pause while holding a
> spinlock, and if left that way for long enough could cause a
> postgresql-wide panic over a stuck spinlock.  But I've never seen this
> happen without intentionally causing.  If the backend you attach is
> truly hung and hasn't already causes a panic, then it almost surely
> can't be happen, but just to be sure you should quit gdb as soon as
> possible so that the backend can continue unimpeded.
>
> Anyway, both are fairly easy to run once you find the pid of one of a
> stuck backend (either using ps, or using pg_stat_activity).  Then you
> give the pid to the debugging program with -p option.
>
> You probably have to run as the postgres user, or you won't have
> permissions to attach to the backend.
>
> With strace, once you attach you will see a stream of system calls go
> to your screen, until you hit ctrl-C to detach.  But if the backend is
> hung on a network connection, there should really only be one system
> call, and then it just wait until you detach or the network connection
> times out, like this:
>
> $ strace -p 21116
> Process 21116 attached - interrupt to quit
> recvfrom(9,
>
> So it is waiting on a recvfrom call, and that call never returns until
> I got sick of waiting and hit ctrl-C.  Not very interesting, but it
> does show it is indeed stuck on the network
>
> for gdb, it is similar to invoke:
>
> $ gdb -p 21116
>
> and it then produces several screenfuls of diagnostic gibberish and
> gives you an interactive command-line environment.  Once attached, you
> want to get a backtrace ("bt", return), and then quit promptly ("q",
> return, "y").
>
> That produces something like this:
>
> Loaded symbols for /lib64/libnss_files.so.2
> 0x00000032a80e9672 in __libc_recv (fd=<value optimized out>,
> buf=0xb33f60, n=8192, flags=0) at
> ../sysdeps/unix/sysv/linux/x86_64/recv.c:30
> 30          return INLINE_SYSCALL (recvfrom, 6, fd, buf, n, flags,
> NULL, NULL);
> (gdb) bt
> #0  0x00000032a80e9672 in __libc_recv (fd=<value optimized out>,
> buf=0xb33f60, n=8192, flags=0) at
> ../sysdeps/unix/sysv/linux/x86_64/recv.c:30
> #1  0x00000000005a4846 in secure_read (port=0x22f1190, ptr=0xb33f60,
> len=8192) at be-secure.c:304
> #2  0x00000000005ae33b in pq_recvbuf () at pqcomm.c:824
> #3  0x00000000005ae73b in pq_getbyte () at pqcomm.c:865
> #4  0x0000000000651c11 in SocketBackend (argc=<value optimized out>,
> argv=<value optimized out>, dbname=0x22d2a28 "jjanes", username=<value
> optimized out>)
>     at postgres.c:342
> #5  ReadCommand (argc=<value optimized out>, argv=<value optimized
> out>, dbname=0x22d2a28 "jjanes", username=<value optimized out>) at
> postgres.c:490
> #6  PostgresMain (argc=<value optimized out>, argv=<value optimized
> out>, dbname=0x22d2a28 "jjanes", username=<value optimized out>) at
> postgres.c:3923
> #7  0x000000000060e861 in BackendRun () at postmaster.c:3614
> #8  BackendStartup () at postmaster.c:3304
> #9  ServerLoop () at postmaster.c:1367
> #10 0x00000000006113b1 in PostmasterMain (argc=<value optimized out>,
> argv=<value optimized out>) at postmaster.c:1127
> #11 0x00000000005b0440 in main (argc=3, argv=0x22d1cb0) at main.c:199
> (gdb) q
> A debugging session is active.
>
>         Inferior 1 [process 21116] will be detached.
>
> Quit anyway? (y or n) y
> Detaching from program: /usr/local/pgsql9_2/bin/postgres, process 21116
>
> Cheers,
>
> Jeff



Re: How to investiage slow insert problem

From
Matheus de Oliveira
Date:
(@Jeff, sorry I sent this message only to you by mistake, sending to the list now...)

On Tue, Aug 20, 2013 at 9:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Aug 19, 2013 at 10:30 PM, Rural Hunter <ruralhunter@gmail.com> wrote:
> 于 2013/8/20 12:34, Jeff Janes 写道:
>

> > How long had they been hanging there?  It makes a big difference whether
> > there are several hanging there at one moment, but a few milliseconds later
> > there are several different ones, versus the same few that hang around of
> > many seconds or minutes at a time.
>
> The hanging connections never disappear until I restart pgbouncer. It's like
> this, At minute 1, 3 connections left. At minute 2, another 3 left, total 6.
> Another minute, another 3 left, total 9....till the limit reaches.

OK, that certainly does sound like network problems and not disk contention.  But what I don't see is why it would be listed as "active" in pg_stat_activity.  If it is blocked on a network connection, I would think it would show 'idle'.

IIRC, the "state" column will show if the query on "query" column is really running or not (by not I mean, it is "idle[ in transaction]"), the column "waiting" is the one that we should look at to see if the backend is really blocked, which is the case if waiting is true. If it is true, then we should check at pg_locks to see who is blocking it, [1] and [2] has good queries for that.

[1] http://wiki.postgresql.org/wiki/Lock_Monitoring
[2] http://wiki.postgresql.org/wiki/Lock_dependency_information

Regards,
--
Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres

Re: How to investiage slow insert problem

From
Matheus de Oliveira
Date:

On Mon, Aug 19, 2013 at 10:44 PM, Rural Hunter <ruralhunter@gmail.com> wrote:
Hi,

I'm on 9.2.4 with Ubuntu server. There are usually hundereds of connections doing the same insert with different data from different networks every minute, through pgbouncer in the same network of the database server. The database has been running for about one year without problem. Yesterday I got a problem that the connection count limit of the database server is reached. I checked the connections and found that there are many inserts hanging there. I checked the load(cpu,memory,io) of the db server but seems everything is fine. I also checked pg log and I only found there are one "incomplete message from client" error message every several minute.

It may not be related, it can be some kind of monitoring tool checking if PostgreSQL is listening on 5432 (or whatever) port. Do you have it?
 
The I recycled pgbouncer and kept monitoring the connections. I found the majority of the inserts finish quickly but every minute there are several inserts left and seems hanging there . So after a while, the connection limit is reached again. Besides those inserts, there are no other long run queries and auto vacuums. I also checked the locks of the inserts and found they were all granted. The insert statement itself is very simple and it only inserts one row but there are some triggers involved. They might impact the performance but I have never experience any since the majority of the inserts are fine.

I would check this triggers first. If you execute (by hand) the same insert (perhaps inside a transaction, followed by a rollback) does it hangs? If so, you can try to trace what these triggers are doing, perhaps the easier/faster way would be the old and good RAISE NOTICE (if it is PL/pgSQL). Or even, try to execute the trigger's source by hand, if it is not really huge; a EXPLAIN ANALYZE of the queries inside it may help.

I already have problems with a system were some UPDATEs suddenly started hungging (like your case), and it was really an SELECT inside a trigger that was with bad plans (some adjustment on ANALYZE parameters for one table helped in the case).
 
The problem persisted about 1-2 hours. I didn't do anything except recycling pgbouncer a few times. After that period, everything goes back to normal. It's has been 24 hours and it didn't happen again.

From the error message in pg log, I supect it might be the network problem from some clients. Could anyone point out if there are other possible causes? I'm also wondering what those inserts are doing actually when they are hanging there, such as if they are in the trigger or not. Anything I can get similar with the connection snapshots in db2?



Regards,
--
Matheus de Oliveira
Analista de Banco de Dados
Dextra Sistemas - MPS.Br nível F!
www.dextra.com.br/postgres