Re: swarm of processes in BIND state? - Mailing list pgsql-general

From hubert depesz lubaczewski
Subject Re: swarm of processes in BIND state?
Date
Msg-id 20160528183202.GC14097@depesz.com
Whole thread Raw
In response to Re: swarm of processes in BIND state?  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: swarm of processes in BIND state?  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-general
On Sat, May 28, 2016 at 10:32:15AM -0700, Jeff Janes wrote:
> > any clues on where to start diagnosing it?
>
> I'd start by using strace (with -y -ttt -T) on one of the processes
> and see what it is doing.  A lot of IO, and one what file?  A lot of
> semop's?

So, I did:

sudo strace -o bad.log -y -ttt -T -p $( ps uwwf -u postgres | grep BIND | awk '{print $2}' | head -n1 )
and killed it after 10 seconds, more or less. Results:

$ wc -l bad.log
6075 bad.log
$ grep -c semop bad.log
6018

The rest were reads, seeks, and single open to these files:
$ grep -v semop bad.log | grep -oE '/16421/[0-9.]*' | sort | uniq -c
2 /16421/3062403236.20
2 /16421/3062403236.8
25 /16421/3222944583.49
28 /16421/3251043620.60

Which are:
$ select oid::regclass from pg_class  where relfilenode in (3062403236, 3222944583, 3251043620);
               oid
----------------------------------
 app_schema.s_table
 app_schema.v_table
 app_schema.m_table
(3 rows)

which are 3 largest tables there are. But, logs dont show any queries
that would touch all 3 of them.

> If that wasn't informative, I'd attach to one of the processes with
> the gdb debugger and get a backtrace.  (You might want to do that a
> few times, just in case the first one accidentally caught the code
> during a part of its execution which was not in the bottlenecked
> spot.)

I did:
for a in $( ps uww -U postgres | grep BIND | awk '{print $2}' ); do echo "bt" | gdb -p $a > $a.bt.log 2>&1; done

Since there is lots of output, I made a tarball with it, and put it on
https://depesz.com/various/all.bt.logs.tar.gz

The file is ~ 19kB.

> > So far we've:
> > 1. ruled out IO problems (enough io both in terms of bandwidth and iops)
>
> Are you saying that you are empirically not actually doing any IO
> waits, or just that the IO capacity is theoretically sufficient?

there are no iowaits per what iostat returns. Or, there are but very low.

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/


pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: swarm of processes in BIND state?
Next
From: hubert depesz lubaczewski
Date:
Subject: Re: swarm of processes in BIND state?