On Sat, May 28, 2016 at 10:32:15AM -0700, Jeff Janes wrote:
> > any clues on where to start diagnosing it?
>
> I'd start by using strace (with -y -ttt -T) on one of the processes
> and see what it is doing. A lot of IO, and one what file? A lot of
> semop's?
So, I did:
sudo strace -o bad.log -y -ttt -T -p $( ps uwwf -u postgres | grep BIND | awk '{print $2}' | head -n1 )
and killed it after 10 seconds, more or less. Results:
$ wc -l bad.log
6075 bad.log
$ grep -c semop bad.log
6018
The rest were reads, seeks, and single open to these files:
$ grep -v semop bad.log | grep -oE '/16421/[0-9.]*' | sort | uniq -c
2 /16421/3062403236.20
2 /16421/3062403236.8
25 /16421/3222944583.49
28 /16421/3251043620.60
Which are:
$ select oid::regclass from pg_class where relfilenode in (3062403236, 3222944583, 3251043620);
oid
----------------------------------
app_schema.s_table
app_schema.v_table
app_schema.m_table
(3 rows)
which are 3 largest tables there are. But, logs dont show any queries
that would touch all 3 of them.
> If that wasn't informative, I'd attach to one of the processes with
> the gdb debugger and get a backtrace. (You might want to do that a
> few times, just in case the first one accidentally caught the code
> during a part of its execution which was not in the bottlenecked
> spot.)
I did:
for a in $( ps uww -U postgres | grep BIND | awk '{print $2}' ); do echo "bt" | gdb -p $a > $a.bt.log 2>&1; done
Since there is lots of output, I made a tarball with it, and put it on
https://depesz.com/various/all.bt.logs.tar.gz
The file is ~ 19kB.
> > So far we've:
> > 1. ruled out IO problems (enough io both in terms of bandwidth and iops)
>
> Are you saying that you are empirically not actually doing any IO
> waits, or just that the IO capacity is theoretically sufficient?
there are no iowaits per what iostat returns. Or, there are but very low.
Best regards,
depesz
--
The best thing about modern society is how easy it is to avoid contact with it.
http://depesz.com/