Thread: swarm of processes in BIND state?

swarm of processes in BIND state?

From
hubert depesz lubaczewski
Date:
hi,
we have following situation:
pg 9.3.11 on ubuntu.
we have master and slave.
the db is large-ish, but we're removing *most* of its data from all
across the tables, and lots of tables too.

while we're doing it, sometimes, we get LOTS of processes, but only on
slave, never on master, that spend long time in BIND state. same queries
ran on master run without any problem, and are fast.

any clues on where to start diagnosing it?

So far we've:
1. ruled out IO problems (enough io both in terms of bandwidth and iops)
2. ruled out memory (plenty)
3. vacuumed all tables, including system ones, with analyze
4. did pg_repack on most of the tables to remove bloat.

we are in process of doing the same operation to series of servers, so
i'm not sure if the criteria on all of them are the same, but perhaps
you can think of something we didn't?

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/


Re: swarm of processes in BIND state?

From
Pavel Stehule
Date:
Hi

2016-05-28 7:19 GMT+02:00 hubert depesz lubaczewski <depesz@depesz.com>:
hi,
we have following situation:
pg 9.3.11 on ubuntu.
we have master and slave.
the db is large-ish, but we're removing *most* of its data from all
across the tables, and lots of tables too.

while we're doing it, sometimes, we get LOTS of processes, but only on
slave, never on master, that spend long time in BIND state. same queries
ran on master run without any problem, and are fast.

any clues on where to start diagnosing it?

So far we've:
1. ruled out IO problems (enough io both in terms of bandwidth and iops)
2. ruled out memory (plenty)
3. vacuumed all tables, including system ones, with analyze
4. did pg_repack on most of the tables to remove bloat.

we are in process of doing the same operation to series of servers, so
i'm not sure if the criteria on all of them are the same, but perhaps
you can think of something we didn't?

It is looking like spinlock issue.

try to look there by "perf top"

Pavel
 

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: swarm of processes in BIND state?

From
hubert depesz lubaczewski
Date:
On Sat, May 28, 2016 at 07:25:18AM +0200, Pavel Stehule wrote:
> It is looking like spinlock issue.
> try to look there by "perf top"

First results look like:

Samples: 64K of event 'cpu-clock', Event count (approx.): 2394094576

       [0/65] 
 32.08%  [unknown]                [.] 0x00007f27ed20a296
  6.31%  [kernel]                 [k] finish_task_switch
  5.72%  [kernel]                 [k] _raw_spin_unlock_irqrestore
  3.46%  postgres                 [.] 0x00000000002764ec
  3.38%  libruby-2.1.so.2.1.0     [.] 0x000000000008a8d0
  2.76%  [kernel]                 [k] __do_page_fault
  2.08%  postgres                 [.] hash_search_with_hash_value
  1.94%  libc-2.19.so             [.] 0x0000000000160b96
  1.79%  postgres                 [.] LWLockAcquire
  1.50%  postgres                 [.] heap_page_prune_opt
  1.28%  postgres                 [.] LWLockRelease
  1.09%  postgres                 [.] heap_hot_search_buffer
  1.00%  [kernel]                 [k] tick_nohz_idle_enter

I guess I have to use pg with debugging flags?

Best regards,

depesz



Re: swarm of processes in BIND state?

From
Pavel Stehule
Date:


2016-05-28 7:45 GMT+02:00 hubert depesz lubaczewski <depesz@depesz.com>:
On Sat, May 28, 2016 at 07:25:18AM +0200, Pavel Stehule wrote:
> It is looking like spinlock issue.
> try to look there by "perf top"

First results look like:

Samples: 64K of event 'cpu-clock', Event count (approx.): 2394094576                                                                                                                                                                                     [0/65]
 32.08%  [unknown]                [.] 0x00007f27ed20a296
  6.31%  [kernel]                 [k] finish_task_switch
  5.72%  [kernel]                 [k] _raw_spin_unlock_irqrestore
  3.46%  postgres                 [.] 0x00000000002764ec
  3.38%  libruby-2.1.so.2.1.0     [.] 0x000000000008a8d0
  2.76%  [kernel]                 [k] __do_page_fault
  2.08%  postgres                 [.] hash_search_with_hash_value
  1.94%  libc-2.19.so             [.] 0x0000000000160b96
  1.79%  postgres                 [.] LWLockAcquire
  1.50%  postgres                 [.] heap_page_prune_opt
  1.28%  postgres                 [.] LWLockRelease
  1.09%  postgres                 [.] heap_hot_search_buffer
  1.00%  [kernel]                 [k] tick_nohz_idle_enter

I guess I have to use pg with debugging flags?

you should to install debug info - or compile with dubug symbols

Pavel
 

Best regards,

depesz


Re: swarm of processes in BIND state?

From
hubert depesz lubaczewski
Date:
On Sat, May 28, 2016 at 07:46:52AM +0200, Pavel Stehule wrote:
> you should to install debug info - or compile with dubug symbols

Installed debug info, and the problem stopped.

Don't think it's related - it could be just timing. I'll report back
if/when the problem will re-appear.

Best regards,

depesz



Re: swarm of processes in BIND state?

From
Jeff Janes
Date:
On Fri, May 27, 2016 at 10:19 PM, hubert depesz lubaczewski
<depesz@depesz.com> wrote:
> hi,
> we have following situation:
> pg 9.3.11 on ubuntu.
> we have master and slave.
> the db is large-ish, but we're removing *most* of its data from all
> across the tables, and lots of tables too.
>
> while we're doing it, sometimes, we get LOTS of processes, but only on
> slave, never on master, that spend long time in BIND state. same queries
> ran on master run without any problem, and are fast.

This sounds like a known issue, except the fix for that issue was
already incorporated into 9.3.11:

commit 4162a55c77cbb54acb4ac442ef3565b813b9d07a
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Tue Feb 25 16:04:09 2014 -0500

    Use SnapshotDirty rather than an active snapshot to probe index endpoints.


But it might be some variant not covered by that fix.  The essence is
that if the planner is thinking about doing a merge join (even if it
never actually uses one) with the merge key being an indexed column,
it will consult the index to find the current min and max values.  If
the min and max portions of the index are full of dead or uncommitted
tuples, it does a lot of work digging through the index looking for a
non-dead one.


> any clues on where to start diagnosing it?

I'd start by using strace (with -y -ttt -T) on one of the processes
and see what it is doing.  A lot of IO, and one what file?  A lot of
semop's?

If that wasn't informative, I'd attach to one of the processes with
the gdb debugger and get a backtrace.  (You might want to do that a
few times, just in case the first one accidentally caught the code
during a part of its execution which was not in the bottlenecked
spot.)


> So far we've:
> 1. ruled out IO problems (enough io both in terms of bandwidth and iops)

Are you saying that you are empirically not actually doing any IO
waits, or just that the IO capacity is theoretically sufficient?

Thanks,

Jeff


Re: swarm of processes in BIND state?

From
hubert depesz lubaczewski
Date:
On Sat, May 28, 2016 at 08:04:43AM +0200, Pavel Stehule wrote:
> > > you should to install debug info - or compile with dubug symbols
> > Installed debug info, and the problem stopped.

OK. ot he problem back.

Ps looked like this:

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres   9762  0.0  1.3 6816588 424512 ?      S    05:49   0:09 /usr/lib/postgresql/9.3/bin/postgres -D
/var/lib/postgresql/9.3/main-c config_file=/etc/postgresql/9.3/main/postgresql.conf 
postgres   9763  1.6 20.3 6819980 6284520 ?     Ss   05:49  11:49  \_ postgres: startup process   recovering
0000000400007D61000000D2
postgres  10300  0.1 20.0 6827396 6190236 ?     Ss   05:49   1:22  \_ postgres: checkpointer process
postgres  10301  0.1 19.9 6819736 6145144 ?     Ss   05:49   0:55  \_ postgres: writer process
postgres  11111  0.0  0.0 104704  2076 ?        Ss   05:49   0:11  \_ postgres: stats collector process
postgres  11148  0.7  0.0 6832848 4840 ?        Ds   05:49   5:42  \_ postgres: wal receiver process   streaming
7D61/D2CE4D90
postgres  43248  0.0  0.2 6823908 79072 ?       Ss   06:41   0:00  \_ postgres: monitor dbname [local] idle
postgres  14802  4.0 18.5 6898688 5723652 ?     Ss   09:57  19:54  \_ postgres: app_user dbname 10.1.15.177(44508) BIND
postgres  14804  2.9 16.3 6870980 5061020 ?     Ss   09:57  14:25  \_ postgres: app_user dbname 10.1.15.177(44510) BIND
postgres 121044  6.2 16.8 6871756 5189808 ?     Ss   13:19  17:52  \_ postgres: app_user dbname 10.1.14.77(58756) BIND
postgres 130506  5.0 15.6 6871536 4844704 ?     Ss   13:38  13:22  \_ postgres: app_user dbname 10.1.15.177(47734) BIND
postgres   3767  2.8 13.6 6870876 4226928 ?     Ss   13:45   7:20  \_ postgres: app_user dbname 10.1.14.77(59142) BIND
postgres   3768  3.7 16.0 6877232 4943916 ?     Ss   13:45   9:35  \_ postgres: app_user dbname 10.1.14.77(59143) BIND
postgres   7168  0.0  0.0 6821248 8260 ?        Ss   13:52   0:00  \_ postgres: pgbouncer dbname 10.1.16.158(50990)
idle
postgres   7170  0.0  2.0 6863740 637368 ?      Ss   13:52   0:01  \_ postgres: app_user dbname 10.1.16.158(50991) idle
postgres  23083  5.9 14.7 6872584 4552872 ?     Ds   14:22  13:16  \_ postgres: app_user dbname 10.1.15.177(48363) BIND
postgres  70043  0.0  0.0 6821224 7548 ?        Ss   15:52   0:00  \_ postgres: pgbouncer dbname 10.1.14.77(60980) idle
postgres  70719  0.0  0.0 6821248 7448 ?        Ss   15:53   0:00  \_ postgres: pgbouncer dbname 10.1.15.177(49701)
idle
postgres  71257  5.8 14.5 6870940 4479808 ?     Ss   15:54   7:33  \_ postgres: app_user dbname 10.1.14.77(32783) BIND
postgres  77780  5.9 13.3 6871008 4119172 ?     Ss   16:06   6:58  \_ postgres: app_user dbname 10.1.14.77(32951) BIND
postgres  84682  9.2 15.5 6877216 4806100 ?     Ss   16:19   9:40  \_ postgres: app_user dbname 10.1.15.177(50074) BIND
postgres 100924  7.1 13.5 6870804 4176864 ?     Ss   16:51   5:11  \_ postgres: app_user dbname 10.1.15.177(50537) BIND
postgres 118767  7.7 13.7 6870736 4251048 ?     Ss   17:24   3:04  \_ postgres: app_user dbname 10.1.14.77(34090) BIND
postgres 118897  6.2 13.6 6870588 4219824 ?     Ss   17:24   2:27  \_ postgres: app_user dbname 10.1.14.77(34094) BIND
postgres 118916  8.6 12.4 6869524 3852648 ?     Ss   17:25   3:22  \_ postgres: app_user dbname 10.1.15.177(51026) BIND
postgres 118978  3.3 13.0 6868456 4033924 ?     Ss   17:25   1:18  \_ postgres: app_user dbname 10.1.14.77(34100) BIND
postgres 119022  2.3 13.4 6829152 4148088 ?     Ss   17:25   0:53  \_ postgres: app_user dbname 10.1.14.77(34101) BIND
postgres 119060  3.3 13.8 6869172 4265976 ?     Ss   17:25   1:18  \_ postgres: app_user dbname 10.1.15.177(51034) BIND
postgres 119096  1.9 12.1 6828716 3754204 ?     Ss   17:25   0:46  \_ postgres: app_user dbname 10.1.15.177(51037) BIND
postgres 119097  6.4 13.8 6870868 4263168 ?     Ss   17:25   2:29  \_ postgres: app_user dbname 10.1.15.177(51038) BIND
postgres 119111  1.9 12.1 6826656 3763260 ?     Ss   17:25   0:44  \_ postgres: app_user dbname 10.1.14.77(34105) BIND
postgres 119152  3.5 12.7 6869468 3921916 ?     Ss   17:25   1:22  \_ postgres: app_user dbname 10.1.14.77(34107) BIND
postgres 119266  7.3 13.5 6868908 4193496 ?     Ss   17:25   2:49  \_ postgres: app_user dbname 10.1.15.177(51041) BIND
postgres 119298  5.4 13.3 6870552 4107692 ?     Ss   17:25   2:05  \_ postgres: app_user dbname 10.1.14.77(34110) BIND
postgres 119303  6.2 14.7 6870816 4553052 ?     Ss   17:25   2:24  \_ postgres: app_user dbname 10.1.15.177(51043) BIND
postgres 119314  4.9 13.0 6869704 4024040 ?     Ss   17:25   1:54  \_ postgres: app_user dbname 10.1.14.77(34113) BIND
postgres 119315  2.0 12.5 6869208 3881956 ?     Ss   17:25   0:47  \_ postgres: app_user dbname 10.1.15.177(51045) BIND
postgres 119316  7.0 13.3 6870908 4136492 ?     Ss   17:25   2:41  \_ postgres: app_user dbname 10.1.15.177(51046) BIND
postgres 119317  3.2 13.0 6870100 4018860 ?     Ss   17:25   1:14  \_ postgres: app_user dbname 10.1.14.77(34114) BIND
postgres 119318  8.2 14.4 6870868 4457528 ?     Ss   17:25   3:09  \_ postgres: app_user dbname 10.1.14.77(34115) BIND
postgres 119319  2.3 12.0 6827084 3724248 ?     Ss   17:25   0:54  \_ postgres: app_user dbname 10.1.15.177(51047) BIND
postgres 119320  5.0 13.0 6870208 4037396 ?     Ss   17:25   1:56  \_ postgres: app_user dbname 10.1.15.177(51048) BIND
postgres 119368  4.5 13.2 6868480 4085724 ?     Ss   17:25   1:44  \_ postgres: app_user dbname 10.1.15.177(51050) BIND
postgres 119417  4.9 12.8 6869740 3981108 ?     Ss   17:25   1:52  \_ postgres: app_user dbname 10.1.14.77(34117) BIND
postgres 119418  3.4 13.0 6869204 4027344 ?     Ss   17:25   1:18  \_ postgres: app_user dbname 10.1.14.77(34118) BIND
postgres 119441  2.4 12.3 6828944 3826244 ?     Ss   17:25   0:57  \_ postgres: app_user dbname 10.1.15.177(51051) BIND
postgres 119442  3.4 12.6 6869248 3896108 ?     Ss   17:26   1:18  \_ postgres: app_user dbname 10.1.15.177(51052) BIND
postgres 119550 10.7 14.4 6870876 4448704 ?     Ss   17:26   4:04  \_ postgres: app_user dbname 10.1.15.177(51062) BIND
postgres 119560  7.0 14.2 6870744 4402940 ?     Ss   17:26   2:40  \_ postgres: app_user dbname 10.1.14.77(34125) BIND
postgres 119621  3.3 13.1 6870516 4052152 ?     Ss   17:26   1:16  \_ postgres: app_user dbname 10.1.14.77(34128) BIND
postgres 119631  1.9 12.4 6827044 3858696 ?     Ss   17:26   0:43  \_ postgres: app_user dbname 10.1.14.77(34129) BIND
postgres 130410  0.0  1.5 6824028 465040 ?      Ss   17:47   0:00  \_ postgres: app_user dbname 10.1.15.177(51372) idle

I ran perf top and got:

Samples: 620K of event 'cpu-clock', Event count (approx.): 30018858044

       [0/32] 
 23.77%  postgres                [.] s_lock
 13.58%  postgres                [.] 0x000000000027668c
  8.00%  postgres                [.] LWLockAcquire
  6.45%  [kernel]                [k] _raw_spin_unlock_irqrestore
  6.13%  postgres                [.] hash_search_with_hash_value
  4.97%  postgres                [.] LWLockRelease
  4.12%  [kernel]                [k] finish_task_switch
  3.95%  postgres                [.] heap_page_prune_opt
  1.33%  postgres                [.] RecoveryInProgress
  1.19%  [kernel]                [k] copy_user_enhanced_fast_string
  1.09%  postgres                [.] _bt_checkkeys
  1.05%  postgres                [.] heap_hot_search_buffer
  1.02%  libc-2.19.so            [.] 0x0000000000160b96
  0.99%  postgres                [.] hash_any
...

Does that help us in any way?

Best regards,

depesz



Re: swarm of processes in BIND state?

From
Tom Lane
Date:
hubert depesz lubaczewski <depesz@depesz.com> writes:
> Does that help us in any way?

Not terribly.  That confirms that the processes are contending for a
spinlock, but we can't tell which one.  Can you collect a few stack traces
from those processes?

            regards, tom lane


Re: swarm of processes in BIND state?

From
hubert depesz lubaczewski
Date:
On Sat, May 28, 2016 at 10:32:15AM -0700, Jeff Janes wrote:
> > any clues on where to start diagnosing it?
>
> I'd start by using strace (with -y -ttt -T) on one of the processes
> and see what it is doing.  A lot of IO, and one what file?  A lot of
> semop's?

So, I did:

sudo strace -o bad.log -y -ttt -T -p $( ps uwwf -u postgres | grep BIND | awk '{print $2}' | head -n1 )
and killed it after 10 seconds, more or less. Results:

$ wc -l bad.log
6075 bad.log
$ grep -c semop bad.log
6018

The rest were reads, seeks, and single open to these files:
$ grep -v semop bad.log | grep -oE '/16421/[0-9.]*' | sort | uniq -c
2 /16421/3062403236.20
2 /16421/3062403236.8
25 /16421/3222944583.49
28 /16421/3251043620.60

Which are:
$ select oid::regclass from pg_class  where relfilenode in (3062403236, 3222944583, 3251043620);
               oid
----------------------------------
 app_schema.s_table
 app_schema.v_table
 app_schema.m_table
(3 rows)

which are 3 largest tables there are. But, logs dont show any queries
that would touch all 3 of them.

> If that wasn't informative, I'd attach to one of the processes with
> the gdb debugger and get a backtrace.  (You might want to do that a
> few times, just in case the first one accidentally caught the code
> during a part of its execution which was not in the bottlenecked
> spot.)

I did:
for a in $( ps uww -U postgres | grep BIND | awk '{print $2}' ); do echo "bt" | gdb -p $a > $a.bt.log 2>&1; done

Since there is lots of output, I made a tarball with it, and put it on
https://depesz.com/various/all.bt.logs.tar.gz

The file is ~ 19kB.

> > So far we've:
> > 1. ruled out IO problems (enough io both in terms of bandwidth and iops)
>
> Are you saying that you are empirically not actually doing any IO
> waits, or just that the IO capacity is theoretically sufficient?

there are no iowaits per what iostat returns. Or, there are but very low.

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/


Re: swarm of processes in BIND state?

From
hubert depesz lubaczewski
Date:
On Sat, May 28, 2016 at 02:15:07PM -0400, Tom Lane wrote:
> hubert depesz lubaczewski <depesz@depesz.com> writes:
> > Does that help us in any way?
>
> Not terribly.  That confirms that the processes are contending for a
> spinlock, but we can't tell which one.  Can you collect a few stack traces
> from those processes?

Sure, tarballed them into https://depesz.com/various/all.bt.logs.tar.gz.

Best regards,

depesz



Re: swarm of processes in BIND state?

From
Jeff Janes
Date:
On Sat, May 28, 2016 at 11:32 AM, hubert depesz lubaczewski
<depesz@depesz.com> wrote:
> On Sat, May 28, 2016 at 10:32:15AM -0700, Jeff Janes wrote:

>> If that wasn't informative, I'd attach to one of the processes with
>> the gdb debugger and get a backtrace.  (You might want to do that a
>> few times, just in case the first one accidentally caught the code
>> during a part of its execution which was not in the bottlenecked
>> spot.)
>
> I did:
> for a in $( ps uww -U postgres | grep BIND | awk '{print $2}' ); do echo "bt" | gdb -p $a > $a.bt.log 2>&1; done
>
> Since there is lots of output, I made a tarball with it, and put it on
> https://depesz.com/various/all.bt.logs.tar.gz
>
> The file is ~ 19kB.

If you look at the big picture, it is what I thought: the planner
probing the index end points when planning a merge join.  Although I
don't know how that turns into the low-level specifics you are seeing.

It looks like everyone becomes interested in the same disk page at the
same time.  One process starts reading in that page, and all the
others queue up on a lock waiting for that one it to finish.  So what
you see is 1 disk wait and N-1 semop waits.

But, if the the page is that popular, why is it not staying in cache?
Either which page is popular is moving around quickly (which is hard
to see how that would be plausible if ti represents the index
end-points) or there are so many simultaneously popular pages that
they can't all fit in cache.

So my theory is that you deleted a huge number of entries off from
either end of the index, that transaction committed, and that commit
became visible to all.  Planning a mergejoin needs to dig through all
those tuples to probe the true end-point.  On master, the index
entries quickly get marked as LP_DEAD so future probes don't have to
do all that work, but on the replicas those index hint bits are, for
some unknown to me reason, not getting set.  So it has to scour the
all the heap pages which might have the smallest/largest tuple, on
every planning cycle, and that list of pages is very large leading to
occasional IO stalls.

Or perhaps the master realizes the deleting transaction is
committed-visible-to-all, but the replicas believe there are still
some transactions which could care about them, and that is the reason
they are not getting hinted?


>> > So far we've:
>> > 1. ruled out IO problems (enough io both in terms of bandwidth and iops)
>>
>> Are you saying that you are empirically not actually doing any IO
>> waits, or just that the IO capacity is theoretically sufficient?
>
> there are no iowaits per what iostat returns. Or, there are but very low.

If each IO wait has a pile-up of processes waiting behind it on
semops, then it could have a much larger effect than the raw numbers
would indicate.

Cheers,

Jeff


Re: swarm of processes in BIND state?

From
hubert depesz lubaczewski
Date:
On Mon, May 30, 2016 at 11:05:17AM -0700, Jeff Janes wrote:
> So my theory is that you deleted a huge number of entries off from
> either end of the index, that transaction committed, and that commit
> became visible to all.  Planning a mergejoin needs to dig through all
> those tuples to probe the true end-point.  On master, the index
> entries quickly get marked as LP_DEAD so future probes don't have to
> do all that work, but on the replicas those index hint bits are, for
> some unknown to me reason, not getting set.  So it has to scour the
> all the heap pages which might have the smallest/largest tuple, on
> every planning cycle, and that list of pages is very large leading to
> occasional IO stalls.

This I get, but why was the same backend reading data for all 3 largest
tables, while I know for sure (well, 99.9% sure) that no query touches
all of them?

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/