Re: swarm of processes in BIND state? - Mailing list pgsql-general

From Jeff Janes
Subject Re: swarm of processes in BIND state?
Date
Msg-id CAMkU=1xx3h0auwwjNfiJ69QEB9xcbVyAxQHD2hn8S_0pVf-b+A@mail.gmail.com
Whole thread Raw
In response to swarm of processes in BIND state?  (hubert depesz lubaczewski <depesz@depesz.com>)
Responses Re: swarm of processes in BIND state?  (hubert depesz lubaczewski <depesz@depesz.com>)
List pgsql-general
On Fri, May 27, 2016 at 10:19 PM, hubert depesz lubaczewski
<depesz@depesz.com> wrote:
> hi,
> we have following situation:
> pg 9.3.11 on ubuntu.
> we have master and slave.
> the db is large-ish, but we're removing *most* of its data from all
> across the tables, and lots of tables too.
>
> while we're doing it, sometimes, we get LOTS of processes, but only on
> slave, never on master, that spend long time in BIND state. same queries
> ran on master run without any problem, and are fast.

This sounds like a known issue, except the fix for that issue was
already incorporated into 9.3.11:

commit 4162a55c77cbb54acb4ac442ef3565b813b9d07a
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Tue Feb 25 16:04:09 2014 -0500

    Use SnapshotDirty rather than an active snapshot to probe index endpoints.


But it might be some variant not covered by that fix.  The essence is
that if the planner is thinking about doing a merge join (even if it
never actually uses one) with the merge key being an indexed column,
it will consult the index to find the current min and max values.  If
the min and max portions of the index are full of dead or uncommitted
tuples, it does a lot of work digging through the index looking for a
non-dead one.


> any clues on where to start diagnosing it?

I'd start by using strace (with -y -ttt -T) on one of the processes
and see what it is doing.  A lot of IO, and one what file?  A lot of
semop's?

If that wasn't informative, I'd attach to one of the processes with
the gdb debugger and get a backtrace.  (You might want to do that a
few times, just in case the first one accidentally caught the code
during a part of its execution which was not in the bottlenecked
spot.)


> So far we've:
> 1. ruled out IO problems (enough io both in terms of bandwidth and iops)

Are you saying that you are empirically not actually doing any IO
waits, or just that the IO capacity is theoretically sufficient?

Thanks,

Jeff


pgsql-general by date:

Previous
From: Francisco Olarte
Date:
Subject: Re: Migrate 2 DB's - v8.3
Next
From: Martín Marqués
Date:
Subject: Re: Migrate 2 DB's - v8.3