Re: backends stuck in "startup" - Mailing list pgsql-general

From Justin Pryzby
Subject Re: backends stuck in "startup"
Date
Msg-id 20171122223117.GB5668@telsasoft.com
Whole thread Raw
In response to Re: backends stuck in "startup"  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: backends stuck in "startup"
List pgsql-general
On Wed, Nov 22, 2017 at 01:27:12PM -0500, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > On Tue, Nov 21, 2017 at 03:40:27PM -0800, Andres Freund wrote:
> >> Could you try stracing next time?
> 
> > I straced all the "startup" PIDs, which were all in futex, without exception:
> 
> If you've got debug symbols installed, could you investigate the states
> of the LWLocks the processes are stuck on?
> 
> My hypothesis about a missed memory barrier would imply that there's (at
> least) one process that's waiting but is not in the lock's wait queue and
> has MyProc->lwWaiting == false, while the rest are in the wait queue and
> have MyProc->lwWaiting == true.  Actually chasing through the list
> pointers would be slightly tedious, but checking MyProc->lwWaiting,
> and maybe MyProc->lwWaitMode, in each process shouldn't be too hard.

> Also verify that they're all waiting for the same LWLock (by address).

I believe my ~40 cores are actually for backends from two separate instances of
this issue on the VM, as evidenced by different argv pointers.

And for each instance, I have cores for only a fraction of the backends
(max_connections=400).

For starters, I found that PID 27427 has:

(gdb) p proc->lwWaiting
$1 = 0 '\000'
(gdb) p proc->lwWaitMode
$2 = 1 '\001'

..where all the others have lwWaiting=1

For #27427:
(gdb) p *lock
$27 = {tranche = 59, state = {value = 1627389952}, waiters = {head = 147, tail = 308}}

(gdb) info locals
mustwait = 1 '\001'
proc = 0x7f1a77dba500
result = 1 '\001'
extraWaits = 0
__func__ = "LWLockAcquire"

And at this point I have to ask for help how to finish traversing these
structures.  I could upload cores for someone (I don't think there's anything
too private) but so far I have 16GB gz compressed cores.

Note: I've compiled locally PG 10.1 with PREFERRED_SEMAPHORES=SYSV to keep the
service up (and to the degree that serves to verify that avoids the issue,
great).

But I could start an instance running pgbench to try to trigger on this VM,
with smaller shared_buffers and backends/clients to allow full cores of every
backend (I don't think I'll be able to dump all 400 cores each up to 2GB from
the production instance).

Would you suggest how I can maximize the likelyhood/speed of triggering that ?
Five years ago, with a report of similar symptoms, you said "You need to hack
pgbench to suppress the single initialization connection it normally likes to
make, else the test degenerates to the one-incoming-connection case"
https://www.postgresql.org/message-id/8896.1337998337%40sss.pgh.pa.us

..but, pgbench --connect seems to do what's needed(?)  (I see that dates back
to 2001, having been added at ba708ea3).

(I don't know there's any suggestion or reason to be believe the bug is
specific to connection/startup phase, or that it's a necessary or sufficient to
hit the bug, but it's at least known to be impacted and all I have to go on for
now).

Justin


pgsql-general by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: backends stuck in "startup"
Next
From: Neto pr
Date:
Subject: query causes connection termination