Thread: [HACKERS] Freeze on Cygwin w/ concurrency

[HACKERS] Freeze on Cygwin w/ concurrency

From
Noah Misch
Date:
"pgbench -i -s 50; pgbench -S -j2 -c16 -T900 -P5" freezes consistently on
Cygwin 2.2.1 and Cygwin 2.6.0.  (I suspect most other versions are affected.)
I've pinged[1] the Cygwin bug thread with some additional detail.  If a Cygwin
buildfarm member starts using --enable-tap-tests, you may see failures in the
pgbench test suite.  (lorikeet used --enable-tap-tests from 2017-03-18 to
2017-03-20, but it failed before reaching the pgbench test suite.)  Curious
that "make check" has too little concurrency to see more effects from this.


Frozen backends show a stack trace like this:

#0  0x000000007710139a in ntdll!ZwWriteFile () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x000007fefd851b2b in WriteFile () from /cygdrive/c/Windows/system32/KERNELBASE.dll
#2  0x0000000076fb3576 in WriteFile () from /cygdrive/c/Windows/system32/kernel32.dll
#3  0x0000000180160c6c in transport_layer_pipes::write (this=<optimized out>, buf=<optimized out>, len=<optimized out>)
 at /usr/src/debug/cygwin-2.6.0-1/winsup/cygserver/transport_pipes.cc:224
 
#4  0x000000018015feb6 in client_request::send (this=0xffffa930, conn=0x6000e8290) at
/usr/src/debug/cygwin-2.6.0-1/winsup/cygserver/client.cc:134
#5  0x0000000180160591 in client_request::make_request (this=this@entry=0xffffa930) at
/usr/src/debug/cygwin-2.6.0-1/winsup/cygserver/client.cc:473
#6  0x0000000180114f79 in semop (semid=65540, sops=0xffffaa00, nsops=1) at
/usr/src/debug/cygwin-2.6.0-1/winsup/cygwin/sem.cc:125
#7  0x0000000180117a4b in _sigfe () at sigfe.s:35
#8  0x000000010063c81a in PGSemaphoreLock (sema=sema@entry=0x6ffffe06a18) at pg_sema.c:387
#9  0x00000001006a962b in LWLockAcquire (lock=lock@entry=0x6fff6774d80, mode=mode@entry=LW_SHARED) at lwlock.c:1286
#10 0x0000000100687d46 in BufferAlloc (foundPtr=0xffffab0b <incomplete sequence \367\377\006>, strategy=0x0,
blockNum=290,forkNum=MAIN_FORKNUM,    relpersistence=112 'p', smgr=0x6000ea588) at bufmgr.c:1012
 

The postmaster, also frozen, shows a stack trace like this:

#0  0x00000000771018ca in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x000007fefd851420 in KERNELBASE!GetCurrentProcess () from /cygdrive/c/Windows/system32/KERNELBASE.dll
#2  0x0000000076fa1220 in WaitForMultipleObjects () from /cygdrive/c/Windows/system32/kernel32.dll
#3  0x0000000180120173 in child_info::sync (this=this@entry=0xffffc008, pid=4692, hProcess=@0xffffc1b0: 0x4b8,
howlong=howlong@entry=300000)  at /usr/src/debug/cygwin-2.6.0-1/winsup/cygwin/sigproc.cc:1010
 
#4  0x00000001800aa163 in frok::parent (this=0xffffc000, stack_here=0xffffbfa0 "") at
/usr/src/debug/cygwin-2.6.0-1/winsup/cygwin/fork.cc:501
#5  0x00000001800aaa05 in fork () at /usr/src/debug/cygwin-2.6.0-1/winsup/cygwin/fork.cc:607
#6  0x0000000180117a4b in _sigfe () at sigfe.s:35
#7  0x0000000100641618 in fork_process () at fork_process.c:61
#8  0x000000010063e80a in StartAutoVacWorker () at autovacuum.c:1436

The postmaster log eventually has:
    28 [main] postgres 4408 child_info::sync: wait failed, pid 4692, Win32 error 183   292 [main] postgres 4408 fork:
child4692 - died waiting for dll loading, errno 11
 


[1] https://cygwin.com/ml/cygwin/2017-03/msg00218.html



Re: [HACKERS] Freeze on Cygwin w/ concurrency

From
Robert Haas
Date:
On Mon, Mar 20, 2017 at 11:47 PM, Noah Misch <noah@leadboat.com> wrote:
> "pgbench -i -s 50; pgbench -S -j2 -c16 -T900 -P5" freezes consistently on
> Cygwin 2.2.1 and Cygwin 2.6.0.  (I suspect most other versions are affected.)
> I've pinged[1] the Cygwin bug thread with some additional detail.

Ouch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Freeze on Cygwin w/ concurrency

From
Andrew Dunstan
Date:

On 03/20/2017 11:47 PM, Noah Misch wrote:
> "pgbench -i -s 50; pgbench -S -j2 -c16 -T900 -P5" freezes consistently on
> Cygwin 2.2.1 and Cygwin 2.6.0.  (I suspect most other versions are affected.)
> I've pinged[1] the Cygwin bug thread with some additional detail.  If a Cygwin
> buildfarm member starts using --enable-tap-tests, you may see failures in the
> pgbench test suite.  (lorikeet used --enable-tap-tests from 2017-03-18 to
> 2017-03-20, but it failed before reaching the pgbench test suite.)  Curious
> that "make check" has too little concurrency to see more effects from this.


Yeah, I abandoned --enable-tap-test on lorikeet, didn't have time to get
to the bottom of the problems. Glad I'm not totally alone keeping this
alive.

cheers

andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Freeze on Cygwin w/ concurrency

From
Noah Misch
Date:
On Mon, Mar 20, 2017 at 11:47:03PM -0400, Noah Misch wrote:
> "pgbench -i -s 50; pgbench -S -j2 -c16 -T900 -P5" freezes consistently on
> Cygwin 2.2.1 and Cygwin 2.6.0.  (I suspect most other versions are affected.)
> I've pinged[1] the Cygwin bug thread with some additional detail.

The problem was cygserver thread exhaustion; cygserver needs a thread per
simultaneous waiter.  With "cygserver -r 40" or the equivalent config file
setting, this test does not freeze.  Cygwin 2.8.0 introduced a change to
dynamically grow the thread count:
https://cygwin.com/git/gitweb.cgi?p=newlib-cygwin.git;a=commitdiff;h=0b73dba4de3fdadde499edfbc7ca9d9a01c11487

However, Cygwin 2.8.0 introduced another source of cygserver freezes:
https://cygwin.com/git/gitweb.cgi?p=newlib-cygwin.git;a=commitdiff;h=b80b2c011936f7f075b76b6e59f9e8a5ec49caa1

The 2.8.0-specific freezes have no known workaround.  Cygwin 2.8.1 works,
having reverted the problem commit.  Do not use PostgreSQL with Cygwin 2.8.0.

> If a Cygwin
> buildfarm member starts using --enable-tap-tests, you may see failures in the
> pgbench test suite.  (lorikeet used --enable-tap-tests from 2017-03-18 to
> 2017-03-20, but it failed before reaching the pgbench test suite.)  Curious
> that "make check" has too little concurrency to see more effects from this.

I now understand the bug required eleven concurrent lock waiters, and it's
plausible that "make check" doesn't experience that.  The pgbench test suite
uses -c5, so I expect it to be stable on almost any Cygwin.