Re: [EXTERNAL] Re: Add non-blocking version of PQcancel - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
Date
Msg-id e4d0cb33-6be5-e4d5-ae49-9eac3ff2b005@gmail.com
Whole thread Raw
In response to Re: [EXTERNAL] Re: Add non-blocking version of PQcancel  (Jelte Fennema-Nio <postgres@jeltef.nl>)
Responses Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
List pgsql-hackers
Hello,

25.06.2024 11:24, Jelte Fennema-Nio wrote:
> My expectation is that that should remove all failure cases. If it
> doesn't, I think our best bet is removing the test again.

It looks like that test eventually showed what could be called a virtue.
Please take a look at a recent BF failure [1]:
timed out after 10800 secs
...
# +++ regress install-check in contrib/postgres_fdw +++
# using postmaster on /home/andrew/bf/root/tmp/buildfarm-e2ahpQ, port 5878

So the postgres_fdw test hanged for several hours while running on the
Cygwin animal lorikeet.

I've managed to reproduce this issue in my Cygwin environment by running
the postgres_fdw test in a loop (10 iterations are enough to get the
described effect). And what I'm seeing is that a query-cancelling backend
is stuck inside pgfdw_xact_callback() -> pgfdw_abort_cleanup() ->
pgfdw_cancel_query() -> pgfdw_cancel_query_begin() -> libpqsrv_cancel() ->
WaitLatchOrSocket() -> WaitEventSetWait() -> WaitEventSetWaitBlock() ->
poll().

The timeout value (approximately 30 seconds), which is passed to poll(),
is effectively ignored by this call — the waiting lasts for unlimited time.

This definitely is caused by 2466d6654. (I applied the test change from that
commit to 2466d6654~1 and saw no issue when running the same test in a
loop.)

With gdb attached to a hanging backend, I see the following stack trace:
#0  0x00007ffb7f70d5e4 in ntdll!ZwWaitForSingleObject () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x00007ffb7d2e920e in WaitForSingleObjectEx () from /cygdrive/c/Windows/System32/KERNELBASE.dll
#2  0x00007ffb5ce78862 in fhandler_socket_wsock::evaluate_events (this=0x800126968, event_mask=50, events=@0x7ffffb208:

0, erase=erase@entry=false)
     at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/fhandler/socket_inet.cc:268
#3  0x00007ffb5cdef0f5 in peek_socket (me=0xa001a43c0) at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:1771
#4  0x00007ffb5cdf211e in select_stuff::poll (this=this@entry=0x7ffffb300, readfds=0x7ffffb570, 
readfds@entry=0x800000000, writefds=0x7ffffb560, writefds@entry=0x7ffffb5c0, exceptfds=0x7ffffb550,
     exceptfds@entry=0x7ffb5cdf2c97 <cygwin_select(int, fd_set*, fd_set*, fd_set*, timeval*)+71>) at 
/usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:554
#5  0x00007ffb5cdf257e in select (maxfds=maxfds@entry=45, readfds=0x800000000, writefds=0x7ffffb5c0, 
exceptfds=0x7ffb5cdf2c97 <cygwin_select(int, fd_set*, fd_set*, fd_set*, timeval*)+71>, us=4308570016,
     us@entry=29973000) at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:204
#6  0x00007ffb5cdf2927 in pselect (maxfds=45, readfds=0x7ffffb570, writefds=0x7ffffb560, exceptfds=0x7ffffb550, 
to=<optimized out>, to@entry=0x7ffffb500, set=<optimized out>, set@entry=0x0)
     at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:120
#7  0x00007ffb5cdf2c97 in cygwin_select (maxfds=<optimized out>, readfds=<optimized out>, writefds=<optimized out>, 
exceptfds=<optimized out>, to=0x7ffffb5b0)
     at /usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/select.cc:147
#8  0x00007ffb5cddc112 in poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at 
/usr/src/debug/cygwin-3.5.3-1/winsup/cygwin/poll.cc:83
...
and socket_inet.c:268 ([2]) indeed contains an infinite wait call
(LOCK_EVENTS; / WaitForSingleObject (wsock_mtx, INFINITE)).

So it looks like a Cygwin bug, but maybe something should be done on our side
too, at least to prevent such lorikeet failures.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2024-07-12%2010%3A05%3A27
[2] https://www.cygwin.com/cgit/newlib-cygwin/tree/winsup/cygwin/fhandler/socket_inet.cc?h=cygwin-3.5.3

Best regards,
Alexander



pgsql-hackers by date:

Previous
From: Rajan Pandey
Date:
Subject: Why is 'use_alias' hardcoded to true in deparseFromExprForRel() for some cases
Next
From: Nazir Bilal Yavuz
Date:
Subject: Re: Use read streams in CREATE DATABASE command when the strategy is wal_log