Re: conchuela timeouts since 2021-10-09 system upgrade - Mailing list pgsql-bugs

From Tom Lane
Subject Re: conchuela timeouts since 2021-10-09 system upgrade
Date
Msg-id 83446.1635258579@sss.pgh.pa.us
Whole thread Raw
In response to Re: conchuela timeouts since 2021-10-09 system upgrade  (Noah Misch <noah@leadboat.com>)
Responses Re: conchuela timeouts since 2021-10-09 system upgrade  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-bugs
Noah Misch <noah@leadboat.com> writes:
> On Tue, Oct 26, 2021 at 02:03:54AM -0400, Tom Lane wrote:
>> Or more
>> practically, use advisory locks in that script to enforce that only one
>> runs at once.

> The author did try that.

Hmm ... that ought to have done the trick, I'd think.  However:

> Both sound doable, but I don't expect either to fix prairiedog's trouble.

Yeah :-(.  I think this test is somehow stumbling over a pre-existing bug.

>> So what we have is that libpq thinks it's sent the next DROP INDEX,
>> but the backend hasn't seen it.

> Thanks for isolating that.

The plot thickens.  When I went back to look at that machine this morning,
I found this in the postmaster log:

2021-10-26 02:52:09.324 EDT [1013] 002_cic.pl LOG:  statement: DROP INDEX CONCURRENTLY idx;
2021-10-26 02:52:09.352 EDT [1013] 002_cic.pl LOG:  could not send data to client: Broken pipe
2021-10-26 02:52:09.352 EDT [1013] 002_cic.pl FATAL:  connection to client lost

The timestamps correspond (more or less anyway) to when I killed off the
stuck test run and went to bed.  So the DROP command *was* sent, and it
was eventually received by the backend, but it seems to have taken killing
the pgbench process to do it.

I think this probably exonerates the pgbench/libpq side of things, and
instead we have to wonder about a backend or kernel bug.  A kernel bug
could possibly explain the unexplainable connection to what's happening on
some other file descriptor.  I'd be prepared to believe that prairiedog's
ancient macOS version has some weird bug preventing kevent() from noticing
available data ... but (a) surely conchuela wouldn't share such a bug,
and (b) we've been using kevent() for a couple years now, so how come
we didn't see this before?

Still baffled.  I'm currently experimenting to see if the bug reproduces
when latch.c is made to use poll() instead of kevent().  But the failure
rate was low enough that it'll be hours before I can say confidently
that it doesn't (unless, of course, it does).

            regards, tom lane



pgsql-bugs by date:

Previous
From: Noah Misch
Date:
Subject: Re: conchuela timeouts since 2021-10-09 system upgrade
Next
From: Pavel Borisov
Date:
Subject: Re: BUG #17246: Feature request for adoptive indexes