Home > mailing lists

Re: conchuela timeouts since 2021-10-09 system upgrade - Mailing list pgsql-bugs

From	Tom Lane
Subject	Re: conchuela timeouts since 2021-10-09 system upgrade
Date	October 26, 2021 14:29:39
Msg-id	83446.1635258579@sss.pgh.pa.us Whole thread Raw
In response to	Re: conchuela timeouts since 2021-10-09 system upgrade (Noah Misch <noah@leadboat.com>)
Responses	Re: conchuela timeouts since 2021-10-09 system upgrade
List	pgsql-bugs

Tree view

Noah Misch <noah@leadboat.com> writes:
> On Tue, Oct 26, 2021 at 02:03:54AM -0400, Tom Lane wrote:
>> Or more
>> practically, use advisory locks in that script to enforce that only one
>> runs at once.

> The author did try that.

Hmm ... that ought to have done the trick, I'd think.  However:

> Both sound doable, but I don't expect either to fix prairiedog's trouble.

Yeah :-(.  I think this test is somehow stumbling over a pre-existing bug.

>> So what we have is that libpq thinks it's sent the next DROP INDEX,
>> but the backend hasn't seen it.

> Thanks for isolating that.

The plot thickens.  When I went back to look at that machine this morning,
I found this in the postmaster log:

2021-10-26 02:52:09.324 EDT [1013] 002_cic.pl LOG:  statement: DROP INDEX CONCURRENTLY idx;
2021-10-26 02:52:09.352 EDT [1013] 002_cic.pl LOG:  could not send data to client: Broken pipe
2021-10-26 02:52:09.352 EDT [1013] 002_cic.pl FATAL:  connection to client lost

The timestamps correspond (more or less anyway) to when I killed off the
stuck test run and went to bed.  So the DROP command *was* sent, and it
was eventually received by the backend, but it seems to have taken killing
the pgbench process to do it.

I think this probably exonerates the pgbench/libpq side of things, and
instead we have to wonder about a backend or kernel bug.  A kernel bug
could possibly explain the unexplainable connection to what's happening on
some other file descriptor.  I'd be prepared to believe that prairiedog's
ancient macOS version has some weird bug preventing kevent() from noticing
available data ... but (a) surely conchuela wouldn't share such a bug,
and (b) we've been using kevent() for a couple years now, so how come
we didn't see this before?

Still baffled.  I'm currently experimenting to see if the bug reproduces
when latch.c is made to use poll() instead of kevent().  But the failure
rate was low enough that it'll be hours before I can say confidently
that it doesn't (unless, of course, it does).

            regards, tom lane

pgsql-bugs by date:

From: Noah Misch
Date: 26 October 2021, 13:45:00
Subject: Re: conchuela timeouts since 2021-10-09 system upgrade

From: Pavel Borisov
Date: 26 October 2021, 19:15:04
Subject: Re: BUG #17246: Feature request for adoptive indexes

Re: conchuela timeouts since 2021-10-09 system upgrade - Mailing list pgsql-bugs

Previous

Next