Re: Why is citext/regress failing on hamerkop? - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: Why is citext/regress failing on hamerkop?
Date
Msg-id 17112876-887c-43aa-e81f-2ec01e3c776f@gmail.com
Whole thread Raw
In response to Re: Why is citext/regress failing on hamerkop?  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Why is citext/regress failing on hamerkop?
Re: Why is citext/regress failing on hamerkop?
List pgsql-hackers
13.05.2024 23:17, Tom Lane wrote:
> 3. We don't know exactly why hamerkop suddenly started seeing these
> failures, but a plausible theory emerges after noting that its
> reported time for the successful "make check" step dropped pretty
> substantially right when this started.  In the v13 branch, "make
> check" was taking 2:18 or more in the several runs right before the
> first isolationcheck failure, but 1:40 or less just after.  So it
> looks like the animal was moved onto faster hardware.  That feeds
> into this problem because, with a successful isolationcheck run
> taking more than a minute, there was enough time for some of the
> earlier stuck sessions to time out and exit before their
> postmaster-child slots were needed.

Yes, and one thing I can't explain yet, is why REL_14_STABLE+ timings
substantially differ from REL_13_STABLE-, say, for the check stage:
REL_14_STABLE: the oldest available test log (from 2021-10-30) shows
check (00:03:47) and the newest one (from 2024-05-12): check (00:03:18).
Whilst on REL_13_STABLE the oldest available log (from 2021-08-06) shows
check duration 00:03:00, then it decreased to 00:02:24 (2021-09-22)/
00:02:14 (2021-11-07), and now it's 1:40, as you said.

Locally I see more or less the same timings on REL_13_STABLE (34, 28, 27
secs) and on REL_14_STABLE (33, 29, 29 secs).

14.05.2024 03:38, Thomas Munro wrote:
> I was beginning to suspect that lingering odour myself.  I haven't
> look at the GSS code but I was imagining that what we have here is
> perhaps not unsent data dropped on the floor due to linger policy
> (unclean socket close on process exist), but rather that the server
> didn't see the socket as ready to read because it lost track of the
> FD_CLOSE somewhere because the client closed it gracefully, and our
> server-side FD_CLOSE handling has always been a bit suspect.  I wonder
> if the GSS code is somehow more prone to brokenness.  One thing we
> learned in earlier problems was that abortive/error disconnections
> generate FD_CLOSE repeatedly, while graceful ones give you only one.
> In other words, if the other end politely calls closesocket(), the
> server had better not miss the FD_CLOSE event, because it won't come
> again.   That's what
>
> https://commitfest.postgresql.org/46/3523/
>
> is intended to fix.  Does it help here?  Unfortunately that's
> unpleasantly complicated and unbackpatchable (keeping a side-table of
> socket FDs and event handles, so we don't lose events between the
> cracks).

Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last
state when that patch set can be applied).

An excerpt (all lines related to process 12500) from a failed run log
without the patch set:
2024-05-14 05:57:29.526 UTC [8228:128] DEBUG:  forked new backend, pid=12500 socket=5524
2024-05-14 05:57:29.534 UTC [12500:1] [unknown] LOG:  connection received: host=::1 port=51394
2024-05-14 05:57:29.534 UTC [12500:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket
2024-05-14 05:57:29.534 UTC [12500:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G
2024-05-14 05:57:29.534 UTC [12500:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait
2024-05-14 05:57:29.534 UTC [12500:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0,
len:4
 
2024-05-14 05:57:29.534 UTC [12500:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035
2024-05-14 05:57:29.534 UTC [12500:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 05:57:29.549 UTC [12500:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 05:57:29.549 UTC [12500:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0,
len:4
 
2024-05-14 05:57:29.549 UTC [12500:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035
2024-05-14 05:57:29.549 UTC [12500:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
...
2024-05-14 05:57:52.024 UTC [8228:3678] DEBUG:  server process (PID 12500) exited with exit code 1
# at the end of the test run

And an excerpt (all lines related to process 11736) from a successful run
log with the patch set applied:
2024-05-14 06:03:57.216 UTC [4524:130] DEBUG:  forked new backend, pid=11736 socket=5540
2024-05-14 06:03:57.226 UTC [11736:1] [unknown] LOG:  connection received: host=::1 port=51914
2024-05-14 06:03:57.226 UTC [11736:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket
2024-05-14 06:03:57.226 UTC [11736:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G
2024-05-14 06:03:57.226 UTC [11736:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait
2024-05-14 06:03:57.226 UTC [11736:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0,
len:4
 
2024-05-14 06:03:57.226 UTC [11736:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035
2024-05-14 06:03:57.226 UTC [11736:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 06:03:57.240 UTC [11736:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 06:03:57.240 UTC [11736:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0,
len:4
 
2024-05-14 06:03:57.240 UTC [11736:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035
2024-05-14 06:03:57.240 UTC [11736:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket()
2024-05-14 06:03:57.240 UTC [11736:12] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket
2024-05-14 06:03:57.240 UTC [11736:13] [unknown] LOG: !!!secure_open_gssapi| read_or_wait returned -1
2024-05-14 06:03:57.240 UTC [11736:14] [unknown] LOG: !!!ProcessStartupPacket| secure_open_gssapi() returned error
2024-05-14 06:03:57.240 UTC [11736:15] [unknown] LOG: !!!BackendInitialize| after ProcessStartupPacket
2024-05-14 06:03:57.240 UTC [11736:16] [unknown] LOG: !!!BackendInitialize| status: -1
2024-05-14 06:03:57.240 UTC [11736:17] [unknown] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:18] [unknown] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:19] [unknown] DEBUG: proc_exit(0): 1 callbacks to make
2024-05-14 06:03:57.240 UTC [11736:20] [unknown] DEBUG:  exit(0)
2024-05-14 06:03:57.240 UTC [11736:21] [unknown] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:22] [unknown] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make
2024-05-14 06:03:57.240 UTC [11736:23] [unknown] DEBUG: proc_exit(-1): 0 callbacks to make
2024-05-14 06:03:57.243 UTC [4524:132] DEBUG:  forked new backend, pid=10536 socket=5540
2024-05-14 06:03:57.243 UTC [4524:133] DEBUG:  server process (PID 11736) exited with exit code 0

Best regards,
Alexander



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: pgsql: Fix overread in JSON parsing errors for incomplete byte sequence
Next
From: Daniel Gustafsson
Date:
Subject: Re: [PATCH] Fix bug when calling strncmp in check_authmethod_valid