Re: Why is citext/regress failing on hamerkop? - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | Re: Why is citext/regress failing on hamerkop? |
Date | |
Msg-id | 17112876-887c-43aa-e81f-2ec01e3c776f@gmail.com Whole thread Raw |
In response to | Re: Why is citext/regress failing on hamerkop? (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Why is citext/regress failing on hamerkop?
Re: Why is citext/regress failing on hamerkop? |
List | pgsql-hackers |
13.05.2024 23:17, Tom Lane wrote: > 3. We don't know exactly why hamerkop suddenly started seeing these > failures, but a plausible theory emerges after noting that its > reported time for the successful "make check" step dropped pretty > substantially right when this started. In the v13 branch, "make > check" was taking 2:18 or more in the several runs right before the > first isolationcheck failure, but 1:40 or less just after. So it > looks like the animal was moved onto faster hardware. That feeds > into this problem because, with a successful isolationcheck run > taking more than a minute, there was enough time for some of the > earlier stuck sessions to time out and exit before their > postmaster-child slots were needed. Yes, and one thing I can't explain yet, is why REL_14_STABLE+ timings substantially differ from REL_13_STABLE-, say, for the check stage: REL_14_STABLE: the oldest available test log (from 2021-10-30) shows check (00:03:47) and the newest one (from 2024-05-12): check (00:03:18). Whilst on REL_13_STABLE the oldest available log (from 2021-08-06) shows check duration 00:03:00, then it decreased to 00:02:24 (2021-09-22)/ 00:02:14 (2021-11-07), and now it's 1:40, as you said. Locally I see more or less the same timings on REL_13_STABLE (34, 28, 27 secs) and on REL_14_STABLE (33, 29, 29 secs). 14.05.2024 03:38, Thomas Munro wrote: > I was beginning to suspect that lingering odour myself. I haven't > look at the GSS code but I was imagining that what we have here is > perhaps not unsent data dropped on the floor due to linger policy > (unclean socket close on process exist), but rather that the server > didn't see the socket as ready to read because it lost track of the > FD_CLOSE somewhere because the client closed it gracefully, and our > server-side FD_CLOSE handling has always been a bit suspect. I wonder > if the GSS code is somehow more prone to brokenness. One thing we > learned in earlier problems was that abortive/error disconnections > generate FD_CLOSE repeatedly, while graceful ones give you only one. > In other words, if the other end politely calls closesocket(), the > server had better not miss the FD_CLOSE event, because it won't come > again. That's what > > https://commitfest.postgresql.org/46/3523/ > > is intended to fix. Does it help here? Unfortunately that's > unpleasantly complicated and unbackpatchable (keeping a side-table of > socket FDs and event handles, so we don't lose events between the > cracks). Yes, that cure helps here too. I've tested it on b282fa88d~1 (the last state when that patch set can be applied). An excerpt (all lines related to process 12500) from a failed run log without the patch set: 2024-05-14 05:57:29.526 UTC [8228:128] DEBUG: forked new backend, pid=12500 socket=5524 2024-05-14 05:57:29.534 UTC [12500:1] [unknown] LOG: connection received: host=::1 port=51394 2024-05-14 05:57:29.534 UTC [12500:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket 2024-05-14 05:57:29.534 UTC [12500:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G 2024-05-14 05:57:29.534 UTC [12500:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait 2024-05-14 05:57:29.534 UTC [12500:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len:4 2024-05-14 05:57:29.534 UTC [12500:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035 2024-05-14 05:57:29.534 UTC [12500:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket() 2024-05-14 05:57:29.549 UTC [12500:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket 2024-05-14 05:57:29.549 UTC [12500:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len:4 2024-05-14 05:57:29.549 UTC [12500:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035 2024-05-14 05:57:29.549 UTC [12500:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket() ... 2024-05-14 05:57:52.024 UTC [8228:3678] DEBUG: server process (PID 12500) exited with exit code 1 # at the end of the test run And an excerpt (all lines related to process 11736) from a successful run log with the patch set applied: 2024-05-14 06:03:57.216 UTC [4524:130] DEBUG: forked new backend, pid=11736 socket=5540 2024-05-14 06:03:57.226 UTC [11736:1] [unknown] LOG: connection received: host=::1 port=51914 2024-05-14 06:03:57.226 UTC [11736:2] [unknown] LOG: !!!BackendInitialize| before ProcessStartupPacket 2024-05-14 06:03:57.226 UTC [11736:3] [unknown] LOG: !!!ProcessStartupPacket| before secure_open_gssapi(), GSSok: G 2024-05-14 06:03:57.226 UTC [11736:4] [unknown] LOG: !!!secure_open_gssapi| before read_or_wait 2024-05-14 06:03:57.226 UTC [11736:5] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len:4 2024-05-14 06:03:57.226 UTC [11736:6] [unknown] LOG: !!!read_or_wait| after secure_raw_read: -1, errno: 10035 2024-05-14 06:03:57.226 UTC [11736:7] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket() 2024-05-14 06:03:57.240 UTC [11736:8] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket 2024-05-14 06:03:57.240 UTC [11736:9] [unknown] LOG: !!!read_or_wait| before secure_raw_read(); PqGSSRecvLength: 0, len:4 2024-05-14 06:03:57.240 UTC [11736:10] [unknown] LOG: !!!read_or_wait| after secure_raw_read: 0, errno: 10035 2024-05-14 06:03:57.240 UTC [11736:11] [unknown] LOG: !!!read_or_wait| before WaitLatchOrSocket() 2024-05-14 06:03:57.240 UTC [11736:12] [unknown] LOG: !!!read_or_wait| after WaitLatchOrSocket 2024-05-14 06:03:57.240 UTC [11736:13] [unknown] LOG: !!!secure_open_gssapi| read_or_wait returned -1 2024-05-14 06:03:57.240 UTC [11736:14] [unknown] LOG: !!!ProcessStartupPacket| secure_open_gssapi() returned error 2024-05-14 06:03:57.240 UTC [11736:15] [unknown] LOG: !!!BackendInitialize| after ProcessStartupPacket 2024-05-14 06:03:57.240 UTC [11736:16] [unknown] LOG: !!!BackendInitialize| status: -1 2024-05-14 06:03:57.240 UTC [11736:17] [unknown] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make 2024-05-14 06:03:57.240 UTC [11736:18] [unknown] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make 2024-05-14 06:03:57.240 UTC [11736:19] [unknown] DEBUG: proc_exit(0): 1 callbacks to make 2024-05-14 06:03:57.240 UTC [11736:20] [unknown] DEBUG: exit(0) 2024-05-14 06:03:57.240 UTC [11736:21] [unknown] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2024-05-14 06:03:57.240 UTC [11736:22] [unknown] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2024-05-14 06:03:57.240 UTC [11736:23] [unknown] DEBUG: proc_exit(-1): 0 callbacks to make 2024-05-14 06:03:57.243 UTC [4524:132] DEBUG: forked new backend, pid=10536 socket=5540 2024-05-14 06:03:57.243 UTC [4524:133] DEBUG: server process (PID 11736) exited with exit code 0 Best regards, Alexander
pgsql-hackers by date: