Thread: postgres_fdw test timeouts
I noticed that the postgres_fdw test periodically times out on Windows: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-11-10%2003%3A12%3A58 https://cirrus-ci.com/task/5504294095421440 https://cirrus-ci.com/task/4814111003901952 https://cirrus-ci.com/task/5343037535027200 https://cirrus-ci.com/task/5893655966253056 https://cirrus-ci.com/task/4568159320014848 https://cirrus-ci.com/task/5238067850641408 (and many more from cfbot) From a quick sampling, the relevant test logs end with either ERROR: server "unknownserver" does not exist STATEMENT: SELECT postgres_fdw_disconnect('unknownserver'); or ERROR: relation "public.non_existent_table" does not exist CONTEXT: remote SQL command: SELECT a, b, c FROM public.non_existent_table STATEMENT: SELECT * FROM async_pt; before the test seems to hang. AFAICT the failures began around September 10th, which leads me to wonder if this is related to commit 04a09ee. That is little more than a wild guess, though. I haven't been able to deduce much else from the logs I can find, and I didn't find any previous reports about this in the archives after lots of searching, so I thought I'd at least park these notes here in case anyone else has ideas. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Dec 1, 2023 at 9:38 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > AFAICT the failures began around September 10th, which leads me to wonder > if this is related to commit 04a09ee. That is little more than a wild > guess, though. I haven't been able to deduce much else from the logs I can > find, and I didn't find any previous reports about this in the archives > after lots of searching, so I thought I'd at least park these notes here in > case anyone else has ideas. Thanks for finding this correlation. Yeah, poking around in the cfbot history database I see about 1 failure like that per day since that date, and there doesn't seem to be anything else as obviously likely to be related to wakeups and timeouts. I don't understand what's wrong with the logic, and I think it would take someone willing to debug it locally to figure that out. Unless someone has an idea, I'm leaning towards reverting that commit and leaving the relatively minor problem that it was intended to fix as a TODO.
Hello Thomas, 03.12.2023 02:48, Thomas Munro wrote: > Thanks for finding this correlation. Yeah, poking around in the cfbot > history database I see about 1 failure like that per day since that > date, and there doesn't seem to be anything else as obviously likely > to be related to wakeups and timeouts. I don't understand what's > wrong with the logic, and I think it would take someone willing to > debug it locally to figure that out. Unless someone has an idea, I'm > leaning towards reverting that commit and leaving the relatively minor > problem that it was intended to fix as a TODO I've managed to reproduce the failure locally when running postgres_fdw_x/ regress in parallel (--num-processes 10). It reproduced for me on on 04a09ee94 (iterations 1, 2, 4), but not on 04a09ee94~1 (30 iterations passed). I'm going to investigate this case within days. Maybe we could find a better fix for the issue. Best regards, Alexander
On Sun, Dec 3, 2023 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > I've managed to reproduce the failure locally when running postgres_fdw_x/ > regress in parallel (--num-processes 10). It reproduced for me on > on 04a09ee94 (iterations 1, 2, 4), but not on 04a09ee94~1 (30 iterations > passed). > > I'm going to investigate this case within days. Maybe we could find a > better fix for the issue. Thanks. One thing I can recommend to anyone trying to understand the change is that you view it with: git show --ignore-all-space 04a09ee ... because it changed a lot of indentation when wrapping a bunch of stuff in a new for loop.
Hello Thomas, 03.12.2023 09:00, Thomas Munro wrote: > On Sun, Dec 3, 2023 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: >> I've managed to reproduce the failure locally when running postgres_fdw_x/ >> regress in parallel (--num-processes 10). It reproduced for me on >> on 04a09ee94 (iterations 1, 2, 4), but not on 04a09ee94~1 (30 iterations >> passed). >> >> I'm going to investigate this case within days. Maybe we could find a >> better fix for the issue. > Thanks. One thing I can recommend to anyone trying to understand the > change is that you view it with: I think, I've found out what's going on here. The culprit is WSAEnumNetworkEvents() assisted by non-trivial logic of ExecAppendAsyncEventWait(). For the case noccurred > 1, ExecAppendAsyncEventWait() performs a loop, where ExecAsyncNotify() is called for the first AsyncRequest, but the second one also processed inside, through a recursive call to ExecAppendAsyncEventWait(): -> ExecAsyncNotify -> produce_tuple_asynchronously -> ExecScan -> ExecInterpExpr -> ExecSetParamPlan -> ExecProcNodeFirst -> ExecAgg -> agg_retrieve_direct -> ExecProcNodeInstr -> ExecAppend -> ExecAppendAsyncEventWait. Here we get into the first loop and call ExecAsyncConfigureWait() for the second AsyncRequest (because we haven't reset it's callback_pending yet), and it leads to creating another WaitEvent for the second socket inside postgresForeignAsyncConfigureWait(): AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn), ... This WaitEvent seemingly misses an event that we should get for that socket. It's not that important to get noccured > 1 in ExecAppendAsyncEventWait() to see the failure, it's enough to call WSAEnumNetworkEvents() inside WaitEventSetWaitBlock() for the second socket (I tried to exit from the WaitEventSetWaitBlock's new loop prematurely, without touching occurred_events, returned_events on a second iteration of the loop). So it looks like we have the same issue with multiple event handles associated with a single socket here. And v2-0003-Redesign-Windows-socket-event-management.patch from [1] "surprisingly" helps in this case as well (I could not see a failure for 100 iterations of 10 tests in parallel). [1] https://www.postgresql.org/message-id/CA%2BhUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ%40mail.gmail.com Best regards, Alexander
On Thu, Dec 7, 2023 at 10:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > I think, I've found out what's going on here. > The culprit is WSAEnumNetworkEvents() assisted by non-trivial logic of > ExecAppendAsyncEventWait(). > For the case noccurred > 1, ExecAppendAsyncEventWait() performs a loop, > where ExecAsyncNotify() is called for the first AsyncRequest, but the > second one also processed inside, through a recursive call to > ExecAppendAsyncEventWait(): > -> ExecAsyncNotify -> produce_tuple_asynchronously > -> ExecScan -> ExecInterpExpr -> ExecSetParamPlan -> ExecProcNodeFirst > -> ExecAgg -> agg_retrieve_direct -> ExecProcNodeInstr -> ExecAppend > -> ExecAppendAsyncEventWait. > Here we get into the first loop and call ExecAsyncConfigureWait() for the > second AsyncRequest (because we haven't reset it's callback_pending yet), > and it leads to creating another WaitEvent for the second socket inside > postgresForeignAsyncConfigureWait(): > AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn), ... Oh, wow. Nice detective work! Thank you for figuring that out. > So it looks like we have the same issue with multiple event handles > associated with a single socket here. > And v2-0003-Redesign-Windows-socket-event-management.patch from [1] > "surprisingly" helps in this case as well (I could not see a failure for > 100 iterations of 10 tests in parallel). Yeah, this makes perfect sense. So, commit 04a09ee is not guilty. But as the saying goes, "no good deed goes unpunished", and work on our Windows port seems to be especially prone to backfiring when kludges combine... Now we have the question of whether to go forwards (commit the "socket table" thing), or backwards (revert 04a09ee for now to clear the CI failures). I don't love the hidden complexity of the socket table and am not in a hurry to commit it, but I don't currently see another way... on the other hand we have other CI flapping due to that problem too so reverting 04a09ee would be sweeping problems under the carpet. I still need to process your feedback/discoveries on that other thread and it may take a few weeks for me to get to it.
On Fri, Dec 08, 2023 at 09:55:58AM +1300, Thomas Munro wrote: > Oh, wow. Nice detective work! Thank you for figuring that out. +1 > Now we have the question of whether to go forwards (commit the "socket > table" thing), or backwards (revert 04a09ee for now to clear the CI > failures). I don't love the hidden complexity of the socket table and > am not in a hurry to commit it, but I don't currently see another > way... on the other hand we have other CI flapping due to that problem > too so reverting 04a09ee would be sweeping problems under the carpet. > I still need to process your feedback/discoveries on that other thread > and it may take a few weeks for me to get to it. I don't think we need to revert 04a09ee provided the issue is unrelated and a fix is in development. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
08.12.2023 02:02, Nathan Bossart wrote: > On Fri, Dec 08, 2023 at 09:55:58AM +1300, Thomas Munro wrote: >> Now we have the question of whether to go forwards (commit the "socket >> table" thing), or backwards (revert 04a09ee for now to clear the CI >> failures). I don't love the hidden complexity of the socket table and >> am not in a hurry to commit it, but I don't currently see another >> way... on the other hand we have other CI flapping due to that problem >> too so reverting 04a09ee would be sweeping problems under the carpet. >> I still need to process your feedback/discoveries on that other thread >> and it may take a few weeks for me to get to it. > I don't think we need to revert 04a09ee provided the issue is unrelated and > a fix is in development. I've reviewed the links posted upthread and analyzed statistics of such failures: yes, it happens rather frequently in Cirrus CI, but there might be dozens of successful runs, for example: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F45%2F3686 has 1 postgres_fdw failure on Windows per 32 runs. And there is only one such failure for 90 days in the buildfarm. (Perhaps the probability of the failure depend on external factors, such as concurrent activity.) So I would not say that it's a dominant failure for now, and given that 04a09ee lives in master only, maybe we can save two commits (Revert + Revert of revert) while moving to a more persistent solution. Best regards, Alexander
On Sun, Dec 10, 2023 at 12:00:01PM +0300, Alexander Lakhin wrote: > So I would not say that it's a dominant failure for now, and given that > 04a09ee lives in master only, maybe we can save two commits (Revert + > Revert of revert) while moving to a more persistent solution. I just checked in on this one to see whether we needed to create an "open item" for v17. While I'm not seeing the failures anymore, the patch that Alexander claimed should fix it [0] doesn't appear to have been committed, either. Perhaps this was fixed by something else... [0] https://postgr.es/m/CA%2BhUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hello Nathan, 08.03.2024 01:00, Nathan Bossart wrote: > On Sun, Dec 10, 2023 at 12:00:01PM +0300, Alexander Lakhin wrote: >> So I would not say that it's a dominant failure for now, and given that >> 04a09ee lives in master only, maybe we can save two commits (Revert + >> Revert of revert) while moving to a more persistent solution. > I just checked in on this one to see whether we needed to create an "open > item" for v17. While I'm not seeing the failures anymore, the patch that > Alexander claimed should fix it [0] doesn't appear to have been committed, > either. Perhaps this was fixed by something else... > > [0] https://postgr.es/m/CA%2BhUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ%40mail.gmail.com I have re-run the tests and found out that the issue was fixed by d3c5f37dd. It changed the inner of the loop "while (PQisBusy(conn))", formerly contained in pgfdw_get_result() as follows: /* Data available in socket? */ if (wc & WL_SOCKET_READABLE) { if (!PQconsumeInput(conn)) pgfdw_report_error(ERROR, NULL, conn, false, query); } -> /* Consume whatever data is available from the socket */ if (PQconsumeInput(conn) == 0) { /* trouble; expect PQgetResult() to return NULL */ break; } That is, the unconditional "if PQconsumeInput() ..." eliminates the test timeout. Best regards, Alexander
On Sat, Mar 09, 2024 at 10:00:00AM +0300, Alexander Lakhin wrote: > I have re-run the tests and found out that the issue was fixed by > d3c5f37dd. It changed the inner of the loop "while (PQisBusy(conn))", > formerly contained in pgfdw_get_result() as follows: > /* Data available in socket? */ > if (wc & WL_SOCKET_READABLE) > { > if (!PQconsumeInput(conn)) > pgfdw_report_error(ERROR, NULL, conn, false, query); > } > -> > /* Consume whatever data is available from the socket */ > if (PQconsumeInput(conn) == 0) > { > /* trouble; expect PQgetResult() to return NULL */ > break; > } > > That is, the unconditional "if PQconsumeInput() ..." eliminates the test > timeout. Thanks for confirming! I'm assuming this just masks the underlying issue... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com