Thread: postgres_fdw test timeouts

postgres_fdw test timeouts

From
Nathan Bossart
Date:
I noticed that the postgres_fdw test periodically times out on Windows:

    https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-11-10%2003%3A12%3A58
    https://cirrus-ci.com/task/5504294095421440
    https://cirrus-ci.com/task/4814111003901952
    https://cirrus-ci.com/task/5343037535027200
    https://cirrus-ci.com/task/5893655966253056
    https://cirrus-ci.com/task/4568159320014848
    https://cirrus-ci.com/task/5238067850641408
    (and many more from cfbot)

From a quick sampling, the relevant test logs end with either

    ERROR:  server "unknownserver" does not exist
    STATEMENT:  SELECT postgres_fdw_disconnect('unknownserver');

or

    ERROR:  relation "public.non_existent_table" does not exist
    CONTEXT:  remote SQL command: SELECT a, b, c FROM public.non_existent_table
    STATEMENT:  SELECT * FROM async_pt;

before the test seems to hang.

AFAICT the failures began around September 10th, which leads me to wonder
if this is related to commit 04a09ee.  That is little more than a wild
guess, though.  I haven't been able to deduce much else from the logs I can
find, and I didn't find any previous reports about this in the archives
after lots of searching, so I thought I'd at least park these notes here in
case anyone else has ideas.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: postgres_fdw test timeouts

From
Thomas Munro
Date:
On Fri, Dec 1, 2023 at 9:38 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> AFAICT the failures began around September 10th, which leads me to wonder
> if this is related to commit 04a09ee.  That is little more than a wild
> guess, though.  I haven't been able to deduce much else from the logs I can
> find, and I didn't find any previous reports about this in the archives
> after lots of searching, so I thought I'd at least park these notes here in
> case anyone else has ideas.

Thanks for finding this correlation.  Yeah, poking around in the cfbot
history database I see about 1 failure like that per day since that
date, and there doesn't seem to be anything else as obviously likely
to be related to wakeups and timeouts.  I don't understand what's
wrong with the logic, and I think it would take someone willing to
debug it locally to figure that out.  Unless someone has an idea, I'm
leaning towards reverting that commit and leaving the relatively minor
problem that it was intended to fix as a TODO.



Re: postgres_fdw test timeouts

From
Alexander Lakhin
Date:
Hello Thomas,

03.12.2023 02:48, Thomas Munro wrote:
> Thanks for finding this correlation.  Yeah, poking around in the cfbot
> history database I see about 1 failure like that per day since that
> date, and there doesn't seem to be anything else as obviously likely
> to be related to wakeups and timeouts.  I don't understand what's
> wrong with the logic, and I think it would take someone willing to
> debug it locally to figure that out.  Unless someone has an idea, I'm
> leaning towards reverting that commit and leaving the relatively minor
> problem that it was intended to fix as a TODO

I've managed to reproduce the failure locally when running postgres_fdw_x/
regress in parallel (--num-processes 10). It reproduced for me on
on 04a09ee94 (iterations 1, 2, 4), but not on 04a09ee94~1 (30 iterations
passed).

I'm going to investigate this case within days. Maybe we could find a
better fix for the issue.

Best regards,
Alexander



Re: postgres_fdw test timeouts

From
Thomas Munro
Date:
On Sun, Dec 3, 2023 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> I've managed to reproduce the failure locally when running postgres_fdw_x/
> regress in parallel (--num-processes 10). It reproduced for me on
> on 04a09ee94 (iterations 1, 2, 4), but not on 04a09ee94~1 (30 iterations
> passed).
>
> I'm going to investigate this case within days. Maybe we could find a
> better fix for the issue.

Thanks.  One thing I can recommend to anyone trying to understand the
change is that you view it with:

git show --ignore-all-space 04a09ee

... because it changed a lot of indentation when wrapping a bunch of
stuff in a new for loop.



Re: postgres_fdw test timeouts

From
Alexander Lakhin
Date:
Hello Thomas,

03.12.2023 09:00, Thomas Munro wrote:
> On Sun, Dec 3, 2023 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>> I've managed to reproduce the failure locally when running postgres_fdw_x/
>> regress in parallel (--num-processes 10). It reproduced for me on
>> on 04a09ee94 (iterations 1, 2, 4), but not on 04a09ee94~1 (30 iterations
>> passed).
>>
>> I'm going to investigate this case within days. Maybe we could find a
>> better fix for the issue.
> Thanks.  One thing I can recommend to anyone trying to understand the
> change is that you view it with:

I think, I've found out what's going on here.
The culprit is WSAEnumNetworkEvents() assisted by non-trivial logic of
ExecAppendAsyncEventWait().
For the case noccurred > 1, ExecAppendAsyncEventWait() performs a loop,
where ExecAsyncNotify() is called for the first AsyncRequest, but the
second one also processed inside, through a recursive call to
ExecAppendAsyncEventWait():
  -> ExecAsyncNotify -> produce_tuple_asynchronously
-> ExecScan -> ExecInterpExpr -> ExecSetParamPlan -> ExecProcNodeFirst
-> ExecAgg -> agg_retrieve_direct -> ExecProcNodeInstr -> ExecAppend
-> ExecAppendAsyncEventWait.
Here we get into the first loop and call ExecAsyncConfigureWait() for the
second AsyncRequest (because we haven't reset it's callback_pending yet),
and it leads to creating another WaitEvent for the second socket inside
postgresForeignAsyncConfigureWait():
     AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn), ...

This WaitEvent seemingly misses an event that we should get for that socket.
It's not that important to get noccured > 1 in
ExecAppendAsyncEventWait() to see the failure, it's enough to call
WSAEnumNetworkEvents() inside WaitEventSetWaitBlock() for the second socket
(I tried to exit from the WaitEventSetWaitBlock's new loop prematurely,
without touching occurred_events, returned_events on a second iteration of
the loop).

So it looks like we have the same issue with multiple event handles
associated with a single socket here.
And v2-0003-Redesign-Windows-socket-event-management.patch from [1]
"surprisingly" helps in this case as well (I could not see a failure for
100 iterations of 10 tests in parallel).

[1] https://www.postgresql.org/message-id/CA%2BhUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ%40mail.gmail.com

Best regards,
Alexander



Re: postgres_fdw test timeouts

From
Thomas Munro
Date:
On Thu, Dec 7, 2023 at 10:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> I think, I've found out what's going on here.
> The culprit is WSAEnumNetworkEvents() assisted by non-trivial logic of
> ExecAppendAsyncEventWait().
> For the case noccurred > 1, ExecAppendAsyncEventWait() performs a loop,
> where ExecAsyncNotify() is called for the first AsyncRequest, but the
> second one also processed inside, through a recursive call to
> ExecAppendAsyncEventWait():
>   -> ExecAsyncNotify -> produce_tuple_asynchronously
> -> ExecScan -> ExecInterpExpr -> ExecSetParamPlan -> ExecProcNodeFirst
> -> ExecAgg -> agg_retrieve_direct -> ExecProcNodeInstr -> ExecAppend
> -> ExecAppendAsyncEventWait.
> Here we get into the first loop and call ExecAsyncConfigureWait() for the
> second AsyncRequest (because we haven't reset it's callback_pending yet),
> and it leads to creating another WaitEvent for the second socket inside
> postgresForeignAsyncConfigureWait():
>      AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn), ...

Oh, wow.  Nice detective work!  Thank you for figuring that out.

> So it looks like we have the same issue with multiple event handles
> associated with a single socket here.
> And v2-0003-Redesign-Windows-socket-event-management.patch from [1]
> "surprisingly" helps in this case as well (I could not see a failure for
> 100 iterations of 10 tests in parallel).

Yeah, this makes perfect sense.

So, commit 04a09ee is not guilty.  But as the saying goes, "no good
deed goes unpunished", and work on our Windows port seems to be
especially prone to backfiring when kludges combine...

Now we have the question of whether to go forwards (commit the "socket
table" thing), or backwards (revert 04a09ee for now to clear the CI
failures).  I don't love the hidden complexity of the socket table and
am not in a hurry to commit it, but I don't currently see another
way... on the other hand we have other CI flapping due to that problem
too so reverting 04a09ee would be sweeping problems under the carpet.
I still need to process your feedback/discoveries on that other thread
and it may take a few weeks for me to get to it.



Re: postgres_fdw test timeouts

From
Nathan Bossart
Date:
On Fri, Dec 08, 2023 at 09:55:58AM +1300, Thomas Munro wrote:
> Oh, wow.  Nice detective work!  Thank you for figuring that out.

+1

> Now we have the question of whether to go forwards (commit the "socket
> table" thing), or backwards (revert 04a09ee for now to clear the CI
> failures).  I don't love the hidden complexity of the socket table and
> am not in a hurry to commit it, but I don't currently see another
> way... on the other hand we have other CI flapping due to that problem
> too so reverting 04a09ee would be sweeping problems under the carpet.
> I still need to process your feedback/discoveries on that other thread
> and it may take a few weeks for me to get to it.

I don't think we need to revert 04a09ee provided the issue is unrelated and
a fix is in development.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: postgres_fdw test timeouts

From
Alexander Lakhin
Date:
08.12.2023 02:02, Nathan Bossart wrote:
> On Fri, Dec 08, 2023 at 09:55:58AM +1300, Thomas Munro wrote:
>> Now we have the question of whether to go forwards (commit the "socket
>> table" thing), or backwards (revert 04a09ee for now to clear the CI
>> failures).  I don't love the hidden complexity of the socket table and
>> am not in a hurry to commit it, but I don't currently see another
>> way... on the other hand we have other CI flapping due to that problem
>> too so reverting 04a09ee would be sweeping problems under the carpet.
>> I still need to process your feedback/discoveries on that other thread
>> and it may take a few weeks for me to get to it.
> I don't think we need to revert 04a09ee provided the issue is unrelated and
> a fix is in development.

I've reviewed the links posted upthread and analyzed statistics of such
failures:
yes, it happens rather frequently in Cirrus CI, but there might be dozens
of successful runs, for example:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F45%2F3686
has 1 postgres_fdw failure on Windows per 32 runs.
And there is only one such failure for 90 days in the buildfarm.
(Perhaps the probability of the failure depend on external factors, such as
concurrent activity.)

So I would not say that it's a dominant failure for now, and given that
04a09ee lives in master only, maybe we can save two commits (Revert +
Revert of revert) while moving to a more persistent solution.

Best regards,
Alexander



Re: postgres_fdw test timeouts

From
Nathan Bossart
Date:
On Sun, Dec 10, 2023 at 12:00:01PM +0300, Alexander Lakhin wrote:
> So I would not say that it's a dominant failure for now, and given that
> 04a09ee lives in master only, maybe we can save two commits (Revert +
> Revert of revert) while moving to a more persistent solution.

I just checked in on this one to see whether we needed to create an "open
item" for v17.  While I'm not seeing the failures anymore, the patch that
Alexander claimed should fix it [0] doesn't appear to have been committed,
either.  Perhaps this was fixed by something else...

[0] https://postgr.es/m/CA%2BhUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: postgres_fdw test timeouts

From
Alexander Lakhin
Date:
Hello Nathan,

08.03.2024 01:00, Nathan Bossart wrote:
> On Sun, Dec 10, 2023 at 12:00:01PM +0300, Alexander Lakhin wrote:
>> So I would not say that it's a dominant failure for now, and given that
>> 04a09ee lives in master only, maybe we can save two commits (Revert +
>> Revert of revert) while moving to a more persistent solution.
> I just checked in on this one to see whether we needed to create an "open
> item" for v17.  While I'm not seeing the failures anymore, the patch that
> Alexander claimed should fix it [0] doesn't appear to have been committed,
> either.  Perhaps this was fixed by something else...
>
> [0] https://postgr.es/m/CA%2BhUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ%40mail.gmail.com

I have re-run the tests and found out that the issue was fixed by
d3c5f37dd. It changed the inner of the loop "while (PQisBusy(conn))",
formerly contained in pgfdw_get_result() as follows:
                 /* Data available in socket? */
                 if (wc & WL_SOCKET_READABLE)
                 {
                     if (!PQconsumeInput(conn))
                         pgfdw_report_error(ERROR, NULL, conn, false, query);
                 }
->
         /* Consume whatever data is available from the socket */
         if (PQconsumeInput(conn) == 0)
         {
             /* trouble; expect PQgetResult() to return NULL */
             break;
         }

That is, the unconditional "if PQconsumeInput() ..." eliminates the test
timeout.

Best regards,
Alexander



Re: postgres_fdw test timeouts

From
Nathan Bossart
Date:
On Sat, Mar 09, 2024 at 10:00:00AM +0300, Alexander Lakhin wrote:
> I have re-run the tests and found out that the issue was fixed by
> d3c5f37dd. It changed the inner of the loop "while (PQisBusy(conn))",
> formerly contained in pgfdw_get_result() as follows:
>                 /* Data available in socket? */
>                 if (wc & WL_SOCKET_READABLE)
>                 {
>                     if (!PQconsumeInput(conn))
>                         pgfdw_report_error(ERROR, NULL, conn, false, query);
>                 }
> ->
>         /* Consume whatever data is available from the socket */
>         if (PQconsumeInput(conn) == 0)
>         {
>             /* trouble; expect PQgetResult() to return NULL */
>             break;
>         }
> 
> That is, the unconditional "if PQconsumeInput() ..." eliminates the test
> timeout.

Thanks for confirming!  I'm assuming this just masks the underlying
issue...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com