Thread: pgsql: Make postgres_fdw's query_cancel test less flaky.

pgsql: Make postgres_fdw's query_cancel test less flaky.

From
Tom Lane
Date:
Make postgres_fdw's query_cancel test less flaky.

This test occasionally shows

+WARNING:  could not get result of cancel request due to timeout

which appears to be because the cancel request is sometimes unluckily
sent to the remote session between queries, and then it's ignored.

This patch tries to make that less probable in three ways:

1. Use a test query that does not involve remote estimates, so that
no EXPLAINs are sent.
2. Make sure that the remote session is ready-to-go (transaction
started, SET commands sent) before we start the timer.
3. Increase the statement_timeout to 100ms, to give the local
session enough time to plan and issue the query.

We might have to go higher than 100ms to make this adequately
stable in the buildfarm, but let's see how it goes.

Back-patch to v17 where this test was introduced.

Jelte Fennema-Nio and Tom Lane

Discussion: https://postgr.es/m/578934.1725045685@sss.pgh.pa.us

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/0e5c823806a3519a3f6966637cb62755dee659d4

Modified Files
--------------
contrib/postgres_fdw/expected/query_cancel.out | 26 +++++++++++++++++++-------
contrib/postgres_fdw/sql/query_cancel.sql      | 18 +++++++++++++-----
2 files changed, 32 insertions(+), 12 deletions(-)


Re: pgsql: Make postgres_fdw's query_cancel test less flaky.

From
Michael Paquier
Date:
Hi Tom,

On Fri, Aug 30, 2024 at 08:47:46PM +0000, Tom Lane wrote:
> Make postgres_fdw's query_cancel test less flaky.
>
> This test occasionally shows
>
> +WARNING:  could not get result of cancel request due to timeout
>
> which appears to be because the cancel request is sometimes unluckily
> sent to the remote session between queries, and then it's ignored.
>
> Back-patch to v17 where this test was introduced.

Based on this recent report from the buildfarm, there is still
something happening here:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-03-07%2004%3A31%3A37

In details:
 -- This would take very long if not canceled:
 SELECT count(*) FROM ft1 a CROSS JOIN ft1 b CROSS JOIN ft1 c CROSS JOIN ft1 d;
 ERROR:  canceling statement due to statement timeout
+WARNING:  could not send cancel request: cancel request timed out
 COMMIT;
--
Michael

Attachment

Re: pgsql: Make postgres_fdw's query_cancel test less flaky.

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Fri, Aug 30, 2024 at 08:47:46PM +0000, Tom Lane wrote:
>> Make postgres_fdw's query_cancel test less flaky.

> Based on this recent report from the buildfarm, there is still
> something happening here:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-03-07%2004%3A31%3A37
> In details:
>  -- This would take very long if not canceled:
>  SELECT count(*) FROM ft1 a CROSS JOIN ft1 b CROSS JOIN ft1 c CROSS JOIN ft1 d;
>  ERROR:  canceling statement due to statement timeout
> +WARNING:  could not send cancel request: cancel request timed out
>  COMMIT;

I only promised to make it less flaky, not zero flaky ;-).
But this is an interesting failure, because it's different from
the previous symptoms.  Before we were getting

>> +WARNING:  could not get result of cancel request due to timeout

but here it's failing to send the cancel request in the first place.
Grepping the buildfarm logs finds one prior occurrence of this in
the last three months, also on drongo:

 sysname | branch |      snapshot       |                        stage                        |
        l                                    

---------+--------+---------------------+-----------------------------------------------------+----------------------------------------------------------------------
 drongo  | HEAD   | 2025-02-01 13:35:14 | postgres_fdwInstallCheck-C                          | +WARNING:  could not
sendcancel request: cancel request timed out\r 
 drongo  | HEAD   | 2025-03-07 04:31:37 | postgres_fdwInstallCheck-English_United-States.1252 | +WARNING:  could not
sendcancel request: cancel request timed out\r 
(2 rows)

Looking at the code, we must have gotten past PQcancelStart, and
then we're waiting vainly for PQcancelPoll to report success.
The timeout for that is 30 seconds, so it's hard to credit that
we didn't wait long enough.  Something must have gone wrong that
we couldn't detect.

It would be interesting to know what conn->status was, so that
we'd have some clue how far the cancel request got; though I have
relatively little hope that that would actually yield a fix.
Do you think it'd be reasonable to add PQcancelStatus() to the
"timed out" message?  It looks like we don't have any infrastructure
for printing ConnStatusType, so we'd have to just print it as an
integer; but since this ought to be a can't-happen case I think
that'd be all right.

            regards, tom lane