Thread: Add non-blocking version of PQcancel

Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

12 January 2022, 15:22:18

The existing PQcancel API is using blocking IO. This makes PQcancel
impossible to use in an event loop based codebase, without blocking the
event loop until the call returns.

This patch adds a new cancellation API to libpq which is called
PQcancelConnectionStart. This API can be used to send cancellations in a
non-blocking fashion. To do this it internally uses regular PGconn
connection establishment. This has as a downside that
PQcancelConnectionStart cannot be safely called from a  signal handler.

Luckily, this should be fine for most usages of this API. Since most
code that's using an event loop handles signals in that event loop as
well (as opposed to calling functions from the signal handler directly).

There are also a few advantages of this approach:
1. No need to add and maintain a second non-blocking connection
   establishment codepath.
2. Cancel connections benefit automatically from any improvements made
   to the normal connection establishment codepath. Examples of things
   that it currently gets for free currently are TLS support and
   keepalive settings.

This patch also includes a test for this new API (and also the already
existing cancellation APIs). The test can be easily run like this:

    cd src/test/modules/libpq_pipeline
    make && ./libpq_pipeline cancel

NOTE: I have not tested this with GSS for the moment. My expectation is
that using this new API with a GSS connection will result in a
CONNECTION_BAD status when calling PQcancelStatus. The reason for this
is that GSS reads will also need to communicate back that an EOF was
found, just like I've done for TLS reads and unencrypted reads. Since in
case of a cancel connection an EOF is actually expected, and should not
be treated as an error.

Attachment

0001-Add-non-blocking-version-of-PQcancel.patch

Re: Add non-blocking version of PQcancel

From

Andres Freund

Date:

13 January 2022, 00:44:11

Hi,

On 2022-01-12 15:22:18 +0000, Jelte Fennema wrote:
> This patch also includes a test for this new API (and also the already
> existing cancellation APIs). The test can be easily run like this:
>
>     cd src/test/modules/libpq_pipeline
>     make && ./libpq_pipeline cancel

Right now tests fails to build on windows with:

[15:45:10.518] src/interfaces/libpq/libpqdll.def : fatal error LNK1121: duplicate ordinal number '189'
[c:\cirrus\libpq.vcxproj]
on fails tests on other platforms. See
https://cirrus-ci.com/build/4791821363576832


> NOTE: I have not tested this with GSS for the moment. My expectation is
> that using this new API with a GSS connection will result in a
> CONNECTION_BAD status when calling PQcancelStatus. The reason for this
> is that GSS reads will also need to communicate back that an EOF was
> found, just like I've done for TLS reads and unencrypted reads. Since in
> case of a cancel connection an EOF is actually expected, and should not
> be treated as an error.

The failures do not seem related to this.

Greetings,

Andres Freund

Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

From

Jelte Fennema

Date:

30 March 2022, 16:08:16

I attached a new version of this patch. Which does three main things:
1. Change the PQrequestCancel implementation to use the regular
connection establishement code, to support all connection options
including encryption.
2. Add PQrequestCancelStart which is a thread-safe and non-blocking
version of this new PQrequestCancel implementation.
3. Add PQconnectComplete, which completes a connection started by
PQrequestCancelStart. This is useful if you want a thread-safe, but
blocking cancel (without having a need for signal safety).

This change un-deprecates PQrequestCancel, since now there's actually an
advantage to using it over PQcancel. It also includes user facing documentation
for all these functions.

As a API design change from the previous version, PQrequestCancelStart now
returns a regular PGconn for the cancel connection.

@Tom Lane regarding this:
> Even in the non-blocking case, callers should only deal with the original PGconn.

This would by definition result in non-threadsafe code (afaict). So I refrained from doing this.
The blocking version doesn't expose a PGconn at all, but the non-blocking one now returns a new PGconn.

There's two more changes that I at least want to do before considering this patch mergable:
1. Go over all the functions that can be called with a PGconn, but should not be
called with a cancellation PGconn and error out or exit early.
2. Copy over the SockAddr from the original connection and always connect to
the same socket. I believe with the current code the cancellation could end up
at the wrong server if there are multiple hosts listed in the connection string.

And there's a third item that I would like to do as a bonus:
3. Actually use the non-blocking API for the postgres_fdw code to implement a
timeout. Which would allow this comment can be removed:
/*
* Issue cancel request. Unfortunately, there's no good way to limit the
* amount of time that we might block inside PQgetCancel().
*/

So a next version of this patch can be expected somewhere later this week.
But any feedback on the current version would be appreciated. Because
these 3 changes won't change the overall design much.

Jelte

Attachment

Re: Add non-blocking version of PQcancel

From

Justin Pryzby

Date:

31 March 2022, 05:47:04

Note that the patch is still variously failing in cirrus.
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/37/3511

You may already know that it's possible to trigger the cirrus ci tasks using a
github branch.  See src/tools/ci/README.

Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

01 April 2022, 16:13:07

Attached is the latest version of this patch, which I think is now in a state
in which it could be merged. The changes are:

1. Don't do host and address discovery for cancel connections. It now
reuses raddr and whichhost from the original connection. This makes
sure the cancel always goes to the right server, even when DNS records
change or another server would be chosen now in case of connnection
strings containing multiple hosts.
2. Fix the windows CI failure. This is done by both using the threadsafe code
in the the dblink cancellation code, and also by not erroring a cancellation
connection on windows in case of any errors. This last one is to work around
the issue described in this thread:
https://www.postgresql.org/message-id/flat/90b34057-4176-7bb0-0dbb-9822a5f6425b%40greiz-reinsdorf.de

I also went over most of the functions that take a PGconn, to see if they needed
extra checks to guard against being executed on cancel. So far all seemed fine,
either they should be okay to execute against a cancellation connection, or
they failed already anyway because a cancellation connection never reaches
the CONNECTION_OK state. So I didn't add any checks specifically for cancel
connections. I'll do this again next week with a fresh head, to see if I haven't
missed any cases.

I'll try to find some time early next week to implement non-blocking cancellation
usage in postgres_fdw, i.e. the bonus task I mentioned in my previous email. But
I don't think it's necessary to have that implemented before merging.

Attachment

0002-Add-non-blocking-version-of-PQcancel.patch

Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

04 April 2022, 15:21:54

Hereby what I consider the final version of this patch. I don't have any
changes planned myself (except for ones that come up during review).
Things that changed since the previous iteration:
1. postgres_fdw now uses the non-blocking cancellation API (including test).
2. Added some extra sleeps to the cancellation test, to remove random failures on FreeBSD.

Attachment

0002-Add-non-blocking-version-of-PQcancel.patch

Re: Add non-blocking version of PQcancel

From

Justin Pryzby

Date:

25 June 2022, 00:36:16

Resending with a problematic email removed from CC...

On Mon, Apr 04, 2022 at 03:21:54PM +0000, Jelte Fennema wrote:
> 2. Added some extra sleeps to the cancellation test, to remove random failures on FreeBSD.

Apparently there's still an occasional issue.
https://cirrus-ci.com/task/6613309985128448

result 232/352 (error): ERROR:  duplicate key value violates unique constraint "ppln_uniqviol_pkey"
DETAIL:  Key (id)=(116) already exists.

This shows that the issue is pretty rare:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/38/3511

-- 
Justin

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

27 June 2022, 09:29:39

(resent because it was blocked from the mailing-list due to inclusion of a blocked email address in the To line)

From: Andres Freund <andres@anarazel.de>
> On 2022-04-04 15:21:54 +0000, Jelte Fennema wrote:
> > 2. Added some extra sleeps to the cancellation test, to remove random
> > failures on FreeBSD.
>
> That's extremely extremely rarely the solution to address test reliability
> issues. It'll fail when running test under valgrind etc.
>
> Why do you need sleeps / can you find another way to make the test reliable?

The problem they are solving is racy behaviour between sending the query
and sending the cancellation. If the cancellation is handled before the query
is started, then the query doesn't get cancelled. To solve this problem I used
the sleeps to wait a bit before sending the cancelation request.

When I wrote this, I couldn't think of a better way to do it then with sleeps.
But I didn't like it either (and I still don't). These emails made me start to think
again, about other ways of solving the problem. I think I've found another
solution (see attached patch). The way I solve it now is by using another
connection to check the state of the first one.

Jelte

Thanks for all the feedback. I attached a new patch that I think
addresses all of it. Below some additional info.

> On the whole it feels like a mistake to have two separate kinds of
> PGconn with fundamentally different behaviors and yet no distinction
> in the API.  I think I'd recommend having a separate struct type
> (which might internally contain little more than a pointer to a
> cloned PGconn), and provide only a limited set of operations on it.

In my first version of this patch, this is exactly what I did. But then
I got this feedback from Jacob, so I changed it to reusing PGconn:

> >  /* issue a cancel request */
> >  extern int PQcancel(PGcancel *cancel, char *errbuf, int errbufsize);
> > +extern PGcancelConn * PQcancelConnectStart(PGconn *conn);
> > +extern PGcancelConn * PQcancelConnect(PGconn *conn);
> > +extern PostgresPollingStatusType PQcancelConnectPoll(PGcancelConn * cancelConn);
> > +extern ConnStatusType PQcancelStatus(const PGcancelConn * cancelConn);
> > +extern int PQcancelSocket(const PGcancelConn * cancelConn);
> > +extern char *PQcancelErrorMessage(const PGcancelConn * cancelConn);
> > +extern void PQcancelFinish(PGcancelConn * cancelConn);
>
> That's a lot of new entry points, most of which don't do anything
> except call their twin after a pointer cast. How painful would it be to
> just use the existing APIs as-is, and error out when calling
> unsupported functions if conn->cancelRequest is true?

I changed it back to use PGcancelConn as per your suggestion and I
agree that the API got better because of it.

> +       CONNECTION_CANCEL_FINISHED,
>        /* Non-blocking mode only below here */
>
> is an absolute non-starter: it breaks ABI for every libpq client,
> even ones that aren't using this facility. 

I removed this now. The main reason was so it was clear that no
queries could be sent over the connection, like is normally the case
when CONNECTION_OK happens. I don't think this is as useful anymore
now that this patch has a dedicated PGcancelStatus function.
NOTE: The CONNECTION_STARTING ConnStatusType is still necessary.
But to keep ABI compatibility I moved it to the end of the enum.

> Alternatively, we could teach libpq_pipeline
> to do getenv("PG_TEST_TIMEOUT_DEFAULT") with a fallback to 180,
> but that feels like it might be overly familiar with the innards
> of Utils.pm.

I went with this approach, because this environment variable was
already used in 2 other places than Utils.pm:
- contrib/test_decoding/sql/twophase.sql
- src/test/isolation/isolationtester.c

So, one more place seemed quite harmless.

P.S. I noticed a logical conflict between this patch and my libpq load
balancing patch. Because this patch depends on the connhost array
is constructed the exact same on the second invocation of connectOptions2.
But the libpq loadbalancing patch breaks this assumption. I'm making
a mental (and public) note that whichever of these patches gets merged last
should address this issue.

Attachment

0001-Add-non-blocking-version-of-PQcancel.patch

Re: Add non-blocking version of PQcancel

From

Jacob Champion

Date:

04 November 2022, 15:58:34

On 10/5/22 06:23, Jelte Fennema wrote:
> In my first version of this patch, this is exactly what I did. But then
> I got this feedback from Jacob, so I changed it to reusing PGconn:
> 
>>  [snip]
> 
> I changed it back to use PGcancelConn as per your suggestion and I 
> agree that the API got better because of it.

Sorry for the whiplash!

Is the latest attachment the correct version? I don't see any difference
between the latest 0001 and the previous version's 0002 -- it has no
references to PG_TEST_TIMEOUT_DEFAULT, PGcancelConn, etc.

Thanks,
--Jacob

Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

15 November 2022, 11:38:00

Ugh, it indeed seems like I somehow messed up sending the new patch.
Here's the correct one.

Attachment

0001-Add-non-blocking-version-of-PQcancel.patch

Re: Add non-blocking version of PQcancel

From

Daniel Gustafsson

Date:

29 November 2022, 19:17:47

> On 15 Nov 2022, at 12:38, Jelte Fennema <Jelte.Fennema@microsoft.com> wrote:

> Here's the correct one.<0001-Add-non-blocking-version-of-PQcancel.patch>

This version of the patch no longer applies, a rebased version is needed.

--
Daniel Gustafsson        https://vmware.com/

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

30 November 2022, 09:20:42

> This version of the patch no longer applies, a rebased version is needed.

Attached is a patch that applies cleanly again and is also changed
to use the recently introduced libpq_append_conn_error.

I also attached a patch that runs pgindent after the introduction of
libpq_append_conn_error. I noticed that this hadn't happened when
trying to run pgindent on my own changes.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

19 January 2023, 11:10:01

Is there anything that is currently blocking this patch? I'd quite
like it to get into PG16.

Especially since I ran into another use case that I would want to use
this patch for recently: Adding an async cancel function to Python
it's psycopg3 library. This library exposes both a Connection class
and an AsyncConnection class (using python its asyncio feature). But
one downside of the AsyncConnection type is that it doesn't have a
cancel method.

I ran into this while changing the PgBouncer tests to use python. And
the cancellation tests were the only tests that required me to use a
ThreadPoolExecutor instead of simply being able to use async-await
style programming:
https://github.com/pgbouncer/pgbouncer/blob/master/test/test_cancel.py#LL9C17-L9C17

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

26 January 2023, 16:42:37

After discussing this patch privately with Andres I created a new
version of this patch.
The main changes are:
1. Build on top of a refactor to addrinfo handling I had done for
another patch of mine (libpq load balancing). This allows creation of
a fake addrinfo list, which made it possible to remove lots of special
cases for cancel requests from PQconnectPoll
2. Move -2 return value of pqReadData to a separate commit.
3. Move usage of new cancel APIs to a separate commit.
4. Move most of the logic that's specific to cancel requests to cancel
related functions, e.g. PQcancelPoll does more than simply forwarding
to PQconnectPoll now.
5. Copy over the connhost data from the original connection, instead
of assuming that it will be rebuilt identically in the cancel
connection. The main reason for this is that when/if the loadbalancing
patch gets merged, then it won't necessarily be rebuilt identically
anymore.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

27 January 2023, 11:50:27

Another small update. Mostly some trivial cleanup in the comments/docs/code. But
also change patch 0005 to call PQcancelFinish in more error cases.

On Wed, 1 Mar 2023 at 20:09, Greg S <stark.cfm@gmail.com> wrote:
>
> On Tue, 28 Feb 2023 at 15:59, Gregory Stark <stark@postgresql.org> wrote:
> >
> > This looks like it needs a rebase.

done

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

"Gregory Stark (as CFM)"

Date:

01 March 2023, 19:50:45

On Wed, 1 Mar 2023 at 14:48, Jelte Fennema <postgres@jeltef.nl> wrote:

> > > This looks like it needs a rebase.
>
> done

Great. Please update the CF entry to Needs Review or Ready for
Committer as appropriate :)

-- 
Gregory Stark
As Commitfest Manager

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

01 March 2023, 20:00:49

On Wed, 1 Mar 2023 at 20:51, Gregory Stark (as CFM) <stark.cfm@gmail.com> wrote:
> Great. Please update the CF entry to Needs Review or Ready for
> Committer as appropriate :)

I realised I rebased a slightly outdated version of my branch (thanks
to git its --force-with-lease flag). Attached is the newest version
rebased (only patch 0004 changed slightly).

And I updated the CF entry to Ready for Committer now.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

06 March 2023, 15:42:13

Updated wording in the docs slightly.

On Wed, 1 Mar 2023 at 21:00, Jelte Fennema <postgres@jeltef.nl> wrote:
>
> On Wed, 1 Mar 2023 at 20:51, Gregory Stark (as CFM) <stark.cfm@gmail.com> wrote:
> > Great. Please update the CF entry to Needs Review or Ready for
> > Committer as appropriate :)
>
> I realised I rebased a slightly outdated version of my branch (thanks
> to git its --force-with-lease flag). Attached is the newest version
> rebased (only patch 0004 changed slightly).
>
> And I updated the CF entry to Ready for Committer now.

The rebase was indeed trivial (git handled everything automatically),
because my first patch was doing a superset of the changes that were
committed in b6dfee28f. Attached are the new patches.

On Tue, 14 Mar 2023 at 19:04, Greg Stark <stark@mit.edu> wrote:
>
> On Tue, 14 Mar 2023 at 13:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > "Gregory Stark (as CFM)" <stark.cfm@gmail.com> writes:
> > > It looks like this needs a big rebase in fea-uth.c fe-auth-scram.c and
> > > fe-connect.c. Every hunk is failing which perhaps means the code
> > > you're patching has been moved or refactored?
> >
> > The cfbot is giving up after
> > v14-0001-libpq-Run-pgindent-after-a9e9a9f32b3.patch fails,
> > but that's been superseded (at least in part) by b6dfee28f.
>
> Ah, same with Jelte Fennema's patch for load balancing in libpq.
>
> --
> greg

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

22 March 2023, 12:32:30

Rebased after conflicts with bfc9497ece01c7c45437bc36387cb1ebe346f4d2

Also included the fix for feedback from Daniel on patch 2, which he
had shared in the load balancing thread.

On Wed, 15 Mar 2023 at 09:49, Jelte Fennema <postgres@jeltef.nl> wrote:
>
> The rebase was indeed trivial (git handled everything automatically),
> because my first patch was doing a superset of the changes that were
> committed in b6dfee28f. Attached are the new patches.
>
> On Tue, 14 Mar 2023 at 19:04, Greg Stark <stark@mit.edu> wrote:
> >
> > On Tue, 14 Mar 2023 at 13:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >
> > > "Gregory Stark (as CFM)" <stark.cfm@gmail.com> writes:
> > > > It looks like this needs a big rebase in fea-uth.c fe-auth-scram.c and
> > > > fe-connect.c. Every hunk is failing which perhaps means the code
> > > > you're patching has been moved or refactored?
> > >
> > > The cfbot is giving up after
> > > v14-0001-libpq-Run-pgindent-after-a9e9a9f32b3.patch fails,
> > > but that's been superseded (at least in part) by b6dfee28f.
> >
> > Ah, same with Jelte Fennema's patch for load balancing in libpq.
> >
> > --
> > greg

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Denis Laxalde

Date:

28 March 2023, 14:53:18

Hi Jelte,

I had a look into your patchset (v16), did a quick review and played a
bit with the feature.

Patch 2 is missing the documentation about PQcancelSocket() and contains
a few typos; please find attached a (fixup) patch to correct these.

--- a/src/interfaces/libpq/libpq-fe.h
+++ b/src/interfaces/libpq/libpq-fe.h
@@ -321,16 +328,28 @@ extern PostgresPollingStatusType PQresetPoll(PGconn *conn);
 /* Synchronous (blocking) */
 extern void PQreset(PGconn *conn);

+/* issue a cancel request */
+extern PGcancelConn * PQcancelSend(PGconn *conn);
[...]

Maybe I'm missing something, but this function above seems a bit
strange. Namely, I wonder why it returns a PGcancelConn and what's the
point of requiring the user to call PQcancelStatus() to see if something
got wrong. Maybe it could be defined as:

  int PQcancelSend(PGcancelConn *cancelConn);

where the return value would be status? And the user would only need to
call PQcancelErrorMessage() in case of error. This would leave only one
single way to create a PGcancelConn value (i.e. PQcancelConn()), which
seems less confusing to me.

Jelte Fennema wrote:
> Especially since I ran into another use case that I would want to use
> this patch for recently: Adding an async cancel function to Python
> it's psycopg3 library. This library exposes both a Connection class
> and an AsyncConnection class (using python its asyncio feature). But
> one downside of the AsyncConnection type is that it doesn't have a
> cancel method.

As part of my testing, I've implemented non-blocking cancellation in
Psycopg, based on v16 on this patchset. Overall this worked fine and
seems useful; if you want to try it:

  https://github.com/dlax/psycopg3/tree/pg16/non-blocking-pqcancel

(The only thing I found slightly inconvenient is the need to convey the
connection encoding (from PGconn) when handling error message from the
PGcancelConn.)

Cheers,
Denis

Attachment

0001-fixup-Add-non-blocking-version-of-PQcancel.patch

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

28 March 2023, 15:54:06

On Tue, 28 Mar 2023 at 16:54, Denis Laxalde <denis.laxalde@dalibo.com> wrote:
> I had a look into your patchset (v16), did a quick review and played a
> bit with the feature.
>
> Patch 2 is missing the documentation about PQcancelSocket() and contains
> a few typos; please find attached a (fixup) patch to correct these.

Thanks applied that patch and attached a new patchset

> Namely, I wonder why it returns a PGcancelConn and what's the
> point of requiring the user to call PQcancelStatus() to see if something
> got wrong. Maybe it could be defined as:
>
>   int PQcancelSend(PGcancelConn *cancelConn);
>
> where the return value would be status? And the user would only need to
> call PQcancelErrorMessage() in case of error. This would leave only one
> single way to create a PGcancelConn value (i.e. PQcancelConn()), which
> seems less confusing to me.

To clarify what you mean, the API would then be like this:
PGcancelConn cancelConn = PQcancelConn(conn);
if (PQcancelSend(cancelConn) == CONNECTION_BAD) {
   printf("ERROR %s\n", PQcancelErrorMessage(cancelConn))
   exit(1)
}

Instead of:
PGcancelConn cancelConn = PQcancelSend(conn);
if (PQcancelStatus(cancelConn) == CONNECTION_BAD) {
   printf("ERROR %s\n", PQcancelErrorMessage(cancelConn))
   exit(1)
}

Those are so similar, that I have no preference either way. If more
people prefer one over the other I'm happy to change it, but for now
I'll keep it as is.

> As part of my testing, I've implemented non-blocking cancellation in
> Psycopg, based on v16 on this patchset. Overall this worked fine and
> seems useful; if you want to try it:
>
>   https://github.com/dlax/psycopg3/tree/pg16/non-blocking-pqcancel

That's great to hear! I'll try to take a closer look at that change tomorrow.

> (The only thing I found slightly inconvenient is the need to convey the
> connection encoding (from PGconn) when handling error message from the
> PGcancelConn.)

Could you expand a bit more on this? And if you have any idea on how
to improve the API with regards to this?

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Denis Laxalde

Date:

29 March 2023, 08:43:14

Jelte Fennema a écrit :
> > Namely, I wonder why it returns a PGcancelConn and what's the
> > point of requiring the user to call PQcancelStatus() to see if something
> > got wrong. Maybe it could be defined as:
> >
> >   int PQcancelSend(PGcancelConn *cancelConn);
> >
> > where the return value would be status? And the user would only need to
> > call PQcancelErrorMessage() in case of error. This would leave only one
> > single way to create a PGcancelConn value (i.e. PQcancelConn()), which
> > seems less confusing to me.
> 
> To clarify what you mean, the API would then be like this:
> PGcancelConn cancelConn = PQcancelConn(conn);
> if (PQcancelSend(cancelConn) == CONNECTION_BAD) {
>    printf("ERROR %s\n", PQcancelErrorMessage(cancelConn))
>    exit(1)
> }

I'm not sure it's worth returning the connection status, maybe just an
int value (the return value of connectDBComplete() for instance).

More importantly, not having PQcancelSend() creating the PGcancelConn
makes reuse of that value, passing through PQcancelReset(), more
intuitive. E.g., in the tests:

diff --git a/src/test/modules/libpq_pipeline/libpq_pipeline.c b/src/test/modules/libpq_pipeline/libpq_pipeline.c
index 6764ab513b..91363451af 100644
--- a/src/test/modules/libpq_pipeline/libpq_pipeline.c
+++ b/src/test/modules/libpq_pipeline/libpq_pipeline.c
@@ -217,17 +217,18 @@ test_cancel(PGconn *conn, const char *conninfo)
         pg_fatal("failed to run PQrequestCancel: %s", PQerrorMessage(conn));
     confirm_query_cancelled(conn);
 
+    cancelConn = PQcancelConn(conn);
+
     /* test PQcancelSend */
     send_cancellable_query(conn, monitorConn);
-    cancelConn = PQcancelSend(conn);
-    if (PQcancelStatus(cancelConn) == CONNECTION_BAD)
+    if (PQcancelSend(cancelConn) == CONNECTION_BAD)
         pg_fatal("failed to run PQcancelSend: %s", PQcancelErrorMessage(cancelConn));
     confirm_query_cancelled(conn);
-    PQcancelFinish(cancelConn);
+
+    PQcancelReset(cancelConn);
 
     /* test PQcancelConn and then polling with PQcancelPoll */
     send_cancellable_query(conn, monitorConn);
-    cancelConn = PQcancelConn(conn);
     if (PQcancelStatus(cancelConn) == CONNECTION_BAD)
         pg_fatal("bad cancel connection: %s", PQcancelErrorMessage(cancelConn));
     while (true)

Otherwise, it's not clear if the PGcancelConn created by PQcancelSend()
should be reused or not. But maybe that's a matter of documentation?


> > As part of my testing, I've implemented non-blocking cancellation in
> > Psycopg, based on v16 on this patchset. Overall this worked fine and
> > seems useful; if you want to try it:
> >
> >   https://github.com/dlax/psycopg3/tree/pg16/non-blocking-pqcancel
> 
> That's great to hear! I'll try to take a closer look at that change tomorrow.

See also https://github.com/psycopg/psycopg/issues/534 if you want to
discuss about this.

> > (The only thing I found slightly inconvenient is the need to convey the
> > connection encoding (from PGconn) when handling error message from the
> > PGcancelConn.)
> 
> Could you expand a bit more on this? And if you have any idea on how
> to improve the API with regards to this?

The thing is that we need the connection encoding (client_encoding) when
eventually forwarding the result of PQcancelErrorMessage(), decoded, to
the user. More specifically, it seems to me that we'd the encoding of
the *cancel connection*, but since PQparameterStatus() cannot be used
with a PGcancelConn, I use that of the PGconn. Roughly, in Python:

    encoding = conn.parameter_status(b"client_encoding")
    # i.e, in C: char *encoding PQparameterStatus(conn, "client_encoding");
    cancel_conn = conn.cancel_conn()
    # i.e., in C: PGcancelConn *cancelConn = PQcancelConn(conn);
    # [... then work with with cancel_conn ...]
    if cancel_conn.status == ConnStatus.BAD:
        raise OperationalError(cancel_conn.error_message().decode(encoding))

This feels a bit non-atomic to me; isn't there a risk that
client_encoding be changed between PQparameterStatus(conn) and
PQcancelConn(conn) calls?

So maybe PQcancelParameterStatus(PGcancelConn *cancelConn, char *name)
is needed?

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

29 March 2023, 15:58:51

On Wed, 29 Mar 2023 at 10:43, Denis Laxalde <denis.laxalde@dalibo.com> wrote:
> More importantly, not having PQcancelSend() creating the PGcancelConn
> makes reuse of that value, passing through PQcancelReset(), more
> intuitive. E.g., in the tests:

You convinced me. Attached is an updated patch where PQcancelSend
takes the PGcancelConn and returns 1 or 0.

> The thing is that we need the connection encoding (client_encoding) when
> eventually forwarding the result of PQcancelErrorMessage(), decoded, to
> the user.

Cancel connections don't have an encoding specified. They never
receive an error from the server. All errors come from the machine
that libpq is on. So I think you're making the decoding more
complicated than it needs to be.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Denis Laxalde

Date:

30 March 2023, 08:07:28

Jelte Fennema a écrit :
> On Wed, 29 Mar 2023 at 10:43, Denis Laxalde <denis.laxalde@dalibo.com> wrote:
> > More importantly, not having PQcancelSend() creating the PGcancelConn
> > makes reuse of that value, passing through PQcancelReset(), more
> > intuitive. E.g., in the tests:
> 
> You convinced me. Attached is an updated patch where PQcancelSend
> takes the PGcancelConn and returns 1 or 0.

Patch 5 is missing respective changes; please find attached a fixup
patch for these.

Attachment

0001-fixup-Start-using-new-libpq-cancel-APIs.patch

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

30 March 2023, 10:17:21

On Thu, 30 Mar 2023 at 10:07, Denis Laxalde <denis.laxalde@dalibo.com> wrote:
> Patch 5 is missing respective changes; please find attached a fixup
> patch for these.

Thanks, attached are newly rebased patches that include this change. I
also cast the result of PQcancelSend to to void in the one case where
it's ignored on purpose. Note that the patchset shrunk by one, since
the original patch 0002 has been committed now.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Denis Laxalde

Date:

07 April 2023, 08:01:01

The patch set does not apply any more.

I tried to rebase locally; even leaving out 1 ("libpq: Run pgindent 
after a9e9a9f32b3"), patch 4 ("Start using new libpq cancel APIs") is 
harder to resolve following 983ec23007b (I suppose).

Appart from that, the implementation in v19 sounds good to me, and seems 
worthwhile. FWIW, as said before, I also implemented it in Psycopg in a 
sort of an end-to-end validation.

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

21 April 2023, 08:20:35

Okay, I rebased again. Indeed 983ec23007b gave the most problems.

On Fri, 7 Apr 2023 at 10:02, Denis Laxalde <denis.laxalde@dalibo.com> wrote:
>
> The patch set does not apply any more.
>
> I tried to rebase locally; even leaving out 1 ("libpq: Run pgindent
> after a9e9a9f32b3"), patch 4 ("Start using new libpq cancel APIs") is
> harder to resolve following 983ec23007b (I suppose).
>
> Appart from that, the implementation in v19 sounds good to me, and seems
> worthwhile. FWIW, as said before, I also implemented it in Psycopg in a
> sort of an end-to-end validation.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema

Date:

19 June 2023, 10:52:48

I noticed that cfbot was unable to run tests due to some rebase
conflict. It seems the pgindent changes from patch 1 have now been
made.
So adding the rebased patches without patch 1 now to unblock cfbot.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

20 December 2023, 13:46:52

On Thu, 14 Dec 2023 at 13:57, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> I changed all the places that were not adhering to those spellings.

It seems I forgot a /g on my sed command to do this so it turned out I
missed one that caused the test to fail to compile... Attached is a
fixed version.

I also updated the patchset to use the EOF detection provided by
0a5c46a7a488f2f4260a90843bb9de6c584c7f4e instead of introducing a new
way of EOF detection using a -2 return value.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

vignesh C

Date:

26 January 2024, 01:59:07

On Wed, 20 Dec 2023 at 19:17, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
>
> On Thu, 14 Dec 2023 at 13:57, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> > I changed all the places that were not adhering to those spellings.
>
> It seems I forgot a /g on my sed command to do this so it turned out I
> missed one that caused the test to fail to compile... Attached is a
> fixed version.
>
> I also updated the patchset to use the EOF detection provided by
> 0a5c46a7a488f2f4260a90843bb9de6c584c7f4e instead of introducing a new
> way of EOF detection using a -2 return value.

CFBot shows that the patch does not apply anymore as in [1]:
patching file doc/src/sgml/libpq.sgml
...
patching file src/interfaces/libpq/exports.txt
Hunk #1 FAILED at 191.
1 out of 1 hunk FAILED -- saving rejects to file
src/interfaces/libpq/exports.txt.rej
patching file src/interfaces/libpq/fe-connect.c

Please post an updated version for the same.

[1] - http://cfbot.cputube.org/patch_46_3511.log

Regards,
Vignesh

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

26 January 2024, 10:44:24

On Fri, 26 Jan 2024 at 02:59, vignesh C <vignesh21@gmail.com> wrote:
> Please post an updated version for the same.

Done.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

26 January 2024, 12:11:02

Pushed 0001.

I wonder, would it make sense to put all these new functions in a
separate file fe-cancel.c?

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"World domination is proceeding according to plan"        (Andrew Morton)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

26 January 2024, 16:52:45

On Fri, 26 Jan 2024 at 13:11, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I wonder, would it make sense to put all these new functions in a
> separate file fe-cancel.c?

Okay I tried doing that. I think the end result is indeed quite nice,
having all the cancellation related functions together in a file. But
it did require making a bunch of static functions in fe-connect
extern, and adding them to libpq-int.h. On one hand that seems fine to
me, on the other maybe that indicates that this cancellation logic
makes sense to be in the same file as the other connection functions
(in a sense, connecting is all that a cancel request does).

On Sun, 28 Jan 2024 at 04:15, vignesh C <vignesh21@gmail.com> wrote:
> CFBot shows that the patch has few compilation errors as in [1]:
> [17:07:07.621] /usr/bin/ld:
> ../../../src/fe_utils/libpgfeutils.a(cancel.o): in function
> `handle_sigint':
> [17:07:07.621] cancel.c:(.text+0x50): undefined reference to `PQcancel'

I forgot to update ./configure based builds with the new file, only
meson was working. Also it seems I trimmed the header list fe-cancel.c
a bit too much for OSX, so I added unistd.h back.

Both of those are fixed now.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

28 January 2024, 12:39:42

On Sun, 28 Jan 2024 at 10:51, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> Both of those are fixed now.

Okay, there turned out to also be an issue on Windows with
setKeepalivesWin32 not being available in fe-cancel.c. That's fixed
now too (as well as some minor formatting issues).

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

29 January 2024, 11:44:44

On 2024-Jan-28, Jelte Fennema-Nio wrote:

> On Sun, 28 Jan 2024 at 10:51, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> > Both of those are fixed now.
> 
> Okay, there turned out to also be an issue on Windows with
> setKeepalivesWin32 not being available in fe-cancel.c. That's fixed
> now too (as well as some minor formatting issues).

Thanks!  I committed 0001 now.  I also renamed the new
pq_parse_int_param to pqParseIntParam, for consistency with other
routines there.  Please rebase the other patches.

Thanks,

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
Thou shalt check the array bounds of all strings (indeed, all arrays), for
surely where thou typest "foo" someone someday shall type
"supercalifragilisticexpialidocious" (5th Commandment for C programmers)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

29 January 2024, 12:28:22

On Mon, 29 Jan 2024 at 12:44, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Thanks!  I committed 0001 now.  I also renamed the new
> pq_parse_int_param to pqParseIntParam, for consistency with other
> routines there.  Please rebase the other patches.

Awesome! Rebased, and renamed pq_release_conn_hosts to
pqReleaseConnHosts for the same consistency reasons.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

02 February 2024, 12:19:38

On 2024-Jan-29, Jelte Fennema-Nio wrote:

> On Mon, 29 Jan 2024 at 12:44, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > Thanks!  I committed 0001 now.  I also renamed the new
> > pq_parse_int_param to pqParseIntParam, for consistency with other
> > routines there.  Please rebase the other patches.
> 
> Awesome! Rebased, and renamed pq_release_conn_hosts to
> pqReleaseConnHosts for the same consistency reasons.

Thank you, looks good.

I propose the following minor/trivial fixes over your initial 3 patches.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu.  Five minutes later I realize that it's also talking
about food" (Donald Knuth)

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

02 February 2024, 14:03:39

On Fri, 2 Feb 2024 at 13:19, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Thank you, looks good.
>
> I propose the following minor/trivial fixes over your initial 3 patches.

All of those seem good like fixes. Attached is an updated patchset
where they are all applied. As well as adding a missing word ("been")
in a comment that I noticed while reading your fixes.

On Sun, 4 Feb 2024 at 16:39, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Maybe this is okay?  I'll have a look at the whole final situation more
> carefully later; or if somebody else wants to share an opinion, please
> do so.

Attached is a new version of the final patches, with much improved
docs (imho) and the new function names: PQcancelStart and
PQcancelBlocking.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

14 February 2024, 17:41:37

On 2024-Feb-14, Jelte Fennema-Nio wrote:

> Attached is a new version of the final patches, with much improved
> docs (imho) and the new function names: PQcancelStart and
> PQcancelBlocking.

Hmm, I think the changes to libpq_pipeline in 0005 should be in 0004.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

14 February 2024, 18:22:06

On Wed, 14 Feb 2024 at 18:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Hmm, I think the changes to libpq_pipeline in 0005 should be in 0004.

Yeah, you're correct. Fixed that now.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Denis Laxalde

Date:

06 March 2024, 14:03:20

In patch 0004, I noticed a couple of typos in the documentation; please 
find attached a fixup patch correcting these.

Still in the documentation, same patch, the last paragraph documenting 
PQcancelPoll() ends as:

+       indicate the current stage of the connection procedure and might 
be useful
+       to provide feedback to the user for example. These statuses are:
+      </para>

while not actually listing the "statuses". Should we list them? Adjust 
the wording? Or refer to PQconnectPoll() documentation (since the 
paragraph is copied from there it seems)?


Otherwise, the feature still works fine as far as I can tell.

Attachment

0001-fixup-libpq-Add-encrypted-and-non-blocking-versions-.patch

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

06 March 2024, 18:09:35

On Wed, 6 Mar 2024 at 15:03, Denis Laxalde <denis.laxalde@dalibo.com> wrote:
>
> In patch 0004, I noticed a couple of typos in the documentation; please
> find attached a fixup patch correcting these.

Thanks, applied.

> while not actually listing the "statuses". Should we list them?

I listed the relevant statuses over now and updated the PQcancelStatus
docs to look more like the PQstatus one. I didn't list any statuses
that a cancel connection could never have (but a normal connection
can).

While going over the list of statuses possible for a cancel connection
I realized that the docs for PQconnectStart were not listing all
relevant statuses, so I fixed that in patch 0001.

Attached is a new patchset with various changes. I created a dedicated
0002 patch to add tests for the already existing cancellation
functions, because that seemed useful for another thread where changes
to the cancellation protocol are being proposed[1].

[1]: https://www.postgresql.org/message-id/flat/508d0505-8b7a-4864-a681-e7e5edfe32aa%40iki.fi

On Wed, 6 Mar 2024 at 19:22, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Docs: one bogus "that that".

This was already fixed by my previous doc changes in v32, I guess that
email got crossed with this one

> Did we consider having PQcancelConn() instead be called
> PQcancelCreate()?

Done

> "Asynchronously cancel a query on the given connection. This requires
> polling the returned PGcancelConn to actually complete the cancellation
> of the query." but this is no longer a good description of what this
> function does.

Fixed

>  Anyway, maybe there are reasons for this; but in any case we
> should set ->cancelRequest in all cases, not only after the first tests
> for errors.

Done

> I think the extra PGconn inside pg_cancel_conn is useless; it would be
> simpler to typedef PGcancelConn to PGconn in fe-cancel.c, and remove the
> indirection through the extra struct.  You're actually dereferencing the
> object in two ways in the new code, both by casting the outer object
> straight to PGconn (taking advantage that the struct member is first in
> the struct), and by using PGcancelConn->conn.  This seems pointless.  I
> mean, if we're going to cast to "PGconn *" in some places anyway, then
> we may as well access all members directly.  Perhaps, if you want, you
> could add asserts that ->cancelRequest is set true in all the
> fe-cancel.c functions.  Anyway, we'd still have compiler support to tell
> you that you're passing the wrong struct to the function.  (I didn't
> actually try to change the code this way, so I might be wrong.)

Turns out you were wrong about the compiler support to tell us we're
passing the wrong struct: When both the PGconn and PGcancelConn
typedefs refer to the same struct, the compiler allows passing PGconn
to PGcancelConn functions and vice versa without complaining. This
seems enough reason for me to keep indirection through the extra
struct.

So instead of adding the proposed typed this typedef I chose to add a
comment to pg_cancel_conn explaining its purpose, as well as not
casting PGcancelConn to PGconn but always accessing the conn field for
consistency.

> We could move the definition of struct pg_cancel to fe-cancel.c.  Nobody
> outside that needs to know that definition anyway.

Done in 0003

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

12 March 2024, 09:19:27

Here's a last one for the cfbot.

I have a question about this one

int
PQcancelStart(PGcancelConn *cancelConn)
{
    [...]
    if (cancelConn->conn.status != CONNECTION_ALLOCATED)
    {
        libpq_append_conn_error(&cancelConn->conn,
                                "cancel request is already being sent on this connection");
        cancelConn->conn.status = CONNECTION_BAD;
        return 0;
    }


If we do this and we see conn.status is not ALLOCATED, meaning a cancel
is already ongoing, shouldn't we leave conn.status alone instead of
changing to CONNECTION_BAD?  I mean, we shouldn't be juggling the elbow
of whoever's doing that, should we?  Maybe just add the error message
and return 0?

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"If it is not right, do not do it.
If it is not true, do not say it." (Marcus Aurelius, Meditations)

On Tue, 12 Mar 2024 at 10:19, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Here's a last one for the cfbot.

Thanks for committing the first 3 patches btw. Attached a tiny change
to 0001, which adds "(backing struct for PGcancelConn)" to the comment
on pg_cancel_conn.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

12 March 2024, 15:45:41

On Tue, 12 Mar 2024 at 15:04, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> Attached a tiny change to 0001

One more tiny comment change, stating that pg_cancel is used by the
deprecated PQcancel function.

On Wed, 13 Mar 2024 at 04:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I suspect it's basically just a
> timing dependency.  Have you thought about the fact that a cancel
> request is a no-op if it arrives after the query's done?

I agree it's probably a timing issue. The cancel being received after
the query is done seems very unlikely, since the query takes 180
seconds (assuming PG_TEST_TIMEOUT_DEFAULT is not lowered for these
animals). I think it's more likely that the cancel request arrives too
early, and thus being ignored because no query is running yet. The
test already had logic to wait until the query backend was in the
"active" state, before sending a cancel to solve that issue. But my
guess is that that somehow isn't enough.

Sadly I'm having a hard time reliably reproducing this race condition
locally. So it's hard to be sure what is happening here. Attached is a
patch with a wild guess as to what the issue might be (i.e. seeing an
outdated "active" state and thus passing the check even though the
query is not running yet)

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

13 March 2024, 19:00:52

On 2024-Mar-13, Jelte Fennema-Nio wrote:

> I agree it's probably a timing issue. The cancel being received after
> the query is done seems very unlikely, since the query takes 180
> seconds (assuming PG_TEST_TIMEOUT_DEFAULT is not lowered for these
> animals). I think it's more likely that the cancel request arrives too
> early, and thus being ignored because no query is running yet. The
> test already had logic to wait until the query backend was in the
> "active" state, before sending a cancel to solve that issue. But my
> guess is that that somehow isn't enough.
> 
> Sadly I'm having a hard time reliably reproducing this race condition
> locally. So it's hard to be sure what is happening here. Attached is a
> patch with a wild guess as to what the issue might be (i.e. seeing an
> outdated "active" state and thus passing the check even though the
> query is not running yet)

I tried leaving the original running in my laptop to see if I could
reproduce it, but got no hits ... and we didn't get any other failures
apart from the three ones already reported ... so it's not terribly high
probability.  Anyway I pushed your patch now since the theory seems
plausible; let's see if we still get the issue to reproduce.  If it
does, we could make the script more verbose to hunt for further clues.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Here's a general engineering tip: if the non-fun part is too complex for you
to figure out, that might indicate the fun part is too ambitious." (John Naylor)
https://postgr.es/m/CAFBsxsG4OWHBbSDM%3DsSeXrQGOtkPiOEOuME4yD7Ce41NtaAD9g%40mail.gmail.com

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jacob Champion

Date:

13 March 2024, 19:08:30

On Wed, Mar 13, 2024 at 12:01 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2024-Mar-13, Jelte Fennema-Nio wrote:
> > Sadly I'm having a hard time reliably reproducing this race condition
> > locally. So it's hard to be sure what is happening here. Attached is a
> > patch with a wild guess as to what the issue might be (i.e. seeing an
> > outdated "active" state and thus passing the check even though the
> > query is not running yet)
>
> I tried leaving the original running in my laptop to see if I could
> reproduce it, but got no hits ... and we didn't get any other failures
> apart from the three ones already reported ... so it's not terribly high
> probability.  Anyway I pushed your patch now since the theory seems
> plausible; let's see if we still get the issue to reproduce.  If it
> does, we could make the script more verbose to hunt for further clues.

I hit this on my machine. With the attached diff I can reproduce
constantly (including with the most recent test patch); I think the
cancel must be arriving between the bind/execute steps?

Thanks,
--Jacob

Attachment

repro.diff.txt

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

14 March 2024, 09:51:13

On Wed, 13 Mar 2024 at 20:08, Jacob Champion
<jacob.champion@enterprisedb.com> wrote:
> I hit this on my machine. With the attached diff I can reproduce
> constantly (including with the most recent test patch); I think the
> cancel must be arriving between the bind/execute steps?

Nice find! Your explanation makes total sense. Attached a patchset
that fixes/works around this issue by using the simple query protocol
in the cancel test.

I enabled the test again and also pushed the changes to dblink,
isolationtester and fe_utils (which AFAICS is used by pg_dump,
pg_amcheck, reindexdb and vacuumdb).  I chickened out of committing the
postgres_fdw changes though, so here they are again.  Not sure I'll find
courage to get these done by tomorrow, or whether I should just let them
for Fujita-san or Noah, who have been the last committers to touch this.


-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No renuncies a nada. No te aferres a nada."

Attachment

v39-0001-postgres_fdw-Start-using-new-libpq-cancel-APIs.patch

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Noah Misch

Date:

21 March 2024, 02:54:38

On Mon, Mar 18, 2024 at 07:40:10PM +0100, Alvaro Herrera wrote:
> I enabled the test again and also pushed the changes to dblink,
> isolationtester and fe_utils (which AFAICS is used by pg_dump,

I recommend adding a libpqsrv_cancel() function to libpq-be-fe-helpers.h, to
use from dblink and postgres_fdw.  pgxn modules calling PQcancel() from the
backend (citus pg_bulkload plproxy pmpp) then have a better chance to adopt
the new way.

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

22 March 2024, 08:54:29

On Thu, 21 Mar 2024 at 03:54, Noah Misch <noah@leadboat.com> wrote:
>
> On Mon, Mar 18, 2024 at 07:40:10PM +0100, Alvaro Herrera wrote:
> > I enabled the test again and also pushed the changes to dblink,
> > isolationtester and fe_utils (which AFAICS is used by pg_dump,
>
> I recommend adding a libpqsrv_cancel() function to libpq-be-fe-helpers.h, to
> use from dblink and postgres_fdw.  pgxn modules calling PQcancel() from the
> backend (citus pg_bulkload plproxy pmpp) then have a better chance to adopt
> the new way.

Done

Attachment

v37-0001-postgres_fdw-Start-using-new-libpq-cancel-APIs.patch

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

27 March 2024, 15:34:57

On 2024-Mar-22, Jelte Fennema-Nio wrote:

> On Thu, 21 Mar 2024 at 03:54, Noah Misch <noah@leadboat.com> wrote:
> >
> > On Mon, Mar 18, 2024 at 07:40:10PM +0100, Alvaro Herrera wrote:
> > > I enabled the test again and also pushed the changes to dblink,
> > > isolationtester and fe_utils (which AFAICS is used by pg_dump,
> >
> > I recommend adding a libpqsrv_cancel() function to libpq-be-fe-helpers.h, to
> > use from dblink and postgres_fdw.  pgxn modules calling PQcancel() from the
> > backend (citus pg_bulkload plproxy pmpp) then have a better chance to adopt
> > the new way.
> 
> Done

Nice, thanks.  I played with it a bit, mostly trying to figure out if
the chosen API is usable.  I toyed with making it return boolean success
and the error message as an output argument, because I was nervous about
what'd happen in OOM.  But since this is backend environment, what
actually happens is that we elog(ERROR) anyway, so we never return a
NULL error message.  So after the detour I think Jelte's API is okay.

I changed it so that the error messages are returned as translated
phrases, and was bothered by the fact that if errors happen repeatedly,
the memory for them might be leaked.  Maybe this is fine depending on
the caller's memory context, but since it's only at most one string each
time, it's quite easy to just keep track of it so that we can release it
on the next.

I ended up reducing the two PG_TRY blocks to a single one.  I see no
reason to split them up, and this way it looks more legible.

What do you think?

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)

Attachment

libpqsrv_cancel.patch

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

27 March 2024, 18:46:19

On 2024-Mar-27, Alvaro Herrera wrote:

> I changed it so that the error messages are returned as translated
> phrases, and was bothered by the fact that if errors happen repeatedly,
> the memory for them might be leaked.  Maybe this is fine depending on
> the caller's memory context, but since it's only at most one string each
> time, it's quite easy to just keep track of it so that we can release it
> on the next.

(Actually this sounds clever but fails pretty obviously if the caller
does free the string, such as in a memory context reset.  So I guess we
have to just accept the potential leakage.)

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La conclusión que podemos sacar de esos estudios es que
no podemos sacar ninguna conclusión de ellos" (Tanenbaum)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

28 March 2024, 09:31:30

On Wed, 27 Mar 2024 at 19:46, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Mar-27, Alvaro Herrera wrote:
>
> > I changed it so that the error messages are returned as translated
> > phrases, and was bothered by the fact that if errors happen repeatedly,
> > the memory for them might be leaked.  Maybe this is fine depending on
> > the caller's memory context, but since it's only at most one string each
> > time, it's quite easy to just keep track of it so that we can release it
> > on the next.
>
> (Actually this sounds clever but fails pretty obviously if the caller
> does free the string, such as in a memory context reset.  So I guess we
> have to just accept the potential leakage.)

Your changes look good, apart from the prverror stuff indeed. If you
remove the prverror stuff again I think this is ready to commit.

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

28 March 2024, 09:33:13

On Wed, 27 Mar 2024 at 19:27, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I ended up reducing the two PG_TRY blocks to a single one.  I see no
> reason to split them up, and this way it looks more legible.

I definitely agree this looks better. Not sure why I hadn't done that,
maybe it wasn't possible in one of the earlier iterations of the API.

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

28 March 2024, 10:33:00

On 2024-Mar-28, Jelte Fennema-Nio wrote:

> Your changes look good, apart from the prverror stuff indeed. If you
> remove the prverror stuff again I think this is ready to commit.

Great, thanks for looking.  Pushed now, I'll be closing the commitfest
entry shortly.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Always assume the user will do much worse than the stupidest thing
you can imagine."                                (Julien PUYDT)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

28 March 2024, 11:15:08

Hm, indri failed:

ccache gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla
-Werror=unguarded-availability-new-Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security
-fno-strict-aliasing-fwrapv -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -g -O2 -fno-common
-Werror -fvisibility=hidden -bundle -o dblink.dylib  dblink.o -L../../src/port -L../../src/common
-L../../src/interfaces/libpq-lpq -isysroot
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk
-L/opt/local/libexec/llvm-15/lib-L/opt/local/lib -L/opt/local/lib -L/opt/local/lib  -L/opt/local/lib
-Wl,-dead_strip_dylibs -Werror  -fvisibility=hidden -bundle_loader ../../src/backend/postgres
 

Undefined symbols for architecture arm64:
  "_libintl_gettext", referenced from:
      _libpqsrv_cancel in dblink.o
      _libpqsrv_cancel in dblink.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[1]: *** [dblink.dylib] Error 1
make: *** [all-dblink-recurse] Error 2

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

28 March 2024, 12:19:14

On 2024-Mar-28, Alvaro Herrera wrote:

> Undefined symbols for architecture arm64:
>   "_libintl_gettext", referenced from:
>       _libpqsrv_cancel in dblink.o
>       _libpqsrv_cancel in dblink.o
> ld: symbol(s) not found for architecture arm64
> clang: error: linker command failed with exit code 1 (use -v to see invocation)
> make[1]: *** [dblink.dylib] Error 1
> make: *** [all-dblink-recurse] Error 2

I just removed the _() from the new function.  There's not much point in
wasting more time on this, given that contrib doesn't have translation
support anyway, and we're not using this in libpqwalreceiver.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Crear es tan difícil como ser libre" (Elsa Triolet)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

28 March 2024, 16:34:08

Eh, kestrel has also failed[1], apparently every query after the large
JOIN that this commit added as test fails with a statement timeout error.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-03-28%2016%3A01%3A14

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No deja de ser humillante para una persona de ingenio saber
que no hay tonto que no le pueda enseñar algo." (Jean B. Say)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

28 March 2024, 16:37:50

On Thu, 28 Mar 2024 at 17:34, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Eh, kestrel has also failed[1], apparently every query after the large
> JOIN that this commit added as test fails with a statement timeout error.
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2024-03-28%2016%3A01%3A14

Ugh that's annoying, the RESET is timing out too I guess. That can
hopefully be easily fixed by changing the new test to:

BEGIN;
SET LOCAL statement_timeout = '10ms';
select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5;
-- this takes very long
ROLLBACK;

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

28 March 2024, 16:43:29

On 2024-Mar-28, Jelte Fennema-Nio wrote:

> Ugh that's annoying, the RESET is timing out too I guess.

Hah, you're right, I can reproduce with a smaller timeout, and using SET
LOCAL works as a fix.  If we're doing that, why not reduce the timeout
to 1ms?  We don't need to wait extra 9ms ...

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
“Cuando no hay humildad las personas se degradan” (A. Christie)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

28 March 2024, 17:13:56

On Thu, 28 Mar 2024 at 17:43, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Hah, you're right, I can reproduce with a smaller timeout, and using SET
> LOCAL works as a fix.  If we're doing that, why not reduce the timeout
> to 1ms?  We don't need to wait extra 9ms ...

I think we don't really want to make the timeout too short. Otherwise
the query might get cancelled before we push any query down to the
FDW. I guess that means that for some slow machines even 10ms is not
enough to make the test do the intended purpose. I'd keep it at 10ms,
which seems long enough for normal systems, while still being pretty
short.

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Tom Lane

Date:

28 March 2024, 17:35:46

Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> On Thu, 28 Mar 2024 at 17:43, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> Hah, you're right, I can reproduce with a smaller timeout, and using SET
>> LOCAL works as a fix.  If we're doing that, why not reduce the timeout
>> to 1ms?  We don't need to wait extra 9ms ...

> I think we don't really want to make the timeout too short. Otherwise
> the query might get cancelled before we push any query down to the
> FDW. I guess that means that for some slow machines even 10ms is not
> enough to make the test do the intended purpose. I'd keep it at 10ms,
> which seems long enough for normal systems, while still being pretty
> short.

If the test fails both when the machine is too slow and when it's
too fast, then there's zero hope of making it stable and we should
just remove it.

            regards, tom lane

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

28 March 2024, 17:53:30

On 2024-Mar-28, Tom Lane wrote:

> Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> 
> > I think we don't really want to make the timeout too short. Otherwise
> > the query might get cancelled before we push any query down to the
> > FDW. I guess that means that for some slow machines even 10ms is not
> > enough to make the test do the intended purpose. I'd keep it at 10ms,
> > which seems long enough for normal systems, while still being pretty
> > short.
> 
> If the test fails both when the machine is too slow and when it's
> too fast, then there's zero hope of making it stable and we should
> just remove it.

It doesn't fail when it's too fast -- it's just that it doesn't cover
the case we want to cover.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Tom Lane

Date:

28 March 2024, 18:02:59

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2024-Mar-28, Tom Lane wrote:
>> If the test fails both when the machine is too slow and when it's
>> too fast, then there's zero hope of making it stable and we should
>> just remove it.

> It doesn't fail when it's too fast -- it's just that it doesn't cover
> the case we want to cover.

That's hardly better, because then you think you have test
coverage but maybe you don't.

Could we make this test bulletproof by using an injection point?
If not, I remain of the opinion that we're better off without it.

            regards, tom lane

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

29 March 2024, 08:17:55

On Thu, 28 Mar 2024 at 19:03, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > It doesn't fail when it's too fast -- it's just that it doesn't cover
> > the case we want to cover.
>
> That's hardly better, because then you think you have test
> coverage but maybe you don't.

Honestly, that seems quite a lot better. Instead of having randomly
failing builds, you have a test that creates coverage 80+% of the
time. And that also seems a lot better than having no coverage at all
(which is what we had for the last 7 years since introduction of
cancellations to postgres_fdw). It would be good to expand the comment
in the test though saying that the test might not always cover the
intended code path, due to timing problems.

> Could we make this test bulletproof by using an injection point?
> If not, I remain of the opinion that we're better off without it.

Possibly, and if so, I agree that would be better than the currently
added test. But I honestly don't feel like spending the time on
creating such a test. And given 7 years have passed without someone
adding any test for this codepath at all, I don't expect anyone else
will either.

If you both feel we're better off without the test, feel free to
remove it. This was just some small missing test coverage that I
noticed while working on this patch, that I thought I'd quickly
address. I don't particularly care a lot about the specific test.

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Noah Misch

Date:

29 March 2024, 22:17:24

On Fri, Mar 29, 2024 at 09:17:55AM +0100, Jelte Fennema-Nio wrote:
> On Thu, 28 Mar 2024 at 19:03, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Could we make this test bulletproof by using an injection point?
> > If not, I remain of the opinion that we're better off without it.
> 
> Possibly, and if so, I agree that would be better than the currently
> added test. But I honestly don't feel like spending the time on
> creating such a test.

The SQL test is more representative of real applications, and it's way simpler
to understand.  In general, I prefer 6-line SQL tests that catch a problem 10%
of the time over injection point tests that catch it 100% of the time.  For
low detection rate to be exciting, it needs to be low enough to have a serious
chance of all buildfarm members reporting green for the bad commit.  With ~115
buildfarm members running in the last day, 0.1% detection rate would have been
low enough to bother improving, but 4% would be high enough to call it good.

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Tom Lane

Date:

03 April 2024, 21:29:44

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Great, thanks for looking.  Pushed now, I'll be closing the commitfest
> entry shortly.

On my machine, headerscheck does not like this:

$ src/tools/pginclude/headerscheck --cplusplus
In file included from /tmp/headerscheck.4gTaW5/test.cpp:3:
./src/include/libpq/libpq-be-fe-helpers.h: In function 'char* libpqsrv_cancel(PGconn*, TimestampTz)':
./src/include/libpq/libpq-be-fe-helpers.h:393:10: warning: ISO C++ forbids converting a string constant to 'char*'
[-Wwrite-strings]
   return "out of memory";
          ^~~~~~~~~~~~~~~
./src/include/libpq/libpq-be-fe-helpers.h:421:13: warning: ISO C++ forbids converting a string constant to 'char*'
[-Wwrite-strings]
     error = "cancel request timed out";
             ^~~~~~~~~~~~~~~~~~~~~~~~~~

The second part of that could easily be fixed by declaring "error" as
"const char *".  As for the first part, can we redefine the whole
function as returning "const char *"?  (If not, this coding is very
questionable anyway.)

            regards, tom lane

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

04 April 2024, 08:45:33

On 2024-Apr-03, Tom Lane wrote:

> On my machine, headerscheck does not like this:
> 
> $ src/tools/pginclude/headerscheck --cplusplus
> In file included from /tmp/headerscheck.4gTaW5/test.cpp:3:
> ./src/include/libpq/libpq-be-fe-helpers.h: In function 'char* libpqsrv_cancel(PGconn*, TimestampTz)':
> ./src/include/libpq/libpq-be-fe-helpers.h:393:10: warning: ISO C++ forbids converting a string constant to 'char*'
[-Wwrite-strings]
>    return "out of memory";
>           ^~~~~~~~~~~~~~~
> ./src/include/libpq/libpq-be-fe-helpers.h:421:13: warning: ISO C++ forbids converting a string constant to 'char*'
[-Wwrite-strings]
>      error = "cancel request timed out";
>              ^~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> The second part of that could easily be fixed by declaring "error" as
> "const char *".  As for the first part, can we redefine the whole
> function as returning "const char *"?  (If not, this coding is very
> questionable anyway.)

Yeah, this seems to work and I no longer get that complaint from
headerscheck.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

Attachment

0001-Make-libpqsrv_cancel-s-return-type-const-char.patch

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

04 April 2024, 08:47:03

On Thu, 4 Apr 2024 at 10:45, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Yeah, this seems to work and I no longer get that complaint from
> headerscheck.

patch LGTM

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Tom Lane

Date:

04 April 2024, 18:06:50

[ from a week ago ]

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Hm, indri failed:
> ccache gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla
-Werror=unguarded-availability-new-Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security
-fno-strict-aliasing-fwrapv -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -g -O2 -fno-common
-Werror -fvisibility=hidden -bundle -o dblink.dylib  dblink.o -L../../src/port -L../../src/common
-L../../src/interfaces/libpq-lpq -isysroot
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk
-L/opt/local/libexec/llvm-15/lib-L/opt/local/lib -L/opt/local/lib -L/opt/local/lib  -L/opt/local/lib
-Wl,-dead_strip_dylibs -Werror  -fvisibility=hidden -bundle_loader ../../src/backend/postgres 

> Undefined symbols for architecture arm64:
>   "_libintl_gettext", referenced from:
>       _libpqsrv_cancel in dblink.o
>       _libpqsrv_cancel in dblink.o
> ld: symbol(s) not found for architecture arm64
> clang: error: linker command failed with exit code 1 (use -v to see invocation)
> make[1]: *** [dblink.dylib] Error 1
> make: *** [all-dblink-recurse] Error 2

Having just fixed the same issue for test_json_parser, I now realize
what's going on there: dblink's link command doesn't actually mention
any of the external libraries that we might need, such as libintl.
You can get away with that on some platforms, but not macOS.
It would probably be possible to fix that if anyone cared to.
I'm not sufficiently excited about it to do so right now --- as
you say, we don't support translation in contrib anyway.

            regards, tom lane

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alexander Lakhin

Date:

22 June 2024, 15:00:01

Hello hackers,

30.03.2024 01:17, Noah Misch wrote:
> On Fri, Mar 29, 2024 at 09:17:55AM +0100, Jelte Fennema-Nio wrote:
>> On Thu, 28 Mar 2024 at 19:03, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Could we make this test bulletproof by using an injection point?
>>> If not, I remain of the opinion that we're better off without it.
>> Possibly, and if so, I agree that would be better than the currently
>> added test. But I honestly don't feel like spending the time on
>> creating such a test.
> The SQL test is more representative of real applications, and it's way simpler
> to understand.  In general, I prefer 6-line SQL tests that catch a problem 10%
> of the time over injection point tests that catch it 100% of the time.  For
> low detection rate to be exciting, it needs to be low enough to have a serious
> chance of all buildfarm members reporting green for the bad commit.  With ~115
> buildfarm members running in the last day, 0.1% detection rate would have been
> low enough to bother improving, but 4% would be high enough to call it good.

As a recent buildfarm failure on orlingo (which tests asan-enabled builds)
[1] shows, that test can still fail:
70/70 postgresql:postgres_fdw-running / postgres_fdw-running/regress               ERROR           278.67s exit status
1

@@ -2775,6 +2775,7 @@
  SET LOCAL statement_timeout = '10ms';
  select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long
  ERROR:  canceling statement due to statement timeout
+WARNING:  could not get result of cancel request due to timeout
  COMMIT;

(from the next run we can see normal duration:
"postgres_fdw-running/regress               OK                6.30s ")

I reproduced the failure with an asan-enabled build on a slowed-down VM
and as far as I can see, it's caused by the following condition in
ProcessInterrupts():
                 /*
                  * If we are reading a command from the client, just ignore the cancel
                  * request --- sending an extra error message won't accomplish
                  * anything.  Otherwise, go ahead and throw the error.
                  */
                 if (!DoingCommandRead)
                 {
                         LockErrorCleanup();
                         ereport(ERROR,
(errcode(ERRCODE_QUERY_CANCELED),
                                          errmsg("canceling statement due to user request")));
                 }

I think this failure can be reproduced easily (without asan/slowing down)
with this modification:
@@ -4630,6 +4630,7 @@ PostgresMain(const char *dbname, const char *username)
              idle_session_timeout_enabled = false;
          }

+if (rand() % 10 == 0) pg_usleep(10000);
          /*
           * (5) disable async signal conditions again.
           *

Running this test in a loop (for ((i=1;i<=100;i++)); do \
echo "iteration $i"; make -s check -C contrib/postgres_fdw/ || break; \
done), I get:
...
iteration 56
# +++ regress check in contrib/postgres_fdw +++
# initializing database system by copying initdb template
# using temp instance on port 55312 with PID 991332
ok 1         - postgres_fdw                            20093 ms
1..1
# All 1 tests passed.
iteration 57
# +++ regress check in contrib/postgres_fdw +++
# initializing database system by copying initdb template
# using temp instance on port 55312 with PID 992152
not ok 1     - postgres_fdw                            62064 ms
1..1
...
--- .../contrib/postgres_fdw/expected/postgres_fdw.out 2024-06-22 02:52:42.991574907 +0000
+++ .../contrib/postgres_fdw/results/postgres_fdw.out 2024-06-22 14:43:43.949552927 +0000
@@ -2775,6 +2775,7 @@
  SET LOCAL statement_timeout = '10ms';
  select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long
  ERROR:  canceling statement due to statement timeout
+WARNING:  could not get result of cancel request due to timeout
  COMMIT;

I also came across another failure of the test:
@@ -2774,7 +2774,7 @@
  BEGIN;
  SET LOCAL statement_timeout = '10ms';
  select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long
-ERROR:  canceling statement due to statement timeout
+ERROR:  canceling statement due to user request
  COMMIT;

which is reproduced with a sleep added here:
@@ -1065,6 +1065,7 @@ exec_simple_query(const char *query_string)
       */
      parsetree_list = pg_parse_query(query_string);
+pg_usleep(11000);

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-06-20%2009%3A52%3A04

Best regards,
Alexander

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

23 June 2024, 22:59:48

On Sat, 22 Jun 2024 at 17:00, Alexander Lakhin <exclusion@gmail.com> wrote:
> @@ -2775,6 +2775,7 @@
>   SET LOCAL statement_timeout = '10ms';
>   select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long
>   ERROR:  canceling statement due to statement timeout
> +WARNING:  could not get result of cancel request due to timeout
>   COMMIT;

As you describe it, this problem occurs when the cancel request is
processed by the foreign server, before the query is actually
received. And postgres then (rightly) ignores the cancel request. I'm
not sure if the existing test is easily changeable to fix this. The
only thing that I can imagine works in practice is increasing the
statement_timeout, e.g. to 100ms.

> I also came across another failure of the test:
> @@ -2774,7 +2774,7 @@
>   BEGIN;
>   SET LOCAL statement_timeout = '10ms';
>   select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long
> -ERROR:  canceling statement due to statement timeout
> +ERROR:  canceling statement due to user request
>   COMMIT;
>
> which is reproduced with a sleep added here:
> @@ -1065,6 +1065,7 @@ exec_simple_query(const char *query_string)
>        */
>       parsetree_list = pg_parse_query(query_string);
> +pg_usleep(11000);

After investigating, I realized this actually exposes a bug in our
statement timeout logic. It has nothing to do with posgres_fdw and
reproduces with any regular postgres query too. Attached is a patch
that fixes this issue. This one should probably be backported.

Attachment

v1-0001-Do-not-reset-statement_timeout-indicator-outside-.patch

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alexander Lakhin

Date:

25 June 2024, 05:00:00

24.06.2024 01:59, Jelte Fennema-Nio wrote:
> On Sat, 22 Jun 2024 at 17:00, Alexander Lakhin <exclusion@gmail.com> wrote:
>> @@ -2775,6 +2775,7 @@
>>    SET LOCAL statement_timeout = '10ms';
>>    select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long
>>    ERROR:  canceling statement due to statement timeout
>> +WARNING:  could not get result of cancel request due to timeout
>>    COMMIT;
> As you describe it, this problem occurs when the cancel request is
> processed by the foreign server, before the query is actually
> received. And postgres then (rightly) ignores the cancel request. I'm
> not sure if the existing test is easily changeable to fix this. The
> only thing that I can imagine works in practice is increasing the
> statement_timeout, e.g. to 100ms.

I'd just like to add that that one original query assumes several "remote"
queries (see the attached excerpt from postmaster.log with verbose logging
enabled).

Best regards,
Alexander

Attachment

postmaster.log.txt

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

25 June 2024, 08:24:37

On Tue, 25 Jun 2024 at 07:00, Alexander Lakhin <exclusion@gmail.com> wrote:
> I'd just like to add that that one original query assumes several "remote"
> queries (see the attached excerpt from postmaster.log with verbose logging
> enabled).

Nice catch! All those EXPLAIN queries are definitely not intentional,
and likely to greatly increase the likelihood of this flakiness.

Attached is a patch that fixes that by moving the test before enabling
use_remote_estimate on any of the foreign tables, as well as
increasing the statement_timeout to 100ms.

My expectation is that that should remove all failure cases. If it
doesn't, I think our best bet is removing the test again.

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Noah Misch

Date:

30 June 2024, 19:00:40

On Tue, Mar 12, 2024 at 05:50:48PM +0100, Alvaro Herrera wrote:
> On 2024-Mar-12, Jelte Fennema-Nio wrote:
> > On Tue, 12 Mar 2024 at 10:19, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > > Here's a last one for the cfbot.
> > 
> > Thanks for committing the first 3 patches btw.
> 
> Thanks, I included it.

PGcancelConn *
PQcancelCreate(PGconn *conn)
{
...
oom_error:
    conn->status = CONNECTION_BAD;
    libpq_append_conn_error(cancelConn, "out of memory");
    return (PGcancelConn *) cancelConn;
}

Shouldn't that be s/conn->status/cancelConn->status/?

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

30 June 2024, 22:38:46

On Sun, 30 Jun 2024 at 21:00, Noah Misch <noah@leadboat.com> wrote:
> Shouldn't that be s/conn->status/cancelConn->status/?

Ugh yes, I think this was a copy paste error. See attached patch 0003
to fix this (rest of the patches are untouched from previous
revision).

Attachment

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

10 July 2024, 12:10:55

On Mon, 1 Jul 2024 at 00:38, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> Ugh yes, I think this was a copy paste error. See attached patch 0003
> to fix this (rest of the patches are untouched from previous
> revision).

Alvaro committed 0003, which caused cfbot to think a rebase is
necessary. Attached should solve that.

On 2024-Jul-16, Alvaro Herrera wrote:

> Maybe we can disable this test specifically on Cygwin.  We could do that
> by creating a postgres_fdw_cancel.sql file, with the current output for
> all platforms, and a "SELECT version() ~ 'cygwin' AS skip_test" query,
> as we do for encoding tests and such.

Something like this.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

Attachment

0001-Split-out-the-query_cancel-test-in-postgres_fdw.patch

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Thomas Munro

Date:

17 July 2024, 00:05:23

On Wed, Jul 17, 2024 at 3:08 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Ugh.  I tried to follow what's going on in that cygwin code, but I gave
> up pretty quickly.  It depends on a mutex, but I didn't see the mutex
> being defined or initialized anywhere.

https://github.com/cygwin/cygwin/blob/cygwin-3.5.3/winsup/cygwin/fhandler/socket_inet.cc#L217C1-L217C77

Not obvious how it'd be deadlocking (?), though...  it's hard to see
how anything between LOCK_EVENTS and UNLOCK_EVENTS could escape/return
early.  (Something weird going on with signal handlers?  I can't
imagine where one would call poll() though).

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alexander Lakhin

Date:

17 July 2024, 19:00:00

Hello Thomas,

17.07.2024 03:05, Thomas Munro wrote:
> On Wed, Jul 17, 2024 at 3:08 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>> Ugh.  I tried to follow what's going on in that cygwin code, but I gave
>> up pretty quickly.  It depends on a mutex, but I didn't see the mutex
>> being defined or initialized anywhere.
> https://github.com/cygwin/cygwin/blob/cygwin-3.5.3/winsup/cygwin/fhandler/socket_inet.cc#L217C1-L217C77
>
> Not obvious how it'd be deadlocking (?), though...  it's hard to see
> how anything between LOCK_EVENTS and UNLOCK_EVENTS could escape/return
> early.  (Something weird going on with signal handlers?  I can't
> imagine where one would call poll() though).

I've simplified the repro to the following:
echo "
-- setup foreign server "loopback" --

CREATE TABLE t1(i int);
CREATE FOREIGN TABLE ft1 (i int) SERVER loopback OPTIONS (table_name 't1');
CREATE FOREIGN TABLE ft2 (i int) SERVER loopback OPTIONS (table_name 't1');

INSERT INTO t1 SELECT i FROM generate_series(1, 100000) g(i);
" | psql

cat << 'EOF' | psql
Select pg_sleep(10);
SET statement_timeout = '10ms';
SELECT 'SELECT count(*) FROM ft1 CROSS JOIN ft2;' FROM generate_series(1, 100)
\gexec
EOF

I've attached strace (with --mask=0x251, per [1]) to the query-cancelling
backend and got strace.log (see in attachment), while observing:
ERROR:  canceling statement due to statement timeout
...
ERROR:  canceling statement due to statement timeout
-- total 14 lines, then the process hanged --
-- I interrupted it several seconds later --

As far as I can see (having analyzed a number of runs), the hanging occurs
when some itimer-related activity happens before "peek_socket" in this
event sequence:
[main] postgres {pid} select_stuff::wait: res after verify 0
[main] postgres {pid} select_stuff::wait: returning 0
[main] postgres {pid} select: sel.wait returns 0
[main] postgres {pid} peek_socket: read_ready: 0, write_ready: 1, except_ready: 0

(See the last occurrence of the sequence in the log.)

[1] https://cygwin.com/cygwin-ug-net/strace.html

Best regards,
Alexander

Attachment

strace.log

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Thomas Munro

Date:

18 July 2024, 01:06:48

On Thu, Jul 18, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> As far as I can see (having analyzed a number of runs), the hanging occurs
> when some itimer-related activity happens before "peek_socket" in this
> event sequence:
> [main] postgres {pid} select_stuff::wait: res after verify 0
> [main] postgres {pid} select_stuff::wait: returning 0
> [main] postgres {pid} select: sel.wait returns 0
> [main] postgres {pid} peek_socket: read_ready: 0, write_ready: 1, except_ready: 0
>
> (See the last occurrence of the sequence in the log.)

Yeah, right, there's a lot going on between those two lines from the
[main] thread.  There are messages from helper threads [itimer], [sig]
and [socksel].  At a guess, [socksel] might be doing extra secret
communication over our socket in order to exchange SO_PEERCRED
information, huh, is that always there?  Seems worth filing a bug
report.

For the record, I know of one other occasional test failure on Cygwin:
it randomly panics in SnapBuildSerialize().  While I don't expect
there to be any users of PostgreSQL on Cygwin (it was unusably broken
before we refactored the postmaster in v16), that one is interesting
because (1) it also happen on native Windows builds, and (2) at least
one candidate fix[1] sounds like it would speed up logical replication
on all operating systems.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BJ4jSFk%3D-hdoZdcx%2Bp7ru6xuipzCZY-kiKoDc2FjsV7g%40mail.gmail.com#afb5dc4208cc0776a060145f9571dec2

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alvaro Herrera

Date:

22 July 2024, 11:26:06

On 2024-Jul-16, Alvaro Herrera wrote:

> On 2024-Jul-16, Alvaro Herrera wrote:
> 
> > Maybe we can disable this test specifically on Cygwin.  We could do that
> > by creating a postgres_fdw_cancel.sql file, with the current output for
> > all platforms, and a "SELECT version() ~ 'cygwin' AS skip_test" query,
> > as we do for encoding tests and such.
> 
> Something like this.

I have pushed this "fix", so we shouldn't see this failure anymore.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Tom Lane

Date:

30 August 2024, 22:21:25

Alexander Lakhin <exclusion@gmail.com> writes:
> Let me show you another related anomaly, which drongo kindly discovered
> recently: [1]. That test failed with:
>   SELECT dblink_cancel_query('dtest1');
> - dblink_cancel_query
> ----------------------
> - OK
> +   dblink_cancel_query
> +--------------------------
> + cancel request timed out
>   (1 row)

While we're piling on, has anyone noticed that *non* Windows buildfarm
animals are also failing this test pretty frequently?  The most recent
occurrence is at [1], and it looks like this:

diff -U3 /home/bf/bf-build/mylodon/HEAD/pgsql/contrib/postgres_fdw/expected/query_cancel.out
/home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/postgres_fdw/regress/results/query_cancel.out
--- /home/bf/bf-build/mylodon/HEAD/pgsql/contrib/postgres_fdw/expected/query_cancel.out    2024-07-22
11:09:50.638133878+0000 
+++ /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/postgres_fdw/regress/results/query_cancel.out    2024-08-30
06:28:01.971083945+0000 
@@ -17,4 +17,5 @@
 SET LOCAL statement_timeout = '10ms';
 select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long
 ERROR:  canceling statement due to statement timeout
+WARNING:  could not get result of cancel request due to timeout
 COMMIT;

I trawled the buildfarm database for other occurrences of "could not
get result of cancel request" since this test went in.  I found 34
of them (see attachment), and none that weren't the timeout flavor.

Most of the failing machines are not especially slow, so even though
the hard-wired 30 second timeout that's being used here feels a little
under-engineered, I'm not sure that arranging to raise it would help.
My spidey sense feels that there's some actual bug here, but it's hard
to say where.  mylodon's postmaster log confirms that the 30 seconds
did elapse, and that there wasn't anything much else going on:

2024-08-30 06:27:31.926 UTC client backend[3668381] pg_regress/query_cancel ERROR:  canceling statement due to
statementtimeout 
2024-08-30 06:27:31.926 UTC client backend[3668381] pg_regress/query_cancel STATEMENT:  select count(*) from ft1 CROSS
JOINft2 CROSS JOIN ft4 CROSS JOIN ft5; 
2024-08-30 06:28:01.946 UTC client backend[3668381] pg_regress/query_cancel WARNING:  could not get result of cancel
requestdue to timeout 

Any thoughts?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-08-30%2006%3A25%3A46

    sysname    |    branch     |      snapshot       |           stage            |
                                l                                                                       

---------------+---------------+---------------------+----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------
 flaviventris  | HEAD          | 2024-04-02 23:58:01 | postgres_fdwInstallCheck-C | +WARNING:  could not get result of
cancelrequest due to timeout 
 calliphoridae | HEAD          | 2024-04-11 01:54:17 | postgres_fdwInstallCheck-C | +WARNING:  could not get result of
cancelrequest due to timeout 
 kestrel       | REL_17_STABLE | 2024-07-13 04:15:25 | postgres_fdwInstallCheck-C | +WARNING:  could not get result of
cancelrequest due to timeout 
 sarus         | REL_17_STABLE | 2024-07-20 15:02:23 | ContribCheck-C             | 2024-07-20 15:10:15.547 UTC
[613791:141]pg_regress/postgres_fdw WARNING:  could not get result of cancel request due to timeout 
 sarus         | REL_17_STABLE | 2024-07-20 15:02:23 | ContribCheck-C             | +WARNING:  could not get result of
cancelrequest due to timeout 
 mylodon       | HEAD          | 2024-07-26 13:15:09 | postgres_fdwInstallCheck-C | +WARNING:  could not get result of
cancelrequest due to timeout 
 adder         | HEAD          | 2024-07-02 00:24:10 | postgres_fdwCheck          | +WARNING:  could not get result of
cancelrequest due to timeout 
 adder         | HEAD          | 2024-07-02 00:24:10 | postgres_fdwCheck          | 2024-07-02 00:33:45.770 UTC client
backend[1036068]pg_regress/postgres_fdw WARNING:  could not get result of cancel request due to timeout 
 calliphoridae | REL_17_STABLE | 2024-07-16 22:45:08 | postgres_fdwInstallCheck-C | +WARNING:  could not get result of
cancelrequest due to timeout 
 mylodon       | HEAD          | 2024-08-09 05:25:24 | postgres_fdwCheck          | +WARNING:  could not get result of
cancelrequest due to timeout 
 mylodon       | HEAD          | 2024-08-09 05:25:24 | postgres_fdwCheck          | 2024-08-09 05:28:21.990 UTC client
backend[1775253]pg_regress/query_cancel WARNING:  could not get result of cancel request due to timeout 
 mylodon       | HEAD          | 2024-04-24 07:12:22 | postgres_fdwInstallCheck-C | +WARNING:  could not get result of
cancelrequest due to timeout 
 skink         | HEAD          | 2024-07-20 20:57:01 | postgres_fdwCheck          | +WARNING:  could not get result of
cancelrequest due to timeout 
 skink         | HEAD          | 2024-07-20 20:57:01 | postgres_fdwCheck          | 2024-07-20 21:06:58.166 UTC client
backend[2223425]pg_regress/postgres_fdw WARNING:  could not get result of cancel request due to timeout 
 grassquit     | HEAD          | 2024-08-20 19:29:20 | postgres_fdwCheck          | +WARNING:  could not get result of
cancelrequest due to timeout 
 grassquit     | HEAD          | 2024-08-20 19:29:20 | postgres_fdwCheck          | 2024-08-20 19:36:26.162 UTC client
backend[2332294]pg_regress/query_cancel WARNING:  could not get result of cancel request due to timeout 
 adder         | REL_17_STABLE | 2024-08-29 12:52:00 | postgres_fdwCheck          | +WARNING:  could not get result of
cancelrequest due to timeout 
 adder         | REL_17_STABLE | 2024-08-29 12:52:00 | postgres_fdwCheck          | 2024-08-29 12:58:57.965 UTC client
backend[2022100]pg_regress/query_cancel WARNING:  could not get result of cancel request due to timeout 
 mylodon       | HEAD          | 2024-08-30 06:25:46 | postgres_fdwCheck          | +WARNING:  could not get result of
cancelrequest due to timeout 
 mylodon       | HEAD          | 2024-08-30 06:25:46 | postgres_fdwCheck          | 2024-08-30 06:28:01.946 UTC client
backend[3668381]pg_regress/query_cancel WARNING:  could not get result of cancel request due to timeout 
 mylodon       | HEAD          | 2024-04-16 18:31:28 | postgres_fdwCheck          | +WARNING:  could not get result of
cancelrequest due to timeout 
 mylodon       | HEAD          | 2024-04-16 18:31:28 | postgres_fdwCheck          | 2024-04-16 18:37:25.510 UTC client
backend[1915743]pg_regress/postgres_fdw WARNING:  could not get result of cancel request due to timeout 
 francolin     | HEAD          | 2024-07-09 00:17:16 | postgres_fdwCheck          | +WARNING:  could not get result of
cancelrequest due to timeout 
 francolin     | HEAD          | 2024-07-09 00:17:16 | postgres_fdwCheck          | 2024-07-09 00:22:05.554 UTC client
backend[3576551]pg_regress/postgres_fdw WARNING:  could not get result of cancel request due to timeout 
 mylodon       | HEAD          | 2024-07-09 03:46:50 | postgres_fdwCheck          | +WARNING:  could not get result of
cancelrequest due to timeout 
 mylodon       | HEAD          | 2024-07-09 03:46:50 | postgres_fdwCheck          | 2024-07-09 03:50:26.354 UTC client
backend[3424317]pg_regress/postgres_fdw WARNING:  could not get result of cancel request due to timeout 
 adder         | HEAD          | 2024-08-29 10:42:09 | postgres_fdwInstallCheck-C | +WARNING:  could not get result of
cancelrequest due to timeout 
 olingo        | HEAD          | 2024-06-20 09:52:04 | postgres_fdwInstallCheck-C | +WARNING:  could not get result of
cancelrequest due to timeout 
 francolin     | HEAD          | 2024-07-09 19:02:27 | postgres_fdwCheck          | +WARNING:  could not get result of
cancelrequest due to timeout 
 francolin     | HEAD          | 2024-07-09 19:02:27 | postgres_fdwCheck          | 2024-07-09 19:05:09.198 UTC client
backend[377098]pg_regress/postgres_fdw WARNING:  could not get result of cancel request due to timeout 
 skink         | HEAD          | 2024-08-10 19:52:56 | postgres_fdwCheck          | +WARNING:  could not get result of
cancelrequest due to timeout 
 skink         | HEAD          | 2024-08-10 19:52:56 | postgres_fdwCheck          | 2024-08-10 19:56:41.222 UTC client
backend[1138302]pg_regress/query_cancel WARNING:  could not get result of cancel request due to timeout 
 snakefly      | REL_17_STABLE | 2024-08-19 11:30:04 | ContribCheck-C             | +WARNING:  could not get result of
cancelrequest due to timeout 
 snakefly      | REL_17_STABLE | 2024-08-19 11:30:04 | ContribCheck-C             | 2024-08-19 12:21:06.879 UTC
[1312:11]pg_regress/query_cancel WARNING:  could not get result of cancel request due to timeout 
(34 rows)

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

30 August 2024, 22:49:44

On Fri, Aug 30, 2024, 21:21 Tom Lane <tgl@sss.pgh.pa.us> wrote:

While we're piling on, has anyone noticed that *non* Windows buildfarm
animals are also failing this test pretty frequently?

<snip>

Any thoughts?

Yes. Fixes are here (see the ~10 emails above in the thread for details): https://www.postgresql.org/message-id/CAGECzQQO8Cn2Rw45xUYmvzXeSSsst7-bcruuzUfMbGQc3ueSdw@mail.gmail.com

They don't apply anymore after the change to move this test to a dedicated file. It shouldn't be too hard to update those patches though. I'll try to do that in a few weeks when I'm back behind my computer. But feel free to commit something earlier.

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Tom Lane

Date:

30 August 2024, 23:11:57

Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> On Fri, Aug 30, 2024, 21:21 Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> While we're piling on, has anyone noticed that *non* Windows buildfarm
>> animals are also failing this test pretty frequently?

> Yes. Fixes are here (see the ~10 emails above in the thread for details):
> https://www.postgresql.org/message-id/CAGECzQQO8Cn2Rw45xUYmvzXeSSsst7-bcruuzUfMbGQc3ueSdw@mail.gmail.com

Hmm.  I'm not convinced that 0001 is an actual *fix*, but it should
at least reduce the frequency of occurrence a lot, which'd help.

I don't want to move the test case to where you propose, because
that's basically not sensible.  But can't we avoid remote estimates
by just cross-joining ft1 to itself, and not using the tables for
which remote estimate is enabled?

I think 0002 is probably outright wrong, or at least the change to
disable_statement_timeout is.  Once we get to that, we don't want
to throw a timeout error any more, even if an interrupt was received
just before it.

            regards, tom lane

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Tom Lane

Date:

30 August 2024, 23:55:09

I wrote:
> Hmm.  I'm not convinced that 0001 is an actual *fix*, but it should
> at least reduce the frequency of occurrence a lot, which'd help.

After enabling log_statement = all to verify what commands are being
sent to the remote, I realized that there's a third thing this patch
can do to stabilize matters: issue a regular remote query inside the
test transaction, before we enable the timeout.  This will ensure
that we've dealt with configure_remote_session() and started a
remote transaction, so that there aren't extra round trips happening
for that while the clock is running.

Pushed with that addition and some comment-tweaking.  We'll see
whether that actually makes things more stable, but I don't think
it could make it worse.

            regards, tom lane

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

31 August 2024, 00:24:58

On Fri, 30 Aug 2024 at 22:12, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> > On Fri, Aug 30, 2024, 21:21 Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> While we're piling on, has anyone noticed that *non* Windows buildfarm
> >> animals are also failing this test pretty frequently?
>
> > Yes. Fixes are here (see the ~10 emails above in the thread for details):
> > https://www.postgresql.org/message-id/CAGECzQQO8Cn2Rw45xUYmvzXeSSsst7-bcruuzUfMbGQc3ueSdw@mail.gmail.com
>
> Hmm.  I'm not convinced that 0001 is an actual *fix*, but it should
> at least reduce the frequency of occurrence a lot, which'd help.

I also don't think it's an actual fix, but I couldn't think of a way
to fix this. And since this only happens if you cancel right at the
start of a postgres_fdw query, I don't think it's worth investing too
much time on a fix.

> I don't want to move the test case to where you propose, because
> that's basically not sensible.  But can't we avoid remote estimates
> by just cross-joining ft1 to itself, and not using the tables for
> which remote estimate is enabled?

Yeah that should work too (I just saw your next email, where you said
it's committed like this).

> I think 0002 is probably outright wrong, or at least the change to
> disable_statement_timeout is.  Once we get to that, we don't want
> to throw a timeout error any more, even if an interrupt was received
> just before it.

The disable_statement_timeout change was not the part of that patch
that was necessary for stable output, only the change in the first
branch of enable_statement_timeout was necessary. The reason being
that enable_statement_timeout is called multiple times for a query,
because start_xact_command is called multiple times in
exec_simple_query. The change to disable_statement_timeout just seemed
like the logical extension of that change, especially since there was
basically a verbatim copy of disable_statement_timeout in the second
branch of enable_statement_timeout.

To make sure I understand your suggestion correctly: Are you saying
you would want to completely remove the outstanding interrupt if it
was caused by de statement_timout when disable_statement_timeout is
called? Because I agree that would probably make sense, but that
sounds like a more impactful change. But the current behaviour seems
strictly worse than the behaviour proposed in the patch to me, because
currently the backend would still be interrupted, but the error would
indicate a reason for the interrupt that is simply incorrect i.e. it
will say it was cancelled due to a user request, which never happened.

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alexander Lakhin

Date:

31 August 2024, 07:04:04

Hello Tom,

30.08.2024 23:55, Tom Lane wrote:
> Pushed with that addition and some comment-tweaking.  We'll see
> whether that actually makes things more stable, but I don't think
> it could make it worse.

Thank you for fixing that issue!

I've tested your fix with the modification I proposed upthread:
              idle_session_timeout_enabled = false;
          }
+if (rand() % 10 == 0) pg_usleep(10000);
          /*
           * (5) disable async signal conditions again.

and can confirm that the issue is gone. On 8749d850f~1, the test failed
on iterations 3, 3. 12 for me, but on current REL_17_STABLE, 100 test
iterations succeeded.

At the same time, mylodon confirmed my other finding at [1] and failed [2] with:
-ERROR:  canceling statement due to statement timeout
+ERROR:  canceling statement due to user request

[1] https://www.postgresql.org/message-id/4db099c8-4a52-3cc4-e970-14539a319466%40gmail.com
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-08-30%2023%3A03%3A31

Best regards,
Alexander

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

31 August 2024, 10:08:51

On Sat, 31 Aug 2024 at 06:04, Alexander Lakhin <exclusion@gmail.com> wrote:
> At the same time, mylodon confirmed my other finding at [1] and failed [2] with:
> -ERROR:  canceling statement due to statement timeout
> +ERROR:  canceling statement due to user request
>
> [1] https://www.postgresql.org/message-id/4db099c8-4a52-3cc4-e970-14539a319466%40gmail.com
> [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-08-30%2023%3A03%3A31

Interestingly that's a different test that failed, but it looks like
it failed for the same reason that my 0002 patch fixes.

I also took a quick look at the code again, and completely removing
the outstanding interrupt seems hard to do. Because there's no way to
know if there were multiple causes for the interupt, i.e. someone
could have pressed ctrl+c as well and we wouldn't want to undo that.

So I think the solution in 0002, while debatable if strictly correct,
is the only fix that we can easily do. Also I personally believe the
behaviour resulting from 0002 is totally correct: The new behaviour
would be that if a timeout occurred, right before it was disabled or
reset, but the interrupt was not processed yet, then we process that
timeout as normal. That seems totally reasonable behaviour to me from
the perspective of an end user: You get a timeout error when the
timeout occurred before the timeout was disabled/reset.

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Andres Freund

Date:

17 November 2024, 04:37:57

Hi,

On 2024-10-01 15:00:00 +0300, Alexander Lakhin wrote:
> Hello Tom and Jelte,
> 
> 31.08.2024 07:04, Alexander Lakhin wrote:
> > I've tested your fix with the modification I proposed upthread:
> >              idle_session_timeout_enabled = false;
> >          }
> > +if (rand() % 10 == 0) pg_usleep(10000);
> >          /*
> >           * (5) disable async signal conditions again.
> > 
> > and can confirm that the issue is gone. On 8749d850f~1, the test failed
> > on iterations 3, 3. 12 for me, but on current REL_17_STABLE, 100 test
> > iterations succeeded.
> > 
> 
> One month later, treehopper has found a way to break that test: [1].
> The failure log contains:
> 2024-09-30 19:34:31.347 UTC [3201034:13] fdw_retry_check LOG: execute <unnamed>: DECLARE c2 CURSOR FOR
>     SELECT count(*) FROM ((("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON
> (TRUE)) INNER JOIN "S 1"."T 1" r4 ON (TRUE)) INNER JOIN "S 1"."T 1" r6 ON
> (TRUE))
> 2024-09-30 19:34:31.464 UTC [3201033:10] pg_regress/query_cancel ERROR:  canceling statement due to statement
timeout
> 2024-09-30 19:34:31.464 UTC [3201033:11] pg_regress/query_cancel STATEMENT: 
> SELECT count(*) FROM ft1 a CROSS JOIN ft1 b CROSS JOIN ft1 c CROSS JOIN ft1
> d;
> 2024-09-30 19:34:31.466 UTC [3201035:1] [unknown] LOG:  connection received: host=[local]
> 2024-09-30 19:34:31.474 UTC [3201034:14] fdw_retry_check LOG: statement: FETCH 100 FROM c2
> 2024-09-30 19:35:01.485 UTC [3201033:12] pg_regress/query_cancel WARNING: 
> could not get result of cancel request due to timeout

Another failure in CI, that cleared up after a retry:
https://cirrus-ci.com/task/5725647677423616
https://api.cirrus-ci.com/v1/artifact/task/5725647677423616/log/contrib/postgres_fdw/regression.diffs
https://api.cirrus-ci.com/v1/artifact/task/5725647677423616/log/contrib/postgres_fdw/log/postmaster.log

diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/query_cancel.out
/tmp/cirrus-ci-build/contrib/postgres_fdw/results/query_cancel.out
--- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/query_cancel.out    2024-11-16 22:13:32.174593005 +0000
+++ /tmp/cirrus-ci-build/contrib/postgres_fdw/results/query_cancel.out    2024-11-16 22:21:20.165877954 +0000
@@ -29,4 +29,5 @@
 -- This would take very long if not canceled:
 SELECT count(*) FROM ft1 a CROSS JOIN ft1 b CROSS JOIN ft1 c CROSS JOIN ft1 d;
 ERROR:  canceling statement due to statement timeout
+WARNING:  could not get result of cancel request due to timeout
 COMMIT;


Statement logging isn't enabled for the test, so the log isn't that helpful:

2024-11-16 22:20:49.962 UTC [38643][not initialized] [[unknown]][:0] LOG:  connection received: host=[local]
2024-11-16 22:20:49.964 UTC [38643][client backend] [[unknown]][67/1:0] LOG:  connection authenticated: user="postgres"
method=trust(/tmp/cirrus-ci-build/contrib/postgres_fdw/tmp_check/data/pg_hba.conf:117)
 
2024-11-16 22:20:49.964 UTC [38643][client backend] [[unknown]][67/1:0] LOG:  connection authorized: user=postgres
database=contrib_regressionapplication_name=pg_regress/query_cancel
 
2024-11-16 22:20:50.007 UTC [38645][not initialized] [[unknown]][:0] LOG:  connection received: host=[local]
2024-11-16 22:20:50.010 UTC [38645][client backend] [[unknown]][68/1:0] LOG:  connection authenticated: user="postgres"
method=trust(/tmp/cirrus-ci-build/contrib/postgres_fdw/tmp_check/data/pg_hba.conf:117)
 
2024-11-16 22:20:50.010 UTC [38645][client backend] [[unknown]][68/1:0] LOG:  connection authorized: user=postgres
database=contrib_regressionapplication_name=fdw_retry_check
 
2024-11-16 22:20:50.148 UTC [38643][client backend] [pg_regress/query_cancel][67/4:0] ERROR:  canceling statement due
tostatement timeout
 
2024-11-16 22:20:50.148 UTC [38643][client backend] [pg_regress/query_cancel][67/4:0] STATEMENT:  SELECT count(*) FROM
ft1a CROSS JOIN ft1 b CROSS JOIN ft1 c CROSS JOIN ft1 d;
 
2024-11-16 22:20:50.159 UTC [38656][not initialized] [[unknown]][:0] LOG:  connection received: host=[local]
2024-11-16 22:21:20.167 UTC [38643][client backend] [pg_regress/query_cancel][67/0:0] WARNING:  could not get result of
cancelrequest due to timeout
 
2024-11-16 22:21:20.170 UTC [38643][client backend] [pg_regress/query_cancel][:0] LOG:  disconnection: session time:
0:00:30.211user=postgres database=contrib_regression host=[local]
 
2024-11-16 22:21:20.315 UTC [36800][postmaster] LOG:  received fast shutdown request

Greetings,

Andres Freund

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Tom Lane

Date:

17 November 2024, 05:33:11

Andres Freund <andres@anarazel.de> writes:
> On 2024-10-01 15:00:00 +0300, Alexander Lakhin wrote:
>> One month later, treehopper has found a way to break that test: [1].
>> The failure log contains:
>> 2024-09-30 19:35:01.485 UTC [3201033:12] pg_regress/query_cancel WARNING: 
>> could not get result of cancel request due to timeout

> Another failure in CI, that cleared up after a retry:
> +WARNING:  could not get result of cancel request due to timeout

Yeah.  This has been happening off-and-on in the buildfarm ever
since we added that test.  I'm not sure if it's just "the test
is unstable" or if it's telling us there's a problem with the
cancel logic.  Scraping the last 3 months worth of buildfarm
logs finds these instances:

  sysname   |    branch     |      snapshot       |           stage            |                                 l

------------+---------------+---------------------+----------------------------+------------------------------------------------------------------
 adder      | HEAD          | 2024-08-29 10:42:09 | postgres_fdwInstallCheck-C | +WARNING:  could not get result of
cancelrequest due to timeout 
 adder      | REL_17_STABLE | 2024-08-29 12:52:00 | postgres_fdwCheck          | +WARNING:  could not get result of
cancelrequest due to timeout 
 froghopper | HEAD          | 2024-10-25 08:31:55 | ContribCheck-C             | +WARNING:  could not get result of
cancelrequest due to timeout 
 grassquit  | HEAD          | 2024-08-20 19:29:20 | postgres_fdwCheck          | +WARNING:  could not get result of
cancelrequest due to timeout 
 mylodon    | HEAD          | 2024-08-30 06:25:46 | postgres_fdwCheck          | +WARNING:  could not get result of
cancelrequest due to timeout 
 pipit      | HEAD          | 2024-11-13 01:12:28 | ContribCheck-C             | +WARNING:  could not get result of
cancelrequest due to timeout 
 snakefly   | REL_17_STABLE | 2024-08-19 11:30:04 | ContribCheck-C             | +WARNING:  could not get result of
cancelrequest due to timeout 
 treehopper | REL_17_STABLE | 2024-09-30 19:21:14 | ContribCheck-C             | +WARNING:  could not get result of
cancelrequest due to timeout 
(8 rows)

            regards, tom lane

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Alexander Lakhin

Date:

17 November 2024, 06:00:01

Hello Tom and Andres,

17.11.2024 05:33, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>
>> Another failure in CI, that cleared up after a retry:
>> +WARNING:  could not get result of cancel request due to timeout
> Yeah.  This has been happening off-and-on in the buildfarm ever
> since we added that test.  I'm not sure if it's just "the test
> is unstable" or if it's telling us there's a problem with the
> cancel logic.  Scraping the last 3 months worth of buildfarm
> logs finds these instances:

Yes, I counted those bf failures at [1] too and posted my explanation
upthread [2].

[1] 

https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#posgtres_fdw.2Fquery_cancel_fails_due_to_an_unexpected_warning_on_canceling_a_statement
[2] https://www.postgresql.org/message-id/c68225b4-fce9-3425-1534-a21a815d5846@gmail.com

Best regards,
Alexander

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Tom Lane

Date:

21 November 2024, 04:30:54

Alexander Lakhin <exclusion@gmail.com> writes:
> 17.11.2024 05:33, Tom Lane wrote:
>> Yeah.  This has been happening off-and-on in the buildfarm ever
>> since we added that test.  I'm not sure if it's just "the test
>> is unstable" or if it's telling us there's a problem with the
>> cancel logic.  Scraping the last 3 months worth of buildfarm
>> logs finds these instances:

> Yes, I counted those bf failures at [1] too and posted my explanation
> upthread [2].

Sorry, I'd forgotten about that.  I added some more debug logging
to the modifications you made, and confirmed your theory that the
remote session is ignoring the cancel request because it receives it
while DoingCommandRead is true; which must mean that it hasn't started
the slow query yet.

This implies that the 100ms delay in query_cancel.sql is not reliably
enough for the remote to receive the command, which surprises me,
especially since the failing animals aren't particularly slow ones.
Maybe there is something else happening?  But I do reproduce the
failure after adding your delays, and the patch I'm about to propose
does fix it.

Anyway, given that info, Jelte's unapplied 0002 patch earlier in the
thread is not the answer, because this is about dropping a query
cancel not about losing a timeout interrupt.  The equivalent thing
to what he suggested would be to not clear the cancel request flag
during DoingCommandRead, instead letting it kill the next query.
But I didn't like the idea for timeouts, and I like it even less for
query cancel.  What I think we should do instead is to re-issue
the cancel request if we've waited a little and nothing came of it.
This corresponds more or less to what a human user would likely do
(or at least this human would).  The attached patch is set up to
re-cancel after 1 second, then 2 more seconds, then 4 more, etc
until we reach the 30-second "it's dead Jim" threshold.

This seems to fix the problem here.  Thoughts?

BTW, while I didn't do it in the attached, I'm tempted to greatly
reduce the 100ms delay in query_cancel.sql.  If this does make it
more robust, we shouldn't need that much time anymore.

            regards, tom lane

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 2326f391d3..7a8cac83cb 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -95,6 +95,13 @@ static uint32 pgfdw_we_get_result = 0;
  */
 #define CONNECTION_CLEANUP_TIMEOUT    30000

+/*
+ * Milliseconds to wait before issuing another cancel request.  This covers
+ * the race condition where the remote session ignored our cancel request
+ * because it arrived while idle.
+ */
+#define RE_CANCEL_TIMEOUT    1000
+
 /* Macro for constructing abort command to be sent */
 #define CONSTRUCT_ABORT_COMMAND(sql, entry, toplevel) \
     do { \
@@ -145,6 +152,7 @@ static void pgfdw_reset_xact_state(ConnCacheEntry *entry, bool toplevel);
 static bool pgfdw_cancel_query(PGconn *conn);
 static bool pgfdw_cancel_query_begin(PGconn *conn, TimestampTz endtime);
 static bool pgfdw_cancel_query_end(PGconn *conn, TimestampTz endtime,
+                                   TimestampTz recanceltime,
                                    bool consume_input);
 static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
                                      bool ignore_errors);
@@ -154,6 +162,7 @@ static bool pgfdw_exec_cleanup_query_end(PGconn *conn, const char *query,
                                          bool consume_input,
                                          bool ignore_errors);
 static bool pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime,
+                                     TimestampTz recanceltime,
                                      PGresult **result, bool *timed_out);
 static void pgfdw_abort_cleanup(ConnCacheEntry *entry, bool toplevel);
 static bool pgfdw_abort_cleanup_begin(ConnCacheEntry *entry, bool toplevel,
@@ -1322,18 +1331,25 @@ pgfdw_reset_xact_state(ConnCacheEntry *entry, bool toplevel)
 static bool
 pgfdw_cancel_query(PGconn *conn)
 {
+    TimestampTz now = GetCurrentTimestamp();
     TimestampTz endtime;
+    TimestampTz recanceltime;

     /*
      * If it takes too long to cancel the query and discard the result, assume
      * the connection is dead.
      */
-    endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(),
-                                          CONNECTION_CLEANUP_TIMEOUT);
+    endtime = TimestampTzPlusMilliseconds(now, CONNECTION_CLEANUP_TIMEOUT);
+
+    /*
+     * Also, lose patience and re-issue the cancel request after a little bit.
+     * (This serves to close some race conditions.)
+     */
+    recanceltime = TimestampTzPlusMilliseconds(now, RE_CANCEL_TIMEOUT);

     if (!pgfdw_cancel_query_begin(conn, endtime))
         return false;
-    return pgfdw_cancel_query_end(conn, endtime, false);
+    return pgfdw_cancel_query_end(conn, endtime, recanceltime, false);
 }

 /*
@@ -1359,9 +1375,10 @@ pgfdw_cancel_query_begin(PGconn *conn, TimestampTz endtime)
 }

 static bool
-pgfdw_cancel_query_end(PGconn *conn, TimestampTz endtime, bool consume_input)
+pgfdw_cancel_query_end(PGconn *conn, TimestampTz endtime,
+                       TimestampTz recanceltime, bool consume_input)
 {
-    PGresult   *result = NULL;
+    PGresult   *result;
     bool        timed_out;

     /*
@@ -1380,7 +1397,8 @@ pgfdw_cancel_query_end(PGconn *conn, TimestampTz endtime, bool consume_input)
     }

     /* Get and discard the result of the query. */
-    if (pgfdw_get_cleanup_result(conn, endtime, &result, &timed_out))
+    if (pgfdw_get_cleanup_result(conn, endtime, recanceltime,
+                                 &result, &timed_out))
     {
         if (timed_out)
             ereport(WARNING,
@@ -1453,7 +1471,7 @@ pgfdw_exec_cleanup_query_end(PGconn *conn, const char *query,
                              TimestampTz endtime, bool consume_input,
                              bool ignore_errors)
 {
-    PGresult   *result = NULL;
+    PGresult   *result;
     bool        timed_out;

     Assert(query != NULL);
@@ -1471,7 +1489,7 @@ pgfdw_exec_cleanup_query_end(PGconn *conn, const char *query,
     }

     /* Get the result of the query. */
-    if (pgfdw_get_cleanup_result(conn, endtime, &result, &timed_out))
+    if (pgfdw_get_cleanup_result(conn, endtime, endtime, &result, &timed_out))
     {
         if (timed_out)
             ereport(WARNING,
@@ -1495,28 +1513,36 @@ pgfdw_exec_cleanup_query_end(PGconn *conn, const char *query,
 }

 /*
- * Get, during abort cleanup, the result of a query that is in progress.  This
- * might be a query that is being interrupted by transaction abort, or it might
- * be a query that was initiated as part of transaction abort to get the remote
- * side back to the appropriate state.
+ * Get, during abort cleanup, the result of a query that is in progress.
+ * This might be a query that is being interrupted by a cancel request or by
+ * transaction abort, or it might be a query that was initiated as part of
+ * transaction abort to get the remote side back to the appropriate state.
  *
  * endtime is the time at which we should give up and assume the remote
- * side is dead.  Returns true if the timeout expired or connection trouble
- * occurred, false otherwise.  Sets *result except in case of a timeout.
- * Sets timed_out to true only when the timeout expired.
+ * side is dead.  recanceltime is the time at which we should issue a fresh
+ * cancel request (pass the same value as endtime if this is not wanted).
+ *
+ * Returns true if the timeout expired or connection trouble occurred,
+ * false otherwise.  Sets *result except in case of a true result.
+ * Sets *timed_out to true only when the timeout expired.
  */
 static bool
-pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime, PGresult **result,
+pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime,
+                         TimestampTz recanceltime,
+                         PGresult **result,
                          bool *timed_out)
 {
     volatile bool failed = false;
     PGresult   *volatile last_res = NULL;

+    *result = NULL;
     *timed_out = false;

     /* In what follows, do not leak any PGresults on an error. */
     PG_TRY();
     {
+        int            canceldelta = RE_CANCEL_TIMEOUT * 2;
+
         for (;;)
         {
             PGresult   *res;
@@ -1527,8 +1553,33 @@ pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime, PGresult **result,
                 TimestampTz now = GetCurrentTimestamp();
                 long        cur_timeout;

+                /* If timeout has expired, give up. */
+                if (now >= endtime)
+                {
+                    *timed_out = true;
+                    failed = true;
+                    goto exit;
+                }
+
+                /* If we need to re-issue the cancel request, do that. */
+                if (now >= recanceltime)
+                {
+                    /* We ignore failure to issue the repeated request. */
+                    (void) libpqsrv_cancel(conn, endtime);
+
+                    /* Recompute "now" in case that took measurable time. */
+                    now = GetCurrentTimestamp();
+
+                    /* Adjust re-cancel timeout in increasing steps. */
+                    recanceltime = TimestampTzPlusMilliseconds(now,
+                                                               canceldelta);
+                    canceldelta += canceldelta;
+                }
+
                 /* If timeout has expired, give up, else get sleep time. */
-                cur_timeout = TimestampDifferenceMilliseconds(now, endtime);
+                cur_timeout = TimestampDifferenceMilliseconds(now,
+                                                              Min(endtime,
+                                                                  recanceltime));
                 if (cur_timeout <= 0)
                 {
                     *timed_out = true;
@@ -1849,7 +1900,9 @@ pgfdw_finish_abort_cleanup(List *pending_entries, List *cancel_requested,
         foreach(lc, cancel_requested)
         {
             ConnCacheEntry *entry = (ConnCacheEntry *) lfirst(lc);
+            TimestampTz now = GetCurrentTimestamp();
             TimestampTz endtime;
+            TimestampTz recanceltime;
             char        sql[100];

             Assert(entry->changing_xact_state);
@@ -1863,10 +1916,13 @@ pgfdw_finish_abort_cleanup(List *pending_entries, List *cancel_requested,
              * remaining entries in the list, leading to slamming that entry's
              * connection shut.
              */
-            endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(),
+            endtime = TimestampTzPlusMilliseconds(now,
                                                   CONNECTION_CLEANUP_TIMEOUT);
+            recanceltime = TimestampTzPlusMilliseconds(now,
+                                                       RE_CANCEL_TIMEOUT);

-            if (!pgfdw_cancel_query_end(entry->conn, endtime, true))
+            if (!pgfdw_cancel_query_end(entry->conn, endtime,
+                                        recanceltime, true))
             {
                 /* Unable to cancel running query */
                 pgfdw_reset_xact_state(entry, toplevel);

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

22 November 2024, 00:33:06

On Thu, 21 Nov 2024 at 02:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Anyway, given that info, Jelte's unapplied 0002 patch earlier in the
> thread is not the answer, because this is about dropping a query
> cancel not about losing a timeout interrupt.

Agreed that 0002 does not fix the issue re-reported by Andres (let's
call this issue number 1). But I'm still of the opinion that 0002
fixes a real bug: i.e. a bug which causes timeouts.spec to randomly
fail[1] (let's call this issue number 2).
> This seems to fix the problem here.  Thoughts?

Overall, a good approach to fix issue number 1. I think it would be
best if this was integrated into libpqsrv_cancel instead though. That
way the dblink would benefit from it too.

nit: Maybe call it RETRY_CANCEL_TIME. The RE_ prefix wasn't instantly
obvious what it meant to me, it seemed like an abbreviation when I first saw it.

> BTW, while I didn't do it in the attached, I'm tempted to greatly
> reduce the 100ms delay in query_cancel.sql.  If this does make it
> more robust, we shouldn't need that much time anymore.

Seems sensible to me.

Finally there's a third issue[2] (let's call this issue number 3).
Alexander did some investigation into this issue too[3]. For this one
I have a hard time understanding what is going on, or at least how
this issue only seems to apply to cancel connections. From his
description of the problem and my reading of the code it seems that if
we fail to send the StartupPacket/CancelRequest due to a socket error,
we set the write_failed flag. But we don't actually check this flag
during the CONNECTION_AWAITING_RESPONSE phase of PQconnectPoll, so we
just wait until we reach a timeout because the server never sends us
anything.

[1]:
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#timeouts.spec_failed_because_of_statement_cancelled_due_to_unexpected_reason
[2]:
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#dblink.sql_.28and_postgres_fdw.sql.29_fail_on_Windows_due_to_the_cancel_packet_not_sent
[3]: https://www.postgresql.org/message-id/5ea25e4d-1ee2-b9bf-7806-119ffa658826@gmail.com

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Tom Lane

Date:

22 November 2024, 03:37:34

Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> On Thu, 21 Nov 2024 at 02:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This seems to fix the problem here.  Thoughts?

> Overall, a good approach to fix issue number 1. I think it would be
> best if this was integrated into libpqsrv_cancel instead though. That
> way the dblink would benefit from it too.

How would we do that?  libpqsrv_cancel is not chartered to wait around
for the results of the cancel, and I'm not even sure that it could
know what to check for.

(I did get the impression that all this code was not very well
factored, but I'm not volunteering to rewrite it wholesale.)

> nit: Maybe call it RETRY_CANCEL_TIME.

Sure.

            regards, tom lane

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Jelte Fennema-Nio

Date:

22 November 2024, 12:19:16

On Fri, 22 Nov 2024 at 01:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> How would we do that?  libpqsrv_cancel is not chartered to wait around
> for the results of the cancel, and I'm not even sure that it could
> know what to check for.

Ah yeah, you're right. I got confused by the two timeouts (the one to
wait for the response of the cancel request itself, and the one to
wait for the running query to actually be cancelled).

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

From

Tom Lane

Date:

23 December 2024, 23:15:40

Jelte Fennema-Nio <postgres@jeltef.nl> writes:
> On Fri, 22 Nov 2024 at 01:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> How would we do that?  libpqsrv_cancel is not chartered to wait around
>> for the results of the cancel, and I'm not even sure that it could
>> know what to check for.

> Ah yeah, you're right. I got confused by the two timeouts (the one to
> wait for the response of the cancel request itself, and the one to
> wait for the running query to actually be cancelled).

Not having heard any better ideas, I pushed that to HEAD and v17
(with the renaming you suggested).

            regards, tom lane