Thread: Query cancel seems to be broken in master since Oct 17

Query cancel seems to be broken in master since Oct 17

From
Vladimir Sitnikov
Date:
Hi,

In pgjdbc we have regular regression testing against "build from master" PostgreSQL, and recent master builds fail for "statement cancel" test.

The PostgreSQL as of Mon Oct 17 00:09:39 UTC 2016 was fine, then "statement cancel" started to fail.
The test executes "select pg_sleep(10)" and tries to cancel it. In recent master builds, cancel seems to be ignored, and the statement lasts for 10 seconds.

Exactly the same driver and test version works fine for 8.4, 9.1, 9.2, 9.3, 9.4, 9.5, and 9.6:

Any hints what could be the issue?
Was the breakage intentional?

Vladimir

Re: Query cancel seems to be broken in master since Oct 17

From
Magnus Hagander
Date:


On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
Hi,

In pgjdbc we have regular regression testing against "build from master" PostgreSQL, and recent master builds fail for "statement cancel" test.

The PostgreSQL as of Mon Oct 17 00:09:39 UTC 2016 was fine, then "statement cancel" started to fail.
The test executes "select pg_sleep(10)" and tries to cancel it. In recent master builds, cancel seems to be ignored, and the statement lasts for 10 seconds.

Exactly the same driver and test version works fine for 8.4, 9.1, 9.2, 9.3, 9.4, 9.5, and 9.6:

Any hints what could be the issue?
Was the breakage intentional?



That's certainly not intended to break things, but that was changed on Oct 17 and it relates to cancel keys.

What platform does the postgres server run on? Can can you check if query cancel works on libpq or if it's completely broken? 


--

Re: Query cancel seems to be broken in master since Oct 17

From
Vladimir Gordiychuk
Date:
>What platform does the postgres server run on?

Ubuntu
OS name: "linux", version: "3.19.0-66-generic", arch: "amd64", family: "unix"

2016-10-18 11:05 GMT+03:00 Magnus Hagander <magnus@hagander.net>:


On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote:
Hi,

In pgjdbc we have regular regression testing against "build from master" PostgreSQL, and recent master builds fail for "statement cancel" test.

The PostgreSQL as of Mon Oct 17 00:09:39 UTC 2016 was fine, then "statement cancel" started to fail.
The test executes "select pg_sleep(10)" and tries to cancel it. In recent master builds, cancel seems to be ignored, and the statement lasts for 10 seconds.

Exactly the same driver and test version works fine for 8.4, 9.1, 9.2, 9.3, 9.4, 9.5, and 9.6:

Any hints what could be the issue?
Was the breakage intentional?



That's certainly not intended to break things, but that was changed on Oct 17 and it relates to cancel keys.

What platform does the postgres server run on? Can can you check if query cancel works on libpq or if it's completely broken? 


--

Re: Query cancel seems to be broken in master since Oct 17

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov <
> sitnikov.vladimir@gmail.com> wrote:
>> The test executes "select pg_sleep(10)" and tries to cancel it. In recent
>> master builds, cancel seems to be ignored, and the statement lasts for 10
>> seconds.

> My guess is it's related to this:
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9e083fd4683294f41544e6d0d72f6e258ff3a77c
> That's certainly not intended to break things, but that was changed on Oct
> 17 and it relates to cancel keys.
> What platform does the postgres server run on? Can can you check if query
> cancel works on libpq or if it's completely broken?

I can confirm that query cancel is broken in HEAD on RHEL6.

regression=# select pg_sleep(10);
^CCancel request sent
... nothing happens for the balance of the 10 seconds ...
regression=#

There's a smoking gun in the postmaster log:

2016-10-18 09:10:34.547 EDT [18502] LOG:  wrong key in cancel request for process 18491
        regards, tom lane



Re: Query cancel seems to be broken in master since Oct 17

From
Heikki Linnakangas
Date:
On 10/18/2016 04:13 PM, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> On Tue, Oct 18, 2016 at 1:00 AM, Vladimir Sitnikov <
>> sitnikov.vladimir@gmail.com> wrote:
>>> The test executes "select pg_sleep(10)" and tries to cancel it. In recent
>>> master builds, cancel seems to be ignored, and the statement lasts for 10
>>> seconds.
>
>> My guess is it's related to this:
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9e083fd4683294f41544e6d0d72f6e258ff3a77c
>> That's certainly not intended to break things, but that was changed on Oct
>> 17 and it relates to cancel keys.
>> What platform does the postgres server run on? Can can you check if query
>> cancel works on libpq or if it's completely broken?
>
> I can confirm that query cancel is broken in HEAD on RHEL6.
>
> regression=# select pg_sleep(10);
> ^CCancel request sent
> ... nothing happens for the balance of the 10 seconds ...
> regression=#
>
> There's a smoking gun in the postmaster log:
>
> 2016-10-18 09:10:34.547 EDT [18502] LOG:  wrong key in cancel request for process 18491

Ok, I've reverted that commit for now. It clearly needs more thought, 
because of this, and the pademelon failure discussed on the other thread.

- Heikki




Re: Query cancel seems to be broken in master since Oct 17

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 10/18/2016 04:13 PM, Tom Lane wrote:
>> There's a smoking gun in the postmaster log:
>> 2016-10-18 09:10:34.547 EDT [18502] LOG:  wrong key in cancel request for process 18491

> Ok, I've reverted that commit for now. It clearly needs more thought,
> because of this, and the pademelon failure discussed on the other thread.

I think that was an overreaction.  The problem is pretty obvious after
adding some instrumentation:

2016-10-18 09:57:47.508 EDT [21229] LOG:  wrong key (0x7B7E4D5E, expected 0xF0F804017B7E4D5E) in cancel request for
process21228 

To wit, the various cancel_key backend variables are declared as "long",
and the new code
if (!pg_strong_random(&MyCancelKey, sizeof(MyCancelKey)))

is therefore computing an 8-byte random value on 64-bit-long machines.
But only 4 bytes go to the client and come back.

The cleanest fix might be to change those various "long" variables
to uint32.  You'd have to think about how to handle the ntohl/htonl
calls that are used on them, though.
        regards, tom lane



Re: Query cancel seems to be broken in master since Oct 17

From
Tom Lane
Date:
I wrote:
> The cleanest fix might be to change those various "long" variables
> to uint32.  You'd have to think about how to handle the ntohl/htonl
> calls that are used on them, though.

Or actually, no, you wouldn't have to think very hard.  I was supposing
that those calls were declared to traffic in "long"s, but they aren't
and never have been, at least not since SUSv2:

uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);

So s/long/uint32/ would probably fix it just fine.
        regards, tom lane



Re: Query cancel seems to be broken in master since Oct 17

From
Michael Paquier
Date:
On Tue, Oct 18, 2016 at 11:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> The cleanest fix might be to change those various "long" variables
>> to uint32.  You'd have to think about how to handle the ntohl/htonl
>> calls that are used on them, though.
>
> Or actually, no, you wouldn't have to think very hard.  I was supposing
> that those calls were declared to traffic in "long"s, but they aren't
> and never have been, at least not since SUSv2:
>
> uint32_t htonl(uint32_t hostlong);
> uint16_t htons(uint16_t hostshort);
> uint32_t ntohl(uint32_t netlong);
> uint16_t ntohs(uint16_t netshort);
>
> So s/long/uint32/ would probably fix it just fine.

The pg_strong_random patch has been reverted, what's done is done.
FWIW only doing that is needed to fix the problem as already mentioned
by Tom:
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index f232083..634578d 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -38,7 +38,7 @@ volatile uint32 CritSectionCount = 0;int            MyProcPid;pg_time_t  MyStartTime;struct Port
*MyProcPort;
-long       MyCancelKey;
+uint32     MyCancelKey;int            MyPMChildSlot;
/*
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 78545da..5e623a1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -162,7 +162,7 @@ extern PGDLLIMPORT int MyProcPid;extern PGDLLIMPORT pg_time_t MyStartTime;extern PGDLLIMPORT struct
Port*MyProcPort;extern PGDLLIMPORT struct Latch *MyLatch;
 
-extern long MyCancelKey;
+extern uint32 MyCancelKey;extern int MyPMChildSlot;
extern char OutputFileName[];

I'll send an update for that, as well as a solution for the pademelon
problem on the SCRAM thread.
-- 
Michael



Re: Query cancel seems to be broken in master since Oct 17

From
Noah Misch
Date:
On Tue, Oct 18, 2016 at 10:03:39AM -0400, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > On 10/18/2016 04:13 PM, Tom Lane wrote:
> >> There's a smoking gun in the postmaster log:
> >> 2016-10-18 09:10:34.547 EDT [18502] LOG:  wrong key in cancel request for process 18491
> 
> > Ok, I've reverted that commit for now. It clearly needs more thought, 
> > because of this, and the pademelon failure discussed on the other thread.
> 
> I think that was an overreaction.  The problem is pretty obvious after
> adding some instrumentation:

I endorse Heikki's revert.