Thread: pgsql: psql: Add test for query canceling

pgsql: psql: Add test for query canceling

From
Peter Eisentraut
Date:
psql: Add test for query canceling

Query canceling in psql was accidentally broken by
3a5130672296ed4e682403a77a9a3ad3d21cef75 (since reverted), so having
some test coverage for that seems useful.

Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>
Discussion: https://www.postgresql.org/message-id/18c78a01-4a34-9dd4-f78b-6860f1420c8e@enterprisedb.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/5b3f471ff23a2773e6c1ee1704377581c987107d

Modified Files
--------------
src/bin/psql/t/020_cancel.pl | 40 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 40 insertions(+)


Re: pgsql: psql: Add test for query canceling

From
Daniel Gustafsson
Date:
> On 20 Aug 2021, at 11:44, Peter Eisentraut <peter@eisentraut.org> wrote:
>
> psql: Add test for query canceling

Seems sidewinder didn’t quite like this:

==~_~===-=-===~_~== pgsql.build/src/bin/psql/tmp_check/log/020_cancel_main.log ==~_~===-=-===~_~==
2021-08-20 12:52:29.992 CEST [9820:1] LOG:  starting PostgreSQL 15devel on x86_64-unknown-netbsd7.0, compiled by gcc
(nb220150115) 4.8.4, 64-bit 
2021-08-20 12:52:29.992 CEST [9820:2] LOG:  listening on Unix socket "/tmp/V5NA0EWHUl/.s.PGSQL.62222"
2021-08-20 12:52:29.994 CEST [9023:1] LOG:  database system was shut down at 2021-08-20 12:52:29 CEST
2021-08-20 12:52:29.995 CEST [9820:3] LOG:  database system is ready to accept connections
2021-08-20 12:52:30.093 CEST [656:1] [unknown] LOG:  connection received: host=[local]
2021-08-20 12:52:30.093 CEST [656:2] [unknown] LOG:  connection authorized: user=pgbf database=postgres
application_name=020_cancel.pl
2021-08-20 12:52:30.098 CEST [656:3] 020_cancel.pl LOG:  statement: select pg_sleep(3);
2021-08-20 12:52:31.121 CEST [9820:4] LOG:  received immediate shutdown request
2021-08-20 12:52:31.121 CEST [656:4] 020_cancel.pl LOG:  could not send data to client: Broken pipe
2021-08-20 12:52:31.121 CEST [656:5] 020_cancel.pl STATEMENT:  select pg_sleep(3);
2021-08-20 12:52:31.125 CEST [9820:5] LOG:  database system is shut down

Nothing obvious sticks out, could it be timing related?

https://brekka.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2021-08-20%2010%3A45%3A24

--
Daniel Gustafsson        https://vmware.com/




Re: pgsql: psql: Add test for query canceling

From
Fabien COELHO
Date:
>> psql: Add test for query canceling
>
> Seems sidewinder didn’t quite like this:

> Nothing obvious sticks out, could it be timing related?

Hmmm. Possibly, as the test is time dependent, hence the "SKIP:" section, 
which should mean that the test should be ignore when it fails. It seems 
that this part did not work. Maybe it should not state that it is 
expecting 2 results?

   use Test::More tests => 2;

Replace by:

   use Test::More;

Or something is amiss with the "SKIP" section trick?

-- 
Fabien.

Re: pgsql: psql: Add test for query canceling

From
Daniel Gustafsson
Date:
> On 20 Aug 2021, at 17:40, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
>>> psql: Add test for query canceling
>>
>> Seems sidewinder didn’t quite like this:
>
>> Nothing obvious sticks out, could it be timing related?
>
> Hmmm. Possibly, as the test is time dependent, hence the "SKIP:" section, which should mean that the test should be
ignorewhen it fails. It seems that this part did not work. Maybe it should not state that it is expecting 2 results? 
>
>  use Test::More tests => 2;
>
> Replace by:
>
>  use Test::More;
>
> Or something is amiss with the "SKIP" section trick?

SKIP blocks only skip tests if the condition is met, in the test as it stands
now it skips it on Windows since it requires a Unix shell:

+SKIP: {
+   skip "cancel test requires a Unix shell", 2 if $windows_os;

On all other platforms the test will execute and count towards the overall test
result of the suite.

If we want the test to run but not fail the entire test suite if it fails then
it should use a TODO block instead, but that’s intended for tests known to fail
and this doesn’t seem to fall in that category.

--
Daniel Gustafsson        https://vmware.com/




Re: pgsql: psql: Add test for query canceling

From
Tom Lane
Date:
Daniel Gustafsson <daniel@yesql.se> writes:
> If we want the test to run but not fail the entire test suite if it fails then
> it should use a TODO block instead, but that’s intended for tests known to fail
> and this doesn’t seem to fall in that category.

That seems pretty useless.  If we did break things in this area,
such a test would not help us notice.

The problem with the test seems blindingly obvious from here: it
is assuming first that psql will start fast enough to print its
PID within one second, and next that we'll be able to issue
the cancel (and have the backend react) in less than 2 seconds
more.  This seems about guaranteed to fail on cache-clobber
animals, for example, but animals that are merely slow or overloaded
would have issues too.

I think you should drop the overly-cute bit with a SIGALRM handler,
and instead have a loop-with-delay around an attempt to read the
psql.pid file, after launching the psql run without an immediate
wait for termination.  That gets rid of the first problem (though
you still want the loop to timeout eventually, it could wait up
to say 180 seconds, as we do elsewhere).  Then the second problem
is easy to solve by making the pg_sleep delay twice as much.

            regards, tom lane



Re: pgsql: psql: Add test for query canceling

From
Daniel Gustafsson
Date:
> On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Daniel Gustafsson <daniel@yesql.se> writes:
>> If we want the test to run but not fail the entire test suite if it fails then
>> it should use a TODO block instead, but that’s intended for tests known to fail
>> and this doesn’t seem to fall in that category.
>
> That seems pretty useless.  If we did break things in this area,
> such a test would not help us notice.

For sure.  I wasn’t advocating it, merely indicating that the SKIP block isn’t
working the way attributed to upthread.

> The problem with the test seems blindingly obvious from here: it
> is assuming first that psql will start fast enough to print its
> PID within one second, and next that we'll be able to issue
> the cancel (and have the backend react) in less than 2 seconds
> more.  This seems about guaranteed to fail on cache-clobber
> animals, for example, but animals that are merely slow or overloaded
> would have issues too.
>
> I think you should drop the overly-cute bit with a SIGALRM handler,
> and instead have a loop-with-delay around an attempt to read the
> psql.pid file, after launching the psql run without an immediate
> wait for termination.  That gets rid of the first problem (though
> you still want the loop to timeout eventually, it could wait up
> to say 180 seconds, as we do elsewhere).  Then the second problem
> is easy to solve by making the pg_sleep delay twice as much.

This could perhaps be done with a PostgresNode::interactive_psql session?  I
used that in a similar, but far from the same, test setup in the online
checksums patchset.

--
Daniel Gustafsson        https://vmware.com/




Re: pgsql: psql: Add test for query canceling

From
Tom Lane
Date:
Daniel Gustafsson <daniel@yesql.se> writes:
> On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think you should drop the overly-cute bit with a SIGALRM handler,
>> and instead have a loop-with-delay around an attempt to read the
>> psql.pid file, after launching the psql run without an immediate
>> wait for termination.

> This could perhaps be done with a PostgresNode::interactive_psql
> session?

Yeah, we do have that infrastructure available in the
010_tab_completion.pl test.

BTW, after looking more closely at the test, I think there's an
independent problem with the SIGALRM business.  If I read the perl
script correctly, it's arming a SIGALRM timer in the Perl script's
process that is supposed to expire after the psql child process
has been launched.  However, per the NetBSD man pages:

* exec() causes signal handlers to be reset to default actions in the
child process.  The default action for SIGALRM is to terminate the
process.

* Active timer events are *inherited* by the child process.  This seems
not very wise, but execve(2) is pretty definite about it [1].

This means that on NetBSD and similar systems, what actually happens
when the SIGALRM timer times out is that the psql process is killed,
likely before the Perl script is able to signal it, but in any case
not too darn much later.  So I do not think that this test is testing
what it means to test at all: there is no reason to believe that any
query cancel request ever gets sent, even without the timing issue.
Since the Perl script is only testing that psql returns nonzero exit
status, and not bothering to inquire into just which status it was,
it cannot tell "query was canceled" apart from "psql died horribly".
And the latter is what's actually happening.

            regards, tom lane

[1] https://man.netbsd.org/NetBSD-9.2-STABLE/execve.2



Re: pgsql: psql: Add test for query canceling

From
Tom Lane
Date:
I see that c818c25f4 worked around sidewinder's issue with
this test, but wrasse is still unhappy.  I was able to
duplicate the problem on wrasse's host, so I ran it under
truss to see if I could see what's happening.

What the truss log shows is that the Perl script successfully
reads psql.pid and issues kill(SIGINT) to the correct process
... but very shortly thereafter it issues kill(SIGTERM) to the
same process, so of course psql goes belly-up and doesn't
complete the test.

I suppose this is some undocumented interaction between IPC::Run
and the SIGALRM signal handler.  I stand by the recommendation
that you need to not use a signal handler here.

            regards, tom lane



Re: pgsql: psql: Add test for query canceling

From
Daniel Gustafsson
Date:
> On 21 Aug 2021, at 00:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Daniel Gustafsson <daniel@yesql.se> writes:
>> On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I think you should drop the overly-cute bit with a SIGALRM handler,
>>> and instead have a loop-with-delay around an attempt to read the
>>> psql.pid file, after launching the psql run without an immediate
>>> wait for termination.
>
>> This could perhaps be done with a PostgresNode::interactive_psql
>> session?
>
> Yeah, we do have that infrastructure available in the
> 010_tab_completion.pl test.

In order to satisfy my own curiosity I decided to take a stab at this which
turned into the attached.  I wasn’t able to reliably know that the pg_sleep
query had started except letting the timer expire, but that may very well be
the rust on my Perl skills showing.

Not sure if this moves the needle on this in either direction, but I figured I
should report back progress either way.

--
Daniel Gustafsson        https://vmware.com/


Attachment

Re: pgsql: psql: Add test for query canceling

From
Peter Eisentraut
Date:
On 23.08.21 13:12, Daniel Gustafsson wrote:
>> On 21 Aug 2021, at 00:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Daniel Gustafsson <daniel@yesql.se> writes:
>>> On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> I think you should drop the overly-cute bit with a SIGALRM handler,
>>>> and instead have a loop-with-delay around an attempt to read the
>>>> psql.pid file, after launching the psql run without an immediate
>>>> wait for termination.
>>
>>> This could perhaps be done with a PostgresNode::interactive_psql
>>> session?
>>
>> Yeah, we do have that infrastructure available in the
>> 010_tab_completion.pl test.
> 
> In order to satisfy my own curiosity I decided to take a stab at this which
> turned into the attached.  I wasn’t able to reliably know that the pg_sleep
> query had started except letting the timer expire, but that may very well be
> the rust on my Perl skills showing.

Thanks for looking into this.  With your patch applied, tests 2, 3, and 
4 fail.  Do they pass for you?



Re: pgsql: psql: Add test for query canceling

From
Daniel Gustafsson
Date:
> On 24 Aug 2021, at 15:40, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>
> On 23.08.21 13:12, Daniel Gustafsson wrote:
>>> On 21 Aug 2021, at 00:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>
>>> Daniel Gustafsson <daniel@yesql.se> writes:
>>>> On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>>> I think you should drop the overly-cute bit with a SIGALRM handler,
>>>>> and instead have a loop-with-delay around an attempt to read the
>>>>> psql.pid file, after launching the psql run without an immediate
>>>>> wait for termination.
>>>
>>>> This could perhaps be done with a PostgresNode::interactive_psql
>>>> session?
>>>
>>> Yeah, we do have that infrastructure available in the
>>> 010_tab_completion.pl test.
>> In order to satisfy my own curiosity I decided to take a stab at this which
>> turned into the attached.  I wasn’t able to reliably know that the pg_sleep
>> query had started except letting the timer expire, but that may very well be
>> the rust on my Perl skills showing.
>
> Thanks for looking into this.  With your patch applied, tests 2, 3, and 4 fail.  Do they pass for you?

Thats odd, all tests pass for me.  Did the logs give an indication as to what
was failing?

--
Daniel Gustafsson        https://vmware.com/




Re: pgsql: psql: Add test for query canceling

From
Tom Lane
Date:
Daniel Gustafsson <daniel@yesql.se> writes:
>> On 24 Aug 2021, at 15:40, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>> Thanks for looking into this.  With your patch applied, tests 2, 3, and 4 fail.  Do they pass for you?

> Thats odd, all tests pass for me.  Did the logs give an indication as to what
> was failing?

I also tried the patch.  It passes for me on my Linux box, although
it takes noticeably longer than before which doesn't seem great.
On wrasse's host, it fails.  The Perl log shows:

ok 1 - PPID timer expired
ok 2 - psql was signaled
not ok 3 - query was canceled

#   Failed test 'query was canceled'
#   at t/020_cancel.pl line 50.
#                   'postgres=# SELECT pg_sleep(15);
# Cancel request sent
# '
#     doesn't match '(?-xism:canceling statement due to user request)'
ok 4 - Sleep query timer expired

The postmaster log shows that the cancel was received and acted on:

2021-08-24 17:39:39.021 CEST [2178] LOG:  database system is ready to accept connections
2021-08-24 17:39:39.270 CEST [2187] 020_cancel.pl LOG:  statement: SELECT pg_sleep(15);
2021-08-24 17:39:42.407 CEST [2187] 020_cancel.pl ERROR:  canceling statement due to user request
2021-08-24 17:39:42.407 CEST [2187] 020_cancel.pl STATEMENT:  SELECT pg_sleep(15);
2021-08-24 17:39:42.426 CEST [2178] LOG:  received fast shutdown request

so I'm not sure why the error message didn't make its way back to the
client.  Lack of pumping maybe?  It's notable that there's not a lot
of time elapsed between the cancel and the shutdown, according to the
postmaster log, which says that we didn't wait for a timeout there.

Anyway, I'm not terribly thrilled with the patch as-presented, because
what I was hoping for was a test free of timing assumptions.  This seems
to be worse not better by that light.

            regards, tom lane



Re: pgsql: psql: Add test for query canceling

From
Peter Eisentraut
Date:
On 20.08.21 20:47, Tom Lane wrote:
> I think you should drop the overly-cute bit with a SIGALRM handler,
> and instead have a loop-with-delay around an attempt to read the
> psql.pid file, after launching the psql run without an immediate
> wait for termination.  That gets rid of the first problem (though
> you still want the loop to timeout eventually, it could wait up
> to say 180 seconds, as we do elsewhere).  Then the second problem
> is easy to solve by making the pg_sleep delay twice as much.

Here is a proposal.  It waits separately for the pid file to appear and 
also checks for the sleep query to be registered by the backend, so it 
doesn't have any more dependencies on things happening "fast enough". 
And it's also faster in the normal case now.  Thoughts?

Attachment

Re: pgsql: psql: Add test for query canceling

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> Here is a proposal.  It waits separately for the pid file to appear and
> also checks for the sleep query to be registered by the backend, so it
> doesn't have any more dependencies on things happening "fast enough".
> And it's also faster in the normal case now.  Thoughts?

That looks pretty solid to me, and I can confirm that it passes
on wrasse's host.  The only nit I can find to pick is that this:

+    usleep(10_000) until -s "$tempdir/psql.pid" or ($count++ > 180 * 100 and die 'pid file did not appear');

basically assumes that psql.pid will be written atomically.
It'd be marginally safer to wait till psql.pid can be seen to
contain a newline.  I don't think that would be too hard to do,
if you put the slurp_file call inside the wait loop and inspect
its result.

            regards, tom lane



Re: pgsql: psql: Add test for query canceling

From
Daniel Gustafsson
Date:
> On 24 Aug 2021, at 18:43, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>
> On 20.08.21 20:47, Tom Lane wrote:
>> I think you should drop the overly-cute bit with a SIGALRM handler,
>> and instead have a loop-with-delay around an attempt to read the
>> psql.pid file, after launching the psql run without an immediate
>> wait for termination.  That gets rid of the first problem (though
>> you still want the loop to timeout eventually, it could wait up
>> to say 180 seconds, as we do elsewhere).  Then the second problem
>> is easy to solve by making the pg_sleep delay twice as much.
>
> Here is a proposal.  It waits separately for the pid file to appear and also checks for the sleep query to be
registeredby the backend, so it doesn't have any more dependencies on things happening "fast enough". And it's also
fasterin the normal case now.  Thoughts? 

From reading the patch, that looks like a really good combination of ideas for
this test.  +1.

--
Daniel Gustafsson        https://vmware.com/




Re: pgsql: psql: Add test for query canceling

From
Fabien COELHO
Date:
> That looks pretty solid to me, and I can confirm that it passes
> on wrasse's host.  The only nit I can find to pick is that this:
>
> +    usleep(10_000) until -s "$tempdir/psql.pid" or ($count++ > 180 * 100 and die 'pid file did not appear');
>
> basically assumes that psql.pid will be written atomically.

> It'd be marginally safer to wait till psql.pid can be seen to
> contain a newline.  I don't think that would be too hard to do,
> if you put the slurp_file call inside the wait loop and inspect
> its result.

I finally came around to have a look at the patch.

For the issue raised above, I can see that the file could be created but 
not yet written, but as the patch waits for the file to be non zero, ISTM 
that the probability of a torn write of a single integer by echo is so 
remote that we could let it as that?

Attached a version with the slurping in the loop anyway.

-- 
Fabien.
Attachment

Re: pgsql: psql: Add test for query canceling

From
Daniel Gustafsson
Date:
> On 25 Aug 2021, at 07:37, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
>> That looks pretty solid to me, and I can confirm that it passes
>> on wrasse's host.  The only nit I can find to pick is that this:
>>
>> +    usleep(10_000) until -s "$tempdir/psql.pid" or ($count++ > 180 * 100 and die 'pid file did not appear');
>>
>> basically assumes that psql.pid will be written atomically.
>
>> It'd be marginally safer to wait till psql.pid can be seen to
>> contain a newline.  I don't think that would be too hard to do,
>> if you put the slurp_file call inside the wait loop and inspect
>> its result.
>
> I finally came around to have a look at the patch.
>
> For the issue raised above, I can see that the file could be created but not yet written, but as the patch waits for
thefile to be non zero, ISTM that the probability of a torn write of a single integer by echo is so remote that we
couldlet it as that? 
>
> Attached a version with the slurping in the loop anyway.

It might be that the risk of a torn write is close to negligible, but that just
means that it’s bound to happen, and pretty soon too =) Checking with slurping
in the loop seems like a good idea to me.

--
Daniel Gustafsson        https://vmware.com/




Re: pgsql: psql: Add test for query canceling

From
Peter Eisentraut
Date:
On 25.08.21 07:37, Fabien COELHO wrote:
> I finally came around to have a look at the patch.
> 
> For the issue raised above, I can see that the file could be created but 
> not yet written, but as the patch waits for the file to be non zero, 
> ISTM that the probability of a torn write of a single integer by echo is 
> so remote that we could let it as that?
> 
> Attached a version with the slurping in the loop anyway.

It looks like this fixed it.  Thanks all.




Re: pgsql: psql: Add test for query canceling

From
Daniel Gustafsson
Date:
> On 30 Aug 2021, at 06:27, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Aug 26, 2021 at 08:39:31AM +0200, Peter Eisentraut wrote:
>> It looks like this fixed it.  Thanks all.
>
> conchuela has complained today on this test, as of HEAD:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-08-30%2001%3A00%3A12

#                   ''
#     doesn't match '(?^:canceling statement due to user request)'

This smells like a timing issue again, which is unexpected given how it's coded.

--
Daniel Gustafsson        https://vmware.com/