Thread: pgsql: psql: Add test for query canceling
psql: Add test for query canceling Query canceling in psql was accidentally broken by 3a5130672296ed4e682403a77a9a3ad3d21cef75 (since reverted), so having some test coverage for that seems useful. Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr> Discussion: https://www.postgresql.org/message-id/18c78a01-4a34-9dd4-f78b-6860f1420c8e@enterprisedb.com Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/5b3f471ff23a2773e6c1ee1704377581c987107d Modified Files -------------- src/bin/psql/t/020_cancel.pl | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+)
> On 20 Aug 2021, at 11:44, Peter Eisentraut <peter@eisentraut.org> wrote: > > psql: Add test for query canceling Seems sidewinder didn’t quite like this: ==~_~===-=-===~_~== pgsql.build/src/bin/psql/tmp_check/log/020_cancel_main.log ==~_~===-=-===~_~== 2021-08-20 12:52:29.992 CEST [9820:1] LOG: starting PostgreSQL 15devel on x86_64-unknown-netbsd7.0, compiled by gcc (nb220150115) 4.8.4, 64-bit 2021-08-20 12:52:29.992 CEST [9820:2] LOG: listening on Unix socket "/tmp/V5NA0EWHUl/.s.PGSQL.62222" 2021-08-20 12:52:29.994 CEST [9023:1] LOG: database system was shut down at 2021-08-20 12:52:29 CEST 2021-08-20 12:52:29.995 CEST [9820:3] LOG: database system is ready to accept connections 2021-08-20 12:52:30.093 CEST [656:1] [unknown] LOG: connection received: host=[local] 2021-08-20 12:52:30.093 CEST [656:2] [unknown] LOG: connection authorized: user=pgbf database=postgres application_name=020_cancel.pl 2021-08-20 12:52:30.098 CEST [656:3] 020_cancel.pl LOG: statement: select pg_sleep(3); 2021-08-20 12:52:31.121 CEST [9820:4] LOG: received immediate shutdown request 2021-08-20 12:52:31.121 CEST [656:4] 020_cancel.pl LOG: could not send data to client: Broken pipe 2021-08-20 12:52:31.121 CEST [656:5] 020_cancel.pl STATEMENT: select pg_sleep(3); 2021-08-20 12:52:31.125 CEST [9820:5] LOG: database system is shut down Nothing obvious sticks out, could it be timing related? https://brekka.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2021-08-20%2010%3A45%3A24 -- Daniel Gustafsson https://vmware.com/
>> psql: Add test for query canceling > > Seems sidewinder didn’t quite like this: > Nothing obvious sticks out, could it be timing related? Hmmm. Possibly, as the test is time dependent, hence the "SKIP:" section, which should mean that the test should be ignore when it fails. It seems that this part did not work. Maybe it should not state that it is expecting 2 results? use Test::More tests => 2; Replace by: use Test::More; Or something is amiss with the "SKIP" section trick? -- Fabien.
> On 20 Aug 2021, at 17:40, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > >>> psql: Add test for query canceling >> >> Seems sidewinder didn’t quite like this: > >> Nothing obvious sticks out, could it be timing related? > > Hmmm. Possibly, as the test is time dependent, hence the "SKIP:" section, which should mean that the test should be ignorewhen it fails. It seems that this part did not work. Maybe it should not state that it is expecting 2 results? > > use Test::More tests => 2; > > Replace by: > > use Test::More; > > Or something is amiss with the "SKIP" section trick? SKIP blocks only skip tests if the condition is met, in the test as it stands now it skips it on Windows since it requires a Unix shell: +SKIP: { + skip "cancel test requires a Unix shell", 2 if $windows_os; On all other platforms the test will execute and count towards the overall test result of the suite. If we want the test to run but not fail the entire test suite if it fails then it should use a TODO block instead, but that’s intended for tests known to fail and this doesn’t seem to fall in that category. -- Daniel Gustafsson https://vmware.com/
Daniel Gustafsson <daniel@yesql.se> writes: > If we want the test to run but not fail the entire test suite if it fails then > it should use a TODO block instead, but that’s intended for tests known to fail > and this doesn’t seem to fall in that category. That seems pretty useless. If we did break things in this area, such a test would not help us notice. The problem with the test seems blindingly obvious from here: it is assuming first that psql will start fast enough to print its PID within one second, and next that we'll be able to issue the cancel (and have the backend react) in less than 2 seconds more. This seems about guaranteed to fail on cache-clobber animals, for example, but animals that are merely slow or overloaded would have issues too. I think you should drop the overly-cute bit with a SIGALRM handler, and instead have a loop-with-delay around an attempt to read the psql.pid file, after launching the psql run without an immediate wait for termination. That gets rid of the first problem (though you still want the loop to timeout eventually, it could wait up to say 180 seconds, as we do elsewhere). Then the second problem is easy to solve by making the pg_sleep delay twice as much. regards, tom lane
> On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Daniel Gustafsson <daniel@yesql.se> writes: >> If we want the test to run but not fail the entire test suite if it fails then >> it should use a TODO block instead, but that’s intended for tests known to fail >> and this doesn’t seem to fall in that category. > > That seems pretty useless. If we did break things in this area, > such a test would not help us notice. For sure. I wasn’t advocating it, merely indicating that the SKIP block isn’t working the way attributed to upthread. > The problem with the test seems blindingly obvious from here: it > is assuming first that psql will start fast enough to print its > PID within one second, and next that we'll be able to issue > the cancel (and have the backend react) in less than 2 seconds > more. This seems about guaranteed to fail on cache-clobber > animals, for example, but animals that are merely slow or overloaded > would have issues too. > > I think you should drop the overly-cute bit with a SIGALRM handler, > and instead have a loop-with-delay around an attempt to read the > psql.pid file, after launching the psql run without an immediate > wait for termination. That gets rid of the first problem (though > you still want the loop to timeout eventually, it could wait up > to say 180 seconds, as we do elsewhere). Then the second problem > is easy to solve by making the pg_sleep delay twice as much. This could perhaps be done with a PostgresNode::interactive_psql session? I used that in a similar, but far from the same, test setup in the online checksums patchset. -- Daniel Gustafsson https://vmware.com/
Daniel Gustafsson <daniel@yesql.se> writes: > On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think you should drop the overly-cute bit with a SIGALRM handler, >> and instead have a loop-with-delay around an attempt to read the >> psql.pid file, after launching the psql run without an immediate >> wait for termination. > This could perhaps be done with a PostgresNode::interactive_psql > session? Yeah, we do have that infrastructure available in the 010_tab_completion.pl test. BTW, after looking more closely at the test, I think there's an independent problem with the SIGALRM business. If I read the perl script correctly, it's arming a SIGALRM timer in the Perl script's process that is supposed to expire after the psql child process has been launched. However, per the NetBSD man pages: * exec() causes signal handlers to be reset to default actions in the child process. The default action for SIGALRM is to terminate the process. * Active timer events are *inherited* by the child process. This seems not very wise, but execve(2) is pretty definite about it [1]. This means that on NetBSD and similar systems, what actually happens when the SIGALRM timer times out is that the psql process is killed, likely before the Perl script is able to signal it, but in any case not too darn much later. So I do not think that this test is testing what it means to test at all: there is no reason to believe that any query cancel request ever gets sent, even without the timing issue. Since the Perl script is only testing that psql returns nonzero exit status, and not bothering to inquire into just which status it was, it cannot tell "query was canceled" apart from "psql died horribly". And the latter is what's actually happening. regards, tom lane [1] https://man.netbsd.org/NetBSD-9.2-STABLE/execve.2
I see that c818c25f4 worked around sidewinder's issue with this test, but wrasse is still unhappy. I was able to duplicate the problem on wrasse's host, so I ran it under truss to see if I could see what's happening. What the truss log shows is that the Perl script successfully reads psql.pid and issues kill(SIGINT) to the correct process ... but very shortly thereafter it issues kill(SIGTERM) to the same process, so of course psql goes belly-up and doesn't complete the test. I suppose this is some undocumented interaction between IPC::Run and the SIGALRM signal handler. I stand by the recommendation that you need to not use a signal handler here. regards, tom lane
> On 21 Aug 2021, at 00:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Daniel Gustafsson <daniel@yesql.se> writes: >> On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I think you should drop the overly-cute bit with a SIGALRM handler, >>> and instead have a loop-with-delay around an attempt to read the >>> psql.pid file, after launching the psql run without an immediate >>> wait for termination. > >> This could perhaps be done with a PostgresNode::interactive_psql >> session? > > Yeah, we do have that infrastructure available in the > 010_tab_completion.pl test. In order to satisfy my own curiosity I decided to take a stab at this which turned into the attached. I wasn’t able to reliably know that the pg_sleep query had started except letting the timer expire, but that may very well be the rust on my Perl skills showing. Not sure if this moves the needle on this in either direction, but I figured I should report back progress either way. -- Daniel Gustafsson https://vmware.com/
Attachment
On 23.08.21 13:12, Daniel Gustafsson wrote: >> On 21 Aug 2021, at 00:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Daniel Gustafsson <daniel@yesql.se> writes: >>> On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> I think you should drop the overly-cute bit with a SIGALRM handler, >>>> and instead have a loop-with-delay around an attempt to read the >>>> psql.pid file, after launching the psql run without an immediate >>>> wait for termination. >> >>> This could perhaps be done with a PostgresNode::interactive_psql >>> session? >> >> Yeah, we do have that infrastructure available in the >> 010_tab_completion.pl test. > > In order to satisfy my own curiosity I decided to take a stab at this which > turned into the attached. I wasn’t able to reliably know that the pg_sleep > query had started except letting the timer expire, but that may very well be > the rust on my Perl skills showing. Thanks for looking into this. With your patch applied, tests 2, 3, and 4 fail. Do they pass for you?
> On 24 Aug 2021, at 15:40, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 23.08.21 13:12, Daniel Gustafsson wrote: >>> On 21 Aug 2021, at 00:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> >>> Daniel Gustafsson <daniel@yesql.se> writes: >>>> On 20 Aug 2021, at 20:47, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>>> I think you should drop the overly-cute bit with a SIGALRM handler, >>>>> and instead have a loop-with-delay around an attempt to read the >>>>> psql.pid file, after launching the psql run without an immediate >>>>> wait for termination. >>> >>>> This could perhaps be done with a PostgresNode::interactive_psql >>>> session? >>> >>> Yeah, we do have that infrastructure available in the >>> 010_tab_completion.pl test. >> In order to satisfy my own curiosity I decided to take a stab at this which >> turned into the attached. I wasn’t able to reliably know that the pg_sleep >> query had started except letting the timer expire, but that may very well be >> the rust on my Perl skills showing. > > Thanks for looking into this. With your patch applied, tests 2, 3, and 4 fail. Do they pass for you? Thats odd, all tests pass for me. Did the logs give an indication as to what was failing? -- Daniel Gustafsson https://vmware.com/
Daniel Gustafsson <daniel@yesql.se> writes: >> On 24 Aug 2021, at 15:40, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: >> Thanks for looking into this. With your patch applied, tests 2, 3, and 4 fail. Do they pass for you? > Thats odd, all tests pass for me. Did the logs give an indication as to what > was failing? I also tried the patch. It passes for me on my Linux box, although it takes noticeably longer than before which doesn't seem great. On wrasse's host, it fails. The Perl log shows: ok 1 - PPID timer expired ok 2 - psql was signaled not ok 3 - query was canceled # Failed test 'query was canceled' # at t/020_cancel.pl line 50. # 'postgres=# SELECT pg_sleep(15); # Cancel request sent # ' # doesn't match '(?-xism:canceling statement due to user request)' ok 4 - Sleep query timer expired The postmaster log shows that the cancel was received and acted on: 2021-08-24 17:39:39.021 CEST [2178] LOG: database system is ready to accept connections 2021-08-24 17:39:39.270 CEST [2187] 020_cancel.pl LOG: statement: SELECT pg_sleep(15); 2021-08-24 17:39:42.407 CEST [2187] 020_cancel.pl ERROR: canceling statement due to user request 2021-08-24 17:39:42.407 CEST [2187] 020_cancel.pl STATEMENT: SELECT pg_sleep(15); 2021-08-24 17:39:42.426 CEST [2178] LOG: received fast shutdown request so I'm not sure why the error message didn't make its way back to the client. Lack of pumping maybe? It's notable that there's not a lot of time elapsed between the cancel and the shutdown, according to the postmaster log, which says that we didn't wait for a timeout there. Anyway, I'm not terribly thrilled with the patch as-presented, because what I was hoping for was a test free of timing assumptions. This seems to be worse not better by that light. regards, tom lane
On 20.08.21 20:47, Tom Lane wrote: > I think you should drop the overly-cute bit with a SIGALRM handler, > and instead have a loop-with-delay around an attempt to read the > psql.pid file, after launching the psql run without an immediate > wait for termination. That gets rid of the first problem (though > you still want the loop to timeout eventually, it could wait up > to say 180 seconds, as we do elsewhere). Then the second problem > is easy to solve by making the pg_sleep delay twice as much. Here is a proposal. It waits separately for the pid file to appear and also checks for the sleep query to be registered by the backend, so it doesn't have any more dependencies on things happening "fast enough". And it's also faster in the normal case now. Thoughts?
Attachment
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > Here is a proposal. It waits separately for the pid file to appear and > also checks for the sleep query to be registered by the backend, so it > doesn't have any more dependencies on things happening "fast enough". > And it's also faster in the normal case now. Thoughts? That looks pretty solid to me, and I can confirm that it passes on wrasse's host. The only nit I can find to pick is that this: + usleep(10_000) until -s "$tempdir/psql.pid" or ($count++ > 180 * 100 and die 'pid file did not appear'); basically assumes that psql.pid will be written atomically. It'd be marginally safer to wait till psql.pid can be seen to contain a newline. I don't think that would be too hard to do, if you put the slurp_file call inside the wait loop and inspect its result. regards, tom lane
> On 24 Aug 2021, at 18:43, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 20.08.21 20:47, Tom Lane wrote: >> I think you should drop the overly-cute bit with a SIGALRM handler, >> and instead have a loop-with-delay around an attempt to read the >> psql.pid file, after launching the psql run without an immediate >> wait for termination. That gets rid of the first problem (though >> you still want the loop to timeout eventually, it could wait up >> to say 180 seconds, as we do elsewhere). Then the second problem >> is easy to solve by making the pg_sleep delay twice as much. > > Here is a proposal. It waits separately for the pid file to appear and also checks for the sleep query to be registeredby the backend, so it doesn't have any more dependencies on things happening "fast enough". And it's also fasterin the normal case now. Thoughts? From reading the patch, that looks like a really good combination of ideas for this test. +1. -- Daniel Gustafsson https://vmware.com/
> That looks pretty solid to me, and I can confirm that it passes > on wrasse's host. The only nit I can find to pick is that this: > > + usleep(10_000) until -s "$tempdir/psql.pid" or ($count++ > 180 * 100 and die 'pid file did not appear'); > > basically assumes that psql.pid will be written atomically. > It'd be marginally safer to wait till psql.pid can be seen to > contain a newline. I don't think that would be too hard to do, > if you put the slurp_file call inside the wait loop and inspect > its result. I finally came around to have a look at the patch. For the issue raised above, I can see that the file could be created but not yet written, but as the patch waits for the file to be non zero, ISTM that the probability of a torn write of a single integer by echo is so remote that we could let it as that? Attached a version with the slurping in the loop anyway. -- Fabien.
Attachment
> On 25 Aug 2021, at 07:37, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > >> That looks pretty solid to me, and I can confirm that it passes >> on wrasse's host. The only nit I can find to pick is that this: >> >> + usleep(10_000) until -s "$tempdir/psql.pid" or ($count++ > 180 * 100 and die 'pid file did not appear'); >> >> basically assumes that psql.pid will be written atomically. > >> It'd be marginally safer to wait till psql.pid can be seen to >> contain a newline. I don't think that would be too hard to do, >> if you put the slurp_file call inside the wait loop and inspect >> its result. > > I finally came around to have a look at the patch. > > For the issue raised above, I can see that the file could be created but not yet written, but as the patch waits for thefile to be non zero, ISTM that the probability of a torn write of a single integer by echo is so remote that we couldlet it as that? > > Attached a version with the slurping in the loop anyway. It might be that the risk of a torn write is close to negligible, but that just means that it’s bound to happen, and pretty soon too =) Checking with slurping in the loop seems like a good idea to me. -- Daniel Gustafsson https://vmware.com/
On 25.08.21 07:37, Fabien COELHO wrote: > I finally came around to have a look at the patch. > > For the issue raised above, I can see that the file could be created but > not yet written, but as the patch waits for the file to be non zero, > ISTM that the probability of a torn write of a single integer by echo is > so remote that we could let it as that? > > Attached a version with the slurping in the loop anyway. It looks like this fixed it. Thanks all.
> On 30 Aug 2021, at 06:27, Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Aug 26, 2021 at 08:39:31AM +0200, Peter Eisentraut wrote: >> It looks like this fixed it. Thanks all. > > conchuela has complained today on this test, as of HEAD: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-08-30%2001%3A00%3A12 # '' # doesn't match '(?^:canceling statement due to user request)' This smells like a timing issue again, which is unexpected given how it's coded. -- Daniel Gustafsson https://vmware.com/