Thread: interrupted tap tests leave postgres instances around

interrupted tap tests leave postgres instances around

From
Andres Freund
Date:
Hi,

When tap tests are interrupted (e.g. with ctrl-c), we don't cancel running
postgres instances etc. That doesn't strike me as a good thing.

In contrast, the postgres instances started by pg_regress do terminate. I
assume this is because pg_regress starts postgres directly, whereas tap tests
largely start postgres via pg_ctl. pg_ctl will, as it should, start postgres
without a controlling terminal. Thus a ctrl-c won't be delivered to it.

ISTM we should at least install a SIGINT/TERM handler in Cluster.pm that does
the stuff we already do in END.

That still leaves us with some other potential processes around that won't
immediately exec, but it'd be much better already.

Greetings,

Andres Freund



Re: interrupted tap tests leave postgres instances around

From
Michael Paquier
Date:
On Thu, Sep 29, 2022 at 09:07:34PM -0700, Andres Freund wrote:
> ISTM we should at least install a SIGINT/TERM handler in Cluster.pm that does
> the stuff we already do in END.

Hmm, indeed.  And here I thought that END was actually taking care of
that on an interrupt..
--
Michael

Attachment

Re: interrupted tap tests leave postgres instances around

From
Alvaro Herrera
Date:
On 2022-Sep-30, Michael Paquier wrote:

> On Thu, Sep 29, 2022 at 09:07:34PM -0700, Andres Freund wrote:
> > ISTM we should at least install a SIGINT/TERM handler in Cluster.pm that does
> > the stuff we already do in END.
> 
> Hmm, indeed.  And here I thought that END was actually taking care of
> that on an interrupt..

Me too.  But the perlmod manpage says

       An "END" code block is executed as late as possible, that is, after perl has
       finished running the program and just before the interpreter is being exited,
       even if it is exiting as a result of a die() function.  (But not if it's
       morphing into another program via "exec", or being blown out of the water by a
       signal--you have to trap that yourself (if you can).)

So clearly we need to fix it.  I thought it should be as simple as the
attached, since exit() calls END.  (Would it be better to die() instead
of exit()?)

But on testing, some nodes linger after being sent a shutdown signal.
I'm not clear why this is -- I think it's due to the fact that we send
the signal just as the node is starting up, which means the signal
doesn't reach the process.  (I added the 0002 patch --not for commit--
to see which Clusters were being shut down and in the trace file I can
clearly see that the nodes that linger were definitely subject to
->teardown_node).


Another funny thing: C-C'ing one run, I got this lingering process:

alvherre  800868 98.2  0.0  12144  5052 pts/9    R    11:03   0:26 /pgsql/install/master/bin/psql -X -c BASE_BACKUP
(CHECKPOINT'fast', MAX_RATE 32); -c SELECT pg_backup_stop() -d port=54380 host=/tmp/O_2PPNj9Fg dbname='postgres'
replication=database

This is probably a bug in psql.  Backtrace is:

#0  PQclear (res=<optimized out>) at /pgsql/source/master/src/interfaces/libpq/fe-exec.c:748
#1  PQclear (res=res@entry=0x55ad308c6190) at /pgsql/source/master/src/interfaces/libpq/fe-exec.c:718
#2  0x000055ad2f303323 in ClearOrSaveResult (result=0x55ad308c6190) at /pgsql/source/master/src/bin/psql/common.c:472
#3  ClearOrSaveAllResults () at /pgsql/source/master/src/bin/psql/common.c:488
#4  ExecQueryAndProcessResults (query=query@entry=0x55ad308bc7a0 "BASE_BACKUP (CHECKPOINT 'fast', MAX_RATE 32);", 
    elapsed_msec=elapsed_msec@entry=0x7fff9c9941d8, svpt_gone_p=svpt_gone_p@entry=0x7fff9c9941d7,
is_watch=is_watch@entry=false,
 
    opt=opt@entry=0x0, printQueryFout=printQueryFout@entry=0x0) at /pgsql/source/master/src/bin/psql/common.c:1608
#5  0x000055ad2f301b9d in SendQuery (query=0x55ad308bc7a0 "BASE_BACKUP (CHECKPOINT 'fast', MAX_RATE 32);")
    at /pgsql/source/master/src/bin/psql/common.c:1172
#6  0x000055ad2f2f7bd9 in main (argc=<optimized out>, argv=<optimized out>) at
/pgsql/source/master/src/bin/psql/startup.c:384


-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"How amazing is that? I call it a night and come back to find that a bug has
been identified and patched while I sleep."                (Robert Davidson)
               http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php

Attachment

Re: interrupted tap tests leave postgres instances around

From
Andres Freund
Date:
Hi,

On 2022-09-30 11:17:00 +0200, Alvaro Herrera wrote:
> But on testing, some nodes linger after being sent a shutdown signal.
> I'm not clear why this is -- I think it's due to the fact that we send
> the signal just as the node is starting up, which means the signal
> doesn't reach the process.

I suspect it's when a test gets interrupt while pg_ctl is starting the
backend. The start() routine only does _update_pid() after pg_ctl finished,
and terminate()->stop() returns before doing anything if pid isn't defined.

Perhaps the END{} routine should call $node->_update_pid(-1); if $exit_code !=
0 and _pid is undefined?

That does seem to reduce the incidence of "leftover" postgres
instances. 001_start_stop.pl leaves some behind, but that makes sense, because
it's bypassing the whole node management. But I still occasionally see some
remaining processes if I crank up test concurrency.

Ah! At least part of the problem is that sub stop() does BAIL_OUT, and of
course it can fail as part of the shutdown.

But there's still some that survive, where your perl.trace doesn't contain the
node getting shut down...


Greetings,

Andres Freund



Re: interrupted tap tests leave postgres instances around

From
Peter Eisentraut
Date:
On 30.09.22 06:07, Andres Freund wrote:
> When tap tests are interrupted (e.g. with ctrl-c), we don't cancel running
> postgres instances etc. That doesn't strike me as a good thing.
> 
> In contrast, the postgres instances started by pg_regress do terminate. I
> assume this is because pg_regress starts postgres directly, whereas tap tests
> largely start postgres via pg_ctl. pg_ctl will, as it should, start postgres
> without a controlling terminal. Thus a ctrl-c won't be delivered to it.

I ran into the problem recently that pg_upgrade starts the servers with 
pg_ctl, and thus without terminal, and so you can't get any password 
prompts for SSL keys, for example.  Taking out the setsid() call in 
pg_ctl.c fixed that.  I suspect this is ultimately the same problem.

We could make TAP tests and pg_upgrade not use pg_ctl and start 
postmaster directly.  I'm not sure how much work that would be, but 
seeing that pg_regress does it, it doesn't seem unreasonable.

Alternatively, perhaps we could make a mode for pg_ctl that it doesn't 
call setsid().  This could be activated by an environment variable. 
That might address all these problems, too.




Re: interrupted tap tests leave postgres instances around

From
Andres Freund
Date:
Hi,

On 2022-10-04 10:24:19 +0200, Peter Eisentraut wrote:
> On 30.09.22 06:07, Andres Freund wrote:
> > When tap tests are interrupted (e.g. with ctrl-c), we don't cancel running
> > postgres instances etc. That doesn't strike me as a good thing.
> > 
> > In contrast, the postgres instances started by pg_regress do terminate. I
> > assume this is because pg_regress starts postgres directly, whereas tap tests
> > largely start postgres via pg_ctl. pg_ctl will, as it should, start postgres
> > without a controlling terminal. Thus a ctrl-c won't be delivered to it.
> 
> I ran into the problem recently that pg_upgrade starts the servers with
> pg_ctl, and thus without terminal, and so you can't get any password prompts
> for SSL keys, for example.

For this specific case I wonder if pg_upgrade should disable ssl... That would
require fixing pg_upgrade to use a unix socket on windows, but that'd be a
good idea anyway.


> Taking out the setsid() call in pg_ctl.c fixed that.  I suspect this is
> ultimately the same problem.

> We could make TAP tests and pg_upgrade not use pg_ctl and start postmaster
> directly.  I'm not sure how much work that would be, but seeing that
> pg_regress does it, it doesn't seem unreasonable.

It's not trivial, particularly from perl. Check all the stuff pg_regress and
pg_ctl do around windows accounts and tokens.


> Alternatively, perhaps we could make a mode for pg_ctl that it doesn't call
> setsid().  This could be activated by an environment variable. That might
> address all these problems, too.

It looks like that won't help. Because pg_ctl exits after forking postgres,
postgres parent isn't the shell anymore...

Greetings,

Andres Freund



Re: interrupted tap tests leave postgres instances around

From
Alvaro Herrera
Date:
On 2022-Oct-01, Andres Freund wrote:

> Perhaps the END{} routine should call $node->_update_pid(-1); if $exit_code !=
> 0 and _pid is undefined?

Yeah, that sounds reasonable.

> That does seem to reduce the incidence of "leftover" postgres
> instances. 001_start_stop.pl leaves some behind, but that makes sense, because
> it's bypassing the whole node management. But I still occasionally see some
> remaining processes if I crank up test concurrency.
> 
> Ah! At least part of the problem is that sub stop() does BAIL_OUT, and of
> course it can fail as part of the shutdown.

I made teardown_node pass down fail_ok=>1 to avoid this problem, so we
no longer BAIL_OUT in that case.


> But there's still some that survive, where your perl.trace doesn't contain the
> node getting shut down...

Yeah, something's still unexplained.  I'll get this pushed soon, which
already reduces the number of leftover instances a good amount.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Attachment

Re: interrupted tap tests leave postgres instances around

From
Alvaro Herrera
Date:
Pushed this.  It should improve things significantly.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"



Re: interrupted tap tests leave postgres instances around

From
Andres Freund
Date:
On 2022-10-19 17:38:50 +0200, Alvaro Herrera wrote:
> Pushed this.  It should improve things significantly.

Thanks for working on this!