TAP test fail: we don't always detect backend crashes - Mailing list pgsql-hackers

From Tom Lane
Subject TAP test fail: we don't always detect backend crashes
Date
Msg-id 2366244.1651681550@sss.pgh.pa.us
Whole thread Raw
List pgsql-hackers
I discovered while poking at an LDAP problem that our TAP tests are
100% reproducibly capable of ignoring server crashes and reporting
success anyway.  This problem occurs if the postmaster doesn't get
the child failure report until after shutdown has been initiated,
in which case you find something like this in the postmaster log:

2022-05-04 12:01:33.946 EDT [57945] [unknown] LOG:  connection received: host=[local]
2022-05-04 12:01:33.995 EDT [57945] [unknown] LOG:  connection authenticated: identity="uid=test1,dc=example,dc=net"
method=ldap(/Users/tgl/pgsql/src/test/ldap/tmp_check/t_001_auth_node_data/pgdata/pg_hba.conf:1) 
2022-05-04 12:01:33.995 EDT [57945] [unknown] LOG:  connection authorized: user=test1 database=postgres
application_name=001_auth.pl
2022-05-04 12:01:33.998 EDT [57945] 001_auth.pl LOG:  statement: SELECT $$connected with user=test1$$
2022-05-04 12:01:34.003 EDT [57937] LOG:  received fast shutdown request
2022-05-04 12:01:34.003 EDT [57937] LOG:  aborting any active transactions
2022-05-04 12:01:34.003 EDT [57937] LOG:  background worker "logical replication launcher" (PID 57943) exited with exit
code1 
2022-05-04 12:01:35.750 EDT [57937] LOG:  server process (PID 57945) was terminated by signal 11: Segmentation fault:
11
2022-05-04 12:01:35.751 EDT [57937] LOG:  terminating any other active server processes
2022-05-04 12:01:35.751 EDT [57937] LOG:  abnormal database system shutdown
2022-05-04 12:01:35.751 EDT [57937] LOG:  database system is shut down

Our TAP scripts don't notice the "abnormal database system shutdown",
so it looks like things have passed.  It's even worse when a script
demands an immediate shutdown, because then the postmaster won't wait
around for the child status.

If you have core dumps enabled, that adds some time before the child
exit status is delivered, making this scenario extremely probable.
I'm finding that src/test/ldap reports success 100% reproducibly
after doing "ulimit -c unlimited", even though four backend core dumps
are produced during the run.

I think that (a) at least by default, node->stop() ought to check for
normal shutdown status, and (b) immediate shutdown should only be used
in cases where we've already decided that the test failed.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: automatically generating node support functions
Next
From: Pavel Stehule
Date:
Subject: Re: strange slow query - lost lot of time somewhere