Re: How abnormal server shutdown could be detected by tests? - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: How abnormal server shutdown could be detected by tests?
Date
Msg-id 5921355f-4cfb-c91a-24b8-6bbde53c990c@gmail.com
Whole thread Raw
In response to Re: How abnormal server shutdown could be detected by tests?  (shveta malik <shveta.malik@gmail.com>)
List pgsql-hackers
Hello Shveta,

12.12.2023 11:44, shveta malik wrote:
>
>> The postmaster process exits with exit code 1, but pg_ctl can't get the
>> code and just reports that stop was completed successfully.
>>
> For what it's worth, there is another thread which stated the similar problem:
> https://www.postgresql.org/message-id/flat/2366244.1651681550%40sss.pgh.pa.us
>

Thank you for the reference!
So I refreshed a first part of the question Tom Lane raised before...

I've made a quick experiment with leaving postmaster.pid intact in case of
abnormal shutdown:
@@ -1113,6 +1113,7 @@ UnlinkLockFiles(int status, Datum arg)
      {
          char       *curfile = (char *) lfirst(l);

+if (strcmp(curfile, DIRECTORY_LOCK_FILE) != 0 || status == 0)
          unlink(curfile);
          /* Should we complain if the unlink fails? */
      }

and `make check-world` passed for me with no failure.
(In the meantime, the assertion failure forced as above is detected.)

Though there is a minor issue with a couple of tests. Namely,
003_recovery_targets.pl does the following:
# wait for the error message in the standby log
foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
{
     $logfile = slurp_file($node_primary->logfile());
     $res = ($logfile =~
         qr/FATAL: .* recovery ended before configured recovery target was reached/);
     if ($res) {
         last;
     }
     usleep(100_000);
}
ok($res,
     'recovery end before target reached is a fatal error');

With postmaster.pid left after unclean shutdown, the test waits for 300
seconds by default and then completes successfully.

If rewrite that loop as follows:
# wait for the error message in the standby log
foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
{
     $logfile = slurp_file($node_primary->logfile());
     $res = ($logfile =~
         qr/FATAL: .* recovery ended before configured recovery target was reached/);
     if ($res) {
         last;
     }
     usleep(100_000);
}
ok($res,
     'recovery end before target reached is a fatal error');

the test completes as quickly as before.
(standby.log is only 2kb, so rereading it isn't a big deal, IMO)

So maybe it's the way to go?

Another way I can think of is sending some signal to pg_ctl in case
postmaster terminates with status 0. Though I think it would complicate
things a little as it allows for three different states:
postmaster.pid preserved (in case postmaster killed with -9),
postmaster.pid removed and the signal received/not received.

Best regards,
Alexander



pgsql-hackers by date:

Previous
From: Xiaoran Wang
Date:
Subject: Re: [PATCH]: Not to invaldiate CatalogSnapshot for local invalidation messages
Next
From: Tom Lane
Date:
Subject: Re: Add --check option to pgindent