[RFC] Should we fix postmaster to avoid slow shutdown? - Mailing list pgsql-hackers

From Tsunakawa, Takayuki
Subject [RFC] Should we fix postmaster to avoid slow shutdown?
Date
Msg-id 0A3221C70F24FB45833433255569204D1F5EF25A@G01JPEXMBYT05
Whole thread Raw
Responses Re: [RFC] Should we fix postmaster to avoid slow shutdown?
List pgsql-hackers
Hello,

Please let me ask you about possible causes of a certain problem, slow shutdown of postmaster when a backend crashes,
andwhether to fix PostgreSQL.
 

Our customer is using 64-bit PostgreSQL 9.2.8 on RHEL 6.4.  Yes, the PostgreSQL version is rather old but there's no
relevantbug fix in later 9.2.x releases.
 


PROBLEM
==============================

One backend process (postgres) for an application session crashed due to a segmentation fault and dumped a core file.
Thecause is a bug of pg_dbms_stats.  Another note is that restart_after_crash is off to make failover happen.
 

The problem here is that postmaster took as long as 15 seconds to terminate after it had detected a crashed backend.
Themessages were output as follows:
 

20:12:35.004に
LOG:  server process (PID 31894) was terminated by signal 11: Segmentation fault
DETAIL:  Failed process was running: DELETE...(snip)
LOG:  terminating any other active server processes

From 20:12:35.013 to 20:12:39.074, the following message was output 80 times.

FATAL:  the database system is in recovery mode

20:12:50
The custom monitoring system detected the death of postmaster as a result of running "pg_ctl status".

That's it.  You may say the following message should also have been emitted, but there's not.  This is because we
commentedout the ereport() call in quickdie() in tcop.c.  That ereport() call can hang depending on the timing, which
isfixed in 9.4.
 

WARNING:  terminating connection because of crash of another server process

The customer insists that PostgreSQL takes longer to shut down than expected, which risks exceeding their allowed
failovertime.
 


CAUSE
==============================

There's no apparent evidence to indicate the cause, but I could guess a few reasons.  What do you think these are
correctand should fix PostgreSQL? (I think so)
 

1) postmaster should close the listening ports earlier
As cited above, for 4 seconds, postmaster created 80 dead-end child processes which just output "FATAL:  the database
systemis in recovery mode".  This indicates that postmaster is busy handling re-connection requests from disconnected
applications,preventing postmaster from reaping dead children as fast as possible.  This is a waste because postmaster
willonly shut down.
 

I think the listening ports should be closed in HandleChildCrash() when the condition "(RecoveryError ||
!restart_after_crash)"is true.
 

2) make stats collector terminate immediately
stats collector seems to write the permanent stats file even when it receives SIGQUIT.  But it's useless because the
statfile is reset during recovery.  And Tom claimed that writing stats file can take long:
 

https://www.postgresql.org/message-id/11800.1455135203@sss.pgh.pa.us


3) Anything else?
While postmaster is in PM_WAIT_DEAD_END state, it leaves the listening ports open but doesn't call select()/accept().
Asa result, incoming connection requests are accumulated in the listen queue of the sockets.  Does the OS have any bug
toslow the process termination when the listen queue is not empty?
 


Regards
Takayuki Tsunakawa




pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Calculation of param_source_rels in add_paths_to_joinrel
Next
From: Michael Paquier
Date:
Subject: Re: Speedup twophase transactions