[RFC] Should we fix postmaster to avoid slow shutdown? - Mailing list pgsql-hackers
From | Tsunakawa, Takayuki |
---|---|
Subject | [RFC] Should we fix postmaster to avoid slow shutdown? |
Date | |
Msg-id | 0A3221C70F24FB45833433255569204D1F5EF25A@G01JPEXMBYT05 Whole thread Raw |
Responses |
Re: [RFC] Should we fix postmaster to avoid slow shutdown?
|
List | pgsql-hackers |
Hello, Please let me ask you about possible causes of a certain problem, slow shutdown of postmaster when a backend crashes, andwhether to fix PostgreSQL. Our customer is using 64-bit PostgreSQL 9.2.8 on RHEL 6.4. Yes, the PostgreSQL version is rather old but there's no relevantbug fix in later 9.2.x releases. PROBLEM ============================== One backend process (postgres) for an application session crashed due to a segmentation fault and dumped a core file. Thecause is a bug of pg_dbms_stats. Another note is that restart_after_crash is off to make failover happen. The problem here is that postmaster took as long as 15 seconds to terminate after it had detected a crashed backend. Themessages were output as follows: 20:12:35.004に LOG: server process (PID 31894) was terminated by signal 11: Segmentation fault DETAIL: Failed process was running: DELETE...(snip) LOG: terminating any other active server processes From 20:12:35.013 to 20:12:39.074, the following message was output 80 times. FATAL: the database system is in recovery mode 20:12:50 The custom monitoring system detected the death of postmaster as a result of running "pg_ctl status". That's it. You may say the following message should also have been emitted, but there's not. This is because we commentedout the ereport() call in quickdie() in tcop.c. That ereport() call can hang depending on the timing, which isfixed in 9.4. WARNING: terminating connection because of crash of another server process The customer insists that PostgreSQL takes longer to shut down than expected, which risks exceeding their allowed failovertime. CAUSE ============================== There's no apparent evidence to indicate the cause, but I could guess a few reasons. What do you think these are correctand should fix PostgreSQL? (I think so) 1) postmaster should close the listening ports earlier As cited above, for 4 seconds, postmaster created 80 dead-end child processes which just output "FATAL: the database systemis in recovery mode". This indicates that postmaster is busy handling re-connection requests from disconnected applications,preventing postmaster from reaping dead children as fast as possible. This is a waste because postmaster willonly shut down. I think the listening ports should be closed in HandleChildCrash() when the condition "(RecoveryError || !restart_after_crash)"is true. 2) make stats collector terminate immediately stats collector seems to write the permanent stats file even when it receives SIGQUIT. But it's useless because the statfile is reset during recovery. And Tom claimed that writing stats file can take long: https://www.postgresql.org/message-id/11800.1455135203@sss.pgh.pa.us 3) Anything else? While postmaster is in PM_WAIT_DEAD_END state, it leaves the listening ports open but doesn't call select()/accept(). Asa result, incoming connection requests are accumulated in the listen queue of the sockets. Does the OS have any bug toslow the process termination when the listen queue is not empty? Regards Takayuki Tsunakawa
pgsql-hackers by date: