Re: Error stopping postgresql service on a standby server. - Mailing list pgsql-general

From Dipti Bharvirkar
Subject Re: Error stopping postgresql service on a standby server.
Date
Msg-id CAL097-FzG9Di=43WAvt-U8+sGMqDK7VbmyNK6xnth2mTqh-MjA@mail.gmail.com
Whole thread Raw
In response to Error stopping postgresql service on a standby server.  (Dipti Bharvirkar <dbharvirkar@gmail.com>)
List pgsql-general
Hi,

Has anyone encountered this issue? Why would the WAL receiver process not stop when postmaster is shutdown?
Any suggestions on how to avoid running into this error or ways to recover from it?

Thank you in advance for any inputs on this,

Dipti

On Fri, Aug 31, 2012 at 1:17 PM, Dipti Bharvirkar wrote:
Hi,

In our project, we use Postgres 9.1.3 version and asynchronous streaming replication.
In recent times, on couple of our setups, we saw issues stopping Postgres service on the standby server after streaming replication was setup.

The command "service postgresql stop" returned with a failure message. We use "pg_ctl stop -D '$PGDATA' -s -m fast" in the Postgres service script to stop the server.
To see if there were some active client connections that were causing a failure in stopping Postgres service, I ran the query "SELECT * FROM pg_stat_activity;".
It failed with the following error: psql: FATAL:  the database system is shutting down

"ps -ef | grep postgres" returned the following:
postgres 14033     1  0 Aug28 ?        00:00:01 /usr/pgsql-9.1/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres 14044 14033  0 Aug28 ?        00:00:00 postgres: logger process
postgres 14046 14033  0 Aug28 ?        00:00:00 postgres: writer process
postgres 14047 14033  0 Aug28 ?        00:00:00 postgres: stats collector process
postgres 14912 14033  0 Aug28 ?        00:00:00 postgres: wal receiver process
root     31519  3003  0 06:18 pts/2    00:00:00 grep postgres

"netstat -anp | grep 5432" returns the following: 
tcp        0      0 0.0.0.0:5432                0.0.0.0:*                   LISTEN      14033/postmaster
tcp        0      0 127.0.0.1:5432              127.0.0.1:60597             TIME_WAIT   -
tcp        0      0 127.0.0.1:5432              127.0.0.1:60589             TIME_WAIT   -
tcp    67288      0 1.1.1.1:61500          2.2.2.2:5432           ESTABLISHED 14912/postgres: wal   

I had a few queries based on some of the observations -
  1. On one of the setups where similar issue was observed, we stopped Postgres service on the master server. As a result of this, the sender process on the master server and consequently the receiver process on standby stopped. After this, Postgres service could successfully be stopped on the standby server. This fact coupled with the output of the two commands mentioned above makes me believe that it is the "wal receiver" process that is not getting terminated because of which the Postgres service on standby server does not stop. Is this assumption right?
  2. If yes, what could be the possible cause for the receiver process to not terminate? Shouldn't it stop gracefully when a shutdown command is received? When the issue occurred, we had minimal activity on the master server. There were no long running transactions being committed to the master and streamed to the standby when the issue occurred. Even if there were, could it cause the receiver process to not terminate?
  3. How can we avoid running into this issue? Could we be missing some step that is essential for a graceful shutdown of the service on a standby?
  4. On one setup where the issue was seen, since "-m fast" option with "pg_ctl stop" did not help in stopping the service, I used the "-m immediate" option. The service stopped (I understand that this option aborts the processes without a clean shutdown and so is not a safe option). The service would not start back up. We saw the "invalid record length" error during the startup (I guess this was expected since it wasn't a clean shutdown). A pg_resetxlog helped recover from this issue. However, that seems risky too since there is a chance of data inconsistency. What is the best way to recover from this error if it occurs again?
Thanks,
Dipti

pgsql-general by date:

Previous
From: Herouth Maoz
Date:
Subject: Re: Maintaining a materialized view only on a replica
Next
From: Achilleas Mantzios
Date:
Subject: Re: "Too far out of the mainstream"