Re: Streaming replication - unable to stop the standby - Mailing list pgsql-hackers
From | Stefan Kaltenbrunner |
---|---|
Subject | Re: Streaming replication - unable to stop the standby |
Date | |
Msg-id | 4BDF19E7.2040205@kaltenbrunner.cc Whole thread Raw |
In response to | Re: Streaming replication - unable to stop the standby (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
Robert Haas wrote: > On Mon, May 3, 2010 at 2:22 PM, Stefan Kaltenbrunner > <stefan@kaltenbrunner.cc> wrote: >> Tom Lane wrote: >>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >>>> I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems quite >>>> easy to end up in a situation where you have a standby that seems to be >>>> stuck in: >>>> $ psql -p 5433 >>>> psql: FATAL: the database system is shutting down >>>> but not not actually shuting down ever. I ran into that a few times now >>>> (mostly because I'm trying to chase a recovery issue I hit during earlier >>>> testing) by simply having the master iterate between a pgbench run and >>>> "idle" while simple doing pg_ctl restart in a loop on the standby. >>>> I do vaguely recall some discussions of that but I thought the issue git >>>> settled somehow? >>> Hm, I haven't pushed this hard but "pg_ctl stop" seems to stop the >>> standby for me. Which subprocesses of the slave postmaster are still >>> around? Could you attach to them with gdb and get stack traces? >> it is not always failing to shut down - it only fails sometimes - I have not >> exactly pinpointed yet what it is causing this but the standby is in a weird >> state now: >> >> * the master is currently idle >> * the standby has no connections at all >> >> logs from the standby: >> >> FATAL: the database system is shutting down >> FATAL: the database system is shutting down >> FATAL: replication terminated by primary server >> LOG: restored log file "000000010000001900000054" from archive >> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such >> file or directory >> LOG: record with zero length at 19/55000078 >> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such >> file or directory >> FATAL: could not connect to the primary server: could not connect to >> server: Connection refused >> Is the server running on host "localhost" and accepting >> TCP/IP connections on port 5432? >> could not connect to server: Connection refused >> Is the server running on host "localhost" and accepting >> TCP/IP connections on port 5432? >> >> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such >> file or directory >> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such >> file or directory >> LOG: streaming replication successfully connected to primary >> FATAL: the database system is shutting down >> >> >> the first two "FATAL: the database system is shutting down" are from me >> trying to connect using psql after i noticed that pg_ctl failed to shutdown >> the slave. >> The next thing I tried was restarting the master - which lead to the >> following logs and the standby noticing that and reconnecting but you cannot >> actually connect... >> >> process tree for the standby is: >> >> 29523 pts/2 S 0:00 /home/postgres9/pginst/bin/postgres -D >> /mnt/space/pgdata_standby >> 29524 ? Ss 0:06 \_ postgres: startup process waiting for >> 000000010000001900000055 >> 29529 ? Ss 0:00 \_ postgres: writer process >> 29835 ? Ss 0:00 \_ postgres: wal receiver process streaming >> 19/55000078 > > <uninformed-speculation> > > Hmm. When I committed that patch to fix smart shutdown on the > standby, we discussed the fact that the startup process can't simply > release its locks and die at shutdown time because the locks it holds > prevent other backends from seeing the database in an inconsistent > state. Therefore, if we were to terminate recovery as soon as the > smart shutdown request is received, we might never complete, because a > backend might be waiting on a lock that will never get released. If > that's really a danger scenario, then it follows that we might also > fail to shut down if we can't connect to the primary, because we might > not be able to replay enough WAL to release the locks the remaining > backends are waiting for. That sort of looks like what is happening > to you, except based on your test scenario I can't figure out where > this came from: > > FATAL: replication terminated by primary server as I said before I restarted the master at that point, the standby logged the above, restored 000000010000001900000054 from the archive, tried reconnecting and logged the "connection refused". a few seconds later the master was up again and the standby succeeded in reconnecting. Stefan
pgsql-hackers by date: