Thread: Streaming replication - unable to stop the standby
I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems quite easy to end up in a situation where you have a standby that seems to be stuck in: $ psql -p 5433 psql: FATAL: the database system is shutting down but not not actually shuting down ever. I ran into that a few times now (mostly because I'm trying to chase a recovery issue I hit during earlier testing) by simply having the master iterate between a pgbench run and "idle" while simple doing pg_ctl restart in a loop on the standby. I do vaguely recall some discussions of that but I thought the issue git settled somehow? Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems > quite easy to end up in a situation where you have a standby that seems > to be stuck in: > $ psql -p 5433 > psql: FATAL: the database system is shutting down > but not not actually shuting down ever. I ran into that a few times now > (mostly because I'm trying to chase a recovery issue I hit during > earlier testing) by simply having the master iterate between a pgbench > run and "idle" while simple doing pg_ctl restart in a loop on the standby. > I do vaguely recall some discussions of that but I thought the issue git > settled somehow? Hm, I haven't pushed this hard but "pg_ctl stop" seems to stop the standby for me. Which subprocesses of the slave postmaster are still around? Could you attach to them with gdb and get stack traces? regards, tom lane
On Mon, May 3, 2010 at 2:04 PM, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems quite > easy to end up in a situation where you have a standby that seems to be > stuck in: > > $ psql -p 5433 > psql: FATAL: the database system is shutting down > > but not not actually shuting down ever. I ran into that a few times now > (mostly because I'm trying to chase a recovery issue I hit during earlier > testing) by simply having the master iterate between a pgbench run and > "idle" while simple doing pg_ctl restart in a loop on the standby. > I do vaguely recall some discussions of that but I thought the issue git > settled somehow? Yes - I thought it was too. Specifically, I thought I fixed it. The default mode is 'smart' shutdown, just as it is on the primary, so it won't shut down until all clients have disconnected, but it should work provided you don't leave a session somewhere. Can you describe steps to reproduce? ...Robert
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems >> quite easy to end up in a situation where you have a standby that seems >> to be stuck in: > >> $ psql -p 5433 >> psql: FATAL: the database system is shutting down > >> but not not actually shuting down ever. I ran into that a few times now >> (mostly because I'm trying to chase a recovery issue I hit during >> earlier testing) by simply having the master iterate between a pgbench >> run and "idle" while simple doing pg_ctl restart in a loop on the standby. >> I do vaguely recall some discussions of that but I thought the issue git >> settled somehow? > > Hm, I haven't pushed this hard but "pg_ctl stop" seems to stop the > standby for me. Which subprocesses of the slave postmaster are still > around? Could you attach to them with gdb and get stack traces? it is not always failing to shut down - it only fails sometimes - I have not exactly pinpointed yet what it is causing this but the standby is in a weird state now: * the master is currently idle * the standby has no connections at all logs from the standby: FATAL: the database system is shutting down FATAL: the database system is shutting down FATAL: replication terminated by primary server LOG: restored log file "000000010000001900000054" from archive cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such file or directory LOG: record with zero length at 19/55000078 cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such file or directory FATAL: could not connect to the primary server: could not connect to server: Connection refused Is the server running on host "localhost" and accepting TCP/IP connections on port 5432?couldnot connect to server: Connection refused Is the server running on host "localhost" and accepting TCP/IPconnections on port 5432? cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such file or directory cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such file or directory LOG: streaming replication successfully connected to primary FATAL: the database system is shutting down the first two "FATAL: the database system is shutting down" are from me trying to connect using psql after i noticed that pg_ctl failed to shutdown the slave. The next thing I tried was restarting the master - which lead to the following logs and the standby noticing that and reconnecting but you cannot actually connect... process tree for the standby is: 29523 pts/2 S 0:00 /home/postgres9/pginst/bin/postgres -D /mnt/space/pgdata_standby 29524 ? Ss 0:06 \_ postgres: startup process waiting for 000000010000001900000055 29529 ? Ss 0:00 \_ postgres: writer process 29835 ? Ss 0:00 \_ postgres: wal receiver process streaming 19/55000078 Stefan
Robert Haas wrote: > On Mon, May 3, 2010 at 2:04 PM, Stefan Kaltenbrunner > <stefan@kaltenbrunner.cc> wrote: >> I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems quite >> easy to end up in a situation where you have a standby that seems to be >> stuck in: >> >> $ psql -p 5433 >> psql: FATAL: the database system is shutting down >> >> but not not actually shuting down ever. I ran into that a few times now >> (mostly because I'm trying to chase a recovery issue I hit during earlier >> testing) by simply having the master iterate between a pgbench run and >> "idle" while simple doing pg_ctl restart in a loop on the standby. >> I do vaguely recall some discussions of that but I thought the issue git >> settled somehow? > > Yes - I thought it was too. Specifically, I thought I fixed it. The > default mode is 'smart' shutdown, just as it is on the primary, so it > won't shut down until all clients have disconnected, but it should > work provided you don't leave a session somewhere. Can you describe > steps to reproduce? well this is basically master and standby on the same box - with the master doing short pgbench interleaved with a "sleep 20", the standby is doing nothing in terms of queries and just executing pg_ctl restart in a loop(simulating a typical maintainance reboot of a standby). Stefan
On Mon, May 3, 2010 at 2:22 PM, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > Tom Lane wrote: >> >> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >>> >>> I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems quite >>> easy to end up in a situation where you have a standby that seems to be >>> stuck in: >> >>> $ psql -p 5433 >>> psql: FATAL: the database system is shutting down >> >>> but not not actually shuting down ever. I ran into that a few times now >>> (mostly because I'm trying to chase a recovery issue I hit during earlier >>> testing) by simply having the master iterate between a pgbench run and >>> "idle" while simple doing pg_ctl restart in a loop on the standby. >>> I do vaguely recall some discussions of that but I thought the issue git >>> settled somehow? >> >> Hm, I haven't pushed this hard but "pg_ctl stop" seems to stop the >> standby for me. Which subprocesses of the slave postmaster are still >> around? Could you attach to them with gdb and get stack traces? > > it is not always failing to shut down - it only fails sometimes - I have not > exactly pinpointed yet what it is causing this but the standby is in a weird > state now: > > * the master is currently idle > * the standby has no connections at all > > logs from the standby: > > FATAL: the database system is shutting down > FATAL: the database system is shutting down > FATAL: replication terminated by primary server > LOG: restored log file "000000010000001900000054" from archive > cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such > file or directory > LOG: record with zero length at 19/55000078 > cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such > file or directory > FATAL: could not connect to the primary server: could not connect to > server: Connection refused > Is the server running on host "localhost" and accepting > TCP/IP connections on port 5432? > could not connect to server: Connection refused > Is the server running on host "localhost" and accepting > TCP/IP connections on port 5432? > > cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such > file or directory > cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such > file or directory > LOG: streaming replication successfully connected to primary > FATAL: the database system is shutting down > > > the first two "FATAL: the database system is shutting down" are from me > trying to connect using psql after i noticed that pg_ctl failed to shutdown > the slave. > The next thing I tried was restarting the master - which lead to the > following logs and the standby noticing that and reconnecting but you cannot > actually connect... > > process tree for the standby is: > > 29523 pts/2 S 0:00 /home/postgres9/pginst/bin/postgres -D > /mnt/space/pgdata_standby > 29524 ? Ss 0:06 \_ postgres: startup process waiting for > 000000010000001900000055 > 29529 ? Ss 0:00 \_ postgres: writer process > 29835 ? Ss 0:00 \_ postgres: wal receiver process streaming > 19/55000078 <uninformed-speculation> Hmm. When I committed that patch to fix smart shutdown on the standby, we discussed the fact that the startup process can't simply release its locks and die at shutdown time because the locks it holds prevent other backends from seeing the database in an inconsistent state. Therefore, if we were to terminate recovery as soon as the smart shutdown request is received, we might never complete, because a backend might be waiting on a lock that will never get released. If that's really a danger scenario, then it follows that we might also fail to shut down if we can't connect to the primary, because we might not be able to replay enough WAL to release the locks the remaining backends are waiting for. That sort of looks like what is happening to you, except based on your test scenario I can't figure out where this came from: FATAL: replication terminated by primary server </uninformed-speculation> ...Robert
Robert Haas wrote: > On Mon, May 3, 2010 at 2:22 PM, Stefan Kaltenbrunner > <stefan@kaltenbrunner.cc> wrote: >> Tom Lane wrote: >>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >>>> I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems quite >>>> easy to end up in a situation where you have a standby that seems to be >>>> stuck in: >>>> $ psql -p 5433 >>>> psql: FATAL: the database system is shutting down >>>> but not not actually shuting down ever. I ran into that a few times now >>>> (mostly because I'm trying to chase a recovery issue I hit during earlier >>>> testing) by simply having the master iterate between a pgbench run and >>>> "idle" while simple doing pg_ctl restart in a loop on the standby. >>>> I do vaguely recall some discussions of that but I thought the issue git >>>> settled somehow? >>> Hm, I haven't pushed this hard but "pg_ctl stop" seems to stop the >>> standby for me. Which subprocesses of the slave postmaster are still >>> around? Could you attach to them with gdb and get stack traces? >> it is not always failing to shut down - it only fails sometimes - I have not >> exactly pinpointed yet what it is causing this but the standby is in a weird >> state now: >> >> * the master is currently idle >> * the standby has no connections at all >> >> logs from the standby: >> >> FATAL: the database system is shutting down >> FATAL: the database system is shutting down >> FATAL: replication terminated by primary server >> LOG: restored log file "000000010000001900000054" from archive >> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such >> file or directory >> LOG: record with zero length at 19/55000078 >> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such >> file or directory >> FATAL: could not connect to the primary server: could not connect to >> server: Connection refused >> Is the server running on host "localhost" and accepting >> TCP/IP connections on port 5432? >> could not connect to server: Connection refused >> Is the server running on host "localhost" and accepting >> TCP/IP connections on port 5432? >> >> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such >> file or directory >> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such >> file or directory >> LOG: streaming replication successfully connected to primary >> FATAL: the database system is shutting down >> >> >> the first two "FATAL: the database system is shutting down" are from me >> trying to connect using psql after i noticed that pg_ctl failed to shutdown >> the slave. >> The next thing I tried was restarting the master - which lead to the >> following logs and the standby noticing that and reconnecting but you cannot >> actually connect... >> >> process tree for the standby is: >> >> 29523 pts/2 S 0:00 /home/postgres9/pginst/bin/postgres -D >> /mnt/space/pgdata_standby >> 29524 ? Ss 0:06 \_ postgres: startup process waiting for >> 000000010000001900000055 >> 29529 ? Ss 0:00 \_ postgres: writer process >> 29835 ? Ss 0:00 \_ postgres: wal receiver process streaming >> 19/55000078 > > <uninformed-speculation> > > Hmm. When I committed that patch to fix smart shutdown on the > standby, we discussed the fact that the startup process can't simply > release its locks and die at shutdown time because the locks it holds > prevent other backends from seeing the database in an inconsistent > state. Therefore, if we were to terminate recovery as soon as the > smart shutdown request is received, we might never complete, because a > backend might be waiting on a lock that will never get released. If > that's really a danger scenario, then it follows that we might also > fail to shut down if we can't connect to the primary, because we might > not be able to replay enough WAL to release the locks the remaining > backends are waiting for. That sort of looks like what is happening > to you, except based on your test scenario I can't figure out where > this came from: > > FATAL: replication terminated by primary server as I said before I restarted the master at that point, the standby logged the above, restored 000000010000001900000054 from the archive, tried reconnecting and logged the "connection refused". a few seconds later the master was up again and the standby succeeded in reconnecting. Stefan
Robert Haas <robertmhaas@gmail.com> writes: > Hmm. When I committed that patch to fix smart shutdown on the > standby, we discussed the fact that the startup process can't simply > release its locks and die at shutdown time because the locks it holds > prevent other backends from seeing the database in an inconsistent > state. Therefore, if we were to terminate recovery as soon as the > smart shutdown request is received, we might never complete, because a > backend might be waiting on a lock that will never get released. If > that's really a danger scenario, then it follows that we might also > fail to shut down if we can't connect to the primary, because we might > not be able to replay enough WAL to release the locks the remaining > backends are waiting for. That sort of looks like what is happening > to you, except based on your test scenario I can't figure out where > this came from: > FATAL: replication terminated by primary server I suspect you have it right, because my experiments where the standby did shut down correctly were all done with an idle master. Seems like we could go ahead and forcibly kill the startup process *once all the standby backends are gone*. There is then no need to worry about not releasing locks, and re-establishing a consistent state when we later restart is logic that we have to have anyway. regards, tom lane
On Mon, May 3, 2010 at 2:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Hmm. When I committed that patch to fix smart shutdown on the >> standby, we discussed the fact that the startup process can't simply >> release its locks and die at shutdown time because the locks it holds >> prevent other backends from seeing the database in an inconsistent >> state. Therefore, if we were to terminate recovery as soon as the >> smart shutdown request is received, we might never complete, because a >> backend might be waiting on a lock that will never get released. If >> that's really a danger scenario, then it follows that we might also >> fail to shut down if we can't connect to the primary, because we might >> not be able to replay enough WAL to release the locks the remaining >> backends are waiting for. That sort of looks like what is happening >> to you, except based on your test scenario I can't figure out where >> this came from: > >> FATAL: replication terminated by primary server > > I suspect you have it right, because my experiments where the standby > did shut down correctly were all done with an idle master. > > Seems like we could go ahead and forcibly kill the startup process *once > all the standby backends are gone*. There is then no need to worry > about not releasing locks, and re-establishing a consistent state when > we later restart is logic that we have to have anyway. That's exactly what we already do. The problem is that smart shutdown doesn't actually kill off the standby backends - it waits for them to exit on their own. Except, if they're blocking on a lock that's never going to get released, then they never do. ...Robert