Re: Streaming replication - unable to stop the standby - Mailing list pgsql-hackers

From Stefan Kaltenbrunner
Subject Re: Streaming replication - unable to stop the standby
Date
Msg-id 4BDF19E7.2040205@kaltenbrunner.cc
Whole thread Raw
In response to Re: Streaming replication - unable to stop the standby  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Robert Haas wrote:
> On Mon, May 3, 2010 at 2:22 PM, Stefan Kaltenbrunner
> <stefan@kaltenbrunner.cc> wrote:
>> Tom Lane wrote:
>>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>>> I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems quite
>>>> easy to end up in a situation where you have a standby that seems to be
>>>> stuck in:
>>>> $ psql -p 5433
>>>> psql: FATAL:  the database system is shutting down
>>>> but not not actually shuting down ever. I ran into that a few times now
>>>> (mostly because I'm trying to chase a recovery issue I hit during earlier
>>>> testing) by simply having the master iterate between a pgbench run and
>>>> "idle" while simple doing pg_ctl restart in a loop on the standby.
>>>> I do vaguely recall some discussions of that but I thought the issue git
>>>> settled somehow?
>>> Hm, I haven't pushed this hard but "pg_ctl stop" seems to stop the
>>> standby for me.  Which subprocesses of the slave postmaster are still
>>> around?  Could you attach to them with gdb and get stack traces?
>> it is not always failing to shut down - it only fails sometimes - I have not
>> exactly pinpointed yet what it is causing this but the standby is in a weird
>> state now:
>>
>> * the master is currently idle
>> * the standby has no connections at all
>>
>> logs from the standby:
>>
>> FATAL:  the database system is shutting down
>> FATAL:  the database system is shutting down
>> FATAL:  replication terminated by primary server
>> LOG:  restored log file "000000010000001900000054" from archive
>> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such
>> file or directory
>> LOG:  record with zero length at 19/55000078
>> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such
>> file or directory
>> FATAL:  could not connect to the primary server: could not connect to
>> server: Connection refused
>>                Is the server running on host "localhost" and accepting
>>                TCP/IP connections on port 5432?
>>        could not connect to server: Connection refused
>>                Is the server running on host "localhost" and accepting
>>                TCP/IP connections on port 5432?
>>
>> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such
>> file or directory
>> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such
>> file or directory
>> LOG:  streaming replication successfully connected to primary
>> FATAL:  the database system is shutting down
>>
>>
>> the first two "FATAL: the database system is shutting down" are from me
>> trying to connect using psql after i noticed that pg_ctl failed to shutdown
>> the slave.
>> The next thing I tried was restarting the master - which lead to the
>> following logs and the standby noticing that and reconnecting but you cannot
>> actually connect...
>>
>> process tree for the standby is:
>>
>> 29523 pts/2    S      0:00 /home/postgres9/pginst/bin/postgres -D
>> /mnt/space/pgdata_standby
>> 29524 ?        Ss     0:06  \_ postgres: startup process   waiting for
>> 000000010000001900000055
>> 29529 ?        Ss     0:00  \_ postgres: writer process
>> 29835 ?        Ss     0:00  \_ postgres: wal receiver process streaming
>> 19/55000078
> 
> <uninformed-speculation>
> 
> Hmm.  When I committed that patch to fix smart shutdown on the
> standby, we discussed the fact that the startup process can't simply
> release its locks and die at shutdown time because the locks it holds
> prevent other backends from seeing the database in an inconsistent
> state.  Therefore, if we were to terminate recovery as soon as the
> smart shutdown request is received, we might never complete, because a
> backend might be waiting on a lock that will never get released.  If
> that's really a danger scenario, then it follows that we might also
> fail to shut down if we can't connect to the primary, because we might
> not be able to replay enough WAL to release the locks the remaining
> backends are waiting for.  That sort of looks like what is happening
> to you, except based on your test scenario I can't figure out where
> this came from:
> 
> FATAL:  replication terminated by primary server

as I said before I restarted the master at that point, the standby 
logged the above, restored 000000010000001900000054 from the archive, 
tried reconnecting and logged the "connection refused". a few seconds 
later the master was up again and the standby succeeded in reconnecting.


Stefan


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Streaming replication - unable to stop the standby
Next
From: Tom Lane
Date:
Subject: Re: Streaming replication - unable to stop the standby