Re: Streaming replication - unable to stop the standby - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Streaming replication - unable to stop the standby
Date
Msg-id AANLkTikiMa8eaVnkWt8sMFTsPNqWmeISsTwd9g59gCqN@mail.gmail.com
Whole thread Raw
In response to Re: Streaming replication - unable to stop the standby  (Stefan Kaltenbrunner <stefan@kaltenbrunner.cc>)
Responses Re: Streaming replication - unable to stop the standby
Re: Streaming replication - unable to stop the standby
List pgsql-hackers
On Mon, May 3, 2010 at 2:22 PM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:
> Tom Lane wrote:
>>
>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>>
>>> I'm currently testing SR/HS in 9.0beta1 and I noticed that it seems quite
>>> easy to end up in a situation where you have a standby that seems to be
>>> stuck in:
>>
>>> $ psql -p 5433
>>> psql: FATAL:  the database system is shutting down
>>
>>> but not not actually shuting down ever. I ran into that a few times now
>>> (mostly because I'm trying to chase a recovery issue I hit during earlier
>>> testing) by simply having the master iterate between a pgbench run and
>>> "idle" while simple doing pg_ctl restart in a loop on the standby.
>>> I do vaguely recall some discussions of that but I thought the issue git
>>> settled somehow?
>>
>> Hm, I haven't pushed this hard but "pg_ctl stop" seems to stop the
>> standby for me.  Which subprocesses of the slave postmaster are still
>> around?  Could you attach to them with gdb and get stack traces?
>
> it is not always failing to shut down - it only fails sometimes - I have not
> exactly pinpointed yet what it is causing this but the standby is in a weird
> state now:
>
> * the master is currently idle
> * the standby has no connections at all
>
> logs from the standby:
>
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  replication terminated by primary server
> LOG:  restored log file "000000010000001900000054" from archive
> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such
> file or directory
> LOG:  record with zero length at 19/55000078
> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such
> file or directory
> FATAL:  could not connect to the primary server: could not connect to
> server: Connection refused
>                Is the server running on host "localhost" and accepting
>                TCP/IP connections on port 5432?
>        could not connect to server: Connection refused
>                Is the server running on host "localhost" and accepting
>                TCP/IP connections on port 5432?
>
> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such
> file or directory
> cp: cannot stat `/mnt/space/wal-archive/000000010000001900000055': No such
> file or directory
> LOG:  streaming replication successfully connected to primary
> FATAL:  the database system is shutting down
>
>
> the first two "FATAL: the database system is shutting down" are from me
> trying to connect using psql after i noticed that pg_ctl failed to shutdown
> the slave.
> The next thing I tried was restarting the master - which lead to the
> following logs and the standby noticing that and reconnecting but you cannot
> actually connect...
>
> process tree for the standby is:
>
> 29523 pts/2    S      0:00 /home/postgres9/pginst/bin/postgres -D
> /mnt/space/pgdata_standby
> 29524 ?        Ss     0:06  \_ postgres: startup process   waiting for
> 000000010000001900000055
> 29529 ?        Ss     0:00  \_ postgres: writer process
> 29835 ?        Ss     0:00  \_ postgres: wal receiver process streaming
> 19/55000078

<uninformed-speculation>

Hmm.  When I committed that patch to fix smart shutdown on the
standby, we discussed the fact that the startup process can't simply
release its locks and die at shutdown time because the locks it holds
prevent other backends from seeing the database in an inconsistent
state.  Therefore, if we were to terminate recovery as soon as the
smart shutdown request is received, we might never complete, because a
backend might be waiting on a lock that will never get released.  If
that's really a danger scenario, then it follows that we might also
fail to shut down if we can't connect to the primary, because we might
not be able to replay enough WAL to release the locks the remaining
backends are waiting for.  That sort of looks like what is happening
to you, except based on your test scenario I can't figure out where
this came from:

FATAL:  replication terminated by primary server

</uninformed-speculation>

...Robert


pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: max_standby_delay considered harmful
Next
From: Stefan Kaltenbrunner
Date:
Subject: Re: Streaming replication - unable to stop the standby