On Thu, Jun 17, 2010 at 12:16 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Greg Stark <gsstark@mit.edu> wrote:
>
>> TCP keepalives are for detecting broken network connections
>
> Yeah. That seems like what we have here. If you shoot the OS in
> the head, the network connection is broken rather abruptly, without
> the normal packets exchanged to close the TCP connection. It sounds
> like it behaves just fine except for not detecting a broken
> connection.
So I think there are two things happening here. If you shut down the
master and don't replace it then you'll get no network errors until
TCP gives up entirely. Similarly if you pull the network cable or your
switch powers off or your routing table becomes messed up, or anything
else occurs which prevents packets from getting through then you'll
see similar breakage. You wouldn't want your database to suddenly come
up as master in such circumstances though when you'll have to fix the
problem anyways, doing so won't solve any problems it would just
create a second problem.
But there's a second case. The Postgres master just stops responding
-- perhaps it starts seeing disk errors and becomes stuck in disk-wait
or the machine just becomes very heaviliy loaded and Postgres can't
get any cycles, or someone attaches to it with gdb, or one of any
number of things happen which cause it to stop sending data. In that
case replication will not see any data from the master but TCP will
never time out because the network is just fine. That's why there
needs to be an application level health check if you want to have
timeouts. You can't depend on the network layer to detect problems
between the application.
--
greg