Oliver Jowett <oliver@opencloud.com> writes:
> The scenario I need to deal with is this:
> There are multiple nodes, network-separated, participating in a cluster.
> One node is selected to talk to a particular postgresql instance (call
> this node A).
> A starts a transaction and grabs some locks in the course of that
> transaction. Then A falls off the network before committing because of a
> hardware or network failure. A's connection might be completely idle
> when this happens.
> The cluster liveness machinery notices that A is dead and selects a new
> node to talk to postgresql (call this node B). B resumes the work that A
> was doing prior to failure.
> B has to wait for any locks held by A to be released before it can make
> any progress.
> Without some sort of tunable timeout, it could take a very long time (2+
> hours by default on Linux) before A's connection finally times out and
> releases the locks.
Wouldn't it be reasonable to expect the "cluster liveness machinery" to
notify the database server's kernel that connections to A are now dead?
I find it really unconvincing to suppose that the above problem should
be solved at the database level.
regards, tom lane