On 6/4/19 6:24 PM, Laurenz Albe wrote:
> Tomasz Ostrowski wrote:
>
>> The TCP keepalives functionality on the database server's operating
>> system would figure out that the connection is not working anymore and
>> close it.
>>
>> You'd assume that the database would get the info and cancel the query.
>> Except - it does not.
>>
>> The database would get an error on the connection socket only after it
>> tries to read or write to it next time. But it does not try to do this -
>> it's busy counting those quarks until statement_timeout is reached.
>
> By default "tcp_keepalives_idle" is quite large: 2 hours.
>
> Are you sure that the queries keep running for longer than that?
>
> Try to experiment with lower settings. It will cause marginally more
> network traffic, but dead connections will be detected more quickly.
The keepalive settings we use is idle/interval/count=60/10/60. It means
that the the dead connections should be cleared after 60s+10*60s=11m. I
started to investigate when I found a query running for over 11 days
(there was no query_timeout set on this server).
But the problem is not that the dead connections aren't detected - they
are, and the operating system clears them. They disappear from "netstat
--tcp" output after expected time.
The problem is that the database does not get the info and does not kill
the query.
I've reproduced this with running the SQL below on a local Postgresql 11
connected with TCP, configured with idle/interval/count=15/15/15:
create or replace function pg_temp.fib(n int) returns int language
plpgsql as $$
begin if n<=1 then return n; end if; return
pg_temp.fib(n-1)+pg_temp.fib(n-2); end;
$$;
select pg_temp.fib(50);
And then filtering out the connection with iptables.
--
Tomasz "Tometzky" Ostrowski