Thread: tcp_keepalives_idle ignored

tcp_keepalives_idle ignored

From
"henry"
Date:
Hello all,

PG:  8.2.4

We have a very busy setup using multiple clusters, slony, etc.  My problem
relates to the number of postgres procs increasing, and not decreasing
when idle.  I eventually end up with thousands of idle processes listening
on /tmp/.s.PGSQL.5432 and not quitting (eventually bumping into
max_connections).

I have tried setting tcp_keepalives_idle = 120 (eg), then restarting PG,
but SHOW ALL; shows tcp_keepalives_idle=0 (ignoring my setting).

I have even tried setting setting /proc/sys/net/ipv4/tcp_keepalive_time
directly, but the idle procs just keep increasing in number.

tcp_keepalives_interval and tcp_keepalives_count I have left on default.
After a few hours worth of running, theres a few thousand idle postgres
procs, and they're all idle...

Any idea what's going on here?

Thanks
Henry


Re: tcp_keepalives_idle ignored

From
Gregory Stark
Date:
"henry" <henry@zen.co.za> writes:

> tcp_keepalives_interval and tcp_keepalives_count I have left on default.
> After a few hours worth of running, theres a few thousand idle postgres
> procs, and they're all idle...

Are you sure the clients are actually gone? tcp keepalives are only going to
help if you have clients disappearing -- such as networks being unplugged or
client machines getting turned off. Under normal conditions when a client
disconnects the postgres process will go away immediately.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

Re: tcp_keepalives_idle ignored

From
Jean-Michel Pouré
Date:
> We have a very busy setup using multiple clusters, slony, etc.  My problem
> relates to the number of postgres procs increasing, and not decreasing
> when idle.  I eventually end up with thousands of idle processes listening
> on /tmp/.s.PGSQL.5432 and not quitting (eventually bumping into
> max_connections).

What about your client session timeouts?


Re: tcp_keepalives_idle ignored

From
"henry"
Date:
On Sat, January 12, 2008 1:20 pm, Gregory Stark wrote:
> "henry" <henry@zen.co.za> writes:
>
>> tcp_keepalives_interval and tcp_keepalives_count I have left on default.
>> After a few hours worth of running, theres a few thousand idle postgres
>> procs, and they're all idle...
>
> Are you sure the clients are actually gone? tcp keepalives are only going

hmm, I think you're right - in a way.  This is beginning to look like a
friggin leak somewhere.  What's supposed to be persistent connections in
long-lived processes are not so persistent after all (and the old
connections are not being released).

Thanks Gregory.

Henry


Re: tcp_keepalives_idle ignored

From
Tom Lane
Date:
"henry" <henry@zen.co.za> writes:
> I have tried setting tcp_keepalives_idle = 120 (eg), then restarting PG,
> but SHOW ALL; shows tcp_keepalives_idle=0 (ignoring my setting).

Just FYI, this is the expected behavior on platforms where the kernel
doesn't allow adjustment of the TCP keepalive parameters from
userspace.  You didn't say what you are running the server on, but
the reference to /proc/sys makes me think it's Linux ... which seems
odd, because modern Linuxen do allow these things to be adjusted.

[ thinks... ]  Maybe you were inspecting the value in a Unix-socket
connection instead of a TCP connection?

This is all irrelevant to your real problem, to judge by the rest of
the thread, but I'm curious.

            regards, tom lane

Re: tcp_keepalives_idle ignored

From
"henry"
Date:

On Sat, January 12, 2008 6:50 pm, Tom Lane wrote:
> "henry" <henry@zen.co.za> writes:
>> I have tried setting tcp_keepalives_idle = 120 (eg), then restarting PG,
>> but SHOW ALL; shows tcp_keepalives_idle=0 (ignoring my setting).
>
> Just FYI, this is the expected behavior on platforms where the kernel
> doesn't allow adjustment of the TCP keepalive parameters from
> userspace.  You didn't say what you are running the server on, but
> the reference to /proc/sys makes me think it's Linux ... which seems
> odd, because modern Linuxen do allow these things to be adjusted.

Yes, this is on Linux - adjusting the param under /proc/sys works as
expected, but doesn't seem to have an effect on my prob.

> [ thinks... ]  Maybe you were inspecting the value in a Unix-socket
> connection instead of a TCP connection?

See below (not sure I understand when you ask 'inspecting').

> This is all irrelevant to your real problem, to judge by the rest of
> the thread, but I'm curious.

I did in fact find a leak in long-lived procs (some of which can run for
days) - but squashing that did not make my problem go away.  In fact,
these procs are connecting to port TCP 5432 - not a socket
(/tmp/.s.PGSQL.5432), TCP connections to 5432 come and go nicely in sync
with the number of active procs.

The number of /tmp/.s.PGSQL.5432 connections just keep growing...  I have
no idea what's causing it.

lsof doesn't tell me what's talking to PG through /tmp/.s.PGSQL.5432
either.  Maybe I'm not understanding exactly how /tmp/.s.PGSQL.5432 is
used - what would connect to PG via a domain socket?  procs which don't
explicitly use -p5432, or some other mechanism which I'm ignorant of?

If I could just figure out what the hell's using /tmp/.s.PGSQL.5432 then I
could get a handle on the problem.

Increasing max_connections to cope without understanding what's happening
is an irritation (although that could be what's required in the final
analysis anyway).  It's currently set to 2048, which gets masticated in
several hours (a result which doesn't make sense considering the number of
active procs) - and I haven't really started using the cluster/s to the
full potential yet.

Any suggestions/pointers are greatly appreciated.

Regards
Henry


Re: tcp_keepalives_idle ignored

From
Martijn van Oosterhout
Date:
On Sun, Jan 13, 2008 at 08:40:34AM +0200, henry wrote:
> > This is all irrelevant to your real problem, to judge by the rest of
> > the thread, but I'm curious.
>
> I did in fact find a leak in long-lived procs (some of which can run for
> days) - but squashing that did not make my problem go away.  In fact,
> these procs are connecting to port TCP 5432 - not a socket
> (/tmp/.s.PGSQL.5432), TCP connections to 5432 come and go nicely in sync
> with the number of active procs.
>
> The number of /tmp/.s.PGSQL.5432 connections just keep growing...  I have
> no idea what's causing it.

Do you have any kind of logging? At the very least pg_stat_activity
should tell you if they're doing anything.

> lsof doesn't tell me what's talking to PG through /tmp/.s.PGSQL.5432
> either.  Maybe I'm not understanding exactly how /tmp/.s.PGSQL.5432 is
> used - what would connect to PG via a domain socket?  procs which don't
> explicitly use -p5432, or some other mechanism which I'm ignorant of?

Maintainence programs? lsof will tell you what's doing it. Try (as
root):

lsof |grep '.s.PGSQL'

That will list a lot of postgres processes, you're looking for the
other ones.

Connecting to unix domain socket happens if you don't specify a host.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Attachment

Re: tcp_keepalives_idle ignored

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Sun, Jan 13, 2008 at 08:40:34AM +0200, henry wrote:
>> lsof doesn't tell me what's talking to PG through /tmp/.s.PGSQL.5432
>> either.  Maybe I'm not understanding exactly how /tmp/.s.PGSQL.5432 is
>> used - what would connect to PG via a domain socket?

> Connecting to unix domain socket happens if you don't specify a host.

Specifically, a local connection goes through the socket file by
default, and via TCP only if you say "-h localhost" or equivalent.

Now that we know the problem connections are local ones, setting a TCP
timeout would be 100% useless anyway.  As Martijn says, the other ends
of the socket connections *must* be on that box someplace --- I'm
inclined to think that lsof didn't show them to you because you didn't
run it as root and they belong to some other userid.

            regards, tom lane

Re: tcp_keepalives_idle ignored

From
"henry"
Date:

On Sun, January 13, 2008 7:25 pm, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
>> On Sun, Jan 13, 2008 at 08:40:34AM +0200, henry wrote:
>>> lsof doesn't tell me what's talking to PG through /tmp/.s.PGSQL.5432
>>> either.  Maybe I'm not understanding exactly how /tmp/.s.PGSQL.5432 is
>>> used - what would connect to PG via a domain socket?
>
>> Connecting to unix domain socket happens if you don't specify a host.
>
> Specifically, a local connection goes through the socket file by
> default, and via TCP only if you say "-h localhost" or equivalent.
>
> Now that we know the problem connections are local ones, setting a TCP
> timeout would be 100% useless anyway.  As Martijn says, the other ends
> of the socket connections *must* be on that box someplace --- I'm
> inclined to think that lsof didn't show them to you because you didn't
> run it as root and they belong to some other userid.

Yes, it was run as root, but all lsof tells you is that postgres has that
socket open - not the name/id of the other end of the connection.

A genuine thanks to Tom and Martijn for your helpful suggestions.  It's
exactly what I needed to identify what was going on.

The problem was the following (to ensure googlability):

To achieve DB replication from a master to quote a few slaves, in multiple
clusters, we're using PostgreSQL, Slony and PgPool.  By default, PgPool is
configured to connect to a local domain socket (backend_host_name = '').
This is fine, but it's also configured by default to never time out when
idle (connection_life_time = 0)...

Changing those two (localhost and [eg] 300) has resolved my problem of
ever-increasing /tmp/.s.PGSQL.5432 domain socket connections.

Once again, thanks.  Now, back to the lounge to catch late-night Sunday TV.

Regards
Henry


Re: tcp_keepalives_idle ignored

From
"henry"
Date:
On Sun, January 13, 2008 6:53 pm, henry wrote:
> On Sun, January 13, 2008 7:25 pm, Tom Lane wrote:
>> Martijn van Oosterhout <kleptog@svana.org> writes:
>>> On Sun, Jan 13, 2008 at 08:40:34AM +0200, henry wrote:
>>>> lsof doesn't tell me what's talking to PG through /tmp/.s.PGSQL.5432
>>>> either.  Maybe I'm not understanding exactly how /tmp/.s.PGSQL.5432 is
>>>> used - what would connect to PG via a domain socket?
>>
>>> Connecting to unix domain socket happens if you don't specify a host.
>>
>> Specifically, a local connection goes through the socket file by
>> default, and via TCP only if you say "-h localhost" or equivalent.

WRT the origional question:  why is tcp_keepalives_idle being ignored
(whether with SET or in postgresql.conf)? - ie, SHOW ALL says it's 0 no
matter what I do.

Regards
Henry


Re: tcp_keepalives_idle ignored

From
"henry"
Date:

On Mon, January 14, 2008 12:49 pm, henry wrote:
> WRT the origional question:  why is tcp_keepalives_idle being ignored
> (whether with SET or in postgresql.conf)? - ie, SHOW ALL says it's 0 no
> matter what I do.

A quick follow-on with more info to my own post.

The culprit in my case was a local trigger firing on INSERTs using
dblink_exec() without 'host=127.0.0.1'.  Bad news though, even _with_
'host=127.0.0.1' the connections do not idle timeout.  They just hang
around waiting for the rapture.

So,... this appears to be dblink related after all.  I'll be trying
explicit open/exec/close.  Weird that dblink_exec in a trigger doesn't
release resources.

Regards
Henry






Re: tcp_keepalives_idle ignored

From
Tom Lane
Date:
"henry" <henry@zen.co.za> writes:
> WRT the origional question:  why is tcp_keepalives_idle being ignored
> (whether with SET or in postgresql.conf)? - ie, SHOW ALL says it's 0 no
> matter what I do.

I think you're looking at it in a session that's connecting over a Unix
socket.  You need to be connected over TCP in order for the variable
to be meaningful.  It'll read as 0 in a socket connection.

            regards, tom lane

Re: tcp_keepalives_idle ignored

From
Tom Lane
Date:
"henry" <henry@zen.co.za> writes:
> The culprit in my case was a local trigger firing on INSERTs using
> dblink_exec() without 'host=127.0.0.1'.  Bad news though, even _with_
> 'host=127.0.0.1' the connections do not idle timeout.  They just hang
> around waiting for the rapture.

> So,... this appears to be dblink related after all.  I'll be trying
> explicit open/exec/close.  Weird that dblink_exec in a trigger doesn't
> release resources.

Hmm, this is the transient-connection form of dblink?  If so, that would
be a bug.  Can you put together a test case?  Also, which PG version are
you using, exactly?

            regards, tom lane

Re: tcp_keepalives_idle ignored

From
"henry"
Date:
On Mon, January 14, 2008 5:46 pm, Tom Lane wrote:
>> So,... this appears to be dblink related after all.  I'll be trying
>> explicit open/exec/close.  Weird that dblink_exec in a trigger doesn't
>> release resources.
>
> Hmm, this is the transient-connection form of dblink?  If so, that would
> be a bug.  Can you put together a test case?  Also, which PG version are
> you using, exactly?

Version 8.2.4.  I *would* normally be using the latest, but I recently
made a switch to CentOS (from slackware) and am trying real hard to break
old habits (which using a clean package system requires) and not roll my
own.

I'll get on the test case.

Regards
Henry