Thread: tcp_keepalives_idle ignored
Hello all, PG: 8.2.4 We have a very busy setup using multiple clusters, slony, etc. My problem relates to the number of postgres procs increasing, and not decreasing when idle. I eventually end up with thousands of idle processes listening on /tmp/.s.PGSQL.5432 and not quitting (eventually bumping into max_connections). I have tried setting tcp_keepalives_idle = 120 (eg), then restarting PG, but SHOW ALL; shows tcp_keepalives_idle=0 (ignoring my setting). I have even tried setting setting /proc/sys/net/ipv4/tcp_keepalive_time directly, but the idle procs just keep increasing in number. tcp_keepalives_interval and tcp_keepalives_count I have left on default. After a few hours worth of running, theres a few thousand idle postgres procs, and they're all idle... Any idea what's going on here? Thanks Henry
"henry" <henry@zen.co.za> writes: > tcp_keepalives_interval and tcp_keepalives_count I have left on default. > After a few hours worth of running, theres a few thousand idle postgres > procs, and they're all idle... Are you sure the clients are actually gone? tcp keepalives are only going to help if you have clients disappearing -- such as networks being unplugged or client machines getting turned off. Under normal conditions when a client disconnects the postgres process will go away immediately. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
> We have a very busy setup using multiple clusters, slony, etc. My problem > relates to the number of postgres procs increasing, and not decreasing > when idle. I eventually end up with thousands of idle processes listening > on /tmp/.s.PGSQL.5432 and not quitting (eventually bumping into > max_connections). What about your client session timeouts?
On Sat, January 12, 2008 1:20 pm, Gregory Stark wrote: > "henry" <henry@zen.co.za> writes: > >> tcp_keepalives_interval and tcp_keepalives_count I have left on default. >> After a few hours worth of running, theres a few thousand idle postgres >> procs, and they're all idle... > > Are you sure the clients are actually gone? tcp keepalives are only going hmm, I think you're right - in a way. This is beginning to look like a friggin leak somewhere. What's supposed to be persistent connections in long-lived processes are not so persistent after all (and the old connections are not being released). Thanks Gregory. Henry
"henry" <henry@zen.co.za> writes: > I have tried setting tcp_keepalives_idle = 120 (eg), then restarting PG, > but SHOW ALL; shows tcp_keepalives_idle=0 (ignoring my setting). Just FYI, this is the expected behavior on platforms where the kernel doesn't allow adjustment of the TCP keepalive parameters from userspace. You didn't say what you are running the server on, but the reference to /proc/sys makes me think it's Linux ... which seems odd, because modern Linuxen do allow these things to be adjusted. [ thinks... ] Maybe you were inspecting the value in a Unix-socket connection instead of a TCP connection? This is all irrelevant to your real problem, to judge by the rest of the thread, but I'm curious. regards, tom lane
On Sat, January 12, 2008 6:50 pm, Tom Lane wrote: > "henry" <henry@zen.co.za> writes: >> I have tried setting tcp_keepalives_idle = 120 (eg), then restarting PG, >> but SHOW ALL; shows tcp_keepalives_idle=0 (ignoring my setting). > > Just FYI, this is the expected behavior on platforms where the kernel > doesn't allow adjustment of the TCP keepalive parameters from > userspace. You didn't say what you are running the server on, but > the reference to /proc/sys makes me think it's Linux ... which seems > odd, because modern Linuxen do allow these things to be adjusted. Yes, this is on Linux - adjusting the param under /proc/sys works as expected, but doesn't seem to have an effect on my prob. > [ thinks... ] Maybe you were inspecting the value in a Unix-socket > connection instead of a TCP connection? See below (not sure I understand when you ask 'inspecting'). > This is all irrelevant to your real problem, to judge by the rest of > the thread, but I'm curious. I did in fact find a leak in long-lived procs (some of which can run for days) - but squashing that did not make my problem go away. In fact, these procs are connecting to port TCP 5432 - not a socket (/tmp/.s.PGSQL.5432), TCP connections to 5432 come and go nicely in sync with the number of active procs. The number of /tmp/.s.PGSQL.5432 connections just keep growing... I have no idea what's causing it. lsof doesn't tell me what's talking to PG through /tmp/.s.PGSQL.5432 either. Maybe I'm not understanding exactly how /tmp/.s.PGSQL.5432 is used - what would connect to PG via a domain socket? procs which don't explicitly use -p5432, or some other mechanism which I'm ignorant of? If I could just figure out what the hell's using /tmp/.s.PGSQL.5432 then I could get a handle on the problem. Increasing max_connections to cope without understanding what's happening is an irritation (although that could be what's required in the final analysis anyway). It's currently set to 2048, which gets masticated in several hours (a result which doesn't make sense considering the number of active procs) - and I haven't really started using the cluster/s to the full potential yet. Any suggestions/pointers are greatly appreciated. Regards Henry
On Sun, Jan 13, 2008 at 08:40:34AM +0200, henry wrote: > > This is all irrelevant to your real problem, to judge by the rest of > > the thread, but I'm curious. > > I did in fact find a leak in long-lived procs (some of which can run for > days) - but squashing that did not make my problem go away. In fact, > these procs are connecting to port TCP 5432 - not a socket > (/tmp/.s.PGSQL.5432), TCP connections to 5432 come and go nicely in sync > with the number of active procs. > > The number of /tmp/.s.PGSQL.5432 connections just keep growing... I have > no idea what's causing it. Do you have any kind of logging? At the very least pg_stat_activity should tell you if they're doing anything. > lsof doesn't tell me what's talking to PG through /tmp/.s.PGSQL.5432 > either. Maybe I'm not understanding exactly how /tmp/.s.PGSQL.5432 is > used - what would connect to PG via a domain socket? procs which don't > explicitly use -p5432, or some other mechanism which I'm ignorant of? Maintainence programs? lsof will tell you what's doing it. Try (as root): lsof |grep '.s.PGSQL' That will list a lot of postgres processes, you're looking for the other ones. Connecting to unix domain socket happens if you don't specify a host. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Those who make peaceful revolution impossible will make violent revolution inevitable. > -- John F Kennedy
Attachment
Martijn van Oosterhout <kleptog@svana.org> writes: > On Sun, Jan 13, 2008 at 08:40:34AM +0200, henry wrote: >> lsof doesn't tell me what's talking to PG through /tmp/.s.PGSQL.5432 >> either. Maybe I'm not understanding exactly how /tmp/.s.PGSQL.5432 is >> used - what would connect to PG via a domain socket? > Connecting to unix domain socket happens if you don't specify a host. Specifically, a local connection goes through the socket file by default, and via TCP only if you say "-h localhost" or equivalent. Now that we know the problem connections are local ones, setting a TCP timeout would be 100% useless anyway. As Martijn says, the other ends of the socket connections *must* be on that box someplace --- I'm inclined to think that lsof didn't show them to you because you didn't run it as root and they belong to some other userid. regards, tom lane
On Sun, January 13, 2008 7:25 pm, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: >> On Sun, Jan 13, 2008 at 08:40:34AM +0200, henry wrote: >>> lsof doesn't tell me what's talking to PG through /tmp/.s.PGSQL.5432 >>> either. Maybe I'm not understanding exactly how /tmp/.s.PGSQL.5432 is >>> used - what would connect to PG via a domain socket? > >> Connecting to unix domain socket happens if you don't specify a host. > > Specifically, a local connection goes through the socket file by > default, and via TCP only if you say "-h localhost" or equivalent. > > Now that we know the problem connections are local ones, setting a TCP > timeout would be 100% useless anyway. As Martijn says, the other ends > of the socket connections *must* be on that box someplace --- I'm > inclined to think that lsof didn't show them to you because you didn't > run it as root and they belong to some other userid. Yes, it was run as root, but all lsof tells you is that postgres has that socket open - not the name/id of the other end of the connection. A genuine thanks to Tom and Martijn for your helpful suggestions. It's exactly what I needed to identify what was going on. The problem was the following (to ensure googlability): To achieve DB replication from a master to quote a few slaves, in multiple clusters, we're using PostgreSQL, Slony and PgPool. By default, PgPool is configured to connect to a local domain socket (backend_host_name = ''). This is fine, but it's also configured by default to never time out when idle (connection_life_time = 0)... Changing those two (localhost and [eg] 300) has resolved my problem of ever-increasing /tmp/.s.PGSQL.5432 domain socket connections. Once again, thanks. Now, back to the lounge to catch late-night Sunday TV. Regards Henry
On Sun, January 13, 2008 6:53 pm, henry wrote: > On Sun, January 13, 2008 7:25 pm, Tom Lane wrote: >> Martijn van Oosterhout <kleptog@svana.org> writes: >>> On Sun, Jan 13, 2008 at 08:40:34AM +0200, henry wrote: >>>> lsof doesn't tell me what's talking to PG through /tmp/.s.PGSQL.5432 >>>> either. Maybe I'm not understanding exactly how /tmp/.s.PGSQL.5432 is >>>> used - what would connect to PG via a domain socket? >> >>> Connecting to unix domain socket happens if you don't specify a host. >> >> Specifically, a local connection goes through the socket file by >> default, and via TCP only if you say "-h localhost" or equivalent. WRT the origional question: why is tcp_keepalives_idle being ignored (whether with SET or in postgresql.conf)? - ie, SHOW ALL says it's 0 no matter what I do. Regards Henry
On Mon, January 14, 2008 12:49 pm, henry wrote: > WRT the origional question: why is tcp_keepalives_idle being ignored > (whether with SET or in postgresql.conf)? - ie, SHOW ALL says it's 0 no > matter what I do. A quick follow-on with more info to my own post. The culprit in my case was a local trigger firing on INSERTs using dblink_exec() without 'host=127.0.0.1'. Bad news though, even _with_ 'host=127.0.0.1' the connections do not idle timeout. They just hang around waiting for the rapture. So,... this appears to be dblink related after all. I'll be trying explicit open/exec/close. Weird that dblink_exec in a trigger doesn't release resources. Regards Henry
"henry" <henry@zen.co.za> writes: > WRT the origional question: why is tcp_keepalives_idle being ignored > (whether with SET or in postgresql.conf)? - ie, SHOW ALL says it's 0 no > matter what I do. I think you're looking at it in a session that's connecting over a Unix socket. You need to be connected over TCP in order for the variable to be meaningful. It'll read as 0 in a socket connection. regards, tom lane
"henry" <henry@zen.co.za> writes: > The culprit in my case was a local trigger firing on INSERTs using > dblink_exec() without 'host=127.0.0.1'. Bad news though, even _with_ > 'host=127.0.0.1' the connections do not idle timeout. They just hang > around waiting for the rapture. > So,... this appears to be dblink related after all. I'll be trying > explicit open/exec/close. Weird that dblink_exec in a trigger doesn't > release resources. Hmm, this is the transient-connection form of dblink? If so, that would be a bug. Can you put together a test case? Also, which PG version are you using, exactly? regards, tom lane
On Mon, January 14, 2008 5:46 pm, Tom Lane wrote: >> So,... this appears to be dblink related after all. I'll be trying >> explicit open/exec/close. Weird that dblink_exec in a trigger doesn't >> release resources. > > Hmm, this is the transient-connection form of dblink? If so, that would > be a bug. Can you put together a test case? Also, which PG version are > you using, exactly? Version 8.2.4. I *would* normally be using the latest, but I recently made a switch to CentOS (from slackware) and am trying real hard to break old habits (which using a clean package system requires) and not roll my own. I'll get on the test case. Regards Henry