Thread: memory issues when running with mod_perl
Someone posted an issue to the mod-perl list a few weeks ago about their machine losing a ton of memory under a mod-perl2/apache/ postgres system - and only being able to reclaim it from reboots A few weeks later I ran into some memory related problems, and noticed a similar issue. Starting / stopping the clients had no effect on memory (which was expected... the new ones just pulled in from the shared cache ) But stopping the daemon didn't affect memory either. running ipcs, i saw the shared memory lock freed. but the kernel never seemed to get it back. and then I'd run into swap. I felt this on freebsd 6.x + pg 8.1.x , seperate people on the list had it under the 2.4 / 2.6 kernels with 7.x and 8.x pgs. someone just posted that they have the same issue on one machine under 7.4.9, but not (yet) under 7.4.13 does anyone have some suggestions on how to test this to make sure its a pg issue ? it often takes a few days for pg to consume enough memory for this behavior to set in place.
On Wed, Sep 27, 2006 at 05:03:15PM -0400, Jonathan Vanasco wrote: > > Someone posted an issue to the mod-perl list a few weeks ago about > their machine losing a ton of memory under a mod-perl2/apache/ > postgres system - and only being able to reclaim it from reboots Are you sure you're looking at the right numbers? Disk cache should be counted as part of free memory, for example. Could you provide some actual output of your tests, so we can see exactly what you mean? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
Martijn, > Are you sure you're looking at the right numbers? Disk cache should be > counted as part of free memory, for example. I am the guy who posted the problem to mod_perl, and yes, I am quite sure that we are talking about the right numbers. The best argument is that the machine in fact starts swapping when memory is gone - and this means there is neither free nor cached memory left. > Could you provide some actual output of your tests, so we can see > exactly what you mean? Several examples have been discussed in the mod_perl mailing list, just have a look at http://mail-archives.apache.org/mod_mbox/perl-modperl/200609.mbox/browser However, maybe my configuration is quite a good example as I am using very current software packets. Linux 2.6.13-15 (SuSE 10.0 i586) Apache 2.0.59 Postgresql 8.1.4 mod_perl 2.0.2 perl 5.8.8 DBI 1.52 DBD::Pg 1.49 Embperl 2.2.0 OpenSSL 0.9.8b. When I boot my machine with 1 G physical RAM and about 4 G of swap memory, it uses about 300 M for the os, apache and pg, and the rest is free. As usual, the free memory goes to cached memory after some time, no reason to worry. However, within two weeks, the cached memory becomes less and less, and the machine starts to swap. When I stop apache at that time, I get some 100-150 M back, however, when I stop pg, I get nearly nothing back; on the other hand, it does not cost my anything if I restart pg, whereas apache of course takes back the 100-150 M. When swapping increases, the machine gets slower and slower, until it does no longer answer anything except ping. I thougt about a kernel memory leak for a long time, however, in the mod_perl mailing list we heard about someone reporting the same problem in BSD, and for that reason, this is maybe the wrong way. Using ipcs, I see that the postmaster uses 10 M of shared memory, why do I not see any increase of free memory (or cache) when stopping pg? Thanks in advance, Andreas P.S.: Jonathan has described nearly the same thing in the mod_perl list on Wed, 06 Sep, 21:36.
Andreas Rieke <andreas.rieke@isl.de> writes: > I am the guy who posted the problem to mod_perl, and yes, I am quite > sure that we are talking about the right numbers. The best argument is > that the machine in fact starts swapping when memory is gone - and this > means there is neither free nor cached memory left. Andreas, what it sounds like to me is a kernel memory leak probably triggered by Postgres' use of SysV shared memory (which is not a heavily used kernel feature these days, so bugs in it are hardly out of the question). A couple of facts that might help you narrow your theories: 1. When the postmaster starts up, it allocates one, count 'em one, shared memory segment that is never thereafter changed in size. 2. When the postmaster shuts down, it issues a shmctl(IPC_RMID) call against that segment. The kernel should thereupon mark the segment for destruction, and then actually destroy it when the last process connected to it is gone. In a normal shutdown that would mean immediately (because the postmaster waits for all its child processes to die first), but in an "immediate mode" shutdown there might still be children alive at the instant of the shmctl. Within this context, the only way to cause a memory leak is to "kill -9" the postmaster instead of giving it a chance to exit gracefully. In that case the shmctl(IPC_RMID) never happens and the memory segment isn't reclaimed. However, if that were your problem then the evidence would be real clear in "ipcs -m -a" output: lots of postgres-owned segments with zero attached processes. (There actually is code in the postmaster to try to find and destroy such orphaned segments during postmaster restart, but it's not 100% guaranteed to find everything.) If the shared segment is no longer present according to ipcs, and there are no postgres processes still running, then it's simply not possible for it to be postgres' fault if memory has not been reclaimed. So you're looking at a kernel bug. As to the nature of the bug ... we saw something similar in older versions of OS X: http://archives.postgresql.org/pgsql-general/2004-08/msg00972.php Since Darwin is BSD-derived, an ancient common bug seems possible. (BTW, I just repeated the above experiment in OS X 10.4.8, and see no leak, so Apple did fix it somewhere along the line.) Anyway I'd suggest trying to duplicate the problem without apache by firing new backends rapidly as in the above message. If you can, file a kernel bug report. regards, tom lane
On Sep 30, 2006, at 12:28 PM, Tom Lane wrote: > If the shared segment is no longer present according to ipcs, > and there are no postgres processes still running, then it's > simply not possible for it to be postgres' fault if memory has > not been reclaimed. So you're looking at a kernel bug. thats got to be it then. i've been running ipcs *hoping* to see something, but its all 'freed'. the mem just disappears. > As to the nature of the bug ... we saw something similar in older > versions of OS X: > http://archives.postgresql.org/pgsql-general/2004-08/msg00972.php thanks for the link!
Tom, thanks for all the facts first. Tom Lane wrote: >If the shared segment is no longer present according to ipcs, >and there are no postgres processes still running, then it's >simply not possible for it to be postgres' fault if memory has >not been reclaimed. So you're looking at a kernel bug. > > > Before we switch this item to the linux kernel mailing list, let me add two results and two more questions. R1: First of all, I tried the loop from your older OS X problem: while true do psql -c "select count(*) from tenk1" regression done Even after running the psql command for more than a million times over quite a small table with about 10 000 entries, I can NOT see any lost memory. Thus, we have another problem as the OS X people. R2: After having a look at the linux kernel mailing list, it seems that this problem is not yet known there. So far, so good. Q1: The first question is quite easy: Is there any way to tell pg NOT to use shmem? Although I expect minor performance with that configuration, I would like to try that out. Q2: You say that pg allocates one shared memory block which is never changed in size, and I can see with ipcs that we talk about 10 MBytes on my machine (which uses the default configuration). Although I usually do not kill -9 the postmaster, the maximum loss of memory seems to be 10 M for that reason. However, my machine looses between 500 M and 800 M in two weeks, and within that time, I restart pg only very few times, say 3-4 times. Does pg allocate other shmem blocks? If there is really a kernel memory problem in shmem, how can I loose so much memory? Thanks in advance, Andreas
> However, my machine looses between 500 M and 800 M in two weeks, and > within that time, I restart pg only very few times, say 3-4 times. > Does pg allocate other shmem blocks? If there is really a kernel memory > problem in shmem, how can I loose so much memory? This is the same thing I am seeing -- 500-1GB memory lost every two weeks -- and I don't restart pg at all. So, whatever is causing this is not due to pg restarts. I am running the same software as everyone else who has had this problem though: Apache, Postgres, and mod_perl on Linux (I know there's the guy seeing it on BSD also). As I mentioned on the mod_perl list, I'm seeing the loss on a machine with ~350 vhosted domains all running a mod_perl CMS: Apache 1.3.37 Postgres 7.4.9 Linux 2.6.12.6 mod_perl 1.29 However, I am not seeing any loss at all on another machine with ~100 vhosted domains running the same CMS, but with the following software: Apache 1.3.37 Postgres 7.4.13 Linux 2.6.16.27 mod_perl 1.29 I cannot be certain that it's not just due to the ligher load (100 vs 350 vhosts), but I have not seen a single MB of lost memory on the second machine, and am inclined to believe that the problem is fixed with that setup (aside from the kernel and postgres, the two machines are running the same software). Tonight I am going to upgrade postgres on the first machine and see if it makes any difference, but it'll be about a week before I know for sure if memory is still being lost (it's such a slow leak that you cannot tell with just a couple days).
Fred, Fred Tyler wrote: > Tonight I am going to upgrade postgres on the first machine and see if > it makes any difference, but it'll be about a week before I know for > sure if memory is still being lost (it's such a slow leak that you > cannot tell with just a couple days). I use the latest 8.1.4 postgres software on my machine, this does not help. You say that you are using Linux 2.6.12.6 on the machine with the problem and Linux 2.6.16.27 on the one where you do not loose memory. Are there any other differences (RAID, Reiser FS, DBI, DBD::Pg, Embperl, physical memory or swap size, ...)? Andreas
Andreas Rieke <andreas.rieke@isl.de> writes: > R1: First of all, I tried the loop from your older OS X problem: > while true > do > psql -c "select count(*) from tenk1" regression > done > Even after running the psql command for more than a million times over > quite a small table with about 10 000 entries, I can NOT see any lost > memory. Thus, we have another problem as the OS X people. OK, that kills the theory that the leak is triggered by subprocess exit. Another thing that would be worth trying is to just stop and start the postmaster a large number of times, to see if the leak occurs at postmaster exit. Also, do you have any problems with backends crashing (ie, forced database restarts)? That scenario should be equivalent to a postmaster restart, but it might be worth trying a few cycles of kill -9'ing a backend (not the postmaster) to check for a leak in that path. > R2: After having a look at the linux kernel mailing list, it seems that > this problem is not yet known there. It's premature to complain to them until we have a clearly reproducible way of causing the leak. > Q1: The first question is quite easy: Is there any way to tell pg NOT to > use shmem? No. > Does pg allocate other shmem blocks? No. regards, tom lane
> > Tonight I am going to upgrade postgres on the first machine and see if > > it makes any difference, but it'll be about a week before I know for > > sure if memory is still being lost (it's such a slow leak that you > > cannot tell with just a couple days). > > I use the latest 8.1.4 postgres software on my machine, this does not help. Actually, I was going to upgrade to 7.4.13 thinking that maybe some bugfix got into the 7.4 series that didn't make it into the 8.1 series. That said, however, I think this is a long shot and I'm not really expecting it to help. All my research a few months ago pointed to this being a memory leak in the 2.6.12 kernel, and I've been patiently waiting for a good time to upgrade the kernel, cross my fingers, and hope everything is magically fixed. But since it came up on the mod_perl list recently and I saw that others with very similar configurations were having the same problems, it's renewed my interest in trying to pinpoint the problem. > You say that you are using Linux 2.6.12.6 on the machine with the > problem and Linux 2.6.16.27 on the one where you do not loose memory. > Are there any other differences (RAID, Reiser FS, DBI, DBD::Pg, Embperl, > physical memory or swap As far as software goes, they are identical with the exception of the kernel version and postgres version. Both run reiserfs. No RAID. The physical memory is the same (2GB). The differences are: Leaky machine: 256MB swap, P4 2.8GHz Non-leaky machine: 512MB swap, Pentium D (dual core) 3.0GHz. What is your kernel version?
> OK, that kills the theory that the leak is triggered by subprocess exit. > Another thing that would be worth trying is to just stop and start the > postmaster a large number of times, to see if the leak occurs at > postmaster exit. It is not from the exit. I see the exact same problem and I never restart postgres and it never crashes. It runs constanty and with no crashes for 20-30 days until the box is out of memory and I have to reboot. > > R2: After having a look at the linux kernel mailing list, it seems that > > this problem is not yet known there. It is possible that it has already been fixed. I am seeing this memory leak quite clearly on 2.6.12.6, but there's no evidence of it at all on 2.6.16.27. The changelogs between those two versions show a lot of bugfixes for memory leaks. Also, that message from Will Glenn references a discussion where someone else running 2.6.12 was having a memory leak issue.
"Fred Tyler" <fredty8@gmail.com> writes: >>> R2: After having a look at the linux kernel mailing list, it seems that >>> this problem is not yet known there. > It is possible that it has already been fixed. I am seeing this memory > leak quite clearly on 2.6.12.6, but there's no evidence of it at all > on 2.6.16.27. The changelogs between those two versions show a lot of > bugfixes for memory leaks. We should not forget the possibility that there's more than one bug here. The people who were seeing leakage on BSD might be facing a different problem than what you are dealing with on Linux. It'd be worth trying the psql-in-a-tight-loop example on the BSD boxes that are showing the problem. regards, tom lane
Fred, > > What is your kernel version? It's 2.6.13-15. Thus, if we have a kernel bug, the newest known leaky version is 2.6.13-15, whereas the oldest fixed version should be 2.6.16.27. As many people run pg on older kernel versions, I would expect many others having memory problems in that case. However, we are going to a dead end. Although I am very sure that I can stop the postmaster without getting the 10 M shmem back, my real problem is worse, because I loose up to 800 M in a continuous way within 14 days. Tom Lane raised the question whether there is more than one bug. Any ideas? Andreas
Andreas Rieke <andreas.rieke@isl.de> writes: > It's 2.6.13-15. Thus, if we have a kernel bug, the newest known leaky > version is 2.6.13-15, whereas the oldest fixed version should be 2.6.16.27. I have a few servers with 2.6.16.21 and I don't see the problem as well. -- Jorge Godoy <jgodoy@gmail.com>
On Oct 1, 2006, at 11:56 AM, Tom Lane wrote: > OK, that kills the theory that the leak is triggered by subprocess > exit. > Another thing that would be worth trying is to just stop and start the > postmaster a large number of times, to see if the leak occurs at > postmaster exit. On FreeBSD I'm not seeing any leak on subprocesses exit. Multiple psql clients just consume memory - often shared - then toss it back nicely. The consumed memory / available memory does grow - but its all allocated for, and within expected constraints. I believe, however, I'm seeing a leak on postmaster exit. Can someone suggest to me a SQL query I can loop a few thousand times to drive up shared memory use? Basically, to test I'd like to do something like what was suggested in the archived osx thread: http://archives.postgresql.org/pgsql-general/2004-08/msg00972.php ==== while true do psql -c "select count(*) from tenk1" regression done ==== except instead of relying on a leak to increase memory, I'd like a rather intensive large function with a dataset to consumer massive amounts of ram. I just can't think of any function to do that.
Jonathan Vanasco <postgres@2xlp.com> writes: > except instead of relying on a leak to increase memory, I'd like a > rather intensive large function with a dataset to consumer massive > amounts of ram. I just can't think of any function to do that. Sort a big chunk of data with a high work_mem setting, eg select random() from generate_series(1,1000000) order by 1; or select count(*) from (select random() from generate_series(1,1000000) order by 1) ss; The former will drive psql's memory usage up too, the latter not. regards, tom lane
On Oct 1, 2006, at 12:24 PM, Fred Tyler wrote: > It is not from the exit. I see the exact same problem and I never > restart postgres and it never crashes. It runs constanty and with no > crashes for 20-30 days until the box is out of memory and I have to > reboot. my theory, which i hope to prove/disprove tonight, is this is happening to us: postgres is slurping all that memory because it should- there's probably a setting on your box that is letting it consumer more shared memory than you want. some issue with the kernel/postgres is not truly freeing the shared memory when postmaster exits in other words, i think the leak isn't in those 20-30 days-- i think thats likely a configuration issue. but i think there is a leak when you stop the daemon
On 10/1/06, Fred Tyler <fredty8@gmail.com> wrote: > > However, my machine looses between 500 M and 800 M in two weeks, and > > within that time, I restart pg only very few times, say 3-4 times. > > Does pg allocate other shmem blocks? If there is really a kernel memory > > problem in shmem, how can I loose so much memory? > > This is the same thing I am seeing -- 500-1GB memory lost every two > weeks -- and I don't restart pg at all. So, whatever is causing this > is not due to pg restarts. I am running the same software as everyone > else who has had this problem though: Apache, Postgres, and mod_perl > on Linux (I know there's the guy seeing it on BSD also). For the record, after upgrading the leaky machine from postgres 7.4.9 to 7.4.13, the memory leak is still very much present. (My original configuration details are quoted below.) From a fresh reboot of the server, with no postgres restarts, crashes, or anything abnormal, I'm seeing a memory loss of around 50-100MB per day. As things now stand, I have two machines with the exact same software configuration, but one machine (the leaky one) is running kernel 2.6.12.6, and the other machine (non-leaky one) is running 2.6.16.27. The hardware is slightly different (both servers are Intel, but one is dual core), and the non-leaky machine has about 1/2 the load of the leaky machine, but all evidence right now is pointing to a kernel leak that was fixed somewhere between 2.6.12.6 and 2.6.16.12. > As I mentioned on the mod_perl list, I'm seeing the loss on a machine > with ~350 vhosted domains all running a mod_perl CMS: > > Apache 1.3.37 > Postgres 7.4.9 > Linux 2.6.12.6 > mod_perl 1.29 > > However, I am not seeing any loss at all on another machine with ~100 > vhosted domains running the same CMS, but with the following software: > > Apache 1.3.37 > Postgres 7.4.13 > Linux 2.6.16.27 > mod_perl 1.29 > > I cannot be certain that it's not just due to the ligher load (100 vs > 350 vhosts), but I have not seen a single MB of lost memory on the > second machine, and am inclined to believe that the problem is fixed > with that setup (aside from the kernel and postgres, the two machines > are running the same software). > > Tonight I am going to upgrade postgres on the first machine and see if > it makes any difference, but it'll be about a week before I know for > sure if memory is still being lost (it's such a slow leak that you > cannot tell with just a couple days). >