Thread: autovacuum launcher process eating up 17G+ of ram?
I have a production server running postgres 8.3.11. I did a dump all and loaded up postgres 9.0.1 on another server. On the new server, the postgres autovacuum launcher process eats up an insane amount of ram (I have seen 17G virt with 6.5G res). On the older version, it's at a reasonable 9MB res after running for a month straight. If I change the autovacuum_naptime parameter to 1, I can actually see the memory count up in MB with top. It looks like after a while it grows faster, with the resident memory going up about 1MB every 6-8 seconds. After about 35 minutes, the virt was 455MB and the res was 296MB. The only change I made in the postgresql.conf (on both versions) is setting the maximum number of connections to 1000. So all other options are default to their particular postgres versions. Both have the same number of databases and data. The number of databases is 134. There are no queries running on the test box with the new version. Both are running RHEL 5 64-bit for an operating system. Any idea how to fix this? I would really like to upgrade but this is going to be a deal breaker.
Nick <t32@2thebatcave.com> wrote: > I have a production server running postgres 8.3.11. I did a dump > all and loaded up postgres 9.0.1 on another server. On the new > server, the postgres autovacuum launcher process eats up an insane > amount of ram (I have seen 17G virt with 6.5G res). You're not looking at this in top and adding up the RAM per process, are you? That is notoriously unreliable; in particular it tends to count the shared memory over and over. What does free say about total usage? How do things look with vmstat 1? > The only change I made in the postgresql.conf (on both versions) > is setting the maximum number of connections to 1000. Unless you have four or five hundred cores in the box, that's probably a bad idea. You should seriously look at using a connection pool. Search the archives for previous discussions on that topic. > So all other options are default to their particular postgres > versions. Since the defaults are designed to allow the server to come up on even a very small machine, you should probably be setting several more. You might want to start here: http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server -Kevin
On Thu, Oct 21, 2010 at 3:14 PM, Nick <t32@2thebatcave.com> wrote: > I have a production server running postgres 8.3.11. I did a dump all > and loaded up postgres 9.0.1 on another server. On the new server, > the postgres autovacuum launcher process eats up an insane amount of > ram (I have seen 17G virt with 6.5G res). On the older version, it's > at a reasonable 9MB res after running for a month straight. What's you're shared_buffers set to? What does VIRT, RES and SHR? Is SHR really big? If RES is close to SHR, then don't worry too much.
Yes I was using top, but I am not adding up anything. I'm only ever looking at the one process called "postgres: autovacuum launcher process", not any of the other postgres processes. Here is a vmstat 1 right after postgres has been started: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 55480 760172 1709424 3758492 0 12 4 294 23 26 3 2 94 0 0 0 0 55480 756080 1709460 3762308 0 0 8 6836 1139 442 10 8 79 3 0 0 0 55480 755088 1709492 3763328 0 0 0 0 1044 409 10 8 83 0 0 0 0 55480 754096 1709544 3763992 0 0 0 64 1023 405 10 7 83 0 0 1 0 55480 749696 1709692 3766816 0 0 2028 0 1141 539 11 9 74 7 0 0 0 55480 746544 1709764 3769900 0 0 2164 0 1128 527 11 7 83 0 0 1 0 55480 743332 1709812 3773040 0 0 2216 7440 1314 572 10 8 80 2 0 3 0 55480 739808 1709864 3776044 0 0 2128 0 1170 541 11 8 81 0 0 0 0 55480 737356 1709928 3779260 0 0 2216 0 1149 552 10 8 82 0 0 Here is another chunk after it's been running for about 18 hours: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 1812168 20332 59452 102172 0 12 5 302 8 19 3 3 93 0 0 0 0 1812168 19960 59456 102164 0 0 0 3260 1051 540 8 7 85 1 0 0 0 1812168 19932 59456 102324 0 0 0 0 1022 585 8 8 83 0 0 1 0 1812168 19056 59456 102620 0 0 0 0 1035 541 7 7 86 0 0 0 0 1812168 19808 59456 102128 0 0 0 0 1024 572 8 9 84 0 0 0 1 1812168 20304 59460 102124 0 0 0 28 1035 553 8 8 85 0 0 1 0 1812168 19428 59476 102312 0 0 0 3152 1031 535 8 6 84 2 0 0 0 1812168 19808 59476 102168 0 0 0 8 1041 573 8 9 83 0 0 1 0 1812168 18924 59476 102316 0 0 0 0 1016 526 9 8 84 0 0 0 0 1812168 18940 59476 102316 0 0 0 0 1042 535 7 8 85 0 0 0 0 1812168 19560 59476 102168 0 0 0 0 1021 439 8 8 85 0 0 0 0 1812168 19312 59488 102156 0 0 0 3136 1045 454 7 8 84 1 0 0 0 1812168 18692 59496 102168 0 0 0 3236 1221 505 9 8 82 2 0 1 0 1812168 21432 59496 102184 0 0 0 0 1044 498 8 8 85 0 0 0 0 1812168 22184 59496 102128 0 0 0 0 1019 511 7 6 86 0 0 0 0 1812168 21936 59496 102168 0 0 0 0 1039 582 8 9 84 0 0 0 0 1812168 21316 59504 102168 0 0 0 3284 1030 450 8 7 85 1 0 0 0 1812168 21440 59504 102168 0 0 0 0 1034 466 8 7 85 0 0 1 0 1812168 20812 59512 102168 0 0 0 16 1021 546 9 7 84 1 0 0 0 1812168 21316 59512 102168 0 0 0 0 1039 532 7 8 85 0 0 Here is a snapshot of all the postgres processes as listed in top (29582 = postgres: autovacuum launcher process): 29582 pgsql 15 0 8739m 6.7g 1816 S 8.9 86.1 91:10.44 postgres 29583 pgsql 15 0 108m 7648 440 S 13.3 0.1 143:04.91 postgres 29578 pgsql 15 0 157m 3704 2992 S 0.3 0.0 1:46.29 postgres 29580 pgsql 15 0 157m 1472 760 S 0.0 0.0 0:00.78 postgres 29581 pgsql 15 0 157m 1204 492 S 0.0 0.0 0:00.12 postgres As you can see, after a while the system is using a lot of swap. The first time I noticed the problem, the swap was up to 12G (there is 8GB of ram in this system). I believe postgres was running for several days to achieve that much memory utilization. Thanks for the other thoughts on tuning. I will investigate those once I figure out what is up with this memory utilization problem. On Thu, Oct 21, 2010 at 2:59 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Nick <t32@2thebatcave.com> wrote: > >> I have a production server running postgres 8.3.11. I did a dump >> all and loaded up postgres 9.0.1 on another server. On the new >> server, the postgres autovacuum launcher process eats up an insane >> amount of ram (I have seen 17G virt with 6.5G res). > > You're not looking at this in top and adding up the RAM per process, > are you? That is notoriously unreliable; in particular it tends to > count the shared memory over and over. What does free say about > total usage? How do things look with vmstat 1? > >> The only change I made in the postgresql.conf (on both versions) >> is setting the maximum number of connections to 1000. > > Unless you have four or five hundred cores in the box, that's > probably a bad idea. You should seriously look at using a > connection pool. Search the archives for previous discussions on > that topic. > >> So all other options are default to their particular postgres >> versions. > > Since the defaults are designed to allow the server to come up on > even a very small machine, you should probably be setting several > more. You might want to start here: > > http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server > > -Kevin >
Nick <t32@2thebatcave.com> wrote: > Here is a vmstat 1 right after postgres has been started: > > procs -----------memory---------- > r b swpd free buff cache > 0 0 55480 760172 1709424 3758492 > Here is another chunk after it's been running for about 18 hours: > > procs -----------memory---------- > r b swpd free buff cache > 0 0 1812168 20332 59452 102172 > Here is a snapshot of all the postgres processes as listed in top > (29582 = postgres: autovacuum launcher process): > > 29582 pgsql 15 0 8739m 6.7g 1816 S 8.9 86.1 91:10.44 > postgres Ouch! I've not seen anything remotely like that. I wonder what's different.... > As you can see, after a while the system is using a lot of swap. > The first time I noticed the problem, the swap was up to 12G > (there is 8GB of ram in this system). I believe postgres was > running for several days to achieve that much memory utilization. Is there anything unusual in the logs? -Kevin
On Fri, Oct 22, 2010 at 11:17 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > I've not seen anything remotely like that. I wonder what's > different.... I did a test with the same postgres build (and autovacuum_naptime = 1 as well) on the same OS with only the system databases (postgres, template0, template1), and after the weekend the memory utilization was acceptable: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14879 pgsql 15 0 312m 177m 1108 S 0.0 8.8 0:24.43 postgres However on the other machine which has 136 total databases (133 user + the same 3 system ones listed above), the kernel had to kill off the postgres process since the box ran out of memory (the box has 8GB ram + 30GB swap), and I see it climbing again. It seems like the problem is a function of how many databases there are. > Is there anything unusual in the logs? I set log_autovacuum_min_duration = 0 but that doesn't even seem to show that anything is being vacuumed. The only thing unusual in the logs was a result of the box running out of memory and the kernel killing off processes.
Nick <t32@2thebatcave.com> wrote: > Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > >> I've not seen anything remotely like that. I wonder what's >> different.... > > I did a test with the same postgres build (and autovacuum_naptime > = 1 as well) on the same OS with only the system databases > (postgres, template0, template1), and after the weekend the memory > utilization was acceptable: That's probably it. Our shop normally has just one application database in a cluster, and I can't think of any clusters with more than three or four. It sounds like there might be some sort of memory leak when a vacuum worker switches databases. -Kevin
Nick <t32@2thebatcave.com> writes: > It seems like the problem is a function of how many databases there are. Yeah, I can reproduce this in HEAD. Create a hundred or so databases, and make sure there's a stats table entry for each, eg create database d0; \c d0 create table t1 as select generate_series(1,100000) x; create database d1; \c d1 create table t1 as select generate_series(1,100000) x; create database d2; \c d2 create table t1 as select generate_series(1,100000) x; ... Set autovacuum_naptime to 1sec, and watch it bloat. (It bloats faster once it's run through the databases once and isn't mostly waiting for children.) It looks to me like the problem is that get_database_list() intentionally pushes its result into a long-lived context, and then the result is never cleaned up. It doesn't help any that it does a lot of leaky things like heap_beginscan with CurrentMemoryContext totally unnecessarily pointing at the long-lived context. I'm not sure what the intent was here, but it's pretty broken as-is. Alvaro, what did you have in mind? Should this be using a less long-lived context, or do we need code to free an avw_dbase list? regards, tom lane
Excerpts from Tom Lane's message of lun oct 25 16:57:10 -0300 2010: > It looks to me like the problem is that get_database_list() > intentionally pushes its result into a long-lived context, and then the > result is never cleaned up. It doesn't help any that it does a lot of > leaky things like heap_beginscan with CurrentMemoryContext totally > unnecessarily pointing at the long-lived context. > > I'm not sure what the intent was here, but it's pretty broken as-is. > Alvaro, what did you have in mind? Should this be using a less > long-lived context, or do we need code to free an avw_dbase list? Sorry for not noticing this earlier. Seems my mail filters are broken :-( Thanks Tom for the ping via private email. Yes, get_database_list is clearly broken as is. I don't remember why it's coded like this; this memcxt shuffling was introducing in the 9.0 rewrite to read pg_database directly instead of the flat file. The right way for it to work seems to be to allocate the output list in the caller's context; both current callers are in a throwaway context and so the allocated list will go away automatically when they're done with the list. Patch for this is attached (applies cleanly to 9.0 and HEAD; 8.4 doesn't need to be patched, but I'll run the test case now just to be sure.) -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Attachment
Alvaro Herrera <alvherre@commandprompt.com> writes: > Patch for this is attached (applies cleanly to 9.0 and HEAD; 8.4 doesn't > need to be patched, but I'll run the test case now just to be sure.) Possibly s/cxt/resultcxt/, or some other less-generic name. Seems pretty sane other than that cosmetic issue. regards, tom lane
Excerpts from Tom Lane's message of lun nov 08 12:15:02 -0300 2010: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Patch for this is attached (applies cleanly to 9.0 and HEAD; 8.4 doesn't > > need to be patched, but I'll run the test case now just to be sure.) > > Possibly s/cxt/resultcxt/, or some other less-generic name. Seems > pretty sane other than that cosmetic issue. Thanks, committed that way and pushed. I verified that 8.4 doesn't have this problem. Oops, I just noticed that I forgot to credit everyone for the report and diagnosis in the commit message. Too late to fix :-( So: thanks Nick for the report and Tom for the diagnosis. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support