Thread: autovacuum launcher process eating up 17G+ of ram?

autovacuum launcher process eating up 17G+ of ram?

From
Nick
Date:
I have a production server running postgres 8.3.11.  I did a dump all
and loaded up postgres 9.0.1 on another server.  On the new server,
the postgres autovacuum launcher process eats up an insane amount of
ram (I have seen 17G virt with 6.5G res).  On the older version, it's
at a reasonable 9MB res after running for a month straight.

If I change the autovacuum_naptime parameter to 1, I can actually see
the memory count up in MB with top.  It looks like after a while it
grows faster, with the resident memory going up about 1MB every 6-8
seconds.  After about 35 minutes, the virt was 455MB and the res was
296MB.

The only change I made in the postgresql.conf (on both versions) is
setting the maximum number of connections to 1000.  So all other
options are default to their particular postgres versions.  Both have
the same number of databases and data.  The number of databases is
134.  There are no queries running on the test box with the new
version.  Both are running RHEL 5 64-bit for an operating system.

Any idea how to fix this?  I would really like to upgrade but this is
going to be a deal breaker.

Re: autovacuum launcher process eating up 17G+ of ram?

From
"Kevin Grittner"
Date:
Nick <t32@2thebatcave.com> wrote:

> I have a production server running postgres 8.3.11.  I did a dump
> all and loaded up postgres 9.0.1 on another server.  On the new
> server, the postgres autovacuum launcher process eats up an insane
> amount of ram (I have seen 17G virt with 6.5G res).

You're not looking at this in top and adding up the RAM per process,
are you?  That is notoriously unreliable; in particular it tends to
count the shared memory over and over.  What does free say about
total usage?  How do things look with vmstat 1?

> The only change I made in the postgresql.conf (on both versions)
> is setting the maximum number of connections to 1000.

Unless you have four or five hundred cores in the box, that's
probably a bad idea.  You should seriously look at using a
connection pool.  Search the archives for previous discussions on
that topic.

> So all other options are default to their particular postgres
> versions.

Since the defaults are designed to allow the server to come up on
even a very small machine, you should probably be setting several
more.  You might want to start here:

http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

-Kevin

Re: autovacuum launcher process eating up 17G+ of ram?

From
Scott Marlowe
Date:
On Thu, Oct 21, 2010 at 3:14 PM, Nick <t32@2thebatcave.com> wrote:
> I have a production server running postgres 8.3.11.  I did a dump all
> and loaded up postgres 9.0.1 on another server.  On the new server,
> the postgres autovacuum launcher process eats up an insane amount of
> ram (I have seen 17G virt with 6.5G res).  On the older version, it's
> at a reasonable 9MB res after running for a month straight.

What's you're shared_buffers set to?

What does VIRT, RES and SHR?  Is SHR really big?  If RES is close to
SHR, then don't worry too much.

Re: autovacuum launcher process eating up 17G+ of ram?

From
Nick
Date:
Yes I was using top, but I am not adding up anything.  I'm only ever
looking at the one process called "postgres: autovacuum launcher
process", not any of the other postgres processes.

Here is a vmstat 1 right after postgres has been started:

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0  55480 760172 1709424 3758492    0   12     4   294   23   26  3
 2 94  0  0
 0  0  55480 756080 1709460 3762308    0    0     8  6836 1139  442 10
 8 79  3  0
 0  0  55480 755088 1709492 3763328    0    0     0     0 1044  409 10
 8 83  0  0
 0  0  55480 754096 1709544 3763992    0    0     0    64 1023  405 10
 7 83  0  0
 1  0  55480 749696 1709692 3766816    0    0  2028     0 1141  539 11
 9 74  7  0
 0  0  55480 746544 1709764 3769900    0    0  2164     0 1128  527 11
 7 83  0  0
 1  0  55480 743332 1709812 3773040    0    0  2216  7440 1314  572 10
 8 80  2  0
 3  0  55480 739808 1709864 3776044    0    0  2128     0 1170  541 11
 8 81  0  0
 0  0  55480 737356 1709928 3779260    0    0  2216     0 1149  552 10
 8 82  0  0

Here is another chunk after it's been running for about 18 hours:

procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0 1812168  20332  59452 102172    0   12     5   302    8   19  3
3 93  0  0
 0  0 1812168  19960  59456 102164    0    0     0  3260 1051  540  8
7 85  1  0
 0  0 1812168  19932  59456 102324    0    0     0     0 1022  585  8
8 83  0  0
 1  0 1812168  19056  59456 102620    0    0     0     0 1035  541  7
7 86  0  0
 0  0 1812168  19808  59456 102128    0    0     0     0 1024  572  8
9 84  0  0
 0  1 1812168  20304  59460 102124    0    0     0    28 1035  553  8
8 85  0  0
 1  0 1812168  19428  59476 102312    0    0     0  3152 1031  535  8
6 84  2  0
 0  0 1812168  19808  59476 102168    0    0     0     8 1041  573  8
9 83  0  0
 1  0 1812168  18924  59476 102316    0    0     0     0 1016  526  9
8 84  0  0
 0  0 1812168  18940  59476 102316    0    0     0     0 1042  535  7
8 85  0  0
 0  0 1812168  19560  59476 102168    0    0     0     0 1021  439  8
8 85  0  0
 0  0 1812168  19312  59488 102156    0    0     0  3136 1045  454  7
8 84  1  0
 0  0 1812168  18692  59496 102168    0    0     0  3236 1221  505  9
8 82  2  0
 1  0 1812168  21432  59496 102184    0    0     0     0 1044  498  8
8 85  0  0
 0  0 1812168  22184  59496 102128    0    0     0     0 1019  511  7
6 86  0  0
 0  0 1812168  21936  59496 102168    0    0     0     0 1039  582  8
9 84  0  0
 0  0 1812168  21316  59504 102168    0    0     0  3284 1030  450  8
7 85  1  0
 0  0 1812168  21440  59504 102168    0    0     0     0 1034  466  8
7 85  0  0
 1  0 1812168  20812  59512 102168    0    0     0    16 1021  546  9
7 84  1  0
 0  0 1812168  21316  59512 102168    0    0     0     0 1039  532  7
8 85  0  0


Here is a snapshot of all the postgres processes as listed in top
(29582 = postgres: autovacuum launcher process):

29582 pgsql     15   0 8739m 6.7g 1816 S  8.9 86.1  91:10.44 postgres
29583 pgsql     15   0  108m 7648  440 S 13.3  0.1 143:04.91 postgres
29578 pgsql     15   0  157m 3704 2992 S  0.3  0.0   1:46.29 postgres
29580 pgsql     15   0  157m 1472  760 S  0.0  0.0   0:00.78 postgres
29581 pgsql     15   0  157m 1204  492 S  0.0  0.0   0:00.12 postgres

As you can see, after a while the system is using a lot of swap.  The
first time I noticed the problem, the swap was up to 12G (there is 8GB
of ram in this system).  I believe postgres was running for several
days to achieve that much memory utilization.


Thanks for the other thoughts on tuning.  I will investigate those
once I figure out what is up with this memory utilization problem.


On Thu, Oct 21, 2010 at 2:59 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Nick <t32@2thebatcave.com> wrote:
>
>> I have a production server running postgres 8.3.11.  I did a dump
>> all and loaded up postgres 9.0.1 on another server.  On the new
>> server, the postgres autovacuum launcher process eats up an insane
>> amount of ram (I have seen 17G virt with 6.5G res).
>
> You're not looking at this in top and adding up the RAM per process,
> are you?  That is notoriously unreliable; in particular it tends to
> count the shared memory over and over.  What does free say about
> total usage?  How do things look with vmstat 1?
>
>> The only change I made in the postgresql.conf (on both versions)
>> is setting the maximum number of connections to 1000.
>
> Unless you have four or five hundred cores in the box, that's
> probably a bad idea.  You should seriously look at using a
> connection pool.  Search the archives for previous discussions on
> that topic.
>
>> So all other options are default to their particular postgres
>> versions.
>
> Since the defaults are designed to allow the server to come up on
> even a very small machine, you should probably be setting several
> more.  You might want to start here:
>
> http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
>
> -Kevin
>

Re: autovacuum launcher process eating up 17G+ of ram?

From
"Kevin Grittner"
Date:
Nick <t32@2thebatcave.com> wrote:

> Here is a vmstat 1 right after postgres has been started:
>
> procs -----------memory----------
>  r  b   swpd   free   buff  cache
>  0  0  55480 760172 1709424 3758492

> Here is another chunk after it's been running for about 18 hours:
>
> procs -----------memory----------
>  r  b   swpd   free   buff  cache
>  0  0 1812168  20332  59452 102172

> Here is a snapshot of all the postgres processes as listed in top
> (29582 = postgres: autovacuum launcher process):
>
> 29582 pgsql     15   0 8739m 6.7g 1816 S  8.9 86.1  91:10.44
> postgres

Ouch!

I've not seen anything remotely like that.  I wonder what's
different....

> As you can see, after a while the system is using a lot of swap.
> The first time I noticed the problem, the swap was up to 12G
> (there is 8GB of ram in this system).  I believe postgres was
> running for several days to achieve that much memory utilization.

Is there anything unusual in the logs?

-Kevin

Re: autovacuum launcher process eating up 17G+ of ram?

From
Nick
Date:
On Fri, Oct 22, 2010 at 11:17 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

> I've not seen anything remotely like that.  I wonder what's
> different....

I did a test with the same postgres build (and autovacuum_naptime = 1
as well) on the same OS with only the system databases (postgres,
template0, template1), and after the weekend the memory utilization
was acceptable:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
14879 pgsql     15   0  312m 177m 1108 S  0.0  8.8   0:24.43 postgres


However on the other machine which has 136 total databases (133 user +
the same 3 system ones listed above), the kernel had to kill off the
postgres process since the box ran out of memory (the box has 8GB ram
+ 30GB swap), and I see it climbing again.

It seems like the problem is a function of how many databases there are.


> Is there anything unusual in the logs?

I set log_autovacuum_min_duration = 0 but that doesn't even seem to
show that anything is being vacuumed.  The only thing unusual in the
logs was a result of the box running out of memory and the kernel
killing off processes.

Re: autovacuum launcher process eating up 17G+ of ram?

From
"Kevin Grittner"
Date:
Nick <t32@2thebatcave.com> wrote:
> Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
>
>> I've not seen anything remotely like that.  I wonder what's
>> different....
>
> I did a test with the same postgres build (and autovacuum_naptime
> = 1 as well) on the same OS with only the system databases
> (postgres, template0, template1), and after the weekend the memory
> utilization was acceptable:

That's probably it.  Our shop normally has just one application
database in a cluster, and I can't think of any clusters with more
than three or four.  It sounds like there might be some sort of
memory leak when a vacuum worker switches databases.

-Kevin

Re: autovacuum launcher process eating up 17G+ of ram?

From
Tom Lane
Date:
Nick <t32@2thebatcave.com> writes:
> It seems like the problem is a function of how many databases there are.

Yeah, I can reproduce this in HEAD.  Create a hundred or so databases,
and make sure there's a stats table entry for each, eg

create database d0;
\c d0
create table t1 as select generate_series(1,100000) x;
create database d1;
\c d1
create table t1 as select generate_series(1,100000) x;
create database d2;
\c d2
create table t1 as select generate_series(1,100000) x;
...

Set autovacuum_naptime to 1sec, and watch it bloat.  (It bloats faster
once it's run through the databases once and isn't mostly waiting for
children.)

It looks to me like the problem is that get_database_list()
intentionally pushes its result into a long-lived context, and then the
result is never cleaned up.  It doesn't help any that it does a lot of
leaky things like heap_beginscan with CurrentMemoryContext totally
unnecessarily pointing at the long-lived context.

I'm not sure what the intent was here, but it's pretty broken as-is.
Alvaro, what did you have in mind?  Should this be using a less
long-lived context, or do we need code to free an avw_dbase list?

            regards, tom lane

Re: autovacuum launcher process eating up 17G+ of ram?

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of lun oct 25 16:57:10 -0300 2010:

> It looks to me like the problem is that get_database_list()
> intentionally pushes its result into a long-lived context, and then the
> result is never cleaned up.  It doesn't help any that it does a lot of
> leaky things like heap_beginscan with CurrentMemoryContext totally
> unnecessarily pointing at the long-lived context.
>
> I'm not sure what the intent was here, but it's pretty broken as-is.
> Alvaro, what did you have in mind?  Should this be using a less
> long-lived context, or do we need code to free an avw_dbase list?

Sorry for not noticing this earlier.  Seems my mail filters are broken
:-(  Thanks Tom for the ping via private email.

Yes, get_database_list is clearly broken as is.  I don't remember why
it's coded like this; this memcxt shuffling was introducing in the 9.0
rewrite to read pg_database directly instead of the flat file.

The right way for it to work seems to be to allocate the output list in
the caller's context; both current callers are in a throwaway context and
so the allocated list will go away automatically when they're done with
the list.

Patch for this is attached (applies cleanly to 9.0 and HEAD; 8.4 doesn't
need to be patched, but I'll run the test case now just to be sure.)

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment

Re: autovacuum launcher process eating up 17G+ of ram?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Patch for this is attached (applies cleanly to 9.0 and HEAD; 8.4 doesn't
> need to be patched, but I'll run the test case now just to be sure.)

Possibly s/cxt/resultcxt/, or some other less-generic name.  Seems
pretty sane other than that cosmetic issue.

            regards, tom lane

Re: autovacuum launcher process eating up 17G+ of ram?

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of lun nov 08 12:15:02 -0300 2010:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Patch for this is attached (applies cleanly to 9.0 and HEAD; 8.4 doesn't
> > need to be patched, but I'll run the test case now just to be sure.)
>
> Possibly s/cxt/resultcxt/, or some other less-generic name.  Seems
> pretty sane other than that cosmetic issue.

Thanks, committed that way and pushed.  I verified that 8.4 doesn't have
this problem.

Oops, I just noticed that I forgot to credit everyone for the report and
diagnosis in the commit message.  Too late to fix :-(

So: thanks Nick for the report and Tom for the diagnosis.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support