Thread: Heavy virtual memory usage on production system

Heavy virtual memory usage on production system

From
Alexander Stanier
Date:
We are having terrible performance issues with a production instance of
PostgreSQL version 7.4.5, but are struggling with which parameters in
the postgresql.conf to change.

Our database server is an Apple G5 (2 x 2GHz CPU, 2GB RAM). The
operating system is Mac OS X 10.3.

The database seems to fine to start with, but then as the load increases
it seems to reach a threshold where the number of non-idle queries in
pg_stat_activity grows heavily and we appear to get something similar to
a motorway tail back with up to perhaps 140 queries awaiting processing.
At the same time the virtual memory usage (reported by the OS) appears
to grow heavily too (sometimes up to 50GB). The CPUs do not seems to be
working overly hard nor do the disks and the memory monitor reports
about 600MB of inactive memory. Once in this situation, the database
never catches up with itself and the only way to put it back on an even
keel is to stop the application and restart database.

The system memory settings are:

kern.sysv.shmmax: 536870912
kern.sysv.shmmin: 1
kern.sysv.shmmni: 4096
kern.sysv.shmseg: 4096
kern.sysv.shmall: 131072

We have unlimited the number of processes and open files for the user
running PostgreSQL (therefore max 2048 processes and max 12288 open files).

Non default postgresql parameters are:

tcpip_socket = true
max_connections = 500
unix_socket_directory = '/Local/PostgreSQL'
shared_buffers = 8192           # min 16, at least max_connections*2,
8KB each
sort_mem = 2048                 # min 64, size in KB
wal_buffers = 32                # min 4, 8KB each
effective_cache_size = 100000   # typically 8KB each
random_page_cost = 2            # units are one sequential page fetch cost
log_min_error_statement = info # Values in order of increasing severity:
log_duration = true
log_pid = true
log_statement = true
log_timestamp = true
stats_command_string = true

although on the last restart I changed the following (since the current
config clearly isn't working):

shared_buffers = 16384           # min 16, at least max_connections*2,
8KB each
effective_cache_size = 10000   # typically 8KB each

We don't know whether these have helped yet - but we should get a good
idea around 10am tomorrow morning.

We currently have the application limited to a maximum of 40 concurrent
connections to the database.

Our application produces a fairly varied mix of queries, some quite
complex and plenty of them. We seem to average about 400,000 queries per
hour. At first I thought it might be one or two inefficient queries
blocking the CPUs but the CPUs don't seem to be very stretched. My guess
is that we have our postgresql memory settings wrong, however, the is
lots of conflicting advice about what to set (from 1000 to 100000 shared
buffers).

Does this heavy use of VM and query tail back indicate which memory
settings are wrong? Presumably if there are 140 queries in
pg_stat_activity then postgresql will be trying to service all these
queries at once? I also presume that if VM usage is high then we are
paging a vast amount to disk. But I am not sure why.

Has anyone seen this behaviour before and can anyone point me in the
right direction?

Regards,

Alexander Stanier


Re: Heavy virtual memory usage on production system

From
Tom Lane
Date:
Alexander Stanier <alexander.stanier@egsgroup.com> writes:
> The database seems to fine to start with, but then as the load increases
> it seems to reach a threshold where the number of non-idle queries in
> pg_stat_activity grows heavily and we appear to get something similar to
> a motorway tail back with up to perhaps 140 queries awaiting processing.
> At the same time the virtual memory usage (reported by the OS) appears
> to grow heavily too (sometimes up to 50GB). The CPUs do not seems to be
> working overly hard nor do the disks and the memory monitor reports
> about 600MB of inactive memory.

You shouldn't be putting a lot of credence in the virtual memory usage
then, methinks.  Some versions of top count the Postgres shared memory
against *each* backend process, leading to a wildly inflated figure for
total memory used.  I'd suggest watching the output of "vmstat 1" (or
local equivalent) to observe whether there's any significant amount of
swapping going on; if not, excessive memory usage isn't the problem.

Are you sure that the problem isn't at the level of some query taking an
exclusive lock and then sitting on it?  I would expect either CPU or
disk bandwidth or both to be saturated if you were having a conventional
resource limitation problem.  Again, comparing vmstat readings during
normal and slow response conditions would be instructive.

            regards, tom lane

Re: Heavy virtual memory usage on production system

From
Alexander Stanier
Date:
The problem happened again this morning and I took the chance to check
out the locking situation. The number of locks increased dramatically up
to over 1000, but they were all "AccessShareLocks" and all were granted.
The odd "RowExclusiveLock" appeared but none persisted. On the basis
that nothing seems to be waiting for a lock, I don't think it is a
locking problem. I think the vast number of locks is symptom of the fact
that the server is trying to service a vast number of requests.
Eventually, the server seemed to catch up with itself - the CPU went up,
the VM went down and the number of queries in pg_stat_activity reduced.

The problem then occurred a second time and there seemed to be a lot of
pageouts and pageins going on, but I was only looking at top so it was
difficult to tell. I have now restarted with a statement_timeout of 2
mins to protect the server from poorly performing queries (fairly brutal
- but it does at least stop the downward spiral). I have also reduced
the sort_mem to 1024. I guess it could be that we simply need more
memory in the server.

I have got vmstat (vm_stat on Mac) running and I will watch the
behaviour......

Regards, Alex Stanier.

Tom Lane wrote:

>Alexander Stanier <alexander.stanier@egsgroup.com> writes:
>
>
>>The database seems to fine to start with, but then as the load increases
>>it seems to reach a threshold where the number of non-idle queries in
>>pg_stat_activity grows heavily and we appear to get something similar to
>>a motorway tail back with up to perhaps 140 queries awaiting processing.
>>At the same time the virtual memory usage (reported by the OS) appears
>>to grow heavily too (sometimes up to 50GB). The CPUs do not seems to be
>>working overly hard nor do the disks and the memory monitor reports
>>about 600MB of inactive memory.
>>
>>
>
>You shouldn't be putting a lot of credence in the virtual memory usage
>then, methinks.  Some versions of top count the Postgres shared memory
>against *each* backend process, leading to a wildly inflated figure for
>total memory used.  I'd suggest watching the output of "vmstat 1" (or
>local equivalent) to observe whether there's any significant amount of
>swapping going on; if not, excessive memory usage isn't the problem.
>
>Are you sure that the problem isn't at the level of some query taking an
>exclusive lock and then sitting on it?  I would expect either CPU or
>disk bandwidth or both to be saturated if you were having a conventional
>resource limitation problem.  Again, comparing vmstat readings during
>normal and slow response conditions would be instructive.
>
>            regards, tom lane
>
>
>

Re: Heavy virtual memory usage on production system

From
Tom Lane
Date:
Alexander Stanier <alexander.stanier@egsgroup.com> writes:
> The problem happened again this morning and I took the chance to check
> out the locking situation. The number of locks increased dramatically up
> to over 1000, but they were all "AccessShareLocks" and all were granted.
> The odd "RowExclusiveLock" appeared but none persisted. On the basis
> that nothing seems to be waiting for a lock, I don't think it is a
> locking problem.

Hmm.  How many active processes were there, and how many locks per
process?  (A quick "SELECT pid, count(*) GROUP BY pid" query should give
you this info next time.)  We just recently got rid of some O(N^2)
behavior in the lock manager for cases where a single backend holds many
different locks.  So if there's a single query acquiring a whole lot of
locks, that could possibly have something to do with this.

            regards, tom lane

Re: Heavy virtual memory usage on production system

From
Alexander Stanier
Date:
Looks as though there are several processes which are acquiring a load
of locks:

pid  | count
------+-------
 3193 |     2
 3192 |     9
 3191 |     7
 3190 |     3
 3189 |     2
 3188 |     3
 3187 |     3
 3186 |     3
 3185 |     3
 3184 |     3
 3183 |     3
 3182 |    13
 3181 |     3
 3179 |    10
 3175 |    13
 3174 |     2
 3173 |    10
 2917 |     3
 3153 |     8
 3150 |     8
 3149 |     8
 3146 |     9
 3145 |     8
 3144 |     8
 3143 |     9
 3142 |     3
 3141 |    10
 3127 |     8
 3125 |    13
 3124 |    13
 3121 |     8
 3118 |     8
 3114 |     8
 3113 |     8
 3110 |     8
 3106 |     8
 3104 |     9
 3102 |     8
 3100 |    13
 2314 |     2
(40 rows)

I guess it might be worth us getting this server up to PostgreSQL 8.0.3.
At least we can then discount that as a problem.

Regards, Alex Stanier.

Tom Lane wrote:

>Alexander Stanier <alexander.stanier@egsgroup.com> writes:
>
>
>>The problem happened again this morning and I took the chance to check
>>out the locking situation. The number of locks increased dramatically up
>>to over 1000, but they were all "AccessShareLocks" and all were granted.
>>The odd "RowExclusiveLock" appeared but none persisted. On the basis
>>that nothing seems to be waiting for a lock, I don't think it is a
>>locking problem.
>>
>>
>
>Hmm.  How many active processes were there, and how many locks per
>process?  (A quick "SELECT pid, count(*) GROUP BY pid" query should give
>you this info next time.)  We just recently got rid of some O(N^2)
>behavior in the lock manager for cases where a single backend holds many
>different locks.  So if there's a single query acquiring a whole lot of
>locks, that could possibly have something to do with this.
>
>            regards, tom lane
>
>
>

Re: Heavy virtual memory usage on production system

From
Tom Lane
Date:
Alexander Stanier <alexander.stanier@egsgroup.com> writes:
> Looks as though there are several processes which are acquiring a load
> of locks:

13 locks isn't "a load".  I was worried about scenarios in which a
single process might take hundreds or thousands of locks; it doesn't
look like you have that.

            regards, tom lane