Thread: Heavy virtual memory usage on production system
We are having terrible performance issues with a production instance of PostgreSQL version 7.4.5, but are struggling with which parameters in the postgresql.conf to change. Our database server is an Apple G5 (2 x 2GHz CPU, 2GB RAM). The operating system is Mac OS X 10.3. The database seems to fine to start with, but then as the load increases it seems to reach a threshold where the number of non-idle queries in pg_stat_activity grows heavily and we appear to get something similar to a motorway tail back with up to perhaps 140 queries awaiting processing. At the same time the virtual memory usage (reported by the OS) appears to grow heavily too (sometimes up to 50GB). The CPUs do not seems to be working overly hard nor do the disks and the memory monitor reports about 600MB of inactive memory. Once in this situation, the database never catches up with itself and the only way to put it back on an even keel is to stop the application and restart database. The system memory settings are: kern.sysv.shmmax: 536870912 kern.sysv.shmmin: 1 kern.sysv.shmmni: 4096 kern.sysv.shmseg: 4096 kern.sysv.shmall: 131072 We have unlimited the number of processes and open files for the user running PostgreSQL (therefore max 2048 processes and max 12288 open files). Non default postgresql parameters are: tcpip_socket = true max_connections = 500 unix_socket_directory = '/Local/PostgreSQL' shared_buffers = 8192 # min 16, at least max_connections*2, 8KB each sort_mem = 2048 # min 64, size in KB wal_buffers = 32 # min 4, 8KB each effective_cache_size = 100000 # typically 8KB each random_page_cost = 2 # units are one sequential page fetch cost log_min_error_statement = info # Values in order of increasing severity: log_duration = true log_pid = true log_statement = true log_timestamp = true stats_command_string = true although on the last restart I changed the following (since the current config clearly isn't working): shared_buffers = 16384 # min 16, at least max_connections*2, 8KB each effective_cache_size = 10000 # typically 8KB each We don't know whether these have helped yet - but we should get a good idea around 10am tomorrow morning. We currently have the application limited to a maximum of 40 concurrent connections to the database. Our application produces a fairly varied mix of queries, some quite complex and plenty of them. We seem to average about 400,000 queries per hour. At first I thought it might be one or two inefficient queries blocking the CPUs but the CPUs don't seem to be very stretched. My guess is that we have our postgresql memory settings wrong, however, the is lots of conflicting advice about what to set (from 1000 to 100000 shared buffers). Does this heavy use of VM and query tail back indicate which memory settings are wrong? Presumably if there are 140 queries in pg_stat_activity then postgresql will be trying to service all these queries at once? I also presume that if VM usage is high then we are paging a vast amount to disk. But I am not sure why. Has anyone seen this behaviour before and can anyone point me in the right direction? Regards, Alexander Stanier
Alexander Stanier <alexander.stanier@egsgroup.com> writes: > The database seems to fine to start with, but then as the load increases > it seems to reach a threshold where the number of non-idle queries in > pg_stat_activity grows heavily and we appear to get something similar to > a motorway tail back with up to perhaps 140 queries awaiting processing. > At the same time the virtual memory usage (reported by the OS) appears > to grow heavily too (sometimes up to 50GB). The CPUs do not seems to be > working overly hard nor do the disks and the memory monitor reports > about 600MB of inactive memory. You shouldn't be putting a lot of credence in the virtual memory usage then, methinks. Some versions of top count the Postgres shared memory against *each* backend process, leading to a wildly inflated figure for total memory used. I'd suggest watching the output of "vmstat 1" (or local equivalent) to observe whether there's any significant amount of swapping going on; if not, excessive memory usage isn't the problem. Are you sure that the problem isn't at the level of some query taking an exclusive lock and then sitting on it? I would expect either CPU or disk bandwidth or both to be saturated if you were having a conventional resource limitation problem. Again, comparing vmstat readings during normal and slow response conditions would be instructive. regards, tom lane
The problem happened again this morning and I took the chance to check out the locking situation. The number of locks increased dramatically up to over 1000, but they were all "AccessShareLocks" and all were granted. The odd "RowExclusiveLock" appeared but none persisted. On the basis that nothing seems to be waiting for a lock, I don't think it is a locking problem. I think the vast number of locks is symptom of the fact that the server is trying to service a vast number of requests. Eventually, the server seemed to catch up with itself - the CPU went up, the VM went down and the number of queries in pg_stat_activity reduced. The problem then occurred a second time and there seemed to be a lot of pageouts and pageins going on, but I was only looking at top so it was difficult to tell. I have now restarted with a statement_timeout of 2 mins to protect the server from poorly performing queries (fairly brutal - but it does at least stop the downward spiral). I have also reduced the sort_mem to 1024. I guess it could be that we simply need more memory in the server. I have got vmstat (vm_stat on Mac) running and I will watch the behaviour...... Regards, Alex Stanier. Tom Lane wrote: >Alexander Stanier <alexander.stanier@egsgroup.com> writes: > > >>The database seems to fine to start with, but then as the load increases >>it seems to reach a threshold where the number of non-idle queries in >>pg_stat_activity grows heavily and we appear to get something similar to >>a motorway tail back with up to perhaps 140 queries awaiting processing. >>At the same time the virtual memory usage (reported by the OS) appears >>to grow heavily too (sometimes up to 50GB). The CPUs do not seems to be >>working overly hard nor do the disks and the memory monitor reports >>about 600MB of inactive memory. >> >> > >You shouldn't be putting a lot of credence in the virtual memory usage >then, methinks. Some versions of top count the Postgres shared memory >against *each* backend process, leading to a wildly inflated figure for >total memory used. I'd suggest watching the output of "vmstat 1" (or >local equivalent) to observe whether there's any significant amount of >swapping going on; if not, excessive memory usage isn't the problem. > >Are you sure that the problem isn't at the level of some query taking an >exclusive lock and then sitting on it? I would expect either CPU or >disk bandwidth or both to be saturated if you were having a conventional >resource limitation problem. Again, comparing vmstat readings during >normal and slow response conditions would be instructive. > > regards, tom lane > > >
Alexander Stanier <alexander.stanier@egsgroup.com> writes: > The problem happened again this morning and I took the chance to check > out the locking situation. The number of locks increased dramatically up > to over 1000, but they were all "AccessShareLocks" and all were granted. > The odd "RowExclusiveLock" appeared but none persisted. On the basis > that nothing seems to be waiting for a lock, I don't think it is a > locking problem. Hmm. How many active processes were there, and how many locks per process? (A quick "SELECT pid, count(*) GROUP BY pid" query should give you this info next time.) We just recently got rid of some O(N^2) behavior in the lock manager for cases where a single backend holds many different locks. So if there's a single query acquiring a whole lot of locks, that could possibly have something to do with this. regards, tom lane
Looks as though there are several processes which are acquiring a load of locks: pid | count ------+------- 3193 | 2 3192 | 9 3191 | 7 3190 | 3 3189 | 2 3188 | 3 3187 | 3 3186 | 3 3185 | 3 3184 | 3 3183 | 3 3182 | 13 3181 | 3 3179 | 10 3175 | 13 3174 | 2 3173 | 10 2917 | 3 3153 | 8 3150 | 8 3149 | 8 3146 | 9 3145 | 8 3144 | 8 3143 | 9 3142 | 3 3141 | 10 3127 | 8 3125 | 13 3124 | 13 3121 | 8 3118 | 8 3114 | 8 3113 | 8 3110 | 8 3106 | 8 3104 | 9 3102 | 8 3100 | 13 2314 | 2 (40 rows) I guess it might be worth us getting this server up to PostgreSQL 8.0.3. At least we can then discount that as a problem. Regards, Alex Stanier. Tom Lane wrote: >Alexander Stanier <alexander.stanier@egsgroup.com> writes: > > >>The problem happened again this morning and I took the chance to check >>out the locking situation. The number of locks increased dramatically up >>to over 1000, but they were all "AccessShareLocks" and all were granted. >>The odd "RowExclusiveLock" appeared but none persisted. On the basis >>that nothing seems to be waiting for a lock, I don't think it is a >>locking problem. >> >> > >Hmm. How many active processes were there, and how many locks per >process? (A quick "SELECT pid, count(*) GROUP BY pid" query should give >you this info next time.) We just recently got rid of some O(N^2) >behavior in the lock manager for cases where a single backend holds many >different locks. So if there's a single query acquiring a whole lot of >locks, that could possibly have something to do with this. > > regards, tom lane > > >
Alexander Stanier <alexander.stanier@egsgroup.com> writes: > Looks as though there are several processes which are acquiring a load > of locks: 13 locks isn't "a load". I was worried about scenarios in which a single process might take hundreds or thousands of locks; it doesn't look like you have that. regards, tom lane