Thread: Postgres consuming way too much memory???
I have a box with an app and postgresql on it. Hardware includes with 2 2.8 Ghz xeons 512KB cache, 4 GB of memory, 6 scsidisk in a software raid 5 on a trustix 2.2 with a 2.6.15.3 kernel. The data and indexes are on the raid array while the tx log is on disk with the OS. All is well. The one application executes one transaction every 60 seconds or so. The transaction can range from tiny to relatively large. Maybe 30-70k inserts, 60-100k updates... nothing too heavy, take about 8-12 seconds to finish the the entire update in the worst case. The application is using the latest jdbc.... I am using preparedStatements with addBatch/executebatch/clearBatch to send statements in batches of 10 thousand... (is that high?) The box itself is a little over subscribed for memory which is causing us to swap a bit... As the application runs, I notice the postgres process which handles this particular app connection grows in memory seemingly uncrontrollably until kaboom. Once the kernel kills off enough processes and the system settles, I see the postgres processis at 1.9GB of res memory and 77MB of shared memory. This challenges a number of assumptions I have made in the last while and raisesa few questions... BTW, I am assuming this is not a memory leak b/c the same install of our software on a box with 8GB of memory and no swap being used has no unexplained growth in the memory... it is perfectly healthy and quite performant. Anyway, due to errors in the transaction, it is rolledback afterwhich the postgres process remains at 901MB of resident memory and 91MB of of shared memory. 27116 postgres 15 0 1515m 901m 91m S 0.0 22.9 18:33.96 postgres: qradar qradar ::ffff:x.x.x.x(51149) idle There are a few things I would like to understand. - What in the postgres will grow at an uncontrolled rate when the system is under heavy load or the transaction is larger... there must be something not governed by the shared memory or other configuration in postgresql.conf. It seems like, once we start hitting swap, postgres grows in memory resulting in more swapping... until applications start getting killed. - when the transaction was rolled back why did the process hold onto the 901MB of memory? - when is a transaction too big? is this determined by the configuration and performance of wal_buffers and wal log or isthere house cleaning which MUST be done at commit/rollback to avoid siutations like this thus indicating there is an upper bound. I have been configuring postgres from tidbits I collected reading this list in the last few months.... not sure if what I have is totally right for the work load, but when I have adequate memory and avoid swap, we are more than happy with performance. Configuration which is not below is just the default. shared_buffers = 32767 work_mem = 20480 maintenance_work_mem = 32768 max_fsm_pages = 4024000 max_fsm_relations = 2000 fsync = false wal_sync_method = fsync wal_buffers = 4096 checkpoint_segments = 32 checkpoint_timeout = 1200 checkpoint_warning = 60 commit_delay = 5000 commit_siblings = 5 effective_cache_size = 175000 random_page_cost = 2 autovacuum = true autovacuum_naptime = 60 autovacuum_vacuum_threshold = 500 autovacuum_analyze_threshold = 250 autovacuum_vacuum_scale_factor = 0.2 autovacuum_analyze_scale_factor = 0.1 autovacuum_vacuum_cost_delay=100 autovacuum_vacuum_cost_limit=100 default_statistics_target = 40 Is there anything here which looks weird or mis configured? I am just starting to play with the bg writer configuration soI did not include. typically, there is little or no iowait... and no reason to think there is something miconfigured... from what I have seen. In one transaction i have seen as many as 5 checkpoint_segments be created/used so I was considering increasing wal_buffersto 8192 from 4096 given as many as 4 segments in memory/cache at once... need to test this though .... Anyone have any thoughts on what could have caused the bloat? thanks
"jody brownell" <jody.brownell@q1labs.com> writes: > 27116 postgres 15 0 1515m 901m 91m S 0.0 22.9 18:33.96 postgres: qradar qradar ::ffff:x.x.x.x(51149) idle This looks like a memory leak, but you haven't provided enough info to let someone else reproduce it. Can you log what your application is doing and extract a test case? What PG version is this, anyway? regards, tom lane
Sorry about that, I was in a slight panic :) I am using postgresql 8.1.4. I will install 8.1.3 and see if the same behavior exists.. we may have started seeing this in 8.1.3, but I dont think before. I will check some stability machines for similar bloating. The query (calling a store proc) which is always running when the spiral begins is below. It simply performs bulk linking of two objects. Depending on what the application is detecting, it could be called to insert 40 - 50k records, 500 at a time. When the box is healthy, this is a 200 - 500 ms op, but this starts to become a 20000+ ms op. I guess this makes sense considering the paging..... Jun 14 12:50:18 xxx postgres[5649]: [3-1] LOG: duration: 20117.984 ms statement: EXECUTE <unnamed> [PREPARE: select *from link_attacker_targets($1, $2, $3) as CREATE OR REPLACE FUNCTION link_attacker_targets (p_attacker bigint, p_targets varchar, p_targets_size integer) returns bigint[] as $body$ DECLARE v_targets bigint[]; v_target bigint; v_returns bigint[]; v_returns_size integer := 0; BEGIN v_targets := convert_string2bigint_array (p_targets, p_targets_size); FOR i IN 1..p_targets_size LOOP v_target := v_targets[i]; BEGIN INSERT into attacker_target_link (attacker_id, target_id) values (p_attacker, v_target); v_returns_size := v_returns_size + 1; v_returns[v_returns_size] := v_target; EXCEPTION WHEN unique_violation THEN -- do nothing... app cache may be out of date. END; END LOOP; RETURN v_returns; END; $body$ LANGUAGE plpgsql VOLATILE CALLED ON NULL INPUT SECURITY INVOKER; On Wednesday 14 June 2006 17:03, you wrote: > "jody brownell" <jody.brownell@q1labs.com> writes: > > 27116 postgres 15 0 1515m 901m 91m S 0.0 22.9 18:33.96 postgres: qradar qradar ::ffff:x.x.x.x(51149) idle > > This looks like a memory leak, but you haven't provided enough info to > let someone else reproduce it. Can you log what your application is > doing and extract a test case? What PG version is this, anyway? > > regards, tom lane > > >
The last version of postgres we had in production was 8.1.1 actually, not 8.1.3. So far, on my stability box and older production stability boxes I dont see the same behavior. I will install 8.1.1 on these boxes and see what I see. On Thursday 15 June 2006 09:01, jody brownell wrote: > Sorry about that, I was in a slight panic :) > > I am using postgresql 8.1.4. I will install 8.1.3 and see if the same behavior exists.. we > may have started seeing this in 8.1.3, but I dont think before. I will check some stability > machines for similar bloating. > > The query (calling a store proc) which is always running when the spiral begins is below. It simply performs > bulk linking of two objects. Depending on what the application is detecting, it could be called to insert > 40 - 50k records, 500 at a time. When the box is healthy, this is a 200 - 500 ms op, but this starts to become > a 20000+ ms op. I guess this makes sense considering the paging..... > > Jun 14 12:50:18 xxx postgres[5649]: [3-1] LOG: duration: 20117.984 ms statement: EXECUTE <unnamed> [PREPARE: select* from link_attacker_targets($1, $2, $3) as > > CREATE OR REPLACE FUNCTION link_attacker_targets (p_attacker bigint, p_targets varchar, p_targets_size integer) > returns bigint[] as > $body$ > DECLARE > v_targets bigint[]; > v_target bigint; > v_returns bigint[]; > v_returns_size integer := 0; > BEGIN > v_targets := convert_string2bigint_array (p_targets, p_targets_size); > > FOR i IN 1..p_targets_size LOOP > v_target := v_targets[i]; > > BEGIN > INSERT into attacker_target_link (attacker_id, target_id) values (p_attacker, v_target); > v_returns_size := v_returns_size + 1; > v_returns[v_returns_size] := v_target; > > EXCEPTION WHEN unique_violation THEN > -- do nothing... app cache may be out of date. > END; > END LOOP; > RETURN v_returns; > END; > $body$ > LANGUAGE plpgsql VOLATILE CALLED ON NULL INPUT SECURITY INVOKER; > > On Wednesday 14 June 2006 17:03, you wrote: > > "jody brownell" <jody.brownell@q1labs.com> writes: > > > 27116 postgres 15 0 1515m 901m 91m S 0.0 22.9 18:33.96 postgres: qradar qradar ::ffff:x.x.x.x(51149) idle > > > > This looks like a memory leak, but you haven't provided enough info to > > let someone else reproduce it. Can you log what your application is > > doing and extract a test case? What PG version is this, anyway? > > > > regards, tom lane > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
Some more information... When postgresql starts to go into this bloating state, I can only make it happen from my java app. If I simultaneously perform insert of 10million rows into another table, it behaves as expected, but the postgresql process handling the java connection slows down and bloats. This leads me to think it has something to do with either the long lived connection. I am using dbcp jdbc pool from jakarta OR I am trigger this behavior with something I am doing in the link routine I sent earlier. I am going to try closing the connection after each TX to see if this resolves it for now. If not, I will write a java app, stored procedure (table etc) reproduce it without our application. Oh yeah, it is when I use about have of my swap, dstat starts reporting heavy paging, memory climbs very quickly. On Thursday 15 June 2006 09:15, jody brownell wrote: > The last version of postgres we had in production was 8.1.1 actually, not 8.1.3. > > So far, on my stability box and older production stability boxes I dont see the same behavior. > > I will install 8.1.1 on these boxes and see what I see. > > On Thursday 15 June 2006 09:01, jody brownell wrote: > > Sorry about that, I was in a slight panic :) > > > > I am using postgresql 8.1.4. I will install 8.1.3 and see if the same behavior exists.. we > > may have started seeing this in 8.1.3, but I dont think before. I will check some stability > > machines for similar bloating. > > > > The query (calling a store proc) which is always running when the spiral begins is below. It simply performs > > bulk linking of two objects. Depending on what the application is detecting, it could be called to insert > > 40 - 50k records, 500 at a time. When the box is healthy, this is a 200 - 500 ms op, but this starts to become > > a 20000+ ms op. I guess this makes sense considering the paging..... > > > > Jun 14 12:50:18 xxx postgres[5649]: [3-1] LOG: duration: 20117.984 ms statement: EXECUTE <unnamed> [PREPARE: select* from link_attacker_targets($1, $2, $3) as > > > > CREATE OR REPLACE FUNCTION link_attacker_targets (p_attacker bigint, p_targets varchar, p_targets_size integer) > > returns bigint[] as > > $body$ > > DECLARE > > v_targets bigint[]; > > v_target bigint; > > v_returns bigint[]; > > v_returns_size integer := 0; > > BEGIN > > v_targets := convert_string2bigint_array (p_targets, p_targets_size); > > > > FOR i IN 1..p_targets_size LOOP > > v_target := v_targets[i]; > > > > BEGIN > > INSERT into attacker_target_link (attacker_id, target_id) values (p_attacker, v_target); > > v_returns_size := v_returns_size + 1; > > v_returns[v_returns_size] := v_target; > > > > EXCEPTION WHEN unique_violation THEN > > -- do nothing... app cache may be out of date. > > END; > > END LOOP; > > RETURN v_returns; > > END; > > $body$ > > LANGUAGE plpgsql VOLATILE CALLED ON NULL INPUT SECURITY INVOKER; > > > > On Wednesday 14 June 2006 17:03, you wrote: > > > "jody brownell" <jody.brownell@q1labs.com> writes: > > > > 27116 postgres 15 0 1515m 901m 91m S 0.0 22.9 18:33.96 postgres: qradar qradar ::ffff:x.x.x.x(51149) idle > > > > > > This looks like a memory leak, but you haven't provided enough info to > > > let someone else reproduce it. Can you log what your application is > > > doing and extract a test case? What PG version is this, anyway? > > > > > > regards, tom lane > > > > > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 6: explain analyze is your friend > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
"jody brownell" <jody.brownell@q1labs.com> writes: > When postgresql starts to go into this bloating state, I can only make it happen from my java app. That's interesting. The JDBC driver uses protocol features that aren't used by psql, so it's possible that the leak is triggered by one of those features. I wouldn't worry too much about duplicating the problem from psql anyway --- a Java test case will do fine. > I am going to try closing the connection after each TX to see if this > resolves it for now. If not, I will write a java app, stored procedure > (table etc) reproduce it without our application. Even if that works around it for you, please pursue getting a test case together so we can find and fix the underlying problem. regards, tom lane
"jody brownell" <jody.brownell@Q1Labs.com> writes: > BEGIN > INSERT into attacker_target_link (attacker_id, target_id) values (p_attacker, v_target); > v_returns_size := v_returns_size + 1; > v_returns[v_returns_size] := v_target; > EXCEPTION WHEN unique_violation THEN > -- do nothing... app cache may be out of date. > END; Hmm. There is a known problem that plpgsql leaks some memory when catching an exception: http://archives.postgresql.org/pgsql-hackers/2006-02/msg00885.php So if your problem case involves a whole lot of duplicates then that could explain the initial bloat. However, AFAIK that leakage is in a transaction-local memory context, so the space ought to be freed at transaction end. And Linux's malloc does know about giving space back to the kernel (unlike some platforms). So I'm not sure why you're seeing persistent bloat. Can you rewrite the function to not use an EXCEPTION block (perhaps a separate SELECT probe for each row --- note this won't be reliable if there are concurrent processes making insertions)? If so, does that fix the problem? regards, tom lane
Tom - that make sense... and fits the timeline of when the instability may have been introduced. I use soft references in java to track these relationships. When the GC needs memory it will collect objects referenced by soft references so I need to have this exception caught where my caches may get cleaned. When the system is under load as it would be in this case, there references would be cleaned causing a large number of exceptions in the pgplsql, subsequently causing the leak... hence the swift downward spiral. The previous version of these routines used selects but due to volume of selects, performance suffered quite a bit. I dont think I could revert now for production use... closing the connection maybe the workaround for us for this release IF this is in fact what the problem is. Unfortunatly, I use the catch in about 20 similar routines to reset sequences etc.... this may be painful :( I will modify the routine to help isolate the problem. stay tuned. BTW - the fix you mentioned .... is that targeted for 8.2? Is there a timeline for 8.2? On Thursday 15 June 2006 12:44, Tom Lane wrote: > "jody brownell" <jody.brownell@Q1Labs.com> writes: > > BEGIN > > INSERT into attacker_target_link (attacker_id, target_id) values (p_attacker, v_target); > > v_returns_size := v_returns_size + 1; > > v_returns[v_returns_size] := v_target; > > > EXCEPTION WHEN unique_violation THEN > > -- do nothing... app cache may be out of date. > > END; > > Hmm. There is a known problem that plpgsql leaks some memory when > catching an exception: > http://archives.postgresql.org/pgsql-hackers/2006-02/msg00885.php > > So if your problem case involves a whole lot of duplicates then that > could explain the initial bloat. However, AFAIK that leakage is in > a transaction-local memory context, so the space ought to be freed at > transaction end. And Linux's malloc does know about giving space back > to the kernel (unlike some platforms). So I'm not sure why you're > seeing persistent bloat. > > Can you rewrite the function to not use an EXCEPTION block (perhaps > a separate SELECT probe for each row --- note this won't be reliable > if there are concurrent processes making insertions)? If so, does > that fix the problem? > > regards, tom lane > > >
"jody brownell" <jody.brownell@Q1Labs.com> writes: > BTW - the fix you mentioned .... is that targeted for 8.2? Is there a timeline for 8.2? There is no fix as yet, but it's on the radar screen to fix for 8.2. We expect 8.2 will go beta towards the end of summer (I forget whether Aug 1 or Sep 1 is the target). regards, tom lane
On Thu, 2006-06-15 at 11:34 -0400, Tom Lane wrote: > "jody brownell" <jody.brownell@q1labs.com> writes: > > When postgresql starts to go into this bloating state, I can only make it happen from my java app. > > That's interesting. The JDBC driver uses protocol features that aren't > used by psql, so it's possible that the leak is triggered by one of > those features. I wouldn't worry too much about duplicating the problem > from psql anyway --- a Java test case will do fine. > > > I am going to try closing the connection after each TX to see if this > > resolves it for now. If not, I will write a java app, stored procedure > > (table etc) reproduce it without our application. Just to mention another possible culprit; this one doesn't seem all that likely to me, but at least it's easy to investigate. With DBCP and non-ancient versions of the JDBC driver that use v3 protocol and real prepared statements, it is possible to (mis)configure the system to create an unbounded number of cached prepared statements on any particular connection. Older versions of DBCP were also known to have bugs which aggravated this issue when prepared statement caching was enabled, IIRC. -- Mark Lewis
Added to TODO: > o Fix memory leak from exceptions > > http://archives.postgresql.org/pgsql-performance/2006-06/msg0$ --------------------------------------------------------------------------- Tom Lane wrote: > "jody brownell" <jody.brownell@Q1Labs.com> writes: > > BEGIN > > INSERT into attacker_target_link (attacker_id, target_id) values (p_attacker, v_target); > > v_returns_size := v_returns_size + 1; > > v_returns[v_returns_size] := v_target; > > > EXCEPTION WHEN unique_violation THEN > > -- do nothing... app cache may be out of date. > > END; > > Hmm. There is a known problem that plpgsql leaks some memory when > catching an exception: > http://archives.postgresql.org/pgsql-hackers/2006-02/msg00885.php > > So if your problem case involves a whole lot of duplicates then that > could explain the initial bloat. However, AFAIK that leakage is in > a transaction-local memory context, so the space ought to be freed at > transaction end. And Linux's malloc does know about giving space back > to the kernel (unlike some platforms). So I'm not sure why you're > seeing persistent bloat. > > Can you rewrite the function to not use an EXCEPTION block (perhaps > a separate SELECT probe for each row --- note this won't be reliable > if there are concurrent processes making insertions)? If so, does > that fix the problem? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > -- Bruce Momjian http://candle.pha.pa.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +