Thread: Disk filling, CPU filling, renegade inserts and deletes?
Hi, I'm having a pretty serious problem with postgresql's performance. Currently, I have a cron task that is set to restart and vacuumdb -faz every six hours. If that doesn't happen, the disk goes from 10% full to 95% full within 2 days (and it's a 90GB disk...with the database being a 2MB download after dump), and the CPU goes from running at around a 2% load to a 99+% load right away (the stats look like a square wave). So it's problem-hunting time, I guess. The problem has something to do with the following errors (there are a lot; I'm posting a short sample) NOTICE: relation "pg_depend" TID 43/27: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_depend" TID 43/28: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_depend" TID 43/29: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_depend" TID 43/30: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_depend" TID 43/31: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_depend" TID 43/32: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_type" TID 17/44: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_type" TID 17/45: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_attribute" TID 133/11: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_attribute" TID 133/12: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_attribute" TID 133/13: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_attribute" TID 133/14: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_attribute" TID 133/15: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_attribute" TID 133/16: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_attribute" TID 133/17: InsertTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_class" TID 41/18: DeleteTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_class" TID 41/19: DeleteTransactionInProgress 209545 --- can't shrink relation NOTICE: relation "pg_class" TID 41/20: DeleteTransactionInProgress 209545 --- can't shrink relation When I vacuum full, I can't get rid of these errors unless I restart the database (and then I restart, vacuum full, and everything's fine). And once I do a successful vacuum full, CPU usage returns to normal, and the disk is no longer almost full (back to 10% full). I'm at a loss to figure out where the problem is coming from and how to fix it. My machine: XServe G5 Dual 2GHz running Mac OS X Server 10.3.9. Postgresql 8.0.1 Thanks for any responses/ideas/solutions (best of all!), Richard
> I'm having a pretty serious problem with postgresql's performance. > Currently, I have a cron task that is set to restart and vacuumdb -faz > every six hours. If that doesn't happen, the disk goes from 10% full > to 95% full within 2 days (and it's a 90GB disk...with the database > being a 2MB download after dump), and the CPU goes from running at > around a 2% load to a 99+% load right away (the stats look like a > square wave). Are you running frequent queries which use temporary tables? --
No, I don't think so. I don't think there are any temp table queries (and I'll check), but even if there are, site traffic is very low, and queries would be very infrequent. On Apr 20, 2005, at 12:36 PM, Rod Taylor wrote: >> I'm having a pretty serious problem with postgresql's performance. >> Currently, I have a cron task that is set to restart and vacuumdb -faz >> every six hours. If that doesn't happen, the disk goes from 10% full >> to 95% full within 2 days (and it's a 90GB disk...with the database >> being a 2MB download after dump), and the CPU goes from running at >> around a 2% load to a 99+% load right away (the stats look like a >> square wave). > > Are you running frequent queries which use temporary tables? > > > -- > > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
As a follow-up, I've found a function that used the following code: CREATE TEMPORARY TABLE results (nOrder integer, page_id integer, name text) WITHOUT OIDS ON COMMIT DROP; I would assume that the "WITHOUT OIDS" would be part of the source of the problem, so I've commented it out. On Apr 20, 2005, at 12:36 PM, Rod Taylor wrote: >> I'm having a pretty serious problem with postgresql's performance. >> Currently, I have a cron task that is set to restart and vacuumdb -faz >> every six hours. If that doesn't happen, the disk goes from 10% full >> to 95% full within 2 days (and it's a 90GB disk...with the database >> being a 2MB download after dump), and the CPU goes from running at >> around a 2% load to a 99+% load right away (the stats look like a >> square wave). > > Are you running frequent queries which use temporary tables? > > > -- >
Richard Plotkin <richard@richardplotkin.com> writes: > I'm having a pretty serious problem with postgresql's performance. > Currently, I have a cron task that is set to restart and vacuumdb -faz > every six hours. If that doesn't happen, the disk goes from 10% full > to 95% full within 2 days (and it's a 90GB disk...with the database > being a 2MB download after dump), and the CPU goes from running at > around a 2% load to a 99+% load right away (the stats look like a > square wave). Q: what have you got the FSM parameters set to? Q: what exactly is bloating? Without knowing which tables or indexes are growing, it's hard to speculate about the exact causes. Use du and oid2name, or look at pg_class.relpages after a plain VACUUM. It's likely that the real answer is "you need to vacuum more often than every six hours", but I'm trying not to jump to conclusions. regards, tom lane
Hi Tom, > Q: what have you got the FSM parameters set to? Here's from postgresql.conf -- FSM at default settings. # - Memory - shared_buffers = 30400 # min 16, at least max_connections*2, 8KB each work_mem = 32168 # min 64, size in KB #maintenance_work_mem = 16384 # min 1024, size in KB #max_stack_depth = 2048 # min 100, size in KB # - Free Space Map - #max_fsm_pages = 20000 # min max_fsm_relations*16, 6 bytes each #max_fsm_relations = 1000 # min 100, ~50 bytes each # - Kernel Resource Usage - max_files_per_process = 750 #1000 # min 25 #preload_libraries = '' > Q: what exactly is bloating? Without knowing which tables or indexes > are growing, it's hard to speculate about the exact causes. Use du and > oid2name, or look at pg_class.relpages after a plain VACUUM. This I do not know. I've disabled the cron jobs and will let the system bloat, then I will gather statistics (I'll give it 12-24 hours). > It's likely that the real answer is "you need to vacuum more often > than every six hours", but I'm trying not to jump to conclusions. That could be it, except that I would expect the problem to then look more like a gradual increase in CPU usage and a gradual increase in use of disk space. Mine could be an invalid assumption, but the system here looks like it goes from no problem to 100% problem within a minute. Thanks again! Richard
More info on what is bloating: It's only in one database (the one that's most used), and after running oid2name on the bloated files, the result is (mysteriously) empty. Here's the run on the three enormous files: $ /usr/local/bin/oid2name -d smt -o 160779 From database "smt": Filenode Table Name ---------------------- $ /usr/local/bin/oid2name -d smt -o 65782869 From database "smt": Filenode Table Name ---------------------- $ /usr/local/bin/oid2name -d smt -o 83345634 From database "smt": Filenode Table Name ---------------------- The file list looks like this (with normal sized files mostly removed): 1.0G ./106779 1.0G ./106779.1 1.0G ./106779.2 1.0G ./106779.3 978M ./106779.4 1.0G ./65782869 248M ./65782869.1 0B ./65782871 8.0K ./65782873 780M ./83345634 0B ./83345636 8.0K ./83345638 So does the empty result mean it's a temporary table? There is one temporary table (in the function previously mentioned) that does get created and dropped with some regularity. Thanks again, Richard On Apr 20, 2005, at 2:06 PM, Richard Plotkin wrote: > Hi Tom, > >> Q: what have you got the FSM parameters set to? > > Here's from postgresql.conf -- FSM at default settings. > # - Memory - > > shared_buffers = 30400 # min 16, at least max_connections*2, > 8KB each > work_mem = 32168 # min 64, size in KB > #maintenance_work_mem = 16384 # min 1024, size in KB > #max_stack_depth = 2048 # min 100, size in KB > > # - Free Space Map - > > #max_fsm_pages = 20000 # min max_fsm_relations*16, 6 bytes > each > #max_fsm_relations = 1000 # min 100, ~50 bytes each > > # - Kernel Resource Usage - > > max_files_per_process = 750 #1000 # min 25 > #preload_libraries = '' > > >> Q: what exactly is bloating? Without knowing which tables or indexes >> are growing, it's hard to speculate about the exact causes. Use du >> and >> oid2name, or look at pg_class.relpages after a plain VACUUM. > > This I do not know. I've disabled the cron jobs and will let the > system bloat, then I will gather statistics (I'll give it 12-24 > hours). > >> It's likely that the real answer is "you need to vacuum more often >> than every six hours", but I'm trying not to jump to conclusions. > > That could be it, except that I would expect the problem to then look > more like a gradual increase in CPU usage and a gradual increase in > use of disk space. Mine could be an invalid assumption, but the > system here looks like it goes from no problem to 100% problem within > a minute. > > Thanks again! > Richard > > > ---------------------------(end of > broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if > your > joining column's datatypes do not match >
On Thu, Apr 21, 2005 at 11:38:22AM -0700, Richard Plotkin wrote: > More info on what is bloating: > > It's only in one database (the one that's most used), and after running > oid2name on the bloated files, the result is (mysteriously) empty. > Here's the run on the three enormous files: > > $ /usr/local/bin/oid2name -d smt -o 160779 > From database "smt": > Filenode Table Name > ---------------------- Try -f instead of -o ... -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "World domination is proceeding according to plan" (Andrew Morton)
That returned the same result. I also tried oid2name -d smt -x -i -S and, separately -s, and also separately, -d with all other databases, and none of the databases turned up any listing, in either oid or filenode, for any of these three bloated files. One thing I've noticed is that these oids are all extremely large numbers, whereas the rest of the oids in /data/base/* are no higher than 40000 or 50000. On Apr 21, 2005, at 1:46 PM, Alvaro Herrera wrote: > On Thu, Apr 21, 2005 at 11:38:22AM -0700, Richard Plotkin wrote: >> More info on what is bloating: >> >> It's only in one database (the one that's most used), and after >> running >> oid2name on the bloated files, the result is (mysteriously) empty. >> Here's the run on the three enormous files: >> >> $ /usr/local/bin/oid2name -d smt -o 160779 >> From database "smt": >> Filenode Table Name >> ---------------------- > > Try -f instead of -o ... > > -- > Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) > "World domination is proceeding according to plan" (Andrew > Morton) > > ---------------------------(end of > broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings >
I've also now tried looking at pg_class.relpages. I compared the results before and after vacuum. The results stayed the same, except for five rows that increased after the vacuum. Here is the select on those rows after the vacuum: relname | relnamespace | reltype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | reltoastrelid | reltoastidxid | relhasindex | relisshared | relkind | relnatts | relchecks | reltriggers | relukeys | relfkeys | relrefs | relhasoids | relhaspkey | relhasrules | relhassubclass | relacl ---------------------------------+--------------+---------+---------- +-------+-------------+---------------+----------+----------- +---------------+---------------+-------------+-------------+--------- +----------+-----------+-------------+----------+----------+--------- +------------+------------+-------------+---------------- +--------------- pg_attribute_relid_attnam_index | 11 | 0 | 1 | 403 | 16686 | 0 | 292 | 10250 | 0 | 0 | f | f | i | 2 | 0 | 0 | 0 | 0 | 0 | f | f | f | f | pg_class_oid_index | 11 | 0 | 1 | 403 | 16690 | 0 | 18 | 2640 | 0 | 0 | f | f | i | 1 | 0 | 0 | 0 | 0 | 0 | f | f | f | f | pg_depend_depender_index | 11 | 0 | 1 | 403 | 16701 | 0 | 52 | 6442 | 0 | 0 | f | f | i | 3 | 0 | 0 | 0 | 0 | 0 | f | f | f | f | pg_type_oid_index | 11 | 0 | 1 | 403 | 16731 | 0 | 8 | 1061 | 0 | 0 | f | f | i | 1 | 0 | 0 | 0 | 0 | 0 | f | f | f | f | pg_depend | 11 | 16677 | 1 | 0 | 16676 | 0 | 32 | 4200 | 0 | 0 | t | f | r | 7 | 0 | 0 | 0 | 0 | 0 | f | f | f | f | {=r/postgres} On Apr 20, 2005, at 1:51 PM, Tom Lane wrote: > Richard Plotkin <richard@richardplotkin.com> writes: >> I'm having a pretty serious problem with postgresql's performance. >> Currently, I have a cron task that is set to restart and vacuumdb -faz >> every six hours. If that doesn't happen, the disk goes from 10% full >> to 95% full within 2 days (and it's a 90GB disk...with the database >> being a 2MB download after dump), and the CPU goes from running at >> around a 2% load to a 99+% load right away (the stats look like a >> square wave). > > Q: what have you got the FSM parameters set to? > > Q: what exactly is bloating? Without knowing which tables or indexes > are growing, it's hard to speculate about the exact causes. Use du and > oid2name, or look at pg_class.relpages after a plain VACUUM. > > It's likely that the real answer is "you need to vacuum more often > than every six hours", but I'm trying not to jump to conclusions. > > regards, tom lane > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
If anybody has additional advice on this problem, I would really, really appreciate it... I updated postgres to 8.0.2, am running vacuumdb -faz every 3 hours, and 50 minutes after a vacuum the CPU usage still skyrocketed, and the disk started filling. This time, there is only a single file that is spanning multiple GB, but running oid2name again returns no result on the oid or filenode. With the increased vacuuming, fixed temp tables, etc., I really am at a loss for what's happening, and could really use some additional help. Thank you, Richard
I also forgot to mention, vacuumdb fails on the command line now with the following error: vacuumdb: could not connect to database smt: FATAL: sorry, too many clients already On Apr 23, 2005, at 9:57 AM, Richard Plotkin wrote: > If anybody has additional advice on this problem, I would really, > really appreciate it... > > I updated postgres to 8.0.2, am running vacuumdb -faz every 3 hours, > and 50 minutes after a vacuum the CPU usage still skyrocketed, and the > disk started filling. This time, there is only a single file that is > spanning multiple GB, but running oid2name again returns no result on > the oid or filenode. > > With the increased vacuuming, fixed temp tables, etc., I really am at > a loss for what's happening, and could really use some additional > help. > > Thank you, > Richard >
Richard Plotkin <richard@richardplotkin.com> writes: > I updated postgres to 8.0.2, am running vacuumdb -faz every 3 hours, > and 50 minutes after a vacuum the CPU usage still skyrocketed, and the > disk started filling. This time, there is only a single file that is > spanning multiple GB, but running oid2name again returns no result on > the oid or filenode. What is the filename exactly (full path)? regards, tom lane
/usr/local/pgsql/data/base/17234/42791 /usr/local/pgsql/data/base/17234/42791.1 /usr/local/pgsql/data/base/17234/42791.2 /usr/local/pgsql/data/base/17234/42791.3 /usr/local/pgsql/data/base/17234/42791.4 /usr/local/pgsql/data/base/17234/42791.5 /usr/local/pgsql/data/base/17234/42791.6 /usr/local/pgsql/data/base/17234/42791.7 /usr/local/pgsql/data/base/17234/42791.8 /usr/local/pgsql/data/base/17234/42791.9 /usr/local/pgsql/data/base/17234/42791.10 /usr/local/pgsql/data/base/17234/42791.11 On Apr 23, 2005, at 11:06 AM, Tom Lane wrote: > Richard Plotkin <richard@richardplotkin.com> writes: >> I updated postgres to 8.0.2, am running vacuumdb -faz every 3 hours, >> and 50 minutes after a vacuum the CPU usage still skyrocketed, and the >> disk started filling. This time, there is only a single file that is >> spanning multiple GB, but running oid2name again returns no result on >> the oid or filenode. > > What is the filename exactly (full path)? > > regards, tom lane >
Richard Plotkin <richard@richardplotkin.com> writes: > /usr/local/pgsql/data/base/17234/42791 > /usr/local/pgsql/data/base/17234/42791.1 > /usr/local/pgsql/data/base/17234/42791.2 > /usr/local/pgsql/data/base/17234/42791.3 > ... Well, that is certainly a table or index of some kind. Go into database 17234 --- if you are not certain which one that is, see select datname from pg_database where oid = 17234 and do select relname from pg_class where relfilenode = 42791 The only way I could see for this to not find the table is if the table creation has not been committed yet. Do you have any apps that create and fill a table in a single transaction? regards, tom lane
Hi Tom, Thanks for your responses this morning. I did the select relname, and it returned 0 rows. I do have one function that creates a temp table and fills it within the same transaction. I'm pasting it below. Perhaps the "ON COMMIT DROP" is causing problems, and I need to drop the table at the end of the function instead of using ON COMMIT DROP? -- -- Name: crumbs(integer, text, boolean); Type: FUNCTION; Schema: public -- CREATE FUNCTION crumbs(integer, text, boolean) RETURNS text AS $_$DECLARE starting_page ALIAS FOR $1; current_page integer; delimiter text DEFAULT ': '; withLinkTags BOOLEAN DEFAULT FALSE; page_id_temp INTEGER; page_name_temp TEXT; current_nOrder INTEGER := 1; page_results record; path TEXT DEFAULT ''; BEGIN IF starting_page IS NULL THEN RETURN NULL; END IF; current_page := starting_page; IF $2 IS NOT NULL THEN delimiter := $2; END IF; IF $3 IS NOT NULL THEN withLinkTags := $3; END IF; --Create a table consisting of three columns: nOrder, page_id, name CREATE TEMPORARY TABLE results (nOrder integer, page_id integer, name text) ON COMMIT DROP; --Select the current page into the results table SELECT INTO page_id_temp, page_name_temp p.page_id, CASE WHEN p.title_abbr IS NOT NULL THEN p.title_abbr ELSE p.title END as name FROM page p WHERE p.page_id = starting_page; IF FOUND THEN EXECUTE 'INSERT INTO results (nOrder, page_id, name) VALUES (' || current_nOrder || ',' || page_id_temp || ',' || quote_literal(page_name_temp) || ')'; current_nOrder := current_nOrder + 1; END IF; --Loop through results for page parents LOOP SELECT INTO page_id_temp, page_name_temp parent.page_id as parent_id, CASE WHEN parent.title_abbr IS NOT NULL THEN parent.title_abbr ELSE parent.title END as name FROM page AS child INNER JOIN page AS parent ON child.subcat_id = parent.page_id WHERE child.page_id = current_page; IF FOUND THEN EXECUTE 'INSERT INTO results (nOrder, page_id, name) VALUES (' || current_nOrder || ',' || page_id_temp || ',' || quote_literal(page_name_temp) || ')'; current_page = page_id_temp; current_nOrder := current_nOrder + 1; ELSE EXIT; END IF; END LOOP; SELECT INTO page_id_temp, page_name_temp c.default_page as parent_id, c.name FROM page p INNER JOIN category c ON c.cat_id = p.cat_id WHERE page_id = starting_page; IF FOUND THEN EXECUTE 'INSERT INTO results (nOrder, page_id, name) VALUES (' || current_nOrder || ',' || page_id_temp || ',' || quote_literal(page_name_temp) || ')'; END IF; FOR page_results IN EXECUTE 'SELECT * FROM results ORDER BY nOrder DESC' LOOP IF path = '' THEN IF withLinkTags IS TRUE THEN path := '<a href="index.php?pid=' || page_results.page_id || '">'; path := path || page_results.name; path := path || '</a>'; ELSE path := page_results.name; END IF; ELSE IF withLinkTags IS TRUE THEN path := path || delimiter; path := path || '<a href="index.php?pid=' || page_results.page_id || '">'; path := path || page_results.name; path := path || '</a>'; ELSE path := path || delimiter || page_results.name; END IF; END IF; END LOOP; RETURN path; END;$_$ LANGUAGE plpgsql; On Apr 23, 2005, at 11:17 AM, Tom Lane wrote: > Richard Plotkin <richard@richardplotkin.com> writes: >> /usr/local/pgsql/data/base/17234/42791 >> /usr/local/pgsql/data/base/17234/42791.1 >> /usr/local/pgsql/data/base/17234/42791.2 >> /usr/local/pgsql/data/base/17234/42791.3 >> ... > > Well, that is certainly a table or index of some kind. > > Go into database 17234 --- if you are not certain which one that is, > see > select datname from pg_database where oid = 17234 > and do > select relname from pg_class where relfilenode = 42791 > > The only way I could see for this to not find the table is if the table > creation has not been committed yet. Do you have any apps that create > and fill a table in a single transaction? > > regards, tom lane >
Richard Plotkin <richard@richardplotkin.com> writes: > Thanks for your responses this morning. I did the select relname, and > it returned 0 rows. I do have one function that creates a temp table > and fills it within the same transaction. I'm pasting it below. > Perhaps the "ON COMMIT DROP" is causing problems, and I need to drop > the table at the end of the function instead of using ON COMMIT DROP? Well, I think we can conclude that the function is pushing way more data into the temp table than you expect. I am wondering if that loop in the middle of the function is turning into an infinite loop --- could it be finding some sort of cycle in your page data? You might want to add some RAISE NOTICE commands to the loop so you can track what it's doing. regards, tom lane
Hi Tom, Thanks! That's exactly what it was. There was a discrepancy in the data that turned this into an endless loop. Everything has been running smoothly since I made a change. Thanks so much, Richard On Apr 23, 2005, at 12:50 PM, Tom Lane wrote: > Richard Plotkin <richard@richardplotkin.com> writes: >> Thanks for your responses this morning. I did the select relname, and >> it returned 0 rows. I do have one function that creates a temp table >> and fills it within the same transaction. I'm pasting it below. >> Perhaps the "ON COMMIT DROP" is causing problems, and I need to drop >> the table at the end of the function instead of using ON COMMIT DROP? > > Well, I think we can conclude that the function is pushing way more > data into the temp table than you expect. I am wondering if that loop > in the middle of the function is turning into an infinite loop --- > could > it be finding some sort of cycle in your page data? You might want to > add some RAISE NOTICE commands to the loop so you can track what it's > doing. > > regards, tom lane >