Thread: VACUUM FULL hangs
Hello list. I recently tried to do a slony replica of my database, and doing it falied. I retried, and then it succeeded (why it failed is another story). This caused that in the replica there is a lot of dead tuples ( If i understand correctly, a failure in creating the replica means a HUGE aborted transaction - and Slony should TRUNCATE the table, getting rid of dead tuples, but that is a subject for another list). so I did vacuum full verbose analyze (does it make sense ?) This hanged on a (quite large) table: INFO: vacuuming "public.calls" INFO: "calls": found 7980456 removable, 3989705 nonremovable row versions in 296943 pages DETAIL: 0 dead row versions cannot be removed yet. Nonremovable row versions range from 136 to 224 bytes long. There were 891 unused item pointers. Total free space (including removable row versions) is 1594703944 bytes. 197958 pages are or will become empty, including 0 at the end of the table. 212719 pages containing 1588415680 free bytes are potential move destinations. CPU 7.25s/3.28u sec elapsed 144.95 sec. INFO: index "calls_pkey" now contains 3989705 row versions in 8975 pages DETAIL: 108927 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.35s/0.59u sec elapsed 39.03 sec. INFO: index "calls_cli" now contains 3989705 row versions in 13504 pages DETAIL: 108927 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.51s/0.60u sec elapsed 58.60 sec. INFO: index "calls_dnis" now contains 3989705 row versions in 13600 pages DETAIL: 108927 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.60s/0.90u sec elapsed 27.05 sec. INFO: index "calls_u" now contains 3989705 row versions in 23820 pages DETAIL: 108927 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.92s/0.78u sec elapsed 80.51 sec. INFO: index "calls_z" now contains 3989705 row versions in 13607 pages DETAIL: 108927 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.60s/0.85u sec elapsed 39.77 sec. It was hanging in this state for more than 3 hours, and I had to kill the vacuum process. From iostat I saw that there was continuous write activity, steadilly about 1.3 MB/s (the disk system can do about 40 MB/s), and there were iowait processes. There was no read activity. There were no other clients for that database (but there were clients in other databases in the instance). version is 8.1.0 . Autovacuum is off. I upped maintenance_work_mem to 512 MB . Any hints? If nothing comes up today, I am scratching that replica. telefony=# \d calls Table "public.calls" Column | Type | Modifiers ---------------+-----------------------------+------------------------------ ------------------------------ dt | timestamp without time zone | machine_ip | integer | port | integer | filename | character varying(15) | account | character(11) | duration | integer | ani | character(32) | application | character(32) | dnis | integer | z | integer | client | integer | taryfa | integer | operator | character varying(20) | id | integer | not null default nextval(('seq_calls_id'::text)::regclass) outgoing | character(12) | release_cause | text | waiting | integer | oper_pin | integer | Indexes: "calls_pkey" PRIMARY KEY, btree (id) "calls_u" UNIQUE, btree (dt, dnis, port, machine_ip, account) "calls_cli" btree (client, dt) "calls_dnis" btree (dnis, dt) "calls_z" btree (z, dt) Triggers: _ctele_denyaccess_5 BEFORE INSERT OR DELETE OR UPDATE ON calls FOR EACH ROW EXECUTE PROCEDURE _ctele.denyaccess('_ctele') Pozdrawiam Marcin Mańk
Marcin Mańk wrote: > Hello list. > > I recently tried to do a slony replica of my database, and doing it falied. > I retried, and then it succeeded (why it failed is another story). > > This caused that in the replica there is a lot of dead tuples ( If i > understand correctly, a failure in creating the replica means a HUGE aborted > transaction - and Slony should TRUNCATE the table, getting rid of dead > tuples, but that is a subject for another list). > > so I did vacuum full verbose analyze (does it make sense ?) Fair enough. If you want empty tables TRUNCATE is probably a better bet though. > This hanged on a (quite large) table: > > INFO: vacuuming "public.calls" > INFO: "calls": found 7980456 removable, 3989705 nonremovable row versions > in 296943 pages > DETAIL: 0 dead row versions cannot be removed yet. > Nonremovable row versions range from 136 to 224 bytes long. > There were 891 unused item pointers. > Total free space (including removable row versions) is 1594703944 bytes. > 197958 pages are or will become empty, including 0 at the end of the table. > 212719 pages containing 1588415680 free bytes are potential move > destinations. OK, so there are 7.9 million removable rows and 3.9 million nonremovable so truncate isn't an option since you have data you presumably want to keep. It estimates about 200,000 pages will become empty, but none of them are at the end of the table. This represents 1.5GB of unused disk-space. I'm a bit puzzled as to how you managed to get so much free space at the start of the table. Did the replication work on the second try? > CPU 7.25s/3.28u sec elapsed 144.95 sec. > INFO: index "calls_pkey" now contains 3989705 row versions in 8975 pages > DETAIL: 108927 index row versions were removed. > 0 index pages have been deleted, 0 are currently reusable. > CPU 0.35s/0.59u sec elapsed 39.03 sec. > INFO: index "calls_cli" now contains 3989705 row versions in 13504 pages > DETAIL: 108927 index row versions were removed. > 0 index pages have been deleted, 0 are currently reusable. > CPU 0.51s/0.60u sec elapsed 58.60 sec. > INFO: index "calls_dnis" now contains 3989705 row versions in 13600 pages > DETAIL: 108927 index row versions were removed. > 0 index pages have been deleted, 0 are currently reusable. > CPU 0.60s/0.90u sec elapsed 27.05 sec. > INFO: index "calls_u" now contains 3989705 row versions in 23820 pages > DETAIL: 108927 index row versions were removed. > 0 index pages have been deleted, 0 are currently reusable. > CPU 0.92s/0.78u sec elapsed 80.51 sec. > INFO: index "calls_z" now contains 3989705 row versions in 13607 pages > DETAIL: 108927 index row versions were removed. > 0 index pages have been deleted, 0 are currently reusable. > CPU 0.60s/0.85u sec elapsed 39.77 sec. It's done all the indexes (and seems to have done them quite quickly), and is presumably working on the data now. > It was hanging in this state for more than 3 hours, and I had to kill the > vacuum process. > >From iostat I saw that there was continuous write activity, steadilly about > 1.3 MB/s (the disk system can do about 40 MB/s), and there were iowait > processes. There was no read activity. > > There were no other clients for that database (but there were clients in > other databases in the instance). OK, so you might well be getting the vacuum writing one page, then WAL, then vacuum, etc. That will mean the disk spends most of its time seeking back and fore. How many disks do you have, and is the WAL on a separate set of disks? I think it's just taking a long time because you have so many pages to move and not enough disk bandwidth. Of course the root of the problem is that you had so many dead rows after a failed replication, but you're right and that's another email. -- Richard Huxton Archonet Ltd
> I'm a bit puzzled as to how you managed to get so much free space at the > start of the table. Did the replication work on the second try? It actually worked on third try, I guess. > OK, so you might well be getting the vacuum writing one page, then WAL, > then vacuum, etc. That will mean the disk spends most of its time > seeking back and fore. How many disks do you have, and is the WAL on a > separate set of disks? It is 2 spindles software RAID1 . Till now there were no performance problems with this machine that would mandate trying anything more fancy, this machine is low traffic. Greetings Marcin Mańk