Thread: VACUUM FULL hangs

VACUUM FULL hangs

From
Marcin Mańk
Date:
Hello list.

I recently tried to do a slony replica of my database, and doing it falied.
I retried, and then it succeeded (why it failed is another story).

This caused that in the replica there is a lot of dead tuples ( If i
understand correctly, a failure in creating the replica means a HUGE aborted
transaction - and Slony should TRUNCATE the table, getting rid of dead
tuples, but that is a subject for another list).

so I did vacuum full verbose analyze (does it make sense ?)

This hanged on a (quite large) table:

INFO:  vacuuming "public.calls"
INFO:  "calls": found 7980456 removable, 3989705 nonremovable row versions
in 296943 pages
DETAIL:  0 dead row versions cannot be removed yet.
Nonremovable row versions range from 136 to 224 bytes long.
There were 891 unused item pointers.
Total free space (including removable row versions) is 1594703944 bytes.
197958 pages are or will become empty, including 0 at the end of the table.
212719 pages containing 1588415680 free bytes are potential move
destinations.
CPU 7.25s/3.28u sec elapsed 144.95 sec.
INFO:  index "calls_pkey" now contains 3989705 row versions in 8975 pages
DETAIL:  108927 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.35s/0.59u sec elapsed 39.03 sec.
INFO:  index "calls_cli" now contains 3989705 row versions in 13504 pages
DETAIL:  108927 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.51s/0.60u sec elapsed 58.60 sec.
INFO:  index "calls_dnis" now contains 3989705 row versions in 13600 pages
DETAIL:  108927 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.60s/0.90u sec elapsed 27.05 sec.
INFO:  index "calls_u" now contains 3989705 row versions in 23820 pages
DETAIL:  108927 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.92s/0.78u sec elapsed 80.51 sec.
INFO:  index "calls_z" now contains 3989705 row versions in 13607 pages
DETAIL:  108927 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.60s/0.85u sec elapsed 39.77 sec.

It was hanging in this state for more than 3 hours, and I had to kill the
vacuum process.

From iostat I saw that there was continuous write activity, steadilly about
1.3 MB/s (the disk system can do about 40 MB/s), and there were iowait
processes. There was no read activity.

There were no other clients for that database (but there were clients in
other databases in the instance).

version is 8.1.0 . Autovacuum is off. I upped maintenance_work_mem to 512 MB
. Any hints? If nothing comes up today, I am scratching that replica.


telefony=# \d calls
                                           Table "public.calls"
    Column     |            Type             |
Modifiers
---------------+-----------------------------+------------------------------
------------------------------
 dt            | timestamp without time zone |
 machine_ip    | integer                     |
 port          | integer                     |
 filename      | character varying(15)       |
 account       | character(11)               |
 duration      | integer                     |
 ani           | character(32)               |
 application   | character(32)               |
 dnis          | integer                     |
 z             | integer                     |
 client        | integer                     |
 taryfa        | integer                     |
 operator      | character varying(20)       |
 id            | integer                     | not null default
nextval(('seq_calls_id'::text)::regclass)
 outgoing      | character(12)               |
 release_cause | text                        |
 waiting       | integer                     |
 oper_pin      | integer                     |
Indexes:
    "calls_pkey" PRIMARY KEY, btree (id)
    "calls_u" UNIQUE, btree (dt, dnis, port, machine_ip, account)
    "calls_cli" btree (client, dt)
    "calls_dnis" btree (dnis, dt)
    "calls_z" btree (z, dt)
Triggers:
    _ctele_denyaccess_5 BEFORE INSERT OR DELETE OR UPDATE ON calls FOR EACH
ROW EXECUTE PROCEDURE _ctele.denyaccess('_ctele')


Pozdrawiam
Marcin Mańk


Re: VACUUM FULL hangs

From
Richard Huxton
Date:
Marcin Mańk wrote:
> Hello list.
>
> I recently tried to do a slony replica of my database, and doing it falied.
> I retried, and then it succeeded (why it failed is another story).
>
> This caused that in the replica there is a lot of dead tuples ( If i
> understand correctly, a failure in creating the replica means a HUGE aborted
> transaction - and Slony should TRUNCATE the table, getting rid of dead
> tuples, but that is a subject for another list).
>
> so I did vacuum full verbose analyze (does it make sense ?)

Fair enough. If you want empty tables TRUNCATE is probably a better bet
though.

> This hanged on a (quite large) table:
>
> INFO:  vacuuming "public.calls"
> INFO:  "calls": found 7980456 removable, 3989705 nonremovable row versions
> in 296943 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> Nonremovable row versions range from 136 to 224 bytes long.
> There were 891 unused item pointers.
> Total free space (including removable row versions) is 1594703944 bytes.
> 197958 pages are or will become empty, including 0 at the end of the table.
> 212719 pages containing 1588415680 free bytes are potential move
> destinations.

OK, so there are 7.9 million removable rows and 3.9 million nonremovable
so truncate isn't an option since you have data you presumably want to
keep. It estimates about 200,000 pages will become empty, but none of
them are at the end of the table. This represents 1.5GB of unused
disk-space.

I'm a bit puzzled as to how you managed to get so much free space at the
start of the table. Did the replication work on the second try?

> CPU 7.25s/3.28u sec elapsed 144.95 sec.
> INFO:  index "calls_pkey" now contains 3989705 row versions in 8975 pages
> DETAIL:  108927 index row versions were removed.
> 0 index pages have been deleted, 0 are currently reusable.
> CPU 0.35s/0.59u sec elapsed 39.03 sec.
> INFO:  index "calls_cli" now contains 3989705 row versions in 13504 pages
> DETAIL:  108927 index row versions were removed.
> 0 index pages have been deleted, 0 are currently reusable.
> CPU 0.51s/0.60u sec elapsed 58.60 sec.
> INFO:  index "calls_dnis" now contains 3989705 row versions in 13600 pages
> DETAIL:  108927 index row versions were removed.
> 0 index pages have been deleted, 0 are currently reusable.
> CPU 0.60s/0.90u sec elapsed 27.05 sec.
> INFO:  index "calls_u" now contains 3989705 row versions in 23820 pages
> DETAIL:  108927 index row versions were removed.
> 0 index pages have been deleted, 0 are currently reusable.
> CPU 0.92s/0.78u sec elapsed 80.51 sec.
> INFO:  index "calls_z" now contains 3989705 row versions in 13607 pages
> DETAIL:  108927 index row versions were removed.
> 0 index pages have been deleted, 0 are currently reusable.
> CPU 0.60s/0.85u sec elapsed 39.77 sec.

It's done all the indexes (and seems to have done them quite quickly),
and is presumably working on the data now.

> It was hanging in this state for more than 3 hours, and I had to kill the
> vacuum process.
>
>From iostat I saw that there was continuous write activity, steadilly about
> 1.3 MB/s (the disk system can do about 40 MB/s), and there were iowait
> processes. There was no read activity.
>
> There were no other clients for that database (but there were clients in
> other databases in the instance).

OK, so you might well be getting the vacuum writing one page, then WAL,
then vacuum, etc. That will mean the disk spends most of its time
seeking back and fore. How many disks do you have, and is the WAL on a
separate set of disks?

I think it's just taking a long time because you have so many pages to
move and not enough disk bandwidth. Of course the root of the problem is
that you had so many dead rows after a failed replication, but you're
right and that's another email.

--
   Richard Huxton
   Archonet Ltd


Re: VACUUM FULL hangs

From
Marcin Mańk
Date:
> I'm a bit puzzled as to how you managed to get so much free space at the
> start of the table. Did the replication work on the second try?

It actually worked on third try, I guess.

> OK, so you might well be getting the vacuum writing one page, then WAL,
> then vacuum, etc. That will mean the disk spends most of its time
> seeking back and fore. How many disks do you have, and is the WAL on a
> separate set of disks?

It is 2 spindles software RAID1 . Till now there were no performance
problems with this machine that would mandate trying anything more fancy,
this machine is low traffic.

Greetings
Marcin Mańk