Re: Vacuum Full Analyze Stalled - Mailing list pgsql-admin

From Jim C. Nasby
Subject Re: Vacuum Full Analyze Stalled
Date
Msg-id 20051003194114.GD40138@pervasive.com
Whole thread Raw
In response to Re: Vacuum Full Analyze Stalled  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Responses Re: Vacuum Full Analyze Stalled
List pgsql-admin
I haven't looked in detail at autovacuum in 8.1 yet, but I know that in
previous versions it was a bad idea to depend on it to vacuum a small
table that has a lot of update activity frequently enough. The issue is
that since it's single-threaded if it starts a vacuum on a large table
it could be hours before it gets around to vacuuming the small table
again.

In a case like this, your best bet is to either have the application
periodically vacuum the small table or just do it once a minute via
cron. It should take next to no time to vacuum a 72 row table, so
running vacuum once a minute shouldn't really present much overhead when
the vacuum isn't needed.

On Mon, Oct 03, 2005 at 02:09:37PM -0500, Kevin Grittner wrote:
> The previous report was from the same databases -- I'm a consultant
> tasked with evaluating the various open source options and making
> one of them work with our existing framework.  Jeff developed a new
> app (which is in beta testing) which is our first use of PostgreSQL
> with real production load for prolonged periods of time.
>
> Jeff was monitoring the app over the weekend, and posted here when
> he had problems.  Unfortunately, he didn't capture enough information
> to diagnose much, but I'm passing along the sequence of events
> since they lead up to a failure this morning for which we have more
> detail.
>
> Saturday, 9:00 a.m. -- Noticed performace beginning to degrade.
> Saturday, 1:00 p.m. -- After continued degradation, stops client app,
>    and runs vacuum full analyze on both Windows and Linux boxes.
> Saturday, 3:00 p.m. -- Windows box completes the vacuum.
> Sunday, 1:00 p.m. -- Vacuum on Linux still not done.  CPU on Linux
>    box is completely idle.  Killed the vacuum process.  Ran vacuum
>    full analyze on each table individually.  All fine except for the 72 row
>    table with very frequent updates.  Left this one running.
> Sunday, 9:00 p.m. -- Vacuum of 72 row table still running.  Killed all
>    processes and restarted postgres.  Started a vacuum full analyze
>    of the 72 row table again.
> Sunday, 11:00 p.m. -- Vacuum of 72 row table finishes.  Started
>    applications, which had been stopped since Saturday, 1:00 p.m.
> Monday, 9:20 a.m. -- Performance is beginning to degrade again.
>    Detailed information from this point below.
>
> dtr=# select * from pg_locks order by pid;
>    locktype    | database | relation | page | tuple | transactionid | classid | objid | objsubid | transaction | pid
|          mode           | granted 
>
---------------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+------+--------------------------+---------
>  transactionid |          |          |      |       |      36430565 |         |       |          |    36430565 | 2771
|ExclusiveLock            | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430565 | 2771
|AccessShareLock          | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430565 | 2771
|RowExclusiveLock         | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430565 | 2771
|AccessShareLock          | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430565 | 2771
|RowExclusiveLock         | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430561 | 2777
|AccessShareLock          | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430561 | 2777
|RowExclusiveLock         | t 
>  transactionid |          |          |      |       |      36430561 |         |       |          |    36430561 | 2777
|ExclusiveLock            | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430561 | 2777
|AccessShareLock          | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430561 | 2777
|RowExclusiveLock         | t 
>  transactionid |          |          |      |       |      36430563 |         |       |          |    36430563 | 2783
|ExclusiveLock            | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430563 | 2783
|AccessShareLock          | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430563 | 2783
|RowExclusiveLock         | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430563 | 2783
|AccessShareLock          | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430563 | 2783
|RowExclusiveLock         | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430559 | 2790
|AccessShareLock          | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    36430559 | 2790
|RowExclusiveLock         | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430559 | 2790
|AccessShareLock          | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    36430559 | 2790
|RowExclusiveLock         | t 
>  transactionid |          |          |      |       |      36430559 |         |       |          |    36430559 | 2790
|ExclusiveLock            | t 
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    35633343 | 3158
|ShareUpdateExclusiveLock | t 
>  transactionid |          |          |      |       |      35633343 |         |       |          |    35633343 | 3158
|ExclusiveLock            | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    35633343 | 3158
|ShareUpdateExclusiveLock | t 
>  relation      |    30793 |    10340 |      |       |               |         |       |          |    36430567 | 4238
|AccessShareLock          | t 
>  transactionid |          |          |      |       |      36430567 |         |       |          |    36430567 | 4238
|ExclusiveLock            | t 
> (25 rows)
>
> dtr=# \q
> postgres@linpost:/var/pgsql/data> ps aux|grep ^postgres
> postgres 28937  0.0  0.0  2436  992 pts/1    S    Sep30   0:00 su postgres
> postgres 28938  0.0  0.0  2944 1620 pts/1    S    Sep30   0:00 bash
> postgres  2612  0.0  0.0 170968 4952 ?       S    Oct02   0:00 /usr/local/pgsql/bin/postmaster
> postgres  2614  0.0  1.9 171200 165528 ?     S    Oct02   0:08 postgres: writer process
> postgres  2615  0.0  0.0  7240 2788 ?        S    Oct02   0:27 postgres: stats buffer process
> postgres  2617  0.0  0.0  6556 1988 ?        S    Oct02   0:22 postgres: stats collector process
> postgres  2724  0.0  0.0 171828 7156 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32986) idle
> postgres  2771  6.4  2.0 171996 167176 ?     S    Oct02  44:33 postgres: dtr dtr 127.0.0.1(32990) idle
> postgres  2773  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32991) idle
> postgres  2774  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32992) idle
> postgres  2775  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32993) idle
> postgres  2776  0.0  0.0 171720 6848 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32994) idle
> postgres  2777  6.4  2.0 171996 167228 ?     R    Oct02  44:29 postgres: dtr dtr 127.0.0.1(32995) UPDATE
> postgres  2778  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32996) idle
> postgres  2779  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32997) idle
> postgres  2780  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32998) idle
> postgres  2781  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(32999) idle
> postgres  2782  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33000) idle
> postgres  2783  6.4  2.0 172016 167292 ?     S    Oct02  44:38 postgres: dtr dtr 127.0.0.1(33001) idle
> postgres  2784  0.0  0.0 171460 5824 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33002) idle
> postgres  2785  0.0  0.0 171460 5828 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33003) idle
> postgres  2786  6.4  2.0 172000 167216 ?     S    Oct02  44:38 postgres: dtr dtr 127.0.0.1(33004) idle
> postgres  2787  6.4  2.0 172000 167236 ?     R    Oct02  44:33 postgres: dtr dtr 127.0.0.1(33005) UPDATE
> postgres  2788  0.0  0.0 171460 5828 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33006) idle
> postgres  2789  0.0  0.0 171460 5828 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33007) idle
> postgres  2790  6.4  2.0 171992 167280 ?     R    Oct02  44:27 postgres: dtr dtr 127.0.0.1(33008) UPDATE
> postgres  2791  0.0  0.0 171460 5828 ?       S    Oct02   0:00 postgres: dtr dtr 127.0.0.1(33009) idle
> postgres  3158  0.0  0.0 188324 8044 ?       S    Oct02   0:00 postgres: autovacuum process   dtr
> postgres  4289  0.0  0.0  2148  672 pts/1    R+   09:21   0:00 ps aux
> postgres  4290  0.0  0.0  1688  688 pts/1    R+   09:21   0:00 grep ^postgres
>
> oid 30871 is the 72 row table.
> oid 30976 is the index for its primary key.
>
> We stopped the applications, and the autovacuum process didn't
> wake up.
>
> dtr=# select * from pg_locks order by pid;
>    locktype    | database | relation | page | tuple | transactionid | classid | objid | objsubid | transaction | pid
|          mode           | granted 
>
---------------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+------+--------------------------+---------
>  relation      |    30793 |    30976 |      |       |               |         |       |          |    35633343 | 3158
|ShareUpdateExclusiveLock | t 
>  transactionid |          |          |      |       |      35633343 |         |       |          |    35633343 | 3158
|ExclusiveLock            | t 
>  relation      |    30793 |    30871 |      |       |               |         |       |          |    35633343 | 3158
|ShareUpdateExclusiveLock | t 
>  relation      |    30793 |    10340 |      |       |               |         |       |          |    36600517 | 4498
|AccessShareLock          | t 
>  transactionid |          |          |      |       |      36600517 |         |       |          |    36600517 | 4498
|ExclusiveLock            | t 
> (5 rows)
>
> dtr=# \q
> postgres@linpost:/var/pgsql/data> ps aux|grep ^postgres
> postgres  2612  0.0  0.0 170968 4952 ?       S    Oct02   0:00 /usr/local/pgsql/bin/postmaster
> postgres  2614  0.0  1.9 171200 165528 ?     S    Oct02   0:10 postgres: writer process
> postgres  2615  0.0  0.0  7240 2788 ?        S    Oct02   0:32 postgres: stats buffer process
> postgres  2617  0.0  0.0  6556 1988 ?        S    Oct02   0:26 postgres: stats collector process
> postgres  3158  0.0  0.0 188324 8044 ?       S    Oct02   0:00 postgres: autovacuum process   dtr
> postgres  4481  0.0  0.0  2432  988 pts/1    S    10:48   0:00 su postgres
> postgres  4482  0.0  0.0  2944 1604 pts/1    S    10:48   0:00 bash
> postgres  4499  0.0  0.0  2148  672 pts/1    R+   10:49   0:00 ps aux
> postgres  4500  0.0  0.0  1680  636 pts/1    S+   10:49   0:00 grep ^postgres
> postgres@linpost:/var/pgsql/data> uptime
>  10:48am  up 6 days 19:20,  2 users,  load average: 0.00, 0.28, 0.73
> postgres@linpost:/var/pgsql/data> vmstat 10
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  0  0      8 278376  62676 7770684    0    0    15     7   17    18 11  1 87  1
>  0  0      8 278376  62700 7770660    0    0     0    11 1035     0  1  0 99  0
>  0  0      8 278440  62700 7770660    0    0     0     0 1013     0  0  0 100  0
>  0  0      8 278440  62700 7770660    0    0     0     0 1016     0  1  0 99  0
>  0  0      8 278440  62700 7770660    0    0     0     4 1018     0  1  0 99  0
>
> We found the 72 row table somewhat bloated, so we cleaned it up.
>
> dtr=# vacuum analyze verbose "DbTranImageStatus";
> INFO:  vacuuming "public.DbTranImageStatus"
> INFO:  index "DbTranImageStatusPK" now contains 72 row versions in 6665 pages
> DETAIL:  6555 index pages have been deleted, 6555 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.01 sec.
> INFO:  "DbTranImageStatus": found 0 removable, 72 nonremovable row versions in 4244 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> There were 479137 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.01s/0.01u sec elapsed 0.02 sec.
> INFO:  vacuuming "pg_toast.pg_toast_272793"
> INFO:  index "pg_toast_272793_index" now contains 0 row versions in 1 pages
> DETAIL:  0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  "pg_toast_272793": found 0 removable, 0 nonremovable row versions in 0 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> There were 0 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  analyzing "public.DbTranImageStatus"
> INFO:  "DbTranImageStatus": scanned 3000 of 4244 pages, containing 68 live rows and 0 dead rows; 68 rows in sample,
96estimated total rows 
> VACUUM
> dtr=# vacuum full analyze verbose "DbTranImageStatus";
> INFO:  vacuuming "public.DbTranImageStatus"
> INFO:  "DbTranImageStatus": found 0 removable, 72 nonremovable row versions in 4244 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> Nonremovable row versions range from 68 to 68 bytes long.
> There were 479137 unused item pointers.
> Total free space (including removable row versions) is 32760236 bytes.
> 4227 pages are or will become empty, including 0 at the end of the table.
> 4244 pages containing 32760236 free bytes are potential move destinations.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  index "DbTranImageStatusPK" now contains 72 row versions in 6654 pages
> DETAIL:  0 index row versions were removed.
> 6544 index pages have been deleted, 6544 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  "DbTranImageStatus": moved 22 row versions, truncated 4244 to 1 pages
> DETAIL:  CPU 0.00s/0.02u sec elapsed 0.03 sec.
> INFO:  index "DbTranImageStatusPK" now contains 72 row versions in 6654 pages
> DETAIL:  22 index row versions were removed.
> 6544 index pages have been deleted, 6544 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  vacuuming "pg_toast.pg_toast_272793"
> INFO:  "pg_toast_272793": found 0 removable, 0 nonremovable row versions in 0 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> Nonremovable row versions range from 0 to 0 bytes long.
> There were 0 unused item pointers.
> Total free space (including removable row versions) is 0 bytes.
> 0 pages are or will become empty, including 0 at the end of the table.
> 0 pages containing 0 free bytes are potential move destinations.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  index "pg_toast_272793_index" now contains 0 row versions in 1 pages
> DETAIL:  0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  analyzing "public.DbTranImageStatus"
> INFO:  "DbTranImageStatus": scanned 1 of 1 pages, containing 72 live rows and 0 dead rows; 72 rows in sample, 72
estimatedtotal rows 
> VACUUM
> dtr=# reindex table "DbTranImageStatus";
> REINDEX
> dtr=# vacuum analyze verbose "DbTranImageStatus";
> INFO:  vacuuming "public.DbTranImageStatus"
> INFO:  index "DbTranImageStatusPK" now contains 72 row versions in 2 pages
> DETAIL:  0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  "DbTranImageStatus": found 0 removable, 72 nonremovable row versions in 1 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> There were 48 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  vacuuming "pg_toast.pg_toast_272793"
> INFO:  index "pg_toast_272793_index" now contains 0 row versions in 1 pages
> DETAIL:  0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  "pg_toast_272793": found 0 removable, 0 nonremovable row versions in 0 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> There were 0 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO:  analyzing "public.DbTranImageStatus"
> INFO:  "DbTranImageStatus": scanned 1 of 1 pages, containing 72 live rows and 0 dead rows; 72 rows in sample, 72
estimatedtotal rows 
> VACUUM
>
> These all ran sub-second.  We updated postgresql.conf for more
> aggressive autovacuum, and restarted postgres, then restarted the
> application.
>
> This failure seems rather different from the one I previously posted,
> since it was blocking on the application table, rather than
> pg_constraint_contypid_index, and it did not wake up when all
> other processes where stopped.
>
> As before, both boxes are running 8.1beta2.  Windows is not showing
> these problems with the autovacuum blocking; just Linux.
>
> -Kevin
>
>
> >>> Tom Lane <tgl@sss.pgh.pa.us> 10/02/05 8:53 PM >>>
> "Jeff Kirby" <Jeff.Kirby@wicourts.gov> writes:
> > the Linux box however is still chugging away this morning... and
> > appears to be stuck on vacuuming "pg_constraint_contypid_index".  How
> > do I know... well I don't really... I'm inferring based on the order
> > of the log output on the Windows box.
>
> Looking in pg_locks would give you a more reliable indicator of what the
> VACUUM is currently working on.
>
> Is the Linux box otherwise idle?  There was another report recently of a
> vacuum hung up for a long time on pg_constraint_contypid_index, but it
> seemed to be due to extremely heavy usage of domain types ...
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
>

--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

pgsql-admin by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: Vacuum Full Analyze Stalled
Next
From: Alvaro Herrera
Date:
Subject: Re: Vacuum Full Analyze Stalled