Thread: vacuum output question

vacuum output question

From
"Dan Armbrust"
Date:
I have a system backed by a PostgreSQL DB at a customer site that
mysteriously slowed way down - and couldn't keep up with the load for
no apparent reason.

I had them run a vacuum analyze verbose on my database, and had these
lines come back which made me suspicious:

INFO:  index "ix_cpe_ispid" now contains 41626 row versions in 13727 pages
DETAIL:  5224 index row versions were removed.
1543 index pages have been deleted, 1373 are currently reusable.
CPU 13.09s/3.51u sec elapsed 157.85 sec.

INFO:  index "ix_cpe_enable" now contains 41628 row versions in 29417 pages
DETAIL:  5224 index row versions were removed.
3706 index pages have been deleted, 3291 are currently reusable.
CPU 31.27s/8.22u sec elapsed 687.60 sec.

INFO:  "cpe": found 5224 removable, 41626 nonremovable row versions in
1303 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 22416 unused item pointers.
0 pages are entirely empty.
CPU 44.46s/11.82u sec elapsed 852.85 sec.


Why did those particular tables and indexes take _so_ long to vacuum?
Perhaps we have a disk level IO problem on this system?

Can someone tell me what 'CPU 44.46s/11.82u sec' means?  I have a
guess, but I'm not sure.

Thanks,

Dan

Re: vacuum output question

From
"Scott Marlowe"
Date:
On Thu, Nov 13, 2008 at 4:08 PM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:
> I have a system backed by a PostgreSQL DB at a customer site that
> mysteriously slowed way down - and couldn't keep up with the load for
> no apparent reason.
>
> I had them run a vacuum analyze verbose on my database, and had these
> lines come back which made me suspicious:
>
> INFO:  index "ix_cpe_ispid" now contains 41626 row versions in 13727 pages
> DETAIL:  5224 index row versions were removed.
> 1543 index pages have been deleted, 1373 are currently reusable.
> CPU 13.09s/3.51u sec elapsed 157.85 sec.
>
> INFO:  index "ix_cpe_enable" now contains 41628 row versions in 29417 pages
> DETAIL:  5224 index row versions were removed.
> 3706 index pages have been deleted, 3291 are currently reusable.
> CPU 31.27s/8.22u sec elapsed 687.60 sec.
>
> INFO:  "cpe": found 5224 removable, 41626 nonremovable row versions in
> 1303 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> There were 22416 unused item pointers.
> 0 pages are entirely empty.
> CPU 44.46s/11.82u sec elapsed 852.85 sec.

That's a fair bit of dead space, but 60k rows isn't really that many.

> Why did those particular tables and indexes take _so_ long to vacuum?
> Perhaps we have a disk level IO problem on this system?

Assuming pagesize is 8k, then we're talking about scanning 1303*8192
bytes or 10 Megabytes.  My laptop can scan that in less than a second.

So, either the hard drive is incredibly fragmented, or there's
something wrong with that machine.

Re: vacuum output question

From
Tom Lane
Date:
"Dan Armbrust" <daniel.armbrust.list@gmail.com> writes:
> Why did those particular tables and indexes take _so_ long to vacuum?
> Perhaps we have a disk level IO problem on this system?

FWIW, I agree with Scott that you seem to have an overstressed I/O
system.  It's hard to tell why from here.

> Can someone tell me what 'CPU 44.46s/11.82u sec' means?  I have a
> guess, but I'm not sure.

That's the vacuum process's system and user CPU-time consumption as
reported by getrusage(2).  It's evidently only a minor component of the
elapsed runtime, though, so you need to be looking at I/O costs.

            regards, tom lane

Re: vacuum output question

From
Craig Ringer
Date:
Scott Marlowe wrote:
> On Thu, Nov 13, 2008 at 4:08 PM, Dan Armbrust
> <daniel.armbrust.list@gmail.com> wrote:

>> Why did those particular tables and indexes take _so_ long to vacuum?
>> Perhaps we have a disk level IO problem on this system?
>
> Assuming pagesize is 8k, then we're talking about scanning 1303*8192
> bytes or 10 Megabytes.  My laptop can scan that in less than a second.
>
> So, either the hard drive is incredibly fragmented, or there's
> something wrong with that machine.

If you're using a system with RAID storage, I'd check to make sure
there's no high priority background rebuild or RAID scrubbing going on.
I've had issues in the past where RAID issues have crippled a server to
the point where it's barely able to handle ssh logins.

[Assuming a Linux server:] Naturally it's also worth checking for
processes doing absurd amounts of I/O. See pidstat -d, iostat, vmstat.
Also look for processes constantly in the `D' state in `ps aux' or `top'
which is usually due to heavy I/O. dmesg may also contain useful
information if it's a low level issue.

--
Craig Ringer

Re: vacuum output question

From
"Dan Armbrust"
Date:
Thanks everyone.  You have helped back up my suspicions.  It is indeed
a Linux system, and it has a RAID IO system, but I don't yet know the
details of that IO system.  Time to put them back to work looking at
their hardware, rather than blaming our software :)

Thanks for the extra tips on hunting for excess IO from other processes.

Dan

Re: vacuum output question

From
Simon Riggs
Date:
On Thu, 2008-11-13 at 17:30 -0700, Scott Marlowe wrote:

> > I had them run a vacuum analyze verbose on my database, and had these
> > lines come back which made me suspicious:
> >
> > INFO:  index "ix_cpe_ispid" now contains 41626 row versions in 13727 pages
> > DETAIL:  5224 index row versions were removed.
> > 1543 index pages have been deleted, 1373 are currently reusable.
> > CPU 13.09s/3.51u sec elapsed 157.85 sec.
> >
> > INFO:  index "ix_cpe_enable" now contains 41628 row versions in 29417 pages
> > DETAIL:  5224 index row versions were removed.
> > 3706 index pages have been deleted, 3291 are currently reusable.
> > CPU 31.27s/8.22u sec elapsed 687.60 sec.
> >
> > INFO:  "cpe": found 5224 removable, 41626 nonremovable row versions in
> > 1303 pages
> > DETAIL:  0 dead row versions cannot be removed yet.
> > There were 22416 unused item pointers.
> > 0 pages are entirely empty.
> > CPU 44.46s/11.82u sec elapsed 852.85 sec.
>
> That's a fair bit of dead space, but 60k rows isn't really that many.
>
> > Why did those particular tables and indexes take _so_ long to vacuum?
> > Perhaps we have a disk level IO problem on this system?
>
> Assuming pagesize is 8k, then we're talking about scanning 1303*8192
> bytes or 10 Megabytes.  My laptop can scan that in less than a second.
>
> So, either the hard drive is incredibly fragmented, or there's
> something wrong with that machine.

There was concurrent access to the table during VACUUMing, so the long
delay is explainable as long waits for cleanup lock, plus probably
thrashing the cache with bloated indexes. The CPU overhead per row seems
OK. We should instrument the wait time during a VACUUM and report that
also.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: vacuum output question

From
"Dan Armbrust"
Date:
>
> There was concurrent access to the table during VACUUMing, so the long
> delay is explainable as long waits for cleanup lock, plus probably
> thrashing the cache with bloated indexes. The CPU overhead per row seems
> OK. We should instrument the wait time during a VACUUM and report that
> also.
>
> --
>  Simon Riggs           www.2ndQuadrant.com
>  PostgreSQL Training, Services and Support


Is that a guess?  Or something you can tell from the log above?
Because there shouldn't have been any concurrent access while the
VACUUM was run - the customers had failed over to a different system,
so while I can't be sure, I expect that there was no other database
activity at the time the command was run.

Thanks,

Dan

Re: vacuum output question

From
Simon Riggs
Date:
On Fri, 2008-11-14 at 09:00 -0600, Dan Armbrust wrote:

> > There was concurrent access to the table during VACUUMing, so the long
> > delay is explainable as long waits for cleanup lock, plus probably
> > thrashing the cache with bloated indexes. The CPU overhead per row seems
> > OK. We should instrument the wait time during a VACUUM and report that
> > also.

> Is that a guess?  Or something you can tell from the log above?

The number of row versions in each index was different after vacuuming.
That tells me some writes occurred and I inferred from that that other
read-only activity occurred as well. Reads or writes will slow down a
VACUUM.

Perhaps you have vacuum_cost_delay set also?

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support