Thread: Slow Vacuum was: vacuum output question

Slow Vacuum was: vacuum output question

From
"Dan Armbrust"
Date:
To follow up on an old thread that I started - I had a customer who
had a system where manual vacuum runs were taking a very long time to
run.  I was seeing output like this:

INFO:  "cpe": found 95498 removable, 18757 nonremovable row versions
in 11117 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 280173 unused item pointers.
0 pages are entirely empty.
CPU 5.35s/0.99u sec elapsed 724.38 sec.

Then, running vacuum again immediately afterword, on a system that was
basically idle, would result in nearly the same amount of time to
vacuum the table.

Getting a copy of the database from the customer, and loading it onto
my Postgres System and running the vacuum would result in runs that
took less than a second (as expected).

General opinion was that it was a disk-io problem.  We rebooted the
system, and magically, the problem went away.

As tends to be the case with "magically" fixed problems, this one is
back.  Now I have a different customer running Postgres 8.1 on Fedora
Core 6, and their system produced that log snippit above.

hdparm shows disk-io thruput being perfectly normal on their system.
Everything else on the system seems to be working correctly.  The
reboot solution doesn't help.  Vacuum still runs painfully slow.

I'm still waiting for a copy of their postgresql config file, but I'm
guessing that almost everything was left on the default settings.

I don't suppose anyone knows of any bugs that existed between postgres
8.1 and older linux kernels that led to behavior like this?  I can't
really just ask them to upgrade their OS and/or Postgres on the hunch
that the problem should go away....  And I haven't yet been able to
reproduce anything like this on a system that I have control over.

Thanks for any ideas.

Dan

Re: Slow Vacuum was: vacuum output question

From
"Scott Marlowe"
Date:
On Tue, Dec 30, 2008 at 9:10 AM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:
> To follow up on an old thread that I started - I had a customer who
> had a system where manual vacuum runs were taking a very long time to
> run.  I was seeing output like this:
>
> INFO:  "cpe": found 95498 removable, 18757 nonremovable row versions
> in 11117 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> There were 280173 unused item pointers.
> 0 pages are entirely empty.
> CPU 5.35s/0.99u sec elapsed 724.38 sec.
>
> Then, running vacuum again immediately afterword, on a system that was
> basically idle, would result in nearly the same amount of time to
> vacuum the table.

You do realize that except for the end of a table, vacuum recovers no
actual space, just makes it available for new tuples to move in there.
 So it makes sense that the second vacuum would take just as long, or
nearly so.

cluster or vacuum full followed by reindex will regain space lost.

> Getting a copy of the database from the customer, and loading it onto
> my Postgres System and running the vacuum would result in runs that
> took less than a second (as expected).
>
> General opinion was that it was a disk-io problem.  We rebooted the
> system, and magically, the problem went away.

Hard to say.  Have them run

vmstat 1

while vacuuming and see what bi/bo look like.

Re: Slow Vacuum was: vacuum output question

From
"Dan Armbrust"
Date:
>> INFO:  "cpe": found 95498 removable, 18757 nonremovable row versions
>> in 11117 pages
>> DETAIL:  0 dead row versions cannot be removed yet.
>> There were 280173 unused item pointers.
>> 0 pages are entirely empty.
>> CPU 5.35s/0.99u sec elapsed 724.38 sec.
>>
>> Then, running vacuum again immediately afterword, on a system that was
>> basically idle, would result in nearly the same amount of time to
>> vacuum the table.
>
> You do realize that except for the end of a table, vacuum recovers no
> actual space, just makes it available for new tuples to move in there.
>  So it makes sense that the second vacuum would take just as long, or
> nearly so.
>

Yep.  The real issue is that it took 724 seconds, instead of say, a
half second - like it does on my system.  I wasn't sure if I should
expect vacuum to take longer to run when it finds a large number of
tuples that it needs to make available, so I just have them run it
twice so I can easily compare the time that it takes with essentially
no work to do.

>
> Hard to say.  Have them run
>
> vmstat 1
>
> while vacuuming and see what bi/bo look like.
>

Haven't looked at that yet on this particular system.  Last time, on
different hardware when this occurred the vmstat 'wa' column showed
very large values while vacuum was running.  I don't recall what the
bi/bo columns indicated.

top also showed very high load averages while vacuum was running - but
basically no cpu use.

Are there any common tools that could do a better disk benchmark than
hdparm -Tt?

Thanks,

Dan

Re: Slow Vacuum was: vacuum output question

From
"Dan Armbrust"
Date:
>> INFO:  "cpe": found 95498 removable, 18757 nonremovable row versions
>> in 11117 pages
>> DETAIL:  0 dead row versions cannot be removed yet.
>> There were 280173 unused item pointers.
>> 0 pages are entirely empty.
>> CPU 5.35s/0.99u sec elapsed 724.38 sec.
>
> How many idle transactions are there?
>

Not sure.  I don't expect that there were many, their system isn't
very highly loaded compared to most of our customers systems.

The way that they reported the problem to us was that if they enable
autovacuum, when ever it runs (about 4 times an hour) it would stop
processing the things it needed to process, due to table lock
contention for several minutes.

Their workaround had been to run a daily autovacuum at the lowest load
time of day, to cause the least disruption.

Re: Slow Vacuum was: vacuum output question

From
"Scott Marlowe"
Date:
On Tue, Dec 30, 2008 at 9:32 AM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:
> Haven't looked at that yet on this particular system.  Last time, on
> different hardware when this occurred the vmstat 'wa' column showed
> very large values while vacuum was running.  I don't recall what the
> bi/bo columns indicated.

definitely sounds like poor io

> top also showed very high load averages while vacuum was running - but
> basically no cpu use.

yeah, load is the number of things running or waiting to run.  If
vacuum is sucking up all the io, and this machine doesn't have much io
capability, then it's quite possible for other processes stuck behind
it to crank up the load factor.

Also, were the any vacuum cost delay settings over 0 on this machine
when the test was run?

> Are there any common tools that could do a better disk benchmark than
> hdparm -Tt?

Keep in mind, hdparm hits the drive directly, not through the
filesystem.  I use bonnie++ or iozone to test io.

Re: Slow Vacuum was: vacuum output question

From
"Scott Marlowe"
Date:
On Tue, Dec 30, 2008 at 9:47 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> Keep in mind, hdparm hits the drive directly, not through the
> filesystem.  I use bonnie++ or iozone to test io.

Also dd and vmstat together.

Re: Slow Vacuum was: vacuum output question

From
Andrew Sullivan
Date:
On Tue, Dec 30, 2008 at 10:37:04AM -0600, Dan Armbrust wrote:

> The way that they reported the problem to us was that if they enable
> autovacuum, when ever it runs (about 4 times an hour) it would stop
> processing the things it needed to process, due to table lock
> contention for several minutes.

If that's true, there are a couple possibilities for why.  One of them
is that autovacuum needs tuning.  Another is that certain tables
aren't really good candidates for autovacuum (<8.3, this is pretty
common).  Another is that they're starved for I/O; you need to check
that when processing and vacuum is happening, not the rest of the
time.  Another is that they're doing something like explicit locking
or very large numbers of UPDATEs in one transaction, and vacuum is
just highlighting a problem with what they're doing.

> Their workaround had been to run a daily autovacuum at the lowest load
> time of day, to cause the least disruption.

What is a "daily autovacuum"?  It sounds like some tables just need
vacuuming more often.  If they find that the system is not responsive
during that, it tells us that they need more disk bandwidth or that
they need to integrate vacuuming some tables with their program.

A

--
Andrew Sullivan
ajs@crankycanuck.ca

Re: Slow Vacuum was: vacuum output question

From
"Dan Armbrust"
Date:
>
>> Their workaround had been to run a daily autovacuum at the lowest load
>> time of day, to cause the least disruption.
>
> What is a "daily autovacuum"?  It sounds like some tables just need
> vacuuming more often.  If they find that the system is not responsive
> during that, it tells us that they need more disk bandwidth or that
> they need to integrate vacuuming some tables with their program.
>

Sorry, I meant a daily manual vacuum.

On paper, their hardware is plenty fast for their workload.  Out of
hundreds of sites, all running the same software putting load on the
database, this is only the second time where we have seen this odd
behaviour of very slow vacuums.

I guess I was hoping that someone would be able to chime in and say -
yes, in so and so version, we fixed an obscure bug that sometimes
caused huge slowdowns, perhaps when combined with certain linux
kernels.  It was a nice dream anyway :)

iozone looks useful.  I'll see if I can get on their system and do
some proper benchmarks.

Thanks,

Dan

Re: Slow Vacuum was: vacuum output question

From
"Scott Marlowe"
Date:
On Tue, Dec 30, 2008 at 10:14 AM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:
>
> On paper, their hardware is plenty fast for their workload.  Out of
> hundreds of sites, all running the same software putting load on the
> database, this is only the second time where we have seen this odd
> behaviour of very slow vacuums.
>
> I guess I was hoping that someone would be able to chime in and say -
> yes, in so and so version, we fixed an obscure bug that sometimes
> caused huge slowdowns, perhaps when combined with certain linux
> kernels.  It was a nice dream anyway :)

This actually kinda sounds like one of those RAID cards that every so
often decides to check the consistency of the RAID array, and when
it's doing that, the whole system slows to a crawl.

Re: Slow Vacuum was: vacuum output question

From
"Dan Armbrust"
Date:
Here is an interesting new datapoint.

Modern Ubuntu distro - PostgreSQL 8.1.  SATA drive.  No Raid.  Cannot
reproduce slow vacuum performance - vacuums take less than a second
for the whole database.

Reinstall OS - Fedora Core 6 - PostgreSQL 8.1.  Push data through
PostgreSQL for a couple hours (same as above) and now vacuum reports
this:

INFO:  vacuuming "public.cpe"
INFO:  index "pk_cpe" now contains 50048 row versions in 2328 pages
DETAIL:  415925 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.05s/0.33u sec elapsed 0.94 sec.
INFO:  index "ix_cpe_ispid" now contains 50090 row versions in 1338 pages
DETAIL:  415925 index row versions were removed.
953 index pages have been deleted, 0 are currently reusable.
CPU 0.27s/0.22u sec elapsed 8.93 sec.
INFO:  index "ix_cpe_enable" now contains 50676 row versions in 1637 pages
DETAIL:  415925 index row versions were removed.
1161 index pages have been deleted, 0 are currently reusable.
CPU 0.45s/0.31u sec elapsed 14.01 sec.
INFO:  "cpe": removed 415925 row versions in 10844 pages
DETAIL:  CPU 1.48s/0.25u sec elapsed 35.86 sec.
INFO:  "cpe": found 415925 removable, 50003 nonremovable row versions
in 10849 pages
DETAIL:  6 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 2.40s/1.18u sec elapsed 61.13 sec.



It tooks 61 seconds to vacuum, and the number of index row versions
removed was huge.
We than issued a reindex command for the entire database - and now the
vaccum times are back down under a second.

What on earth could be going on between PostgreSQL 8.1 and Fedora 6
that is bloating and/or corrupting the indexes like this?

Thanks,

Dan

Re: Slow Vacuum was: vacuum output question

From
Alvaro Herrera
Date:
Dan Armbrust escribió:

> What on earth could be going on between PostgreSQL 8.1 and Fedora 6
> that is bloating and/or corrupting the indexes like this?

Postgres 8.1 was slow to vacuum btree indexes.  My guess is that your
indexes are so bloated that it takes a lot of time to scan them.

I think the solution here is to vacuum this table far more often.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Slow Vacuum was: vacuum output question

From
"Scott Marlowe"
Date:
On Tue, Jan 6, 2009 at 1:39 PM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:
> Here is an interesting new datapoint.
>
> Modern Ubuntu distro - PostgreSQL 8.1.  SATA drive.  No Raid.  Cannot
> reproduce slow vacuum performance - vacuums take less than a second
> for the whole database.
>
> Reinstall OS - Fedora Core 6 - PostgreSQL 8.1.  Push data through
> PostgreSQL for a couple hours (same as above) and now vacuum reports
> this:

Are you pushing the same amount of data through the ubuntu server?  if
not, then the comparison is invalid, if so, then yeah, there's some
kind of difference between the platforms.

Note that Fedora Core 6 is quite old compared to ubuntu 8.04 or 8.10.
Also it's more likely to be installed on older and / or slower
equipment.

Re: Slow Vacuum was: vacuum output question

From
"Dan Armbrust"
Date:
On Tue, Jan 6, 2009 at 3:01 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Dan Armbrust escribió:
>
>> What on earth could be going on between PostgreSQL 8.1 and Fedora 6
>> that is bloating and/or corrupting the indexes like this?
>
> Postgres 8.1 was slow to vacuum btree indexes.  My guess is that your
> indexes are so bloated that it takes a lot of time to scan them.
>
> I think the solution here is to vacuum this table far more often.
>
> --
> Alvaro Herrera                                http://www.CommandPrompt.com/
> The PostgreSQL Company - Command Prompt, Inc.
>

Actually, the customer reported problem is that when they enable
autovacuum, the performance basically tanks because vacuum runs so
slow they can't bear to have it run frequently.

Though, perhaps they had bloated indexes before they started
autovacuum, and it never fixed them.  Perhaps it will behave properly
if we do a reindex, and then enable autovacuum.

Re: Slow Vacuum was: vacuum output question

From
"Dan Armbrust"
Date:
> On Tue, Jan 6, 2009 at 1:39 PM, Dan Armbrust
> <daniel.armbrust.list@gmail.com> wrote:
>> Here is an interesting new datapoint.
>>
>> Modern Ubuntu distro - PostgreSQL 8.1.  SATA drive.  No Raid.  Cannot
>> reproduce slow vacuum performance - vacuums take less than a second
>> for the whole database.
>>
>> Reinstall OS - Fedora Core 6 - PostgreSQL 8.1.  Push data through
>> PostgreSQL for a couple hours (same as above) and now vacuum reports
>> this:
>
> Are you pushing the same amount of data through the ubuntu server?  if
> not, then the comparison is invalid, if so, then yeah, there's some
> kind of difference between the platforms.
>
> Note that Fedora Core 6 is quite old compared to ubuntu 8.04 or 8.10.
> Also it's more likely to be installed on older and / or slower
> equipment.
>

Yep - actually, we pushed much more data through the Ubuntu system and
could never reproduce the problem.  On the Fedora Core 6 system, the
problem happened very quickly.

In our testing here, the Ubuntu test was on the same hardware as the
fedora core 6 system (not just identical, but the same actual box)

It seems that there is some sort of bad interaction between some part
of the older OS and PostgreSQL.  We have also seen what appears to be
the same issue on a Cent OS 4.4 system.  Which is a rather similar
package level to Fedora Core 6.

Re: Slow Vacuum was: vacuum output question

From
Alan Hodgson
Date:
On Tuesday 06 January 2009, "Dan Armbrust" <daniel.armbrust.list@gmail.com>
wrote:
> What on earth could be going on between PostgreSQL 8.1 and Fedora 6
> that is bloating and/or corrupting the indexes like this?

Obviously the choice of operating system has no impact on the contents of
your index.

A better question might be, what did your application or maintenance
procedures do different in the different tests?


--
Alan

Re: Slow Vacuum was: vacuum output question

From
"Dan Armbrust"
Date:
>
> Obviously the choice of operating system has no impact on the contents of
> your index.
>
> A better question might be, what did your application or maintenance
> procedures do different in the different tests?
>
>
> --
> Alan

Our problem for a long time has been assuming the "obvious".  But we
now have tests that show otherwise.

I'm now thinking something along the lines of an obscure file system
or kernel interaction bug now - that was perhaps corrected in newer
releases of the OS.

Now that we can finally reproduce the problem in house, we are still
doing more tests to figure out specifics - does the problem go away
with Postgres 8.3, ext2/ext3/reiserFS, etc.

Re: Slow Vacuum was: vacuum output question

From
"Scott Marlowe"
Date:
On Tue, Jan 6, 2009 at 2:07 PM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:
>
> Actually, the customer reported problem is that when they enable
> autovacuum, the performance basically tanks because vacuum runs so
> slow they can't bear to have it run frequently.

Actually this is kinda backwards.  What's happening is that the vacuum
uses up so much IO that nothing else can get through.  The answer is
to make it run slower, by use of autovacuum_vacuum_cost_delay, and
setting it to 10 or 20 and seeing if they can then run autovacuum
during the day without these issues.

Note that vacuum was improved a fair bit from 8.1 to 8.2 and even
moreso from 8.2 to 8.3.

> Though, perhaps they had bloated indexes before they started
> autovacuum, and it never fixed them.  Perhaps it will behave properly
> if we do a reindex, and then enable autovacuum.

Definitely look at the cost_delay setting.  Makes a huge difference.

Re: Slow Vacuum was: vacuum output question

From
Tom Lane
Date:
"Dan Armbrust" <daniel.armbrust.list@gmail.com> writes:
> INFO:  "cpe": found 415925 removable, 50003 nonremovable row versions
> in 10849 pages

> What on earth could be going on between PostgreSQL 8.1 and Fedora 6
> that is bloating and/or corrupting the indexes like this?

You're focusing on the indexes when the problem is dead table rows.

It's very hard to believe that there's any OS dependence as such
involved in that.  I wonder whether (a) the Ubuntu and Fedora packages
you're using are the same 8.1.x point release; (b) if there's any
interesting non-default behavior built into the Ubuntu packaging
... like running autovacuum automatically, for instance.

            regards, tom lane

Re: Slow Vacuum was: vacuum output question

From
"Dan Armbrust"
Date:
On Tue, Jan 6, 2009 at 3:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Dan Armbrust" <daniel.armbrust.list@gmail.com> writes:
>> INFO:  "cpe": found 415925 removable, 50003 nonremovable row versions
>> in 10849 pages
>
>> What on earth could be going on between PostgreSQL 8.1 and Fedora 6
>> that is bloating and/or corrupting the indexes like this?
>
> You're focusing on the indexes when the problem is dead table rows.
>
> It's very hard to believe that there's any OS dependence as such
> involved in that.  I wonder whether (a) the Ubuntu and Fedora packages
> you're using are the same 8.1.x point release; (b) if there's any
> interesting non-default behavior built into the Ubuntu packaging
> ... like running autovacuum automatically, for instance.
>
>                        regards, tom lane
>

In our testing, Postgres 8.1 was build from source (PostgreSQL website
source) on both systems.  No Distro packages involved.

Believe me, we are as baffled as you.  We have been chasing this bug
off and on for months on a couple of different customer sites now.

Thanks,

Dan

Re: Slow Vacuum was: vacuum output question

From
"Dan Armbrust"
Date:
I'm no closer to a solution, but here are some additional data points
- all taken on Fedora Core 6.

Postgres 8.1 built from source.  Auto vacuum disabled.

Create Empty Database.
Run our load on the system for 2 hours to  populate and exercise the database.
Run Vacuum.  Takes more than a minute.
Run Vacuum immediately again.  Takes more than a minute.
Reindex the database.  Takes 10 seconds.
Run Vacuum again.  Takes 2 seconds.
Allow load to run on the system for 30 minutes.
Run Vacuum again.  Takes more than a minute.


Postgres 8.3 built from source.  Auto vacuum disabled.

Create Empty Database.
Run our load on the system for 2 hours to  populate and exercise the database.
Run Vacuum.  Takes more than a minute.
Run Vacuum immediately again.  Takes 15 seconds.
Run Vacuum immediately again.  Takes 15 seconds.
Reindex the database.  Takes 10 seconds.
Run Vacuum again.  Takes 2 seconds.

So, PostgreSQL 8.3 shows better behaviour, but it is still showing
some sort of performance issue which a reindex fixes.
And then of course, the kicker is that we can't recreate any of these
issues when running the same exact test, on the same exact hardware -
but using a different underlying OS.  When we were running under a
modern Ubuntu, the vacuum never takes more than 2 seconds.  We will be
checking other OSs soon.  I guess if we can't figure out what is
causing it, we can at least isolate the distros that we need to tell
our customers not to use (or to schedule a reindex if they insist on
not upgrading their OS)

Re: Slow Vacuum was: vacuum output question

From
"Scott Marlowe"
Date:
On Wed, Jan 7, 2009 at 10:26 AM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:
> I'm no closer to a solution, but here are some additional data points
> - all taken on Fedora Core 6.

So, what does vmstat 10 say when you're running the "long" vacuum on
each system?

Re: Slow Vacuum was: vacuum output question

From
"Dan Armbrust"
Date:
On PostgreSQL 8.1, while a long vacuum is running, the output of
vmstat 10 looks like this (sorry, can't format this very will in this
e-mail client):

r  b   swpd   free     buff     cache     si   so    bi       bo
in   cs      us  sy id wa st
5  2    112  53732   4388 1163404    0    0 13524  1322  941 75053 45
55  0  0  0
3  2    112  51596   2336 1160032    0    0 20731   732  999 196116 35
58  0  7  0
3  2    112  53784   2448 1155388    0    0  5729  1680  716 114442 46
52  0  2  0
7  1    112  52088   2568 1156796    0    0   205  1919  546 28929 56
44  0  0  0
3  2    112  51772   2652 1157028    0    0   162  2010  534 2845 56 44  0  0  0
3  2    112  52296   2736 1156332    0    0   174  1910  517 2648 53 47  0  0  0
2  2    112  52736   2832 1155388    0    0   173  1960  521 2698 57 43  0  0  0
6  1    112  51172   2908 1157400    0    0   173  1825  511 2621 52 48  0  0  0
2  1    112  52156   2984 1155876    0    0   152  1766  495 2669 53 47  0  0  0
5  2    112  53828   3064 1154028    0    0   124  1762  498 2673 54 46  0  0  0
5  0    112  53396   3140 1154684    0    0   122  1803  532 2871 52 48  0  0  0
6  1    112  52004   3240 1155892    0    0   150  2004  522 3712 56 44  0  0  0
5  2    112  53100   3348 1152364    0    0   128  2142  543 2711 54 38  0  8  0
1  2    112  56704   3428 1148132    0    0   102  2229  560 2925 56 44  0  0  0
2  1    112  54876   3500 1150080    0    0    91  2129  598 3068 65 35  0  0  0
3  1    112  53196   3588 1152164    0    0    90  2091  542 2803 57 43  0  0  0
8  2    112  53528   3664 1151868    0    0   106  2180  553 2947 55 39  0  6  0

Vacuum hasn't stopped yet at this point - its been running for many minutes.


On PostgreSQL 8.3, we had this:

 r   b   swpd   free     buff        cache     si   so    bi     bo
 in   cs    us  sy id wa st
 6  0   9604  51932  14276 1138996    0    0     3  1808  523 5115 45
9 42  4  0
 2  0   9580  53284  14124 1138092   10    0   564  2561  585 5126 53
10 19 18  0
 7  0   9564  53412  14144 1137696    0    0     0  1682  497 4985 42
9 48  2  0
 3  0   9532  53320  13880 1137940    6    0    37  1757  538 4979 48  8 41  3

More random data:

On an overnight test of PostgreSQL 8.3 on Fedora Core 6, with
auto-vacuum on (using all default settings) the amount of time that it
takes to run a manual vacuum on a constant sized database (only row
updates, or deletes followed by a replacement add) has doubled from 2
seconds to 4 seconds.  A reindex brings the time back down to 2
seconds.

Dan

Re: Slow Vacuum was: vacuum output question

From
"Scott Marlowe"
Date:
On Thu, Jan 8, 2009 at 10:43 AM, Dan Armbrust
<daniel.armbrust.list@gmail.com> wrote:
> On PostgreSQL 8.1, while a long vacuum is running, the output of
> vmstat 10 looks like this (sorry, can't format this very will in this
> e-mail client):
>
> r  b   swpd   free     buff     cache     si   so    bi       bo
> in   cs      us  sy id wa st
> 5  2    112  53732   4388 1163404    0    0 13524  1322  941 75053 45
> 55  0  0  0
> 3  2    112  51596   2336 1160032    0    0 20731   732  999 196116 35
> 58  0  7  0
> 3  2    112  53784   2448 1155388    0    0  5729  1680  716 114442 46
> 52  0  2  0
> 7  1    112  52088   2568 1156796    0    0   205  1919  546 28929 56
> 44  0  0  0
> 3  2    112  51772   2652 1157028    0    0   162  2010  534 2845 56 44  0  0  0
> 3  2    112  52296   2736 1156332    0    0   174  1910  517 2648 53 47  0  0  0
> 2  2    112  52736   2832 1155388    0    0   173  1960  521 2698 57 43  0  0  0
> 6  1    112  51172   2908 1157400    0    0   173  1825  511 2621 52 48  0  0  0
> 2  1    112  52156   2984 1155876    0    0   152  1766  495 2669 53 47  0  0  0
> 5  2    112  53828   3064 1154028    0    0   124  1762  498 2673 54 46  0  0  0
> 5  0    112  53396   3140 1154684    0    0   122  1803  532 2871 52 48  0  0  0
> 6  1    112  52004   3240 1155892    0    0   150  2004  522 3712 56 44  0  0  0
> 5  2    112  53100   3348 1152364    0    0   128  2142  543 2711 54 38  0  8  0
> 1  2    112  56704   3428 1148132    0    0   102  2229  560 2925 56 44  0  0  0
> 2  1    112  54876   3500 1150080    0    0    91  2129  598 3068 65 35  0  0  0
> 3  1    112  53196   3588 1152164    0    0    90  2091  542 2803 57 43  0  0  0
> 8  2    112  53528   3664 1151868    0    0   106  2180  553 2947 55 39  0  6  0
>
> Vacuum hasn't stopped yet at this point - its been running for many minutes.
>
>
> On PostgreSQL 8.3, we had this:
>
>  r   b   swpd   free     buff        cache     si   so    bi     bo
>  in   cs    us  sy id wa st
>  6  0   9604  51932  14276 1138996    0    0     3  1808  523 5115 45
> 9 42  4  0
>  2  0   9580  53284  14124 1138092   10    0   564  2561  585 5126 53
> 10 19 18  0
>  7  0   9564  53412  14144 1137696    0    0     0  1682  497 4985 42
> 9 48  2  0
>  3  0   9532  53320  13880 1137940    6    0    37  1757  538 4979 48  8 41  3
>
> More random data:
>
> On an overnight test of PostgreSQL 8.3 on Fedora Core 6, with
> auto-vacuum on (using all default settings) the amount of time that it
> takes to run a manual vacuum on a constant sized database (only row
> updates, or deletes followed by a replacement add) has doubled from 2
> seconds to 4 seconds.  A reindex brings the time back down to 2
> seconds.

Well, your throughput on this machine is horrible.  It looks like with
8.1 all your time is sys + cpu for your cpus, while with 8.3 you've
got more idle and more i/o wait, which tells me that 8.3 is smarter
about vacuuming, so it's spending less time working the cpus and more
time waiting for the i/o subsystem.

Wither way, getting only 2 or so megs a second write is pretty bad.  I
can get those numbers from a laptop.  An older laptop like a single
core 1.6GHz pentium M based T42 or something.  My laptop, which is new
from last year, is about twice as fast as your server in terms of I/O.
  I'm running 8.3.5 btw.  Here's vmstat 10 output while vacuuming a
db:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  1  45104  27228  26876 1806996    0    0   132  3545 1009 4531 17  4 50 29
 0  1  45104  31680  26804 1803040    0    0   108  2144 1012 2187 15  3 39 44
 7  0  45104  27120  26916 1807800    0    0   103  4439 1057 5018 17  4 37 42
 2  1  45104  25732  26972 1808868    0    0    92  4174 1049 4439 16  4 37 43
 0  1  45104  32928  26988 1801116    0    0    97  3709 1029 3254 17  3 34 45

pgbench gives me numbers like this:
 0  1  45116  29920  26772 1810308    0    0   132  4540  932 2100 16  3 41 40
 1  2  45116  31956  26732 1807844    0    0   380  8864  897 1824 12  4 32 52

You could replace that machine with something much faster for the
price of an upgrade or two on the dell.  I spent a lot of effort in
the last year trying to increase the performance of our dell 1950s and
finally gave up and just replaced them as db servers.  One decent
sized dual QC opteron can run the same load as 5 other database
servers while running under a much lower load and better performance
than any of the 5 dbs it replaced.

We had bought a 1950 last year with an MD-1000 before I started, and
it had cost about $12k in total for 7 SAS drives, controller and the
original 1950.  The new db server to replace it cost $500 less, had 16
SAS drives, a much faster Areca 1680 RAID controller, and 8 opteron
cores.  It is literally 10 to 40 times faster depending on the
workload.

The 1850s we have are half as fast as our 1950s as db servers, so I'm
pretty sure you could run faster with only a handful of SAS drives or
intel xm25s.

Re: Slow Vacuum was: vacuum output question

From
"Grzegorz Jaśkiewicz"
Date:
well, I can confirm problems is caused by indices here as well - I do
reindex twice a month because of that. (and general performance
problems caused by large index) - maybe it is time to focus on index
re usability - since HOT is already there, autovacuum does the job
too.

Re: Slow Vacuum was: vacuum output question

From
"Dan Armbrust"
Date:
> Well, your throughput on this machine is horrible.  It looks like with
> 8.1 all your time is sys + cpu for your cpus, while with 8.3 you've
> got more idle and more i/o wait, which tells me that 8.3 is smarter
> about vacuuming, so it's spending less time working the cpus and more
> time waiting for the i/o subsystem.
>
> Wither way, getting only 2 or so megs a second write is pretty bad.  I
> can get those numbers from a laptop.  An older laptop like a single
> core 1.6GHz pentium M based T42 or something.  My laptop, which is new
> from last year, is about twice as fast as your server in terms of I/O.

This is my problem in a nutshell.  As of yet, I have no rational
explanation for this performance.

The servers in question are not slow.  This particular server never
shows this problem when running a newer OS - but I have not yet
finished isolating which OS's have problems on this hardware.  No
other software on the system exhibits any sort of IO issue other than
PostgreSQL.

I have customers with $20K servers that can't handle the workload that
I can handle on an old cruddy laptop.  However, much of the appeal of
our product is low per-site installation costs, so expensive servers
don't fit into the mix.

Random futzing - reindexing, manual full vacuums, rebooting the server
- each of these has cleared the error condition on one site or
another, and allowed the system to go back to functioning the way it
should for months, until the problem randomly crops up again.

I'm still looking into it, but, at the same time, we have enough
workarounds to the issue now (scheduled reindex, install a newer OS,
upgrade to Postgres 8.3)  that this is becoming a low priority
mystery, rather than the high priority one it has been.

Thanks for your thoughts,

Dan