Thread: stats collector suddenly causing lots of IO

stats collector suddenly causing lots of IO

From

Chris

Date:

13 April 2010, 14:07:52

I have a lot of centos servers which are running postgres.  Postgres isn't used
that heavily on any of them, but lately, the stats collector process keeps
causing tons of IO load.  It seems to happen only on servers with centos 5.
The versions of postgres that are running are:

8.1.18
8.2.6
8.3.1
8.3.5
8.3.6
8.3.7
8.3.8
8.3.9
8.4.2
8.4.3

I've tried turning off everything under RUNTIME STATISTICS in postgresql.conf
except track_counts (since auto vacuum says it needs it), but it seems to have
little affect on the IO caused by the stats collector.

Has anyone else noticed this?  Have there been recent kernel changes
that could cause this that anyone knows about?  Since we haven't touched
postgres on these boxes since they were setup initially, I'm a bit baffled as
to what might be causing the problem, and why I can't make it go away short of
kill -STOP.

Any suggestions would be much appreciated!

Re: stats collector suddenly causing lots of IO

From

Cédric Villemain

Date:

13 April 2010, 15:21:36

2010/4/13 Chris <lists@deksai.com>:
> I have a lot of centos servers which are running postgres.  Postgres isn't used
> that heavily on any of them, but lately, the stats collector process keeps
> causing tons of IO load.  It seems to happen only on servers with centos 5.
> The versions of postgres that are running are:
>
> 8.1.18
> 8.2.6
> 8.3.1
> 8.3.5
> 8.3.6
> 8.3.7
> 8.3.8
> 8.3.9
> 8.4.2
> 8.4.3
>
> I've tried turning off everything under RUNTIME STATISTICS in postgresql.conf
> except track_counts (since auto vacuum says it needs it), but it seems to have
> little affect on the IO caused by the stats collector.
>
> Has anyone else noticed this?  Have there been recent kernel changes
> that could cause this that anyone knows about?  Since we haven't touched
> postgres on these boxes since they were setup initially, I'm a bit baffled as
> to what might be causing the problem, and why I can't make it go away short of
> kill -STOP.
>
> Any suggestions would be much appreciated!

stats file is writed to disk every 500ms (can be change while building
postgres) but it have been improved in 8.4 and should be write only if
needed.

In 8.4 you can change the directory where to write the stat file with
the config param : stats_temp_directory  Perhaps have a test and
change the filesystem (you might want to try a ramdisk and another fs
- ext3 -XFS-ext4 depending of your kernel) and see if it does change
something in your IO load.

Anyway it looks like it is centos 5 relative so what is your curernt
running kernel ? (and what FS )

>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
Cédric Villemain

Re: stats collector suddenly causing lots of IO

From

Alvaro Herrera

Date:

13 April 2010, 18:55:39

Chris wrote:
> I have a lot of centos servers which are running postgres.  Postgres isn't used
> that heavily on any of them, but lately, the stats collector process keeps
> causing tons of IO load.  It seems to happen only on servers with centos 5.

Does this correlate to an increase in size of the pgstat.stat file?
Maybe you could try resetting stats, so that the file goes back to an
initial size and is slowly repopulated.  I'd suggest monitoring the size
of the stats file, just in case there's something abnormal with it.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: stats collector suddenly causing lots of IO

From

Tom Lane

Date:

15 April 2010, 19:21:26

Chris <lists@deksai.com> writes:
> I have a lot of centos servers which are running postgres.  Postgres isn't used
> that heavily on any of them, but lately, the stats collector process keeps
> causing tons of IO load.  It seems to happen only on servers with centos 5.
> The versions of postgres that are running are:

> 8.1.18
> 8.2.6
> 8.3.1
> 8.3.5
> 8.3.6
> 8.3.7
> 8.3.8
> 8.3.9
> 8.4.2
> 8.4.3

Do these different server versions really all show the problem to the
same extent?  I'd expect 8.4.x in particular to be cheaper than the
older branches.  Are their pgstat.stat files all of similar sizes?
(Note that 8.4.x keeps pgstat.stat under $PGDATA/pg_stat_tmp/ whereas
in earlier versions it was under $PGDATA/global/.)

If your applications create/use/drop a lot of tables (perhaps temp
tables) then bloat of the pgstat.stat file is to be expected, but
it should get cleaned up by vacuum (including autovacuum).  What is
your vacuuming policy on these servers ... do you use autovacuum?

            regards, tom lane

Re: stats collector suddenly causing lots of IO

From

Josh Kupershmidt

Date:

16 April 2010, 11:39:43

On Thu, Apr 15, 2010 at 6:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Chris <lists@deksai.com> writes:
>> I have a lot of centos servers which are running postgres.  Postgres isn't used
>> that heavily on any of them, but lately, the stats collector process keeps
>> causing tons of IO load.  It seems to happen only on servers with centos 5.
>
> Say, I just realized that both of you are complaining about stats
> collector overhead on centos 5 servers.  I hadn't been thinking in terms
> of OS-specific causes, but maybe that is what we need to consider.
> Can you tell me the exact kernel versions you are seeing these problems
> with?

uname -a says "... 2.6.18-92.1.13.el5 #1 SMP ... x86_64", and it's CentOS 5.2.

I'm not sure whether this is related to the stats collector problems
on this machine, but I noticed alarming table bloat in the catalog
tables pg_attribute, pg_attrdef, pg_depend, and pg_type. Perhaps this
has happened slowly over the past few months, but I discovered the
bloat when I ran the query from:
http://pgsql.tapoueh.org/site/html/news/20080131.bloat.html

on the most-active database on this server (OID 16389 from the
pgstat.stat I sent in). See attached table_bloat.txt. The autovacuum
settings for this server haven't been tweaked from the default; they
probably should have been, given the heavy bulk updates/inserts done.
Maybe there's another cause for this extreme catalog bloat, besides
the weak autovacuum settings, though.

Table sizes, according to pg_size_pretty(pg_total_relation_size(...)):
 * pg_attribute: 145 GB
 * pg_attrdef: 85 GB
 * pg_depend: 38 GB
 * pg_type: 3465 MB

I'll try to send in strace outputs later today.

Josh

Attachment

table_bloat.txt

Re: stats collector suddenly causing lots of IO

From

Tom Lane

Date:

16 April 2010, 12:24:05

Josh Kupershmidt <schmiddy@gmail.com> writes:
> I'm not sure whether this is related to the stats collector problems
> on this machine, but I noticed alarming table bloat in the catalog
> tables pg_attribute, pg_attrdef, pg_depend, and pg_type.

Hmm.  That makes me wonder if autovacuum is functioning properly at all.
What does pg_stat_all_tables show for the last vacuum and analyze times
of those tables?  Try something like

select relname,n_live_tup,n_dead_tup,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze from pg_stat_all_tables
whereschemaname = 'pg_catalog' order by 1; 

            regards, tom lane

Re: stats collector suddenly causing lots of IO

From

Tom Lane

Date:

16 April 2010, 12:30:06

Chris <lists@deksai.com> writes:
> After the file was made larger and I stopped the vacuum process, I started
> seeing the problem.  All other postgress processes were quiet, but the stats
> collector was constantly causing anywhere from 20-60 of the IO on the server.
> Since all the other postgres processes weren't really doing anything, and it is
> a busy web server which is predominately MySQL, I'm fairly curious as to what
> it is doing.

Yeah, the stats collector rewrites the stats file every half second, if
there have been any changes since last time --- so the bigger the file,
the more overhead.  (8.4 is smarter about this, but that doesn't help
you on 8.3.)

> I straced the stats collector process.  I wasn't sure what else to trace as
> there wasn't a single other postgres process doing anything.

That strace doesn't really prove much; it's what I'd expect.  Here's
what to do: start a PG session, and strace that session's backend *and*
the stats collector while you manually do VACUUM some-small-table.
The VACUUM command should try to send some messages to the stats collector
process.  I'm wondering if those get dropped somehow.

            regards, tom lane

Re: stats collector suddenly causing lots of IO

From

Josh Kupershmidt

Date:

16 April 2010, 12:30:59

On Fri, Apr 16, 2010 at 11:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Josh Kupershmidt <schmiddy@gmail.com> writes:
>> I'm not sure whether this is related to the stats collector problems
>> on this machine, but I noticed alarming table bloat in the catalog
>> tables pg_attribute, pg_attrdef, pg_depend, and pg_type.
>
> Hmm.  That makes me wonder if autovacuum is functioning properly at all.
> What does pg_stat_all_tables show for the last vacuum and analyze times
> of those tables?  Try something like
>
> select relname,n_live_tup,n_dead_tup,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze from
pg_stat_all_tableswhere schemaname = 'pg_catalog' order by 1; 
>

Output attached. Note that I ran pg_stat_reset() a few days ago.
Josh

Attachment

pg_stat_all_tables.txt

Re: stats collector suddenly causing lots of IO

From

Tom Lane

Date:

16 April 2010, 12:41:37

Josh Kupershmidt <schmiddy@gmail.com> writes:
> On Fri, Apr 16, 2010 at 11:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm. �That makes me wonder if autovacuum is functioning properly at all.
>> What does pg_stat_all_tables show for the last vacuum and analyze times
>> of those tables? �Try something like
>>
>> select relname,n_live_tup,n_dead_tup,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze from
pg_stat_all_tableswhere schemaname = 'pg_catalog' order by 1; 

> Output attached. Note that I ran pg_stat_reset() a few days ago.

Wow.  Well, we have a smoking gun here: for some reason, autovacuum
isn't running, or isn't doing its job if it is.  If it's not running
at all, that would explain failure to prune the stats collector's file
too.

Is there anything in the postmaster log that would suggest autovac
difficulties?

            regards, tom lane

Re: stats collector suddenly causing lots of IO

From

Tom Lane

Date:

16 April 2010, 13:26:03

Josh Kupershmidt <schmiddy@gmail.com> writes:
> I made a small half-empty table like this:
>   CREATE TABLE test_vacuum (i  int  PRIMARY KEY);
>   INSERT INTO test_vacuum (i) SELECT a FROM generate_series(1,500000) AS a;
>   DELETE FROM test_vacuum WHERE RANDOM() < 0.5;

> and then ran:
>   VACUUM test_vacuum;

> while an strace of the stats collector process was running. Then after
> a few seconds, found the PID of the VACUUM process, and ran strace on
> it. I killed them after the VACUUM finished. Outputs attached.

Huh.  The VACUUM strace clearly shows a boatload of TABPURGE messages
being sent:

sendto(7, "\2\0\0\0\350\3\0\0\5@\0\0\366\0\0\0\324\206<\24\321uC\24\320\350)\24\225\345,\24"..., 1000, 0, NULL, 0) =
1000
sendto(7, "\2\0\0\0\350\3\0\0\5@\0\0\366\0\0\0C\274?\24\365\323?\24\241N@\24\217\0309\24"..., 1000, 0, NULL, 0) = 1000
sendto(7, "\2\0\0\0\350\3\0\0\5@\0\0\366\0\0\0\375Z2\24\211\f@\0241\3047\24\357mH\24"..., 1000, 0, NULL, 0) = 1000
sendto(7, "\2\0\0\0\350\3\0\0\5@\0\0\366\0\0\0\242\3529\24\234K\'\24\17\227)\24\300\22+\24"..., 1000, 0, NULL, 0) =
1000

and the stats collector is receiving them:

recvfrom(7, "\2\0\0\0\350\3\0\0\5@\0\0\366\0\0\0\324\206<\24\321uC\24\320\350)\24\225\345,\24"..., 1000, 0, NULL, NULL)
=1000 
recvfrom(7, "\2\0\0\0\350\3\0\0\5@\0\0\366\0\0\0C\274?\24\365\323?\24\241N@\24\217\0309\24"..., 1000, 0, NULL, NULL) =
1000
recvfrom(7, "\2\0\0\0\350\3\0\0\5@\0\0\366\0\0\0\375Z2\24\211\f@\0241\3047\24\357mH\24"..., 1000, 0, NULL, NULL) = 1000

So this *should* have resulted in the stats file shrinking.  Did you
happen to notice if it did, after you did this?

            regards, tom lane

Re: stats collector suddenly causing lots of IO

From

Josh Kupershmidt

Date:

16 April 2010, 13:32:24

On Fri, Apr 16, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Wow.  Well, we have a smoking gun here: for some reason, autovacuum
> isn't running, or isn't doing its job if it is.  If it's not running
> at all, that would explain failure to prune the stats collector's file
> too.

Hrm, well autovacuum is at least trying to do work: it's currently
stuck on those bloated pg_catalog tables, of course. Another developer
killed an autovacuum of pg_attribute (or maybe it was pg_attrdef)
after it had been running for two weeks. See current pg_stat_activity
output attached, which shows the three autovacuum workers running plus
two manual VACUUM ANALYZEs I started yesterday.

> Is there anything in the postmaster log that would suggest autovac
> difficulties?

Yup, there are logs from April 1st which I just grepped through. I
attached the redacted output, and I see a few warnings about "[table]
contains more than "max_fsm_pages" pages with useful free space", as
well as "ERROR:  canceling autovacuum task".

Perhaps bumping up max_fsm_pages and making autovacuum settings more
aggressive will help me? I was also planning to run a CLUSTER of those
four bloated pg_catalog tables -- is this safe, particularly for
tables like pg_attrdef which rely on OIDs?

Josh

Attachment

Re: stats collector suddenly causing lots of IO

From

Tom Lane

Date:

16 April 2010, 13:40:09

I wrote:
> So this *should* have resulted in the stats file shrinking.  Did you
> happen to notice if it did, after you did this?

Oh, never mind that --- I can see that it did shrink, just from counting
the write() calls in the collector's strace.  So what we have here is a
demonstration that the tabpurge mechanism does work for you, when it's
invoked.  Which is further evidence that for some reason autovacuum is
not running for you.

What I'd suggest at this point is cranking up log_min_messages to DEBUG2
or so in postgresql.conf, restarting the postmaster, and keeping an eye
on the log to see if you can spot anything about why autovac isn't
working.

            regards, tom lane

Re: stats collector suddenly causing lots of IO

From

Tom Lane

Date:

16 April 2010, 13:49:10

Josh Kupershmidt <schmiddy@gmail.com> writes:
> On Fri, Apr 16, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Wow. �Well, we have a smoking gun here: for some reason, autovacuum
>> isn't running, or isn't doing its job if it is. �If it's not running
>> at all, that would explain failure to prune the stats collector's file
>> too.

> Hrm, well autovacuum is at least trying to do work: it's currently
> stuck on those bloated pg_catalog tables, of course. Another developer
> killed an autovacuum of pg_attribute (or maybe it was pg_attrdef)
> after it had been running for two weeks. See current pg_stat_activity
> output attached, which shows the three autovacuum workers running plus
> two manual VACUUM ANALYZEs I started yesterday.

Two weeks?  What have you got the autovacuum cost delays set to?

Once you're up to three AV workers, no new ones can get launched until
one of those finishes or is killed.  So that would explain failure to
prune the stats collector's tables (the tabpurge code is only run during
AV worker launch).  So what we need to figure out is why it's taking so
obscenely long to vacuum these tables ...

            regards, tom lane

Re: stats collector suddenly causing lots of IO

From

Scott Carey

Date:

16 April 2010, 14:22:31

On Apr 16, 2010, at 9:48 AM, Tom Lane wrote:

> Josh Kupershmidt <schmiddy@gmail.com> writes:
>> On Fri, Apr 16, 2010 at 11:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Wow.  Well, we have a smoking gun here: for some reason, autovacuum
>>> isn't running, or isn't doing its job if it is.  If it's not running
>>> at all, that would explain failure to prune the stats collector's file
>>> too.
>
>> Hrm, well autovacuum is at least trying to do work: it's currently
>> stuck on those bloated pg_catalog tables, of course. Another developer
>> killed an autovacuum of pg_attribute (or maybe it was pg_attrdef)
>> after it had been running for two weeks. See current pg_stat_activity
>> output attached, which shows the three autovacuum workers running plus
>> two manual VACUUM ANALYZEs I started yesterday.
>
> Two weeks?  What have you got the autovacuum cost delays set to?
>
> Once you're up to three AV workers, no new ones can get launched until
> one of those finishes or is killed.  So that would explain failure to
> prune the stats collector's tables (the tabpurge code is only run during
> AV worker launch).  So what we need to figure out is why it's taking so
> obscenely long to vacuum these tables ...
>

On any large system with good I/O I have had to significantly increase the aggressiveness of autovacuum.
Even with the below settings, it doesn't interfere with other activity (~2200iops random, ~900MB/sec sequential capable
I/O).

My relevant autovacuum parameters are (from 'show *'):
 autovacuum                      | on                                    | Starts the autovacuum subprocess.
 autovacuum_analyze_scale_factor | 0.1                                   | Number of tuple inserts, updates or deletes
priorto analyze as a fraction of reltuples. 
 autovacuum_analyze_threshold    | 50                                    | Minimum number of tuple inserts, updates or
deletesprior to analyze. 
 autovacuum_freeze_max_age       | 200000000                             | Age at which to autovacuum a table to
preventtransaction ID wraparound. 
 autovacuum_max_workers          | 3                                     | Sets the maximum number of simultaneously
runningautovacuum worker processes. 
 autovacuum_naptime              | 1min                                  | Time to sleep between autovacuum runs.
 autovacuum_vacuum_cost_delay    | 20ms                                  | Vacuum cost delay in milliseconds, for
autovacuum.
 autovacuum_vacuum_cost_limit    | 2000                                  | Vacuum cost amount available before napping,
forautovacuum. 
 autovacuum_vacuum_scale_factor  | 0.2                                   | Number of tuple updates or deletes prior to
vacuumas a fraction of reltuples. 
 autovacuum_vacuum_threshold     | 50

For what it is worth, I just went onto one of my systems -- one with lots of partition tables and temp table
creation/destruction-- and looked at the system tables in question there. 

Postgres 8.4, using dt+  (trimmed result below to interesting tables)

   Schema   |          Name           | Type  |  Owner   |    Size    | Description
------------+-------------------------+-------+----------+------------+-------------
 pg_catalog | pg_attrdef              | table | postgres | 195 MB     |
 pg_catalog | pg_attribute            | table | postgres | 1447 MB    |
 pg_catalog | pg_class                | table | postgres | 1694 MB    |
 pg_catalog | pg_constraint           | table | postgres | 118 MB     |
 pg_catalog | pg_depend               | table | postgres | 195 MB     |
 pg_catalog | pg_statistic            | table | postgres | 2300 MB    |
 pg_catalog | pg_type                 | table | postgres | 181 MB     |

So, I did a vacuum full; reindex table; analyze;   sequence on each of these.  I wish I could just CLUSTER them but the
aboveworks. 

now the tables are:
   Schema   |          Name           | Type  |  Owner   |    Size    | Description
------------+-------------------------+-------+----------+------------+-------------
 pg_catalog | pg_attrdef              | table | postgres | 44 MB      |
 pg_catalog | pg_attribute            | table | postgres | 364 MB     |
 pg_catalog | pg_class                | table | postgres | 1694 MB    |
 pg_catalog | pg_constraint           | table | postgres | 118 MB     |
 pg_catalog | pg_depend               | table | postgres | 195 MB     |
 pg_catalog | pg_statistic            | table | postgres | 656 MB     |
 pg_catalog | pg_type                 | table | postgres | 45 MB      |

I've learned to accept about 50% bloat (2x the compacted size) in postgres as just the way it usually is on a busy
table,but the 3x and 4x bloat of statistic, attrdef, and attribute have me wondering. 

I have had some 'idle in transaction' connections hanging out from time to time that have caused issues on this machine
thatcould explain the above perma-bloat.  That is one thing that could affect the case reported here as well.  The
worstthing about those, is you can't even force kill those connections from within postgres (pg_cancel_backend doesn't
workon them, and killing them via the OS bounces postgres ...) so you have to hunt down the offending client. 

>             regards, tom lane
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: stats collector suddenly causing lots of IO

From

Scott Carey

Date:

16 April 2010, 14:35:47

>
> I have had some 'idle in transaction' connections hanging out from time to time that have caused issues on this
machinethat could explain the above perma-bloat.  That is one thing that could affect the case reported here as well.
Theworst thing about those, is you can't even force kill those connections from within postgres (pg_cancel_backend
doesn'twork on them, and killing them via the OS bounces postgres ...) so you have to hunt down the offending client. 
>

Ooh, I just noticed pg_terminate_backend() ... maybe this will let me kill annoying idle in transaction clients.  I
guessthis arrived in 8.4?  Hopefully this won't cause the whole thing to bounce and close all other backends.... 


>
>>             regards, tom lane
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: stats collector suddenly causing lots of IO

From

Josh Kupershmidt

Date:

16 April 2010, 14:44:04

On Fri, Apr 16, 2010 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Josh Kupershmidt <schmiddy@gmail.com> writes:
>> Hrm, well autovacuum is at least trying to do work: it's currently
>> stuck on those bloated pg_catalog tables, of course. Another developer
>> killed an autovacuum of pg_attribute (or maybe it was pg_attrdef)
>> after it had been running for two weeks. See current pg_stat_activity
>> output attached, which shows the three autovacuum workers running plus
>> two manual VACUUM ANALYZEs I started yesterday.
>
> Two weeks?  What have you got the autovacuum cost delays set to?

SELECT name, current_setting(name), source FROM pg_settings WHERE
source != 'default' AND name ILIKE '%vacuum%';
         name         | current_setting |       source
----------------------+-----------------+--------------------
 vacuum_cost_delay    | 200ms           | configuration file
 vacuum_cost_limit    | 100             | configuration file
 vacuum_cost_page_hit | 6               | configuration file
(3 rows)

I'm guessing these values and the default autovacuum configuration
values need to be cranked significantly to make vacuum much more
aggressive :-(

> Once you're up to three AV workers, no new ones can get launched until
> one of those finishes or is killed.  So that would explain failure to
> prune the stats collector's tables (the tabpurge code is only run during
> AV worker launch).  So what we need to figure out is why it's taking so
> obscenely long to vacuum these tables ...
>

Hopefully changing those three vacuum_cost_* params will speed up the
manual- and auto-vacuums.. it'll take me a few days to see any
results, since I still need to do something about the bloat that's
already there.

Josh

Re: stats collector suddenly causing lots of IO

From

Greg Smith

Date:

16 April 2010, 15:15:08

Josh Kupershmidt wrote:
> SELECT name, current_setting(name), source FROM pg_settings WHERE
> source != 'default' AND name ILIKE '%vacuum%';
>          name         | current_setting |       source
> ----------------------+-----------------+--------------------
>  vacuum_cost_delay    | 200ms           | configuration file
>  vacuum_cost_limit    | 100             | configuration file
>  vacuum_cost_page_hit | 6               | configuration file
>
>
> Hopefully changing those three vacuum_cost_* params will speed up the
> manual- and auto-vacuums..

Those only impact manual VACUUM statements.  There's a different set
with names like autovacuum_vacuum_cost_delay that control the daemon.
You can set those to "-1" in order to match the regular VACUUM, but
that's not the default.

You really need to sort out the max_fsm_pages setting too, because until
that issue goes away these tables are unlikely to ever stop growing.
And, no, you can't use CLUSTER on the system tables to clean those up.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: stats collector suddenly causing lots of IO

From

Josh Kupershmidt

Date:

16 April 2010, 15:36:31

On Fri, Apr 16, 2010 at 2:14 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Josh Kupershmidt wrote:
>>
>> SELECT name, current_setting(name), source FROM pg_settings WHERE
>> source != 'default' AND name ILIKE '%vacuum%';
>>         name         | current_setting |       source
>> ----------------------+-----------------+--------------------
>>  vacuum_cost_delay    | 200ms           | configuration file
>>  vacuum_cost_limit    | 100             | configuration file
>>  vacuum_cost_page_hit | 6               | configuration file
>>
>>  Hopefully changing those three vacuum_cost_* params will speed up the
>> manual- and auto-vacuums..
>
> Those only impact manual VACUUM statements.  There's a different set with
> names like autovacuum_vacuum_cost_delay that control the daemon.  You can
> set those to "-1" in order to match the regular VACUUM, but that's not the
> default.

It looks like the default which I have of autovacuum_vacuum_cost_limit
= -1, which means it's inheriting the vacuum_cost_limit of 100 I had
set. I'll try bumping vacuum_cost_limit up to 1000 or so.

> You really need to sort out the max_fsm_pages setting too, because until
> that issue goes away these tables are unlikely to ever stop growing.  And,
> no, you can't use CLUSTER on the system tables to clean those up.

I have max_fsm_pages = 524288 , but from the hints in the logfiles
this obviously needs to go up much higher. And it seems the only way
to compact the pg_catalog tables is VACUUM FULL + REINDEX on 8.3 -- I
had tried the CLUSTER on my 9.0 machine and wrongly assumed it would
work on 8.3, too.

Josh

Re: stats collector suddenly causing lots of IO

From

Greg Smith

Date:

16 April 2010, 15:53:37

Josh Kupershmidt wrote:
> And it seems the only way
> to compact the pg_catalog tables is VACUUM FULL + REINDEX on 8.3 -- I
> had tried the CLUSTER on my 9.0 machine and wrongly assumed it would
> work on 8.3, too.
>

Right; that just got implemented a couple of months ago.  See the news
from http://www.postgresql.org/community/weeklynews/pwn20100214 for a
summary of how the code was gyrated around to support that.  This is a
tough situation to get out of in <9.0 because VACUUM FULL is slow and
takes an exclusive lock on the table.  That tends to lead toward an
unpredictable window for required downtime, which is never good.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: stats collector suddenly causing lots of IO

From

Tom Lane

Date:

16 April 2010, 16:22:58

Josh Kupershmidt <schmiddy@gmail.com> writes:
>         name         | current_setting |       source
> ----------------------+-----------------+--------------------
>  vacuum_cost_delay    | 200ms           | configuration file
>  vacuum_cost_limit    | 100             | configuration file
>  vacuum_cost_page_hit | 6               | configuration file
>
> It looks like the default which I have of autovacuum_vacuum_cost_limit
> = -1, which means it's inheriting the vacuum_cost_limit of 100 I had
> set. I'll try bumping vacuum_cost_limit up to 1000 or so.

Actually I think the main problem is that cost_delay value, which is
probably an order of magnitude too high.  The way to limit vacuum's
I/O impact on other stuff is to make it take frequent short delays,
not have it run full speed and then sleep a long time.  In any case,
your current settings have got it sleeping way too much.  Two WEEKS !!!??

            regards, tom lane

Re: stats collector suddenly causing lots of IO

From

Josh Kupershmidt

Date:

16 April 2010, 16:41:23

On Fri, Apr 16, 2010 at 3:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Josh Kupershmidt <schmiddy@gmail.com> writes:
>>         name         | current_setting |       source
>> ----------------------+-----------------+--------------------
>>  vacuum_cost_delay    | 200ms           | configuration file
>>  vacuum_cost_limit    | 100             | configuration file
>>  vacuum_cost_page_hit | 6               | configuration file
>>
>> It looks like the default which I have of autovacuum_vacuum_cost_limit
>> = -1, which means it's inheriting the vacuum_cost_limit of 100 I had
>> set. I'll try bumping vacuum_cost_limit up to 1000 or so.
>
> Actually I think the main problem is that cost_delay value, which is
> probably an order of magnitude too high.  The way to limit vacuum's
> I/O impact on other stuff is to make it take frequent short delays,
> not have it run full speed and then sleep a long time.  In any case,
> your current settings have got it sleeping way too much.  Two WEEKS !!!??

Yup, I was going to turn vacuum_cost_delay down to 20. The two weeks
was for the pg_catalog table which has bloated to 145 GB, I think. One
of those manual VACUUMs I kicked off just finished, after 48 hours --
and that table was only 25 GB or so. I wasn't the one who set up this
postgresql.conf, but I am stuck fixing things :/