Thread: Problem after VACUUM ANALYZE

Problem after VACUUM ANALYZE

From

mljv@planwerk6.de

Date:

08 April 2008, 15:17:15

Hi all,

our postgresql DB was running fine for a long time, but suddenly we
encountered a huge problem which we got fixed only temporarily.

We are running debian stable with postgresql 8.1.11. Our app is connecting via
JDBC and uses Prepared Statements.

We are not running autovacuum but a nightly VACUUM ANALYZE at 3:00 am
(without 'full' option of course).

Our largest table has 80,000,000 records and has a relation size of 4.4 GB.
Indizes of this table are 5.5 GB (total_relation_size = 10GB). this table is
queried a lot of times. The second largest table is not even 20% of the
largest table. Other tables range from 10KB to 300MB (total_relation_size).

The server is dedicated DB server with dual cpu (AMD Athlon(tm) 64 X2), 8 GB
of ram, 3ware RAID-1 with SATA harddisks.

Everything ran fine. The Vacuum process took quite a long time each night
(about 60 minutes) but we didn't care as it was a very low traffic time.

Suddenly one morning the database was still running but it was VERY slow. Lots
of SELECT queries were hanging around waiting for an answer.

The VACUUM process was already done at this moment.

First try was to restart the database. After the restart the problem occured
again. The vaccum process was already done BEFORE the restart and was not run
again. We just restarted the database. Next try was a REINDEX while
disonnecting all clients, but it didn't helped either.

Next try was to disconnect all clients again (servlet containers), then we did
a restart of the database and run "VACUUM ANALYZE" as the only query with no
other concurrent queries. It took an hour. Afterwards we started the database
and everything ran smoothly.

We thought it might have been only some kind of weired circumstances. So we
left everything else as it was.

Next night the Vacuum process did run again and after this the database slowed
down immediately. The same behaviour as the night before. We stopped the
database, disconnected all clients and ran VACUUM ANALYZE without concurrent
queries. After restarting everything was fine again.

We looked in our cpu monitoring and saw that we have huge IOwait while VACUUM
is running, not unusual though. But just after VACUUM was finished, the
userCPU load raised to 200% (dual cpu). At this time in the early morning
there were only very few users using our app. We analyzed the situation
further and saw that most queries had a very long duration time (minutes
instead of milliseconds). BTW: the vacuum process was not running at this
moment. It was logged before as done.

As we cannot afford a downtime every morning, we disabled the VACUUM process
at night. Next morning everything was fine and is till then (10 days ago).

Of course, not vacuuming at all is not a good idea.

I don't have any clue what is happening. My thoughts about this as follows
(might be totally wrong):

* Maybe the Vacuum analyze process has not enough memory and therefore could
not ANALYZE the tables correctly. It then writes wrong statistics to the
database which results in wrong execution plans using sequence scans instead
of index scans. This only happens if the vacuum analyze process runs
concurrently with user requests. If it runs on its own, the vacuum process
has enough memory and writes correct statistics to the database.

Here are some of our configuration parameters. We never really tweaked it as
it ran fine. We just raised some parameters. The following list should show
all parameters changed from the default:

$ cat /etc/postgresql/8.1/main/postgresql.conf | grep -v '^[[:space:]]*#' |
grep -v '^$'
listen_addresses = '*'
port = 5432
max_connections = 300
unix_socket_directory = '/var/run/postgresql'
shared_buffers = 30000
work_mem = 10240
maintenance_work_mem = 163840
vacuum_mem = 32000
max_fsm_pages = 500000
bgwriter_lru_percent = 10.0
bgwriter_lru_maxpages = 100
bgwriter_all_percent = 5
bgwriter_all_maxpages = 200
wal_buffers = 16
checkpoint_segments = 10
checkpoint_warning = 3600
effective_cache_size = 180000
random_page_cost = 3
log_min_messages = info
log_min_error_statement = warning
log_min_duration_statement = 500
log_line_prefix = '%t [%p]'
stats_command_string = off
stats_block_level = on
stats_row_level = on
stats_reset_on_server_start = on
autovacuum = off
autovacuum_naptime = 3000
lc_messages = 'en_US'
lc_monetary = 'en_US'
lc_numeric = 'en_US'
lc_time = 'en_US'

I already tried to google my problem and came across some answers which seems
to fit, but were all slightly different. Of course updating to 8.3 or restore
the  whole database might help, but i really want to understand what i did
wrong to improve my skills.

If any further information is needed, please ask. I hope my information was
not too veborse and somebody can help me with my problem. Your help is very
appreciated.

kind regards,
janning

Re: Problem after VACUUM ANALYZE

From

Richard Huxton

Date:

08 April 2008, 15:40:45

mljv@planwerk6.de wrote:
> We looked in our cpu monitoring and saw that we have huge IOwait while VACUUM
> is running, not unusual though. But just after VACUUM was finished, the
> userCPU load raised to 200% (dual cpu). At this time in the early morning
> there were only very few users using our app. We analyzed the situation
> further and saw that most queries had a very long duration time (minutes
> instead of milliseconds). BTW: the vacuum process was not running at this
> moment. It was logged before as done.

But WHAT was using all of your cpu? Was it PostgreSQL, and if so just
one backend? If something else was using all your cpu, then it might
just be that the PostgreSQL server wasn't getting a chance to run your
queries.

> * Maybe the Vacuum analyze process has not enough memory and therefore could
> not ANALYZE the tables correctly. It then writes wrong statistics to the
> database which results in wrong execution plans using sequence scans instead
> of index scans. This only happens if the vacuum analyze process runs
> concurrently with user requests. If it runs on its own, the vacuum process
> has enough memory and writes correct statistics to the database.

Doesn't sound likely to me. You've got 8GB of RAM, and if you were going
into swap you'd have noticed the disk I/O.

> Here are some of our configuration parameters. We never really tweaked it as
> it ran fine. We just raised some parameters. The following list should show
> all parameters changed from the default:

> max_connections = 300
> shared_buffers = 30000
> work_mem = 10240

OK, so that's 30,000 * 8KB = 240MB of shared_buffers
You have 10MB of work_mem and if all 300 connections were using that
much you'd have committed 3GB of your RAM for that. Of course they'll
want more than just that.

Do you really have 300 concurrent connections?

> maintenance_work_mem = 163840

160MB for vacuums - should be OK given how much memory you have and the
fact that it's quiet when you vacuum.

> vacuum_mem = 32000

This is what maintenance_work_mem used to be called. You can delete this
  entry.

> max_fsm_pages = 500000

You can track at most 500,000 pages with free space on them. In 8.2+
versions VACUUM VERBOSE will show you how many are currently being used.
Not sure about 8.1

> bgwriter_lru_percent = 10.0
> bgwriter_lru_maxpages = 100
> bgwriter_all_percent = 5
> bgwriter_all_maxpages = 200
> wal_buffers = 16

> checkpoint_segments = 10

If you have bursts of write activity you might want to increase this.

> checkpoint_warning = 3600

> effective_cache_size = 180000

That's 180,000 * 8KB = 180 * 8MB = 1.4GB
If that's really all you're using as cache, I'd reduce the number of
concurrent connections. Check free/top and see how much RAM is really
being used as disk cache.

> random_page_cost = 3

Might be too high - you don't mention what disks you have.

> stats_command_string = off

If you turn this one on, you'll be able to see the queries each backend
is executing as they happen. Might be useful, but does have some cost.

The crucial thing is to find out exactly what is happening when things
get very slow. Check vmstat and top, look in the pg_locks system-table
and if needs be we can see what strace says a particular backend is doing.

--
   Richard Huxton
   Archonet Ltd

Re: Problem after VACUUM ANALYZE

From

mljv@planwerk6.de

Date:

08 April 2008, 16:12:03

HI Richard,

thanks for your immediate response. I will answer your questions below:

Am Dienstag, 8. April 2008 17:40 schrieb Richard Huxton:
> mljv@planwerk6.de wrote:
> > We looked in our cpu monitoring and saw that we have huge IOwait while
> > VACUUM is running, not unusual though. But just after VACUUM was
> > finished, the userCPU load raised to 200% (dual cpu). At this time in the
> > early morning there were only very few users using our app. We analyzed
> > the situation further and saw that most queries had a very long duration
> > time (minutes instead of milliseconds). BTW: the vacuum process was not
> > running at this moment. It was logged before as done.
>
> But WHAT was using all of your cpu? Was it PostgreSQL, and if so just
> one backend? If something else was using all your cpu, then it might
> just be that the PostgreSQL server wasn't getting a chance to run your
> queries.

The CPU was used by postgresql. As i saw many SELECTS by looking at "ps ax" i
guess that many backends using all of the cpu.

> > * Maybe the Vacuum analyze process has not enough memory and therefore
> > could not ANALYZE the tables correctly. It then writes wrong statistics
> > to the database which results in wrong execution plans using sequence
> > scans instead of index scans. This only happens if the vacuum analyze
> > process runs concurrently with user requests. If it runs on its own, the
> > vacuum process has enough memory and writes correct statistics to the
> > database.
>
> Doesn't sound likely to me. You've got 8GB of RAM, and if you were going
> into swap you'd have noticed the disk I/O.

That makes sense. We had not any IOwait at this moment but 200% userCPU.
But maybe vaccum can't use the ram because of a mis-configuration.

> > Here are some of our configuration parameters. We never really tweaked it
> > as it ran fine. We just raised some parameters. The following list should
> > show all parameters changed from the default:
> >
> > max_connections = 300
> > shared_buffers = 30000
> > work_mem = 10240
>
> OK, so that's 30,000 * 8KB = 240MB of shared_buffers
> You have 10MB of work_mem and if all 300 connections were using that
> much you'd have committed 3GB of your RAM for that. Of course they'll
> want more than just that.
>
> Do you really have 300 concurrent connections?

At peaktime, yes.

> > maintenance_work_mem = 163840
>
> 160MB for vacuums - should be OK given how much memory you have and the
> fact that it's quiet when you vacuum.
>
> > vacuum_mem = 32000
>
> This is what maintenance_work_mem used to be called. You can delete this
>   entry.
>
> > max_fsm_pages = 500000
>
> You can track at most 500,000 pages with free space on them. In 8.2+
> versions VACUUM VERBOSE will show you how many are currently being used.
> Not sure about 8.1
>
> > bgwriter_lru_percent = 10.0
> > bgwriter_lru_maxpages = 100
> > bgwriter_all_percent = 5
> > bgwriter_all_maxpages = 200
> > wal_buffers = 16
> >
> > checkpoint_segments = 10
>
> If you have bursts of write activity you might want to increase this.
>
> > checkpoint_warning = 3600
> >
> > effective_cache_size = 180000
>
> That's 180,000 * 8KB = 180 * 8MB = 1.4GB
> If that's really all you're using as cache, I'd reduce the number of
> concurrent connections. Check free/top and see how much RAM is really
> being used as disk cache.
>
> > random_page_cost = 3
>
> Might be too high - you don't mention what disks you have.

3ware RAID-1 Controller with plain (cheap) SATA disks

> The crucial thing is to find out exactly what is happening when things
> get very slow. Check vmstat and top, look in the pg_locks system-table
> and if needs be we can see what strace says a particular backend is doing.

The problem is that we have peaktimes were everything is running fine. It has
something to do with the vacuum process running. To simplify my problem:

- I run vaccum analyze concurrently with some few user queries: slows down to
a crawl after vacuum is finished.

- if i run it while no user is connected, everything runs fine afterwards.

It has something to do with the vacuum analyze process.

kind regards,
janning

Re: Problem after VACUUM ANALYZE

From

Alan Hodgson

Date:

08 April 2008, 16:28:33

On Tuesday 08 April 2008, mljv@planwerk6.de wrote:
> The problem is that we have peaktimes were everything is running fine. It
> has something to do with the vacuum process running. To simplify my
> problem:
>
> - I run vaccum analyze concurrently with some few user queries: slows
> down to a crawl after vacuum is finished.
>
> - if i run it while no user is connected, everything runs fine
> afterwards.
>
> It has something to do with the vacuum analyze process.

It's probably running the machine into swap. Check swap usage and disk I/O
when it happens.


--
Alan

Re: Problem after VACUUM ANALYZE

From

"Scott Marlowe"

Date:

08 April 2008, 16:38:47

It sounds to me like two possible problems, maybe combined.

One possibility is that you have a data distribution that results in
statistics being gathered that don't really represent your data.  Try
increasing the stats target for that column (or the whole db if you'd
rather) and re-running analyze.

The other possibility is that you've got some index bloat happening
and you might benefit from  reindexing the problematic table.

Re: Problem after VACUUM ANALYZE

From

mljv@planwerk6.de

Date:

09 April 2008, 07:29:06

Am Dienstag, 8. April 2008 18:38 schrieb Scott Marlowe:
> It sounds to me like two possible problems, maybe combined.
>
> One possibility is that you have a data distribution that results in
> statistics being gathered that don't really represent your data.  Try
> increasing the stats target for that column (or the whole db if you'd
> rather) and re-running analyze.
>
> The other possibility is that you've got some index bloat happening
> and you might benefit from  reindexing the problematic table.

But if i do "VACUUM ANALYZE" without concurrent queries, everything runs fine
afterwards.

If i run  "VACUUM ANALYZE" with few concurrent queries, it slows down to a
crawl.

Could it be that something like this is happening:
- In the early morning a new DB connection is opened.
- While running VACUUM ANALYZE the planner uses different plans because some
index could not be used or the statistics are right in that moment not
present because they are updated... So the query gets a wrong plan. It uses a
seqScan instead of an index scan.
- This wrongly planned statement is prepared so even after VACUUM ANALYZE is
done, the statement does not use the wrong plan.
- load raises triggers many concurrent queries with wrong plans. so everything
slows down.

kind regards,
janning

Re: Problem after VACUUM ANALYZE

From

"David Wilson"

Date:

09 April 2008, 08:11:37

On Wed, Apr 9, 2008 at 3:29 AM,  <mljv@planwerk6.de> wrote:

>  But if i do "VACUUM ANALYZE" without concurrent queries, everything runs fine
>  afterwards.
>
>  If i run  "VACUUM ANALYZE" with few concurrent queries, it slows down to a
>  crawl.
>
>  Could it be that something like this is happening:
>  - In the early morning a new DB connection is opened.
>  - While running VACUUM ANALYZE the planner uses different plans because some
>  index could not be used or the statistics are right in that moment not
>  present because they are updated... So the query gets a wrong plan. It uses a
>  seqScan instead of an index scan.
>  - This wrongly planned statement is prepared so even after VACUUM ANALYZE is
>  done, the statement does not use the wrong plan.
>  - load raises triggers many concurrent queries with wrong plans. so everything
>  slows down.
>

I've run into a very similar problem. I have some long-running
processes that generate a large amount of data and then query that
data that must periodically drop their connections and rebuild to
ensure that query plans get re-planned according to updated
statistics. This is especially true when a new DB is first being
brought up with an initial set of data (~1 week of compute time, ~30gb
of data and ~120m rows) with processes that live through the entire
process.

My solution, as mentioned above, is to rebuild the connection approx
every hour on my long-running processes. This is a tricky question,
because you want the benefit of not re-planning queries in 99.9% of
the cases- I'm not really opposed to the current system that requires
the reconnections, given that the overhead involved in them is
completely negligible in the grand scheme of things. There are
downsides to not replanning queries, and if you can't live with them
then you should either force re-planning at intervals or avoid
prepared statements.

--
- David T. Wilson
Princeton Satellite Systems
david.t.wilson@gmail.com

Re: Problem after VACUUM ANALYZE

From

"Pavan Deolasee"

Date:

09 April 2008, 08:16:33

On Wed, Apr 9, 2008 at 1:41 PM, David Wilson <david.t.wilson@gmail.com> wrote:

>
>
>  I've run into a very similar problem. I have some long-running
>  processes that generate a large amount of data and then query that
>  data that must periodically drop their connections and rebuild to
>  ensure that query plans get re-planned according to updated
>  statistics. This is especially true when a new DB is first being
>  brought up with an initial set of data (~1 week of compute time, ~30gb
>  of data and ~120m rows) with processes that live through the entire
>  process.
>

I believe plan-invalidation in 8.3 should address this. Isn't it ?

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: Problem after VACUUM ANALYZE

From

mljv@planwerk6.de

Date:

09 April 2008, 08:17:36

Am Mittwoch, 9. April 2008 10:11 schrieb David Wilson:
> On Wed, Apr 9, 2008 at 3:29 AM,  <mljv@planwerk6.de> wrote:
> >  But if i do "VACUUM ANALYZE" without concurrent queries, everything runs
> > fine afterwards.
> >
> >  If i run  "VACUUM ANALYZE" with few concurrent queries, it slows down to
> > a crawl.
> >
> >  Could it be that something like this is happening:
> >  - In the early morning a new DB connection is opened.
> >  - While running VACUUM ANALYZE the planner uses different plans because
> > some index could not be used or the statistics are right in that moment
> > not present because they are updated... So the query gets a wrong plan.
> > It uses a seqScan instead of an index scan.
> >  - This wrongly planned statement is prepared so even after VACUUM
> > ANALYZE is done, the statement does not use the wrong plan.
> >  - load raises triggers many concurrent queries with wrong plans. so
> > everything slows down.
>
> I've run into a very similar problem. I have some long-running
> processes that generate a large amount of data and then query that
> data that must periodically drop their connections and rebuild to
> ensure that query plans get re-planned according to updated
> statistics. This is especially true when a new DB is first being
> brought up with an initial set of data (~1 week of compute time, ~30gb
> of data and ~120m rows) with processes that live through the entire
> process.

This is not the case at my place. We are not producing lots of data. Just few
data with small INSERTs all the time. I don't think that the statistic are
really changing quite often as the tables are rather fixed and just some data
is added.

> My solution, as mentioned above, is to rebuild the connection approx
> every hour on my long-running processes. This is a tricky question,
> because you want the benefit of not re-planning queries in 99.9% of
> the cases- I'm not really opposed to the current system that requires
> the reconnections, given that the overhead involved in them is
> completely negligible in the grand scheme of things. There are
> downsides to not replanning queries, and if you can't live with them
> then you should either force re-planning at intervals or avoid
> prepared statements.

In release 8.3:
http://www.postgresql.org/docs/8.3/interactive/release-8-3.html
"Automatically re-plan cached queries when table definitions change or
statistics are updated"

So no reconnection is necessary in 8.3 to replan prepared statements. Maybe it
helps you.

kind regards,
janning

Re: Problem after VACUUM ANALYZE

From

Gregory Stark

Date:

09 April 2008, 10:50:17

<mljv@planwerk6.de> writes:

> Could it be that something like this is happening:
> - In the early morning a new DB connection is opened.
> - While running VACUUM ANALYZE the planner uses different plans because some
> index could not be used or the statistics are right in that moment not
> present because they are updated... So the query gets a wrong plan. It uses a
> seqScan instead of an index scan.

This isn't supposed to happen. The old statistics are still visible until the
new ones are visible. Nothing is locked or unavailable while analyze is running.

However one of the things that can happen is the statistics for one table get
updated and until the statistics for another table are updated the planner has
skewed results. It has the new statistics for the first table but old
statistics for the second table.

Normally the solution to this is to run analyze more frequently so things
don't change too drastically from one set of statistics to another. In some
situations this isn't good enough, for example if you're truncating a bunch of
tables together.

You can also run analyze (but not vacuum analyze) inside a transaction so all
the stats go into effect together when you commit. This does have downsides
however. I think it will block a vacuum analyze which tries to update the same
statistics, for example.

Another thing to be aware of is that *empty* tables cause Postgres to default
to a heuristic of assuming 1,000 records. That can cause strange things to
happen if you truncate just some tables and run analyze on them.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

Re: Problem after VACUUM ANALYZE

From

mljv@planwerk6.de

Date:

11 April 2008, 14:44:17

Hi all, dear Richard,

your mail about my configuration parameter were the right hint, but i am still
struggling with the problem. i will appreciate if you or somebody else can
help me even further.

After some investigation i got some new results to my problem. The following
query is not working as it should and is my most important query:

# select * from SpielTipp natural join Tippspiel
# where tippspieltag_id=1254056;

- Spieltipp has about 80.000.000 records
- Tippspiel has about 10.000.000 records
- random_page_cost = 3

Both table have indexes, of course.  So there should be no seqscan in use, but
the planner is using a sequence scan:

# explain analyze
# select * from SpielTipp natural join Tippspiel
# where tippspieltag_id = 817372;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=333.93..1647646.04 rows=1383 width=28) (actual
time=104193.150..104193.150 rows=0 loops=1)
   Hash Cond: ("outer".tippspiel_id = "inner".tippspiel_id)
   ->  Seq Scan on spieltipp  (cost=0.00..1253846.52 rows=78690352 width=16)
(actual time=10.355..69195.235 rows=78690348 loops=1)
   ->  Hash  (cost=333.44..333.44 rows=198 width=16) (actual
time=44.821..44.821 rows=9 loops=1)
         ->  Index Scan using tippspiel_tippspieltag_id_key on tippspiel
(cost=0.00..333.44 rows=198 width=16) (actual time=44.798..44.809 rows=9
loops=1)
               Index Cond: (tippspieltag_id = 817372)
 Total runtime: 104193.209 ms
(7 rows)

------------------------------------------------------------------------------------------------------------------------------------------------------

just to see how wrong the plan is, i disabled seqscan:

# set enable_seqscan to off;
# explain analyze
# select * from SpielTipp natural join Tippspiel
# where tippspieltag_id = 817372;

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..1682807.57 rows=1383 width=28) (actual
time=0.186..0.186 rows=0 loops=1)
   ->  Index Scan using tippspiel_tippspieltag_id_key on tippspiel
(cost=0.00..333.44 rows=198 width=16) (actual time=0.072..0.082 rows=9
loops=1)
         Index Cond: (tippspieltag_id = 817372)
   ->  Index Scan using ix_spieltipp_tippspiel_id on spieltipp
(cost=0.00..8458.83 rows=3081 width=16) (actual time=0.010..0.010 rows=0
loops=9)
         Index Cond: (spieltipp.tippspiel_id = "outer".tippspiel_id)
 Total runtime: 0.232 ms
(6 rows)

----------------------------------------------------------------------------------------------------------------------------------------------------

no i tried to lower random_page_cost:

# set enable_seqscan to on;
# set random_page_cost to 1.5;
# explain analyze
# select * from SpielTipp natural join Tippspiel
# where tippspieltag_id = 817372;

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..731643.62 rows=1383 width=28) (actual
time=0.089..0.089 rows=0 loops=1)
   ->  Index Scan using tippspiel_tippspieltag_id_key on tippspiel
(cost=0.00..146.17 rows=198 width=16) (actual time=0.017..0.024 rows=9
loops=1)
         Index Cond: (tippspieltag_id = 817372)
   ->  Index Scan using ix_spieltipp_tippspiel_id on spieltipp
(cost=0.00..3655.92 rows=3081 width=16) (actual time=0.005..0.005 rows=0
loops=9)
         Index Cond: (spieltipp.tippspiel_id = "outer".tippspiel_id)
 Total runtime: 0.135 ms
(6 rows)

----------------------------------------------------------------------------------------------------------------------------------------------------

looks fine at first glance. but the total estimated cost of 731643 is still
far to high, right?

so what happened to me with a random_page_cost of 3 at my production server:
the estimated costs between a seq scan and an index scan are not too
different. So sometimes it will use a sequence scan after a fresh ANALYZE and
sometime not as the statistics vary across ANALYZE runs. so i had the problem
that the plan changed by running nightly ANALYZE on the database.

As i have 200-240 connections at peak time, so reading your advices and
annotated postgresql.conf, my conclusion is:

max_connections = 250
shared_buffers = 200000 # 1.6 GB = 20% of avail. RAM
work_mem = 20000
maintenance_work_mem = 160000
effective_cache_size = 600000 # 4.8 GB = 60% of avail. RAM
random_page_cost = 2

Are those settings reasonable for my box?
my box is:
- dedicated
- AMD Athlon(tm) 64 X2 Dual Core Processor 6000+
- 3ware RAID 1 Controller with two rather cheap SATA disks
- 8 GB RAM

kind regards
Janning

Am Dienstag, 8. April 2008 17:40 schrieb Richard Huxton:
> > Here are some of our configuration parameters. We never really tweaked it
> > as it ran fine. We just raised some parameters. The following list should
> > show all parameters changed from the default:
> >
> > max_connections = 300
> > shared_buffers = 30000
> > work_mem = 10240
>
> OK, so that's 30,000 * 8KB = 240MB of shared_buffers
> You have 10MB of work_mem and if all 300 connections were using that
> much you'd have committed 3GB of your RAM for that. Of course they'll
> want more than just that.
>
> Do you really have 300 concurrent connections?
>
> > maintenance_work_mem = 163840
>
> 160MB for vacuums - should be OK given how much memory you have and the
> fact that it's quiet when you vacuum.
>
> > max_fsm_pages = 500000
>
> You can track at most 500,000 pages with free space on them. In 8.2+
> versions VACUUM VERBOSE will show you how many are currently being used.
> Not sure about 8.1
>
> > bgwriter_lru_percent = 10.0
> > bgwriter_lru_maxpages = 100
> > bgwriter_all_percent = 5
> > bgwriter_all_maxpages = 200
> > wal_buffers = 16
> >
> > checkpoint_segments = 10
>
> If you have bursts of write activity you might want to increase this.
>
> > checkpoint_warning = 3600
> >
> > effective_cache_size = 180000
>
> That's 180,000 * 8KB = 180 * 8MB = 1.4GB
> If that's really all you're using as cache, I'd reduce the number of
> concurrent connections. Check free/top and see how much RAM is really
> being used as disk cache.
>
> > random_page_cost = 3
>
> Might be too high - you don't mention what disks you have.
>
> > stats_command_string = off
>
> If you turn this one on, you'll be able to see the queries each backend
> is executing as they happen. Might be useful, but does have some cost.
>
>
> The crucial thing is to find out exactly what is happening when things
> get very slow. Check vmstat and top, look in the pg_locks system-table
> and if needs be we can see what strace says a particular backend is doing.
>
> --
>    Richard Huxton
>    Archonet Ltd