Problem after VACUUM ANALYZE - Mailing list pgsql-general

From mljv@planwerk6.de
Subject Problem after VACUUM ANALYZE
Date
Msg-id 200804081657.26402.mljv@planwerk6.de
Whole thread Raw
Responses Re: Problem after VACUUM ANALYZE  (Richard Huxton <dev@archonet.com>)
Re: Problem after VACUUM ANALYZE  ("Scott Marlowe" <scott.marlowe@gmail.com>)
List pgsql-general
Hi all,

our postgresql DB was running fine for a long time, but suddenly we
encountered a huge problem which we got fixed only temporarily.

We are running debian stable with postgresql 8.1.11. Our app is connecting via
JDBC and uses Prepared Statements.

We are not running autovacuum but a nightly VACUUM ANALYZE at 3:00 am
(without 'full' option of course).

Our largest table has 80,000,000 records and has a relation size of 4.4 GB.
Indizes of this table are 5.5 GB (total_relation_size = 10GB). this table is
queried a lot of times. The second largest table is not even 20% of the
largest table. Other tables range from 10KB to 300MB (total_relation_size).

The server is dedicated DB server with dual cpu (AMD Athlon(tm) 64 X2), 8 GB
of ram, 3ware RAID-1 with SATA harddisks.

Everything ran fine. The Vacuum process took quite a long time each night
(about 60 minutes) but we didn't care as it was a very low traffic time.

Suddenly one morning the database was still running but it was VERY slow. Lots
of SELECT queries were hanging around waiting for an answer.

The VACUUM process was already done at this moment.

First try was to restart the database. After the restart the problem occured
again. The vaccum process was already done BEFORE the restart and was not run
again. We just restarted the database. Next try was a REINDEX while
disonnecting all clients, but it didn't helped either.

Next try was to disconnect all clients again (servlet containers), then we did
a restart of the database and run "VACUUM ANALYZE" as the only query with no
other concurrent queries. It took an hour. Afterwards we started the database
and everything ran smoothly.

We thought it might have been only some kind of weired circumstances. So we
left everything else as it was.

Next night the Vacuum process did run again and after this the database slowed
down immediately. The same behaviour as the night before. We stopped the
database, disconnected all clients and ran VACUUM ANALYZE without concurrent
queries. After restarting everything was fine again.

We looked in our cpu monitoring and saw that we have huge IOwait while VACUUM
is running, not unusual though. But just after VACUUM was finished, the
userCPU load raised to 200% (dual cpu). At this time in the early morning
there were only very few users using our app. We analyzed the situation
further and saw that most queries had a very long duration time (minutes
instead of milliseconds). BTW: the vacuum process was not running at this
moment. It was logged before as done.

As we cannot afford a downtime every morning, we disabled the VACUUM process
at night. Next morning everything was fine and is till then (10 days ago).

Of course, not vacuuming at all is not a good idea.

I don't have any clue what is happening. My thoughts about this as follows
(might be totally wrong):

* Maybe the Vacuum analyze process has not enough memory and therefore could
not ANALYZE the tables correctly. It then writes wrong statistics to the
database which results in wrong execution plans using sequence scans instead
of index scans. This only happens if the vacuum analyze process runs
concurrently with user requests. If it runs on its own, the vacuum process
has enough memory and writes correct statistics to the database.

Here are some of our configuration parameters. We never really tweaked it as
it ran fine. We just raised some parameters. The following list should show
all parameters changed from the default:

$ cat /etc/postgresql/8.1/main/postgresql.conf | grep -v '^[[:space:]]*#' |
grep -v '^$'
listen_addresses = '*'
port = 5432
max_connections = 300
unix_socket_directory = '/var/run/postgresql'
shared_buffers = 30000
work_mem = 10240
maintenance_work_mem = 163840
vacuum_mem = 32000
max_fsm_pages = 500000
bgwriter_lru_percent = 10.0
bgwriter_lru_maxpages = 100
bgwriter_all_percent = 5
bgwriter_all_maxpages = 200
wal_buffers = 16
checkpoint_segments = 10
checkpoint_warning = 3600
effective_cache_size = 180000
random_page_cost = 3
log_min_messages = info
log_min_error_statement = warning
log_min_duration_statement = 500
log_line_prefix = '%t [%p]'
stats_command_string = off
stats_block_level = on
stats_row_level = on
stats_reset_on_server_start = on
autovacuum = off
autovacuum_naptime = 3000
lc_messages = 'en_US'
lc_monetary = 'en_US'
lc_numeric = 'en_US'
lc_time = 'en_US'

I already tried to google my problem and came across some answers which seems
to fit, but were all slightly different. Of course updating to 8.3 or restore
the  whole database might help, but i really want to understand what i did
wrong to improve my skills.

If any further information is needed, please ask. I hope my information was
not too veborse and somebody can help me with my problem. Your help is very
appreciated.

kind regards,
janning


pgsql-general by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: Determining weather a query fired a trigger
Next
From: Richard Huxton
Date:
Subject: Re: Determining weather a query fired a trigger