Postgres version 8.4.4
Hardware:
12 cpu intel
sda 15K 600GB raid 1
sdb ssd 173GB raid 1 (most tables and indexes on sdb)
adaptec controller
24G of memory
shared buffers 12GB
Machine age 3 months
Normal load 2-4
200-300 Transactions per second
We had a database failure last night after one of the tables had a
corrupted block. After we noticed the corruption all available memory
was used up plus swap. The database died (or rather killed by the
kernel) with an out of memory error. We switched to the warm standby
which doesn't have any corruption.
On the postmortem I found 4 tables with corruption. Only thing that
links these tables was there was autovacuum (to prevent wraparound)
either running or had run on those tables. All tables are in the Gig
range or multi Gig range. The vacuum of some of the tables had been
going on for days.
The errors from the log file were in the form of :
ERROR: invalid page header in block 290125 of relation
pg_tblspc/16385/18674/205612
After an attempted vaccum we had this error:
2010-06-24 17:31:09 UTC [31766]: [36-1]WARNING: PD_ALL_VISIBLE flag was
incorrectly set in relation "org_crawl_page_scrape_result" page 128902
The first error was logged at 10:15pm
At 1 am a pg_dump was run from cron and failed after 20 minutes while
try ing to allocate an immenense amount of memory while attempting to
dump one of the corrupted tables.
At 2:00 am All memory was used up and cpu was maxed and a load average
of 56.
We transferred to the standby and rebooted the machine. At this time
the database is sitting there though we'll need to remake the database
and turn it into the warm standby.