Thread: implosion follow up, 7.4.5

implosion follow up, 7.4.5

From
Cott Lang
Date:
The new thread on 7.4.5 losing committed transactions popped up just as
I discovered something that was at least unexpected to me. 

In doing the cleanup from my pg_resetxlogs from today's earlier fun, I
found some missing rows and some duplicate row versions showing up in my
restore. All of this was within a 90 second period, which makes sense to
me.

What doesn't make sense to me is that I'm missing 19 records in one
table that were committed 3 hours before my crash.  There were no errors
before the crash, there were no errors in the dump after the
pg_resetxlog. I have application logs that confirm these records were
present; not only do I have logs showing they were saved, but logs from
later processes manipulating these records.

I'm running 7.4.5 on RHAS 3 x86-64 on 4x244 32GB system. It's NFS
attached. Derogatory remarks about NFS welcome, but you're preaching to
the choir. :)

The only thing unusual thing I noticed today was abominable performance
for several hours before the crash (Load=30, iowait=95%).  This machine
has been running for weeks with excellent performance - generally 4
times faster than my dual Xeon 2.4Ghz, 12GB RAM, 6x36GB U320 RAID 1+0
systems.

Typically in my benchmarking sessions and application runs, I rarely saw
any read activity - it appeared that everything was pulled straight out
of the disk buffer cache. Today, NFS was choked with reads, despite
having 10GB of RAM free (!).  Nothing has changed on this machine in at
least 4 weeks. 

Any ideas are appreciated. While I'm sure the crash is hardware/config
related, the missing 19 records from something committed 3 hours earlier
is confusing. :)

As always, any insight is appreciated. We are very committed to
PostgreSQL after booting a large Oracle installation out 16 months ago.

thanks!