Re: Losing records when server hang - Mailing list pgsql-general

From lec
Subject Re: Losing records when server hang
Date
Msg-id 41182683.6090409@streamyx.com
Whole thread Raw
In response to Losing records when server hang  (lec <limec@streamyx.com>)
List pgsql-general
Tom Lane wrote:
Marco Colombo <marco@esi.it> writes: 
Tom Lane wrote:   
However this would seem to imply disk drive misfeasance above and beyond
your motherboard problem.     
 
Well, no. How about this theory:   
 
1) everything is ok:   the backend executes  write()/fsync() for transactions 1-5   
 
2) hardware fails some how at MB level (imagine CPU/RAM overheating):   RAM gets corrupted - kernel starts oopsing (but goes on)   meanwhile, the backend executes write()/fsync() for transactions 6-10,   but randomly corrupted data gets written to disk.   
 
3) unrecoverable kernel error occurs, the show stops.   
 
On recover, transactions 6-9 don't even look like valid log entries, while
10, for some reason, does (maybe only data is corrupted).   
 
I'm not familiar with the details of WAL files and post-crash recovery,
but is that possible? Or does the process stop at the first failure?   
Recovery will stop at the first corrupted record, so it would not happen
like that.  But you are right, the MB failure alone might have been
enough to corrupt the outgoing WAL log data and thus produce the
scenario I described.  Once Postgres *thinks* transactions 1-10 are
safely down to disk in the WAL log, it will feel free to update the data
files in any random order that seems convenient.  So the write of record
10 could have occurred before the rest, and if that happened not to get
corrupted by the MB problem, we could see the result lec describes.

Of course this is all guesswork since we have no direct evidence to look
at, but it seems fairly plausible.
 
Anyway, if your CPU/RAM is failing, no DB technology can save you.   
Agreed.  Software certainly cannot make any guarantees if it can't even
execute correctly ...
 
Same here.  I don't even want to have to prove anything if the hardware isn't reliable but the "management" queries about the lost transactions, blaming on system/software/database.  I could prove to them that the lost transactions were due to the system hang, but transaction #10 being there makes my reasoning doubtful.

Thanks for all your feedbacks and reasoning.

--lec

pgsql-general by date:

Previous
From: lec
Date:
Subject: Re: Losing records when server hang
Next
From: "Scott Marlowe"
Date:
Subject: Re: Losing records when server hang