Tom Lane wrote:
Marco Colombo <marco@esi.it> writes:
Tom Lane wrote:
However this would seem to imply disk drive misfeasance above and beyond
your motherboard problem.
Well, no. How about this theory:
1) everything is ok: the backend executes write()/fsync() for transactions 1-5
2) hardware fails some how at MB level (imagine CPU/RAM overheating): RAM gets corrupted - kernel starts oopsing (but goes on) meanwhile, the backend executes write()/fsync() for transactions 6-10, but randomly corrupted data gets written to disk.
3) unrecoverable kernel error occurs, the show stops.
On recover, transactions 6-9 don't even look like valid log entries, while
10, for some reason, does (maybe only data is corrupted).
I'm not familiar with the details of WAL files and post-crash recovery,
but is that possible? Or does the process stop at the first failure?
Recovery will stop at the first corrupted record, so it would not happen
like that. But you are right, the MB failure alone might have been
enough to corrupt the outgoing WAL log data and thus produce the
scenario I described. Once Postgres *thinks* transactions 1-10 are
safely down to disk in the WAL log, it will feel free to update the data
files in any random order that seems convenient. So the write of record
10 could have occurred before the rest, and if that happened not to get
corrupted by the MB problem, we could see the result lec describes.
Of course this is all guesswork since we have no direct evidence to look
at, but it seems fairly plausible.
Anyway, if your CPU/RAM is failing, no DB technology can save you.
Agreed. Software certainly cannot make any guarantees if it can't even
execute correctly ...
Same here. I don't even want to have to prove anything if the hardware isn't reliable but the "management" queries about the lost transactions, blaming on system/software/database. I could prove to them that the lost transactions were due to the system hang, but transaction #10 being there makes my reasoning doubtful.
Thanks for all your feedbacks and reasoning.
--lec