I've been testing the crash recovery of REL9_2_BETA1, using the same
method I posted in the "Scaling XLog insertion" thread. I have the
checkpointer occasionally throw a FATAL error, which causes the
postmaster to take down all of the other processes (DETAIL: The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally
and possibly corrupted shared memory.) and initiate recovery.
However, sometimes the automatic recovery never initiates. It looks
like the postmaster is waiting for the archiver to exit before it
starts recovery, and the archiver is waiting for something, I don't
really know what.
This happens on about 10% of the crashes on REL9_2_BETA1, although I
imagine that number is extremely depend on minutiae of the setup,
hardware, and phase of the moon, as it is probably some kind of race.
This behavior is also present in 9_1_STABLE, although at a much lower
prevalence (about 1%). If fact it seems to go back at least to 8.4.0.
If I kill -9 the archiver, then recovery initiates and proceeds as normal.
I don't know the best way to tackle this. By staring at the code, by
"git bisect" (which is hard to do, because I don't know if the
behavior was ever not there, and because the problem only occurs
statistically it can take many hours per iteration), or some other
method?
Thanks,
Jeff