Archiver not exiting upon crash - Mailing list pgsql-hackers

From Jeff Janes
Subject Archiver not exiting upon crash
Date
Msg-id CAMkU=1z7de-P8=r8EWQZpJoY=ccyE8OhFQey8zymKW7_ASX+Uw@mail.gmail.com
Whole thread Raw
Responses Re: Archiver not exiting upon crash
Re: Archiver not exiting upon crash
List pgsql-hackers
I've been testing the crash recovery of REL9_2_BETA1, using the same
method I posted in the "Scaling XLog insertion" thread.  I have the
checkpointer occasionally throw a FATAL error, which causes the
postmaster to take down all of the other processes (DETAIL:  The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally
and possibly corrupted shared memory.)  and initiate recovery.

However, sometimes the automatic recovery never initiates.  It looks
like the postmaster is waiting for the archiver to exit before it
starts recovery, and the archiver is waiting for something, I don't
really know what.

This happens on about 10% of the crashes on REL9_2_BETA1, although I
imagine that number is extremely depend on minutiae of the setup,
hardware, and phase of the moon, as it is probably some kind of race.

This behavior is also present in 9_1_STABLE, although at a much lower
prevalence (about 1%).  If fact it seems to go back at least to 8.4.0.

If I kill -9 the archiver, then recovery initiates and proceeds as normal.

I don't know the best way to tackle this.  By staring at the code, by
"git bisect" (which is hard to do, because I don't know if the
behavior was ever not there, and because the problem only occurs
statistically it can take many hours per iteration), or some other
method?

Thanks,

Jeff


pgsql-hackers by date:

Previous
From: Chander Ganesan
Date:
Subject: LISTEN/NOTIFY Security and the docs
Next
From: Heikki Linnakangas
Date:
Subject: Bug in new buffering GiST build code