Vadim,
I have committed changes to separate the notions of critical sections
and sections that just want to hold off cancel/die interrupts, as we
discussed. While I was doing that, I noticed a couple of places that
I think you should take a second look at:
1. src/backend/access/nbtree/nbtinsert.c, line 867: shouldn't this
END_CRIT_SECTION be moved up to before the _bt_wrtbuf call? It seems
to me that an elog during the wrtbuf is not a critical failure. If
this code is correct, then all the other crit sections are wrong,
because all of them release the crit section before writing buffers,
not after.
2. src/backend/commands/vacuum.c, line 1907: does this
START_CRIT_SECTION really have to be here, and not down at line 1935,
just before PageRepairFragmentation()? I really don't like the idea of
turning those elogs that are inside the loop into reasons to force a
system-wide restart.
3. src/backend/access/transam/xlog.c, routine CreateCheckPoint:
does this *entire* routine need to be a critical section? Again,
I fear a shotgun approach will mean a net decrease in reliability,
not an improvement. How much of this code really has to be critical?
Do you really want a failure in, say, MoveOfflineLogs to take down the
whole database?
regards, tom lane