Dear Hackers,
I would like to discuss ProcessTwoPhaseBuffer function. It reads two-phase transaction states from disk or the WAL. It
takesxid as well as some other input parameters and executes the following steps:
Step #1: Check if xid is committed or aborted in clog (TransactionIdDidCommit, TransactionIdDidAbort)
Step #2: Check if xid is not equal or greater than ShmemVariableCache->nextXid
Step #3: Read two-phase state for the specified xid from memory or the corresponding file and returns it
In some, very rare scenarios, the postgres instance will newer recover because of such logic. Imagine, that the
two_phasedirectory contains some files with two-phase states of transactions of distant future. I assume, it can happen
ifsome WAL segments are broken and ignored (as well as clog data) but two_phase directory was not broken. In recovery,
postgresqlreads all the files in two_phase and tries to recover two-phase states.
The problem appears in the functions TransactionIdDidCommit or TransactionIdDidAbort. These functions may fail with the
FATALmessage like below when no clog state on disk is available for the xid:
FATAL: could not access status of transaction 286331153
DETAIL: Could not open file "pg_xact/0111": No such file or directory.
Such error do not allow the postgresql instance to be started.
My guess, if to swap Step #1 with Step #2 such error will disappear because transactions will be filtered when
comparingxid with ShmemVariableCache->nextXid before accessing clog. The function will be more robust. In general, it
worksbut I'm not sure that such logic will not break some rare boundary cases. Another solution is to catch and ignore
sucherror, but the original solution is the simpler one. I appreciate any thoughts concerning this topic. May be, you
knowsome cases when such change in logic is not relevant?
Thank you in advance!
With best regards,
Vitaly