hello, all.
Recently, I find one very strange situation to lose data of primary node which the
details can be find at the first patch: 0001-Add-test-case-data-lost-after-restart.patch.
The first patch shows us that data could be lost after truncating physical file by
someone else before starting up primary node. However, then the primary node
still starts up normally without any alarm, even that it find any invalid page
during crash recovery.
And then I find another situation about abort transaction which details can be find
at the second patch: 0002-Add-test-case-for-abort-transaction-across-checkpoin.patch.
The second patch shows us that abort transaction across checkpoint could also cause
invalid pages, and leave some undeleted relation files forever during crash recovery.
And then the primary node still starts up normally without any alarm, just like the
first situation.
By the way, the above experiments are both running after setting the following
parameters:
$node_primary->append_conf('postgresql.conf', 'synchronous_commit=on');
$node_primary->append_conf('postgresql.conf', 'full_page_writes=off');
$node_primary->append_conf('postgresql.conf', 'log_min_messages=debug2');
As my opinion, the primary node should alarm some invalid pages found during
crash recovery, as same as what the standby node does after reached consistency
recovery state. So I put the third bug fix patch which is
0003-Check-invalid-pages-at-the-end-of-recovery.patch to do the following two things:
(1) Primary node checks invalid pages at the end of recovery;
(2) Flush the abort WAL before truncating or deleting any relation files.
Best wishes,
rogers.ww.