There was a bug in the redo 2PC remove code path. Because of which, autovac would think that the 2PC is gone and cause removal of the corresponding clog entry earlier than needed.
Please find attached, the bug fix: 2pc_redo_remove_bug.patch.
I have been testing this on top of Michael's 2pc-restore-fix.patch and things seem to be ok for the past one+ hour. Will keep it running for long.
Jeff, thanks for these very useful scripts. I am going to make a habit to run these scripts on my side from now on. Do you have any other script that I could try against these patches? Please let me know.
This script is the only one I have that specifically targets 2PC. I wrote it last year when the previous round of speed-up code (which avoided writing the files upon "PREPARE" by delaying them until the next checkpoint) was developed. I just decided to dust that test off to try again here. I don't know how to change it to make it more targeted towards this set of patches. Would this bug have been seen in a replica server in the absence of crashes, or was it only vulnerable during crash recovery rather than streaming replication?