Wow. Have you seen that in an actual production environment?
Yes, we see it regularly, and it is reproducible in test environments as well.
my $start_page = start_of_page($end_lsn); my $wal_file = write_wal($primary, $TLI, $start_page, "\x00" x $WAL_BLOCK_SIZE); # copy the file we just "hacked" to the archive copy($wal_file, $primary->archive_dir);
So you are emulating a failure by filling with zeros the second page where the last emit_message() generated a record, and the page before that includes the continuation record. Then abuse of WAL archiving to force the replay of the last record. That's kind of cool.
Right, at this point it is easier than to cause an artificial crash on the primary after it finished writing just one page.
> To be honest, I don't know yet how to fix it nicely. I am thinking about > returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a new > timeline while trying to read a page and if this page is invalid.
Hmm. I suspect that you may be right on a TLI change when reading a page. There are a bunch of side cases with continuation records and header validation around XLogReaderValidatePageHeader(). Perhaps you have an idea of patch to show your point?
Not yet, but hopefully I will get something done next week.
Nit. In your test, it seems to me that you should not call directly set_standby_mode and enable_restoring, just rely on has_restoring with the standby option included.