Hi,
I spent a bit more time fixing the TAP test. The attached patch makes it
"work" for me (or I think it should, in principle). I'm not saying it's
the best way to do stuff.
With the patch applied, I tried running it, and I got a failure when
running pg_checksums. There's a log snippet describing the issue, but
AFAICS it's happening like this:
1) checksums are disabled
2) flip_data_checksums gets called
3) both clusters go through 'inprogress-on' and 'on' states
4) primary gets shutdown in 'immediate' mode
5) standby gets shutdown in 'fast' mode
6) we try to validate checksums on the standby, but control file still
says checksums=inprogress-on
This seems like a bug to me - AFAICS the expectation is that after fast
shutdown, we don't forget the checksum state. Or is that expected? In
that case the TAP test probably needs to check the control file, instead
of relying on the perl variable $data_checksum_state. Or maybe it should
check that the control file has the correct / expected state?
FWIW I don't think the primary shutdown matters. I've seen multiple of
these failures, and it happens even without primary shutdown. But the
standby "fast" shutdown is always there.
But this also shows a limitation of the TAP test - it never triggers the
shutdowns while flipping the checksums (in flip_data_checksums). I think
that's something worth testing.
regards
--
Tomas Vondra