Hi,
On 2022-04-07 13:40:30 -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > Add TAP test for archive_cleanup_command and recovery_end_command
>
> grassquit just showed a non-reproducible failure in this test [1]:
I was just staring at that as well.
> # Postmaster PID for node "standby" is 291160
> ok 1 - check content from archives
> not ok 2 - archive_cleanup_command executed on checkpoint
>
> # Failed test 'archive_cleanup_command executed on checkpoint'
> # at t/002_archiving.pl line 74.
>
> This test is sending a CHECKPOINT command to the standby and
> expecting it to run the archive_cleanup_command, but it looks
> like the standby did not actually run any checkpoint:
>
> 2022-04-07 16:11:33.060 UTC [291806][not initialized][:0] LOG: connection received: host=[local]
> 2022-04-07 16:11:33.078 UTC [291806][client backend][2/15:0] LOG: connection authorized: user=bf database=postgres
application_name=002_archiving.pl
> 2022-04-07 16:11:33.084 UTC [291806][client backend][2/16:0] LOG: statement: CHECKPOINT
> 2022-04-07 16:11:33.092 UTC [291806][client backend][:0] LOG: disconnection: session time: 0:00:00.032 user=bf
database=postgreshost=[local]
>
> I am suspicious that the reason is that ProcessUtility does not
> ask for a forced checkpoint when in recovery:
>
> RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
> (RecoveryInProgress() ? 0 : CHECKPOINT_FORCE));
>
> The trouble with this theory is that this test has been there for
> nearly six months and this is the first such failure (I scraped the
> buildfarm logs to be sure). Seems like failures should be a lot
> more common than that.
> I wondered if the recent pg_stats changes could have affected this, but I
> don't really see how.
I don't really see either. It's a bit more conceivable that the recovery
prefetching changes could affect the timing sufficiently?
It's also possible that it requires an animal of a certain speed to happen -
we didn't have an -fsanitize=address animal until recently.
I guess we'll have to wait and see what the frequency of the problem is?
Greetings,
Andres Freund