Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman - Mailing list pgsql-hackers

From Andres Freund
Subject Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
Date
Msg-id 20220407175210.q44nnrvkovprxo2a@alap3.anarazel.de
Whole thread Raw
In response to Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2022-04-07 13:40:30 -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > Add TAP test for archive_cleanup_command and recovery_end_command
> 
> grassquit just showed a non-reproducible failure in this test [1]:

I was just staring at that as well.


> # Postmaster PID for node "standby" is 291160
> ok 1 - check content from archives
> not ok 2 - archive_cleanup_command executed on checkpoint
> 
> #   Failed test 'archive_cleanup_command executed on checkpoint'
> #   at t/002_archiving.pl line 74.
> 
> This test is sending a CHECKPOINT command to the standby and
> expecting it to run the archive_cleanup_command, but it looks
> like the standby did not actually run any checkpoint:
> 
> 2022-04-07 16:11:33.060 UTC [291806][not initialized][:0] LOG:  connection received: host=[local]
> 2022-04-07 16:11:33.078 UTC [291806][client backend][2/15:0] LOG:  connection authorized: user=bf database=postgres
application_name=002_archiving.pl
> 2022-04-07 16:11:33.084 UTC [291806][client backend][2/16:0] LOG:  statement: CHECKPOINT
> 2022-04-07 16:11:33.092 UTC [291806][client backend][:0] LOG:  disconnection: session time: 0:00:00.032 user=bf
database=postgreshost=[local]
 
> 
> I am suspicious that the reason is that ProcessUtility does not
> ask for a forced checkpoint when in recovery:
> 
>             RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
>                               (RecoveryInProgress() ? 0 : CHECKPOINT_FORCE));
> 
> The trouble with this theory is that this test has been there for
> nearly six months and this is the first such failure (I scraped the
> buildfarm logs to be sure).  Seems like failures should be a lot
> more common than that.

> I wondered if the recent pg_stats changes could have affected this, but I
> don't really see how.

I don't really see either. It's a bit more conceivable that the recovery
prefetching changes could affect the timing sufficiently?

It's also possible that it requires an animal of a certain speed to happen -
we didn't have an -fsanitize=address animal until recently.

I guess we'll have to wait and see what the frequency of the problem is?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [PATCH] Add native windows on arm64 support
Next
From: Andres Freund
Date:
Subject: Re: [PATCH] Add native windows on arm64 support