On Sat, Feb 25, 2023 at 12:02 AM Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, Feb 24, 2023 at 09:36:50AM +0900, Michael Paquier wrote:
> > I was thinking about that, and you may be fine as long as you skip
> > some parts of the restartpoint logic. The case reported of this
> > thread does not cause crash recovery, actually, because startup
> > switches to +archive+ recovery any time it sees a backup_label file.
> > One thing I did not remember here is that we also set minRecoveryPoint
> > at a much earlier LSN than it should be (see 6c4f666). However, we
> > rely heavily on backupEndRequired in the control file to make sure
> > that we've replayed up the end-of-backup record to decide if the
> > system is consistent or not.
>
> I have been spending more time on that to see if I was missing
> something, and reproducing the issue is rather easy by using pgbench
> that gets stopped with a SIGINT so as restart points would be able to
> see transactions still running in the code path triggering the assert.
> A cheap regression test should be possible, actually, though for now
> the only thing I have been able to rely on is a hack to force
> checkpoint_timeout at 1s to make the failure rate more aggressive.
>
> Anyway, with this simple method (and an increase of short pgbench runs
> that are interrupted to increase the chance of hits), a bisect points
> at 7ff23c6 :/
Thanks. I've been thinking about how to make a deterministic test
script to study this and possible fixes, too. Unfortunately I came
down with a nasty cold and stopped computing for a couple of days, so
sorry for the slow response on this thread, but I seem to have
rebooted now. Looking.