Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Simon Riggs wrote:
>> Replication lag tracking for walsenders
>>
>> Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication.
> Did anyone notice that this seems to be causing buildfarm member 'tern'
> to fail the recovery check? See here:
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=tern&dt=2017-04-21%2012%3A48%3A09&stg=recovery-check
> which has
> TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)
> Line 3331 was added by this commit.
Note that while that commit was some time back, tern has only just started
running recovery-check, following its update to the latest buildfarm
script. It looks like it's run that test four times and failed twice,
so far. So, not 100% reproducible, but there's something rotten there.
Timing-dependent, maybe?
Some excavation in the buildfarm database says that the coverage for
the recovery-check test has been mighty darn thin up until just recently.
These are all the reports we have:
pgbfprod=> select sysname, min(snapshot) as oldest, count(*) from build_status_log where log_stage =
'recovery-check.log'group by 1 order by 2;sysname | oldest | count
----------+---------------------+-------hamster | 2016-03-01 02:34:26 | 182crake | 2017-04-09 01:58:15 |
80nightjar| 2017-04-11 15:54:34 | 52longfin | 2017-04-19 16:29:39 | 9hornet | 2017-04-20 14:12:08 |
4mandrill| 2017-04-20 14:14:08 | 4sungazer | 2017-04-20 14:16:08 | 4tern | 2017-04-20 14:18:08 | 4prion
| 2017-04-20 14:23:05 | 8jacana | 2017-04-20 15:00:17 | 3
(10 rows)
So, other than hamster which is certainly going to have its own spin
on the timing question, we have next to no track record for this test.
I wouldn't bet that this issue is unique to tern; more likely, that's
just the first critter to show an intermittent issue.
regards, tom lane