Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders
Date
Msg-id 27895.1492808648@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking forwalsenders  (Andres Freund <andres@anarazel.de>)
Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Simon Riggs wrote:
>> Replication lag tracking for walsenders
>>
>> Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication.

> Did anyone notice that this seems to be causing buildfarm member 'tern'
> to fail the recovery check?  See here:

> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=tern&dt=2017-04-21%2012%3A48%3A09&stg=recovery-check
> which has
> TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)

> Line 3331 was added by this commit.

Note that while that commit was some time back, tern has only just started
running recovery-check, following its update to the latest buildfarm
script.  It looks like it's run that test four times and failed twice,
so far.  So, not 100% reproducible, but there's something rotten there.
Timing-dependent, maybe?

Some excavation in the buildfarm database says that the coverage for
the recovery-check test has been mighty darn thin up until just recently.
These are all the reports we have:

pgbfprod=> select sysname, min(snapshot) as oldest, count(*) from build_status_log where log_stage =
'recovery-check.log'group by 1 order by 2;sysname  |       oldest        | count  
----------+---------------------+-------hamster  | 2016-03-01 02:34:26 |   182crake    | 2017-04-09 01:58:15 |
80nightjar| 2017-04-11 15:54:34 |    52longfin  | 2017-04-19 16:29:39 |     9hornet   | 2017-04-20 14:12:08 |
4mandrill| 2017-04-20 14:14:08 |     4sungazer | 2017-04-20 14:16:08 |     4tern     | 2017-04-20 14:18:08 |     4prion
  | 2017-04-20 14:23:05 |     8jacana   | 2017-04-20 15:00:17 |     3 
(10 rows)

So, other than hamster which is certainly going to have its own spin
on the timing question, we have next to no track record for this test.
I wouldn't bet that this issue is unique to tern; more likely, that's
just the first critter to show an intermittent issue.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Ilya Roublev
Date:
Subject: [HACKERS] multithreading in Batch/pipelining mode for libpq
Next
From: Andres Freund
Date:
Subject: Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking forwalsenders