Thread: [HACKERS] Small bug in replication lag tracking
Hi, I discovered a thinko in the new replication lag interpolation code that can cause a strange number to be reported occasionally. The interpolation code is designed to report increasing lag when replay gets stuck somewhere between two LSNs for which we have timestamp samples. The bug is that after sitting idle and fully replayed for a while and then encountering a new burst of WAL activity, we interpolate between an ancient sample and the not-yet-reached one for the new traffic, which is inappropriate. It's hard to see obviously strange lag times, because they normally only exist for a very short moment in between receiving the first and second replies from the standby, and they often look reasonable even if you do manage to catch one in pg_stat_replication. You can see the problem by pausing replay on the the standby in between two bursts of WAL with a long period of idleness in between. Please find attached a patch to fix that, with comments to explain. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 23 June 2017 at 06:45, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I discovered a thinko in the new replication lag interpolation code > that can cause a strange number to be reported occasionally. Thanks, will review. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 23 June 2017 at 08:18, Simon Riggs <simon@2ndquadrant.com> wrote: > On 23 June 2017 at 06:45, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > >> I discovered a thinko in the new replication lag interpolation code >> that can cause a strange number to be reported occasionally. > > Thanks, will review. I've pushed this, but I think we should leave the code alone after this and wait for user feedback. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jun 24, 2017 at 6:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 23 June 2017 at 08:18, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 23 June 2017 at 06:45, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> >>> I discovered a thinko in the new replication lag interpolation code >>> that can cause a strange number to be reported occasionally. >> >> Thanks, will review. > > I've pushed this, but I think we should leave the code alone after > this and wait for user feedback. Thanks! Yeah, I haven't heard any feedback about this, which I've been interpreting as good news... but if anyone's looking for something to beta test, reports would be most welcome. -- Thomas Munro http://www.enterprisedb.com