BUG #16760: Standby database missed records for at least 1 table - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #16760: Standby database missed records for at least 1 table
Date
Msg-id 16760-6c90634964ab51b0@postgresql.org
Whole thread Raw
Responses Re: BUG #16760: Standby database missed records for at least 1 table
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      16760
Logged by:          Andriy Bartash
Email address:      abartash@xmatters.com
PostgreSQL version: 12.3
Operating system:   CentOS7
Description:

We run postgres in Google cloud and HA configured as follow:
Primary instance runs in US-EAST1 region
1st Secondary in US-EAST1 
2nd Secondary in US-CENTRAL1
Recently we discovered that 1st Secondary instance missed 8 records in one
table at least . At the same time, recovery process was working fine and
continued recovering. The 2nd Secondary had those records though.
What we missed : 
8 records in audit_evs_all table, audit_ev_id from 221535154 to 221535161
Where audit_ev_id is a PK of audit_evs_all
---------------------------------------------------------------
Below output from Primary
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
 audit_ev_id |         when_created
-------------+-------------------------------
   221535154 | 2020-12-01 00:00:20.955348+00
   221535155 | 2020-12-01 00:00:20.955348+00
   221535156 | 2020-12-01 00:00:20.955348+00
   221535157 | 2020-12-01 00:00:20.955348+00
   221535158 | 2020-12-01 00:00:20.955348+00
   221535159 | 2020-12-01 00:00:20.955348+00
   221535160 | 2020-12-01 00:00:20.955348+00
   221535161 | 2020-12-01 00:00:20.955348+00
   221535162 | 2020-12-01 00:00:20.955348+00
(9 rows)
---------------------------------------------------------------
Same query's output from 1st Secondary
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
 audit_ev_id |         when_created
-------------+-------------------------------
   221535162 | 2020-12-01 00:00:20.955348+00
(1 row)

When it was discovered: roughly 7 hours after.
Column when_created represents when the record was inserted into the table,
so, as a proof that recovery process was running fine we can:
select max(when_created) from audit_evs_all;
              max
-------------------------------
 2020-12-01 07:38:18.258866+00
(1 row)

We didn't find any ERRORS in postgres logs on Primary either on 1st
Secondary around  2020-12-01 00:00:20.955348+00 (as we see all those 9
records were inserted at the same time as a bulk insert).
What was running on 1st Secondary around that time: We had pg_dump running
there between 12:00 a.m. and 1 a.m., so, we know that postgres suspended
recovery process while pg_dump was working and definitely it caused some lag
between Primary and 1st Secondary. 
When problem was found, 1st Secondary instance was restarted, we hoped that
postgres might identify this issue and apply missed wal files if any, but
nothing happened and it continued applying latest changes from the
Primary.
Recovery config from 1st Secondary (host IP replaced with xx.xx.xx.xx):
# recovery.conf
primary_conninfo = 'user=replication passfile=/var/lib/pgsql/pgpass
host=xx.xx.xx.xx port=5432 sslmode=prefer
application_name=postgres-prd-useast1-naprd8-1'
primary_slot_name = 'postgres_prd_useast1_naprd8_1'
recovery_target = ''
recovery_target_lsn = ''
recovery_target_name = ''
recovery_target_time = ''
recovery_target_timeline = 'latest'
recovery_target_xid = ''

Please let me know if you need anything else form our end, we have a cold
backup of PGDATA from1st Secondary as well as Postgres logs from Primary and
Secondary.   

PS: It is the second time we see this issue in our environment (about 30 PG
clusters) within last 3 weeks but different cluster  this time.


pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #16759: Estimation of the planner is wrong for hash join
Next
From: Tom Lane
Date:
Subject: Re: BUG #16759: Estimation of the planner is wrong for hash join