Home > mailing lists

BUG #16760: Standby database missed records for at least 1 table - Mailing list pgsql-bugs

From	PG Bug reporting form
Subject	BUG #16760: Standby database missed records for at least 1 table
Date	December 2, 2020 23:22:37
Msg-id	16760-6c90634964ab51b0@postgresql.org Whole thread Raw
Responses	Re: BUG #16760: Standby database missed records for at least 1 table
List	pgsql-bugs

Tree view

The following bug has been logged on the website:

Bug reference:      16760
Logged by:          Andriy Bartash
Email address:      abartash@xmatters.com
PostgreSQL version: 12.3
Operating system:   CentOS7
Description:

We run postgres in Google cloud and HA configured as follow:
Primary instance runs in US-EAST1 region
1st Secondary in US-EAST1 
2nd Secondary in US-CENTRAL1
Recently we discovered that 1st Secondary instance missed 8 records in one
table at least . At the same time, recovery process was working fine and
continued recovering. The 2nd Secondary had those records though.
What we missed : 
8 records in audit_evs_all table, audit_ev_id from 221535154 to 221535161
Where audit_ev_id is a PK of audit_evs_all
---------------------------------------------------------------
Below output from Primary
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
 audit_ev_id |         when_created
-------------+-------------------------------
   221535154 | 2020-12-01 00:00:20.955348+00
   221535155 | 2020-12-01 00:00:20.955348+00
   221535156 | 2020-12-01 00:00:20.955348+00
   221535157 | 2020-12-01 00:00:20.955348+00
   221535158 | 2020-12-01 00:00:20.955348+00
   221535159 | 2020-12-01 00:00:20.955348+00
   221535160 | 2020-12-01 00:00:20.955348+00
   221535161 | 2020-12-01 00:00:20.955348+00
   221535162 | 2020-12-01 00:00:20.955348+00
(9 rows)
---------------------------------------------------------------
Same query's output from 1st Secondary
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
 audit_ev_id |         when_created
-------------+-------------------------------
   221535162 | 2020-12-01 00:00:20.955348+00
(1 row)

When it was discovered: roughly 7 hours after.
Column when_created represents when the record was inserted into the table,
so, as a proof that recovery process was running fine we can:
select max(when_created) from audit_evs_all;
              max
-------------------------------
 2020-12-01 07:38:18.258866+00
(1 row)

We didn't find any ERRORS in postgres logs on Primary either on 1st
Secondary around  2020-12-01 00:00:20.955348+00 (as we see all those 9
records were inserted at the same time as a bulk insert).
What was running on 1st Secondary around that time: We had pg_dump running
there between 12:00 a.m. and 1 a.m., so, we know that postgres suspended
recovery process while pg_dump was working and definitely it caused some lag
between Primary and 1st Secondary. 
When problem was found, 1st Secondary instance was restarted, we hoped that
postgres might identify this issue and apply missed wal files if any, but
nothing happened and it continued applying latest changes from the
Primary.
Recovery config from 1st Secondary (host IP replaced with xx.xx.xx.xx):
# recovery.conf
primary_conninfo = 'user=replication passfile=/var/lib/pgsql/pgpass
host=xx.xx.xx.xx port=5432 sslmode=prefer
application_name=postgres-prd-useast1-naprd8-1'
primary_slot_name = 'postgres_prd_useast1_naprd8_1'
recovery_target = ''
recovery_target_lsn = ''
recovery_target_name = ''
recovery_target_time = ''
recovery_target_timeline = 'latest'
recovery_target_xid = ''

Please let me know if you need anything else form our end, we have a cold
backup of PGDATA from1st Secondary as well as Postgres logs from Primary and
Secondary.   

PS: It is the second time we see this issue in our environment (about 30 PG
clusters) within last 3 weeks but different cluster  this time.

pgsql-bugs by date:

From: PG Bug reporting form
Date: 02 December 2020, 13:43:50
Subject: BUG #16759: Estimation of the planner is wrong for hash join

From: Tom Lane
Date: 02 December 2020, 23:29:39
Subject: Re: BUG #16759: Estimation of the planner is wrong for hash join

BUG #16760: Standby database missed records for at least 1 table - Mailing list pgsql-bugs

Previous

Next