BUG #16760: Standby database missed records for at least 1 table - Mailing list pgsql-bugs
From | PG Bug reporting form |
---|---|
Subject | BUG #16760: Standby database missed records for at least 1 table |
Date | |
Msg-id | 16760-6c90634964ab51b0@postgresql.org Whole thread Raw |
Responses |
Re: BUG #16760: Standby database missed records for at least 1 table
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 16760 Logged by: Andriy Bartash Email address: abartash@xmatters.com PostgreSQL version: 12.3 Operating system: CentOS7 Description: We run postgres in Google cloud and HA configured as follow: Primary instance runs in US-EAST1 region 1st Secondary in US-EAST1 2nd Secondary in US-CENTRAL1 Recently we discovered that 1st Secondary instance missed 8 records in one table at least . At the same time, recovery process was working fine and continued recovering. The 2nd Secondary had those records though. What we missed : 8 records in audit_evs_all table, audit_ev_id from 221535154 to 221535161 Where audit_ev_id is a PK of audit_evs_all --------------------------------------------------------------- Below output from Primary select audit_ev_id, when_created from audit_evs_all where audit_ev_id between 221535154 and 221535162; audit_ev_id | when_created -------------+------------------------------- 221535154 | 2020-12-01 00:00:20.955348+00 221535155 | 2020-12-01 00:00:20.955348+00 221535156 | 2020-12-01 00:00:20.955348+00 221535157 | 2020-12-01 00:00:20.955348+00 221535158 | 2020-12-01 00:00:20.955348+00 221535159 | 2020-12-01 00:00:20.955348+00 221535160 | 2020-12-01 00:00:20.955348+00 221535161 | 2020-12-01 00:00:20.955348+00 221535162 | 2020-12-01 00:00:20.955348+00 (9 rows) --------------------------------------------------------------- Same query's output from 1st Secondary select audit_ev_id, when_created from audit_evs_all where audit_ev_id between 221535154 and 221535162; audit_ev_id | when_created -------------+------------------------------- 221535162 | 2020-12-01 00:00:20.955348+00 (1 row) When it was discovered: roughly 7 hours after. Column when_created represents when the record was inserted into the table, so, as a proof that recovery process was running fine we can: select max(when_created) from audit_evs_all; max ------------------------------- 2020-12-01 07:38:18.258866+00 (1 row) We didn't find any ERRORS in postgres logs on Primary either on 1st Secondary around 2020-12-01 00:00:20.955348+00 (as we see all those 9 records were inserted at the same time as a bulk insert). What was running on 1st Secondary around that time: We had pg_dump running there between 12:00 a.m. and 1 a.m., so, we know that postgres suspended recovery process while pg_dump was working and definitely it caused some lag between Primary and 1st Secondary. When problem was found, 1st Secondary instance was restarted, we hoped that postgres might identify this issue and apply missed wal files if any, but nothing happened and it continued applying latest changes from the Primary. Recovery config from 1st Secondary (host IP replaced with xx.xx.xx.xx): # recovery.conf primary_conninfo = 'user=replication passfile=/var/lib/pgsql/pgpass host=xx.xx.xx.xx port=5432 sslmode=prefer application_name=postgres-prd-useast1-naprd8-1' primary_slot_name = 'postgres_prd_useast1_naprd8_1' recovery_target = '' recovery_target_lsn = '' recovery_target_name = '' recovery_target_time = '' recovery_target_timeline = 'latest' recovery_target_xid = '' Please let me know if you need anything else form our end, we have a cold backup of PGDATA from1st Secondary as well as Postgres logs from Primary and Secondary. PS: It is the second time we see this issue in our environment (about 30 PG clusters) within last 3 weeks but different cluster this time.
pgsql-bugs by date: