Thread: BUG #16760: Standby database missed records for at least 1 table

BUG #16760: Standby database missed records for at least 1 table

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16760
Logged by:          Andriy Bartash
Email address:      abartash@xmatters.com
PostgreSQL version: 12.3
Operating system:   CentOS7
Description:

We run postgres in Google cloud and HA configured as follow:
Primary instance runs in US-EAST1 region
1st Secondary in US-EAST1 
2nd Secondary in US-CENTRAL1
Recently we discovered that 1st Secondary instance missed 8 records in one
table at least . At the same time, recovery process was working fine and
continued recovering. The 2nd Secondary had those records though.
What we missed : 
8 records in audit_evs_all table, audit_ev_id from 221535154 to 221535161
Where audit_ev_id is a PK of audit_evs_all
---------------------------------------------------------------
Below output from Primary
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
 audit_ev_id |         when_created
-------------+-------------------------------
   221535154 | 2020-12-01 00:00:20.955348+00
   221535155 | 2020-12-01 00:00:20.955348+00
   221535156 | 2020-12-01 00:00:20.955348+00
   221535157 | 2020-12-01 00:00:20.955348+00
   221535158 | 2020-12-01 00:00:20.955348+00
   221535159 | 2020-12-01 00:00:20.955348+00
   221535160 | 2020-12-01 00:00:20.955348+00
   221535161 | 2020-12-01 00:00:20.955348+00
   221535162 | 2020-12-01 00:00:20.955348+00
(9 rows)
---------------------------------------------------------------
Same query's output from 1st Secondary
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
 audit_ev_id |         when_created
-------------+-------------------------------
   221535162 | 2020-12-01 00:00:20.955348+00
(1 row)

When it was discovered: roughly 7 hours after.
Column when_created represents when the record was inserted into the table,
so, as a proof that recovery process was running fine we can:
select max(when_created) from audit_evs_all;
              max
-------------------------------
 2020-12-01 07:38:18.258866+00
(1 row)

We didn't find any ERRORS in postgres logs on Primary either on 1st
Secondary around  2020-12-01 00:00:20.955348+00 (as we see all those 9
records were inserted at the same time as a bulk insert).
What was running on 1st Secondary around that time: We had pg_dump running
there between 12:00 a.m. and 1 a.m., so, we know that postgres suspended
recovery process while pg_dump was working and definitely it caused some lag
between Primary and 1st Secondary. 
When problem was found, 1st Secondary instance was restarted, we hoped that
postgres might identify this issue and apply missed wal files if any, but
nothing happened and it continued applying latest changes from the
Primary.
Recovery config from 1st Secondary (host IP replaced with xx.xx.xx.xx):
# recovery.conf
primary_conninfo = 'user=replication passfile=/var/lib/pgsql/pgpass
host=xx.xx.xx.xx port=5432 sslmode=prefer
application_name=postgres-prd-useast1-naprd8-1'
primary_slot_name = 'postgres_prd_useast1_naprd8_1'
recovery_target = ''
recovery_target_lsn = ''
recovery_target_name = ''
recovery_target_time = ''
recovery_target_timeline = 'latest'
recovery_target_xid = ''

Please let me know if you need anything else form our end, we have a cold
backup of PGDATA from1st Secondary as well as Postgres logs from Primary and
Secondary.   

PS: It is the second time we see this issue in our environment (about 30 PG
clusters) within last 3 weeks but different cluster  this time.


Re: BUG #16760: Standby database missed records for at least 1 table

From
Euler Taveira
Date:
On Wed, 2 Dec 2020 at 17:44, PG Bug reporting form <noreply@postgresql.org> wrote:
Below output from Primary
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
 audit_ev_id |         when_created
-------------+-------------------------------
   221535154 | 2020-12-01 00:00:20.955348+00
   221535155 | 2020-12-01 00:00:20.955348+00
   221535156 | 2020-12-01 00:00:20.955348+00
   221535157 | 2020-12-01 00:00:20.955348+00
   221535158 | 2020-12-01 00:00:20.955348+00
   221535159 | 2020-12-01 00:00:20.955348+00
   221535160 | 2020-12-01 00:00:20.955348+00
   221535161 | 2020-12-01 00:00:20.955348+00
   221535162 | 2020-12-01 00:00:20.955348+00
(9 rows)
---------------------------------------------------------------
Same query's output from 1st Secondary
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
 audit_ev_id |         when_created
-------------+-------------------------------
   221535162 | 2020-12-01 00:00:20.955348+00
(1 row)

I would start this investigation, disabling the index usage for this query.

-- check which index has been used
explain (analyze) select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
-- disable index usage
begin;
set local enable_indexscan to 0;
set local enable_indexonlyscan to 0;
explain (analyze) select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
commit;

If the second EXPLAIN isn't using the index shown in the first EXPLAIN and the
SELECT returns all rows, reindex the index that has been used. If the index is
not the culprit, you should inspect the page(s) that contain those rows using
pageinspect.


--
Euler Taveira                 http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16760: Standby database missed records for at least 1 table

From
Andriy Bartash
Date:

Thank you Euler

Here’s what I have:

  1. Yes, index was used:

nex=# explain (analyze) select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

nex-# between 221535154 and 221535162;

                                                           QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------

Index Scan using audit_evs_all_pk on audit_evs_all  (cost=0.29..3.31 rows=1 width=16) (actual time=0.070..0.071 rows=1 loops=1)

   Index Cond: ((audit_ev_id >= 221535154) AND (audit_ev_id <= 221535162))

Planning Time: 0.352 ms

Execution Time: 0.113 ms

(4 rows)

  1. Disable index (but it used Bitmap index Scan anyway) and returned only 1 record.

nex=# begin;

BEGIN

nex=# set local enable_indexscan to 0;

SET

nex=# set local enable_indexonlyscan to 0;

SET

nex=# explain (analyze) select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

nex-# between 221535154 and 221535162;

                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------

Bitmap Heap Scan on audit_evs_all  (cost=1.80..3.32 rows=1 width=16) (actual time=0.115..0.116 rows=1 loops=1)

   Recheck Cond: ((audit_ev_id >= 221535154) AND (audit_ev_id <= 221535162))

   Heap Blocks: exact=1

   ->  Bitmap Index Scan on audit_evs_all_pk  (cost=0.00..1.80 rows=1 width=0) (actual time=0.015..0.016 rows=9 loops=1)

         Index Cond: ((audit_ev_id >= 221535154) AND (audit_ev_id <= 221535162))

Planning Time: 0.295 ms

Execution Time: 0.146 ms

(7 rows)

 

nex=# select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

nex-# between 221535154 and 221535162;

audit_ev_id |         when_created

-------------+-------------------------------

   221535162 | 2020-12-01 00:00:20.955348+00

(1 row)

 

nex=# commit;

 

  1. Tried to reindex table and select again (didn’t help, same 1 record was returned)

nex=# reindex table audit_evs_all;

REINDEX

nex=# select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

between 221535154 and 221535162;

audit_ev_id |         when_created

-------------+-------------------------------

   221535162 | 2020-12-01 00:00:20.955348+00

(1 row)

  1. Drop primary key constraint with its index and try again (seq scan used and 1 record returned again)

nex=# ALTER TABLE public.audit_evs_all DROP CONSTRAINT audit_evs_all_pk;

ALTER TABLE

nex=# explain (analyze) select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

nex-# between 221535154 and 221535162;

                                                 QUERY PLAN

-------------------------------------------------------------------------------------------------------------

Seq Scan on audit_evs_all  (cost=0.00..5311.15 rows=1 width=16) (actual time=10.680..19.632 rows=1 loops=1)

   Filter: ((audit_ev_id >= 221535154) AND (audit_ev_id <= 221535162))

   Rows Removed by Filter: 86276

Planning Time: 0.376 ms

Execution Time: 19.667 ms

(5 rows)

 

nex=# select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

between 221535154 and 221535162;

audit_ev_id |         when_created

-------------+-------------------------------

   221535162 | 2020-12-01 00:00:20.955348+00

(1 row)

  1. Now I am looking into pageinspect . I will update you immediately.

 

Much appreciate your help.

 

 

 

 

 

From: Euler Taveira <euler.taveira@2ndquadrant.com>
Date: Thursday, December 3, 2020 at 10:20 AM
To: Andriy Bartash <abartash@xmatters.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>
Subject: Re: BUG #16760: Standby database missed records for at least 1 table

On Wed, 2 Dec 2020 at 17:44, PG Bug reporting form <noreply@postgresql.org> wrote:

Below output from Primary
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
 audit_ev_id |         when_created
-------------+-------------------------------
   221535154 | 2020-12-01 00:00:20.955348+00
   221535155 | 2020-12-01 00:00:20.955348+00
   221535156 | 2020-12-01 00:00:20.955348+00
   221535157 | 2020-12-01 00:00:20.955348+00
   221535158 | 2020-12-01 00:00:20.955348+00
   221535159 | 2020-12-01 00:00:20.955348+00
   221535160 | 2020-12-01 00:00:20.955348+00
   221535161 | 2020-12-01 00:00:20.955348+00
   221535162 | 2020-12-01 00:00:20.955348+00
(9 rows)
---------------------------------------------------------------
Same query's output from 1st Secondary
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
 audit_ev_id |         when_created
-------------+-------------------------------
   221535162 | 2020-12-01 00:00:20.955348+00
(1 row)

I would start this investigation, disabling the index usage for this query.

 

-- check which index has been used

explain (analyze) select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

between 221535154 and 221535162;

-- disable index usage

begin;

set local enable_indexscan to 0;

set local enable_indexonlyscan to 0;

explain (analyze) select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

between 221535154 and 221535162;

select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

between 221535154 and 221535162;

commit;

 

If the second EXPLAIN isn't using the index shown in the first EXPLAIN and the
SELECT returns all rows, reindex the index that has been used. If the index is
not the culprit, you should inspect the page(s) that contain those rows using
pageinspect.

 

 

--

Euler Taveira                 http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16760: Standby database missed records for at least 1 table

From
Andriy Bartash
Date:

Hi Euler

I inspected pages where I expected to have my data on broken standby.

What was done:

  1. Identified the page where was my data (on primary cluster)

select ctid,audit_ev_id from audit_evs_all  where audit_ev_id between 221535154 and 221535162;

and it returned

ctid,audit_ev_id

"(1979,3)",221535154

"(1980,13)",221535155

"(1980,14)",221535156

"(1980,28)",221535157

"(1980,29)",221535158

"(1982,6)",221535159

"(1982,7)",221535160

"(1982,32)",221535161

"(1982,33)",221535162

Found that page # 1982 had 4 records on primary (ev_id 221535159, 221535160, 221535161 and 221535162)

 

  1. Same query on Broken secondary showed that page # 1982 had only 1 record

(1982,33) |   221535162

(1 row)

  1. Managed to restore postgres cluster from physical backup as of 24hr after secondary became corrupted and got all records from page # 1982

 

SELECT when_created,(ctid::text::point)[0]::bigint AS page_number, audit_ev_id from audit_evs_all  where (ctid::text::point)[0]::bigint = 1982 and when_created <'2020-12-02' order by 1 ;

         when_created          | page_number | audit_ev_id

-------------------------------+-------------+-------------

2020-11-04 05:39:40.524816+00 |        1982 |   218065111

2020-11-04 05:39:41.551833+00 |        1982 |   218065112

2020-11-09 08:57:29.458528+00 |        1982 |   218711001

2020-11-09 08:57:29.458528+00 |        1982 |   218711000

2020-11-15 00:40:10.2285+00   |        1982 |   219765000

2020-11-15 00:40:10.2285+00   |        1982 |   219765001

2020-11-15 00:40:10.2285+00   |        1982 |   219765003

2020-11-15 00:40:10.2285+00   |        1982 |   219765004

2020-11-15 00:40:10.2285+00   |        1982 |   219765005

2020-11-15 00:40:10.2285+00   |        1982 |   219765006

2020-11-15 00:40:10.2285+00   |        1982 |   219765007

2020-11-15 00:40:10.2285+00   |        1982 |   219765008

2020-11-15 00:40:10.2285+00   |        1982 |   219765009

2020-11-15 00:40:10.2285+00   |        1982 |   219765010

2020-11-15 00:40:10.2285+00   |        1982 |   219765011

2020-11-15 00:40:10.2285+00   |        1982 |   219765002

2020-11-19 23:16:08.365151+00 |        1982 |   220198480

2020-11-19 23:16:08.365151+00 |        1982 |   220198479

2020-11-27 00:33:36.568531+00 |        1982 |   220990024

2020-11-27 00:33:36.568531+00 |        1982 |   220990023

2020-11-27 00:33:36.568531+00 |        1982 |   220990025

2020-12-01 00:00:20.955348+00 |        1982 |   221535161

2020-12-01 00:00:20.955348+00 |        1982 |   221535162

2020-12-01 00:00:20.955348+00 |        1982 |   221535160

2020-12-01 00:00:20.955348+00 |        1982 |   221535159

(25 rows)

  1. Same query on broken secondary returns 22 records (same missed records with ev_id 221535159, 221535160, 221535161)

 

     when_created          | page_number | audit_ev_id

-------------------------------+-------------+-------------

2020-11-04 05:39:40.524816+00 |        1982 |   218065111

2020-11-04 05:39:41.551833+00 |        1982 |   218065112

2020-11-09 08:57:29.458528+00 |        1982 |   218711001

2020-11-09 08:57:29.458528+00 |        1982 |   218711000

2020-11-15 00:40:10.2285+00   |        1982 |   219765000

2020-11-15 00:40:10.2285+00   |        1982 |   219765001

2020-11-15 00:40:10.2285+00   |        1982 |   219765006

2020-11-15 00:40:10.2285+00   |        1982 |   219765007

2020-11-15 00:40:10.2285+00   |        1982 |   219765008

2020-11-15 00:40:10.2285+00   |        1982 |   219765009

2020-11-15 00:40:10.2285+00   |        1982 |   219765010

2020-11-15 00:40:10.2285+00   |        1982 |   219765011

2020-11-15 00:40:10.2285+00   |        1982 |   219765002

2020-11-15 00:40:10.2285+00   |        1982 |   219765003

2020-11-15 00:40:10.2285+00   |        1982 |   219765004

2020-11-15 00:40:10.2285+00   |        1982 |   219765005

2020-11-19 23:16:08.365151+00 |        1982 |   220198479

2020-11-19 23:16:08.365151+00 |        1982 |   220198480

2020-11-27 00:33:36.568531+00 |        1982 |   220990025

2020-11-27 00:33:36.568531+00 |        1982 |   220990024

2020-11-27 00:33:36.568531+00 |        1982 |   220990023

2020-12-01 00:00:20.955348+00 |        1982 |   221535162

(22 rows)

 

Conclusion: Secondary postgres cluster is corrupted, 3 records missed.

 

Does it make sense?

 

 

 

 

 

From: Euler Taveira <euler.taveira@2ndquadrant.com>
Date: Thursday, December 3, 2020 at 10:20 AM
To: Andriy Bartash <abartash@xmatters.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>
Subject: Re: BUG #16760: Standby database missed records for at least 1 table

On Wed, 2 Dec 2020 at 17:44, PG Bug reporting form <noreply@postgresql.org> wrote:

Below output from Primary
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
 audit_ev_id |         when_created
-------------+-------------------------------
   221535154 | 2020-12-01 00:00:20.955348+00
   221535155 | 2020-12-01 00:00:20.955348+00
   221535156 | 2020-12-01 00:00:20.955348+00
   221535157 | 2020-12-01 00:00:20.955348+00
   221535158 | 2020-12-01 00:00:20.955348+00
   221535159 | 2020-12-01 00:00:20.955348+00
   221535160 | 2020-12-01 00:00:20.955348+00
   221535161 | 2020-12-01 00:00:20.955348+00
   221535162 | 2020-12-01 00:00:20.955348+00
(9 rows)
---------------------------------------------------------------
Same query's output from 1st Secondary
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
 audit_ev_id |         when_created
-------------+-------------------------------
   221535162 | 2020-12-01 00:00:20.955348+00
(1 row)

I would start this investigation, disabling the index usage for this query.

 

-- check which index has been used

explain (analyze) select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

between 221535154 and 221535162;

-- disable index usage

begin;

set local enable_indexscan to 0;

set local enable_indexonlyscan to 0;

explain (analyze) select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

between 221535154 and 221535162;

select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

between 221535154 and 221535162;

commit;

 

If the second EXPLAIN isn't using the index shown in the first EXPLAIN and the
SELECT returns all rows, reindex the index that has been used. If the index is
not the culprit, you should inspect the page(s) that contain those rows using
pageinspect.

 

 

--

Euler Taveira                 http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16760: Standby database missed records for at least 1 table

From
Euler Taveira
Date:
On Tue, 8 Dec 2020 at 21:03, Andriy Bartash <abartash@xmatters.com> wrote:

Hi Euler

I inspected pages where I expected to have my data on broken standby.

What was done:

  1. Identified the page where was my data (on primary cluster)

select ctid,audit_ev_id from audit_evs_all  where audit_ev_id between 221535154 and 221535162;

and it returned

ctid,audit_ev_id

"(1979,3)",221535154

"(1980,13)",221535155

"(1980,14)",221535156

"(1980,28)",221535157

"(1980,29)",221535158

"(1982,6)",221535159

"(1982,7)",221535160

"(1982,32)",221535161

"(1982,33)",221535162

Found that page # 1982 had 4 records on primary (ev_id 221535159, 221535160, 221535161 and 221535162)

If you already have the broken secondary, share the following query results from both servers (requires extension pageinspect):

SELECT lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid, to_hex(t_infomask) AS infomask, to_hex(t_infomask2) AS infomask2, t_hoff, t_bits FROM heap_page_items(get_raw_page('audit_evs_all', 1982)) WHERE t_ctid IN('1982,6)', '(1982,7)', '(1982,32)');


--
Euler Taveira                 http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16760: Standby database missed records for at least 1 table

From
Andriy Bartash
Date:

Here’s output you requested

lp | lp_off | lp_flags | lp_len |  t_xmin   | t_xmax | t_field3 |  t_ctid   | infomask | infomask2 | t_hoff |                              t_bits

----+--------+----------+--------+-----------+--------+----------+-----------+----------+-----------+--------+------------------------------------------------------------------

  6 |   2616 |        1 |    227 | 388763235 |      0 |       80 | (1982,6)  | a03      | 3e        |     32 | 1111100100110000000000000000001000000101010000010101011111000100

  7 |   2304 |        1 |    307 | 388763235 |      0 |        0 | (1982,7)  | a03      | 3e        |     32 | 1111110100110000000000001000001000000101011000010101011111000100

32 |   2072 |        1 |    227 | 388763235 |      0 |        0 | (1982,32) | a03      | 3e        |     32 | 1111100100110000000000000000001000000101010000010101011111000100

(3 rows)

 

From: Euler Taveira <euler.taveira@2ndquadrant.com>
Date: Wednesday, December 9, 2020 at 4:48 PM
To: Andriy Bartash <abartash@xmatters.com>
Cc: pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>
Subject: Re: BUG #16760: Standby database missed records for at least 1 table

On Tue, 8 Dec 2020 at 21:03, Andriy Bartash <abartash@xmatters.com> wrote:

Hi Euler

I inspected pages where I expected to have my data on broken standby.

What was done:

  1. Identified the page where was my data (on primary cluster)

select ctid,audit_ev_id from audit_evs_all  where audit_ev_id between 221535154 and 221535162;

and it returned

ctid,audit_ev_id

"(1979,3)",221535154

"(1980,13)",221535155

"(1980,14)",221535156

"(1980,28)",221535157

"(1980,29)",221535158

"(1982,6)",221535159

"(1982,7)",221535160

"(1982,32)",221535161

"(1982,33)",221535162

Found that page # 1982 had 4 records on primary (ev_id 221535159, 221535160, 221535161 and 221535162)

If you already have the broken secondary, share the following query results from both servers (requires extension pageinspect):

SELECT lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid, to_hex(t_infomask) AS infomask, to_hex(t_infomask2) AS infomask2, t_hoff, t_bits FROM heap_page_items(get_raw_page('audit_evs_all', 1982)) WHERE t_ctid IN('1982,6)', '(1982,7)', '(1982,32)');

 

 

--

Euler Taveira                 http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services