Re: BUG #16760: Standby database missed records for at least 1 table - Mailing list pgsql-bugs

From Andriy Bartash
Subject Re: BUG #16760: Standby database missed records for at least 1 table
Date
Msg-id SA2PR02MB7547D9B3F15931FCDBDAFCBABBCD0@SA2PR02MB7547.namprd02.prod.outlook.com
Whole thread Raw
In response to Re: BUG #16760: Standby database missed records for at least 1 table  (Euler Taveira <euler.taveira@2ndquadrant.com>)
Responses Re: BUG #16760: Standby database missed records for at least 1 table
List pgsql-bugs

Hi Euler

I inspected pages where I expected to have my data on broken standby.

What was done:

  1. Identified the page where was my data (on primary cluster)

select ctid,audit_ev_id from audit_evs_all  where audit_ev_id between 221535154 and 221535162;

and it returned

ctid,audit_ev_id

"(1979,3)",221535154

"(1980,13)",221535155

"(1980,14)",221535156

"(1980,28)",221535157

"(1980,29)",221535158

"(1982,6)",221535159

"(1982,7)",221535160

"(1982,32)",221535161

"(1982,33)",221535162

Found that page # 1982 had 4 records on primary (ev_id 221535159, 221535160, 221535161 and 221535162)

 

  1. Same query on Broken secondary showed that page # 1982 had only 1 record

(1982,33) |   221535162

(1 row)

  1. Managed to restore postgres cluster from physical backup as of 24hr after secondary became corrupted and got all records from page # 1982

 

SELECT when_created,(ctid::text::point)[0]::bigint AS page_number, audit_ev_id from audit_evs_all  where (ctid::text::point)[0]::bigint = 1982 and when_created <'2020-12-02' order by 1 ;

         when_created          | page_number | audit_ev_id

-------------------------------+-------------+-------------

2020-11-04 05:39:40.524816+00 |        1982 |   218065111

2020-11-04 05:39:41.551833+00 |        1982 |   218065112

2020-11-09 08:57:29.458528+00 |        1982 |   218711001

2020-11-09 08:57:29.458528+00 |        1982 |   218711000

2020-11-15 00:40:10.2285+00   |        1982 |   219765000

2020-11-15 00:40:10.2285+00   |        1982 |   219765001

2020-11-15 00:40:10.2285+00   |        1982 |   219765003

2020-11-15 00:40:10.2285+00   |        1982 |   219765004

2020-11-15 00:40:10.2285+00   |        1982 |   219765005

2020-11-15 00:40:10.2285+00   |        1982 |   219765006

2020-11-15 00:40:10.2285+00   |        1982 |   219765007

2020-11-15 00:40:10.2285+00   |        1982 |   219765008

2020-11-15 00:40:10.2285+00   |        1982 |   219765009

2020-11-15 00:40:10.2285+00   |        1982 |   219765010

2020-11-15 00:40:10.2285+00   |        1982 |   219765011

2020-11-15 00:40:10.2285+00   |        1982 |   219765002

2020-11-19 23:16:08.365151+00 |        1982 |   220198480

2020-11-19 23:16:08.365151+00 |        1982 |   220198479

2020-11-27 00:33:36.568531+00 |        1982 |   220990024

2020-11-27 00:33:36.568531+00 |        1982 |   220990023

2020-11-27 00:33:36.568531+00 |        1982 |   220990025

2020-12-01 00:00:20.955348+00 |        1982 |   221535161

2020-12-01 00:00:20.955348+00 |        1982 |   221535162

2020-12-01 00:00:20.955348+00 |        1982 |   221535160

2020-12-01 00:00:20.955348+00 |        1982 |   221535159

(25 rows)

  1. Same query on broken secondary returns 22 records (same missed records with ev_id 221535159, 221535160, 221535161)

 

     when_created          | page_number | audit_ev_id

-------------------------------+-------------+-------------

2020-11-04 05:39:40.524816+00 |        1982 |   218065111

2020-11-04 05:39:41.551833+00 |        1982 |   218065112

2020-11-09 08:57:29.458528+00 |        1982 |   218711001

2020-11-09 08:57:29.458528+00 |        1982 |   218711000

2020-11-15 00:40:10.2285+00   |        1982 |   219765000

2020-11-15 00:40:10.2285+00   |        1982 |   219765001

2020-11-15 00:40:10.2285+00   |        1982 |   219765006

2020-11-15 00:40:10.2285+00   |        1982 |   219765007

2020-11-15 00:40:10.2285+00   |        1982 |   219765008

2020-11-15 00:40:10.2285+00   |        1982 |   219765009

2020-11-15 00:40:10.2285+00   |        1982 |   219765010

2020-11-15 00:40:10.2285+00   |        1982 |   219765011

2020-11-15 00:40:10.2285+00   |        1982 |   219765002

2020-11-15 00:40:10.2285+00   |        1982 |   219765003

2020-11-15 00:40:10.2285+00   |        1982 |   219765004

2020-11-15 00:40:10.2285+00   |        1982 |   219765005

2020-11-19 23:16:08.365151+00 |        1982 |   220198479

2020-11-19 23:16:08.365151+00 |        1982 |   220198480

2020-11-27 00:33:36.568531+00 |        1982 |   220990025

2020-11-27 00:33:36.568531+00 |        1982 |   220990024

2020-11-27 00:33:36.568531+00 |        1982 |   220990023

2020-12-01 00:00:20.955348+00 |        1982 |   221535162

(22 rows)

 

Conclusion: Secondary postgres cluster is corrupted, 3 records missed.

 

Does it make sense?

 

 

 

 

 

From: Euler Taveira <euler.taveira@2ndquadrant.com>
Date: Thursday, December 3, 2020 at 10:20 AM
To: Andriy Bartash <abartash@xmatters.com>, pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>
Subject: Re: BUG #16760: Standby database missed records for at least 1 table

On Wed, 2 Dec 2020 at 17:44, PG Bug reporting form <noreply@postgresql.org> wrote:

Below output from Primary
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
 audit_ev_id |         when_created
-------------+-------------------------------
   221535154 | 2020-12-01 00:00:20.955348+00
   221535155 | 2020-12-01 00:00:20.955348+00
   221535156 | 2020-12-01 00:00:20.955348+00
   221535157 | 2020-12-01 00:00:20.955348+00
   221535158 | 2020-12-01 00:00:20.955348+00
   221535159 | 2020-12-01 00:00:20.955348+00
   221535160 | 2020-12-01 00:00:20.955348+00
   221535161 | 2020-12-01 00:00:20.955348+00
   221535162 | 2020-12-01 00:00:20.955348+00
(9 rows)
---------------------------------------------------------------
Same query's output from 1st Secondary
select audit_ev_id, when_created from audit_evs_all  where audit_ev_id
between 221535154 and 221535162;
 audit_ev_id |         when_created
-------------+-------------------------------
   221535162 | 2020-12-01 00:00:20.955348+00
(1 row)

I would start this investigation, disabling the index usage for this query.

 

-- check which index has been used

explain (analyze) select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

between 221535154 and 221535162;

-- disable index usage

begin;

set local enable_indexscan to 0;

set local enable_indexonlyscan to 0;

explain (analyze) select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

between 221535154 and 221535162;

select audit_ev_id, when_created from audit_evs_all  where audit_ev_id

between 221535154 and 221535162;

commit;

 

If the second EXPLAIN isn't using the index shown in the first EXPLAIN and the
SELECT returns all rows, reindex the index that has been used. If the index is
not the culprit, you should inspect the page(s) that contain those rows using
pageinspect.

 

 

--

Euler Taveira                 http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-bugs by date:

Previous
From: Matthias van de Meent
Date:
Subject: Re: BUG #16767: Silent dropping of CONSTRAINT... UNIQUE
Next
From: PG Bug reporting form
Date:
Subject: BUG #16768: PostgreSQL 12.5 - INSERT ON CONFLICT succeeds on data INSERT but does nothing on UPDATE