Home > mailing lists

12.3 replicas falling over during WAL redo - Mailing list pgsql-general

From	Ben Chobot
Subject	12.3 replicas falling over during WAL redo
Date	August 1, 2020 17:55:27
Msg-id	8a8d6728-22ad-61ec-09f9-51f29b9a2f61@silentmedia.com Whole thread Raw
Responses	Re: 12.3 replicas falling over during WAL redo (Alvaro Herrera <alvherre@2ndquadrant.com>)
List	pgsql-general

Tree view

We have a few hundred postgres servers in AWS EC2, all of which do streaming replication to at least two replicas. As we've transitioned our fleet to from 9.5 to 12.3, we've noticed an alarming increase in the frequency of a streaming replica dying during replay. Postgres will log something like:


2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1] db=,user= LOG:  restartpoint starting: time
2020-07-31T16:55:24.637150+00:00 hostA postgres[24076]: [15754-1] db=,user= FATAL:  incorrect index offsets supplied
2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2] db=,user= CONTEXT:  WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed 1720
2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1] db=,user= LOG:  startup process (PID 24076) exited with exit code 1

...or:


2020-07-31T10:43:35.520066+00:00 hostB postgres[11836]: [12-1] db=,user= WARNING:  will not overwrite a used ItemId
2020-07-31T10:43:35.520144+00:00 hostB postgres[11836]: [12-2] db=,user= CONTEXT:  WAL redo at AD1/6948E9E0 for Heap/HOT_UPDATE: off 4 xmax 128308492 flags 0x10 ; new off 2 xmax 0
2020-07-31T10:43:35.520180+00:00 hostB postgres[11836]: [13-1] db=,user= PANIC:  failed to add tuple
2020-07-31T10:43:35.520220+00:00 hostB postgres[11836]: [13-2] db=,user= CONTEXT:  WAL redo at AD1/6948E9E0 for Heap/HOT_UPDATE: off 4 xmax 128308492 flags 0x10 ; new off 2 xmax 0

...or:

2020-07-30T14:59:36.839243+00:00 hostC postgres[24338]: [45253-1] db=,user= WARNING:  specified item offset is too large
2020-07-30T14:59:36.839307+00:00 hostC postgres[24338]: [45253-2] db=,user= CONTEXT:  WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 hostC postgres[24338]: [45254-1] db=,user= PANIC:  btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 hostC postgres[24338]: [45254-2] db=,user= CONTEXT:  WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 hostC postgres[24337]: [11-1] db=,user= LOG:  startup process (PID 24338) was terminated by signal 6: Aborted

Each time, a simple restart of the postgres service will bring the database back to a happy state and it will merrily replicate past the LSN where it had died before. This has never (yet) happened on a primary db, and (so far) always on only one one of the replicas the primary is replicating to, leading me to think there isn't anything actually wrong with my data. Still, this is no way to run a database. We never saw this problem in 9.5, but it happened 3 times just on Friday. We have taken the opportunity to enable checksuming with our move to 12, but this doesn't appear to be related to that, to my untrained eyes.

A lot of suggestions I've heard to fix problems which sound like this involve reindexing, and while we haven't yet tried that, it doesn't seem likely to help. We upgraded from 9.5 to 12 using pglogical, so all our indices were created from 12.3 code. That said, we were running pg_repack in an automated fashion for a bit, which seemed to be causing issues in 12 that we haven't had time to track down and so have currently disabled.

With the non-durability of the problem, I'm out of ideas of what to look for. Suggestions?

pgsql-general by date:

From: Zack Weinberg
Date: 31 July 2020, 18:07:39
Subject: Apparent missed query optimization with self-join and inner grouping

From: Alvaro Herrera
Date: 01 August 2020, 19:35:51
Subject: Re: 12.3 replicas falling over during WAL redo

12.3 replicas falling over during WAL redo - Mailing list pgsql-general

Previous

Next