12.3 replicas falling over during WAL redo - Mailing list pgsql-general

From Ben Chobot
Subject 12.3 replicas falling over during WAL redo
Date
Msg-id 8a8d6728-22ad-61ec-09f9-51f29b9a2f61@silentmedia.com
Whole thread Raw
Responses Re: 12.3 replicas falling over during WAL redo  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-general
We have a few hundred postgres servers in AWS EC2, all of which do streaming replication to at least two replicas. As we've transitioned our fleet to from 9.5 to 12.3, we've noticed an alarming increase in the frequency of a streaming replica dying during replay. Postgres will log something like:

2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1] db=,user= LOG:  restartpoint starting: time
2020-07-31T16:55:24.637150+00:00 hostA postgres[24076]: [15754-1] db=,user= FATAL:  incorrect index offsets supplied
2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2] db=,user= CONTEXT:  WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed 1720
2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1] db=,user= LOG:  startup process (PID 24076) exited with exit code 1

...or:

2020-07-31T10:43:35.520066+00:00 hostB postgres[11836]: [12-1] db=,user= WARNING:  will not overwrite a used ItemId
2020-07-31T10:43:35.520144+00:00 hostB postgres[11836]: [12-2] db=,user= CONTEXT:  WAL redo at AD1/6948E9E0 for Heap/HOT_UPDATE: off 4 xmax 128308492 flags 0x10 ; new off 2 xmax 0
2020-07-31T10:43:35.520180+00:00 hostB postgres[11836]: [13-1] db=,user= PANIC:  failed to add tuple
2020-07-31T10:43:35.520220+00:00 hostB postgres[11836]: [13-2] db=,user= CONTEXT:  WAL redo at AD1/6948E9E0 for Heap/HOT_UPDATE: off 4 xmax 128308492 flags 0x10 ; new off 2 xmax 0

...or:
2020-07-30T14:59:36.839243+00:00 hostC postgres[24338]: [45253-1] db=,user= WARNING:  specified item offset is too large
2020-07-30T14:59:36.839307+00:00 hostC postgres[24338]: [45253-2] db=,user= CONTEXT:  WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 hostC postgres[24338]: [45254-1] db=,user= PANIC:  btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 hostC postgres[24338]: [45254-2] db=,user= CONTEXT:  WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 hostC postgres[24337]: [11-1] db=,user= LOG:  startup process (PID 24338) was terminated by signal 6: Aborted

Each time, a simple restart of the postgres service will bring the database back to a happy state and it will merrily replicate past the LSN where it had died before. This has never (yet) happened on a primary db, and (so far) always on only one one of the replicas the primary is replicating to, leading me to think there isn't anything actually wrong with my data. Still, this is no way to run a database. We never saw this problem in 9.5, but it happened 3 times just on Friday. We have taken the opportunity to enable checksuming with our move to 12, but this doesn't appear to be related to that, to my untrained eyes.

A lot of suggestions I've heard to fix problems which sound like this involve reindexing, and while we haven't yet tried that, it doesn't seem likely to help. We upgraded from 9.5 to 12 using pglogical, so all our indices were created from 12.3 code. That said, we were running pg_repack in an automated fashion for a bit, which seemed to be causing issues in 12 that we haven't had time to track down and so have currently disabled.

With the non-durability of the problem, I'm out of ideas of what to look for. Suggestions?

pgsql-general by date:

Previous
From: Zack Weinberg
Date:
Subject: Apparent missed query optimization with self-join and inner grouping
Next
From: Alvaro Herrera
Date:
Subject: Re: 12.3 replicas falling over during WAL redo