Re: 12.3 replicas falling over during WAL redo - Mailing list pgsql-general

From Ben Chobot
Subject Re: 12.3 replicas falling over during WAL redo
Date
Msg-id b64cb215-e03a-9e07-e916-5b216b0c5b1b@silentmedia.com
Whole thread Raw
In response to Re: 12.3 replicas falling over during WAL redo  (Ben Chobot <bench@silentmedia.com>)
Responses Re: 12.3 replicas falling over during WAL redo  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: 12.3 replicas falling over during WAL redo  (Ben Chobot <bench@silentmedia.com>)
List pgsql-general
Ben Chobot wrote on 8/1/20 9:58 AM:
Alvaro Herrera wrote on 8/1/20 9:35 AM:
On 2020-Aug-01, Ben Chobot wrote:

Can you find out what the index is being modified by those LSNs -- is it
always the same index?  Can you have a look at nearby WAL records that
touch the same page of the same index in each case?

One possibility is that the storage forgot a previous write.

So here is me fumbling how to do this. Hopefully this is useful?

The error for this host was:

2020-07-30T14:59:36.839243+00:00 host postgres[24338]: [45253-1] db=,user= WARNING:  specified item offset is too large
2020-07-30T14:59:36.839307+00:00 host postgres[24338]: [45253-2] db=,user= CONTEXT:  WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 host postgres[24338]: [45254-1] db=,user= PANIC:  btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 host postgres[24338]: [45254-2] db=,user= CONTEXT:  WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 host postgres[24337]: [11-1] db=,user= LOG:  startup process (PID 24338) was terminated by signal 6: Aborted
2020-07-30T14:59:37.587266+00:00 host postgres[24337]: [12-1] db=,user= LOG:  terminating any other active server processes

So, starting from a filesystem snapshot where the db had crashed and not yet been started, I did:

/usr/lib/postgresql/12/bin/pg_waldump -p /var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB 0000000100000A0A000000AD | grep A0A/AC4204A0

This shows me:

rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn: A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel 16605/16613/60529051 blk 6501

So then I did:

/usr/lib/postgresql/12/bin/pg_waldump -p /var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB 0000000100000A0A000000AD | grep 16605/16613/60529051
rmgr: Btree       len (rec/tot):     53/   813, tx:   76393363, lsn: A0A/AB289778, prev A0A/AB287F30, desc: INSERT_LEAF off 26, blkref #0: rel 16605/16613/60529051 blk 6089 FPW
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393390, lsn: A0A/AB2BE870, prev A0A/AB2BD6E0, desc: INSERT_LEAF off 128, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393394, lsn: A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel 16605/16613/60529051 blk 6501
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393992, lsn: A0A/AB5FA848, prev A0A/AB5FA7F0, desc: INSERT_LEAF off 133, blkref #0: rel 16605/16613/60529051 blk 5999
rmgr: Btree       len (rec/tot):     72/    72, tx:   76394938, lsn: A0A/ABE49738, prev A0A/ABE47AF8, desc: INSERT_LEAF off 189, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76394977, lsn: A0A/ABEB0330, prev A0A/ABEB02D8, desc: INSERT_LEAF off 204, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn: A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel 16605/16613/60529051 blk 6501
rmgr: Btree       len (rec/tot):     72/    72, tx:   76396075, lsn: A0A/AC427518, prev A0A/AC4274C0, desc: INSERT_LEAF off 211, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     53/  2529, tx:   76396250, lsn: A0A/AC5352F8, prev A0A/AC533A00, desc: INSERT_LEAF off 87, blkref #0: rel 16605/16613/60529051 blk 5639 FPW
pg_waldump: fatal: error in WAL record at A0A/AC5411B0: invalid resource manager ID 110 at A0A/AC5411E0

...and I have no idea what I'm looking at. I assume/hope the error at the end is due to the db shutting down, and nothing to be particularly worried about?

I have the specific index that postgres tripped on here and I'll spend some time today trying to see if the other crashes were the same index.

pgsql-general by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: 12.3 replicas falling over during WAL redo
Next
From: Alvaro Herrera
Date:
Subject: Re: 12.3 replicas falling over during WAL redo