Home > mailing lists

Re: 12.3 replicas falling over during WAL redo - Mailing list pgsql-general

From	Ben Chobot
Subject	Re: 12.3 replicas falling over during WAL redo
Date	August 3, 2020 22:11:05
Msg-id	b64cb215-e03a-9e07-e916-5b216b0c5b1b@silentmedia.com Whole thread Raw
In response to	Re: 12.3 replicas falling over during WAL redo (Ben Chobot <bench@silentmedia.com>)
Responses	Re: 12.3 replicas falling over during WAL redo (Alvaro Herrera <alvherre@2ndquadrant.com>) Re: 12.3 replicas falling over during WAL redo (Ben Chobot <bench@silentmedia.com>)
List	pgsql-general

Tree view

Ben Chobot wrote on 8/1/20 9:58 AM:

Alvaro Herrera wrote on 8/1/20 9:35 AM:
On 2020-Aug-01, Ben Chobot wrote:

Can you find out what the index is being modified by those LSNs -- is it
always the same index? Can you have a look at nearby WAL records that
touch the same page of the same index in each case?

One possibility is that the storage forgot a previous write.

So here is me fumbling how to do this. Hopefully this is useful?

The error for this host was:

2020-07-30T14:59:36.839243+00:00 host postgres[24338]: [45253-1] db=,user= WARNING: specified item offset is too large
2020-07-30T14:59:36.839307+00:00 host postgres[24338]: [45253-2] db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 host postgres[24338]: [45254-1] db=,user= PANIC: btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 host postgres[24338]: [45254-2] db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 host postgres[24337]: [11-1] db=,user= LOG: startup process (PID 24338) was terminated by signal 6: Aborted
2020-07-30T14:59:37.587266+00:00 host postgres[24337]: [12-1] db=,user= LOG: terminating any other active server processes

So, starting from a filesystem snapshot where the db had crashed and not yet been started, I did:

/usr/lib/postgresql/12/bin/pg_waldump -p /var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB 0000000100000A0A000000AD | grep A0A/AC4204A0

This shows me:

rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn: A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel 16605/16613/60529051 blk 6501

So then I did:

/usr/lib/postgresql/12/bin/pg_waldump -p /var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB 0000000100000A0A000000AD | grep 16605/16613/60529051
rmgr: Btree       len (rec/tot):     53/   813, tx:   76393363, lsn: A0A/AB289778, prev A0A/AB287F30, desc: INSERT_LEAF off 26, blkref #0: rel 16605/16613/60529051 blk 6089 FPW
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393390, lsn: A0A/AB2BE870, prev A0A/AB2BD6E0, desc: INSERT_LEAF off 128, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393394, lsn: A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel 16605/16613/60529051 blk 6501
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393992, lsn: A0A/AB5FA848, prev A0A/AB5FA7F0, desc: INSERT_LEAF off 133, blkref #0: rel 16605/16613/60529051 blk 5999
rmgr: Btree       len (rec/tot):     72/    72, tx:   76394938, lsn: A0A/ABE49738, prev A0A/ABE47AF8, desc: INSERT_LEAF off 189, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76394977, lsn: A0A/ABEB0330, prev A0A/ABEB02D8, desc: INSERT_LEAF off 204, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn: A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel 16605/16613/60529051 blk 6501
rmgr: Btree       len (rec/tot):     72/    72, tx:   76396075, lsn: A0A/AC427518, prev A0A/AC4274C0, desc: INSERT_LEAF off 211, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     53/ 2529, tx:   76396250, lsn: A0A/AC5352F8, prev A0A/AC533A00, desc: INSERT_LEAF off 87, blkref #0: rel 16605/16613/60529051 blk 5639 FPW
pg_waldump: fatal: error in WAL record at A0A/AC5411B0: invalid resource manager ID 110 at A0A/AC5411E0

...and I have no idea what I'm looking at. I assume/hope the error at the end is due to the db shutting down, and nothing to be particularly worried about?

I have the specific index that postgres tripped on here and I'll spend some time today trying to see if the other crashes were the same index.

pgsql-general by date:

From: Peter Geoghegan
Date: 03 August 2020, 21:25:08
Subject: Re: 12.3 replicas falling over during WAL redo

From: Alvaro Herrera
Date: 03 August 2020, 22:34:04
Subject: Re: 12.3 replicas falling over during WAL redo

Re: 12.3 replicas falling over during WAL redo - Mailing list pgsql-general

Previous

Next