Thread: 12.3 replicas falling over during WAL redo
We have a few hundred postgres servers in AWS EC2, all of which do streaming replication to at least two replicas. As we've transitioned our fleet to from 9.5 to 12.3, we've noticed an alarming increase in the frequency of a streaming replica dying during replay. Postgres will log something like:
...or:
...or:
Each time, a simple restart of the postgres service will bring the database back to a happy state and it will merrily replicate past the LSN where it had died before. This has never (yet) happened on a primary db, and (so far) always on only one one of the replicas the primary is replicating to, leading me to think there isn't anything actually wrong with my data. Still, this is no way to run a database. We never saw this problem in 9.5, but it happened 3 times just on Friday. We have taken the opportunity to enable checksuming with our move to 12, but this doesn't appear to be related to that, to my untrained eyes.
A lot of suggestions I've heard to fix problems which sound like this involve reindexing, and while we haven't yet tried that, it doesn't seem likely to help. We upgraded from 9.5 to 12 using pglogical, so all our indices were created from 12.3 code. That said, we were running pg_repack in an automated fashion for a bit, which seemed to be causing issues in 12 that we haven't had time to track down and so have currently disabled.
With the non-durability of the problem, I'm out of ideas of what to look for. Suggestions?
2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1] db=,user= LOG: restartpoint starting: time
2020-07-31T16:55:24.637150+00:00 hostA postgres[24076]: [15754-1] db=,user= FATAL: incorrect index offsets supplied
2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2] db=,user= CONTEXT: WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed 1720
2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1] db=,user= LOG: startup process (PID 24076) exited with exit code 1
...or:
2020-07-31T10:43:35.520066+00:00 hostB postgres[11836]: [12-1] db=,user= WARNING: will not overwrite a used ItemId
2020-07-31T10:43:35.520144+00:00 hostB postgres[11836]: [12-2] db=,user= CONTEXT: WAL redo at AD1/6948E9E0 for Heap/HOT_UPDATE: off 4 xmax 128308492 flags 0x10 ; new off 2 xmax 0
2020-07-31T10:43:35.520180+00:00 hostB postgres[11836]: [13-1] db=,user= PANIC: failed to add tuple
2020-07-31T10:43:35.520220+00:00 hostB postgres[11836]: [13-2] db=,user= CONTEXT: WAL redo at AD1/6948E9E0 for Heap/HOT_UPDATE: off 4 xmax 128308492 flags 0x10 ; new off 2 xmax 0
...or:
2020-07-30T14:59:36.839243+00:00 hostC postgres[24338]: [45253-1] db=,user= WARNING: specified item offset is too large
2020-07-30T14:59:36.839307+00:00 hostC postgres[24338]: [45253-2] db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 hostC postgres[24338]: [45254-1] db=,user= PANIC: btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 hostC postgres[24338]: [45254-2] db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 hostC postgres[24337]: [11-1] db=,user= LOG: startup process (PID 24338) was terminated by signal 6: Aborted
Each time, a simple restart of the postgres service will bring the database back to a happy state and it will merrily replicate past the LSN where it had died before. This has never (yet) happened on a primary db, and (so far) always on only one one of the replicas the primary is replicating to, leading me to think there isn't anything actually wrong with my data. Still, this is no way to run a database. We never saw this problem in 9.5, but it happened 3 times just on Friday. We have taken the opportunity to enable checksuming with our move to 12, but this doesn't appear to be related to that, to my untrained eyes.
A lot of suggestions I've heard to fix problems which sound like this involve reindexing, and while we haven't yet tried that, it doesn't seem likely to help. We upgraded from 9.5 to 12 using pglogical, so all our indices were created from 12.3 code. That said, we were running pg_repack in an automated fashion for a bit, which seemed to be causing issues in 12 that we haven't had time to track down and so have currently disabled.
With the non-durability of the problem, I'm out of ideas of what to look for. Suggestions?
On 2020-Aug-01, Ben Chobot wrote: > We have a few hundred postgres servers in AWS EC2, all of which do streaming > replication to at least two replicas. As we've transitioned our fleet to > from 9.5 to 12.3, we've noticed an alarming increase in the frequency of a > streaming replica dying during replay. Postgres will log something like: > > |2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1] db=,user= > LOG: restartpoint starting: time 2020-07-31T16:55:24.637150+00:00 hostA > postgres[24076]: [15754-1] db=,user= FATAL: incorrect index offsets supplied > 2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2] db=,user= > CONTEXT: WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed 1720 > 2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1] db=,user= LOG: > startup process (PID 24076) exited with exit code 1| I've never seen this one. Can you find out what the index is being modified by those LSNs -- is it always the same index? Can you have a look at nearby WAL records that touch the same page of the same index in each case? One possibility is that the storage forgot a previous write. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera wrote on 8/1/20 9:35 AM: > On 2020-Aug-01, Ben Chobot wrote: > >> We have a few hundred postgres servers in AWS EC2, all of which do streaming >> replication to at least two replicas. As we've transitioned our fleet to >> from 9.5 to 12.3, we've noticed an alarming increase in the frequency of a >> streaming replica dying during replay. Postgres will log something like: >> >> |2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1] db=,user= >> LOG: restartpoint starting: time 2020-07-31T16:55:24.637150+00:00 hostA >> postgres[24076]: [15754-1] db=,user= FATAL: incorrect index offsets supplied >> 2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2] db=,user= >> CONTEXT: WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed 1720 >> 2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1] db=,user= LOG: >> startup process (PID 24076) exited with exit code 1| > I've never seen this one. > > Can you find out what the index is being modified by those LSNs -- is it > always the same index? Can you have a look at nearby WAL records that > touch the same page of the same index in each case? > > One possibility is that the storage forgot a previous write. I'd be happy to, if you tell me how. :) We're using xfs for our postgres filesystem, on ubuntu bionic. Of course it's always possible there's something wrong in the filesystem or the EBS layer, but that is one thing we have not changed in the migration from 9.5 to 12.3.
At Sat, 1 Aug 2020 09:58:05 -0700, Ben Chobot <bench@silentmedia.com> wrote in > > > Alvaro Herrera wrote on 8/1/20 9:35 AM: > > On 2020-Aug-01, Ben Chobot wrote: > > > >> We have a few hundred postgres servers in AWS EC2, all of which do > >> streaming > >> replication to at least two replicas. As we've transitioned our fleet > >> to > >> from 9.5 to 12.3, we've noticed an alarming increase in the frequency > >> of a > >> streaming replica dying during replay. Postgres will log something > >> like: > >> > >> |2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1] > >> |db=,user= > >> LOG: restartpoint starting: time 2020-07-31T16:55:24.637150+00:00 > >> hostA > >> postgres[24076]: [15754-1] db=,user= FATAL: incorrect index offsets > >> supplied > >> 2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2] > >> db=,user= > >> CONTEXT: WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed > >> 1720 > >> 2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1] > >> db=,user= LOG: > >> startup process (PID 24076) exited with exit code 1| > > I've never seen this one. > > > > Can you find out what the index is being modified by those LSNs -- is > > it > > always the same index? Can you have a look at nearby WAL records that > > touch the same page of the same index in each case? > > > > One possibility is that the storage forgot a previous write. > > I'd be happy to, if you tell me how. :) > > We're using xfs for our postgres filesystem, on ubuntu bionic. Of > course it's always possible there's something wrong in the filesystem > or the EBS layer, but that is one thing we have not changed in the > migration from 9.5 to 12.3. All of the cited log lines seem suggesting relation with deleted btree page items. As a possibility I can guess, that can happen if the pages were flushed out during a vacuum after the last checkpoint and full-page-writes didn't restored the page to the state before the index-item deletion happened(that is, if full_page_writes were set to off.). (If it found to be the cause, I'm not sure why that didn't happen on 9.5.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro Horiguchi wrote on 8/2/20 9:39 PM: > At Sat, 1 Aug 2020 09:58:05 -0700, Ben Chobot <bench@silentmedia.com> wrote in > All of the cited log lines seem suggesting relation with deleted btree > page items. As a possibility I can guess, that can happen if the pages > were flushed out during a vacuum after the last checkpoint and > full-page-writes didn't restored the page to the state before the > index-item deletion happened(that is, if full_page_writes were set to > off.). (If it found to be the cause, I'm not sure why that didn't > happen on 9.5.) > > regards. We have always had full_page_writes enabled.
On Sun, Aug 2, 2020 at 9:39 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > All of the cited log lines seem suggesting relation with deleted btree > page items. As a possibility I can guess, that can happen if the pages > were flushed out during a vacuum after the last checkpoint and > full-page-writes didn't restored the page to the state before the > index-item deletion happened(that is, if full_page_writes were set to > off.). (If it found to be the cause, I'm not sure why that didn't > happen on 9.5.) There is also a Heap/HOT_UPDATE log line with similar errors. -- Peter Geoghegan
Ben Chobot wrote on 8/1/20 9:58 AM:
So here is me fumbling how to do this. Hopefully this is useful?
The error for this host was:
2020-07-30T14:59:36.839243+00:00 host postgres[24338]: [45253-1] db=,user= WARNING: specified item offset is too large
2020-07-30T14:59:36.839307+00:00 host postgres[24338]: [45253-2] db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 host postgres[24338]: [45254-1] db=,user= PANIC: btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 host postgres[24338]: [45254-2] db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 host postgres[24337]: [11-1] db=,user= LOG: startup process (PID 24338) was terminated by signal 6: Aborted
2020-07-30T14:59:37.587266+00:00 host postgres[24337]: [12-1] db=,user= LOG: terminating any other active server processes
So, starting from a filesystem snapshot where the db had crashed and not yet been started, I did:
/usr/lib/postgresql/12/bin/pg_waldump -p /var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB 0000000100000A0A000000AD | grep A0A/AC4204A0
This shows me:
rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn: A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel 16605/16613/60529051 blk 6501
So then I did:
/usr/lib/postgresql/12/bin/pg_waldump -p /var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB 0000000100000A0A000000AD | grep 16605/16613/60529051
rmgr: Btree len (rec/tot): 53/ 813, tx: 76393363, lsn: A0A/AB289778, prev A0A/AB287F30, desc: INSERT_LEAF off 26, blkref #0: rel 16605/16613/60529051 blk 6089 FPW
rmgr: Btree len (rec/tot): 72/ 72, tx: 76393390, lsn: A0A/AB2BE870, prev A0A/AB2BD6E0, desc: INSERT_LEAF off 128, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree len (rec/tot): 72/ 72, tx: 76393394, lsn: A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel 16605/16613/60529051 blk 6501
rmgr: Btree len (rec/tot): 72/ 72, tx: 76393992, lsn: A0A/AB5FA848, prev A0A/AB5FA7F0, desc: INSERT_LEAF off 133, blkref #0: rel 16605/16613/60529051 blk 5999
rmgr: Btree len (rec/tot): 72/ 72, tx: 76394938, lsn: A0A/ABE49738, prev A0A/ABE47AF8, desc: INSERT_LEAF off 189, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree len (rec/tot): 72/ 72, tx: 76394977, lsn: A0A/ABEB0330, prev A0A/ABEB02D8, desc: INSERT_LEAF off 204, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn: A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel 16605/16613/60529051 blk 6501
rmgr: Btree len (rec/tot): 72/ 72, tx: 76396075, lsn: A0A/AC427518, prev A0A/AC4274C0, desc: INSERT_LEAF off 211, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree len (rec/tot): 53/ 2529, tx: 76396250, lsn: A0A/AC5352F8, prev A0A/AC533A00, desc: INSERT_LEAF off 87, blkref #0: rel 16605/16613/60529051 blk 5639 FPW
pg_waldump: fatal: error in WAL record at A0A/AC5411B0: invalid resource manager ID 110 at A0A/AC5411E0
...and I have no idea what I'm looking at. I assume/hope the error at the end is due to the db shutting down, and nothing to be particularly worried about?
I have the specific index that postgres tripped on here and I'll spend some time today trying to see if the other crashes were the same index.
Alvaro Herrera wrote on 8/1/20 9:35 AM:On 2020-Aug-01, Ben Chobot wrote:
Can you find out what the index is being modified by those LSNs -- is it
always the same index? Can you have a look at nearby WAL records that
touch the same page of the same index in each case?
One possibility is that the storage forgot a previous write.
So here is me fumbling how to do this. Hopefully this is useful?
The error for this host was:
2020-07-30T14:59:36.839243+00:00 host postgres[24338]: [45253-1] db=,user= WARNING: specified item offset is too large
2020-07-30T14:59:36.839307+00:00 host postgres[24338]: [45253-2] db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 host postgres[24338]: [45254-1] db=,user= PANIC: btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 host postgres[24338]: [45254-2] db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 host postgres[24337]: [11-1] db=,user= LOG: startup process (PID 24338) was terminated by signal 6: Aborted
2020-07-30T14:59:37.587266+00:00 host postgres[24337]: [12-1] db=,user= LOG: terminating any other active server processes
So, starting from a filesystem snapshot where the db had crashed and not yet been started, I did:
/usr/lib/postgresql/12/bin/pg_waldump -p /var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB 0000000100000A0A000000AD | grep A0A/AC4204A0
This shows me:
rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn: A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel 16605/16613/60529051 blk 6501
So then I did:
/usr/lib/postgresql/12/bin/pg_waldump -p /var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB 0000000100000A0A000000AD | grep 16605/16613/60529051
rmgr: Btree len (rec/tot): 53/ 813, tx: 76393363, lsn: A0A/AB289778, prev A0A/AB287F30, desc: INSERT_LEAF off 26, blkref #0: rel 16605/16613/60529051 blk 6089 FPW
rmgr: Btree len (rec/tot): 72/ 72, tx: 76393390, lsn: A0A/AB2BE870, prev A0A/AB2BD6E0, desc: INSERT_LEAF off 128, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree len (rec/tot): 72/ 72, tx: 76393394, lsn: A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel 16605/16613/60529051 blk 6501
rmgr: Btree len (rec/tot): 72/ 72, tx: 76393992, lsn: A0A/AB5FA848, prev A0A/AB5FA7F0, desc: INSERT_LEAF off 133, blkref #0: rel 16605/16613/60529051 blk 5999
rmgr: Btree len (rec/tot): 72/ 72, tx: 76394938, lsn: A0A/ABE49738, prev A0A/ABE47AF8, desc: INSERT_LEAF off 189, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree len (rec/tot): 72/ 72, tx: 76394977, lsn: A0A/ABEB0330, prev A0A/ABEB02D8, desc: INSERT_LEAF off 204, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn: A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel 16605/16613/60529051 blk 6501
rmgr: Btree len (rec/tot): 72/ 72, tx: 76396075, lsn: A0A/AC427518, prev A0A/AC4274C0, desc: INSERT_LEAF off 211, blkref #0: rel 16605/16613/60529051 blk 6519
rmgr: Btree len (rec/tot): 53/ 2529, tx: 76396250, lsn: A0A/AC5352F8, prev A0A/AC533A00, desc: INSERT_LEAF off 87, blkref #0: rel 16605/16613/60529051 blk 5639 FPW
pg_waldump: fatal: error in WAL record at A0A/AC5411B0: invalid resource manager ID 110 at A0A/AC5411E0
...and I have no idea what I'm looking at. I assume/hope the error at the end is due to the db shutting down, and nothing to be particularly worried about?
I have the specific index that postgres tripped on here and I'll spend some time today trying to see if the other crashes were the same index.
On 2020-Aug-03, Ben Chobot wrote: > rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn: > A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel > 16605/16613/60529051 blk 6501 > > So then I did: > > /usr/lib/postgresql/12/bin/pg_waldump -p /var/lib/postgresql/12/main/pg_wal/ > 0000000100000A0A000000AB 0000000100000A0A000000AD | grep > 16605/16613/60529051 Yep. Looking at the ones in block 6501, > rmgr: Btree len (rec/tot): 72/ 72, tx: 76393394, lsn: > A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel > 16605/16613/60529051 blk 6501 > rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn: > A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel > 16605/16613/60529051 blk 6501 My question was whether the block has received the update that added the item in offset 41; that is, is the LSN in the crashed copy of the page equal to A0A/AB2C43D0? If it's an older value, then the write above was lost for some reason. > pg_waldump: fatal: error in WAL record at A0A/AC5411B0: invalid resource > manager ID 110 at A0A/AC5411E0 > > ...and I have no idea what I'm looking at. I assume/hope the error at the > end is due to the db shutting down, and nothing to be particularly worried > about? Yeah. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera wrote on 8/3/20 12:34 PM: > On 2020-Aug-03, Ben Chobot wrote: > > Yep. Looking at the ones in block 6501, > >> rmgr: Btree len (rec/tot): 72/ 72, tx: 76393394, lsn: >> A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel >> 16605/16613/60529051 blk 6501 >> rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn: >> A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel >> 16605/16613/60529051 blk 6501 > My question was whether the block has received the update that added the > item in offset 41; that is, is the LSN in the crashed copy of the page > equal to A0A/AB2C43D0? If it's an older value, then the write above was > lost for some reason. How do I tell?
Peter Geoghegan wrote on 8/3/20 11:25 AM: > On Sun, Aug 2, 2020 at 9:39 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: >> All of the cited log lines seem suggesting relation with deleted btree >> page items. As a possibility I can guess, that can happen if the pages >> were flushed out during a vacuum after the last checkpoint and >> full-page-writes didn't restored the page to the state before the >> index-item deletion happened(that is, if full_page_writes were set to >> off.). (If it found to be the cause, I'm not sure why that didn't >> happen on 9.5.) > There is also a Heap/HOT_UPDATE log line with similar errors. Yes, and I have the pg_waldump output for it. But, that table is quite large, and the transaction that contains the LSN in the error log is 1,752 waldump lines long. I'm happy to share what would be useful to help debug it but I'm guessing it should be a subset of that.
Alvaro Herrera wrote on 8/1/20 9:35 AM: >>> On 2020-Aug-01, Ben Chobot wrote: >>> >>> Can you find out what the index is being modified by those LSNs -- >>> is it >>> always the same index? Can you have a look at nearby WAL records that >>> touch the same page of the same index in each case? They turn out to be different indices on different tables.
On 2020-Aug-03, Ben Chobot wrote: > Alvaro Herrera wrote on 8/3/20 12:34 PM: > > On 2020-Aug-03, Ben Chobot wrote: > > > > Yep. Looking at the ones in block 6501, > > > > > rmgr: Btree len (rec/tot): 72/ 72, tx: 76393394, lsn: > > > A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel > > > 16605/16613/60529051 blk 6501 > > > rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn: > > > A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel > > > 16605/16613/60529051 blk 6501 > > My question was whether the block has received the update that added the > > item in offset 41; that is, is the LSN in the crashed copy of the page > > equal to A0A/AB2C43D0? If it's an older value, then the write above was > > lost for some reason. > > How do I tell? You can use pageinspect's page_header() function to obtain the page's LSN. You can use dd to obtain the page from the file, dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501 then put that binary file in a bytea column, perhaps like create table page (raw bytea); insert into page select pg_read_binary_file('/tmp/page'); and with that you can run page_header: create extension pageinspect; select h.* from page, page_header(raw) h; -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Aug 3, 2020 at 2:35 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > You can use pageinspect's page_header() function to obtain the page's > LSN. You can use dd to obtain the page from the file, > > dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501 Ben might find this approach to dumping out a single page image easier, since it doesn't involve relfilenodes or filesystem files: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump -- Peter Geoghegan
Peter Geoghegan wrote on 8/3/20 3:04 PM:
I don't think that works, because it sounds like this requires I connect to the db, but that means I have to start it up. I am (wrongly?) assuming we should get what we can while the db is still in a just-crashed state.
Still, I seem to have mangled Alvaro's suggestion somehow, because this doesn't look right:
# select h.* from public.page , page_header(raw) h;
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----+----------+-------+-------+-------+---------+----------+---------+-----------
0/0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
(1 row)
I'll keep trying though.
On Mon, Aug 3, 2020 at 2:35 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:You can use pageinspect's page_header() function to obtain the page's LSN. You can use dd to obtain the page from the file, dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501Ben might find this approach to dumping out a single page image easier, since it doesn't involve relfilenodes or filesystem files: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump
I don't think that works, because it sounds like this requires I connect to the db, but that means I have to start it up. I am (wrongly?) assuming we should get what we can while the db is still in a just-crashed state.
Still, I seem to have mangled Alvaro's suggestion somehow, because this doesn't look right:
# select h.* from public.page , page_header(raw) h;
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----+----------+-------+-------+-------+---------+----------+---------+-----------
0/0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
(1 row)
I'll keep trying though.
Alvaro Herrera wrote on 8/3/20 2:34 PM:
If I use skip instead of seek....
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
--------------+----------+-------+-------+-------+---------+----------+---------+-----------
A0A/99BA11F8 | -215 | 0 | 180 | 7240 | 8176 | 8192 | 4 | 0
As I understand what we're looking at, this means the WAL stream was assuming this page was last touched by A0A/AB2C43D0, but the page itself thinks it was last touched by A0A/99BA11F8, which means at least one write to the page is missing?
On 2020-Aug-03, Ben Chobot wrote:Alvaro Herrera wrote on 8/3/20 12:34 PM:On 2020-Aug-03, Ben Chobot wrote: Yep. Looking at the ones in block 6501,rmgr: Btree len (rec/tot): 72/ 72, tx: 76393394, lsn: A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel 16605/16613/60529051 blk 6501 rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn: A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel 16605/16613/60529051 blk 6501My question was whether the block has received the update that added the item in offset 41; that is, is the LSN in the crashed copy of the page equal to A0A/AB2C43D0? If it's an older value, then the write above was lost for some reason.How do I tell?You can use pageinspect's page_header() function to obtain the page's LSN. You can use dd to obtain the page from the file, dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501
If I use skip instead of seek....
then put that binary file in a bytea column, perhaps like create table page (raw bytea); insert into page select pg_read_binary_file('/tmp/page'); and with that you can run page_header: create extension pageinspect; select h.* from page, page_header(raw) h;
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
--------------+----------+-------+-------+-------+---------+----------+---------+-----------
A0A/99BA11F8 | -215 | 0 | 180 | 7240 | 8176 | 8192 | 4 | 0
As I understand what we're looking at, this means the WAL stream was assuming this page was last touched by A0A/AB2C43D0, but the page itself thinks it was last touched by A0A/99BA11F8, which means at least one write to the page is missing?
On 2020-Aug-03, Ben Chobot wrote: > Alvaro Herrera wrote on 8/3/20 2:34 PM: > > On 2020-Aug-03, Ben Chobot wrote: > > dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501 > > If I use skip instead of seek.... Argh, yes, I did correct that in my test and forgot to copy and paste. > lsn | checksum | flags | lower | upper | special | pagesize | > version | prune_xid > --------------+----------+-------+-------+-------+---------+----------+---------+----------- > A0A/99BA11F8 | -215 | 0 | 180 | 7240 | 8176 | 8192 > | 4 | 0 > > As I understand what we're looking at, this means the WAL stream was > assuming this page was last touched by A0A/AB2C43D0, but the page itself > thinks it was last touched by A0A/99BA11F8, which means at least one write > to the page is missing? Yeah, that's exactly what we're seeing. Somehow an older page version was resurrected. Of course, this should never happen. So my theory has been proved. What now? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Aug-03, Alvaro Herrera wrote: > > lsn | checksum | flags | lower | upper | special | pagesize | > > version | prune_xid > > --------------+----------+-------+-------+-------+---------+----------+---------+----------- > > A0A/99BA11F8 | -215 | 0 | 180 | 7240 | 8176 | 8192 > > | 4 | 0 > > > > As I understand what we're looking at, this means the WAL stream was > > assuming this page was last touched by A0A/AB2C43D0, but the page itself > > thinks it was last touched by A0A/99BA11F8, which means at least one write > > to the page is missing? > > Yeah, that's exactly what we're seeing. Somehow an older page version > was resurrected. Of course, this should never happen. ... although, the block should have been in shared buffers, and it is there that the previous WAL record would have updated -- not necessarily flushed to disk. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Tue, 4 Aug 2020 09:53:36 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in > On 2020-Aug-03, Alvaro Herrera wrote: > > > > lsn | checksum | flags | lower | upper | special | pagesize | > > > version | prune_xid > > > --------------+----------+-------+-------+-------+---------+----------+---------+----------- > > > A0A/99BA11F8 | -215 | 0 | 180 | 7240 | 8176 | 8192 > > > | 4 | 0 > > > > > > As I understand what we're looking at, this means the WAL stream was > > > assuming this page was last touched by A0A/AB2C43D0, but the page itself > > > thinks it was last touched by A0A/99BA11F8, which means at least one write > > > to the page is missing? > > > > Yeah, that's exactly what we're seeing. Somehow an older page version > > was resurrected. Of course, this should never happen. > > ... although, the block should have been in shared buffers, and it is > there that the previous WAL record would have updated -- not necessarily > flushed to disk. Yeah. On the other hand, the WAL records shown upthread desn't have a FPW. > rmgr: Btree len (rec/tot): 72/ 72, tx: 76393394, lsn: > A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel > 16605/16613/60529051 blk 6501 > rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn: > A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel > 16605/16613/60529051 blk 6501 There must be a record for the page 6501 conveying FPW after the last checkpoint. If it is not found, something wrong on deciding whether to attach FPW. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Alvaro Herrera wrote on 8/3/20 4:54 PM:
Just to close the loop on this, we haven't seen the issue since we've stopped expanding our filesystems by moving LVM extents between devices, so while I don't know exactly where the bug lies, I feel it's quite likely not in Postgres.
On 2020-Aug-03, Ben Chobot wrote:Alvaro Herrera wrote on 8/3/20 2:34 PM:On 2020-Aug-03, Ben Chobot wrote:dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501If I use skip instead of seek....Argh, yes, I did correct that in my test and forgot to copy and paste.lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid --------------+----------+-------+-------+-------+---------+----------+---------+----------- A0A/99BA11F8 | -215 | 0 | 180 | 7240 | 8176 | 8192 | 4 | 0 As I understand what we're looking at, this means the WAL stream was assuming this page was last touched by A0A/AB2C43D0, but the page itself thinks it was last touched by A0A/99BA11F8, which means at least one write to the page is missing?Yeah, that's exactly what we're seeing. Somehow an older page version was resurrected. Of course, this should never happen. So my theory has been proved. What now?
Just to close the loop on this, we haven't seen the issue since we've stopped expanding our filesystems by moving LVM extents between devices, so while I don't know exactly where the bug lies, I feel it's quite likely not in Postgres.