Re: [BUG] Panic due to incorrect missingContrecPtr after promotion - Mailing list pgsql-hackers

From Imseih (AWS), Sami
Subject Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Date
Msg-id 6A726564-E016-4305-97AB-F0E825B3B53C@amazon.com
Whole thread Raw
In response to Re: [BUG] Panic due to incorrect missingContrecPtr after promotion  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
List pgsql-hackers
> The server seem to have started as a standby after crashing a
> primary. Is it correct?

Yes, that is correct.

2022-08-05 17:18:51 UTC::@:[359]:LOG:  database system was interrupted; last known up at 2022-08-05 17:08:52 UTC
2022-08-05 17:18:51 UTC::@:[359]:LOG:  creating missing WAL directory "pg_wal/archive_status"
recovering 00000002.history
2022-08-05 17:18:51 UTC::@:[359]:LOG:  entering standby mode
2022-08-05 17:18:51 UTC::@:[359]:LOG:  ### [S] @0/48A3400: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0, SbyModeReq=1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  checkpoint record is at 0/48A3388
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  redo record is at 0/48A3388; shutdown true
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  next transaction ID: 533; next OID: 16395
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  next MultiXactId: 1; next MultiXactOffset: 0
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  oldest unfrozen transaction ID: 479, in database 1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  oldest MultiXactId: 1, in database 1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  commit timestamp Xid oldest/newest: 0/0
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  transaction ID wrap limit is 2147484126, limited by database with OID 1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  MultiXactId wrap limit is 2147483648, limited by database with OID 1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  starting up replication slots
2022-08-05 17:18:51 UTC::@:[359]:LOG:  database system was not properly shut down; automatic recovery in progress
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  resetting unlogged relations: cleanup 1 init 0
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  initializing for hot standby
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  recovery snapshots are now enabled
2022-08-05 17:18:51 UTC::@:[359]:LOG:  redo starts at 0/48A3400
2022-08-05 17:18:51 UTC::@:[359]:LOG:  ### [F] @0/10000000: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0, SbyModeReq=1
2022-08-05 17:18:51 UTC::@:[359]:DEBUG:  reached end of WAL in pg_wal, entering archive recovery
2022-08-05 17:18:51 UTC::@:[359]:LOG:  consistent recovery state reached at 0/10000000
2022-08-05 17:18:51 UTC::@:[357]:LOG:  database system is ready to accept read only connections


> Archive recovery ended here. The server should have promoted that
> time..  Do you see some interesting log lines around this time?

The server did promote around that time

2022-08-05 18:38:14 UTC::@:[359]:LOG:  ### [F] @6/B6CB27D0: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=1, SbyModeReq=1
2022-08-05 18:38:14 UTC::@:[359]:LOG:  received promote request
2022-08-05 18:38:14 UTC::@:[359]:LOG:  redo done at 6/B6CB27A8
2022-08-05 18:38:14 UTC::@:[359]:LOG:  last completed transaction was at log time 2022-08-05 18:38:00.832047+00
recovering 00000001000000060000002D
....
........
2022-08-05 18:38:14 UTC::@:[359]:LOG:  ### [S] @6/B6CB27D0: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0, SbyModeReq=1
recovering 00000002.history
2022-08-05 18:38:14 UTC::@:[359]:LOG:  selected new timeline ID: 2
2022-08-05 18:38:14 UTC::@:[359]:LOG:  archive recovery complete
recovering 00000001.history
2022-08-05 18:38:14 UTC::@:[357]:LOG:  database system is ready to accept connections
2022-08-05 18:38:15 UTC::@:[367]:LOG:  restartpoint complete: wrote 21388 buffers (2.1%); 0 WAL file(s) added, 9
removed,0 recycled; write=98.394 s, sync=0.041 s, total=98.586 s; sync files=46, longest=0.012 s, average=0.001 s;
distance=1048565kB, estimate=1048584 kB
 
2022-08-05 18:38:15 UTC::@:[367]:LOG:  recovery restart point at 6/5C0003B0

>    But, recovery continues in non-standby mode.  I don't see how come it
>    behaves that way.

But the server crashes sometime after which is why recovery starts.

022-08-05 18:50:13 UTC::@:[357]:LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-08-05 18:50:13 UTC::@:[357]:LOG:  listening on IPv6 address "::", port 5432
2022-08-05 18:50:13 UTC::@:[357]:LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-08-05 18:50:13 UTC::@:[359]:LOG:  database system was interrupted; last known up at 2022-08-05 18:38:15 UTC
2022-08-05 18:50:13 UTC::@:[359]:LOG:  creating missing WAL directory "pg_wal/archive_status"
recovering 00000003.history
2022-08-05 18:50:13 UTC::@:[359]:LOG:  entering standby mode
recovering 00000002.history
2022-08-05 18:50:13 UTC::@:[359]:LOG:  ### [S] @6/B8000198: abort=(0/0)0/0, miss=(0/0)0/0, SbyMode=0, SbyModeReq=1
2022-08-05 18:50:13 UTC::@:[359]:LOG:  database system was not properly shut down; automatic recovery in progress
2022-08-05 18:50:13 UTC::@:[359]:LOG:  redo starts at 6/B80000E8

And a few hours later, is when we see a panic

Thanks

--
Sami Imseih
Amazon Web Services (AWS)



pgsql-hackers by date:

Previous
From: "Jonathan S. Katz"
Date:
Subject: 2022-08-11 release announcement draft
Next
From: Greg Stark
Date:
Subject: Re: shared-memory based stats collector - v70