Thread: BUG #15346: Replica fails to start after the crash

BUG #15346: Replica fails to start after the crash

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15346
Logged by:          Alexander Kukushkin
Email address:      cyberdemn@gmail.com
PostgreSQL version: 9.6.10
Operating system:   Ubuntu
Description:

We run quite a lot of clusters and from time to time EC2 instances are
dying, but so far we never had problems which I describe here.
The first time we experienced it on the last week when doing upgrade to the
next minor versions. A few replicas failed to start with very similar
symptoms and I just reinitialized them because I was in a hurry.

But, this morning one of EC2 instances running a replica has crashed.
After the crash, the new instances was spawned and existing EBS volume (with
existing PGDATA) attached to it.

Postgres logs before the crash:
2018-08-22 05:14:43.036 UTC,,,23838,,5b78e9a0.5d1e,2636,,2018-08-19 03:53:04
UTC,,0,LOG,00000,"restartpoint complete: wrote 82957 buffers (17.0%); 0
transaction log file(s) added, 3 removed, 7 recycled; write=269.887 s,
sync=0.012 s, total=269.926 s; sync files=105, longe
st=0.002 s, average=0.000 s; distance=168429 kB, estimate=173142
kB",,,,,,,,,""
2018-08-22 05:14:43.036 UTC,,,23838,,5b78e9a0.5d1e,2637,,2018-08-19 03:53:04
UTC,,0,LOG,00000,"recovery restart point at AB3/37BD3C60","last completed
transaction was at log time 2018-08-22 05:14:43.022752+00",,,,,,,,""
2018-08-22 05:15:13.065 UTC,,,23838,,5b78e9a0.5d1e,2638,,2018-08-19 03:53:04
UTC,,0,LOG,00000,"restartpoint starting: time",,,,,,,,,""
2018-08-22 05:19:43.093 UTC,,,23838,,5b78e9a0.5d1e,2639,,2018-08-19 03:53:04
UTC,,0,LOG,00000,"restartpoint complete: wrote 86588 buffers (17.7%); 0
transaction log file(s) added, 2 removed, 8 recycled; write=269.983 s,
sync=0.017 s, total=270.028 s; sync files=112, longe
st=0.002 s, average=0.000 s; distance=156445 kB, estimate=171472
kB",,,,,,,,,""
2018-08-22 05:19:43.093 UTC,,,23838,,5b78e9a0.5d1e,2640,,2018-08-19 03:53:04
UTC,,0,LOG,00000,"recovery restart point at AB3/4149B340","last completed
transaction was at log time 2018-08-22 05:19:43.077943+00",,,,,,,,""
2018-08-22 05:20:13.121 UTC,,,23838,,5b78e9a0.5d1e,2641,,2018-08-19 03:53:04
UTC,,0,LOG,00000,"restartpoint starting: time",,,,,,,,,""

Here it crashed.

Now postgres fails to start with the following logs:
2018-08-22 06:22:12.994 UTC,,,51,,5b7d0114.33,1,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"ending log output to stderr",,"Future log output will go
to log destination ""csvlog"".",,,,,,,""
2018-08-22 06:22:12.996 UTC,,,54,,5b7d0114.36,1,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"database system was interrupted while in recovery at log
time 2018-08-22 05:10:29 UTC",,"If this has occurred more than once some
data might be corrupted and you might need to choose an
 earlier recovery target.",,,,,,,""
2018-08-22 06:22:13.232 UTC,,,54,,5b7d0114.36,2,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2018-08-22 06:22:13.420 UTC,,,54,,5b7d0114.36,3,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"restored log file ""00000005.history"" from
archive",,,,,,,,,""
2018-08-22 06:22:13.430 UTC,,,147,"[local]",5b7d0115.93,1,"",2018-08-22
06:22:13 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2018-08-22 06:22:13.430
UTC,"postgres","postgres",147,"[local]",5b7d0115.93,2,"",2018-08-22 06:22:13
UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2018-08-22 06:22:13.435 UTC,,,149,"[local]",5b7d0115.95,1,"",2018-08-22
06:22:13 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2018-08-22 06:22:13.435
UTC,"postgres","postgres",149,"[local]",5b7d0115.95,2,"",2018-08-22 06:22:13
UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2018-08-22 06:22:13.912 UTC,,,54,,5b7d0114.36,4,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"restored log file ""0000000500000AB30000004A"" from
archive",,,,,,,,,""
... restores WAL files from S3
2018-08-22 06:22:23.489 UTC,,,54,,5b7d0114.36,22,,2018-08-22 06:22:12
UTC,1/0,0,LOG,00000,"restored log file ""0000000500000AB300000050"" from
archive",,,,,,,,,""
2018-08-22 06:22:23.633 UTC,,,54,,5b7d0114.36,23,,2018-08-22 06:22:12
UTC,1/0,0,WARNING,01000,"page 179503104 of relation base/18055/212875 does
not exist",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182
items",,,,""
2018-08-22 06:22:23.634 UTC,,,54,,5b7d0114.36,24,,2018-08-22 06:22:12
UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog
redo at AB3/50323E78 for Btree/DELETE: 182 items",,,,""
2018-08-22 06:22:23.846 UTC,,,51,,5b7d0114.33,3,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"startup process (PID 54) was terminated by signal 6:
Aborted",,,,,,,,,""
2018-08-22 06:22:23.846 UTC,,,51,,5b7d0114.33,4,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""
2018-08-22 06:22:23.847
UTC,"postgres","postgres",842,"[local]",5b7d011c.34a,3,"idle",2018-08-22
06:22:20 UTC,3/0,0,WARNING,57P02,"terminating connection because of crash of
another server process","The postmaster has commanded this server process to
roll back the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.","In a moment you
should be able to reconnect to the database and repeat your
command.",,,,,,,"Patroni"
2018-08-22 06:22:23.945 UTC,,,51,,5b7d0114.33,5,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,""
2018-08-22 06:22:35.294 UTC,,,1135,,5b7d012b.46f,1,,2018-08-22 06:22:35
UTC,,0,LOG,00000,"ending log output to stderr",,"Future log output will go
to log destination ""csvlog"".",,,,,,,""
2018-08-22 06:22:35.295 UTC,,,1138,,5b7d012b.472,1,,2018-08-22 06:22:35
UTC,,0,LOG,00000,"database system was interrupted while in recovery at log
time 2018-08-22 05:10:29 UTC",,"If this has occurred more than once some
data might be corrupted and you might need to choose an earlier recovery
target.",,,,,,,""


Here are a few log lines from pg_xlogdump 0000000500000AB300000050 around
LSN AB3/50323E78:
rmgr: Heap        len (rec/tot):     54/    54, tx:  518902809, lsn:
AB3/50323D30, prev AB3/50323CE8, desc: LOCK off 18: xid 518902809: flags 0
LOCK_ONLY KEYSHR_LOCK , blkref #0: rel 1663/18053/213284 blk 137366
rmgr: Btree       len (rec/tot):     88/    88, tx:  518902810, lsn:
AB3/50323D68, prev AB3/50323D30, desc: INSERT_LEAF off 66, blkref #0: rel
1663/18055/212936 blk 32947
rmgr: Btree       len (rec/tot):     96/    96, tx:  518902810, lsn:
AB3/50323DC0, prev AB3/50323D68, desc: INSERT_LEAF off 3, blkref #0: rel
1663/18055/212935 blk 6
rmgr: Btree       len (rec/tot):     88/    88, tx:  518902810, lsn:
AB3/50323E20, prev AB3/50323DC0, desc: INSERT_LEAF off 12, blkref #0: rel
1663/18055/212930 blk 25152
rmgr: Btree       len (rec/tot):    436/   547, tx:  518902810, lsn:
AB3/50323E78, prev AB3/50323E20, desc: DELETE 182 items, blkref #0: rel
1663/18055/212907 blk 72478 FPW
rmgr: Btree       len (rec/tot):     88/    88, tx:  518902810, lsn:
AB3/503240B8, prev AB3/50323E78, desc: INSERT_LEAF off 2, blkref #0: rel
1663/18055/212907 blk 72478
rmgr: Gist        len (rec/tot):     55/   163, tx:  518902810, lsn:
AB3/50324110, prev AB3/503240B8, desc: PAGE_UPDATE , blkref #0: rel
1663/18055/212906 blk 273288 FPW
rmgr: Btree       len (rec/tot):    160/   160, tx:  518902810, lsn:
AB3/503241B8, prev AB3/50324110, desc: INSERT_LEAF off 8, blkref #0: rel
1663/18055/212892 blk 85620
rmgr: Gist        len (rec/tot):     57/   201, tx:  518902812, lsn:
AB3/50324258, prev AB3/503241B8, desc: PAGE_UPDATE , blkref #0: rel
1663/18053/213364 blk 99265 FPW
rmgr: Btree       len (rec/tot):     55/  2403, tx:  518902810, lsn:
AB3/50324328, prev AB3/50324258, desc: INSERT_LEAF off 13, blkref #0: rel
1663/18055/212893 blk 53544 FPW
rmgr: Btree       len (rec/tot):     55/  1464, tx:  518902812, lsn:
AB3/50324C90, prev AB3/50324328, desc: INSERT_LEAF off 83, blkref #0: rel
1663/18053/213369 blk 41494 FPW
rmgr: Transaction len (rec/tot):     34/    34, tx:  518902810, lsn:
AB3/50325248, prev AB3/50324C90, desc: COMMIT 2018-08-22 05:18:03.670434
UTC

And the list of files for a given relation:
$ ls -al base/18055/212875*
-rw------- 1 postgres postgres 1073741824 Aug 22 05:20 base/18055/212875
-rw------- 1 postgres postgres 1014947840 Aug 22 05:21 base/18055/212875.1
-rw------- 1 postgres postgres     532480 Aug 22 05:21
base/18055/212875_fsm
-rw------- 1 postgres postgres      65536 Aug 22 05:21
base/18055/212875_vm

What I find really strange is a huge difference in the page number between
postgres log and pg_xlogdump output:
"page 179503104 of relation base/18055/212875 does not exist" vs "rel
1663/18055/212907 blk 72478 FPW"

I will keep this instance around for further investigation and would be
happy to provide some more details if you need.


Re: BUG #15346: Replica fails to start after the crash

From
Michael Paquier
Date:
On Wed, Aug 22, 2018 at 07:36:58AM +0000, PG Bug reporting form wrote:
> 2018-08-22 06:22:23.633 UTC,,,54,,5b7d0114.36,23,,2018-08-22 06:22:12
> UTC,1/0,0,WARNING,01000,"page 179503104 of relation base/18055/212875 does
> not exist",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182
> items",,,,""
> 2018-08-22 06:22:23.634 UTC,,,54,,5b7d0114.36,24,,2018-08-22 06:22:12
> UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog
> redo at AB3/50323E78 for Btree/DELETE: 182 items",,,,""

Once recovery has reached a consistent state, the startup process would
look at if there are any invalid pages tracked in a given hash table and
complains loudly about them.  It is not the last record or its
surroundings which matter in case, but if this page has been found in
one of the records replayed during recovery up to the consistent point.
Do you have in any records from the WAL segments fetched a reference to
this page?  A page is 8kB, and the page number is 179503104, which is
definitely weird as that would cause a relation file to be more than
1000GB.  If the record itself is in bad shape, this may be a corrupted
segment.  As far as I can see you only have one incorrect page reference
(see XLogCheckInvalidPages in xlog.c).

> I will keep this instance around for further investigation and would be
> happy to provide some more details if you need.

That would be nice!
--
Michael

Attachment

Re: BUG #15346: Replica fails to start after the crash

From
Alexander Kukushkin
Date:
Hi,

I've figured out what is going on.
On this server we have a background worker, which starts from
shared_preload_libraries.

In order to debug and reproduce it, I removed everything from
background worker code except _PG_init, worker_main and couple of
sighandler functions.

Here is the code:

void
worker_main(Datum main_arg)
{
        pqsignal(SIGHUP, bg_mon_sighup);
        pqsignal(SIGTERM, bg_mon_sigterm);
        if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
                proc_exit(1);
        BackgroundWorkerUnblockSignals();
        BackgroundWorkerInitializeConnection("postgres", NULL);
        while (!got_sigterm)
        {
                        int rc = WaitLatch(MyLatch,
                                                   WL_LATCH_SET |
WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                                   naptime*1000L);

                        ResetLatch(MyLatch);
                        if (rc & WL_POSTMASTER_DEATH)
                                proc_exit(1);

        }

        proc_exit(1);
}

void
_PG_init(void)
{
        BackgroundWorker worker;
        if (!process_shared_preload_libraries_in_progress)
                return;
        worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
                BGWORKER_BACKEND_DATABASE_CONNECTION;
        worker.bgw_start_time = BgWorkerStart_ConsistentState;
        worker.bgw_restart_time = 1;
        worker.bgw_main = worker_main;
        worker.bgw_notify_pid = 0;
        snprintf(worker.bgw_name, BGW_MAXLEN, "my_worker");
        RegisterBackgroundWorker(&worker);
}

Most of this code is taken from "worker_spi.c".

Basically, it just initializes connection to the postgres database and
sleeps all the time.

If I comment out the 'BackgroundWorkerInitializeConnection("postgres",
NULL);' call, postgres starts without any problem.
What is very strange, because background worker itself is not doing anything...

And one more thing, if I add sleep(15) before calling
BackgroundWorkerInitializeConnection, postgres manages to start
successfully.
Is there a very strange race condition here?

Regards,
--
Alexander Kukushkin


Re: BUG #15346: Replica fails to start after the crash

From
Alvaro Herrera
Date:
On 2018-Aug-22, Alexander Kukushkin wrote:

> If I comment out the 'BackgroundWorkerInitializeConnection("postgres",
> NULL);' call, postgres starts without any problem.
> What is very strange, because background worker itself is not doing anything...
> 
> And one more thing, if I add sleep(15) before calling
> BackgroundWorkerInitializeConnection, postgres manages to start
> successfully.
> Is there a very strange race condition here?

Sounds likely.  I suggest to have a look at what's going on inside the
postmaster process when it gets stuck.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: BUG #15346: Replica fails to start after the crash

From
Alexander Kukushkin
Date:
2018-08-22 16:44 GMT+02:00 Alvaro Herrera <alvherre@2ndquadrant.com>:

>
> Sounds likely.  I suggest to have a look at what's going on inside the
> postmaster process when it gets stuck.

Well, it doesn't get stuck, it aborts start with the message:
2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,23,,2018-08-22
14:26:10 UTC,1/0,0,WARNING,01000,"page 179503104 of relation
base/18055/212875 does not exist",,,,,"xlog redo at AB3/50323E78 for
Btree/DELETE: 182 items",,,,""
2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,24,,2018-08-22
14:26:10 UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid
pages",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182
items",,,,""
2018-08-22 14:26:42.214 UTC,,,28483,,5b7d7282.6f43,3,,2018-08-22
14:26:10 UTC,,0,LOG,00000,"startup process (PID 28485) was terminated
by signal 6: Aborted",,,,,,,,,""
2018-08-22 14:26:42.214 UTC,,,28483,,5b7d7282.6f43,4,,2018-08-22
14:26:10 UTC,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""
2018-08-22 14:26:42.290 UTC,,,28483,,5b7d7282.6f43,5,,2018-08-22
14:26:10 UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,""

Regards,
--
Alexander Kukushkin


Re: BUG #15346: Replica fails to start after the crash

From
Alvaro Herrera
Date:
On 2018-Aug-22, Alexander Kukushkin wrote:

> 2018-08-22 16:44 GMT+02:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
> 
> >
> > Sounds likely.  I suggest to have a look at what's going on inside the
> > postmaster process when it gets stuck.
> 
> Well, it doesn't get stuck, it aborts start with the message:
> 2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,23,,2018-08-22
> 14:26:10 UTC,1/0,0,WARNING,01000,"page 179503104 of relation
> base/18055/212875 does not exist",,,,,"xlog redo at AB3/50323E78 for
> Btree/DELETE: 182 items",,,,""
> 2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,24,,2018-08-22
> 14:26:10 UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid
> pages",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182
> items",,,,""
> 2018-08-22 14:26:42.214 UTC,,,28483,,5b7d7282.6f43,3,,2018-08-22
> 14:26:10 UTC,,0,LOG,00000,"startup process (PID 28485) was terminated
> by signal 6: Aborted",,,,,,,,,""

Oh, that's weird ... sounds like the fact that the bgworker starts
somehow manages to corrupt the list of invalid pages in the startup
process.  That doesn't make any sense ...

ENOTIME for a closer look ATM, though, sorry.  Maybe you could try
running under valgrind?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: BUG #15346: Replica fails to start after the crash

From
Dmitry Dolgov
Date:
> On Wed, 22 Aug 2018 at 17:08, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > On 2018-Aug-22, Alexander Kukushkin wrote:
>
> > 2018-08-22 16:44 GMT+02:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
> >
> > >
> > > Sounds likely.  I suggest to have a look at what's going on inside the
> > > postmaster process when it gets stuck.
> >
> > Well, it doesn't get stuck, it aborts start with the message:
> > 2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,23,,2018-08-22
> > 14:26:10 UTC,1/0,0,WARNING,01000,"page 179503104 of relation
> > base/18055/212875 does not exist",,,,,"xlog redo at AB3/50323E78 for
> > Btree/DELETE: 182 items",,,,""
> > 2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,24,,2018-08-22
> > 14:26:10 UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid
> > pages",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182
> > items",,,,""
> > 2018-08-22 14:26:42.214 UTC,,,28483,,5b7d7282.6f43,3,,2018-08-22
> > 14:26:10 UTC,,0,LOG,00000,"startup process (PID 28485) was terminated
> > by signal 6: Aborted",,,,,,,,,""
>
> Oh, that's weird ... sounds like the fact that the bgworker starts
> somehow manages to corrupt the list of invalid pages in the startup
> process.  That doesn't make any sense ...

We can see that the crash itself happened because in XLogReadBufferExtended at
`if (PageIsNew(page))` (xlogutils.c:512) we've got a page that apparently
wasn't initialized yet, and, since we've reached a consistent state,
log_invalid_page panics.

> ENOTIME for a closer look ATM, though, sorry.  Maybe you could try
> running under valgrind?

Could you elaborate please, what can we find using valgrind in this case, some
memory leaks? In any way there is a chance that everything will be ok, since
even just a slow tracing under gdb leads to disappearing of this race
condition.


Re: BUG #15346: Replica fails to start after the crash

From
Alvaro Herrera
Date:
On 2018-Aug-22, Dmitry Dolgov wrote:

> > ENOTIME for a closer look ATM, though, sorry.  Maybe you could try
> > running under valgrind?
> 
> Could you elaborate please, what can we find using valgrind in this case, some
> memory leaks? In any way there is a chance that everything will be ok, since
> even just a slow tracing under gdb leads to disappearing of this race
> condition.

I don't know.  I was thinking it could detect some invalid write in
shared memory.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: BUG #15346: Replica fails to start after the crash

From
Michael Paquier
Date:
On Wed, Aug 22, 2018 at 02:50:34PM -0300, Alvaro Herrera wrote:
> On 2018-Aug-22, Dmitry Dolgov wrote:
> I don't know.  I was thinking it could detect some invalid write in
> shared memory.

invalid_page_tab does not use HASH_SHARED_MEM.  What did you have in
mind?

On my side, I have been letting a primary/standby set run for a certain
amount of time with the following running:
- pgbench load on the primary.
- background worker connecting to database, just running in loop and
sleeping.  It uses BgWorkerStart_ConsistentState as start mode.
- The standby gets stopped in immediate mode, and restarted after some
time to let some activity happening after the recovery point has been
reached.

After close to an hour of repeated tests, I am not able to reproduce
that.  Maybe there are some specifics in your schema causing more
certain types of records to be created.  Could you check if in the set
of WAL segments replayed you have references to the block the hash table
is complaining about?
--
Michael

Attachment

Re: BUG #15346: Replica fails to start after the crash

From
Alexander Kukushkin
Date:
Hi Michae,

It aborts not right after reaching a consistent state, but a little bit later:
"consistent recovery state reached at AB3/4A1B3118"
"WAL contains references to invalid pages",,,,,"xlog redo at
AB3/50323E78 for Btree/DELETE: 182 items"

Wal segments 0000000500000AB30000004A and 0000000500000AB300000050
correspondingly.

Also I managed to get a backtrace:

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fd3f1deb801 in __GI_abort () at abort.c:79
#2  0x00005644f7094aaf in errfinish (dummy=dummy@entry=0) at
./build/../src/backend/utils/error/elog.c:557
#3  0x00005644f7096a49 in elog_finish (elevel=elevel@entry=22,
fmt=fmt@entry=0x5644f70e5998 "WAL contains references to invalid
pages") at ./build/../src/backend/utils/error/elog.c:1378
#4  0x00005644f6d8ec18 in log_invalid_page (node=...,
forkno=forkno@entry=MAIN_FORKNUM, blkno=blkno@entry=179503104,
present=<optimized out>) at
./build/../src/backend/access/transam/xlogutils.c:95
#5  0x00005644f6d8f168 in XLogReadBufferExtended (rnode=...,
forknum=forknum@entry=MAIN_FORKNUM, blkno=179503104,
mode=mode@entry=RBM_NORMAL) at
./build/../src/backend/access/transam/xlogutils.c:515
#6  0x00005644f6d59448 in btree_xlog_delete_get_latestRemovedXid
(record=0x5644f7d35758) at
./build/../src/backend/access/nbtree/nbtxlog.c:593
#7  btree_xlog_delete (record=0x5644f7d35758) at
./build/../src/backend/access/nbtree/nbtxlog.c:682
#8  btree_redo (record=0x5644f7d35758) at
./build/../src/backend/access/nbtree/nbtxlog.c:1012
#9  0x00005644f6d83c34 in StartupXLOG () at
./build/../src/backend/access/transam/xlog.c:6967
#10 0x00005644f6f27953 in StartupProcessMain () at
./build/../src/backend/postmaster/startup.c:216
#11 0x00005644f6d925a7 in AuxiliaryProcessMain (argc=argc@entry=2,
argv=argv@entry=0x7fff1f893a80) at
./build/../src/backend/bootstrap/bootstrap.c:419
#12 0x00005644f6f24983 in StartChildProcess (type=StartupProcess) at
./build/../src/backend/postmaster/postmaster.c:5300
#13 0x00005644f6f27437 in PostmasterMain (argc=16,
argv=0x5644f7d07f10) at
./build/../src/backend/postmaster/postmaster.c:1321
#14 0x00005644f6d08b5b in main (argc=16, argv=0x5644f7d07f10) at
./build/../src/backend/main/main.c:228


From looking on xlogutils.c I actually figured out that this backtrace
is a bit misleading because it can't get to the line 515
So I checked out REL9_6_10 tag and build it with -O0, now it makes more sense.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f9ae0777801 in __GI_abort () at abort.c:79
#2  0x000055c9f285f56d in errfinish (dummy=0) at elog.c:557
#3  0x000055c9f2861d7c in elog_finish (elevel=22, fmt=0x55c9f28cbff0
"WAL contains references to invalid pages") at elog.c:1378
#4  0x000055c9f2424e76 in log_invalid_page (node=...,
forkno=MAIN_FORKNUM, blkno=179503104, present=0 '\000') at
xlogutils.c:95
#5  0x000055c9f24258ce in XLogReadBufferExtended (rnode=...,
forknum=MAIN_FORKNUM, blkno=179503104, mode=RBM_NORMAL) at
xlogutils.c:470
#6  0x000055c9f23de8ce in btree_xlog_delete_get_latestRemovedXid
(record=0x55c9f3cbf438) at nbtxlog.c:593
#7  0x000055c9f23dea88 in btree_xlog_delete (record=0x55c9f3cbf438) at
nbtxlog.c:682
#8  0x000055c9f23df857 in btree_redo (record=0x55c9f3cbf438) at nbtxlog.c:1012
#9  0x000055c9f24160e8 in StartupXLOG () at xlog.c:6967
#10 0x000055c9f267d9f6 in StartupProcessMain () at startup.c:216
#11 0x000055c9f242a276 in AuxiliaryProcessMain (argc=2,
argv=0x7ffea746c260) at bootstrap.c:419
#12 0x000055c9f267c93e in StartChildProcess (type=StartupProcess) at
postmaster.c:5300
#13 0x000055c9f2677469 in PostmasterMain (argc=16,
argv=0x55c9f3c91690) at postmaster.c:1321
#14 0x000055c9f25c1515 in main (argc=16, argv=0x55c9f3c91690) at main.c:228


> After close to an hour of repeated tests, I am not able to reproduce
> that.  Maybe there are some specifics in your schema causing more
> certain types of records to be created.  Could you check if in the set
> of WAL segments replayed you have references to the block the hash table
> is complaining about?

I am sorry, but don't really know where to look :(


Regards,
--
Alexander Kukushkin


Re: BUG #15346: Replica fails to start after the crash

From
Alexander Kukushkin
Date:
Hi,

I think I am getting closer:

If I start postgres without background worker, then works following code:

//
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/nbtxlog.c;h=c536e224321dbc46574aa76876b9d49fa4b5e9a4;hb=REL9_6_10#l549
        if (CountDBBackends(InvalidOid) == 0)
                return latestRemovedXid;

And it returns from the btree_xlog_delete_get_latestRemovedXid function.

In case if we have a background worker, then
CountDBBackends(InvalidOid) = 1 and as a result it aborts.

Comment before if statement clearly tells about possible race condition:
        /*
         * If there's nothing running on the standby we don't need to derive a
         * full latestRemovedXid value, so use a fast path out of here.  This
         * returns InvalidTransactionId, and so will conflict with all HS
         * transactions; but since we just worked out that that's zero people,
         * it's OK.
         *
         * XXX There is a race condition here, which is that a new backend might
         * start just after we look.  If so, it cannot need to
conflict, but this
         * coding will result in throwing a conflict anyway.
         */


Regards,
--
Alexander Kukushkin


Re: BUG #15346: Replica fails to start after the crash

From
Alexander Kukushkin
Date:
Hi,


btree_xlog_delete_get_latestRemovedXid is callled only when the
postgres is already running InHotStandby
And what this function is actually doing, it reads index page from
disk and iterates over IndexTuple records to get corresponding
HeapPages.
The problem is that xlrec->nitems = 182 in my case, but on the index
page there are only 56 tuples and the rest of the page is filled with
zeros:
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f07a2be2801 in __GI_abort () at abort.c:79
#2  0x000055938000f586 in errfinish (dummy=0) at elog.c:557
#3  0x0000559380011d95 in elog_finish (elevel=22, fmt=0x55938007c010
"WAL contains references to invalid pages") at elog.c:1378
#4  0x000055937fbd4e8f in log_invalid_page (node=...,
forkno=MAIN_FORKNUM, blkno=179503104, present=0 '\000') at
xlogutils.c:95
#5  0x000055937fbd58e7 in XLogReadBufferExtended (rnode=...,
forknum=MAIN_FORKNUM, blkno=179503104, mode=RBM_NORMAL) at
xlogutils.c:470
#6  0x000055937fb8e8e7 in btree_xlog_delete_get_latestRemovedXid
(record=0x55938151b438) at nbtxlog.c:599
#7  0x000055937fb8eaa1 in btree_xlog_delete (record=0x55938151b438) at
nbtxlog.c:688
#8  0x000055937fb8f870 in btree_redo (record=0x55938151b438) at nbtxlog.c:1018
#9  0x000055937fbc6101 in StartupXLOG () at xlog.c:6967
#10 0x000055937fe2da0f in StartupProcessMain () at startup.c:216
#11 0x000055937fbda28f in AuxiliaryProcessMain (argc=2,
argv=0x7ffe88f10df0) at bootstrap.c:419
#12 0x000055937fe2c957 in StartChildProcess (type=StartupProcess) at
postmaster.c:5300
#13 0x000055937fe27482 in PostmasterMain (argc=16,
argv=0x5593814ed690) at postmaster.c:1321
#14 0x000055937fd7152e in main (argc=16, argv=0x5593814ed690) at main.c:228
(gdb) frame 6
#6  0x000055937fb8e8e7 in btree_xlog_delete_get_latestRemovedXid
(record=0x55938151b438) at nbtxlog.c:599
599                     hbuffer = XLogReadBufferExtended(xlrec->hnode,
MAIN_FORKNUM, hblkno, RBM_NORMAL);
(gdb) l 587
582              * Loop through the deleted index items to obtain the
TransactionId from
583              * the heap items they point to.
584              */
585             unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeDelete);
586
587             for (i = 0; i < xlrec->nitems; i++)
588             {
589                     /*
590                      * Identify the index tuple about to be deleted
591                      */
(gdb) l
592                     iitemid = PageGetItemId(ipage, unused[i]);
593                     itup = (IndexTuple) PageGetItem(ipage, iitemid);
594
595                     /*
596                      * Locate the heap page that the index tuple points at
597                      */
598                     hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
599                     hbuffer = XLogReadBufferExtended(xlrec->hnode,
MAIN_FORKNUM, hblkno, RBM_NORMAL);
600                     if (!BufferIsValid(hbuffer))
601                     {
(gdb) p i
$45 = 53
(gdb) p unused[i]
$46 = 57
(gdb) p *iitemid
$47 = {lp_off = 0, lp_flags = 0, lp_len = 0}

Therefore iitemid points to the beginning of the index page, what
doesn't make sense.

Why the number of tuples in the xlog is greater than the number of
tuples on the index page?
Because this page was already overwritten and its LSN is HIGHER than
the current LSN!

(gdb) p *(PageHeader)ipage
$49 = {pd_lsn = {xlogid = 2739, xrecoff = 1455373160}, pd_checksum =
3042, pd_flags = 0, pd_lower = 248, pd_upper = 5936, pd_special =
8176, pd_pagesize_version = 8196, pd_prune_xid = 0, pd_linp =
0x7f06ed9e9598}

Index page LSN: AB3/56BF3B68
Current LSN: AB3/50323E78

Is there a way to recover from such a situation? Should the postgres
in such case do comparison of LSNs and if the LSN on the page is
higher than the current LSN simply return InvalidTransactionId?
Apparently, if there are no connections open postgres simply is not
running this code and it seems ok.


Regards,
--
Alexander Kukushkin


Re: BUG #15346: Replica fails to start after the crash

From
Michael Paquier
Date:
On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote:
> Why the number of tuples in the xlog is greater than the number of
> tuples on the index page?
> Because this page was already overwritten and its LSN is HIGHER than
> the current LSN!

That's annoying.  Because that means that the control file of your
server maps to a consistent point which is older than some of the
relation pages.  How was the base backup of this node created?  Please
remember that when taking a base backup from a standby, you should
backup the control file last, as there is no control of end backup with
records available.  So it seems to me that the origin of your problem
comes from an incorrect base backup expectation?

> Is there a way to recover from such a situation? Should the postgres
> in such case do comparison of LSNs and if the LSN on the page is
> higher than the current LSN simply return InvalidTransactionId?
> Apparently, if there are no connections open postgres simply is not
> running this code and it seems ok.

One idea I have would be to copy all the WAL segments up to the point
where the pages to-be-updated are, and let Postgres replay all the local
WALs first.  However it is hard to say if that would be enough, as you
could have more references to pages even newer than the btree one you
just found.
--
Michael

Attachment

Re: BUG #15346: Replica fails to start after the crash

From
Andres Freund
Date:

On August 28, 2018 11:44:09 AM GMT+09:00, Michael Paquier <michael@paquier.xyz> wrote:
>On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote:
>> Why the number of tuples in the xlog is greater than the number of
>> tuples on the index page?
>> Because this page was already overwritten and its LSN is HIGHER than
>> the current LSN!
>
>That's annoying.  Because that means that the control file of your
>server maps to a consistent point which is older than some of the
>relation pages.  How was the base backup of this node created?  Please
>remember that when taking a base backup from a standby, you should
>backup the control file last, as there is no control of end backup with
>records available.  So it seems to me that the origin of your problem
>comes from an incorrect base backup expectation?

Uh, where is that "control file last" bit coming from?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: BUG #15346: Replica fails to start after the crash

From
Alexander Kukushkin
Date:
Hi Michael,

> That's annoying.  Because that means that the control file of your
> server maps to a consistent point which is older than some of the
> relation pages.  How was the base backup of this node created?  Please
> remember that when taking a base backup from a standby, you should
> backup the control file last, as there is no control of end backup with
> records available.  So it seems to me that the origin of your problem
> comes from an incorrect base backup expectation?

We are running the cluster of 3 nodes (m4.large + EBS volume for
PGDATA) on AWS. Replicas were initialized about a years ago with
pg_basebackup and working absolutely fine. In the past year I did a
few minor upgrades with switchover (first upgrade of the replicas,
switchover, and upgrade the former primary). The last switchover was
done on the August 19th. This instance was working as a replica for
about three days until the sudden crash of EC2 instance. On the new
instance we attached existing EBS volume with existing the PGDATA and
tried to start postgres. Consequences you can see in the very first
email.


> One idea I have would be to copy all the WAL segments up to the point
> where the pages to-be-updated are, and let Postgres replay all the local
> WALs first.  However it is hard to say if that would be enough, as you
> could have more references to pages even newer than the btree one you
> just found.

Well, I did some experiments, among them was the approach you suggest,
i.e. I commented out restore_command in the recovery.conf and copied
quite a few WAL segments to the pg_xlog. Results are the same. It
aborts as long as there are connections open :(


Regards,
--
Alexander Kukushkin


Re: BUG #15346: Replica fails to start after the crash

From
Stephen Frost
Date:
Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:
> On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote:
> > Why the number of tuples in the xlog is greater than the number of
> > tuples on the index page?
> > Because this page was already overwritten and its LSN is HIGHER than
> > the current LSN!
>
> That's annoying.  Because that means that the control file of your
> server maps to a consistent point which is older than some of the
> relation pages.  How was the base backup of this node created?  Please
> remember that when taking a base backup from a standby, you should
> backup the control file last, as there is no control of end backup with
> records available.  So it seems to me that the origin of your problem
> comes from an incorrect base backup expectation?

Yeah, we really need to know how this system was built.  If it was built
using the low-level pg_start/stop_backup, then which API was used- the
old one that creates a backup_label file, or the new API which requires
the user to create the backup_label file?  I haven't followed this
thread very closely, but I wonder if maybe the new API was used, but no
backup_label file was created, making PG think it was doing crash
recovery instead of restoring from a file-level backup...

Thanks!

Stephen

Attachment

Re: BUG #15346: Replica fails to start after the crash

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On August 28, 2018 11:44:09 AM GMT+09:00, Michael Paquier <michael@paquier.xyz> wrote:
> >On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote:
> >> Why the number of tuples in the xlog is greater than the number of
> >> tuples on the index page?
> >> Because this page was already overwritten and its LSN is HIGHER than
> >> the current LSN!
> >
> >That's annoying.  Because that means that the control file of your
> >server maps to a consistent point which is older than some of the
> >relation pages.  How was the base backup of this node created?  Please
> >remember that when taking a base backup from a standby, you should
> >backup the control file last, as there is no control of end backup with
> >records available.  So it seems to me that the origin of your problem
> >comes from an incorrect base backup expectation?
>
> Uh, where is that "control file last" bit coming from?

pg_basebackup copies it last.  The comments should probably be improved
as to *why* but my recollection is that it's, at least in part, to
ensure the new cluster can't be used until it's actually a complete
backup.

Thanks!

Stephen

Attachment

Re: BUG #15346: Replica fails to start after the crash

From
Michael Paquier
Date:
On Tue, Aug 28, 2018 at 08:23:11AM -0400, Stephen Frost wrote:
> * Andres Freund (andres@anarazel.de) wrote:
>> Uh, where is that "control file last" bit coming from?
>
> pg_basebackup copies it last.  The comments should probably be improved
> as to *why* but my recollection is that it's, at least in part, to
> ensure the new cluster can't be used until it's actually a complete
> backup.

What we have now is mainly in basebackup.c.  See 8366c780 which
introduced that.  Stephen has that right, as we cannot rely on an
end-backup record when taking a backup from a standby, copying the
control file last ensures that the consistent point should be late
enough that no other pages are inconsistent.  Even with that, I think
that there is still a small race condition but I cannot put my finger on
it now.  I agree that the current comments do a bad job as to why this
happens.  That's actually something I discovered when discussing what
has resulted in f267c1c2.
--
Michael

Attachment

Re: BUG #15346: Replica fails to start after the crash

From
Alexander Kukushkin
Date:

Re: BUG #15346: Replica fails to start after the crash

From
Michael Paquier
Date:
On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote:
> Is there a way to recover from such a situation? Should the postgres
> in such case do comparison of LSNs and if the LSN on the page is
> higher than the current LSN simply return InvalidTransactionId?

Hmm.  That does not sound right to me.  If the page has a header LSN
higher than the one replayed, we should not see it as consistency has
normally been reached.  XLogReadBufferExtended() seems to complain in
your case about a page which should not exist per the information of
your backtrace.  What's the length of relation file at this point?  If
the relation is considered as having less blocks, shouldn't we just
ignore it if we're trying to delete items on it and return
InvalidTransactionId?  I have to admit that I am not the best specialist
with this code.

hblkno looks also unsanely high to me if you look at the other blkno
references you are mentioning upthread.

> Apparently, if there are no connections open postgres simply is not
> running this code and it seems ok.

Yeah, that's used for standby conflicts.
--
Michael

Attachment

Re: BUG #15346: Replica fails to start after the crash

From
Alexander Kukushkin
Date:
Hi,

2018-08-29 8:17 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
> Hmm.  That does not sound right to me.  If the page has a header LSN
> higher than the one replayed, we should not see it as consistency has
> normally been reached.  XLogReadBufferExtended() seems to complain in
> your case about a page which should not exist per the information of
> your backtrace.  What's the length of relation file at this point?  If
> the relation is considered as having less blocks, shouldn't we just
> ignore it if we're trying to delete items on it and return
> InvalidTransactionId?  I have to admit that I am not the best specialist
> with this code.

This is what pg_controldata reports:
Latest checkpoint location:           AB3/4A1B30A8
Prior checkpoint location:            AB3/4A1B30A8
Latest checkpoint's REDO location:    AB3/4149B340
Latest checkpoint's REDO WAL file:    0000000500000AB300000041
Latest checkpoint's TimeLineID:       5
Latest checkpoint's PrevTimeLineID:   5
Latest checkpoint's full_page_writes: on

Minimum recovery ending location:     AB3/4A1B3118
Min recovery ending loc's timeline:   5
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    replica
wal_log_hints setting:                on

Therefore it reaches consistency at AB3/4A1B3118 and starts accepting
connections.
A few moments later it starts replaying 0000000500000AB300000050,
where "AB3/50323E78" contains "Btree/DELETE: 182".
This record in the pg_xlogdump looks like:
rmgr: Btree       len (rec/tot):    436/   547, tx:  518902810, lsn:
AB3/50323E78, prev AB3/50323E20, desc: DELETE 182 items, blkref #0:
rel 1663/18055/212907 blk 72478 FPW

212907 -- btree index:
$ ls -al data/base/18055/212907*
-rw------- 1 akukushkin akukushkin 594812928 Aug 22 07:22 data/base/18055/212907
-rw------- 1 akukushkin akukushkin    163840 Aug 22 07:22
data/base/18055/212907_fsm

212875 -- the table:
$ ls -al data/base/18055/212875*
-rw------- 1 akukushkin akukushkin 1073741824 Aug 22 07:20
data/base/18055/212875
-rw------- 1 akukushkin akukushkin 1014947840 Aug 22 07:21
data/base/18055/212875.1
-rw------- 1 akukushkin akukushkin     532480 Aug 22 07:21
data/base/18055/212875_fsm
-rw------- 1 akukushkin akukushkin      65536 Aug 22 07:21
data/base/18055/212875_vm

As you can see its size is only about 2GB.

>
> hblkno looks also unsanely high to me if you look at the other blkno
> references you are mentioning upthread.

hblkno is calculated from the data which postgres reads from indexfile
data/base/18055/212907, block 72478.
This block has only 56 index tuples, while there expected to be at
least 182, therefore starting from tuple 57 the tuple pointer starts
looking as:
(gdb) p *iitemid
$47 = {lp_off = 0, lp_flags = 0, lp_len = 0}

Basically it points to the beginning of the page, what doesn't make
sense at all.

Why the block 72478 of index relfile doesn't meet our expectations
(contains so few tuples)?
The answer to this question is in the page header. LSN, written in the
indexpage header is AB3/56BF3B68.
That has only one meaning, while the postgres was working before the
crash it managed to apply WAL stream til at least AB3/56BF3B68, what
is far ahead of "Minimum recovery ending location:     AB3/4A1B3118".

Regards,
--
Alexander Kukushkin


Re: BUG #15346: Replica fails to start after the crash

From
Michael Paquier
Date:
On Wed, Aug 29, 2018 at 08:59:16AM +0200, Alexander Kukushkin wrote:
> Why the block 72478 of index relfile doesn't meet our expectations
> (contains so few tuples)?
> The answer to this question is in the page header. LSN, written in the
> indexpage header is AB3/56BF3B68.
> That has only one meaning, while the postgres was working before the
> crash it managed to apply WAL stream til at least AB3/56BF3B68, what
> is far ahead of "Minimum recovery ending location:     AB3/4A1B3118".

Yeah, that's the pinpoint.  Do you know by chance what was the content
of the control file for each standby you have upgraded to 9.6.10 before
starting them with the new binaries?  You mentioned a cluster of three
nodes, so I guess that you have two standbys, and that one of them did
not see the symptoms discussed here, while the other saw them.  Do you
still have the logs of the recovery just after starting the other
standby with 9.4.10 which did not see the symptom?  All your standbys
are using the background worker which would cause the btree deletion
code to be scanned, right?

I am trying to work on a reproducer with a bgworker starting once
recovery has been reached, without success yet.  Does your cluster
generate some XLOG_PARAMETER_CHANGE records?  In some cases, 9.4.8 could
have updated minRecoveryPoint to go backward, which is something that
8d68ee6 has been working on addressing.

Did you also try to use local WAL segments up where AB3/56BF3B68 is
applied, and also have a restore_command so as extra WAL segment fetches
from the archive would happen?
--
Michael

Attachment

Re: BUG #15346: Replica fails to start after the crash

From
Alexander Kukushkin
Date:
Hi,

2018-08-29 14:10 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
> Yeah, that's the pinpoint.  Do you know by chance what was the content
> of the control file for each standby you have upgraded to 9.6.10 before
> starting them with the new binaries?  You mentioned a cluster of three

No, I don't. Right after the upgrade they started normally and have
been working for a few days. I believe the controlfile was overwritten
a few hundred times before the instance crashed.

> nodes, so I guess that you have two standbys, and that one of them did
> not see the symptoms discussed here, while the other saw them.  Do you

The other node didn't crash and still working.

> still have the logs of the recovery just after starting the other
> standby with 9.4.10 which did not see the symptom?  All your standbys

I don't think it is really related to the minor upgrade. After the
upgrade the whole cluster was running for about 3 days.
Every day it generates about 2000 WAL segments, the total volume of
daily WALs is very close to the size of cluster, which is 38GB.


> are using the background worker which would cause the btree deletion
> code to be scanned, right?

Well, any open connection to the database will produce the same
result. In our case we are using Patroni for automatic failover, which
connects immediately after postgres has started and keeps this
connection permanently open. Background worker just appeared to be
faster than anything else.

> I am trying to work on a reproducer with a bgworker starting once
> recovery has been reached, without success yet.  Does your cluster
> generate some XLOG_PARAMETER_CHANGE records?  In some cases, 9.4.8 could
> have updated minRecoveryPoint to go backward, which is something that
> 8d68ee6 has been working on addressing.

No, it doesn't.

>
> Did you also try to use local WAL segments up where AB3/56BF3B68 is
> applied, and also have a restore_command so as extra WAL segment fetches
> from the archive would happen?

If there are no connections open, it applies a necessary amount of WAL
segments (with the help of restore_command off course) and reaches the
real consistency. After that, it is possible to connect and it doesn't
startup process doesn't abort anymore.


BTW, I am thinking that we should return InvalidTransactionId from the
btree_xlog_delete_get_latestRemovedXid if the index page we read from
disk is newer then xlog record we are currently processing. Please see
the patch attached.

--
Alexander Kukushkin

Attachment

Re: BUG #15346: Replica fails to start after the crash

From
Michael Paquier
Date:
On Wed, Aug 29, 2018 at 03:05:25PM +0200, Alexander Kukushkin wrote:
> BTW, I am thinking that we should return InvalidTransactionId from the
> btree_xlog_delete_get_latestRemovedXid if the index page we read from
> disk is newer then xlog record we are currently processing. Please see
> the patch attached.

This is not a solution in my opinion, as you could invalidate activities
of backends connected to the database when the incorrect consistent
point allows connections to come in too early.

What happens if you replay with hot_standby = on up to the latest point,
without any concurrent connections, then issue a checkpoint on the
standby once you got to a point newer than the complain, and finally
restart the standby with the bgworker?

Another idea I have would be to make the standby promote, issue a
checkpoint on it, and then use pg_rewind as a trick to update the
control file to a point newer than the inconsistency.  As PG < 9.6.10
could make the minimum recovery point go backwards, applying the upgrade
after the consistent point got to an incorrect state would trigger the
failure.
--
Michael

Attachment