Thread: [GENERAL] Is there possibility btree_redo with XLOG_BTREE_DELETE done betweenstandby_redo and the end of backup

is there possibility btree_redo with XLOG_BTREE_DELETE info done between
standby_redo and  the end of backup? I have PostgreSQL 9.3.14 which have
some patches to use and easy to happen.

//checkpoint
[26005-59251d35.6595-712522] 2017-05-24 05:42:22.343 GMT < > DEBUG:  00000:
recovery snapshots are now enabled
[26005-59251d35.6595-712523] 2017-05-24 05:42:22.343 GMT < > CONTEXT:  xlog
redo running xacts: nextXid 3269723 latestCompletedXid 3269721
oldestRunningXid 3269722; 1 xacts: 3269722

void
ProcArrayApplyRecoveryInfo(RunningTransactions running)
{
......
    standbyState = STANDBY_SNAPSHOT_READY;  /  //the standbyState is changed
to STANDBY_SNAPSHOT_READY/
......
}

static void
CheckRecoveryConsistency(void)
{
......
///but the reachedConsistency  is not set true because below condition is
not true./
    if (!reachedConsistency && !ControlFile->backupEndRequired &&
        minRecoveryPoint <= lastReplayedEndRecPtr &&
        XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
    {
        /*
         * Check to see if the XLOG sequence contained any unresolved
         * references to uninitialized pages.
         */
        XLogCheckInvalidPages();

        reachedConsistency = true;
        ereport(LOG,
                (errmsg("consistent recovery state reached at %X/%X",
                        (uint32) (lastReplayedEndRecPtr >> 32),
                        (uint32) lastReplayedEndRecPtr)));
    }
......

}

//inconsistent detected in btree_xlog_delete_get_latestRemovedXid()
[26005-59251d35.6595-726087] 2017-05-24 05:42:22.513 GMT < > WARNING:
01000: btree_xlog_delete_get_latestRemovedXid: cannot operate with
inconsistent data
[26005-59251d35.6595-726088] 2017-05-24 05:42:22.513 GMT < > CONTEXT:  xlog
redo delete: index 1663/16385/11693; iblk 110, heap 1663/16385/11691;

static void
btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
{
......
       / //InHotStandby is true/
    if (InHotStandby) / //#define InHotStandby (standbyState >=
STANDBY_SNAPSHOT_PENDING)/
    {
        TransactionId latestRemovedXid =
btree_xlog_delete_get_latestRemovedXid(xlrec);

        ResolveRecoveryConflictWithSnapshot(latestRemovedXid, xlrec->node);
    }
......
}

static TransactionId
btree_xlog_delete_get_latestRemovedXid(xl_btree_delete *xlrec)
{
......
        ///PANIC happened  *(I changed the PANIC to WARNING in the log to
see further steps.)*/
    if (!reachedConsistency)
          *elog(PANIC, "btree_xlog_delete_get_latestRemovedXid: cannot
operate with inconsistent data");*
......
}

//backup end
[26005-59251d35.6595-801986] 2017-05-24 05:42:28.564 GMT < > DEBUG:  00000:
end of backup reached
[26005-59251d35.6595-801987] 2017-05-24 05:42:28.564 GMT < > CONTEXT:  xlog
redo backup end: 191/AB1F50E8

static void
CheckRecoveryConsistency(void)
{
......
///This time the reachedConsistency  is set true because below condition
becomes to true./
    if (!reachedConsistency && !ControlFile->backupEndRequired &&
        minRecoveryPoint <= lastReplayedEndRecPtr &&
        XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
    {
        /*
         * Check to see if the XLOG sequence contained any unresolved
         * references to uninitialized pages.
         */
        XLogCheckInvalidPages();

        reachedConsistency = true;
        ereport(LOG,
                (errmsg("consistent recovery state reached at %X/%X",
                        (uint32) (lastReplayedEndRecPtr >> 32),
                        (uint32) lastReplayedEndRecPtr)));
    }
......

}

I'm not sure if this scenario with original PostgreSQL code(9.3.14) would
happen while standby server start up in recovery mode(meanwhile huge data
updating in master server) in real world.
I'm not sure if the problem is because original PostgreSQL code or because
our patches.
So need to give comments from PostgreSQL original code point of view to
analyze the possibility.
Thanks!




--
View this message in context:
http://www.postgresql-archive.org/Is-there-possibility-btree-redo-with-XLOG-BTREE-DELETE-done-between-standby-redo-and-the-end-of-backp-tp5963011.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


y39chen <yan-jack.chen@nokia.com> writes:
> is there possibility btree_redo with XLOG_BTREE_DELETE info done between
> standby_redo and  the end of backup? I have PostgreSQL 9.3.14 which have
> some patches to use and easy to happen.
> ...
> //inconsistent detected in btree_xlog_delete_get_latestRemovedXid()
> [26005-59251d35.6595-726087] 2017-05-24 05:42:22.513 GMT < > WARNING:
> 01000: btree_xlog_delete_get_latestRemovedXid: cannot operate with
> inconsistent data

This looks a whole lot like a bug we fixed in 9.3.15, cf


https://www.postgresql.org/message-id/flat/DB5PR07MB15416C65687A1EA9AC0D26F8D6E00%40DB5PR07MB1541.eurprd07.prod.outlook.com

            regards, tom lane


On Wed, May 24, 2017 at 11:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> y39chen <yan-jack.chen@nokia.com> writes:
>> is there possibility btree_redo with XLOG_BTREE_DELETE info done between
>> standby_redo and  the end of backup? I have PostgreSQL 9.3.14 which have
>> some patches to use and easy to happen.
>> ...
>> //inconsistent detected in btree_xlog_delete_get_latestRemovedXid()
>> [26005-59251d35.6595-726087] 2017-05-24 05:42:22.513 GMT < > WARNING:
>> 01000: btree_xlog_delete_get_latestRemovedXid: cannot operate with
>> inconsistent data
>
> This looks a whole lot like a bug we fixed in 9.3.15, cf
>
>
https://www.postgresql.org/message-id/flat/DB5PR07MB15416C65687A1EA9AC0D26F8D6E00%40DB5PR07MB1541.eurprd07.prod.outlook.com

To which commit are you referring here? I don't recall any fixes
related to that, and there is nothing in the git history indicating
so.

Still I don't understand how this PANIC code can be reachable with
community code. CountDBBackends() will normally return 0 if
consistency is not reached, and postmaster will block incoming
connections until a consistent state is reached.
--
Michael


Michael Paquier <michael.paquier@gmail.com> writes:
> On Wed, May 24, 2017 at 11:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> This looks a whole lot like a bug we fixed in 9.3.15, cf
>>
https://www.postgresql.org/message-id/flat/DB5PR07MB15416C65687A1EA9AC0D26F8D6E00%40DB5PR07MB1541.eurprd07.prod.outlook.com

> To which commit are you referring here? I don't recall any fixes
> related to that, and there is nothing in the git history indicating
> so.

Wup, sorry, I quoted the wrong thread mentioning
btree_xlog_delete_get_latestRemovedXid.  The right one is

https://www.postgresql.org/message-id/flat/00F0B2CEF6D0CEF8A90119D4%40eje.credativ.lan

and the patch is

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master [0e0f43d6f] 2016-08-31 08:52:13 -0400
Branch: REL9_6_STABLE Release: REL9_6_0 [3fc489cb3] 2016-08-31 08:52:13 -0400
Branch: REL9_5_STABLE Release: REL9_5_5 [c40bb1155] 2016-08-31 08:52:13 -0400
Branch: REL9_4_STABLE Release: REL9_4_10 [f4e40537e] 2016-08-31 08:52:13 -0400
Branch: REL9_3_STABLE Release: REL9_3_15 [baf111d31] 2016-08-31 08:52:13 -0400
Branch: REL9_2_STABLE Release: REL9_2_19 [823df401d] 2016-08-31 08:52:13 -0400
Branch: REL9_1_STABLE Release: REL9_1_24 [e3439a455] 2016-08-31 08:52:13 -0400

    Prevent starting a standalone backend with standby_mode on.

    This can't really work because standby_mode expects there to be more
    WAL arriving, which there will not ever be because there's no WAL
    receiver process to fetch it.  Moreover, if standby_mode is on then
    hot standby might also be turned on, causing even more strangeness
    because that expects read-only sessions to be executing in parallel.
    Bernd Helmle reported a case where btree_xlog_delete_get_latestRemovedXid
    got confused, but rather than band-aiding individual problems it seems
    best to prevent getting anywhere near this state in the first place.
    Back-patch to all supported branches.

    In passing, also fix some omissions of errcodes in other ereport's in
    readRecoveryCommandFile().

    Michael Paquier (errcode hacking by me)

    Discussion: <00F0B2CEF6D0CEF8A90119D4@eje.credativ.lan>

            regards, tom lane


On Wed, May 24, 2017 at 11:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Wup, sorry, I quoted the wrong thread mentioning
> btree_xlog_delete_get_latestRemovedXid.  The right one is
>
> https://www.postgresql.org/message-id/flat/00F0B2CEF6D0CEF8A90119D4%40eje.credativ.lan
>
> and the patch is
>
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Branch: master [0e0f43d6f] 2016-08-31 08:52:13 -0400
> Branch: REL9_6_STABLE Release: REL9_6_0 [3fc489cb3] 2016-08-31 08:52:13 -0400
> Branch: REL9_5_STABLE Release: REL9_5_5 [c40bb1155] 2016-08-31 08:52:13 -0400
> Branch: REL9_4_STABLE Release: REL9_4_10 [f4e40537e] 2016-08-31 08:52:13 -0400
> Branch: REL9_3_STABLE Release: REL9_3_15 [baf111d31] 2016-08-31 08:52:13 -0400
> Branch: REL9_2_STABLE Release: REL9_2_19 [823df401d] 2016-08-31 08:52:13 -0400
> Branch: REL9_1_STABLE Release: REL9_1_24 [e3439a455] 2016-08-31 08:52:13 -0400
>
>     Prevent starting a standalone backend with standby_mode on.

Ahh. This one... Thanks. That may be related, but no way to say if
that's related with this report as there is no information that a
standalone backend is used. Perhaps that's the case though. So it
could be possible as well that the set of custom patches mentioned is
fishy.
--
Michael


Thank you the comments.
We found the panic happened when adding one of our patch.

static int
ProcessStartupPacket(Port *port, bool SSLdone)
{
......
    /*
     * If we're going to reject the connection due to database state, say so
     * now instead of wasting cycles on an authentication exchange. (This also
     * allows a pg_ping utility to be written.)
     */
    switch (port->canAcceptConnections)
    {
        case CAC_STARTUP:
            ereport(*LOG*,
                    (errcode(ERRCODE_CANNOT_CONNECT_NOW),
                     errmsg("the database system is starting up")));
            break;
......
}

I understand the patch would accept connection while Postgres master is
still recovering. and it is dangerous in general.  When I remove the
patch(be PANIC still). no panic happened. It seems have relationship.
I also noticed there is one post for similar problem discussion.

http://www.postgresql-archive.org/Re-Crash-observed-during-the-start-of-the-Postgres-process-td5958225.html
<http://www.postgresql-archive.org/Re-Crash-observed-during-the-start-of-the-Postgres-process-td5958225.html>

My doubt is Standby is redoing the records in WAL from master. how accept
connection in standby side while recovering would trigger
btree_xlog_delete_get_latestRemovedXid() and panic happen.

I tried to read the postgres code the get the clear map. But still confused.

Are you aware and would you kindly explain the relationship behind? Or give
me some hint how to investigate it?




--
View this message in context:
http://www.postgresql-archive.org/Is-there-possibility-btree-redo-with-XLOG-BTREE-DELETE-done-between-standby-redo-and-the-end-of-backp-tp5963066p5963181.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


On Thu, May 25, 2017 at 5:23 AM, y39chen <yan-jack.chen@nokia.com> wrote:
> My doubt is Standby is redoing the records in WAL from master. how accept
> connection in standby side while recovering would trigger
> btree_xlog_delete_get_latestRemovedXid() and panic happen.

You should look at the relationship between the code of postmaster.c
dealing with updates of pmState and how the startup process (xlog.c)
lets the postmaster know when it can accept incoming connections. Once
a set of conditions is met, the startup process will let the
postmaster know if it is safe to accept connections on a hot standby.
That's a good study and the code is well-commented, so I let you guess
what are those conditions and how they are met during recovery.
--
Michael


y39chen <yan-jack.chen@nokia.com> writes:
> We found the panic happened when adding one of our patch.

>     switch (port->canAcceptConnections)
>     {
>         case CAC_STARTUP:
>             ereport(*LOG*,
>                     (errcode(ERRCODE_CANNOT_CONNECT_NOW),
>                      errmsg("the database system is starting up")));
>             break;

Did you not absorb the advice given in

https://www.postgresql.org/message-id/23381.1494607092@sss.pgh.pa.us

?  The above patch is simply an impossibly bad idea.  Don't do it.

            regards, tom lane


Yeah, I figured out the point(logic). The precondition is should not have any
connections accept while recovering. It is clear to me now. Thank you very
much.

static TransactionId
btree_xlog_delete_get_latestRemovedXid(xl_btree_delete *xlrec)
{
......
    if (*CountDBBackends(InvalidOid)* == 0)
        return latestRemovedXid;

    /*
     * In what follows, we have to examine the previous state of the index
     * page, as well as the heap page(s) it points to.  This is only valid if
     * WAL replay has reached a consistent database state; which means that
     * the preceding check is not just an optimization, but is *necessary*. We
     * won't have let in any user sessions before we reach consistency.
     */
    if (!reachedConsistency)
            elog(PANIC, "btree_xlog_delete_get_latestRemovedXid: cannot
operate with inconsistent data");

......
}



--
View this message in context:
http://www.postgresql-archive.org/Is-there-possibility-btree-redo-with-XLOG-BTREE-DELETE-done-between-standby-redo-and-the-end-of-backp-tp5963066p5963349.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.