[BUG] False indication in pg_stat_replication.sync_state - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject [BUG] False indication in pg_stat_replication.sync_state
Date
Msg-id 20121018.174208.239638814.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: Reduce palloc's in numeric operations.  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: [BUG] False indication in pg_stat_replication.sync_state  (Fujii Masao <masao.fujii@gmail.com>)
List pgsql-hackers
Hello. My colleague found that pg_stat_replication.sync_state
shows false state for some condition.

This prevents Pacemaker from completing fail-over that could
safely be done.

The point is in walsender.c, pg_stat_get_wal_senders() below, (as
of REL9_2_1)
 1555:    if (walsnd->pid != 0) 1556:    { 1557:      /* 1558:       * Treat a standby such as a pg_basebackup
backgroundprocess 1559:       * which always returns an invalid flush location, as an 1560:       * asynchronous
standby.1561:       */
 
! 1562:       sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ? 1563:          0 :
walsnd->sync_standby_priority;

Here, XLogRecPtrIsInvalid(walsnd->flush) is defined as
(walsnd->flush.xrecoff == 0) which becomes true as usual at every
WAL 'file's (not segments) boundary. xrecoff == 0 is certainly
invalid for the start point of WAL *RECORD*, but should be
considered valid in replication stream. This check was introduced
at 9.2.0 and the version up between 9.1.4 and 9.1.5.
| DEBUG:  write 4/0 flush 3/FEFFEC68 apply 3/FEFFEC68| DEBUG:  write 4/0 flush 4/0 apply 3/FEFFEC68| DEBUG:  HOGE:
flush= 3/FEFFEC68 sync_priority[0] = 1| DEBUG:  write 4/111C0 flush 4/0 apply 3/FEFFECC0
 
!| DEBUG:  HOGE: flush = 4/0 sync_priority[0] = 0

This value zero of sync_priority[0] makes sync_status 'async'
errorneously and confuses Pacemaker.

# The log line marked with 'HOGE' above printed by applying the
# patch at the bottom of this message and invoking 'select
# sync_state from pg_stat_replication' periodically. To increase
# the chance to see the symptom, sleep 1 second for 'file'
# boundaries :-)

The Heikki's recent(?) commit
0ab9d1c4b31622e9176472b4276f3e9831e3d6ba which changes the format
of XLogRecPtr from logid:xrecoff struct to 64 bit linear address
would fix the false indication. But I suppose this patch won't be
applied to existing 9.1.x and 9.2.x because of the modification
onto streaming protocol.

As far as I see the patch, it would'nt change the meaning of
XLogRecPtr to change XLogRecPtrIsInvalid from (xrecoff == 0) to
(xrecoff == 0 && xlogid == 0). But this change affects rather
wide portion where handling WAL nevertheless what is needed here
is only to stop the false indication.

On the other hand, pg_basebackup seems return 0/0 as flush and
apply positions so it seems enough only to add xlogid == 0 into
the condition. The patch attached for REL9_2_1 does this and
yields the result following.
| DEBUG:  write 2/FEFFFD48 flush 2/FEFFFD48 apply 2/FEFF7AB0| DEBUG:  write 3/0 flush 2/FEFFFD48 apply 2/FEFF7E88|
DEBUG: write 3/0 flush 3/0 apply 2/FEFFFD48| DEBUG:  HOGE: flush = 2/FEFFFD48 sync_priority[0] = 1| DEBUG:  write
3/E338flush 3/0 apply 2/FEFFFF80
 
!| DEBUG:  HOGE: flush = 3/0 sync_priority[0] = 1

I think this patch should be applied for 9.2.2 and 9.1.7.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

============================================================
===== The patch for this test.
============================================================
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 064ddd5..19f79d1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -618,6 +618,10 @@ ProcessStandbyReplyMessage(void)         reply.flush.xlogid, reply.flush.xrecoff,
reply.apply.xlogid,reply.apply.xrecoff);
 
+    if (reply.write.xrecoff == 0 ||
+        reply.flush.xrecoff == 0)
+        sleep(1);
+    /*     * Update shared state for this WalSender process based on reply data from     * standby.
@@ -1561,7 +1565,10 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)             */            sync_priority[i] =
XLogRecPtrIsInvalid(walsnd->flush)?                0 : walsnd->sync_standby_priority;
 
-
+            elog(DEBUG1, "HOGE: flush = %X/%X sync_priority[%d] = %d",
+                 walsnd->flush.xlogid, walsnd->flush.xrecoff, 
+                 i, sync_priority[i]);
+                        if (walsnd->state == WALSNDSTATE_STREAMING &&                walsnd->sync_standby_priority > 0
&&               (priority == 0 || 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 064ddd5..1d4cbc4 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1555,11 +1555,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)        if (walsnd->pid != 0)        {            /*
-             * Treat a standby such as a pg_basebackup background process
-             * which always returns an invalid flush location, as an
+             * Treat a standby such as a pg_basebackup background process which
+             * always returns 0/0 (InvalidXLogRecPtr) as flush location, as an             * asynchronous standby.
       */
 
-            sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
+            sync_priority[i] = XLByteEQ(walsnd->flush, InvalidXLogRecPtr) ?                0 :
walsnd->sync_standby_priority;           if (walsnd->state == WALSNDSTATE_STREAMING && 

pgsql-hackers by date:

Previous
From: Sebastien FLAESCH
Date:
Subject: Database object names and libpq in UTF-8 locale on Windows
Next
From: Andres Freund
Date:
Subject: Re: Bugs in CREATE/DROP INDEX CONCURRENTLY