Re: [BUGS] BUG #14721: Assertion of synchronous replication - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: [BUGS] BUG #14721: Assertion of synchronous replication
Date
Msg-id CAEepm=3k=gTAn5=X_Qv=hWw9JnUxUMXCzBxTKPaHHXxKkF0+iw@mail.gmail.com
Whole thread Raw
In response to [BUGS] BUG #14721: Assertion of synchronous replication  (const_sunny@126.com)
Responses Re: [BUGS] BUG #14721: Assertion of synchronous replication  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-bugs
On Thu, Jun 29, 2017 at 2:36 PM,  <const_sunny@126.com> wrote:
> The following bug has been logged on the website:
>
> Bug reference:      14721
> Logged by:          Const Zhang
> Email address:      const_sunny@126.com
> PostgreSQL version: 9.6.2
> Operating system:   CentOS7

Hi Const,

Thanks for the detailed report.  What type of CPU is this running on?

>     I have found a bug about synchronous replication.
>     At first, see the stack of the core file.
>
> 1
> (gdb) bt
> 2
> #0 0x00007fe9aab2e1d7 in raise () from /lib64/libc.so.6
> 3
> #1 0x00007fe9aab2f8c8 in abort () from /lib64/libc.so.6
> 4
> #2 0x0000000000af0699 in ExceptionalCondition (conditionName=0xcdc111
> "!(SHMQueueIsDetached(&(MyProc->syncRepLinks)))", errorType=0xb6c443
> "FailedAssertion",
> 5
> fileName=0xcdc140
> "/home/zl/workspace_pg962/postgres/src/backend/replication/syncrep.c",
> lineNumber=294) at

I assume this is line 298 in master today.  There is some interesting
IPC going on here.  SyncRepWakeQueue() does three things while holding
SyncRepLock in some other process:
 thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE; SHMQueueDelete(&(thisproc->syncRepLinks));
SetLatch(&(thisproc->procLatch));

Meanwhile SyncRepWaitForLSN() in your process does this:
       /*        * Acquiring the lock is not needed, the latch ensures proper        * barriers. If it looks like we're
done,we must really be done,        * because once walsender changes the state to SYNC_REP_WAIT_COMPLETE,        * it
willnever update it again, so we can't be seeing a stale value        * in that case.        */       if
(MyProc->syncRepState== SYNC_REP_WAIT_COMPLETE)           break;
 

... OK, then outside the loop:
   /*    * WalSender has checked our LSN and has removed us from queue. Clean up    * state and leave.  It's OK to
resetthese shared memory fields without    * holding SyncRepLock, because any walsenders will ignore us anyway when
*we're not on the queue.    */   Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
 

I wonder your CPU core was able to see syncRepState ==
SYNC_REP_WAIT_COMPLETE, but not yet see MyProc->syncRepLinks's next
and previous members as 0.  When you look at them in your debugger you
see them as zero, but that's a bit later:

> (gdb) p MyProc->syncRepLinks
> 2
> $1 = {prev = 0x0, next = 0x0}

I may be way off base here, and haven't studied all of your report
yet.  But my first thought is: shouldn't SyncRepWakeQueue() do things
the other way around, with a barrier in between, like this:
 SHMQueueDelete(&(thisproc->syncRepLinks)); pg_write_barrier(); thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE;
SetLatch(&(thisproc->procLatch));

... and then shouldn't SyncRepWaitForLSN() have a pg_read_barrier()
inserted before Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)))?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

pgsql-bugs by date:

Previous
From: const_sunny@126.com
Date:
Subject: [BUGS] BUG #14721: Assertion of synchronous replication
Next
From: "const_sunny@126.com"
Date:
Subject: [BUGS] Assertion of synchronous replication