[BUGS] BUG #14721: Assertion of synchronous replication - Mailing list pgsql-bugs

From const_sunny@126.com
Subject [BUGS] BUG #14721: Assertion of synchronous replication
Date
Msg-id 20170629023623.1480.26508@wrigleys.postgresql.org
Whole thread Raw
Responses Re: [BUGS] BUG #14721: Assertion of synchronous replication  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      14721
Logged by:          Const Zhang
Email address:      const_sunny@126.com
PostgreSQL version: 9.6.2
Operating system:   CentOS7
Description:

Hi all!
   I have found a bug about synchronous replication.   At first, see the stack of the core file.

1
(gdb) bt
2
#0 0x00007fe9aab2e1d7 in raise () from /lib64/libc.so.6
3
#1 0x00007fe9aab2f8c8 in abort () from /lib64/libc.so.6
4
#2 0x0000000000af0699 in ExceptionalCondition (conditionName=0xcdc111
"!(SHMQueueIsDetached(&(MyProc->syncRepLinks)))", errorType=0xb6c443
"FailedAssertion",
5
fileName=0xcdc140
"/home/zl/workspace_pg962/postgres/src/backend/replication/syncrep.c",
lineNumber=294) at
/home/zl/workspace_pg962/postgres/src/backend/utils/error/assert.c:54
6
#3 0x00000000008c7e94 in SyncRepWaitForLSN (lsn=50435080, commit=1 '\001')
at /home/zl/workspace_pg962/postgres/src/backend/replication/syncrep.c:294
7
#4 0x000000000056ed11 in RecordTransactionCommit () at
/home/zl/workspace_pg962/postgres/src/backend/access/transam/xact.c:1343
8
#5 0x0000000000568c96 in CommitTransaction () at
/home/zl/workspace_pg962/postgres/src/backend/access/transam/xact.c:2041
9
#6 0x0000000000568717 in CommitTransactionCommand () at
/home/zl/workspace_pg962/postgres/src/backend/access/transam/xact.c:2768
10
#7 0x000000000092eb56 in finish_xact_command () at
/home/zl/workspace_pg962/postgres/src/backend/tcop/postgres.c:2459
11
#8 0x000000000092cb37 in exec_simple_query (query_string=0x25d0de0 "insert
into x values(1,3),(1,4),(1,5),(1,6);") at
/home/zl/workspace_pg962/postgres/src/backend/tcop/postgres.c:1132
12
#9 0x000000000092bdf0 in PostgresMain (argc=1, argv=0x257f308,
dbname=0x257f168 "postgres", username=0x2550e10 "postgres") at
/home/zl/workspace_pg962/postgres/src/backend/tcop/postgres.c:4066
13
#10 0x0000000000879426 in BackendRun (port=0x2575650) at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:4317
14
#11 0x0000000000878a50 in BackendStartup (port=0x2575650) at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:3989
15
#12 0x000000000087509c in ServerLoop () at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:1729
16
#13 0x0000000000872612 in PostmasterMain (argc=3, argv=0x254ec80) at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:1337
17
#14 0x0000000000795228 in main (argc=3, argv=0x254ec80) at
/home/zl/workspace_pg962/postgres/src/backend/main/main.c:228   I think it is impossible when i print something about
theassertion.
 

1
(gdb) p MyProc->syncRepLinks
2
$1 = {prev = 0x0, next = 0x0}   So, what causes this assertion? To solve my doubts, i add some debug
log. See the macro DEBUG_SUNNY as below.

1
/*
2* Wait for synchronous replication, if requested by user.
3*
4* Initially backends start in state SYNC_REP_NOT_WAITING and then
5* change that state to SYNC_REP_WAITING before adding ourselves
6* to the wait queue. During SyncRepWakeQueue() a WALSender changes
7* the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
8* This backend then resets its state to SYNC_REP_NOT_WAITING.
9*
10* 'lsn' represents the LSN to wait for.  'commit' indicates whether this
LSN
11* represents a commit record.  If it doesn't, then we wait only for the
WAL
12* to be flushed if synchronous_commit is set to the higher level of
13* remote_apply, because only commit records provide apply feedback.
14*/
15
void
16
SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
17
{
18   char       *new_status = NULL;
19   const char *old_status;
20   int         mode;
21
22   /* Cap the level for anything other than commit to remote flush only.
*/
23   if (commit)
24       mode = SyncRepWaitMode;
25   else
26       mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
27
28   /*
29    * Fast exit if user has not requested sync replication, or there are
no
30    * sync replication standby names defined. Note that those standbys
don't
31    * need to be connected.
32    */
33   if (!SyncRepRequested() || !SyncStandbysDefined())
34       return;
35
36   Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
37   Assert(WalSndCtl != NULL);
38
39   LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
40   Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
41
42   /*
43    * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is
not
44    * set.  See SyncRepUpdateSyncStandbysDefined.
45    *
46    * Also check that the standby hasn't already replied. Unlikely race
47    * condition but we'll be fetching that cache line anyway so it's
likely
48    * to be a low cost check.
49    */
50   if (!WalSndCtl->sync_standbys_defined ||
51       lsn <= WalSndCtl->lsn[mode])
52   {
53       LWLockRelease(SyncRepLock);
54       return;
55   }
56
57   /*
58    * Set our waitLSN so WALSender will know when to wake us, and add
59    * ourselves to the queue.
60    */
61   MyProc->waitLSN = lsn;
62   MyProc->syncRepState = SYNC_REP_WAITING;
63   SyncRepQueueInsert(mode);
64   Assert(SyncRepQueueIsOrderedByLSN(mode));
65   LWLockRelease(SyncRepLock);
66
67   /* Alter ps display to show waiting for sync rep. */
68   if (update_process_title)
69   {
70       int         len;
71
72       old_status = get_ps_display(&len);
73       new_status = (char *) palloc(len + 32 + 1);
74       memcpy(new_status, old_status, len);
75       sprintf(new_status + len, " waiting for %X/%X",
76               (uint32) (lsn >> 32), (uint32) lsn);
77       set_ps_display(new_status, false);
78       new_status[len] = '\0'; /* truncate off " waiting ..." */
79   }
80
81   /*
82    * Wait for specified LSN to be confirmed.
83    *
84    * Each proc has its own wait latch, so we perform a normal latch
85    * check/wait loop here.
86    */
87   for (;;)
88   {
89       /* Must reset the latch before testing state. */
90       ResetLatch(MyLatch);
91
92       /*
93        * Acquiring the lock is not needed, the latch ensures proper
94        * barriers. If it looks like we're done, we must really be done,
95        * because once walsender changes the state to
SYNC_REP_WAIT_COMPLETE,
96        * it will never update it again, so we can't be seeing a stale
value
97        * in that case.
98        */
99       if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
100           break;
101
102       /*
103        * If a wait for synchronous replication is pending, we can
neither
104        * acknowledge the commit nor raise ERROR or FATAL.  The latter
would
105        * lead the client to believe that the transaction aborted, which
is
106        * not true: it's already committed locally. The former is no good
107        * either: the client has requested synchronous replication, and
is
108        * entitled to assume that an acknowledged commit is also
replicated,
109        * which might not be true. So in this case we issue a WARNING
(which
110        * some clients may be able to interpret) and shut off further
output.
111        * We do NOT reset ProcDiePending, so that the process will die
after
112        * the commit is cleaned up.
113        */
114       if (ProcDiePending)
115       {
116           ereport(WARNING,
117                   (errcode(ERRCODE_ADMIN_SHUTDOWN),
118                    errmsg("canceling the wait for synchronous replication
and terminating connection due to administrator command"),
119                    errdetail("The transaction has already committed
locally, but might not have been replicated to the standby.")));
120           whereToSendOutput = DestNone;
121           SyncRepCancelWait();
122           break;
123       }
124
125       /*
126        * It's unclear what to do if a query cancel interrupt arrives.
We
127        * can't actually abort at this point, but ignoring the interrupt
128        * altogether is not helpful, so we just terminate the wait with a
129        * suitable warning.
130        */
131       if (QueryCancelPending)
132       {
133           QueryCancelPending = false;
134           ereport(WARNING,
135                   (errmsg("canceling wait for synchronous replication due
to user request"),
136                    errdetail("The transaction has already committed
locally, but might not have been replicated to the standby.")));
137           SyncRepCancelWait();
138           break;
139       }
140
141       /*
142        * If the postmaster dies, we'll probably never get an
143        * acknowledgement, because all the wal sender processes will exit.
So
144        * just bail out.
145        */
146       if (!PostmasterIsAlive())
147       {
148           ProcDiePending = true;
149           whereToSendOutput = DestNone;
150           SyncRepCancelWait();
151           break;
152       }
153
154       /*
155        * Wait on latch.  Any condition that should wake us up will set
the
156        * latch, so no need for timeout.
157        */
158       WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1);
159   }
160
161
#ifdef DEBUG_SUNNY
162   if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
163       ereport(LOG,
164           (errmsg("[SUNNY] It is impossible. [lsn] %X/%X [prev] %p [next]
%p",
165               (uint32) (MyProc->waitLSN >> 32), (uint32)
MyProc->waitLSN,
166               MyProc->syncRepLinks.prev, MyProc->syncRepLinks.next)));
167
#endif
168
169   /*
170    * WalSender has checked our LSN and has removed us from queue. Clean
up
171    * state and leave.  It's OK to reset these shared memory fields
without
172    * holding SyncRepLock, because any walsenders will ignore us anyway
when
173    * we're not on the queue.
174    */
175   Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
176   MyProc->syncRepState = SYNC_REP_NOT_WAITING;
177   MyProc->waitLSN = 0;
178
179   if (new_status)
180   {
181       /* Reset ps display */
182       set_ps_display(new_status, false);
183       pfree(new_status);
184   }
185
}
186
187
/*
188* Insert MyProc into the specified SyncRepQueue, maintaining sorted
invariant.
189*
190* Usually we will go at tail of queue, though it's possible that we
arrive
191* here out of order, so start at tail and work back to insertion point.
192*/
193
static void
194
SyncRepQueueInsert(int mode)
195
{
196   PGPROC     *proc;
197
198   Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
199   proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue[mode]),
200                                  &(WalSndCtl->SyncRepQueue[mode]),
201                                  offsetof(PGPROC, syncRepLinks));
202
203   while (proc)
204   {
205       /*
206        * Stop at the queue element that we should after to ensure the
queue
207        * is ordered by LSN.
208        */
209       if (proc->waitLSN < MyProc->waitLSN)
210           break;
211
212       proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue[mode]),
213                                      &(proc->syncRepLinks),
214                                      offsetof(PGPROC, syncRepLinks));
215   }
216
217   if (proc)
218       SHMQueueInsertAfter(&(proc->syncRepLinks),
&(MyProc->syncRepLinks));
219   else
220       SHMQueueInsertAfter(&(WalSndCtl->SyncRepQueue[mode]),
&(MyProc->syncRepLinks));
221
222
#ifdef DEBUG_SUNNY
223   ereport(LOG,
224       (errmsg("[SUNNY] Insert [lsn] %X/%X [prev] %p [next] %p",
225           (uint32) (MyProc->waitLSN >> 32), (uint32) MyProc->waitLSN,
226           MyProc->syncRepLinks.prev, MyProc->syncRepLinks.next)));
227
#endif
228
}
229
230
/*
231* Walk the specified queue from head.  Set the state of any backends that
232* need to be woken, remove them from the queue, and then wake them.
233* Pass all = true to wake whole queue; otherwise, just wake up to
234* the walsender's LSN.
235*
236* Must hold SyncRepLock.
237*/
238
static int
239
SyncRepWakeQueue(bool all, int mode)
240
{
241   volatile WalSndCtlData *walsndctl = WalSndCtl;
242   PGPROC     *proc = NULL;
243   PGPROC     *thisproc = NULL;
244   int         numprocs = 0;
245
246   Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
247   Assert(SyncRepQueueIsOrderedByLSN(mode));
248
249   proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
250                                  &(WalSndCtl->SyncRepQueue[mode]),
251                                  offsetof(PGPROC, syncRepLinks));
252
253   while (proc)
254   {
255       /*
256        * Assume the queue is ordered by LSN
257        */
258       if (!all && walsndctl->lsn[mode] < proc->waitLSN)
259           return numprocs;
260
261       /*
262        * Move to next proc, so we can delete thisproc from the queue.
263        * thisproc is valid, proc may be NULL after this.
264        */
265       thisproc = proc;
266       proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
267                                      &(proc->syncRepLinks),
268                                      offsetof(PGPROC, syncRepLinks));
269
270       /*
271        * Remove thisproc from queue.
272        */
273       SHMQueueDelete(&(thisproc->syncRepLinks));
274
275       /*
276        * Set state to complete; see SyncRepWaitForLSN() for discussion
of
277        * the various states.
278        */
279       thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE;
280
281
#ifdef DEBUG_SUNNY
282       ereport(LOG,
283           (errmsg("[SUNNY] Delete [lsn] %X/%X [prev] %p [next] %p",
284               (uint32) (thisproc->waitLSN >> 32), (uint32)
thisproc->waitLSN,
285               thisproc->syncRepLinks.prev,
thisproc->syncRepLinks.next)));
286
#endif
287       /*
288        * Wake only when we have set state and removed from queue.
289        */
290       SetLatch(&(thisproc->procLatch));
291
292       numprocs++;
293   }
294
295   return numprocs;
296
}
   Then i made pressure test by benchmark, and got some log as below.

1
2017-06-28 02:51:33.868
CST,"benchmarksql","postgres",77171,"10.20.16.227:43879",595271c7.12d73,176627,"COMMIT
PREPARED",2017-06-27 22:55:03 CST,769/7909,0,LOG,00000,"[SUNNY] Insert [lsn]
32/7131AB58 [prev] 0x2b614633d6a8 [next] 0x2b61446f1b18",,,,,,"COMMIT
PREPARED 'T11860878'",,,"pgxc"
2
3
2017-06-28 02:51:33.868
CST,"benchmarksql","postgres",77171,"10.20.16.227:43879",595271c7.12d73,176628,"COMMIT
PREPARED waiting for 32/7131AB58",2017-06-27 22:55:03
CST,769/7909,0,LOG,00000,"[SUNNY] It is impossible. [lsn] 32/7131AB58 [prev]
0x2b614633d6a8 [next] 0x2b61446f1b18",,,,,,"COMMIT PREPARED
'T11860878'",,,"pgxc"
4
5
2017-06-28 02:51:33.870
CST,"pgxcn","",52856,"10.20.16.214:43102",59522322.ce78,4758326,"streaming
32/7131AB58",2017-06-27 17:19:30 CST,1/0,0,LOG,00000,"[SUNNY] Delete [lsn]
32/7131AB58 [prev] (nil) [next] (nil)",,,,,,,,,"slave"   You can find the "DELETE" log is later than the "IMPOSSIBLE"
log.What 
conditions does this happen under?


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 At last, i have made this bug reappear by GDB follow these steps.
 
1. In wal sender process, add a breakpoint at code line
"SHMQueueDelete(&(thisproc->syncRepLinks)); " of "SyncRepWakeQueue".

SUNNY
1
/*
2* Walk the specified queue from head.  Set the state of any backends that
3* need to be woken, remove them from the queue, and then wake them.
4* Pass all = true to wake whole queue; otherwise, just wake up to
5* the walsender's LSN.
6*
7* Must hold SyncRepLock.
8*/
9
static int
10
SyncRepWakeQueue(bool all, int mode)
11
{
12   volatile WalSndCtlData *walsndctl = WalSndCtl;
13   PGPROC     *proc = NULL;
14   PGPROC     *thisproc = NULL;
15   int         numprocs = 0;
16
17   Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
18   Assert(SyncRepQueueIsOrderedByLSN(mode));
19
20   proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
21                                  &(WalSndCtl->SyncRepQueue[mode]),
22                                  offsetof(PGPROC, syncRepLinks));
23
24   while (proc)
25   {
26       /*
27        * Assume the queue is ordered by LSN
28        */
29       if (!all && walsndctl->lsn[mode] < proc->waitLSN)
30           return numprocs;
31
32       /*
33        * Move to next proc, so we can delete thisproc from the queue.
34        * thisproc is valid, proc may be NULL after this.
35        */
36       thisproc = proc;
37       proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
38                                      &(proc->syncRepLinks),
39                                      offsetof(PGPROC, syncRepLinks));
40
41       /*
42        * Set state to complete; see SyncRepWaitForLSN() for discussion
of
43        * the various states.
44        */
45       thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE;
46
47       /*
48        * Remove thisproc from queue.
49        */
50       SHMQueueDelete(&(thisproc->syncRepLinks));
51
52
#ifdef DEBUG_SUNNY
53       ereport(LOG,
54           (errmsg("[SUNNY] Delete [lsn] %X/%X [prev] %p [next] %p",
55               (uint32) (thisproc->waitLSN >> 32), (uint32)
thisproc->waitLSN,
56               thisproc->syncRepLinks.prev,
thisproc->syncRepLinks.next)));
57
#endif
58       /*
59        * Wake only when we have set state and removed from queue.
60        */
61       SetLatch(&(thisproc->procLatch));
62
63       numprocs++;
64   }
65
66   return numprocs;
67
}
2. In backend process, add a breakpoint at code line "if
(MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)" of "SyncRepWaitForLSN".

1
/*
2* Wait for synchronous replication, if requested by user.
3*
4* Initially backends start in state SYNC_REP_NOT_WAITING and then
5* change that state to SYNC_REP_WAITING before adding ourselves
6* to the wait queue. During SyncRepWakeQueue() a WALSender changes
7* the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
8* This backend then resets its state to SYNC_REP_NOT_WAITING.
9*
10* 'lsn' represents the LSN to wait for.  'commit' indicates whether this
LSN
11* represents a commit record.  If it doesn't, then we wait only for the
WAL
12* to be flushed if synchronous_commit is set to the higher level of
13* remote_apply, because only commit records provide apply feedback.
14*/
15
void
16
SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
17
{
18   char       *new_status = NULL;
19   const char *old_status;
20   int         mode;
21
22   /* Cap the level for anything other than commit to remote flush only.
*/
23   if (commit)
24       mode = SyncRepWaitMode;
25   else
26       mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
27
28   /*
29    * Fast exit if user has not requested sync replication, or there are
no
30    * sync replication standby names defined. Note that those standbys
don't
31    * need to be connected.
32    */
33   if (!SyncRepRequested() || !SyncStandbysDefined())
34       return;
35
36   Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
37   Assert(WalSndCtl != NULL);
38
39   LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
40   Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
41
42   /*
43    * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is
not
44    * set.  See SyncRepUpdateSyncStandbysDefined.
45    *
46    * Also check that the standby hasn't already replied. Unlikely race
47    * condition but we'll be fetching that cache line anyway so it's
likely
48    * to be a low cost check.
49    */
50   if (!WalSndCtl->sync_standbys_defined ||
51       lsn <= WalSndCtl->lsn[mode])
52   {
53       LWLockRelease(SyncRepLock);
54       return;
55   }
56
57   /*
58    * Set our waitLSN so WALSender will know when to wake us, and add
59    * ourselves to the queue.
60    */
61   MyProc->waitLSN = lsn;
62   MyProc->syncRepState = SYNC_REP_WAITING;
63   SyncRepQueueInsert(mode);
64   Assert(SyncRepQueueIsOrderedByLSN(mode));
65   LWLockRelease(SyncRepLock);
66
67   /* Alter ps display to show waiting for sync rep. */
68   if (update_process_title)
69   {
70       int         len;
71
72       old_status = get_ps_display(&len);
73       new_status = (char *) palloc(len + 32 + 1);
74       memcpy(new_status, old_status, len);
75       sprintf(new_status + len, " waiting for %X/%X",
76               (uint32) (lsn >> 32), (uint32) lsn);
77       set_ps_display(new_status, false);
78       new_status[len] = '\0'; /* truncate off " waiting ..." */
79   }
80
81   /*
82    * Wait for specified LSN to be confirmed.
83    *
84    * Each proc has its own wait latch, so we perform a normal latch
85    * check/wait loop here.
86    */
87   for (;;)
88   {
89       /* Must reset the latch before testing state. */
90       ResetLatch(MyLatch);
91
92       /*
93        * Acquiring the lock is not needed, the latch ensures proper
94        * barriers. If it looks like we're done, we must really be done,
95        * because once walsender changes the state to
SYNC_REP_WAIT_COMPLETE,
96        * it will never update it again, so we can't be seeing a stale
value
97        * in that case.
98        */
99       if (MyProc->syncRepState == SYNC_REP_WAIT_COMPLETE)
100           break;
101
102       /*
103        * If a wait for synchronous replication is pending, we can
neither
104        * acknowledge the commit nor raise ERROR or FATAL.  The latter
would
105        * lead the client to believe that the transaction aborted, which
is
106        * not true: it's already committed locally. The former is no good
107        * either: the client has requested synchronous replication, and
is
108        * entitled to assume that an acknowledged commit is also
replicated,
109        * which might not be true. So in this case we issue a WARNING
(which
110        * some clients may be able to interpret) and shut off further
output.
111        * We do NOT reset ProcDiePending, so that the process will die
after
112        * the commit is cleaned up.
113        */
114       if (ProcDiePending)
115       {
116           ereport(WARNING,
117                   (errcode(ERRCODE_ADMIN_SHUTDOWN),
118                    errmsg("canceling the wait for synchronous replication
and terminating connection due to administrator command"),
119                    errdetail("The transaction has already committed
locally, but might not have been replicated to the standby.")));
120           whereToSendOutput = DestNone;
121           SyncRepCancelWait();
122           break;
123       }
124
125       /*
126        * It's unclear what to do if a query cancel interrupt arrives.
We
127        * can't actually abort at this point, but ignoring the interrupt
128        * altogether is not helpful, so we just terminate the wait with a
129        * suitable warning.
130        */
131       if (QueryCancelPending)
132       {
133           QueryCancelPending = false;
134           ereport(WARNING,
135                   (errmsg("canceling wait for synchronous replication due
to user request"),
136                    errdetail("The transaction has already committed
locally, but might not have been replicated to the standby.")));
137           SyncRepCancelWait();
138           break;
139       }
140
141       /*
142        * If the postmaster dies, we'll probably never get an
143        * acknowledgement, because all the wal sender processes will exit.
So
144        * just bail out.
145        */
146       if (!PostmasterIsAlive())
147       {
148           ProcDiePending = true;
149           whereToSendOutput = DestNone;
150           SyncRepCancelWait();
151           break;
152       }
153
154       /*
155        * Wait on latch.  Any condition that should wake us up will set
the
156        * latch, so no need for timeout.
157        */
158       WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1);
159   }
160
161
#ifdef DEBUG_SUNNY
162   if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
163       ereport(LOG,
164           (errmsg("[SUNNY] It is impossible. [lsn] %X/%X [prev] %p [next]
%p",
165               (uint32) (MyProc->waitLSN >> 32), (uint32)
MyProc->waitLSN,
166               MyProc->syncRepLinks.prev, MyProc->syncRepLinks.next)));
167
#endif
168
169   /*
170    * WalSender has checked our LSN and has removed us from queue. Clean
up
171    * state and leave.  It's OK to reset these shared memory fields
without
172    * holding SyncRepLock, because any walsenders will ignore us anyway
when
173    * we're not on the queue.
174    */
175   Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
176   MyProc->syncRepState = SYNC_REP_NOT_WAITING;
177   MyProc->waitLSN = 0;
178
179   if (new_status)
180   {
181       /* Reset ps display */
182       set_ps_display(new_status, false);
183       pfree(new_status);
184   }
185
}
3. Execute a SQL whatever will generate tansaction log by psql.

.
1
[zl@INTEL175 ~/workspace_pg962/project]$ psql -p 8000 -U postgres
2
psql (9.6.2)
3
Type "help" for help.
4
5
postgres=# insert into x values(1,3),(1,4),(1,5),(1,6);
4. Hold the breakpoint in wal sender process and step next in backend
process. Then a assertion core file will be found.

1
(gdb) bt
2
#0  0x00007fa769cb41d7 in raise () from /lib64/libc.so.6
3
#1  0x00007fa769cb58c8 in abort () from /lib64/libc.so.6
4
#2  0x0000000000af0699 in ExceptionalCondition (conditionName=0xcdc111
"!(SHMQueueIsDetached(&(MyProc->syncRepLinks)))", errorType=0xb6c443
"FailedAssertion", 
5   fileName=0xcdc140
"/home/zl/workspace_pg962/postgres/src/backend/replication/syncrep.c",
lineNumber=294) at
/home/zl/workspace_pg962/postgres/src/backend/utils/error/assert.c:54
6
#3  0x00000000008c7e94 in SyncRepWaitForLSN (lsn=50438264, commit=1 '\001')
at /home/zl/workspace_pg962/postgres/src/backend/replication/syncrep.c:294
7
#4  0x000000000056ed11 in RecordTransactionCommit () at
/home/zl/workspace_pg962/postgres/src/backend/access/transam/xact.c:1343
8
#5  0x0000000000568c96 in CommitTransaction () at
/home/zl/workspace_pg962/postgres/src/backend/access/transam/xact.c:2041
9
#6  0x0000000000568717 in CommitTransactionCommand () at
/home/zl/workspace_pg962/postgres/src/backend/access/transam/xact.c:2768
10
#7  0x000000000092eb56 in finish_xact_command () at
/home/zl/workspace_pg962/postgres/src/backend/tcop/postgres.c:2459
11
#8  0x000000000092cb37 in exec_simple_query (query_string=0x1194de0 "insert
into x values(1,3),(1,4),(1,5),(1,6);") at
/home/zl/workspace_pg962/postgres/src/backend/tcop/postgres.c:1132
12
#9  0x000000000092bdf0 in PostgresMain (argc=1, argv=0x1143308,
dbname=0x1143168 "postgres", username=0x1114e10 "postgres") at
/home/zl/workspace_pg962/postgres/src/backend/tcop/postgres.c:4066
13
#10 0x0000000000879426 in BackendRun (port=0x1139650) at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:4317
14
#11 0x0000000000878a50 in BackendStartup (port=0x1139650) at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:3989
15
#12 0x000000000087509c in ServerLoop () at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:1729
16
#13 0x0000000000872612 in PostmasterMain (argc=3, argv=0x1112c80) at
/home/zl/workspace_pg962/postgres/src/backend/postmaster/postmaster.c:1337
17
#14 0x0000000000795228 in main (argc=3, argv=0x1112c80) at
/home/zl/workspace_pg962/postgres/src/backend/main/main.c:228
18
(gdb) 
   Is this a bug? And how to slove it? Looking forward to your reply.
Thanks!   Sorry about my poor english O(∩_∩)O

                                           Const Sunny                                           2017-6-29
                             


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

pgsql-bugs by date:

Previous
From: bricklen
Date:
Subject: Re: [BUGS] Multi-Master Replication
Next
From: Thomas Munro
Date:
Subject: Re: [BUGS] BUG #14721: Assertion of synchronous replication