Re: [HACKERS] More race conditions in logical replication - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: [HACKERS] More race conditions in logical replication
Date
Msg-id 20170725174223.cf2z3venmka4gxta@alvherre.pgsql
Whole thread Raw
In response to Re: [HACKERS] More race conditions in logical replication  (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
Responses Re: [HACKERS] More race conditions in logical replication
Re: [HACKERS] More race conditions in logical replication
List pgsql-hackers
Petr Jelinek wrote:
> On 25/07/17 01:33, Alvaro Herrera wrote:
> > Alvaro Herrera wrote:
> >> I'm back at looking into this again, after a rather exhausting week.  I
> >> think I have a better grasp of what is going on in this code now,
> > 
> > Here's an updated patch, which I intend to push tomorrow.
> 
> I think the ConditionVariablePrepareToSleep() call in
> ReplicationSlotAcquire() needs to be done inside the mutex lock
> otherwise there is tiny race condition where the process which has slot
> acquired might release the slot between the mutex unlock and the
> ConditionVariablePrepareToSleep() call which means we'll never get
> signaled and ConditionVariableSleep() will wait forever.

Hmm, yeah, that's not good.  However, I didn't like the idea of putting
it inside the locked area, as it's too much code.  Instead I added just
before acquiring the spinlock.  We cancel the sleep unconditionally
later on if we didn't need to sleep after all.

As I see it, we need to backpatch at least parts of this patch.  I've
received reports that in earlier branches pglogical and BDR can
sometimes leave slots behind when removing nodes, and I have a hunch
that it can be explained by the bugs being fixed here.  Now we cannot
use condition variables in back-branches, so we'll need to figure out
how to actually do it ...

> As a side note, the ConditionVariablePrepareToSleep()'s comment could be
> improved because currently it says the only advantage is that we skip
> double-test in the beginning of ConditionVariableSleep(). But that's not
> true, it's essential for preventing race conditions like the one above
> because it puts the current process into waiting list so we can be sure
> it will be signaled on broadcast once ConditionVariablePrepareToSleep()
> has been called.

Hmm.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] Testlib.pm vs msys
Next
From: "Joshua D. Drake"
Date:
Subject: Re: [HACKERS] Create language syntax is not proper in pg_dumpall andnot working using pg_upgrade