Re: Review for GetWALAvailability() - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Re: Review for GetWALAvailability() |
Date | |
Msg-id | 20200617.173058.579037591051032616.horikyota.ntt@gmail.com Whole thread Raw |
In response to | Re: Review for GetWALAvailability() (Fujii Masao <masao.fujii@oss.nttdata.com>) |
Responses |
Re: Review for GetWALAvailability()
|
List | pgsql-hackers |
At Wed, 17 Jun 2020 17:01:11 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2020/06/17 12:10, Kyotaro Horiguchi wrote: > > At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera > > <alvherre@2ndquadrant.com> wrote in > >> On 2020-Jun-17, Fujii Masao wrote: > >>> On 2020/06/17 3:50, Alvaro Herrera wrote: > >> > >>> So InvalidateObsoleteReplicationSlots() can terminate normal backends. > >>> But do we want to do this? If we want, we should add the note about > >>> this > >>> case into the docs? Otherwise the users would be surprised at > >>> termination > >>> of backends by max_slot_wal_keep_size. I guess that it's basically > >>> rarely > >>> happen, though. > >> > >> Well, if we could distinguish a walsender from a non-walsender > >> process, > >> then maybe it would make sense to leave backends alive. But do we > >> want > >> that? I admit I don't know what would be the reason to have a > >> non-walsender process with an active slot, so I don't have a good > >> opinion on what to do in this case. > > The non-walsender backend is actually doing replication work. It > > rather should be killed? > > I have no better opinion about this. So I agree to leave the logic as > it is > at least for now, i.e., we terminate the process owning the slot > whatever > the type of process is. Agreed. > >>>>> + /* > >>>>> + * Signal to terminate the process using the replication slot. > >>>>> + * > >>>>> + * Try to signal every 100ms until it succeeds. > >>>>> + */ > >>>>> + if (!killed && kill(active_pid, SIGTERM) == 0) > >>>>> + killed = true; > >>>>> + ConditionVariableTimedSleep(&slot->active_cv, 100, > >>>>> + WAIT_EVENT_REPLICATION_SLOT_DROP); > >>>>> + } while (ReplicationSlotIsActive(slot, NULL)); > >>>> > >>>> Note that here you're signalling only once and then sleeping many > >>>> times > >>>> in increments of 100ms -- you're not signalling every 100ms as the > >>>> comment claims -- unless the signal fails, but you don't really expect > >>>> that. On the contrary, I'd claim that the logic is reversed: if the > >>>> signal fails, *then* you should stop signalling. > >>> > >>> You mean; in this code path, signaling fails only when the target > >>> process > >>> disappears just before signaling. So if it fails, slot->active_pid is > >>> expected to become 0 even without signaling more. Right? > >> > >> I guess kill() can also fail if the PID now belongs to a process owned > >> by a different user. > > Yes. This case means that the PostgreSQL process using the slot > disappeared > and the same PID was assigned to non-PostgreSQL process. So if kill() > fails > for this reason, we don't need to kill() again. > > > I think we've disregarded very quick reuse of > >> PIDs, so we needn't concern ourselves with it. > > The first time call to ConditionVariableTimedSleep doen't actually > > sleep, so the loop works as expected. But we may make an extra call > > to kill(2). Calling ConditionVariablePrepareToSleep beforehand of the > > loop would make it better. > > Sorry I failed to understand your point... My point is the ConditionVariableTimedSleep does *not* sleep on the CV first time in this usage. The new version anyway avoids useless kill(2) call, but still may make an extra call to ReplicationSlotAcquireInternal. I think we should call ConditionVariablePrepareToSleep before the sorrounding for statement block. > Anyway, the attached is the updated version of the patch. This fixes > all the issues in InvalidateObsoleteReplicationSlots() that I reported > upthread. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: