Home > mailing lists

Re: Review for GetWALAvailability() - Mailing list pgsql-hackers

From	Fujii Masao
Subject	Re: Review for GetWALAvailability()
Date	June 17, 2020 02:30:41
Msg-id	f279ada0-d5e2-5284-d3fb-46b3943b6c79@oss.nttdata.com Whole thread Raw
In response to	Re: Review for GetWALAvailability() (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses	Re: Review for GetWALAvailability()
List	pgsql-hackers

Tree view


On 2020/06/17 3:50, Alvaro Herrera wrote:
> On 2020-Jun-17, Fujii Masao wrote:
> 
>> While reading InvalidateObsoleteReplicationSlots() code, I found another issue.
>>
>>             ereport(LOG,
>>                     (errmsg("terminating walsender %d because replication slot \"%s\" is too far behind",
>>                             wspid, NameStr(slotname))));
>>             (void) kill(wspid, SIGTERM);
>>
>> wspid indicates the PID of process using the slot. That process can be
>> a backend, for example, executing pg_replication_slot_advance().
>> So "walsender" in the above log message is not always correct.
> 
> Good point.

So InvalidateObsoleteReplicationSlots() can terminate normal backends.
But do we want to do this? If we want, we should add the note about this
case into the docs? Otherwise the users would be surprised at termination
of backends by max_slot_wal_keep_size. I guess that it's basically rarely
happen, though.


>>             int            wspid = ReplicationSlotAcquire(NameStr(slotname),
>>                                                        SAB_Inquire);
>>
>> Why do we need to call ReplicationSlotAcquire() here and mark the slot as
>> used by the checkpointer? Isn't it enough to check directly the slot's
>> active_pid, instead?
>>
>> Maybe ReplicationSlotAcquire() is necessary because
>> ReplicationSlotRelease() is called later? If so, why do we need to call
>> ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's
>> active_pid is zero. No?
> 
> I think the point here was that in order to modify the slot you have to
> acquire it -- it's not valid to modify a slot you don't own.

Understood. Thanks!


>> +        /*
>> +         * Signal to terminate the process using the replication slot.
>> +         *
>> +         * Try to signal every 100ms until it succeeds.
>> +         */
>> +        if (!killed && kill(active_pid, SIGTERM) == 0)
>> +            killed = true;
>> +        ConditionVariableTimedSleep(&slot->active_cv, 100,
>> +                                    WAIT_EVENT_REPLICATION_SLOT_DROP);
>> +    } while (ReplicationSlotIsActive(slot, NULL));
> 
> Note that here you're signalling only once and then sleeping many times
> in increments of 100ms -- you're not signalling every 100ms as the
> comment claims -- unless the signal fails, but you don't really expect
> that.  On the contrary, I'd claim that the logic is reversed: if the
> signal fails, *then* you should stop signalling.

You mean; in this code path, signaling fails only when the target process
disappears just before signaling. So if it fails, slot->active_pid is
expected to become 0 even without signaling more. Right?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

pgsql-hackers by date:

From: Tom Lane
Date: 17 June 2020, 01:43:58
Subject: Re: language cleanups in code and docs

From: Alvaro Herrera
Date: 17 June 2020, 02:40:56
Subject: Re: Review for GetWALAvailability()

Re: Review for GetWALAvailability() - Mailing list pgsql-hackers

Previous

Next