Re: [HACKERS] Restricting maximum keep segments by repslots - Mailing list pgsql-hackers
From | Jehan-Guillaume de Rorthais |
---|---|
Subject | Re: [HACKERS] Restricting maximum keep segments by repslots |
Date | |
Msg-id | 20200122174723.4a26ca12@firost Whole thread Raw |
In response to | Re: [HACKERS] Restricting maximum keep segments by repslots (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Responses |
Re: [HACKERS] Restricting maximum keep segments by repslots
|
List | pgsql-hackers |
Hi, First, it seems you did not reply to Alvaro's concerns in your new set of patch. See: https://www.postgresql.org/message-id/20190917195800.GA16694%40alvherre.pgsql On Tue, 24 Dec 2019 21:26:14 +0900 (JST) Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: [...] > > Indeed, "loosing" is a better match for this state. > > > > However, what's the point of this state from the admin point of view? In > > various situation, the admin will have no time to react immediately and fix > > whatever could help. > > > > How useful is this specific state? > > If we assume "losing" segments as "lost", a segment once "lost" can > return to "keeping" or "streaming" state. That is intuitively > impossible. On the other hand if we assume it as "keeping", it should > not be removed by the next checkpoint but actually it can be > removed. The state "losing" means such a unstable state different from > both "lost" and "keeping". OK, indeed. But I'm still unconfortable with this "unstable" state. It would be better if we could grab a stable state: either "keeping" or "lost". > > + <entry>Availability of WAL records claimed by this > > + slot. <literal>streaming</literal>, <literal>keeping</literal>, > > > > Slots are keeping WALs, not WAL records. Shouldn't it be "Availability of > > WAL files claimed by this slot"? > > I choosed "record" since a slot points a record. I'm not sure but I'm > fine with "file". Fixed catalogs.sgml and config.sgml that way. Thanks. > > + <literal>streaming</literal> means that the claimed records are > > + available within max_wal_size. <literal>keeping</literal> means > > > > I wonder if streaming is the appropriate name here. The WALs required might > > be available for streaming, but the slot not active, thus not "streaming". > > What about merging with the "active" field, in the same fashion as > > pg_stat_activity.state? We would have an enum "pg_replication_slots.state" > > with the following states: > > * inactive: non active slot > > * active: activated, required WAL within max_wal_size > > * keeping: activated, max_wal_size is exceeded but still held by replication > > slots or wal_keep_segments. > > * lost: some WAL are definitely lost > > > > Thoughts? > > In the first place, I realized that I was missed a point about the > relationship between max_wal_size and max_slot_wal_keep_size > here. Since the v15 of this patch, GetLsnAvailablity returns > "streaming" when the restart_lsn is within max_wal_size. That behavior > makes sense when max_slot_wal_keep_size > max_wal_size. However, in > the contrary case, restart_lsn could be lost even it is > withinmax_wal_size. So we would see "streaming" (or "normal") even > though restart_lsn is already lost. That is broken. > > In short, the "streaming/normal" state is useless if > max_slot_wal_keep_size < max_wal_size. Good catch! > Finally I used the following wordings. > > (there's no "inactive" wal_state) > > * normal: required WAL within max_wal_size when max_slot_wal_keep_size > is larger than max_wal_size. > * keeping: required segments are held by replication slots or > wal_keep_segments. > > * losing: required segments are about to be removed or may be already > removed but streaming is not dead yet. As I wrote, I'm still uncomfortable with this state. Maybe we should ask other reviewers opinions on this. [...] > > WARNING: some replication slots have lost required WAL segments > > DETAIL: Slot slot_limit_st lost 177 segment(s) > > > > I wonder if this is useful to show these messages for slots that were > > already dead before this checkpoint? > > Makes sense. I changed KeepLogSeg so that it emits the message only on > slot_names changes. Thanks. Bellow some code review. In regard with FindOldestXLogFileSegNo(...): > /* > * Return the oldest WAL segment file. > * > * The returned value is XLogGetLastRemovedSegno() + 1 when the function > * returns a valid value. Otherwise this function scans over WAL files and > * finds the oldest segment at the first time, which could be very slow. > */ > XLogSegNo > FindOldestXLogFileSegNo(void) The comment is not clear to me. I suppose "at the first time" might better be expressed as "if none has been removed since last startup"? Moreover, what about patching XLogGetLastRemovedSegno() itself instead of adding a function? In regard with GetLsnAvailability(...): > /* > * Detect availability of the record at given targetLSN. > * > * targetLSN is restart_lsn of a slot. Wrong argument name. It is called "restart_lsn" in the function declaration. > * restBytes is the pointer to uint64 variable, to store the remaining bytes > * until the slot goes into "losing" state. I'm not convinced with this argument name. What about "remainingBytes"? Note that you use remaining_bytes elsewhere in your patch. > * -1 is stored to restBytes if the values is useless. What about returning a true negative value when the slot is really lost? All in all, I feel like this function is on the fence between being generic because of its name and being slot-only oriented because of the first parameter name, use of "max_slot_wal_keep_size_mb", returned status and "slotPtr". I wonder if it should be more generic and stay here or move to xlogfuncs.c with a more specific name? > * slot limitation is not activated, WAL files are kept unlimitedlllly "unlimitedly"? "infinitely"? "unconditionally"? > /* it is useless for the states below */ > *restBytes = -1; This might be set to the real bytes kept, even if status is "losing". > * The segment is alrady lost or being lost. If the oldest segment is just "already" > if (oldestSeg == restartSeg + 1 && walsender_pid != 0) > return "losing"; I wonder if this should be "oldestSeg > restartSeg"? Many segments can be removed by the next or running checkpoint. And a running walsender can send more than one segment in the meantime I suppose? In regard with GetOldestKeepSegment(...): > static XLogSegNo > GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN, > XLogRecPtr targetLSN, int64 *restBytes) I wonder if minSlotLSN is really useful as a parameter or if it should be fetched from GetOldestKeepSegment() itself? Currently, XLogGetReplicationSlotMinimumLSN() is always called right before GetOldestKeepSegment() just to fill this argument. > walstate = > GetLsnAvailability(restart_lsn, active_pid, &remaining_bytes); I agree with Alvaro: we might want to return an enum and define the related state string here. Or, if we accept negative remaining_bytes, GetLsnAvailability might even only return remaining_bytes and we deduce the state directly from here. Regards,
pgsql-hackers by date: