Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers
From | Peter Smith |
---|---|
Subject | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Date | |
Msg-id | CAHut+Pup-3tRFsvG_W6pb6UoVnMfFD-q7i5QzXDNAgqph=hA5A@mail.gmail.com Whole thread Raw |
In response to | Re: Introduce XID age and inactive timeout based replication slot invalidation (vignesh C <vignesh21@gmail.com>) |
List | pgsql-hackers |
Hi Nisha, Some review comments for v71-0001. ====== src/backend/access/transam/xlog.c 1. XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size); KeepLogSeg(recptr, &_logSegNo); - if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_REMOVED, + if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_REMOVED | RS_INVAL_IDLE_TIMEOUT, _logSegNo, InvalidOid, InvalidTransactionId)) { @@ -7792,7 +7792,7 @@ CreateRestartPoint(int flags) replayPtr = GetXLogReplayRecPtr(&replayTLI); endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr; KeepLogSeg(endptr, &_logSegNo); - if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_REMOVED, + if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_REMOVED | RS_INVAL_IDLE_TIMEOUT, _logSegNo, InvalidOid, InvalidTransactionId)) It seems fundamentally strange to me to assign multiple simultaneous causes like this. IMO you can't invalidate something that is invalid already. I gues v71 was an attempt to implement Amit's: > > Can we try and see how the patch looks if we try to invalidate the > > slot due to idle time at the same time when we are trying to > > invalidate due to WAL? But, AFAICT the current code now has a confused mixture of: 'cause' parameter meaning "this is the invalidation cause", versus 'cause' parameter meaning "here is a mask of possible causes" ====== src/backend/replication/slot.c SlotInvalidationCauses[] 2. [RS_INVAL_WAL_REMOVED] = "wal_removed", [RS_INVAL_HORIZON] = "rows_removed", [RS_INVAL_WAL_LEVEL] = "wal_level_insufficient", + [RS_INVAL_IDLE_TIMEOUT] = "idle_timeout", }; By using bit flags in the enum (see slot.h) and designated initializers here in SlotInvalidationCauses[], you'll end up with 9 entries (0-0x08) instead of 4, and the other undesignated entries will be all NULL. Maybe it is intended, but if it is I think it is strange to be indexing by bit flags so at least you should have a comment. If you really need bitflags then perhaps it is better to maintain them in addition to the v70 enum values (??) ~~~ 3. /* Maximum number of invalidation causes */ -#define RS_INVAL_MAX_CAUSES RS_INVAL_WAL_LEVEL +#define RS_INVAL_MAX_CAUSES RS_INVAL_IDLE_TIMEOUT Hmm. The impact of using bit flags has (probably) unintended consequences. e.g. Now you've made the GetSlotInvalidationCause() function worse than before because now it will be iterating over all the undesignated NULL entries of the array when searching for the matching cause. ~~~ 4. + /* Calculate the idle time duration of the slot */ + elapsed_secs = (now - inactive_since) / USECS_PER_SEC; + minutes = elapsed_secs / SECS_PER_MINUTE; + secs = elapsed_secs % SECS_PER_MINUTE; + + /* translator: %s is a GUC variable name */ + appendStringInfo(&err_detail, _("The slot's idle time of %d minutes and %d seconds exceeds the configured \"%s\" duration."), + minutes, secs, "idle_replication_slot_timeout"); Idleness timeout durations defined like 1d aren't going to look pretty using this log format. We already discussed off-list about how to make this better, but not done yet? ~~~ 5. + if (cause & RS_INVAL_HORIZON) + { + if (!SlotIsLogical(s)) break; The meaning of the 'break' here is different to before. Now breaking the entire for-loop instead of just breaking from the switch. (same already posted by Vignesh) ~~~ 6. ReportSlotInvalidation(invalidation_cause, true, active_pid, slotname, restart_lsn, - oldestLSN, snapshotConflictHorizon); + oldestLSN, snapshotConflictHorizon, + inactive_since, now); if (MyBackendType == B_STARTUP) (void) SendProcSignal(active_pid, @@ -1785,7 +1881,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, ReportSlotInvalidation(invalidation_cause, false, active_pid, slotname, restart_lsn, - oldestLSN, snapshotConflictHorizon); + oldestLSN, snapshotConflictHorizon, + inactive_since, now); If the cause was not already (masked with) RS_INVAL_IDLE_TIMEOUT then AFAICT 'now' will still be 0 here. This seems an unexpected quirk, which at best is quite misleading. Even if the code sty like this I felt ReportSlotInvalidation should Assert 'now' must be 0 unless the cause passed was RS_INVAL_IDLE_TIMEOUT. ~~~ CheckPointReplicationSlots: 7. /* - * Flush all replication slots to disk. + * Flush all replication slots to disk. Also, invalidate obsolete slots during + * non-shutdown checkpoint. Since the v70 code was removed in v71, the function now is the same as master. So did we need the function comment change? ====== src/include/replication/slot.h 8. - * When adding a new invalidation cause here, remember to update + * When adding a new invalidation cause here, the value must be powers of 2 + * (e.g., 1, 2, 4...) for proper bitwise operations. Also, remember to update * SlotInvalidationCauses and RS_INVAL_MAX_CAUSES. */ typedef enum ReplicationSlotInvalidationCause { - RS_INVAL_NONE, + RS_INVAL_NONE = 0x00, /* required WAL has been removed */ - RS_INVAL_WAL_REMOVED, + RS_INVAL_WAL_REMOVED = 0x01, /* required rows have been removed */ - RS_INVAL_HORIZON, + RS_INVAL_HORIZON = 0x02, /* wal_level insufficient for slot */ - RS_INVAL_WAL_LEVEL, + RS_INVAL_WAL_LEVEL = 0x04, + /* idle slot timeout has occurred */ + RS_INVAL_IDLE_TIMEOUT = 0x08, } ReplicationSlotInvalidationCause; 8a. IMO enums are intended for discrete values like "red" or "blue", but not combinations of values like "reddy-bluey". AFAIK this kind of usage is not normal and is discouraged in C programming. So if you need bitflags then really the bit flags should be #define etc. ~ 8b. Does it make sense? You can't invalidate something that is already invalid, so what does it even mean to have multiple simultaneous ReplicationSlotInvalidationCause values? AFAICT it was only done like this to CHECK for multiple **possible** causes, but this point is not very clear ~ 8c. This introduces a side-effect that now the char *const SlotInvalidationCauses[] array in slot.c will have 8 entries, half of them NULL. Already mentioned elsewhere. And, this will get increasingly worse if more invalidation reasons get added. 8,16,32,64 mostly unused entries etc... ====== Kind Regards, Peter Smith. Fujitsu Australia
pgsql-hackers by date: