Thread: Re: Fix assert failure when decoding XLOG_PARAMETER_CHANGE on primary
On Fri, Jan 24, 2025 at 4:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > When a standby replays a XLOG_PARAMETER_CHANGE record that lowers > wal_level from logical, we invalidate all logical slots only when the > standby is in hot standby mode: > > if (InRecovery && InHotStandby && > xlrec.wal_level < WAL_LEVEL_LOGICAL && > wal_level >= WAL_LEVEL_LOGICAL) > InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_LEVEL, > 0, InvalidOid, > InvalidTransactionId); > > However, it's possible that this record is replayed when not in hot > standby mode and the slot is used after the promotion. In this case, > the following Assert in xlog_decode() fails: > > /* > * This can occur only on a standby, as a primary would > * not allow to restart after changing wal_level < logical > * if there is pre-existing logical slot. > */ Shouldn't we do similar to what this comment indicates on standby? We can disallow to start the server as standby, if the hot_standby is off and there is a pre-existing logical slot. > Assert(RecoveryInProgress()); > ereport(ERROR, > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("logical decoding on standby requires \"wal_level\" >= > \"logical\" on the primary"))); > > Here is brief steps to reproduce this issue: > > 1. setup the primary and the standby servers (with hot_standby=on). > 2. create a logical slot on the standby. > 3. on standby, set hot_standby=off and restart. > 4. on primary, lower wal_level to replica and restart. > 5. promote the standby and execute the logical decoding. > > I've attached a small patch to fix this issue. With the patch, we > invalidate all logical slots even when not in hot standby, that is, > the patch just removes InHotStandby from the condition. Thoughts? > This can fix the scenario you described but I am not sure it is a good idea. We can consider the fix on these lines if you don't like the above suggestion. -- With Regards, Amit Kapila.
On Sun, Feb 2, 2025 at 8:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 24, 2025 at 4:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > When a standby replays a XLOG_PARAMETER_CHANGE record that lowers > > wal_level from logical, we invalidate all logical slots only when the > > standby is in hot standby mode: > > > > if (InRecovery && InHotStandby && > > xlrec.wal_level < WAL_LEVEL_LOGICAL && > > wal_level >= WAL_LEVEL_LOGICAL) > > InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_LEVEL, > > 0, InvalidOid, > > InvalidTransactionId); > > > > However, it's possible that this record is replayed when not in hot > > standby mode and the slot is used after the promotion. In this case, > > the following Assert in xlog_decode() fails: > > > > /* > > * This can occur only on a standby, as a primary would > > * not allow to restart after changing wal_level < logical > > * if there is pre-existing logical slot. > > */ > > Shouldn't we do similar to what this comment indicates on standby? We > can disallow to start the server as standby, if the hot_standby is off > and there is a pre-existing logical slot. It seems like a better idea. I thought we could pass StandbyMode to StartupReplicationSlots() and check if there is a pre-existing logical slot, but it would break the ABI compatibility. It might not be a problem in practice as StartupReplicationSlots() is normally used only by the startup process. But if we want to avoid that we can introduce a new function for that. > > > Assert(RecoveryInProgress()); > > ereport(ERROR, > > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > errmsg("logical decoding on standby requires \"wal_level\" >= > > \"logical\" on the primary"))); > > > > Here is brief steps to reproduce this issue: > > > > 1. setup the primary and the standby servers (with hot_standby=on). > > 2. create a logical slot on the standby. > > 3. on standby, set hot_standby=off and restart. > > 4. on primary, lower wal_level to replica and restart. > > 5. promote the standby and execute the logical decoding. > > > > I've attached a small patch to fix this issue. With the patch, we > > invalidate all logical slots even when not in hot standby, that is, > > the patch just removes InHotStandby from the condition. Thoughts? > > > > This can fix the scenario you described but I am not sure it is a good > idea. We can consider the fix on these lines if you don't like the > above suggestion. Agreed. If your suggestion doesn't work, let's go back to this idea. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Feb 4, 2025 at 12:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sun, Feb 2, 2025 at 8:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Jan 24, 2025 at 4:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > When a standby replays a XLOG_PARAMETER_CHANGE record that lowers > > > wal_level from logical, we invalidate all logical slots only when the > > > standby is in hot standby mode: > > > > > > if (InRecovery && InHotStandby && > > > xlrec.wal_level < WAL_LEVEL_LOGICAL && > > > wal_level >= WAL_LEVEL_LOGICAL) > > > InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_LEVEL, > > > 0, InvalidOid, > > > InvalidTransactionId); > > > > > > However, it's possible that this record is replayed when not in hot > > > standby mode and the slot is used after the promotion. In this case, > > > the following Assert in xlog_decode() fails: > > > > > > /* > > > * This can occur only on a standby, as a primary would > > > * not allow to restart after changing wal_level < logical > > > * if there is pre-existing logical slot. > > > */ > > > > Shouldn't we do similar to what this comment indicates on standby? We > > can disallow to start the server as standby, if the hot_standby is off > > and there is a pre-existing logical slot. > > It seems like a better idea. I thought we could pass StandbyMode to > StartupReplicationSlots() and check if there is a pre-existing logical > slot, but it would break the ABI compatibility. It might not be a > problem in practice as StartupReplicationSlots() is normally used only > by the startup process. But if we want to avoid that we can introduce > a new function for that. Since StandbyMode is exposed, we don't need to change the function signature. I'll update and submit the patch soon. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Feb 5, 2025 at 2:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've attached the updated patch. The fix needs to be back-patched to > v16 where logical decoding on standby was introduced. > Isn't it better to give the new ERROR near the below code? RestoreSlotFromDisk() { ... if (cp.slotdata.database != InvalidOid && wal_level < WAL_LEVEL_LOGICAL) ereport(FATAL, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("logical replication slot \"%s\" exists, but \"wal_level\" < \"logical\"", NameStr(cp.slotdata.name)), ... } If feasible, this will avoid an additional loop over the slots and a new ERROR will be added at the same place as an existing similar ERROR. -- With Regards, Amit Kapila.
Hi, On Wed, Feb 05, 2025 at 12:08:26PM +0530, Amit Kapila wrote: > On Wed, Feb 5, 2025 at 2:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I've attached the updated patch. The fix needs to be back-patched to > > v16 where logical decoding on standby was introduced. Nice catch and thanks for the patch! I also do prefer the current idea (disallow to start if the hot_standby is off and there is an existing logical slot). > Isn't it better to give the new ERROR near the below code? > RestoreSlotFromDisk() > { > ... > if (cp.slotdata.database != InvalidOid && wal_level < WAL_LEVEL_LOGICAL) > ereport(FATAL, > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("logical replication slot \"%s\" exists, but \"wal_level\" < > \"logical\"", > NameStr(cp.slotdata.name)), > ... > } > > If feasible, this will avoid an additional loop over the slots and a > new ERROR will be added at the same place as an existing similar > ERROR. Agree. + if (SlotIsLogical(s) && !EnableHotStandby) + ereport(FATAL, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("hot standby must be enabled for pre-existing logical replication slots"))); On a primary lowering wal_level < logical, we'd get something like: " FATAL: logical replication slot "logical_slot" exists, but "wal_level" < "logical" " What about being close to it, so something like? " FATAL: logical replication slot "logical_slot" exists, but "hot_standby" = "off" " Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Feb 6, 2025 at 12:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've updated the patch accordingly. > Today, again thinking about the proposed fix, I was wondering about the following case. Say, on hot_standby, the user created a logical slot, then shut down hot_standby, turn off the hot_standby flag, and restart standby. This is allowed today but not after the patch. It is possible that the user can promote a non-hot_standby server after its restart in the previous step and can reuse the logical slot. So, is it okay to change this behavior? If not, then we may need to go back to the first fix proposed by you in this thread. -- With Regards, Amit Kapila.
Hi, On Fri, Feb 07, 2025 at 11:00:39AM +0530, Amit Kapila wrote: > On Fri, Feb 7, 2025 at 12:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Feb 5, 2025 at 10:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Feb 6, 2025 at 12:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > I've updated the patch accordingly. > > > > > > > > > > Today, again thinking about the proposed fix, I was wondering about > > > the following case. Say, on hot_standby, the user created a logical > > > slot, then shut down hot_standby, turn off the hot_standby flag, and > > > restart standby. This is allowed today but not after the patch. It is > > > possible that the user can promote a non-hot_standby server after its > > > restart in the previous step and can reuse the logical slot. So, is it > > > okay to change this behavior? If not, then we may need to go back to > > > the first fix proposed by you in this thread. > > > > While non hot standby, no one can advance logical slots, meaning the > > standby server ends up accumulating WAL records and also causing the > > bloat on the primary if hot_standby_feedback is enabled. Yeah. I think the bloat on the primary can occur only if there is also a physical replication slot between the two (which is probably the majority of the cases though). > Given this > > outcome, I though that the current patch approach would be reasonable. > > > > Agreed as I also can't think of any case where the user would require > a logical slot on a non-hotstandby server. +1 > > On the other hand, as you pointed out, it would not be good in terms > > of compatibility as we end up limiting potentially existing use cases. > > And it would be prefectly fine if users promote the standby server > > shortly after starting up with hot_standby = off. yeah exactly. And if hot_standby = off is "really" needed then the option would be to remove the logical replication slot on the standby (which is probably a good thing to do with hot_standby = off due to the cons you mentioned above). > True but it sounds like there is more harm than benefit. It seems > reasonable to do this on HEAD. Shall we think of doing it differently > in HEAD and back-branches or let's restrict as your v2 patch is doing > and if by any chance users complain about it we can try to fix it in > another way? I'm tempted to vote for the later so that we can maintain the same code across branches. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Wed, Feb 05, 2025 at 10:59:22AM -0800, Masahiko Sawada wrote: > On Tue, Feb 4, 2025 at 11:48 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > + if (SlotIsLogical(s) && !EnableHotStandby) > > + ereport(FATAL, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("hot standby must be enabled for pre-existing logical replication slots"))); > > > > On a primary lowering wal_level < logical, we'd get something like: > > > > " > > FATAL: logical replication slot "logical_slot" exists, but "wal_level" < "logical" > > " > > > > What about being close to it, so something like? > > > > " > > FATAL: logical replication slot "logical_slot" exists, but "hot_standby" = "off" > > " > > Looks good, but I'd like to mention that this is required only on > standbys. So how about the following? > > FATAL: logical replication slot "s" exists on the standby, but > "hot_standby" = "off" > HINT: Change "hot_standby" to be "on". Thanks! Looks good and consistent with the existing wording that can be found in slot.c. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Feb 6, 2025 at 9:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Feb 7, 2025 at 12:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Feb 5, 2025 at 10:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Feb 6, 2025 at 12:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > I've updated the patch accordingly. > > > > > > > > > > Today, again thinking about the proposed fix, I was wondering about > > > the following case. Say, on hot_standby, the user created a logical > > > slot, then shut down hot_standby, turn off the hot_standby flag, and > > > restart standby. This is allowed today but not after the patch. It is > > > possible that the user can promote a non-hot_standby server after its > > > restart in the previous step and can reuse the logical slot. So, is it > > > okay to change this behavior? If not, then we may need to go back to > > > the first fix proposed by you in this thread. > > > > While non hot standby, no one can advance logical slots, meaning the > > standby server ends up accumulating WAL records and also causing the > > bloat on the primary if hot_standby_feedback is enabled. Given this > > outcome, I though that the current patch approach would be reasonable. > > > > Agreed as I also can't think of any case where the user would require > a logical slot on a non-hotstandby server. > > > On the other hand, as you pointed out, it would not be good in terms > > of compatibility as we end up limiting potentially existing use cases. > > And it would be prefectly fine if users promote the standby server > > shortly after starting up with hot_standby = off. > > True but it sounds like there is more harm than benefit. It seems > reasonable to do this on HEAD. Shall we think of doing it differently > in HEAD and back-branches or let's restrict as your v2 patch is doing > and if by any chance users complain about it we can try to fix it in > another way? While the latter would be good for maintainability it would not be advisable to change the behavior back and forth in back branches. I'd like to make it clear what point of v2 patch (restring server startup for pre-existing logical slots) is preferable over the first patch (removing InHotStandby condition)? If the approach of the first patch is reasonably good, we could take it both in HEAD and backbranches. > > > > > But, similarly, is > > there any concerns of my proposed fix? > > > > - if (InRecovery && InHotStandby && > + if (InRecovery && > xlrec.wal_level < WAL_LEVEL_LOGICAL && > > Won't the InRecovery be true even for crash-recovery as well? I think > we need to check standby mode here. I think if we check standby mode here (i.e., adding StandbyMode), we would not be able to invalidate logical slots during archive recovery. That is, in the following scenario, we would hit the same assertion failure: 1. setup the primary and the standby servers (with hot_standby=on). 2. create a logical slot on the standby. 3. on standby, set archive recovery settings (setting restore_command, removing standby.signal, and creating recovery.signal etc.). 4. on primary, lower wal_level to replica and restart. 5. start standby (in archive recovery mode). 6. execute the logical decoding after the archive recovery completes. And, you're right that InRecovery is true even for crash-recovery. But is there any case where we invalidate logical slots due to XLOG_PARAMETER_CHANGE during a crash recovery? Also, I guess InRecovery is always true when we call xlog_redo(). Do we need to check it in the first place? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com