Thread: Fix a bug in DecodeAbort() and improve input data check on subscriber.
Hi all, While updating the patch I recently posted[1] to make pg_waldump report replication origin ID, LSN, and timestamp, I found a bug that replication origin timestamp is not set in ROLLBACK PREPARED case. Commit 8bdb1332eb5 (CC'ed Amit) added an argument to ReorderBufferFinishPrepared() but in ROLLBACK PREPARED case, the caller specified it at the wrong position: @@ -730,6 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, if (two_phase) { ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, + SnapBuildInitialConsistentPoint(ctx->snapshot_builder), commit_time, origin_id, origin_lsn, parsed->twophase_gid, true); } @@ -868,6 +869,7 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, { ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, abort_time, origin_id, origin_lsn, + InvalidXLogRecPtr, parsed->twophase_gid, false); } This affects the value of rollback_data.rollback_time on the subscriber, resulting in setting the wrong timestamp to both the replication origin timestamp and the error callback argument on the subscriber. I've attached the patch to fix it. Besides, I think we can improve checking input data on subscribers. This bug was not detected by compilers but it could have been detected if we checked the input data. Looking at logicalrep_read_xxx functions in proto.c, there are some inconsistencies; we check the value of prepare_data->xid in logicalrep_read_prepare_common() but we don't in both logicalrep_read_commit_prepared() and logicalrep_read_rollback_prepared(), and we don't check anything in stream_start/stream_stop. Also, IIUC that since timestamps are always set in prepare/commit prepared/rollback prepared cases we can check them too. I've attached a PoC patch that introduces macros for checking input data and adds some new checks. Since it could be frequently called, I used unlikely() but probably we can also consider replacing elog(ERROR) with assertions. Regards, [1] https://www.postgresql.org/message-id/CAD21AoD2dJfgsdxk4_KciAZMZQoUiCvmV9sDpp8ZuKLtKCNXaA%40mail.gmail.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
On Tue, Dec 7, 2021 at 6:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi all, > > While updating the patch I recently posted[1] to make pg_waldump > report replication origin ID, LSN, and timestamp, I found a bug that > replication origin timestamp is not set in ROLLBACK PREPARED case. > Commit 8bdb1332eb5 (CC'ed Amit) added an argument to > ReorderBufferFinishPrepared() but in ROLLBACK PREPARED case, the > caller specified it at the wrong position: > > @@ -730,6 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf, > if (two_phase) > { > ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, > buf->endptr, > + > SnapBuildInitialConsistentPoint(ctx->snapshot_builder), > commit_time, origin_id, origin_lsn, > parsed->twophase_gid, true); > } > @@ -868,6 +869,7 @@ DecodeAbort(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf, > { > ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, > buf->endptr, > abort_time, origin_id, origin_lsn, > + InvalidXLogRecPtr, > parsed->twophase_gid, false); > } > > This affects the value of rollback_data.rollback_time on the > subscriber, resulting in setting the wrong timestamp to both the > replication origin timestamp and the error callback argument on the > subscriber. I've attached the patch to fix it. > Thanks for the report and patches. I see this is a problem and the first patch will fix it. I'll test the same and review another patch as well. -- With Regards, Amit Kapila.
On Tue, Dec 7, 2021 at 6:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi all, > > While updating the patch I recently posted[1] to make pg_waldump > report replication origin ID, LSN, and timestamp, I found a bug that > replication origin timestamp is not set in ROLLBACK PREPARED case. > Commit 8bdb1332eb5 (CC'ed Amit) added an argument to > ReorderBufferFinishPrepared() but in ROLLBACK PREPARED case, the > caller specified it at the wrong position: > > @@ -730,6 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf, > if (two_phase) > { > ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, > buf->endptr, > + > SnapBuildInitialConsistentPoint(ctx->snapshot_builder), > commit_time, origin_id, origin_lsn, > parsed->twophase_gid, true); > } > @@ -868,6 +869,7 @@ DecodeAbort(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf, > { > ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, > buf->endptr, > abort_time, origin_id, origin_lsn, > + InvalidXLogRecPtr, > parsed->twophase_gid, false); > } > > This affects the value of rollback_data.rollback_time on the > subscriber, resulting in setting the wrong timestamp to both the > replication origin timestamp and the error callback argument on the > subscriber. I've attached the patch to fix it. > > Besides, I think we can improve checking input data on subscribers. > This bug was not detected by compilers but it could have been detected > if we checked the input data. Looking at logicalrep_read_xxx functions > in proto.c, there are some inconsistencies; we check the value of > prepare_data->xid in logicalrep_read_prepare_common() but we don't in > both logicalrep_read_commit_prepared() and > logicalrep_read_rollback_prepared(), and we don't check anything in > stream_start/stream_stop. Also, IIUC that since timestamps are always > set in prepare/commit prepared/rollback prepared cases we can check > them too. I've attached a PoC patch that introduces macros for > checking input data and adds some new checks. Since it could be > frequently called, I used unlikely() but probably we can also consider > replacing elog(ERROR) with assertions. The first patch is required as suggested and fixes the problem. Few comments on the second patch: 1) Should we validate prepare_time and xid in logicalrep_read_begin_prepare. Is this not checked intentionally? 2) Similarly should we validate committime in logicalrep_read_stream_commit? Regards, Vignesh
On Tue, Dec 7, 2021 at 6:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi all, > > While updating the patch I recently posted[1] to make pg_waldump > report replication origin ID, LSN, and timestamp, I found a bug that > replication origin timestamp is not set in ROLLBACK PREPARED case. > Commit 8bdb1332eb5 (CC'ed Amit) added an argument to > ReorderBufferFinishPrepared() but in ROLLBACK PREPARED case, the > caller specified it at the wrong position: > > @@ -730,6 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf, > if (two_phase) > { > ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, > buf->endptr, > + > SnapBuildInitialConsistentPoint(ctx->snapshot_builder), > commit_time, origin_id, origin_lsn, > parsed->twophase_gid, true); > } > @@ -868,6 +869,7 @@ DecodeAbort(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf, > { > ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, > buf->endptr, > abort_time, origin_id, origin_lsn, > + InvalidXLogRecPtr, > parsed->twophase_gid, false); > } > > This affects the value of rollback_data.rollback_time on the > subscriber, resulting in setting the wrong timestamp to both the > replication origin timestamp and the error callback argument on the > subscriber. I've attached the patch to fix it. > Pushed. > Besides, I think we can improve checking input data on subscribers. > This bug was not detected by compilers but it could have been detected > if we checked the input data. Looking at logicalrep_read_xxx functions > in proto.c, there are some inconsistencies; we check the value of > prepare_data->xid in logicalrep_read_prepare_common() but we don't in > both logicalrep_read_commit_prepared() and > logicalrep_read_rollback_prepared(), and we don't check anything in > stream_start/stream_stop. Also, IIUC that since timestamps are always > set in prepare/commit prepared/rollback prepared cases we can check > them too. > I am not sure if we should try adding these validity checks for all kinds of input parameters. I think the original code was checking for lsn's, see logicalrep_read_begin, and at one place we start validating xid and now you are asking for the timestamp. I think it is important to ensure the correctness of these lsn values as the processing of changes on subscriber relies on them. I think for the sake of consistency it is better to check the validity of lsn's but if we want to add for others, we should analyze and add for all the required ones in one shot. -- With Regards, Amit Kapila.