RE: Logical replication timeout problem - Mailing list pgsql-hackers

From wangw.fnst@fujitsu.com
Subject RE: Logical replication timeout problem
Date
Msg-id OS3PR01MB62753A7728A688B7349D92209EFE9@OS3PR01MB6275.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Logical replication timeout problem  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
Responses Re: Logical replication timeout problem  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
List pgsql-hackers
On Fri, Jan 6, 2023 at 15:06 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
> Hi Wang,
> Thanks for working on this. One of our customer faced a similar
> situation when running BDR with PostgreSQL.
> 
> I tested your patch and it solves the problem.
> 
> Please find some review comments below

Thanks for your testing and comments.

> +/*
> + * Helper function for ReorderBufferProcessTXN for updating progress.
> + */
> +static inline void
> +ReorderBufferUpdateProgress(ReorderBuffer *rb, ReorderBufferTXN *txn,
> +                            ReorderBufferChange *change)
> +{
> +    LogicalDecodingContext *ctx = rb->private_data;
> +    static int    changes_count = 0;
> 
> It's not easy to know that a variable is static when reading the code which
> uses it. So it's easy to interpret code wrong. I would probably track it
> through logical decoding context itself OR through a global variable like other
> places where we track the last timestamps. But there's more below on this.

I'm not sure if we need to add global variables or member variables for a
cumulative count that is only used here. How would you feel if I add some
comments when declaring this static variable?

> +
> +    if (!ctx->update_progress)
> +        return;
> +
> +    Assert(!ctx->fast_forward);
> +
> +    /* set output state */
> +    ctx->accept_writes = false;
> +    ctx->write_xid = txn->xid;
> +    ctx->write_location = change->lsn;
> +    ctx->end_xact = false;
> 
> This patch reverts many of the changes of the previous commit which tried to
> fix this issue i.e. 55558df2374. end_xact was introduced by the same commit but
> without much explanation of that in the commit message. Its only user,
> WalSndUpdateProgress(), is probably making a wrong assumption as well.
> 
>      * We don't have a mechanism to get the ack for any LSN other than end
>      * xact LSN from the downstream. So, we track lag only for end of
>      * transaction LSN.
> 
> IIUC, WAL sender tracks the LSN of the last WAL record read in sentPtr which is
> sent downstream through a keep alive message. Downstream may
> acknowledge this
> LSN. So we do get ack for any LSN, not just commit LSN.
> 
> So I propose removing end_xact as well.

We didn't track the lag during a transaction because it could make the
calculations of lag functionality inaccurate. If we track every lsn, it could
fail to record important lsn information because of
WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS (see function WalSndUpdateProgress).
Please see details in [1] and [2].

Regards,
Wang Wei

[1] -
https://www.postgresql.org/message-id/OS3PR01MB62755D216245199554DDC8DB9EEA9%40OS3PR01MB6275.jpnprd01.prod.outlook.com
[2] -
https://www.postgresql.org/message-id/OS3PR01MB627514AE0B3040D8F55A68B99EEA9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: POC: Lock updated tuples in tuple_update() and tuple_delete()
Next
From: Pavel Borisov
Date:
Subject: Re: POC: Lock updated tuples in tuple_update() and tuple_delete()