Re: Logical replication timeout problem - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Logical replication timeout problem
Date
Msg-id CAA4eK1KYbj+-=2ZdhqcW9m+xS9dHHKc1e5f2LG=R+Qg2zWbNcw@mail.gmail.com
Whole thread Raw
In response to Re: Logical replication timeout problem  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
Responses RE: Logical replication timeout problem  ("wangw.fnst@fujitsu.com" <wangw.fnst@fujitsu.com>)
List pgsql-hackers
On Fri, Jan 6, 2023 at 12:35 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> +
> +    /*
> +     * We don't want to try sending a keepalive message after processing each
> +     * change as that can have overhead. Tests revealed that there is no
> +     * noticeable overhead in doing it after continuously processing 100 or so
> +     * changes.
> +     */
> +#define CHANGES_THRESHOLD 100
>
> I think a time based threashold makes more sense. What if the timeout was
> nearing and those 100 changes just took little more time causing a timeout? We
> already have a time based threashold in WalSndKeepaliveIfNecessary(). And that
> function is invoked after reading every WAL record in WalSndLoop(). So it does
> not look like it's an expensive function. If it is expensive we might want to
> worry about WalSndLoop as well. Does it make more sense to remove this
> threashold?
>

We have previously tried this for every change [1] and it brings
noticeable overhead. In fact, even doing it for every 10 changes also
had some overhead which is why we reached this threshold number. I
don't think it can lead to timeout due to skipping changes but sure if
we see any such report we can further fine-tune this setting or will
try to make it time-based but for now I feel it would be safe to use
this threshold.

> +
> +    /*
> +     * After continuously processing CHANGES_THRESHOLD changes, we
> +     * try to send a keepalive message if required.
> +     */
> +    if (++changes_count >= CHANGES_THRESHOLD)
> +    {
> +        ctx->update_progress(ctx, ctx->write_location, ctx->write_xid, false);
> +        changes_count = 0;
> +    }
> +}
> +
>
> On the other thread, I mentioned that we don't have a TAP test for it.
> I agree with
> Amit's opinion there that it's hard to create a test which will timeout
> everywhere. I think what we need is a way to control the time required for
> decoding a transaction.
>
> A rough idea is to induce a small sleep after decoding every change. The amount
> of sleep * number of changes will help us estimate and control the amount of
> time taken to decode a transaction. Then we create a transaction which will
> take longer than the timeout threashold to decode. But that's a
> significant code. I
> don't think PostgreSQL has a facility to induce a delay at a particular place
> in the code.
>

Yeah, I don't know how to induce such a delay while decoding changes.

One more thing, I think it would be better to expose a new callback
API via reorder buffer as suggested previously [2] similar to other
reorder buffer APIs instead of directly using reorderbuffer API to
invoke plugin API.


[1] -
https://www.postgresql.org/message-id/OS3PR01MB6275DFFDAC7A59FA148931529E209%40OS3PR01MB6275.jpnprd01.prod.outlook.com
[2] - https://www.postgresql.org/message-id/CAA4eK1%2BfQjndoBOFUn9Wy0hhm3MLyUWEpcT9O7iuCELktfdBiQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: missing indexes in indexlist with partitioned tables
Next
From: Brar Piening
Date:
Subject: Re: doc: add missing "id" attributes to extension packaging page