Thread: Timeout when changes are filtered out by the core during logical replication

Timeout when changes are filtered out by the core during logical replication

From
Ashutosh Bapat
Date:
Hi All,
A customer ran a script dropping a few dozens of users in a transaction. Before dropping a user they change the ownership of the tables owned by that user to another user and revoking all the accesses from that user in the same transaction. There were a few thousand tables whose privileges and ownership was changed by this transaction. Since all of these changes were in catalog table, those changes were filtered out in ReorderBufferProcessTXN()
by the following code
                   if (!RelationIsLogicallyLogged(relation))
                        goto change_done;

I tried to reproduce a similar situation through the attached TAP test. For 500 users and 1000 tables, we see that the transaction takes significant time but logical decoding does not take much time. So with the default 1 min WAL sender and receiver timeout I could not reproduce the timeout. Beyond that our TAp test itself times out.

But I think there's a possibility that the logical receiver will time out this way when decoding a sufficiently large transaction which takes more than the timeout amount of time to decode. So I think we need to call OutputPluginUpdateProgress() after a regular interval (in terms of time or number of changes) to consume any feedback from the subscriber or send a keep-alive message.

Following commit
```
commit 87c1dd246af8ace926645900f02886905b889718
Author: Amit Kapila <akapila@postgresql.org>
Date:   Wed May 11 10:12:23 2022 +0530

    Fix the logical replication timeout during large transactions.
   
 ```
fixed a similar problem when the changes were filtered by an output plugin, but in this case the changes are not being handed over to the output plugin as well. If we fix it in the core we may not need to handle it in the output plugin as that commit does. The commit does not have a test case which I could run to reproduce the timeout.

--
Best Wishes,
Ashutosh
Attachment
On Thu, Dec 22, 2022 at 6:58 PM Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>
> Hi All,
> A customer ran a script dropping a few dozens of users in a transaction. Before dropping a user they change the
ownershipof the tables owned by that user to another user and revoking all the accesses from that user in the same
transaction.There were a few thousand tables whose privileges and ownership was changed by this transaction. Since all
ofthese changes were in catalog table, those changes were filtered out in ReorderBufferProcessTXN() 
> by the following code
>                    if (!RelationIsLogicallyLogged(relation))
>                         goto change_done;
>
> I tried to reproduce a similar situation through the attached TAP test. For 500 users and 1000 tables, we see that
thetransaction takes significant time but logical decoding does not take much time. So with the default 1 min WAL
senderand receiver timeout I could not reproduce the timeout. Beyond that our TAp test itself times out. 
>
> But I think there's a possibility that the logical receiver will time out this way when decoding a sufficiently large
transactionwhich takes more than the timeout amount of time to decode. So I think we need to call
OutputPluginUpdateProgress()after a regular interval (in terms of time or number of changes) to consume any feedback
fromthe subscriber or send a keep-alive message. 
>

I don't think it will be a good idea to directly call
OutputPluginUpdateProgress() from reorderbuffer.c. There is already a
patch to discuss this problem [1].

> Following commit
> ```
> commit 87c1dd246af8ace926645900f02886905b889718
> Author: Amit Kapila <akapila@postgresql.org>
> Date:   Wed May 11 10:12:23 2022 +0530
>
>     Fix the logical replication timeout during large transactions.
>
>  ```
> fixed a similar problem when the changes were filtered by an output plugin, but in this case the changes are not
beinghanded over to the output plugin as well. If we fix it in the core we may not need to handle it in the output
pluginas that commit does. The commit does not have a test case which I could run to reproduce the timeout. 
>

It is not evident how to write a stable test for this because
estimating how many changes are enough for the configured
wal_receiver_timeout to
pass on all the buildfarm machines is tricky. If you have good ideas
then feel free to propose a test patch.

[1] -
https://www.postgresql.org/message-id/OS3PR01MB62751A8063A9A75A096000D89E3F9%40OS3PR01MB6275.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.



Re: Timeout when changes are filtered out by the core during logical replication

From
Ashutosh Bapat
Date:


On Fri, Dec 23, 2022 at 2:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Dec 22, 2022 at 6:58 PM Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>
> Hi All,
> A customer ran a script dropping a few dozens of users in a transaction. Before dropping a user they change the ownership of the tables owned by that user to another user and revoking all the accesses from that user in the same transaction. There were a few thousand tables whose privileges and ownership was changed by this transaction. Since all of these changes were in catalog table, those changes were filtered out in ReorderBufferProcessTXN()
> by the following code
>                    if (!RelationIsLogicallyLogged(relation))
>                         goto change_done;
>
> I tried to reproduce a similar situation through the attached TAP test. For 500 users and 1000 tables, we see that the transaction takes significant time but logical decoding does not take much time. So with the default 1 min WAL sender and receiver timeout I could not reproduce the timeout. Beyond that our TAp test itself times out.
>
> But I think there's a possibility that the logical receiver will time out this way when decoding a sufficiently large transaction which takes more than the timeout amount of time to decode. So I think we need to call OutputPluginUpdateProgress() after a regular interval (in terms of time or number of changes) to consume any feedback from the subscriber or send a keep-alive message.
>

I don't think it will be a good idea to directly call
OutputPluginUpdateProgress() from reorderbuffer.c. There is already a
patch to discuss this problem [1].

Yeah. I don't mean to use OutputPluginUpdateProgress() directly. The patch just showed that it helps calling it there in some way. Thanks for pointing the other thread. I have reviewed the patch on that thread and continue the discussion there.
 

> Following commit
> ```
> commit 87c1dd246af8ace926645900f02886905b889718
> Author: Amit Kapila <akapila@postgresql.org>
> Date:   Wed May 11 10:12:23 2022 +0530
>
>     Fix the logical replication timeout during large transactions.
>
>  ```
> fixed a similar problem when the changes were filtered by an output plugin, but in this case the changes are not being handed over to the output plugin as well. If we fix it in the core we may not need to handle it in the output plugin as that commit does. The commit does not have a test case which I could run to reproduce the timeout.
>

It is not evident how to write a stable test for this because
estimating how many changes are enough for the configured
wal_receiver_timeout to
pass on all the buildfarm machines is tricky. If you have good ideas
then feel free to propose a test patch.

Will continue this on the other thread.

--
Best Wishes,
Ashutosh