RE: Logical replication timeout problem - Mailing list pgsql-hackers
From | houzj.fnst@fujitsu.com |
---|---|
Subject | RE: Logical replication timeout problem |
Date | |
Msg-id | OS0PR01MB5716D781C642CCF759C5B50894C89@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Logical replication timeout problem (Peter Smith <smithpb2250@gmail.com>) |
Responses |
Re: Logical replication timeout problem
|
List | pgsql-hackers |
On Monday, January 23, 2023 8:51 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are my review comments for patch v4-0001 > ====== > Commit message > > 2. > > The problem is when there is a DDL in a transaction that generates lots of > temporary data due to rewrite rules, these temporary data will not be processed > by the pgoutput plugin. The previous commit (f95d53e) only fixed timeouts > caused by filtering out changes in pgoutput. Therefore, the previous fix for DML > had no impact on this case. > > ~ > > IMO this still some rewording to say up-front what the the actual problem -- i.e. > an avoidable timeout occuring. > > SUGGESTION (or something like this...) > > When there is a DDL in a transaction that generates lots of temporary data due > to rewrite rules, this temporary data will not be processed by the pgoutput > plugin. This means it is possible for a timeout to occur if a sufficiently long time > elapses since the last pgoutput message. A previous commit (f95d53e) fixed a > similar scenario in this area, but that only fixed timeouts for DML going through > pgoutput, so it did not address this DDL timeout case. Thanks, I changed the commit message as suggested. > ====== > src/backend/replication/logical/logical.c > > 3. update_progress_txn_cb_wrapper > > +/* > + * Update progress callback while processing a transaction. > + * > + * Try to update progress and send a keepalive message during sending > +data of a > + * transaction (and its subtransactions) to the output plugin. > + * > + * For a large transaction, if we don't send any change to the > +downstream for a > + * long time (exceeds the wal_receiver_timeout of standby) then it can timeout. > + * This can happen when all or most of the changes are either not > +published or > + * got filtered out. > + */ > +static void > +update_progress_txn_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN > *txn, > + ReorderBufferChange *change) > > Simplify the "Try to..." paragraph. And other part should also mention about DDL. > > SUGGESTION > > Try send a keepalive message during transaction processing. > > This is done because if we don't send any change to the downstream for a long > time (exceeds the wal_receiver_timeout of standby), then it can timeout. This can > happen for large DDL, or for large transactions when all or most of the changes > are either not published or got filtered out. Changed. > ====== > .../replication/logical/reorderbuffer.c > > 4. ReorderBufferProcessTXN > > @@ -2105,6 +2105,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > > PG_TRY(); > { > + /* > + * Static variable used to accumulate the number of changes while > + * processing txn. > + */ > + static int changes_count = 0; > + > + /* > + * Sending keepalive messages after every change has some overhead, but > + * testing showed there is no noticeable overhead if keepalive is only > + * sent after every ~100 changes. > + */ > +#define CHANGES_THRESHOLD 100 > + > > IMO these can be relocated to be declared/defined inside the "while" > loop -- i.e. closer to where they are being used. Moved into the while loop. Attach the new version patch which addressed above comments. Also attach a simple script which use "refresh matview" to reproduce this timeout problem just in case some one want to try to reproduce this. Best regards, Hou zj
Attachment
pgsql-hackers by date: