Thread: Reduce useless changes before reassembly during logical replication

Reduce useless changes before reassembly during logical replication

From
li jie
Date:
Hi hackers,

During logical replication, if there is a large write transaction, some
spill files will be written to disk, depending on the setting of
logical_decoding_work_mem.

This behavior can effectively avoid OOM, but if the transaction
generates a lot of change before commit, a large number of files may
fill the disk. For example, you can update a TB-level table.
Of course, this is also inevitable.

But I found an inelegant phenomenon. If the updated large table is not
published, its changes will also be written with a large number of spill files.
Look at an example below:

publisher:
```
create table tbl_pub(id int, val1 text, val2 text,val3 text);
create table tbl_t1(id int, val1 text, val2 text,val3 text);
CREATE PUBLICATION mypub FOR TABLE public.tbl_pub;
```

subscriber:
```
create table tbl_pub(id int, val1 text, val2 text,val3 text);
create table tbl_t1(id int, val1 text, val2 text,val3 text);
CREATE SUBSCRIPTION mysub CONNECTION 'host=127.0.0.1 port=5432
user=postgres dbname=postgres' PUBLICATION mypub;
```

publisher:
```
begin;
insert into tbl_t1 select i,repeat('xyzzy', i),repeat('abcba',
i),repeat('dfds', i) from generate_series(0,999999) i;
```

Later you will see a large number of spill files in the
"/$PGDATA/pg_replslot/mysub/" directory.
```
$ll -sh
total 4.5G
4.0K -rw------- 1 postgres postgres 200 Nov 30 09:24 state
17M -rw------- 1 postgres postgres 17M Nov 30 08:22 xid-750-lsn-0-10000000.spill
12M -rw------- 1 postgres postgres 12M Nov 30 08:20 xid-750-lsn-0-1000000.spill
17M -rw------- 1 postgres postgres 17M Nov 30 08:23 xid-750-lsn-0-11000000.spill
......
```

We can see that table tbl_t1 is not published in mypub. It also won't be sent
downstream because it's not subscribed.
After the transaction is reorganized, the pgoutput decoding plugin filters out
changes to these unpublished relationships when sending logical changes.
See function pgoutput_change.

Most importantly, if we filter out unpublished relationship-related
changes after
constructing the changes but before queuing the changes into a transaction,
will it reduce the workload of logical decoding and avoid disk or memory growth
as much as possible?

Attached is the patch I used to implement this optimization.

Design:

1. Added a callback LogicalDecodeFilterByRelCB for the output plugin.

2. Added this callback function pgoutput_table_filter for the pgoutput plugin.
Its main implementation is based on the table filter in the
pgoutput_change function.
Its main function is to determine whether the change needs to be published based
on the parameters of the publication, and if not, filter it.

3. After constructing a change and before Queue a change into a transaction,
use RelidByRelfilenumber to obtain the relation associated with the change,
just like obtaining the relation in the ReorderBufferProcessTXN function.

4. Relation may be a toast, and there is no good way to get its real
table relation based on toast relation. Here, I get the real table oid
through toast relname, and then get the real table relation.

5. This filtering takes into account INSERT/UPDATE/INSERT. Other
changes have not been considered yet and can be expanded in the future.

Test:
1. Added a test case 034_table_filter.pl
2. Like the case above, create two tables, the published table tbl_pub and
    the non-published table tbl_t1
3. Insert 10,000 rows of toast data into tbl_t1  on the publisher, and use
   pg_ls_replslotdir to record the total size of the slot directory
every second.
4. Compare the size of the slot directory at the beginning of the
transaction(size1),
   the size at the end of the transaction (size2), and the average
size of the entire process(size3).
5. Assert(size1==size2==size3)

Sincerely look forward to your feedback.
Regards, lijie

Attachment
Hi Jie,

> Most importantly, if we filter out unpublished relationship-related
> changes after
> constructing the changes but before queuing the changes into a transaction,
> will it reduce the workload of logical decoding and avoid disk or memory growth
> as much as possible?

Thanks for the report!

Discarding the unused changes as soon as possible looks like a valid
optimization for me, but I pretty like more experienced people have a
double check. 

> Attached is the patch I used to implement this optimization.

After a quick look at the patch, I found FilterByTable is too expensive
because of the StartTransaction and AbortTransaction. With your above
setup and run the below test:

insert into tbl_t1 select i,repeat('xyzzy', i),repeat('abcba',
i),repeat('dfds', i) from generate_series(0,999100) i;

perf the wal sender of mypub for 30 seconds, then I get:

- 22.04%     1.53%  postgres  postgres           [.] FilterByTable                                        - 20.51%
FilterByTable                                                                             
 
        AbortTransaction
ResourceOwnerReleaseInternal                                                                          LockReleaseAll
                                                                                    hash_seq_search 
 

The main part comes from AbortTransaction, and the 20% is not trivial.

From your patch:
+
+    /*
+     * Decoding needs access to syscaches et al., which in turn use
+     * heavyweight locks and such. Thus we need to have enough state around to
+     * keep track of those.  The easiest way is to simply use a transaction
+     * internally.
+ ....
+    using_subtxn = IsTransactionOrTransactionBlock();
+
+    if (using_subtxn)
+        BeginInternalSubTransaction("filter change by table");
+    else
+        StartTransactionCommand();

Acutally FilterByTable here is simpler than "decoding", we access
syscache only when we find an entry in get_rel_sync_entry and the
replicate_valid is false, and the invalid case should rare. 

What I'm thinking now is we allow the get_rel_sync_sync_entry build its
own transaction state *only when it find a invalid entry*. if the caller
has built it already, like the existing cases in master, nothing will
happen except a simple transaction state check. Then in the
FilterByTable case we just leave it for get_rel_sync_sync_entry. See the
attachemnt for the idea.

-- 
Best Regards
Andy Fan


Attachment