Hi hackers,
During logical replication, if there is a large write transaction, some
spill files will be written to disk, depending on the setting of
logical_decoding_work_mem.
This behavior can effectively avoid OOM, but if the transaction
generates a lot of change before commit, a large number of files may
fill the disk. For example, you can update a TB-level table.
Of course, this is also inevitable.
But I found an inelegant phenomenon. If the updated large table is not
published, its changes will also be written with a large number of spill files.
Look at an example below:
publisher:
```
create table tbl_pub(id int, val1 text, val2 text,val3 text);
create table tbl_t1(id int, val1 text, val2 text,val3 text);
CREATE PUBLICATION mypub FOR TABLE public.tbl_pub;
```
subscriber:
```
create table tbl_pub(id int, val1 text, val2 text,val3 text);
create table tbl_t1(id int, val1 text, val2 text,val3 text);
CREATE SUBSCRIPTION mysub CONNECTION 'host=127.0.0.1 port=5432
user=postgres dbname=postgres' PUBLICATION mypub;
```
publisher:
```
begin;
insert into tbl_t1 select i,repeat('xyzzy', i),repeat('abcba',
i),repeat('dfds', i) from generate_series(0,999999) i;
```
Later you will see a large number of spill files in the
"/$PGDATA/pg_replslot/mysub/" directory.
```
$ll -sh
total 4.5G
4.0K -rw------- 1 postgres postgres 200 Nov 30 09:24 state
17M -rw------- 1 postgres postgres 17M Nov 30 08:22 xid-750-lsn-0-10000000.spill
12M -rw------- 1 postgres postgres 12M Nov 30 08:20 xid-750-lsn-0-1000000.spill
17M -rw------- 1 postgres postgres 17M Nov 30 08:23 xid-750-lsn-0-11000000.spill
......
```
We can see that table tbl_t1 is not published in mypub. It also won't be sent
downstream because it's not subscribed.
After the transaction is reorganized, the pgoutput decoding plugin filters out
changes to these unpublished relationships when sending logical changes.
See function pgoutput_change.
Most importantly, if we filter out unpublished relationship-related
changes after
constructing the changes but before queuing the changes into a transaction,
will it reduce the workload of logical decoding and avoid disk or memory growth
as much as possible?
Attached is the patch I used to implement this optimization.
Design:
1. Added a callback LogicalDecodeFilterByRelCB for the output plugin.
2. Added this callback function pgoutput_table_filter for the pgoutput plugin.
Its main implementation is based on the table filter in the
pgoutput_change function.
Its main function is to determine whether the change needs to be published based
on the parameters of the publication, and if not, filter it.
3. After constructing a change and before Queue a change into a transaction,
use RelidByRelfilenumber to obtain the relation associated with the change,
just like obtaining the relation in the ReorderBufferProcessTXN function.
4. Relation may be a toast, and there is no good way to get its real
table relation based on toast relation. Here, I get the real table oid
through toast relname, and then get the real table relation.
5. This filtering takes into account INSERT/UPDATE/INSERT. Other
changes have not been considered yet and can be expanded in the future.
Test:
1. Added a test case 034_table_filter.pl
2. Like the case above, create two tables, the published table tbl_pub and
the non-published table tbl_t1
3. Insert 10,000 rows of toast data into tbl_t1 on the publisher, and use
pg_ls_replslotdir to record the total size of the slot directory
every second.
4. Compare the size of the slot directory at the beginning of the
transaction(size1),
the size at the end of the transaction (size2), and the average
size of the entire process(size3).
5. Assert(size1==size2==size3)
Sincerely look forward to your feedback.
Regards, lijie