Thread: Selectively invalidate caches in pgoutput module
Dear hackers, Hi, this is a fork thread from [1]. I want to propose a small optimization for logical replication system. Background ========== When the ALTER PUBLICATION command is executed, all entries in RelationSyncCache will be discarded anyway. This mechanism works well but is sometimes not efficient. For example, when the ALTER PUBLICATION DROP TABLE is executed, 1) the specific entry in RelationSyncCache will be removed, and then 2) all entries will be discarded twice. This happens because the pgoutput plugin registers both RelcacheCallback (rel_sync_cache_relation_cb) and SyscacheCallback (publication_invalidation_cb, rel_sync_cache_publication_cb). Then, when ALTER PUBLICATION ADD/SET/DROP is executed, both the relation cache of added tables and the syscache of pg_publication_rel and pg_publication are invalidated. The callback for the relation cache will remove an entry from the hash table, and syscache callbacks will look up all entries and invalidate them. However, AFAICS does not need to invalidate all of them. I grepped source codes and found this happens since the initial version. Currently the effect of the behavior may not be large, but [1] may affect significantly because it propagates invalidation messages to all in-progress decoding transactions. Patch overview ============ Based on the background, the patch avoids dropping all entries in RelationSyncCache when ALTER PUBLICATION is executed. It removes sys cache callbacks for pg_publication_rel and pg_publication_namespace and avoids discarding entries in sys cache for pg_publication. Apart from the above, this patch also ensures that relcaches of publishing tables are invalidated when ALTER PUBLICATION is executed. ADD/SET/DROP already has this mechanism, but ALTER PUBLICATION OWNER TO and RENAME TO do not. Regarding RENAME TO, now we are using a common function, but it is replaced with RenamePublication() to do invalidations. How do you think? [1]: https://www.postgresql.org/message-id/de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Mon, Mar 3, 2025 at 1:27 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Hi, this is a fork thread from [1]. I want to propose a small optimization for > logical replication system. > > Background > ========== > > When the ALTER PUBLICATION command is executed, all entries in RelationSyncCache > will be discarded anyway. This mechanism works well but is sometimes not efficient. > For example, when the ALTER PUBLICATION DROP TABLE is executed, > 1) the specific entry in RelationSyncCache will be removed, and then > 2) all entries will be discarded twice. > In (2) Why twice? Is it because we call rel_sync_cache_publication_cb() first for PUBLICATIONRELMAP and then for PUBLICATIONOID via publication_invalidation_cb()? > This happens because the pgoutput plugin registers both RelcacheCallback > (rel_sync_cache_relation_cb) and SyscacheCallback (publication_invalidation_cb, > rel_sync_cache_publication_cb). Then, when ALTER PUBLICATION ADD/SET/DROP is executed, > both the relation cache of added tables and the syscache of pg_publication_rel and > pg_publication are invalidated. > The callback for the relation cache will remove an entry from the hash table, and > syscache callbacks will look up all entries and invalidate them. However, AFAICS > does not need to invalidate all of them. > IIUC, you want to say in the above case only (1) would be sufficient, right? > I grepped source codes and found this happens since the initial version. > > Currently the effect of the behavior may not be large, > Is it possible to see the impact? I guess if there are a large number of relations (or partitions), then all will get invalidated even if one relation is added/dropped from the publication; if so, this should impact the performance. > but [1] may affect > significantly because it propagates invalidation messages to all in-progress > decoding transactions. > > Patch overview > ============ > > Based on the background, the patch avoids dropping all entries in RelationSyncCache > when ALTER PUBLICATION is executed. It removes sys cache callbacks for pg_publication_rel > and pg_publication_namespace and avoids discarding entries in sys cache for pg_publication. > > Apart from the above, this patch also ensures that relcaches of publishing tables > are invalidated when ALTER PUBLICATION is executed. ADD/SET/DROP already has this > mechanism, but ALTER PUBLICATION OWNER TO and RENAME TO do not. > Regarding RENAME TO, now we are using a common function, but it is replaced with > RenamePublication() to do invalidations. > For Rename/Owner, can we use a new invalidation instead of using relcache invalidation, as that can impact other sessions for no benefit? IIUC, changing the other properties of publication like dropping the table requires us to invalidate even the corresponding relcache entry because it contains the publication descriptor (rd_pubdesc). See RelationBuildPublicationDesc(). Also, it is better to consider doing rename/owner_change related changes in a separate patch as that will make the first patch easier to review and commit. -- With Regards, Amit Kapila.
Dear Amit, > > When the ALTER PUBLICATION command is executed, all entries in > RelationSyncCache > > will be discarded anyway. This mechanism works well but is sometimes not > efficient. > > For example, when the ALTER PUBLICATION DROP TABLE is executed, > > 1) the specific entry in RelationSyncCache will be removed, and then > > 2) all entries will be discarded twice. > > > > In (2) Why twice? Is it because we call > rel_sync_cache_publication_cb() first for PUBLICATIONRELMAP and then > for PUBLICATIONOID via publication_invalidation_cb()? This part was not correct. I considered that ALTER PUBLICATION DROP TABLE also invalidated the pg_publication syscache, but it would not do. So...the command firstly invalidate a specific entry, and then invalidate all entries once. > > This happens because the pgoutput plugin registers both RelcacheCallback > > (rel_sync_cache_relation_cb) and SyscacheCallback > (publication_invalidation_cb, > > rel_sync_cache_publication_cb). Then, when ALTER PUBLICATION > ADD/SET/DROP is executed, > > both the relation cache of added tables and the syscache of pg_publication_rel > and > > pg_publication are invalidated. > > The callback for the relation cache will remove an entry from the hash table, and > > syscache callbacks will look up all entries and invalidate them. However, > AFAICS > > does not need to invalidate all of them. > > > > IIUC, you want to say in the above case only (1) would be sufficient, right? Right, this is a main aim of this proposal. > Is it possible to see the impact? I guess if there are a large number > of relations (or partitions), then all will get invalidated even if > one relation is added/dropped from the publication; if so, this should > impact the performance. Agreed. I'm planning to do and share results. Please wait... > For Rename/Owner, can we use a new invalidation instead of using > relcache invalidation, as that can impact other sessions for no > benefit? IIUC, changing the other properties of publication like > dropping the table requires us to invalidate even the corresponding > relcache entry because it contains the publication descriptor > (rd_pubdesc). See RelationBuildPublicationDesc(). Attached patchset implemented with the approach. 0001 contains only part to avoid whole cache invalidation, for ADD/SET/DROP command. 0002 introduces new type of invalidation message which only invalidates caches on the decoding plugin. Better name is very welcome. 0003 uses the mechanism to avoid discarding relcache for RENAME/OWNER case. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Tuesday, March 4, 2025 7:44 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: Hi, > > > > When the ALTER PUBLICATION command is executed, all entries in > > RelationSyncCache > > > will be discarded anyway. This mechanism works well but is sometimes > > > not > > efficient. > > > For example, when the ALTER PUBLICATION DROP TABLE is executed, > > > 1) the specific entry in RelationSyncCache will be removed, and then > > > 2) all entries will be discarded twice. > > > > > > > In (2) Why twice? Is it because we call > > rel_sync_cache_publication_cb() first for PUBLICATIONRELMAP and then > > for PUBLICATIONOID via publication_invalidation_cb()? > > This part was not correct. I considered that ALTER PUBLICATION DROP TABLE > also invalidated the pg_publication syscache, but it would not do. So...the > command firstly invalidate a specific entry, and then invalidate all entries once. > > > > This happens because the pgoutput plugin registers both > > > RelcacheCallback > > > (rel_sync_cache_relation_cb) and SyscacheCallback > > (publication_invalidation_cb, > > > rel_sync_cache_publication_cb). Then, when ALTER PUBLICATION > > ADD/SET/DROP is executed, > > > both the relation cache of added tables and the syscache of > > > pg_publication_rel > > and > > > pg_publication are invalidated. > > > The callback for the relation cache will remove an entry from the > > > hash table, and syscache callbacks will look up all entries and > > > invalidate them. However, > > AFAICS > > > does not need to invalidate all of them. > > > > > > > IIUC, you want to say in the above case only (1) would be sufficient, right? > > Right, this is a main aim of this proposal. > > > Is it possible to see the impact? I guess if there are a large number > > of relations (or partitions), then all will get invalidated even if > > one relation is added/dropped from the publication; if so, this should > > impact the performance. > > Agreed. I'm planning to do and share results. Please wait... > > > For Rename/Owner, can we use a new invalidation instead of using > > relcache invalidation, as that can impact other sessions for no > > benefit? IIUC, changing the other properties of publication like > > dropping the table requires us to invalidate even the corresponding > > relcache entry because it contains the publication descriptor > > (rd_pubdesc). See RelationBuildPublicationDesc(). > > Attached patchset implemented with the approach. > 0001 contains only part to avoid whole cache invalidation, for ADD/SET/DROP > command. I think the changes proposed in 0001 are reasonable. Basically, any modifications to pg_publication_rel or pg_publication_namespace should trigger relcache invalidation messages. This is necessary for rebuilding the cache stored in RelationData::rd_pubdesc, given that these changes could impact whether a table is published. Following this logic, it should be safe and appropriate to depend on individual relcache invalidation messages to invalidate the relsynccache in pgoutput, rather than invalidating the entire relsynccache for a single catalog change. Additionally, I've verified the scenario involving partitioned tables to ensure the invalidation behaves correctly. When a parent table is added to a publication (with pubviaroot=true), it should influence the publication status of its child tables as well. And this case works as expected, since invalidation messages are dispatched for all items in the partition tree (GetPubPartitionOptionRelations()). One nitpick for 0001: the comments atop of the rel_sync_cache_publication_cb() need updating, as they still reference pg_publication_rel and pg_publication_namespace. Aside from this, 0001 looks good to me. Best Regards, Hou zj
Dear Hou, Thanks for reading the thread! > > Attached patchset implemented with the approach. > > 0001 contains only part to avoid whole cache invalidation, for ADD/SET/DROP > > command. > > I think the changes proposed in 0001 are reasonable. Basically, any > modifications to pg_publication_rel or pg_publication_namespace should trigger > relcache invalidation messages. This is necessary for rebuilding the cache > stored in RelationData::rd_pubdesc, given that these changes could impact > whether a table is published. Following this logic, it should be safe and > appropriate to depend on individual relcache invalidation messages to > invalidate the relsynccache in pgoutput, rather than invalidating the entire > relsynccache for a single catalog change. Right. I should have clarified these points on my previous post. > One nitpick for 0001: the comments atop of the rel_sync_cache_publication_cb() > need updating, as they still reference pg_publication_rel and > pg_publication_namespace. Aside from this, 0001 looks good to me. Thanks for pointing out, fixed. PSA new version. Additionally, I found a feasibility that discarding whole-cache can be avoided more. Currently, all caches are unconditionally removed when pg_namesace is updated. Invalidation is needed to update the qualified name of tables in the replication messages after the schema renames, but IIUC it is OK to do for relations that belong to the renamed schema. IIUC there are two approaches to implement it. The first idea is to introduce a new type of invalidation message. This is similar to what 0002 does. The invalidation message can contain relids that belong to the renaming schema, and pgoutput can register callback for it. The second idea is to use syscache callback. IIUC, the callback is passed the hash of oid. Thus, RelSyncCache can contain hash values of its schema, and they can be compared with a callback function. I noted the idea on the code as XXX code comment in 0001. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Dear hackers, I did a performance testing with HEAD and v2-0001, and confirmed that it could improve performance around 20% in the typical case. Workload ====== I emulated a scenario that there are many tables to be published and only one table is stop and resume publishing. In HEAD, ALTER PUBLICATION DROP/ADD command invalidates all entries in the RelSyncCache, but v2 invalidates only a needed one - this means the decoding time would be reduced after patching. 1. Initialized an instance 2. Created a root table and 5000 leaf tables. 3. Created another table which did not have parent-child relationship. 4. Created a publication which included a root table and another table 5. Created a replication slot with pgoutput plugin. 6. Executed a transaction which would be decoded. In the transaction: a. Inserted tuples to all the leaf tables b. Altered publication to drop another table c. Altered publication again to add the dropped one d. Inserted tuples to all the leaf tables again. 7. measured decoding time via SQL interfaces, five times. Attached script automated above. Results ===== Each cell is a median value of 5 runs. Compared with HEAD, V2-0001 can reduce the decoding time, relatively 20%. head [ms] v2-0001 [ms] 252 198 I'm planning to do further tests to prove the benefit for 0002, 0003 patches. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Wed, Mar 5, 2025 at 2:35 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > I did a performance testing with HEAD and v2-0001, and confirmed that it could > improve performance around 20% in the typical case. > > Workload > ====== > I emulated a scenario that there are many tables to be published and only one > table is stop and resume publishing. In HEAD, ALTER PUBLICATION DROP/ADD command > invalidates all entries in the RelSyncCache, but v2 invalidates only a needed > one - this means the decoding time would be reduced after patching. > > 1. Initialized an instance > 2. Created a root table and 5000 leaf tables. > 3. Created another table which did not have parent-child relationship. > 4. Created a publication which included a root table and another table > 5. Created a replication slot with pgoutput plugin. > 6. Executed a transaction which would be decoded. In the transaction: > a. Inserted tuples to all the leaf tables > b. Altered publication to drop another table > c. Altered publication again to add the dropped one > d. Inserted tuples to all the leaf tables again. > 7. measured decoding time via SQL interfaces, five times. > > Attached script automated above. > > Results > ===== > Each cell is a median value of 5 runs. Compared with HEAD, V2-0001 can reduce > the decoding time, relatively 20%. > Yeah, this is a good improvement and the patch looks safe to me, so I pushed with minor changes in the comments. .. > > I'm planning to do further tests to prove the benefit for 0002, 0003 patches. > Ideally, we need the time of only step-6d with and without the patch. Step 6-a is required to build the cache, but in actual workloads via walsender, we won't need that, and also, at the end of SQL API execution, we will forcibly invalidate all caches, so that is also not required. See, if in the performance measurement of the next set of patches, you can avoid that. -- With Regards, Amit Kapila.
Dear Amit, hackers, > Yeah, this is a good improvement and the patch looks safe to me, so I > pushed with minor changes in the comments. Thanks! PSA rebased patch. While considering more about 0002, I found a conceptual bug - when relsync cache could not be re-built when SQL interfaces are used. In old vesrion inval messages were not recorded on the WAL, so relsync cache could not be discarded if decoding happed later. In new version, the invalidation is handled as transactional and serialized to the WAL. More detail, messages are stored in InvalMessageArrays and consumed at the end of transactions. For simplfication, the subgroup of the message is set as "relcache" when serializing. But IIUC the handling is OK - snapshot invalidation does the same approach. I did a performance testing with given patch. () Mostly same as previous test but used two publications. 1. Initialize an instance 2. Create a root table and leaf tables. The variable $NUM_PART controls how many partitions exist on the instance. 3. Create another table 4. Create publications (p1, p2). One includes a root table, and another one includes independent table. 5. Create a replication slot with pgoutput plugint. 6. Execute a transaction which would be decoded. In the transaction: a. Insert tuples to all the leaf tables b. Rename publication p2 to p2_renamed c. Rename publication p2_renamed to p2 d. Insert tuples to all the leaf tables again. 7. decode all the changes via SQL interfaces. My expectation is for HEAD, leaves are cached at a), but they would be discarded at b) and c), then they are cached again at d). For patched case, it only an independent table would be discarded at b) and c) so the decoding time should be reduced. Below results are the median of five runs. ITSM around 15% improvements. head [ms] patched (0001-0003) 239 200 > Ideally, we need the time of only step-6d with and without the patch. > Step 6-a is required to build the cache, but in actual workloads via > walsender, we won't need that, and also, at the end of SQL API > execution, we will forcibly invalidate all caches, so that is also not > required. See, if in the performance measurement of the next set of > patches, you can avoid that. Hmm. It requires additional debug messages. Let me consider. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Thu, Mar 6, 2025 at 5:16 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Amit, hackers, > > > Yeah, this is a good improvement and the patch looks safe to me, so I > > pushed with minor changes in the comments. > > Thanks! PSA rebased patch. > Few comments: 1. Why do we need to invalidate relsync entries when owner of its publication changes? I think the owner change will impact the future Alter Publication ... Add/Drop/Set/Rename operations as that will be allowed only to new owner (or super users), otherwise, there shouldn't be an impact on RelSyncCache entries. Am, I missing something? 2. + if (pubform->puballtables) + { + CacheInvalidateRelSyncAll(); + } + else + { + List *relids = NIL; + List *schemarelids = NIL; + + /* + * For partition table, when we insert data, get_rel_sync_entry is + * called and a hash entry is created for the corresponding leaf table. + * So invalidating the leaf nodes would be sufficient here. + */ + relids = GetPublicationRelations(pubform->oid, + PUBLICATION_PART_LEAF); + schemarelids = GetAllSchemaPublicationRelations(pubform->oid, + PUBLICATION_PART_LEAF); + + relids = list_concat_unique_oid(relids, schemarelids); + + InvalidateRelSyncCaches(relids); + } + + CatalogTupleUpdate(rel, &tup->t_self, tup); Shouldn't we need to update the CatalogTuple before invalidations. 3. + if (pubform->puballtables) + { + CacheInvalidateRelSyncAll(); + } + else + { + List *relids = NIL; + List *schemarelids = NIL; + + /* + * For partition table, when we insert data, get_rel_sync_entry is + * called and a hash entry is created for the corresponding leaf table. + * So invalidating the leaf nodes would be sufficient here. + */ + relids = GetPublicationRelations(pubform->oid, + PUBLICATION_PART_LEAF); + schemarelids = GetAllSchemaPublicationRelations(pubform->oid, + PUBLICATION_PART_LEAF); + + relids = list_concat_unique_oid(relids, schemarelids); + + InvalidateRelSyncCaches(relids); + } This is a duplicate code. Can we write a function to eliminate this duplicacy? -- With Regards, Amit Kapila.
Dear hackers, > Hmm. It requires additional debug messages. Let me consider. I did further performance testing to see the re-building time, and confirmed that the decoding time of second INSERT part became around 1/10. In this test number of partitions was also varied. Each cell is a median of 5 runs. number of partitions HEAD [micro second] v4 patched [micro second] -------------------------------------------------------------------------------- 1000 6030 635 5000 33456 3555 15000 106688 14049 Workload ====== Workload is mostly the same, but I ensured that step6 is done within a same transaction. I also attached just in case. Used source ======== Attached diff file was applied atop HEAD (d611f8) and v4 patchset. Patched source outputs current timestamp (in micro seconds) when 1) INSERT is decoded, 2) invalidation happens or 3) WAL processing is finished for the transaction. Also, it has a ratchet mechanism not to log every time: continuous INSERT or INVALIDATION is ignored. Typically, the output is like below: ``` postgres=# SELECT count(*) FROM pg_logical_slot_peek_binary_changes('test', NULL, NULL, 'proto_version', '4', 'publication_names','pub1'); NOTICE: XXX start inserting. Time 794645780406191 NOTICE: XXX start invalidating. Time 794645780407423 NOTICE: XXX start inserting. Time 794645780407448 // (a) NOTICE: XXX finish processing. Time 794645780407488 // (b) count ------- 7 (1 row) ``` How I compared =========== We can obtain the elapsed time for second INSERT part, by (b) - (a). I took the calculated data for each runs and noted a median in above table. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Dear Amit, Thanks for the comment! PSA new version. > > Few comments: > 1. Why do we need to invalidate relsync entries when owner of its > publication changes? > > I think the owner change will impact the future Alter Publication ... > Add/Drop/Set/Rename operations as that will be allowed only to new > owner (or super users), otherwise, there shouldn't be an impact on > RelSyncCache entries. Am, I missing something? Actually, I did not find reasons to invalidate even OWNER command for now. I included it to 1) follow current rule, and to 2) prepare for future update. 1) - Currently RelSyncCache entries are discarded even by ALTER PUBLICATION OWNER statements. I wanted to preserve it. 2) - In future versions, privilege might be introduced for the publications, some publications would not be visible for other db users. In this case I feel we should modify RelationSyncEntry::pubactions, columns and exprstate thus entries must be discarded. Now I removed invalidations for OWNER command. Let's revert the change if we miss something. > 2. > + if (pubform->puballtables) > + { > + CacheInvalidateRelSyncAll(); > + } > + else > + { > + List *relids = NIL; > + List *schemarelids = NIL; > + > + /* > + * For partition table, when we insert data, get_rel_sync_entry is > + * called and a hash entry is created for the corresponding leaf table. > + * So invalidating the leaf nodes would be sufficient here. > + */ > + relids = GetPublicationRelations(pubform->oid, > + PUBLICATION_PART_LEAF); > + schemarelids = GetAllSchemaPublicationRelations(pubform->oid, > + PUBLICATION_PART_LEAF); > + > + relids = list_concat_unique_oid(relids, schemarelids); > + > + InvalidateRelSyncCaches(relids); > + } > + > + CatalogTupleUpdate(rel, &tup->t_self, tup); > > Shouldn't we need to update the CatalogTuple before invalidations. Right, fixed. > 3. > + if (pubform->puballtables) > + { > + CacheInvalidateRelSyncAll(); > + } > + else > + { > + List *relids = NIL; > + List *schemarelids = NIL; > + > + /* > + * For partition table, when we insert data, get_rel_sync_entry is > + * called and a hash entry is created for the corresponding leaf table. > + * So invalidating the leaf nodes would be sufficient here. > + */ > + relids = GetPublicationRelations(pubform->oid, > + PUBLICATION_PART_LEAF); > + schemarelids = GetAllSchemaPublicationRelations(pubform->oid, > + PUBLICATION_PART_LEAF); > + > + relids = list_concat_unique_oid(relids, schemarelids); > + > + InvalidateRelSyncCaches(relids); > + } > > This is a duplicate code. Can we write a function to eliminate this duplicacy? Since the part has been removed from OWNER command, duplicacy was removed. I did not introduce a function for this. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Fri, Mar 7, 2025 at 4:28 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Amit, > > Thanks for the comment! PSA new version. > Don't we need to call this invalidation function from InvalidateSystemCachesExtended()? -- With Regards, Amit Kapila.
Dear Amit, > Don't we need to call this invalidation function from > InvalidateSystemCachesExtended()? Hmm. I did not call relsync callback functions in InvalidateSystemCachesExtended() intentionally, but added now. Please see my analysis below and let me know how you think. In the first place, InvalidateSystemCachesExtended() can be called by 1) InvalidateSystemCaches(), and 2) AcceptInvalidationMessages(). Regarding the 1), it could be used when a) the parallel worker is launched, b) decoder starts decoding, or c) decoder finishes decoding and after decoding context is free'd. However, parallel worker won't do a decoding, initially the relsync cache is empty and relsync would be dropped when the decding context is removed. Based on the fact I did not called the function. As for the 2), the path exists only when the debug build mode and debug_discard_caches is set. According to the manual and code comments, it must discard all caches anytime. So...OK, I decided to follow the rule. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
I found cfbot got angry due to a variable-shadowing. PSA fixed version. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Dear hackers, I did a self-reviewing and updated a patch. PSA new version. What's new: 1. Fixed a bug which existing name can be specified by ALTER PUBLICATION RENAME. A validation is added in RenamePublication(). Just in case, we want to discuss the case that the renaming publication has thousands of tables. Before the patch, The ALTER PUBLICATION RENAME command just invalidates a syscache, and registered callback would invalidate all of them. With the patch, however, all relsync entries are invalidated within the command so that the execution time can be longer. I feel this can be acceptable because some codes have such a part; E.g., ATTACH/DETACH partition command invalidates relaches of the children when the specified relation also has children - even if it has thousands of leaves. (See d6f1e16). Also note that all other processes like walsender can be faster by avoiding discarding entire caches. how do you feel? Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Monday, March 10, 2025 7:00 PM Kuroda, Hayato <kuroda.hayato@fujitsu.com> wrote: > > I did a self-reviewing and updated a patch. PSA new version. What's new: Thanks for updating the patch. I tested the behavior for partitioned table and have a comment on this. > + relids = GetPublicationRelations(pubform->oid, > + PUBLICATION_PART_LEAF); Currently, only the leaf partition is invalidated when the published table is partitioned. However, I think pgoutput could cache both the partitioned table and the leaf partition table as relsync entries. For INSERT/UPDATE/DELETE on a partitioned table, only the leaf partition's relsync entry is used in pgoutput, but the TRUNCATE references the parent table's relsync entry. For example[1], if the parent table's relsync entry is not invalidated after a RENAME operation, it results in the TRUNCATE to be missed. So I think we should Invalidate all the tables in the partition tree by passing PUBLICATION_PART_ALL, which is also consistent with ALTER PUB ADD/SET/DROP TABLE. [1]--- Example --- create table test(a int primary key) PARTITION BY RANGE (a); CREATE TABLE test_1 PARTITION OF test FOR VALUES FROM (0) TO (2); CREATE TABLE test_2 PARTITION OF test FOR VALUES FROM (2) TO (4); CREATE PUBLICATION pub; CREATE PUBLICATION pub2 FOR TABLE test WITH (PUBLISH_VIA_PARTITION_ROOT); SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput'); TRUNCATE test; ALTER PUBLICATION pub RENAME TO pub3; ALTER PUBLICATION pub2 RENAME TO pub; TRUNCATE test; -- I can consume some changes using the following function on HEAD, but got -- nothing after applying the patch. SELECT * FROM pg_logical_slot_get_binary_changes('isolation_slot', NULL, NULL, 'proto_version', '4', 'publication_names','pub', 'streaming', 'on'); --- Best Regards, Hou zj
Dear Hou, > Currently, only the leaf partition is invalidated when the published table is > partitioned. However, I think pgoutput could cache both the partitioned table > and the leaf partition table as relsync entries. > > For INSERT/UPDATE/DELETE on a partitioned table, only the leaf partition's > relsync entry is used in pgoutput, but the TRUNCATE references the parent > table's relsync entry. I think your analysis is correct. PSA new version. Below part contains my analysis. In ExecuteTruncate(), if the specified relation has children, all of them are checked via find_all_inheritors() and listed as target. Also, ExecuteTruncateGuts() serializes both a parent and children in XLOG_HEAP_TRUNCATE WAL record. Decoding layer passes relations as-is. These facts mean that output plugins can store caches on the memory. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Mon, Mar 10, 2025 at 6:42 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > Currently, only the leaf partition is invalidated when the published table is > > partitioned. However, I think pgoutput could cache both the partitioned table > > and the leaf partition table as relsync entries. > > > > For INSERT/UPDATE/DELETE on a partitioned table, only the leaf partition's > > relsync entry is used in pgoutput, but the TRUNCATE references the parent > > table's relsync entry. > > I think your analysis is correct. PSA new version. Below part contains my analysis. > I have made several cosmetic changes atop 0001 patch in the attached. Additionally, fixed an issue in AddRelsyncInvalidationMessage() to consider invalidating all the RelSyncCache entries. Kindly include these in the next version if you find the changes are okay. -- With Regards, Amit Kapila.
Attachment
On Monday, March 10, 2025 9:12 PM Kuroda, Hayato <kuroda.hayato@fujitsu.com> wrote: > > > Currently, only the leaf partition is invalidated when the published > > table is partitioned. However, I think pgoutput could cache both the > > partitioned table and the leaf partition table as relsync entries. > > > > For INSERT/UPDATE/DELETE on a partitioned table, only the leaf > > partition's relsync entry is used in pgoutput, but the TRUNCATE > > references the parent table's relsync entry. > > I think your analysis is correct. PSA new version. Below part contains my > analysis. > > In ExecuteTruncate(), if the specified relation has children, all of them are > checked via find_all_inheritors() and listed as target. Also, > ExecuteTruncateGuts() serializes both a parent and children in > XLOG_HEAP_TRUNCATE WAL record. > Decoding layer passes relations as-is. These facts mean that output plugins > can store caches on the memory. Thanks for updating the patch. I have reviewed patch 0001 and did not find issues, aside from a few issues for code comments that were mentioned in Amit's email. Here are some analyses and insights gathered during the review of 0001: The functions and variables in the 0001 patch uses 'Relsync' (e.g., RegisterRelsyncInvalidation) instead of the longer 'RelsyncCache'. After internal discussions, we think it's OK, as using 'RelsyncCache' could unnecessarily lengthen the names. Furthermore, considering we're introducing a new invalidation message for the RelationSyncCache within pgoutput, which is an output plugin, we discussed whether a more general invalidation name should be adopted in case other output plugins might use it. However, after reviewing all the plugins listed in Wiki[1], we did not find any other plugins that reference the built-in publication catalog. Therefore, this scenario appears to be uncommon. Additionally, the current naming convention is sufficiently intuitive for output plugin developers. Hence, we feel it is reasonable to retain the existing names. For patch 0002, I think the implementation could be improved. The current patch introduces a new function, RenamePublication, to replace the existing generic approach in ExecRenameStmt->AlterObjectRename_internal. However, this creates inconsistency because the original code uses AccessExclusiveLock for publication renaming, making concurrent renaming impossible. The current patch seems to overlook this aspect. Additionally, introducing a new function results in code duplication, which can be avoided. After further consideration, handling the publication rename separately seems unnecessary, given it requires only sending a few extra invalidation messages. Therefore, I suggest reusing the generic handling and simply extending AlterObjectRename_internal to include the additional invalidation messages. I've attached a diff with the proposed changes for 0002. Best Regards, Hou zj
Attachment
Dear Amit, Hou, Thanks for giving comments! > For patch 0002, I think the implementation could be improved. The > current patch introduces a new function, RenamePublication, to replace the > existing generic approach in ExecRenameStmt->AlterObjectRename_internal. > However, this creates inconsistency because the original code uses > AccessExclusiveLock for publication renaming, making concurrent renaming > impossible. The current patch seems to overlook this aspect. Oh, I missed the point, thanks. > Additionally, introducing a new function results in code duplication, which can > be avoided. After further consideration, handling the publication rename > separately seems unnecessary, given it requires only sending a few extra > invalidation messages. Therefore, I suggest reusing the generic handling and > simply extending AlterObjectRename_internal to include the additional > invalidation messages. > > I've attached a diff with the proposed changes for 0002. Hmm, possible. previously I introduced new function to preserve existing codes as much as possible. But this introduced some duplicy. Your approach which extends the common function looks nice, apart from two points; 1. Relsync cache invalidations should be done after the catalog update, but proposed one did before that. I preferred to add new if-statementfor post catalog-update. 2. Also, common check for the name conflict can be reused. Attached patch address comments by you and Amit [1]. What's new; * Skip adding new relsync messages when message for InvalidOid has already exist. Per comment from Amit [1]. * Update some code comments. Some of them are pointed out by [1]. * Stop using RenamePublication() and extend the common function. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Tue, Mar 11, 2025 at 5:47 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Attached patch address comments by you and Amit [1]. What's new; > Thanks, the patch looks mostly good to me. I have made few cosmetic changes in the attached and combined both the patches. The existing Alter Publication ... Rename tests don't test invalidations arising from that command. As this patch changes that code path, it would be good to add a few tests for the same. We can add one for individual relations and another for ALL Tables publication. -- With Regards, Amit Kapila.
Attachment
Dear Amit, > Thanks, the patch looks mostly good to me. I have made few cosmetic > changes in the attached and combined both the patches. Thanks, it looks good to me. > The existing > Alter Publication ... Rename tests don't test invalidations arising > from that command. As this patch changes that code path, it would be > good to add a few tests for the same. We can add one for individual > relations and another for ALL Tables publication. I created new patch which adds a test code. I added in 007_ddl.pl, but I feel it is OK to introduce new test file. How do you think? Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Wed, Mar 12, 2025 at 5:46 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > Thanks, the patch looks mostly good to me. I have made few cosmetic > > changes in the attached and combined both the patches. > > Thanks, it looks good to me. > > > The existing > > Alter Publication ... Rename tests don't test invalidations arising > > from that command. As this patch changes that code path, it would be > > good to add a few tests for the same. We can add one for individual > > relations and another for ALL Tables publication. > > I created new patch which adds a test code. > Thanks. I have pushed the patch after minor changes in the test. -- With Regards, Amit Kapila.
Hi, On 2025-03-13 14:13:39 +0530, Amit Kapila wrote: > On Wed, Mar 12, 2025 at 5:46 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > > Thanks, the patch looks mostly good to me. I have made few cosmetic > > > changes in the attached and combined both the patches. > > > > Thanks, it looks good to me. > > > > > The existing > > > Alter Publication ... Rename tests don't test invalidations arising > > > from that command. As this patch changes that code path, it would be > > > good to add a few tests for the same. We can add one for individual > > > relations and another for ALL Tables publication. > > > > I created new patch which adds a test code. > > > > Thanks. I have pushed the patch after minor changes in the test. I think the new tests just failed in CI: https://cirrus-ci.com/task/5602950271205376?logs=test_world#L268 [13:34:38.562] ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― [13:34:38.562] stderr: [13:34:38.562] # Failed test 'check replication worked well before renaming a publication' [13:34:38.562] # at /tmp/cirrus-ci-build/src/test/subscription/t/007_ddl.pl line 93. [13:34:38.562] # got: '' [13:34:38.562] # expected: '1' [13:34:38.562] # Failed test 'check the tuple inserted after the RENAME was not replicated' [13:34:38.562] # at /tmp/cirrus-ci-build/src/test/subscription/t/007_ddl.pl line 110. [13:34:38.562] # got: '' [13:34:38.562] # expected: '1' [13:34:38.562] # Looks like you failed 2 tests of 8. [13:34:38.562] [13:34:38.562] (test program exited with status code 2) [13:34:38.562] ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― [13:34:38.562] Greetings, Andres Freund
Dear Andres, > I think the new tests just failed in CI: > https://cirrus-ci.com/task/5602950271205376?logs=test_world#L268 Thanks for reporting, I'll look into it. Best regards, Hayato Kuroda FUJITSU LIMITED
On Fri, Mar 28, 2025 at 6:03 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Andres, > > > I think the new tests just failed in CI: > > https://cirrus-ci.com/task/5602950271205376?logs=test_world#L268 > > Thanks for reporting, I'll look into it. > The problem here is that after ALTER SUBSCRIPTION tap_sub SET PUBLICATION ..., we didn't wait for the new walsender on publisher to start. We must use wait_for_subscription_sync both after the "CREATE SUBSCRIPTION ..." and the "ALTER SUBSCRIPTION ..." commands and keep copy_data=true to ensure the initial replication is setup between publisher and subscriber. This is how we use these commands at other places. -- With Regards, Amit Kapila.
Dear Amit, > The problem here is that after ALTER SUBSCRIPTION tap_sub SET > PUBLICATION ..., we didn't wait for the new walsender on publisher to > start. We must use wait_for_subscription_sync both after the "CREATE > SUBSCRIPTION ..." and the "ALTER SUBSCRIPTION ..." commands and keep > copy_data=true to ensure the initial replication is setup between > publisher and subscriber. This is how we use these commands at other > places. Agreed. PSA the patch to fix the issue. Best regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Fri, Mar 28, 2025 at 10:46 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > The problem here is that after ALTER SUBSCRIPTION tap_sub SET > > PUBLICATION ..., we didn't wait for the new walsender on publisher to > > start. We must use wait_for_subscription_sync both after the "CREATE > > SUBSCRIPTION ..." and the "ALTER SUBSCRIPTION ..." commands and keep > > copy_data=true to ensure the initial replication is setup between > > publisher and subscriber. This is how we use these commands at other > > places. > > Agreed. PSA the patch to fix the issue. > Pushed after slight modification. -- With Regards, Amit Kapila.