Re: Initial Schema Sync for Logical Replication - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Initial Schema Sync for Logical Replication |
Date | |
Msg-id | CAD21AoBiLh39X4b6fhpXXJHrsUNa2u_z0Z4V2jj9UZozrNh74w@mail.gmail.com Whole thread Raw |
In response to | Re: Initial Schema Sync for Logical Replication (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Initial Schema Sync for Logical Replication
|
List | pgsql-hackers |
On Tue, May 23, 2023 at 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, May 22, 2023 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Apr 28, 2023 at 4:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Yes, in this approach, we need to dump/restore objects while > > > specifying with fine granularity. Ideally, the table sync worker dumps > > > and restores the table schema, does copy the initial data, and then > > > creates indexes, and triggers and table-related objects are created > > > after that. So if we go with the pg_dump approach to copy the schema > > > of individual tables, we need to change pg_dump (or libpgdump needs to > > > be able to do) to support it. > > > > We have been discussing how to sync schema but I'd like to step back a > > bit and discuss use cases and requirements of this feature. > > > > Suppose that a table belongs to a publication, what objects related to > > the table we want to sync by the initial schema sync features? IOW, do > > we want to sync table's ACLs, tablespace settings, triggers, and > > security labels too? > > > > If we want to replicate the whole database, e.g. when using logical > > replication for major version upgrade, it would be convenient if it > > synchronizes all table-related objects. However, if we have only this > > option, it could be useless in some cases. For example, in a case > > where users have different database users on the subscriber than the > > publisher, they might want to sync only CREATE TABLE, and set ACL etc > > by themselves. In this case, it would not be necessary to sync ACL and > > security labels. > > > > What use case do we want to support by this feature? I think the > > implementation could be varied depending on how to select what objects > > to sync. > > > > One possible idea is to select objects to sync depending on how DDL > > replication is set in the publisher. It's straightforward but I'm not > > sure the design of DDL replication syntax has been decided. Also, even > > if we create a publication with ddl = 'table' option, it's not clear > > to me that we want to sync table-dependent triggers, indexes, and > > rules too by the initial sync feature. > > > > I think it is better to keep the initial sync the same as the > replication. So, if the publication specifies 'table' then we should > just synchronize tables. Otherwise, it will look odd that the initial > sync has synchronized say index-related DDLs but then later > replication didn't replicate it. OTOH, if we want to do initial sync > of table-dependent objects like triggers, indexes, rules, etc. when > the user has specified ddl = 'table' then the replication should also > follow the same. The main reason to exclude the other objects during > replication is to reduce the scope of deparsing patch but if we have a > finite set of objects (say all dependent on the table) then we can > probably try to address those. > We have discussed several ideas of how to synchronize schemas between publisher and subscribers, and the points are summarized in Wiki page[1]. As for the idea of using pg_dump, we were concerned that pg_dump needs to be present along with the server binary if the user needs to use the initial schema synchronization feature. Since these binaries are typically included in different packages, they need to install both. During PGCon we've discussed with some senior hackers that it would be an acceptable limitation for users. When executing CREATE/ALTER SUBSCRIPTION, we check if pg_dump is available and raise an error if not. We've also discussed the idea of using pg_dump_library but no one preferred this idea because of its implementation costs. Therefore, I'm going to do further evaluation for the pg_dump idea. I agree with Amit that the initial schema synchronization should process the same as the DDL replication. We can support only table schemas as the first step. To do that, we need a new switch, say --exclude-table-dependents, in pg_dump to dump only table schemas excluding table-related objects such as triggers and indexes. Then, we can support synchronizing tables and table-related objects such as triggers, indexes, and rules, as the second step, which can be done with the --schema and --table option. Finally, we can synchronize the whole database by using the --schema option. We also need to research how to integrate the initial schema synchronization with tablesync workers. We have a PoC patch[2]. Regards, [1] https://wiki.postgresql.org/wiki/Logical_replication_of_DDLs#Initial_Schema_Sync [2] https://www.postgresql.org/message-id/CAD21AoCdfg506__qKz%2BHX8vqfdyKgQ5qeehgqq9bi1L-6p5Pwg%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: