Re: Initial Schema Sync for Logical Replication - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Initial Schema Sync for Logical Replication
Date
Msg-id CAD21AoBiLh39X4b6fhpXXJHrsUNa2u_z0Z4V2jj9UZozrNh74w@mail.gmail.com
Whole thread Raw
In response to Re: Initial Schema Sync for Logical Replication  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Initial Schema Sync for Logical Replication
List pgsql-hackers
On Tue, May 23, 2023 at 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 22, 2023 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Apr 28, 2023 at 4:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > Yes, in this approach, we need to dump/restore objects while
> > > specifying with fine granularity. Ideally, the table sync worker dumps
> > > and restores the table schema, does copy the initial data, and then
> > > creates indexes, and triggers and table-related objects are created
> > > after that. So if we go with the pg_dump approach to copy the schema
> > > of individual tables, we need to change pg_dump (or libpgdump needs to
> > > be able to do) to support it.
> >
> > We have been discussing how to sync schema but I'd like to step back a
> > bit and discuss use cases and requirements of this feature.
> >
> > Suppose that a table belongs to a publication, what objects related to
> > the table we want to sync by the initial schema sync features? IOW, do
> > we want to sync table's ACLs, tablespace settings, triggers, and
> > security labels too?
> >
> > If we want to replicate the whole database, e.g. when using logical
> > replication for major version upgrade, it would be convenient if it
> > synchronizes all table-related objects. However, if we have only this
> > option, it could be useless in some cases. For example, in a case
> > where users have different database users on the subscriber than the
> > publisher, they might want to sync only CREATE TABLE, and set ACL etc
> > by themselves. In this case, it would not be necessary to sync ACL and
> > security labels.
> >
> > What use case do we want to support by this feature? I think the
> > implementation could be varied depending on how to select what objects
> > to sync.
> >
> > One possible idea is to select objects to sync depending on how DDL
> > replication is set in the publisher. It's straightforward but I'm not
> > sure the design of DDL replication syntax has been decided. Also, even
> > if we create a publication with ddl = 'table' option, it's not clear
> > to me that we want to sync table-dependent triggers, indexes, and
> > rules too by the initial sync feature.
> >
>
> I think it is better to keep the initial sync the same as the
> replication. So, if the publication specifies 'table' then we should
> just synchronize tables. Otherwise, it will look odd that the initial
> sync has synchronized say index-related DDLs but then later
> replication didn't replicate it. OTOH, if we want to do initial sync
> of table-dependent objects like triggers, indexes, rules, etc. when
> the user has specified ddl = 'table' then the replication should also
> follow the same. The main reason to exclude the other objects during
> replication is to reduce the scope of deparsing patch but if we have a
> finite set of objects (say all dependent on the table) then we can
> probably try to address those.
>

We have discussed several ideas of how to synchronize schemas between
publisher and subscribers, and the points are summarized in Wiki
page[1]. As for the idea of using pg_dump, we were concerned that
pg_dump needs to be present along with the server binary if the user
needs to use the initial schema synchronization feature. Since these
binaries are typically included in different packages, they need to
install both. During PGCon we've discussed with some senior hackers
that it would be an acceptable limitation for users. When executing
CREATE/ALTER SUBSCRIPTION, we check if pg_dump is available and raise
an error if not. We've also discussed the idea  of using
pg_dump_library but no one preferred this idea because of its
implementation costs. Therefore, I'm going to do further evaluation
for the pg_dump idea.

I agree with Amit that the initial schema synchronization should
process the same as the DDL replication. We can support only table
schemas as the first step. To do that, we need a new switch, say
--exclude-table-dependents, in pg_dump to dump only table schemas
excluding table-related objects such as triggers and indexes. Then, we
can support synchronizing tables and table-related objects such as
triggers, indexes, and rules, as the second step, which can be done
with the --schema and --table option. Finally, we can synchronize the
whole database by using the --schema option.

We also need to research how to integrate the initial schema
synchronization with tablesync workers.  We have a PoC patch[2].

Regards,

[1] https://wiki.postgresql.org/wiki/Logical_replication_of_DDLs#Initial_Schema_Sync
[2] https://www.postgresql.org/message-id/CAD21AoCdfg506__qKz%2BHX8vqfdyKgQ5qeehgqq9bi1L-6p5Pwg%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: [PoC] pg_upgrade: allow to upgrade publisher node
Next
From: Dilip Kumar
Date:
Subject: Re: Let's make PostgreSQL multi-threaded