Thread: Initial Schema Sync for Logical Replication

Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:

Hi Everyone,

 

I am working on the initial schema sync for Logical replication. Currently, user have to

manually create a schema on subscriber side. Aim of this feature is to add an option in

create subscription, so that schema sync can be automatic. I am sharing Design Doc below,

but there are some corner cases where the design does not work. Please share your opinion

if design can be improved and we can get rid of corner cases. This design is loosely based

on Pglogical.

DDL replication is required for this feature.

(https://www.postgresql.org/message-id/flat/CAAD30U%2BpVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo%3DOjdQGJ9g%40mail.gmail.com)

 

SQL Changes:-

CREATE SUBSCRIPTION subscription_name

CONNECTION 'conninfo'

PUBLICATION publication_name [, ...]

[ WITH ( subscription_parameter [= value] [, ... ] ) ]

sync_initial_schema (enum) will be added to subscription_parameter.

It can have 3 values:-

TABLES, ALL , NONE (Default)

In ALL everything will be synced including global objects too.

 

Restrictions :- sync_initial_schema=ALL can only be used for publication with FOR ALL TABLES

 

Design:-

 

Publisher :-

Publisher have to implement `SHOW CREATE TABLE_NAME`, this table definition will be used by

subscriber to create exact schema of a table on the subscriber. One alternative to this can

be doing it on the subscriber side itself, we can create a function similar to

describeOneTableDetails and call it on the subscriber. We also need maintain same ownership

as of publisher.

 

It should also have turned on publication of DDL commands.

 

Subscriber :-

 

1. In  CreateSubscription()  when we create replication slot(walrcv_create_slot()), should

use CRS_EXPORT_SNAPSHOT, So that we can use this snapshot later in the pg_dump.

 

2.  Now we can call pg_dump with above snapshot from CreateSubscription. This is inside

opts.connect && opts.create_slot if statement. If we fail in this step we have to drop

the replication slot and create a new one again. Because we need snapshot and creating a

replication slot is a way to get snapshot. The reason for running pg_dump with above

snapshot is that we don't want execute DDLs in wal_logs to 2 times. With above snapshot we

get a state of database which is before the replication slot origin and any changes after

the snapshot will be in wal_logs.

 

We will save the pg_dump into a file (custom archive format). So pg_dump will be similar to

pg_dump --connection_string --schema_only --snapshot=xyz -Fc --file initSchema

 

If  sync_initial_schema=TABLES we dont have to call pg_dump/restore at all. TableSync process

will take care of it.

 

3. If we have to sync global objects we need to call pg_dumpall --globals-only also. But pg_dumpall

does not support --snapshot option, So if user creates a new global object between creation

of replication slot and running pg_dumpall, that above global object will be created 2

times on subscriber , which will error out the Applier process.

 

4. walrcv_disconnect should be called after pg_dump is finished, otherwise snapshot will

not be valid.

 

5. Users will replication role cant not call pg_dump , So the replication user have to

superuser. This is a a major problem.

postgres=# create role s4 WITH LOGIN Replication;

CREATE ROLE

─sachin@DUB-1800550165 ~

─$ pg_dump postgres -s -U s4                                                                               1

pg_dump: error: query failed: ERROR:  permission denied for table t1

pg_dump: detail: Query was: LOCK TABLE public.t1, public.t2 IN ACCESS SHARE MODE

 

6. pg_subscription_rel table column srsubstate will have one more state

SUBREL_STATE_CREATE 'c'. if sync_initial_schema is enabled we will set table_state to 'c'.

Above 6 steps will be done even if subscription is not enabled, but connect is true.

 

7. Leader Applier process should check if initSync file exist , if true  then it should

call pg_restore. We are not using —pre-data and —post-data segment as it is used in

Pglogical, Because post_data works on table having data , but we will fill the data into

table on later stages.  pg_restore can be called like this

 

pg_restore --connection_string -1 file_name

-1 option will execute every command inside of one transaction. If there is any error

everything will be rollbacked.

pg_restore should be called quite early in the Applier process code, before any tablesync

process can be created.

Instead of checking if file exist maybe pg_subscription table can be extended with column

SyncInitialSchema and applier process will check SyncInitialSchema == SYNC_PENDING

 

8. TableSync process should check the state of table , if it is SUBREL_STATE_CREATE it should

get the latest definition from the publisher and recreate the table. (We have to recreate

the table even if there are no changes). Then it should go into copy table mode as usual.

 

It might seem that TableSync is doing duplicate work already done by pg_restore. We are doing

it in this way because of concurrent DDLs and refresh publication command.

 

Concurrent DDL :-

User can execute a DDL command to table t1 at the same time when subscriber is trying to sync

it. pictorial representation https://imgur.com/a/ivrIEv8 [1]

 

In tablesync process, it makes a connection to the publisher and it sees the

table state which can be in future wrt to the publisher, which can introduce conflicts.

For example:-

 

CASE 1:- { Publisher removed the column b from the table t1 when subscriber was doing pg_restore

(or any point in concurrent DDL window described in picture [1] ), when tableSync

process will start transaction on the publisher it will see request data of table t1

including column b, which does not exist on the publisher.} So that is why tableSync process

asks for the latest definition.

If we say that we will delay tableSync worker till all the DDL related to table t1 is

applied by the applier process , we can still have a window when publisher issues a DDL

command just before tableSync starts its transaction, and therefore making tableSync and

publisher table definition incompatible (Thanks to Masahiko for pointing out this race

condition).

 

Applier process will skip all DDL/DMLs related to the table t1 and tableSync will apply those

in Catchup phase.

Although there is one issue what will happen to views/ or functions which depend on the table

. I think they should wait till table_state is > SUBREL_STATE_CREATE (means we have the latest

schema definition from the publisher).

There might be corner cases to this approach or maybe a better way to handle concurrent DDL

One simple solution might be to disallow DDLs on the publisher till all the schema is

synced and all tables have state >= SUBREL_STATE_DATASYNC (We can have CASE 1: issue ,

even with DDL replication, so we have to wait till all the tables have table_state

> SUBREL_STATE_DATASYNC). Which might be a big window for big databases.

 

 

Refresh publication :-

In refresh publication, subscriber does create a new replication slot hence , we can’t run

pg_dump with a snapshot which starts from origin(maybe this is not an issue at all). In this case

it makes more sense for tableSync worker to do schema sync.

 

 

If community is happy with above design, I can start working on prototype.

 

Credits :- This design is inspired by Pglogical. Also thanks to Zane, Masahiko, Amit for reviewing earlier designs

 

Regards

Sachin Kumar

Amazon Web Services

Re: Initial Schema Sync for Logical Replication

From
Peter Smith
Date:
Hi,

I have a couple of questions.

Q1.

What happens if the subscriber already has some tables present? For
example, I did not see the post saying anything like "Only if the
table does not already exist then it will be created".

On the contrary, the post seemed to say SUBREL_STATE_CREATE 'c' would
*always* be set when this subscriber mode is enabled. And then it
seemed to say the table would *always* get re-created by the tablesync
in this new mode.

Won't this cause problems
- if the user wanted a slightly different subscriber-side table? (eg
some extra columns on the subscriber-side table)
- if there was some pre-existing table data on the subscriber-side
table that you now are about to re-create and clobber?

Or does the idea intend that the CREATE TABLE DDL that will be
executed is like "CREATE TABLE ... IF NOT EXISTS"?

~~~

Q2.

The post says. "DDL replication is required for this feature". And "It
should also have turned on publication of DDL commands."

It wasn't entirely clear to me why those must be a requirement. Is
that just to make implementation easier?

Sure, I see that the idea might have some (or maybe a lot?) of common
internal code with the table DDL replication work, but OTOH an
auto-create feature for subscriber tables seems like it might be a
useful feature to have regardless of the value of the publication
'ddl' parameter.

------
Kind Regards,
Peter Smith.
Fujitsu Australia.



RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
Hi Peter,

> Hi,
> 
> I have a couple of questions.
> 
> Q1.
> 
> What happens if the subscriber already has some tables present? For
> example, I did not see the post saying anything like "Only if the table does
> not already exist then it will be created".
> 
My assumption was the if subscriber is doing initial schema sync , It does not have
any conflicting database objects.
> On the contrary, the post seemed to say SUBREL_STATE_CREATE 'c' would
> *always* be set when this subscriber mode is enabled. And then it seemed
> to say the table would *always* get re-created by the tablesync in this new
> mode.
Right
> 
> Won't this cause problems
> - if the user wanted a slightly different subscriber-side table? (eg some extra
> columns on the subscriber-side table)
> - if there was some pre-existing table data on the subscriber-side table that
> you now are about to re-create and clobber?
> 
> Or does the idea intend that the CREATE TABLE DDL that will be executed is
> like "CREATE TABLE ... IF NOT EXISTS"?
> 
pg_dump does not support --if-not-exists , But I think it can be added  and we get a
dump with IF NOT EXISTS.
On subscriber side we get table OID list, we can use this change table_state
= SUBREL_STATE_INIT so that it won't be recreated. 
> ~~~
> 
> Q2.
> 
> The post says. "DDL replication is required for this feature". And "It should
> also have turned on publication of DDL commands."
> 
> It wasn't entirely clear to me why those must be a requirement. Is that just to
> make implementation easier?
DDL replication is needed to facilitate concurrent DDL, so that we don’t have to
worry about schema change at the same time when subscriber is initializing.
if we can block publisher so that it does not do DDLs or subscriber can simple
error out if it sees conflicting table information , then we don’t need to use DDL
replication. 
Regards
Sachin
> 
> Sure, I see that the idea might have some (or maybe a lot?) of common
> internal code with the table DDL replication work, but OTOH an auto-create
> feature for subscriber tables seems like it might be a useful feature to have
> regardless of the value of the publication 'ddl' parameter.
> 
> ------
> Kind Regards,
> Peter Smith.
> Fujitsu Australia.

Re: Initial Schema Sync for Logical Replication

From
Alvaro Herrera
Date:
On 2023-Mar-15, Kumar, Sachin wrote:

> 1. In  CreateSubscription()  when we create replication slot(walrcv_create_slot()), should
> use CRS_EXPORT_SNAPSHOT, So that we can use this snapshot later in the pg_dump.
> 
> 2.  Now we can call pg_dump with above snapshot from CreateSubscription.

Overall I'm not on board with the idea that logical replication would
depend on pg_dump; that seems like it could run into all sorts of
trouble (what if calling external binaries requires additional security
setup?  what about pg_hba connection requirements? what about
max_connections in tight circumstances?).

It would be much better, I think, to handle this internally in the
publisher instead: similar to how DDL sync would work, except it'd
somehow generate the CREATE statements from the existing tables instead
of waiting for DDL events to occur.  I grant that this does require
writing a bunch of new code for each object type, a lot of which would
duplicate the pg_dump logic, but it would probably be a lot more robust.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: Initial Schema Sync for Logical Replication

From
Amit Kapila
Date:
On Thu, Mar 16, 2023 at 10:27 PM Kumar, Sachin <ssetiya@amazon.com> wrote:
>
> > Hi,
> >
> > I have a couple of questions.
> >
> > Q1.
> >
> > What happens if the subscriber already has some tables present? For
> > example, I did not see the post saying anything like "Only if the table does
> > not already exist then it will be created".
> >
> My assumption was the if subscriber is doing initial schema sync , It does not have
> any conflicting database objects.
>

Can't we simply error out in such a case with "obj already exists"?
This would be similar to how we deal with conflicting rows with
unique/primary keys.

--
With Regards,
Amit Kapila.



RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
Hi Amit,

> From: Amit Kapila <amit.kapila16@gmail.com>
> > > Hi,
> > >
> > > I have a couple of questions.
> > >
> > > Q1.
> > >
> > > What happens if the subscriber already has some tables present? For
> > > example, I did not see the post saying anything like "Only if the
> > > table does not already exist then it will be created".
> > >
> > My assumption was the if subscriber is doing initial schema sync , It
> > does not have any conflicting database objects.
> >
> 
> Can't we simply error out in such a case with "obj already exists"?
> This would be similar to how we deal with conflicting rows with unique/primary
> keys.
Right this is the default behaviour , We will run pg_restore with --single_transaction,
So if we get error while executing a create table the whole pg_restore will fail and 
user will notified. 
Regards
Sachin

RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
Hi Alvaro,

> From: Alvaro Herrera <alvherre@alvh.no-ip.org>
> Subject: RE: [EXTERNAL]Initial Schema Sync for Logical Replication
> On 2023-Mar-15, Kumar, Sachin wrote:
> 
> > 1. In  CreateSubscription()  when we create replication
> > slot(walrcv_create_slot()), should use CRS_EXPORT_SNAPSHOT, So that we
> can use this snapshot later in the pg_dump.
> >
> > 2.  Now we can call pg_dump with above snapshot from CreateSubscription.
> 
> Overall I'm not on board with the idea that logical replication would depend on
> pg_dump; that seems like it could run into all sorts of trouble (what if calling
> external binaries requires additional security setup?  what about pg_hba
> connection requirements? what about max_connections in tight
> circumstances?).
> what if calling external binaries requires additional security setup
I am not sure what kind of security restriction would apply in this case, maybe pg_dump
binary can be changed ? 
> what about pg_hba connection requirements?
We will use the same connection string which subscriber process uses to connect to
the publisher.
>what about max_connections in tight circumstances?
Right that might be a issue, but I don’t think it will be a big issue, We will create dump
of database in CreateSubscription() function itself , So before tableSync process even starts
if we have reached max_connections while calling pg_dump itself , tableSync wont be successful.
> It would be much better, I think, to handle this internally in the publisher instead:
> similar to how DDL sync would work, except it'd somehow generate the CREATE
> statements from the existing tables instead of waiting for DDL events to occur.  I
> grant that this does require writing a bunch of new code for each object type, a
> lot of which would duplicate the pg_dump logic, but it would probably be a lot
> more robust.
Agree , But we might have a lots of code duplication essentially almost all of pg_dump
Code needs to be duplicated, which might cause issue when modifying/adding new
DDLs. 
I am not sure but if it's possible to move dependent code of pg_dump to common/ folder
, to avoid duplication.

Regards
Sachin

Re: Initial Schema Sync for Logical Replication

From
"Euler Taveira"
Date:
On Mon, Mar 20, 2023, at 10:10 PM, Kumar, Sachin wrote:
> From: Alvaro Herrera <alvherre@alvh.no-ip.org>
> Subject: RE: [EXTERNAL]Initial Schema Sync for Logical Replication
> On 2023-Mar-15, Kumar, Sachin wrote:

> > 1. In  CreateSubscription()  when we create replication
> > slot(walrcv_create_slot()), should use CRS_EXPORT_SNAPSHOT, So that we
> can use this snapshot later in the pg_dump.
> >
> > 2.  Now we can call pg_dump with above snapshot from CreateSubscription.

> Overall I'm not on board with the idea that logical replication would depend on
> pg_dump; that seems like it could run into all sorts of trouble (what if calling
> external binaries requires additional security setup?  what about pg_hba
> connection requirements? what about max_connections in tight
> circumstances?).
> what if calling external binaries requires additional security setup
I am not sure what kind of security restriction would apply in this case, maybe pg_dump
binary can be changed ? 
Using pg_dump as part of this implementation is not acceptable because we
expect the backend to be decoupled from the client. Besides that, pg_dump
provides all table dependencies (such as tablespaces, privileges, security
labels, comments); not all dependencies shouldn't be replicated. You should
exclude them removing these objects from the TOC before running pg_restore or
adding a few pg_dump options to exclude these objects. Another issue is related
to different version. Let's say the publisher has a version ahead of the
subscriber version, a new table syntax can easily break your logical
replication setup. IMO pg_dump doesn't seem like a good solution for initial
synchronization.

Instead, the backend should provide infrastructure to obtain the required DDL
commands for the specific (set of) tables. This can work around the issues from
the previous paragraph:

* you can selectively choose dependencies;
* don't require additional client packages;
* don't need to worry about different versions.

This infrastructure can also be useful for other use cases such as:

* client tools that provide create commands (such as psql, pgAdmin);
* other logical replication solutions;
* other logical backup solutions.


--
Euler Taveira

Re: Initial Schema Sync for Logical Replication

From
Amit Kapila
Date:
On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira <euler@eulerto.com> wrote:
>
> On Mon, Mar 20, 2023, at 10:10 PM, Kumar, Sachin wrote:
>
> > From: Alvaro Herrera <alvherre@alvh.no-ip.org>
> > Subject: RE: [EXTERNAL]Initial Schema Sync for Logical Replication
> > On 2023-Mar-15, Kumar, Sachin wrote:
> >
> > > 1. In  CreateSubscription()  when we create replication
> > > slot(walrcv_create_slot()), should use CRS_EXPORT_SNAPSHOT, So that we
> > can use this snapshot later in the pg_dump.
> > >
> > > 2.  Now we can call pg_dump with above snapshot from CreateSubscription.
> >
> > Overall I'm not on board with the idea that logical replication would depend on
> > pg_dump; that seems like it could run into all sorts of trouble (what if calling
> > external binaries requires additional security setup?  what about pg_hba
> > connection requirements? what about max_connections in tight
> > circumstances?).
> > what if calling external binaries requires additional security setup
> I am not sure what kind of security restriction would apply in this case, maybe pg_dump
> binary can be changed ?
>
> Using pg_dump as part of this implementation is not acceptable because we
> expect the backend to be decoupled from the client. Besides that, pg_dump
> provides all table dependencies (such as tablespaces, privileges, security
> labels, comments); not all dependencies shouldn't be replicated.
>

I agree that in the initial version we may not support sync of all
objects but why that shouldn't be possible in the later versions?

> You should
> exclude them removing these objects from the TOC before running pg_restore or
> adding a few pg_dump options to exclude these objects. Another issue is related
> to different version. Let's say the publisher has a version ahead of the
> subscriber version, a new table syntax can easily break your logical
> replication setup. IMO pg_dump doesn't seem like a good solution for initial
> synchronization.
>
> Instead, the backend should provide infrastructure to obtain the required DDL
> commands for the specific (set of) tables. This can work around the issues from
> the previous paragraph:
>
...
> * don't need to worry about different versions.
>

AFAICU some of the reasons why pg_dump is not allowed to dump from the
newer version are as follows: (a) there could be more columns in the
newer version of the system catalog and then Select * type of stuff
won't work because the client won't have knowledge of additional
columns. (b) the newer version could have new features (represented by
say new columns in existing catalogs or new catalogs) that the older
version of pg_dump has no knowledge of and will fail to get that data
and hence an inconsistent dump. The subscriber will easily be not in
sync due to that.

Now, how do we avoid these problems even if we have our own version of
functionality similar to pg_dump for selected objects? I guess we will
face similar problems. If so, we may need to deny schema sync in any
such case.

--
With Regards,
Amit Kapila.



Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Tue, Mar 21, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira <euler@eulerto.com> wrote:
> >
> > On Mon, Mar 20, 2023, at 10:10 PM, Kumar, Sachin wrote:
> >
> > > From: Alvaro Herrera <alvherre@alvh.no-ip.org>
> > > Subject: RE: [EXTERNAL]Initial Schema Sync for Logical Replication
> > > On 2023-Mar-15, Kumar, Sachin wrote:
> > >
> > > > 1. In  CreateSubscription()  when we create replication
> > > > slot(walrcv_create_slot()), should use CRS_EXPORT_SNAPSHOT, So that we
> > > can use this snapshot later in the pg_dump.
> > > >
> > > > 2.  Now we can call pg_dump with above snapshot from CreateSubscription.
> > >
> > > Overall I'm not on board with the idea that logical replication would depend on
> > > pg_dump; that seems like it could run into all sorts of trouble (what if calling
> > > external binaries requires additional security setup?  what about pg_hba
> > > connection requirements? what about max_connections in tight
> > > circumstances?).
> > > what if calling external binaries requires additional security setup
> > I am not sure what kind of security restriction would apply in this case, maybe pg_dump
> > binary can be changed ?
> >
> > Using pg_dump as part of this implementation is not acceptable because we
> > expect the backend to be decoupled from the client. Besides that, pg_dump
> > provides all table dependencies (such as tablespaces, privileges, security
> > labels, comments); not all dependencies shouldn't be replicated.
> >
>
> I agree that in the initial version we may not support sync of all
> objects but why that shouldn't be possible in the later versions?
>
> > You should
> > exclude them removing these objects from the TOC before running pg_restore or
> > adding a few pg_dump options to exclude these objects. Another issue is related
> > to different version. Let's say the publisher has a version ahead of the
> > subscriber version, a new table syntax can easily break your logical
> > replication setup. IMO pg_dump doesn't seem like a good solution for initial
> > synchronization.
> >
> > Instead, the backend should provide infrastructure to obtain the required DDL
> > commands for the specific (set of) tables. This can work around the issues from
> > the previous paragraph:
> >
> ...
> > * don't need to worry about different versions.
> >
>
> AFAICU some of the reasons why pg_dump is not allowed to dump from the
> newer version are as follows: (a) there could be more columns in the
> newer version of the system catalog and then Select * type of stuff
> won't work because the client won't have knowledge of additional
> columns. (b) the newer version could have new features (represented by
> say new columns in existing catalogs or new catalogs) that the older
> version of pg_dump has no knowledge of and will fail to get that data
> and hence an inconsistent dump. The subscriber will easily be not in
> sync due to that.
>
> Now, how do we avoid these problems even if we have our own version of
> functionality similar to pg_dump for selected objects? I guess we will
> face similar problems.

Right. I think that such functionality needs to return DDL commands
that can be executed on the requested version.

> If so, we may need to deny schema sync in any such case.

Yes. Do we have any concrete use case where the subscriber is an older
version, in the first place?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Initial Schema Sync for Logical Replication

From
Amit Kapila
Date:
On Wed, Mar 22, 2023 at 8:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Mar 21, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira <euler@eulerto.com> wrote:
> >
> > > You should
> > > exclude them removing these objects from the TOC before running pg_restore or
> > > adding a few pg_dump options to exclude these objects. Another issue is related
> > > to different version. Let's say the publisher has a version ahead of the
> > > subscriber version, a new table syntax can easily break your logical
> > > replication setup. IMO pg_dump doesn't seem like a good solution for initial
> > > synchronization.
> > >
> > > Instead, the backend should provide infrastructure to obtain the required DDL
> > > commands for the specific (set of) tables. This can work around the issues from
> > > the previous paragraph:
> > >
> > ...
> > > * don't need to worry about different versions.
> > >
> >
> > AFAICU some of the reasons why pg_dump is not allowed to dump from the
> > newer version are as follows: (a) there could be more columns in the
> > newer version of the system catalog and then Select * type of stuff
> > won't work because the client won't have knowledge of additional
> > columns. (b) the newer version could have new features (represented by
> > say new columns in existing catalogs or new catalogs) that the older
> > version of pg_dump has no knowledge of and will fail to get that data
> > and hence an inconsistent dump. The subscriber will easily be not in
> > sync due to that.
> >
> > Now, how do we avoid these problems even if we have our own version of
> > functionality similar to pg_dump for selected objects? I guess we will
> > face similar problems.
>
> Right. I think that such functionality needs to return DDL commands
> that can be executed on the requested version.
>
> > If so, we may need to deny schema sync in any such case.
>
> Yes. Do we have any concrete use case where the subscriber is an older
> version, in the first place?
>

As per my understanding, it is mostly due to the reason that it can
work today. Today, during an off-list discussion with Jonathan on this
point, he pointed me to a similar incompatibility in MySQL
replication. See the "SQL incompatibilities" section in doc[1]. Also,
please note that this applies not only to initial sync but also to
schema sync during replication. I don't think it would be feasible to
keep such cross-version compatibility for DDL replication.

Having said above, I don't intend that we must use pg_dump from the
subscriber for the purpose of initial sync. I think the idea at this
stage is to primarily write a POC patch to see what difficulties we
may face. The other options that we could try out are (a) try to
duplicate parts of pg_dump code in some way (by extracting required
code) for the subscription's initial sync, or (b) have a common code
(probably as a library or some other way) for the required
functionality. There could be more possibilities that we may not have
thought of yet. But the main point is that for approaches other than
using pg_dump, we should consider ways to avoid duplicity of various
parts of its code. Due to this, I think before ruling out using
pg_dump, we should be clear about its risks and limitations.

Thoughts?

[1] - https://dev.mysql.com/doc/refman/8.0/en/replication-compatibility.html
[2] - https://www.postgresql.org/message-id/CAAD30U%2BpVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo%3DOjdQGJ9g%40mail.gmail.com

--
With Regards,
Amit Kapila.



RE: Initial Schema Sync for Logical Replication

From
"houzj.fnst@fujitsu.com"
Date:
On Wednesday, March 22, 2023 1:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Wed, Mar 22, 2023 at 8:29 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Mar 21, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira <euler@eulerto.com> wrote:
> > >
> > > > You should
> > > > exclude them removing these objects from the TOC before running
> > > > pg_restore or adding a few pg_dump options to exclude these
> > > > objects. Another issue is related to different version. Let's say
> > > > the publisher has a version ahead of the subscriber version, a new
> > > > table syntax can easily break your logical replication setup. IMO
> > > > pg_dump doesn't seem like a good solution for initial synchronization.
> > > >
> > > > Instead, the backend should provide infrastructure to obtain the
> > > > required DDL commands for the specific (set of) tables. This can
> > > > work around the issues from the previous paragraph:
> > > >
> > > ...
> > > > * don't need to worry about different versions.
> > > >
> > >
> > > AFAICU some of the reasons why pg_dump is not allowed to dump from
> > > the newer version are as follows: (a) there could be more columns in
> > > the newer version of the system catalog and then Select * type of
> > > stuff won't work because the client won't have knowledge of
> > > additional columns. (b) the newer version could have new features
> > > (represented by say new columns in existing catalogs or new
> > > catalogs) that the older version of pg_dump has no knowledge of and
> > > will fail to get that data and hence an inconsistent dump. The
> > > subscriber will easily be not in sync due to that.
> > >
> > > Now, how do we avoid these problems even if we have our own version
> > > of functionality similar to pg_dump for selected objects? I guess we
> > > will face similar problems.
> >
> > Right. I think that such functionality needs to return DDL commands
> > that can be executed on the requested version.
> >
> > > If so, we may need to deny schema sync in any such case.
> >
> > Yes. Do we have any concrete use case where the subscriber is an older
> > version, in the first place?
> >
> 
> As per my understanding, it is mostly due to the reason that it can work today.
> Today, during an off-list discussion with Jonathan on this point, he pointed me
> to a similar incompatibility in MySQL replication. See the "SQL
> incompatibilities" section in doc[1]. Also, please note that this applies not only
> to initial sync but also to schema sync during replication. I don't think it would
> be feasible to keep such cross-version compatibility for DDL replication.
> 
> Having said above, I don't intend that we must use pg_dump from the
> subscriber for the purpose of initial sync. I think the idea at this stage is to
> primarily write a POC patch to see what difficulties we may face. The other
> options that we could try out are (a) try to duplicate parts of pg_dump code in
> some way (by extracting required
> code) for the subscription's initial sync, or (b) have a common code (probably
> as a library or some other way) for the required functionality. There could be
> more possibilities that we may not have thought of yet. But the main point is
> that for approaches other than using pg_dump, we should consider ways to
> avoid duplicity of various parts of its code. Due to this, I think before ruling out
> using pg_dump, we should be clear about its risks and limitations.

I thought about some possible problems about the design of using pg_dump.

1) According to the design, it will internally call pg_dump when creating
subscription, but it requires to use a powerful user when calling pg_dump.
Currently, it may not be a problem because create subscription also requires
superuser. But people have recently discussed about allowing non-superuser to
create the subscription[1], if that is accepted, then it seems not great to
internally use superuser to call pg_dump while the user creating the
subscription is a non-super user.

2) I think it's possible that some cloud DB service doesn't allow user to use
the client commands(pg_dump ,..) directly, and the user that login in the
database may not have the permission to execute the client commands.

[1] https://www.postgresql.org/message-id/flat/20230308194743.23rmgjgwahh4i4rg%40awork3.anarazel.de

Best Regards,
Hou zj


RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
> From: Amit Kapila <amit.kapila16@gmail.com>
> Sent: Wednesday, March 22, 2023 5:16 AM
> To: Masahiko Sawada <sawada.mshk@gmail.com>
> Cc: Euler Taveira <euler@eulerto.com>; Kumar, Sachin
> <ssetiya@amazon.com>; Alvaro Herrera <alvherre@alvh.no-ip.org>; pgsql-
> hackers@lists.postgresql.org; Jonathan S. Katz <jkatz@postgresql.org>
> Subject: RE: [EXTERNAL]Initial Schema Sync for Logical Replication
> 
> CAUTION: This email originated from outside of the organization. Do not click
> links or open attachments unless you can confirm the sender and know the
> content is safe.
> 
> 
> 
> On Wed, Mar 22, 2023 at 8:29 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Mar 21, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira <euler@eulerto.com>
> wrote:
> > >
> > > > You should
> > > > exclude them removing these objects from the TOC before running
> > > > pg_restore or adding a few pg_dump options to exclude these
> > > > objects. Another issue is related to different version. Let's say
> > > > the publisher has a version ahead of the subscriber version, a new
> > > > table syntax can easily break your logical replication setup. IMO
> > > > pg_dump doesn't seem like a good solution for initial synchronization.
> > > >
> > > > Instead, the backend should provide infrastructure to obtain the
> > > > required DDL commands for the specific (set of) tables. This can
> > > > work around the issues from the previous paragraph:
> > > >
> > > ...
> > > > * don't need to worry about different versions.
> > > >
> > >
> > > AFAICU some of the reasons why pg_dump is not allowed to dump from
> > > the newer version are as follows: (a) there could be more columns in
> > > the newer version of the system catalog and then Select * type of
> > > stuff won't work because the client won't have knowledge of
> > > additional columns. (b) the newer version could have new features
> > > (represented by say new columns in existing catalogs or new
> > > catalogs) that the older version of pg_dump has no knowledge of and
> > > will fail to get that data and hence an inconsistent dump. The
> > > subscriber will easily be not in sync due to that.
> > >
> > > Now, how do we avoid these problems even if we have our own version
> > > of functionality similar to pg_dump for selected objects? I guess we
> > > will face similar problems.
> >
> > Right. I think that such functionality needs to return DDL commands
> > that can be executed on the requested version.
> >
> > > If so, we may need to deny schema sync in any such case.
> >
> > Yes. Do we have any concrete use case where the subscriber is an older
> > version, in the first place?
> >
> 
> As per my understanding, it is mostly due to the reason that it can work
> today. Today, during an off-list discussion with Jonathan on this point, he
> pointed me to a similar incompatibility in MySQL replication. See the "SQL
> incompatibilities" section in doc[1]. Also, please note that this applies not
> only to initial sync but also to schema sync during replication. I don't think it
> would be feasible to keep such cross-version compatibility for DDL
> replication.
> 
> Having said above, I don't intend that we must use pg_dump from the
> subscriber for the purpose of initial sync. I think the idea at this stage is to
> primarily write a POC patch to see what difficulties we may face. The other
> options that we could try out are (a) try to duplicate parts of pg_dump code
> in some way (by extracting required
> code) for the subscription's initial sync, or (b) have a common code (probably
> as a library or some other way) for the required functionality. There could be
> more possibilities that we may not have thought of yet. But the main point is
> that for approaches other than using pg_dump, we should consider ways to
> avoid duplicity of various parts of its code. Due to this, I think before ruling
> out using pg_dump, we should be clear about its risks and limitations.
> 
> Thoughts?
There is one more thing which needs to be consider even if we use pg_dump/pg_restore
We still need to have a way to get the create table for tables , if we want to support
concurrent DDLs on the publisher.
>8. TableSync process should check the state of table , if it is SUBREL_STATE_CREATE it should
>get the latest definition from the publisher and recreate the table. (We have to recreate
>the table even if there are no changes). Then it should go into copy table mode as usual.
Unless there is different way to support concurrent DDLs or we going for blocking publisher
till initial sync is completed.
Regards
Sachin
> 
> [1] - https://dev.mysql.com/doc/refman/8.0/en/replication-
> compatibility.html
> [2] - https://www.postgresql.org/message-
> id/CAAD30U%2BpVmfKwUKy8cbZOnUXyguJ-
> uBNejwD75Kyo%3DOjdQGJ9g%40mail.gmail.com
> 
> --
> With Regards,
> Amit Kapila.

Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Wed, Mar 22, 2023 at 2:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 22, 2023 at 8:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Mar 21, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira <euler@eulerto.com> wrote:
> > >
> > > > You should
> > > > exclude them removing these objects from the TOC before running pg_restore or
> > > > adding a few pg_dump options to exclude these objects. Another issue is related
> > > > to different version. Let's say the publisher has a version ahead of the
> > > > subscriber version, a new table syntax can easily break your logical
> > > > replication setup. IMO pg_dump doesn't seem like a good solution for initial
> > > > synchronization.
> > > >
> > > > Instead, the backend should provide infrastructure to obtain the required DDL
> > > > commands for the specific (set of) tables. This can work around the issues from
> > > > the previous paragraph:
> > > >
> > > ...
> > > > * don't need to worry about different versions.
> > > >
> > >
> > > AFAICU some of the reasons why pg_dump is not allowed to dump from the
> > > newer version are as follows: (a) there could be more columns in the
> > > newer version of the system catalog and then Select * type of stuff
> > > won't work because the client won't have knowledge of additional
> > > columns. (b) the newer version could have new features (represented by
> > > say new columns in existing catalogs or new catalogs) that the older
> > > version of pg_dump has no knowledge of and will fail to get that data
> > > and hence an inconsistent dump. The subscriber will easily be not in
> > > sync due to that.
> > >
> > > Now, how do we avoid these problems even if we have our own version of
> > > functionality similar to pg_dump for selected objects? I guess we will
> > > face similar problems.
> >
> > Right. I think that such functionality needs to return DDL commands
> > that can be executed on the requested version.
> >
> > > If so, we may need to deny schema sync in any such case.
> >
> > Yes. Do we have any concrete use case where the subscriber is an older
> > version, in the first place?
> >
>
> As per my understanding, it is mostly due to the reason that it can
> work today. Today, during an off-list discussion with Jonathan on this
> point, he pointed me to a similar incompatibility in MySQL
> replication. See the "SQL incompatibilities" section in doc[1]. Also,
> please note that this applies not only to initial sync but also to
> schema sync during replication. I don't think it would be feasible to
> keep such cross-version compatibility for DDL replication.

Makes sense to me.

> Having said above, I don't intend that we must use pg_dump from the
> subscriber for the purpose of initial sync. I think the idea at this
> stage is to primarily write a POC patch to see what difficulties we
> may face. The other options that we could try out are (a) try to
> duplicate parts of pg_dump code in some way (by extracting required
> code) for the subscription's initial sync, or (b) have a common code
> (probably as a library or some other way) for the required
> functionality. There could be more possibilities that we may not have
> thought of yet. But the main point is that for approaches other than
> using pg_dump, we should consider ways to avoid duplicity of various
> parts of its code. Due to this, I think before ruling out using
> pg_dump, we should be clear about its risks and limitations.
>
> Thoughts?
>

Agreed. My biggest concern about approaches other than using pg_dump
is the same; the code duplication that could increase the maintenance
costs. We should clarify what points of using pg_dump is not a good
idea, and also analyze alternative ideas in depth.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Initial Schema Sync for Logical Replication

From
Zheng Li
Date:
> > Yes. Do we have any concrete use case where the subscriber is an older
> > version, in the first place?
> >
>
> As per my understanding, it is mostly due to the reason that it can
> work today. Today, during an off-list discussion with Jonathan on this
> point, he pointed me to a similar incompatibility in MySQL
> replication. See the "SQL incompatibilities" section in doc[1]. Also,
> please note that this applies not only to initial sync but also to
> schema sync during replication. I don't think it would be feasible to
> keep such cross-version compatibility for DDL replication.

I think it's possible to make DDL replication cross-version
compatible, by making the DDL deparser version-aware: the deparsed
JSON blob can have a PG version in it, and the destination server can
process the versioned JSON blob by transforming anything incompatible
according to the original version and its own version.

Regards,
Zane



Re: Initial Schema Sync for Logical Replication

From
"Euler Taveira"
Date:
On Tue, Mar 21, 2023, at 8:18 AM, Amit Kapila wrote:
Now, how do we avoid these problems even if we have our own version of
functionality similar to pg_dump for selected objects? I guess we will
face similar problems. If so, we may need to deny schema sync in any
such case.
There are 2 approaches for initial DDL synchronization:

1) generate the DDL command on the publisher, stream it and apply it as-is on
the subscriber;
2) generate a DDL representation (JSON, for example) on the publisher, stream
it, transform it into a DDL command on subscriber and apply it.

The option (1) is simpler and faster than option (2) because it does not
require an additional step (transformation). However, option (2) is more
flexible than option (1) because it allow you to create a DDL command even if a
feature was removed from the subscriber and the publisher version is less than
the subscriber version or a feature was added to the publisher and the
publisher version is greater than the subscriber version. Of course there are
exceptions and it should forbid the transformation (in this case, it can be
controlled by the protocol version -- LOGICALREP_PROTO_FOOBAR_VERSION_NUM). A
decision must be made: simple/restrict vs complex/flexible.

One of the main use cases for logical replication is migration (X -> Y where X
< Y). Postgres generally does not remove features but it might happen (such as
WITH OIDS syntax) and it would break the DDL replication (option 1). In the
downgrade case (X -> Y where X > Y), it might break the DDL replication if a
new syntax is introduced in X. Having said that, IMO option (1) is fragile if
we want to support DDL replication between different Postgres versions. It
might eventually work but there is no guarantee.

Per discussion [1], I think if we agree that the Alvaro's DDL deparse patch is
the way to go with DDL replication, it seems wise that it should be used for
initial DDL synchronization as well.





--
Euler Taveira

Re: Initial Schema Sync for Logical Replication

From
Amit Kapila
Date:
On Wed, Mar 15, 2023 at 11:12 PM Kumar, Sachin <ssetiya@amazon.com> wrote:
>
>
> Concurrent DDL :-
>
> User can execute a DDL command to table t1 at the same time when subscriber is trying to sync
>
> it. pictorial representation https://imgur.com/a/ivrIEv8 [1]
>
>
>
> In tablesync process, it makes a connection to the publisher and it sees the
>
> table state which can be in future wrt to the publisher, which can introduce conflicts.
>
> For example:-
>
>
>
> CASE 1:- { Publisher removed the column b from the table t1 when subscriber was doing pg_restore
>
> (or any point in concurrent DDL window described in picture [1] ), when tableSync
>
> process will start transaction on the publisher it will see request data of table t1
>
> including column b, which does not exist on the publisher.} So that is why tableSync process
>
> asks for the latest definition.
>
> If we say that we will delay tableSync worker till all the DDL related to table t1 is
>
> applied by the applier process , we can still have a window when publisher issues a DDL
>
> command just before tableSync starts its transaction, and therefore making tableSync and
>
> publisher table definition incompatible (Thanks to Masahiko for pointing out this race
>
> condition).
>

IIUC, this is possible only if tablesync process uses a snapshot
different than the snapshot we have used to perform the initial schema
sync, otherwise, this shouldn't be a problem. Let me try to explain my
understanding with an example (the LSNs used are just explain the
problem):

1. Create Table t1(c1, c2); --LSN: 90
2. Insert t1 (1, 1); --LSN 100
3. Insert t1 (2, 2); --LSN 110
4. Alter t1 Add Column c3; --LSN 120
5. Insert t1 (3, 3, 3); --LSN 130

Now, say before starting tablesync worker, apply process performs
initial schema sync and uses a snapshot corresponding to LSN 100. Then
it starts tablesync process to allow the initial copy of data in t1.
Here, if the table sync process tries to establish a new snapshot, it
may get data till LSN 130 and when it will try to copy the same in
subscriber it will fail. Is my understanding correct about the problem
you described? If so, can't we allow tablesync process to use the same
exported snapshot as we used for the initial schema sync and won't
that solve the problem you described?

>
>
> Applier process will skip all DDL/DMLs related to the table t1 and tableSync will apply those
>
> in Catchup phase.
>
> Although there is one issue what will happen to views/ or functions which depend on the table
>
> . I think they should wait till table_state is > SUBREL_STATE_CREATE (means we have the latest
>
> schema definition from the publisher).
>
> There might be corner cases to this approach or maybe a better way to handle concurrent DDL
>
> One simple solution might be to disallow DDLs on the publisher till all the schema is
>
> synced and all tables have state >= SUBREL_STATE_DATASYNC (We can have CASE 1: issue ,
>
> even with DDL replication, so we have to wait till all the tables have table_state
>
> > SUBREL_STATE_DATASYNC). Which might be a big window for big databases.
>
>
>
>
>
> Refresh publication :-
>
> In refresh publication, subscriber does create a new replication slot hence , we can’t run
>
> pg_dump with a snapshot which starts from origin(maybe this is not an issue at all). In this case
>
> it makes more sense for tableSync worker to do schema sync.
>

Can you please explain this problem with some examples?

--
With Regards,
Amit Kapila.



Re: Initial Schema Sync for Logical Replication

From
Amit Kapila
Date:
On Thu, Mar 23, 2023 at 2:48 AM Euler Taveira <euler@eulerto.com> wrote:
>
> On Tue, Mar 21, 2023, at 8:18 AM, Amit Kapila wrote:
>
> Now, how do we avoid these problems even if we have our own version of
> functionality similar to pg_dump for selected objects? I guess we will
> face similar problems. If so, we may need to deny schema sync in any
> such case.
>
> There are 2 approaches for initial DDL synchronization:
>
> 1) generate the DDL command on the publisher, stream it and apply it as-is on
> the subscriber;
> 2) generate a DDL representation (JSON, for example) on the publisher, stream
> it, transform it into a DDL command on subscriber and apply it.
>
> The option (1) is simpler and faster than option (2) because it does not
> require an additional step (transformation). However, option (2) is more
> flexible than option (1) because it allow you to create a DDL command even if a
> feature was removed from the subscriber and the publisher version is less than
> the subscriber version or a feature was added to the publisher and the
> publisher version is greater than the subscriber version.
>

Is this practically possible? Say the publisher has a higher version
that has introduced a new object type corresponding to which it has
either a new catalog or some new columns in the existing catalog. Now,
I don't think the older version of the subscriber can modify the
command received from the publisher so that the same can be applied to
the subscriber because it won't have any knowledge of the new feature.
In the other case where the subscriber is of a newer version, we
anyway should be able to support it with pg_dump as there doesn't
appear to be any restriction with that, am, I missing something?

> One of the main use cases for logical replication is migration (X -> Y where X
> < Y).
>

I don't think we need to restrict this case even if we decide to use pg_dump.

>
> Per discussion [1], I think if we agree that the Alvaro's DDL deparse patch is
> the way to go with DDL replication, it seems wise that it should be used for
> initial DDL synchronization as well.
>

Even if we decide to use deparse approach, it would still need to
mimic stuff from pg_dump to construct commands based on only catalog
contents. I am not against using this approach but we shouldn't ignore
the duplicity required in this approach.

--
With Regards,
Amit Kapila.



RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
> From: Amit Kapila <amit.kapila16@gmail.com>
> IIUC, this is possible only if tablesync process uses a snapshot different than the
> snapshot we have used to perform the initial schema sync, otherwise, this
> shouldn't be a problem. Let me try to explain my understanding with an example
> (the LSNs used are just explain the
> problem):
> 
> 1. Create Table t1(c1, c2); --LSN: 90
> 2. Insert t1 (1, 1); --LSN 100
> 3. Insert t1 (2, 2); --LSN 110
> 4. Alter t1 Add Column c3; --LSN 120
> 5. Insert t1 (3, 3, 3); --LSN 130
> 
> Now, say before starting tablesync worker, apply process performs initial
> schema sync and uses a snapshot corresponding to LSN 100. Then it starts
> tablesync process to allow the initial copy of data in t1.
> Here, if the table sync process tries to establish a new snapshot, it may get data
> till LSN 130 and when it will try to copy the same in subscriber it will fail. Is my
> understanding correct about the problem you described?
Right
> If so, can't we allow
> tablesync process to use the same exported snapshot as we used for the initial
> schema sync and won't that solve the problem you described?
I think we won't be able to use same snapshot because the transaction will be committed.
In CreateSubscription() we can use the transaction snapshot from walrcv_create_slot()
till walrcv_disconnect() is called.(I am not sure about this part maybe walrcv_disconnect() calls
the commits internally ?). 
So somehow we need to keep this snapshot alive, even after transaction is committed(or delay committing
the transaction , but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart before 
tableSync is able to use the same snapshot.) 
> > Refresh publication :-
> >
> > In refresh publication, subscriber does create a new replication slot
Typo-> subscriber does not
> > hence , we can’t run
> >
> > pg_dump with a snapshot which starts from origin(maybe this is not an
> > issue at all). In this case
> >
> > it makes more sense for tableSync worker to do schema sync.
> >
> 
> Can you please explain this problem with some examples?
I think we can have same issues as you mentioned
New table t1 is added to the publication , User does a refresh publication.
pg_dump / pg_restore restores the table definition. But before tableSync
can start,  steps from 2 to 5 happen on the publisher.
> 1. Create Table t1(c1, c2); --LSN: 90
> 2. Insert t1 (1, 1); --LSN 100
> 3. Insert t1 (2, 2); --LSN 110
> 4. Alter t1 Add Column c3; --LSN 120
> 5. Insert t1 (3, 3, 3); --LSN 130
And table sync errors out
There can be one more issue , since we took the pg_dump without snapshot (wrt to replication slot).
(I am not 100 percent sure about this).
Lets imagine applier process is lagging behind publisher. 
Events on publisher
1. alter t1 drop column c; LSN 100 <-- applier process tries to execute this DDL
2. alter t1 drop column d; LSN 110
3. insert into t1 values(..); LSN 120 <-- (Refresh publication called )pg_dump/restore restores this version
Applier process executing 1 will fail because t1 does not have column c. 
Regards
Sachin

Re: Initial Schema Sync for Logical Replication

From
"Euler Taveira"
Date:
On Thu, Mar 23, 2023, at 8:44 AM, Amit Kapila wrote:
On Thu, Mar 23, 2023 at 2:48 AM Euler Taveira <euler@eulerto.com> wrote:
>
> On Tue, Mar 21, 2023, at 8:18 AM, Amit Kapila wrote:
>
> Now, how do we avoid these problems even if we have our own version of
> functionality similar to pg_dump for selected objects? I guess we will
> face similar problems. If so, we may need to deny schema sync in any
> such case.
>
> There are 2 approaches for initial DDL synchronization:
>
> 1) generate the DDL command on the publisher, stream it and apply it as-is on
> the subscriber;
> 2) generate a DDL representation (JSON, for example) on the publisher, stream
> it, transform it into a DDL command on subscriber and apply it.
>
> The option (1) is simpler and faster than option (2) because it does not
> require an additional step (transformation). However, option (2) is more
> flexible than option (1) because it allow you to create a DDL command even if a
> feature was removed from the subscriber and the publisher version is less than
> the subscriber version or a feature was added to the publisher and the
> publisher version is greater than the subscriber version.
>

Is this practically possible? Say the publisher has a higher version
that has introduced a new object type corresponding to which it has
either a new catalog or some new columns in the existing catalog. Now,
I don't think the older version of the subscriber can modify the
command received from the publisher so that the same can be applied to
the subscriber because it won't have any knowledge of the new feature.
In the other case where the subscriber is of a newer version, we
anyway should be able to support it with pg_dump as there doesn't
appear to be any restriction with that, am, I missing something?
I think so (with some limitations). Since the publisher knows the subscriber
version, publisher knows that the subscriber does not contain the new object
type then publisher can decide if this case is critical (and reject the
replication) or optional (and silently not include the feature X -- because it
is not essential for logical replication). If required, the transformation
should be done on the publisher.

Even if we decide to use deparse approach, it would still need to
mimic stuff from pg_dump to construct commands based on only catalog
contents. I am not against using this approach but we shouldn't ignore
the duplicity required in this approach.
It is fine to share code between pg_dump and this new infrastructure. However,
the old code should coexist to support older versions because the new set of
functions don't exist in older server versions. Hence, duplicity should exist
for a long time (if you consider that the current policy is to allow dump from
9.2, we are talking about 10 years or so). There are some threads [1][2] that
discussed this topic: provide a SQL command based on the catalog
representation. You can probably find other discussions searching for "pg_dump
library" or "getddl".




--
Euler Taveira

Re: Initial Schema Sync for Logical Replication

From
Amit Kapila
Date:
On Thu, Mar 23, 2023 at 9:24 PM Kumar, Sachin <ssetiya@amazon.com> wrote:
>
> > From: Amit Kapila <amit.kapila16@gmail.com>
> > IIUC, this is possible only if tablesync process uses a snapshot different than the
> > snapshot we have used to perform the initial schema sync, otherwise, this
> > shouldn't be a problem. Let me try to explain my understanding with an example
> > (the LSNs used are just explain the
> > problem):
> >
> > 1. Create Table t1(c1, c2); --LSN: 90
> > 2. Insert t1 (1, 1); --LSN 100
> > 3. Insert t1 (2, 2); --LSN 110
> > 4. Alter t1 Add Column c3; --LSN 120
> > 5. Insert t1 (3, 3, 3); --LSN 130
> >
> > Now, say before starting tablesync worker, apply process performs initial
> > schema sync and uses a snapshot corresponding to LSN 100. Then it starts
> > tablesync process to allow the initial copy of data in t1.
> > Here, if the table sync process tries to establish a new snapshot, it may get data
> > till LSN 130 and when it will try to copy the same in subscriber it will fail. Is my
> > understanding correct about the problem you described?
> Right
> > If so, can't we allow
> > tablesync process to use the same exported snapshot as we used for the initial
> > schema sync and won't that solve the problem you described?
> I think we won't be able to use same snapshot because the transaction will be committed.
> In CreateSubscription() we can use the transaction snapshot from walrcv_create_slot()
> till walrcv_disconnect() is called.(I am not sure about this part maybe walrcv_disconnect() calls
> the commits internally ?).
> So somehow we need to keep this snapshot alive, even after transaction is committed(or delay committing
> the transaction , but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart before
> tableSync is able to use the same snapshot.)
>

Can we think of getting the table data as well along with schema via
pg_dump? Won't then both schema and initial data will correspond to
the same snapshot?

> > > Refresh publication :-
> > >
> > > In refresh publication, subscriber does create a new replication slot
> Typo-> subscriber does not
> > > hence , we can’t run
> > >
> > > pg_dump with a snapshot which starts from origin(maybe this is not an
> > > issue at all). In this case
> > >
> > > it makes more sense for tableSync worker to do schema sync.
> > >
> >
> > Can you please explain this problem with some examples?
> I think we can have same issues as you mentioned
> New table t1 is added to the publication , User does a refresh publication.
> pg_dump / pg_restore restores the table definition. But before tableSync
> can start,  steps from 2 to 5 happen on the publisher.
> > 1. Create Table t1(c1, c2); --LSN: 90
> > 2. Insert t1 (1, 1); --LSN 100
> > 3. Insert t1 (2, 2); --LSN 110
> > 4. Alter t1 Add Column c3; --LSN 120
> > 5. Insert t1 (3, 3, 3); --LSN 130
> And table sync errors out
> There can be one more issue , since we took the pg_dump without snapshot (wrt to replication slot).
>

To avoid both the problems mentioned for Refresh Publication, we can
do one of the following: (a) create a new slot along with a snapshot
for this operation and drop it afterward; or (b) using the existing
slot, establish a new snapshot using a technique proposed in email
[1].

Note - Please keep one empty line before and after your inline
responses, otherwise, it is slightly difficult to understand your
response.

[1] - https://www.postgresql.org/message-id/CAGPVpCRWEVhXa7ovrhuSQofx4to7o22oU9iKtrOgAOtz_%3DY6vg%40mail.gmail.com

--
With Regards,
Amit Kapila.



RE: Initial Schema Sync for Logical Replication

From
"houzj.fnst@fujitsu.com"
Date:
On Friday, March 24, 2023 12:02 AM Euler Taveira <euler@eulerto.com>  wrote:
> 
> On Thu, Mar 23, 2023, at 8:44 AM, Amit Kapila wrote:
> > On Thu, Mar 23, 2023 at 2:48 AM Euler Taveira <mailto:euler@eulerto.com> wrote:
> > >
> > > On Tue, Mar 21, 2023, at 8:18 AM, Amit Kapila wrote:
> > >
> > > Now, how do we avoid these problems even if we have our own version of
> > > functionality similar to pg_dump for selected objects? I guess we will
> > > face similar problems. If so, we may need to deny schema sync in any
> > > such case.
> > >
> > > There are 2 approaches for initial DDL synchronization:
> > >
> > > 1) generate the DDL command on the publisher, stream it and apply it as-is on
> > > the subscriber;
> > > 2) generate a DDL representation (JSON, for example) on the publisher, stream
> > > it, transform it into a DDL command on subscriber and apply it.
> > >
> > > The option (1) is simpler and faster than option (2) because it does not
> > > require an additional step (transformation). However, option (2) is more
> > > flexible than option (1) because it allow you to create a DDL command even if a
> > > feature was removed from the subscriber and the publisher version is less than
> > > the subscriber version or a feature was added to the publisher and the
> > > publisher version is greater than the subscriber version.
> > >
> > 
> > Is this practically possible? Say the publisher has a higher version
> > that has introduced a new object type corresponding to which it has
> > either a new catalog or some new columns in the existing catalog. Now,
> > I don't think the older version of the subscriber can modify the
> > command received from the publisher so that the same can be applied to
> > the subscriber because it won't have any knowledge of the new feature.
> > In the other case where the subscriber is of a newer version, we
> > anyway should be able to support it with pg_dump as there doesn't
> > appear to be any restriction with that, am, I missing something?
> I think so (with some limitations). Since the publisher knows the subscriber
> version, publisher knows that the subscriber does not contain the new object
> type then publisher can decide if this case is critical (and reject the
> replication) or optional (and silently not include the feature X -- because it
> is not essential for logical replication). If required, the transformation
> should be done on the publisher.

I am not if it's feasible to support the use case the replicate DDL to old
subscriber.

First, I think the current publisher doesn't know the version number of
client(subscriber) so we need to check the feasibility of same. Also, having
client's version number checks doesn't seem to be a good idea.

Besides, I thought about the problems that will happen if we try to support
replicating New PG to older PG. The following examples assume that we support the
DDL replication in the mentioned PG.

1) Assume we want to replicate from a newer PG to a older PG where partition
   table has not been introduced. I think even if the publisher is aware of
   that, it doesn't have a good way to transform the partition related command,
   maybe one could say we can transform that to inherit table, but I feel that
   introduces too much complexity.

2) Another example is generated column. To replicate the newer PG which has
   this feature to a older PG without this. I am concerned that is there a way
   to transform this without causing inconsistent behavior.

Even if we decide to simply skip sending such unsupported commands or skip
applying them, then it's likely that the following dml replication will cause
data inconsistency.

So, it seems we cannot completely support this use case, there would be some
limitations. Personally, I am not sure if it's worth introducing complexity to
support it partially.

Best Regards,
Hou zj

RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
> From: Amit Kapila <amit.kapila16@gmail.com>
> > I think we won't be able to use same snapshot because the transaction will
> > be committed.
> > In CreateSubscription() we can use the transaction snapshot from
> > walrcv_create_slot() till walrcv_disconnect() is called.(I am not sure
> > about this part maybe walrcv_disconnect() calls the commits internally ?).
> > So somehow we need to keep this snapshot alive, even after transaction
> > is committed(or delay committing the transaction , but we can have
> > CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart
> > before tableSync is able to use the same snapshot.)
> >
> 
> Can we think of getting the table data as well along with schema via
> pg_dump? Won't then both schema and initial data will correspond to the
> same snapshot?

Right , that will work, Thanks!

> > I think we can have same issues as you mentioned New table t1 is added
> > to the publication , User does a refresh publication.
> > pg_dump / pg_restore restores the table definition. But before
> > tableSync can start,  steps from 2 to 5 happen on the publisher.
> > > 1. Create Table t1(c1, c2); --LSN: 90 2. Insert t1 (1, 1); --LSN 100
> > > 3. Insert t1 (2, 2); --LSN 110 4. Alter t1 Add Column c3; --LSN 120
> > > 5. Insert t1 (3, 3, 3); --LSN 130
> > And table sync errors out
> > There can be one more issue , since we took the pg_dump without
> snapshot (wrt to replication slot).
> >
> 
> To avoid both the problems mentioned for Refresh Publication, we can do
> one of the following: (a) create a new slot along with a snapshot for this
> operation and drop it afterward; or (b) using the existing slot, establish a
> new snapshot using a technique proposed in email [1].
> 

Thanks, I think option (b) will be perfect, since we don’t have to create a new slot.

Regards
Sachin

Re: Initial Schema Sync for Logical Replication

From
"Euler Taveira"
Date:
On Fri, Mar 24, 2023, at 8:57 AM, houzj.fnst@fujitsu.com wrote:
First, I think the current publisher doesn't know the version number of
client(subscriber) so we need to check the feasibility of same. Also, having
client's version number checks doesn't seem to be a good idea.

walrcv_server_version().

Besides, I thought about the problems that will happen if we try to support
replicating New PG to older PG. The following examples assume that we support the
DDL replication in the mentioned PG.

1) Assume we want to replicate from a newer PG to a older PG where partition
   table has not been introduced. I think even if the publisher is aware of
   that, it doesn't have a good way to transform the partition related command,
   maybe one could say we can transform that to inherit table, but I feel that
   introduces too much complexity.

2) Another example is generated column. To replicate the newer PG which has
   this feature to a older PG without this. I am concerned that is there a way
   to transform this without causing inconsistent behavior.

Even if we decide to simply skip sending such unsupported commands or skip
applying them, then it's likely that the following dml replication will cause
data inconsistency.

As I mentioned in a previous email [1], the publisher can contain code to
decide if it can proceed or not, in case you are doing a downgrade. I said
downgrade but it can also happen if we decide to deprecate a syntax. For
example, when WITH OIDS was deprecated, pg_dump treats it as an acceptable
removal. The transformation can be (dis)allowed by the protocol version or
another constant [2].

So, it seems we cannot completely support this use case, there would be some
limitations. Personally, I am not sure if it's worth introducing complexity to
support it partially.

Limitations are fine; they have different versions. I wouldn't like to forbid
downgrade just because I don't want to maintain compatibility with previous
versions. IMO it is important to be able to downgrade in case of any
incompatibility with an application. You might argue that this isn't possible
due to time or patch size and that there is a workaround for it but I wouldn't
want to close the door for downgrade in the future.

--
Euler Taveira

RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
> I am not if it's feasible to support the use case the replicate DDL to old
> subscriber.
>

+1
 
> First, I think the current publisher doesn't know the version number of
> client(subscriber) so we need to check the feasibility of same. Also, having
> client's version number checks doesn't seem to be a good idea.
> 
> Besides, I thought about the problems that will happen if we try to support
> replicating New PG to older PG. The following examples assume that we
> support the DDL replication in the mentioned PG.
> 
> 1) Assume we want to replicate from a newer PG to a older PG where
> partition
>    table has not been introduced. I think even if the publisher is aware of
>    that, it doesn't have a good way to transform the partition related
> command,
>    maybe one could say we can transform that to inherit table, but I feel that
>    introduces too much complexity.
> 
> 2) Another example is generated column. To replicate the newer PG which
> has
>    this feature to a older PG without this. I am concerned that is there a way
>    to transform this without causing inconsistent behavior.
> 
> Even if we decide to simply skip sending such unsupported commands or
> skip applying them, then it's likely that the following dml replication will
> cause data inconsistency.
> 
> So, it seems we cannot completely support this use case, there would be
> some limitations. Personally, I am not sure if it's worth introducing
> complexity to support it partially.
> 

+1

Regards
Sachin



RE: Initial Schema Sync for Logical Replication

From
"houzj.fnst@fujitsu.com"
Date:
On Friday, March 24, 2023 11:01 PM Euler Taveira <euler@eulerto.com>  wrote:

Hi,

> On Fri, Mar 24, 2023, at 8:57 AM, mailto:houzj.fnst@fujitsu.com wrote:
> > First, I think the current publisher doesn't know the version number of
> > client(subscriber) so we need to check the feasibility of same. Also, having
> > client's version number checks doesn't seem to be a good idea.
> 
> walrcv_server_version().

I don't think this function works, as it only shows the server version (e.g.
publisher/walsender).

> > Besides, I thought about the problems that will happen if we try to support
> > replicating New PG to older PG. The following examples assume that we support the
> > DDL replication in the mentioned PG.
> > 
> > 1) Assume we want to replicate from a newer PG to a older PG where partition
> >    table has not been introduced. I think even if the publisher is aware of
> >    that, it doesn't have a good way to transform the partition related command,
> >    maybe one could say we can transform that to inherit table, but I feel that
> >    introduces too much complexity.
> > 
> > 2) Another example is generated column. To replicate the newer PG which has
> >    this feature to a older PG without this. I am concerned that is there a way
> >    to transform this without causing inconsistent behavior.
> > 
> > Even if we decide to simply skip sending such unsupported commands or skip
> > applying them, then it's likely that the following dml replication will cause
> > data inconsistency.
>
> As I mentioned in a previous email [1], the publisher can contain code to
> decide if it can proceed or not, in case you are doing a downgrade. I said
> downgrade but it can also happen if we decide to deprecate a syntax. For
> example, when WITH OIDS was deprecated, pg_dump treats it as an acceptable
> removal. The transformation can be (dis)allowed by the protocol version or
> another constant [2].

If most of the new DDL related features won't be supported to be transformed to
old subscriber, I don't see a point in supporting this use case.

I think cases like the removal of WITH OIDS are rare enough that we don't need
to worry about and it doesn't affect the data consistency. But new DDL features
are different.

Not only the features like partition or generated column, features like
nulls_not_distinct are also tricky to be transformed without causing
inconsistent behavior.

> > So, it seems we cannot completely support this use case, there would be some
> > limitations. Personally, I am not sure if it's worth introducing complexity to
> > support it partially.
> 
> Limitations are fine; they have different versions. I wouldn't like to forbid
> downgrade just because I don't want to maintain compatibility with previous
> versions. IMO it is important to be able to downgrade in case of any
> incompatibility with an application. You might argue that this isn't possible
> due to time or patch size and that there is a workaround for it but I wouldn't
> want to close the door for downgrade in the future.

The biggest problem is the data inconsistency that it would cause. I am not
aware of a generic solution to replicate new introduced DDLs to old subscriber.
which wouldn't cause data inconsistency. And apart from that, IMO the
complexity and maintainability of the feature also matters.

Best  Regards,
Hou zj

Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Fri, Mar 24, 2023 at 11:51 PM Kumar, Sachin <ssetiya@amazon.com> wrote:
>
> > From: Amit Kapila <amit.kapila16@gmail.com>
> > > I think we won't be able to use same snapshot because the transaction will
> > > be committed.
> > > In CreateSubscription() we can use the transaction snapshot from
> > > walrcv_create_slot() till walrcv_disconnect() is called.(I am not sure
> > > about this part maybe walrcv_disconnect() calls the commits internally ?).
> > > So somehow we need to keep this snapshot alive, even after transaction
> > > is committed(or delay committing the transaction , but we can have
> > > CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart
> > > before tableSync is able to use the same snapshot.)
> > >
> >
> > Can we think of getting the table data as well along with schema via
> > pg_dump? Won't then both schema and initial data will correspond to the
> > same snapshot?
>
> Right , that will work, Thanks!

While it works, we cannot get the initial data in parallel, no?

>
> > > I think we can have same issues as you mentioned New table t1 is added
> > > to the publication , User does a refresh publication.
> > > pg_dump / pg_restore restores the table definition. But before
> > > tableSync can start,  steps from 2 to 5 happen on the publisher.
> > > > 1. Create Table t1(c1, c2); --LSN: 90 2. Insert t1 (1, 1); --LSN 100
> > > > 3. Insert t1 (2, 2); --LSN 110 4. Alter t1 Add Column c3; --LSN 120
> > > > 5. Insert t1 (3, 3, 3); --LSN 130
> > > And table sync errors out
> > > There can be one more issue , since we took the pg_dump without
> > snapshot (wrt to replication slot).
> > >
> >
> > To avoid both the problems mentioned for Refresh Publication, we can do
> > one of the following: (a) create a new slot along with a snapshot for this
> > operation and drop it afterward; or (b) using the existing slot, establish a
> > new snapshot using a technique proposed in email [1].
> >
>
> Thanks, I think option (b) will be perfect, since we don’t have to create a new slot.

Regarding (b), does it mean that apply worker stops streaming,
requests to create a snapshot, and then resumes the streaming?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Initial Schema Sync for Logical Replication

From
Amit Kapila
Date:
On Mon, Mar 27, 2023 at 8:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Mar 24, 2023 at 11:51 PM Kumar, Sachin <ssetiya@amazon.com> wrote:
> >
> > > From: Amit Kapila <amit.kapila16@gmail.com>
> > > > I think we won't be able to use same snapshot because the transaction will
> > > > be committed.
> > > > In CreateSubscription() we can use the transaction snapshot from
> > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am not sure
> > > > about this part maybe walrcv_disconnect() calls the commits internally ?).
> > > > So somehow we need to keep this snapshot alive, even after transaction
> > > > is committed(or delay committing the transaction , but we can have
> > > > CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart
> > > > before tableSync is able to use the same snapshot.)
> > > >
> > >
> > > Can we think of getting the table data as well along with schema via
> > > pg_dump? Won't then both schema and initial data will correspond to the
> > > same snapshot?
> >
> > Right , that will work, Thanks!
>
> While it works, we cannot get the initial data in parallel, no?
>

Another possibility is that we dump/restore the schema of each table
along with its data. One thing we can explore is whether the parallel
option of dump can be useful here. Do you have any other ideas?

One related idea is that currently, we fetch the table list
corresponding to publications in subscription and create the entries
for those in pg_subscription_rel during Create Subscription, can we
think of postponing that work till after the initial schema sync? We
seem to be already storing publications list in pg_subscription, so it
appears possible if we somehow remember the value of copy_data. If
this is feasible then I think that may give us the flexibility to
perform the initial sync at a later point by the background worker.

> >
> > > > I think we can have same issues as you mentioned New table t1 is added
> > > > to the publication , User does a refresh publication.
> > > > pg_dump / pg_restore restores the table definition. But before
> > > > tableSync can start,  steps from 2 to 5 happen on the publisher.
> > > > > 1. Create Table t1(c1, c2); --LSN: 90 2. Insert t1 (1, 1); --LSN 100
> > > > > 3. Insert t1 (2, 2); --LSN 110 4. Alter t1 Add Column c3; --LSN 120
> > > > > 5. Insert t1 (3, 3, 3); --LSN 130
> > > > And table sync errors out
> > > > There can be one more issue , since we took the pg_dump without
> > > snapshot (wrt to replication slot).
> > > >
> > >
> > > To avoid both the problems mentioned for Refresh Publication, we can do
> > > one of the following: (a) create a new slot along with a snapshot for this
> > > operation and drop it afterward; or (b) using the existing slot, establish a
> > > new snapshot using a technique proposed in email [1].
> > >
> >
> > Thanks, I think option (b) will be perfect, since we don’t have to create a new slot.
>
> Regarding (b), does it mean that apply worker stops streaming,
> requests to create a snapshot, and then resumes the streaming?
>

Shouldn't this be done by the backend performing a REFRESH publication?

--
With Regards,
Amit Kapila.



Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Tue, Mar 28, 2023 at 6:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Mar 27, 2023 at 8:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Mar 24, 2023 at 11:51 PM Kumar, Sachin <ssetiya@amazon.com> wrote:
> > >
> > > > From: Amit Kapila <amit.kapila16@gmail.com>
> > > > > I think we won't be able to use same snapshot because the transaction will
> > > > > be committed.
> > > > > In CreateSubscription() we can use the transaction snapshot from
> > > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am not sure
> > > > > about this part maybe walrcv_disconnect() calls the commits internally ?).
> > > > > So somehow we need to keep this snapshot alive, even after transaction
> > > > > is committed(or delay committing the transaction , but we can have
> > > > > CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart
> > > > > before tableSync is able to use the same snapshot.)
> > > > >
> > > >
> > > > Can we think of getting the table data as well along with schema via
> > > > pg_dump? Won't then both schema and initial data will correspond to the
> > > > same snapshot?
> > >
> > > Right , that will work, Thanks!
> >
> > While it works, we cannot get the initial data in parallel, no?
> >
>
> Another possibility is that we dump/restore the schema of each table
> along with its data. One thing we can explore is whether the parallel
> option of dump can be useful here. Do you have any other ideas?

A downside of the idea of dumping both table schema and table data
would be that we need to temporarily store data twice the size of the
table (the dump file and the table itself) during the load. One might
think that we can redirect the pg_dump output into the backend so that
it can load it via SPI, but it doesn't work since "COPY tbl FROM
stdin;" doesn't work via SPI. The --inserts option of pg_dump could
help it out but it makes restoration very slow.

>
> One related idea is that currently, we fetch the table list
> corresponding to publications in subscription and create the entries
> for those in pg_subscription_rel during Create Subscription, can we
> think of postponing that work till after the initial schema sync? We
> seem to be already storing publications list in pg_subscription, so it
> appears possible if we somehow remember the value of copy_data. If
> this is feasible then I think that may give us the flexibility to
> perform the initial sync at a later point by the background worker.

It sounds possible. With this idea, we will be able to have the apply
worker restore the table schemas (and create pg_subscription_rel
entries) as the first thing. Another point we might need to consider
is that the initial schema sync (i.e. creating tables) and creating
pg_subscription_rel entries need to be done in the same transaction.
Otherwise, we could end up committing either one change. I think it
depends on how we restore the schema data.

>
> > >
> > > > > I think we can have same issues as you mentioned New table t1 is added
> > > > > to the publication , User does a refresh publication.
> > > > > pg_dump / pg_restore restores the table definition. But before
> > > > > tableSync can start,  steps from 2 to 5 happen on the publisher.
> > > > > > 1. Create Table t1(c1, c2); --LSN: 90 2. Insert t1 (1, 1); --LSN 100
> > > > > > 3. Insert t1 (2, 2); --LSN 110 4. Alter t1 Add Column c3; --LSN 120
> > > > > > 5. Insert t1 (3, 3, 3); --LSN 130
> > > > > And table sync errors out
> > > > > There can be one more issue , since we took the pg_dump without
> > > > snapshot (wrt to replication slot).
> > > > >
> > > >
> > > > To avoid both the problems mentioned for Refresh Publication, we can do
> > > > one of the following: (a) create a new slot along with a snapshot for this
> > > > operation and drop it afterward; or (b) using the existing slot, establish a
> > > > new snapshot using a technique proposed in email [1].
> > > >
> > >
> > > Thanks, I think option (b) will be perfect, since we don’t have to create a new slot.
> >
> > Regarding (b), does it mean that apply worker stops streaming,
> > requests to create a snapshot, and then resumes the streaming?
> >
>
> Shouldn't this be done by the backend performing a REFRESH publication?

Hmm, I might be missing something but the idea (b) uses the existing
slot to establish a new snapshot, right? What existing replication
slot do we use for that? I thought it was the one used by the apply
worker.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
> > > > From: Amit Kapila <amit.kapila16@gmail.com>
> > > > > I think we won't be able to use same snapshot because the
> > > > > transaction will be committed.
> > > > > In CreateSubscription() we can use the transaction snapshot from
> > > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am
> > > > > not sure about this part maybe walrcv_disconnect() calls the commits
> internally ?).
> > > > > So somehow we need to keep this snapshot alive, even after
> > > > > transaction is committed(or delay committing the transaction ,
> > > > > but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we
> > > > > can have a restart before tableSync is able to use the same
> > > > > snapshot.)
> > > > >
> > > >
> > > > Can we think of getting the table data as well along with schema
> > > > via pg_dump? Won't then both schema and initial data will
> > > > correspond to the same snapshot?
> > >
> > > Right , that will work, Thanks!
> >
> > While it works, we cannot get the initial data in parallel, no?
> >

I was thinking each TableSync process will call pg_dump --table, This way if we have N
tableSync process, we can have N pg_dump --table=table_name called in parallel.
In fact we can use --schema-only to get schema and then let COPY take care of data
syncing . We will use same snapshot for pg_dump as well as COPY table. 

Regards
Sachin

Re: Initial Schema Sync for Logical Replication

From
Amit Kapila
Date:
On Tue, Mar 28, 2023 at 8:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Mar 28, 2023 at 6:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > >
> > > > > > I think we can have same issues as you mentioned New table t1 is added
> > > > > > to the publication , User does a refresh publication.
> > > > > > pg_dump / pg_restore restores the table definition. But before
> > > > > > tableSync can start,  steps from 2 to 5 happen on the publisher.
> > > > > > > 1. Create Table t1(c1, c2); --LSN: 90 2. Insert t1 (1, 1); --LSN 100
> > > > > > > 3. Insert t1 (2, 2); --LSN 110 4. Alter t1 Add Column c3; --LSN 120
> > > > > > > 5. Insert t1 (3, 3, 3); --LSN 130
> > > > > > And table sync errors out
> > > > > > There can be one more issue , since we took the pg_dump without
> > > > > snapshot (wrt to replication slot).
> > > > > >
> > > > >
> > > > > To avoid both the problems mentioned for Refresh Publication, we can do
> > > > > one of the following: (a) create a new slot along with a snapshot for this
> > > > > operation and drop it afterward; or (b) using the existing slot, establish a
> > > > > new snapshot using a technique proposed in email [1].
> > > > >
> > > >
> > > > Thanks, I think option (b) will be perfect, since we don’t have to create a new slot.
> > >
> > > Regarding (b), does it mean that apply worker stops streaming,
> > > requests to create a snapshot, and then resumes the streaming?
> > >
> >
> > Shouldn't this be done by the backend performing a REFRESH publication?
>
> Hmm, I might be missing something but the idea (b) uses the existing
> slot to establish a new snapshot, right? What existing replication
> slot do we use for that? I thought it was the one used by the apply
> worker.
>

Right, it will be the same as the one for apply worker. I think if we
decide to do initial sync via apply worker then in this case also, we
need to let apply worker restart and perform initial sync as the first
thing.

--
With Regards,
Amit Kapila.



Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Wed, Mar 29, 2023 at 7:57 PM Kumar, Sachin <ssetiya@amazon.com> wrote:
>
> > > > > From: Amit Kapila <amit.kapila16@gmail.com>
> > > > > > I think we won't be able to use same snapshot because the
> > > > > > transaction will be committed.
> > > > > > In CreateSubscription() we can use the transaction snapshot from
> > > > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am
> > > > > > not sure about this part maybe walrcv_disconnect() calls the commits
> > internally ?).
> > > > > > So somehow we need to keep this snapshot alive, even after
> > > > > > transaction is committed(or delay committing the transaction ,
> > > > > > but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we
> > > > > > can have a restart before tableSync is able to use the same
> > > > > > snapshot.)
> > > > > >
> > > > >
> > > > > Can we think of getting the table data as well along with schema
> > > > > via pg_dump? Won't then both schema and initial data will
> > > > > correspond to the same snapshot?
> > > >
> > > > Right , that will work, Thanks!
> > >
> > > While it works, we cannot get the initial data in parallel, no?
> > >
>
> I was thinking each TableSync process will call pg_dump --table, This way if we have N
> tableSync process, we can have N pg_dump --table=table_name called in parallel.
> In fact we can use --schema-only to get schema and then let COPY take care of data
> syncing . We will use same snapshot for pg_dump as well as COPY table.

How can we postpone creating the pg_subscription_rel entries until the
tablesync worker starts and does the schema sync? I think that since
pg_subscription_rel entry needs the table OID, we need either to do
the schema sync before creating the entry (i.e, during CREATE
SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The
apply worker needs the information of tables to sync in order to
launch the tablesync workers, but it needs to create the table schema
to get that information.

Regards,

[1] https://www.postgresql.org/message-id/CAA4eK1Ld9-5ueomE_J5CA6LfRo%3DwemdTrUp5qdBhRFwGT%2BdOUw%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > 
> > One related idea is that currently, we fetch the table list
> > corresponding to publications in subscription and create the entries
> > for those in pg_subscription_rel during Create Subscription, can we
> > think of postponing that work till after the initial schema sync? We
> > seem to be already storing publications list in pg_subscription, so it
> > appears possible if we somehow remember the value of copy_data. If
> > this is feasible then I think that may give us the flexibility to
> > perform the initial sync at a later point by the background worker.

Maybe we need to add column to pg_subscription to store copy_data state ?

> 
> It sounds possible. With this idea, we will be able to have the apply worker
> restore the table schemas (and create pg_subscription_rel
> entries) as the first thing. Another point we might need to consider is that the
> initial schema sync (i.e. creating tables) and creating pg_subscription_rel entries
> need to be done in the same transaction.
> Otherwise, we could end up committing either one change. I think it depends on
> how we restore the schema data.

I think we have to add one more column to pg_subscription to track the initial sync
state {OFF, SCHEMA_DUMPED, SCHEMA_RESTORED, COMPLETED} (COMPLETED will
show that pg_subscription_rel is filled) . I don’t think we won't be able
to do pg_restore and pg_subscription_rel in single transaction, but we can use 
initial_sync_state to start from where we left after a abrupt crash/shutdown.

Regards
Sachin




Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Thu, Mar 30, 2023 at 12:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 29, 2023 at 7:57 PM Kumar, Sachin <ssetiya@amazon.com> wrote:
> >
> > > > > > From: Amit Kapila <amit.kapila16@gmail.com>
> > > > > > > I think we won't be able to use same snapshot because the
> > > > > > > transaction will be committed.
> > > > > > > In CreateSubscription() we can use the transaction snapshot from
> > > > > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am
> > > > > > > not sure about this part maybe walrcv_disconnect() calls the commits
> > > internally ?).
> > > > > > > So somehow we need to keep this snapshot alive, even after
> > > > > > > transaction is committed(or delay committing the transaction ,
> > > > > > > but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we
> > > > > > > can have a restart before tableSync is able to use the same
> > > > > > > snapshot.)
> > > > > > >
> > > > > >
> > > > > > Can we think of getting the table data as well along with schema
> > > > > > via pg_dump? Won't then both schema and initial data will
> > > > > > correspond to the same snapshot?
> > > > >
> > > > > Right , that will work, Thanks!
> > > >
> > > > While it works, we cannot get the initial data in parallel, no?
> > > >
> >
> > I was thinking each TableSync process will call pg_dump --table, This way if we have N
> > tableSync process, we can have N pg_dump --table=table_name called in parallel.
> > In fact we can use --schema-only to get schema and then let COPY take care of data
> > syncing . We will use same snapshot for pg_dump as well as COPY table.
>
> How can we postpone creating the pg_subscription_rel entries until the
> tablesync worker starts and does the schema sync? I think that since
> pg_subscription_rel entry needs the table OID, we need either to do
> the schema sync before creating the entry (i.e, during CREATE
> SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The
> apply worker needs the information of tables to sync in order to
> launch the tablesync workers, but it needs to create the table schema
> to get that information.

For the above reason, I think that step 6 of the initial proposal won't work.

If we can have the tablesync worker create an entry of
pg_subscription_rel after creating the table, it may give us the
flexibility to perform the initial sync. One idea is that we add a
relname field to pg_subscription_rel so that we can create entries
with relname instead of OID if the table is not created yet. Once the
table is created, we clear the relname field and set the OID of the
table instead. It's not an ideal solution but we might make it simpler
later.

Assuming that it's feasible, I'm considering another approach for the
initial sync in order to address the concurrent DDLs.

The basic idea is to somewhat follow how pg_dump/restore to
dump/restore the database data. We divide the synchronization phase
(including both schema and data) up into three phases: pre-data,
table-data, post-data. These mostly follow the --section option of
pg_dump.

1. The backend process performing CREATE SUBSCRIPTION creates the
subscription but doesn't create pg_subscription_rel entries yet.

2. Before starting the streaming, the apply worker fetches the table
list from the publisher, create pg_subscription_rel entries for them,
and dumps+restores database objects that tables could depend on but
don't depend on tables such as TYPE, OPERATOR, ACCESS METHOD etc (i.e.
pre-data).

3. The apply worker launches the tablesync workers for tables that
need to be synchronized.

There might be DDLs executed on the publisher for tables before the
tablesync worker starts. But the apply worker needs to apply DDLs for
pre-data database objects. OTOH, it can ignore DDLs for not-synced-yet
tables and other database objects such as INDEX, TRIGGER, RULE, etc
(i.e. post-data).

4. The tablesync worker creates its replication slot, dumps+restores
the table schema, update the pg_subscription_rel, and perform COPY.

These operations should be done in the same transaction.

5. After finishing COPY, the tablesync worker dumps indexes (and
perhaps constraints) of the table and creates them (which possibly
takes a long time). Then it starts to catch up, same as today. The
apply worker needs to wait for the tablesync worker to catch up.

We need to repeat these steps until we complete the initial data copy
and create indexes for all tables, IOW until all pg_subscription_rel
status becomes READY.

6. If the apply worker confirms all tables are READY, it starts
another sync worker who is responsible for the post-data database
objects such as TRIGGER, RULE, POLICY etc (i.e. post-data).

While the sync worker is starting up or working, the apply worker
applies changes for pre-data database objects as well as READY tables.

7. Similar to the tablesync worker, this sync worker creates its
replication slot and sets the returned LSN somewhere, say
pg_subscription.

8. The sync worker dumps and restores these objects. Which could take
a time since it would need to create FK constraints. Then it starts to
catch up if the apply worker is ahead. The apply worker waits for the
sync worker to  catch up.

9. Once the sync worker catches up, the apply worker starts applying
changes for all database objects.

IIUC with this approach, we can resolve the concurrent DDL problem
Sachin mentioned, and indexes (and constraints) are created after the
initial data copy.

The procedures are still very complex and not fully considered yet but
I hope there are some useful things at least for discussion.

Probably we can start with supporting only tables. In this case, we
would not need the post-data phase (i.e. step 6-9). It seems to me
that we need to have the subscription state somewhere (maybe
pg_subscription) so that the apply worker figure out the next step.
Since we need to dump and restore different objects on different
timings, we probably cannot directly use pg_dump/pg_restore. I've not
considered how the concurrent REFRESH PUBLICATION works.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:

> -----Original Message-----
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > > I was thinking each TableSync process will call pg_dump --table,
> > > This way if we have N tableSync process, we can have N pg_dump --
> table=table_name called in parallel.
> > > In fact we can use --schema-only to get schema and then let COPY
> > > take care of data syncing . We will use same snapshot for pg_dump as well
> as COPY table.
> >
> > How can we postpone creating the pg_subscription_rel entries until the
> > tablesync worker starts and does the schema sync? I think that since
> > pg_subscription_rel entry needs the table OID, we need either to do
> > the schema sync before creating the entry (i.e, during CREATE
> > SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The
> > apply worker needs the information of tables to sync in order to
> > launch the tablesync workers, but it needs to create the table schema
> > to get that information.
> 
> For the above reason, I think that step 6 of the initial proposal won't work.
> 
> If we can have the tablesync worker create an entry of pg_subscription_rel after
> creating the table, it may give us the flexibility to perform the initial sync. One
> idea is that we add a relname field to pg_subscription_rel so that we can create
> entries with relname instead of OID if the table is not created yet. Once the
> table is created, we clear the relname field and set the OID of the table instead.
> It's not an ideal solution but we might make it simpler later.
> 
> Assuming that it's feasible, I'm considering another approach for the initial sync
> in order to address the concurrent DDLs.
> 
> The basic idea is to somewhat follow how pg_dump/restore to dump/restore
> the database data. We divide the synchronization phase (including both schema
> and data) up into three phases: pre-data, table-data, post-data. These mostly
> follow the --section option of pg_dump.
> 
> 1. The backend process performing CREATE SUBSCRIPTION creates the
> subscription but doesn't create pg_subscription_rel entries yet.
> 
> 2. Before starting the streaming, the apply worker fetches the table list from the
> publisher, create pg_subscription_rel entries for them, and dumps+restores
> database objects that tables could depend on but don't depend on tables such as
> TYPE, OPERATOR, ACCESS METHOD etc (i.e.
> pre-data).

We will not have slot starting snapshot, So somehow we have to get a new snapshot
And skip all the wal_log between starting of slot and snapshot creation lsn ? .

> 
> 3. The apply worker launches the tablesync workers for tables that need to be
> synchronized.
> 
> There might be DDLs executed on the publisher for tables before the tablesync
> worker starts. But the apply worker needs to apply DDLs for pre-data database
> objects. OTOH, it can ignore DDLs for not-synced-yet tables and other database
> objects such as INDEX, TRIGGER, RULE, etc (i.e. post-data).
> 
> 4. The tablesync worker creates its replication slot, dumps+restores the table
> schema, update the pg_subscription_rel, and perform COPY.
> 
> These operations should be done in the same transaction.

pg_restore wont be rollbackable, So we need to maintain states in pg_subscription_rel.

> 
> 5. After finishing COPY, the tablesync worker dumps indexes (and perhaps
> constraints) of the table and creates them (which possibly takes a long time).
> Then it starts to catch up, same as today. The apply worker needs to wait for the
> tablesync worker to catch up.

I don’t think we can have CATCHUP stage. We can have a DDL on publisher which
can add a new column (And this DDL will be executed by applier later). Then we get a INSERT
 because we have old definition of table, insert will fail.

> 
> We need to repeat these steps until we complete the initial data copy and create
> indexes for all tables, IOW until all pg_subscription_rel status becomes READY.
> 
> 6. If the apply worker confirms all tables are READY, it starts another sync
> worker who is responsible for the post-data database objects such as TRIGGER,
> RULE, POLICY etc (i.e. post-data).
> 
> While the sync worker is starting up or working, the apply worker applies
> changes for pre-data database objects as well as READY tables.
We might have some issue if we have create table like
Create table_name as select * from materialized_view.
> 
> 7. Similar to the tablesync worker, this sync worker creates its replication slot
> and sets the returned LSN somewhere, say pg_subscription.
> 
> 8. The sync worker dumps and restores these objects. Which could take a time
> since it would need to create FK constraints. Then it starts to catch up if the
> apply worker is ahead. The apply worker waits for the sync worker to  catch up.
> 
> 9. Once the sync worker catches up, the apply worker starts applying changes
> for all database objects.
> 
> IIUC with this approach, we can resolve the concurrent DDL problem Sachin
> mentioned, and indexes (and constraints) are created after the initial data copy.
> 
> The procedures are still very complex and not fully considered yet but I hope
> there are some useful things at least for discussion.
> 
> Probably we can start with supporting only tables. In this case, we would not
> need the post-data phase (i.e. step 6-9). It seems to me that we need to have
> the subscription state somewhere (maybe
> pg_subscription) so that the apply worker figure out the next step.
> Since we need to dump and restore different objects on different timings, we
> probably cannot directly use pg_dump/pg_restore. I've not considered how the
> concurrent REFRESH PUBLICATION works.

I think above prototype will work and will have least amount of side effects, but
It might be too complex to implement and I am not sure about corner cases.

I was thinking of other ways of doing Initial Sync , which are less complex but each
with separate set of bottlenecks

On Publisher Side:- 
1) Locking the publisher:- Easiest one to implement, applier process will get Access Shared
lock on the all the published tables. (We don't have to worry newly created concurrent table)
As tableSync will finish syncing the table, it will release table lock, So we will release
table locks in steps. Users can still perform DML on tables, but DDLs wont be allowed.

On Subscriber Side:-
So the main issue is tableSync process can see the future table data/version wrt to the
applier process, So we have to find a way to ensure that tableSync/applier process sees
same table version.

2) Using pg_dump/pg_restore for schema and data:- As Amit mentioned we can use pg_dump/
pg_restore [1], Although it might have side effect of using double storage , we can
table pg_dump of each table separately and delete the dump as soon as table is synced.
tableSync process will read the dump and call pg_restore on the table.
If we crash in middle of restoring the tables we can start pg_dump(--clean)/restore again
with left out tables.
With this we can reduce space usage but we might create too many files.

3) Using publisher snapshot:- Applier process will do pg_dump/pg_restore as usual,
Then  applier process will start a new process P1 which will connect to
publisher and start a transaction , it will call pg_export_snapshot() to export the
snapshot.Then applier process will take snapshot string and pass it to the tableSync process
as a argument. tableSync will use this snapshot for COPY TABLE. tableSync should only
do COPY TABLE and then will exit , So we wont do any catchup phase in tableSync. After
all tables finish COPY table transaction will be committed by P1 process and it will exit.
In the case of crash/restart we can simple start from beginning since nothing is committed
till every table is synced. There are 2 main issues with this approach
1. I am not sure what side-effects we might have on publisher since we might have to keep
the transaction open for long time.
2. Applier process will simple wait till all tables are synced.
since applier process wont be able to apply any wal_logs till all tables are synced
maybe instead of creating new process Applier process itself can start transaction/
export snapshot and tableSync process will use that snapshot. After all tables are synced
it can start wal_streaming.

I think approach no 3 might be the best way.

[1] https://www.postgresql.org/message-id/CAA4eK1Ld9-5ueomE_J5CA6LfRo%3DwemdTrUp5qdBhRFwGT%2BdOUw%40mail.gmail.com

Regards
Sachin

Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Mon, Apr 3, 2023 at 3:54 PM Kumar, Sachin <ssetiya@amazon.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Masahiko Sawada <sawada.mshk@gmail.com>
> > > > I was thinking each TableSync process will call pg_dump --table,
> > > > This way if we have N tableSync process, we can have N pg_dump --
> > table=table_name called in parallel.
> > > > In fact we can use --schema-only to get schema and then let COPY
> > > > take care of data syncing . We will use same snapshot for pg_dump as well
> > as COPY table.
> > >
> > > How can we postpone creating the pg_subscription_rel entries until the
> > > tablesync worker starts and does the schema sync? I think that since
> > > pg_subscription_rel entry needs the table OID, we need either to do
> > > the schema sync before creating the entry (i.e, during CREATE
> > > SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The
> > > apply worker needs the information of tables to sync in order to
> > > launch the tablesync workers, but it needs to create the table schema
> > > to get that information.
> >
> > For the above reason, I think that step 6 of the initial proposal won't work.
> >
> > If we can have the tablesync worker create an entry of pg_subscription_rel after
> > creating the table, it may give us the flexibility to perform the initial sync. One
> > idea is that we add a relname field to pg_subscription_rel so that we can create
> > entries with relname instead of OID if the table is not created yet. Once the
> > table is created, we clear the relname field and set the OID of the table instead.
> > It's not an ideal solution but we might make it simpler later.
> >
> > Assuming that it's feasible, I'm considering another approach for the initial sync
> > in order to address the concurrent DDLs.
> >
> > The basic idea is to somewhat follow how pg_dump/restore to dump/restore
> > the database data. We divide the synchronization phase (including both schema
> > and data) up into three phases: pre-data, table-data, post-data. These mostly
> > follow the --section option of pg_dump.
> >
> > 1. The backend process performing CREATE SUBSCRIPTION creates the
> > subscription but doesn't create pg_subscription_rel entries yet.
> >
> > 2. Before starting the streaming, the apply worker fetches the table list from the
> > publisher, create pg_subscription_rel entries for them, and dumps+restores
> > database objects that tables could depend on but don't depend on tables such as
> > TYPE, OPERATOR, ACCESS METHOD etc (i.e.
> > pre-data).
>
> We will not have slot starting snapshot, So somehow we have to get a new snapshot
> And skip all the wal_log between starting of slot and snapshot creation lsn ? .

Yes. Or we can somehow postpone creating pg_subscription_rel entries
until the tablesync workers create tables, or we request walsender to
establish a new snapshot using a technique proposed in email[1].

>
> >
> > 3. The apply worker launches the tablesync workers for tables that need to be
> > synchronized.
> >
> > There might be DDLs executed on the publisher for tables before the tablesync
> > worker starts. But the apply worker needs to apply DDLs for pre-data database
> > objects. OTOH, it can ignore DDLs for not-synced-yet tables and other database
> > objects such as INDEX, TRIGGER, RULE, etc (i.e. post-data).
> >
> > 4. The tablesync worker creates its replication slot, dumps+restores the table
> > schema, update the pg_subscription_rel, and perform COPY.
> >
> > These operations should be done in the same transaction.
>
> pg_restore wont be rollbackable, So we need to maintain states in pg_subscription_rel.

Yes. But I think it depends on how we restore them. For example, if we
have the tablesync worker somethow restore the table using a new SQL
function returning the table schema as we discussed or executing the
dump file via SPI, we can do that in the same transaction.

>
> >
> > 5. After finishing COPY, the tablesync worker dumps indexes (and perhaps
> > constraints) of the table and creates them (which possibly takes a long time).
> > Then it starts to catch up, same as today. The apply worker needs to wait for the
> > tablesync worker to catch up.
>
> I don’t think we can have CATCHUP stage. We can have a DDL on publisher which
> can add a new column (And this DDL will be executed by applier later). Then we get a INSERT
>  because we have old definition of table, insert will fail.

All DMLs and DDLs associated with the table being synchronized are
applied by the tablesync worker until it catches up with the apply
worker.

>
> >
> > We need to repeat these steps until we complete the initial data copy and create
> > indexes for all tables, IOW until all pg_subscription_rel status becomes READY.
> >
> > 6. If the apply worker confirms all tables are READY, it starts another sync
> > worker who is responsible for the post-data database objects such as TRIGGER,
> > RULE, POLICY etc (i.e. post-data).
> >
> > While the sync worker is starting up or working, the apply worker applies
> > changes for pre-data database objects as well as READY tables.
> We might have some issue if we have create table like
> Create table_name as select * from materialized_view.

Could you elaborate on the scenario where we could have an issue with such DDL?

> >
> > 7. Similar to the tablesync worker, this sync worker creates its replication slot
> > and sets the returned LSN somewhere, say pg_subscription.
> >
> > 8. The sync worker dumps and restores these objects. Which could take a time
> > since it would need to create FK constraints. Then it starts to catch up if the
> > apply worker is ahead. The apply worker waits for the sync worker to  catch up.
> >
> > 9. Once the sync worker catches up, the apply worker starts applying changes
> > for all database objects.
> >
> > IIUC with this approach, we can resolve the concurrent DDL problem Sachin
> > mentioned, and indexes (and constraints) are created after the initial data copy.
> >
> > The procedures are still very complex and not fully considered yet but I hope
> > there are some useful things at least for discussion.
> >
> > Probably we can start with supporting only tables. In this case, we would not
> > need the post-data phase (i.e. step 6-9). It seems to me that we need to have
> > the subscription state somewhere (maybe
> > pg_subscription) so that the apply worker figure out the next step.
> > Since we need to dump and restore different objects on different timings, we
> > probably cannot directly use pg_dump/pg_restore. I've not considered how the
> > concurrent REFRESH PUBLICATION works.
>
> I think above prototype will work and will have least amount of side effects, but
> It might be too complex to implement and I am not sure about corner cases.
>
> I was thinking of other ways of doing Initial Sync , which are less complex but each
> with separate set of bottlenecks
>
> On Publisher Side:-
> 1) Locking the publisher:- Easiest one to implement, applier process will get Access Shared
> lock on the all the published tables. (We don't have to worry newly created concurrent table)
> As tableSync will finish syncing the table, it will release table lock, So we will release
> table locks in steps. Users can still perform DML on tables, but DDLs wont be allowed.

Do you mean that the apply worker acquires table locks and the
tablesync workers release them? If so, how can we implement it?

>
> 2) Using pg_dump/pg_restore for schema and data:- As Amit mentioned we can use pg_dump/
> pg_restore [1], Although it might have side effect of using double storage , we can
> table pg_dump of each table separately and delete the dump as soon as table is synced.
> tableSync process will read the dump and call pg_restore on the table.
> If we crash in middle of restoring the tables we can start pg_dump(--clean)/restore again
> with left out tables.
> With this we can reduce space usage but we might create too many files.

With this idea, who does pg_dump and pg_restore? and when do we create
pg_subscription_rel entries?

>
> 3) Using publisher snapshot:- Applier process will do pg_dump/pg_restore as usual,
> Then  applier process will start a new process P1 which will connect to
> publisher and start a transaction , it will call pg_export_snapshot() to export the
> snapshot.Then applier process will take snapshot string and pass it to the tableSync process
> as a argument. tableSync will use this snapshot for COPY TABLE. tableSync should only
> do COPY TABLE and then will exit , So we wont do any catchup phase in tableSync. After
> all tables finish COPY table transaction will be committed by P1 process and it will exit.
> In the case of crash/restart we can simple start from beginning since nothing is committed
> till every table is synced. There are 2 main issues with this approach
> 1. I am not sure what side-effects we might have on publisher since we might have to keep
> the transaction open for long time.

I'm concerned that it would not be an acceptable downside that we keep
a transaction open until all tables are synchronized.

> 2. Applier process will simple wait till all tables are synced.
> since applier process wont be able to apply any wal_logs till all tables are synced
> maybe instead of creating new process Applier process itself can start transaction/
> export snapshot and tableSync process will use that snapshot. After all tables are synced
> it can start wal_streaming.

I think that after users execute REFRESH PUBLICATION, there are mixed
non-ready and ready tables in the subscription. In this case, it's a
huge restriction for users that logical replication for the ready
tables stops until all newly-subscribed tables are synchronized.

Regards,

[1] https://www.postgresql.org/message-id/CAGPVpCRWEVhXa7ovrhuSQofx4to7o22oU9iKtrOgAOtz_%3DY6vg%40mail.gmail.com



--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > >
> > > 3. The apply worker launches the tablesync workers for tables that
> > > need to be synchronized.
> > >
> > > There might be DDLs executed on the publisher for tables before the
> > > tablesync worker starts. But the apply worker needs to apply DDLs
> > > for pre-data database objects. OTOH, it can ignore DDLs for
> > > not-synced-yet tables and other database objects such as INDEX,
> TRIGGER, RULE, etc (i.e. post-data).
> > >
> > > 4. The tablesync worker creates its replication slot, dumps+restores
> > > the table schema, update the pg_subscription_rel, and perform COPY.
> > >
> > > These operations should be done in the same transaction.
> >
> > pg_restore wont be rollbackable, So we need to maintain states in
> pg_subscription_rel.
> 
> Yes. But I think it depends on how we restore them. For example, if we have
> the tablesync worker somethow restore the table using a new SQL function
> returning the table schema as we discussed or executing the dump file via
> SPI, we can do that in the same transaction.

okay

> 
> >
> > >
> > > 5. After finishing COPY, the tablesync worker dumps indexes (and
> > > perhaps
> > > constraints) of the table and creates them (which possibly takes a long
> time).
> > > Then it starts to catch up, same as today. The apply worker needs to
> > > wait for the tablesync worker to catch up.
> >
> > I don’t think we can have CATCHUP stage. We can have a DDL on
> > publisher which can add a new column (And this DDL will be executed by
> > applier later). Then we get a INSERT  because we have old definition of
> table, insert will fail.
> 
> All DMLs and DDLs associated with the table being synchronized are applied
> by the tablesync worker until it catches up with the apply worker.

Right, Sorry I forgot that in above case if definition on publisher changes we will also have a 
corresponding DDLs.

> 
> >
> > >
> > > We need to repeat these steps until we complete the initial data
> > > copy and create indexes for all tables, IOW until all pg_subscription_rel
> status becomes READY.
> > >
> > > 6. If the apply worker confirms all tables are READY, it starts
> > > another sync worker who is responsible for the post-data database
> > > objects such as TRIGGER, RULE, POLICY etc (i.e. post-data).
> > >
> > > While the sync worker is starting up or working, the apply worker
> > > applies changes for pre-data database objects as well as READY tables.
> > We might have some issue if we have create table like Create
> > table_name as select * from materialized_view.
> 
> Could you elaborate on the scenario where we could have an issue with such
> DDL?

Since materialized view of publisher has not been created by subscriber yet
So if we have a DDL which does a create table using a materialized view
it will fail. I am not sure how DDL patch is handling create table as statements.
If it is modified to become like a normal CREATE TABLE then we wont have any issues.   

> 
> > >
> > > 7. Similar to the tablesync worker, this sync worker creates its
> > > replication slot and sets the returned LSN somewhere, say
> pg_subscription.
> > >
> > > 8. The sync worker dumps and restores these objects. Which could
> > > take a time since it would need to create FK constraints. Then it
> > > starts to catch up if the apply worker is ahead. The apply worker waits for
> the sync worker to  catch up.
> > >
> > > 9. Once the sync worker catches up, the apply worker starts applying
> > > changes for all database objects.
> > >
> > > IIUC with this approach, we can resolve the concurrent DDL problem
> > > Sachin mentioned, and indexes (and constraints) are created after the
> initial data copy.
> > >
> > > The procedures are still very complex and not fully considered yet
> > > but I hope there are some useful things at least for discussion.
> > >
> > > Probably we can start with supporting only tables. In this case, we
> > > would not need the post-data phase (i.e. step 6-9). It seems to me
> > > that we need to have the subscription state somewhere (maybe
> > > pg_subscription) so that the apply worker figure out the next step.
> > > Since we need to dump and restore different objects on different
> > > timings, we probably cannot directly use pg_dump/pg_restore. I've
> > > not considered how the concurrent REFRESH PUBLICATION works.
> >
> > I think above prototype will work and will have least amount of side
> > effects, but It might be too complex to implement and I am not sure about
> corner cases.
> >
> > I was thinking of other ways of doing Initial Sync , which are less
> > complex but each with separate set of bottlenecks
> >
> > On Publisher Side:-
> > 1) Locking the publisher:- Easiest one to implement, applier process
> > will get Access Shared lock on the all the published tables. (We don't
> > have to worry newly created concurrent table) As tableSync will finish
> > syncing the table, it will release table lock, So we will release table locks in
> steps. Users can still perform DML on tables, but DDLs wont be allowed.
> 
> Do you mean that the apply worker acquires table locks and the tablesync
> workers release them? If so, how can we implement it?
> 

I think releasing lock in steps would be impossible (given postgres lock implementations)
So applier process has to create a new transaction and lock all the published tables in 
access shared mode. And after tableSync is completed transaction will be committed to release
locks. So 1 and 3 are similar we have to keep one transaction open till table are synced.

> >
> > 2) Using pg_dump/pg_restore for schema and data:- As Amit mentioned
> we
> > can use pg_dump/ pg_restore [1], Although it might have side effect of
> > using double storage , we can table pg_dump of each table separately and
> delete the dump as soon as table is synced.
> > tableSync process will read the dump and call pg_restore on the table.
> > If we crash in middle of restoring the tables we can start
> > pg_dump(--clean)/restore again with left out tables.
> > With this we can reduce space usage but we might create too many files.
> 
> With this idea, who does pg_dump and pg_restore? and when do we create
> pg_subscription_rel entries?

Applier process will do pg_dump/pg_restore . pg_subscription_rel entries can be created
after pg_restore, We can create a new column with rel_nam and keep oid empty As you have
suggested earlier.

> 
> >
> > 3) Using publisher snapshot:- Applier process will do
> > pg_dump/pg_restore as usual, Then  applier process will start a new
> > process P1 which will connect to publisher and start a transaction ,
> > it will call pg_export_snapshot() to export the snapshot.Then applier
> > process will take snapshot string and pass it to the tableSync process
> > as a argument. tableSync will use this snapshot for COPY TABLE.
> > tableSync should only do COPY TABLE and then will exit , So we wont do
> any catchup phase in tableSync. After all tables finish COPY table transaction
> will be committed by P1 process and it will exit.
> > In the case of crash/restart we can simple start from beginning since
> > nothing is committed till every table is synced. There are 2 main
> > issues with this approach 1. I am not sure what side-effects we might
> > have on publisher since we might have to keep the transaction open for
> long time.
> 
> I'm concerned that it would not be an acceptable downside that we keep a
> transaction open until all tables are synchronized.
> 

Okay, There is one more issue just using same snapshot will not stop table DDL
modifications, we need to have atleast access share lock on each tables.
So this will make tables locked on publisher, So this is essentially same as 1.

Regards
Sachin

Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Thu, Mar 30, 2023 at 10:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 30, 2023 at 12:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Mar 29, 2023 at 7:57 PM Kumar, Sachin <ssetiya@amazon.com> wrote:
> > >
> > > > > > > From: Amit Kapila <amit.kapila16@gmail.com>
> > > > > > > > I think we won't be able to use same snapshot because the
> > > > > > > > transaction will be committed.
> > > > > > > > In CreateSubscription() we can use the transaction snapshot from
> > > > > > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am
> > > > > > > > not sure about this part maybe walrcv_disconnect() calls the commits
> > > > internally ?).
> > > > > > > > So somehow we need to keep this snapshot alive, even after
> > > > > > > > transaction is committed(or delay committing the transaction ,
> > > > > > > > but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we
> > > > > > > > can have a restart before tableSync is able to use the same
> > > > > > > > snapshot.)
> > > > > > > >
> > > > > > >
> > > > > > > Can we think of getting the table data as well along with schema
> > > > > > > via pg_dump? Won't then both schema and initial data will
> > > > > > > correspond to the same snapshot?
> > > > > >
> > > > > > Right , that will work, Thanks!
> > > > >
> > > > > While it works, we cannot get the initial data in parallel, no?
> > > > >
> > >
> > > I was thinking each TableSync process will call pg_dump --table, This way if we have N
> > > tableSync process, we can have N pg_dump --table=table_name called in parallel.
> > > In fact we can use --schema-only to get schema and then let COPY take care of data
> > > syncing . We will use same snapshot for pg_dump as well as COPY table.
> >
> > How can we postpone creating the pg_subscription_rel entries until the
> > tablesync worker starts and does the schema sync? I think that since
> > pg_subscription_rel entry needs the table OID, we need either to do
> > the schema sync before creating the entry (i.e, during CREATE
> > SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The
> > apply worker needs the information of tables to sync in order to
> > launch the tablesync workers, but it needs to create the table schema
> > to get that information.
>
> For the above reason, I think that step 6 of the initial proposal won't work.
>
> If we can have the tablesync worker create an entry of
> pg_subscription_rel after creating the table, it may give us the
> flexibility to perform the initial sync. One idea is that we add a
> relname field to pg_subscription_rel so that we can create entries
> with relname instead of OID if the table is not created yet. Once the
> table is created, we clear the relname field and set the OID of the
> table instead. It's not an ideal solution but we might make it simpler
> later.

While writing a PoC patch, I found some difficulties in this idea.
First, I tried to add schemaname+relname to pg_subscription_rel but I
could not define the primary key of pg_subscription_rel. The primary
key on (srsubid, srrelid) doesn't work since srrelid could be NULL.
Similarly, the primary key on (srsubid, srrelid, schemaname, relname)
also doesn't work. So I tried another idea: that we generate a new OID
for srrelid and the tablesync worker will replace it with the new
table's OID once it creates the table. However, since we use srrelid
in replication slot names, changing srrelid during the initial
schema+data sync is not straightforward (please note that the slot is
created by the tablesync worker but is removed by the apply worker).
Using relname in slot name instead of srrelid is not a good idea since
it requires all pg_subscription_rel entries have relname, and slot
names could be duplicated, for example, when the relname is very long
and we cut it.

I'm trying to consider the idea from another angle: the apply worker
fetches the table list and passes the relname to the tablesync worker.
But a problem of this approach is that the table list is not
persisted. If the apply worker restarts during the initial table sync,
it could not get the same list as before.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Initial Schema Sync for Logical Replication

From
Amit Kapila
Date:
On Thu, Apr 6, 2023 at 6:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Mar 30, 2023 at 10:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Mar 30, 2023 at 12:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > >
> > > How can we postpone creating the pg_subscription_rel entries until the
> > > tablesync worker starts and does the schema sync? I think that since
> > > pg_subscription_rel entry needs the table OID, we need either to do
> > > the schema sync before creating the entry (i.e, during CREATE
> > > SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The
> > > apply worker needs the information of tables to sync in order to
> > > launch the tablesync workers, but it needs to create the table schema
> > > to get that information.
> >
> > For the above reason, I think that step 6 of the initial proposal won't work.
> >
> > If we can have the tablesync worker create an entry of
> > pg_subscription_rel after creating the table, it may give us the
> > flexibility to perform the initial sync. One idea is that we add a
> > relname field to pg_subscription_rel so that we can create entries
> > with relname instead of OID if the table is not created yet. Once the
> > table is created, we clear the relname field and set the OID of the
> > table instead. It's not an ideal solution but we might make it simpler
> > later.
>
> While writing a PoC patch, I found some difficulties in this idea.
> First, I tried to add schemaname+relname to pg_subscription_rel but I
> could not define the primary key of pg_subscription_rel. The primary
> key on (srsubid, srrelid) doesn't work since srrelid could be NULL.
> Similarly, the primary key on (srsubid, srrelid, schemaname, relname)
> also doesn't work.
>

Can we think of having a separate catalog table say
pg_subscription_remote_rel for this? You can have srsubid,
remote_schema_name, remote_rel_name, etc. We may need some other state
to be maintained during the initial schema sync where this table can
be used. Basically, this can be used to maintain the state till the
initial schema sync is complete because we can create a relation entry
in pg_subscritption_rel only after the initial schema sync is
complete.

> So I tried another idea: that we generate a new OID
> for srrelid and the tablesync worker will replace it with the new
> table's OID once it creates the table. However, since we use srrelid
> in replication slot names, changing srrelid during the initial
> schema+data sync is not straightforward (please note that the slot is
> created by the tablesync worker but is removed by the apply worker).
> Using relname in slot name instead of srrelid is not a good idea since
> it requires all pg_subscription_rel entries have relname, and slot
> names could be duplicated, for example, when the relname is very long
> and we cut it.
>
> I'm trying to consider the idea from another angle: the apply worker
> fetches the table list and passes the relname to the tablesync worker.
> But a problem of this approach is that the table list is not
> persisted. If the apply worker restarts during the initial table sync,
> it could not get the same list as before.
>

Agreed, this has some drawbacks. We can try to explore this if the
above idea of the new catalog table doesn't solve this problem.


--
With Regards,
Amit Kapila.



Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Fri, Apr 7, 2023 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Apr 6, 2023 at 6:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Mar 30, 2023 at 10:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Thu, Mar 30, 2023 at 12:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > >
> > > > How can we postpone creating the pg_subscription_rel entries until the
> > > > tablesync worker starts and does the schema sync? I think that since
> > > > pg_subscription_rel entry needs the table OID, we need either to do
> > > > the schema sync before creating the entry (i.e, during CREATE
> > > > SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The
> > > > apply worker needs the information of tables to sync in order to
> > > > launch the tablesync workers, but it needs to create the table schema
> > > > to get that information.
> > >
> > > For the above reason, I think that step 6 of the initial proposal won't work.
> > >
> > > If we can have the tablesync worker create an entry of
> > > pg_subscription_rel after creating the table, it may give us the
> > > flexibility to perform the initial sync. One idea is that we add a
> > > relname field to pg_subscription_rel so that we can create entries
> > > with relname instead of OID if the table is not created yet. Once the
> > > table is created, we clear the relname field and set the OID of the
> > > table instead. It's not an ideal solution but we might make it simpler
> > > later.
> >
> > While writing a PoC patch, I found some difficulties in this idea.
> > First, I tried to add schemaname+relname to pg_subscription_rel but I
> > could not define the primary key of pg_subscription_rel. The primary
> > key on (srsubid, srrelid) doesn't work since srrelid could be NULL.
> > Similarly, the primary key on (srsubid, srrelid, schemaname, relname)
> > also doesn't work.
> >
>
> Can we think of having a separate catalog table say
> pg_subscription_remote_rel for this? You can have srsubid,
> remote_schema_name, remote_rel_name, etc. We may need some other state
> to be maintained during the initial schema sync where this table can
> be used. Basically, this can be used to maintain the state till the
> initial schema sync is complete because we can create a relation entry
> in pg_subscritption_rel only after the initial schema sync is
> complete.

It might not be ideal but I guess it works. But I think we need to
modify the name of replication slot for initial sync as it currently
includes OID of the table:

void
ReplicationSlotNameForTablesync(Oid suboid, Oid relid,
                                char *syncslotname, Size szslot)
{
    snprintf(syncslotname, szslot, "pg_%u_sync_%u_" UINT64_FORMAT, suboid,
             relid, GetSystemIdentifier());
}

If we use both schema name and table name, it's possible that slot
names are duplicated if schema and/or table names are long. Another
idea is to use the hash value of schema+table names, but it cannot
completely eliminate that possibility, and probably would make
investigation and debugging hard in case of any failure. Probably we
can use the OID of each entry in pg_subscription_remote_rel instead,
but I'm not sure it's a good idea, mainly because we will end up using
twice as many OIDs as before.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > > While writing a PoC patch, I found some difficulties in this idea.
> > > First, I tried to add schemaname+relname to pg_subscription_rel but
> > > I could not define the primary key of pg_subscription_rel. The
> > > primary key on (srsubid, srrelid) doesn't work since srrelid could be NULL.
> > > Similarly, the primary key on (srsubid, srrelid, schemaname,
> > > relname) also doesn't work.
> > >
> >
> > Can we think of having a separate catalog table say
> > pg_subscription_remote_rel for this? You can have srsubid,
> > remote_schema_name, remote_rel_name, etc. We may need some other
> state
> > to be maintained during the initial schema sync where this table can
> > be used. Basically, this can be used to maintain the state till the
> > initial schema sync is complete because we can create a relation entry
> > in pg_subscritption_rel only after the initial schema sync is
> > complete.
> 
> It might not be ideal but I guess it works. But I think we need to modify the name
> of replication slot for initial sync as it currently includes OID of the table:
> 
> void
> ReplicationSlotNameForTablesync(Oid suboid, Oid relid,
>                                 char *syncslotname, Size szslot) {
>     snprintf(syncslotname, szslot, "pg_%u_sync_%u_" UINT64_FORMAT, suboid,
>              relid, GetSystemIdentifier()); }
> 
> If we use both schema name and table name, it's possible that slot names are
> duplicated if schema and/or table names are long. Another idea is to use the
> hash value of schema+table names, but it cannot completely eliminate that
> possibility, and probably would make investigation and debugging hard in case
> of any failure. Probably we can use the OID of each entry in
> pg_subscription_remote_rel instead, but I'm not sure it's a good idea, mainly
> because we will end up using twice as many OIDs as before.

Maybe we can create serial primary key for pg_subscription_remote_rel table 
And use this key for creating replication slot ?

Regards
Sachin

RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
I am working on a prototype with above Idea , and will send it for review by Sunday/Monday

Regards
Sachin

Re: Initial Schema Sync for Logical Replication

From
Amit Kapila
Date:
On Mon, Apr 17, 2023 at 9:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Apr 7, 2023 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Apr 6, 2023 at 6:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > >
> > > While writing a PoC patch, I found some difficulties in this idea.
> > > First, I tried to add schemaname+relname to pg_subscription_rel but I
> > > could not define the primary key of pg_subscription_rel. The primary
> > > key on (srsubid, srrelid) doesn't work since srrelid could be NULL.
> > > Similarly, the primary key on (srsubid, srrelid, schemaname, relname)
> > > also doesn't work.
> > >
> >
> > Can we think of having a separate catalog table say
> > pg_subscription_remote_rel for this? You can have srsubid,
> > remote_schema_name, remote_rel_name, etc. We may need some other state
> > to be maintained during the initial schema sync where this table can
> > be used. Basically, this can be used to maintain the state till the
> > initial schema sync is complete because we can create a relation entry
> > in pg_subscritption_rel only after the initial schema sync is
> > complete.
>
> It might not be ideal but I guess it works. But I think we need to
> modify the name of replication slot for initial sync as it currently
> includes OID of the table:
>
> void
> ReplicationSlotNameForTablesync(Oid suboid, Oid relid,
>                                 char *syncslotname, Size szslot)
> {
>     snprintf(syncslotname, szslot, "pg_%u_sync_%u_" UINT64_FORMAT, suboid,
>              relid, GetSystemIdentifier());
> }
>
> If we use both schema name and table name, it's possible that slot
> names are duplicated if schema and/or table names are long. Another
> idea is to use the hash value of schema+table names, but it cannot
> completely eliminate that possibility, and probably would make
> investigation and debugging hard in case of any failure. Probably we
> can use the OID of each entry in pg_subscription_remote_rel instead,
> but I'm not sure it's a good idea, mainly because we will end up using
> twice as many OIDs as before.
>

The other possibility is to use worker_pid. To make debugging easier,
we may want to LOG schema_name+rel_name vs slot_name mapping at DEBUG1
log level.

--
With Regards,
Amit Kapila.



Re: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
I am working on a prototype with above discussed idea, I think I will send it for initial review by Monday.

Regards
Sachin


Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Thu, Apr 20, 2023 at 9:41 PM Kumar, Sachin <ssetiya@amazon.com> wrote:
>
> I am working on a prototype with above discussed idea, I think I will send it for initial review by Monday.
>

Okay, but which idea are you referring to? pg_subscription_remote_rel
+ worker_pid idea Amit proposed?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Thu, Apr 20, 2023 at 8:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Apr 17, 2023 at 9:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Apr 7, 2023 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Apr 6, 2023 at 6:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > >
> > > > While writing a PoC patch, I found some difficulties in this idea.
> > > > First, I tried to add schemaname+relname to pg_subscription_rel but I
> > > > could not define the primary key of pg_subscription_rel. The primary
> > > > key on (srsubid, srrelid) doesn't work since srrelid could be NULL.
> > > > Similarly, the primary key on (srsubid, srrelid, schemaname, relname)
> > > > also doesn't work.
> > > >
> > >
> > > Can we think of having a separate catalog table say
> > > pg_subscription_remote_rel for this? You can have srsubid,
> > > remote_schema_name, remote_rel_name, etc. We may need some other state
> > > to be maintained during the initial schema sync where this table can
> > > be used. Basically, this can be used to maintain the state till the
> > > initial schema sync is complete because we can create a relation entry
> > > in pg_subscritption_rel only after the initial schema sync is
> > > complete.
> >
> > It might not be ideal but I guess it works. But I think we need to
> > modify the name of replication slot for initial sync as it currently
> > includes OID of the table:
> >
> > void
> > ReplicationSlotNameForTablesync(Oid suboid, Oid relid,
> >                                 char *syncslotname, Size szslot)
> > {
> >     snprintf(syncslotname, szslot, "pg_%u_sync_%u_" UINT64_FORMAT, suboid,
> >              relid, GetSystemIdentifier());
> > }
> >
> > If we use both schema name and table name, it's possible that slot
> > names are duplicated if schema and/or table names are long. Another
> > idea is to use the hash value of schema+table names, but it cannot
> > completely eliminate that possibility, and probably would make
> > investigation and debugging hard in case of any failure. Probably we
> > can use the OID of each entry in pg_subscription_remote_rel instead,
> > but I'm not sure it's a good idea, mainly because we will end up using
> > twice as many OIDs as before.
> >
>
> The other possibility is to use worker_pid. To make debugging easier,
> we may want to LOG schema_name+rel_name vs slot_name mapping at DEBUG1
> log level.

Since worker_pid changes after the worker restarts, a new worker
cannot find the slot that had been used, no?

After thinking it over, a better solution would be that we add an oid
column, nspname column, and relname column to pg_subscription_rel and
the primary key on the oid. If the table is not present on the
subscriber we store the schema name and table name to the catalog, and
otherwise we store the local table oid same as today. The local table
oid will be filled after the schema sync. The names of origin and
replication slot the tablesync worker uses use the oid instead of the
table oid.

I've attached a PoC patch of this idea (very rough patch and has many
TODO comments). It mixes the following changes:

1. Add oid column to the pg_subscription_rel. The oid is used as the
primary key and in the names of origin and slot the tablesync workers
use.

2. Add copy_schema = on/off option to CREATE SUBSCRIPTION (not yet
support for ALTER SUBSCRIPTION).

3. Add CRS_EXPORT_USE_SNAPSHOT new action in order to use the same
snapshot by both walsender and other processes (e.g. pg_dump). In this
patch, the snapshot is exported for pg_dump and is used by the
walsender for COPY.

It seems to work well but there might be a pitfall as I've not fully
implemented it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

RE: Initial Schema Sync for Logical Replication

From
"Wei Wang (Fujitsu)"
Date:
On Fri, Apr 21, 2023 at 16:48 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Thu, Apr 20, 2023 at 8:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Apr 17, 2023 at 9:12 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > On Fri, Apr 7, 2023 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Thu, Apr 6, 2023 at 6:57 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > >
> > > > > While writing a PoC patch, I found some difficulties in this idea.
> > > > > First, I tried to add schemaname+relname to pg_subscription_rel but I
> > > > > could not define the primary key of pg_subscription_rel. The primary
> > > > > key on (srsubid, srrelid) doesn't work since srrelid could be NULL.
> > > > > Similarly, the primary key on (srsubid, srrelid, schemaname, relname)
> > > > > also doesn't work.
> > > > >
> > > >
> > > > Can we think of having a separate catalog table say
> > > > pg_subscription_remote_rel for this? You can have srsubid,
> > > > remote_schema_name, remote_rel_name, etc. We may need some other
> state
> > > > to be maintained during the initial schema sync where this table can
> > > > be used. Basically, this can be used to maintain the state till the
> > > > initial schema sync is complete because we can create a relation entry
> > > > in pg_subscritption_rel only after the initial schema sync is
> > > > complete.
> > >
> > > It might not be ideal but I guess it works. But I think we need to
> > > modify the name of replication slot for initial sync as it currently
> > > includes OID of the table:
> > >
> > > void
> > > ReplicationSlotNameForTablesync(Oid suboid, Oid relid,
> > >                                 char *syncslotname, Size szslot)
> > > {
> > >     snprintf(syncslotname, szslot, "pg_%u_sync_%u_" UINT64_FORMAT,
> suboid,
> > >              relid, GetSystemIdentifier());
> > > }
> > >
> > > If we use both schema name and table name, it's possible that slot
> > > names are duplicated if schema and/or table names are long. Another
> > > idea is to use the hash value of schema+table names, but it cannot
> > > completely eliminate that possibility, and probably would make
> > > investigation and debugging hard in case of any failure. Probably we
> > > can use the OID of each entry in pg_subscription_remote_rel instead,
> > > but I'm not sure it's a good idea, mainly because we will end up using
> > > twice as many OIDs as before.
> > >
> >
> > The other possibility is to use worker_pid. To make debugging easier,
> > we may want to LOG schema_name+rel_name vs slot_name mapping at
> DEBUG1
> > log level.
> 
> Since worker_pid changes after the worker restarts, a new worker
> cannot find the slot that had been used, no?
> 
> After thinking it over, a better solution would be that we add an oid
> column, nspname column, and relname column to pg_subscription_rel and
> the primary key on the oid. If the table is not present on the
> subscriber we store the schema name and table name to the catalog, and
> otherwise we store the local table oid same as today. The local table
> oid will be filled after the schema sync. The names of origin and
> replication slot the tablesync worker uses use the oid instead of the
> table oid.
> 
> I've attached a PoC patch of this idea (very rough patch and has many
> TODO comments). It mixes the following changes:
> 
> 1. Add oid column to the pg_subscription_rel. The oid is used as the
> primary key and in the names of origin and slot the tablesync workers
> use.
> 
> 2. Add copy_schema = on/off option to CREATE SUBSCRIPTION (not yet
> support for ALTER SUBSCRIPTION).
> 
> 3. Add CRS_EXPORT_USE_SNAPSHOT new action in order to use the same
> snapshot by both walsender and other processes (e.g. pg_dump). In this
> patch, the snapshot is exported for pg_dump and is used by the
> walsender for COPY.
> 
> It seems to work well but there might be a pitfall as I've not fully
> implemented it.

Thanks for your POC patch.
After reviewing this patch, I have a question below that want to confirm:

1. In the function synchronize_table_schema.
I think some changes to GUC and table-related object SQLs are included in the
pg_dump result. And in this POC, these SQLs will be executed. Do we need to
alter the pg_dump results to only execute the table schema related SQLs?
For example, if we have below table schema in the publisher-side:
```
create table tbl(a int, b int);
create index idx_t on tbl (a);
CREATE FUNCTION trigger_func() RETURNS TRIGGER LANGUAGE PLPGSQL AS $$ BEGIN INSERT INTO public.tbl VALUES (NEW.*);
RETURNNEW; END; $$;
 
CREATE TRIGGER tri_tbl BEFORE INSERT ON public.tbl FOR EACH ROW EXECUTE PROCEDURE trigger_func();
```
The result of pg_dump executed on the subscriber-side:
```
SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET xmloption = content;
SET client_min_messages = warning;
SET row_security = off;
SET default_tablespace = '';
SET default_table_access_method = heap;

CREATE TABLE public.tbl (
    a integer,
    b integer
);

ALTER TABLE public.tbl OWNER TO postgres;

CREATE INDEX idx_t ON public.tbl USING btree (a);

CREATE TRIGGER tri_tbl BEFORE INSERT ON public.tbl FOR EACH ROW EXECUTE FUNCTION public.trigger_func();
```
And this will cause an error when `CREATE TRIGGER` because we did not dump the
function trigger_func.

Regards,
Wang Wei

Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Thu, Apr 27, 2023 at 12:02 PM Wei Wang (Fujitsu)
<wangw.fnst@fujitsu.com> wrote:
>
> On Fri, Apr 21, 2023 at 16:48 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Thu, Apr 20, 2023 at 8:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Apr 17, 2023 at 9:12 AM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Fri, Apr 7, 2023 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Thu, Apr 6, 2023 at 6:57 PM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > While writing a PoC patch, I found some difficulties in this idea.
> > > > > > First, I tried to add schemaname+relname to pg_subscription_rel but I
> > > > > > could not define the primary key of pg_subscription_rel. The primary
> > > > > > key on (srsubid, srrelid) doesn't work since srrelid could be NULL.
> > > > > > Similarly, the primary key on (srsubid, srrelid, schemaname, relname)
> > > > > > also doesn't work.
> > > > > >
> > > > >
> > > > > Can we think of having a separate catalog table say
> > > > > pg_subscription_remote_rel for this? You can have srsubid,
> > > > > remote_schema_name, remote_rel_name, etc. We may need some other
> > state
> > > > > to be maintained during the initial schema sync where this table can
> > > > > be used. Basically, this can be used to maintain the state till the
> > > > > initial schema sync is complete because we can create a relation entry
> > > > > in pg_subscritption_rel only after the initial schema sync is
> > > > > complete.
> > > >
> > > > It might not be ideal but I guess it works. But I think we need to
> > > > modify the name of replication slot for initial sync as it currently
> > > > includes OID of the table:
> > > >
> > > > void
> > > > ReplicationSlotNameForTablesync(Oid suboid, Oid relid,
> > > >                                 char *syncslotname, Size szslot)
> > > > {
> > > >     snprintf(syncslotname, szslot, "pg_%u_sync_%u_" UINT64_FORMAT,
> > suboid,
> > > >              relid, GetSystemIdentifier());
> > > > }
> > > >
> > > > If we use both schema name and table name, it's possible that slot
> > > > names are duplicated if schema and/or table names are long. Another
> > > > idea is to use the hash value of schema+table names, but it cannot
> > > > completely eliminate that possibility, and probably would make
> > > > investigation and debugging hard in case of any failure. Probably we
> > > > can use the OID of each entry in pg_subscription_remote_rel instead,
> > > > but I'm not sure it's a good idea, mainly because we will end up using
> > > > twice as many OIDs as before.
> > > >
> > >
> > > The other possibility is to use worker_pid. To make debugging easier,
> > > we may want to LOG schema_name+rel_name vs slot_name mapping at
> > DEBUG1
> > > log level.
> >
> > Since worker_pid changes after the worker restarts, a new worker
> > cannot find the slot that had been used, no?
> >
> > After thinking it over, a better solution would be that we add an oid
> > column, nspname column, and relname column to pg_subscription_rel and
> > the primary key on the oid. If the table is not present on the
> > subscriber we store the schema name and table name to the catalog, and
> > otherwise we store the local table oid same as today. The local table
> > oid will be filled after the schema sync. The names of origin and
> > replication slot the tablesync worker uses use the oid instead of the
> > table oid.
> >
> > I've attached a PoC patch of this idea (very rough patch and has many
> > TODO comments). It mixes the following changes:
> >
> > 1. Add oid column to the pg_subscription_rel. The oid is used as the
> > primary key and in the names of origin and slot the tablesync workers
> > use.
> >
> > 2. Add copy_schema = on/off option to CREATE SUBSCRIPTION (not yet
> > support for ALTER SUBSCRIPTION).
> >
> > 3. Add CRS_EXPORT_USE_SNAPSHOT new action in order to use the same
> > snapshot by both walsender and other processes (e.g. pg_dump). In this
> > patch, the snapshot is exported for pg_dump and is used by the
> > walsender for COPY.
> >
> > It seems to work well but there might be a pitfall as I've not fully
> > implemented it.
>
> Thanks for your POC patch.
> After reviewing this patch, I have a question below that want to confirm:
>
> 1. In the function synchronize_table_schema.
> I think some changes to GUC and table-related object SQLs are included in the
> pg_dump result. And in this POC, these SQLs will be executed. Do we need to
> alter the pg_dump results to only execute the table schema related SQLs?

Yes, in this approach, we need to dump/restore objects while
specifying with fine granularity. Ideally, the table sync worker dumps
and restores the table schema, does copy the initial data, and then
creates indexes, and triggers and table-related objects are created
after that. So if we go with the pg_dump approach to copy the schema
of individual tables, we need to change pg_dump (or libpgdump needs to
be able to do) to support it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Fri, Apr 28, 2023 at 4:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Yes, in this approach, we need to dump/restore objects while
> specifying with fine granularity. Ideally, the table sync worker dumps
> and restores the table schema, does copy the initial data, and then
> creates indexes, and triggers and table-related objects are created
> after that. So if we go with the pg_dump approach to copy the schema
> of individual tables, we need to change pg_dump (or libpgdump needs to
> be able to do) to support it.

We have been discussing how to sync schema but I'd like to step back a
bit and discuss use cases and requirements of this feature.

Suppose that a table belongs to a publication, what objects related to
the table we want to sync by the initial schema sync features? IOW, do
we want to sync table's ACLs, tablespace settings, triggers, and
security labels too?

If we want to replicate the whole database, e.g. when using logical
replication for major version upgrade, it would be convenient if it
synchronizes all table-related objects. However, if we have only this
option, it could be useless in some cases. For example, in a case
where users have different database users on the subscriber than the
publisher, they might want to sync only CREATE TABLE, and set ACL etc
by themselves. In this case, it would not be necessary to sync ACL and
security labels.

What use case do we want to support by this feature? I think the
implementation could be varied depending on how to select what objects
to sync.

One possible idea is to select objects to sync depending on how DDL
replication is set in the publisher. It's straightforward but I'm not
sure the design of DDL replication syntax has been decided. Also, even
if we create a publication with ddl = 'table' option, it's not clear
to me that we want to sync table-dependent triggers, indexes, and
rules too by the initial sync feature.

Second idea is to make it configurable by users so that they can
specify what objects to sync. But it would make the feature complex
and I'm not sure users can use it properly.

Third idea is that since the use case of synchronizing the whole
database can be achievable even by pg_dump(all), we support
synchronizing only tables (+ indexes) in the initial sync feature,
which can not be achievable by pg_dump.

Feedback is very welcome.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Initial Schema Sync for Logical Replication

From
Amit Kapila
Date:
On Mon, May 22, 2023 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Apr 28, 2023 at 4:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Yes, in this approach, we need to dump/restore objects while
> > specifying with fine granularity. Ideally, the table sync worker dumps
> > and restores the table schema, does copy the initial data, and then
> > creates indexes, and triggers and table-related objects are created
> > after that. So if we go with the pg_dump approach to copy the schema
> > of individual tables, we need to change pg_dump (or libpgdump needs to
> > be able to do) to support it.
>
> We have been discussing how to sync schema but I'd like to step back a
> bit and discuss use cases and requirements of this feature.
>
> Suppose that a table belongs to a publication, what objects related to
> the table we want to sync by the initial schema sync features? IOW, do
> we want to sync table's ACLs, tablespace settings, triggers, and
> security labels too?
>
> If we want to replicate the whole database, e.g. when using logical
> replication for major version upgrade, it would be convenient if it
> synchronizes all table-related objects. However, if we have only this
> option, it could be useless in some cases. For example, in a case
> where users have different database users on the subscriber than the
> publisher, they might want to sync only CREATE TABLE, and set ACL etc
> by themselves. In this case, it would not be necessary to sync ACL and
> security labels.
>
> What use case do we want to support by this feature? I think the
> implementation could be varied depending on how to select what objects
> to sync.
>
> One possible idea is to select objects to sync depending on how DDL
> replication is set in the publisher. It's straightforward but I'm not
> sure the design of DDL replication syntax has been decided. Also, even
> if we create a publication with ddl = 'table' option, it's not clear
> to me that we want to sync table-dependent triggers, indexes, and
> rules too by the initial sync feature.
>

I think it is better to keep the initial sync the same as the
replication. So, if the publication specifies 'table' then we should
just synchronize tables. Otherwise, it will look odd that the initial
sync has synchronized say index-related DDLs but then later
replication didn't replicate it. OTOH, if we want to do initial sync
of table-dependent objects like triggers, indexes, rules, etc. when
the user has specified ddl = 'table' then the replication should also
follow the same. The main reason to exclude the other objects during
replication is to reduce the scope of deparsing patch but if we have a
finite set of objects (say all dependent on the table) then we can
probably try to address those.

> Second idea is to make it configurable by users so that they can
> specify what objects to sync. But it would make the feature complex
> and I'm not sure users can use it properly.
>
> Third idea is that since the use case of synchronizing the whole
> database can be achievable even by pg_dump(all), we support
> synchronizing only tables (+ indexes) in the initial sync feature,
> which can not be achievable by pg_dump.
>

Can't we add some switch to dump only the table and not its dependents
if we want to go with that approach?

--
With Regards,
Amit Kapila.



Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Tue, May 23, 2023 at 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 22, 2023 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Apr 28, 2023 at 4:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > Yes, in this approach, we need to dump/restore objects while
> > > specifying with fine granularity. Ideally, the table sync worker dumps
> > > and restores the table schema, does copy the initial data, and then
> > > creates indexes, and triggers and table-related objects are created
> > > after that. So if we go with the pg_dump approach to copy the schema
> > > of individual tables, we need to change pg_dump (or libpgdump needs to
> > > be able to do) to support it.
> >
> > We have been discussing how to sync schema but I'd like to step back a
> > bit and discuss use cases and requirements of this feature.
> >
> > Suppose that a table belongs to a publication, what objects related to
> > the table we want to sync by the initial schema sync features? IOW, do
> > we want to sync table's ACLs, tablespace settings, triggers, and
> > security labels too?
> >
> > If we want to replicate the whole database, e.g. when using logical
> > replication for major version upgrade, it would be convenient if it
> > synchronizes all table-related objects. However, if we have only this
> > option, it could be useless in some cases. For example, in a case
> > where users have different database users on the subscriber than the
> > publisher, they might want to sync only CREATE TABLE, and set ACL etc
> > by themselves. In this case, it would not be necessary to sync ACL and
> > security labels.
> >
> > What use case do we want to support by this feature? I think the
> > implementation could be varied depending on how to select what objects
> > to sync.
> >
> > One possible idea is to select objects to sync depending on how DDL
> > replication is set in the publisher. It's straightforward but I'm not
> > sure the design of DDL replication syntax has been decided. Also, even
> > if we create a publication with ddl = 'table' option, it's not clear
> > to me that we want to sync table-dependent triggers, indexes, and
> > rules too by the initial sync feature.
> >
>
> I think it is better to keep the initial sync the same as the
> replication. So, if the publication specifies 'table' then we should
> just synchronize tables. Otherwise, it will look odd that the initial
> sync has synchronized say index-related DDLs but then later
> replication didn't replicate it. OTOH, if we want to do initial sync
> of table-dependent objects like triggers, indexes, rules, etc. when
> the user has specified ddl = 'table' then the replication should also
> follow the same. The main reason to exclude the other objects during
> replication is to reduce the scope of deparsing patch but if we have a
> finite set of objects (say all dependent on the table) then we can
> probably try to address those.
>

We have discussed several ideas of how to synchronize schemas between
publisher and subscribers, and the points are summarized in Wiki
page[1]. As for the idea of using pg_dump, we were concerned that
pg_dump needs to be present along with the server binary if the user
needs to use the initial schema synchronization feature. Since these
binaries are typically included in different packages, they need to
install both. During PGCon we've discussed with some senior hackers
that it would be an acceptable limitation for users. When executing
CREATE/ALTER SUBSCRIPTION, we check if pg_dump is available and raise
an error if not. We've also discussed the idea  of using
pg_dump_library but no one preferred this idea because of its
implementation costs. Therefore, I'm going to do further evaluation
for the pg_dump idea.

I agree with Amit that the initial schema synchronization should
process the same as the DDL replication. We can support only table
schemas as the first step. To do that, we need a new switch, say
--exclude-table-dependents, in pg_dump to dump only table schemas
excluding table-related objects such as triggers and indexes. Then, we
can support synchronizing tables and table-related objects such as
triggers, indexes, and rules, as the second step, which can be done
with the --schema and --table option. Finally, we can synchronize the
whole database by using the --schema option.

We also need to research how to integrate the initial schema
synchronization with tablesync workers.  We have a PoC patch[2].

Regards,

[1] https://wiki.postgresql.org/wiki/Logical_replication_of_DDLs#Initial_Schema_Sync
[2] https://www.postgresql.org/message-id/CAD21AoCdfg506__qKz%2BHX8vqfdyKgQ5qeehgqq9bi1L-6p5Pwg%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Initial Schema Sync for Logical Replication

From
Peter Smith
Date:
On Thu, Jun 8, 2023 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
...

> We also need to research how to integrate the initial schema
> synchronization with tablesync workers.  We have a PoC patch[2].
>
> Regards,
>
> [1] https://wiki.postgresql.org/wiki/Logical_replication_of_DDLs#Initial_Schema_Sync
> [2] https://www.postgresql.org/message-id/CAD21AoCdfg506__qKz%2BHX8vqfdyKgQ5qeehgqq9bi1L-6p5Pwg%40mail.gmail.com
>

FYI -- the PoC patch fails to apply using HEAD fetched today.

git apply ../patches_misc/0001-Poc-initial-table-structure-synchronization-in-logic.patch
error: patch failed: src/backend/replication/logical/tablesync.c:1245
error: src/backend/replication/logical/tablesync.c: patch does not apply

------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Initial Schema Sync for Logical Replication

From
Peter Smith
Date:
On Thu, Jun 15, 2023 at 4:14 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Thu, Jun 8, 2023 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> ...
>
> > We also need to research how to integrate the initial schema
> > synchronization with tablesync workers.  We have a PoC patch[2].
> >
> > Regards,
> >
> > [1] https://wiki.postgresql.org/wiki/Logical_replication_of_DDLs#Initial_Schema_Sync
> > [2] https://www.postgresql.org/message-id/CAD21AoCdfg506__qKz%2BHX8vqfdyKgQ5qeehgqq9bi1L-6p5Pwg%40mail.gmail.com
> >
>
> FYI -- the PoC patch fails to apply using HEAD fetched today.
>
> git apply ../patches_misc/0001-Poc-initial-table-structure-synchronization-in-logic.patch
> error: patch failed: src/backend/replication/logical/tablesync.c:1245
> error: src/backend/replication/logical/tablesync.c: patch does not apply
>

After rebasing the PoC patch locally, I found the 'make check' still
did not pass 100%.

# 2 of 215 tests failed.

Here are the differences:

diff -U3 /home/postgres/oss_postgres_misc/src/test/regress/expected/rules.out
/home/postgres/oss_postgres_misc/src/test/regress/results/rules.out
--- /home/postgres/oss_postgres_misc/src/test/regress/expected/rules.out
   2023-06-02 23:12:32.073864475 +1000
+++ /home/postgres/oss_postgres_misc/src/test/regress/results/rules.out
2023-06-15 16:53:29.352622676 +1000
@@ -2118,14 +2118,14 @@
     su.subname,
     st.pid,
     st.leader_pid,
-    st.relid,
+    st.subrelid,
     st.received_lsn,
     st.last_msg_send_time,
     st.last_msg_receipt_time,
     st.latest_end_lsn,
     st.latest_end_time
    FROM (pg_subscription su
-     LEFT JOIN pg_stat_get_subscription(NULL::oid) st(subid, relid,
pid, leader_pid, received_lsn, last_msg_send_time,
last_msg_receipt_time, latest_end_lsn, latest_end_time) ON ((st.subid
= su.oid)));
+     LEFT JOIN pg_stat_get_subscription(NULL::oid) st(subid,
subrelid, pid, leader_pid, received_lsn, last_msg_send_time,
last_msg_receipt_time, latest_end_lsn, latest_end_time) ON ((st.subid
= su.oid)));
 pg_stat_subscription_stats| SELECT ss.subid,
     s.subname,
     ss.apply_error_count,
diff -U3 /home/postgres/oss_postgres_misc/src/test/regress/expected/oidjoins.out
/home/postgres/oss_postgres_misc/src/test/regress/results/oidjoins.out
--- /home/postgres/oss_postgres_misc/src/test/regress/expected/oidjoins.out
2022-10-04 15:11:32.457834981 +1100
+++ /home/postgres/oss_postgres_misc/src/test/regress/results/oidjoins.out
 2023-06-15 16:54:07.159839010 +1000
@@ -265,4 +265,3 @@
 NOTICE:  checking pg_subscription {subdbid} => pg_database {oid}
 NOTICE:  checking pg_subscription {subowner} => pg_authid {oid}
 NOTICE:  checking pg_subscription_rel {srsubid} => pg_subscription {oid}
-NOTICE:  checking pg_subscription_rel {srrelid} => pg_class {oid}

------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Initial Schema Sync for Logical Replication

From
Peter Smith
Date:
Hi,

Below are my review comments for the PoC patch 0001.

In addition,  the patch needed rebasing, and, after I rebased it
locally in my private environment there were still test failures:
a) The 'make check' tests fail but only in a minor way due to changes colname
b) the subscription TAP test did not work at all for me -- many errors.

======
Commit message.

1.
- Add oid column to the pg_subscription_rel.
  - use it as the primary key.
  - use it in the names of origin and slot the tablesync workers use.

~

IIUC, I think there were lots of variables called 'subrelid' referring
to this new 'oid' field. But, somehow I found that very confusing with
the other similarly named 'relid'. I wonder if all those can be named
like 'sroid' or 'srid' to reduce the confusion of such similar names?


======
src/backend/catalog/pg_subscription.c

2. AddSubscriptionRelState

I felt should be some sanity check Asserts for the args here. E.g.
Cannot have valid relid when copy_schema == true, etc.

~~~

3.
+ if (nspname)
+ values[Anum_pg_subscription_rel_srnspname - 1] = CStringGetDatum(nspname);
+ else
+ nulls[Anum_pg_subscription_rel_srnspname - 1] = true;
+
+ if (relname)
+ values[Anum_pg_subscription_rel_srrelname - 1] = CStringGetDatum(relname);
+ else
+ nulls[Anum_pg_subscription_rel_srrelname - 1] = true;

Here is where I was wondering why not pass the nspname and relname all
the time, even for valid 'relid' (when copy_schema is false). It
should simplify some code, as well as putting more useful/readable
information into the catalog.

~~~

4. UpdateSubscriptionRelRelid

+ /* XXX: need to distinguish from message in UpdateSubscriptionRelState() */
+ if (!HeapTupleIsValid(tup))
+ elog(ERROR, "subscription table %u in subscription %u does not exist",
+ subrelid, subid);

Is that ERROR msg correct? IIUC the 'subrelid' is the Oid of the row
in the catalog -- it is not the "subscription table" Oid.

~~~

5. UpdateSubscriptionRelState

  if (!HeapTupleIsValid(tup))
  elog(ERROR, "subscription table %u in subscription %u does not exist",
- relid, subid);
+ subrelid, subid);


(ditto previous review comment)

Is that ERROR msg correct? IIUC the subrelid is the Oid of the row in
the catalog -- it is not the "subscription table" Oid.

~~~

6. GetSubscriptoinRelStateByRelid

There is a spelling mistake in this function name

/Subscriptoin/Subscription/

~~~

7.
+ ScanKeyInit(&skey[0],
+ Anum_pg_subscription_rel_srrelid,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(relid));
+ ScanKeyInit(&skey[1],
+ Anum_pg_subscription_rel_srsubid,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(subid));

Won't it be better to swap the order of these so it matches the
function comment "(srsubid, srrelid)".

~~~

8.
+ tup = systable_getnext(scan);
+
+
+ if (!HeapTupleIsValid(tup))

Double blank lines

~~~

9.
/* Get palloc'ed SubscriptionRelState of the given subrelid */
SubscriptionRelState *
GetSubscriptionRelByOid(Oid subrelid)

~

There seems some function name confusion because the struct is called
SubscriptionRelState and it also has a 'state' field.

e.g. The functions named GetSubscriptionRelStateXXX return only the
state field of the struct. OTOH, this function returns the
SubscriptionRelState* but it is NOT called
GetSubscriptionRelStateByOid (??).

~~~

10. deconstruct_subrelstate

+ /* syncflags */
+ relstate->syncflags =
+ (((subrel_form->srsyncschema) ? SUBREL_SYNC_KIND_SCHEMA : 0) |
+ ((subrel_form->srsyncdata) ? SUBREL_SYNC_KIND_DATA : 0));

Seems excessive parens.

~~~

11.
+ return relstate;
+}
 /*
  * Drop subscription relation mapping. These can be for a particular
  * subscription, or for a particular relation, or both.
  */
 void
-RemoveSubscriptionRel(Oid subid, Oid relid)
+RemoveSubscriptionRel(Oid subid, Oid relid, Oid subrelid)

~

There is no blank line before this function

~~~

12. RemoveSubscriptionRel

-RemoveSubscriptionRel(Oid subid, Oid relid)
+RemoveSubscriptionRel(Oid subid, Oid relid, Oid subrelid)
 {

~

IIUC what you called 'subrelid' is the PK, so would it make more sense
for that to be the 1st parameter for this function?

======
src/backend/commands/subscriptioncmds.c

13. struct SubOpts

  bool copy_data;
+ /* XXX: want to choose synchronizing only tables or all objects? */
+ bool copy_schema;

I wonder if it would be more natural to put the 'copy_schema' field
before the 'copy_data' field?

~~~

14. parse_subscription_options

  if (IsSet(supported_opts, SUBOPT_COPY_DATA))
  opts->copy_data = true;
+ if (IsSet(supported_opts, SUBOPT_COPY_SCHEMA))
+ opts->copy_data = true;

14a.
I wonder if it would be more natural to put the COPY_SCHEMA logic
before the COPY_DATA logic?

~

14b.
Is this a bug? Why is this assigning copy_data = true, instead of
copy_schema = true?

~~~

15.
  opts->specified_opts |= SUBOPT_COPY_DATA;
  opts->copy_data = defGetBoolean(defel);
  }
+ else if (IsSet(supported_opts, SUBOPT_COPY_SCHEMA) &&
+ strcmp(defel->defname, "copy_schema") == 0)
+ {
+ if (IsSet(opts->specified_opts, SUBOPT_COPY_SCHEMA))
+ errorConflictingDefElem(defel, pstate);
+
+ opts->specified_opts |= SUBOPT_COPY_SCHEMA;
+ opts->copy_schema = defGetBoolean(defel);
+ }

I wonder if it would be more natural to put the COPY_SCHEMA logic
before the COPY_DATA logic?

~~~

16.
+ if (opts->copy_schema &&
+ IsSet(opts->specified_opts, SUBOPT_COPY_SCHEMA))
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("%s and %s are mutually exclusive options",
+ "connect = false", "copy_schema = true")));
+

I wonder if it would be more natural to put the COPY_SCHEMA logic
before the COPY_DATA logic?

~~~

17. CreateSubscription

  * Set sync state based on if we were asked to do data copy or
  * not.
  */
- table_state = opts.copy_data ? SUBREL_STATE_INIT : SUBREL_STATE_READY;
+ if (opts.copy_data || opts.copy_schema)
+ table_state = SUBREL_STATE_INIT;
+ else
+ table_state = SUBREL_STATE_READY;

The comment prior to this code needs updating, it still only mentions
"data copy".

~~~

18. AlterSubscription_refresh

+ sub_remove_rels[remove_rel_len].relid = subrelid;
  sub_remove_rels[remove_rel_len++].state = state;
~

Is that right?

IIUC that 'subrelid' is the OID PK of the row in pg_subscription_rel,
which is not the same as the 'relid'.

Shouldn't this be sub_remove_rels[remove_rel_len].relid = relstate->relid;

~~~

19.
+ if (OidIsValid(relstate->relid))
+ ereport(DEBUG1,
+ (errmsg_internal("table \"%s.%s\" removed from subscription \"%s\"",
+ get_namespace_name(get_rel_namespace(relstate->relid)),
+ get_rel_name(relstate->relid),
+ sub->name)));
+ else
+ ereport(DEBUG1,
+ (errmsg_internal("table \"%s.%s\" removed from subscription \"%s\"",
+ relstate->nspname, relstate->relname,
+ sub->name)));

I wondered why can't we just always store nspname and relname even for
the valid 'relid' when there is no copy_schema? Won't that simplify
code such as this?

======
src/backend/replication/logical/launcher.c

20. logicalrep_worker_find

- if (w->in_use && w->subid == subid && w->relid == relid &&
+ if (w->in_use && w->subid == subid && w->subrelid == subrelid &&
  (!only_running || w->proc))
  {

~

Maybe I misunderstand something, but somehow it seems strange to be
checking both the 'subid' and the the Oid PK ('subrelid') here. Isn't
it that when subrelid is valid you need to test only 'subrelid' (aka
tablesync) for equality? But when subrelid is InvalidOid (aka not a
tablesync worker) you only need to test subid for equality?

~~~

21. logicalrep_worker_launch

  bool is_parallel_apply_worker = (subworker_dsm != DSM_HANDLE_INVALID);

  /* Sanity check - tablesync worker cannot be a subworker */
- Assert(!(is_parallel_apply_worker && OidIsValid(relid)));
+ Assert(!(is_parallel_apply_worker && OidIsValid(subrelid)));

IIUC I thought this code might be easier to understand if you
introduced another variable

bool is_tabslync_worker = OidIsValid(subrelid);

~~~

22.
+ if (OidIsValid(subrelid) && nsyncworkers >= max_sync_workers_per_subscription)

(ditto previous comment)

~~~

23.
- if (OidIsValid(relid))
+ if (OidIsValid(subrelid))
  snprintf(bgw.bgw_name, BGW_MAXLEN,
- "logical replication worker for subscription %u sync %u", subid, relid);
+ "logical replication worker for subscription %u sync %u", subid, subrelid);

This name seems somehow less useful to the user now. IIUC 'subrelid'
is just the PK of the pg_subscription_rel_catalog instead of the
relid. Does this require changes to the documentation that might have
been saying this is the relid?

~~~

24. logicalrep_worker_stop

  * Stop the logical replication worker for subid/relid, if any.
  */
 void
-logicalrep_worker_stop(Oid subid, Oid relid)
+logicalrep_worker_stop(Oid subid, Oid subrelid)

The function comment still is talking about relid.

======
src/backend/replication/logical/snapbuild.c

25. SnapBuildExportSnapshot

-SnapBuildExportSnapshot(SnapBuild *builder)
+SnapBuildExportSnapshot(SnapBuild *builder, bool use_it)

'use_it' does not see a good parameter name. At least, maybe the
function comment can describe the meaning of use_it.

~~~

26.
- /* There doesn't seem to a nice API to set these */
- XactIsoLevel = XACT_REPEATABLE_READ;
- XactReadOnly = true;
+ /* There doesn't seem to a nice API to set these */
+ XactIsoLevel = XACT_REPEATABLE_READ;
+ XactReadOnly = true;
+ }
+ else
+ Assert(IsTransactionBlock());

Although it is not introduced by this patch, since you change the
indent on this line you might as well at the same time fix the typo on
this line.

/seem to be nice/seem to be a nice/

======
src/backend/replication/logical/tablesync.c

27. process_syncing_tables_for_sync

  UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
-    MyLogicalRepWorker->relid,
+    MyLogicalRepWorker->subrelid,
     MyLogicalRepWorker->relstate,
     MyLogicalRepWorker->relstate_lsn);

IIUC the 'subrelid' is now the PK. Isn't it better for that to be the 1st param?

~~~

28.

+ if ((syncflags & SUBREL_SYNC_KIND_SCHEMA) != 0)

There are several checks like the code shown above. Would it be better
to have some macro for that expression? Or maybe simply assign this
result to a local variable instead of testing the same thing multiple
times.

~~~

29. synchronize_table_schema

FILE *handle;
Oid relid;
Oid nspoid;
StringInfoData command;
StringInfoData querybuf;
char full_path[MAXPGPATH];
char buf[1024];
int ret;

   if (find_my_exec("pg_dump", full_path) < 0)
   elog(ERROR, "\"%s\" was not found", "pg_dump")

~

Something is not quite right with the indentation in this new function.

~~~

30.
+ * XXX what if the table already doesn't exist?

I didn't understand the meaning of the comment. Is it supposed to say
"What if the table already exists?" (??)

======
src/backend/replication/logical/worker.c

31. InitializeApplyWorker

+ {
+ if (OidIsValid(MyLogicalRepWorker->relid))
+ ereport(LOG,
+ (errmsg("logical replication table synchronization worker for
subscription \"%s\", table \"%s\" has started",
+ MySubscription->name,
+ get_rel_name(MyLogicalRepWorker->relid))));
+ else
+ ereport(LOG,
+ (errmsg("logical replication table synchronization worker for
subscription \"%s\", relid %u has started",
+ MySubscription->name,
+ MyLogicalRepWorker->subrelid)));
+ }

~

IIUC it doesn't seem right to say "relid %u has started". Because
that's not really a relid is it? I thought it is just a PK Oid of the
row in the catalog.

======
src/include/catalog/pg_subscription_rel.h

32. pg_subscription_rel

+ /* What part do we need to synchronize? */
+ bool srsyncschema;
+ bool srsyncdata;

These aren't really "parts".

SUGGESTION
/* What to synchronize? */

~~~

33.
 typedef struct SubscriptionRelState
 {
+ Oid oid;

Is that the pg_subscription_rel's oid? Maybe it would be better to
call this field 'sroid'? (see the general comment in the commit
message)

======
src/include/replication/walsender.h

34. CRSSnapshotAction

  CRS_EXPORT_SNAPSHOT,
  CRS_NOEXPORT_SNAPSHOT,
- CRS_USE_SNAPSHOT
+ CRS_USE_SNAPSHOT,
+ CRS_EXPORT_USE_SNAPSHOT
 } CRSSnapshotAction;

~

Should the CRS_USE_SNAPSHOT be renamed to CRS_NOEXOPRT_USE_SNAPSHOT to
have a more consistent naming pattern?

======
src/include/replication/worker_internal.h

35.
- /* Used for initial table synchronization. */
+ /*
+ * Used for initial table synchronization.
+ *
+ * relid is an invalid oid if the table is not created on the subscriber
+ * yet.
+ */
+ Oid subrelid;
  Oid relid;
It would be good to have more explanation what is the different
meaning of 'subrelid' versus 'relid' (see also the general comment
suggesting to rename this)

------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Mon, Jun 19, 2023 at 5:29 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Hi,
>
> Below are my review comments for the PoC patch 0001.
>
> In addition,  the patch needed rebasing, and, after I rebased it
> locally in my private environment there were still test failures:
> a) The 'make check' tests fail but only in a minor way due to changes colname
> b) the subscription TAP test did not work at all for me -- many errors.

Thank you for reviewing the patch.

While updating the patch, I realized that the current approach won't
work well or at least has the problem with partition tables. If a
publication has a partitioned table with publish_via_root = false, the
subscriber launches tablesync workers for its partitions so that each
tablesync worker copies data of each partition. Similarly, if it has a
partition table with publish_via_root = true, the subscriber launches
a tablesync worker for the parent table. With the current design,
since the tablesync worker is responsible for both schema and data
synchronization for the target table, it won't be possible to
synchronize both the parent table's schema and partitions' schema. For
example, there is no pg_subscription_rel entry for the parent table if
the publication has publish_via_root = false. In addition to that, we
need to be careful about the order of synchronization of the parent
table and its partitions. We cannot start schema synchronization for
partitions before its parent table. So it seems to me that we need to
consider another approach.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Wed, Jul 5, 2023 at 11:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jun 19, 2023 at 5:29 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Hi,
> >
> > Below are my review comments for the PoC patch 0001.
> >
> > In addition,  the patch needed rebasing, and, after I rebased it
> > locally in my private environment there were still test failures:
> > a) The 'make check' tests fail but only in a minor way due to changes colname
> > b) the subscription TAP test did not work at all for me -- many errors.
>
> Thank you for reviewing the patch.
>
> While updating the patch, I realized that the current approach won't
> work well or at least has the problem with partition tables. If a
> publication has a partitioned table with publish_via_root = false, the
> subscriber launches tablesync workers for its partitions so that each
> tablesync worker copies data of each partition. Similarly, if it has a
> partition table with publish_via_root = true, the subscriber launches
> a tablesync worker for the parent table. With the current design,
> since the tablesync worker is responsible for both schema and data
> synchronization for the target table, it won't be possible to
> synchronize both the parent table's schema and partitions' schema. For
> example, there is no pg_subscription_rel entry for the parent table if
> the publication has publish_via_root = false. In addition to that, we
> need to be careful about the order of synchronization of the parent
> table and its partitions. We cannot start schema synchronization for
> partitions before its parent table. So it seems to me that we need to
> consider another approach.

So I've implemented a different approach; doing schema synchronization
at a CREATE SUBSCRIPTION time. The backend executing CREATE
SUBSCRIPTION uses pg_dump and restores the table schemas including
both partitioned tables and their partitions regardless of
publish_via_partition_root option, and then creates
pg_subscription_rel entries for tables while respecting
publish_via_partition_root option.

There is a window between table creations and the tablesync workers
starting to process the tables. If DDLs are executed in this window,
the tablesync worker might fail because the table schema might have
already been changed. We need to mention this note in the
documentation. BTW, I think we will be able to get rid of this
downside if we support DDL replication. DDLs executed in the window
are applied by the apply worker and it takes over the data copy to the
tablesync worker at a certain LSN.

I've attached PoC patches. It has regression tests but doesn't have
the documentation yet.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> So I've implemented a different approach; doing schema synchronization at a
> CREATE SUBSCRIPTION time. The backend executing CREATE SUBSCRIPTION
> uses pg_dump and restores the table schemas including both partitioned tables
> and their partitions regardless of publish_via_partition_root option, and then
> creates pg_subscription_rel entries for tables while respecting
> publish_via_partition_root option.
> 
> There is a window between table creations and the tablesync workers starting to
> process the tables. If DDLs are executed in this window, the tablesync worker
> might fail because the table schema might have already been changed. We need
> to mention this note in the documentation. BTW, I think we will be able to get
> rid of this downside if we support DDL replication. DDLs executed in the window
> are applied by the apply worker and it takes over the data copy to the tablesync
> worker at a certain LSN.

I don’t think even with DDL replication we will be able to get rid of this window. 
There are some issues
1. Even with tablesync worker taking over at certain LSN, publisher can make more changes till
Table sync acquires lock on publisher table via copy table.
2. how we will make sure that applier worker has caught up will all the changes from publisher
Before it starts tableSync worker. It can be lag behind publisher.

I think the easiest option would be to just recreate the table , this way we don’t have to worry about 
complex race conditions, tablesync already makes a slot for copy data we can use same slot for 
getting upto date table definition, dropping the table won't be much expensive since there won't be any data
in it.Apply worker will skip all the DDLs/DMLs till table is synced.

Although for partitioned tables we will be able to keep with published table schema changes only when 
publish_by_partition_root is true.

Regards
Sachin
Amazon Web Services: https://aws.amazon.com

Re: Initial Schema Sync for Logical Replication

From
Amit Kapila
Date:
On Wed, Jul 5, 2023 at 7:45 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jun 19, 2023 at 5:29 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Hi,
> >
> > Below are my review comments for the PoC patch 0001.
> >
> > In addition,  the patch needed rebasing, and, after I rebased it
> > locally in my private environment there were still test failures:
> > a) The 'make check' tests fail but only in a minor way due to changes colname
> > b) the subscription TAP test did not work at all for me -- many errors.
>
> Thank you for reviewing the patch.
>
> While updating the patch, I realized that the current approach won't
> work well or at least has the problem with partition tables. If a
> publication has a partitioned table with publish_via_root = false, the
> subscriber launches tablesync workers for its partitions so that each
> tablesync worker copies data of each partition. Similarly, if it has a
> partition table with publish_via_root = true, the subscriber launches
> a tablesync worker for the parent table. With the current design,
> since the tablesync worker is responsible for both schema and data
> synchronization for the target table, it won't be possible to
> synchronize both the parent table's schema and partitions' schema.
>

I think one possibility to make this design work is that when
publish_via_root is false, then we assume that subscriber already has
parent table and then the individual tablesync workers can sync the
schema of partitions and their data. And when publish_via_root is
true, then the table sync worker is responsible to sync parent and
child tables along with data. Do you think such a mechanism can
address the partition table related cases?

--
With Regards,
Amit Kapila.



RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:

> From: Amit Kapila <amit.kapila16@gmail.com>
> On Wed, Jul 5, 2023 at 7:45 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jun 19, 2023 at 5:29 PM Peter Smith <smithpb2250@gmail.com>
> wrote:
> > >
> > > Hi,
> > >
> > > Below are my review comments for the PoC patch 0001.
> > >
> > > In addition,  the patch needed rebasing, and, after I rebased it
> > > locally in my private environment there were still test failures:
> > > a) The 'make check' tests fail but only in a minor way due to
> > > changes colname
> > > b) the subscription TAP test did not work at all for me -- many errors.
> >
> > Thank you for reviewing the patch.
> >
> > While updating the patch, I realized that the current approach won't
> > work well or at least has the problem with partition tables. If a
> > publication has a partitioned table with publish_via_root = false, the
> > subscriber launches tablesync workers for its partitions so that each
> > tablesync worker copies data of each partition. Similarly, if it has a
> > partition table with publish_via_root = true, the subscriber launches
> > a tablesync worker for the parent table. With the current design,
> > since the tablesync worker is responsible for both schema and data
> > synchronization for the target table, it won't be possible to
> > synchronize both the parent table's schema and partitions' schema.
> >
> 
> I think one possibility to make this design work is that when publish_via_root
> is false, then we assume that subscriber already has parent table and then
> the individual tablesync workers can sync the schema of partitions and their
> data.

Since publish_via_partition_root is false by default users have to create parent table by themselves
which I think is not a good user experience.

> And when publish_via_root is true, then the table sync worker is
> responsible to sync parent and child tables along with data. Do you think
> such a mechanism can address the partition table related cases?
> 

Re: Initial Schema Sync for Logical Replication

From
Masahiko Sawada
Date:
On Mon, Jul 10, 2023 at 8:06 PM Kumar, Sachin <ssetiya@amazon.com> wrote:
>
>
>
> > From: Amit Kapila <amit.kapila16@gmail.com>
> > On Wed, Jul 5, 2023 at 7:45 AM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Jun 19, 2023 at 5:29 PM Peter Smith <smithpb2250@gmail.com>
> > wrote:
> > > >
> > > > Hi,
> > > >
> > > > Below are my review comments for the PoC patch 0001.
> > > >
> > > > In addition,  the patch needed rebasing, and, after I rebased it
> > > > locally in my private environment there were still test failures:
> > > > a) The 'make check' tests fail but only in a minor way due to
> > > > changes colname
> > > > b) the subscription TAP test did not work at all for me -- many errors.
> > >
> > > Thank you for reviewing the patch.
> > >
> > > While updating the patch, I realized that the current approach won't
> > > work well or at least has the problem with partition tables. If a
> > > publication has a partitioned table with publish_via_root = false, the
> > > subscriber launches tablesync workers for its partitions so that each
> > > tablesync worker copies data of each partition. Similarly, if it has a
> > > partition table with publish_via_root = true, the subscriber launches
> > > a tablesync worker for the parent table. With the current design,
> > > since the tablesync worker is responsible for both schema and data
> > > synchronization for the target table, it won't be possible to
> > > synchronize both the parent table's schema and partitions' schema.
> > >
> >
> > I think one possibility to make this design work is that when publish_via_root
> > is false, then we assume that subscriber already has parent table and then
> > the individual tablesync workers can sync the schema of partitions and their
> > data.
>
> Since publish_via_partition_root is false by default users have to create parent table by themselves
> which I think is not a good user experience.

I have the same concern. I think that users normally use
publish_via_partiiton_root = false if the partitioned table on the
subscriber consists of the same set of partitions as the publisher's
ones. And such users would expect the both partitioned table and its
partitions to be synchronized.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
Hi Everyone, based on internal discussion with Masahiko
I have implemented concurrent DDL support for initial schema sync.

Concurrent Patch workflow

1. When TableSync worker creates a replicaton slot, It will
save the slot lsn into pg_subscription_rel with
SUBREL_SYNC_SCHEMA_DATA_SYNC state, and it will wait for
its state to be SUBREL_STATE_DATASYNC.

2. Applier process will apply DDLs till tablesync lsn, and then
it will change pg_subscription_rel state to SUBREL_STATE_DATASYNC.

3. TableSync will continue applying pending DML/DDls till it catch up.

This patch needs DDL replication to apply concurrent DDLs, I have cherry-
picked this DDL patch [0]

Issues
1) needs testing for concurrent DDLs, Not sure how to make tablesync process wait so that
concurrent DDLs can be issued on publisher.
2) In my testing created table does not appear on the same conenction on subscriber,
I have to reconnect to see table.
3) maybe different chars for SUBREL_SYNC_SCHEMA_DATA_INIT and SUBREL_SYNC_SCHEMA_DATA_SYNC,
currently they are 'x' and 'y'.
4) I need to add SUBREL_SYNC_SCHEMA_DATA_INIT and SUBREL_SYNC_SCHEMA_DATA_SYNC to
pg_subscription_rel_d.h to make it compile succesfully.
5) It only implement concurrent alter as of now

[0] =
https://www.postgresql.org/message-id/OS0PR01MB57163E6487EFF7378CB8E17C9438A%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Attachment

Re: Initial Schema Sync for Logical Replication

From
vignesh C
Date:
On Thu, 31 Aug 2023 at 17:18, Kumar, Sachin <ssetiya@amazon.com> wrote:
>
> Hi Everyone, based on internal discussion with Masahiko
> I have implemented concurrent DDL support for initial schema sync.
>
> Concurrent Patch workflow
>
> 1. When TableSync worker creates a replicaton slot, It will
> save the slot lsn into pg_subscription_rel with
> SUBREL_SYNC_SCHEMA_DATA_SYNC state, and it will wait for
> its state to be SUBREL_STATE_DATASYNC.
>
> 2. Applier process will apply DDLs till tablesync lsn, and then
> it will change pg_subscription_rel state to SUBREL_STATE_DATASYNC.
>
> 3. TableSync will continue applying pending DML/DDls till it catch up.
>
> This patch needs DDL replication to apply concurrent DDLs, I have cherry-
> picked this DDL patch [0]

Can you rebase the patch and post the complete set of required changes
for the concurrent DDL, I will have a look at them.

Regards,
Vignesh



Re: Initial Schema Sync for Logical Replication

From
vignesh C
Date:
On Fri, 7 Jul 2023 at 12:41, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Jul 5, 2023 at 11:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jun 19, 2023 at 5:29 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > Below are my review comments for the PoC patch 0001.
> > >
> > > In addition,  the patch needed rebasing, and, after I rebased it
> > > locally in my private environment there were still test failures:
> > > a) The 'make check' tests fail but only in a minor way due to changes colname
> > > b) the subscription TAP test did not work at all for me -- many errors.
> >
> > Thank you for reviewing the patch.
> >
> > While updating the patch, I realized that the current approach won't
> > work well or at least has the problem with partition tables. If a
> > publication has a partitioned table with publish_via_root = false, the
> > subscriber launches tablesync workers for its partitions so that each
> > tablesync worker copies data of each partition. Similarly, if it has a
> > partition table with publish_via_root = true, the subscriber launches
> > a tablesync worker for the parent table. With the current design,
> > since the tablesync worker is responsible for both schema and data
> > synchronization for the target table, it won't be possible to
> > synchronize both the parent table's schema and partitions' schema. For
> > example, there is no pg_subscription_rel entry for the parent table if
> > the publication has publish_via_root = false. In addition to that, we
> > need to be careful about the order of synchronization of the parent
> > table and its partitions. We cannot start schema synchronization for
> > partitions before its parent table. So it seems to me that we need to
> > consider another approach.
>
> So I've implemented a different approach; doing schema synchronization
> at a CREATE SUBSCRIPTION time. The backend executing CREATE
> SUBSCRIPTION uses pg_dump and restores the table schemas including
> both partitioned tables and their partitions regardless of
> publish_via_partition_root option, and then creates
> pg_subscription_rel entries for tables while respecting
> publish_via_partition_root option.
>
> There is a window between table creations and the tablesync workers
> starting to process the tables. If DDLs are executed in this window,
> the tablesync worker might fail because the table schema might have
> already been changed. We need to mention this note in the
> documentation. BTW, I think we will be able to get rid of this
> downside if we support DDL replication. DDLs executed in the window
> are applied by the apply worker and it takes over the data copy to the
> tablesync worker at a certain LSN.
>
> I've attached PoC patches. It has regression tests but doesn't have
> the documentation yet.

Few thoughts:
1) There might be a scenario where we will create multiple
subscriptions with the tables overlapping across the subscription, in
that case, the table will be present when the 2nd subscription is
being created, can we do something in this case:
+               /*
+                * Error if the table is already present on the
subscriber. Please note
+                * that concurrent DDLs can create the table as we
don't acquire any lock
+                * on the table.
+                *
+                * XXX: do we want to overwrite it (or optionally)?
+                */
+               if (OidIsValid(RangeVarGetRelid(rv, AccessShareLock, true)))
+                       ereport(ERROR,
+                                       (errmsg("existing table %s
cannot synchronize table schema",
+                                                       rv->relname)));

2) Should we clean the replication slot in case of failures, currently
the replication slot is left over.

3) Is it expected that all of the dependencies like type/domain etc
should be created by the user before creating a subscription with
copy_schema, currently we are taking care of creating the sequences
for tables, is this an exception?

4) If a column list publication is created, currently we are getting
all of the columns, should we get only the specified columns in this
case?

Regards,
Vignesh



RE: Initial Schema Sync for Logical Replication

From
"Kumar, Sachin"
Date:
> From: vignesh C <vignesh21@gmail.com>
> Sent: Thursday, October 19, 2023 10:41 AM
> Can you rebase the patch and post the complete set of required changes for
> the concurrent DDL, I will have a look at them.

Sure , I will try to send the complete rebased patch within a week.

Regards
Sachin