Thread: Initial Schema Sync for Logical Replication
Hi Everyone,
I am working on the initial schema sync for Logical replication. Currently, user have to
manually create a schema on subscriber side. Aim of this feature is to add an option in
create subscription, so that schema sync can be automatic. I am sharing Design Doc below,
but there are some corner cases where the design does not work. Please share your opinion
if design can be improved and we can get rid of corner cases. This design is loosely based
on Pglogical.
DDL replication is required for this feature.
(https://www.postgresql.org/message-id/flat/CAAD30U%2BpVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo%3DOjdQGJ9g%40mail.gmail.com)
SQL Changes:-
CREATE SUBSCRIPTION subscription_name
CONNECTION 'conninfo'
PUBLICATION publication_name [, ...]
[ WITH ( subscription_parameter [= value] [, ... ] ) ]
sync_initial_schema (enum) will be added to subscription_parameter.
It can have 3 values:-
TABLES, ALL , NONE (Default)
In ALL everything will be synced including global objects too.
Restrictions :- sync_initial_schema=ALL can only be used for publication with FOR ALL TABLES
Design:-
Publisher :-
Publisher have to implement `SHOW CREATE TABLE_NAME`, this table definition will be used by
subscriber to create exact schema of a table on the subscriber. One alternative to this can
be doing it on the subscriber side itself, we can create a function similar to
describeOneTableDetails and call it on the subscriber. We also need maintain same ownership
as of publisher.
It should also have turned on publication of DDL commands.
Subscriber :-
1. In CreateSubscription() when we create replication slot(walrcv_create_slot()), should
use CRS_EXPORT_SNAPSHOT, So that we can use this snapshot later in the pg_dump.
2. Now we can call pg_dump with above snapshot from CreateSubscription. This is inside
opts.connect && opts.create_slot if statement. If we fail in this step we have to drop
the replication slot and create a new one again. Because we need snapshot and creating a
replication slot is a way to get snapshot. The reason for running pg_dump with above
snapshot is that we don't want execute DDLs in wal_logs to 2 times. With above snapshot we
get a state of database which is before the replication slot origin and any changes after
the snapshot will be in wal_logs.
We will save the pg_dump into a file (custom archive format). So pg_dump will be similar to
pg_dump --connection_string --schema_only --snapshot=xyz -Fc --file initSchema
If sync_initial_schema=TABLES we dont have to call pg_dump/restore at all. TableSync process
will take care of it.
3. If we have to sync global objects we need to call pg_dumpall --globals-only also. But pg_dumpall
does not support --snapshot option, So if user creates a new global object between creation
of replication slot and running pg_dumpall, that above global object will be created 2
times on subscriber , which will error out the Applier process.
4. walrcv_disconnect should be called after pg_dump is finished, otherwise snapshot will
not be valid.
5. Users will replication role cant not call pg_dump , So the replication user have to
superuser. This is a a major problem.
postgres=# create role s4 WITH LOGIN Replication;
CREATE ROLE
╭─sachin@DUB-1800550165 ~
╰─$ pg_dump postgres -s -U s4 1 ↵
pg_dump: error: query failed: ERROR: permission denied for table t1
pg_dump: detail: Query was: LOCK TABLE public.t1, public.t2 IN ACCESS SHARE MODE
6. pg_subscription_rel table column srsubstate will have one more state
SUBREL_STATE_CREATE 'c'. if sync_initial_schema is enabled we will set table_state to 'c'.
Above 6 steps will be done even if subscription is not enabled, but connect is true.
7. Leader Applier process should check if initSync file exist , if true then it should
call pg_restore. We are not using —pre-data and —post-data segment as it is used in
Pglogical, Because post_data works on table having data , but we will fill the data into
table on later stages. pg_restore can be called like this
pg_restore --connection_string -1 file_name
-1 option will execute every command inside of one transaction. If there is any error
everything will be rollbacked.
pg_restore should be called quite early in the Applier process code, before any tablesync
process can be created.
Instead of checking if file exist maybe pg_subscription table can be extended with column
SyncInitialSchema and applier process will check SyncInitialSchema == SYNC_PENDING
8. TableSync process should check the state of table , if it is SUBREL_STATE_CREATE it should
get the latest definition from the publisher and recreate the table. (We have to recreate
the table even if there are no changes). Then it should go into copy table mode as usual.
It might seem that TableSync is doing duplicate work already done by pg_restore. We are doing
it in this way because of concurrent DDLs and refresh publication command.
Concurrent DDL :-
User can execute a DDL command to table t1 at the same time when subscriber is trying to sync
it. pictorial representation https://imgur.com/a/ivrIEv8 [1]
In tablesync process, it makes a connection to the publisher and it sees the
table state which can be in future wrt to the publisher, which can introduce conflicts.
For example:-
CASE 1:- { Publisher removed the column b from the table t1 when subscriber was doing pg_restore
(or any point in concurrent DDL window described in picture [1] ), when tableSync
process will start transaction on the publisher it will see request data of table t1
including column b, which does not exist on the publisher.} So that is why tableSync process
asks for the latest definition.
If we say that we will delay tableSync worker till all the DDL related to table t1 is
applied by the applier process , we can still have a window when publisher issues a DDL
command just before tableSync starts its transaction, and therefore making tableSync and
publisher table definition incompatible (Thanks to Masahiko for pointing out this race
condition).
Applier process will skip all DDL/DMLs related to the table t1 and tableSync will apply those
in Catchup phase.
Although there is one issue what will happen to views/ or functions which depend on the table
. I think they should wait till table_state is > SUBREL_STATE_CREATE (means we have the latest
schema definition from the publisher).
There might be corner cases to this approach or maybe a better way to handle concurrent DDL
One simple solution might be to disallow DDLs on the publisher till all the schema is
synced and all tables have state >= SUBREL_STATE_DATASYNC (We can have CASE 1: issue ,
even with DDL replication, so we have to wait till all the tables have table_state
> SUBREL_STATE_DATASYNC). Which might be a big window for big databases.
Refresh publication :-
In refresh publication, subscriber does create a new replication slot hence , we can’t run
pg_dump with a snapshot which starts from origin(maybe this is not an issue at all). In this case
it makes more sense for tableSync worker to do schema sync.
If community is happy with above design, I can start working on prototype.
Credits :- This design is inspired by Pglogical. Also thanks to Zane, Masahiko, Amit for reviewing earlier designs
Regards
Sachin Kumar
Amazon Web Services
Hi, I have a couple of questions. Q1. What happens if the subscriber already has some tables present? For example, I did not see the post saying anything like "Only if the table does not already exist then it will be created". On the contrary, the post seemed to say SUBREL_STATE_CREATE 'c' would *always* be set when this subscriber mode is enabled. And then it seemed to say the table would *always* get re-created by the tablesync in this new mode. Won't this cause problems - if the user wanted a slightly different subscriber-side table? (eg some extra columns on the subscriber-side table) - if there was some pre-existing table data on the subscriber-side table that you now are about to re-create and clobber? Or does the idea intend that the CREATE TABLE DDL that will be executed is like "CREATE TABLE ... IF NOT EXISTS"? ~~~ Q2. The post says. "DDL replication is required for this feature". And "It should also have turned on publication of DDL commands." It wasn't entirely clear to me why those must be a requirement. Is that just to make implementation easier? Sure, I see that the idea might have some (or maybe a lot?) of common internal code with the table DDL replication work, but OTOH an auto-create feature for subscriber tables seems like it might be a useful feature to have regardless of the value of the publication 'ddl' parameter. ------ Kind Regards, Peter Smith. Fujitsu Australia.
Hi Peter, > Hi, > > I have a couple of questions. > > Q1. > > What happens if the subscriber already has some tables present? For > example, I did not see the post saying anything like "Only if the table does > not already exist then it will be created". > My assumption was the if subscriber is doing initial schema sync , It does not have any conflicting database objects. > On the contrary, the post seemed to say SUBREL_STATE_CREATE 'c' would > *always* be set when this subscriber mode is enabled. And then it seemed > to say the table would *always* get re-created by the tablesync in this new > mode. Right > > Won't this cause problems > - if the user wanted a slightly different subscriber-side table? (eg some extra > columns on the subscriber-side table) > - if there was some pre-existing table data on the subscriber-side table that > you now are about to re-create and clobber? > > Or does the idea intend that the CREATE TABLE DDL that will be executed is > like "CREATE TABLE ... IF NOT EXISTS"? > pg_dump does not support --if-not-exists , But I think it can be added and we get a dump with IF NOT EXISTS. On subscriber side we get table OID list, we can use this change table_state = SUBREL_STATE_INIT so that it won't be recreated. > ~~~ > > Q2. > > The post says. "DDL replication is required for this feature". And "It should > also have turned on publication of DDL commands." > > It wasn't entirely clear to me why those must be a requirement. Is that just to > make implementation easier? DDL replication is needed to facilitate concurrent DDL, so that we don’t have to worry about schema change at the same time when subscriber is initializing. if we can block publisher so that it does not do DDLs or subscriber can simple error out if it sees conflicting table information , then we don’t need to use DDL replication. Regards Sachin > > Sure, I see that the idea might have some (or maybe a lot?) of common > internal code with the table DDL replication work, but OTOH an auto-create > feature for subscriber tables seems like it might be a useful feature to have > regardless of the value of the publication 'ddl' parameter. > > ------ > Kind Regards, > Peter Smith. > Fujitsu Australia.
On 2023-Mar-15, Kumar, Sachin wrote: > 1. In CreateSubscription() when we create replication slot(walrcv_create_slot()), should > use CRS_EXPORT_SNAPSHOT, So that we can use this snapshot later in the pg_dump. > > 2. Now we can call pg_dump with above snapshot from CreateSubscription. Overall I'm not on board with the idea that logical replication would depend on pg_dump; that seems like it could run into all sorts of trouble (what if calling external binaries requires additional security setup? what about pg_hba connection requirements? what about max_connections in tight circumstances?). It would be much better, I think, to handle this internally in the publisher instead: similar to how DDL sync would work, except it'd somehow generate the CREATE statements from the existing tables instead of waiting for DDL events to occur. I grant that this does require writing a bunch of new code for each object type, a lot of which would duplicate the pg_dump logic, but it would probably be a lot more robust. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On Thu, Mar 16, 2023 at 10:27 PM Kumar, Sachin <ssetiya@amazon.com> wrote: > > > Hi, > > > > I have a couple of questions. > > > > Q1. > > > > What happens if the subscriber already has some tables present? For > > example, I did not see the post saying anything like "Only if the table does > > not already exist then it will be created". > > > My assumption was the if subscriber is doing initial schema sync , It does not have > any conflicting database objects. > Can't we simply error out in such a case with "obj already exists"? This would be similar to how we deal with conflicting rows with unique/primary keys. -- With Regards, Amit Kapila.
Hi Amit, > From: Amit Kapila <amit.kapila16@gmail.com> > > > Hi, > > > > > > I have a couple of questions. > > > > > > Q1. > > > > > > What happens if the subscriber already has some tables present? For > > > example, I did not see the post saying anything like "Only if the > > > table does not already exist then it will be created". > > > > > My assumption was the if subscriber is doing initial schema sync , It > > does not have any conflicting database objects. > > > > Can't we simply error out in such a case with "obj already exists"? > This would be similar to how we deal with conflicting rows with unique/primary > keys. Right this is the default behaviour , We will run pg_restore with --single_transaction, So if we get error while executing a create table the whole pg_restore will fail and user will notified. Regards Sachin
Hi Alvaro, > From: Alvaro Herrera <alvherre@alvh.no-ip.org> > Subject: RE: [EXTERNAL]Initial Schema Sync for Logical Replication > On 2023-Mar-15, Kumar, Sachin wrote: > > > 1. In CreateSubscription() when we create replication > > slot(walrcv_create_slot()), should use CRS_EXPORT_SNAPSHOT, So that we > can use this snapshot later in the pg_dump. > > > > 2. Now we can call pg_dump with above snapshot from CreateSubscription. > > Overall I'm not on board with the idea that logical replication would depend on > pg_dump; that seems like it could run into all sorts of trouble (what if calling > external binaries requires additional security setup? what about pg_hba > connection requirements? what about max_connections in tight > circumstances?). > what if calling external binaries requires additional security setup I am not sure what kind of security restriction would apply in this case, maybe pg_dump binary can be changed ? > what about pg_hba connection requirements? We will use the same connection string which subscriber process uses to connect to the publisher. >what about max_connections in tight circumstances? Right that might be a issue, but I don’t think it will be a big issue, We will create dump of database in CreateSubscription() function itself , So before tableSync process even starts if we have reached max_connections while calling pg_dump itself , tableSync wont be successful. > It would be much better, I think, to handle this internally in the publisher instead: > similar to how DDL sync would work, except it'd somehow generate the CREATE > statements from the existing tables instead of waiting for DDL events to occur. I > grant that this does require writing a bunch of new code for each object type, a > lot of which would duplicate the pg_dump logic, but it would probably be a lot > more robust. Agree , But we might have a lots of code duplication essentially almost all of pg_dump Code needs to be duplicated, which might cause issue when modifying/adding new DDLs. I am not sure but if it's possible to move dependent code of pg_dump to common/ folder , to avoid duplication. Regards Sachin
> From: Alvaro Herrera <alvherre@alvh.no-ip.org>> Subject: RE: [EXTERNAL]Initial Schema Sync for Logical Replication> On 2023-Mar-15, Kumar, Sachin wrote:>> > 1. In CreateSubscription() when we create replication> > slot(walrcv_create_slot()), should use CRS_EXPORT_SNAPSHOT, So that we> can use this snapshot later in the pg_dump.> >> > 2. Now we can call pg_dump with above snapshot from CreateSubscription.>> Overall I'm not on board with the idea that logical replication would depend on> pg_dump; that seems like it could run into all sorts of trouble (what if calling> external binaries requires additional security setup? what about pg_hba> connection requirements? what about max_connections in tight> circumstances?).> what if calling external binaries requires additional security setupI am not sure what kind of security restriction would apply in this case, maybe pg_dumpbinary can be changed ?
On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira <euler@eulerto.com> wrote: > > On Mon, Mar 20, 2023, at 10:10 PM, Kumar, Sachin wrote: > > > From: Alvaro Herrera <alvherre@alvh.no-ip.org> > > Subject: RE: [EXTERNAL]Initial Schema Sync for Logical Replication > > On 2023-Mar-15, Kumar, Sachin wrote: > > > > > 1. In CreateSubscription() when we create replication > > > slot(walrcv_create_slot()), should use CRS_EXPORT_SNAPSHOT, So that we > > can use this snapshot later in the pg_dump. > > > > > > 2. Now we can call pg_dump with above snapshot from CreateSubscription. > > > > Overall I'm not on board with the idea that logical replication would depend on > > pg_dump; that seems like it could run into all sorts of trouble (what if calling > > external binaries requires additional security setup? what about pg_hba > > connection requirements? what about max_connections in tight > > circumstances?). > > what if calling external binaries requires additional security setup > I am not sure what kind of security restriction would apply in this case, maybe pg_dump > binary can be changed ? > > Using pg_dump as part of this implementation is not acceptable because we > expect the backend to be decoupled from the client. Besides that, pg_dump > provides all table dependencies (such as tablespaces, privileges, security > labels, comments); not all dependencies shouldn't be replicated. > I agree that in the initial version we may not support sync of all objects but why that shouldn't be possible in the later versions? > You should > exclude them removing these objects from the TOC before running pg_restore or > adding a few pg_dump options to exclude these objects. Another issue is related > to different version. Let's say the publisher has a version ahead of the > subscriber version, a new table syntax can easily break your logical > replication setup. IMO pg_dump doesn't seem like a good solution for initial > synchronization. > > Instead, the backend should provide infrastructure to obtain the required DDL > commands for the specific (set of) tables. This can work around the issues from > the previous paragraph: > ... > * don't need to worry about different versions. > AFAICU some of the reasons why pg_dump is not allowed to dump from the newer version are as follows: (a) there could be more columns in the newer version of the system catalog and then Select * type of stuff won't work because the client won't have knowledge of additional columns. (b) the newer version could have new features (represented by say new columns in existing catalogs or new catalogs) that the older version of pg_dump has no knowledge of and will fail to get that data and hence an inconsistent dump. The subscriber will easily be not in sync due to that. Now, how do we avoid these problems even if we have our own version of functionality similar to pg_dump for selected objects? I guess we will face similar problems. If so, we may need to deny schema sync in any such case. -- With Regards, Amit Kapila.
On Tue, Mar 21, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira <euler@eulerto.com> wrote: > > > > On Mon, Mar 20, 2023, at 10:10 PM, Kumar, Sachin wrote: > > > > > From: Alvaro Herrera <alvherre@alvh.no-ip.org> > > > Subject: RE: [EXTERNAL]Initial Schema Sync for Logical Replication > > > On 2023-Mar-15, Kumar, Sachin wrote: > > > > > > > 1. In CreateSubscription() when we create replication > > > > slot(walrcv_create_slot()), should use CRS_EXPORT_SNAPSHOT, So that we > > > can use this snapshot later in the pg_dump. > > > > > > > > 2. Now we can call pg_dump with above snapshot from CreateSubscription. > > > > > > Overall I'm not on board with the idea that logical replication would depend on > > > pg_dump; that seems like it could run into all sorts of trouble (what if calling > > > external binaries requires additional security setup? what about pg_hba > > > connection requirements? what about max_connections in tight > > > circumstances?). > > > what if calling external binaries requires additional security setup > > I am not sure what kind of security restriction would apply in this case, maybe pg_dump > > binary can be changed ? > > > > Using pg_dump as part of this implementation is not acceptable because we > > expect the backend to be decoupled from the client. Besides that, pg_dump > > provides all table dependencies (such as tablespaces, privileges, security > > labels, comments); not all dependencies shouldn't be replicated. > > > > I agree that in the initial version we may not support sync of all > objects but why that shouldn't be possible in the later versions? > > > You should > > exclude them removing these objects from the TOC before running pg_restore or > > adding a few pg_dump options to exclude these objects. Another issue is related > > to different version. Let's say the publisher has a version ahead of the > > subscriber version, a new table syntax can easily break your logical > > replication setup. IMO pg_dump doesn't seem like a good solution for initial > > synchronization. > > > > Instead, the backend should provide infrastructure to obtain the required DDL > > commands for the specific (set of) tables. This can work around the issues from > > the previous paragraph: > > > ... > > * don't need to worry about different versions. > > > > AFAICU some of the reasons why pg_dump is not allowed to dump from the > newer version are as follows: (a) there could be more columns in the > newer version of the system catalog and then Select * type of stuff > won't work because the client won't have knowledge of additional > columns. (b) the newer version could have new features (represented by > say new columns in existing catalogs or new catalogs) that the older > version of pg_dump has no knowledge of and will fail to get that data > and hence an inconsistent dump. The subscriber will easily be not in > sync due to that. > > Now, how do we avoid these problems even if we have our own version of > functionality similar to pg_dump for selected objects? I guess we will > face similar problems. Right. I think that such functionality needs to return DDL commands that can be executed on the requested version. > If so, we may need to deny schema sync in any such case. Yes. Do we have any concrete use case where the subscriber is an older version, in the first place? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 22, 2023 at 8:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Mar 21, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira <euler@eulerto.com> wrote: > > > > > You should > > > exclude them removing these objects from the TOC before running pg_restore or > > > adding a few pg_dump options to exclude these objects. Another issue is related > > > to different version. Let's say the publisher has a version ahead of the > > > subscriber version, a new table syntax can easily break your logical > > > replication setup. IMO pg_dump doesn't seem like a good solution for initial > > > synchronization. > > > > > > Instead, the backend should provide infrastructure to obtain the required DDL > > > commands for the specific (set of) tables. This can work around the issues from > > > the previous paragraph: > > > > > ... > > > * don't need to worry about different versions. > > > > > > > AFAICU some of the reasons why pg_dump is not allowed to dump from the > > newer version are as follows: (a) there could be more columns in the > > newer version of the system catalog and then Select * type of stuff > > won't work because the client won't have knowledge of additional > > columns. (b) the newer version could have new features (represented by > > say new columns in existing catalogs or new catalogs) that the older > > version of pg_dump has no knowledge of and will fail to get that data > > and hence an inconsistent dump. The subscriber will easily be not in > > sync due to that. > > > > Now, how do we avoid these problems even if we have our own version of > > functionality similar to pg_dump for selected objects? I guess we will > > face similar problems. > > Right. I think that such functionality needs to return DDL commands > that can be executed on the requested version. > > > If so, we may need to deny schema sync in any such case. > > Yes. Do we have any concrete use case where the subscriber is an older > version, in the first place? > As per my understanding, it is mostly due to the reason that it can work today. Today, during an off-list discussion with Jonathan on this point, he pointed me to a similar incompatibility in MySQL replication. See the "SQL incompatibilities" section in doc[1]. Also, please note that this applies not only to initial sync but also to schema sync during replication. I don't think it would be feasible to keep such cross-version compatibility for DDL replication. Having said above, I don't intend that we must use pg_dump from the subscriber for the purpose of initial sync. I think the idea at this stage is to primarily write a POC patch to see what difficulties we may face. The other options that we could try out are (a) try to duplicate parts of pg_dump code in some way (by extracting required code) for the subscription's initial sync, or (b) have a common code (probably as a library or some other way) for the required functionality. There could be more possibilities that we may not have thought of yet. But the main point is that for approaches other than using pg_dump, we should consider ways to avoid duplicity of various parts of its code. Due to this, I think before ruling out using pg_dump, we should be clear about its risks and limitations. Thoughts? [1] - https://dev.mysql.com/doc/refman/8.0/en/replication-compatibility.html [2] - https://www.postgresql.org/message-id/CAAD30U%2BpVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo%3DOjdQGJ9g%40mail.gmail.com -- With Regards, Amit Kapila.
On Wednesday, March 22, 2023 1:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 22, 2023 at 8:29 AM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > On Tue, Mar 21, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira <euler@eulerto.com> wrote: > > > > > > > You should > > > > exclude them removing these objects from the TOC before running > > > > pg_restore or adding a few pg_dump options to exclude these > > > > objects. Another issue is related to different version. Let's say > > > > the publisher has a version ahead of the subscriber version, a new > > > > table syntax can easily break your logical replication setup. IMO > > > > pg_dump doesn't seem like a good solution for initial synchronization. > > > > > > > > Instead, the backend should provide infrastructure to obtain the > > > > required DDL commands for the specific (set of) tables. This can > > > > work around the issues from the previous paragraph: > > > > > > > ... > > > > * don't need to worry about different versions. > > > > > > > > > > AFAICU some of the reasons why pg_dump is not allowed to dump from > > > the newer version are as follows: (a) there could be more columns in > > > the newer version of the system catalog and then Select * type of > > > stuff won't work because the client won't have knowledge of > > > additional columns. (b) the newer version could have new features > > > (represented by say new columns in existing catalogs or new > > > catalogs) that the older version of pg_dump has no knowledge of and > > > will fail to get that data and hence an inconsistent dump. The > > > subscriber will easily be not in sync due to that. > > > > > > Now, how do we avoid these problems even if we have our own version > > > of functionality similar to pg_dump for selected objects? I guess we > > > will face similar problems. > > > > Right. I think that such functionality needs to return DDL commands > > that can be executed on the requested version. > > > > > If so, we may need to deny schema sync in any such case. > > > > Yes. Do we have any concrete use case where the subscriber is an older > > version, in the first place? > > > > As per my understanding, it is mostly due to the reason that it can work today. > Today, during an off-list discussion with Jonathan on this point, he pointed me > to a similar incompatibility in MySQL replication. See the "SQL > incompatibilities" section in doc[1]. Also, please note that this applies not only > to initial sync but also to schema sync during replication. I don't think it would > be feasible to keep such cross-version compatibility for DDL replication. > > Having said above, I don't intend that we must use pg_dump from the > subscriber for the purpose of initial sync. I think the idea at this stage is to > primarily write a POC patch to see what difficulties we may face. The other > options that we could try out are (a) try to duplicate parts of pg_dump code in > some way (by extracting required > code) for the subscription's initial sync, or (b) have a common code (probably > as a library or some other way) for the required functionality. There could be > more possibilities that we may not have thought of yet. But the main point is > that for approaches other than using pg_dump, we should consider ways to > avoid duplicity of various parts of its code. Due to this, I think before ruling out > using pg_dump, we should be clear about its risks and limitations. I thought about some possible problems about the design of using pg_dump. 1) According to the design, it will internally call pg_dump when creating subscription, but it requires to use a powerful user when calling pg_dump. Currently, it may not be a problem because create subscription also requires superuser. But people have recently discussed about allowing non-superuser to create the subscription[1], if that is accepted, then it seems not great to internally use superuser to call pg_dump while the user creating the subscription is a non-super user. 2) I think it's possible that some cloud DB service doesn't allow user to use the client commands(pg_dump ,..) directly, and the user that login in the database may not have the permission to execute the client commands. [1] https://www.postgresql.org/message-id/flat/20230308194743.23rmgjgwahh4i4rg%40awork3.anarazel.de Best Regards, Hou zj
> From: Amit Kapila <amit.kapila16@gmail.com> > Sent: Wednesday, March 22, 2023 5:16 AM > To: Masahiko Sawada <sawada.mshk@gmail.com> > Cc: Euler Taveira <euler@eulerto.com>; Kumar, Sachin > <ssetiya@amazon.com>; Alvaro Herrera <alvherre@alvh.no-ip.org>; pgsql- > hackers@lists.postgresql.org; Jonathan S. Katz <jkatz@postgresql.org> > Subject: RE: [EXTERNAL]Initial Schema Sync for Logical Replication > > CAUTION: This email originated from outside of the organization. Do not click > links or open attachments unless you can confirm the sender and know the > content is safe. > > > > On Wed, Mar 22, 2023 at 8:29 AM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > On Tue, Mar 21, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira <euler@eulerto.com> > wrote: > > > > > > > You should > > > > exclude them removing these objects from the TOC before running > > > > pg_restore or adding a few pg_dump options to exclude these > > > > objects. Another issue is related to different version. Let's say > > > > the publisher has a version ahead of the subscriber version, a new > > > > table syntax can easily break your logical replication setup. IMO > > > > pg_dump doesn't seem like a good solution for initial synchronization. > > > > > > > > Instead, the backend should provide infrastructure to obtain the > > > > required DDL commands for the specific (set of) tables. This can > > > > work around the issues from the previous paragraph: > > > > > > > ... > > > > * don't need to worry about different versions. > > > > > > > > > > AFAICU some of the reasons why pg_dump is not allowed to dump from > > > the newer version are as follows: (a) there could be more columns in > > > the newer version of the system catalog and then Select * type of > > > stuff won't work because the client won't have knowledge of > > > additional columns. (b) the newer version could have new features > > > (represented by say new columns in existing catalogs or new > > > catalogs) that the older version of pg_dump has no knowledge of and > > > will fail to get that data and hence an inconsistent dump. The > > > subscriber will easily be not in sync due to that. > > > > > > Now, how do we avoid these problems even if we have our own version > > > of functionality similar to pg_dump for selected objects? I guess we > > > will face similar problems. > > > > Right. I think that such functionality needs to return DDL commands > > that can be executed on the requested version. > > > > > If so, we may need to deny schema sync in any such case. > > > > Yes. Do we have any concrete use case where the subscriber is an older > > version, in the first place? > > > > As per my understanding, it is mostly due to the reason that it can work > today. Today, during an off-list discussion with Jonathan on this point, he > pointed me to a similar incompatibility in MySQL replication. See the "SQL > incompatibilities" section in doc[1]. Also, please note that this applies not > only to initial sync but also to schema sync during replication. I don't think it > would be feasible to keep such cross-version compatibility for DDL > replication. > > Having said above, I don't intend that we must use pg_dump from the > subscriber for the purpose of initial sync. I think the idea at this stage is to > primarily write a POC patch to see what difficulties we may face. The other > options that we could try out are (a) try to duplicate parts of pg_dump code > in some way (by extracting required > code) for the subscription's initial sync, or (b) have a common code (probably > as a library or some other way) for the required functionality. There could be > more possibilities that we may not have thought of yet. But the main point is > that for approaches other than using pg_dump, we should consider ways to > avoid duplicity of various parts of its code. Due to this, I think before ruling > out using pg_dump, we should be clear about its risks and limitations. > > Thoughts? There is one more thing which needs to be consider even if we use pg_dump/pg_restore We still need to have a way to get the create table for tables , if we want to support concurrent DDLs on the publisher. >8. TableSync process should check the state of table , if it is SUBREL_STATE_CREATE it should >get the latest definition from the publisher and recreate the table. (We have to recreate >the table even if there are no changes). Then it should go into copy table mode as usual. Unless there is different way to support concurrent DDLs or we going for blocking publisher till initial sync is completed. Regards Sachin > > [1] - https://dev.mysql.com/doc/refman/8.0/en/replication- > compatibility.html > [2] - https://www.postgresql.org/message- > id/CAAD30U%2BpVmfKwUKy8cbZOnUXyguJ- > uBNejwD75Kyo%3DOjdQGJ9g%40mail.gmail.com > > -- > With Regards, > Amit Kapila.
On Wed, Mar 22, 2023 at 2:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 22, 2023 at 8:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Mar 21, 2023 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira <euler@eulerto.com> wrote: > > > > > > > You should > > > > exclude them removing these objects from the TOC before running pg_restore or > > > > adding a few pg_dump options to exclude these objects. Another issue is related > > > > to different version. Let's say the publisher has a version ahead of the > > > > subscriber version, a new table syntax can easily break your logical > > > > replication setup. IMO pg_dump doesn't seem like a good solution for initial > > > > synchronization. > > > > > > > > Instead, the backend should provide infrastructure to obtain the required DDL > > > > commands for the specific (set of) tables. This can work around the issues from > > > > the previous paragraph: > > > > > > > ... > > > > * don't need to worry about different versions. > > > > > > > > > > AFAICU some of the reasons why pg_dump is not allowed to dump from the > > > newer version are as follows: (a) there could be more columns in the > > > newer version of the system catalog and then Select * type of stuff > > > won't work because the client won't have knowledge of additional > > > columns. (b) the newer version could have new features (represented by > > > say new columns in existing catalogs or new catalogs) that the older > > > version of pg_dump has no knowledge of and will fail to get that data > > > and hence an inconsistent dump. The subscriber will easily be not in > > > sync due to that. > > > > > > Now, how do we avoid these problems even if we have our own version of > > > functionality similar to pg_dump for selected objects? I guess we will > > > face similar problems. > > > > Right. I think that such functionality needs to return DDL commands > > that can be executed on the requested version. > > > > > If so, we may need to deny schema sync in any such case. > > > > Yes. Do we have any concrete use case where the subscriber is an older > > version, in the first place? > > > > As per my understanding, it is mostly due to the reason that it can > work today. Today, during an off-list discussion with Jonathan on this > point, he pointed me to a similar incompatibility in MySQL > replication. See the "SQL incompatibilities" section in doc[1]. Also, > please note that this applies not only to initial sync but also to > schema sync during replication. I don't think it would be feasible to > keep such cross-version compatibility for DDL replication. Makes sense to me. > Having said above, I don't intend that we must use pg_dump from the > subscriber for the purpose of initial sync. I think the idea at this > stage is to primarily write a POC patch to see what difficulties we > may face. The other options that we could try out are (a) try to > duplicate parts of pg_dump code in some way (by extracting required > code) for the subscription's initial sync, or (b) have a common code > (probably as a library or some other way) for the required > functionality. There could be more possibilities that we may not have > thought of yet. But the main point is that for approaches other than > using pg_dump, we should consider ways to avoid duplicity of various > parts of its code. Due to this, I think before ruling out using > pg_dump, we should be clear about its risks and limitations. > > Thoughts? > Agreed. My biggest concern about approaches other than using pg_dump is the same; the code duplication that could increase the maintenance costs. We should clarify what points of using pg_dump is not a good idea, and also analyze alternative ideas in depth. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
> > Yes. Do we have any concrete use case where the subscriber is an older > > version, in the first place? > > > > As per my understanding, it is mostly due to the reason that it can > work today. Today, during an off-list discussion with Jonathan on this > point, he pointed me to a similar incompatibility in MySQL > replication. See the "SQL incompatibilities" section in doc[1]. Also, > please note that this applies not only to initial sync but also to > schema sync during replication. I don't think it would be feasible to > keep such cross-version compatibility for DDL replication. I think it's possible to make DDL replication cross-version compatible, by making the DDL deparser version-aware: the deparsed JSON blob can have a PG version in it, and the destination server can process the versioned JSON blob by transforming anything incompatible according to the original version and its own version. Regards, Zane
Now, how do we avoid these problems even if we have our own version offunctionality similar to pg_dump for selected objects? I guess we willface similar problems. If so, we may need to deny schema sync in anysuch case.
On Wed, Mar 15, 2023 at 11:12 PM Kumar, Sachin <ssetiya@amazon.com> wrote: > > > Concurrent DDL :- > > User can execute a DDL command to table t1 at the same time when subscriber is trying to sync > > it. pictorial representation https://imgur.com/a/ivrIEv8 [1] > > > > In tablesync process, it makes a connection to the publisher and it sees the > > table state which can be in future wrt to the publisher, which can introduce conflicts. > > For example:- > > > > CASE 1:- { Publisher removed the column b from the table t1 when subscriber was doing pg_restore > > (or any point in concurrent DDL window described in picture [1] ), when tableSync > > process will start transaction on the publisher it will see request data of table t1 > > including column b, which does not exist on the publisher.} So that is why tableSync process > > asks for the latest definition. > > If we say that we will delay tableSync worker till all the DDL related to table t1 is > > applied by the applier process , we can still have a window when publisher issues a DDL > > command just before tableSync starts its transaction, and therefore making tableSync and > > publisher table definition incompatible (Thanks to Masahiko for pointing out this race > > condition). > IIUC, this is possible only if tablesync process uses a snapshot different than the snapshot we have used to perform the initial schema sync, otherwise, this shouldn't be a problem. Let me try to explain my understanding with an example (the LSNs used are just explain the problem): 1. Create Table t1(c1, c2); --LSN: 90 2. Insert t1 (1, 1); --LSN 100 3. Insert t1 (2, 2); --LSN 110 4. Alter t1 Add Column c3; --LSN 120 5. Insert t1 (3, 3, 3); --LSN 130 Now, say before starting tablesync worker, apply process performs initial schema sync and uses a snapshot corresponding to LSN 100. Then it starts tablesync process to allow the initial copy of data in t1. Here, if the table sync process tries to establish a new snapshot, it may get data till LSN 130 and when it will try to copy the same in subscriber it will fail. Is my understanding correct about the problem you described? If so, can't we allow tablesync process to use the same exported snapshot as we used for the initial schema sync and won't that solve the problem you described? > > > Applier process will skip all DDL/DMLs related to the table t1 and tableSync will apply those > > in Catchup phase. > > Although there is one issue what will happen to views/ or functions which depend on the table > > . I think they should wait till table_state is > SUBREL_STATE_CREATE (means we have the latest > > schema definition from the publisher). > > There might be corner cases to this approach or maybe a better way to handle concurrent DDL > > One simple solution might be to disallow DDLs on the publisher till all the schema is > > synced and all tables have state >= SUBREL_STATE_DATASYNC (We can have CASE 1: issue , > > even with DDL replication, so we have to wait till all the tables have table_state > > > SUBREL_STATE_DATASYNC). Which might be a big window for big databases. > > > > > > Refresh publication :- > > In refresh publication, subscriber does create a new replication slot hence , we can’t run > > pg_dump with a snapshot which starts from origin(maybe this is not an issue at all). In this case > > it makes more sense for tableSync worker to do schema sync. > Can you please explain this problem with some examples? -- With Regards, Amit Kapila.
On Thu, Mar 23, 2023 at 2:48 AM Euler Taveira <euler@eulerto.com> wrote: > > On Tue, Mar 21, 2023, at 8:18 AM, Amit Kapila wrote: > > Now, how do we avoid these problems even if we have our own version of > functionality similar to pg_dump for selected objects? I guess we will > face similar problems. If so, we may need to deny schema sync in any > such case. > > There are 2 approaches for initial DDL synchronization: > > 1) generate the DDL command on the publisher, stream it and apply it as-is on > the subscriber; > 2) generate a DDL representation (JSON, for example) on the publisher, stream > it, transform it into a DDL command on subscriber and apply it. > > The option (1) is simpler and faster than option (2) because it does not > require an additional step (transformation). However, option (2) is more > flexible than option (1) because it allow you to create a DDL command even if a > feature was removed from the subscriber and the publisher version is less than > the subscriber version or a feature was added to the publisher and the > publisher version is greater than the subscriber version. > Is this practically possible? Say the publisher has a higher version that has introduced a new object type corresponding to which it has either a new catalog or some new columns in the existing catalog. Now, I don't think the older version of the subscriber can modify the command received from the publisher so that the same can be applied to the subscriber because it won't have any knowledge of the new feature. In the other case where the subscriber is of a newer version, we anyway should be able to support it with pg_dump as there doesn't appear to be any restriction with that, am, I missing something? > One of the main use cases for logical replication is migration (X -> Y where X > < Y). > I don't think we need to restrict this case even if we decide to use pg_dump. > > Per discussion [1], I think if we agree that the Alvaro's DDL deparse patch is > the way to go with DDL replication, it seems wise that it should be used for > initial DDL synchronization as well. > Even if we decide to use deparse approach, it would still need to mimic stuff from pg_dump to construct commands based on only catalog contents. I am not against using this approach but we shouldn't ignore the duplicity required in this approach. -- With Regards, Amit Kapila.
> From: Amit Kapila <amit.kapila16@gmail.com> > IIUC, this is possible only if tablesync process uses a snapshot different than the > snapshot we have used to perform the initial schema sync, otherwise, this > shouldn't be a problem. Let me try to explain my understanding with an example > (the LSNs used are just explain the > problem): > > 1. Create Table t1(c1, c2); --LSN: 90 > 2. Insert t1 (1, 1); --LSN 100 > 3. Insert t1 (2, 2); --LSN 110 > 4. Alter t1 Add Column c3; --LSN 120 > 5. Insert t1 (3, 3, 3); --LSN 130 > > Now, say before starting tablesync worker, apply process performs initial > schema sync and uses a snapshot corresponding to LSN 100. Then it starts > tablesync process to allow the initial copy of data in t1. > Here, if the table sync process tries to establish a new snapshot, it may get data > till LSN 130 and when it will try to copy the same in subscriber it will fail. Is my > understanding correct about the problem you described? Right > If so, can't we allow > tablesync process to use the same exported snapshot as we used for the initial > schema sync and won't that solve the problem you described? I think we won't be able to use same snapshot because the transaction will be committed. In CreateSubscription() we can use the transaction snapshot from walrcv_create_slot() till walrcv_disconnect() is called.(I am not sure about this part maybe walrcv_disconnect() calls the commits internally ?). So somehow we need to keep this snapshot alive, even after transaction is committed(or delay committing the transaction , but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart before tableSync is able to use the same snapshot.) > > Refresh publication :- > > > > In refresh publication, subscriber does create a new replication slot Typo-> subscriber does not > > hence , we can’t run > > > > pg_dump with a snapshot which starts from origin(maybe this is not an > > issue at all). In this case > > > > it makes more sense for tableSync worker to do schema sync. > > > > Can you please explain this problem with some examples? I think we can have same issues as you mentioned New table t1 is added to the publication , User does a refresh publication. pg_dump / pg_restore restores the table definition. But before tableSync can start, steps from 2 to 5 happen on the publisher. > 1. Create Table t1(c1, c2); --LSN: 90 > 2. Insert t1 (1, 1); --LSN 100 > 3. Insert t1 (2, 2); --LSN 110 > 4. Alter t1 Add Column c3; --LSN 120 > 5. Insert t1 (3, 3, 3); --LSN 130 And table sync errors out There can be one more issue , since we took the pg_dump without snapshot (wrt to replication slot). (I am not 100 percent sure about this). Lets imagine applier process is lagging behind publisher. Events on publisher 1. alter t1 drop column c; LSN 100 <-- applier process tries to execute this DDL 2. alter t1 drop column d; LSN 110 3. insert into t1 values(..); LSN 120 <-- (Refresh publication called )pg_dump/restore restores this version Applier process executing 1 will fail because t1 does not have column c. Regards Sachin
On Thu, Mar 23, 2023 at 2:48 AM Euler Taveira <euler@eulerto.com> wrote:>> On Tue, Mar 21, 2023, at 8:18 AM, Amit Kapila wrote:>> Now, how do we avoid these problems even if we have our own version of> functionality similar to pg_dump for selected objects? I guess we will> face similar problems. If so, we may need to deny schema sync in any> such case.>> There are 2 approaches for initial DDL synchronization:>> 1) generate the DDL command on the publisher, stream it and apply it as-is on> the subscriber;> 2) generate a DDL representation (JSON, for example) on the publisher, stream> it, transform it into a DDL command on subscriber and apply it.>> The option (1) is simpler and faster than option (2) because it does not> require an additional step (transformation). However, option (2) is more> flexible than option (1) because it allow you to create a DDL command even if a> feature was removed from the subscriber and the publisher version is less than> the subscriber version or a feature was added to the publisher and the> publisher version is greater than the subscriber version.>Is this practically possible? Say the publisher has a higher versionthat has introduced a new object type corresponding to which it haseither a new catalog or some new columns in the existing catalog. Now,I don't think the older version of the subscriber can modify thecommand received from the publisher so that the same can be applied tothe subscriber because it won't have any knowledge of the new feature.In the other case where the subscriber is of a newer version, weanyway should be able to support it with pg_dump as there doesn'tappear to be any restriction with that, am, I missing something?
Even if we decide to use deparse approach, it would still need tomimic stuff from pg_dump to construct commands based on only catalogcontents. I am not against using this approach but we shouldn't ignorethe duplicity required in this approach.
On Thu, Mar 23, 2023 at 9:24 PM Kumar, Sachin <ssetiya@amazon.com> wrote: > > > From: Amit Kapila <amit.kapila16@gmail.com> > > IIUC, this is possible only if tablesync process uses a snapshot different than the > > snapshot we have used to perform the initial schema sync, otherwise, this > > shouldn't be a problem. Let me try to explain my understanding with an example > > (the LSNs used are just explain the > > problem): > > > > 1. Create Table t1(c1, c2); --LSN: 90 > > 2. Insert t1 (1, 1); --LSN 100 > > 3. Insert t1 (2, 2); --LSN 110 > > 4. Alter t1 Add Column c3; --LSN 120 > > 5. Insert t1 (3, 3, 3); --LSN 130 > > > > Now, say before starting tablesync worker, apply process performs initial > > schema sync and uses a snapshot corresponding to LSN 100. Then it starts > > tablesync process to allow the initial copy of data in t1. > > Here, if the table sync process tries to establish a new snapshot, it may get data > > till LSN 130 and when it will try to copy the same in subscriber it will fail. Is my > > understanding correct about the problem you described? > Right > > If so, can't we allow > > tablesync process to use the same exported snapshot as we used for the initial > > schema sync and won't that solve the problem you described? > I think we won't be able to use same snapshot because the transaction will be committed. > In CreateSubscription() we can use the transaction snapshot from walrcv_create_slot() > till walrcv_disconnect() is called.(I am not sure about this part maybe walrcv_disconnect() calls > the commits internally ?). > So somehow we need to keep this snapshot alive, even after transaction is committed(or delay committing > the transaction , but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart before > tableSync is able to use the same snapshot.) > Can we think of getting the table data as well along with schema via pg_dump? Won't then both schema and initial data will correspond to the same snapshot? > > > Refresh publication :- > > > > > > In refresh publication, subscriber does create a new replication slot > Typo-> subscriber does not > > > hence , we can’t run > > > > > > pg_dump with a snapshot which starts from origin(maybe this is not an > > > issue at all). In this case > > > > > > it makes more sense for tableSync worker to do schema sync. > > > > > > > Can you please explain this problem with some examples? > I think we can have same issues as you mentioned > New table t1 is added to the publication , User does a refresh publication. > pg_dump / pg_restore restores the table definition. But before tableSync > can start, steps from 2 to 5 happen on the publisher. > > 1. Create Table t1(c1, c2); --LSN: 90 > > 2. Insert t1 (1, 1); --LSN 100 > > 3. Insert t1 (2, 2); --LSN 110 > > 4. Alter t1 Add Column c3; --LSN 120 > > 5. Insert t1 (3, 3, 3); --LSN 130 > And table sync errors out > There can be one more issue , since we took the pg_dump without snapshot (wrt to replication slot). > To avoid both the problems mentioned for Refresh Publication, we can do one of the following: (a) create a new slot along with a snapshot for this operation and drop it afterward; or (b) using the existing slot, establish a new snapshot using a technique proposed in email [1]. Note - Please keep one empty line before and after your inline responses, otherwise, it is slightly difficult to understand your response. [1] - https://www.postgresql.org/message-id/CAGPVpCRWEVhXa7ovrhuSQofx4to7o22oU9iKtrOgAOtz_%3DY6vg%40mail.gmail.com -- With Regards, Amit Kapila.
On Friday, March 24, 2023 12:02 AM Euler Taveira <euler@eulerto.com> wrote: > > On Thu, Mar 23, 2023, at 8:44 AM, Amit Kapila wrote: > > On Thu, Mar 23, 2023 at 2:48 AM Euler Taveira <mailto:euler@eulerto.com> wrote: > > > > > > On Tue, Mar 21, 2023, at 8:18 AM, Amit Kapila wrote: > > > > > > Now, how do we avoid these problems even if we have our own version of > > > functionality similar to pg_dump for selected objects? I guess we will > > > face similar problems. If so, we may need to deny schema sync in any > > > such case. > > > > > > There are 2 approaches for initial DDL synchronization: > > > > > > 1) generate the DDL command on the publisher, stream it and apply it as-is on > > > the subscriber; > > > 2) generate a DDL representation (JSON, for example) on the publisher, stream > > > it, transform it into a DDL command on subscriber and apply it. > > > > > > The option (1) is simpler and faster than option (2) because it does not > > > require an additional step (transformation). However, option (2) is more > > > flexible than option (1) because it allow you to create a DDL command even if a > > > feature was removed from the subscriber and the publisher version is less than > > > the subscriber version or a feature was added to the publisher and the > > > publisher version is greater than the subscriber version. > > > > > > > Is this practically possible? Say the publisher has a higher version > > that has introduced a new object type corresponding to which it has > > either a new catalog or some new columns in the existing catalog. Now, > > I don't think the older version of the subscriber can modify the > > command received from the publisher so that the same can be applied to > > the subscriber because it won't have any knowledge of the new feature. > > In the other case where the subscriber is of a newer version, we > > anyway should be able to support it with pg_dump as there doesn't > > appear to be any restriction with that, am, I missing something? > I think so (with some limitations). Since the publisher knows the subscriber > version, publisher knows that the subscriber does not contain the new object > type then publisher can decide if this case is critical (and reject the > replication) or optional (and silently not include the feature X -- because it > is not essential for logical replication). If required, the transformation > should be done on the publisher. I am not if it's feasible to support the use case the replicate DDL to old subscriber. First, I think the current publisher doesn't know the version number of client(subscriber) so we need to check the feasibility of same. Also, having client's version number checks doesn't seem to be a good idea. Besides, I thought about the problems that will happen if we try to support replicating New PG to older PG. The following examples assume that we support the DDL replication in the mentioned PG. 1) Assume we want to replicate from a newer PG to a older PG where partition table has not been introduced. I think even if the publisher is aware of that, it doesn't have a good way to transform the partition related command, maybe one could say we can transform that to inherit table, but I feel that introduces too much complexity. 2) Another example is generated column. To replicate the newer PG which has this feature to a older PG without this. I am concerned that is there a way to transform this without causing inconsistent behavior. Even if we decide to simply skip sending such unsupported commands or skip applying them, then it's likely that the following dml replication will cause data inconsistency. So, it seems we cannot completely support this use case, there would be some limitations. Personally, I am not sure if it's worth introducing complexity to support it partially. Best Regards, Hou zj
> From: Amit Kapila <amit.kapila16@gmail.com> > > I think we won't be able to use same snapshot because the transaction will > > be committed. > > In CreateSubscription() we can use the transaction snapshot from > > walrcv_create_slot() till walrcv_disconnect() is called.(I am not sure > > about this part maybe walrcv_disconnect() calls the commits internally ?). > > So somehow we need to keep this snapshot alive, even after transaction > > is committed(or delay committing the transaction , but we can have > > CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart > > before tableSync is able to use the same snapshot.) > > > > Can we think of getting the table data as well along with schema via > pg_dump? Won't then both schema and initial data will correspond to the > same snapshot? Right , that will work, Thanks! > > I think we can have same issues as you mentioned New table t1 is added > > to the publication , User does a refresh publication. > > pg_dump / pg_restore restores the table definition. But before > > tableSync can start, steps from 2 to 5 happen on the publisher. > > > 1. Create Table t1(c1, c2); --LSN: 90 2. Insert t1 (1, 1); --LSN 100 > > > 3. Insert t1 (2, 2); --LSN 110 4. Alter t1 Add Column c3; --LSN 120 > > > 5. Insert t1 (3, 3, 3); --LSN 130 > > And table sync errors out > > There can be one more issue , since we took the pg_dump without > snapshot (wrt to replication slot). > > > > To avoid both the problems mentioned for Refresh Publication, we can do > one of the following: (a) create a new slot along with a snapshot for this > operation and drop it afterward; or (b) using the existing slot, establish a > new snapshot using a technique proposed in email [1]. > Thanks, I think option (b) will be perfect, since we don’t have to create a new slot. Regards Sachin
First, I think the current publisher doesn't know the version number ofclient(subscriber) so we need to check the feasibility of same. Also, havingclient's version number checks doesn't seem to be a good idea.
Besides, I thought about the problems that will happen if we try to supportreplicating New PG to older PG. The following examples assume that we support theDDL replication in the mentioned PG.1) Assume we want to replicate from a newer PG to a older PG where partitiontable has not been introduced. I think even if the publisher is aware ofthat, it doesn't have a good way to transform the partition related command,maybe one could say we can transform that to inherit table, but I feel thatintroduces too much complexity.2) Another example is generated column. To replicate the newer PG which hasthis feature to a older PG without this. I am concerned that is there a wayto transform this without causing inconsistent behavior.Even if we decide to simply skip sending such unsupported commands or skipapplying them, then it's likely that the following dml replication will causedata inconsistency.
So, it seems we cannot completely support this use case, there would be somelimitations. Personally, I am not sure if it's worth introducing complexity tosupport it partially.
> I am not if it's feasible to support the use case the replicate DDL to old > subscriber. > +1 > First, I think the current publisher doesn't know the version number of > client(subscriber) so we need to check the feasibility of same. Also, having > client's version number checks doesn't seem to be a good idea. > > Besides, I thought about the problems that will happen if we try to support > replicating New PG to older PG. The following examples assume that we > support the DDL replication in the mentioned PG. > > 1) Assume we want to replicate from a newer PG to a older PG where > partition > table has not been introduced. I think even if the publisher is aware of > that, it doesn't have a good way to transform the partition related > command, > maybe one could say we can transform that to inherit table, but I feel that > introduces too much complexity. > > 2) Another example is generated column. To replicate the newer PG which > has > this feature to a older PG without this. I am concerned that is there a way > to transform this without causing inconsistent behavior. > > Even if we decide to simply skip sending such unsupported commands or > skip applying them, then it's likely that the following dml replication will > cause data inconsistency. > > So, it seems we cannot completely support this use case, there would be > some limitations. Personally, I am not sure if it's worth introducing > complexity to support it partially. > +1 Regards Sachin
On Friday, March 24, 2023 11:01 PM Euler Taveira <euler@eulerto.com> wrote: Hi, > On Fri, Mar 24, 2023, at 8:57 AM, mailto:houzj.fnst@fujitsu.com wrote: > > First, I think the current publisher doesn't know the version number of > > client(subscriber) so we need to check the feasibility of same. Also, having > > client's version number checks doesn't seem to be a good idea. > > walrcv_server_version(). I don't think this function works, as it only shows the server version (e.g. publisher/walsender). > > Besides, I thought about the problems that will happen if we try to support > > replicating New PG to older PG. The following examples assume that we support the > > DDL replication in the mentioned PG. > > > > 1) Assume we want to replicate from a newer PG to a older PG where partition > > table has not been introduced. I think even if the publisher is aware of > > that, it doesn't have a good way to transform the partition related command, > > maybe one could say we can transform that to inherit table, but I feel that > > introduces too much complexity. > > > > 2) Another example is generated column. To replicate the newer PG which has > > this feature to a older PG without this. I am concerned that is there a way > > to transform this without causing inconsistent behavior. > > > > Even if we decide to simply skip sending such unsupported commands or skip > > applying them, then it's likely that the following dml replication will cause > > data inconsistency. > > As I mentioned in a previous email [1], the publisher can contain code to > decide if it can proceed or not, in case you are doing a downgrade. I said > downgrade but it can also happen if we decide to deprecate a syntax. For > example, when WITH OIDS was deprecated, pg_dump treats it as an acceptable > removal. The transformation can be (dis)allowed by the protocol version or > another constant [2]. If most of the new DDL related features won't be supported to be transformed to old subscriber, I don't see a point in supporting this use case. I think cases like the removal of WITH OIDS are rare enough that we don't need to worry about and it doesn't affect the data consistency. But new DDL features are different. Not only the features like partition or generated column, features like nulls_not_distinct are also tricky to be transformed without causing inconsistent behavior. > > So, it seems we cannot completely support this use case, there would be some > > limitations. Personally, I am not sure if it's worth introducing complexity to > > support it partially. > > Limitations are fine; they have different versions. I wouldn't like to forbid > downgrade just because I don't want to maintain compatibility with previous > versions. IMO it is important to be able to downgrade in case of any > incompatibility with an application. You might argue that this isn't possible > due to time or patch size and that there is a workaround for it but I wouldn't > want to close the door for downgrade in the future. The biggest problem is the data inconsistency that it would cause. I am not aware of a generic solution to replicate new introduced DDLs to old subscriber. which wouldn't cause data inconsistency. And apart from that, IMO the complexity and maintainability of the feature also matters. Best Regards, Hou zj
On Fri, Mar 24, 2023 at 11:51 PM Kumar, Sachin <ssetiya@amazon.com> wrote: > > > From: Amit Kapila <amit.kapila16@gmail.com> > > > I think we won't be able to use same snapshot because the transaction will > > > be committed. > > > In CreateSubscription() we can use the transaction snapshot from > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am not sure > > > about this part maybe walrcv_disconnect() calls the commits internally ?). > > > So somehow we need to keep this snapshot alive, even after transaction > > > is committed(or delay committing the transaction , but we can have > > > CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart > > > before tableSync is able to use the same snapshot.) > > > > > > > Can we think of getting the table data as well along with schema via > > pg_dump? Won't then both schema and initial data will correspond to the > > same snapshot? > > Right , that will work, Thanks! While it works, we cannot get the initial data in parallel, no? > > > > I think we can have same issues as you mentioned New table t1 is added > > > to the publication , User does a refresh publication. > > > pg_dump / pg_restore restores the table definition. But before > > > tableSync can start, steps from 2 to 5 happen on the publisher. > > > > 1. Create Table t1(c1, c2); --LSN: 90 2. Insert t1 (1, 1); --LSN 100 > > > > 3. Insert t1 (2, 2); --LSN 110 4. Alter t1 Add Column c3; --LSN 120 > > > > 5. Insert t1 (3, 3, 3); --LSN 130 > > > And table sync errors out > > > There can be one more issue , since we took the pg_dump without > > snapshot (wrt to replication slot). > > > > > > > To avoid both the problems mentioned for Refresh Publication, we can do > > one of the following: (a) create a new slot along with a snapshot for this > > operation and drop it afterward; or (b) using the existing slot, establish a > > new snapshot using a technique proposed in email [1]. > > > > Thanks, I think option (b) will be perfect, since we don’t have to create a new slot. Regarding (b), does it mean that apply worker stops streaming, requests to create a snapshot, and then resumes the streaming? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Mar 27, 2023 at 8:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Mar 24, 2023 at 11:51 PM Kumar, Sachin <ssetiya@amazon.com> wrote: > > > > > From: Amit Kapila <amit.kapila16@gmail.com> > > > > I think we won't be able to use same snapshot because the transaction will > > > > be committed. > > > > In CreateSubscription() we can use the transaction snapshot from > > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am not sure > > > > about this part maybe walrcv_disconnect() calls the commits internally ?). > > > > So somehow we need to keep this snapshot alive, even after transaction > > > > is committed(or delay committing the transaction , but we can have > > > > CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart > > > > before tableSync is able to use the same snapshot.) > > > > > > > > > > Can we think of getting the table data as well along with schema via > > > pg_dump? Won't then both schema and initial data will correspond to the > > > same snapshot? > > > > Right , that will work, Thanks! > > While it works, we cannot get the initial data in parallel, no? > Another possibility is that we dump/restore the schema of each table along with its data. One thing we can explore is whether the parallel option of dump can be useful here. Do you have any other ideas? One related idea is that currently, we fetch the table list corresponding to publications in subscription and create the entries for those in pg_subscription_rel during Create Subscription, can we think of postponing that work till after the initial schema sync? We seem to be already storing publications list in pg_subscription, so it appears possible if we somehow remember the value of copy_data. If this is feasible then I think that may give us the flexibility to perform the initial sync at a later point by the background worker. > > > > > > I think we can have same issues as you mentioned New table t1 is added > > > > to the publication , User does a refresh publication. > > > > pg_dump / pg_restore restores the table definition. But before > > > > tableSync can start, steps from 2 to 5 happen on the publisher. > > > > > 1. Create Table t1(c1, c2); --LSN: 90 2. Insert t1 (1, 1); --LSN 100 > > > > > 3. Insert t1 (2, 2); --LSN 110 4. Alter t1 Add Column c3; --LSN 120 > > > > > 5. Insert t1 (3, 3, 3); --LSN 130 > > > > And table sync errors out > > > > There can be one more issue , since we took the pg_dump without > > > snapshot (wrt to replication slot). > > > > > > > > > > To avoid both the problems mentioned for Refresh Publication, we can do > > > one of the following: (a) create a new slot along with a snapshot for this > > > operation and drop it afterward; or (b) using the existing slot, establish a > > > new snapshot using a technique proposed in email [1]. > > > > > > > Thanks, I think option (b) will be perfect, since we don’t have to create a new slot. > > Regarding (b), does it mean that apply worker stops streaming, > requests to create a snapshot, and then resumes the streaming? > Shouldn't this be done by the backend performing a REFRESH publication? -- With Regards, Amit Kapila.
On Tue, Mar 28, 2023 at 6:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Mar 27, 2023 at 8:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Mar 24, 2023 at 11:51 PM Kumar, Sachin <ssetiya@amazon.com> wrote: > > > > > > > From: Amit Kapila <amit.kapila16@gmail.com> > > > > > I think we won't be able to use same snapshot because the transaction will > > > > > be committed. > > > > > In CreateSubscription() we can use the transaction snapshot from > > > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am not sure > > > > > about this part maybe walrcv_disconnect() calls the commits internally ?). > > > > > So somehow we need to keep this snapshot alive, even after transaction > > > > > is committed(or delay committing the transaction , but we can have > > > > > CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart > > > > > before tableSync is able to use the same snapshot.) > > > > > > > > > > > > > Can we think of getting the table data as well along with schema via > > > > pg_dump? Won't then both schema and initial data will correspond to the > > > > same snapshot? > > > > > > Right , that will work, Thanks! > > > > While it works, we cannot get the initial data in parallel, no? > > > > Another possibility is that we dump/restore the schema of each table > along with its data. One thing we can explore is whether the parallel > option of dump can be useful here. Do you have any other ideas? A downside of the idea of dumping both table schema and table data would be that we need to temporarily store data twice the size of the table (the dump file and the table itself) during the load. One might think that we can redirect the pg_dump output into the backend so that it can load it via SPI, but it doesn't work since "COPY tbl FROM stdin;" doesn't work via SPI. The --inserts option of pg_dump could help it out but it makes restoration very slow. > > One related idea is that currently, we fetch the table list > corresponding to publications in subscription and create the entries > for those in pg_subscription_rel during Create Subscription, can we > think of postponing that work till after the initial schema sync? We > seem to be already storing publications list in pg_subscription, so it > appears possible if we somehow remember the value of copy_data. If > this is feasible then I think that may give us the flexibility to > perform the initial sync at a later point by the background worker. It sounds possible. With this idea, we will be able to have the apply worker restore the table schemas (and create pg_subscription_rel entries) as the first thing. Another point we might need to consider is that the initial schema sync (i.e. creating tables) and creating pg_subscription_rel entries need to be done in the same transaction. Otherwise, we could end up committing either one change. I think it depends on how we restore the schema data. > > > > > > > > > I think we can have same issues as you mentioned New table t1 is added > > > > > to the publication , User does a refresh publication. > > > > > pg_dump / pg_restore restores the table definition. But before > > > > > tableSync can start, steps from 2 to 5 happen on the publisher. > > > > > > 1. Create Table t1(c1, c2); --LSN: 90 2. Insert t1 (1, 1); --LSN 100 > > > > > > 3. Insert t1 (2, 2); --LSN 110 4. Alter t1 Add Column c3; --LSN 120 > > > > > > 5. Insert t1 (3, 3, 3); --LSN 130 > > > > > And table sync errors out > > > > > There can be one more issue , since we took the pg_dump without > > > > snapshot (wrt to replication slot). > > > > > > > > > > > > > To avoid both the problems mentioned for Refresh Publication, we can do > > > > one of the following: (a) create a new slot along with a snapshot for this > > > > operation and drop it afterward; or (b) using the existing slot, establish a > > > > new snapshot using a technique proposed in email [1]. > > > > > > > > > > Thanks, I think option (b) will be perfect, since we don’t have to create a new slot. > > > > Regarding (b), does it mean that apply worker stops streaming, > > requests to create a snapshot, and then resumes the streaming? > > > > Shouldn't this be done by the backend performing a REFRESH publication? Hmm, I might be missing something but the idea (b) uses the existing slot to establish a new snapshot, right? What existing replication slot do we use for that? I thought it was the one used by the apply worker. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
> > > > From: Amit Kapila <amit.kapila16@gmail.com> > > > > > I think we won't be able to use same snapshot because the > > > > > transaction will be committed. > > > > > In CreateSubscription() we can use the transaction snapshot from > > > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am > > > > > not sure about this part maybe walrcv_disconnect() calls the commits > internally ?). > > > > > So somehow we need to keep this snapshot alive, even after > > > > > transaction is committed(or delay committing the transaction , > > > > > but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we > > > > > can have a restart before tableSync is able to use the same > > > > > snapshot.) > > > > > > > > > > > > > Can we think of getting the table data as well along with schema > > > > via pg_dump? Won't then both schema and initial data will > > > > correspond to the same snapshot? > > > > > > Right , that will work, Thanks! > > > > While it works, we cannot get the initial data in parallel, no? > > I was thinking each TableSync process will call pg_dump --table, This way if we have N tableSync process, we can have N pg_dump --table=table_name called in parallel. In fact we can use --schema-only to get schema and then let COPY take care of data syncing . We will use same snapshot for pg_dump as well as COPY table. Regards Sachin
On Tue, Mar 28, 2023 at 8:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Mar 28, 2023 at 6:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > I think we can have same issues as you mentioned New table t1 is added > > > > > > to the publication , User does a refresh publication. > > > > > > pg_dump / pg_restore restores the table definition. But before > > > > > > tableSync can start, steps from 2 to 5 happen on the publisher. > > > > > > > 1. Create Table t1(c1, c2); --LSN: 90 2. Insert t1 (1, 1); --LSN 100 > > > > > > > 3. Insert t1 (2, 2); --LSN 110 4. Alter t1 Add Column c3; --LSN 120 > > > > > > > 5. Insert t1 (3, 3, 3); --LSN 130 > > > > > > And table sync errors out > > > > > > There can be one more issue , since we took the pg_dump without > > > > > snapshot (wrt to replication slot). > > > > > > > > > > > > > > > > To avoid both the problems mentioned for Refresh Publication, we can do > > > > > one of the following: (a) create a new slot along with a snapshot for this > > > > > operation and drop it afterward; or (b) using the existing slot, establish a > > > > > new snapshot using a technique proposed in email [1]. > > > > > > > > > > > > > Thanks, I think option (b) will be perfect, since we don’t have to create a new slot. > > > > > > Regarding (b), does it mean that apply worker stops streaming, > > > requests to create a snapshot, and then resumes the streaming? > > > > > > > Shouldn't this be done by the backend performing a REFRESH publication? > > Hmm, I might be missing something but the idea (b) uses the existing > slot to establish a new snapshot, right? What existing replication > slot do we use for that? I thought it was the one used by the apply > worker. > Right, it will be the same as the one for apply worker. I think if we decide to do initial sync via apply worker then in this case also, we need to let apply worker restart and perform initial sync as the first thing. -- With Regards, Amit Kapila.
On Wed, Mar 29, 2023 at 7:57 PM Kumar, Sachin <ssetiya@amazon.com> wrote: > > > > > > From: Amit Kapila <amit.kapila16@gmail.com> > > > > > > I think we won't be able to use same snapshot because the > > > > > > transaction will be committed. > > > > > > In CreateSubscription() we can use the transaction snapshot from > > > > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am > > > > > > not sure about this part maybe walrcv_disconnect() calls the commits > > internally ?). > > > > > > So somehow we need to keep this snapshot alive, even after > > > > > > transaction is committed(or delay committing the transaction , > > > > > > but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we > > > > > > can have a restart before tableSync is able to use the same > > > > > > snapshot.) > > > > > > > > > > > > > > > > Can we think of getting the table data as well along with schema > > > > > via pg_dump? Won't then both schema and initial data will > > > > > correspond to the same snapshot? > > > > > > > > Right , that will work, Thanks! > > > > > > While it works, we cannot get the initial data in parallel, no? > > > > > I was thinking each TableSync process will call pg_dump --table, This way if we have N > tableSync process, we can have N pg_dump --table=table_name called in parallel. > In fact we can use --schema-only to get schema and then let COPY take care of data > syncing . We will use same snapshot for pg_dump as well as COPY table. How can we postpone creating the pg_subscription_rel entries until the tablesync worker starts and does the schema sync? I think that since pg_subscription_rel entry needs the table OID, we need either to do the schema sync before creating the entry (i.e, during CREATE SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The apply worker needs the information of tables to sync in order to launch the tablesync workers, but it needs to create the table schema to get that information. Regards, [1] https://www.postgresql.org/message-id/CAA4eK1Ld9-5ueomE_J5CA6LfRo%3DwemdTrUp5qdBhRFwGT%2BdOUw%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
> From: Masahiko Sawada <sawada.mshk@gmail.com> > > > > One related idea is that currently, we fetch the table list > > corresponding to publications in subscription and create the entries > > for those in pg_subscription_rel during Create Subscription, can we > > think of postponing that work till after the initial schema sync? We > > seem to be already storing publications list in pg_subscription, so it > > appears possible if we somehow remember the value of copy_data. If > > this is feasible then I think that may give us the flexibility to > > perform the initial sync at a later point by the background worker. Maybe we need to add column to pg_subscription to store copy_data state ? > > It sounds possible. With this idea, we will be able to have the apply worker > restore the table schemas (and create pg_subscription_rel > entries) as the first thing. Another point we might need to consider is that the > initial schema sync (i.e. creating tables) and creating pg_subscription_rel entries > need to be done in the same transaction. > Otherwise, we could end up committing either one change. I think it depends on > how we restore the schema data. I think we have to add one more column to pg_subscription to track the initial sync state {OFF, SCHEMA_DUMPED, SCHEMA_RESTORED, COMPLETED} (COMPLETED will show that pg_subscription_rel is filled) . I don’t think we won't be able to do pg_restore and pg_subscription_rel in single transaction, but we can use initial_sync_state to start from where we left after a abrupt crash/shutdown. Regards Sachin
On Thu, Mar 30, 2023 at 12:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Mar 29, 2023 at 7:57 PM Kumar, Sachin <ssetiya@amazon.com> wrote: > > > > > > > > From: Amit Kapila <amit.kapila16@gmail.com> > > > > > > > I think we won't be able to use same snapshot because the > > > > > > > transaction will be committed. > > > > > > > In CreateSubscription() we can use the transaction snapshot from > > > > > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am > > > > > > > not sure about this part maybe walrcv_disconnect() calls the commits > > > internally ?). > > > > > > > So somehow we need to keep this snapshot alive, even after > > > > > > > transaction is committed(or delay committing the transaction , > > > > > > > but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we > > > > > > > can have a restart before tableSync is able to use the same > > > > > > > snapshot.) > > > > > > > > > > > > > > > > > > > Can we think of getting the table data as well along with schema > > > > > > via pg_dump? Won't then both schema and initial data will > > > > > > correspond to the same snapshot? > > > > > > > > > > Right , that will work, Thanks! > > > > > > > > While it works, we cannot get the initial data in parallel, no? > > > > > > > > I was thinking each TableSync process will call pg_dump --table, This way if we have N > > tableSync process, we can have N pg_dump --table=table_name called in parallel. > > In fact we can use --schema-only to get schema and then let COPY take care of data > > syncing . We will use same snapshot for pg_dump as well as COPY table. > > How can we postpone creating the pg_subscription_rel entries until the > tablesync worker starts and does the schema sync? I think that since > pg_subscription_rel entry needs the table OID, we need either to do > the schema sync before creating the entry (i.e, during CREATE > SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The > apply worker needs the information of tables to sync in order to > launch the tablesync workers, but it needs to create the table schema > to get that information. For the above reason, I think that step 6 of the initial proposal won't work. If we can have the tablesync worker create an entry of pg_subscription_rel after creating the table, it may give us the flexibility to perform the initial sync. One idea is that we add a relname field to pg_subscription_rel so that we can create entries with relname instead of OID if the table is not created yet. Once the table is created, we clear the relname field and set the OID of the table instead. It's not an ideal solution but we might make it simpler later. Assuming that it's feasible, I'm considering another approach for the initial sync in order to address the concurrent DDLs. The basic idea is to somewhat follow how pg_dump/restore to dump/restore the database data. We divide the synchronization phase (including both schema and data) up into three phases: pre-data, table-data, post-data. These mostly follow the --section option of pg_dump. 1. The backend process performing CREATE SUBSCRIPTION creates the subscription but doesn't create pg_subscription_rel entries yet. 2. Before starting the streaming, the apply worker fetches the table list from the publisher, create pg_subscription_rel entries for them, and dumps+restores database objects that tables could depend on but don't depend on tables such as TYPE, OPERATOR, ACCESS METHOD etc (i.e. pre-data). 3. The apply worker launches the tablesync workers for tables that need to be synchronized. There might be DDLs executed on the publisher for tables before the tablesync worker starts. But the apply worker needs to apply DDLs for pre-data database objects. OTOH, it can ignore DDLs for not-synced-yet tables and other database objects such as INDEX, TRIGGER, RULE, etc (i.e. post-data). 4. The tablesync worker creates its replication slot, dumps+restores the table schema, update the pg_subscription_rel, and perform COPY. These operations should be done in the same transaction. 5. After finishing COPY, the tablesync worker dumps indexes (and perhaps constraints) of the table and creates them (which possibly takes a long time). Then it starts to catch up, same as today. The apply worker needs to wait for the tablesync worker to catch up. We need to repeat these steps until we complete the initial data copy and create indexes for all tables, IOW until all pg_subscription_rel status becomes READY. 6. If the apply worker confirms all tables are READY, it starts another sync worker who is responsible for the post-data database objects such as TRIGGER, RULE, POLICY etc (i.e. post-data). While the sync worker is starting up or working, the apply worker applies changes for pre-data database objects as well as READY tables. 7. Similar to the tablesync worker, this sync worker creates its replication slot and sets the returned LSN somewhere, say pg_subscription. 8. The sync worker dumps and restores these objects. Which could take a time since it would need to create FK constraints. Then it starts to catch up if the apply worker is ahead. The apply worker waits for the sync worker to catch up. 9. Once the sync worker catches up, the apply worker starts applying changes for all database objects. IIUC with this approach, we can resolve the concurrent DDL problem Sachin mentioned, and indexes (and constraints) are created after the initial data copy. The procedures are still very complex and not fully considered yet but I hope there are some useful things at least for discussion. Probably we can start with supporting only tables. In this case, we would not need the post-data phase (i.e. step 6-9). It seems to me that we need to have the subscription state somewhere (maybe pg_subscription) so that the apply worker figure out the next step. Since we need to dump and restore different objects on different timings, we probably cannot directly use pg_dump/pg_restore. I've not considered how the concurrent REFRESH PUBLICATION works. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
> -----Original Message----- > From: Masahiko Sawada <sawada.mshk@gmail.com> > > > I was thinking each TableSync process will call pg_dump --table, > > > This way if we have N tableSync process, we can have N pg_dump -- > table=table_name called in parallel. > > > In fact we can use --schema-only to get schema and then let COPY > > > take care of data syncing . We will use same snapshot for pg_dump as well > as COPY table. > > > > How can we postpone creating the pg_subscription_rel entries until the > > tablesync worker starts and does the schema sync? I think that since > > pg_subscription_rel entry needs the table OID, we need either to do > > the schema sync before creating the entry (i.e, during CREATE > > SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The > > apply worker needs the information of tables to sync in order to > > launch the tablesync workers, but it needs to create the table schema > > to get that information. > > For the above reason, I think that step 6 of the initial proposal won't work. > > If we can have the tablesync worker create an entry of pg_subscription_rel after > creating the table, it may give us the flexibility to perform the initial sync. One > idea is that we add a relname field to pg_subscription_rel so that we can create > entries with relname instead of OID if the table is not created yet. Once the > table is created, we clear the relname field and set the OID of the table instead. > It's not an ideal solution but we might make it simpler later. > > Assuming that it's feasible, I'm considering another approach for the initial sync > in order to address the concurrent DDLs. > > The basic idea is to somewhat follow how pg_dump/restore to dump/restore > the database data. We divide the synchronization phase (including both schema > and data) up into three phases: pre-data, table-data, post-data. These mostly > follow the --section option of pg_dump. > > 1. The backend process performing CREATE SUBSCRIPTION creates the > subscription but doesn't create pg_subscription_rel entries yet. > > 2. Before starting the streaming, the apply worker fetches the table list from the > publisher, create pg_subscription_rel entries for them, and dumps+restores > database objects that tables could depend on but don't depend on tables such as > TYPE, OPERATOR, ACCESS METHOD etc (i.e. > pre-data). We will not have slot starting snapshot, So somehow we have to get a new snapshot And skip all the wal_log between starting of slot and snapshot creation lsn ? . > > 3. The apply worker launches the tablesync workers for tables that need to be > synchronized. > > There might be DDLs executed on the publisher for tables before the tablesync > worker starts. But the apply worker needs to apply DDLs for pre-data database > objects. OTOH, it can ignore DDLs for not-synced-yet tables and other database > objects such as INDEX, TRIGGER, RULE, etc (i.e. post-data). > > 4. The tablesync worker creates its replication slot, dumps+restores the table > schema, update the pg_subscription_rel, and perform COPY. > > These operations should be done in the same transaction. pg_restore wont be rollbackable, So we need to maintain states in pg_subscription_rel. > > 5. After finishing COPY, the tablesync worker dumps indexes (and perhaps > constraints) of the table and creates them (which possibly takes a long time). > Then it starts to catch up, same as today. The apply worker needs to wait for the > tablesync worker to catch up. I don’t think we can have CATCHUP stage. We can have a DDL on publisher which can add a new column (And this DDL will be executed by applier later). Then we get a INSERT because we have old definition of table, insert will fail. > > We need to repeat these steps until we complete the initial data copy and create > indexes for all tables, IOW until all pg_subscription_rel status becomes READY. > > 6. If the apply worker confirms all tables are READY, it starts another sync > worker who is responsible for the post-data database objects such as TRIGGER, > RULE, POLICY etc (i.e. post-data). > > While the sync worker is starting up or working, the apply worker applies > changes for pre-data database objects as well as READY tables. We might have some issue if we have create table like Create table_name as select * from materialized_view. > > 7. Similar to the tablesync worker, this sync worker creates its replication slot > and sets the returned LSN somewhere, say pg_subscription. > > 8. The sync worker dumps and restores these objects. Which could take a time > since it would need to create FK constraints. Then it starts to catch up if the > apply worker is ahead. The apply worker waits for the sync worker to catch up. > > 9. Once the sync worker catches up, the apply worker starts applying changes > for all database objects. > > IIUC with this approach, we can resolve the concurrent DDL problem Sachin > mentioned, and indexes (and constraints) are created after the initial data copy. > > The procedures are still very complex and not fully considered yet but I hope > there are some useful things at least for discussion. > > Probably we can start with supporting only tables. In this case, we would not > need the post-data phase (i.e. step 6-9). It seems to me that we need to have > the subscription state somewhere (maybe > pg_subscription) so that the apply worker figure out the next step. > Since we need to dump and restore different objects on different timings, we > probably cannot directly use pg_dump/pg_restore. I've not considered how the > concurrent REFRESH PUBLICATION works. I think above prototype will work and will have least amount of side effects, but It might be too complex to implement and I am not sure about corner cases. I was thinking of other ways of doing Initial Sync , which are less complex but each with separate set of bottlenecks On Publisher Side:- 1) Locking the publisher:- Easiest one to implement, applier process will get Access Shared lock on the all the published tables. (We don't have to worry newly created concurrent table) As tableSync will finish syncing the table, it will release table lock, So we will release table locks in steps. Users can still perform DML on tables, but DDLs wont be allowed. On Subscriber Side:- So the main issue is tableSync process can see the future table data/version wrt to the applier process, So we have to find a way to ensure that tableSync/applier process sees same table version. 2) Using pg_dump/pg_restore for schema and data:- As Amit mentioned we can use pg_dump/ pg_restore [1], Although it might have side effect of using double storage , we can table pg_dump of each table separately and delete the dump as soon as table is synced. tableSync process will read the dump and call pg_restore on the table. If we crash in middle of restoring the tables we can start pg_dump(--clean)/restore again with left out tables. With this we can reduce space usage but we might create too many files. 3) Using publisher snapshot:- Applier process will do pg_dump/pg_restore as usual, Then applier process will start a new process P1 which will connect to publisher and start a transaction , it will call pg_export_snapshot() to export the snapshot.Then applier process will take snapshot string and pass it to the tableSync process as a argument. tableSync will use this snapshot for COPY TABLE. tableSync should only do COPY TABLE and then will exit , So we wont do any catchup phase in tableSync. After all tables finish COPY table transaction will be committed by P1 process and it will exit. In the case of crash/restart we can simple start from beginning since nothing is committed till every table is synced. There are 2 main issues with this approach 1. I am not sure what side-effects we might have on publisher since we might have to keep the transaction open for long time. 2. Applier process will simple wait till all tables are synced. since applier process wont be able to apply any wal_logs till all tables are synced maybe instead of creating new process Applier process itself can start transaction/ export snapshot and tableSync process will use that snapshot. After all tables are synced it can start wal_streaming. I think approach no 3 might be the best way. [1] https://www.postgresql.org/message-id/CAA4eK1Ld9-5ueomE_J5CA6LfRo%3DwemdTrUp5qdBhRFwGT%2BdOUw%40mail.gmail.com Regards Sachin
On Mon, Apr 3, 2023 at 3:54 PM Kumar, Sachin <ssetiya@amazon.com> wrote: > > > > > -----Original Message----- > > From: Masahiko Sawada <sawada.mshk@gmail.com> > > > > I was thinking each TableSync process will call pg_dump --table, > > > > This way if we have N tableSync process, we can have N pg_dump -- > > table=table_name called in parallel. > > > > In fact we can use --schema-only to get schema and then let COPY > > > > take care of data syncing . We will use same snapshot for pg_dump as well > > as COPY table. > > > > > > How can we postpone creating the pg_subscription_rel entries until the > > > tablesync worker starts and does the schema sync? I think that since > > > pg_subscription_rel entry needs the table OID, we need either to do > > > the schema sync before creating the entry (i.e, during CREATE > > > SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The > > > apply worker needs the information of tables to sync in order to > > > launch the tablesync workers, but it needs to create the table schema > > > to get that information. > > > > For the above reason, I think that step 6 of the initial proposal won't work. > > > > If we can have the tablesync worker create an entry of pg_subscription_rel after > > creating the table, it may give us the flexibility to perform the initial sync. One > > idea is that we add a relname field to pg_subscription_rel so that we can create > > entries with relname instead of OID if the table is not created yet. Once the > > table is created, we clear the relname field and set the OID of the table instead. > > It's not an ideal solution but we might make it simpler later. > > > > Assuming that it's feasible, I'm considering another approach for the initial sync > > in order to address the concurrent DDLs. > > > > The basic idea is to somewhat follow how pg_dump/restore to dump/restore > > the database data. We divide the synchronization phase (including both schema > > and data) up into three phases: pre-data, table-data, post-data. These mostly > > follow the --section option of pg_dump. > > > > 1. The backend process performing CREATE SUBSCRIPTION creates the > > subscription but doesn't create pg_subscription_rel entries yet. > > > > 2. Before starting the streaming, the apply worker fetches the table list from the > > publisher, create pg_subscription_rel entries for them, and dumps+restores > > database objects that tables could depend on but don't depend on tables such as > > TYPE, OPERATOR, ACCESS METHOD etc (i.e. > > pre-data). > > We will not have slot starting snapshot, So somehow we have to get a new snapshot > And skip all the wal_log between starting of slot and snapshot creation lsn ? . Yes. Or we can somehow postpone creating pg_subscription_rel entries until the tablesync workers create tables, or we request walsender to establish a new snapshot using a technique proposed in email[1]. > > > > > 3. The apply worker launches the tablesync workers for tables that need to be > > synchronized. > > > > There might be DDLs executed on the publisher for tables before the tablesync > > worker starts. But the apply worker needs to apply DDLs for pre-data database > > objects. OTOH, it can ignore DDLs for not-synced-yet tables and other database > > objects such as INDEX, TRIGGER, RULE, etc (i.e. post-data). > > > > 4. The tablesync worker creates its replication slot, dumps+restores the table > > schema, update the pg_subscription_rel, and perform COPY. > > > > These operations should be done in the same transaction. > > pg_restore wont be rollbackable, So we need to maintain states in pg_subscription_rel. Yes. But I think it depends on how we restore them. For example, if we have the tablesync worker somethow restore the table using a new SQL function returning the table schema as we discussed or executing the dump file via SPI, we can do that in the same transaction. > > > > > 5. After finishing COPY, the tablesync worker dumps indexes (and perhaps > > constraints) of the table and creates them (which possibly takes a long time). > > Then it starts to catch up, same as today. The apply worker needs to wait for the > > tablesync worker to catch up. > > I don’t think we can have CATCHUP stage. We can have a DDL on publisher which > can add a new column (And this DDL will be executed by applier later). Then we get a INSERT > because we have old definition of table, insert will fail. All DMLs and DDLs associated with the table being synchronized are applied by the tablesync worker until it catches up with the apply worker. > > > > > We need to repeat these steps until we complete the initial data copy and create > > indexes for all tables, IOW until all pg_subscription_rel status becomes READY. > > > > 6. If the apply worker confirms all tables are READY, it starts another sync > > worker who is responsible for the post-data database objects such as TRIGGER, > > RULE, POLICY etc (i.e. post-data). > > > > While the sync worker is starting up or working, the apply worker applies > > changes for pre-data database objects as well as READY tables. > We might have some issue if we have create table like > Create table_name as select * from materialized_view. Could you elaborate on the scenario where we could have an issue with such DDL? > > > > 7. Similar to the tablesync worker, this sync worker creates its replication slot > > and sets the returned LSN somewhere, say pg_subscription. > > > > 8. The sync worker dumps and restores these objects. Which could take a time > > since it would need to create FK constraints. Then it starts to catch up if the > > apply worker is ahead. The apply worker waits for the sync worker to catch up. > > > > 9. Once the sync worker catches up, the apply worker starts applying changes > > for all database objects. > > > > IIUC with this approach, we can resolve the concurrent DDL problem Sachin > > mentioned, and indexes (and constraints) are created after the initial data copy. > > > > The procedures are still very complex and not fully considered yet but I hope > > there are some useful things at least for discussion. > > > > Probably we can start with supporting only tables. In this case, we would not > > need the post-data phase (i.e. step 6-9). It seems to me that we need to have > > the subscription state somewhere (maybe > > pg_subscription) so that the apply worker figure out the next step. > > Since we need to dump and restore different objects on different timings, we > > probably cannot directly use pg_dump/pg_restore. I've not considered how the > > concurrent REFRESH PUBLICATION works. > > I think above prototype will work and will have least amount of side effects, but > It might be too complex to implement and I am not sure about corner cases. > > I was thinking of other ways of doing Initial Sync , which are less complex but each > with separate set of bottlenecks > > On Publisher Side:- > 1) Locking the publisher:- Easiest one to implement, applier process will get Access Shared > lock on the all the published tables. (We don't have to worry newly created concurrent table) > As tableSync will finish syncing the table, it will release table lock, So we will release > table locks in steps. Users can still perform DML on tables, but DDLs wont be allowed. Do you mean that the apply worker acquires table locks and the tablesync workers release them? If so, how can we implement it? > > 2) Using pg_dump/pg_restore for schema and data:- As Amit mentioned we can use pg_dump/ > pg_restore [1], Although it might have side effect of using double storage , we can > table pg_dump of each table separately and delete the dump as soon as table is synced. > tableSync process will read the dump and call pg_restore on the table. > If we crash in middle of restoring the tables we can start pg_dump(--clean)/restore again > with left out tables. > With this we can reduce space usage but we might create too many files. With this idea, who does pg_dump and pg_restore? and when do we create pg_subscription_rel entries? > > 3) Using publisher snapshot:- Applier process will do pg_dump/pg_restore as usual, > Then applier process will start a new process P1 which will connect to > publisher and start a transaction , it will call pg_export_snapshot() to export the > snapshot.Then applier process will take snapshot string and pass it to the tableSync process > as a argument. tableSync will use this snapshot for COPY TABLE. tableSync should only > do COPY TABLE and then will exit , So we wont do any catchup phase in tableSync. After > all tables finish COPY table transaction will be committed by P1 process and it will exit. > In the case of crash/restart we can simple start from beginning since nothing is committed > till every table is synced. There are 2 main issues with this approach > 1. I am not sure what side-effects we might have on publisher since we might have to keep > the transaction open for long time. I'm concerned that it would not be an acceptable downside that we keep a transaction open until all tables are synchronized. > 2. Applier process will simple wait till all tables are synced. > since applier process wont be able to apply any wal_logs till all tables are synced > maybe instead of creating new process Applier process itself can start transaction/ > export snapshot and tableSync process will use that snapshot. After all tables are synced > it can start wal_streaming. I think that after users execute REFRESH PUBLICATION, there are mixed non-ready and ready tables in the subscription. In this case, it's a huge restriction for users that logical replication for the ready tables stops until all newly-subscribed tables are synchronized. Regards, [1] https://www.postgresql.org/message-id/CAGPVpCRWEVhXa7ovrhuSQofx4to7o22oU9iKtrOgAOtz_%3DY6vg%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
> From: Masahiko Sawada <sawada.mshk@gmail.com> > > > > > > 3. The apply worker launches the tablesync workers for tables that > > > need to be synchronized. > > > > > > There might be DDLs executed on the publisher for tables before the > > > tablesync worker starts. But the apply worker needs to apply DDLs > > > for pre-data database objects. OTOH, it can ignore DDLs for > > > not-synced-yet tables and other database objects such as INDEX, > TRIGGER, RULE, etc (i.e. post-data). > > > > > > 4. The tablesync worker creates its replication slot, dumps+restores > > > the table schema, update the pg_subscription_rel, and perform COPY. > > > > > > These operations should be done in the same transaction. > > > > pg_restore wont be rollbackable, So we need to maintain states in > pg_subscription_rel. > > Yes. But I think it depends on how we restore them. For example, if we have > the tablesync worker somethow restore the table using a new SQL function > returning the table schema as we discussed or executing the dump file via > SPI, we can do that in the same transaction. okay > > > > > > > > > 5. After finishing COPY, the tablesync worker dumps indexes (and > > > perhaps > > > constraints) of the table and creates them (which possibly takes a long > time). > > > Then it starts to catch up, same as today. The apply worker needs to > > > wait for the tablesync worker to catch up. > > > > I don’t think we can have CATCHUP stage. We can have a DDL on > > publisher which can add a new column (And this DDL will be executed by > > applier later). Then we get a INSERT because we have old definition of > table, insert will fail. > > All DMLs and DDLs associated with the table being synchronized are applied > by the tablesync worker until it catches up with the apply worker. Right, Sorry I forgot that in above case if definition on publisher changes we will also have a corresponding DDLs. > > > > > > > > > We need to repeat these steps until we complete the initial data > > > copy and create indexes for all tables, IOW until all pg_subscription_rel > status becomes READY. > > > > > > 6. If the apply worker confirms all tables are READY, it starts > > > another sync worker who is responsible for the post-data database > > > objects such as TRIGGER, RULE, POLICY etc (i.e. post-data). > > > > > > While the sync worker is starting up or working, the apply worker > > > applies changes for pre-data database objects as well as READY tables. > > We might have some issue if we have create table like Create > > table_name as select * from materialized_view. > > Could you elaborate on the scenario where we could have an issue with such > DDL? Since materialized view of publisher has not been created by subscriber yet So if we have a DDL which does a create table using a materialized view it will fail. I am not sure how DDL patch is handling create table as statements. If it is modified to become like a normal CREATE TABLE then we wont have any issues. > > > > > > > 7. Similar to the tablesync worker, this sync worker creates its > > > replication slot and sets the returned LSN somewhere, say > pg_subscription. > > > > > > 8. The sync worker dumps and restores these objects. Which could > > > take a time since it would need to create FK constraints. Then it > > > starts to catch up if the apply worker is ahead. The apply worker waits for > the sync worker to catch up. > > > > > > 9. Once the sync worker catches up, the apply worker starts applying > > > changes for all database objects. > > > > > > IIUC with this approach, we can resolve the concurrent DDL problem > > > Sachin mentioned, and indexes (and constraints) are created after the > initial data copy. > > > > > > The procedures are still very complex and not fully considered yet > > > but I hope there are some useful things at least for discussion. > > > > > > Probably we can start with supporting only tables. In this case, we > > > would not need the post-data phase (i.e. step 6-9). It seems to me > > > that we need to have the subscription state somewhere (maybe > > > pg_subscription) so that the apply worker figure out the next step. > > > Since we need to dump and restore different objects on different > > > timings, we probably cannot directly use pg_dump/pg_restore. I've > > > not considered how the concurrent REFRESH PUBLICATION works. > > > > I think above prototype will work and will have least amount of side > > effects, but It might be too complex to implement and I am not sure about > corner cases. > > > > I was thinking of other ways of doing Initial Sync , which are less > > complex but each with separate set of bottlenecks > > > > On Publisher Side:- > > 1) Locking the publisher:- Easiest one to implement, applier process > > will get Access Shared lock on the all the published tables. (We don't > > have to worry newly created concurrent table) As tableSync will finish > > syncing the table, it will release table lock, So we will release table locks in > steps. Users can still perform DML on tables, but DDLs wont be allowed. > > Do you mean that the apply worker acquires table locks and the tablesync > workers release them? If so, how can we implement it? > I think releasing lock in steps would be impossible (given postgres lock implementations) So applier process has to create a new transaction and lock all the published tables in access shared mode. And after tableSync is completed transaction will be committed to release locks. So 1 and 3 are similar we have to keep one transaction open till table are synced. > > > > 2) Using pg_dump/pg_restore for schema and data:- As Amit mentioned > we > > can use pg_dump/ pg_restore [1], Although it might have side effect of > > using double storage , we can table pg_dump of each table separately and > delete the dump as soon as table is synced. > > tableSync process will read the dump and call pg_restore on the table. > > If we crash in middle of restoring the tables we can start > > pg_dump(--clean)/restore again with left out tables. > > With this we can reduce space usage but we might create too many files. > > With this idea, who does pg_dump and pg_restore? and when do we create > pg_subscription_rel entries? Applier process will do pg_dump/pg_restore . pg_subscription_rel entries can be created after pg_restore, We can create a new column with rel_nam and keep oid empty As you have suggested earlier. > > > > > 3) Using publisher snapshot:- Applier process will do > > pg_dump/pg_restore as usual, Then applier process will start a new > > process P1 which will connect to publisher and start a transaction , > > it will call pg_export_snapshot() to export the snapshot.Then applier > > process will take snapshot string and pass it to the tableSync process > > as a argument. tableSync will use this snapshot for COPY TABLE. > > tableSync should only do COPY TABLE and then will exit , So we wont do > any catchup phase in tableSync. After all tables finish COPY table transaction > will be committed by P1 process and it will exit. > > In the case of crash/restart we can simple start from beginning since > > nothing is committed till every table is synced. There are 2 main > > issues with this approach 1. I am not sure what side-effects we might > > have on publisher since we might have to keep the transaction open for > long time. > > I'm concerned that it would not be an acceptable downside that we keep a > transaction open until all tables are synchronized. > Okay, There is one more issue just using same snapshot will not stop table DDL modifications, we need to have atleast access share lock on each tables. So this will make tables locked on publisher, So this is essentially same as 1. Regards Sachin
On Thu, Mar 30, 2023 at 10:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 30, 2023 at 12:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Mar 29, 2023 at 7:57 PM Kumar, Sachin <ssetiya@amazon.com> wrote: > > > > > > > > > > From: Amit Kapila <amit.kapila16@gmail.com> > > > > > > > > I think we won't be able to use same snapshot because the > > > > > > > > transaction will be committed. > > > > > > > > In CreateSubscription() we can use the transaction snapshot from > > > > > > > > walrcv_create_slot() till walrcv_disconnect() is called.(I am > > > > > > > > not sure about this part maybe walrcv_disconnect() calls the commits > > > > internally ?). > > > > > > > > So somehow we need to keep this snapshot alive, even after > > > > > > > > transaction is committed(or delay committing the transaction , > > > > > > > > but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we > > > > > > > > can have a restart before tableSync is able to use the same > > > > > > > > snapshot.) > > > > > > > > > > > > > > > > > > > > > > Can we think of getting the table data as well along with schema > > > > > > > via pg_dump? Won't then both schema and initial data will > > > > > > > correspond to the same snapshot? > > > > > > > > > > > > Right , that will work, Thanks! > > > > > > > > > > While it works, we cannot get the initial data in parallel, no? > > > > > > > > > > > I was thinking each TableSync process will call pg_dump --table, This way if we have N > > > tableSync process, we can have N pg_dump --table=table_name called in parallel. > > > In fact we can use --schema-only to get schema and then let COPY take care of data > > > syncing . We will use same snapshot for pg_dump as well as COPY table. > > > > How can we postpone creating the pg_subscription_rel entries until the > > tablesync worker starts and does the schema sync? I think that since > > pg_subscription_rel entry needs the table OID, we need either to do > > the schema sync before creating the entry (i.e, during CREATE > > SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The > > apply worker needs the information of tables to sync in order to > > launch the tablesync workers, but it needs to create the table schema > > to get that information. > > For the above reason, I think that step 6 of the initial proposal won't work. > > If we can have the tablesync worker create an entry of > pg_subscription_rel after creating the table, it may give us the > flexibility to perform the initial sync. One idea is that we add a > relname field to pg_subscription_rel so that we can create entries > with relname instead of OID if the table is not created yet. Once the > table is created, we clear the relname field and set the OID of the > table instead. It's not an ideal solution but we might make it simpler > later. While writing a PoC patch, I found some difficulties in this idea. First, I tried to add schemaname+relname to pg_subscription_rel but I could not define the primary key of pg_subscription_rel. The primary key on (srsubid, srrelid) doesn't work since srrelid could be NULL. Similarly, the primary key on (srsubid, srrelid, schemaname, relname) also doesn't work. So I tried another idea: that we generate a new OID for srrelid and the tablesync worker will replace it with the new table's OID once it creates the table. However, since we use srrelid in replication slot names, changing srrelid during the initial schema+data sync is not straightforward (please note that the slot is created by the tablesync worker but is removed by the apply worker). Using relname in slot name instead of srrelid is not a good idea since it requires all pg_subscription_rel entries have relname, and slot names could be duplicated, for example, when the relname is very long and we cut it. I'm trying to consider the idea from another angle: the apply worker fetches the table list and passes the relname to the tablesync worker. But a problem of this approach is that the table list is not persisted. If the apply worker restarts during the initial table sync, it could not get the same list as before. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Apr 6, 2023 at 6:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Mar 30, 2023 at 10:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Mar 30, 2023 at 12:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > How can we postpone creating the pg_subscription_rel entries until the > > > tablesync worker starts and does the schema sync? I think that since > > > pg_subscription_rel entry needs the table OID, we need either to do > > > the schema sync before creating the entry (i.e, during CREATE > > > SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The > > > apply worker needs the information of tables to sync in order to > > > launch the tablesync workers, but it needs to create the table schema > > > to get that information. > > > > For the above reason, I think that step 6 of the initial proposal won't work. > > > > If we can have the tablesync worker create an entry of > > pg_subscription_rel after creating the table, it may give us the > > flexibility to perform the initial sync. One idea is that we add a > > relname field to pg_subscription_rel so that we can create entries > > with relname instead of OID if the table is not created yet. Once the > > table is created, we clear the relname field and set the OID of the > > table instead. It's not an ideal solution but we might make it simpler > > later. > > While writing a PoC patch, I found some difficulties in this idea. > First, I tried to add schemaname+relname to pg_subscription_rel but I > could not define the primary key of pg_subscription_rel. The primary > key on (srsubid, srrelid) doesn't work since srrelid could be NULL. > Similarly, the primary key on (srsubid, srrelid, schemaname, relname) > also doesn't work. > Can we think of having a separate catalog table say pg_subscription_remote_rel for this? You can have srsubid, remote_schema_name, remote_rel_name, etc. We may need some other state to be maintained during the initial schema sync where this table can be used. Basically, this can be used to maintain the state till the initial schema sync is complete because we can create a relation entry in pg_subscritption_rel only after the initial schema sync is complete. > So I tried another idea: that we generate a new OID > for srrelid and the tablesync worker will replace it with the new > table's OID once it creates the table. However, since we use srrelid > in replication slot names, changing srrelid during the initial > schema+data sync is not straightforward (please note that the slot is > created by the tablesync worker but is removed by the apply worker). > Using relname in slot name instead of srrelid is not a good idea since > it requires all pg_subscription_rel entries have relname, and slot > names could be duplicated, for example, when the relname is very long > and we cut it. > > I'm trying to consider the idea from another angle: the apply worker > fetches the table list and passes the relname to the tablesync worker. > But a problem of this approach is that the table list is not > persisted. If the apply worker restarts during the initial table sync, > it could not get the same list as before. > Agreed, this has some drawbacks. We can try to explore this if the above idea of the new catalog table doesn't solve this problem. -- With Regards, Amit Kapila.
On Fri, Apr 7, 2023 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Apr 6, 2023 at 6:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Mar 30, 2023 at 10:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Thu, Mar 30, 2023 at 12:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > How can we postpone creating the pg_subscription_rel entries until the > > > > tablesync worker starts and does the schema sync? I think that since > > > > pg_subscription_rel entry needs the table OID, we need either to do > > > > the schema sync before creating the entry (i.e, during CREATE > > > > SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The > > > > apply worker needs the information of tables to sync in order to > > > > launch the tablesync workers, but it needs to create the table schema > > > > to get that information. > > > > > > For the above reason, I think that step 6 of the initial proposal won't work. > > > > > > If we can have the tablesync worker create an entry of > > > pg_subscription_rel after creating the table, it may give us the > > > flexibility to perform the initial sync. One idea is that we add a > > > relname field to pg_subscription_rel so that we can create entries > > > with relname instead of OID if the table is not created yet. Once the > > > table is created, we clear the relname field and set the OID of the > > > table instead. It's not an ideal solution but we might make it simpler > > > later. > > > > While writing a PoC patch, I found some difficulties in this idea. > > First, I tried to add schemaname+relname to pg_subscription_rel but I > > could not define the primary key of pg_subscription_rel. The primary > > key on (srsubid, srrelid) doesn't work since srrelid could be NULL. > > Similarly, the primary key on (srsubid, srrelid, schemaname, relname) > > also doesn't work. > > > > Can we think of having a separate catalog table say > pg_subscription_remote_rel for this? You can have srsubid, > remote_schema_name, remote_rel_name, etc. We may need some other state > to be maintained during the initial schema sync where this table can > be used. Basically, this can be used to maintain the state till the > initial schema sync is complete because we can create a relation entry > in pg_subscritption_rel only after the initial schema sync is > complete. It might not be ideal but I guess it works. But I think we need to modify the name of replication slot for initial sync as it currently includes OID of the table: void ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char *syncslotname, Size szslot) { snprintf(syncslotname, szslot, "pg_%u_sync_%u_" UINT64_FORMAT, suboid, relid, GetSystemIdentifier()); } If we use both schema name and table name, it's possible that slot names are duplicated if schema and/or table names are long. Another idea is to use the hash value of schema+table names, but it cannot completely eliminate that possibility, and probably would make investigation and debugging hard in case of any failure. Probably we can use the OID of each entry in pg_subscription_remote_rel instead, but I'm not sure it's a good idea, mainly because we will end up using twice as many OIDs as before. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
> From: Masahiko Sawada <sawada.mshk@gmail.com> > > > While writing a PoC patch, I found some difficulties in this idea. > > > First, I tried to add schemaname+relname to pg_subscription_rel but > > > I could not define the primary key of pg_subscription_rel. The > > > primary key on (srsubid, srrelid) doesn't work since srrelid could be NULL. > > > Similarly, the primary key on (srsubid, srrelid, schemaname, > > > relname) also doesn't work. > > > > > > > Can we think of having a separate catalog table say > > pg_subscription_remote_rel for this? You can have srsubid, > > remote_schema_name, remote_rel_name, etc. We may need some other > state > > to be maintained during the initial schema sync where this table can > > be used. Basically, this can be used to maintain the state till the > > initial schema sync is complete because we can create a relation entry > > in pg_subscritption_rel only after the initial schema sync is > > complete. > > It might not be ideal but I guess it works. But I think we need to modify the name > of replication slot for initial sync as it currently includes OID of the table: > > void > ReplicationSlotNameForTablesync(Oid suboid, Oid relid, > char *syncslotname, Size szslot) { > snprintf(syncslotname, szslot, "pg_%u_sync_%u_" UINT64_FORMAT, suboid, > relid, GetSystemIdentifier()); } > > If we use both schema name and table name, it's possible that slot names are > duplicated if schema and/or table names are long. Another idea is to use the > hash value of schema+table names, but it cannot completely eliminate that > possibility, and probably would make investigation and debugging hard in case > of any failure. Probably we can use the OID of each entry in > pg_subscription_remote_rel instead, but I'm not sure it's a good idea, mainly > because we will end up using twice as many OIDs as before. Maybe we can create serial primary key for pg_subscription_remote_rel table And use this key for creating replication slot ? Regards Sachin
I am working on a prototype with above Idea , and will send it for review by Sunday/Monday Regards Sachin
On Mon, Apr 17, 2023 at 9:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Apr 7, 2023 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Apr 6, 2023 at 6:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > While writing a PoC patch, I found some difficulties in this idea. > > > First, I tried to add schemaname+relname to pg_subscription_rel but I > > > could not define the primary key of pg_subscription_rel. The primary > > > key on (srsubid, srrelid) doesn't work since srrelid could be NULL. > > > Similarly, the primary key on (srsubid, srrelid, schemaname, relname) > > > also doesn't work. > > > > > > > Can we think of having a separate catalog table say > > pg_subscription_remote_rel for this? You can have srsubid, > > remote_schema_name, remote_rel_name, etc. We may need some other state > > to be maintained during the initial schema sync where this table can > > be used. Basically, this can be used to maintain the state till the > > initial schema sync is complete because we can create a relation entry > > in pg_subscritption_rel only after the initial schema sync is > > complete. > > It might not be ideal but I guess it works. But I think we need to > modify the name of replication slot for initial sync as it currently > includes OID of the table: > > void > ReplicationSlotNameForTablesync(Oid suboid, Oid relid, > char *syncslotname, Size szslot) > { > snprintf(syncslotname, szslot, "pg_%u_sync_%u_" UINT64_FORMAT, suboid, > relid, GetSystemIdentifier()); > } > > If we use both schema name and table name, it's possible that slot > names are duplicated if schema and/or table names are long. Another > idea is to use the hash value of schema+table names, but it cannot > completely eliminate that possibility, and probably would make > investigation and debugging hard in case of any failure. Probably we > can use the OID of each entry in pg_subscription_remote_rel instead, > but I'm not sure it's a good idea, mainly because we will end up using > twice as many OIDs as before. > The other possibility is to use worker_pid. To make debugging easier, we may want to LOG schema_name+rel_name vs slot_name mapping at DEBUG1 log level. -- With Regards, Amit Kapila.
I am working on a prototype with above discussed idea, I think I will send it for initial review by Monday. Regards Sachin
On Thu, Apr 20, 2023 at 9:41 PM Kumar, Sachin <ssetiya@amazon.com> wrote: > > I am working on a prototype with above discussed idea, I think I will send it for initial review by Monday. > Okay, but which idea are you referring to? pg_subscription_remote_rel + worker_pid idea Amit proposed? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Apr 20, 2023 at 8:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 17, 2023 at 9:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Apr 7, 2023 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Apr 6, 2023 at 6:57 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > While writing a PoC patch, I found some difficulties in this idea. > > > > First, I tried to add schemaname+relname to pg_subscription_rel but I > > > > could not define the primary key of pg_subscription_rel. The primary > > > > key on (srsubid, srrelid) doesn't work since srrelid could be NULL. > > > > Similarly, the primary key on (srsubid, srrelid, schemaname, relname) > > > > also doesn't work. > > > > > > > > > > Can we think of having a separate catalog table say > > > pg_subscription_remote_rel for this? You can have srsubid, > > > remote_schema_name, remote_rel_name, etc. We may need some other state > > > to be maintained during the initial schema sync where this table can > > > be used. Basically, this can be used to maintain the state till the > > > initial schema sync is complete because we can create a relation entry > > > in pg_subscritption_rel only after the initial schema sync is > > > complete. > > > > It might not be ideal but I guess it works. But I think we need to > > modify the name of replication slot for initial sync as it currently > > includes OID of the table: > > > > void > > ReplicationSlotNameForTablesync(Oid suboid, Oid relid, > > char *syncslotname, Size szslot) > > { > > snprintf(syncslotname, szslot, "pg_%u_sync_%u_" UINT64_FORMAT, suboid, > > relid, GetSystemIdentifier()); > > } > > > > If we use both schema name and table name, it's possible that slot > > names are duplicated if schema and/or table names are long. Another > > idea is to use the hash value of schema+table names, but it cannot > > completely eliminate that possibility, and probably would make > > investigation and debugging hard in case of any failure. Probably we > > can use the OID of each entry in pg_subscription_remote_rel instead, > > but I'm not sure it's a good idea, mainly because we will end up using > > twice as many OIDs as before. > > > > The other possibility is to use worker_pid. To make debugging easier, > we may want to LOG schema_name+rel_name vs slot_name mapping at DEBUG1 > log level. Since worker_pid changes after the worker restarts, a new worker cannot find the slot that had been used, no? After thinking it over, a better solution would be that we add an oid column, nspname column, and relname column to pg_subscription_rel and the primary key on the oid. If the table is not present on the subscriber we store the schema name and table name to the catalog, and otherwise we store the local table oid same as today. The local table oid will be filled after the schema sync. The names of origin and replication slot the tablesync worker uses use the oid instead of the table oid. I've attached a PoC patch of this idea (very rough patch and has many TODO comments). It mixes the following changes: 1. Add oid column to the pg_subscription_rel. The oid is used as the primary key and in the names of origin and slot the tablesync workers use. 2. Add copy_schema = on/off option to CREATE SUBSCRIPTION (not yet support for ALTER SUBSCRIPTION). 3. Add CRS_EXPORT_USE_SNAPSHOT new action in order to use the same snapshot by both walsender and other processes (e.g. pg_dump). In this patch, the snapshot is exported for pg_dump and is used by the walsender for COPY. It seems to work well but there might be a pitfall as I've not fully implemented it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Apr 21, 2023 at 16:48 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Apr 20, 2023 at 8:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Apr 17, 2023 at 9:12 AM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > > > On Fri, Apr 7, 2023 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Thu, Apr 6, 2023 at 6:57 PM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > > > > While writing a PoC patch, I found some difficulties in this idea. > > > > > First, I tried to add schemaname+relname to pg_subscription_rel but I > > > > > could not define the primary key of pg_subscription_rel. The primary > > > > > key on (srsubid, srrelid) doesn't work since srrelid could be NULL. > > > > > Similarly, the primary key on (srsubid, srrelid, schemaname, relname) > > > > > also doesn't work. > > > > > > > > > > > > > Can we think of having a separate catalog table say > > > > pg_subscription_remote_rel for this? You can have srsubid, > > > > remote_schema_name, remote_rel_name, etc. We may need some other > state > > > > to be maintained during the initial schema sync where this table can > > > > be used. Basically, this can be used to maintain the state till the > > > > initial schema sync is complete because we can create a relation entry > > > > in pg_subscritption_rel only after the initial schema sync is > > > > complete. > > > > > > It might not be ideal but I guess it works. But I think we need to > > > modify the name of replication slot for initial sync as it currently > > > includes OID of the table: > > > > > > void > > > ReplicationSlotNameForTablesync(Oid suboid, Oid relid, > > > char *syncslotname, Size szslot) > > > { > > > snprintf(syncslotname, szslot, "pg_%u_sync_%u_" UINT64_FORMAT, > suboid, > > > relid, GetSystemIdentifier()); > > > } > > > > > > If we use both schema name and table name, it's possible that slot > > > names are duplicated if schema and/or table names are long. Another > > > idea is to use the hash value of schema+table names, but it cannot > > > completely eliminate that possibility, and probably would make > > > investigation and debugging hard in case of any failure. Probably we > > > can use the OID of each entry in pg_subscription_remote_rel instead, > > > but I'm not sure it's a good idea, mainly because we will end up using > > > twice as many OIDs as before. > > > > > > > The other possibility is to use worker_pid. To make debugging easier, > > we may want to LOG schema_name+rel_name vs slot_name mapping at > DEBUG1 > > log level. > > Since worker_pid changes after the worker restarts, a new worker > cannot find the slot that had been used, no? > > After thinking it over, a better solution would be that we add an oid > column, nspname column, and relname column to pg_subscription_rel and > the primary key on the oid. If the table is not present on the > subscriber we store the schema name and table name to the catalog, and > otherwise we store the local table oid same as today. The local table > oid will be filled after the schema sync. The names of origin and > replication slot the tablesync worker uses use the oid instead of the > table oid. > > I've attached a PoC patch of this idea (very rough patch and has many > TODO comments). It mixes the following changes: > > 1. Add oid column to the pg_subscription_rel. The oid is used as the > primary key and in the names of origin and slot the tablesync workers > use. > > 2. Add copy_schema = on/off option to CREATE SUBSCRIPTION (not yet > support for ALTER SUBSCRIPTION). > > 3. Add CRS_EXPORT_USE_SNAPSHOT new action in order to use the same > snapshot by both walsender and other processes (e.g. pg_dump). In this > patch, the snapshot is exported for pg_dump and is used by the > walsender for COPY. > > It seems to work well but there might be a pitfall as I've not fully > implemented it. Thanks for your POC patch. After reviewing this patch, I have a question below that want to confirm: 1. In the function synchronize_table_schema. I think some changes to GUC and table-related object SQLs are included in the pg_dump result. And in this POC, these SQLs will be executed. Do we need to alter the pg_dump results to only execute the table schema related SQLs? For example, if we have below table schema in the publisher-side: ``` create table tbl(a int, b int); create index idx_t on tbl (a); CREATE FUNCTION trigger_func() RETURNS TRIGGER LANGUAGE PLPGSQL AS $$ BEGIN INSERT INTO public.tbl VALUES (NEW.*); RETURNNEW; END; $$; CREATE TRIGGER tri_tbl BEFORE INSERT ON public.tbl FOR EACH ROW EXECUTE PROCEDURE trigger_func(); ``` The result of pg_dump executed on the subscriber-side: ``` SET statement_timeout = 0; SET lock_timeout = 0; SET idle_in_transaction_session_timeout = 0; SET client_encoding = 'UTF8'; SET standard_conforming_strings = on; SELECT pg_catalog.set_config('search_path', '', false); SET check_function_bodies = false; SET xmloption = content; SET client_min_messages = warning; SET row_security = off; SET default_tablespace = ''; SET default_table_access_method = heap; CREATE TABLE public.tbl ( a integer, b integer ); ALTER TABLE public.tbl OWNER TO postgres; CREATE INDEX idx_t ON public.tbl USING btree (a); CREATE TRIGGER tri_tbl BEFORE INSERT ON public.tbl FOR EACH ROW EXECUTE FUNCTION public.trigger_func(); ``` And this will cause an error when `CREATE TRIGGER` because we did not dump the function trigger_func. Regards, Wang Wei
On Thu, Apr 27, 2023 at 12:02 PM Wei Wang (Fujitsu) <wangw.fnst@fujitsu.com> wrote: > > On Fri, Apr 21, 2023 at 16:48 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Apr 20, 2023 at 8:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Apr 17, 2023 at 9:12 AM Masahiko Sawada > > <sawada.mshk@gmail.com> wrote: > > > > > > > > On Fri, Apr 7, 2023 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Thu, Apr 6, 2023 at 6:57 PM Masahiko Sawada > > <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > > > > > > > While writing a PoC patch, I found some difficulties in this idea. > > > > > > First, I tried to add schemaname+relname to pg_subscription_rel but I > > > > > > could not define the primary key of pg_subscription_rel. The primary > > > > > > key on (srsubid, srrelid) doesn't work since srrelid could be NULL. > > > > > > Similarly, the primary key on (srsubid, srrelid, schemaname, relname) > > > > > > also doesn't work. > > > > > > > > > > > > > > > > Can we think of having a separate catalog table say > > > > > pg_subscription_remote_rel for this? You can have srsubid, > > > > > remote_schema_name, remote_rel_name, etc. We may need some other > > state > > > > > to be maintained during the initial schema sync where this table can > > > > > be used. Basically, this can be used to maintain the state till the > > > > > initial schema sync is complete because we can create a relation entry > > > > > in pg_subscritption_rel only after the initial schema sync is > > > > > complete. > > > > > > > > It might not be ideal but I guess it works. But I think we need to > > > > modify the name of replication slot for initial sync as it currently > > > > includes OID of the table: > > > > > > > > void > > > > ReplicationSlotNameForTablesync(Oid suboid, Oid relid, > > > > char *syncslotname, Size szslot) > > > > { > > > > snprintf(syncslotname, szslot, "pg_%u_sync_%u_" UINT64_FORMAT, > > suboid, > > > > relid, GetSystemIdentifier()); > > > > } > > > > > > > > If we use both schema name and table name, it's possible that slot > > > > names are duplicated if schema and/or table names are long. Another > > > > idea is to use the hash value of schema+table names, but it cannot > > > > completely eliminate that possibility, and probably would make > > > > investigation and debugging hard in case of any failure. Probably we > > > > can use the OID of each entry in pg_subscription_remote_rel instead, > > > > but I'm not sure it's a good idea, mainly because we will end up using > > > > twice as many OIDs as before. > > > > > > > > > > The other possibility is to use worker_pid. To make debugging easier, > > > we may want to LOG schema_name+rel_name vs slot_name mapping at > > DEBUG1 > > > log level. > > > > Since worker_pid changes after the worker restarts, a new worker > > cannot find the slot that had been used, no? > > > > After thinking it over, a better solution would be that we add an oid > > column, nspname column, and relname column to pg_subscription_rel and > > the primary key on the oid. If the table is not present on the > > subscriber we store the schema name and table name to the catalog, and > > otherwise we store the local table oid same as today. The local table > > oid will be filled after the schema sync. The names of origin and > > replication slot the tablesync worker uses use the oid instead of the > > table oid. > > > > I've attached a PoC patch of this idea (very rough patch and has many > > TODO comments). It mixes the following changes: > > > > 1. Add oid column to the pg_subscription_rel. The oid is used as the > > primary key and in the names of origin and slot the tablesync workers > > use. > > > > 2. Add copy_schema = on/off option to CREATE SUBSCRIPTION (not yet > > support for ALTER SUBSCRIPTION). > > > > 3. Add CRS_EXPORT_USE_SNAPSHOT new action in order to use the same > > snapshot by both walsender and other processes (e.g. pg_dump). In this > > patch, the snapshot is exported for pg_dump and is used by the > > walsender for COPY. > > > > It seems to work well but there might be a pitfall as I've not fully > > implemented it. > > Thanks for your POC patch. > After reviewing this patch, I have a question below that want to confirm: > > 1. In the function synchronize_table_schema. > I think some changes to GUC and table-related object SQLs are included in the > pg_dump result. And in this POC, these SQLs will be executed. Do we need to > alter the pg_dump results to only execute the table schema related SQLs? Yes, in this approach, we need to dump/restore objects while specifying with fine granularity. Ideally, the table sync worker dumps and restores the table schema, does copy the initial data, and then creates indexes, and triggers and table-related objects are created after that. So if we go with the pg_dump approach to copy the schema of individual tables, we need to change pg_dump (or libpgdump needs to be able to do) to support it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Apr 28, 2023 at 4:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Yes, in this approach, we need to dump/restore objects while > specifying with fine granularity. Ideally, the table sync worker dumps > and restores the table schema, does copy the initial data, and then > creates indexes, and triggers and table-related objects are created > after that. So if we go with the pg_dump approach to copy the schema > of individual tables, we need to change pg_dump (or libpgdump needs to > be able to do) to support it. We have been discussing how to sync schema but I'd like to step back a bit and discuss use cases and requirements of this feature. Suppose that a table belongs to a publication, what objects related to the table we want to sync by the initial schema sync features? IOW, do we want to sync table's ACLs, tablespace settings, triggers, and security labels too? If we want to replicate the whole database, e.g. when using logical replication for major version upgrade, it would be convenient if it synchronizes all table-related objects. However, if we have only this option, it could be useless in some cases. For example, in a case where users have different database users on the subscriber than the publisher, they might want to sync only CREATE TABLE, and set ACL etc by themselves. In this case, it would not be necessary to sync ACL and security labels. What use case do we want to support by this feature? I think the implementation could be varied depending on how to select what objects to sync. One possible idea is to select objects to sync depending on how DDL replication is set in the publisher. It's straightforward but I'm not sure the design of DDL replication syntax has been decided. Also, even if we create a publication with ddl = 'table' option, it's not clear to me that we want to sync table-dependent triggers, indexes, and rules too by the initial sync feature. Second idea is to make it configurable by users so that they can specify what objects to sync. But it would make the feature complex and I'm not sure users can use it properly. Third idea is that since the use case of synchronizing the whole database can be achievable even by pg_dump(all), we support synchronizing only tables (+ indexes) in the initial sync feature, which can not be achievable by pg_dump. Feedback is very welcome. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, May 22, 2023 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Apr 28, 2023 at 4:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Yes, in this approach, we need to dump/restore objects while > > specifying with fine granularity. Ideally, the table sync worker dumps > > and restores the table schema, does copy the initial data, and then > > creates indexes, and triggers and table-related objects are created > > after that. So if we go with the pg_dump approach to copy the schema > > of individual tables, we need to change pg_dump (or libpgdump needs to > > be able to do) to support it. > > We have been discussing how to sync schema but I'd like to step back a > bit and discuss use cases and requirements of this feature. > > Suppose that a table belongs to a publication, what objects related to > the table we want to sync by the initial schema sync features? IOW, do > we want to sync table's ACLs, tablespace settings, triggers, and > security labels too? > > If we want to replicate the whole database, e.g. when using logical > replication for major version upgrade, it would be convenient if it > synchronizes all table-related objects. However, if we have only this > option, it could be useless in some cases. For example, in a case > where users have different database users on the subscriber than the > publisher, they might want to sync only CREATE TABLE, and set ACL etc > by themselves. In this case, it would not be necessary to sync ACL and > security labels. > > What use case do we want to support by this feature? I think the > implementation could be varied depending on how to select what objects > to sync. > > One possible idea is to select objects to sync depending on how DDL > replication is set in the publisher. It's straightforward but I'm not > sure the design of DDL replication syntax has been decided. Also, even > if we create a publication with ddl = 'table' option, it's not clear > to me that we want to sync table-dependent triggers, indexes, and > rules too by the initial sync feature. > I think it is better to keep the initial sync the same as the replication. So, if the publication specifies 'table' then we should just synchronize tables. Otherwise, it will look odd that the initial sync has synchronized say index-related DDLs but then later replication didn't replicate it. OTOH, if we want to do initial sync of table-dependent objects like triggers, indexes, rules, etc. when the user has specified ddl = 'table' then the replication should also follow the same. The main reason to exclude the other objects during replication is to reduce the scope of deparsing patch but if we have a finite set of objects (say all dependent on the table) then we can probably try to address those. > Second idea is to make it configurable by users so that they can > specify what objects to sync. But it would make the feature complex > and I'm not sure users can use it properly. > > Third idea is that since the use case of synchronizing the whole > database can be achievable even by pg_dump(all), we support > synchronizing only tables (+ indexes) in the initial sync feature, > which can not be achievable by pg_dump. > Can't we add some switch to dump only the table and not its dependents if we want to go with that approach? -- With Regards, Amit Kapila.
On Tue, May 23, 2023 at 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, May 22, 2023 at 6:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Apr 28, 2023 at 4:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Yes, in this approach, we need to dump/restore objects while > > > specifying with fine granularity. Ideally, the table sync worker dumps > > > and restores the table schema, does copy the initial data, and then > > > creates indexes, and triggers and table-related objects are created > > > after that. So if we go with the pg_dump approach to copy the schema > > > of individual tables, we need to change pg_dump (or libpgdump needs to > > > be able to do) to support it. > > > > We have been discussing how to sync schema but I'd like to step back a > > bit and discuss use cases and requirements of this feature. > > > > Suppose that a table belongs to a publication, what objects related to > > the table we want to sync by the initial schema sync features? IOW, do > > we want to sync table's ACLs, tablespace settings, triggers, and > > security labels too? > > > > If we want to replicate the whole database, e.g. when using logical > > replication for major version upgrade, it would be convenient if it > > synchronizes all table-related objects. However, if we have only this > > option, it could be useless in some cases. For example, in a case > > where users have different database users on the subscriber than the > > publisher, they might want to sync only CREATE TABLE, and set ACL etc > > by themselves. In this case, it would not be necessary to sync ACL and > > security labels. > > > > What use case do we want to support by this feature? I think the > > implementation could be varied depending on how to select what objects > > to sync. > > > > One possible idea is to select objects to sync depending on how DDL > > replication is set in the publisher. It's straightforward but I'm not > > sure the design of DDL replication syntax has been decided. Also, even > > if we create a publication with ddl = 'table' option, it's not clear > > to me that we want to sync table-dependent triggers, indexes, and > > rules too by the initial sync feature. > > > > I think it is better to keep the initial sync the same as the > replication. So, if the publication specifies 'table' then we should > just synchronize tables. Otherwise, it will look odd that the initial > sync has synchronized say index-related DDLs but then later > replication didn't replicate it. OTOH, if we want to do initial sync > of table-dependent objects like triggers, indexes, rules, etc. when > the user has specified ddl = 'table' then the replication should also > follow the same. The main reason to exclude the other objects during > replication is to reduce the scope of deparsing patch but if we have a > finite set of objects (say all dependent on the table) then we can > probably try to address those. > We have discussed several ideas of how to synchronize schemas between publisher and subscribers, and the points are summarized in Wiki page[1]. As for the idea of using pg_dump, we were concerned that pg_dump needs to be present along with the server binary if the user needs to use the initial schema synchronization feature. Since these binaries are typically included in different packages, they need to install both. During PGCon we've discussed with some senior hackers that it would be an acceptable limitation for users. When executing CREATE/ALTER SUBSCRIPTION, we check if pg_dump is available and raise an error if not. We've also discussed the idea of using pg_dump_library but no one preferred this idea because of its implementation costs. Therefore, I'm going to do further evaluation for the pg_dump idea. I agree with Amit that the initial schema synchronization should process the same as the DDL replication. We can support only table schemas as the first step. To do that, we need a new switch, say --exclude-table-dependents, in pg_dump to dump only table schemas excluding table-related objects such as triggers and indexes. Then, we can support synchronizing tables and table-related objects such as triggers, indexes, and rules, as the second step, which can be done with the --schema and --table option. Finally, we can synchronize the whole database by using the --schema option. We also need to research how to integrate the initial schema synchronization with tablesync workers. We have a PoC patch[2]. Regards, [1] https://wiki.postgresql.org/wiki/Logical_replication_of_DDLs#Initial_Schema_Sync [2] https://www.postgresql.org/message-id/CAD21AoCdfg506__qKz%2BHX8vqfdyKgQ5qeehgqq9bi1L-6p5Pwg%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Jun 8, 2023 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > ... > We also need to research how to integrate the initial schema > synchronization with tablesync workers. We have a PoC patch[2]. > > Regards, > > [1] https://wiki.postgresql.org/wiki/Logical_replication_of_DDLs#Initial_Schema_Sync > [2] https://www.postgresql.org/message-id/CAD21AoCdfg506__qKz%2BHX8vqfdyKgQ5qeehgqq9bi1L-6p5Pwg%40mail.gmail.com > FYI -- the PoC patch fails to apply using HEAD fetched today. git apply ../patches_misc/0001-Poc-initial-table-structure-synchronization-in-logic.patch error: patch failed: src/backend/replication/logical/tablesync.c:1245 error: src/backend/replication/logical/tablesync.c: patch does not apply ------ Kind Regards, Peter Smith. Fujitsu Australia
On Thu, Jun 15, 2023 at 4:14 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Thu, Jun 8, 2023 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > ... > > > We also need to research how to integrate the initial schema > > synchronization with tablesync workers. We have a PoC patch[2]. > > > > Regards, > > > > [1] https://wiki.postgresql.org/wiki/Logical_replication_of_DDLs#Initial_Schema_Sync > > [2] https://www.postgresql.org/message-id/CAD21AoCdfg506__qKz%2BHX8vqfdyKgQ5qeehgqq9bi1L-6p5Pwg%40mail.gmail.com > > > > FYI -- the PoC patch fails to apply using HEAD fetched today. > > git apply ../patches_misc/0001-Poc-initial-table-structure-synchronization-in-logic.patch > error: patch failed: src/backend/replication/logical/tablesync.c:1245 > error: src/backend/replication/logical/tablesync.c: patch does not apply > After rebasing the PoC patch locally, I found the 'make check' still did not pass 100%. # 2 of 215 tests failed. Here are the differences: diff -U3 /home/postgres/oss_postgres_misc/src/test/regress/expected/rules.out /home/postgres/oss_postgres_misc/src/test/regress/results/rules.out --- /home/postgres/oss_postgres_misc/src/test/regress/expected/rules.out 2023-06-02 23:12:32.073864475 +1000 +++ /home/postgres/oss_postgres_misc/src/test/regress/results/rules.out 2023-06-15 16:53:29.352622676 +1000 @@ -2118,14 +2118,14 @@ su.subname, st.pid, st.leader_pid, - st.relid, + st.subrelid, st.received_lsn, st.last_msg_send_time, st.last_msg_receipt_time, st.latest_end_lsn, st.latest_end_time FROM (pg_subscription su - LEFT JOIN pg_stat_get_subscription(NULL::oid) st(subid, relid, pid, leader_pid, received_lsn, last_msg_send_time, last_msg_receipt_time, latest_end_lsn, latest_end_time) ON ((st.subid = su.oid))); + LEFT JOIN pg_stat_get_subscription(NULL::oid) st(subid, subrelid, pid, leader_pid, received_lsn, last_msg_send_time, last_msg_receipt_time, latest_end_lsn, latest_end_time) ON ((st.subid = su.oid))); pg_stat_subscription_stats| SELECT ss.subid, s.subname, ss.apply_error_count, diff -U3 /home/postgres/oss_postgres_misc/src/test/regress/expected/oidjoins.out /home/postgres/oss_postgres_misc/src/test/regress/results/oidjoins.out --- /home/postgres/oss_postgres_misc/src/test/regress/expected/oidjoins.out 2022-10-04 15:11:32.457834981 +1100 +++ /home/postgres/oss_postgres_misc/src/test/regress/results/oidjoins.out 2023-06-15 16:54:07.159839010 +1000 @@ -265,4 +265,3 @@ NOTICE: checking pg_subscription {subdbid} => pg_database {oid} NOTICE: checking pg_subscription {subowner} => pg_authid {oid} NOTICE: checking pg_subscription_rel {srsubid} => pg_subscription {oid} -NOTICE: checking pg_subscription_rel {srrelid} => pg_class {oid} ------ Kind Regards, Peter Smith. Fujitsu Australia
Hi, Below are my review comments for the PoC patch 0001. In addition, the patch needed rebasing, and, after I rebased it locally in my private environment there were still test failures: a) The 'make check' tests fail but only in a minor way due to changes colname b) the subscription TAP test did not work at all for me -- many errors. ====== Commit message. 1. - Add oid column to the pg_subscription_rel. - use it as the primary key. - use it in the names of origin and slot the tablesync workers use. ~ IIUC, I think there were lots of variables called 'subrelid' referring to this new 'oid' field. But, somehow I found that very confusing with the other similarly named 'relid'. I wonder if all those can be named like 'sroid' or 'srid' to reduce the confusion of such similar names? ====== src/backend/catalog/pg_subscription.c 2. AddSubscriptionRelState I felt should be some sanity check Asserts for the args here. E.g. Cannot have valid relid when copy_schema == true, etc. ~~~ 3. + if (nspname) + values[Anum_pg_subscription_rel_srnspname - 1] = CStringGetDatum(nspname); + else + nulls[Anum_pg_subscription_rel_srnspname - 1] = true; + + if (relname) + values[Anum_pg_subscription_rel_srrelname - 1] = CStringGetDatum(relname); + else + nulls[Anum_pg_subscription_rel_srrelname - 1] = true; Here is where I was wondering why not pass the nspname and relname all the time, even for valid 'relid' (when copy_schema is false). It should simplify some code, as well as putting more useful/readable information into the catalog. ~~~ 4. UpdateSubscriptionRelRelid + /* XXX: need to distinguish from message in UpdateSubscriptionRelState() */ + if (!HeapTupleIsValid(tup)) + elog(ERROR, "subscription table %u in subscription %u does not exist", + subrelid, subid); Is that ERROR msg correct? IIUC the 'subrelid' is the Oid of the row in the catalog -- it is not the "subscription table" Oid. ~~~ 5. UpdateSubscriptionRelState if (!HeapTupleIsValid(tup)) elog(ERROR, "subscription table %u in subscription %u does not exist", - relid, subid); + subrelid, subid); (ditto previous review comment) Is that ERROR msg correct? IIUC the subrelid is the Oid of the row in the catalog -- it is not the "subscription table" Oid. ~~~ 6. GetSubscriptoinRelStateByRelid There is a spelling mistake in this function name /Subscriptoin/Subscription/ ~~~ 7. + ScanKeyInit(&skey[0], + Anum_pg_subscription_rel_srrelid, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(relid)); + ScanKeyInit(&skey[1], + Anum_pg_subscription_rel_srsubid, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(subid)); Won't it be better to swap the order of these so it matches the function comment "(srsubid, srrelid)". ~~~ 8. + tup = systable_getnext(scan); + + + if (!HeapTupleIsValid(tup)) Double blank lines ~~~ 9. /* Get palloc'ed SubscriptionRelState of the given subrelid */ SubscriptionRelState * GetSubscriptionRelByOid(Oid subrelid) ~ There seems some function name confusion because the struct is called SubscriptionRelState and it also has a 'state' field. e.g. The functions named GetSubscriptionRelStateXXX return only the state field of the struct. OTOH, this function returns the SubscriptionRelState* but it is NOT called GetSubscriptionRelStateByOid (??). ~~~ 10. deconstruct_subrelstate + /* syncflags */ + relstate->syncflags = + (((subrel_form->srsyncschema) ? SUBREL_SYNC_KIND_SCHEMA : 0) | + ((subrel_form->srsyncdata) ? SUBREL_SYNC_KIND_DATA : 0)); Seems excessive parens. ~~~ 11. + return relstate; +} /* * Drop subscription relation mapping. These can be for a particular * subscription, or for a particular relation, or both. */ void -RemoveSubscriptionRel(Oid subid, Oid relid) +RemoveSubscriptionRel(Oid subid, Oid relid, Oid subrelid) ~ There is no blank line before this function ~~~ 12. RemoveSubscriptionRel -RemoveSubscriptionRel(Oid subid, Oid relid) +RemoveSubscriptionRel(Oid subid, Oid relid, Oid subrelid) { ~ IIUC what you called 'subrelid' is the PK, so would it make more sense for that to be the 1st parameter for this function? ====== src/backend/commands/subscriptioncmds.c 13. struct SubOpts bool copy_data; + /* XXX: want to choose synchronizing only tables or all objects? */ + bool copy_schema; I wonder if it would be more natural to put the 'copy_schema' field before the 'copy_data' field? ~~~ 14. parse_subscription_options if (IsSet(supported_opts, SUBOPT_COPY_DATA)) opts->copy_data = true; + if (IsSet(supported_opts, SUBOPT_COPY_SCHEMA)) + opts->copy_data = true; 14a. I wonder if it would be more natural to put the COPY_SCHEMA logic before the COPY_DATA logic? ~ 14b. Is this a bug? Why is this assigning copy_data = true, instead of copy_schema = true? ~~~ 15. opts->specified_opts |= SUBOPT_COPY_DATA; opts->copy_data = defGetBoolean(defel); } + else if (IsSet(supported_opts, SUBOPT_COPY_SCHEMA) && + strcmp(defel->defname, "copy_schema") == 0) + { + if (IsSet(opts->specified_opts, SUBOPT_COPY_SCHEMA)) + errorConflictingDefElem(defel, pstate); + + opts->specified_opts |= SUBOPT_COPY_SCHEMA; + opts->copy_schema = defGetBoolean(defel); + } I wonder if it would be more natural to put the COPY_SCHEMA logic before the COPY_DATA logic? ~~~ 16. + if (opts->copy_schema && + IsSet(opts->specified_opts, SUBOPT_COPY_SCHEMA)) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("%s and %s are mutually exclusive options", + "connect = false", "copy_schema = true"))); + I wonder if it would be more natural to put the COPY_SCHEMA logic before the COPY_DATA logic? ~~~ 17. CreateSubscription * Set sync state based on if we were asked to do data copy or * not. */ - table_state = opts.copy_data ? SUBREL_STATE_INIT : SUBREL_STATE_READY; + if (opts.copy_data || opts.copy_schema) + table_state = SUBREL_STATE_INIT; + else + table_state = SUBREL_STATE_READY; The comment prior to this code needs updating, it still only mentions "data copy". ~~~ 18. AlterSubscription_refresh + sub_remove_rels[remove_rel_len].relid = subrelid; sub_remove_rels[remove_rel_len++].state = state; ~ Is that right? IIUC that 'subrelid' is the OID PK of the row in pg_subscription_rel, which is not the same as the 'relid'. Shouldn't this be sub_remove_rels[remove_rel_len].relid = relstate->relid; ~~~ 19. + if (OidIsValid(relstate->relid)) + ereport(DEBUG1, + (errmsg_internal("table \"%s.%s\" removed from subscription \"%s\"", + get_namespace_name(get_rel_namespace(relstate->relid)), + get_rel_name(relstate->relid), + sub->name))); + else + ereport(DEBUG1, + (errmsg_internal("table \"%s.%s\" removed from subscription \"%s\"", + relstate->nspname, relstate->relname, + sub->name))); I wondered why can't we just always store nspname and relname even for the valid 'relid' when there is no copy_schema? Won't that simplify code such as this? ====== src/backend/replication/logical/launcher.c 20. logicalrep_worker_find - if (w->in_use && w->subid == subid && w->relid == relid && + if (w->in_use && w->subid == subid && w->subrelid == subrelid && (!only_running || w->proc)) { ~ Maybe I misunderstand something, but somehow it seems strange to be checking both the 'subid' and the the Oid PK ('subrelid') here. Isn't it that when subrelid is valid you need to test only 'subrelid' (aka tablesync) for equality? But when subrelid is InvalidOid (aka not a tablesync worker) you only need to test subid for equality? ~~~ 21. logicalrep_worker_launch bool is_parallel_apply_worker = (subworker_dsm != DSM_HANDLE_INVALID); /* Sanity check - tablesync worker cannot be a subworker */ - Assert(!(is_parallel_apply_worker && OidIsValid(relid))); + Assert(!(is_parallel_apply_worker && OidIsValid(subrelid))); IIUC I thought this code might be easier to understand if you introduced another variable bool is_tabslync_worker = OidIsValid(subrelid); ~~~ 22. + if (OidIsValid(subrelid) && nsyncworkers >= max_sync_workers_per_subscription) (ditto previous comment) ~~~ 23. - if (OidIsValid(relid)) + if (OidIsValid(subrelid)) snprintf(bgw.bgw_name, BGW_MAXLEN, - "logical replication worker for subscription %u sync %u", subid, relid); + "logical replication worker for subscription %u sync %u", subid, subrelid); This name seems somehow less useful to the user now. IIUC 'subrelid' is just the PK of the pg_subscription_rel_catalog instead of the relid. Does this require changes to the documentation that might have been saying this is the relid? ~~~ 24. logicalrep_worker_stop * Stop the logical replication worker for subid/relid, if any. */ void -logicalrep_worker_stop(Oid subid, Oid relid) +logicalrep_worker_stop(Oid subid, Oid subrelid) The function comment still is talking about relid. ====== src/backend/replication/logical/snapbuild.c 25. SnapBuildExportSnapshot -SnapBuildExportSnapshot(SnapBuild *builder) +SnapBuildExportSnapshot(SnapBuild *builder, bool use_it) 'use_it' does not see a good parameter name. At least, maybe the function comment can describe the meaning of use_it. ~~~ 26. - /* There doesn't seem to a nice API to set these */ - XactIsoLevel = XACT_REPEATABLE_READ; - XactReadOnly = true; + /* There doesn't seem to a nice API to set these */ + XactIsoLevel = XACT_REPEATABLE_READ; + XactReadOnly = true; + } + else + Assert(IsTransactionBlock()); Although it is not introduced by this patch, since you change the indent on this line you might as well at the same time fix the typo on this line. /seem to be nice/seem to be a nice/ ====== src/backend/replication/logical/tablesync.c 27. process_syncing_tables_for_sync UpdateSubscriptionRelState(MyLogicalRepWorker->subid, - MyLogicalRepWorker->relid, + MyLogicalRepWorker->subrelid, MyLogicalRepWorker->relstate, MyLogicalRepWorker->relstate_lsn); IIUC the 'subrelid' is now the PK. Isn't it better for that to be the 1st param? ~~~ 28. + if ((syncflags & SUBREL_SYNC_KIND_SCHEMA) != 0) There are several checks like the code shown above. Would it be better to have some macro for that expression? Or maybe simply assign this result to a local variable instead of testing the same thing multiple times. ~~~ 29. synchronize_table_schema FILE *handle; Oid relid; Oid nspoid; StringInfoData command; StringInfoData querybuf; char full_path[MAXPGPATH]; char buf[1024]; int ret; if (find_my_exec("pg_dump", full_path) < 0) elog(ERROR, "\"%s\" was not found", "pg_dump") ~ Something is not quite right with the indentation in this new function. ~~~ 30. + * XXX what if the table already doesn't exist? I didn't understand the meaning of the comment. Is it supposed to say "What if the table already exists?" (??) ====== src/backend/replication/logical/worker.c 31. InitializeApplyWorker + { + if (OidIsValid(MyLogicalRepWorker->relid)) + ereport(LOG, + (errmsg("logical replication table synchronization worker for subscription \"%s\", table \"%s\" has started", + MySubscription->name, + get_rel_name(MyLogicalRepWorker->relid)))); + else + ereport(LOG, + (errmsg("logical replication table synchronization worker for subscription \"%s\", relid %u has started", + MySubscription->name, + MyLogicalRepWorker->subrelid))); + } ~ IIUC it doesn't seem right to say "relid %u has started". Because that's not really a relid is it? I thought it is just a PK Oid of the row in the catalog. ====== src/include/catalog/pg_subscription_rel.h 32. pg_subscription_rel + /* What part do we need to synchronize? */ + bool srsyncschema; + bool srsyncdata; These aren't really "parts". SUGGESTION /* What to synchronize? */ ~~~ 33. typedef struct SubscriptionRelState { + Oid oid; Is that the pg_subscription_rel's oid? Maybe it would be better to call this field 'sroid'? (see the general comment in the commit message) ====== src/include/replication/walsender.h 34. CRSSnapshotAction CRS_EXPORT_SNAPSHOT, CRS_NOEXPORT_SNAPSHOT, - CRS_USE_SNAPSHOT + CRS_USE_SNAPSHOT, + CRS_EXPORT_USE_SNAPSHOT } CRSSnapshotAction; ~ Should the CRS_USE_SNAPSHOT be renamed to CRS_NOEXOPRT_USE_SNAPSHOT to have a more consistent naming pattern? ====== src/include/replication/worker_internal.h 35. - /* Used for initial table synchronization. */ + /* + * Used for initial table synchronization. + * + * relid is an invalid oid if the table is not created on the subscriber + * yet. + */ + Oid subrelid; Oid relid; It would be good to have more explanation what is the different meaning of 'subrelid' versus 'relid' (see also the general comment suggesting to rename this) ------ Kind Regards, Peter Smith. Fujitsu Australia
On Mon, Jun 19, 2023 at 5:29 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi, > > Below are my review comments for the PoC patch 0001. > > In addition, the patch needed rebasing, and, after I rebased it > locally in my private environment there were still test failures: > a) The 'make check' tests fail but only in a minor way due to changes colname > b) the subscription TAP test did not work at all for me -- many errors. Thank you for reviewing the patch. While updating the patch, I realized that the current approach won't work well or at least has the problem with partition tables. If a publication has a partitioned table with publish_via_root = false, the subscriber launches tablesync workers for its partitions so that each tablesync worker copies data of each partition. Similarly, if it has a partition table with publish_via_root = true, the subscriber launches a tablesync worker for the parent table. With the current design, since the tablesync worker is responsible for both schema and data synchronization for the target table, it won't be possible to synchronize both the parent table's schema and partitions' schema. For example, there is no pg_subscription_rel entry for the parent table if the publication has publish_via_root = false. In addition to that, we need to be careful about the order of synchronization of the parent table and its partitions. We cannot start schema synchronization for partitions before its parent table. So it seems to me that we need to consider another approach. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Jul 5, 2023 at 11:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jun 19, 2023 at 5:29 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Hi, > > > > Below are my review comments for the PoC patch 0001. > > > > In addition, the patch needed rebasing, and, after I rebased it > > locally in my private environment there were still test failures: > > a) The 'make check' tests fail but only in a minor way due to changes colname > > b) the subscription TAP test did not work at all for me -- many errors. > > Thank you for reviewing the patch. > > While updating the patch, I realized that the current approach won't > work well or at least has the problem with partition tables. If a > publication has a partitioned table with publish_via_root = false, the > subscriber launches tablesync workers for its partitions so that each > tablesync worker copies data of each partition. Similarly, if it has a > partition table with publish_via_root = true, the subscriber launches > a tablesync worker for the parent table. With the current design, > since the tablesync worker is responsible for both schema and data > synchronization for the target table, it won't be possible to > synchronize both the parent table's schema and partitions' schema. For > example, there is no pg_subscription_rel entry for the parent table if > the publication has publish_via_root = false. In addition to that, we > need to be careful about the order of synchronization of the parent > table and its partitions. We cannot start schema synchronization for > partitions before its parent table. So it seems to me that we need to > consider another approach. So I've implemented a different approach; doing schema synchronization at a CREATE SUBSCRIPTION time. The backend executing CREATE SUBSCRIPTION uses pg_dump and restores the table schemas including both partitioned tables and their partitions regardless of publish_via_partition_root option, and then creates pg_subscription_rel entries for tables while respecting publish_via_partition_root option. There is a window between table creations and the tablesync workers starting to process the tables. If DDLs are executed in this window, the tablesync worker might fail because the table schema might have already been changed. We need to mention this note in the documentation. BTW, I think we will be able to get rid of this downside if we support DDL replication. DDLs executed in the window are applied by the apply worker and it takes over the data copy to the tablesync worker at a certain LSN. I've attached PoC patches. It has regression tests but doesn't have the documentation yet. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
> From: Masahiko Sawada <sawada.mshk@gmail.com> > So I've implemented a different approach; doing schema synchronization at a > CREATE SUBSCRIPTION time. The backend executing CREATE SUBSCRIPTION > uses pg_dump and restores the table schemas including both partitioned tables > and their partitions regardless of publish_via_partition_root option, and then > creates pg_subscription_rel entries for tables while respecting > publish_via_partition_root option. > > There is a window between table creations and the tablesync workers starting to > process the tables. If DDLs are executed in this window, the tablesync worker > might fail because the table schema might have already been changed. We need > to mention this note in the documentation. BTW, I think we will be able to get > rid of this downside if we support DDL replication. DDLs executed in the window > are applied by the apply worker and it takes over the data copy to the tablesync > worker at a certain LSN. I don’t think even with DDL replication we will be able to get rid of this window. There are some issues 1. Even with tablesync worker taking over at certain LSN, publisher can make more changes till Table sync acquires lock on publisher table via copy table. 2. how we will make sure that applier worker has caught up will all the changes from publisher Before it starts tableSync worker. It can be lag behind publisher. I think the easiest option would be to just recreate the table , this way we don’t have to worry about complex race conditions, tablesync already makes a slot for copy data we can use same slot for getting upto date table definition, dropping the table won't be much expensive since there won't be any data in it.Apply worker will skip all the DDLs/DMLs till table is synced. Although for partitioned tables we will be able to keep with published table schema changes only when publish_by_partition_root is true. Regards Sachin Amazon Web Services: https://aws.amazon.com
On Wed, Jul 5, 2023 at 7:45 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jun 19, 2023 at 5:29 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Hi, > > > > Below are my review comments for the PoC patch 0001. > > > > In addition, the patch needed rebasing, and, after I rebased it > > locally in my private environment there were still test failures: > > a) The 'make check' tests fail but only in a minor way due to changes colname > > b) the subscription TAP test did not work at all for me -- many errors. > > Thank you for reviewing the patch. > > While updating the patch, I realized that the current approach won't > work well or at least has the problem with partition tables. If a > publication has a partitioned table with publish_via_root = false, the > subscriber launches tablesync workers for its partitions so that each > tablesync worker copies data of each partition. Similarly, if it has a > partition table with publish_via_root = true, the subscriber launches > a tablesync worker for the parent table. With the current design, > since the tablesync worker is responsible for both schema and data > synchronization for the target table, it won't be possible to > synchronize both the parent table's schema and partitions' schema. > I think one possibility to make this design work is that when publish_via_root is false, then we assume that subscriber already has parent table and then the individual tablesync workers can sync the schema of partitions and their data. And when publish_via_root is true, then the table sync worker is responsible to sync parent and child tables along with data. Do you think such a mechanism can address the partition table related cases? -- With Regards, Amit Kapila.
> From: Amit Kapila <amit.kapila16@gmail.com> > On Wed, Jul 5, 2023 at 7:45 AM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jun 19, 2023 at 5:29 PM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > > > Hi, > > > > > > Below are my review comments for the PoC patch 0001. > > > > > > In addition, the patch needed rebasing, and, after I rebased it > > > locally in my private environment there were still test failures: > > > a) The 'make check' tests fail but only in a minor way due to > > > changes colname > > > b) the subscription TAP test did not work at all for me -- many errors. > > > > Thank you for reviewing the patch. > > > > While updating the patch, I realized that the current approach won't > > work well or at least has the problem with partition tables. If a > > publication has a partitioned table with publish_via_root = false, the > > subscriber launches tablesync workers for its partitions so that each > > tablesync worker copies data of each partition. Similarly, if it has a > > partition table with publish_via_root = true, the subscriber launches > > a tablesync worker for the parent table. With the current design, > > since the tablesync worker is responsible for both schema and data > > synchronization for the target table, it won't be possible to > > synchronize both the parent table's schema and partitions' schema. > > > > I think one possibility to make this design work is that when publish_via_root > is false, then we assume that subscriber already has parent table and then > the individual tablesync workers can sync the schema of partitions and their > data. Since publish_via_partition_root is false by default users have to create parent table by themselves which I think is not a good user experience. > And when publish_via_root is true, then the table sync worker is > responsible to sync parent and child tables along with data. Do you think > such a mechanism can address the partition table related cases? >
On Mon, Jul 10, 2023 at 8:06 PM Kumar, Sachin <ssetiya@amazon.com> wrote: > > > > > From: Amit Kapila <amit.kapila16@gmail.com> > > On Wed, Jul 5, 2023 at 7:45 AM Masahiko Sawada > > <sawada.mshk@gmail.com> wrote: > > > > > > On Mon, Jun 19, 2023 at 5:29 PM Peter Smith <smithpb2250@gmail.com> > > wrote: > > > > > > > > Hi, > > > > > > > > Below are my review comments for the PoC patch 0001. > > > > > > > > In addition, the patch needed rebasing, and, after I rebased it > > > > locally in my private environment there were still test failures: > > > > a) The 'make check' tests fail but only in a minor way due to > > > > changes colname > > > > b) the subscription TAP test did not work at all for me -- many errors. > > > > > > Thank you for reviewing the patch. > > > > > > While updating the patch, I realized that the current approach won't > > > work well or at least has the problem with partition tables. If a > > > publication has a partitioned table with publish_via_root = false, the > > > subscriber launches tablesync workers for its partitions so that each > > > tablesync worker copies data of each partition. Similarly, if it has a > > > partition table with publish_via_root = true, the subscriber launches > > > a tablesync worker for the parent table. With the current design, > > > since the tablesync worker is responsible for both schema and data > > > synchronization for the target table, it won't be possible to > > > synchronize both the parent table's schema and partitions' schema. > > > > > > > I think one possibility to make this design work is that when publish_via_root > > is false, then we assume that subscriber already has parent table and then > > the individual tablesync workers can sync the schema of partitions and their > > data. > > Since publish_via_partition_root is false by default users have to create parent table by themselves > which I think is not a good user experience. I have the same concern. I think that users normally use publish_via_partiiton_root = false if the partitioned table on the subscriber consists of the same set of partitions as the publisher's ones. And such users would expect the both partitioned table and its partitions to be synchronized. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Hi Everyone, based on internal discussion with Masahiko I have implemented concurrent DDL support for initial schema sync. Concurrent Patch workflow 1. When TableSync worker creates a replicaton slot, It will save the slot lsn into pg_subscription_rel with SUBREL_SYNC_SCHEMA_DATA_SYNC state, and it will wait for its state to be SUBREL_STATE_DATASYNC. 2. Applier process will apply DDLs till tablesync lsn, and then it will change pg_subscription_rel state to SUBREL_STATE_DATASYNC. 3. TableSync will continue applying pending DML/DDls till it catch up. This patch needs DDL replication to apply concurrent DDLs, I have cherry- picked this DDL patch [0] Issues 1) needs testing for concurrent DDLs, Not sure how to make tablesync process wait so that concurrent DDLs can be issued on publisher. 2) In my testing created table does not appear on the same conenction on subscriber, I have to reconnect to see table. 3) maybe different chars for SUBREL_SYNC_SCHEMA_DATA_INIT and SUBREL_SYNC_SCHEMA_DATA_SYNC, currently they are 'x' and 'y'. 4) I need to add SUBREL_SYNC_SCHEMA_DATA_INIT and SUBREL_SYNC_SCHEMA_DATA_SYNC to pg_subscription_rel_d.h to make it compile succesfully. 5) It only implement concurrent alter as of now [0] = https://www.postgresql.org/message-id/OS0PR01MB57163E6487EFF7378CB8E17C9438A%40OS0PR01MB5716.jpnprd01.prod.outlook.com
Attachment
On Thu, 31 Aug 2023 at 17:18, Kumar, Sachin <ssetiya@amazon.com> wrote: > > Hi Everyone, based on internal discussion with Masahiko > I have implemented concurrent DDL support for initial schema sync. > > Concurrent Patch workflow > > 1. When TableSync worker creates a replicaton slot, It will > save the slot lsn into pg_subscription_rel with > SUBREL_SYNC_SCHEMA_DATA_SYNC state, and it will wait for > its state to be SUBREL_STATE_DATASYNC. > > 2. Applier process will apply DDLs till tablesync lsn, and then > it will change pg_subscription_rel state to SUBREL_STATE_DATASYNC. > > 3. TableSync will continue applying pending DML/DDls till it catch up. > > This patch needs DDL replication to apply concurrent DDLs, I have cherry- > picked this DDL patch [0] Can you rebase the patch and post the complete set of required changes for the concurrent DDL, I will have a look at them. Regards, Vignesh
On Fri, 7 Jul 2023 at 12:41, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Jul 5, 2023 at 11:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jun 19, 2023 at 5:29 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > Hi, > > > > > > Below are my review comments for the PoC patch 0001. > > > > > > In addition, the patch needed rebasing, and, after I rebased it > > > locally in my private environment there were still test failures: > > > a) The 'make check' tests fail but only in a minor way due to changes colname > > > b) the subscription TAP test did not work at all for me -- many errors. > > > > Thank you for reviewing the patch. > > > > While updating the patch, I realized that the current approach won't > > work well or at least has the problem with partition tables. If a > > publication has a partitioned table with publish_via_root = false, the > > subscriber launches tablesync workers for its partitions so that each > > tablesync worker copies data of each partition. Similarly, if it has a > > partition table with publish_via_root = true, the subscriber launches > > a tablesync worker for the parent table. With the current design, > > since the tablesync worker is responsible for both schema and data > > synchronization for the target table, it won't be possible to > > synchronize both the parent table's schema and partitions' schema. For > > example, there is no pg_subscription_rel entry for the parent table if > > the publication has publish_via_root = false. In addition to that, we > > need to be careful about the order of synchronization of the parent > > table and its partitions. We cannot start schema synchronization for > > partitions before its parent table. So it seems to me that we need to > > consider another approach. > > So I've implemented a different approach; doing schema synchronization > at a CREATE SUBSCRIPTION time. The backend executing CREATE > SUBSCRIPTION uses pg_dump and restores the table schemas including > both partitioned tables and their partitions regardless of > publish_via_partition_root option, and then creates > pg_subscription_rel entries for tables while respecting > publish_via_partition_root option. > > There is a window between table creations and the tablesync workers > starting to process the tables. If DDLs are executed in this window, > the tablesync worker might fail because the table schema might have > already been changed. We need to mention this note in the > documentation. BTW, I think we will be able to get rid of this > downside if we support DDL replication. DDLs executed in the window > are applied by the apply worker and it takes over the data copy to the > tablesync worker at a certain LSN. > > I've attached PoC patches. It has regression tests but doesn't have > the documentation yet. Few thoughts: 1) There might be a scenario where we will create multiple subscriptions with the tables overlapping across the subscription, in that case, the table will be present when the 2nd subscription is being created, can we do something in this case: + /* + * Error if the table is already present on the subscriber. Please note + * that concurrent DDLs can create the table as we don't acquire any lock + * on the table. + * + * XXX: do we want to overwrite it (or optionally)? + */ + if (OidIsValid(RangeVarGetRelid(rv, AccessShareLock, true))) + ereport(ERROR, + (errmsg("existing table %s cannot synchronize table schema", + rv->relname))); 2) Should we clean the replication slot in case of failures, currently the replication slot is left over. 3) Is it expected that all of the dependencies like type/domain etc should be created by the user before creating a subscription with copy_schema, currently we are taking care of creating the sequences for tables, is this an exception? 4) If a column list publication is created, currently we are getting all of the columns, should we get only the specified columns in this case? Regards, Vignesh
> From: vignesh C <vignesh21@gmail.com> > Sent: Thursday, October 19, 2023 10:41 AM > Can you rebase the patch and post the complete set of required changes for > the concurrent DDL, I will have a look at them. Sure , I will try to send the complete rebased patch within a week. Regards Sachin