RE: Initial Schema Sync for Logical Replication - Mailing list pgsql-hackers

From Kumar, Sachin
Subject RE: Initial Schema Sync for Logical Replication
Date
Msg-id b8c228aee5564248ae9c1c6d48811d70@amazon.com
Whole thread Raw
In response to Re: Initial Schema Sync for Logical Replication  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Initial Schema Sync for Logical Replication  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
> From: Amit Kapila <amit.kapila16@gmail.com>
> IIUC, this is possible only if tablesync process uses a snapshot different than the
> snapshot we have used to perform the initial schema sync, otherwise, this
> shouldn't be a problem. Let me try to explain my understanding with an example
> (the LSNs used are just explain the
> problem):
> 
> 1. Create Table t1(c1, c2); --LSN: 90
> 2. Insert t1 (1, 1); --LSN 100
> 3. Insert t1 (2, 2); --LSN 110
> 4. Alter t1 Add Column c3; --LSN 120
> 5. Insert t1 (3, 3, 3); --LSN 130
> 
> Now, say before starting tablesync worker, apply process performs initial
> schema sync and uses a snapshot corresponding to LSN 100. Then it starts
> tablesync process to allow the initial copy of data in t1.
> Here, if the table sync process tries to establish a new snapshot, it may get data
> till LSN 130 and when it will try to copy the same in subscriber it will fail. Is my
> understanding correct about the problem you described?
Right
> If so, can't we allow
> tablesync process to use the same exported snapshot as we used for the initial
> schema sync and won't that solve the problem you described?
I think we won't be able to use same snapshot because the transaction will be committed.
In CreateSubscription() we can use the transaction snapshot from walrcv_create_slot()
till walrcv_disconnect() is called.(I am not sure about this part maybe walrcv_disconnect() calls
the commits internally ?). 
So somehow we need to keep this snapshot alive, even after transaction is committed(or delay committing
the transaction , but we can have CREATE SUBSCRIPTION with ENABLED=FALSE, so we can have a restart before 
tableSync is able to use the same snapshot.) 
> > Refresh publication :-
> >
> > In refresh publication, subscriber does create a new replication slot
Typo-> subscriber does not
> > hence , we can’t run
> >
> > pg_dump with a snapshot which starts from origin(maybe this is not an
> > issue at all). In this case
> >
> > it makes more sense for tableSync worker to do schema sync.
> >
> 
> Can you please explain this problem with some examples?
I think we can have same issues as you mentioned
New table t1 is added to the publication , User does a refresh publication.
pg_dump / pg_restore restores the table definition. But before tableSync
can start,  steps from 2 to 5 happen on the publisher.
> 1. Create Table t1(c1, c2); --LSN: 90
> 2. Insert t1 (1, 1); --LSN 100
> 3. Insert t1 (2, 2); --LSN 110
> 4. Alter t1 Add Column c3; --LSN 120
> 5. Insert t1 (3, 3, 3); --LSN 130
And table sync errors out
There can be one more issue , since we took the pg_dump without snapshot (wrt to replication slot).
(I am not 100 percent sure about this).
Lets imagine applier process is lagging behind publisher. 
Events on publisher
1. alter t1 drop column c; LSN 100 <-- applier process tries to execute this DDL
2. alter t1 drop column d; LSN 110
3. insert into t1 values(..); LSN 120 <-- (Refresh publication called )pg_dump/restore restores this version
Applier process executing 1 will fail because t1 does not have column c. 
Regards
Sachin

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Non-superuser subscription owners
Next
From: Robert Haas
Date:
Subject: Re: Transparent column encryption