Thread: Single transaction in the tablesync worker?

Single transaction in the tablesync worker?

From
Amit Kapila
Date:
The tablesync worker in logical replication performs the table data
sync in a single transaction which means it will copy the initial data
and then catch up with apply worker in the same transaction. There is
a comment in LogicalRepSyncTableStart ("We want to do the table data
sync in a single transaction.") saying so but I can't find the
concrete theory behind the same. Is there any fundamental problem if
we commit the transaction after initial copy and slot creation in
LogicalRepSyncTableStart and then allow the apply of transactions as
it happens in apply worker? I have tried doing so in the attached (a
quick prototype to test) and didn't find any problems with regression
tests. I have tried a few manual tests as well to see if it works and
didn't find any problem. Now, it is quite possible that it is
mandatory to do the way we are doing currently, or maybe something
else is required to remove this requirement but I think we can do
better with respect to comments in this area.

The reason why I am looking into this area is to support the logical
decoding of prepared transactions. See the problem [1] reported by
Peter Smith. Basically, when we stream prepared transactions in the
tablesync worker, it will simply commit the same due to the
requirement of maintaining a single transaction for the entire
duration of copy and streaming of transactions. Now, we can fix that
problem by disabling the decoding of prepared xacts in tablesync
worker. But that will arise to a different kind of problems like the
prepare will not be sent by the publisher but a later commit might
move lsn to a later step which will allow it to catch up till the
apply worker. So, now the prepared transaction will be skipped by both
tablesync and apply worker.

I think apart from unblocking the development of 'logical decoding of
prepared xacts', it will make the code consistent between apply and
tablesync worker and reduce the chances of future bugs in this area.
Basically, it will reduce the checks related to am_tablesync_worker()
at various places in the code.

I see that this code is added as part of commit
7c4f52409a8c7d85ed169bbbc1f6092274d03920 (Logical replication support
for initial data copy).

Thoughts?

[1] - https://www.postgresql.org/message-id/CAHut+PuEMk4SO8oGzxc_ftzPkGA8uC-y5qi-KRqHSy_P0i30DA@mail.gmail.com

-- 
With Regards,
Amit Kapila.

Attachment

Re: Single transaction in the tablesync worker?

From
Ashutosh Bapat
Date:
On Thu, Dec 3, 2020 at 2:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> The tablesync worker in logical replication performs the table data
> sync in a single transaction which means it will copy the initial data
> and then catch up with apply worker in the same transaction. There is
> a comment in LogicalRepSyncTableStart ("We want to do the table data
> sync in a single transaction.") saying so but I can't find the
> concrete theory behind the same. Is there any fundamental problem if
> we commit the transaction after initial copy and slot creation in
> LogicalRepSyncTableStart and then allow the apply of transactions as
> it happens in apply worker? I have tried doing so in the attached (a
> quick prototype to test) and didn't find any problems with regression
> tests. I have tried a few manual tests as well to see if it works and
> didn't find any problem. Now, it is quite possible that it is
> mandatory to do the way we are doing currently, or maybe something
> else is required to remove this requirement but I think we can do
> better with respect to comments in this area.

If we commit the initial copy, the data upto the initial copy's
snapshot will be visible downstream. If we apply the changes by
committing changes per transaction, the data visible to the other
transactions will differ as the apply progresses. You haven't
clarified whether we will respect the transaction boundaries in the
apply log or not. I assume we will. Whereas if we apply all the
changes in one go, other transactions either see the data before
resync or after it without any intermediate states. That will not
violate consistency, I think.

That's all I can think of as the reason behind doing a whole resync as
a single transaction.

-- 
Best Wishes,
Ashutosh Bapat



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Thu, Dec 3, 2020 at 7:04 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Dec 3, 2020 at 2:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > The tablesync worker in logical replication performs the table data
> > sync in a single transaction which means it will copy the initial data
> > and then catch up with apply worker in the same transaction. There is
> > a comment in LogicalRepSyncTableStart ("We want to do the table data
> > sync in a single transaction.") saying so but I can't find the
> > concrete theory behind the same. Is there any fundamental problem if
> > we commit the transaction after initial copy and slot creation in
> > LogicalRepSyncTableStart and then allow the apply of transactions as
> > it happens in apply worker? I have tried doing so in the attached (a
> > quick prototype to test) and didn't find any problems with regression
> > tests. I have tried a few manual tests as well to see if it works and
> > didn't find any problem. Now, it is quite possible that it is
> > mandatory to do the way we are doing currently, or maybe something
> > else is required to remove this requirement but I think we can do
> > better with respect to comments in this area.
>
> If we commit the initial copy, the data upto the initial copy's
> snapshot will be visible downstream. If we apply the changes by
> committing changes per transaction, the data visible to the other
> transactions will differ as the apply progresses.
>

It is not clear what you mean by the above.  The way you have written
appears that you are saying that instead of copying the initial data,
I am saying to copy it transaction-by-transaction. But that is not the
case. I am saying copy the initial data by using REPEATABLE READ
isolation level as we are doing now, commit it and then process
transaction-by-transaction till we reach sync-point (point till where
apply worker has already received the data).

> You haven't
> clarified whether we will respect the transaction boundaries in the
> apply log or not. I assume we will.
>

It will be transaction-by-transaction.

> Whereas if we apply all the
> changes in one go, other transactions either see the data before
> resync or after it without any intermediate states.
>

What is the problem even if the user is able to see the data after the
initial copy?

> That will not
> violate consistency, I think.
>

I am not sure how consistency will be broken.

> That's all I can think of as the reason behind doing a whole resync as
> a single transaction.
>

Thanks for sharing your thoughts.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Craig Ringer
Date:
On Thu, 3 Dec 2020 at 17:25, Amit Kapila <amit.kapila16@gmail.com> wrote:

> Is there any fundamental problem if
> we commit the transaction after initial copy and slot creation in
> LogicalRepSyncTableStart and then allow the apply of transactions as
> it happens in apply worker?

No fundamental problem. Both approaches are fine. Committing the
initial copy then doing the rest in individual txns means an
incomplete sync state for the table becomes visible, which may not be
ideal. Ideally we'd do something like sync the data into a clone of
the table then swap the table relfilenodes out once we're synced up.

IMO the main advantage of committing as we go is that it would let us
use a non-temporary slot and support recovering an incomplete sync and
finishing it after interruption by connection loss, crash, etc. That
would be advantageous for big table syncs or where the sync has lots
of lag to replay. But it means we have to remember sync states, and
give users a way to cancel/abort them. Otherwise forgotten temp slots
for syncs will cause a mess on the upstream.

It also allows the sync slot to advance, freeing any held upstream
resources before the whole sync is done, which is good if the upstream
is busy and generating lots of WAL.

Finally, committing as we go means we won't exceed the cid increment
limit in a single txn.

> The reason why I am looking into this area is to support the logical
> decoding of prepared transactions. See the problem [1] reported by
> Peter Smith. Basically, when we stream prepared transactions in the
> tablesync worker, it will simply commit the same due to the
> requirement of maintaining a single transaction for the entire
> duration of copy and streaming of transactions. Now, we can fix that
> problem by disabling the decoding of prepared xacts in tablesync
> worker.

Tablesync should indeed only receive a txn when the commit arrives, it
should not attempt to handle uncommitted prepared xacts.

> But that will arise to a different kind of problems like the
> prepare will not be sent by the publisher but a later commit might
> move lsn to a later step which will allow it to catch up till the
> apply worker. So, now the prepared transaction will be skipped by both
> tablesync and apply worker.

I'm not sure I understand. If what you describe is possible then
there's already a bug in prepared xact handling. Prepared xact commit
progress should be tracked by commit lsn, not by prepare lsn.

Can you set out the ordering of events in more detail?

> I think apart from unblocking the development of 'logical decoding of
> prepared xacts', it will make the code consistent between apply and
> tablesync worker and reduce the chances of future bugs in this area.
> Basically, it will reduce the checks related to am_tablesync_worker()
> at various places in the code.

I think we made similar changes in pglogical to switch to applying
sync work in individual txns.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Dec 4, 2020 at 7:53 AM Craig Ringer
<craig.ringer@enterprisedb.com> wrote:
>
> On Thu, 3 Dec 2020 at 17:25, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> > The reason why I am looking into this area is to support the logical
> > decoding of prepared transactions. See the problem [1] reported by
> > Peter Smith. Basically, when we stream prepared transactions in the
> > tablesync worker, it will simply commit the same due to the
> > requirement of maintaining a single transaction for the entire
> > duration of copy and streaming of transactions. Now, we can fix that
> > problem by disabling the decoding of prepared xacts in tablesync
> > worker.
>
> Tablesync should indeed only receive a txn when the commit arrives, it
> should not attempt to handle uncommitted prepared xacts.
>

Why? If we go with the approach of the commit as we go for individual
transactions in the tablesync worker then this shouldn't be a problem.

> > But that will arise to a different kind of problems like the
> > prepare will not be sent by the publisher but a later commit might
> > move lsn to a later step which will allow it to catch up till the
> > apply worker. So, now the prepared transaction will be skipped by both
> > tablesync and apply worker.
>
> I'm not sure I understand. If what you describe is possible then
> there's already a bug in prepared xact handling. Prepared xact commit
> progress should be tracked by commit lsn, not by prepare lsn.
>

Oh no, I am talking about commit of some other transaction.

> Can you set out the ordering of events in more detail?
>

Sure. It will be something like where apply worker is ahead of sync worker:

Assume t1 has some data which tablesync worker has to first copy.

tx1
Begin;
Insert into t1....
Prepare Transaction 'foo'

tx2
Begin;
Insert into t1....
Commit

apply worker
• tx1: replays - does not apply anything because
should_apply_changes_for_rel thinks relation is not ready
• tx2: replays - does not apply anything because
should_apply_changes_for_rel thinks relation is not ready

tablesync worder
• tx1: handles: BEGIN - INSERT - PREPARE 'xyz';  (but tablesync gets
nothing because say we disable 2-PC for it)
• tx2: handles: BEGIN - INSERT - COMMIT;
• tablelsync exits

Now the situation is that the apply worker has skipped the prepared
xact data and tablesync worker has not received it, so not applied it.
Next, when we get Commit Prepared for tx1, it will silently commit the
prepared transaction without any data being updated. The commit
prepared won't error out in subscriber because the prepare would have
been successful even though the data is skipped via
should_apply_changes_for_rel.

> > I think apart from unblocking the development of 'logical decoding of
> > prepared xacts', it will make the code consistent between apply and
> > tablesync worker and reduce the chances of future bugs in this area.
> > Basically, it will reduce the checks related to am_tablesync_worker()
> > at various places in the code.
>
> I think we made similar changes in pglogical to switch to applying
> sync work in individual txns.
>

oh, cool. Did you make some additional changes as you have mentioned
in the earlier part of the email?

--
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Dec 4, 2020 at 7:53 AM Craig Ringer
<craig.ringer@enterprisedb.com> wrote:
>
> On Thu, 3 Dec 2020 at 17:25, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Is there any fundamental problem if
> > we commit the transaction after initial copy and slot creation in
> > LogicalRepSyncTableStart and then allow the apply of transactions as
> > it happens in apply worker?
>
> No fundamental problem. Both approaches are fine. Committing the
> initial copy then doing the rest in individual txns means an
> incomplete sync state for the table becomes visible, which may not be
> ideal. Ideally we'd do something like sync the data into a clone of
> the table then swap the table relfilenodes out once we're synced up.
>
> IMO the main advantage of committing as we go is that it would let us
> use a non-temporary slot and support recovering an incomplete sync and
> finishing it after interruption by connection loss, crash, etc. That
> would be advantageous for big table syncs or where the sync has lots
> of lag to replay. But it means we have to remember sync states, and
> give users a way to cancel/abort them. Otherwise forgotten temp slots
> for syncs will cause a mess on the upstream.
>
> It also allows the sync slot to advance, freeing any held upstream
> resources before the whole sync is done, which is good if the upstream
> is busy and generating lots of WAL.
>
> Finally, committing as we go means we won't exceed the cid increment
> limit in a single txn.
>


Yeah, all these are advantages of processing
transaction-by-transaction. IIUC, we need to primarily do two things
to achieve it, one is to have an additional state in the catalog (say
catch up) which will say that the initial copy is done. Then we need
to have a permanent slot using which we can track the progress of the
slot so that after restart (due to crash, connection break, etc.) we
can start from the appropriate position.

Apart from the above, I think with the current design of tablesync we
can see partial data of transactions because we allow all the
tablesync workers to run parallelly. Consider the below scenario:

CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
CREATE TABLE mytbl2(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

Tx1
BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
INSERT INTO mytbl2(somedata, text) VALUES (1, 1);
COMMIT;

CREATE PUBLICATION mypublication FOR TABLE mytbl;

CREATE SUBSCRIPTION mysub
         CONNECTION 'host=localhost port=5432 dbname=postgres'
        PUBLICATION mypublication;

Tx2
BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
INSERT INTO mytbl2(somedata, text) VALUES (1, 2);
Commit;

Tx3
BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 3);
INSERT INTO mytbl2(somedata, text) VALUES (1, 3);
Commit;

Now, I could see the below results on subscriber:

postgres=# select * from mytbl1;
 id | somedata | text
----+----------+------
(0 rows)


postgres=# select * from mytbl2;
 id | somedata | text
----+----------+------
  1 |        1 | 1
  2 |        1 | 2
  3 |        1 | 3
(3 rows)

Basically, the results for Tx1, Tx2, Tx3 are visible for mytbl2 but
not for mytbl1. To reproduce this I have stopped the tablesync workers
(via debugger) for mytbl1 and mytbl2 in LogicalRepSyncTableStart
before it changes the relstate to SUBREL_STATE_SYNCWAIT. Then allowed
Tx2 and Tx3 to be processed by apply worker and then allowed tablesync
worker for mytbl2 to proceed. After that, I can see the above state.

Now, won't this behavior be considered as transaction inconsistency
where partial transaction data or later transaction data is visible? I
don't think we can have such a situation on the master (publisher)
node or in physical standby.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Dec 4, 2020 at 10:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Dec 4, 2020 at 7:53 AM Craig Ringer
> <craig.ringer@enterprisedb.com> wrote:
> >
> > On Thu, 3 Dec 2020 at 17:25, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > Is there any fundamental problem if
> > > we commit the transaction after initial copy and slot creation in
> > > LogicalRepSyncTableStart and then allow the apply of transactions as
> > > it happens in apply worker?
> >
> > No fundamental problem. Both approaches are fine. Committing the
> > initial copy then doing the rest in individual txns means an
> > incomplete sync state for the table becomes visible, which may not be
> > ideal. Ideally we'd do something like sync the data into a clone of
> > the table then swap the table relfilenodes out once we're synced up.
> >
> > IMO the main advantage of committing as we go is that it would let us
> > use a non-temporary slot and support recovering an incomplete sync and
> > finishing it after interruption by connection loss, crash, etc. That
> > would be advantageous for big table syncs or where the sync has lots
> > of lag to replay. But it means we have to remember sync states, and
> > give users a way to cancel/abort them. Otherwise forgotten temp slots
> > for syncs will cause a mess on the upstream.
> >
> > It also allows the sync slot to advance, freeing any held upstream
> > resources before the whole sync is done, which is good if the upstream
> > is busy and generating lots of WAL.
> >
> > Finally, committing as we go means we won't exceed the cid increment
> > limit in a single txn.
> >
>
> Yeah, all these are advantages of processing
> transaction-by-transaction. IIUC, we need to primarily do two things
> to achieve it, one is to have an additional state in the catalog (say
> catch up) which will say that the initial copy is done. Then we need
> to have a permanent slot using which we can track the progress of the
> slot so that after restart (due to crash, connection break, etc.) we
> can start from the appropriate position.
>
> Apart from the above, I think with the current design of tablesync we
> can see partial data of transactions because we allow all the
> tablesync workers to run parallelly. Consider the below scenario:
>
> CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
> CREATE TABLE mytbl2(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
>
> Tx1
> BEGIN;
> INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
> INSERT INTO mytbl2(somedata, text) VALUES (1, 1);
> COMMIT;
>
> CREATE PUBLICATION mypublication FOR TABLE mytbl;
>

oops, the above statement should be CREATE PUBLICATION mypublication
FOR TABLE mytbl1, mytbl2;

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Dec 4, 2020 at 10:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Dec 4, 2020 at 7:53 AM Craig Ringer
> <craig.ringer@enterprisedb.com> wrote:
> >
> > On Thu, 3 Dec 2020 at 17:25, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > Is there any fundamental problem if
> > > we commit the transaction after initial copy and slot creation in
> > > LogicalRepSyncTableStart and then allow the apply of transactions as
> > > it happens in apply worker?
> >
> > No fundamental problem. Both approaches are fine. Committing the
> > initial copy then doing the rest in individual txns means an
> > incomplete sync state for the table becomes visible, which may not be
> > ideal. Ideally we'd do something like sync the data into a clone of
> > the table then swap the table relfilenodes out once we're synced up.
> >
> > IMO the main advantage of committing as we go is that it would let us
> > use a non-temporary slot and support recovering an incomplete sync and
> > finishing it after interruption by connection loss, crash, etc. That
> > would be advantageous for big table syncs or where the sync has lots
> > of lag to replay. But it means we have to remember sync states, and
> > give users a way to cancel/abort them. Otherwise forgotten temp slots
> > for syncs will cause a mess on the upstream.
> >
> > It also allows the sync slot to advance, freeing any held upstream
> > resources before the whole sync is done, which is good if the upstream
> > is busy and generating lots of WAL.
> >
> > Finally, committing as we go means we won't exceed the cid increment
> > limit in a single txn.
> >
>
>
> Yeah, all these are advantages of processing
> transaction-by-transaction. IIUC, we need to primarily do two things
> to achieve it, one is to have an additional state in the catalog (say
> catch up) which will say that the initial copy is done. Then we need
> to have a permanent slot using which we can track the progress of the
> slot so that after restart (due to crash, connection break, etc.) we
> can start from the appropriate position.
>
> Apart from the above, I think with the current design of tablesync we
> can see partial data of transactions because we allow all the
> tablesync workers to run parallelly. Consider the below scenario:
>
..
..
>
> Basically, the results for Tx1, Tx2, Tx3 are visible for mytbl2 but
> not for mytbl1. To reproduce this I have stopped the tablesync workers
> (via debugger) for mytbl1 and mytbl2 in LogicalRepSyncTableStart
> before it changes the relstate to SUBREL_STATE_SYNCWAIT. Then allowed
> Tx2 and Tx3 to be processed by apply worker and then allowed tablesync
> worker for mytbl2 to proceed. After that, I can see the above state.
>
> Now, won't this behavior be considered as transaction inconsistency
> where partial transaction data or later transaction data is visible? I
> don't think we can have such a situation on the master (publisher)
> node or in physical standby.
>

On briefly checking the pglogical code [1], it seems this problem
won't be there in pglogical. Because it seems to first copy all the
tables (via pglogical_sync_table) in one process and then catch with
the apply worker in a transaction-by-transaction manner. Am, I reading
it correctly? If so then why we followed a different approach for
in-core solution or is it that the pglogical has improved over time
but all the improvements can't be implemented in-core because of some
missing features?

[1] - https://github.com/2ndQuadrant/pglogical

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Ashutosh Bapat
Date:
On Thu, Dec 3, 2020 at 7:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Dec 3, 2020 at 7:04 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Thu, Dec 3, 2020 at 2:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > The tablesync worker in logical replication performs the table data
> > > sync in a single transaction which means it will copy the initial data
> > > and then catch up with apply worker in the same transaction. There is
> > > a comment in LogicalRepSyncTableStart ("We want to do the table data
> > > sync in a single transaction.") saying so but I can't find the
> > > concrete theory behind the same. Is there any fundamental problem if
> > > we commit the transaction after initial copy and slot creation in
> > > LogicalRepSyncTableStart and then allow the apply of transactions as
> > > it happens in apply worker? I have tried doing so in the attached (a
> > > quick prototype to test) and didn't find any problems with regression
> > > tests. I have tried a few manual tests as well to see if it works and
> > > didn't find any problem. Now, it is quite possible that it is
> > > mandatory to do the way we are doing currently, or maybe something
> > > else is required to remove this requirement but I think we can do
> > > better with respect to comments in this area.
> >
> > If we commit the initial copy, the data upto the initial copy's
> > snapshot will be visible downstream. If we apply the changes by
> > committing changes per transaction, the data visible to the other
> > transactions will differ as the apply progresses.
> >
>
> It is not clear what you mean by the above.  The way you have written
> appears that you are saying that instead of copying the initial data,
> I am saying to copy it transaction-by-transaction. But that is not the
> case. I am saying copy the initial data by using REPEATABLE READ
> isolation level as we are doing now, commit it and then process
> transaction-by-transaction till we reach sync-point (point till where
> apply worker has already received the data).

Craig in his mail has clarified this. The changes after the initial
COPY will be visible before the table sync catches up.

>
> > You haven't
> > clarified whether we will respect the transaction boundaries in the
> > apply log or not. I assume we will.
> >
>
> It will be transaction-by-transaction.
>
> > Whereas if we apply all the
> > changes in one go, other transactions either see the data before
> > resync or after it without any intermediate states.
> >
>
> What is the problem even if the user is able to see the data after the
> initial copy?
>
> > That will not
> > violate consistency, I think.
> >
>
> I am not sure how consistency will be broken.

Some of the transactions applied by apply workers may not have been
applied by the resync and vice versa. If the intermediate states of
table resync worker are visible, this difference in applied
transaction will result in loss of consistency if those transactions
are changing the table being resynced and some other table in the same
transaction. The changes won't be atomically visible. Thinking more
about this, this problem exists today for a table being resynced, but
at least it's only the table being resynced that is behind the other
tables so it's predictable.

-- 
Best Wishes,
Ashutosh Bapat



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Dec 4, 2020 at 7:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Dec 3, 2020 at 7:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Dec 3, 2020 at 7:04 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > > On Thu, Dec 3, 2020 at 2:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > The tablesync worker in logical replication performs the table data
> > > > sync in a single transaction which means it will copy the initial data
> > > > and then catch up with apply worker in the same transaction. There is
> > > > a comment in LogicalRepSyncTableStart ("We want to do the table data
> > > > sync in a single transaction.") saying so but I can't find the
> > > > concrete theory behind the same. Is there any fundamental problem if
> > > > we commit the transaction after initial copy and slot creation in
> > > > LogicalRepSyncTableStart and then allow the apply of transactions as
> > > > it happens in apply worker? I have tried doing so in the attached (a
> > > > quick prototype to test) and didn't find any problems with regression
> > > > tests. I have tried a few manual tests as well to see if it works and
> > > > didn't find any problem. Now, it is quite possible that it is
> > > > mandatory to do the way we are doing currently, or maybe something
> > > > else is required to remove this requirement but I think we can do
> > > > better with respect to comments in this area.
> > >
> > > If we commit the initial copy, the data upto the initial copy's
> > > snapshot will be visible downstream. If we apply the changes by
> > > committing changes per transaction, the data visible to the other
> > > transactions will differ as the apply progresses.
> > >
> >
> > It is not clear what you mean by the above.  The way you have written
> > appears that you are saying that instead of copying the initial data,
> > I am saying to copy it transaction-by-transaction. But that is not the
> > case. I am saying copy the initial data by using REPEATABLE READ
> > isolation level as we are doing now, commit it and then process
> > transaction-by-transaction till we reach sync-point (point till where
> > apply worker has already received the data).
>
> Craig in his mail has clarified this. The changes after the initial
> COPY will be visible before the table sync catches up.
>

I think the problem is not that the changes are visible after COPY
rather it is that we don't have a mechanism to restart if it crashes
after COPY unless we do all the sync up in one transaction. Assume we
commit after COPY and then process transaction-by-transaction and it
errors out (due to connection loss) or crashes, in-between one of the
following transactions after COPY then after the restart we won't know
from where to start for that relation. This is because the catalog
(pg_subscription_rel) will show the state as 'd' (data is being
copied) and the slot would have gone as it was a temporary slot. But
as mentioned in one of my emails above [1] we can solve these problems
which Craig also seems to be advocating for as there are many
advantages of not doing the entire sync (initial copy + stream changes
for that relation) in one single transaction. It will allow us to
support decode of prepared xacts in the subscriber. Also, it seems
pglogical already does processing transaction-by-transaction after the
initial copy. The only thing which is not clear to me is why we
haven't decided to go ahead initially and it would be probably better
if the original authors would also chime-in to at least clarify the
same.

> >
> > > You haven't
> > > clarified whether we will respect the transaction boundaries in the
> > > apply log or not. I assume we will.
> > >
> >
> > It will be transaction-by-transaction.
> >
> > > Whereas if we apply all the
> > > changes in one go, other transactions either see the data before
> > > resync or after it without any intermediate states.
> > >
> >
> > What is the problem even if the user is able to see the data after the
> > initial copy?
> >
> > > That will not
> > > violate consistency, I think.
> > >
> >
> > I am not sure how consistency will be broken.
>
> Some of the transactions applied by apply workers may not have been
> applied by the resync and vice versa. If the intermediate states of
> table resync worker are visible, this difference in applied
> transaction will result in loss of consistency if those transactions
> are changing the table being resynced and some other table in the same
> transaction. The changes won't be atomically visible. Thinking more
> about this, this problem exists today for a table being resynced, but
> at least it's only the table being resynced that is behind the other
> tables so it's predictable.
>

Yeah, I have already shown that this problem [1] exists today and it
won't be predictable when the number of tables to be synced are more.
I am not sure why but it seems acceptable to original authors that the
data of transactions are visibly partially during the initial
synchronization phase for a subscription. I don't see it documented
clearly either.


[1] - https://www.postgresql.org/message-id/CAA4eK1Ld9XaLoTZCoKF_gET7kc1fDf8CPR3CM48MQb1N1jDLYg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Craig Ringer
Date:


On Sat, 5 Dec 2020, 10:03 Amit Kapila, <amit.kapila16@gmail.com> wrote:
On Fri, Dec 4, 2020 at 7:12 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Dec 3, 2020 at 7:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Dec 3, 2020 at 7:04 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > > On Thu, Dec 3, 2020 at 2:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > The tablesync worker in logical replication performs the table data
> > > > sync in a single transaction which means it will copy the initial data
> > > > and then catch up with apply worker in the same transaction. There is
> > > > a comment in LogicalRepSyncTableStart ("We want to do the table data
> > > > sync in a single transaction.") saying so but I can't find the
> > > > concrete theory behind the same. Is there any fundamental problem if
> > > > we commit the transaction after initial copy and slot creation in
> > > > LogicalRepSyncTableStart and then allow the apply of transactions as
> > > > it happens in apply worker? I have tried doing so in the attached (a
> > > > quick prototype to test) and didn't find any problems with regression
> > > > tests. I have tried a few manual tests as well to see if it works and
> > > > didn't find any problem. Now, it is quite possible that it is
> > > > mandatory to do the way we are doing currently, or maybe something
> > > > else is required to remove this requirement but I think we can do
> > > > better with respect to comments in this area.
> > >
> > > If we commit the initial copy, the data upto the initial copy's
> > > snapshot will be visible downstream. If we apply the changes by
> > > committing changes per transaction, the data visible to the other
> > > transactions will differ as the apply progresses.
> > >
> >
> > It is not clear what you mean by the above.  The way you have written
> > appears that you are saying that instead of copying the initial data,
> > I am saying to copy it transaction-by-transaction. But that is not the
> > case. I am saying copy the initial data by using REPEATABLE READ
> > isolation level as we are doing now, commit it and then process
> > transaction-by-transaction till we reach sync-point (point till where
> > apply worker has already received the data).
>
> Craig in his mail has clarified this. The changes after the initial
> COPY will be visible before the table sync catches up.
>

I think the problem is not that the changes are visible after COPY
rather it is that we don't have a mechanism to restart if it crashes
after COPY unless we do all the sync up in one transaction. Assume we
commit after COPY and then process transaction-by-transaction and it
errors out (due to connection loss) or crashes, in-between one of the
following transactions after COPY then after the restart we won't know
from where to start for that relation. This is because the catalog
(pg_subscription_rel) will show the state as 'd' (data is being
copied) and the slot would have gone as it was a temporary slot. But
as mentioned in one of my emails above [1] we can solve these problems
which Craig also seems to be advocating for as there are many
advantages of not doing the entire sync (initial copy + stream changes
for that relation) in one single transaction. It will allow us to
support decode of prepared xacts in the subscriber. Also, it seems
pglogical already does processing transaction-by-transaction after the
initial copy. The only thing which is not clear to me is why we
haven't decided to go ahead initially and it would be probably better
if the original authors would also chime-in to at least clarify the
same.

It's partly a resource management issue.

Replication origins are a limited resource. We need to use a replication origin for any sync we want to be durable across restarts.

Then again so are slots and we use temp slots for each sync.

If a sync fails cleanup on the upstream side is simple with a temp slot. With persistent slots we have more risk of creating upstream issues. But then, so long as the subscriber exists it can deal with that. And if the subscriber no longer exists its primary slot is an issue too.

It'd help if we could register pg_shdepend entries between catalog entries and slots, and from a main subscription slot to any extra slots used for resynchronization.

And I should write a patch for a resource retention summarisation view.


I am not sure why but it seems acceptable to original authors that the
data of transactions are visibly partially during the initial
synchronization phase for a subscription.

I don't think there's much alternative there.

Pg would need some kind of cross commit visibility control mechanism that separates durable commit from visibility

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi,

I wanted to float another idea to solve these tablesync/apply worker problems.

This idea may or may not have merit. Please consider it.

~

Basically, I was wondering why can't the "tablesync" worker just
gather messages in a similar way to how the current streaming feature
gathers messages into a "changes" file, so that they can be replayed
later.

e.g. Imagine if

A) The "tablesync" worker (after the COPY) does not ever apply any of
the incoming messages, but instead it just gobbles them into a
"changes" file until it decides it has reached SYNCDONE state and
exits.

B) Then, when the "apply" worker proceeds, if it detects the existence
of the "changes" file it will replay/apply_dispatch all those gobbled
messages before just continuing as normal.

So
- IIUC this kind of replay is like how the current code stream commit
applies the streamed "changes" file.
- "tablesync" worker would only be doing table sync (COPY) as its name
suggests. Any detected "changes" are recorded and left for the "apply"
worker to handle.
- "tablesync" worker would just operate in single tx with a temporary
slot as per current code
- Then the "apply" worker would be the *only* worker that actually
applies anything. (as its name suggests)

Thoughts?

---

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Dec 7, 2020 at 6:20 AM Craig Ringer
<craig.ringer@enterprisedb.com> wrote:
>
> On Sat, 5 Dec 2020, 10:03 Amit Kapila, <amit.kapila16@gmail.com> wrote:
>>
>> On Fri, Dec 4, 2020 at 7:12 PM Ashutosh Bapat
>> <ashutosh.bapat.oss@gmail.com> wrote:
>>
>> I think the problem is not that the changes are visible after COPY
>> rather it is that we don't have a mechanism to restart if it crashes
>> after COPY unless we do all the sync up in one transaction. Assume we
>> commit after COPY and then process transaction-by-transaction and it
>> errors out (due to connection loss) or crashes, in-between one of the
>> following transactions after COPY then after the restart we won't know
>> from where to start for that relation. This is because the catalog
>> (pg_subscription_rel) will show the state as 'd' (data is being
>> copied) and the slot would have gone as it was a temporary slot. But
>> as mentioned in one of my emails above [1] we can solve these problems
>> which Craig also seems to be advocating for as there are many
>> advantages of not doing the entire sync (initial copy + stream changes
>> for that relation) in one single transaction. It will allow us to
>> support decode of prepared xacts in the subscriber. Also, it seems
>> pglogical already does processing transaction-by-transaction after the
>> initial copy. The only thing which is not clear to me is why we
>> haven't decided to go ahead initially and it would be probably better
>> if the original authors would also chime-in to at least clarify the
>> same.
>
>
> It's partly a resource management issue.
>
> Replication origins are a limited resource. We need to use a replication origin for any sync we want to be durable
acrossrestarts. 
>
> Then again so are slots and we use temp slots for each sync.
>
> If a sync fails cleanup on the upstream side is simple with a temp slot. With persistent slots we have more risk of
creatingupstream issues. But then, so long as the subscriber exists it can deal with that. And if the subscriber no
longerexists its primary slot is an issue too. 
>

I think if the only issue is slot clean up, then the same exists today
for the slot created by the apply worker (or which I think you are
referring to as a primary slot). This can only happen if the
subscriber goes away without dropping the subscription. Also, if we
are worried about using up too many slots then the slots used by
tablesync workers will probably be freed sooner.

> It'd help if we could register pg_shdepend entries between catalog entries and slots, and from a main subscription
slotto any extra slots used for resynchronization. 
>

Which catalog entries you are referring to here?

> And I should write a patch for a resource retention summarisation view.
>

That would be great.

>>
>> I am not sure why but it seems acceptable to original authors that the
>> data of transactions are visibly partially during the initial
>> synchronization phase for a subscription.
>
>
> I don't think there's much alternative there.
>

I am not sure about this. I think it is primarily to allow some more
parallelism among apply and sync workers. One primitive way to achieve
parallelism and don't have this problem is to allow apply worker to
wait till all the tablesync workers are in DONE state. Then we will
never have an inconsistency problem or the prepared xact problem. Now,
surely if large copies are required for multiple relations then we
would delay a bit to replay transactions partially by the apply worker
but don't know how much that matters as compared to transaction
visibility issue and anyway we would have achieved the maximum
parallelism by allowing copy via multiple workers.

--
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Craig Ringer
Date:
On Mon, 7 Dec 2020 at 11:44, Peter Smith <smithpb2250@gmail.com> wrote:

Basically, I was wondering why can't the "tablesync" worker just
gather messages in a similar way to how the current streaming feature
gathers messages into a "changes" file, so that they can be replayed
later.


See the related thread "Logical archiving"


where I addressed some parts of this topic in detail earlier today.

A) The "tablesync" worker (after the COPY) does not ever apply any of
the incoming messages, but instead it just gobbles them into a
"changes" file until it decides it has reached SYNCDONE state and
exits.

This has a few issues.

Most importantly, the sync worker must cooperate with the main apply worker to achieve a consistent end-of-sync cutover. The sync worker must have replayed the pending changes in order to make this cut-over, because the non-sync apply worker will need to start applying changes on top of the resync'd table potentially as soon as the next transaction it starts applying, so it needs to see the rows there.

Doing this would also add another round of write multiplication since the data would get spooled then applied to WAL then heap. Write multiplication is already an issue for logical replication so adding to it isn't particularly desirable without a really compelling reason. With  the write multiplication comes disk space management issues for big transactions as well as the obvious performance/throughput impact.

It adds even more latency between upstream commit and downstream apply, something that is again already an issue for logical replication.

Right now we don't have any concept of a durable and locally flushed spool.

It's not impossible to do as you suggest but the cutover requirement makes it far from simple. As discussed in the logical archiving thread I think it'd be good to have something like this, and there are times the write multiplication price would be well worth paying. But it's not easy.

B) Then, when the "apply" worker proceeds, if it detects the existence
of the "changes" file it will replay/apply_dispatch all those gobbled
messages before just continuing as normal.

That's going to introduce a really big stall in the apply worker's progress in many cases. During that time it won't be receiving from upstream (since we don't spool logical changes to disk at this time) so the upstream lag will grow. That will impact synchronous replication, pg_wal size management, catalog bloat, etc. It'll also leave the upstream logical decoding session idle, so when it resumes it may create a spike of I/O and CPU load as it catches up, as well as a spike of network traffic. And depending on how close the upstream write rate is to the max decode speed, network throughput max, and downstream apply speed max, it may take some time to catch up over the resulting lag.

Not a big fan of that approach.

Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Dec 7, 2020 at 10:02 AM Craig Ringer
<craig.ringer@enterprisedb.com> wrote:
>
> On Mon, 7 Dec 2020 at 11:44, Peter Smith <smithpb2250@gmail.com> wrote:
>>
>>
>> Basically, I was wondering why can't the "tablesync" worker just
>> gather messages in a similar way to how the current streaming feature
>> gathers messages into a "changes" file, so that they can be replayed
>> later.
>>
>
> See the related thread "Logical archiving"
>
> https://www.postgresql.org/message-id/20D9328B-A189-43D1-80E2-EB25B9284AD6@yandex-team.ru
>
> where I addressed some parts of this topic in detail earlier today.
>
>> A) The "tablesync" worker (after the COPY) does not ever apply any of
>> the incoming messages, but instead it just gobbles them into a
>> "changes" file until it decides it has reached SYNCDONE state and
>> exits.
>
>
> This has a few issues.
>
> Most importantly, the sync worker must cooperate with the main apply worker to achieve a consistent end-of-sync
cutover.
>

In this idea, there is no need to change the end-of-sync cutover. It
will work as it is now. I am not sure what makes you think so.

> The sync worker must have replayed the pending changes in order to make this cut-over, because the non-sync apply
workerwill need to start applying changes on top of the resync'd table potentially as soon as the next transaction it
startsapplying, so it needs to see the rows there. 
>

The change here would be that the apply worker will check for changes
file and if it exists then apply them before it changes the relstate
to SUBREL_STATE_READY in process_syncing_tables_for_apply(). So, it
will not miss seeing any rows.

> Doing this would also add another round of write multiplication since the data would get spooled then applied to WAL
thenheap. Write multiplication is already an issue for logical replication so adding to it isn't particularly desirable
withouta really compelling reason. 
>

It will solve our problem of allowing decoding of prepared xacts in
pgoutput. I have explained the problem above [1]. The other idea which
we discussed is to allow having an additional state in
pg_subscription_rel, make the slot as permanent in tablesync worker,
and then process transaction-by-transaction in apply worker. Does that
approach sounds better? Is there any bigger change involved in this
approach (making tablesync slot permanent) which I am missing?

> With  the write multiplication comes disk space management issues for big transactions as well as the obvious
performance/throughputimpact. 
>
> It adds even more latency between upstream commit and downstream apply, something that is again already an issue for
logicalreplication. 
>
> Right now we don't have any concept of a durable and locally flushed spool.
>

I think we have a concept quite close to it for writing changes for
in-progress xacts as done in PG-14. It is not durable but that
shouldn't be a big problem if we allow syncing the changes file.

> It's not impossible to do as you suggest but the cutover requirement makes it far from simple. As discussed in the
logicalarchiving thread I think it'd be good to have something like this, and there are times the write multiplication
pricewould be well worth paying. But it's not easy. 
>
>> B) Then, when the "apply" worker proceeds, if it detects the existence
>> of the "changes" file it will replay/apply_dispatch all those gobbled
>> messages before just continuing as normal.
>
>
> That's going to introduce a really big stall in the apply worker's progress in many cases. During that time it won't
bereceiving from upstream (since we don't spool logical changes to disk at this time) so the upstream lag will grow.
Thatwill impact synchronous replication, pg_wal size management, catalog bloat, etc. It'll also leave the upstream
logicaldecoding session idle, so when it resumes it may create a spike of I/O and CPU load as it catches up, as well as
aspike of network traffic. And depending on how close the upstream write rate is to the max decode speed, network
throughputmax, and downstream apply speed max, it may take some time to catch up over the resulting lag. 
>

This is just for the initial tablesync phase. I think it is equivalent
to saying that during basebackup, we need to parallelly start physical
replication. I agree that sometimes it can take a lot of time to copy
large tables but it will be just one time and no worse than the other
situations like basebackup.

[1] - https://www.postgresql.org/message-id/CAA4eK1KFsjf6x-S7b0dJLvEL3tcn9x-voBJiFoGsccyH5xgDzQ%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Dec 7, 2020 at 9:21 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Dec 7, 2020 at 6:20 AM Craig Ringer
> <craig.ringer@enterprisedb.com> wrote:
> >
>
> >>
> >> I am not sure why but it seems acceptable to original authors that the
> >> data of transactions are visibly partially during the initial
> >> synchronization phase for a subscription.
> >
> >
> > I don't think there's much alternative there.
> >
>
> I am not sure about this. I think it is primarily to allow some more
> parallelism among apply and sync workers. One primitive way to achieve
> parallelism and don't have this problem is to allow apply worker to
> wait till all the tablesync workers are in DONE state.
>

As the slot of apply worker is created before all the tablesync
workers it should never miss any LSN which tablesync workers would
have processed. Also, the table sync workers should not process any
xact if the apply worker has not processed anything. I think tablesync
currently always processes one transaction (because we call
process_sync_tables at commit of a txn) even if that is not required
to be in sync with the apply worker. This should solve both the
problems (a) visibility of partial transactions (b) allow prepared
transactions because tablesync worker no longer needs to combine
multiple transactions data.

I think the other advantages of this would be that it would reduce the
load (both CPU and I/O) on the publisher-side by allowing to decode
the data only once instead of for each table sync worker once and
separately for the apply worker. I think it will use fewer resources
to finish the work.

Is there any flaw in this idea which I am missing?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Dec 7, 2020 at 2:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Dec 7, 2020 at 9:21 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Dec 7, 2020 at 6:20 AM Craig Ringer
> > <craig.ringer@enterprisedb.com> wrote:
> > >
> >
> > >>
> > >> I am not sure why but it seems acceptable to original authors that the
> > >> data of transactions are visibly partially during the initial
> > >> synchronization phase for a subscription.
> > >
> > >
> > > I don't think there's much alternative there.
> > >
> >
> > I am not sure about this. I think it is primarily to allow some more
> > parallelism among apply and sync workers. One primitive way to achieve
> > parallelism and don't have this problem is to allow apply worker to
> > wait till all the tablesync workers are in DONE state.
> >
>
> As the slot of apply worker is created before all the tablesync
> workers it should never miss any LSN which tablesync workers would
> have processed. Also, the table sync workers should not process any
> xact if the apply worker has not processed anything. I think tablesync
> currently always processes one transaction (because we call
> process_sync_tables at commit of a txn) even if that is not required
> to be in sync with the apply worker.
>

One more thing to consider here is that currently in tablesync worker,
we create a slot with CRS_USE_SNAPSHOT option which creates a
transaction snapshot on the publisher, and then we use the same
snapshot for a copy from the publisher. After this, when we try to
receive the data from the publisher using the same slot, it will be in
sync with the COPY. I think to keep the same consistency between COPY
and the data we receive from the publisher in this approach, we need
to export the snapshot while creating a slot in the apply worker by
using CRS_EXPORT_SNAPSHOT and then use the same snapshot by all the
tablesync workers doing the copy. In tablesync workers, we can use the
SET TRANSACTION SNAPSHOT command after "BEGIN READ ONLY ISOLATION
LEVEL REPEATABLE READ" to achieve it. That way the COPY will use the
same snapshot as is used for receiving the changes in apply worker and
the data will be in sync.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Dec 7, 2020 at 7:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> As the slot of apply worker is created before all the tablesync
> workers it should never miss any LSN which tablesync workers would
> have processed. Also, the table sync workers should not process any
> xact if the apply worker has not processed anything. I think tablesync
> currently always processes one transaction (because we call
> process_sync_tables at commit of a txn) even if that is not required
> to be in sync with the apply worker. This should solve both the
> problems (a) visibility of partial transactions (b) allow prepared
> transactions because tablesync worker no longer needs to combine
> multiple transactions data.
>
> I think the other advantages of this would be that it would reduce the
> load (both CPU and I/O) on the publisher-side by allowing to decode
> the data only once instead of for each table sync worker once and
> separately for the apply worker. I think it will use fewer resources
> to finish the work.

Yes, I observed this same behavior.

IIUC the only way for the tablesync worker to go from CATCHUP mode to
SYNCDONE is via the call to process_sync_tables.

But a side-effect of this is, when messages arrive during this CATCHUP
phase one tx will be getting handled by the tablesync worker before
the process_sync_tables() is ever encountered.

I have created and attached a simple patch which allows the tablesync
to detect if there is anything to do *before* it enters the apply main
loop. Calling process_sync_tables() before the apply main loop  offers
a quick way out so the message handling will not be split
unnecessarily between the workers.

~

The result of the patch is demonstrated by the following test/logs
which are also attached.
Note: I added more logging (not in this patch) to make it easier to
see what is going on.

LOGS1. Current code.
Test: 10 x INSERTS done at CATCHUP time.
Result: tablesync worker does 1 x INSERT, then apply worker skips 1
and does remaining 9 x INSERTs.

LOGS2. Patched code.
Test: Same 10 x INSERTS done at CATCHUP time.
Result: tablesync can exit early. apply worker handles all 10 x INSERTs

LOGS3. Patched code.
Test: 2PC PREPARE then COMMIT PREPARED [1] done at CATCHUP time
psql -d test_pub -c "BEGIN;INSERT INTO test_tab VALUES(1,
'foo');PREPARE TRANSACTION 'test_prepared_tab';"
psql -d test_pub -c "COMMIT PREPARED 'test_prepared_tab';"
Result: The PREPARE and COMMIT PREPARED are both handle by apply
worker. This avoids complications which the split otherwise causes.
[1] 2PC prepare test requires v29 patch from
https://www.postgresql.org/message-id/flat/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3%2BQ%40mail.gmail.com

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Tue, Dec 8, 2020 at 11:53 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Yes, I observed this same behavior.
>
> IIUC the only way for the tablesync worker to go from CATCHUP mode to
> SYNCDONE is via the call to process_sync_tables.
>
> But a side-effect of this is, when messages arrive during this CATCHUP
> phase one tx will be getting handled by the tablesync worker before
> the process_sync_tables() is ever encountered.
>
> I have created and attached a simple patch which allows the tablesync
> to detect if there is anything to do *before* it enters the apply main
> loop. Calling process_sync_tables() before the apply main loop  offers
> a quick way out so the message handling will not be split
> unnecessarily between the workers.
>

Yeah, this demonstrates the idea can work but as mentioned in my
previous email [1] this needs much more work to make the COPY and then
later fetching the changes from the publisher consistently. So, let me
summarize the discussion so far. We wanted to enhance the tablesync
phase of Logical Replication to enable decoding of prepared
transactions [2]. The problem was when we stream prepared transactions
in the tablesync worker, it will simply commit the same due to the
requirement of maintaining a single transaction for the entire
duration of copy and streaming of transactions afterward. We can't
simply disable the decoding of prepared xacts for tablesync workers
because it can skip some of the prepared xacts forever on subscriber
as explained in one of the emails above [3]. Now, while investigating
the solutions to enhance tablesync to support decoding at prepare
time, I found that due to the current design of tablesync we can see
partial data of transactions on subscribers which is also explained in
the email above with an example [4]. This problem of visibility is
there since the Logical Replication is introduced in PostgreSQL and
the only answer I got till now is that there doesn't seem to be any
other alternative which I think is not true and I have provided one
alternative as well.

Next, we have discussed three different solutions all of which will
solve the first problem (allow the tablesync worker to decode
transactions at prepare time) and one of which solves both the first
and second problem (partial transaction data visibility).

Solution-1: Allow the table-sync worker to use multiple transactions.
The reason for doing it in a single transaction is that if after
initial COPY we commit and then crash while streaming changes of other
transactions, the state of the table won't be known after the restart
as we are using temporary slot so we don't from where to restart
syncing the table.

IIUC, we need to primarily do two things to achieve multiple
transactions, one is to have an additional state in the catalog (say
catch up) which will say that the initial copy is done. Then we need
to have a permanent slot using which we can track the progress of the
slot so that after restart (due to crash, connection break, etc.) we
can start from the appropriate position. Now, this will allow us to do
less work after recovering from a crash because we will know the
restart point. As Craig mentioned, it also allows the sync slot to
advance, freeing any held upstream resources before the whole sync is
done, which is good if the upstream is busy and generating lots of
WAL. Finally, committing as we go means we won't exceed the cid
increment limit in a single txn.

Solution-2: The next solution we discussed is to make "tablesync"
worker just gather messages after COPY in a similar way to how the
current streaming of in-progress transaction feature gathers messages
into a "changes" file so that they can be replayed later by the apply
worker. Now, here as we don't need to replay the individual
transactions in tablesync worker in a single transaction, it will
allow us to send decode prepared to the subscriber. This has some
disadvantages such as each transaction processed by tablesync worker
needs to be durably written to file and it can also lead to some apply
lag later when we process the same by apply worker.

Solution-3: Allow the table-sync workers to just perform initial COPY
and then once the COPY is done for all relations the apply worker will
stream all the future changes. Now, surely if large copies are
required for multiple relations then we would delay a bit to replay
transactions partially by the apply worker but don't know how much
that matters as compared to transaction visibility issue and anyway we
would have achieved the maximum parallelism by allowing copy via
multiple workers. This would reduce the load (both CPU and I/O) on the
publisher-side by allowing to decode the data only once instead of for
each table sync worker once and separately for the apply worker. I
think it will use fewer resources to finish the work.

Currently, in tablesync worker, we create a slot with CRS_USE_SNAPSHOT
option which creates a transaction snapshot on the publisher, and then
we use the same snapshot for COPY from the publisher. After this, when
we try to receive the data from the publisher using the same slot, it
will be in sync with the COPY. I think to keep the same consistency
between COPY and the data we receive from the publisher in this
approach, we need to export the snapshot while creating a slot in the
apply worker by using CRS_EXPORT_SNAPSHOT and then use the same
snapshot by all the tablesync workers doing the copy. In tablesync
workers, we can use the SET TRANSACTION SNAPSHOT command after "BEGIN
READ ONLY ISOLATION LEVEL REPEATABLE READ" to use the exported
snapshot. That way the COPY will use the same snapshot as is used for
receiving the changes in apply worker and the data will be in sync.

Then we also need a way to export snapshot while the apply worker is
already receiving the changes because users can use 'ALTER
SUBSCRIPTION name REFRESH PUBLICATION' which allows new tables to be
synced. I think we need to introduce a new command in
exec_replication_command() to export the snapshot from the existing
slot and then use it by the new tablesync worker.


Among the above three solutions, the first two will solve the first
problem (allow the tablesync worker to decode transactions at prepare
time) and the third solution will solve both the first and second
problem (partial transaction data visibility). The third solution
requires quite some redesign of how the Logical Replication work is
synchronized between apply and tablesync workers and might turn out to
be a bigger implementation effort. I am tentatively thinking to go
with a first or second solution at this stage and anyway if later
people feel that we need some bigger redesign then we can go with
something on the lines of Solution-3.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BQC74wRQmbYT%2BMmOs%3DYbdUjuq0_A9CBbVoQMB1Ryi-OA%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAHut+PuEMk4SO8oGzxc_ftzPkGA8uC-y5qi-KRqHSy_P0i30DA@mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAA4eK1KFsjf6x-S7b0dJLvEL3tcn9x-voBJiFoGsccyH5xgDzQ%40mail.gmail.com
[4] - https://www.postgresql.org/message-id/CAA4eK1Ld9XaLoTZCoKF_gET7kc1fDf8CPR3CM48MQb1N1jDLYg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Tue, Dec 8, 2020 at 9:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Dec 8, 2020 at 11:53 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Yes, I observed this same behavior.
> >
> > IIUC the only way for the tablesync worker to go from CATCHUP mode to
> > SYNCDONE is via the call to process_sync_tables.
> >
> > But a side-effect of this is, when messages arrive during this CATCHUP
> > phase one tx will be getting handled by the tablesync worker before
> > the process_sync_tables() is ever encountered.
> >
> > I have created and attached a simple patch which allows the tablesync
> > to detect if there is anything to do *before* it enters the apply main
> > loop. Calling process_sync_tables() before the apply main loop  offers
> > a quick way out so the message handling will not be split
> > unnecessarily between the workers.
> >
>
> Yeah, this demonstrates the idea can work but as mentioned in my
> previous email [1] this needs much more work to make the COPY and then
> later fetching the changes from the publisher consistently. So, let me
> summarize the discussion so far. We wanted to enhance the tablesync
> phase of Logical Replication to enable decoding of prepared
> transactions [2]. The problem was when we stream prepared transactions
> in the tablesync worker, it will simply commit the same due to the
> requirement of maintaining a single transaction for the entire
> duration of copy and streaming of transactions afterward. We can't
> simply disable the decoding of prepared xacts for tablesync workers
> because it can skip some of the prepared xacts forever on subscriber
> as explained in one of the emails above [3]. Now, while investigating
> the solutions to enhance tablesync to support decoding at prepare
> time, I found that due to the current design of tablesync we can see
> partial data of transactions on subscribers which is also explained in
> the email above with an example [4]. This problem of visibility is
> there since the Logical Replication is introduced in PostgreSQL and
> the only answer I got till now is that there doesn't seem to be any
> other alternative which I think is not true and I have provided one
> alternative as well.
>
> Next, we have discussed three different solutions all of which will
> solve the first problem (allow the tablesync worker to decode
> transactions at prepare time) and one of which solves both the first
> and second problem (partial transaction data visibility).
>
> Solution-1: Allow the table-sync worker to use multiple transactions.
> The reason for doing it in a single transaction is that if after
> initial COPY we commit and then crash while streaming changes of other
> transactions, the state of the table won't be known after the restart
> as we are using temporary slot so we don't from where to restart
> syncing the table.
>
> IIUC, we need to primarily do two things to achieve multiple
> transactions, one is to have an additional state in the catalog (say
> catch up) which will say that the initial copy is done. Then we need
> to have a permanent slot using which we can track the progress of the
> slot so that after restart (due to crash, connection break, etc.) we
> can start from the appropriate position. Now, this will allow us to do
> less work after recovering from a crash because we will know the
> restart point. As Craig mentioned, it also allows the sync slot to
> advance, freeing any held upstream resources before the whole sync is
> done, which is good if the upstream is busy and generating lots of
> WAL. Finally, committing as we go means we won't exceed the cid
> increment limit in a single txn.
>
> Solution-2: The next solution we discussed is to make "tablesync"
> worker just gather messages after COPY in a similar way to how the
> current streaming of in-progress transaction feature gathers messages
> into a "changes" file so that they can be replayed later by the apply
> worker. Now, here as we don't need to replay the individual
> transactions in tablesync worker in a single transaction, it will
> allow us to send decode prepared to the subscriber. This has some
> disadvantages such as each transaction processed by tablesync worker
> needs to be durably written to file and it can also lead to some apply
> lag later when we process the same by apply worker.
>
> Solution-3: Allow the table-sync workers to just perform initial COPY
> and then once the COPY is done for all relations the apply worker will
> stream all the future changes. Now, surely if large copies are
> required for multiple relations then we would delay a bit to replay
> transactions partially by the apply worker but don't know how much
> that matters as compared to transaction visibility issue and anyway we
> would have achieved the maximum parallelism by allowing copy via
> multiple workers. This would reduce the load (both CPU and I/O) on the
> publisher-side by allowing to decode the data only once instead of for
> each table sync worker once and separately for the apply worker. I
> think it will use fewer resources to finish the work.
>
> Currently, in tablesync worker, we create a slot with CRS_USE_SNAPSHOT
> option which creates a transaction snapshot on the publisher, and then
> we use the same snapshot for COPY from the publisher. After this, when
> we try to receive the data from the publisher using the same slot, it
> will be in sync with the COPY. I think to keep the same consistency
> between COPY and the data we receive from the publisher in this
> approach, we need to export the snapshot while creating a slot in the
> apply worker by using CRS_EXPORT_SNAPSHOT and then use the same
> snapshot by all the tablesync workers doing the copy. In tablesync
> workers, we can use the SET TRANSACTION SNAPSHOT command after "BEGIN
> READ ONLY ISOLATION LEVEL REPEATABLE READ" to use the exported
> snapshot. That way the COPY will use the same snapshot as is used for
> receiving the changes in apply worker and the data will be in sync.
>
> Then we also need a way to export snapshot while the apply worker is
> already receiving the changes because users can use 'ALTER
> SUBSCRIPTION name REFRESH PUBLICATION' which allows new tables to be
> synced. I think we need to introduce a new command in
> exec_replication_command() to export the snapshot from the existing
> slot and then use it by the new tablesync worker.
>
>
> Among the above three solutions, the first two will solve the first
> problem (allow the tablesync worker to decode transactions at prepare
> time) and the third solution will solve both the first and second
> problem (partial transaction data visibility). The third solution
> requires quite some redesign of how the Logical Replication work is
> synchronized between apply and tablesync workers and might turn out to
> be a bigger implementation effort. I am tentatively thinking to go
> with a first or second solution at this stage and anyway if later
> people feel that we need some bigger redesign then we can go with
> something on the lines of Solution-3.
>
> Thoughts?
>
> [1] -
https://www.postgresql.org/message-id/CAA4eK1%2BQC74wRQmbYT%2BMmOs%3DYbdUjuq0_A9CBbVoQMB1Ryi-OA%40mail.gmail.com
> [2] - https://www.postgresql.org/message-id/CAHut+PuEMk4SO8oGzxc_ftzPkGA8uC-y5qi-KRqHSy_P0i30DA@mail.gmail.com
> [3] - https://www.postgresql.org/message-id/CAA4eK1KFsjf6x-S7b0dJLvEL3tcn9x-voBJiFoGsccyH5xgDzQ%40mail.gmail.com
> [4] - https://www.postgresql.org/message-id/CAA4eK1Ld9XaLoTZCoKF_gET7kc1fDf8CPR3CM48MQb1N1jDLYg%40mail.gmail.com
>
> --

Hi Amit,

- Solution-3 has become too complicated to be attempted by me. Anyway,
we may be better to just focus on eliminating the new problems exposed
by the 2PC work [1], rather than burning too much effort to fix some
other quirk which apparently has existed for years.
[1] https://www.postgresql.org/message-id/CAHut%2BPtm7E5Jj92tJWPtnnjbNjJN60_%3DaGGKYW3h23b7J%3DqeDg%40mail.gmail.com

- Solution-2 has some potential lag problems, and maybe file resource
problems as well. This idea did not get a very favourable response
when I first proposed it.

- This leaves Solution-1 as the best viable option to fix the current
known 2PC trouble.

~~

So I will try to write a patch for the proposed Solution-1.

---
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Dilip Kumar
Date:
On Thu, Dec 10, 2020 at 3:19 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Tue, Dec 8, 2020 at 9:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Dec 8, 2020 at 11:53 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > Yes, I observed this same behavior.
> > >
> > > IIUC the only way for the tablesync worker to go from CATCHUP mode to
> > > SYNCDONE is via the call to process_sync_tables.
> > >
> > > But a side-effect of this is, when messages arrive during this CATCHUP
> > > phase one tx will be getting handled by the tablesync worker before
> > > the process_sync_tables() is ever encountered.
> > >
> > > I have created and attached a simple patch which allows the tablesync
> > > to detect if there is anything to do *before* it enters the apply main
> > > loop. Calling process_sync_tables() before the apply main loop  offers
> > > a quick way out so the message handling will not be split
> > > unnecessarily between the workers.
> > >
> >
> > Yeah, this demonstrates the idea can work but as mentioned in my
> > previous email [1] this needs much more work to make the COPY and then
> > later fetching the changes from the publisher consistently. So, let me
> > summarize the discussion so far. We wanted to enhance the tablesync
> > phase of Logical Replication to enable decoding of prepared
> > transactions [2]. The problem was when we stream prepared transactions
> > in the tablesync worker, it will simply commit the same due to the
> > requirement of maintaining a single transaction for the entire
> > duration of copy and streaming of transactions afterward. We can't
> > simply disable the decoding of prepared xacts for tablesync workers
> > because it can skip some of the prepared xacts forever on subscriber
> > as explained in one of the emails above [3]. Now, while investigating
> > the solutions to enhance tablesync to support decoding at prepare
> > time, I found that due to the current design of tablesync we can see
> > partial data of transactions on subscribers which is also explained in
> > the email above with an example [4]. This problem of visibility is
> > there since the Logical Replication is introduced in PostgreSQL and
> > the only answer I got till now is that there doesn't seem to be any
> > other alternative which I think is not true and I have provided one
> > alternative as well.
> >
> > Next, we have discussed three different solutions all of which will
> > solve the first problem (allow the tablesync worker to decode
> > transactions at prepare time) and one of which solves both the first
> > and second problem (partial transaction data visibility).
> >
> > Solution-1: Allow the table-sync worker to use multiple transactions.
> > The reason for doing it in a single transaction is that if after
> > initial COPY we commit and then crash while streaming changes of other
> > transactions, the state of the table won't be known after the restart
> > as we are using temporary slot so we don't from where to restart
> > syncing the table.
> >
> > IIUC, we need to primarily do two things to achieve multiple
> > transactions, one is to have an additional state in the catalog (say
> > catch up) which will say that the initial copy is done. Then we need
> > to have a permanent slot using which we can track the progress of the
> > slot so that after restart (due to crash, connection break, etc.) we
> > can start from the appropriate position. Now, this will allow us to do
> > less work after recovering from a crash because we will know the
> > restart point. As Craig mentioned, it also allows the sync slot to
> > advance, freeing any held upstream resources before the whole sync is
> > done, which is good if the upstream is busy and generating lots of
> > WAL. Finally, committing as we go means we won't exceed the cid
> > increment limit in a single txn.
> >
> > Solution-2: The next solution we discussed is to make "tablesync"
> > worker just gather messages after COPY in a similar way to how the
> > current streaming of in-progress transaction feature gathers messages
> > into a "changes" file so that they can be replayed later by the apply
> > worker. Now, here as we don't need to replay the individual
> > transactions in tablesync worker in a single transaction, it will
> > allow us to send decode prepared to the subscriber. This has some
> > disadvantages such as each transaction processed by tablesync worker
> > needs to be durably written to file and it can also lead to some apply
> > lag later when we process the same by apply worker.
> >
> > Solution-3: Allow the table-sync workers to just perform initial COPY
> > and then once the COPY is done for all relations the apply worker will
> > stream all the future changes. Now, surely if large copies are
> > required for multiple relations then we would delay a bit to replay
> > transactions partially by the apply worker but don't know how much
> > that matters as compared to transaction visibility issue and anyway we
> > would have achieved the maximum parallelism by allowing copy via
> > multiple workers. This would reduce the load (both CPU and I/O) on the
> > publisher-side by allowing to decode the data only once instead of for
> > each table sync worker once and separately for the apply worker. I
> > think it will use fewer resources to finish the work.
> >
> > Currently, in tablesync worker, we create a slot with CRS_USE_SNAPSHOT
> > option which creates a transaction snapshot on the publisher, and then
> > we use the same snapshot for COPY from the publisher. After this, when
> > we try to receive the data from the publisher using the same slot, it
> > will be in sync with the COPY. I think to keep the same consistency
> > between COPY and the data we receive from the publisher in this
> > approach, we need to export the snapshot while creating a slot in the
> > apply worker by using CRS_EXPORT_SNAPSHOT and then use the same
> > snapshot by all the tablesync workers doing the copy. In tablesync
> > workers, we can use the SET TRANSACTION SNAPSHOT command after "BEGIN
> > READ ONLY ISOLATION LEVEL REPEATABLE READ" to use the exported
> > snapshot. That way the COPY will use the same snapshot as is used for
> > receiving the changes in apply worker and the data will be in sync.
> >
> > Then we also need a way to export snapshot while the apply worker is
> > already receiving the changes because users can use 'ALTER
> > SUBSCRIPTION name REFRESH PUBLICATION' which allows new tables to be
> > synced. I think we need to introduce a new command in
> > exec_replication_command() to export the snapshot from the existing
> > slot and then use it by the new tablesync worker.
> >
> >
> > Among the above three solutions, the first two will solve the first
> > problem (allow the tablesync worker to decode transactions at prepare
> > time) and the third solution will solve both the first and second
> > problem (partial transaction data visibility). The third solution
> > requires quite some redesign of how the Logical Replication work is
> > synchronized between apply and tablesync workers and might turn out to
> > be a bigger implementation effort. I am tentatively thinking to go
> > with a first or second solution at this stage and anyway if later
> > people feel that we need some bigger redesign then we can go with
> > something on the lines of Solution-3.
> >
> > Thoughts?
> >
> > [1] -
https://www.postgresql.org/message-id/CAA4eK1%2BQC74wRQmbYT%2BMmOs%3DYbdUjuq0_A9CBbVoQMB1Ryi-OA%40mail.gmail.com
> > [2] - https://www.postgresql.org/message-id/CAHut+PuEMk4SO8oGzxc_ftzPkGA8uC-y5qi-KRqHSy_P0i30DA@mail.gmail.com
> > [3] - https://www.postgresql.org/message-id/CAA4eK1KFsjf6x-S7b0dJLvEL3tcn9x-voBJiFoGsccyH5xgDzQ%40mail.gmail.com
> > [4] - https://www.postgresql.org/message-id/CAA4eK1Ld9XaLoTZCoKF_gET7kc1fDf8CPR3CM48MQb1N1jDLYg%40mail.gmail.com
> >
> > --
>
> Hi Amit,
>
> - Solution-3 has become too complicated to be attempted by me. Anyway,
> we may be better to just focus on eliminating the new problems exposed
> by the 2PC work [1], rather than burning too much effort to fix some
> other quirk which apparently has existed for years.
> [1] https://www.postgresql.org/message-id/CAHut%2BPtm7E5Jj92tJWPtnnjbNjJN60_%3DaGGKYW3h23b7J%3DqeDg%40mail.gmail.com
>
> - Solution-2 has some potential lag problems, and maybe file resource
> problems as well. This idea did not get a very favourable response
> when I first proposed it.
>
> - This leaves Solution-1 as the best viable option to fix the current
> known 2PC trouble.
>
> ~~
>
> So I will try to write a patch for the proposed Solution-1.

Yeah, even I think that the Solution-1 is best for solving the problem for 2PC.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Thu, Dec 10, 2020 at 8:49 PM Peter Smith <smithpb2250@gmail.com> wrote:

> So I will try to write a patch for the proposed Solution-1.
>

Hi Amit.

FYI, here is my v3 WIP patch for the Solution1.

This patch applies onto the v30 patch set [1] from the other 2PC thread:
[1] https://www.postgresql.org/message-id/CAFPTHDYA8yE6tEmQ2USYS68kNt%2BkM%3DSwKgj%3Djy4AvFD5e9-UTQ%40mail.gmail.com

Although incomplete, it does continue to pass all the make check, and
src/test/subscription TAP tests.

====

Coded / WIP:

* tablesync slot is now permanent instead of temporary

* the tablesync slot cleanup (drop) code is added for DropSubscription
and for finish_sync_worker functions

* tablesync worked now allowing multiple tx instead of single tx

* a new state (SUBREL_STATE_COPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* if a relaunched tablesync finds the state is SUBREL_STATE_COPYDONE
then it will bypass the initial copy_table phase.


TODO / Known Issues:

* The tablesync replication origin/lsn logic all needs to be updated
so that tablesync knows where to restart based on information held by
the now permanent slot.

* the current implementation of tablesync drop slot (e.g. from DROP
SUBSCRIPTION) or finish_sync_worker regenerates the tablesync slot
name so it knows what slot to drop. The current code may be ok for
normal use cases, but if there is an ALTER SUBSCRIPTION ... SET
(slot_name = newname) it would fail to be able to find the tablesync
slot. Some redesign may be needed for this part.

* help / comments / cleanup

* There is temporary "!!>>" excessive logging of mine scattered around
which I added to help my testing during development

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA my v4 WIP patch for the Solution1.

This patch applies onto the v30 patch set [1] from other 2PC thread:
[1] https://www.postgresql.org/message-id/CAFPTHDYA8yE6tEmQ2USYS68kNt%2BkM%3DSwKgj%3Djy4AvFD5e9-UTQ%40mail.gmail.com

Although incomplete it does still pass all the make check, and
src/test/subscription TAP tests.

====

Coded / WIP:

* tablesync slot is now permanent instead of temporary

* the tablesync slot cleanup (drop) code is added for DropSubscription
and for finish_sync_worker functions

* tablesync worked now allowing multiple tx instead of single tx

* a new state (SUBREL_STATE_COPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* if a relaunched tablesync finds the state is SUBREL_STATE_COPYDONE
then it will bypass the initial copy_table phase.

* tablesync now sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for apply worker)

* tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply

TODO / Known Issues:

* the current implementation of tablesync drop slot (e.g. from
DropSubscription or finish_sync_worker) regenerates the tablesync slot
name so it knows what slot to drop. The current code might be ok for
normal use cases, but if there is an ALTER SUBSCRIPTION ... SET
(slot_name = newname) it would fail to be able to find the tablesync
slot.

* I think if there are crashed tablesync workers then they are not
known to DropSubscription. So this might be a problem to cleanup slots
and/or origin tracking belonging to those unknown workers.

* help / comments / cleanup

* There is temporary "!!>>" excessive logging of mine scattered around
which I added to help my testing during development

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Dec 18, 2020 at 6:41 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> TODO / Known Issues:
>
> * the current implementation of tablesync drop slot (e.g. from
> DropSubscription or finish_sync_worker) regenerates the tablesync slot
> name so it knows what slot to drop.
>

If you always drop the slot at finish_sync_worker, then in which case
do you need to drop it during DropSubscription? Is it when the table
sync workers are crashed?

> The current code might be ok for
> normal use cases, but if there is an ALTER SUBSCRIPTION ... SET
> (slot_name = newname) it would fail to be able to find the tablesync
> slot.
>

Sure, but the same will be true for the apply worker slot as well. I
agree the problem would be more for table sync workers but I think we
can solve it, see below.

> * I think if there are crashed tablesync workers then they are not
> known to DropSubscription. So this might be a problem to cleanup slots
> and/or origin tracking belonging to those unknown workers.
>

Yeah, I think we can do two things to avoid this and the previous
problem. (a) We can generate the slot_name for the table sync worker
based on only subscription_id and rel_id. (b) Immediately after
creating the slot, advance the replication origin with the position
(origin_startpos) we get from walrcv_create_slot, this will help us to
start from the right location.

Do you see anything which will still not be addressed after doing the above?

I understand why you are trying to create this patch atop logical
decoding of 2PC patch but I think it is better to create this as an
independent patch and then use it to test 2PC problem. Also, please
explain what kind of testing you did to ensure that it works properly
after the table sync worker restarts after the crash.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Sat, Dec 19, 2020 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Dec 18, 2020 at 6:41 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
>
> I understand why you are trying to create this patch atop logical
> decoding of 2PC patch but I think it is better to create this as an
> independent patch and then use it to test 2PC problem. Also, please
> explain what kind of testing you did to ensure that it works properly
> after the table sync worker restarts after the crash.
>

Few other comments:
==================
1.
* FIXME 3 - Crashed tablesync workers may also have remaining slots
because I don't think
+ * such workers are even iterated by this loop, and nobody else is
removing them.
+ */
+ if (slotname)
+ {

The above FIXME is not clear to me. Actually, the crashed workers
should restart, finish their work, and drop the slots. So not sure
what exactly this FIXME refers to?

2.
DropSubscription()
{
..
ReplicationSlotDropAtPubNode(
+ NULL,
+ conninfo, /* use conninfo to make a new connection. */
+ subname,
+ syncslotname);
..
}

With the above call, it will form a connection with the publisher and
drop the required slots. I think we need to save the connection info
so that we don't need to connect/disconnect for each slot to be
dropped. Later in this function, we again connect and drop the apply
worker slot. I think we should connect just once drop the apply and
table sync slots if any.

3.
ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn_given, char
*conninfo, char *subname, char *slotname)
{
..
+ PG_TRY();
..
+ PG_CATCH();
+ {
+ /* NOP. Just gobble any ERROR. */
+ }
+ PG_END_TRY();

Why are we suppressing the error instead of handling it the error in
the same way as we do while dropping the apply worker slot in
DropSubscription?

4.
@@ -139,6 +141,28 @@ finish_sync_worker(void)
  get_rel_name(MyLogicalRepWorker->relid))));
  CommitTransactionCommand();

+ /*
+ * Cleanup the tablesync slot.
+ */
+ {
+ extern void ReplicationSlotDropAtPubNode(
+ WalReceiverConn *wrconn_given, char *conninfo, char *subname, char *slotname);

This is not how we export functions at other places?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA my v5 WIP patch for the Solution1.

This patch still applies onto the v30 patch set [1] from other 2PC thread:
[1] https://www.postgresql.org/message-id/CAFPTHDYA8yE6tEmQ2USYS68kNt%2BkM%3DSwKgj%3Djy4AvFD5e9-UTQ%40mail.gmail.com

(I understand you would like this to be delivered as a separate patch
independent of v30. I will convert it ASAP)

====

Coded / WIP:

* tablesync slot is now permanent instead of temporary. The tablesync
slot name is no longer tied to the Subscription slot name.

* the tablesync slot cleanup (drop) code is added for DropSubscription
and for finish_sync_worker functions

* tablesync worked now allowing multiple tx instead of single tx

* a new state (SUBREL_STATE_COPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* if a relaunched tablesync finds the state is SUBREL_STATE_COPYDONE
then it will bypass the initial copy_table phase.

* tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply

TODO / Known Issues:

* I think if there are crashed tablesync workers they may not be known
to DropSubscription current code. This might be a problem to cleanup
slots and/or origin tracking belonging to those unknown workers.

* Help / comments / cleanup

* There is temporary "!!>>" excessive logging of mine scattered around
which I added to help my testing during development

* Address review comments

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Sat, Dec 19, 2020 at 5:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Dec 18, 2020 at 6:41 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > TODO / Known Issues:
> >
> > * the current implementation of tablesync drop slot (e.g. from
> > DropSubscription or finish_sync_worker) regenerates the tablesync slot
> > name so it knows what slot to drop.
> >
>
> If you always drop the slot at finish_sync_worker, then in which case
> do you need to drop it during DropSubscription? Is it when the table
> sync workers are crashed?

Yes. It is not the normal case. But if the tablesync never yet got to
SYNCDONE state (maybe crashed) then finish_sync_worker may not be
called.
So I think a rogue tablesync slot might still exist during DropSubscription.

>
> > The current code might be ok for
> > normal use cases, but if there is an ALTER SUBSCRIPTION ... SET
> > (slot_name = newname) it would fail to be able to find the tablesync
> > slot.
> >
>
> Sure, but the same will be true for the apply worker slot as well. I
> agree the problem would be more for table sync workers but I think we
> can solve it, see below.
>
> > * I think if there are crashed tablesync workers then they are not
> > known to DropSubscription. So this might be a problem to cleanup slots
> > and/or origin tracking belonging to those unknown workers.
> >
>
> Yeah, I think we can do two things to avoid this and the previous
> problem. (a) We can generate the slot_name for the table sync worker
> based on only subscription_id and rel_id. (b) Immediately after
> creating the slot, advance the replication origin with the position
> (origin_startpos) we get from walrcv_create_slot, this will help us to
> start from the right location.
>
> Do you see anything which will still not be addressed after doing the above?

(a) V5 Patch is updated as suggested.
(b) V5 Patch is updated as suggested. Now calling replorigin_advance.
No problems seen so far. All TAP tests pass, but more testing needed
for the origin stuff

>
> I understand why you are trying to create this patch atop logical
> decoding of 2PC patch but I think it is better to create this as an
> independent patch and then use it to test 2PC problem.

OK. The latest patch still applies to v30 just for my convenience
today, but I will head towards converting this to an independent patch
ASAP.

> Also, please
> explain what kind of testing you did to ensure that it works properly
> after the table sync worker restarts after the crash.

So far tested like this - I caused the tablesync to crash after
COPYDONE (but before SYNCDONE) by sending a row to cause a PK
violation while holding the tablesync at the CATCHUP state in the
debugger. The tablesync then handles the insert, encounters the PK
violation error, and re-launches. Then I can remove the extra row so
the PK violation does not happen, so the (re-launched) tablesync can
complete and finish normally. The Apply worker then takes over.

I have attached some captured/annotated logging of my test scenario
which I ran using the V4 patch (the log has a lot of extra temporary
output to help see what is going on)

---
Kind Regards,
Peter Smith.
Fujitsu Australia.

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> Few other comments:
> ==================

Thanks for your feedback.

> 1.
> * FIXME 3 - Crashed tablesync workers may also have remaining slots
> because I don't think
> + * such workers are even iterated by this loop, and nobody else is
> removing them.
> + */
> + if (slotname)
> + {
>
> The above FIXME is not clear to me. Actually, the crashed workers
> should restart, finish their work, and drop the slots. So not sure
> what exactly this FIXME refers to?

Yes, normally if the tablesync can complete it should behave like that.
But I think there are other scenarios where it may be unable to
clean-up after itself. For example:

i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
handled by tablesync can give a PK violation which also will crash
again and again for each re-launched/replacement tablesync worker.
This can be reproduced in the debugger. If the DropSubscription
doesn't clean-up the tablesync's slot then nobody will.

ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
ensure that the launcher doesn't restart new worker during dropping
the subscription". So executing DROP SUBSCRIPTION will prevent a newly
crashed tablesync from re-launching, so it won’t be able to take care
of its own slot. If the DropSubscription doesn't clean-up that
tablesync's slot then nobody will.

>
> 2.
> DropSubscription()
> {
> ..
> ReplicationSlotDropAtPubNode(
> + NULL,
> + conninfo, /* use conninfo to make a new connection. */
> + subname,
> + syncslotname);
> ..
> }
>
> With the above call, it will form a connection with the publisher and
> drop the required slots. I think we need to save the connection info
> so that we don't need to connect/disconnect for each slot to be
> dropped. Later in this function, we again connect and drop the apply
> worker slot. I think we should connect just once drop the apply and
> table sync slots if any.

OK. IIUC this is a suggestion for more efficient connection usage,
rather than actual bug right? I have added this suggestion to my TODO
list.

>
> 3.
> ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn_given, char
> *conninfo, char *subname, char *slotname)
> {
> ..
> + PG_TRY();
> ..
> + PG_CATCH();
> + {
> + /* NOP. Just gobble any ERROR. */
> + }
> + PG_END_TRY();
>
> Why are we suppressing the error instead of handling it the error in
> the same way as we do while dropping the apply worker slot in
> DropSubscription?

This function is common - it is also called from the tablesync
finish_sync_worker. But in the finish_sync_worker case I wanted to
avoid throwing an ERROR which would cause the tablesync to crash and
relaunch (and crash/relaunch/repeat...) when all it was trying to do
in the first place was just cleanup and exit the process. Perhaps the
error suppression should be conditional depending where this function
is called from?

>
> 4.
> @@ -139,6 +141,28 @@ finish_sync_worker(void)
>   get_rel_name(MyLogicalRepWorker->relid))));
>   CommitTransactionCommand();
>
> + /*
> + * Cleanup the tablesync slot.
> + */
> + {
> + extern void ReplicationSlotDropAtPubNode(
> + WalReceiverConn *wrconn_given, char *conninfo, char *subname, char *slotname);
>
> This is not how we export functions at other places?

Fixed in latest v5 patch -
https://www.postgresql.org/message-id/CAHut%2BPvmDJ_EO11_up%3D_cRbOjhdWCMG-n7kF-mdRhjtCHcjHRA%40mail.gmail.com


----
Kind Regards,
Peter Smith.
Fujitsu Australia.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Dec 21, 2020 at 3:17 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Few other comments:
> > ==================
>
> Thanks for your feedback.
>
> > 1.
> > * FIXME 3 - Crashed tablesync workers may also have remaining slots
> > because I don't think
> > + * such workers are even iterated by this loop, and nobody else is
> > removing them.
> > + */
> > + if (slotname)
> > + {
> >
> > The above FIXME is not clear to me. Actually, the crashed workers
> > should restart, finish their work, and drop the slots. So not sure
> > what exactly this FIXME refers to?
>
> Yes, normally if the tablesync can complete it should behave like that.
> But I think there are other scenarios where it may be unable to
> clean-up after itself. For example:
>
> i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
> handled by tablesync can give a PK violation which also will crash
> again and again for each re-launched/replacement tablesync worker.
> This can be reproduced in the debugger. If the DropSubscription
> doesn't clean-up the tablesync's slot then nobody will.
>
> ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
> ensure that the launcher doesn't restart new worker during dropping
> the subscription".
>

Yeah, I have also read that comment but do you know how it is
preventing relaunch? How does the subscription lock help?

> So executing DROP SUBSCRIPTION will prevent a newly
> crashed tablesync from re-launching, so it won’t be able to take care
> of its own slot. If the DropSubscription doesn't clean-up that
> tablesync's slot then nobody will.
>


> >
> > 2.
> > DropSubscription()
> > {
> > ..
> > ReplicationSlotDropAtPubNode(
> > + NULL,
> > + conninfo, /* use conninfo to make a new connection. */
> > + subname,
> > + syncslotname);
> > ..
> > }
> >
> > With the above call, it will form a connection with the publisher and
> > drop the required slots. I think we need to save the connection info
> > so that we don't need to connect/disconnect for each slot to be
> > dropped. Later in this function, we again connect and drop the apply
> > worker slot. I think we should connect just once drop the apply and
> > table sync slots if any.
>
> OK. IIUC this is a suggestion for more efficient connection usage,
> rather than actual bug right?
>

Yes, it is for effective connection usage.

> I have added this suggestion to my TODO
> list.
>
> >
> > 3.
> > ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn_given, char
> > *conninfo, char *subname, char *slotname)
> > {
> > ..
> > + PG_TRY();
> > ..
> > + PG_CATCH();
> > + {
> > + /* NOP. Just gobble any ERROR. */
> > + }
> > + PG_END_TRY();
> >
> > Why are we suppressing the error instead of handling it the error in
> > the same way as we do while dropping the apply worker slot in
> > DropSubscription?
>
> This function is common - it is also called from the tablesync
> finish_sync_worker. But in the finish_sync_worker case I wanted to
> avoid throwing an ERROR which would cause the tablesync to crash and
> relaunch (and crash/relaunch/repeat...) when all it was trying to do
> in the first place was just cleanup and exit the process. Perhaps the
> error suppression should be conditional depending where this function
> is called from?
>

Yeah, that could be one way and if you follow my previous suggestion
this function might change a bit more.

--
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA my v6 WIP patch for the Solution1.

This patch still applies onto the v30 patch set [1] from other 2PC thread:
[1] https://www.postgresql.org/message-id/CAFPTHDYA8yE6tEmQ2USYS68kNt%2BkM%3DSwKgj%3Djy4AvFD5e9-UTQ%40mail.gmail.com

(I understand you would like this to be delivered as a separate patch
independent of v30. I will convert it ASAP)

====

Coded / WIP:

* tablesync slot is now permanent instead of temporary. The tablesync
slot name is no longer tied to the Subscription slot name.

* the tablesync slot cleanup (drop) code is added for DropSubscription
and for finish_sync_worker functions

* tablesync worked now allowing multiple tx instead of single tx

* a new state (SUBREL_STATE_COPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* if a relaunched tablesync finds the state is SUBREL_STATE_COPYDONE
then it will bypass the initial copy_table phase.

* tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply

TODO / Known Issues:

* Crashed tablesync workers may not be known to DropSubscription
current code. This might be a problem to cleanup slots and/or origin
tracking belonging to those unknown workers.

* There seems to be a race condition during DROP SUBSCRIPTION. It
manifests as the TAP test 007 hanging. Logging shows it seems to be
during replorigin_drop when called from DropSubscription. It is timing
related and quite rare - e.g. Only happens 1x every 10x running
subscription TAP tests.

* Help / comments / cleanup

* There is temporary "!!>>" excessive logging of mine scattered around
which I added to help my testing during development

* Address review comments

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Dec 21, 2020 at 11:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Dec 21, 2020 at 3:17 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > Few other comments:
> > > ==================
> >
> > Thanks for your feedback.
> >
> > > 1.
> > > * FIXME 3 - Crashed tablesync workers may also have remaining slots
> > > because I don't think
> > > + * such workers are even iterated by this loop, and nobody else is
> > > removing them.
> > > + */
> > > + if (slotname)
> > > + {
> > >
> > > The above FIXME is not clear to me. Actually, the crashed workers
> > > should restart, finish their work, and drop the slots. So not sure
> > > what exactly this FIXME refers to?
> >
> > Yes, normally if the tablesync can complete it should behave like that.
> > But I think there are other scenarios where it may be unable to
> > clean-up after itself. For example:
> >
> > i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
> > handled by tablesync can give a PK violation which also will crash
> > again and again for each re-launched/replacement tablesync worker.
> > This can be reproduced in the debugger. If the DropSubscription
> > doesn't clean-up the tablesync's slot then nobody will.
> >
> > ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
> > ensure that the launcher doesn't restart new worker during dropping
> > the subscription".
> >
>
> Yeah, I have also read that comment but do you know how it is
> preventing relaunch? How does the subscription lock help?

Hmmm. I did see there is a matching lock in get_subscription_list of
launcher.c, which may be what that code comment was referring to. But
I also am currently unsure how this lock prevents anybody (e.g.
process_syncing_tables_for_apply) from executing another
logicalrep_worker_launch.

>
> > So executing DROP SUBSCRIPTION will prevent a newly
> > crashed tablesync from re-launching, so it won’t be able to take care
> > of its own slot. If the DropSubscription doesn't clean-up that
> > tablesync's slot then nobody will.
> >
>
>
> > >
> > > 2.
> > > DropSubscription()
> > > {
> > > ..
> > > ReplicationSlotDropAtPubNode(
> > > + NULL,
> > > + conninfo, /* use conninfo to make a new connection. */
> > > + subname,
> > > + syncslotname);
> > > ..
> > > }
> > >
> > > With the above call, it will form a connection with the publisher and
> > > drop the required slots. I think we need to save the connection info
> > > so that we don't need to connect/disconnect for each slot to be
> > > dropped. Later in this function, we again connect and drop the apply
> > > worker slot. I think we should connect just once drop the apply and
> > > table sync slots if any.
> >
> > OK. IIUC this is a suggestion for more efficient connection usage,
> > rather than actual bug right?
> >
>
> Yes, it is for effective connection usage.
>

I have addressed this in the latest patch [v6]

> >
> > >
> > > 3.
> > > ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn_given, char
> > > *conninfo, char *subname, char *slotname)
> > > {
> > > ..
> > > + PG_TRY();
> > > ..
> > > + PG_CATCH();
> > > + {
> > > + /* NOP. Just gobble any ERROR. */
> > > + }
> > > + PG_END_TRY();
> > >
> > > Why are we suppressing the error instead of handling it the error in
> > > the same way as we do while dropping the apply worker slot in
> > > DropSubscription?
> >
> > This function is common - it is also called from the tablesync
> > finish_sync_worker. But in the finish_sync_worker case I wanted to
> > avoid throwing an ERROR which would cause the tablesync to crash and
> > relaunch (and crash/relaunch/repeat...) when all it was trying to do
> > in the first place was just cleanup and exit the process. Perhaps the
> > error suppression should be conditional depending where this function
> > is called from?
> >
>
> Yeah, that could be one way and if you follow my previous suggestion
> this function might change a bit more.

I have addressed this in the latest patch [v6]

---
[v6] https://www.postgresql.org/message-id/CAHut%2BPuCLty2HGNT6neyOcUmBNxOLo%3DybQ2Yv-nTR4kFY-8QLw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA my v7 WIP patch for the Solution1.

This patch still applies onto the v30 patch set [1] from other 2PC thread:
[1] https://www.postgresql.org/message-id/CAFPTHDYA8yE6tEmQ2USYS68kNt%2BkM%3DSwKgj%3Djy4AvFD5e9-UTQ%40mail.gmail.com

(I understand you would like this to be delivered as a separate patch
independent of v30. I will convert it ASAP)

====

Coded / WIP:

* tablesync slot is now permanent instead of temporary. The tablesync
slot name is no longer tied to the Subscription slot name.

* the tablesync slot cleanup (drop) code is added for DropSubscription
and for finish_sync_worker functions

* tablesync worked now allowing multiple tx instead of single tx

* a new state (SUBREL_STATE_COPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* if a relaunched tablesync finds the state is SUBREL_STATE_COPYDONE
then it will bypass the initial copy_table phase.

* tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply

* The v7 DropSubscription cleanup code has been rewritten since v6.
The subscription TAP tests have been executed many (7) times now
without observing any of the race problems that I previously reported
seeing when using the v6 patch.

TODO / Known Issues:

* Help / comments / cleanup

* There is temporary "!!>>" excessive logging scattered around which I
added to help my testing during development

* Address review comments

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA my v8 WIP patch for the Solution1.

This has the same code changes as the v7 patch, but the v8 patch can
be applied to the current PG OSS master code base.

====

Coded / WIP:

* tablesync slot is now permanent instead of temporary. The tablesync
slot name is no longer tied to the Subscription slot name.

* the tablesync slot cleanup (drop) code is added for DropSubscription
and for finish_sync_worker functions

* tablesync worked now allowing multiple tx instead of single tx

* a new state (SUBREL_STATE_COPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* if a relaunched tablesync finds the state is SUBREL_STATE_COPYDONE
then it will bypass the initial copy_table phase.

* tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply

* The DropSubscription cleanup code was changed lots in v7. The
subscription TAP tests have been executed 6x now without observing any
race problems that were sometimes seen to happen in the v6 patch.

TODO / Known Issues:

* Help / comments

* There is temporary "!!>>" excessive logging scattered around which I
added to help my testing during development

* Address review comments

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Wed, Dec 23, 2020 at 11:49 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Hi Amit.
>
> PSA my v7 WIP patch for the Solution1.
>

Few comments:
================
1.
+ * Rarely, the DropSubscription may be issued when a tablesync still
+ * is in SYNCDONE but not yet in READY state. If this happens then
+ * the drop slot could fail because it is already dropped.
+ * In this case suppress and drop slot error.
+ *
+ * FIXME - Is there a better way than this?
+ */
+ if (rstate->state != SUBREL_STATE_SYNCDONE)
+ PG_RE_THROW();

So, does this situation happens when we try to drop subscription after
the state is changed to syncdone but not syncready. If so, then can't
we write a function GetSubscriptionNotDoneRelations similar to
GetSubscriptionNotReadyRelations where we get a list of relations that
are not in done stage. I think this should be safe because once we are
here we shouldn't be allowed to start a new worker and old workers are
already stopped by this function.

2. Your changes in LogicalRepSyncTableStart() doesn't seem to be
right. IIUC, you are copying the table in one transaction, then
updating the state to SUBREL_STATE_COPYDONE in another transaction,
and after that doing replorigin_advance. Consider what happened if we
error out after the first txn is committed in which we have copied the
table. After the restart, it will again try to copy and lead to an
error. Similarly, consider if we error out after the second
transaction, we won't where to start decoding from. I think all these
should be done in a single transaction.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Tue, Dec 22, 2020 at 4:58 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Dec 21, 2020 at 11:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Dec 21, 2020 at 3:17 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > Few other comments:
> > > > ==================
> > >
> > > Thanks for your feedback.
> > >
> > > > 1.
> > > > * FIXME 3 - Crashed tablesync workers may also have remaining slots
> > > > because I don't think
> > > > + * such workers are even iterated by this loop, and nobody else is
> > > > removing them.
> > > > + */
> > > > + if (slotname)
> > > > + {
> > > >
> > > > The above FIXME is not clear to me. Actually, the crashed workers
> > > > should restart, finish their work, and drop the slots. So not sure
> > > > what exactly this FIXME refers to?
> > >
> > > Yes, normally if the tablesync can complete it should behave like that.
> > > But I think there are other scenarios where it may be unable to
> > > clean-up after itself. For example:
> > >
> > > i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
> > > handled by tablesync can give a PK violation which also will crash
> > > again and again for each re-launched/replacement tablesync worker.
> > > This can be reproduced in the debugger. If the DropSubscription
> > > doesn't clean-up the tablesync's slot then nobody will.
> > >
> > > ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
> > > ensure that the launcher doesn't restart new worker during dropping
> > > the subscription".
> > >
> >
> > Yeah, I have also read that comment but do you know how it is
> > preventing relaunch? How does the subscription lock help?
>
> Hmmm. I did see there is a matching lock in get_subscription_list of
> launcher.c, which may be what that code comment was referring to. But
> I also am currently unsure how this lock prevents anybody (e.g.
> process_syncing_tables_for_apply) from executing another
> logicalrep_worker_launch.
>

process_syncing_tables_for_apply will be called by the apply worker
and we are stopping the apply worker. So, after that launcher won't
start a new apply worker because of get_subscription_list() and if the
apply worker is not started then it won't be able to start tablesync
worker. So, we need the handling of crashed tablesync workers here
such that we need to drop any new sync slots.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA my v9 WIP patch for the Solution1 which addresses some recent
review comments, and other minor changes.

====

Features:

* tablesync slot is now permanent instead of temporary. The tablesync
slot name is no longer tied to the Subscription slot na

* the tablesync slot cleanup (drop) code is added for DropSubscription
and for finish_sync_worker functions

* tablesync worked now allowing multiple tx instead of single tx

* a new state (SUBREL_STATE_COPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* if a relaunched tablesync finds the state is SUBREL_STATE_COPYDONE
then it will bypass the initial copy_table phase.

* tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply.

* The DropSubscription cleanup code was enhanced in v7 to take care of
crashed sync workers.

* Minor updates to PG docs

TODO / Known Issues:

* Source includes temporary "!!>>" excessive logging which I added to
help testing during development

* Address review comments

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Wed, Dec 23, 2020 at 9:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Dec 22, 2020 at 4:58 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Mon, Dec 21, 2020 at 11:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Dec 21, 2020 at 3:17 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > > >
> > > > On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > Few other comments:
> > > > > ==================
> > > >
> > > > Thanks for your feedback.
> > > >
> > > > > 1.
> > > > > * FIXME 3 - Crashed tablesync workers may also have remaining slots
> > > > > because I don't think
> > > > > + * such workers are even iterated by this loop, and nobody else is
> > > > > removing them.
> > > > > + */
> > > > > + if (slotname)
> > > > > + {
> > > > >
> > > > > The above FIXME is not clear to me. Actually, the crashed workers
> > > > > should restart, finish their work, and drop the slots. So not sure
> > > > > what exactly this FIXME refers to?
> > > >
> > > > Yes, normally if the tablesync can complete it should behave like that.
> > > > But I think there are other scenarios where it may be unable to
> > > > clean-up after itself. For example:
> > > >
> > > > i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
> > > > handled by tablesync can give a PK violation which also will crash
> > > > again and again for each re-launched/replacement tablesync worker.
> > > > This can be reproduced in the debugger. If the DropSubscription
> > > > doesn't clean-up the tablesync's slot then nobody will.
> > > >
> > > > ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
> > > > ensure that the launcher doesn't restart new worker during dropping
> > > > the subscription".
> > > >
> > >
> > > Yeah, I have also read that comment but do you know how it is
> > > preventing relaunch? How does the subscription lock help?
> >
> > Hmmm. I did see there is a matching lock in get_subscription_list of
> > launcher.c, which may be what that code comment was referring to. But
> > I also am currently unsure how this lock prevents anybody (e.g.
> > process_syncing_tables_for_apply) from executing another
> > logicalrep_worker_launch.
> >
>
> process_syncing_tables_for_apply will be called by the apply worker
> and we are stopping the apply worker. So, after that launcher won't
> start a new apply worker because of get_subscription_list() and if the
> apply worker is not started then it won't be able to start tablesync
> worker. So, we need the handling of crashed tablesync workers here
> such that we need to drop any new sync slots.

Yes, in the v6 patch code this was a problem in need of handling. But
since the v7 patch the DropSubscription code is now using a separate
GetSubscriptionNotReadyRelations loop to handle the cleanup of
potentially leftover slots from crashed tablesync workers (i.e.
workers that never got to a READY state).

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Wed, Dec 23, 2020 at 8:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> 1.
> + * Rarely, the DropSubscription may be issued when a tablesync still
> + * is in SYNCDONE but not yet in READY state. If this happens then
> + * the drop slot could fail because it is already dropped.
> + * In this case suppress and drop slot error.
> + *
> + * FIXME - Is there a better way than this?
> + */
> + if (rstate->state != SUBREL_STATE_SYNCDONE)
> + PG_RE_THROW();
>
> So, does this situation happens when we try to drop subscription after
> the state is changed to syncdone but not syncready. If so, then can't
> we write a function GetSubscriptionNotDoneRelations similar to
> GetSubscriptionNotReadyRelations where we get a list of relations that
> are not in done stage. I think this should be safe because once we are
> here we shouldn't be allowed to start a new worker and old workers are
> already stopped by this function.

Yes, but I don't see how adding such a function is an improvement over
the existing code:
e.g.1. GetSubscriptionNotDoneRelations will include the READY state
(which we don't want) just like GetSubscriptionNotReadyRelations
includes the SYNCDONE state.
e.g.2. Or, something like GetSubscriptionNotDoneAndNotReadyRelations
would be an unnecessary overkill replacement for the current simple
"if".

AFAIK the code is OK as-is. That "FIXME" comment was really meant only
to highlight this for review, rather than to imply something needed to
be fixed. I have removed that "FIXME" comment in the latest patch
[v9].

>
> 2. Your changes in LogicalRepSyncTableStart() doesn't seem to be
> right. IIUC, you are copying the table in one transaction, then
> updating the state to SUBREL_STATE_COPYDONE in another transaction,
> and after that doing replorigin_advance. Consider what happened if we
> error out after the first txn is committed in which we have copied the
> table. After the restart, it will again try to copy and lead to an
> error. Similarly, consider if we error out after the second
> transaction, we won't where to start decoding from. I think all these
> should be done in a single transaction.

Fixed as suggested. Please see latest patch [v9]

---

[v9] https://www.postgresql.org/message-id/CAHut%2BPv8ShLmrjCriVU%2Btprk_9b2kwBxYK2oGSn5Eb__kWVc7A%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Wed, Dec 30, 2020 at 11:51 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Wed, Dec 23, 2020 at 8:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 1.
> > + * Rarely, the DropSubscription may be issued when a tablesync still
> > + * is in SYNCDONE but not yet in READY state. If this happens then
> > + * the drop slot could fail because it is already dropped.
> > + * In this case suppress and drop slot error.
> > + *
> > + * FIXME - Is there a better way than this?
> > + */
> > + if (rstate->state != SUBREL_STATE_SYNCDONE)
> > + PG_RE_THROW();
> >
> > So, does this situation happens when we try to drop subscription after
> > the state is changed to syncdone but not syncready. If so, then can't
> > we write a function GetSubscriptionNotDoneRelations similar to
> > GetSubscriptionNotReadyRelations where we get a list of relations that
> > are not in done stage. I think this should be safe because once we are
> > here we shouldn't be allowed to start a new worker and old workers are
> > already stopped by this function.
>
> Yes, but I don't see how adding such a function is an improvement over
> the existing code:
>

The advantage is that we don't need to use try..catch to deal with
such conditions which I don't think is a good way to deal with such
cases. Also, even after using try...catch, still, we can leak the
slots because the patch drops the slot after changing the state to
syncdone and if there is any error while dropping the slot, it simply
skips it. So, it is possible that the rel state is syncdone but the
slot still exists and we get an error due to some different reason,
and then we will silently skip it again and allow the subscription to
be dropped.

I think instead what we should do is to drop the slot before we change
the rel state to syncdone. Also, if the apply workers fail to drop the
slot, it should try to again drop it after restart. In
DropSubscription, we can then check if the rel state is not SYNC or
READY, we can drop the corresponding slots.

> e.g.1. GetSubscriptionNotDoneRelations will include the READY state
> (which we don't want) just like GetSubscriptionNotReadyRelations
> includes the SYNCDONE state.
> e.g.2. Or, something like GetSubscriptionNotDoneAndNotReadyRelations
> would be an unnecessary overkill replacement for the current simple
> "if".
>

or we can probably modify the function as
GetSubscriptionRelationsNotInStates and pass it an array of states
which we don't want.

> AFAIK the code is OK as-is.
>

As described above, there are still race conditions where we can leak
slots and also this doesn't look clean.

Few other comments:
=================
1.
+ elog(LOG, "!!>> DropSubscription: dropping the tablesync slot
\"%s\".", syncslotname);
+ ReplicationSlotDropAtPubNode(wrconn, syncslotname);
+ elog(LOG, "!!>> DropSubscription: dropped the tablesync slot
\"%s\".", syncslotname);

...
...

+ elog(LOG, "!!>> finish_sync_worker: dropping the tablesync slot
\"%s\".", syncslotname);
+ ReplicationSlotDropAtPubNode(wrconn, syncslotname);
+ elog(LOG, "!!>> finish_sync_worker: dropped the tablesync slot
\"%s\".", syncslotname);

Remove these and other elogs added to aid debugging or testing. If you
need these for development purposes then move these to separate patch.

2. Remove WIP from the commit message and patch name.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA my v10 patch for the Solution1.

v10 is essentially the same as v9, except now all the temporary "!!>>"
logging has been isolated to a separate (optional) patch 0002.

====

Features:

* tablesync slot is now permanent instead of temporary. The tablesync
slot name is no longer tied to the Subscription slot na

* the tablesync slot cleanup (drop) code is added for DropSubscription
and for finish_sync_worker functions

* tablesync worked now allowing multiple tx instead of single tx

* a new state (SUBREL_STATE_COPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* if a re-launched tablesync finds the state is SUBREL_STATE_COPYDONE
then it will bypass the initial copy_table phase.

* tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply.

* the DropSubscription cleanup code was enhanced (v7+) to take care of
crashed sync workers.

* minor updates to PG docs

TODO / Known Issues:

* address review comments

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Jan 4, 2021 at 8:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Few other comments:
> =================
> 1.
> + elog(LOG, "!!>> DropSubscription: dropping the tablesync slot
> \"%s\".", syncslotname);
> + ReplicationSlotDropAtPubNode(wrconn, syncslotname);
> + elog(LOG, "!!>> DropSubscription: dropped the tablesync slot
> \"%s\".", syncslotname);
>
> ...
> ...
>
> + elog(LOG, "!!>> finish_sync_worker: dropping the tablesync slot
> \"%s\".", syncslotname);
> + ReplicationSlotDropAtPubNode(wrconn, syncslotname);
> + elog(LOG, "!!>> finish_sync_worker: dropped the tablesync slot
> \"%s\".", syncslotname);
>
> Remove these and other elogs added to aid debugging or testing. If you
> need these for development purposes then move these to separate patch.

Fixed in latest patch (v10).

>
> 2. Remove WIP from the commit message and patch name.
>
> --

Fixed in latest patch (v10)

---
v10 = https://www.postgresql.org/message-id/CAHut%2BPuzPmFzk3p4oL9H3nkiY6utFryV9c5dW6kRhCe_RY%3DgnA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Jan 4, 2021 at 2:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Few other comments:
> =================
>

Few more comments on v9:
======================
1.
+ /* Drop the tablesync slot. */
+ {
+ char *syncslotname = ReplicationSlotNameForTablesync(subid, relid);
+
+ /*
+ * If the subscription slotname is NONE/NULL and the connection to publisher is
+ * broken, but the DropSubscription should still be allowed to complete.
+ * But without a connection it is not possible to drop any tablesync slots.
+ */
+ if (!wrconn)
+ {
+ /* FIXME - OK to just log a warning? */
+ elog(WARNING, "!!>> DropSubscription: no connection. Cannot drop
tablesync slot \"%s\".",
+   syncslotname);
+ }

Why is this not an ERROR? We don't want to keep the table slots
lingering after DropSubscription. If there is any tablesync slot that
needs to be dropped and the publisher is not available then we should
raise an error.

2.
+ /*
+ * Tablesync resource cleanup (slots and origins).
+ *
+ * Any READY-state relations would already have dealt with clean-ups.
+ */
+ {

There is no need to start a separate block '{' here.

3.
+#define SUBREL_STATE_COPYDONE 'C' /* tablesync copy phase is completed */

You can mention in the comments that sublsn will be NULL for this
state as it is mentioned for other similar states. Can we think of
using any letter in lower case for this as all other states are in
lower-case except for this which makes it a look bit odd? We can use
'f' or 'e' and describe it as 'copy finished' or 'copy end'. I am fine
if you have any better ideas.

4.
LogicalRepSyncTableStart()
{
..
..
+copy_table_done:
+
+ /* Setup replication origin tracking. */
+ {
+ char originname[NAMEDATALEN];
+ RepOriginId originid;
+
+ snprintf(originname, sizeof(originname), "pg_%u_%u",
MySubscription->oid, MyLogicalRepWorker->relid);
+ originid = replorigin_by_name(originname, true);
+ if (!OidIsValid(originid))
+ {
+ /*
+ * Origin tracking does not exist. Create it now, and advance to LSN
got from walrcv_create_slot.
+ */
+ elog(LOG, "!!>> LogicalRepSyncTableStart: 1 replorigin_create
\"%s\".", originname);
+ originid = replorigin_create(originname);
+ elog(LOG, "!!>> LogicalRepSyncTableStart: 1 replorigin_session_setup
\"%s\".", originname);
+ replorigin_session_setup(originid);
+ replorigin_session_origin = originid;
+ elog(LOG, "!!>> LogicalRepSyncTableStart: 1 replorigin_advance
\"%s\".", originname);
+ replorigin_advance(originid, *origin_startpos, InvalidXLogRecPtr,
+    true /* go backward */ , true /* WAL log */ );
+ }
+ else
+ {
+ /*
+ * Origin tracking already exists.
+ */
+ elog(LOG, "!!>> LogicalRepSyncTableStart: 2 replorigin_session_setup
\"%s\".", originname);
+ replorigin_session_setup(originid);
+ replorigin_session_origin = originid;
+ elog(LOG, "!!>> LogicalRepSyncTableStart: 2
replorigin_session_get_progress \"%s\".", originname);
+ *origin_startpos = replorigin_session_get_progress(false);
+ }
..
..
}

I am not sure if this code is correct because, for the very first time
when the copy is done, we don't expect replication origin to exist
whereas this code will silently use already existing replication
origin which can lead to a wrong start position for the slot. In such
a case it should error out. I guess we should create the replication
origin before making the state as copydone. I feel we should even have
a test case for this as it is not difficult to have a pre-existing
replication origin.

5. Is it possible to write a testcase where we fail (say due to pk
violation or some other error) after the initial copy is done, then
remove the conflicting row and allow a copy to be completed? If we
already have any such test then it is fine.

6.
+/*
+ * Drop the replication slot at the publisher node
+ * using the replication connection.
+ */

This comment looks a bit odd. The first line appears to be too short.
We have limit of 80 chars but this is much lesser than that.

7.
@@ -905,7 +905,7 @@ replorigin_advance(RepOriginId node,
  LWLockAcquire(&replication_state->lock, LW_EXCLUSIVE);

  /* Make sure it's not used by somebody else */
- if (replication_state->acquired_by != 0)
+ if (replication_state->acquired_by != 0 &&
replication_state->acquired_by != MyProcPid)
  {

I think you won't need this change if you do replorigin_advance before
replorigin_session_setup in your patch.

8.
- * that ensures we won't loose knowledge about that after a crash if the
+ * that ensures we won't lose knowledge about that after a crash if the

It is better to submit this as a separate patch.


-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Ajin Cherian
Date:
On Wed, Dec 30, 2020 at 5:08 PM Peter Smith <smithpb2250@gmail.com> wrote:

>
> PSA my v9 WIP patch for the Solution1 which addresses some recent
> review comments, and other minor changes.

I did some tests using the test suite prepared by Erik Rijkers in [1]
during the initial design of tablesync.

Back then, they had seen some errors while doing multiple commits in
initial tablesync. So I've rerun the test script on the v9 patch
applied on HEAD and found no errors.
The script runs pgbench, creates a pub/sub on a standby server, and
all of the pgbench tables are replicated to the standby. The contents
of the tables are compared at
the end of each run to make sure they are identical.
I have run it for around 12 hours, and it worked without any errors.
Attaching the script I used.

regards,
Ajin Cherian
Fujitsu Australia

[1]- https://www.postgresql.org/message-id/93d02794068482f96d31b002e0eb248d%40xs4all.nl

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA the v11 patch for the Tablesync Solution1.

Difference from v10:
- Addresses several recent review comments.
- pg_indent has been run

====

Features:

* tablesync slot is now permanent instead of temporary. The tablesync
slot name is no longer tied to the Subscription slot na

* the tablesync slot cleanup (drop) code is added for DropSubscription
and for finish_sync_worker functions

* tablesync worked now allowing multiple tx instead of single tx

* a new state (SUBREL_STATE_COPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* if a re-launched tablesync finds the state is SUBREL_STATE_COPYDONE
then it will bypass the initial copy_table phase.

* tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply.

* the DropSubscription cleanup code was enhanced (v7+) to take care of
crashed sync workers.

* minor updates to PG docs

TODO / Known Issues:

* address review comments

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Jan 4, 2021 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Few more comments on v9:
> ======================
> 1.
> + /* Drop the tablesync slot. */
> + {
> + char *syncslotname = ReplicationSlotNameForTablesync(subid, relid);
> +
> + /*
> + * If the subscription slotname is NONE/NULL and the connection to publisher is
> + * broken, but the DropSubscription should still be allowed to complete.
> + * But without a connection it is not possible to drop any tablesync slots.
> + */
> + if (!wrconn)
> + {
> + /* FIXME - OK to just log a warning? */
> + elog(WARNING, "!!>> DropSubscription: no connection. Cannot drop
> tablesync slot \"%s\".",
> +   syncslotname);
> + }
>
> Why is this not an ERROR? We don't want to keep the table slots
> lingering after DropSubscription. If there is any tablesync slot that
> needs to be dropped and the publisher is not available then we should
> raise an error.

Previously there was only the subscription slot. If the connection was
broken and caused an error then it was still possible for the user to
disassociate the subscription from the slot using ALTER SUBSCRIPTION
... SET (slot_name = NONE).  And then (when the slotname is NULL)  the
DropSubscription could complete OK. I expect in that case the Admin
still had some slot clean-up they would need to do on the Publisher
machine.

But now we have the tablesync slots so if I caused them to give ERROR
when the connection is broken then the subscription would become
un-droppable. If you think that having ERROR and an undroppable
subscription is better than the current WARNING then please let me
know - there is no problem to change it.

> 2.
> + /*
> + * Tablesync resource cleanup (slots and origins).
> + *
> + * Any READY-state relations would already have dealt with clean-ups.
> + */
> + {
>
> There is no need to start a separate block '{' here.

Written this way so I can declare variables only at the scope they are
needed. I didn’t see anything in the PG code conventions discouraging
doing this practice: https://www.postgresql.org/docs/devel/source.html

> 3.
> +#define SUBREL_STATE_COPYDONE 'C' /* tablesync copy phase is completed */
>
> You can mention in the comments that sublsn will be NULL for this
> state as it is mentioned for other similar states. Can we think of
> using any letter in lower case for this as all other states are in
> lower-case except for this which makes it a look bit odd? We can use
> 'f' or 'e' and describe it as 'copy finished' or 'copy end'. I am fine
> if you have any better ideas.
>

Fixed in latest patch [v11]

> 4.
> LogicalRepSyncTableStart()
> {
> ..
> ..
> +copy_table_done:
> +
> + /* Setup replication origin tracking. */
> + {
> + char originname[NAMEDATALEN];
> + RepOriginId originid;
> +
> + snprintf(originname, sizeof(originname), "pg_%u_%u",
> MySubscription->oid, MyLogicalRepWorker->relid);
> + originid = replorigin_by_name(originname, true);
> + if (!OidIsValid(originid))
> + {
> + /*
> + * Origin tracking does not exist. Create it now, and advance to LSN
> got from walrcv_create_slot.
> + */
> + elog(LOG, "!!>> LogicalRepSyncTableStart: 1 replorigin_create
> \"%s\".", originname);
> + originid = replorigin_create(originname);
> + elog(LOG, "!!>> LogicalRepSyncTableStart: 1 replorigin_session_setup
> \"%s\".", originname);
> + replorigin_session_setup(originid);
> + replorigin_session_origin = originid;
> + elog(LOG, "!!>> LogicalRepSyncTableStart: 1 replorigin_advance
> \"%s\".", originname);
> + replorigin_advance(originid, *origin_startpos, InvalidXLogRecPtr,
> +    true /* go backward */ , true /* WAL log */ );
> + }
> + else
> + {
> + /*
> + * Origin tracking already exists.
> + */
> + elog(LOG, "!!>> LogicalRepSyncTableStart: 2 replorigin_session_setup
> \"%s\".", originname);
> + replorigin_session_setup(originid);
> + replorigin_session_origin = originid;
> + elog(LOG, "!!>> LogicalRepSyncTableStart: 2
> replorigin_session_get_progress \"%s\".", originname);
> + *origin_startpos = replorigin_session_get_progress(false);
> + }
> ..
> ..
> }
>
> I am not sure if this code is correct because, for the very first time
> when the copy is done, we don't expect replication origin to exist
> whereas this code will silently use already existing replication
> origin which can lead to a wrong start position for the slot. In such
> a case it should error out. I guess we should create the replication
> origin before making the state as copydone. I feel we should even have
> a test case for this as it is not difficult to have a pre-existing
> replication origin.
>

Fixed as suggested in latest patch [v11]

> 5. Is it possible to write a testcase where we fail (say due to pk
> violation or some other error) after the initial copy is done, then
> remove the conflicting row and allow a copy to be completed? If we
> already have any such test then it is fine.
>

Causing a PK violation during the initial copy is not a problem to
test, but doing it after the initial copy is difficult. I have done
exactly this test scenario before but I thought it cannot be
automated. E.g. To cause an PK violation error somewhere between
COPYDONE and SYNDONE means that the offending insert (the one which
tablesync will fail to replicate) has to be sent while the tablesync
is in CATCHUP mode. But AFAIK that can only be achieved using the
debugger to get the timing right.

> 6.
> +/*
> + * Drop the replication slot at the publisher node
> + * using the replication connection.
> + */
>
> This comment looks a bit odd. The first line appears to be too short.
> We have limit of 80 chars but this is much lesser than that.
>

Fixed in latest patch [v11]

> 7.
> @@ -905,7 +905,7 @@ replorigin_advance(RepOriginId node,
>   LWLockAcquire(&replication_state->lock, LW_EXCLUSIVE);
>
>   /* Make sure it's not used by somebody else */
> - if (replication_state->acquired_by != 0)
> + if (replication_state->acquired_by != 0 &&
> replication_state->acquired_by != MyProcPid)
>   {
>

TODO

> I think you won't need this change if you do replorigin_advance before
> replorigin_session_setup in your patch.
>
> 8.
> - * that ensures we won't loose knowledge about that after a crash if the
> + * that ensures we won't lose knowledge about that after a crash if the
>
> It is better to submit this as a separate patch.
>

Done. Please see CF entry. https://commitfest.postgresql.org/32/2926/

----
[v11] = https://www.postgresql.org/message-id/CAHut%2BPu0A6TUPgYC-L3BKYQfa_ScL31kOV_3RsB3ActdkL1iBQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Tue, Jan 5, 2021 at 3:32 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Jan 4, 2021 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Few more comments on v9:
> > ======================
> > 1.
> > + /* Drop the tablesync slot. */
> > + {
> > + char *syncslotname = ReplicationSlotNameForTablesync(subid, relid);
> > +
> > + /*
> > + * If the subscription slotname is NONE/NULL and the connection to publisher is
> > + * broken, but the DropSubscription should still be allowed to complete.
> > + * But without a connection it is not possible to drop any tablesync slots.
> > + */
> > + if (!wrconn)
> > + {
> > + /* FIXME - OK to just log a warning? */
> > + elog(WARNING, "!!>> DropSubscription: no connection. Cannot drop
> > tablesync slot \"%s\".",
> > +   syncslotname);
> > + }
> >
> > Why is this not an ERROR? We don't want to keep the table slots
> > lingering after DropSubscription. If there is any tablesync slot that
> > needs to be dropped and the publisher is not available then we should
> > raise an error.
>
> Previously there was only the subscription slot. If the connection was
> broken and caused an error then it was still possible for the user to
> disassociate the subscription from the slot using ALTER SUBSCRIPTION
> ... SET (slot_name = NONE).  And then (when the slotname is NULL)  the
> DropSubscription could complete OK. I expect in that case the Admin
> still had some slot clean-up they would need to do on the Publisher
> machine.
>

I think such an option could probably be used for user-created slots
but it would be difficult for even Admin to know about these
internally created slots associated with the particular subscription.
I would say it is better to ERROR out.

>
> > 2.
> > + /*
> > + * Tablesync resource cleanup (slots and origins).
> > + *
> > + * Any READY-state relations would already have dealt with clean-ups.
> > + */
> > + {
> >
> > There is no need to start a separate block '{' here.
>
> Written this way so I can declare variables only at the scope they are
> needed. I didn’t see anything in the PG code conventions discouraging
> doing this practice: https://www.postgresql.org/docs/devel/source.html
>

But, do we encourage such a coding convention to declare variables. I
find it difficult to read such a code. I guess as a one-off we can do
this but I don't see a compelling need here.

> > 3.
> > +#define SUBREL_STATE_COPYDONE 'C' /* tablesync copy phase is completed */
> >
> > You can mention in the comments that sublsn will be NULL for this
> > state as it is mentioned for other similar states. Can we think of
> > using any letter in lower case for this as all other states are in
> > lower-case except for this which makes it a look bit odd? We can use
> > 'f' or 'e' and describe it as 'copy finished' or 'copy end'. I am fine
> > if you have any better ideas.
> >
>
> Fixed in latest patch [v11]
>

It is still not reflected in the docs. See below:
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7651,6 +7651,7 @@ SCRAM-SHA-256$<replaceable><iteration
count></replaceable>:<replaceable>&l
        State code:
        <literal>i</literal> = initialize,
        <literal>d</literal> = data is being copied,
+       <literal>C</literal> = table data has been copied,
        <literal>s</literal> = synchronized,

--
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Tue, Jan 5, 2021 at 10:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > 1.
> > > + /* Drop the tablesync slot. */
> > > + {
> > > + char *syncslotname = ReplicationSlotNameForTablesync(subid, relid);
> > > +
> > > + /*
> > > + * If the subscription slotname is NONE/NULL and the connection to publisher is
> > > + * broken, but the DropSubscription should still be allowed to complete.
> > > + * But without a connection it is not possible to drop any tablesync slots.
> > > + */
> > > + if (!wrconn)
> > > + {
> > > + /* FIXME - OK to just log a warning? */
> > > + elog(WARNING, "!!>> DropSubscription: no connection. Cannot drop
> > > tablesync slot \"%s\".",
> > > +   syncslotname);
> > > + }
> > >
> > > Why is this not an ERROR? We don't want to keep the table slots
> > > lingering after DropSubscription. If there is any tablesync slot that
> > > needs to be dropped and the publisher is not available then we should
> > > raise an error.
> >
> > Previously there was only the subscription slot. If the connection was
> > broken and caused an error then it was still possible for the user to
> > disassociate the subscription from the slot using ALTER SUBSCRIPTION
> > ... SET (slot_name = NONE).  And then (when the slotname is NULL)  the
> > DropSubscription could complete OK. I expect in that case the Admin
> > still had some slot clean-up they would need to do on the Publisher
> > machine.
> >
>
> I think such an option could probably be used for user-created slots
> but it would be difficult for even Admin to know about these
> internally created slots associated with the particular subscription.
> I would say it is better to ERROR out.

I am having doubts that ERROR is the best choice here. There is a long
note in https://www.postgresql.org/docs/devel/sql-dropsubscription.html
which describes this problem for the subscription slot and how to
disassociate the name to give a workaround “To proceed in this
situation”.

OTOH if we make the tablesync slot unconditionally ERROR for a broken
connection then there is no way to proceed, and the whole (slot_name =
NONE) workaround becomes ineffectual. Note - the current patch code is
only logging when the user has already disassociated the slot name; of
course normally (when the slot name was not disassociated) table slots
will give ERROR for broken connections.

IMO, if the user has disassociated the slot name then they have
already made their decision that they REALLY DO want to “proceed in
this situation”. So I thought we should let them proceed.

What do you think?

----
Kind Regards,
Peter Smith.
Fujitsu Australia.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Wed, Jan 6, 2021 at 4:32 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Tue, Jan 5, 2021 at 10:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > 1.
> > > > + /* Drop the tablesync slot. */
> > > > + {
> > > > + char *syncslotname = ReplicationSlotNameForTablesync(subid, relid);
> > > > +
> > > > + /*
> > > > + * If the subscription slotname is NONE/NULL and the connection to publisher is
> > > > + * broken, but the DropSubscription should still be allowed to complete.
> > > > + * But without a connection it is not possible to drop any tablesync slots.
> > > > + */
> > > > + if (!wrconn)
> > > > + {
> > > > + /* FIXME - OK to just log a warning? */
> > > > + elog(WARNING, "!!>> DropSubscription: no connection. Cannot drop
> > > > tablesync slot \"%s\".",
> > > > +   syncslotname);
> > > > + }
> > > >
> > > > Why is this not an ERROR? We don't want to keep the table slots
> > > > lingering after DropSubscription. If there is any tablesync slot that
> > > > needs to be dropped and the publisher is not available then we should
> > > > raise an error.
> > >
> > > Previously there was only the subscription slot. If the connection was
> > > broken and caused an error then it was still possible for the user to
> > > disassociate the subscription from the slot using ALTER SUBSCRIPTION
> > > ... SET (slot_name = NONE).  And then (when the slotname is NULL)  the
> > > DropSubscription could complete OK. I expect in that case the Admin
> > > still had some slot clean-up they would need to do on the Publisher
> > > machine.
> > >
> >
> > I think such an option could probably be used for user-created slots
> > but it would be difficult for even Admin to know about these
> > internally created slots associated with the particular subscription.
> > I would say it is better to ERROR out.
>
> I am having doubts that ERROR is the best choice here. There is a long
> note in https://www.postgresql.org/docs/devel/sql-dropsubscription.html
> which describes this problem for the subscription slot and how to
> disassociate the name to give a workaround “To proceed in this
> situation”.
>
> OTOH if we make the tablesync slot unconditionally ERROR for a broken
> connection then there is no way to proceed, and the whole (slot_name =
> NONE) workaround becomes ineffectual. Note - the current patch code is
> only logging when the user has already disassociated the slot name; of
> course normally (when the slot name was not disassociated) table slots
> will give ERROR for broken connections.
>
> IMO, if the user has disassociated the slot name then they have
> already made their decision that they REALLY DO want to “proceed in
> this situation”. So I thought we should let them proceed.
>

Okay, if we want to go that way then we should add some documentation
about it. Currently, the slot name used by apply worker is known to
the user because either it is specified by the user or the default is
subscription name, so the user can manually remove that slot later but
that is not true for tablesync slots. I think we need to update both
the Drop Subscription page [1] and logical-replication-subscription
page [2] where we have mentioned temporary slots and in the end "Here
are some scenarios: .." to mention about these slots and probably how
their names are generated so that in such special situations users can
drop them manually.

[1] - https://www.postgresql.org/docs/devel/sql-dropsubscription.html
[2] - https://www.postgresql.org/docs/devel/logical-replication-subscription.html

--
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Tue, Jan 5, 2021 at 3:32 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Jan 4, 2021 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > 5. Is it possible to write a testcase where we fail (say due to pk
> > violation or some other error) after the initial copy is done, then
> > remove the conflicting row and allow a copy to be completed? If we
> > already have any such test then it is fine.
> >
>
> Causing a PK violation during the initial copy is not a problem to
> test, but doing it after the initial copy is difficult. I have done
> exactly this test scenario before but I thought it cannot be
> automated. E.g. To cause an PK violation error somewhere between
> COPYDONE and SYNDONE means that the offending insert (the one which
> tablesync will fail to replicate) has to be sent while the tablesync
> is in CATCHUP mode. But AFAIK that can only be achieved using the
> debugger to get the timing right.
>

Yeah, I am also not able to think of any way to automate such a test.
I was thinking about what could go wrong if we error out in that
stage. The only thing that could be problematic is if we somehow make
the slot and replication origin used during copy dangling. I think if
tablesync is restarted after error then we will clean up those which
will be normally the case but what if the tablesync worker is not
started again? I think the only possibility of tablesync worker not
started again is if during Alter Subscription ... Refresh Publication,
we remove the corresponding subscription rel (see
AlterSubscription_refresh, I guess it could happen if one has dropped
the relation from publication). I haven't tested this with your patch
but if such a possibility exists then we need to think of cleaning up
slot and origin when we remove subscription rel. What do you think?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Wed, Jan 6, 2021 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 5, 2021 at 3:32 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Mon, Jan 4, 2021 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > 5. Is it possible to write a testcase where we fail (say due to pk
> > > violation or some other error) after the initial copy is done, then
> > > remove the conflicting row and allow a copy to be completed? If we
> > > already have any such test then it is fine.
> > >
> >
> > Causing a PK violation during the initial copy is not a problem to
> > test, but doing it after the initial copy is difficult. I have done
> > exactly this test scenario before but I thought it cannot be
> > automated. E.g. To cause an PK violation error somewhere between
> > COPYDONE and SYNDONE means that the offending insert (the one which
> > tablesync will fail to replicate) has to be sent while the tablesync
> > is in CATCHUP mode. But AFAIK that can only be achieved using the
> > debugger to get the timing right.
> >
>
> Yeah, I am also not able to think of any way to automate such a test.
> I was thinking about what could go wrong if we error out in that
> stage. The only thing that could be problematic is if we somehow make
> the slot and replication origin used during copy dangling. I think if
> tablesync is restarted after error then we will clean up those which
> will be normally the case but what if the tablesync worker is not
> started again? I think the only possibility of tablesync worker not
> started again is if during Alter Subscription ... Refresh Publication,
> we remove the corresponding subscription rel (see
> AlterSubscription_refresh, I guess it could happen if one has dropped
> the relation from publication). I haven't tested this with your patch
> but if such a possibility exists then we need to think of cleaning up
> slot and origin when we remove subscription rel. What do you think?
>

I think it makes sense. If there can be a race between the tablesync
re-launching (after error), and the AlterSubscription_refresh removing
some table’s relid from the subscription then there could be lurking
slot/origin tablesync resources (of the removed table) which a
subsequent DROP SUBSCRIPTION cannot discover. I will think more about
how/if it is possible to make this happen. Anyway, I suppose I ought
to refactor/isolate some of the tablesync cleanup code in case it
needs to be commonly called from DropSubscription and/or from
AlterSubscription_refresh.

----
Kind Regards,
Peter Smith.
Fujitsu Australia.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Wed, Jan 6, 2021 at 2:13 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> I think it makes sense. If there can be a race between the tablesync
> re-launching (after error), and the AlterSubscription_refresh removing
> some table’s relid from the subscription then there could be lurking
> slot/origin tablesync resources (of the removed table) which a
> subsequent DROP SUBSCRIPTION cannot discover. I will think more about
> how/if it is possible to make this happen. Anyway, I suppose I ought
> to refactor/isolate some of the tablesync cleanup code in case it
> needs to be commonly called from DropSubscription and/or from
> AlterSubscription_refresh.
>

Fair enough. BTW, I have analyzed whether we need any modifications to
pg_dump/restore for this patch as this changes the state of one of the
fields in the system table and concluded that we don't need any
change. For subscriptions, we don't dump any of the information from
pg_subscription_rel, rather we just dump subscriptions with the
connect option as false which means users need to enable the
subscription and refresh publication after restore. I have checked
this in the code and tested it as well. The related information is
present in pg_dump doc page [1], see from "When dumping logical
replication subscriptions ....".

[1] - https://www.postgresql.org/docs/devel/app-pgdump.html

--
With Regards,
Amit Kapila.



RE: Single transaction in the tablesync worker?

From
"Hou, Zhijie"
Date:
> PSA the v11 patch for the Tablesync Solution1.
> 
> Difference from v10:
> - Addresses several recent review comments.
> - pg_indent has been run
> 
Hi

I took a look into the patch and have some comments.

1.
  *      So the state progression is always: INIT -> DATASYNC -> SYNCWAIT ->
- *      CATCHUP -> SYNCDONE -> READY.
+ *      CATCHUP -> (sync worker TCOPYDONE) -> SYNCDONE -> READY.

I noticed the new state TCOPYDONE is commented between CATCHUP and SYNCDONE,
But It seems the SUBREL_STATE_TCOPYDONE is actually set before SUBREL_STATE_SYNCWAIT[1].
Did i miss something here ?

[1]-----------------
+    UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
+                               MyLogicalRepWorker->relid,
+                               SUBREL_STATE_TCOPYDONE,
+                               MyLogicalRepWorker->relstate_lsn);
...
    /*
     * We are done with the initial data synchronization, update the state.
     */
    SpinLockAcquire(&MyLogicalRepWorker->relmutex);
    MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCWAIT;
------------------


2.
        <literal>i</literal> = initialize,
        <literal>d</literal> = data is being copied,
+       <literal>C</literal> = table data has been copied,
        <literal>s</literal> = synchronized,
        <literal>r</literal> = ready (normal replication)

+#define SUBREL_STATE_TCOPYDONE    't' /* tablesync copy phase is completed
+                                     * (sublsn NULL) */
The character representing 'data has been copied' in the catalog seems different from the macro define.


Best regards,
houzj



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Thankyou for the feedback.

On Thu, Jan 7, 2021 at 12:45 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
>
> > PSA the v11 patch for the Tablesync Solution1.
> >
> > Difference from v10:
> > - Addresses several recent review comments.
> > - pg_indent has been run
> >
> Hi
>
> I took a look into the patch and have some comments.
>
> 1.
>   *       So the state progression is always: INIT -> DATASYNC -> SYNCWAIT ->
> - *       CATCHUP -> SYNCDONE -> READY.
> + *       CATCHUP -> (sync worker TCOPYDONE) -> SYNCDONE -> READY.
>
> I noticed the new state TCOPYDONE is commented between CATCHUP and SYNCDONE,
> But It seems the SUBREL_STATE_TCOPYDONE is actually set before SUBREL_STATE_SYNCWAIT[1].
> Did i miss something here ?
>
> [1]-----------------
> +       UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
> +                                                          MyLogicalRepWorker->relid,
> +                                                          SUBREL_STATE_TCOPYDONE,
> +                                                          MyLogicalRepWorker->relstate_lsn);
> ...
>         /*
>          * We are done with the initial data synchronization, update the state.
>          */
>         SpinLockAcquire(&MyLogicalRepWorker->relmutex);
>         MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCWAIT;
> ------------------
>

Thanks for reporting this mistake. I will correct the comment for the
next patch (v12)

>
> 2.
>         <literal>i</literal> = initialize,
>         <literal>d</literal> = data is being copied,
> +       <literal>C</literal> = table data has been copied,
>         <literal>s</literal> = synchronized,
>         <literal>r</literal> = ready (normal replication)
>
> +#define SUBREL_STATE_TCOPYDONE 't' /* tablesync copy phase is completed
> +                                                                        * (sublsn NULL) */
> The character representing 'data has been copied' in the catalog seems different from the macro define.
>

Yes, same was already previously reported [1]
[1] https://www.postgresql.org/message-id/CAA4eK1Kyi037XZzyrLE71MS2KoMmNSqa6RrQLdSCeeL27gnL%2BA%40mail.gmail.com
It will be fixed in the next patch (v12)

----
Kind Regards,
Peter Smith.
Fujitsu Australia.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Wed, Jan 6, 2021 at 3:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 6, 2021 at 2:13 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > I think it makes sense. If there can be a race between the tablesync
> > re-launching (after error), and the AlterSubscription_refresh removing
> > some table’s relid from the subscription then there could be lurking
> > slot/origin tablesync resources (of the removed table) which a
> > subsequent DROP SUBSCRIPTION cannot discover. I will think more about
> > how/if it is possible to make this happen. Anyway, I suppose I ought
> > to refactor/isolate some of the tablesync cleanup code in case it
> > needs to be commonly called from DropSubscription and/or from
> > AlterSubscription_refresh.
> >
>
> Fair enough.
>

I think before implementing, we should once try to reproduce this
case. I understand this is a timing issue and can be reproduced only
with the help of debugger but we should do that.

> BTW, I have analyzed whether we need any modifications to
> pg_dump/restore for this patch as this changes the state of one of the
> fields in the system table and concluded that we don't need any
> change. For subscriptions, we don't dump any of the information from
> pg_subscription_rel, rather we just dump subscriptions with the
> connect option as false which means users need to enable the
> subscription and refresh publication after restore. I have checked
> this in the code and tested it as well. The related information is
> present in pg_dump doc page [1], see from "When dumping logical
> replication subscriptions ....".
>

I have further analyzed that we don't need to do anything w.r.t
pg_upgrade as well because it uses pg_dump/pg_dumpall to dump the
schema info of the old cluster and then restore it to the new cluster.
And, we know that pg_dump ignores the info in pg_subscription_rel, so
we don't need to change anything as our changes are specific to the
state of one of the columns in pg_subscription_rel. I have not tested
this but we should test it by having some relations in not_ready state
and then allow the old cluster (<=PG13) to be upgraded to new (pg14)
both with and without this patch and see if there is any change in
behavior.

--
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA the v12 patch for the Tablesync Solution1.

Differences from v11:
  + Added PG docs to mention the tablesync slot
  + Refactored tablesync slot drop (done by
DropSubscription/process_syncing_tables_for_sync)
  + Fixed PG docs mentioning wrong state code
  + Fixed wrong code comment describing TCOPYDONE state

====

Features:

* The tablesync slot is now permanent instead of temporary. The
tablesync slot name is no longer tied to the Subscription slot na

* The tablesync slot cleanup (drop) code is added for DropSubscription
and for process_syncing_tables_for_sync functions

* The tablesync worker is now allowing multiple tx instead of single tx

* A new state (SUBREL_STATE_TCOPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* If a re-launched tablesync finds state SUBREL_STATE_TCOPYDONE then
it will bypass the initial copy_table phase.

* Now tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* The tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply.

* The DropSubscription cleanup code was enhanced (v7+) to take care of
any crashed tablesync workers.

* Updates to PG docs

TODO / Known Issues:

* Address review comments

* Patch applies with whitespace warning

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Jan 4, 2021 at 8:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Dec 30, 2020 at 11:51 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Wed, Dec 23, 2020 at 8:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > 1.
> > > + * Rarely, the DropSubscription may be issued when a tablesync still
> > > + * is in SYNCDONE but not yet in READY state. If this happens then
> > > + * the drop slot could fail because it is already dropped.
> > > + * In this case suppress and drop slot error.
> > > + *
> > > + * FIXME - Is there a better way than this?
> > > + */
> > > + if (rstate->state != SUBREL_STATE_SYNCDONE)
> > > + PG_RE_THROW();
> > >
> > > So, does this situation happens when we try to drop subscription after
> > > the state is changed to syncdone but not syncready. If so, then can't
> > > we write a function GetSubscriptionNotDoneRelations similar to
> > > GetSubscriptionNotReadyRelations where we get a list of relations that
> > > are not in done stage. I think this should be safe because once we are
> > > here we shouldn't be allowed to start a new worker and old workers are
> > > already stopped by this function.
> >
> > Yes, but I don't see how adding such a function is an improvement over
> > the existing code:
> >
>
> The advantage is that we don't need to use try..catch to deal with
> such conditions which I don't think is a good way to deal with such
> cases. Also, even after using try...catch, still, we can leak the
> slots because the patch drops the slot after changing the state to
> syncdone and if there is any error while dropping the slot, it simply
> skips it. So, it is possible that the rel state is syncdone but the
> slot still exists and we get an error due to some different reason,
> and then we will silently skip it again and allow the subscription to
> be dropped.
>
> I think instead what we should do is to drop the slot before we change
> the rel state to syncdone. Also, if the apply workers fail to drop the
> slot, it should try to again drop it after restart. In
> DropSubscription, we can then check if the rel state is not SYNC or
> READY, we can drop the corresponding slots.
>

Fixed as suggested in latest patch [v12]

----

[v12] = https://www.postgresql.org/message-id/CAHut%2BPsonJzarxSBWkOM%3DMjoEpaq53ShBJoTT9LHJskwP3OvZA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Tue, Jan 5, 2021 at 10:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > > 3.
> > > +#define SUBREL_STATE_COPYDONE 'C' /* tablesync copy phase is completed */
> > >
> > > You can mention in the comments that sublsn will be NULL for this
> > > state as it is mentioned for other similar states. Can we think of
> > > using any letter in lower case for this as all other states are in
> > > lower-case except for this which makes it a look bit odd? We can use
> > > 'f' or 'e' and describe it as 'copy finished' or 'copy end'. I am fine
> > > if you have any better ideas.
> > >
> >
> > Fixed in latest patch [v11]
> >
>
> It is still not reflected in the docs. See below:
> --- a/doc/src/sgml/catalogs.sgml
> +++ b/doc/src/sgml/catalogs.sgml
> @@ -7651,6 +7651,7 @@ SCRAM-SHA-256$<replaceable><iteration
> count></replaceable>:<replaceable>&l
>         State code:
>         <literal>i</literal> = initialize,
>         <literal>d</literal> = data is being copied,
> +       <literal>C</literal> = table data has been copied,
>         <literal>s</literal> = synchronized,
>

Fixed in latest patch [v12]

----
[v12] = https://www.postgresql.org/message-id/CAHut%2BPsonJzarxSBWkOM%3DMjoEpaq53ShBJoTT9LHJskwP3OvZA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Wed, Jan 6, 2021 at 2:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>

> Okay, if we want to go that way then we should add some documentation
> about it. Currently, the slot name used by apply worker is known to
> the user because either it is specified by the user or the default is
> subscription name, so the user can manually remove that slot later but
> that is not true for tablesync slots. I think we need to update both
> the Drop Subscription page [1] and logical-replication-subscription
> page [2] where we have mentioned temporary slots and in the end "Here
> are some scenarios: .." to mention about these slots and probably how
> their names are generated so that in such special situations users can
> drop them manually.
>
> [1] - https://www.postgresql.org/docs/devel/sql-dropsubscription.html
> [2] - https://www.postgresql.org/docs/devel/logical-replication-subscription.html
>

PG docs updated in latest patch [v12]

----
[v12] = https://www.postgresql.org/message-id/CAHut%2BPsonJzarxSBWkOM%3DMjoEpaq53ShBJoTT9LHJskwP3OvZA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Thu, Jan 7, 2021 at 3:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 6, 2021 at 3:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jan 6, 2021 at 2:13 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > I think it makes sense. If there can be a race between the tablesync
> > > re-launching (after error), and the AlterSubscription_refresh removing
> > > some table’s relid from the subscription then there could be lurking
> > > slot/origin tablesync resources (of the removed table) which a
> > > subsequent DROP SUBSCRIPTION cannot discover. I will think more about
> > > how/if it is possible to make this happen. Anyway, I suppose I ought
> > > to refactor/isolate some of the tablesync cleanup code in case it
> > > needs to be commonly called from DropSubscription and/or from
> > > AlterSubscription_refresh.
> > >
> >
> > Fair enough.
> >
>
> I think before implementing, we should once try to reproduce this
> case. I understand this is a timing issue and can be reproduced only
> with the help of debugger but we should do that.

FYI, I was able to reproduce this case in debugger. PSA logs showing details.

----
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

RE: Single transaction in the tablesync worker?

From
"Hou, Zhijie"
Date:
> PSA the v12 patch for the Tablesync Solution1.
> 
> Differences from v11:
>   + Added PG docs to mention the tablesync slot
>   + Refactored tablesync slot drop (done by
> DropSubscription/process_syncing_tables_for_sync)
>   + Fixed PG docs mentioning wrong state code
>   + Fixed wrong code comment describing TCOPYDONE state
> 
Hi

I look into the new patch and have some comments.

1.
+    /* Setup replication origin tracking. */
①+    originid = replorigin_by_name(originname, true);
+    if (!OidIsValid(originid))
+    {

②+            originid = replorigin_by_name(originname, true);
+            if (originid != InvalidRepOriginId)
+            {

There are two different style code which check whether originid is valid.
Both are fine, but do you think it’s better to have a same style here?


2.
  *         state to SYNCDONE.  There might be zero changes applied between
  *         CATCHUP and SYNCDONE, because the sync worker might be ahead of the
  *         apply worker.
+ *       - The sync worker has a intermediary state TCOPYDONE which comes after
+ *        DATASYNC and before SYNCWAIT. This state indicates that the initial

This comment about TCOPYDONE is better to be placed at [1]*, where is between DATASYNC and SYNCWAIT.

 *       - Tablesync worker starts; changes table state from INIT to DATASYNC while
 *         copying.
 [1]*
 *       - Tablesync worker finishes the copy and sets table state to SYNCWAIT;
 *         waits for state change.

3.
+    /*
+     * To build a slot name for the sync work, we are limited to NAMEDATALEN -
+     * 1 characters.
+     *
+     * The name is calculated as pg_%u_sync_%u (3 + 10 + 6 + 10 + '\0'). (It's
+     * actually the NAMEDATALEN on the remote that matters, but this scheme
+     * will also work reasonably if that is different.)
+     */
+    StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small");    /* for sanity */
+
+    syncslotname = psprintf("pg_%u_sync_%u", suboid, relid);

The comments says syncslotname is limit to NAMEDATALEN - 1 characters.
But the actual size of it is (3 + 10 + 6 + 10 + '\0') = 30,which seems not NAMEDATALEN - 1.
Should we change the comment here?

Best regards,
houzj





Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Jan 8, 2021 at 7:14 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Thu, Jan 7, 2021 at 3:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jan 6, 2021 at 3:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Jan 6, 2021 at 2:13 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > > >
> > > > I think it makes sense. If there can be a race between the tablesync
> > > > re-launching (after error), and the AlterSubscription_refresh removing
> > > > some table’s relid from the subscription then there could be lurking
> > > > slot/origin tablesync resources (of the removed table) which a
> > > > subsequent DROP SUBSCRIPTION cannot discover. I will think more about
> > > > how/if it is possible to make this happen. Anyway, I suppose I ought
> > > > to refactor/isolate some of the tablesync cleanup code in case it
> > > > needs to be commonly called from DropSubscription and/or from
> > > > AlterSubscription_refresh.
> > > >
> > >
> > > Fair enough.
> > >
> >
> > I think before implementing, we should once try to reproduce this
> > case. I understand this is a timing issue and can be reproduced only
> > with the help of debugger but we should do that.
>
> FYI, I was able to reproduce this case in debugger. PSA logs showing details.
>

Thanks for reproducing as I was worried about exactly this case. I
have one question related to logs:

##
## ALTER SUBSCRIPTION to REFRESH the publication

## This blocks on some latch until the tablesync worker dies, then it continues
##

Did you check which exact latch or lock blocks this? It is important
to retain this interlock as otherwise even if decide to drop slot (and
or origin) the tablesync worker might continue.

--
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA the v13 patch for the Tablesync Solution1.

Differences from v12:
+ Fixed whitespace errors of v12-0001
+ Modify TCOPYDONE state comment (houzj feedback)
+ WIP fix for AlterSubscripion_refresh (Amit feedback)

====

Features:

* The tablesync slot is now permanent instead of temporary. The
tablesync slot name is no longer tied to the Subscription slot na

* The tablesync slot cleanup (drop) code is added for DropSubscription
and for process_syncing_tables_for_sync functions

* The tablesync worker is now allowing multiple tx instead of single tx

* A new state (SUBREL_STATE_TCOPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* If a re-launched tablesync finds state SUBREL_STATE_TCOPYDONE then
it will bypass the initial copy_table phase.

* Now tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* The tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply.

* The DropSubscription cleanup code was enhanced (v7+) to take care of
any crashed tablesync workers.

* Updates to PG docs

TODO / Known Issues:

* Address review comments

* ALTER PUBLICATION DROP TABLE can mean knowledge of tablesyncs gets
lost causing resource cleanup to be missed. There is a WIP fix for
this in the AlterSubscription_refresh, however it is not entirely
correct; there are known race conditions. See FIXME comments.

---

Kind Regards,
Peter Smith.
Fujitsu Australia

On Thu, Jan 7, 2021 at 6:52 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Hi Amit.
>
> PSA the v12 patch for the Tablesync Solution1.
>
> Differences from v11:
>   + Added PG docs to mention the tablesync slot
>   + Refactored tablesync slot drop (done by
> DropSubscription/process_syncing_tables_for_sync)
>   + Fixed PG docs mentioning wrong state code
>   + Fixed wrong code comment describing TCOPYDONE state
>
> ====
>
> Features:
>
> * The tablesync slot is now permanent instead of temporary. The
> tablesync slot name is no longer tied to the Subscription slot na
>
> * The tablesync slot cleanup (drop) code is added for DropSubscription
> and for process_syncing_tables_for_sync functions
>
> * The tablesync worker is now allowing multiple tx instead of single tx
>
> * A new state (SUBREL_STATE_TCOPYDONE) is persisted after a successful
> copy_table in LogicalRepSyncTableStart.
>
> * If a re-launched tablesync finds state SUBREL_STATE_TCOPYDONE then
> it will bypass the initial copy_table phase.
>
> * Now tablesync sets up replication origin tracking in
> LogicalRepSyncTableStart (similar as done for the apply worker). The
> origin is advanced when first created.
>
> * The tablesync replication origin tracking is cleaned up during
> DropSubscription and/or process_syncing_tables_for_apply.
>
> * The DropSubscription cleanup code was enhanced (v7+) to take care of
> any crashed tablesync workers.
>
> * Updates to PG docs
>
> TODO / Known Issues:
>
> * Address review comments
>
> * Patch applies with whitespace warning
>
> ---
>
> Kind Regards,
> Peter Smith.
> Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Fri, Jan 8, 2021 at 1:02 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
>
> > PSA the v12 patch for the Tablesync Solution1.
> >
> > Differences from v11:
> >   + Added PG docs to mention the tablesync slot
> >   + Refactored tablesync slot drop (done by
> > DropSubscription/process_syncing_tables_for_sync)
> >   + Fixed PG docs mentioning wrong state code
> >   + Fixed wrong code comment describing TCOPYDONE state
> >
> Hi
>
> I look into the new patch and have some comments.
>
> 1.
> +       /* Setup replication origin tracking. */
> ①+      originid = replorigin_by_name(originname, true);
> +       if (!OidIsValid(originid))
> +       {
>
> ②+                      originid = replorigin_by_name(originname, true);
> +                       if (originid != InvalidRepOriginId)
> +                       {
>
> There are two different style code which check whether originid is valid.
> Both are fine, but do you think it’s better to have a same style here?

Yes. I think the 1st style is better, so I used the OidIsValid for all
the new code of the patch.
But the check in DropSubscription is an exception; there I used 2nd
style but ONLY to be consistent with another originid check which
already existed in that same function.

>
>
> 2.
>   *              state to SYNCDONE.  There might be zero changes applied between
>   *              CATCHUP and SYNCDONE, because the sync worker might be ahead of the
>   *              apply worker.
> + *        - The sync worker has a intermediary state TCOPYDONE which comes after
> + *             DATASYNC and before SYNCWAIT. This state indicates that the initial
>
> This comment about TCOPYDONE is better to be placed at [1]*, where is between DATASYNC and SYNCWAIT.
>
>  *         - Tablesync worker starts; changes table state from INIT to DATASYNC while
>  *               copying.
>  [1]*
>  *         - Tablesync worker finishes the copy and sets table state to SYNCWAIT;
>  *               waits for state change.
>

Agreed. I have moved the comment per your suggestion (and I also
re-worded it again).
Fixed in latest patch [v13]

> 3.
> +       /*
> +        * To build a slot name for the sync work, we are limited to NAMEDATALEN -
> +        * 1 characters.
> +        *
> +        * The name is calculated as pg_%u_sync_%u (3 + 10 + 6 + 10 + '\0'). (It's
> +        * actually the NAMEDATALEN on the remote that matters, but this scheme
> +        * will also work reasonably if that is different.)
> +        */
> +       StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small");   /* for sanity */
> +
> +       syncslotname = psprintf("pg_%u_sync_%u", suboid, relid);
>
> The comments says syncslotname is limit to NAMEDATALEN - 1 characters.
> But the actual size of it is (3 + 10 + 6 + 10 + '\0') = 30,which seems not NAMEDATALEN - 1.
> Should we change the comment here?
>

The comment wording is a remnant from older code which had a
differently format slot name.
I think the comment is still valid, albeit maybe unnecessary since in
the current code the tablesync slot
name length is fixed. But I left the older comment here as a safety reminder
in case some future change would want to modify the slot name. What do
you think?

----
[v13] = https://www.postgresql.org/message-id/CAHut%2BPvby4zg6kM1RoGd_j-xs9OtPqZPPVhbiC53gCCRWdNSrw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Jan 8, 2021 at 2:55 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Fri, Jan 8, 2021 at 1:02 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
> >
>
> > 3.
> > +       /*
> > +        * To build a slot name for the sync work, we are limited to NAMEDATALEN -
> > +        * 1 characters.
> > +        *
> > +        * The name is calculated as pg_%u_sync_%u (3 + 10 + 6 + 10 + '\0'). (It's
> > +        * actually the NAMEDATALEN on the remote that matters, but this scheme
> > +        * will also work reasonably if that is different.)
> > +        */
> > +       StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small");   /* for sanity */
> > +
> > +       syncslotname = psprintf("pg_%u_sync_%u", suboid, relid);
> >
> > The comments says syncslotname is limit to NAMEDATALEN - 1 characters.
> > But the actual size of it is (3 + 10 + 6 + 10 + '\0') = 30,which seems not NAMEDATALEN - 1.
> > Should we change the comment here?
> >
>
> The comment wording is a remnant from older code which had a
> differently format slot name.
> I think the comment is still valid, albeit maybe unnecessary since in
> the current code the tablesync slot
> name length is fixed. But I left the older comment here as a safety reminder
> in case some future change would want to modify the slot name. What do
> you think?
>

I find it quite confusing. The comments should reflect the latest
code. You can probably say in some form that the length of slotname
shouldn't exceed NAMEDATALEN because of remote node constraints on
slot name length. Also, probably the StaticAssert on NAMEDATALEN is
not required.

1.
+   <para>
+    Additional table synchronization slots are normally transient, created
+    internally and dropped automatically when they are no longer needed.
+    These table synchronization slots have generated names:
+   <quote><literal>pg_%u_sync_%u</literal></quote> (parameters:
Subscription <parameter>oid</parameter>, Table
<parameter>relid</parameter>)
+   </para>

The last line seems too long. I think we are not strict for 80 char
limit in docs but it is good to be close to that, however, this
appears quite long.

2.
+ /*
+ * Cleanup any remaining tablesync resources.
+ */
+ {
+ char originname[NAMEDATALEN];
+ RepOriginId originid;
+ char state;
+ XLogRecPtr statelsn;

I have already mentioned previously that let's not use this new style
of code (start using { to localize the scope of variables). I don't
know about others but I find it difficult to read such a code. You
might want to consider moving this whole block to a separate function.

3.
/*
+ * XXX - Should optimize this to avoid multiple
+ * connect/disconnect.
+ */
+ wrconn = walrcv_connect(sub->conninfo, true, sub->name, &err);

I think it is better to avoid multiple connect/disconnect here. In
this same function, we have connected to the publisher, we should be
able to use the same connection.

4.
process_syncing_tables_for_sync()
{
..
+ /*
+ * Cleanup the tablesync slot.
+ */
+ syncslotname = ReplicationSlotNameForTablesync(
+    MySubscription->oid,
+    MyLogicalRepWorker->relid);
+ PG_TRY();
+ {
+ elog(DEBUG1, "process_syncing_tables_for_sync: dropping the
tablesync slot \"%s\".", syncslotname);
+ ReplicationSlotDropAtPubNode(wrconn, syncslotname);
+ }
+ PG_FINALLY();
+ {
+ pfree(syncslotname);
+ }
+ PG_END_TRY();
..
}

Both here and in DropSubscription(), it seems we are using
PG_TRY..PG_FINALLY just to free the memory even though
ReplicationSlotDropAtPubNode already has try..finally. Can we arrange
code to move allocation of syncslotname inside
ReplicationSlotDropAtPubNode to avoid additional try..finaly? BTW, if
the usage of try..finally here is only to free the memory, I am not
sure if it is required because I think we will anyway Reset the memory
context where this memory is allocated as part of error handling.

5.
 #define SUBREL_STATE_DATASYNC 'd' /* data is being synchronized (sublsn
  * NULL) */
+#define SUBREL_STATE_TCOPYDONE 't' /* tablesync copy phase is completed
+ * (sublsn NULL) */
 #define SUBREL_STATE_SYNCDONE 's' /* synchronization finished in front of
  * apply (sublsn set) */

I am not very happy with the new state name SUBREL_STATE_TCOPYDONE as
it is quite different from other adjoining state names and somehow not
going well with the code. How about SUBREL_STATE_ENDCOPY 'e' or
SUBREL_STATE_FINISHEDCOPY 'f'?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Jan 8, 2021 at 8:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 8, 2021 at 7:14 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > FYI, I was able to reproduce this case in debugger. PSA logs showing details.
> >
>
> Thanks for reproducing as I was worried about exactly this case. I
> have one question related to logs:
>
> ##
> ## ALTER SUBSCRIPTION to REFRESH the publication
>
> ## This blocks on some latch until the tablesync worker dies, then it continues
> ##
>
> Did you check which exact latch or lock blocks this?
>

I have checked this myself and the command is waiting on the drop of
origin till the tablesync worker is finished because replorigin_drop()
requires state->acquired_by to be 0 which will only be true once the
tablesync worker exits. I think this is the reason you might have
noticed that the command can't be finished until the tablesync worker
died. So this can't be an interlock between ALTER SUBSCRIPTION ..
REFRESH command and tablesync worker and to that end it seems you have
below Fixme's in the patch:

+ * FIXME - Usually this cleanup would be OK, but will not
+ * always be OK because the logicalrep_worker_stop_at_commit
+ * only "flags" the worker to be stopped in the near future
+ * but meanwhile it may still be running. In this case there
+ * could be a race between the tablesync worker and this code
+ * to see who will succeed with the tablesync drop (and the
+ * loser will ERROR).
+ *
+ * FIXME - Also, checking the state is also not guaranteed
+ * correct because state might be TCOPYDONE when we checked
+ * but has since progressed to SYNDONE
+ */
+
+ if (state == SUBREL_STATE_TCOPYDONE)
+ {

I feel this was okay for an earlier code but now we need to stop the
tablesync workers before trying to drop the slot as we do in
DropSubscription. Now, if we do that then that would fix the race
conditions mentioned in Fixme but still, there are few more things I
am worried about: (a) What if the launcher again starts the tablesync
worker? One idea could be to acquire AccessExclusiveLock on
SubscriptionRelationId as we do in DropSubscription which is not a
very good idea but I can't think of any other good way. (b) the patch
is just checking SUBREL_STATE_TCOPYDONE before dropping the
replication slot but the slot could be created even before that (in
SUBREL_STATE_DATASYNC state). One idea could be we can try to drop the
slot and if we are not able to drop then we can simply continue
assuming it didn't exist.

One minor comment:
1.
+ SpinLockAcquire(&MyLogicalRepWorker->relmutex);
  MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE;
  MyLogicalRepWorker->relstate_lsn = current_lsn;
-

Spurious line removal.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Ajin Cherian
Date:
On Thu, Jan 7, 2021 at 3:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> > BTW, I have analyzed whether we need any modifications to
> > pg_dump/restore for this patch as this changes the state of one of the
> > fields in the system table and concluded that we don't need any
> > change. For subscriptions, we don't dump any of the information from
> > pg_subscription_rel, rather we just dump subscriptions with the
> > connect option as false which means users need to enable the
> > subscription and refresh publication after restore. I have checked
> > this in the code and tested it as well. The related information is
> > present in pg_dump doc page [1], see from "When dumping logical
> > replication subscriptions ....".
> >
>
> I have further analyzed that we don't need to do anything w.r.t
> pg_upgrade as well because it uses pg_dump/pg_dumpall to dump the
> schema info of the old cluster and then restore it to the new cluster.
> And, we know that pg_dump ignores the info in pg_subscription_rel, so
> we don't need to change anything as our changes are specific to the
> state of one of the columns in pg_subscription_rel. I have not tested
> this but we should test it by having some relations in not_ready state
> and then allow the old cluster (<=PG13) to be upgraded to new (pg14)
> both with and without this patch and see if there is any change in
> behavior.

I have tested this scenario, stopped a server running PG_13 when
subscription table sync was in progress.
One of the tables in pg_subscription_rel was still in 'd' state (DATASYNC)

postgres=# select * from pg_subscription_rel;
 srsubid | srrelid | srsubstate |  srsublsn
---------+---------+------------+------------
   16424 |   16384 | d          |
   16424 |   16390 | r          | 0/247A63D8
   16424 |   16395 | r          | 0/247A6410
   16424 |   16387 | r          | 0/247A6448
(4 rows)

then initiated the pg_upgrade to PG_14 with the patch and without the patch:
I see that the subscription exists but is not enabled:

postgres=# select * from pg_subscription;
  oid  | subdbid | subname | subowner | subenabled | subbinary |
substream |               subconninfo                | subslotname |
subsynccommit | subpublications

-------+---------+---------+----------+------------+-----------+-----------+------------------------------------------+-------------+---------------+-----------------
 16407 |   16401 | tap_sub |       10 | f          | f         | f
    | host=localhost port=6972 dbname=postgres | tap_sub     | off
      | {tap_pub}
(1 row)

and looking at the pg_subscription_rel:

postgres=# select * from pg_subscription_rel;
 srsubid | srrelid | srsubstate | srsublsn
---------+---------+------------+----------
(0 rows)

As can be seen, none of the data in the pg_subscription_rel has been
copied over. Same behaviour is seen with the patch and without the
patch.

regards,
Ajin Cherian
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Jan 11, 2021 at 3:53 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Thu, Jan 7, 2021 at 3:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > > BTW, I have analyzed whether we need any modifications to
> > > pg_dump/restore for this patch as this changes the state of one of the
> > > fields in the system table and concluded that we don't need any
> > > change. For subscriptions, we don't dump any of the information from
> > > pg_subscription_rel, rather we just dump subscriptions with the
> > > connect option as false which means users need to enable the
> > > subscription and refresh publication after restore. I have checked
> > > this in the code and tested it as well. The related information is
> > > present in pg_dump doc page [1], see from "When dumping logical
> > > replication subscriptions ....".
> > >
> >
> > I have further analyzed that we don't need to do anything w.r.t
> > pg_upgrade as well because it uses pg_dump/pg_dumpall to dump the
> > schema info of the old cluster and then restore it to the new cluster.
> > And, we know that pg_dump ignores the info in pg_subscription_rel, so
> > we don't need to change anything as our changes are specific to the
> > state of one of the columns in pg_subscription_rel. I have not tested
> > this but we should test it by having some relations in not_ready state
> > and then allow the old cluster (<=PG13) to be upgraded to new (pg14)
> > both with and without this patch and see if there is any change in
> > behavior.
>
> I have tested this scenario, stopped a server running PG_13 when
> subscription table sync was in progress.
>

Thanks for the test. This confirms my analysis and we don't need any
change in pg_dump or pg_upgrade for this patch.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA the v14 patch for the Tablesync Solution1.

Main differences from v13:
+ Addresses all review comments 1-5, posted 9/Jan [ak9]
+ Addresses review comment 1, posted 11/Jan [ak11]
+ Modifications per suggestion [ak11] to handle race scenarios during
Drop/AlterSubscription
+ Changed LOG to WARNING if DropSubscription unable to drop tablesync slot

[ak9] =
https://www.postgresql.org/message-id/CAA4eK1%2BgUBxKcYWg%2BMCC6Qbw-My%2B2wKUct%2BiFtr-_HgundUUBQ%40mail.gmail.com
[ak11] = https://www.postgresql.org/message-id/CAA4eK1KGUt86A7CfuQW6OeDvAhEbVk8VOBJmcoZjrYBn965kOA%40mail.gmail.com

====

Features:

* The tablesync slot is now permanent instead of temporary.

* The tablesync slot name is no longer tied to the Subscription slot name.

* The tablesync slot cleanup (drop) code is added for
DropSubscription, AlterSubscription_refresh and for
process_syncing_tables_for_sync functions. Drop/AlterSubscription will
issue WARNING instead of ERROR in case the slot drop fails.

* The tablesync worker is now allowing multiple tx instead of single tx

* A new state (SUBREL_STATE_FINISHEDCOPY) is persisted after a
successful copy_table in tablesync's LogicalRepSyncTableStart.

* If a re-launched tablesync finds state SUBREL_STATE_FINISHEDCOPY
then it will bypass the initial copy_table phase.

* Now tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* The tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply.

* The DropSubscription cleanup code was enhanced (v7+) to take care of
any crashed tablesync workers.

* The AlterSubscription_refresh (v14+) is now more similar to
DropSubscription w.r.t to stopping workers for any "removed" tables.

* Updates to PG docs.

TODO / Known Issues:

* Minor review comments

===

Also PSA some detailed logging evidence of some test scenarios
involving Drop/AlterSubscription:
+ Test-20210112-AlterSubscriptionRefresh-ok.txt =
AlterSubscription_refresh which successfully drops a tablesync slot
+ Test-20210112-AlterSubscriptionRefresh-warning.txt =
AlterSubscription_refresh gives WARNING that it cannot drop the
tablesync slot (which no longer exists)
+ Test-20210112-DropSubscription-warning.txt = DropSubscription with a
disassociated slot_name gives a WARNING that it cannot drop the
tablesync slot (due to broken connection)

---
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Sat, Jan 9, 2021 at 5:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 8, 2021 at 2:55 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Fri, Jan 8, 2021 at 1:02 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
> > >
> >
> > > 3.
> > > +       /*
> > > +        * To build a slot name for the sync work, we are limited to NAMEDATALEN -
> > > +        * 1 characters.
> > > +        *
> > > +        * The name is calculated as pg_%u_sync_%u (3 + 10 + 6 + 10 + '\0'). (It's
> > > +        * actually the NAMEDATALEN on the remote that matters, but this scheme
> > > +        * will also work reasonably if that is different.)
> > > +        */
> > > +       StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small");   /* for sanity */
> > > +
> > > +       syncslotname = psprintf("pg_%u_sync_%u", suboid, relid);
> > >
> > > The comments says syncslotname is limit to NAMEDATALEN - 1 characters.
> > > But the actual size of it is (3 + 10 + 6 + 10 + '\0') = 30,which seems not NAMEDATALEN - 1.
> > > Should we change the comment here?
> > >
> >
> > The comment wording is a remnant from older code which had a
> > differently format slot name.
> > I think the comment is still valid, albeit maybe unnecessary since in
> > the current code the tablesync slot
> > name length is fixed. But I left the older comment here as a safety reminder
> > in case some future change would want to modify the slot name. What do
> > you think?
> >
>
> I find it quite confusing. The comments should reflect the latest
> code. You can probably say in some form that the length of slotname
> shouldn't exceed NAMEDATALEN because of remote node constraints on
> slot name length. Also, probably the StaticAssert on NAMEDATALEN is
> not required.

Modified comment in latest patch [v14]

>
> 1.
> +   <para>
> +    Additional table synchronization slots are normally transient, created
> +    internally and dropped automatically when they are no longer needed.
> +    These table synchronization slots have generated names:
> +   <quote><literal>pg_%u_sync_%u</literal></quote> (parameters:
> Subscription <parameter>oid</parameter>, Table
> <parameter>relid</parameter>)
> +   </para>
>
> The last line seems too long. I think we are not strict for 80 char
> limit in docs but it is good to be close to that, however, this
> appears quite long.

Fixed in latest patch [v14]

>
> 2.
> + /*
> + * Cleanup any remaining tablesync resources.
> + */
> + {
> + char originname[NAMEDATALEN];
> + RepOriginId originid;
> + char state;
> + XLogRecPtr statelsn;
>
> I have already mentioned previously that let's not use this new style
> of code (start using { to localize the scope of variables). I don't
> know about others but I find it difficult to read such a code. You
> might want to consider moving this whole block to a separate function.
>

Removed extra code block in latest patch [v14]

> 3.
> /*
> + * XXX - Should optimize this to avoid multiple
> + * connect/disconnect.
> + */
> + wrconn = walrcv_connect(sub->conninfo, true, sub->name, &err);
>
> I think it is better to avoid multiple connect/disconnect here. In
> this same function, we have connected to the publisher, we should be
> able to use the same connection.
>

Fixed in latest patch [v14]

> 4.
> process_syncing_tables_for_sync()
> {
> ..
> + /*
> + * Cleanup the tablesync slot.
> + */
> + syncslotname = ReplicationSlotNameForTablesync(
> +    MySubscription->oid,
> +    MyLogicalRepWorker->relid);
> + PG_TRY();
> + {
> + elog(DEBUG1, "process_syncing_tables_for_sync: dropping the
> tablesync slot \"%s\".", syncslotname);
> + ReplicationSlotDropAtPubNode(wrconn, syncslotname);
> + }
> + PG_FINALLY();
> + {
> + pfree(syncslotname);
> + }
> + PG_END_TRY();
> ..
> }
>
> Both here and in DropSubscription(), it seems we are using
> PG_TRY..PG_FINALLY just to free the memory even though
> ReplicationSlotDropAtPubNode already has try..finally. Can we arrange
> code to move allocation of syncslotname inside
> ReplicationSlotDropAtPubNode to avoid additional try..finaly? BTW, if
> the usage of try..finally here is only to free the memory, I am not
> sure if it is required because I think we will anyway Reset the memory
> context where this memory is allocated as part of error handling.
>

Eliminated need for TRY/FINALLY to free syncslotname in latest patch [v14]

> 5.
>  #define SUBREL_STATE_DATASYNC 'd' /* data is being synchronized (sublsn
>   * NULL) */
> +#define SUBREL_STATE_TCOPYDONE 't' /* tablesync copy phase is completed
> + * (sublsn NULL) */
>  #define SUBREL_STATE_SYNCDONE 's' /* synchronization finished in front of
>   * apply (sublsn set) */
>
> I am not very happy with the new state name SUBREL_STATE_TCOPYDONE as
> it is quite different from other adjoining state names and somehow not
> going well with the code. How about SUBREL_STATE_ENDCOPY 'e' or
> SUBREL_STATE_FINISHEDCOPY 'f'?
>

Using SUBREL_STATE_FINISHEDCOPY in latest patch [v14]

---
[v14] = https://www.postgresql.org/message-id/CAHut%2BPsPO2vOp%2BP7U2szROMy15PKKGanhUrCYQ0ffpy9zG1V1A%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Jan 11, 2021 at 3:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 8, 2021 at 8:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jan 8, 2021 at 7:14 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > FYI, I was able to reproduce this case in debugger. PSA logs showing details.
> > >
> >
> > Thanks for reproducing as I was worried about exactly this case. I
> > have one question related to logs:
> >
> > ##
> > ## ALTER SUBSCRIPTION to REFRESH the publication
> >
> > ## This blocks on some latch until the tablesync worker dies, then it continues
> > ##
> >
> > Did you check which exact latch or lock blocks this?
> >
>
> I have checked this myself and the command is waiting on the drop of
> origin till the tablesync worker is finished because replorigin_drop()
> requires state->acquired_by to be 0 which will only be true once the
> tablesync worker exits. I think this is the reason you might have
> noticed that the command can't be finished until the tablesync worker
> died. So this can't be an interlock between ALTER SUBSCRIPTION ..
> REFRESH command and tablesync worker and to that end it seems you have
> below Fixme's in the patch:

I have also seen this same blocking reason before in the replorigin_drop().
However, back when I first tested/reproduced the refresh issue
[test-refresh] that
AlterSubscription_refresh was still *original* unchanged code, so at
that time it did not
have any replorigin_drop() in at all. In any case in the latest code
[v14] the AlterSubscription is
immediately stopping the workers so this question may be moot.

>
> + * FIXME - Usually this cleanup would be OK, but will not
> + * always be OK because the logicalrep_worker_stop_at_commit
> + * only "flags" the worker to be stopped in the near future
> + * but meanwhile it may still be running. In this case there
> + * could be a race between the tablesync worker and this code
> + * to see who will succeed with the tablesync drop (and the
> + * loser will ERROR).
> + *
> + * FIXME - Also, checking the state is also not guaranteed
> + * correct because state might be TCOPYDONE when we checked
> + * but has since progressed to SYNDONE
> + */
> +
> + if (state == SUBREL_STATE_TCOPYDONE)
> + {
>
> I feel this was okay for an earlier code but now we need to stop the
> tablesync workers before trying to drop the slot as we do in
> DropSubscription. Now, if we do that then that would fix the race
> conditions mentioned in Fixme but still, there are few more things I
> am worried about: (a) What if the launcher again starts the tablesync
> worker? One idea could be to acquire AccessExclusiveLock on
> SubscriptionRelationId as we do in DropSubscription which is not a
> very good idea but I can't think of any other good way. (b) the patch
> is just checking SUBREL_STATE_TCOPYDONE before dropping the
> replication slot but the slot could be created even before that (in
> SUBREL_STATE_DATASYNC state). One idea could be we can try to drop the
> slot and if we are not able to drop then we can simply continue
> assuming it didn't exist.

The code was modified in the latest patch [v14] something like as suggested.

The workers for removed tables are now immediately stopped (like
DropSubscription does). Although I did include the AccessExclusiveLock
as (a) suggested, AFAIK this was actually ineffective at preventing
the workers relaunching. Instead, I am using
logicalrep_worker_stop_at_commit to do this - testing shows it as
working ok. Please see the code and latest test logs [v14] for
details.

Also, now the Drop/AlterSubscription will only give WARNING if unable
to drop slots, a per suggestion (b). This is also tested [v14].

>
> One minor comment:
> 1.
> + SpinLockAcquire(&MyLogicalRepWorker->relmutex);
>   MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE;
>   MyLogicalRepWorker->relstate_lsn = current_lsn;
> -
>
> Spurious line removal.

Fixed in latest patch [v14]

----
[v14] = https://www.postgresql.org/message-id/CAHut%2BPsPO2vOp%2BP7U2szROMy15PKKGanhUrCYQ0ffpy9zG1V1A%40mail.gmail.com
[test-refresh]
https://www.postgresql.org/message-id/CAHut%2BPv7YW7AyO_Q_nf9kzogcJcDFQNe8FBP6yXdzowMz3dY_Q%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



RE: Single transaction in the tablesync worker?

From
"Hou, Zhijie"
Date:
> Also PSA some detailed logging evidence of some test scenarios involving
> Drop/AlterSubscription:
> + Test-20210112-AlterSubscriptionRefresh-ok.txt =
> AlterSubscription_refresh which successfully drops a tablesync slot
> + Test-20210112-AlterSubscriptionRefresh-warning.txt =
> AlterSubscription_refresh gives WARNING that it cannot drop the tablesync
> slot (which no longer exists)
> + Test-20210112-DropSubscription-warning.txt = DropSubscription with a
> disassociated slot_name gives a WARNING that it cannot drop the tablesync
> slot (due to broken connection)

Hi

> * The AlterSubscription_refresh (v14+) is now more similar to DropSubscription w.r.t to stopping workers for any
"removed"tables.
 
I have an issue about the above feature.

With the patch, it seems does not stop the worker in the case of [1].
I probably missed something, should we stop the worker in such case ?

[1] https://www.postgresql.org/message-id/CALj2ACV%2B0UFpcZs5czYgBpujM9p0Hg1qdOZai_43OU7bqHU_xw%40mail.gmail.com

Best regards,
houzj




Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Jan 4, 2021 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 7.
> @@ -905,7 +905,7 @@ replorigin_advance(RepOriginId node,
>   LWLockAcquire(&replication_state->lock, LW_EXCLUSIVE);
>
>   /* Make sure it's not used by somebody else */
> - if (replication_state->acquired_by != 0)
> + if (replication_state->acquired_by != 0 &&
> replication_state->acquired_by != MyProcPid)
>   {
>
> I think you won't need this change if you do replorigin_advance before
> replorigin_session_setup in your patch.
>

As you know the replorigin_session_setup sets the
replication_state->acquired_by to be the current PID. So without this
change the replorigin_advance rejects that same slot state thinking
that it is already active for a different process. Root problem is
that the same process/PID calling both functions would hang. So this
patch change allows replorigin_advance code to be called by self.

IIUC that acquired_by check condition is like a sanity check for the
originid passed. The patched code only does just like what the comment
says:
"/* Make sure it's not used by somebody else */"
Doesn't "somebody else" means "anyone but me" (i.e. anyone but MyProcPid).

Also, “setup” of a thing generally comes before usage of that thing,
so won't it seem strange to do (like the suggestion) and deliberately
call the "setup" function 2nd instead of 1st?

Can you please explain why is it better to do it the suggested way
(switch the calls around) than keep the patch code? Probably there is
a good reason but I am just not understanding it.

----
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Wed, Jan 13, 2021 at 1:07 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
>
> > Also PSA some detailed logging evidence of some test scenarios involving
> > Drop/AlterSubscription:
> > + Test-20210112-AlterSubscriptionRefresh-ok.txt =
> > AlterSubscription_refresh which successfully drops a tablesync slot
> > + Test-20210112-AlterSubscriptionRefresh-warning.txt =
> > AlterSubscription_refresh gives WARNING that it cannot drop the tablesync
> > slot (which no longer exists)
> > + Test-20210112-DropSubscription-warning.txt = DropSubscription with a
> > disassociated slot_name gives a WARNING that it cannot drop the tablesync
> > slot (due to broken connection)
>
> Hi
>
> > * The AlterSubscription_refresh (v14+) is now more similar to DropSubscription w.r.t to stopping workers for any
"removed"tables.
 
> I have an issue about the above feature.
>
> With the patch, it seems does not stop the worker in the case of [1].
> I probably missed something, should we stop the worker in such case ?
>
> [1] https://www.postgresql.org/message-id/CALj2ACV%2B0UFpcZs5czYgBpujM9p0Hg1qdOZai_43OU7bqHU_xw%40mail.gmail.com
>

I am not exactly sure of the concern. (If the extra info below does
not help can you please describe your concern with more details).

This [v14] patch code/feature is only referring to the immediate
stopping of only the *** "tablesync" *** worker (if any) for any/each
table being removed from the subscription. It has nothing to say about
the "apply" worker of the subscription, which continues replicating as
before.

OTOH, I think the other mail problem is not really related to the
"tablesync" workers. As you can see (e.g. steps 7,8,9,10 of [2]), that
problem is described as continuing over multiple transactions to
replicate unexpected rows - I think this could only be done by the
subscription "apply" worker, and is after the "tablesync" worker has
gone away.

So AFAIK these are 2 quite unrelated problems, and would be solved
independently.

It just happens that they are both exposed using ALTER SUBSCRIPTION
... REFRESH PUBLICATION;

----
[v14] = https://www.postgresql.org/message-id/CAHut%2BPsPO2vOp%2BP7U2szROMy15PKKGanhUrCYQ0ffpy9zG1V1A%40mail.gmail.com
[2] = https://www.postgresql.org/message-id/CALj2ACV%2B0UFpcZs5czYgBpujM9p0Hg1qdOZai_43OU7bqHU_xw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



RE: Single transaction in the tablesync worker?

From
"Hou, Zhijie"
Date:
> I am not exactly sure of the concern. (If the extra info below does not
> help can you please describe your concern with more details).
> 
> This [v14] patch code/feature is only referring to the immediate stopping
> of only the *** "tablesync" *** worker (if any) for any/each table being
> removed from the subscription. It has nothing to say about the "apply" worker
> of the subscription, which continues replicating as before.
> 
> OTOH, I think the other mail problem is not really related to the "tablesync"
> workers. As you can see (e.g. steps 7,8,9,10 of [2]), that problem is
> described as continuing over multiple transactions to replicate unexpected
> rows - I think this could only be done by the subscription "apply" worker,
> and is after the "tablesync" worker has gone away.
> 
> So AFAIK these are 2 quite unrelated problems, and would be solved
> independently.
> 
> It just happens that they are both exposed using ALTER SUBSCRIPTION ...
> REFRESH PUBLICATION;

So sorry for the confusion, you are right that these are 2 quite unrelated problems.
I misunderstood the 'stop the worker' here.


+                /* Immediately stop the worker. */
+                logicalrep_worker_stop_at_commit(subid, relid); /* prevent re-launching */
+                logicalrep_worker_stop(subid, relid); /* stop immediately */

Do you think we can add some comments to describe what type "worker" is stop here ? (sync worker here) 
And should we add some more comments to talk about the reason of " Immediately stop " here ? it may looks easier to
understand.

Best regards,
Houzj



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Wed, Jan 13, 2021 at 1:30 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:
>
> > I am not exactly sure of the concern. (If the extra info below does not
> > help can you please describe your concern with more details).
> >
> > This [v14] patch code/feature is only referring to the immediate stopping
> > of only the *** "tablesync" *** worker (if any) for any/each table being
> > removed from the subscription. It has nothing to say about the "apply" worker
> > of the subscription, which continues replicating as before.
> >
> > OTOH, I think the other mail problem is not really related to the "tablesync"
> > workers. As you can see (e.g. steps 7,8,9,10 of [2]), that problem is
> > described as continuing over multiple transactions to replicate unexpected
> > rows - I think this could only be done by the subscription "apply" worker,
> > and is after the "tablesync" worker has gone away.
> >
> > So AFAIK these are 2 quite unrelated problems, and would be solved
> > independently.
> >
> > It just happens that they are both exposed using ALTER SUBSCRIPTION ...
> > REFRESH PUBLICATION;
>
> So sorry for the confusion, you are right that these are 2 quite unrelated problems.
> I misunderstood the 'stop the worker' here.
>
>
> +                               /* Immediately stop the worker. */
> +                               logicalrep_worker_stop_at_commit(subid, relid); /* prevent re-launching */
> +                               logicalrep_worker_stop(subid, relid); /* stop immediately */
>
> Do you think we can add some comments to describe what type "worker" is stop here ? (sync worker here)
> And should we add some more comments to talk about the reason of " Immediately stop " here ? it may looks easier to
understand.
>

Another thing related to this is why we need to call both
logicalrep_worker_stop_at_commit() and logicalrep_worker_stop()?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Wed, Jan 13, 2021 at 11:18 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Jan 4, 2021 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > 7.
> > @@ -905,7 +905,7 @@ replorigin_advance(RepOriginId node,
> >   LWLockAcquire(&replication_state->lock, LW_EXCLUSIVE);
> >
> >   /* Make sure it's not used by somebody else */
> > - if (replication_state->acquired_by != 0)
> > + if (replication_state->acquired_by != 0 &&
> > replication_state->acquired_by != MyProcPid)
> >   {
> >
> > I think you won't need this change if you do replorigin_advance before
> > replorigin_session_setup in your patch.
> >
>
> As you know the replorigin_session_setup sets the
> replication_state->acquired_by to be the current PID. So without this
> change the replorigin_advance rejects that same slot state thinking
> that it is already active for a different process. Root problem is
> that the same process/PID calling both functions would hang.
>

I think the hang happens only if we call unchanged replorigin_advance
after session_setup API, right?

> So this
> patch change allows replorigin_advance code to be called by self.
>
> IIUC that acquired_by check condition is like a sanity check for the
> originid passed. The patched code only does just like what the comment
> says:
> "/* Make sure it's not used by somebody else */"
> Doesn't "somebody else" means "anyone but me" (i.e. anyone but MyProcPid).
>
> Also, “setup” of a thing generally comes before usage of that thing,
> so won't it seem strange to do (like the suggestion) and deliberately
> call the "setup" function 2nd instead of 1st?
>
> Can you please explain why is it better to do it the suggested way
> (switch the calls around) than keep the patch code? Probably there is
> a good reason but I am just not understanding it.
>

Because there is no requirement for origin_advance API to be called
after session setup. Session setup is required to mark the node as
replaying from a remote node, see [1] whereas origin_advance is used
for setting up the initial location or setting a new location, see [2]
(pg_replication_origin_advance).

Now here, after creating the origin, we need to set up the initial
location and it seems fine to call origin_advance before
session_setup. In short, as such, I don't see any problem with your
change in replorigin_advance but OTOH, I don't see the need for the
same as well. So, let's try to avoid that change unless we can't do
without it.

Also, another thing is we need to take RowExclusiveLock on
pg_replication_origin as written in comments atop replorigin_advance
before calling it. See its usage in pg_replication_origin_advance.
Also, write comments on why we need to use replorigin_advance here
(... something, like we need to WAL log this for the purpose of
recovery...).

[1] - https://www.postgresql.org/docs/devel/replication-origins.html
[2] - https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-REPLICATION

--
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Tue, Jan 12, 2021 at 6:17 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Jan 11, 2021 at 3:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jan 8, 2021 at 8:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Jan 8, 2021 at 7:14 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > > >
> > > > FYI, I was able to reproduce this case in debugger. PSA logs showing details.
> > > >
> > >
> > > Thanks for reproducing as I was worried about exactly this case. I
> > > have one question related to logs:
> > >
> > > ##
> > > ## ALTER SUBSCRIPTION to REFRESH the publication
> > >
> > > ## This blocks on some latch until the tablesync worker dies, then it continues
> > > ##
> > >
> > > Did you check which exact latch or lock blocks this?
> > >
> >
> > I have checked this myself and the command is waiting on the drop of
> > origin till the tablesync worker is finished because replorigin_drop()
> > requires state->acquired_by to be 0 which will only be true once the
> > tablesync worker exits. I think this is the reason you might have
> > noticed that the command can't be finished until the tablesync worker
> > died. So this can't be an interlock between ALTER SUBSCRIPTION ..
> > REFRESH command and tablesync worker and to that end it seems you have
> > below Fixme's in the patch:
>
> I have also seen this same blocking reason before in the replorigin_drop().
> However, back when I first tested/reproduced the refresh issue
> [test-refresh] that
> AlterSubscription_refresh was still *original* unchanged code, so at
> that time it did not
> have any replorigin_drop() in at all. In any case in the latest code
> [v14] the AlterSubscription is
> immediately stopping the workers so this question may be moot.
>
> >
> > + * FIXME - Usually this cleanup would be OK, but will not
> > + * always be OK because the logicalrep_worker_stop_at_commit
> > + * only "flags" the worker to be stopped in the near future
> > + * but meanwhile it may still be running. In this case there
> > + * could be a race between the tablesync worker and this code
> > + * to see who will succeed with the tablesync drop (and the
> > + * loser will ERROR).
> > + *
> > + * FIXME - Also, checking the state is also not guaranteed
> > + * correct because state might be TCOPYDONE when we checked
> > + * but has since progressed to SYNDONE
> > + */
> > +
> > + if (state == SUBREL_STATE_TCOPYDONE)
> > + {
> >
> > I feel this was okay for an earlier code but now we need to stop the
> > tablesync workers before trying to drop the slot as we do in
> > DropSubscription. Now, if we do that then that would fix the race
> > conditions mentioned in Fixme but still, there are few more things I
> > am worried about: (a) What if the launcher again starts the tablesync
> > worker? One idea could be to acquire AccessExclusiveLock on
> > SubscriptionRelationId as we do in DropSubscription which is not a
> > very good idea but I can't think of any other good way. (b) the patch
> > is just checking SUBREL_STATE_TCOPYDONE before dropping the
> > replication slot but the slot could be created even before that (in
> > SUBREL_STATE_DATASYNC state). One idea could be we can try to drop the
> > slot and if we are not able to drop then we can simply continue
> > assuming it didn't exist.
>
> The code was modified in the latest patch [v14] something like as suggested.
>
> The workers for removed tables are now immediately stopped (like
> DropSubscription does). Although I did include the AccessExclusiveLock
> as (a) suggested, AFAIK this was actually ineffective at preventing
> the workers relaunching.
>

The reason why it was ineffective is that you are locking
SubscriptionRelationId which is to protect relaunch of apply workers
not tablesync workers. But in current form even acquiring
SubscriptionRelRelationId lock won't serve the purpose because
process_syncing_tables_for_apply() doesn't always acquire it before
relaunching the tablesync workers. However, if we acquire
SubscriptionRelRelationId in process_syncing_tables_for_apply() then
it would prevent relaunch of workers but not sure if that is a good
idea. Can you think of some other way?

> Instead, I am using
> logicalrep_worker_stop_at_commit to do this - testing shows it as
> working ok. Please see the code and latest test logs [v14] for
> details.
>

There is still a window where it can relaunch. Basically, after you
stop the worker in AlterSubscription_refresh and till the commit
happens apply worker can relaunch the tablesync workers. I don't see
code-wise how we can protect that. And if the tablesync workers are
restarted after we stopped them, the purpose won't be achieved because
it can recreate or try to reuse the slot which we have dropped.

The other issue with the current code could be that after we drop the
slot and origin what if the transaction (in which we are doing Alter
Subscription) is rolledback? Basically, the workers will be relaunched
and it would assume that slot should be there but the slot won't be
present. I have thought of dropping the slot at commit time after we
stop the workers but again not sure if that is a good idea because at
that point we don't want to establish the connection with the
publisher.

I think this needs some more thought.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA the v15 patch for the Tablesync Solution1.

Main differences from v14:
+ Addresses review comment, posted 13/Jan [ak13]

[ak13] = https://www.postgresql.org/message-id/CAA4eK1KzNbudfwmJD-ureYigX6sNyCU6YgHscg29xWoZG6osvA%40mail.gmail.com

====

Features:

* The tablesync slot is now permanent instead of temporary.

* The tablesync slot name is no longer tied to the Subscription slot name.

* The tablesync slot cleanup (drop) code is added for
DropSubscription, AlterSubscription_refresh and for
process_syncing_tables_for_sync functions. Drop/AlterSubscription will
issue WARNING instead of ERROR in case the slot drop fails.

* The tablesync worker is now allowing multiple tx instead of single tx

* A new state (SUBREL_STATE_FINISHEDCOPY) is persisted after a
successful copy_table in tablesync's LogicalRepSyncTableStart.

* If a re-launched tablesync finds state SUBREL_STATE_FINISHEDCOPY
then it will bypass the initial copy_table phase.

* Now tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* The tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply.

* The DropSubscription cleanup code was enhanced (v7+) to take care of
any crashed tablesync workers.

* The AlterSubscription_refresh (v14+) is now more similar to
DropSubscription w.r.t to stopping tablesync workers for any "removed"
tables.

* Updates to PG docs.

TODO / Known Issues:

* The AlterSubscription_refresh tablesync cleanup code still has some
problems [1]
[1] = https://www.postgresql.org/message-id/CAA4eK1JuwZF7FHM%2BEPjWdVh%3DXaz-7Eo-G0TByMjWeUU32Xue3w%40mail.gmail.com

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Wed, Jan 13, 2021 at 9:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 13, 2021 at 11:18 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Mon, Jan 4, 2021 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > 7.
> > > @@ -905,7 +905,7 @@ replorigin_advance(RepOriginId node,
> > >   LWLockAcquire(&replication_state->lock, LW_EXCLUSIVE);
> > >
> > >   /* Make sure it's not used by somebody else */
> > > - if (replication_state->acquired_by != 0)
> > > + if (replication_state->acquired_by != 0 &&
> > > replication_state->acquired_by != MyProcPid)
> > >   {
> > >
> > > I think you won't need this change if you do replorigin_advance before
> > > replorigin_session_setup in your patch.
> > >
> >
> > As you know the replorigin_session_setup sets the
> > replication_state->acquired_by to be the current PID. So without this
> > change the replorigin_advance rejects that same slot state thinking
> > that it is already active for a different process. Root problem is
> > that the same process/PID calling both functions would hang.
> >
>
> I think the hang happens only if we call unchanged replorigin_advance
> after session_setup API, right?
>
> > So this
> > patch change allows replorigin_advance code to be called by self.
> >
> > IIUC that acquired_by check condition is like a sanity check for the
> > originid passed. The patched code only does just like what the comment
> > says:
> > "/* Make sure it's not used by somebody else */"
> > Doesn't "somebody else" means "anyone but me" (i.e. anyone but MyProcPid).
> >
> > Also, “setup” of a thing generally comes before usage of that thing,
> > so won't it seem strange to do (like the suggestion) and deliberately
> > call the "setup" function 2nd instead of 1st?
> >
> > Can you please explain why is it better to do it the suggested way
> > (switch the calls around) than keep the patch code? Probably there is
> > a good reason but I am just not understanding it.
> >
>
> Because there is no requirement for origin_advance API to be called
> after session setup. Session setup is required to mark the node as
> replaying from a remote node, see [1] whereas origin_advance is used
> for setting up the initial location or setting a new location, see [2]
> (pg_replication_origin_advance).
>
> Now here, after creating the origin, we need to set up the initial
> location and it seems fine to call origin_advance before
> session_setup. In short, as such, I don't see any problem with your
> change in replorigin_advance but OTOH, I don't see the need for the
> same as well. So, let's try to avoid that change unless we can't do
> without it.
>
> Also, another thing is we need to take RowExclusiveLock on
> pg_replication_origin as written in comments atop replorigin_advance
> before calling it. See its usage in pg_replication_origin_advance.
> Also, write comments on why we need to use replorigin_advance here
> (... something, like we need to WAL log this for the purpose of
> recovery...).
>

Modified in latest patch [v15].

----
[v15] = https://www.postgresql.org/message-id/CAHut%2BPu3he2rOWjbXcNUO6z3aH2LYzW03KV%2BfiMWim49qW9etQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Wed, Jan 13, 2021 at 5:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 12, 2021 at 6:17 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Mon, Jan 11, 2021 at 3:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > The workers for removed tables are now immediately stopped (like
> > DropSubscription does). Although I did include the AccessExclusiveLock
> > as (a) suggested, AFAIK this was actually ineffective at preventing
> > the workers relaunching.
> >
>
> The reason why it was ineffective is that you are locking
> SubscriptionRelationId which is to protect relaunch of apply workers
> not tablesync workers. But in current form even acquiring
> SubscriptionRelRelationId lock won't serve the purpose because
> process_syncing_tables_for_apply() doesn't always acquire it before
> relaunching the tablesync workers. However, if we acquire
> SubscriptionRelRelationId in process_syncing_tables_for_apply() then
> it would prevent relaunch of workers but not sure if that is a good
> idea. Can you think of some other way?
>
> > Instead, I am using
> > logicalrep_worker_stop_at_commit to do this - testing shows it as
> > working ok. Please see the code and latest test logs [v14] for
> > details.
> >
>
> There is still a window where it can relaunch. Basically, after you
> stop the worker in AlterSubscription_refresh and till the commit
> happens apply worker can relaunch the tablesync workers. I don't see
> code-wise how we can protect that. And if the tablesync workers are
> restarted after we stopped them, the purpose won't be achieved because
> it can recreate or try to reuse the slot which we have dropped.
>
> The other issue with the current code could be that after we drop the
> slot and origin what if the transaction (in which we are doing Alter
> Subscription) is rolledback? Basically, the workers will be relaunched
> and it would assume that slot should be there but the slot won't be
> present. I have thought of dropping the slot at commit time after we
> stop the workers but again not sure if that is a good idea because at
> that point we don't want to establish the connection with the
> publisher.
>
> I think this needs some more thought.
>

I have another idea to solve this problem. Instead of Alter
Subscription drop the slot/origin, we can let tablesync worker do it.
Basically, we need to register SignalHandlerForShutdownRequest as
SIGTERM handler and then later need to check ShutdownRequestPending
flag in the tablesync worker. If the flag is set, then we can drop the
slot/origin and allow the process to exit cleanly.

This will obviate the need to take the lock and all sort of rollback
problems. If this works out well then I think we can use this for
DropSubscription as well but that is a matter of separate patch.

Thoughts?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA the v16 patch for the Tablesync Solution1.

Main differences from v15:
+ Tablesync cleanups of DropSubscription/AlterSubscription_refresh are
re-implemented as as ProcessInterrupts function

====

Features:

* The tablesync slot is now permanent instead of temporary.

* The tablesync slot name is no longer tied to the Subscription slot name.

* The tablesync worker is now allowing multiple tx instead of single tx

* A new state (SUBREL_STATE_FINISHEDCOPY) is persisted after a
successful copy_table in tablesync's LogicalRepSyncTableStart.

* If a re-launched tablesync finds state SUBREL_STATE_FINISHEDCOPY
then it will bypass the initial copy_table phase.

* Now tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* Cleanup of tablesync resources:
- The tablesync slot cleanup (drop) code is added for
process_syncing_tables_for_sync functions.
- The tablesync replication origin tracking is cleaned
process_syncing_tables_for_apply.
- A tablesync function to cleanup its own slot/origin is called from
ProcessInterrupt. This is indirectly invoked by
DropSubscription/AlterSubscrition when they signal the tablesync
worker to stop.

* Updates to PG docs.

TODO / Known Issues:

* Race condition observed in "make check" may be related to this patch.

* Add test cases.

---

Please also see some test scenario logging which shows the new
tablesync cleanup function getting called as results of
Drop/AlterSUbscription.

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA the v17 patch for the Tablesync Solution1.

Main differences from v16:
+ Small refactor for DropSubscription to correct the "make check" deadlock
+ Added test case
+ Some comment wording

====

Features:

* The tablesync slot is now permanent instead of temporary.

* The tablesync slot name is no longer tied to the Subscription slot name.

* The tablesync worker is now allowing multiple tx instead of single tx

* A new state (SUBREL_STATE_FINISHEDCOPY) is persisted after a
successful copy_table in tablesync's LogicalRepSyncTableStart.

* If a re-launched tablesync finds state SUBREL_STATE_FINISHEDCOPY
then it will bypass the initial copy_table phase.

* Now tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* Cleanup of tablesync resources:
- The tablesync slot cleanup (drop) code is added for
process_syncing_tables_for_sync functions.
- The tablesync replication origin tracking is cleaned
process_syncing_tables_for_apply.
- A tablesync function to cleanup its own slot/origin is called fro
ProcessInterrupts. This is indirectly invoked by
DropSubscription/AlterSubscription when they signal the tablesync
worker to stop.

* Updates to PG docs.

* New TAP test case

TODO / Known Issues:

* None known.

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Tue, Jan 19, 2021 at 2:32 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Hi Amit.
>
> PSA the v17 patch for the Tablesync Solution1.
>

Thanks for the updated patch. Below are few comments:
1. Why are we changing the scope of PG_TRY in DropSubscription()?
Also, it might be better to keep the replication slot drop part as it
is.

2.
- *    - Tablesync worker finishes the copy and sets table state to SYNCWAIT;
- * waits for state change.
+ *    - Tablesync worker does initial table copy; there is a
FINISHEDCOPY state to
+ * indicate when the copy phase has completed, so if the worker crashes
+ * before reaching SYNCDONE the copy will not be re-attempted.

In the last line, shouldn't the state be FINISHEDCOPY instead of SYNCDONE?

3.
+void
+tablesync_cleanup_at_interrupt(void)
+{
+ bool drop_slot_needed;
+ char originname[NAMEDATALEN] = {0};
+ RepOriginId originid;
+ TimeLineID tli;
+ Oid subid = MySubscription->oid;
+ Oid relid = MyLogicalRepWorker->relid;
+
+ elog(DEBUG1,
+ "tablesync_cleanup_at_interrupt for relid = %d",
+ MyLogicalRepWorker->relid);

The function name and message makes it sound like that we drop slot
and origin at any interrupt. Isn't it better to name it as
tablesync_cleanup_at_shutdown()?

4.
+ drop_slot_needed =
+ wrconn != NULL &&
+ MyLogicalRepWorker->relstate != SUBREL_STATE_SYNCDONE &&
+ MyLogicalRepWorker->relstate != SUBREL_STATE_READY;
+
+ if (drop_slot_needed)
+ {
+ char syncslotname[NAMEDATALEN] = {0};
+ bool missing_ok = true; /* no ERROR if slot is missing. */

I think we can avoid using missing_ok and drop_slot_needed variables.

5. Can we drop the origin along with the slot in
process_syncing_tables_for_sync() instead of
process_syncing_tables_for_apply()? I think this is possible because
of the other changes you made in origin.c. Also, if possible, we can
try to use the same code to drop the slot and origin in
tablesync_cleanup_at_interrupt and process_syncing_tables_for_sync.

6.
+ if (MyLogicalRepWorker->relstate == SUBREL_STATE_FINISHEDCOPY)
+ {
+ /*
+ * The COPY phase was previously done, but tablesync then crashed/etc
+ * before it was able to finish normally.
+ */

There seems to be a typo (crashed/etc) in the above comment.

7.
+# check for occurrence of the expected error
+poll_output_until("replication slot \"$slotname\" already exists")
+    or die "no error stop for the pre-existing origin";

In this test, isn't it better to check for datasync state like below?
004_sync.pl has some other similar test.
my $started_query = "SELECT srsubstate = 'd' FROM pg_subscription_rel;";
$node_subscriber->poll_query_until('postgres', $started_query)
  or die "Timed out while waiting for subscriber to start sync";

Is there a reason why we can't use the existing way to check for
failure in this case?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Thu, Jan 21, 2021 at 3:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 19, 2021 at 2:32 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Hi Amit.
> >
> > PSA the v17 patch for the Tablesync Solution1.
> >
>
> Thanks for the updated patch. Below are few comments:
>

One more comment:

In LogicalRepSyncTableStart(), you are trying to remove the slot on
the failure of copy which won't work if the publisher is down. If that
happens on restart of tablesync worker, we will retry to create the
slot with the same name and it will fail because the previous slot is
still not removed from the publisher. I think the same problem can
happen if, after an error in tablesync worker and we drop the
subscription before tablesync worker gets a chance to restart. So, to
avoid these problems can we use the TEMPORARY slot for tablesync
workers as previously? If I remember correctly, the main problem was
we don't know where to start decoding if we fail in catchup phase. But
for that origins should be sufficient because if we fail before copy
then anyway we have to create a new slot and origin but if we fail
after copy then we can use the start_decoding_position from the
origin. So before copy, we still need to use CRS_USE_SNAPSHOT while
creating a temporary slot but if we are already in FINISHED COPY state
at the start of tablesync worker then create a slot with
CRS_NOEXPORT_SNAPSHOT option and then use origin's start_pos and
proceed decoding changes from that point onwards similar to how
currently the apply worker works.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA the v18 patch for the Tablesync Solution1.

Main differences from v17:
+ Design change to use TEMPORARY tablesync slots [ak0122] means lots
of the v17 slot cleanup code became unnecessary.
+ Small refactor in LogicalReplicationSyncTableStart to fix a deadlock scenario.
+ Addressing some review comments [ak0121].

[ak0121]
https://www.postgresql.org/message-id/CAA4eK1LGxuB_RTfZ2HLJT76wv%3DFLV6UPqT%2BFWkiDg61rvQkkmQ%40mail.gmail.com
[ak0122] https://www.postgresql.org/message-id/CAA4eK1LS0_mdVx2zG3cS%2BH88FJiwyS3kZi7zxijJ_gEuw2uQ2g%40mail.gmail.com

====

Features:

* The tablesync slot name is no longer tied to the Subscription slot name.

* The tablesync worker is now allowing multiple tx instead of single tx

* A new state (SUBREL_STATE_FINISHEDCOPY) is persisted after a
successful copy_table in tablesync's LogicalRepSyncTableStart.

* If a re-launched tablesync finds state SUBREL_STATE_FINISHEDCOPY
then it will bypass the initial copy_table phase.

* Now tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* The tablesync replication origin tracking record is cleaned up by:
- process_syncing_tables_for_apply
- DropSubscription
- AlterSubscription_refresh

* Updates to PG docs.

* New TAP test case

Known Issues:

* None.

---
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Fri, Jan 22, 2021 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 21, 2021 at 3:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jan 19, 2021 at 2:32 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > Hi Amit.
> > >
> > > PSA the v17 patch for the Tablesync Solution1.
> > >
> >
> > Thanks for the updated patch. Below are few comments:
> >
>
> One more comment:
>
> In LogicalRepSyncTableStart(), you are trying to remove the slot on
> the failure of copy which won't work if the publisher is down. If that
> happens on restart of tablesync worker, we will retry to create the
> slot with the same name and it will fail because the previous slot is
> still not removed from the publisher. I think the same problem can
> happen if, after an error in tablesync worker and we drop the
> subscription before tablesync worker gets a chance to restart. So, to
> avoid these problems can we use the TEMPORARY slot for tablesync
> workers as previously? If I remember correctly, the main problem was
> we don't know where to start decoding if we fail in catchup phase. But
> for that origins should be sufficient because if we fail before copy
> then anyway we have to create a new slot and origin but if we fail
> after copy then we can use the start_decoding_position from the
> origin. So before copy, we still need to use CRS_USE_SNAPSHOT while
> creating a temporary slot but if we are already in FINISHED COPY state
> at the start of tablesync worker then create a slot with
> CRS_NOEXPORT_SNAPSHOT option and then use origin's start_pos and
> proceed decoding changes from that point onwards similar to how
> currently the apply worker works.
>

OK. Code is modified as suggested in the latest patch [v18].
Now that tablesync slots are temporary, quite a lot of cleanup code
from the previous patch (v17) is no longer required so has been
removed.

----
[v18] =
https://www.postgresql.org/message-id/CAHut%2BPvm0R%3DMn_uVN_JhK0scE54V6%2BEDGHJg1WYJx0Q8HX_mkQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Thu, Jan 21, 2021 at 9:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 19, 2021 at 2:32 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Hi Amit.
> >
> > PSA the v17 patch for the Tablesync Solution1.
> >
>
> Thanks for the updated patch. Below are few comments:
> 1. Why are we changing the scope of PG_TRY in DropSubscription()?
> Also, it might be better to keep the replication slot drop part as it
> is.
>

The latest patch [v18] was re-designed to make tablesync slots as
TEMPORARY [ak0122], so this code in DropSubscription is modified a
lot. This review comment is not applicable anymore.

> 2.
> - *    - Tablesync worker finishes the copy and sets table state to SYNCWAIT;
> - * waits for state change.
> + *    - Tablesync worker does initial table copy; there is a
> FINISHEDCOPY state to
> + * indicate when the copy phase has completed, so if the worker crashes
> + * before reaching SYNCDONE the copy will not be re-attempted.
>
> In the last line, shouldn't the state be FINISHEDCOPY instead of SYNCDONE?
>

OK. The code comment was correct, but maybe confusing. I have reworded
it in the latest patch [v18].

> 3.
> +void
> +tablesync_cleanup_at_interrupt(void)
> +{
> + bool drop_slot_needed;
> + char originname[NAMEDATALEN] = {0};
> + RepOriginId originid;
> + TimeLineID tli;
> + Oid subid = MySubscription->oid;
> + Oid relid = MyLogicalRepWorker->relid;
> +
> + elog(DEBUG1,
> + "tablesync_cleanup_at_interrupt for relid = %d",
> + MyLogicalRepWorker->relid);
>
> The function name and message makes it sound like that we drop slot
> and origin at any interrupt. Isn't it better to name it as
> tablesync_cleanup_at_shutdown()?
>

The latest patch [v18] was re-designed to make tablesync slots as
TEMPORARY [ak0122], so this cleanup function is removed. This review
comment is not applicable anymore.

> 4.
> + drop_slot_needed =
> + wrconn != NULL &&
> + MyLogicalRepWorker->relstate != SUBREL_STATE_SYNCDONE &&
> + MyLogicalRepWorker->relstate != SUBREL_STATE_READY;
> +
> + if (drop_slot_needed)
> + {
> + char syncslotname[NAMEDATALEN] = {0};
> + bool missing_ok = true; /* no ERROR if slot is missing. */
>
> I think we can avoid using missing_ok and drop_slot_needed variables.
>

The latest patch [v18] was re-designed to make tablesync slots as
TEMPORARY [ak0122], so this code no longer exists. This review comment
is not applicable anymore.

> 5. Can we drop the origin along with the slot in
> process_syncing_tables_for_sync() instead of
> process_syncing_tables_for_apply()? I think this is possible because
> of the other changes you made in origin.c. Also, if possible, we can
> try to use the same code to drop the slot and origin in
> tablesync_cleanup_at_interrupt and process_syncing_tables_for_sync.
>

No, the origin tracking cannot be dropped by the tablesync worker for
the normal use-case even with my modified origin.c; it would fail
during the commit TX because while trying to do
replorigin_session_advance it would find the asserted origin id was
not there anymore.

Also, the latest patch [v18] was re-designed to make tablesync slots
as TEMPORARY [ak0122], so the tablesync_cleanup_at_interrupt function
no longer exists (so the origin.c change of v17 has also been
removed).

> 6.
> + if (MyLogicalRepWorker->relstate == SUBREL_STATE_FINISHEDCOPY)
> + {
> + /*
> + * The COPY phase was previously done, but tablesync then crashed/etc
> + * before it was able to finish normally.
> + */
>
> There seems to be a typo (crashed/etc) in the above comment.
>

OK. Fixed in latest patch [v18].

----
[ak0122] =
https://www.postgresql.org/message-id/CAA4eK1LS0_mdVx2zG3cS%2BH88FJiwyS3kZi7zxijJ_gEuw2uQ2g%40mail.gmail.com
[v18] =
https://www.postgresql.org/message-id/CAHut%2BPvm0R%3DMn_uVN_JhK0scE54V6%2BEDGHJg1WYJx0Q8HX_mkQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Ajin Cherian
Date:
On Thu, Jan 21, 2021 at 9:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> 7.
> +# check for occurrence of the expected error
> +poll_output_until("replication slot \"$slotname\" already exists")
> +    or die "no error stop for the pre-existing origin";
>
> In this test, isn't it better to check for datasync state like below?
> 004_sync.pl has some other similar test.
> my $started_query = "SELECT srsubstate = 'd' FROM pg_subscription_rel;";
> $node_subscriber->poll_query_until('postgres', $started_query)
>   or die "Timed out while waiting for subscriber to start sync";
>
> Is there a reason why we can't use the existing way to check for
> failure in this case?

Since the new design now uses temporary slots, is this test case still
required?. If required, I can change it accordingly.

regards,
Ajin Cherian
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Sat, Jan 23, 2021 at 8:37 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Thu, Jan 21, 2021 at 9:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > 7.
> > +# check for occurrence of the expected error
> > +poll_output_until("replication slot \"$slotname\" already exists")
> > +    or die "no error stop for the pre-existing origin";
> >
> > In this test, isn't it better to check for datasync state like below?
> > 004_sync.pl has some other similar test.
> > my $started_query = "SELECT srsubstate = 'd' FROM pg_subscription_rel;";
> > $node_subscriber->poll_query_until('postgres', $started_query)
> >   or die "Timed out while waiting for subscriber to start sync";
> >
> > Is there a reason why we can't use the existing way to check for
> > failure in this case?
>
> Since the new design now uses temporary slots, is this test case still
> required?
>

I think so. But do you have any reason to believe that it won't be
required anymore?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Ajin Cherian
Date:
On Sat, Jan 23, 2021 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

>
> I think so. But do you have any reason to believe that it won't be
> required anymore?

A temporary slot will not clash with a permanent slot of the same name,

regards,
Ajin Cherian
Fujitsu



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Sat, Jan 23, 2021 at 4:55 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> PSA the v18 patch for the Tablesync Solution1.
>

Few comments:
=============
1.
- *   So the state progression is always: INIT -> DATASYNC -> SYNCWAIT ->
- *   CATCHUP -> SYNCDONE -> READY.
+ *   So the state progression is always: INIT -> DATASYNC ->
+ *   (sync worker FINISHEDCOPY) -> SYNCWAIT -> CATCHUP -> SYNCDONE -> READY.

I don't think we need to be specific here that sync worker sets
FINISHEDCOPY state.

2.
@@ -98,11 +102,16 @@
 #include "miscadmin.h"
 #include "parser/parse_relation.h"
 #include "pgstat.h"
+#include "postmaster/interrupt.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalrelation.h"
+#include "replication/logicalworker.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
+#include "replication/slot.h"

I don't think the above includes are required. They seem to the
remnant of the previous approach.

3.
 process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
- Assert(IsTransactionState());
+ bool sync_done = false;

  SpinLockAcquire(&MyLogicalRepWorker->relmutex);
+ sync_done = MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
+ current_lsn >= MyLogicalRepWorker->relstate_lsn;
+ SpinLockRelease(&MyLogicalRepWorker->relmutex);

- if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
- current_lsn >= MyLogicalRepWorker->relstate_lsn)
+ if (sync_done)
  {
  TimeLineID tli;

+ /*
+ * Change state to SYNCDONE.
+ */
+ SpinLockAcquire(&MyLogicalRepWorker->relmutex);

Why do we need these changes? If you have done it for the
code-readability purpose then we can consider this as a separate patch
because I don't see why these are required w.r.t this patch.

4.
- /*
- * To build a slot name for the sync work, we are limited to NAMEDATALEN -
- * 1 characters.  We cut the original slot name to NAMEDATALEN - 28 chars
- * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0').  (It's actually the
- * NAMEDATALEN on the remote that matters, but this scheme will also work
- * reasonably if that is different.)
- */
- StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small"); /* for sanity */
- slotname = psprintf("%.*s_%u_sync_%u",
- NAMEDATALEN - 28,
- MySubscription->slotname,
- MySubscription->oid,
- MyLogicalRepWorker->relid);
+ /* Calculate the name of the tablesync slot. */
+ slotname = ReplicationSlotNameForTablesync(
+    MySubscription->oid,
+    MyLogicalRepWorker->relid);

What is the reason for changing the slot name calculation? If there is
any particular reasons, then we can add a comment to indicate why we
can't include the subscription's slotname in this calculation.

5.
This is WAL
+ * logged for for the purpose of recovery. Locks are to prevent the
+ * replication origin from vanishing while advancing.

/for for/for

6.
+ /* Remove the tablesync's origin tracking if exists. */
+ snprintf(originname, sizeof(originname), "pg_%u_%u", subid, relid);
+ originid = replorigin_by_name(originname, true);
+ if (originid != InvalidRepOriginId)
+ {
+ elog(DEBUG1, "DropSubscription: dropping origin tracking for
\"%s\"", originname);

I don't think we need this and the DEBUG1 message in
AlterSubscription_refresh. IT is fine to print this information for
background workers like in apply-worker but not sure if need it here.
The DropSubscription drops the origin of apply worker but it doesn't
use such a DEBUG message so I guess we don't it for tablesync origins
as well.

7. Have you tested with the new patch the scenario where we crash
after FINISHEDCOPY and before SYNCDONE, is it able to pick up the
replication using the new temporary slot? Here, we need to test the
case where during the catchup phase we have received few commits and
then the tablesync worker is crashed/errored out? Basically, check if
the replication is continued from the same point? I understand that
this can be only tested by adding some logs and we might not be able
to write a test for it.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
FYI - I have done some long-running testing using the current patch [v18].

1. The src/test/subscription TAP tests:
- Subscription TAP tests were executed in a loop X 150 iterations.
- Duration 5 hrs.
- All iterations report "Result: PASS"

2. The postgres "make check" tests:
- make check was executed in a loop X 150 iterations.
- Duration 2 hrs.
- All iterations report "All 202 tests passed"

---
[v18] https://www.postgresql.org/message-id/CAHut%2BPvm0R%3DMn_uVN_JhK0scE54V6%2BEDGHJg1WYJx0Q8HX_mkQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Sat, Jan 23, 2021 at 11:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jan 23, 2021 at 4:55 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > PSA the v18 patch for the Tablesync Solution1.
> >
>
> Few comments:
> =============
> 1.
> - *   So the state progression is always: INIT -> DATASYNC -> SYNCWAIT ->
> - *   CATCHUP -> SYNCDONE -> READY.
> + *   So the state progression is always: INIT -> DATASYNC ->
> + *   (sync worker FINISHEDCOPY) -> SYNCWAIT -> CATCHUP -> SYNCDONE -> READY.
>
> I don't think we need to be specific here that sync worker sets
> FINISHEDCOPY state.
>

This was meant to indicate that *only* the sync worker knows about the
FINISHEDCOPY state, whereas all the other states are either known
(assigned and/or used) by *both* kinds of workers. But, I can remove
it if you feel that distinction is not useful.

> 4.
> - /*
> - * To build a slot name for the sync work, we are limited to NAMEDATALEN -
> - * 1 characters.  We cut the original slot name to NAMEDATALEN - 28 chars
> - * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0').  (It's actually the
> - * NAMEDATALEN on the remote that matters, but this scheme will also work
> - * reasonably if that is different.)
> - */
> - StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small"); /* for sanity */
> - slotname = psprintf("%.*s_%u_sync_%u",
> - NAMEDATALEN - 28,
> - MySubscription->slotname,
> - MySubscription->oid,
> - MyLogicalRepWorker->relid);
> + /* Calculate the name of the tablesync slot. */
> + slotname = ReplicationSlotNameForTablesync(
> +    MySubscription->oid,
> +    MyLogicalRepWorker->relid);
>
> What is the reason for changing the slot name calculation? If there is
> any particular reasons, then we can add a comment to indicate why we
> can't include the subscription's slotname in this calculation.
>

The subscription slot name may be changed (e.g. ALTER SUBSCRIPTION)
and so including the subscription slot name as part of the tablesync
slot name was considered to be:
a) possibly risky/undefined, if the subscription slot_name = NONE
b) confusing, if we end up using 2 different slot names for the same
tablesync (e.g. if the subscription slot name is changed before a sync
worker is re-launched).
And since this subscription slot name part is not necessary for
uniqueness anyway, it was removed from the tablesync slot name to
eliminate those concerns.

Also, the tablesync slot name calculation was encapsulated as a
separate function because previously (i.e. before v18) it was used by
various other cleanup codes. I still like it better as a function, but
now it is only called from one place so we could put that code back
inline if you prefer it how it was..

----
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Sun, Jan 24, 2021 at 5:54 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > 4.
> > - /*
> > - * To build a slot name for the sync work, we are limited to NAMEDATALEN -
> > - * 1 characters.  We cut the original slot name to NAMEDATALEN - 28 chars
> > - * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0').  (It's actually the
> > - * NAMEDATALEN on the remote that matters, but this scheme will also work
> > - * reasonably if that is different.)
> > - */
> > - StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small"); /* for sanity */
> > - slotname = psprintf("%.*s_%u_sync_%u",
> > - NAMEDATALEN - 28,
> > - MySubscription->slotname,
> > - MySubscription->oid,
> > - MyLogicalRepWorker->relid);
> > + /* Calculate the name of the tablesync slot. */
> > + slotname = ReplicationSlotNameForTablesync(
> > +    MySubscription->oid,
> > +    MyLogicalRepWorker->relid);
> >
> > What is the reason for changing the slot name calculation? If there is
> > any particular reasons, then we can add a comment to indicate why we
> > can't include the subscription's slotname in this calculation.
> >
>
> The subscription slot name may be changed (e.g. ALTER SUBSCRIPTION)
> and so including the subscription slot name as part of the tablesync
> slot name was considered to be:
> a) possibly risky/undefined, if the subscription slot_name = NONE
> b) confusing, if we end up using 2 different slot names for the same
> tablesync (e.g. if the subscription slot name is changed before a sync
> worker is re-launched).
> And since this subscription slot name part is not necessary for
> uniqueness anyway, it was removed from the tablesync slot name to
> eliminate those concerns.
>
> Also, the tablesync slot name calculation was encapsulated as a
> separate function because previously (i.e. before v18) it was used by
> various other cleanup codes. I still like it better as a function, but
> now it is only called from one place so we could put that code back
> inline if you prefer it how it was..

It turns out those (a/b) concerns I wrote above are maybe unfounded,
because it seems not possible to alter the slot_name = NONE unless the
subscription is first DISABLED.
So probably I can revert all this tablesync slot name calculation back
to how it originally was in the OSS HEAD if you want.

----
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA the v19 patch for the Tablesync Solution1.

Main differences from v18:
+ Patch has been rebased off HEAD @ 24/Jan
+ Addressing some review comments [ak0123]

[ak0123] https://www.postgresql.org/message-id/CAA4eK1JhpuwujrV6ABMmZ3jXfW37ssZnJ3fikrY7rRdvoEmu_g%40mail.gmail.com

====

Features:

* The tablesync worker is now allowing multiple tx instead of single tx.

* A new state (SUBREL_STATE_FINISHEDCOPY) is persisted after a
successful copy_table in tablesync's LogicalRepSyncTableStart.

* If a re-launched tablesync finds state SUBREL_STATE_FINISHEDCOPY
then it will bypass the initial copy_table phase.

* Now tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* The tablesync replication origin tracking record is cleaned up by:
- process_syncing_tables_for_apply
- DropSubscription
- AlterSubscription_refresh

* Updates to PG docs.

* New TAP test case.

Known Issues:

* None.

---
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Jan 25, 2021 at 6:15 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Sun, Jan 24, 2021 at 5:54 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > > 4.
> > > - /*
> > > - * To build a slot name for the sync work, we are limited to NAMEDATALEN -
> > > - * 1 characters.  We cut the original slot name to NAMEDATALEN - 28 chars
> > > - * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0').  (It's actually the
> > > - * NAMEDATALEN on the remote that matters, but this scheme will also work
> > > - * reasonably if that is different.)
> > > - */
> > > - StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small"); /* for sanity */
> > > - slotname = psprintf("%.*s_%u_sync_%u",
> > > - NAMEDATALEN - 28,
> > > - MySubscription->slotname,
> > > - MySubscription->oid,
> > > - MyLogicalRepWorker->relid);
> > > + /* Calculate the name of the tablesync slot. */
> > > + slotname = ReplicationSlotNameForTablesync(
> > > +    MySubscription->oid,
> > > +    MyLogicalRepWorker->relid);
> > >
> > > What is the reason for changing the slot name calculation? If there is
> > > any particular reasons, then we can add a comment to indicate why we
> > > can't include the subscription's slotname in this calculation.
> > >
> >
> > The subscription slot name may be changed (e.g. ALTER SUBSCRIPTION)
> > and so including the subscription slot name as part of the tablesync
> > slot name was considered to be:
> > a) possibly risky/undefined, if the subscription slot_name = NONE
> > b) confusing, if we end up using 2 different slot names for the same
> > tablesync (e.g. if the subscription slot name is changed before a sync
> > worker is re-launched).
> > And since this subscription slot name part is not necessary for
> > uniqueness anyway, it was removed from the tablesync slot name to
> > eliminate those concerns.
> >
> > Also, the tablesync slot name calculation was encapsulated as a
> > separate function because previously (i.e. before v18) it was used by
> > various other cleanup codes. I still like it better as a function, but
> > now it is only called from one place so we could put that code back
> > inline if you prefer it how it was..
>
> It turns out those (a/b) concerns I wrote above are maybe unfounded,
> because it seems not possible to alter the slot_name = NONE unless the
> subscription is first DISABLED.
>

Yeah, but I think the user can still change to some other predefined
slot_name. However, I guess it doesn't matter unless it can lead what
you have mentioned in (a). As that can't happen, it is probably better
to take out that change from the patch. I see your point of moving
this calculation to a separate function but not sure if it is worth it
unless we have to call it from multiple places or it simplifies the
existing code.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Sat, Jan 23, 2021 at 11:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> 2.
> @@ -98,11 +102,16 @@
>  #include "miscadmin.h"
>  #include "parser/parse_relation.h"
>  #include "pgstat.h"
> +#include "postmaster/interrupt.h"
>  #include "replication/logicallauncher.h"
>  #include "replication/logicalrelation.h"
> +#include "replication/logicalworker.h"
>  #include "replication/walreceiver.h"
>  #include "replication/worker_internal.h"
> +#include "replication/slot.h"
>
> I don't think the above includes are required. They seem to the
> remnant of the previous approach.
>

OK. Fixed in the latest patch [v19].

> 3.
>  process_syncing_tables_for_sync(XLogRecPtr current_lsn)
>  {
> - Assert(IsTransactionState());
> + bool sync_done = false;
>
>   SpinLockAcquire(&MyLogicalRepWorker->relmutex);
> + sync_done = MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
> + current_lsn >= MyLogicalRepWorker->relstate_lsn;
> + SpinLockRelease(&MyLogicalRepWorker->relmutex);
>
> - if (MyLogicalRepWorker->relstate == SUBREL_STATE_CATCHUP &&
> - current_lsn >= MyLogicalRepWorker->relstate_lsn)
> + if (sync_done)
>   {
>   TimeLineID tli;
>
> + /*
> + * Change state to SYNCDONE.
> + */
> + SpinLockAcquire(&MyLogicalRepWorker->relmutex);
>
> Why do we need these changes? If you have done it for the
> code-readability purpose then we can consider this as a separate patch
> because I don't see why these are required w.r.t this patch.
>

Yes it was for code readability in v17 when this function used to be
much larger. But it is not very necessary anymore and has been
reverted in the latest patch [v19].

> 4.
> - /*
> - * To build a slot name for the sync work, we are limited to NAMEDATALEN -
> - * 1 characters.  We cut the original slot name to NAMEDATALEN - 28 chars
> - * and append _%u_sync_%u (1 + 10 + 6 + 10 + '\0').  (It's actually the
> - * NAMEDATALEN on the remote that matters, but this scheme will also work
> - * reasonably if that is different.)
> - */
> - StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small"); /* for sanity */
> - slotname = psprintf("%.*s_%u_sync_%u",
> - NAMEDATALEN - 28,
> - MySubscription->slotname,
> - MySubscription->oid,
> - MyLogicalRepWorker->relid);
> + /* Calculate the name of the tablesync slot. */
> + slotname = ReplicationSlotNameForTablesync(
> +    MySubscription->oid,
> +    MyLogicalRepWorker->relid);
>
> What is the reason for changing the slot name calculation? If there is
> any particular reasons, then we can add a comment to indicate why we
> can't include the subscription's slotname in this calculation.
>

The tablesync slot name changes were not strictly necessary, so the
code is all reverted to be the same as OSS HEAD now in the latest
patch [v19].

> 5.
> This is WAL
> + * logged for for the purpose of recovery. Locks are to prevent the
> + * replication origin from vanishing while advancing.
>
> /for for/for
>

OK. Fixed in the latest patch [v19].

> 6.
> + /* Remove the tablesync's origin tracking if exists. */
> + snprintf(originname, sizeof(originname), "pg_%u_%u", subid, relid);
> + originid = replorigin_by_name(originname, true);
> + if (originid != InvalidRepOriginId)
> + {
> + elog(DEBUG1, "DropSubscription: dropping origin tracking for
> \"%s\"", originname);
>
> I don't think we need this and the DEBUG1 message in
> AlterSubscription_refresh. IT is fine to print this information for
> background workers like in apply-worker but not sure if need it here.
> The DropSubscription drops the origin of apply worker but it doesn't
> use such a DEBUG message so I guess we don't it for tablesync origins
> as well.
>

OK. These DEBUG1 logs are removed in the latest patch [v19].

----
[v19] https://www.postgresql.org/message-id/CAHut%2BPsj7Xm8C1LbqeAbk-3duyS8xXJtL9TiGaeu3P8g272mAA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Sun, Jan 24, 2021 at 12:24 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Sat, Jan 23, 2021 at 11:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Few comments:
> > =============
> > 1.
> > - *   So the state progression is always: INIT -> DATASYNC -> SYNCWAIT ->
> > - *   CATCHUP -> SYNCDONE -> READY.
> > + *   So the state progression is always: INIT -> DATASYNC ->
> > + *   (sync worker FINISHEDCOPY) -> SYNCWAIT -> CATCHUP -> SYNCDONE -> READY.
> >
> > I don't think we need to be specific here that sync worker sets
> > FINISHEDCOPY state.
> >
>
> This was meant to indicate that *only* the sync worker knows about the
> FINISHEDCOPY state, whereas all the other states are either known
> (assigned and/or used) by *both* kinds of workers. But, I can remove
> it if you feel that distinction is not useful.
>

Okay, but I feel you can mention that in the description you have
added for FINISHEDCOPY state. It looks a bit odd here and the message
you want to convey is also not that clear.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Sat, Jan 23, 2021 at 11:08 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Sat, Jan 23, 2021 at 3:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
> > I think so. But do you have any reason to believe that it won't be
> > required anymore?
>
> A temporary slot will not clash with a permanent slot of the same name,
>

I have tried below and it seems to be clashing:
postgres=# SELECT 'init' FROM
pg_create_logical_replication_slot('test_slot2', 'test_decoding');
 ?column?
----------
 init
(1 row)

postgres=# SELECT 'init' FROM
pg_create_logical_replication_slot('test_slot2', 'test_decoding',
true);
ERROR:  replication slot "test_slot2" already exists

Note that the third parameter in the second statement above indicates
whether it is a temporary slot or not. What am I missing?
-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Jan 25, 2021 at 8:23 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Sat, Jan 23, 2021 at 11:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 2.
> > @@ -98,11 +102,16 @@
> >  #include "miscadmin.h"
> >  #include "parser/parse_relation.h"
> >  #include "pgstat.h"
> > +#include "postmaster/interrupt.h"
> >  #include "replication/logicallauncher.h"
> >  #include "replication/logicalrelation.h"
> > +#include "replication/logicalworker.h"
> >  #include "replication/walreceiver.h"
> >  #include "replication/worker_internal.h"
> > +#include "replication/slot.h"
> >
> > I don't think the above includes are required. They seem to the
> > remnant of the previous approach.
> >
>
> OK. Fixed in the latest patch [v19].
>

You seem to forgot removing #include "replication/slot.h". Check, if
it is not required then remove that as well.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Jan 25, 2021 at 8:03 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Hi Amit.
>
> PSA the v19 patch for the Tablesync Solution1.
>

I see one race condition in this patch where we try to drop the origin
via apply process and DropSubscription. I think it can lead to the
error "cache lookup failed for replication origin with oid %u". The
same problem can happen via exposed API pg_replication_origin_drop but
probably because this is not used concurrently so nobody faced this
issue. I think for the matter of this patch we can try to suppress
such an error either via try..catch, or by adding missing_ok argument
to replorigin_drop API, or we can just add to comments that such a
race exists. Additionally, we should try to start a new thread for the
existence of this problem in pg_replication_origin_drop. What do you
think?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA the v20 patch for the Tablesync Solution1.

Main differences from v19:
+ Updated TAP test [ak0123-7]
+ Fixed comment [ak0125-1]
+ Removed redundant header [ak0125-2]
+ Protection against race condition [ak0125-race]

[ak0123-7] https://www.postgresql.org/message-id/CAA4eK1JhpuwujrV6ABMmZ3jXfW37ssZnJ3fikrY7rRdvoEmu_g%40mail.gmail.com
[ak0125-1]
https://www.postgresql.org/message-id/CAA4eK1JmP2VVpH2%3DO%3D5BBbuH7gyQtWn40aXp_Jyjn1%2BKggfq8A%40mail.gmail.com
[ak0125-2]
https://www.postgresql.org/message-id/CAA4eK1L1j5sfBgHb0-H-%2B2quBstsA3hMcDfP-4vLuU-UF43nXQ%40mail.gmail.com
[ak0125-race]
https://www.postgresql.org/message-id/CAA4eK1%2ByeLwBCkTvTdPM-hSk1fr6jT8KJc362CN8zrGztq_JqQ%40mail.gmail.com

====

Features:

* The tablesync worker is now allowing multiple tx instead of single tx.

* A new state (SUBREL_STATE_FINISHEDCOPY) is persisted after a
successful copy_table in tablesync's LogicalRepSyncTableStart.

* If a re-launched tablesync finds state SUBREL_STATE_FINISHEDCOPY
then it will bypass the initial copy_table phase.

* Now tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* The tablesync replication origin tracking record is cleaned up by:
- process_syncing_tables_for_apply
- DropSubscription
- AlterSubscription_refresh

* Updates to PG docs.

* New TAP test case.

Known Issues:

* Some records arriving between FINISHEDCOPY and SYNCDONE state may be
lost (currently under investigation).

---

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Thu, Jan 21, 2021 at 9:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 7.
> +# check for occurrence of the expected error
> +poll_output_until("replication slot \"$slotname\" already exists")
> +    or die "no error stop for the pre-existing origin";
>
> In this test, isn't it better to check for datasync state like below?
> 004_sync.pl has some other similar test.
> my $started_query = "SELECT srsubstate = 'd' FROM pg_subscription_rel;";
> $node_subscriber->poll_query_until('postgres', $started_query)
>   or die "Timed out while waiting for subscriber to start sync";
>
> Is there a reason why we can't use the existing way to check for
> failure in this case?
>

The TAP test is updated in the latest patch [v20].

----
[v20] https://www.postgresql.org/message-id/CAHut%2BPuNwSujoL_dwa%3DTtozJ_vF%3DCnJxjgQTCmNBkazd8J1m-A%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Jan 25, 2021 at 1:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jan 24, 2021 at 12:24 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Sat, Jan 23, 2021 at 11:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Few comments:
> > > =============
> > > 1.
> > > - *   So the state progression is always: INIT -> DATASYNC -> SYNCWAIT ->
> > > - *   CATCHUP -> SYNCDONE -> READY.
> > > + *   So the state progression is always: INIT -> DATASYNC ->
> > > + *   (sync worker FINISHEDCOPY) -> SYNCWAIT -> CATCHUP -> SYNCDONE -> READY.
> > >
> > > I don't think we need to be specific here that sync worker sets
> > > FINISHEDCOPY state.
> > >
> >
> > This was meant to indicate that *only* the sync worker knows about the
> > FINISHEDCOPY state, whereas all the other states are either known
> > (assigned and/or used) by *both* kinds of workers. But, I can remove
> > it if you feel that distinction is not useful.
> >
>
> Okay, but I feel you can mention that in the description you have
> added for FINISHEDCOPY state. It looks a bit odd here and the message
> you want to convey is also not that clear.
>

The comment is updated in the latest patch [v20].

----
[v20] https://www.postgresql.org/message-id/CAHut%2BPuNwSujoL_dwa%3DTtozJ_vF%3DCnJxjgQTCmNBkazd8J1m-A%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Jan 25, 2021 at 2:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 25, 2021 at 8:23 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Sat, Jan 23, 2021 at 11:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > 2.
> > > @@ -98,11 +102,16 @@
> > >  #include "miscadmin.h"
> > >  #include "parser/parse_relation.h"
> > >  #include "pgstat.h"
> > > +#include "postmaster/interrupt.h"
> > >  #include "replication/logicallauncher.h"
> > >  #include "replication/logicalrelation.h"
> > > +#include "replication/logicalworker.h"
> > >  #include "replication/walreceiver.h"
> > >  #include "replication/worker_internal.h"
> > > +#include "replication/slot.h"
> > >
> > > I don't think the above includes are required. They seem to the
> > > remnant of the previous approach.
> > >
> >
> > OK. Fixed in the latest patch [v19].
> >
>
> You seem to forgot removing #include "replication/slot.h". Check, if
> it is not required then remove that as well.
>

Fixed in the latest patch [v20].

----
[v20] https://www.postgresql.org/message-id/CAHut%2BPuNwSujoL_dwa%3DTtozJ_vF%3DCnJxjgQTCmNBkazd8J1m-A%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Jan 25, 2021 at 4:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 25, 2021 at 8:03 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Hi Amit.
> >
> > PSA the v19 patch for the Tablesync Solution1.
> >
>
> I see one race condition in this patch where we try to drop the origin
> via apply process and DropSubscription. I think it can lead to the
> error "cache lookup failed for replication origin with oid %u". The
> same problem can happen via exposed API pg_replication_origin_drop but
> probably because this is not used concurrently so nobody faced this
> issue. I think for the matter of this patch we can try to suppress
> such an error either via try..catch, or by adding missing_ok argument
> to replorigin_drop API, or we can just add to comments that such a
> race exists.

OK. This has been isolated to a common function called from 3 places.
The potential race ERROR is suppressed by TRY/CATCH.
Please see code of latest patch [v20]

> Additionally, we should try to start a new thread for the
> existence of this problem in pg_replication_origin_drop. What do you
> think?

OK. It is on my TODO list..

----
[v20] https://www.postgresql.org/message-id/CAHut%2BPuNwSujoL_dwa%3DTtozJ_vF%3DCnJxjgQTCmNBkazd8J1m-A%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Jan 25, 2021 at 4:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 25, 2021 at 8:03 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Hi Amit.
> >
> > PSA the v19 patch for the Tablesync Solution1.
> >
>
> I see one race condition in this patch where we try to drop the origin
> via apply process and DropSubscription. I think it can lead to the
> error "cache lookup failed for replication origin with oid %u". The
> same problem can happen via exposed API pg_replication_origin_drop but
> probably because this is not used concurrently so nobody faced this
> issue. I think for the matter of this patch we can try to suppress
> such an error either via try..catch, or by adding missing_ok argument
> to replorigin_drop API, or we can just add to comments that such a
> race exists. Additionally, we should try to start a new thread for the
> existence of this problem in pg_replication_origin_drop. What do you
> think?

OK. A new thread [ps0127] for this problem was started

---
[ps0127] =
https://www.postgresql.org/message-id/CAHut%2BPuW8DWV5fskkMWWMqzt-x7RPcNQOtJQBp6SdwyRghCk7A%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Sat, Jan 23, 2021 at 5:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jan 23, 2021 at 4:55 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > PSA the v18 patch for the Tablesync Solution1.
>
> 7. Have you tested with the new patch the scenario where we crash
> after FINISHEDCOPY and before SYNCDONE, is it able to pick up the
> replication using the new temporary slot? Here, we need to test the
> case where during the catchup phase we have received few commits and
> then the tablesync worker is crashed/errored out? Basically, check if
> the replication is continued from the same point?
>

I have tested this and it didn't work, see the below example.

Publisher-side
================
CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
COMMIT;

CREATE PUBLICATION mypublication FOR TABLE mytbl1;

Subscriber-side
================
- Have a while(1) loop in LogicalRepSyncTableStart so that tablesync
worker stops.

CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));


CREATE SUBSCRIPTION mysub
         CONNECTION 'host=localhost port=5432 dbname=postgres'
        PUBLICATION mypublication;

During debug, stop after we mark FINISHEDCOPY state.

Publisher-side
================
INSERT INTO mytbl1(somedata, text) VALUES (1, 3);
INSERT INTO mytbl1(somedata, text) VALUES (1, 4);


Subscriber-side
================
- Have a breakpoint in apply_dispatch
- continue in debugger;
- After we replay first commit (which will be for values(1,3), note
down the origin position in apply_handle_commit_internal and somehow
error out. I have forced the debugger to set to the last line in
apply_dispatch where the error is raised.
- After the error, again the tablesync worker is restarted and it
starts from the position noted in the previous step
- It exits without replaying the WAL for (1,4)

So, on the subscriber-side, you will see 3 records. Fourth is missing.
Now, if you insert more records on the publisher, it will anyway
replay those but the fourth one got missing.

The temporary slots didn't seem to work because we created again the
new temporary slot after the crash and ask it to start decoding from
the point we noted in origin_lsn. The publisher didn’t hold the
required WAL as our slot was temporary so it started sending from some
later point. We retain WAL based on the slots restart_lsn position and
wal_keep_size. For our case, the positions of the slots will matter
and as we have created temporary slots, there is no way for a
publisher to save that WAL.

In this particular case, even if the WAL would have been there we only
pass the start_decoding_at position but didn’t pass restart_lsn, so it
picked a random location (current insert position in WAL) which is
ahead of start_decoding_at point so it never sent the required fourth
record. Now, I don’t think it will work even if somehow sent the
correct restart_lsn because of what I wrote earlier that there is no
guarantee that the earlier WAL would have been saved.

At this point, I can't think of any way to fix this problem except for
going back to the previous approach of permanent slots but let me know
if you have any ideas to salvage this approach?

--
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA the v21 patch for the Tablesync Solution1.

Main differences from v20:
+ Rebased to latest OSS HEAD @ 27/Jan
+ v21 is a merging of patches [v17] and [v20], which was made
necessary when it was found [ak0127] that the v20 usage of TEMPORARY
tablesync slots did not work correctly. v21 reverts to using PERMANENT
tablesync slots same as implemented in v17, while retaining other
review comment fixes made for v18, v19, v20.

----
[v17] https://www.postgresql.org/message-id/CAHut%2BPt9%2Bg8qQR0kMC85nY-O4uDQxXboamZAYhHbvkebzC9fAQ%40mail.gmail.com
[v20] https://www.postgresql.org/message-id/CAHut%2BPuNwSujoL_dwa%3DTtozJ_vF%3DCnJxjgQTCmNBkazd8J1m-A%40mail.gmail.com
[ak0127] https://www.postgresql.org/message-id/CAA4eK1LDsj9kw4FbWAw3CMHyVsjafgDum03cYy-wpGmor%3D8-1w%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Wed, Jan 27, 2021 at 2:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jan 23, 2021 at 5:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, Jan 23, 2021 at 4:55 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > PSA the v18 patch for the Tablesync Solution1.
> >
> > 7. Have you tested with the new patch the scenario where we crash
> > after FINISHEDCOPY and before SYNCDONE, is it able to pick up the
> > replication using the new temporary slot? Here, we need to test the
> > case where during the catchup phase we have received few commits and
> > then the tablesync worker is crashed/errored out? Basically, check if
> > the replication is continued from the same point?
> >
>
> I have tested this and it didn't work, see the below example.
>
> Publisher-side
> ================
> CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
>
> BEGIN;
> INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
> INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
> COMMIT;
>
> CREATE PUBLICATION mypublication FOR TABLE mytbl1;
>
> Subscriber-side
> ================
> - Have a while(1) loop in LogicalRepSyncTableStart so that tablesync
> worker stops.
>
> CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
>
>
> CREATE SUBSCRIPTION mysub
>          CONNECTION 'host=localhost port=5432 dbname=postgres'
>         PUBLICATION mypublication;
>
> During debug, stop after we mark FINISHEDCOPY state.
>
> Publisher-side
> ================
> INSERT INTO mytbl1(somedata, text) VALUES (1, 3);
> INSERT INTO mytbl1(somedata, text) VALUES (1, 4);
>
>
> Subscriber-side
> ================
> - Have a breakpoint in apply_dispatch
> - continue in debugger;
> - After we replay first commit (which will be for values(1,3), note
> down the origin position in apply_handle_commit_internal and somehow
> error out. I have forced the debugger to set to the last line in
> apply_dispatch where the error is raised.
> - After the error, again the tablesync worker is restarted and it
> starts from the position noted in the previous step
> - It exits without replaying the WAL for (1,4)
>
> So, on the subscriber-side, you will see 3 records. Fourth is missing.
> Now, if you insert more records on the publisher, it will anyway
> replay those but the fourth one got missing.
>
> The temporary slots didn't seem to work because we created again the
> new temporary slot after the crash and ask it to start decoding from
> the point we noted in origin_lsn. The publisher didn’t hold the
> required WAL as our slot was temporary so it started sending from some
> later point. We retain WAL based on the slots restart_lsn position and
> wal_keep_size. For our case, the positions of the slots will matter
> and as we have created temporary slots, there is no way for a
> publisher to save that WAL.
>
> In this particular case, even if the WAL would have been there we only
> pass the start_decoding_at position but didn’t pass restart_lsn, so it
> picked a random location (current insert position in WAL) which is
> ahead of start_decoding_at point so it never sent the required fourth
> record. Now, I don’t think it will work even if somehow sent the
> correct restart_lsn because of what I wrote earlier that there is no
> guarantee that the earlier WAL would have been saved.
>
> At this point, I can't think of any way to fix this problem except for
> going back to the previous approach of permanent slots but let me know
> if you have any ideas to salvage this approach?
>

OK. The latest patch [v21] now restores the permanent slot (and slot
cleanup) approach as it was implemented in an earlier version [v17].
Please note that this change also re-introduces some potential slot
cleanup problems for some race scenarios. These will be addressed by
future patches.

----
[v17] https://www.postgresql.org/message-id/CAHut%2BPt9%2Bg8qQR0kMC85nY-O4uDQxXboamZAYhHbvkebzC9fAQ%40mail.gmail.com
[v21] https://www.postgresql.org/message-id/CAHut%2BPvzHRRA_5O0R8KZCb1tVe1mBVPxFtmttXJnmuOmAegoWA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Thu, Jan 28, 2021 at 12:32 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Wed, Jan 27, 2021 at 2:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, Jan 23, 2021 at 5:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Sat, Jan 23, 2021 at 4:55 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > > >
> > > > PSA the v18 patch for the Tablesync Solution1.
> > >
> > > 7. Have you tested with the new patch the scenario where we crash
> > > after FINISHEDCOPY and before SYNCDONE, is it able to pick up the
> > > replication using the new temporary slot? Here, we need to test the
> > > case where during the catchup phase we have received few commits and
> > > then the tablesync worker is crashed/errored out? Basically, check if
> > > the replication is continued from the same point?
> > >
> >
> > I have tested this and it didn't work, see the below example.
> >
> > Publisher-side
> > ================
> > CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
> >
> > BEGIN;
> > INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
> > INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
> > COMMIT;
> >
> > CREATE PUBLICATION mypublication FOR TABLE mytbl1;
> >
> > Subscriber-side
> > ================
> > - Have a while(1) loop in LogicalRepSyncTableStart so that tablesync
> > worker stops.
> >
> > CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
> >
> >
> > CREATE SUBSCRIPTION mysub
> >          CONNECTION 'host=localhost port=5432 dbname=postgres'
> >         PUBLICATION mypublication;
> >
> > During debug, stop after we mark FINISHEDCOPY state.
> >
> > Publisher-side
> > ================
> > INSERT INTO mytbl1(somedata, text) VALUES (1, 3);
> > INSERT INTO mytbl1(somedata, text) VALUES (1, 4);
> >
> >
> > Subscriber-side
> > ================
> > - Have a breakpoint in apply_dispatch
> > - continue in debugger;
> > - After we replay first commit (which will be for values(1,3), note
> > down the origin position in apply_handle_commit_internal and somehow
> > error out. I have forced the debugger to set to the last line in
> > apply_dispatch where the error is raised.
> > - After the error, again the tablesync worker is restarted and it
> > starts from the position noted in the previous step
> > - It exits without replaying the WAL for (1,4)
> >
> > So, on the subscriber-side, you will see 3 records. Fourth is missing.
> > Now, if you insert more records on the publisher, it will anyway
> > replay those but the fourth one got missing.
> >
...
> >
> > At this point, I can't think of any way to fix this problem except for
> > going back to the previous approach of permanent slots but let me know
> > if you have any ideas to salvage this approach?
> >
>
> OK. The latest patch [v21] now restores the permanent slot (and slot
> cleanup) approach as it was implemented in an earlier version [v17].
> Please note that this change also re-introduces some potential slot
> cleanup problems for some race scenarios.
>

I am able to reproduce the race condition where slot/origin will
remain on the publisher node even when the corresponding subscription
is dropped. Basically, if we error out in the 'catchup' phase in
tablesync worker then either it will restart and cleanup slot/origin
or if in the meantime we have dropped the subscription and stopped
apply worker then probably the slot and origin will be dangling on the
publisher.

I have used exactly the same test procedure as was used to expose the
problem in the temporary slots with some minor changes as mentioned
below:
Subscriber-side
================
- Have a while(1) loop in LogicalRepSyncTableStart so that tablesync
worker stops.
- Have a while(1) loop in wait_for_relation_state_change so that we
can control apply worker via debugger at the right time.

Subscriber-side
================
- Have a breakpoint in apply_dispatch
- continue in debugger;
- After we replay first commit somehow error out. I have forced the
debugger to set to the last line in apply_dispatch where the error is
raised.
- Now, the table sync worker won't restart because the apply worker is
looping in wait_for_relation_state_change.
- Execute DropSubscription;
- We can allow apply worker to continue by skipping the while(1) and
it will exit because DropSubscription would have sent a terminate
signal.

After the above steps, check the publisher (select * from
pg_replication_slots) and you will find the dangling tablesync slot.

I think to solve the above problem we should drop tablesync
slot/origin at the Drop/Alter Subscription time and additionally we
need to ensure that apply worker doesn't let tablesync workers restart
(or it must not do any work to access the slot because the slots are
dropped) once we stopped them. To ensure that, I think we need to make
the following changes:

1. Take AccessExclusivelock on subscription_rel during Alter (before
calling RemoveSubscriptionRel) and don't release it till transaction
end (do table_close with NoLock) similar to DropSubscription.
2. Take share lock (AccessShareLock) in GetSubscriptionRelState (it
gets called from logicalrepsyncstartworker), we can release this lock
at the end of that function. This will ensure that even if the
tablesync worker is restarted, it will be blocked till the transaction
performing Alter will commit.
3. Make Alter command to not run in xact block so that we don't keep
locks for a longer time and second for the slots related stuff similar
to dropsubscription.

Few comments on v21:
===================
1.
DropSubscription()
{
..
- /* Clean up dependencies */
+ /* Clean up dependencies. */
  deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);
..
}

The above change seems unnecessary w.r.t current patch.

2.
DropSubscription()
{
..
  /*
- * If there is no slot associated with the subscription, we can finish
- * here.
+ * If there is a slot associated with the subscription, then drop the
+ * replication slot at the publisher node using the replication
+ * connection.
  */
- if (!slotname)
+ if (slotname)
  {
- table_close(rel, NoLock);
- return;
..
}

What is the reason for this change? Can't we keep the check in its
existing form?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Hi Amit.

PSA the v22 patch for the Tablesync Solution1.

Differences from v21:
+ Patch is rebased to latest OSS HEAD @ 29/Jan.
+ Includes new code as suggested [ak0128] to ensure no dangling slots
at Drop/AlterSubscription.
+ Removes the slot/origin cleanup down by process interrupt logic
(cleanup_at_shutdown function).
+ Addresses some minor review comments.

----
[ak0128] https://www.postgresql.org/message-id/CAA4eK1LMYXZY1SpzgW-WyFdy%2BFTMZ4BMz1dj0rT2rxGv-zLwFA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Thu, Jan 28, 2021 at 9:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 28, 2021 at 12:32 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Wed, Jan 27, 2021 at 2:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Sat, Jan 23, 2021 at 5:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Sat, Jan 23, 2021 at 4:55 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > > > >
> > > > > PSA the v18 patch for the Tablesync Solution1.
> > > >
> > > > 7. Have you tested with the new patch the scenario where we crash
> > > > after FINISHEDCOPY and before SYNCDONE, is it able to pick up the
> > > > replication using the new temporary slot? Here, we need to test the
> > > > case where during the catchup phase we have received few commits and
> > > > then the tablesync worker is crashed/errored out? Basically, check if
> > > > the replication is continued from the same point?
> > > >
> > >
> > > I have tested this and it didn't work, see the below example.
> > >
> > > Publisher-side
> > > ================
> > > CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
> > >
> > > BEGIN;
> > > INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
> > > INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
> > > COMMIT;
> > >
> > > CREATE PUBLICATION mypublication FOR TABLE mytbl1;
> > >
> > > Subscriber-side
> > > ================
> > > - Have a while(1) loop in LogicalRepSyncTableStart so that tablesync
> > > worker stops.
> > >
> > > CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
> > >
> > >
> > > CREATE SUBSCRIPTION mysub
> > >          CONNECTION 'host=localhost port=5432 dbname=postgres'
> > >         PUBLICATION mypublication;
> > >
> > > During debug, stop after we mark FINISHEDCOPY state.
> > >
> > > Publisher-side
> > > ================
> > > INSERT INTO mytbl1(somedata, text) VALUES (1, 3);
> > > INSERT INTO mytbl1(somedata, text) VALUES (1, 4);
> > >
> > >
> > > Subscriber-side
> > > ================
> > > - Have a breakpoint in apply_dispatch
> > > - continue in debugger;
> > > - After we replay first commit (which will be for values(1,3), note
> > > down the origin position in apply_handle_commit_internal and somehow
> > > error out. I have forced the debugger to set to the last line in
> > > apply_dispatch where the error is raised.
> > > - After the error, again the tablesync worker is restarted and it
> > > starts from the position noted in the previous step
> > > - It exits without replaying the WAL for (1,4)
> > >
> > > So, on the subscriber-side, you will see 3 records. Fourth is missing.
> > > Now, if you insert more records on the publisher, it will anyway
> > > replay those but the fourth one got missing.
> > >
> ...
> > >
> > > At this point, I can't think of any way to fix this problem except for
> > > going back to the previous approach of permanent slots but let me know
> > > if you have any ideas to salvage this approach?
> > >
> >
> > OK. The latest patch [v21] now restores the permanent slot (and slot
> > cleanup) approach as it was implemented in an earlier version [v17].
> > Please note that this change also re-introduces some potential slot
> > cleanup problems for some race scenarios.
> >
>
> I am able to reproduce the race condition where slot/origin will
> remain on the publisher node even when the corresponding subscription
> is dropped. Basically, if we error out in the 'catchup' phase in
> tablesync worker then either it will restart and cleanup slot/origin
> or if in the meantime we have dropped the subscription and stopped
> apply worker then probably the slot and origin will be dangling on the
> publisher.
>
> I have used exactly the same test procedure as was used to expose the
> problem in the temporary slots with some minor changes as mentioned
> below:
> Subscriber-side
> ================
> - Have a while(1) loop in LogicalRepSyncTableStart so that tablesync
> worker stops.
> - Have a while(1) loop in wait_for_relation_state_change so that we
> can control apply worker via debugger at the right time.
>
> Subscriber-side
> ================
> - Have a breakpoint in apply_dispatch
> - continue in debugger;
> - After we replay first commit somehow error out. I have forced the
> debugger to set to the last line in apply_dispatch where the error is
> raised.
> - Now, the table sync worker won't restart because the apply worker is
> looping in wait_for_relation_state_change.
> - Execute DropSubscription;
> - We can allow apply worker to continue by skipping the while(1) and
> it will exit because DropSubscription would have sent a terminate
> signal.
>
> After the above steps, check the publisher (select * from
> pg_replication_slots) and you will find the dangling tablesync slot.
>
> I think to solve the above problem we should drop tablesync
> slot/origin at the Drop/Alter Subscription time and additionally we
> need to ensure that apply worker doesn't let tablesync workers restart
> (or it must not do any work to access the slot because the slots are
> dropped) once we stopped them. To ensure that, I think we need to make
> the following changes:
>
> 1. Take AccessExclusivelock on subscription_rel during Alter (before
> calling RemoveSubscriptionRel) and don't release it till transaction
> end (do table_close with NoLock) similar to DropSubscription.
> 2. Take share lock (AccessShareLock) in GetSubscriptionRelState (it
> gets called from logicalrepsyncstartworker), we can release this lock
> at the end of that function. This will ensure that even if the
> tablesync worker is restarted, it will be blocked till the transaction
> performing Alter will commit.
> 3. Make Alter command to not run in xact block so that we don't keep
> locks for a longer time and second for the slots related stuff similar
> to dropsubscription.
>

OK. The latest patch [v22] changes the code as suggested above.

> Few comments on v21:
> ===================
> 1.
> DropSubscription()
> {
> ..
> - /* Clean up dependencies */
> + /* Clean up dependencies. */
>   deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);
> ..
> }
>
> The above change seems unnecessary w.r.t current patch.
>

OK. Modified in patch [v22].

> 2.
> DropSubscription()
> {
> ..
>   /*
> - * If there is no slot associated with the subscription, we can finish
> - * here.
> + * If there is a slot associated with the subscription, then drop the
> + * replication slot at the publisher node using the replication
> + * connection.
>   */
> - if (!slotname)
> + if (slotname)
>   {
> - table_close(rel, NoLock);
> - return;
> ..
> }
>
> What is the reason for this change? Can't we keep the check in its
> existing form?
>

I think the above comment is longer applicable in the latest patch [v22].
Early exit for null slotname is not desirable anymore; we still need
to process all the tablesync slots/origins regardless.

----
[v22] https://www.postgresql.org/message-id/CAHut%2BPtrAVrtjc8srASTeUhbJtviw0Up-bzFSc14Ss%3DmAMxz9g%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Jan 29, 2021 at 4:07 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
>
> Differences from v21:
> + Patch is rebased to latest OSS HEAD @ 29/Jan.
> + Includes new code as suggested [ak0128] to ensure no dangling slots
> at Drop/AlterSubscription.
> + Removes the slot/origin cleanup down by process interrupt logic
> (cleanup_at_shutdown function).
> + Addresses some minor review comments.
>

I have made the below changes in the patch. Let me know what you think
about these?
1. It was a bit difficult to understand the code in DropSubscription
so I have rearranged the code to match the way we are doing in HEAD
where we drop the slots at the end after finishing all the other
cleanup.
2. In AlterSubscription_refresh(), we can't allow workers to be
stopped at commit time as we have already dropped the slots because
the worker can access the dropped slot. We need to stop the workers
before dropping slots. This makes all the code related to
logicalrep_worker_stop_at_commit redundant.
3. In AlterSubscription_refresh(), we need to acquire the lock on
pg_subscription_rel only when we try to remove any subscription rel.
4. Added/Changed quite a few comments.

-- 
With Regards,
Amit Kapila.

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I have made the below changes in the patch. Let me know what you think
> about these?
> 1. It was a bit difficult to understand the code in DropSubscription
> so I have rearranged the code to match the way we are doing in HEAD
> where we drop the slots at the end after finishing all the other
> cleanup.

There was a reason why the v22 logic was different from HEAD.

The broken connection leaves dangling slots which is unavoidable. But,
whereas the user knows the name of the Subscription slot (they named
it), there is no easy way for them to know the names of the remaining
tablesync slots unless we log them.

That is why the v22 code was written to process the tablesync slots
even for wrconn == NULL so their name could be logged:
elog(WARNING, "no connection; cannot drop tablesync slot \"%s\".",
syncslotname);

The v23 patch removed this dangling slot name information, so it makes
it difficult for the user to know what tablesync slots to cleanup.

> 2. In AlterSubscription_refresh(), we can't allow workers to be
> stopped at commit time as we have already dropped the slots because
> the worker can access the dropped slot. We need to stop the workers
> before dropping slots. This makes all the code related to
> logicalrep_worker_stop_at_commit redundant.

OK.

> 3. In AlterSubscription_refresh(), we need to acquire the lock on
> pg_subscription_rel only when we try to remove any subscription rel.

+ if (!sub_rel_locked)
+ {
+ rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
+ sub_rel_locked = true;
+ }

OK. But the sub_rel_locked bool is not really necessary. Why not just
check for rel == NULL? e.g.
if (!rel)
    rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);

> 4. Added/Changed quite a few comments.
>

@@ -1042,6 +1115,31 @@ DropSubscription(DropSubscriptionStmt *stmt,
bool isTopLevel)
  }
  list_free(subworkers);

+ /*
+ * Tablesync resource cleanup (slots and origins).

The comment is misleading; this code is only dropping origins.

----
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

> 2. In AlterSubscription_refresh(), we can't allow workers to be
> stopped at commit time as we have already dropped the slots because
> the worker can access the dropped slot. We need to stop the workers
> before dropping slots. This makes all the code related to
> logicalrep_worker_stop_at_commit redundant.

@@ -73,20 +73,6 @@ typedef struct LogicalRepWorkerId
  Oid relid;
 } LogicalRepWorkerId;

-typedef struct StopWorkersData
-{
- int nestDepth; /* Sub-transaction nest level */
- List    *workers; /* List of LogicalRepWorkerId */
- struct StopWorkersData *parent; /* This need not be an immediate
- * subtransaction parent */
-} StopWorkersData;

Since v23 removes that typedef from the code, don't you also have to
remove it from src/tools/pgindent/typedefs.list?

----
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Feb 1, 2021 at 6:48 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I have made the below changes in the patch. Let me know what you think
> > about these?
> > 1. It was a bit difficult to understand the code in DropSubscription
> > so I have rearranged the code to match the way we are doing in HEAD
> > where we drop the slots at the end after finishing all the other
> > cleanup.
>
> There was a reason why the v22 logic was different from HEAD.
>
> The broken connection leaves dangling slots which is unavoidable.
>

I think this is true only when the user specifically requested it by
the use of "ALTER SUBSCRIPTION ... SET (slot_name = NONE)", right?
Otherwise, we give an error on a broken connection. Also, if that is
true then is there a reason to pass missing_ok as true while dropping
tablesync slots?


> But,
> whereas the user knows the name of the Subscription slot (they named
> it), there is no easy way for them to know the names of the remaining
> tablesync slots unless we log them.
>
> That is why the v22 code was written to process the tablesync slots
> even for wrconn == NULL so their name could be logged:
> elog(WARNING, "no connection; cannot drop tablesync slot \"%s\".",
> syncslotname);
>
> The v23 patch removed this dangling slot name information, so it makes
> it difficult for the user to know what tablesync slots to cleanup.
>

Okay, then can we think of combining with the existing error of the
replication slot? I think that might produce a very long message, so
another idea could be to LOG a separate WARNING for each such slot
just before giving the error.

> > 2. In AlterSubscription_refresh(), we can't allow workers to be
> > stopped at commit time as we have already dropped the slots because
> > the worker can access the dropped slot. We need to stop the workers
> > before dropping slots. This makes all the code related to
> > logicalrep_worker_stop_at_commit redundant.
>
> OK.
>
> > 3. In AlterSubscription_refresh(), we need to acquire the lock on
> > pg_subscription_rel only when we try to remove any subscription rel.
>
> + if (!sub_rel_locked)
> + {
> + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
> + sub_rel_locked = true;
> + }
>
> OK. But the sub_rel_locked bool is not really necessary. Why not just
> check for rel == NULL? e.g.
> if (!rel)
>     rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
>

Okay, that seems to be better, will change accordingly.

> > 4. Added/Changed quite a few comments.
> >
>
> @@ -1042,6 +1115,31 @@ DropSubscription(DropSubscriptionStmt *stmt,
> bool isTopLevel)
>   }
>   list_free(subworkers);
>
> + /*
> + * Tablesync resource cleanup (slots and origins).
>
> The comment is misleading; this code is only dropping origins.
>

Okay, will change to something like: "Cleanup of tablesync replication origins."

> @@ -73,20 +73,6 @@ typedef struct LogicalRepWorkerId
>   Oid relid;
>  } LogicalRepWorkerId;
>
> -typedef struct StopWorkersData
> -{
> - int nestDepth; /* Sub-transaction nest level */
> - List    *workers; /* List of LogicalRepWorkerId */
> - struct StopWorkersData *parent; /* This need not be an immediate
> - * subtransaction parent */
> -} StopWorkersData;
>
> Since v23 removes that typedef from the code, don't you also have to
> remove it from src/tools/pgindent/typedefs.list?
>

Yeah.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Feb 1, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 1, 2021 at 6:48 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > I have made the below changes in the patch. Let me know what you think
> > > about these?
> > > 1. It was a bit difficult to understand the code in DropSubscription
> > > so I have rearranged the code to match the way we are doing in HEAD
> > > where we drop the slots at the end after finishing all the other
> > > cleanup.
> >
> > There was a reason why the v22 logic was different from HEAD.
> >
> > The broken connection leaves dangling slots which is unavoidable.
> >
>
> I think this is true only when the user specifically requested it by
> the use of "ALTER SUBSCRIPTION ... SET (slot_name = NONE)", right?
> Otherwise, we give an error on a broken connection. Also, if that is
> true then is there a reason to pass missing_ok as true while dropping
> tablesync slots?
>

AFAIK there is always a potential race with DropSubscription dropping
slots. The DropSubscription might be running at exactly the same time
the apply worker has just dropped the very same tablesync slot. By
saying missing_ok = true it means DropSubscription would not give
ERROR in such a case, so at least the DROP SUBSCRIPTION would not fail
with an unexpected error.

>
> > But,
> > whereas the user knows the name of the Subscription slot (they named
> > it), there is no easy way for them to know the names of the remaining
> > tablesync slots unless we log them.
> >
> > That is why the v22 code was written to process the tablesync slots
> > even for wrconn == NULL so their name could be logged:
> > elog(WARNING, "no connection; cannot drop tablesync slot \"%s\".",
> > syncslotname);
> >
> > The v23 patch removed this dangling slot name information, so it makes
> > it difficult for the user to know what tablesync slots to cleanup.
> >
>
> Okay, then can we think of combining with the existing error of the
> replication slot? I think that might produce a very long message, so
> another idea could be to LOG a separate WARNING for each such slot
> just before giving the error.

There may be many subscribed tables so I agree combining to one
message might be too long. Yes, we can add another loop to output the
necessary information. But, isn’t logging each tablesync slot WARNING
before the subscription slot ERROR exactly the behaviour which already
existed in v22. IIUC the DropSubscription refactoring in V23 was not
done for any functional change, but was intended only to make the code
simpler, but how is that goal achieved if v23 ends up needing 3 loops
where v22 only needed 1 loop to do the same thing?

----
Kind Regards,
Peter Smith.
Fujitsu Australia.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Feb 1, 2021 at 9:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Feb 1, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Feb 1, 2021 at 6:48 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > I have made the below changes in the patch. Let me know what you think
> > > > about these?
> > > > 1. It was a bit difficult to understand the code in DropSubscription
> > > > so I have rearranged the code to match the way we are doing in HEAD
> > > > where we drop the slots at the end after finishing all the other
> > > > cleanup.
> > >
> > > There was a reason why the v22 logic was different from HEAD.
> > >
> > > The broken connection leaves dangling slots which is unavoidable.
> > >
> >
> > I think this is true only when the user specifically requested it by
> > the use of "ALTER SUBSCRIPTION ... SET (slot_name = NONE)", right?
> > Otherwise, we give an error on a broken connection. Also, if that is
> > true then is there a reason to pass missing_ok as true while dropping
> > tablesync slots?
> >
>
> AFAIK there is always a potential race with DropSubscription dropping
> slots. The DropSubscription might be running at exactly the same time
> the apply worker has just dropped the very same tablesync slot.
>

We stopped the workers before getting a list of NotReady relations and
then we try to drop the corresponding slots. So, how such a race
condition can happen? Note, because we have a lock on pg_subscrition,
there is no chance that the workers can restart till the transaction
end.

> By
> saying missing_ok = true it means DropSubscription would not give
> ERROR in such a case, so at least the DROP SUBSCRIPTION would not fail
> with an unexpected error.
>
> >
> > > But,
> > > whereas the user knows the name of the Subscription slot (they named
> > > it), there is no easy way for them to know the names of the remaining
> > > tablesync slots unless we log them.
> > >
> > > That is why the v22 code was written to process the tablesync slots
> > > even for wrconn == NULL so their name could be logged:
> > > elog(WARNING, "no connection; cannot drop tablesync slot \"%s\".",
> > > syncslotname);
> > >
> > > The v23 patch removed this dangling slot name information, so it makes
> > > it difficult for the user to know what tablesync slots to cleanup.
> > >
> >
> > Okay, then can we think of combining with the existing error of the
> > replication slot? I think that might produce a very long message, so
> > another idea could be to LOG a separate WARNING for each such slot
> > just before giving the error.
>
> There may be many subscribed tables so I agree combining to one
> message might be too long. Yes, we can add another loop to output the
> necessary information. But, isn’t logging each tablesync slot WARNING
> before the subscription slot ERROR exactly the behaviour which already
> existed in v22. IIUC the DropSubscription refactoring in V23 was not
> done for any functional change, but was intended only to make the code
> simpler, but how is that goal achieved if v23 ends up needing 3 loops
> where v22 only needed 1 loop to do the same thing?
>

No, there is a functionality change as well. The way we have code in
v22 can easily lead to a problem when we have dropped the slots but
get an error while removing origins or an entry from subscription rel.
In such cases, we won't be able to rollback the drop of slots but the
other database operations will be rolled back. This is the reason we
have to drop the slots at the end. We need to ensure the same thing
for AlterSubscription_refresh. Does this make sense now?

--
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Feb 1, 2021 at 10:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 1, 2021 at 9:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Mon, Feb 1, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Feb 1, 2021 at 6:48 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > > >
> > > > On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > I have made the below changes in the patch. Let me know what you think
> > > > > about these?
> > > > > 1. It was a bit difficult to understand the code in DropSubscription
> > > > > so I have rearranged the code to match the way we are doing in HEAD
> > > > > where we drop the slots at the end after finishing all the other
> > > > > cleanup.
> > > >
> > > > There was a reason why the v22 logic was different from HEAD.
> > > >
> > > > The broken connection leaves dangling slots which is unavoidable.
> > > >
> > >
> > > I think this is true only when the user specifically requested it by
> > > the use of "ALTER SUBSCRIPTION ... SET (slot_name = NONE)", right?
> > > Otherwise, we give an error on a broken connection. Also, if that is
> > > true then is there a reason to pass missing_ok as true while dropping
> > > tablesync slots?
> > >
> >
> > AFAIK there is always a potential race with DropSubscription dropping
> > slots. The DropSubscription might be running at exactly the same time
> > the apply worker has just dropped the very same tablesync slot.
> >
>
> We stopped the workers before getting a list of NotReady relations and
> then we try to drop the corresponding slots. So, how such a race
> condition can happen?
>

I think it is possible that the state is still not SYNCDONE but the
slot is already dropped so here we should be ready with the missing
slot.


-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Feb 1, 2021 at 3:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 1, 2021 at 9:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Mon, Feb 1, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Feb 1, 2021 at 6:48 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > > >
> > > > On Sun, Jan 31, 2021 at 12:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > I have made the below changes in the patch. Let me know what you think
> > > > > about these?
> > > > > 1. It was a bit difficult to understand the code in DropSubscription
> > > > > so I have rearranged the code to match the way we are doing in HEAD
> > > > > where we drop the slots at the end after finishing all the other
> > > > > cleanup.
> > > >
> > > > There was a reason why the v22 logic was different from HEAD.
> > > >
> > > > The broken connection leaves dangling slots which is unavoidable.
> > > >
> > >
> > > I think this is true only when the user specifically requested it by
> > > the use of "ALTER SUBSCRIPTION ... SET (slot_name = NONE)", right?
> > > Otherwise, we give an error on a broken connection. Also, if that is
> > > true then is there a reason to pass missing_ok as true while dropping
> > > tablesync slots?
> > >
> >
> > AFAIK there is always a potential race with DropSubscription dropping
> > slots. The DropSubscription might be running at exactly the same time
> > the apply worker has just dropped the very same tablesync slot.
> >
>
> We stopped the workers before getting a list of NotReady relations and
> then we try to drop the corresponding slots. So, how such a race
> condition can happen? Note, because we have a lock on pg_subscrition,
> there is no chance that the workers can restart till the transaction
> end.

OK. I think I was forgetting the logicalrep_worker_stop would also go
into a loop waiting for the worker process to die. So even if the
tablesync worker does simultaneously drop it's own slot, I think it
will certainly at least be in SYNCDONE state before DropSubscription
does anything else with that worker.

>
> > By
> > saying missing_ok = true it means DropSubscription would not give
> > ERROR in such a case, so at least the DROP SUBSCRIPTION would not fail
> > with an unexpected error.
> >
> > >
> > > > But,
> > > > whereas the user knows the name of the Subscription slot (they named
> > > > it), there is no easy way for them to know the names of the remaining
> > > > tablesync slots unless we log them.
> > > >
> > > > That is why the v22 code was written to process the tablesync slots
> > > > even for wrconn == NULL so their name could be logged:
> > > > elog(WARNING, "no connection; cannot drop tablesync slot \"%s\".",
> > > > syncslotname);
> > > >
> > > > The v23 patch removed this dangling slot name information, so it makes
> > > > it difficult for the user to know what tablesync slots to cleanup.
> > > >
> > >
> > > Okay, then can we think of combining with the existing error of the
> > > replication slot? I think that might produce a very long message, so
> > > another idea could be to LOG a separate WARNING for each such slot
> > > just before giving the error.
> >
> > There may be many subscribed tables so I agree combining to one
> > message might be too long. Yes, we can add another loop to output the
> > necessary information. But, isn’t logging each tablesync slot WARNING
> > before the subscription slot ERROR exactly the behaviour which already
> > existed in v22. IIUC the DropSubscription refactoring in V23 was not
> > done for any functional change, but was intended only to make the code
> > simpler, but how is that goal achieved if v23 ends up needing 3 loops
> > where v22 only needed 1 loop to do the same thing?
> >
>
> No, there is a functionality change as well. The way we have code in
> v22 can easily lead to a problem when we have dropped the slots but
> get an error while removing origins or an entry from subscription rel.
> In such cases, we won't be able to rollback the drop of slots but the
> other database operations will be rolled back. This is the reason we
> have to drop the slots at the end. We need to ensure the same thing
> for AlterSubscription_refresh. Does this make sense now?
>

OK.

----
Kind Regards,
Peter Smith.
Fujitsu Australia.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Feb 1, 2021 at 11:23 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Feb 1, 2021 at 3:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Feb 1, 2021 at 9:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > > I think this is true only when the user specifically requested it by
> > > > the use of "ALTER SUBSCRIPTION ... SET (slot_name = NONE)", right?
> > > > Otherwise, we give an error on a broken connection. Also, if that is
> > > > true then is there a reason to pass missing_ok as true while dropping
> > > > tablesync slots?
> > > >
> > >
> > > AFAIK there is always a potential race with DropSubscription dropping
> > > slots. The DropSubscription might be running at exactly the same time
> > > the apply worker has just dropped the very same tablesync slot.
> > >
> >
> > We stopped the workers before getting a list of NotReady relations and
> > then we try to drop the corresponding slots. So, how such a race
> > condition can happen? Note, because we have a lock on pg_subscrition,
> > there is no chance that the workers can restart till the transaction
> > end.
>
> OK. I think I was forgetting the logicalrep_worker_stop would also go
> into a loop waiting for the worker process to die. So even if the
> tablesync worker does simultaneously drop it's own slot, I think it
> will certainly at least be in SYNCDONE state before DropSubscription
> does anything else with that worker.
>

How is that ensured? We don't have anything like HOLD_INTERRUPTS
between the time dropped the slot and updated rel state as SYNCDONE.
So, isn't it possible that after we dropped the slot and before we
update the state, the SIGTERM signal arrives and led to worker exit?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Feb 1, 2021 at 5:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

 > > > AFAIK there is always a potential race with DropSubscription dropping
> > > > slots. The DropSubscription might be running at exactly the same time
> > > > the apply worker has just dropped the very same tablesync slot.
> > > >
> > >
> > > We stopped the workers before getting a list of NotReady relations and
> > > then we try to drop the corresponding slots. So, how such a race
> > > condition can happen? Note, because we have a lock on pg_subscrition,
> > > there is no chance that the workers can restart till the transaction
> > > end.
> >
> > OK. I think I was forgetting the logicalrep_worker_stop would also go
> > into a loop waiting for the worker process to die. So even if the
> > tablesync worker does simultaneously drop it's own slot, I think it
> > will certainly at least be in SYNCDONE state before DropSubscription
> > does anything else with that worker.
> >
>
> How is that ensured? We don't have anything like HOLD_INTERRUPTS
> between the time dropped the slot and updated rel state as SYNCDONE.
> So, isn't it possible that after we dropped the slot and before we
> update the state, the SIGTERM signal arrives and led to worker exit?
>

The worker has the SIGTERM handler of "die". IIUC the "die" function
doesn't normally do anything except set some flags to say please die
at the next convenient opportunity. My understanding is that the
worker process will not actually exit until it next executes
CHECK_FOR_INTERRUPTS(), whereupon it will see the ProcDiePending flag
and *really* die. So even if the SIGTERM signal arrives immediately
after the slot is dropped, the tablesync will still become SYNCDONE.
Is this wrong understanding?

But your scenario could still be possible if "die" exited immediately
(e.g. only in single user mode?).

----
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Feb 1, 2021 at 1:08 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Feb 1, 2021 at 5:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>  > > > AFAIK there is always a potential race with DropSubscription dropping
> > > > > slots. The DropSubscription might be running at exactly the same time
> > > > > the apply worker has just dropped the very same tablesync slot.
> > > > >
> > > >
> > > > We stopped the workers before getting a list of NotReady relations and
> > > > then we try to drop the corresponding slots. So, how such a race
> > > > condition can happen? Note, because we have a lock on pg_subscrition,
> > > > there is no chance that the workers can restart till the transaction
> > > > end.
> > >
> > > OK. I think I was forgetting the logicalrep_worker_stop would also go
> > > into a loop waiting for the worker process to die. So even if the
> > > tablesync worker does simultaneously drop it's own slot, I think it
> > > will certainly at least be in SYNCDONE state before DropSubscription
> > > does anything else with that worker.
> > >
> >
> > How is that ensured? We don't have anything like HOLD_INTERRUPTS
> > between the time dropped the slot and updated rel state as SYNCDONE.
> > So, isn't it possible that after we dropped the slot and before we
> > update the state, the SIGTERM signal arrives and led to worker exit?
> >
>
> The worker has the SIGTERM handler of "die". IIUC the "die" function
> doesn't normally do anything except set some flags to say please die
> at the next convenient opportunity. My understanding is that the
> worker process will not actually exit until it next executes
> CHECK_FOR_INTERRUPTS(), whereupon it will see the ProcDiePending flag
> and *really* die. So even if the SIGTERM signal arrives immediately
> after the slot is dropped, the tablesync will still become SYNCDONE.
> Is this wrong understanding?
>
> But your scenario could still be possible if "die" exited immediately
> (e.g. only in single user mode?).
>

I think it is possible without that as well. There are many calls
in-between those two operations which can internally call
CHECK_FOR_INTERRUPTS. One of the flows where such a possibility exists
is
UpdateSubscriptionRelState->SearchSysCacheCopy2->SearchSysCacheCopy->SearchSysCache->SearchCatCache->SearchCatCacheInternal->SearchCatCacheMiss->systable_getnext.
This can internally call heapgetpage where we have
CHECK_FOR_INTERRUPTS. I think even if today there was no CFI call we
can't take a guarantee for the future as the calls used are quite
common. So, probably we need missing_ok flag in DropSubscription.

One more point in the tablesync code you are calling
ReplicationSlotDropAtPubNode with missing_ok as false. What if we get
an error after that and before we have marked the state as SYNCDONE? I
guess it will always error from ReplicationSlotDropAtPubNode after
that because we had already dropped the slot.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Feb 1, 2021 at 11:23 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Feb 1, 2021 at 3:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Feb 1, 2021 at 9:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > No, there is a functionality change as well. The way we have code in
> > v22 can easily lead to a problem when we have dropped the slots but
> > get an error while removing origins or an entry from subscription rel.
> > In such cases, we won't be able to rollback the drop of slots but the
> > other database operations will be rolled back. This is the reason we
> > have to drop the slots at the end. We need to ensure the same thing
> > for AlterSubscription_refresh. Does this make sense now?
> >
>
> OK.
>

I have updated the patch to display WARNING for each of the tablesync
slots during DropSubscription. As discussed, I have moved the drop
slot related code towards the end in AlterSubscription_refresh. Apart
from this, I have fixed one more issue in tablesync code where in
after catching the exception we were not clearing the transaction
state on the publisher, see changes in LogicalRepSyncTableStart. I
have also fixed other comments raised by you. Additionally, I have
removed the test because it was creating the same name slot as the
tablesync worker and tablesync worker removed the same due to new
logic in LogicalRepSyncStart. Earlier, it was not failing because of
the bug in that code which I have fixed in the attached.

I wonder whether we should restrict creating slots with prefix pg_
because we are internally creating slots with those names? I think
this was a problem previously also. We already prohibit it for few
other objects like origins, schema, etc., see the usage of
IsReservedName.




--
With Regards,
Amit Kapila.

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Feb 1, 2021 at 11:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> I have updated the patch to display WARNING for each of the tablesync
> slots during DropSubscription. As discussed, I have moved the drop
> slot related code towards the end in AlterSubscription_refresh. Apart
> from this, I have fixed one more issue in tablesync code where in
> after catching the exception we were not clearing the transaction
> state on the publisher, see changes in LogicalRepSyncTableStart. I
> have also fixed other comments raised by you.

Here are some additional feedback comments about the v24 patch:

~~

ReportSlotConnectionError:

1,2,3,4.
+ foreach(lc, rstates)
+ {
+ SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
+ Oid relid = rstate->relid;
+
+ /* Only cleanup resources of tablesync workers */
+ if (!OidIsValid(relid))
+ continue;
+
+ /*
+ * Caller needs to ensure that we have appropriate locks so that
+ * relstate doesn't change underneath us.
+ */
+ if (rstate->state != SUBREL_STATE_SYNCDONE)
+ {
+ char syncslotname[NAMEDATALEN] = { 0 };
+
+ ReplicationSlotNameForTablesync(subid, relid, syncslotname);
+ elog(WARNING, "could not drop tablesync replication slot \"%s\"",
+ syncslotname);
+
+ }
+ }

1. I wonder if "rstates" would be better named something like
"not_ready_rstates", otherwise it is not apparent what states are in
this list

2. The comment "/* Only cleanup resources of tablesync workers */" is
not quite correct because there is no cleanup happening here. Maybe
change to:
if (!OidIsValid(relid))
continue; /* not a tablesync worker */

3. Maybe the "appropriate locks" comment can say what locks are the
"appropriate" ones?

4. Spurious blank line after the elog?

~~

AlterSubscription_refresh:

5.
+ /*
+ * Drop the tablesync slot. This has to be at the end because
otherwise if there
+ * is an error while doing the database operations we won't be able to rollback
+ * dropped slot.
+ */

Maybe "Drop the tablesync slot." should say "Drop the tablesync slots
associated with removed tables."

~~

DropSubscription:

6.
+ /*
+ * Cleanup of tablesync replication origins.
+ *
+ * Any READY-state relations would already have dealt with clean-ups.
+ *
+ * Note that the state can't change because we have already stopped both
+ * the apply and tablesync workers and they can't restart because of
+ * exclusive lock on the subscription.
+ */
+ rstates = GetSubscriptionNotReadyRelations(subid);
+ foreach(lc, rstates)

I wonder if "rstates" would be better named as "not_ready_rstates",
because it is used in several places where not READY is assumed.

7.
+ {
+ if (!slotname)
+ {
+ /* be tidy */
+ list_free(rstates);
+ return;
+ }
+ else
+ {
+ ReportSlotConnectionError(rstates, subid, slotname, err);
+ }
+
+ }

Spurious blank line above?

8.
The new logic of calling the ReportSlotConnectionError seems to be
expecting that the user has encountered some connection error, and
*after* that they have assigned slot_name = NONE as a workaround. In
this scenario the code looks ok since names of any dangling tablesync
slots were being logged at the time of the error.

But I am wondering what about where the user might have set slot_name
= NONE *before* the connection is broken. In this scenario, there is
no ERROR, so if there are still (is it possible?) dangling tablesync
slots, their names are never getting logged at all. So how can the
user know what to delete?

~~

> Additionally, I have
> removed the test because it was creating the same name slot as the
> tablesync worker and tablesync worker removed the same due to new
> logic in LogicalRepSyncStart. Earlier, it was not failing because of
> the bug in that code which I have fixed in the attached.

Wasn't causing a tablesync slot clash and seeing if it could recover
the point of that test? Why not just keep, and modify the test to make
it work again? Isn't it still valuable because at least it would
execute the code through the PG_CATCH which otherwise may not get
executed by any other test?

>
> I wonder whether we should restrict creating slots with prefix pg_
> because we are internally creating slots with those names? I think
> this was a problem previously also. We already prohibit it for few
> other objects like origins, schema, etc., see the usage of
> IsReservedName.
>

Yes, we could restrict the create slot API if you really wanted to.
But, IMO it is implausible that a user could "accidentally" clash with
the internal tablesync slot name, so in practice maybe this change
would not help much but it might make it more difficult to test some
scenarios.

----
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Tue, Feb 2, 2021 at 8:29 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Feb 1, 2021 at 11:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > I have updated the patch to display WARNING for each of the tablesync
> > slots during DropSubscription. As discussed, I have moved the drop
> > slot related code towards the end in AlterSubscription_refresh. Apart
> > from this, I have fixed one more issue in tablesync code where in
> > after catching the exception we were not clearing the transaction
> > state on the publisher, see changes in LogicalRepSyncTableStart. I
> > have also fixed other comments raised by you.
>
> Here are some additional feedback comments about the v24 patch:
>
> ~~
>
> ReportSlotConnectionError:
>
> 1,2,3,4.
> + foreach(lc, rstates)
> + {
> + SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc);
> + Oid relid = rstate->relid;
> +
> + /* Only cleanup resources of tablesync workers */
> + if (!OidIsValid(relid))
> + continue;
> +
> + /*
> + * Caller needs to ensure that we have appropriate locks so that
> + * relstate doesn't change underneath us.
> + */
> + if (rstate->state != SUBREL_STATE_SYNCDONE)
> + {
> + char syncslotname[NAMEDATALEN] = { 0 };
> +
> + ReplicationSlotNameForTablesync(subid, relid, syncslotname);
> + elog(WARNING, "could not drop tablesync replication slot \"%s\"",
> + syncslotname);
> +
> + }
> + }
>
> 1. I wonder if "rstates" would be better named something like
> "not_ready_rstates", otherwise it is not apparent what states are in
> this list
>

I don't know if that would be better and it is used in the same way in
the existing code. I find the current naming succinct.

> 2. The comment "/* Only cleanup resources of tablesync workers */" is
> not quite correct because there is no cleanup happening here. Maybe
> change to:
> if (!OidIsValid(relid))
> continue; /* not a tablesync worker */
>

Aren't we trying to cleanup the tablesync slots here? So, I don't see
the comment as irrelevant.

> 3. Maybe the "appropriate locks" comment can say what locks are the
> "appropriate" ones?
>
> 4. Spurious blank line after the elog?
>

Will fix both the above.

> ~~
>
> AlterSubscription_refresh:
>
> 5.
> + /*
> + * Drop the tablesync slot. This has to be at the end because
> otherwise if there
> + * is an error while doing the database operations we won't be able to rollback
> + * dropped slot.
> + */
>
> Maybe "Drop the tablesync slot." should say "Drop the tablesync slots
> associated with removed tables."
>

makes sense, will fix.

> ~~
>
> DropSubscription:
>
> 6.
> + /*
> + * Cleanup of tablesync replication origins.
> + *
> + * Any READY-state relations would already have dealt with clean-ups.
> + *
> + * Note that the state can't change because we have already stopped both
> + * the apply and tablesync workers and they can't restart because of
> + * exclusive lock on the subscription.
> + */
> + rstates = GetSubscriptionNotReadyRelations(subid);
> + foreach(lc, rstates)
>
> I wonder if "rstates" would be better named as "not_ready_rstates",
> because it is used in several places where not READY is assumed.
>

Same response as above for similar comment.

> 7.
> + {
> + if (!slotname)
> + {
> + /* be tidy */
> + list_free(rstates);
> + return;
> + }
> + else
> + {
> + ReportSlotConnectionError(rstates, subid, slotname, err);
> + }
> +
> + }
>
> Spurious blank line above?
>

Will fix.

> 8.
> The new logic of calling the ReportSlotConnectionError seems to be
> expecting that the user has encountered some connection error, and
> *after* that they have assigned slot_name = NONE as a workaround. In
> this scenario the code looks ok since names of any dangling tablesync
> slots were being logged at the time of the error.
>
> But I am wondering what about where the user might have set slot_name
> = NONE *before* the connection is broken. In this scenario, there is
> no ERROR, so if there are still (is it possible?) dangling tablesync
> slots, their names are never getting logged at all. So how can the
> user know what to delete?
>

It has been mentioned in docs that the user is responsible for
cleaning that up manually in such a case. The patch has also described
how the names are generated so that can help user to remove those.
+    These table synchronization slots have generated names:
+    <quote><literal>pg_%u_sync_%u</literal></quote> (parameters: Subscription
+    <parameter>oid</parameter>, Table <parameter>relid</parameter>)

I think if the user changes slot_name associated with the
subscription, it would be his responsibility to clean up the
previously associated slot. This is currently the case with the main
subscription slot as well. I think it won't be advisable for the user
to change slot_name unless under some rare cases where the system
might be stuck like the one for which we are giving WARNING and
providing a hint for setting the slot_name to NONE.


> ~~
>
> > Additionally, I have
> > removed the test because it was creating the same name slot as the
> > tablesync worker and tablesync worker removed the same due to new
> > logic in LogicalRepSyncStart. Earlier, it was not failing because of
> > the bug in that code which I have fixed in the attached.
>
> Wasn't causing a tablesync slot clash and seeing if it could recover
> the point of that test? Why not just keep, and modify the test to make
> it work again?
>

We can do that but my other worry was that we might want to reserve
the names for slots that start with pg_.

> Isn't it still valuable because at least it would
> execute the code through the PG_CATCH which otherwise may not get
> executed by any other test?
>

It is valuable but IIRC there was a test (in subscription/004_sync.pl)
where PK violation happens during copy which will lead to the coverage
of code in CATCH.

> >
> > I wonder whether we should restrict creating slots with prefix pg_
> > because we are internally creating slots with those names? I think
> > this was a problem previously also. We already prohibit it for few
> > other objects like origins, schema, etc., see the usage of
> > IsReservedName.
> >
>
> Yes, we could restrict the create slot API if you really wanted to.
> But, IMO it is implausible that a user could "accidentally" clash with
> the internal tablesync slot name, so in practice maybe this change
> would not help much but it might make it more difficult to test some
> scenarios.
>

Isn't the same true for origins?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Ajin Cherian
Date:
On Mon, Feb 1, 2021 at 11:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have updated the patch to display WARNING for each of the tablesync
> slots during DropSubscription. As discussed, I have moved the drop
> slot related code towards the end in AlterSubscription_refresh. Apart
> from this, I have fixed one more issue in tablesync code where in
> after catching the exception we were not clearing the transaction
> state on the publisher, see changes in LogicalRepSyncTableStart. I
> have also fixed other comments raised by you. Additionally, I have
> removed the test because it was creating the same name slot as the
> tablesync worker and tablesync worker removed the same due to new
> logic in LogicalRepSyncStart. Earlier, it was not failing because of
> the bug in that code which I have fixed in the attached.
>

I was testing this patch. I had a table on the subscriber which had a
row that would cause a PK constraint
violation during the table copy. This is resulting in the subscriber
trying to rollback the table copy and failing.

2021-02-01 23:28:16.041 EST [23738] LOG:  logical replication apply
worker for subscription "tap_sub" has started
2021-02-01 23:28:16.051 EST [23740] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "tab_rep" has
started
2021-02-01 23:28:21.118 EST [23740] ERROR:  table copy could not
rollback transaction on publisher
2021-02-01 23:28:21.118 EST [23740] DETAIL:  The error was: another
command is already in progress
2021-02-01 23:28:21.122 EST [8028] LOG:  background worker "logical
replication worker" (PID 23740) exited with exit code 1
2021-02-01 23:28:21.125 EST [23908] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "tab_rep" has
started
2021-02-01 23:28:21.138 EST [23908] ERROR:  could not create
replication slot "pg_16398_sync_16384": ERROR:  replication slot
"pg_16398_sync_16384" already exists
2021-02-01 23:28:21.139 EST [8028] LOG:  background worker "logical
replication worker" (PID 23908) exited with exit code 1
2021-02-01 23:28:26.168 EST [24048] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "tab_rep" has
started
2021-02-01 23:28:34.244 EST [24048] ERROR:  table copy could not
rollback transaction on publisher
2021-02-01 23:28:34.244 EST [24048] DETAIL:  The error was: another
command is already in progress
2021-02-01 23:28:34.251 EST [8028] LOG:  background worker "logical
replication worker" (PID 24048) exited with exit code 1
2021-02-01 23:28:34.254 EST [24337] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "tab_rep" has
started
2021-02-01 23:28:34.263 EST [24337] ERROR:  could not create
replication slot "pg_16398_sync_16384": ERROR:  replication slot
"pg_16398_sync_16384" already exists
2021-02-01 23:28:34.264 EST [8028] LOG:  background worker "logical
replication worker" (PID 24337) exited with exit code 1

And one more thing I see is that now we error out in PG_CATCH() in
LogicalRepSyncTableStart() with the above error and as a result, the
tablesync slot is not dropped. Hence causing the slot create to fail
in the next restart.
I think this can be avoided. We could either attempt a rollback only
on specific failures and drop slot prior to erroring out.

regards,
Ajin Cherian
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Ajin Cherian
Date:
Another failure I see in my testing

On publisher create a big enough table:
publisher:
postgres=# CREATE TABLE tab_rep (a int primary key);CREATE TABLE
postgres=# INSERT INTO tab_rep SELECT generate_series(1,1000000);
INSERT 0 1000000
postgres=# CREATE PUBLICATION tap_pub FOR ALL TABLES;
CREATE PUBLICATION

Subscriber:
postgres=# CREATE TABLE tab_rep (a int primary key);
CREATE TABLE
postgres=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
dbname=postgres port=6972' PUBLICATION tap_pub WITH (enabled = false);

Create the subscription but do not enable it:
The below two commands on the subscriber should be issued quickly with
no delay between them.

postgres=# ALTER SUBSCRIPTION tap_sub enable;
ALTER SUBSCRIPTION
postgres=# ALTER SUBSCRIPTION tap_sub disable;
ALTER SUBSCRIPTION

This leaves the below state for the pg_subscription rel:
postgres=# select * from pg_subscription_rel;
 srsubid | srrelid | srsubstate | srsublsn
---------+---------+------------+----------
   16395 |   16384 | f          |
(1 row)

The rel is in the SUBREL_STATE_FINISHEDCOPY state.

Meanwhile on the publisher, looking at the slots created:

postgres=# select * from pg_replication_slots;
      slot_name      |  plugin  | slot_type | datoid | database |
temporary | active | active_pid | x
min | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status |
safe_wal_size
---------------------+----------+-----------+--------+----------+-----------+--------+------------+--
----+--------------+-------------+---------------------+------------+---------------
 tap_sub             | pgoutput | logical   |  13859 | postgres | f
     | f      |            |
    |          517 | 0/9303660   | 0/9303698           | reserved   |
 pg_16395_sync_16384 | pgoutput | logical   |  13859 | postgres | f
     | f      |            |
    |          517 | 0/9303660   | 0/9303698           | reserved   |
(2 rows)


There are two slots, the main slot as well as the tablesync slot, drop
the table, re-enable the subscription and then drop the subscription

Now on the subscriber:
postgres=# drop table tab_rep;
DROP TABLE
postgres=# ALTER SUBSCRIPTION tap_sub enable;
ALTER SUBSCRIPTION
postgres=# drop subscription tap_sub ;
NOTICE:  dropped replication slot "tap_sub" on publisher
DROP SUBSCRIPTION

We see the tablesync slot dangling in the publisher:
postgres=# select * from pg_replication_slots;
      slot_name      |  plugin  | slot_type | datoid | database |
temporary | active | active_pid | x
min | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status |
safe_wal_size
---------------------+----------+-----------+--------+----------+-----------+--------+------------+--
----+--------------+-------------+---------------------+------------+---------------
  pg_16395_sync_16384 | pgoutput | logical   |  13859 | postgres | f
      | f      |            |
    |          517 | 0/9303660   | 0/9303698           | reserved   |
(1 row)

The dropping of the table, meant that after the tablesync is
restarted, the worker has no idea about the old slot created as its
name uses the relid of the dropped table.

regards,
Ajin Cherian
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Feb 1, 2021 at 11:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I have updated the patch to display WARNING for each of the tablesync
> slots during DropSubscription. As discussed, I have moved the drop
> slot related code towards the end in AlterSubscription_refresh. Apart
> from this, I have fixed one more issue in tablesync code where in
> after catching the exception we were not clearing the transaction
> state on the publisher, see changes in LogicalRepSyncTableStart.
...

I know that in another email [ac0202] Ajin has reported some problem
he found related to this new (LogicalRepSyncTableStart PG_CATCH) code
for some different use-case, but for my test scenario of a "broken
connection during a table copy" the code did appear to be working
properly.

PSA detailed logs which show the test steps and output for this
""broken connection during a table copy" scenario.

----
[ac0202] https://www.postgresql.org/message-id/CAFPTHDaZw5o%2BwMbv3aveOzuLyz_rqZebXAj59rDKTJbwXFPYgw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Tue, Feb 2, 2021 at 10:34 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Mon, Feb 1, 2021 at 11:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have updated the patch to display WARNING for each of the tablesync
> > slots during DropSubscription. As discussed, I have moved the drop
> > slot related code towards the end in AlterSubscription_refresh. Apart
> > from this, I have fixed one more issue in tablesync code where in
> > after catching the exception we were not clearing the transaction
> > state on the publisher, see changes in LogicalRepSyncTableStart. I
> > have also fixed other comments raised by you. Additionally, I have
> > removed the test because it was creating the same name slot as the
> > tablesync worker and tablesync worker removed the same due to new
> > logic in LogicalRepSyncStart. Earlier, it was not failing because of
> > the bug in that code which I have fixed in the attached.
> >
>
> I was testing this patch. I had a table on the subscriber which had a
> row that would cause a PK constraint
> violation during the table copy. This is resulting in the subscriber
> trying to rollback the table copy and failing.
>

I am not getting this error. I have tried by below test:
Publisher
===========
CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
COMMIT;

CREATE PUBLICATION mypublication FOR TABLE mytbl1;

Subscriber
=============
CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

BEGIN;
INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
COMMIT;

CREATE SUBSCRIPTION mysub
         CONNECTION 'host=localhost port=5432 dbname=postgres'
        PUBLICATION mypublication;

It generates the PK violation the first time and then I removed the
conflicting rows in the subscriber and it passed. See logs below.

2021-02-02 13:51:34.316 IST [20796] LOG:  logical replication table
synchronization worker for subscription "mysub", table "mytbl1" has
started
2021-02-02 13:52:43.625 IST [20796] ERROR:  duplicate key value
violates unique constraint "mytbl1_pkey"
2021-02-02 13:52:43.625 IST [20796] DETAIL:  Key (id)=(1) already exists.
2021-02-02 13:52:43.625 IST [20796] CONTEXT:  COPY mytbl1, line 1
2021-02-02 13:52:43.695 IST [27840] LOG:  background worker "logical
replication worker" (PID 20796) exited with exit code 1
2021-02-02 13:52:43.884 IST [6260] LOG:  logical replication table
synchronization worker for subscription "mysub", table "mytbl1" has
started
2021-02-02 13:53:54.680 IST [6260] LOG:  logical replication table
synchronization worker for subscription "mysub", table "mytbl1" has
finished

Also, a similar test exists in 0004_sync.pl, is that also failing for
you? Can you please provide detailed steps that led to this failure?

>
> And one more thing I see is that now we error out in PG_CATCH() in
> LogicalRepSyncTableStart() with the above error and as a result, the
> tablesync slot is not dropped. Hence causing the slot create to fail
> in the next restart.
> I think this can be avoided. We could either attempt a rollback only
> on specific failures and drop slot prior to erroring out.
>

Hmm, we have to first rollback before attempting any other operation
because the transaction on the publisher is in an errored state.


-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also
tried a simple test where I do a DROP TABLE with very bad timing for
the tablesync worker. It seems that doing this can cause the sync
worker's MyLogicalRepWorker->relid to become invalid.

In my test this caused a stack trace within some logging, but I
imagine other bad things can happen if the tablesync worker can be
executed with an invalid relid.

Possibly this is an existing PG bug which has just never been seen
before; The ereport which has failed here is not new code.

PSA the log for the test steps and the stack trace details.

----
[ac0202] https://www.postgresql.org/message-id/CAFPTHDYzjaNfzsFHpER9idAPB8v5j%3DSUbWL0AKj5iVy0BKbTpg%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Ajin Cherian
Date:
On Tue, Feb 2, 2021 at 7:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 2, 2021 at 10:34 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Mon, Feb 1, 2021 at 11:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > I have updated the patch to display WARNING for each of the tablesync
> > > slots during DropSubscription. As discussed, I have moved the drop
> > > slot related code towards the end in AlterSubscription_refresh. Apart
> > > from this, I have fixed one more issue in tablesync code where in
> > > after catching the exception we were not clearing the transaction
> > > state on the publisher, see changes in LogicalRepSyncTableStart. I
> > > have also fixed other comments raised by you. Additionally, I have
> > > removed the test because it was creating the same name slot as the
> > > tablesync worker and tablesync worker removed the same due to new
> > > logic in LogicalRepSyncStart. Earlier, it was not failing because of
> > > the bug in that code which I have fixed in the attached.
> > >
> >
> > I was testing this patch. I had a table on the subscriber which had a
> > row that would cause a PK constraint
> > violation during the table copy. This is resulting in the subscriber
> > trying to rollback the table copy and failing.
> >
>
> I am not getting this error. I have tried by below test:

I am sorry, my above steps were not correct. I think the reason for
the failure I was seeing were some other steps I did prior to this. I
will recreate this and update you with the appropriate steps.

regards,
Ajin Cherian
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Tue, Feb 2, 2021 at 11:35 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> Another failure I see in my testing
>

The problem here is that we are allowing to drop the table when table
synchronization is still in progress and then we don't have any way to
know the corresponding slot or origin. I think we can try to drop the
slot and origin as well but that is not a good idea because slots once
dropped won't be rolled back. So, I have added a fix to disallow the
drop of the table when table synchronization is still in progress.
Apart from that, I have fixed comments raised by Peter as discussed
above and made some additional changes in comments, code (code changes
are cosmetic), and docs.

Let me know if the issue reported is fixed or not?

-- 
With Regards,
Amit Kapila.

Attachment

Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Tue, Feb 2, 2021 at 3:31 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also
> tried a simple test where I do a DROP TABLE with very bad timing for
> the tablesync worker. It seems that doing this can cause the sync
> worker's MyLogicalRepWorker->relid to become invalid.
>

I think this should be fixed by latest patch because I have disallowed
drop of a table when its synchronization is in progress. You can check
once and let me know if the issue still exists?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Wed, Feb 3, 2021 at 12:26 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 2, 2021 at 3:31 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also
> > tried a simple test where I do a DROP TABLE with very bad timing for
> > the tablesync worker. It seems that doing this can cause the sync
> > worker's MyLogicalRepWorker->relid to become invalid.
> >
>
> I think this should be fixed by latest patch because I have disallowed
> drop of a table when its synchronization is in progress. You can check
> once and let me know if the issue still exists?
>

FYI - I confirmed that the problem scenario that I reported yesterday
is no longer possible because now the V25 patch is disallowing the
DROP TABLE while the tablesync is still running.

PSA my test logs showing it is now working as expected.

----
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Ajin Cherian
Date:
On Wed, Feb 3, 2021 at 12:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

> The problem here is that we are allowing to drop the table when table
> synchronization is still in progress and then we don't have any way to
> know the corresponding slot or origin. I think we can try to drop the
> slot and origin as well but that is not a good idea because slots once
> dropped won't be rolled back. So, I have added a fix to disallow the
> drop of the table when table synchronization is still in progress.
> Apart from that, I have fixed comments raised by Peter as discussed
> above and made some additional changes in comments, code (code changes
> are cosmetic), and docs.
>
> Let me know if the issue reported is fixed or not?

Yes, the issue is fixed, now the table drop results in an error.

postgres=# drop table tab_rep ;
ERROR:  could not drop relation mapping for subscription "tap_sub"
DETAIL:  Table synchronization for relation "tab_rep" is in progress
and is in state "f".
HINT:  Use ALTER SUBSCRIPTION ... ENABLE to enable subscription if not
already enabled or use DROP SUBSCRIPTION ... to drop the subscription.

regards,
Ajin Cherian
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Wed, Feb 3, 2021 at 6:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Wed, Feb 3, 2021 at 12:26 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Feb 2, 2021 at 3:31 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also
> > > tried a simple test where I do a DROP TABLE with very bad timing for
> > > the tablesync worker. It seems that doing this can cause the sync
> > > worker's MyLogicalRepWorker->relid to become invalid.
> > >
> >
> > I think this should be fixed by latest patch because I have disallowed
> > drop of a table when its synchronization is in progress. You can check
> > once and let me know if the issue still exists?
> >
>
> FYI - I confirmed that the problem scenario that I reported yesterday
> is no longer possible because now the V25 patch is disallowing the
> DROP TABLE while the tablesync is still running.
>

Thanks for the confirmation. BTW, can you please check if we can
reproduce that problem without this patch? If so, we might want to
apply this fix irrespective of this patch. If not, why not?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Wed, Feb 3, 2021 at 1:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 3, 2021 at 6:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Wed, Feb 3, 2021 at 12:26 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Feb 2, 2021 at 3:31 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > > >
> > > > After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also
> > > > tried a simple test where I do a DROP TABLE with very bad timing for
> > > > the tablesync worker. It seems that doing this can cause the sync
> > > > worker's MyLogicalRepWorker->relid to become invalid.
> > > >
> > >
> > > I think this should be fixed by latest patch because I have disallowed
> > > drop of a table when its synchronization is in progress. You can check
> > > once and let me know if the issue still exists?
> > >
> >
> > FYI - I confirmed that the problem scenario that I reported yesterday
> > is no longer possible because now the V25 patch is disallowing the
> > DROP TABLE while the tablesync is still running.
> >
>
> Thanks for the confirmation. BTW, can you please check if we can
> reproduce that problem without this patch? If so, we might want to
> apply this fix irrespective of this patch. If not, why not?
>

Yes, this was an existing postgres bug. It is independent of the patch.

I can reproduce exactly the same stacktrace using the HEAD src pulled @ 3/Feb.

PSA my test logs showing the details.

----
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Ajin Cherian
Date:
On Tue, Feb 2, 2021 at 9:03 PM Ajin Cherian <itsajin@gmail.com> wrote:

> I am sorry, my above steps were not correct. I think the reason for
> the failure I was seeing were some other steps I did prior to this. I
> will recreate this and update you with the appropriate steps.

The correct steps are as follows:

Publisher:

postgres=# CREATE TABLE tab_rep (a int primary key);
CREATE TABLE
postgres=# INSERT INTO tab_rep SELECT generate_series(1,1000000);
INSERT 0 1000000
postgres=# CREATE PUBLICATION tap_pub FOR ALL TABLES;
CREATE PUBLICATION

Subscriber:
postgres=# CREATE TABLE tab_rep (a int primary key);
CREATE TABLE
postgres=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
dbname=postgres port=6972' PUBLICATION tap_pub WITH (enabled = false);
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
postgres=# ALTER SUBSCRIPTION tap_sub enable;
ALTER SUBSCRIPTION

Allow the tablesync to complete and then drop the subscription, the
table remains full and restarting the subscription should fail with a
constraint violation during tablesync but it does not.


Subscriber:
postgres=# drop subscription tap_sub ;
NOTICE:  dropped replication slot "tap_sub" on publisher
DROP SUBSCRIPTION
postgres=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
dbname=postgres port=6972' PUBLICATION tap_pub WITH (enabled = false);
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
postgres=# ALTER SUBSCRIPTION tap_sub enable;
ALTER SUBSCRIPTION

This takes the subscriber into an error loop but no mention of what
the error was:

2021-02-02 05:01:34.698 EST [1549] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "tab_rep" has
started
2021-02-02 05:01:34.739 EST [1549] ERROR:  table copy could not
rollback transaction on publisher
2021-02-02 05:01:34.739 EST [1549] DETAIL:  The error was: another
command is already in progress
2021-02-02 05:01:34.740 EST [8028] LOG:  background worker "logical
replication worker" (PID 1549) exited with exit code 1
2021-02-02 05:01:40.107 EST [1711] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "tab_rep" has
started
2021-02-02 05:01:40.121 EST [1711] ERROR:  could not create
replication slot "pg_16479_sync_16435": ERROR:  replication slot
"pg_16479_sync_16435" already exists
2021-02-02 05:01:40.121 EST [8028] LOG:  background worker "logical
replication worker" (PID 1711) exited with exit code 1
2021-02-02 05:01:45.140 EST [1891] LOG:  logical replication table
synchronization worker for subscription "tap_sub", table "tab_rep" has
started

regards,
Ajin Cherian
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Wed, Feb 3, 2021 at 2:51 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Wed, Feb 3, 2021 at 1:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Feb 3, 2021 at 6:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > On Wed, Feb 3, 2021 at 12:26 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Tue, Feb 2, 2021 at 3:31 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > > > >
> > > > > After seeing Ajin's test [ac0202] which did a DROP TABLE, I have also
> > > > > tried a simple test where I do a DROP TABLE with very bad timing for
> > > > > the tablesync worker. It seems that doing this can cause the sync
> > > > > worker's MyLogicalRepWorker->relid to become invalid.
> > > > >
> > > >
> > > > I think this should be fixed by latest patch because I have disallowed
> > > > drop of a table when its synchronization is in progress. You can check
> > > > once and let me know if the issue still exists?
> > > >
> > >
> > > FYI - I confirmed that the problem scenario that I reported yesterday
> > > is no longer possible because now the V25 patch is disallowing the
> > > DROP TABLE while the tablesync is still running.
> > >
> >
> > Thanks for the confirmation. BTW, can you please check if we can
> > reproduce that problem without this patch? If so, we might want to
> > apply this fix irrespective of this patch. If not, why not?
> >
>
> Yes, this was an existing postgres bug. It is independent of the patch.
>
> I can reproduce exactly the same stacktrace using the HEAD src pulled @ 3/Feb.
>
> PSA my test logs showing the details.
>

Since this is an existing PG bug independent of this patch, I spawned
a new thread [ps0202] to deal with this problem.

----
[ps0202]
https://www.postgresql.org/message-id/CAHut%2BPu7Z4a%3Domo%2BTvK5Gub2hxcJ7-3%2BBu1FO_%2B%2BfpFTW0oQfQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Wed, Feb 3, 2021 at 1:28 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Tue, Feb 2, 2021 at 9:03 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> > I am sorry, my above steps were not correct. I think the reason for
> > the failure I was seeing were some other steps I did prior to this. I
> > will recreate this and update you with the appropriate steps.
>
> The correct steps are as follows:
>
> Publisher:
>
> postgres=# CREATE TABLE tab_rep (a int primary key);
> CREATE TABLE
> postgres=# INSERT INTO tab_rep SELECT generate_series(1,1000000);
> INSERT 0 1000000
> postgres=# CREATE PUBLICATION tap_pub FOR ALL TABLES;
> CREATE PUBLICATION
>
> Subscriber:
> postgres=# CREATE TABLE tab_rep (a int primary key);
> CREATE TABLE
> postgres=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
> dbname=postgres port=6972' PUBLICATION tap_pub WITH (enabled = false);
> NOTICE:  created replication slot "tap_sub" on publisher
> CREATE SUBSCRIPTION
> postgres=# ALTER SUBSCRIPTION tap_sub enable;
> ALTER SUBSCRIPTION
>
> Allow the tablesync to complete and then drop the subscription, the
> table remains full and restarting the subscription should fail with a
> constraint violation during tablesync but it does not.
>
>
> Subscriber:
> postgres=# drop subscription tap_sub ;
> NOTICE:  dropped replication slot "tap_sub" on publisher
> DROP SUBSCRIPTION
> postgres=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
> dbname=postgres port=6972' PUBLICATION tap_pub WITH (enabled = false);
> NOTICE:  created replication slot "tap_sub" on publisher
> CREATE SUBSCRIPTION
> postgres=# ALTER SUBSCRIPTION tap_sub enable;
> ALTER SUBSCRIPTION
>
> This takes the subscriber into an error loop but no mention of what
> the error was:
>

Thanks for the report. The problem here was that the error occurred
when we were trying to copy the large data. Now, before fetching the
entire data we issued a rollback that led to this problem. I think the
alternative here could be to first fetch the entire data when the
error occurred then issue the following commands. Instead, I have
modified the patch to perform 'drop_replication_slot' in the beginning
if the relstate is datasync.  Do let me know if you can think of a
better way to fix this?

-- 
With Regards,
Amit Kapila.

Attachment

Re: Single transaction in the tablesync worker?

From
Ajin Cherian
Date:
On Wed, Feb 3, 2021 at 11:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> Thanks for the report. The problem here was that the error occurred
> when we were trying to copy the large data. Now, before fetching the
> entire data we issued a rollback that led to this problem. I think the
> alternative here could be to first fetch the entire data when the
> error occurred then issue the following commands. Instead, I have
> modified the patch to perform 'drop_replication_slot' in the beginning
> if the relstate is datasync.  Do let me know if you can think of a
> better way to fix this?

I have verified that the problem is not seen after this patch. I also
agree with the approach taken for the fix,

regards,
Ajin Cherian
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Thu, Feb 4, 2021 at 9:55 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Feb 3, 2021 at 11:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Thanks for the report. The problem here was that the error occurred
> > when we were trying to copy the large data. Now, before fetching the
> > entire data we issued a rollback that led to this problem. I think the
> > alternative here could be to first fetch the entire data when the
> > error occurred then issue the following commands. Instead, I have
> > modified the patch to perform 'drop_replication_slot' in the beginning
> > if the relstate is datasync.  Do let me know if you can think of a
> > better way to fix this?
>
> I have verified that the problem is not seen after this patch. I also
> agree with the approach taken for the fix,
>

Thanks. I have fixed one of the issues reported by me earlier [1]
wherein the tablesync worker can repeatedly fail if after dropping the
slot there is an error while updating the SYNCDONE state in the
database. I have moved the drop of the slot just before commit of the
transaction where we are marking the state as SYNCDONE. Additionally,
I have removed unnecessary includes in tablesync.c, updated the docs
for Alter Subscription, and updated the comments at various places in
the patch. I have also updated the commit message this time.

I am still not very happy with the way we handle concurrent drop
origins but probably that would be addressed by the other patch Peter
is working on [2].

[1] - https://www.postgresql.org/message-id/CAA4eK1JdWv84nMyCpTboBURjj70y3BfO1xdy8SYPRqNxtH7TEA%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAHut%2BPsW6%2B7Ucb1sxjSNBaSYPGAVzQFbAEg4y1KsYQiGCnyGeQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Thu, Feb 4, 2021 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
...

> Thanks. I have fixed one of the issues reported by me earlier [1]
> wherein the tablesync worker can repeatedly fail if after dropping the
> slot there is an error while updating the SYNCDONE state in the
> database. I have moved the drop of the slot just before commit of the
> transaction where we are marking the state as SYNCDONE. Additionally,
> I have removed unnecessary includes in tablesync.c, updated the docs
> for Alter Subscription, and updated the comments at various places in
> the patch. I have also updated the commit message this time.
>

Below are my feedback comments for V17 (nothing functional)

~~

1.
V27 Commit message:
For the initial table data synchronization in logical replication, we use
a single transaction to copy the entire table and then synchronizes the
position in the stream with the main apply worker.

Typo:
"synchronizes" -> "synchronize"

~~

2.
@@ -48,6 +48,23 @@ ALTER SUBSCRIPTION <replaceable
class="parameter">name</replaceable> RENAME TO <
    (Currently, all subscription owners must be superusers, so the owner checks
    will be bypassed in practice.  But this might change in the future.)
   </para>
+
+  <para>
+   When refreshing a publication we remove the relations that are no longer
+   part of the publication and we also remove the tablesync slots if there are
+   any. It is necessary to remove tablesync slots so that the resources
+   allocated for the subscription on the remote host are released. If due to
+   network breakdown or some other error, we are not able to remove the slots,
+   we give WARNING and the user needs to manually remove such slots later as
+   otherwise, they will continue to reserve WAL and might eventually cause
+   the disk to fill up. See also <xref
linkend="logical-replication-subscription-slot"/>.
+  </para>

I think the content is good, but the 1st-person wording seemed strange.
e.g.
"we are not able to remove the slots, we give WARNING and the user needs..."
Maybe it should be like:
"... PostgreSQL is unable to remove the slots, so a WARNING is
reported. The user needs... "

~~

3.
@@ -566,107 +569,197 @@ AlterSubscription_refresh(Subscription *sub,
bool copy_data)
...
+ * XXX If there is a network break down while dropping the

"network break down" -> "network breakdown"

~~

4.
@@ -872,7 +970,48 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
  (errmsg("could not connect to the publisher: %s", err)));
...
+ * XXX We could also instead try to drop the slot, last time we failed
+ * but for that, we might need to clean up the copy state as it might
+ * be in the middle of fetching the rows. Also, if there is a network
+ * break down then it wouldn't have succeeded so trying it next time
+ * seems like a better bet.

"network break down" -> "network breakdown"

~~

5.
@@ -269,26 +313,47 @@ invalidate_syncing_table_states(Datum arg, int
cacheid, uint32 hashvalue)
...
+
+ /*
+ * Cleanup the tablesync slot.
+ *
+ * This has to be done after updating the state because otherwise if
+ * there is an error while doing the database operations we won't be
+ * able to rollback dropped slot.
+ */
+ ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid,
+ MyLogicalRepWorker->relid,
+ syncslotname);
+
+ ReplicationSlotDropAtPubNode(wrconn, syncslotname, false /* missing_ok */);
+

Should this comment also describe why the missing_ok is false for this case?

----
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Feb 5, 2021 at 7:09 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Thu, Feb 4, 2021 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> ...
>
> > Thanks. I have fixed one of the issues reported by me earlier [1]
> > wherein the tablesync worker can repeatedly fail if after dropping the
> > slot there is an error while updating the SYNCDONE state in the
> > database. I have moved the drop of the slot just before commit of the
> > transaction where we are marking the state as SYNCDONE. Additionally,
> > I have removed unnecessary includes in tablesync.c, updated the docs
> > for Alter Subscription, and updated the comments at various places in
> > the patch. I have also updated the commit message this time.
> >
>
> Below are my feedback comments for V17 (nothing functional)
>
> ~~
>
> 1.
> V27 Commit message:
> For the initial table data synchronization in logical replication, we use
> a single transaction to copy the entire table and then synchronizes the
> position in the stream with the main apply worker.
>
> Typo:
> "synchronizes" -> "synchronize"
>

Fixed and added a note about Alter Sub .. Refresh .. command can't be
executed in the transaction block.

> ~~
>
> 2.
> @@ -48,6 +48,23 @@ ALTER SUBSCRIPTION <replaceable
> class="parameter">name</replaceable> RENAME TO <
>     (Currently, all subscription owners must be superusers, so the owner checks
>     will be bypassed in practice.  But this might change in the future.)
>    </para>
> +
> +  <para>
> +   When refreshing a publication we remove the relations that are no longer
> +   part of the publication and we also remove the tablesync slots if there are
> +   any. It is necessary to remove tablesync slots so that the resources
> +   allocated for the subscription on the remote host are released. If due to
> +   network breakdown or some other error, we are not able to remove the slots,
> +   we give WARNING and the user needs to manually remove such slots later as
> +   otherwise, they will continue to reserve WAL and might eventually cause
> +   the disk to fill up. See also <xref
> linkend="logical-replication-subscription-slot"/>.
> +  </para>
>
> I think the content is good, but the 1st-person wording seemed strange.
> e.g.
> "we are not able to remove the slots, we give WARNING and the user needs..."
> Maybe it should be like:
> "... PostgreSQL is unable to remove the slots, so a WARNING is
> reported. The user needs... "
>

Changed as per suggestion with a minor tweak.

> ~~
>
> 3.
> @@ -566,107 +569,197 @@ AlterSubscription_refresh(Subscription *sub,
> bool copy_data)
> ...
> + * XXX If there is a network break down while dropping the
>
> "network break down" -> "network breakdown"
>
> ~~
>
> 4.
> @@ -872,7 +970,48 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
>   (errmsg("could not connect to the publisher: %s", err)));
> ...
> + * XXX We could also instead try to drop the slot, last time we failed
> + * but for that, we might need to clean up the copy state as it might
> + * be in the middle of fetching the rows. Also, if there is a network
> + * break down then it wouldn't have succeeded so trying it next time
> + * seems like a better bet.
>
> "network break down" -> "network breakdown"
>

Changed as per suggestion.

> ~~
>
> 5.
> @@ -269,26 +313,47 @@ invalidate_syncing_table_states(Datum arg, int
> cacheid, uint32 hashvalue)
> ...
> +
> + /*
> + * Cleanup the tablesync slot.
> + *
> + * This has to be done after updating the state because otherwise if
> + * there is an error while doing the database operations we won't be
> + * able to rollback dropped slot.
> + */
> + ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid,
> + MyLogicalRepWorker->relid,
> + syncslotname);
> +
> + ReplicationSlotDropAtPubNode(wrconn, syncslotname, false /* missing_ok */);
> +
>
> Should this comment also describe why the missing_ok is false for this case?
>

Yeah that makes sense, so added a comment.

Additionally, I have changed the errorcode in RemoveSubscriptionRel,
moved the setup of origin before copy_table in
LogicalRepSyncTableStart to avoid doing copy again due to an error in
setting up origin. I have made a few comment changes as well.

-- 
With Regards,
Amit Kapila.

Attachment

RE: Single transaction in the tablesync worker?

From
"osumi.takamichi@fujitsu.com"
Date:
Hello



On Friday, February 5, 2021 2:23 PM Amit Kapila <amit.kapila16@gmail.com>
> On Fri, Feb 5, 2021 at 7:09 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Thu, Feb 4, 2021 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > ...
> >
> > > Thanks. I have fixed one of the issues reported by me earlier [1]
> > > wherein the tablesync worker can repeatedly fail if after dropping
> > > the slot there is an error while updating the SYNCDONE state in the
> > > database. I have moved the drop of the slot just before commit of
> > > the transaction where we are marking the state as SYNCDONE.
> > > Additionally, I have removed unnecessary includes in tablesync.c,
> > > updated the docs for Alter Subscription, and updated the comments at
> > > various places in the patch. I have also updated the commit message this
> time.
> > >
> >
> > Below are my feedback comments for V17 (nothing functional)
> >
> > ~~
> >
> > 1.
> > V27 Commit message:
> > For the initial table data synchronization in logical replication, we
> > use a single transaction to copy the entire table and then
> > synchronizes the position in the stream with the main apply worker.
> >
> > Typo:
> > "synchronizes" -> "synchronize"
> >
> 
> Fixed and added a note about Alter Sub .. Refresh .. command can't be
> executed in the transaction block.
Thank you for the updates.

We need to add some tests to prove the new checks of AlterSubscription() work. 
I chose TAP tests as we need to set connect = true for the subscription.
When it can contribute to the development, please utilize this.
I used v28 to check my patch and works as we expect.


Best Regards,
    Takamichi Osumi

Attachment

Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Feb 5, 2021 at 12:36 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> We need to add some tests to prove the new checks of AlterSubscription() work.
> I chose TAP tests as we need to set connect = true for the subscription.
> When it can contribute to the development, please utilize this.
> I used v28 to check my patch and works as we expect.
>

Thanks for writing the tests but I don't understand why you need to
set connect = true for this test? I have tried below '... with connect
= false' and it seems to be working:
postgres=# CREATE SUBSCRIPTION mysub
postgres-#          CONNECTION 'host=localhost port=5432 dbname=postgres'
postgres-#         PUBLICATION mypublication WITH (connect = false);
WARNING:  tables were not subscribed, you will have to run ALTER
SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables
CREATE SUBSCRIPTION
postgres=# Begin;
BEGIN
postgres=*# Alter Subscription mysub Refresh Publication;
ERROR:  ALTER SUBSCRIPTION ... REFRESH is not allowed for disabled subscriptions

So, if possible lets write this test in src/test/regress/sql/subscription.sql.

I have another idea for a test case: What if we write a test such that
it fails PK violation on copy and then drop the subscription. Then
check there shouldn't be any dangling slot on the publisher? This is
similar to a test in subscription/t/004_sync.pl, we can use some of
that framework but have a separate test for this.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Ajin Cherian
Date:
I did some basic cross-version testing, publisher on PG13 and
subscriber on PG14 and publisher on PG14 and subscriber on PG13.
Did some basic operations, CREATE, ALTER and STOP subscriptions and it
seemed to work fine, no errors.

regards,
Ajin Cherian
Fujitsu Australia.



Re: Single transaction in the tablesync worker?

From
Petr Jelinek
Date:
Hi,

We had a bit high-level discussion about this patches with Amit 
off-list, so I decided to also take a look at the actual code.

My main concern originally was the potential for left-over slots on 
publisher, but I think the state now is relatively okay, with couple of 
corner cases that are documented and don't seem much worse than the main 
slot.

I wonder if we should mention the max_slot_wal_keep_size GUC in the 
table sync docs though.

Another thing that might need documentation is that the the visibility 
of changes done by table sync is not anymore isolated in that table 
contents will show intermediate progress to other backends, rather than 
switching from nothing to state consistent with rest of replication.


Some minor comments about code:

> +        else if (res->status == WALRCV_ERROR && missing_ok)
> +        {
> +            /* WARNING. Error, but missing_ok = true. */
> +            ereport(WARNING,

I wonder if we need to add error code to the WalRcvExecResult and check 
for the appropriate ones here. Because this can for example return error 
because of timeout, not because slot is missing. Not sure if it matters 
for current callers though (but then maybe don't call the param 
missign_ok?).


> +ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char syncslotname[NAMEDATALEN])
> +{
> +    if (syncslotname)
> +        sprintf(syncslotname, "pg_%u_sync_%u", suboid, relid);
> +    else
> +        syncslotname = psprintf("pg_%u_sync_%u", suboid, relid);
> +
> +    return syncslotname;
> +}

Given that we are now explicitly dropping slots, what happens here if we 
have 2 different downstreams that happen to get same suboid and reloid, 
will one of the drop the slot of the other one? Previously with the 
cleanup being left to temp slot we'd at maximum got error when creating 
it but with the new logic in LogicalRepSyncTableStart it feels like we 
could get into situation where 2 downstreams are fighting over slot no?


-- 
Petr



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Sat, Feb 6, 2021 at 2:10 AM Petr Jelinek
<petr.jelinek@enterprisedb.com> wrote:
>
> > +ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char syncslotname[NAMEDATALEN])
> > +{
> > +     if (syncslotname)
> > +             sprintf(syncslotname, "pg_%u_sync_%u", suboid, relid);
> > +     else
> > +             syncslotname = psprintf("pg_%u_sync_%u", suboid, relid);
> > +
> > +     return syncslotname;
> > +}
>
> Given that we are now explicitly dropping slots, what happens here if we
> have 2 different downstreams that happen to get same suboid and reloid,
> will one of the drop the slot of the other one? Previously with the
> cleanup being left to temp slot we'd at maximum got error when creating
> it but with the new logic in LogicalRepSyncTableStart it feels like we
> could get into situation where 2 downstreams are fighting over slot no?
>

The PG docs [1] says "there is only one copy of pg_subscription per
cluster, not one per database". IIUC that means it is not possible for
2 different subscriptions to have the same suboid. And if the suboid
is globally unique then syncslotname name is also unique. Is that
understanding not correct?

-----
[1] https://www.postgresql.org/docs/devel/catalog-pg-subscription.html

Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Sat, Feb 6, 2021 at 6:22 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Sat, Feb 6, 2021 at 2:10 AM Petr Jelinek
> <petr.jelinek@enterprisedb.com> wrote:
> >
> > > +ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char syncslotname[NAMEDATALEN])
> > > +{
> > > +     if (syncslotname)
> > > +             sprintf(syncslotname, "pg_%u_sync_%u", suboid, relid);
> > > +     else
> > > +             syncslotname = psprintf("pg_%u_sync_%u", suboid, relid);
> > > +
> > > +     return syncslotname;
> > > +}
> >
> > Given that we are now explicitly dropping slots, what happens here if we
> > have 2 different downstreams that happen to get same suboid and reloid,
> > will one of the drop the slot of the other one? Previously with the
> > cleanup being left to temp slot we'd at maximum got error when creating
> > it but with the new logic in LogicalRepSyncTableStart it feels like we
> > could get into situation where 2 downstreams are fighting over slot no?
> >

I think so. See, if the alternative suggested below works or if you
have any other suggestions for the same?

>
> The PG docs [1] says "there is only one copy of pg_subscription per
> cluster, not one per database". IIUC that means it is not possible for
> 2 different subscriptions to have the same suboid.
>

I think he is talking about two different clusters having separate
subscriptions but point to the same publisher. In different clusters,
we can get the same subid/relid. I think we need a cluster-wide unique
identifier to distinguish among different subscribers. How about using
the system_identifier stored in the control file (we can use
GetSystemIdentifier to retrieve it).  I think one concern could be
that adding that to slot name could exceed the max length of slot
(NAMEDATALEN -1) but I don't think that is the case here
(pg_%u_sync_%u_UINT64_FORMAT (3 + 10 + 6 + 10 + 20 + '\0')). Note last
is system_identifier in this scheme.

Do you guys think that works or let me know if you have any other
better idea? Petr, is there a reason why such an identifier is not
considered originally, is there any risk in it?

-- 
With Regards,
Amit Kapila.



RE: Single transaction in the tablesync worker?

From
"osumi.takamichi@fujitsu.com"
Date:
Hi


On Friday, February 5, 2021 5:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Feb 5, 2021 at 12:36 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > We need to add some tests to prove the new checks of AlterSubscription()
> work.
> > I chose TAP tests as we need to set connect = true for the subscription.
> > When it can contribute to the development, please utilize this.
> > I used v28 to check my patch and works as we expect.
> >
> 
> Thanks for writing the tests but I don't understand why you need to set
> connect = true for this test? I have tried below '... with connect = false' and it
> seems to be working:
> postgres=# CREATE SUBSCRIPTION mysub
> postgres-#          CONNECTION 'host=localhost port=5432
> dbname=postgres'
> postgres-#         PUBLICATION mypublication WITH (connect = false);
> WARNING:  tables were not subscribed, you will have to run ALTER
> SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables CREATE
> SUBSCRIPTION postgres=# Begin; BEGIN postgres=*# Alter Subscription
> mysub Refresh Publication;
> ERROR:  ALTER SUBSCRIPTION ... REFRESH is not allowed for disabled
> subscriptions
> 
> So, if possible lets write this test in src/test/regress/sql/subscription.sql.
OK. I changed the place to write the tests for those.

 
> I have another idea for a test case: What if we write a test such that it fails PK
> violation on copy and then drop the subscription. Then check there shouldn't
> be any dangling slot on the publisher? This is similar to a test in
> subscription/t/004_sync.pl, we can use some of that framework but have a
> separate test for this.
I've added this PK violation test to the attached tests.
The patch works with v28 and made no failure during regression tests.

Best Regards,
    Takamichi Osumi


Attachment

Re: Single transaction in the tablesync worker?

From
Petr Jelinek
Date:
On 06/02/2021 06:07, Amit Kapila wrote:
> On Sat, Feb 6, 2021 at 6:22 AM Peter Smith <smithpb2250@gmail.com> wrote:
>> On Sat, Feb 6, 2021 at 2:10 AM Petr Jelinek
>> <petr.jelinek@enterprisedb.com> wrote:
>>>> +ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char syncslotname[NAMEDATALEN])
>>>> +{
>>>> +     if (syncslotname)
>>>> +             sprintf(syncslotname, "pg_%u_sync_%u", suboid, relid);
>>>> +     else
>>>> +             syncslotname = psprintf("pg_%u_sync_%u", suboid, relid);
>>>> +
>>>> +     return syncslotname;
>>>> +}
>>> Given that we are now explicitly dropping slots, what happens here if we
>>> have 2 different downstreams that happen to get same suboid and reloid,
>>> will one of the drop the slot of the other one? Previously with the
>>> cleanup being left to temp slot we'd at maximum got error when creating
>>> it but with the new logic in LogicalRepSyncTableStart it feels like we
>>> could get into situation where 2 downstreams are fighting over slot no?
>>>
> I think so. See, if the alternative suggested below works or if you
> have any other suggestions for the same?
>
>> The PG docs [1] says "there is only one copy of pg_subscription per
>> cluster, not one per database". IIUC that means it is not possible for
>> 2 different subscriptions to have the same suboid.
>>
> I think he is talking about two different clusters having separate
> subscriptions but point to the same publisher. In different clusters,
> we can get the same subid/relid. I think we need a cluster-wide unique
> identifier to distinguish among different subscribers. How about using
> the system_identifier stored in the control file (we can use
> GetSystemIdentifier to retrieve it).  I think one concern could be
> that adding that to slot name could exceed the max length of slot
> (NAMEDATALEN -1) but I don't think that is the case here
> (pg_%u_sync_%u_UINT64_FORMAT (3 + 10 + 6 + 10 + 20 + '\0')). Note last
> is system_identifier in this scheme.


Yep that's what I mean and system_identifier seems like a good choice to me.


> Do you guys think that works or let me know if you have any other
> better idea? Petr, is there a reason why such an identifier is not
> considered originally, is there any risk in it?


Originally it was not considered likely because it's all based on 
pglogical/BDR work where ids are hashes of stuff that's unique across 
group of instances, not counter based like Oids in PostgreSQL and I 
simply didn't realize it could be a problem until reading this patch :)


-- 
Petr Jelinek




Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Sat, Feb 6, 2021 at 2:10 AM Petr Jelinek
<petr.jelinek@enterprisedb.com> wrote:
>
> Hi,
>
> Some minor comments about code:
>
> > +             else if (res->status == WALRCV_ERROR && missing_ok)
> > +             {
> > +                     /* WARNING. Error, but missing_ok = true. */
> > +                     ereport(WARNING,
>
> I wonder if we need to add error code to the WalRcvExecResult and check
> for the appropriate ones here. Because this can for example return error
> because of timeout, not because slot is missing. Not sure if it matters
> for current callers though (but then maybe don't call the param
> missign_ok?).

You are right. The way we are using this function has evolved beyond
the original intention.
Probably renaming the param to something like "error_ok" would be more
appropriate now.

----
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Sun, Feb 7, 2021 at 2:38 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Sat, Feb 6, 2021 at 2:10 AM Petr Jelinek
> <petr.jelinek@enterprisedb.com> wrote:
> >
> > Hi,
> >
> > Some minor comments about code:
> >
> > > +             else if (res->status == WALRCV_ERROR && missing_ok)
> > > +             {
> > > +                     /* WARNING. Error, but missing_ok = true. */
> > > +                     ereport(WARNING,
> >
> > I wonder if we need to add error code to the WalRcvExecResult and check
> > for the appropriate ones here. Because this can for example return error
> > because of timeout, not because slot is missing. Not sure if it matters
> > for current callers though (but then maybe don't call the param
> > missign_ok?).
>
> You are right. The way we are using this function has evolved beyond
> the original intention.
> Probably renaming the param to something like "error_ok" would be more
> appropriate now.
>

PSA a patch (apply on top of V28) to change the misleading param name.

----
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Sat, Feb 6, 2021 at 6:30 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> Hi
>
>
> On Friday, February 5, 2021 5:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Fri, Feb 5, 2021 at 12:36 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > We need to add some tests to prove the new checks of AlterSubscription()
> > work.
> > > I chose TAP tests as we need to set connect = true for the subscription.
> > > When it can contribute to the development, please utilize this.
> > > I used v28 to check my patch and works as we expect.
> > >
> >
> > Thanks for writing the tests but I don't understand why you need to set
> > connect = true for this test? I have tried below '... with connect = false' and it
> > seems to be working:
> > postgres=# CREATE SUBSCRIPTION mysub
> > postgres-#          CONNECTION 'host=localhost port=5432
> > dbname=postgres'
> > postgres-#         PUBLICATION mypublication WITH (connect = false);
> > WARNING:  tables were not subscribed, you will have to run ALTER
> > SUBSCRIPTION ... REFRESH PUBLICATION to subscribe the tables CREATE
> > SUBSCRIPTION postgres=# Begin; BEGIN postgres=*# Alter Subscription
> > mysub Refresh Publication;
> > ERROR:  ALTER SUBSCRIPTION ... REFRESH is not allowed for disabled
> > subscriptions
> >
> > So, if possible lets write this test in src/test/regress/sql/subscription.sql.
> OK. I changed the place to write the tests for those.
>
>
> > I have another idea for a test case: What if we write a test such that it fails PK
> > violation on copy and then drop the subscription. Then check there shouldn't
> > be any dangling slot on the publisher? This is similar to a test in
> > subscription/t/004_sync.pl, we can use some of that framework but have a
> > separate test for this.
> I've added this PK violation test to the attached tests.
> The patch works with v28 and made no failure during regression tests.
>

I checked this patch. It applied cleanly on top of V28, and all tests passed OK.

Here are two feedback comments.

1. For the regression test there is 2 x SQL and 1 x function test. I
thought to cover all the combinations there should be another function
test. e.g.
Tests ALTER … REFRESH
Tests ALTER …. (refresh = true)
Tests ALTER … (refresh = true) in a function
Tests ALTER … REFRESH in a function  <== this combination is not being
testing ??

2. For the 004 test case I know the test is needing some PK constraint violation
# Check if DROP SUBSCRIPTION cleans up slots on the publisher side
# when the subscriber is stuck on data copy for constraint

But it is not clear to me what was the exact cause of that PK
violation. I think you must be relying on data that is leftover from
some previous test case but I am not sure which one. Can you make the
comment more detailed to say *how* the PK violation is happening - e.g
something to say which rows, in which table, and inserted by who?

------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Feb 8, 2021 at 8:06 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Sat, Feb 6, 2021 at 6:30 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> >
> > > I have another idea for a test case: What if we write a test such that it fails PK
> > > violation on copy and then drop the subscription. Then check there shouldn't
> > > be any dangling slot on the publisher? This is similar to a test in
> > > subscription/t/004_sync.pl, we can use some of that framework but have a
> > > separate test for this.
> > I've added this PK violation test to the attached tests.
> > The patch works with v28 and made no failure during regression tests.
> >
>
> I checked this patch. It applied cleanly on top of V28, and all tests passed OK.
>
> Here are two feedback comments.
>
> 1. For the regression test there is 2 x SQL and 1 x function test. I
> thought to cover all the combinations there should be another function
> test. e.g.
> Tests ALTER … REFRESH
> Tests ALTER …. (refresh = true)
> Tests ALTER … (refresh = true) in a function
> Tests ALTER … REFRESH in a function  <== this combination is not being
> testing ??
>

I am not sure whether there is much value in adding more to this set
of negative test cases unless it really covers a different code path
which I think won't happen if we add more tests here.

--
With Regards,
Amit Kapila.



RE: Single transaction in the tablesync worker?

From
"osumi.takamichi@fujitsu.com"
Date:
Hello


On Mon, Feb 8, 2021 12:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Feb 8, 2021 at 8:06 AM Peter Smith <smithpb2250@gmail.com>
> wrote:
> >
> > On Sat, Feb 6, 2021 at 6:30 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com> wrote:
> > >
> > > > I have another idea for a test case: What if we write a test such
> > > > that it fails PK violation on copy and then drop the subscription.
> > > > Then check there shouldn't be any dangling slot on the publisher?
> > > > This is similar to a test in subscription/t/004_sync.pl, we can
> > > > use some of that framework but have a separate test for this.
> > > I've added this PK violation test to the attached tests.
> > > The patch works with v28 and made no failure during regression tests.
> > >
> >
> > I checked this patch. It applied cleanly on top of V28, and all tests passed
> OK.
> >
> > Here are two feedback comments.
> >
> > 1. For the regression test there is 2 x SQL and 1 x function test. I
> > thought to cover all the combinations there should be another function
> > test. e.g.
> > Tests ALTER … REFRESH
> > Tests ALTER …. (refresh = true)
> > Tests ALTER … (refresh = true) in a function Tests ALTER … REFRESH in
> > a function  <== this combination is not being testing ??
> >
> 
> I am not sure whether there is much value in adding more to this set of
> negative test cases unless it really covers a different code path which I think
> won't happen if we add more tests here.
Yeah, I agree. Accordingly, I didn't fix that part.


On Mon, Feb 8, 2021 11:36 AM Peter Smith <smithpb2250@gmail.com> wrote:
> 2. For the 004 test case I know the test is needing some PK constraint
> violation # Check if DROP SUBSCRIPTION cleans up slots on the publisher
> side # when the subscriber is stuck on data copy for constraint
> 
> But it is not clear to me what was the exact cause of that PK violation. I think
> you must be relying on data that is leftover from some previous test case but
> I am not sure which one. Can you make the comment more detailed to say
> *how* the PK violation is happening - e.g something to say which rows, in
> which table, and inserted by who?
I added some comments to clarify how the PK violation happens.
Please have a look.


Best Regards,
    Takamichi Osumi

Attachment

RE: Single transaction in the tablesync worker?

From
"osumi.takamichi@fujitsu.com"
Date:
On Monday, February 8, 2021 1:44 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com>
> On Mon, Feb 8, 2021 12:40 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > On Mon, Feb 8, 2021 at 8:06 AM Peter Smith <smithpb2250@gmail.com>
> > wrote:
> > >
> > > On Sat, Feb 6, 2021 at 6:30 PM osumi.takamichi@fujitsu.com
> > > <osumi.takamichi@fujitsu.com> wrote:
> > > >
> > > > > I have another idea for a test case: What if we write a test
> > > > > such that it fails PK violation on copy and then drop the subscription.
> > > > > Then check there shouldn't be any dangling slot on the publisher?
> > > > > This is similar to a test in subscription/t/004_sync.pl, we can
> > > > > use some of that framework but have a separate test for this.
> > > > I've added this PK violation test to the attached tests.
> > > > The patch works with v28 and made no failure during regression tests.
> > > >
> > >
> > > I checked this patch. It applied cleanly on top of V28, and all
> > > tests passed
> > OK.
> > >
> > > Here are two feedback comments.
> > >
> > > 1. For the regression test there is 2 x SQL and 1 x function test. I
> > > thought to cover all the combinations there should be another
> > > function test. e.g.
> > > Tests ALTER … REFRESH
> > > Tests ALTER …. (refresh = true)
> > > Tests ALTER … (refresh = true) in a function Tests ALTER … REFRESH
> > > in a function  <== this combination is not being testing ??
> > >
> >
> > I am not sure whether there is much value in adding more to this set
> > of negative test cases unless it really covers a different code path
> > which I think won't happen if we add more tests here.
> Yeah, I agree. Accordingly, I didn't fix that part.
> 
> 
> On Mon, Feb 8, 2021 11:36 AM Peter Smith <smithpb2250@gmail.com>
> wrote:
> > 2. For the 004 test case I know the test is needing some PK constraint
> > violation # Check if DROP SUBSCRIPTION cleans up slots on the
> > publisher side # when the subscriber is stuck on data copy for
> > constraint
> >
> > But it is not clear to me what was the exact cause of that PK
> > violation. I think you must be relying on data that is leftover from
> > some previous test case but I am not sure which one. Can you make the
> > comment more detailed to say
> > *how* the PK violation is happening - e.g something to say which rows,
> > in which table, and inserted by who?
> I added some comments to clarify how the PK violation happens.
> Please have a look.
Sorry, I had a one typo in the tests of subscription.sql in v2.
I used 'foo' for the first test of "ALTER SUBSCRIPTION mytest SET PUBLICATION foo WITH (refresh = true) in v02",
but I should have used 'mypub' to make this test clearly independent from other previous tests.
Attached the fixed version.

Best Regards,
    Takamichi Osumi


Attachment

Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Feb 5, 2021 at 8:40 PM Petr Jelinek
<petr.jelinek@enterprisedb.com> wrote:
>
> Hi,
>
> We had a bit high-level discussion about this patches with Amit
> off-list, so I decided to also take a look at the actual code.
>

Thanks for the discussion and a follow-up review.

> My main concern originally was the potential for left-over slots on
> publisher, but I think the state now is relatively okay, with couple of
> corner cases that are documented and don't seem much worse than the main
> slot.
>
> I wonder if we should mention the max_slot_wal_keep_size GUC in the
> table sync docs though.
>

I have added the reference of this in Alter Subscription where we
mentioned the risk of leftover slots. Let me know if you have
something else in mind?

> Another thing that might need documentation is that the the visibility
> of changes done by table sync is not anymore isolated in that table
> contents will show intermediate progress to other backends, rather than
> switching from nothing to state consistent with rest of replication.
>

Agreed and updated the docs accordingly.

>
> Some minor comments about code:
>
> > +             else if (res->status == WALRCV_ERROR && missing_ok)
> > +             {
> > +                     /* WARNING. Error, but missing_ok = true. */
> > +                     ereport(WARNING,
>
> I wonder if we need to add error code to the WalRcvExecResult and check
> for the appropriate ones here. Because this can for example return error
> because of timeout, not because slot is missing.
>

I think there are both pros and cons of distinguishing the error
("slot doesnot exist" from others). The benefit is if there a network
glitch then the user can probably retry the commands Alter/Drop and it
will be successful next time. OTOH, say the network is broken for a
long time and the user wants to proceed but there won't be any way to
proceed for Alter Subscription ... Refresh or Drop Command. So by
giving WARNING at least we can provide a way to proceed and then they
can drop such slots later. We have mentioned this in docs as well. I
think we can go either way here, let me know what do you think is a
better way?

> Not sure if it matters
> for current callers though (but then maybe don't call the param
> missign_ok?).
>

Sure, if we decide not to change the behavior as suggested by you then
this makes sense.

>
> > +ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char syncslotname[NAMEDATALEN])
> > +{
> > +     if (syncslotname)
> > +             sprintf(syncslotname, "pg_%u_sync_%u", suboid, relid);
> > +     else
> > +             syncslotname = psprintf("pg_%u_sync_%u", suboid, relid);
> > +
> > +     return syncslotname;
> > +}
>
> Given that we are now explicitly dropping slots, what happens here if we
> have 2 different downstreams that happen to get same suboid and reloid,
> will one of the drop the slot of the other one? Previously with the
> cleanup being left to temp slot we'd at maximum got error when creating
> it but with the new logic in LogicalRepSyncTableStart it feels like we
> could get into situation where 2 downstreams are fighting over slot no?
>

As discussed, added system_identifier to distinguish subscriptions
between different clusters.

Apart from fixing the above comment, I have integrated it with the new
replorigin_drop_by_name() API being discussed in the thread [1] and
posted that patch just for ease. I have also integrated Osumi-San's
test case patch with minor modifications.

[1] - https://www.postgresql.org/message-id/CAA4eK1L7mLhY%3DwyCB0qsEGUpfzWfncDSS9_0a4Co%2BN0GUyNGNQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Attachment

Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Mon, Feb 8, 2021 at 12:22 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
> On Monday, February 8, 2021 1:44 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com>
> > On Mon, Feb 8, 2021 11:36 AM Peter Smith <smithpb2250@gmail.com>
> > wrote:
> > > 2. For the 004 test case I know the test is needing some PK constraint
> > > violation # Check if DROP SUBSCRIPTION cleans up slots on the
> > > publisher side # when the subscriber is stuck on data copy for
> > > constraint
> > >
> > > But it is not clear to me what was the exact cause of that PK
> > > violation. I think you must be relying on data that is leftover from
> > > some previous test case but I am not sure which one. Can you make the
> > > comment more detailed to say
> > > *how* the PK violation is happening - e.g something to say which rows,
> > > in which table, and inserted by who?
> > I added some comments to clarify how the PK violation happens.
> > Please have a look.
> Sorry, I had a one typo in the tests of subscription.sql in v2.
> I used 'foo' for the first test of "ALTER SUBSCRIPTION mytest SET PUBLICATION foo WITH (refresh = true) in v02",
> but I should have used 'mypub' to make this test clearly independent from other previous tests.
> Attached the fixed version.
>

Thanks. I have integrated this into the main patch with minor
modifications in the comments. The main change I have done is to
remove the test that was testing that there are two slots remaining
after the initial sync failure. This is because on restart of
tablesync worker we again try to drop the slot so we can't guarantee
that the tablesync slot would be remaining. I think this is a timing
issue so it might not have occurred on your machine but I could
reproduce that by repeated runs of the tests provided by you.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Mon, Feb 8, 2021 at 11:42 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Sun, Feb 7, 2021 at 2:38 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Sat, Feb 6, 2021 at 2:10 AM Petr Jelinek
> > <petr.jelinek@enterprisedb.com> wrote:
> > >
> > > Hi,
> > >
> > > Some minor comments about code:
> > >
> > > > +             else if (res->status == WALRCV_ERROR && missing_ok)
> > > > +             {
> > > > +                     /* WARNING. Error, but missing_ok = true. */
> > > > +                     ereport(WARNING,
> > >
> > > I wonder if we need to add error code to the WalRcvExecResult and check
> > > for the appropriate ones here. Because this can for example return error
> > > because of timeout, not because slot is missing. Not sure if it matters
> > > for current callers though (but then maybe don't call the param
> > > missign_ok?).
> >
> > You are right. The way we are using this function has evolved beyond
> > the original intention.
> > Probably renaming the param to something like "error_ok" would be more
> > appropriate now.
> >
>
> PSA a patch (apply on top of V28) to change the misleading param name.
>

PSA an alternative patch. This one adds a new member to
WalRcvExecResult and so is able to detect the "slot does not exist"
error. This patch also applies on top of V28, if you want it.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

RE: Single transaction in the tablesync worker?

From
"osumi.takamichi@fujitsu.com"
Date:
On Mon, Feb 8, 2021 8:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Feb 8, 2021 at 12:22 PM osumi.takamichi@fujitsu.com
> <osumi.takamichi@fujitsu.com> wrote:
> > On Monday, February 8, 2021 1:44 PM osumi.takamichi@fujitsu.com
> > <osumi.takamichi@fujitsu.com>
> > > On Mon, Feb 8, 2021 11:36 AM Peter Smith <smithpb2250@gmail.com>
> > > wrote:
> > > > 2. For the 004 test case I know the test is needing some PK
> > > > constraint violation # Check if DROP SUBSCRIPTION cleans up slots
> > > > on the publisher side # when the subscriber is stuck on data copy
> > > > for constraint
> > > >
> > > > But it is not clear to me what was the exact cause of that PK
> > > > violation. I think you must be relying on data that is leftover
> > > > from some previous test case but I am not sure which one. Can you
> > > > make the comment more detailed to say
> > > > *how* the PK violation is happening - e.g something to say which
> > > > rows, in which table, and inserted by who?
> > > I added some comments to clarify how the PK violation happens.
> > > Please have a look.
> > Sorry, I had a one typo in the tests of subscription.sql in v2.
> > I used 'foo' for the first test of "ALTER SUBSCRIPTION mytest SET
> > PUBLICATION foo WITH (refresh = true) in v02", but I should have used
> 'mypub' to make this test clearly independent from other previous tests.
> > Attached the fixed version.
> >
> 
> Thanks. I have integrated this into the main patch with minor modifications in
> the comments. The main change I have done is to remove the test that was
> testing that there are two slots remaining after the initial sync failure. This is
> because on restart of tablesync worker we again try to drop the slot so we
> can't guarantee that the tablesync slot would be remaining. I think this is a
> timing issue so it might not have occurred on your machine but I could
> reproduce that by repeated runs of the tests provided by you.
OK. I understand. Thank you so much that your modified
and integrated it into the main patch.


Best Regards,
    Takamichi Osumi

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Here are my feedback comments for the V29 patch.

====

FILE: logical-replication.sgml

+    slots have generated names:
<quote><literal>pg_%u_sync_%u_%llu</literal></quote>
+    (parameters: Subscription <parameter>oid</parameter>,
+    Table <parameter>relid</parameter>, system
identifier<parameter>sysid</parameter>)
+   </para>

1.
There is a missing space before the sysid parameter.

=====

FILE: subscriptioncmds.c

+ * SUBREL_STATE_FINISHEDCOPY. The apply worker can also
+ * concurrently try to drop the origin and by this time the
+ * origin might be already removed. For these reasons,
+ * passing missing_ok = true from here.
+ */
+ snprintf(originname, sizeof(originname), "pg_%u_%u", sub->oid, relid);
+ replorigin_drop_by_name(originname, true, false);
+ }

2.
Don't really need to say "from here".
(same comment applies multiple places, in this file and in tablesync.c)

3.
Previously the tablesync origin name format was encapsulated in a
common function. IMO it was cleaner/safer how it was before, instead
of the same "pg_%u_%u" cut/paste and scattered in many places.
(same comment applies multiple places, in this file and in tablesync.c)

4.
Calls like replorigin_drop_by_name(originname, true, false); make it
unnecessarily hard to read code when the boolean params are neither
named as variables nor commented. I noticed on another thread [et0205]
there was an idea that having no name/comments is fine because anyway
it is not difficult to figure out when using a "modern IDE", but since
my review tools are only "vi" and "meld" I beg to differ with that
justification.
(same comment applies multiple places, in this file and in tablesync.c)

[et0205] https://www.postgresql.org/message-id/c1d9833f-eeeb-40d5-89ba-87674e1b7ba3%40www.fastmail.com

=====

FILE: tablesync.c

5.
Previously there was a function tablesync_replorigin_drop which was
encapsulating the tablesync origin name formatting. I thought that was
better than the V29 code which now has the same formatting scattered
over many places.
(same comment applies for worker_internal.h)

+ * Determine the tablesync slot name.
+ *
+ * The name must not exceed NAMEDATALEN - 1 because of remote node constraints
+ * on slot name length. We do append system_identifier to avoid slot_name
+ * collision with subscriptions in other clusters. With current scheme
+ * pg_%u_sync_%u_UINT64_FORMAT (3 + 10 + 6 + 10 + 20 + '\0'), the maximum
+ * length of slot_name will be 50.
+ *
+ * The returned slot name is either:
+ * - stored in the supplied buffer (syncslotname), or
+ * - palloc'ed in current memory context (if syncslotname = NULL).
+ *
+ * Note: We don't use the subscription slot name as part of tablesync slot name
+ * because we are responsible for cleaning up these slots and it could become
+ * impossible to recalculate what name to cleanup if the subscription slot name
+ * had changed.
+ */
+char *
+ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char
syncslotname[NAMEDATALEN])
+{
+ if (syncslotname)
+ sprintf(syncslotname, "pg_%u_sync_%u_" UINT64_FORMAT, suboid, relid,
+ GetSystemIdentifier());
+ else
+ syncslotname = psprintf("pg_%u_sync_%u_" UINT64_FORMAT, suboid, relid,
+ GetSystemIdentifier());
+
+ return syncslotname;
+}

6.
"We do append" --> "We append"
"With current scheme" -> "With the current scheme"

7.
Maybe consider to just assign GetSystemIdentifier() to a static
instead of calling that function for every slot?
static uint64 sysid = GetSystemIdentifier();
IIUC the sysid value is never going to change for a process, right?

------
Kind Regards,
Peter Smith.
Fujitsu Australia

On Mon, Feb 8, 2021 at 9:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Feb 5, 2021 at 8:40 PM Petr Jelinek
> <petr.jelinek@enterprisedb.com> wrote:
> >
> > Hi,
> >
> > We had a bit high-level discussion about this patches with Amit
> > off-list, so I decided to also take a look at the actual code.
> >
>
> Thanks for the discussion and a follow-up review.
>
> > My main concern originally was the potential for left-over slots on
> > publisher, but I think the state now is relatively okay, with couple of
> > corner cases that are documented and don't seem much worse than the main
> > slot.
> >
> > I wonder if we should mention the max_slot_wal_keep_size GUC in the
> > table sync docs though.
> >
>
> I have added the reference of this in Alter Subscription where we
> mentioned the risk of leftover slots. Let me know if you have
> something else in mind?
>
> > Another thing that might need documentation is that the the visibility
> > of changes done by table sync is not anymore isolated in that table
> > contents will show intermediate progress to other backends, rather than
> > switching from nothing to state consistent with rest of replication.
> >
>
> Agreed and updated the docs accordingly.
>
> >
> > Some minor comments about code:
> >
> > > +             else if (res->status == WALRCV_ERROR && missing_ok)
> > > +             {
> > > +                     /* WARNING. Error, but missing_ok = true. */
> > > +                     ereport(WARNING,
> >
> > I wonder if we need to add error code to the WalRcvExecResult and check
> > for the appropriate ones here. Because this can for example return error
> > because of timeout, not because slot is missing.
> >
>
> I think there are both pros and cons of distinguishing the error
> ("slot doesnot exist" from others). The benefit is if there a network
> glitch then the user can probably retry the commands Alter/Drop and it
> will be successful next time. OTOH, say the network is broken for a
> long time and the user wants to proceed but there won't be any way to
> proceed for Alter Subscription ... Refresh or Drop Command. So by
> giving WARNING at least we can provide a way to proceed and then they
> can drop such slots later. We have mentioned this in docs as well. I
> think we can go either way here, let me know what do you think is a
> better way?
>
> > Not sure if it matters
> > for current callers though (but then maybe don't call the param
> > missign_ok?).
> >
>
> Sure, if we decide not to change the behavior as suggested by you then
> this makes sense.
>
> >
> > > +ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char syncslotname[NAMEDATALEN])
> > > +{
> > > +     if (syncslotname)
> > > +             sprintf(syncslotname, "pg_%u_sync_%u", suboid, relid);
> > > +     else
> > > +             syncslotname = psprintf("pg_%u_sync_%u", suboid, relid);
> > > +
> > > +     return syncslotname;
> > > +}
> >
> > Given that we are now explicitly dropping slots, what happens here if we
> > have 2 different downstreams that happen to get same suboid and reloid,
> > will one of the drop the slot of the other one? Previously with the
> > cleanup being left to temp slot we'd at maximum got error when creating
> > it but with the new logic in LogicalRepSyncTableStart it feels like we
> > could get into situation where 2 downstreams are fighting over slot no?
> >
>
> As discussed, added system_identifier to distinguish subscriptions
> between different clusters.
>
> Apart from fixing the above comment, I have integrated it with the new
> replorigin_drop_by_name() API being discussed in the thread [1] and
> posted that patch just for ease. I have also integrated Osumi-San's
> test case patch with minor modifications.
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1L7mLhY%3DwyCB0qsEGUpfzWfncDSS9_0a4Co%2BN0GUyNGNQ%40mail.gmail.com
>
> --
> With Regards,
> Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
More V29 Feedback

FILE: alter_subscription.sgml

8.
+  <para>
+   Commands <command>ALTER SUBSCRIPTION ... REFRESH ..</command> and
+   <command>ALTER SUBSCRIPTION ... SET PUBLICATION ..</command> with refresh
+   option as true cannot be executed inside a transaction block.
+  </para>

My guess is those two lots of double dots ("..") were probably meant
to be ellipsis ("...")

----
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
Looking at the V29 style tablesync slot names now they appear like this:

WARNING:  could not drop tablesync replication slot
"pg_16397_sync_16389_6927117142022745645"
That is in the order subid +  relid + sysid

Now that I see it in a message it seems a bit strange with the sysid
just tacked onto the end like that.

I am wondering if reordering of parent to child might be more natural.
e.g sysid + subid + relid gives a more intuitive name IMO.

So in this example it would be "pg_sync_6927117142022745645_16397_16389"

Thoughts?

----
Kind Regards,
Peter Smith
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
When looking at the DropSubscription code I noticed that there is a
small difference between the HEAD code and the V29 code when slot_name
= NONE.

HEAD does
------
    if (!slotname)
    {
        table_close(rel, NoLock);
        return;
    }
------

V29 does
------
        if (!slotname)
        {
            /* be tidy */
            list_free(rstates);
            return;
        }
------

Isn't the V29 code missing doing a table_close(rel, NoLock) there?

------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Tue, Feb 9, 2021 at 12:02 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Here are my feedback comments for the V29 patch.
>

Thanks.

>
> 3.
> Previously the tablesync origin name format was encapsulated in a
> common function. IMO it was cleaner/safer how it was before, instead
> of the same "pg_%u_%u" cut/paste and scattered in many places.
> (same comment applies multiple places, in this file and in tablesync.c)
>
> 4.
> Calls like replorigin_drop_by_name(originname, true, false); make it
> unnecessarily hard to read code when the boolean params are neither
> named as variables nor commented. I noticed on another thread [et0205]
> there was an idea that having no name/comments is fine because anyway
> it is not difficult to figure out when using a "modern IDE", but since
> my review tools are only "vi" and "meld" I beg to differ with that
> justification.
> (same comment applies multiple places, in this file and in tablesync.c)
>

It would be a bit convenient for you but for most others, I think it
would be noise. Personally, I find the code more readable without such
name comments, it just breaks the flow of code unless you want to
study in detail the value of each param.

> [et0205] https://www.postgresql.org/message-id/c1d9833f-eeeb-40d5-89ba-87674e1b7ba3%40www.fastmail.com
>
> =====
>
> FILE: tablesync.c
>
> 5.
> Previously there was a function tablesync_replorigin_drop which was
> encapsulating the tablesync origin name formatting. I thought that was
> better than the V29 code which now has the same formatting scattered
> over many places.
> (same comment applies for worker_internal.h)
>

Isn't this the same as what you want to say in point-3?

>
> 7.
> Maybe consider to just assign GetSystemIdentifier() to a static
> instead of calling that function for every slot?
> static uint64 sysid = GetSystemIdentifier();
> IIUC the sysid value is never going to change for a process, right?
>

That's right but I am not sure if there is much value in saving one
call here by introducing extra variable.

I'll fix other comments raised by you.

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Tue, Feb 9, 2021 at 1:37 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Looking at the V29 style tablesync slot names now they appear like this:
>
> WARNING:  could not drop tablesync replication slot
> "pg_16397_sync_16389_6927117142022745645"
> That is in the order subid +  relid + sysid
>
> Now that I see it in a message it seems a bit strange with the sysid
> just tacked onto the end like that.
>
> I am wondering if reordering of parent to child might be more natural.
> e.g sysid + subid + relid gives a more intuitive name IMO.
>
> So in this example it would be "pg_sync_6927117142022745645_16397_16389"
>

I have kept the order based on the importance of each parameter. Say
when the user sees this message in the server log of the subscriber
either for the purpose of tracking the origins progress or for errors,
the sysid parameter won't be of much use and they will be mostly
looking at subid and relid. OTOH, if due to some reason this parameter
appears in the publisher logs then sysid might be helpful.

Petr, anyone else, do you have any opinion on this matter?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Tue, Feb 9, 2021 at 12:02 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Here are my feedback comments for the V29 patch.
>
> ====
>
> FILE: logical-replication.sgml
>
> +    slots have generated names:
> <quote><literal>pg_%u_sync_%u_%llu</literal></quote>
> +    (parameters: Subscription <parameter>oid</parameter>,
> +    Table <parameter>relid</parameter>, system
> identifier<parameter>sysid</parameter>)
> +   </para>
>
> 1.
> There is a missing space before the sysid parameter.
>
> =====
>
> FILE: subscriptioncmds.c
>
> + * SUBREL_STATE_FINISHEDCOPY. The apply worker can also
> + * concurrently try to drop the origin and by this time the
> + * origin might be already removed. For these reasons,
> + * passing missing_ok = true from here.
> + */
> + snprintf(originname, sizeof(originname), "pg_%u_%u", sub->oid, relid);
> + replorigin_drop_by_name(originname, true, false);
> + }
>
> 2.
> Don't really need to say "from here".
> (same comment applies multiple places, in this file and in tablesync.c)
>
> 3.
> Previously the tablesync origin name format was encapsulated in a
> common function. IMO it was cleaner/safer how it was before, instead
> of the same "pg_%u_%u" cut/paste and scattered in many places.
> (same comment applies multiple places, in this file and in tablesync.c)
>

Fixed all the three above comments.

> 4.
> Calls like replorigin_drop_by_name(originname, true, false); make it
> unnecessarily hard to read code when the boolean params are neither
> named as variables nor commented. I noticed on another thread [et0205]
> there was an idea that having no name/comments is fine because anyway
> it is not difficult to figure out when using a "modern IDE", but since
> my review tools are only "vi" and "meld" I beg to differ with that
> justification.
> (same comment applies multiple places, in this file and in tablesync.c)
>

Already responded to it separately. I went ahead and removed such
comments from other places in the patch.

> [et0205] https://www.postgresql.org/message-id/c1d9833f-eeeb-40d5-89ba-87674e1b7ba3%40www.fastmail.com
>
> =====
>
> FILE: tablesync.c
>
> 5.
> Previously there was a function tablesync_replorigin_drop which was
> encapsulating the tablesync origin name formatting. I thought that was
> better than the V29 code which now has the same formatting scattered
> over many places.
> (same comment applies for worker_internal.h)
>

I am not sure what different you are expecting here than point-3?

> + * Determine the tablesync slot name.
> + *
> + * The name must not exceed NAMEDATALEN - 1 because of remote node constraints
> + * on slot name length. We do append system_identifier to avoid slot_name
> + * collision with subscriptions in other clusters. With current scheme
> + * pg_%u_sync_%u_UINT64_FORMAT (3 + 10 + 6 + 10 + 20 + '\0'), the maximum
> + * length of slot_name will be 50.
> + *
> + * The returned slot name is either:
> + * - stored in the supplied buffer (syncslotname), or
> + * - palloc'ed in current memory context (if syncslotname = NULL).
> + *
> + * Note: We don't use the subscription slot name as part of tablesync slot name
> + * because we are responsible for cleaning up these slots and it could become
> + * impossible to recalculate what name to cleanup if the subscription slot name
> + * had changed.
> + */
> +char *
> +ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char
> syncslotname[NAMEDATALEN])
> +{
> + if (syncslotname)
> + sprintf(syncslotname, "pg_%u_sync_%u_" UINT64_FORMAT, suboid, relid,
> + GetSystemIdentifier());
> + else
> + syncslotname = psprintf("pg_%u_sync_%u_" UINT64_FORMAT, suboid, relid,
> + GetSystemIdentifier());
> +
> + return syncslotname;
> +}
>
> 6.
> "We do append" --> "We append"
> "With current scheme" -> "With the current scheme"
>

Fixed.

> 7.
> Maybe consider to just assign GetSystemIdentifier() to a static
> instead of calling that function for every slot?
> static uint64 sysid = GetSystemIdentifier();
> IIUC the sysid value is never going to change for a process, right?
>

Already responded.

> FILE: alter_subscription.sgml
>
> 8.
> +  <para>
> +   Commands <command>ALTER SUBSCRIPTION ... REFRESH ..</command> and
> +   <command>ALTER SUBSCRIPTION ... SET PUBLICATION ..</command> with refresh
> +   option as true cannot be executed inside a transaction block.
> +  </para>
>
> My guess is those two lots of double dots ("..") were probably meant
> to be ellipsis ("...")
>

Fixed, for the first one I completed the command by adding PUBLICATION.

>
> When looking at the DropSubscription code I noticed that there is a
> small difference between the HEAD code and the V29 code when slot_name
> = NONE.
>
> HEAD does
> ------
>     if (!slotname)
>     {
>         table_close(rel, NoLock);
>         return;
>     }
> ------
>
> V29 does
> ------
>         if (!slotname)
>         {
>             /* be tidy */
>             list_free(rstates);
>             return;
>         }
> ------
>
> Isn't the V29 code missing doing a table_close(rel, NoLock) there?
>

Yes, good catch. Fixed.

-- 
With Regards,
Amit Kapila.

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Tue, Feb 9, 2021 at 8:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 9, 2021 at 12:02 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > Here are my feedback comments for the V29 patch.
> >
>
> Thanks.
>
> >
> > 3.
> > Previously the tablesync origin name format was encapsulated in a
> > common function. IMO it was cleaner/safer how it was before, instead
> > of the same "pg_%u_%u" cut/paste and scattered in many places.
> > (same comment applies multiple places, in this file and in tablesync.c)

OK. I confirmed it is fixed in V30.

But I noticed that the new function name is not quite consistent with
existing function for slot name. e.g.
ReplicationSlotNameForTablesync versus
ReplicationOriginNameForTableSync (see "TableSync" instead of
"Tablesync")

------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
On Tue, Feb 9, 2021 at 10:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Mon, Feb 8, 2021 at 11:42 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Sun, Feb 7, 2021 at 2:38 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > On Sat, Feb 6, 2021 at 2:10 AM Petr Jelinek
> > > <petr.jelinek@enterprisedb.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Some minor comments about code:
> > > >
> > > > > +             else if (res->status == WALRCV_ERROR && missing_ok)
> > > > > +             {
> > > > > +                     /* WARNING. Error, but missing_ok = true. */
> > > > > +                     ereport(WARNING,
> > > >
> > > > I wonder if we need to add error code to the WalRcvExecResult and check
> > > > for the appropriate ones here. Because this can for example return error
> > > > because of timeout, not because slot is missing. Not sure if it matters
> > > > for current callers though (but then maybe don't call the param
> > > > missign_ok?).
> > >
> > > You are right. The way we are using this function has evolved beyond
> > > the original intention.
> > > Probably renaming the param to something like "error_ok" would be more
> > > appropriate now.
> > >
> >
> > PSA a patch (apply on top of V28) to change the misleading param name.
> >
>
> PSA an alternative patch. This one adds a new member to
> WalRcvExecResult and so is able to detect the "slot does not exist"
> error. This patch also applies on top of V28, if you want it.
>

PSA v2 of this WalRcvExceResult patch (it is same as v1 but includes
some PG doc updates).
This applies OK on top of v30 of the main patch.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

Re: Single transaction in the tablesync worker?

From
Ajin Cherian
Date:
On Tue, Feb 9, 2021 at 10:38 AM Peter Smith <smithpb2250@gmail.com> wrote:

> PSA an alternative patch. This one adds a new member to
> WalRcvExecResult and so is able to detect the "slot does not exist"
> error. This patch also applies on top of V28, if you want it.

Did some testing with this patch on top of v29. I could see that now,
while dropping the subscription, if the tablesync slot does not exist
on the publisher, then it gives a warning
but the command does not fail.

postgres=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
dbname=postgres port=6972' PUBLICATION tap_pub WITH (enabled = false);
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
postgres=# ALTER SUBSCRIPTION tap_sub enable;
ALTER SUBSCRIPTION
postgres=# ALTER SUBSCRIPTION tap_sub disable;
ALTER SUBSCRIPTION
=== here, the tablesync slot exists on the publisher but I go and
=== manually drop it.

postgres=# drop subscription tap_sub;
WARNING:  could not drop the replication slot
"pg_16401_sync_16389_6927117142022745645" on publisher
DETAIL:  The error was: ERROR:  replication slot
"pg_16401_sync_16389_6927117142022745645" does not exist
NOTICE:  dropped replication slot "tap_sub" on publisher
DROP SUBSCRIPTION

I have a minor comment on the error message, the "The error was:"
seems a bit redundant here. Maybe remove it? So that it looks like:

WARNING:  could not drop the replication slot
"pg_16401_sync_16389_6927117142022745645" on publisher
DETAIL:  ERROR:  replication slot
"pg_16401_sync_16389_6927117142022745645" does not exist

regards,
Ajin Cherian
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Wed, Feb 10, 2021 at 7:41 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Tue, Feb 9, 2021 at 10:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
>
> PSA v2 of this WalRcvExceResult patch (it is same as v1 but includes
> some PG doc updates).
> This applies OK on top of v30 of the main patch.
>

Thanks, I have integrated these changes into the main patch and
additionally made some changes to comments and docs. I have also fixed
the function name inconsistency issue you reported and ran pgindent.

-- 
With Regards,
Amit Kapila.

Attachment

Re: Single transaction in the tablesync worker?

From
Peter Smith
Date:
I have reviewed again the latest patch (V31)

I found only a few minor nitpick issues not worth listing.

Then I ran the subscription TAP tests 50x in a loop as a kind of
stress test. That ran for 2.5hrs and the result was all 50x 'Result:
PASS'.

So V31 looks good to me.

------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Petr Jelinek
Date:
On 10 Feb 2021, at 06:32, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 10, 2021 at 7:41 AM Peter Smith <smithpb2250@gmail.com> wrote:
>>
>> On Tue, Feb 9, 2021 at 10:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
>>>
>>
>> PSA v2 of this WalRcvExceResult patch (it is same as v1 but includes
>> some PG doc updates).
>> This applies OK on top of v30 of the main patch.
>>
>
> Thanks, I have integrated these changes into the main patch and
> additionally made some changes to comments and docs. I have also fixed
> the function name inconsistency issue you reported and ran pgindent.

One thing:

> +        else if (res->status == WALRCV_ERROR &&
> +                 missing_ok &&
> +                 res->sqlstate == ERRCODE_UNDEFINED_OBJECT)
> +        {
> +            /* WARNING. Error, but missing_ok = true. */
> +            ereport(WARNING,
>                      (errmsg("could not drop the replication slot \"%s\" on publisher",
>                              slotname),
>                       errdetail("The error was: %s", res->err)));

Hmm, why is this WARNING, we mostly call it with missing_ok = true when the slot is not expected to be there, so it
doesnot seem correct to report it as warning? 

--
Petr


Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Thu, Feb 11, 2021 at 1:51 PM Petr Jelinek
<petr.jelinek@enterprisedb.com> wrote:
>
> On 10 Feb 2021, at 06:32, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Feb 10, 2021 at 7:41 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >>
> >> On Tue, Feb 9, 2021 at 10:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >>>
> >>
> >> PSA v2 of this WalRcvExceResult patch (it is same as v1 but includes
> >> some PG doc updates).
> >> This applies OK on top of v30 of the main patch.
> >>
> >
> > Thanks, I have integrated these changes into the main patch and
> > additionally made some changes to comments and docs. I have also fixed
> > the function name inconsistency issue you reported and ran pgindent.
>
> One thing:
>
> > +             else if (res->status == WALRCV_ERROR &&
> > +                              missing_ok &&
> > +                              res->sqlstate == ERRCODE_UNDEFINED_OBJECT)
> > +             {
> > +                     /* WARNING. Error, but missing_ok = true. */
> > +                     ereport(WARNING,
> >                                       (errmsg("could not drop the replication slot \"%s\" on publisher",
> >                                                       slotname),
> >                                        errdetail("The error was: %s", res->err)));
>
> Hmm, why is this WARNING, we mostly call it with missing_ok = true when the slot is not expected to be there, so it
doesnot seem correct to report it as warning?
 
>

WARNING is for the cases where we don't always expect slots to exist
and we don't want to stop the operation due to it. For example, in
DropSubscription, for some of the rel states like (SUBREL_STATE_INIT
and SUBREL_STATE_DATASYNC), the slot won't exist. Similarly, say if we
fail (due to network error) after removing some of the slots, next
time, it will again try to drop already dropped slots and fail. For
these reasons, we need to use WARNING. Similarly for tablesync workers
when we are trying to initially drop the slot there is no certainty
that it exists, so we can't throw ERROR and stop the operation there.
There are other cases like when the table sync worker has finished
syncing the table, there we will raise an ERROR if the slot doesn't
exist. Does this make sense?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Petr Jelinek
Date:
On 11 Feb 2021, at 10:42, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Feb 11, 2021 at 1:51 PM Petr Jelinek
> <petr.jelinek@enterprisedb.com> wrote:
>>
>> On 10 Feb 2021, at 06:32, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>
>>> On Wed, Feb 10, 2021 at 7:41 AM Peter Smith <smithpb2250@gmail.com> wrote:
>>>>
>>>> On Tue, Feb 9, 2021 at 10:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
>>>>>
>>>>
>>>> PSA v2 of this WalRcvExceResult patch (it is same as v1 but includes
>>>> some PG doc updates).
>>>> This applies OK on top of v30 of the main patch.
>>>>
>>>
>>> Thanks, I have integrated these changes into the main patch and
>>> additionally made some changes to comments and docs. I have also fixed
>>> the function name inconsistency issue you reported and ran pgindent.
>>
>> One thing:
>>
>>> +             else if (res->status == WALRCV_ERROR &&
>>> +                              missing_ok &&
>>> +                              res->sqlstate == ERRCODE_UNDEFINED_OBJECT)
>>> +             {
>>> +                     /* WARNING. Error, but missing_ok = true. */
>>> +                     ereport(WARNING,
>>>                                      (errmsg("could not drop the replication slot \"%s\" on publisher",
>>>                                                      slotname),
>>>                                       errdetail("The error was: %s", res->err)));
>>
>> Hmm, why is this WARNING, we mostly call it with missing_ok = true when the slot is not expected to be there, so it
doesnot seem correct to report it as warning? 
>>
>
> WARNING is for the cases where we don't always expect slots to exist
> and we don't want to stop the operation due to it. For example, in
> DropSubscription, for some of the rel states like (SUBREL_STATE_INIT
> and SUBREL_STATE_DATASYNC), the slot won't exist. Similarly, say if we
> fail (due to network error) after removing some of the slots, next
> time, it will again try to drop already dropped slots and fail. For
> these reasons, we need to use WARNING. Similarly for tablesync workers
> when we are trying to initially drop the slot there is no certainty
> that it exists, so we can't throw ERROR and stop the operation there.
> There are other cases like when the table sync worker has finished
> syncing the table, there we will raise an ERROR if the slot doesn't
> exist. Does this make sense?

Well, I was thinking it could be NOTICE or LOG to be honest, WARNING seems unnecessarily scary for those usecases to
me.

—
Petr




Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Thu, Feb 11, 2021 at 3:20 PM Petr Jelinek
<petr.jelinek@enterprisedb.com> wrote:
>
> On 11 Feb 2021, at 10:42, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Feb 11, 2021 at 1:51 PM Petr Jelinek
> > <petr.jelinek@enterprisedb.com> wrote:
> >>
> >> On 10 Feb 2021, at 06:32, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>>
> >>> On Wed, Feb 10, 2021 at 7:41 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >>>>
> >>>> On Tue, Feb 9, 2021 at 10:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >>>>>
> >>>>
> >>>> PSA v2 of this WalRcvExceResult patch (it is same as v1 but includes
> >>>> some PG doc updates).
> >>>> This applies OK on top of v30 of the main patch.
> >>>>
> >>>
> >>> Thanks, I have integrated these changes into the main patch and
> >>> additionally made some changes to comments and docs. I have also fixed
> >>> the function name inconsistency issue you reported and ran pgindent.
> >>
> >> One thing:
> >>
> >>> +             else if (res->status == WALRCV_ERROR &&
> >>> +                              missing_ok &&
> >>> +                              res->sqlstate == ERRCODE_UNDEFINED_OBJECT)
> >>> +             {
> >>> +                     /* WARNING. Error, but missing_ok = true. */
> >>> +                     ereport(WARNING,
> >>>                                      (errmsg("could not drop the replication slot \"%s\" on publisher",
> >>>                                                      slotname),
> >>>                                       errdetail("The error was: %s", res->err)));
> >>
> >> Hmm, why is this WARNING, we mostly call it with missing_ok = true when the slot is not expected to be there, so
itdoes not seem correct to report it as warning?
 
> >>
> >
> > WARNING is for the cases where we don't always expect slots to exist
> > and we don't want to stop the operation due to it. For example, in
> > DropSubscription, for some of the rel states like (SUBREL_STATE_INIT
> > and SUBREL_STATE_DATASYNC), the slot won't exist. Similarly, say if we
> > fail (due to network error) after removing some of the slots, next
> > time, it will again try to drop already dropped slots and fail. For
> > these reasons, we need to use WARNING. Similarly for tablesync workers
> > when we are trying to initially drop the slot there is no certainty
> > that it exists, so we can't throw ERROR and stop the operation there.
> > There are other cases like when the table sync worker has finished
> > syncing the table, there we will raise an ERROR if the slot doesn't
> > exist. Does this make sense?
>
> Well, I was thinking it could be NOTICE or LOG to be honest, WARNING seems unnecessarily scary for those usecases to
me.
>

I am fine with LOG and will make that change. Do you have any more
comments or want to spend more time on this patch before we call it
good?

-- 
With Regards,
Amit Kapila.



Re: Single transaction in the tablesync worker?

From
Petr Jelinek
Date:
On 11 Feb 2021, at 10:56, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Feb 11, 2021 at 3:20 PM Petr Jelinek
> <petr.jelinek@enterprisedb.com> wrote:
>>
>> On 11 Feb 2021, at 10:42, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>
>>> On Thu, Feb 11, 2021 at 1:51 PM Petr Jelinek
>>> <petr.jelinek@enterprisedb.com> wrote:
>>>>
>>>> On 10 Feb 2021, at 06:32, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>>
>>>>> On Wed, Feb 10, 2021 at 7:41 AM Peter Smith <smithpb2250@gmail.com> wrote:
>>>>>>
>>>>>> On Tue, Feb 9, 2021 at 10:38 AM Peter Smith <smithpb2250@gmail.com> wrote:
>>>>>>>
>>>>>>
>>>>>> PSA v2 of this WalRcvExceResult patch (it is same as v1 but includes
>>>>>> some PG doc updates).
>>>>>> This applies OK on top of v30 of the main patch.
>>>>>>
>>>>>
>>>>> Thanks, I have integrated these changes into the main patch and
>>>>> additionally made some changes to comments and docs. I have also fixed
>>>>> the function name inconsistency issue you reported and ran pgindent.
>>>>
>>>> One thing:
>>>>
>>>>> +             else if (res->status == WALRCV_ERROR &&
>>>>> +                              missing_ok &&
>>>>> +                              res->sqlstate == ERRCODE_UNDEFINED_OBJECT)
>>>>> +             {
>>>>> +                     /* WARNING. Error, but missing_ok = true. */
>>>>> +                     ereport(WARNING,
>>>>>                                     (errmsg("could not drop the replication slot \"%s\" on publisher",
>>>>>                                                     slotname),
>>>>>                                      errdetail("The error was: %s", res->err)));
>>>>
>>>> Hmm, why is this WARNING, we mostly call it with missing_ok = true when the slot is not expected to be there, so
itdoes not seem correct to report it as warning? 
>>>>
>>>
>>> WARNING is for the cases where we don't always expect slots to exist
>>> and we don't want to stop the operation due to it. For example, in
>>> DropSubscription, for some of the rel states like (SUBREL_STATE_INIT
>>> and SUBREL_STATE_DATASYNC), the slot won't exist. Similarly, say if we
>>> fail (due to network error) after removing some of the slots, next
>>> time, it will again try to drop already dropped slots and fail. For
>>> these reasons, we need to use WARNING. Similarly for tablesync workers
>>> when we are trying to initially drop the slot there is no certainty
>>> that it exists, so we can't throw ERROR and stop the operation there.
>>> There are other cases like when the table sync worker has finished
>>> syncing the table, there we will raise an ERROR if the slot doesn't
>>> exist. Does this make sense?
>>
>> Well, I was thinking it could be NOTICE or LOG to be honest, WARNING seems unnecessarily scary for those usecases to
me.
>>
>
> I am fine with LOG and will make that change. Do you have any more
> comments or want to spend more time on this patch before we call it
> good?

I am good, thanks!

—
Petr


Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Thu, Feb 11, 2021 at 3:32 PM Petr Jelinek
<petr.jelinek@enterprisedb.com> wrote:
>
> On 11 Feb 2021, at 10:56, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >> Well, I was thinking it could be NOTICE or LOG to be honest, WARNING seems unnecessarily scary for those usecases
tome.
 
> >>
> >
> > I am fine with LOG and will make that change. Do you have any more
> > comments or want to spend more time on this patch before we call it
> > good?
>
> I am good, thanks!
>

Okay, attached an updated patch with only that change.

-- 
With Regards,
Amit Kapila.

Attachment

Re: Single transaction in the tablesync worker?

From
Ajin Cherian
Date:
On Thu, Feb 11, 2021 at 10:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> Okay, attached an updated patch with only that change.

I ran Erik's test suite [1] on this patch overnight and found no
errors. No more comments from me. The patch looks good.

regards,
Ajin Cherian
Fujitsu Australia

[1]- https://www.postgresql.org/message-id/93d02794068482f96d31b002e0eb248d%40xs4all.nl



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Feb 12, 2021 at 7:18 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Thu, Feb 11, 2021 at 10:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Okay, attached an updated patch with only that change.
>
> I ran Erik's test suite [1] on this patch overnight and found no
> errors. No more comments from me. The patch looks good.
>

Thanks, I have pushed the patch but getting one failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2021-02-12%2002%3A28%3A12

The reason seems to be that we are trying to connect and
max_wal_senders is set to zero. I think we can write this without
trying to connect. The attached patch fixes the problem for me. What
do you think?


-- 
With Regards,
Amit Kapila.

Attachment

Re: Single transaction in the tablesync worker?

From
Ajin Cherian
Date:
On Fri, Feb 12, 2021 at 2:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

>
> Thanks, I have pushed the patch but getting one failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2021-02-12%2002%3A28%3A12
>
> The reason seems to be that we are trying to connect and
> max_wal_senders is set to zero. I think we can write this without
> trying to connect. The attached patch fixes the problem for me. What
> do you think?

Verified this with installcheck and modified configuration to have
wal_level = minimal and max_wal_senders = 0.
Tests passed. The changes look good  to me.

regards,
Ajin Cherian
Fujitsu Australia



Re: Single transaction in the tablesync worker?

From
Amit Kapila
Date:
On Fri, Feb 12, 2021 at 10:08 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Fri, Feb 12, 2021 at 2:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
> > Thanks, I have pushed the patch but getting one failure:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2021-02-12%2002%3A28%3A12
> >
> > The reason seems to be that we are trying to connect and
> > max_wal_senders is set to zero. I think we can write this without
> > trying to connect. The attached patch fixes the problem for me. What
> > do you think?
>
> Verified this with installcheck and modified configuration to have
> wal_level = minimal and max_wal_senders = 0.
> Tests passed. The changes look good  to me.
>

Thanks, I have pushed the fix and the latest run of 'thorntail' has passed.

-- 
With Regards,
Amit Kapila.