Thread: In logical replication concurrent update of partition key creates aduplicate record on standby.
In logical replication concurrent update of partition key creates aduplicate record on standby.
From
amul sul
Date:
Hi, If an update of partition key involves tuple movement from one partition to another partition then there will be a separate delete on one partition and insert on the other partition made. In the logical replication if an update performed on the master and standby at the same moment, then replication worker tries to replicate delete + insert operation on standby. While replying master changes on standby for the delete operation worker will log "concurrent update, retrying" message (because the update on standby has already deleted) and move forward to reply the next insert operation. Standby update also did the same delete+insert is as part of the update of partition key in a result there will be two records inserted on standby. Here is the quick demonstration: == NODE 1 == postgres=# insert into foo values(1, 'initial insert'); INSERT 0 1 postgres=# select tableoid::regclass, * from foo; tableoid | a | b ----------+---+---------------- foo1 | 1 | initial insert (1 row) == NODE 2 == postgres=# select tableoid::regclass, * from foo; tableoid | a | b ----------+---+---------------- foo1 | 1 | initial insert (1 row) -- Now attach GDB to the replication worker & break on heap_lock_tuple() call == NODE 1 == postgres=# update foo set a=2, b='node1_update' where a=1; UPDATE 1 <---- replication worker hits break point on heap_lock_tuple() ---> == NODE 2 == postgres=# update foo set a=2, b='node2_update' where a=1; UPDATE 1 <---- continue replication worker ---> postgres=# 2018-02-07 13:32:46.307 IST [81613] LOG: concurrent update, retrying == NODE 1 == postgres=# select tableoid::regclass, * from foo; tableoid | a | b ----------+---+-------------- foo2 | 2 | node1_update (1 row) == NODE 2 == postgres=# select tableoid::regclass, * from foo; tableoid | a | b ----------+---+-------------- foo2 | 2 | node2_update foo2 | 2 | node1_update (2 rows) === Script to create partitioned table, publication & subscription == -- node1 CREATE TABLE foo (a int2, b text) PARTITION BY LIST (a); CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1); CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2); CREATE PUBLICATION update_row_mov_pub FOR ALL TABLES; ALTER TABLE foo REPLICA IDENTITY FULL; ALTER TABLE foo1 REPLICA IDENTITY FULL; ALTER TABLE foo2 REPLICA IDENTITY FULL; -- node2 CREATE TABLE foo (a int2, b text) PARTITION BY LIST (a); CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1); CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2); CREATE SUBSCRIPTION update_row_mov_sub CONNECTION 'host=localhost dbname=postgres' PUBLICATION update_row_mov_pub; == END == Here is a link of previous discussion : https://postgr.es/m/CAAJ_b97w_GGV-k4ErxWTpz5sAgFJ4aUYMx0khfySvANmWRzsag@mail.gmail.com Regards, Amul Sul
Re: In logical replication concurrent update of partition key createsa duplicate record on standby.
From
Amit Khandekar
Date:
On 7 February 2018 at 13:53, amul sul <sulamul@gmail.com> wrote: > Hi, > > If an update of partition key involves tuple movement from one partition to > another partition then there will be a separate delete on one partition and > insert on the other partition made. > > In the logical replication if an update performed on the master and standby at > the same moment, then replication worker tries to replicate delete + insert > operation on standby. While replying master changes on standby for the delete > operation worker will log "concurrent update, retrying" message (because the > update on standby has already deleted) and move forward to reply the next > insert operation. Standby update also did the same delete+insert is as part of > the update of partition key in a result there will be two records inserted on > standby. A quick thinking on how to resolve this makes me wonder if we can manage to pass some information through logical decoding that the delete is part of a partition key update. This is analogous to how we set some information locally in the tuple by setting tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber. I guess, at the node 2 where this issue reproduces, this issue can also be reproduced if there is a non-partition-key UPDATE going on, and the tuple gets deleted as part of the replaying of partition-key update ? This UPDATE will skip the update, thinking that the tuple is deleted. This is similar to what's happening now in case of local concurrent updates, for which the fix is being worked upon.
Re: In logical replication concurrent update of partition key createsa duplicate record on standby.
From
amul sul
Date:
On Wed, Feb 7, 2018 at 3:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 7 February 2018 at 13:53, amul sul <sulamul@gmail.com> wrote: >> Hi, >> >> If an update of partition key involves tuple movement from one partition to >> another partition then there will be a separate delete on one partition and >> insert on the other partition made. >> >> In the logical replication if an update performed on the master and standby at >> the same moment, then replication worker tries to replicate delete + insert >> operation on standby. While replying master changes on standby for the delete >> operation worker will log "concurrent update, retrying" message (because the >> update on standby has already deleted) and move forward to reply the next >> insert operation. Standby update also did the same delete+insert is as part of >> the update of partition key in a result there will be two records inserted on >> standby. > > A quick thinking on how to resolve this makes me wonder if we can > manage to pass some information through logical decoding that the > delete is part of a partition key update. This is analogous to how we > set some information locally in the tuple by setting > tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber. > +1, also if worker failed to reply delete operation on standby then we need to decide what will be the next step, should we skip follow insert operation or error out or something else. > I guess, at the node 2 where this issue reproduces, this issue can > also be reproduced if there is a non-partition-key UPDATE going on, > and the tuple gets deleted as part of the replaying of partition-key > update ? This UPDATE will skip the update, thinking that the tuple is > deleted. This is similar to what's happening now in case of local > concurrent updates, for which the fix is being worked upon. Yes, you are correct, at node 2 the reported issue is also reproducible without the update of partition key. == NODE 2 == postgres=# update foo set b='node2_update' where a=1; UPDATE 1 postgres=# select * from foo; a | b ---+-------------- 1 | node2_update (1 row) < -- continued replication worker --> postgres=# 2018-02-07 15:26:53.323 IST [86449] LOG: concurrent update, retrying postgres=# select tableoid::regclass, * from foo; tableoid | a | b ----------+---+-------------- foo1 | 1 | node2_update foo2 | 2 | node1_update (2 rows) Regards, Amul Sul
Re: In logical replication concurrent update of partition key createsa duplicate record on standby.
From
Craig Ringer
Date:
On 7 February 2018 at 17:33, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
A quick thinking on how to resolve this makes me wonder if we can
manage to pass some information through logical decoding that the
delete is part of a partition key update. This is analogous to how we
set some information locally in the tuple by setting
tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber.
We already do something similar for UPDATEs that change the REPLICA IDENTITY; we include the oldkey in extra WAL.
The main question is whether the required knowledge is available at a suitable level.
Re: In logical replication concurrent update of partition key createsa duplicate record on standby.
From
Amit Kapila
Date:
On Wed, Feb 7, 2018 at 3:42 PM, amul sul <sulamul@gmail.com> wrote: > On Wed, Feb 7, 2018 at 3:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >> On 7 February 2018 at 13:53, amul sul <sulamul@gmail.com> wrote: >>> Hi, >>> >>> If an update of partition key involves tuple movement from one partition to >>> another partition then there will be a separate delete on one partition and >>> insert on the other partition made. >>> >>> In the logical replication if an update performed on the master and standby at >>> the same moment, then replication worker tries to replicate delete + insert >>> operation on standby. While replying master changes on standby for the delete >>> operation worker will log "concurrent update, retrying" message (because the >>> update on standby has already deleted) and move forward to reply the next >>> insert operation. Standby update also did the same delete+insert is as part of >>> the update of partition key in a result there will be two records inserted on >>> standby. >> >> A quick thinking on how to resolve this makes me wonder if we can >> manage to pass some information through logical decoding that the >> delete is part of a partition key update. This is analogous to how we >> set some information locally in the tuple by setting >> tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber. >> > > +1, > I also mentioned the same thing in the other thread [1], but I think that alone won't solve the dual record problem as you are seeing. I think we need to do something for next insert as you are suggesting. > also if worker failed to reply delete operation on standby then > we need to decide what will be the next step, should we skip follow > insert operation or error out or something else. > That would be tricky, do you see any simple way of doing either of those. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BHopDbA3h0oYXE1kuhsU0rLT-hONeeS0SoG36YpeSnGw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: In logical replication concurrent update of partition key createsa duplicate record on standby.
From
Amit Kapila
Date:
On Wed, Feb 7, 2018 at 6:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Feb 7, 2018 at 3:42 PM, amul sul <sulamul@gmail.com> wrote: >> On Wed, Feb 7, 2018 at 3:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> On 7 February 2018 at 13:53, amul sul <sulamul@gmail.com> wrote: >>>> Hi, >>>> >>>> If an update of partition key involves tuple movement from one partition to >>>> another partition then there will be a separate delete on one partition and >>>> insert on the other partition made. >>>> >>>> In the logical replication if an update performed on the master and standby at >>>> the same moment, then replication worker tries to replicate delete + insert >>>> operation on standby. While replying master changes on standby for the delete >>>> operation worker will log "concurrent update, retrying" message (because the >>>> update on standby has already deleted) and move forward to reply the next >>>> insert operation. Standby update also did the same delete+insert is as part of >>>> the update of partition key in a result there will be two records inserted on >>>> standby. >>> >>> A quick thinking on how to resolve this makes me wonder if we can >>> manage to pass some information through logical decoding that the >>> delete is part of a partition key update. This is analogous to how we >>> set some information locally in the tuple by setting >>> tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber. >>> >> >> +1, >> > > I also mentioned the same thing in the other thread [1], but I think > that alone won't solve the dual record problem as you are seeing. I > think we need to do something for next insert as you are suggesting. > Can you please once check what was the behavior before Update Tuple routing patch (Commit-id: 2f178441) went in? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: In logical replication concurrent update of partition key createsa duplicate record on standby.
From
amul sul
Date:
On Wed, Feb 7, 2018 at 6:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Feb 7, 2018 at 3:42 PM, amul sul <sulamul@gmail.com> wrote: >> On Wed, Feb 7, 2018 at 3:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>> On 7 February 2018 at 13:53, amul sul <sulamul@gmail.com> wrote: >>>> Hi, >>>> >>>> If an update of partition key involves tuple movement from one partition to >>>> another partition then there will be a separate delete on one partition and >>>> insert on the other partition made. >>>> >>>> In the logical replication if an update performed on the master and standby at >>>> the same moment, then replication worker tries to replicate delete + insert >>>> operation on standby. While replying master changes on standby for the delete >>>> operation worker will log "concurrent update, retrying" message (because the >>>> update on standby has already deleted) and move forward to reply the next >>>> insert operation. Standby update also did the same delete+insert is as part of >>>> the update of partition key in a result there will be two records inserted on >>>> standby. >>> >>> A quick thinking on how to resolve this makes me wonder if we can >>> manage to pass some information through logical decoding that the >>> delete is part of a partition key update. This is analogous to how we >>> set some information locally in the tuple by setting >>> tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber. >>> >> >> +1, >> > > I also mentioned the same thing in the other thread [1], but I think > that alone won't solve the dual record problem as you are seeing. I > think we need to do something for next insert as you are suggesting. > >> also if worker failed to reply delete operation on standby then >> we need to decide what will be the next step, should we skip follow >> insert operation or error out or something else. >> > > That would be tricky, do you see any simple way of doing either of those. > Not really, like ExecUpdate for an update of partition key if delete is failed then the further insert will be skipped, but you are correct, it might be more tricky than I can think -- there is no guarantee that the next insert operation which replication worker trying to replicate is part of the update of partition key mechanism. How can one identify that an insert operation on one relation is related to previously deleting operation on some other relation? Regards, Amul
Re: In logical replication concurrent update of partition key createsa duplicate record on standby.
From
amul sul
Date:
On Thu, Feb 8, 2018 at 5:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Feb 7, 2018 at 6:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Wed, Feb 7, 2018 at 3:42 PM, amul sul <sulamul@gmail.com> wrote: >>> On Wed, Feb 7, 2018 at 3:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: >>>> On 7 February 2018 at 13:53, amul sul <sulamul@gmail.com> wrote: >>>>> Hi, >>>>> >>>>> If an update of partition key involves tuple movement from one partition to >>>>> another partition then there will be a separate delete on one partition and >>>>> insert on the other partition made. >>>>> >>>>> In the logical replication if an update performed on the master and standby at >>>>> the same moment, then replication worker tries to replicate delete + insert >>>>> operation on standby. While replying master changes on standby for the delete >>>>> operation worker will log "concurrent update, retrying" message (because the >>>>> update on standby has already deleted) and move forward to reply the next >>>>> insert operation. Standby update also did the same delete+insert is as part of >>>>> the update of partition key in a result there will be two records inserted on >>>>> standby. >>>> >>>> A quick thinking on how to resolve this makes me wonder if we can >>>> manage to pass some information through logical decoding that the >>>> delete is part of a partition key update. This is analogous to how we >>>> set some information locally in the tuple by setting >>>> tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber. >>>> >>> >>> +1, >>> >> >> I also mentioned the same thing in the other thread [1], but I think >> that alone won't solve the dual record problem as you are seeing. I >> think we need to do something for next insert as you are suggesting. >> > > Can you please once check what was the behavior before Update Tuple > routing patch (Commit-id: 2f178441) went in? > Before this commit such update will be failed with following error: postgres=# update foo set a=2, b='node1_update' where a=1; ERROR: new row for relation "foo1" violates partition constraint DETAIL: Failing row contains (2, node1_update). Regards, Amul
Re: In logical replication concurrent update of partition key createsa duplicate record on standby.
From
Peter Eisentraut
Date:
On 2/8/18 10:54, amul sul wrote: > Not really, like ExecUpdate for an update of partition key if delete is failed > then the further insert will be skipped, but you are correct, it might be more > tricky than I can think -- there is no guarantee that the next insert operation > which replication worker trying to replicate is part of the update of partition > key mechanism. How can one identify that an insert operation on one relation is > related to previously deleting operation on some other relation? I think you somehow need to stitch this back together in logical decoding and publish it as an update operation. Otherwise, wrong things happen. For example, what happens to a publication that is configured to only publish inserts? What happens to update triggers on the receiving table? What if the subscriber side is partitioned differently? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: In logical replication concurrent update of partition key createsa duplicate record on standby.
From
Amit Kapila
Date:
On Tue, Feb 13, 2018 at 5:25 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 2/8/18 10:54, amul sul wrote: >> Not really, like ExecUpdate for an update of partition key if delete is failed >> then the further insert will be skipped, but you are correct, it might be more >> tricky than I can think -- there is no guarantee that the next insert operation >> which replication worker trying to replicate is part of the update of partition >> key mechanism. How can one identify that an insert operation on one relation is >> related to previously deleting operation on some other relation? > > I think you somehow need to stitch this back together in logical > decoding and publish it as an update operation. > I am not sure what you have in mind, but it seems difficult to piggyback it on exiting delete operation because both delete and insert are separate operations. What might work out is that we have a separate special update WAL record which constitutes both delete and insert. Now, it is easier said than done, because current update tuple routing mechanism uses existing infrastructure for delete and insert, and now if we want to use a different WAL, then we might need to change the existing infrastructure as well to avoid writing three WAL records (one for delete, one for insert, one for special update) for this operation. > Otherwise, wrong things > happen. For example, what happens to a publication that is configured > to only publish inserts? What happens to update triggers on the > receiving table? What if the subscriber side is partitioned differently? > All of these seems to be valid points and will get addressed if we decide to tinker WAL format and logical decoding for this operation. OTOH, we might want to document this behavior for v11 as we are discussing for some other related issues [1][2]. [1] - https://www.postgresql.org/message-id/CA%2BTgmoY_h%2B3J46zShEZD0_KLRHa1NsJkGrC4Ou%3DBqt%3DKRboHtg%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CA%2BTgmoacGfUSWSMVRci-duVFSGOoevgq43mSY9Sztd1RRhiHjg%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com