Thread: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

[HACKERS] Restrict concurrent update/delete with UPDATE of partition key

From
amul sul
Date:
Hi All,

Attaching POC patch that throws an error in the case of a concurrent update
to an already deleted tuple due to UPDATE of partition key[1].

In a normal update new tuple is linked to the old one via ctid forming
a chain of tuple versions but UPDATE of partition key[1] move tuple
from one partition to an another partition table which breaks that chain.

Consider a following[2] concurrent update case where one session trying to
update a row that's locked for a concurrent update by the another session
cause tuple movement to the another partition.

create table foo (a int2, b text) partition by list (a);
create table foo1 partition of foo for values IN (1);
create table foo2 partition of foo for values IN (2);
insert into foo values(1, 'ABC');

 ----------- session 1 -----------
postgres=# begin;
BEGIN
postgres=# update foo set a=2 where a=1;
UPDATE 1


 ----------- session 2  -----------
postgres=# update foo set b='EFG' where a=1;

 ….. wait state ……

----------- session 1 -----------
postgres=# commit;
COMMIT

 ----------- session 2  -----------
UPDATE 0

This UPDATE 0 is the problematic, see Greg Stark's update[3] explains
why we need an error.

To throw an error we need an indicator that the targeted row has been
already moved to the another partition, and for that setting a ctid.ip_blkid to
InvalidBlockId looks viable option for now.

The attached patch incorporates the following logic suggested by Amit
Kapila[4]:

"We can pass a flag say row_moved (or require_row_movement) to heap_delete
which will in turn set InvalidBlockId in ctid instead of setting it to
self. Then the
ExecUpdate needs to check for the same and return an error when heap_update is
not successful (result != HeapTupleMayBeUpdated)."

1] https://postgr.es/m/CAJ3gD9do9o2ccQ7j7%2BtSgiE1REY65XRiMb%3DyJO3u3QhyP8EEPQ%40mail.gmail.com
2] With https://postgr.es/m/CAJ3gD9fzD4jBpv+zXqZYnW=h9JXUFG9E7NGdA9gR_JJbOj=Q5A@mail.gmail.com
patch applied.
3] https://postgr.es/m/CAM-w4HPis7rbnwi%2BoXjnouqMSRAC5DeVcMdxEXTMfDos1kaYPQ%40mail.gmail.com
4] https://postgr.es/m/CAA4eK1KEZQ%2BCyXbBzfn1jFHoEfa_OemDLhLyy7xfD1QUZLo1DQ%40mail.gmail.com

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Robert Haas
Date:
On Wed, Sep 27, 2017 at 7:07 AM, amul sul <sulamul@gmail.com> wrote:
> Attaching POC patch that throws an error in the case of a concurrent update
> to an already deleted tuple due to UPDATE of partition key[1].
>
> In a normal update new tuple is linked to the old one via ctid forming
> a chain of tuple versions but UPDATE of partition key[1] move tuple
> from one partition to an another partition table which breaks that chain.

This patch needs a rebase.  It has one whitespace-only hunk that
should possibly be excluded.

I think the basic idea of this is sound.  Either you or Amit need to
document the behavior in the user-facing documentation, and it needs
tests that hit every single one of the new error checks you've added
(obviously, the tests will only work in combination with Amit's
patch).  The isolation should be sufficient to write such tests.

It needs some more extensive comments as well.  The fact that we're
assigning a meaning to ip_blkid -> InvalidBlockNumber is a big deal,
and should at least be documented in itemptr.h in the comments for the
ItemPointerData structure.

I suspect ExecOnConflictUpdate needs an update for the
HeapTupleUpdated case similar to what you've done elsewhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
amul sul
Date:
On Sat, Nov 11, 2017 at 1:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Sep 27, 2017 at 7:07 AM, amul sul <sulamul@gmail.com> wrote:
>> Attaching POC patch that throws an error in the case of a concurrent update
>> to an already deleted tuple due to UPDATE of partition key[1].
>>
>> In a normal update new tuple is linked to the old one via ctid forming
>> a chain of tuple versions but UPDATE of partition key[1] move tuple
>> from one partition to an another partition table which breaks that chain.
>
> This patch needs a rebase.  It has one whitespace-only hunk that
> should possibly be excluded.
>
Thanks for looking at the patch.

> I think the basic idea of this is sound.  Either you or Amit need to
> document the behavior in the user-facing documentation, and it needs
> tests that hit every single one of the new error checks you've added
> (obviously, the tests will only work in combination with Amit's
> patch).  The isolation should be sufficient to write such tests.
>
> It needs some more extensive comments as well.  The fact that we're
> assigning a meaning to ip_blkid -> InvalidBlockNumber is a big deal,
> and should at least be documented in itemptr.h in the comments for the
> ItemPointerData structure.
>
> I suspect ExecOnConflictUpdate needs an update for the
> HeapTupleUpdated case similar to what you've done elsewhere.
>

UPDATE of partition key v25[1] has documented this concurrent scenario,
I need to rework on that document paragraph to include this behaviour, will
discuss with Amit.

Attached 0001 patch includes error check for 8 functions, out of 8 I am able
to build isolation test for 4 functions -- ExecUpdate,ExecDelete,
GetTupleForTrigger & ExecLockRows.

And remaining are EvalPlanQualFetch, ExecOnConflictUpdate,
RelationFindReplTupleByIndex & RelationFindReplTupleSeq.  Note that check in
RelationFindReplTupleByIndex & RelationFindReplTupleSeq will have LOG not an
ERROR.

Any help/suggestion to build test for these function would be much appreciated.


1] http://postgr.es/m/CAJ3gD9f4Um99sOJmVSEPj783VWw5GbXkgU3OWcYZJv=ipjTkAw@mail.gmail.com


Regards,
Amul

Attachment

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Robert Haas
Date:
On Thu, Nov 23, 2017 at 6:48 AM, amul sul <sulamul@gmail.com> wrote:
> And remaining are EvalPlanQualFetch, ExecOnConflictUpdate,
> RelationFindReplTupleByIndex & RelationFindReplTupleSeq.  Note that check in
> RelationFindReplTupleByIndex & RelationFindReplTupleSeq will have LOG not an
> ERROR.

The first one is going to come up when you have, for example, two
concurrent updates targeting the same row, and the second one when you
have an ON CONFLICT UPDATE clause.  I guess the latter two are
probably related to logical replication, and maybe not easy to test
via an automated regression test.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Alvaro Herrera
Date:
A typo in all the messages the patch adds:
"to an another" -> "to another"

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Thu, Nov 23, 2017 at 5:18 PM, amul sul <sulamul@gmail.com> wrote:
> On Sat, Nov 11, 2017 at 1:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Sep 27, 2017 at 7:07 AM, amul sul <sulamul@gmail.com> wrote:
>>> Attaching POC patch that throws an error in the case of a concurrent update
>>> to an already deleted tuple due to UPDATE of partition key[1].
>>>
>>> In a normal update new tuple is linked to the old one via ctid forming
>>> a chain of tuple versions but UPDATE of partition key[1] move tuple
>>> from one partition to an another partition table which breaks that chain.
>>
>> This patch needs a rebase.  It has one whitespace-only hunk that
>> should possibly be excluded.
>>
> Thanks for looking at the patch.
>
>> I think the basic idea of this is sound.  Either you or Amit need to
>> document the behavior in the user-facing documentation, and it needs
>> tests that hit every single one of the new error checks you've added
>> (obviously, the tests will only work in combination with Amit's
>> patch).  The isolation should be sufficient to write such tests.
>>
>> It needs some more extensive comments as well.  The fact that we're
>> assigning a meaning to ip_blkid -> InvalidBlockNumber is a big deal,
>> and should at least be documented in itemptr.h in the comments for the
>> ItemPointerData structure.
>>
>> I suspect ExecOnConflictUpdate needs an update for the
>> HeapTupleUpdated case similar to what you've done elsewhere.
>>
>
> UPDATE of partition key v25[1] has documented this concurrent scenario,
> I need to rework on that document paragraph to include this behaviour, will
> discuss with Amit.
>
> Attached 0001 patch includes error check for 8 functions, out of 8 I am able
> to build isolation test for 4 functions -- ExecUpdate,ExecDelete,
> GetTupleForTrigger & ExecLockRows.
>

Few comments:

1.
@@ -1480,6 +1493,10 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),errmsg("could not serialize access due to concurrent update")));
 
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be updated was already moved to an another
partition due to concurrent update")));

Why do you think we need this check in the OnConflictUpdate path?  I
think we don't it here because we are going to relinquish this version
of the tuple and will start again and might fetch some other row
version.  Also, we don't support Insert .. On Conflict Update with
partitioned tables, see[1], which is also an indication that at the
very least we don't need it now.

2.
@@ -2709,6 +2709,10 @@ EvalPlanQualFetch(EState *estate, Relation
relation, int lockmode, ereport(ERROR, (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), errmsg("could not serialize access
dueto concurrent update")));
 
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be updated was already moved to an another
partition due to concurrent update")));

..
..
+++ b/src/backend/executor/nodeLockRows.c
@@ -218,6 +218,11 @@ lnext: ereport(ERROR, (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), errmsg("could not serialize
accessdue to concurrent update")));
 
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be locked was already moved to an another partition
due to concurrent update")));
+

At some places after heap_lock_tuple the error message says "tuple to
be updated .." and other places it says "tuple to be locked ..".  Can
we use the same message consistently?  I think it would be better to
use the second one.

3.
}
+ /* tuple already deleted; nothing to do */ return NULL;

Spurious whitespace.

4.  There is no need to use *POC* in the name of the patch.  I think
this is no more a POC patch.

[1] - https://www.postgresql.org/message-id/7ff1e8ec-dc39-96b1-7f47-ff5965dceeac%40lab.ntt.co.jp

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
amul sul
Date:
On Sat, Nov 25, 2017 at 11:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Nov 23, 2017 at 5:18 PM, amul sul <sulamul@gmail.com> wrote:
>> On Sat, Nov 11, 2017 at 1:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Wed, Sep 27, 2017 at 7:07 AM, amul sul <sulamul@gmail.com> wrote:
>>>>
[...]
> Few comments:
>
Thanks for looking at the patch, please find my comments inline:

> 1.
> @@ -1480,6 +1493,10 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
>   ereport(ERROR,
>   (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>   errmsg("could not serialize access due to concurrent update")));
> + if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("tuple to be updated was already moved to an another
> partition due to concurrent update")));
>
> Why do you think we need this check in the OnConflictUpdate path?  I
> think we don't it here because we are going to relinquish this version
> of the tuple and will start again and might fetch some other row
> version.  Also, we don't support Insert .. On Conflict Update with
> partitioned tables, see[1], which is also an indication that at the
> very least we don't need it now.
>
Agreed, even though this case will never going to be anytime soon
shouldn't we have a check for invalid block id? IMHO, we should have
this check and error report or assert, thoughts?

> 2.
> @@ -2709,6 +2709,10 @@ EvalPlanQualFetch(EState *estate, Relation
> relation, int lockmode,
>   ereport(ERROR,
>   (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>   errmsg("could not serialize access due to concurrent update")));
> + if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("tuple to be updated was already moved to an another
> partition due to concurrent update")));
>
> ..
> ..
> +++ b/src/backend/executor/nodeLockRows.c
> @@ -218,6 +218,11 @@ lnext:
>   ereport(ERROR,
>   (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>   errmsg("could not serialize access due to concurrent update")));
> + if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("tuple to be locked was already moved to an another partition
> due to concurrent update")));
> +
>
> At some places after heap_lock_tuple the error message says "tuple to
> be updated .." and other places it says "tuple to be locked ..".  Can
> we use the same message consistently?  I think it would be better to
> use the second one.
>
Okay, will use "tuple to be locked"

> 3.
> }
> +
>   /* tuple already deleted; nothing to do */
>   return NULL;
>
> Spurious whitespace.
>
Sorry about that, will fix this.

> 4.  There is no need to use *POC* in the name of the patch.  I think
> this is no more a POC patch.
>
Understood.

Regards,
Amul


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
amul sul
Date:
On Fri, Nov 24, 2017 at 9:37 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> A typo in all the messages the patch adds:
> "to an another" -> "to another"
>

Thanks for the looking into the patch, will fix in the next version.

Regards,
Amul


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Tue, Nov 28, 2017 at 5:58 PM, amul sul <sulamul@gmail.com> wrote:
> On Sat, Nov 25, 2017 at 11:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Nov 23, 2017 at 5:18 PM, amul sul <sulamul@gmail.com> wrote:
>>> On Sat, Nov 11, 2017 at 1:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> On Wed, Sep 27, 2017 at 7:07 AM, amul sul <sulamul@gmail.com> wrote:
>>>>>
> [...]
>> Few comments:
>>
> Thanks for looking at the patch, please find my comments inline:
>
>> 1.
>> @@ -1480,6 +1493,10 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
>>   ereport(ERROR,
>>   (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>>   errmsg("could not serialize access due to concurrent update")));
>> + if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
>> + ereport(ERROR,
>> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> + errmsg("tuple to be updated was already moved to an another
>> partition due to concurrent update")));
>>
>> Why do you think we need this check in the OnConflictUpdate path?  I
>> think we don't it here because we are going to relinquish this version
>> of the tuple and will start again and might fetch some other row
>> version.  Also, we don't support Insert .. On Conflict Update with
>> partitioned tables, see[1], which is also an indication that at the
>> very least we don't need it now.
>>
> Agreed, even though this case will never going to be anytime soon
> shouldn't we have a check for invalid block id? IMHO, we should have
> this check and error report or assert, thoughts?
>

I feel adding code which can't be hit (even if it is error handling)
is not a good idea.  I think having an Assert should be okay, but
please write comments to explain the reason for adding an Assert.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
amul sul
Date:
On Wed, Nov 29, 2017 at 7:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Nov 28, 2017 at 5:58 PM, amul sul <sulamul@gmail.com> wrote:
>> On Sat, Nov 25, 2017 at 11:39 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Thu, Nov 23, 2017 at 5:18 PM, amul sul <sulamul@gmail.com> wrote:
>>>> On Sat, Nov 11, 2017 at 1:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>>> On Wed, Sep 27, 2017 at 7:07 AM, amul sul <sulamul@gmail.com> wrote:
>>>>>>
>> [...]
>>> Few comments:
>>>
>> Thanks for looking at the patch, please find my comments inline:
>>
>>> 1.
>>> @@ -1480,6 +1493,10 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
>>>   ereport(ERROR,
>>>   (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>>>   errmsg("could not serialize access due to concurrent update")));
>>> + if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
>>> + ereport(ERROR,
>>> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>>> + errmsg("tuple to be updated was already moved to an another
>>> partition due to concurrent update")));
>>>
>>> Why do you think we need this check in the OnConflictUpdate path?  I
>>> think we don't it here because we are going to relinquish this version
>>> of the tuple and will start again and might fetch some other row
>>> version.  Also, we don't support Insert .. On Conflict Update with
>>> partitioned tables, see[1], which is also an indication that at the
>>> very least we don't need it now.
>>>
>> Agreed, even though this case will never going to be anytime soon
>> shouldn't we have a check for invalid block id? IMHO, we should have
>> this check and error report or assert, thoughts?
>>
>
> I feel adding code which can't be hit (even if it is error handling)
> is not a good idea.  I think having an Assert should be okay, but
> please write comments to explain the reason for adding an Assert.
>

Agree, updated in the attached patch.  Patch 0001 also includes your
previous review comment[1] and typo correction suggested by Alvaro[2].

Patch 0002 still missing tests for EvalPlanQualFetch() function.  I think we
could skip that because direct/indirect callers of EvalPlanQualFetch() are
GetTupleForTrigger, ExecDelete, ExecUpdate & ExecLockRows got the required test
coverage in the attached patch.

1] https://postgr.es/m/CAA4eK1LQS6TmsGaEwR9HgF-9TZTHxrdAELuX6wOZBDbbjOfDjQ@mail.gmail.com
2] https://postgr.es/m/20171124160756.eyljpmpfzwd6jmnr@alvherre.pgsql

Regards,
Amul

Attachment

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Stephen Frost
Date:
Amul,

* amul sul (sulamul@gmail.com) wrote:
> Agree, updated in the attached patch.  Patch 0001 also includes your
> previous review comment[1] and typo correction suggested by Alvaro[2].

Looks like this needs to be rebased (though the failures aren't too bad,
from what I'm seeing), so going to mark this back to Waiting For Author.
Hopefully this also helps to wake this thread up a bit and get another
review of it.

Thanks!

Stephen

Attachment
On Thu, Jan 11, 2018 at 8:06 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Amul,
>
> * amul sul (sulamul@gmail.com) wrote:
>> Agree, updated in the attached patch.  Patch 0001 also includes your
>> previous review comment[1] and typo correction suggested by Alvaro[2].
>
> Looks like this needs to be rebased (though the failures aren't too bad,
> from what I'm seeing), so going to mark this back to Waiting For Author.
> Hopefully this also helps to wake this thread up a bit and get another
> review of it.
>

Thanks for looking at this thread, attached herewith an updated patch rebase on
'UPDATE of partition key v35' patch[1].


Regards,
Amul


1] https://postgr.es/m/CAJ3gD9dixkkMzNnnP1CaZ1H17-U17ok_sVbjZZo+wnB=rJH6yg@mail.gmail.com

Attachment

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Fri, Jan 12, 2018 at 11:43 AM, amul sul <sulamul@gmail.com> wrote:
>
> Thanks for looking at this thread, attached herewith an updated patch rebase on
> 'UPDATE of partition key v35' patch[1].
>

  ExecDelete(mtstate, tupleid, oldtuple, planSlot, epqstate, estate,
-   &tuple_deleted, false, false);
+   &tuple_deleted, false, false, true);

  /*
  * For some reason if DELETE didn't happen (e.g. trigger prevented
@@ -1292,6 +1299,11 @@ lreplace:;
  ereport(ERROR,
  (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
  errmsg("could not serialize access due to concurrent update")));
+ if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be locked was already moved to another partition
due to concurrent update")));

I have asked to change the message "tuple to be updated .." after
heap_lock_tuple call not in nodeModifyTable.c, so please revert the
message in nodeModifyTable.c.

Have you verified the changes in execReplication.c in some way?  It is
not clear to me how do you ensure to set the special value
(InvalidBlockNumber) in CTID for delete operation via logical
replication?  The logical replication worker uses function
ExecSimpleRelationDelete to perform Delete and there is no way it can
pass the correct value of row_moved in heap_delete.  Am I missing
something due to which we don't need to do this?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


On Tue, Jan 23, 2018 at 7:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jan 12, 2018 at 11:43 AM, amul sul <sulamul@gmail.com> wrote:
[....]
> I have asked to change the message "tuple to be updated .." after
> heap_lock_tuple call not in nodeModifyTable.c, so please revert the
> message in nodeModifyTable.c.
>

Understood, fixed in the attached version, sorry I'd missed it.

> Have you verified the changes in execReplication.c in some way? It is
> not clear to me how do you ensure to set the special value
> (InvalidBlockNumber) in CTID for delete operation via logical
> replication?  The logical replication worker uses function
> ExecSimpleRelationDelete to perform Delete and there is no way it can
> pass the correct value of row_moved in heap_delete.  Am I missing
> something due to which we don't need to do this?
>

You are correct, from ExecSimpleRelationDelete, heap_delete will always
receive row_moved = false and InvalidBlockNumber will never set.

I didn't found any test case to hit changes in execReplication.c.  I am not sure
what should we suppose do here, and having LOG is how much worse either.

What do you think, should we add an assert like EvalPlanQualFetch() here or
the current LOG message is fine?

Thanks for the review.

Regards,
Amul Sul

Attachment

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Wed, Jan 24, 2018 at 12:44 PM, amul sul <sulamul@gmail.com> wrote:
> On Tue, Jan 23, 2018 at 7:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Jan 12, 2018 at 11:43 AM, amul sul <sulamul@gmail.com> wrote:
> [....]
>> I have asked to change the message "tuple to be updated .." after
>> heap_lock_tuple call not in nodeModifyTable.c, so please revert the
>> message in nodeModifyTable.c.
>>
>
> Understood, fixed in the attached version, sorry I'd missed it.
>
>> Have you verified the changes in execReplication.c in some way? It is
>> not clear to me how do you ensure to set the special value
>> (InvalidBlockNumber) in CTID for delete operation via logical
>> replication?  The logical replication worker uses function
>> ExecSimpleRelationDelete to perform Delete and there is no way it can
>> pass the correct value of row_moved in heap_delete.  Am I missing
>> something due to which we don't need to do this?
>>
>
> You are correct, from ExecSimpleRelationDelete, heap_delete will always
> receive row_moved = false and InvalidBlockNumber will never set.
>

So, this means that in case of logical replication, it won't generate
the error this patch is trying to introduce.  I think if we want to
handle this we need some changes in WAL and logical decoding as well.

Robert, others, what do you think?  I am not very comfortable leaving
this unaddressed, if we don't want to do anything about it, at least
we should document it.

> I didn't found any test case to hit changes in execReplication.c.  I am not sure
> what should we suppose do here, and having LOG is how much worse either.
>

I think you can manually (via debugger) hit this by using
PUBLICATION/SUBSCRIPTION syntax for logical replication.  I think what
you need to do is in node-1, create a partitioned table and subscribe
it on node-2.  Now, perform an Update on node-1, then stop the logical
replication worker before it calls heap_lock_tuple.  Now, in node-2,
update the same row such that it moves the row.  Now, continue the
logical replication worker.  I think it should hit your new code, if
not then we need to think of some other way.

> What do you think, should we add an assert like EvalPlanQualFetch() here or
> the current LOG message is fine?
>

I think first let's try to hit this case.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
amul sul
Date:
Hi Amit,

Sorry for the delayed response.

On Fri, Jan 26, 2018 at 11:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Jan 24, 2018 at 12:44 PM, amul sul <sulamul@gmail.com> wrote:
>> On Tue, Jan 23, 2018 at 7:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Fri, Jan 12, 2018 at 11:43 AM, amul sul <sulamul@gmail.com> wrote:
[....]
> I think you can manually (via debugger) hit this by using
> PUBLICATION/SUBSCRIPTION syntax for logical replication.  I think what
> you need to do is in node-1, create a partitioned table and subscribe
> it on node-2.  Now, perform an Update on node-1, then stop the logical
> replication worker before it calls heap_lock_tuple.  Now, in node-2,
> update the same row such that it moves the row.  Now, continue the
> logical replication worker.  I think it should hit your new code, if
> not then we need to think of some other way.
>

I am able to hit the change log using above steps. Thanks a lot for the
step by step guide, I really needed that.

One strange behavior I found in the logical replication which is reproducible
without attached patch as well -- when I have updated on node2 by keeping
breakpoint before the heap_lock_tuple call in replication worker, I can see
a duplicate row was inserted on the node2, see this:

== NODE 1 ==

postgres=# insert into foo values(1, 'initial insert');
INSERT 0 1

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |       b
----------+---+----------------
 foo1     | 1 | initial insert
(1 row)


=== NODE 2 ==

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |       b
----------+---+----------------
 foo1     | 1 | initial insert
(1 row)


== NODE 1 ==

postgres=# update foo set a=2, b='node1_update' where a=1;
UPDATE 1

<---- BREAK POINT BEFORE heap_lock_tuple IN replication worker  --->


== NODE 2 ==

postgres=#  update foo set a=2, b='node2_update' where a=1;

<---- RELEASE BREAK POINT --->

postgres=# 2018-02-02 12:35:45.050 IST [91786] LOG:  tuple to be
locked was already moved to another partition due to concurrent
update, retrying

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |      b
----------+---+--------------
 foo2     | 2 | node2_update
 foo2     | 2 | node1_update
(2 rows)


== NODE 1 ==

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |      b
----------+---+--------------
 foo2     | 2 | node1_update
(1 row)

I am thinking to report this in a separate thread, but not sure if
this is already known behaviour or not.

== schema to reproduce above case ==
-- node1
create table foo (a int2, b text) partition by list (a);
create table foo1 partition of foo for values IN (1);
create table foo2 partition of foo for values IN (2);
insert into foo values(1, 'initial insert');
CREATE PUBLICATION update_row_mov_pub FOR ALL TABLES;
ALTER TABLE foo REPLICA IDENTITY FULL;
ALTER TABLE foo1 REPLICA IDENTITY FULL;
ALTER TABLE foo2 REPLICA IDENTITY FULL;

-- node2
create table foo (a int2, b text) partition by list (a);
create table foo1 partition of foo for values IN (1);
create table foo2 partition of foo for values IN (2);
CREATE SUBSCRIPTION update_row_mov_sub CONNECTION 'host=localhost
dbname=postgres' PUBLICATION update_row_mov_pub;
== END==

Updated patch attached -- correct changes in execReplication.c.

Regards,
Amul Sul

Attachment

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Robert Haas
Date:
On Fri, Jan 26, 2018 at 1:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> So, this means that in case of logical replication, it won't generate
> the error this patch is trying to introduce.  I think if we want to
> handle this we need some changes in WAL and logical decoding as well.
>
> Robert, others, what do you think?  I am not very comfortable leaving
> this unaddressed, if we don't want to do anything about it, at least
> we should document it.

As I said on the other thread, I'm not sure how reasonable it really
is to try to do anything about this.  For both the issue you raised
there, I think we'd need to introduce a new WAL record type that
represents a delete from one table and an insert to another that
should be considered as a single operation. I'm not keen on that idea,
but you can make an argument that it's the Right Thing To Do.  I would
be more inclined, at least for v11, to just document that the
delete+insert will be replayed separately on replicas.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Fri, Feb 2, 2018 at 2:11 PM, amul sul <sulamul@gmail.com> wrote:
> On Fri, Jan 26, 2018 at 11:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> [....]
>> I think you can manually (via debugger) hit this by using
>> PUBLICATION/SUBSCRIPTION syntax for logical replication.  I think what
>> you need to do is in node-1, create a partitioned table and subscribe
>> it on node-2.  Now, perform an Update on node-1, then stop the logical
>> replication worker before it calls heap_lock_tuple.  Now, in node-2,
>> update the same row such that it moves the row.  Now, continue the
>> logical replication worker.  I think it should hit your new code, if
>> not then we need to think of some other way.
>>
>
> I am able to hit the change log using above steps. Thanks a lot for the
> step by step guide, I really needed that.
>
> One strange behavior I found in the logical replication which is reproducible
> without attached patch as well -- when I have updated on node2 by keeping
> breakpoint before the heap_lock_tuple call in replication worker, I can see
> a duplicate row was inserted on the node2, see this:
>
..
>
> I am thinking to report this in a separate thread, but not sure if
> this is already known behaviour or not.
>

I think it is worth to discuss this behavior in a separate thread.
However, if possible, try to reproduce it without partitioning and
then report it.

>
> Updated patch attached -- correct changes in execReplication.c.
>

Your changes look correct to me.

I wonder what will be the behavior of this patch with
wal_consistency_checking [1].  I think it will generate a failure as
there is nothing in WAL to replay it.  Can you once try it?  If we see
a failure with wal consistency checker, then we need to think whether
(a) we want to deal with it by logging this information, or (b) do we
want to mask it or (c) something else?


[1] -  https://www.postgresql.org/docs/devel/static/runtime-config-developer.html


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
amul sul
Date:
On Sun, Feb 4, 2018 at 10:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Feb 2, 2018 at 2:11 PM, amul sul <sulamul@gmail.com> wrote:
>> On Fri, Jan 26, 2018 at 11:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> [....]
>>> I think you can manually (via debugger) hit this by using
>>> PUBLICATION/SUBSCRIPTION syntax for logical replication.  I think what
>>> you need to do is in node-1, create a partitioned table and subscribe
>>> it on node-2.  Now, perform an Update on node-1, then stop the logical
>>> replication worker before it calls heap_lock_tuple.  Now, in node-2,
>>> update the same row such that it moves the row.  Now, continue the
>>> logical replication worker.  I think it should hit your new code, if
>>> not then we need to think of some other way.
>>>
>>
>> I am able to hit the change log using above steps. Thanks a lot for the
>> step by step guide, I really needed that.
>>
>> One strange behavior I found in the logical replication which is reproducible
>> without attached patch as well -- when I have updated on node2 by keeping
>> breakpoint before the heap_lock_tuple call in replication worker, I can see
>> a duplicate row was inserted on the node2, see this:
>>
> ..
>>
>> I am thinking to report this in a separate thread, but not sure if
>> this is already known behaviour or not.
>>
>
> I think it is worth to discuss this behavior in a separate thread.
> However, if possible, try to reproduce it without partitioning and
> then report it.
>
Logical replication behavior for the normal table is as expected, this happens
only with partition table, will start a new thread for this on hacker.

>>
>> Updated patch attached -- correct changes in execReplication.c.
>>
>
> Your changes look correct to me.
>
> I wonder what will be the behavior of this patch with
> wal_consistency_checking [1].  I think it will generate a failure as
> there is nothing in WAL to replay it.  Can you once try it?  If we see
> a failure with wal consistency checker, then we need to think whether
> (a) we want to deal with it by logging this information, or (b) do we
> want to mask it or (c) something else?
>
>
> [1] -  https://www.postgresql.org/docs/devel/static/runtime-config-developer.html
>

Yes, you are correct standby stopped with a following error:

 FATAL:  inconsistent page found, rel 1663/13260/16390, forknum 0, blkno 0
 CONTEXT:  WAL redo at 0/3002510 for Heap/DELETE: off 6 KEYS_UPDATED
 LOG:  startup process (PID 22791) exited with exit code 1
 LOG:  terminating any other active server processes
 LOG:  database system is shut down

I have tested warm standby replication setup using attached script. Without
wal_consistency_checking setting, it works fine & data from master to standby is
replicated as expected, if this guaranty is enough then I think could skip this
error from wal consistent check for such deleted tuple (I guess option
b that you have suggested), thoughts?

Attachment

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
amul sul
Date:
On Tue, Feb 6, 2018 at 7:05 PM, amul sul <sulamul@gmail.com> wrote:
> On Sun, Feb 4, 2018 at 10:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Feb 2, 2018 at 2:11 PM, amul sul <sulamul@gmail.com> wrote:
>>> On Fri, Jan 26, 2018 at 11:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> [....]
>>>> I think you can manually (via debugger) hit this by using
>>>> PUBLICATION/SUBSCRIPTION syntax for logical replication.  I think what
>>>> you need to do is in node-1, create a partitioned table and subscribe
>>>> it on node-2.  Now, perform an Update on node-1, then stop the logical
>>>> replication worker before it calls heap_lock_tuple.  Now, in node-2,
>>>> update the same row such that it moves the row.  Now, continue the
>>>> logical replication worker.  I think it should hit your new code, if
>>>> not then we need to think of some other way.
>>>>
>>>
>>> I am able to hit the change log using above steps. Thanks a lot for the
>>> step by step guide, I really needed that.
>>>
>>> One strange behavior I found in the logical replication which is reproducible
>>> without attached patch as well -- when I have updated on node2 by keeping
>>> breakpoint before the heap_lock_tuple call in replication worker, I can see
>>> a duplicate row was inserted on the node2, see this:
>>>
>> ..
>>>
>>> I am thinking to report this in a separate thread, but not sure if
>>> this is already known behaviour or not.
>>>
>>
>> I think it is worth to discuss this behavior in a separate thread.
>> However, if possible, try to reproduce it without partitioning and
>> then report it.
>>
> Logical replication behavior for the normal table is as expected, this happens
> only with partition table, will start a new thread for this on hacker.
>
Posted on hackers :
https://postgr.es/m/CAAJ_b94bYxLsX0erZXVH-anQPbWqcYUPWX4xVRa1YJY=Ph60ZQ@mail.gmail.com

Regards,
Amul Sul


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Sat, Feb 3, 2018 at 4:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 26, 2018 at 1:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> So, this means that in case of logical replication, it won't generate
>> the error this patch is trying to introduce.  I think if we want to
>> handle this we need some changes in WAL and logical decoding as well.
>>
>> Robert, others, what do you think?  I am not very comfortable leaving
>> this unaddressed, if we don't want to do anything about it, at least
>> we should document it.
>
> As I said on the other thread, I'm not sure how reasonable it really
> is to try to do anything about this.  For both the issue you raised
> there, I think we'd need to introduce a new WAL record type that
> represents a delete from one table and an insert to another that
> should be considered as a single operation.
>

I think to solve the issue in this thread, a flag should be sufficient
that can be used in logical replication InvalidBlockNumber in CTID for
Deletes.

> I'm not keen on that idea,
> but you can make an argument that it's the Right Thing To Do.  I would
> be more inclined, at least for v11, to just document that the
> delete+insert will be replayed separately on replicas.
>

Even if we do what you are suggesting, we need something in WAL
(probably a flag to indicate this special type of Delete), otherwise,
wal consistency checker will fail.   Another idea would be to mask the
ctid change so that wal consistency checker doesn't cry.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
amul sul
Date:
On Tue, Feb 6, 2018 at 7:05 PM, amul sul <sulamul@gmail.com> wrote:
> On Sun, Feb 4, 2018 at 10:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Feb 2, 2018 at 2:11 PM, amul sul <sulamul@gmail.com> wrote:
>>> On Fri, Jan 26, 2018 at 11:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
[....]
>>
>> I wonder what will be the behavior of this patch with
>> wal_consistency_checking [1].  I think it will generate a failure as
>> there is nothing in WAL to replay it.  Can you once try it?  If we see
>> a failure with wal consistency checker, then we need to think whether
>> (a) we want to deal with it by logging this information, or (b) do we
>> want to mask it or (c) something else?
>>
>>
>> [1] -  https://www.postgresql.org/docs/devel/static/runtime-config-developer.html
>>
>
> Yes, you are correct standby stopped with a following error:
>
>  FATAL:  inconsistent page found, rel 1663/13260/16390, forknum 0, blkno 0
>  CONTEXT:  WAL redo at 0/3002510 for Heap/DELETE: off 6 KEYS_UPDATED
>  LOG:  startup process (PID 22791) exited with exit code 1
>  LOG:  terminating any other active server processes
>  LOG:  database system is shut down
>
> I have tested warm standby replication setup using attached script. Without
> wal_consistency_checking setting, it works fine & data from master to standby is
> replicated as expected, if this guaranty is enough then I think could skip this
> error from wal consistent check for such deleted tuple (I guess option
> b that you have suggested), thoughts?

I tried to mask ctid.ip_blkid if it is set to InvalidBlockId with
following change in heap_mask:

------------- PATCH -------------
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 682f4f07a8..e7c011f9a5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -9323,6 +9323,10 @@ heap_mask(char *pagedata, BlockNumber blkno)
             */
            if (HeapTupleHeaderIsSpeculative(page_htup))
                ItemPointerSet(&page_htup->t_ctid, blkno, off);
+
+           /* TODO : comments ?  */
+           if (!BlockNumberIsValid(BlockIdGetBlockNumber((&((page_htup->t_ctid).ip_blkid)))))
+               BlockIdSet(&((page_htup->t_ctid).ip_blkid), blkno);
        }

        /*
------------- END -------------

Test script[1] works as expected with this change but I don't have much
confident on it due to lack of knowledge of wal_consistency_checking
routine. Any suggestion/comments will be much appreciated, thanks!

[1] https://postgr.es/m/CAAJ_b94_29wiUA83W8LQjtfjv9XNV=+PT8+ioWRPjnnFHe3eqw@mail.gmail.com

Regards,
Amul


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Wed, Feb 7, 2018 at 6:13 PM, amul sul <sulamul@gmail.com> wrote:
> On Tue, Feb 6, 2018 at 7:05 PM, amul sul <sulamul@gmail.com> wrote:
>> On Sun, Feb 4, 2018 at 10:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Yes, you are correct standby stopped with a following error:
>>
>>  FATAL:  inconsistent page found, rel 1663/13260/16390, forknum 0, blkno 0
>>  CONTEXT:  WAL redo at 0/3002510 for Heap/DELETE: off 6 KEYS_UPDATED
>>  LOG:  startup process (PID 22791) exited with exit code 1
>>  LOG:  terminating any other active server processes
>>  LOG:  database system is shut down
>>
>> I have tested warm standby replication setup using attached script. Without
>> wal_consistency_checking setting, it works fine & data from master to standby is
>> replicated as expected, if this guaranty is enough then I think could skip this
>> error from wal consistent check for such deleted tuple (I guess option
>> b that you have suggested), thoughts?
>
> I tried to mask ctid.ip_blkid if it is set to InvalidBlockId with
> following change in heap_mask:
>

Your change appears fine to me.  I think one can set both block number
and offset as we do for  HeapTupleHeaderIsSpeculative, but the way you
have done it looks good to me.  Kindly include it in the next version
of your patch by adding the missing comment.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
amul sul
Date:
On Tue, Feb 13, 2018 at 11:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Feb 7, 2018 at 6:13 PM, amul sul <sulamul@gmail.com> wrote:
>> On Tue, Feb 6, 2018 at 7:05 PM, amul sul <sulamul@gmail.com> wrote:
>>> On Sun, Feb 4, 2018 at 10:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>
>>> Yes, you are correct standby stopped with a following error:
>>>
>>>  FATAL:  inconsistent page found, rel 1663/13260/16390, forknum 0, blkno 0
>>>  CONTEXT:  WAL redo at 0/3002510 for Heap/DELETE: off 6 KEYS_UPDATED
>>>  LOG:  startup process (PID 22791) exited with exit code 1
>>>  LOG:  terminating any other active server processes
>>>  LOG:  database system is shut down
>>>
>>> I have tested warm standby replication setup using attached script. Without
>>> wal_consistency_checking setting, it works fine & data from master to standby is
>>> replicated as expected, if this guaranty is enough then I think could skip this
>>> error from wal consistent check for such deleted tuple (I guess option
>>> b that you have suggested), thoughts?
>>
>> I tried to mask ctid.ip_blkid if it is set to InvalidBlockId with
>> following change in heap_mask:
>>
>
> Your change appears fine to me.  I think one can set both block number
> and offset as we do for  HeapTupleHeaderIsSpeculative, but the way you
> have done it looks good to me.  Kindly include it in the next version
> of your patch by adding the missing comment.
>

Thanks for the confirmation, updated patch attached.

Regards,
Amul

Attachment

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Tue, Feb 13, 2018 at 12:41 PM, amul sul <sulamul@gmail.com> wrote:
> On Tue, Feb 13, 2018 at 11:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Your change appears fine to me.  I think one can set both block number
>> and offset as we do for  HeapTupleHeaderIsSpeculative, but the way you
>> have done it looks good to me.  Kindly include it in the next version
>> of your patch by adding the missing comment.
>>
>
> Thanks for the confirmation, updated patch attached.
>

+# Concurrency error from GetTupleForTrigger
+# Concurrency error from ExecLockRows

I think you don't need to mention above sentences in spec files.
Apart from that, your patch looks good to me.  I have marked it as
Ready For Committer.

Notes for Committer -
1. We might need some changes in update-tuple-routing mechanism if we
decide to do anything for the bug [1] discussed in the nearby thread,
but as that is not directly related to this patch, we can move ahead.
2. I think it is better to document that for update tuple routing the
delete+insert will be replayed separately on replicas.  I leave this
to the discretion of the committer.

[1] - https://www.postgresql.org/message-id/CAAJ_b94bYxLsX0erZXVH-anQPbWqcYUPWX4xVRa1YJY%3DPh60ZQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Rajkumar Raghuwanshi
Date:
On Wed, Feb 14, 2018 at 5:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
+# Concurrency error from GetTupleForTrigger
+# Concurrency error from ExecLockRows

I think you don't need to mention above sentences in spec files.
Apart from that, your patch looks good to me.  I have marked it as
Ready For Committer.

I too have tested this feature with isolation framework and this look good to me.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Andres Freund
Date:
Hi,

On 2018-02-13 12:41:26 +0530, amul sul wrote:
> From 08c8c7ece7d9411e70a780dbeed89d81419db6b6 Mon Sep 17 00:00:00 2001
> From: Amul Sul <sulamul@gmail.com>
> Date: Tue, 13 Feb 2018 12:37:52 +0530
> Subject: [PATCH 1/2] Invalidate ip_blkid v5
> 
> v5:
>  - Added code in heap_mask to skip wal_consistency_checking[7]
>  - Fixed previous todos.
> 
> v5-wip2:
>  - Minor changes in RelationFindReplTupleByIndex() and
>    RelationFindReplTupleSeq()
> 
>  - TODO;
>    Same as the privious
> 
> v5-wip: Update w.r.t Amit Kapila's comments[6].
>  - Reverted error message in nodeModifyTable.c from 'tuple to be locked'
>    to 'tuple to be updated'.
> 
>  - TODO:
>  1. Yet to made a decision of having LOG/ELOG/ASSERT in the
>     RelationFindReplTupleByIndex() and RelationFindReplTupleSeq().
> 
> v4: Rebased on "UPDATE of partition key v35" patch[5].
> 
> v3: Update w.r.t Amit Kapila's[3] & Alvaro Herrera[4] comments
>  - typo in all error message and comment : "to an another" -> "to another"
>  - error message change : "tuple to be updated" -> "tuple to be locked"
>  - In ExecOnConflictUpdate(), error report converted into assert &
>   comments added.
> 
> v2: Updated w.r.t Robert review comments[2]
>  - Updated couple of comment of heap_delete argument and ItemPointerData
>  - Added same concurrent update error logic in ExecOnConflictUpdate,
>    RelationFindReplTupleByIndex and RelationFindReplTupleSeq
> 
> v1: Initial version -- as per Amit Kapila's suggestions[1]
>  - When tuple is being moved to another partition then ip_blkid in the
>    tuple header mark to InvalidBlockNumber.

Very nice and instructive to keep this in a submission's commit message.



> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index 8a846e7dba..f4560ee9cb 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -3037,6 +3037,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
>   *    crosscheck - if not InvalidSnapshot, also check tuple against this
>   *    wait - true if should wait for any conflicting update to commit/abort
>   *    hufd - output parameter, filled in failure cases (see below)
> + *    row_moved - true iff the tuple is being moved to another partition
> + *                table due to an update of partition key. Otherwise, false.
>   *

I don't think 'row_moved' is a good variable name for this. Moving a row
in our heap format can mean a lot of things. Maybe 'to_other_part' or
'changing_part'?


> +    /*
> +     * Sets a block identifier to the InvalidBlockNumber to indicate such an
> +     * update being moved tuple to another partition.
> +     */
> +    if (row_moved)
> +        BlockIdSet(&((tp.t_data->t_ctid).ip_blkid), InvalidBlockNumber);

The parens here are set in a bit werid way. I assume that's from copying
it out of ItemPointerSet()?  Why aren't you just using ItemPointerSetBlockNumber()?


I think it'd be better if we followed the example of specultive inserts
and created an equivalent of HeapTupleHeaderSetSpeculativeToken. That'd
be a heck of a lot easier to grep for...


> @@ -9314,6 +9323,18 @@ heap_mask(char *pagedata, BlockNumber blkno)
>               */
>              if (HeapTupleHeaderIsSpeculative(page_htup))
>                  ItemPointerSet(&page_htup->t_ctid, blkno, off);
> +
> +            /*
> +             * For a deleted tuple, a block identifier is set to the

I think this 'the' is superflous.


> +             * InvalidBlockNumber to indicate that the tuple has been moved to
> +             * another partition due to an update of partition key.

But I think it should be 'the partition key'.


> +             * Like speculative tuple, to ignore any inconsistency set block
> +             * identifier to current block number.

This doesn't quite parse.


> +             */
> +            if (!BlockNumberIsValid(
> +                    BlockIdGetBlockNumber((&((page_htup->t_ctid).ip_blkid)))))
> +                BlockIdSet(&((page_htup->t_ctid).ip_blkid), blkno);
>          }

That formatting looks wrong. I think it should be replaced by a macro
like mentioned above.


>          /*
> diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
> index 160d941c00..a770531e14 100644
> --- a/src/backend/commands/trigger.c
> +++ b/src/backend/commands/trigger.c
> @@ -3071,6 +3071,11 @@ ltrmark:;
>                      ereport(ERROR,
>                              (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>                               errmsg("could not serialize access due to concurrent update")));
> +                if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
> +                    ereport(ERROR,
> +                            (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                             errmsg("tuple to be locked was already moved to another partition due to concurrent
update")));

Yes, given that we repeat this in multiple places, I *definitely* want
to see this wrapped in a macro with a descriptive name.


> diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
> index 7961b4be6a..b07b7092de 100644
> --- a/src/backend/executor/nodeLockRows.c
> +++ b/src/backend/executor/nodeLockRows.c
> @@ -218,6 +218,11 @@ lnext:
>                      ereport(ERROR,
>                              (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>                               errmsg("could not serialize access due to concurrent update")));
> +                if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
> +                    ereport(ERROR,
> +                            (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                             errmsg("tuple to be locked was already moved to another partition due to concurrent
update")));
> +

Why are we using ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE rather than
ERRCODE_T_R_SERIALIZATION_FAILURE?  A lot of frameworks have builtin
logic to retry serialization failures, and this kind of thing is going
to resolved by retrying, no?


> diff --git a/src/test/isolation/expected/partition-key-update-1.out
b/src/test/isolation/expected/partition-key-update-1.out
> new file mode 100644

I'd like to see tests that show various interactions with ON CONFLICT.

Greetings,

Andres Freund


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Tue, Mar 6, 2018 at 4:53 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
>> diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
>> index 7961b4be6a..b07b7092de 100644
>> --- a/src/backend/executor/nodeLockRows.c
>> +++ b/src/backend/executor/nodeLockRows.c
>> @@ -218,6 +218,11 @@ lnext:
>>                                       ereport(ERROR,
>>                                                       (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>>                                                        errmsg("could not serialize access due to concurrent
update")));
>> +                             if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
>> +                                     ereport(ERROR,
>> +                                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> +                                                      errmsg("tuple to be locked was already moved to another
partitiondue to concurrent update")));
 
>> +
>
> Why are we using ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE rather than
> ERRCODE_T_R_SERIALIZATION_FAILURE?  A lot of frameworks have builtin
> logic to retry serialization failures, and this kind of thing is going
> to resolved by retrying, no?
>

I think it depends, in some cases retry can help in deleting the
required tuple, but in other cases like when the user tries to perform
delete on a particular partition table, it won't be successful as the
tuple would have been moved.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Pavan Deolasee
Date:


On Wed, Feb 28, 2018 at 12:38 PM, Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:
On Wed, Feb 14, 2018 at 5:44 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
+# Concurrency error from GetTupleForTrigger
+# Concurrency error from ExecLockRows

I think you don't need to mention above sentences in spec files.
Apart from that, your patch looks good to me.  I have marked it as
Ready For Committer.

I too have tested this feature with isolation framework and this look good to me.


It looks to me that we are trying to fix only one issue here with concurrent updates. What happens if a non-partition key is first updated and then a second session updates the partition key?

For example, with your patches applied:

CREATE TABLE pa_target (key integer, val text)
    PARTITION BY LIST (key);
CREATE TABLE part1 PARTITION OF pa_target FOR VALUES IN (1);
CREATE TABLE part2 PARTITION OF pa_target FOR VALUES IN (2);
INSERT INTO pa_target VALUES (1, 'initial1');

session1:
BEGIN;
UPDATE pa_target SET val = val || ' updated by update1' WHERE key = 1;
UPDATE 1
postgres=# SELECT * FROM pa_target ;
 key |             val             
-----+-----------------------------
   1 | initial1 updated by update1
(1 row)

session2:
UPDATE pa_target SET val = val || ' updated by update2', key = key + 1 WHERE key = 1
<blocks>

session1:
postgres=# COMMIT;
COMMIT

<session1 unblocks and completes its UPDATE>

postgres=# SELECT * FROM pa_target ;
 key |             val             
-----+-----------------------------
   2 | initial1 updated by update2
(1 row)

Ouch. The committed updates by session1 are overwritten by session2. This clearly violates the rules that rest of the system obeys and is not acceptable IMHO.

Clearly, ExecUpdate() while moving rows between partitions is missing out on re-constructing the to-be-updated tuple, based on the latest tuple in the update chain. Instead, it's simply deleting the latest tuple and inserting a new tuple in the new partition based on the old tuple. That's simply wrong.

I haven't really thought carefully to see if this should be a separate patch, but it warrants attention. We should at least think through all different concurrency aspects of partition key updates and think about a holistic solution, instead of fixing one problem at a time. This probably also shows that isolation tests for partition key updates are either missing (I haven't checked) or they need more work.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Pavan Deolasee
Date:

On Tue, Feb 13, 2018 at 12:41 PM, amul sul <sulamul@gmail.com> wrote:

Thanks for the confirmation, updated patch attached.


I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does not do anything to deal with the fact that t_ctid may no longer point to itself to mark end of the chain. I just can't see how that would work. But if it does, it needs good amount of comments explaining why and most likely updating comments at other places where chain following is done. For example, how's this code in heap_get_latest_tid() is still valid? Aren't we setting "ctid" to some invalid value here?

2302         /*
2303          * If there's a valid t_ctid link, follow it, else we're done.
2304          */
2305         if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
2306             HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
2307             ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
2308         {
2309             UnlockReleaseBuffer(buffer);
2310             break;
2311         }
2312 
2313         ctid = tp.t_data->t_ctid;

This is just one example. I am almost certain there are many such cases that will require careful attention.

What happens if a partition key update deletes a row, but the operation is aborted? Do we need any special handling for that case?

I am actually worried that we're tinkering with ip_blkid to handle one corner case of detecting partition key update. This is going to change on-disk format and probably need more careful attention. Are we certain that we would never require update-chain following when partition keys are updated? If so, can we think about some other mechanism which actually even leaves behind <new_partition, new_ctid>? I am not saying we should do that, but it warrants a thought. May be it was discussed somewhere else and ruled out. I happened to notice this patch because of the bug I encountered.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Khandekar
Date:
On 8 March 2018 at 09:15, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
> For example, with your patches applied:
>
> CREATE TABLE pa_target (key integer, val text)
>     PARTITION BY LIST (key);
> CREATE TABLE part1 PARTITION OF pa_target FOR VALUES IN (1);
> CREATE TABLE part2 PARTITION OF pa_target FOR VALUES IN (2);
> INSERT INTO pa_target VALUES (1, 'initial1');
>
> session1:
> BEGIN;
> UPDATE pa_target SET val = val || ' updated by update1' WHERE key = 1;
> UPDATE 1
> postgres=# SELECT * FROM pa_target ;
>  key |             val
> -----+-----------------------------
>    1 | initial1 updated by update1
> (1 row)
>
> session2:
> UPDATE pa_target SET val = val || ' updated by update2', key = key + 1 WHERE
> key = 1
> <blocks>
>
> session1:
> postgres=# COMMIT;
> COMMIT
>
> <session1 unblocks and completes its UPDATE>
>
> postgres=# SELECT * FROM pa_target ;
>  key |             val
> -----+-----------------------------
>    2 | initial1 updated by update2
> (1 row)
>
> Ouch. The committed updates by session1 are overwritten by session2. This
> clearly violates the rules that rest of the system obeys and is not
> acceptable IMHO.
>
> Clearly, ExecUpdate() while moving rows between partitions is missing out on
> re-constructing the to-be-updated tuple, based on the latest tuple in the
> update chain. Instead, it's simply deleting the latest tuple and inserting a
> new tuple in the new partition based on the old tuple. That's simply wrong.

You are right. This need to be fixed. This is a different issue than
the particular one that is being worked upon in this thread, and both
these issues have different fixes.

Like you said, the tuple needs to be reconstructed when ExecDelete()
finds that the row has been updated by another transaction. We should
send back this information from ExecDelete() (I think tupleid
parameter gets updated in this case), and then in ExecUpdate() we
should goto lreplace, so that the the row is fetched back similar to
how it happens when heap_update() knows that the tuple was updated.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
> On Tue, Feb 13, 2018 at 12:41 PM, amul sul <sulamul@gmail.com> wrote:
>>
>> Thanks for the confirmation, updated patch attached.
>>
>
> I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does not
> do anything to deal with the fact that t_ctid may no longer point to itself
> to mark end of the chain. I just can't see how that would work.
>

I think it is not that patch doesn't care about the end of the chain.
 For example, ctid pointing to itself is used to ensure that for
deleted rows, nothing more needs to be done like below check in the
ExecUpdate/ExecDelete code path.
if (!ItemPointerEquals(tupleid, &hufd.ctid))
{
..
}
..

It will deal with such cases by checking invalidblockid before these
checks.  So, we should be fine in such cases.


> But if it
> does, it needs good amount of comments explaining why and most likely
> updating comments at other places where chain following is done. For
> example, how's this code in heap_get_latest_tid() is still valid? Aren't we
> setting "ctid" to some invalid value here?
>
> 2302         /*
> 2303          * If there's a valid t_ctid link, follow it, else we're done.
> 2304          */
> 2305         if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
> 2306             HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
> 2307             ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
> 2308         {
> 2309             UnlockReleaseBuffer(buffer);
> 2310             break;
> 2311         }
> 2312
> 2313         ctid = tp.t_data->t_ctid;
>

I have not tested, but it seems this could be problematic, but I feel
we could deal with such cases by checking invalid block id in the
above if check.  Another one such case is in EvalPlanQualFetch

> This is just one example. I am almost certain there are many such cases that
> will require careful attention.
>

Right, I think we should be able to detect and fix such cases.

> What happens if a partition key update deletes a row, but the operation is
> aborted? Do we need any special handling for that case?
>

If the transaction is aborted than future updates would update the
ctid to a new row, do you see any problem with it?

> I am actually worried that we're tinkering with ip_blkid to handle one
> corner case of detecting partition key update. This is going to change
> on-disk format and probably need more careful attention. Are we certain that
> we would never require update-chain following when partition keys are
> updated?
>

I think we should never need update-chain following when the row is
moved from one partition to another partition, otherwise, we don't
change anything on the tuple.

> If so, can we think about some other mechanism which actually even
> leaves behind <new_partition, new_ctid>? I am not saying we should do that,
> but it warrants a thought.
>

Oh, this would much bigger disk-format change and need much more
thoughts, where will we store new partition information.

>  May be it was discussed somewhere else and ruled
> out.

There were a couple of other options discussed in the original thread
"UPDATE of partition key".  One of them was to have an additional bit
on the tuple, but we found reusing ctid a better approach.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Thu, Mar 8, 2018 at 11:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 8 March 2018 at 09:15, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>> For example, with your patches applied:
>>
>> CREATE TABLE pa_target (key integer, val text)
>>     PARTITION BY LIST (key);
>> CREATE TABLE part1 PARTITION OF pa_target FOR VALUES IN (1);
>> CREATE TABLE part2 PARTITION OF pa_target FOR VALUES IN (2);
>> INSERT INTO pa_target VALUES (1, 'initial1');
>>
>> session1:
>> BEGIN;
>> UPDATE pa_target SET val = val || ' updated by update1' WHERE key = 1;
>> UPDATE 1
>> postgres=# SELECT * FROM pa_target ;
>>  key |             val
>> -----+-----------------------------
>>    1 | initial1 updated by update1
>> (1 row)
>>
>> session2:
>> UPDATE pa_target SET val = val || ' updated by update2', key = key + 1 WHERE
>> key = 1
>> <blocks>
>>
>> session1:
>> postgres=# COMMIT;
>> COMMIT
>>
>> <session1 unblocks and completes its UPDATE>
>>
>> postgres=# SELECT * FROM pa_target ;
>>  key |             val
>> -----+-----------------------------
>>    2 | initial1 updated by update2
>> (1 row)
>>
>> Ouch. The committed updates by session1 are overwritten by session2. This
>> clearly violates the rules that rest of the system obeys and is not
>> acceptable IMHO.
>>
>> Clearly, ExecUpdate() while moving rows between partitions is missing out on
>> re-constructing the to-be-updated tuple, based on the latest tuple in the
>> update chain. Instead, it's simply deleting the latest tuple and inserting a
>> new tuple in the new partition based on the old tuple. That's simply wrong.
>
> You are right. This need to be fixed. This is a different issue than
> the particular one that is being worked upon in this thread, and both
> these issues have different fixes.
>

I also think that this is a bug in the original patch and won't be
directly related to the patch being discussed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Khandekar
Date:
On 8 March 2018 at 12:34, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Mar 8, 2018 at 11:57 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 8 March 2018 at 09:15, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>>> For example, with your patches applied:
>>>
>>> CREATE TABLE pa_target (key integer, val text)
>>>     PARTITION BY LIST (key);
>>> CREATE TABLE part1 PARTITION OF pa_target FOR VALUES IN (1);
>>> CREATE TABLE part2 PARTITION OF pa_target FOR VALUES IN (2);
>>> INSERT INTO pa_target VALUES (1, 'initial1');
>>>
>>> session1:
>>> BEGIN;
>>> UPDATE pa_target SET val = val || ' updated by update1' WHERE key = 1;
>>> UPDATE 1
>>> postgres=# SELECT * FROM pa_target ;
>>>  key |             val
>>> -----+-----------------------------
>>>    1 | initial1 updated by update1
>>> (1 row)
>>>
>>> session2:
>>> UPDATE pa_target SET val = val || ' updated by update2', key = key + 1 WHERE
>>> key = 1
>>> <blocks>
>>>
>>> session1:
>>> postgres=# COMMIT;
>>> COMMIT
>>>
>>> <session1 unblocks and completes its UPDATE>
>>>
>>> postgres=# SELECT * FROM pa_target ;
>>>  key |             val
>>> -----+-----------------------------
>>>    2 | initial1 updated by update2
>>> (1 row)
>>>
>>> Ouch. The committed updates by session1 are overwritten by session2. This
>>> clearly violates the rules that rest of the system obeys and is not
>>> acceptable IMHO.
>>>
>>> Clearly, ExecUpdate() while moving rows between partitions is missing out on
>>> re-constructing the to-be-updated tuple, based on the latest tuple in the
>>> update chain. Instead, it's simply deleting the latest tuple and inserting a
>>> new tuple in the new partition based on the old tuple. That's simply wrong.
>>
>> You are right. This need to be fixed. This is a different issue than
>> the particular one that is being worked upon in this thread, and both
>> these issues have different fixes.
>>
>
> I also think that this is a bug in the original patch and won't be
> directly related to the patch being discussed.

Yes. Will submit a patch for this in a separate thread.





-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Pavan Deolasee
Date:


On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
> On Tue, Feb 13, 2018 at 12:41 PM, amul sul <sulamul@gmail.com> wrote:
>>
>> Thanks for the confirmation, updated patch attached.
>>
>
> I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does not
> do anything to deal with the fact that t_ctid may no longer point to itself
> to mark end of the chain. I just can't see how that would work.
>

I think it is not that patch doesn't care about the end of the chain.
 For example, ctid pointing to itself is used to ensure that for
deleted rows, nothing more needs to be done like below check in the
ExecUpdate/ExecDelete code path.

Yeah, but it only looks for places where it needs to detect deleted tuples and thus wants to throw an error. I am worried about other places where it is assumed that the ctid points to a valid looking tid, self or otherwise. I see no such places being either updated or commented.

Now may be there is no danger because of other protections in place, but it looks hazardous.
 

I have not tested, but it seems this could be problematic, but I feel
we could deal with such cases by checking invalid block id in the
above if check.  Another one such case is in EvalPlanQualFetch


Right.
 

> What happens if a partition key update deletes a row, but the operation is
> aborted? Do we need any special handling for that case?
>

If the transaction is aborted than future updates would update the
ctid to a new row, do you see any problem with it?

I don't know. May be there is none. But it needs to explained why it's not a problem.
 

> I am actually worried that we're tinkering with ip_blkid to handle one
> corner case of detecting partition key update. This is going to change
> on-disk format and probably need more careful attention. Are we certain that
> we would never require update-chain following when partition keys are
> updated?
>

I think we should never need update-chain following when the row is
moved from one partition to another partition, otherwise, we don't
change anything on the tuple.

I am not sure I follow. I understand that it's probably a tough problem to follow update chain from one partition to another. But why do you think we would never need that? What if someone wants to improve on the restriction this patch is imposing and actually implement partition key UPDATEs the way we do for regular tables i.e. instead of throwing error, we actually update/delete the row in the new partition?
 

> If so, can we think about some other mechanism which actually even
> leaves behind <new_partition, new_ctid>? I am not saying we should do that,
> but it warrants a thought.
>

Oh, this would much bigger disk-format change and need much more
thoughts, where will we store new partition information.

Yeah, but the disk format will change probably change just once. Or may be this can be done local to a partition table without requiring any disk format changes? Like adding a nullable hidden column in each partition to store the forward pointer?
 
Thanks,
Pavan

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Thu, Mar 8, 2018 at 12:52 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
> On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
>> <pavan.deolasee@gmail.com> wrote:
>> >
>> > On Tue, Feb 13, 2018 at 12:41 PM, amul sul <sulamul@gmail.com> wrote:
>> >>
>> >> Thanks for the confirmation, updated patch attached.
>> >>
>> >
>> > I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does
>> > not
>> > do anything to deal with the fact that t_ctid may no longer point to
>> > itself
>> > to mark end of the chain. I just can't see how that would work.
>> >
>>
>> I think it is not that patch doesn't care about the end of the chain.
>>  For example, ctid pointing to itself is used to ensure that for
>> deleted rows, nothing more needs to be done like below check in the
>> ExecUpdate/ExecDelete code path.
>
>
> Yeah, but it only looks for places where it needs to detect deleted tuples
> and thus wants to throw an error. I am worried about other places where it
> is assumed that the ctid points to a valid looking tid, self or otherwise. I
> see no such places being either updated or commented.
>
> Now may be there is no danger because of other protections in place, but it
> looks hazardous.
>

Right, I feel we need some tests to prove it, I think as per code I
can see we need checks in few more places (like the ones mentioned in
the previous email) apart from where this patch has added.

>>
>>
>> I have not tested, but it seems this could be problematic, but I feel
>> we could deal with such cases by checking invalid block id in the
>> above if check.  Another one such case is in EvalPlanQualFetch
>>
>
> Right.
>

Amul, can you please look into the scenario being discussed and see if
you can write a test to see the behavior.

>>
>>
>> > What happens if a partition key update deletes a row, but the operation
>> > is
>> > aborted? Do we need any special handling for that case?
>> >
>>
>> If the transaction is aborted than future updates would update the
>> ctid to a new row, do you see any problem with it?
>
>
> I don't know. May be there is none. But it needs to explained why it's not a
> problem.
>

Sure, I guess in that case, we need to update in comments why it would
be okay after abort.

>>
>>
>> > I am actually worried that we're tinkering with ip_blkid to handle one
>> > corner case of detecting partition key update. This is going to change
>> > on-disk format and probably need more careful attention. Are we certain
>> > that
>> > we would never require update-chain following when partition keys are
>> > updated?
>> >
>>
>> I think we should never need update-chain following when the row is
>> moved from one partition to another partition, otherwise, we don't
>> change anything on the tuple.
>
>
> I am not sure I follow. I understand that it's probably a tough problem to
> follow update chain from one partition to another. But why do you think we
> would never need that? What if someone wants to improve on the restriction
> this patch is imposing and actually implement partition key UPDATEs the way
> we do for regular tables i.e. instead of throwing error, we actually
> update/delete the row in the new partition?
>

I think even if we want to uplift this restriction, storing ctid link
of another partition appears to be a major change somebody would like
to do for this feature.  We had some discussion on this matter earlier
where Robert, Greg seems to have said something like that as well.
See [1][2].  I think one way could be if updates/deletes, encounter
InvalidBlkID, they can use metadata of partition table to refind the
row.  We already had a discussion on this point in the original thread
"UPDATE of partition key" and agreed to throw an error as the better
way to deal with it.



[1] - https://www.postgresql.org/message-id/CAM-w4HPis7rbnwi%2BoXjnouqMSRAC5DeVcMdxEXTMfDos1kaYPQ%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CA%2BTgmoY1W-jaS0vH8f%3D5xKQB3EWj5L0XcBf6P7WB7JqbKB3tSQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


On Thu, Mar 8, 2018 at 3:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Mar 8, 2018 at 12:52 PM, Pavan Deolasee
> <pavan.deolasee@gmail.com> wrote:
>>
>> On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>>>
>>> On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
>>> <pavan.deolasee@gmail.com> wrote:
>>> >
>>> > On Tue, Feb 13, 2018 at 12:41 PM, amul sul <sulamul@gmail.com> wrote:
>>> >>
>>> >> Thanks for the confirmation, updated patch attached.
>>> >>
>>> >
>>> > I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does
>>> > not
>>> > do anything to deal with the fact that t_ctid may no longer point to
>>> > itself
>>> > to mark end of the chain. I just can't see how that would work.
>>> >
>>>
>>> I think it is not that patch doesn't care about the end of the chain.
>>>  For example, ctid pointing to itself is used to ensure that for
>>> deleted rows, nothing more needs to be done like below check in the
>>> ExecUpdate/ExecDelete code path.
>>
>>
>> Yeah, but it only looks for places where it needs to detect deleted tuples
>> and thus wants to throw an error. I am worried about other places where it
>> is assumed that the ctid points to a valid looking tid, self or otherwise. I
>> see no such places being either updated or commented.
>>
>> Now may be there is no danger because of other protections in place, but it
>> looks hazardous.
>>
>
> Right, I feel we need some tests to prove it, I think as per code I
> can see we need checks in few more places (like the ones mentioned in
> the previous email) apart from where this patch has added.
>
>>>
>>>
>>> I have not tested, but it seems this could be problematic, but I feel
>>> we could deal with such cases by checking invalid block id in the
>>> above if check.  Another one such case is in EvalPlanQualFetch
>>>
>>
>> Right.
>>
>
> Amul, can you please look into the scenario being discussed and see if
> you can write a test to see the behavior.
>

Sure, I'll try.

Regards,
Amu


On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
> <pavan.deolasee@gmail.com> wrote:
>>
[.....]
>
>> But if it
>> does, it needs good amount of comments explaining why and most likely
>> updating comments at other places where chain following is done. For
>> example, how's this code in heap_get_latest_tid() is still valid? Aren't we
>> setting "ctid" to some invalid value here?
>>
>> 2302         /*
>> 2303          * If there's a valid t_ctid link, follow it, else we're done.
>> 2304          */
>> 2305         if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
>> 2306             HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
>> 2307             ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
>> 2308         {
>> 2309             UnlockReleaseBuffer(buffer);
>> 2310             break;
>> 2311         }
>> 2312
>> 2313         ctid = tp.t_data->t_ctid;
>>
>
> I have not tested, but it seems this could be problematic, but I feel
> we could deal with such cases by checking invalid block id in the
> above if check.  Another one such case is in EvalPlanQualFetch
>

I tried the following test to hit this code and found that the situation is not
that much unpleasant.

heap_get_latest_tid() will follow the chain and return latest tid iff the
current tuple satisfies visibility check (via HeapTupleSatisfiesVisibility), in
our case it doesn't and we are safe here, but I agree with Amit -- it is better
to add invalid block id check.
In EvalPlanQualFetch() invalid block id check already there before
ItemPointerEquals call.

=== TEST ==
create table foo (a int2, b text) partition by list (a);
create table foo1 partition of foo for values IN (1);
create table foo2 partition of foo for values IN (2);
insert into foo values(1, 'Initial record');
update foo set b= b || ' -> update1' where a=1;
update foo set b= b || ' -> update2' where a=1;

postgres=# select tableoid::regclass, ctid, * from foo;
 tableoid | ctid  | a |                  b
----------+-------+---+--------------------------------------
 foo1     | (0,3) | 1 | Initial record -> update1 -> update2
(1 row)

postgres=# select currtid2('foo1','(0,1)');
 currtid2
----------
 (0,3)
(1 row)

postgres=# select tableoid::regclass, ctid, * from foo;
 tableoid | ctid  | a |                      b
----------+-------+---+----------------------------------------------
 foo2     | (0,1) | 2 | Initial record -> update1 -> update2-> moved
(1 row)

postgres=# select currtid2('foo1','(0,1)');
 currtid2
----------
 (0,1)
(1 row)

=== END ===


>> This is just one example. I am almost certain there are many such cases that
>> will require careful attention.
>>
>
> Right, I think we should be able to detect and fix such cases.
>

Will look into the places carefully where ItemPointerEquals() call
made for heap tuple.

Regards,
Amul


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Robert Haas
Date:
On Thu, Mar 8, 2018 at 12:34 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does not
> do anything to deal with the fact that t_ctid may no longer point to itself
> to mark end of the chain. I just can't see how that would work. But if it
> does, it needs good amount of comments explaining why and most likely
> updating comments at other places where chain following is done. For
> example, how's this code in heap_get_latest_tid() is still valid? Aren't we
> setting "ctid" to some invalid value here?

So the general idea of the patch is that this new kind of marking
marks the CTID chain as "broken" and that code which cares about
following CTID chains forward can see that it's reached a point where
the chain is broken and throw an error saying "hey, I can't do the
stuff we normally do in concurrency scenarions, because the CTID chain
got broken by a cross-partition update".

I don't think it's practical to actually make CTID links across
partitions work.  Certainly not in time for v11.  If somebody wants to
try that at some point in the future, cool.  But that's moving the
goalposts an awfully long way.  When this was discussed about a year
ago, my understanding is that there was a consensus that doing nothing
was not acceptable, but that throwing an error in the cases where
anomalies would have happened was good enough.  I don't think anyone
argued that we had to be able to perfectly mimic the usual EPQ
semantics as a condition of having update tuple routing.  That's
setting the bar at a level that we're not going to be able to reach in
the next couple of weeks.  I suppose we could still decide that if we
can't have that, we don't want update tuple routing at all, but I
think that's an overreaction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Pavan Deolasee <pavan.deolasee@gmail.com> writes:
> I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does
> not do anything to deal with the fact that t_ctid may no longer point to
> itself to mark end of the chain. I just can't see how that would work.
> ...
> I am actually worried that we're tinkering with ip_blkid to handle one
> corner case of detecting partition key update. This is going to change
> on-disk format and probably need more careful attention.

You know, either one of those alone would be scary as hell.  Both in
one patch seem to me to be sufficient reason to reject it outright.
Not only will it be an unending source of bugs, but it's chewing up
far too much of what few remaining degrees-of-freedom we have in the
on-disk format ... for a single purpose that hasn't even been sold as
something we have to have.

Find another way.

            regards, tom lane


Robert Haas <robertmhaas@gmail.com> writes:
> ... I suppose we could still decide that if we
> can't have that, we don't want update tuple routing at all, but I
> think that's an overreaction.

Between this thread and
<CAJ3gD9fRbEzDqdeDq1jxqZUb47kJn+tQ7=Bcgjc8quqKsDViKQ@mail.gmail.com>
I am getting the distinct impression that that feature wasn't ready
to be committed.  I think that reverting it for v11 is definitely
an option that needs to be kept on the table.

            regards, tom lane


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Robert Haas
Date:
On Thu, Mar 8, 2018 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Pavan Deolasee <pavan.deolasee@gmail.com> writes:
>> I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does
>> not do anything to deal with the fact that t_ctid may no longer point to
>> itself to mark end of the chain. I just can't see how that would work.
>> ...
>> I am actually worried that we're tinkering with ip_blkid to handle one
>> corner case of detecting partition key update. This is going to change
>> on-disk format and probably need more careful attention.
>
> You know, either one of those alone would be scary as hell.  Both in
> one patch seem to me to be sufficient reason to reject it outright.
> Not only will it be an unending source of bugs, but it's chewing up
> far too much of what few remaining degrees-of-freedom we have in the
> on-disk format ... for a single purpose that hasn't even been sold as
> something we have to have.

I agree that it isn't clear that it's worth making a change to the
on-disk format for this feature.  I made the argument when it was
first proposed that we should just document that there would be
anomalies with cross-partition updates that didn't occur otherwise.
However, multiple people thought that it was worth burning one of our
precious few remaining infomask bits in order to throw an error in
that case rather than just silently having an anomaly, and that's why
this patch got written.  It's not too late to decide that we'd rather
not do that after all.

However, there's no such thing as a free lunch.  We can't use the CTID
field to point to a CTID in another table because there's no room to
include the identify of the other table in the field.  We can't widen
it to make room because that would break on-disk compatibility and
bloat our already-too-big tuple headers.  So, we cannot make it work
like it does when the updates are confined to a single partition.
Therefore, the only options are (1) ignore the problem, and let a
cross-partition update look entirely like a delete+insert, (2) try to
throw some error in the case where this introduces user-visible
anomalies that wouldn't be visible otherwise, or (3) revert update
tuple routing entirely.  I voted for (1), but the consensus was (2).
I think that (3) will make a lot of people sad; it's a very good
feature.  If we want to have (2), then we've got to have some way to
mark a tuple that was deleted as part of a cross-partition update, and
that requires a change to the on-disk format.

In short, the two things that you are claiming are prohibitively scary
if done in the same patch look to me like they're actually just one
thing, and that one thing is something which absolutely has to be done
in order to implement the design most community members favored in the
original discussion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> Therefore, the only options are (1) ignore the problem, and let a
> cross-partition update look entirely like a delete+insert, (2) try to
> throw some error in the case where this introduces user-visible
> anomalies that wouldn't be visible otherwise, or (3) revert update
> tuple routing entirely.  I voted for (1), but the consensus was (2).

FWIW, I would also vote for (1), especially if the only way to do (2)
is stuff as outright scary as this.  I would far rather have (3) than
this, because IMO, what we are looking at right now is going to make
the fallout from multixacts look like a pleasant day at the beach.

            regards, tom lane


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Robert Haas
Date:
On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Therefore, the only options are (1) ignore the problem, and let a
>> cross-partition update look entirely like a delete+insert, (2) try to
>> throw some error in the case where this introduces user-visible
>> anomalies that wouldn't be visible otherwise, or (3) revert update
>> tuple routing entirely.  I voted for (1), but the consensus was (2).
>
> FWIW, I would also vote for (1), especially if the only way to do (2)
> is stuff as outright scary as this.  I would far rather have (3) than
> this, because IMO, what we are looking at right now is going to make
> the fallout from multixacts look like a pleasant day at the beach.

Whoa.  Well, that would clearly be bad, but I don't understand why you
find this so scary.  Can you explain further?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Pavan Deolasee
Date:


On Thu, Mar 8, 2018 at 10:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

However, there's no such thing as a free lunch.  We can't use the CTID
field to point to a CTID in another table because there's no room to
include the identify of the other table in the field.  We can't widen
it to make room because that would break on-disk compatibility and
bloat our already-too-big tuple headers.  So, we cannot make it work
like it does when the updates are confined to a single partition.
Therefore, the only options are (1) ignore the problem, and let a
cross-partition update look entirely like a delete+insert, (2) try to
throw some error in the case where this introduces user-visible
anomalies that wouldn't be visible otherwise, or (3) revert update
tuple routing entirely.  I voted for (1), but the consensus was (2).
I think that (3) will make a lot of people sad; it's a very good
feature. 

I am definitely not suggesting to do #3, though I agree with Tom that the option is on table. May be two back-to-back bugs in the area makes me worried and raises questions about the amount of testing the feature has got. In addition, making such a significant on-disk change for one corner case, for which even #1 might be acceptable, seems a lot. If we at all want to go in that direction, I would suggest considering a patch that I wrote last year to free-up additional bits from the ctid field (as part of the WARM). I know Tom did not like that either, but at the very least, it provides us a lot more room for future work, with the same amount of risk. 
 
If we want to have (2), then we've got to have some way to
mark a tuple that was deleted as part of a cross-partition update, and
that requires a change to the on-disk format.

I think the question is: isn't there an alternate way to achieve the same result? One alternate way would be to do what I suggested above i.e. free up more bits and use one of those. Another way would be to add a hidden column to the partition table, when it is created or when it is attached as a partition. This only penalises the partition tables, but keeps rest of the system out of it. Obviously, if this column is added when the table is attached as a partition, as against at table creation time, then the old tuple may not have room to store this additional field. May be we can handle that by double updating the tuple? That seems bad, but then it only impacts the case when a partition key is updated. And we can clearly document performance implications of that operation. I am not sure how common this case is going to be anyways. With this hidden column, we can even store a pointer to another partition and do something with that, if at all needed.

That's just one idea. Of course, I haven't thought about it for more than 10mins, so most likely I may have missed out on details and it's probably a stupid idea afterall. But there could be other ideas too. And even if we can't find one, my vote would be to settle for #1 instead of trying to do #2.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Robert Haas
Date:
On Thu, Mar 8, 2018 at 12:25 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> I think the question is: isn't there an alternate way to achieve the same
> result? One alternate way would be to do what I suggested above i.e. free up
> more bits and use one of those.

That's certainly possible, but TBH the CTID field seems like a pretty
good choice for this particular feature.  I mean, we're essentially
trying to indicate that the CTID link is not valid, so using an
invalid value in the CTID field seems like a pretty natural choice.
We could use, say, an infomask bit to indicate that the CTID link is
not valid, but an infomask bit is more precious.  Any two-valued
property can be represented by an infomask bit, but using the CTID
field is only possible for properties that can't be true at the same
time that the CTID field needs to be valid.  So it makes sense that
this property, which can't be true at the same time the CTID field
needs to be valid, should try to use an otherwise-unused bit pattern
for the CTID field itself.

> Another way would be to add a hidden column
> to the partition table, when it is created or when it is attached as a
> partition. This only penalises the partition tables, but keeps rest of the
> system out of it. Obviously, if this column is added when the table is
> attached as a partition, as against at table creation time, then the old
> tuple may not have room to store this additional field. May be we can handle
> that by double updating the tuple? That seems bad, but then it only impacts
> the case when a partition key is updated. And we can clearly document
> performance implications of that operation. I am not sure how common this
> case is going to be anyways. With this hidden column, we can even store a
> pointer to another partition and do something with that, if at all needed.

Sure, but that would mean that partitioned tables would get bigger as
compared with unpartitioned tables, it would break backward
compatibility with v10, and it would require a major redesign of the
system -- the list of "system" columns is deeply embedded in the
system design and previous proposals to add to it have not been met
with wild applause.

> That's just one idea. Of course, I haven't thought about it for more than
> 10mins, so most likely I may have missed out on details and it's probably a
> stupid idea afterall. But there could be other ideas too. And even if we
> can't find one, my vote would be to settle for #1 instead of trying to do
> #2.

Fair enough.  I don't really see a reason why we can't make #2 work.
Obviously, the patch touches the on-disk format and is therefore scary
-- that's why I thought it should be broken out of the main update
tuple routing patch -- but it's far less of a structural change than
Alvaro's multixact work or the WARM stuff, at least according to my
current understanding.  Tom said he thinks it's riskier than the
multixact stuff but I don't see why that should be the case.  That had
widespread impacts on vacuuming and checkpointing that are not at
issue here.  Still, there's no question that it's a scary patch and if
the consensus is now that we don't need it -- so be it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> FWIW, I would also vote for (1), especially if the only way to do (2)
>> is stuff as outright scary as this.  I would far rather have (3) than
>> this, because IMO, what we are looking at right now is going to make
>> the fallout from multixacts look like a pleasant day at the beach.

> Whoa.  Well, that would clearly be bad, but I don't understand why you
> find this so scary.  Can you explain further?

Possibly I'm crying wolf; it's hard to be sure.  But I recall that nobody
was particularly afraid of multixacts when that went in, and look at all
the trouble we've had with that.  Breaking fundamental invariants like
"ctid points to this tuple or its update successor" is going to cause
trouble.  There's a lot of code that knows that; more than knows the
details of what's in xmax, I believe.

I would've been happier about expending an infomask bit towards this
purpose.  Just eyeing what we've got, I can't help noticing that
HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
in a partitioned table.  Perhaps making these tests depend on
partitioned-ness would be unworkably messy, but it's worth thinking
about.

            regards, tom lane


Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

From
Andres Freund
Date:

On March 8, 2018 10:46:53 AM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> FWIW, I would also vote for (1), especially if the only way to do
>(2)
>>> is stuff as outright scary as this.  I would far rather have (3)
>than
>>> this, because IMO, what we are looking at right now is going to make
>>> the fallout from multixacts look like a pleasant day at the beach.
>
>> Whoa.  Well, that would clearly be bad, but I don't understand why
>you
>> find this so scary.  Can you explain further?
>
>Possibly I'm crying wolf; it's hard to be sure.  But I recall that
>nobody
>was particularly afraid of multixacts when that went in, and look at
>all
>the trouble we've had with that.  Breaking fundamental invariants like
>"ctid points to this tuple or its update successor" is going to cause
>trouble.  There's a lot of code that knows that; more than knows the
>details of what's in xmax, I believe.
>
>I would've been happier about expending an infomask bit towards this
>purpose.  Just eyeing what we've got, I can't help noticing that
>HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
>in a partitioned table.  Perhaps making these tests depend on
>partitioned-ness would be unworkably messy, but it's worth thinking
>about.

We're pretty much doing so for speculative lock IDs/upsert already. Which doesn't seem to have caused a lot of
problems.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

From
Andres Freund
Date:

On March 8, 2018 10:46:53 AM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> FWIW, I would also vote for (1), especially if the only way to do
>(2)
>>> is stuff as outright scary as this.  I would far rather have (3)
>than
>>> this, because IMO, what we are looking at right now is going to make
>>> the fallout from multixacts look like a pleasant day at the beach.
>
>> Whoa.  Well, that would clearly be bad, but I don't understand why
>you
>> find this so scary.  Can you explain further?
>
>Possibly I'm crying wolf; it's hard to be sure.  But I recall that
>nobody
>was particularly afraid of multixacts when that went in, and look at
>all
>the trouble we've had with that.  Breaking fundamental invariants like
>"ctid points to this tuple or its update successor" is going to cause
>trouble.  There's a lot of code that knows that; more than knows the
>details of what's in xmax, I believe.

I don't think this is that big a problem. All code already needs to handle the case where ctid points to an aborted
updatetuple. Which might long have been replaced by as an independent role. That's why we have all this updated.xmax ==
new.xminchecks. Which will, without any changes, catch the proposed scheme, no? 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Andres Freund <andres@anarazel.de> writes:
> On March 8, 2018 10:46:53 AM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Breaking fundamental invariants like
>> "ctid points to this tuple or its update successor" is going to cause
>> trouble.  There's a lot of code that knows that; more than knows the
>> details of what's in xmax, I believe.

> I don't think this is that big a problem. All code already needs to handle the case where ctid points to an aborted
updatetuple. Which might long have been replaced by as an independent role. That's why we have all this updated.xmax ==
new.xminchecks. Which will, without any changes, catch the proposed scheme, no? 

No.  In those situations, the conclusion is that the current tuple is
live, which is exactly the wrong conclusion for a cross-partition update.
Or at least it might be the wrong conclusion ... I wonder how this patch
works if the updating transaction aborted.

            regards, tom lane


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Andres Freund
Date:
On 2018-03-08 14:25:59 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On March 8, 2018 10:46:53 AM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Breaking fundamental invariants like
> >> "ctid points to this tuple or its update successor" is going to cause
> >> trouble.  There's a lot of code that knows that; more than knows the
> >> details of what's in xmax, I believe.
> 
> > I don't think this is that big a problem. All code already needs to handle the case where ctid points to an aborted
updatetuple. Which might long have been replaced by as an independent role. That's why we have all this updated.xmax ==
new.xminchecks. Which will, without any changes, catch the proposed scheme, no?
 
> 
> No.  In those situations, the conclusion is that the current tuple is
> live, which is exactly the wrong conclusion for a cross-partition
> update.

I don't see the problem you're seeing here. Visibility decisions and
ctid chaining aren't really done in the same way. And in the cases we do
want different behaviour for updates that cross partition boundaries,
the patch adds the error messages. What I was trying to say is not that
we don't need to touch any of those paths, but that there's code to
handle bogus ctid values already. That really wasn't the case for
multixacts (in fact, they broke this check in multiple places).


> Or at least it might be the wrong conclusion ... I wonder how this patch
> works if the updating transaction aborted.

If the updated transaction aborted, HTSU will return
HeapTupleMayBeUpdated and we can just go ahead and allow an update?

Greetings,

Andres Freund


Hi Andres,

Thanks for your time and the review comments/suggestions.

On Tue, Mar 6, 2018 at 4:53 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2018-02-13 12:41:26 +0530, amul sul wrote:
>> From 08c8c7ece7d9411e70a780dbeed89d81419db6b6 Mon Sep 17 00:00:00 2001
>> From: Amul Sul <sulamul@gmail.com>
>> Date: Tue, 13 Feb 2018 12:37:52 +0530
>> Subject: [PATCH 1/2] Invalidate ip_blkid v5
[....]
>
> Very nice and instructive to keep this in a submission's commit message.
>

Thank you.

>
>
>> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
>> index 8a846e7dba..f4560ee9cb 100644
>> --- a/src/backend/access/heap/heapam.c
>> +++ b/src/backend/access/heap/heapam.c
>> @@ -3037,6 +3037,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
>>   *   crosscheck - if not InvalidSnapshot, also check tuple against this
>>   *   wait - true if should wait for any conflicting update to commit/abort
>>   *   hufd - output parameter, filled in failure cases (see below)
>> + *   row_moved - true iff the tuple is being moved to another partition
>> + *                           table due to an update of partition key. Otherwise, false.
>>   *
>
> I don't think 'row_moved' is a good variable name for this. Moving a row
> in our heap format can mean a lot of things. Maybe 'to_other_part' or
> 'changing_part'?
>

Okay, renamed to  'changing_part' in the attached version.

>
>> +     /*
>> +      * Sets a block identifier to the InvalidBlockNumber to indicate such an
>> +      * update being moved tuple to another partition.
>> +      */
>> +     if (row_moved)
>> +             BlockIdSet(&((tp.t_data->t_ctid).ip_blkid), InvalidBlockNumber);
>
> The parens here are set in a bit werid way. I assume that's from copying
> it out of ItemPointerSet()?  Why aren't you just using ItemPointerSetBlockNumber()?
>
>
> I think it'd be better if we followed the example of specultive inserts
> and created an equivalent of HeapTupleHeaderSetSpeculativeToken. That'd
> be a heck of a lot easier to grep for...
>

Added HeapTupleHeaderValidBlockNumber, HeapTupleHeaderSetBlockNumber and
ItemPointerValidBlockNumber macro, but not exactly same as the
HeapTupleHeaderSetSpeculativeToken. Do let me know your thoughts/suggestions.

>
>> @@ -9314,6 +9323,18 @@ heap_mask(char *pagedata, BlockNumber blkno)
>>                        */
>>                       if (HeapTupleHeaderIsSpeculative(page_htup))
>>                               ItemPointerSet(&page_htup->t_ctid, blkno, off);
>> +
>> +                     /*
>> +                      * For a deleted tuple, a block identifier is set to the
>
> I think this 'the' is superflous.
>

Fixed in the attached version.

>
>> +                      * InvalidBlockNumber to indicate that the tuple has been moved to
>> +                      * another partition due to an update of partition key.
>
> But I think it should be 'the partition key'.
>

Fixed in the attached version.

>
>> +                      * Like speculative tuple, to ignore any inconsistency set block
>> +                      * identifier to current block number.
>
> This doesn't quite parse.
>

Tried to explain a little bit more, any help or suggestion to improve it
further will be appreciated.

>
>> +                      */
>> +                     if (!BlockNumberIsValid(
>> +                                     BlockIdGetBlockNumber((&((page_htup->t_ctid).ip_blkid)))))
>> +                             BlockIdSet(&((page_htup->t_ctid).ip_blkid), blkno);
>>               }
>
> That formatting looks wrong. I think it should be replaced by a macro
> like mentioned above.
>

Used HeapTupleHeaderValidBlockNumber & HeapTupleHeaderSetBlockNumber
macro in the attached version.

>
>>               /*
>> diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
>> index 160d941c00..a770531e14 100644
>> --- a/src/backend/commands/trigger.c
>> +++ b/src/backend/commands/trigger.c
>> @@ -3071,6 +3071,11 @@ ltrmark:;
>>                                       ereport(ERROR,
>>                                                       (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>>                                                        errmsg("could not serialize access due to concurrent
update")));
>> +                             if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
>> +                                     ereport(ERROR,
>> +                                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> +                                                      errmsg("tuple to be locked was already moved to another
partitiondue to concurrent update")));
 
>
> Yes, given that we repeat this in multiple places, I *definitely* want
> to see this wrapped in a macro with a descriptive name.
>

Used ItemPointerValidBlockNumber macro all such places.

>
>> diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
>> index 7961b4be6a..b07b7092de 100644
>> --- a/src/backend/executor/nodeLockRows.c
>> +++ b/src/backend/executor/nodeLockRows.c
>> @@ -218,6 +218,11 @@ lnext:
>>                                       ereport(ERROR,
>>                                                       (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>>                                                        errmsg("could not serialize access due to concurrent
update")));
>> +                             if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
>> +                                     ereport(ERROR,
>> +                                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> +                                                      errmsg("tuple to be locked was already moved to another
partitiondue to concurrent update")));
 
>> +
>
> Why are we using ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE rather than
> ERRCODE_T_R_SERIALIZATION_FAILURE?  A lot of frameworks have builtin
> logic to retry serialization failures, and this kind of thing is going
> to resolved by retrying, no?
>

No change, any comments on Amit's response[1]

>
>> diff --git a/src/test/isolation/expected/partition-key-update-1.out
b/src/test/isolation/expected/partition-key-update-1.out
>> new file mode 100644
>
> I'd like to see tests that show various interactions with ON CONFLICT.
>

I've added isolation test for ON CONFLICT DO NOTHING case only, ON CONFLICT DO
UPDATE is yet to support for a partitioned table[2].  But one can we do that
with update row movement if can test ON CONFLICT DO UPDATE on the leaf table,
like attached TRIAL-on-conflict-do-update-wip.patch, thoughts?

In addition, I have added invalid block number check at the few places, as
discussed in [3]. Also, added check in heap_lock_updated_tuple,
rewrite_heap_tuple & EvalPlanQualFetch where ItemPointerEquals() used
to conclude tuple has been updated/deleted, but yet to figure out the
way to hit this changes manually, so that marking the patch as wip.

Regards,
Amul

1] https://postgr.es/m/CAA4eK1KFfm4PBbshNSikdru3Qpt8hUoKQWtBYjdVE2R7U9f6iA@mail.gmail.com
2] https://postgr.es/m/20180228004602.cwdyralmg5ejdqkq@alvherre.pgsql
3] https://postgr.es/m/CAAJ_b97BBkRWFowGRs9VNzFykoK0ikGB1yYEsWfYK8xR5enSrw@mail.gmail.com

Attachment
On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
> <pavan.deolasee@gmail.com> wrote:
>>
>> On Tue, Feb 13, 2018 at 12:41 PM, amul sul <sulamul@gmail.com> wrote:
>>>
>>> Thanks for the confirmation, updated patch attached.
>>>
>>
>> I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does not
>> do anything to deal with the fact that t_ctid may no longer point to itself
>> to mark end of the chain. I just can't see how that would work.
>>
>
> I think it is not that patch doesn't care about the end of the chain.
>  For example, ctid pointing to itself is used to ensure that for
> deleted rows, nothing more needs to be done like below check in the
> ExecUpdate/ExecDelete code path.
> if (!ItemPointerEquals(tupleid, &hufd.ctid))
> {
> ..
> }
> ..
>
> It will deal with such cases by checking invalidblockid before these
> checks.  So, we should be fine in such cases.
>
>
>> But if it
>> does, it needs good amount of comments explaining why and most likely
>> updating comments at other places where chain following is done. For
>> example, how's this code in heap_get_latest_tid() is still valid? Aren't we
>> setting "ctid" to some invalid value here?
>>
>> 2302         /*
>> 2303          * If there's a valid t_ctid link, follow it, else we're done.
>> 2304          */
>> 2305         if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
>> 2306             HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
>> 2307             ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
>> 2308         {
>> 2309             UnlockReleaseBuffer(buffer);
>> 2310             break;
>> 2311         }
>> 2312
>> 2313         ctid = tp.t_data->t_ctid;
>>
>
> I have not tested, but it seems this could be problematic, but I feel
> we could deal with such cases by checking invalid block id in the
> above if check.  Another one such case is in EvalPlanQualFetch
>
>> This is just one example. I am almost certain there are many such cases that
>> will require careful attention.
>>
>
> Right, I think we should be able to detect and fix such cases.
>

I found a couple of places (in heap_lock_updated_tuple, rewrite_heap_tuple,
EvalPlanQualFetch & heap_lock_updated_tuple_rec) where ItemPointerEquals is
use to check tuple has been updated/deleted.  With the proposed patch
ItemPointerEquals() will no longer work as before, we required addition check
for updated/deleted tuple, proposed the same in latest patch[1]. Do let me know
your thoughts/suggestions on this, thanks.

Regards,
Amul

1] https://postgr.es/m/CAAJ_b96mw5xn5oSQgxpgn5dWFRs1j7OebpHRmXkdSNY+70yYEw@mail.gmail.com


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Fri, Mar 9, 2018 at 3:18 PM, amul sul <sulamul@gmail.com> wrote:
> On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
>>
>>> This is just one example. I am almost certain there are many such cases that
>>> will require careful attention.
>>>
>>
>> Right, I think we should be able to detect and fix such cases.
>>
>
> I found a couple of places (in heap_lock_updated_tuple, rewrite_heap_tuple,
> EvalPlanQualFetch & heap_lock_updated_tuple_rec) where ItemPointerEquals is
> use to check tuple has been updated/deleted.  With the proposed patch
> ItemPointerEquals() will no longer work as before, we required addition check
> for updated/deleted tuple, proposed the same in latest patch[1]. Do let me know
> your thoughts/suggestions on this, thanks.
>

I think you have identified the places correctly.  I have few
suggestions though.

1.
- if (!ItemPointerEquals(&tuple->t_self, ctid))
+ if (!(ItemPointerEquals(&tuple->t_self, ctid) ||
+   (!ItemPointerValidBlockNumber(ctid) &&
+    (ItemPointerGetOffsetNumber(&tuple->t_self) ==   /* TODO: Condn.
should be macro? */
+ ItemPointerGetOffsetNumber(ctid)))))

Can't we write this and similar tests as:
ItemPointerValidBlockNumber(ctid) &&
!ItemPointerEquals(&tuple->t_self, ctid)?  It will be bit simpler to
understand and serve the purpose.

2.

if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
  ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
+ !HeapTupleHeaderValidBlockNumber(mytup.t_data) ||
  HeapTupleHeaderIsOnlyLocked(mytup.t_data))

I think it is better to keep the check for
HeapTupleHeaderValidBlockNumber earlier than ItemPointerEquals as it
will first validate if block number is valid and then only compare the
complete CTID.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


On Sat, Mar 10, 2018 at 5:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Mar 9, 2018 at 3:18 PM, amul sul <sulamul@gmail.com> wrote:
>> On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
>>>
>>>> This is just one example. I am almost certain there are many such cases that
>>>> will require careful attention.
>>>>
>>>
>>> Right, I think we should be able to detect and fix such cases.
>>>
>>
>> I found a couple of places (in heap_lock_updated_tuple, rewrite_heap_tuple,
>> EvalPlanQualFetch & heap_lock_updated_tuple_rec) where ItemPointerEquals is
>> use to check tuple has been updated/deleted.  With the proposed patch
>> ItemPointerEquals() will no longer work as before, we required addition check
>> for updated/deleted tuple, proposed the same in latest patch[1]. Do let me know
>> your thoughts/suggestions on this, thanks.
>>
>
> I think you have identified the places correctly.  I have few
> suggestions though.
>
> 1.
> - if (!ItemPointerEquals(&tuple->t_self, ctid))
> + if (!(ItemPointerEquals(&tuple->t_self, ctid) ||
> +   (!ItemPointerValidBlockNumber(ctid) &&
> +    (ItemPointerGetOffsetNumber(&tuple->t_self) ==   /* TODO: Condn.
> should be macro? */
> + ItemPointerGetOffsetNumber(ctid)))))
>
> Can't we write this and similar tests as:
> ItemPointerValidBlockNumber(ctid) &&
> !ItemPointerEquals(&tuple->t_self, ctid)?  It will be bit simpler to
> understand and serve the purpose.
>

Yes, you are correct, we need not worry about offset matching -- invalid block
number check and ItemPointerEquals is more than enough to conclude the tuple has
been deleted or not.  Will change the condition accordingly in the next version.

> 2.
>
> if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
>   ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
> + !HeapTupleHeaderValidBlockNumber(mytup.t_data) ||
>   HeapTupleHeaderIsOnlyLocked(mytup.t_data))
>
> I think it is better to keep the check for
> HeapTupleHeaderValidBlockNumber earlier than ItemPointerEquals as it
> will first validate if block number is valid and then only compare the
> complete CTID.

Sure, will do that.

Thanks for the confirmation and suggestions.

Regards,
Amul


On Mon, Mar 12, 2018 at 11:45 AM, amul sul <sulamul@gmail.com> wrote:
> On Sat, Mar 10, 2018 at 5:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Mar 9, 2018 at 3:18 PM, amul sul <sulamul@gmail.com> wrote:
>>> On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
>>>>
>>>>> This is just one example. I am almost certain there are many such cases that
>>>>> will require careful attention.
>>>>>
>>>>
>>>> Right, I think we should be able to detect and fix such cases.
>>>>
>>>
>>> I found a couple of places (in heap_lock_updated_tuple, rewrite_heap_tuple,
>>> EvalPlanQualFetch & heap_lock_updated_tuple_rec) where ItemPointerEquals is
>>> use to check tuple has been updated/deleted.  With the proposed patch
>>> ItemPointerEquals() will no longer work as before, we required addition check
>>> for updated/deleted tuple, proposed the same in latest patch[1]. Do let me know
>>> your thoughts/suggestions on this, thanks.
>>>
>>
>> I think you have identified the places correctly.  I have few
>> suggestions though.
>>
>> 1.
>> - if (!ItemPointerEquals(&tuple->t_self, ctid))
>> + if (!(ItemPointerEquals(&tuple->t_self, ctid) ||
>> +   (!ItemPointerValidBlockNumber(ctid) &&
>> +    (ItemPointerGetOffsetNumber(&tuple->t_self) ==   /* TODO: Condn.
>> should be macro? */
>> + ItemPointerGetOffsetNumber(ctid)))))
>>
>> Can't we write this and similar tests as:
>> ItemPointerValidBlockNumber(ctid) &&
>> !ItemPointerEquals(&tuple->t_self, ctid)?  It will be bit simpler to
>> understand and serve the purpose.
>>
>
> Yes, you are correct, we need not worry about offset matching -- invalid block
> number check and ItemPointerEquals is more than enough to conclude the tuple has
> been deleted or not.  Will change the condition accordingly in the next version.
>
>> 2.
>>
>> if (mytup.t_data->t_infomask & HEAP_XMAX_INVALID ||
>>   ItemPointerEquals(&mytup.t_self, &mytup.t_data->t_ctid) ||
>> + !HeapTupleHeaderValidBlockNumber(mytup.t_data) ||
>>   HeapTupleHeaderIsOnlyLocked(mytup.t_data))
>>
>> I think it is better to keep the check for
>> HeapTupleHeaderValidBlockNumber earlier than ItemPointerEquals as it
>> will first validate if block number is valid and then only compare the
>> complete CTID.
>
> Sure, will do that.

I did the aforementioned changes in the attached patch, thanks.

Regards,
Amul

Attachment

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Mon, Mar 12, 2018 at 6:33 PM, amul sul <sulamul@gmail.com> wrote:
> On Mon, Mar 12, 2018 at 11:45 AM, amul sul <sulamul@gmail.com> wrote:
>> On Sat, Mar 10, 2018 at 5:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> complete CTID.
>>
>> Sure, will do that.
>
> I did the aforementioned changes in the attached patch, thanks.
>

--- a/src/include/storage/itemptr.h
+++ b/src/include/storage/itemptr.h
@@ -23,7 +23,9 @@
  * This is a pointer to an item within a disk page of a known file
  * (for example, a cross-link from an index to its parent table).
  * blkid tells us which block, posid tells us which entry in the linp
- * (ItemIdData) array we want.
+ * (ItemIdData) array we want.  blkid is marked InvalidBlockNumber when
+ * a tuple is moved to another partition relation due to an update of
+ * the partition key.

I think instead of updating this description in itemptr.h, you should
update it in htup_details.h where we already have a description of
t_ctid.  After this patch, the t_ctid column value in heap_page_items
function will show it as InvalidBlockNumber and in the documentation,
we have given the reference of htup_details.h.   Other than that the
latest version looks good to me.

I have marked this patch as RFC as this is a small change, hope you
can update the patch soon.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


On Sat, Mar 17, 2018 at 4:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Mar 12, 2018 at 6:33 PM, amul sul <sulamul@gmail.com> wrote:
>> On Mon, Mar 12, 2018 at 11:45 AM, amul sul <sulamul@gmail.com> wrote:
>>> On Sat, Mar 10, 2018 at 5:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> complete CTID.
>>>
>>> Sure, will do that.
>>
>> I did the aforementioned changes in the attached patch, thanks.
>>
>
> --- a/src/include/storage/itemptr.h
> +++ b/src/include/storage/itemptr.h
> @@ -23,7 +23,9 @@
>   * This is a pointer to an item within a disk page of a known file
>   * (for example, a cross-link from an index to its parent table).
>   * blkid tells us which block, posid tells us which entry in the linp
> - * (ItemIdData) array we want.
> + * (ItemIdData) array we want.  blkid is marked InvalidBlockNumber when
> + * a tuple is moved to another partition relation due to an update of
> + * the partition key.
>
> I think instead of updating this description in itemptr.h, you should
> update it in htup_details.h where we already have a description of
> t_ctid.  After this patch, the t_ctid column value in heap_page_items
> function will show it as InvalidBlockNumber and in the documentation,
> we have given the reference of htup_details.h.   Other than that the
> latest version looks good to me.
>

Okay, fixed in the attached version.

> I have marked this patch as RFC as this is a small change, hope you
> can update the patch soon.
>

Thank you, updated patch attached.

Regards,
Amul

Attachment

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Andres Freund
Date:
Hi,

On 2018-03-08 13:46:53 -0500, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Thu, Mar 8, 2018 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> FWIW, I would also vote for (1), especially if the only way to do (2)
> >> is stuff as outright scary as this.  I would far rather have (3) than
> >> this, because IMO, what we are looking at right now is going to make
> >> the fallout from multixacts look like a pleasant day at the beach.
> 
> > Whoa.  Well, that would clearly be bad, but I don't understand why you
> > find this so scary.  Can you explain further?
> 
> Possibly I'm crying wolf; it's hard to be sure.  But I recall that nobody
> was particularly afraid of multixacts when that went in, and look at all
> the trouble we've had with that.  Breaking fundamental invariants like
> "ctid points to this tuple or its update successor" is going to cause
> trouble.  There's a lot of code that knows that; more than knows the
> details of what's in xmax, I believe.

Given, as explained nearby, we already do store transient data in the
ctid for speculative insertions (i.e. ON CONFLICT), and it hasn't caused
even a whiff of trouble, I'm currently not inclined to see a huge issue
here.  It'd be great if you could expand on your concerns here a bit, we
gotta figure out a way forward.

I think the proposed patch needs some polish (I'm e.g. unhappy with the
naming of the new macros etc), but I think otherwise it's a reasonable
attempt at solving the problem.


> I would've been happier about expending an infomask bit towards this
> purpose.  Just eyeing what we've got, I can't help noticing that
> HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
> in a partitioned table.  Perhaps making these tests depend on
> partitioned-ness would be unworkably messy, but it's worth thinking
> about.

They previously couldn't be set together IIRC, so we could just (mask &
(HEAP_MOVED_OFF |HEAP_MOVED_IN)) == (HEAP_MOVED_OFF |HEAP_MOVED_IN) but
that'd be permanently eating two infomask bits. For something that
doesn't in general have to be able to live on tuples, just on (at?) the
"deleted tuple at end of a chain".

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2018-03-08 13:46:53 -0500, Tom Lane wrote:
>> ... Breaking fundamental invariants like
>> "ctid points to this tuple or its update successor" is going to cause
>> trouble.  There's a lot of code that knows that; more than knows the
>> details of what's in xmax, I believe.

> Given, as explained nearby, we already do store transient data in the
> ctid for speculative insertions (i.e. ON CONFLICT), and it hasn't caused
> even a whiff of trouble, I'm currently not inclined to see a huge issue
> here.  It'd be great if you could expand on your concerns here a bit, we
> gotta figure out a way forward.

Just what I said.  There's a lot of code that knows how to follow tuple
update chains, probably not all of it in core, and this will break it.
But only in seldom-exercised corner cases, which is the worst of all
possible worlds from a reliability standpoint.  (I don't think ON CONFLICT
is a counterexample because, IIUC, it's not a persistent state.)

Given that there are other ways we could attack it, I think throwing away
this particular invariant is an unnecessarily risky solution.

>> I would've been happier about expending an infomask bit towards this
>> purpose.  Just eyeing what we've got, I can't help noticing that
>> HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
>> in a partitioned table.  Perhaps making these tests depend on
>> partitioned-ness would be unworkably messy, but it's worth thinking
>> about.

> They previously couldn't be set together IIRC, so we could just (mask &
> (HEAP_MOVED_OFF |HEAP_MOVED_IN)) == (HEAP_MOVED_OFF |HEAP_MOVED_IN) but
> that'd be permanently eating two infomask bits.

Hmm.  That objection only matters if we have realistic intentions of
reclaiming those bits in future, which I've not heard anyone making
serious effort towards.  Rather than messing with the definition of ctid,
I'd be happier with saying that they're never going to be reclaimed, but
at least we're getting one bit's worth of real use out of them.

            regards, tom lane


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Andres Freund
Date:
Hi,

On 2018-03-28 13:52:24 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Given, as explained nearby, we already do store transient data in the
> > ctid for speculative insertions (i.e. ON CONFLICT), and it hasn't caused
> > even a whiff of trouble, I'm currently not inclined to see a huge issue
> > here.  It'd be great if you could expand on your concerns here a bit, we
> > gotta figure out a way forward.
> 
> Just what I said.  There's a lot of code that knows how to follow tuple
> update chains, probably not all of it in core, and this will break it.
> But only in seldom-exercised corner cases, which is the worst of all
> possible worlds from a reliability standpoint.

How will it break it? They'll see an invalid ctid and conclude that the
tuple is dead? Without any changes that's already something that can
happen if a later tuple in the chain has been pruned away.  Sure, that
code won't realize it should error out because the tuple is now in a
different partition, but neither would a infomask bit.

I think my big problem is that I just don't see what the worst that can
happen is. We'd potentially see a broken ctid chain, something that very
commonly happens, and consider the tuple to be invisible.  That seems
pretty sane behaviour for unadapted code, and not any worse than other
potential solutions.


> (I don't think ON CONFLICT is a counterexample because, IIUC, it's not
> a persistent state.)

Hm, it can be persistent if we error out in the right moment. But it's
nots super common to encounter that over a long time, I grant you
that. Not that this'd be super persistent either, vacuum/pruning would
normally remove the tuple as well, it's dead after all.


> >> I would've been happier about expending an infomask bit towards this
> >> purpose.  Just eyeing what we've got, I can't help noticing that
> >> HEAP_MOVED_OFF/HEAP_MOVED_IN couldn't possibly be set in any tuple
> >> in a partitioned table.  Perhaps making these tests depend on
> >> partitioned-ness would be unworkably messy, but it's worth thinking
> >> about.
> 
> > They previously couldn't be set together IIRC, so we could just (mask &
> > (HEAP_MOVED_OFF |HEAP_MOVED_IN)) == (HEAP_MOVED_OFF |HEAP_MOVED_IN) but
> > that'd be permanently eating two infomask bits.
> 
> Hmm.  That objection only matters if we have realistic intentions of
> reclaiming those bits in future, which I've not heard anyone making
> serious effort towards.

I plan to submit a patch early v12 that keeps track of the last time a
table has been fully scanned (and when it was created). With part of the
goal being debuggability and part being able to reclaim things like
these bits.


Greetings,

Andres Freund


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Robert Haas
Date:
On Wed, Mar 28, 2018 at 2:12 PM, Andres Freund <andres@anarazel.de> wrote:
> How will it break it? They'll see an invalid ctid and conclude that the
> tuple is dead? Without any changes that's already something that can
> happen if a later tuple in the chain has been pruned away.  Sure, that
> code won't realize it should error out because the tuple is now in a
> different partition, but neither would a infomask bit.
>
> I think my big problem is that I just don't see what the worst that can
> happen is. We'd potentially see a broken ctid chain, something that very
> commonly happens, and consider the tuple to be invisible.  That seems
> pretty sane behaviour for unadapted code, and not any worse than other
> potential solutions.

This is more or less my feeling as well.  I think it's better to
conserve our limited supply of infomask bits as much as we can, and I
do think that we should try to reclaimed HEAP_MOVED_IN and
HEAP_MOVED_OFF in the future instead of defining the combination of
the two of them to mean something now.

The only scenario in which I can see this patch really leading to
disaster is if there's some previous release out there where the bit
pattern chosen by this patch has some other, incompatible meaning.  As
far as we know, that's not the case: this bit pattern was previously
unused.  Code seeing that bit pattern could potentially therefore (1)
barf on the valid CTID, but the whole point of this is to throw an
ERROR anyway, so if that happens then we're getting basically the
right behavior with the wrong error message or (2) just treat it as a
broken CTID link, in which case the result should be pretty much the
same as if this patch hadn't been committed in the first place.

Where the multixact patch really caused us a lot of trouble is that
the implications weren't just for the heap itself -- the relevant
SLRUs became subject to new retention requirements which in turn
affected vacuum, autovacuum, and checkpoint behavior.  There is no
similar problem here -- the flag indicating the problematic situation,
however it ends up being stored, doesn't point to any external data.
Now, that doesn't mean there can't be some other kind of problem with
this patch, but I don't think that we should block the patch on the
theory that it might have an undiscovered problem that destroys the
entire PostgreSQL ecosystem with no theory as to what that problem
might actually be.  Modulo implementation quality, I think the risk
level of this patch is somewhat but not vastly higher than
37484ad2aacef5ec794f4dd3d5cf814475180a78, which similarly defined a
previously-unused bit pattern in the tuple header.  The reason I think
this one might be somewhat riskier is because AFAICS it's not so easy
to make sure we've found all the code, even in core, that might care,
as it was in that case; and also because updates happen more than
freezing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Andres Freund
Date:
On 2018-03-06 19:57:03 +0530, Amit Kapila wrote:
> On Tue, Mar 6, 2018 at 4:53 AM, Andres Freund <andres@anarazel.de> wrote:
> > Hi,
> >
> >> diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
> >> index 7961b4be6a..b07b7092de 100644
> >> --- a/src/backend/executor/nodeLockRows.c
> >> +++ b/src/backend/executor/nodeLockRows.c
> >> @@ -218,6 +218,11 @@ lnext:
> >>                                       ereport(ERROR,
> >>                                                       (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
> >>                                                        errmsg("could not serialize access due to concurrent
update")));
> >> +                             if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
> >> +                                     ereport(ERROR,
> >> +                                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >> +                                                      errmsg("tuple to be locked was already moved to another
partitiondue to concurrent update")));
 
> >> +
> >
> > Why are we using ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE rather than
> > ERRCODE_T_R_SERIALIZATION_FAILURE?  A lot of frameworks have builtin
> > logic to retry serialization failures, and this kind of thing is going
> > to resolved by retrying, no?
> >
> 
> I think it depends, in some cases retry can help in deleting the
> required tuple, but in other cases like when the user tries to perform
> delete on a particular partition table, it won't be successful as the
> tuple would have been moved.

So? In that case the retry will not find the tuple, which'll also
resolve the issue. Preventing frameworks from dealing with this seems
like a way worse issue than that.

Greetings,

Andres Freund


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Wed, Apr 4, 2018 at 4:31 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-03-06 19:57:03 +0530, Amit Kapila wrote:
>> On Tue, Mar 6, 2018 at 4:53 AM, Andres Freund <andres@anarazel.de> wrote:
>> > Hi,
>> >
>> >> diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
>> >> index 7961b4be6a..b07b7092de 100644
>> >> --- a/src/backend/executor/nodeLockRows.c
>> >> +++ b/src/backend/executor/nodeLockRows.c
>> >> @@ -218,6 +218,11 @@ lnext:
>> >>                                       ereport(ERROR,
>> >>                                                       (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>> >>                                                        errmsg("could not serialize access due to concurrent
update")));
>> >> +                             if (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid))))
>> >> +                                     ereport(ERROR,
>> >> +                                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> >> +                                                      errmsg("tuple to be locked was already moved to another
partitiondue to concurrent update")));
 
>> >> +
>> >
>> > Why are we using ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE rather than
>> > ERRCODE_T_R_SERIALIZATION_FAILURE?  A lot of frameworks have builtin
>> > logic to retry serialization failures, and this kind of thing is going
>> > to resolved by retrying, no?
>> >
>>
>> I think it depends, in some cases retry can help in deleting the
>> required tuple, but in other cases like when the user tries to perform
>> delete on a particular partition table, it won't be successful as the
>> tuple would have been moved.
>
> So? In that case the retry will not find the tuple, which'll also
> resolve the issue. Preventing frameworks from dealing with this seems
> like a way worse issue than that.
>

The idea was just that the apps should get an error so that they can
take appropriate action (either retry or whatever they want), we don't
want to silently make it a no-delete op.  The current error patch is
throwing appears similar to what we already do in delete/update
operation with a difference that here we are trying to delete a moved
tuple.

heap_delete()
{
..
if (result == HeapTupleInvisible)
{
UnlockReleaseBuffer(buffer);
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("attempted to delete invisible tuple")));
}
..
}

I think if we want users to always retry on this operation, then
ERRCODE_T_R_SERIALIZATION_FAILURE is a better error code.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Andres Freund
Date:
Hi,

On 2018-04-02 11:26:38 -0400, Robert Haas wrote:
> On Wed, Mar 28, 2018 at 2:12 PM, Andres Freund <andres@anarazel.de> wrote:
> > How will it break it? They'll see an invalid ctid and conclude that the
> > tuple is dead? Without any changes that's already something that can
> > happen if a later tuple in the chain has been pruned away.  Sure, that
> > code won't realize it should error out because the tuple is now in a
> > different partition, but neither would a infomask bit.
> >
> > I think my big problem is that I just don't see what the worst that can
> > happen is. We'd potentially see a broken ctid chain, something that very
> > commonly happens, and consider the tuple to be invisible.  That seems
> > pretty sane behaviour for unadapted code, and not any worse than other
> > potential solutions.
>
> This is more or less my feeling as well.  I think it's better to
> conserve our limited supply of infomask bits as much as we can, and I
> do think that we should try to reclaimed HEAP_MOVED_IN and
> HEAP_MOVED_OFF in the future instead of defining the combination of
> the two of them to mean something now.

Yep.

It'd also make locking more complicated or require to keep more
information around in HeapUpdateFailureData. In a number of places we
currently release the buffer pin before switching over heap_lock_tuple
etc results, or there's not even a way to get at the infomask currently
(heap_update failing).


> Modulo implementation quality, I think the risk
> level of this patch is somewhat but not vastly higher than
> 37484ad2aacef5ec794f4dd3d5cf814475180a78, which similarly defined a
> previously-unused bit pattern in the tuple header.

Personally I think that change was vastly riskier, because it affected
freezing and wraparounds. Which is something we've repeatedly gotten
wrong.


> The reason I think this one might be somewhat riskier is because
> AFAICS it's not so easy to make sure we've found all the code, even in
> core, that might care, as it was in that case; and also because
> updates happen more than freezing.

Butthe consequences of not catching a changed piece of code are fairly
harmless. And I'd say things that happen more often are actually easier
to validate than something that with default settings requires hours of
testing...

I've attached a noticeably editorialized patch:

- I'm uncomfortable with the "moved" information not being crash-safe /
  replicated. Thus I added a new flag to preserve it, and removed the
  masking of the moved bit in the ctid from heap_mask().

- renamed macros to not mention valid / invalid block numbers, but
  rather
  HeapTupleHeaderSetMovedPartitions / HeapTupleHeaderIndicatesMovedPartitions
  and
  ItemPointerSetMovedPartitions /  ItemPointerIndicatesMovedPartitions

  I'm not wedded to these names, but I'l be adamant they they're not
  talking about invalid block numbers. Makes code harder to understand
  imo.

- removed new assertion from heap_get_latest_tid(), it's wrong for the
  case where all row versions are invisible.

- editorialized comments a bit

- added a few more assertions


I went through the existing code to make sure that
a) no checks where missed
b) to evaluate what the consequences when chasing chains would be
c) to evaluate what the consequences when we miss erroring out

WRT b), it's usually just superflous extra work if the new checks
weren't there. I went through all callers accessing xmax (via GetRawXmax
and GetUpdateXid):

b)
- heap rewrites will keep a tuple in hashtable till end of run, then
  reset the ctid to self. No real corruption, but we'd not detect
  further errors when attempting to follow chain.
- EvalPlanQualFetch would fail to abort loop, attempt to fetch
  tuple. This'll extend the relation by a single page, because P_NEW ==
  InvalidBlockNumber.
- heap_prune_chain - no changes needed (delete isn't recursed through)
- heap_get_root_tuples - same
- heap_hot_search_buffer - only continues over hot updates
- heap_lock_tuple (and subsidiary routines) - same as EvalPlanQualFetch,
  would then return HeapTupleUpdated.

c)
- GetTupleForTrigger - the proper error wouldn't be raised, instead a
  NULL tuple would be passed to the trigger
- EvalPlanQualFetch - a NULL tuple would be returned after the
  consequences above
- RelationFindReplTupleBy* - wrong error message
- ExecLockRows - no error would be raised, continue normally
- ExecDelete() - tuple ignored without error
- ExecUpdate() - same


Questions:

- I'm not perfectly happy with
  "tuple to be locked was already moved to another partition due to concurrent update"
  as the error message. If somebody has a better suggestions.
- should heap_get_latest_tid() error out when the chain ends in a moved
  tuple?  I personally think it doesn't matter much, the functionality
  is so bonkers and underspecified that it doesn't matter anyway ;)
- I'm not that happy with the number of added spec test files with
  number postfixes. Can't we combine them into a single file?
- as remarked elsewhere on this thread, I think the used errcode should
  be a serialization failure

Greetings,

Andres Freund

Attachment

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:
>
> I've attached a noticeably editorialized patch:
>
> - I'm uncomfortable with the "moved" information not being crash-safe /
>   replicated. Thus I added a new flag to preserve it, and removed the
>   masking of the moved bit in the ctid from heap_mask().
>
> - renamed macros to not mention valid / invalid block numbers, but
>   rather
>   HeapTupleHeaderSetMovedPartitions / HeapTupleHeaderIndicatesMovedPartitions
>   and
>   ItemPointerSetMovedPartitions /  ItemPointerIndicatesMovedPartitions
>
>   I'm not wedded to these names, but I'l be adamant they they're not
>   talking about invalid block numbers. Makes code harder to understand
>   imo.
>

The new names for macros make the code easier to understand.

> - removed new assertion from heap_get_latest_tid(), it's wrong for the
>   case where all row versions are invisible.
>

Why?  tid is both an input and output parameter.  The input tid is
valid and is verified at the top of the function, now if no row
version is visible, then it should have the same value as passed Tid.
I am not telling that it was super important to have that assertion,
but if it is valid then it can catch a case where we might have missed
checking the tuple which has invalid block number (essentialy the case
introduced by the patch).

I assume you are talking about below assertion:

+
+ /* Make sure that the return value has a valid block number */
+ Assert(ItemPointerValidBlockNumber(tid));


>
> Questions:
>
> - I'm not perfectly happy with
>   "tuple to be locked was already moved to another partition due to concurrent update"
>   as the error message. If somebody has a better suggestions.
>

I don't have any better suggestion, but I have noticed a small
inconsistency in the message.  In case of delete, the message is
"tuple to be updated was ...". I think here it should be "tuple to be
deleted was ...".

> - should heap_get_latest_tid() error out when the chain ends in a moved
>   tuple?

Won't the same question applies to the similar usage in
EvalPlanQualFetch and heap_lock_updated_tuple_rec.  In
EvalPlanQualFetch, we consider such a tuple to be deleted and will
silently miss/skip it which seems contradictory to the places where we
have detected such a situation and raised an error.  In
heap_lock_updated_tuple_rec, we will skip locking the versions of a
tuple after we encounter a tuple version that is moved to another
partition.

> - I'm not that happy with the number of added spec test files with
>   number postfixes. Can't we combine them into a single file?

+1 for doing so.

> - as remarked elsewhere on this thread, I think the used errcode should
>   be a serialization failure
>

No problem.  I was telling up thread that the used error code has some
precedent in the code for similar usage, but we have precedent for the
serialization failure error code as well, so it should be okay to use
it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
"David G. Johnston"
Date:
On Wednesday, April 4, 2018, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:
>

> Questions:
>
> - I'm not perfectly happy with
>   "tuple to be locked was already moved to another partition due to concurrent update"
>   as the error message. If somebody has a better suggestions.
>

I don't have any better suggestion, but I have noticed a small
inconsistency in the message.  In case of delete, the message is
"tuple to be updated was ...". I think here it should be "tuple to be
deleted was ..."

The whole "moved to another partition" explains why and seems better placed in the errdetail.  The error itself should indicate which attempted action failed.  And the attempted action for the end user usually isn't the scope of "locked tuple" - it's the insert or update, the locking is a side effect (why).

David J.

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Thu, Apr 5, 2018 at 10:40 AM, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> On Wednesday, April 4, 2018, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:
>> >
>>
>> > Questions:
>> >
>> > - I'm not perfectly happy with
>> >   "tuple to be locked was already moved to another partition due to
>> > concurrent update"
>> >   as the error message. If somebody has a better suggestions.
>> >
>>
>> I don't have any better suggestion, but I have noticed a small
>> inconsistency in the message.  In case of delete, the message is
>> "tuple to be updated was ...". I think here it should be "tuple to be
>> deleted was ..."
>
>
> The whole "moved to another partition" explains why and seems better placed
> in the errdetail.  The error itself should indicate which attempted action
> failed.  And the attempted action for the end user usually isn't the scope
> of "locked tuple" - it's the insert or update, the locking is a side effect
> (why).
>

I don't think locking is just a side effect, it will be used when the
user tries to lock tuple via "Select .. For Key Share"



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Pavan Deolasee
Date:


On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:


I've attached a noticeably editorialized patch:



+           /*
+            * As long as we don't support an UPDATE of INSERT ON CONFLICT for
+            * a partitioned table we shouldn't reach to a case where tuple to
+            * be lock is moved to another partition due to concurrent update
+            * of the partition key.
+            */
+           Assert(!ItemPointerIndicatesMovedPartitions(&hufd.ctid));
+
 
This is no longer true; at least not entirely. We still don't support ON CONFLICT DO UPDATE to move a row to a different partition, but otherwise it works now. See 555ee77a9668e3f1b03307055b5027e13bf1a715.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Andres Freund
Date:
On 2018-04-04 22:10:06 -0700, David G. Johnston wrote:
> On Wednesday, April 4, 2018, Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> > On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:
> > >
> >
> > > Questions:
> > >
> > > - I'm not perfectly happy with
> > >   "tuple to be locked was already moved to another partition due to
> > concurrent update"
> > >   as the error message. If somebody has a better suggestions.
> > >
> >
> > I don't have any better suggestion, but I have noticed a small
> > inconsistency in the message.  In case of delete, the message is
> > "tuple to be updated was ...". I think here it should be "tuple to be
> > deleted was ..."
> >
> 
> The whole "moved to another partition" explains why and seems better placed
> in the errdetail.  The error itself should indicate which attempted action
> failed.  And the attempted action for the end user usually isn't the scope
> of "locked tuple" - it's the insert or update, the locking is a side effect
> (why).

Well, update/delete have their own messages, don't think you can get
this for inserts (there'd be no tuple to follow across EPQ). The case I
copied from above, was locking a tuple, hence the reference to that.

I don't agree with moving "moved to another partition" to errdetail,
that's *the* crucial detail. If there's anything in the error message,
it should be that.

Greetings,

Andres Freund


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Alvaro Herrera
Date:
Pavan Deolasee wrote:
> On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:

> +           /*
> +            * As long as we don't support an UPDATE of INSERT ON CONFLICT for
> +            * a partitioned table we shouldn't reach to a case where tuple to
> +            * be lock is moved to another partition due to concurrent update
> +            * of the partition key.
> +            */
> +           Assert(!ItemPointerIndicatesMovedPartitions(&hufd.ctid));
> +
> 
> This is no longer true; at least not entirely. We still don't support ON
> CONFLICT DO UPDATE to move a row to a different partition, but otherwise it
> works now. See 555ee77a9668e3f1b03307055b5027e13bf1a715.

Right.  So I think the assert() should remain, but the comment should
say "As long as we don't update moving a tuple to a different partition
during INSERT ON CONFLICT DO UPDATE on a partitioned table, ..."

FWIW I think the code flow is easier to read with the renamed macros.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Andres Freund
Date:
On 2018-04-05 10:17:59 +0530, Amit Kapila wrote:
> On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:
> Why?  tid is both an input and output parameter.  The input tid is
> valid and is verified at the top of the function, now if no row
> version is visible, then it should have the same value as passed Tid.
> I am not telling that it was super important to have that assertion,
> but if it is valid then it can catch a case where we might have missed
> checking the tuple which has invalid block number (essentialy the case
> introduced by the patch).

You're right. It's bonkers that the output parameter isn't set to an
invalid value if the tuple isn't found. Makes the whole function
entirely useless.

> > - I'm not perfectly happy with
> >   "tuple to be locked was already moved to another partition due to concurrent update"
> >   as the error message. If somebody has a better suggestions.
> >
> 
> I don't have any better suggestion, but I have noticed a small
> inconsistency in the message.  In case of delete, the message is
> "tuple to be updated was ...". I think here it should be "tuple to be
> deleted was ...".

Yea, I noticed that too. Note that the message a few lines up is
similarly wrong:
ereport(ERROR,
        (errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
         errmsg("tuple to be updated was already modified by an operation triggered by the current command"),
         errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.")));


> > - should heap_get_latest_tid() error out when the chain ends in a moved
> >   tuple?
> 
> Won't the same question applies to the similar usage in
> EvalPlanQualFetch and heap_lock_updated_tuple_rec.

I don't think so?


> In EvalPlanQualFetch, we consider such a tuple to be deleted and will
> silently miss/skip it which seems contradictory to the places where we
> have detected such a situation and raised an error.

if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
    ereport(ERROR,
            (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
             errmsg("tuple to be locked was already moved to another partition due to concurrent update")));


> In heap_lock_updated_tuple_rec, we will skip locking the versions of a
> tuple after we encounter a tuple version that is moved to another
> partition.

I don't think that's true? We'll not lock *any* tuple in that case, but
return HeapTupleUpdated. Which callers then interpret in whatever way
they need to?

Greetings,

Andres Freund


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Fri, Apr 6, 2018 at 1:13 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-04-05 10:17:59 +0530, Amit Kapila wrote:
>> On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:
>> Why?  tid is both an input and output parameter.  The input tid is
>> valid and is verified at the top of the function, now if no row
>> version is visible, then it should have the same value as passed Tid.
>> I am not telling that it was super important to have that assertion,
>> but if it is valid then it can catch a case where we might have missed
>> checking the tuple which has invalid block number (essentialy the case
>> introduced by the patch).
>
> You're right. It's bonkers that the output parameter isn't set to an
> invalid value if the tuple isn't found. Makes the whole function
> entirely useless.
>

Yeah, kind of, but I think the same is noted in function comments and
caller is prepared to deal with it.  See comments atop
heap_get_latest_tid (Note that it will not be changed if no version of
the row passes the snapshot test.).  The caller (TidNext) will just
ignore such TID and move to next TID.  I think you have a point that
one might have designed it differently by setting output value to
invalid value which would make caller to detect it easily.  In short,
it's just a matter of choice whether we want to have Assert as Amul
has it in his patch or just leave.  It should be okay either way.

>
>
>> > - should heap_get_latest_tid() error out when the chain ends in a moved
>> >   tuple?
>>
>> Won't the same question applies to the similar usage in
>> EvalPlanQualFetch and heap_lock_updated_tuple_rec.
>
> I don't think so?
>
>
>> In EvalPlanQualFetch, we consider such a tuple to be deleted and will
>> silently miss/skip it which seems contradictory to the places where we
>> have detected such a situation and raised an error.
>
> if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
>         ereport(ERROR,
>                         (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>                          errmsg("tuple to be locked was already moved to another partition due to concurrent
update")));
>
>

I was talking about the case when the tuple version is not visible aka
the below code:

/*
* If we get here, the tuple was found but failed SnapshotDirty.
..
*/
if (HeapTupleHeaderIndicatesMovedPartitions(tuple.t_data) ||
ItemPointerEquals(&tuple.t_self, &tuple.t_data->t_ctid))
{
/* deleted, so forget about it */
ReleaseBuffer(buffer);
return NULL;
}

Normally, if the tuple would have been updated such that it landed in
the same partition, then the chain would have continued, but now
because tuple is moved to another partition, we would end the chain
without letting the user know about it.  Just see below the
repercussion of same.

>> In heap_lock_updated_tuple_rec, we will skip locking the versions of a
>> tuple after we encounter a tuple version that is moved to another
>> partition.
>
> I don't think that's true? We'll not lock *any* tuple in that case,
>

I think what will happen is that we will end up locking some versions
in the chain and then silently skip others.  See below example:

Setup
-----------
postgres=# create table t1(c1 int, c2 varchar) partition by range(c1);
CREATE TABLE
postgres=# create table t1_part_1 partition of t1 for values from (1) to (100);
CREATE TABLE
postgres=# create table t1_part_2 partition of t1 for values from
(100) to (200);
CREATE TABLE
postgres=# insert into t1 values(1, 'aaa');
INSERT 0 1

Session-1
---------------
postgres=# begin;
BEGIN
postgres=# update t1 set c2='bbb' where c1=1;
UPDATE 1
postgres=# update t1 set c2='ccc' where c1=1;
UPDATE 1
postgres=# update t1 set c1=102 where c1=1;
UPDATE 1

Session-2
----------------
postgres=# begin;
BEGIN
postgres=# select * from t1 where c1=1 for key share;


Here, the Session-2 will lock one of the tuple versions and then wait
for Session-1 to end (as there is a conflicting update).  Now, commit
the transaction in Session-1.

Session-1
---------------
Commit;

Now the Session-2 will skip the latest version of a tuple as it is
moved to another partition.

Session-2
----------------
postgres=# select * from t1 where c1=1 for key share;
 c1 | c2
----+----
(0 rows)

The end result is that Session-2 will lock one of the versions of the
tuple and silently skipped locking latest version of the tuple.  I
feel that is slightly confusing behavior with respect to the current
behavior or when tuple updates landed in the same partition.

I think if we return an error in EvalPlanQualFetch at the place
mentioned above, the behavior will be sane.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


On Thu, Apr 5, 2018 at 10:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:
>>
[...]
>>
>> Questions:
>>
>> - I'm not perfectly happy with
>>   "tuple to be locked was already moved to another partition due to concurrent update"
>>   as the error message. If somebody has a better suggestions.
>>
>
> I don't have any better suggestion, but I have noticed a small
> inconsistency in the message.  In case of delete, the message is
> "tuple to be updated was ...". I think here it should be "tuple to be
> deleted was ...".
>

+1, will do the error message change in ExecDelete.

>> - should heap_get_latest_tid() error out when the chain ends in a moved
>>   tuple?
>
> Won't the same question applies to the similar usage in
> EvalPlanQualFetch and heap_lock_updated_tuple_rec.  In
> EvalPlanQualFetch, we consider such a tuple to be deleted and will
> silently miss/skip it which seems contradictory to the places where we
> have detected such a situation and raised an error.  In
> heap_lock_updated_tuple_rec, we will skip locking the versions of a
> tuple after we encounter a tuple version that is moved to another
> partition.
>
>> - I'm not that happy with the number of added spec test files with
>>   number postfixes. Can't we combine them into a single file?
>
> +1 for doing so.

Agree, we could combine specs-1/2/3 into a single file which is doing the error
check and for the specs-4/5, imho, let it be, as it is checking different the
scenario of ON CONFLICT DO NOTHING on the moved tuple and also
it resembles the existing ON CONFLICT isolation tests.

Will post rebase version of Andres' patch[1] including aforementioned
changes within an hour, thanks


1] https://postgr.es/m/20180405014439.fbezvbjrmcw64vjc@alap3.anarazel.de

Regards,
Amul


On Fri, Apr 6, 2018 at 12:07 PM, amul sul <sulamul@gmail.com> wrote:
> On Thu, Apr 5, 2018 at 10:17 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:
>>>
> [...]
>>>
>>> Questions:
>>>
>>> - I'm not perfectly happy with
>>>   "tuple to be locked was already moved to another partition due to concurrent update"
>>>   as the error message. If somebody has a better suggestions.
>>>
>>
>> I don't have any better suggestion, but I have noticed a small
>> inconsistency in the message.  In case of delete, the message is
>> "tuple to be updated was ...". I think here it should be "tuple to be
>> deleted was ...".
>>
>
> +1, will do the error message change in ExecDelete.
>
>>> - should heap_get_latest_tid() error out when the chain ends in a moved
>>>   tuple?
>>
>> Won't the same question applies to the similar usage in
>> EvalPlanQualFetch and heap_lock_updated_tuple_rec.  In
>> EvalPlanQualFetch, we consider such a tuple to be deleted and will
>> silently miss/skip it which seems contradictory to the places where we
>> have detected such a situation and raised an error.  In
>> heap_lock_updated_tuple_rec, we will skip locking the versions of a
>> tuple after we encounter a tuple version that is moved to another
>> partition.
>>
>>> - I'm not that happy with the number of added spec test files with
>>>   number postfixes. Can't we combine them into a single file?
>>
>> +1 for doing so.
>
> Agree, we could combine specs-1/2/3 into a single file which is doing the error
> check and for the specs-4/5, imho, let it be, as it is checking different the
> scenario of ON CONFLICT DO NOTHING on the moved tuple and also
> it resembles the existing ON CONFLICT isolation tests.
>
> Will post rebase version of Andres' patch[1] including aforementioned
> changes within an hour, thanks
>
>
> 1] https://postgr.es/m/20180405014439.fbezvbjrmcw64vjc@alap3.anarazel.de
>

Updated patch attached.

Regards,
Amul

Attachment
On Thu, Apr 5, 2018 at 7:14 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2018-04-02 11:26:38 -0400, Robert Haas wrote:
>> On Wed, Mar 28, 2018 at 2:12 PM, Andres Freund <andres@anarazel.de> wrote:
[....]
> I've attached a noticeably editorialized patch:
>
> - I'm uncomfortable with the "moved" information not being crash-safe /
>   replicated. Thus I added a new flag to preserve it, and removed the
>   masking of the moved bit in the ctid from heap_mask().
>
> - renamed macros to not mention valid / invalid block numbers, but
>   rather
>   HeapTupleHeaderSetMovedPartitions / HeapTupleHeaderIndicatesMovedPartitions
>   and
>   ItemPointerSetMovedPartitions /  ItemPointerIndicatesMovedPartitions
>
>   I'm not wedded to these names, but I'l be adamant they they're not
>   talking about invalid block numbers. Makes code harder to understand
>   imo.
>

These names are much better than before, thanks.

One concern -- instead  xxxMovedPartitions can we have
xxxPartitionChanged or xxxChangedPartition?

xxxMovedPartitions looks (at least to me) like partitions are moved. In other
databases, there is maintenance command to move a partition from one tablespace
to another, current naming is fine as long as we don't support the
same, but if we do then this names will be confusing, comments/thoughts?

Regards,
Amul


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Amit Kapila
Date:
On Fri, Apr 6, 2018 at 12:50 PM, amul sul <sulamul@gmail.com> wrote:
>
> Updated patch attached.
>

+ if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("tuple to be locked was already moved to another partition
due to concurrent update")));

As suggested by Andres, I think you should change the error code to
serialization failure i.e ERRCODE_T_R_SERIALIZATION_FAILURE.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


On Fri, Apr 6, 2018 at 1:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Apr 6, 2018 at 12:50 PM, amul sul <sulamul@gmail.com> wrote:
>>
>> Updated patch attached.
>>
>
> + if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("tuple to be locked was already moved to another partition
> due to concurrent update")));
>
> As suggested by Andres, I think you should change the error code to
> serialization failure i.e ERRCODE_T_R_SERIALIZATION_FAILURE.
>

Thanks for the reminder -- fixed in the attached version.

Regards,
Amul

Attachment

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Andres Freund
Date:
Hi Tom, All,

On 2018-04-06 14:19:02 +0530, amul sul wrote:
> Thanks for the reminder -- fixed in the attached version.

Tom, this seems to be the best approach for fixing the visibility issues
around this. I've spent a good chunk of time looking at corruption
issues like the ones you feared (see [1]) and I'm not particularly
concerned.  I'm currently planning to go ahead with this, do you want to
"veto" that (informally, not formally)?

I'll go through this again tomorrow morning.

[1] https://postgr.es/m/20180405014439.fbezvbjrmcw64vjc@alap3.anarazel.de


> v9:
>  Its the rebase version of Andres Freund patch v8[1] with the
>  following additional changes:
>  3. Argument changing_part of heap_delete renamed to ChangingPart to be
>     consistent with ExecDelete

FWIW, I'd left it as it was before because the two functions have a bit
different coding style, and the capitalization seemed more fitting in
the surrounding context.

> +test: partition-key-update-1
> +test: partition-key-update-2
> +test: partition-key-update-3

Can you give these more descriptive names please (or further combine them)?

Greetings,

Andres Freund


On Sat, Apr 7, 2018 at 9:08 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi Tom, All,
>
> On 2018-04-06 14:19:02 +0530, amul sul wrote:
>> Thanks for the reminder -- fixed in the attached version.
>
> Tom, this seems to be the best approach for fixing the visibility issues
> around this. I've spent a good chunk of time looking at corruption
> issues like the ones you feared (see [1]) and I'm not particularly
> concerned.  I'm currently planning to go ahead with this, do you want to
> "veto" that (informally, not formally)?
>
> I'll go through this again tomorrow morning.
>
> [1] https://postgr.es/m/20180405014439.fbezvbjrmcw64vjc@alap3.anarazel.de
>
>
>> v9:
>>  Its the rebase version of Andres Freund patch v8[1] with the
>>  following additional changes:
>>  3. Argument changing_part of heap_delete renamed to ChangingPart to be
>>     consistent with ExecDelete
>
> FWIW, I'd left it as it was before because the two functions have a bit
> different coding style, and the capitalization seemed more fitting in
> the surrounding context.
>
>> +test: partition-key-update-1
>> +test: partition-key-update-2
>> +test: partition-key-update-3
>
> Can you give these more descriptive names please (or further combine them)?
>

As I explained above further combining might not the good option and about
the descriptive names I have following suggestions but I am afraid of
the length of
test case name:

+test: concurrent-partition-key-update.out

This test does the serialization failure check.

+test: concurrent-partition-key-update-and-insert-conflict-do-nothing-1
+test: concurrent-partition-key-update-and-insert-conflict-do-nothing-2

Both are testing partition key update behaviour with the insert on
conflict do nothing case.

Attached is the patch does the renaming of this tests -- need to apply
to the top of  v10 patch[1].

Regards,
Amul

1]  https://postgr.es/m/CAAJ_b94X5Y_zdTN=BGdZie+hM4p6qW70-XCJhFYaCUO0OfF=aQ@mail.gmail.com

Attachment

Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Alvaro Herrera
Date:
amul sul wrote:
> On Sat, Apr 7, 2018 at 9:08 AM, Andres Freund <andres@anarazel.de> wrote:

> >> +test: partition-key-update-1
> >> +test: partition-key-update-2
> >> +test: partition-key-update-3
> >
> > Can you give these more descriptive names please (or further combine them)?
> 
> As I explained above further combining might not the good option and about
> the descriptive names I have following suggestions but I am afraid of
> the length of
> test case name:
> 
> +test: concurrent-partition-key-update.out
> 
> This test does the serialization failure check.
> 
> +test: concurrent-partition-key-update-and-insert-conflict-do-nothing-1
> +test: concurrent-partition-key-update-and-insert-conflict-do-nothing-2

Yikes.  I'd rather have the original name, and each test's purpose
stated in a comment in the spec file itself.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Andres Freund
Date:
On 2018-04-06 09:41:07 +0530, Amit Kapila wrote:
> >> Won't the same question applies to the similar usage in
> >> EvalPlanQualFetch and heap_lock_updated_tuple_rec.
> >
> > I don't think so?
> >
> >
> >> In EvalPlanQualFetch, we consider such a tuple to be deleted and will
> >> silently miss/skip it which seems contradictory to the places where we
> >> have detected such a situation and raised an error.
> >
> > if (ItemPointerIndicatesMovedPartitions(&hufd.ctid))
> >         ereport(ERROR,
> >                         (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >                          errmsg("tuple to be locked was already moved to another partition due to concurrent
update")));
> >
> >
> 
> I was talking about the case when the tuple version is not visible aka
> the below code:

> I think if we return an error in EvalPlanQualFetch at the place
> mentioned above, the behavior will be sane.

I think you're right. I've adapted the code, added a bunch of tests.

Greetings,

Andres Freund


Re: [HACKERS] Restrict concurrent update/delete with UPDATE ofpartition key

From
Andres Freund
Date:
On 2018-04-07 20:13:36 +0530, amul sul wrote:
> Attached is the patch does the renaming of this tests -- need to apply
> to the top of  v10 patch[1].

These indeed are a bit too long, so I went with the numbers.  I've
pushed the patch now. Two changes:
- I've added one more error patch to EvalPlanQualFetch, as suggested by
  Amit. Added tests for that too.
- renamed '*_rang_*' table names in tests to range.

Thanks!

- Andres


On Sun, Apr 8, 2018 at 2:04 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-04-07 20:13:36 +0530, amul sul wrote:
>> Attached is the patch does the renaming of this tests -- need to apply
>> to the top of  v10 patch[1].
>
> These indeed are a bit too long,  so I went with the numbers.  I've
> pushed the patch now. Two changes:
> - I've added one more error patch to EvalPlanQualFetch, as suggested by
>   Amit. Added tests for that too.
> - renamed '*_rang_*' table names in tests to range.
>

Thanks for the review and commit -- I appreciate your and Amit Kapila's help to
push the patch to the committable stage.  Thanks to all those who have
looked into the patch and provided valuable inputs.

FWIW here is the commit :
https://postgr.es/m/E1f4uWV-00057x-0a@gemulon.postgresql.org

Regards,
Amul