Thread: default partition and concurrent attach partition

default partition and concurrent attach partition

From
Amit Langote
Date:
Hi,

Starting a new thread to discuss a bug related to $subject that Hao Wu
reported on thread titled "ALTER TABLE .. DETACH PARTITION
CONCURRENTLY" [1].  I have been able to reproduce the bug using steps
that Hao gave in that email:

create table tpart (i int, j int) partition by range(i);
create table tpart_1 (like tpart);
create table tpart_2 (like tpart);
create table tpart_default (like tpart);
alter table tpart attach partition tpart_1 for values from (0) to (100);
alter table tpart attach partition tpart_default default;
insert into tpart_2 values (110,110), (120,120), (150,150);

Session 1:

begin;
alter table tpart attach partition tpart_2 for values from (100) to (200);

Session 2:

begin;
insert into tpart values (110,110), (120,120), (150,150);
<blocks waiting for the concurrent attach to finish>

Session 1:

end;

Session 2:

select tableoid::regclass, * from tpart;
end;

The select will show that rows inserted by session 2 are inserted into
tpart_default, whereas after successfully attaching tpart_2, they do
not actually belong there.

The problem is that when session 2 inserts those rows into tpart, it
only knows about 2 partitions: tpart_1, tpart_default, of which it
selects tpart_default to insert those rows into.  When tpart_default
is locked to perform the insert, it waits for session 1 to release the
lock taken on tpart_default during the attach command.  When it is
unblocked, it proceeds to finish the insert without rechecking the
partition constraint which would have been updated as result of a new
partition having been added to the parent.

Note that we don't normally check the partition constraint when
inserting a row into a partition if the insert occurs via tuple
routing, which makes sense for non-default partitions whose partition
constraint cannot change due to concurrent activity.  But this test
case has shown that the assumption is not safe for a default partition
whose constraint is a function of other partitions that exist as of
when the insert occurs.

By the way, if you reverse the order of operations between session 1
and 2 such that the insert by session 2 occurs first and then the
attach by session 1, then you will correctly get this error from the
attach command:

ERROR:  updated partition constraint for default partition
"tpart_default" would be violated by some row

Attached is a patch to fix things on the insert side.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com


[1]
https://www.postgresql.org/message-id/DM5PR0501MB3910E97A9EDFB4C775CF3D75A42F0%40DM5PR0501MB3910.namprd05.prod.outlook.com

Attachment

Re: default partition and concurrent attach partition

From
Amit Langote
Date:
On Thu, Sep 3, 2020 at 6:50 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Hi,
>
> Starting a new thread to discuss a bug related to $subject that Hao Wu
> reported on thread titled "ALTER TABLE .. DETACH PARTITION
> CONCURRENTLY" [1].  I have been able to reproduce the bug using steps
> that Hao gave in that email:
>
> create table tpart (i int, j int) partition by range(i);
> create table tpart_1 (like tpart);
> create table tpart_2 (like tpart);
> create table tpart_default (like tpart);
> alter table tpart attach partition tpart_1 for values from (0) to (100);
> alter table tpart attach partition tpart_default default;
> insert into tpart_2 values (110,110), (120,120), (150,150);
>
> Session 1:
>
> begin;
> alter table tpart attach partition tpart_2 for values from (100) to (200);
>
> Session 2:
>
> begin;
> insert into tpart values (110,110), (120,120), (150,150);
> <blocks waiting for the concurrent attach to finish>
>
> Session 1:
>
> end;
>
> Session 2:
>
> select tableoid::regclass, * from tpart;
> end;
>
> The select will show that rows inserted by session 2 are inserted into
> tpart_default, whereas after successfully attaching tpart_2, they do
> not actually belong there.
>
> The problem is that when session 2 inserts those rows into tpart, it
> only knows about 2 partitions: tpart_1, tpart_default, of which it
> selects tpart_default to insert those rows into.  When tpart_default
> is locked to perform the insert, it waits for session 1 to release the
> lock taken on tpart_default during the attach command.  When it is
> unblocked, it proceeds to finish the insert without rechecking the
> partition constraint which would have been updated as result of a new
> partition having been added to the parent.
>
> Note that we don't normally check the partition constraint when
> inserting a row into a partition if the insert occurs via tuple
> routing, which makes sense for non-default partitions whose partition
> constraint cannot change due to concurrent activity.  But this test
> case has shown that the assumption is not safe for a default partition
> whose constraint is a function of other partitions that exist as of
> when the insert occurs.
>
> By the way, if you reverse the order of operations between session 1
> and 2 such that the insert by session 2 occurs first and then the
> attach by session 1, then you will correctly get this error from the
> attach command:
>
> ERROR:  updated partition constraint for default partition
> "tpart_default" would be violated by some row
>
> Attached is a patch to fix things on the insert side.

Forgot to mention that the problem exists as of v12 (commit: 898e5e329).

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com



Re: default partition and concurrent attach partition

From
Alvaro Herrera
Date:
Thanks for this fix!  Looking into it now.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: default partition and concurrent attach partition

From
Alvaro Herrera
Date:
On 2020-Sep-03, Alvaro Herrera wrote:

> +    /*
> +     * If setting up a PartitionDispatch for a sub-partitioned table, we may
> +     * also need a fake ResultRelInfo for checking the partition constraint
> +     * later; set that up now.
> +     */
> +    if (parent_pd)
> +    {
> +        ResultRelInfo *rri = makeNode(ResultRelInfo);
> +
> +        InitResultRelInfo(rri, rel, 1, proute->partition_root, 0);
> +        proute->nonleaf_partitions[dispatchidx] = rri;
> +    }
> +

Heh, I had intended to remove the attachment before sending, because I
thought I was seeing a problem with this proposed coding of mine.  But
since I sent it by mistake, I'll explain: I think this will fail if we
have a partitioned default partition, and we direct the insert to it
directly while attaching a partition to its parent.  I think that kind
of scenario deserves its own test case.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: default partition and concurrent attach partition

From
Alvaro Herrera
Date:
Also, I should have pointed out that ExecInsert doesn't actually check
the partitin constraint except in very specific cases; what it does is
expect that the partition routing code got it right.  So the comment
you're adding about that is wrong, and it did misled me into changing
your code in a way that actually caused a bug -- hence my proposed
rewording.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: default partition and concurrent attach partition

From
Amit Langote
Date:
Hi Alvaro,

On Fri, Sep 4, 2020 at 6:28 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Also, I should have pointed out that ExecInsert doesn't actually check
> the partitin constraint except in very specific cases; what it does is
> expect that the partition routing code got it right.  So the comment
> you're adding about that is wrong, and it did misled me into changing
> your code in a way that actually caused a bug -- hence my proposed
> rewording.

Thanks for taking a look.

        /*
         * If this partition is the default one, we must check its partition
-        * constraint, because it may have changed due to partitions being
-        * added to the parent concurrently.  We do the check here instead of
-        * in ExecInsert(), because we would not want to miss checking the
-        * constraint of any nonleaf partitions as they never make it to
-        * ExecInsert().
+        * constraint now, which may have changed due to partitions being
+        * added to the parent concurrently.

I am fine with your rewording but let me explain how I ended up
writing what I wrote:

At first I had pondered tweaking the following code in ExecInsert() to
fix this bug:

        /*
         * Also check the tuple against the partition constraint, if there is
         * one; except that if we got here via tuple-routing, we don't need to
         * if there's no BR trigger defined on the partition.
         */
        if (resultRelInfo->ri_PartitionCheck &&
            (resultRelInfo->ri_PartitionRoot == NULL ||
             (resultRelInfo->ri_TrigDesc &&
              resultRelInfo->ri_TrigDesc->trig_insert_before_row)))
            ExecPartitionCheck(resultRelInfo, slot, estate, true);

I gave up because I concluded that there isn't enough information at
this place in code to determine if the partition is a default
partition, so I moved my attention to ExecFindPartition() where we
have access to the parent's PartitionDesc which is enough to do so.
In the process of modifying ExecFindPartition() to fix the bug I
realized that default partitions can be partitioned and
sub-partitioned partitions never reach ExecInsert().  So, beside the
earlier mentioned inconvenience of fixing this bug in ExecInsert(),
there is also the problem that such a fix would have missed addressing
sub-partitioned default partitions.  I thought someone beside me would
also wonder why ExecInsert() is not touched in this fix, hence the
comment.

FWIW, I still prefer "minimally valid ResultRelInfo" to "fake
ResultRelInfo", because those aren't really fake, are they?  They are
as valid as any other ResultRelInfo as far I can see.  I said
"minimally valid" because a fully-valid partition ResultRelInfo is one
made by ExecInitPartitionInfo(), which I avoided to use here thinking
that would be overkill.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com



Re: default partition and concurrent attach partition

From
Alvaro Herrera
Date:
On 2020-Sep-04, Amit Langote wrote:

Hello

> FWIW, I still prefer "minimally valid ResultRelInfo" to "fake
> ResultRelInfo", because those aren't really fake, are they?  They are
> as valid as any other ResultRelInfo as far I can see.  I said
> "minimally valid" because a fully-valid partition ResultRelInfo is one
> made by ExecInitPartitionInfo(), which I avoided to use here thinking
> that would be overkill.

Well, they are fake in that the ri_RangeTableIndex they carry is bogus,
which means that ExecBuildSlotValueDescription will misbehave if the
partitioned default partition has a different column order than its
parent.  That can be evidenced by changing the setup block in the
isolation test as in the attached; and you'll get an undesirable error
like this:

2020-09-07 19:00:49.513 -03 [12981] ERROR:  attribute 2 of type record has wrong type
2020-09-07 19:00:49.513 -03 [12981] DETAIL:  Table has type text, but query expects integer.
2020-09-07 19:00:49.513 -03 [12981] STATEMENT:  insert into attach_tab values (110, 'eleven and five twenties');

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: default partition and concurrent attach partition

From
Alvaro Herrera
Date:
On 2020-Sep-07, Alvaro Herrera wrote:

> Well, they are fake in that the ri_RangeTableIndex they carry is bogus,
> which means that ExecBuildSlotValueDescription will misbehave if the
> partitioned default partition has a different column order than its
> parent.  That can be evidenced by changing the setup block in the
> isolation test as in the attached; and you'll get an undesirable error
> like this:
> 
> 2020-09-07 19:00:49.513 -03 [12981] ERROR:  attribute 2 of type record has wrong type
> 2020-09-07 19:00:49.513 -03 [12981] DETAIL:  Table has type text, but query expects integer.
> 2020-09-07 19:00:49.513 -03 [12981] STATEMENT:  insert into attach_tab values (110, 'eleven and five twenties');

... and I sent before completing.  I'm not sure what a good fix for this
is.  We could try to initialize the resultRelInfo honestly, or we could
set ri_RangeTableIndex to some invalid value that will ... eh ... do
something else.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: default partition and concurrent attach partition

From
Alvaro Herrera
Date:
Ah, it looks like we can get away with initializing the RRI to 0, and
then explicitly handle that case in ExecPartitionCheckEmitError, as in
the attached (which means reindenting, but I left it alone to make it
easy to read).  It kinda sucks because we don't report the tuple that
causes the error, but

a) it's a very unlikely case anyway
b) it's better than the bogus error message
c) it's better than some hypothetical crash
d) nobody uses partitioned default partitions anyway
e) nobody uses differing column order anyway

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: default partition and concurrent attach partition

From
Alvaro Herrera
Date:
On 2020-Sep-07, Alvaro Herrera wrote:

> Ah, it looks like we can get away with initializing the RRI to 0, and
> then explicitly handle that case in ExecPartitionCheckEmitError, as in
> the attached (which means reindenting, but I left it alone to make it
> easy to read).

Well, that was silly -- this seems fixable by mapping the tuple columns
prior to ExecPartitionCheck, which is achievable with something like

        if (partidx == partdesc->boundinfo->default_index)
        {
            TupleTableSlot *tempslot;

            if (dispatch->tupmap != NULL)
            {
                tempslot = MakeSingleTupleTableSlot(RelationGetDescr(rri->ri_RelationDesc),
                                                                   &TTSOpsHeapTuple);
                tempslot = execute_attr_map_slot(dispatch->tupmap, slot, tempslot);
            }
            else
                tempslot = slot;
            ExecPartitionCheck(rri, tempslot, estate, true);
            if (dispatch->tupmap != NULL)
                ExecDropSingleTupleTableSlot(tempslot);
        }

though this exact incantation, apart from being pretty horrible, doesn't
actually work (other than to fix this specific test case -- it crashes
elsewhere.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: default partition and concurrent attach partition

From
Amit Langote
Date:
Hi Alvaro,

On Tue, Sep 8, 2020 at 8:44 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2020-Sep-07, Alvaro Herrera wrote:
>
> > Ah, it looks like we can get away with initializing the RRI to 0, and
> > then explicitly handle that case in ExecPartitionCheckEmitError, as in
> > the attached (which means reindenting, but I left it alone to make it
> > easy to read).

At this point, I think it may be clear that ri_RangeTableIndex being
set to a dummy value isn't problematic.  I checked the code paths you
mentioned upthread, but don't see anything that would be broken by
ri_RangeTableIndex being set to a dummy value of 1.

Moving on from that...

> Well, that was silly -- this seems fixable by mapping the tuple columns
> prior to ExecPartitionCheck, which is achievable with something like
>
>                 if (partidx == partdesc->boundinfo->default_index)
>                 {
>                         TupleTableSlot *tempslot;
>
>                         if (dispatch->tupmap != NULL)
>                         {
>                                 tempslot = MakeSingleTupleTableSlot(RelationGetDescr(rri->ri_RelationDesc),
>
            &TTSOpsHeapTuple);
 
>                                 tempslot = execute_attr_map_slot(dispatch->tupmap, slot, tempslot);
>                         }
>                         else
>                                 tempslot = slot;
>                         ExecPartitionCheck(rri, tempslot, estate, true);
>                         if (dispatch->tupmap != NULL)
>                                 ExecDropSingleTupleTableSlot(tempslot);
>                 }
>
> though this exact incantation, apart from being pretty horrible, doesn't
> actually work (other than to fix this specific test case -- it crashes
> elsewhere.)

Yeah, we need to make sure that ExecPartitionCheck gets a slot whose
TupleDesc matches the partition's.  Actually we do have such dedicated
slots for partitions around (for both sub-partitioned and leaf
partitions), so we can simply use them instead of creating one from
scratch for every use.  It did take some code rearrangement to
actually make that work though.

Attached is the latest patch including those changes.  Also, I have
merged your earlier "fixes" patch and updated the isolation test to
exercise a case with sub-partitioned default partition, as well as
varying attribute numbers.  Patch applies as-is to both HEAD and v13
branches, but needs slight changes to apply to v12 branch, so also
attaching one for that branch.



--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: default partition and concurrent attach partition

From
Alvaro Herrera
Date:
Hello Amit,

On 2020-Sep-08, Amit Langote wrote:

> On Tue, Sep 8, 2020 at 8:44 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > On 2020-Sep-07, Alvaro Herrera wrote:
> >
> > > Ah, it looks like we can get away with initializing the RRI to 0, and
> > > then explicitly handle that case in ExecPartitionCheckEmitError, as in
> > > the attached (which means reindenting, but I left it alone to make it
> > > easy to read).
> 
> At this point, I think it may be clear that ri_RangeTableIndex being
> set to a dummy value isn't problematic.

Yep ... I misled myself.

> Yeah, we need to make sure that ExecPartitionCheck gets a slot whose
> TupleDesc matches the partition's.  Actually we do have such dedicated
> slots for partitions around (for both sub-partitioned and leaf
> partitions),

Yeah, that's what I was looking for.

> so we can simply use them instead of creating one from
> scratch for every use.  It did take some code rearrangement to
> actually make that work though.

Thanks.  It doesn't look too bad ... I'd say it even looks easier to
read now in terms of code structure.

> Attached is the latest patch including those changes.  Also, I have
> merged your earlier "fixes" patch and updated the isolation test to
> exercise a case with sub-partitioned default partition, as well as
> varying attribute numbers.  Patch applies as-is to both HEAD and v13
> branches, but needs slight changes to apply to v12 branch, so also
> attaching one for that branch.

Thanks, will dispatch it shortly.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: default partition and concurrent attach partition

From
Alvaro Herrera
Date:
Hi,

Andres added to CC because of TTS interface: apparently calling
slot_getallattrs() on a virtual slot raises error that "getsomeattrs is
not required to be called on a virtual tuple table slot".  I'm thinking
that this exposes implementation details that should not be necessary
for a caller to know; patch 0001 fixes that at the problematic caller by
making the slot_getallatrs() call conditional on the slot not being
virtual, but I wonder if the better fix isn't to remove the elog(ERROR)
at tts_virtual_getsomeattrs.

Moving on from that -- this is a version of Amit's previous patch that I
like better.  I think the "prev_myslot" thing was a bit ugly, but it
turns out that with the above fix we can clear the slot before creating
the new one, making things more sensible.  I also changed the "do {}
while ()" into a simple "while {}" loop, which is sensible since the
condition is true on loop entrance.  Minor comment rewording also.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: default partition and concurrent attach partition

From
Alvaro Herrera
Date:
On 2020-Sep-08, Alvaro Herrera wrote:

> Andres added to CC because of TTS interface: apparently calling
> slot_getallattrs() on a virtual slot raises error that "getsomeattrs is
> not required to be called on a virtual tuple table slot".  I'm thinking
> that this exposes implementation details that should not be necessary
> for a caller to know; patch 0001 fixes that at the problematic caller by
> making the slot_getallatrs() call conditional on the slot not being
> virtual, but I wonder if the better fix isn't to remove the elog(ERROR)
> at tts_virtual_getsomeattrs.

Actually I misread the code, so this is all bogus.  Nevermind ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: default partition and concurrent attach partition

From
Alvaro Herrera
Date:
On 2020-Sep-08, Amit Langote wrote:

> Yeah, we need to make sure that ExecPartitionCheck gets a slot whose
> TupleDesc matches the partition's.  Actually we do have such dedicated
> slots for partitions around (for both sub-partitioned and leaf
> partitions), so we can simply use them instead of creating one from
> scratch for every use.  It did take some code rearrangement to
> actually make that work though.

Pushed, thanks for working on this fix.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: default partition and concurrent attach partition

From
Amit Langote
Date:
On Wed, Sep 9, 2020 at 7:41 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2020-Sep-08, Amit Langote wrote:
>
> > Yeah, we need to make sure that ExecPartitionCheck gets a slot whose
> > TupleDesc matches the partition's.  Actually we do have such dedicated
> > slots for partitions around (for both sub-partitioned and leaf
> > partitions), so we can simply use them instead of creating one from
> > scratch for every use.  It did take some code rearrangement to
> > actually make that work though.
>
> Pushed, thanks for working on this fix.

Thanks.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com