Thread: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
I've been working on the ability to detach a partition from a
partitioned table, without causing blockages to concurrent activity.
I think this operation is critical for some use cases.

There was a lot of great discussion which ended up in Robert completing
a much sought implementation of non-blocking ATTACH.  DETACH was
discussed too because it was a goal initially, but eventually dropped
from that patch altogether. Nonetheless, that thread provided a lot of
useful input to this implementation.  Important ones:

[1] https://postgr.es/m/CA+TgmoYg4x7AH=_QSptvuBKf+3hUdiCa4frPkt+RvXZyjX1n=w@mail.gmail.com
[2] https://postgr.es/m/CA+TgmoaAjkTibkEr=xJg3ndbRsHHSiYi2SJgX69MVosj=LJmug@mail.gmail.com
[3] https://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@mail.gmail.com

Attached is a patch that implements
ALTER TABLE ... DETACH PARTITION .. CONCURRENTLY.

In the previous thread we were able to implement the concurrent model
without the extra keyword.  For this one I think that won't work; my
implementation works in two transactions so there's a restriction that
you can't run it in a transaction block.  Also, there's a wait phase
that makes it slower than the non-concurrent one.  Those two drawbacks
make me think that it's better to keep both modes available, just like
we offer both CREATE INDEX and CREATE INDEX CONCURRENTLY.

Why two transactions?  The reason is that in order for this to work, we
make a catalog change (mark it detached), and commit so that all
concurrent transactions can see the change.  A second transaction waits
for anybody who holds any lock on the partitioned table and grabs Access
Exclusive on the partition (which now no one cares about, if they're
looking at the partitioned table), where the DDL action on the partition
can be completed.

ALTER TABLE is normally unable to run in two transactions.  I hacked it
(0001) so that the relation can be closed and reopened in the Exec phase
(by having the rel as part of AlteredTableInfo: when ATRewriteCatalogs
returns, it uses that pointer to close the rel).  It turns out that this
is sufficient to make that work.  This means that ALTER TABLE DETACH
CONCURRENTLY cannot work as part of a multi-command ALTER TABLE, but
that's alreay enforced by the grammar anyway.

DETACH CONCURRENTLY doesn't work if a default partition exists.  It's
just too problematic a case; you would still need to have AEL on the
default partition.


I haven't yet experimented with queries running in a standby in tandem
with a detach.

-- 
Álvaro Herrera

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2020-Aug-03, Alvaro Herrera wrote:

> There was a lot of great discussion which ended up in Robert completing
> a much sought implementation of non-blocking ATTACH.  DETACH was
> discussed too because it was a goal initially, but eventually dropped
> from that patch altogether. Nonetheless, that thread provided a lot of
> useful input to this implementation.  Important ones:
> 
> [1] https://postgr.es/m/CA+TgmoYg4x7AH=_QSptvuBKf+3hUdiCa4frPkt+RvXZyjX1n=w@mail.gmail.com
> [2] https://postgr.es/m/CA+TgmoaAjkTibkEr=xJg3ndbRsHHSiYi2SJgX69MVosj=LJmug@mail.gmail.com
> [3] https://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@mail.gmail.com

There was some discussion about having a version number in the partition
descriptor somewhere as a means to implement this.  I couldn't figure
out how that would work, or what the version number would be attached
to.  Surely the idea wasn't to increment the version number to every
partition other than the one being detached?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2020-Aug-03, Alvaro Herrera wrote:

> Why two transactions?  The reason is that in order for this to work, we
> make a catalog change (mark it detached), and commit so that all
> concurrent transactions can see the change.  A second transaction waits
> for anybody who holds any lock on the partitioned table and grabs Access
> Exclusive on the partition (which now no one cares about, if they're
> looking at the partitioned table), where the DDL action on the partition
> can be completed.

I forgot to mention.  If for whatever reason the second transaction
fails, (say the user aborts it or there is a crash), then the partition
is still marked as detached, so no queries would see it; but all the
ancillary catalog data remains.  Just like when CREATE INDEX
CONCURRENTLY fails, you keep an invalid index that must be dropped; in
this case, the changes to do are much more extensive, so manual action
is out of the question.  So there's another DDL command to be invoked,

ALTER TABLE parent DETACH PARTITION part FINALIZE;

which will complete the detach action.

If we had UNDO then perhaps it would be possible to register an action
so that the detach is completed automatically.  But for now this seems
sufficient.


Another aspect worth mentioning is constraints.  In the patch, I create
a CHECK constraint to stand for the partition constraint that's going to
logically disappear.  This was mentioned as a potential problem in one
of Robert's emails (I didn't actually verify that this is a problem).
However, a funny thing is that if a constraint already exists, you get a
dupe, so after a few rounds of attach/detach you can see them pile up.
I'll have to fix this at some point.  But also, I need to think about
whether foreign keys have similar problems, since they are also used by
the optimizer.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Mon, Aug 3, 2020 at 7:49 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Why two transactions?  The reason is that in order for this to work, we
> make a catalog change (mark it detached), and commit so that all
> concurrent transactions can see the change.  A second transaction waits
> for anybody who holds any lock on the partitioned table and grabs Access
> Exclusive on the partition (which now no one cares about, if they're
> looking at the partitioned table), where the DDL action on the partition
> can be completed.

Is there a more detailed theory of operation of this patch somewhere?
What exactly do you mean by marking it detached? Committing the change
makes it possible for all concurrent transactions to see the change,
but does not guarantee that they will. If you can't guarantee that,
then I'm not sure how you can guarantee that they will behave sanely.
Even if you can, it's not clear what the sane behavior is: what
happens when a tuple gets routed to an ex-partition? What happens when
an ex-partition needs to be scanned? What prevents problems if a
partition is detached, possibly modified, and then reattached,
possibly with different partition bounds?

I think the two-transaction approach is interesting and I can imagine
that it possibly solves some problems, but it's not clear to me
exactly which problems it solves or how it does so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2020-Aug-04, Robert Haas wrote:

> On Mon, Aug 3, 2020 at 7:49 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Why two transactions?  The reason is that in order for this to work, we
> > make a catalog change (mark it detached), and commit so that all
> > concurrent transactions can see the change.  A second transaction waits
> > for anybody who holds any lock on the partitioned table and grabs Access
> > Exclusive on the partition (which now no one cares about, if they're
> > looking at the partitioned table), where the DDL action on the partition
> > can be completed.
> 
> Is there a more detailed theory of operation of this patch somewhere?
> What exactly do you mean by marking it detached? Committing the change
> makes it possible for all concurrent transactions to see the change,
> but does not guarantee that they will. If you can't guarantee that,
> then I'm not sure how you can guarantee that they will behave sanely.

Sorry for the long delay.  I haven't written up the theory of operation.
I suppose it is complicated enough that it should be documented
somewhere.

To mark it detached means to set pg_inherits.inhdetached=true.  That
column name is a bit of a misnomer, since that column really means "is
in the process of being detached"; the pg_inherits row goes away
entirely once the detach process completes.  This mark guarantees that
everyone will see that row because the detaching session waits long
enough after committing the first transaction and before deleting the
pg_inherits row, until everyone else has moved on.

The other point is that the partition directory code can be asked to
include detached partitions, or not to.  The executor does the former,
and the planner does the latter.  This allows a transient period during
which the partition descriptor returned to planner and executor is
different; this makes the situation equivalent to what would have
happened if the partition was attached during the operation: in executor
we would detect that there is an additional partition that was not seen
by the planner, and we already know how to deal with that situation by
your handling of the ATTACH code.

> Even if you can, it's not clear what the sane behavior is: what
> happens when a tuple gets routed to an ex-partition? What happens when
> an ex-partition needs to be scanned? 

During the transient period, any transaction that obtained a partition
descriptor before the inhdetached mark is committed should be able to
get the tuple routing done successfully, but transactions using later
snapshots should get their insertions rejected.  Reads should behave in
the same way.

> What prevents problems if a partition is detached, possibly modified,
> and then reattached, possibly with different partition bounds?

This should not be a problem, because the partition needs to be fully
detached before it can be attached again.  And if the partition bounds
are different, that won't be a problem, because the previous partition
bounds won't be present in the pg_class row.  Of course, the new
partition bounds will be checked to the existing contents.

There is one fly in the ointment though, which is that if you cancel the
wait and then immediately do the ALTER TABLE DETACH FINALIZE without
waiting for as long as the original execution would have waited, you
might end up killing the partition ahead of time.  One solution to this
would be to cause the FINALIZE action to wait again at start.  This
would cause it to take even longer, but it would be safer.  (If you
don't want it to take longer, you can just not cancel it in the first
place.)  This is not a problem if the server crashes in between (which
is the scenario I had in mind when doing the FINALIZE thing), because of
course no transaction can continue to run across a crash.


I'm going to see if I can get the new delay_execution module to help
test this, to verify whether my claims are true.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Wed, Aug 26, 2020 at 7:40 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> To mark it detached means to set pg_inherits.inhdetached=true.  That
> column name is a bit of a misnomer, since that column really means "is
> in the process of being detached"; the pg_inherits row goes away
> entirely once the detach process completes.  This mark guarantees that
> everyone will see that row because the detaching session waits long
> enough after committing the first transaction and before deleting the
> pg_inherits row, until everyone else has moved on.

OK. Do you just wait until the XID of the transaction that set
inhdetached is all-visible, or how do you do it?

> The other point is that the partition directory code can be asked to
> include detached partitions, or not to.  The executor does the former,
> and the planner does the latter.  This allows a transient period during
> which the partition descriptor returned to planner and executor is
> different; this makes the situation equivalent to what would have
> happened if the partition was attached during the operation: in executor
> we would detect that there is an additional partition that was not seen
> by the planner, and we already know how to deal with that situation by
> your handling of the ATTACH code.

Ah ha! That is quite clever and I don't think that I would have
thought of it. So all the plans that were created before you set
inhdetached=true have to be guaranteed to be invaliated or gone
altogether before you can actually delete the pg_inherits row. It
seems like it ought to be possible to ensure that, though I am not
surely of the details exactly. Is it sufficient to wait for all
transactions that have locked the table to go away? I'm not sure
exactly how this stuff interacts with the plan cache.

> There is one fly in the ointment though, which is that if you cancel the
> wait and then immediately do the ALTER TABLE DETACH FINALIZE without
> waiting for as long as the original execution would have waited, you
> might end up killing the partition ahead of time.  One solution to this
> would be to cause the FINALIZE action to wait again at start.  This
> would cause it to take even longer, but it would be safer.  (If you
> don't want it to take longer, you can just not cancel it in the first
> place.)  This is not a problem if the server crashes in between (which
> is the scenario I had in mind when doing the FINALIZE thing), because of
> course no transaction can continue to run across a crash.

Yeah, it sounds like this will require some solution, but I agree that
just waiting "longer" seems acceptable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Hao Wu
Date:
Hi hacker,

I tested the patch provided by Alvaro. It seems not well in REPEATABLE READ.

gpadmin=*# \d+ tpart
                             Partitioned table "public.tpart"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 i      | integer |           |          |         | plain   |              |
 j      | integer |           |          |         | plain   |              |
Partition key: RANGE (i)
Partitions: tpart_1 FOR VALUES FROM (0) TO (100),
            tpart_2 FOR VALUES FROM (100) TO (200)

begin isolation level repeatable read;
BEGIN
gpadmin=*# select * from tpart;
  i  |  j
-----+-----
  10 |  10
  50 |  50
 110 | 110
 120 | 120
 150 | 150
(5 rows)
-- Here, run `ALTER TABLE tpart DROP PARTITION tpart_2 CONCURRENTLY`
-- but only complete the first transaction.

-- the tuples from tpart_2 are gone.
gpadmin=*# select * from tpart;
 i  | j
----+----
 10 | 10
 50 | 50
(2 rows)

gpadmin=*# \d+ tpart_2
                                  Table "public.tpart_2"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 i      | integer |           |          |         | plain   |              |
 j      | integer |           |          |         | plain   |              |
Partition of: tpart FOR VALUES FROM (100) TO (200)
Partition constraint: ((i IS NOT NULL) AND (i >= 100) AND (i < 200))
Access method: heap

-- the part tpart_2 is not showed as DETACHED
gpadmin=*# \d+ tpart
                             Partitioned table "public.tpart"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 i      | integer |           |          |         | plain   |              |
 j      | integer |           |          |         | plain   |              |
Partition key: RANGE (i)
Partitions: tpart_1 FOR VALUES FROM (0) TO (100),
            tpart_2 FOR VALUES FROM (100) TO (200)

-- next, the insert failed. It's OK.
gpadmin=*# insert into tpart values(130,130);
ERROR:  no partition of relation "tpart" found for row
DETAIL:  Partition key of the failing row contains (i) = (130).


Is this an expected behavior?

Regards,
Hao Wu


From: Robert Haas <robertmhaas@gmail.com>
Sent: Thursday, August 27, 2020 11:46 PM
To: Alvaro Herrera <alvherre@2ndquadrant.com>
Cc: Pg Hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
 
On Wed, Aug 26, 2020 at 7:40 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> To mark it detached means to set pg_inherits.inhdetached=true.  That
> column name is a bit of a misnomer, since that column really means "is
> in the process of being detached"; the pg_inherits row goes away
> entirely once the detach process completes.  This mark guarantees that
> everyone will see that row because the detaching session waits long
> enough after committing the first transaction and before deleting the
> pg_inherits row, until everyone else has moved on.

OK. Do you just wait until the XID of the transaction that set
inhdetached is all-visible, or how do you do it?

> The other point is that the partition directory code can be asked to
> include detached partitions, or not to.  The executor does the former,
> and the planner does the latter.  This allows a transient period during
> which the partition descriptor returned to planner and executor is
> different; this makes the situation equivalent to what would have
> happened if the partition was attached during the operation: in executor
> we would detect that there is an additional partition that was not seen
> by the planner, and we already know how to deal with that situation by
> your handling of the ATTACH code.

Ah ha! That is quite clever and I don't think that I would have
thought of it. So all the plans that were created before you set
inhdetached=true have to be guaranteed to be invaliated or gone
altogether before you can actually delete the pg_inherits row. It
seems like it ought to be possible to ensure that, though I am not
surely of the details exactly. Is it sufficient to wait for all
transactions that have locked the table to go away? I'm not sure
exactly how this stuff interacts with the plan cache.

> There is one fly in the ointment though, which is that if you cancel the
> wait and then immediately do the ALTER TABLE DETACH FINALIZE without
> waiting for as long as the original execution would have waited, you
> might end up killing the partition ahead of time.  One solution to this
> would be to cause the FINALIZE action to wait again at start.  This
> would cause it to take even longer, but it would be safer.  (If you
> don't want it to take longer, you can just not cancel it in the first
> place.)  This is not a problem if the server crashes in between (which
> is the scenario I had in mind when doing the FINALIZE thing), because of
> course no transaction can continue to run across a crash.

Yeah, it sounds like this will require some solution, but I agree that
just waiting "longer" seems acceptable.

--
Robert Haas
EnterpriseDB: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.enterprisedb.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=tqYUKh-fXcYPWSaF4E-D6A&m=SEDl-6dEISo7BA0qWuv1-idQUVtO0M6qz7hcfwlrF3I&s=pZ7Dx6xrJOYkKKMlXR4wpJNZv-W10wQkMfXdEjtIXJY&e=
The Enterprise PostgreSQL Company


Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2020-Aug-27, Robert Haas wrote:

> On Wed, Aug 26, 2020 at 7:40 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > To mark it detached means to set pg_inherits.inhdetached=true.  That
> > column name is a bit of a misnomer, since that column really means "is
> > in the process of being detached"; the pg_inherits row goes away
> > entirely once the detach process completes.  This mark guarantees that
> > everyone will see that row because the detaching session waits long
> > enough after committing the first transaction and before deleting the
> > pg_inherits row, until everyone else has moved on.
> 
> OK. Do you just wait until the XID of the transaction that set
> inhdetached is all-visible, or how do you do it?

I'm just doing WaitForLockers( ... AccessExclusiveLock ...)  on the
partitioned table at the start of the second transaction.  That will
wait until all lockers that have obtained a partition descriptor with
the old definition are gone.  Note we don't actually lock the
partitioned table with that lock level.

In the second transaction we additionally obtain AccessExclusiveLock on
the partition itself, but that's after nobody sees it as a partition
anymore.  That lock level is needed for some of the internal DDL
changes, and should not cause problems.

I thought about using WaitForOlderSnapshots() instead of waiting for
lockers, but it didn't seem to solve any actual problem.

Note that on closing the first transaction, the locks on both tables are
released.  This avoids the deadlock hazards because of the lock upgrades
that would otherwise occur.  This means that the tables could be dropped
or changed in the meantime.  The case where either relation is dropped
is handled by using try_relation_open() in the second transaction; if
either table is gone, then we can just mark the operation as completed.
This part is a bit fuzzy.  One thing that should probably be done is
have a few operations (such as other ALTER TABLE) raise an error when
run on a table with inhdetached=true, because that might get things out
of step and potentially cause other problems.  I've not done that yet.  

> So all the plans that were created before you set
> inhdetached=true have to be guaranteed to be invaliated or gone
> altogether before you can actually delete the pg_inherits row. It
> seems like it ought to be possible to ensure that, though I am not
> surely of the details exactly. Is it sufficient to wait for all
> transactions that have locked the table to go away? I'm not sure
> exactly how this stuff interacts with the plan cache.

Hmm, any cached plan should be released with relcache inval events, per
PlanCacheRelCallback().  There are some comments in plancache.h about
"unsaved" cached plans that I don't really understand :-(

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Hao Wu
Date:
Not related to DETACH PARTITION, but I found a bug in ATTACH PARTITION.
Here are the steps to reproduce the issue:

create table tpart(i int, j int) partition by range(i);
create table tpart_1(like tpart);
create table tpart_2(like tpart);
create table tpart_default(like tpart);alter table tpart attach partition tpart_1 for values from(0) to (100);
alter table tpart attach partition tpart_default default;insert into tpart_2 values(110,110),(120,120),(150,150);1: begin;
1: alter table tpart attach partition tpart_2 for values from(100) to (200);
2: begin;
-- insert will be blocked by ALTER TABLE ATTACH PARTITION
2: insert into tpart values (110,110),(120,120),(150,150);
1: end;
2: select * from tpart_default;
2: end;

After that the partition tpart_default contains (110,110),(120,120),(150,150)
inserted in session 2, which obviously violates the partition constraint.

Regards,
Hao Wu

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
Hi Hao,

On Wed, Sep 2, 2020 at 5:25 PM Hao Wu <hawu@vmware.com> wrote:
>
> Not related to DETACH PARTITION, but I found a bug in ATTACH PARTITION.
> Here are the steps to reproduce the issue:
>
> create table tpart(i int, j int) partition by range(i);
> create table tpart_1(like tpart);
> create table tpart_2(like tpart);
> create table tpart_default(like tpart);alter table tpart attach partition tpart_1 for values from(0) to (100);
> alter table tpart attach partition tpart_default default;insert into tpart_2 values(110,110),(120,120),(150,150);1:
begin;
> 1: alter table tpart attach partition tpart_2 for values from(100) to (200);
> 2: begin;
> -- insert will be blocked by ALTER TABLE ATTACH PARTITION
> 2: insert into tpart values (110,110),(120,120),(150,150);
> 1: end;
> 2: select * from tpart_default;
> 2: end;
>
>
> After that the partition tpart_default contains (110,110),(120,120),(150,150)
> inserted in session 2, which obviously violates the partition constraint.

Thanks for the report.  That looks like a bug.

I have started another thread to discuss this bug and a patch to fix
it to keep the discussion here focused on the new feature.

Subject: default partition and concurrent attach partition
https://www.postgresql.org/message-id/CA%2BHiwqFqBmcSSap4sFnCBUEL_VfOMmEKaQ3gwUhyfa4c7J_-nA%40mail.gmail.com

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2020-Aug-31, Hao Wu wrote:

> I tested the patch provided by Alvaro. It seems not well in REPEATABLE READ.

> -- the tuples from tpart_2 are gone.
> gpadmin=*# select * from tpart;
>  i  | j
> ----+----
>  10 | 10
>  50 | 50
> (2 rows)

Interesting example, thanks.  It seems this can be fixed without
breaking anything else by changing the planner so that it includes
detached partitions when we are in a snapshot-isolation transaction.
Indeed, the results from the detach-partition-concurrently-1.spec
isolation test are more satisfying with this change.

The attached v2, on current master, includes that change, as well as
fixes a couple of silly bugs in the previous one.

(Because of experimenting with git workflow I did not keep the 0001
part split in this one, but that part is unchanged from v1.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Robert Haas
Date:
On Thu, Sep 10, 2020 at 4:54 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Interesting example, thanks.  It seems this can be fixed without
> breaking anything else by changing the planner so that it includes
> detached partitions when we are in a snapshot-isolation transaction.
> Indeed, the results from the detach-partition-concurrently-1.spec
> isolation test are more satisfying with this change.

Hmm, so I think the idea here is that since we're out-waiting plans
with the old partition descriptor by waiting for lock release, it's OK
for anyone who has a lock to keep using the old partition descriptor
as long as they continuously hold the lock. Is that right? I can't
think of a hole in that logic, but it's probably worth noting in the
comments, in case someone is tempted to change the way that we
out-wait plans with the old partition descriptor to some other
mechanism.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
Hi Alvaro,

Studying the patch to understand how it works.

On Tue, Aug 4, 2020 at 8:49 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Why two transactions?  The reason is that in order for this to work, we
> make a catalog change (mark it detached), and commit so that all
> concurrent transactions can see the change.  A second transaction waits
> for anybody who holds any lock on the partitioned table and grabs Access
> Exclusive on the partition (which now no one cares about, if they're
> looking at the partitioned table), where the DDL action on the partition
> can be completed.

+   /*
+    * Concurrent mode has to work harder; first we add a new constraint to the
+    * partition that matches the partition constraint.  The reason for this is
+    * that the planner may have made optimizations that depend on the
+    * constraint.  XXX Isn't it sufficient to invalidate the partition's
+    * relcache entry?
...
+       /* Add constraint on partition, equivalent to the partition
constraint */
+       n = makeNode(Constraint);
+       n->contype = CONSTR_CHECK;
+       n->conname = NULL;
+       n->location = -1;
+       n->is_no_inherit = false;
+       n->raw_expr = NULL;
+       n->cooked_expr =
nodeToString(make_ands_explicit(RelationGetPartitionQual(partRel)));
+       n->initially_valid = true;
+       n->skip_validation = true;
+       /* It's a re-add, since it nominally already exists */
+       ATAddCheckConstraint(wqueue, tab, partRel, n,
+                            true, false, true, ShareUpdateExclusiveLock);

I suspect that we don't really need this defensive constraint.  I mean
even after committing the 1st transaction, the partition being
detached still has relispartition set to true and
RelationGetPartitionQual() still returns the partition constraint, so
it seems the constraint being added above is duplicative.  Moreover,
the constraint is not removed as part of any cleaning up after the
DETACH process, so repeated attach/detach of the same partition
results in the constraints piling up:

create table foo (a int, b int) partition by range (a);
create table foo1 partition of foo for values from (1) to (2);
create table foo2 partition of foo for values from (2) to (3);
alter table foo detach partition foo2 concurrently;
alter table foo attach partition foo2 for values from (2) to (3);
alter table foo detach partition foo2 concurrently;
alter table foo attach partition foo2 for values from (2) to (3);
\d foo2
                Table "public.foo2"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 a      | integer |           |          |
 b      | integer |           |          |
Partition of: foo FOR VALUES FROM (2) TO (3)
Check constraints:
    "foo2_a_check" CHECK (a IS NOT NULL AND a >= 2 AND a < 3)
    "foo2_a_check1" CHECK (a IS NOT NULL AND a >= 2 AND a < 3)

Noticed a bug/typo in the patched RelationBuildPartitionDesc():

-   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock,
+                                       include_detached);

You're passing NoLock for include_detached which means you never
actually end up including detached partitions from here.  I think it
is due to this bug that partition-concurrent-attach test fails in my
run.  Also, the error seen in the following hunk of
detach-partition-concurrently-1 test is not intentional:

+starting permutation: s1brr s1prep s1s s2d s1s s1exec2 s1c
+step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1prep: PREPARE f(int) AS INSERT INTO d_listp VALUES ($1);
+step s1s: SELECT * FROM d_listp;
+a
+
+1
+2
+step s2d: ALTER TABLE d_listp DETACH PARTITION d_listp2 CONCURRENTLY;
<waiting ...>
+step s1s: SELECT * FROM d_listp;
+a
+
+1
+step s1exec2: EXECUTE f(2); DEALLOCATE f;
+step s2d: <... completed>
+error in steps s1exec2 s2d: ERROR:  no partition of relation
"d_listp" found for row
+step s1c: COMMIT;

As you're intending to make the executor always *include* detached
partitions, the insert should be able route (2) to d_listp2, the
detached partition.  It's the bug mentioned above that causes the
executor's partition descriptor build to falsely fail to include the
detached partition.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2020-Sep-23, Amit Langote wrote:

Hi Amit, thanks for reviewing this patch!

> On Tue, Aug 4, 2020 at 8:49 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> I suspect that we don't really need this defensive constraint.  I mean
> even after committing the 1st transaction, the partition being
> detached still has relispartition set to true and
> RelationGetPartitionQual() still returns the partition constraint, so
> it seems the constraint being added above is duplicative.

Ah, thanks for thinking through that.  I had vague thoughts about this
being unnecessary in the current mechanics, but hadn't fully
materialized the thought.  (The patch was completely different a few
unposted iterations ago).

> Moreover, the constraint is not removed as part of any cleaning up
> after the DETACH process, so repeated attach/detach of the same
> partition results in the constraints piling up:

Yeah, I knew about this issue (mentioned in my self-reply on Aug 4) and
didn't worry too much about it because I was thinking I'd rather get rid
of the constraint addition in the first place.

> Noticed a bug/typo in the patched RelationBuildPartitionDesc():
> 
> -   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
> +   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock,
> +                                       include_detached);
> 
> You're passing NoLock for include_detached which means you never
> actually end up including detached partitions from here.

I fixed this in the version I posted on Sept 10.  I think you were
reading the version posted at the start of this thread.

Thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
Hi Alvaro,

Sorry I totally failed to see the v2 you had posted and a couple of
other emails where you mentioned the issues I brought up.

On Thu, Sep 24, 2020 at 12:23 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> On 2020-Sep-23, Amit Langote wrote:
 > I suspect that we don't really need this defensive constraint.  I mean
> > even after committing the 1st transaction, the partition being
> > detached still has relispartition set to true and
> > RelationGetPartitionQual() still returns the partition constraint, so
> > it seems the constraint being added above is duplicative.
>
> Ah, thanks for thinking through that.  I had vague thoughts about this
> being unnecessary in the current mechanics, but hadn't fully
> materialized the thought.  (The patch was completely different a few
> unposted iterations ago).
>
> > Moreover, the constraint is not removed as part of any cleaning up
> > after the DETACH process, so repeated attach/detach of the same
> > partition results in the constraints piling up:
>
> Yeah, I knew about this issue (mentioned in my self-reply on Aug 4) and
> didn't worry too much about it because I was thinking I'd rather get rid
> of the constraint addition in the first place.

Okay, gotcha.

> > Noticed a bug/typo in the patched RelationBuildPartitionDesc():
> >
> > -   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
> > +   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock,
> > +                                       include_detached);
> >
> > You're passing NoLock for include_detached which means you never
> > actually end up including detached partitions from here.
>
> I fixed this in the version I posted on Sept 10.  I think you were
> reading the version posted at the start of this thread.

I am trying the v2 now and I can confirm that those problems are now fixed.

However, I am a bit curious about including detached partitions in
some cases while not in other, which can result in a (to me)
surprising behavior as follows:

Session 1:

create table foo (a int, b int) partition by range (a);
create table foo1 partition of foo for values from (1) to (2);
create table foo2 partition of foo for values from (2) to (3);

...attach GDB and set breakpoint so as to block right after finishing
the 1st transaction of DETACH PARTITION CONCURRENTLY...
alter table foo detach partition foo2 concurrently;
<hits breakpoint, wait...>

Session 2:

begin;
insert into foo values (2);  -- ok
select * from foo;
select * from foo;  -- ?!
 a | b
---+---
(0 rows)

Maybe, it's fine to just always exclude detached partitions, although
perhaps I am missing some corner cases that you have thought of?

Also, I noticed that looking up a parent's partitions via
RelationBuildPartitionDesc or directly will inspect inhdetached to
include or exclude partitions, but checking if a child table is a
partition of a given parent table via get_partition_parent doesn't.
Now if you fix get_partition_parent() to also take into account
inhdetached, for example, to return InvalidOid if true, then the
callers would need to not consider the child table a valid partition.
So, RelationGetPartitionQual() on a detached partition should actually
return NIL, making my earlier claim about not needing the defensive
CHECK constraint invalid.  But maybe that's fine if all places agree
on a detached partition not being a valid partition anymore?

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Michael Paquier
Date:
On Thu, Sep 24, 2020 at 12:51:52PM +0900, Amit Langote wrote:
> Also, I noticed that looking up a parent's partitions via
> RelationBuildPartitionDesc or directly will inspect inhdetached to
> include or exclude partitions, but checking if a child table is a
> partition of a given parent table via get_partition_parent doesn't.
> Now if you fix get_partition_parent() to also take into account
> inhdetached, for example, to return InvalidOid if true, then the
> callers would need to not consider the child table a valid partition.
> So, RelationGetPartitionQual() on a detached partition should actually
> return NIL, making my earlier claim about not needing the defensive
> CHECK constraint invalid.  But maybe that's fine if all places agree
> on a detached partition not being a valid partition anymore?

It would be good to get that answered, and while on it please note
that the patch needs a rebase.
--
Michael

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Andy Fan
Date:
Hi Alvaro:

On Tue, Aug 4, 2020 at 7:49 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I've been working on the ability to detach a partition from a
partitioned table, without causing blockages to concurrent activity.
I think this operation is critical for some use cases.

I think if it is possible to implement the detech with a NoWait option . 

ALTER TABLE ... DETACH PARTITION ..  [NoWait]. 

if it can't get the lock, raise "Resource is Busy" immediately, without blocking others. 
this should be a default behavior.   If people do want to keep trying, it can set 
a ddl_lock_timeout to 'some-interval',  in this case, it will still block others(so it
can't be as good as what you are doing, but very simple),  however the user
would know what would happen exactly and can coordinate with their
application accordingly.   I'm sorry about this since it is a bit of off-topics
or it has been discussed already. 

--
Best Regards
Andy Fan

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2020-Oct-15, Andy Fan wrote:

> I think if it is possible to implement the detech with a NoWait option .
> 
> ALTER TABLE ... DETACH PARTITION ..  [NoWait].
> 
> if it can't get the lock, raise "Resource is Busy" immediately,
> without blocking others.  this should be a default behavior.   If
> people do want to keep trying, it can set a ddl_lock_timeout to
> 'some-interval',  in this case, it will still block others(so it can't
> be as good as what you are doing, but very simple),  however the user
> would know what would happen exactly and can coordinate with their
> application accordingly.   I'm sorry about this since it is a bit of
> off-topics or it has been discussed already.

Hi.  I don't think this has been discussed, but it doesn't really solve
the use case I want to -- in many cases where the tables are
continuously busy, this would lead to starvation.  I think the proposal
to make the algorithm work with reduced lock level is much more useful.

I think you can already do NOWAIT behavior, with LOCK TABLE .. NOWAIT
followed by DETACH PARTITION, perhaps with a nonzero statement timeout.



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
David Rowley
Date:
On Thu, 15 Oct 2020 at 14:04, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> I think if it is possible to implement the detech with a NoWait option .
>
> ALTER TABLE ... DETACH PARTITION ..  [NoWait].
>
> if it can't get the lock, raise "Resource is Busy" immediately, without blocking others.
> this should be a default behavior.   If people do want to keep trying, it can set
> a ddl_lock_timeout to 'some-interval',  in this case, it will still block others(so it
> can't be as good as what you are doing, but very simple),  however the user
> would know what would happen exactly and can coordinate with their
> application accordingly.   I'm sorry about this since it is a bit of off-topics
> or it has been discussed already.

How would that differ from setting a low lock_timeout and running the DDL?

I think what Alvaro wants to avoid is taking the AEL in the first
place. When you have multiple long overlapping queries to the
partitioned table, then there be no point in time where there are zero
locks on the table. It does not sound like your idea would help with
that.

David



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Andy Fan
Date:
Hi David/Alvaro:

On Thu, Oct 15, 2020 at 9:09 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 15 Oct 2020 at 14:04, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> I think if it is possible to implement the detech with a NoWait option .
>
> ALTER TABLE ... DETACH PARTITION ..  [NoWait].
>
> if it can't get the lock, raise "Resource is Busy" immediately, without blocking others.
> this should be a default behavior.   If people do want to keep trying, it can set
> a ddl_lock_timeout to 'some-interval',  in this case, it will still block others(so it
> can't be as good as what you are doing, but very simple),  however the user
> would know what would happen exactly and can coordinate with their
> application accordingly.   I'm sorry about this since it is a bit of off-topics
> or it has been discussed already.

How would that differ from setting a low lock_timeout and running the DDL?
 
They are exactly the same (I didn't realize this parameter when I sent the email).  
 
I think what Alvaro wants to avoid is taking the AEL in the first
place.

I'm agreed with this,  that's why I said "so it can't be as good as what you are doing"
 
When you have multiple long overlapping queries to the
partitioned table, then there be no point in time where there are zero
locks on the table. It does not sound like your idea would help with that.
 

Based on my current knowledge,  "detach" will hold an exclusive lock 
and it will have higher priority than other waiters.  so it has to wait for the lock
holder before it (named as sess 1).  and at the same time, block all the other
waiters which are requiring a lock even the lock mode is compatible with session 1. 
So "deteach" can probably get its lock in a short time (unless some long transaction
before it). I'm not sure if I have some misunderstanding here. 

Overall I'd be +1 for this patch. 

-- 
Best Regards
Andy Fan

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2020-Sep-24, Amit Langote wrote:

Hello Amit,

> Sorry I totally failed to see the v2 you had posted and a couple of
> other emails where you mentioned the issues I brought up.

No worries, I appreciate you reviewing this.

> However, I am a bit curious about including detached partitions in
> some cases while not in other, which can result in a (to me)
> surprising behavior as follows:
[ snip ]
> begin;
> insert into foo values (2);  -- ok
> select * from foo;
> select * from foo;  -- ?!
>  a | b
> ---+---
> (0 rows)
> 
> Maybe, it's fine to just always exclude detached partitions, although
> perhaps I am missing some corner cases that you have thought of?

Well, this particular case can be fixed by changing
ExecInitPartitionDispatchInfo specifically, from including detached
partitions to excluding them, as in the attached version.  Given your
example I think the case is fairly good that they should be excluded
there.  I can't think of a case that this change break.

However I'm not sure that excluding them everywhere is sensible.  There
are currently two cases where they are included (search for calls to
CreatePartitionDirectory if you're curious).  One is snapshot-isolation
transactions (repeatable read and serializable) in
set_relation_partition_info, per the example from Hao Wu.  If we simply
exclude detached transaction there, repeatable read no longer works
properly; rows will just go missing for no apparent reason.  I don't
think this is acceptable.

The other case is ExecCreatePartitionPruneState().  The whole point of
including detached partitions here is to make them available for queries
that were planned before the detach and executed after the detach.  My
fear is that the pruning plan will contain references (from planner) to
partitions that the executor doesn't know about.  If there are extra
partitions at the executor side, it shouldn't harm anything (and it
shouldn't change query results); but I'm not sure that things will work
OK if partitions seen by the planner disappear from under the executor.

I'm posting this version just as a fresh rebase -- it is not yet
intended for commit.  I haven't touched the constraint stuff.  I still
think that that can be removed before commit, which is a simple change;
but even if not, the problem with the duplicated constraints should be
easy to fix.

Again, my thanks for reviewing.

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
Here's an updated version of this patch.

Apart from rebasing to current master, I made the following changes:

* On the first transaction (the one that marks the partition as
detached), the partition is locked with ShareLock rather than
ShareUpdateExclusiveLock.  This means that DML is not allowed anymore.
This seems a reasonable restriction to me; surely by the time you're
detaching the partition you're not inserting data into it anymore.

* In ExecInitPartitionDispatchInfo, the previous version always
excluded detached partitions.  Now it does include them in isolation
level repeatable read and higher.  Considering the point above, this
sounds a bit contradictory: you shouldn't be inserting new tuples in
partitions being detached.  But if you do, it makes more sense: in RR
two queries that insert tuples in the same partition would not fail
mid-transaction.  (In a read-committed transaction, the second query
does fail, but to me that does not sound surprising.)

* ALTER TABLE .. DETACH PARTITION FINALIZE now waits for concurrent old
snapshots, as previously discussed. This should ensure that the user
doesn't just cancel the wait during the second transaction by Ctrl-C and
run FINALIZE immediately afterwards, which I claimed would bring
inconsistency.

* Avoid creating the CHECK constraint if an identical one already
exists.

(Note I do not remove the constraint on ATTACH.  That seems pointless.)

Still to do: test this using the new hook added by 6f0b632f083b.

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Anastasia Lubennikova
Date:
On 04.11.2020 02:56, Alvaro Herrera wrote:
> Here's an updated version of this patch.
>
> Apart from rebasing to current master, I made the following changes:
>
> * On the first transaction (the one that marks the partition as
> detached), the partition is locked with ShareLock rather than
> ShareUpdateExclusiveLock.  This means that DML is not allowed anymore.
> This seems a reasonable restriction to me; surely by the time you're
> detaching the partition you're not inserting data into it anymore.
>
> * In ExecInitPartitionDispatchInfo, the previous version always
> excluded detached partitions.  Now it does include them in isolation
> level repeatable read and higher.  Considering the point above, this
> sounds a bit contradictory: you shouldn't be inserting new tuples in
> partitions being detached.  But if you do, it makes more sense: in RR
> two queries that insert tuples in the same partition would not fail
> mid-transaction.  (In a read-committed transaction, the second query
> does fail, but to me that does not sound surprising.)
>
> * ALTER TABLE .. DETACH PARTITION FINALIZE now waits for concurrent old
> snapshots, as previously discussed. This should ensure that the user
> doesn't just cancel the wait during the second transaction by Ctrl-C and
> run FINALIZE immediately afterwards, which I claimed would bring
> inconsistency.
>
> * Avoid creating the CHECK constraint if an identical one already
> exists.
>
> (Note I do not remove the constraint on ATTACH.  That seems pointless.)
>
> Still to do: test this using the new hook added by 6f0b632f083b.

Status update for a commitfest entry.

The commitfest is nearing the end and this thread is "Waiting on Author".
As far as I see the last message contains a patch. Is there anything 
left to work on or it needs review now? Alvaro, are you planning to 
continue working on it?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2020-Nov-30, Anastasia Lubennikova wrote:

> The commitfest is nearing the end and this thread is "Waiting on Author".
> As far as I see the last message contains a patch. Is there anything left to
> work on or it needs review now? Alvaro, are you planning to continue working
> on it?

Thanks Anastasia.  I marked it as needs review.



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Andy Fan
Date:


On Tue, Aug 4, 2020 at 7:49 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I've been working on the ability to detach a partition from a
partitioned table, without causing blockages to concurrent activity.
I think this operation is critical for some use cases.


This would be a very great feature.  When we can't handle thousands of partitions
very well, and user agree to detach some old partitions automatically, the blocking
issue here would be a big blocker for this solution. Thanks for working on this!

 
There was a lot of great discussion which ended up in Robert completing
a much sought implementation of non-blocking ATTACH.  DETACH was
discussed too because it was a goal initially, but eventually dropped
from that patch altogether. Nonetheless, that thread provided a lot of
useful input to this implementation.  Important ones:

[1] https://postgr.es/m/CA+TgmoYg4x7AH=_QSptvuBKf+3hUdiCa4frPkt+RvXZyjX1n=w@mail.gmail.com
[2] https://postgr.es/m/CA+TgmoaAjkTibkEr=xJg3ndbRsHHSiYi2SJgX69MVosj=LJmug@mail.gmail.com
[3] https://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@mail.gmail.com

Attached is a patch that implements
ALTER TABLE ... DETACH PARTITION .. CONCURRENTLY.

In the previous thread we were able to implement the concurrent model
without the extra keyword.  For this one I think that won't work; my
implementation works in two transactions so there's a restriction that
you can't run it in a transaction block.  Also, there's a wait phase
that makes it slower than the non-concurrent one.  Those two drawbacks
make me think that it's better to keep both modes available, just like
we offer both CREATE INDEX and CREATE INDEX CONCURRENTLY.

Why two transactions?  The reason is that in order for this to work, we
make a catalog change (mark it detached), and commit so that all
concurrent transactions can see the change.  A second transaction waits
for anybody who holds any lock on the partitioned table and grabs Access
Exclusive on the partition (which now no one cares about, if they're
looking at the partitioned table), where the DDL action on the partition
can be completed.

ALTER TABLE is normally unable to run in two transactions.  I hacked it
(0001) so that the relation can be closed and reopened in the Exec phase
(by having the rel as part of AlteredTableInfo: when ATRewriteCatalogs
returns, it uses that pointer to close the rel).  It turns out that this
is sufficient to make that work.  This means that ALTER TABLE DETACH
CONCURRENTLY cannot work as part of a multi-command ALTER TABLE, but
that's alreay enforced by the grammar anyway.

DETACH CONCURRENTLY doesn't work if a default partition exists.  It's
just too problematic a case; you would still need to have AEL on the
default partition.


I haven't yet experimented with queries running in a standby in tandem
with a detach.

--
Álvaro Herrera


--
Best Regards
Andy Fan

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> I added that test as promised, and I couldn't find any problems, so I
> have pushed it.

Buildfarm testing suggests there's an issue under CLOBBER_CACHE_ALWAYS:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=trilobite&dt=2021-03-29%2018%3A14%3A24

specifically

diff -U3
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/expected/detach-partition-concurrently-4.out
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/output_iso/results/detach-partition-concurrently-4.out
---
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/expected/detach-partition-concurrently-4.out
2021-03-2920:14:21.258199311 +0200 
+++
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/output_iso/results/detach-partition-concurrently-4.out
  2021-03-30 18:54:34.272839428 +0200 
@@ -324,6 +324,7 @@
 1
 2
 step s1insert: insert into d4_fk values (1);
+ERROR:  insert or update on table "d4_fk" violates foreign key constraint "d4_fk_a_fkey"
 step s1c: commit;

 starting permutation: s2snitch s1b s1s s2detach s1cancel s3vacfreeze s1s s1insert s1c

            regards, tom lane



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On Mon, Apr 12, 2021 at 6:20 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2021-Mar-31, Tom Lane wrote:
>
> > diff -U3
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/expected/detach-partition-concurrently-4.out
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/output_iso/results/detach-partition-concurrently-4.out
> > ---
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/expected/detach-partition-concurrently-4.out
2021-03-29 20:14:21.258199311 +0200
 
> > +++
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/output_iso/results/detach-partition-concurrently-4.out
  2021-03-30 18:54:34.272839428 +0200
 
> > @@ -324,6 +324,7 @@
> >  1
> >  2
> >  step s1insert: insert into d4_fk values (1);
> > +ERROR:  insert or update on table "d4_fk" violates foreign key constraint "d4_fk_a_fkey"
> >  step s1c: commit;
> >
> >  starting permutation: s2snitch s1b s1s s2detach s1cancel s3vacfreeze s1s s1insert s1c
>
> Hmm, actually, looking at this closely, I think the expected output is
> bogus and trilobite is doing the right thing by throwing this error
> here.  The real question is why isn't this case behaving in that way in
> every *other* animal.

Indeed.

I can see a wrong behavior of RelationGetPartitionDesc() in a case
that resembles the above test case.

drop table if exists d4_primary, d4_primary1, d4_fk, d4_pid;
create table d4_primary (a int primary key) partition by list (a);
create table d4_primary1 partition of d4_primary for values in (1);
create table d4_primary2 partition of d4_primary for values in (2);
insert into d4_primary values (1);
insert into d4_primary values (2);
create table d4_fk (a int references d4_primary);
insert into d4_fk values (2);
create table d4_pid (pid int);

Session 1:
begin isolation level serializable;
select * from d4_primary;
 a
---
 1
 2
(2 rows)

Session 2:
alter table d4_primary detach partition d4_primary1 concurrently;
<waits>

Session 1:
-- can see d4_primary1 as detached at this point, though still scans!
select * from d4_primary;
 a
---
 1
 2
(2 rows)
insert into d4_fk values (1);
INSERT 0 1
end;

Session 2:
<alter-finishes>
ALTER TABLE

Session 1:
-- FK violated
select * from d4_primary;
 a
---
 2
(1 row)
select * from d4_fk;
 a
---
 1
(1 row)

The 2nd select that session 1 performs adds d4_primary1, whose detach
it *sees* is pending, to the PartitionDesc, but without setting its
includes_detached.  The subsequent insert's RI query, because it uses
that PartitionDesc as-is, returns 1 as being present in d4_primary,
leading to the insert succeeding.  When session 1's transaction
commits, the waiting ALTER proceeds with committing the 2nd part of
the DETACH, without having a chance again to check if the DETACH would
lead to the foreign key of d4_fk being violated.

It seems problematic to me that the logic of setting includes_detached
is oblivious of the special check in find_inheritance_children() to
decide whether "force"-include a detach-pending child based on
cross-checking its pg_inherit tuple's xmin against the active
snapshot.  Maybe fixing that would help, although I haven't tried that
myself yet.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On Mon, Apr 12, 2021 at 4:42 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Mon, Apr 12, 2021 at 6:20 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > On 2021-Mar-31, Tom Lane wrote:
> >
> > > diff -U3
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/expected/detach-partition-concurrently-4.out
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/output_iso/results/detach-partition-concurrently-4.out
> > > ---
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/expected/detach-partition-concurrently-4.out
2021-03-29 20:14:21.258199311 +0200 
> > > +++
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/output_iso/results/detach-partition-concurrently-4.out
  2021-03-30 18:54:34.272839428 +0200 
> > > @@ -324,6 +324,7 @@
> > >  1
> > >  2
> > >  step s1insert: insert into d4_fk values (1);
> > > +ERROR:  insert or update on table "d4_fk" violates foreign key constraint "d4_fk_a_fkey"
> > >  step s1c: commit;
> > >
> > >  starting permutation: s2snitch s1b s1s s2detach s1cancel s3vacfreeze s1s s1insert s1c
> >
> > Hmm, actually, looking at this closely, I think the expected output is
> > bogus and trilobite is doing the right thing by throwing this error
> > here.  The real question is why isn't this case behaving in that way in
> > every *other* animal.
>
> Indeed.
>
> I can see a wrong behavior of RelationGetPartitionDesc() in a case
> that resembles the above test case.
>
> drop table if exists d4_primary, d4_primary1, d4_fk, d4_pid;
> create table d4_primary (a int primary key) partition by list (a);
> create table d4_primary1 partition of d4_primary for values in (1);
> create table d4_primary2 partition of d4_primary for values in (2);
> insert into d4_primary values (1);
> insert into d4_primary values (2);
> create table d4_fk (a int references d4_primary);
> insert into d4_fk values (2);
> create table d4_pid (pid int);
>
> Session 1:
> begin isolation level serializable;
> select * from d4_primary;
>  a
> ---
>  1
>  2
> (2 rows)
>
> Session 2:
> alter table d4_primary detach partition d4_primary1 concurrently;
> <waits>
>
> Session 1:
> -- can see d4_primary1 as detached at this point, though still scans!
> select * from d4_primary;
>  a
> ---
>  1
>  2
> (2 rows)
> insert into d4_fk values (1);
> INSERT 0 1
> end;
>
> Session 2:
> <alter-finishes>
> ALTER TABLE
>
> Session 1:
> -- FK violated
> select * from d4_primary;
>  a
> ---
>  2
> (1 row)
> select * from d4_fk;
>  a
> ---
>  1
> (1 row)
>
> The 2nd select that session 1 performs adds d4_primary1, whose detach
> it *sees* is pending, to the PartitionDesc, but without setting its
> includes_detached.  The subsequent insert's RI query, because it uses
> that PartitionDesc as-is, returns 1 as being present in d4_primary,
> leading to the insert succeeding.  When session 1's transaction
> commits, the waiting ALTER proceeds with committing the 2nd part of
> the DETACH, without having a chance again to check if the DETACH would
> lead to the foreign key of d4_fk being violated.
>
> It seems problematic to me that the logic of setting includes_detached
> is oblivious of the special check in find_inheritance_children() to
> decide whether "force"-include a detach-pending child based on
> cross-checking its pg_inherit tuple's xmin against the active
> snapshot.  Maybe fixing that would help, although I haven't tried that
> myself yet.

I tried that in the attached.  It is indeed the above failing
isolation test whose output needed to be adjusted.

While at it, I tried rewording the comment around that special
visibility check done to force-include detached partitions, as I got
confused by the way it's worded currently.  Actually, it may be a good
idea to add some comments around the intended include_detached
behavior in the places where PartitionDesc is used; e.g.
set_relation_partition_info() lacks a one-liner on why it's okay for
the planner to not see detached partitions.  Or perhaps, a comment for
includes_detached of PartitionDesc should describe the various cases
in which it is true and the cases in which it is not.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On Mon, Apr 12, 2021 at 6:23 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Sun, Apr 11, 2021 at 05:20:35PM -0400, Alvaro Herrera wrote:
> > On 2021-Mar-31, Tom Lane wrote:
> >
> > > diff -U3
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/expected/detach-partition-concurrently-4.out
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/output_iso/results/detach-partition-concurrently-4.out
> > > ---
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/expected/detach-partition-concurrently-4.out
2021-03-2920:14:21.258199311 +0200 
> > > +++
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/output_iso/results/detach-partition-concurrently-4.out
2021-03-30 18:54:34.272839428 +0200 
> > > @@ -324,6 +324,7 @@
> > >  1
> > >  2
> > >  step s1insert: insert into d4_fk values (1);
> > > +ERROR:  insert or update on table "d4_fk" violates foreign key constraint "d4_fk_a_fkey"
> > >  step s1c: commit;
> > >
> > >  starting permutation: s2snitch s1b s1s s2detach s1cancel s3vacfreeze s1s s1insert s1c
> >
> > Hmm, actually, looking at this closely, I think the expected output is
> > bogus and trilobite is doing the right thing by throwing this error
> > here.  The real question is why isn't this case behaving in that way in
> > every *other* animal.
>
> I was looking/thinking at it, and wondered whether it could be a race condition
> involving pg_cancel_backend()

I thought about it some and couldn't come up with an explanation as to
why pg_cancel_backend() race might be a problem.

Actually it occurred to me this morning that CLOBBER_CACHE_ALWAYS is
what exposed this problem on this animal (not sure if other such
animals did too though).  With CLOBBER_CACHE_ALWAYS, a PartitionDesc
will be built afresh on most uses.  In this particular case, the RI
query executed by the insert has to build a new one (for d4_primary),
correctly excluding the detach-pending partition (d4_primary1) per the
snapshot with which it is run.  In normal builds, it would reuse the
one built by an earlier query in the transaction, which does include
the detach-pending partition, thus allowing the insert trying to
insert a row referencing that partition to succeed.  There is a
provision in RelationGetPartitionDesc() to rebuild if any
detach-pending partitions in the existing copy of PartitionDesc are
not to be seen by the current query, but a bug mentioned in my earlier
reply prevents that from kicking in.


--
Amit Langote
EDB: http://www.enterprisedb.com



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2021-Apr-13, Amit Langote wrote:

> Actually it occurred to me this morning that CLOBBER_CACHE_ALWAYS is
> what exposed this problem on this animal (not sure if other such
> animals did too though).  With CLOBBER_CACHE_ALWAYS, a PartitionDesc
> will be built afresh on most uses.  In this particular case, the RI
> query executed by the insert has to build a new one (for d4_primary),
> correctly excluding the detach-pending partition (d4_primary1) per the
> snapshot with which it is run.  In normal builds, it would reuse the
> one built by an earlier query in the transaction, which does include
> the detach-pending partition, thus allowing the insert trying to
> insert a row referencing that partition to succeed.  There is a
> provision in RelationGetPartitionDesc() to rebuild if any
> detach-pending partitions in the existing copy of PartitionDesc are
> not to be seen by the current query, but a bug mentioned in my earlier
> reply prevents that from kicking in.

Right -- that explanation makes perfect sense: the problem stems from
the fact that the cached copy of the partition descriptor is not valid
depending on the visibility of detached partitions for the operation
that requests the descriptor.  I think your patch is a critical part to
a complete solution, but one thing is missing: we don't actually know
that the detached partitions we see now are the same detached partitions
we saw a moment ago.  After all, a partitioned table can have several
partitions in the process of being detached; so if you just go with the
boolean "does it have any detached or not" bit, you could end up with a
descriptor that doesn't include/ignore *all* detached partitions, just
the older one(s).

I think you could fix this aspect easily by decreeing that you can only
have only one partition-being-detached at one point.  So if you try to
DETACH CONCURRENTLY and there's another one in that state, raise an
error.  Maybe for simplicity we should do that anyway.

But I think there's another hidden assumption in your patch, which is
that the descriptor is rebuilt every now and then *anyway* because the
flag for detached flips between parser and executor, and because we send
invalidation messages for each detach.  I don't think we would ever
change things that would break this flipping (it sounds like planner and
executor necessarily have to be doing things differently all the time),
but it seems fragile as heck.  I would feel much safer if we just
avoided caching the wrong thing ... or perhaps keep a separate cache
entry (one descriptor including detached, another one not), to avoid
pointless rebuilds.

-- 
Álvaro Herrera       Valdivia, Chile



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
OK so after mulling this over for a long time, here's a patch for this.
It solves the problem by making the partition descriptor no longer be
cached if a detached partition is omitted.  Any transaction that needs a
partition descriptor that excludes detached partitions, will have to
recreate the partdesc from scratch.  To support this, I changed the
output boolean semantics: instead of "does this partdesc include an
detached partitions" as in your patch, it now is "are there any detached
partitions".  But whenever no detached partitions exist, or when all
partitions including detached are requested, then the cached descriptor
is returned (because that necessarily has to be correct).  The main
difference this has to your patch is that we always keep the descriptor
in the cache and don't rebuild anything, unless a detached partition is
present.

I flipped the find_inheritance_children() input boolean, from
"include_detached" to "omit_detached".  This is more natural, given the
internal behavior.  You could argue to propagate that naming change to
the partdesc.h API and PartitionDirectory, but I don't think there's a
need for that.

I ran all the detach-partition-concurrently tests under
debug_invalidate_system_caches_always=1 and everything passes.

I experimented with keeping a separate cached partition descriptor that
omits the detached partition, but that brings back some trouble; I
couldn't find a way to invalidate such a cached entry in a reasonable
way.  I have the patch for that, if somebody wants to play with it.

-- 
Álvaro Herrera       Valdivia, Chile
"That sort of implies that there are Emacs keystrokes which aren't obscure.
I've been using it daily for 2 years now and have yet to discover any key
sequence which makes any sense."                        (Paul Thomas)

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
Actually I had a silly bug in the version that attempted to cache a
partdesc that omits detached partitions.  This one, while not fully
baked, seems to work correctly (on top of the previous one).

The thing that I don't fully understand is why we have to require to
have built the regular one first.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W
"This is what I like so much about PostgreSQL.  Most of the surprises
are of the "oh wow!  That's cool" Not the "oh shit!" kind.  :)"
Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
While the approach in the previous email does pass the tests, I think
(but couldn't find a test case to prove) it does so coincidentally, not
because it is correct.  If I make the test for "detached exist" use the
cached omits-partitions-partdesc, it does fail, because we had
previously cached one that was not yet omitting the partition.  So what
I said earlier in the thread stands: the set of partitions that are
considered detached changes depending on what the active snapshot is,
and therefore we *must not* cache any such descriptor.

So I backtracked to my previous proposal, which saves in relcache only
the partdesc that includes all partitions.  If any partdesc is built
that omits partitions being detached, that one must be rebuilt afresh
each time.  And to avoid potentially saving a lot of single-use
partdescs in CacheMemoryContext, in the attached second patch (which I
attach separately only to make it more obvious to review) I store such
partdescs in PortalContext.

Barring objections, I will get this pushed early tomorrow.

-- 
Álvaro Herrera       Valdivia, Chile
"Just treat us the way you want to be treated + some extra allowance
 for ignorance."                                    (Michael Brusser)

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
(Sorry about being away from this for over a week.)

On Thu, Apr 22, 2021 at 5:39 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> While the approach in the previous email does pass the tests, I think
> (but couldn't find a test case to prove) it does so coincidentally, not
> because it is correct.  If I make the test for "detached exist" use the
> cached omits-partitions-partdesc, it does fail, because we had
> previously cached one that was not yet omitting the partition.  So what
> I said earlier in the thread stands: the set of partitions that are
> considered detached changes depending on what the active snapshot is,
> and therefore we *must not* cache any such descriptor.
>
> So I backtracked to my previous proposal, which saves in relcache only
> the partdesc that includes all partitions.  If any partdesc is built
> that omits partitions being detached, that one must be rebuilt afresh
> each time.  And to avoid potentially saving a lot of single-use
> partdescs in CacheMemoryContext, in the attached second patch (which I
> attach separately only to make it more obvious to review) I store such
> partdescs in PortalContext.
>
> Barring objections, I will get this pushed early tomorrow.

Thanks for updating the patch.  I have mostly cosmetic comments.

Reading through the latest one, seeing both include_detached and
omit_detached being used in different parts of the code makes it a bit
hard to keep in mind what a given code path is doing wrt detached
partitions.  How about making it all omit_detached?

         * Cope with partitions concurrently being detached.  When we see a
-        * partition marked "detach pending", we only include it in the set of
-        * visible partitions if caller requested all detached partitions, or
-        * if its pg_inherits tuple's xmin is still visible to the active
-        * snapshot.
+        * partition marked "detach pending", we omit it from the returned
+        * descriptor if caller requested that and the tuple's xmin does not
+        * appear in progress to the active snapshot.

It seems odd for a comment in find_inheritance_children() to talk
about the "descriptor".   Maybe the earlier "set of visible
partitions" wording was fine?

-        * The reason for this check is that we want to avoid seeing the
+        * The reason for this hack is that we want to avoid seeing the
         * partition as alive in RI queries during REPEATABLE READ or
<snip>
+        * SERIALIZABLE transactions.

The comment doesn't quite make it clear why it is the RI query case
that necessitates this hack in the first case.  Maybe the relation to
what's going on with the partdesc

+   if (likely(rel->rd_partdesc &&
+              (!rel->rd_partdesc->detached_exist || include_detached)))
+       return rel->rd_partdesc;

I think it would help to have a comment about what's going here, in
addition to the description you already wrote for
PartitionDescData.detached_exist.  Maybe something along the lines of:

===
Under normal circumstances, we just return the partdesc that was
already built.  However, if the partdesc was built at a time when
there existed detach-pending partition(s), which the current caller
would rather not see (omit_detached), then we build one afresh
omitting any such partitions and return that one.
RelationBuildPartitionDesc() makes sure that any such partdescs will
disappear when the query finishes.
===

That's maybe a bit verbose but I am sure you will find a way to write
it more succinctly.

BTW, I do feel a bit alarmed by the potential performance impact of
this.  If detached_exist of a cached partdesc is true, then RI queries
invoked during a bulk DML operation would have to rebuild one for
every tuple to be checked, right?  I haven't tried an actual example
yet though.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2021-Apr-22, Amit Langote wrote:

> On Thu, Apr 22, 2021 at 5:39 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> Reading through the latest one, seeing both include_detached and
> omit_detached being used in different parts of the code makes it a bit
> hard to keep in mind what a given code path is doing wrt detached
> partitions.  How about making it all omit_detached?

Yeah, I hesitated but wanted to do that too.  Done.

>          * Cope with partitions concurrently being detached.  When we see a
> -        * partition marked "detach pending", we only include it in the set of
> -        * visible partitions if caller requested all detached partitions, or
> -        * if its pg_inherits tuple's xmin is still visible to the active
> -        * snapshot.
> +        * partition marked "detach pending", we omit it from the returned
> +        * descriptor if caller requested that and the tuple's xmin does not
> +        * appear in progress to the active snapshot.
> 
> It seems odd for a comment in find_inheritance_children() to talk
> about the "descriptor".   Maybe the earlier "set of visible
> partitions" wording was fine?

Absolutely -- done that way.

> -        * The reason for this check is that we want to avoid seeing the
> +        * The reason for this hack is that we want to avoid seeing the
>          * partition as alive in RI queries during REPEATABLE READ or
> <snip>
> +        * SERIALIZABLE transactions.
> 
> The comment doesn't quite make it clear why it is the RI query case
> that necessitates this hack in the first case.

I added "such queries use a different snapshot than the one used by
regular (user) queries."  I hope that's sufficient.

> Maybe the relation to what's going on with the partdesc
> 
> +   if (likely(rel->rd_partdesc &&
> +              (!rel->rd_partdesc->detached_exist || include_detached)))
> +       return rel->rd_partdesc;
> 
> I think it would help to have a comment about what's going here, in
> addition to the description you already wrote for
> PartitionDescData.detached_exist.  Maybe something along the lines of:
> 
> ===
> Under normal circumstances, we just return the partdesc that was
> already built.  However, if the partdesc was built at a time when
> there existed detach-pending partition(s), which the current caller
> would rather not see (omit_detached), then we build one afresh
> omitting any such partitions and return that one.
> RelationBuildPartitionDesc() makes sure that any such partdescs will
> disappear when the query finishes.
> ===
> 
> That's maybe a bit verbose but I am sure you will find a way to write
> it more succinctly.

I added some text in this spot, and also wrote some more in the comment
above RelationGetPartitionDesc and RelationBuildPartitionDesc.

> BTW, I do feel a bit alarmed by the potential performance impact of
> this.  If detached_exist of a cached partdesc is true, then RI queries
> invoked during a bulk DML operation would have to rebuild one for
> every tuple to be checked, right?  I haven't tried an actual example
> yet though.

Yeah, I was scared about that too (which is why I insisted on trying to
add a cached copy of the partdesc omitting detached partitions).  But
AFAICS what happens is that the plan for the RI query gets cached after
a few tries; so we do build the partdesc a few times at first, but later
we use the cached plan and so we no longer use that one.  So at least in
the normal cases this isn't a serious problem that I can see.

I pushed it now.  Thanks for your help,

-- 
Álvaro Herrera       Valdivia, Chile
"How strange it is to find the words "Perl" and "saner" in such close
proximity, with no apparent sense of irony. I doubt that Larry himself
could have managed it."         (ncm, http://lwn.net/Articles/174769/)



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On Fri, Apr 23, 2021 at 4:26 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2021-Apr-22, Amit Langote wrote:
> > -        * The reason for this check is that we want to avoid seeing the
> > +        * The reason for this hack is that we want to avoid seeing the
> >          * partition as alive in RI queries during REPEATABLE READ or
> > <snip>
> > +        * SERIALIZABLE transactions.
> >
> > The comment doesn't quite make it clear why it is the RI query case
> > that necessitates this hack in the first case.
>
> I added "such queries use a different snapshot than the one used by
> regular (user) queries."  I hope that's sufficient.

Yeah, that makes sense.

> > Maybe the relation to what's going on with the partdesc

(I had to leave my desk while in the middle of typing this, but I
forget what I was going to add :()

> > BTW, I do feel a bit alarmed by the potential performance impact of
> > this.  If detached_exist of a cached partdesc is true, then RI queries
> > invoked during a bulk DML operation would have to rebuild one for
> > every tuple to be checked, right?  I haven't tried an actual example
> > yet though.
>
> Yeah, I was scared about that too (which is why I insisted on trying to
> add a cached copy of the partdesc omitting detached partitions).  But
> AFAICS what happens is that the plan for the RI query gets cached after
> a few tries; so we do build the partdesc a few times at first, but later
> we use the cached plan and so we no longer use that one.  So at least in
> the normal cases this isn't a serious problem that I can see.

Actually, ri_trigger.c (or really plancache.c) is not very good at
caching the plan when querying partitioned tables; it always chooses
to replan because a generic plan, even with runtime pruning built into
it, looks very expensive compared to a custom one.  Now that's a
problem we will have to fix sooner than later, but until then we have
to work around it.

Here is an example that shows the problem:

create unlogged table pk_parted (a int primary key) partition by range (a);
select 'create unlogged table pk_parted_' || i || ' partition of
pk_parted for values from (' || (i-1) * 1000 + 1 || ') to (' ||  i *
1000 + 1 || ');' from generate_series(1, 1000) i;
\gexec
create unlogged table fk (a int references pk_parted);
insert into pk_parted select generate_series(1, 10000);
begin;
select * from fk_parted where a = 1;

In another session:

alter table pk_parted detach partition pk_parted_1000 concurrently;
<blocks; cancel using ctrl-c>

Back in the 1st session:

end;
insert into fk select generate_series(1, 10000);
INSERT 0 10000
Time: 47400.792 ms (00:47.401)

The insert took unusually long, because the PartitionDesc for
pk_parted had to be built exactly 10000 times, because there's a
detach-pending partition lying around.  There is also a danger of an
OOM with such an insert because of leaking into PortalContext the
memory of every PartitionDesc thus built, especially with larger
counts of PK partitions and rows inserted into the FK table.

Also, I noticed that all queries on pk_parted, not just the RI
queries, have to build the PartitionDesc every time, so take that much
longer:

-- note the planning time
explain analyze select * from pk_parted where a = 1;
                                                                 QUERY
PLAN


-------------------------------------------------------------------------------------------------------------------------------
--------------
 Index Only Scan using pk_parted_1_pkey on pk_parted_1 pk_parted
(cost=0.28..8.29 rows=1 width=4) (actual time=0.016..0.017 ro
ws=1 loops=1)
   Index Cond: (a = 1)
   Heap Fetches: 1
 Planning Time: 7.543 ms
 Execution Time: 0.044 ms
(5 rows)

Finalizing the detach makes the insert and the query finish in normal
time, because the PartitionDesc can be cached again:

alter table pk_parted detach partition pk_parted_1000 finalize;
insert into fk select generate_series(1, 10000);
INSERT 0 10000
Time: 855.336 ms

explain analyze select * from pk_parted where a = 1;
                                                                 QUERY
PLAN


-------------------------------------------------------------------------------------------------------------------------------
--------------
 Index Only Scan using pk_parted_1_pkey on pk_parted_1 pk_parted
(cost=0.28..8.29 rows=1 width=4) (actual time=0.033..0.036 ro
ws=1 loops=1)
   Index Cond: (a = 1)
   Heap Fetches: 1
 Planning Time: 0.202 ms
 Execution Time: 0.075 ms
(5 rows)

I am afraid we may have to fix this in the code after all, because
there does not seem a good way to explain this away in the
documentation.  If I read correctly, you did try an approach of
caching the PartitionDesc that we currently don't, no?

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2021-Apr-23, Amit Langote wrote:

> Back in the 1st session:
> 
> end;
> insert into fk select generate_series(1, 10000);
> INSERT 0 10000
> Time: 47400.792 ms (00:47.401)

I guess I was wrong about that ... the example I tried didn't have 1000s
of partitions, and I used debug print-outs to show when a new partdesc
was being rebuilt, and only six were occurring.  I'm not sure why my
case behaves so differently from yours, but clearly from the timing this
is not good.

> I am afraid we may have to fix this in the code after all, because
> there does not seem a good way to explain this away in the
> documentation.  

Yeah, looking at this case, I agree that it needs a fix of some kind.

> If I read correctly, you did try an approach of caching the
> PartitionDesc that we currently don't, no?

I think the patch I posted was too simple.  I think a real fix requires
us to keep track of exactly in what way the partdesc is outdated, so
that we can compare to the current situation in deciding to use that
partdesc or build a new one.  For example, we could keep track of a list
of OIDs of detached partitions (and it simplifies things a lot if allow
only a single partition in this situation, because it's either zero OIDs
or one OID); or we can save the Xmin of the pg_inherits tuple for the
detached partition (and we can compare that Xmin to our current active
snapshot and decide whether to use that partdesc or not).

I'll experiment with this a little more and propose a patch later today.

I don't think it's too much of a problem to state that you need to
finalize one detach before you can do the next one.  After all, with
regular detach, you'd have the tables exclusively locked so you can't do
them in parallel anyway.  (It also increases the chances that people
will finalize detach operations that went unnoticed.)

-- 
Álvaro Herrera                            39°49'30"S 73°17'W



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2021-Apr-23, Alvaro Herrera wrote:

> I think the patch I posted was too simple.  I think a real fix requires
> us to keep track of exactly in what way the partdesc is outdated, so
> that we can compare to the current situation in deciding to use that
> partdesc or build a new one.  For example, we could keep track of a list
> of OIDs of detached partitions (and it simplifies things a lot if allow
> only a single partition in this situation, because it's either zero OIDs
> or one OID); or we can save the Xmin of the pg_inherits tuple for the
> detached partition (and we can compare that Xmin to our current active
> snapshot and decide whether to use that partdesc or not).
> 
> I'll experiment with this a little more and propose a patch later today.

This (POC-quality) seems to do the trick.

(I restored the API of find_inheritance_children, because it was getting
a little obnoxious.  I haven't thought this through but I think we
should do something like it.)

> I don't think it's too much of a problem to state that you need to
> finalize one detach before you can do the next one.  After all, with
> regular detach, you'd have the tables exclusively locked so you can't do
> them in parallel anyway.  (It also increases the chances that people
> will finalize detach operations that went unnoticed.)

I haven't added a mechanism to verify this; but with asserts on, this
patch will crash if you have more than one.  I think the behavior is not
necessarily sane with asserts off, since you'll get an arbitrary
detach-Xmin assigned to the partdesc, depending on catalog scan order.

-- 
Álvaro Herrera       Valdivia, Chile
"El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
Hi Alvaro,

On Sat, Apr 24, 2021 at 8:31 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2021-Apr-23, Alvaro Herrera wrote:
> > I think the patch I posted was too simple.  I think a real fix requires
> > us to keep track of exactly in what way the partdesc is outdated, so
> > that we can compare to the current situation in deciding to use that
> > partdesc or build a new one.  For example, we could keep track of a list
> > of OIDs of detached partitions (and it simplifies things a lot if allow
> > only a single partition in this situation, because it's either zero OIDs
> > or one OID); or we can save the Xmin of the pg_inherits tuple for the
> > detached partition (and we can compare that Xmin to our current active
> > snapshot and decide whether to use that partdesc or not).
> >
> > I'll experiment with this a little more and propose a patch later today.
>
> This (POC-quality) seems to do the trick.

Thanks for the patch.

> (I restored the API of find_inheritance_children, because it was getting
> a little obnoxious.  I haven't thought this through but I think we
> should do something like it.)

+1.

> > I don't think it's too much of a problem to state that you need to
> > finalize one detach before you can do the next one.  After all, with
> > regular detach, you'd have the tables exclusively locked so you can't do
> > them in parallel anyway.  (It also increases the chances that people
> > will finalize detach operations that went unnoticed.)

That sounds reasonable.

> I haven't added a mechanism to verify this; but with asserts on, this
> patch will crash if you have more than one.  I think the behavior is not
> necessarily sane with asserts off, since you'll get an arbitrary
> detach-Xmin assigned to the partdesc, depending on catalog scan order.

Maybe this is an ignorant question but is the plan to add an elog() in
this code path or a check (and an ereport()) somewhere in
ATExecDetachPartition() to prevent more than one partition ending up
in detach-pending state?

Please allow me to study the patch a bit more closely and get back tomorrow.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
Hello Amit,

On 2021-Apr-26, Amit Langote wrote:

> On Sat, Apr 24, 2021 at 8:31 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > I haven't added a mechanism to verify this; but with asserts on, this
> > patch will crash if you have more than one.  I think the behavior is not
> > necessarily sane with asserts off, since you'll get an arbitrary
> > detach-Xmin assigned to the partdesc, depending on catalog scan order.
> 
> Maybe this is an ignorant question but is the plan to add an elog() in
> this code path or a check (and an ereport()) somewhere in
> ATExecDetachPartition() to prevent more than one partition ending up
> in detach-pending state?

Yeah, that's what I'm planning to do.

> Please allow me to study the patch a bit more closely and get back tomorrow.

Sure, thanks!

-- 
Álvaro Herrera       Valdivia, Chile
"But static content is just dynamic content that isn't moving!"
                http://smylers.hates-software.com/2007/08/15/fe244d0c.html



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2021-Apr-26, Alvaro Herrera wrote:

> > Please allow me to study the patch a bit more closely and get back tomorrow.
> 
> Sure, thanks!

Here's a more polished version.

After trying the version with the elog(ERROR) when two detached
partitions are present, I decided against it; it is unhelpful because
it doesn't let you build partition descriptors for anything.  So I made
it an elog(WARNING) (not an ereport, note), and keep the most recent
pg_inherits.xmin value.  This is not great, but it lets you out of the
situation by finalizing one of the detaches.

The other check (at ALTER TABLE .. DETACH time) is an ereport(ERROR) and
should make the first one unreachable.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
Sorry, I forgot to update some comments in that version.  Fixed here.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
Thanks for the updated patch.  I've been reading it, but I noticed a
bug in 8aba9322511f, which I thought you'd want to know to make a note
of when committing this one.

So we decided in 8aba9322511f that it is okay to make the memory
context in which a transient partdesc is allocated a child of
PortalContext so that it disappears when the portal does.  But
PortalContext is apparently NULL when the planner runs, at least in
the "simple" query protocol, so any transient partdescs built by the
planner would effectively leak.  Not good.

With this patch, partdesc_nodetached is no longer transient, so the
problem doesn't exist.

I will write more about the updated patch in a bit...

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On Tue, Apr 27, 2021 at 4:34 PM Amit Langote <amitlangote09@gmail.com> wrote:
> Thanks for the updated patch.  I've been reading it, but I noticed a
> bug in 8aba9322511f, which I thought you'd want to know to make a note
> of when committing this one.
>
> So we decided in 8aba9322511f that it is okay to make the memory
> context in which a transient partdesc is allocated a child of
> PortalContext so that it disappears when the portal does.  But
> PortalContext is apparently NULL when the planner runs, at least in
> the "simple" query protocol, so any transient partdescs built by the
> planner would effectively leak.  Not good.
>
> With this patch, partdesc_nodetached is no longer transient, so the
> problem doesn't exist.
>
> I will write more about the updated patch in a bit...

The first thing that struck me about partdesc_nodetached is that it's
not handled in RelationClearRelation(), where we (re)set a regular
partdesc to NULL so that the next RelationGetPartitionDesc() has to
build it from scratch.  I think partdesc_nodetached and the xmin
should likewise be reset.  Also in load_relcache_init_file(), although
nothing serious there.

That said, I think I may have found a couple of practical problems
with partdesc_nodetached, or more precisely with having it
side-by-side with regular partdesc.  Maybe they can be fixed, so the
problems are not as such deal breakers for the patch's main idea.  The
problems can be seen when different queries in a serializable
transaction have to use both the regular partdesc and
partdesc_detached in a given relcache.  For example, try the following
after first creating a situation where the table p has a
detach-pending partition p2 (for values in (2) and a live partition p1
(for values in (1)).

begin isolation level serializable;
insert into p values (1);
select * from p where a = 1;
insert into p values (1);

The 1st insert succeeds but the 2nd fails with:

ERROR:  no partition of relation "p" found for row
DETAIL:  Partition key of the failing row contains (a) = (1).

I haven't analyzed this error very closely but there is another
situation which causes a crash due to what appears to be a conflict
with rd_pdcxt's design:

-- run in a new session
begin isolation level serializable;
select * from p where a = 1;
insert into p values (1);
select * from p where a = 1;

The 2nd select crashes:

server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

The crash occurs because the planner gets handed a stale copy of
partdesc_nodetached for the 2nd select.  It gets stale, because the
context it's allocated in gets made a child of rd_pdcxt, which is in
turn assigned the context of the regular partdesc when it is built for
the insert query.  Any child contexts of rd_pdcxt are deleted as soon
as the Relation's refcount goes to zero, taking it with
partdesc_nodetached.  Note it is this code in
RelationBuildPartitionDesc():

    /*
     * But first, a kluge: if there's an old rd_pdcxt, it contains an old
     * partition descriptor that may still be referenced somewhere.
     * Preserve it, while not leaking it, by reattaching it as a child
     * context of the new rd_pdcxt.  Eventually it will get dropped by
     * either RelationClose or RelationClearRelation.
     */
    if (rel->rd_pdcxt != NULL)
        MemoryContextSetParent(rel->rd_pdcxt, new_pdcxt);
    rel->rd_pdcxt = new_pdcxt;

I think we may need a separate context for partdesc_nodetached, likely
with the same kludges as rd_pdcxt.  Maybe the first problem will go
away with that as well.

Few other minor things I noticed:

+    * Store it into relcache.  For snapshots built excluding detached
+    * partitions, which we save separately, we also record the
+    * pg_inherits.xmin of the detached partition that was omitted; this
+    * informs a future potential user of such a cached snapshot.

The "snapshot" in the 1st and the last sentence should be "partdesc"?

+ * We keep two partdescs in relcache: rd_partdesc_nodetached excludes
+ * partitions marked concurrently being detached, while rd_partdesc includes
+ * them.

IMHO, describing rd_partdesc first in the sentence would be better.
Like: rd_partdesc includes all partitions including any that are being
concurrently detached, while rd_partdesc_nodetached excludes them.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2021-Apr-27, Amit Langote wrote:

> On Tue, Apr 27, 2021 at 4:34 PM Amit Langote <amitlangote09@gmail.com> wrote:

> I think we may need a separate context for partdesc_nodetached, likely
> with the same kludges as rd_pdcxt.  Maybe the first problem will go
> away with that as well.

Ooh, seems I completely misunderstood what RelationClose was doing.  I
thought it was deleting the whole rd_pdcxt, *including* the "current"
partdesc.  But that's not at all what it does: it only deletes the
*children* memcontexts, so the partdesc that is currently valid remains
valid.  I agree that your proposed fix appears to be promising, in that
a separate "context tree" rd_pddcxt (?) can be used for this.  I'll try
it out now.

> Few other minor things I noticed:
> 
> +    * Store it into relcache.  For snapshots built excluding detached
> +    * partitions, which we save separately, we also record the
> +    * pg_inherits.xmin of the detached partition that was omitted; this
> +    * informs a future potential user of such a cached snapshot.
> 
> The "snapshot" in the 1st and the last sentence should be "partdesc"?

Doh, yeah.

> + * We keep two partdescs in relcache: rd_partdesc_nodetached excludes
> + * partitions marked concurrently being detached, while rd_partdesc includes
> + * them.
> 
> IMHO, describing rd_partdesc first in the sentence would be better.
> Like: rd_partdesc includes all partitions including any that are being
> concurrently detached, while rd_partdesc_nodetached excludes them.

Makes sense.

-- 
Álvaro Herrera       Valdivia, Chile



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
This v3 handles things as you suggested and works correctly AFAICT.  I'm
going to add some more tests cases to verify the behavior in the
scenarios you showed, and get them to run under cache-clobber options to
make sure it's good.

Thanks!

-- 
Álvaro Herrera       Valdivia, Chile

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2021-Apr-27, Alvaro Herrera wrote:

> This v3 handles things as you suggested and works correctly AFAICT.  I'm
> going to add some more tests cases to verify the behavior in the
> scenarios you showed, and get them to run under cache-clobber options to
> make sure it's good.

Yep, it seems to work.  Strangely, the new isolation case doesn't
actually crash before the fix -- it merely throws a memory allocation
error.

-- 
Álvaro Herrera       Valdivia, Chile
"Linux transformó mi computadora, de una `máquina para hacer cosas',
en un aparato realmente entretenido, sobre el cual cada día aprendo
algo nuevo" (Jaime Salinas)

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On Wed, Apr 28, 2021 at 8:32 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2021-Apr-27, Alvaro Herrera wrote:
>
> > This v3 handles things as you suggested and works correctly AFAICT.  I'm
> > going to add some more tests cases to verify the behavior in the
> > scenarios you showed, and get them to run under cache-clobber options to
> > make sure it's good.
>
> Yep, it seems to work.  Strangely, the new isolation case doesn't
> actually crash before the fix -- it merely throws a memory allocation
> error.

Thanks.  Yeah, it does seem to work.

I noticed that rd_partdesc_nodetached_xmin can sometimes end up with
value 0. While you seem to be already aware of that, because otherwise
you wouldn't have added TransactionIdIsValid(...) in condition in
RelationGetPartitionDesc(), the comments nearby don't mention why such
a thing might happen.  Also, I guess it can't be helped that the
partdesc_nodetached will have to be leaked when the xmin is 0, but
that shouldn't be as problematic as the case we discussed earlier.

+   /*
+    * But first, a kluge: if there's an old context for this type of
+    * descriptor, it contains an old partition descriptor that may still be
+    * referenced somewhere.  Preserve it, while not leaking it, by
+    * reattaching it as a child context of the new one.  Eventually it will
+    * get dropped by either RelationClose or RelationClearRelation.
+    *
+    * We keep the regular partdesc in rd_pdcxt, and the partdesc-excluding-
+    * detached-partitions in rd_pddcxt.
+    */
+   context = is_omit ? &rel->rd_pddcxt : &rel->rd_pdcxt;
+   if (*context != NULL)
+       MemoryContextSetParent(*context, new_pdcxt);
+   *context = new_pdcxt;

Would it be a bit more readable to just duplicate this stanza in the
blocks that assign to rd_partdesc_nodetached and rd_partdesc,
respectively?  That's not much code to duplicate and it'd be easier to
see which context is for which partdesc.

+   TransactionId rd_partdesc_nodetached_xmin;  /* xmin for the above */

Could you please expand this description a bit?

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Álvaro Herrera
Date:
Thanks for re-reviewing! This one I hope is the last version.

On Wed, Apr 28, 2021, at 10:21 AM, Amit Langote wrote:
I noticed that rd_partdesc_nodetached_xmin can sometimes end up with
value 0. While you seem to be already aware of that, because otherwise
you wouldn't have added TransactionIdIsValid(...) in condition in
RelationGetPartitionDesc(), the comments nearby don't mention why such
a thing might happen.  Also, I guess it can't be helped that the
partdesc_nodetached will have to be leaked when the xmin is 0, but
that shouldn't be as problematic as the case we discussed earlier.

The only case I am aware where that can happen is if the pg_inherits tuple is frozen. (That's exactly what the affected test case was testing, note the "VACUUM FREEZE pg_inherits" there).  So that test case blew up immediately; but I think the real-world chances that people are going to be doing that are pretty low, so I'm not really concerned about the leak.

Would it be a bit more readable to just duplicate this stanza in the
blocks that assign to rd_partdesc_nodetached and rd_partdesc,
respectively?  That's not much code to duplicate and it'd be easier to
see which context is for which partdesc.

Sure .. that's how I first wrote this code.  We don't use that style much, so I'm OK with backing out of it.

+   TransactionId rd_partdesc_nodetached_xmin;  /* xmin for the above */

Could you please expand this description a bit?

Done.
Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Álvaro Herrera
Date:
Pushed that now, with a one-line addition to the docs that only one
partition can be marked detached.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W
"That sort of implies that there are Emacs keystrokes which aren't obscure.
I've been using it daily for 2 years now and have yet to discover any key
sequence which makes any sense."                        (Paul Thomas)



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
(Thanks for committing the fix.)

On Thu, Apr 29, 2021 at 1:11 AM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On Wed, Apr 28, 2021, at 10:21 AM, Amit Langote wrote:
> I noticed that rd_partdesc_nodetached_xmin can sometimes end up with
> value 0. While you seem to be already aware of that, because otherwise
> you wouldn't have added TransactionIdIsValid(...) in condition in
> RelationGetPartitionDesc(), the comments nearby don't mention why such
> a thing might happen.  Also, I guess it can't be helped that the
> partdesc_nodetached will have to be leaked when the xmin is 0, but
> that shouldn't be as problematic as the case we discussed earlier.
>
>
> The only case I am aware where that can happen is if the pg_inherits tuple is frozen. (That's exactly what the
affectedtest case was testing, note the "VACUUM FREEZE pg_inherits" there).  So that test case blew up immediately; but
Ithink the real-world chances that people are going to be doing that are pretty low, so I'm not really concerned about
theleak. 

The case I was looking at is when a partition detach appears as
in-progress to a serializable transaction.  If the caller wants to
omit detached partitions, such a partition ends up in
rd_partdesc_nodetached, with the corresponding xmin being set to 0 due
to the way find_inheritance_children_extended() sets *detached_xmin.
The next query in the transaction that wants to omit detached
partitions, seeing rd_partdesc_nodetached_xmin being invalid, rebuilds
the partdesc, again including that partition because the snapshot
wouldn't have changed, and so on until the transaction ends.  Now,
this can perhaps be "fixed" by making
find_inheritance_children_extended() set the xmin outside the
snapshot-checking block, but maybe there's no need to address this on
priority.

Rather, a point that bothers me a bit is that we're including a
detached partition in the partdesc labeled "nodetached" in this
particular case.  Maybe we should avoid that by considering in this
scenario that no detached partitions exist for this transactions and
so initialize rd_partdesc, instead of rd_partdesc_nodetached.  That
will let us avoid the situations where the xmin is left in invalid
state.  Maybe like the attached (it also fixes a couple of
typos/thinkos in the previous commit).

Note that we still end up in the same situation as before where each
query in the serializable transaction that sees the detach as
in-progress to have to rebuild the partition descriptor omitting the
detached partitions, even when it's clear that the detach-in-progress
partition will be included every time.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Pavel Luzanov
Date:
Hello,

I found this in the documentation, section '5.11.3. Partitioning Using Inheritance'[1]:
"Some operations require a stronger lock when using declarative partitioning than when using table inheritance. For example, removing a partition from a partitioned table requires taking an ACCESS EXCLUSIVE lock on the parent table, whereas a SHARE UPDATE EXCLUSIVE lock is enough in the case of regular inheritance."

This point is no longer valid
with some restrictions. If the table has a default partition, then removing a partition still requires taking an ACCESS EXCLUSIVE lock.

May be make sense to add some details about DETACH CONCURRENTLY to the section '5.11.2.2. Partition Maintenance' and completely remove this point?

1. https://www.postgresql.org/docs/devel/ddl-partitioning.html#DDL-PARTITIONING-USING-INHERITANCE
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On Wed, May 5, 2021 at 7:59 PM Pavel Luzanov <p.luzanov@postgrespro.ru> wrote:
> I found this in the documentation, section '5.11.3. Partitioning Using Inheritance'[1]:
> "Some operations require a stronger lock when using declarative partitioning than when using table inheritance. For
example,removing a partition from a partitioned table requires taking an ACCESS EXCLUSIVE lock on the parent table,
whereasa SHARE UPDATE EXCLUSIVE lock is enough in the case of regular inheritance." 
>
> This point is no longer valid with some restrictions. If the table has a default partition, then removing a partition
stillrequires taking an ACCESS EXCLUSIVE lock. 
>
> May be make sense to add some details about DETACH CONCURRENTLY to the section '5.11.2.2. Partition Maintenance' and
completelyremove this point? 
>
> 1. https://www.postgresql.org/docs/devel/ddl-partitioning.html#DDL-PARTITIONING-USING-INHERITANCE

That makes sense, thanks for noticing.

How about the attached?

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Pavel Luzanov
Date:
On 06.05.2021 08:35, Amit Langote wrote:
> On Wed, May 5, 2021 at 7:59 PM Pavel Luzanov 
> <p.luzanov@postgrespro.ru> wrote:
>> I found this in the documentation, section '5.11.3. Partitioning 
>> Using Inheritance'[1]: "Some operations require a stronger lock when 
>> using declarative partitioning than when using table inheritance. For 
>> example, removing a partition from a partitioned table requires 
>> taking an ACCESS EXCLUSIVE lock on the parent table, whereas a SHARE 
>> UPDATE EXCLUSIVE lock is enough in the case of regular inheritance." 
>> This point is no longer valid with some restrictions. If the table 
>> has a default partition, then removing a partition still requires 
>> taking an ACCESS EXCLUSIVE lock. May be make sense to add some 
>> details about DETACH CONCURRENTLY to the section '5.11.2.2. Partition 
>> Maintenance' and completely remove this point? 1. 
>> https://www.postgresql.org/docs/devel/ddl-partitioning.html#DDL-PARTITIONING-USING-INHERITANCE
> That makes sense, thanks for noticing. How about the attached? 

I like it.
Especially the link to the ALTER TABLE, this avoids duplication of all 
the nuances of the the DETACH .. CONCURRENTLY.

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Álvaro Herrera
Date:
On 2021-Apr-30, Amit Langote wrote:

> The case I was looking at is when a partition detach appears as
> in-progress to a serializable transaction.

Yeah, I was exceedingly sloppy on my reasoning about this, and you're
right that that's what actually happens rather than what I said.

> If the caller wants to omit detached partitions, such a partition ends
> up in rd_partdesc_nodetached, with the corresponding xmin being set to
> 0 due to the way find_inheritance_children_extended() sets
> *detached_xmin.  The next query in the transaction that wants to omit
> detached partitions, seeing rd_partdesc_nodetached_xmin being invalid,
> rebuilds the partdesc, again including that partition because the
> snapshot wouldn't have changed, and so on until the transaction ends.
> Now, this can perhaps be "fixed" by making
> find_inheritance_children_extended() set the xmin outside the
> snapshot-checking block, but maybe there's no need to address this on
> priority.

Hmm.  See below.

> Rather, a point that bothers me a bit is that we're including a
> detached partition in the partdesc labeled "nodetached" in this
> particular case.  Maybe we should avoid that by considering in this
> scenario that no detached partitions exist for this transactions and
> so initialize rd_partdesc, instead of rd_partdesc_nodetached.  That
> will let us avoid the situations where the xmin is left in invalid
> state.  Maybe like the attached (it also fixes a couple of
> typos/thinkos in the previous commit).

Makes sense -- applied, thanks.

> Note that we still end up in the same situation as before where each
> query in the serializable transaction that sees the detach as
> in-progress to have to rebuild the partition descriptor omitting the
> detached partitions, even when it's clear that the detach-in-progress
> partition will be included every time.

Yeah, you're right that there is a performance hole in the case where a
partition pending detach exists and you're using repeatable read
transactions.  I didn't see it as terribly critical since it's supposed
to be very transient, but I may be wrong.

-- 
Álvaro Herrera       Valdivia, Chile
"Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Álvaro Herrera
Date:
On 2021-May-05, Pavel Luzanov wrote:

> Hello,
> 
> I found this in the documentation, section '5.11.3. Partitioning Using
> Inheritance'[1]:
> "Some operations require a stronger lock when using declarative partitioning
> than when using table inheritance. For example, removing a partition from a
> partitioned table requires taking an ACCESS EXCLUSIVE lock on the parent
> table, whereas a SHARE UPDATE EXCLUSIVE lock is enough in the case of
> regular inheritance."
> 
> This point is no longer valid with some restrictions. If the table has a
> default partition, then removing a partition still requires taking an ACCESS
> EXCLUSIVE lock.

Hmm, are there any other operations for which the partitioning command
takes a strong lock than the legacy inheritance corresponding command?
If there aren't any, then it's okay to delete this paragraph as in the
proposed patch.  But if there are any, then I think we should change the
example to mention that other operation.  I'm not sure what's a good way
to verify that, though.

Also, it remains true that without CONCURRENTLY the DETACH operation
takes AEL.  I'm not sure it's worth pointing this out in this paragraph.

> May be make sense to add some details about DETACH CONCURRENTLY to the
> section '5.11.2.2. Partition Maintenance' and completely remove this point?

Maybe you're right, though.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W
"Las mujeres son como hondas:  mientras más resistencia tienen,
 más lejos puedes llegar con ellas"  (Jonas Nightingale, Leap of Faith)



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Álvaro Herrera
Date:
On 2021-May-06, Amit Langote wrote:

> That makes sense, thanks for noticing.
> 
> How about the attached?

I tweaked the linkage; as submitted, the text in the link contained what
is in the <term> tag, so literally it had:

  ... see DETACH PARTITION partition_name [ CONCURRENTLY | FINALIZE ] for
  details ...

which didn't look very nice.  So I made it use <link> instead of xref
and wrote the "ALTER TABLE .. DETACH PARTITION" text.  I first tried to
fix it by adding an "xreflabel" attrib, but I didn't like it because the
text was not set in fixed width font.

I also tweaked the wording to match the surrounding text a bit better,
at least IMO.  Feel free to suggest improvements.

Thanks!

-- 
Álvaro Herrera                            39°49'30"S 73°17'W
"That sort of implies that there are Emacs keystrokes which aren't obscure.
I've been using it daily for 2 years now and have yet to discover any key
sequence which makes any sense."                        (Paul Thomas)



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On Fri, May 7, 2021 at 2:13 AM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2021-Apr-30, Amit Langote wrote:
>
> > The case I was looking at is when a partition detach appears as
> > in-progress to a serializable transaction.
>
> Yeah, I was exceedingly sloppy on my reasoning about this, and you're
> right that that's what actually happens rather than what I said.
>
> > If the caller wants to omit detached partitions, such a partition ends
> > up in rd_partdesc_nodetached, with the corresponding xmin being set to
> > 0 due to the way find_inheritance_children_extended() sets
> > *detached_xmin.  The next query in the transaction that wants to omit
> > detached partitions, seeing rd_partdesc_nodetached_xmin being invalid,
> > rebuilds the partdesc, again including that partition because the
> > snapshot wouldn't have changed, and so on until the transaction ends.
> > Now, this can perhaps be "fixed" by making
> > find_inheritance_children_extended() set the xmin outside the
> > snapshot-checking block, but maybe there's no need to address this on
> > priority.
>
> Hmm.  See below.
>
> > Rather, a point that bothers me a bit is that we're including a
> > detached partition in the partdesc labeled "nodetached" in this
> > particular case.  Maybe we should avoid that by considering in this
> > scenario that no detached partitions exist for this transactions and
> > so initialize rd_partdesc, instead of rd_partdesc_nodetached.  That
> > will let us avoid the situations where the xmin is left in invalid
> > state.  Maybe like the attached (it also fixes a couple of
> > typos/thinkos in the previous commit).
>
> Makes sense -- applied, thanks.

Thank you.

> > Note that we still end up in the same situation as before where each
> > query in the serializable transaction that sees the detach as
> > in-progress to have to rebuild the partition descriptor omitting the
> > detached partitions, even when it's clear that the detach-in-progress
> > partition will be included every time.
>
> Yeah, you're right that there is a performance hole in the case where a
> partition pending detach exists and you're using repeatable read
> transactions.  I didn't see it as terribly critical since it's supposed
> to be very transient, but I may be wrong.

Yeah, I'd hope so too.  I think RR transactions would have to be
concurrent with an interrupted DETACH CONCURRENTLY to suffer the
performance hit and that does kind of make this a rarely occurring
case.

--
Amit Langote
EDB: http://www.enterprisedb.com



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Noah Misch
Date:
On Wed, Apr 21, 2021 at 04:38:55PM -0400, Alvaro Herrera wrote:

[fix to let CLOBBER_CACHE_ALWAYS pass]

> Barring objections, I will get this pushed early tomorrow.

prairiedog and wrasse failed a $SUBJECT test after this (commit 8aba932).
Also, some non-CLOBBER_CACHE_ALWAYS animals failed a test before the fix.
These IsolationCheck failures match detach-partition-concurrently[^\n]*FAILED:

  sysname   │      snapshot       │ branch │                                             bfurl
                   
 

────────────┼─────────────────────┼────────┼────────────────────────────────────────────────────────────────────────────────────────────────
 hyrax      │ 2021-03-27 07:29:34 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-03-27%2007%3A29%3A34
 topminnow  │ 2021-03-28 20:37:38 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2021-03-28%2020%3A37%3A38
 trilobite  │ 2021-03-29 18:14:24 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=trilobite&dt=2021-03-29%2018%3A14%3A24
 hyrax      │ 2021-04-01 07:21:10 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-04-01%2007%3A21%3A10
 dragonet   │ 2021-04-01 19:48:03 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2021-04-01%2019%3A48%3A03
 avocet     │ 2021-04-05 15:45:56 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2021-04-05%2015%3A45%3A56
 hyrax      │ 2021-04-06 07:15:16 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-04-06%2007%3A15%3A16
 hyrax      │ 2021-04-11 07:25:50 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-04-11%2007%3A25%3A50
 hyrax      │ 2021-04-20 18:25:37 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-04-20%2018%3A25%3A37
 wrasse     │ 2021-04-21 10:38:32 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-04-21%2010%3A38%3A32
 prairiedog │ 2021-04-25 22:05:48 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-04-25%2022%3A05%3A48
 wrasse     │ 2021-05-11 03:43:40 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-05-11%2003%3A43%3A40
(12 rows)



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Amit Langote
Date:
On Mon, May 24, 2021 at 6:07 PM Noah Misch <noah@leadboat.com> wrote:
> On Wed, Apr 21, 2021 at 04:38:55PM -0400, Alvaro Herrera wrote:
>
> [fix to let CLOBBER_CACHE_ALWAYS pass]
>
> > Barring objections, I will get this pushed early tomorrow.
>
> prairiedog and wrasse failed a $SUBJECT test after this (commit 8aba932).
> Also, some non-CLOBBER_CACHE_ALWAYS animals failed a test before the fix.
> These IsolationCheck failures match detach-partition-concurrently[^\n]*FAILED:

FWIW, all 4 detach-partition-concurrently suites passed for me on a
build of the latest HEAD, with CPPFLAGS = -DRELCACHE_FORCE_RELEASE
-DCATCACHE_FORCE_RELEASE -DCLOBBER_CACHE_ALWAYS -D_GNU_SOURCE

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From
Alvaro Herrera
Date:
On 2021-May-24, Noah Misch wrote:

> prairiedog and wrasse failed a $SUBJECT test after this (commit 8aba932).
> Also, some non-CLOBBER_CACHE_ALWAYS animals failed a test before the fix.
> These IsolationCheck failures match detach-partition-concurrently[^\n]*FAILED:
> 
>   sysname   │      snapshot       │ branch │                                             bfurl
                     
 
>
────────────┼─────────────────────┼────────┼────────────────────────────────────────────────────────────────────────────────────────────────

Checking this list, these three failures can be explained by the
detach-partition-concurrently-3 that was just patched.

>  wrasse     │ 2021-04-21 10:38:32 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-04-21%2010%3A38%3A32
>  prairiedog │ 2021-04-25 22:05:48 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-04-25%2022%3A05%3A48
>  wrasse     │ 2021-05-11 03:43:40 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-05-11%2003%3A43%3A40

Next there's a bunch whose error message is the same that we had seen
earlier in this thread; these animals are all CLOBBER_CACHE_ALWAYS:

 step s1insert: insert into d4_fk values (1);
 +ERROR:  insert or update on table "d4_fk" violates foreign key constraint "d4_fk_a_fkey"

>  hyrax      │ 2021-03-27 07:29:34 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-03-27%2007%3A29%3A34
>  trilobite  │ 2021-03-29 18:14:24 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=trilobite&dt=2021-03-29%2018%3A14%3A24
>  hyrax      │ 2021-04-01 07:21:10 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-04-01%2007%3A21%3A10
>  avocet     │ 2021-04-05 15:45:56 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=avocet&dt=2021-04-05%2015%3A45%3A56
>  hyrax      │ 2021-04-06 07:15:16 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-04-06%2007%3A15%3A16
>  hyrax      │ 2021-04-11 07:25:50 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-04-11%2007%3A25%3A50
>  hyrax      │ 2021-04-20 18:25:37 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-04-20%2018%3A25%3A37

This is fine, because the fix commit 8aba932 is dated April 22 and these
failures all predate that.


And finally there's these two:

>  topminnow  │ 2021-03-28 20:37:38 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2021-03-28%2020%3A37%3A38
>  dragonet   │ 2021-04-01 19:48:03 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2021-04-01%2019%3A48%3A03

(animals not CCA) which are exposing the same problem in
detach-partition-concurrently-4 that we just fixed in
detach-partition-concurrently-3, so we should apply the same fix: add a
no-op step right after the cancel to prevent the error report from
changing.  I'll go do that after grabbing some coffee.

Thanks for digging into the reports!

-- 
Álvaro Herrera                            39°49'30"S 73°17'W
"Cada quien es cada cual y baja las escaleras como quiere" (JMSerrat)