Thread: BUG #18371: There are wrong constraint residues when detach hash partiton concurrently

The following bug has been logged on the website:

Bug reference:      18371
Logged by:          Fei Changhong
Email address:      feichanghong@qq.com
PostgreSQL version: 16.2
Operating system:   Operating system: centos 7,Kernel version: 5.10.13
Description:

Hi all,

When I tried to insert data into a table that had just been detached, I
encountered an error: "ERROR: could not open relation with OID 16384".

The environment information is as follows:
PostgreSQL branch: master.
commit: bcdfa5f2e2f274caeed20b2f986012a9cb6a259c

This issue can be reproduced with the following steps:
```
create table t(a int) partition by hash (a);
create table t_p1 partition of t for values with (modulus 2, remainder 0);
alter table t detach partition t_p1 concurrently ;
drop table t;
insert into t_p1 select 1;
```

When detaching a partition concurrently, the function
DetachAddConstraintIfNeeded
is called to convert the partition constraint of the detached partition into
a 
regular constraint, and it is not dropped at the end of the detach
process.
This can work normally on range partitions. However, the constraint on
hash
partitions uses satisfies_hash_partition with the OID of the parent table,
and
the newly created constraint does not take effect. For example, in the
following
case, although there is a t_p1_a_check constraint on t_p1, it is still
possible
to perform an insert:
```
create table t(a int) partition by hash (a);
create table t_p1 partition of t for values with (modulus 2, remainder 0);
alter table t detach partition t_p1 concurrently ;
postgres=# \d+ t_p1
                                          Table "public.t_p1"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression |
Stats target | Description
--------+---------+-----------+----------+---------+---------+-------------+--------------+-------------
 a      | integer |           |          |         | plain   |             |
             |
Check constraints:
    "t_p1_a_check" CHECK (satisfies_hash_partition('16384'::oid, 2, 0, a))
Access method: heap
postgres=# insert into t_p1 select 1;
INSERT 0 1
postgres=#
```
If the parent table has already been dropped, an error will occur during
constraint checking.

Based on the analysis above, should the added constraint for a hash
partition
be dropped after detachment?



> This can work normally on range partitions. However, the constraint on hash
> partitions uses satisfies_hash_partition with the OID of the parent table, and
> the newly created constraint does not take effect. For example, in the following
> case, although there is a t_p1_a_check constraint on t_p1, it is still possible
> to perform an insert:
What I said here is wrong, the constraints on the hash partition will also take
effect. But this constraint depends on the oid of the parent partition.

> Based on the analysis above, should the added constraint for a hash partition
> be dropped after detachment?
I have initially implemented this logic in the attached patch and added a
testcase. I really hope that developers can give me some suggestions.


Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.
 
Attachment
On Thu, Feb 29, 2024 at 02:21:58PM +0800, feichanghong wrote:
> What I said here is wrong, the constraints on the hash partition will also take
> effect. But this constraint depends on the oid of the parent partition.

Something is weird with the format of the messages you have sent, by
the way.

> Based on the analysis above, should the added constraint for a hash partition
> be dropped after detachment?I have initially implemented this logic
> in the attached patch and added a testcase. I really hope that
> developers can give me some suggestions.

I am not much a fan of relying on ATExecDropConstraint(), where the
logic introduced is a copy-paste of get_constraint_name() to retrieve
the name of the constraint to drop.  If any, we ought to rely more on
dropconstraint_internal() instead using the pg_constraint tuple based
on the OID of the constraint, and not do a cross-operation with the
constraint name.

But actually, why do you do a drop of the constraint at all?
DetachAddConstraintIfNeeded() re-adds the constraint as it is, so
instead of re-adding it as-is and then dropping it again, wouldn't it
be better to just *not* re-add to begin with it if we don't need it at
all?

This reproduces down to 14, for what I am assuming is an issue
introduced around 7b357cc6ae55.  Alvaro, what do you think?
--
Michael

Attachment
On 2024-Mar-05, Michael Paquier wrote:

> This reproduces down to 14, for what I am assuming is an issue
> introduced around 7b357cc6ae55.  Alvaro, what do you think?

Yeah, I have this on my list of things to fix this month.  I haven't
looked at it closely, but I will soon.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Los cuentos de hadas no dan al niño su primera idea sobre los monstruos.
Lo que le dan es su primera idea de la posible derrota del monstruo."
                                                   (G. K. Chesterton)



On 2024/3/5, 11:13, "Michael Paquier" <michael@paquier.xyz> wrote:
    Something is weird with the format of the messages you have sent, by
    the way. 

Ah ha! This should be a problem with my email client, I will try another client.

    I am not much a fan of relying on ATExecDropConstraint(), where the
    logic introduced is a copy-paste of get_constraint_name() to retrieve
    the name of the constraint to drop.  If any, we ought to rely more on
    dropconstraint_internal() instead using the pg_constraint tuple based
    on the OID of the constraint, and not do a cross-operation with the
    constraint name.

    But actually, why do you do a drop of the constraint at all?
    DetachAddConstraintIfNeeded() re-adds the constraint as it is, so
    instead of re-adding it as-is and then dropping it again, wouldn't it
    be better to just *not* re-add to begin with it if we don't need it at
    all?

After verification, the partition constraints on the partition being detached
remain in effect until the second transaction committed. This newly added
constraint should only be for the purpose of speeding up the re-attach
operation. Perhaps I've missed something?

So, for hash partitioning, it should be better not to re-add new constraints.
Perhaps we should filter out hash partitions in DetachAddConstraintIfNeeded().

--
Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.
 






On Tue, Mar 05, 2024 at 08:37:27AM +0100, Alvaro Herrera wrote:
> Yeah, I have this on my list of things to fix this month.  I haven't
> looked at it closely, but I will soon.

Cool, thanks!
--
Michael

Attachment
On 2024-Feb-29, feichanghong wrote:

> > This can work normally on range partitions. However, the constraint on hash
> > partitions uses satisfies_hash_partition with the OID of the parent table, and
> > the newly created constraint does not take effect. For example, in the following
> > case, although there is a t_p1_a_check constraint on t_p1, it is still possible
> > to perform an insert:
> What I said here is wrong, the constraints on the hash partition will also take
> effect. But this constraint depends on the oid of the parent partition.

We should definitely not have this constraint on hash-partition tables
after the detach.  However, I wonder if instead of adding it and later
removing it as you propose, it wouldn't be better to just not add it in
the first place.  As a first step, I tried commenting out and found that
no interesting test fails (only alter_table.sql fails but only because
the constraint is not there when looking for it specifically.)

The current code does not explain *why* we have to add this constraint,
and I had forgotten, so I went to look at the first patch submission in
that thread [1] and saw this comment there:

+   /*
+    * Concurrent mode has to work harder; first we add a new constraint to the
+    * partition that matches the partition constraint.  The reason for this is
+    * that the planner may have made optimizations that depend on the
+    * constraint.  XXX Isn't it sufficient to invalidate the partition's
+    * relcache entry?


I'm trying to figure out whether it's possible that the planner would
make optimizations based on the hashing function.  Quite possibly it
won't.  If that's so, then we should just not make the constraint at
all, which would make the fix even simpler.  I also wonder, maybe that
XXX comment (which I removed before committing the patch) is right and
we don't actually need the constraint with _any_ partition strategy, not
just hash.


[1] https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql


-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Hi Alvaro,

On Jul 16, 2024, at 03:35, alvherre <alvherre@alvh.no-ip.org> wrote:

We should definitely not have this constraint on hash-partition tables
after the detach.  However, I wonder if instead of adding it and later
removing it as you propose, it wouldn't be better to just not add it in
the first place.  As a first step, I tried commenting out and found that
no interesting test fails (only alter_table.sql fails but only because
the constraint is not there when looking for it specifically.)

The current code does not explain *why* we have to add this constraint,
and I had forgotten, so I went to look at the first patch submission in
that thread [1] and saw this comment there:

+   /*
+    * Concurrent mode has to work harder; first we add a new constraint to the
+    * partition that matches the partition constraint.  The reason for this is
+    * that the planner may have made optimizations that depend on the
+    * constraint.  XXX Isn't it sufficient to invalidate the partition's
+    * relcache entry?


I'm trying to figure out whether it's possible that the planner would
make optimizations based on the hashing function.  Quite possibly it
won't.  If that's so, then we should just not make the constraint at
all, which would make the fix even simpler.  I also wonder, maybe that
XXX comment (which I removed before committing the patch) is right and
we don't actually need the constraint with _any_ partition strategy, not
just hash.

I agree with your point of view. There is no need to add an extra partition
constraint for "DETACH PARTITION CONCURRENTLY". In the first transaction, we
merely set inhdetachpending to true without actually dropping the existing
partition constraint.
At the start of the second transaction, we wait for all queries that were
planned with the partition to finish. Then, we acquire an AccessExclusiveLock
on the sub-partition and drop the original partition constraint. Most likely,
the newly added constraint is useless for the optimizer.

The only useful scenario I can think of is that, if we were to attach this
sub-partition back to the parent table using the same strategy, it might reduce
the checks for partition constraints.

Best Regards,
Fei Changhong



alvherre <alvherre@alvh.no-ip.org> 于2024年7月16日周二 03:35写道:
On 2024-Feb-29, feichanghong wrote:

> &gt; This can work normally on range partitions. However, the constraint on hash
> &gt; partitions uses satisfies_hash_partition with the OID of the parent table, and
> &gt; the newly created constraint does not take effect. For example, in the following
> &gt; case, although there is a t_p1_a_check constraint on t_p1, it is still possible
> &gt; to perform an insert:
> What I said here is wrong, the constraints on the hash partition will also take
> effect. But this constraint depends on the oid of the parent partition.

We should definitely not have this constraint on hash-partition tables
after the detach.  However, I wonder if instead of adding it and later
removing it as you propose, it wouldn't be better to just not add it in
the first place.  As a first step, I tried commenting out and found that
no interesting test fails (only alter_table.sql fails but only because
the constraint is not there when looking for it specifically.)

The current code does not explain *why* we have to add this constraint,
and I had forgotten, so I went to look at the first patch submission in
that thread [1] and saw this comment there:

+   /*
+    * Concurrent mode has to work harder; first we add a new constraint to the
+    * partition that matches the partition constraint.  The reason for this is
+    * that the planner may have made optimizations that depend on the
+    * constraint.  XXX Isn't it sufficient to invalidate the partition's
+    * relcache entry?


I'm trying to figure out whether it's possible that the planner would
make optimizations based on the hashing function.  Quite possibly it
won't.  If that's so, then we should just not make the constraint at
all, which would make the fix even simpler.  I also wonder, maybe that
XXX comment (which I removed before committing the patch) is right and
we don't actually need the constraint with _any_ partition strategy, not
just hash.

The planner will do partition pruning based on partbound, if there is no the
new adding constraint when detaching concurrently, can we make sure that
the query see the consistent result?


--
Tender Wang


alvherre <alvherre@alvh.no-ip.org> 于2024年7月16日周二 03:35写道:
On 2024-Feb-29, feichanghong wrote:

> &gt; This can work normally on range partitions. However, the constraint on hash
> &gt; partitions uses satisfies_hash_partition with the OID of the parent table, and
> &gt; the newly created constraint does not take effect. For example, in the following
> &gt; case, although there is a t_p1_a_check constraint on t_p1, it is still possible
> &gt; to perform an insert:
> What I said here is wrong, the constraints on the hash partition will also take
> effect. But this constraint depends on the oid of the parent partition.

We should definitely not have this constraint on hash-partition tables
after the detach.  However, I wonder if instead of adding it and later
removing it as you propose, it wouldn't be better to just not add it in
the first place.  As a first step, I tried commenting out and found that
no interesting test fails (only alter_table.sql fails but only because
the constraint is not there when looking for it specifically.)

The current code does not explain *why* we have to add this constraint,
and I had forgotten, so I went to look at the first patch submission in
that thread [1] and saw this comment there:

+   /*
+    * Concurrent mode has to work harder; first we add a new constraint to the
+    * partition that matches the partition constraint.  The reason for this is
+    * that the planner may have made optimizations that depend on the
+    * constraint.  XXX Isn't it sufficient to invalidate the partition's
+    * relcache entry?


I'm trying to figure out whether it's possible that the planner would
make optimizations based on the hashing function.  Quite possibly it
won't.  If that's so, then we should just not make the constraint at
all, which would make the fix even simpler.  I also wonder, maybe that
XXX comment (which I removed before committing the patch) is right and
we don't actually need the constraint with _any_ partition strategy, not
just hash.

I pick up this issue again. I skim through the codes and I agree with your point.
We don't acually need the constraint with any partition strategy, not just hash.

If we have old queries when we are doing detach, the detach session will be stopped at
WaitForLockersMultiple().  Other sessions can't insert tuples that violate partition constraint.
Because the detached partition still has partition constranit. 

--
Thanks,
Tender Wang