Thread: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15672
Logged by:          Jianing Yang
Email address:      jianingy.yang@gmail.com
PostgreSQL version: 11.2
Operating system:   Ubuntu 18.10
Description:

Reproduce steps:

1. create a partition table with the following constraints
    a. with a unique key on partition key and a varchar type field
    b. using hash partition
2. alter the  length of the varchar type field
3. drop the partition table in a transaction
4. crash

Screenshot:
→ pgcli -h localhost -p 5555 -U postgres -W
Server: PostgreSQL 11.2
Version: 2.0.1
Chat: https://gitter.im/dbcli/pgcli
Mail: https://groups.google.com/forum/#!forum/pgcli
Home: http://pgcli.com

postgres@localhost:postgres> show server_version;
                               
+-------------------------------+
| server_version                |
|-------------------------------|
| 11.2 (Debian 11.2-1.pgdg90+1) |
+-------------------------------+
SHOW
Time: 0.011s
postgres@localhost:postgres> create table users(user_id int, name varchar
(64), unique (user_id, name)) partition by hash(user_id);
                                                                
CREATE TABLE
Time: 0.004s
postgres@localhost:postgres> create table users_000 partition of users for
values with (modulus 2, remainder 0);
                                                               
CREATE TABLE
Time: 0.012s
postgres@localhost:postgres> create table users_001 partition of users for
values with (modulus 2, remainder 1);
                                                               
CREATE TABLE
Time: 0.012s
postgres@localhost:postgres> alter table users alter column name type
varchar(127);                         
You're about to run a destructive command.
Do you want to proceed? (y/n): y
Your call!
ALTER TABLE
Time: 0.007s
postgres@localhost:postgres> \d users;
                               
+----------+------------------------+-------------+
| Column   | Type                   | Modifiers   |
|----------+------------------------+-------------|
| user_id  | integer                |             |
| name     | character varying(127) |             |
+----------+------------------------+-------------+
Indexes:
    "users_user_id_name_key" UNIQUE CONSTRAINT, btree (user_id, name)
Partition key: HASH (user_id)
Number of partitions 2: (Use \d+ to list them.)

Time: 0.012s
postgres@localhost:postgres> begin;
                               
BEGIN
Time: 0.001s
postgres@localhost:postgres> drop table users;
                               
You're about to run a destructive command.
Do you want to proceed? (y/n): y
Your call!
DROP TABLE
Time: 0.002s
postgres@localhost:postgres> commit;   
Connection reset. Reconnect (Y/n):


Server Log:

2019-03-06 14:59:26.101 UTC [61] ERROR:  SMgrRelation hashtable corrupted
2019-03-06 14:59:26.101 UTC [61] STATEMENT:  commit
2019-03-06 14:59:26.101 UTC [61] WARNING:  AbortTransaction while in COMMIT
state
2019-03-06 14:59:26.101 UTC [61] PANIC:  cannot abort transaction 573, it
was already committed
2019-03-06 14:59:26.178 UTC [1] LOG:  server process (PID 61) was terminated
by signal 6: Aborted
2019-03-06 14:59:26.178 UTC [1] DETAIL:  Failed process was running:
commit
2019-03-06 14:59:26.178 UTC [1] LOG:  terminating any other active server
processes
2019-03-06 14:59:26.178 UTC [58] WARNING:  terminating connection because of
crash of another server process
2019-03-06 14:59:26.178 UTC [58] DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2019-03-06 14:59:26.178 UTC [58] HINT:  In a moment you should be able to
reconnect to the database and repeat your command.
2019-03-06 14:59:26.179 UTC [1] LOG:  all server processes terminated;
reinitializing
2019-03-06 14:59:26.186 UTC [68] LOG:  database system was interrupted; last
known up at 2019-03-06 14:58:30 UTC
2019-03-06 14:59:26.212 UTC [68] LOG:  database system was not properly shut
down; automatic recovery in progress
2019-03-06 14:59:26.214 UTC [68] LOG:  redo starts at 0/1650278
2019-03-06 14:59:26.215 UTC [68] FATAL:  SMgrRelation hashtable corrupted
2019-03-06 14:59:26.215 UTC [68] CONTEXT:  WAL redo at 0/16768C8 for
Transaction/COMMIT: 2019-03-06 14:59:26.099739+00; rels: base/13067/16389
base/13067/16387 base/13067/16394 base/13067/16387; inval msgs: catcache 41
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6
catcache 7 catcache 6 catcache 50 catcache 49 catcache 19 catcache 32
catcache 7 catcache 6 catcache 7 catcache 6 catcache 50 catcache 49 catcache
74 catcache 73 catcache 74 catcache 73 catcache 7 catcache 6 catcache 7
catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache
50 catcache 49 catcache 19 catcache 32 catcache 7 catcache 6 catcache 7
catcache 6 catcache 50 catcache 49 catcache 74 catcache 73 catcache 74
catcache 73 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache
6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache
7 catcache 6 catcache 7 catcache 6 catcache 50 catcache 49 catcache 19
catcache 32 catcache 7 catcache 6 catcache 7 catcache 6 catcache 50 catcache
49 catcache 74 catcache 73 catcache 74 catcache 73 relcache 16384 snapshot
2608 snapshot 2608 relcache 16399 relcache 16384 snapshot 2608 snapshot 2608
snapshot 2608 relcache 16389 relcache 16384 snapshot 2608 snapshot 2608
relcache 16401 relcache 16389 snapshot 2608 snapshot 2608 snapshot 2608
relcache 16394 relcache 16384 snapshot 2608 snapshot 2608 relcache 16403
relcache 16394 snapshot 2608 snapshot 2608 snapshot 2608
2019-03-06 14:59:26.216 UTC [1] LOG:  startup process (PID 68) exited with
exit code 1
2019-03-06 14:59:26.216 UTC [1] LOG:  aborting startup due to startup
process failure
2019-03-06 14:59:26.217 UTC [1] LOG:  database system is shut down


Affected Server Version:

11.1
11.2


Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Michael Paquier
Date:
On Wed, Mar 06, 2019 at 03:06:53PM +0000, PG Bug reporting form wrote:
> 1. create a partition table with the following constraints
>     a. with a unique key on partition key and a varchar type field
>     b. using hash partition
> 2. alter the  length of the varchar type field
> 3. drop the partition table in a transaction
> 4. crash

I can reproduce the failure easily, not on HEAD but with
REL_11_STABLE:
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f585729b535 in __GI_abort () at abort.c:79
#2  0x000055eef597e60a in errfinish (dummy=0) at elog.c:555
#3  0x000055eef5980c50 in elog_finish (elevel=22, fmt=0x55eef5a41408
"cannot abort transaction %u, it was already committed") at
elog.c:1376
#4  0x000055eef5479647 in RecordTransactionAbort (isSubXact=false) at
xact.c:1580
#5  0x000055eef547a6c0 in AbortTransaction () at xact.c:2602
#6  0x000055eef547aef4 in AbortCurrentTransaction () at xact.c:3104

That's worth an investigation, SMgrRelationHash is getting messed up
which causes the transaction commit to fail where it should not.
--
Michael

Attachment

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
Hi,

On 2019/03/07 9:03, Michael Paquier wrote:
> On Wed, Mar 06, 2019 at 03:06:53PM +0000, PG Bug reporting form wrote:
>> 1. create a partition table with the following constraints
>>     a. with a unique key on partition key and a varchar type field
>>     b. using hash partition
>> 2. alter the  length of the varchar type field
>> 3. drop the partition table in a transaction
>> 4. crash
> 
> I can reproduce the failure easily, not on HEAD but with
> REL_11_STABLE:

Same here.  I could reproduce it with 11.0.

> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007f585729b535 in __GI_abort () at abort.c:79
> #2  0x000055eef597e60a in errfinish (dummy=0) at elog.c:555
> #3  0x000055eef5980c50 in elog_finish (elevel=22, fmt=0x55eef5a41408
> "cannot abort transaction %u, it was already committed") at
> elog.c:1376
> #4  0x000055eef5479647 in RecordTransactionAbort (isSubXact=false) at
> xact.c:1580
> #5  0x000055eef547a6c0 in AbortTransaction () at xact.c:2602
> #6  0x000055eef547aef4 in AbortCurrentTransaction () at xact.c:3104
> 
> That's worth an investigation, SMgrRelationHash is getting messed up
> which causes the transaction commit to fail where it should not.

Looking at what was causing the SMgrRelationHash corruption, it seems
there were entries with same/duplicated relnode value in pendingDeletes list.

The problem start when ALTER TABLE users ALTER COLUMN is executed.
create table users(user_id int, name varchar(64), unique (user_id, name))
partition by list(user_id);

create table users_000 partition of users for values in(0);
create table users_001 partition of users for values in(1);
select relname, relfilenode from pg_class where relname like 'users%';
          relname           │ relfilenode
────────────────────────────┼─────────────
 users                      │       16441
 users_000                  │       16446
 users_000_user_id_name_key │       16449
 users_001                  │       16451
 users_001_user_id_name_key │       16454
 users_user_id_name_key     │       16444
(6 rows)

alter table users alter column name type varchar(127);
select relname, relfilenode from pg_class where relname like 'users%';
          relname           │ relfilenode
────────────────────────────┼─────────────
 users                      │       16441
 users_000                  │       16446
 users_000_user_id_name_key │       16444    <===  duplicated
 users_001                  │       16451
 users_001_user_id_name_key │       16444    <===  duplicated
 users_user_id_name_key     │       16444    <===  duplicated
(6 rows)

Ran out off time...

Thanks,
Amit



Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
On Thu, Mar 7, 2019 at 11:17 AM Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> The problem start when ALTER TABLE users ALTER COLUMN is executed.
> create table users(user_id int, name varchar(64), unique (user_id, name))
> partition by list(user_id);
>
> create table users_000 partition of users for values in(0);
> create table users_001 partition of users for values in(1);
> select relname, relfilenode from pg_class where relname like 'users%';
>           relname           │ relfilenode
> ────────────────────────────┼─────────────
>  users                      │       16441
>  users_000                  │       16446
>  users_000_user_id_name_key │       16449
>  users_001                  │       16451
>  users_001_user_id_name_key │       16454
>  users_user_id_name_key     │       16444
> (6 rows)
>
> alter table users alter column name type varchar(127);
> select relname, relfilenode from pg_class where relname like 'users%';
>           relname           │ relfilenode
> ────────────────────────────┼─────────────
>  users                      │       16441
>  users_000                  │       16446
>  users_000_user_id_name_key │       16444    <===  duplicated
>  users_001                  │       16451
>  users_001_user_id_name_key │       16444    <===  duplicated
>  users_user_id_name_key     │       16444    <===  duplicated
> (6 rows)

I checked why users_000's and user_0001's indexes end up reusing
users_user_id_name_key's relfilenode.  At the surface, it's because
DefineIndex(<parent's-index-to-be-recreated>) is carrying oldNode =
<old-parents-index's-relfilenode> in IndexStmt, which is recursively
passed down to DefineIndex(<child-indexes-to-be-recreated>).  This
DefineIndex() chain is running due to ATPostAlterTypeCleanup() on the
parent rel.  This surface problem may be solved in DefineIndex() by
just resetting oldNode in each child IndexStmt before recursing, but
that means child indexes are recreated with new relfilenodes.  That
solves the immediate problem of relfilenodes being wrongly duplicated,
that's leading to madness such as SMgrRelationHash corruption being
seen in the original bug report.

But, the root problem seems to be that ATPostAlterTypeCleanup() on
child tables isn't setting up their own
DefineIndex(<child-index-to-be-rewritten>) step.  That's because the
parent's ATPostAlterTypeCleanup() dropped child copies of the UNIQUE
constraint due to dependencies (+ CCI).  So, ATExecAlterColumnType()
on child relations isn't able to find the constraint on the individual
child relations to turn into their own
DefineIndex(<child-index-to-be-rewritten>). If we manage to handle
each relation's ATPostAlterTypeCleanup() independently, child's
recreated indexes will be able to reuse their old relfilenodes and
everything will be fine.  But maybe that will require significant
overhaul of how this post-alter-type-cleanup occurs?

Thanks,
Amit


Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
On 2019/03/07 20:36, Amit Langote wrote:
> On Thu, Mar 7, 2019 at 11:17 AM Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> The problem start when ALTER TABLE users ALTER COLUMN is executed.
>> create table users(user_id int, name varchar(64), unique (user_id, name))
>> partition by list(user_id);
>>
>> create table users_000 partition of users for values in(0);
>> create table users_001 partition of users for values in(1);
>> select relname, relfilenode from pg_class where relname like 'users%';
>>           relname           │ relfilenode
>> ────────────────────────────┼─────────────
>>  users                      │       16441
>>  users_000                  │       16446
>>  users_000_user_id_name_key │       16449
>>  users_001                  │       16451
>>  users_001_user_id_name_key │       16454
>>  users_user_id_name_key     │       16444
>> (6 rows)
>>
>> alter table users alter column name type varchar(127);
>> select relname, relfilenode from pg_class where relname like 'users%';
>>           relname           │ relfilenode
>> ────────────────────────────┼─────────────
>>  users                      │       16441
>>  users_000                  │       16446
>>  users_000_user_id_name_key │       16444    <===  duplicated
>>  users_001                  │       16451
>>  users_001_user_id_name_key │       16444    <===  duplicated
>>  users_user_id_name_key     │       16444    <===  duplicated
>> (6 rows)
> 
> I checked why users_000's and user_0001's indexes end up reusing
> users_user_id_name_key's relfilenode.  At the surface, it's because
> DefineIndex(<parent's-index-to-be-recreated>) is carrying oldNode =
> <old-parents-index's-relfilenode> in IndexStmt, which is recursively
> passed down to DefineIndex(<child-indexes-to-be-recreated>).  This
> DefineIndex() chain is running due to ATPostAlterTypeCleanup() on the
> parent rel.  This surface problem may be solved in DefineIndex() by
> just resetting oldNode in each child IndexStmt before recursing, but
> that means child indexes are recreated with new relfilenodes.  That
> solves the immediate problem of relfilenodes being wrongly duplicated,
> that's leading to madness such as SMgrRelationHash corruption being
> seen in the original bug report.

This doesn't happen in HEAD, because in HEAD we got 807ae415c5, which
changed things so that partitioned relations always have their relfilenode
set to 0.  So, there's no question of parent's relfilenode being passed to
children and hence being duplicated.

But that also means child indexes are unnecessarily rewritten, that is,
with new relfilenodes.

> But, the root problem seems to be that ATPostAlterTypeCleanup() on
> child tables isn't setting up their own
> DefineIndex(<child-index-to-be-rewritten>) step.  That's because the
> parent's ATPostAlterTypeCleanup() dropped child copies of the UNIQUE
> constraint due to dependencies (+ CCI).  So, ATExecAlterColumnType()
> on child relations isn't able to find the constraint on the individual
> child relations to turn into their own
> DefineIndex(<child-index-to-be-rewritten>). If we manage to handle
> each relation's ATPostAlterTypeCleanup() independently, child's
> recreated indexes will be able to reuse their old relfilenodes and
> everything will be fine.  But maybe that will require significant
> overhaul of how this post-alter-type-cleanup occurs?

We could try to solve this dividing ATPostAlterTypeCleanup processing into
two functions:

1 The first one runs right after ATExecAlterColumnType() is finished for a
given table (like it does today), which then runs ATPostAlterTypeParse to
generate commands for constraints and/or indexes to re-add.  This part
won't drop the old constraints/indexes just yet, so child
constraints/indexes will remain for ATExecAlterColumnType to see when
executed for the children.

2. Dropping the old constraints/indexes is the responsibility of the 2nd
part, which runs just before executing ATExecReAddIndex or
ATExecAddConstraint (with is_readd = true), so that the new constraints
don't collide with the existing ones.

This arrangement allows step 1 to generate the commands to recreate even
the child indexes such that the old relfilenode can be be preserved by
setting IndexStmt.oldNode.

Attached patch is a very rough sketch, which fails some regression tests,
but I ran out of time today.

Thanks,
Amit

Attachment

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Michael Paquier
Date:
On Wed, Mar 27, 2019 at 11:56:20AM +0900, Michael Paquier wrote:
> Adding it to the section for older bugs sounds fine to me.  Thanks for
> doing so.

I have begun looking at this issue.  Hopefully I'll be able to provide
an update soon.
--
Michael

Attachment

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Michael Paquier
Date:
On Wed, Mar 27, 2019 at 11:56:20AM +0900, Michael Paquier wrote:
> Adding it to the section for older bugs sounds fine to me.  Thanks for
> doing so.

I have begun looking at this issue.  Hopefully I'll be able to provide
an update soon.
--
Michael

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
On 2019/04/11 15:57, Michael Paquier wrote:
> On Wed, Mar 27, 2019 at 11:56:20AM +0900, Michael Paquier wrote:
>> Adding it to the section for older bugs sounds fine to me.  Thanks for
>> doing so.
> 
> I have begun looking at this issue.  Hopefully I'll be able to provide
> an update soon.

Great, thanks.

Regards,
Amit





Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
On 2019/04/11 15:57, Michael Paquier wrote:
> On Wed, Mar 27, 2019 at 11:56:20AM +0900, Michael Paquier wrote:
>> Adding it to the section for older bugs sounds fine to me.  Thanks for
>> doing so.
> 
> I have begun looking at this issue.  Hopefully I'll be able to provide
> an update soon.

Great, thanks.

Regards,
Amit





ISTM we could work around the problem with the attached, which I think
is a good change independently of anything else.

There is still an issue, which manifests in both 11 and HEAD, namely
that the code also propagates the parent index's comment to any child
indexes.  You can see that with this extended test case:

create table users(user_id int, name varchar(64), unique (user_id, name)) partition by hash(user_id);
comment on index users_user_id_name_key is 'parent index';
create table users_000 partition of users for values with (modulus 2, remainder 0);
  
create table users_001 partition of users for values with (modulus 2, remainder 1);
  

select relname, relfilenode, obj_description(oid,'pg_class') from pg_class where relname like 'users%';
alter table users alter column name type varchar(127);
select relname, relfilenode, obj_description(oid,'pg_class') from pg_class where relname like 'users%';

which gives me (in 11, with this patch)

...
          relname           | relfilenode | obj_description
----------------------------+-------------+-----------------
 users                      |       89389 |
 users_000                  |       89394 |
 users_000_user_id_name_key |       89397 |
 users_001                  |       89399 |
 users_001_user_id_name_key |       89402 |
 users_user_id_name_key     |       89392 | parent index
(6 rows)

ALTER TABLE
          relname           | relfilenode | obj_description
----------------------------+-------------+-----------------
 users                      |       89389 |
 users_000                  |       89394 |
 users_000_user_id_name_key |       89406 | parent index
 users_001                  |       89399 |
 users_001_user_id_name_key |       89408 | parent index
 users_user_id_name_key     |       89404 | parent index
(6 rows)

However, I doubt that that's bad enough to justify a major rewrite
of the ALTER TABLE code in 11 ... and maybe not in HEAD either;
I wouldn't be too unhappy to leave it to v13.

            regards, tom lane

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f8ee4b0..c5c543f 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -252,8 +252,17 @@ CheckIndexCompatible(Oid oldId,
     if (!ret)
         return false;

-    /* For polymorphic opcintype, column type changes break compatibility. */
+    /* We now need to actually look at the index rel. */
     irel = index_open(oldId, AccessShareLock);    /* caller probably has a lock */
+
+    /* If it's a partitioned index, there is no storage to share. */
+    if (irel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
+    {
+        index_close(irel, NoLock);
+        return false;
+    }
+
+    /* For polymorphic opcintype, column type changes break compatibility. */
     for (i = 0; i < old_natts; i++)
     {
         if (IsPolymorphicType(get_opclass_input_type(classObjectId[i])) &&

ISTM we could work around the problem with the attached, which I think
is a good change independently of anything else.

There is still an issue, which manifests in both 11 and HEAD, namely
that the code also propagates the parent index's comment to any child
indexes.  You can see that with this extended test case:

create table users(user_id int, name varchar(64), unique (user_id, name)) partition by hash(user_id);
comment on index users_user_id_name_key is 'parent index';
create table users_000 partition of users for values with (modulus 2, remainder 0);
  
create table users_001 partition of users for values with (modulus 2, remainder 1);
  

select relname, relfilenode, obj_description(oid,'pg_class') from pg_class where relname like 'users%';
alter table users alter column name type varchar(127);
select relname, relfilenode, obj_description(oid,'pg_class') from pg_class where relname like 'users%';

which gives me (in 11, with this patch)

...
          relname           | relfilenode | obj_description
----------------------------+-------------+-----------------
 users                      |       89389 |
 users_000                  |       89394 |
 users_000_user_id_name_key |       89397 |
 users_001                  |       89399 |
 users_001_user_id_name_key |       89402 |
 users_user_id_name_key     |       89392 | parent index
(6 rows)

ALTER TABLE
          relname           | relfilenode | obj_description
----------------------------+-------------+-----------------
 users                      |       89389 |
 users_000                  |       89394 |
 users_000_user_id_name_key |       89406 | parent index
 users_001                  |       89399 |
 users_001_user_id_name_key |       89408 | parent index
 users_user_id_name_key     |       89404 | parent index
(6 rows)

However, I doubt that that's bad enough to justify a major rewrite
of the ALTER TABLE code in 11 ... and maybe not in HEAD either;
I wouldn't be too unhappy to leave it to v13.

            regards, tom lane

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f8ee4b0..c5c543f 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -252,8 +252,17 @@ CheckIndexCompatible(Oid oldId,
     if (!ret)
         return false;

-    /* For polymorphic opcintype, column type changes break compatibility. */
+    /* We now need to actually look at the index rel. */
     irel = index_open(oldId, AccessShareLock);    /* caller probably has a lock */
+
+    /* If it's a partitioned index, there is no storage to share. */
+    if (irel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
+    {
+        index_close(irel, NoLock);
+        return false;
+    }
+
+    /* For polymorphic opcintype, column type changes break compatibility. */
     for (i = 0; i < old_natts; i++)
     {
         if (IsPolymorphicType(get_opclass_input_type(classObjectId[i])) &&

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
Thanks for looking at this.

On 2019/04/24 7:03, Tom Lane wrote:
> ISTM we could work around the problem with the attached, which I think
> is a good change independently of anything else.

Agreed.

> There is still an issue, which manifests in both 11 and HEAD, namely
> that the code also propagates the parent index's comment to any child
> indexes.  You can see that with this extended test case:
> 
> create table users(user_id int, name varchar(64), unique (user_id, name)) partition by hash(user_id);    
> comment on index users_user_id_name_key is 'parent index';
> create table users_000 partition of users for values with (modulus 2, remainder 0);
   
 
> create table users_001 partition of users for values with (modulus 2, remainder 1);
   
 
> 
> select relname, relfilenode, obj_description(oid,'pg_class') from pg_class where relname like 'users%';
> alter table users alter column name type varchar(127);                         
> select relname, relfilenode, obj_description(oid,'pg_class') from pg_class where relname like 'users%';
> 
> which gives me (in 11, with this patch)
> 
> ...
>           relname           | relfilenode | obj_description 
> ----------------------------+-------------+-----------------
>  users                      |       89389 | 
>  users_000                  |       89394 | 
>  users_000_user_id_name_key |       89397 | 
>  users_001                  |       89399 | 
>  users_001_user_id_name_key |       89402 | 
>  users_user_id_name_key     |       89392 | parent index
> (6 rows)
> 
> ALTER TABLE
>           relname           | relfilenode | obj_description 
> ----------------------------+-------------+-----------------
>  users                      |       89389 | 
>  users_000                  |       89394 | 
>  users_000_user_id_name_key |       89406 | parent index
>  users_001                  |       89399 | 
>  users_001_user_id_name_key |       89408 | parent index
>  users_user_id_name_key     |       89404 | parent index
> (6 rows)

This may be seen as slightly worse if the child indexes had their own
comments, which would get overwritten by the parent's.

create table pp (a int, b text, unique (a, b)) partition by list (a);
create table pp1 partition of pp for values in (1);
create table pp2 partition of pp for values in (2);
comment on index pp_a_b_key is 'parent index';
comment on index pp1_a_b_key is 'child index 1';
comment on index pp2_a_b_key is 'child index 2';
select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
where relname like 'pp%';
   relname   │ relfilenode │ obj_description
─────────────┼─────────────┼─────────────────
 pp          │       16420 │
 pp1         │       16425 │
 pp1_a_b_key │       16428 │ child index 1
 pp2         │       16433 │
 pp2_a_b_key │       16436 │ child index 2
 pp_a_b_key  │       16423 │ parent index
(6 rows)

alter table pp alter b type varchar(128);
select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
where relname like 'pp%';
   relname   │ relfilenode │ obj_description
─────────────┼─────────────┼─────────────────
 pp          │       16420 │
 pp1         │       16447 │
 pp1_a_b_key │       16450 │ parent index
 pp2         │       16451 │
 pp2_a_b_key │       16454 │ parent index
 pp_a_b_key  │       16441 │ parent index
(6 rows)

> However, I doubt that that's bad enough to justify a major rewrite
> of the ALTER TABLE code in 11 ... and maybe not in HEAD either;
> I wouldn't be too unhappy to leave it to v13.

Yeah, it's probably decent amount of code churn to undertake as a bug-fix.

Thanks,
Amit




Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
Thanks for looking at this.

On 2019/04/24 7:03, Tom Lane wrote:
> ISTM we could work around the problem with the attached, which I think
> is a good change independently of anything else.

Agreed.

> There is still an issue, which manifests in both 11 and HEAD, namely
> that the code also propagates the parent index's comment to any child
> indexes.  You can see that with this extended test case:
> 
> create table users(user_id int, name varchar(64), unique (user_id, name)) partition by hash(user_id);    
> comment on index users_user_id_name_key is 'parent index';
> create table users_000 partition of users for values with (modulus 2, remainder 0);
   
 
> create table users_001 partition of users for values with (modulus 2, remainder 1);
   
 
> 
> select relname, relfilenode, obj_description(oid,'pg_class') from pg_class where relname like 'users%';
> alter table users alter column name type varchar(127);                         
> select relname, relfilenode, obj_description(oid,'pg_class') from pg_class where relname like 'users%';
> 
> which gives me (in 11, with this patch)
> 
> ...
>           relname           | relfilenode | obj_description 
> ----------------------------+-------------+-----------------
>  users                      |       89389 | 
>  users_000                  |       89394 | 
>  users_000_user_id_name_key |       89397 | 
>  users_001                  |       89399 | 
>  users_001_user_id_name_key |       89402 | 
>  users_user_id_name_key     |       89392 | parent index
> (6 rows)
> 
> ALTER TABLE
>           relname           | relfilenode | obj_description 
> ----------------------------+-------------+-----------------
>  users                      |       89389 | 
>  users_000                  |       89394 | 
>  users_000_user_id_name_key |       89406 | parent index
>  users_001                  |       89399 | 
>  users_001_user_id_name_key |       89408 | parent index
>  users_user_id_name_key     |       89404 | parent index
> (6 rows)

This may be seen as slightly worse if the child indexes had their own
comments, which would get overwritten by the parent's.

create table pp (a int, b text, unique (a, b)) partition by list (a);
create table pp1 partition of pp for values in (1);
create table pp2 partition of pp for values in (2);
comment on index pp_a_b_key is 'parent index';
comment on index pp1_a_b_key is 'child index 1';
comment on index pp2_a_b_key is 'child index 2';
select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
where relname like 'pp%';
   relname   │ relfilenode │ obj_description
─────────────┼─────────────┼─────────────────
 pp          │       16420 │
 pp1         │       16425 │
 pp1_a_b_key │       16428 │ child index 1
 pp2         │       16433 │
 pp2_a_b_key │       16436 │ child index 2
 pp_a_b_key  │       16423 │ parent index
(6 rows)

alter table pp alter b type varchar(128);
select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
where relname like 'pp%';
   relname   │ relfilenode │ obj_description
─────────────┼─────────────┼─────────────────
 pp          │       16420 │
 pp1         │       16447 │
 pp1_a_b_key │       16450 │ parent index
 pp2         │       16451 │
 pp2_a_b_key │       16454 │ parent index
 pp_a_b_key  │       16441 │ parent index
(6 rows)

> However, I doubt that that's bad enough to justify a major rewrite
> of the ALTER TABLE code in 11 ... and maybe not in HEAD either;
> I wouldn't be too unhappy to leave it to v13.

Yeah, it's probably decent amount of code churn to undertake as a bug-fix.

Thanks,
Amit




Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
> On 2019/04/24 7:03, Tom Lane wrote:
>> ISTM we could work around the problem with the attached, which I think
>> is a good change independently of anything else.

> Agreed.

After thinking about it more, it seems like a bad idea to put the check
in CheckIndexCompatible; that could interfere with any other use of the
function to match up potential child indexes.  (I wonder why this test
isn't the same as what DefineIndex uses to spot potential child indexes,
BTW --- it uses a completely separate function CompareIndexInfo, which
seems both wasteful and trouble waiting to happen.)

So I think we should test the relkind in TryReuseIndex, instead.
I think it would be a good idea to *also* do what you suggested
upthread and have DefineIndex clear the field when cloning an
IndexStmt to create a child; no good could possibly come of
passing that down when we intend to create a new index.

In short, I think the right short-term fix is as attached (plus
a new regression test case based on the submitted example).

Longer term, it's clearly bad that we fail to reuse child indexes
in this scenario; the point about mangled comments is minor by
comparison.  I'm inclined to think that what we want to do is
*not* recurse when creating the parent index, but to initially
make it NOT VALID, and then do ALTER ATTACH PARTITION with each
re-used child index.  This would successfully reproduce the
previous state in case only some of the partitions have attached
indexes, which I don't think works correctly right now.

BTW, I hadn't ever looked closely at what the index reuse code
does, and now that I have, my heart just sinks.  I think that
logic needs to be nuked from orbit.  RelationPreserveStorage was
never meant to be abused in this way; I invite you to contemplate
whether it's not a problem that it doesn't check the backend and
nestLevel fields of PendingRelDelete entries before killing them.
(In its originally-designed use for mapped rels at transaction end,
that wasn't a problem, but I'm afraid that it may be one now.)

The right way to do this IMO would be something closer to the
heap-swap logic in cluster.c, where we exchange the relfilenodes
of two live indexes, rather than what is happening now.  Or for
that matter, do we really need to delete the old indexes at all?

None of that looks very practical for v12, let alone back-patching
to v11, though.

            regards, tom lane

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a1c91b5..002a8b8 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1125,6 +1125,18 @@ DefineIndex(Oid relationId,
                     ListCell   *lc;

                     /*
+                     * We can't use the same index name for the child index,
+                     * so clear idxname to let the recursive invocation choose
+                     * a new name.  Likewise, the existing target relation
+                     * field is wrong, and if indexOid or oldNode are set,
+                     * they mustn't be applied to the child either.
+                     */
+                    childStmt->idxname = NULL;
+                    childStmt->relation = NULL;
+                    childStmt->indexOid = InvalidOid;
+                    childStmt->oldNode = InvalidOid;
+
+                    /*
                      * Adjust any Vars (both in expressions and in the index's
                      * WHERE clause) to match the partition's column numbering
                      * in case it's different from the parent's.
@@ -1155,8 +1167,6 @@ DefineIndex(Oid relationId,
                     if (found_whole_row)
                         elog(ERROR, "cannot convert whole-row table reference");

-                    childStmt->idxname = NULL;
-                    childStmt->relation = NULL;
                     DefineIndex(childRelid, childStmt,
                                 InvalidOid, /* no predefined OID */
                                 indexRelationId,    /* this is our child */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index aa7328e..66f863f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11454,7 +11454,9 @@ TryReuseIndex(Oid oldId, IndexStmt *stmt)
     {
         Relation    irel = index_open(oldId, NoLock);

-        stmt->oldNode = irel->rd_node.relNode;
+        /* If it's a partitioned index, there is no storage to share. */
+        if (irel->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
+            stmt->oldNode = irel->rd_node.relNode;
         index_close(irel, NoLock);
     }
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f8ee4b0..4b7520b 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1001,6 +1001,19 @@ DefineIndex(Oid relationId,
                     ListCell   *lc;

                     /*
+                     * We can't use the same index name for the child index,
+                     * so clear idxname to let the recursive invocation choose
+                     * a new name.  Likewise, the existing target relation
+                     * field is wrong, and if indexOid or oldNode are set,
+                     * they mustn't be applied to the child either.
+                     */
+                    childStmt->idxname = NULL;
+                    childStmt->relation = NULL;
+                    childStmt->relationId = childRelid;
+                    childStmt->indexOid = InvalidOid;
+                    childStmt->oldNode = InvalidOid;
+
+                    /*
                      * Adjust any Vars (both in expressions and in the index's
                      * WHERE clause) to match the partition's column numbering
                      * in case it's different from the parent's.
@@ -1031,8 +1044,6 @@ DefineIndex(Oid relationId,
                     if (found_whole_row)
                         elog(ERROR, "cannot convert whole-row table reference");

-                    childStmt->idxname = NULL;
-                    childStmt->relationId = childRelid;
                     DefineIndex(childRelid, childStmt,
                                 InvalidOid, /* no predefined OID */
                                 indexRelationId,    /* this is our child */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 65ede33..865fc42 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -10696,7 +10696,9 @@ TryReuseIndex(Oid oldId, IndexStmt *stmt)
     {
         Relation    irel = index_open(oldId, NoLock);

-        stmt->oldNode = irel->rd_node.relNode;
+        /* If it's a partitioned index, there is no storage to share. */
+        if (irel->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
+            stmt->oldNode = irel->rd_node.relNode;
         index_close(irel, NoLock);
     }
 }

Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
> On 2019/04/24 7:03, Tom Lane wrote:
>> ISTM we could work around the problem with the attached, which I think
>> is a good change independently of anything else.

> Agreed.

After thinking about it more, it seems like a bad idea to put the check
in CheckIndexCompatible; that could interfere with any other use of the
function to match up potential child indexes.  (I wonder why this test
isn't the same as what DefineIndex uses to spot potential child indexes,
BTW --- it uses a completely separate function CompareIndexInfo, which
seems both wasteful and trouble waiting to happen.)

So I think we should test the relkind in TryReuseIndex, instead.
I think it would be a good idea to *also* do what you suggested
upthread and have DefineIndex clear the field when cloning an
IndexStmt to create a child; no good could possibly come of
passing that down when we intend to create a new index.

In short, I think the right short-term fix is as attached (plus
a new regression test case based on the submitted example).

Longer term, it's clearly bad that we fail to reuse child indexes
in this scenario; the point about mangled comments is minor by
comparison.  I'm inclined to think that what we want to do is
*not* recurse when creating the parent index, but to initially
make it NOT VALID, and then do ALTER ATTACH PARTITION with each
re-used child index.  This would successfully reproduce the
previous state in case only some of the partitions have attached
indexes, which I don't think works correctly right now.

BTW, I hadn't ever looked closely at what the index reuse code
does, and now that I have, my heart just sinks.  I think that
logic needs to be nuked from orbit.  RelationPreserveStorage was
never meant to be abused in this way; I invite you to contemplate
whether it's not a problem that it doesn't check the backend and
nestLevel fields of PendingRelDelete entries before killing them.
(In its originally-designed use for mapped rels at transaction end,
that wasn't a problem, but I'm afraid that it may be one now.)

The right way to do this IMO would be something closer to the
heap-swap logic in cluster.c, where we exchange the relfilenodes
of two live indexes, rather than what is happening now.  Or for
that matter, do we really need to delete the old indexes at all?

None of that looks very practical for v12, let alone back-patching
to v11, though.

            regards, tom lane

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a1c91b5..002a8b8 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1125,6 +1125,18 @@ DefineIndex(Oid relationId,
                     ListCell   *lc;

                     /*
+                     * We can't use the same index name for the child index,
+                     * so clear idxname to let the recursive invocation choose
+                     * a new name.  Likewise, the existing target relation
+                     * field is wrong, and if indexOid or oldNode are set,
+                     * they mustn't be applied to the child either.
+                     */
+                    childStmt->idxname = NULL;
+                    childStmt->relation = NULL;
+                    childStmt->indexOid = InvalidOid;
+                    childStmt->oldNode = InvalidOid;
+
+                    /*
                      * Adjust any Vars (both in expressions and in the index's
                      * WHERE clause) to match the partition's column numbering
                      * in case it's different from the parent's.
@@ -1155,8 +1167,6 @@ DefineIndex(Oid relationId,
                     if (found_whole_row)
                         elog(ERROR, "cannot convert whole-row table reference");

-                    childStmt->idxname = NULL;
-                    childStmt->relation = NULL;
                     DefineIndex(childRelid, childStmt,
                                 InvalidOid, /* no predefined OID */
                                 indexRelationId,    /* this is our child */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index aa7328e..66f863f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11454,7 +11454,9 @@ TryReuseIndex(Oid oldId, IndexStmt *stmt)
     {
         Relation    irel = index_open(oldId, NoLock);

-        stmt->oldNode = irel->rd_node.relNode;
+        /* If it's a partitioned index, there is no storage to share. */
+        if (irel->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
+            stmt->oldNode = irel->rd_node.relNode;
         index_close(irel, NoLock);
     }
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f8ee4b0..4b7520b 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1001,6 +1001,19 @@ DefineIndex(Oid relationId,
                     ListCell   *lc;

                     /*
+                     * We can't use the same index name for the child index,
+                     * so clear idxname to let the recursive invocation choose
+                     * a new name.  Likewise, the existing target relation
+                     * field is wrong, and if indexOid or oldNode are set,
+                     * they mustn't be applied to the child either.
+                     */
+                    childStmt->idxname = NULL;
+                    childStmt->relation = NULL;
+                    childStmt->relationId = childRelid;
+                    childStmt->indexOid = InvalidOid;
+                    childStmt->oldNode = InvalidOid;
+
+                    /*
                      * Adjust any Vars (both in expressions and in the index's
                      * WHERE clause) to match the partition's column numbering
                      * in case it's different from the parent's.
@@ -1031,8 +1044,6 @@ DefineIndex(Oid relationId,
                     if (found_whole_row)
                         elog(ERROR, "cannot convert whole-row table reference");

-                    childStmt->idxname = NULL;
-                    childStmt->relationId = childRelid;
                     DefineIndex(childRelid, childStmt,
                                 InvalidOid, /* no predefined OID */
                                 indexRelationId,    /* this is our child */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 65ede33..865fc42 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -10696,7 +10696,9 @@ TryReuseIndex(Oid oldId, IndexStmt *stmt)
     {
         Relation    irel = index_open(oldId, NoLock);

-        stmt->oldNode = irel->rd_node.relNode;
+        /* If it's a partitioned index, there is no storage to share. */
+        if (irel->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
+            stmt->oldNode = irel->rd_node.relNode;
         index_close(irel, NoLock);
     }
 }

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
On 2019/04/25 8:27, Tom Lane wrote:
> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>> On 2019/04/24 7:03, Tom Lane wrote:
>>> ISTM we could work around the problem with the attached, which I think
>>> is a good change independently of anything else.
> 
>> Agreed.
> 
> After thinking about it more, it seems like a bad idea to put the check
> in CheckIndexCompatible; that could interfere with any other use of the
> function to match up potential child indexes.

CheckIndexCompatible() seems to be invented solely for the purposes of
post-ALTER-COLUMN-TYPE, which is what I get from the header comment of
that function:

 * This is tailored to the needs of ALTER TABLE ALTER TYPE, which recreates
 * any indexes that depended on a changing column from their pg_get_indexdef
 * or pg_get_constraintdef definitions.  We omit some of the sanity checks of
 * DefineIndex.  We assume that the old and new indexes have the same number
 * of columns and that if one has an expression column or predicate, both do.
 * Errors arising from the attribute list still apply.

Given that, it may have been better to keep the function local to
tablecmds.c to avoid potential misuse; I see no other callers of this
function beside TryReuseIndex(), which in turn is only used by
post-ALTER-COLUMN-TYPE processing.

> (I wonder why this test
> isn't the same as what DefineIndex uses to spot potential child indexes,
> BTW --- it uses a completely separate function CompareIndexInfo, which
> seems both wasteful and trouble waiting to happen.)

I don't think that CheckIndexCompatible's job is to spot child indexes
compatible with a given parent index; that's CompareIndexInfo's job.  The
latter was invented for the partition index DDL code, complete with
provisions to do mapping between parent and child attributes before
matching if their TupleDescs are different.

CheckIndexCompatible, as mentioned above, is to do the errands of
ATPostAlterTypeParse().

/*
 * CheckIndexCompatible
 *      Determine whether an existing index definition is compatible with a
 *      prospective index definition, such that the existing index storage
 *      could become the storage of the new index, avoiding a rebuild.

So, maybe a bigger (different?) charter than CompareIndexInfo's, or so I
think.  Although, they may be able share the code.

> So I think we should test the relkind in TryReuseIndex, instead.

Which would work too.  Or we could just not call TryReuseIndex() if the
the index in question is partitioned.

> I think it would be a good idea to *also* do what you suggested
> upthread and have DefineIndex clear the field when cloning an
> IndexStmt to create a child; no good could possibly come of
> passing that down when we intend to create a new index.

Note that oldNode is only set by TryReuseIndex() today.  Being careful in
DefineIndex might be a good idea anyway.

> In short, I think the right short-term fix is as attached (plus
> a new regression test case based on the submitted example).

Sounds good.

> Longer term, it's clearly bad that we fail to reuse child indexes
> in this scenario; the point about mangled comments is minor by
> comparison.

Agreed.

> I'm inclined to think that what we want to do is
> *not* recurse when creating the parent index, but to initially
> make it NOT VALID, and then do ALTER ATTACH PARTITION with each
> re-used child index.  This would successfully reproduce the
> previous state in case only some of the partitions have attached
> indexes, which I don't think works correctly right now.

Well, an index in the parent must be present in all partitions, so I don't
understand which case you're thinking of.

> BTW, I hadn't ever looked closely at what the index reuse code
> does, and now that I have, my heart just sinks.  I think that
> logic needs to be nuked from orbit.  RelationPreserveStorage was
> never meant to be abused in this way; I invite you to contemplate
> whether it's not a problem that it doesn't check the backend and
> nestLevel fields of PendingRelDelete entries before killing them.
> (In its originally-designed use for mapped rels at transaction end,
> that wasn't a problem, but I'm afraid that it may be one now.)
> 
> The right way to do this IMO would be something closer to the
> heap-swap logic in cluster.c, where we exchange the relfilenodes
> of two live indexes, rather than what is happening now.  Or for
> that matter, do we really need to delete the old indexes at all?

Yeah, we wouldn't need TryReuseIndex and subsequent
RelationPreserveStorage if we hadn't dropped the old indexes to begin
with.  Instead, in ATPostAlterTypeParse, check if the index after ALTER
TYPE is still incompatible (CheckIndexCompatible) and if it is, don't add
a new AT_ReAddIndex command.  If it's not, *then* drop the index, and
recreate the index from scratch using an IndexStmt generated from the old
index definition.  I guess We can get rid of IndexStmt.oldNode too.

> None of that looks very practical for v12, let alone back-patching
> to v11, though.

Yes.

Thanks,
Amit




Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
On 2019/04/25 8:27, Tom Lane wrote:
> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>> On 2019/04/24 7:03, Tom Lane wrote:
>>> ISTM we could work around the problem with the attached, which I think
>>> is a good change independently of anything else.
> 
>> Agreed.
> 
> After thinking about it more, it seems like a bad idea to put the check
> in CheckIndexCompatible; that could interfere with any other use of the
> function to match up potential child indexes.

CheckIndexCompatible() seems to be invented solely for the purposes of
post-ALTER-COLUMN-TYPE, which is what I get from the header comment of
that function:

 * This is tailored to the needs of ALTER TABLE ALTER TYPE, which recreates
 * any indexes that depended on a changing column from their pg_get_indexdef
 * or pg_get_constraintdef definitions.  We omit some of the sanity checks of
 * DefineIndex.  We assume that the old and new indexes have the same number
 * of columns and that if one has an expression column or predicate, both do.
 * Errors arising from the attribute list still apply.

Given that, it may have been better to keep the function local to
tablecmds.c to avoid potential misuse; I see no other callers of this
function beside TryReuseIndex(), which in turn is only used by
post-ALTER-COLUMN-TYPE processing.

> (I wonder why this test
> isn't the same as what DefineIndex uses to spot potential child indexes,
> BTW --- it uses a completely separate function CompareIndexInfo, which
> seems both wasteful and trouble waiting to happen.)

I don't think that CheckIndexCompatible's job is to spot child indexes
compatible with a given parent index; that's CompareIndexInfo's job.  The
latter was invented for the partition index DDL code, complete with
provisions to do mapping between parent and child attributes before
matching if their TupleDescs are different.

CheckIndexCompatible, as mentioned above, is to do the errands of
ATPostAlterTypeParse().

/*
 * CheckIndexCompatible
 *      Determine whether an existing index definition is compatible with a
 *      prospective index definition, such that the existing index storage
 *      could become the storage of the new index, avoiding a rebuild.

So, maybe a bigger (different?) charter than CompareIndexInfo's, or so I
think.  Although, they may be able share the code.

> So I think we should test the relkind in TryReuseIndex, instead.

Which would work too.  Or we could just not call TryReuseIndex() if the
the index in question is partitioned.

> I think it would be a good idea to *also* do what you suggested
> upthread and have DefineIndex clear the field when cloning an
> IndexStmt to create a child; no good could possibly come of
> passing that down when we intend to create a new index.

Note that oldNode is only set by TryReuseIndex() today.  Being careful in
DefineIndex might be a good idea anyway.

> In short, I think the right short-term fix is as attached (plus
> a new regression test case based on the submitted example).

Sounds good.

> Longer term, it's clearly bad that we fail to reuse child indexes
> in this scenario; the point about mangled comments is minor by
> comparison.

Agreed.

> I'm inclined to think that what we want to do is
> *not* recurse when creating the parent index, but to initially
> make it NOT VALID, and then do ALTER ATTACH PARTITION with each
> re-used child index.  This would successfully reproduce the
> previous state in case only some of the partitions have attached
> indexes, which I don't think works correctly right now.

Well, an index in the parent must be present in all partitions, so I don't
understand which case you're thinking of.

> BTW, I hadn't ever looked closely at what the index reuse code
> does, and now that I have, my heart just sinks.  I think that
> logic needs to be nuked from orbit.  RelationPreserveStorage was
> never meant to be abused in this way; I invite you to contemplate
> whether it's not a problem that it doesn't check the backend and
> nestLevel fields of PendingRelDelete entries before killing them.
> (In its originally-designed use for mapped rels at transaction end,
> that wasn't a problem, but I'm afraid that it may be one now.)
> 
> The right way to do this IMO would be something closer to the
> heap-swap logic in cluster.c, where we exchange the relfilenodes
> of two live indexes, rather than what is happening now.  Or for
> that matter, do we really need to delete the old indexes at all?

Yeah, we wouldn't need TryReuseIndex and subsequent
RelationPreserveStorage if we hadn't dropped the old indexes to begin
with.  Instead, in ATPostAlterTypeParse, check if the index after ALTER
TYPE is still incompatible (CheckIndexCompatible) and if it is, don't add
a new AT_ReAddIndex command.  If it's not, *then* drop the index, and
recreate the index from scratch using an IndexStmt generated from the old
index definition.  I guess We can get rid of IndexStmt.oldNode too.

> None of that looks very practical for v12, let alone back-patching
> to v11, though.

Yes.

Thanks,
Amit




Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
On 2019/04/25 11:21, Amit Langote wrote:
> On 2019/04/25 8:27, Tom Lane wrote:
>> BTW, I hadn't ever looked closely at what the index reuse code
>> does, and now that I have, my heart just sinks.  I think that
>> logic needs to be nuked from orbit.  RelationPreserveStorage was
>> never meant to be abused in this way; I invite you to contemplate
>> whether it's not a problem that it doesn't check the backend and
>> nestLevel fields of PendingRelDelete entries before killing them.
>> (In its originally-designed use for mapped rels at transaction end,
>> that wasn't a problem, but I'm afraid that it may be one now.)
>>
>> The right way to do this IMO would be something closer to the
>> heap-swap logic in cluster.c, where we exchange the relfilenodes
>> of two live indexes, rather than what is happening now.  Or for
>> that matter, do we really need to delete the old indexes at all?
> 
> Yeah, we wouldn't need TryReuseIndex and subsequent
> RelationPreserveStorage if we hadn't dropped the old indexes to begin
> with.  Instead, in ATPostAlterTypeParse, check if the index after ALTER
> TYPE is still incompatible (CheckIndexCompatible) and if it is, don't add
> a new AT_ReAddIndex command.  If it's not, *then* drop the index, and
> recreate the index from scratch using an IndexStmt generated from the old
> index definition.  I guess We can get rid of IndexStmt.oldNode too.

Thinking on this more and growing confident that we could indeed avoid
drop index + recreate-it-while-preserving-storage, instead by just not
doing anything when CheckIndexCompatible says the old index will be fine
despite ALTER TYPE, but only if the table is not rewritten.  I gave this a
try and came up with the attached patch.  It fixes the bug related to
partitioned indexes (the originally reported one) and then some.

Basically, I aimed to rewrite the code in ATPostAlterTypeCleanup and
ATPostAlterTypeParse such that we no longer have to rely on an
implementation based on setting "oldNode" to preserve old indexes.  With
the attached, for the cases in which the table won't be rewritten and
hence the indexes not rebuilt, ATPostAlterTypeParse() simply won't queue a
AT_ReAddIndex command to rebuild index while preserving the storage.  That
means both ATAddIndex() and DefineIndex can be freed of the duty of
looking out for the "oldNode" case, because that case no longer exists.

Another main change is that inherited (!conislocal) constraints are now
recognized by ATPostAlterTypeParse directly, instead of
ATPostAlterTypeCleanup checking for them and skipping
ATPostAlterTypeParse() as a whole for such constraints.  For one, I had to
make that change to make the above-described approach work.  Also, doing
that allowed to fix another bug whereby the comments of child constraints
would go away when they're reconstructed.  Notice what happens on
un-patched PG 11:

create table pp (a int, b text, unique (a, b), c varchar(64)) partition by
list (a);
create table pp1 partition of pp for values in (1);
create table pp2 partition of pp for values in (2);
alter table pp add constraint c_chk check (c <> '');
comment on constraint c_chk ON pp is 'parent check constraint';
comment on constraint c_chk ON pp1 is 'child check constraint 1';
comment on constraint c_chk ON pp2 is 'child check constraint 2';
select conname, obj_description(oid, 'pg_constraint') from pg_constraint
where conname = 'c_chk';
 conname │     obj_description
─────────┼──────────────────────────
 c_chk   │ parent check constraint
 c_chk   │ child check constraint 1
 c_chk   │ child check constraint 2
(3 rows)

alter table pp alter c type varchar(64);

select conname, obj_description(oid, 'pg_constraint') from pg_constraint
where conname = 'c_chk';
 conname │     obj_description
─────────┼─────────────────────────
 c_chk   │ parent check constraint
 c_chk   │
 c_chk   │
(3 rows)

The patch fixes that with some surgery of RebuildConstraintComment
combined with aforementioned changes.  With the patch:

alter table pp alter c type varchar(64);

select conname, obj_description(oid, 'pg_constraint') from pg_constraint
where conname = 'c_chk';
 conname │     obj_description
─────────┼──────────────────────────
 c_chk   │ parent check constraint
 c_chk   │ child check constraint 1
 c_chk   │ child check constraint 2
(3 rows)

alter table pp alter c type varchar(128);

select conname, obj_description(oid, 'pg_constraint') from pg_constraint
where conname = 'c_chk';
 conname │     obj_description
─────────┼──────────────────────────
 c_chk   │ parent check constraint
 c_chk   │ child check constraint 1
 c_chk   │ child check constraint 2
(3 rows)

Also for index comments, but only in the case when indexes are not rebuilt.

comment on index pp_a_b_key is 'parent index';
comment on index pp1_a_b_key is 'child index 1';
comment on index pp2_a_b_key is 'child index 2';

select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
where relname like 'pp%key';
   relname   │ relfilenode │ obj_description
─────────────┼─────────────┼─────────────────
 pp1_a_b_key │       17280 │ child index 1
 pp2_a_b_key │       17284 │ child index 2
 pp_a_b_key  │       17271 │ parent index
(3 rows)

-- no rewrite, indexes untouched, comments preserved
alter table pp alter b type varchar(128);

select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
where relname like 'pp%key';
   relname   │ relfilenode │ obj_description
─────────────┼─────────────┼─────────────────
 pp1_a_b_key │       17280 │ child index 1
 pp2_a_b_key │       17284 │ child index 2
 pp_a_b_key  │       17271 │ parent index
(3 rows)

-- table rewritten, indexes rebuild, child indexes' comments gone
alter table pp alter b type varchar(64);

select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
where relname like 'pp%key';
   relname   │ relfilenode │ obj_description
─────────────┼─────────────┼─────────────────
 pp1_a_b_key │       17294 │
 pp2_a_b_key │       17298 │
 pp_a_b_key  │       17285 │ parent index
(3 rows)


I've also added tests for both the originally reported bug and the comment
ones.

The patch applies to PG 11.

Thanks,
Amit

Attachment

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
On 2019/04/25 11:21, Amit Langote wrote:
> On 2019/04/25 8:27, Tom Lane wrote:
>> BTW, I hadn't ever looked closely at what the index reuse code
>> does, and now that I have, my heart just sinks.  I think that
>> logic needs to be nuked from orbit.  RelationPreserveStorage was
>> never meant to be abused in this way; I invite you to contemplate
>> whether it's not a problem that it doesn't check the backend and
>> nestLevel fields of PendingRelDelete entries before killing them.
>> (In its originally-designed use for mapped rels at transaction end,
>> that wasn't a problem, but I'm afraid that it may be one now.)
>>
>> The right way to do this IMO would be something closer to the
>> heap-swap logic in cluster.c, where we exchange the relfilenodes
>> of two live indexes, rather than what is happening now.  Or for
>> that matter, do we really need to delete the old indexes at all?
> 
> Yeah, we wouldn't need TryReuseIndex and subsequent
> RelationPreserveStorage if we hadn't dropped the old indexes to begin
> with.  Instead, in ATPostAlterTypeParse, check if the index after ALTER
> TYPE is still incompatible (CheckIndexCompatible) and if it is, don't add
> a new AT_ReAddIndex command.  If it's not, *then* drop the index, and
> recreate the index from scratch using an IndexStmt generated from the old
> index definition.  I guess We can get rid of IndexStmt.oldNode too.

Thinking on this more and growing confident that we could indeed avoid
drop index + recreate-it-while-preserving-storage, instead by just not
doing anything when CheckIndexCompatible says the old index will be fine
despite ALTER TYPE, but only if the table is not rewritten.  I gave this a
try and came up with the attached patch.  It fixes the bug related to
partitioned indexes (the originally reported one) and then some.

Basically, I aimed to rewrite the code in ATPostAlterTypeCleanup and
ATPostAlterTypeParse such that we no longer have to rely on an
implementation based on setting "oldNode" to preserve old indexes.  With
the attached, for the cases in which the table won't be rewritten and
hence the indexes not rebuilt, ATPostAlterTypeParse() simply won't queue a
AT_ReAddIndex command to rebuild index while preserving the storage.  That
means both ATAddIndex() and DefineIndex can be freed of the duty of
looking out for the "oldNode" case, because that case no longer exists.

Another main change is that inherited (!conislocal) constraints are now
recognized by ATPostAlterTypeParse directly, instead of
ATPostAlterTypeCleanup checking for them and skipping
ATPostAlterTypeParse() as a whole for such constraints.  For one, I had to
make that change to make the above-described approach work.  Also, doing
that allowed to fix another bug whereby the comments of child constraints
would go away when they're reconstructed.  Notice what happens on
un-patched PG 11:

create table pp (a int, b text, unique (a, b), c varchar(64)) partition by
list (a);
create table pp1 partition of pp for values in (1);
create table pp2 partition of pp for values in (2);
alter table pp add constraint c_chk check (c <> '');
comment on constraint c_chk ON pp is 'parent check constraint';
comment on constraint c_chk ON pp1 is 'child check constraint 1';
comment on constraint c_chk ON pp2 is 'child check constraint 2';
select conname, obj_description(oid, 'pg_constraint') from pg_constraint
where conname = 'c_chk';
 conname │     obj_description
─────────┼──────────────────────────
 c_chk   │ parent check constraint
 c_chk   │ child check constraint 1
 c_chk   │ child check constraint 2
(3 rows)

alter table pp alter c type varchar(64);

select conname, obj_description(oid, 'pg_constraint') from pg_constraint
where conname = 'c_chk';
 conname │     obj_description
─────────┼─────────────────────────
 c_chk   │ parent check constraint
 c_chk   │
 c_chk   │
(3 rows)

The patch fixes that with some surgery of RebuildConstraintComment
combined with aforementioned changes.  With the patch:

alter table pp alter c type varchar(64);

select conname, obj_description(oid, 'pg_constraint') from pg_constraint
where conname = 'c_chk';
 conname │     obj_description
─────────┼──────────────────────────
 c_chk   │ parent check constraint
 c_chk   │ child check constraint 1
 c_chk   │ child check constraint 2
(3 rows)

alter table pp alter c type varchar(128);

select conname, obj_description(oid, 'pg_constraint') from pg_constraint
where conname = 'c_chk';
 conname │     obj_description
─────────┼──────────────────────────
 c_chk   │ parent check constraint
 c_chk   │ child check constraint 1
 c_chk   │ child check constraint 2
(3 rows)

Also for index comments, but only in the case when indexes are not rebuilt.

comment on index pp_a_b_key is 'parent index';
comment on index pp1_a_b_key is 'child index 1';
comment on index pp2_a_b_key is 'child index 2';

select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
where relname like 'pp%key';
   relname   │ relfilenode │ obj_description
─────────────┼─────────────┼─────────────────
 pp1_a_b_key │       17280 │ child index 1
 pp2_a_b_key │       17284 │ child index 2
 pp_a_b_key  │       17271 │ parent index
(3 rows)

-- no rewrite, indexes untouched, comments preserved
alter table pp alter b type varchar(128);

select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
where relname like 'pp%key';
   relname   │ relfilenode │ obj_description
─────────────┼─────────────┼─────────────────
 pp1_a_b_key │       17280 │ child index 1
 pp2_a_b_key │       17284 │ child index 2
 pp_a_b_key  │       17271 │ parent index
(3 rows)

-- table rewritten, indexes rebuild, child indexes' comments gone
alter table pp alter b type varchar(64);

select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
where relname like 'pp%key';
   relname   │ relfilenode │ obj_description
─────────────┼─────────────┼─────────────────
 pp1_a_b_key │       17294 │
 pp2_a_b_key │       17298 │
 pp_a_b_key  │       17285 │ parent index
(3 rows)


I've also added tests for both the originally reported bug and the comment
ones.

The patch applies to PG 11.

Thanks,
Amit

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Alvaro Herrera
Date:
Haven't read the patch, but I tried applying it on top of my tablespace
fixing patch ... and my first report is that this query in regress fails
(three times):

 select conname, obj_description(oid, 'pg_constraint') from pg_constraint where conname = 'c_chk' order by 1, 2;
  conname |            obj_description            
 ---------+---------------------------------------
+ c_chk   | alttype_cleanup_idx check constraint
  c_chk   | alttype_cleanup_idx1 check constraint
  c_chk   | alttype_cleanup_idx2 check constraint
- c_chk   | alttype_cleanup_idx check constraint
 (3 rows)

I think you should use 'ORDER BY 2 COLLATE "C"' to avoid the problem.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Alvaro Herrera
Date:
Haven't read the patch, but I tried applying it on top of my tablespace
fixing patch ... and my first report is that this query in regress fails
(three times):

 select conname, obj_description(oid, 'pg_constraint') from pg_constraint where conname = 'c_chk' order by 1, 2;
  conname |            obj_description            
 ---------+---------------------------------------
+ c_chk   | alttype_cleanup_idx check constraint
  c_chk   | alttype_cleanup_idx1 check constraint
  c_chk   | alttype_cleanup_idx2 check constraint
- c_chk   | alttype_cleanup_idx check constraint
 (3 rows)

I think you should use 'ORDER BY 2 COLLATE "C"' to avoid the problem.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
On Thu, Apr 25, 2019 at 10:46 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Haven't read the patch, but I tried applying it on top of my tablespace
> fixing patch ... and my first report is that this query in regress fails
> (three times):
>
>  select conname, obj_description(oid, 'pg_constraint') from pg_constraint where conname = 'c_chk' order by 1, 2;
>   conname |            obj_description
>  ---------+---------------------------------------
> + c_chk   | alttype_cleanup_idx check constraint
>   c_chk   | alttype_cleanup_idx1 check constraint
>   c_chk   | alttype_cleanup_idx2 check constraint
> - c_chk   | alttype_cleanup_idx check constraint
>  (3 rows)
>
> I think you should use 'ORDER BY 2 COLLATE "C"' to avoid the problem.

Oops, will do.  Thanks.

Regards,
Amit



Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
On Thu, Apr 25, 2019 at 10:46 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Haven't read the patch, but I tried applying it on top of my tablespace
> fixing patch ... and my first report is that this query in regress fails
> (three times):
>
>  select conname, obj_description(oid, 'pg_constraint') from pg_constraint where conname = 'c_chk' order by 1, 2;
>   conname |            obj_description
>  ---------+---------------------------------------
> + c_chk   | alttype_cleanup_idx check constraint
>   c_chk   | alttype_cleanup_idx1 check constraint
>   c_chk   | alttype_cleanup_idx2 check constraint
> - c_chk   | alttype_cleanup_idx check constraint
>  (3 rows)
>
> I think you should use 'ORDER BY 2 COLLATE "C"' to avoid the problem.

Oops, will do.  Thanks.

Regards,
Amit



Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
On 2019/04/25 19:02, Amit Langote wrote:
> On 2019/04/25 11:21, Amit Langote wrote:
>> On 2019/04/25 8:27, Tom Lane wrote:
>>> BTW, I hadn't ever looked closely at what the index reuse code
>>> does, and now that I have, my heart just sinks.  I think that
>>> logic needs to be nuked from orbit.  RelationPreserveStorage was
>>> never meant to be abused in this way; I invite you to contemplate
>>> whether it's not a problem that it doesn't check the backend and
>>> nestLevel fields of PendingRelDelete entries before killing them.
>>> (In its originally-designed use for mapped rels at transaction end,
>>> that wasn't a problem, but I'm afraid that it may be one now.)
>>>
>>> The right way to do this IMO would be something closer to the
>>> heap-swap logic in cluster.c, where we exchange the relfilenodes
>>> of two live indexes, rather than what is happening now.  Or for
>>> that matter, do we really need to delete the old indexes at all?
>>
>> Yeah, we wouldn't need TryReuseIndex and subsequent
>> RelationPreserveStorage if we hadn't dropped the old indexes to begin
>> with.  Instead, in ATPostAlterTypeParse, check if the index after ALTER
>> TYPE is still incompatible (CheckIndexCompatible) and if it is, don't add
>> a new AT_ReAddIndex command.  If it's not, *then* drop the index, and
>> recreate the index from scratch using an IndexStmt generated from the old
>> index definition.  I guess We can get rid of IndexStmt.oldNode too.
> 
> Thinking on this more and growing confident that we could indeed avoid
> drop index + recreate-it-while-preserving-storage, instead by just not
> doing anything when CheckIndexCompatible says the old index will be fine
> despite ALTER TYPE, but only if the table is not rewritten.  I gave this a
> try and came up with the attached patch.  It fixes the bug related to
> partitioned indexes (the originally reported one) and then some.
> 
> Basically, I aimed to rewrite the code in ATPostAlterTypeCleanup and
> ATPostAlterTypeParse such that we no longer have to rely on an
> implementation based on setting "oldNode" to preserve old indexes.  With
> the attached, for the cases in which the table won't be rewritten and
> hence the indexes not rebuilt, ATPostAlterTypeParse() simply won't queue a
> AT_ReAddIndex command to rebuild index while preserving the storage.  That
> means both ATAddIndex() and DefineIndex can be freed of the duty of
> looking out for the "oldNode" case, because that case no longer exists.
> 
> Another main change is that inherited (!conislocal) constraints are now
> recognized by ATPostAlterTypeParse directly, instead of
> ATPostAlterTypeCleanup checking for them and skipping
> ATPostAlterTypeParse() as a whole for such constraints.  For one, I had to
> make that change to make the above-described approach work.  Also, doing
> that allowed to fix another bug whereby the comments of child constraints
> would go away when they're reconstructed.  Notice what happens on
> un-patched PG 11:
> 
> create table pp (a int, b text, unique (a, b), c varchar(64)) partition by
> list (a);
> create table pp1 partition of pp for values in (1);
> create table pp2 partition of pp for values in (2);
> alter table pp add constraint c_chk check (c <> '');
> comment on constraint c_chk ON pp is 'parent check constraint';
> comment on constraint c_chk ON pp1 is 'child check constraint 1';
> comment on constraint c_chk ON pp2 is 'child check constraint 2';
> select conname, obj_description(oid, 'pg_constraint') from pg_constraint
> where conname = 'c_chk';
>  conname │     obj_description
> ─────────┼──────────────────────────
>  c_chk   │ parent check constraint
>  c_chk   │ child check constraint 1
>  c_chk   │ child check constraint 2
> (3 rows)
> 
> alter table pp alter c type varchar(64);
> 
> select conname, obj_description(oid, 'pg_constraint') from pg_constraint
> where conname = 'c_chk';
>  conname │     obj_description
> ─────────┼─────────────────────────
>  c_chk   │ parent check constraint
>  c_chk   │
>  c_chk   │
> (3 rows)
> 
> The patch fixes that with some surgery of RebuildConstraintComment
> combined with aforementioned changes.  With the patch:
> 
> alter table pp alter c type varchar(64);
> 
> select conname, obj_description(oid, 'pg_constraint') from pg_constraint
> where conname = 'c_chk';
>  conname │     obj_description
> ─────────┼──────────────────────────
>  c_chk   │ parent check constraint
>  c_chk   │ child check constraint 1
>  c_chk   │ child check constraint 2
> (3 rows)
> 
> alter table pp alter c type varchar(128);
> 
> select conname, obj_description(oid, 'pg_constraint') from pg_constraint
> where conname = 'c_chk';
>  conname │     obj_description
> ─────────┼──────────────────────────
>  c_chk   │ parent check constraint
>  c_chk   │ child check constraint 1
>  c_chk   │ child check constraint 2
> (3 rows)
> 
> Also for index comments, but only in the case when indexes are not rebuilt.
> 
> comment on index pp_a_b_key is 'parent index';
> comment on index pp1_a_b_key is 'child index 1';
> comment on index pp2_a_b_key is 'child index 2';
> 
> select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
> where relname like 'pp%key';
>    relname   │ relfilenode │ obj_description
> ─────────────┼─────────────┼─────────────────
>  pp1_a_b_key │       17280 │ child index 1
>  pp2_a_b_key │       17284 │ child index 2
>  pp_a_b_key  │       17271 │ parent index
> (3 rows)
> 
> -- no rewrite, indexes untouched, comments preserved
> alter table pp alter b type varchar(128);
> 
> select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
> where relname like 'pp%key';
>    relname   │ relfilenode │ obj_description
> ─────────────┼─────────────┼─────────────────
>  pp1_a_b_key │       17280 │ child index 1
>  pp2_a_b_key │       17284 │ child index 2
>  pp_a_b_key  │       17271 │ parent index
> (3 rows)
> 
> -- table rewritten, indexes rebuild, child indexes' comments gone
> alter table pp alter b type varchar(64);
> 
> select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
> where relname like 'pp%key';
>    relname   │ relfilenode │ obj_description
> ─────────────┼─────────────┼─────────────────
>  pp1_a_b_key │       17294 │
>  pp2_a_b_key │       17298 │
>  pp_a_b_key  │       17285 │ parent index
> (3 rows)
> 
> 
> I've also added tests for both the originally reported bug and the comment
> ones.
> 
> The patch applies to PG 11.

Per Alvaro's report, regression tests added weren't portable.  Fixed that
in the attached updated patch.

Thanks,
Amit

Attachment

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Amit Langote
Date:
On 2019/04/25 19:02, Amit Langote wrote:
> On 2019/04/25 11:21, Amit Langote wrote:
>> On 2019/04/25 8:27, Tom Lane wrote:
>>> BTW, I hadn't ever looked closely at what the index reuse code
>>> does, and now that I have, my heart just sinks.  I think that
>>> logic needs to be nuked from orbit.  RelationPreserveStorage was
>>> never meant to be abused in this way; I invite you to contemplate
>>> whether it's not a problem that it doesn't check the backend and
>>> nestLevel fields of PendingRelDelete entries before killing them.
>>> (In its originally-designed use for mapped rels at transaction end,
>>> that wasn't a problem, but I'm afraid that it may be one now.)
>>>
>>> The right way to do this IMO would be something closer to the
>>> heap-swap logic in cluster.c, where we exchange the relfilenodes
>>> of two live indexes, rather than what is happening now.  Or for
>>> that matter, do we really need to delete the old indexes at all?
>>
>> Yeah, we wouldn't need TryReuseIndex and subsequent
>> RelationPreserveStorage if we hadn't dropped the old indexes to begin
>> with.  Instead, in ATPostAlterTypeParse, check if the index after ALTER
>> TYPE is still incompatible (CheckIndexCompatible) and if it is, don't add
>> a new AT_ReAddIndex command.  If it's not, *then* drop the index, and
>> recreate the index from scratch using an IndexStmt generated from the old
>> index definition.  I guess We can get rid of IndexStmt.oldNode too.
> 
> Thinking on this more and growing confident that we could indeed avoid
> drop index + recreate-it-while-preserving-storage, instead by just not
> doing anything when CheckIndexCompatible says the old index will be fine
> despite ALTER TYPE, but only if the table is not rewritten.  I gave this a
> try and came up with the attached patch.  It fixes the bug related to
> partitioned indexes (the originally reported one) and then some.
> 
> Basically, I aimed to rewrite the code in ATPostAlterTypeCleanup and
> ATPostAlterTypeParse such that we no longer have to rely on an
> implementation based on setting "oldNode" to preserve old indexes.  With
> the attached, for the cases in which the table won't be rewritten and
> hence the indexes not rebuilt, ATPostAlterTypeParse() simply won't queue a
> AT_ReAddIndex command to rebuild index while preserving the storage.  That
> means both ATAddIndex() and DefineIndex can be freed of the duty of
> looking out for the "oldNode" case, because that case no longer exists.
> 
> Another main change is that inherited (!conislocal) constraints are now
> recognized by ATPostAlterTypeParse directly, instead of
> ATPostAlterTypeCleanup checking for them and skipping
> ATPostAlterTypeParse() as a whole for such constraints.  For one, I had to
> make that change to make the above-described approach work.  Also, doing
> that allowed to fix another bug whereby the comments of child constraints
> would go away when they're reconstructed.  Notice what happens on
> un-patched PG 11:
> 
> create table pp (a int, b text, unique (a, b), c varchar(64)) partition by
> list (a);
> create table pp1 partition of pp for values in (1);
> create table pp2 partition of pp for values in (2);
> alter table pp add constraint c_chk check (c <> '');
> comment on constraint c_chk ON pp is 'parent check constraint';
> comment on constraint c_chk ON pp1 is 'child check constraint 1';
> comment on constraint c_chk ON pp2 is 'child check constraint 2';
> select conname, obj_description(oid, 'pg_constraint') from pg_constraint
> where conname = 'c_chk';
>  conname │     obj_description
> ─────────┼──────────────────────────
>  c_chk   │ parent check constraint
>  c_chk   │ child check constraint 1
>  c_chk   │ child check constraint 2
> (3 rows)
> 
> alter table pp alter c type varchar(64);
> 
> select conname, obj_description(oid, 'pg_constraint') from pg_constraint
> where conname = 'c_chk';
>  conname │     obj_description
> ─────────┼─────────────────────────
>  c_chk   │ parent check constraint
>  c_chk   │
>  c_chk   │
> (3 rows)
> 
> The patch fixes that with some surgery of RebuildConstraintComment
> combined with aforementioned changes.  With the patch:
> 
> alter table pp alter c type varchar(64);
> 
> select conname, obj_description(oid, 'pg_constraint') from pg_constraint
> where conname = 'c_chk';
>  conname │     obj_description
> ─────────┼──────────────────────────
>  c_chk   │ parent check constraint
>  c_chk   │ child check constraint 1
>  c_chk   │ child check constraint 2
> (3 rows)
> 
> alter table pp alter c type varchar(128);
> 
> select conname, obj_description(oid, 'pg_constraint') from pg_constraint
> where conname = 'c_chk';
>  conname │     obj_description
> ─────────┼──────────────────────────
>  c_chk   │ parent check constraint
>  c_chk   │ child check constraint 1
>  c_chk   │ child check constraint 2
> (3 rows)
> 
> Also for index comments, but only in the case when indexes are not rebuilt.
> 
> comment on index pp_a_b_key is 'parent index';
> comment on index pp1_a_b_key is 'child index 1';
> comment on index pp2_a_b_key is 'child index 2';
> 
> select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
> where relname like 'pp%key';
>    relname   │ relfilenode │ obj_description
> ─────────────┼─────────────┼─────────────────
>  pp1_a_b_key │       17280 │ child index 1
>  pp2_a_b_key │       17284 │ child index 2
>  pp_a_b_key  │       17271 │ parent index
> (3 rows)
> 
> -- no rewrite, indexes untouched, comments preserved
> alter table pp alter b type varchar(128);
> 
> select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
> where relname like 'pp%key';
>    relname   │ relfilenode │ obj_description
> ─────────────┼─────────────┼─────────────────
>  pp1_a_b_key │       17280 │ child index 1
>  pp2_a_b_key │       17284 │ child index 2
>  pp_a_b_key  │       17271 │ parent index
> (3 rows)
> 
> -- table rewritten, indexes rebuild, child indexes' comments gone
> alter table pp alter b type varchar(64);
> 
> select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
> where relname like 'pp%key';
>    relname   │ relfilenode │ obj_description
> ─────────────┼─────────────┼─────────────────
>  pp1_a_b_key │       17294 │
>  pp2_a_b_key │       17298 │
>  pp_a_b_key  │       17285 │ parent index
> (3 rows)
> 
> 
> I've also added tests for both the originally reported bug and the comment
> ones.
> 
> The patch applies to PG 11.

Per Alvaro's report, regression tests added weren't portable.  Fixed that
in the attached updated patch.

Thanks,
Amit

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Alvaro Herrera
Date:
Hi,

Please trim the quoted text in your reply.

On 2019-Apr-26, Amit Langote wrote:

> Per Alvaro's report, regression tests added weren't portable.  Fixed that
> in the attached updated patch.

Um, this one doesn't apply because of yesterday's 87259588d0ab.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping apartition table

From
Alvaro Herrera
Date:
Hi,

Please trim the quoted text in your reply.

On 2019-Apr-26, Amit Langote wrote:

> Per Alvaro's report, regression tests added weren't portable.  Fixed that
> in the attached updated patch.

Um, this one doesn't apply because of yesterday's 87259588d0ab.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Um, this one doesn't apply because of yesterday's 87259588d0ab.

Before we spend too much time on minutiae, we should ask ourselves whether
this patch is even going in the right direction.  I'm not sure.

One point is that if we simply adopt the old index as-is, we won't see
any updates in its metadata.  An example is that if we have an index
on a varchar(10) column, and we alter the column to varchar(12),
the current behavior is to generate a new index that agrees with that:

regression=# create table pp(f1 varchar(10) unique);
CREATE TABLE
regression=# \d pp_f1_key
              Index "public.pp_f1_key"
 Column |         Type          | Key? | Definition 
--------+-----------------------+------+------------
 f1     | character varying(10) | yes  | f1
unique, btree, for table "public.pp"

regression=# alter table pp alter column f1 type varchar(12);
ALTER TABLE
regression=# \d pp_f1_key
              Index "public.pp_f1_key"
 Column |         Type          | Key? | Definition 
--------+-----------------------+------+------------
 f1     | character varying(12) | yes  | f1
unique, btree, for table "public.pp"

With this patch, I believe, the index column will still claim to be
varchar(10).  Is that OK?  It might not actually break anything
right now, but at the very least it's likely to be confusing.
Also, it'd essentially render the declared types/typmods of index
columns untrustworthy, which seems like something that would come
back to bite us.

            regards, tom lane



Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Um, this one doesn't apply because of yesterday's 87259588d0ab.

Before we spend too much time on minutiae, we should ask ourselves whether
this patch is even going in the right direction.  I'm not sure.

One point is that if we simply adopt the old index as-is, we won't see
any updates in its metadata.  An example is that if we have an index
on a varchar(10) column, and we alter the column to varchar(12),
the current behavior is to generate a new index that agrees with that:

regression=# create table pp(f1 varchar(10) unique);
CREATE TABLE
regression=# \d pp_f1_key
              Index "public.pp_f1_key"
 Column |         Type          | Key? | Definition 
--------+-----------------------+------+------------
 f1     | character varying(10) | yes  | f1
unique, btree, for table "public.pp"

regression=# alter table pp alter column f1 type varchar(12);
ALTER TABLE
regression=# \d pp_f1_key
              Index "public.pp_f1_key"
 Column |         Type          | Key? | Definition 
--------+-----------------------+------+------------
 f1     | character varying(12) | yes  | f1
unique, btree, for table "public.pp"

With this patch, I believe, the index column will still claim to be
varchar(10).  Is that OK?  It might not actually break anything
right now, but at the very least it's likely to be confusing.
Also, it'd essentially render the declared types/typmods of index
columns untrustworthy, which seems like something that would come
back to bite us.

            regards, tom lane



I went ahead and pushed the stopgap patches, along with regression tests
based on yours.  The tests show the current (i.e. wrong) behavior for
index comment and relfilenode reuse.  I think that whenever we fix that,
we can just adjust the expected output instead of adding more tests.

            regards, tom lane



I went ahead and pushed the stopgap patches, along with regression tests
based on yours.  The tests show the current (i.e. wrong) behavior for
index comment and relfilenode reuse.  I think that whenever we fix that,
we can just adjust the expected output instead of adding more tests.

            regards, tom lane



On 2019/04/27 3:57, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Um, this one doesn't apply because of yesterday's 87259588d0ab.
> 
> Before we spend too much time on minutiae, we should ask ourselves whether
> this patch is even going in the right direction.  I'm not sure.
> 
> One point is that if we simply adopt the old index as-is, we won't see
> any updates in its metadata.  An example is that if we have an index
> on a varchar(10) column, and we alter the column to varchar(12),
> the current behavior is to generate a new index that agrees with that:
> 
> regression=# create table pp(f1 varchar(10) unique);
> CREATE TABLE
> regression=# \d pp_f1_key
>               Index "public.pp_f1_key"
>  Column |         Type          | Key? | Definition 
> --------+-----------------------+------+------------
>  f1     | character varying(10) | yes  | f1
> unique, btree, for table "public.pp"
> 
> regression=# alter table pp alter column f1 type varchar(12);
> ALTER TABLE
> regression=# \d pp_f1_key
>               Index "public.pp_f1_key"
>  Column |         Type          | Key? | Definition 
> --------+-----------------------+------+------------
>  f1     | character varying(12) | yes  | f1
> unique, btree, for table "public.pp"
> 
> With this patch, I believe, the index column will still claim to be
> varchar(10).

You're right, that's what happens.

> Is that OK?  It might not actually break anything
> right now, but at the very least it's likely to be confusing.
> Also, it'd essentially render the declared types/typmods of index
> columns untrustworthy, which seems like something that would come
> back to bite us.

That's definitely misleading.

Still, I think it'd be nice if we didn't have to do full-blown
DefineIndex() in this case if only to update the pg_attribute tuples of
the index relation.  Maybe we could update them directly in the ALTER
COLUMN TYPE's code path?

Thanks,
Amit




On 2019/04/27 3:57, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Um, this one doesn't apply because of yesterday's 87259588d0ab.
> 
> Before we spend too much time on minutiae, we should ask ourselves whether
> this patch is even going in the right direction.  I'm not sure.
> 
> One point is that if we simply adopt the old index as-is, we won't see
> any updates in its metadata.  An example is that if we have an index
> on a varchar(10) column, and we alter the column to varchar(12),
> the current behavior is to generate a new index that agrees with that:
> 
> regression=# create table pp(f1 varchar(10) unique);
> CREATE TABLE
> regression=# \d pp_f1_key
>               Index "public.pp_f1_key"
>  Column |         Type          | Key? | Definition 
> --------+-----------------------+------+------------
>  f1     | character varying(10) | yes  | f1
> unique, btree, for table "public.pp"
> 
> regression=# alter table pp alter column f1 type varchar(12);
> ALTER TABLE
> regression=# \d pp_f1_key
>               Index "public.pp_f1_key"
>  Column |         Type          | Key? | Definition 
> --------+-----------------------+------+------------
>  f1     | character varying(12) | yes  | f1
> unique, btree, for table "public.pp"
> 
> With this patch, I believe, the index column will still claim to be
> varchar(10).

You're right, that's what happens.

> Is that OK?  It might not actually break anything
> right now, but at the very least it's likely to be confusing.
> Also, it'd essentially render the declared types/typmods of index
> columns untrustworthy, which seems like something that would come
> back to bite us.

That's definitely misleading.

Still, I think it'd be nice if we didn't have to do full-blown
DefineIndex() in this case if only to update the pg_attribute tuples of
the index relation.  Maybe we could update them directly in the ALTER
COLUMN TYPE's code path?

Thanks,
Amit