Thread: BUG #15528: on v11.0 version still get error "ERROR: catalog ismissing 1 attribute(s) for relid 6855092"

The following bug has been logged on the website:

Bug reference:      15528
Logged by:          zhou xiaowei
Email address:      110876189@qq.com
PostgreSQL version: 11.0
Operating system:   Linux 3.10.0-327.36.58.4.x86_64
Description:

Hi,my friend.
I dump out 27969 tables from PG v9.6.8 one by one, use "pg_dump -t xxx -f
xxx.sql". Then start 300 threads to parallel import these .sql  into
v11.0.
after data imported completely, parallel create constraint for every
table.
when all completely, execute "vacuum full",I got error "ERROR:  catalog is
missing 1 attribute(s) for relid 6855092", which occured on 11beta2 solved
by BUG #15309.

Today, I want to make sure which step cause this issue, I only start 300
threads to parallel import these .sql, not create constraint for them.
when execute "vacuum full", I got error "ERROR:  duplicate key value
violates unique constraint "pg_class_relname_nsp_index""
the details:
ERROR:  duplicate key value violates unique constraint
"pg_class_relname_nsp_index"
DETAIL:  Key (relname, relnamespace)=(pg_toast_17589_index, 99) already
exists.
STATEMENT:  vacuum full;

after "vacuum full" error,I got different info about pg_toast_17589_index
from pg_class:
ncc_1126=# select relname from pg_class where relname =
'pg_toast_17589_index';
       relname        
----------------------
 pg_toast_17589_index
 pg_toast_17589_index
(2 rows)

ncc_1126=# select * from pg_class where relname = 'pg_toast_17589_index';
       relname        | relnamespace | reltype | reloftype | relowner |
relam | relfilenode | reltablespace | relpages | reltuples | relallvisible |
reltoastrelid | relhasindex | relisshared | relpersistence | relkind |
relnatts | relchecks | relhasoids | relhasrules | relhastriggers |
relhassubclass | relrowsecurity | relforcerowsecurity | relispopulated |
relreplident | relispartition | relrewrite | relfrozenxid | relminmxid |
relacl | reloptions | relpartbound 

----------------------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+----------------+---------+----------+-----------+------------+-------------+----------------+----------------+----------------+---------------------+----------------+--------------+----------------+------------+--------------+------------+--------+------------+--------------
 pg_toast_17589_index |           99 |       0 |         0 |    16396 |
403 |     1739800 |             0 |        1 |         0 |             0 |
          0 | f           | f           | p              | i       |
2 |         0 | f          | f           | f              | f              |
f              | f                   | t              | n            | f
         |          0 |            0 |          0 |        |            | 
 pg_toast_178529      |           99 |  178537 |         0 |    16396 |
0 |     1807123 |             0 |        0 |         0 |             0 |
        0 | t           | f           | p              | t       |        3
|         0 | f          | f           | f              | f              | f
             | f                   | t              | n            | f
       |          0 |        96029 |          1 |        |            | 
(2 rows)

two querys got different content for column 'relname'. 

.sql's content as follow:
--
-- PostgreSQL database dump
--

-- Dumped from database version 9.6.8
-- Dumped by pg_dump version 9.6.8

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET client_min_messages = warning;
SET row_security = off;

SET default_tablespace = '';

SET default_with_oids = false;

--
-- Name: hrtm_myquestion; Type: TABLE; Schema: ncc_gh; Owner: ncc_gh
--

CREATE TABLE ncc_gh.hrtm_myquestion (
xxxx
);

ALTER TABLE ncc_gh.hrtm_myquestion OWNER TO ncc_gh;

--
-- Data for Name: hrtm_myquestion; Type: TABLE DATA; Schema: ncc_gh; Owner:
ncc_gh
--

COPY ncc_gh.hrtm_myquestion (......
\.
--
-- PostgreSQL database dump complete
--

what happened? is this a bug?
To get more details for this issue, how can I do it?


=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
> To get more details for this issue, how can I do it?

It sounds like you have a script for causing this problem,
would you share that?

            regards, tom lane


all script same as attached .sql file.these .sql are customer's data,I'm forbided to get out.

---Original---
From: "Tom Lane"<tgl@sss.pgh.pa.us>
Date: Fri, Nov 30, 2018 23:01 PM
To: "110876189"<110876189@qq.com>;
Cc: "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject: Re: BUG #15528: on v11.0 version still get error "ERROR: catalog is missing 1 attribute(s) for relid 6855092"

PG Bug reporting form <noreply@postgresql.org> writes:
> To get more details for this issue, how can I do it?

It sounds like you have a script for causing this problem,
would you share that?

regards, tom lane
On Fri, Nov 30, 2018 at 12:17 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
> I dump out 27969 tables from PG v9.6.8 one by one, use "pg_dump -t xxx -f
> xxx.sql". Then start 300 threads to parallel import these .sql  into
> v11.0.
> after data imported completely, parallel create constraint for every
> table.
> when all completely, execute "vacuum full",I got error "ERROR:  catalog is
> missing 1 attribute(s) for relid 6855092", which occured on 11beta2 solved
> by BUG #15309.

BUG #15309 could potentially have quite a few symptoms, including this one.

> Today, I want to make sure which step cause this issue, I only start 300
> threads to parallel import these .sql, not create constraint for them.

It seems like you're asking about how to reproduce corruption with BUG
#15309 on a pre-release version of Postgres 11 (a v11 without the fix
-- commit 9353d94a). I believe that you are not actually reporting a
new bug. Is this understanding correct? Is there a new bug?

Either way, you might find it interesting to see the result of this
query, which relies on the v11 amcheck extension (so "CREATE EXTENSION
amcheck" first):

SELECT bt_index_parent_check(index => c.oid, heapallindexed => true),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;

If that doesn't raise an error, you could try the same query, but
remove "AND n.nspname = 'pg_catalog'". That will take considerably
longer, but probably won't be intolerable.
-- 
Peter Geoghegan


On Sat, Dec 01, 2018 at 10:49:41AM +0800, 小威 wrote:
> All script same as attached .sql file.these .sql are customer's
> data,I'm forbided to get out.

A self-contained test case is most welcome in those cases, particularly
for such class of bugs.  It could be possible to filter out any
sensitive data with a similar schema, or fake data.  What you are
proposing upthread has very limited information, and the schema of the
relation(s) involved can become very important.
--
Michael

Attachment
the database no inclulde amcheck extension.so I don't care how to product BUG#15039 by amcheck.I only worry my issue has some relation with it.next week,I'll continue to work on my issue.do you have some suggest to help quickly demarcate it?I have no idea now,except goto do many test.

---Original---
From: "Peter Geoghegan"<pg@bowt.ie>
Date: Sat, Dec 1, 2018 11:25 AM
To: "PostgreSQL mailing lists"<pgsql-bugs@lists.postgresql.org>;"110876189"<110876189@qq.com>;
Subject: Re: BUG #15528: on v11.0 version still get error "ERROR: catalog is missing 1 attribute(s) for relid 6855092"

On Fri, Nov 30, 2018 at 12:17 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
> I dump out 27969 tables from PG v9.6.8 one by one, use "pg_dump -t xxx -f
> xxx.sql". Then start 300 threads to parallel import these .sql  into
> v11.0.
> after data imported completely, parallel create constraint for every
> table.
> when all completely, execute "vacuum full",I got error "ERROR:  catalog is
> missing 1 attribute(s) for relid 6855092", which occured on 11beta2 solved
> by BUG #15309.

BUG #15309 could potentially have quite a few symptoms, including this one.

> Today, I want to make sure which step cause this issue, I only start 300
> threads to parallel import these .sql, not create constraint for them.

It seems like you're asking about how to reproduce corruption with BUG
#15309 on a pre-release version of Postgres 11 (a v11 without the fix
-- commit 9353d94a). I believe that you are not actually reporting a
new bug. Is this understanding correct? Is there a new bug?

Either way, you might find it interesting to see the result of this
query, which relies on the v11 amcheck extension (so "CREATE EXTENSION
amcheck" first):

SELECT bt_index_parent_check(index => c.oid, heapallindexed => true),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;

If that doesn't raise an error, you could try the same query, but
remove "AND n.nspname = 'pg_catalog'". That will take considerably
longer, but probably won't be intolerable.
--
Peter Geoghegan
., s
On Fri, Nov 30, 2018 at 8:52 PM 小威 <zxw110876189@qq.com> wrote:
> the database no inclulde amcheck extension.so I don't care how to product BUG#15039 by amcheck.I only worry my issue
hassome relation with it.next week,I'll continue to work on my issue.do you have some suggest to help quickly demarcate
it?Ihave no idea now,except goto do many test. 

Sorry, I do not understand your questions.

You have not told us if you believe there is a new bug or not. Do you?

What you have shown at least looks very much like BUG#15039. It also
seems like your problem is probably down to BUG#15039 because you said
you were using a beta version of 11 at one point, and because your
test case involved a VACUUM FULL of a system catalog where parallel
CREATE INDEX is used (i.e. the specific factors required for BUG#15039
to cause corruption).

What is your goal?

--
Peter Geoghegan


I'm working on 11.0, no beta version.and  my db include BUG#15039 patch,but still occur this issue. actually,I don't known what happened.I only sure there still has a bug in my case.

---Original---
From: "Peter Geoghegan"<pg@bowt.ie>
Date: Sat, Dec 1, 2018 13:03 PM
To: "zxw110876189"<zxw110876189@qq.com>;
Cc: "PostgreSQL mailing lists"<pgsql-bugs@lists.postgresql.org>;
Subject: Re: BUG #15528: on v11.0 version still get error "ERROR: catalog is missing 1 attribute(s) for relid 6855092"

., s
On Fri, Nov 30, 2018 at 8:52 PM 小威 <zxw110876189@qq.com> wrote:
> the database no inclulde amcheck extension.so I don't care how to product BUG#15039 by amcheck.I only worry my issue has some relation with it.next week,I'll continue to work on my issue.do you have some suggest to help quickly demarcate it?I have no idea now,except goto do many test.

Sorry, I do not understand your questions.

You have not told us if you believe there is a new bug or not. Do you?

What you have shown at least looks very much like BUG#15039. It also
seems like your problem is probably down to BUG#15039 because you said
you were using a beta version of 11 at one point, and because your
test case involved a VACUUM FULL of a system catalog where parallel
CREATE INDEX is used (i.e. the specific factors required for BUG#15039
to cause corruption).

What is your goal?

--
Peter Geoghegan
On Fri, Nov 30, 2018 at 9:19 PM 小威 <zxw110876189@qq.com> wrote:
> I'm working on 11.0, no beta version.and  my db include BUG#15039 patch,but still occur this issue. actually,I don't
knownwhat happened.I only sure there still has a bug in my case. 

I think that you'll probably need to provide a script to reproduce the
issue. Can you make the data anonymous before sending? Alternatively,
you can send it off-list. I promise that I will treat the application
specific parts as confidential.

Using amcheck would let us know if the index is completely corrupt
(e.g. tuples are in the wrong order), or if the duplicate is garbage
that has somehow become visible again, a symptom of bugs which is also
possible. It's easy to use amcheck. Maybe there is another bug that is
similar to #15309, but we need to establish if the index is truly
corrupt, and what it looks like. It would also be good to use
contrib/pageinspect to look at any blocks that amcheck shows in an
error message.

--
Peter Geoghegan


On Fri, Nov 30, 2018 at 9:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Using amcheck would let us know if the index is completely corrupt
> (e.g. tuples are in the wrong order), or if the duplicate is garbage
> that has somehow become visible again, a symptom of bugs which is also
> possible. It's easy to use amcheck. Maybe there is another bug that is
> similar to #15309, but we need to establish if the index is truly
> corrupt, and what it looks like. It would also be good to use
> contrib/pageinspect to look at any blocks that amcheck shows in an
> error message.

Also, can you confirm that the problem goes away when
max_parallel_maintenance_workers is set to 0? That's what we saw with
bug #15309.

-- 
Peter Geoghegan


when set max_parallel_maintenance_workers = 0,no issue(vacuum full ok)now.before it my 2 times test were all failed.now I set it to 2 again and set log_min_message = debug5,log_min_error_statement=debug5,try to get more details.

---Original---
From: "Peter Geoghegan"<pg@bowt.ie>
Date: Sat, Dec 1, 2018 14:14 PM
To: "zxw110876189"<zxw110876189@qq.com>;
Cc: "PostgreSQL mailing lists"<pgsql-bugs@lists.postgresql.org>;
Subject: Re: BUG #15528: on v11.0 version still get error "ERROR: catalog is missing 1 attribute(s) for relid 6855092"

On Fri, Nov 30, 2018 at 9:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Using amcheck would let us know if the index is completely corrupt
> (e.g. tuples are in the wrong order), or if the duplicate is garbage
> that has somehow become visible again, a symptom of bugs which is also
> possible. It's easy to use amcheck. Maybe there is another bug that is
> similar to #15309, but we need to establish if the index is truly
> corrupt, and what it looks like. It would also be good to use
> contrib/pageinspect to look at any blocks that amcheck shows in an
> error message.

Also, can you confirm that the problem goes away when
max_parallel_maintenance_workers is set to 0? That's what we saw with
bug #15309.

--
Peter Geoghegan
On Mon, Dec 3, 2018 at 6:48 PM 小威 <zxw110876189@qq.com> wrote:
> when set max_parallel_maintenance_workers = 0,no issue(vacuum full ok)now.before it my 2 times test were all
failed.nowI set it to 2 again and set log_min_message = debug5,log_min_error_statement=debug5,try to get more details. 

That would be helpful. I also encourage you to install
contrib/amcheck, and perform the verification I asked about. amcheck
is not a third party extension -- it's maintained as part of the
PostgreSQL codebase. You probably just have to install a separate OS
package, and then you can "CREATE EXTENSION amcheck" from psql. The
package that you need is probably called something like
"postgresql-contrib-11", and will almost certainly be available from
the same package repository that the server (e.g. "postgresql-11")
came from. It should be easy for you.

Thank you
--
Peter Geoghegan