Thread: invalid non-zero objectSubId for object class

invalid non-zero objectSubId for object class

From
Michel Pelletier
Date:
On a 12.3 AWS RDS instance, I get the following error when trying to drop either of two tables:

dev=> drop table current_flight;
ERROR:  invalid non-zero objectSubId for object class 297108
dev=> drop table flight;
ERROR:  invalid non-zero objectSubId for object class 297108

I can create and drop new tables, but these two I cannot drop.

Both tables were at one point replicated using pglogical, but I've removed them from their replication sets and nothing depends on them that I can see.  The replication of other tables is still proceeding normally.

Anyone have any thoughts on this?  The db was also upgraded seemingly successfully from 12.2 a couple of days ago and otherwise seems to be working fine.

Thanks for any pointers!

-Michel


Re: invalid non-zero objectSubId for object class

From
Tom Lane
Date:
Michel Pelletier <pelletier.michel@gmail.com> writes:
> On a 12.3 AWS RDS instance, I get the following error when trying to drop
> either of two tables:

> dev=> drop table current_flight;
> ERROR:  invalid non-zero objectSubId for object class 297108
> dev=> drop table flight;
> ERROR:  invalid non-zero objectSubId for object class 297108

This looks like corrupt data in pg_depend, specifically an entry or
entries with classid or refclassid = 297108, which should not happen
(the classid should always be the OID of one of a short list of system
catalogs).  You could try poking around in pg_depend to see if you
can identify any obviously-bogus rows.

No idea how it got that way.  Have you had any database crashes or the
like?

            regards, tom lane



Re: invalid non-zero objectSubId for object class

From
Michel Pelletier
Date:


On Thu, Jul 9, 2020 at 4:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michel Pelletier <pelletier.michel@gmail.com> writes:
> On a 12.3 AWS RDS instance, I get the following error when trying to drop
> either of two tables:

> dev=> drop table current_flight;
> ERROR:  invalid non-zero objectSubId for object class 297108
> dev=> drop table flight;
> ERROR:  invalid non-zero objectSubId for object class 297108

This looks like corrupt data in pg_depend, specifically an entry or
entries with classid or refclassid = 297108, which should not happen
(the classid should always be the OID of one of a short list of system
catalogs).  You could try poking around in pg_depend to see if you
can identify any obviously-bogus rows.

Hi Tom, thanks for getting back so quick:

I don't seem to have either:

dev=> select * from pg_depend where classid = 297108 or refclassid = 297108;
 classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype
---------+-------+----------+------------+----------+-------------+---------
(0 rows)

I'm not sure what a bogus row would look like.
 
No idea how it got that way.  Have you had any database crashes or the
like?

No crashes, but a restart and one upgrade.  On Sunday and Monday, at exactly UTC midnight  we run a cron job to create a new partition for an unrelated table and attach it to a pglogical replication set.   I updated the procedure on saturday to create two new partitions for two unrelated tables, and that somehow caused an issue on 12.2 / pglogical 2.3.0 that caused an error, but not a crash.  What's puzzling is that the two partition creation still worked, and replicated to all downstream consumers, but from that point on replication ceased and consumers logged the error in the link below:


This spooled up changes on the RDS primary until it filled up the storage. On sunday we resized the instance and restarted, and reinitialized the pglogical setup which restarted replication.  On monday the error happened again at midnight, and we restarted replication and upgraded to 12.3/2.3.1 on tuesday as recommended in the issue.  It has thus run till now without error and has been replicating nicely so have assumed that issue is fixed.

Neither of these two tables are involved in the midnight job, they're no longer used and I was hoping to clean them up.  I guess my concern should be, is there additional possible corruption I can check for?  And if that's ok is there some manual intervention I can do to drop the tables?

Thanks,

-Michel
 

                        regards, tom lane

Re: invalid non-zero objectSubId for object class

From
Alvaro Herrera
Date:
On 2020-Jul-09, Michel Pelletier wrote:

> Hi Tom, thanks for getting back so quick:
> 
> I don't seem to have either:
> 
> dev=> select * from pg_depend where classid = 297108 or refclassid = 297108;
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype
> ---------+-------+----------+------------+----------+-------------+---------
> (0 rows)

Maybe try looking in pg_shdepend?  Or pg_description?


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: invalid non-zero objectSubId for object class

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2020-Jul-09, Michel Pelletier wrote:
>> I don't seem to have either:
>>
>> dev=> select * from pg_depend where classid = 297108 or refclassid = 297108;
>> classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype
>> ---------+-------+----------+------------+----------+-------------+---------
>> (0 rows)

> Maybe try looking in pg_shdepend?  Or pg_description?

I think it has to be pg_depend, because there'd be no reason to
be calling getObjectClass() on a entry from one of those other
catalogs during a DROP TABLE.

So my thoughts are now turning towards there being corruption
in pg_depend's indexes as well as the heap.  Try doing the above
query after turning enable_indexscan and enable_bitmapscan off.

            regards, tom lane



Re: invalid non-zero objectSubId for object class

From
Michel Pelletier
Date:


On Thu, Jul 9, 2020 at 5:26 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2020-Jul-09, Michel Pelletier wrote:

> Hi Tom, thanks for getting back so quick:
>
> I don't seem to have either:
>
> dev=> select * from pg_depend where classid = 297108 or refclassid = 297108;
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype
> ---------+-------+----------+------------+----------+-------------+---------
> (0 rows)

Maybe try looking in pg_shdepend?  Or pg_description?

Hi Álvaro,

It doesn't occur in either of those either
dev=> select * from pg_description \g | grep 297108
dev=> select * from pg_shdepend \g | grep 297108

-Michel
 


--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: invalid non-zero objectSubId for object class

From
Michel Pelletier
Date:


On Thu, Jul 9, 2020 at 5:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2020-Jul-09, Michel Pelletier wrote:
>> I don't seem to have either:
>>
>> dev=> select * from pg_depend where classid = 297108 or refclassid = 297108;
>> classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype
>> ---------+-------+----------+------------+----------+-------------+---------
>> (0 rows)

> Maybe try looking in pg_shdepend?  Or pg_description?

I think it has to be pg_depend, because there'd be no reason to
be calling getObjectClass() on a entry from one of those other
catalogs during a DROP TABLE.

So my thoughts are now turning towards there being corruption
in pg_depend's indexes as well as the heap.  Try doing the above
query after turning enable_indexscan and enable_bitmapscan off.

Unfortunately still no refs there:

dev=> set enable_indexscan = off;
SET
dev=> set enable_bitmapscan = off;
SET
dev=> select * from pg_depend \g | grep 297108
dev=> select * from pg_depend where classid = 297108::oid or refclassid = 297108::oid;
 classid | objid | objsubid | refclassid | refobjid | refobjsubid | deptype
---------+-------+----------+------------+----------+-------------+---------
(0 rows)

dev=> explain select * from pg_depend where classid = 297108::oid or refclassid = 297108::oid;
                              QUERY PLAN                              
-----------------------------------------------------------------------
 Seq Scan on pg_depend  (cost=0.00..336.87 rows=2 width=25)
   Filter: ((classid = '297108'::oid) OR (refclassid = '297108'::oid))
(2 rows)

-Michel
 

                        regards, tom lane

Re: invalid non-zero objectSubId for object class

From
Michel Pelletier
Date:


I restored a snapshot and I can drop the tables there, so we'll likely proceed  to swap the replicas over tomorrow.  I have this corrupted instance i can continue to debug on if necessary.  There seem to be some other issues now that we're investigating, like a max(timestamp) query on the old instance is doing a seqscan on all the partitions instead of using an index that's there, the new instance from the snapshot correctly uses the index, so I suspect further corruption.

Thanks for your help!

-Michel

Re: invalid non-zero objectSubId for object class

From
Alvaro Herrera
Date:
On 2020-Jul-09, Michel Pelletier wrote:

> I restored a snapshot and I can drop the tables there, so we'll likely
> proceed  to swap the replicas over tomorrow.  I have this corrupted
> instance i can continue to debug on if necessary.  There seem to be some
> other issues now that we're investigating, like a max(timestamp) query on
> the old instance is doing a seqscan on all the partitions instead of using
> an index that's there, the new instance from the snapshot correctly uses
> the index, so I suspect further corruption.

This being RDS makes it very difficult to debug.  For example I would
suggest a debugger to find the problem, but you don't have access to
that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: invalid non-zero objectSubId for object class

From
Michel Pelletier
Date:
On Fri, Jul 10, 2020 at 5:37 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2020-Jul-09, Michel Pelletier wrote:

> I restored a snapshot and I can drop the tables there, so we'll likely
> proceed  to swap the replicas over tomorrow.  I have this corrupted
> instance i can continue to debug on if necessary.  There seem to be some
> other issues now that we're investigating, like a max(timestamp) query on
> the old instance is doing a seqscan on all the partitions instead of using
> an index that's there, the new instance from the snapshot correctly uses
> the index, so I suspect further corruption.

This being RDS makes it very difficult to debug.  For example I would
suggest a debugger to find the problem, but you don't have access to
that.

Yes agreed, we'll probably terminate the instance today if there's no further lines of investigation.  I know this isn't their forum, but I don't get why AWS pins all their extension versions to the postgres version, If you use 12.2 on rds, you get pglogical 2.3.0.  If you want 2.3.1, you have to upgrade to pg 12.3.  I don't get that logic, shouldn't 2.3.1 be the current version on both?  What's the point of them supporting 12.2 if they ship an extension version that 2ndQ says you shouldn't use?

Anyway, just musing out loud, don't expect any kind of answer to that here. :)

Thanks again everyone,

-Michel



--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: invalid non-zero objectSubId for object class

From
Adrian Klaver
Date:
On 7/10/20 8:06 AM, Michel Pelletier wrote:
> On Fri, Jul 10, 2020 at 5:37 AM Alvaro Herrera <alvherre@2ndquadrant.com 
> <mailto:alvherre@2ndquadrant.com>> wrote:
> 
>     On 2020-Jul-09, Michel Pelletier wrote:
> 
>      > I restored a snapshot and I can drop the tables there, so we'll
>     likely
>      > proceed  to swap the replicas over tomorrow.  I have this corrupted
>      > instance i can continue to debug on if necessary.  There seem to
>     be some
>      > other issues now that we're investigating, like a max(timestamp)
>     query on
>      > the old instance is doing a seqscan on all the partitions instead
>     of using
>      > an index that's there, the new instance from the snapshot
>     correctly uses
>      > the index, so I suspect further corruption.
> 
>     This being RDS makes it very difficult to debug.  For example I would
>     suggest a debugger to find the problem, but you don't have access to
>     that.
> 
> 
> Yes agreed, we'll probably terminate the instance today if there's no 
> further lines of investigation.  I know this isn't their forum, but I 
> don't get why AWS pins all their extension versions to the postgres 
> version, If you use 12.2 on rds, you get pglogical 2.3.0.  If you want 
> 2.3.1, you have to upgrade to pg 12.3.  I don't get that logic, 
> shouldn't 2.3.1 be the current version on both?  What's the point of 
> them supporting 12.2 if they ship an extension version that 2ndQ says 
> you shouldn't use?
> 
> Anyway, just musing out loud, don't expect any kind of answer to that 
> here. :)

Already asked and not really answered:

https://www.postgresql.org/message-id/A51655C0-BDD0-4F52-8DC5-2672AADA32D0%40gmail.com

> 
> Thanks again everyone,
> 
> -Michel
> 
> 
> 
>     -- 
>     Álvaro Herrera https://www.2ndQuadrant.com/
>     PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com