Thread: Btree or not btree? That is the question

Btree or not btree? That is the question

From

Greg Sabino Mullane

Date:

04 June 2012, 13:30:03

We have a 8.3.18 system (yes, the same one from the previous
thread, finally upgraded!) that gave us this error yesterday:

ERROR:  index "pg_class_oid_index" is not a btree

It seems like this happened from just a normal SELECT query, and did
not occur again (and this is a very busy system) The pg_class table
does gets vacuumed and reindexed often. All appeared fine when I
did some manual investigation.

Since this only happened this one time, can I chalk it up to
some random rare race condition having to do with a reindex?
Or is there anything else I should be looking for or can
proactively do? No other system catalog problems have been seen
before it or since.

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Btree or not btree? That is the question

From

Tom Lane

Date:

04 June 2012, 15:10:04

Greg Sabino Mullane <greg@endpoint.com> writes:
> We have a 8.3.18 system (yes, the same one from the previous 
> thread, finally upgraded!) that gave us this error yesterday:

> ERROR:  index "pg_class_oid_index" is not a btree

That means you got bogus data while reading the metapage.
I'm beginning to wonder about the hardware on this server ...
        regards, tom lane

Re: Btree or not btree? That is the question

From

Greg Sabino Mullane

Date:

06 June 2012, 17:10:13

On Mon, Jun 04, 2012 at 02:09:44PM -0400, Tom Lane wrote:
> Greg Sabino Mullane <greg@endpoint.com> writes:
> > We have a 8.3.18 system (yes, the same one from the previous
> > thread, finally upgraded!) that gave us this error yesterday:
>
> > ERROR:  index "pg_class_oid_index" is not a btree
>
> That means you got bogus data while reading the metapage.
> I'm beginning to wonder about the hardware on this server ...

Thanks for the reply. Me too. This is on a cluster[1], so we
do have the luxury of testing the hardware on each box and
shuffling things around. I'm also thinking we need to get away
from the near-constant updating of pg_class, just as a good
measure.

[1] Bruce, yet another usage! :)

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Btree or not btree? That is the question

From

Greg Sabino Mullane

Date:

21 June 2012, 14:46:09

> > ERROR:  index "pg_class_oid_index" is not a btree
>
> That means you got bogus data while reading the metapage.
> I'm beginning to wonder about the hardware on this server ...

This happened again, and this time I went back through
the logs and found that it is always the exact same query causing
the issue. I also found it occuring on different servers,
which rules out RAM anyway (still shared disk, so those are suspect).
This query also sometimes gives errors like this:

ERROR:  could not read block 3 of relation 1663/1554846571/3925298284:  read only 0 of 8192 bytes

However, the final number changes: these are invariably temporary relations.
The query itself is a GROUP BY over a large view and the explain plan is
107 rows, with nothing esoteric about it. Most of the tables used are
fairly common ones. I'm trying to duplicate on a non-production box, without
success so far, and I'm loath to run it on production as it sometimes
causes multiple backends to freeze up and requires a forceful restart.

Any ideas on how to carefully debug this? There are a couple of quicksorts
when I explain analyze on a non-prod system, which I am guessing where
the temp tables come from (work_mem is 24MB). I'm not sure I understand
what could be causing both the 'read 0' and btree errors for the
same query - bad blocks on disk for one of the underlying tables?
I'll work next on checking each of the tables the view is using.

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Btree or not btree? That is the question

From

Tom Lane

Date:

21 June 2012, 15:37:35

Greg Sabino Mullane <greg@endpoint.com> writes:
>>> ERROR:  index "pg_class_oid_index" is not a btree
>> That means you got bogus data while reading the metapage.
>> I'm beginning to wonder about the hardware on this server ...

> This happened again, and this time I went back through 
> the logs and found that it is always the exact same query causing 
> the issue. I also found it occuring on different servers, 
> which rules out RAM anyway (still shared disk, so those are suspect).
> This query also sometimes gives errors like this:
> ERROR:  could not read block 3 of relation 1663/1554846571/3925298284: 
>   read only 0 of 8192 bytes
> However, the final number changes: these are invariably temporary relations. 

Oh really ... okay, maybe it is a software problem then.

> Any ideas on how to carefully debug this? There are a couple of quicksorts 
> when I explain analyze on a non-prod system, which I am guessing where 
> the temp tables come from (work_mem is 24MB).

No, that error message is complaining about an attempt to read an
actual, named, relation (could be temp, no way to be sure from this
info).  A sort might generate temp files but those don't have that
kind of name.

> I'm not sure I understand 
> what could be causing both the 'read 0' and btree errors for the 
> same query - bad blocks on disk for one of the underlying tables?

The "could not read" thing looks more like an attempt to fetch an
invalid TID; you could get such an error for instance if you had an
index that referenced a tuple in block 3, but the table on disk isn't
that long.  So one possible theory for what's happening here is that
once in a while we get confused about which shared buffer holds which
disk block, and either find the wrong block entirely when looking for
pg_class_oid_index (the first case) or grab a page of the wrong index
in the second case.  This theory would be more plausible if you're
wrong about the second-case tables being temp, though, because if they
are temp then their indexes would be kept in local buffers not shared
buffers, making it much harder to believe in a single bug causing both
symptoms.

One possible mechanism for confusion of that sort would be if the
spinlock code wasn't quite right, or the compiler was incorrectly
moving loads/stores into or out of locked sections.  So it might be
time to ask exactly what kind of hardware this is, which compiler
PG was built with, etc.

On the other hand, if the issue were of that sort then it ought to
affect all buffers more or less at random; so if you're consistently
seeing exactly these symptoms (in particular, if it's always
pg_class_oid_index that's complained of), then I'm not sure I believe
this theory either.

Which PG version again?  Are you in the habit of doing VACUUM FULLs
on system catalogs, and if so do these glitches correlate at all
with such activities?
        regards, tom lane

Re: Btree or not btree? That is the question

From

Greg Sabino Mullane

Date:

21 June 2012, 17:49:30

I dug through the logs and found some other occurances of
the "could not read block" errors. Some on dirt simple
SELECT queries. Nothing else has generated the btree
error yet. About 35 found in the last month.

> This theory would be more plausible if you're wrong about the second-case
> tables being temp, though, because if they are temp then their indexes
> would be kept in local buffers not shared buffers, making it much harder
> to believe in a single bug causing both symptoms.

I grepped the last month of logs and found about 20 instances of
that error: none of the relfilenodes given shows up in pg_class,
even for that dirt simple SELECT.

> One possible mechanism for confusion of that sort would be if the
> spinlock code wasn't quite right, or the compiler was incorrectly
> moving loads/stores into or out of locked sections.  So it might be
> time to ask exactly what kind of hardware this is, which compiler
> PG was built with, etc.

Quad core AMD Opteron. RHEL. Compiled with gcc with all the options
(basically the standard compilation e.g. --build=x86_64-redhat-linux-gnu)
I can give you more details offlist if it will help.

> On the other hand, if the issue were of that sort then it ought to
> affect all buffers more or less at random; so if you're consistently
> seeing exactly these symptoms (in particular, if it's always
> pg_class_oid_index that's complained of), then I'm not sure I believe
> this theory either.

I've never seen any other index for the btree error, but it has only
happened a grand total of 3 times ever. The other error appears to
be fairly random, except that the one particular query that gives
the btree error always seems to give one version or the other.

> Which PG version again?  Are you in the habit of doing VACUUM FULLs
> on system catalogs, and if so do these glitches correlate at all
> with such activities?

Heh. 8.3.18. Yes, very heavy vac fulls (and reindexes) of the system
catalogs. Cron-driven, and depends on the time of day and if any
DDL is running (if so, it does not run), but probably on average
pg_class is vacfulled and reindexed twice an hour during the times
this happens (which is, during normal business hours). There is a lot
in churn in pg_class, pg_attribute, and pg_depend in particular from
all the temp stuff being created and torn down all day, as well as
some Bucardo pg_class updating.

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Btree or not btree? That is the question

From

Greg Sabino Mullane

Date:

09 July 2012, 16:32:01

I've expanded my searching a bit, to see if I can find any other
correlations. One thing that seems to happen about 10 times a day
is an error of this sort:

ERROR:  could not open relation with OID 1554847326

In this case, the OID in question always exists, and corresponds to
one of a handful of particularly busy tables. Sometimes the query
does not even touch the OID mentioned directly: in the above example,
the SQL was an update to table A that had a FK to table B, and the
OID above is for table B. The queries themselves vary: I've not found any
common factor yet.

These errors have been happening a long time, and obviously don't cause the
same database-hosed-must-restart issue the btree does, but it is still
a little disconcerting. Although 10 times out of > 20 million transactions
per day is at least an extremely rare event :) It is definitely NOT correlated to
system table reindexing, but does seem to be roughly correlated to how busy
things are in general. We've not been able to duplicate on a non-prod test
system yet either, which points to either hardware or (more likely) a failure
to completely simulate the high activity level of prod.

No idea if this related to the relatively recent btree errors, but figured
I would get it out there. There is also an even rarer sprinkling of:

ERROR:  relation with OID 3924107573 does not exist

but I figured that was probably a variant of the first error.

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Btree or not btree? That is the question

From

Tom Lane

Date:

09 July 2012, 17:02:37

Greg Sabino Mullane <greg@endpoint.com> writes:
> I've expanded my searching a bit, to see if I can find any other 
> correlations. One thing that seems to happen about 10 times a day 
> is an error of this sort:

> ERROR:  could not open relation with OID 1554847326

Is that the *entire* message?  No details?  Could you crank up the
log verbosity so we can get file and line number, at least?
        regards, tom lane

Re: Btree or not btree? That is the question

From

Greg Sabino Mullane

Date:

09 July 2012, 17:21:39

> > ERROR:  could not open relation with OID 1554847326
>
> Is that the *entire* message?  No details?  Could you crank up the
> log verbosity so we can get file and line number, at least?

Yes, that's the entire thing, other than the statement line after it.
In this particular case:

STATEMENT: UPDATE mytable SET foobar = 'T' WHERE id IN ($1,$2,$3,$4)

I'll start the wheels rolling for upping the verbosity.

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Btree or not btree? That is the question

From

Greg Sabino Mullane

Date:

09 July 2012, 17:37:03

> Could you crank up the log verbosity so we can get
> file and line number, at least?

First hit since the change:

ERROR:  XX000: could not open relation with OID 1554847444
LOCATION:  relation_open, heapam.c:879
STATEMENT:  SELECT
...

Will leave the verbosity up and see if it occurs in the same place.


--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8

Re: Btree or not btree? That is the question

From

Greg Sabino Mullane

Date:

10 July 2012, 14:35:51

On Mon, Jul 09, 2012 at 04:02:13PM -0400, Tom Lane wrote:
> ... Could you crank up the log verbosity so we can get
> file and line number, at least?

Here is what the increased verbosity reveals in aggregate. This
is about an 18-hour span, covering 12.5M transactions, on
version 8.3.18:

(13 times) Various OIDs that do exist cannot be found:
ERROR:  XX000: could not open relation with OID 1554847444
LOCATION:  relation_open, heapam.c:879

(21 times) Relations that do exist cannot be found:
ERROR:  42P01: relation "foobar" does not exist
LOCATION:  RangeVarGetRelid, namespace.c:273

(1 time) Qualified relation that exists cannot be found:
ERROR:  42P01: relation "public.foobar" does not exist
LOCATION:  RangeVarGetRelid, namespace.c:268

(5 times) Failure to read a block:
XX001: could not read block 3 of relation 1663/1554846571/4184054438: read only 0 of 8192 bytes
LOCATION:  mdread, md.c:631

(5 times) Cache lookup failure:
XX000: cache lookup failed for relation 1554847255
LOCATION:  has_subclass, plancat.c:921

--
Greg Sabino Mullane greg@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8