Thread: index corruption?

index corruption?

From
Laurette Cisneros
Date:
This is the second time I've seen this.

7.3.2

This particular table is empty.  I'm trying to read it in a perl script.
It doesn't duplicate regularly (I have a script that creates the database
by copying table data from another databases).

This is the error in the pgsql log:
2003-02-13 16:21:42 [8843]   ERROR:  Index external_signstops_pkey is not a
btree
2003-02-13 16:21:42 [8843]   ERROR:  current transaction is aborted,
queries ignored until end of transaction block

Any ideas?

-- 
Laurette Cisneros, L.D.
The Database Group
(510) 420-3137
NextBus Information Systems, Inc.
www.nextbus.com
----------------------------------
"No man is wise enough by himself"
-- Titus Maccius Plautus   (254 Bc - 184 BC), Miles Gloriosus



Re: index corruption?

From
Tom Lane
Date:
Laurette Cisneros <laurette@nextbus.com> writes:
> This is the error in the pgsql log:
> 2003-02-13 16:21:42 [8843]   ERROR:  Index external_signstops_pkey is not a
> btree

This says that one of two fields that should never change, in fixed
positions in the first block of a btree index, didn't have the right
values.  I am not aware of any PG bugs that could overwrite those
fields.  I think the most likely bet is that you've got hardware
issues ... have you run memory and disk diagnostics lately?
        regards, tom lane


Re: index corruption?

From
"Ed L."
Date:
On Feb 13, 2003, Tom Lane wrote:
>
> Laurette Cisneros <laurette@nextbus.com> writes:
> > This is the error in the pgsql log:
> > 2003-02-13 16:21:42 [8843]   ERROR:  Index external_signstops_pkey is
> > not a btree
>
> This says that one of two fields that should never change, in fixed
> positions in the first block of a btree index, didn't have the right
> values.  I am not aware of any PG bugs that could overwrite those
> fields.  I think the most likely bet is that you've got hardware
> issues ... have you run memory and disk diagnostics lately?

I am seeing this same problem on two separate machines, one brand new, one
older.  Not sure yet what is causing it, but seems pretty unlikely that it
is hardware-related.

Ed



Re: index corruption?

From
"Ed L."
Date:
On Monday March 31 2003 3:38, Ed L. wrote:
> On Feb 13, 2003, Tom Lane wrote:
> > Laurette Cisneros <laurette@nextbus.com> writes:
> > > This is the error in the pgsql log:
> > > 2003-02-13 16:21:42 [8843]   ERROR:  Index external_signstops_pkey is
> > > not a btree
> >
> > This says that one of two fields that should never change, in fixed
> > positions in the first block of a btree index, didn't have the right
> > values.  I am not aware of any PG bugs that could overwrite those
> > fields.  I think the most likely bet is that you've got hardware
> > issues ... have you run memory and disk diagnostics lately?
>
> I am seeing this same problem on two separate machines, one brand new,
> one older.  Not sure yet what is causing it, but seems pretty unlikely
> that it is hardware-related.

I am dabbling for the first time with a (crashing) C trigger, so that may be
the culprit here.

Ed



Re: index corruption?

From
Tom Lane
Date:
"Ed L." <pgsql@bluepolka.net> writes:
>> I am seeing this same problem on two separate machines, one brand new,
>> one older.  Not sure yet what is causing it, but seems pretty unlikely
>> that it is hardware-related.

> I am dabbling for the first time with a (crashing) C trigger, so that may be 
> the culprit here.

Could well be, although past experience has been that crashes in C code
seldom lead directly to disk corruption.  (First, the bogus code has to
overwrite a shared disk buffer.  If you follow what I consider the
better path of not making your shared buffers a large fraction of the
address space, the odds of a wild store happening to hit a disk buffer
aren't high.  Second, once it's corrupted a shared buffer, it has to
contrive to cause that buffer to get written out before the core dump
occurs --- in most cases, the fact that the postmaster abandons the
contents of shared memory after a backend crash protects us from this
kind of failure.)

When you find the problem, please take note of whether there's something
involved that increases the chances of corruption getting to disk.  We
might want to try to do something about it ...
        regards, tom lane



Re: index corruption?

From
"Ed L."
Date:
On Monday March 31 2003 3:54, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> >> I am seeing this same problem on two separate machines, one brand new,
> >> one older.  Not sure yet what is causing it, but seems pretty unlikely
> >> that it is hardware-related.
> >
> > I am dabbling for the first time with a (crashing) C trigger, so that
> > may be the culprit here.
>
> Could well be, although past experience has been that crashes in C code
> seldom lead directly to disk corruption.  (First, the bogus code has to
> overwrite a shared disk buffer.  If you follow what I consider the
> better path of not making your shared buffers a large fraction of the
> address space, the odds of a wild store happening to hit a disk buffer
> aren't high.  Second, once it's corrupted a shared buffer, it has to
> contrive to cause that buffer to get written out before the core dump
> occurs --- in most cases, the fact that the postmaster abandons the
> contents of shared memory after a backend crash protects us from this
> kind of failure.)
>
> When you find the problem, please take note of whether there's something
> involved that increases the chances of corruption getting to disk.  We
> might want to try to do something about it ...

It is definitely due to some rogue trigger code.  Not sure what exactly, but
if I remove a certain code segment the problem disappears.

Ed



Re: index corruption?

From
"scott.marlowe"
Date:
On Mon, 31 Mar 2003, Ed L. wrote:

> On Feb 13, 2003, Tom Lane wrote:
> >
> > Laurette Cisneros <laurette@nextbus.com> writes:
> > > This is the error in the pgsql log:
> > > 2003-02-13 16:21:42 [8843]   ERROR:  Index external_signstops_pkey is
> > > not a btree
> >
> > This says that one of two fields that should never change, in fixed
> > positions in the first block of a btree index, didn't have the right
> > values.  I am not aware of any PG bugs that could overwrite those
> > fields.  I think the most likely bet is that you've got hardware
> > issues ... have you run memory and disk diagnostics lately?
> 
> I am seeing this same problem on two separate machines, one brand new, one 
> older.  Not sure yet what is causing it, but seems pretty unlikely that it 
> is hardware-related.

Until you've tested them, the likelyhood is unimportant.  If you've tested 
the boxes, and the memory tests good and the hard drives test good, then 
there is still likely to be another explanation, like a runaway kernel bug 
is writing somewhere it should every fifth eon or two.

If you haven't tested the boxes, they're reliability is part of the NULL 
set. :-)



Re: index corruption?

From
"Ed L."
Date:
On Monday March 31 2003 4:15, Ed L. wrote:
> On Monday March 31 2003 3:54, Tom Lane wrote:
> > "Ed L." <pgsql@bluepolka.net> writes:
> > >> I am seeing this same problem on two separate machines, one brand
> > >> new, one older.  Not sure yet what is causing it, but seems pretty
> > >> unlikely that it is hardware-related.
> > >
> > > I am dabbling for the first time with a (crashing) C trigger, so that
> > > may be the culprit here.
> >
> > Could well be, although past experience has been that crashes in C code
> > seldom lead directly to disk corruption.  (First, the bogus code has to
> > overwrite a shared disk buffer.  If you follow what I consider the
> > better path of not making your shared buffers a large fraction of the
> > address space, the odds of a wild store happening to hit a disk buffer
> > aren't high.  Second, once it's corrupted a shared buffer, it has to
> > contrive to cause that buffer to get written out before the core dump
> > occurs --- in most cases, the fact that the postmaster abandons the
> > contents of shared memory after a backend crash protects us from this
> > kind of failure.)
> >
> > When you find the problem, please take note of whether there's
> > something involved that increases the chances of corruption getting to
> > disk.  We might want to try to do something about it ...

Well, I fixed it but cannot now remember exactly what change did it amidst a
bunch of rewrites of some existing stuff, and I cannot get back to that
state from here.  :(  It was definitely arising from some funky C trigger
code of my own making.

Ed