Thread: page corruption bug

page corruption bug

From
"A Palmblad"
Date:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
                        POSTGRESQL BUG REPORT TEMPLATE
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D


Your name  : Adam Palmblad
Your email address : adampalmblad@yahoo.ca


System Configuration
---------------------
  Architecture (example: Intel Pentium)   : dual AMD 64s (242),=20

  Operating System (example: Linux 2.4.18)  : Gentoo Linux, kernel 2.6.3-Ge=
ntoo-r2, XFS file system

  PostgreSQL version (example: PostgreSQL-7.4.2):   PostgreSQL-7.4.2 (64-bi=
t compile)

  Compiler used (example:  gcc 2.95.2)  : 3.3.3


Please enter a FULL description of your problem:
------------------------------------------------
We are having a recurring problem with page corruption in our database.  We=
 need to add over 3 million records a day to our database, and
we have been finding that we will start getting page header corruption erro=
rs after around 12 - 15 million records.  These errors
show up both in tables and in indexes.  Generally they only occur in our la=
rgest tables.  This is a new server, when it was set up some
basic hardware tests were done, and they checked out okay.  The data in the=
 databases is critical to our business; having to rebuild a table
and reinsert data every few days is not really an acceptable solution.

Another error was just noted, reading as follows: ERROR: Couldn't open segm=
ent 1 of relation: XXXX (target block 746874992): No such file or directory.

Please describe a way to repeat the problem.   Please try to provide a
concise reproducible example, if at all possible:=20
----------------------------------------------------------------------
Insert 15 million records to a table.  Use the copy command.  We are runnin=
g copy with files of 60 000 lines to insert the data.
Do a vacuum or similar operation that would visit every page of the table.
An invalid page header error may occur.


If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------
Has anyone else had this problem?  Would it be better for us to try a diffe=
rent
file system or kernel?  Should postgres be recompiled in 32-bit mode?

Re: page corruption bug

From
Tom Lane
Date:
"A Palmblad" <adampalmblad@yahoo.ca> writes:
> We are having a recurring problem with page corruption in our
> database.

The symptoms you describe are indistinguishable from those seen with
flaky hardware.  I'd strongly suggest doing more extensive testing of
both RAM and disks.  memtest86 and badblocks are the least common
denominator for test programs, though I think you can get better ones
if you're willing to pay.  (In particular, I do not know if memtest86
can reach all of RAM in a 64-bit machine; it may be 32-bit-only...)

The software setup (dual AMD's and a 64-bit compile) is a bit off the
beaten track, but if you did have a porting problem these are not the
sort of symptoms I'd expect.  My money is on a hardware fault.

I'll even go out on a limb and suggest that it's probably bad RAM rather
than drives; the behavior seems consistent with flaky RAM in an address
range that doesn't get used until the kernel has managed to fill up most
of memory.

> Another error was just noted, reading as follows: ERROR: Couldn't open segm=
> ent 1 of relation: XXXX (target block 746874992): No such file or directory.

Likely explanation is a trashed block pointer in an index entry.  Again,
not too surprising if hardware is flaky.

            regards, tom lane