Thread: possible database corruption

possible database corruption

From

Chris Anderson

Date:

12 January 2001, 21:13:21

============================================================================
                        POSTGRESQL BUG REPORT TEMPLATE
============================================================================


Your name        :    Chris Anderson
Your email address    :    chris@journyx.com


System Configuration
---------------------
  Architecture (example: Intel Pentium)      : Intel Pentium 3 (x2)

  Operating System (example: Linux 2.0.26 ELF)     : Linux 2.2.14 (SMP)

  PostgreSQL version (example: PostgreSQL-7.0):   PostgreSQL-7.0.3

  Compiler used (example:  gcc 2.8.0)        : egcs-2.91.66


Please enter a FULL description of your problem:
------------------------------------------------

We are using postgresql as the backend for an online service where we host
a web based application for customers. Each customer has their own copy of
the application server (written in python) which maintains three
persistant connections to postgres.

We presently have a single postgres instance on a dedicated machine which
maintains 94 databases and around 280 connections. These connections are
are initiated from four additional servers which provide the application
to the customer. These machines are all running the same version of linux
as the database server, however their pgres clients are only at version
6.5.

This solution has worked very well for us in the past, but now we are
experiencing very strange behavior which seems to be the result of
periodic corruption in the database files.

Sometimes immediately after we create a new database, it will somehow
become corrupted and trying to access it will cause postmaster to crash,
thereby killing everyone else's connections. Note, that not all types of
accesses will cause it to crash, however a vacuum will almost always do
the trick. Actually, selects and inserts usually work just fine. However,
it does tend to lead toward a general instability in the server, and we
see postgres crashes quite regularly after it happens.

We cannot predict when this will happen, though we've been seeing it
almost weekly now, but once it does happen any new databases created will
exhibit the exact same behavior every time.

Once this happens, the only way I've been able to recover from the problem
seems to be to wipe the data directory and restore from a pg_dump.
Deleting the offending database and recreating it will not do the trick.

The server itself has never locked up, there are no known filesystem
errors, and I have been very careful to cleanup any lingering shm stuff
before reinvoking postmaster.

When postmaster dies, it does dump a core, which I can provide. The stack
trace looks like this:

-- begin gdb output --

GNU gdb 19991004
Copyright 1998 Free Software Foundation, Inc.
This GDB was configured as "i386-redhat-linux"...
Core was generated by `/usr/local/pgres/bin/postgres localhost postgres d'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libcrypt.so.1...done.
Reading symbols from /lib/libnsl.so.1...done.
Reading symbols from /lib/libdl.so.2...done.
Reading symbols from /lib/libm.so.6...done.
Reading symbols from /lib/libutil.so.1...done.
Reading symbols from /usr/lib/libreadline.so.3...done.
Reading symbols from /lib/libtermcap.so.2...done.
Reading symbols from /usr/lib/libncurses.so.4...done.
Reading symbols from /lib/libc.so.6...done.
Reading symbols from /lib/ld-linux.so.2...done.
Reading symbols from /lib/libnss_files.so.2...done.
#0  0x81253ea in GetRawDatabaseInfo ()
(gdb) where
#0  0x81253ea in GetRawDatabaseInfo ()
#1  0x8125016 in InitPostgres ()
#2  0x80ebed5 in PostgresMain ()
#3  0x80d6652 in DoBackend ()
#4  0x80d6231 in BackendStartup ()
#5  0x80d55ea in ServerLoop ()
#6  0x80d5074 in PostmasterMain ()
#7  0x80ab866 in main ()
#8  0x401049cb in __libc_start_main (main=0x80ab800 <main>, argc=6,
argv=0xbffffb64, init=0x8064084 <_init>,
    fini=0x812a0cc <_fini>, rtld_fini=0x4000ae60 <_dl_fini>,
stack_end=0xbffffb5c) at ../sysdeps/generic/libc-start.c:92

-- end gdb output --

Needless to say this is quite disconcerting, and absolutely _any_ input
you could provide would be invaluable.

Please describe a way to repeat the problem.   Please try to provide a
concise reproducible example, if at all possible:
----------------------------------------------------------------------

As I mentioned above, it is difficult to predict when it will start
happening, however we have only ever seen this once we started getting the
number of connections pretty high.

If it is significant, postmaster is started with the following options:

su -l postgres -c '/usr/local/pgres/bin/postmaster -i -N 512 -B 2048 2>&1
> /var/log/postgres.log


If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------

Well, I know how to repair it, but what I am most interested in is how to
prevent it, or at least how to debug what may be causing the problem in
the first place.

Re: possible database corruption

From

Tom Lane

Date:

12 January 2001, 21:28:39

Chris Anderson <chris@journyx.com> writes:
> [ occasional crash in GetRawDatabaseInfo() ]

Try applying the following patch, which corresponds to a bug that I
noticed in GetRawDatabaseInfo a couple months ago: it looks at one
tuple-pointer slot too many in each page of pg_database.  Normally the
bogus slot will be ignored because it doesn't have the LP_USED bit set,
but if you create and drop databases a lot, it's possible there would
be garbage there.

If this doesn't help, please recompile the backend with -g, so that we
can see a more detailed stack backtrace.

            regards, tom lane

*** src/backend/utils/misc/database.c~    Wed Apr 12 13:16:07 2000
--- src/backend/utils/misc/database.c    Fri Jan 12 21:25:24 2001
***************
*** 180,186 ****
          max = PageGetMaxOffsetNumber(pg);

          /* look at each tuple on the page */
!         for (i = 0; i <= max; i++)
          {
              int            offset;

--- 180,186 ----
          max = PageGetMaxOffsetNumber(pg);

          /* look at each tuple on the page */
!         for (i = 0; i < max; i++)
          {
              int            offset;