Thread: memory corruption bug

memory corruption bug

From
Alan Stange
Date:
If PostgreSQL failed to compile on your computer or you found a bug that
is likely to be specific to one platform then please fill out this form
and e-mail it to pgsql-ports@postgresql.org.

To report any other bug, fill out the form below and e-mail it to
pgsql-bugs@postgresql.org.

If you not only found the problem but solved it and generated a patch
then e-mail it to pgsql-patches@postgresql.org instead.  Please use the
command "diff -c" to generate the patch.

You may also enter a bug report at http://www.postgresql.org/ instead of
e-mail-ing this form.

============================================================================
                        POSTGRESQL BUG REPORT TEMPLATE
============================================================================


Your name        :Alan Stange
Your email address    :stange@rentec.com


System Configuration
---------------------
  Architecture (example: Intel Pentium)      : Sun UltraSparc IIICu

  Operating System (example: Linux 2.4.18)     : Solaris 9

  PostgreSQL version (example: PostgreSQL-7.4.1):   PostgreSQL-7.4.1

  Compiler used (example:  gcc 2.95.2)        : Sun spro9.


Please enter a FULL description of your problem:
------------------------------------------------
There's memory corruption occuring causing the malloc routines to SIGBUS.   This is typically caused
by a routine overwriting the end of malloc()'d memory causing corruption to the meta data used by malloc(),
free(), realloc(), etc.

We have this happen about once / day.

program terminated by signal BUS (invalid address alignment)
0xff043694: realfree+0x023c:    ld       [%o2 + 8], %o0
(dbx) where
=>[1] realfree(0x703b50, 0x782c0, 0x51, 0xff0bc000, 0x3c, 0x18cf08), at 0xff043694
  [2] cleanfree(0x0, 0x9, 0xff0c26ec, 0x10018, 0x703b50, 0x0), at 0xff043d90
  [3] _malloc_unlocked(0x20018, 0x79270, 0x0, 0xff0bc000, 0x0, 0x0), at 0xff042ebc
  [4] malloc(0x20018, 0x20007, 0x0, 0x0, 0x0, 0x0), at 0xff042dac
  [5] AllocSetAlloc(0x5cadf8, 0x20000, 0x0, 0x0, 0x0, 0x2000), at 0x24fdfc
  [6] AllocSetRealloc(0x5cadf8, 0x604608, 0x20000, 0x604608, 0x2000, 0x604600), at 0x25053c
  [7] enlargeStringInfo(0xffbfed48, 0x10473, 0x20000, 0x10474, 0x100, 0x1), at 0x10fb10
  [8] pq_getmessage(0xffbfed48, 0x0, 0x0, 0x0, 0x2ff910, 0x604608), at 0x1181bc
  [9] SocketBackend(0xffbfed48, 0x310400, 0x51, 0x30000, 0x3c, 0x18cf08), at 0x18d2bc
  [10] PostgresMain(0x190400, 0x310c00, 0x1, 0x315000, 0x310c00, 0x1), at 0x19193c
  [11] BackendFork(0x364b18, 0x356848, 0x2d043c, 0x2d044c, 0x310400, 0x319958), at 0x15b988
  [12] BackendStartup(0x364b18, 0x7f7f7c00, 0x7f7f7c00, 0x3541d8, 0x6e0c3a8f, 0x349c00), at 0x15b170
  [13] ServerLoop(0xc0, 0x310640, 0x364b18, 0x320e64, 0x1, 0x1), at 0x159814
  [14] PostmasterMain(0x15b800, 0x15a500, 0xffffffff, 0x349000, 0x354000, 0x315000), at 0x159048
  [15] main(0x1, 0x35608c, 0x3543d8, 0x3541d8, 0x0, 0x278), at 0x119bf4


These bugs are hard to fix.  The actual corruption could have occured much earlier in the execution.

We had the same error with the gcc compilers as well.  We're using the Sun compilers as the resulting
PG binaries are much faster.


Please describe a way to repeat the problem.   Please try to provide a
concise reproducible example, if at all possible:
----------------------------------------------------------------------
Sadly, we have no reproducible test case right now.




If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------

Typically one would use a tool like purify, bcheck, etc., to find the offending code.

Re: memory corruption bug

From
Tom Lane
Date:
Alan Stange <stange@rentec.com> writes:
> There's memory corruption occuring causing the malloc routines to
> SIGBUS.
> ...
> Sadly, we have no reproducible test case right now.

Without a test case there's not much we can do :-(.  The backtrace you
show is during client query collection, which is a quite well-tested
code path and hardly likely to contain such errors, so I would venture
that the memory clobber occurred previously.  Meaning that this trace
gives no useful data about where it is.

Some versions of Solaris are known to contain a broken vsnprintf()
library routine that can result in memory clobbers.  I don't recall
which versions exactly, but I think it was only in 64-bit builds;
is yours 64-bit?  Anyway you might try undef'ing HAVE_VSNPRINTF and
HAVE_SNPRINTF in src/include/pg_config.h after configure (to force
use of our own emulations of those routines) to see if that helps.
Also try scanning the pgsql-bugs archives for other mentions of Solaris
to see if there are other such issues ...

> Typically one would use a tool like purify, bcheck, etc., to find the
> offending code.

Without being able to duplicate the crash, it's unlikely anyone else
could find the problem.  We haven't heard similar reports, so this is
either a platform-specific issue or a consequence of some corner case
that you are exercising and other people aren't.  Either way, the most
likely way to learn something from Purify is for you to run it ...

            regards, tom lane