Re: PosgreSQL is crashing with a signal 11 - Bug? - Mailing list pgsql-bugs

From Rafael Martinez Guerrero
Subject Re: PosgreSQL is crashing with a signal 11 - Bug?
Date
Msg-id 1095072069.31640.137.camel@bbking.uio.no
Whole thread Raw
In response to Re: PosgreSQL is crashing with a signal 11 - Bug?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: PosgreSQL is crashing with a signal 11 - Bug?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On Fri, 2004-09-10 at 16:24, Tom Lane wrote:
> Kjetil Torgrim Homme <kjetilho@ifi.uio.no> writes:
> > how can att[i]->attlen possibly change in the interim?  but
> > data_length looks corrupted, too.
>=20
> Unless you compiled with no optimization at all (-O0), the compiler
> would likely fold the identical memcpy() calls in the different
> if-branches together.  So I wouldn't put too much stock in the
> reported line number.
>=20
> It does seem striking that a 0x2f got dumped into the high byte of the
> length word in both cases.  Have you checked to see what the
> page-on-disk looks like?  I'd be interested to know if the offset of the
> damaged byte within the page is again 0x0fff.
>=20

Hei Tom=20

Kjetil will answer you about this.=20

In the meant time we got new core dumps when taking a backup of the same
database.=20

Some more info I got from the departament in charge of this database:
-----------------------------------------------------------
We make a backup of our production server every 15 minutes.  Recently,
we've seen behaviour like this:

  [12/09/2004-05:46:00] PostgreSQL: starting backup_cluster01.sh: on
cerebellum.uio.no
  [12/09/2004-05:48:03] PostgreSQL: backup_cluster01.sh finnished on
cerebellum.uio.no
  [12/09/2004-06:01:00] PostgreSQL: starting backup_cluster01.sh: on
cerebellum.uio.no
  pg_dump: ERROR:  MemoryContextAlloc: invalid request size 1577058307
  pg_dump: lost synchronization with server, resetting connection
  pg_dump: SQL command to dump the contents of table
"paid_quota_history" failed: PQendcopy() failed.
  pg_dump: Error message from server: pg_dump: The command was: COPY
public.paid_quota_history (job_id, transaction_type, person_id, tstamp,
update_by, update
  _program, pageunits_free, pageunits_paid, pageunits_total) TO stdout;
  pg_dumpall: pg_dump failed on cerebrum_prod, exiting
  [12/09/2004-06:02:16] PostgreSQL: backup_cluster01.sh finnished on
cerebellum.uio.no

Every consecutive backup failes with the same message, and then
suddenly:

  [12/09/2004-08:46:00] PostgreSQL: starting backup_cluster01.sh: on
cerebellum.uio.no
  [12/09/2004-08:48:34] PostgreSQL: backup_cluster01.sh finnished on
cerebellum.uio.no

To me this looks like a cache somewhere that upon read contained some
incorrect data.  This cache was somehow flushed two-hours later, and
fresh data was read from disk.

Could this be postgres problem, or is it hardware/kernel related?
Upgrading from 7.3.5 to 7.3.7 to 7.4.5 does not help.  We have now
moved the database between 3 different Dell2650 servers, and replaced
memory chips on one system once.  Lately one or more postgres
processes received signal11 atleast once a day.  The problems started
about a week ago after stable production for about 9 months.

The backup failures above were accompanied by 4 core-dumps.  Backtrace
follows:

  #0  0xb734d07c in memcpy () from /lib/tls/libc.so.6
  #1  0x08174880 in set_var_from_num (num=3D0xb7021d24, dest=3D0x87b432fe)
at numeric.c:2673
  #2  0x08171927 in numeric_out (fcinfo=3D0xbfffc2d0) at numeric.c:373
  #3  0x081aa81d in FunctionCall3 (flinfo=3D0x82cc4e8, arg1=3D3221209808,
arg2=3D3221209808, arg3=3D3221209808) at fmgr.c:1016
  #4  0x080c78fb in CopyTo (rel=3D0xb6800bd0, attnumlist=3D0x82cb4a0,
binary=3D0 '\0', oids=3D0 '\0', delim=3D0x82232a8 "\t", null_print=3D0x81fc=
95d
"\\N")
      at copy.c:1096
  #5  0x080c7021 in DoCopy (stmt=3D0x2f000004) at copy.c:920
  #6  0x081507c5 in PortalRunUtility (portal=3D0x82bdfd8, query=3D0x82ba220,
dest=3D0x82ba1d8, completionTag=3D0xbfffc650 "") at pquery.c:772
  #7  0x08150a3e in PortalRunMulti (portal=3D0x82bdfd8, dest=3D0x82ba1d8,
altdest=3D0x82ba1d8, completionTag=3D0xbfffc650 "") at pquery.c:836
  #8  0x0815033c in PortalRun (portal=3D0x82bdfd8, count=3D2147483647,
dest=3D0x82ba1d8, altdest=3D0x82ba1d8, completionTag=3D0xbfffc650 "") at
pquery.c:494
  #9  0x0814d5f8 in exec_simple_query (
      query_string=3D0x82b9bc0 "COPY public.change_log (tstamp, change_id,
subject_entity, change_type_id, dest_entity, change_params, change_by,
change_program, description) TO stdout;") at postgres.c:873
  #10 0x0814f660 in PostgresMain (argc=3D4, argv=3D0x82701b8,
username=3D0x8270188 "postgres") at postgres.c:2868
  #11 0x0812f5ab in BackendFork (port=3D0x827d0a0) at postmaster.c:2564
  #12 0x0812f09e in BackendStartup (port=3D0x827d0a0) at postmaster.c:2207
  #13 0x0812d95f in ServerLoop () at postmaster.c:1119
  #14 0x0812d305 in PostmasterMain (argc=3D3, argv=3D0x826e1c0) at
postmaster.c:897
  #15 0x08104f10 in main (argc=3D3, argv=3D0xbfffd6c4) at main.c:214

We are currently in the process of moving the production server to an
IBM box, which should eliminate any Dell2650 spesific causes.
-----------------------------------------------------------

--=20
 Rafael Martinez, <r.m.guerrero@usit.uio.no>
 Center for Information Technology Services
 University of Oslo, Norway

pgsql-bugs by date:

Previous
From: "PostgreSQL Bugs List"
Date:
Subject: BUG #1251: setTransactionIsolation does not seem to work
Next
From: Kris Jurka
Date:
Subject: Re: BUG #1251: setTransactionIsolation does not seem to work