Re: PosgreSQL is crashing with a signal 11 - Bug? - Mailing list pgsql-bugs
From | Rafael Martinez Guerrero |
---|---|
Subject | Re: PosgreSQL is crashing with a signal 11 - Bug? |
Date | |
Msg-id | 1095072069.31640.137.camel@bbking.uio.no Whole thread Raw |
In response to | Re: PosgreSQL is crashing with a signal 11 - Bug? (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: PosgreSQL is crashing with a signal 11 - Bug?
|
List | pgsql-bugs |
On Fri, 2004-09-10 at 16:24, Tom Lane wrote: > Kjetil Torgrim Homme <kjetilho@ifi.uio.no> writes: > > how can att[i]->attlen possibly change in the interim? but > > data_length looks corrupted, too. >=20 > Unless you compiled with no optimization at all (-O0), the compiler > would likely fold the identical memcpy() calls in the different > if-branches together. So I wouldn't put too much stock in the > reported line number. >=20 > It does seem striking that a 0x2f got dumped into the high byte of the > length word in both cases. Have you checked to see what the > page-on-disk looks like? I'd be interested to know if the offset of the > damaged byte within the page is again 0x0fff. >=20 Hei Tom=20 Kjetil will answer you about this.=20 In the meant time we got new core dumps when taking a backup of the same database.=20 Some more info I got from the departament in charge of this database: ----------------------------------------------------------- We make a backup of our production server every 15 minutes. Recently, we've seen behaviour like this: [12/09/2004-05:46:00] PostgreSQL: starting backup_cluster01.sh: on cerebellum.uio.no [12/09/2004-05:48:03] PostgreSQL: backup_cluster01.sh finnished on cerebellum.uio.no [12/09/2004-06:01:00] PostgreSQL: starting backup_cluster01.sh: on cerebellum.uio.no pg_dump: ERROR: MemoryContextAlloc: invalid request size 1577058307 pg_dump: lost synchronization with server, resetting connection pg_dump: SQL command to dump the contents of table "paid_quota_history" failed: PQendcopy() failed. pg_dump: Error message from server: pg_dump: The command was: COPY public.paid_quota_history (job_id, transaction_type, person_id, tstamp, update_by, update _program, pageunits_free, pageunits_paid, pageunits_total) TO stdout; pg_dumpall: pg_dump failed on cerebrum_prod, exiting [12/09/2004-06:02:16] PostgreSQL: backup_cluster01.sh finnished on cerebellum.uio.no Every consecutive backup failes with the same message, and then suddenly: [12/09/2004-08:46:00] PostgreSQL: starting backup_cluster01.sh: on cerebellum.uio.no [12/09/2004-08:48:34] PostgreSQL: backup_cluster01.sh finnished on cerebellum.uio.no To me this looks like a cache somewhere that upon read contained some incorrect data. This cache was somehow flushed two-hours later, and fresh data was read from disk. Could this be postgres problem, or is it hardware/kernel related? Upgrading from 7.3.5 to 7.3.7 to 7.4.5 does not help. We have now moved the database between 3 different Dell2650 servers, and replaced memory chips on one system once. Lately one or more postgres processes received signal11 atleast once a day. The problems started about a week ago after stable production for about 9 months. The backup failures above were accompanied by 4 core-dumps. Backtrace follows: #0 0xb734d07c in memcpy () from /lib/tls/libc.so.6 #1 0x08174880 in set_var_from_num (num=3D0xb7021d24, dest=3D0x87b432fe) at numeric.c:2673 #2 0x08171927 in numeric_out (fcinfo=3D0xbfffc2d0) at numeric.c:373 #3 0x081aa81d in FunctionCall3 (flinfo=3D0x82cc4e8, arg1=3D3221209808, arg2=3D3221209808, arg3=3D3221209808) at fmgr.c:1016 #4 0x080c78fb in CopyTo (rel=3D0xb6800bd0, attnumlist=3D0x82cb4a0, binary=3D0 '\0', oids=3D0 '\0', delim=3D0x82232a8 "\t", null_print=3D0x81fc= 95d "\\N") at copy.c:1096 #5 0x080c7021 in DoCopy (stmt=3D0x2f000004) at copy.c:920 #6 0x081507c5 in PortalRunUtility (portal=3D0x82bdfd8, query=3D0x82ba220, dest=3D0x82ba1d8, completionTag=3D0xbfffc650 "") at pquery.c:772 #7 0x08150a3e in PortalRunMulti (portal=3D0x82bdfd8, dest=3D0x82ba1d8, altdest=3D0x82ba1d8, completionTag=3D0xbfffc650 "") at pquery.c:836 #8 0x0815033c in PortalRun (portal=3D0x82bdfd8, count=3D2147483647, dest=3D0x82ba1d8, altdest=3D0x82ba1d8, completionTag=3D0xbfffc650 "") at pquery.c:494 #9 0x0814d5f8 in exec_simple_query ( query_string=3D0x82b9bc0 "COPY public.change_log (tstamp, change_id, subject_entity, change_type_id, dest_entity, change_params, change_by, change_program, description) TO stdout;") at postgres.c:873 #10 0x0814f660 in PostgresMain (argc=3D4, argv=3D0x82701b8, username=3D0x8270188 "postgres") at postgres.c:2868 #11 0x0812f5ab in BackendFork (port=3D0x827d0a0) at postmaster.c:2564 #12 0x0812f09e in BackendStartup (port=3D0x827d0a0) at postmaster.c:2207 #13 0x0812d95f in ServerLoop () at postmaster.c:1119 #14 0x0812d305 in PostmasterMain (argc=3D3, argv=3D0x826e1c0) at postmaster.c:897 #15 0x08104f10 in main (argc=3D3, argv=3D0xbfffd6c4) at main.c:214 We are currently in the process of moving the production server to an IBM box, which should eliminate any Dell2650 spesific causes. ----------------------------------------------------------- --=20 Rafael Martinez, <r.m.guerrero@usit.uio.no> Center for Information Technology Services University of Oslo, Norway
pgsql-bugs by date: