Debugging a postgresql server crash. - Mailing list pgsql-general

From Clemens Wehrmann
Subject Debugging a postgresql server crash.
Date
Msg-id 200502181813.27714.clemens.wehrmann@ciao-de.com
Whole thread Raw
Responses Re: Debugging a postgresql server crash.
List pgsql-general
I am trying to debug a postgresql server crash and am out of my depth.  The
server is sent a segmentation fault when accessing certain data.  I've
tracked down one such place (though there may be others).  I'm wondering if
its possible for a more experienced person to see from the below if I'm
looking at table corruption.  pg_dump also crashes the server.  The
filesystem purports to be consistent and I haven't yet found bad blocks.

Thanks much to anyone who reads on and the many whose questions and answers
have been so helpful to this lurker in the past.

Cheers,
--clemens


======
Set up:
======
FreeBSD hog6 5.3-RELEASE
dual amd64
amr(RAID10) MegaRAID
postgresql 7.4.5 from ports

=======
The table
=======
catfang_de_broken=# \d imp_product
                             Table "public.imp_product"
   Column    |  Type   |                          Modifiers
-------------+---------+-------------------------------------------------------------
 product_id  | integer | not null default
(nextval('product_id_seq'::text))::integer
 category_id | integer | not null
 name        | text    | not null
 created     | date    | not null
 current     | boolean |
 active      | boolean | not null default true
 action      | integer | not null
Indexes:
    "imp_product_pkey" primary key, btree (product_id)
    "imp_product_action_idx" btree ("action")
    "imp_product_category_idx" btree (category_id)
Check constraints:
    "imp_product_category_id" CHECK (category_id <> 0)
    "imp_product_product_id" CHECK (product_id <> 0)
Foreign-key constraints:
    "$1" FOREIGN KEY (category_id) REFERENCES imp_category(category_id)

=====================================================
Selecting works for records where offset < 129833, but is broken for the next
record.
=====================================================
catfang_de_broken=# select * from imp_product offset 129832 limit 1;
 product_id | category_id |   name   |  created   | current | active | action
------------+-------------+----------+------------+---------+--------+--------
     761377 |       11996 | KGN 7060 | 2000-11-21 |         | t      |      0
(1 row)
catfang_de_broken=# select * from imp_product offset 129831 limit 1;
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

======================================================
The same two queries from the log.  No error messages prior to the crash.
======================================================
Feb 18 13:02:11 hog6 postgres[94548]: [130033-3] STATEMENT:  select * from
imp_product offset 129832 limit 1;
Feb 18 13:02:11 hog6 postgres[94548]: [130034-1] LOG:  00000: statement:
select * from imp_product offset 129833 limit 1;
Feb 18 13:02:11 hog6 postgres[94548]: [130034-2] LOCATION:  pg_parse_query,
postgres.c:464
Feb 18 13:02:11 hog6 postgres[94548]: [130034-3] STATEMENT:  select * from
imp_product offset 129833 limit 1;
Feb 18 13:02:34 hog6 postgres[94330]: [1-1] LOG:  00000: server process (PID
94548) was terminated by signal 11
Feb 18 13:02:34 hog6 postgres[94330]: [1-2] LOCATION:  LogChildExit,
postmaster.c:2087
Feb 18 13:02:34 hog6 postgres[94330]: [2-1] LOG:  00000: terminating any other
active server processes
Feb 18 13:02:34 hog6 postgres[94330]: [2-2] LOCATION:  CleanupProc,
postmaster.c:2008
Feb 18 13:02:34 hog6 postgres[94330]: [3-1] LOG:  00000: all server processes
terminated; reinitializing

============================
Running truss shows the file it's reading when it crashes.
("/dbdata/pgsql/data/base/17142/17234")
Is there anything to be infered from the mmap or munmap?
============================

 2431: open("/dbdata/pgsql/data/base/17142/17234",0x2,0600) = 35 (0x23)
 2431: lseek(35,0x0,0x343332)                    = 144023552 (0x895a000)
 2431: open("/dbdata/pgsql/data/base/17142/16652",0x2,0600) = 36 (0x24)
 2431: read(0x24,0x801e83680,0x2000)             = 8192 (0x2000)
 2431: read(0x24,0x801e85680,0x2000)             = 8192 (0x2000)
 2431: break(0x8b0000)                           = 0 (0x0)
 2431: open("/dbdata/pgsql/data/base/17142/1247",0x2,0600) = 37 (0x25)
 2431: read(0x25,0x801e87680,0x2000)             = 8192 (0x2000)
 2431: read(0x25,0x801e89680,0x2000)             = 8192 (0x2000)
 2431: lseek(35,0x0,0x3734)                      = 0 (0x0)
 2431: read(0x23,0x801e8b680,0x2000)             = 8192 (0x2000)
 2431: read(0x23,0x801e8d680,0x2000)             = 8192 (0x2000)
 2431: read(0x23,0x801e8f680,0x2000)             = 8192 (0x2000)
...
 2431: read(0x23,0x801c87680,0x2000)             = 8192 (0x2000)
 2431: read(0x23,0x801c89680,0x2000)             = 8192 (0x2000)
 2431: break(0x20cd5000)                         = 0 (0x0)
 2431: mmap(0x0,1064960,(0x3)PROT_READ|PROT_WRITE,(0x1002)MAP_ANON|
MAP_PRIVATE,4294967295,0x0) = 34394505216 (0x802128000)
 2431: munmap(0x80075b000,0x1000)                = 0 (0x0)
SIGNAL 11 (SIGSEGV)
SIGNAL 11 (SIGSEGV)
Process stopped because of:  16
process exit, rval = 139
SIGNAL 20 (SIGCHLD)
 2391: select(6,{3 4 5},0x0,0x0,{120 0})         ERR#4 'Interrupted system
call'

=======================
The system tables confirm the file above is related to the data we've
isolated.  (/dbdata/pgsql/data/base/17142/17234, imp_product, 129833)
=======================
catfang_de_broken=# select relfilenode from pg_class where
relname='imp_product';
 relfilenode
-------------
       17234
catfang_de_broken=# select ctid from imp_product offset 129832 limit 1;
   ctid
-----------
 (4743,60)
(1 row)

catfang_de_broken=# select ctid from imp_product offset 129833 limit 1;
   ctid
-----------
 (4743,61)
(1 row)

=================================
Copying the file to a new inode does not change anything.  So am I right to
conclude table corruption and start developing a deep mistrust of this
relatively new server?  Or do I need to look more widely?  I tried to look at
the block indicated above using pg_dump, but I don't know enough to see
anything obviously wrong.  I'm hoping one of you might.
=================================
./pg_filedump -R 4743 -i -f /dbdata/pgsql/data/base/17142/17234
Block 4743 ********************************************************
<Header> -----
 Block Offset: 0x0250e000         Offsets: Lower     320 (0x0140)
 Block: Size 8192  Version    1            Upper     344 (0x0158)
 LSN:  logid     19 recoff 0xca5ad618      Special  8192 (0x2000)
 Items:   75                   Free Space:   24
 Length (including item array): 324

  0000: 13000000 18d65aca 0f000000 40015801  ......Z.....@.X.
  0010: 00200120 b09f9800 d89ea801 70826001  . . ........p.`.
  0020: 589e0001 089e9800 709d3001 289d8800  X.......p.0.(...
  0030: b89cd800 689ca000 109ca800 d09b8000  ....h...........
  0040: 689bd000 089bc000 989ad800 509a8800  h...........P...
  0050: e899c800 9899a000 3899c000 c098f000  ........8.......
  0060: 6898a800 8897b801 2097d000 b096d800  h....... .......
  0070: 4896c800 c0816001 e095c800 8895a800  H.....`.........
  0080: 1095f000 b094b800 4894d000 d893e000  ........H.......
  0090: 6893d800 0893b800 a092c800 3892c800  h...........8...
  00a0: d891c000 8891a000 1091f000 b090c000  ................
  00b0: 4890d000 e08fc800 688fe800 008fd000  H.......h.......
  00c0: a08ec000 388ec800 d08dc800 608dd800  ....8.......`...
  00d0: 008dc000 a08cb800 388cd000 d08bc800  ........8.......
  00e0: 708bc000 188bb000 d88a8000 908a8800  p...............
  00f0: 508a8000 088a8800 c8898000 88898000  P...............
  0100: 48898000 00898800 b8888800 78888000  H...........x...
  0110: 0888d800 9887d800 00873001 50866001  ..........0.P.`.
  0120: a8855001 3885e000 5881c800 b8840001  ..P.8...X.......
  0130: 5084c800 d0830001 7883a800 2083b000  P.......x... ...
  0140: 00000000                             ....

<Data> ------
 Item   1 -- Length:   76  Offset: 8112 (0x1fb0)  Flags: USED
  XID: min (733)  CMIN|XMAX: 14342668  CMAX|XVAC: 1
  Block Id: 14787  linp Index: 21   Attributes: 7   Size: 32
  infomask: 0x0513 (HASNULL|HASVARWIDTH|HASOID|XMIN_COMMITTED|XMAX_COMMITTED)
  t_bits: [0]: 0x6f

  1fb0: dd020000 0cdada00 01000000 0000c339  ...............9
  1fc0: 15000700 1305206f 00000000 10193d06  ...... o......=.
  1fd0: 06740600 8e2e0000 16000000 41736d61  .t..........Asma
  1fe0: 72202d20 59616972 2044616c 616c0000  r - Yair Dalal..
  1ff0: 9b060000 01000000 00000000           ............

...

 Item  60 -- Length:   64  Offset: 2376 (0x0948)  Flags: USED
  XID: min (733)  CMIN|XMAX: 0  CMAX|XVAC: 0
  Block Id: 4743  linp Index: 60   Attributes: 7   Size: 32
  infomask: 0x0913 (HASNULL|HASVARWIDTH|HASOID|XMIN_COMMITTED|XMAX_INVALID)
  t_bits: [0]: 0x6f

  0948: dd020000 00000000 00000000 00008712  ................
  0958: 3c000700 1309206f 00000000 4b193d06  <..... o....K.=.
  0968: 219e0b00 dc2e0000 0c000000 4b474e20  !...........KGN
  0978: 37303630 45010000 01000000 00000000  7060E...........

 Item  61 -- Length:   68  Offset: 2304 (0x0900)  Flags: USED
  XID: min (733)  CMIN|XMAX: 0  CMAX|XVAC: 0
  Block Id: 4743  linp Index: 61   Attributes: 7   Size: 32
  infomask: 0x0913 (HASNULL|HASVARWIDTH|HASOID|XMIN_COMMITTED|XMAX_INVALID)
  t_bits: [0]: 0x6f

  0900: dd020000 00000000 00000000 00008712  ................
  0910: 3d000700 1309206f 00000000 4c193d06  =..... o....L.=.
  0920: f29c0b00 dc2e0000 0d000041 46414220  ...........AFAB
  0930: 33325853 33000000 52050000 01000000  32XS3...R.......
  0940: 00000000                             ....

--
Clemens Wehrmann

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Database locks
Next
From: Dieter Schröder
Date:
Subject: Re: PostgreSQL Replication