Re: Debugging a postgresql server crash. - Mailing list pgsql-general

From Tom Lane
Subject Re: Debugging a postgresql server crash.
Date
Msg-id 19694.1108750514@sss.pgh.pa.us
Whole thread Raw
In response to Debugging a postgresql server crash.  (Clemens Wehrmann <clemens.wehrmann@ciao-de.com>)
List pgsql-general
Clemens Wehrmann <clemens.wehrmann@ciao-de.com> writes:
> I am trying to debug a postgresql server crash and am out of my depth.

I think you're doing pretty well, actually ...

> Selecting works for records where offset < 129833, but is broken for the next
> record.

You mean offset < 129831, no?  At least that's what your example implies:

> catfang_de_broken=# select * from imp_product offset 129832 limit 1;
>  product_id | category_id |   name   |  created   | current | active | action
> ------------+-------------+----------+------------+---------+--------+--------
>      761377 |       11996 | KGN 7060 | 2000-11-21 |         | t      |      0
> (1 row)
> catfang_de_broken=# select * from imp_product offset 129831 limit 1;
> FATAL:  terminating connection due to administrator command

It does look like a corrupted-data problem to me.

> catfang_de_broken=# select ctid from imp_product offset 129832 limit 1;
>    ctid
> -----------
>  (4743,60)
> (1 row)

OK, so the bad row is the one before (4743,60).

>  Item   1 -- Length:   76  Offset: 8112 (0x1fb0)  Flags: USED
>   XID: min (733)  CMIN|XMAX: 14342668  CMAX|XVAC: 1
>   Block Id: 14787  linp Index: 21   Attributes: 7   Size: 32
>   infomask: 0x0513 (HASNULL|HASVARWIDTH|HASOID|XMIN_COMMITTED|XMAX_COMMITTED)
>   t_bits: [0]: 0x6f

>   1fb0: dd020000 0cdada00 01000000 0000c339  ...............9
>   1fc0: 15000700 1305206f 00000000 10193d06  ...... o......=.
>   1fd0: 06740600 8e2e0000 16000000 41736d61  .t..........Asma
>   1fe0: 72202d20 59616972 2044616c 616c0000  r - Yair Dalal..
>   1ff0: 9b060000 01000000 00000000           ............

> ...

>  Item  60 -- Length:   64  Offset: 2376 (0x0948)  Flags: USED
>   XID: min (733)  CMIN|XMAX: 0  CMAX|XVAC: 0
>   Block Id: 4743  linp Index: 60   Attributes: 7   Size: 32
>   infomask: 0x0913 (HASNULL|HASVARWIDTH|HASOID|XMIN_COMMITTED|XMAX_INVALID)
>   t_bits: [0]: 0x6f

>   0948: dd020000 00000000 00000000 00008712  ................
>   0958: 3c000700 1309206f 00000000 4b193d06  <..... o....K.=.
>   0968: 219e0b00 dc2e0000 0c000000 4b474e20  !...........KGN
>   0978: 37303630 45010000 01000000 00000000  7060E...........


Unfortunately, you stripped out the part of this dump that's actually
interesting ...

If all the rows before item 60 are deleted, it could be that the
corrupted data is actually in a page before this one.  You should check
the CTID of the last row you can retrieve before offset 129831 in order
to know for sure what range of items need to be looked at.

            regards, tom lane

pgsql-general by date:

Previous
From: "Ed L."
Date:
Subject: Re: hung postmaster?
Next
From: "Joshua D. Drake"
Date:
Subject: Re: PostgreSQL Replication