Re: i'm really desperate: invalid memory alloc request - Mailing list pgsql-general

From Janning Vygen
Subject Re: i'm really desperate: invalid memory alloc request
Date
Msg-id 200410041913.24845.vygen@gmx.de
Whole thread Raw
In response to Re: i'm really desperate: invalid memory alloc request  (Richard Huxton <dev@archonet.com>)
List pgsql-general
Am Freitag, 1. Oktober 2004 10:56 schrieb Richard Huxton:
> Janning Vygen wrote:
> > tonight my database got corruppted. before it worked fine.
> > in the morning some sql queries failed. it seems only one table was
> > affected. i stopped all web access and tried to backup the current
> > database:
> >
> > pg_dump: ERROR:  invalid memory alloc request size 0
> > pg_dump: SQL command to dump the contents of table "fragentipps" failed:
> > PQendcopy() failed.
> > pg_dump: Error message from server: ERROR:  invalid memory alloc request
> > size 0
> > pg_dump: The command was: COPY public.fragentipps (tr_kurzname, mg_name,
> > fr_id, aw_antworttext) TO stdout;
>
> Does it do this consistently at the same place?

Yes. It is in one table if i select a certain row. How can stuff like this can
happen?

> > i tried to recover from backup which was made just before clustering
> > but i got
> > ERROR:  index row requires 77768 bytes, maximum size is 8191
>
> There are a few steps - you've already done the first
>   1. Stop PG and take a full copy of the data/ directory
>   2. Check your installation - make sure you don't have multiple
>      versions of pg_dump/libraries/etc installed
>   3. Try dumping individual tables (pg_dump -t table1 ...)
>   4. Reindex/repair files
>   5. Check hardware to make sure it doesn't happen again.
>
> Once you've dumped as many individual tables as you can, you can even
> try selecting data to a file avoiding certain rows if they are causing
> the problem.

Ok, i can recreate most of the data. My main question is now:
- Why does things like this can happen?
- how often do they happen?

> There's more you can do after that, but let's see how that works out.
>
> PS - your next mail mentions sig11 which usually implies hardware
> problems, so don't forget to test the machine thoroughly once this is over.

first i ran the long smart selftest:

*************
 === START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)
LBA_of_first_error
# 1  Extended off-line   Completed without error       00%      4097         -
*************

AND

*************
# smartctl -Hc /dev/hda
smartctl version 5.1-18 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
[...]
*************

so SMART tells me that everything is fine. but in my messages

*************
Oct  2 14:50:45 p15154389 smartd[11205]: Device: /dev/hda, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 62 to 61
Oct  2 14:50:45 p15154389 smartd[11205]: Device: /dev/hda, SMART Usage
Attribute: 195 Hardware_ECC_Recovered changed from 62 to 61
Oct  2 14:59:00 p15154389 /USR/SBIN/CRON[11428]: (root) CMD ( rm
-f /var/spool/cron/lastrun/cron.hourly)
Oct  2 15:19:55 p15154389 -- MARK --
Oct  2 15:20:46 p15154389 smartd[11205]: Device: /dev/hda, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 61 to 63
Oct  2 15:20:46 p15154389 smartd[11205]: Device: /dev/hda, SMART Usage
Attribute: 195 Hardware_ECC_Recovered changed from 61 to 63
Oct  2 15:31:22 p15154389 su: pam_unix2: session finished for user root,
service su
Oct  2 15:50:45 p15154389 smartd[11205]: Device: /dev/hda, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 63 to 61
Oct  2 15:50:45 p15154389 smartd[11205]: Device: /dev/hda, SMART Usage
Attribute: 195 Hardware_ECC_Recovered changed from 63 to 61
*************

don't know what it means. after that i run memtest via a serial console for
hours and hours but no errors where found!

Its a little bit strange. It would feel much nicer if harddisk oder memory
were damaged.

so what could be the reason for SIG11??
is it save to use this machine again after testing memory and hardware?

kind regards
janning

pgsql-general by date:

Previous
From: Marco Colombo
Date:
Subject: Re: Random not so random
Next
From: Bruno Wolff III
Date:
Subject: Re: Random not so random