Thread: i'm really desperate: invalid memory alloc request size 0

i'm really desperate: invalid memory alloc request size 0

From
Janning Vygen
Date:
Hi,

tonight my database got corruppted. before it worked fine.

since two days i do the following tasks every night

psql -c 'CLUSTER;' $DBNAME
psql -c 'VACUUM FULL ANALYZE;' $DBNAME

before these opertaions i stop all web access. The last months i only did a
"VACUUM ANALYZE" each night and didn't get any failures.

in the morning some sql queries failed. it seems only one table was affected.

i stopped all web access and tried to backup the current database:


pg_dump: ERROR:  invalid memory alloc request size 0
pg_dump: SQL command to dump the contents of table "fragentipps" failed:
PQendcopy() failed.
pg_dump: Error message from server: ERROR:  invalid memory alloc request size
0
pg_dump: The command was: COPY public.fragentipps (tr_kurzname, mg_name,
fr_id, aw_antworttext) TO stdout;

so i did:
/etc/init.d/postgresql stop
cp -rp /home/postgres/data /home/postgres/datafailure

and i tried to recover from backup which was made just before clustering but i
got
ERROR:  index row requires 77768 bytes, maximum size is 8191

is there any chance to get my database keep going again?

pg version is 7.4.2


kind regards
janning

Re: i'm really desperate: invalid memory alloc request size 0

From
Janning Vygen
Date:
Am Freitag, 1. Oktober 2004 09:49 schrieben Sie:
> Hi,
>
> tonight my database got corruppted. before it worked fine.
>
> since two days i do the following tasks every night
>
> psql -c 'CLUSTER;' $DBNAME
> psql -c 'VACUUM FULL ANALYZE;' $DBNAME
>
> before these opertaions i stop all web access. The last months i only did a
> "VACUUM ANALYZE" each night and didn't get any failures.
>
> in the morning some sql queries failed. it seems only one table was
> affected.
>
> i stopped all web access and tried to backup the current database:
>
>
> pg_dump: ERROR:  invalid memory alloc request size 0
> pg_dump: SQL command to dump the contents of table "fragentipps" failed:
> PQendcopy() failed.
> pg_dump: Error message from server: ERROR:  invalid memory alloc request
> size 0
> pg_dump: The command was: COPY public.fragentipps (tr_kurzname, mg_name,
> fr_id, aw_antworttext) TO stdout;
>
> so i did:
> /etc/init.d/postgresql stop
> cp -rp /home/postgres/data /home/postgres/datafailure
>
> and i tried to recover from backup which was made just before clustering
> but i got
> ERROR:  index row requires 77768 bytes, maximum size is 8191
>
> is there any chance to get my database keep going again?
>
> pg version is 7.4.2
>
>
> kind regards
> janning


some additional information from my log files:

LOG:  server process (PID 24227) was terminated by signal 11
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another ser
ver process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat
your command.


[...]


LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted at 2004-10-01 04:59:46 CEST
LOG:  checkpoint record is at 6A/A9142BB4
LOG:  redo record is at 6A/A9016D68; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 202721693; next OID: 352799
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 6A/A9016D68
LOG:  record with zero length at 6A/A931CEE0
LOG:  redo done at 6A/A931CEBC
LOG:  recycled transaction log file "0000006A000000A6"
LOG:  recycled transaction log file "0000006A000000A7"
LOG:  removing transaction log file "0000006A000000A8"
LOG:  database system is ready
LOG:  server process (PID 24235) was terminated by signal 11
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another ser
ver process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat
your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another ser
ver process exited abnormally and possibly corrupted shared memory.


please help me...

janning

Re: i'm really desperate: invalid memory alloc request

From
Richard Huxton
Date:
Janning Vygen wrote:
> Hi,
>
> tonight my database got corruppted. before it worked fine.
>
> since two days i do the following tasks every night
>
> psql -c 'CLUSTER;' $DBNAME
> psql -c 'VACUUM FULL ANALYZE;' $DBNAME
>
> before these opertaions i stop all web access. The last months i only did a
> "VACUUM ANALYZE" each night and didn't get any failures.
>
> in the morning some sql queries failed. it seems only one table was affected.
> i stopped all web access and tried to backup the current database:
  >
>
> pg_dump: ERROR:  invalid memory alloc request size 0
> pg_dump: SQL command to dump the contents of table "fragentipps" failed:
> PQendcopy() failed.
> pg_dump: Error message from server: ERROR:  invalid memory alloc request size
> 0
> pg_dump: The command was: COPY public.fragentipps (tr_kurzname, mg_name,
> fr_id, aw_antworttext) TO stdout;

Does it do this consistently at the same place?

> so i did:
> /etc/init.d/postgresql stop
> cp -rp /home/postgres/data /home/postgres/datafailure

Good - we know we've got a copy of everything.

> and i tried to recover from backup which was made just before clustering but i
> got
> ERROR:  index row requires 77768 bytes, maximum size is 8191
>
> is there any chance to get my database keep going again?

There are a few steps - you've already done the first
  1. Stop PG and take a full copy of the data/ directory
  2. Check your installation - make sure you don't have multiple
     versions of pg_dump/libraries/etc installed
  3. Try dumping individual tables (pg_dump -t table1 ...)
  4. Reindex/repair files
  5. Check hardware to make sure it doesn't happen again.

Once you've dumped as many individual tables as you can, you can even
try selecting data to a file avoiding certain rows if they are causing
the problem.

Then, stop PG and restart a standalone backend. This will let you
reindex the database, in case a corrupt index is the problem. The
REINDEX and postgres documentation details this, also check the mailing
list archives.

There's more you can do after that, but let's see how that works out.

PS - your next mail mentions sig11 which usually implies hardware
problems, so don't forget to test the machine thoroughly once this is over.

> pg version is 7.4.2

Download 7.4.5 - that's got the latest bugfixes in it.

--
   Richard Huxton
   Archonet Ltd

Re: i'm really desperate: invalid memory alloc request size 0

From
Janning Vygen
Date:
Hi Richard,

i feared all db gurus are asleep at the moment.

Am Freitag, 1. Oktober 2004 10:56 schrieb Richard Huxton:
> PS - your next mail mentions sig11 which usually implies hardware
> problems, so don't forget to test the machine thoroughly once this is over.

You saved my life!! Nothing less!

This was a great help cause i never thought that it could be a hardware
problem. I took a dump from last night and tried to recover on the original
machine. it didnt work as i wrote. but when i tried to install it on another
machine it just worked fine.

So everything is up and running.

I still have all the corrupt files in place and now  i try to determine what
went wrong.

As it is obviously a hardware problem, my question is now: how can i check my
hardware (disk)?

How can i get informed next time when things are going wrong?

Ok i will come up with a lot of questions as soon as i had another coffee
because i never want to feel so helpless again.

There really should be a section in the manual like "desaster recovery" which
shows some tricks and methods.

> > pg version is 7.4.2
>
> Download 7.4.5 - that's got the latest bugfixes in it.

i will as soon as my nerves are cooling down again :-)

kind regard,
janning

> --
>    Richard Huxton
>    Archonet Ltd

Re: i'm really desperate: invalid memory alloc request

From
Richard Huxton
Date:
Janning Vygen wrote:
> Hi Richard,
>
> i feared all db gurus are asleep at the moment.

They are, that's why you've got me :-)

> Am Freitag, 1. Oktober 2004 10:56 schrieb Richard Huxton:
>
>>PS - your next mail mentions sig11 which usually implies hardware
>>problems, so don't forget to test the machine thoroughly once this is over.
>
> You saved my life!! Nothing less!
>
> This was a great help cause i never thought that it could be a hardware
> problem. I took a dump from last night and tried to recover on the original
> machine. it didnt work as i wrote. but when i tried to install it on another
> machine it just worked fine.
>
> So everything is up and running.
>
> I still have all the corrupt files in place and now  i try to determine what
> went wrong.
>
> As it is obviously a hardware problem, my question is now: how can i check my
> hardware (disk)?
>
> How can i get informed next time when things are going wrong?

Well, it might be memory too. You probably want to run memtest86 for a
day or two. Bonnie++ is disk performance rather than testing, but will
stress the system.
   http://www.linuxtested.com/linux_tools.html

Many modern drives offer SMART disk monitoring - google for tools to
display the relevant statistics.

> Ok i will come up with a lot of questions as soon as i had another coffee
> because i never want to feel so helpless again.
>
> There really should be a section in the manual like "desaster recovery" which
> shows some tricks and methods.

It doesn't happen often enough to warrant a chapter, but someone should
write something step-by-step.

>>>pg version is 7.4.2
>>
>>Download 7.4.5 - that's got the latest bugfixes in it.
>
> i will as soon as my nerves are cooling down again :-)

Nice to be able to stop screaming, isn't it ;-)

--
   Richard Huxton
   Archonet Ltd

Re: i'm really desperate: invalid memory alloc request

From
Janning Vygen
Date:
Am Freitag, 1. Oktober 2004 10:56 schrieb Richard Huxton:
> Janning Vygen wrote:
> > tonight my database got corruppted. before it worked fine.
> > in the morning some sql queries failed. it seems only one table was
> > affected. i stopped all web access and tried to backup the current
> > database:
> >
> > pg_dump: ERROR:  invalid memory alloc request size 0
> > pg_dump: SQL command to dump the contents of table "fragentipps" failed:
> > PQendcopy() failed.
> > pg_dump: Error message from server: ERROR:  invalid memory alloc request
> > size 0
> > pg_dump: The command was: COPY public.fragentipps (tr_kurzname, mg_name,
> > fr_id, aw_antworttext) TO stdout;
>
> Does it do this consistently at the same place?

Yes. It is in one table if i select a certain row. How can stuff like this can
happen?

> > i tried to recover from backup which was made just before clustering
> > but i got
> > ERROR:  index row requires 77768 bytes, maximum size is 8191
>
> There are a few steps - you've already done the first
>   1. Stop PG and take a full copy of the data/ directory
>   2. Check your installation - make sure you don't have multiple
>      versions of pg_dump/libraries/etc installed
>   3. Try dumping individual tables (pg_dump -t table1 ...)
>   4. Reindex/repair files
>   5. Check hardware to make sure it doesn't happen again.
>
> Once you've dumped as many individual tables as you can, you can even
> try selecting data to a file avoiding certain rows if they are causing
> the problem.

Ok, i can recreate most of the data. My main question is now:
- Why does things like this can happen?
- how often do they happen?

> There's more you can do after that, but let's see how that works out.
>
> PS - your next mail mentions sig11 which usually implies hardware
> problems, so don't forget to test the machine thoroughly once this is over.

first i ran the long smart selftest:

*************
 === START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)
LBA_of_first_error
# 1  Extended off-line   Completed without error       00%      4097         -
*************

AND

*************
# smartctl -Hc /dev/hda
smartctl version 5.1-18 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
[...]
*************

so SMART tells me that everything is fine. but in my messages

*************
Oct  2 14:50:45 p15154389 smartd[11205]: Device: /dev/hda, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 62 to 61
Oct  2 14:50:45 p15154389 smartd[11205]: Device: /dev/hda, SMART Usage
Attribute: 195 Hardware_ECC_Recovered changed from 62 to 61
Oct  2 14:59:00 p15154389 /USR/SBIN/CRON[11428]: (root) CMD ( rm
-f /var/spool/cron/lastrun/cron.hourly)
Oct  2 15:19:55 p15154389 -- MARK --
Oct  2 15:20:46 p15154389 smartd[11205]: Device: /dev/hda, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 61 to 63
Oct  2 15:20:46 p15154389 smartd[11205]: Device: /dev/hda, SMART Usage
Attribute: 195 Hardware_ECC_Recovered changed from 61 to 63
Oct  2 15:31:22 p15154389 su: pam_unix2: session finished for user root,
service su
Oct  2 15:50:45 p15154389 smartd[11205]: Device: /dev/hda, SMART Prefailure
Attribute: 1 Raw_Read_Error_Rate changed from 63 to 61
Oct  2 15:50:45 p15154389 smartd[11205]: Device: /dev/hda, SMART Usage
Attribute: 195 Hardware_ECC_Recovered changed from 63 to 61
*************

don't know what it means. after that i run memtest via a serial console for
hours and hours but no errors where found!

Its a little bit strange. It would feel much nicer if harddisk oder memory
were damaged.

so what could be the reason for SIG11??
is it save to use this machine again after testing memory and hardware?

kind regards
janning