Thread: i'm really desperate: invalid memory alloc request size 0
Hi, tonight my database got corruppted. before it worked fine. since two days i do the following tasks every night psql -c 'CLUSTER;' $DBNAME psql -c 'VACUUM FULL ANALYZE;' $DBNAME before these opertaions i stop all web access. The last months i only did a "VACUUM ANALYZE" each night and didn't get any failures. in the morning some sql queries failed. it seems only one table was affected. i stopped all web access and tried to backup the current database: pg_dump: ERROR: invalid memory alloc request size 0 pg_dump: SQL command to dump the contents of table "fragentipps" failed: PQendcopy() failed. pg_dump: Error message from server: ERROR: invalid memory alloc request size 0 pg_dump: The command was: COPY public.fragentipps (tr_kurzname, mg_name, fr_id, aw_antworttext) TO stdout; so i did: /etc/init.d/postgresql stop cp -rp /home/postgres/data /home/postgres/datafailure and i tried to recover from backup which was made just before clustering but i got ERROR: index row requires 77768 bytes, maximum size is 8191 is there any chance to get my database keep going again? pg version is 7.4.2 kind regards janning
Am Freitag, 1. Oktober 2004 09:49 schrieben Sie: > Hi, > > tonight my database got corruppted. before it worked fine. > > since two days i do the following tasks every night > > psql -c 'CLUSTER;' $DBNAME > psql -c 'VACUUM FULL ANALYZE;' $DBNAME > > before these opertaions i stop all web access. The last months i only did a > "VACUUM ANALYZE" each night and didn't get any failures. > > in the morning some sql queries failed. it seems only one table was > affected. > > i stopped all web access and tried to backup the current database: > > > pg_dump: ERROR: invalid memory alloc request size 0 > pg_dump: SQL command to dump the contents of table "fragentipps" failed: > PQendcopy() failed. > pg_dump: Error message from server: ERROR: invalid memory alloc request > size 0 > pg_dump: The command was: COPY public.fragentipps (tr_kurzname, mg_name, > fr_id, aw_antworttext) TO stdout; > > so i did: > /etc/init.d/postgresql stop > cp -rp /home/postgres/data /home/postgres/datafailure > > and i tried to recover from backup which was made just before clustering > but i got > ERROR: index row requires 77768 bytes, maximum size is 8191 > > is there any chance to get my database keep going again? > > pg version is 7.4.2 > > > kind regards > janning some additional information from my log files: LOG: server process (PID 24227) was terminated by signal 11 LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another ser ver process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. [...] LOG: all server processes terminated; reinitializing LOG: database system was interrupted at 2004-10-01 04:59:46 CEST LOG: checkpoint record is at 6A/A9142BB4 LOG: redo record is at 6A/A9016D68; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 202721693; next OID: 352799 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 6A/A9016D68 LOG: record with zero length at 6A/A931CEE0 LOG: redo done at 6A/A931CEBC LOG: recycled transaction log file "0000006A000000A6" LOG: recycled transaction log file "0000006A000000A7" LOG: removing transaction log file "0000006A000000A8" LOG: database system is ready LOG: server process (PID 24235) was terminated by signal 11 LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another ser ver process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another ser ver process exited abnormally and possibly corrupted shared memory. please help me... janning
Janning Vygen wrote: > Hi, > > tonight my database got corruppted. before it worked fine. > > since two days i do the following tasks every night > > psql -c 'CLUSTER;' $DBNAME > psql -c 'VACUUM FULL ANALYZE;' $DBNAME > > before these opertaions i stop all web access. The last months i only did a > "VACUUM ANALYZE" each night and didn't get any failures. > > in the morning some sql queries failed. it seems only one table was affected. > i stopped all web access and tried to backup the current database: > > > pg_dump: ERROR: invalid memory alloc request size 0 > pg_dump: SQL command to dump the contents of table "fragentipps" failed: > PQendcopy() failed. > pg_dump: Error message from server: ERROR: invalid memory alloc request size > 0 > pg_dump: The command was: COPY public.fragentipps (tr_kurzname, mg_name, > fr_id, aw_antworttext) TO stdout; Does it do this consistently at the same place? > so i did: > /etc/init.d/postgresql stop > cp -rp /home/postgres/data /home/postgres/datafailure Good - we know we've got a copy of everything. > and i tried to recover from backup which was made just before clustering but i > got > ERROR: index row requires 77768 bytes, maximum size is 8191 > > is there any chance to get my database keep going again? There are a few steps - you've already done the first 1. Stop PG and take a full copy of the data/ directory 2. Check your installation - make sure you don't have multiple versions of pg_dump/libraries/etc installed 3. Try dumping individual tables (pg_dump -t table1 ...) 4. Reindex/repair files 5. Check hardware to make sure it doesn't happen again. Once you've dumped as many individual tables as you can, you can even try selecting data to a file avoiding certain rows if they are causing the problem. Then, stop PG and restart a standalone backend. This will let you reindex the database, in case a corrupt index is the problem. The REINDEX and postgres documentation details this, also check the mailing list archives. There's more you can do after that, but let's see how that works out. PS - your next mail mentions sig11 which usually implies hardware problems, so don't forget to test the machine thoroughly once this is over. > pg version is 7.4.2 Download 7.4.5 - that's got the latest bugfixes in it. -- Richard Huxton Archonet Ltd
Hi Richard, i feared all db gurus are asleep at the moment. Am Freitag, 1. Oktober 2004 10:56 schrieb Richard Huxton: > PS - your next mail mentions sig11 which usually implies hardware > problems, so don't forget to test the machine thoroughly once this is over. You saved my life!! Nothing less! This was a great help cause i never thought that it could be a hardware problem. I took a dump from last night and tried to recover on the original machine. it didnt work as i wrote. but when i tried to install it on another machine it just worked fine. So everything is up and running. I still have all the corrupt files in place and now i try to determine what went wrong. As it is obviously a hardware problem, my question is now: how can i check my hardware (disk)? How can i get informed next time when things are going wrong? Ok i will come up with a lot of questions as soon as i had another coffee because i never want to feel so helpless again. There really should be a section in the manual like "desaster recovery" which shows some tricks and methods. > > pg version is 7.4.2 > > Download 7.4.5 - that's got the latest bugfixes in it. i will as soon as my nerves are cooling down again :-) kind regard, janning > -- > Richard Huxton > Archonet Ltd
Janning Vygen wrote: > Hi Richard, > > i feared all db gurus are asleep at the moment. They are, that's why you've got me :-) > Am Freitag, 1. Oktober 2004 10:56 schrieb Richard Huxton: > >>PS - your next mail mentions sig11 which usually implies hardware >>problems, so don't forget to test the machine thoroughly once this is over. > > You saved my life!! Nothing less! > > This was a great help cause i never thought that it could be a hardware > problem. I took a dump from last night and tried to recover on the original > machine. it didnt work as i wrote. but when i tried to install it on another > machine it just worked fine. > > So everything is up and running. > > I still have all the corrupt files in place and now i try to determine what > went wrong. > > As it is obviously a hardware problem, my question is now: how can i check my > hardware (disk)? > > How can i get informed next time when things are going wrong? Well, it might be memory too. You probably want to run memtest86 for a day or two. Bonnie++ is disk performance rather than testing, but will stress the system. http://www.linuxtested.com/linux_tools.html Many modern drives offer SMART disk monitoring - google for tools to display the relevant statistics. > Ok i will come up with a lot of questions as soon as i had another coffee > because i never want to feel so helpless again. > > There really should be a section in the manual like "desaster recovery" which > shows some tricks and methods. It doesn't happen often enough to warrant a chapter, but someone should write something step-by-step. >>>pg version is 7.4.2 >> >>Download 7.4.5 - that's got the latest bugfixes in it. > > i will as soon as my nerves are cooling down again :-) Nice to be able to stop screaming, isn't it ;-) -- Richard Huxton Archonet Ltd
Am Freitag, 1. Oktober 2004 10:56 schrieb Richard Huxton: > Janning Vygen wrote: > > tonight my database got corruppted. before it worked fine. > > in the morning some sql queries failed. it seems only one table was > > affected. i stopped all web access and tried to backup the current > > database: > > > > pg_dump: ERROR: invalid memory alloc request size 0 > > pg_dump: SQL command to dump the contents of table "fragentipps" failed: > > PQendcopy() failed. > > pg_dump: Error message from server: ERROR: invalid memory alloc request > > size 0 > > pg_dump: The command was: COPY public.fragentipps (tr_kurzname, mg_name, > > fr_id, aw_antworttext) TO stdout; > > Does it do this consistently at the same place? Yes. It is in one table if i select a certain row. How can stuff like this can happen? > > i tried to recover from backup which was made just before clustering > > but i got > > ERROR: index row requires 77768 bytes, maximum size is 8191 > > There are a few steps - you've already done the first > 1. Stop PG and take a full copy of the data/ directory > 2. Check your installation - make sure you don't have multiple > versions of pg_dump/libraries/etc installed > 3. Try dumping individual tables (pg_dump -t table1 ...) > 4. Reindex/repair files > 5. Check hardware to make sure it doesn't happen again. > > Once you've dumped as many individual tables as you can, you can even > try selecting data to a file avoiding certain rows if they are causing > the problem. Ok, i can recreate most of the data. My main question is now: - Why does things like this can happen? - how often do they happen? > There's more you can do after that, but let's see how that works out. > > PS - your next mail mentions sig11 which usually implies hardware > problems, so don't forget to test the machine thoroughly once this is over. first i ran the long smart selftest: ************* === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended off-line Completed without error 00% 4097 - ************* AND ************* # smartctl -Hc /dev/hda smartctl version 5.1-18 Copyright (C) 2002-3 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED [...] ************* so SMART tells me that everything is fine. but in my messages ************* Oct 2 14:50:45 p15154389 smartd[11205]: Device: /dev/hda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 62 to 61 Oct 2 14:50:45 p15154389 smartd[11205]: Device: /dev/hda, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 61 Oct 2 14:59:00 p15154389 /USR/SBIN/CRON[11428]: (root) CMD ( rm -f /var/spool/cron/lastrun/cron.hourly) Oct 2 15:19:55 p15154389 -- MARK -- Oct 2 15:20:46 p15154389 smartd[11205]: Device: /dev/hda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 61 to 63 Oct 2 15:20:46 p15154389 smartd[11205]: Device: /dev/hda, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 61 to 63 Oct 2 15:31:22 p15154389 su: pam_unix2: session finished for user root, service su Oct 2 15:50:45 p15154389 smartd[11205]: Device: /dev/hda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 63 to 61 Oct 2 15:50:45 p15154389 smartd[11205]: Device: /dev/hda, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 63 to 61 ************* don't know what it means. after that i run memtest via a serial console for hours and hours but no errors where found! Its a little bit strange. It would feel much nicer if harddisk oder memory were damaged. so what could be the reason for SIG11?? is it save to use this machine again after testing memory and hardware? kind regards janning