Thread: pg_xlog disk full error, i need help
Hi, i do a nightly CLUSTER and VACUUM on one of my production databases. Yesterday in the morning the vacuum process was still running after 8 hours. That was very unusal and i didnt know exactly what to do. So i tried to stop the process. After it didnt work i killed -9 the Vacuum process. I restarted the database and everything worked fine again. I did know that this was NOT a good idea but i had to fined a quick solution and it did work at least. Tonight know something very strange did happen before or while the clustering did run: PANIC: could not write to file "/home/postgres/data/pg_xlog/xlogtemp.6434": No space left on device server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. connection to server was lost WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. connection to server was lost My disk was running full with 100 GB (!) of data/pg_xlog/ files. I tried to delete some files on the same partition after which i had 3 GB free space again. the i tried to start the postmaster: the startup process logged this: LOG: database system shutdown was interrupted at 2005-03-28 09:33:15 CEST LOG: checkpoint record is at F/EE0F0010 LOG: redo record is at F/EC007900; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 46558173; next OID: 58970 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at F/EC007900 looks fine as it says "automatic recovery in progress" but there are no more log entries since startup and my process table says: 8495 pts/0 S 0:00 /usr/local/pgsql/bin/postmaster -D /home/postgres/data 8498 pts/0 S 0:00 postgres: stats buffer process 8499 pts/0 S 0:00 postgres: stats collector process 8500 pts/0 D 5:15 postgres: startup subprocess and top says 8500 postgres 15 0 131m 131m 131m D 18.9 6.5 5:18.26 postmaster so the postmaster is still working. How long will it work on this problem? Can i expect to have everything working correctly after this startup process or shoul di stop it and use a backup (which i hope is useful and not corrupt) I am kind of helpless in this situation as i dont know much of all the underlying storage, WAL and xlog things. Maybe i could just delete all files in this directory? Can anybody give me some hints what to do or how to ask? i am really desperate at the moment. kind regards, Janning please excuse bad english and typos. i am kind of nervous at the moment.
Hi Janning! You need to expand your pg_xlog partition. If you use reiserfs, you can do this with resize_reiserfs. If you use ext2/ext3 you may try reise2fs. If you need to repartition your filesystem, by myself experience, I recommend to you to use LVM. With LVM, you can expand easily, add more disk to you volume group and grow your logical volume. This already saved me some hours of maintenance a time. Atenciosamente, Gustavo Franklin Nóbrega Infraestrutura e Banco de Dados Planae Tecnologia da Informação (+55) 14 2106-3514 http://www.planae.com.br ----- Original Message ----- From: "Janning Vygen" <vygen@gmx.de> To: <pgsql-general@postgresql.org> Sent: Monday, March 28, 2005 7:19 AM Subject: [GENERAL] pg_xlog disk full error, i need help > Hi, > > i do a nightly CLUSTER and VACUUM on one of my production databases. > > Yesterday in the morning the vacuum process was still running after 8 > hours. > That was very unusal and i didnt know exactly what to do. So i tried to > stop > the process. After it didnt work i killed -9 the Vacuum process. I > restarted > the database and everything worked fine again. I did know that this was > NOT > a good idea but i had to fined a quick solution and it did work at least. > > Tonight know something very strange did happen before or while the > clustering did run: > > PANIC: could not write to file > "/home/postgres/data/pg_xlog/xlogtemp.6434": > No space left on device > server closed the connection unexpectedly > This probably means the server terminated abnormally before or while > processing the request. connection to server was lost > WARNING: terminating connection because of crash of another server > process > DETAIL: The postmaster has commanded this server process to roll back the > current transaction and exit, because another server process exited > abnormally and possibly corrupted shared memory. > HINT: In a moment you should be able to reconnect to the database and > repeat your command. > server closed the connection unexpectedly > This probably means the server terminated abnormally before or while > processing the request. connection to server was lost > > My disk was running full with 100 GB (!) of data/pg_xlog/ files. I tried > to > delete some files on the same partition after which i had 3 GB free space > again. the i tried to start the postmaster: > > the startup process logged this: > > LOG: database system shutdown was interrupted at 2005-03-28 09:33:15 CEST > LOG: checkpoint record is at F/EE0F0010 > LOG: redo record is at F/EC007900; undo record is at 0/0; shutdown FALSE > LOG: next transaction ID: 46558173; next OID: 58970 > LOG: database system was not properly shut down; automatic recovery in > progress > LOG: redo starts at F/EC007900 > > looks fine as it says "automatic recovery in progress" but there are no > more > log entries since startup and my process table says: > 8495 pts/0 S 0:00 /usr/local/pgsql/bin/postmaster -D > /home/postgres/data > 8498 pts/0 S 0:00 postgres: stats buffer process > 8499 pts/0 S 0:00 postgres: stats collector process > 8500 pts/0 D 5:15 postgres: startup subprocess > > and top says > 8500 postgres 15 0 131m 131m 131m D 18.9 6.5 5:18.26 postmaster > > so the postmaster is still working. > > How long will it work on this problem? Can i expect to have everything > working correctly after this startup process or shoul di stop it and use a > backup (which i hope is useful and not corrupt) > > I am kind of helpless in this situation as i dont know much of all the > underlying storage, WAL and xlog things. Maybe i could just delete all > files > in this directory? > > Can anybody give me some hints what to do or how to ask? > > i am really desperate at the moment. > > kind regards, > Janning > > please excuse bad english and typos. i am kind of nervous at the moment. > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
On Mon, 28 Mar 2005 08:46:09 -0300, Gustavo Franklin Nóbrega - Planae <gfnobrega@planae.com.br> wrote: > Hi Janning! > > You need to expand your pg_xlog partition. If you use reiserfs, you can > do this with resize_reiserfs. If you use ext2/ext3 you may try reise2fs. > > If you need to repartition your filesystem, by myself experience, I > recommend to you to use LVM. With LVM, you can expand easily, add more disk > to you volume group and grow your logical volume. This already saved me some > hours of maintenance a time. > > Atenciosamente, > > Gustavo Franklin Nóbrega > Infraestrutura e Banco de Dados > Planae Tecnologia da Informação > (+55) 14 2106-3514 > http://www.planae.com.br > ----- Original Message ----- > From: "Janning Vygen" <vygen@gmx.de> > To: <pgsql-general@postgresql.org> > Sent: Monday, March 28, 2005 7:19 AM > Subject: [GENERAL] pg_xlog disk full error, i need help > > > Hi, > > > > i do a nightly CLUSTER and VACUUM on one of my production databases. > > > > Yesterday in the morning the vacuum process was still running after 8 > > hours. > > That was very unusal and i didnt know exactly what to do. So i tried to > > stop > > the process. After it didnt work i killed -9 the Vacuum process. I > > restarted > > the database and everything worked fine again. I did know that this was > > NOT > > a good idea but i had to fined a quick solution and it did work at least. > > > > Tonight know something very strange did happen before or while the > > clustering did run: > > > > PANIC: could not write to file > > "/home/postgres/data/pg_xlog/xlogtemp.6434": > > No space left on device > > server closed the connection unexpectedly > > This probably means the server terminated abnormally before or while > > processing the request. connection to server was lost > > WARNING: terminating connection because of crash of another server > > process > > DETAIL: The postmaster has commanded this server process to roll back the > > current transaction and exit, because another server process exited > > abnormally and possibly corrupted shared memory. > > HINT: In a moment you should be able to reconnect to the database and > > repeat your command. > > server closed the connection unexpectedly > > This probably means the server terminated abnormally before or while > > processing the request. connection to server was lost > > > > My disk was running full with 100 GB (!) of data/pg_xlog/ files. I tried > > to > > delete some files on the same partition after which i had 3 GB free space > > again. the i tried to start the postmaster: > > > > the startup process logged this: > > > > LOG: database system shutdown was interrupted at 2005-03-28 09:33:15 CEST > > LOG: checkpoint record is at F/EE0F0010 > > LOG: redo record is at F/EC007900; undo record is at 0/0; shutdown FALSE > > LOG: next transaction ID: 46558173; next OID: 58970 > > LOG: database system was not properly shut down; automatic recovery in > > progress > > LOG: redo starts at F/EC007900 > > > > looks fine as it says "automatic recovery in progress" but there are no > > more > > log entries since startup and my process table says: > > 8495 pts/0 S 0:00 /usr/local/pgsql/bin/postmaster -D > > /home/postgres/data > > 8498 pts/0 S 0:00 postgres: stats buffer process > > 8499 pts/0 S 0:00 postgres: stats collector process > > 8500 pts/0 D 5:15 postgres: startup subprocess > > > > and top says > > 8500 postgres 15 0 131m 131m 131m D 18.9 6.5 5:18.26 postmaster > > > > so the postmaster is still working. > > > > How long will it work on this problem? Can i expect to have everything > > working correctly after this startup process or shoul di stop it and use a > > backup (which i hope is useful and not corrupt) > > > > I am kind of helpless in this situation as i dont know much of all the > > underlying storage, WAL and xlog things. Maybe i could just delete all > > files > > in this directory? > > > > Can anybody give me some hints what to do or how to ask? > > > > i am really desperate at the moment. > > > > kind regards, > > Janning > > > > please excuse bad english and typos. i am kind of nervous at the moment. Or free up space by deleting something non-critical from that partition. -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ L. Friedman netllama@gmail.com LlamaLand http://netllama.linux-sxs.org
"Janning Vygen" <vygen@gmx.de> writes: > My disk was running full with 100 GB (!) of data/pg_xlog/ files. The only way for pg_xlog to bloat vastly beyond what it's supposed to be (which is to say, about twice your checkpoint_segments setting) is if checkpoints are somehow blocked from happening. The only mechanism I know about for that is that in 7.4.* (maybe 7.3.* too) a very large btree CREATE INDEX or REINDEX operation can block checkpoints until it completes. Did you have something like that going on? Anyway, replaying that much log is gonna take awhile :-(. I think you have only two choices: 1. Grin and bear it. 2. Kill the replay process, then use pg_resetxlog to throw away the xlog. Then pray you didn't lose anything critical by doing so. If you know that there was nothing going on except the supposed index build, then you can be pretty sure that #2 will lose nothing except the incomplete index, so it might be a workable alternative. regards, tom lane
Am Montag, 28. März 2005 13:46 schrieb Gustavo Franklin Nóbrega - Planae: > Hi Janning! > > You need to expand your pg_xlog partition. If you use reiserfs, you can > do this with resize_reiserfs. If you use ext2/ext3 you may try reise2fs. This is not an option to me at the moment. because my disk is only 120 GB. With 100 GB of xlog files i dont have any space at all besides some small amount of disk space on other partitions like /usr etc. > If you need to repartition your filesystem, by myself experience, I > recommend to you to use LVM. With LVM, you can expand easily, add more disk > to you volume group and grow your logical volume. This already saved me > some hours of maintenance a time. Yes, i really need to take a look to LVM after cooling down again. kind regards, janning > Atenciosamente, > > Gustavo Franklin Nóbrega > Infraestrutura e Banco de Dados > Planae Tecnologia da Informação > (+55) 14 2106-3514 > http://www.planae.com.br > ----- Original Message ----- > From: "Janning Vygen" <vygen@gmx.de> > To: <pgsql-general@postgresql.org> > Sent: Monday, March 28, 2005 7:19 AM > Subject: [GENERAL] pg_xlog disk full error, i need help > > > PANIC: could not write to file > > "/home/postgres/data/pg_xlog/xlogtemp.6434": > > No space left on device > > server closed the connection unexpectedly > > This probably means the server terminated abnormally before or while -- PLANWERK 6 websolutions Herzogstraße 85, 40215 Düsseldorf Tel.: 0211-6015919 Fax: 0211-6015917 http://www.planwerk6.de
Am Montag, 28. März 2005 18:06 schrieb Tom Lane: > "Janning Vygen" <vygen@gmx.de> writes: > > My disk was running full with 100 GB (!) of data/pg_xlog/ files. > > The only way for pg_xlog to bloat vastly beyond what it's supposed to be > (which is to say, about twice your checkpoint_segments setting) is if > checkpoints are somehow blocked from happening. The only mechanism I > know about for that is that in 7.4.* (maybe 7.3.* too) a very large > btree CREATE INDEX or REINDEX operation can block checkpoints until it > completes. Did you have something like that going on? first of all i have 7.4 running. A "CLUSTER" was running which to me is somewhat similiar to REINDEX, isn't it? And the night before i killed -9 my nightly vacuum process which did not return after 6 hours or so. first i tried to stop the postmaster with the init.d script, which didnt worked at all. i think that killing this vacuum process was not a good idea. 24 hours after killing this process this ugly xlog thing happend while executing "CLUSTER". And the pg_dump right before "CLUSTER" did work fine. Besides 100 GB of xlog, another strange thing that i had about 42 GB in directory data/base. And it should be about 4 GB and i vacuum an cluster every night. > Anyway, replaying that much log is gonna take awhile :-(. I think you > have only two choices: > 1. Grin and bear it. i tried for several hours. > 2. Kill the replay process, then use pg_resetxlog to throw away the xlog. > Then pray you didn't lose anything critical by doing so. i killed the process and used a database backup from just before the error occurred. > If you know that there was nothing going on except the supposed index > build, then you can be pretty sure that #2 will lose nothing except the > incomplete index, so it might be a workable alternative. When it comes to trouble with postgresql i always have the feeling of not knowing enough stuff which is NOT inside the docs. I had another ugly situation a year ago and when in trouble it's very difficult to act calm. Isnt' there more information about "Troubleshooting" than reading postgresql code and archives? I am not an expert DBA (i wouldn't call me a DBA at all besides the fact that i am actually doing the administration). But i am willing to learn. kind regards, janning -- PLANWERK 6 websolutions Herzogstraße 85, 40215 Düsseldorf Tel.: 0211-6015919 Fax: 0211-6015917 http://www.planwerk6.de
Janning Vygen <vygen@gmx.de> writes: > Am Montag, 28. M�rz 2005 18:06 schrieb Tom Lane: >> The only way for pg_xlog to bloat vastly beyond what it's supposed to be >> (which is to say, about twice your checkpoint_segments setting) is if >> checkpoints are somehow blocked from happening. The only mechanism I >> know about for that is that in 7.4.* (maybe 7.3.* too) a very large >> btree CREATE INDEX or REINDEX operation can block checkpoints until it >> completes. Did you have something like that going on? > It looks like something/someone wrote so much data in my table that the > nightly clustering process just didn't succeed because of disk full failure > after writing too many pg_xlog files. The writing of so many pg_xlog files > now makes more sense to me when clustering 68 GByte of data. Yeah, CLUSTER proceeds by rewriting the table and then invoking REINDEX on each index, so the checkpoint lockout problem will apply. Tell you the truth, the best and perhaps only answer for you is to update to 8.0 where that problem is solved. > How do i get the tablename using this filenode? (restarting the database is > not an option) Sure it is. pg_resetxlog will allow you to restart ... possibly you will lose some transactions, but if the only thing going on was the CLUSTER, nothing of value will be lost. What I would expect to be happening in a CLUSTER is that there would be an "old" file plus a "new" file of similar size, for both the table itself and each index that's been processed (up to the point of failure, where you will have a partially-written new index). After restart with this method, you will find only the "old" files listed in pg_class. You'll want to manually delete the unreferenced "new" files. regards, tom lane
Am Dienstag, 29. März 2005 16:37 schrieb Tom Lane: > Janning Vygen <vygen@gmx.de> writes: > > Am Montag, 28. März 2005 18:06 schrieb Tom Lane: > >> The only way for pg_xlog to bloat vastly beyond what it's supposed to be > >> (which is to say, about twice your checkpoint_segments setting) is if > >> checkpoints are somehow blocked from happening. The only mechanism I > >> know about for that is that in 7.4.* (maybe 7.3.* too) a very large > >> btree CREATE INDEX or REINDEX operation can block checkpoints until it > >> completes. Did you have something like that going on? > > > > It looks like something/someone wrote so much data in my table that the > > nightly clustering process just didn't succeed because of disk full > > failure after writing too many pg_xlog files. The writing of so many > > pg_xlog files now makes more sense to me when clustering 68 GByte of > > data. > > Yeah, CLUSTER proceeds by rewriting the table and then invoking REINDEX > on each index, so the checkpoint lockout problem will apply. Tell you > the truth, the best and perhaps only answer for you is to update to 8.0 > where that problem is solved. > > > How do i get the tablename using this filenode? (restarting the database > > is not an option) > > Sure it is. pg_resetxlog will allow you to restart ... possibly you > will lose some transactions, but if the only thing going on was the > CLUSTER, nothing of value will be lost. * I just can't restart it: I zipped all my pg_xlog files in the crashed database to have enough space to get my backup running. As my database server is not in my LAN i can't download 100 GB of files and i can't gunzip all the pg_xlog files again to start it on the same server. So i could delete all files in my pg_xlog directory and then try to start the database with another compiled instance of postgresql. But as this is a production database, its not a good idea. And i cant move 100 GB (or only 60 GB in the base directory) to another server because no server has enough space nor is transfering 60 GB very cheap. I have no idea how to get it running again. * But what i really want to know is how to interpret the results of pg_filedump. I didn't found any documentation besides the README. Most output is easy to understand, others are not. example: what means "Flags: USED" in a data item? * anyway: i am still confused how my table could get this big over night. I stop all database activities by shutting down apache in a nightly cronjob and then my nightly job runs pg_dump -Fc $DBNAME > $BACKUP_FILE psql -c 'SELECT update_tspt_aktuell();' $DBNAME psql -c 'CLUSTER;' $DBNAME a) The dump file ist just fine and the one table is not as big as 60 GB! The whole base directory after reinstalling is 1.4 GB. b) The Function is this: ---------------------------- CREATE OR REPLACE FUNCTION update_tspt_aktuell () RETURNS integer LANGUAGE 'plpgsql' AS ' DECLARE var_count integer; BEGIN UPDATE Tippspieltage SET tspt_aktuell = false WHERE tspt_aktuell; UPDATE Tippspieltage SET tspt_aktuell = true FROM ( SELECT DISTINCT ON (tspt2sp.tr_kurzname) tspt2sp.tr_kurzname, tspt2sp.tspt_sort, MIN(abs(EXTRACT(epoch FROM date_trunc(''day'', sp.sp_termin) - CURRENT_DATE))) AS timediff FROM Tippspieltage2Spiele AS tspt2sp LEFT JOIN Spiele AS sp USING (sp_id) GROUP BY tspt2sp.tr_kurzname, tspt2sp.tspt_sort ORDER BY tspt2sp.tr_kurzname, timediff ASC, tspt2sp.tspt_sort DESC ) as tspt_akt WHERE Tippspieltage.tr_kurzname = tspt_akt.tr_kurzname AND Tippspieltage.tspt_sort = tspt_akt.tspt_sort ; GET DIAGNOSTICS var_count = ROW_COUNT; RETURN var_count; END; '; ---------------------------- my cron mail reports success and modified rows: update_tspt_aktuell --------------------- 5872 (1 row) So there is no reason i can see that this function produced 60 GB of data. c) after this function cluster fails. The Clustering fails starting with "PANIC: could not write to file "/home/postgres/data/pg_xlog/xlogtemp.24223": No space left on device" the cron job took 2:15 hours to run and to report this failure. I guess the clustering is somewhat broken (maybe because i use many multi-column natural keys) * conclusion: i think i will stop clustering every night until i upgraded to 8.0, but it leaves me very unsatisfied not to know the reason for this kind of db failure (or human error or whatever it is) > What I would expect to be happening in a CLUSTER is that there would be > an "old" file plus a "new" file of similar size, for both the table > itself and each index that's been processed (up to the point of failure, > where you will have a partially-written new index). After restart with > this method, you will find only the "old" files listed in pg_class. > You'll want to manually delete the unreferenced "new" files. Ok but the clustering should not scale the file from 500 MB to 64 GB (separated on many files). another strange thing: i looked at one of my database servers. and it has checkpoint_segments = 30 and 62 files in pg_xlog instead of 2*30+1. Normally i wouldn't care, but maybe there is something wrong. i really appreciate your help so far. Would be even nicer to send some more thoughts to my questions as i am willing to learn. kind regards, janning