Thread: pg_xlog disk full error, i need help

pg_xlog disk full error, i need help

From
"Janning Vygen"
Date:
Hi,

i do a nightly CLUSTER and VACUUM on one of my production databases.

Yesterday in the morning the vacuum process was still running after 8 hours.
That was very unusal and i didnt know exactly what to do. So i tried to stop
the process. After it didnt work i killed -9 the Vacuum process. I restarted
the database and everything worked fine again. I did know that this was NOT
a good idea but i had to fined a quick solution and it did work at least.

Tonight know something very strange did happen before or while the
clustering did run:

PANIC:  could not write to file "/home/postgres/data/pg_xlog/xlogtemp.6434":
No space left on device
server closed the connection unexpectedly
This probably means the server terminated abnormally before or while
processing the request. connection to server was lost
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally before or while
processing the request. connection to server was lost

My disk was running full with 100 GB (!) of data/pg_xlog/ files. I tried to
delete some files on the same partition after which i had 3 GB free space
again. the i tried to start the postmaster:

the startup process logged this:

LOG:  database system shutdown was interrupted at 2005-03-28 09:33:15 CEST
LOG:  checkpoint record is at F/EE0F0010
LOG:  redo record is at F/EC007900; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 46558173; next OID: 58970
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at F/EC007900

looks fine as it says "automatic recovery in progress" but there are no more
log entries since startup and my process table says:
 8495 pts/0    S      0:00 /usr/local/pgsql/bin/postmaster -D
/home/postgres/data
 8498 pts/0    S      0:00 postgres: stats buffer process
 8499 pts/0    S      0:00 postgres: stats collector process
 8500 pts/0    D      5:15 postgres: startup subprocess

and top says
 8500 postgres  15   0  131m 131m 131m D 18.9  6.5   5:18.26 postmaster

so the postmaster is still working.

How long will it work on this problem? Can i expect to have everything
working correctly after this startup process or shoul di stop it and use a
backup (which i hope is useful and not corrupt)

I am kind of helpless in this situation as i dont know much of all the
underlying storage, WAL and xlog things. Maybe i could just delete all files
in this directory?

Can anybody give me some hints what to do or how to ask?

i am really desperate at the moment.

kind regards,
Janning

please excuse bad english and typos. i am kind of nervous at the moment.


Re: pg_xlog disk full error, i need help

From
Gustavo Franklin Nóbrega - Planae
Date:
Hi Janning!

    You need to expand your pg_xlog partition. If you use reiserfs, you can
do this with resize_reiserfs. If you use ext2/ext3 you may try reise2fs.

    If you need to repartition your filesystem, by myself experience, I
recommend to you to use LVM. With LVM, you can expand easily, add more disk
to you volume group and grow your logical volume. This already saved me some
hours of maintenance a time.

Atenciosamente,

Gustavo Franklin Nóbrega
Infraestrutura e Banco de Dados
Planae Tecnologia da Informação
(+55) 14 2106-3514
http://www.planae.com.br
----- Original Message -----
From: "Janning Vygen" <vygen@gmx.de>
To: <pgsql-general@postgresql.org>
Sent: Monday, March 28, 2005 7:19 AM
Subject: [GENERAL] pg_xlog disk full error, i need help


> Hi,
>
> i do a nightly CLUSTER and VACUUM on one of my production databases.
>
> Yesterday in the morning the vacuum process was still running after 8
> hours.
> That was very unusal and i didnt know exactly what to do. So i tried to
> stop
> the process. After it didnt work i killed -9 the Vacuum process. I
> restarted
> the database and everything worked fine again. I did know that this was
> NOT
> a good idea but i had to fined a quick solution and it did work at least.
>
> Tonight know something very strange did happen before or while the
> clustering did run:
>
> PANIC:  could not write to file
> "/home/postgres/data/pg_xlog/xlogtemp.6434":
> No space left on device
> server closed the connection unexpectedly
> This probably means the server terminated abnormally before or while
> processing the request. connection to server was lost
> WARNING:  terminating connection because of crash of another server
> process
> DETAIL:  The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> server closed the connection unexpectedly
> This probably means the server terminated abnormally before or while
> processing the request. connection to server was lost
>
> My disk was running full with 100 GB (!) of data/pg_xlog/ files. I tried
> to
> delete some files on the same partition after which i had 3 GB free space
> again. the i tried to start the postmaster:
>
> the startup process logged this:
>
> LOG:  database system shutdown was interrupted at 2005-03-28 09:33:15 CEST
> LOG:  checkpoint record is at F/EE0F0010
> LOG:  redo record is at F/EC007900; undo record is at 0/0; shutdown FALSE
> LOG:  next transaction ID: 46558173; next OID: 58970
> LOG:  database system was not properly shut down; automatic recovery in
> progress
> LOG:  redo starts at F/EC007900
>
> looks fine as it says "automatic recovery in progress" but there are no
> more
> log entries since startup and my process table says:
> 8495 pts/0    S      0:00 /usr/local/pgsql/bin/postmaster -D
> /home/postgres/data
> 8498 pts/0    S      0:00 postgres: stats buffer process
> 8499 pts/0    S      0:00 postgres: stats collector process
> 8500 pts/0    D      5:15 postgres: startup subprocess
>
> and top says
> 8500 postgres  15   0  131m 131m 131m D 18.9  6.5   5:18.26 postmaster
>
> so the postmaster is still working.
>
> How long will it work on this problem? Can i expect to have everything
> working correctly after this startup process or shoul di stop it and use a
> backup (which i hope is useful and not corrupt)
>
> I am kind of helpless in this situation as i dont know much of all the
> underlying storage, WAL and xlog things. Maybe i could just delete all
> files
> in this directory?
>
> Can anybody give me some hints what to do or how to ask?
>
> i am really desperate at the moment.
>
> kind regards,
> Janning
>
> please excuse bad english and typos. i am kind of nervous at the moment.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>


Re: pg_xlog disk full error, i need help

From
Lonni J Friedman
Date:
On Mon, 28 Mar 2005 08:46:09 -0300, Gustavo Franklin Nóbrega - Planae
<gfnobrega@planae.com.br> wrote:
> Hi Janning!
>
>     You need to expand your pg_xlog partition. If you use reiserfs, you can
> do this with resize_reiserfs. If you use ext2/ext3 you may try reise2fs.
>
>     If you need to repartition your filesystem, by myself experience, I
> recommend to you to use LVM. With LVM, you can expand easily, add more disk
> to you volume group and grow your logical volume. This already saved me some
> hours of maintenance a time.
>
> Atenciosamente,
>
> Gustavo Franklin Nóbrega
> Infraestrutura e Banco de Dados
> Planae Tecnologia da Informação
> (+55) 14 2106-3514
> http://www.planae.com.br
> ----- Original Message -----
> From: "Janning Vygen" <vygen@gmx.de>
> To: <pgsql-general@postgresql.org>
> Sent: Monday, March 28, 2005 7:19 AM
> Subject: [GENERAL] pg_xlog disk full error, i need help
>
> > Hi,
> >
> > i do a nightly CLUSTER and VACUUM on one of my production databases.
> >
> > Yesterday in the morning the vacuum process was still running after 8
> > hours.
> > That was very unusal and i didnt know exactly what to do. So i tried to
> > stop
> > the process. After it didnt work i killed -9 the Vacuum process. I
> > restarted
> > the database and everything worked fine again. I did know that this was
> > NOT
> > a good idea but i had to fined a quick solution and it did work at least.
> >
> > Tonight know something very strange did happen before or while the
> > clustering did run:
> >
> > PANIC:  could not write to file
> > "/home/postgres/data/pg_xlog/xlogtemp.6434":
> > No space left on device
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally before or while
> > processing the request. connection to server was lost
> > WARNING:  terminating connection because of crash of another server
> > process
> > DETAIL:  The postmaster has commanded this server process to roll back the
> > current transaction and exit, because another server process exited
> > abnormally and possibly corrupted shared memory.
> > HINT:  In a moment you should be able to reconnect to the database and
> > repeat your command.
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally before or while
> > processing the request. connection to server was lost
> >
> > My disk was running full with 100 GB (!) of data/pg_xlog/ files. I tried
> > to
> > delete some files on the same partition after which i had 3 GB free space
> > again. the i tried to start the postmaster:
> >
> > the startup process logged this:
> >
> > LOG:  database system shutdown was interrupted at 2005-03-28 09:33:15 CEST
> > LOG:  checkpoint record is at F/EE0F0010
> > LOG:  redo record is at F/EC007900; undo record is at 0/0; shutdown FALSE
> > LOG:  next transaction ID: 46558173; next OID: 58970
> > LOG:  database system was not properly shut down; automatic recovery in
> > progress
> > LOG:  redo starts at F/EC007900
> >
> > looks fine as it says "automatic recovery in progress" but there are no
> > more
> > log entries since startup and my process table says:
> > 8495 pts/0    S      0:00 /usr/local/pgsql/bin/postmaster -D
> > /home/postgres/data
> > 8498 pts/0    S      0:00 postgres: stats buffer process
> > 8499 pts/0    S      0:00 postgres: stats collector process
> > 8500 pts/0    D      5:15 postgres: startup subprocess
> >
> > and top says
> > 8500 postgres  15   0  131m 131m 131m D 18.9  6.5   5:18.26 postmaster
> >
> > so the postmaster is still working.
> >
> > How long will it work on this problem? Can i expect to have everything
> > working correctly after this startup process or shoul di stop it and use a
> > backup (which i hope is useful and not corrupt)
> >
> > I am kind of helpless in this situation as i dont know much of all the
> > underlying storage, WAL and xlog things. Maybe i could just delete all
> > files
> > in this directory?
> >
> > Can anybody give me some hints what to do or how to ask?
> >
> > i am really desperate at the moment.
> >
> > kind regards,
> > Janning
> >
> > please excuse bad english and typos. i am kind of nervous at the moment.

Or free up space by deleting something non-critical from that partition.


--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama@gmail.com
LlamaLand                       http://netllama.linux-sxs.org

Re: pg_xlog disk full error, i need help

From
Tom Lane
Date:
"Janning Vygen" <vygen@gmx.de> writes:
> My disk was running full with 100 GB (!) of data/pg_xlog/ files.

The only way for pg_xlog to bloat vastly beyond what it's supposed to be
(which is to say, about twice your checkpoint_segments setting) is if
checkpoints are somehow blocked from happening.  The only mechanism I
know about for that is that in 7.4.* (maybe 7.3.* too) a very large
btree CREATE INDEX or REINDEX operation can block checkpoints until it
completes.  Did you have something like that going on?

Anyway, replaying that much log is gonna take awhile :-(.  I think you
have only two choices:
1. Grin and bear it.
2. Kill the replay process, then use pg_resetxlog to throw away the xlog.
   Then pray you didn't lose anything critical by doing so.

If you know that there was nothing going on except the supposed index
build, then you can be pretty sure that #2 will lose nothing except the
incomplete index, so it might be a workable alternative.

            regards, tom lane

Re: pg_xlog disk full error, i need help

From
"Janning Vygen"
Date:
Am Montag, 28. März 2005 13:46 schrieb Gustavo Franklin Nóbrega - Planae:
> Hi Janning!
>
>     You need to expand your pg_xlog partition. If you use reiserfs, you can
> do this with resize_reiserfs. If you use ext2/ext3 you may try reise2fs.

This is not an option to me at the moment. because my disk is only 120 GB.
With 100 GB of xlog files i dont have any space at all besides some small
amount of disk space on other partitions like /usr etc.

>     If you need to repartition your filesystem, by myself experience, I
> recommend to you to use LVM. With LVM, you can expand easily, add more disk
> to you volume group and grow your logical volume. This already saved me
> some hours of maintenance a time.

Yes, i really need to take a look to LVM after cooling down again.

kind regards,
janning

> Atenciosamente,
>
> Gustavo Franklin Nóbrega
> Infraestrutura e Banco de Dados
> Planae Tecnologia da Informação
> (+55) 14 2106-3514
> http://www.planae.com.br
> ----- Original Message -----
> From: "Janning Vygen" <vygen@gmx.de>
> To: <pgsql-general@postgresql.org>
> Sent: Monday, March 28, 2005 7:19 AM
> Subject: [GENERAL] pg_xlog disk full error, i need help
>
> > PANIC:  could not write to file
> > "/home/postgres/data/pg_xlog/xlogtemp.6434":
> > No space left on device
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally before or while

--
PLANWERK 6 websolutions
Herzogstraße 85, 40215 Düsseldorf
Tel.: 0211-6015919  Fax: 0211-6015917
http://www.planwerk6.de

Re: pg_xlog disk full error, i need help

From
"Janning Vygen"
Date:
Am Montag, 28. März 2005 18:06 schrieb Tom Lane:
> "Janning Vygen" <vygen@gmx.de> writes:
> > My disk was running full with 100 GB (!) of data/pg_xlog/ files.
>
> The only way for pg_xlog to bloat vastly beyond what it's supposed to be
> (which is to say, about twice your checkpoint_segments setting) is if
> checkpoints are somehow blocked from happening.  The only mechanism I
> know about for that is that in 7.4.* (maybe 7.3.* too) a very large
> btree CREATE INDEX or REINDEX operation can block checkpoints until it
> completes.  Did you have something like that going on?

first of all i have 7.4 running. A "CLUSTER" was running which to me is
somewhat similiar to REINDEX, isn't it? And the night before i killed -9 my
nightly vacuum process which did not return after 6 hours or so. first i
tried to stop the postmaster with the init.d script, which didnt worked at
all. i think that killing this vacuum process was not a good idea. 24 hours
after killing this process this ugly xlog thing happend while executing
"CLUSTER". And the pg_dump right before "CLUSTER" did work fine.

Besides 100 GB of xlog, another strange thing that i had about 42 GB in
directory data/base. And it should be about 4 GB and  i vacuum an cluster
every night.

> Anyway, replaying that much log is gonna take awhile :-(.  I think you
> have only two choices:
> 1. Grin and bear it.

i tried for several hours.

> 2. Kill the replay process, then use pg_resetxlog to throw away the xlog.
>    Then pray you didn't lose anything critical by doing so.

i killed the process and used a database backup from just before the error
occurred.

> If you know that there was nothing going on except the supposed index
> build, then you can be pretty sure that #2 will lose nothing except the
> incomplete index, so it might be a workable alternative.

When it comes to trouble with postgresql i always have the feeling of not
knowing enough stuff which is NOT inside the docs. I had another ugly
situation a year ago and when in trouble it's very difficult to act calm.
Isnt' there more information about "Troubleshooting" than reading postgresql
code and archives? I am not an expert DBA (i wouldn't call me a DBA at all
besides the fact that i am actually doing the administration). But i am
willing to learn.

kind regards,
janning

--
PLANWERK 6 websolutions
Herzogstraße 85, 40215 Düsseldorf
Tel.: 0211-6015919  Fax: 0211-6015917
http://www.planwerk6.de

Re: pg_xlog disk full error, i need help

From
Tom Lane
Date:
Janning Vygen <vygen@gmx.de> writes:
> Am Montag, 28. M�rz 2005 18:06 schrieb Tom Lane:
>> The only way for pg_xlog to bloat vastly beyond what it's supposed to be
>> (which is to say, about twice your checkpoint_segments setting) is if
>> checkpoints are somehow blocked from happening.  The only mechanism I
>> know about for that is that in 7.4.* (maybe 7.3.* too) a very large
>> btree CREATE INDEX or REINDEX operation can block checkpoints until it
>> completes.  Did you have something like that going on?

> It looks like something/someone wrote so much data in my table that the
> nightly clustering process just didn't succeed because of disk full failure
> after writing too many pg_xlog files. The writing of so many pg_xlog files
> now makes more sense to me when clustering 68 GByte of data.

Yeah, CLUSTER proceeds by rewriting the table and then invoking REINDEX
on each index, so the checkpoint lockout problem will apply.  Tell you
the truth, the best and perhaps only answer for you is to update to 8.0
where that problem is solved.

> How do i get the tablename using this filenode? (restarting the database is
> not an option)

Sure it is.  pg_resetxlog will allow you to restart ... possibly you
will lose some transactions, but if the only thing going on was the
CLUSTER, nothing of value will be lost.

What I would expect to be happening in a CLUSTER is that there would be
an "old" file plus a "new" file of similar size, for both the table
itself and each index that's been processed (up to the point of failure,
where you will have a partially-written new index).  After restart with
this method, you will find only the "old" files listed in pg_class.
You'll want to manually delete the unreferenced "new" files.

            regards, tom lane

Re: pg_xlog disk full error, i need help

From
Janning Vygen
Date:
Am Dienstag, 29. März 2005 16:37 schrieb Tom Lane:
> Janning Vygen <vygen@gmx.de> writes:
> > Am Montag, 28. März 2005 18:06 schrieb Tom Lane:
> >> The only way for pg_xlog to bloat vastly beyond what it's supposed to be
> >> (which is to say, about twice your checkpoint_segments setting) is if
> >> checkpoints are somehow blocked from happening.  The only mechanism I
> >> know about for that is that in 7.4.* (maybe 7.3.* too) a very large
> >> btree CREATE INDEX or REINDEX operation can block checkpoints until it
> >> completes.  Did you have something like that going on?
> >
> > It looks like something/someone wrote so much data in my table that the
> > nightly clustering process just didn't succeed because of disk full
> > failure after writing too many pg_xlog files. The writing of so many
> > pg_xlog files now makes more sense to me when clustering 68 GByte of
> > data.
>
> Yeah, CLUSTER proceeds by rewriting the table and then invoking REINDEX
> on each index, so the checkpoint lockout problem will apply.  Tell you
> the truth, the best and perhaps only answer for you is to update to 8.0
> where that problem is solved.
>
> > How do i get the tablename using this filenode? (restarting the database
> > is not an option)
>
> Sure it is.  pg_resetxlog will allow you to restart ... possibly you
> will lose some transactions, but if the only thing going on was the
> CLUSTER, nothing of value will be lost.

* I just can't restart it:

I zipped all my pg_xlog files in the crashed database to have enough space to
get my backup running.

As my database server is not in my LAN i can't download 100 GB of files and i
can't gunzip all the pg_xlog files again to start it on the same server.

So i could delete all files in my pg_xlog directory and then try to start the
database with another compiled instance of postgresql. But as this is a
production database, its not a good idea.

And i cant move 100 GB (or only 60 GB in the base directory) to another server
because no server has enough space nor is transfering 60 GB very cheap. I
have no idea how to get it running again.

* But what i really want to know is how to interpret the results of
pg_filedump. I didn't found any documentation besides the README. Most output
is easy to understand, others are not. example: what means "Flags: USED" in a
data item?

* anyway: i am still confused how my table could get this big over night.

I stop all database activities by shutting down apache in a nightly cronjob
and then my nightly job runs

  pg_dump -Fc $DBNAME > $BACKUP_FILE
  psql -c 'SELECT update_tspt_aktuell();' $DBNAME
  psql -c 'CLUSTER;' $DBNAME

a) The dump file ist just fine and the one table is not as big as 60 GB! The
whole base directory after reinstalling is 1.4 GB.

b) The Function is this:
----------------------------
CREATE OR REPLACE FUNCTION update_tspt_aktuell () RETURNS integer LANGUAGE
'plpgsql' AS '
DECLARE
  var_count integer;
BEGIN

UPDATE Tippspieltage SET tspt_aktuell = false WHERE tspt_aktuell;
UPDATE Tippspieltage SET tspt_aktuell = true
FROM
  (
    SELECT DISTINCT ON (tspt2sp.tr_kurzname)
      tspt2sp.tr_kurzname,
      tspt2sp.tspt_sort,
      MIN(abs(EXTRACT(epoch FROM date_trunc(''day'', sp.sp_termin) -
CURRENT_DATE))) AS timediff
    FROM
      Tippspieltage2Spiele AS tspt2sp
      LEFT JOIN Spiele AS sp USING (sp_id)
    GROUP BY
      tspt2sp.tr_kurzname,
      tspt2sp.tspt_sort
    ORDER BY
      tspt2sp.tr_kurzname,
      timediff ASC,
      tspt2sp.tspt_sort DESC
  ) as tspt_akt
WHERE
  Tippspieltage.tr_kurzname = tspt_akt.tr_kurzname
  AND Tippspieltage.tspt_sort = tspt_akt.tspt_sort
;

GET DIAGNOSTICS var_count = ROW_COUNT;
RETURN var_count;
END;
';
----------------------------
my cron mail reports success and modified rows:

 update_tspt_aktuell
---------------------
                5872
(1 row)

So there is no reason i can see that this function produced 60 GB of data.

c) after this function cluster fails. The Clustering fails starting with
"PANIC:  could not write to file
"/home/postgres/data/pg_xlog/xlogtemp.24223": No space left on device"

the cron job took 2:15 hours to run and to report this failure. I guess the
clustering is somewhat broken (maybe because i use many multi-column natural
keys)

* conclusion: i think i will stop clustering every night until i upgraded to
8.0, but it leaves me very unsatisfied not to know the reason for this kind
of db failure (or human error or whatever it is)

> What I would expect to be happening in a CLUSTER is that there would be
> an "old" file plus a "new" file of similar size, for both the table
> itself and each index that's been processed (up to the point of failure,
> where you will have a partially-written new index).  After restart with
> this method, you will find only the "old" files listed in pg_class.
> You'll want to manually delete the unreferenced "new" files.

Ok but the clustering should not scale the file from 500 MB to 64 GB
(separated on many files).

another strange thing: i looked at one of my database servers. and it has
   checkpoint_segments = 30
and 62 files in pg_xlog instead of 2*30+1. Normally i wouldn't care, but maybe
there is something wrong.

i really appreciate your help so far. Would be even nicer to send some more
thoughts to my questions as i am willing to learn.

kind regards,
janning