Re: pg_xlog disk full error, i need help - Mailing list pgsql-general

From Janning Vygen
Subject Re: pg_xlog disk full error, i need help
Date
Msg-id 200503292041.01234.vygen@gmx.de
Whole thread Raw
In response to Re: pg_xlog disk full error, i need help  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Am Dienstag, 29. März 2005 16:37 schrieb Tom Lane:
> Janning Vygen <vygen@gmx.de> writes:
> > Am Montag, 28. März 2005 18:06 schrieb Tom Lane:
> >> The only way for pg_xlog to bloat vastly beyond what it's supposed to be
> >> (which is to say, about twice your checkpoint_segments setting) is if
> >> checkpoints are somehow blocked from happening.  The only mechanism I
> >> know about for that is that in 7.4.* (maybe 7.3.* too) a very large
> >> btree CREATE INDEX or REINDEX operation can block checkpoints until it
> >> completes.  Did you have something like that going on?
> >
> > It looks like something/someone wrote so much data in my table that the
> > nightly clustering process just didn't succeed because of disk full
> > failure after writing too many pg_xlog files. The writing of so many
> > pg_xlog files now makes more sense to me when clustering 68 GByte of
> > data.
>
> Yeah, CLUSTER proceeds by rewriting the table and then invoking REINDEX
> on each index, so the checkpoint lockout problem will apply.  Tell you
> the truth, the best and perhaps only answer for you is to update to 8.0
> where that problem is solved.
>
> > How do i get the tablename using this filenode? (restarting the database
> > is not an option)
>
> Sure it is.  pg_resetxlog will allow you to restart ... possibly you
> will lose some transactions, but if the only thing going on was the
> CLUSTER, nothing of value will be lost.

* I just can't restart it:

I zipped all my pg_xlog files in the crashed database to have enough space to
get my backup running.

As my database server is not in my LAN i can't download 100 GB of files and i
can't gunzip all the pg_xlog files again to start it on the same server.

So i could delete all files in my pg_xlog directory and then try to start the
database with another compiled instance of postgresql. But as this is a
production database, its not a good idea.

And i cant move 100 GB (or only 60 GB in the base directory) to another server
because no server has enough space nor is transfering 60 GB very cheap. I
have no idea how to get it running again.

* But what i really want to know is how to interpret the results of
pg_filedump. I didn't found any documentation besides the README. Most output
is easy to understand, others are not. example: what means "Flags: USED" in a
data item?

* anyway: i am still confused how my table could get this big over night.

I stop all database activities by shutting down apache in a nightly cronjob
and then my nightly job runs

  pg_dump -Fc $DBNAME > $BACKUP_FILE
  psql -c 'SELECT update_tspt_aktuell();' $DBNAME
  psql -c 'CLUSTER;' $DBNAME

a) The dump file ist just fine and the one table is not as big as 60 GB! The
whole base directory after reinstalling is 1.4 GB.

b) The Function is this:
----------------------------
CREATE OR REPLACE FUNCTION update_tspt_aktuell () RETURNS integer LANGUAGE
'plpgsql' AS '
DECLARE
  var_count integer;
BEGIN

UPDATE Tippspieltage SET tspt_aktuell = false WHERE tspt_aktuell;
UPDATE Tippspieltage SET tspt_aktuell = true
FROM
  (
    SELECT DISTINCT ON (tspt2sp.tr_kurzname)
      tspt2sp.tr_kurzname,
      tspt2sp.tspt_sort,
      MIN(abs(EXTRACT(epoch FROM date_trunc(''day'', sp.sp_termin) -
CURRENT_DATE))) AS timediff
    FROM
      Tippspieltage2Spiele AS tspt2sp
      LEFT JOIN Spiele AS sp USING (sp_id)
    GROUP BY
      tspt2sp.tr_kurzname,
      tspt2sp.tspt_sort
    ORDER BY
      tspt2sp.tr_kurzname,
      timediff ASC,
      tspt2sp.tspt_sort DESC
  ) as tspt_akt
WHERE
  Tippspieltage.tr_kurzname = tspt_akt.tr_kurzname
  AND Tippspieltage.tspt_sort = tspt_akt.tspt_sort
;

GET DIAGNOSTICS var_count = ROW_COUNT;
RETURN var_count;
END;
';
----------------------------
my cron mail reports success and modified rows:

 update_tspt_aktuell
---------------------
                5872
(1 row)

So there is no reason i can see that this function produced 60 GB of data.

c) after this function cluster fails. The Clustering fails starting with
"PANIC:  could not write to file
"/home/postgres/data/pg_xlog/xlogtemp.24223": No space left on device"

the cron job took 2:15 hours to run and to report this failure. I guess the
clustering is somewhat broken (maybe because i use many multi-column natural
keys)

* conclusion: i think i will stop clustering every night until i upgraded to
8.0, but it leaves me very unsatisfied not to know the reason for this kind
of db failure (or human error or whatever it is)

> What I would expect to be happening in a CLUSTER is that there would be
> an "old" file plus a "new" file of similar size, for both the table
> itself and each index that's been processed (up to the point of failure,
> where you will have a partially-written new index).  After restart with
> this method, you will find only the "old" files listed in pg_class.
> You'll want to manually delete the unreferenced "new" files.

Ok but the clustering should not scale the file from 500 MB to 64 GB
(separated on many files).

another strange thing: i looked at one of my database servers. and it has
   checkpoint_segments = 30
and 62 files in pg_xlog instead of 2*30+1. Normally i wouldn't care, but maybe
there is something wrong.

i really appreciate your help so far. Would be even nicer to send some more
thoughts to my questions as i am willing to learn.

kind regards,
janning


pgsql-general by date:

Previous
From: Michael Fuhr
Date:
Subject: Re: 8.0.2beta1
Next
From: Caleb Simonyi-Gindele
Date:
Subject: do I need replication or something else?