Thread: URGENT: Out of disk space pg_xlog

URGENT: Out of disk space pg_xlog

From
"Jeremy Haile"
Date:
I created a 10GB partition for pg_xlog and ran out of disk space today
during a long running update.  My checkpoint_segments is set to 12, but
there are 622 files in pg_xlog.  What size should the pg_xlog partition
be?

Postmaster is currently not starting up (critical for my organization)
and reports "FATAL: The database system is starting up" .

The log reports:
2006-12-22 10:50:09 LOG:  checkpoint record is at 2E/87A323C8
2006-12-22 10:50:09 LOG:  redo record is at 2E/8729A6E8; undo record is
at 0/0; shutdown FALSE
2006-12-22 10:50:09 LOG:  next transaction ID: 0/25144015; next OID:
140986
2006-12-22 10:50:09 LOG:  next MultiXactId: 12149; next MultiXactOffset:
24306
2006-12-22 10:50:09 LOG:  database system was not properly shut down;
automatic recovery in progress
2006-12-22 10:50:09 LOG:  redo starts at 2E/8729A6E8


This has been running for 20 minutes.  What can I do?  Please help!

Re: URGENT: Out of disk space pg_xlog

From
"Jeremy Haile"
Date:
Sorry for my rushed posting, as I was in a bit of a panic.

We moved the pg_xlog directory over to a 70GB partition, and after 15-20
minutes the automatic recovery finished.  Everything is working fine
now.

I would still appreciate a PG guru explaining how to estimate size for a
pg_xlog partition.  It seems like it can vary considerably depending on
how intensive your current transactions are.  Is there a way to
determine a maximum?

On Fri, 22 Dec 2006 11:06:46 -0500, "Jeremy Haile" <jhaile@fastmail.fm>
said:
> I created a 10GB partition for pg_xlog and ran out of disk space today
> during a long running update.  My checkpoint_segments is set to 12, but
> there are 622 files in pg_xlog.  What size should the pg_xlog partition
> be?
>
> Postmaster is currently not starting up (critical for my organization)
> and reports "FATAL: The database system is starting up" .
>
> The log reports:
> 2006-12-22 10:50:09 LOG:  checkpoint record is at 2E/87A323C8
> 2006-12-22 10:50:09 LOG:  redo record is at 2E/8729A6E8; undo record is
> at 0/0; shutdown FALSE
> 2006-12-22 10:50:09 LOG:  next transaction ID: 0/25144015; next OID:
> 140986
> 2006-12-22 10:50:09 LOG:  next MultiXactId: 12149; next MultiXactOffset:
> 24306
> 2006-12-22 10:50:09 LOG:  database system was not properly shut down;
> automatic recovery in progress
> 2006-12-22 10:50:09 LOG:  redo starts at 2E/8729A6E8
>
>
> This has been running for 20 minutes.  What can I do?  Please help!
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

Re: URGENT: Out of disk space pg_xlog

From
"Joshua D. Drake"
Date:
> 2006-12-22 10:50:09 LOG:  database system was not properly shut down;
> automatic recovery in progress
> 2006-12-22 10:50:09 LOG:  redo starts at 2E/8729A6E8
>
>
> This has been running for 20 minutes.  What can I do?  Please help!

1. Turn off postgresql.
2. Make tar backup of entire thing
3. Move pg_xlog somehwere that has space
4. ln postgresql to new pg_xlog directory
5. Start postgresql
6. Look for errors
7. Report back

Sincerely.

Joshua D. Drake

>
--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate




Re: URGENT: Out of disk space pg_xlog

From
"Simon Riggs"
Date:
On Fri, 2006-12-22 at 11:52 -0500, Jeremy Haile wrote:

> I would still appreciate ... explaining how to estimate size for a
> pg_xlog partition.  It seems like it can vary considerably depending on
> how intensive your current transactions are.  Is there a way to
> determine a maximum?

There should be at most 2*checkpoint_segments+1 files in pg_xlog, which
are 16MB each. So you shouldn't be having a problem.

If there are more than this, it could be because you have
currently/previously had archive_command set and the archive command
failed to execute correctly, or the database was shutdown/crashed prior
to the archive commands being executed.

IIRC there was a bug that allowed this to happen, but that was some time
ago.

Perhaps you could show us the dir listing, so we can check that there is
not a new problem emerging? Can you also show us the contents of the
pg_xlog/archive_status directory? Thanks.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: URGENT: Out of disk space pg_xlog

From
"Kevin Grittner"
Date:
As I understand it, the log space accumulates for the oldest transaction
which is still running, and all transactions which started after it.  I
don't think there is any particular limit besides available disk space.
Long running transactions can cause various problems, including table
and index bloat which can degrade performance.  You should probably look
at whether the long running transaction could be broken down into a
number of smaller ones.

-Kevin


>>> On Fri, Dec 22, 2006 at 10:52 AM, in message
<1166806378.10592.281708161@webmail.messagingengine.com>, "Jeremy
Haile"
<jhaile@fastmail.fm> wrote:
> Sorry for my rushed posting, as I was in a bit of a panic.
>
> We moved the pg_xlog directory over to a 70GB partition, and after
15- 20
> minutes the automatic recovery finished.  Everything is working fine
> now.
>
> I would still appreciate a PG guru explaining how to estimate size
for a
> pg_xlog partition.  It seems like it can vary considerably depending
on
> how intensive your current transactions are.  Is there a way to
> determine a maximum?
>
> On Fri, 22 Dec 2006 11:06:46 - 0500, "Jeremy Haile"
<jhaile@fastmail.fm>
> said:
>> I created a 10GB partition for pg_xlog and ran out of disk space
today
>> during a long running update.  My checkpoint_segments is set to 12,
but
>> there are 622 files in pg_xlog.  What size should the pg_xlog
partition
>> be?
>>
>> Postmaster is currently not starting up (critical for my
organization)
>> and reports "FATAL: The database system is starting up" .
>>
>> The log reports:
>> 2006- 12- 22 10:50:09 LOG:  checkpoint record is at 2E/87A323C8
>> 2006- 12- 22 10:50:09 LOG:  redo record is at 2E/8729A6E8; undo
record is
>> at 0/0; shutdown FALSE
>> 2006- 12- 22 10:50:09 LOG:  next transaction ID: 0/25144015; next
OID:
>> 140986
>> 2006- 12- 22 10:50:09 LOG:  next MultiXactId: 12149; next
MultiXactOffset:
>> 24306
>> 2006- 12- 22 10:50:09 LOG:  database system was not properly shut
down;
>> automatic recovery in progress
>> 2006- 12- 22 10:50:09 LOG:  redo starts at 2E/8729A6E8
>>
>>
>> This has been running for 20 minutes.  What can I do?  Please help!
>>
>> --------------------------- (end of
broadcast)---------------------------
>> TIP 4: Have you searched our list archives?
>>
>>                http://archives.postgresql.org
>
> --------------------------- (end of
broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings


Re: URGENT: Out of disk space pg_xlog

From
"Jeremy Haile"
Date:
The archive_status directory is empty.  I've never seen any files in
there and I've never set archive_command.

Well, the problem has since resolved, but here is what is in the
directory now.  Previously there were hundreds of files, but these
disappeared after Postgres performed the automatic recovery.

12/22/2006  11:16 AM        16,777,216 0000000100000030000000D2
12/22/2006  11:17 AM        16,777,216 0000000100000030000000D3
12/22/2006  11:17 AM        16,777,216 0000000100000030000000D4
12/22/2006  11:17 AM        16,777,216 0000000100000030000000D5
12/22/2006  11:18 AM        16,777,216 0000000100000030000000D6
12/22/2006  11:19 AM        16,777,216 0000000100000030000000D7
12/22/2006  11:19 AM        16,777,216 0000000100000030000000D8
12/22/2006  11:19 AM        16,777,216 0000000100000030000000D9
12/22/2006  11:19 AM        16,777,216 0000000100000030000000DA
12/22/2006  11:21 AM        16,777,216 0000000100000030000000DB
12/22/2006  10:07 AM        16,777,216 0000000100000030000000DC
12/22/2006  10:07 AM        16,777,216 0000000100000030000000DD
12/22/2006  10:07 AM        16,777,216 0000000100000030000000DE
12/22/2006  10:33 AM        16,777,216 0000000100000030000000DF
12/22/2006  10:08 AM        16,777,216 0000000100000030000000E0
12/22/2006  10:32 AM        16,777,216 0000000100000030000000E1
12/22/2006  10:08 AM        16,777,216 0000000100000030000000E2
12/22/2006  10:08 AM        16,777,216 0000000100000030000000E3
12/22/2006  10:17 AM        16,777,216 0000000100000030000000E4
12/22/2006  10:11 AM        16,777,216 0000000100000030000000E5
12/22/2006  11:10 AM        16,777,216 0000000100000030000000E6
12/22/2006  11:11 AM        16,777,216 0000000100000030000000E7
12/22/2006  11:15 AM        16,777,216 0000000100000030000000E8
12/22/2006  11:15 AM        16,777,216 0000000100000030000000E9
12/22/2006  11:15 AM        16,777,216 0000000100000030000000EA
12/22/2006  11:16 AM        16,777,216 0000000100000030000000EB
12/22/2006  11:16 AM        16,777,216 0000000100000030000000EC
12/22/2006  11:16 AM        16,777,216 0000000100000030000000ED
12/18/2006  08:52 PM    <DIR>          archive_status
              28 File(s)    469,762,048 bytes
               3 Dir(s)  10,206,756,864 bytes free

On Fri, 22 Dec 2006 17:02:43 +0000, "Simon Riggs"
<simon@2ndquadrant.com> said:
> On Fri, 2006-12-22 at 11:52 -0500, Jeremy Haile wrote:
>
> > I would still appreciate ... explaining how to estimate size for a
> > pg_xlog partition.  It seems like it can vary considerably depending on
> > how intensive your current transactions are.  Is there a way to
> > determine a maximum?
>
> There should be at most 2*checkpoint_segments+1 files in pg_xlog, which
> are 16MB each. So you shouldn't be having a problem.
>
> If there are more than this, it could be because you have
> currently/previously had archive_command set and the archive command
> failed to execute correctly, or the database was shutdown/crashed prior
> to the archive commands being executed.
>
> IIRC there was a bug that allowed this to happen, but that was some time
> ago.
>
> Perhaps you could show us the dir listing, so we can check that there is
> not a new problem emerging? Can you also show us the contents of the
> pg_xlog/archive_status directory? Thanks.
>
> --
>   Simon Riggs
>   EnterpriseDB   http://www.enterprisedb.com
>
>

Re: URGENT: Out of disk space pg_xlog

From
"Simon Riggs"
Date:
On Fri, 2006-12-22 at 12:30 -0500, Jeremy Haile wrote:
> The archive_status directory is empty.  I've never seen any files in
> there and I've never set archive_command.
>
> Well, the problem has since resolved, but here is what is in the
> directory now.  Previously there were hundreds of files, but these
> disappeared after Postgres performed the automatic recovery.

What were you doing before the server crashed?

Did you previously have checkpoint_segments set higher? When/how was it
reduced?

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: URGENT: Out of disk space pg_xlog

From
"Jeremy Haile"
Date:
checkpoint_segments has been set at 12 for a while and was never set
higher than that. (before that it was set to the PG default - 3 I think)

Before the server crashed I was running an update that updates a boolean
flag on two large tables (10 million rows each) for transactions older
than today (roughly 80% of the rows)  The transaction ran for a long
time and I assume is what caused the pg_xlog to fill up.

On Fri, 22 Dec 2006 17:36:39 +0000, "Simon Riggs"
<simon@2ndquadrant.com> said:
> On Fri, 2006-12-22 at 12:30 -0500, Jeremy Haile wrote:
> > The archive_status directory is empty.  I've never seen any files in
> > there and I've never set archive_command.
> >
> > Well, the problem has since resolved, but here is what is in the
> > directory now.  Previously there were hundreds of files, but these
> > disappeared after Postgres performed the automatic recovery.
>
> What were you doing before the server crashed?
>
> Did you previously have checkpoint_segments set higher? When/how was it
> reduced?
>
> --
>   Simon Riggs
>   EnterpriseDB   http://www.enterprisedb.com
>
>

Re: URGENT: Out of disk space pg_xlog

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> As I understand it, the log space accumulates for the oldest transaction
> which is still running, and all transactions which started after it.

No, pg_xlog can be truncated as soon as a checkpoint occurs.  If Jeremy
wasn't using archive_command then the only possible explanation for
bloated pg_xlog is that checkpoints were failing.  Which is not unlikely
if the *data* partition runs out of space.  Were there gripes in the log
before the system crash?  The scenario we've seen in the past is

* data partition out of space, so writes fail
* each time Postgres attempts a checkpoint, writes fail, so the
  checkpoint fails.  No data loss at this point, the dirty buffers
  just stay in memory.
* pg_xlog bloats because we can't truncate away old segments
* eventually pg_xlog runs out of space, at which point we PANIC
  and can't continue running the database

Once you free some space on the data partition and restart, you should
be good to go --- there will be no loss of committed transactions, since
all the operations are in pg_xlog.  Might take a little while to replay
all that log though :-(

            regards, tom lane

Re: URGENT: Out of disk space pg_xlog

From
"Kevin Grittner"
Date:
>>> On Fri, Dec 22, 2006 at 12:14 PM, in message
<26238.1166811258@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> As I understand it, the log space accumulates for the oldest
transaction
>> which is still running, and all transactions which started after
it.
>
> No, pg_xlog can be truncated as soon as a checkpoint occurs.

Thanks.  Good to know.  I had missed that.

> If Jeremy
> wasn't using archive_command then the only possible explanation for
> bloated pg_xlog is that checkpoints were failing.  Which is not
unlikely
> if the *data* partition runs out of space.  Were there gripes in the
log
> before the system crash?  The scenario we've seen in the past is
>
> * data partition out of space, so writes fail
> * each time Postgres attempts a checkpoint, writes fail, so the
>   checkpoint fails.  No data loss at this point, the dirty buffers
>   just stay in memory.
> * pg_xlog bloats because we can't truncate away old segments

So, at this point, if space is freed on the data partition somehow,
Postgres recovers with no problems?  (i.e.,, the database is still
running and no requests have been terminated abnormally due to the space
problems?)

> * eventually pg_xlog runs out of space, at which point we PANIC
>   and can't continue running the database
>
> Once you free some space on the data partition and restart, you
should
> be good to go ---  there will be no loss of committed transactions,
since
> all the operations are in pg_xlog.  Might take a little while to
replay
> all that log though :- (

Just to confirm what I would assume at this point -- non-committed
transactions should roll back cleanly; it is reasonable to assume no
corruption at this point?

Thanks,

-Kevin



Re: URGENT: Out of disk space pg_xlog

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> before the system crash?  The scenario we've seen in the past is
>>
>> * data partition out of space, so writes fail
>> * each time Postgres attempts a checkpoint, writes fail, so the
>> checkpoint fails.  No data loss at this point, the dirty buffers
>> just stay in memory.
>> * pg_xlog bloats because we can't truncate away old segments

> So, at this point, if space is freed on the data partition somehow,
> Postgres recovers with no problems?  (i.e.,, the database is still
> running and no requests have been terminated abnormally due to the space
> problems?)

Right, no committed transactions have been lost.  Depending on what you
are doing, you might see individual transactions fail due to
out-of-space --- an INSERT/UPDATE that couldn't find free space within
its table would probably fail while trying to extend the table, and
anything requiring a large temp file would fail.

> Just to confirm what I would assume at this point -- non-committed
> transactions should roll back cleanly; it is reasonable to assume no
> corruption at this point?

Yeah, I would expect no problems.

            regards, tom lane

Re: URGENT: Out of disk space pg_xlog

From
ohp@pyrenet.fr
Date:
On Fri, 22 Dec 2006, Tom Lane wrote:

> Date: Fri, 22 Dec 2006 13:14:18 -0500
> From: Tom Lane <tgl@sss.pgh.pa.us>
> To: Kevin Grittner <Kevin.Grittner@wicourts.gov>
> Cc: Jeremy Haile <jhaile@fastmail.fm>, pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] URGENT: Out of disk space pg_xlog
>
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> > As I understand it, the log space accumulates for the oldest transaction
> > which is still running, and all transactions which started after it.
>
> No, pg_xlog can be truncated as soon as a checkpoint occurs.

Even for currently running transactions ?

My understanding was that checkpoint was syncing data files for commited
transactions.

What happens to pg_xlogs when a transaction updates M of rows/tables and
runs for hours?
[snip]
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>
regards
--
Olivier PRENANT                    Tel: +33-5-61-50-97-00 (Work)
15, Chemin des Monges                +33-5-61-50-97-01 (Fax)
31190 AUTERIVE                       +33-6-07-63-80-64 (GSM)
FRANCE                          Email: ohp@pyrenet.fr
------------------------------------------------------------------------------
Make your life a dream, make your dream a reality. (St Exupery)

Re: URGENT: Out of disk space pg_xlog

From
Tom Lane
Date:
ohp@pyrenet.fr writes:
> On Fri, 22 Dec 2006, Tom Lane wrote:
>> No, pg_xlog can be truncated as soon as a checkpoint occurs.

> Even for currently running transactions ?

Yes.

            regards, tom lane

Re: URGENT: Out of disk space pg_xlog

From
Alvaro Herrera
Date:
ohp@pyrenet.fr wrote:
> On Fri, 22 Dec 2006, Tom Lane wrote:
>
> > Date: Fri, 22 Dec 2006 13:14:18 -0500
> > From: Tom Lane <tgl@sss.pgh.pa.us>
> > To: Kevin Grittner <Kevin.Grittner@wicourts.gov>
> > Cc: Jeremy Haile <jhaile@fastmail.fm>, pgsql-performance@postgresql.org
> > Subject: Re: [PERFORM] URGENT: Out of disk space pg_xlog
> >
> > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> > > As I understand it, the log space accumulates for the oldest transaction
> > > which is still running, and all transactions which started after it.
> >
> > No, pg_xlog can be truncated as soon as a checkpoint occurs.
>
> Even for currently running transactions ?
>
> My understanding was that checkpoint was syncing data files for commited
> transactions.

No, it syncs data files for all transactions, even those currently
running.

> What happens to pg_xlogs when a transaction updates M of rows/tables and
> runs for hours?

They get recycled as the update goes.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: URGENT: Out of disk space pg_xlog

From
"Steinar H. Gunderson"
Date:
On Fri, Dec 22, 2006 at 07:47:05PM +0100, ohp@pyrenet.fr wrote:
>> No, pg_xlog can be truncated as soon as a checkpoint occurs.
> Even for currently running transactions ?

Isn't that the entire point of having checkpoints in the first place? :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: URGENT: Out of disk space pg_xlog

From
Bruce Momjian
Date:
Tom Lane wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> > As I understand it, the log space accumulates for the oldest transaction
> > which is still running, and all transactions which started after it.
>
> No, pg_xlog can be truncated as soon as a checkpoint occurs.  If Jeremy
> wasn't using archive_command then the only possible explanation for
> bloated pg_xlog is that checkpoints were failing.  Which is not unlikely
> if the *data* partition runs out of space.  Were there gripes in the log
> before the system crash?  The scenario we've seen in the past is
>
> * data partition out of space, so writes fail
> * each time Postgres attempts a checkpoint, writes fail, so the
>   checkpoint fails.  No data loss at this point, the dirty buffers
>   just stay in memory.
> * pg_xlog bloats because we can't truncate away old segments
> * eventually pg_xlog runs out of space, at which point we PANIC
>   and can't continue running the database
>
> Once you free some space on the data partition and restart, you should
> be good to go --- there will be no loss of committed transactions, since
> all the operations are in pg_xlog.  Might take a little while to replay
> all that log though :-(

Amazing that all works.  What I did not see is confirmation from the
user that the data directory filled up _before_ pg_xlog filled up.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: URGENT: Out of disk space pg_xlog

From
ohp@pyrenet.fr
Date:
Thanks for your reply.

I learnt something again!
On Fri, 22 Dec 2006, Alvaro Herrera wrote:

> Date: Fri, 22 Dec 2006 17:25:43 -0300
> From: Alvaro Herrera <alvherre@commandprompt.com>
> To: ohp@pyrenet.fr
> Cc: Tom Lane <tgl@sss.pgh.pa.us>,
>      Kevin Grittner <Kevin.Grittner@wicourts.gov>,
>      Jeremy Haile <jhaile@fastmail.fm>, pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] URGENT: Out of disk space pg_xlog
>
> ohp@pyrenet.fr wrote:
> > On Fri, 22 Dec 2006, Tom Lane wrote:
> >
> > > Date: Fri, 22 Dec 2006 13:14:18 -0500
> > > From: Tom Lane <tgl@sss.pgh.pa.us>
> > > To: Kevin Grittner <Kevin.Grittner@wicourts.gov>
> > > Cc: Jeremy Haile <jhaile@fastmail.fm>, pgsql-performance@postgresql.org
> > > Subject: Re: [PERFORM] URGENT: Out of disk space pg_xlog
> > >
> > > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> > > > As I understand it, the log space accumulates for the oldest transaction
> > > > which is still running, and all transactions which started after it.
> > >
> > > No, pg_xlog can be truncated as soon as a checkpoint occurs.
> >
> > Even for currently running transactions ?
> >
> > My understanding was that checkpoint was syncing data files for commited
> > transactions.
>
> No, it syncs data files for all transactions, even those currently
> running.
>
> > What happens to pg_xlogs when a transaction updates M of rows/tables and
> > runs for hours?
>
> They get recycled as the update goes.
>
>

--
Olivier PRENANT                    Tel: +33-5-61-50-97-00 (Work)
15, Chemin des Monges                +33-5-61-50-97-01 (Fax)
31190 AUTERIVE                       +33-6-07-63-80-64 (GSM)
FRANCE                          Email: ohp@pyrenet.fr
------------------------------------------------------------------------------
Make your life a dream, make your dream a reality. (St Exupery)

Re: URGENT: Out of disk space pg_xlog

From
"Jeremy Haile"
Date:
> > Once you free some space on the data partition and restart, you should
> > be good to go --- there will be no loss of committed transactions, since
> > all the operations are in pg_xlog.  Might take a little while to replay
> > all that log though :-(
>
> Amazing that all works.  What I did not see is confirmation from the
> user that the data directory filled up _before_ pg_xlog filled up.

After I freed up space on the pg_xlog partition and restarted, it took
some time to replay all of the log (15-20 minutes) and everything
recovered with no data corruption!  However, the theory about the data
partition filling up first didn't happen in my case.  The data partition
was (and still is) less than 50% utilized.  My pg_xlog files typically
run around 400MB, but with the long running update filled up the entire
10GB partition.  (which is now a 70 GB partition)

So, I'm still not sure what caused the problem.  When I get back to work
(or maybe sooner), I'll take a look in the PG logs and post anything
that looks suspicious here.  Thanks for all of your comments and
suggestions.  Even though I haven't figured out the root of the problem
yet, they've been very informative.

Re: URGENT: Out of disk space pg_xlog

From
Bruce Momjian
Date:
Jeremy Haile wrote:
> > > Once you free some space on the data partition and restart, you should
> > > be good to go --- there will be no loss of committed transactions, since
> > > all the operations are in pg_xlog.  Might take a little while to replay
> > > all that log though :-(
> >
> > Amazing that all works.  What I did not see is confirmation from the
> > user that the data directory filled up _before_ pg_xlog filled up.
>
> After I freed up space on the pg_xlog partition and restarted, it took
> some time to replay all of the log (15-20 minutes) and everything
> recovered with no data corruption!  However, the theory about the data
> partition filling up first didn't happen in my case.  The data partition
> was (and still is) less than 50% utilized.  My pg_xlog files typically
> run around 400MB, but with the long running update filled up the entire
> 10GB partition.  (which is now a 70 GB partition)
>
> So, I'm still not sure what caused the problem.  When I get back to work
> (or maybe sooner), I'll take a look in the PG logs and post anything
> that looks suspicious here.  Thanks for all of your comments and
> suggestions.  Even though I haven't figured out the root of the problem
> yet, they've been very informative.

The bottom line is that we know of now cases where a long-running
transaction would delay recycling of the WAL files, so there is
certainly something not understood here.

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: URGENT: Out of disk space pg_xlog

From
"Simon Riggs"
Date:
On Sat, 2006-12-23 at 13:13 -0500, Bruce Momjian wrote:

> The bottom line is that we know of now cases where a long-running
> transaction would delay recycling of the WAL files, so there is
> certainly something not understood here.

We can see from all of this that a checkpoint definitely didn't occur.
Tom's causal chain was just one way that could have happened, there
could well be others.

I've noticed previously that a checkpoint can be starved out when trying
to acquire the CheckpointStartLock. I've witnessed a two minute delay
plus in obtaining the lock in the face of heavy transactions.

If wal_buffers is small enough, WAL write rate high enough and the
transaction rate high enough, a long queue can form for the
WALWriteLock, which ensures that the CheckpointStartLock would queue
indefinitely.

I've tried implementing a queueable shared lock for the
CheckpointStartLock. That helps the checkpoint, but it harms performance
of other transactions waiting to commit, so I let that idea go.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com