Re: Startup process thrashing - Mailing list pgsql-general

From Greg Smith
Subject Re: Startup process thrashing
Date
Msg-id Pine.GSO.4.64.0812102257331.28476@westnet.com
Whole thread Raw
In response to Startup process thrashing  (Phillip Berry <pberry@stellaconcepts.com>)
Responses Re: Startup process thrashing
Re: Startup process thrashing
List pgsql-general
On Thu, 11 Dec 2008, Phillip Berry wrote:

> I've got a bit of a problem.  It started last night when postgres
> (8.1.9) went down citing the need for a vacuum full to be done due to
> the transaction log needing to wraparound.

Not exactly.  What it said was "To avoid a database shutdown, execute a
full-database VACUUM".  In that context, "full" means you vacuum
everything in the database, but only regular VACUUM is needed.  VACUUM
FULL, as you learned the hard way, is a more intensive operation, and it's
not needed to resolve the problem you started with.  It's a somewhat
unfortunately worded HINT.

> During the vacuum of the larger of the databases a few hours in it
> failed, it's filled up the 18GB pg_xlog partition with over 1000 wal
> files.  Due to running out of space the vacuum failed.

Ouch.  Are you running PITR recovery by setting archive_command?  Did you
set checkpoint_segments to some very high value?  1000 WAL files is not
normal, curious how you ended up with so many of them.

> When I came in this morning I attempted to start postgres using the normal init script, and now it's
> stuck.  The startup process is thrashing the disks and working hard, pg_controldata says it's in
> recovery, but it's been going for over two hours.

It takes a long time to sort through 1000 WAL files and figure out if the
database is consistent for every transactions mentioned there.  If your
VACUUM FULL ran for several hours and kicked out 1000 of them, it would be
reasonable to expect the cleanup to also take many hours.

> My question is where I should go from here?  Should i kill the startup script, clear out the excess
> wal files, start the standalone server and try vacuum again?

Deleting the WAL files like that will leave your database completely
trashed.  The utility that's provided to do the job you're considering is
pg_resetxlog:

http://www.postgresql.org/docs/8.1/static/app-pgresetxlog.html

Which is slightly safer, but note the dire warnings there.  You are very
likely to get some sort of data corruption if you do that, and you won't
know where it is.  You'll be facing a dump and restore to sort that out,
and if you think the server startup is taking a while on a 156GB database
you're really not going to be happy with how long a restore takes.

The safest thing you can do here is just wait for the server to finish
recovery so it starts up.  Watch the system activity with something like
vmstat.  If the server process is busy using the CPU and it's doing stuff
with the disks, if you have evidence it's making progress, you'll be hard
pressed to execute any manual recovery that's any safer or more efficient
than that is.

Someone else may be able to point you toward better estimating how far
it's got left to go, I haven't ever been stuck in your position for long
enough before to figure that out myself.  Good luck.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

pgsql-general by date:

Previous
From: Greg Smith
Date:
Subject: Re: The future of Solaris?
Next
From: "Scott Marlowe"
Date:
Subject: Re: Data Replication