Thread: full data disk -- any chance of recovery

full data disk -- any chance of recovery

From
"Gregory S. Williamson"
Date:
An enthusiastic person in out content department went and did a silly thing ...

Well, he went and fired off an update that consumed all of the remaining disk space on two runtime servers.

We've fallen back to a hot spare and I am faced with trying to retrieve these machines by Tuesday morning when we
expectsome increase in traffic. 

Postgres version is 7.4; the only thing in the /data directory is postgres data and related files:

$ du
3632    ./gex_runtime/base/1
4468    ./gex_runtime/base/17141
0       ./gex_runtime/base/138602992/pgsql_tmp
32682348        ./gex_runtime/base/138602992
32690448        ./gex_runtime/base
340     ./gex_runtime/global
492120  ./gex_runtime/pg_xlog
7660    ./gex_runtime/pg_clog
33190592        ./gex_runtime
0       ./bkup
33190592        .

The log is saying:
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2006-01-01 23:20:19 WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory. 
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2006-01-01 23:20:19 LOG:  could not close temporary statistics file
"/data/postgres/gex_runtime/global/pgstat.tmp.1413":No space left on device 

Availables space is:
$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             32850580   3137552  28044280  11% /
/dev/sdb1             35001508  33223500        16 100% /data

Any suggestions ? Falling back to the last known state is fine, but just in case I am making a backup of the remaining
databaseto build a replacement. 

And yes, I did forsee this and did warn management repeatedly and yet somehow the advice falls on deaf ears. Go figure.
Iguess maybe because it isn't management that a hole kicked in a 3 day weekend. 

Greg Williamson
DBA (for now at least)
GlobeXplorer LLC

Re: full data disk -- any chance of recovery

From
"Jeff Frost"
Date:
Greg, I'm not sure what you're looking for in the way of suggestions.  Do
you just want to be able to start this postgres server up and remove some
data?  Easiest way I see to accomplish that given the information you
provided is to move pg_xlog to the sda disk and symlink it to the data dir.


In general terms, it would go like this:

Stop postmaster
cd /data/gex_runtime
mv pg_xlog /
ln -s /pg_xlog
Start postmaster

The commands may vary depending on OS.

That would also give you better performance if sda and sdb are actually
separate physical disks.  However, that's only going to give you about 500MB
of free space, so I see bigger disks in your future.  A vacuum full might
recover a bit of space as well if you've got any bloat.

The question I have is this: Is your database read-only?  Otherwise,
bringing these machines back up probably isn't too useful as they are now
out of sync with the new primary (your old hot spare).

Good luck!

-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Gregory S. Williamson
Sent: Sunday, January 01, 2006 11:28 PM
To: pgsql-admin@postgresql.org
Subject: [ADMIN] full data disk -- any chance of recovery

Availables space is:
$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             32850580   3137552  28044280  11% /
/dev/sdb1             35001508  33223500        16 100% /data

Any suggestions ? Falling back to the last known state is fine, but just in
case I am making a backup of the remaining database to build a replacement.

And yes, I did forsee this and did warn management repeatedly and yet
somehow the advice falls on deaf ears. Go figure. I guess maybe because it
isn't management that a hole kicked in a 3 day weekend.



Re: full data disk -- any chance of recovery

From
"Gregory S. Williamson"
Date:
Jeff --

Thanks for the suggestion -- I think this fills the bill except that the postmaster won't quit because it has no space
(atleast that is how I interpet it). These are all linux boxes with the same architecture (2 CPUs, 2 gigs of RAM, disks
notadequate for a database: QED). 

I had an urgent priority in November to upgrade these beasts, but the best laid plans o' mice and men, etc. etc.

These servers are mostly read-only for spatial data so falling back to the last known state (e.g. before the current
transaction)would work perfectly. 

But I'm still making a copy o' one of the two hot spares (one of which is now in play), juts in case.

Have a good {day|afternoon|evening|night) !

Greg

-----Original Message-----
From:    jeff@glacier.frostconsultingllc.com on behalf of Jeff Frost
Sent:    Sun 1/1/2006 11:49 PM
To:    Gregory S. Williamson; pgsql-admin@postgresql.org
Cc:
Subject:    RE: [ADMIN] full data disk -- any chance of recovery
Greg, I'm not sure what you're looking for in the way of suggestions.  Do
you just want to be able to start this postgres server up and remove some
data?  Easiest way I see to accomplish that given the information you
provided is to move pg_xlog to the sda disk and symlink it to the data dir.


In general terms, it would go like this:

Stop postmaster
cd /data/gex_runtime
mv pg_xlog /
ln -s /pg_xlog
Start postmaster

The commands may vary depending on OS.

That would also give you better performance if sda and sdb are actually
separate physical disks.  However, that's only going to give you about 500MB
of free space, so I see bigger disks in your future.  A vacuum full might
recover a bit of space as well if you've got any bloat.

The question I have is this: Is your database read-only?  Otherwise,
bringing these machines back up probably isn't too useful as they are now
out of sync with the new primary (your old hot spare).

Good luck!

-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Gregory S. Williamson
Sent: Sunday, January 01, 2006 11:28 PM
To: pgsql-admin@postgresql.org
Subject: [ADMIN] full data disk -- any chance of recovery

Availables space is:
$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             32850580   3137552  28044280  11% /
/dev/sdb1             35001508  33223500        16 100% /data

Any suggestions ? Falling back to the last known state is fine, but just in
case I am making a backup of the remaining database to build a replacement.

And yes, I did forsee this and did warn management repeatedly and yet
somehow the advice falls on deaf ears. Go figure. I guess maybe because it
isn't management that a hole kicked in a 3 day weekend.



!DSPAM:43b8db0031385555610062!





Re: full data disk -- any chance of recovery

From
"Jeff Frost"
Date:
Greg,

Does pg_ctl stop -m immediate stop the postmaster for you?

----
Jeff Frost, Owner       <jeff@frostconsultingllc.com>
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908     FAX: 650-649-1954


-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Gregory S. Williamson
Sent: Sunday, January 01, 2006 11:58 PM
To: Jeff Frost; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] full data disk -- any chance of recovery

Jeff --

Thanks for the suggestion -- I think this fills the bill except that the
postmaster won't quit because it has no space (at least that is how I
interpet it). These are all linux boxes with the same architecture (2 CPUs,
2 gigs of RAM, disks not adequate for a database: QED).

I had an urgent priority in November to upgrade these beasts, but the best
laid plans o' mice and men, etc. etc.

These servers are mostly read-only for spatial data so falling back to the
last known state (e.g. before the current transaction) would work perfectly.

But I'm still making a copy o' one of the two hot spares (one of which is
now in play), juts in case.

Have a good {day|afternoon|evening|night) !

Greg

-----Original Message-----
From:    jeff@glacier.frostconsultingllc.com on behalf of Jeff Frost
Sent:    Sun 1/1/2006 11:49 PM
To:    Gregory S. Williamson; pgsql-admin@postgresql.org
Cc:
Subject:    RE: [ADMIN] full data disk -- any chance of recovery
Greg, I'm not sure what you're looking for in the way of suggestions.  Do
you just want to be able to start this postgres server up and remove some
data?  Easiest way I see to accomplish that given the information you
provided is to move pg_xlog to the sda disk and symlink it to the data dir.


In general terms, it would go like this:

Stop postmaster
cd /data/gex_runtime
mv pg_xlog /
ln -s /pg_xlog
Start postmaster

The commands may vary depending on OS.

That would also give you better performance if sda and sdb are actually
separate physical disks.  However, that's only going to give you about 500MB
of free space, so I see bigger disks in your future.  A vacuum full might
recover a bit of space as well if you've got any bloat.

The question I have is this: Is your database read-only?  Otherwise,
bringing these machines back up probably isn't too useful as they are now
out of sync with the new primary (your old hot spare).

Good luck!

-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Gregory S. Williamson
Sent: Sunday, January 01, 2006 11:28 PM
To: pgsql-admin@postgresql.org
Subject: [ADMIN] full data disk -- any chance of recovery

Availables space is:
$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             32850580   3137552  28044280  11% /
/dev/sdb1             35001508  33223500        16 100% /data

Any suggestions ? Falling back to the last known state is fine, but just in
case I am making a backup of the remaining database to build a replacement.

And yes, I did forsee this and did warn management repeatedly and yet
somehow the advice falls on deaf ears. Go figure. I guess maybe because it
isn't management that a hole kicked in a 3 day weekend.



!DSPAM:43b8db0031385555610062!





---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend



Re: full data disk -- any chance of recovery

From
"Gregory S. Williamson"
Date:
You wrote:
>
> Greg,
>
> Does pg_ctl stop -m immediate stop the postmaster for you?

I tried
su - postgres -c '/apps/pgsql-7.4/bin/pg_ctl stop -D /data/postgres/gex_runtime -m immediate'

on one of the two hozed servers and that's (I think) what got this:

2006-01-01 23:20:19 WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory. 
HINT:  In a moment you should be able to reconnect to the database and repeat your command.

And that's been like that for a while

While the other server (unstopped) shows only:

2006-01-02 00:30:01 LOG:  could not close temporary statistics file
"/data/postgres/gex_runtime/global/pgstat.tmp.1453":No space left on device 
2006-01-02 00:33:54 ERROR:  could not access status of transaction 0
DETAIL:  could not write to file "/data/postgres/gex_runtime/pg_clog/0AFA" at offset 196608: No space left on device

G


----
Jeff Frost, Owner       <jeff@frostconsultingllc.com>
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908     FAX: 650-649-1954


-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Gregory S. Williamson
Sent: Sunday, January 01, 2006 11:58 PM
To: Jeff Frost; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] full data disk -- any chance of recovery

Jeff --

Thanks for the suggestion -- I think this fills the bill except that the
postmaster won't quit because it has no space (at least that is how I
interpet it). These are all linux boxes with the same architecture (2 CPUs,
2 gigs of RAM, disks not adequate for a database: QED).

I had an urgent priority in November to upgrade these beasts, but the best
laid plans o' mice and men, etc. etc.

These servers are mostly read-only for spatial data so falling back to the
last known state (e.g. before the current transaction) would work perfectly.

But I'm still making a copy o' one of the two hot spares (one of which is
now in play), juts in case.

Have a good {day|afternoon|evening|night) !

Greg

-----Original Message-----
From:    jeff@glacier.frostconsultingllc.com on behalf of Jeff Frost
Sent:    Sun 1/1/2006 11:49 PM
To:    Gregory S. Williamson; pgsql-admin@postgresql.org
Cc:
Subject:    RE: [ADMIN] full data disk -- any chance of recovery
Greg, I'm not sure what you're looking for in the way of suggestions.  Do
you just want to be able to start this postgres server up and remove some
data?  Easiest way I see to accomplish that given the information you
provided is to move pg_xlog to the sda disk and symlink it to the data dir.


In general terms, it would go like this:

Stop postmaster
cd /data/gex_runtime
mv pg_xlog /
ln -s /pg_xlog
Start postmaster

The commands may vary depending on OS.

That would also give you better performance if sda and sdb are actually
separate physical disks.  However, that's only going to give you about 500MB
of free space, so I see bigger disks in your future.  A vacuum full might
recover a bit of space as well if you've got any bloat.

The question I have is this: Is your database read-only?  Otherwise,
bringing these machines back up probably isn't too useful as they are now
out of sync with the new primary (your old hot spare).

Good luck!

-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Gregory S. Williamson
Sent: Sunday, January 01, 2006 11:28 PM
To: pgsql-admin@postgresql.org
Subject: [ADMIN] full data disk -- any chance of recovery

Availables space is:
$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             32850580   3137552  28044280  11% /
/dev/sdb1             35001508  33223500        16 100% /data

Any suggestions ? Falling back to the last known state is fine, but just in
case I am making a backup of the remaining database to build a replacement.

And yes, I did forsee this and did warn management repeatedly and yet
somehow the advice falls on deaf ears. Go figure. I guess maybe because it
isn't management that a hole kicked in a 3 day weekend.









---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend



!DSPAM:43b8e0cf33531348188260!





Re: full data disk -- any chance of recovery

From
Jeff Frost
Date:
Seems like you're going to have to kill -9.

On Mon, 2 Jan 2006, Gregory S. Williamson wrote:

> I tried
> su - postgres -c '/apps/pgsql-7.4/bin/pg_ctl stop -D /data/postgres/gex_runtime -m immediate'
>
> on one of the two hozed servers and that's (I think) what got this:

Re: full data disk -- any chance of recovery

From
Tom Lane
Date:
"Gregory S. Williamson" <gsw@globexplorer.com> writes:
> 2006-01-02 00:30:01 LOG:  could not close temporary statistics file
"/data/postgres/gex_runtime/global/pgstat.tmp.1453":No space left on device 
> 2006-01-02 00:33:54 ERROR:  could not access status of transaction 0
> DETAIL:  could not write to file "/data/postgres/gex_runtime/pg_clog/0AFA" at offset 196608: No space left on device

Just kill -9 all the postgres processes; everything you need should be
safely down in the WAL files.

You might not have to move pg_xlog --- the first thing to do is see
if there are any large temp files hanging about in the pgsql_tmp
subdirectories.  Anything you see in there can be shot on sight once
the postmaster is stopped (actually, recent versions of the postmaster
will do it for you on restart, but don't remember about 7.4).

Which PG release is this exactly (7.4.what)?  This misbehavior reminds
me of a bug that we fixed in 7.4.2.

            regards, tom lane

Re: full data disk -- any chance of recovery

From
Ben Kim
Date:
Just curious, I guess the problem is not simply the disk full now, but
supposing the disk full is the only problem, what would happen if we move
some old files temporarily from pg_xlog/* to somewhere else and free up
some disk space? (On mine, I guess I can get about 75 MB, leaving the most
recent ones: say, dated today.)

>> I tried
>> su - postgres -c '/apps/pgsql-7.4/bin/pg_ctl stop -D /data/postgres/gex_runtime -m immediate'
>>
>> on one of the two hozed servers and that's (I think) what got this:
>
>---------------------------(end of broadcast)---------------------------
>TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match
>

Regards,

Ben Kim
Developer
http://benix.tamu.edu



Re: full data disk -- any chance of recovery

From
"Joshua D. Drake"
Date:
Ben Kim wrote:

>Just curious, I guess the problem is not simply the disk full now, but
>supposing the disk full is the only problem, what would happen if we move
>some old files temporarily from pg_xlog/* to somewhere else and free up
>some disk space? (On mine, I guess I can get about 75 MB, leaving the most
>recent ones: say, dated today.)
>
>

Uhmmm don't do that :). You need to find something else. The pg_xlog is
your transaction logs.

>
>
>>>I tried
>>>su - postgres -c '/apps/pgsql-7.4/bin/pg_ctl stop -D /data/postgres/gex_runtime -m immediate'
>>>
>>>on one of the two hozed servers and that's (I think) what got this:
>>>
>>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 9: In versions below 8.0, the planner will ignore your desire to
>>      choose an index scan if your joining column's datatypes do not
>>      match
>>
>>
>>
>
>Regards,
>
>Ben Kim
>Developer
>http://benix.tamu.edu
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match
>
>


--
The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: PLphp, PLperl - http://www.commandprompt.com/


Re: full data disk -- any chance of recovery

From
"Gregory S. Williamson"
Date:
I'll check into the temp files and the like in a bit -- the output from version() says:
 PostgreSQL 7.4 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2 (Mandrake Linux 9.1 3.2.2-3mdk)
(1 row)

so I am not sure if this 7.4.2 -- I have some documentation though that says it is 7.4.2 so I think this beast may be
ofthat flavor. 
'
Thanks,

Greg

-----Original Message-----
From:    Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent:    Mon 1/2/2006 9:45 AM
To:    Gregory S. Williamson
Cc:    Jeff Frost; pgsql-admin@postgresql.org
Subject:    Re: [ADMIN] full data disk -- any chance of recovery
"Gregory S. Williamson" <gsw@globexplorer.com> writes:
> 2006-01-02 00:30:01 LOG:  could not close temporary statistics file
"/data/postgres/gex_runtime/global/pgstat.tmp.1453":No space left on device 
> 2006-01-02 00:33:54 ERROR:  could not access status of transaction 0
> DETAIL:  could not write to file "/data/postgres/gex_runtime/pg_clog/0AFA" at offset 196608: No space left on device

Just kill -9 all the postgres processes; everything you need should be
safely down in the WAL files.

You might not have to move pg_xlog --- the first thing to do is see
if there are any large temp files hanging about in the pgsql_tmp
subdirectories.  Anything you see in there can be shot on sight once
the postmaster is stopped (actually, recent versions of the postmaster
will do it for you on restart, but don't remember about 7.4).

Which PG release is this exactly (7.4.what)?  This misbehavior reminds
me of a bug that we fixed in 7.4.2.

            regards, tom lane

!DSPAM:43b9668499131348188260!





Re: full data disk -- any chance of recovery

From
Tom Lane
Date:
"Gregory S. Williamson" <gsw@globexplorer.com> writes:
> I'll check into the temp files and the like in a bit -- the output from version() says:
>  PostgreSQL 7.4 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2 (Mandrake Linux 9.1 3.2.2-3mdk)
> (1 row)

> so I am not sure if this 7.4.2 --

If it were 7.4.2 it would say so.  You are in desperate need of an
update, as there are half a dozen known data-loss issues that are
corrected in the 7.4.x update series.  The one that I now think bit
you is just one of them.

            regards, tom lane

Re: full data disk -- any chance of recovery

From
"Gregory S. Williamson"
Date:
Ah well, figures. If only ops had listened to me, we'd be on 8.1 right now.

Thanks anyway, as always, for the sage advice.

G

-----Original Message-----
From:    Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent:    Mon 1/2/2006 2:11 PM
To:    Gregory S. Williamson
Cc:    Jeff Frost; pgsql-admin@postgresql.org
Subject:    Re: [ADMIN] full data disk -- any chance of recovery
"Gregory S. Williamson" <gsw@globexplorer.com> writes:
> I'll check into the temp files and the like in a bit -- the output from version() says:
>  PostgreSQL 7.4 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2 (Mandrake Linux 9.1 3.2.2-3mdk)
> (1 row)

> so I am not sure if this 7.4.2 --

If it were 7.4.2 it would say so.  You are in desperate need of an
update, as there are half a dozen known data-loss issues that are
corrected in the 7.4.x update series.  The one that I now think bit
you is just one of them.

            regards, tom lane

!DSPAM:43b9a4cb122511222944467!





Re: full data disk -- any chance of recovery

From
Tomaz Borstnar
Date:
Jeff Frost pravi:
> Seems like you're going to have to kill -9.

Yeah, this is bad :( Seems like kill -9 is needed when disk is full. Tested on *BSD jails.

Tomaž

Re: full data disk -- any chance of recovery

From
Tom Lane
Date:
Tomaz Borstnar <tomaz.borstnar@over.net> writes:
> Jeff Frost pravi:
>> Seems like you're going to have to kill -9.

> Yeah, this is bad :( Seems like kill -9 is needed when disk is full. Tested on *BSD jails.

With what PG version?  And what behavior did you see exactly?

            regards, tom lane

Re: full data disk -- any chance of recovery

From
"Jim C. Nasby"
Date:
On Mon, Jan 02, 2006 at 12:45:29PM -0500, Tom Lane wrote:
> "Gregory S. Williamson" <gsw@globexplorer.com> writes:
> > 2006-01-02 00:30:01 LOG:  could not close temporary statistics file
"/data/postgres/gex_runtime/global/pgstat.tmp.1453":No space left on device 
> > 2006-01-02 00:33:54 ERROR:  could not access status of transaction 0
> > DETAIL:  could not write to file "/data/postgres/gex_runtime/pg_clog/0AFA" at offset 196608: No space left on
device
>
> Just kill -9 all the postgres processes; everything you need should be
> safely down in the WAL files.

Another alternative: most unix filesistems actually set it up so that
there is still some free space left even if it's reporting 100%. On
FreeBSD, you can change the amount of reserved space with tunefs -m, but
you should read the caveats in man tunefs.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: full data disk -- any chance of recovery

From
"Gregory S. Williamson"
Date:
FWIW,

I can at least report the resolution of the original problem.

I went sleuthing and found some core files in the ./base/13860299 directory. Deleteing those freed up some gigabytes of
space(each core was 1-2 gigs). 

The server that I had tried to stop with "-m immediate" command did in fact then go offline; it came up with a few
complaints;I ran a vacuum on all of the databases in that instance and our content manager was able to do his update
(theupdate was such that reapplying it to any given row didn't hurt anything; these were massive updates changing
copyrightrelated info and the like). So far the database has passed all sanity checks and is back online. 

The server that I left alone was responsive, i.e. psql could connect and do queries, but there were a few tables it
refusedto have anything to do with, complaining about missing xlog files. I brought it down with with "-m fast" mode,
restartedit and it also seems now to be fine. (Knock on simulated woodgrain) 

Lessons learned:
  a) upgrade to current revisions whenever possible -- old software is a hand grenade waiting to go off.
  b) look for core files and delete them if you don't need them -- I was not expecting to find them in a data directory
sothis was a bit of s surprise. 
  c) don't run out of disk space (duh)

Thanks to all who helped me. I might be able to get a server to test on with a different release of postgres if that
wouldbe useful, although we are strictly a linux shop and Dell x86 servers are what I mostly can get my hands on
(running2.4.21-0.13mdkenterprise). 

Greg W.

-----Original Message-----
From:    pgsql-admin-owner@postgresql.org on behalf of Tom Lane
Sent:    Tue 1/3/2006 10:38 AM
To:    Tomaz Borstnar
Cc:    pgsql-admin@postgresql.org
Subject:    Re: [ADMIN] full data disk -- any chance of recovery
Tomaz Borstnar <tomaz.borstnar@over.net> writes:
> Jeff Frost pravi:
>> Seems like you're going to have to kill -9.

> Yeah, this is bad :( Seems like kill -9 is needed when disk is full. Tested on *BSD jails.

With what PG version?  And what behavior did you see exactly?

            regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

!DSPAM:43bac4af260671270013900!





Re: full data disk -- any chance of recovery

From
Jeff Frost
Date:
On Tue, 3 Jan 2006, Jim C. Nasby wrote:

> Another alternative: most unix filesistems actually set it up so that
> there is still some free space left even if it's reporting 100%. On
> FreeBSD, you can change the amount of reserved space with tunefs -m, but
> you should read the caveats in man tunefs.

Jim, excellent thought!  And on Linux at least you can change it with the
filesystem still mounted:

tune2fs -m 0 /dev/sdb1

would probably do the trick.

You might want to set it back after you're done though. :-)  Default appears
to be 5 on my machine.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: full data disk -- any chance of recovery

From
"Jim C. Nasby"
Date:
On Tue, Jan 03, 2006 at 05:17:45PM -0800, Gregory S. Williamson wrote:
> FWIW,
>
> I can at least report the resolution of the original problem.
>
> I went sleuthing and found some core files in the ./base/13860299 directory. Deleteing those freed up some gigabytes
ofspace (each core was 1-2 gigs). 

Might want to turn off dumping of core files; I believe man ulimit is
the place to look.

>   a) upgrade to current revisions whenever possible -- old software is a hand grenade waiting to go off.

Well, at least in the case of PostgreSQL, it's generally not critical to
upgrade major (x.y) versions quickly. But you often do want to upgrade
minor (x.y.z) versions, as they often contain bug fixes. But 7.4.x is
getting pretty old.

>   c) don't run out of disk space (duh)

There have actually been fixes to make it less of an issue when you do
run out of disk space. See item a. :)
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: full data disk -- any chance of recovery

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Tue, Jan 03, 2006 at 05:17:45PM -0800, Gregory S. Williamson wrote:
>> I went sleuthing and found some core files in the ./base/13860299 directory. Deleteing those freed up some gigabytes
ofspace (each core was 1-2 gigs). 

> Might want to turn off dumping of core files; I believe man ulimit is
> the place to look.

Actually, as a developer I would've first wanted to look into the core
files and try to see why they showed up in the first place.  A gdb stack
trace would often tell something useful (... if not to you, then to
someone on the -hackers list ...).  Cleaning up after a problem is fine,
but don't destroy the evidence until you've learned as much as you can
towards preventing the problem from happening again.

I spend a remarkably large fraction of my time advising people to enable
core-dumping on platforms that disable it by default, so you'll
certainly not ever see me advising anyone to turn it off on a platform
where it is default ;-)

Having said all that, +1 to the point about staying up-to-date in
whichever PG release series you are using.  We do not spend time on
making dot-releases because we have nothing to do on a Saturday
afternoon ... an update is put out because it fixes one or more pretty
serious bugs.  Sure, there is some risk of a regression in a
dot-release, but it's small.  As best I recall at the moment, we've had
only one or two regressions in dot-releases in the eight or so years
I've been around the project.

            regards, tom lane

Re: full data disk -- any chance of recovery

From
"Gregory S. Williamson"
Date:
Tom Lane conjured forth the following characters:

>
> > Might want to turn off dumping of core files; I believe man ulimit is
> > the place to look.
>
> Actually, as a developer I would've first wanted to look into the core
> files and try to see why they showed up in the first place.  A gdb stack
> trace would often tell something useful (... if not to you, then to
> someone on the -hackers list ...).  Cleaning up after a problem is fine,
> but don't destroy the evidence until you've learned as much as you can
> towards preventing the problem from happening again.

We'll be a month or so to switching to 8.1, so I am sure that we'll have another core file which can kept. To the truth
Ihaven't pursued this much because 
 (a) it's an old revision and if time is to be spent swatting bugs it is better spent on current software, and besides
itmay a result of something already fixed;  
 (b) we're almost certain that this is a result of catastrophic failures in postGIS/GEOS under load. We typically see a
fewconnections go crazy and eat up all the RAM and CPU time; sometimes we've had to reboot to get things calm again.
Ourtesting of 8.0 w/ postGIS 1.0 led us to conclude that we will see far less of this, in that when we replayed a day's
trafficto the databases we saw no errors versus dozens from the same traffic on 7.4. 

When I do find another core I'll people know if you care; the chances are quite good that we find the opportunity.

Alas, moving large systems in a company sometimes requires the subtle skills of a cat herder combined with the social
tactof an offensive linebacker.  

Thanks again,

G




Re: full data disk -- any chance of recovery

From
Tomaz Borstnar
Date:
Tom Lane pravi:
> Tomaz Borstnar <tomaz.borstnar@over.net> writes:
>> Jeff Frost pravi:
>>> Seems like you're going to have to kill -9.
>
>> Yeah, this is bad :( Seems like kill -9 is needed when disk is full. Tested on *BSD jails.
>
> With what PG version?
8.0.x for sure.

> And what behavior did you see exactly?

Postgresql was running inside jail. All was fine until partition filled up and at this point kill -9 was the only
option 
to stop postgresql in jail. It said about stopping by administrative command, but it did not exit - kill -9 was the
only 
solution without rebooting.


Tomaž