Thread: [HACKERS] Orphaned files in base/[oid]
Hi all;
I am trying to track down a problem we are seeing that looks very similar to bug #12050, and would certainly consider trying to contribute a fix if we agree on one. (I am not sure we can, so absent that, the next question is whether it makes sense to create a utility to fix the problem when it comes up so that a dump/restore is not needed).
The system:
PostgreSQL 9.6.3
Gentoo Linux.
Problem:
The system this came up on is PostgreSQL 9.6.3 and has had repeated trouble with disk space. Querying pg_database_size, as well as du on the subdirectory of base/ show total usage to be around 3.8TB. Summing up the size of the relations in pg_class though shows around 2.1TB.
Initial troubleshooting found around 150 GB of space in pg_temp which had never been cleared and was at least several days old. Restarting the server cleared these up.
Poking around the base/[oid] directory, I found a large number of files which did not correspond with a pg_class entry. One of the apparent relations was nearly 1TB in size.
What I think happened:
I think various pg_temp/* and orphaned relation files (In base/[oid]) were created when PostgreSQL crashed due to running out of space in various operations including creating materialised views.
--
So my question is if there is a way we can safely clean these up on server restart? If not does it make sense to try to create a utility that can connect to PostgreSQL, seek out valid files, and delete the rest?
Best Regards,
Chris Travers
Database Administrator
Saarbrücker Straße 37a, 10405 Berlin
On Aug 14, 2017 14:12, "Chris Travers" <chris.travers@adjust.com> wrote:
Hi all;I am trying to track down a problem we are seeing that looks very similar to bug #12050, and would certainly consider trying to contribute a fix if we agree on one. (I am not sure we can, so absent that, the next question is whether it makes sense to create a utility to fix the problem when it comes up so that a dump/restore is not needed).The system:PostgreSQL 9.6.3Gentoo Linux.Problem:The system this came up on is PostgreSQL 9.6.3 and has had repeated trouble with disk space. Querying pg_database_size, as well as du on the subdirectory of base/ show total usage to be around 3.8TB. Summing up the size of the relations in pg_class though shows around 2.1TB.Initial troubleshooting found around 150 GB of space in pg_temp which had never been cleared and was at least several days old. Restarting the server cleared these up.Poking around the base/[oid] directory, I found a large number of files which did not correspond with a pg_class entry. One of the apparent relations was nearly 1TB in size.What I think happened:I think various pg_temp/* and orphaned relation files (In base/[oid]) were created when PostgreSQL crashed due to running out of space in various operations including creating materialised views.So my question is if there is a way we can safely clean these up on server restart? If not does it make sense to try to create a utility that can connect to PostgreSQL, seek out valid files, and delete the rest?
Ok I have identified one case where symptoms I am seeing can be reproduced. I am currently working on a Mac so there may be quirks in my repro. However....
When the WAL writer runs out of disk space no cleanup is done.
So I will be looking at possible solutions next.
--Best Regards,Chris TraversDatabase AdministratorSaarbrücker Straße 37a, 10405 Berlin
Hi, On 2017-08-14 14:12:22 +0200, Chris Travers wrote: > Problem: > The system this came up on is PostgreSQL 9.6.3 and has had repeated trouble > with disk space. Querying pg_database_size, as well as du on the > subdirectory of base/ show total usage to be around 3.8TB. Summing up the > size of the relations in pg_class though shows around 2.1TB. > > Initial troubleshooting found around 150 GB of space in pg_temp which had > never been cleared and was at least several days old. Restarting the > server cleared these up. > > Poking around the base/[oid] directory, I found a large number of files > which did not correspond with a pg_class entry. One of the apparent > relations was nearly 1TB in size. > > What I think happened: > I think various pg_temp/* and orphaned relation files (In base/[oid]) were > created when PostgreSQL crashed due to running out of space in various > operations including creating materialised views. > > So my question is if there is a way we can safely clean these up on server > restart? If not does it make sense to try to create a utility that can > connect to PostgreSQL, seek out valid files, and delete the rest? I think the fix here is to call RemovePgTempFiles() during crash-restarts, instead of just full starts. The previously stated need to be able to inspect temp files after a crash can be less impactfully fulfilled with restart_after_crash = false. Greetings, Andres Freund
On Mon, Aug 14, 2017 at 6:33 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2017-08-14 14:12:22 +0200, Chris Travers wrote:
> Problem:
> The system this came up on is PostgreSQL 9.6.3 and has had repeated trouble
> with disk space. Querying pg_database_size, as well as du on the
> subdirectory of base/ show total usage to be around 3.8TB. Summing up the
> size of the relations in pg_class though shows around 2.1TB.
>
> Initial troubleshooting found around 150 GB of space in pg_temp which had
> never been cleared and was at least several days old. Restarting the
> server cleared these up.
>
> Poking around the base/[oid] directory, I found a large number of files
> which did not correspond with a pg_class entry. One of the apparent
> relations was nearly 1TB in size.
>
> What I think happened:
> I think various pg_temp/* and orphaned relation files (In base/[oid]) were
> created when PostgreSQL crashed due to running out of space in various
> operations including creating materialised views.
>
> So my question is if there is a way we can safely clean these up on server
> restart? If not does it make sense to try to create a utility that can
> connect to PostgreSQL, seek out valid files, and delete the rest?
I think the fix here is to call RemovePgTempFiles() during
crash-restarts, instead of just full starts. The previously stated need
to be able to inspect temp files after a crash can be less impactfully
fulfilled with restart_after_crash = false.
But that only clears temp files right?
I am less concerned about the temp files because a restart clears them.
The bigger issue I see are with the orphaned base files. It looks like files in base/[oid] don't get cleaned up either if I read my output correctly and it would explain why we saw 1.7TB of discrepancy between relations and database size. Safety-wise it seems like the best way out of that is a dump/restore but doing that with a 2.1TB database is annoying.
Greetings,
Andres Freund
Best Regards,
Chris Travers
Database Administrator
Saarbrücker Straße 37a, 10405 Berlin
Chris Travers <chris.travers@adjust.com> writes: > On Mon, Aug 14, 2017 at 6:33 PM, Andres Freund <andres@anarazel.de> wrote: >> I think the fix here is to call RemovePgTempFiles() during >> crash-restarts, instead of just full starts. The previously stated need >> to be able to inspect temp files after a crash can be less impactfully >> fulfilled with restart_after_crash = false. > But that only clears temp files right? > I am less concerned about the temp files because a restart clears them. It will clear temp files, and also temp tables. > The bigger issue I see are with the orphaned base files. It would be possible to have orphaned non-temp tables if you'd suffered a crash during the transaction that created those tables. Ordinarily a newly-created table file wouldn't be that large, but if your workflow created tables and shoved boatloads of data into them in the same transaction, it's not so hard to see this becoming an issue. The core problem with zapping non-temp table files is that you can't do that unless you're sure you have consistent, up-to-date pg_class data that nobody else is busy adding to. It's hard to see an external application being able to do that safely. You certainly can't do it at the point in the postmaster startup cycle where we currently do the other things --- for those, we rely only on filesystem naming conventions to identify what to zap. regards, tom lane
On 2017-08-14 14:40:46 -0400, Tom Lane wrote: > The core problem with zapping non-temp table files is that you can't > do that unless you're sure you have consistent, up-to-date pg_class > data that nobody else is busy adding to. It's hard to see an external > application being able to do that safely. You certainly can't do it > at the point in the postmaster startup cycle where we currently do > the other things --- for those, we rely only on filesystem naming > conventions to identify what to zap. I think there are some possibilities to close the gap here. We could e.g. have <relfilenode>.delete_on_crash marker files that get installed when creating a new persistent relfilenode. If we set up things so they get deleted post commit, but inside the critical section, we could rely on them being present in case of crash, but consistently removed during WAL replay. At the end of recovery, iterate over the whole datadir and nuke all relations with marker files present. I first thought that'd cost an additional fsync per relation created. But I think we actually can delay that to a pre-commit phase, if we have XLOG_SMGR_CREATE create those markers via a flag, and fsync them just before checkpoint (via the usual delayed fsync mechanism). That'd still require an XLogFlush(), but that seems hard to avoid unless we just don't create relations on FS level until buffers are evicted and/or BufferSync(). Alternatively we could do something without marker files, with some added complexity: Keep track of all "uncommitted new files" in memory, and log them every checkpoint. Commit/abort records clear elements of that list. Since we always start replay at the beginning of a checkpoint, we'd always reach a moment with such an up2date list of pending-action files before reaching end-of-recovery. At end-of-recovery we can delete all unconfirmed files. To avoid out-of-memory due to too many tracked relations, we'd possibly still have to have marker files... Regards, Andres
On Mon, Aug 14, 2017 at 8:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
It would be possible to have orphaned non-temp tables if you'd suffered
a crash during the transaction that created those tables. Ordinarily
a newly-created table file wouldn't be that large, but if your workflow
created tables and shoved boatloads of data into them in the same
transaction, it's not so hard to see this becoming an issue.
I think the working theory is that these were very like a number of very large (multi-hundred-GB materialised views).
The core problem with zapping non-temp table files is that you can't
do that unless you're sure you have consistent, up-to-date pg_class
data that nobody else is busy adding to. It's hard to see an external
application being able to do that safely. You certainly can't do it
at the point in the postmaster startup cycle where we currently do
the other things --- for those, we rely only on filesystem naming
conventions to identify what to zap.
Yeah that occurred to me. At this point I would settle for something I could run with Postgres in single user mode. Although that is very far from ideal. So what I wonder is if at least a short-term solution might be a utility that starts Postgres in single user mode and we insist that PostgreSQL is otherwise not running before the run.
I am certainly not feeling qualified at present for more advanced solutions but that I might be able to do.
regards, tom lane
Best Regards,
Chris Travers
Database Administrator
Saarbrücker Straße 37a, 10405 Berlin
On Tue, Aug 15, 2017 at 3:56 AM, Andres Freund <andres@anarazel.de> wrote: > I think there are some possibilities to close the gap here. We could > e.g. have <relfilenode>.delete_on_crash marker files that get installed > when creating a new persistent relfilenode. If we set up things so they > get deleted post commit, but inside the critical section, we could rely > on them being present in case of crash, but consistently removed during > WAL replay. At the end of recovery, iterate over the whole datadir and > nuke all relations with marker files present. I agree that an approach like that has value for the problem defined in this thread. > I first thought that'd cost an additional fsync per relation > created. But I think we actually can delay that to a pre-commit phase, > if we have XLOG_SMGR_CREATE create those markers via a flag, and fsync > them just before checkpoint (via the usual delayed fsync mechanism). > That'd still require an XLogFlush(), but that seems hard to avoid unless > we just don't create relations on FS level until buffers are > evicted and/or BufferSync(). Yeah, that should work as well. -- Michael
There's another side to this and that I am not sure it is a backend crash.
Here is what I did to reproduce:
2 virtual disk images: 100mb for main data, 40 MB for WAL. work_mem set to 256MB. The idea is to test different out of space conditions.
Create table as ...; drop table; select pg_size_pretty(pg_current_xlog_location() - '0/0');
I played around with parameters to determine how different kinds of out of space errors were handled.
1. running out of temp space was cleaned up without a server restart needed.
2. A relation running out of disk space *seemed* to get cleaned up.
3. Running out of WAL space left *both* temp and non-temp files.
I wonder about a different solution. Would it be possible to special case vacuum to check for and remove (or just move to where they can be removed) files when vacuuming pg_class? At the point we are vacuuming pg_class, we ought to be able to know that a relfilenode shouldn't be used anymore, right?
Best Regards,
Chris Travers
Database Administrator
Saarbrücker Straße 37a, 10405 Berlin
Chris Travers <chris.travers@adjust.com> writes: > I wonder about a different solution. Would it be possible to special case > vacuum to check for and remove (or just move to where they can be removed) > files when vacuuming pg_class? At the point we are vacuuming pg_class, we > ought to be able to know that a relfilenode shouldn't be used anymore, > right? I don't think so. It's not clear to me whether you have in mind "scan pg_class, collect relfilenodes from all live tuples, then zap all files not in that set" or "when removing a dead tuple, zap the relfilenode it mentions". But neither one works. The first case has a race condition against new pg_class entries. As for the second, the existence of a dead tuple bearing relfilenode N isn't evidence that some other live tuple can't have relfilenode N. Another problem for the second solution is that in the case you're worried about (ie, PANIC due to out-of-WAL-space during relation's creating transaction), there's no very good reason to expect that the relation's pg_class tuple ever made it to disk at all. A traditional low-tech answer to this has been to keep the WAL on a separate volume from the main data store, so that it's protected from out-of-space conditions in the main store and temp areas. The space needs for WAL proper are generally much more predictable than the main store, so it's easier to keep the dedicated space from overflowing. (Stalled replication/archiving processes can be hazardous to your health in this scenario, though, if they prevent prompt recycling of WAL files.) regards, tom lane
On Tue, Aug 15, 2017 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Chris Travers <chris.travers@adjust.com> writes:
> I wonder about a different solution. Would it be possible to special case
> vacuum to check for and remove (or just move to where they can be removed)
> files when vacuuming pg_class? At the point we are vacuuming pg_class, we
> ought to be able to know that a relfilenode shouldn't be used anymore,
> right?
I don't think so. It's not clear to me whether you have in mind "scan
pg_class, collect relfilenodes from all live tuples, then zap all files
not in that set" or "when removing a dead tuple, zap the relfilenode
it mentions". But neither one works. The first case has a race condition
against new pg_class entries. As for the second, the existence of a dead
tuple bearing relfilenode N isn't evidence that some other live tuple
can't have relfilenode N.
Ah because if the file never made it on to disk the number could be re-used.
Another problem for the second solution is that in the case you're worried
about (ie, PANIC due to out-of-WAL-space during relation's creating
transaction), there's no very good reason to expect that the relation's
pg_class tuple ever made it to disk at all.
A traditional low-tech answer to this has been to keep the WAL on a
separate volume from the main data store, so that it's protected from
out-of-space conditions in the main store and temp areas. The space
needs for WAL proper are generally much more predictable than the main
store, so it's easier to keep the dedicated space from overflowing.
(Stalled replication/archiving processes can be hazardous to your
health in this scenario, though, if they prevent prompt recycling of
WAL files.)
Yeah, most of our dbs here have wal on a separate volume but not this system. This system is also unusual in that disk usage varies wildly (and I am not 100% sure that this is the only case which causes it though I can reproduce it consistently in the case of the wal writer running out of disk space with symptoms exactly what I found).
So for now that leaves my fallback approach as a way to fix it when I see it.
I have written a shell script which does as follows:
1. starts Postgres in single user mode with a data directory or dies (won't run if Postgres seems to be already running)
2. gets the old of the current database
3. lists all files consisting of only digits in the base/[dboid] directory
4. asks Postgres (In single user mode again) for all relfilenodes and oids of tables (In my testing both were required because there were some cases where relfilenodes were not set in some system
5. Loops through the file nodes gathered, checks against the relfilenode entries, and zaps $f, $f_*, and $f.*. Currently for testing "zaps" has been to move to a lostnfound folder for inspection following the script. The logic here is not perfect and is very slightly under inclusive, but better that than the other way.
Then we can start Postgres again. I cannot find a better way to avoid race conditions, I guess. At any rate it sounds like preventing the problem more generally may be something beyond what I would feel comfortable trying to do as a patch at my current level of familiarity with he source code.
The full script is included inline below my signature in case it is of interest to anyone on the list.
regards, tom lane
Best Regards,
Chris Travers
Database Administrator
---
#!/bin/bash
datadir=$1
database=$2
pg_ctl -D $datadir stop
dboidfile="$PWD/cleanupdb.oid"
reloidfile="$PWD/refilenodes.list"
echo "COPY (select oid from pg_database where datname = current_database()) TO '$dboidfile'" | postgres --single -D $datadir $database > /dev/null
if (($?))
then
echo "FATAL: Could not start Postgres in single user mode"
exit 1
fi
dboid=`cat $dboidfile`
filenodes=`(cd test/base/$dboid; ls [0-9]*[0-9] | grep -v '\.' | sort -n)`
#echo $filenodes
echo "COPY (select relfilenode from pg_class union select oid as relfilenode from pg_class) TO '$reloidfile'" | postgres --single -D $datadir $database > /dev/null
relfilenodes=`cat $reloidfile`
#echo $relfilenodes
if [[ -z relfilenodes ]]
then
echo "FATAL: did not get any relfilenodes"
exit 2
fi
mkdir lostnfound;
for f in $filenodes
do
if [[ -z `echo $relfilenodes | grep -w $f` ]]
then
echo moving $f to lostnfound
mv $datadir/base/$dboid/$f lostnfound
mv $datadir/base/$dboid/${f}_* lostnfound 2> /dev/null
mv $datadir/base/$dboid/${f}.* lostnfound 2> /dev/null
fi
done
rm $dboidfile
rm $reloidfile
So having throught about this a bit more, and having had some real-world experience with the script now, I have an idea that might work and some questions to make it succeed.
My thinking is to add a new form of vacuum called VACUUM FSCK.
This would:
1. lock pg_class in exclusive mode (or do I need exclusive access?), as this is needed to solve the race conditions. As I see, this seems to bring the database to a screeching halt concurrency-wise (but unlike my script would allow other databases to be accessed normally).
2. read the files where the name consists of only digits out of the filesystem and compare with oids from pg_class and relfilenodes
3. Any file not found in that list would then unlink it, as well as any files with the patter followed by an underscore or period.
This would mean that the following cases would not be handled:
If you have the first extent gone but later extents are present we check on the first extant, and so would not see the later ones. Same goes for visibility maps and other helper files.
If you add a file in the directory which has a name like 34F3A222BC, that would never get cleaned up because it contains non-digits.
So this leads to the following questions:
1. Is locking pg_class enough to avoid race conditions? Is exclusive mode sufficient or do I need exclusive access mode?
2. would it be preferable to move the file to a directory rather than unlinking it?
3. Should I perform any sort of check on the tables at the end to make sure everything is ok?
Best Regards,
Chris Travers
Database Administrator
Saarbrücker Straße 37a, 10405 Berlin
On Mon, Aug 14, 2017 at 2:56 PM, Andres Freund <andres@anarazel.de> wrote: > I think there are some possibilities to close the gap here. We could > e.g. have <relfilenode>.delete_on_crash marker files that get installed > when creating a new persistent relfilenode. If we set up things so they > get deleted post commit, but inside the critical section, we could rely > on them being present in case of crash, but consistently removed during > WAL replay. At the end of recovery, iterate over the whole datadir and > nuke all relations with marker files present. At the risk of being predictable, I think we should add an undo subsystem instead of continuing to create ad-hoc solutions to problems like this. (Of course, that's being worked on by Thomas, Amit, and others.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-08-16 14:20:02 +0200, Chris Travers wrote: > So having throught about this a bit more, and having had some real-world > experience with the script now, I have an idea that might work and some > questions to make it succeed. > > My thinking is to add a new form of vacuum called VACUUM FSCK. > > This would: > 1. lock pg_class in exclusive mode (or do I need exclusive access?), as > this is needed to solve the race conditions. As I see, this seems to bring > the database to a screeching halt concurrency-wise (but unlike my script > would allow other databases to be accessed normally). > 2. read the files where the name consists of only digits out of the > filesystem and compare with oids from pg_class and relfilenodes > 3. Any file not found in that list would then unlink it, as well as any > files with the patter followed by an underscore or period. > > This would mean that the following cases would not be handled: > > If you have the first extent gone but later extents are present we check on > the first extant, and so would not see the later ones. Same goes for > visibility maps and other helper files. > > If you add a file in the directory which has a name like 34F3A222BC, that > would never get cleaned up because it contains non-digits. > > So this leads to the following questions: > > 1. Is locking pg_class enough to avoid race conditions? Is exclusive mode > sufficient or do I need exclusive access mode? > 2. would it be preferable to move the file to a directory rather than > unlinking it? > 3. Should I perform any sort of check on the tables at the end to make > sure everything is ok? I think this entirely is the wrong approach. We shouldn't add weird check commands that require locks on pg_class, we should avoid leaving the orphaned files in the first place. I've upthread outlined approached how to do so. Greetings, Andres Freund
On Wed, Aug 16, 2017 at 7:15 PM, Andres Freund <andres@anarazel.de> wrote:
I think this entirely is the wrong approach. We shouldn't add weird
check commands that require locks on pg_class, we should avoid leaving
the orphaned files in the first place. I've upthread outlined
approached how to do so.
And in the mean time I suppose there is no reason that the approach I have outlined cannot be used by an external program. I think it is sane to draw the line at having the db responsible for cleaning up after itself when something goes wrong and leave external programs to do after-the-fact review and cleanup.
Greetings,
Andres Freund
Best Regards,
Chris Travers
Database Administrator
Saarbrücker Straße 37a, 10405 Berlin
On 14.08.2017 23:56, Andres Freund wrote: > Alternatively we could do something without marker files, with some > added complexity: Keep track of all "uncommitted new files" in memory, > and log them every checkpoint. Commit/abort records clear elements of > that list. Since we always start replay at the beginning of a > checkpoint, we'd always reach a moment with such an up2date list of > pending-action files before reaching end-of-recovery. At end-of-recovery > we can delete all unconfirmed files. To avoid out-of-memory due to too > many tracked relations, we'd possibly still have to have marker files... > Hi, hackers. I'm sorry, but I want to bump this thread, because there is still no good solution to solve the problem. I see there are few threads with undo-based approaches, which looks preferable, but have some pitfalls. Is there any chance to return to non-undo approaches partially discussed here? What do you think about the following solutions? 1) Make `pendingDeletes` shared and let postmaster clean all garbage in case of child process dying. Cons: Not works in case of postmaster dying. Should care about `pendingDeletes` pointers validity. 2) Catch and store all records with relfilenode during WAL replay, delete all orphaned nodes at the end of replaying. Cons: The final delete may use an incomplete list of nodes, as there was something before the latest checkpoint. The general opacity - we remove something without a corresponded WAL record (or possibly do it in wrong place in general). 3) This way is close to one I quoted and a combination of two above. `pendingDeletes` is shared. Each checkpoint creates a WAL record with a list of open transactions and created nodes. WAL replaying can use this list as base, adding nodes to it from newer records. The final delete operation has a complete list of orphaned nodes. Cons: Complexity(?). Others(?). Can it work? Are any of this approaches still relevant? I'm sorry for the last html-formatted message. Our web-based app is too smart. -- Regards, Alex Go, C developer goa@arenadata.io, www.arenadata.tech