Thread: Cleaning up unreferenced table files
Here's a patch for the TODO item "Remove unreferenced table files created by transactions that were in-progress when the server terminated abruptly." It adds a new function, CleanupStaleRelFiles, that scans through the data directory and removes all table files that are not mentioned in pg_class of the corresponding database. CleanupStaleRelFiles is called after WAL recovery. Actually, the patch doesn't currently delete the files, just issues a warning. Testing is easier if the files don't keep getting deleted :). The patch also adds a GetTablespacePath function similar to GetDatabasePath that constructs the path to a tablespace symbolic link. commands/tablespace.c is modified to use it, in addition to the new CleanupStaleRelFiles function. - Heikki
Attachment
Heikki Linnakangas <hlinnaka@iki.fi> writes: > Here's a patch for the TODO item "Remove unreferenced table files created by transactions > that were in-progress when the server terminated abruptly." xlog.c is a fairly random place to put that functionality. Didn't it strike any warning bells for you when you had to add so many new #includes? I'm not entirely sure where this should go, but not there. regards, tom lane
On Sat, 5 Mar 2005, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> Here's a patch for the TODO item "Remove unreferenced table files created by transactions >> that were in-progress when the server terminated abruptly." > > xlog.c is a fairly random place to put that functionality. Didn't it > strike any warning bells for you when you had to add so many new > #includes? I'm not entirely sure where this should go, but not there. Yeah actually it did, but I forgot about it along the way. How about putting it in a file of its own in backend/catalog? There's some code that also deals with the data directories. Or straight into backend/storage. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On Sat, 5 Mar 2005, Tom Lane wrote: >> xlog.c is a fairly random place to put that functionality. Didn't it >> strike any warning bells for you when you had to add so many new >> #includes? I'm not entirely sure where this should go, but not there. > Yeah actually it did, but I forgot about it along the way. How about > putting it in a file of its own in backend/catalog? There's some code that > also deals with the data directories. Or straight into backend/storage. Actually, you could make some case for putting it in utils/init/ beside flatfiles.c, which has got much the same sort of issues to deal with. I think though that we ought to first consider the question of whether we *want* this functionality. On reflection I'm fairly nervous about the idea of actually deleting anything during startup --- seems like a good recipe for turning small failures into large failures. But if we're not going to delete anything then it's questionable whether we need to code it like this at all. It'd certainly be easier and safer to examine these tables after the system is up and running normally. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > On Sat, 5 Mar 2005, Tom Lane wrote: > >> xlog.c is a fairly random place to put that functionality. Didn't it > >> strike any warning bells for you when you had to add so many new > >> #includes? I'm not entirely sure where this should go, but not there. > > > Yeah actually it did, but I forgot about it along the way. How about > > putting it in a file of its own in backend/catalog? There's some code that > > also deals with the data directories. Or straight into backend/storage. > > Actually, you could make some case for putting it in utils/init/ beside > flatfiles.c, which has got much the same sort of issues to deal with. > > I think though that we ought to first consider the question of whether > we *want* this functionality. On reflection I'm fairly nervous about > the idea of actually deleting anything during startup --- seems like a > good recipe for turning small failures into large failures. But if > we're not going to delete anything then it's questionable whether we > need to code it like this at all. It'd certainly be easier and safer to > examine these tables after the system is up and running normally. Let's discuss this. The patch as submitted checks for unreferenced files on bootup and reports them in the log file, but does not delete them. That seems like the proper behavior. I think we delete from pgsql_tmp on bootup, but we _know_ those aren't referenced. What other user interface would trigger this if we did it after startup? Wouldn't we have to lock pg_class against VACUUM while we scan the file system, and are we sure we do things in pg_class or the file system first consistently? It seems much more prone to error doing it while the system is running. I guess I am happy with just reporting during startup like the patch does now. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Mon, 25 Apr 2005, Bruce Momjian wrote: > Tom Lane wrote: ... >> I think though that we ought to first consider the question of whether >> we *want* this functionality. On reflection I'm fairly nervous about >> the idea of actually deleting anything during startup --- seems like a >> good recipe for turning small failures into large failures. But if >> we're not going to delete anything then it's questionable whether we >> need to code it like this at all. It'd certainly be easier and safer to >> examine these tables after the system is up and running normally. > > Let's discuss this. The patch as submitted checks for unreferenced > files on bootup and reports them in the log file, but does not delete > them. That seems like the proper behavior. I think we delete from > pgsql_tmp on bootup, but we _know_ those aren't referenced. > > What other user interface would trigger this if we did it after startup? > Wouldn't we have to lock pg_class against VACUUM while we scan the file > system, and are we sure we do things in pg_class or the file system > first consistently? It seems much more prone to error doing it while > the system is running. I agree. Also, you can only have stale files after a backend crash, since they are normally cleaned up at the end of transaction. If it was a separate program or command, the administrator would have to be aware of the issue. Otherwise, he wouldn't know he needs to run it after a crash. I feel that crashes that leaves behind stale files are rare. You would need an application that creates/drops tables as part of normal operation. Some kind of a large batch load might do that: BEGIN, CREATE TABLE foo, COPY 1 GB of data, COMMIT. The nasty thing right now is, you might end up with 1 GB of wasted disk space, and never even know it. > I guess I am happy with just reporting during startup like the patch > does now. Ok. I'll fix the design issues Tom addressed earlier, add documentation, and resubmit. We can come back to this after a release or two, when we have more confidence in the feature. Maybe we'll also get some feedback on how often those stale files occur in practice. - Heikki
On Tue, 26 Apr 2005, Heikki Linnakangas wrote: > On Mon, 25 Apr 2005, Bruce Momjian wrote: ... >> I guess I am happy with just reporting during startup like the patch >> does now. > > Ok. I'll fix the design issues Tom addressed earlier, add documentation, and > resubmit. Here you go. The new functionality is now separated in a new file backend/utils/init/cleanup.c. There was code in many places that constructs the path to a tablespace directory. I refactored that into a new function called GetTablespacePath and put it next to GetDatabasePath in catalog.c. I added a section under the "Routine Database Maintenance Tasks" that basically gives a heads up that these notifications can appear in the log after a crash. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > I feel that crashes that leaves behind stale files are rare. Indeed, and getting more so all the time ... which makes me question the value of doing anything about this at all. regards, tom lane
On Tue, 26 Apr 2005, Alvaro Herrera wrote: > You forgot the attachment? Damn. It happens time after time... - Heikki
Attachment
On Tue, 26 Apr 2005, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> I feel that crashes that leaves behind stale files are rare. > > Indeed, and getting more so all the time ... How so? Have changes been made in those parts of the code? > which makes me question the value of doing anything about this at all. What bothers me is that we currently have no means of knowing if it happens and how often it happens, so we're just guessing that "it's rare". How rare? We don't know. If nobody ever runs into this issue in production, and this whole exercise turns out to be completely unnecessary, at least we'll know. That alone makes me feel better. The only drawback of the patch that I can see is the performance impact on recovery. And I think the time it takes to scan the data directories isn't much compared to WAL replay. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On Tue, 26 Apr 2005, Tom Lane wrote: >> Heikki Linnakangas <hlinnaka@iki.fi> writes: >>> I feel that crashes that leaves behind stale files are rare. >> >> Indeed, and getting more so all the time ... > How so? Have changes been made in those parts of the code? No, just that the overall reliability of Postgres keeps improving. At the time that TODO entry was created, I don't think we even had the ability to roll back table creates/drops properly, so there were scenarios in which unreferenced files could be left behind without even assuming any software error. And the prevalence of backend crashes was way higher than it is now, too. So I think a good argument could be made that the TODO item isn't nearly as important as it was at the time. > If nobody ever runs into this issue in production, and this whole exercise > turns out to be completely unnecessary, at least we'll know. That alone > makes me feel better. We will know no such thing, unless the patch is made to announce the problem so intrusively that people are certain to see it *and report it to us*. Which I don't think will be acceptable. regards, tom lane
On Wed, 27 Apr 2005, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> On Tue, 26 Apr 2005, Tom Lane wrote: >>> Heikki Linnakangas <hlinnaka@iki.fi> writes: >>>> I feel that crashes that leaves behind stale files are rare. >>> >>> Indeed, and getting more so all the time ... > >> How so? Have changes been made in those parts of the code? > > No, just that the overall reliability of Postgres keeps improving. > > At the time that TODO entry was created, I don't think we even had the > ability to roll back table creates/drops properly, so there were > scenarios in which unreferenced files could be left behind without even > assuming any software error. And the prevalence of backend crashes was > way higher than it is now, too. So I think a good argument could be > made that the TODO item isn't nearly as important as it was at the time. *shrug*. I won't push any harder if you feel strongly against the patch. Should we just remove the TODO item? And/or document the issue? - Heikki
Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > On Tue, 26 Apr 2005, Tom Lane wrote: > >> Heikki Linnakangas <hlinnaka@iki.fi> writes: > >>> I feel that crashes that leaves behind stale files are rare. > >> > >> Indeed, and getting more so all the time ... > > > How so? Have changes been made in those parts of the code? > > No, just that the overall reliability of Postgres keeps improving. > > At the time that TODO entry was created, I don't think we even had the > ability to roll back table creates/drops properly, so there were > scenarios in which unreferenced files could be left behind without even > assuming any software error. And the prevalence of backend crashes was > way higher than it is now, too. So I think a good argument could be > made that the TODO item isn't nearly as important as it was at the time. > > > If nobody ever runs into this issue in production, and this whole exercise > > turns out to be completely unnecessary, at least we'll know. That alone > > makes me feel better. > > We will know no such thing, unless the patch is made to announce the > problem so intrusively that people are certain to see it *and report it > to us*. Which I don't think will be acceptable. Well, if putting it in the server logs isn't enough, I don't know what is. I think we do need the patch, at least to find out if there is an issue we don't know about. I don't see the hard in it. Heikki, would you time startup and tell us what percentage of time is taken by the routines? Or I can do it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I think we do need the patch, at least to find out if there is an issue > we don't know about. My point is that we won't find out anything, because we will have no idea if people are noticing the log entries at all, much less telling us about 'em. Of course if they do tell us then we'd know something, but what I am expecting is a vast silence. We will not be able to tell the difference between "no problem" and "very infrequent problem" any better than we can now. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I think we do need the patch, at least to find out if there is an issue > > we don't know about. > > My point is that we won't find out anything, because we will have no > idea if people are noticing the log entries at all, much less telling > us about 'em. Of course if they do tell us then we'd know something, > but what I am expecting is a vast silence. We will not be able to tell > the difference between "no problem" and "very infrequent problem" any > better than we can now. I think people do look at their logs. We are certainly better off finding it than having no reporting at all. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Uh, you forgot to add cleanup.h. --------------------------------------------------------------------------- Heikki Linnakangas wrote: > On Tue, 26 Apr 2005, Alvaro Herrera wrote: > > > You forgot the attachment? > > Damn. It happens time after time... > > - Heikki Content-Description: [ Attachment, skipping... ] -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
How embarrasing... I hope it's all there now. On Thu, 28 Apr 2005, Bruce Momjian wrote: > Uh, you forgot to add cleanup.h. > > --------------------------------------------------------------------------- > > Heikki Linnakangas wrote: >> On Tue, 26 Apr 2005, Alvaro Herrera wrote: >> >>> You forgot the attachment? >> >> Damn. It happens time after time... >> >> - Heikki > > Content-Description: > > [ Attachment, skipping... ] > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > - Heikki
Attachment
Applied. --------------------------------------------------------------------------- pgman wrote: > Tom Lane wrote: > > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > > On Tue, 26 Apr 2005, Tom Lane wrote: > > >> Heikki Linnakangas <hlinnaka@iki.fi> writes: > > >>> I feel that crashes that leaves behind stale files are rare. > > >> > > >> Indeed, and getting more so all the time ... > > > > > How so? Have changes been made in those parts of the code? > > > > No, just that the overall reliability of Postgres keeps improving. > > > > At the time that TODO entry was created, I don't think we even had the > > ability to roll back table creates/drops properly, so there were > > scenarios in which unreferenced files could be left behind without even > > assuming any software error. And the prevalence of backend crashes was > > way higher than it is now, too. So I think a good argument could be > > made that the TODO item isn't nearly as important as it was at the time. > > Agreed, but I don't think we are ever going to have the file system obey > the same semantics as the database, so there will always be corner cases > where we have the possibility for unreferenced files. > > > > If nobody ever runs into this issue in production, and this whole exercise > > > turns out to be completely unnecessary, at least we'll know. That alone > > > makes me feel better. > > > > We will know no such thing, unless the patch is made to announce the > > problem so intrusively that people are certain to see it *and report it > > to us*. Which I don't think will be acceptable. > > You are right that checking only after a server crash isn't going to > help us very much, because the odds someone is going to look at the logs > just at that moment are slim, and if they stop and start the server > again, the message will not appear so they will think it is fixed. > > Though the problem is caused by a server crash, it remains even after a > clean startup. My new version of the patch checks not just after a > server crash, but every time the server starts up. Here are some > timings: > > 3dbs without patch 0m0.230s 300 files > 3dbs with patch 0m0.230s 300 files > 100dbs with patch 0m0.270s 10000 files > > (The timing test started the server and waited for the message "database > system is ready".) > > You can see that for three empty databases (only system tables/indexes), > the patch has no impact. For 100 databases, or 10,000 files, there is a > 17% increase in startup time, which seems acceptable. > > The message printed in the logs is: > > LOG: The table or index file "/u/pgsql/data/base/1/100001" is stale and can be safely removed > > I made a few changes to the patch. I changed 'clean' to 'check' because > I don't think we are ever going to remove the files on startup (If we > did we could do the cleanup just on crash recovery startup.) I changed > the message from NOTICE to LOG, I made it run on all startups and not > just crash recovery, and it was missing FreeDir() calls. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > Index: doc/src/sgml/maintenance.sgml > =================================================================== > RCS file: /cvsroot/pgsql/doc/src/sgml/maintenance.sgml,v > retrieving revision 1.41 > diff -c -c -r1.41 maintenance.sgml > *** doc/src/sgml/maintenance.sgml 20 Feb 2005 02:21:26 -0000 1.41 > --- doc/src/sgml/maintenance.sgml 1 May 2005 04:30:48 -0000 > *************** > *** 474,479 **** > --- 474,496 ---- > </para> > </sect1> > > + <sect1 id="check-files-after-crash"> > + <title>Check files after crash</title> > + > + <indexterm zone="check-files-after-crash"> > + <primary>stale file</primary> > + </indexterm> > + > + <para> > + <productname>PostgreSQL</productname> recovers automatically after crash > + using the write-ahead log (see <xref linkend="wal">) and no manual > + operations are normally needed. However, if there was a transaction running > + when the crash occured that created or dropped a relation, the > + transaction might have left a stale file in the data directory. If this > + happens, you will get a notice in the log file stating which files can be > + deleted. > + </para> > + </sect1> > > <sect1 id="logfile-maintenance"> > <title>Log File Maintenance</title> > Index: src/backend/access/transam/xlog.c > =================================================================== > RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v > retrieving revision 1.189 > diff -c -c -r1.189 xlog.c > *** src/backend/access/transam/xlog.c 28 Apr 2005 21:47:10 -0000 1.189 > --- src/backend/access/transam/xlog.c 1 May 2005 04:30:52 -0000 > *************** > *** 43,48 **** > --- 43,49 ---- > #include "utils/builtins.h" > #include "utils/guc.h" > #include "utils/relcache.h" > + #include "utils/flatfiles.h" > > > /* > *************** > *** 4525,4530 **** > --- 4526,4533 ---- > > CreateCheckPoint(true, true); > > + CheckStaleRelFiles(); > + > /* > * Close down recovery environment > */ > *************** > *** 4536,4541 **** > --- 4539,4550 ---- > */ > remove_backup_label(); > } > + else > + { > + XLogInitRelationCache(); > + CheckStaleRelFiles(); > + XLogCloseRelationCache(); > + } > > /* > * Preallocate additional log files, if wanted. > Index: src/backend/catalog/catalog.c > =================================================================== > RCS file: /cvsroot/pgsql/src/backend/catalog/catalog.c,v > retrieving revision 1.59 > diff -c -c -r1.59 catalog.c > *** src/backend/catalog/catalog.c 14 Apr 2005 20:03:23 -0000 1.59 > --- src/backend/catalog/catalog.c 1 May 2005 04:30:52 -0000 > *************** > *** 106,111 **** > --- 106,144 ---- > return path; > } > > + /* > + * GetTablespacePath - construct path to a tablespace symbolic link > + * > + * Result is a palloc'd string. > + * > + * XXX this must agree with relpath and GetDatabasePath! > + */ > + char * > + GetTablespacePath(Oid spcNode) > + { > + int pathlen; > + char *path; > + > + Assert(spcNode != GLOBALTABLESPACE_OID); > + > + if (spcNode == DEFAULTTABLESPACE_OID) > + { > + /* The default tablespace is {datadir}/base */ > + pathlen = strlen(DataDir) + 5 + 1; > + path = (char *) palloc(pathlen); > + snprintf(path, pathlen, "%s/base", > + DataDir); > + } > + else > + { > + /* All other tablespaces have symlinks in pg_tblspc */ > + pathlen = strlen(DataDir) + 11 + OIDCHARS + 1; > + path = (char *) palloc(pathlen); > + snprintf(path, pathlen, "%s/pg_tblspc/%u", > + DataDir, spcNode); > + } > + return path; > + } > > /* > * IsSystemRelation > Index: src/backend/commands/tablespace.c > =================================================================== > RCS file: /cvsroot/pgsql/src/backend/commands/tablespace.c,v > retrieving revision 1.17 > diff -c -c -r1.17 tablespace.c > *** src/backend/commands/tablespace.c 14 Apr 2005 20:03:24 -0000 1.17 > --- src/backend/commands/tablespace.c 1 May 2005 04:30:53 -0000 > *************** > *** 341,348 **** > /* > * All seems well, create the symlink > */ > ! linkloc = (char *) palloc(strlen(DataDir) + 11 + 10 + 1); > ! sprintf(linkloc, "%s/pg_tblspc/%u", DataDir, tablespaceoid); > > if (symlink(location, linkloc) < 0) > ereport(ERROR, > --- 341,347 ---- > /* > * All seems well, create the symlink > */ > ! linkloc = GetTablespacePath(tablespaceoid); > > if (symlink(location, linkloc) < 0) > ereport(ERROR, > *************** > *** 495,502 **** > char *subfile; > struct stat st; > > ! location = (char *) palloc(strlen(DataDir) + 11 + 10 + 1); > ! sprintf(location, "%s/pg_tblspc/%u", DataDir, tablespaceoid); > > /* > * Check if the tablespace still contains any files. We try to rmdir > --- 494,500 ---- > char *subfile; > struct stat st; > > ! location = GetTablespacePath(tablespaceoid); > > /* > * Check if the tablespace still contains any files. We try to rmdir > *************** > *** 1036,1043 **** > set_short_version(location); > > /* Create the symlink if not already present */ > ! linkloc = (char *) palloc(strlen(DataDir) + 11 + 10 + 1); > ! sprintf(linkloc, "%s/pg_tblspc/%u", DataDir, xlrec->ts_id); > > if (symlink(location, linkloc) < 0) > { > --- 1034,1040 ---- > set_short_version(location); > > /* Create the symlink if not already present */ > ! linkloc = GetTablespacePath(xlrec->ts_id); > > if (symlink(location, linkloc) < 0) > { > Index: src/backend/utils/adt/misc.c > =================================================================== > RCS file: /cvsroot/pgsql/src/backend/utils/adt/misc.c,v > retrieving revision 1.40 > diff -c -c -r1.40 misc.c > *** src/backend/utils/adt/misc.c 31 Dec 2004 22:01:22 -0000 1.40 > --- src/backend/utils/adt/misc.c 1 May 2005 04:30:53 -0000 > *************** > *** 26,31 **** > --- 26,32 ---- > #include "funcapi.h" > #include "catalog/pg_type.h" > #include "catalog/pg_tablespace.h" > + #include "catalog/catalog.h" > > #define atooid(x) ((Oid) strtoul((x), NULL, 10)) > > *************** > *** 144,154 **** > > fctx = palloc(sizeof(ts_db_fctx)); > > - /* > - * size = path length + tablespace dirname length + 2 dir sep > - * chars + oid + terminator > - */ > - fctx->location = (char *) palloc(strlen(DataDir) + 11 + 10 + 1); > if (tablespaceOid == GLOBALTABLESPACE_OID) > { > fctx->dirdesc = NULL; > --- 145,150 ---- > *************** > *** 157,168 **** > } > else > { > ! if (tablespaceOid == DEFAULTTABLESPACE_OID) > ! sprintf(fctx->location, "%s/base", DataDir); > ! else > ! sprintf(fctx->location, "%s/pg_tblspc/%u", DataDir, > ! tablespaceOid); > ! > fctx->dirdesc = AllocateDir(fctx->location); > > if (!fctx->dirdesc) > --- 153,159 ---- > } > else > { > ! fctx->location = GetTablespacePath(tablespaceOid); > fctx->dirdesc = AllocateDir(fctx->location); > > if (!fctx->dirdesc) > Index: src/backend/utils/init/Makefile > =================================================================== > RCS file: /cvsroot/pgsql/src/backend/utils/init/Makefile,v > retrieving revision 1.18 > diff -c -c -r1.18 Makefile > *** src/backend/utils/init/Makefile 20 Feb 2005 02:22:00 -0000 1.18 > --- src/backend/utils/init/Makefile 1 May 2005 04:30:53 -0000 > *************** > *** 12,18 **** > top_builddir = ../../../.. > include $(top_builddir)/src/Makefile.global > > ! OBJS = flatfiles.o globals.o miscinit.o postinit.o > > all: SUBSYS.o > > --- 12,18 ---- > top_builddir = ../../../.. > include $(top_builddir)/src/Makefile.global > > ! OBJS = flatfiles.o globals.o miscinit.o postinit.o checkfiles.o > > all: SUBSYS.o > > Index: src/include/catalog/catalog.h > =================================================================== > RCS file: /cvsroot/pgsql/src/include/catalog/catalog.h,v > retrieving revision 1.30 > diff -c -c -r1.30 catalog.h > *** src/include/catalog/catalog.h 31 Dec 2004 22:03:24 -0000 1.30 > --- src/include/catalog/catalog.h 1 May 2005 04:30:54 -0000 > *************** > *** 19,24 **** > --- 19,25 ---- > > extern char *relpath(RelFileNode rnode); > extern char *GetDatabasePath(Oid dbNode, Oid spcNode); > + extern char *GetTablespacePath(Oid spcNode); > > extern bool IsSystemRelation(Relation relation); > extern bool IsToastRelation(Relation relation); > Index: src/include/utils/flatfiles.h > =================================================================== > RCS file: /cvsroot/pgsql/src/include/utils/flatfiles.h,v > retrieving revision 1.1 > diff -c -c -r1.1 flatfiles.h > *** src/include/utils/flatfiles.h 20 Feb 2005 02:22:07 -0000 1.1 > --- src/include/utils/flatfiles.h 1 May 2005 04:30:54 -0000 > *************** > *** 30,33 **** > --- 30,36 ---- > > extern Datum flatfile_update_trigger(PG_FUNCTION_ARGS); > > + /* from checkfiles.c */ > + extern void CheckStaleRelFiles(void); > + > #endif /* FLATFILES_H */ > /*------------------------------------------------------------------------- > * > * checkfiles.c > * support to clean up stale relation files on crash recovery > * > * If a backend crashes while in a transaction that has created or > * deleted a relfilenode, a stale file can be left over in the data > * directory. This file contains routines to clean up those stale > * files on recovery. > * > * $PostgreSQL$ > * > *------------------------------------------------------------------------- > */ > #include "postgres.h" > > #include "storage/fd.h" > > #include "utils/flatfiles.h" > #include "miscadmin.h" > #include "catalog/pg_tablespace.h" > #include "catalog/catalog.h" > #include "access/skey.h" > #include "utils/fmgroids.h" > #include "access/relscan.h" > #include "access/heapam.h" > #include "utils/resowner.h" > > static void CheckStaleRelFilesFrom(Oid tablespaceoid, Oid dboid); > static void CheckStaleRelFilesFromTablespace(Oid tablespaceoid); > > /* Like AllocateDir, but ereports on failure */ > static DIR * > AllocateDirChecked(char *path) > { > DIR *dirdesc = AllocateDir(path); > if (dirdesc == NULL) > ereport(ERROR, > (errcode_for_file_access(), > errmsg("could not open directory \"%s\": %m", > path))); > return dirdesc; > } > > /* > * Scan through all tablespaces for relations left over > * by aborted transactions. > * > * For example, if a transaction issues > * BEGIN; CREATE TABLE foobar (); > * and then the backend crashes, the file is left in the > * tablespace until CheckStaleRelFiles deletes it. > */ > void > CheckStaleRelFiles(void) > { > DIR *dirdesc; > struct dirent *de; > char *path; > int pathlen; > > pathlen = strlen(DataDir) + 11 + 1; > path = (char *) palloc(pathlen); > snprintf(path, pathlen, "%s/pg_tblspc/", DataDir); > dirdesc = AllocateDirChecked(path); > while ((de = readdir(dirdesc)) != NULL) > { > char *invalid; > Oid tablespaceoid; > > /* Check that the directory name looks like valid tablespace link. */ > tablespaceoid = (Oid) strtol(de->d_name, &invalid, 10); > if(invalid[0] == '\0') > CheckStaleRelFilesFromTablespace(tablespaceoid); > } > FreeDir(dirdesc); > pfree(path); > > CheckStaleRelFilesFromTablespace(DEFAULTTABLESPACE_OID); > } > > /* Scan a specific tablespace for stale relations */ > static void > CheckStaleRelFilesFromTablespace(Oid tablespaceoid) > { > DIR *dirdesc; > struct dirent *de; > char *path; > > path = GetTablespacePath(tablespaceoid); > > dirdesc = AllocateDirChecked(path); > while ((de = readdir(dirdesc)) != NULL) > { > char *invalid; > Oid dboid; > > dboid = (Oid) strtol(de->d_name, &invalid, 10); > if(invalid[0] == '\0') > CheckStaleRelFilesFrom(tablespaceoid, dboid); > } > FreeDir(dirdesc); > pfree(path); > } > > /* Scan a specific database in a specific tablespace for stale relations. > * > * First, pg_class for the database is opened, and the relfilenodes of all > * relations mentioned there are stored in a hash table. > * > * Then the directory is scanned. Every file in the directory that's not > * found in pg_class (the hash table) is logged. > */ > static void > CheckStaleRelFilesFrom(Oid tablespaceoid, Oid dboid) > { > DIR *dirdesc; > struct dirent *de; > HASHCTL hashctl; > HTAB *relfilenodeHash; > MemoryContext mcxt; > RelFileNode rnode; > char *path; > > /* We create a private memory context so that we can easily deallocate > * the hash table and its contents > */ > mcxt = AllocSetContextCreate(TopMemoryContext, "CheckStaleRelFiles", > ALLOCSET_DEFAULT_MINSIZE, > ALLOCSET_DEFAULT_INITSIZE, > ALLOCSET_DEFAULT_MAXSIZE); > > hashctl.hash = tag_hash; > /* The entry contents is not used for anything, we just check > * if an oid is in the hash table or not. */ > hashctl.keysize = sizeof(Oid); > hashctl.entrysize = 1; > hashctl.hcxt = mcxt; > relfilenodeHash = hash_create("relfilenodeHash", 100, &hashctl, > HASH_FUNCTION > | HASH_ELEM | HASH_CONTEXT); > > /* Read all relfilenodes from pg_class into the hash table */ > { > ResourceOwner owner, oldowner; > Relation rel; > HeapScanDesc scan; > HeapTuple tuple; > > /* Need a resowner to keep the heapam and buffer code happy */ > owner = ResourceOwnerCreate(NULL, "CheckStaleRelFiles"); > oldowner = CurrentResourceOwner; > CurrentResourceOwner = owner; > > rnode.spcNode = tablespaceoid; > rnode.dbNode = dboid; > rnode.relNode = RelationRelationId; > rel = XLogOpenRelation(true, 0 , rnode); > > scan = heap_beginscan(rel, SnapshotNow, 0, NULL); > while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) > { > Form_pg_class classform = (Form_pg_class) GETSTRUCT(tuple); > > hash_search(relfilenodeHash, &classform->relfilenode, > HASH_ENTER, NULL); > } > heap_endscan(scan); > > XLogCloseRelation(rnode); > CurrentResourceOwner = oldowner; > ResourceOwnerDelete(owner); > } > > /* Scan the directory */ > path = GetDatabasePath(dboid, tablespaceoid); > > dirdesc = AllocateDirChecked(path); > while ((de = readdir(dirdesc)) != NULL) > { > char *invalid; > Oid relfilenode; > > relfilenode = strtol(de->d_name, &invalid, 10); > if(invalid[0] == '\0') > { > /* Filename was a valid number, check if pg_class knows > about it */ > if(hash_search(relfilenodeHash, &relfilenode, > HASH_FIND, NULL) == NULL) > { > char *filepath; > rnode.spcNode = tablespaceoid; > rnode.dbNode = dboid; > rnode.relNode = relfilenode; > > filepath = relpath(rnode); > > ereport(LOG, > (errcode_for_file_access(), > errmsg("The table or index file \"%s\" is stale and can be safely removed", > filepath))); > pfree(filepath); > } > } > } > FreeDir(dirdesc); > pfree(path); > hash_destroy(relfilenodeHash); > MemoryContextDelete(mcxt); > } -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Heikki, Good patch. ...The patch makes no mention of temporary files, which are left there after a crash. Temp files are only removed on a normal startup, whereas the patch invokes removal on both normal startup and crash recovery. That doesn't make much sense... Also, temp file deletion happens in the postmaster, not at the end of recovery in xlog.c. Could we rationalise those two? One place, one action for both? I'd rather have this also as an administrator command rather than as something automatically run after crash recovery. That way I have my debugging opportunity, but the admin can remove them without restarting the server. Same code, just a Function instead... Best Regards, Simon Riggs reference from fd.c: (this is not a patch) /* * Remove temporary files left over from a prior postmaster session * * This should be called during postmaster startup. It will forcibly * remove any leftover files created by OpenTemporaryFile. * * NOTE: we could, but don't, call this during a post-backend-crash restart * cycle. The argument for not doing it is that someone might want to examine * the temp files for debugging purposes. This does however mean that * OpenTemporaryFile had better allow for collision with an existing temp * file name. */ void RemovePgTempFiles(void)
FYI, his patch is in CVS and now only _reports_ unreferenced files, and it happens on recovery and normal startup. --------------------------------------------------------------------------- Simon Riggs wrote: > Heikki, > > Good patch. > > ...The patch makes no mention of temporary files, which are left there > after a crash. > > Temp files are only removed on a normal startup, whereas the patch > invokes removal on both normal startup and crash recovery. That doesn't > make much sense... > > Also, temp file deletion happens in the postmaster, not at the end of > recovery in xlog.c. > > Could we rationalise those two? One place, one action for both? > > I'd rather have this also as an administrator command rather than as > something automatically run after crash recovery. That way I have my > debugging opportunity, but the admin can remove them without restarting > the server. > > Same code, just a Function instead... > > Best Regards, Simon Riggs > > reference from fd.c: (this is not a patch) > > /* > * Remove temporary files left over from a prior postmaster session > * > * This should be called during postmaster startup. It will forcibly > * remove any leftover files created by OpenTemporaryFile. > * > * NOTE: we could, but don't, call this during a post-backend-crash > restart > * cycle. The argument for not doing it is that someone might want to > examine > * the temp files for debugging purposes. This does however mean that > * OpenTemporaryFile had better allow for collision with an existing > temp > * file name. > */ > void > RemovePgTempFiles(void) > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Applied. Now that I've had a chance to look at it, this patch is thoroughly broken. Problems observed in a quick review: 1. It doesn't work at all for non-default tablespaces: it will claim that every file in such a tablespace is stale. The fact that it does that rather than failing entirely is accidental. It tries to read the database's pg_class in the target tablespace whether it's there or not. Because the system is still in recovery mode, the low-level routines allow the access to the nonexistent pg_class table to pass --- in fact they think they should create the file, so after it runs there's a bogus empty "1259" file in each such tablespace (which of course it complains about, too). The code then proceeds to think that pg_class is empty so of course everything draws a warning. 2. It's not robust against stale subdirectories of a tablespace (ie, subdirs corresponding to a nonexistent database) --- again, it'll try to read a nonexistent pg_class. Then it'll produce a bunch of off-target complaint messages. 3. It's assuming that relfilenode is unique database-wide, when no such assumption is safe. We only have a guarantee that it's unique tablespace-wide. 4. It fails to examine table segment files (such as "nnn.1"). These should be complained of when the "nnn" doesn't match any hash entry. 5. It will load every relfilenode value in pg_class into the hashtable whether it's meaningful or not. There should be a check on relkind. 6. I don't think relying on strtol to decide if a filename is entirely numeric is very safe. Note all the extra defenses in pg_atoi against various platform-specific misbehaviors of strtol. Personally I'd use a strspn test instead. 7. There are no checks for readdir failure (compare any other readdir loop in the backend). See also Simon Riggs' complaints that the circumstances under which it's done are pretty randomly selected. (One particular thing that I think is a bad idea is to do this in a standalone backend. Any sort of corruption in any db's pg_class would render it impossible to start up.) To fix the first three problems, and also avoid the performance problem of multiply rescanning a database's pg_class for each of its tablespaces, I would suggest that the hashtable entries be widened to RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid). Then there should be one iteration over pg_database to learn the OIDs and default tablespaces of each database; with that you can read pg_class from its correct location for each database and load all the entries into the hashtable. Then you iterate through the tablespaces looking for stuff not present in the hashtable. You might also want to build a list or hashtable of known database OIDs, so that you can recognize a stale subdirectory immediately and issue a direct complaint about it without even recursing into it. regards, tom lane
On Thu, 5 May 2005, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Applied. > > Now that I've had a chance to look at it, this patch is thoroughly > broken. Problems observed in a quick review: > > 1. It doesn't work at all for non-default tablespaces: it will > claim that every file in such a tablespace is stale. The fact > that it does that rather than failing entirely is accidental. > It tries to read the database's pg_class in the target tablespace > whether it's there or not. Because the system is still in recovery > mode, the low-level routines allow the access to the nonexistent > pg_class table to pass --- in fact they think they should create > the file, so after it runs there's a bogus empty "1259" file in each > such tablespace (which of course it complains about, too). The code > then proceeds to think that pg_class is empty so of course everything > draws a warning. > > 2. It's not robust against stale subdirectories of a tablespace > (ie, subdirs corresponding to a nonexistent database) --- again, > it'll try to read a nonexistent pg_class. Then it'll produce a > bunch of off-target complaint messages. > > 3. It's assuming that relfilenode is unique database-wide, when no > such assumption is safe. We only have a guarantee that it's unique > tablespace-wide. > > 4. It fails to examine table segment files (such as "nnn.1"). These > should be complained of when the "nnn" doesn't match any hash entry. > > 5. It will load every relfilenode value in pg_class into the hashtable > whether it's meaningful or not. There should be a check on relkind. > > 6. I don't think relying on strtol to decide if a filename is entirely > numeric is very safe. Note all the extra defenses in pg_atoi against > various platform-specific misbehaviors of strtol. Personally I'd use a > strspn test instead. > I'll fix 1-6 according to your suggestions, and send another patch. It shows how little experience I have with multiple database and tablespace management. > 7. There are no checks for readdir failure (compare any other readdir > loop in the backend). I couldn't figure out what you meant. The readdir code is the same as anywhere else. Also, man page (Linux) says that readdir returns NULL on error, and that is checked. > See also Simon Riggs' complaints that the circumstances under which it's > done are pretty randomly selected. (One particular thing that I think > is a bad idea is to do this in a standalone backend. Any sort of > corruption in any db's pg_class would render it impossible to start up.) I'd agree with Simons complaints if we actually deleted the files. But since we only report them, it's a good idea to report them on every startup, otherwise the DBA might think that the stale files are not there anymore since the system isn't complaining about them anymore. The original patch only ran the check on crash recovery, but Bruce changed it to run on startup as well, for the above reason. I agree, though, that it's a bad idea to do it in standalone mode. I'll add a check for that. Also it probably shouldn't stop the startup even if some pg_class is corrupt. Other databases could be fine. > To fix the first three problems, and also avoid the performance problem > of multiply rescanning a database's pg_class for each of its > tablespaces, I would suggest that the hashtable entries be widened to > RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid). Then > there should be one iteration over pg_database to learn the OIDs and > default tablespaces of each database; with that you can read pg_class > from its correct location for each database and load all the entries > into the hashtable. Then you iterate through the tablespaces looking > for stuff not present in the hashtable. You might also want to build a > list or hashtable of known database OIDs, so that you can recognize a > stale subdirectory immediately and issue a direct complaint about it > without even recursing into it. > > regards, tom lane > - Heikki
Maybe we should take a different approach to the problem: 1. Create new file with an extension to mark that it's not yet committed (eg. 1234.notcommitted) 2. ... 3. Take CheckpointStartLock 4. Write commit record to WAL, with list of created files. 5. rename created file (1234.notcommitted -> 1234). 6. Release CheckpointStartLock This would guarantee that after successful WAL replay, all files in the data directory with .notcommitted extension can be safely deleted. No need to read pg_database or pg_class. We would take a performance hit because of the additional rename and fsync step. Also, we must somehow make sure that the new file or the directory it's in is fsynced on checkpoint to make sure that the rename is flushed to disk. A variant of the scheme would be to create two files on step 1. One would be the actual relfile (1234) and the other would an empty marker file (1234.notcommitted). That way the smgr code wouldn't have to care it the file is new or not when opening it. - Heikki On Thu, 5 May 2005, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Applied. > > Now that I've had a chance to look at it, this patch is thoroughly > broken. Problems observed in a quick review: > > 1. It doesn't work at all for non-default tablespaces: it will > claim that every file in such a tablespace is stale. The fact > that it does that rather than failing entirely is accidental. > It tries to read the database's pg_class in the target tablespace > whether it's there or not. Because the system is still in recovery > mode, the low-level routines allow the access to the nonexistent > pg_class table to pass --- in fact they think they should create > the file, so after it runs there's a bogus empty "1259" file in each > such tablespace (which of course it complains about, too). The code > then proceeds to think that pg_class is empty so of course everything > draws a warning. > > 2. It's not robust against stale subdirectories of a tablespace > (ie, subdirs corresponding to a nonexistent database) --- again, > it'll try to read a nonexistent pg_class. Then it'll produce a > bunch of off-target complaint messages. > > 3. It's assuming that relfilenode is unique database-wide, when no > such assumption is safe. We only have a guarantee that it's unique > tablespace-wide. > > 4. It fails to examine table segment files (such as "nnn.1"). These > should be complained of when the "nnn" doesn't match any hash entry. > > 5. It will load every relfilenode value in pg_class into the hashtable > whether it's meaningful or not. There should be a check on relkind. > > 6. I don't think relying on strtol to decide if a filename is entirely > numeric is very safe. Note all the extra defenses in pg_atoi against > various platform-specific misbehaviors of strtol. Personally I'd use a > strspn test instead. > > 7. There are no checks for readdir failure (compare any other readdir > loop in the backend). > > See also Simon Riggs' complaints that the circumstances under which it's > done are pretty randomly selected. (One particular thing that I think > is a bad idea is to do this in a standalone backend. Any sort of > corruption in any db's pg_class would render it impossible to start up.) > > To fix the first three problems, and also avoid the performance problem > of multiply rescanning a database's pg_class for each of its > tablespaces, I would suggest that the hashtable entries be widened to > RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid). Then > there should be one iteration over pg_database to learn the OIDs and > default tablespaces of each database; with that you can read pg_class > from its correct location for each database and load all the entries > into the hashtable. Then you iterate through the tablespaces looking > for stuff not present in the hashtable. You might also want to build a > list or hashtable of known database OIDs, so that you can recognize a > stale subdirectory immediately and issue a direct complaint about it > without even recursing into it. > > regards, tom lane > - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > Maybe we should take a different approach to the problem: > 1. Create new file with an extension to mark that it's not > yet committed (eg. 1234.notcommitted) This is pushing the problem into the wrong place, viz the lowest-level file access routines, which will now all have to know about .notcommitted status. It also creates race conditions --- think about backend A trying to commit file 1234 at about the same time that backend B is trying to flush some dirty buffers belonging to that file. But most importantly, it doesn't handle the file-deletion case. regards, tom lane