Thread: Re: [PATCHES] Cleaning up unreferenced table files
Maybe we should take a different approach to the problem: 1. Create new file with an extension to mark that it's not yet committed (eg. 1234.notcommitted) 2. ... 3. Take CheckpointStartLock 4. Write commit record to WAL, with list of created files. 5. rename created file (1234.notcommitted -> 1234). 6. Release CheckpointStartLock This would guarantee that after successful WAL replay, all files in the data directory with .notcommitted extension can be safely deleted. No need to read pg_database or pg_class. We would take a performance hit because of the additional rename and fsync step. Also, we must somehow make sure that the new file or the directory it's in is fsynced on checkpoint to make sure that the rename is flushed to disk. A variant of the scheme would be to create two files on step 1. One would be the actual relfile (1234) and the other would an empty marker file (1234.notcommitted). That way the smgr code wouldn't have to care it the file is new or not when opening it. - Heikki On Thu, 5 May 2005, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Applied. > > Now that I've had a chance to look at it, this patch is thoroughly > broken. Problems observed in a quick review: > > 1. It doesn't work at all for non-default tablespaces: it will > claim that every file in such a tablespace is stale. The fact > that it does that rather than failing entirely is accidental. > It tries to read the database's pg_class in the target tablespace > whether it's there or not. Because the system is still in recovery > mode, the low-level routines allow the access to the nonexistent > pg_class table to pass --- in fact they think they should create > the file, so after it runs there's a bogus empty "1259" file in each > such tablespace (which of course it complains about, too). The code > then proceeds to think that pg_class is empty so of course everything > draws a warning. > > 2. It's not robust against stale subdirectories of a tablespace > (ie, subdirs corresponding to a nonexistent database) --- again, > it'll try to read a nonexistent pg_class. Then it'll produce a > bunch of off-target complaint messages. > > 3. It's assuming that relfilenode is unique database-wide, when no > such assumption is safe. We only have a guarantee that it's unique > tablespace-wide. > > 4. It fails to examine table segment files (such as "nnn.1"). These > should be complained of when the "nnn" doesn't match any hash entry. > > 5. It will load every relfilenode value in pg_class into the hashtable > whether it's meaningful or not. There should be a check on relkind. > > 6. I don't think relying on strtol to decide if a filename is entirely > numeric is very safe. Note all the extra defenses in pg_atoi against > various platform-specific misbehaviors of strtol. Personally I'd use a > strspn test instead. > > 7. There are no checks for readdir failure (compare any other readdir > loop in the backend). > > See also Simon Riggs' complaints that the circumstances under which it's > done are pretty randomly selected. (One particular thing that I think > is a bad idea is to do this in a standalone backend. Any sort of > corruption in any db's pg_class would render it impossible to start up.) > > To fix the first three problems, and also avoid the performance problem > of multiply rescanning a database's pg_class for each of its > tablespaces, I would suggest that the hashtable entries be widened to > RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid). Then > there should be one iteration over pg_database to learn the OIDs and > default tablespaces of each database; with that you can read pg_class > from its correct location for each database and load all the entries > into the hashtable. Then you iterate through the tablespaces looking > for stuff not present in the hashtable. You might also want to build a > list or hashtable of known database OIDs, so that you can recognize a > stale subdirectory immediately and issue a direct complaint about it > without even recursing into it. > > regards, tom lane > - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > Maybe we should take a different approach to the problem: > 1. Create new file with an extension to mark that it's not > yet committed (eg. 1234.notcommitted) This is pushing the problem into the wrong place, viz the lowest-level file access routines, which will now all have to know about .notcommitted status. It also creates race conditions --- think about backend A trying to commit file 1234 at about the same time that backend B is trying to flush some dirty buffers belonging to that file. But most importantly, it doesn't handle the file-deletion case. regards, tom lane
On Sat, 7 May 2005, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> Maybe we should take a different approach to the problem: >> 1. Create new file with an extension to mark that it's not >> yet committed (eg. 1234.notcommitted) > > This is pushing the problem into the wrong place, viz the lowest-level > file access routines, which will now all have to know about > .notcommitted status. It also creates race conditions --- think about > backend A trying to commit file 1234 at about the same time that > backend B is trying to flush some dirty buffers belonging to that file. True. With the rename variant, it might indeed get messy. Consider the variant with extra marker files. In that case, backend B doesn't have to know about the .notcommitted status to flush the buffers. > But most importantly, it doesn't handle the file-deletion case. File-deletions are easy to handle. Just write the list of pending deletions to WAL on commit. To recap, we have 2 slightly different scenarios: a) Delete a file, write commit record, crash b) Create a file, crash Just WAL logging the deletions on commit would take care of A. The .notcommitted mechanism would take care of B. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > Consider the variant with extra marker files. In that case, backend B > doesn't have to know about the .notcommitted status to flush the buffers. [ shrug ] It's still broken, and the reason is that there's no equivalent of fsync for directory operations. Consider A creates 1234 and 1234.notcommitted A commits B performs a checkpoint crash all before A manages to delete 1234.notcommitted, or at least before that deletion has made its way to disk. Upon restart, only WAL events after the checkpoint will be replayed, so 1234.notcommitted doesn't go away, and then you've got a problem. To fix this there would need to be a way (1) for B to be aware of the pending file deletion and (2) for B to delay committing the checkpoint until the directory update is surely down on disk. Your proposal doesn't provide for (1), and even if we fixed that, I know of no portable kernel API for (2). fsync isn't applicable. While your original patch is buggy, it's at least fixable and has localized, limited impact. I don't think these schemes are safe at all --- they put a great deal more weight on the semantics of the filesystem than I care to do. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > Consider the variant with extra marker files. In that case, backend B > > doesn't have to know about the .notcommitted status to flush the buffers. > > [ shrug ] It's still broken, and the reason is that there's no > equivalent of fsync for directory operations. Consider Traditionally that's because directory operations were always synchronous, and hence didn't need to be fsynced. I think this is still true, other systems like qmail's maildir still depend on this, for example. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> [ shrug ] It's still broken, and the reason is that there's no >> equivalent of fsync for directory operations. Consider > Traditionally that's because directory operations were always > synchronous, and hence didn't need to be fsynced. That might be true with respect to the process requesting the directory operation ... but I think you missed the point entirely. regards, tom lane
On Sun, 8 May 2005, Tom Lane wrote: > While your original patch is buggy, it's at least fixable and has > localized, limited impact. I don't think these schemes are safe > at all --- they put a great deal more weight on the semantics of > the filesystem than I care to do. I'm going to try this some more, because I feel that a scheme like this that doesn't rely on scanning pg_class and the file system would in fact be safer. The key is to A) obey the "WAL first" rule, and A) remember information about file creations over a checkpoint. The problem with the my previous suggestion was that it didn't reliably accomplish either :). Right now we break the WAL rule because the file creation is recorded after the file is created. And the record is not flushed. The trivial way to fix that is to write and flush the xlog record before actually creating the file. (for a more optimized way to do it, see end of message). Then we could trust that there aren't any files in the data directory that don't have a corresponding record in WAL. But that's not enough. If a checkpoint occurs after the file is created, but before the transaction ends, WAL replay doesn't see the file creation record. That's why we need a mechanism to carry the information over the checkpoint. We could do that by extending the ForwardFsyncRequest function or by creating something similar to that. When a backend writes the file creation WAL record, it also sends a message to the bgwriter that says "I'm xid 1234, and I have just created file foobar/1234" (while holding CheckpointStartLock). Bgwriter keeps a list of xid/file pairs like it keeps a list of pending fsync operations. On checkpoint, the checkpointer scans the list and removes entries for transactions that have already ended, and attaches the remaining list to the checkpoint record. WAL replay would start with the xid/file list in the checkpoint record, and update it during the replay whenever a file creation or a transaction commit/rollback record is seen. On a rollback record, files created by that transaction are deleted. At the end of WAL replay, the files that are left in the list belong to transactions that implicitly aborted, and can be deleted. If we don't want to extend the checkpoint record, a separate WAL record works too. Now, the more optimized way to do A: Delay the actual file creation until it's first written to. The write needs to be WAL logged anyway, so we would just piggyback on that. Implemented this way, I don't think there would be a significant performance hit from the scheme. We would create more ForwardFsyncRequest traffic, but not much compared to the block fsync requests we have right now. BTW: If we allowed mdopen to create the file if it doesn't exist already, would we need the current file creation xlog record for anything? (I'm not suggesting to do that, just trying to get more insight) - Heikki
Heikki Linnakangas wrote: > On Sun, 8 May 2005, Tom Lane wrote: > > > While your original patch is buggy, it's at least fixable and has > > localized, limited impact. I don't think these schemes are safe > > at all --- they put a great deal more weight on the semantics of > > the filesystem than I care to do. > > I'm going to try this some more, because I feel that a scheme like this > that doesn't rely on scanning pg_class and the file system would in fact > be safer. The current code is nice and localized and doesn't add any burden on our existing code, which is already complicated enough. I think we either fix checkfiles.c, or we remove it and decide it isn't worth checking for unrefrenced files. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On Sun, 8 May 2005, Tom Lane wrote: >> While your original patch is buggy, it's at least fixable and has >> localized, limited impact. I don't think these schemes are safe >> at all --- they put a great deal more weight on the semantics of >> the filesystem than I care to do. > I'm going to try this some more, because I feel that a scheme like this > that doesn't rely on scanning pg_class and the file system would in fact > be safer. I think this proposal is getting more and more invasive and expensive, and it's all to solve a problem that we do not even know is worth spending any time on. I *really* think this is the wrong direction to take. Aside from the required effort and risk of breaking things, the original patch incurred cost only during crash recovery; this is pushing costs into the normal code paths. > Delay the actual file creation until it's first written to. The write > needs to be WAL logged anyway, so we would just piggyback on that. This is a bad idea since by then it's (potentially) too late to roll back the creating transaction if the creation fails. Consider for instance a tablespace directory that's mispermissioned read-only, or some such. I'd rather have the CREATE TABLE fail than a later INSERT. (Admittedly, we can't completely guarantee that an INSERT won't hit some kind of filesystem-level problem, but it's still something to try to avoid.) Also, the "first write" actually comes from mdextend, which is not a WAL-logged operation AFAIR. Some rethinking of that would be necessary before this would have any chance of working. regards, tom lane
On Tue, 10 May 2005, Bruce Momjian wrote: > The current code is nice and localized and doesn't add any burden on our > existing code, which is already complicated enough. I think we either > fix checkfiles.c, or we remove it and decide it isn't worth checking for > unrefrenced files. Let's pull the patch for now. I still think we should check for unreferenced files, but it needs some more thought and it's by no means urgent. If I have time later on, I might write a patch to implement the WAL-based scheme to see how much change to existing code it really would need. - Heikki
Heikki Linnakangas wrote: > On Tue, 10 May 2005, Bruce Momjian wrote: > > > The current code is nice and localized and doesn't add any burden on our > > existing code, which is already complicated enough. I think we either > > fix checkfiles.c, or we remove it and decide it isn't worth checking for > > unrefrenced files. > > Let's pull the patch for now. Removed, and TODO item restored. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On Tue, 10 May 2005, Bruce Momjian wrote: >> The current code is nice and localized and doesn't add any burden on our >> existing code, which is already complicated enough. I think we either >> fix checkfiles.c, or we remove it and decide it isn't worth checking for >> unrefrenced files. > Let's pull the patch for now. FWIW, I was OK with the idea of adding something similar to the given patch to find out whether we had a problem or not. With sufficient evidence that lost files are a big problem, I'd be in favor of a mechanism of the kind proposed in Heikki's latest messages. The disconnect for me at the moment is that there's no evidence to justify that amount of effort/risk. A startup-time patch would have provided that evidence, or else have proven that it's not worth spending more time on. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > On Tue, 10 May 2005, Bruce Momjian wrote: > >> The current code is nice and localized and doesn't add any burden on our > >> existing code, which is already complicated enough. I think we either > >> fix checkfiles.c, or we remove it and decide it isn't worth checking for > >> unrefrenced files. > > > Let's pull the patch for now. > > FWIW, I was OK with the idea of adding something similar to the given > patch to find out whether we had a problem or not. With sufficient > evidence that lost files are a big problem, I'd be in favor of a > mechanism of the kind proposed in Heikki's latest messages. The > disconnect for me at the moment is that there's no evidence to justify > that amount of effort/risk. A startup-time patch would have provided > that evidence, or else have proven that it's not worth spending more > time on. Agreed. Imagine a backend creates a table file, then the operating system crashes. I assume WAL wasn't fsync'ed, so there is no way that WAL can discover that unreferenced file. While I think WAL can correct some cases, I don't think it can correct them all, so it seems it is necessary to check the file system against pg_class to catch all the cases. The transaction and file system semantics are just different and need to be checked against each other. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073